Corpus-based analysis and diachronic linguistics 9789027207708, 9027207704

495 82 8MB

English Pages 0 [300] Year 2012

Report DMCA / Copyright


Polecaj historie

Corpus-based analysis and diachronic linguistics
 9789027207708, 9027207704

Table of contents :
1. Message from the President (by Kameyama, Ikuo)
2. Center for Corpus-based Linguistics and Language Education (by Minegishi, Makoto)
3. Introduction (by Kawaguchi, Yuji)
4. The Atlas Linguarum Europae: A diachronic analysis of its data (by Viereck, Wolfgang)
5. Variationism and underuse statistics in the analysis of the development of relative clauses in German (by Ludeling, Anke)
6. Variation and change in the Montferrand Account-books (1259-1367) (by Lodge, R. Anthony)
7. Cognitive aspects of language evolution and language change: The example of French historical texts (by Raible, Wolfgang)
8. The importance of diasystematic parameters in studying the history of French (by Schosler, Lene)
9. The reorganisation of mood in the epistemic subsystem - The case of French belief predicates in diachronic dynamics (by Becker, Martin)
10. French liaison in the 18th Century - Analysis of Gile Vaudelin's texts (by Kawaguchi, Yuji)
11. Issues in the typographic representation of medieval primary sources (by Emiliano, Antonio)
12. An analysis of the misuse of the participle in old Russian texts (by Onda, Yoshinori)
13. A preliminary analysis of Arabic derived verbs in the Leeds Quran Corpus - With special reference to Stem III (CaaCaC) (by Ratcliffe, Robert R.)
14. On the narrow and open "e" contrast in Santali (by Minegishi, Makoto)
15. The classification of Apabhramsa - A corpus-based approach of the study of Middle Indo-Aryan (by Yamahata, Tomoyuki)
16. Changes in the meaning and construction of Polysemous words: The case of mieru and mirareru (by Shiba, Ayako)
17. Language change from the viewpoint of distribution patterns of standard Japanese forms (by Yarimizu, Kanetaka)
18. Index of proper nouns
19. Index of subjects
20. Contributors

Citation preview

Corpus-based Analysis and Diachronic Linguistics

Tokyo University of Foreign Studies (TUFS) Studies in Linguistics For an overview of all books published in this series, please see

Volume 3 Corpus-based Analysis and Diachronic Linguistics Edited by Yuji Kawaguchi, Makoto Minegishi and Wolfgang Viereck

Corpus-based Analysis and Diachronic Linguistics Edited by

Yuji Kawaguchi Makoto Minegishi Tokyo University of Foreign Studies

Wolfgang Viereck University of Bamberg

John Benjamins Publishing Company Amsterdam / Philadelphia



The paper used in this publication meets the minimum requirements of American National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984.

Library of Congress Cataloging-in-Publication Data Corpus-based analysis and diachronic linguistics / edited by Yuji Kawaguchi, Makoto Minegishi, Wolfgang Viereck. p. cm. (Tokyo University of Foreign Studies (TUFS), studies in linguistics, issn 1877-6248 ; v. 3) Includes bibliographical references and index. 1. Corpora (Linguistics) 2. Language and languages--Variation. 3. Historical linguistics. I. Kawaguchi, Yuji, 1958- II. Minegishi, Makoto. III. Viereck, Wolfgang. P128.C68C654   2011 410.1’88--dc23 2011045661 isbn 978 90 272 0770 8 (Hb ; alk. paper) isbn 978 90 272 7215 7  (Eb)

© 2011 – Tokyo University of Foreign Studies No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Co. · P.O. Box 36224 · 1020 me Amsterdam · The Netherlands John Benjamins North America · P.O. Box 27519 · Philadelphia pa 19118-0519 · usa

Contents Message from the President Ikuo KAMEYAMA (President, Tokyo University of Foreign Studies)............................. 1 Center for Corpus-based Linguistics and Language Education Makoto MINEGISHI (GCOE Project Leader) ............................................................... 3 Introduction Yuji KAWAGUCHI, Wolfgang VIERECK and Makoto MINEGISHI.............................. I 7 The Atlas Linguarum Europae: A Diachronic Analysis of Its Data Wolfgang VIERECK K ..................................................................................................... 21 Variationism and Underuse Statistics in the Analysis of the Development of Relative Clauses in German Anke LÜDELING, Hagen HIRSCHMANN and Amir ZELDESS ................................... 37 Variation and Change in the Montferrand Account-books (1259-1367) Anthony LODGE E .......................................................................................................... 59 Cognitive Aspects of Language Evolution and Language Change: The Example of French Historical Texts Wolfgang RAIBLE E ........................................................................................................ 75 The Importance of Diasystematic Parameters in Studying the History of French Lene SCHØSLER.......................................................................................................... 91 The Reorganisation of Mood in the Epistemic Subsystem —The Case of French Belief Predicates in Diachronic Dynamics Martin BECKER ..........................................................................................................111 French Liaison in the 18th Century—Analysis y of Gile Vaudelin’s Texts T — Yuji KAWAGUCHI...................................................................................................... I 133 Issues in the Typographic Representation of Medieval Primary Sources António EMILIANO.................................................................................................... 153 An Analysis of the Misuse of the Participle in Old Russian Texts Yoshinori ONDA ......................................................................................................... 175 A Preliminary Analysis of Arabic Derived Verbs in the Leeds Quran Corpus —With Special Reference to Stem III (CaaCaC) Robert R. RATCLIFFE E ............................................................................................... 189

On the Narrow and Open “e” Contrast in Santali Makoto MINEGISHI, Jun TAKASHIMA and Ganesh MURMU U ................................ 203 The Classification of Apabhraṃśa ―A Corpus-based Approach of the Study of Middle Indo-Aryan― Tomoyuki YAMAHATA................................................................................................ 223 Changes in the Meaning and Construction of Polysemous Words: The Case of mieru and mirareru Ayako SHIBA .............................................................................................................. 243 Language Change from the Viewpoint of Distribution Patterns of Standard Japanese Forms Kanetaka YARIMIZU.................................................................................................. U 265

Index of Proper Nouns ......................................................................................................... 285 Index of Subjects.................................................................................................................. 289 Contributors ......................................................................................................................... 293

Message from the President Ikuo KAMEYAMA (President, Tokyo University of Foreign Studies) It was a great honor for me to participate in the international symposium entitled “Corpus Analysis and Diachronic Linguistics”. This symposium was also attended by the eight international scholars who have traveled to join us from America, England, Germany and Sweden. I will now dwell briefly on the Global COE Program, which began in April 2007. This program is an effort by the Japanese government to strengthen its support to research and educational institutions in which internationally renowned work is taking place. The program was developed to take advantage of world-class resources to help foster the development of creative researchers who can lead in their fields, and to strengthen research and education in Japan’s centers of graduate education. In 2007, proposals were solicited in five areas. The program submitted in the area of humanities by our university was one of 12 selected nationally. The humanities category encompasses fields as diverse as philosophy, art, psychology, education and archaeology. The submission from our university was the only one selected in the area of linguistics. We believe this reflects the high level of research and education at our institution. Our submission, entitled “Corpus-based Linguistics and Language Education” emphasizes on a field of empirical linguistics based on the uses of corpora. The program’s goal is to foster the growth of advanced researchers with international perspectives. This program continues the research conducted under the “Usage-based Linguistic Informatics” 21stt Century COE Program, which concluded March 2007. The new program will build on the international joint research framework that was created over the past five years to achieve two goals, with the support of the entire university: 1. To further develop a comprehensive education program for graduate students 2. To give graduate students opportunities to perform fieldwork, build and analyze corpora, and receive language education and training, both in Japan and overseas. I am not an expert in linguistics nor do I have a deep scholarly understanding of corpus linguistics. However, as a scholar of literature, I have a keen interest in the possibilities inherent in the field. The corpus concept was introduced into my area of specialization, Russian literature, in the late 1980s. As far as I know, this resulted in the creation of corpora for the works of authors such as Fyodor Dostoevsky and Andrei Platonov. However, it is not yet clear how effective the corpus concept will be in the development of the study of



literature. In contrast, corpus-based linguistics seeks not to use linguistic data to understand the latent properties of a text as a close system, but to understand the linguistic structure and function of a language within a larger context. So, I believe that corpus linguistics provides us with higher objectivity and richer possibilities in the field of humanities. Still, it is my opinion that the greatest hurdles for corpus-based linguistics are still to come. Humans are creatures that cannot help but seek out meaning and possibilities of systemizing matters. It is evident that corpus linguistics has not been a field that describes only the actual uses of languages, but one that finds ways to generalize creative discoveries and to extend its insights. Its value lies in its ability to push itself. For corpus-based linguistics to grow creatively as a human science, we must help young researchers to develop innovative and unique capabilities for analysis. I believe that this is where the real importance of the current G-COE Program lies. In conclusion, I would like, as president of this institution, to express my sincere respect to all the leading researchers who attended this symposium, for their untiring efforts. More importantly, I hope that the young scholars who attended the symposium have imbibed some of the passion that was on display, and I hope that it will help them to grow internationally competitive researchers. July 12, 2011

Center for Corpus-based Linguistics and Language Education Makoto MINEGISHI (GCOE Project Leader) The Center for Corpus-based Linguistics and Language Education (CbLLE) was established with the express target to build an education and research center with unique strengths in the study of linguistic diversity and also in usage-based research of linguistic structure and language education. This centre builds on the strengths of the nationally high-ranking Graduate School of Area and Culture Studies of the Tokyo University of Foreign Studies (TUFS) and of the Research Institute of Languages and Cultures of Asia and Africa (ILCAA). Its educational and research uniqueness is achieved by integrating the three core areas of activities: (a) collection and analysis of naturally occurring language use data through field research, (b) compilation and analysis of large-scale corpora of language use data from a wide range of languages, and (c) application of corpus-based linguistic analyses to language education and pedagogy. A few details of the work that is being done in the above core areas follows: Field linguistics: The field linguistics program supports fieldwork-based research on a typologically diverse set of languages, including not only the world’s major languages but also lesser-studied languages. They include languages of Africa, Eurasia, and North America. It also aims at advancing typological research on the basis of the primary data from a broad range of languages. It provides a solid training to students in the methodology of collecting, processing, and analysis of the field data. The project has undertaken fieldwork-based study of a diverse range of languages of the world (lesserstudied languages in particular) and typologically-informed description of these languages. Some of the projects under this category are: Compilation of a Word List for Field Research on Khwe Languages; Field-work based study of under-studied speeches of India; Collation of Spontaneous Conversational Data of Individual Languages such as: Swahili, Russian, French, Spanish, Turkish. Corpus linguistics: The program in corpus linguistics supports analysis of a large amount of language use data and compilation of corpora, which feed into linguistic informatics research and also into descriptive and typological research. Some of the specific targets are: Building electronic corpora and developing analysis and processing tools in order to support new ways of analyzing language data and multipurposing of the data; Developing



multilingual and multifunctional integrative corpora of language use for major languages on the basis of language use data collected in language teaching classrooms, blogs, etc.; Conducting international collaborative research and providing support in development and utilization of tools for corpus creation, morphological analysis, electronic dictionary creation, text analysis. The projects undertaken here include: Development of Electronic Dictionaries for Russian as well as Thai (separately); corpus Compilation of Data from Medium/Minor Language Groups; Development of Utility Manuals for German Corpus; Preparation of Introductory Text-book on Lexicology based on Corpus Data; Research on Corpora for Minor Language Group in EU Countries. Linguistic informatics: The linguistic informatics program builds on the research in field linguistics and corpus linguistics components to significantly advance research in language pedagogy. It seeks to make a significant contribution to the research in language pedagogy by taking into account the factors of linguistic and cultural diversity through analysis of corpora of language use in actual contexts of language instruction, including naturally occurring conversations and learners’ language use. A few studies undertaken in this context are: Research on Lexicon/language-use based on Corpora for Various Fields; Language Processing/education Technology; POS Search Engine for Spoken French as well as Spoken Spanish; Basic Research on E-learning through Moodle; Corpora of Learners’ Language Use (both as an internal project as well as an international collaborative project); Creation of Language Tests based on Error Analysis of Language Use of English Learners. The GCOE trains researchers and educators who have a clear understanding of the nature and significance of linguistic and cultural diversity and can take a flexible research approach to language structure and language education. This project equips young researchers with a broad foundation for linguistic research by providing practical trainings in field research, corpus-based research and language education. These training programs support the integrative research on linguistic and cultural diversity and usage-based linguistics by connecting effectively field data collection, data analysis, and educational application of theoretical insights obtained from the analysis. The specific projects and tasks listed above form part of the larger plan of building an international research and education center with more generalized targets described below. The Center seeks to build a world-leading research and education center in the study of linguistic diversity and in the usage-based research of linguistic structure. The national and international infrastructure for supporting the GCoE are being built through the following activities:

Center for CbLLE


Formation of an international network of collaborative research: Collaboration in corpus creation and in development of analysis tools (such as electronic dictionary systems); building a network of international collaboration and academic exchange in linguistic research and teaching within the framework of the ‘Consortium for Asian and African Studies’ which has its headquarters at the University. Expansion of opportunities for academic interaction across institutions and across countries: Expanding opportunities for young researchers, as well as established scholars, within Japan and abroad to assemble and interact through visiting scholar programs and through employment. Support program for young researchers: Providing young researchers with financial and technical support for linguistic field research, corpus creation, and education research in the field; and providing young researchers with financial support for professional development (including presentation at international conferences). Active international dissemination research results: Building an information technological infrastructure that supports active electronic dissemination of research results; and Publishing the research results in a series of publications through international publishers that are specialized in publication in linguistics — the present volume being a contribution towards this aim.

Introduction Yuji KAWAGUCHI, Wolfgang VIERECK and Makoto MINEGISHI 1. From dichotomy to hybrid dynamic synchrony It is well-known that Cours de linguistique générale, the posthumous publication of Ferdinand de Saussure’s introductory course at the University of Geneva, which advocated the dichotomy between synchrony and diachrony, emphasized the synchronic description, rather than the historical vicissitudes, of language. Riedlinger’s note (dated December 17, 1909), makes it clear that synchrony took priority in Saussure’s lectures. pour se rendre compte de ce qui existe dans un état de langue, le mieux est de faire abstraction du passé. Paradoxe, mais paradoxe vrai: les signes de la langue ont leur valeur fixée par ce qui coexiste, non par ce qui précède (exemples). Godel (1969) 70 “to explain what is going on in the state of a language, the best approach is to forget the past. This is a paradox, a real paradox: linguistic signs have their value fixed by what coexists, not by what precedes them (examples).” The prioritization of synchrony over diachrony in Cours probably originated from two motivations1. First, the synchronic reality of language is directly observable and less dependent on extralinguistic effects, which better suits the conception of the immanent principle of Cours; in other words, “As its unique and veritable object, linguistics has the langue envisaged in itself and for itself 2.” Back then, this principle was essential to establish the foundation of linguistics as an independent domain of science that was based on the analysis of “langue”. Second, synchronic analysis was linked to the study of the systematic mechanism of language, which was a departure from the more psychologically oriented approach of earlier philological studies. However, some linguists insisted on differentiating between linguistic rules and laws in natural sciences, claiming that linguistic rules are no more than the results of historical constatation, because the condition of linguistic rules is unique, and they can never be reproduced in the same manner. This viewpoint, which contradicts our stance, underestimates the embryonic or ongoing changes 1


It is beyond our scope to verify whether this endorsement of synchronic viewpoint dates back directly to Cours. For example, in his Principes de grammaire générale, Louis Hjelmslev regards H. G. Wiwel as the pioneer of synchronic linguistics. “la linguistique a pour unique et véritable objet la langue envisagé en elle-même et pour elle-même.”, Cours, 317.



that can often be observed in the same synchronic state of a language. Such an atomistic interpretation of linguistic phenomena consequently neglects the analysis of synchronic variation, since variation is identified through the recurrent occurrence of the variants in question. This holds true for the renowned French philologist Gaston Paris. He was keenly interested in the geographical variation of French, and created the post of Professor of Dialectology for Jules Gilliéron at the École Pratique des Hautes Études. However, Paris himself did not seem to indicate much interest in the ongoing variation of French. Today, linguists would not question the existence of synchronic variation, which could be motivated regionally (diatopic variation), sociolinguistically (diastratic variation), or stylistically (diaphasic variation). In addition to the dichotomy of synchrony and diachrony, we can recognize the hybrid nature of synchrony that is referred to as “dynamic synchrony.” Such a conception of synchrony assumes that similar patterns of usage can coexist in a community during a certain period and that their mutual relations are not static but conflicting enough to give rise to a future systematic change through symptomatic synchronic variation. It is noteworthy that the appearance of large corpora of written texts for some languages has made it possible to analyze quantitatively as well as qualitatively the conditions of historical changes, not only over a long span of time, but also over a short span. This has already led to a number of studies on synchronic and diachronic variation, including the majority of the papers in this volume.3 2. Realia or Fictio in written documents The main resources for diachronic studies are written texts, primarily. However, from the very beginning of the research, the questions of how and why these texts have existed after they were written need to be confronted. The birth of a new genre such as prose literature in Old French, for instance, was closely linked to the cognitive evolution of medieval writers. Such a cognitive change is evident in the preface of Chronique de Pseudo-Turpin, of which the earliest existing manuscript dates from the twelfth century. Nus contes rimés n’est verais. Tot est mençongie ço qu’il en dient car il n’en sievent rienz fors quant par oïr dire. (Chronique de Pseudo-Turpin, ms.BN fr.124, fol.1) “No rhymed tale is true. All that they tell are fictions because they know them only through hearsay.” The writer presumes that historical descriptions should not be based on hearsay, i.e. fictio, but should represent realia, the reality. Verse is not a convenient tool 3

See also Kawaguchi, Minegishi, and Durand (eds.) 2009.



for this purpose; prose is more often than not the preferable means to depict historical truth. Based on the evidence of existing documents, it would appear that medieval French writers rarely used prose before 1200. The emergence of prose was a historical event in the thirteenth century, which signaled a cognitive evolution.4 The oral tradition represented by verse, which the medieval intelligentsia and nobility had been accustomed to, was losing ground, and reading literature, instead of listening to it, had become the general practice in the thirteenth century.5 This ideological trend contributed to the creation of a new literary genre that focused on the construction of a virtual reality through written words. In the present volume, Wolfgang Raible treats this cognitive framework of medieval French texts by comparing the description of the Fourth Crusade (1202-04) by two contemporary writers, Geoffroy de Villehardouin and Robert de Clari. Truth and reality as observed in written documents are polysemous, for they could be purely ideological, i.e. representing the writer’s wish to be true to reality, or they could represent an objective truth. In Japan, it is possible, even today, to hear a number of sociolinguistically different ways of saying the same thing (as was common in ancient times) in Kabuki, the Japanese traditional theater. For instance, samurais would say, ““Itsu Edo-e mairareta?” Monks and doctors would use ““Itsu Edo-e gozarimashita?” Housewives would say, ““Itsu Edo-e okoshidegozarimashita?” Prostitutes would say, ““Itsu Edo-e kiyashanshita?” Geishas would use ““Itsu Edo-e kinasanshita?” Further, workers would say, “Itsu “ Edo-e oidenaseemashita?” all of which mean the same thing, “When did you come up to Edo (Tokyo)?” 6 It is really surprising that strikingly similar sociolinguistic variations can be found in the popular Japanese novels of Shikitei Samba, such as Ukiyoburo (1809-13) and Ukiyodoko (1813-14). Today, the existence of such variations at the beginning of the eighteenth century is accepted without question in Japanese linguistics. Written documents remain the most valuable and reliable sources for linguistic historians, for the analysis of dynamic synchrony of the past. Of course, there were, and still continue to be, constant disputes about the authenticity of these stylistic variations in written documents. Rebecca Posner uses an interesting metaphor to describe the insolubility of this problem. How far literary Romance has ever been identical with the spoken dialects is a moot point: To pose the question of the relationship between the two again calls out the hare we refused to pursue earlier. Posner (1972) 49. 4

5 6

This historical development was not restricted to French. Also in Cornish, English and German, literature in verse preceded that in prose. Kawaguchi 2007b. Tanaka 1983. 173.



Linguists investigating written documents of the past are deeply indebted to the critical editing of philologists. António Emiliano, in his article in the present volume, remarks that the scholarly editing of primary sources cannot dispense with good practices and sound philological groundwork, and discusses several possible strategies concerning the typographic representation of medieval texts. Additionally, it would be better to use tagged corpora and some kind of concordancers for the linguistic analysis of historical corpora. In the present volume, Anke Lüdeling, Hagen Hirschmann, and Amir Zeldes discuss their use of four comparable deeply annotated corpora representing different historical stages of German. Martin Becker’s analysis is based on the New Amsterdam Corpus (NAC)7 and the Middle French subcorpus of Frantext 8. An excellent parallel in English is Matti Rissanen et al., The Helsinki Corpus of English Texts (1991). It is a structured multi-genre diachronic corpus and covers the period between c.730-1710. As this is an early corpus, it comes as no surprise that there are offsprings with considerable additions (and deletions) to the content, such as Ann Taylor et al., York – Toronto - Helsinki parsed corpus of Old English prose (2003) and Anthony Kroch - Ann Taylor, Penn – Helsinki parsed corpus of Middle English (2000). Anthony Lodge and Yuji Kawaguchi use the electronic tools Loceme9 and AntConc10 respectively. The possibility of using such tools in linguistic analysis largely depends on the typological and grammatical characteristics of the languages under study. The development of tools useful for linguistic analysis is relatively easier for those languages that are more or less isolated and written with space, than for those that are agglutinative and written without space, like Japanese. 3. Ongoing changes in dynamic synchrony The majority of the papers in this volume analyze ongoing changes, covering the phonetic, phonological, morphological, morphosyntactic, semantic, and pragmatic aspects of language, which appear as variation in the written documents. Section 7 of Lodge’s paper focuses on the phonetic features which characterize the North Auvergnat dialect. Spelling variation in documents represents not only dialectal but also phonological variation11. In their analysis of digitized data from Bodding, Makoto Minegishi, Jun Takashima, and 7 8 9 10 11 Kawaguchi 2007a.



Ganesh Murmu examine whether or not the distinction between narrow “-e-“ and open “-e-“ is phonological in Santali, the most widely spoken language of the Munda language family. Such ongoing changes are sometimes recognized and recorded by grammarians and foreigners. For instance, S. R. Brown, a Presbyterian missionary, vividly describes “vulgar” Tokyo pronunciations in his Colloquial Japanese (1863), published in Shanghai. The vulgar in Tokyo say ai for ae, and oi for oe; thus mai, instead of mae, “before”; koi (which means “love”), instead of koe, “voice.” They also often contract ai into a long e:, as narane: for naranai, “it won’t do.” But this is as bad as the dropping of the letter h by cockneys. Matsumura (1957) 222. When compared to segmental variation, the variations in suprasegmental and linking phenomena are relatively less documented in written texts. Making use of the rare quasi-phonetic descriptions of Gile Vaudelin’s books, Kawaguchi analyzes the evolutionary stages of the liaison in French at the beginning of the eighteenth century. Morphological or morphosyntactic variation is the most extensively treated kind of variation in this volume. Lodge analyzes variables in verb morphology, based on the corpus of the Montferrand Account-Books (12591367). The misuse of participles in Old Church Slavonic and Old Russian texts is analyzed in Yoshinori Onda’s contribution, while Robert Ratkliffe studies the phenomenon of semi-productivity in the so-called derived verb system of Classical Arabic, by using the data from the Leeds Quran Corpus. Morphological variation is sometimes closely related to ongoing syntactic change. Focusing on four main constructions of the two polysemous verbs, mieru and mirareru “to see” in the Modern and Contemporary Japanese Corpus, Ayako Shiba finds that the existential construction of mieru is being progressively replaced by mirareru. These papers deal with the variation or ongoing change in dynamic synchrony. The contributions of Lene Schøsler, and Lüdeling, Hirschmann, and Zeldes go beyond synchronic description; the models of diachronic development of the composed past tense from Latin to Modern French are demonstrated in the former, while those of the relative clauses from Old High German to Modern German are discussed in the latter. Using the New Amsterdam Corpus (NAC), Becker conducts a semanticopragmatic analysis of mood selection in Old French belief verbs, such as cuidierr and croire “to believe.” He describes a diachronic conflict where the latter will expel the former, progressively infringing on the contexts dominated by the former for centuries. Finally, a corpus-based study of written documents can give us important clues about the classification of language groups. Through the corpus analysis of eight different texts of Apabhraṃśa, the language used in the literary works of northern India in the Middle Ages,



Tomoyuki Yamahata finds little relationship between Apabhraṃśa and the New Indo-Aryan languages. 4. Linguistic atlas and diachronic linguistics Can a linguistic atlas be considered a linguistic corpus? Jean-Philippe Dalbera, the French dialectologist, is in favor of this12. He distinguishes three different generations of geolinguistic works. The first generation establishes the concept of a linguistic atlas and its methodology. The second generation constitutes a reliable corpus of linguistic data by improving the atlas tool. He puts forward three key parameters for defining the corpus of a linguistic atlas: comparativism, diatopy, and lexicon. The analysis and use of these corpora are to be devised in the third generation. In this volume, two papers concern geolinguistic researches of the third generation as per Dalbera’s definition. The construction of the Atlas Linguarum Europae (ALE)13 that began in 1970 was without doubt the first original research attempt originating from the existing linguistic atlases of Europe. According to Wolfgang Viereck, three aspects are important with regard to the interpretation of the word maps of ALE: loanword research, etymological research, and the study of the motivations in designating certain objects. Using ALE, he deduces the religious and cultural history of Europe, which provides a magnificent macro perspective for describing the linguistic landscape or scene (Sprachlandschaft) of Europe on both synchronic and diachronic axes. Kanetaka Yarimizu’s article is directed along a more micro perspective. His data comes from two recent dialect surveys: the Grammar Atlas of Japanese Dialects (GAJ), and the Glottogram survey in northern regions. Making full use of various statistical methods, he investigates the standardization processes of Japanese across five different historical stages. 5. Corpus-based analysis and diachronic linguistics In this section, abstracts of 14 papers dealing with corpus-based analyses of different languages are presented in the order of their appearance in this volume. 1. Wolfgang Viereck, in “The Atlas Linguarum Europae: A Diachronic Analysis of Its Data,” discusses the vast linguistic diversity exhibited across the European continent, with 6 language families and 22 language groups, each with a large number of individual languages. Several short-lived projects 12


See his “Linguistic Atlases: Objectives, Methods, Results, Prospects,” in Kawaguchi et al. 2007. 39-54. Viereck (ed.) 2007.



had taken off since the mid-nineteenth century to study the European linguistic scene from different perspectives, yet the only one still in existence is the Atlas Linguarum Europae, which was begun in 1970. Seven fascicles have been published so far. The atlas is primarily an interpretative word atlas. Only a few typological maps have been published until now. One such map is presented in Viereck’s article. The study of the motivations in designating objects is an innovative attempt to interpret geolexical data. New insights into the cultural development of Europe are gained by applying this approach. The motivations are many, of course, but it is the history of religions that is of primary interest for Viereck, as religion is the basis of every culture. The cultural history of Europe is not made up of random elements and events—it follows a wellstructured pattern that is described in some detail in his contribution. 2. Anke Lüdeling, Hagen Hirschmann, and Amir Zeldes, in “Variationism and Underuse Statistics in the Analysis of the Development of Relative Clauses in German,” introduce a corpus-based variationist approach to the study of language change, which hinges on the definition and explicit coding of variables and variants, or competing “ways of saying the same thing,” within their usage in corpus data. They use multiple extensible annotation levels to examine variants in the development of relative clauses from Old High German to Modern German, using four comparable deeply annotated corpora of different German language stages. They compare the frequencies of different grammatical categories, such as word forms, parts of speech, and syntactic constructions, to diagnose the most significant changes that are evident in their corpora, and show the advantages of dynamically re-examining quantitative results and categorization systems. Finally, they discuss how far their approach can support theories on language change, and lead to insights which enrich previous theoretical accounts. 3. According to Anthony Lodge, in “Variation and Change in the Montferrand Account-books (1259-1367),” the account books maintained between 1259 and 1388 by the consuls of Montferrand (Puy-de-Dôme, France), in the North Auvergnat dialect of Occitan, offer a rich source of material for linguists interested in language variation and change in the Romance languages before the onset of standardization. The transcriptions of these documents constitute an electronic corpus of 250,000 words. The documents were precisely dated throughout. They were all written in the same place, and all of them performed the same sociolinguistic function. With the help of an electronic tool, Loceme (designed by C. Mansfield, University of Plymouth, UK), Lodge is able to create a visual representation of the distribution of linguistic variants across the corpus, which allows him to rapidly identify the key diachronic variables, and to follow the process of linguistic change in the North Auvergnat dialect across more than a century.



4. In “Cognitive Aspects of Language Evolution and Language Change: The Example of French Historical Texts,” Wolfgang Raible begins with the assumption that every text token complies with a cognitive framework, called “genre” or “format.” Analyzing the earliest two historical texts written in Old French prose, both of which deal with the Fourth Crusade (1202-1204), the author puts forward two theses. (1) Somebody trying to write such texts for the first time will use already existing cognitive models―the only possible model in this case being Old French novels in verse (the so-called romances)—that is, texts that are nowadays considered to be fiction, but were not treated as such at that point in time. (2) It will still take considerable time for the cognitive and linguistic framework for historical prose proper to develop. The material analyzed by Raible confirms both the theses. In both the texts, the intended addressees, the readers, are still thought of as hearers, history is conceived of as a tale consisting of a series of adventures where marvelous things happened, the technique of enhancing (sachiez que “let it be known to you”) comes directly from the romances, etc. There is, in general, a lack of abstraction from, and distance to, the represented events. As was hypothesized by Raible, the appropriate cognitive framework appears much later; even what we would consider to be elementary terms like événementt “event” are attested for the first time only around 250 years later. 5. Lene Schøsler proposes that hypotheses in diachronic linguistics can be confirmed or dismissed by means of corpora, in “The Importance of Diasystematic Parameters in Studying the History of French.” The hypotheses under investigation concern the nature of language change, and the appropriate models of change. The particular instance of language change that is used to illustrate her point is the creation of the composed past in Romance languages, especially in French, from the Latin present form habeo litteras scriptas “I have letters [that have been] written.” Several intriguing questions are raised in the paper: • What is the function of the composed past in the old texts—is it a present form or a past form? • Which are the phases of change? • How does epic tense switching conform to analyses of the composed past? • How can the conflicting evidence in the old texts be explained? The focus is on the processes of reanalysis (i.e., innovation) and actualization in relation to the creation of the composed past, as these aspects have not yet been investigated or illustrated in texts. It is hypothesized that changes are always textually manifested in synchronic variation. The investigation of synchronic variation is related to the actualization process, and to well-known diasystematic parameters—diachronic, diatopic, diastratic, diaphasic, and diamesic variation.



6. Martin Becker, in “The Reorganization of Mood in the Epistemic Subsystem: the Case of French Belief Predicates in Diachronic Dynamics,” attempts to illustrate how a theory-based analytical framework such as modal semantics can be combined with corpus linguistics in order to gain deeper insights into the processes and mechanisms of language change. An interesting case in point is the Old French system of mood selection in the domain of belief (“doxastic”) predicates, and the principles of change in this subsystem from Old to Classical French. The database principally consists of the New Amsterdam Corpus (NAC) and the Middle French subcorpus of Frantext. Although Becker obtains interesting results regarding the stages and underlying principles of change, he points out the inherent limitations of the promising diachronic tools of corpus linguistics. 7. Yuji Kawaguchi’s paper, “French Liaison in the 18th Century: Analysis of Gile Vaudelin’s Texts,” discusses the corpus analysis of two of Gile Vaudelin’s texts, namely, Nouvelle manière d’écrire comme on parle en France (1713), and Instructions crétiennes, mises en orthografe naturelle, pour faciliter au peuple la lecture de la Sience du salutt (1715). He analyzes the usage of Vaudelin’s new alphabets, describes the liaison at the first quarter of the eighteenth century, and tries to place this synchronic situation in the evolutionary stages of the French liaison. Liaison is realized almost without exception when the liaison consonant -zz represents the plural morpheme in the personal pronouns nous and vous, as well as in the articles des and les. For verbs, liaison will occur most regularly when the liaison consonant -t represents the third person morpheme. Thus, the morphemic bound form type of liaison consonants -zz and -tt are the most constantly pronounced, while the free form type, including several adverbs such as moins, pas, plus, and toujours, occurs less regularly. 8. A bad philological approach can ruin an electronic corpus or archive, or severely diminish its value for research. The scholarly editing of primary textual sources in the digital age cannot dispense with good practices and sound philological groundwork. Traditional philologists often seem to take for granted the meaning, as well as the principles and procedures, of “transcription.” However, as it turns out, what most philologists think of as transcription is in fact “transliteration,” the replacement of a character set (such as a medieval character set) with a completely different set (the modern typographic version of the Roman alphabet). The aim of faithful scholarly editing should be transcription (at least at the early stages of the editorial process), for which several strategies can be thought of António Emiliano, in “Issues in the Typographic Representation of Medieval Primary Sources,” discusses several possible strategies regarding the typographic representation of medieval texts, and the preliminary aspects of corpus encoding, such as transliteration



and transcription, and character encoding procedures. He argues that direct typographic representation by means of a Unicode-compliant “medieval character set” is the optimal solution for both encoders and scholars. 9. In the history of the Slavonic literary language, the paradigm and function of the participle has considerably changed. In “An Analysis of the Misuse of the Participle in Old Russian Texts,” Yoshinori Onda focuses on the misuse of participles in Old Church Slavonic and Old Russian (OR) texts, which seem to attest the beginning of the change. Participles originally agreed with the related nouns, and represented the incidental action of the subject, only rarely appearing as predicates with conjunctions to connect participles and verbs equivalently. To explain the cause of this misuse, he hypothesizes that the similarity of the textual structures caused confusion regarding participle use, and that a reverential attitude towards the original texts greatly influenced the copyists. From research on the OR text Vita Constantini, he finds twenty-six cases of misuse. By analyzing these misuses from the point of usage, tense, and word order, he confirms that the misuses are caused by structural similarity. This finding supports the first hypothesis. However, the exact nature of the relationship between the text type and the attitude of the copyists could not be determined, despite the fact that the occurrences of misuse in quotations are fewer than in statements or conversations. 10. Robert Ratcliffe’s paper, “A Preliminary Analysis of Arabic Derived Verbs in the Leeds Quran Corpus, with Special Reference to Stem III (CaaCaC),” analyzes the data from the Leeds Quran Corpus in order to evaluate the phenomenon of semi-productivity in the so-called derived verb system of Classical Arabic. The preliminary statistical data turns out to be consistent with his earlier proposals that the stems are related to each other in a systematic derivational way, and that the core function of the stems is the marking of valence (Ratcliffe 2005, 2008). The latter point is investigated with particular reference to the so-called stem III, which has not been traditionally analyzed as having a primarily valence-changing function. 11. Santali, the most widely spoken member of the Munda language family, is spoken in the eastern states of the Indian subcontinent. It is said to have Southern and Northern dialectal varieties based on the difference in the number of phonemic vowels: the Southern dialect has six vowels, whereas the Northern one has eight. This is described in Bodding (1929-36), which is the largest dictionary of the language. In the paper “On the Narrow and Open “e” Contrast in Santali,” Makoto Minegishi, Jun Takashima, and Ganesh Murmu use digitized data from Bodding in order to examine whether or not the contrast between the narrow “e” and the open “e” is really phonologically distinct. First, the most frequent syllable patterns are examined, which are found to be disyllabic words that contain two open or two narrow



instances of “e”; the narrow and open instances of “e” rarely co-occur in a word. Attempts are then made to find candidates for minimal pairs in which the narrow “e” and the open “e” contrast in the same phonemic environment. The authors find that out of 285 such candidates, 153 cases refer to other words containing “e” with different height: open “e” to narrow “e”, or vice versa. Further, only seven cases of minimal pair candidates are found. They, therefore, conclude that the narrow and open “e” contrast is not a full-fledged phonemic one. 12. Apabhraṃśa was the language used in the literary works, such as poetry or narration, of northern India in the Middle Ages. Owing to considerable differences across documents written in Apabhraṃśa, numerous classifications of this language have been put forward from the Middle Ages to the present. In “The Classification of Apabhraṃśa: A Corpus-based Approach of the Study of Middle Indo-Aryan,” Tomoyuki Yamahata surveys these various classifications, and attempts to find classification criteria from a corpus of Apabhraṃśa. Tagare (1948) classified Apabhraṃśa texts into three groups: Eastern, Western, and Southern. Yamahata adds Kashmiri Apabhraṃśa to these groups. The inflectional forms of Apabhraṃśa fluctuate across multiple forms, which are generally pseudo-archaic and non-archaic. He proposes that the variation of Apabhraṃśa can be defined by style, specifically, by the degree of preference for pseudo-archaic forms. The corpus used for this paper consists of eight texts. Metrical lines divide the texts, but morphological or syntactical information is not tagged into it. Yamahata finds little relationship between the Apabhraṃśa groups and the New IndoAryan languages. He examines four characteristics of desinence, and finds considerable differences in each desinence. However, it is difficult to find consistent criteria for a choice of forms. The groups and the desinences show a more complicated relationship than was expected. 13. In “Changes in the Meaning and Construction of Polysemous Words: The Case of mieru and mirareru,” Ayako Shiba investigates the polysemous verbs mieru and mirareru “to see,” focusing on their construction types. She extracts four main construction types of both verbs from the Modern Japanese Corpus and the Present-day Japanese Corpus, and describes the actual use of each type in each corpus in order to explore how mieru and mirareru are extending their evidential meaning. The Modern Japanese Corpus contains many [[clause]-to mieru] constructions, and the incidence of the evidential inference type exceeds ten percent. However, it has decreased in present-day texts, or the evidential type with mieru is becoming antiquated. However, the evidential with mirareru rarely appears in either corpus (critical essays). This suggests that mirareru has not replaced mieru in its evidential use. She proposes that the other evidential markers -yoda and -rashii have replaced



mieru, and the evidential type with mirareru has independently developed in the report genre. On the other hand, the incidence of mirareru has increased greatly in present-day texts because of the considerable use of the existence construction. It can be assumed that the mirareru construction is replacing the mieru construction in the existence type, considering the decrease in the existence use of mieru. 14. In “Language Change from the Viewpoint of Distribution Patterns of Standard Japanese Forms,” Kanetaka Yarimizu discusses the standardization processes of Japanese across five different historical stages. His data comes from two recent dialect surveys: the Grammar Atlas of Japanese Dialects (GAJ), and the Glottogram survey in the Tohoku and Hokkaido regions. The following are the five stages of his standardization model. (1) The period until the mid-eighteenth century, (2) the period from the mid-eighteenth century to the end of the nineteenth century, (3) the period from the end of the nineteenth century to the mid-twentieth century, (4) the period from the mid-twentieth century to the present, and (5) the present. In the standardization process that occurred during the first two stages, the influence of the Kansai dialects, especially with Kyoto as the linguistic center, was strong. Standardization progressed after these stages through education, but did not affect the private domains of Japanese people, and the traditional dialects were maintained until the third stage, i.e. the beginning of the Meiji Era. Finally, from the mid-Meiji Era until today, standardization has been increasingly based on the standard forms used in Tokyo, and is continuously affected by the mass media. Thus, standardization and linguistic “Tokyonization” progress concurrently. References Bodding, Paul Olaf. 1929-36. A Santal dictionary. Oslo: I Kommisjon Hos Jacob Dybwad. Desmet, Piet and Pierre Swiggers. 1991 [1992]. “Diachronie et continuité: les vues de Gaston Paris sur la Grammaire Historique du Français”. Folia Linguistica Historica XII/1-2. 181-196. Godel, Robert. 1969. Les sources manuscrites du Cours de linguistique générale de F. de Saussure, 2e éd. Genève: Droz. Hjelmslev, Louis. 1968. Principes de grammaire générale, 2e éd. København: Munksgaard. Kawaguchi, Yuji. 2007a. “L’État actuel de la dialectologie du français médiéval: le cas des chartes champenoises méridionales”. Le Nouveau Corpus d’Amsterdam, P. Kunstmann and A. Stein (eds). Stuttgart: Franz Steiner. 188-200.



Kawaguchi, Yuji. 2007b. “Demonstratives in De Bello Gallico and Li Fet des Romains - Parallel corpus approach to medieval translation”. CorpusBased Perspectives in Linguistics, Y. Kawaguchi et al. (eds). Amsterdam/ Philadelphia: John Benjamins. 265-286. Kawaguchi, Yuji, Toshihiro Takagaki, Nobuo Tomimori and Yoichiro Tsuruga (eds). 2007. Corpus-Based Perspectives in Linguistics. Amsterdam/ Philadelphia: John Benjamins. 442p. Kawaguchi, Yuji, Makoto Minegishi and Jacques Durand (eds). 2009. Corpus Analysis and Variation in Linguistics. Amsterdam/Philadelphia: John Benjamins. 399p. Martinet, André. 1990. “La synchronie dynamique”. La Linguistique 26/2, 1990. 13-23. Matsumura, Akira. 1957. Studies of languages of Edo and Tokyo (in Japanese). Tokyo: Tokyôdô. Posner, Rebecca. 1972. “Positivism in Historical Linguistics”. Readings in Romance Linguistics, James M. Anderson and Jo Ann Creore (eds). The Hague/Paris: 39-51. Mouton. Ratcliffe, Robert. 2005. “Semi-Productivity and Valence Marking in Arabic – the So-called ‘verbal themes’”. Corpus Approaches to Sentence Structure, Toshihiro Takagaki, Susumu Zaima, Yoichiro Tsuruga, Francisco Moreno Fernandez and Yuji Kawaguchi (eds). Amsterdam/Philadelphia: John Benjamins. 179-190. Ratcliffe, Robert. 2008. “The Simple Math of Valence and Voice Ambiguity: Arabic Derived Verbs, Passive/Causative Overlap, and other problems”. Ambiguity of Morphological and Syntactic Analyses, Tokusu Kurebito (ed). Tokyo: ILCAA. 1-14. de Saussure, Ferdinand. 1972. Cours de linguistique générale, publié par Charles Bally et Albert Sechehaye; avec la collaboration de Albert Riedlinger, éd. critique préparée par Tullio De Mauro, Paris: Payot. Tagare, Ganesh Vasudev. 1948. Historical grammar of Apabhraṃśa. Poona: Deccan College, Post-Graduate and Research Institute. Tanaka, Akio. 1983. Language of Tokyo: Its Establishment and Development (in Japanese). Tokyo: Meijishoin. Viereck, Wolfgang. 2006. “The linguistic and cultural significance of the Atlas Linguarum Europae”. Gengojohogaku Kenkyuhokoku (Working Papers in Linguistic Informatics) 9, Kawaguchi et al. (ed). Tokyo: Tokyo University of Foreign Studies. 58-80. Viereck,Wolfgang (ed). 2007 [2008]. Atlas Linguarum Europae, Vol. 1: Septième fascicule, Commentaires et Cartes. Rome: Istituto Poligrafico.

The Atlas Linguarum Europae: A Diachronic Analysis of Its Data Wolfgang VIERECK

1. A short presentation of the project Bernardino Biondelli (1804-1886), today at least half forgotten, was in many ways an original scholar—also in the area of linguistic geography. Four decades before Jules Gilliéron published his Petit Atlas Phonétique du Valais Roman (1881) and sixty years before Gustav Weigand and, once more, Jules Gilliéron brought out their larger operations of this kind, Biondelli presented fascicle 1 of his programmatically challenging Atlante Linguistico d’Europa in 1841. The issues discussed were, according to the subtitle, “Nozioni preliminari, classificazione, carattere e regno delle lingue indoeuropee”. With the pan-European project, the Atlas Linguarum Europae (ALE), which was founded by the Dutch scholar Antonius Weijnen1 at the University of Nijmegen, The Netherlands, and which came into existence only in 1970, Biondelli’s far-sighted anticipation of such dreams and his spadework on behalf of cartographic projections of linguistic facts on a European scale seem all the more remarkable. His map “Prospetto topografico delle lingue parlate in Europa” is interesting from a historical perspective. When it appeared in 1841, historicocomparative linguistics had already made some progress, but its greatest achievements were still to come in the second half of the nineteenth century. The map was, of course, constructed according to the knowledge of the time and it is, thus, not surprising that it shows a number of terminological and factual inaccuracies. On the other hand, the ALE map “Carte de distribution des familles et des groupes linguistiques” gives an accurate description of Europe’s linguistic situation. It can easily be consulted in the project’s publications, whose latest fascicle, fascicle 7, appeared fairly recently (Viereck 2007 [2008]). Fascicle 8 is with the publishers. The ALE map distinguishes between six language 1

Weijnen notes: ‹L’idée d’un atlas linguistique du continent européen a été énoncée pour la première fois par Wilhelm Pessler en 1929 ... Il y propose d’entreprendre la recherche dans le domaine lexicologique. Un peu plus tard, les phonologues ont repris cette idée en lançant le projet d’un atlas phonologique de l’Europe ... Mais les événements de la seconde guerre mondiale ont empêché la réalisation du projet ...› (1981: 14). Apparently Weijnen was unaware of Biondelli’s 19th century project.


Wolfgang VIERECK

families: Altaic, Basque, Caucasian, Indo-European, Semitic and Uralic. In these language families, 22 language groups in total, such as Romance and Slavic, can be counted. These, in turn, consist of many individual languages. It thus becomes apparent that the demands on scholars to interpret the heterogeneous data collected from Iceland to the Ural mountains are very high indeed. Unfortunately, the ALE net is not uniform. Different countries collected materials in different ways, using new fieldwork, published sources, such as existing national linguistic atlases or dictionaries and unpublished archives. While this is perhaps the only way in which such a large-scale project could have been carried out in practice, one must lament the loss of synchrony due to the chronological discrepancies involved in such a procedure. It is always the oldest vernacular words that are looked for in the various languages. ““Dialects, and not only languages, since any comparative study of standard languages … by neglecting dialects, necessarily gives only a partial and incomplete reconstruction of the linguistic continuum; modern dialects, and not only ancientt languages, as is traditional in Indo-European studies, for it is possible that modern dialects preserve more archaic features than the most archaic written documents” (Alinei 1983: XXII-XXIII). Until now commentaries of 62 notions and 84 computer-produced multi-colour maps have been published, large-format productions (74 cm×60 cm), each with an accompanying sheet of equal dimension explaining the various symbols employed. The objective here has been to create a symbology indicating conceptual congruity across language(-family) boundaries. 2. Presentation of a typological map The ALE is, primarily, an interpretative word atlas. Typological maps are few in number. They deal with the presence vs. absence of the definite article, the position of the adjective with regard to the noun or with the obligatory vs. free use of subject pronouns. As to the definite article, Europe is divided roughly into two areas (see Map 1): The western area shows the article and the eastern area does not. More specifically, the whole Slavic area with the exception of Bulgarian and Macedonian, the whole Uralic area except for Hungarian and the Altaic and Caucasian areas do not have the definite article. Within the area where the article does appear, there is an additional opposition between pre- and postposition of the definite article. Basque differs from the surrounding prepositive Romance areas; but within the Indo-European area itself not only the Scandinavian area (Danish2, Norwegian, Swedish, Faroese and Icelandic), 2

with the exception of the Danish dialects of West and South Jutland that use a prepositive definite article.

Map1. Typological Map No. 1: Languages with / without definite article

The Atlas Linguarum Europae 23


Wolfgang VIERECK

but also a compact area formed by Albanian, Romanian (the only Romance area with postposition), Bulgarian and Macedonian (the only Slavic areas with the definite article) have postposition. The picture is thus contradictory: for, on the one hand, postposition of the definite article isolates the Scandinavian area from its common Germanic ancestry; on the other hand, it contributes to unifying, despite their different origin, all Balcanic groups: Romanian of Romance origin, Bulgarian and Macedonian of Slavic origin, and Albanian of Illyric origin. This feature is one of the many on the basis of which the Balcan linguistic area forms a Sprachbund. The distributional area shows that the formation of the definite article is more recent than that of genetic branchings (Alinei 1997a: 33 with several corrections and additions). Generally speaking, the areal distribution of typological features does not seem to correspond to that of genetic features within the framework of language families or language groups. The interpretation of word maps follows different lines. Three aspects are important in this connection: loanword research, etymological research going back to prehistoric times and the study of motivations in designating certain objects. 3. Loanword research Loanwords usually belong to the historical period, as they are connected with technology, culture and commerce. The ALE has important contributions to its credit in this area. Generally speaking, there are no problems with etymology. One such example is the expressions provided for the notion ink. A commentary on inkk has not yet been published within the ALE framework. In ancient times black ink was mostly produced with lampblack. In the 3rd century AD, a mixture of soluble iron salt with tannic acid, often extracted from oak bark, came into use. This type of ink spread among the tribes of Europe. Therefore the word for ‘ink’ in present-day Germanic languages is identical with ‘black ink’, cf. German schwarz wie Tinte (‘black as ink’) or schwarz auf weiß (‘black on white [paper]’) or English atramentous ‘black as ink’. The same is true of the most widely diffused expressions for ‘ink’ in the Slavic area, such as Russian černila, Polish czernidło and Czech černidlo. They all go back to a Proto-Slavic root *čъrnidlo meaning ‘black colour, ink’. The words for ink in Finnish, Ingrian, Votic, Karelian, Mordvin, Lappish, Permic and Samoyed tšernila are all loans from Russian černila. Also Irish dubh goes back to Old Irish dub ‘black’. In addition to this most widespread colour, there were and there are also inks of different colours. In the southern Germanic area and in the British Isles, the use of ink goes back to the contact with the Romans during the first centuries AD. Ink came to Scandinavia from the British Isles with the introduction of Christianity.

The Atlas Linguarum Europae


Attestations written with ink in the Runic alphabet have come down to us from the 13th and 14th centuries. According to the OED (19892) black is “a word of difficult history”. Frings (1966: 158) assumes that Old English blæc, blac was a translation of Latin atramentum ‘ink’, derived from Latin aterr ‘black’. Old English blæc, blac came to Scandinavia from the British Isles with the introduction of Christianity. Whereas blackk meaning ‘ink’ is obsolete in English today (cf. OED 19892, s.v. ‘black’, sb. 2a), all the Scandinavian languages have retained it with this meaning (cf. Swedish bläck, Icelandic blek, Danish blæc, Norwegian and Faroese blekk). k Finnish (b)läkki is a loan from Swedish and Lappish blækka is a loan from Swedish/Norwegian. The loans from Latin atramentum (librarium) are, of course, not restricted to the west and north Germanic area. They appear in direct form in Belorussian, Ukrainian, Czech, Slovak, Polish atramentt and Lithuanian (a)tramentas. The loan process started from Polish. In the German-speaking area Tinte (with variants) dominates, going back to Latin tincta (aqua) ‘coloured (water)’. The word must have been borrowed after the second or High German consonant shift. Tinte predominated over the words going back to Latin atramentum as well as to Latin encaustum. From German, Tinte spread to a number of languages such as Polish (tint[a]), Lithuanian (tinta), Latvian (tinte), Estonian (tint), Livonian (tint) and Slovene (tinta). The Ukrainian form tinta could also have been borrowed from Hungarian tinta. This is a direct loan from Latin, as is the case with Spanish and Catalan, Portuguese and Italian tinta. In the western Germanic area, in parts of the Romance and the Slavic areas, words succeeded that go back to Late Latin encau(s)tum which, in turn, derives from Greek έγχανστόν. Originally this term meant ‘purple ink’ used by the Roman emperors for signing documents. From there the general meaning ‘ink’ developed as we find it today in French encre, Italian inchiostro, Friulian ingiustri, Polish inkaust, Czech inkoust, English ink, Dutch inktt and Rheno-Westphalian dialectal forms. According to De Vries (1971) Latin encautum was adopted in the Rhineland when Roman emperors resided in Trier (Augusta Treverorum), the oldest city in Germany. From there it spread into Old Dutch, Old Low German and northern Old French, attested there as enque (11th century). Enque became Middle English enke (first attested in 1250) and Modern English ink. In the Old French form, the Greek accent was retained in this Latin loan, while Italian inchiostro and Old Occitan encaut follow the Latin stress pattern. 4. Etymological research: Faithfulness to reconstructed roots Insights into the ethnolinguistic origins of Europe are also expected from


Wolfgang VIERECK

the ALE. This is a most lively and controversially debated field at present. In the area of Indo-European scholarship, scholars developed three theories during the last decades, the oldest being the Invasion Theory according to which there was a gigantic invasion at the beginning of the Metal Age that brought Proto-Indo-European to Europe. Archaeology and genetic research proved a little later, however, that there was irrefutable evidence for cultural continuity from the Paleolithic to the Bronze Age in Europe. These insights led to the so-called Neolithic Dispersal Theory, which assumes that Neolithic farmers coming from the Middle East introduced Proto-Indo-European into Europe, and the Paleolithic Continuity Theory, which assumes that there were no invasions from non-European peoples. With the following example I want to show that it is not without speculation to deal with aspects going so far back in time. Alinei, a strong supporter of the Paleolithic Continuity Theory, asks “Why has Indo-European a common word for ‘dying’, but not for ‘burying’ and ‘grave’?” (2008: 15) and concludes that only the Paleolithic Continuity Theory can account for this. He places his common word for ‘dying’ (Proto-IndoEuropean *mer- attested, according to him, in Celtic, Germanic, Italic, Greek and Balto-Slavic) to Middle Paleolithic, which must therefore be regarded as belonging to Common Indo-European, and the notions of ‘burying’ and ‘grave’ to the Upper Paleolithic and Mesolithic, when they were already expressed by different Indo-European words. In order to do this he had to manipulate the data. In addition to *mer- which, contrary to Alinei’s belief, is not attested in Celtic, nor in Albanian or Tocharian, the following verbal roots are listed in Mallory & Adams 1997: 150, s.v. ‘death’, with the meaning ’die, perish’: *nek-, *ųel- and *dheu-. They were as equally widespread as was *mer-, and, consequently, of the same age. In contrast, the distributions of *dhg gwhei‘perish’, attested only in Greek and Sanscrit, and *(s)ter- ‘kill’, attested only in Germanic and Old Irish, suggest late isoglosses in Indo-European. Thus, judging from the distributions of the verbal roots in Proto-Indo-European we can postulate at least a relative temporal difference between the two groups without pinpointing it to a specific period. If we are faithful to the data, as, of course, we should be, Alinei’s example does not prove what he says it proves. All too often scholars are so proud of their theory that they disregard the data when they do not fit the theory. This led Raven I. McDavid, Jr. who, as a dialectologist, had always been faithful to the data to the remark that “for many linguists, data has become the most obscene of all four-letter words” (1972: 192). 5. Motivational research So much to loanwords and early etymological research within the frame of the ALE. But there is a third important aspect, namely the study of motivations.

The Atlas Linguarum Europae


Motivational mapping is an innovative manner of interpreting geolexical data. It goes beyond an interest in etymology and asks for the causes or the motives in designating certain objects. Only in a large-scale project such as the ALE can this approach be successfully pursued. In national, let alone regional linguistic atlases, the area is usually too small for the approach to be very productive. This may be one reason why it had aroused so little interest prior to the ALE. Another may be seen in De Saussure’s dominance in modern linguistics. The arbitrariness of the linguistic sign, important as it is for the functional aspect of language, left hardly any room for the genetic aspect of language, i.e. for the serious study of motivation. Seen more narrowly, however, the motivation of a linguistic sign is not in opposition to its arbitrariness, as the choice of a certain motive itself is not obligatory. As regards the ALE, insights into Europe’s cultural past follow less from loanwords and from reconstructed roots. Loanwords, as pointed out already, are too young, while reconstructed roots involve very early periods but are usually motivationally opaque and thus not very revealing for a cultural analysis. Insights into Europe’s cultural past rather follow from motivations in so far as they are transparent. This is an important point, as formal differences between languages can thus be eliminated and the focus is solely on semantic parallelisms. The motives for naming an object, of course, vary enormously. To give an example: Popular names for the plant Taraxacum Dens-leonis or Leontodon Taraxacum abound in Europe, which is no doubt due to its wide distribution. The names are not old, as the plant cannot safely be documented in the writings even of the early Middle Ages. Among the many motivational aspects there are those names referring to the shape of the leaves and to medical properties, i.e. to the effect the plant has on the bladder and the bowels. Dandelion, found everywhere in England, loantranslates medieval Latin dens leonis. According to the OED (19892) it first appeared in English in 1513 in the form dent de lion. ‘Tooth of the lion’ is also attested, e.g., in German Löwenzahn, Danish lǿvetand, Norwegian lǿvetann, Spanish diente de léon, Italian dente di leone and Welsh dant y llew. However, the standard French expression pissenlit refers to medical properties. It is interesting to note that pissenlitt was taken over by neighbouring German and Dutch dialects as Bettpisser, Bettseicher, Seichblum and pisbloem, zeikbloem respectively (see Viereck 1997). In his study Les noms populaires des plantes dans les Pyrénées Centrales Jean Séguy concluded: « ... le chiffre le plus remarquable est celui du caractère forme ... en additionnant ... forme des feuilles, des fruits et des fleurs, on obtient 45,84% » (1953: 380). Both Séguy (1953) and Seidensticker (1997) describe the different motivations in designating plants that refer to the various forms of the leaves, the blossoms and the fruit, but completely exclude mythology and the history


Wolfgang VIERECK

of religion and culture that are in the centre of the discussion here. For elucidating Europe’s cultural past the frame of reference is the history of religions, as religion is the basis of every culture. Geolexical data show that the cultural history of Europe is not made up of random elements and events but follows a unified, well-structured pattern where three separate layers can be distinguished, namely a historical layer, i.e. a Christian/Muslim layer, and two prehistorical layers, i.e. an anthropomorphic layer going back to the Metal Age and an even earlier zoomorphic layer that also includes kinship representations. They are connected with more primitive societies of the Stone Age (cf. Alinei 1997b: 27). Cultural morphologists had already described the basics of the two prehistorical layers in the 1920s and 1930s (see, e.g., Frobenius 1929). In view of the atlas results the third, historical layer followed automatically. Unlike vertical dead archaeological stratigraphies, linguistic stratigraphies as presented on ALE motivational maps are horizontal and all the above layers are still alive. As my contribution published in a volume edited by Professor Kawaguchi and published in 2006 (Viereck 2006a) contains many examples from many languages across Europe that fit this three-layer model, I will simply add one example now for the benefit of those to whom my paper is not known. The example is butterfly. Why is the butterfly called ‘butterfly’? The OED (19892) informs its readers: “The reason of the name is unknown”. The dictionary editors can be helped in this respect. Assistance comes from earlier popular religious beliefs where butterflies played a prominent role. Starting with the middle layer, butterflies were given anthropomorphic designations in many European languages. We find, for instance, names of fairies for butterfly in Italian farfarello or French farfadet, both closely connected with Italian farfalla ‘butterfly’. The brimstone butterfly is called Hex ‘witch’ in parts of northwestern Germany. In Austria the butterfly appears as Waldgeist ‘forest spirit’ and in Russia as babočka (derived from the goddess Baba). There were many names for the butterfly in German meaning ‘thief of milk’, namely Milchdieb, Milchstehlerr and Molkendieb. With this information we return to the question raised earlier. Especially in the Germanic area the belief was widespread that witches in the form of butterflies stole milk and butter. Compounds with butter- occur most often, such as Standard English butterfly, German Butterfliege and Dutch botervlieg, botervogel, but also boterwijff and boterhex, expressions that clearly show the belief in witches. The zoomorphic layer is also attested for this notion, e.g. ‘grandmother’ in Rhaeto-Romance and Basque, ‘mother’ in Austrian German and Sardinian and ‘(grand)father’ occasionally in Tat and Udmurt. The butterfly is christianized in Europe also, for example in Gaelic as ‘God’s bird’, in Norwegian as ‘Mary’s hen’, in Basque as ‘hen of the good God’, in Greek, referring to the Greek Orthodox Church, as the ‘pope’s wife’, in Finnish as ‘Brigit’s bird’ and ‘flying Brigit’


Map 2.

The Atlas Linguarum Europae

Wolfgang VIERECK

Map 3.



Map 4.

The Atlas Linguarum Europae


Wolfgang VIERECK

and in Komi-Zyryan as ‘God’s hen’. Viereck 2006b shows that there is much more to be said on the butterfly. The preceding three maps (Maps 2-4) present the distribution of the ALE data according to the three layers mentioned3. They are based on the responses to the following nine notions of plants, animals and natural phenomena: blackberry, butterfly, cornflower, firefly, ladybird, lightning, rainbow, thunder and weasel. Generally speaking, the results are not surprising. Responses to the oldest layer are, of course, lowest in number. They are mainly to be found in the periphery of Europe, namely in Russia and parts of the Balkan. Answers that refer to the anthropomorphic layer are about twice as frequent as those of the zoomorphic layer. With the exception of Germany, the Netherlands and some regions in southeastern Europe (Hungary, Romania and Bulgaria) they are distributed fairly evenly over Europe, however with clear differences in frequency. Most of the anthropomorphic responses are, again, to be found in the periphery with Portugal in the West, Norway in the North, Sicily in the South and the Baltic states, Poland, Belorussia, the Ukraine and Russia in the East. In one locality in Lithuania five anthropomorphic answers were attested! Lithuania is in some repects a special case. This was the last European country that became christianised and that only in the late 14th century. Therefore pagan rituals are still very much alive there. The old pagan religion is known in Lithuania today as “Romuva”. Among its three main gods, Perkūnas, the god of thunder, is the most important one. His name often occurred in responses given by Lithuanian ALE informants. (For more information on Baltic, especially Lithuanian pagan religion cf. Trinkunas 2002.) Another reason why Lithuania is a special case is provided by the great English philologist Joseph Wright who remarked: “From a linguistic point of view I love the Lithuanians more than any race under the sun” (Sladen 2010: 20). In contrast to Sladen who calls this, strangely enough, a “perhaps perverse claim” (2010: 20), Wright, of course, knew that Lithuanian was then and is now the most archaic among all the Indo-European languages spoken in Europe, and as a result it is very useful, indeed, indispensible in the study of Indo-European linguistics. In Hungary, Romania, Bulgaria and Albania zoomorphic and anthropomorphic responses are in complementary distribution: frequent zoomorphic answers show hardly any anthropomorphic ones there. The most equal distribution 3

The following information cannot be found on the maps: for the zoomorphic layer one attestation each in Malta and the gipsy languages for ladybird, for the anthropomorphic layer one attestation each in Malta for weasell and the gipsy languages for weasell and lightning g and for the Christian layer, again, one attestation each in Malta for rainbow and the gipsy languages for rainbow and butterfly. Birgit Eder and I supplied the data and Arjen Versloot digitized them. Cordial thanks to both of them.


Wolfgang VIERECK

found today. The initials of Caspar/Kaspar+Melchior+Balthasar+the year are still written on the entrance doors of people’s houses in Catholic areas in Germany, in Italy and in Poland on Epiphany, January 6, to protect the people from evil of any kind and small pictures of St. Christopher are hung up by car drivers as a protection in many countries, such as the Ukraine and Germany. Apparently Enlightenment has had no effect on people’s piety. The ALE relies, of course, on European dialects and languages. The motivational procedure unearthed some important elements in the mosaic of the cultural development of Europe. Unquestionably their consequences transcend the frontiers of the European continent. In the light of the complementarity of world cultures it would be highly desirable to complement the presented picture with insights into other cultures4. References Alinei, M. 1983. “Introduction”. Atlas linguarum Europae. Commentaires. Vol. I: premier fascicule. Assen: Van Gorcum. XV-XXXIX. Alinei, M. 1997a. “The Atlas Linguarum Europae after a quarter century: a new presentation”. Perspectives nouvelles en géolinguistique, Alinei and Viereck (eds). Rome: Istituto Poligrafico. 1-40. Alinei, M. 1997b. “Magico-religious motivations in European dialects: A contribution to archaeolinguistics”. Dialectologia et Geolinguistica 5. 3-30. Alinei, M. 2008. “Forty years of ALE: memories and reflexions of the first general editor of its maps and commentaries”. Revue Romaine de Linguistique 53. 5-46. Ashley, L.R.N. 1974. “Uncommon names for common plants: The onomastics of native and wild plants of the British Isles”. Names 22. 111-128. Biondelli, B. 1841. Atlante linguistico d’Europa. Vol.1, Part 1. Milano: Rusconie. 4

There are clear parallels with, e.g., the religion in Egypt, with the motivations of the rainbow in South America where in Brazilian Portuguese it is christianised differently than in the mother country as Arco-da-aliança de Jesus or Aliança de Cristo com os homens, in Japan, China, where the rainbow is a double-headed dragon that drinks water on both sides of the river, and in African languages (on the latter cf. Möhlig/Jungraithmayr 1998, s.v. ‘Regenbogen’ and ‘Flußschlange’). In all these regions the rainbow has zoomorphic motivations that are also attested in Europe, namely ‘drinking animal’, ‘(river, water) snake’ or ‘dragon’, with the numerous zoomorphic motivations of lightning in African languages (cf. Lagercrantz 2000), or with the motivation of thunder as ‘God’s cry’ in Karakere (Africa) or ‘turtle’ in Japan in the Kyoto area. What today in Europe is called Christmas belongs to the natural phenomena, namely to winter solstice. Winter solstice and summer solstice were of great importance also in Japan, as the excavations in Yoshinogari (Saga Prefecture) show (Yoshinogara Site 2000). For the weasell there is also a kinship term outside Europe, namely ‘bride’ or ‘son of the bride’ in Arabic.

The Atlas Linguarum Europae


Capelle, T. 2005. Heidenchristen. Mainz: von Zabern. Frings, Th. 1966. Germania Romana, 2nd ed. Müller, G. Halle: Niemeyer. Frobenius, L. 1929. Monumenta Terrarum: Der Geist über den Erdteilen. 2nd ed. off Festlandkultur. Frankfurt: Buchverlag. Lagercrantz. S. 2000. “Om blixtdjur”. Annales Societatis Litterarum Humaniorum Regiae Upsaliensis. Årsbok 1999. Uppsala. 149-171. Mallory, J.P. and D.Q. Adams. 1997. Encyclopaedia of Indo-European culture. London: Fitzroy Dearborn Publications. McDavid, R.I., Jr. 1972. “Carry you home once more”. Neuphilologische Mitteilungen 73. 192-195. Möhlig, W.J.G. and H. Jungraithmayr. 1998. Lexikon der afrikanistischen Erzählforschung. Köln: Köppe. Müller-Karpe, H. 1998. Grundzüge früher Menschheitsgeschichte. 5 vols. Darmstadt: Wissenschaftliche Buchgesellschaft. OED 19892=The Oxford English Dictionary, prep. by J.A. Simpson and E.S.C. Weiner. Oxford: Clarendon Press. Riegler, R. 1937/2000. ““Tiergestalt” and “Tiernamen””. Handwörterbuch des deutschen Aberglaubens, Bächtold-Stäubli and Hoffmann-Krayer (eds). Berlin: de Gruyter. 8. 819-842 and 863-901. Séguy, J. 1953. Les noms populaires des plantes dans les Pyrénées Centrales. Barcelona: Monografias del Instituto des Estudios Pirenaicos. Seidensticker, P. 1997. ‘Die seltzamen namen all’: Studien zur Überlieferung der Pflanzennamen. Stuttgart: Steiner. Sladen, Ch. 2010. “Idle scholar who brought local language to book”. Oxford Today. Trinity Issue. 20-21. Trinkunas, J. (ed). 2002. RASA: Götter und Rituale des baltischen Heidentums. Engerda: Arun-Verlag. Viereck, W. 1997. “On some plant names in Britain and beyond”. Englishes around the world 1, Schneider (ed). Amsterdam: Benjamins. 227-234. Viereck, W. 2006a. “The linguistic and cultural significance of the Atlas Linguarum Europae”. Gengojohogaku Kenkyuhokoku (Memoir for Linguistic Informatics) 9, Kawaguchi (ed). Tokyo: Gaikokugo Daigaku. 58-80. Viereck, W. 2006b. “Chasing butterflies: Why is a butterfly called butterfly?” Japanische Kultur und Sprache. Studia Iaponica Wolfgango Viereck emerito oblata, Oebel (ed). München: Lincom Europa. 73-76. Viereck, W. (ed). 2007 [2008]. Atlas Linguarum Europae. Vol.1: Septième fascicule. Commentaires and Cartes. Rome: Istituto Poligrafico. Vries, J. de. 1971. Nederlands etymologisch woordenboek. Leiden: Brill. Weijnen, A. 1981. “L’Atlas Linguarum Europae”. Union Académique Internationale. Compte rendu de la 55e session annuelle du comité. Budapest. 14ff.


Wolfgang VIERECK

The Yoshinogari Site. Japan’s largest site of an ancient moat-enclosed settlement. 2000. Saga City: Board of Education.

Variationism and Underuse Statistics in the Analysis of the Development of Relative Clauses in German1 Anke LÜDELING, Hagen HIRSCHMANN and Amir ZELDES

1. Introduction Many language change theories are quantitative, describing the gradual change from one form to another form. Any quantitative theory is, of course, built on a qualitative (categorical) analysis—one has to decide which forms to compare. It has often been noted (see among many others the discussion in Labov 2004) that here lies a crucial difficulty in diachronic analysis because categorization is difficult within and across language stages. Different categorizations lead to different analyses and often it is not the conclusions that differ but the basis on which they are built. In this paper we want to explore how a multi-layer corpus architecture, where different layers of analysis can be coded simultaneously, helps in understanding change phenomena. The main focus of this paper is methodological. In order to illustrate our point we investigate the development of the German relative clause from Old High German to New High German. We choose this phenomenon because it has facets in several layers of analysis: syntactic, morphological, and semantic, making it a challenging testing ground for a methodology analyzing language change. The complexity of the phenomenon in question also makes it necessary to have a corpus architecture capable of expressing different annotation formats. We must say at the outset that we will not find any qualitative conclusions that are radically new about relative clauses (which are well-researched and well-understood), but even though the diachronic corpus we use is small our results fit with, and enrich, previous work on this subject, and show how such analyses can be performed. The paper is organized as follows: Section 2 introduces the general theoretical framework behind the study of quantitative variation, charting the competition between different variants realized in each language stage through the use of diachronic corpora. Section 3 introduces and illustrates multi-layer architectures and Section 4 shows the use of overuse/underuse statistics as a corpus-based diagnostic. Section 5 then presents the corpus and the case study of German relative clauses, while Section 6 draws the final conclusions. 1

We want to thank Eva Schlachter and Jürg Fleischer for interesting discussions and valuable comments.

The Atlas Linguarum Europae


of responses, however—surprisingly—not the most frequent occurrence, is shown by the youngest layer. Christian motivations occur mainly in Spain, central Europe, Hungary and the Baltic States. In the process of the cultural development of Europe we thus find recurrent structural patterns: the same reality was first given kinship and zoomorphic names to be followed by anthropomorphic names and finally by Christian and Islamic names—and this across all language and dialectal borders. The three periods mentioned, of course, do not end and begin abruptly. Archaeological finds show that there were fluid transitions also between the Stone Age on the one hand and the Metal Age on the other and that anthropomorphic representations were known also in the Neolithic period (cf. Müller-Karpe 1998). Also Riegler noted: “Remarkable are the many transition phases that led from the theriomorphic to the anthropomorphic apperception” (1937/2000: 826f.; translated from German). That the transitions between the pagan and the Christian layer can be better documented are to be explained with the greater temporal proximity to us. Up to the early 4th century the early Christian church had been an underground church and it took many centuries until the Christian faith had penetrated the whole of Europe. In Scandinavia heathendom and Christianity had co-existed down to the 11th century (cf. Capelle 2005, who calls his book characteristically “heathen Christians”) and Lithuania became christianised only in the late 14th century. Just as earlier pagan places of worship had turned into Christian places of prayer, so Christian churches turned later into mosques. The best-known example of such a transformation is no doubt the Hagia Sophia in Istanbul. Also Jewish synagogues were consecrated as Christian churches. A good example of where the change was even kept in the name is the Sinagoga Santa María la Blanca in Toledo, which had become a Christian church already in 1405 long before the Jews were expelled from Spain in 1492. With new religious beliefs a wave of new designations followed, yet the old conceptions often remained the same. To take just one example out of many: When Christianity came to Britain, the bright yellow flowers of the plants in the Hypericum family that had been associated with the golden brightness of Baldur the sun-god came to be called St. John’s wort, as Baldur’s Day became St. John’s Day. The plant continued to be thought a cure for wounds and on St. John’s Eve good Christians wore a sprig of it to ward off evil spirits and especially to protect themselves against the stray thunderbolts of the gods (Ashley 1974: 116).

Saint John’s Day is the Christian equivalent of the summer solstice, one of the most important events in prehistoric times. In the early Christian period, pagan thought was alive and well. However, examples of this can easily be



2. Variation and variationism For a long time theoretical linguistics has argued that linguistic systems are rather homogeneous and that variation is accidental and therefore not interesting for theory building.2 Contrary to that view, many studies in sociolinguistics, historical linguistics and synchronic corpus-based linguistics have shown that variation is not random and that speakers of a language have very fine-grained and consistent knowledge of usage. Starting with Labov’s famous 1966 study of phonological variation in New York it has been shown again and again that variation happens on all linguistic levels and most of it is quantitative rather than qualitative. Variation is only possible if there are several ways of doing ‘the same thing’ from which the speaker can choose. If a speaker of German for example wants to express the fact that something is acceptable, she/he can say: X ist akzeptabell or X ist annehmbarr or man kann X akzeptieren etc. This is only interesting if—as it is argued—the choice between the variants is not random but triggered by grammatical and functional factors. The different variants of ‘the same thing’ are correlated with other linguistic and extralinguistic factors. Labov (2001, 2004), among many others, has linked variation to social variables. Other studies, such as e.g. those of Biber (1988, 2009) show that there is a lot of variation within a speaker and this can be attributed to different functional needs—it is said that each speaker is able to vary his/her linguistic behavior according to the situation/purpose etc. of the utterance. The obvious and very difficult problem is, of course, to decide what counts as ‘the same thing’. Here we want to use the terms variable for ‘the same thing’ and variant for a possible realization of a variable. A variable is always an abstraction over several variants. In addition to such functionally triggered synchronous variation there is diachronic variation—the idea that one variant may increasingly come to take on functions or contexts previously associated with another variant, over time. It is probably impossible to tease these two types of variation apart; most of the time a diachronically ‘new’ variant occurs first in a given register and then becomes ‘fashionable’. Those historical linguists who accept (quantitative) variation as a trigger for language change are called variationists (Labov 1994 & 2001, Rissanen 2008). Figure 1 illustrates the idea that language change cannot be described in terms such as ‘in period X people used A and in period Y people used B’—rather there is a gradual change where one variant is becoming stronger while another variant is slowly fading. 2

We will not go into the long-standing debate between competence-based (generative) models and usage-based models for language change. See Wasow (2007) and Sag & Wasow (to appear) for a discussion.

Variationism and Underuse Statistics


Figure 1. An illustration of how variants of a variable change quantitatively over time. A, B, and C are all variants of a single variable (from Rissanen 2008, 59).

In a variationist approach to language change3 one therefore needs to define a variable and its variant expressions (see Section 4). Because of the large amount of variation within a language stage (see above) it is crucial to use comparable corpora. Ideally the corpora should contain texts which differ only in one parameter (here: time) so that all differences can be attributed to that one parameter. While it might be possible to build contemporary corpora that fulfil (or come close to fulfilling) that requirement,4 historical texts are, of course, much more diverse and there are many parameters that cannot be controlled for because the information (e.g. about the author or the intended audience) is not known, or a given genre does not exist, or suitable texts (e.g. personal letters) have simply not survived. In this situation some authors use parallel corpora instead of comparable corpora (for European languages this usually means Bible corpora, see Resnik et al. 1999 for a discussion and Zeldes 2007 for an example), but parallel corpora—which necessarily involve translations—come with their own set of problems (see e.g. Baroni & Bernardini 2006 on translationese). There are many corpus-based studies of language change.5 While some of them focus on lexical categories that can be researched in unannotated corpora, many involve annotation of some kind. The most well-known annotated historical corpora are probably the treebanks built from the Helsinki corpus (and sometimes additional material, see hist-corpora/) which have been used for many studies. For German there are not (yet) many publically available annotated historical corpora.6 3



Language change does not have to be ‘historical’. The same method can be applied to the study of recent or ongoing change, see Mair (2009) for an overview. This has, for example, been the idea behind the Brown corpus family (see e.g. Leech et al. 2008) or the ICE corpora (Greenbaum & Nelson 2009, In essence, all historical studies are corpus-based. We use the term corpus here only for electronic corpora.



However while annotated corpora are enormously helpful, they could be even more helpful if some widespread problems are overcome.7 The annotation is usually done by a group of researchers according to a very specific annotation scheme and research question. The corpus architecture is then not flexible enough to handle annotations with different formats (such as trees, spans or pointing relations) or merge annotations made by different tools. Research questions that do not interest the original annotators or hypotheses that come up during an analysis are not/cannot be included. This means that categorization beyond the provided annotation and quantitative analysis is usually done in separate programs (e.g. spreadsheets) and not coded in the corpus (this is, in essence, the traditional way of working with historical documents, see Meyer 2008). It is therefore not directly available to other researchers and results are not easily reproducible or reusable. In the following we want to show how a flexible corpus architecture that allows various annotation formats, the addition of annotation layers at any point, and visualization of quantitative aspects can help in the analysis of linguistic change. Before we go into our case study we will briefly introduce our corpus and the phenomenon we will be looking at. 3. Data and corpus architecture For our study we use the DeutscheDiachroneBaumbank (DDB, available at, a tiny, but deeply annotated, comparable diachronic corpus of German which consists of the following subcorpora: ◦ Subcorpus Old High German (OHG), containing the Gospel of Matthew, based on an edition by George Allison Hench (1890). The subcorpus is a part of the Monsee Fragments (written at the end of the 8th century). It consists of 3626 tokens. ◦ Subcorpus Middle High German (MHG), consisting of a collection of Middle High German sermons, called “Specculum ecclesiae” (written at the end of the 12th century), based on an edition by Gert Mellenbourn (1944). The subcorpus consists of 2483 tokens. 6


In addition to those described in Kroymann et al. (2004) we are aware of the following annotated historical corpora of German: The Early Modern German Mercurius Treebank (Demske 2007) which is not yet publicly available and the GermanC corpus (http://www. which is annotated on several levels (but not syntactically). The situation is changing, however: The projects Referenzkorpus Althochdeutsch (http:// and Mittelhochdeutsche Grammatik ( will make their material available shortly via ANNIS (Section 3). The same problems pertain to most contemporary corpora as well.

Variationism and Underuse Statistics


◦ Subcorpus Early New High German (ENHG), consisting of a sermon by the preacher Veit Nuber (written 1544), called “Ein kurtze und einfeltige unterweisung zum sterben nutzlich und heilsam den krancken furzuhalten an irem letzten/aus der heiligen schriften zusamen gelesen”, extracted from the Bonner Frühneuhochdeutschkorpus (Diel et al. 2002). The subcorpus consists of 2673 tokens. ◦ Subcorpus New High German (NHG), comprising the first four chapters of the Acts of the Apostles from the Neue evangelistische Übertragung, a freely available translation of the entire Bible (New Testament 2003, Old Testament 2009) prepared by Karl-Heinz Vanheiden and available from The subcorpus consists of 3574 tokens. The corpus is annotated as follows: The NHG corpus contains part of speech tags automatically generated using the TreeTagger (Schmid 1994) and constituency trees generated using the Stanford Parser (Klein & Manning 2003), but no morphological or dependency information. The historical corpora contain the following annotations, which were created manually (see Figure 2): ◦ part of speech annotation (POS), using the German STTS-tagset (http:// ◦ morphological information based on the TIGER morphological tagset (inflectional morphology). ◦ syntactic annotation, using the annotation scheme of the Tiger Project (, which is a combination of dependency and constituency annotation.8 ◦ normalized spelling of the original text based on editions, in order to ensure uniform searchability of word forms. ◦ hyper-lemmatization to create comparability between language stages, based on the morphologically, or in special cases a semantically corresponding New High German lemma. ◦ absolute and normalized frequencies for word forms, lemmas, POS, and POS-bigrams, as well as Underuse/Overuse ratios and statistical significance for each token as compared to the NHG corpus (see below).9 8


The annotation scheme was developed by Hagen Hirschmann and Sonja Linde. For synchronic corpora of German the TIGER annotation scheme (Brants et al. 2002) has come to be the most influential and widely accepted. In order to make the historical corpora comparable to the modern corpus it was decided to adhere as closely as possible to the original TIGER annotation and propose changes very conservatively. Further annotation levels present in the historical corpora but not used in this study are: - Normalized lemmatization according to standard dictionary norms for each language stage. - Bibliographic annotation referring to the editions’ scheme for coding lines in the original manuscripts.



Figure 2. Sample sentence (“peace be this house”) from DDB-OHG with all annotation layers: From top to bottom: syntactic annotation, bibliographic information, text representation in the original text edition, lemmatization, normalized word layer and statistical information for token annotations.

The representation of the heterogeneous types of data described above requires a special corpus architecture which is both searchable on all levels simultaneously (i.e. we can find all cases of certain syntax-tree structures overlapping certain spans of orthographic forms with significantly deviating frequencies) and extensible, so that further levels of annotation can be added, modified or removed in the course of the study, easily and independently. The currently most versatile technique for achieving these goals is the use of standoff XML formats, in which primary data and each annotation level are all kept in separate XML files (see Carletta et al. 2003, and especially Lüdeling/Poschenrieder/Faulstich 2005 in the context of historical corpus

Variationism and Underuse Statistics


architectures). In this case we used PAULA XML (Dipper 2005) to merge annotations from multiple source formats: TigerXML (see http://www.ims. for syntactic annotations and EXMARaLDA XML (see http://www.exmaralda. org/) for other span based annotations, as well as output from automatic tools like the TreeTagger (see above). Through the use of standoff XML it becomes possible for researchers to work concurrently on the same source data (the transcribed manuscripts) without altering it using multiple annotation tools, with the possibility to later revise separate annotations or even apply several versions of the same annotation layer. To search through the annotated data and visualize our search results we use ANNIS2 (see Zeldes et al. 2009, annis/). This system grants corpus access to multiple users over a web-browser and provides a query language AQL (ANNIS Query Language) to express arbitrary annotation graphs being searched for. Query results are then visualized in multiple levels according to annotation types, e.g. with syntactic annotations receiving tree visualizations and span annotations being displayed as grids. For more detailed information on the corpus architecture, the reader is referred to Hirschmann/Lüdeling/Zeldes (submitted). In the following we show how multiple annotation layers are simultaneously needed to study the development of relative clauses. 4. Comparing quantities: under and overuse of corpus measurements In order to compare the distribution of variants in different language stages it is necessary to code them in a way that makes them identifiable and extractable for researchers. Since more complex types of variation involve not just surface word forms but also higher-level categories, such as parts of speech or syntactic structures, these must be annotated wherever they occur. The idea behind such annotations is that researchers’ analyses of language data should be made explicit within the corpus, allowing them to search for and review occurrences of relevant phenomena, no matter how complex (see Leech 1993, Garside et al. 1997). Each annotation category or combination of categories can in a first approximation be seen as a variable in the sense introduced above: different surface forms or lower levels of annotation are the variants of a variable. In this sense, linguistic developments between language stages in a comparable diachronic corpus are already coded in the data itself. The normalized frequency of each phenomenon in each stage can then be extracted and compared. Once frequencies for a phenomenon have been collected in each comparable subcorpus, standard statistical tests such as the chi-square test or a test of equal proportions can be used to evaluate whether there is a significant


Anke LÜDELING, Hagen HIRSCHMANN and Amir ZELDES Table 1. Comparison of part of speech frequencies in the subcorpora. Underuse and overuse are marked with arrows and progressively deeper shades for stronger deviations with respect to NHG. Pos PDAT PPER ART VVINF PRELS VAFIN VAINF

OHG ▲0.046131 ▲0.083545 0 ▼0.01126 ▼0.009444 ▼0.03705 ▼0.001453

MHG ▲0.011679 ▼0.052759 ▲0.07934 ▼0.015707 ▼0.011679 ▼0.035038 ▼0.001208

ENHG ▼0.007105 ▲0.075916 ▲0.065445 ▼0.018325 ▼0.013837 ▲0.04786836 ▲0.00411369

NHG 0.008954 0.075825 0.061835 0.022104 0.016788 0.045887 0.003078

deviation between the respective data samples or to compute a statistical model of the development. In the case of language data a particularly high significance is expected, since the assumption of statistical independence between linguistic phenomena in a text is not granted and since the usually large sample size (thousands or even millions of words) makes even small deviations in frequency appear significant (see Kilgarriff 2001; Evert 2005, 2006). In historical corpora, the amount of data is often quite small, as is the case here, though as we will show below, even small corpora can yield interesting quantitative results given appropriate annotation layers. In order to detect a change phenomenon we compare the frequencies of a variant of a given variable across the language stages. We choose one language stage (here New High German) as the reference frequency and calculate how frequencies in the other language stages differ from this. Here we use the terms overuse and underuse to describe the deviations.10 Since we do not know in advance which variants of a given variable are more or less widespread in each stage, we can initially test all variables in the corpus as an exploratory diagnostic to find the most extreme cases of overuse or underuse. For example, Table 1 shows normalized frequencies for several part-of-speech categories in the different subcorpora. The older stages’ frequencies are coded with ▲ to signify overuse (higher frequency) and ▼ 10

Overuse and underuse are defined as statistically significant deviations in frequency as compared to another language stage or stages serving as a control population. This strategy has been employed especially in contrastive interlanguage analysis (CIA), a paradigm comparing texts from language learners with different native tongues and native speakers (see Selinker 1972, Granger/Hung/Petch-Tyson 2002). The post-hoc nature of underuse/overuse diagnostics means that their results are not as compelling as pre-hoc hypothesis testing, but ideally results from such studies can then be tested in further data sets (for an underuse study of learner German along these lines see Zeldes/ Lüdeling/Hirschmann 2008)

Variationism and Underuse Statistics


to signify underuse (lower frequency) with respect to the NHG subcorpus; the depth of the shading in each cell signifies the extent of the deviation. The same information is coded for each word and each category in the corpus itself so that it can be searched for in conjunction with other annotations. This information can be used as a diagnostic for finding interesting change candidates. Wherever we find a word or an annotation category that displays a uniform change pattern (all underuse or overuse, and a deep shade leading to a light shade across time) we can suspect that there could be a uniform, possibly ongoing, change. In Table 1 the categories VVINF (infinitive verbs) and PRELS (relativizers) show such a pattern. The seemingly gradient change of PRELS is directly related to the object of our case study of relative clauses and forms the starting point for our study in the following section. However, these observations do not yet supply an interpretation of the data. To explain the development of one variant we must understand the variants with which it competes. Table 1 also shows the frequencies of articles (ART), which are rather frequent in NHG but not present in OHG.11 Like many of the older Indo-European languages, German developed a definite article from its demonstrative stem d d- (akin to Eng. th- in the and this) and an indefinite article from the numeral ein- ‘one’ (on the development of the German articles see Oubouzar 1992). In OHG, these forms are only just forming, with many nominal phrases having no article where one would be expected in NHG (1), while other cases have a corresponding demonstrative (with the tag PDAT) which can still be interpreted as such (2): (1) Hench 1890, ch. I, line 1812 Mannes sunu habet gauualt in herdhu za forlazanne suntea man’s son has power in earth to forgive sins the son of man has the power on earth to forgive sins (2) Hench 1890, ch. I, line 8 enti gasah iesus iro galaupin quhad dem lamin and saw Jesus his faith said this paralytic and Jesus saw his faith [and] said to this paralytic

The annotation scheme of the OHG corpus considers all such determiners to be demonstratives when they are present, thus the overuse of the PDAT tag in the OHG column in Table 1 directly expresses researchers’ interpretation of the data. In the MHG and ENHG corpora article use is similar to NHG 11


The situation is more complicated. For this article it suffices to say that the (few) forms that look like articles in OHG are often analyzed as demonstratives. The annotation here follows this analysis. All citations in Hench 1890 refer to the Gospel According to Matthew in this edition.



(slightly overused), and PDAT also behaves similarly, meaning article use is quantitatively comparable for all these periods. In other words: The category ART can be interpreted as a variable with several variants (the specific forms of articles in the different language stages) if one wants to see the development of article forms. If one wants to find out about how the category ‘article’ evolved one has to assume a more abstract variable (something like ‘pre-nominal determiner’) with the variants ART, PDAT, Ø etc. Because the empty form is one of the variants of this variable, it is not possible to observe this directly from the part-of-speech annotation. It is possible to solve this problem by looking at the syntactic environment for article occurrence. If a category such as this turns out to be interesting in retrospect, it is possible to add an annotation layer especially for it (in our case, since we have syntactic annotation, we do not need to do so, as we could phrase a query for articleless nominal phrases using the syntactic environment). 5. Examining underuse close up: relative clauses In this section we discuss how the phenomena diagnosed by rough underuse/overuse statistics can be evaluated more precisely using the rich annotation in the DDB corpora. Although the corpora at hand are extremely small for a quantitative study, comparisons with previous work on these phenomena will show the results to be plausible, while at the same time they provide estimates for the relative quantifications of competing variants, charting a gradual development in features sometimes thought to be categorical properties of particular language stages. Categorically the development of relative clauses in German seems not especially interesting.13 From Old High German (OHG) to New High German (NHG) we find relative clauses in the form in (3) where the relative clause is introduced by a relative pronoun and the word order in the relative clause is that of a subordinate clause (V-final). They are sometimes considered to be the oldest dependent clauses in German (e.g. Schmidt 2004: 235). Some researchers argue that with respect to relative clauses German has not changed. (3) Acts 1:18 (Vanheiden 2003) Von der Belohnung, die er für seine Untat bekam, wurde dann in from the award that he for his misdeed received was then in seinem Namen ein Acker gekauft his name a field bought From the award that he received for his misdeed a field was bought. 13

For more comprehensive overviews of relative clauses in German see Lehmann (1984), Zifonun (2001), or Pittner (2009).

Variationism and Underuse Statistics


Table 2. Frequencies for PRELS in a clause-based normalization Subcorpus OHG MHG ENHG NHG

PRELS per 100 clauses 4.62 10.25 12.85 13.35

Quantitatively, however, as the numbers in Table 1 suggest, there might be an interesting development. The category PRELS gradually increases over time. Does this mean that relative clauses become more frequent? If so, one would expect that they extend their domain over time, either grammatically or functionally. This will be discussed in Section 5.3, but first we need to consider a number of surface properties of relative clauses that might have a bearing on the numbers in Table 1. 5.1. Normalization The data in Table 1 is normalized per token: The older language stages show an underuse of PRELS (relative pronouns) per token. However, if we are interested in the occurrence of relative clauses in each period and the ways in which they are realized, this is misleading. The token-based normalization may be inappropriate in this case, since it depends on the length of sentences, and not on how many sentences in fact contain relative clauses. Using the syntactic annotation, we can establish how many PRELS appear per 100 clauses (Table 2). Normalized to clauses, PRELS appear with roughly the same frequency from MHG to NHG and only OHG shows a significantly (p[d]

Occitan abas, abitar pes, se, ser, tela lor, prezicador festa, terra bes, pe obra leit, pret, respeit heu, meus, neut, peus, queur feulha cera, igleza chami, chapela, achabat, nebot, ribeira, saber chadeira, chadena, donada

French abbé, habiter poids, soi, soir, toile leur, prêcheur fête, terre bien, pied oeuvre lit, prix, répit hui, muid, nuit, puis, cuir feuille cire, église chemin, chapelain achever, neveu, rivière, savoir chaire, chaîne, donnée

Table 12. Features differentiating Northern Occitan from Southern Occitan Feature Palatalisation of [ka] Palatalisation of [ga] Evolution of [-kt]>[-jt] Evolution of the suffix -arium Evolution of the suffixe -ariam Effacement of intervocalic [d]

Northern Occitan chadeira, chadena jauzir fait, dreit, leit carteir/carter chalforneir/chauforner archeira, bareira aerarr (=adezar), afiet (=afizet), recrenssa

French chaire, chaîne jouir fait, droit, lit charretier, chaufournier archère, barrière O.F. aeser, afier, recreance

Variation and Change in the Montferrand Account-books (1259-1367)


These features are consistent with what one would expect in a situation of a dialect continuum langue d’oc => langue d’oil. 7.1. Local (Auvergnat) features The features listed in Tables 11 and 12 are not subject to significant variation and change within our corpus. However, we find in the corpus a number of variant spellings which show evidence of change. • levelling of the diphthong [aw] In medieval Auvergnat the group [a]+[l]+C was undergoing phonetic change: stage 1 vocalisation of pre-consonantal /l/>/w/, and stage 2 levelling of the diphthong /aw/>/o/. Examples: dos (=daus), os (=aus), ozell (=auzel), enpozat (=enpauzat), ozirr (=auzir), pozarr (= =pauzar), trezorerr (=trezaurer), Choriac (=Chauriac), Doratt (= =Daurat), Joza (=Jauza = ), Morell (=Maurel). This change can be seen most clearly in the development of enclytic forms comprising the preposition a+plural definite article los. In our corpus we find three variants: als~aus~os, which we represent in Table 13 with the following symbols: ⊗=als, Δ=aus, ♦=os. Table 13. Levelling: als → aus → os

The spelling als (⊗) predominates in the 13th century. The vocalised variant aus (Δ) takes over in the 14th century, to be gradually supplanted by the levelled variant os (♦). • diphthongisation of [il]>[jal] In Auvergnat the vowel [i] generally diphthongises to [ja] before the consonant [l], exemples: abriall (=abril), cortiall (=cortil), datiall (=datil), fiall (= =fil), gential (=gentil = ), piala (=pila = ), seviall (=civil), viala (=villa).


Anthony LODGE Table 14. vila~viala

The form viala emerges quite suddenly at the extreme end of the 13th century, and, after a short period of variation, when the two forms are in competition, the traditional form vila disappears. • effacement of pre-consonantal [s] Before voiced consonants the effacement of pre-consonantal [s] gradually spreads, as me move from the 13th to the 14th centuries. Exemples: carema (=caresma), dînarr (=disnar), eleutt (=esleut), emendatt (=esmendat), temoyns (=tesmoyns). In Table 15 we can see the chronological distribution of the 4th person endings of the preterite, -esmes~-emes. Table 15. -esmes~-emes

However, before unvoiced consonants, effacement of pre-consonantal [s] occurs only sporadically, e. g. chatel~chastel, ecriore~escriore, epital~espital, Pachas~Paschas, depens~despens, suggesting a change some distance away from completion.

Variation and Change in the Montferrand Account-books (1259-1367)


Table 16. despens~depens

Certain spellings indicate the passage of pre-consonantal [s]>[j], before both voiced and unvoiced consonants. e.g. aitel, arbaleyteyras, areitat, Aygueiparssa, berteicha, careima, deidut, deilhoradas, deilhoranssa, deilocgeront, eimenda, Eipinassa, Eipital, empaiteir, eytatges, peycheir, preit, preita, segreita, trameymes. These forms are narrowly localised in the area around Montferrand and occur towards the end of our corpus. • palatalisation of word-initial [l] before [i] et [y] One of the stereotypical characteristics of Auvergnat speech is the unusually high rate of consonant palatalisation. Here I will look solely at palatalisation of word-initial [l] before the high front volwels [i] et [y]. Exemples: [l]+[i] deilhoranssa, deslhiberasio, lhi, lhiams, lhiar, Lhimanha, Lhimotgas, lhiorar, relhiarr and [l]+[u] lhui, lhugat, lhus. Table 17. li~lhi

In the case of lhi~li the palatalised variants (lhi) predominate ((65%), whereas in the case of lhu-~lu- the situation is reversed, with the palatalised variants occurring in only 30% of cases. However, the distribution of the variants across the text gives no indication of the direction of change.


Anthony LODGE

8. Conclusion In this short paper I have not attempted an exhaustive analysis of linguistic features present in this corpus which were subject to variation and change. However, I hope I have shown that a set of documents of this type has considerable interest for the historical linguist. Plotting linguistic change on the ground involves observing shifts in the quantitaive distribution of linguistic variants. The written material presented here does not directly mirror the speech of the writers, but the amount of variation and change we have seen suggests that it was closer to the spoken language of the writers than was to be the case in later documents, in which the consuls had to fall in with externally imposed linguistic norms. References Bec, P. 1967. La Langue occitane. Paris: PUF (Que Sais-Je?). Chambon, J.-P. et P. Olivier. 2000. “L’histoire linguistique de l’Auvergne et du Velay”. Travaux de linguistique et de littérature, t.38. 83-153. Lodge, R.A. 1985. Le Plus Ancien Registre de comptes des consuls de Montferrand en provençal auvergnat 1259-1272, Clermont-Ferrand, (Mémoires de l’Académie des Sciences, Belles Lettres et Arts de ClermontFerrandd 49). —. 1994. “Okzitanische Skriptaformen II, Auvergne”. Lexikon der Romanistischen Linguistik t.II.2, Holtus,G., M. Metzeltin and C. Schmitt (eds). Tübingen: Niemeyer. 420-424. —. 1998. “The consular records of Montferrand (Puy-de-Dôme) ”. De Mot en Mot: Aspects of Medieval Linguistics, S. Gregory and D. Trotter (eds). Cardiff: University of Wales Press. 105-125. —. 2006. Les Comptes des consuls de Montferrand (1273-1319), (Etudes et Rencontres de l’école des chartes, 23). Paris. —. forthcoming. Les Comptes des consuls de Montferrand (1346-1373), (Etudes et Rencontres de l’école des chartes, 23). Paris. —. in preparation. Les Comptes des consuls de Montferrand (1378-1386), (Etudes et Rencontres de l’école des chartes, 23). Paris. —. 2009. “Le français et l’occitan en Auvergne au XIV Ve siècle: l’exemple de Montferrand”. Le Français d’un continent à l’autre, Baronian, L. and F. Martineau (eds). Montréal: PUL. 269-289. Mansfield, C. 2019. “Loceme” ( Paden, W.D. 1998. An Introduction to Old Occitan. New York: The Modern Language Association of America. The Montferrand Project is supported by a generous grant from the Leverhulme Trust, London, UK.

Cognitive Aspects of Language Evolution and Language Change: The Example of French Historical Texts Wolfgang RAIBLE To the memory of Brigitte Schlieben-Lange (1943-2000) 1. A text linguistic approach There are different kinds of linguistics. The opposition between sentence and text linguistics can define two of them. Text linguistics starts from the assumption that we do not speak in words or sentences, but in utterances as often called texts. This means that when we make use of the admittedly finite means a language offers us, this does not lead to an infinite series of sentences, but of utterances. Thus text linguistics deals with entire texts, where any text belongs to a particular text genre1. Among the text genres, some can be highly demanding, especially written ones; they should be formulated in a way that guarantees a widespread reception in any situation of communication. To the oral genres belong specimens like small talk, a genre whose realisation should be easier. At any rate, this means that when speaking or writing, we have to start from a conception fitting into the appropriate genre, giving it an adequate text-syntactic and semantic realisation leading to an appropriate organisation of the entire utterance. The concept of genre is dealt with under different terms: in television we tend to call it a ‘format’, in the theory of action, genres come as ‘activity types’ or ‘types of acting’, in literary studies ‘genres’ is the preferred expression. Others use ‘traditions of speaking’, ‘simple forms’ (André Jolles2). Or, in daily communication: ‘communicative genres of everyday life’. A definition of this last term, inspired by anthropological phenomenology, shows what is meant in general: “Communicative genres are considered to be those communicative phenomena that have become socially rooted. Their basic social function consists of alleviating the burden of subordinate (communicative) action problems. Due to the fixed patterns they



A seminal paper in this respect was Daneš 1966. The basic level is syntax; above we have semantics, the topmost level being the three level approch of the organisation of the utterance. The genres and their cognitive demands have to be placed still on top of the third level of Daneš. Jolles 82006.


Wolfgang RAIBLE constitute, genres are an orientation framework for the production and reception of communicative actions.3”

The expression ‘traditions of speaking’ or ‘discourse traditions’ clearly manifest the diachronic side of such genres4. As often they have a very long history and are sometimes only slightly, sometimes considerably, modified over time. Witness the example of the juridical judgement: In the Latin tradition the Praetor used to pronounce just one sentence. In principle, this tradition has not changed, e.g., in France, but now the single sentence can go over two or three pages with up to 10 degrees of subordination5. Such examples clearly show the amount of conceptualisation, of cognitive activity, behind the formulation of such texts. At the same time, they show how demanding textual genres can be in terms of linguistic means we need in order to realise them: without the appropriate subordinating techniques (which, in French and other Romance languages, evolved only over time) such a sophisticated task cannot be fulfilled6. And French judges have to spend a considerable amount of time in training until they eventually master this textual genre. At the same time this means that we do not start from the idea—cherished by generativists— that language as a system exists per se in the heads of all speakers. (a) The possibilities language systems offer us develop over time according to the necessities of the society using the language; Gusiilay, e.g., wouldn’t offer us the possibility of writing a French style judgement7. (b) An individual actively acquires these possibilities only to the extent s/he needs them. 2. The particular evolution of Old French Texts: Telling the truth requires prose These general considerations may serve as a background for the peculiar, particular situation in Old French, which will be the object of this article. A characteristic of mediaeval cultures is diglossia: Content is written down in Latin, whereas spoken language makes use of the respective vernaculars. Now Old French literature starts, in specific contexts, with written texts already at the beginning of the 9th century. In the 11th and 12th centuries, these texts become truly abundant. A common denominator of nearly all of them is that they are realised in verse, not in prose. There exist a big number of novels belonging above all to the so-called chansons de geste (decasyllabic verse with assonances; they have no authors) and octosyllabic rhymed romances 3 4 5 6


Knoblauch and Luckmann 2004, 203. Schlieben-Lange 1983. Krefeld 1985. A lot of historical and theoretical research has been done on the evolution of subordinating techniques: Klare 1958, Stempel 1964, Ehrliholzer 1965, Raible 1997 and Raible. 2001. As to this language see: Tendeng 2007.

Cognitive Aspects of Language Evolution and Language Change


(they tend to be linked with the name of an author, the most renowned one being Chrétien de Troyes8). Although being novels in our eyes, for the contemporaries the content of these texts was seen as being true9. Thus until the end of the 12th century, prose texts remained the domain of Latin. But as a result of a discussion having lasted for about 25 years, things changed at the beginning of the 13th century. The basic question had been: how can a text written in verse express the truth? It stands to reason that historical persons whose speech is thought to be reported did not speak that way. 3. The first historical texts in Old French deal with the Fourth Crusade This is why we are by now confronted with an ever-growing number of prose texts written in Old French. Two of the first ones deal with a contemporary event: Two participants tell us what happened in the Fourth Crusade (1202 to 1204). This means that these texts are historical texts in a true sense. They even constitute the most important information we have about this Crusade. The respective authors were Geoffroi de Villehardouin (1160-c. 1212) De la Conquête de Constantinople (On the Conquest of Constantinople); he was “member of the general staff” of this undertaking, whilst the second author was a simple soldier: Robert de Clari (c. 1170-after 1216): La Conquête de Constantinople. Here some information about the Crusades, especially the fourth one, is at stake. There were lots of Crusades, each one being—from a today’s perspective—totally unnecessary, superfluous, harmful, deleterious and obnoxious. The fourth one was particularly catastrophic, albeit with advantageous economic consequences for Venice. 1stt Crusade 1095-1099 2ndd Crusade 1147-1149 3rdd Crusade 1187-1192 -----------1200-----------4th Crusade 1202-1204 5th Crusade 1217-1221 6th Crusade 1228-1229 7th Crusade 1248-1254 8th Crusade 1270 9th Crusade 1271-1272

8 9

c. 1140-c. 1190 There is no contemporary concept of history in our sense. Witness Geoffrey of Monmouth (c. 1100-c. 1155) with his Historia regum Britanniae (History of the Kings of Britain) that treats the Arthurian legends as belonging to history.


Wolfgang RAIBLE

In the case of the Fourth Crusade, a predominantly French affair, the original idea was to transport the army by ship from Venice to Cairo. Some important points in telegraphic style: The crusaders needed about 100 ships for 35,000 persons, including 2000 knights with their horses; the Venetians charged 85,000 marks of silver, but the crusaders could not afford the whole sum; this lead to negotiations and a contract: the crusaders had to do some services for the Venetians (the Doge himself took the cross and participated with a Venetian contingent). They had to neutralise the biggest competitor of Venice in Mediterranean trade on the Adriatic Sea, the (Christian) port of Zadar. There was a more subtle strategy as to Constantinople, the biggest competitor of Venice in the Mediterranean: The crusaders should enthrone Alexis as the righteous emperor of Constantinople, a man who lived exiled in Germany and promised a high sum for helping restore him (this money would not only have covered the sum the crusaders owed to Venice); this lead to a first conquest of Constantinople. Alexis was enthroned, but failed to keep his promise. As a consequence, the city was conquered for a second time, followed by the sacking of a Christian metropolis by Christian crusaders. This laid the foundation for the domination of the Mediterranean trade by Venice, thus creating its incredible wealth accumulated during the following centuries10. Thus we have two writing participants of this undertaking, with Villehardouin, as a member of the general staff, being well informed especially about all negotiations, while Robert de Clari has a simpler perspective and knows much less. For both of them, though, a major problem was that there didn’t exist a tradition of writing historical prose texts in Old French. Since the authors of the two texts I am interested in are no learned persons, they were not acquainted with a Latin tradition whatsoever. So what could be a model they could follow? An easily understandable example for the necessity of genres and their active knowledge are the reports on the conquest of the Americas by European colonists. What we know today rests, as regards the role of Spain, mainly upon reports written by the conquerors themselves. But these reports were justifications addressed to the Spanish Crown (relaciones) showing that all the respective acts complied with Spanish Law. (The conquerors had contracts with the Crown allowing them to take possession of foreign land under welldefined conditions.) Those who wrote these reports admirably mastered the respective text genre (which does not at all imply that they told the truth, to the contrary). Nevertheless, there exist also reports of simple participants not 10

A late consequence was: Pope John Paul II had to apologise in public, in 2001, during a visit he paid to Greece, for the sacking of Greek orthodox Constantinople by Roman catholic crusaders which had happened almost 800 years before.

Cognitive Aspects of Language Evolution and Language Change


at all acquainted with the art of writing understandable texts. An example is Alonso Borregán, a writing clerk. What he describes is certainly authentic. He is even able to write correct sentences; but whoever is able to understand even small text-passages of his Crónica de la conquista del Perú should be awarded a prize. Up to now, any tentative was bound to fail11. While in 16th century Spanish there existed good generic models not mastered by simple scribes, at the beginning of the 12th century even the generic models lack in Old French. What will happen is presumed in the two following theses: Thesis I: If in such a situation authors try for the first time to write a historical text in prose, they will use already existing generic models—here romances in verse. Thesis II: It will still take considerable time until the cognitive and linguistic framework for historical prose proper will develop.

4. The structure of romances in verse Chrétien de Troyes was highly popular among his contemporaries and remains known for his well-structured texts. Two modern scholars, Karl D. Uitti and Michelle A. Freeman write: “With Érec and Énide (ca. 1170), a new era opens in the history of European story telling—an era whose effects are still very much with us today. This poem reinvents the genre we call narrative romance; in some important respects it also initiates the vernacular novel.”12

When we intend to tell a certain event to others, we have to bring a bulk of information chunks into a linear order. Where do we have to start? Where do we have to put an end? How many participants of an event should be mentioned in the narrative? What is necessary in the domain of descriptions of people, places, and objects? What can or should be omitted and what not? Chrétien was the first one among vernacular storytellers who would aptly create an intelligible and seemingly self-evident chronological order in his narrative, resulting in a clear tripartition into beginning, middle an end. Causality plays an important part. Thus this author, apart from writing in verse, could to some extent be a model for historical writing. Typically, a young would-be knight starts from home in quest (quête) of important and marvellous encounters to happen during long travels. After 11 12

The last serious effort: Stoll 1997. Uitti & Freeman. 1995, p. 36.


Wolfgang RAIBLE

having overcome lots of dangerous situations, this chevalier errantt (knighterrant) arrives at his final vocation or destination. Thus the basic structure of such a romance is a chronologically ordered sequence of so-called adventures (avantures). This is what can be adapted in any event by our amateur historians. Villehardouin, one of the leaders, is better acquainted with the tradition of telling romances than the simple soldier Clari. This turns out to be an advantage and a disadvantage at the same time. 5. A comparison between Chrétien ((Percevall) and the two historians In order to show the dependence upon romances in the conceptualisation of actions and their sequence, I will use a corpus of three texts: the last work of Chrétien, the unaccomplished Percevall with its 9100 odd verses, and compare features of this text with the texts of the two historians. Chrétien’s text encompasses 51213 tokens representing 5527 types; for Villehardouin the respective numbers are 46718 and 4426, for Clari 32228 and 3589. This means that the relation between types and tokens is largely equivalent. How can the author of a text make his recipients believe hat he is telling the truth? For chansons de geste and romances en vers this was quite simple: they had to make believe that they depended on a written source (livres ‘book’ and estoire ‘history’). This is true already of the Chanson de Roland, a chanson de geste written down in the context of propaganda for the First Crusade. The author of the Song of Roland d refers six times to a Gesta Francorum that would contain the story he is telling. In one case, he happens to put such a reference into the mouth of a person acting in a battle itself, thus unmasking the device: Turpin could not know what would be written later about the battle he is participating in (Laisse CXI). CXI Franc i unt ferut de coer e de vigur;

CXI The Franks strike on; their hearts are good and stout.

Paien sunt morz a millers e a fuls:

Pagans are slain, a thousandfold, in crowds,

De cent millers n'en poent guarir dous.

Left of five score are not two thousands now.

Dist l’arcevesques: «Nostre hume sunt mult proz: Says the Archbishop: “Our men are very proud, Suz ciel n'ad home plus en ait de meillors.

No man on earth has more nor better found.

Il est escrit en la Geste Francor

In Chronicles of Franks is written down,

Que vassal sont a nostre empereür. »

What vassalage he has, our Emperour.”

In his Perceval, Chrétien refers five times to an estoire as his source, one time to a book (li livres); these sources testify (tesmoigner) or tell (conter, reconter, deviser) what he himself is telling:

Cognitive Aspects of Language Evolution and Language Change 43 2872 3327


»/La senestre, selonc l'estoire,/senefie la vainne gloire bien m'an remenbre,/et l'estoire ensi le tesmoingne./Chascuns d'ivoire./Ensi con reconte l'estoire,/ele estoit tote d'une


/Percevax, ce conte l'estoire,/a si perdue la memoire


/furent d'or fin, tesmoing l'estoire./L'une des portes fu d'ivoire


sont voires/tex con li livres les devise,/onques riens13

Apart from that, the author refers 13 times to truth (as vérité). Now, as eyewitnesses, Villehardouin and Clari have no source at all they could refer to, so what they tell us is their own story. Nonetheless, borrowing from the discourse tradition of romances in verse, Villehardouin thinks it is unavoidable to refer to such a source: 71

bonnes genz dont li livres ne fait mie mention.//


et maint autre dont li livres ne fait mie mention.//

87 765

prodome, dont li livres ore se taist. Et tant vos retrait li livres que il ne furent que doze

908 por la honte; si que li livres tesmoigne bien que plus 1028

ne vos contera mie li livres; mais la somme del conseil


genz assez dont li livres ore se taist. En l'ost


Et bien tesmoigne li livres que onques nus n'eschiva

1178 1266

et bien tesmoigne li livres que bien duroit demie lieur //LXXVII /Or conte li livres une grant merveille: que

1278 mult des autres dont li livres ne parole mie ci. Et li 1284 cinq chevaliers, que li livres ne raconte mie. Et einsi 1294

autre chevalier que li livres ne raconte mie. Maistre

1323 toz les noms raconter li livres. Une des graignors dolors 1375

Et bien tesmoigne li livres que onques à plus grant

Nevertheless, when Villehardouin uses the verb tesmoignerr ‘to testify’—a term we are already familiar with through Chrétien—, the alternating, but otherwise parallel occurrences show that it is he himself who guarantees the truth, in other words: that ‘the book’ refers to himself: 908 951

au/vent. //[120] Et bien tesmoigne Joffrois li mareschaus


grant et merveilleus; et ce tesmoigne Joffrois de Ville-Harduin


mult viguerosement. Et bien tesmoigne Joffrois li mareschaus



honte; si que li livres tesmoigne/bien que plus de la moitié

ou mort ou pris. Et bien tesmoigne li livres que onques

For the technical and numeric side, Simple Concordance Program 4.0.9 for Mac was used.


Wolfgang RAIBLE 1178 mervoille à regarder; et bien tesmoigne li livres que bien 1190

trové en terre. Et bien tesmoigne Joffrois de Vile-Hardoin

1258 1278 1371

Et bien li portèrent tesmoing cil qui là furent, que le laissassent mie: et bien tesmoignent cil qui là furent que qui erent el païs. Et bien tesmoigne Joffrois de Vile-Hardoin


mult bien. Et bien tesmoigne li livres que onques

Less familiar with the respective discourse tradition, Clari uses istoire or estoire, instead, as a designation of his own text. 5

Robert de Clari//Istoire de ceus qui conquisent


//Ci commence l'istoire de ceus qui conquisent

85 si comme avés oï en l'istoire devant, pour ce qu'il eussent

At the end of his text, he is as explicit as one can be in this respect: Ore avez oï la verité, comme faitement Now have ye heard the truth, in what manner Constantinople fu conquise; et comme li cuens de Constantinople was conquered, and in what way Flandres Baudouins en fu empereres, et messire Count Baldwin of Flanders became emperor Henri ses freres après; que cil qui y fu et qui le thereof, and my Lord Henry his brother after him; vit, et qui l'oï, le tesmoigne, ROBERS DE CLARI for he who was there and who saw these things li chevaliers, et a fait metre en escrit la verité si and who heard the testimony thereof, Robert of comme ele fu conquise. Et ja soit ce que il ne l'ait si Clari, Knight, hath also caused the truth to be put belement contée comme maint bon diteur l'eussent down in writing, how the city was conquered; and contée, et assez de verités en a tues qu'il ne put mie albeit he may not have recounted the conquest in toutes remembrer.

as fair a fashion as many a good chronicler would have recounted it, yet hath he at all times recounted the strict truth; and many true things hath he left untold, because, in sooth, he cannot remember them all.

With “Now have ye heard the truth ...” he remains nonetheless in the overall style of verse romances, referring to his public as hearers. We find this in Percevall as well as in Villehardouin’s text. He uses ‘to hear’ not only as a back reference, but also when speaking of events he is going to tell us: 115 li six message com vos avez oï, et pristrent conseil

There are 22 similar cases more of ‘as you have heard’. Now the downward references:

Cognitive Aspects of Language Evolution and Language Change 528 1208


aprocha.//XV. [070] Or oïez une des plus granz merveilles com vos avez oï. Or oïez se ceste gent devoient terre

1238 que il fu toz esmiez.//Or oïez une grant merveille: que en

Besides using the same perspective of recipients hearing his text, Clari is more modern because he refers also to himself as to the speaker addressing his public (si vous dirons—and we will tell you): 8 63 85

Constantinople; si vous dirons après qui il furent et par et de l'estoire, si vous dirons de ce vaslet et de l'empereeur de Constantinople. Or vous dirons de cel enfant et des croisiés


ester de l'estoire; si vous dirons le mesfait dont li marquis


guerdon, si comme nous vous dirons après. Or avint, après que


leur chevauchiée. Si vous dirons ce qu'il font. Chascuns

193 240 245

à roi.//LXVI.//Or vous dirons d'une autre aventure qui chier, si comme nous vous dirons après.//Si envoierent saisir males voies comme nous vous dirons après.//LXXXII.//Quant la

250 en l'autre une toile: si vous dirons dont cil saintuaire estoien

Clari is even able to mark the beginning of a digression using again estoire for his own text: 63 XVIII.//Or vous lairons ci ester des pelerins et de l'estoire (For the moment we will let remain here the pilgrims and the story…) 95

fait. Or vous lairons ici ester de l'estoire; si vous dirons

275 y sont, vous lairrons nous ester à dire. Car nus hom terriens

When trying to make large texts intelligible, an important task is accentuating—or rhematising, to speak with a term belonging to the level of utterance organisation—certain passages. One of the means employed by Chrétien is the imperative sachiez que, ‘let it be known to you’. Chrétien doesn’t use it too frequently, though: Only 18 times in the 9100 odd verses of his Perceval. Here are some examples: 1946 ice puis je bien afichier./Et sachiez que je sui sa niece,/mes 2169

vostre aaiges/n'est tex, ce sachiez de seür,/que vos a chevalier


/mes il reperdront, ce sachiez./Les ialz amedeus me sachiez

… 2941

a mout grant musardie;/et sachiez que par coardie/nel lait

2997 en avant tanra/la terre, ce sachiez de fi,/et se ele est morte


Wolfgang RAIBLE

Instead, Villehardouin uses this device abundantly—94 times, typically as “et sachiez que”. Some cases: 354

li pelerin de lor païs. Et sachiez que mainte lerme i/fu plorée


tant de belles. //[076] Et sachiez que il portèrent es nefs


le/departirent tote-voie. Et sachiez que ce fu la plus granz

728 volez asseurer devers/vos. Et sachiez que si haute convenance 775 /contre le roi de Hongrie. Et sachiez que li cuer des genz ne 1021

les autres ere soveraine. Et sachiez que il n'i ot/si hardi

How are actions and events themselves conceptualised? As we know, the heroes of verse romances encounter adventures whose nature is often marvellous. As a consequence, in his Percevall Chrétien uses 25 times avanture, two times we find the negative term mesavanture, 34 times there happen mervoilles, marvellous things.—Now Villehardouin qualifies 25 events as aventures, according to the nature of the Fourth Crusade 15 of these being characterised even as mesaventures. At the same time, there are 23 cases of marvellous things, both terms occurring in passages like “Or oïez une des plus granz merveilles et des greignors aventures”—‘Now hear—another rhematising device he uses often times—one of the biggest marvels and of the greatest adventures’. Even Clari resorts to this same conceptualisation: There are 18 cases of mervoille, 6 of them in the syntagm une fine mervoille. In his case, avanture is less frequent—he prefers the etymologically related verb avint que (‘it happened that’, 24 cases), liked also by Villehardouin (29 tokens). This clumsy construction, familiar to those who know the style of the New Testament (‘Now it happened that Jesus entered into Jerusalem’ instead of ‘Jesus entered ...’), is extremely rare in Chrétien (8 occurrences in 9100 verses only). 6. Clause linking Contrary for instance to the Chanson de geste, clause linking—by both coordinating and subordinating techniques—is well established in Chrétien14. As may be expected, most frequent are the relations of causality; generally speaking, there is also a lot of argumentation in Chrétien’s dialogues. As regards the sequence of events and actions in time, one would be prone to think that a simple instrument could be the conjunction when. Throughout the text of Chrétien, though, we have only 110 cases of linking with quant ‘when’—cases where generally a long sequence of foregoing propositions is resumed: For example about 40 verses describing a young lady ((pucele) 14

There is a seminal work on this topic cited already above, Stempel 1964.

Cognitive Aspects of Language Evolution and Language Change


followed in verse 1873 by “Et quantt li chevaliers la voit”—‘and when the knight saw her ...’ In Chrétien’s Perceval, only 27 when-occurrences open a new sentence. Instead, reading the text of Clari, quantt reappears with high frequency in quite clumsy a manner: He uses quant 363 times, predominantly in the scheme “X happened. When X had happened”, with the next proposition repeating part of the foregoing one. “[... the Doge] prendroit il la croix. Et quant il l'eut prise, si li donna ... ” ‘... the Doge would take the cross. And when he had taken it ...’ Villehardouin uses this inelegant technique in a much less obstinate manner. Among his only 154 instances of quant, only a small part is of this nature with its somewhat pedantic effect. As has been said, causal linking is frequent in Chrétien’s narrative. In his Perceval, we encounter 46 tokens of coordinating carr and 28 tokens of subordinating por ce que. There seem to exist metrical causes for the repartition: 24 of the 28 Por ce que open a new verse, whereas this holds only for 28 of the 46 car-tokens. In Villehardouin linking by a subordinate clause is nearly as frequent as coordination: 22 tokens of porce que against 28 with car. Clari clearly prefers coordination: 47 carr against 6 of pour ce que. 7. First thesis appears plausible Thesis I was: If in a situation as initially described authors try for the first time to write a historical text in prose, they will use already existing generic models—the most appropriate being here romances in verse. This thesis was confirmed as regards the overall conceptualisation of the events described in both historical texts: A sequence of marvellous events qualified as adventures, chronologically well ordered and linked, not least by introducing explicit causal relations besides the temporal ones. The author is supposed to rely upon a written source guaranteeing the truth; he addresses his recipients as hearers, using rhematising techniques reflecting the performativity, i.e., the situation of reception, of the said romances: formulae like sachiez que or oïezz as well as direct address to the recipients (vos avez oï, vos dirons après ...). There was a difference in degree, though, between the two authors, with Villehardouin showing a stronger influence of the romance model; the text references of the author who served as a simple soldier are more innovative, too, not least in the kind he signals digressions. Here Chrétien could not serve as a good model and Villehardouin doesn’t succeed in marking digressions well. On the other hand, Clari is quite clumsy in signalling time sequence using in an obstinate manner the when-technique. Typical is also a considerable use made of direct speech in Chrétien and in the texts of the historians. If we compare the accomplishments of our authors with those of later historians, the biggest difference lies in the increasing awareness the authors have of the nature of their activity: What means ‘writing history’?


Wolfgang RAIBLE

Witness already the verbs they use for writing: For Chrétien it is to tell: he (or his estoire) have something to tell us: 68

peinne/a rimoier le meillor conte,/par le comandement le conte

69 comandement le conte,/qui soit contez an cort … … 3325

une lee table d'ivoire. Ensi con reconte l'estoire, ele estoit tote d'une


de Perceval./Percevax, ce conte l'estoire,/a si perdue la


/Tex fu li liz, qui voir an conte,/c'onques ne por roi ne por

The same thing holds for the two historians: they (or their “source”) tell—modern l historians certainly wouldn’t like to see their writings qualified as tales: Villehardouin 232

et belles ne vos puis tout/raconter; [pmossopm in h MS] mais


toz les morz ne vos pui mie raconter; mais ainz que li estorz


que li livres ne raconte mie. Et einsi en vinrent


que li livres ne raconte mie. Maistre Pierres de

1323 pris ne vos puet toz les noms raconter li livres. Une des Clari: (ra)conter 253 vous sauroit mie descrire ne raconter. En cel palais on trouva 263 107 127

que on ne vous sauroit mie raconter la noblesse et la richesse on.//XXXIX./Or vous avons conté le meffait dont li marquis si comme je vous ai ci conté, li pelerin d'autre part


//CVI.//Or avions oublié à conter une aventure qui avint à


que il ne l'ait si belement contée comme maint bon diteur l'eussent


maint bon diteur l'eussent contée, et assez de verités en

Later on, historians know that they select their matter (recueillier, tirer, extraire), put pieces together (compiler, joindre, composer, assembler, copuler), put it into a linear order (ordonner, mettre en ordre), couch it into written form (couchier par écrit), redact it (rédiger)15. It is quite interesting for us to see that terms which would seem self-evident in the context of historical writing appear only much later in written texts (source: Trésor de la langue française électronique):


Cf. Schlieben-Lange 1983: 154sqq.

Cognitive Aspects of Language Evolution and Language Change

1220 relation

1283 coucher par écrit

1377 énoncer

1225 paragraphe

1290 extrait

1410 tirer

1230 compiler

1294 assertion

1455 rédiger

1230 constituer

1300 traité

1461 événement

1268 convention

1330 copuler

1498 récit

1272 effet

1362 relater

1275 affirmation

1370 résumer


That the first attestation of the—to us most natural—neutral term événementt dates only from the second half of the 15th century (Engl. event, coming from the same source, is even later) will certainly come as a surprise to most of us. 8. What about the second thesis? Thus, history becomes more than telling a story. It needs intellectual efforts of the authors resulting in a more abstract vocabulary, reflected, among others, by the above list of verbs and nouns. It is reflected, on the other hand, by a far more intricate syntax with subordinate clauses serving as a grounding technique for the information in the main sentence. Essentially, this evolution starts with Jean Froissart (c. 1337-c.1406). This topic has been admirably treated by a series of papers by Peter Blumenthal who made us, by the way, rediscover Voltaire as a most remarkable historian16. As a rule, introducing concepts of causality, of argumentation and valorisation makes these texts on historical events more easily understandable. But it is still a long way to go until, e.g., in the second half of the 19th century Johann Gustav Droysen publishes his Outline of the principles of history, i.e., a kind of metahistory, or Fernand Braudel invents his conception of a history beyond simple chronological order, leading to three different layers of time, aptly adapted, e.g., by Michael Foucault who even doesn’t bother to cite him17. It is quite interesting that after a series of mental efforts going on for centuries, one can arrive at an insight formulated eventually by the German critic Theodor Lessing (1872-1933) in the title of a book: Writing history as giving sense to what has no sense18. And modern literary critics like Hayden White insist on the fact, that, despite telling us the contrary, historians continue using devices of literary texts when writing history19. 16 17 18 19

Blumenthal 1990, 1992, 1994, 2000. Droysen 1882. Braudel 1949. Foucault 1969. Lessing 41927. White 1973, 1987. In the same spirit: Koselleck & Stempel eds. 1973.


Wolfgang RAIBLE

At least French historical writing started from literary text models—and even modern historians cannot but resort to them at least in part: This is the heritage of telling stories—generally speaking: of narrative—inherent in history. Modern historians need techniques of thematising and rhematising, too. Instead of “let it be known to you” and the like they use subordinating or clefting techniques, the interplay between abstraction and concretisation, the technique of resuming, etc. Nevertheless, we see that Thesis II is plausible, too: It took considerable time until the cognitive and linguistic framework for historical prose proper developed—to the extent, that eventually even the critique of history, metahistory, became possible. References Blumenthal, Peter. 1990. “Textorganisation im Französischen: vom Mittelalter zur Klassik”. Zeitschrift für französische Sprache und Literaturr 100. 25-60. ———. 1992. “Zum Stil moderner Geschichtsschreibung”. Le français aujourd'hui, une langue à comprendre - französisch heute. Mélanges offerts à J. Olbert, Gilles Dorion et al. (eds). Frankfurt: Diesterweg. 171-181. ———. 1994. “Schémas de cohésion et causalité dans Br audel: La Méditerranée”. Le français moderne 63. 1-19. ———. 2000. “Textlinguistik und Geschichtswissenschaft”. Text- und Gesprächslinguistik : ein internationales Handbuch zeitgenössischer Forschung. Linguistics of text and conversation, Brinker, Klaus et al. (eds). (Handbücher zur Sprach- und Kommunikationswissenschaft ; Bd. 16). Berlin; New York: de Gruyter. 797-803. Braudel, Fernand. 1949. La Méditerranée et le Monde Méditerranéen a l'époque de Philippe II. I Paris: Armand Colin.—English translation: The Mediterranean and the Mediterranean World in the Age of Philip II. Berkeley: University of California Press. 1996. Daneš, František. 1966. “A three level approach to syntax”. Travaux linguistiques de Prague 1. 225-240. Ehrliholzer, Hans-Peter. 1965. Der sprachliche Ausdruck der Kausalität im Altitalienischen. Winterthur: P.G. Keller. Droysen, Johann Gustav. 1882. Grundriss der Historik. 3te umgearbeitete Auflage. Leipzig : Veit.—English translation: Outline of the principles of history. Boston: Ginn & Company. 1893. Foucault, Michel. 1969. L’archéologie du savoir. (Bibliothèque des sciences humaines). Paris: Gallimard. Jolles, André. 82006. Einfache Formen. Legende, Sage, Mythe, Rätsel, Spruch, Kasus, Memorabile, Märchen, Witz. Tübingen: Niemeyer.

Cognitive Aspects of Language Evolution and Language Change


Klare, Johannes. 1958. Entstehung und Entwicklung der konzessiven Konjunktionen im Französischen. (Veröffentlichungen des Instituts für Romanische Sprachwissenschaften Nr. 13). Berlin: Akademie-Verlag. Knoblauch, Hubert and Thomas Luckmann. 2004. “Genre analysis”. A companion to qualitative research, Uwe Flick et al. (ed). London etc.: Sage. 303-307. Krefeld, Thomas. 1985. Das französische Gerichtsurteil in linguistischer Sicht. Zwischen Fach- und Standessprache. Frankfurt am Main, etc.: Lang. Raible, Wolfgang. 1997. “Die Bildung neuer Konjunktionen. Ein Rückblick auf die Dissertation von Johannes Klare”. Studia historica in honorem Johannes Klare, Huberty, Maren and Perlick, Claudia (ed). (Abhandlungen zur Sprache und Literatur 90). Bonn: Romanistischer Verlag. 41-59. Koselleck, Reinhart and Stempel, Wolf-Dieter (eds). 1973. Geschichte Ereignis und Erzählung. München: Fink. Lessing, Theodor. 41927. Geschichte als Sinngebung des Sinnlosen: Oder die Geburt der Geschichte aus dem Mythos. 4., völlig umgearb. Aufl. Leipzig: Reinicke. Raible, Wolfgang. 2001. “Linking clauses”. Language Typology and Language Universals - Sprachtypologie und sprachliche Universalien - La Typologie des langues et les universaux linguistiques. An International Handbook, Haspelmath, Martin & König, Ekkehard & Oesterreicher, Wulf & Raible, Wolfgang (eds). ((Handbücher zur Sprach- und Kommunikationswissenschaftt 20(1)). Berlin; New York: de Gruyter. 590-617 (article 45). Schlieben-Lange, Brigitte. 1983. Traditionen des Sprechens : Elemente einer pragmatischen Sprachgeschichtsschreibung. Stuttgart: Kohlhammer. Stempel, Wolf-Dieter. 1964. Untersuchungen zur Satzverknüpfung im Altfranzösischen. (Archiv f. d. Studium d. neueren Sprachen u. Literaturen. Beih. 1.). Braunschweig: Westermann. Stoll, Eva. 1997. Konquistadoren als Historiographen. Diskurstraditionelle und textpragmatische Aspekte in Texten von Francisco de Jerez, Diego de Trujillo, Pedro Pizarro und Alonso Borregán. Tübingen: Narr. Tendeng, Odile. 2007. Le Gusiilay : un essai de systématisation : Une contribution à l'étude du Jóola. Bern; Berlin etc.: Lang. Uitti Karl D. and Michelle A. Freeman. 1995. Chrétien de Troyes revisited. New York: Twayne. White, Hayden.1973. Metahistory: The Historical Imagination in NineteenthCentury Europe. Baltimore: The Johns Hopkins University Press. ———. 1987. The Content of the Form: Narrative Discourse and Historical Representation. Baltimore: The Johns Hopkins University Press.

The Importance of Diasystematic Parameters in Studying the History of French Lene SCHØSLER

1. Introduction This paper proposes that hypotheses in diachronic linguistics can be confirmed or dismissed by means of corpora. In order to do this, corpora must be composed in such a way that they permit an exploration and testing of the relevance of parameters for language change. The hypotheses under investigation concern the nature of language change and the appropriate models of change. The example of language change to illustrate my topic is the creation of the composed past, from the Latin present form: habeo litteras scriptas, litt: ‘I have letters [that have been] written’. The main changes from the Latin present form to modern Romance, e.g. Modern French composed past: j’ai écrit les lettres, ‘I have written the letters’ are well known, but several intriguing questions are still not answered. The first two questions being the basic ones and the latter two more closely linked to understanding data: • Which is the function of the composed past in the old texts—is it a present or a past form? • Which are the phases of change? • How does epic tense switching conform to analyses of the composed past? • How may we explain conflicting evidence in the old texts? It will be shown that diasystematic parameters and use of corpora permit answers to the questions under investigation. The paper is organised in the following way: Section 2 presents the models of change to be investigated. Section 3 introduces the research questions and methodology. The empirical data are presented and analysed in sections 4 and 5 and provide arguments for discussing the research questions. Section 6 sums up arguments concerning the relevance of the diasystematic parameters and section 7 contains my conclusion. 2. The model of change The model of language change accepted here has been expressed most clearly by Andersen (2008: 32) in terms of repeated cycles of innovation and



actualisation of innovative speech in the community. Here focus will be on the processes of reanalysis (i.e. innovation) and actualisation in relation to the creation of the composed past, as these have not yet been investigated nor illustrated in texts. It is hypothesised that changes are always textually manifested in synchronic variation. The investigation of synchronic variation will be related to the actualisation process and to well known diasystematic parameters, i.e. to diachronic, diatopic, diastratic, diaphasic and diamesic variation (see Völker (2007: 209). The changes will be related to the analyses of the tense system proposed by Benveniste in 1959 (reprinted in Benveniste (1966), later elaborated by Weinrich (1973). 3. Research questions and methodology Language change is generally formalised according to the simple model : A>{A,B}>B. This model may grasp different types of change, both replacement of one form (A) by a new form (B) and change of meaning without change of forms, i.e. meaning A being replaced by meaning B.1 Concerning the past tenses, both types of change are found in Romance languages. We find the original simple past which, in most Romance languages, is replaced or currently in the process of being replaced by the composed past—a process which is more or less advanced in the individual languages. This change is exemplified in Modern French and presented in Table 1. As indicated, the change conforms to the simple model: replacement of forms. Firstly, the form cantavi is used (A). Then a new form (B) is introduced (habeo cantatum). The coexistence of A and B in the function of past tense, manifests itself as synchronic variation in texts. Later, the form B replaces A, which tends to disappear. Table 1. Past tenses, from Latin to Modern French stage of change


form(s) of the past tense



cantavi ‘I sang’


Early stages of Romance languages Modern French (standard, written)

B 1

Modern French (spoken)

habeo cantatum have-PRES-1SG singPAST-PART j’ai chanté ‘I sang’ j’ai chanté ‘I sang’

cantavi je chantai ‘I sang’

In Hopper & Traugott (1993:36) this model of changes is used for change of forms. In Traugott & Dasher (2005:12) it is used for the representation of semantic change. In the following, I will use the term ‘meaning’ as short for ‘meaning and/or function’.

The Importance of Diasystematic Parameters in Studying the History of French


The prerequisite of substituting the simple past with the composedd past, as shown in Table 1, is that the two forms have comparable temporal meaning. This implies another change; that of the meaning g of the composed form, from a present to a past tense. In Table 2 this change is presented according to the model of change shown in Table 1, and it is broken down into four phases. Although the general result of the change is well known, this manner of presentation is not generally accepted—because no consensus has so far been reached on this point. Instead, Table 2 represents the working hypothesis of this paper, which will be tested on corpora. Table 2. Development of the composed past tense: four phases Phases example form phase 1 habeo litteras form of the verb habeo with a direct object (=A) scriptas (litteras) and an object predicate (scriptas), free word order, S1 (of habeo) can, but must not be identical to S2 (of scribo)2 phase 2 habeo litteras the verb habeo/avoirr takes the function of (=A,B) scriptas/j’ai an auxiliary and écrire is the main verb écrit(es) les with a direct object, possibly concord of the lettres participle, free word order, S1 (of habeo) must be identical to S2 (of scribo) form of the verb écrire with a direct object, phase 3 j’ai écrit(es) (=A,B) les lettres free word order, possibly concord between object and participle phase 4 j’ai écrit les (B) lettres

meaning present tense

accomplished present

perfectum praesens, i.e. a (recent) past, which is relevant for the present form of the verb écrire with a direct object, perfectum fixed word order, standard rules of concord historicum, i.e. a between ante-posed object and participle past without relevance for the present

This paper focuses on the change of meaning in the composed form. According to the model of Table 1, this must be interpreted in the following way: Firstly, the form is used with a specific meaning in a specific context (A). Then a new meaning is introduced and found in new contexts; (B), later also spreading to the contexts of A. The coexistence of A and B in the same contexts manifests itself as synchronic variation in meaning. Finally, the meaning A eventually disappears and only B persists. When a new meaning arises, i.e. a new interpretation of the same string is found, a reanalysis has taken place. I here use the term reanalysis in accordance with the research 2

Please note that the use of must in phase 2: “S1 must not be identical to S2” is used in the sense that identity of subjects is possible, but not obligatory.



tradition found in Andersen (1973), i.e. a change in language structure, which happens by abduction and is motivated by the ambiguity of the expression produced by speakers.3 If we want to understand the language change under investigation, we must first determine the critical context that offers two possible analyses of the same expression.4 Once the reanalysis has taken place, there is a spread in usage of the reanalysed expression. The spread of meaning B is labelled actualisation. The situation of co-occurrence of A and B is labelled layering.5 According to Andersen (2001), it is hypothesised that the actualisation progresses in a systematic way, which manifests itself as predictable variation in corpora. It is my intention to show that this variation conforms to diasystematic parameters. In accordance with the preceding remarks, the corpus analyses will address the following questions: • Which are the critical contexts for each of the reanalyses? • How does the subsequent actualisation proceed? • Is the simple model A>{A,B}>B the relevant model of change? The discussion of these methodological questions will permit me to investigate the two first research questions—I recall that these concern the function of the composed form in the old texts and the appropriateness of the four phases of change proposed. 4. The creation of the composed past 4.1. Phase 1—phase 2 transition, first reanalysis The point of departure (phase 1, in Table 2) is the free combination of the verb habeo with a direct object (litteras) and an object predicate (scriptas), which is the past participle passive, in the form of the object (feminine, accusative, plural form). The word order is free. The subject of habeo (S1) must not be identical to that of scribo (S2), in the sense that identity of subjects is possible, but not obligatory. According to e.g. Rubenbauer & Hofmann (1958: 193) and Thielmann (1885: 541), a use of the construction habeo litteras scriptas announcing the Romance accomplished present form, is already found in Late Latin, but disappears and is not found again before the 6th century with Grégoire de Tours.

3 4


See also the presentation of this term in Hopper & Traugott (1993:32-62). Critical contexts are defined as contexts which are semantically and structurally ambiguous. See Diewald (2002). See also Traugott & Dasher (2005:12), who define layeringg as “coexistence of variants”.

The Importance of Diasystematic Parameters in Studying the History of French


The earliest French texts provide a few examples that represent phase 1. At the same time they are examples of critical contexts, i.e. of examples permitting two analyses, see (1-2): 1. 2.

Saint Léger v. 125: son quev que il a coronatt (‘his head which he has tonsured’) Alexis v. 116: Si att li emfes sa tendra carn mudede (‘At that point has the young man his gentle flesh changed’)

The original analysis of the construction found in e.g. (1), is that Saint Léger has his head, meaning his hair, in the way that monks normally have, i.e. tonsured. Now, the ambiguity of this example concerns who has tonsured Léger. Which of the two possibilities is most probable: is it Léger himself or another person? The point here is that both interpretations are possible in phase 1. Similarly, in example (2), it is unspecified whether life has changed the appearance of Alexis, or it is Alexis who has changed his appearance. Put differently, speakers may understand these as examples of phase 1 or of phase 2. These examples illustrate the ambiguity, which may have motivated speakers to reanalyse the expression ‘have’+past participle as an accomplished present tense. In phase 2, the verb habeo/avoirr takes the function of auxiliary and écrire turns into main verb with a direct object, possibly with concord of the participle and free word order. Example (3) shows that the reanalysis, from phase 1 to phase 2, has taken place. Here, Léger is not in the possession or absence of something, i.e. the verb ‘to have’ is not the main verb. The point is that he has lost something, i.e. his tongue has been cut off. Consequently, he is unable to praise God: 3.

Saint Léger v. 161: hor a pordudd dom Deu parlier/ja non podra mais Deu laudier (litt. now [he] has lost [to] God speak/never will praise God, ‘now he has lost the power to speak to God/he will never be able to praise him again’)

Moreover, example (3) helps us understand the second reanalysis. Indeed, the present situation—that Léger is unable to speak—is due to a preceding action, i.e. the cutting off, of his tongue. In other words, example (3) is ambiguous with respect to the meanings accomplished presentt (phase 2) or perfectum praesens (phase 3).Thus, it is an example of critical context. 4.2. Phase 2—phase 3 transition, second reanalysis In phase 3, the composed form takes the meaning of a perfectum praesens, i.e. a (recent) past, which is relevant for the present situation. Example (3) illustrated a situation of ambiguity for the speaker. The following example



(4) provides a clear example of phase 3. By means of the adverb ore ‘now’, it opposes the present situation to another situation, clearly anchored in the past by means of the adverb tant, meaning ‘for such a long time’: 4.

Alexis v. 353: “Ore ai trovétt ço que tant avums quis” (‘now I have found what we have been searching for such a long time’)

However, if focus turns to the activity of the past and less on the relevance for the present situation, it becomes understandable that speakers reanalyse the perfectum praesens as a real past, i.e. a perfectum historicum. Example (5), especially the last words of the wife, provides an illustration of such a situation, where the family deplores the death of Saint Alexis: 5.

Alexis v. 106-9: Ço dist li pedres: “Cher filz, cum t’ai perdut! ”/Respont la medre: “Lasse! qu’est devenut? ”/Ço dist la spuse: “Pechét le m’at tolut ” (‘This said the father: “Dear son, I have lost you!”/The mother answers: “Alas, what has he become?”/This said the wife: “Sin took him away from me”’)

This reanalysis, which is the third and last one, is not generally found in texts before the 18th century. This implies that for a period of about 800 years, we find the co-occurrence, i.e. layering, of the composed past with the function of perfectum praesens and the simple past with the function of perfectum historicum. The following section sums up corpus investigations on this important period (see Schøsler 2004, Schøsler 2007, Caudal & Vetters 2007). 4.3. Phase 3—phase 4 transition, third reanalysis Co-existence of the composed past with the meaning of perfectum praesens and the simple past with the meaning of perfectum historicum was codified by grammarians and Remarqueurs in the periods of the Renaissance and Classical French. In the 16th century, Henri Estienne expressed what was later known as the: ‘rule of 24 hours’ in the following terms (6): 6.

“Quand nous disons: j’ay parlé à luy et luy ay faict response, cela s’entend avoir esté faict ce jour là; mais quand on dit: je parlay à luy et luy fei response, cecy ne s’entend point avoir esté faict ce jour mesme auquel on raconte ceci, mais auparavant, sans qu’on puisse juger combien de temps est passé depuis. Car soit que j’aye faict ceste response le jour de devant seulement, soit qu’il y ait jà cinquante ans passez ou plus, je diray: je luy fei response, ou alors, ou adonc je fei response”.6 (‘When we say: I have talked to him and given him my answer, it should be understood that it has happened that same day; but when we say: I talked to him and gave

The Importance of Diasystematic Parameters in Studying the History of French


him my answer, this should not be understood to have taken place the same day we say this, but previously and we cannot know how much time has passed since. Because it may be the case that I have given this answer the previous day, or that it happened 50 years ago or more, still I would say: I gave him my answer, or: at that moment, I gave him my answer”.’)

Now, let us look for possible critical contexts that motivate speakers to perform what I have labelled the third reanalysis, i.e. from phase 3 to phase 4. I believe that negation may provide the appropriate contextual ambiguity, see the following quotation from Henri Estienne: 7.

“... nous disons ordinairement: je luy ay faict souventesfois plaisir, et non pas je luy fei souventesfois plaisir. Et toutesfois, en la négative, nous usons de tous les deux: je ne luy ay jamais faict plaisir, je ne luy fei jamais plaisir. Mais ... en l’affirmative, ... j’ay faict est plus général ... ” (‘we normally say: I have often pleased him/her, and not: I often pleased him/her. And however, with the negation, we use both: I have never pleased him/her, I never pleased him/her. But … in the affirmative, … I have pleased d is more general ...’)

Corpus investigations confirm that negated contexts of this type tend to be anchored in the past by means of adverbs. Examining Montaigne, we find the adverb ‘never’, oncques, used thirty times more frequently with the simple past (the unequivocal perfectum historicum form), than with the composed past (see Wilmet (1998: 366)). In short: affirmative statements of the type quoted here, point to the present situation, whereas negated statements point to the past. When the composed past is used in negated contexts, it tends to acquire the meaning of a simple past, i.e. of a perfectum historicum. In the 17th century, grammarians and Remarqueurs insist on the rule of 24 hours (see the quotation in 6), and even famous authors such as Corneille must modify their use of tenses according to this.7 However, during the 18th century, texts reveal that the third reanalysis has taken place and that the subsequent actualisation process of the composed past with the meaning of perfectum historicum has started. Fortunately, many text types are available for this period and it is therefore possible to study in detail the period of layering (co-existence in texts of phase 3 and 4) and especially to investigate the actualisation process according to diasystematic parameters. I will discuss this further in section 5. Let us first introduce the temporal adverbs 6


The citation of H. Estienne in (6) and (7) stems from his Traité de la conformité du langage français avec le grec, 1569, quoted here from Fournier 1998:413. See l’Académie française: Sentiments sur le Cid, 1637, cited in Fournier (1998: 414) and Wilmet (1998: 367) for a critical discussion of the consequences of the rule.



anchoring the tenses to the three dimensions of time. These are illustrated in (8): future, present and past, with the appropriate temporal adverbs: demain, aujourd’hui, hier. 8.

Il fera demain ce qu’il fait aujourd’huii et ce qu’il fit hier; et il meurt ainsi après avoir vécu (La Bruyère, Les Caractères ou les mœurs de ce siècle. De la ville, 1688, ‘he will do tomorrow, what he does today, and what he did yesterday; and thus he will die after having lived’)

Let us consider more closely the context of the composed past with the meaning of perfectum praesens, i.e. phase 3, as opposed to that of phase 4. Let us label this the context of phase A. Here we find adverbs linked to the present: aujourd’hui, hier, avant-hier, il y a x jours. The actualisation process of the composed past, phase four, manifests itself as the spreading from the original context linked to the present, to contexts which were originally restricted to the simple past, context B. Consequently, from the moment we find the composed past in context B, we know that the composed past has acquired the meaning of a perfectum historicum. Context B typically includes temporal adverbs such as alors, la veille, le lendemain. It is common to consider private letters close to spoken language and consequently to provide reliable evidence concerning the spreading of new forms. Examples of private letters, quoted below, from Diderot (Frantext, ca. 1760) illustrate the use of the simple past (9) alternating with the composed past (10), i.e. cases of layering, in exactly the same context: 9.

““Nous dînames hierr ensemble depuis deux heures et demie jusqu’à neuf heures du soir” (‘Yesterday, we had dinner together from half past two till nine o’clock in the evening’) 10. “J’ai dîné hierr avec tout une colonie angloise” (‘Yesterday, I have had dinner with an entire colony of Englishmen’)

Only in informal texts, close to spoken French, such as private letters, do we find composed past forms in context B. In standard literary texts, we do not find the composed past in context B before the 20th century. Camus is famous for having introduced the use of the composed past in literature in 1957, see typical examples in (11), from Frantext: 11. Camus, L’Etranger: J’ai pris l’autobus à deux heures (p. 8, ‘at 2 o’clock I took the bus’). ... A ce moment, t le concierge est entré derrière mon dos (p. 12, ‘at that moment, the caretaker entered ...’).

The Importance of Diasystematic Parameters in Studying the History of French


In modern, spoken French, the composed past has almost completely replaced the simple past, in the function of perfectum historicum. Evidence from Caudal & Vetters (2007: 126)8 illustrates the actualisation process around 1750 as it manifests itself in private letters with past tenses in contexts A and B, i.e. accompanied by temporal adverbs. The study shows that the simple past form takes the default form meaning perfectum historicum, in contexts of type B (with temporal adverbs like hier, alors, jadis, lundi, mardi ... un/le jour ...), and that the composed past is used in the meaning perfectum praesens in contexts of type A (with temporal adverbs like aujourd’hui, maintenant, tantôt, présentement, ce jourr ...). However, in the 18th century, the composed past progresses in contexts (B), starting with temporal adverbs of recent past such as ‘yesterday’, hier. 4.4. Summing up section 4 Section 4 has examined the function of the French composed past, in terms of time reference. It has been argued that the composed form started as a present tense, which became a tense of the past through four phases. Each of these phases presupposes ambiguity of the expression, which motivates speakers to analyse the form differently. Each reanalysis is followed by an actualisation of the new meaning of the form, i.e. a spreading in use to the new context. In section 3, I introduced the simple model of change A>{A,B}>B. With respect to the two past tenses: simple or composed, the investigation has shown that it adequately represents the change of forms, since the composed past has taken over, and the simple past has become restricted to high, mainly written style. As for the meaning of these forms, in terms of tense values: present tense, accomplished present, perfectum praesens and perfectum historicum, things are more complicated and will be studied in more detail in section 5. It should be observed, however, that the two latter functions, perfectum praesens and perfectum historicum, are now expressed in modern, spoken French by means of the composed past, implying that the model of change concerning the meaning of the composed form is best formalised by: A>{A,B}.9



The datation is, however, modified, according to data from Caron & Liu (1999). The datation around 1750 is also confirmed by the detailed study by Le Guern (1986). He examines the concord between the subjunctive form and the composed past. Before ca. 1750 the composed past concords with the present subjunctive. Later, it concords with the past subjunctive. As for rules of word order and of concord, these have become subject to strict regulation by grammarians, especially from the 16th century (see Clément Marot’s poetic rules from 1558, quoted by Wilmet (1998: 360-361).



5. Discussion of the conflicting evidence from old French texts In section 4, I presented and argued for the model of change presented in four phases in Table 2. One might object that it is falsified by the intriguing use of past tenses in Old French, especially in epic texts. There has been much discussion about what has been labelled tense switching, which is a well known phenomenon in epics written in old languages. In the following, I intend to show that my hypothesis proposes a more coherent analysis of Old French tense use than other proposals. Put differently, the use of tenses in Old French epics constitutes a crucial test of validity for my hypothesis. 5.1. Tense switching Experts in Old French have proposed different analyses in order to account for tense switching. Let me first provide examples of the phenomenon in (12), (13), and (14). Here, we find an apparently indiscriminate use of past forms: the simple past, the composed past and the so-called historical present form. Each of these forms expresses a series of actions that have happened in the past. In verse texts, the composed form is frequently used in assonance, which is strikingly illustrated in (14), where the simple past is used in the body of the verse, the composed past in assonance.10 However, this is certainly not the only parameter of distribution, as seen in prose texts and in (13), where the composed past is used in the body of the verse: De saint batesma l’unt fait regenererr (‘they had him reborn (composed past) by holy baptism’). 12. Passion v. 21-24: Cum cel asnez fu amenazz (‘when this donkey was (simple past) fetched’)/de lor mantelz ben l’ant paradd (‘with their mantle [they] embellished (composed past) it’/de lor mantelz de lor vestitt (‘with their mantles and with their clothes’)/ben li aprestunt o ss’assis (‘they prepared (historical present tense) well there where he sat (simple past) down’) 13. Alexis v. 26-30: Tant li (= to God) prierent par grant humilitétt (‘they prayed (simple past) to God with much humility’)/Que la muiler dunatt fecunditétt (‘that [God] give (present subjunctive) fertility to the wife’)/Un filz lur dunet si l’en sourent bon gret. (‘[God] gave (historical present), them a son, they were (simple past) thankful to him for that’)/De / saint batesma l’unt faitt regenererr (‘they had him reborn (composed past) by holy baptism’)/Bel / num li metent selunc cristïentét (‘they gave (historical present) him a beautiful Christian name’) 14. Aucassin VII v. 6-9): Vers le palés est alés/Il s en monta les degrés/En une canbre est entrés/Si s commença a plorerr (‘Towards the castle he walked (composed past)/ he walked (simple past) the steps/in the room he entered (composed past)/then he started (simple past) to weep’). 10

Buridant (2000: 370-381, 384) confirms this observation.

The Importance of Diasystematic Parameters in Studying the History of French


Manuscript variations also show the alternation of forms; in the following example from Saint Alexis v. 77, mss. L and A, we find indiscriminate use of the simple past and historical present in the same verse:11 Saint Alexis ms. L Saint Alexis ms. A v. 77: La nef estt (historical present) preste ou La neff fu (simple past) prest u il dut (simple il deveit (imperfect past) entrer past) enz entrer (‘the ship was ready, the one that he was about to enter’)

5.2. Conflicting analyses of tense switching Tense switching existed in epic texts from the Middle ages, but was explicitly rejected by Vaugelas (grammarian from the 17th century) and disappeared. Specialists of Old French have often discussed this phenomenon and have proposed different analyses of the apparently random use of tenses in old epic texts. The views can be reduced to three interpretations. Tense switching is: 1. Chaotic12 2. Pragmatically motivated 3. Reveals ongoing language change

I will briefly comment these interpretations. The view that tense use in epic texts is chaotic and cannot be accounted for, hardly needs further discussion and defies any regularity of data. The pragmatically inspired view that: “...narrative constitutes a special category of linguistic performance whose grammar differs in certain significant respects from the grammar of non narrative language...” is for example found in Fleischman (1990: 313). Fleischman provides interesting insights on textual functions of tense use, drawing upon studies in narratology. As I understand it, Fleischman is inspired by Benveniste (1966) and Weinrich (1973) in her analyses of the ‘tempo’ of texts, as it is provided by means of tense switching (see concrete analyses e.g. p. 208-209). According to Benveniste (and to Weinrich, heavily inspired by Benveniste), we must distinguish between two models of discourse: ‘storytelling’ and ‘communication’. In ‘storytelling’, we find all 11


Here, the historical present tense and the simple past have the same number of syllables, which permits the tense switch. The composed past has more syllables, and cannot alternate in manuscripts without more important reorganisation of the text. Modern specialists of Old French such as Buridant (2000) and Perret (2008) renounce to explain tense switching in Old French. I consider these to represent the ‘chaotic’ view. Buridant puts foreward mainly metrical arguments in his presentation of tense switching.



past tenses, including the simple past and the imperfect. In ‘communication’, the simple past is excluded (op.cit. 114). Now, the problem here, as I see it, is that the model of Benveniste (and of Weinrich) is implicitly based on a tense system that corresponds to phase 3 of my Table 2, or possibly to the period of actualisation from phase 3 to phase 4, since the simple past is excluded from ‘communication’. However, as we have seen in private letters from the 18th century, the simple past is generally used in informal communication, at least until 1750. In other words, the tense system implied by Fleischman for epics in Old French is that of Post-classical French. Moreover, Fleischman fails to observe that tense switching presupposes identity of tense reference among alternating tenses. The composed past can alternate with the historical present and with the simple past because it is a present form, i.e. it is to be located at phase 2, not at phase 3. Quite a different view is expressed by Foulet, according to whom the use of the composed past in tense switching, is an indication of the change of the meaning for this form into perfectum historicum, i.e. to my phase 4. This view is accepted by other scholars, e.g. by Brunot and Clédat.13 It is in obvious contradiction to my hypothesis, according to which we do not find the composed form with the function of a perfectum historicum before the 18th century. Let us consider the arguments confirming or refuting the two hypotheses. According to Foulet, innovative tense use is found in epic texts: “... il faudrait précisément chercher les débuts de l’usage moderne dans la langue poétique ou littéraire du XIIe et du XIIIe siècle. C’est là, semble-t-il, que pour la première fois le passé indéfini a pris, à côté de son sens traditionnel, la signification d’un prétérit.” (Foulet 1920: 273-274)14

According to my own investigations (see Schøsler 1973), tense use in (fictitious) direct discourse of Old French is different from that of epic texts, in that the composed past in such texts corresponds to phase 3. If Foulet’s analysis is correct, this implies that innovative use is found in literary narrative texts earlier that in text types close to spoken language. These are two conflicting views of how to evaluate text evidence in periods where no direct sources are available. I will question Foulet’s claim that epic texts are innovative, and 13


See Foulet § 322, Brunot I: 240, Clédat § 455. Without quoting Foulet, Caudal & Vetters (2007: 124), referring to recent studies on Old French, accept the (incorrect) idea that the composed past takes the meaning of perfectum historicum in Old French epics. ‘We probably find the start of the modern use [of past tenses] in the poetic or literary style from the 12th and the 13th centuries. It seems that this is where the simple past for the first time has taken the meaning of a preterit [i.e. perfectum historicum] besides its traditional meaning’ AT.

The Importance of Diasystematic Parameters in Studying the History of French


that the tense value of phase 4 should start in these texts. Indeed, it is hard to understand that the composed past with the function of perfectum historicum disappears15 with this specific text type, at the end of the Middle Ages. It is also hard to understand that it does not reappear before the 18th century, in private letters—and that we find it only in literary style in the 20th century. These legitimate problems are not addressed by Foulet or by his adherents. Foulet admits, however, that this is the only case found where the language of literary texts is innovative.16 Arguments against Foulet’s view include the fact that tense switching in epic texts differs essentially from the use of composed past of phase 4, in that it often alternates with historical present forms; it does not form series of successive actions; it is not found with temporal adverbs anchoring the action in a past (i.e. the context of the simple past labelled context B above). Moreover, studies of independent changes in Old French clearly show that epics, especially in verse texts, do not constitute an innovative context. Old French language change appears in (fictitious) direct discourse earlier than in narrative parts, in prose earlier than in verse texts. Good examples are the disappearance of the case declension and the disappearance of null subjects (see Schøsler: 1984, 2001b, 2002). According to my hypothesis, composed past in Old French texts had tense values corresponding to phase 2 and phase 3, respectively accomplished present and perfectum praesens. Phase 2 is used in conservative contexts (narrative texts, mainly verse texts), phase 3 in innovative contexts (fictitious direct discourse, prose texts). Phase 2 is a present form, so it is natural that it alternates with the historical present form, which takes the meaning of the simple past.17 This implies that the composed form in Old French has an essentially different meaning than the one found in phase 4. 5.3. Summing up section 5 This section has presented and analysed the phenomenon of tense switching as a test of my hypothesis concerning the change of the composed form in French. It has been argued that proposing tense switching to be chaotic or pragmatically motivated does not account for the data. On the contrary, corpus evidence reveals ongoing language change, but not in the sense proposed by e.g. Foulet. Rather, data confirms my hypothesis that tense switching takes advantage of the different, but co-existing meanings of tense forms linked to 15 16


See also Martin 1971: 397. However, he proposes a slightly earlier datation of phase 4. [Cette hypothèse] “attribue à la langue littéraire une influence sur l’évolution linguistique que ne confirme aucun autre témoignage” Foulet 1958: 274. A similar analysis is proposed by Wilmet (1998: 364-365).



different styles: an innovative one in prose, especially in direct discourse and a conservative one in verse, especially in narrative parts. The co-existence of meaning, i.e. the layering, reveals the ongoing language change with the spreading of innovative language use of the composed past from context A to context B. 6. The relevance of the diasystematic parameters 6.1. Diasystems Just like sociolinguistics, variationist linguistics is based on the principle that languages are not wholly autonomous systems, but that they are related to extralinguistic factors. These are interpersonal or intrapersonal. The former are linked to the diachronic, topological and social anchoring of the speakers, the latter are linked to conditions of language production, e.g. style and medium of communication (oral/written). The concepts of variety and diasystems were introduced by Uriel Weinreich, Eugenio Coseriu, Ludwig Söll and Bodo Müller (see the presentation in Völker (2007) with references). In variationist linguistics it is hypothesised that it is possible to systematize extralinguistic (diasystematic) factors according to their impact on language structures. As I see it, it is possible to correlate the theory of actualisation (Andersen 2001) with diasystematic parameters. According to this hypothesis, we should expect innovative use to spread according to the diasystematic parameters. Due to the availability of relevant data, this is best tested on the actualisation process following reanalysis three, from phase 3 to phase 4. However, concerning the previous actualisation process, from phase 2 to phase 3, it is also possible to confirm the relevance of at least some of the diasystematic parameters. I will do this in the following section, on the basis own my own research (Schøsler 1973, 2001a, 2004, 2007) and on evidence from Caron & Liu (1999). 6.2. Test of the actualisation theory and of the diasystematic parameters If we distinguish narrative/epic parts of texts and (fictitious) direct discourse, we find innovative use in direct discourse first. This suggests that “direct discourse” is closer to the source of innovation, which is mainly taken to be in informal speech.18 This holds for the observations concerning the actualisation process from phase 2 to phase 3. Concerning the actualisation process from phase 3 to 4, innovation is first found in the first and second person (implying closeness to direct discourse) and later in the third person, typical of narrative (see also Wilmet 1998: 366). Evidence from private letters (Caron & 18

This holds for the internally motivated language chance studied here. Externally motivated language change may follow the opposite direction, i.e. be imposed on informal speech from e.g. standard language.

The Importance of Diasystematic Parameters in Studying the History of French


Liu 1999) permits us to follow the process of spreading of the innovative tense meaning of the composed past ((perfectum historicum). Indeed, innovation spreads according to diasystematic parameters: first in the medium of private letters, i.e. the text type closest to oral language: (the diamesic parameter), first in informal style (the diaphatic parameter), opposed to literary, high style, first in private letters by women, and by women from the Paris region, later by men and by those living in province (the diatopic parameter) and first by younger, later by older authors of letters (the diachronic parameter). 7. Conclusion The preceding sections have shown that the investigation of the change from habeo litteras scriptas to j’ai écrit les lettres by means of corpora has answered the research questions asked in the introduction of my paper regarding: • the models of change • the methodology applied • the relevance of the diasystematic parameters Concerning the model of change A > {A,B} > B, see section 4.4. for arguments in favour of the model A > {A,B} as adequate description of the changes. Corpus evidence has indeed shown the relevance of distinguishing text types and text functions for phase 2>3. Due to a better access to diversified text types, phase 3>4 has been investigated in greater detail. The result has been the confirmation of the relevance of the diachronic, diatopic, diaphasic and diamesic parameters for the actualisation process of innovation. Due to the absence of relevant data, the diastratic parameter has not been investigated. However, according to Blumenthal (1986), this parameter is relevant for the use of past tenses in his diachronic corpus of French. This investigation has also shown that hypotheses in diachronic linguistics can be confirmed or dismissed by means of corpora, provided that corpora are composed in such a way that they permit an exploration of relevance for various parameters. Moreover, it has been shown that new insights on the process of language change can be gained by combining variational linguistics with Andersen’s theory on actualisation. I have proposed above that change proceeds in three reanalyses and subsequent actualisation processes, phase 1>2 around the year 1000, phase 2>3 in the period 1150-1200 (see textual evidence in Saint Alexis) and phase 3>4 circa 1750, each phase being the prerequisite for the following one. At the present stage of my research, I am not yet capable of answering the important why-question: why did the three reanalyses happen at that specific



moment in the history of French and not before or after? More synchronic studies are a prerequisite for the diachronic considerations regarding a possible connection between language change and substantial changes in the corresponding society. It is well known that languages experience swifter changes under social and political conditions that imply intensive linguistic contact, while they may stagnate in isolated societies. Further research in this field must therefore include considerations that comprise the social conditions as they relate to language variation and theories on language change, which distinguish between internally and externally motivated linguistic change. Another important question regards the connection between different language changes. These may be interpreted as chains, where one acts as a prerequisite link for the next and so forth. This question has not received much attention, in that most language change studies mainly consider isolated events. Above, I have referred to independent evidence concerning Old French changes that all confirm the relevance of diasystematic parameters (see section 5.2.) and further research should include the study of connectedness between these changes. In sum, further investigation and reliable datations of language change is needed, in order to study the possible co-occurrence of changes and the possible link between these changes. References Sources: Les serments de Strasbourg, La cantilène de Sainte Eulalie, La Passion du Christ, La Vie de Saint Léger in Koschwitz ed (1964). La vie de Saint Alexis in Christopher Storey (1934): Saint Alexis. Etude de la langue du manuscrit de Hildesheim, suivie d’une édition critique du texte d’après le manuscrit L, avec commentaire et glossaire. Paris: Librairie Droz. Atilf, Frantext, La Nouvelle Base d’Amsterdam, BFM, la Base de Français Médiéval de l’UMR 8503, base électronique élaborée par Christiane Marchello-Nizia et son équipe. Scientific References: Andersen, H. 1973. “Abductive and deductive change”. Language 49:4. 765793. Andersen, H. (red). 2001. Actualization. Linguistic Change in Progress. Amsterdam: Benjamins.

The Importance of Diasystematic Parameters in Studying the History of French


Andersen, H. 2008. “Grammaticalization in a speaker-oriented theory of change”. Grammatical Change and Linguistic Theory. The Rosendal papers, T. Eythórsson (ed). 11-44. Blumenthal, Peter. 1986. Vergangenheitstempora, Textstrukturierung und Zeitverständnis in der französischen Sprachgeschichte, ZfZL Beiheft 12. Stuttgart: Franz Steiner Verlag. Brunot, Ferdinand. 1905-1953. Histoire de la langue française des origines à 1900/à nos jours. Paris: Colin. Benveniste, Emile. 1966. “Les relations de temps dans le verbe français”. Problèmes de linguistique générale 1. 237-250. Buridant, Claude. 2000. Grammaire nouvelle de l’ancien français. SEDES: Paris. Caron, Philipp et Yu-Chang Liu. 1999. “Nouvelles données sur la concurrence du passé simple et du passé composé dans la littérature épistolaire”. L’Information grammaticale 82. 38-50. Caudal, Patrick et Carl Vetters. 2007. “Passé composé et passé simple: Sémantique diachronique et formelle”, Labeau, Emmanuelle, Carl Vetters et Patrick Caudal (ed). 121-151. Clédat, Léon. 1887. Grammaire élémentaire de la vieille langue française. Paris: Garnier. Diewald, Gabriele. 2002. “A model for relevant types of contexts in grammaticalization”. New Reflections on Grammaticalization, Wischer, Ils and Gabriele Diewald (eds). Amsterdam: Benjamins. 103-120. Fleischman, Suzanne. 1990. Tense and narrativity. From medieval performance to modern fiction. University of Texas Press. Foulet, Lucien. 19583 Petite syntaxe de l’ancien français. Paris: Champion. Hopper, Paul J. and Elizabeth C. Traugott. 1993. Grammaticalization. Cambridge: CUP. Labeau, Emmanuelle, Carl Vetters et Patrick Caudal (red). 2007. Diachronie et sémantique du système verbal français. Cahiers Chronos 16. Amsterdam/ NY: Rodopi. Le Guern, Michel. 1986. “Notes sur le verbe français”. Sur le Verbe, RémiGiraud, Sylvianne (eds). Presses Universitaire de Lyon. 9-60. Maiden, Martin. 2003. A Linguistic History of Italian. London/NY: Longman. Martin, Robert. 1971. Temps et aspect. Essai sur l’emploi des temps narratifs en moyen français. Klincksieck: Paris. Perret, Michèle. 2008. Introduction à l’histoire de la langue française. 3. éd. Paris: Colin. Rubenbauer, Hans et J. B. Hofmann. 1958. Lectiones Latinae, Grammatik. 5e éd. München.



Schoch, J. 1912. Perfectum Historicum und Perfectum Praesens im Französischen von seinen Anfängen bis 1700. Halle. Schøsler, Lene. 1973. Les temps du passé dans Aucassin et Nicolete. L’emploi du passé simple, du passé composé, de l’imparfait et du présent « historique » de l’indicatif. f Odense: Odense University Press. Schøsler, Lene. 1984. La déclinaison bicasuelle de l'ancien français, son rôle dans la syntaxe de la phrase, les causes de sa disparition. Etudes romanes de l'Université d'Odense, vol 19. Odense: Odense University Press. Schøsler, Lene. 1994. “Did Aktionsart ever ‘compensate’ verbal aspect in Old and Middle French?”. Tense, Aspect and Action. Empirical and Theoretical Contributions to Language Typology, Carl Bache, Hans Basbøll and Carl-Erik Lindberg (eds) . Berlin/New York: Mouton de Gruyter. 165-184. Schøsler, Lene. 2001a. “From Latin to modern French: Actualization and markedness”. Andersen, Henning (ed) . 169-185. Schøsler, Lene. 2001b. “The coding of the subject-object distinction from Latin to Modern French”. Grammatical Relations in Change. (Studies in language companion series 56), Faarlund, Jan Terje (ed). Amsterdam: Benjamins. 273-302. Schøsler, Lene. 2002. “La variation linguistique : le cas de l'expression sujet”. Interpreting the History of French, A Festschrift for Peter Richard on the occasion of his eightieth birthday, Samson, Rodney and Wendy. AyresBennet (eds). Amsterdam/NY: Rodopi. 187-208. Schøsler, Lene. 2004. ““Tu eps l’as deit”/“Tut s’en vat declinant”. Grammaticalisation et dégrammaticalisation dans le système verbal du français illustrées par deux évolutions, celle du passé composé et celle du progressif”. Aemilianense. Revista Int. ...Génesis y Orígenes Históricos... Lenguas Romaces Vol 1. 517-568. Schøsler, Lene. 2007. “Grammaticalisation et dégrammaticalisation. Etude des constructions progressives en français du type Pierre va/vient/est chantant”. Labeau, Emmanuelle, Carl Vetters et Patrick Caudal (ed). 91119. Söll, Ludwig. 1974. Gesprochenes und geschriebenes Französisch. Berlin: Schmidt. Traugott, Elizabeth C. and Richard B. Dasher 2005. Regularity in Semantic Change. Cambridge: CUP Thielmann, Ph. 1885. “Habere mit dem Part. Perf. Passiv”. Archiv für lateinishe Lexikographie 2. 373-423, 509-549. Vising, Johan. 1888-9. “Die realen Tempora der Vergangenheit im Französischen und den übrigen romanischen Sprachen”. Französische Studien 6. 1-228. Französische Studien 7. 1-113.

The Importance of Diasystematic Parameters in Studying the History of French


Völker, Harald. 2007. “A ‘practice of the variant’ and the origins of the standard. Presentation of a variationist linguistics method for a corpus of Old French charters”. French Language Studies 17. 207-223. Weinrich, Harald. 1973. Le Temps. Paris: Editions du Seuil. Wilmet, Marc. 19982. Grammaire critique du français. Paris/Bruxelles: Duculot.

The Reorganisation of Mood in the Epistemic Subsystem—The Case of French Belief Predicates in Diachronic Dynamics Martin BECKER

1. Introduction This article intends to illustrate how we can combine theoretical notions of language theory in the field of modal semantics with corpus-based empirical research in order to gain new insights into the evolution of language systems over the course of time. A particularly interesting case in point is the restructuring of the mood system in the so called doxastic domain constituted by the belief predicates in the history of the French language. This modal domain underwent a profound change from Old French to the 17th century given that the principles of mood selection were not only modified in a substantial way, restricting the use of the subjunctive mood to very specialised contexts but the prototypical Old French belief predicate ‘cuidier’ was also superseded by the competing verbs ‘croire’ and ‘penser’. In the first part of my contribution I shall outline some basic modalsemantic notions which are appropriate for the description of belief (or doxastic) predicates. Subsequently, I will present and analyse some relevant data provided by the New Amsterdam Corpus and finally, I will attempt to make some interpretive remarks on the results, trying to join together relevant facets—at least tentatively—to form a comprehensive picture. 2. Mood and modality in the doxastic domain 2.1. Some basic notions and their relationship In modern modal semantics the category of mood is conceived of as one specific mode of realising modality (cf. Palmer2 2001). In this perspective, mood can be interpreted as an inflectional modality marking device which differs from lexical devices such as modal verbs (can, might, should), modal adverbs (probably, perhaps, maybe) and modal adjectives (probable, possible, certain, etc.). Modality is a notion which dates back to Aristotle’s De interpretatione and has been defined in many ways (Aristoteles 1999, von Wright 1951). Since we cannot discuss the different approaches to modality (see e.g. Le Querler 1996, Portner 2009) in detail, we will simply note the uncontroversial fact that



modality always implies the idea of alternatives (or ‘alternative worlds’) with respect to our base world—the world of reference we identify as the space in which we live and act. Hence, modality evokes possible worlds which may come true or which will be forever relegated to the realm of counterfactuality. One of the basic insights we can gain from a systematic study of the structure and diachronic evolution of mood systems is the fact that mood alternations in language systems are organised with respect to modal domains which correspond to conceptual domains (and may be represented as a conceptual map). Therefore, even though we may be able to ascribe an abstract value to the indicative and the subjunctive mood, their specific interpretation will rest on particular modal domains and their specific underlying conceptual structures. 2.2. The inherent structure of the ‘epistemic domain’ A particularly interesting area of modality is the so-called ‘epistemic domain’. A great deal of work has been undertaken in recent years by scholars looking to disentangle the multifold aspects of epistemicity. Without claiming to be exhaustive, we can quote seminal contributions presented by De Haan (1999), Nuyts (2001, 2006), Plungian (2001), Cornillie (2007), Pietandrea and—most recently—Leiss (2009). Notwithstanding some controversy around issues of classification, there is, at least, general consensus on the existence of three essential dimensions which come into play when dealing with epistemic modality and especially epistemic predicates. Therefore, we must distinguish: a) the dimension of epistemicity (in a narrow sense) which refers to the commitment of the speaker who evaluates—as Nuyts puts it—‘the chances that a certain hypothetical state of affairs under consideration will occur, is occurring, or has occurred in a possible world’ (Nuyts 2006: 21). As we can see, the epistemic dimension of this class of predicates always implies a scale of probability which ranks possible worlds in accordance with their chances of realisation. Speakers compare their expectations of future outcomes of actual developments with their assumptions concerning a specific state of affairs. b) Secondly, the evidential dimension: this focuses on the source that justifies the assertion of a determined proposition. Therefore, an assertion can be based on different types of evidence. Plungian (2001), for instance, distinguishes the following basic kinds of evidence: - direct evidence which is either visual, or sensory or endophoric (which means that it reflects mental and physical inner states such as feelings or cognitive operations);

The Reorganisation of Mood in the Epistemic Subsystem


- personal indirect evidence which is based on inferences drawn by the speaker in sentences like ‘Given that Peter didn’t have dinner the whole day, he must be hungry’; - mediated evidence through the statements of others: this type of evidence refers to reportive readings which can be second-hand (something like ‘according to Mr. Miller’) or third-hand (‘there is a rumour that’), but it can also be based on common knowledge, on tradition or on authority. c) The third dimension, the dimension of subjectivity, concerns a particular aspect of epistemic modality, the so-called problem of accessibility. The bearer of a propositional attitude may hold beliefs which he shares with others—this amounts to the notion of ‘intersubjectivity’—but it is also possible that his beliefs belong to him alone (the idea of ‘subjectivity’). The attitude holder may be capable of producing good evidence or arguing in a conclusive way for his convictions (hence the notion of ‘objectivity’). However, he may not be able or willing to come up with sufficient evidence for his own convictions. The philosophy of language has been particularly interested in the problem of accessibility inherent to this third dimension, and philosophers of language like Kaplan (1969), Lewis (1979) and, later, Kamp (2003) commented on the opposition between de re and de dicto beliefs. A belief can be assumed to be de re if the belief holder does not only stand in a subjective acquaintance relation with the content of his belief—the so called ‘res’ -, but if this acquaintance relation is ‘objective’, in the sense that it could be confirmed by an external and impartial observer (‘the eye of Horus’).1 To put things differently, we can state that the belief holder has access to some possible worlds (his ‘belief worlds’) and that these possible worlds can either be accessed by other individuals (a ‘shared set of belief worlds’) or only by the believer alone (which corresponds to the empty set of joint belief worlds). With regard to the problem of accessibility, it becomes apparent that the three dimensions in question are profoundly intertwined: the speaker may ascribe a certain value or degree of likelihood to a propositional content (or a ‘state of affairs’), but his commitment will be based on the available evidence he can provide to ‘testify’ to a certain state of affairs. Finally, the evidence the speaker may produce mirrors the accessibility relation he entertains with the content of his belief: the relationship may be 1

H. Kamp puts forward a more empirical criterion to define a de re belief: according to him there must be a causal relationship between the attitude of a belief-holder and the res in question. (Kamp 2003: 254)



confined to him alone or it may be shared by others and can even be subject to a procedure of objectivisation. So the domain of epistemicity turns out to be an intricate conceptual complex which is composed of an epistemic, an evidential and subjective/intersubjective dimension. Interestingly, the different lexical items in the field of epistemic modality single out or focus on some aspects or dimensions of the conceptual complex defocalising or fading out other aspects or dimensions. For example, a modal adverb like French ‘peut-être’ (perhaps) in example (1) (1) Pierre viendra, peut-être, demain. Peter might come tomorrow

focuses exclusively on the epistemic dimension as it highlights the degree of probability the speaker ascribes to the realization of the proposition. However, the modal verb ‘devoir’ (English ‘must’)—as in example (2) (2) ‘Tiens on sonne à la porte—ça doit être le facteur’ (Dendale 1994: 38) Oh, the bell is ringing. It must be the postman

allows for an inferential reading, conflating two different aspects: for one thing, it conveys the degree of the speaker’s commitment (he is sure that the proposition p is the case), and then, it indicates that his conviction was derived by a reasoning procedure based on logical inferences, his source being world knowledge and logic (see Pietandrea: 81). So—according to what we know about the daily routine of the persons with whom we interact—the speaker can infer that at a certain hour of the day the person most likely to ring the bell must be the postman. Finally, if a speaker of French falls back on the Conditional (as in example 3) (3) Le président aurait l’intention de se rendre au Japon selon L’Agence France Presse. The president is allegedly paying a visit to Japan (according to the AFP)

he wants to emphasise the source of his belief or conviction and quotes, in this context, the relevant source of knowledge (Agence France Presse). With these considerations in mind, we can pass to the second part of the contribution in which some light is shed on the evolution of the mood system in the domain of belief predicates from Old to Classical French. For the sake of analysis and illustration I will focus on two basic belief predicates—the verbs ‘cuid(i)er’ and ‘croire’—which originate from the Latin verbs cogitare and credere, respectively. In what follows, it will be demonstrated that corpora can help us to gain some new interesting insights into the principles and the dynamics of linguistic change.

The Reorganisation of Mood in the Epistemic Subsystem


2.3. Mood selection of belief predicates in modern French In order to understand the different changes in the doxastic domain we should start with the endpoint of developments, the contemporary system of mood selection triggered in the context of belief predicates. The verb ‘croire’ turned into the prototypical belief verb in French having ousted its rival ‘cuidier’ by the beginning of the 17th century. In modern French ‘croire’ always selects the indicative mood in affirmative sentences, e.g.: (4) Pierre croit que Paris est la capitale d’Angleterre (Peter believes Paris to be (IND) the capital of England)

In sentences which are in the scope of negation or an interrogative operator mood selection depends on whether the speaker commits himself to the truth of the subordinate proposition. If he does, the content of the complement clause is presupposed as true and therefore ‘scoped out’ of the intensional predicate, the indicative being chosen. Otherwise, the speaker falls back on the subjunctive mood. For the sake of illustration three examples are provided below: (5) Il ne croit pas que Paris est la capitale de la France. (He does not believe that Paris is (IND) the capital of France) (6) Il ne croit pas que Jean soit le meilleur ami de Luc. (He does not believe that Jean is (SUBJ) Luc’s best friend) (7) Crois-tu que Jésus-Christ est/soit père, fils et saint esprit ? (Do you believe that J. C. is (IND/SUBJ) the Father, Son and Holy Spirit?)

The complement clause in (5) states common knowledge as to the capital of France, whereas in (7) the speaker has to decide whether he wants to endorse the doxa voiced in the subordinate proposition (a priest, for instance, should) or whether he remains at least neutral as to the truth of the proposition. The speaker of (6) does not take any stance as regards the truth of the subordinate clause. This operator-sensitive type of mood selection which is characteristic of weak intensional predicates (such as epistemic verbs)2 has been described as the ‘polarity subjunctive’ (see Quer 1998: 31 ss.).


For the opposition between ‘’strong’’ and ‘’weak’’ intensional predicates, see McCawley 1981: 326-340.



3. A case study: the development of mood selection of ‘cuidier’ 3.1. Mood selection in Old French and the New Amsterdam Corpus: general tendencies Tracing developments back to Old French we become aware that the doxastic domain was organised in a completely different way. The data of the New Amsterdam Corpus convey an interesting picture with regard to mood selection principles of Old French in the epistemic domain. However, before exposing and analysing the data, some remarks on the New Amsterdam Corpus (NCA) are required. The NCA is based on Anthonij Dees’ Amsterdam Corpus of Old French Literary Texts which resulted in the Atlas des formes linguistiques des textes littéraires de l'ancien français (Dees 1987). The original Corpus was rearranged (revised, lemmatised and XML-formatted) principally by Achim Stein, who facilitated corpus research thanks to a special TWIC query tool. This tool allows for pattern matching queries on the basis of regular expressions. The NCA now offers an extended bibliography elaborated by Martin D. Gleßgen which includes information about the author of the corresponding text, its date of composition, the date of the manuscript, place of composition, genre, number of words, punctuation and so forth.3 In a first approximation of the mood distribution in the doxastic domain we took into consideration different combinations of the relevant matrix subjects, tense features and mood selection in the context of ‘cuidier’. A comprehensive overview of distribution patterns of mood in Old French is given in table (1) Table 1. ‘cuidier’ in Old French: Matrix subject and mood selection Je/jou quit (cuit) 1 p. sg. Total 131 FUT 12 (9.2%) COND 11 (8.4%) IND 29 (22.1%) 15 (11.5%) -PAST SUBPRES 55 (42%) -NEGATION 40 (30.5%) SUBIMP 24 (18.3%) -NEGATION 16 (12.2%) 52 (39.7%) Total IND Total SUB 79 (60.3%) Form


Cuide quide 3 41 1 (2.4%) 3 (7.3%) 2 (4.9%) 35 (85.4%) 5 (12.2%) 2 (4.9%) 4 (9.8%) 37 (90.2%)

Cuident Quident 3 p. pl. 22

Je/ge/ie Cuidoie 1 sg. past 28

22 (100%) 3 (13.6%)

0 (0%) 22 (100%)

Cuidoi(en)t Quidoient 3 pl. past 42

1 (2.4%) 28 5 (17.9%) 0 (%) 28 (100%)

41 (97.6%) 5 (11.9%) 0 (0%) 42 (100%)

For further information see the homepage of Achim Stein on lingrom/stein/forschung/resource.html.

The Reorganisation of Mood in the Epistemic Subsystem


What can be inferred from table 1 is the fact that the matrix subject or, in general, the belief holder is paramount for mood selection in Old French. In contexts in which the believer coincides with the speaker, we encounter both moods in Old French, the subjunctive overweighing the indicative with a ratio of approximately 40% to 60%. Hence, Old French displays a considerable variation with regard to the mood selection in affirmative clauses. Yet, the same does not hold for belief contexts where neither the speaker nor the interlocutor is the belief holder. In these—3rdd person singular and plural—contexts, more than 90% of the respective subordinate propositions are marked by the subjunctive mood. Moreover, the subjunctive is overwhelmingly favoured in past contexts—irrespective of the nature of the matrix subject. This points to the fact that temporal distance between belief states—the actual state at utterance time and the past attitude alluded to—is considered a crucial factor with respect to the accessibility to a state of affairs. However, in order to obtain a more fine-grained picture of mood alternations we undertook a more sophisticated analysis, taking into consideration various contextual factors. The results of this will be discussed in the next section of this article. 3.2. Mood selection in Old French and the New Amsterdam Corpus: specific tendencies In what follows, I would like to summarise some of the more fine-grained contextual factors which explain the mood selection in belief contexts, but which also illustrate pragmatic aspects related to the alternation of mood in contexts of ‘cuidier’. The following tendencies can be highlighted in order to take stock of mood choices as documented in the New Amsterdam Corpus: 1) First, adverbial modifications exert an important influence on mood choices. What our medieval texts reveal is that mood selection correlates in a significant way with stereotypical contexts. Interestingly, the form ‘je quit/ cuit’ (I believe) is generally modified by adverbs which either strengthen the firm and straightforward commitment on the part of the speaker—the typical formula in these contexts is ‘je cuit/quit bien’ (something like ‘I believe well’ (i.e. ‘I firmly believe that’)—or create a negative implicature suggesting that p is not the case. In the latter context we find the typical formula ‘a poinnes cuit que p’ (‘I hardly believe that p’). Take the following example: (8) mes un pont passer li covient si foibel ainzqu’a la porte veingne qu’a poinnes cuit que le sosteingne (chret2, 2708) (but he had to cross a bridge which was so shaky that before arriving at the gateway he did not really believe that it would hold/support him)



In a belief context, the utterance of a minimal degree of commitment amounts to saying that the epistemic alternatives (or worlds) in which the proposition holds is true are so remote that—under standard assumptions—the truth of the proposition p is de facto inconceivable. So for the most probable worlds (or epistemic alternatives) we can state that p is not the case. This description can be represented more formally as follows: for all w’: w’ ∈ MBdox(w, t) AND w’ ∈ MOST PROB ALT: p(w’)=0.4 2) Secondly, the belief predicate ‘cuidier’ occurs in a significant number of contexts in the scope of the negation operator when referring to the speaker. Additionally, the stereotypical formula ‘je ne cuit que’ combines regularly with negative polarity items like ‘jamais’ or ‘oncques’ (‘ever’) which lend a strong categorical flavour to the utterance, emphasising the contrary-to-fact status of the proposition with regard to all worlds, even the least probable. An example: (9) Ne cuit onques mais veisse une feste mout miex aree (chauvency, 376) (I do not believe that ever in my lifetime have I seen a more generous celebration)

The preference for domain-widening negative polarity expressions like ‘ever’ (‘james’ ‘oncques’), ‘anybody’ (‘nus hom’) or ‘anywhere’ (in the world/under the sky/in the Christian world’) (‘al munt’, ’soz ciel’, ’en chrestiente’) square neatly with the general counterfactual implicature conveyed by the belief propositions under the scope of negation. This configuration—the negation operator in the matrix clause (=main clause) combined with a negative polarity item in the subordinate clause—is characteristic of belief contexts in Old French texts: Though the speaker utters, on the surface, a non-belief of the type “I do not believe that for any time or for anybody or at any place so-and-so is the case”, he actually implicates that he believes that the proposition p is absolutely not the case for all times or persons or places of reference. In other words, in his view the state of affairs described by the complement clause p is contrary to reality. This counterfactual status of the subordinate clause (and the propositional content it conveys) is all the more important, as it contributed to the strengthening of the contrary-to-fact character of the belief predicate ’cuidier’ over the course of time and thus paved the way for the changes to be attested in the period of Middle French. 4

MBdox refers to a doxastic modal base (which contains the worlds accessible to the believer from an index of utterance (w, t)), MOST PROB ALT to the set of those worlds in the doxastic modal base which are the most probable alternatives to the reference world of the belief-holder.

The Reorganisation of Mood in the Epistemic Subsystem


3) Thirdly, there is another remarkable fact which should be mentioned in our analysis of ‘cuidier’: Rarely do we encounter a second person (one or several addressees) as a true belief-holder in free contexts of Old French texts. Rather, the second person singular or plural is confined to very stereotypical and even ritualised contexts of use. If we take a closer look, we can distinguish the following pragmatic conventions: a) The most conspicuous pattern of use is related to rhetorical questions: The speaker pretends to put a serious question about a propositional content which—given its contrary-to-common-sense status—could never become the common ground of knowledge for the interlocutors. Take the following example (10) Cuides-tu que hons qui ait cent ans en sa vieillesse ait enfant ? (malk, 182) (Do you really believe that a man who is one hundred years old might actually have a child at that age?)

By converting an irrelevant or even absurd hypothesis into a question of belief, the speaker makes it plain that he considers that nobody would take the trouble to defend the proposition p. In other words, by begging the question p, the speaker appeals to a common conviction concerning the contrary-to-fact status of the proposition in question. Once again, a typical subjunctive context in the domain of beliefs is associated with a counterfactual implicature: the speaker poses a belief question and his mood choice reveals that he presupposes the contrary-to-fact status of the relevant content of belief. b) Another construction should be mentioned in passing: the pattern ‘ne cuidiez pas que p’ , which is also used with some frequency, turns out to be a way of reinforcing or rather hammering home a belief to the contrary: For instance, the request (11) ‘ne quidez pas que je vos mente’ (best, 2848) (Don’t believe that I tell you lies!)

aims at reassuring the addressee that the speaker is telling the plain truth and therefore turns out to be a kind of sincerity topos. c) A third and quite frequent conventionalised pattern of use combines the affirmative imperative with the 2nd person plural of ‘cuidier’ (‘cuidez que p!’). The interlocutor is reminded of a fact which should be part of the common ground, but is not as the subjunctive indicates. So the



speaker exhorts the addressee to update his epistemic model in order to enable a felicitous communication. (12) quidez que ci seie venuz senz la volentet vostre seignur (reis, 7208) (Please believe that I have come here without the permission of your master)

Up to now, we have considered the conventionalised uses of ‘cuidier’. We must now turn to the basic question of how we can account for the mood selection and above all the mood alternations attested in the New Amsterdam Corpus. 3.3. The basic principles of mood selection in the doxastic domain As the Corpus data indicate we have to separate first-person contexts (which include the speaker) and third-person contexts. In the same vein, temporal distance between belief states at different evaluation times plays a crucial rule for mood alternation. This begs the question: what are the basic principles which can account for mood selection in doxastic contexts? The first observation which must be made is that the data of the Amsterdam Corpus do not confirm the hypothesis according to which mood alternation in speaker-centred belief contexts is due to degrees of probability that p might be the case. It is not the scale of probability that is then at the heart of mood alternations, but the question of whether the propositional content is more or less accessible to the speaker. In other words, what counts in the first place is the nature of the acquaintance relation that the speaker entertains with the state of affairs in question (the ‘res’). This crucial aspect is related to a second aspect, namely the question of what kind of evidence the speaker has at his disposal in order to access the propositional content and, whenever required, to argue for it in a determined conversational context. Let us take three examples which illustrate our analysis of the Corpus data: (13) Entendez moi biaus sire chiers, or cuit ie que cist chevaliers est morz qu’il n’ot mais ne entent. (perceval 6847) (Do you understand me, my precious beautiful master? I believe that this knight is dead given that he neither listens nor understands (what I am telling him).

In this example, the speaker can produce first-hand evidence for his belief that the knight must be dead given that he enters into visual contact with him and becomes aware of the fact that the knight is no longer capable of exercising basic perceptual functions (like listening and understanding). (14) Ge cuit que deus li fait des ses pechiez pardon, mais ne doit pas atendre nuz hom si longement (moral 842-843) (I believe that God forgives him his sins, but nobody should wait too long)

The Reorganisation of Mood in the Epistemic Subsystem


In the second example the speaker voices a common belief of his time which is based on the general Christian doxa ‘the sinner is relieved from his sins on condition that he repents himself in an appropriate moment of his life.’ This is a typical case of what we can classify as collective knowledge or presupposed cultural common ground knowledge. Another interesting quotation is taken from ‘The Death of King Arthur’: (15) (...) si dist mes sire gauvains a la reine : " dame, nos ne savons pas tres bien qui cil fu qui veinqui le tornoiement, car nos cuidons que ce fust uns chevaliers estranges. (mortartu, 473) (and my master Gauvain told the queen: Madame, we do not know very well who was the person who won the tournament since we believe that he was (SUBJ) a knight from abroad)

Though being aware of the identity of the mysterious knight (it is Lancelot), Gauvain, the speaker, pretends to believe that the person in question was a stranger. However, his words or rather his mood choice are revealing as to his real attitude: Gauvains downplays the value of his acquaintance relation with the content of the belief: he and his friends put forward a hypothesis to which they do not adhere too fiercely given their background knowledge. To get to the heart of the matter: Gauvains and his friend are, of course, liars, but they are not shameless liars given that they never pretended to dispose of good evidence for the content of their lie. Turning to the third person contexts, we should stress the strong bias for the subjunctive in Old French, which in no way implies the falsehood of the subordinate proposition. Rather, the subjunctive indicates that the speaker has no access to the belief worlds of the matrix subject. In other words, the speaker emphasises that the matrix subject stands in a special acquaintance relation with the belief content, which is mediated by a personal doxastic source. There is no need to substantiate this personal doxastic source in a more specific way—it suffices that the speaker focuses on its subjective character. David Lewis (1979) has characterised these particular worlds as ‘centred worlds’—belief worlds anchored to an individual. The key to understanding mood alternations in the context of ‘cuidier’ lies in the aforementioned dimension of subjectivity/ intersubjectivity, which highlights the problem of the specific acquaintance relation that the belief holder entertains with the propositional content (or the ‘res’). To quote another example taken from the Amsterdam Corpus: (16) Et Hestors, qui cuide qui li rois ait dite ceste parole par mal de Boort, saut avant toz courrouciez (La Mort le roi Artu, 3240) (And Hector, who believes that King Arthur has (SUBJ) spoken these words in order to offend Boort, stands up and is upset more than anything)



Undoubtedly, the speaker zooms in on Hector’s strong conviction about the king’s evaluation of Boort’s loyalty. Therefore, the focus of the utterance cannot lie on the truth value of Hector’s belief (which may be right or wrong); it rather rests on its entirely subjective nature: Hector’s belief is based on his personal doxa—the way he perceives and conceives of the world. The basic criterion—the accessibility or non-accessibility of the proposition content—also applies to the other doxastic predicates in Old French, as for example, ’croire’ (‘believe’), ‘penser’ (‘to be of the opinion that’), ‘(mi) est vis/avis que’’ (‘I am of the opinion that’). An example is given below which illustrates the parallel mood selection patterns of ‘croire’ in a remarkable context: (17) Li rois artus qui entent ceste parole ne puet pas cuidier que ce soit voirs, einz croit veraiement que ce soit mensonge, si respont: (...).“ (La mort le roi Artu, 35) (King Arthur, who heard these words, cannot believe that this is true, instead he has the firm conviction that this is a lie and answers: (...).“)

The speaker presents King Arthur’s state of belief evoking the two possible doxastic alternatives, the truth of the proposition and its converse. By falling back on the subjunctive in both cases, the speaker marks the doxastic status of the propositions: the propositions mirror subjective convictions of the matrix subject and denote possible worlds centred around him. The speaker completely excludes himself as a belief holder—he neither commits himself to the truth of the proposition nor to its falsehood. A brief look at other Romance languages shows that they differ with respect to whether they mark systematically ‘centred worlds’ irrespective of whether they are also accessible to persons other than the belief-holder. A case in point is, for instance, Standard Italian which categorically signals the anchoring of belief propositions through selection of the subjunctive mood. The reverse holds for Rumanian which abides to the principle of ‘relativised veridicality’ (Giannakidou 1998) and marks, therefore, all belief propositions with the indicative as long as the belief holder takes them to be true (hence ‘relativised veridicality’). (18) Gianni crede que Luigi sia venuto stamattina. (Gianni believes that Luigi came this morning) (19) El crede că Ion a venit ieri, dar nu este aşa. (He believes that Ion came yesterday, but this is not true)

As we see, Old French, with its emphasis laid on the nature of the acquaintance relation with the propositional content, represents an intermediate stage with

The Reorganisation of Mood in the Epistemic Subsystem


respect to other Romance languages which either focus on the ‘centred’ character of belief states (as is the case of Standard Italian) or emphasise the relative truth status of beliefs (e.g. in Rumanian). It would undoubtedly be worthwhile to complete this picture in a further study. 3.4. The development of ‘cuidier’ in Middle and 16th century French The Middle French Subcorpus of Frantext reveals some remarkable changes with respect to mood selection in belief contexts. Taking into account the frequency and the distribution of mood in accordance with parameters such as ‘person’ (speaker, neither speaker nor interlocutor involved as ‘belief holder’), ‘number’ (singular/plural) and ‘tense’ we obtain the following overall results: Table 2. Mood selection of ‘cuidier’ in Middle French (14th and 15th century)


je cuide 1. p. sg. 30

SUBIMP 1 (3.3%) +NEG SUBPRES 9 (30%) 2 (6.7%) -davon NEG -bzw. NEGimpl 3 (10%) -Restriktion IND 15 (50%) -davon PAST 2 (6.7%) -bzw. NEGimpl 2 (6.7%) FUT

3 (10%)


2 (6.7%)

Total IND Total SUB

20 (66.7%) 10 (33.3%)

il/elle cuide 3. p. sg. 44 1 (2.3%) 1 (2.3%) 41 (93.2%) 7 (16%)

ils/elles cuident je cuidoie 1. sg. past 3. p. pl. 49 8 2 (4.1%)

1 (2.3%)

35 (71.4%) 2 (4.1%) 1 (2%) 4 (8.2%) 9 (18.4%)

1 (2.3%)

2 (4.1%)

— 2 (4.5%) 42 (95.5%)

8 (100%) 3 (38%)

il/elle cuidoit 3. sg. past 44 38 (86.4%) 8(18.2%)

6 (13.6%) 1 (2.3%)

1 (2%) 12 (24.5%) 37 (75.5%)

0 (0%) 8 (100%)

6 (13.6%) 38 (86.4%)

When we take a closer look at the results, we can ascertain that the ratio between the indicative and the subjunctive has been reverted in belief contexts involving the speaker as belief-holder: it is now the indicative which by far outweighs (2:1) the subjunctive, the latter losing considerably its relevance in the corresponding contexts. On the other hand, the subjunctive continues to prevail in those contexts in which the belief holder is neither the speaker nor



the interlocutor. However, comparing the ratio of the two moods in Old French, it becomes apparent that the indicative gained at least some ground, as can be seen from the figures in the column referring to ‘ils/elles cuident’. Additionally, what is striking is that even in past tense contexts the indicative is able to make some incursions, particularly in those contexts where the speaker is not the belief holder. Surprisingly, the subjunctive remains completely dominant in the first person, a fact which hints at a particular convention of use: evidently, the speakers choose to fall back on the past subjunctive whenever they wish to deny their responsibility for a past conviction, emphasising the contrast between an erroneous attitude due to a lack of knowledge and an updated stance at utterance time based on new available facts. This mood selection pattern seems to be a systematic option of mood specialisation as can be seen from parallel examples in contemporary Portuguese.5 Yet, the question arises of how we can account for the general shift from the subjunctive to the indicative (though the main differences between the contexts of use remain, basically, the same). We have to pass, therefore, to a qualitative analysis of the data displayed by our Middle French Corpus. The overwhelming majority of examples point to a change of the underlying principle of mood selection. In particular, the subjunctive is no longer triggered in contexts of ‘subjectivity’ in which a special acquaintance relation between the believer and the ‘res’ holds. The new salient principle responsible for mood selection turns out to be the notion of ‘counterfactuality’. This development is not surprising given the general tendencies we have already observed in the extensive Old French data provided by the New Amsterdam Corpus. As we saw ‘cuidier’ combined with the subjunctive occurred predominantly in specialised and highly conventionalised contexts which implied the contrary-to-fact status of the proposition in question. In our corpus analysis we were able to track down a significant number of subjunctives in contexts of negation, rhetorical questions, negated imperatives and polemic belief ascriptions. In all of these contexts it was implicated by the speaker that the state of affairs described in the subordinate clause did not hold in the real world. This implicature of counterfactuality seems to have been generalised to most clauses marked by the subjunctive mood. For this reason, the subjunctive was no longer selected in order to characterise the special nature of a determined belief—its ‘subject-centredness’—, but was reinterpreted as a marker of counterfactuality with respect to the vericonditional status of the complement clause. This general tendency is confirmed by the still important 5

See for example Wherritt 1977: 64 ‘Ah, eu não sabia, pensei que fosse ele’. ‘Oh, I didn’t know, I thought it was him’. This specialised pattern of mood selection does not hold for contemporary Standard Spanish and Catalan.

The Reorganisation of Mood in the Epistemic Subsystem


presence of hearer-oriented conventionalised uses in Middle French. In all of these uses, the subjunctive was selected exclusively given that they imply the falsehood of the underlying propositions. The following table gives an idea of the frequency of the conventionalised patterns of use. It should be noted that there are always some cases of homonymy which do not pose any problems of interpretation given the unambiguous character of the relevant contexts:

Table 3. Hearer-oriented conventionalised uses

Total SUBPRES SUBIMP Indicative COND Homonymy

IMP: Q: ne cuidez pas que p ! Cuidez vous que p ? 15 6 9 4



Q: cuides tu que p ? 24 10 12 1 1

The ongoing change in the ‘logic of mood’ can be grasped in the following, very conspicuous, example which has been taken from Oresme’s translation of Aristotle’s Ethics: (20) ‘Et par imprudence, il cuide que les choses qui sont bonnes que il soient mauvaises (...).’ (Oresmes, Le livre de ethiques d’Aristote, 1370, p. 369) (And devoid of wisdom, he believes that the things which are good are (SUBJ) actually bad)

Evidently, the speaker opposes right and wrong beliefs concerning the moral substance of things and marks the false assumptions by falling back on the subjunctive mood. On the other hand, we encounter many examples with a complementary mood selection. For instance, when commenting on two contrary beliefs or convictions in a controversy, Oresmes—in contrary to what was usual in Old French—no longer signals the particular character of the propositional attitudes by means of the subjunctive mood. Rather, the pros and cons are considered instances of the indicative given that they do not concern (or rather contradict) the speaker’s knowledge about the truth status of the propositional content. The following example corroborates this new trend of mood selection in the domain of belief predicates:


Martin BECKER (21) les uns par aventure sont enformés et cuident certainnement que delectacion est tres mauvaise chose, et les autres cuident que non est et que c’est bonne chose; (Nicole Oresme, le livre de ethiques d’Aristote, 1370, livre IX, pages 485-486) (some people happen to be informed and they surely believe that pleasure is a very bad thing, and others think that it is not true and that it is a good thing)

However, the change of mood selection principles was not the only development underway after the Old French period. Another, more profound transformation would erode the original system in the doxastic domain: during the 15th and especially in the 16th century the prototypical belief predicate ‘cuidier’ first lost its prevalence in the doxastic domain and after a swift decline was completely superseded by competing ‘croire’ by the end of the second decade of the 17th century. A comparison between the frequencies of use from Middle French to the 16th century gives a vivid impression of the decreasing relevance of ‘cuidier’ and the final reversal of the statistical weight: while in the 14th and 15th century the impact of ‘cuidier’ is 82%, with 18% for ‘croire’, the proportion is reversed during the 16th century to 46% : 54%. The last attested use of ‘cuidier’ can be found in the writings of Agrippa d’Aubigné, an author known for the archaic flavour of his tragedies. For the sake of illustration I will quote the last piece of evidence which dates back to the year 1623: (22) Deux filles, qui cuidoyent que le noeu de la race/Au sein de leurs parents trouveroit quelque place (Agrippa d’Aubigné, Les Tragiques 3/1623, livre IV, p. 70, livre) (Two girls who believed that the node of the race would find some place at their parent‘s breast)

3.5. Some remarks on the competing verb ‘croire’ When analysing basic aspects of ‘croire’—its frequency, occurrences and typical mood selection patterns—we can grasp some interesting differences between ‘croire’ and ‘cuidier’ in Old French. First of all, the belief predicate ‘croire’ is used to a much lesser extent in complement clauses than ‘cuidier’, the ratio being approximately 1:6. Another particular property consists of the affinity of ‘croire’ with present tense contexts which matches its all but marginal presence in past occurrences. In addition, the verb appears far more frequently with the indicative (present or future) on condition that the beliefholder turns out to be the speaker (52% subjunctive vs. 48% indicative). Finally, we do not find any occurrence of ‘croire’ in those conventionalised 2ndd person patterns which were characteristic of ‘cuidier’ and were linked to an implicature of counterfactuality. One explanation for these peculiarities of ‘croire’ in Old French may be found in the tradition surrounding the predicate, which is deeply rooted in the Vulgate and its universe of belief. Against the

The Reorganisation of Mood in the Epistemic Subsystem


backdrop of this universe of belief, the speakers rely on ‘credere’/ ‘croire’ in order to emphasise their full commitment to the fundamental beliefs of the Christian doctrine (‘the Christian doxa’). In the relevant contexts ‘croire’ must be understood affirmatively as pointing to a set of beliefs accessible to all members of a community whose shared convictions constitute an uncontroversial ‘common ground’. Therefore, though belonging to the same lexical domain (of doxastic verbs), the verb ‘croire’ presupposes in Old French a more specific frame based on the background of the biblical text (the Vulgate) and related discourse traditions as well as on a common body of shared religious and cultural beliefs.6 For the purposes of illustration we quote two examples taken from characteristic religious text genres of the Middle Ages. The first one relates to hagiography (‘La vie de Saint Eustache’), the second one to the so-called ‘miracles’, the mariological ‘Les miracles de Notre Dame de Soissons’: (23) Croi que tu es le fix de diex et par ta mort sommes sauvé (eustache, 590) (I believe that you are the son of God and that we are saved by your death) (24) (...) bien croi que li sainz esperites en vos sainz flanz le roi concut qui mort en crois por nous recut (mir, 1164-1666) (I firmly believe that the holy spirit conceived the king in your holy womb who accepted his death for us at the cross)

However, the subjunctive contexts, though less frequent, are equivalent to those we found for ‘cuidier’. So we encounter ‘croire’ combined with the subjunctive in counterfactual contexts which are either created by simple negation or imply neg-raising, the latter being typically reinforced by negative polarity items as in example (25): (25) ainsi fait orguellors felon qui par menace et par felon cuit que nus l’osast contretenir (fablesL, 1460-1462) (in this way acts an arrogant traitor who believes that by resorting to threats and treason nobody would dare to stop him)

By the same token, the very rare occurrences of affirmative subjunctive examples comply with the same basic principles of mood selection that we have already elucidated in analogous cases of ‘cuidier’. I do not want to repeat the detailed characterisation of possible acquaintance relations here. The basic idea is the element of ‘subjectivity’ which comes to bear in this type of contexts. For convenience example (16) will be repeated as (26) in order 6

For the concept of ‘frame’ see Fillmore 1985: 223-233 as well as Croft & Cruse 2004.



to illustrate the perfect overlap between the semantic range of ‘croire’ and ‘cuidier’ in this particular class of subjunctive contexts: (26) Li rois artus qui entent ceste parole ne puet pas cuidier que ce soit voirs, einz croit veraiement que ce soit mensonge, si respont: (...).“ (La mort le roi Artu, 35) (King Arthur who heard these words cannot believe that this is true, instead he has the firm conviction that this is a lie and answers: (...).“

A more fine-grained study could illustrate the process of how ‘croire’ encroaches upon the specific domains reserved for ‘cuidier’. This is, of course, a task beyond the scope of this study given limitations of space. However, we can at least prove that ‘croire’—contrary to what had been the situation in Old French—succeeded in taking over the typical conventionalised pragmatic functions of ‘cuidier’ for the 2ndd person from the second half of the 14th century onwards. In the Middle French subcorpus of Frantext many examples of the relevant formula come to light, as for example: ‘crois-tu que p (+subjunctive)?’, ‘ne croyez pas que p (+subjunctive)!’ and finally ‘croyez que p!’. It is, nevertheless, remarkable that the last of these—in contrast to ‘cuidez que p!’—is principally associated with the indicative (and not with the subjunctive).7 The diachronic data show that ‘croire’ was able to oust ‘cuidier’ by progressively encroaching upon the contexts which were dominated for centuries by its rival. On the other hand, ‘croire’ showed more affinity with the indicative given its more ‘affirmative’ character linked to its ‘history of use’. This ‘history of use’ was related to specialised religious contexts and the corresponding textual traditions which were centered around the foundational text of the Vulgate. This background of ‘croire’ may have been the reason for a general drift of the belief predicates towards the indicative domain. At any rate, there is no doubt that there was considerable hesitation regarding the principles of mood selection in the 16th and 17th century.8 Finally, we must admit that with all the results and observations concerning the evolution of mood selection and verbal preferences in the doxastic domain, we are not able to account for the deeper reasons which might explain the developments that we have highlighted. More precisely, we are for the time being not in a position to pin down what motivated speakers to switch systematically from one verb option (‘cuidier’) to the other (‘croire’). Here we become aware of the inherent limitations of diachronic corpora which 7


In the Middle French subcorpus we found 51 examples for ‘croyez que p!’, 7 examples for ‘ne croyez pas que p!’ and only 2 examples for ‘crois-tu/croyez-vous que p’. See Brunot 1966, vol. 3: 565 ss. and vol. 4: 1000 ss.

The Reorganisation of Mood in the Epistemic Subsystem


may mirror the symptoms but do not reveal the innermost causes of linguistic change. 4. Conclusion In summary, we can say that the decisive factor of mood selection in Old French belief contexts was neither the degree of probability the speaker ascribed to a certain state of affairs (the dimension of epistemicity in a narrow sense) nor the indication of a certain source of evidence by the speaker (the dimension of evidentiality). Rather, the crucial factor turned out to be the dimension of subjectivity/intersubjectivity and—related to this—the character of the acquaintance relation a (matrix) subject entertained with a determined state of affairs (the ‘res’). The quality of evidence undoubtedly played an important role in determining whether a certain belief was accessed by the matrix subject alone or whether it was shared by a relevant community including the speaker. Corpus data did not only reveal the general mechanisms of mood selection in Old French belief contexts, but they also helped to elucidate some factors which were responsible for the reorganisation of mood selection in the doxastic domain and finally its conventionalisation. Since most of the prototypical subjunctive belief contexts involved a counterfactual implicature, speakers generalised this implicature to all subjunctive contexts, reinterpreting the subjunctive as a marker of contrary-to-fact status of the subordinate clause. The competing predicate ‘croire’, finally, ousted its rival by giving up its specific religious sense and spread to all contexts covered by ‘cuidier’. As it seems, ‘croire’ also imposed its stronger affinity with the indicative inherited from its more ‘affirmative’ history of use in the context of religious traditions of discourse. Influential linguistic authorities in the second half of the 17th century, the famous ‘siècle classique’, especially Thomas Corneille and Andry de Bois-Regard, recommended the selection of the indicative for all affirmative belief sentences and foreshadowed with their normative remarks on the convenient mood selection the conventions for the proper use of belief predicates in Modern French.9 Our case study has thus illustrated how a theory-based analytical framework (as is the case of modal semantics) can be combined with the solid empirical basis of historical corpora in order to provide us with a deeper insight into the principles of how languages work and how they change over time. Yet, corpus linguistics is not a panacea given its inherent limitations: what it is not able to do is decipher the innermost motivations which drive speakers to give up one viable option of communication and to strengthen an alternative option at a certain period of time. 9

Brunot 1966, vol. 4: 1000s.



References Aristoteles. 1999. De interpretatione, introduzione, testo greco, traduzione, commento a cura di Attilio Zadro. Napoli: Loffredo. Brunot, F. 1966. Histoire de la langue française des origines à nos jours. Vol. 4. Paris: Colin. Cornillie, B. 2007. Evidentiality and Epistemic Modality in Spanish (Semi-) Auxiliaries. A Cognitive-Funkctional Approach. Berlin/New York: Mouton de Gruyer. Croft, W. and D.A. Cruse. 2004. Cognitive Linguistics. Dardano: Cambridge University Press. Dees, A. 1987. Atlas des formes linguistiques des textes littéraires de l’ancien français. Tübingen: Niemeyer. De Haan, F. 1999. “Evidentiality and Epistemic Modality: Setting Boundaries”. Southwest Journal of Linguistics 18. 83-101. Dendale, P. 1994. “Devoir épistémique, valeur modale ou évidentielle?”. Langue française 102. 24-40. Fillmore, C. 1985. “Frames and the Semantics of Understanding”. Quaderni di semantica 6. 222-254. Frantext. ( Giannakidou, A. 1998. Polarity sensitivity as (non)veridical dependency. Amsterdam/Philadelphia: John Benjamins. Kamp, H. 2003. “Einstellungszustände und Einstellungszuschreibungen in der Diskursrepräsentationstheorie”. Intentionalität zwischen Subjektivität und Weltbezug, Haas-Spohn, U. (ed). Paderborn: Mentis. 209-289. Kaplan, D. 1969. “Quantifying In”. Words and Objections, Davidson, D. and J. Hintikka (eds). Reidel: Dordrecht. 206-242. Lewis, D.K. 1979. “Attitudes De Dicto and De Se”. Philosophical Review 88. 513-543. Leiss, E. 2009. “Drei Spielarten der Epistemizität, drei Spielarten der Evidentialität und drei Spielarten des Wissens”. Modalität—Epistemik und Evidentialität bei Modalverb, Adverb, Modalpartikel und Modus (Studien zur deutschen Grammatikk 77), Abraham, W. and E. Leiss (eds). Tübingen: Stauffenburg. 3-24. Le Querler, N. 1996. Typologie des modalités. Caen: Presses Universitaires de Caen. McCawley, J.D. 1981. Everything that linguists have always wanted to know about logic but were ashamed to ask. Chicago: University of Chicago Press.

The Reorganisation of Mood in the Epistemic Subsystem


NCA: Nouveau Corpus d’Amsterdam Corpus informatique de textes littéraires d’ancien français (ca 1150-1350), établi par Anthonij Dees (Amsterdam 1987), remanié par Achim Stein, Pierre Kunstmann et Martin-D. Gleßgen“. Universität Stuttgart, Institut für Linguistik/Romanistik, 2007. Nuyts, J. 2001. “Subjectivity as an evidential dimension in epistemic modal expressions”. Journal of Pragmatics 33. 383-400. Nuyts, J. 2006. “Modality: Overview and linguistic issues”. The expression of Modality, Frawley, W. et al.(eds). Berlin/New York: de Gruyter. 1-27. Palmer, F.R. 2001. Mood and modality. Cambridge: Cambridge University Press. Pietandrea, P. 2005. Epistemic Modality. Amsterdam/Philadelphia: John Benjamins. Plungian, V.A. 2001. “The place of evidentiality within the grammatical space”. Journal of Pragmatics 33. 349-357. Portner, P. 2009. Modality. Oxford University Press. Quer, J. 1998. Mood at the interface. The Hague: Holland Academic Graphics. Wherritt, I.M. 1977. The Subjunctive in Brazilian Portuguese. Ph.D. Dissertation, University of New Mexico. Wright, G.H. von. 1951. An essay in modal logic. Amsterdam: NorthHolland.

French Liaison in the 18th Century —Analysis of Gile Vaudelin’s Texts— Yuji KAWAGUCHI

1. Introduction In The Sounds of French: An Introduction, Bernard Tranel presented a brief history of French liaison. In Old French final consonants were pronounced, but from the twelfth to the sixteenth centuries, they progressively disappeared, first in preconsonantal position and then at the pause, leaving them to appear only in prevocalic position. Later, other restrictions came to reduce even more the contexts in which these consonants could appear, so much so that liaison today occurs far less than it used to (but probably more than tomorrow). (Tranel 1987: 169)

Following Tranel’s explanation, three important stages will be distinguished in the evolution of French liaison. The first stage is the one in which the final consonant was dropped before the initial consonant of the following word. The second is the period during which the final consonant was lost in prepausal position. The last but on-going stage is characterized by the progressive restriction of phonetic contexts for liaison. In brief, from its developmental viewpoint, French liaison was established in the 17th century after the two previous stages from the 12th to the 16th centuries.1 French liaison is regarded as a type of sandhi, but in fact, it is composed of three different phonological processes: 1. Deletion, 2. Liaison, and 3. Insertion; see Figure 1.2 1. Deletion: Ex. six [sis] becomes [si] before consonants. 2. Liaison: Silent final consonant in prepausal position is pronounced in prevocalic position. This type is further divided into four subcategories, as shown in Figure 1: 2.1.1. Non-Inflectional, 2.1.2. Inflectional, 2.2.1. Non-Morphemic, and 2.2.2. Morphemic.



I would like to thank Yves-Charles Morin and Anthony Lodge for their important comments. Latency of final consonants in the earliest Old French and its impact on subsequent liaison phenomena are out of the scope of the present analysis; see Chasle (2008). Our classification is based on Morin (1986).



Under free forms, non-inflectional and inflectional types are to be differentiated. For instance, deux [dø(z)] is a non-inflectional type, and the consonant -tt in petit(e) [pti(t)] is an inflectional type. However, under bound forms, two different categories can be observed. Liaisons in the pronoun en and possessive adjectives such as mon are considered as non-morphemic, whilst those of the definite article les and the third person singular form est as morphemic because liaison consonants -zz and -t demonstrate respectively the plural morpheme and third person marker. 3. Insertion: Ex. the inversion of il a becomes a-t-ill with -t- inserted.

Figure 1. Sandhi in French

In this article, we will analyze French liaison and related phenomena in two texts written at the beginning of the 18th century and evaluate the situation of French linking phenomena in the 18th century. 2. Gile Vaudelin Two interesting texts of Gile Vaudelin were handed down from the 18th century. Nouvelle manière d’écrire comme on parle en France (hereafter NM) was published in Paris in 1713. Instructions crétiennes, mises en ortografe naturelle, pour faciliter au peuple la lecture de la sience du salutt (hereafter IC) was published in Paris in 1715. Here, we will refer to the reproductions of Slatkine Reprints. For Vaudelin’s personality, we have some introductory remarks in Cohen (1946). Vaudelin’s texts are valuable sources for the analysis of phonology in the 17th and 18th centuries. From Cohen’s dissertation in 1946 to recent articles, many manuals and articles have cited his books. However, to our knowledge, no exhaustive analysis of liaison has been attempted for Vaudelin’s texts.

French Liaison in the 18th Century


As the title of his first book shows, the objective of Vaudelin’s texts is to find “a new method of writing as people speak in France ((Nouvelle manière d’écrire comme on parle en France).” He designed thus following 29 letters composed of 13 vowels and 16 consonants; see Figure 2.

Figure 2. New alphabet of Vaudelin (1713), NM. p. 4

In order to investigate in detail how Vaudelin used his new alphabet to depict the pronunciation of French around the first quarter of the 18th century, we digitalized3 his two texts as shown in Figure 3 and processed them through the concordancer AntConc 3.2.1w to obtain quantitative information on several grammatical categories involved in linking phenomena.

Figure 3. Original and digitalized texts; see NM. p. 17 3

The passages written in Latin or other languages are excluded from our digitalization. More exactly, they are the following pages: Approbation, Privilege du roy and Errata in NM and Exhortation au peuple, Approbation, Privilege du roy, Preface, Abregé de l’orthgrafe naturelle, Alfabai. And Latin and other languages in 24-25, 27-28, 29-30, 4041, 42-48, 51, 58, 62, 64-65, 73, 76, 77-79, 80, 85, 142, 143-172 and Priair Latin from 173-242 in IC.



3. Analysis of NM and IC 3.1. Verbs 3.1.1. Avoir Practically, only 1 example of liaison consonant -zz is found for the second person plural avez: e de grâs ce vouz i avezz atahê (IC, p. 49) = et de grâce que vous y avez attachée. In our digitalization, we tried as far as possible to transliterate his new alphabets into ASCII-based ordinary fonts. There are 6 examples of ave without -z, IC, p. 57, 61, 63, 65, 66, 73. For the third person plural form, all 11 occurrences, NM, p. 17, 19, 23, 23 and IC, p. 37, 48, 54, 83, 91, 131, 139, are noted as ont with liaison consonantt -t; see the 3 examples and Table 1. Table1. Liaison of the verb avoir 2 pers. pl. 3 pers. pl.

Liaison of -z: avez 1 occ. Liaison of -t: ont 11 occ.

Non-liaison: ave 6 occ. Non-liaison: on not attested

Pars c’ail ont eun mâim divinite IC, p. 83. “Parce qu’elles ont une même divinité” S’âit an se c’il ont eun sosiete e IC, p. 91. “C’est en ce qu’ils ont une société et” c’il ont ojourd’ui NM, p. 23 “qu’ils ont aujourd’hui”

It is not easy to speculate why liaison occurs more frequently in the third versus second person forms. Nevertheless, as we will observe the same tendency in the following lines, the third person marker -tt supposedly had certain linguistic significance in French linking phenomena. In fact, liaison consonant -tt is always realized in the interrogatives with third person pronoun inversion; see Table 2. Table 2. Liaison of avoir with the third person pronoun inversion 3 pers. sing. 3 pers. pl.

at-il “a-t-il” 35 occ. ont-i “ont-ils” 2 occ.

at-aill “a-t-elle” 3 occ. ont-aill “ont-elles” not attested

at-on “a-t-on” 4 occ.

3.1.2. Être In Vaudelin’s texts, liaison consonant -tt of the third person singular form estt is attested in 32 different contexts. However, its distribution is relatively restricted in some fixed expressions. The most frequent 4 contexts alone

French Liaison in the 18th Century


account for 61.5 percent of its occurrences; see Figure 4. These contexts are est un, est-il, c’est-à-dire, and est une/un.4 However, we know that according to one recent analysis, only 30 percent of the occurrences of c’estt are affected by liaison in contemporary spoken French (Durand and Lyche 2008: 46). In contrast, liaison of c’estt was surprisingly regular in Vaudelin’s texts. The following interrogative is the only exception: Pourcoai dit-vou ce s’âi eun espri? (IC, p. 81) “Pourquoi dites-vous que c’est un esprit?”

Figure 4. Liaison of est

For a working hypothesis, we can postulate some sociolinguistic and pragmatic factors that may be responsible for the consistent liaison of c’estt in the 18th century. As Vaudelin explained in the beginning of NM, his aim was to find an appropriate way to write down the pronunciation of French in the 18th century. Aside from this, there is no other way to validate his claim that he faithfully described the pronunciation of his contemporaries. At the same time, it is practically impossible to confirm what his linguistic norm for liaison was. Vaudelin was not a fieldworker who tried to draw a living picture of the pronunciation of French at that time. Pragmatically speaking, it remains unknown how the pronunciation of French was filtered through Vaudelin’s own linguistic norm and what happened when he passed from his initial phonetic considerations to the act of writing books. Sociolinguistically speaking, he was an intellectual who belonged to the reformed Augustinian school (Cohen 1946: vii). 4

In Vaudelin’s transcription, eun represents the masculine indefinite article un before a vowel as well as the feminine une in all positions (eune is also found, but only once in the feminine).



In any case, his intention was not meant to strictly reproduce the pronunciation but to allow common folks to read aloud the instructions and prayers. As the title of the second book shows, for him, one of the objectives to publish these works is “to help the people to read d the science of salvation (pour faciliter au peuple la lecture de la sience du salut).” Based on these assumptions, the liaison of c’est in Vaudelin’s texts must represent a type of pronunciation at a time when 18th century intellectuals read written texts aloud. Thus, it is not surprising if the liaison rule that Vaudelin advocated in his texts is considerably different from that of contemporary spoken French. For the liaison of plural form sont, there are 5 examples with pronoun inversion sont-i “sont-ils,” 3 examples of sont-aill “sont-elles,” and 6 examples of other types. No example can be found without liaison for the third person singular form est. Liaison occurs also in imperfect tense, conditional, and subjunctive moods: il aitait alôr, NM p. 31; Insi so-a t-i, IC p. 29, etc.; srait-i onorabl, NM p. 25; and srâit âiz, NM p. 25.5 As remarked above, the final consonant -tt undertakes certain morphemic weight and functions as a trigger of liaison. 3.1.3. Other verbs In the following examples, the consonant -tt is inserted as the third person singular marker: paiht-on “ “pèche-t-on” (9 occ. IC, p. 119, 120, 120, 122, 122, 123, 123, 123, 128), apailt-on/apailt-aill “appelle-t-on/appelle-t-elle” (4 occ. IC, p. 84, 104, 114, 141), ofrt-on/ofrt-aill “offre-t-on/offre-t-elle” (3 occ. IC, p. 108, 108, 109), dont-aill “donne-t-elle” (2 occ. IC, p. 99, 113), dont-on “donnet-on” (2 occ. IC, p. 97, 100); the others have only one occurrence.6 Liaison consonant -tt of the third person of other verbs can be found in 74 cases, and among others, interrogatives with NP inversion, such as doait-on “doit-on,” fait-on “fait-on,” or fôt-i “faut-il” constitute 50 occurrences, i.e., 67.6 percent of the total, and 24 examples are of the following types: doait ancor, doait avoair, etc. A closer look at verb types and their following words will reveal their strongly biased occurrences. First, only 3 verbs—falloir, — pouvoir, and devoir—cover r 77.8 percent of all the occurrences; see Table 3. Second, Table 4 shows also a preference in selecting some words after the liaison consonant -t. In fact, 75 percent of the occurrences are occupied by avoir, être, or encore. 5


An exception without liaison in imperfect tense: (si dâija ail n’aitâi a l’Otail) e IC, p. 40. = (si déjà elle n’était à l’autel) et. ( ) shows occurrence. adort on=adore-t-on, IC, p. 121; consist-ail=consiste-t-elle, l IC, p. 100; efface-t-il IC 97; frapt-i=frappe-t-il, IC, p. 101; justifit-ail=justifi l e-t-elle, IC, p. 110; menace-t-elle IC 132; raistt-i=reste-t-il, IC, p. 104; râist-ail=reste-t-elle, l IC, p. 97; regardt-i=regarde-t-il, IC, p. 125; rsoaivt-i=reçoivent-ils, IC, p. 106; santifit-ail= l sanctifie-t-elle, IC, p. 124.

French Liaison in the 18th Century


Table 3. Verb types and liaison fôt/fott “faut” peut fait doaitt “doit”

12 occ. NM, p. 13, IC, p. 35, 53, 102 106, 106, 106, 113, 115, 115, 117, 128 6 occ. IC, p. 31, 82, 85, 100, 101, 116 3 occ. IC, p. 94, 109, 111 3 occ. NM, OBSERVATION7, IC, p. 103, 131


1 occ. IC, p. 40

fôdraitt “faudrait” 1 occ. NM, p. 19 refit “refit” 1 occ. IC, p. 50

Table 4. Words following liaison consonant -t 6 occ. NM, OBSERVATION, IC, p. 31, 82, 85, 101, 106, 6 occ. NM, p. 19, IC, p. 102, 106, âitrr “être” 106, 113, 115, 6 occ. IC, p. 35, 103, 116, 117, 128, ancor “encore” 131 o “au” 1 occ. IC, p. 50 ansanbl “ensemble” 1 occ. IC, p. 40 avoair “avoir”

antrr “entre”

1 occ. IC, p. 53

aple “appeler”

1 occ. IC, p. 100

i “y”

1 occ. IC, p. 115

ojurdui “aujourd’hui” 1 occ. NM, p. 27

There are only 3 examples in which peutt and doitt did not make liaison with avoirr and à respectively. Based on these observations, it undoubtedly seems that the liaison was respected, especially for the third person morpheme -t. 3.2. Pronouns 3.2.1. Nous/Vous Nous and vous, whether in nominative case or not, regularly cause liaison before vowels. Nominative nous has 21 occurrences, of which 8 are nouz avon. The others have 59 occurrences, of which 15 are nouz a and 11 nouz oblij. Nominative vous is attested in 48 cases, 16 of which are vouz âitt and 5 vouz i. The others have 51 occurrences, of which 9 are vouz aim/vouz âim and 5 vouz ador “vous adore” or vouz e “vous ai.” 3.2.2. Il, Ils/Elle, Elles/On As is the case in contemporary French, the third person masculine pronoun ill is often pronounced without the final consonant -l. A minute examination of some frequently attested words will disclose the quasicomplementary distribution of these two variants. The pronoun ill is reduced to i in preconsonantal position, and the same consonant -ll is pronounced in prevocalic position in Vaudelin’s texts; see Table 6.


In NM, OBSERVATION contains two pages and comes before the main text.


Yuji KAWAGUCHI Table 6. Pronoun ill in some frequent words

il i ann a “il y en a” il âi “ il est” il a

22 occ. il âitt “il est” 9 occ.

i fo“il faut”

28 occ, i ne “il ne”

6 occ.

19 occ. il i a “il y a” 6 occ. 10 occ.

i nou “il nous” 20 occ. i se “il se”

3 occ.

This alternation closely resembles liaison phenomenon. Ten out of 101 occurrences, however, retains final -ll even in preconsonantal position: c’il confirm “qu’il confirme,” IC. p. 100; s’il n’a, IC, p. 132; c’il ne sra “qu’il ne sera,” NM, p. 23, etc. Quoting the description of Claude de Saint-Lien on French phonetics of the 16th century, Livet remarked that “(...) il vient, ils disent; but courtiers did not pronounce them (= final consonants -l): the choice is here permitted.”8 The variation of ill and i seems to be sociolinguistically motivated. For the third person masculine plural pronoun, a variation is observed between ill and iz: il ontt vs. îz aprandron, îz oront, etc. The consonant -l drops before the liaison consonant -zz in the second variant. No such variation can be found for the third person feminine pronouns, and Vaudelin noted ail “elle/elles” always with final consonant -l. The indefinite pronoun on has 81 occurrences, of which 73 are noted as on and 8 as onn.9 In conclusion, liaison is realized without exception in both bound form pronouns in nominative case like nous, vous, ils, and on and non-nominative case nous, vous, and les. 3.2.3. False liaison According to Hindret, grammarian of the end of the 17th century, there were some Parisians and courtiers who pronounced il leurz a ditt instead of il leur a dit. He himself criticized this kind of false liaison (Thurot 1881: 38–39). All 8 examples of false liaison observed in Vaudelin’s texts are as follows: ce je leuz ansainie. 5. Lôrc pour se done de grand comodite (NM, p. 19) “que je leur enseigne. 5. Lorsque pour se donner de grande commodité” ce le Misionair e le Pôvr an srâit âiz, e leuz an sorâi bon gre! (NM, p. 25) “que les missionnaires et les pauvres en seraient aise et leur en seraient bon gré!” ce Mesieu le Cure ne peuv leuz aprandrr par ceur (NM, p. 31) “que Messieurs les curés ne peuvent leur apprendre par cœur” un comairs de priair, ce nou leuz adraison (IC, p. 92) 8


(...) il vient, ils disent; mais les courtisans ne les prononcent pas: le choix ici est permis (Livet 1859: 503). The transcription onn represents a single nasal vowel followed by a liaison consonant -n.

French Liaison in the 18th Century


“une commerce de prière, que nous leur adressons” Pour loue e rmairsie Dieu de victoair c’i leuz a fai ranporte par sa grâs (IC, p. 108) “Pour louer et remercier Dieu de victoire qu’il leur a fait remporter par sa grâce” nou ne leuz adraison pa no priair (IC, p. 122) “nous ne leur adressons pas nos prières” de le respaicte, de leuz obei, e de lez asiste dan le bzoin. (IC, p. 125) “de les respecter, de leur obéir, et de les assister dans le besoin.” Pars ce pluzieur de Laitr Latin ne consairv pa toujou le Son ci leuz âi naturail (IC, p. 142) “Parce que plusieurs de lettres latines ne conservent pas toujours le son qui leur est naturel.”

Krier (1993) explains that Vaudelin followed the habits of pronunciation of Parisians and courtiers in his time. Yet if this were the case, why don’t we find false liaisons of the same type for the words quatre and cent, in which liaison consonant -zz comes from the influence of the plural meaning of these numerals? Or are these liaisons of quatre or centt too colloquial or vulgar to be included into his texts in light of Vaudelin’s consciousness of linguistic norm? In any case, the liaison consonant -zz of leuz is also due to the plural meaning of leur. The reason we find the false liaison of leurr exclusively in these texts may be explained in a different way. Livet quotes the phonetic description of Théodore de Bèze, a grammarian at the end of the 16th century: “Parisians, and even more the inhabitants of Auxerre and those of Vezelay, author’s hometown, often alternate s into r and r into s, when they say courin, Masie, pese, mese, Théodose, instead of cousin, Marie, pere, mere, Theodore” (Livet 1859: 517). This phonetic habit must have facilitated more or less the passage of leur aprandrr to leuz aprandr. The consonant -rr could change easily into -zz in intervocalic position.10 3.2.4. Pronouns En, Dont, Rien Liaison of pronoun en appears in 45 sentences. 26 cases are il i ann a “il y en a.” The other types have less than 3 occurrences each. Relative pronoun dontt has 6 occurrences of liaison -tt in prevocalic position, IC, p. 74, 108, 122, 127, 134, 136, and 19 non-liaison forms don in preconsonantal position. The indefinite pronoun rien has only 1 example of liaison ri-inn, NM, p. 17. Liaison occurs regularly in these pronouns. However it should be noticed that liaison is avoided at morphosyntactic boundaries such as in de rien ill in the example: S’âi ce de riin il a fai tout hôz. “C’est que de rien il a fait toute chose” IC, p. 86. 10

According to the personal communication of Anthony Lodge, this vernacular pronunciation of Paris was salient in the 16th century and was not probably widespread in the 18th century (Lodge 2004: 140, 144-145, 188-189).



3.3. Articles, possessive adjectives 3.3.1. Indefinite and definite articles The masculine singular indefinite article un is noted by a tailed vowel u as in Figure 5.

Figure 5. “il n’y en a qu’un” IC, p. 31.

In order to explain French liaison of masculine singular indefinite article to his readers, Vaudelin used the letter of nasal vowel un followed by a consonant -n, cf. unn ami, NM, p. 10. Curiously enough, however, a detailed analysis of his texts shows that he never used this form unn, but eun in other parts of his books: eun Abu, eun Alfabe, eun avantaj, etc. This form will be interpreted as a denasalized vowel because it is not transcribed by a tailed vowel u, but eun. Consequently, the spelling is similar to the feminine singular indefinite article: eun pâin “une peine” and eun urêuz fin “une heureuse fin.” The form un never appears in prevocalic position, and 2 examples of it appear in prepausal position: i n’i ann a c’un, IC, p. 31, 82, “il n’y en a qu’un”. The plural indefinite article is represented by de in preconsonantal position and dezz in prevocalic position. The plural definite article is regularly le in preconsonantal position and lezz in prevocalic position. The contracted form of the preposition à and plural definite article is always ô in preconsonantal position and ozz in prevocalic position. In these cases, the morphemic value of plurality is responsible for the liaison in prevocalic position. Liaison is here without exception. 3.3.2. Possessive adjectives The denasalization of the nasal vowel on is expected to be the same as that for possessive adjectives mon, ton, son in the position of liaison: Mon Dieu, je vouz ofr mon âme, IC, p. 24. However, in addition to these denasalized spellings, there are also nasal vowels followed by liaison consonant -n; see sonn in Figure 6.11


We use the spelling -onn to represent a nasal vowel -on followed by a consonant -n.

French Liaison in the 18th Century


& porcoa ne la pa joindr a sonn istoairr ci sra toujou si curieuz e toujou bail? “ et pourquoi ne la pas joindre à son histoire qui sera toujours si curieuse et toujours belle?”

NM, p. 29 ou dan sa pairson, ou dan sonn oneur, ou dan se biin. “ ou dans sa personne, ou dans son honneur, ou dans ses biens.”

IC, p. 112

Plural possessive adjectives are all regular in liaison position: me, se, no/nô, and vo/vô in preconsonantal position and mez, sez, noz/nôz, and voz/ vôzz in prevocalic position. Leurr and leurs are respectively leu and lêu in preconsonantal position and leurr and leuzz in prevocalic position. Cohen considered lêu as an exception (Cohen 1946: §37, p. 53), but it would seem more legitimate to interpret that the spelling lêu corresponds to the plural form of the singular possessive leu because it better explains the presence of a long vowel ê.12 In the position of liaison, the singular form leurr becomes leuzz in plural, and final consonant -rr is deleted before liaison consonant -z, i.e., leur-z > leuz. Again, the final consonant -zz of both articles and possessive adjectives is endowed with a morphemic weight, and therefore, it will trigger liaison. 3.4. Prepositions Liaison of the preposition après is regular. There are 8 examples of aprai, NM, p. 21, IC, p. 60, 73, 97, 135, 135, 136, 137, and 4 of apraiz, IC, p. 65, 66, 124, 127 with liaison consonant -z. The preposition dès observes the same rule: 3 examples of dâi/dai before a consonant, NM, p. 25, 31, IC, p. 29, and 2 of dâizz in the collocation dâiz a prezan “dès à présent,” IC, p. 60, 138. The same results can be found for the prepositions dans and sans: dan (128 occ.) and san (8 occ.) in preconsonantal position and danzz (3 occ.), IC, p. 61, 116, 120, and sanz (3 occ.), NM, OBSERVATION, IC, p. 49, 132, in prevocalic position. The preposition depuis is worth our attention. There are 6 examples of dpui in preconsonantal position, NM, OBSERVATION, p. 17, IC, p. 60, 69, 93, 134. No example is found for prevocalic position, but the form dpuis is attested in which final consonant -s is retained even before a consonant: Je vou remairsi de tout le grâs ce vou m’ave fait dpuis ce vou m’ave cree jusq’a


However, this is the only example that Vaudelin used a long vowel for plural leurs.



mintnan, IC, p. 36.13 “Je vous remercie de toutes les grâces que vous m’avez faites depuis que vous m’avez créé jusqu’à maintenant.” Preconsonantal deletion of the final consonant in some prepositions, if not restored thereafter, had not been completed in the first quarter of the 18th century. This inference will be corroborated by the situation of another preposition. Though the final consonant -kk would be restored in a later period, the most frequently attested form of the preposition avec was avai without final consonant in Vaudelin’s texts. There are 68 examples of avai in preconsonantal position and 19 of avaic in prevocalic position. We can also find avaic as preconsonantal form: ou contr la verite, se c’onn apail parjur; ou avaic verite, mâi san nesaisite, IC, p. 123, “ou contre la vérité, ce qu’on appelle parjour ou avec vérité, mais sans neccesité.” As opposed to the liaisons in which final consonants -tt and -zz function respectively as the third person morpheme (see Section 3.1.1) and the plural morpheme (see Section 3.3.1) such that the final consonants are retained in prevocalic position, loss or maintenance of final consonant seems variable in non-morphemic liaison of prepositions such as depuis and avec. 3.5. Adjectives, adverbs 3.5.1. Adjectives Plural forms of some adjectives such as beaux, bons, mauvais(es), petits, and saints have regular liaison forms: bô, bon, movai, pti, and sin before a consonant and bôz, bonz, movaiz, ptiz, and sinzz before a vowel. Like possessive adjectives mon and son, the adjective singular form bon is noted by the nasal vowel letter in preconsonantal position, whilst bon with an oral vowel in prevocalic position. The nasal vowel -on is thus denasalized in the position of liaison such as mon bon Anj “mon bon ange,” IC, p. 24, 60. This liaison form is similar to its feminine form: bon foai “bonne foi,” NM, p. 25, and bon odeur “bonne odeur,” IC, p. 101; see also Section 3.3.1. Adjectives toutt and tous become tou and tôu before a consonant and tout and tôuzz before a vowel. As already mentioned regarding the example of de rien in Section 3.2.4, final consonant deletion will be impeded by the presence of a clear morphosyntactic boundary. Moreover, for prepositions depuis and avec, loss and maintenance of final consonant can coexist in Vaudelin’s texts; see Section 3.4. Probably, morphosyntactic boundaries can manifest the reason why the final consonant -tt is absent even in prevocalic position in the following two sentences: J’e du l’anplo-aiie tou a vou glorifie e... (IC, p. 36) “J’ai dû l’employer tout à vous glorifier et...” ou vouz ait tou a moai IC, p. 76) “où vous êtes tout à moi.” 13

cf. par not prop volonte dpui ce nouz avon atin IC, p. 134.

French Liaison in the 18th Century


3.5.2. Adverbs The preconsonantal forms of bien, donc, fort, jamais, and très are biin, don, for, jamâi, and trâi, and prevocalic forms are biinn, donc, fort, jamaiz, and trâiz. The alternation is very regular. However, the adverb moins can drop its final consonant -zz in prevocalic position; see the following examples. Ton Createur rsvrâ o moin a Pâc unblman? (IC, p. 131) “Ton créateur recevra au moins à Pâques humblement.” Pars ce hac Mo doait avoair o moin eun Voaiiail. (IC, OBSERVATION) “ Parce que chaque mot doit avoir au moins une voyelle.” se Confâis e Comunî o moin eun foa tou le moâi. (IC, p. 68) “ se confesse et communie au moins une fois tous les mois.” A confaise o moin eun foai l’anê tôu nô pehe... (IC, p. 131) “ a confessé au moins une fois l’année tous nos péchés.” ci ont atin l’âj de discresion de comunie o moin eun foai l’an (IC, p. 131) “qui ont atteint l’âge de discrétion de communier au moins une fois l’an” Tou te pehe confaisrâ a tou le moin eun foai l’an: (IC, p. 131) “Tous tes péchés confesseras à tout le moins une fois l’an:”

We can suppose that this final consonant deletion is realized in the collocations au moins and à tout le moins, but the formularization of the phenomenon is complicated by a counter-example found in IC, p. 27: Tôu te pehê confaiserâ A tou le moinz eune fo-a l’an “Tous tes péchés confesseras à tout le moins une fois l’an.” The negative particle pas becomes pa in preconsonantal position and pazz in prevocalic position. But for this negative particle, too, we can find the following example. In addition, even in prevocalic position, the final consonant -tt in inmediatman and particuliairman is not pronounced in the first example or the next, but is pronounced as particuliairmant in the last. Pars-ce hac Conson ci ne presaid pâ inmediatman eun Voaiiail dous port naturailman...” (NM OBSERVATION) “Parce que chaque consonne qui ne précède pas immédiatement une voyelle douce porte naturellement...” Mon Dieu je cra ce vouz ait prezan par tou e particuliairman isi ou je vouz ador... (IC, p. 23) “ Mon Dieu je crois que vous êtes présent partout et particulièrement ici où je vous adore...” Pourcoai donc âit-ail particuliairmant atribue o Pair? (IC, p. 86) “Pourquoi donc est-elle particulièrement attribué au père?”



No liaison is attested for the negative particle point. There are 3 examples of poin in prepausal position, IC, p. 54, 118, 118, 27 cases of it in preconsonantal position, and 8 cases in prevocalic position, NM, OBSERVATION, IC, p. 31, 37, 81, 85, 115, 132, 139. The reason for the absence of liaison in the sentence ne sra poin un dzoneurr “ne sera point un déshonneur” IC, OBSERVATION is ascribed by Krier (1993) to the presence of two accent groups.14 This interpretation seems possible for the cases: ne comunî poin a Pâc “ne communie point à Pâques,” IC, p. 132, and ne nouz induize poin en tantasion “ne nous induisez point en tentation,” IC, 37, 139. However, it is quite unlikely in the following 3 cases: i n’a poin u de comansman “il n’a point eu de commencement,” IC, p. 31, 81; eun sucsaision ci n’a poin ete intaironpu “une succession qui n’a point été interrompue,” IC, p. 115. The adverb plus has plu/plû in preconsonantal position and pluz/plûzz in prevocalic position, but it is not so clearcut because the non-liaison form plu/ plû also appears before vowel. liaison form pluz/plûz e ecsite-vou de pluz an plû a la contrision. (IC, p. 66) “ et excitez-vous de plus en plus à la contrition.” La tro-aijâim, fair un fairm propô de ne le pluz ofanse. (IC, p. 34) “ La troisième, faire un ferme propos de ne le plus offenser.” non-liaison form plu/plû e je sui dan la rezolusion de ne le plû ofanse moaiiainan sa grâs: (IC, p. 65) “ et je suis dans la résolution de ne le plus offenser moyennant sa grâce.”

The contexts of plus offenserr are very similar in the two examples above such that liaison appears as a free variation in these cases. The prevocalic form of the adverb toujours is not toujouz, but toujour. Toujou is its preconsonantal form in 16 examples and its prepausal one in 2. In Feraud’s dictionary (1761), an old and traditional pronunciation without final -rr was recorded: “Tou-jou, the first short (vowel), the second long.” Curiously, Vaudelin never notes a long vowel in both toujou and toujour. Final consonant -rr was systematically restored after Littré (Catach (dir.) 1995: 1038). There are 8 examples of toujourr before a vowel: toujour an prezans “toujours en presence,” IC, p. 116; toujour anui-ieu “toujours ennuyeux,” NM, p. 33; toujour atahe “toujours attaché,” IC, p. 58; toujour ete “toujours été,” 14

(...) das Fehlen der Liaison in NM, p. 36: /nə sra poɛ̃ œ̃ dzonør/ “ne sera point un déshonneur” dürfte sich durch das Vorhandensein von zwei Akzentgruppen erklären lassen; (Krier 1993: 120).

French Liaison in the 18th Century


IC, p. 31; toujour obsairve “toujours observé,” IC, p. 136; se pronons toujour OU U “se prononce toujours OU”, IC, p. 147; se pronons toujour Egz. com, “se prononce toujours Egz. comme,” IC, p. 149; toujour un son sinpll “toujours un son simple,” NM, p. 21. On the contrary, non-liaison form appears in 7 examples in prevocalic position. e supôz toujou avaic ail la Voaiiail e. (NM OBSERVATION) “et suppose toujours avec elle la voyelle e.” pars ce Jezu-Cri se trouv toujou o milieu de seu ci s’asanbl... (IC, p. 39). “ parce que Jésus-Christ se trouve toujours au milieu de ceux qui s’assemblent...” Souvne vou biin ce toujou eun Voaiiail avai M. ou N. suivi d’eun Conson (IC, p. 143) “Souvenez-vous bien que toujours une voyelle avec M. ou N. suivie d’une consonne” Souvne-vou ancor biin c’a la fin du mo Am, an se pronons toujou Anm, Ann. (IC, p. 143) “Souvenez-vous encore bien qu’à la fin du mot Am, an se prononce toujours Anm, Ann.” U gard toujou son propr Son dvant E, e dvant I. com, (IC, p. 148) “U garde toujours son propre son devant E, et devant I comme,” U âi toujou muai dvant O, (IC, p. 148) “U est toujours muet devant O,” toujou le Mecrdi, Vandrdi e Samdi 10. (IC, p. 161) “toujours le Mercredi, Vendredi et Samedi 10.”

As we have seen in Section 3.4, loss or maintenance of a final consonant seems variable for non-morphemic liaison of adjectives and adverbs. 3.6. Conjunctives, participles The conjunctives quandd and mais have regular liaison forms can and mâi before a consonant and cantt and mâiz/maizz before a vowel. No liaison is realized before or after the conjunctive et. Vaudelin used a nasal vowel before ett in the following example. There are 3 other examples of the same type. Oui: il âi bon e utill d’avoair ..., “Oui: il est bon e utile d’avoir ... (IC, p. 121)

For the participle faisant, the form fzan is attested once before a pause, IC, p. 65, and 4 times before a consonant, IC, p. 61, 61, 74, 92. In prevocalic position, the liaison form fzantt appears in 2 cases, NM p. 31, IC, p. 88, and non-liaison fzan in another 2, IC, p. 98, 123, such that they seem to be free variants.



3.7. Numerals Some numerals have three different forms in preconsonantal, prevocalic, and prepausal positions. Numerals are important elements that will cast light on the diachronic formation of linking phenomena. For instance, deux is noted as dêu/deu before a consonant, as dêuz/deuzz with liaison consonant -zz before a vowel, and as deus with final consonant -s before pause: Il i ann a deus “Il y en a deux,” IC, p. 85. In the linguistic situation of the 18th century, the prepausal form deus still conserved an earlier stage of the language before the establishment of liaison. In contemporary French, this final consonant -s is lost in both preconsonantal and prepausal positions. It seems thus significant to recognize that the liaison of deux with final consonant -zz in prevocalic position was probably not established simultaneously with the loss of final -s before a consonant or a pause. The following examples indicate that the final consonant was without doubt maintained at the beginning of the 18th century. Conbiin ann at-i? Il i ann a tro-ais (IC, p. 32). “Combien en a-t-il? Il y en a trois.” Conbiin at-i de Comandman de l’Eglîz? Il i ann a sis. (IC, p. 130) “Combien a-t-il de commandements de l’Église? Il y en a six.”

The numeral dix also has a prepausal form dis. The final consonant is -tt for the numeral sept. The consonant is pronounced before both a vowel and a pause. Prepausal position Conbi-in i at-i de Sacrman? Il i ann a saitt (IC, p. 33). “Combien y a-t-il de sacrements? Il y en a sept.” Conbiin i at-i de Sacrman. Il i ann a saitt (IC, p. 96). “Combien y a-t-il de sacrements? Il y en a sept.” Prevocalic position e le sait ôtrr regard le prohin (IC, p. 117). “et les sept autres regardent les prochains.” e l’amour du prohin ranfairm le sait ôtrr (IC, p. 117). “et l’amour du prochain renferme les sept autres.”

Loss or maintenance of final consonant -rr was variable in Vaudelin’s texts. The numeral quatre was noted as catt without a final consonant in 4 examples such as cat hozz “quatre choses,” IC, p. 119, and cat maniâirr “quatre manières,” IC, p. 91, 119, 134, but catrr in 2 examples: Il i ann a catr principô “Il y en a quatre principaux,” IC, p. 105; Il i ann a catr principall “Il y en a quatre principales,” IC, p. 115. As mentioned in Sections 3.4. and 3.5., final consonants can drop or

French Liaison in the 18th Century


remain for the non-morphemic liaison of adjectives, adverbs, and prepositions, and this phenomenon is more dynamic for numerals because of the presence of the third form in prepausal position, i.e., deus, troais, and sis. 4. Conclusion Stylistic difference between spoken and written languages must be the most cumbersome parameter of variation in the linguistic studies of past ages. As noticed in Section 3.1.2, Vaudelin’s works seem far from colloquial in style. It is highly presumed that his texts were written not to be pronounced, but to be read, whether aloud or silently. Thus, it is reasonable to suppose that his texts represent what happened when an 18th century intellectual read French texts. In other words, we do not have straightforward and overall access to spoken French in his texts. However, it should be noted that the presence of false liaison (see Section 3.2.3) convincingly illustrates a reflection of spoken French in the 18th century. In this way, the greater part of Vaudelin’s texts demonstrate the linking phenomenon in relatively formal contexts, but occasionally his works manifest some informal false liaison. We can also find an example of liaison consonant ––p, and this will remind us again of its formal style. d’ou je conclu c’i fôdrait âitr bi-in delica, e trop ainmi de la pairfaicsion (NM, p. 19). “d’où je conclus qu’il faudrait être bien délicat, et trop ennemi de la perfection”

Then, what should we do in the analysis of these written documents that are considered as precious sources of spoken French in past ages but which, in reality, are inevitably a mixture of written and spoken languages? In this respect, we fully agree with the opinion of Ayres-Bennett (2004). Methodologically speaking, it is significant to examine “the extent to which it is possible to find reflections of spoken French in textual and other sources, and to evaluate their reliability” (Ayres-Bennett 2004: 17). As concluding remarks, we will recapitulate our findings regarding the liaisons in Vaudelin’s texts in order to grasp at what stage his texts should be situated in the diachronic evolution of French liaison. Consonant deletion had not been fully completed at the first quarter of the 18th century. It was accomplished for the consonants -s and -t, which are integrated into the voiced/voiceless opposition, although -c (=-k) k of avaic, however, was excluded; see Section 3.4. Consonants without their voiced partners were not deleted. For instance, we can observe consonants -ll and -r pronounced in preconsonantal positions respectively for the personal pronoun ill and adverb toujours. In addition, the numeral deux has three different forms: deu before a consonant, deus before a pause, and deuz before a vowel.



Consonant deletion was not yet complete before a pause. Based on this evidence, the general hypothesis that final consonant deletion should be presupposed for the emergence of liaison does not seem to hold. Every phonetic change does not affect every vocabulary item at the same speed. On the contrary, consonant insertion did occur without exception; see Section 3.1. The regularity of this inserted -tt comes from its morphemic value as the third person marker. As far as Vaudelin’s texts are concerned, liaison of the bound form type is realized regularly and quickly; see the examples of personal pronouns, the pronoun en, articles, and prepositions. Among others, liaison is realized almost without exception when liaison consonant -zz represents the plural morpheme in personal pronouns nous and vous and articles des and les. For verbs, it will occur most regularly when liaison consonant -tt represents the third person morpheme. In this way, liaison consonants -zz and -t of morphemic bound form type are the most consistently pronounced, whilst those of the free form type occur less regularly; see the examples of several adverbs such as moins, pas, plus, and toujours. References Ayres-Bennett, Wendy. 2004. Sociolinguistic variation in seventeenth-century France: Methodology and case studies, Cambridge: Cambridge University Press. Catach, Nina. 1995. Dictionnaire historique de l’orthographe française, Paris: Larousse. Cohen, Marcel. 1946. Le français en 1700 d’après le témoignage de Gile Vaudelin, Paris: Librairie ancienne Honoré Champion. Chasle, Nathalie. 2008. “Manifestation de la latence en ancien français aux Xème et XIème siècles: liaison et redoublement syntaxique”. Congrès Mondial de Linguistique Française—CMLF’08, Paris, 1645-56. cf. http://www. =standard&Itemid=129&url=/articles/cmlf/pdf/2008/01/cmlf08175.pdf Durand, Jacques and Chantal Lyche. 2008. “French liaison in the light of corpus data”. French Language Studies 18. 33-66. Klausenburger, Jürgen. 1984. French liaison and linguistic theory. Stuttgart: Franz Steiner. Krier, Fernande. 1993. “Gile Vaudelin und die französische Orthographie”. Sprachwandel und Sprachgeschichte, Festschrift für Helmut Lüdtke, Schmidt-Radefeldt and Andreas Harder (eds). Tübingen: Narr. 117-122. Livet, Charles-Louis. 1859 (1967). La grammaire française et les grammairiens du XVIe siècle. Genève: Slatkine Reprints.

French Liaison in the 18th Century


Lodge, Anthony. 2004. A sociolinguistic history of Parisian French. Cambridge: Cambridge University Press. Morin, Yves Charles. 1986. “On the morphologization of word-final consonant deletion in French”. Sandhi phenomena in the languages of Europe, Andersen, Henning (ed). Berlin/New York/Amsterdam: Mouton de Gruyter, 167-210. Morin, Yves Charles. 2005. “Liaison et enchaînement dans le vers aux XVIe et XVIIe siècles”. De la langue au style, Gouvard, Jean-Michel (ed). Lyon: Presses Universitaires de Lyon. 299-318. Morin, Yves-Charles. 2005. “La liaison relève-t-elle d’une tendance à éviter les hiatus? Réflexions sur son évolution historique”. Langages 158. 8-23. Rosset, Théodore. 1911. Les Origines de la prononciation moderne étudiées au XVIIe siècle d’après les remarques des grammairiens et les textes en patois de la banlieue parisienne. Paris: Armand Colin. Thurot, Charles. 1881 (1966). De la prononciation française depuis le commencement du XVIe siècle d’après les témoignages des grammairiens, 2nd vol. Genève: Slatkine Reprints. Tranel, Bernard. 1987. The sounds of French. An introduction. Cambridge: Cambridge University Press. Vaudelin, Gile. 1713 (1973). Nouvelle manière d’écrire comme on parle en France. Paris: Chez La Veuve de Jean Cot et Jean-Baptiste Lamesle, Slatkine Reprints, 33. Vaudelin, Gile. 1715 (1973). Instructions crétiennes mises en ortografe naturelle. Paris: Chez Jean-Baptiste Lamesle, Slatkine Reprints, 247. AntConc 3.2.1w (Windows). Laurence Anthony. Faculty of Science and Engineering Waseda University, Japan; [email protected]

Issues in the Typographic Representation of Medieval Primary Sources António EMILIANO The transcription of medieval primary sources for linguistic and philological study entails complex problems and decisions regarding the typographic representation of medieval characters1. The linguistic study of medieval texts requires highly conservative transcriptions, in order that information about the medieval character sets be faithfully preserved in the editions: full access to the original character sets by researchers allows for grounded analyses of the graphemic systems and hence of the linguistic data present in manuscripts. Many medieval characters, such as special lettershapes (with several case distinctions), abbreviation marks (combining characters) and signs (spacing characters) and punctuation signs have fallen into desuse since the arrival of printing (although the early printers up to the 18th century made use of several medieval characters and abbreviations). Traditional philologists have considered the diversity of medieval lettershapes as a problem and have dealt with it by simply suppressing that diversity. The fact is that different medieval scripts (like the Visigothic, Carolingian or Gothic families of scripts which were continuously used in the Iberian Peninsula for several centuries) had some very distinct lettershapes which could be regarded as separate characters when compared to the contemporary versions of the Roman alphabet we use in print. 1. Some general assumptions I submit that the following assumptions regarding the nature and aims of editions of medieval primary sources cannot be circumvented: 1) ‘transcription’, ‘transliteration’ and ‘edition’ of a text are different tasks and steps of the philological work, each with its specific set of goals and procedures; 2) an edition will represent a text better (i.e. more faithfully) to the extent that it entails the least amount of transliteration operations;


The concept of ‘typographic representation’ encompasses both physical and digital media, although this paper focuses on editions to be included in digital textual corpora.



3) an interpretive/diplomatic edition must always be based on a good transcription and a highly conservative transliteration—ideally, every interpretive/diplomatic edition should be preceded by a conservative/ palaeographic edition; 4) no single edition of a medieval primary source will ever meet the needs of all potential users and audiences; 5) the edition of medieval texts cannot be considered and carried out solely in terms of printed editions, and the study of medieval texts requires the creation of not just electronic text archives but also of electronic corpora. In order to explain fully the conceptual and practical consequences of these assumptions I will first consider the distinction between graphetics and graphemics and the meaning of such terms as ‘letter’, ‘character’ (and the correlated term ‘character set’) and ‘glyph’ (and the correlated term ‘glyph set’). The following discussion will be confined to the Roman alphabet and its medieval and modern derivatives. When one is faced with the analysis of an alphabet-based writing system the graphetic plane must crucially be distinguished and separated from the graphemic plane. On the former, a writing system is simply a code, a set of symbols (letters, diacritics, punctuation marks, auxiliary signs), or, to be more precise, a character set. Since graphetics is concerned with the description and the history of writing codes, the relation of characters to units of speech has no bearing on this plane of analysis. One can describe, analyse and discuss a historical set of characters and its development without ever referring to any language features or structures. In other words, alphabets qua character sets can be analysed without any reference to a particular language or spelling system. An alphabet is not an orthography and letters and characters are not graphemes. Both graphetic and graphemic factors should be considered when adopting any given strategy for transcribing medieval texts and for representing them typographically. 2. Letters, characters, glyphs and graphs A letter, the basic unit of the medieval and modern versions of the Roman alphabet, is in reality a class of characters. For instance, what we call the letter A is in fact a set that comprises the characters capital A, small A, superscript small A, small capital A. In medieval writing the set was larger and encompassed other types of A. In our modern printed version of the Roman alphabet some letters have strikingly different shapes and graphic attributes, as shown in table 1:

Issues in the Typographic Representation of Medieval Primary Sources


Table 1. Letters and lettershapes LETTER















subscript a g m q

small capital










Table 1 contains just four letters. However, the total count of lettershapes (characters) is twenty, because letters are not lettershapes: letters can be manifested by more than one lettershape or character which may be notoriously distinct but are considered to be formally equivalent, that is, the “same” letter. Small A and capital A are different characters: to a user of a script not related to the Roman alphabet they may be regarded at first sight as two letters. For users of the Roman alphabet they are the same letter and most users do not even notice how different they are as lettershapes. A character is an abstract shape regardless of any graphic rendering. It is an ideal form, in the sense that the triangle or the circle are ideal forms with no concrete features or attributes beyond those which define them as distinct geometric entities. The triangle and the circle are two-dimensional entities, i.e. surfaces, regardless of any specific trait they may aquire in the real world: the triangle is simply a polygon with three vertices and three sides which are line segments; the circle is simply a sequence of points in a plane which are all equidistant from a given point (called the centre) which is located on the same plane. These are basic and archetypal geometric shapes: their definition is not bound or constrained by any aspects pertaining to the physical materiality of their medium (size, colour, texture, material). In exactly the same way, the entity ‘capital A’ is an ideal or archetypal form, a basic writing symbol which can simply be defined as a shape made up of three line segments arranged in a distinctive manner. The differences between ‹A› (regular capital A), ‹A› (bold capital A), ‹A› (italic capital A) and ‹A› (bold italic capital A) are not contrastive, because they are subject to the ‘style’ of graphic presentation of the capital A character (which is essentially neither regular, nor italic, nor bold, nor bold italic—characters have no presentation style). Whilst capital A is a simple literal character (or just a ‘simple literal’), Á, Â, Ã, etc. are composite characters: they are made up of two separate characters (a literal plus a diacritic). The association of the supralinear acute, circumflex and tilde combining characters to a literal results in new distinctive writing units.



A character can be presented in manifold variant forms which are called ‘glyphs’. The glyph is no less abstract than the character but it is not simply an ideal shape: the systematic glyph is a model or blueprint for the presentation of a characterr in a writing medium. In other words the systematic glyph is an abstract entity that contains explicit information about the features of graphic (and typographic) rendering of a character. Table 2 shows the typographic representation of several glyphs and characters: Table 2. Letters, characters and glyphs proportionally spaced sans serif typeface character name

monospaced typeface







capital A







small A







capital M







small M







Table 2 contains the typographic presentation of 2 letters, 4 characters and 24 glyphs. Each letter corresponds in this table to 2 characters and 12 glyphs. Alphabets and character sets are closed (finite) sets but any character set can associated to an infinite number of glyph sets: a letter is a group of characters which are regarded as formally equivalent and each character can have an infinite number of systematic glyphs. The actual materialization of writing in a specific medium is ‘graphic or graphetic implementation’ and any instance of writing qua implementation corresponds to a unique spatiotemporal event. In print (typographic implementation of writing), the ‘implementational glyphs’ or ‘graphs’ through which systematic glyphs are materialized are tendentially identical. In chirographic (manual) writing the differences between graphs which manifest the same glyph are greater: there are differences between different hands (scribes) and a single writer can adopt different presentation styles. Furthermore, a lettershape is never drawn exactly the same way by the same individual, even in the more elaborate calligraphic styles, due to the biologic/ physiologic nature of humans. This does not mean that we use, or that the medieval scribes used, an infinite number of glyphs: we use a finite set of characters, one or more finite sets of systematic glyphs (according to the adopted style of writing), and, yes, we do draw an infinite number of graphs for each character.

Issues in the Typographic Representation of Medieval Primary Sources


Letters, characters, glyphs and graphs are thus basic units in the graphetic analysis of writing: they belong to distinct logic planes, have different properties and attributes and should never be confused. These distinctions should be particularly important to philologists undertaking the edition of a text (or group of texts) or planning the annotation of a corpus of medieval primary sources. Careful consideration should be given to which type of entities one wishes to represent or annotate and for which purpose. It goes without saying that trying to encode graphs is tantamount to creating a typographic facsimile and simply makes no sense. However there is ample room for discussion regarding the amount of ‘glyphic’ information one should encode or leave out in a palaeographic edition of a medieval text. 3. Graphemes, allography and writing systems The graphemic plane of writing (which is the object of graphemics) requires the consideration of the relation between writing units and linguistic units. The graphemic plane is, therefore, a (scripto)linguistic plane; in other words, graphemics is a subdivision of linguistics. To make this clearer, consider the proposition: A a is character of the Roman alphabet.

This proposition is of course true. No additional comments are needed. Now consider a similar proposition: A is a grapheme.

It is neither true nor false, it is simply meaningless. Because the grapheme (like the phoneme) is a linguistic relational concept, this proposition can only have truth or falsehood content in relation to a specific language. The definition of ‘emic’ units, contrary to ‘etic’ units, is a function of their status in a given symbolic system. The proposition: A is a grapheme of written Portuguese.

is true, whereas A is a grapheme of written Arabic.

is false. Graphemes are minimal units of a writing system, which is in turn the minimal set of contrastive graphic elements, which, in association with a set of concatenation and mapping rules, allows for the written representation of a linguistic system and hence makes written linguistic communication possible. In a logographic system, mapping rules lay out the relation between lexemic units and graphemic units (‘logograms’ or ‘grapholexemes’) whereas in an alphabetic system (which is tendentially phonographic) the mapping rules are basically grapheme-phoneme correspondence rules2.



In an alphabet-based system graphemes are mostly ‘phonograms’ (or ‘graphophonemes’). A grapheme can be made up of one or more letters and it can be mapped to more than one phoneme. The grapheme, like the phoneme, is an abstract unit, whose value is defined in terms of the relation between elements of the same type in a system. A phoneme can be actualized in speech by distinct phones (allophones) according to the phonological context. A grapheme can also have different allographemes or allograms, although the factors that govern allophony and allography are of distinct nature. In reality, two types of allography must be considered: 1) character-allography or ‘deep allography’ and 2) glyphallography or ‘shallow allography’. In deep allography the allographs are different characters (e.g. capitals vs. uncials vs. minuscules) whereas in shallow allography the allographs are variants of the same character, i.e. they correspond to similar glyphs whose occurrence is generally context-dependent (e.g. word-medial lettershapes vs. word-final lettershapes). Medieval texts present abundant examples of context-dependent allograms and careful thought should be given to the amount of allography that will be represented in an edition or encoded in a corpus. 4. Orthography vs. alphabet An alphabet is a code, a set of symbols, created for the purpose of representing language. The letters have a basic or general phonographic value (which goes back to their Roman origin) but because the same Roman alphabet (with slight adaptations) is currently used in many orthographies associated to completely different languages the precise value of a letter can only be ascertaind after careful examination of a language and its orthography (i.e. after graphemic analysis). For instance, one can say that the letter P 2

There is an obvious simplification in this description: there are no pure phonographic or logographic systems (the two categories phonography and logography overlap): alphabet-based systems are basically or originally phonographic but acquire over time logographic traits. Also, mature users of an alphabetic system process the written words holistically, i.e. they read logographically with no intermediate “letter-by-letter” sequential analysis of the written forms. There are graphemic elements in some alphabetic systems that represent syllables or parts of syllables and morphemes; in other cases there are graphemes with no phonemic, morphemic or lexemic status or content. For instance, the English verbal morpheme corresponds to three context-dependent phonetic possibilities; the two-letter sequence is a ‘graphomorpheme’ composed of two graphemes. Logographic systems can contain graphophonemes (grapholexemes/logograms used as graphophonemes) and contextual graphemes (determinatives) that do not correspond to any explicit linguistic element but define a linguistic category, like gender, animated/ nonanimated, type of object/entity, etc. Japanese writing is a mixed system: it contains logographic, syllabographic and phonographic subsystems.

Issues in the Typographic Representation of Medieval Primary Sources


represents a ‘p-sound’ (a voiceless bilabial stop); that is true in most instances, but in English receiptt and European Portuguese recepção “reception” the letter P does not match any phonic unit; in English pitt the P corresponds to an aspirated stop, but in tip it corresponds to an unaspirated unreleased stop. As for the letter C, it has so many strikingly different values throughout European orthographies (including Turkish), that it is impossible to state a ‘basic value’ for this letter regardless of a specific language. Alphabets are not ‘writing systems’ proper, orthographies are. Letters and characters are the building blocks of orthographies but an orthography is much more than an inventory of symbols. An orthography is a normalized and codified writing system. It is a hard-learned protocol which is rigidly imposed either by social consensus or by law and the users have no say or choice in the way they spell. The concept of orthography entails the notions of norm, correction and error; it also precludes the possibility of particular or regional practices in a public context. Thus in written languages endowed with an orthography a deviant spelling is always a mistake, not a variant possibility. The situation of the vernacular languages of Europe in the Middle Ages and the Renaissance was different from the present: not only different versions of the Roman alphabet were developed and extensively used but also individual and regional diversity and variation in writing were widespread. The concept of orthography was widely known and it was enforced in written Latin and Greek (by scholars) but not in the written vernaculars. Those languages did not have proper orthographies, although it has become common practice for many scholars to use ‘orthography’ as a neutral term for ‘spelling system’ when dealing with medieval and early modern printed texts. 5. Transcription vs. transliteration The edition of a medieval text results from an editorial programme or agenda which presupposes an interpretation of textual data. An edition is always a process of mediation of a text. Through this mediation the text is stripped of its original mode of presentation, according to the editor’s agenda. There is no such thing as a definitive or objective edition of a medieval text. Peter Robinson, an eminent example of a philologist of the computer age, noted wisely: Interpretation is fundamental to transcription. It cannot be eliminated, and must be accomodated. [...] Transcription of a primary textual source cannot be regarded as an act of substitution, but a series of acts of translation from one semiotic system (that of the primary source) to another semiotic system (that of the computer). Like all acts of translation, it must be seen as fundamentally incomplete and fundamentally interpretive. (Robinson 1994: 9; my emphasis)



The type of edition that linguists, language historians and philologists require is one that presents a high degree of faithfulness regarding textual data, graphemic data and graphetic data. Any other kind of edition will always require the direct examination of the manuscript or of a good facsimile. Taking for granted that there is no such thing as ‘philological truth’, one can still argue the case for ‘philological truthness’ or, better, ‘faithfulness’. But how does one measure or gauge the degree of faithfulness of an edition? In my view the issue of philological faithfulness is contingent to the adoption of a set of clearcut principles and procedures which take into account the crucial distinction between transcription and transliteration. There has always been a (serious) problem with the correct understanding of what ‘transcription’ is in medieval philology. In most editions the ‘transcription’ of the text(s) is generally preceded by a list of transcriptional criteria or procedures adopted by the editor(s). Most editors fail to realize that their criteria for transcription are in fact criteria for transliteration. Transcription is the representation of a text by means of the original character set and the systematic glyph set: transcribing a text requires that both the original character and glyph sets are represented faithfully or in an unambiguous way. To me this is the measure of ‘truthness’ of an edition. Thus an egyptologist will use hieroglyphic, hieratic or demotic characters to transcribe an Egyptian text, a sanskritologist will use the devanagari script to transcribe Vedic or Classical Sanskrit texts and a Norse philologist will use some form of fuþark to transcribe runic texts, whether they are transcribing their specimens manually or using a computer. Their need for faithful transcriptions will not exempt those scholars from making transliterations using the Roman alphabet in order that their texts be more accessible to nonspecialists. Likewise the medieval philologist should use glyphs that unmistakeably match the special medieval glyphs that are no longer in use in present times in order to transcribe the original character set of medieval texts. Ensuing transliterations should be grounded on faithful transcriptions. We should bear in mind that transcription is not depiction of a text in its original medium, a facsimile. Facsimiles are no doubt useful and each and every Portuguese medievalist would be grateful if they could have direct access to good quality digital reproductions of the thousands of medieval manuscripts kept at the Lisbon National Archive. Facsimiles can be invaluable aides to research. But facsimiles are not editions; a corpus of digital images of manuscripts is just a corpus of images not a textual corpus. Archives and corpora of images are not queriable objects like textual corpora. Philologists work with texts, not images: when a philologist transcribes the text their goal is to capture or represent the text (not an image of the manuscript). A highly conservative palaeographic edition will never be a facsimile nor should it ever purport to be one.

Issues in the Typographic Representation of Medieval Primary Sources


Transliteration implies substituting a different character set for the original one. To transliterate a text is to represent it by means of a character set and a glyph set that are structurally and formally different from those present in the manuscripts. If transliteration were always performed homothetically there should be no real problem, but the fact remains that most editions of medieval texts involve criteria for transliteration that deliberately mutilate and disfigure both the graphemic and graphetic reality of the texts: they do not just transliterate, instead they ‘adapt’, ‘normalize’, ‘modernize’ and whatnot. Most editions of medieval texts reduce the amount of graphemic and graphetic information present in the texts because of the limitations of the current typographic version of the Roman alphabet. Most editors in fact transliterate their texts when they state that they are transcribing them. This common misunderstanding stems, in my view, from the fact that traditional philologists and historians (who are responsible for many editions of medieval sources which are unfortunately useless for certain types of linguistic analyses) fail to recognize that medieval character sets are different from their modern counterparts. Most editors do not seem to realize that medieval scripts and medieval scribal practices present a reality that is completely different from our contemporary printed versions of the Roman alphabet: the fact that a medieval Portuguese text makes use of the Roman alphabet does not mean that the structure of the script and of the spelling system thereof are the same as those of a contemporary Portuguese text. Replacing medieval characters (both literals and nonliterals) by modern print is transliteration not transcription. A noteworthy case is the way most philologists and historians handle medieval abbreviations and punctuation. Abbreviations were are an important element of medieval writing and because our present form of the Roman alphabet does not contain the special signs that were used in abbreviations most editors feel that they must alter the texts by substituting modern characters for the old abbreviatory signs or marks. The procedure most commonly used is to expand abbreviations (i.e. to replace brachygraphs by sequences of letters) according to the interpretation and the intuition of the editor. As for punctuation many editors simply ignore the original system of punctuation and insert punctuation in their editions according to the principles of their own language. Furthermore they separate words and text units—such as titles, paragraphs and verses—according to modern practice, ‘normalize’ capitalization and simply ignore intermediate letter cases such as uncials, enlarged minuscules or small capitals. All these editorial procedures are taken for granted as part of transcription. If an edition of a medieval text is intended for use by an audience of nonspecialists all the aforementioned operations of transliteration are of course



legitimate because they insure that the text be received by a contemporary audience of nonphilologists. People who just want to know and enjoy the works of the past as part of their education, self-improvement or entertainment are not willing to tackle with a foreign or alien writing system (nor should they be forced to be!). They need, nay, they demand a fully accessible rendition of the text. Scholarly editions are a whole different business: they are made for a scholarly community and for scholarly purposes in compliance to strict requirements of accuracy and faithfulness. I wrote above «an edition will better represent a text to the extent that it entails the least amount of transliteration operations». A practical consequence of this statement which I regard as a crucial guideline is the need for special typefaces as a means of ensuring i) that the original character sets be faithfully preserved in editions and ii) that the original character set be unambiguously recognized by a (human) reader. Another consequence is that all special medieval characters and systematic glyphs should be recognized and encoded by the Unicode Consortium and ISO and should be included in the Universal Character Set (UCS). An interim solution, which several philologists and projects have adopted, is of course the use of provisional codepoints in the Private Use Area of a Unicode-compliant font. 6. The typographic representation of medieval primary sources There are three basic strategies in the typographic representation of medieval texts: 1) indirect (deferred) representation 2) direct (straightforward) representation 3) normalization All modern scholarly editions can be adequately labelled according to this simple scheme. Most fall under the heading of ‘normalization’ due to their extensive use of transliteration procedures. Only the first two editorial strategies allow for the representation of medieval character sets. I present below in an addendum some examples of these different approaches to typographic representation by means of a short excerpt from a Latin-Portuguese 10th-century charter written in cursive Visigothic script. Strategy 1 (indirect representation) is especially suited for electronic processing because it requires the use of a text-encoding application, such as the Text Encoding Initiative (TEI), based on a markup language like XML (previously SGML). Editions produced according to this strategy thus make use of annotation and entities (cf. addendum—‘indirect representation’): word

Issues in the Typographic Representation of Medieval Primary Sources


abbreviations can be encoded with the TEI core elements and (among others available) whereas abbreviatory signs and modifications of literals for abbreviatory purposes can be encoded as entities (cf. Robinson 1994, Parkinson & Emiliano 1999 and 2003 and Chapter 11, Representation of Primary Sources, of TEI P53). All special medieval characters, both literal and 3

Chapter 11 of TEI P5 presents and discusses briefly the encoding of medieval abbreviations; several approaches, using different elements are proposed. They all seem, however, to rely on the (implicit) assumption that the ideal edition of a medieval text is a normalized edition: representation of medieval characters seems to be peripheral. The simple fact that abbreviation encoding is discussed in a section called “Altered, Corrected, and Erroneous Texts” (11.3) downplays the role of abbreviations as an important feature of medieval character sets. Abbreviations were not simply scribal devices used to shorten words and to speed up writing: they formed a rich subsystem of characters and were an integral and fundamental part of a scribe’s graphetic and graphemic competence. The proposed use for the and elements is not grounded in good and sound palaeographic doctrine: «The content of the abbr element should usually include the whole of the abbreviated word, while the expan element should include the whole of its expansion.» (11.3.2 Abbreviation and Expansion) This proposal does not take into account that there were distinct types of abbreviations, namely, lexical abbreviations, whereby a whole word was abbreviated by contraction and/or suspension and the simultaneous use of a generic brachygraph (usually an overline of variable length), abbreviation signs (modified letters or special spacing characters) and abbreviation marks (combining supra- and infralinear characters). Version 5 of the TEI Guidelines contains the additional elements (=glyph!), (=abbreviation mark) and (=editorial expansion) and uses the term ‘brevigraph’ to refer to special abbreviatory characters. This term is an unwarranted admixture of Latin and Greek formatives; ‘brachygraph’ or ‘brachygram’, of Greek origin, should be preferred since the words ‘brachygraphy’ and ‘brachygraphic’ already exist in the English language. As for the and elements they seem to be redundant, at least in some cases, with respect to and . To make this point clear: TEI illustrates the use of these elements with a sequence from an English medieval text that contains the word eu(er)y (the letters enclosed by round brackets are the expansion of a combining supralinear brachygraph). TEI proposes the following encoding solutions: euy> and euery, > among other possibilities. If the and elements are defined and used properly (i.e. associated strictly to the abbreviated letter-sequences, not to the whole word as a matter of principle), the use of and in abbreviation signs and marks is redundant; like this: euy and euery. The first example, with the element , is odd from a philological and a palaeographic point of view: the encoded text is not really text but a (=glyph) element whose attribute is a character reference (!). The Unicode block Latin Extended-D (range A720-A7FF) contains several medieval characters that were recently accepted into the UCS (cf. Everson et al. 1996a, 1996b). The designations proposed and accepted for several abbreviatory signs include the word letter; e.g. the abbreviatory sign small P with stroke (which stood for per, also parr in Medieval Portuguese) was deliberately encoded with the name LATIN SMALL LETTER P WITH STROKE THROUGH DESCENDER R (U+A751). Other abbreviatory (spacing) signs included in Latin Extended-D that bore no relation to an existing literal were also named letters. This stresses the fact that abbreviatory signs had the same importance as literal characters in medieval scripts and that they are not simply a problem to be solved by editorial policy. They possessed full “character-ness” and were not glyphs as the TEI element might suggest to the uncautious reader of the TEI Guidelines.



nonliteral, can be indirectly (but unambiguously) represented by XML entities listed in the corpus’’ Document Type Definition (DTD). “Special” characters are transcribed as entity references in the body of the edition and no attempt is made to represent them directly (‘glyphically’). This procedure has been advocated by the Text Encoding Initiative (since its inception) and the Digital Scriptorium among others. Strategy 2 (direct representation) implies that each and every medieval character (abbreviation marks and signs included) is explicitly and unambiguously represented in the edition. There are two ways to achieve direct typographic representation: 1) anisomorphic direct representation 2) isomorphic direct representation The terms ‘isomorphic’ and ‘anisomorphic’ in this context refer to the presence or absence of a direct and absolute match between characters in the edition and characters in the manuscript. Anisomorphic representation can in turn be implemented in two very distinct ways: 1) creation of a set of typographic conventions that map sequences of characters of the Basic Latin character set to medieval characters; 2) creation of a set of SGML/XML entities and use of entity references in the body of the edition; the entities are mapped to UCS or Private Use Area character codepoints for display purposes. In Solution 1 each medieval character that has no direct match in the Basic Latin set is mapped to a combination of characters. It is a cumbersome way of transcribing and representing a medieval text (an edition made according to this strategy is not easy to read) but it is effective (cf. Parkinson 1983). With the widespread use of markup languages and the development of TEI and similar XML applications this strategy is clearly outdated. Solution 2 has been successfully adopted and used by the Medieval Nordic 4 Text Archive—MENOTA — , a project of outstanding quality and scholarship. The actual text file of their editions is in XML format and contains entity references representing “special” medieval letters and signs. The editions are 4

Medieval Nordic Text Archive (Arkiv for nordiske middelaldertekster)—Forskergruppe for tekstteknologi / Avdeling for kultur, språk og informasjonsteknologi (AKSIS) / Universitetet i Bergen, Norge,; cf. The Menota Hanbook 2.0.

Issues in the Typographic Representation of Medieval Primary Sources


meant to be read with a web browser. The philologists who prepare the editions work solely with entity references but the end-users of the editions see only glyphs by means of a Unicode-compliant font, not the entity references (cf. addendum—‘anisomorphic direct representation with XML/TEI entities’). The MENOTA project uses the MUFI guidelines and the MUFI font (q.v. infra). In my view, a important downside of this solution is the fact that the base text contains entity references for special medieval characters, while the browseable version containing special glyphs is meant for display purposes only5. Any type of search or data extraction operation (such as generating wordlists and concordances) will have to be performed directly on the base text and the search parameters will have to refer to entities not to medieval characters. The upside is the fact that any change in the codepoints of medieval characters (for instance, when a character that was provisionally located in the Private Use Area is officially recognized and accepted into the UCS) requires a single correction in the corpus’ DTD instead of multiple corrections in the body of the edition. But this is an upside for the encoders/curators not for the researchers which are the corpus’ end-users. I strongly believe that corpus building and corpus structure should always be user-oriented not encoder- or curator-oriented. Isomorphic representation is achieved by designing, and using in the body of the edition, a Unicode-compliant computer typeface containing medieval characters and glyphs (cf. addendum—‘isomorphic direct representation’). The last two solutions for direct representation—entity-based (anisomorphic) and character-based (isomorphic), respectively—are not mutually exclusive and the use of the latter does not preclude the philologist to adopt the former at any given point in time: substituting one solution for the other is just a matter of automatically replacing entity references for characters and vice versa. For methodological reasons I strongly prefer solution 2, which dispenses with entity references in character representation (Emiliano 2002, 2004a, 2004b). Thus I have collaborated with the Medieval Unicode Font Initiative (MUFI)6 in the development of an inventory of medieval characters, a Unicodecompliant ‘medieval’ font and two medievalist proposals submitted to the Unicode consortium7.



Actually there is not a separate version of the editions with medieval characters: what a browser does is to interpret the entity references as Unicode or Private Use Area codepoints and to render them as medieval glyphs for display purposes only. Medieval Unicode Font Initiative (MUFI)—Forskergruppe for tekstteknologi / Avdeling for kultur, språk og informasjonsteknologi (AKSIS) / Universitetet i Bergen, Norge,; cf. MUFI Character Recommendation 3.0.



The main advantage in adopting isomorphic typographic representation of medieval characters in a corpus is to make life easier for end-users, i.e. the researchers that actually use the editions a raw material for their work. After all it is for their use that corpora are created in the first place: corpora are a means to an end not an end in themselves; the end in this case is research and knowledge. Isomorphic representation, which relies on the creation of a Unicode-compliant font, does not require that entity references be converted into glyphs: this means that the edition can be actually read with any text editor. Also, searches and data extraction are simple to perform because the base text contains all the required special characters. A final noteworthy upside of isomorphic direct representation concerns the work of the corpus’ encoders and curators: this type of representation allows for the immediate visualization and verification of an edition in progress. Proof-reading and correction of the edition are also greatly simplified. This upside should not be underestimated when one is dealing with a large corpus and with entities whose names are long sequences of letters or just codepoints and thus easily subject to mistakes. Finally, strategy 3 (normalization) can be implemented in a variety of ways, according to the aims and agenda of the editor. Normalization basically relinquishes any attempt at direct representation and results in extensive editorial intervention (cf. addendum for an example of what can be called “deep” normalization; the general outlook of the edition is modern, with insertion of modern capitalization, punctuation, word separation and generalized expansion of abbreviations). Scholarly editions of this type, usually called ‘diplomatic editions’, are extremely useful to extract historical information, textual information and linguistic data concerning lexis and syntax. These editions can also provide the basis for wordlists (indices verborum) and glossaries. Since normalization is in fact ‘non-representation’ (or ‘re-representation’, so to speak), it should be done ideally by philologists with a good palaeographic and diplomatic background. Normalized editions should rely on bona fide transcriptions. 7. To sum up The most important requirement that must be met in the typographic representation of medieval texts is, in my view, accuracy of transcription (with clarity and consistency in the definition of editorial criteria). The actual strategy and the precise tactic or expedient adopted will ultimately depend i) on the type of study one intends to carry out and ii) on the nature of the corpus one wants to build. 7

See Emiliano, A. and S. Pedro, S. 2003, Everson et al. 2006a, 2006b, Everson et al. 2007, Unicode Consortium 2009—Latin Extended D.

Issues in the Typographic Representation of Medieval Primary Sources


Digital typography has come a long way in the last decade. There are virtually no limits to what one can represent typographically in web-accessible plain text. As a philologist I wonder every day at how far we have come in terms of possibilities. But the edition of medieval texts in the computer age is not about what can be represented in print or in a computer screen: it is about what should be represented. Both over-representation and under-representation can be instances of bad philology: the former eludes the importance of characters and systematic glyphs as basic transcriptional units, the latter simply ignores the reality of medieval texts and thus disfigures them. In medio virtus. This paper is respectfully and gratefully dedicated to Michael Everson. Addendum Latin-Portuguese 10th-century charter (a.D. 977)—Excerpt of a deed of gift to a monastery Date: 0977/04/22 Place: Municipality of Santa Maria da Feira (Northern Portugal) Archive ref.: Instituto dos Arquivos Nacionais / Torre do Tombo, Sé de Coimbra, maço 1, n.°5 (reference code PT-TT-CSC/1/5) Type: Private document (deed of gift); original ms. in cursive Visigothic script Editions: Herculano, A. (ed.) 1867-73, vol. I, doc. CXX, 75; Santos, M.J. 1994: 323-4 Scribe: Inuenando Subject: Penedruia donates several real estate items to the Monastery of S. João-de-Ver. Facsimile (detail)8

Indirect representation Editorial conventions abbr XML/TEI element=abbreviation add XML/TEI element=scribal additition 8

Detail extracted from matrix No. PT-TT-CSC-1-5_1_m0001.tif of the project Origins of the Portuguese Language: Digitization, Edition and Linguistic Analysis of Charters from the 9th and 10th centuries (ORIGENS DO PORTUGUÊS: DIGITALIZAÇÃO, EDIÇÃO E ESTUDO LINGUÍSTICO DE DOCUMENTOS DOS SÉCULOS IX-X, POCI/LIN/58815/2004, funded by Fundação para a Ciência e a Tecnologia).



expan l n place punc #a775 _ = bold

attribute of XML/TEI element =expansion XML/TEI element=line attribute of XML/TEI element =line number attribute of XML/TEI element =place of scribal addition XML/TEI element=punctuation Unicode character reference (U+A775—LATIN SMALL LETTER RUM) underscore sign=word juncture equals sign=word disjuncture9 bold text is “visible text” (regular text is annotation)

Anisomorphic direct representation with XML/TEI entities List of entities used


No elements are provided in the TEI Guidelines to handle word juncture and word disjuncture in a straightforward way. The only elements that could be used in theory to that effect are (= segment) and . However their suggested use encompasses many different situations, none of which corresponds to the common medieval problem of word separation. Cf. TEI Character Encoding Workgroup 2004 (CE W 12) and Bański & Przepiórkowski 2009 for short discussions of related issues.

Issues in the Typographic Representation of Medieval Primary Sources

Editorial conventions abbr XML/TEI element=lexical abbreviation add XML/TEI element=scribal additition l XML/TEI element=line n attribute of XML/TEI element =line number place attribute of XML/TEI element =place of scribal addition punc XML/TEI element=punctuation &text; XML entity reference _ underscore sign=word juncture = equals sign=word disjuncture bold bold text is “visible text” (regular text is annotation)




Isomorphic direct representation Editorial conventions abbr XML/TEI element=lexical abbreviation add XML/TEI element=scribal additition l XML/TEI element=line n attribute of XML/TEI element =line number place attribute of XML/TEI element =place of scribal addition _ underscore sign=word juncture = equals sign=word disjuncture

“Deep” normalization Editorial conventions Normalized capitalization, punctuation and word separation (according to contemporary practice in print). Abbreviations are expanded. The text is divided into numbered paragraphs (a procedure which is not unfortunately current practice in diplomatic editions). lb XML/TEI element=line break n attribute of XML/TEI element

=paragraph number p XML/TEI element=paragraph bold bold text is “visible text” (regular text is annotation)

Issues in the Typographic Representation of Medieval Primary Sources


References Bański, P. and A. Przepiórkowski. 2009. “Stand-off TEI Annotation: the Case of the National Corpus of Polish”. Proceedings of the Third Linguistic Annotation Workshop—LAW — III (Suntec, Singapore, 6-7 August 2009), ACL-IJCNLP. 64-7. index.html, [30/ 6/2010] Digital Scriptorium. 2007. “Technical Information”. Columbia University of Libraries, technical/index.html[30/06/2010] Emiliano, A. 2002. “Problemas de transliteração na edição de textos medievais”. Revista Galega de Filoloxía, 3. 29-64. Emiliano, A. 2004a. “Tarefas da Filologia Portuguesa face à documentação antiga de Portugal”. Actas do XIX Encontro Nacional da Associação Portuguesa de Linguística (Faculdade de Letras da Universidade de Lisboa, 1-3 de Outubro de 2003). Lisboa: Colibri, APL. 58-68. Emiliano, A. 2004b. “A edição e interpretação da documentação antiga de Portugal: problemas e perspectivas da Filologia Portuguesa face ao estudo das origens da escrita em Português”. Aemilianense. Revista Internacional sobre la génesis y los orígenes históricos de las lenguas romances 1. (Proceedings of I Congreso Internacional sobre «Las Lenguas Romances en su Origen», Fundación San Millán de la Cogolla, Logroño, Spain, Monastery of San Millán de la Cogolla, 16-20 December 2003). 33-63. Emiliano, A. and S. Pedro. 2003. The Portuguese Medieval Font Project and the Medieval Unicode Font Initiative. TM_Unicode.pdf, [30/ 06/2010]



Everson, M., Baker, Emiliano, Grammel, Haugen, Luft, Pedro, Schumacher and Stötzner (ed). 2006a. Proposal to add medievalist characters to the UCS. Universal Multiple-Octet Coded Character Set, International Organization for Standardization, Organisation internationale de normalisation, Международная организация по стандартизации, ISO/ IEC JTC1/SC2/WG2 N3027, L2/06-027, 2006-01-30 (Working Group Document), [30/06/2010]. Everson, M., Baker, Emiliano, Grammel, Haugen, Luft, Pedro, Schumacher and Stötzner (ed). 2006b. Response to UTC/US contribution N3037R, “Feedback on N3027 Proposal to add medievalist characters”. Universal Multiple-Octet Coded Character Set, International Organization for Standardization, Organisation internationale de normalisation, Международная организация по стандартизации, ISO/IEC JTC1/SC2/WG2 N3077, L2/06-116, 2006-03-31 (Expert Contribution for consideration by JTC1/ SC2/WG2 and UTC), [30/06/2010]. Everson, M., Baker, Dohnicht, Emiliano, Haugen, Pedro, Perry, Pournader (ed). 2007. Proposal to add Medievalist and Iranianist punctuation characters to the UCS. Universal Multiple-Octet Coded Character Set, International Organization for Standardization, Organisation internationale de normalisation, Международная организация по стандартизации, Working Group Document for consideration by JTC1/SC2/WG2 and UTC, [30/06/2010]. Herculano, A. (ed). 1867-73. Portugaliae Monumenta Historica a Saeculo Octavo post Christum usque ad Quintum Decimum—Diplomata — et Chartae, vol. I. Lisboa: Real Academia das Sciencias. Medieval Nordic Text Archive. 2008. The Menota Hanbook 2.0. http://www. [30/06/2010]. Medieval Unicode Font Initiative. 2009. MUFI Character Recommendation 3.0. [30/06/2010]. Parkinson, S. 1983. “Um arquivo computorizado de textos medievais portugueses”. Boletim de Filologia 28. 241-252. Parkinson, S. and A. Emiliano. 1999. Encoding Medieval Portuguese and Latin Texts for Computer Analysis: development of TEI conformant tagging guidelines for linguistic study of electronic corpora of Medieval texts. ms., Oxford and Lisbon, Final Report of Project Ref. B-38/98 of the Anglo-Portuguese Joint Research Programme Treaty of Windsor 1998, [30/06/2010]. Parkinson, S. and A. Emiliano. 2002. “Encoding medieval abbreviations for computer analysis (from Medieval Latin-Portuguese and Portuguese nonliterary sources)”. Literary and Linguistic Computing g 17. 345-360.

Issues in the Typographic Representation of Medieval Primary Sources


Robinson, P. 1994. The transcription of primary textual sources using SGML. Oxford: Office for Humanities Communication Publications, Oxford University Computing Services. Santos, M.J. 1994. Da Visigótica à Carolina. A escrita em Portugal de 882 a 1172. Lisboa: Fundação Calouste Gulbenkian, Junta Nacional de Investigação Científica e Tecnológica. TEI Character Encoding Workgroup. 2004. “The ‘end of word’ problem in Sanskrit: report of the workgroup”. CE W 12: Report from Sanskrit Workgroup, 2004, [30/6/2010] TEI Consortium (eds). 2007. TEI P5: Guidelines for Electronic Text Encoding and Interchange. [TEI P5 1.0], [30/06/2010]. Unicode Consortium. 2009. “Latin Extended D, Unicode 5.2 Character Code Chart”. The Unicode Standard, Version 5.2. charts/PDF/UA720.pdf , [30/06/2010]

An Analysis of the Misuse of the Participle in Old Russian Texts Yoshinori ONDA

1. Introduction Throughout the history of the Slavonic literary language, its participles have changed usage and form. Specifically, Russian as a literary language (Russ. литературный язык) has lost its formerly ordered paradigm and has confused the functions of participles and adverbs1. The aim of this study is to analyze the misuse of participles, which occurred in the early stage of this change, and propose a functional explanation of its causes. 1.1. Old Church Slavonic and Old Russian Old Church Slavonic (OCS) is the first literary Slavic language, which was established in the 9th century by Saint Cyril and his brother Saint Methodius (Russ. Кирилл и Мефодий) to spresd Christianity among the Slavic peoples of Great Moravia and Pannonia. OCS is thought to have old forms very close to Proto-Slavic and is categorized to Southern Slavic and has Bulgaro-Macedonian dialect features. The earliest manuscripts have been lost, with only manuscripts dating from the tenth and eleventh centuries remaining extant. As standard texts (or canon), the following texts are listed2. Codex Zographensis (Zo., tetraevangelia, 271folia), Codex Marianus (Mar., tetraevangelia, 174folia3), Codex Assemanianus (Ass., aprakos, 158folia), Psalterium Sinaiticum (Ps., psalter, 177folia), Euchologium Sinaiticum (Euch., eastern missal, 109folia), Glagolita Clozianus (Cloz., homilies, 14folia), Savva’s book (Sav. aprakos, 129folia), Codex Suprasliensis (Supr., menaeum, 285folia). And newly discovered Vatican palimpsest cyrillic lectionary (Vat., aprakos, 96folia) is also comprised in OCS canonical texts4. In Russia, OCS was introduced as a language in church in the 10th century, when Christianity became the established religion. OCS was rapidly affected by Old East Slavic and Russian Church Slavonic arose5. This language was in 1 2 3 4

Russ. причастие and деепричастие. Kimura (1985: 25-27) 173folia, Цейтлин(1994: 14), More details about OCS, see Lunt (2001: 1-5), or Comrie (2001: 125-126)


Yoshinori ONDA

wide use as a literary language by the seventeenth century. In this paper, we shall refer to this language as Old Russian (OR). 1.2. Participles in OCS The participles found in OCS texts have a rich morphology and typically agree with related nouns in gender, number, and case. They are typically used to express the incidental action of the subject of the sentence (the other usages of participles are detailed in 2.1.2.). (1) Below is an example of a participle from the Codex Marianus, one of the OCS manuscripts6: (1) i šedъ isъ and to go, Past-A Part. M. Sg. Nom. SF. Jesus, Noun. M. Sg. Nom. vidě narodъ mъnogъ. i mili emu byšę. to see, V. Aor. 3. Sg. people many and have compassion him were zane běxǫ ěko ovъcę ne imǫštę pastyrě. i načętъ učiti because were like sheeps not had shepherd and began to teach ję mъnogo. (Mar.)7 them many [Jesus came out, saw a great multitude, and he had compassion on them, because they were like sheep without a shepherd, and he began to teach them many things.] (Mk. 6.34)

In the example (1), the participle šedъ (to go) represents the action of the sentence’s subject isъ (Jesus) and precedes the verbal action vidě (to see). The structure of this sentence can be simplified as follows: SUBJECT + PARTICIPLE + VERB ...(STRUCTURE 1) In this type of usage, a coordinate conjunction rarely appears between the participle and the verb. Example (2) below is taken from another OCS text, Savva’s Book, to illustrate the presence of a conjunction between a participle and a verb: (2) razgněvavь g že sę i ne xotěše to anger, Past-A Part. M. Sg. Nom. SF. but and not to want, V. Imp. 3rd. Sg.


6 7

Even in Ostromir Gospel, which was written in 1056-57 features of the East Slavic dialect are found. About the transliteration, see the appendix of this paper. Jagić (1960: 138)

An Analysis of the Misuse of the Participle in Old Russian Texts


vьniti. ocь že ego išъdъ to go in father then his to go, Past-A Part. M. Sg. Nom. SF. molěše i. (Sav.)8 to entreat, V. Imp. 3rd. Sg. him [(But) he was angry, and would not go in. Therefore his father came out, and begged him.] (Lk. 15.28)

The structure of the type seen in example (2) can be simplified as follows: SUBJECT + PARTICIPLE + CONJUNCTION + VERB ...(STRUCTURE 2) Actually, statements in which participles are used in the above manner do not require a conjunction. Such structures are not common in OCS literature; they appear more frequently, however, in OR literature. The purpose of this study is to discuss the causes of this misuse of the participle and the reason for its appearance in OR texts. 1.3. The Definition of “misuse” According to Frei (1929: 17-20), the definition of language correctt varies from the points of view. Referring to the normative grammar, it is to correspond with the social norms. Othewise from the point of functional linguistics, the definition of correct is to fullfill the language functions. In this study, following Frei, we divide incorrectt into two distinct notions. The first is faute (misuse), indicating something which fails to fulfill a grammatical norm. The second is déficitt (deficiency), describing that which fails to carry out a linguistic function. Our focus is mainly on the misuse9. 2. Correct and incorrect usages of participles in OCS literature 2.1. The norm of the OCS participle 2.1.1. Morphology OCS participles are morphologically divided into active/passive, present/ past, and short/long forms. They follow paradigms of gender (masculine, feminine, and neuter), number (singular, dual, and plural), and case (nominative, genitive, dative, accusative, instrumental, and locative). We should note that present participles represent actions that occur coincidentally with the actions represented by the verbs in the sentences. The past participles, on the other 8 9

Щепкин (1959: 54) In translating these terms, we also reffered to the Russian ошибка (faute) and недостаточность (déficit)


Yoshinori ONDA

hand, represent actions that occur antecedently to the actions represented by the verbs. 2.1.2. Usage In addition to instances in which the OCS participle is used to represent an incidental action (example 1), the OCS participle can occur in a definitive usage, which qualifies the noun as if it were an adjective: (3) ašte že sěno selъnoe dьnesъ sǫǫštee but then hay, Noun. N. Sg. Acc. field today to be, Pres-A Part. N. Sg. Acc. LF. a utrě vъ ognь vьmetomo. bъ tako but tomorrow in fire to contain, Pres-P Part. N. Sg. Acc. SF. God like this oděetъ. kolьmi pače vasъ malověri. (Mar.)10 clothe how more you lack of faith [But if God so clothes the grass of the field, which today exists, and tomorrow is thrown into the oven, won’t he much more clothe you, you of little faith?] (Mt. 6.30)

Complementary usage complements some verbs, perception verbs in particular: (4) i kъnižъnici farisei. viděvъše i and scribes Pharisees saw he, Pron. M. Sg. Acc. ědǫ ǫštъ sъ mytari i grěšъniky. glaaxǫ to eat, Pres-A Part. M. Sg. Acc. SF. with tax collectors and sinners said učenikomъ ego. čto ěko sъ grěšъniky ěstъ i pьetъ. (Mar.)11 to disciples his why with sinners eats and drinks [The scribes and the Pharisees, when they saw that he was eating with the sinners and tax collectors, said to his disciples, “Why is it that he eats and drinks with tax collectors and sinners?”] (Mk. 2.16)

Participles are often used as nouns by itself: (5) mьně podobaatъ dělati děla posъlavъšaago p g mę. for me must to do work to send, Past-A Part. M. Sg. Gen. LF. me donьdeže denь estъ pridetъ noštъ. egda niktože ne možetъ while day is comes night when no one not can dělati. (Mar.)12 work 10 11 12

Jagić (1960: 18) Jagić (1960: 121-122) Jagić (1960: 353)

An Analysis of the Misuse of the Participle in Old Russian Texts


[I must work the works of him who sent me, while it is day. The night is coming, when no one can work.] (Jo. 9.4)

The participles can form adverbial clauses called dative absolutes, which describe conditions related to time, cause, and so on. In this usage, both participles and subject appear in the dative case: (6) i vъstavъ ide kъ otčju svoemu. ešte že emu daleče and arose came for father his was but he, Pron. M. Sg. Dat. far sǫ ǫštju j . uzьrě i otecъ ego. i milъ to be, Pres-A Part. M. Sg. Dat. SF. saw him father his and have compassion emu bystъ. i tekъ napade na vyjǫ ego i oblobyza i. (Mar.)13 for him was and ran fell on neck his and kissed him [He arose, and came to his father. But while he was still far off, his father saw him, and was moved with compassion, and ran, and fell on his neck, and kissed him.] (Lk. 15.20)

2.2. Misuse of the participle Let us take a closer look at the following incorrect structure in example (2): (2) razgněvavь g že sę i ne xotěše to anger, Past-A Part. M. Sg. Nom. SF. but and not to want, V. Imp. 3rd. Sg. vьniti. ocь že ego išъdъ to go in father then his to go, Past-A Part. M. Sg. Nom. SF. molěše i. (Sav.) to entreat, V. Imp. 3rd. Sg. him [(But) he was angry, and would not go in. Therefore his father came out, and begged him.] (Lk. 15.28, Reused)

In example (2), razgněvavь is a past active participle, meaning to anger, while xotěše is the past tense form of a verb that means to want. Conjunction i (and) links the participle and the verb coordinately, but this conjunction i is not needed in this sentence. Compare the first half of the sentence with the second half. The latter includes a past active participle išъdъ, meaning to go, and a past tense verb molěše, meaning to entreat, with no conjunction between the participle and the verb. This is a grammatically correct sentence. Let us look at the same part of the Gospel in another OCS text:


Jagić (1960: 269)


Yoshinori ONDA (7) razgněva g že sę i ne xotěaše vъniti. to anger, V. Aor. 3rd. Sg. but and not to want, V. Imp. 3rd. Sg. to go in otcъ že ego išedъ molěaše i. (Mar.)14 father then his to go, Past-A Part. M. Sg. Nom. SF. to entreat, V. Imp. 3rd. Sg. him (8) razgněva g že sę. i ne xotěaše vьniti. to anger, V. Aor. 3rd. Sg. but and not to want, V. Imp. 3rd. Sg. to go in ocь že ego išьdъ molěaše i. (Zogr.)15 father then his to go, Past-A Part. M. Sg. Nom. SF. to entreat, V. Imp. 3rd. Sg. him (9) razgněvavъ g že sę. ne xotěaše vъniti. to anger, Past-A Part. M. Sg. Nom. SF. but not to want, V. Imp. 3rd. Sg to go in ocъ že ego išedъ molěaše i. (Ass.)16 father then his to go, Past-A Part. M. Sg. Nom. SF. to entreat, V. Imp. 3rd. Sg. him

In examples (7) and (8), razgněva (to anger) is represented as an aorist, which is a past-tense form, while xotěaše (to want) is an imperfect, also a past-tense form. They are connected by the conjunction i (and). The resultant structure can be simplified like STRUCTURE 3 and is also grammatically correct: SUBJECT + VERB + CONJUNCTION + VERB ...(STRUCTURE 3) As already mentioned, the incorrect participial STRUCTURE 2 (subject + participle + conjunction + verb) is found more frequently in OR than in OCS texts. Example (10) is taken from the OR text Arkhangel’skoe evangelie: (10) ražněvavъ že sę i ne xotęaše ę vъniti. to anger, Past-A Part. M. Sg. Nom. SF. but and not to want, V. Imp. 3rd. Sg. to go in ocь že jego išьdъ molęše ę i. (Arkh.)17 father then his to go, Past-A Part. M. Sg. Nom. SF. to entreat, V. Imp. 3rd. Sg. him

The conjunction i (and) appears between the participle ražněvavъ (to anger) and the verb xotęaše (to want).

14 15 16 17

Jagić (1960: 270) Jagić (1954: 116) Kurz (1955: 140) Жуковская (1997: 187)

An Analysis of the Misuse of the Participle in Old Russian Texts


3. Hypotheses and materials 3.1. Hypotheses So far, we have seen some examples of incorrect participial structures. One noticeable factor common to all of them is that the participles are in the past tense. The past participle represents anticipant verbal action. Therefore, the order of action in STRUCTURE 1' would be α followed by β. First α, then β: SUBJECT + PAST PARTICIPLE α + VERB β ...(STRUCTURE 1') This order of action can be represented in STRUCTURE 3' with pasttense verbs: SUBJECT + PAST TENSE VERB α' + CONJUNCTION + PAST TENSE VERB β' β ...(STRUCTURE 3') In terms of the order of action, STRUCTURE 1' and STRUCTURE 3' have similar structures. From examples (2), (7), (8), and (9), we see clearly that both structures are used in parallel in OCS texts. However, we cannot attribute this error to confusion. We need to consider the special conditions of these languages (OCS and OR): the fact that almost all the OCS and OR texts we are referring to are manuscripts. Manuscripts were always copied from the original texts as carefully and accurately as possible. Given these facts, we present the following two hypotheses: HYPOTHESIS 1. The similarity of the syntactic structures caused confusion regarding participle use. HYPOTHESIS 2. The attitude of the copyists to be accurate toward the original text greatly influenced the nature of the copied text. Regarding the second hypothesis, it is important to remember that the texts in question belong to church literature, which, because of the crucial cultural importance the ancient Slavic world attached to it, commanded an attitude of deep reverence. On the other hand, the attitude toward non-church literature was less reverential, resulting in its frequent misrepresentation and misuse. On the basis of these facts, we may revise Hypothesis 2 as follows: HYPOTHESIS 2'. Frequency of misuse depends on the genre of the text.


Yoshinori ONDA

3.2. Material To verify these hypotheses, we shall investigate the occurrences of participles in Vita Constantini (Vita Con.). The original Vita Con. text was thought to be written in Greek in the late ninth century. However, the extant 48 manuscripts were written in and after the fifteenth century. For this study, we have used the text edited by Lavrov in 1930 (Лавров 1930: 1-39). Lavrov’s text is based on the manuscript which is thought to be written in the 15th century, and Eastern Slavic features can already be seen in this text. Vita Con. is based on the life of Saint Constantine-Cyril (Constantinos-Kurillos), who created the Slavic own alphabet (Glagolitic alphabet) and first Slavic literary language for the purpose of the mission of Christianity among the Slavic peoples. Besides, Vita Con. contains episodes of his upbringing, religious debates with pagans and many quotations from the Bible. Basically conversations of the characters are represented by direct discourse. 4. Quantitative results and verification of hypotheses Vita Con. comprises a total of 484 participles. We find 26 instances of misuse, of which 23 cases relate to the incorrect usage of incidental action, while three relate to the incorrect participle usage in dative absolute constructions. Table 1. Total number of participles in Vita Con. Usage Incidental action Definitive Complemental As noun Dative absolute Others18 Total

Total 332 40 20 52 36 4 484

Misuse 23 0 0 0 3 0 26

4.1. Verification of HYPOTHESIS 1 In order to verify HYPOTHESIS 1, we need to focus on the differences in the structures containing the misused participles. 4.1.1. Usage As shown in Table 1, instances of misuse can be seen only in the usage of participles representing incidental action (except for three cases of error in dative absolute constructions). Cases of misused participles do not appear in the structures in which the action of the subject is not successive. 18

Idioms and other exceptions.

An Analysis of the Misuse of the Participle in Old Russian Texts


Below (11) is an example of misuse from Vita Con. occurring in the usage of participles representing incidental action: (11) sědъ že paky filosofъ cъ kaganomъ, to sit, Past-A Part. M. Pl. Dat. LF. then again Philosov with Khan i reče: (Vita Con.:X)19 and to speak, V. Aor. 3rd. Sg. [Philosov then sat with Khan again, and said:]

The three cases of the erroneous use of participles in dative absolute constructions are examples (12) to (14) as follows: (12) došedšimъ že imъ tamo, i to reach, Past-A Part. M. Pl. Dat. LF. then he, Pron. M. Pl. Dat. there and bęxu ę obradzi napisani děmonьstii vъněudu na to be, V. Imp. 3rd. Pl. figures painted demons out side on dverexъ vsěxъ xristianъ, divьi tvoręšte i rugajušte(sę). (Vita Con.:VI)20 doors all Christians surprising and taunting [When they reached there, figures of demons were painted on every door of the Christians, which surprised and taunted them.] togda g i (13) sьinu že sę ego abie krъštešu, son then his soon to baptize, Pres-A Part. M. Sg. Dat. SF. then also sam sę po nemъ krъsti. (Vita Con.:VIII)21 himself after him to baptize, V. Aor. 3rd. Sg. [As soon as his son was baptized, he himself was baptized after him.] (14) i utišьšusęę morju j velьmi, and to become quiet, Past-A Part. N. Sg. Dat. SF. sea, Noun. N. Sg. Dat. very i došedše, načęša ę kopati, pojušte. (Vita Con.:VIII)22 and reached to start, V. Aor. 3rd. Pl. to dig reciting [As the sea stilled, they reached and started digging, reciting (Psalms)]

4.1.2. Tense In this section, we shall focus on the tense of the participles. Among the 23 cases of misuse of participles representing incidental action, 22 cases are related to the past participle, and only one case is related to the present 19 20 21 22

Лавров (1966: 15) Лавров (1966: 8) Лавров (1966: 12) Лавров (1966: 12)


Yoshinori ONDA

participle. The present participle represents action that is coincident with verbal action. This order can hardly be expressed through verbs and a conjunction. The following example (15) is the only instance of a misused present participle: (15) kako vьi imušte upovanie na člověka, i how you to have, Pres-A Part. M. Pl. Nom. SF. hope on man and tvoritesęę blagosloveni bьiti, a knigьi proklinajutь takovago? to do, V. Pres. 2nd. Pl. blessed to be but Bible curses such thing (Vita Con.:X)23 [How do you think that you are blessed, placing your hope on a man (Christ) although the Bible curses this very thing?]

All three cases of misused participles in dative absolute constructions are related to the past participle. 4.1.3. Word order Now, let us look at the word order. OCS and OR are so-called inflected languages, which are characterized by a relatively free word order. However, the participles precede the verbs in 25 cases of misused participles. The only exception is the following case, in which a verb precedes the participle: (16) nasiliemъ mę sъgnaša g , a ne with force me to drive away, V. Aor. 3rd. Pl. but not preprěvše p p mene. (Vita Con.:V)24 to persuade, Past-A Part. M. Pl. Nom. SF. me [(They) expelled me with force, not refuting me.]

Giving a definitive explanation for the exceptions would be difficult. It is clear, however, that most instances of misuse occur in the sentences that have the similar structure as STRUCTURE 1' (past tense participle + past tense verb), while they rarely occur in the other structures. 4.2. Verification of HYPOTHESIS 2' To study the relationship between the text type and the attitude of the copyists, we shall segment the texts into (a) quotation, (b) statement, and (c) conversational sentences.

23 24

Лавров (1966: 19) Лавров (1966: 6)

An Analysis of the Misuse of the Participle in Old Russian Texts


A quotation is a sentence taken verbatim from the Bible. It follows, therefore, that the copyist’s reverential attitude toward the Bible would be reflected strongly with regard to such quotations. A conversational sentence, on the other hand, occurs within the direct discourse of characters and reflects the colloquial language used by them. Naturally, the copyist would take a less reverential attitude to this type of text. A statement represents all other sentences, to which type the copyist’s attitude seems to be neutral. Our concern in this section is the usage of incidental action, as all three types of misused participles in dative absolute constructions appear in the b) type of text (i.e. statements). Table 2. Percentages of misuse in the various types of texts. Types of text a) quotation b) statement c) conversation Total

Total 20 197 115 332

/ Misuse / 1 / 15 / 7 / 23

Percentages of misuse 5% 7.6% 6%

As we see in Table 2, only one misuse occurs in quotations. However, the percentage of misuses found in statements and conversations shows a result that does not support HYPOTHESIS 2'. We might consider the HYPOTHESIS 2' supported if we were to compare the number of misused participles occurring in the quotations versus those occurring in the statements and conversations. However, the difference is too small to be deemed conclusive. 5. Conclusion In the history of the Slavonic literary language, the participle has considerably changed its paradigm and function. This study focuses on the misuse of participles in OCS and OR texts, which seems to occur in the beginning of the change. Participles originally agreed with the nouns and represented the incidental action of the subject. But in some cases they appear as predicates with conjunctions which connect participles and verbs equivalently. To explain the cause of this misuse, we set two hypotheses. The first hypothesis is that the similarity of the syntactic structures caused confusion regarding participle use and the second one is that a reverential attitude of the copyists toward the original texts greatly influenced the copied texts. From the research on the OR text Vita Constantini, we found 26 cases of the misuse. By analyzing these misuses from the point of usage, tense, and word order, we confirmed that the misuses are caused by structural similarity. This finding supports the first hypothesis. Nevertheless, we could not precisely


Yoshinori ONDA

determine the nature of the relationship between the text type and the attitude of the copyists, despite the fact that occurrences of misuse in quotations are fewer than in statements or conversations. That means the second hypothesis is neither clearly supported nor denied. For further research, we need to examine the misuses occurring in texts of other genres belonging to the same period as Vita Con. Abbreviations (OCS Texts) Mar.: Codex Marianus, Sav.: Savva’s book, Zogr.: Codex Zographensis, Ass.: Codex Assemanianus. (OR Text) Arkh.: Arkhangel’skoe evangelie, Vita Con.: Vita Constantini; V.: Verb, Noun.: Noun, Pron.: Pronoun, Conj.: Conjunction, Imp.: Imperfect, Aor.: Aorist, Pres.: Present, 1st.: 1st person, 2nd.: 2nd person, 3rd.: 3rd person, M.: Musculine, F.: Feminine, N.: Neuter, Sg.: Singular, Du.: Dual, Pl.: Plural, Nom.: Nominative, Gen.: Genetive, Dat.: Dative, Acc.: Accusative, Inst.: Instrumental, Loc.: Locative, Pres-A Part.: Present Active Participles, Past-A Part.: Past Active Participles, Pres-P Part.: Present Passive Participles, Past-P Part.: Past Passive Participles, SF.: Short form, LF.: Long form. : participles, : verbs, : conjunctions Appendix: Table of transliteration

An Analysis of the Misuse of the Participle in Old Russian Texts


References Comrie, B. and Corbett, G. G. (eds). 2001. The Slavonic Languages. LondonNew York: Routledge. Diels, P. 1963. Altkirchenslavische Grammatik. Heidelberg: Carl Winter Universitätsverlag. Frei, H. 1929. La grammaire des fautes. Paris, Genève, Leipzig: Paul GEUTHNER, KUNDIG Otto HARRASSOWITZ. (Russian Trans. Пастернак, Е. Л. и Сичинава, Д. В. 2007. Грамматика ошибок. Москва: URSS. Japanese Trans. Kobayashi, Hideo. 1973. Goyo no bumpo Tokyo: Misuzu-shobo.) Grünenthal, O. 1909-1910. “Die Übersetzungstechnik der altkirchenslavischen Evangelienübersetzung”. Archiv für Slavische Philologie 31-32. Berlin. (Reprinted 1964. Hague.) Jagić, V. 1879. Quattuor evangeliorum Codex Glagoliticus olim Zographensis nunc Petropolitanus. Berlin. (Reprinted 1954. Graz: Akademische Drucku. Verlagsanstalt.) Jagić, V. 1883. Quattuor evangeliorum versionis palaeoslovenicae Codex Marianus. Berlin. (Reprinted 1960. Graz: Akademische Druck-u. Verlagsanstalt.) Kimura, S. 1985. Kodaikyokaisurabugo nyumon (Introduction to Old Church Slavonic). Tokyo: Hakusui-sha. Kimura, S. and N. Iwai. 1984-85. “Konstantinos ichidai-ki (Vita Constantini)”. Slavic Studies 31-32. Hokkaido: Slavic Research Center. Kurz, J. 1955. Evangeliarium Assemani; Codex Vaticanus 3. slavicus glagoliticus. Pragae: Československé academie věd. Kurz, J. and Hauptova, Z. (eds). 1966-1997. Slovník jazyka staroslovĕnského, Praha: Československé akademie věd. Lunt, H. G. 2001. Old Church Slavonic Grammar (7ed.). Berlin, New York: Mouton de Gruyter. Vajs, J. 1927. Evangelium Sv. Marka: A Jeho poměr k řecké předloze. Pragae: České Akademie Věd a Uměni. Vajs, J. 1935. Evangelium s. Matthaei: palaeoslovenice. Pragae: Academiae Scientiarum et Artium. Vajs, J. 1936a. Evangelium S. Lvcae: paleoslovenice. Pragae: Adiuvante Ministerio scholarum et nstuctionis publicae. Vajs, J. 1936b. Evangelium s. Ioannis: palaeoslovenice. Pragae: Academiae Velehradensis. Vlasto, A. P. 1986. A Linguistic History of Russian to the End of the Eighteenth Century. Oxford: Clarendon Press. Булаховский, Л. А. 1958. Исторический комментарий к русскому литературному языку. Киев: Радянська Школа.


Yoshinori ONDA

Жуковская, Л. П., Левочкин, И. и Милонова, Т. (Ред). 1997. Архангельское евангелие 1092 года; Исследования, древнерусский текст, словоуказатели. Москва: Скрипторий. Лавров, П. А. 1930. Материалы по истории возникновения древнейшей славянской письменностию. Ленинград. (Reprinted 1966. Hague: Mouton & Co.) Ларин, Б. А. 2005. Лекции по истории русского литературного языка СПб. Азбука. Цейтлин, Р. М., Вечерка, Р. и Благова, Э. 1994. Старославянский словарь (по-рукописям X-XIвеков). Москва: Русский язык. Срезневскiй, И. И. 1893. Матерiалы для словаря древнерусскаго языка по письменнымъ памятникъ. т.1-3. СПб. (Reprinted 1955 Graz: Akademische Druck-u. Verlagsanstalt., 2003 Москва: Знак.) Щепкин, В., 1903. Саввина книга. СПб. (Reprinted 1959. Graz: Akademische Druck-u. Verlagsanstalt.) Ягич, И. В, 1910. Иисторiя славянской филологiи. СПб. (Reprinted 2003. Mосква: Индрик.

A Preliminary Analysis of Arabic Derived Verbs in the Leeds Quran Corpus —With Special Reference to Stem III (CaaCaC) Robert R. RATCLIFFE 1. Introduction 1.1. Productivity Productivity is commonly defined “as the extent to which a particular affix is used in the production of new words” (Aronoff & Ashen 1998: 242). For the sort of productivity problems we wish to investigate in Arabic this definition is unsatisfactory for a number reasons. Firstly not all morphology involves affixation. In Semitic languages in particular, the marking of morphological contrasts through differences in stem shape is common. Secondly the definition does not distinguish between formal and functional productivity. The question is not simply whether a morphological operation can apply, but whether its operation yields a predictable change of meaning or function to the word which undergoes it. The -s nominal plural suffix and the verbal noun suffix -(a)tion are both often cited as productive affixes in English but there is a difference. With very few exceptions (lexicalized plurals like “scissors” and “pants”) the meaning of an English word ending in -s plural is entirely predictable from the meaning of the singular form without -s. The sequence V-(a)tion perhaps most commonly can be interpreted as “act of doing V” as in “celebrate/celebration.” But words like “decoration” or “invitation” refer not to acts of decorating or inviting but rather to specific physical objects used for doing so. Other cases like “donation” or “creation” refer or can refer not to the action but to the object of the action. In still other cases like “foundation,” “constitution,” “institution,” the noun has acquired a lexicalized sense not predictable from the original source verb. This is an example of formal productivity combined with functional (or semantic) idiosyncracy (or lack of productivity). The reverse phenomenon of functional productivity combined with formal idiosyncracy can be seen in English adjectives derived from nouns denoting nations or ethnic groups. There are a variety of suffixes, -ese, -(i)sh, -((i)a)n (Japanese, Turkish, Brazilian, Italian, Korean) as well as suffix deletions (Germany >> German), consonant alternations (Greece >> Greek) and consonant plus vowel alternations (France >> French, Wales >> Welsh). None of the individual affixes or processes involved in deriving national/



ethnic adjectives is productive. But the system as a whole exhibits productivity in function. For any noun denoting a nation or ethnicity an adjective can be derived. The meaning of the adjective is transparent and predictable based on the meaning of the source noun. But the actual form which this adjective will take is largely unpredictable and idiosyncratic. These examples also indicate another aspect of the issue which is important for understanding Arabic morphology: Semiproductivity. According to authors such as Bybee (1985) and Bauer (2001) productivity is a continuum, statistically defined. A given morphological process may have a core function. Likewise there might be a core (preferred or default) morphological form for expressing a given function. Beyond the core there may be an inner periphery, more or less predictable on the basis of formal or semantic properties of the input word, and a more idiosyncratic outer periphery. Semiproductivity also includes the idea of local productivity: One can often predict the most likely choice among several allomorphs (or the most likely among several possible interpretations for a form or process) with greater accuracy when formal or functional properties of the source are taken into account. For example, for the English ethnonymic adjectives, the –n allomorph seems to be preferred for multisyllabic nouns ending in –a (Russian, Australian, Korean). For the –(a)tion verbal nouns, the action, as opposed to the object or instrument interpretation seems to be more likely for intransitive verbs. In Generative research in the 70’s and 80’s the difference between linguistic behavior which could be described as rule-based (in this case productive morphology) and that which could not (non-productive) was thought to be both categorical and terribly important. This reflected the computer model of mind which was popular at the time. The mind was thought of as a machine for processing algorithms. Grammatical knowledge was thought to consist of a set of algorithms (rules). Semiproductivity therefore presented a problem. In the framework of the connectionist models of mind that have come to dominate neurocognitive linguistics in recent years rules are not thought to have any neurocognitive reality (Lamb 1998). Speakers can establish connections among words on the basis of any shared feature of form or function, these connections can then be extended to create new words. The continuous nature of productivity falls out naturally from this approach. So too does the sort of local productivity revealed by non-normative analogical formations like bring*brang or moose-*meese, which cannot be explained under the hypotheis of analogy as extension of a (categorically productive) rule. Making use of the notational apparatus of Ford, Singh, Martohardjono (1997) we can represent ideal (formal and functional) productivity as the case where the formal change of a morphological operation correlates consistently with change in function.

A Preliminary Analysis of Arabic Derived Verbs in the Leeds Quran Corpus


[X]a [X’]a’ (for all X and all a) In this schema the formal aspect of a morphological contrast is indicated by the part within brackets and the functional aspect is indicated by the part outside. Functional productivity without formal productivity (allomorphy) has the following schema: [X]a [X1, X2…Xn]a’ Formal productivity without functional productivity has this schema: [X]a [X’]a1, a2…an 1.2. Semi-productive systems in classical Arabic Central to Classical Arabic morphology are two large, semi-productive sub-systems, the “broken” plural system and the derived verb system. From the point of view of the distinction between formal and functional productivity the systems are mirror images of each other. The broken plural system exhibits lack of productivity in form combined with productivity in function: The phonological form which the plural of a given singular will take is not entirely predictable, although there are statistical trends which allow us to predict the plural form of a given singular with an average of about 70% accuracy (Ratcliffe 1998). Nonetheless the function of the plural—the sense [PLURAL] and the syntactic contexts in which a plural form will be required (after numbers from three to ten, for example)—is transparent and predictable. The derived verb system exhibits lack of productivity in function combined with productivity in form. The formal (phonological) shape of a particular derived verb stem (roughly nine of which were in common use in Classical Arabic) is entirely predictable (and can be freely created by any competent speaker), but the function (meaning and syntactic distribution) of the stems is to some extent idiosyncratic and lexicalized. In order to try to quantify formal productivity, as in the case of the broken plural, a simple dictionary analysis would seem to be sufficient— noting and counting the plural allomorphs listed for each morphologically and phonologically defined singular category. The problem has indeed been approached in this way (Levy 1971, see also Ratcliffe 1998 and references therein). However, this approach is not satisfactory when dealing with questions of meaning and syntactic behavior, where context is required. As a step toward trying to quantify the functional productivity of the derived stems in Classical Arabic, specifically Quranic Arabic, I undertook an analysis based on the Leeds Quranic corpus ( The preliminary results of that analysis are reported here.



2. Some hypotheses 2.1. Word-based morphology as applied to Semitic A data-based analysis is most useful (perhaps only useful) if it can be used to test a hypothesis. There are several hypotheses to test here. The first is the word-based approach to Semitic morphology (Heath (1987), McCarthy & Prince (1990, 1995) Ratcliffe (1997, 2003), Gafos (2003) , etc. There are various versions of this theory, but the common idea is that regularities in Semitic morphology can best be stated in terms of derivational relationships between one word and another. Such statements may involve a shared “root” in the traditional sense, but they need not necessarily do so in all cases. The “root” is not thought of as the starting point for the derivation. The alternative, morpheme-based theory of Semitic morphology, the classic statement of which is generally attributed to Cantineau (1950a, b), says that all words are derived by combining bound morphemes: a (consonantal) root and a (syllabic/vocalic) pattern. This approach has its origins in the medieval lexicographic tradition and is carried over into most textbooks and descriptions of Arabic. The major difficulty with the traditional approach is that it provides no way to talk about relationships among words except as realizations of the same root on different patterns. Yet as McCarthy (1993) first showed there are a number of regularities in the nominal morphology (plurals and diminutives, for example) which only become visible when one looks at word-to-word relationships. The data is quite rich and the issue is complicated, but the following examples should suffice to illustrate the central problems. In many plural forms the plural shares phonological material with the singular beyond the consonantal string usually identified as the root, as is clear in (1): (1) sg. “scorpion” “desk” “pronoun” “mold”

ʕaqrab maktab damiir qaalab

pl. CaCaaCiC ʕaqaarib makaatib damaaʔir qawaalib


dictionary root

ʕqrb mktb dmir qalb

ʕq r b ktb dmr qlb

It looks as though all of these plurals are formed on the same plural pattern CaCaaCiC, and that each onset and coda segment in the singular maps to a C position in the plural, with the vocalic codas of the last two examples undergoing predictable phonological change in the new syllabic environment provided by the plural pattern. On a classic root-and-pattern analysis, we would have to

A Preliminary Analysis of Arabic Derived Verbs in the Leeds Quran Corpus


recognize four different patterns here CaCaaCiC, maCaaCiC, CaCaaʔiC and CawaaCiC. The facts that both singular and plural in the second example have /m-/ and that the extra consonant in the third and fourth examples happens to fall in the same relative position as the coda vowel of the heavy syllable of the singular, would have to be treated as purely coincidental. There are other cases where there are clear semantic relationships between one word and another which can not be reduced to the hypothetical meaning of the shared root (Larcher 1995, Watson 2006). An illustration of this in Classical Arabic is provided by the following: maktab “desk, office” (place where writing is done), related to kataba “write” maktaba “library, bookstore” (place where books are) related to kitaab “book” Both of the words on the left side are nouns of place, with the prefix ma-. The –a on the second word is a feminine suffix arbitrarily attached to indicate a difference of meaning with the first word. The meanings “book” and “write” are not unrelated of course, but productively and etymologically kitaab should mean “letter” or “correspondence.” The meaning of the word kitaab has shifted, and the meaning of the “root” k-t-b in the word has followed. The derivatives with place prefix ma- have in common with their source the string k-t-b but the meaning of the derivative reflects the meaning of a particular source word rather than some more abstract, superordinate meaning common to “book “ and “write” that we could assign to the string k-t-b. 2.2 Valence theory of the derived verbs The second hypothesis I wish to test with this analysis is the Ratcliffe (2005, 2008) hypothesis regarding the derived stems. Traditional grammars and textbooks of Arabic (Wright 1899 [1979], Haywood & Nahmad 1965) present the system somewhat as in (2). (2) traditional characterization of the derived verbs I II III IV V VI VII VIII X

CVCVC CaCCaC CaaCaC ʔaCCVC taCaCCaC taCaaCaC inCaCaC iCtaCaC istaCCaC

faʕal faʕʕal faaʕal ʔafʕal tafaʕʕal tafaaʕal infaʕal iftaʕal istafʕal

intensive, trans. of intrans., causative of trans. relation of action to another person trans. of intrans., causative consider/represent oneself as doing/being I reciprocal (mutual application of action) passive of I reflexive of I ask for the act of I, esteem or think someone to be or do I



The problems with this presentation are that it does not give a consistent indication of the interrelationships among the stems and that it is rather vague about the overall function of the system. Ratcliffe (2005, 2008) makes two main proposals. The first is that there is a systematic pattern of interrelationship among the stems, captured in terms of derivational arrows. (This is the application of word-based theory to this area of morphology). The second is that the core function of the system is the marking of a syntactic property—namely valence, and that many of the semantic nuances ascribed to the stems follow logically from this syntactic property. The proposals are schematized in the following table. Stems II (CaCCaC), III (CaaCaC), and IV (ʔaCCaC) are derived from stem I and have the effect of increasing valence (adding an argument, hence the indication +1). Stems VII (iCtaCaC) and VIII (inCaCaC) are derived from stem I and have the effect of decreasing valence (indicated as -1). Stems V (taCaCCaC), VI (taCaaCaC), and X (istaCCaC) are second order derivations derived from stems II, III, and IV, respectively. These also have the effect of decreasing the valence of the immediate source verb. (3) Valence marking analysis of derived verbs (Ratcliffe 2005, 2008) +1 >> CvCvC —1

\/ \/ \/ \/



>> CaCCaC —1 \/ taCaCCaC


>> CaaCaC —1 \/ taCaaCaC


ʔaCCaC —1 \/ istaCCaC


The description of stems of stems II and IV as valence-adding is uncontroversial and quite consistent with the traditional designation “causative/ factitive.” The function of stem III has not been well understood or at least not well described in the past. It has been labeled plural, reciprocral, or inchoative, none of which labels really fit the normal role or function of this stem. Larcher (2009: 641) following the Arabic grammatical tradition says that “Form III has, in comparison to form I primarily an insistence value (mubaalaɣa ɣ ).” But he only gives one example where this interpretation works. And he also acknowledges the valence changing function, which I regard as central: “If form I is intransitive, form III becomes transitive, and the insistence focuses on the object.” Within the framework of valence theory, however, it is clear that stem III fills a logical niche. For a normal transitive verb the core verbal arguments of

A Preliminary Analysis of Arabic Derived Verbs in the Leeds Quran Corpus


subject and object normally correlate with the core semantic roles of agent and patient. If we are to increase verbal valence by bringing an external (semantic) actor (neither agent nor patient) into a core grammatical role, this can be done in only two logical ways, namely by making the external actor either the subject or the object. In causative constructions an external actor becomes the subject and the actual doer of the action is demoted to direct or indirect object. In the Arabic stem III (and similar constructions elsewhere, which are often termed applicative) the external agent becomes the object and the actual patient or object of the act is either deleted or demoted to the role of secondary object: kataba risaalatan (“he wrote a letter,” stem I, object= patient). kaataba sˤadiiqatan ˤ (“he wrote (to) a friend,” stem III, object= external actor.) (4) Different argument structures of II (CaCCaC) and III (CaaCaC), where I (CaCaC) is transitive: I

Grammatical Logical


Grammatical Logical


Grammatical Logical

Subj Agent 1 Subj. EA 3 Subj. Agent 1

DO Patient 2 DO Agent 1 DO EA 3

(DO2) (Patient) (2) (DO2) (Patient) (2)

Finally with regard to the so-called “semantic-nuances” often ascribed to the stems, we note form/function relationships are rarely one-to-one, and they are implemented through different linguistic levels. Morphological marking may not have direct semantic reference, but indicate syntactic role, with a range of possible semantic interpretations. morphological p g form syntactic function


syntactic y form semantic reference

The semantic interpretation of the stem III as incompleted action (e.g. qatala “kill,” but qaatala “to fight”—i.e. to try but not necessarily succeed in killing) follows from the syntactic property of demoting the semantic patient to a non-core grammatical role. In English there is no comparable verbal morphology to the Arabic stem III but something similar can be accomplished by adding a preposition to a transitive verb: “to hit at” someone is not to



succeed in hitting him. Indeed in most cases the Arabic stem III has to be translated into English with a verb plus preposition. 2.3. Historical source of stem III Within Semitic, Arabic is unique in having a stem III (also termed an L-stem in Semiticist literature) with a distinct function. It is possible that this stem with its Arabic function goes back to Proto-Semitic. It is also possible that the L stem (III) was originally a phonologically conditioned variant of the D-stem (stem II) (Zaborski 1994, see also Lipinski 1997). Biblical Hebrew has an L-stem as a variant of the D-stem when the second consonant is one which resists gemination in that language—namely an r or a post-velar fricative. If Zaborski’s hypothesis is right, then we might expect to find traces of the original situation in Quranic Arabic. Specifically we would expect to find cases of doublets where both II and III were attested in the same meaning. We would also expect to find a disproportionatiely high percentage of stem III verbs with r or a post velar as the second consonant. 3. Testing the hypotheses against the corpus-data Problem I: On a word-based analysis we might expect that derived words would be less frequent than their derivational sources. (Since derivational processes are optional, we would expect that not all theoretically possible types have in fact come into use.) The Leeds Corpus allows for a very quick search of the total number of verbs (token counts) of each stem. The results are laid out in (5), with the verbs organized according to the derivational schema in (3) above. The results would seem to bear out this prediction of the word-based derivational analysis. The underived base stem is by far the most frequent. All derived stems are less common. Further all of the stems hypothesized to be second-order derivations (taCaCCaC, taCaaCaC and istaCCaC) are less common than the first order derivations from which they are hypothetically derived (CaCCaC, CaaCaC, and ʔaCCaC, respectively). (5) Number of verbs of different stem types in the Quran based on Leeds Corpus (, based on search conducted March 2010 CvCvC 12,630 iCtaCaC 952 inCaCaC 49

CaCCaC 1,273 taCaCCaC 322

CaaCaC 328 taCaaCaC 67

ʔaCCaC 3,377 istaCCaC 368

A Preliminary Analysis of Arabic Derived Verbs in the Leeds Quran Corpus


Of course the prediction of relatively lower frequency for derived stem relates properly to type counts rather than token counts, and this is a token count. I counted types for stems III and VI, with the following results. The 328 tokens of stem III represent 64 types (of which one xaaff appears to be mislabeled). The 67 tokens of stem VI represent 34 types. The frequency of the hypothetically derived type is still lower, although less dramatically so (a ratio of 4.9:1 for tokens and 1.9:1 for types). The derivational theory would further lead us to predict that for any derived stem the hypothetical source should also be attested. This last prediction is not borne out in the case of stems III and VI. There are only 11 cases (or about a third of the 34 total types for VI) where both stem III and stem VI sharing the same root are attested in the corpus. Problem II: What percentage of the verbs shows the predicted valence? I analyzed the stem III verbs in the corpus as transitive based exclusively on the criterion of whether an overt nominal or pronominal object is present. (Whether the English translation was transitive or not was not considered.) By this criterion, three of the 63 verbs (types) are ditransitive, 48 are transitive (of which 3 also occur as intransitive—without an overt object in some contexts), and 12 are exclusively intransitive. Approximately 81% of the verbs have the predicted valence. This is a roughly equivalent degree of functional productivity as the formal productivity observed in broken plurals, as noted obve. (6) Numbers and percent of stem III verb types by valence ditransitive transitive intransitive

3 48 12

(5%) (76%) (19%)

A second question investigated was how many of the stems have a personal direct object, as such a criterion is often included in the traditional characterization of the stem. The answer is 42, or 82% of transitive/ ditransitive verbs, 67% of all verbs. Problem III: Is there any evidence that stem III was a phonologically conditioned variant of stem II? Two issues were investigated here. The first is whether any semantic overlap between stem II and III could be observed in cases where both stems sharing a given root are attested. The second is whether there was a statistically significant number of stem III verbs having as second stem consonant either a post-velar fricative or /r/. Pre-computational analysis (Leemhuis 1977) of the distribution of stems II and IV in the Quran showed considerable overlap in meaning/function (i.e.



plus-one argument causative/transitive) of these two stems, with a number of cases of nearly identical meaning for II and IV verbs sharing the same root. For stems II and III no such semantic overlap could be observed. For only six (9.5%) of the stem III verbs attested in the corpus, was a stem II with the same root also attested. In all cases the meaning/ and or argument structure was clearly distinct. These six cases are listed below with their English glosses. (7) Semantic contrast between stem II and stem III verbs III qaasam naaʤa(y) ʕaaqab xaalaf saawa(y) baaʃar

Eng. trans swear (to) consult w/ s.o. retaliate differ (from) so. level have relations w/ so.

Eng. trans II distribute qassam deliver, save naʤʤa(y) ʕaqqab adjust, return, look back leave behind xallaf fashion sawwa(y) give good news baʃʃar

Of the 63 verbs attested in the corpus 17 (or 27%) were found to have r, h,ʔ,ʕ, or ħ as C2. If all 28 consonants were equally distributed through the verb stems these five consonants should occur in medial position on average about 18% (5/28) of the time by chance. This is lower but not dramatically lower than what is actually found. Possibly, moreover, these consonants are more frequent than average in this position. In an earlier project I calculated the frequency of consonants in all stems in a modern Arabic dictionary (Wehr 1979) and found the following frequencies for these six consonants in medial position: (8) relative frequency (%) of consonants in medial position in Arabic r h ʕ ħ ʔ total.

8.26 3.55 3.18 3.18 1.43 19.6%

This calculation does not significantly alter our earlier conclusion. In conclusion the prelimnary analysis of stem III verbs in the Quranic corpus yields mildly positive results for the word-based derivation theory, mildly positive results for the valence theory of the function of the derived

A Preliminary Analysis of Arabic Derived Verbs in the Leeds Quran Corpus


stems, and mildly negative results for the hypothesis of the origin of stem III as a phonologically conditioned variant of stem II. A list of the most frequent verbs of this type in the corpus is included as an appendix. (9) Most frequent stem III verbs in the Quran (more than 5 tokens) 3rdd perf.


qaatal naada(y) ʤaahad ʤaadal haaʤar ʕaahad dˤaaʕaf saaraʕ ʔaaxað raawad baarak ʕaaqab ðˤaahar

fight (against s.o.) call (to s.o.) strive (against s.o.) argue with, dispute emigrate promise, make a covenant (w/s.o.) double, multiply hasten blame, take to task try to seduce bless retaliate support, back

predicted valence and argument structure? yes yes yes yes no yes yes no yes yes yes yes and no yes

number of tokens 53 44 27 25 16 11 9 9 9 8 8 6 6

References Aronoff, Mark and Frank Ashen. 1998. “Morphology and the Lexicon”. The Handbook of Morphology, Spencer, Andrew and Arnold M. Zwicky (eds). Oxford: Blackwell. 237-248. Bauer, Laurie. 2001. Morphological Productivity. Cambridge: Cambridge University Press. Bybee, Joan. 1985. Morphology: A study of the relation between meaning and form. Amsterdam: John Benjamins. Cantineau, Jean. 1950a. “Racines et schèmes”. Mélanges, W. Marçais (ed). Paris: G.P. Maisonneuve. 119-124. —. 1950b. “La notion de “schème” et son altérnation dans diverses langues sémitiques”. Semitica 3. Ford, Alan, Rajendra Singh, and Gita Martohardjono. 1997. Pace Panini: Towards a Word- Based Theory of Morphology. New York: Peter Lang. Gafos, Adamantios. 2003. “Greenberg’s Assymmetry in Arabic: A consequence of stems in Paradigms”. Language 79. 317-355.



Heath, Jeffrey. 1987. Ablaut and Ambiguity: Phonology of a Moroccan Arabic Dialect. Albany: SUNY Press. Haywood, J.A. and H.M. Nahmad. 1965. A New Arabic Grammar of the Written Language. Cambridge, Ma.: Harvard University Press. Lamb, Sydney. 1998. Pathways in the Brain. Amsterdam: John Benjamins. Larcher, Pierre. 1995. “Où il est montré, qu’en arabe classique la racine n’a pas de sens et qu’il n’y a pas de sens à dériver d’elle.” Arabica 41. 291314. Larcher, Pierre. 2009. “Verb”. Encyclopedia of Arabic Language and Linguistics vol. 4. Versteegh, Kees (ed). Leiden: E.J. Brill. 638-645. Leemhuis, F. 1977. The D and H stems in Koranic Arabic : a comparative study of the function and meaning of the fa’’ala and ’af’ala forms in Koranic usage. Leiden: E.J. Brill. Levy, Mary M. 1971. The Plural of the Noun in Modern Standard Arabic. Doctoral Dissertation. University of Michigan. Lipinski, Edward. 1997. Semitic Languages: Outline of a Comparative Grammar. Leuven: Peeters. McCarthy, John. J.1983. “A Prosodic Account of Arabic Broken Plurals”. Current Trends in African Linguistics I, I Dihoff (ed). Dordrecht: Foris. 289-320. McCarthy, John J. and Alan Prince. 1990. “Foot and Word in Prosodic Morphology: TheArabic Broken Plural”. Natural Language and Linguistic Theory 8. 209-283. —. 1995. “Prosodic Morphology”. The Handbook of Phonological Theory, Goldsmith, John (ed). Cambridge, Ma.: Blackwell. 318-366. Ratcliffe, Robert. 1997. “Prosodic Templates in a Word-Based Morphological Analysis of Arabic”. Perspectives on Arabic Linguistics X X, Eid, Mushira. and Robert Ratcliffe (eds). Amsterdam/Philadelphia: John Benjamins. 147-171. —. 1998. The ‘Broken Plural’ Problem in Arabic and Compartive Semitic. Amsterdam: John Benjamins. —. 2003. “Toward a Universal Theory of Shape-Invariant (Templatic) Morphology: Classical Arabic reconsidered”. Explorations In Seamless Morphology, Singh, Rajendra and Stanley Starosta (eds). New Delhi, London, and Thousand Oaks: Sage Publications. 212-269. —. 2005. “Semi-Productivity and Valence Marking in Arabic—the So-called ‘verbal themes’”. Corpus-Based Approaches to Sentence Structure, Takagaki, Toshihiro, Susumu Zaima, Yoichiro Tsuruga, Francisco Moreno Fernandez and Yuji Kawaguchi (eds). Amsterdam: John Benjamins. 179-190.

A Preliminary Analysis of Arabic Derived Verbs in the Leeds Quran Corpus


—. 2008. “The Simple Math of Valence and Voice Ambiguity: Arabic Derived Verbs, Passsive/Causative Overlap, and other problems”. Ambiguity of Morphological and Syntactic Analyses, Tokusu, Kurebito (ed). Tokyo: ILCAA. 1-14. Watson, Janet. 2006. “Arabic Morphology: Diminutive Verbs and Diminuitive Nouns in San’ani Arabic”. Morphology 16. 189-204. Wehr, Hans. 1979. A Dictionary of Modern Written Arabic. Wiesbaden: Harrasowitz. Wright, William (ed and trans). 1896. A Grammar of the Arabic Language. 3rd. edition. London: Cambridge University Press. Zaborski, Andrzej. 1994. “Archaic Semitic in Light of Hamito-Semitic”. Zeitschrift für Althebraistikk 7. 234-244.

On the Narrow and Open “e” Contrast in Santali Makoto MINEGISHI, Jun TAKASHIMA and Ganesh MURMU 1. Introduction The purpose of this paper is to examine the narrow and open “e” contrast in Santali on the basis of Bodding’s Santali data [1932-36], hereafter abbreviated as BSD. First, we will briefly introduce Santali, which is a member of the Munda language family within the Austroasiatic linguistic phylum. Historically, the Austroasiatic language family is important in both India and mainland Southeast Asia, because it is regarded as the oldest linguistic substratum in these regions. According to earlier documents, Santali is said to have at least two dialectal varieties; the northern dialect has eight or nine vowels and the southern, six vowels. The former is represented by Bodding’s monumental dictionary. We started to digitize it in the early 1990s as part of a Japanese-Indian joint research project that aims at describing the language.1 By using the corpus data, we will examine whether the contrast between narrow and open “e” is phonologically distinct in Santali. 1.1. Brief introduction to Santali Santali is a language widely spoken in the eastern part of the Indian subcontinent: Jharkhand, West Bengal, Odisha or Orissa, Northeast India, and Bangladesh. It was formerly recognized as a tribal language, but since the establishment of the Jharkhand state in 2000, it has gained status as the official state language in addition to Hindi. The total number of Santali speakers is between four million and seven million according to Anderson (2006); thus, Santali speakers are the most populous language minority group in India. Santali belongs to the North Munda subgroup of the Munda family within the Austroasiatic linguistic phylum. Another subgroup of this phylum is the Mon-Khmer family of mainland Southeast Asia and Northeastern India. 1

Part of the statistic analysis of narrow and open “e” was presented at The Third International Conference on Austroasiatic Linguistics at Deccan College, Pune, Nov. 2007. Our research activities were supported in part by the project Grammatological Informatics based on Corpora of Asian Scripts (2001-2005), granted under the Center of Excellence Program of the Japanese Ministry of Education, Culture, Sports, Science & Technology (MEXT), and in part by the project Multilingual Concierge: An Interface for the Next Generation, granted under the Strategic Information and Communications R&D Promotion Program (SCOPE) of the Japanese Ministry of Internal Affairs and Communication.



According to Ethnologue, Summer Institute of Linguistics, the total number of Austroasiatic languages is 169. Most of these are minority languages, except Khmer and Vietnamese of the Mon-Khmer family; the former is the national language of Cambodia, the latter, of Vietnam. In the early twentieth century, Wilhelm Schmidt, an Austrian anthropologist, proposed the Austric Hypothesis, which holds that the Austroasiatic and Austronesian languages together form the Austric super-family. Now, few linguists support the hypothesis, although the name “Austric” is still used in India in place of “Austroasiatic.” The Austroasiatic languages are important because they are regarded as the linguistic substrata both in the Indian subcontinent and in mainland Southeast Asia. It should be noted that in terms of linguistic typology, the Munda and Mon-Khmer subgroups reflect areal linguistic features of each region. Santali, as well as Mundari and other languages of the Munda group, has agglutinating morphology with monosyllabic and disyllabic stems or words, to which one or more affixes are added to form polysyllabic grammatical units. Mon-Khmer and other Southeast Asian languages are isolating with monosyllabic stems; sometimes a prefix is added to form disyllabic words. The basic word order of the Munda languages is SOV, whereas that of Southeast Asian languages is SVO. These areal contrasts are not only found between the Munda and the Mon-Khmer subgroups, but are also shared in languages of other genetic origins, namely, between those in the Indian subcontinent (except for Assam, which is linguistically part of Southeast Asia) and mainland Southeast Asia. Indian languages, in principle, have a five-vowel system, whereas Southeast Asian languages have more numerous and complex vowel contrasts, including suprasegmental phonemes: some have a tonal system; others have a phonation contrast such as breathy vs. creaky. It is noteworthy that Bengali, spoken in almost the same region as Santali is, has seven vowels despite being one of the Indo-European languages that have five vowels in principle. Interestingly, Burmese, spoken to the east of Bangladesh, has seven vowels. As shown below, Santali has dialectal varieties in its vowel system, with six or eight vowels, whereas other Munda languages have basically five or six vowels. Burmese, Bengali, and Santali can thus be regarded as a transitional language type from Southeast Asia and Bangladesh to India. 1.2. Varieties in Santali vowels in the earlier works According to Anderson (2006), Santali has at least two dialectal groups (southern and northern), and these have slightly different sets of phonemes, different lexical items, and variable morphology. The southern dialect has six vowels, whereas the northern has eight or nine. Although the dialectal differences and boundaries have not been investigated so far, Campbell, the first editor of a Santali dictionary, mentioned

On the Narrow and Open “e” Contrast in Santali


northern and southern regional or stylistic varieties. Bodding noted a difference between the pronunciation of vowels in Manbhum (present-day Purulia in West Bengal), where Campbell lived, and Benagalia near Dumka, where he worked.2 In contrast to the six-vowel system adopted by Campbell, Bodding claimed that Santali has an eight-vowel system with narrow and open contrast in “e” and “o”. Minegishi and Murmu began a description of Santali in 1988 as part of the joint research project between the Research Institute of Languages and Cultures of Asia and Africa (ILCAA), Tokyo and the Central Institute of Indian Languages (CIIL), Mysore, India. Based on Murmu’s native speaker pronunciation, we described the six vowel system in which /ǝ/ is derived from /a/; thus, we assume that the original vowel system consisted of five vowels. Because Murmu was born in East Singbhum, Jharkhand (former Bihar), we tentatively called the dialectal variety as the Singhbhum dialect. Takashima later joined the project and has been in charge of the data processing of Indic scripts. The basic lexicon of the dialect with grammatical notes is available as Minegishi and Murmu (2001). Since the earlier stage of our project, we have been particularly interested in the “dialectal difference” in the Santali vowel system because it may reflect historical changes in the language. There are two possibilities of historical phonemic change in the number of vowels: either reduction or increase. BSD may represent the original vowel system of Santali from which other dialectal varieties have derived. We should keep in mind, however, that the rest of the Munda languages as well as other languages in the Indian subcontinent have, in principle, the five-vowel system. Up to the present, Bodding’s description is the only reliable basis for assuming that the proto-Munda language may have had a seven-vowel system. Santali in BSD is therefore the only exceptional case exempt from such a reduction the number of vowels. Although historic changes from a complex system to a simpler system have been attested in many languages, we may alternatively assume that the original five vowels may have been increased to seven for unknown reasons. It may be the case that Santali, affected by Bengali through long-time contact in the region, has adopted the latter’s seven-vowel system by accepting a large 2

In Macphail, R.M. (ed). [1953-1954], the revised edition of Campbell’s dictionary, Campbell gave the following description: “Northern Santali, or that spoken in Bhagarpur, Monghyr, the Santal Parganas, Birbhum, Bankura, Hazaribagh, Manbhum… is more polished than Southern Santali. The former is regarded as the Standard, and Southern Santali, or that spoken in the remaining districts [Midnapur, Singbhum, Mayurbhanj, and Balasore], as a dialect, or, possibly, a group of dialects of it” (preface). BSD is based on the language of the Santals in the southern parts of the Santal Parganas district of Bihar and the adjoining districts of Bengal (preface, vol. II).



amount of loanwords. If this is the case, we should examine the vowels in Bengali loanwords in Santali to determine whether narrow and open “e” and “o” in the loanwords correspond to those of Bengali. 2. Analysis of Bodding’s Santal dictionary 2.1. Santali corpus project The Japanese-Indian joint research is now conducted between ILCAA and the Department of Tribal and Regional Languages at Ranchi University; part of it is the Santali corpus project started in the early 1990s. We decided to digitize BSD because it is the most important description of the Santali language and culture; it provides not only a detailed language description, but also rich cultural information about Santali society. In the 1990s, we completed data input, and in 2009, the third round of proofreading was completed, on which our following analysis is based. Now the data is available at our web site: 2.2. Structure of the dictionary BSD consists of five volumes. The total number of pages is 3,406, from which we have made digital data consisting of about ten megabytes including mark-up tags. In the following, Santali word forms are given in transliterated forms within angle brackets < >, which we use as tags to indicate Santali or other foreign language forms in the data. The BSD data structure is as follows: In digitization, a Santali headword is defined as a string of characters that denotes a word (a free morpheme), a bound morpheme, or a combination of free forms (sometimes with bound forms) standing for compounds or even for idiomatic phrases. The total number of headwords is 39,945. Because the headwords comprise compounds and idioms, the same morpheme (type) appears many times as tokens in different headwords; for example, (an adverb-forming postposition) appears as 532 tokens.3 The total number of types is 29,851, and that of tokens, 59,387; the latter is almost twice as many as the former. A headword is followed by a comma, an abbreviation of the part of speech, and a period. Its English translation is given in the second sentence. Santali sentence example(s), if any, follow the above, then their English translations. Additional information concerning etymology or loans is given within round brackets at the end of each description. 3

Words that appear more than 100 times are given below, followed by their English translation and the number of occurrences in parentheses. See the appendix for the transliteration rules. ‘equal to,’ , or < marte1>’ (532); ‘adverb-forming postposition’ (503); ‘postposition to show the manner or mode in which an action is performed or anything happens’ (502); ^ ‘water’ (118); ‘flower’ (108); ^ ‘vegetables’ (106); ‘rice plant’ (106).

On the Narrow and Open “e” Contrast in Santali


Bodding adopted a Latin alphabet system with various kinds of superscripts and subscripts instead of using the IPA. Indic scripts such as Devanagari or Bengali were not appropriate for transcription because the number of vowel markings they use is not sufficient to distinguish the Santali vowels.4 To digitize BSD, we have made transliteration rules that assign one byte character for each diacritic mark, principally in order to preserve the original information and to make processing easier; for example, underscored “e” and “o” are replaced with “e+” and “o+”, etc. For details of the transliteration rules, see the appendix. 2.3. Santali vowel notation in BSD In the following, we first examine how Bodding describes the Santali vowels. In BSD, the vowels are listed as headwords in alphabetic order with descriptions of pronunciation as follow. “a”, mid-back-wide, similar to a in English father. “ạ”, a resultant vowel apparently peculiar to the Santal langue; the “a” as pronounced when an “i” or a “u” is, or has been, found within the same stress-unit... “e” has in Santali several values, the mid-front-narrow (like in Norw. fred), d the mid-front-wide “e” (like in Engl. men), or the mid-mixed-narrow (or wide) “e”. “e”, the low-front-narrow or low-front-wide sound, pronounced like the vowel in Engl. airr or dead. “i” is the high-front-narrow or high-front-wide sound, like “i” in police or the vowel of cheese or in hit. “o” is the mid-back-narrow-round or the mid-back-wide-round vowel sound, something like the sound in “note.” The lips are not much protruded. “o” is the low-back-narrow-round, the low-mixed-narrow, or the low-back-wide round sound, long or short, like in Engl. law, or not. “u” represents the high-back-narrow-round sound (like in French tour), ... When “u” is in the same stress unit with other vowels these are changed into their resultant vowels (“a” to “ạ”, “o” to “ọ”, “e” or “e” to “ẹ”) or to the corresponding close vowel (“e” to “i”, “o” to “u”). It should be noted that among the above three vowels “ạ”, “ọ” and “ẹ”, only “ạ” is listed as a headword. It means that although these are phonetic variants derived from “a”, “o”, and “e” when they co-occur with narrow 4

Because Santali is spoken across different states in India, it is written with the writing system of each state: the Devanagari script in Bihar and Jharkhand, the Bengali script in West Bengal, and the Odia (Oriya) script in Odisha. Christian missionaries adopted the Latin alphabet with diacritics. In addition, the “Ol Chiki” script was created by Ragunat Murmu in 1925. As a result, five writing systems are used for this one language.



vowels “i” or “u”, only “ạ” has gained a stable status as an independent vowel. The contrast between “e” and “e” and “o” and “o” can be regarded as a height contrast; that is, they are narrow and open, respectively. Based on the descriptions above, we can now assume the eight vowel system of BSD in modern phonemic symbols as /i, e, ɛ (=e), ə (=ạ), a, ɔ (=o), o, u/. In this paper, we hereafter call narrow e, open ɛ, narrow o, and open ɔ, as “e1”, “e2,” “o1,” and “o2,” respectively, for convenience in print. 2.4. Bodding’s analysis of the vowels In the preface of BSD, Bodding remarks that “e1” vs. “e2” and “o1” vs. “o2” are “distinctly separate sounds, and that it has sometimes led to confusion when no distinction has been made” (preface, p. VIII). He further states that Campbell’s dictionary is “not wholly satisfactory phonetically and grammatically” (ibid. p. IX) because Campbell did not distinguish them. Bodding also noticed dialectal differences between his material collected in the southern parts of the Santal Parganas and Campbell’s collected in Manbhum. Although Bodding says the narrow and open “e”s and “o”s are distinct, it is not clear whether he thinks the difference is phonemic in the sense of modern linguistics. He remarks that “...the law of harmonic sequence demands that the open vowel sounds are used when the preceding vowel of the same stress-unit has an open sound (vol. V, preface),” which suggests that there is no phonemic distinction between “e1” and “e2” or “o1” and “o2” in a phonological unit. Further, “the law of harmonic sequence” in his remark means that Santali has a progressive assimilation of vowels rather than the vowel harmony found in Turkish, in which the vowels are divided into two groups according to their phonemic features, and the vowels in one group will not co-occur with those belonging to the other. Bodding designed the BSD to represent prescriptive standards of Santali orthography as well as to provide descriptions of how each word is actually pronounced. He noted that throughout, he had followed the system decided on at a missionary conference in 1898 that stated that “in verbal suffixes, postpositions and the personal pronouns the open “e” and “o” sounds should not be marked.” This means that even if “e1” and “o1” in such bound forms are found as part of the headwords, they are written according to the orthographic rule that was decided upon; their actual pronunciations, therefore, may be “e2” and “o2” respectively, if the vowels in the precedent free lexeme are open. This is because the above bound morphemes form a stress unit with precedent free lexemes, which often happens in agglutinating languages. Thus, in BSD, a postposition such as is described by saying, “while it is always written , the pronunciation generally is , except when the law of harmonic sequence demands < re1>” (p. 58, vol. IV).

On the Narrow and Open “e” Contrast in Santali


3. Analysis of Santali vowels 3.1. Methods of finding minimal pairs In the following, we will first examine the major co-occurrence patterns of vowels. The most important characteristic of Santali vowels is that “e1” and “e2” rarely co-occur in a word. Our examination shows that patterns with repetitive “e2-e2” or “e1-e1” are the most common in the headwords. Table 1 below shows the most frequent syllable patterns among all types found in the headwords that appear more than ten times. Note that “C” in Table 1 denotes any consonant, and “e23” represents nasalized “e2”. The most frequent patterns are disyllabic words with the “e2-e2” pattern, such as /Ce2Ce2C/, /Ce2Ce2/, /Ce2CCe2C/, / Ce2CCe2/, /e2Ce2C/, etc., compared to which those with “e1-e1”, such as /Ce1Ce1C/, /Ce1Ce1/, /Ce1CCe1C/, or /Ce1CCe1/ are fewer in number. On the other hand, the most frequent syllable pattern containing both “e1” and “e2” is /Co2Ce2Ce1/ with vowels /o2-e2-e1/, which appears only seven times in the corpus and thus is not found in Table 1. The /o2-e2-e1/ pattern, however, appears in actual headwords , , , , , , and , which can be further divided into disyllabic words with or , both of which are postpositions frequently used and which were decided to be written with “e1” in the orthography. Table 1. Frequent Syllable Patterns Order 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Syllable Pattern Ce2Ce2C Ce2Ce2 Ce2CCe2C Ce2CCe2 Ce1Ce1C Ce1Ce1 e2Ce2C Ce23Ce23 Ce2Ce2Ce2C Ce1CCe1C Ce23Ce23C Ce2CCCe2C e2Ce2 Ce1CCe1 e2CCe2C e2CCe2 e1Ce1C Cae1Ce1 Ce2Ce2CCe2C Ce1Ce1a

Frequency 327 171 150 92 83 39 30 25 17 17 16 15 14 13 13 11 11 10 10 10

Co-ocurring Vowels e2-e2 e2-e2 e2-e2 e2-e2 e1-e1 e1-e1 e2-e2 e23-e23 e2-e2-e2 e1-e1 e23-e23 e2-e2 e2-e2 e1-e1 e2-e2 e2-e2 e1-e1 e1-e1 e2-e2-e2 e1-e1a



Table 1 also shows that the most frequent tokens do not exceed three syllables. We may therefore be able to focus on such words as having one to three e’s in the following analysis.5 As is shown in Section 2.2, the total number of types of Santali free or bound morphemes in the headwords is 29,851. In order to obtain candidates for minimal pairs, we have extracted those containing “e2” or “e1,” not exceeding three instances, among the above types. For the types with three instances of “e2”, those with the vowel sequence “e2-e2-e2” are selected to compare with those with “e1-e1-e1”, “e2-e1-e1”, “e1-e2-e1”, “e1-e1-e2”, “e1-e2-e2”, “e2-e2-e1”, and “e2-e1-e2” in the same phonemic environment. Similarly, for disyllabic types, those with the vowel sequence “e2-e2” are selected to compare with those with “e1-e1”, “e2-e1” and “e1-e2”. We will examine below the narrow “e1” vs. open “e2” contrast in the same syllable pattern to determine whether they can be minimal pairs. 3.2. Examinations of “e1” and “e2” We have extracted 285 candidates for e1-e2 minimal pairs6 and classified the pairs into the following seven groups. 1. Alternative round bracketed form: pairs that consist of a word and another form denoting alternative pronunciation within round brackets in the headwords; obviously, they are meant to be one word with variants in pronunciation. 2. Empty entry: pairs that contain a word that is shown with no definition and only refers to the other headword because the two have the same meaning. These empty entries are not real synonyms, but the same words with alternative pronunciations because they are described as “the same as...” in the original text. In some cases, both a word A with “e1” and another B with “e2” have a reference to a third word C that is different from both A and B. 3. Different part of speech: pairs that consist of a verbal suffix, postposition, or personal pronoun that follows free forms and a substantive word, such as a noun or a verb. As is shown in Bodding’s remarks, the former is written with “e1” regardless of its actual pronunciation; thus, these should be excluded from among the candidates. 5


Only two headwords contain four e1 or e2’s as follows. < e2NDe2te1ge1>, adv. Why, I have heard (with past tense); then (with fut.). < ge1ge1te2re2n^>, n. A certain crawling insect…, Spirocystus cilcylindricus. We have confirmed that only four headwords contain four o’s. For o1 and o2, we have 698 candidates, though their examination is not given in the present paper.

On the Narrow and Open “e” Contrast in Santali


4. Onomatopoeic: pairs that contain an onomatopoeic expression or interjection. Because onomatopoeia and interjections are difficult to define precisely, they should be excluded from among the candidates. 5. Loanword: pairs in which one headword is of foreign origin and the other is seemingly native. We need a close examination of loanwords because borrowing from Bengali, which has seven vowels, may have influenced the historical change in the Santali vowel system. Further, in BSD, words are included that are cited from the foregoing dictionary compiled by Campbell’s that are marked with “C.” They are of a different dialect with six vowels and thus should be excluded from among the candidates. 6. Different environment: pairs in which compounds or a few words (usually two words) are listed as headwords and the target elements appear in different environments, i.e., one appears as the first element of the sequence and the other as the last; we should avoid regarding these as minimal pairs. 7. Minimal pair candidates: pairs that consist of two words that have different meanings and the vowels occurring in the same phonemic environment; these should be regarded as a minimal pair. It should, however, be noted that we often find cases in which two words have similar meanings (or not very similar but are related in some senses). Given the above classification, the words in Groups 1 and 2 are cases that clearly describe the existence of two pronunciations of the same word, and thus, they should be excluded from among the candidates for minimal pairs. We should also exclude the words in Group 3 due to the different phonemic environments in which they occur and those in Groups 4 and 5 because of the difficulty in examining their precise meanings. In practical grouping, we first compared headwords that contain only one token. When such headwords were unavailable, we compared those that contain two repeated tokens because they are in the same environments. This is possible because BSD contains many headwords with repeated tokens; among all the entries, 3,028 headwords consist of two repeated tokens, most of which are adverbs or onomatopoeic expressions, and some of which are interjections. During the classification, we found that some words are simultaneously assigned to two of the above groups. For example, some words are grouped into both empty entries (Group 2) and onomatopoeic ones (Group 4), other words, into empty entries (Group 2) and loanwords (Group 5), and others, into onomatopoeic entries (Group 4) and loanwords (Group 5). In these cases, the entries were classified into the group with the smaller group number: Group 2



for those that belonged to Groups 2 and 4, Group 4 for those that belonged to Groups 4 and 5, etc. The words classified in Groups 6 and 7 are problematic. In Group 6, some substantive words might be “grammaticalized” to function as postpositions, which happens frequently in agglutinating languages. It is difficult to determine whether certain words are really substantives versus grammaticalized postpositions. Further, for Group 7, it is almost impossible to decide whether the entries are different words or a single polysemic word, because evidence from old written documents is not available. In some cases, misprinting is suspected in either the original or digitized data. BSD may include oversights of misprinting, as Bodding admits in the preface. Further, we must admit oversights may still remain in our digitization, which is inevitable for ten megabytes of text data, even though we have completed three proofreadings. Tentatively, we will discuss the suspected cases in the following section. 3.3. Results and analysis The results of our examination are as follows. Group 1: Alternative pronunciation within round brackets. Since the words in Group 1 are given with an alternative pronunciation, they are not candidates for minimal pairs. Only four such cases, listed below, are found in BSD. In the first two cases, and , the form with “e2” is in brackets, whereas in the other cases, the forms with “e1” are in brackets. , pers. pr. 2. pers. dual. You two. v. a. int. and m., come of one mind. Make, become one pair. [...] (or < tabe2n>), n. Flattened rice; v. a. To prepare do. , pers. pr. 2. p. pl. You, ye, yours; v. m. Be, become of one family or sept (used in addressing), intimate. [...] , pers. pr., adj., v. m. You, your.

Group 2: Empty entry referring to the other headword. Like the words in Group 1, those in Group 2 are not minimal pair candidates. We have found 149 such cases. Among them, fifty-four are empty entries with “e2” that refer to an entry with “e1”, and forty-nine have the “e1” entry referring to the “e2 form”, although there seems to be no significance from which vowel to which the reference is made. Besides, there are forty-six cases in which a reference is made to an identical word that is different from either one with “e1” or “e2” or in which the descriptions of each word are the same or similar.

On the Narrow and Open “e” Contrast in Santali


In Group 2, we have found two cases where references are made from an empty entry to those with the alternative vowel, given below, which are onomatopoeic expressions. This means that the difference between “e1” and “e2” does not affect the meaning of the onomatopoeic expressions. , adv., the sound of belching, eructating. , adv. Twittering (maena youngs, before they can fly). (onomatop.).

The following are loanwords in Group 2. We have found cases in which references are made from an empty entry to a loanword from Hindi (ten cases), Bengali (eight cases), Hindi and Bengali (two cases), Persian and Hindi (five cases), Arabic and Hindi (two cases), Arabic and Bengali (one case), English (four cases), Desi (five cases), or Munda or Birhor (one case). Among these languages, only Hindi, Bengali, Desi, Mundari, and Birhor are spoken in the same area as Santali. The Arabic, Persian, and English words are supposedly borrowed into Santali via Hindi or Bengali. Only Bengali has seven vowels; the others have basically five or six.7 Hindi (ten cases): The loanwords from Hindi that originally contain the vowel “e”, given below, may be pronounced with either “e1” or “e2” because they are in Group 2. , v. a. m. Chase, pursue till death,[...]. (H. , run after) , n., v. a. m. Leg, foot of bedstead, [...]. (cf. H. , a small bedstead) , intj. to dogs. Come, here! (cf. H. < le1p>, adj., v. a. Unguent, ointment; [...]. (H. )

The above are cases in which the headwords with “e2” are empty, whereas those with “e1” are given meanings and etymological information. Below are cases in which the headwords with “e1” are empty, whereas those with “e2” are given meanings and etymologies. , n. Trip, time (single journey). (H. < khep>.) , n. Difference, discrepancy (in weighing). (H. < pher>.) , n., adj. Neighbourhood, vicinity;[...].(v. < aDe> and cf. H. < pa=s>; cf. < aRe>).


No detailed phonemic description is available for Desi, a local language of the region of Indo-Arian origin, because any local language in a region can be called desi ‘local.’



It should be noted that in the following cases, the vowels in the Santali headwords do not correspond with those in Hindi. , adj., v. m. At variance with, on unfriendly terms; [...](H. < biba=d>). , adj. Contemptuous, mocking; [...]. (H. ) , adv. Sometimes. Generally repeated:[...] ([...] cf. H. < khan>, a moment).

Bengali (eight cases): The Bengali loanwords attract special attention because the Bengali vowel system is basically the same as that of BSD. Because the following loanwords are in Group 2, which means they are referred to by those with an alternative pronunciation, “e1” and “e2” in Santali do not correspond to those in the original Bengali pronunciation. , adv. So they say, it is said, that is to say.(B. < bo2le2>, he says). , adv. Immediately, at once, [...]. (B. < e2kha=l> + < te1>). , n. The name of a plant, Cotyledon laciniata, [...] (B. < he2msago2r>.) < je1>, indef. pr. Whoever, whatever, who, what. (B. < ye>.) < -re1>, postp. intj. (intensifying). Oh, oh dear (often not translatable). (B. < re>.)

It should be noted, however, that in the above cases, the meanings are given to the headwords whose vowels “e1” and “e2” correspond to those of Bengali, while this is not the case in the following loanwords.8 , adv. One by one, singly, separately. (B. ). , n. A fisherman, a caste of Hindu fishermen. (B. < ke2oT>.) , n. A small silver coin (...). (P. H. < rezagi=>, Desi < ricki>, B. < rejoki>.)

Further, for the loanwords below from Hindi and Bengali, Arabic and Bengali, Persian and Hindi, and Arabic and Hindi, it does not matter whether they are pronounced with “e1” or “e2”. Hindi and Bengali (two cases): , postp. By, after, for (giving the meaning of every or each). (B. H. .) , num. One ... (H. B. < ek>).

Arabic and Bengali (one case): , n. Choice, will, self-will;[...]. (A. B. < e2khtiya=ri=>). 8

Though Bengali does not distinguish the phonemic differences between narrow and open “e” and “o” in its orthography, BSD distinguishes them with the same diacritics as are used for Santali.

On the Narrow and Open “e” Contrast in Santali


Persian and Hindi (five cases): , adv. Often, continually, again and again. (v. ; P. H. < ba=re>) , n., adj. Poor fellow (term of commiseration); [...]. (P. H. < beca=ra>). , n. An (Indian) official [...] (P. H. < peshka=r>;[...]) , n., adj., v. m. Vigour, strength, briskness; [...]. (P. H. ; v. < [email protected]>; also H. ) , n. A small silver coin (...). (P. H. < rezagi=>, Desi < ricki>, B. < rejo+ki>.)

Arabic and Hindi (two cases): , adj., v. m. Bristling, standing up (hair),[...]. (cf. A. H. < rafi=>, high)

The following English loanwords are supposedly borrowed via Hindi or Bengali; again, it does not matter whether they are pronounced with “e1” or “e2”. English (four cases): , n. Bench. (Engl., the more common pronunciation is < bin^ci>). , n. Jail; v. a. m. Imprison. (from Engl. jail) < ke1no2sTabo2l>, n. A constable. (Engl.). , n. A policeman. (Engl. constable).

With regard to the following cases, we are not sure whether Desi or other tribal languages like Munda (Mundari) or Birhor have borrowed the loanwords from Santali or vice versa; both directions are possible. Desi (five cases): , int. of remonstrance or warning. (Desi < ehe>). , n. Wretch, rascal, scamp, poor wretch;[...]. (Desi .) , adversative conjunction. But, however. (Desi ) , adv. Perhaps, possibly. (Desi < pase>.)

Munda or Birhor (one case): < bale>, adj., v. m. Tender, fresh, young, infantile;[...] (Mundari, Ho, Birho2R < bale>,[...]

Group 3: Different parts of speech We have found only two cases that fall into Group 3. 1. < baRe1 (dare1)> and < baRe2> < baRe1 (dare1)>, n. The banyan tree, Ficus bengalensis, L. [...] (H. < baR>). < baRe2>, opt. particle. Please, do, O that (may).



In (1), the former is a noun, the latter, a particle. Because the former is a loan from Hindi, it can be classified into Group 5. 2. < be1hal> and < be2hal> < be1hal>, adv. Very much, astonishingly (used like < behaj>). , v. a. m. Damage, make unserviceable, ruin, destroy: lose all, exceed, overstep, transgress. (P. H. < beHa=l>).

In (2), the former is an adverb, the latter, a verb. The latter, again, is a loan from Persian via Hindi; thus, we may exclude them and classify them into Group 5. Group 4: Onomatopoeic We have found eighteen cases, among which six are onomatopoeic and twelve are interjections. Because the precise meanings cannot be defined in either case, they should be excluded from among the candidates for minimal pairs. The following six cases are onomatopoeic. , adv. The call of [...], the black partridge. (onomatop.) , adv., v. a. Pass loose stools [...] (v. < cher cherao>, possibly onomat.). , adv. The sound of passing water (men standing). , adv., the sound of the < carkha>, spinning-wheel, [...];(... onomat.). , the same as , q. v. cf. < kereo2t^>, adv., v. a. With a scream; to scream (once) (onomat.). , adv. The call of the < citri> (partridge) mother [...] (onomat.).

The following four among twelve cases are interjections that may be borrowed from Hindi or Bengali. < bhe1R>, int. to ploughing cattle. Forward. (cf. H. < bheRna=>. Desi < bheR>). < de2>, int., an optative particle [...] (H. B. < de>; cf. H. < da=>). < de1k>, int. Look. (H.< dekh>). < he1>, int. part., the same as < h\o>, q. v. (used only in songs). (B. < he>.)

Group 5: Loanwords We have found nineteen loanword cases. In addition, seventeen cases are found that are empty headwords marked with “C,” which means they are cited from Campbell’s dictionary; these should be excluded from among the candidates. Loanwords are mostly from Hindi (nine cases) or Bengali (four cases); some are found in both Hindi and Bengali (three cases). There is only one

On the Narrow and Open “e” Contrast in Santali


case in which a word is found in both Arabic and Hindi. Given below are only the loanwords from Bengali or Hindi and Bengali. This is for the purpose of examining whether any correspondence can be found between “e1” and “e2” in Santali and Bengali. In the following four cases, the “e1” in Santali forms corresponds to the narrow “e” in Bengali. , n. A fish, the same as < co2Dgo2c^>, q. v. (B. < cen*>). , v. a. m. Cut the throat, kill by cutting the throat. (cf. B. < jo2ba=i>; Desi < jo2bho2e1>). , n., v. m. Pain in the chest, [...] (H. < bitha=>; B. < betha=>). , adj., adv., v. a. m. Much, many, plenty, abundant; [...] (H. B. < Dher>).

In the following case, however, the “e2” in Santali forms does not correspond to the narrow “e” in Bengali. , n. Country, land, the Bengal districts. (H. B. < des 'des^>).

The following is a case in which the vowel in Santali forms is different from that in Bengali. < be1as>, n. Diameter. (B. < bya=s>; only in books).

Group 6: Different environment Vowels in different morphological or grammatical environments should not be compared in the search for minimal pairs. We have found forty-nine cases to be classified in Group 6. One environment to be classified in Group 6 is the case in which a word A with “e1” appears as part of a compound headword combined with X, and another word with “e2” denoted as “B(e2)” appears as a headword in itself. In this case, we cannot specify whether the difference in meaning between the former and the latter is due to that between A(e1) and B(e2) or that between X and B(e2), as shown in the formula below. A(e1)-X vs. B(e2) An example using the above formula with and is as follows. < [email protected] Dabe1>, adv. Loiteringly, carelessly, unsatisfactorily (walk, work), here and there, wide apart. < Dabe2>, adj. Having large horns, spread out and curved upwards (buffaloes)...



Classified in Group 6 is another environment in which a word A(e1) appears as the first element of a compound with X, and another B(e2) appears as the second element in another compound with Y, shown as the following formula. A(e1)-X vs. Y-B(e2) An example with and d follows. , adv., v. m. Pushing and shoving, jostling; to push, jostle, elbow one's way, thrust oneself in, force one's way into. , adj., v. m. Short, of low stature, stunted; become do...

It should be noted that, with further investigation, some of the words in Group 6 may be included in Group 7 or other groups. In the following example, seems to be the same word as in because they share the meaning “to bring in.” They remain, however, in Group 6 because this identity is not explicitly described in BSD. If it were, they would be in Group 2. , v. a. m. Bring in, put in, insert, introduce; enter, penetrate; put (boat on water). < [email protected] ade2r>, n., v. a. m. Bringing in the bandis; in case they are heavy,...

Group 7: Minimal pair candidates There are only seven cases of potential minimal pairs, given as follows. 1. and (a) , n. A louse; v. a. To infect with lice. In people the are found on the head (Pediculus capitis); in buffaloes and pigs and fowls are found all over the body... (b) , v. m. To boil over, foam, well up, froth... (c) , postpos. particle, used to add emphasis, incitement, encouragement. I say; do, come! Often not translatable... (d) < se1>, demonstr. pr., v. . (Also pronounced .)

Among the above four entries, because (c) is a postposition that appears in a different position than (a), (b), and (d), we do not need to address it. Similarly, because (d) is a demonstrative pronoun that has an alternative pronunciation, we can exclude it from among the candidates. The remaining (a) and (b) may be a minimal pair, although they belong to different parts of speech; one is a noun, the other, a verb. Thus, they do not occur in completely the same grammatical environment.

On the Narrow and Open “e” Contrast in Santali


2. and , v. m. n. Be angry, furious, to fume... , adv., v. a. m. Covered with ornaments, full (river to banks); to adorn, cover with ornaments; to be full...

Because both phrases are made of a word that is repeated, they are probably onomatopoeic expressions. The meanings of the phrases in the pair, however, are different; they may be a minimal pair or a case of polysemy. 3. and < bakRe1>, n. The flesh of the < kuiNDi> (Bassia latifolia) fruit, rind included... < bakRe2>, adj., v. a. Pervert, upset, spoil...

In the pair in (3), the former is a noun, the latter, an adjective; they may be a minimal pair, although they belong to different parts of speech. 4. and < be1so1ar>, v. a. d. Spice, season... < be2so1ar>, n. Unpleasantness, impolite behaviour. (< be2 + soar>).

In the above pair, the former is a verb, the latter, a noun; they may be a minimal pair, although they belong to different parts of speech. 5. and , adj. Worthless, useless; adv. Hang it. , v. a. Cross one's way, cross in front of, interrupt, stop, turn off; take the word from one.

In the above pair, the former is an adjective, the latter, a verb; they may be a minimal pair, although they belong to different parts of speech. 6. and , adj. Frequented, used (road); v. a. m. Make do., use. , n. Intercourse; adj. Familiar with, having intercourse with; v. a. m. Make oneself acquainted with. , v. a. m. Chip, cut, pare, trim, prune; wear away.

The adjectival meanings of the first and the second entries are similar. Furthermore, they have the same pronunciation; therefore, they must be the same word. The meaning of the third entry, however, is quite different from that of the other two; thus, they may be a minimal pair.



7. l and l , adj., v. a. m. Brimful, flush; fill brimful; be filled with. , the same as < ce2ple2>, q. v.

The latter refers to < ce2ple2>, which is another headword as follows. , adj., v. a. m. Flat, flattened, low, of small stature; make, become do. (v. < ce2pe2>).

Misprinted entries Misprinting is suspected in 20 cases. First, it should be noted that BSD has many repeated tokens as headwords where both tokens have exactly the same phonemes. As is noted in Section 3.2, 3,028 headwords consist of two repeated tokens. Most of them are adverbs and onomatopoeic expressions; some are interjections. There are some cases, however, in the repeated tokens, in which “e2-e2” appears in one, but “e2-e1” in another. For example, there is a suspected case, , which means “adv. one by one, singly, separately,” where we expect a complete repetition such as . Furthermore, because BSD follows the rule that “e2” in suffixes and postpositions should be written as “e1” regardless of the actual pronunciation, words such as , which can be further divided into and , should have been written as . Actually, appears only once in (vol. IV, p. 565), whereas appears 503 times; thus, the former must be misprinted. Similarly appears once in (vol. II, p. 122), whereas occurs 502 times. It is plausible to regard these rare cases as misprinted. Another obvious case is that of and d . appears as an empty headword that refers to , which is not found in BSD. is not found as an independent headword either. Instead, d only as a compound is found. In this case, it is plausible to consider that and d are the same word. Similarly, misprinting is suspected if a headword that is referred to by another headword is not found. 4. Tentative conclusion The examination in this paper is based on the BSD corpus of the third proofreading, following which we are planning a fourth proofreading, hopefully to be the final one, starting in the latter half of 2010. This will be followed by the final data correction. The results of our analysis, liable to minor statistical changes after expected corrections, are drawn tentatively.

On the Narrow and Open “e” Contrast in Santali


First, the most frequent syllable patterns in BSD are disyllabic words that contain two open “e2”s or two narrow “e1”s, and “e2” and “e1” rarely co-occur in a word. We can therefore conclude that “e1” and “e2” do not phonemically contrast. Then, we have chosen 285 candidates for minimal pairs that have exactly the same phonemic environment except for “e1” and “e2” contrasting in the same position. We classified them into seven groups and excluded twenty cases for suspected misprinting. Table 2 below summarizes the classification given in Section 3.3. Table 2. Classification of Headwords with “e1” and “e2” Group Number Group 1 Group 2 Group 3 Group 4 Group 5 Subtotal Group 6 Group 7 Rest Total

Number of Cases 4 149 2 18 36 209 49 7 20 285

Notes Alternative form in round brackets Empty entry Different part of speech Onomatopoeic Loanword (including Campbell’s material) Different environment Minimal pair candidates Misprinted

As seen in Table 2, 209 cases in Groups 1-5 should be excluded. Group 6 has 49 cases that we should refrain from judging. The apparent minimal pair candidates number only seven. The loanwords from Bengali, which has seven vowels, can be pronounced in Santali using either “e1” or “e2”; no clear correspondence is found between narrow or open “e” in Bengali and Santali. From this fact, we may conclude that the existence of an “e1-e2” contrast in Santali is not the result of its longterm contact with adjacent Bengali. We tentatively conclude that the vowel contrast between “e1” and “e2” is not a full-fledged phonemic one. References Anderson, G.D.S. 2006. “Santali”. Encyclopedia of Languages and Linguistics. 2 nd Edition, Brown, Keith (ed). Amsterdam: Elsevier. 749-751. Bodding, P.O. 1929-1936. A Santal Dictionary, 5 vols. Oslo: Norwegian Academy of Science and Letters.



Lewis, M. Paul (ed). 2009. Ethnologue: Languages of the World, Sixteenth edition. Dallas, Tex.: SIL International. Macphail, R.M. (ed). 1953-1954. Campbell’s Santali-English dictionary, 3rd edition, Santal Parganas: Santal Christina Council. Minegishi, Makoto and Ganesh Murmu. 2001. Santali Basic Lexicon with Grammatical Notes. Tokyo: ILCAA, Tokyo University of Foreign Studies. Schmidt, Wilhelm. 1906. “Die Mon-Khmer-Völker, ein Bindeglied zwischen Völkern Zentralasiens und Austronesiens”. Archiv für Anthropologie, Braunschweig, new series 5. 59-109. URL: Appendix: Transliteration rules for BSD digitization (part) Character Vowels e o ạ ẹ ọ ā ī ū ã ĩ ũ ẽ õ

Transliteration Character Consonants e+ ṭ o+ ḍ @ ṇ \e ṅ \o ń a= ṛ i= ṣ u= ṕ a~ t́ i~ ḱ u~ e~ o~

Transliteration T D N n* n^ R S p^ t^ k^

The Classification of Apabhraṃśa ―A Corpus-based Approach of the Study of Middle Indo-Aryan― Tomoyuki YAMAHATA 1. Introduction The purpose of this paper is to investigate the variances of the texts of Apabhraṃśa language belonging to the Indic languages (Indo-Aryan). Apabhraṃśa is a language used in literary works such as poetry or narration in northern India from 5th century to 12-13th century A.D. Apabhraṃśa is classified diversely so far because some variances are observed in its extant texts. But these variances do not agree with New Indo-Aryan languages. So we supposed that the Apabhraṃśa prevailed over India as a literary language from certain region, then morphological variances were caused by the authors who preferred archaic forms. For examining this supposition, new forms and pseudo-archaic forms were collected from a corpus and calculated their proportions. The result of the examination implies interrelations between regions and pseudo-archaic tendencies. Indo-Aryan language is a branch of Indo-European. The earliest documents of the languages are Vedic texts, especially Ṛgveda. The language of Ṛgveda is called Vedic language which shares some linguistic characteristics with other Indo-European languages. A grammarian Pāṇini described a grammar according to his own language in 5th or 4th century B.C. The grammar of Pāṇini was assumed as authority by scholars in later ages. This is classical Sanskrit. Both of these, Vedic and classical Sanskrit are categorized as Old Indo-Aryan (OIA) distinguished by inflectional features from other Indo-Aryan languages. Middle Indo-Aryan (MIA) is a term that indicates languages other than OIA and New Indo-Aryan languages (NIA). OIA or NIA show more precise definition than ‘MIA’ of which distinction is ambiguous. Such feature probably results from that MIA is defined socio-linguistically rather than linguistically. We show such features practically by the array of MIA languages. Classical Sanskrit was dominant over all other languages of India as a literary language in a long time. Provincial languages, however, were employed in parallel with Sanskrit. Edicts of King Aśoka (3rd century B.C.) are oldest existing documents of Indo-Aryan languages and the languages of edicts are



different from classical Sanskrit. The edicts were written in languages which held some variation. The languages of the edicts are classified into eastern, western and north-western according to phonological features. Pāli and Ardha Māgadhī were typical literary languages other than classical Sanskrit. Pāli is a canonical language of Buddhists and the oldest documents of Pāli are thought to be compiled in the 3rd century B.C. It seems that Pāli originated from western India so that it shares some phonological features with western group of the Aśokan edicts. Ardha Māgadhī is used in Jain canons and Māgadhī is derived from Magadha region of eastern India. The phonemic system and the inflectional system of Pāli and Ardha Māgadhī became simplified as compared with OIA. The Buddhist canons are compiled also in Gandhāra region located in north western India. A simplification of inflection was proceeding but limited

Figure 1. The map of Middle Indo-Aryan I and II

The Classification of Apabhraṃśa


in these languages, for example these have aorist or perfect in conjugation which disappeared in later. This stage is called MIA (I). MIA entered in second stage at 2nd-3rd century A.D. Classical Sanskrit controlled literary writing even of Buddhists and of Jains. But dramatists began to employ regional languages for strengthening rhetorical effects. It is prescribed in the texts of dramaturgy what language should be used in a play according to its social status. Mahārāṣṭrī, Śaurasenī and Māgadhī are the main languages of MIA(II). Mahārāṣṭrī was derived from Mahārāṣṭra region which located in western India, and Śaurasenī from Śūrasena region in central India. Scholars or rhetoricians called them Prakrit languages, that means ‘natural languages’ in comparison to Sanskrit ‘perfect language’. ‘Prakrit’ may include all MIA languages in broad sense, but we treat the term narrowly according to literary usage of languages. A dramatist Bharata’s Nāṭyaśāstra ṭ , ‘The Book of Dance (Drama)’, is considered as the oldest document that mentione Prakrit languages, and it was often quoted by later grammarians. Bharata explained the uses of various languages, including Sanskrit and Prakrit, as a technique of dramaturgy. Therefore, the names of various languages and their social status were roughly described in Nāṭyaśāstra ṭ . Bharata said A dramatist should choose deśabhāṣ ā ā (‘regional languages’) according to his taste, so that there are categories of poetry made in various regions. There are seven bhāṣ ā ās (‘major languages’), Māgadhī, Avantī, Prācyā, Śaurasenī, Ardhamāgadhī, Bāhlikā, and Dākṣiṇātyā. In drama, Śakāra, Ābhīra, Caṇḍāla, Śabara, Dramila, and Āndhraja are inferior vibhāṣ ā ās (‘minor languages’) of forest inhabitants.1 Table 1. Middle Indo-Aryan languages Stage I II III

Period Language B.C. 4c? - A.D. 1c? Languages of Aśokan Edicts, Pāli, Ardha Māgadhī, Gāndhārī 2nd - 12th century Mahārāṣṭrī, Śaurasenī, Māgadhī 6th? - 13th century Apabhraṃśa

Another important usage of Prakrit is narrative literature. Especially Jain authors wrote the narratives or the hagiographies of Jain saints in Prakrit languages called Jain Mahārāṣṭrī and Jain Śaurasenī. Differences between MIA (I) and MIA(II) are not so much great. The number of hiatus increased so that single intervocalic stops were gradually weakened. They may be distinguished sociolinguistically rather than linguistically. The grammarians 1

Nāṭyaśāstra, ṭ 17.46–48. See. Dolci (1938) pp. 68–69.



or the rhetoricians of those days did not refer to the MIA(I) languages. A discontinuity is observed between MIA (I) and MIA (II) in view of a tradition of literature. These two stages may be integrated into one group linguistically but we follow the traditional distinction in this paper. Apabhraṃśa language is regarded as the last stage of MIA. Apabhraṃśa succeeds phonological features of MIA(II), but morphologically and syntactically attains new characteristics. According to Bubeník (1998) Apabhraṃśa has following features. i) the nominal system ended up with only one form for each Nominative vs. Accusative, Instrumental vs. Locative and Ablative vs. Genitive. ii) Certain disyllabic suffixes containing -n- became monosyllabic like -eṇa- to -eṃ. iii) i- and u- stems started to decline as astems. iv) The pronominal system adopted some nominal suffixes. v) The emerging postpositional cases. vi) The past passive participle became the only means for expressing the past tense. vii) the evolution of the system of the lexical aspect anticipating the state of affairs, certain verbs like ‘go’ or ‘give’ started functioning as light verb. viii) The ergative construction emerged. So we can see linguistic differences MIA (II) and MIA (III). However, the language written in Apabhraṃśa documents contains rich variation by time and region, particularly in the morphology. For example, Tagare (1948) showed the features of Eastern Apabhraṃśa are especially different from the Apabhraṃśa of other regions in its phonological and morphological characteristics. However, the definite relation between Eastern Apabhraṃśa and eastern NIA languages such as Bengali or Oriya is not proved. Most Apabhraṃśa documents were written by Jains in western India. However, some Buddhists of eastern India also chose Apabhraṃśa as a literary language. Most MIA languages have the names of regions. However ‘Apabhraṃśa’ does not imply its birthplace; the name means only ‘corrupt’ in Sanskrit. Therefore, the term has ambiguous meanings, among which are ‘languages except for Sanskrit’ and ‘a literary language used over northern India’. Hence, various classifications of Apabhraṃśa have been used until modern times. For example, Tagare (1948) insisted on the existence of a linguistic difference between Western, Southern, and Eastern groups. In the following, we show such classifications, which have been proposed from the middle ages until the present. Before examining the classifications of Apabhraṃśa, We cite passages from the texts of Indian grammarians and rhetoricians from second century B.C. to tenth century A.D., who told how Apabhraṃśa was recognized as a language for literature. In Mahābhāṣ āṣya, Patañjali wrote:2 2

Mahābhāṣ āṣya, 1.1.1. Commentary. See Kielhorn (1880) p. 2. Patañjali (second century B.C.) is a grammarian who wrote commentary on Pāṇini’s grammar.

The Classification of Apabhraṃśa


Most of the words (of Sanskrit) changed into corrupted (apabhraṃśaḥ) forms. For example, the word gaurr (cow) becomes gāvī, govī, gotā, or gopotālikā.

In Kāvyālaṃkāra, Bhāmaha3 considered Apabhraṃśa as a name of a language. There are three kinds (of languages, which are) Sanskrit, Prakrit, and Apabhraṃśa.

In Kāvyādarśa, Daṇḍin4 said: Brahmans say, ‘the literary works are comprised of the four kinds hereafter; they are Sanskrit, Prakrit, Apabhraṃśa, and mixed language.’

In Kāvyamīmāṃsā, Rājaśekhara5 wrote: The words and the meanings are your bodies. Sanskrit is your mouth, Prakrit is your arm, Apabhraṃśa is your back, Paiśācī is your foot, and the ‘mixed language’ is your chest.

The authors mentioned Apabhraṃśa along with Sanskrit and Prakrit languages in poetics. Tagare set the first use in the fifth century A.D. based on some verses of the drama Vikramorvaśīya by Kālidāsa. There are many arguments for the genuineness of this.6 Shahidullah (1928), who edited the Dohākoṣa, estimated the date of this work to be around A.D. 700, whereas Chatterji (1926) insisted it was written about A.D. 1200. Hence, the upper limit of Apabhraṃśa works remains undecided, but we can say that productions of Apabhraṃśa literature peaked around the tenth century. The Jains played an important role in the flourishing of Apabhraṃśa literatures. They created enormous works such as Paumacariu, Mahāpurāṇa, and other hagiographical literatures. A famous Jain scholar named Hemacandra compiled a detailed Apabhraṃśa grammar in the twelfth century. 2. The traditional classification In this section, We will examine various classifications of Apabhraṃśa. We divide the classifications into four types corresponding to the medieval grammarians, Jacobi and Tagare. The term ‘medieval’ indicate long term, from 5th century to 18th century in this paper, because it is often difficult to decide a period of work in India and traditional style had not changed for the most part. 3

4 5 6

Kāvyālaṃkāra, 1.16. Bhāmaha is thought to be the oldest rhetorician, but his date is not determined. Kāvyādarśa, 1.32 Daṇḍin (seventh century A.D.) is a poet and rhetorician. Kāvyamīmāṃsā, 3. Rājaśekhara (tenth century A.D.) is a poet and rhetorician. See Tagare (1948) p. 17. and Velankar (1961) Intro. pp. 55-80.



2.1. A classification (Nāgara, Upanāgara, Vrācaḍa) ḍ by the Medieval Grammarians There were many grammarians in India from Pāṇini. Dolci (1938) introduced grammarians who mentioned Prakrit languages and examined their mutual relations. The Prākṛ k taprakāśa ‘Illumination of Prakrit’ by Vararuci is considered as the oldest among them. It seems to have been compiled from the first century B.C. until the second century A.D. He described only one Prakrit language, which later grammarians called Mahārāṣṭrī. The style of the description follows the classical rules provided by Pāṇini. Vararuci’s rules for Prakrit prescribed how to convert a Sanskrit word or suffix into Prakrit. Later grammarians also followed the same style. It is known that five medieval grammarians are important for investigating Apabhraṃśa variances. Five medieval grammarians—Hemacandra, Kramadīśvara, Puruṣottama, Rāmaśarman, and Mārkaṇḍeya. Hemacandra was an important grammarian who wrote a prescriptive grammar of Apabhraṃśa. However, he did not mention any subcategories of the language, in contrast to the other grammarians who gave names to different types of Apabhraṃśa. However, the other four grammarians prescribed the grammar of only three subcategories, Nāgara, Upanāgara, and Vrācaḍa. They treated Nāgara as a basic Apabhraṃśa and explained the others by describing differences with Nāgara. Their descriptions, however, are not always applicable to the Apabhraṃśa seen in existing documents. Moreover it is difficult to find a diachronic relations in the Indian classical texts. The authors of classical literatures often seem to pay less interest on historical description. Grammatical texts have same tendency, hence these texts supply little information about historical relation of Apabhraṃśa languages. Hemacandra wrote the grammatical text Siddhahemaśabdānuśāsana (twelfth century A.D.), which described the grammar of Apabhraṃśa for the first time. This work is considered as a standard grammar of Jain Prakrit literature. However, there is no allusion to a classification of Apabhraṃśa. Hemacandra differs from other Prakrit grammarians on this point. It is thought that his work influenced the later grammarians immensely. The Apabhraṃśa grammar that he established agrees with the Nāgara Apabhraṃśa, which the other four grammarians described as the principal Apabhraṃśa. However, he shows unattested forms such as the second singular ablative pronoun tudhra. Sanskritism and Prakritism are found often in real documents, e.g., the singular genitive devassa, which seems to borrow from the Prakrit form. We can find these ‘pseudo-archaic’ forms even in famous narrative texts. Hemacandra did not give them. Therefore, it is supposed that he reduced some characteristics of the language in order to create a ‘normal’ grammar of Apabhraṃśa, given that he had written guides to literary works other than grammar—for example, for vocabulary, poetics, and metrics.

The Classification of Apabhraṃśa


2.2. Jacobi’s classification (Northern and Gurjara Apabhraṃśa) Jacobi edited two Apabhraṃśa works, the Bhavisattakahāā (1918) and the Nemināhacariu included in the Sanatkumāracarita (1921). He classified Apabhraṃśa of these texts after consideration of the portion of Apabhraṃśa in the Vikramorvaśīya of Kālidāsa, Mārkaṇḍeya, and Kramadīśvara, of which only a piece had been discovered. Jacobi stated that the Nāgara of medieval grammarians includes the languages of Bhavisattakahāā and of Nemināhacariu. However, he found some differences between them, so he named the former northern Apabhraṃśa (NAp) and the latter Gurjara Apabhraṃśa (GAp). Alsdorf (1928) edited another Apabhraṃśa work, Kumārapālapratibodha, and he mentioned the language of this work as ‘classical’ Apabhraṃśa compared with GAp. Alsdorf showed differences between these two types of Apabhraṃśa. The following two points are important morphological differences between both groups. The instrumental singular of a-stem nouns has the desinences aiṃ and eṃ in NAp, whereas they are aiṇa and eṇa in GAp. NAp: parihāsaiṃ bhiuḍipaloyaṇāiṃ / līlaeṃ addhāsaṇabhoyanāiṃ// [Bhavisattakahāā 324. 3] “by smile (parihasāiṃ) contracting eyeblows, and by charm (līlaeṃ) by which he can get half of a king’s seat”. GAp: muṇivi savvu nāṇena sūrihiṃ nibbhacccchiu / [Kumārapālapratibodha S. 101. 3] “Scholars knew all by wisdom (nāṇena), and scolded”.

The genitive singular of a-stem nouns has the desinences aho, ihu and uha in NAp, whereas they are aha, ihi, and uhu in GAp. NAp: ema karevi samuccau gottaho / [Bhavisattakahāā 43. 3] “The meeting of the whole families was held in this way” GAp: riddhivihūṇaha māṇusaha / [Kumārapālapratibodha E 19a. 1] “Of a person who does not have wealth”

In addition, GAp has considerably many borrowed words and desinences from Sanskrit and Prakrit languages, unlike NAp. Bubeník (1998) pointed out that Kumārapālapratibodha has a tendency of recurrence to Prakrit languages. 2.3. The classification (Western, Southern, and Eastern) of Tagare Due to the limitations of the documents, Jacobi classified Apabhraṃśa within a restrictive range. Tagare (1948) collected Apabhraṃśa documents broadly and showed historical changes of Apabhraṃśa by the examination of these texts. Therefore, he classified Apabhraṃśa documents distributed over north India into Eastern Apabhraṃśa (EAp) of Bengal, Western Apabhraṃśa



Figure 2. The map of Apabhraṃśa groups according to Tagare (1948) with Kashmiri Apabhraṃśa.

(WAp) of Gujarat and Rajasthan, and Southern Apabhraṃśa (SAp) of Maharashtra. We will show the differences among them as indicated by Tagare. WAp and SAp converted three sibilants (ś, ṣ, and s) of Old Indo-Aryan into s, whereas EAp converted them into ś. Sanskrit śāstra (weapon) > WAp: sattha [Kumārapālapratibodha S. 31.1] SAp: sattha [Karakaṇḍacariu ḍ 8. 8. 4] EAp: śattha [Dohākoṣa of Kāṇha 12. 2]

WAp and SAp have the verb desinences si and hi for the second person singular, but EAp has only ‘si’.

The Classification of Apabhraṃśa


WAp: ahaha ari jiya karisi maa rosu / [Sanatkumāracarita 693.1] “Oh, Soul, you must not have anger” SAp: kiṃ mahilahe kāraṇe khavahi dehu / [Karakaṇḍacariu ḍ 5.16.2] “Why do you torment your body for the lady?” EAp: tāba ki dehānuttara pābasi / [Dohākoṣa of Saraha 62.3] “Why did you obtain a supreme body”?

The double consonant corresponding to kṣ k in Sanskrit is both kkh (kh) and cch (ch) in WAp and SAp, but EAp has only kkh (kh). WAp: kṣetra> khitta [Bhavisattakahāā 349. 10], chitta [Bhavisattakahāā 5. 3] EAp: kṣetra> khetta [Dohākoṣa of Saraha 50. 1]

The desinences of the nominative and accusative singular of a-stem nouns are u and o in WAp and SAp. EAp has the tendency to show only a for this case. WAp differs from SAp and EAp as to some phenomena. First, it has the desinence auṃ for the first singular form of verbs, whereas SAp and EAp have mi. WAp: garuya maṇoraha jai karauṃ / [Kumārapālapratibodha J 8. 8] “Great master, if I do it” SAp: ṇiyabuddhipavittharu ṇau rahami / [Mahāpurāṇa 69. 1. 3] “I, who have large innate wisdom, do not hide anything”. EAp: tāhara ṇāma ṇa jāṇami e sahi / [Dohākoṣa of Saraha 92. 4] “Oh Friend, I do not know the name of him”

WAp has the desinence ahiṃ for the third plural form of verbs, whereas SAp has anti. WAp: ramahiṃ ramaṇi sayasahasa sundara / [Sanatkumāracarita 520. 5] “The hundreds and thousands women gather, oh, Beauty!” SAp: dahadoaṃgaiṃ risi je dharaṃti / paṃcasayaiṃ tāhaṃ vi vajjaraṃti // [Mahāpurāṇa 68. 8. 4] “It is said that there were five hundred ascetics endowed with twelve Angas”

SAp uses the desinence -ahuṃ for the infinitive of verbs, which is not seen in WAp. Hemacandra described that -i, -iu, -ivi, -avi, -eppi, -eppiṇu, -evi, and -eviṇu are desinences for the infinitive. SAp: pāraddhau thuṇahuṃ ṇarāhiviṇa bhuvaṇaṃbhoruhadivasayaru // [Mahāpurāṇa 2. 2. 14] “The King has begun to worship (at Jina), who is the lotus and the sun of this world”



In the classification of Tagare, WAp and SAp are not so different, whereas he assumed EAp comparatively distant from others in the features of words and morphemes. 2.4. The validity of the various classifications The classifications of medieval grammarians are scarsely attested by the proof of the text. It is difficult to find systematic differences among the Apabhraṃśa languages. In addition, it is obscure whether the languages described on these classifications have any relation to NIA languages. Moreover, the texts of existing Apabhraṃśa documents scarcely correspond to their rules of grammar. Thus, we know nothing about how to infer the relations between their grammar and Apabhraṃśa literatures. However, there are some differences between the four grammarians and Hemacandra in their descriptions of Apabhraṃśa.7 Dolci pointed out, ‘Prakrit grammarians classified the languages not by linguistical characteristics, but by literal usages.’8 For example, Rāmaśarman said, ‘Even if languages such as Śākāraka, Auḍra, and Drāviḍa possess vocabulary of Apabhraṃśa, you should not consider them as Apabhraṃśa languages when they are used in dramas.’9 Thus, the classification of the language did not depend solely on linguistic characteristics; it is based on a style including the quantity of borrowed words or meters frequently used. Hence, the classifications are not explanations of linguistic characteristics of Apabhraṃśa. Jacobi suggested that there were regional differences among the Apabhraṃśa texts. However, there were not enough edited texts for an analysis of the differences. Tagare made use of many edited texts, and he classified them into three varieties of Apabhraṃśa, which are Eastern, Western, and Southern. Because there are considerable differences among them regarding the phonology and morphology, it is valid that he classifies them in this way. However, it is unclear why such differences exist. I do not accept that ‘Apabhraṃśa languages’ as local languages that correspond to NIA languages, but as a literal language that originated from a region like Mahārāṣṭrī, and so on, because differences among the Apabhraṃśa categories put forth by Tagare do not correspond to differences among NIA languages.

7 8 9

Bubeník (1998) pp. 43-47. Dolci (1926) p. 122. Prākṛ k takalpataru, 2.3.31.

The Classification of Apabhraṃśa


3. Corpus analysis 3.1. The texts used for the analysis This study examines variations of texts written in Apabhraṃśa. The corpus consists of eight texts: Eastern Apabhraṃśa (EAp.) Dohākoṣa of Kāṇha Dohākoṣa of Saraha These two works consist of The dogmas of esoteric Buddhism. Kāṇha and Saraha are recorded as saints of late Buddhism. Late Buddhism prevailed over the eastern India, Nepal and Tibet in 7th century A.D. at the latest. It is estimated that Kāṇha lived from 8th to 12th century and Saraha from 10th to 12th century. Southern Apabhraṃśa (SAp.) Paumacariu The Paumacariu is generally known as a Jain Rāmāyana. Rāmāyana is one of the most famous epic in India. Jains often reformed Hindu epics or narratives according to their doctrine. Paumacariu shows a typical remormation. The author Svayambhū lived before 10th century in Mahārāṣṭra. The text was edited, translated and digitalized by De Clerck (2005). Hariseṇacariu The Sixty-three Great Persons’ behaviors are integral to a cosmography and a legend of Jains. Therefore Jain writers repeatedly compiled the hagiographies of the Great Persons. Hariseṇa is one of the Great Persons and a king of the world. I added this work to Southern Apabhraṃśa, judging from contents and the locates of manuscripts. It is thought that the date of this work is after Paumacariu. Western Apabhraṃśa (WAp.) Sanatkumāracarita A Jain narrative work Sanatkumāracarita includes Nemināhacariu written in Apabhraṃśa. The author Haribhadra said that he wrote this work 1159 A.D. in Gujarat, the western India. He is contemporary with Hemacandra shown above. Vikaramorvaśīya The author Kālidāsa is the most famous poet and dramatist in the classical period of India. Kālidāsa did not write any texts of Apabhraṃśa except in Vikaramorvaśīya. So some scholars doubt the genuineness of Apabhraṃśa portion. This paper admits its genuineness according to Tagare, Velankar (1961) and Ghosal (1972). The texts are classified into three groups according to Tagare (Table 2).



Moreover we added two texts, the Tantrasāra, Āgamadambara to the corpus. Kashmiri Apabhraṃśa (KAp.) Tantrasāra This is a summary of Tantrāloka, the philosophical text written by Abhinavagupta in 10th or 11th century at Kashmir, the north western India. Abhinavagupta is a scholar of Kashmir Saivism, a sect of Hinduism. Each chapter of Tantrasāra has Apabhraṃśa portion in the end. Āgamadambara A play on the theme of criticizing heretic sects from a point of view of Hinduism. The ascetics among characters in the play, called Nīlāmbara sect, use Apabhraṃśa. These two texts are belonging to Kashmir Saivism and their linguistic characteristics are comparatively similar to Western Apabhraṃśa. However, the texts of Kashmir group have significantly different contents from Jain narrative literature. Hence, these works are newly classified by the provisional name of Kashmiri Apabhraṃśa (KAp). Table2. Texts included in Apabhraṃśa Corpus Eastern Apabhraṃśa Western Apabhraṃśa Kashmiri Apabhraṃśa Southern Apabhraṃśa Paumacariu Dohākoṣa of Kāṇha Vikramorvaśīya Tantrasāra Āgamadambara Hariseṇacariu Dohākoṣa of Saraha Sanatkumāracarita

Table 3 shows the tokens and the types of words of each texts. And it should be noted that compounds are not divided in the corpus, though noun compounds are used frequently in Indian classical literature, and these were important for classical compositions. Therefore, compounds should be divided for counting words. However, the corpus does not yet have this capability. In addition, the corpus has no tags for syntactic or morphological use. But all the Apabhraṃśa documents are versified works. Thus, we can infer linguistic information from metrical information such as the length of the syllable or the position in the foot. 3.2. Matters of research The texts are classified according to Tagare and Kashmiri Apabhraṃśa is added to his categories on the ground of the contents and the regions of authors (Tables 2 and 3). Therefore we examined the validity of the classifications given above from the view of the morphology and archaism.

The Classification of Apabhraṃśa


Table 3. The token and type of the corpus Region







Western Apabhraṃśa



Kashmir Apabhraṃśa





Eastern Apabhraṃśa Southern Apabhraṃśa


Title Dohākoṣa of Kāṇha Dohākoṣa of Saraha Hariseṇacariu Paumacariu Sanatkumāracarita Vikramorvaśīya Āgamadambara Tantrasāra

Token 444 2588 5439 98651 7548 177 106 402

Type 319 1399 2942 32208 4528 128 90 343

The Apabhraṃśa declension has multiple forms for one case, which is defined by traditional grammar. Such phenomena are mainly due to a pseudoarchaic tendency, especially in WAp. Hemacandra showed some archaic forms for the instrumental singular and dative-genitive singular forms of a-stem nouns, which are deveṇa, devasu, and devassu (Table 4), whereas he did not describe devāṇa for the dative-genitive singular, which is used frequently in the texts. Table 4. Declension of the a-stem noun ‘deva’ (god) according to Hemacandra’s grammar Nominative Accusative Instrumental Dative-Genitive Ablative Locative

Singular deva, devā, devu, devo deva, devā, devu deveṇa, deveṇaṃ, deveṃ deva, devā, devasu, devāsu, devaho, devāho, devassu devahe, devāhe, devahu, devāhu devi, deve

Plural deva, devā deva, devā devahiṃ, devāhiṃ, devehiṃ deva, devā, devahaṃ, devāhaṃ devahuṃ, devāhuṃ devahiṃ, devāhiṃ

Therefore we took up four morphologic categories for the analysis in this paper. 1. Instrumental singular masculine of a-stem nouns: -eṇa, -eṇaṃ, and eṃ 2. Dative-genitive singular masculine of a-stem nouns: -aho/-āho10, -su, and -ssu 3. Dative-genitive plural masculine of a-stem nouns: -ahaṃ/-āhaṃ, -āṇa, and -āṇaṃ 4. Present first singular of a-stem verbs: -mi and -auṃ 10

The phonemes a and ā before the suffixes -ho and -haṃ may change each other.



These matters are chosen for examining a proportion between pseudoarchaic forms and non-archaic forms. It is not thought that the dative-genitive singular -aho and its plural -ahaṃ are borrowing forms from Sanskrit or other MIA languages. However, we can not assert about its origins, so that the historical position of these forms have not been clear. Thus, I use the term ‘non-archaic’ for ho, haṃ, and so on. The inference of pseudo-archaic forms is based on Table 5. Table 5. The nominal system (a-stems) of Old and Middle Indo-Aryan



Nominative Accusative Instrumental Dative Genitive Ablative Locative Nominative Accusative Instrumental Dative Genitive Ablative Locative

OIA Sanskrit

MIA(I) Pāli

-aḥ -am -eṇa -āya -asya -āt -e -āḥ -an -ebhiḥ -ebhyaḥ -āṇāṃ -ebhyaḥ -eṣu

-o -aṃ -eṇa, -aṃ -assa, -āya -assa -asmā -asmiṃ -ā -e -ehi (→genitive) -ānaṃ -ehi -esu

MIA(II) Mahārāṣṭrī -e -aṃ -eṇa -āa -assa -āo -ammi, -e -a -e -ehiṃ (→genitive) -āṇaṃ -āhinto -esu

MIA(III) Apabhraṃśa -u -u -eṃ -aho -aho, -ahu -ahe, -ahu -i, -e -a -a -a, -ehiṃ (→genitive) -ahaṃ -ahuṃ -ahiṃ

Instrumental Singular The instrumental singular in Sanskrit is -eṇa. Apabhraṃśa also keeps this form with -eṇaṃ and -eṃ. EAp uses -eṃ frequently, in contrast to WAp and KAp. SAp mediates between the two sides. Therefore, we can put these texts in the order EAp > SAp > WAp, KAp from the eṃ to the eṇa group. Table 6. Numbers of forms of a-stem Instrumental Singular listed by regions and titles EAp SAp WAp KAp

-eṃ 181 2254 0 0

-eṇa 19 3235 104 6

-eṇaṃ 0 110 0 0

Dohākoṣa of Kāṇha Dohākoṣa of Saraha Hariseṇacariu Paumacariu Sanatkumāracarita Āgamadambara Tantrasāra

-eṃ 15 166 62 2192 0 0 0

-eṇa 4 15 74 3161 104 2 4

-eṇaṃ 0 0 0 110 0 0 0

The Classification of Apabhraṃśa


Dative-Genitive Singular The -asu and -assu are probably related to -asya, which is the genitive singular form in Sanskrit. WAp has no -aho. Thus, we can arrange SAp > EAp > WAp, KAp from the -aho to the -asu group. Table 7. Numbers of forms of a-stem Dative-Genitive Singular listed by regions and titles EAp SAp WAp KAp

-aho 9 3709 0 0

-asu 18 745 108 5

-assu 0 0 50 2

Dohākoṣa of Kāṇha Dohākoṣa of Saraha Hariseṇacariu Paumacariu Sanatkumāracarita Āgamadambara Tantrasāra

-aho 3 6 125 3584 0 0 0

-asu 2 16 51 694 108 2 3

-assu 0 0 0 0 50 0 2

Dative-Genitive Plural The dative-genitive plural form of Sanskrit is -ānām, and Prakrit form is -āṇaṃ. Apabhraṃśa has -āṇaṃ and āṇa with the non-archaic form -ahaṃ. It is supposed that there are scarce numbers of -ahaṃ in EAp. Thus, we can order WAp > SAp > EAp from the ahaṃ to the āṇa group. Table 8. Numbers of forms of a-stem Dative-Genitive Singular listed by regions and titles EAp SAp WAp KAp

-ahaṃ 0 319 137 0

-āṇa 25 77 16 0

-āṇaṃ 0 1 0 0

Dohākoṣa of Kāṇha Dohākoṣa of Saraha Hariseṇacariu Paumacariu Sanatkumāracarita Vikramorvaśīya Āgamadambara Tantrasāra

-ahaṃ 0 0 25 294 137 0 0 0

-āṇa 2 23 9 68 16 0 0 0

-āṇaṃ 0 0 0 1 0 0 0 0

First Singular of Verbs The a-stem verbal desinence of the present first singular is -ami in Sanskrit. Apabhraṃśa has another form, -auṃ. WAp comparatively prefers to use this form. Hence, we can posit WAp > SAp, EAp, KAp. Vikramorvaśīya does not have -auṃ, and in this regard it resembles SAp; Tantrasāra has also no -auṃ. Vikramorvaśīya, although its sample size is considerably small, seems to belong to SAp rather than WAp.



Table 9. Numbers of forms of a-stem First Singular of Verbs listed by regions and titles EAp SAp WAp KAp

-auṃ 1 28 29 0

-ami 9 696 20 2

Dohākoṣa of Kāṇha Dohākoṣa of Saraha Hariseṇacariu Paumacariu Sanatkumāracarita Vikramorvaśīya Tantrasāra

-auṃ 1 0 22 6 29 0 0

-ami 0 9 60 636 15 5 2

4. Conclusion Apabhraṃśa is the last stage of MIA languages, and has rich variances especially in inflectional system. Therefore various classifications have been attempted. This paper examines the classification of Tagare from the view of the frequency of pseudo-archaic forms. The corpus is divided into four groups which include the Tagare’s three groups and the Kashmiri group. However, there is no proof that these variations of Apabhraṃśa have relation to the NIA languages of each region. Therefore it is supposed in this paper that all of four Apabhraṃśa groups are originated from a MIA language that was probably spoken at Rajasthan or Gujarat in the western India. Because the western India brought forth copious texts of Apabhraṃśa. In addition, the rhetoricians and the dramatists of those days referred to Apabhraṃśa as related to Sanskrit and Prakrit languages. Hence we can assume this language prevailed over north India as a literal language temporally. On this assumption, we expected that the variations of Apabhraṃśa languages should be classified on the grounds of style, which is a degree of preference of pseudo-archaic forms. An examination is designed according to this supposition. We inspected whether a constant tendency for three nominal suffixes and one verbal suffix in each group. Four suffixes are selected because it is relatively easy to distinguish pseudo-archaic forms from non-archaic forms. And numbers are enumerated by each group. Table 6 shows that EAp prefers non-archaic -eṃ in Instrumental Singular, whereas it is not seen in WAp and KAp. Therefore, I classified EAp into ‘non-archaic’, WAp and KAp into ‘pseudo-archaic’ and SAp is located in the middle of both. Table 7 also indicates WAp and KAp have pseudo-archaic characteristics, but EAp and SAp show both forms. Table 8 shows vice versa. WAp has non-archaic forms while EAp has pseudo-archaic forms in Dative-Genitive Plural. Same tendency is observed table 9, though WAp has both forms.

The Classification of Apabhraṃśa


Table 10. From non-archaic to pseudo-archaic Non-archaic Table 6 Table 7 Table 8 Table 9

inst. sg. dat-gen. sg. dat-gen. pl. pres. 1st. sg.



Pseudo-archaic WAp, KAp WAp, KAp EAp KAp EAp

As a result of examination, table 10 summarizes the tendencies of four groups of every suffix. We can find two points from table 10. First, SAp seems to be located between non- and pseudo-archaic forms. Second, WAp is characteristic of pseudo-archaic and EAp is non-archaic in Instrumental Singular and Dative-Genitive Singular. On the contrary, EAp is pseudo-archaic and WAp is non-archaic in Dative-Genitive Plural. WAp and EAp obviously show contrastive relations and SAp has both characteristics. Therefore, it is difficult to classify these groups based on only an adoption of pseudo-archaic forms. But it became clear that WAp and EAp were opposed to each other in the view of pseudo-archaic forms. Thus, there is shown to be a tendency based on a degree of preference of the pseudo-archaic forms. However, it is insufficient for the classification of Apabhraṃśa. It is necessary for the corpus to be tagged for both phonological and syntactical analysis. Then, we should examine the measurement of characteristics by statistical methods. References Alsdorf, Ludwig. 1928. Der Kumārapālapratibodha. Hamburg: De Gruyter & Co. Bagchi, Prabodh Chandra. 1935. “Dohakoṣa with notes and Translations”. Journal of the Department of Letters 28. Calcutta University Press. i-ii 1-180. Banerjee, Satya Ranjana. 1980. Saṃkṣ k iptasāragataḥ Prākṛ k tādhyāyaḥ. Prakrit Text Society. Bhayani, Harivallabh Chunilal. 1957-70. Paumacariu. New Delhi: Bharatiya Jnanpith (second edition 1989-2001). Böhtlingk, Otto. 1890. Daṇḍin’s Poetik (Kâvyâdarça). Leipzig: Haessel. Bubeník, Vít. 1998. A Historical Syntax of Late Middle Indo-Aryan (Apabhraṃśa). Amsterdam: John Benjamins Publishing Company. Chatterji, Suniti Kumar. 1926. Origin and development of the Bengali language. Calcutta.



Chatterji, Suniti Kumar. 1983. On the Development of Middle Indo-Aryan. Calcutta: Sanskrit College. Cowell, Edward Byles. 1854. The Prákṛ k ita-prakáśa: or the Prákṛ k ita Grammar of Vararuci. Hertford: Stephen Austin. Dalal, C. D. and Shastry. Anantakrishna R. 1916. Kāvyamīmāṃsā of Rājaśekhara. Baroda: Central Library. De Clerck, Eva. 2005. Een Kritische studie van Svayambhūdeva’s Paümacariu. Proefschrift ingediend voor het behalen van de graad van doctor in de Oosterse talen en culturen. Gent: Universiteit Gent, Faculteit Letteren en Wijsbegeerte, Vakgroep Talen en culturen van Zuid-en Oost-Azië. Dezső, Csaba. 2005. Much ado about religion. New York: New York University Press. Dvivedī, Revāprasād. 2005. Bharatamunipraṇīta Svāyambhuva Nāṭyaśāstra ṭ . New Delhi: Aryan Books International. Ghosal, S. N. 1972. The Apabhraṃśa verses of the Vikramorvaśīya from the Linguistic Standpoint. Calcutta: The World Press Private Ltd. Ghosh, Manomohan. 1954. Rāmaśarman’s Prākṛ k takalpataru: with introduction, translation and notes, and also with Puruṣottama-Deva’s Prākṛ k tānuśāsana, Laṅkeśvara’s Prākṛ k ta-Kāmadhenu, and the Prākṛ k ta-Lakṣ k aṇa in the Viṣṇ i udharmottara. Calcutta: Asiatic Society. Jacobi, Hermann. 1918. Bhavisatta Kaha von Dhaṇavāla. München: Bayerischen Akademie der Wissenschaften. Jacobi, Hermann. 1921. Sanatkumāracaritam. München: Bayerischen Akademie der Wissenschaften. Jain, Hīrālāl. 1964. Karakaṇḍacariu ḍ . Delhi: Bhāratīya Jñānapīṭ ī ha. Jain, Snehlatā. 2006. Hariseṇacariu. Jaypur: Apabhraṃśa Sāhitya Akādemī. Kale, M. R. 1967. The Vikramorvaśīya of Kālidāsa. New Delhi: Mtilal Banarsidass. Kielhorn, Franz. 1880. The Vyâkaraṇa-Mahâbhâṣya ṣ of Patañjali. Bombay: Goverment Central Depôt. Krishna Chandra Acharya. 1968. Prākṛ k ta-Sarvasva. Ahmedabad: Prakrit Text Society. Miśra, Paramasiṃha. 1996. Tantrasāra. Vārāṇasī: Caukhambā Surabhāratī Prakāśan. Nara, Tsuyoshi. 1979. Avahaṭṭha ṭ and Comparative Vocabulary of NIA. Tokyo: Institute for the Study of Languages and Cultures of Asia and Africa. Nitti-Dolci, Luigia. 1938. Les grammairiens prakrits. Paris: AdrienMaisonneuve. Pischel, Richard. 1880. Hemacandra’s Grammatik der Prâkritsprachen. Halle: Verlag der Buchhandlung des Waisenhauses.

The Classification of Apabhraṃśa


Sāṃkṛtyāyana, Rāhula. 1957. Dohākoṣa. Paṭnā: Sarvādhikār Prakāśakādhīn Surakṣit. Sastry, P. V. Naganatha. 1970. Kāvyālaṃkāra of Bhāmaha. Delhi: MotilalBanarsidass Publishers. Sen, Sukumar. 1960. A Comparative Grammar of Middle Indo-Aryan. Poona: Linguistic Society of India. Shankar Panduang Pandit. 1936. Kumārapālacarita of Hemacandra Illustlating The Eighth Chapter of His Siddha-Hemacandra or Prakrit Grammar. Poona. Shahidullah, Mohammed. 1928. Les Chants Mystiques de Kāṇha et de Saraha Les Dohā-Koṣa et Les Caryā. Paris: Adrien-Maisonneuve. Tagare, Ganesh Vasudev. 1948. Historical Grammar of Apabhraṁśa. Poona: Motilal Banarsidass. Vaidya, P. L. and Jain, Devendra Kumar. 1979-1999. Mahākavi Puṣpadanta’s ṣ Mahāpurāṇa. New Delhi: Bhāratīya Jñānapīṭ ī ha Velankar, H. D. 1961. The Vikramorvaśīya of Kālidāsa. New Delhi: Sahitya Akademi. Verbeke, Saartje. 2006. Middel-Indische grammatica. Een kritisch vergelijkende studie van de Prākṛ k ta-Sarvasva, de Prākṛ k ta-Kalpataru en de Prākṛ k tānuśāsana. Gent: Universiteit Gent.

Changes in the Meaning and Construction of Polysemous Words: The Case of mieru and mirareru Ayako SHIBA 1. Introduction Mieru and mirareru, the spontaneous or possibility/potential verb forms of miru ‘to look, to watch,’ basically mean ‘to see’ or ‘to be able to see.’ (1) Koko-kara umi-ga mieru/mirareru / . here-from sea-NOM see/can see “We (can) see the sea from here.”

In present-day Japanese texts, both mieru and mirareru appear in interesting expressions, as can be seen in examples (2) and (3). (2) Madogarasu-ga ware-teiru. Hannin-wa koko-kara shinnyu-shita-to mirare-ru. window-NOM broken-RES criminal-TOP here-from raid-PAS-QUOT see-NPAS “The window is broken. It seems that the criminal entered from here.” (3) Taro-wa sono shatsu-ga kini-itta-to mie-te, e itsumo sore bakari ki-tei-ru. Taro-TOP the shirt-NOM like-PAS-QUOT see-CF always it only wear-HAB-NPAS “Taro seems to like the shirt. He always wears it.”

In sentence (2), -to mirareru means, “the speaker (and other indefinite agents) deduces that the criminal entered from here based on the evidence of the broken window.” Similarly, in (3), -to mieru means that “the speaker deduces that Taro likes the shirt based on the evidence that he always wears it.” Thus, both forms appear to have evidential use. Although a considerable amount of study has focused on the Japanese evidential markers -yoda and -rashii, little attention has been paid to the evidentiality of mieru and mirareru. Is the evidential use of these verbs still a new or already an old expression in present-day Japanese, or is it quite specific to text genre, and the verbs are not recognized as evidential markers? Shiba 2009 demonstrated that the evidential construction with mirareru has a high occurrence in newspaper texts. In this study, extracting examples from corpus data from two different periods, we examine the construction types of each verb form and show the distribution of the construction types in order to reveal how the verbs have recently extended their evidential meaning.



2. Theoretical background 2.1. How should polysemy be described? Okuda 1967 argues that the description of polysemous words in dictionaries must be given along with their syntactic construction patterns. A word has a sentence-free meaning, typically a basic sense, and for polysemous words, most of the other senses are not determined when the word is considered independently from constructions in which it is an element. If each sense of a polysemous word is treated without the construction pattern, its classification becomes quite arbitrary. The word is not absolutely, but rather relatively, independent of the sentence. Additionally, the construction is not entirely independent from the meaning of its elements. The recent Construction Grammar (Goldberg 1995) decisively lacks this idea. It repeatedly emphasizes the independence of the construction and considers the polysemy of the construction to be a natural property. For example, the English ditransitive construction typically implies “that the agent argument acts to cause transfer of an object to a recipient,” but Goldberg argues, “that this case of actual successful transfer is the basic sense of the construction” (Goldberg 1995: 32); thus, the construction itself can have other senses in certain cases. She lists some cases in which the ditransitive construction does not strictly imply the transfer of the object: when the expressions involve verbs of creation (e.g., bake, make, build, or cook) or obtaining (e.g., get, grab, win, or earn) or imply that the agent undertakes an obligation (e.g., promise, guarantee, or owe). It is obvious, however, that the sentence meaning comes from the interaction between the construction and its elements (in this case, the verbs). With these verbs, the final meaning of the sentence does not imply a successful transfer because the verb groups are not typical ‘give (dative) verbs,’ and they affect the meaning of the construction. It is impossible to ascribe the final meaning of the sentence to either of them, as the relation between a construction and its elements is always dynamic. Besides, the extended senses of polysemous words are results of the interaction between the construction and its elements (Okuda 1968-72). Okuda proposes the steps of how a word acquires a new sense through interaction with a construction. First, the word appears in a casual combination (construction) with an imaginal-metaphorical use. When the combination is used habitually, it becomes an idiomatic expression. Finally, when the word sense as it is used in the idiom is recognized and the word is then used in another combination, the sense is registered in the lexicon as a free meaning. It is important to note that the construction consists of elements, and these elements do not exist without the construction. Moreover, there is always a dynamic interaction between a construction and its elements (Okuda 1980-81).

Changes in the Meaning and Construction of Polysemous Words


For a dictionary description of a word, the valence approach can also be useful in that it mainly addresses a property of a lexical item. Valence is the potential or ability of a word to combine with another in syntactic constructions. A word exercises this potential when it becomes an element of a construction. That is, the valence is only a property of the word. In contrast, a description of a grammar requires the construction approach because it concerns the sentence, and the sentence meaning is the result of the dialectic relation between the property (both syntactic and semantic) of a construction and that of its elements. We cannot fail in the description of a construction’s properties. Following Okuda’s approach, we analyze the Japanese polysemous words mieru and mirareru by extracting the meaning-construction types from two corpora. In Section 2.2, I will explain what we call the ‘meaningconstruction type.’ 2.2. What is the ‘meaning-construction type?’ The meaning-construction type is a generalized and abstract pattern registered in the lexicon as a result of repeatedly uttered sentences. It has a meaning and a form (=construction). We regard a ‘construction’ not only as a case structure with no substance, but also as the architecture containing materials (Okuda 1980-81), which are a verb’s lexical (categorical) meaning, a noun’s lexical (categorical) meaning and semantic roles. Needless to say, a primary ground of the materials is the lexical meaning of the verb, but that of the noun, such as animate or inanimate, concrete or abstract, etc., also plays an important role for the construction. The construction is the order or method of organic relations between these elements. The meaning-construction type has a relation with another type, and a certain constructional condition switches from one to another. For example, with the polysemous verb ageru ‘to lift, to give,’ the phrase hon-o tana-ni ageru ‘to put a book on the shelf’ is an instance of the change of location construction, which has the form [ConcreteNP-ACC(Theme) SpatialNPLOC(Place) Change of Location Verb]. When the -ni-marked NP is an animate noun, it becomes the dative construction hon-o Hanako-ni ageru ‘‘to give Hanako a book,’ which has the form [ConcreteNP-ACC(Theme) AnimateNPDAT(Recipient) DativeVerb]. Thus, each construction type changes to another in certain constructional conditions and forms a paradigmatic system, that is, a network. We will return to a discussion of this relationship between constructions in Section 5.



3. Corpus data The analysis of this study is based on corpus data from two different periods, the Modern Japanese Corpus (MJC) and the Present-day Japanese Corpus (PJC). We selected the Taiyo Corpus as a representative of modern Japanese written text (1868-1944). It is a full-text corpus of Taiyo—a general interest magazine that was very popular among educated people in pre-war times. For this study, we only used the volumes of the year 1895 from the Meiji period (1868-1912) that comprise 729 articles (written by writers assumed to be born in the nineteenth century). They include over three million letters. The PJC contains approximately two million letters in the works of twenty-two writers born during the twentieth century. While both corpora consist mainly of critical essays on history, science, and culture, only the MJC (Taiyo Corpus) includes works of fiction. Table 1. Corpus Data Modern Japanese Corpus (MJC) Present-day Japanese Corpus (PJC)

Component 729 articles from the 1895 volumes Works by 22 writers born during 20c.

Size 3,340,000 letters 2,090,000 letters

4. Morphological features and case structures of mieru and mirareru Both mieru and mirareru come from the same basic verb mi-ru ‘to look’ or ‘to watch.’ Mi-ru is a transitive verb, as it requires a nominative agent (=subject) and an accusative theme (=object) for its valency. (4)

[AniNP-ga NP-o


agent-NOM theme-ACC SUBJECT OBJECT PREDICATE (5) Taro-ga Picaso-no e-o mi-ta. Taro-NOM Picasso-GEN art-ACC look-PAS “Taro looked at Picasso’s art.”

Mieru consists of mi from the stem miru, -e from an old suffix --yu, and ru as a conjugational form. (6)

mi - e - ru

< mi-yu

stem - suffix - NPAS

The old suffix -yu - originally meant “although the agent has no intention to do so, the event occurs spontaneously (spontaneous)” or “it is (im)possible for the agent that the event occurs because of his/her situation (possibility).”

Changes in the Meaning and Construction of Polysemous Words


Whether the sentence represents the spontaneous or possibility depends on its own construction and the context construction in which it becomes an element. The basic case structure of mieru is shown in (7). Here, the theme is marked by the nominative case. The agent is normally omitted, but when it is not, it is marked by the dative (typically topicalized). (7)

[(AniNP-ni-wa) NP-ga


(agent-DAT-TOP) theme-NOM

Because mieru produces a valence change, that is, a change of relation between the semantic role and the case marker, it is clear that --yu originally belonged to the voice category. Mirareru differs from mieru in that it includes the suffix -rare in place of -e. (8)

mi - rare



stem - suffix - NPAS

-Rare replaced --yu from the tenth to the eleventh century. In present-day Japanese, -rare is quite productive, whereas --yu remains only in a limited number of verbs. They are similar in meaning in that they both correspond to the spontaneous and possibility meanings. However, only -rare has the salient meaning of passive. The basic case structure of mirareru is a bit more complicated than that of mieru. When a mirareru construction represents the possibility meaning, the theme NP is occasionally marked by the accusative and the agent, nominative, which is a transitive construction (see (10)). When it represents the passive, the theme NP marked by the nominative typically precedes the agent NP marked by the dative or -niyotte ‘by’(see (11)). (9)

[(AniNP-ni-wa) NP-ga mirareru]: Spontaneous (agent-DAT-TOP) theme-NOM

(10) [(AniNP(-ni)-wa) NP-ga/-o mirareru]: Possibility (agent-DAT-TOP) theme-NOM/ACC (11) [NP-ga (AniNP-ni/-niyotte) mirareru]: Passive theme-NOM (agent-DAT/by)


The main difference between the spontaneous/possibility and passive meanings is that the former carries the speaker’s empathy on the agent, whereas the latter defocuses the agent (Jacobsen 1991). That is, in the spontaneous/



possibility construction, the speaker construes the event from the agent’s viewpoint. Hence, the typical spontaneous meaning is “although I have no intention to make it happen, the event occurs spontaneously,” and the typical (im)possibility, “it is impossible for me that the event occurs because of my situation (though I’m trying to make it happen).” The word order of each construction corresponds to this difference in meaning. In the spontaneous/ possibility construction, an agent NP, when it appears, generally precedes a theme NP and tends to be topicalized, but in the passive construction, it always follows a theme NP and is rarely topicalized. The passive construction can be divided into two principal types, the affective passive (when the theme/patient is an animate noun) and the detransitive voice passive (when the theme/patient is an inanimate noun) (Kuroda 1979, Shiba 2005). In the affective passive construction, an animate theme (=subject) has the speaker’s empathy, whereas in the detransitive voice passive, the speaker’s empathy is neutral. This difference in speaker empathy concerns the evidential meaning of each verb. The evidential construction with mieru is a more subjective expression than that with mirareru: in the mieru construction, the speaker construes the event from the agent’s standpoint, and it is apparent that he or she has made the inference/deduction in question. In contrast, the evidential construction with mirareru defocuses agents through the use of the detransitive voice passive, and it is quite obscure who bears responsibility for the inference, even if the speaker is included as an agent. 5. Classification of meaning-construction types We distinguish several senses of each verb in referring to the description of the meaning-construction types of miru ‘to look, to watch’ in Okuda (1967, 1968-72). Before moving on to an explanation of each meaning-construction type, we note the difference between mieru and mirareru in terms of their constructional properties. A notable feature of the mieru construction is that the agent NP that is marked by the dative and rarely appears retains the empathy of the speaker because it represents only the spontaneous/possibility meaning. Therefore, the theme, that is, the subject of the sentence, is regarded as an inanimate NP. Although an animate NP can be a theme, its personality or emotion is disregarded. On the other hand, as mentioned above, empathy in the mirareru construction is complicated. When an animate theme carries a speaker’s empathy, the sentence represents a use of the affected passive. When the theme is inanimate and an agent carries a speaker’s empathy, the sentence has the spontaneous or possibility meaning. When the theme is inanimate and

Changes in the Meaning and Construction of Polysemous Words


empathy is neutral, the sentence represents the detransitive voice passive. The position of empathy depends on the word order and the construction of the context in which the sentence becomes an element. Next, I will detail the form (construction) and meaning of the meaningconstruction types and describe their properties of actual use in our corpus data. 5.1. Classification of mieru We extract the four principal meaning-construction types of mieru— simple perception, existence, appearance, and judgment—and arrange each type as follows: the simplest in meaning and construction is first, and the most complicated and complex is the last. The following are the simplified constructions of each type. 1. Simple Perception: [NP-ga mieru] theme-NOM

2. Existence: [NP-ni NP-ga mieru] place-LOC theme-NOM

3. Appearance: [NP-ga ADJ-ku/(so)-ni mieru] theme-NOM Appearance

4. Judgment: [[PHRASE/CLAUSE]-ni/-to mieru] Content of Judgment

To summarize the main point of each construction, the simple perception construction has only one argument that is a theme; the existence construction has a -ni-marked NP that refers to a place; the appearance construction has an adjective phrase; and the judgment construction includes a subordinate clause marked by -ni or -to. 5.1.1. Simple perception The simple perception meaning-construction type of mieru means that “an agent perceives an object without intention” or “an agent can perceive an object.” The structural feature of this type is that it has only one argument or complement, the theme. (12) Simple

Perception: [NP-ga mieru] theme-NOM

There are two subtypes here, visual perception and mental cognition. Visual perception requires a theme NP to be a concrete noun, whereas in the case of mental cognition, mieru combines with an abstract noun.



a. Visual Perception [ConcreteNP-ga mieru] (13) Yama-ga mie-ru. mountain-NOM see-NPAS “We see a mountain.”

b. Mental Cognition [AbstractNP-ga mieru] (14) Arata-na mondai-ga mie-ta. new-CF problem-NOM see-PAS “We saw a new problem.”

Visual perception is the basic type of the mieru construction because it represents the basic meaning of mieru and has the simplest structure. The sentences that contain this type in the Taiyo Corpus (MJC) tend to appear in negative form; this is the case in seventy-three of the 106 occurrences (around 70%). However, it is not the case in the PJC, where only thirty-one occurrences out of seventy-eight (around 40%) appear in negative form. Of the instances in the MJC, the expressions representing ‘not to exist’ stands out. (15) Akidsuki-no hanshi-wa dokoe-to mi-reba, tokuni nigeuse-ke-n kage-dani mie-zu. Akidsuki-GEN samurai-TOP where-QUOT look-when, quickly escape-PAS-SPEC shadow-even see-NEG “When I looked for the samurai, Akidsuki, he may have run away quickly. There wasn’t even a shadow.” (“Ko Arimura Renjuin-no Katei [The Family of the late Arimura Renjuin]” by Kaieda Nobuyoshi in the MJC)

This sort of expression is similar to the existence type of meaningconstruction. Acctually some of them appear with the place NP. Some examples of the mental cognition type are accompanied by a -ni-marked NP that refers to an aspect of something/someone, as can be seen in kubomi-taru-ni in (16). These examples are similar to (approach) the existence type when we read the aspect (=-ni-marked NP) as the place where the theme exists ((16) can be read as “There is anxiety in her pale face and sunken eyes ”). (16) Omote-no iro yaya aozame-te, sasimo suzushige-naru me-no, sukoshi kubomi-taru-ni, face-GEN color little pale-CF such cool-be eyes-NOM little sink-RES-LOC kono higorono sinro-mo mie-te… this daily anxiety-TOP see-CF “I see her daily anxiety in that her face is rather pale, and her eyes have sunk slightly...” (“Shohkunon [Zhaojun yuan]” by Iwaya Sazanami in the MJC)

At the same time, (16) has a characteristic of the evidential type which is that the aspect (-ni-marked NP) can be the evidence from which the speaker derives the inference that “she is anxious.”

Changes in the Meaning and Construction of Polysemous Words


5.1.2. Existence The existence meaning-construction type means that “an agent perceives the existence of an object in something.” The construction consists of NP-ni, which refers to the place, and NP-ga, the theme. (17) Existence: [NP-ni NP-ga mieru] place-LOC theme-NOM

This construction is similar to a typical existence construction with the verb aru ‘to be.’ Example (18) can be expressed as wakai hito-ni humin-no keiko-ga aru “Young people have a tendency toward insomnia.” (18) Wakai hito-ni huminno keiko-ga mie-ru. young people-LOC sleepless tendency-NOM see-NPAS “There appears a tendency toward insomnia in young people.”

The existence construction with mieru in the MJC has a peculiar structure. Many examples have the construction [PAPER-ni INFORMATION-ga mieru], in which the phrase/clause referring to ‘information’ is sometimes marked by -to (quotative) (see (20)). (19) Kono-koto-wa Toshoshinkunhu-ni-mo mie-te... this-thing-TOP Toshoshinkunhu-LOC-also see-CF “This can be seen also in “Toshoshinkunhu”...” (“Okubo Sagamishu Tadachika” by Fukuchi Ouchi in the MJC) (20) Hokke-hiyuhon-no go-ni sudeni sangai-no kataku-o de-te LotusSutra-similitude-GEN words-LOC already world-GEN trouble-ACC leave-CF roji-ni zasuru-to mie... enlightment-LOC sit-QUO see “There appears in the words of “Lotus Sutra (Similitude)” that he has already left the world and sits in the state of enlightment...” (“Cha-no yu (1) [Tea Ceremony]” by Jono Saigiku in the MJC)

When the theme is a concrete noun, the existence construction resembles (approaches) that of visual perception in that it retains the meaning of ‘visual perception.’ (21) Ki-no ue-ni kodomo-ga mie-ru. tree-GEN top-LOC child-NOM see-NPAS “We see a child on top of the tree.”



However, here the -ni-marked NP needs to be a concrete spacial noun, as (21) would be less natural if expressed as Ki-ni kodomo-ga mie-ru with ue-ni ‘on top’ omitted. The -ni-marked NP in the typical existence construction refers to a thing (or person) rather than a spacial place. 5.1.3. Appearance The appearance construction means that “an agent perceives an object with some appearance.” It includes an adjective phrase. (22) Appearance: [NP-ga ADJ-ku/(so)-ni mieru] theme-NOM Appearance (23) Mizu-ga ao-ku mie-ru. water-NOM blue-CF see-NPAS “The water looks blue.” (24) Sono tokei-wa taka-so-ni mie-ru. the watch-TOP expensive-like-that see-NPAS “The watch looks expensive.”

Some examples contain a continuative form of the state verb instead of an adjective phrase. (25) Me-ga kagayai-te mie-ru. eyes-NOM shine-CF see-NPAS “The eyes look(are) shining.”

Example (23) contains an objective adjective (which modifies shapes or colors) and retains the meaning of ‘visual perception,’ whereas (24), which has a subjective adjective that represents an agent’s (=speaker’s) evaluation, approaches the judgment meaning-construction type. The objective adjective ao-ku in (23) is a continuative form, whereas the subjective adjective taka-so in (24) is a predicative form, and -ni is the same particle as that used in sensory judgment. Therefore, sono tokei in (24), which is the subject of the predicate mieru, can also be regarded as that of taka-so. Thus, the appearance usage with an objective adjective is similar to the visual perception usage, and the appearance usage with a subjective adjective is similar to the sensory judgment usage in meaning and construction. 5.1.4. Judgment The judgment meaning-construction type means “an agent perceives an object with some judgment about it.” It includes a subordinate clause that refers to the content of the judgment.

Changes in the Meaning and Construction of Polysemous Words (26) Judgment:


[[PHRASE/CLAUSE]-ni/-to mieru] Content of Judgment

We identified three subcategories of this type, sensory judgment, cognitive judgment, and evidential inference. In the sensory judgment construction, the judgment clause is marked by (-yo/-ka)-ni, and in the cognitive judgment construction, by -to. a. Sensory Judgment [[PHRASE/CLAUSE]-yo/-ka- ni mieru] (27) Nihon-de-wa hinpu-no kakusa-ga kakudai-sita-you-ni mie-ru. Japan-LOC-TOP rich.and.poor-GEN gap-NOM widen-PAS-like-that see-NPAS “It seems that the gap between rich and poor has widened in Japan.”

b. Cognitive Judgment [[PHRASE/CLAUSE]-to mieru] (28) Sengo sono shukan-wa hukkatsu-sita-to mie-ru. after.war the custom-TOP revive-PAS-QUOT see-NPAS “It seems that the custom revived after the war.”

Despite the difference of the subordinate clause marker, no major difference exists in meaning between the types (see (27) and (28)). Only when the sentence is expressed as [CLAUSE]-ka-ni mieru, it implies that “it seems so; however, the fact is different. ” (29) Sengo sono shukan-wa hukkatsu-sita-ka-ni mie-ta. after.war the custom-TOP revive-PAS-INTERR-that see-PAS “It seemed that the custom had revived after the war (but the fact is it had not).”

The evidential inference type is represented by a unique construction in which mieru appears as a continuative form (mie-te), and another clause that refers to evidence follows. The inference and evidence clauses can also occur in successive sentences. This construction means that “the speaker deduces that A is B from the evidence.” c. Evidential Inference (30) a. [[PHRASE/CLAUSE]to mie-te, [CLAUSE]] Content of Inference


b. [[CLAUSE]. [PHRASE/CLAUSE]-to mieru] Evidence Content of Inference (31) Taro-wa sono shatsu-ga kini-itta to mie-te, e itsumo sore bakari ki-teiru. Taro-TOP the shirt-NOM like-PAS QUOT see-CF always it only wear-HAB “Taro seems to like it. He always wears the shirt.”



The difference between cognitive judgment and evidential inference is that the latter has a clause or sentence that refers to the evidence from which the speaker derives the inference. This evidence is typically an individual and concrete event. In the case of cognitive judgment, the grounds for judgment can be general knowledge or the speaker’s personal feelings. Moreover, constructions using cognitive judgment can take a third person as an agent and can be expressed in the past (see (32)), whereas the evidential inference typically reflects a speaker’s present attitude. (32) Kodomo-ni-wa Nihon-ga kat-ta-to mie-ta. child-DAT-TOP Japan-NOM win-PAS-QUOT see-PAS “It seemed for children that Japan had won.”

Nevertheless, it is difficult to distinguish them in many cases because the agent of the cognitive judgment is typically the speaker. 5.1.5. Others The minority types and those that are restricted in meaning and construction are classified as ‘others.’ There are four such types, invisible, meet, appear, and idiom. The invisible type always appears in negative form (mie-nai) and in nexus position, where it modifies a noun and commonly combines with me-ni ‘on eyes.’ a. Invisible [(me-ni) mie-nai NP] (33) mie-nai sekai see-NEG world “an invisible world”

The ‘meet’ type is quite different from others in its construction in that it maintains an agent NP in the subject position (nominative). This type rarely appears in present-day Japanese. b. Meet [AniNP-ga AniNP-ni mieru] agent-NOM theme-DAT (34) Mekake-wa katsute hito-tabi kimi-ni mie-shi-kotoari. mistress-TOP formerly one-time you-DAT see-PAS-EXP “The mistress has met you once.” (“Roka” by Chizuka Reisui in the MJC)

The ‘appear’ type has only one argument that is an animate noun, which cannot be determined if it is an agent or a theme. We can infer that this animate

Changes in the Meaning and Construction of Polysemous Words


noun originally referred to a theme, as does the visual perception type, but then because of its animacy, it came to be regarded as an agent. c. Appear [(SpacialNP-ni) AniNP-ga mieru] (35) Mina atsumari-si-ga kano musume-nomi-wa mie-zu. everyone gather-PAS-but that lady-only-TOP see-NEG “Everyone gathered, but only that lady didn’t appear.”

We consider the following case of an idiom that behaves as an adverb phrase and means ‘obviously.’ d. Idiom (36) me-ni mie-te eye-LOC see-CF “obviously” (37) Gakusei-no kazu-wa me-ni mie-te gennsho-shi-teiru. student-GEN nomber-TOP eye-LOC see-CF decrease-NPAS-PFV “The number of students has obviously declined.”

5.2. Classification of mirareru Now turning to the meaning-construction types of mirareru, we can see that the classification is almost the same. We must simply set the affected passive type as the fifth type. The main difference between the other four types and the affected passive is that the former include an inanimate theme in their construction, whereas the latter has an animate theme as the subject, with the speaker’s empathy. 5.2.1. Simple perception [NP-ga mirareru] a. Visual Perception [CNP-ga mirareru] (38) Intanetto-de-wa takusan-no doga-ga mi-rare-tei-ru. internet-INS-TOP many-GEN movie-NOM watch-PASS-HAB-NPAS “Many movies are watched on the internet.”

b. Mental Cognition [ANP-ga mirareru] (39) Chiho-de-wa iegara-ga mi-rare-ru. provinces-LOC-TOP descent-NOM look-PASS-NPAS “Descent is considered in the provinces.”

Although we have not generalized the property of the inanimate theme, the abstract nouns that combine with mirareru in the mental cognition construction are different from those that combine with mieru. For example, mirareru in (39) means ‘to be considered,’ and it cannot be expressed with



mieru as in *Chiho-de-wa iegara-ga mieru. On the other hand, mieru as it is used in (40) means ‘to know / to understand,’ and mirareru cannot be used in the sentence. (40) Kenpokitee-ga kanzen-demo-nai-koto-wa yato-gawa-ni-mo mie-tei-ru-hazuda-ga... constitution-NOM perfect-be-NEG-thing-TOP opposition-side-DAT-also see-HAB-NPAS-should-but “The opposition parties should know that the constitution is not perfect but...” (“Kenpo-o yomu [To read the constitution]” by Nakagawa Go in the PJC)

5.2.2. Existence [NP-ni NP-ga mirareru] (41) Ookuno kuni-ni kyotsuten-ga mi-rare-ru. many country-LOC similarity-NOM see-POS/PASS-NPAS “There are similarities in many countries.”

The existence construction that contains mirareru is well developed in present-day texts, and most of the subjects for this type are abstract nouns. (42) Noo-no hatsuiku-ni-wa ikutsuka-no hushime-ga mi-rare-masu. brain-GEN growth-LOC-TOP several-GEN turning.point-NOM look-POS/PASS-NPAS “There are several turning points in the growth of the brain.” (“Zennogata Benkyo-ho-no Susume” by Shinagawa Yoshiya in the PJC)

5.2.3. Appearance [NP-ga ADJ-ku/(so)-ni mirareru] (43) Tsuki kagayaki-te niwa-no omoshiro-ku mi-rareru-mama... moon shine-CF garden-NOM charming-CF see-SPO-CF “The moon shines, and the garden can be seen as charming…” (“Furosen” by Ohhashi Otowa in the MJC)

Mirareru seems to be inadequate for this construction type with the inanimate theme because it rarely appears in either the MJC or PJC. Sentence (23) is less natural with mirareru. (44) ?Mizu-ga ao-ku mi-rare-ru. water-NOM blue-CF see-POS-NPAS “We can see the water as blue.”

5.2.4. Judgment [[PHRASE/CLAUSE]-ni/to mirareru] a. Sensory Judgment [[PHRASE/CLAUSE]-you/ka-ni mirareru] (45) Sengo sono shukan-wa hukkatsu-sita-you-ni mi-rare-ru. after.war the custom-TOP revive-PAS-like-that see-SPO/POS/PASS-NPAS “It is seen that the custom revived after the war.”

Changes in the Meaning and Construction of Polysemous Words


b. Cognitive Judgment [[PHRASE/CLAUSE]-to mirareru] (46) Kono tan’itsusei-ga haiseki-undo-o motarasi-ta-to mi-rare-ru. this unity-NOM rejection-movement-ACC bring-PAS-QUOT see-SPO/POS/PASS-NPAS “It seems that the unity has brought a rejection movement.”

c. Evidential Inference [[CLAUSE], [PHRASE/CLAUSE]-to mirareru] (47) Nuime-ga aru koto kara, kore-wa zubon-no ichibu-to mi-rare-ru. seam-NOM have thing as this-TOP pants-GEN part-QUOT see-SPO/POS/PASS-NPAS “This cloth seems to be part of some pants, as it has a seam.”

In these four types (simple perception, existence, appearance, judgment) with an inanimate theme, whether the sentence represents a spontaneous, possibility, or detransitive voice passive construction is difficult to determine in many cases, as it is the case with a perception verb like mirareru. When the verb form is mirareru as in (45)-(47), -rare can represent every meaning. Nevertheless, for example, when mirareru takes the -teiru form, which refers to habitual meaning, the sentence represents only the detransitive voice passive (see (38)). In this case, the speaker is excluded from the agent because the speaker’s empathy becomes neutral in habitual aspect. Hence, (48) below can have only the cognitive judgment meaning.1 In order to represent the evidential meaning, the speaker must be included as the agent. (48) Hanin-wa koko-kara shinnyu-shita-to mi-rare-tei-ru. (cf. see (2)) criminal-TOP here-from raid-PAST-QUOT see-PASS-HAB-NPAST “It is generally seen that the criminal entered from here.”

Moreover, in the cognitive judgment construction, the place or time can be definite, which is the main difference between it and the evidential type in terms of constructional properties. (49) Nihon-de-wa, kono tan’itsusei-ga haiseki-undo-o motarasi-ta-to mi-rare-tei-ta. Japan-LOC-TOP this unity-NOM rejection-movement-ACC bring-PAS-QUOT see-PASS-HAB-PAS “It was generally seen in Japan that the unity had brought a rejection movement.”

5.2.5. Affected passive [AniNP-ga AniNP-ni (Adj-ku/NP-ni/-to) mirareru] (50) Hanako-wa eki-mae-de Kazuo-ni mi-rare-ta. Hanako-TOP station-front-LOC Kazuo-DAT see-PASS-PAS “Hanako was seen by Kazuo in front of the station.”


This is not the case with the judgment type with mieru, which cannot take the form -teiru. Only when the agent is an explicit third person can mieru take the form -teiru.


Ayako SHIBA (51) Kare-wa itsumo minna-ni waka-ku mi-rare-ru. he-TOP always everyone-DAT young-CF see-PASS-NPAS “He is always seen as young by everyone.”

The affected passive is a construction based on a different classification viewpoint. It is a meaning mainly carried by -rare and its construction, whereas the other types are classified by a verb meaning and its construction. Hence, the affected passive type can be subcategorized by a verb sense such as visual perception, appearance, judgment, etc. However, this study focuses on the evidential use of each verb and is concerned with the inanimate theme, we do not address in detail the construction with an animate theme subject. In this section, we have explained the meaning-construction types of mieru and mirareru in regard to their constructional features and properties in actual use. The relationship between each type has also been shown, albeit incomplete. The visual perception sense is the basic, primary type, as it simply represents the verb’s basic meaning. It becomes a mental cognition construction if the theme NP is an abstract noun. The existence type includes a ‘visual perception’ meaning when the theme NP is a concrete noun. The appearance type also implies ‘visual perception’ when the adjective phrase is an objective one, whereas it comes to resemble (approaches) the judgment type when the adjective phrase is a subjective one that refers to an evaluation of the speaker. 6. Frequency of each meaning-construction type Now, I will show the statistical results, that is, the percentage of each meaning-construction type. The results demonstrate how the usage of each type or verb has changed from the Meiji era (1868-1912) to the present day. Table 2 displays the frequency and percentage of mieru as it occurs in the MJC, and Table 3 shows the same in PJC. We see from Table 2 that the texts from the Meiji era contain a high incidence of the existence type (see (52)). As mentioned above, they especially contain a high number of [PAPER-ni INFORMATION-ga mieru] constructions. This is not the case in present-day texts, where we find scarce incidence of the same. (52) Hurui shiryo-ni sono na-ga mie-ru. old papers-LOC the name-NOM see-NPAS “The name appears in old papers.”

Switching our attention to the judgment type, we find that the incidence of the cognitive judgment and the evidential inference types exceeds ten

Changes in the Meaning and Construction of Polysemous Words


percent in Meiji texts, while they appear to be decreasing in present-day texts. The characteristic constructional feature of these types is that the subordinate clause is marked by -to(see (53)). As compared to these types, the incidence of the sensory judgment type in which the subordinate clause is marked by -ni(see (54)) is increasing in present-day texts. Based on a study of a genre of each type, there seems to be no genre bias in their occurrence. (53) Ei-futsu-no ikusa-wa kokusen-to-koso mie-ni-kere. England-France-GEN battle-TOP fierce-QUO- EMP see-PFV-PAS “It seemed that the battle between England and France was fierce.” (“Wootoruro Gassen-no ki [Report of The Battle of Waterloo]” by Togawa Zanka in the MJC) (54) Igirisu-no Indo-shihai-wa nao tyokini watatte antai-dearu-yo-ni mie-ta. England-GEN India-rule-TOP still longtime over secure-be-like-that see-PAS “It seemed that British rule in India was still a long stable condition.” (“Nijusseeki-no Sekai [The Twentieth-Century World]” by Imazu Akira in the PJC) Table 2. Mieru in the MJC TYPE frequency 1) Simple Perception 134 2) Existence 99 3) Appearance 31 4) Sensory 31 Cognitive 51 Evidential 53 5) Others 29 TOTAL 428

Table 3. Mieru in the PJC ratio 31% 23% 7% 7% 12% 12% 7% 100%

TYPE frequency 1) Simple Perception 109 2) Existence 21 3) Appearance 42 4) Sensory 125 Cognitive 15 Evidential 7 5) Others 95 TOTAL 414

ratio 26% 5% 10% 30% 4% 2% 23% 100%

Referring to the ‘others’ of each verb, we see that their content is quite different. In the MJC, both the ‘appear’ and ‘meet’ types have eleven instances, but there are no examples found in the PJC. The ‘meet’ type has fallen out of present-day use. The ‘appear’ type remains for an honorific use, which normally appears in the spoken language. The other seven examples in the MJC are all of the invisible type. In the PJC, the invisible type amounts to seventy-eight examples, and the obvious type to twelve. A comparison of the sum total occurrence shows that there is no major difference: 428 examples occur in around three million letters (MJC), and 414 in around two million letters (PJC). Tables 4 and 5 show the percentage of each type with mirareru. What is interesting here is that the incidence of mirareru has increased greatly in present-day texts. This is due to the considerable



use of the existence type, while the others remained the same. Moreover, considering the decrease in the existence use of mieru, it can be understood that the mirareru construction is replacing the mieru construction in this type category. Table 4. Mirareru in the MJC TYPE frequency 1) Simple Perception 10 2) Existence 3 3) Appearance 1 4) Sensory 2 Cognitive 10 Evidential 0 5) Affected Passive 11 6) Others 4 TOTAL 41

ratio 24% 7% 2% 5% 24% 0% 27% 10% 100%

Table 5. Mirareru in the PJC TYPE frequency 1) Simple Perception 16 2) Existence 240 3) Appearance 0 4) Sensory 1 Cognitive 14 Evidential 0 5) Affected Passive 17 6) Others 2 TOTAL 290

ratio 6% 83% 0% 0% 5% 0% 6% 1% 100%

On the other hand, no evidential use of mirareru in present-day texts is found, despite the decrease in mieru. Therefore, it is unlikely that mirareru has replaced mieru in the evidential construction. This leads to our presumption that the other evidential forms -yoda and -rashii have replaced the evidential use of mieru in present-day Japanese, as it feels more natural to use -yoda or -rashii in expressions that contain mieru. This, however, is the case with regard to critical essays. In analyzing newspaper texts, we find quite a high incidence of the evidential construction with mirareru. Table 6 below shows the percentage of the occurrence of mirareru with an inanimate theme in the Asahi Newspaper throughout May 2006 (This data is culled from Shiba 2009). Table 6. ‘Mirareru with inanimate theme’ in the Asahi Newspaper, May 2006 (Shiba 2009) TYPE frequency Simple Perception 39 Existence 114 Appearance 0 Judgment (Sensory and Cognitive) 96 Evidential 598 Idiom 0 TOTAL 847

ratio 4.6% 13.5% 0.0% 11.3% 70.6% 0.0% 100.0%

Changes in the Meaning and Construction of Polysemous Words


The evidential type with mirareru has developed in the report genre. Regrettably, it is at present quite difficult to acquire a newspaper corpus of Modern Japanese. This is a crucial and interesting subject in the investigation of the evolution of the evidential type with mirareru in newspaper texts. The reason that the evidential construction with mirareru has developed in newspapers is that it is quite an objective expression. In the case of the evidential with mieru, it is obvious that the speaker has deduced or inferred the judgment concerned because it carries a speaker’s empathy on the agent. In the case of the other evidential markers, -yoda and -rashii, it is the speaker who deduces the event as well, whereas in the case of mirareru, the agent is defocused, and it is quite obscure who makes the inference. There, the speaker is included in an indefinite number of agents, and the responsibility of the inference or judgment is indirect. This is why the evidential with mirareru is preferred in report texts in which the writer’s standpoint is backgrounded. When comparing mieru and mirareru, another important point is that the usage of the [clause-to mieru] construction (cognitive or evidential) is decreasing, while that of the [clause-ni mieru] (sensory) construction is increasing, and at the same time, the [clause-to mirareru] construction is used often in present-day texts (especially newspapers). In other words, the constructional patterns [-ni mieru] and [-to mirareru] have become common fixtures in present-day texts. (55) Sengo sono shukan-wa hukkatsu-sita-yo-ni mie-ru. after.war the custom-TOP revive-PAS-like-that see-NPAS “It seems that the custom revived after the war.” (56) Sengo sono shukan-wa hukkatsu-sita-to mirare-ru. after.war the custom-TOP revive-PAS-QUOT see-NPAS “It seems that the custom revived after the war.”

Now, by studying the appearance type, we note that mieru is a more adequate element for this construction because mirareru rarely appears in this sense in either period. It is reasonable to suppose this if we compare examples (57) and (58), in which the appearance type with mirareru sounds less natural than that with mieru. (57) Mizu-ga ao-ku mieru/?mirareru. water-NOM blue-CF see “The water looks blue.’ (58) Sono tokei-wa taka-so-ni mieru/?mirareru. the watch-TOP expensive-like-that see “The watch looks expensive.”



However, this is only the case when the theme is an inanimate noun. Mirareru is appropriate in the appearance construction with an animate theme, as can be seen in (59). (59) Hanako-wa itsumo waka-ku mieru/mirareru. Hanako-TOP always young-CF see/be seen “Hanako always looks young / Hanako is always seen as young.”

7. Conclusion Until now, we have given an account of the difference between mieru and mirareru in their meaning-construction types and demonstrated the distribution of each type in the Modern Japanese Corpus and Present-day Japanese Corpus. We have extracted the four main meaning-construction types in which both mieru and mirareru become elements: simple perception (visual perception and mental cognition), existence, appearance, and judgment. We have also briefly suggested the relation between each. The essential points that concern the evidential use are as follows. In the Modern Japanese Corpus, there are many [[clause]-to mieru] constructions, and the incidence of both the cognitive judgment and the evidential inference types exceeds ten percent, but it has decreased in present-day texts. In contrast, the [[clause]-ni mieru] construction (sensory judgment) has increased in the present day. On the other hand, mirareru has a tendency to be an element of the [[clauses]-to VP] construction in both periods. In addition to this change, the evidential with mieru is becoming antiquated. However, the fact that the evidential with mirareru does not appear in critical essays of either period suggests that mirareru has not replaced mieru in its evidential use. We have given the conjecture that the other evidential markers -yoda and -rashii have taken the place of mieru. The above describes the case of the critical essay genre. The evidential type with mirareru has developed considerably in report genre in the present day. This phenomenon is not irrelevant to its detransitive passive use, in which the agent is defocused. The function of defocusing the agent results in obscuring who has made the evidential inference; then, the agent’s (=speaker’s) responsibility for the inference is backgrounded. This is why the evidential use with mirareru is preferred in report texts, in which the more objective expression is required. There are some interesting facts about the other meaning-construction types. Mirareru appears infrequently as compared to mieru in the Modern Japanese Corpus. Today, however, its frequency has quite increased due to a large increase in the existence type (especially in critical essays). In

Changes in the Meaning and Construction of Polysemous Words


contrast, usage of the existence type with mieru has become obsolete. Also, the appearance construction with an inanimate theme chooses mieru more than mirareru for its element in both periods. Henceforth, we hope to investigate the evidential use of both verbs by dividing the text genre more rigidly. A comparison with the other evidential markers -yoda and -rashii in modern and present-day Japanese is also a matter that is essential to reveal the paradigmatic system of Japanese evidential mood. Abbreviations ACC=Accusative; HAB=Habitual; PASS=Passive; ADJ=Adjctive; INS= Instrumental; PFV=Perfective; Ani=Animate; INTERR=Interrogative; POS= Possibility; CF=Continuative form; LOC=Locative; QUOT=Quotative; DAT= Dative; NEG=Negative; RES=Resultative; EMP=Emphasis; NOM=Nominative; SPEC=Speculative; EXP=Experience; NPAS=Non-past; SPO=Spontaneous; GEN=Genitive; PAS=Past; TOP=Topic References Aikhenvald, d A. Y. 2004. Evidentiality. Oxford:Oxford University Press. Goldberg, A. E. 1995. Constructions: A Construction Grammar Approach to Argument Structure. Chicago: Chicago Univ. Press. Haspelmath, M. 2003. “The Geometry of Grammatical Meaning: Semantic Maps and Cross-Linguistic Comparison”. The New Psychology of Language; Cognitive and Functional Approaches to Language Structure 2, Tomasello, M. (ed). Lawrence Erlbaum Associates, Inc. 211-242. Iida, T. 1997. “ ‘Mieru’ ‘Mirareru’ Saiko [Reconsideration of Mieru and Mirareru]”. Bulletin of International Center, The University of Tokyo 7. 43-65. Jacobsen, W. M. 1991. The Transitive Structure of Events in Japanese (Studies in Japanese Linguistics 1). Tokyo: Kurosio. Kawamura, F. 2008. “‘Miyu’ ‘Kikoyu’ ‘Omohoyu/oboyu’-no Kakutaisei – Dooshi rarerukei-to-no taisho-no kanten-kara—[The Case Structure of Miyu, Kikoyu and Omohoyu/Oboyu—comparing to the verbal rareru form]”. Area and Culture Studies, Tokyo University of Foreign Studies 77. 351-370. Kuroda, SY. 1979 “On Japanese Passives”. Explorations in Linguistics: Papers in Honor of Kazuko Inoue, Bedell, G., E. Kobayashi, and M. Muraki (eds). Tokyo: Kenkyusha. 305-347. Li, J. L. 1994. “ ‘Mieru’ ‘Mirareru’ ‘Mirukoto-ga dekiru’ nitsuite [On Mieru, Mirareru, Mirukoto-ga dekiru]”. Japanese-language education around the globe; Japanese language education around the globe 4. 185-191, 231, 238.



Okuda, Y. 1967. “Goiteki-na Imi-no Arikata [The Way How Lexical Meanings Are]”. Kyoiku-Kokugo 8. (Reprint in Okuda Y. Kotoba-no Kenkyu Josetsu. 1985. 3-20.) Okuda, Y. 1968-72. “Okaku-no Meishi-to Doshi-no Kumiawase [The Combination of an Accusative Noun and a Verb]”. Kyoiku-Kokugo 12, 13, 15, 20, 21, 23, 25, 26, 28. (Reprint in Gengogaku Kenkyukai (ed). Nihongo Bunpo Rengron—Shiryohen. 1983. 21-149.) Okuda, Y. 1980-81. “Gengo-no Taikeisei [The Systematicness of Language]”. Pedagogy of Japanese 63-66. (Reprint in Okuda Y. Kotoba-no Kenkyu Josetsu. 1985. 189-226.) Palmer, F. R. 1986. Mood and Modality. Cambridge: Cambridge Univ. Press. Shiba, A. 2009. “Ninshiki-dooshi-no Hijo-shugo Ukemibun [The InanimateSubject Passives of Cognitive Verbs]”. Japanese Studies: Research Education Annual Reportt 13. 1-24. Vanhove, M. 2008. “Semantic associations between sensory modalities, prehension and mental perceptions; A crosslinguistic perspective”. From Polysemy to Semantic Change, Vanhove, M. (ed). Amsterdam: John Benjamins. 341-370. Yamamoto, M. 1988. “Kokugoka Kyookasho niokeru ‘Mieru’/‘Mirareru’ nitusite—Kanoo-hyoogen-o megutte—[On Mieru/Mirareru in Japanese readers—With Special Reference to Potential Expressions—]”. Memoirs of the Faculty of Liberal Arts & Education. Part I, Humanities & Social Sciences 39. 67-72.

Language Change from the Viewpoint of Distribution Patterns of Standard Japanese Forms1 Kanetaka YARIMIZU

1. Objective of the study This study aims to analyze the standardization process of Japanese by using data collected from dialect research. Two different data sets will be used. The first is a national large-scale database of dialect research called the Grammar Atlas of Japanese Dialects (GAJ), edited by the National Institute for Japanese Language. The second is the “Glottogram survey” in the Tohoku and Hokkaido regions, hereafter referred to as the TH survey, which is limited in area but focused on generational differences. The present study, quantitative in nature, will employ multivariate analysis to examine the data of the GAJ and TH survey. Previous studies on linguistic standardization based on the data of dialect atlases using multivariate analysis include Inoue and Kasai (1982) and Inoue (1990) on Japanese and Yarimizu, Kawaguchi, and Ichikawa (2005) and Kawaguchi (2007) on French. 2. Research data 2.1. Grammar Atlas of Japanese Dialects (GAJ) The GAJ is a large-scale dialect research project conducted from 1979 to 1982 at 807 research points all over Japan. At each research point, one informant born around the year 1910 was asked to answer 267 questions, mostly regarding Japanese grammar. The results were published in six volumes that contain 350 maps. Our study will use 144 items from Vols. 1 to 3 of the GAJ. Vol. 1, entitled “Particle Items,” contains 60 maps, but only 54 of them will be used, and Vols. 2 and 3, “Conjugation Items,” cover 90 maps. Sawaki (2002) has classified the research points of Vol. 1 using a cluster analysis. 2.2. Glottogram survey in the Tohoku and Hokkaido region (the TH survey) The Glottogram survey is a research method that enables the observation of language spread by limiting the research points linearly and examining the language use of different generations living at those points.


I express my gratitude to Norie Yazu of Kanda University of International Studies and Yuji Kawaguchi of Tokyo University of Foreign Studies for their helpful comments.



Figure 1. Research points of the TH survey

As Figure 1 shows, the TH survey was conducted at 69 research points in southern Hokkaido and the Tohoku Pacific region in the years 2001 and 2002. A total of 229 informants answered 193 items regarding grammar and vocabulary, of which 30 items are common to both the TH survey and the GAJ Vols. 1 to 3. At each research point, four informants from different generations participated. They were people born around 1930, 1950, 1970, and 1990, who were in their 70s, 50s, 30s, and teens, respectively, at the time of the survey (Inoue, Tamai, and Yarimizu 2003). 3. The standardization of Japanese It is well known that the standardization of Japanese has progressed from Tokyo, the capital and center of Japan. Historically speaking, however,

Language Change from the Viewpoint of Distribution Patterns of Standard Japanese Forms


this is a recent trend. As the capital of Japan had been Kyoto for a long time, it is presumed that the linguistic center had also been Kyoto. The following quotation written by a foreign missionary at the end of the sixteenth century supports this presumption. Rodriguez (1604-08) writes: The real Japanese in spoken language is what the court noble (Cungues) and the aristocrats speak in Kyoto (Miyaco). In the speech of these people are preserved the pure and precise way of saying things, whereas all the speeches deviating from this can be regarded as vulgar and deficient. ((Arte da Lingoa de Iapam, Volume 2, 1604-08)

Figure 2 shows the locations of Tokyo and Kyoto. Tokyo, or Edo, the name of the city before the Meiji Restoration in 1868, has been the center of Japan since the seventeenth century. As the standardization process evolves in speed and quality according to the social changes of different periods, the standardization process of Japanese should be categorized into different stages. In this study,

Figure 2. Locations of Tokyo (Edo) and Kyoto



the standardization process of Japanese is categorized into the following five stages: First Stage: The period until the first half of the Edo Era (until the mid-eighteenth century) Second Stage: From the latter half of the Edo Era to the early years of the Meiji Era (from the mid-eighteenth century to the end of the nineteenth century) Third Stage: From the mid-Meiji Era to the mid-Showa Era (from the end of the nineteenth century to the mid-twentieth century) Fourth Stage: The latter half of the Showa Era to the present (from the mid-twentieth century to the present) Fifth Stage: The present These five stages will be explained in Section 4. 4. The five stages of the standardization 4.1. First stage (until the first half of the Edo Era) It is presumed that the standardization of Japanese until the first half of the Edo Era was largely influenced by the Kyoto dialect (See the quotation from Arte da Lingoa de Iapam by Rodriguez in Section 3). Kyoto was the capital of Japan for over 1,000 years from 794 to 1868 and the political center of Japan until the government was first established in Edo in 1603. It is generally presumed that Kyoto was also the linguistic center of Japan. This point cannot be proven directly from the modern dialect data. But old Kyoto dialect forms can be observed in the old written materials. These forms can also be found in the modern dialect data. —The Linguistic Atlas of Japan shows many examples of the peripheral distribution of the dialect forms, which suggest that the forms used in Kyoto eventually spread to its surrounding areas. Therefore, it is supposed that the language used in Kyoto had authority. Given that the samurai class all over Japan was ordered to move to Edo, the new capital, with the transfer of the capital from Kyoto to Edo in 1603, the seventeenth century can be regarded as a period in which the foundation for a new standard form was being built. 4.2. Second stage (from the latter half of the Edo Era to the early years of the Meiji Era) The standardization process spreading from Tokyo started to build its foundation by the early years of the Meiji Era, or the end of the nineteenth century. The Edo dialect served as a lingua franca for the samurai class across

Language Change from the Viewpoint of Distribution Patterns of Standard Japanese Forms


Japan. Therefore, the Edo dialect, originally based on the Kanto dialect, was influenced by the dialect of Kansai (Kyoto and Osaka), the former political center and still then the economic center of Japan, and evolved into the so-called “standard language of Edo,” in which characteristics of the Kanto dialect decreased gradually (Komatsu 1985). By the eighteenth century, Edo, now Tokyo, with a population of over one million, had become the political center of Japan, and the Edo dialect gained authority. This can be seen in the dialect dictionary of those years, in which the written language shifted from the Kyoto dialect to the Edo dialect from the middle of the eighteenth century (Sanada 1991). It is also presumed that a spoken language close to the modern standard language was established at the beginning of the nineteenth century (Tanaka 1983, Tsuchiya 2009). In the following, the influence of the dialects of Tokyo and Kyoto (and Osaka) on their surrounding areas will be examined using data from the GAJ. The data by informants from Tokyo in the GAJ are referred to as the “Tokyo dialect” and those by informants from Kyoto (and Osaka) as the “KyotoOsaka dialect.” How the answers correspond in each research point will be discussed. Figures 3 and 4 show the results from west to east by prefecture. The method is adopted from Inoue (2004). The x-axis shows the railway distance between the railway station of the capital city in each prefecture and Kyoto Station (in Figure 3) and Tokyo Station (in Figure 4). The y-axis represents the use rate of the Kyoto-Osaka and Tokyo dialects (Figures 3 and 4, respectively) at each research point.

Figure 3. Use rate of the Kyoto-Osaka dialect (GAJ, Vols. 1-3)



Figure 4. Use rate of the Tokyo dialect (GAJ, Vols. 1-3)

In both figures, the use rates of the two dialects drop as the distance becomes greater. Upon closer examination, however, we can see that the areas with a higher use rate of the Tokyo dialect are limited to the surrounding areas of Tokyo, whereas those with a higher use rate of the Kyoto-Osaka dialect spread much more widely. In other words, the influence of the Tokyo dialect on the surrounding cities is limited. In contrast, the influence of the KyotoOsaka dialect is quite expansive. The GAJ reflects the use of dialects in the first half of the twentieth century. Even in this period, the expansive influence of the Kyoto-Osaka dialect and the limited influence of the Tokyo dialect can be observed. We can speculate, therefore, that the Kyoto-Osaka dialect has recently influenced the standard language. 4.3. Third stage (from the mid-Meiji Era to the mid-Showa Era) The period from the mid-Meiji Era to the mid-Showa Era, or the mid-twentieth century, can be regarded as the early stage of standardization. Upon becoming a “nation-state” in 1868, after the end of the Edo Era, the Japanese government pursued a new language policy concerning Japanese as the “national language.” It is generally believed that the language spoken by the educated people living in Tokyo was suggested to be the norm for Japanese (Sanada 1991). In fact, however, the language of Tokyo was diverse at the beginning of the Meiji Era. Tanaka (1983) and Matsumura (1998) argue that the spoken language of Tokyo was established through the influence of written language produced at the beginning of the Meiji Era

Language Change from the Viewpoint of Distribution Patterns of Standard Japanese Forms


by the literary circles, which promoted the unification of the written and spoken styles. Moreover, Yasuda (1999) points out that the establishment of a national language suitable for a great empire was advocated following the rise of Japanese militarism at the end of the nineteenth century, which led the government to eradicate dialects in the education system. Although the imperative influence of education, the radio, and newspapers meant that people could write and understand the standard language, it is still believed that dialects remained in use in everyday spoken language. Figure 5 shows the use of standard forms in the GAJ by prefecture. The region with the highest use rate of standard forms is the Kanto region (Tokyo and the vast area around it), which implies that the Kanto dialect was adopted as the standard. On the other hand, we can see that the use rate of standard forms is also high in the Kansai region (Kyoto, Osaka, and the vast area around them). Thus, the Kansai dialect also seems to have a close relation with the standard language. To examine this further, we will use Hayashi’s Quantification Method Type III to classify the patterns of the use of standard forms shown in the GAJ. Fifty-four maps of particle items and 90 maps of conjugation items will be analyzed separately, and the values of Axes 1-3 by prefecture will be shown.

Figure 5. Use rate of standard forms (GAJ vols. 1-3)



Figure 6. Values of Axis 1 of Hayashi’s Quantification Method Type III (GAJ particle items) by prefecture

Figure 7. Values of Axis 2 of Hayashi’s Quantification Method Type III (GAJ particle items) by prefecture

Language Change from the Viewpoint of Distribution Patterns of Standard Japanese Forms


Figure 8. Values of Axis 3 of Hayashi’s Quantification Method Type III (GAJ particle items) by prefecture

Figures 6-8 show the pattern classification of the use of standard forms of particle items shown in the GAJ. Axis 1: Nationwide (excluding Okinawa) Axis 2: Eastern Japan (excluding Tohoku) Axis 3: Western Japan (excluding Kyushu) This implies that many common forms of Japanese particles are used in the vast central area of Japan, from the Kanto to Kansai regions, and that this area could be separated from the three areas where the use of non-standard forms is high (Okinawa, Tohoku, and Kyushu). Figures 9-11 show the different patterns of the use of standard conjugation forms shown in the GAJ. Axis 1: Kanto area Axis 2: The vast Kansai area, including the Kanto area Axis 3: Kyushu area + Kanto area (excluding Kansai) As the Ryukyu (Okinawa) dialect has hardly any standard forms in most of the items, it should be excluded from our analysis.



Figure 9. Values of Axis 1 of Hayashi’s Quantification Method Type III (GAJ conjugation items) by prefecture

Figure 10. Values of Axis 2 of Hayashi’s Quantification Method Type III (GAJ conjugation items) by prefecture

Language Change from the Viewpoint of Distribution Patterns of Standard Japanese Forms


Figure 11. Values of Axis 3 of Hayashi’s Quantification Method Type III (GAJ conjugation items) by prefecture

The value of Axis 1 shows the standard forms distributed only in Kanto, the center of standardization. The value of Axis 2 demonstrates the standard forms distributed mainly in Kansai and its surrounding areas, but also in some areas of Kanto. Moreover, the peculiar distribution of the standard forms of Axis 3, excluding the Kansai area, shows that a considerable period of time has passed since Kansai ceased being the center. Some conjugation forms have changed in the Kansai region during this period of time: for instance, mire-ba ‘if you see’ and maka-seta ‘(I) will leave (it)’ have become mi-tara and makashita, respectively. In conclusion, the pattern classification of the standard forms of particle items in the GAJ more faithfully represents the early stage of the standardization process, in which common grammatical characteristics can be observed in both the Kansai and Kanto areas. In addition, the standard forms of conjugation items in the GAJ also show the same tendency, especially in Axis 2. 4.4. Fourth stage (the latter half of the Showa Era to the present) The latter half of the twentieth century is marked by the completion of the standardization process. Rapid economic growth since the 1960s has greatly influenced the standardization of Japanese. Additionally, the development of transportation and television has facilitated communication throughout Japan,



eventually contributing to the standardization. Mase (1981) emphasizes the temporal overlap of the intensive standardization period and the diffusion of television. The period of the increasing diffusion of television in 1960s to 1970s corresponds precisely to the strongest progress of standard forms in the survey of standardization by the National Institute for Japanese Language (2007). To understand this rapid standardization, it is important to examine data that show the difference by generations and regions, such as that of the TH survey.2 Figure 12 shows the use rate of the standard forms of 168 items from the TH survey by region. We can see that the younger generation uses the standard forms more than the older generation. Hokkaido shows a different tendency because it is an island that consists of immigrants from the main island. Therefore, standardization in Hokkaido began long ago. Figure 12 suggests that the use of the standard is high even for the generation born in the 1930s. In the following analysis, however, Hokkaido is excluded because of its high use rate of standard forms.

Figure 12. Use rate of the standard forms from the TH Survey by region. 2

It must be noted, however, that the Tohoku region is adjacent to the Kanto region, which is the center of the standard, and therefore, it is strongly influenced by the standard. In the case of western Japan, the structure of the forms is more complicated because the Kansai dialect is still the linguistic center of western Japan.

Language Change from the Viewpoint of Distribution Patterns of Standard Japanese Forms


Figure 13. Use of the standard forms of 11 Items from the TH Survey by domain

Figure 13 shows the use of standard forms of 11 items from the TH survey by domain. We examined the use rate of standard forms for speakers speaking at home (i.e., the private domain), and when being interviewed on TV (i.e., the public domain). The overall tendency is that the use rate of standard forms is lower in the private domain than in the public domain. Further, although the younger generation uses the standard forms more than the older generation, they tend to choose their code according to the domain. To analyze the 30 items common to both the TH survey and the GAJ, we used Hayashi’s Quantification Method Type III. We examined 30 standard forms and 41 major dialect forms. Also, we selected the GAJ research points that correspond roughly to the research points in the TH survey. The results are as follow: Axis 1: difference by generation Axis 2: difference by region Axis 3: the use of new dialect forms Figure 14, which analyzes Axes 1 (y-axis) and 2 (x-axis), shows that the regional difference decreases as the generation becomes younger, which means that the younger generation does not show as much regional difference as the



Figure 14. Values of Axes 1 and 2 of Hayashi’s Quantification Method Type III (GAJ and TH Surveys)

older generation. Note the tendency of the values of younger generations (i.e., 70 and 90 on the graph) to approach closer to the y-axis. Figure 15, which analyzes Axes 1 (x-axis) and 3 (y-axis), shows that middle-aged people (See 50 and 70 on the graph), have peculiar trends compared with the other generations. The values of Axis 3 correspond to the use of new dialect forms such as koko-sa aru ‘it’s here’, kuru-daba ‘if you come’, and se ‘do it’ that appeared in the Tohuku region recently, which are different from the traditional dialects. Inoue (1994) named the dialect forms “new dialect”3 forms in the following cases:


“Neo dialect” in Sanada (1996) is a similar concept. It is defined as a speech style that deviates from the standard and is also different from the traditional dialects. Inoue’s (1994) “new dialect” is a term applied to conjugation forms. However, Sanada’s “neo dialect” should be identified not according to conjugation forms, but according to the context.

Language Change from the Viewpoint of Distribution Patterns of Standard Japanese Forms


Figure 15. Values of Axes 1 and 3 of Hayashi’s Quantification Method Type III (GAJ and TH Surveys)

(1) if they are used frequently among the younger generations (2) if they are considered as non-standard forms (3) if the users are aware that they are “dialects” d se can be considered as new dialect forms. In each The forms -daba, -sa, and region, these new dialects have appeared, but the younger generation born in the 1990s has a tendency to avoid them. It must be noted, however, that the younger generation uses the standard forms, whereas the older generation uses the traditional dialect forms. From this analysis, it can be argued that the standardization of Japanese has progressed rapidly in the Fourth Stage, with the variety of dialect forms of each region becoming increasingly limited. 4.5. Fifth stage (the present) The present stage can be an era of so-called “linguistic Tokyonization,” which represents “the phenomenon of the nationwide spread of the non-standard



forms used in Tokyo.” Tanaka (1983) argues that although the Tokyo dialect is basically the Eastern Japan dialect, upon closer inspection, we see that it also has many elements of the Kansai dialect, which is consistent with my analysis of the standard in Section 4.3. Through television and comics, Japanese people across the nation have been strongly exposed to the spoken Japanese of Tokyo. It has even entered the private domains of Japanese people’s lives. Our multivariate analysis of the TH survey and the GAJ in Section 4.4 showed the gradual disappearance of the Tohoku dialects (See Figure 12). It further showed that standardization has progressed to the private domains among the younger generation born in the 1990s, and that the regions close to Tokyo show a higher rate of standardization (See Figure 13). Although the spoken Japanese of Tokyo is certainly close to the standard, there are some expressions that differ from the standard; these may be called forms of the “Tokyo dialect.” Figure 16 represents the use rate of 21 new expressions used in Tokyo4 examined in the TH survey. The figure shows that

Figure 16. Use rate of the new dialects of Tokyo (TH survey) 4

Some of the examples include uzattai ‘unpleasant’, kenken ‘hopping on one leg’, shiasatte ‘three days from now’, chigakatta ‘it’s different’, and yosageda ‘it seems good’.

Language Change from the Viewpoint of Distribution Patterns of Standard Japanese Forms


the closer a region is to Tokyo, the more are the forms of the Tokyo dialect used among the younger generations born in the 1970s and 1990s. As these forms are not standard forms, but dialect forms, this can be called the linguistic Tokyonization rather than the standardization of Japanese. However, as Figure 13 demonstrates, dialects and the standard have different domains of usage. For example, the dialect is used in private domains, whereas the standard is found in public domains. In some television programs such as soap operas and entertainments, the spoken language of Tokyo tends to be used even in a public domain such as “television.” As the Tokyo dialect is increasingly used as the “standard in private domains,” differences in the way people speak in private versus public domains are expected to decrease. 5. Conclusion In this study, the standardization of Japanese was analyzed in five historical stages using data from two recent dialect surveys, the GAJ and the TH survey. A model of standardization based on multivariate analysis, shown in Figure 17, concludes our study. This model depicts the process of decline of the traditional dialect forms used nationwide, mainly in the private domains, until the beginning of the Meiji Era, through the influence of the standardization process centered in Tokyo. The “area,” shown horizontally, describes the diverse variation of the dialects, and the “generation,” shown

Model of the standardization of Japanese



vertically, represents the time axis. This model explains the five stages of standardization discussed in this study. The linguistic center in the pre-modern First Stage was Kyoto, and in the early modern Second Stage, Edo (Tokyo). During both stages, the influence of the language of the center was not strong, and traditional dialect forms, constantly evolving, were used in each region. In the modern Third Stage, standardization progressed through education but did not affect the private domains, in which the traditional forms were used. The model in Figure 17 suggests that the traditional dialects, depicted on top, were maintained until the Third Stage. As standardization progressed, however, standard forms started to infiltrate the private domains. This is the Fourth Stage, in which not only the number of users of the dialect forms but also the number of form variations decreased. At the same time, a reverse trend, what may be called the spread of the “new dialect forms,” was observed. In the Fifth Stage, the present, the standardization process approaches completion, and the new dialect forms also decrease, which results in the nationwide homogenization of spoken language. Furthermore, it is presumed that standardization is strongly affected by the mass media, which is broadcast mainly from Tokyo. At the same time, nationwide use of the Tokyo dialect has been observed in private domains. Thus, standardization and linguistic Tokyonization progress concurrently. In this study, the state of Japanese from the point of view of the distribution of standard forms was discussed historically. Further, generational differences combined with the distribution of dialect forms were used to explain language change in detail. References Hida, Yoshifumi. 1992. Studies in the Formation of Tokyo Japanese (in Japanese). Tokyodo Shuppan. Inoue, Fumio. 1990. “Quantitative characteristics and geographical distribution patterns of standard Japanese forms”. GENGO KENKYU U 97. Linguistic Society of Japan. Inoue, Fumio. 2004. “Geographical factors of communication on the basis of usage rate of the standard Japanese forms and railway distance”. The Japanese Journal of Language in Society 7-1. The Japanese Association of Sociolinguistic Sciences. Inoue, Fumio and Hisako Kasai. 1982. “Dialect classification by standard Japanese forms”. Mathematical Linguistics 27-3. The Mathematical Linguistic Society of Japan. Inoue, Fumio, Koji Tamai and Kanetaka Yarimizu. 2003. Age-Area Distribution of the Tohoku and Hokkaido Dialects (TH Glottograms). Tokyo University of Foreign Studies.

Language Change from the Viewpoint of Distribution Patterns of Standard Japanese Forms


Kawaguchi, Yuji. 2007. “Is it possible to measure the distance between near languages? A case study of French dialects”. Langues proches—Langues — collatérales, L’Harmattan. 81-88. Komatsu, Hisao. 1985. Japanese in the Edo Period— d—Edo Japanese (in Japanese). Tokyodo Shuppan. Mase, Yoshio. 1981. “The Influence of television and language of a city on language acquisition (in Japanese)”. Studies in the Japanese Language (Kokugogaku) 125. Matsumura, Akira. 1998. Study of the Edo and Tokyo Japanese (Enlarged Edition) (in Japanese). Tokyodo Shuppan. Nakamura, Michio. 1948. Characteristics of the Tokyo Dialectt (in Japanese). Kawada Shobo. Rodriguez, João. 1604-08. Arte da Lingoa de Iapam (facsimile edition, Benseisha, 1976). Sakono, Fuminori. 1998. Study of Philological Study of Dialects (in Japanese). Seibundo. Sanada, Shinji. 1991. How was the Standard Japanese Language formed? (in Japanese). Sotakusha. Sanada, Shinji. 1996. Dynamism of the Local Language—Kansai (in Japanese). Ohfu. Sawaki, Motoei. 2002. “Application of linguistic atlas data—Cluster analysis of research points of the GAJ (in Japanese)”. Hogenchirigaku no Kadai, MASE, Yoshio (ed). Meiji Shoin. Sibata, Takesi. 1988. Dialectology (in Japanese). Heibonsha. Tanaka, Akio. 1983. “Tokyo Japanese—Its formation and development (in Japanese)”. Meiji Shoin. The National Institute for Japanese Language. 1966-1974. Linguistic Atlas of Japan (in Japanese) Vols. 1-6. The National Institute for Japanese Language. 1989-2006. Grammar Atlas of Japanese Dialects (in Japanese) Vols. 1-6. The National Institute for Japanese Language. 2007. Language Activity in Local Society— y Three continuous surveys every twenty years at Tsuruoka (in Japanese). Tsuchiya, Shinichi. 2009. Study of EdoTokyo Japanese—The Way to the Standard Japanese (in Japanese). Bensei Shuppan. Yarimizu, Kanetaka. 2006. The Standardization Process of the Dialect Grammar in Tohoku ad Hokkaido (in Japanese), Linguistic Informatics VI. I Tokyo University of Foreign Studies.



Yarimizu, Kanetaka, Yuji Kawaguchi and Masanori Ichikawa. 2005. “Multivariate analysis in dialectology—A case study of the standardization in the environs of Paris”. Linguistic Informatics—State — of the Art and the Future. The First International Conference on Linguistic Informatics. John Benjamins. Yasuda, Toshiaki. 1999. Between “the national language” and “dialects”— Politics of construction of a language (in Japanese). Jimbun Shoin.


Index of Proper Nouns Apabhraṃśa 223, 225-230, 232-239 Arabic 189, 191-196, 198, 213-215, 217 Atlas Linguarum Europae 12 Austric 204 Austroasiatic 203 bas-auvergnatt 60-62, 71 Bengal 229 Bengali 204, 213, 214, 216, 217, 221 Bihar 205 Birhor 213, 215 Burmese 204 Central Institute of Indian Languages (CIIL) 205 Chanson de Roland 80 Clermont-Ferrand 59 Constantinople 80-87 Desi 213, 215 Deutsche Diachrone Baumbank (DBB) 40 Document Type Definition (DTD) 164, 165 Eastern Apabhraṃśa (EAp) 226, 229, 231-233, 235-239 English 158, 159, 189, 190, 195-198, 213, 215 Frantext 123 French 60, 62 Grammar Atlas of Japanese Dialectes 12, 265 Gujarat 230, 233 Gurjara Apabhraṃśa (GAp) 229 Gusiilay 76 Hayashi’s Quantifi cation Method Type III 271-275, 277-279 Helsinki Corpus 39 Hindi 203, 213, 214, 216 Iberian Peninsula 153 Indo-Aryan 223, 224, 232

ISO (International Organization for Standardization) 162 Jharkhand 203 Kashmir 234 Kashmiri Apabhraṃśa (KAp) 230, 234-239 langue d'oc 60 langue d'oïll 60 Latin 76, 91, 92, 94 Latin Extended-D 163, 166 Le Corpus d’Amsterdam 116 Leeds 189, 191, 196 Maharashtra 230 Medieval Nordic Text Archive (MENOTA) 164 Medieval Unicode Font Initiative (MUFI) 165 Middle Ages 159 Middle French 118, 122-128 Middle Indo-Aryan (MIA) 223, 225, 226, 236, 238 Modern French 115, 129 Mon-Khmer 203 Montferrand 60 Munda 203, 213, 215 National Institute for Japanese Language 265, 276 New Amsterdam Corpus 111 New Indo-Aryan (NIA) 223, 226, 232, 238 Northern Apabhraṃśa (NAp) 229 Occitan 60 Old Church Slavonic (OCS) 175-181, 184, 185 Old French 12, 13, 75-85, 111, 116-129 Old Indo-Aryan (OIA) 223, 224, 230, 236


Index of Proper Nouns

Old Russian (OR) 175-177, 180, 181, 184, 185 Origins of the Portuguese Language project 167 Paris 64 Perceval 80 Persian 213-216 Portuguese 124, 159 Prakrit 225, 227-229, 232, 238 Private Use Area 162, 164, 165 Quran 189, 191, 196-199 Rajasthan 230 Ranchi University 206 Renaissance 159 Research Institute of Languages and Cultures of Asia and Africa (ILCAA) 205, 206 Romance (languages) 91, 92, 94 Romance languages 76 Rumanian 122 Sanskrit 223-229, 231, 236-238 Santali 203 Semitic 189, 192, 196 SGML (Standard Generalized Markup Language) 162, 164

Siècle classique 129 Simple Concordance Program 4.0.9 for Mac 81 Southern Apabhramsa (SAp) 230-233, 235-239 Spanish 78, 79 Standard Italian 122 TEI Character Encoding Workshop (CE W 12) 168 TEI Guidelines / TEI P5 163, 168 Text Encoding Initiative (TEI) 162-165 Ukiyoburo 9 Ukiyodoko 9 Unicode 165 Unicode Consortium 162, 165 Universal Character Set (UCS) 162-165 Vita Constantini (Vita Con.) 175, 182, 183, 185, 186 Vulgate 126-128 Western Apabhraṃśa (WAp) 229-239 XML (Extensible Markup Language) 162, 164

Index of Proper Nouns

Names Adams, D. Q. 26 Agrippa d' Aubigné 126 Alinei, M. 22, 24, 26, 28 Alsdorf, L. 229 Andersen, H. 91, 94, 104, 105 Anderson, G.D.S. 203, 204 Andry de Bois-Regard 129 Aronoff, M. 189 Ashen, F. 189 Bauer, L. 190 Benveniste, E. 92, 101, 102 Biondelli, B. 21 Blumenthal, P. 87 Bodding, P.O. 203, 205, 208 Bodou J. 68 Borregán, A. 79 Braudel, F. 87 Bubeník, V. 226, 229, 232 Bybee, J. 190 Campbell, A 204, 205, 208, 211 Cantineau, J. 192 Chatterji, S.K. 227 Chretien of Troyes 80-87 Cohen, M. 134, 137, 143 Corneille, T. 129 Cornillie 112 Dalbera, J.-P. 12 Daneš, F. 75 De Clerck, E. 233 De Haan, F. 112 De Saussure, F. 7, 27 De Vies. 25 Dees, A. 116 Digital Scriptorium 164 Droysen, J. 87 Ehrliholzer, H. 76 Emiliano, A. 163, 165, 166 Everson, M. 163, 166

Fleischman, S. 101, 102 Ford, A. 190 Foucault, M. 87 Foulet, L. 102, 103 Freeman, M. 79 Frei, H. 177 Froissart, J. 87 Gafos, A. 192 Geoffroy of Monmouth 78 Giannakidou, A. 122 Gilliéron, J. 21 Godel, R. 7 Goldberg, A. E. 244 Haywood, J.A. 193 Heath, J. 192 Hemacandra 227, 228, 235 Hjelmslev, L. 7 Inoue, F. 265, 266, 269, 278 Jacobi, H. 227, 229, 232 Jagić, V. 176, 177, 180 Jolles, A. 75 Kamp, H. 113 Klare, J. 76 Knoblauch, H. 76 Koselleck, R. 87 Krefeld, T. 76 Krier, F. 141, 146 Kuroda, S. 248 Labov W. 37, 38 Lamb, S. 190 Larcher, P. 193, 194 Lavrov (Лавров), П. 182-184 Le Querler, N. 111 Leemhuis, F. 197 Leiss, E. 112 Lessing, T. 87 Levy, M. 191 Lewis, D.K. 113, 121



Index of Proper Nouns

Lipinski, E. 196 Lodge R. A. 59, 133, 141 Luckmann, T. 76 Luigia, N.-D. 228, 232 MacPhail, R.M. 205 Mallory, J.P. 26 Mansfield C. 65 Martohardjono, G. 190 Matsumura, A. 11 McCarthy, J. 192 Minegishi & Murmu 205 Morin, Y. C. 133 Nahmad, H.M. 193 Nuyts, J. 112 Okuda, Y. 244, 245, 248 Palmer, F.R. 111 Pāṇini 223, 228 Paris, G. 8 Parkinson, S. 163, 164 Pedro, S. 166 Pietandrea, P. 112 Plungian, V.A. 112 Pope John Paul II. 78 Portner, P. 111 Posner, R. 9 Prince, A. 192 Ragunat, M. 207 Raible, W. 76

Ratcliffe, R. 191-194 Rissanen, M. 10 Robert de Clari 80-87 Robinson, P. 159, 163 Rodriguez, J. 267, 268 Schlieben-Lange, B. 75, 76, 86 Schmidt, W. 204 Singh, R. 190 Stein, A. 116 Stempel, W. 76, 84, 87 Stoll, E. 79 Tagare, G.V. 226, 227, 229, 230, 232-234 Tendeng, O. 76 Tranel, B. 133 Uitti, K. 79 Vaudelin, G. 18, 134-141, 144, 146, 148 Villehardouin, G. 80-87 Voltaire 87 Watson, J. 193 Weigand, G. 21 Weijnen, A. 21 Weinrich, H. 92, 101, 102 White, H. 87 Wright, J. 32 Wright, W. 193 Zaborski, A. 196


Index of Subjects element 163, 167, 169, 170 abbreviations 161, 163 abstraction 88 accessibility 113 acquaintance relation 113, 120-122 activity type 75 actualisation 92, 94, 97-99, 102, 104, 105 element 167, 169, 170 adventure 80, 84, 85 agent 195 allograph(s) 158 allographemes/allograms 158 allography 157, 158 allophones 158 allophony 158 alphabet(s) 154, 156, 158, 159 element 163 analogy 190 anisomorphic direct representation 164, 165, 168 anthropomorphic layer 28, 30, 32 bad philology 167 belief predicates (doxastic predicates) 111, 114-129 brachygraph(s) 161, 163 brachygraphy 163 brevigraph 163 broken plural 191, 197 BSD 203, 206 causality 79, 84, 87 causative 193-195, 198 centred worlds 121, 122 chansons de geste 76 character set(s) 154, 156, 160, 161 character(s) 154-157, 159, 162, 163, 167 chirographic writing 156

Christian/Muslim layer 28, 31, 32 chronological order 79, 80, 87 class of characters 154 clause linking 84, 85 clefting 83, 88 cognitive framework 88 common ground 119 communicative genres 75 conceptualisation 76, 85 conceptualisation of action 80, 84, 85, 87 conceptualisation of history 77, 87, 88 conceptualisation of writing 86 connectionism 190 conventionalized patterns of use 118, 119, 125, 128 coordination 84, 85 corpus 223, 233-235, 238, 239 corpus architecture 39, 40, 42, 43 corpus linguistics 116, 129 corpus/corpora 153, 154, 160, 165, 166 correct 177, 179, 180 counterfactual implicature 119, 124, 126 counterfactuality 118, 124, 127 critical context 94, 95, 97 croire 111, 114, 115, 122, 126-129 cuidier 111, 115-129 de dicto/de re 113 derived verbs 189-194, 196-198 diachrony 7 diasystematic parameters 91, 92, 94, 97, 104-106 digital typography 167 diglossia 76 digression 83, 85 diphthongisation 71 diplomatic edition 154, 166 direct representation 162, 164


Index of Subjects

direct speech 85 disourse tratitions 76, 81 ditransitive 197 domain-widening expressions 118 dynamic synchrony 8-11 edition of medieval texts/primary sources 153, 154, 159 edition(s) 153, 159-162, 164, 167 emic and etic units 157 empathy 247-249, 255, 261 endogenous linguistic change 64 entity reference(s) 164-166 entity(ies) 164, 165 epic tense switching 91, 100-103 epistemicity 112, 113 ethnolinguistic origins of Europe 25, 26 etymology 25, 26 evidential 243, 248, 250, 253, 254, 257, 258, 260-263 evidentiality 112 element 163 exogenous linguistic change 64 element 163, 167 external actor 195 facsimile(s) 157, 160 format 75 frequency 197, 198 function 175, 177, 185 element 163 generic model 79 glottogram survey 265 glyph set(s) 154, 160, 161 glyph(s) 154, 156, 157, 162, 163 graph(s) 156, 157 grapheme(s) 157, 158 grapheme-phoneme correspondence rules 157 graphemic 154, 157, 158, 160, 161 graphemics 154, 157 graphetic 154, 160, 161 graphetic implementation 156

graphetics 154 grapholexemes/logograms 157, 158 graphomorphemes 158 graphonemes/phonograms 158 grounding 87 historical corpora 10 historical texts 77, 79-88 history of religions 28 implicature 118, 126 incorrect 177, 179-182 indicative 112, 115, 117, 122-129 indirect representation 162, 167 inferences 114 intersubjectivity 113 intransitive 190, 193, 194, 197 invasion theory 26 isomorphic direct representation 164-166, 170 element 168 element 168-170 language change 37-40, 44, 50 Latin-Portuguese charter 162, 167 layering 94, 96-98, 104 element 170 letter(s) 154-157, 159 lettershape(s) 155, 156 levelling of diphthongs 71 lexical change 66 linguistic atlas 12 linguistic framework 88 linguistic norms 65 linguistic significance 136 literary models 88 loanword 24, 25 Loceme 65 logography 158 manuscript(s) 160, 161 meaning-construction type 245, 248-252, 255, 258, 262 medieval script(s) 161 metahistory 87, 88

Index of Subjects minimal pair 209, 218 misuse 175, 177, 179, 181-186 modality 111, 112 mood 12-19, 111-121 morphemic weight 138, 143 morphological change 68 morphology 189-192, 194, 195 motivation 26-33 multi-layer annotation 37, 43, 49, 54 attribute 168-170 narrative 79, 88 negation 68 negative polarity expressions 118, 127 neolithic dispersal theory 26 new dialect 277-279, 282 non-archaic 236-239 norm 177 normalization 42-44, 47, 49, 162, 166, 170 novel 76, 77, 79 onomatopoeic 216 organisation of the utterance 75 orthography 154, 158, 159

element 170 palaeographic edition 154, 160 palatalisation of consonants 72 paleolithic continuity theory 26 participle 175-185 patient 195 peripheral distribution 268 philological faithfulness 160 philological truth 160 philological truthness 160 phoneme 157, 158 phonetic change 70 phonograms 158 phonographic vs. logographic systems 158 phonography 158 attribute 168-170 polarity subjunctive 115


Prakritism 228 preterite tense 69 productivity 189-191, 197 prose 76-88 pseudo-archaic 223, 228, 236, 238, 239 public addressed as hearers 82 element 168, 169 railway distance 269 reanalysis 92-97, 99, 104 reciprocal 193 reconstructed roots 26 reflexive 193 relaciones 78 rhematising 83, 84, 88 rhetorical questions 119 rhotacisation 65 Roman alphabet 158, 159, 161 romances in verse 76, 81, 84, 85 root 192, 193, 197, 198 sandhi 133, 134 Sanskritism 228 element 168 semi-productivity 190, 191 sociolinguistic and pragmatic factors 137 Sprachbund 24 standardization process 265, 267, 268, 275, 281, 282 subjectivity 113, 114, 121, 122, 124, 127, 129 subjunctive 111, 112, 115, 117-129 subordination 76, 84, 85, 87 synchrony 7 syntactic annotation 40-43, 47, 48, 51 syntactic change 67 systematic glyph(s) 156, 167 tale 86 text genre 75, 78, 79 text linguistics 75 Tokyo dialect 269, 270, 280-282 traditions of speaking 75, 76, 78


Index of Subjects

transcription 153, 159-161 transitive 193-195, 197, 198 transliteration 153, 159-162 transliteration rules 207 treebank 39, 40 typographic representation 153, 156, 162, 164, 166 typology 22-24 valence 193-195, 197, 198 variationism 38, 39

vernaculars 76, 79 verse 76, 77, 79, 81, 82, 84, 85 Visigothic script 162, 167 word juncture/disjuncture 168 word-based morphology 192, 194, 196, 198 writing system 154, 157, 159, 207 written source 80 zoomorphic layer 28, 29, 32

Contributors Wolfgang VIERECK

University of Bamberg


Humboldt-University Berlin


Humboldt-University Berlin


Humboldt-University Berlin

Anthony LODGE

University of St Andrews

Wolfgang RAIBLE

University of Freiburg


University of Copenhagen


University of Cologne


Tokyo University of Foreign Studies


New University of Lisbon

Yoshinori ONDA

Tokyo University of Foreign Studies


Tokyo University of Foreign Studies


ILCAA, Tokyo University of Foreign Studies


ILCAA, Tokyo University of Foreign Studies


Ranchi University


ILCAA, Tokyo University of Foreign Studies


GCOE, Tokyo University of Foreign Studies


National Institute for Japanese Language and Linguistics