The Phraseological View of Language: A Tribute to John Sinclair 9783110257014, 9783110256888

This volume presents the results of the international symposium Chunks in Corpus Linguistics and Cognitive Linguistics,

195 86 18MB

English Pages 336 Year 2011

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

The Phraseological View of Language: A Tribute to John Sinclair
 9783110257014, 9783110256888

Table of contents :
I John McH. Sinclair and his contribution to linguistics
Preface
A tribute to John McHardy Sinclair (14 June 1933 – 13 March 2007)
Corpus, lexis, discourse: a tribute to John Sinclair
II The concept of collocation: theoretical and pedagogical aspects
Choosing sandy beaches – collocations, probabemes and the idiom principle
Sinclair revisited: beyond idiom and open choice
Accessing second-order collocation through lexical co-occurrence networks
From phraseology to pedagogy: challenges and prospects
Chunks and the effective learner – a few remarks concerning foreign language teaching and lexicography
Exploring the phraseology of ESL and EFL varieties
III Variation and change
Writing the history of spoken standard English in the twentieth century
Prefabs in spoken English
Observations on the phraseology of academic writing: local patterns – local meanings?
Collocational behaviour of different types of text
IV Computational aspects
Corpus linguistics, generative grammar and database semantics
Chunk parsing in corpora
German noun+verb collocations in the sentence context: morphosyntactic properties contributing to idiomaticity
Author index
Subject index

Citation preview

The Phraseological View of Language

The Phraseological View of Language A Tribute to John Sinclair

Edited by Thomas Herbst Susen Faulhaber Peter Uhrig

De Gruyter Mouton

ISBN 978-3-11-025688-8 e-ISBN 978-3-11-025701-4 Library of Congress Cataloging-in-Publication Data The phraseological view of language : a tribute to John Sinclair / edited by Thomas Herbst, Susen Faulhaber, Peter Uhrig. p. cm. Includes bibliographical references and index. ISBN 978-3-11-025688-8 (alk. paper) 1. Discourse analysis. 2. Computational linguistics. 3. Linguistics — Methodology. I. Sinclair, John McHardy, 1933 — 2007. II. Herbst, Thomas, 1953— III. Faulhaber, Susen, 1978— IV. Uhrig, Peter. P302.P48 2011 401'.41—dc23 2011036171

Bibliographic

information

published

by the Deutsche

Nationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at http://dnb.d-nb.de.

© 2011 Walter de Gruyter GmbH & Co. KG, Berlin/Boston Printing: Hubert & Co. GmbH & Co. KG, Göttingen co Printed on acid-free paper Printed in Germany www.degruyter.com

Preface This volume goes back to a workshop entitled Chunks in Corpus Linguistics and Cognitive Linguistics held at the Fnednch-Alexander-Umversitat Erlangen-Niirnberg in October 2007 on the occasion of awarding an honorary doctorate by the Philosophische Fakultat II to Professor John McHardy Sinclair. With this honorary degree the Faculty wished to express its respect and admiration for John Sinclair's outstanding scholarly achievement and the enormous contribution he has made to linguistic research. This concerns both his role in the development of corpus linguistics and his innovative approach towards the making and designing of dictionaries, which culminated in the Cobuild project, but also the fact that for him such applied work was always connected with the analysis of language as such and thus with theoretical insights about language. For instance, at a time when it was commonly held that words have more than one meaning and that when listening to a sentence listeners pick out the appropriate meanings of the words on the basis of the context in which they are used, it took Sinclair to point out that, taking the meanings listed in a common English dictionary, a sentence such as The cat sat on the mat would potentially have more than 41 million interpretations from which listeners choose one as being correct (Sinclair 2004: 138).2 It is observations such as these - and especially the many demonstrations of the complex character of prefabricated chunks in language - that make his books and articles so enlightening and engaging to read. Unconventional thought and provocative ideas based on unparalleled empirical evidence - these are the qualities that made this faculty regard John Sinclair as an outstanding individual and scholar. His approach to linguistics encompassed all facets without any division between theory and practical description and has provided an important impetus for linguists worldwide, including many associated with the Erlangen Interdisciplinary Centre for Research on Lexicography, Valency and Collocation. On a personal note, we are grateful to him for his decisive support of the valency dictionary project. That John Sinclair, who had planned to take part in the colloquium, died in March 2007, came as a great shock; at least, the news of this honorary doctorate had reached him when he was still well. Nevertheless, we felt it was appropriate to hold the workshop, in which a number of John's colleagues and friends participated, to discuss his ideas and research projects that were inspired by and related to his work and we are very honoured and

V!

Susen Faulhaber, Thomas Herbst and Peter Uhrig

grateful that Professor Elena Togmm Bonelli came to take part in the workshop and to receive this distinction on his behalf. We owe a great deal of gratitude to Michael Stubbs and Stig Johansson for providing very detailed surveys of Sinclair's outstanding contribution to the development of the subject which forms section I of the this volume. In "A tribute to John McHardy Sinclair (14 June 1933 - 13 March 2007)", Michael Stubbs provides a detailed outline of John Sinclair's academic career and demonstrates how he has left his mark in various linguistic fields which are, in part, intrinsically connected to his name. Of particular significance in this respect is his focus on refining the notion of units of meaning, the role of linguistics in language learning and discourse analysis and, of course, his contributions in the field of lexicography and corpus linguistics in the context of the Cobuild project. We are particularly grateful to Stig Johansson, who, although he was not able to take part in the workshop for reasons of his own health, wrote a tribute to John Sinclair for this occasion. His article "Corpus, lexis, discourse: a tribute to John Sinclair" focuses in particular on Sinclair's work in the context of the Bank of English and his influence on the field of corpus linguistics. The other contributions to this volume take up different issues which have featured prominently in Sinclair's theoretical work. The articles in section II focus on the concept of collocation and the notions of openchoice and idiom principle. Thomas Herbst, in "Choosing sandy beaches collocations, probabemes and the idiom principle" discusses different types of collocation in the light of Sinclair's concepts of single choice and extended unit of meaning, drawing parallels to recent research in cognitive linguistics. The notions of open choice and idiom principle are also taken up by Dirk Siepmann, who in "Sinclair revisited: beyond idiom and open choice" suggests introducing a third principle, which he calls the principle of creativity, and discusses its implications for translation teaching. Eugene Mollet, Alison Wray, and Tess Fitzpatnck widen the notion of collocation in their article "Accessing second-order collocation through lexical cooccurrence networks". By second-order collocation they refer to word combinations depending on the presence or absence of other items, modelled in the form of networks. Thus they address questions such as to what extent the presence of several second-order collocations can influence the target item. A number of contributions focus on aspects of collocation in foreign language teaching. In "From phraseology to pedagogy: challenges and prospects" Sylviane Granger discusses the implications of the interdependence of lexis and grammar and the idiom principle with respect to the

Preface

vii

learning and teaching of languages. While bearing in mind that such an approach needs to be reconciled with the realities of language classrooms, she gives a detailed overview of the pros and cons of a lexical as opposed to a structural grammar-based syllabus. In "Chunks and the effective learner - a few remarks concerning foreign language teaching and lexicography" Dieter Gotz illustrates approaches towards the treatment of chunks in lexicography and discusses strategies foreign learners should develop to expand their repertoire of prefabricated items. Nadja Nesselhauf s article "Exploring the phraseology of ESL and EFL varieties" addresses questions of foreign language use (EFL) and compares them with English as a native language and English as a second language (ESL) with respect to chunks and phraseological phenomena. With her view of learner English as a variety of English, Nesselhauf s contribution provides a link to section III, which focuses on aspects of variation and change. "Writing the history of spoken standard English in the twentieth century" by Christian Mair takes a diachromc perspective and focuses on the role of the spoken language - as opposed to that of the written language in language change. By comparing data from the Diachromc Corpus of Present-Day Spoken English (DCPSE) with corpora from the "Brown family", he shows that speech and writing can develop autonomously and that changes which develop in one mode need not necessarily be taken over into the other. In her article "Prefabs in spoken English", Bngitta Mittmann looks at regional variation and presents results of a corpus-based study of phraseological units in British and American English. By providing empirical evidence for regional differences, e.g. in the sense of different preferences for phraseological units with the same pragmatic function, she underlines the arbitrariness and conventionality of such chunks in language. Ute Romer investigates variation with respect to genre in her article "Observations on the phraseology of academic writing: local patterns local meanings?". Examining what she calls the "phraseological profile" in this specific type of written English with the help of a corpus of linguistic book reviews, she shows that certain chunks are often associated with positive or negative evaluative meaning even if the lexical items themselves are rather neutral in this respect. The fact that this association in part differs from the use of the same chunks in more general English is taken as evidence for a "local grammar" and the genre-specificity of units of meaning and thus their conventionality. Peter Uhng and Katnn Gotz-Votteler also place a focus on genre-specific differences in their article "Collocational behaviour of different types of text". They look at different samples of fie-

viii Susen Faulhaber, Thomas Herbst and Peter Uhrig

tional, non-fictional and learner texts using a computer program to determine different degrees of collocational strength and to explore to what extent it is possible to relate these to factors such as perceived difficulty, text typeondiomaticity. Section IV concentrates on computational aspects of phraseological research. In "Corpus linguistics, generative grammar, and database semantics", Roland Hausser compares the information given in entries of the COBUILD English Language Dictionary to the data structures needed in Database Semantics, which is aimed at enabling an artificial cognitive agent to understand natural language. Giinther Gorz and Girnter Schellenberger show in "Chunk parsing in corpora" that chunks also play an important role in Natural Language Processing (NLP). They present a method for chunk parsing and for evaluating the performance of chunk parsers. Finally, Ulnch Heid's contribution "German noun+verb collocations in the sentence context: morphosyntactic properties contributing to idiomaticity" combines computational aspects of analysis with the theoretical insights gained in a Sinclairean framework and shows that a thorough description of collocations goes beyond the level of lexical co-occurrence and has to include morphological, syntactic, semantic and pragmatic properties. He proposes a computational architecture for the extraction of such units from corpora that makes use of syntactic dependency parsing. We hope that the contributions in this volume give some idea of the wide spectrum of areas in which research has been earned out that has been either directly inspired by the work of Sinclair or that addresses issues that were also important to him. The spirit of this volume is to pay tribute to one of the most outstanding linguists of the twentieth century, who (in German) could rightly be called a first-class Querdenker, an independent thinker who, in a rather down-to-earth way, raised questions that were not always in line with the research fashion of the day but instead pointed towards unconventional solutions which in the end seemed perfectly straightforward. Susen Faulhaber Thomas Herbst Peter Uhrig

Preface 1

2

ix

We would like to thank Barbara Gabel-Cunmngham for her invaluable help in preparing the manuscript, Kevin Pike for his linguistic advice and also Christian Hauf for his assistance in preparing the index and proofreading. See Sinclair (2004: 138), The lexical item. In Trust the Text: Language, Corpus and Discourse, John McH. Sinclair with Ronald Carter (eds.). London/New York: Routledge. First published in Edda Weigand, Contrastive Lexical Semantics: Current Issues in Linguistic Theory, Amsterdam/Philadelphia: Benjamins, 1998.

Contents I John McH. Sinclair and his contribution to linguistics Preface Susen Faulhaber, Thomas Herbst and Peter Uhrig A tribute to John McHardy Sinclair (14 June 1933-13 March 2007) Michael Stubbs Corpus, lexis, discourse: a tribute to John Sinclair Strg Johansson

1

17

II The concept of collocation: theoretical and pedagogical aspects Choosing sandy beaches - collocations, probabemes and the idiom principle Thomas Herbst Sinclair revisited: beyond idiom and open choice DrrkSrepmann Accessing second-order collocation through lexical co-occurrence networks Eugene Mollet, Alison Wray and Tess Frtzpatnck From phraseology to pedagogy: challenges and prospects Sylvrane Granger Chunks and the effective learner - a few remarks concerning foreign language teaching and lexicography Dieter Gotz Exploring the phraseology of ESL and EFL varieties Nadja Nesselhauf

27

59

87

123

147

159

xii

Contents

III Variation and change Wnting the history of spoken standard English in the twentieth century ChnstianMatr Prefabs in spoken English BngtttaMMmann Observations on the phraseology of academic writing: local patterns - local meanings? UteRomer Collocational behaviour of different types of text Peter Uhrig and Katrm Gotz-Votteler

179

197

211

229

IV Computational aspects Corpus linguistics, generative grammar and database semantics Roland Hausser

243

Chunk parsing in corpora Gunther Gorz and Gunter Schellenberger

269

German noun+verb collocations in the sentence context: morphosyntactic properties contributing to idiomaticity UlnchHetd

283

Author index Subject index

313 319

Michael Stubbs

1.

Abstract

This Laudatio was held at the University of Erlangen, Germany, on 25 October 2007, on the occasion of the posthumous award of an honorary doctorate to John McHardy Sinclair. The Laudatio briefly summarizes some facts about his career in Edinburgh and Birmingham, and then discusses the major contributions which he made to three related areas: language in education, discourse analysis and corpus-assisted lexicography. A major theme in his work throughout his whole career is signalled in the title of one of his articles: "the search for units of meaning". In the 1960s, in his early work on corpus analysis, he studied the relation between objectively observable collocations and the psychological sensation of meaning. In the 1970s, in his work on classroom discourse, he studied the prototypical units of teacher-pupil dialogue in school classrooms. And from the 1980s onwards, in his influential work on corpus lexicography, for which he is now best known, he studied the kinds of patterning in long texts which are observable only with computational help. The Laudatio provides brief examples of the kind of innovative findings about long texts which his work has made possible: his development of a "new view of language and the technology associated with it". Finally, it situates his work within a long tradition of British empiricism. 2.

Introduction

John Sinclair is one of the most important figures in modern linguistics. It will take a long time before the implications of his ideas, for both applied linguistics and theoretical linguistics, are fully worked out, because many of his ideas were so original and innovative. But some points are clear. Most of us would be pleased if we could make one recognized contribution to one area of linguistics. Sinclair made substantial contributions to three areas: language in education, discourse analysis, and corpus linguistics and

2

Michael Stubbs

lexicography. In the case of discourse analysis and corpus-assisted lexicography, he created the very areas which he then developed. The three areas are closely related. In the 1960s, in Edinburgh, he began his work on spoken language, based on the belief that spoken English would provide evidence of "the common, frequently occurring patterns of language" (Sinclair et al. [1970] 2004: 19). One central topic in his work on language in education was classroom discourse: authentic audio-recorded spoken language. The work on classroom language emphasized the need for the analysis of long texts, as opposed to the short invented sentences which were in vogue at the time, post-1965. In turn, this led to the construction of large text collections, machine-readable corpora of hundreds of millions of running words, and to studying patterning which is visible only across very large text collections. This allowed the construction of the series of COBUILD dictionaries and grammars for learners of English. The COBUILD dictionaries, which were produced from the late 1980s by using such corpus data, were designed as pedagogic tools for advanced learners of English. There were always close relations between his interest in spoken language, authentic language and language in the classroom, and between his theoretical and applied interests. 3.

Education and career

Sinclair was a Scot, and proud of it. He was born in 1933 in Edinburgh, attended school there, and then studied at the University of Edinburgh, where he obtained a first class degree in English Language and Literature (MA 1955). He was then briefly a research student at the University, before being appointed to a Lectureship in the Department of English Language and General Linguistics, where he worked with Michael Halliday. His work in Edinburgh centred on the computer-assisted analysis of spoken English and on the linguistic styhstics of literary texts. In 1965, at the age of 31, he was appointed to the foundation Chair of Modern English Language at the University of Birmingham, where he then stayed for the whole of his formal university career. His inaugural lecture was entitled "Indescribable English" and argues that "we use language rather than just say things" and that "utterances do things rather than just mean things" (Sinclair 1966, emphasis in original). His work in the 1970s focussed on educational linguistics and discourse analysis, and in the 1980s he took up again the corpus work and developed his enormously influential approach to lexicography.

A Tribute to John McHardy Sinclair

3

He took partial retirement in 1995, then formal retirement in 2000, but remained extremely active in both teaching and research. With his second wife Elena Togmm-Bonelli, he founded the Tuscan Word Centre, and attracted large numbers of practising linguists to teach large numbers of students from around the world.1 Throughout his career he introduced very large numbers of young researchers and PhD students - who were spread across many countries - to work in educational topics, discourse analysis and corpus analysis. He travelled extensively in many countries, but nevertheless spent most of his career based in Birmingham. One reason, he once told me, was that he had built up computing facilities there, and, until the 1990s, it was simply not possible to transfer such work to other geographical locations. He died at his home in Florence in March 2007. 4.

"The search for units of meaning-

It is obviously rather simplistic to pick out just one theme in all his work, but "the search for units of meaning" (Sinclair 1996) might not be far off. This is the title of one his articles from 1996. In his first work in corpus linguistics in the 1960s, he had asked: "(a) How can collocation be objectively described?" and "(b) What is the relationship between the physical evidence of collocation and the psychological sensation of meaning?" (Sinclair, Jones and Daley 2004: 3). In his work on discourse in the classroom, he was looking for characteristic units of teacher-pupil dialogue. In his later corpus-based work he was developing a sophisticated model of extended lexical units: a theory of phraseology. And the basic method was to search for patterning in long authentic texts. Along with this went his impatience with the very small number of short invented, artificial examples on which much linguistics from the 1960s to the 1990s was based. The title of a lecture from 1990, which became the title of his 2004 book, also expresses an essential theme in his work: "Trust the Text" (Sinclair 2004). He argued consistently against the neglect and devaluation of textual study, which affected high theory in both linguistic and literary study from the 1960s onwards (see Hoover 2007).

4

5.

Michael Stubbs

Language in education

One of his major contributions in the 1960s and 1970s was to language in education and educational linguistics. In the 1970s, he was very active in developing teacher-training in the Birmingham area. He regularly made the point that knowledge about language is "sadly watered down and trivialized" in much educational discussion (Sinclair 1971: 15), and he succeeded in making English language a compulsory component of teacher-training in BEd degrees in Colleges of Education in the West Midlands. Also in the early 1970s, along with Jim Wight, he directed a project entitled Concept 7 to 9. This project produced innovative teaching materials. They consisted of a large box full of communicative tasks and games. They were originally designed for children of Afro-Caribbean origin, who spoke a variety of English sometimes a long way from standard British English, but it turned out that they were of much more general value in developing the communicative competence of all children. The tasks focussed on "the aims of communication in real or realistic situations" and "the language needs of urban classrooms" (Sinclair 1973: 5). In the late 1970s, he directed a project which developed ESP materials (English for Specific Purposes) for the University of Malaya. The materials were published as four course books, entitled Skills for Learning (Sinclair 1980). In the early 1990s, he became Chair of the editorial board for the journal Language Awareness, which started in 1992. One of his last projects is PhraseBox. This is a project to develop a corpus linguistics programme for schools, which Sinclair worked on from around 2000. It was commissioned by Scottish CILT (Centre for Information on Language Teaching and Research) and funded by the Scottish Executive, Learning and Teaching Scotland and Canan (the Gaelic College on Skye). The software gives children in Scottish primary schools resources to develop their vocabulary and grammar by providing them with real-time access to a 100-million-word corpus. The project is described in one of Sinclair's more obscure publications, in West Word, a community newspaper for the western highlands in Scotland (Sinclair 2006a). In a word, Sinclair did not just write articles about language, but helped to develop training materials for teachers and classroom materials for students and pupils.

A Tribute to John McHardy Sinclair

6.

5

Discourse analysis

His contribution to discourse analysis continued his early interests in both spoken language and in language in education. The work started formally in 1970 in a funded research project on classroom discourse. The work was published in 1975 in one book written with the project's co-director, Malcolm Coulthard, Towards an Analysts ofDtscourse (Sinclair and Coulthard 1975), and then a second book in 1982 with David Brazil Teacher Talk (Sinclair and Brazil 1982). It was through this work that I got to know John Sinclair personally. I'd been doing my PhD in Edinburgh on classroom discourse, and joined him and his colleague Malcolm Coulthard on a second project on discourse analysis, which studied doctor-patient consultations, telephone conversations, and trade union / management negotiations. My job was to make and analyse audio-recordings in the local car industry. The work from this project was never published as a book, but individual journal articles and book chapters appeared. It is difficult to remember how little was available on discourse analysis at the time of these projects, and therefore how innovative Sinclair's approach was. J. R. Firth had long ago pointed out that "conversation is much more of a roughly prescribed ritual than most people think" (Firth 1935: 66). But in the early 1970s, the published version of John Austin's How to Do Thtngs wtth Words was still quite recent (published only ten years before in 1962). John Searle's Speech Acts had been published only two or three years before (1969). Paul Gnce had given his lectures on Logtc and Conversation some five years before (in 1967), but they were formally published only in 1975 and seem not to have been known to the project. This Oxford (so-called) "ordinary language" work was, however, not based on ordinary language at all, but on invented data: anathema to Sinclair's approach. Michael Halliday's work was known of course, and provided a general functional background, but even Language as Soctal Semtotic was not published until several years after the project (in 1978). Anthropological and sociological work also provided general background: Dell Hymes on the ethnography of speaking (available since the early 1960s), Erving Goffman on "behaviour in public places" (from the 1960s); William Labov on narratives and ntual insults (from the early 1970s). Harvey Sacks' lectures were circulating in mimeo form; I arrived in Birmingham with a small collection, and had heard him lecture in Edinburgh around 1972; but little had been

6

Michael Stubbs

formally published. Otherwise, in the early 1970s, work on classroom discourse (by educationalists such as Douglas Barnes) provided insightful observation, but little systematic linguistic description. There was a general feeling that discourse should somehow be studied, but there were few if any attempts to develop formal models of discourse structure. The two Birmingham projects were the first of their kind, but just a few years later, "discourse analysis" had become a clearly designated area, with its own courses and textbooks. One of the first student introductions was by Malcolm Coulthard (1977). Sinclair was following the principle proposed by J. R. Firth in the 1930s, that conversation is "the key to a better understanding of what language really is and how it works" (Firth 1935: 71), but Sinclair's work on discourse was some ten years ahead of the avalanche of work which it helped to start. The aspect of the Birmingham discourse model which everyone immediately grasped was the stereotypical teacher-pupil exchange. In classic structuralist manner, Sinclair proposed that classroom discourse is hierarchic: a classroom lesson consists of transactions which consist of exchanges which consist of moves which consist of acts. It was probably the prototypical exchange structure which everyone immediately recognized: an IRF sequence of initiation - response - feedback (Sinclair and Coulthard 1975: 64): I R F

Teacher: What is the name we give to those letters? Paul? PupU: Vowels. Teacher: They're vowels, aren't they.

Nowadays, the IRF model is widely taken for granted, though I suspect that many people who use it no longer know where it comes from. In addition, Sinclair never abandoned an interest in literature, and his work on text and discourse analysis always included literary texts. In the 1970s, along with the novelist David Lodge, who was his colleague in the English Department at Birmingham, he developed a course on stylistics. The title of an early article was "The integration of language and literature in the English curriculum" (Sinclair 1971). For the course, they selected extracts of literary texts in which a specific linguistic feature was foregrounded: such as repetition, verbless sentences, complex noun phrases, and the like. From one end, they taught grammar through literature, and from the other end, they showed that grammatical analysis was necessary to literary interpretation.

A Tribute to John McHardy Sinclair

7

The analysis of literary texts was part of Sinclair's demand that linguistics must be able to handle all kinds of authentic texts. He argued further that, if linguists cannot handle the most prestigious texts in the culture, then there is a major gap in linguistic theory. Conversely, of course, the analysis of literary texts must have a systematic basis, and not be the mere swapping of personal opinions. In an analysis of a poem by Robert Graves, he argued that the role of linguistics is to expose "the public meaning" of texts in a language (Sinclair 1968: 216). He similarly argued that "if literary comment is to be more than exclusively personal testimony, it must be interpretable with respect to objective analysis" (Sinclair 1971: 17). In all of this work there is a consistent emphasis on long texts, authentic texts, including literary texts, and on observable textual evidence of meaning. 7.

Corpus linguistics and lexicography

7.1.

The "OSTI" Report

Post-1990, Sinclair was mainly known for his work in corpus linguistics. This work started in Edinburgh, in the 1960s, and was informally published as the "OSTI Report" (UK Government Office for Scientific and Technical Information, Sinclair, Jones and Daley 2004). This is a report on quantitative research on computer-readable corpus data, earned out between 1963 and 1969, but not formally published until 2004. The project was in touch with the work at Brown University: Francis and Kucera's Computational Analysts of Present Day American English, based on their one-million-word corpus of written American English, had appeared in 1967. But again, it is difficult to project oneself back to a period in which there were no PCs, and in which the university mainframe machine could only handle with difficulty Sinclair's corpus of 135,000 running words of spoken language. Yet the report worked out many of the main ideas of modern corpus linguistics in astonishing detail. This work in the 1960s formulated explicitly several principles which are still central in corpus linguistics today. It put forward a statistical theory of collocation in which collocations were interpreted as evidence of meaning. It asked: What kinds of lexical patterning can be found in text? How can collocation be objectively described? What size of span is relevant? How can collocational evidence be used to study meaning? Some central principles which are explicitly formulated include: The unit of lexis is unlikely to be the word in all cases. Units of meaning

8

Michael Stubbs

can be defined via statistically defined units of lexis. Homonyms can be automatically distinguished by their collocations. Collocations differ in different text-types. Many words are frequent because they are used in frequent phrases. One form of a lemma is regularly much more frequent than the others (which throws doubt on the lemma as a linguistic unit). It proposed that there is a relation "between statistically defined units of lexis and postulated units of meaning" (Sinclair, Jones and Daley 2004: 6). As Sinclair puts it in the 2004 preface to the OSTI Report, we have a "very strong hypothesis [that] for every distinct unit of meaning there is a full phrasal expression ... which we call the canonical form". And he formulates one of his main ambitious aims: a list of all the lexical items in the language with their possible variants would be "the ultimate dictionary" (Sinclair, Jones and Daley 2004: xxiv). In a word, the OSTI Report makes substantial progress with a question which had never had a satisfactory answer: How can the units of meaning of a language be objectively and formally identified? It is important to emphasize that this tradition of corpus work was concerned, from the beginning, with a theory of meaning. The work then had to be shelved, because the machines were simply not powerful enough in the 1970s to handle large quantities of data. It was started again in the 1980s as the COBUILD project in corpus-assisted lexicography. 7.2.

The COBUILD project

In the 1980s, Sinclair became the Founding Editor in Chief of the COBUILD series of language reference materials. He built up the Birmingham corpus, which came to be called the Bank of English, and along with a powerful team of colleagues - many of whom have made important contributions to corpus linguistics in their own right - the first COBUILD dictionary was published in 1987: the first dictionary based entirely on corpus data. The team for this and later dictionaries and grammars included Mona Baker, Joanna Channell, Jem Clear, Gwynneth Fox, Gill Francis, Patrick Hanks, Susan Hunston, Ramesh Knshnamurthy, Rosamund Moon, Antoinette Renouf and others. The first dictionary, Collins COBUILD English Language Dictionary (Sinclair 1987b) was followed by a whole series of other dictionaries and grammars, plus associated teaching materials, including Collins COBUILD English Grammar (Sinclair 1990) and Collins COBUILD Grammar Patterns (Francis, Hunston and Manning 1996).

A Tribute to John McHardy Sinclair

9

A recent account of the COBUILD project is provided by Moon (2007), one of the senior lexicographers in COBUILD, who worked with the project from the beginning. She analyses why the "new methodology and approach" of the project had such "a catalytic effect on lexicography" (176). When the project started there simply was no "viable lexicographic theory", (177), whereas lexicography is now part of mainstream linguistics. Yet, it was in some ways just too innovative to be a total commercial success. There was a clash between commercial priorities and academic rigour, and the purist approach to examples turned out to be confusing and not entirely right for learners. Only the most advanced learners and language professionals could handle the authentic examples. By 1995 other major British dictionary publishers (Cambridge University Press, Longman and Oxford University Press) had copied the ideas. This was imitation as the sincerest form of flattery, but they subtly changed the attitude to modifying attested corpus examples and made the dictionaries more user-friendly. It remains true however that it is the COBUILD project which developed the lexicographic theory. Many of the principles of corpus compilation and analysis are set out in Looking Up (Sinclair 1987a). The title is of course a play on words: you look things up in dictionaries, and dictionary making is looking up, that is, improving with new data and methods. Many of Sinclair's main ideas are formulated in what is now a modern classic: Corpus Concordance Collocation (Sinclair 1991). The corpus, the concordance and the collocations are chronologically and logically related. First, you need a corpus: a machine-readable text collection, ideally as large as possible. Second, you need concordance software in order to identify patterns. Third, these patterns involve collocations: the regular co-selection of words and other grammatical features. We have had paper concordances since the Middle Ages. But modern concordance software can search large corpora very fast, re-order the findings, and help to identify variable extended units of meaning. It is difficult to illustrate the power of this idea very briefly, because it depends on the analysis of very large amounts of data. But a simple example is possible. In both the OSTI Report and in a 1999 article Sinclair points out that way is "a very unusual word", and that "the very frequent words need to be ... described in their own terms", since they "play an important role in phraseology" (Sinclair 1999: 157, 159). It doesn't make much sense to ask what the individual word way means, since it all depends on the phraseology: all the way to school, half way through, the other way round, by the way, a possible way of checking,... etc

10

Michael Stubbs

Few of the very common words in the language "have a clear meaning independent of the cotext". Nevertheless, "their frequency makes them dominatealltext"(Sinclairl999:158,163). Here is a fragment of output from some modern concordance software: all the examples of the three words way - long - go co-occurring in a sixmillion-word corpus.2 The concordance lines were generated by software developed by Martin Warren and Chris Greaves (Cheng, Greaves and Warren 2006), in a project that Sinclair was involved in. 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

added that there was still a ges In 1902 there was i H T T a e," he said. There Is sTTII a handicapped. There Is i H T T a on peas. But we s t i l l l ^ i a hem to Church-we sTiTI have a other we still h ^ T e ^ n awful real g o o d ^ l ^ i Gemma's got a s demonstrates that we have a Ice, but I'm afraid we have a dlcatlons are that there Is a f small atomic reactors has a seventy. I mean he's, he's a aell occupation, Israel has a Ichael Calne." But he's got a teen. My turn to draw. A long y the earth's gases will go a nd and motion, fountains go a ntion. Vaccinations also go a and ready to act, would go a s father owned, it might go a these very simple cases go a duction in wastage would go a and guineas. That should go a 11, if voters are loyal, go a s in partnership, we can go a ius scarf, and a boy can go a Conolly said. "We could go a Of course we go back an awful I go to London. And we go any

Figure 1.

long long long long long long long long long long long long long long long long long long long long long long long long long long long long long damn

way to go In overcoming Stalinist structure way to go: A. M. Falrbalrn warned Sir Alfre way to go, however, to reach the 1991 high way to go before the majority of teachers 1 way to g o . ^ i l F w e Imagine our blizzard rag way to go to reach our African Church stand way to go. TRADES UNIONS AND THE EUROPEAN way to go before she gets to eighty You're way to go bifoTi we have true democracy In way to go bifoTi we catch up to the Japanes way to go bifoTi the Algerian problem Is fl way to go bifoTi It becomes a commercial pr way to go.^cT^ould you cut me a slice of t way to go to convince anyone that It Is ser way to go. "David Who? Never heard of him," way to go though. Difficult. Well look wher way toward bolstering or destroying cosmic way to^aTd selling themselves in showrooms way to^aTd eliminating the spread of more v way to^aTd making the 'new world order' mor way to^aTd explaining why she was reluctant way to^aTds explaining puzzling features of way to easing the manpower problems. In gen way to easing the strain on an amateur team way to ensuring the election of one or even w a y . ^ o u do believe that?" The expression 1 way with those things. You got a job yet? W way on this. I didn't know Major Fitzroy wa way don't we? Yes. Yeah. Are you going to t way I please, as long as I don't interfere

Six-miUion-word corpus: all examples of way-long- go

Data of this kind make visible the kinds of patterning which occur across long texts, and provide observable evidence of the meaning of extended lexical units. Several things are visible in the concordance lines. They show that way "appears frequently in fixed sequences" (Sinclair 2004: 110), and that the unit of meaning is rarely an individual word. Rather, words are coselected to form longer non-compositional units. Also, the three words way - long - go tend to occur in still longer sequences, which are not literal, but metaphorical. There are two main units, which have pragmatic meanings: (a) is used in an abstract extended sense to simultaneously encourage hearers about progress in the past and to warn them of efforts still required in the future; (b) is also used in exclusively abstract senses.

A Tribute to John McHardy Sinclair

(a)

11

there BE still a long way to go before...

(b) (modal) go a long way to(wards) VERB-ing ... Now, the word way is just one point of origin for a collocation, and it is shown here with just two collocates. Imagine doing this with say 20,000 different words and all their frequent collocates in a corpus of 500 million words, and you have some small impression of the ambitious range of Sinclair's aim of creating an inventory of the units of meaning in English. In a series of papers from the 1990s onwards (Sinclair 1996, 1998, 2005), he put forward a detailed model of semantic units of a kind which had not previously been described. In these articles, he argued consistently that "the normal earner of meaning is the phrase" (Sinclair 2005), and that the lack of a convincing theory of phraseology is due to two things: the faulty assumption that the word is the primary unit of meaning, and the misleading separation of lexis and grammar. The model is extremely productive, and many further examples have been discovered by other researchers. It's all to do with observable empirical evidence of meaning, and what texts and corpora can tell us about meaning. The overall finding of this work is that the phraseological tendency in language use is much greater than previously suspected (except perhaps by a few scholars such as Dwight Bolinger, Igor Mel'cuk and Andrew Pawley), and its extent can be quantified. 8.

Publications

Sinclair's work was for a long time not as well known as it deserved to be. This was partly his own fault. He often published in obscure places, not always as obscure as community newspapers from the Scottish Highlands, but nevertheless frequently in little known journals and book collections, and it was only post-1990 or so that he began to collect his work into books with leading publishers (Oxford University Press, Routledge, Benjamins). He once told me that he had never published an article in a mainstream refereed journal. I questioned this and cited some counter-examples, which he argued were not genuine counter-examples, since he had not submitted the articles: they had been commissioned. He was always very sceptical of journals and their refereeing and gate-keeping processes, which he thought were driven by fashion rather than by standards of empirical research. He was also particularly proud of the fact that, when he was appointed to his chair in Birmingham, he had no PhD and no formal publications. His

12

Michael Stubbs

first publication was in 1965, the year when he took up his chair: it was an article on stylistics entitled "When is a poem like a sunset?", which was published in a literary journal (Sinclair 1965). It is a short experimental study of the oral poetic tradition which he earned out with students. He got them to read and memorize a ballad ("La Belle Dame Sans Merci" by Keats) and then studied what changes they introduced into their versions when they tried to remember the poem some time later. His last book Lmear Unit Grammar, co-authored with Anna Mauranen, is typical Sinclair (Sinclair and Mauranen 2006). It is based on one of his most fundamental principles: if a grammar cannot handle authentic raw texts of any type whatsoever, then it is of limited value. The book points out that traditional grammars work only on input sentences which have been very considerably cleaned up (or simply invented). Sinclair and Mauranen demonstrate that analysis of raw textual data is possible. On the one hand, the proposals are so simple as to seem absolutely obvious: once someone else has thought of them. On the other hand, they are so innovative, that it will take some time before they can be properly evaluated. I will not attempt this here, and just note that the book develops the view that significant units of language in use are multiword chunks. But here, the approach is via a detailed discussion of individual text fragments as opposed to repeated patterns across large text collections. Either way, it is a significant break with mainstream linguistic approaches. 9.

In summary

First, Sinclair's work belongs to a long tradition of British empiricism and British and European text and corpus analysis, derived from his own teachers and colleagues (especially J. R. Firth and Michael Halhday), but represented in a broader European tradition (for example by Otto Jespersen) and in a much more restricted American tradition (for example by Charles Fries). This work, based on the careful description of texts, is very different from the largely American tradition of invented introspective data which provided a short interruption to this empirical tradition. As Sinclair pointed out in a characteristically ironic aside, "one does not study all of botany by making artificial flowers" (Sinclair 1991: 6). Second, the description of meaning has always been at the centre of the British Firth-Halliday-Sinclair tradition of linguistics. It is in Sinclair's work that one finds the most sustained attempt to develop an empirical semantics. As he said in a plenary at the AAAL (American Association for

A Tribute to John McHardy Sinclair 13 Applied Linguistics), "corpus research, properly focussed, can sharpen perceptions of meaning" (Sinclair 2006b). Third, he is one of the very few linguists whose work has changed the way we perceive language. In the words of one of his best known observations: "The language looks rather different when you look at a lot of it at once" (Sinclair 1991: 100). Fourth, Sinclair is one of the very few linguists who have made substantial discoveries. As Wilson (1998: 61) has argued: "The true and final test of a scientific career is how well the following declarative sentence can be completed: He (or she) dtscovered that ..." Sinclair's work is full of new findings about English, things that people had previously simply not noticed, despite thousands of years of textual study. But then they are only observable with the help of the computer techniques which he helped to invent, and which the rest of us can now use to make further discoveries. These include both individual phraseological units, but also methods of analysis - how to extract patterns from raw data - and principles: in particular the extent of phraseology in language use. Sinclair's vision of linguistics was always long-term: "a new view of language and the technology associated with it" (Sinclair 1991: 1). He developed some of his main ideas in the 1960s, and then waited till the technology - and everyone else's ideas - had caught up with him. As he remarked with some satisfaction: Thirty years ago [in the 1960s] when this research started it was considered impossible to process texts of several million words in length. Twenty years ago [in the 1970s] it was considered marginally possible but lunatic. Ten years ago [in the 1980s] it was considered quite possible but still lunatic. Today [in the 1990s] it is very popular. (Sinclair 1991: 1) John Sinclair's work has shown how to use empirical evidence to tackle the deepest question in the philosophy of language: the nature of units of meaning. Like many other people, I owe a very large part of my own academic development to John Sinclair's friendship and inspiring ideas. I knew him for over thirty years: from 1973 when he appointed me to my first academic job (on the second project in discourse analysis) at the University of Birmingham. In October 2007 in Erlangen, he was due to receive his honorary doctorate personally, and then take part in a round table discussion, where he would have responded, courteously but firmly, to our papers; and shown us when we had strayed from his own rigorous standards of empirical research. I was so much looking forward to seeing him again in Erlangen, and

14

Michael Stubbs

to continuing unfinished discussions with him. I will miss him greatly, as will friends and colleagues in many places in the world. But I am very grateful that I had the chance to know him.

Notes 1 Some biographical details are from the English Department website at Birmingham University and from obituaries in The Guardian (3 May 2007), The Scotsman (10 May 2007) and Functions of Language 14 (2) (2007). Special issues of two journals are devoted to papers on Sinclair's work: International Journal of Corpus Linguistics 12 (2) (2007) and International Journal of Lexicography 21 (3) (2008). I am grateful to Susan Hunston and Michaela Mahlberg for comments on a previous version of this paper. 2 The corpus consisted of Brown, LOB, Frown and FLOB plus BNC-baby: five million words of written data and one million words of spoken data.

References Cheng, Winnie, Chris Greaves and Martin Warren 2006 From n-gram to skip-gram to congram. International Journal of Corpus Linguistics 11(4): 411-433. Coulthard,R. Malcolm. 1977 An Introduction to Discourse Analysis. London: Longman. Firth, John Rupert 1935 The technique of semantics. Transactions of the Philological Society 34(1): 36-72. Francis, Gill, Susan Hunston and Elizabeth Manning 1996 Collins COBUILD Grammar Patterns. 2 Vols. London: HarperCollins. Hoover, David 2007 The end of the irrelevant text. DHQ: Digital Humanities Quarterly 1 (2). http://www.digitalhumanities.Org/dhq/vol/001/2/index.html, accessed 3 Nov 2007. Kucera, Henry and W.Nelson Francis 1967 Computational Analysis of Present Day American English. Providence: Brown University Press. Moon, Rosamund 2007 Sinclair, lexicography and the COBUILD project. International Journal of Corpus Linguistics 12(2): 159-181.

A Tribute to John McHardy Sinclair

15

Smclmr, John McH. 1965 When is a poem like a sunset? A Review of English Literature 6 (2): 76-91. Sinclair, John McH. 1966 Indescribable English. Inaugural lecture, University of Birmingham. Abstract in Sinclair and Coulthard 1975: 151. Sinclair, John McH. 1968 A technique of stylistic description. Language and Style 1: 215-242. Sinclair, John McH. 1971 The integration of language and literature in the English curriculum. Educational Review 23 (3). Page references to reprint in Literary Text and Language Study, Ronald Carter and Deirdre Burton (eds.). London: Arnold, 1982. Sinclair, John McH. 1973 English for effect. Commonwealth Education Liaison Newsletter 3 (11): 5-7. Sinclair, John McH. (ed.) 1980 Skills for Learning. Nelson: University of Malaya Press. Sinclair, John McH. (ed.) 1987a Looking Up. London: Collins. Sinclair, John McH. (ed.) 1987b Collins COBUILD English Language Dictionary. London: HarperCollins. Sinclair, John McH. (ed.) 1990 Collins COBUILD English Grammar. London: HarperCollins. Sinclair, John McH. 1991 Corpus Concordance Collocation. Oxford: Oxford University Press. Sinclair, John McH. 1996 The search for units of meaning. Textus 9 (1): 75-106. Sinclair, John McH. 1998 The lexical item. In Contrastive Lexical Semantics, Edda Weigand (ed.), 1-24. Amsterdam: Benjamins. Sinclair, John McH. 1999 A way with common words. In Out of Corpora, Hilde Hasselgard and Signe Oksefjell (eds.), 157-179. Amsterdam: Rodopi. Sinclair, John McH. 2004 Trust the Text. London: Routledge. Sinclair, John McH. 2005 The phrase, the whole phrase and nothing but the phrase: Plenary. Phraseology 2005, Louvam-la-Neuve, October 2005.

16

Michael Stubbs

Sinclair, John McH. 2006a A language landscape. West Word. January 2006. http://road-to-theisles.org.uk/westword/jan2006.html, accessed 3 Nov 2007. Sinclair, John McH. 2006b Small words make big meanings: Plenary. AAAL (American Association for Applied Linguistics). Sinclair, John McH. and David Brazil 1982 Teacher Talk. Oxford: Oxford University Press. Sinclair, John McH. and R. Malcolm Coulthard 1975 Towards an Analysis of Discourse. London: Oxford University Press. Sinclair, John McH. and Anna Mauranen 2006 Linear Unit Grammar. Amsterdam: Benjamins. Sinclair, John McH., Susan Jones and Robert Daley 2004 English Collocation Studies: The OSTI Report. R. Knshnamurthy (ed.). London: Continuum. Original mimeoed report 1970. Wilson, Edward O. 1998 Consilience: The Unity of Knowledge. London: Abacus.

Corpora BNC Baby

BROWN

FROWN LOB

FLOB

The BNC Baby, version 2. 2005. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. URL: http://www.natcorp.ox.ac.uk/. A Standard Corpus of Present-Day Edited American English, for use with Digital Computers (Brown). 1964, 1971, 1979. Compiled by W. N. Francis and H. Kucera. Brown University. Providence, Rhode Island. The Freiburg-Brown Corpus ('Frown') (original version) compiled by Christian Mair, Albert-Ludwigs-Umversitat Freiburg. The LOB Corpus, original version (1970-1978). Compiled by Geoffrey Leech, Lancaster University, Stig Johansson, University of Oslo (project leaders) and Knut Holland, University of Bergen (head of computing). The Freiburg-LOB Corpus ('F-LOB') (original version) compiled by Christian Mair, Albert-Ludwigs-Umversitat Freiburg.

Sttg Johansson*

It is an honour to have been asked to give this speech for John Sinclair, pioneer in corpus linguistics, original thinker and a source of inspiration for countless numbers of language students. The use of corpora, or collections of texts, has a venerable tradition in language studies. Many important works have drawn systematically on evidence from texts. To take just two examples, the great grammar by Otto Jespersen was based on collections of several hundred thousand examples. The famous Oxford EngUsh Dictionary could use several million examples collected from English texts. There is no doubt that the data collections, or rather the intelligent use of evidence from the collections, contributed greatly to the success of these monumental works. But these data collections had the drawback that the examples had been collected in a more or less impressionistic manner, and there is no way of knowing what had been missed. Working in this way, there is a danger that the attention is drawn to oddities and irregularities and that what is most typical is overlooked. Just as important, the examples were taken out of their context. When we talk about corpora these days, we think of collections of running text held in electronic form. Given such computer corpora, we can study language in context, both what is typical and what is idiosyncratic. This is where we have an edge on Jespersen and the original editors of the Oxford EngUsh Dictionary. With the computational analysis tools which are now available we can observe patterns that are beyond the capacity of ordinary human observation. The compilation and use of electronic corpora started about forty-fifty years ago. At that time, corpora were small by today's standards, and they were difficult to compile and use. There were also influential linguists who *

Sadly, Stig Johansson died in April 2010. The editors of this volume would lrke to express their thanks to Professor Hilde Hasselgard for taking care of the final version of his paper.

18

Stig Johansson

rejected corpora, notably Noam Chomsky and his followers. Those who worked with corpora were a small select group. One of them was John Sinclair. In the course of the last few decades there has been an amazing development, made possible by technological advances but also connected with the foresight and ability of linguists like John Sinclair to see the possibilities of using the new tools for the study of language. We now have vast text collections, numbering several hundred million words, and analysis tools that make it possible to use these large data sources. The number of linguists working with computer corpora has grown from a select few to an ever increasing number, so that Jan Svartvik, another corpus pioneer, could say in the 1990s that "corpora are becoming mainstream" (Svartvik 1996). We also have a new term for the study of language on the basis of computer corpora: corpus linguistics. As far as I know, this term was first used in the early 1980s by another pioneer, Jan Aarts from the University of Nijmegen in Holland (see Aarts and Metis 1984). Now it has become a household word. A search on the Internet provides over a million hits. Many people working with corpora probably associate the beginnings of corpus linguistics with Randolph Quirk's Survey of English Usage, a project which started in the late 1950s, but the corpus produced by Quirk and his team was not computerised until much later. What really got the development of computer corpora going was the Brown Corpus, compiled in the early 1960s by W. Nelson Francis and Henry Kucera at Brown University in the United States. The Brown Corpus has been of tremendous importance in setting a pattern for the compilation and use of computer corpora. Not least, it was invaluable that the pioneers gave researchers across the world access to this important data source, which has been used for hundreds of language studies: in lexis, grammar, stylistics, etc. Around this time John Sinclair was engaged in a corpus project in Britain. The reason why this is less known is probably that the corpus was not made publicly available. We can read about the project in a book published a couple of years ago: John M. Sinclair, Susan Jones and Robert Daley, English Collocation Studies: The OSTI Report, edited by Ramesh Knshnamurthy, including a new interview with John M. Sinclair, conducted by Wolfgang Teubert. The book is significant both because it gives access to the OSTI Report, which had been difficult to get hold of, and because of the interview, which gives insight into the development of John Sinclair's thinking.

Corpus, lens, discourse: a tribute to John Sinclair 19 The OSTI Report was, as we can read on the title page, the final report to the Office for Scientific and Technical Information (OSTI) on the Lexical Research Project C/LP/08 for the period January 1967 - September 1969, and it was dated January 1970, but the project had started in 1963. There are two things which I find particularly significant in connection with this project. In the first place, it included the compilation of a corpus of conversation, probably the world's first electronic corpus of spoken language compiled for linguistic studies. The corpus was fairly small, about 135,000 words, but considering the difficulties of recording, transcribing and computerising spoken material, this was quite an achievement. In addition, some other material was used for the project, including the Brown Corpus. The most significant aspect of the project was that the focus of the study was on lexis. We should remember that at this time lexis was disregarded, or at least underestimated, by many - perhaps most linguists, who regarded the lexicon as a marginal part attached to grammar. Schematically, we could represent it in this way: Lexicon Grammar

Perhaps the most enduring contribution of John Sinclair's work is that he has redefined lexis and placed it at the centre of the study of language. This is how he views the relationship between lexis and grammar in a paper published as Sinclair (1999: 8):2

Residual grammar

Lexical items

(no independent semantics)

I will come back later to the notion of lexical item. Let's return to the origin of John Sinclair's thinking on lexis. We find it in the OSTI Report and in a paper with the title "Beginning the study of lexis", published for a collection of papers in memory of his mentor J. R. Firth (Bazell et al. 1966). Firth had stressed the importance of collocations, representing the

20

Stig Johansson

significant co-occurrence of words. But he did not have the means of exploring this beyond typical examples, such as dark mght (Firth 1957: 197). What is done in the OSTI Report is that systematic procedures are devised for defining collocations in the corpus. Here we find notions such as node, collocate and span, which have become familiar later: A node is an item whose total pattern of co-occurrence with other words is under examination; a collocate is any one of the hems which appears with the node within the specified span. (Sinclair, Jones and Daley [1970] 2004: 10) In the interview with Wolfgang Teubert, John Sinclair reports that the optimal span was calculated to be four words before and four words after the node, and he says that, when this was re-calculated some years ago based on a much larger corpus, they came to almost the same result (Sinclair, Jones and Daley 2004: xix). It was a problem that the corpus was rather small for a systematic study of collocations. In the opening paragraph of the paper I just referred to, John Sinclair says: [... ] if one wishes to study the 'formal' aspects of vocabulary organization, all sorts of problems He ahead, problems which are not likely to yield to anything less imposing than a very large computer. (Sinclair 1966: 410) Later in the paper we read that "it is likely that a very large computer will be strained to the utmost to cope with the data" (Sinclair 1966: 428). There was no way of knowing what technological developments lay ahead, and that we would get small computers with an infinitely larger capacity than the large computers at the time this was written. John Sinclair says that he did very little work on corpora in the 1970s (Sinclair, Jones and Daley 2004: xix), frustrated by the labonousness of using the corpus and by the poor analysis programs which were available. But he and his team at Birmingham did ground-breaking work on discourse, leading to an important publication on the English used by teachers and pupils (Sinclair and Coulthard 1975). As I have understood it, what was foremost for John Sinclair was his concern with discourse and with studying discourse on the basis of genuine data. We must "trust the text", as he puts it in the title of a recent book (Sinclair 2004). This applies both to the discourse analysis project and to his corpus work. Around 1980 John Sinclair was ready to return to corpus work. We were fortunate to have him as a guest lecturer at the University of Oslo in February 1980, and a year later he attended a conference in Bergen, the

Corpus, lexis, cUscourse: a tribute to John Sinclcur

21

second conference of ICAME, the International Computer Archive of Modern English, as it was called at the time. A spinoff from the conference was the publication of a little book on Computer Corpora in English Language Research (Johansson 1982). The opening contribution was a visionary paper by John Sinclair called "Reflections on computer corpora in English language research" (Sinclair 1982). In just a few pages he outlines a program for corpus studies: he draws attention to the new possibilities of building large corpora using typesetting tapes and optical scanning; he stresses that we need very large corpora to cope with lexis; and this is, I believe, where he first introduces his idea of monitor corpora, large text collections changing with the development of the language. The 1980s represents the breakthrough of the use of corpora in lexical studies. John Sinclair and his team in Birmingham started the building of a large corpus and initiated the COBUILD project (Sinclair 1987) which led to the first corpus-based dictionary: The Collins COBUILD English Language Dictionary. There were a number of innovative features of this dictionary: it was based on fresh evidence from the corpus; the selection of words was corpus-based, and so was the selection and ordering of senses; there were large numbers of examples drawn from the corpus; a great deal of attention was given to collocations; definitions were written in a new way which simultaneously defined the meaning of words and illustrated their collocational patterns, etc. Later dictionaries have not followed suit in all respects, but it is to the credit of the work of John Sinclair and his team that English dictionaries these days cannot do without corpora. Later John Sinclair developed his ideas in a steady stream of conference papers, articles and books. I cannot comment on all of these, but would like to give a couple of illustrations from his work. The first is from a paper called "The computer, the corpus, and the theory of language" (Sinclair 1999), the source of the diagram shown above (p. 19). Consider the noun brink, if we examine its collocations, we discover a consistent pattern. I have made a collocation study based on the British National Corpus, which contains a hundred million words:

22

Stig Johansson

No.

Word

Total no. in the whole BNC

As collocate

In no. of texts

Mutual information value

1

teetering

68

15

15

8.734477

2

teetered

44

9

9

8.658970

3

porsed

683

10

10

6.022025

4

starvation

466

6

5

5.893509

5

hovering

416

5

5

5.824688

6

extinction

562

5

5

5.523871

7

bankruptcy

999

7

7

5.285090

8

collapse

2568

17

17

5.228266

9

disaster

2837

13

12

4.860382

10 destruction

2360

5

5

4.088956

Here we find some verbs: teetermg, teetered, porsed, havering and some nouns denoting disasters: starvation, extinction, bankruptcy, collapse, disaster, destruction. These are the words which most typically co-occur with brink, identified by a measure of co-occurrence called mutual information. The pattern could have been shown more clearly if I had given the lists for left and right contexts separately, but there should be no need to do this for the present purpose. The results agree very well with the findings presented in John Sinclair's article, though he used a different corpus. He summarises the results in this way (Sinclair 1999: 12): [A]M/IpreptheEofD This is a lexical item. It is used about some actor (A) who is on (I), or is moving towards (M), the edge (E) of something disastrous (D). It has an invariable core, brink, and there are accompanying elements which conform to the formula. By using the item "the speaker or writer is drawing attention to the time and risk factors, and wants to give an urgent warning" (loc. sti.). There is a negative semantic prosody, reflecting the communicative purpose of the item. Let's take a second example, from a book called Reading Concordances (Sinclair 2003: 141-151). This is a bit more complicated. How do we use the sequence true feelings'!

Corpus, lexis, discourse: a tribute to John Sinclcur 23 After examining material from his corpus, John Sinclair arrives at the following analysis: GRAMMAR

SEMANTICS PROSODY

reluctance

PREFERENCE COLLIGATION COLLOCATION

verb

expression

possession

verb

poss.adj.

hide his their your

truefeeUngs

less open about showing their

truefeeUngs

you 11 be inclined to hide your

true feelings

reveal express EXAMPLES

CORE

The communicative purpose is to express reluctance. There is a semantic preference for some expression of reluctance plus a reference to the person involved. Colligation defines the grammatical structure. And the collocations show how the forms may vary. The main claim is that text is made up of lexical items of this kind, where there is not a strict separation between form and meaning. Lexis rs more than it used to be. Grammar is seen as residual, adjusting the text after the selection of the lexical items. The preoccupation with lexis does not mean that John Sinclair has neglected grammar. He has also produced grammar books, most recently Linear Unit Grammar, co-authored with Anna Mauranen, a new approach designed to integrate speech and writing (Sinclair and Mauranen 2006). It is time to sum up. I will start by quoting a passage from the introduction to a festschrift for John Sinclair: The career of John McH. Sinclair, Professor of Modern English Language at the Umvershy of Birmingham, has been characterised by an unending stream of original ideas, sometimes carefully worked out in detail, sometimes casually tossed out in papers in obscure volumes, occasionally developed in large research teams, often passed to a thesis student in the course of

24

Stig Johansson conversation. Indeed it is hard to imagine him writing or saying anything derivative or dull, and, while the reader may on occasion be driven to disagree with him, he or she is never tempted to ignore him. (Hoey 1993: v)

The author of these lines, Michael Hoey, professor of English language at the University of Liverpool, is one of many former students of John Sinclair who now hold important academic posts. Another one is Michael Stubbs, professor of English linguistics at the University of Trier. I whole-heartedly agree with Michael Hoey. In my talk I have not been able to do justice to the whole of John Sinclair's contribution to linguistics. He was a multifaceted man. He was concerned both with linguistic theory and its applications, above all in lexicography. The three words I selected for the title of my talk - corpus, lexis, discourse - are keywords in John Sinclair's work, as I see it. There is a remarkable consistency, all the way back to his paper on "Beginning the study of lexis" (Sinclair 1966). By consistency I do not mean stagnation. What is consistent is his way of thinking - original, always developing, yet never letting go of the thought that the proper concern of linguistics is to study how language is actually used and how it functions in communication - through corpora, to lexis and discourse.

Notes 1

2

This text was prepared for the ceremony in connection with the award of an honorary doctorate to John Sinclair (Erlangen, November 2007). As the audience was expected to be mixed, it includes background information which is well-known among corpus linguists. The text is virtually unchanged as it was prepared early in 2007, before we heard the sad news that John had passed away. The tense forms have not been changed. See also Sinclair (1998).

References Aarts,JanandWillemMeijs 1984 Corpus Linguistics: Recent Developments in the Use of Computer Corpora in English Language Research. Amsterdam: Rodopi. Bazell, Charles Ernest, John Cunnison Catford, Michael Alexander Kirkwood HalMay and Robert H. Robins (eds.) 1966 In Memory of J. R. Firth. London: Longman.

Corpus, lexis, discourse: a tribute to John Sinclair

25

Firth, John Rupert 1957 Papers in Linguistics 1934-1951. London: Oxford University Press. Hoey, Michael (ed.) 1993 Data, Description, Discourse: Papers on the English Language in Honour of John McH. Sinclair on his Sixtieth Birthday. London: HarperCollins. Johansson, Stig (ed.) 1982 Computer Corpora in English Language Research. Bergen: Norwegian Computing Centre for the Humanities. Sinclair, John McH. 1966 Beginning the study of lexis. In /„ Memory of J R. Firth, Charles Ernest Bazell, John Cunmson Catford, Michael Alexander Kirkwood Halliday and Robert H. Robins (eds.), 410-430. London: Longman. Sinclair, John McH. 1982 Reflections on computer corpora in English language research. In Computer Corpora in English Language Research, Stig Johansson (ed.), 1-6. Bergen: Norwegian Computing Centre for the Humanities. Sinclair, John McH. (ed.) 1987 Looking Up: An Account of the COBUILD Project in Lexical Computing London: Collins ELT. Sinclair, John McH. 1998 The lexical item. In Contrastive Lexical Semantics, Edda Weigand (ed.), 1-24. Amsterdam: John Benjamins. Sinclair, John McH. 1999 The computer, the corpus and the theory of language. In Transiti Linguistici e Culturali. Atti del XVIII Congresso Nazionale dell A.I.A. (Genova, 30 Settembre 2 Ottobre 1996), Gabnele Azzaro and Marghenta Ulrych (eds.), 1-15. Trieste: E.U.T. Sinclair, John McH. 2003 Reading Concordances: An Introduction. London: Pearson Education. Sinclair, John McH. 2004 Trust the Text: Language, Corpus and Discourse. Edited with Ronald Carter. London/New York: Routledge. Sinclair, John McH. and R. M. Coulthard 1975 Towards an Analysis of Discourse: The English Used by Teachers and Pupils. London: Oxford University Press. Sinclair, John McH., Susan Jones and Robert Daley 2004 English Collocation Studies: The OSTI Report, Ramesh Knshnamurthy (ed.), including a new interview with John M. Sinclair, conducted by Wolfgang Teubert. London/New York: Continuum. First published in 1970.

26

Stig Johansson

Sinclair, John McH. and Anna Mauranen 2006 Linear Unit Grammar. Amsterdam: John Benjamins. Svartvik,Jan 1996 Corpora are becoming mainstream. In Using Corpora far Language Research, Jenny Thomas and Mick Short (eds.), 3-13. London/New York: Longman.

Corpus BNC

The British National Corpus. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk/.

StssrA'ssi-ssr Thomas Herbst

... a language user has available to him or her a large number of semipreconstructed phrases that constitute single choices (Sinclair 1991: 110) ... patterns of co-selection among words, which are much stronger than any description has yet allowed for, have a direct connection with meaning. (Sinclair 2004: 133) ... a text is a unique deployment of meaningful units, and its particular meaning is not adequately accounted for by any organized concatenation of the fixed meanings of each unit. (Sinclair 2004: 134) 1.

The idiom principle

In a volume compiled in honour of John Sinclair, there is no need to explain or to defend these positions, which, after all, are central to his approach. In this article, I would like to raise a few questions concerning the notions of single choree and meaningful units and relate them to approaches addressing similar issues that have gained increasing popularity recently. Single choice can be interpreted to mean that several words - or, as will be argued below, a word and a particular grammatical construction - are chosen simultaneously by a speaker to express a particular meaning.2 This presumably implies that this meaning (or something resembling this meaning) exists before it is being expressed by the speaker. In the case of the following sentence (taken from a novel by David Lodge) (1) The winter term at Rummidge was often week's duration ... one could argue that the choice of the word duration entails a choice of the preposition o/and the use of the verb be and that this single choice expresses a particular meamng or perhaps a "conceptual umt". The same - or a very similar meaning - could be expressed using the verb last (la) The winter term at Rummidge lasted for ten weeks. Be + of + duration can thus be seen as a "simultaneous choice of ... words", which is a fundamental component of Sinclair's (1991: 110) idiom

28

Thomas Herbst

principle. By formulating the idiom principle, Sinclair has given a theoretical perspective to the insights into the extent of recurrent co-occurrences of words in texts that has been brought to light by large-scale corpus analyses which had become possible through access to large computerized corpora such as the COBUILD corpus.3 This has opened up a new view on the phraseological element in language by making it central to language description and no longer regarding it as peripheral, marginal or somehow special, as was the case in generative grammar and, in a different way, also in traditional phraseology.4 At the same time, the focus has shifted away from true idioms, proverbs etc. to other types of phraseological chunks. The distinction between the open-choice principle and the idiom principle has had considerable impact on recent linguistic research. One of the reasons for this may be that it is the outcome of empirical corpus analysis and thus has been arrived at on the basis of the analysis of language use, which distinguishes it from the inductive approach taken, for example, by Chomsky. A second reason can be found in the fact that the notion of the idiom principle is attractive to foreign language linguistics because it can serve to explain what is wrong with particular instances of learner language or translated text (Hausmann 1984; Granger 1998, this volume; Gilquin 2007; Nesselhauf 2005; Herbst 1996 and 2007). The emphasis put on aspects of language such as collocation or valency phenomena in this area goes hand in hand with a focus on specific lexical properties in terms of the idiom principle. This in turn provides an interesting parallel to some approaches within the framework of cognitive linguistics, especially construction grammar. In a way, one could say that what all these approaches have in common is that they take the phenomenon of what one could call irregularity rather seriously by saying that idiosyncratic properties of lexical items, especially the tendency to occur together with certain other linguistic units, is a central phenomenon of language and not something that could be relegated to the appendix of an otherwise neatly organized grammar or to the periphery of linguistic theory.5 This is accompanied by the recognition of units larger than the individual word. The parallels between different approaches even show in the phrasing: thus Sinclair (1991: 110) speaks of "semipreconstructed phrases" and Hausmann (1984: 398) of "Halbfertigprodukte der Sprache". Nevertheless, it must not be overlooked that the motivation for the interest in such units is slightly different:

Choosing sandy beaches - collocations, probabemes and the idiom principle

29

-

corpus linguistics provides the ideal tool for identifying recurrent combinations of words, in particular the occurrence ofn-tuples; foreign language linguistics looks at chunks from the point of view of production difficulty caused by unpredictability; construction grammar and related approaches are interested in chunks in terms of form-meaning pairings,6 investigate the role chunks play in the process of first language acquisition and look at the way one could imagine chunks to be represented in the mental lexicon. These differences in approach may also result in differences with respect to the concepts developed within the various frameworks. Thus it is worth noting that Franz Josef Hausmann (2004: 320), one of the leading collocation specialists of the foreign language linguistics orientation, recognizes the value of the research on collocation in corpus linguistics but at the same time speaks of "terminological war" and suggests that corpus linguists should find a different term for what they commonly refer to as collocation. Similarly, despite obvious points of contact between at least some conclusions arrived at by Sinclair and some concepts of construction grammar, "the main protagonists appear to see the similarity between the approaches as merely superficial", as Stubbs (2009: 27) puts it. It may thus be worthwhile discussing central units identified in these different frameworks with respect to criteria such as delimitation and size; semantic compositionahty and semantic autonomy of the component parts, and, related to that; predictability for the foreign leaner. 2.

Collocations-compounds-concepts

2.1.

Unpredictability and single choice

The different uses of the term collocation referred to by Hausmann focus on two types of combination - the sandy beaches and the false teeth type (Herbst 1996). Sandy beaches is a typical example of the kind of collocation identified by corpus research because it is significant as a combination on the basis of the frequency with which the two words co-occur in the language, i.e. statistically significant. False teeth, on the other hand, represents the type of collocation that foreign language linguistics has focussed on because the combination of the two items is unpredictable for a foreign

30

Thomas Herbst

learner of the language, i.e. semantically significant. This distinction between a statistically-oriented and a significance-oriented approach to collocation is also made for example by Nadja Nesselhauf (2005: 12), who distinguishes between a frequency-based and a phraseological approach, or Dirk Siepmann (2005: 411), who uses the terms frequency-based and semantically-based.7 Collocations such as false teeth have been called "encoding idioms" (a term used, for example, by Makkai (1972) or by Croft and Cruse (2004: 250)), which refers to the fact that they can be easily interpreted but there is no way of knowing that the established way of expressing this meaning is false teeth and not artificial teeth (as opposed to artificial hip versus ?false hip)- Similarly, foreign learners of English have no way of predicting combinations such as lay the table or strong tea. Hausmann uses semantic considerations to distinguish between the two components of such combinations as the base, which is semantically autonomous ("semantisch autonom", Hausmann 1984: 401), and the collocate, which cannot be defined, learnt or translated without the base (Hausmann 2007: 218).8 This is particularly relevant to foreign language teaching and lexicography - at least in dictionaries which aim to be production dictionaries for foreign learners. The distinction between Basis (base) and Kollokator (collocate) introduced by Hausmann (1984: 401) forms the basis for an adequate lexicographical treatment of such collocations: the foreign learner looking for adjectives to qualify tea or gale will have to find weak and strong under tea and light and strongvnter gale. The unpredictability of such combinations arises from the fact that out of a range of possible collocates for a base, only one or several - but not all - can be regarded as established in language use. Thus, heavy applies to ram, rainfall, storm, gale, smoker or drinking but not to tea, coffee or taste, whereas strong can be used for wind, gale, tea, coffee and taste but not storm, ram, rainfall, smoker or drinking etc. Interestingly, the semantic contribution of these collocates can be characterized in terms of degree of intensity in some way or another.9 BNC

rain

heavy

254

strong severe

rainfall

wind

storm

gale

smoker

drinking

tea

coffee

taste

21

5

5

1

47

43

0

0

0

0

0

223

0

4

0

0

28

19

3

2

0

1

15

10

0

0

0

0

0

Choosing sandy beaches - collocations, probabemes and the idiom principle 31 BNC

ram

light

47

slight

ramfall

mnd

storm

gale

smoker

drmkmg

tea

coffee

taste

1

61

0

0

1

0

2

0

2

0

0

3

0

0

0

0

0

0

2

moderate

0

5

8

0

3

2

11

0

0

1

weak

0

0

0

0

0

0

0

16

3

0

What this table illustrates, however, is the limited combinabihty of certain adjectives (or adjective meanings) with certain nouns, which provides (at least when we consider established usage) an interesting case of restrictions on the operation of the open-choice principle.10 This sort of situation finds a direct parallel in the area of valency, where, while certain generalizations with respect to argument structure constructions are certainly possible (Goldberg 1995, 2006), restrictions on the co-occurrence of particular valency carriers with particular valency patterns will also have to be accounted for (Herbst 2009; Herbst and Uhng 2009; Faulhaber 2011):" valency pattern

consider

judge

call

count

NPV„ t NPNP

+

+

+

+

NPV„ t NPasNP

+

+

NPV„,ofNPasNP

+

regard

think

see

+ +

+ +

etc.

Thus if one takes the definition of manage provided by the Cobmld English Language Dictionary (mi) (2) If you manage to do something, you succeed in doing it. one could argue that the choice of a particular valency earner such as manage or succeed entails a simultaneous choice of a particular valency pattern (and obviously the fact that manage combines with a [to INF]-complement and succeed with an [in V-ing]-complement must be stored in the mind and is unpredictable for the foreign learner of the language). If we consider heavy and strong or the valency patterns listed (and the list of patterns and verbs could be expanded to result in an even more complex picture), then we are confronted with important combinatorial properties. Methodologically, it is not easy to decide whether (or in which cases) these should be described as restrictions or merely as preferences: while the

32

Thomas Herbst

BNC does not show any occurrences of strong + storm/s or strong + ramfall/s (span ±5), for example, it does contain 5 instances of heavy wmd/s (and artificial teeth). Similarly, extremely rare occurrences of a verb in a pattern in which it does not "normally" occur such as (3)

... they regard management a very important ingredient within their strategy

cannot be regarded as a sufficient reason to see this as an established use and to make this pattern part of the valency description of that verb. Nevertheless, acceptability judgments are highly problematic in this area, which is why it may be preferable to speak of established uses, which however means that frequency of occurrence is taken into account. In any case, the observation that the collocational and colhgational properties of words display a high degree of idiosyncracy is highly compatible with usagebased models of cognitive linguistics (e.g. Tomasello 2003; Lieven forthcoming; Behrens 2007; or Bybee 2007). 12 Although cases such as heavy ram or heavy drinking represent an argument in favour of storage, it is not necessarily the case that we have to talk about a single choice. In some cases it is quite plausible to assume, as Hausmann (1985: 119, 2007: 218) does, that the base is chosen first and then a second choice limited by the collocational options of that base takes place. This choice, of course, can be relatively limited as in the case of Hausmann's (1984: 401-402, 2007: 218) examples schutteres Haar and confirmed bachelor/eingefleischter Junggeselle or, on a slightly different line, physical attack, scientific experiment and full enquiry discussed by Sinclair (2004: 21). In other cases, however, accounting for a collocation in terms of a single choice option may be more plausible. This applies to examples such as white wme or red wme, also commented on by Sinclair (2004). These show great similarity to compounds but they allow interpolation (4) white Franconian wine - a n d predicative uses (5) Wmes can be red, white or rose, and still or sparkling - where the natural carbon dioxide from fermentation is trapped in the wme. and also, of course, uses such as (6)

'I always know when I'm in England,' said Morris Zapp, as Philip Swallow went off, 'because when you go to a party, the first thing anyone says to you is, "Red or whiter"

Choosing sandy beaches - collocations, probabemes and the idiom principle

33

Compared with the heavy ram-examples, white wme presents a much stronger case for conceptualization - partly because the meaning associated with the adjective is much more item-specific and more complex in its semantics than mere intensification. To what extent white wme (which, as everybody is aware, is not wme with milk) is a description of wtne or a classification of wme is difficult to decide on linguistic grounds. It thus seems that amongst semantically significant collocations one can distinguish - at least prototypically - between collocations which like white wtne represent a unified concept and thus can be described in terms of a single choice and collocations such as heavy ram or strong wtnd where the relationship of the two components can be seen as that of a base and a modifying collocate. In both cases, however, there is an element of storage involved. 2.2.

Sandy beaches and high tide - concepts?

In this respect, the case of sandy beaches seems to be quite different. One could argue that the mere fact that beaches and sandy co-occur with a loglikelihood value of 3089.83 ({beach/N} ±3) has to do with the fact that beaches are often sandy and therefore tend to be described with the adjective sandy. In this respect, sandy beaches can be compared to the collocates of winds, where the fact that the BNC contains 22 instances of westerly wmd/wmds, 12 of south-westerly wtnd/wtnds and only 2 of southerly wmd/wmds ({wind/N} -1) is a reflection of the facts of the world discussed in the texts of the corpus rather than of the language. Sandy beaches could then be analysed as a free combination - the type that Hausmann (1984: 399-400) refers to as a "Ko-Kreation" - and this could be taken as an argument for not including sandy beaches in a dictionary - at least not as a significant combination, although perhaps, like in Cobuildl and LDOCE5,13 as an example of a rather typical use. On the other hand, in German the meaning of sandy beaches is usually expressed by a compound - Sandstrande. The question to be asked is whether it is realistic to imagine that the same meaning or concept is realized by a free combination in one language and by a compound in another, in other words, whether the meaning of the German lexicahzed compound can be seen as the same as that of the collocation in English.

34

Thomas Herbst

Furthermore, sandy beaches tend to come as a relatively fixed chunk. Of the 258 occurrences14 of sandy and beach ({beach/N} ±3) in the BNC, there is only one predicative use (7) The mam beach is sandy and normally sheltered but an easterly storm will bring up masses of seaweed which occasionally has to be carted away.

and only 7 uses of the kind: (8) They surround a 45-acre lake which is bordered with sandy white beaches, seven swimming pools, children's playground and pool, and lots of shops and restaurants. (9) Our day cruises visit sandy and colourful beaches ... Tins finds an interesting parallel in Mgh tide and low tide. Although these are listed in dictionaries, which points at compound status, they do not carry stress on the first component and like sandy beach they also allow predicative use: (10) Before then it was served with a ford, and a ferry when the tide was high.

One point to be considered in this context is that the tide rs Mgh is not necessarily synonymous with Mgh tide. (11) By the time they entered the harbour it was high tide and the launch, with the Wheelridingher stern, lay almost level with the quay. If you get Mgh tide in the sense of the German Hochwasser at, say, 5 p.m., then the tide can presumably be called high between 3 and 7 or even between 2 and 8. In the same way it seems appropriate to state a difference in meaning between the predicative and the attributive uses of sandy in (12) and (13): (12) The beaches on the North Coast of Cornwall are sandy. (13) There are many sandy beaches on the North Coast of Cornwall. Irrespective of whether in the light of these facts sandy beaches should be analyzed as a collocation or a compound, I would argue that at least from the foreign learner's point of view, the combination has to be accounted for. For the foreign learner, it is by no means obvious why Sandbank, Sandburg, Sandkasten, Sandsturm or Sandmann, Sandpaper and Sandstem could be translated by equivalent English compounds with sand as their first element, but not Sandstrand. From this point of view, the fact that sandy beach is the equivalent of Sandstrand makes it an encoding idiom, or, if you like, a Hausmann-type of collocation in the same way as weak tea

Choosing sandy beaches - collocations, probabemes and the idiom principle

35

or guilty conscience. Even from the LI perspective it seems rather idiosyncratic that there are sandcastles but not sandy castles and that sandy banks exist alongside (in the language, not geographically speaking) sandbanks but with a different meaning. To complicate matters further, there is the question of sand beach. Sand beach is not normally listed in dictionaries (with the exception of the OED where it is given under "general combinations"), three of four native speakers consulted said it did not exist and one pointed out a technical geological sense; the BNC contains 23 instances of sand beach® in comparison with 249forWJWM{beach/N}-l).15 In German, the situation seems to be the opposite - both Sandstrand and sandtge Strande cmbe found: (14) (15)

Sylt verfugt tiber 38,3 km Sandstrand mit mehr als 13.000 Strandkorben16 Wale benutzen em Sonarsystem zur Onentierung und konnen verwirrt werden, wenn sie in die Nahe sandiger Strande kommer,

The DeReKo corpus of the Institut fur deutsche Sprache (W-6ffentlich) yields 5,247 instances of the different forms of Sandstrand (« 2.01 ipm) as opposed to 30 of those of sandtge Strande (« 0.01 ipm), which can be taken as an indication of the fact that Sandstrand is the established way of referring to a sandy beach. The fact that a Google search produced several thousand instances of the latter can perhaps be explained by assuming that many of the texts in which these forms occur have been translated from or modelled on English, and indeed they often seem to refer to non-German beaches. Although this is by far outnumbered by millions of Sandstrdnde (and the corresponding morphological forms) in Google, sandtge Strande seems to be acceptable in German and often as a synonym of Sandstrand something also suggested by the Duden UmversalworterbucVs (2001) definition of Sandstrand as "sandiger Strand". Irrespective of the question of whether there is a difference in meaning between the different uses of Sandstrand and sandtger Strand in German and sand beach and sandy beach or whether such a difference in meaning is always intended by the speakers or perceived in the interpretation of a text, 17 it is probably fair to say that the vast majority of uses of Sandstrand in German corresponds to the vast majority of uses of sandy beach in English. This, however, raises the question of whether the fact that we have a compound in one language and a collocation in the other can or must be taken to mean that we also have different representations at the cognitive level.

36

Thomas Herbst

It would be tempting to associate unified concepts of meaning with individual lexical items - either single words or compounds - and to see collocations such as sandy beaches as composite units, both at the formal and semantic levels. On the one hand, this is underscored by the fact that in the case of sandy and beaches, predicative uses or even superlative forms such as sandiest beaches can be found.18 On the other hand, Schmid (2011: 132) also says: 19 From a cognitive perspective we can say that compounds represent new conceptual forms that are stored as integrated units in the mental lexicon, as opposed to syntactic groups, which are put together during on-going processing of individual concepts in actual language use as the need arises. While theoretically this criterion is without doubt the most important one, in practice it is exceedingly difficult to implement, not least because there are competing linguistic units that also consist of several words and are stored as gestalts, namely fixed expressions and phraseological units. These include not just classical idioms such as to bite the dust ('to die') and to eat humble pie ('to back down/give in'), which are almost certainly stored as units in the mental lexicon, but ^phrasal verbs such as to get up, to walk out and many others. It seems unlikely that in a situation in which speakers want to describe a beach more closely, the process that happens in German when the compound Sandstrand is chosen is much different from that taking place in English when sandy beaches is chosen. In fact, it could be argued that collocations of this type are subject to the same or at least similar mechanisms of entrenchment ascompounds. 20 There are obvious parallels between compounds and collocations with respect to the criterion of unpredictability or idiomaticity: there does not seem to be any great difference between compounds such as lighthouse or Leuchtturm and collocations such as lay the table or set the table in this respect, or, for that matter, if cross-linguistic evidence is legitimate in this kind of argument, between Sandstrand and sandy beach. These examples show that the identification of discrete units of meaning is by no means unproblematic. It is relatively obvious that combinations such as low tide and Niedngwasser are units that express a particular concept, but it could be argued that the meaning of the tide is high also presents a semantic unit or concept in the sense of an identifiable state of affairs or situation. This concept can be expressed in a number of different ways, but one could not argue that these were predictable in any way. Why is the tide Mgh or in but not up? Why is the tide going out but not decreas-

Choosing sandy beaches - collocations, probabemes and the idiom principle

37

mgl It seems that, again, the co-occurrence of particular lexical items creates a particular meaning, and as such they can be considered a single choice. However, the fact that high tide and the tide ts high do not represent identical concepts shows that we have to consider not only the individual words that make up an "extended unit of meaning", to use Sinclair's (2004: 24) term, but that the meaning of the construction must also be taken into account, where one could speculate that in the case of adjective-nouncombinations there is a gradient from compounds to attributive collocations (such as sandy beach, confirmed bachelor, high tide) to predicative uses {the tide ts high) as far as the concreteness and stability of the concepts is concerned. Even if there are no clear-cut criteria for the identification of concepts or semantic units, it has become clear that they do not coincide with the classification of collocations on the basis of the criteria of semantic or statistical significance. While semantically significant collocations such as guilty conscience or white wme can be seen as representing concepts, this is not necessarily the case in the same way with other such collocations like heavy ram or heavy smoker, for instance. Similarly, the fact that a statistically significant collocation such as sandy beach can be seen as representing a concept does not necessarily mean that all frequent word combinations represent concepts in this way. It must be doubted, for instance, whether the fact that the most frequent collocate of the verbs buy and sell in the BNC is house should be taken as evidence for claiming that "buying a house" has concept status for native speakers of English. This means that traditional distinctions such as the ones between different types of collocation or between collocations and compounds are not necessarily particularly helpful when it comes to identifying single choices or semantic units in Sinclair's sense. 2.3.

The scope of unpredictability and the open choice paradox

The case of sandy beaches shows how scope and perspective of linguistic analysis influence its outcome. If one studies combinations of two words, as Hausmann (1984) does, then weak tea and heavy storm will be classified as semantically significant collocations since tea and storm do not combine with semantically similar adjectives such as feeble (tea) or strong (storm). On the other hand, one could argue that the uses of the adjectives sandy and sandtg follow the open-choice principle: sandy occurs with nouns such as

38

Thomas Herbst

beach, heath, soil etc.; sandrg with nouns such as Boden, but also with Schuhe, Strumpfe etc. If one identifies two senses of sandrg in German one meaning 'consisting of sand', one 'being covered by sand', then of course all of the uses are perfectly "regular". Seen in this light, such combinations can be attributed to the principle of open choice, which Sinclair at least in (1991: 109) - believed to be necessary alongside the idiom principle "in order to explain the way in which meaning arises from language text". The open-choice principle can be said to operate whenever there are no restrictions in the combinations of particular lexical items. However, as demonstrated above, the fact that sandy beach is frequently used in English (because sand beach seems to be restricted to technical language) where Sandstrand is used in German means that - at least when we talk about established language use - the choice is not an entirely open one. In any case, open choice must not be seen as being identical with predictability. For example, verbs such as buy, sell, propose or object seem to present good examples of the principle of open choice since there do not seem to be any restrictions concerning the lexical items that can occur as the second valency complement of the verbs. The fact that shares, house and goods are the most frequent noun collocates of buy and of sell in the BNC (span ± 3 f can be seen as a reflection of the facts of the world (or of the world discussed in the texts of the BNC) but one would hardly regard this as a sufficient reason to consider these as phraseological or conceptual units. So there is open choice in the language, but how speakers know that (and when) this is the case, is a slightly different matter. In a way this is a slightly paradoxical situation in that speakers will only produce combinations such as buy shares or buy a book (a) because they have positive evidence that the respective meanings can be expressed in this way and/or (b) because they lack evidence that this is not the case - facts that cognitive grammar would account for in terms of entrenchment or pre-emption.22 This open choice paradox can be taken as evidence for the immense role of storage even in cases where the meaning of an extended unit of meaning can be analysed as being entirely compositional. A further complication about deciding whether a particular combination of words can be attributed to the open-choice or the idiom principle is due to the unavoidable element of circularity caused by the fact that this decision is based on our analysis of the meanings of the component parts of the combination, which in turn however is based on the combinations in which these words occur.

Choosing sandy beaches - collocations, probabemes and the idiom principle

3.

39

Probabemes

If we consider sandy beaches in this light, then it appears as a free combination because there is no element of semantic unpredictability involved. However, if one investigates how a particular concept (or meaning) is expressed in a language, we can observe an element of unpredictability caused by the fact that the concept of sandy beaches is expressed by a combination of two words and not by a one-word compound in English, which provides further evidence for the impact of the idiom principle. It seems worthwhile to pursue this line of investigation further and extend the analysis beyond formally defined types of combination. So far, the study of collocations and other phraseological units has concentrated on analysing certain formally defined types of combination. It is to be expected that the scope of idiomaticity to be found in language will be increased considerably if we take an onomasiological approach and study the ways in which different concepts are realized in different languages. Despite the problems concerned with questions such as synonymy or near-synonymy and the representativeness of corpora, it will be argued here that it may be rewarding to combine an onomasiological analysis of this kind with statistical corpus analysis. For instance, if we take such words as year, half and three-quarters, one would hardly doubt that there are equivalents in German which express the same meanings as the English words namely Jahr, halb and dretvtertel. However, if we combine these meanings, we find that there is a considerable amount of idiomatization involved: in German, there is em halbes Jahr and em dretvtertel Jahr, which could be seen as free combinations (Langenschetdt Colltns Grofiworterbuch Engltsch 2004 gives Dretvterteljahr as a compound, Duden Deutsches Untversalworterbuch 2001 does not), in English, there is only half a year butnot three quarters ofa year. Furthermore, the BNC yields some 4000 instances oEsix months and (or 6 months), but only 46 of half a year, which suggests that the usual way of referring to a period of some 180 days in English is six months rather than half a year* In bilingual dictionaries, this is indicated by the fact that in the entry of Jahr, the unit halbes Jahr is included with an extra translation. It is in this context that the concept of probabeme (Herbst and Klotz 2003) may turn out to be useful. If one combines the insights gained by traditional phraseology and corpus linguistics into the importance of multiword units with an onomasiological approach, then we have to look for all possible formal expressions of a particular meaning in a language irrespec-

40

Thomas Herbst

tive of the fact whether these expressions take the form of one word or several words. The term probabeme can then be used to refer to umts such as six months, i.e. the (most) likely (or established) verbalizations of a particular meaning in a language. If we talk about the idiom principle, we talk about what is established, usual in a speech community, and thus identifying probabemes is part of the description of the idiom principle. This aspect of idiomaticity was highlighted by Pawley and Syder (1983: 196), who point out that utterances of the type (16) I desire you to become married to me (17) Your manying me is desired by me "might not achieve the desired response". More recently, Adele Goldberg (2006: 13) observes that "it's much more idiomatic to say (18)

I like lima beans

than it would be to say (19)

Lima beans please me."

These examples show that if we follow the frame-semantics approach outlined by Fillmore (1977: 16-18),25 which includes such factors as perspectivization, choice of verb and choice of construction, the idea of a single choice may apply to larger units than the ones discussed so far. A further case in point concerning probabemes is presented by the equivalent of combinations such as wrr drer oder die drer, which in English tends to be the three of us (BNC: 137) and the three of them (BNC: 189) rather than we three (BNC: 22) or they three (BNC: 4). The three of us is a very good example of how difficult it is to account for the recurrent chunks in the language. What we have here is a kind of construction which could not really be called item-specific since it can be described in very general terms: the + numeral + of + personal pronoun. Again, bilingual dictionaries seem more explicit than monolingual dictionaries. Langenscherdt's Power Dictionary (1997) and Langenscherdt Collins Grofiworterbuch English (2004) give the three of us under wir, but no equivalent under ihr or die. 4.

Meaning-carrying units

The types of formal realizations to be considered in this context should not be restricted in any way. The point of an onomasiological approach as suggested here is not only to include "extended units of meaning" that consist

Choosing sandy beaches - collocations, probabemes and the idiom principle 41 of several words but not to treat them any differently from single words.26 Units of meaning in this sense include elements that traditionally could be classified as single words {beach), compounds (lighthouse), collocations (sandy beaches, weak tea, set the table) or units such as bear resemblance to or be of duration. Basically, all items that could be regarded as constructions (or at least item-based constructions) in the sense of formmeaning pairings should be included in such an approach.27 There may be relatively little point in attempting a classification of all the types of meaning-carrying units to be identified in the language (or a language) in that little seems to be gained by expanding the lists of phraseological units identified so far (Granger and Paquot 2008: 43-44; or Glaser 1990), especially since many of the units observed do not neatly fall into any category. In the light of the range of phraseological units identified it may rather be necessary to radically rethink commonly established principles and categories in syntactic analysis. Thus Sinclair (1991: 110-111) points out that the "o/in of course is not the preposition of that is found in grammar books" and likewise Fillmore, Kay and O'Connor (1988: 538) ask whether we have "the right to describe the the" in the the Xer the reconstruction "as the definite article".28 Similarly, from a semantic point of view it seems counterintuitive to analyse number and deal as heads of noun phrases in cases such as (21) a number of novels (22) a great deal oftime where one could also argue for an analysis in terms of complex determiners (Herbst and Schuller 2008: 73-74).29 A further case in point is presented by examples such as (23) The possibilities, I suppose, are almost endless where an analysis which takes suppose as the governing verb and the clause as a valency complement of the verb is not entirely convincing (which is why such cases are given a special status in the Valency Dtctionary ofEngItsh, for instance). In fact, there are good arguments for treating / suppose as a phraseological chunk that has the same function as an adverb such as presumably and can occur in the same positions of the sentence as adverbials with the same pragmatic function. It has to be said, however, that the precise nature of the construction is more complex: it is characterized by the quasi-monovalent use of the verb (or a particular class of verbs comprising, for example, suppose, assume, know or explain) under particular

42

Thomas Herbst

contextual or structural conditions, and at the same time one has to say that the subject of the verb need not be a first person pronoun as in (23), although statistically it very often is (24) PMUp Larkin, one has to assume, was joking when he said that sexual intercourse began in 1962. The precise identification and delimitation of such units raises a number of problems, however, both at the levels of form and meaning. In the case of example (1), for instance, one might ask whether the unit to be identified is [of duration] or [be® of duration} or [be® of 'time span' duration], where be® stands for different forms of the verb be and 'time W indicates a slot to be filled by expressions such as ten weeks or short. One might also argue in favour of a compositional account and not see be as part of the unit at all. At the semantic level, one would have to ask to what extent we are justified in treating expressions such as [be® of 'time span' duration] or [last 'time span'] as alternative expressions ofthe"same meaning or not. In his description of the idiom principle, Sinclair (1991: 111-112) himself points at the indeterminate nature of such "phrases" by pointing out their "indeterminate extent", "internal lexical" and "syntactic variation" etc.30 Nevertheless, the question of to what extent formal and semantic criteria coincide in the delimitation of chunks deserves further investigation, in particular with respect to the notions of predictability and storage relevant in foreign language linguistics and cognitive linguistics. 5.

Creativity and open choice

There can be no doubt that in the last thirty or forty years, "the analysis of language has developed out of all recognition", as John Sinclair wrote in 1991, and, indeed, the "availability of data" (Sinclair 1991: 1) has contributed enormously to this breakthrough. On the other hand, the use and the interpretation of the data available is easier in some areas than in others. For instance, we can now find out from corpora such facts as that there is a very similar number of occurrences of the verbs suppose (11,493) and assume (10,956) and that 6 6 % of the uses of suppose (7,606) are first person, but only 7 % (773) of those of assume; the verb agree (span ±1) shows 113 co-occurrences with entirely, 31 ^th fully, 27 with wholeheartedly (or whole-heartedly), 25 with totally, 20 with completely and 9 with wholly;

Choosing sandy beaches - collocations, probabemes and the idiom principle

-

43

and that 88 % of the entirely agra-cases are first person singular, but only 52 % of the fully agree-czses, compared with only 16 % of all uses of agree.* With respect to the relevance of these data, one will probably have to say that some of these findings can be explained in terms of general principles of communication such as the fact that one tends to ask people questions of the type (25) Do you agree? rather than (26) Do you agree entirely? Like the statistically significant co-occurrence of the verb buy with particular nouns such as house or the predominance of westerly winds in the British National Corpus, such corpus data can be seen as a reflection of certain types of human behaviour or facts of the world described in the corpus analysed. Although Sinclair (1991: 110) mentions "the recurrence of similar situations in human affairs" in his discussion of the idiom principle, cooccurrences of this type may be of more relevance with respect to psycholinguistic phenomena concerning the availability of certain prefabricated items to a speaker than to the analysis of a language as such. However, the fact that / suppose is much more common than / assume makes it a probabeme which is relevant to foreign language teaching and foreign language lexicography. This is equally true of the co-occurrence of entirely and agree, where a comparison of the collocations of agree in the BNC with the ICLE learner corpus shows a significant underuse of entirely by learners. Thus, obviously, the insights to be gained from this kind of analysis have to be filtered and evaluated with regard to particular purposes of research. While the occurrence of certain combinations of words or the overall frequency of particular words or combinations of words may be relevant to some research questions, what is needed in foreign language teaching and lexicography is information about the relative frequency of units expressing the same meaning in the sense of the probabeme concept32 Recognizing the idiom principle thus requires a considerable amount of detailed and item-specific description, which is useful and necessary for applied purposes and this needs to be given an appropriate place in linguistic theory. At the same time, it is obvious that when we discuss the idiom principle, we are not concerned with what is possible in a language but with what is usual in a language - with de Saussure's (1916) parole, Cosenu's (1973) Norm or what in British linguistics has been called use. In other

44

Thomas Herbst

words, we are not necessarily concerned with the creativity of speakers, which no doubt exists and equally has to be accounted for in linguistic theory. While it would be futile to discuss to what extent and in which sense Goldberg's (2006: 100) equivalent of colourless green rdeas (27) She sneezed the foam off the cappuccino or (28) He pottered off pigwards quoted from P.G. Wodehouse by Hockett (1958: 308) can be regarded as creative uses of language,33 it is certainly true that established language use provides the background to certain forms of creativity. Thus whereas (27) can be seen as an atypical use of the verb sneeze but one which describes a rather unusual situation for which no more established form of expression would come to mind, (28) is a conscious and intended deviation from more established ways of expressing the same meaning, which can perhaps also be said of Ian McEwan's avoidance of shmgle beach in the formulation (29) ... Chesil Beach with its infinite shmgle. It may be debatable whether one should make a distinction between a principle of open choice and a principle of creativity, as suggested by Siepmann (this volume), but it is certainly true that for such purposes deviation from established use cannot be measured purely in terms of frequency (Stefanowitsch 2005). Thus one would hesitate to argue that using buy in combination with door as in (29) We bought er [pause] a solid door [pause] for the front. is more creative than using it in combination with car (30) 'Why did you buy a foreign car?' he said. despite the fact that car is more than 50 times as frequent as a direct object o f / ^ (span ±3) in the BNC.34 6.

Choices: from meaning to form

As far as the question of single choice is concerned, what I wanted to demonstrate here can be summed up as follows: 1. Some but not all semantical^ significant collocations can be seen in terms of a single choice. Collocations involving intensification such as weak tea can readily be analysed as structured in terms of base and collocate, often with a restricted set of possible collocates, which

Choosing sandy beaches - collocations, probabemes and the idiom principle

45

could be an indication of storage. Other combinations such as white wine or bear resemblance to, ratse objections, have a swtm seem to represent unified semantic concepts and thus single choices in Sinclair's terms. 2. What appears as a free combination in the categories of descriptive linguistics can be analysed as a single choice if it represents a conceptual unit. The decision not to use sand beach may in fact be a decision in favour of sandy beach, which would then be a single choice. Such an analysis seems more convincing for sandy beach than for, say, a combination such as blue bus, but, of course, our arguments for claiming that a particular combination represents a unified concept are based on language use and thus there is a certain danger of circularity, although Elena Togmm-Bonelli (2002) has provided a list of criteria to identify what she calls functionally complete units of meaning - interestingly in the context of translation, which is another area where AWphenomena feature prominently. 3. If we agree that concepts can be expressed by single words or by chunks of words, then these units must be given equal status in a semantic description as possible single choices. This can be supplemented by the factor of frequency as in the probabeme concept encompassing the preferred choice. In any case, recognizing the role of multi-word units in the creation of meaning in language text shows how misguided some of the structuralist work on word fields that only comprise simple lexemes is (or was). The analysis of such items further shows that the same amount of arbitrariness can be observed as with traditional simple lexemes. The identification of the idiom principle and the evidence provided for its essential role in creating language text has thus opened up far-reaching perspectives for further research. From a cognitive point of view it will be important to see what sort of evidence can be found for storage and accessibility of multi-word units and whether differences between different types of multi-word units identified in traditional phraseology and corpus linguistics can be shown to exist in this respect* Furthermore, the role of "extended units of meaning" in texts as demonstrated by Sinclair shows that what is required now is a concentration on the paradigmatic dimension in terms of an identification of the units that carry meaning - ranging from morphemes or words to collocations and multi-word units or item-based constructions - and the meanings or concepts expressed by these units. In fact, the collocation and thesaurus boxes to be found in more recent edi-

46

Thomas Herbst

tions of many modern English learners' dictionaries can be taken to apply that sort of insight to lexicographical practice36 A further consequence is that at least for certain descriptive and theoretical purposes one should be quite radical in overcoming traditional or established types of categories or classification. If, for instance, Granger and Paquot (2008: 43) describe sequences such as depend on and interested in as grammatical collocations but exclude other valency patterns such as avoid -tog-form because they do not consider them "to be part of the phraseological spectrum", this obscures the fact that valency patterns constitute single choices, too. In the spirit of a lexical approach to valency or construction grammar, Sinclair's (1991: 110) creed that "a language user has available to him or her a large number of semi-preconstructed phrases that constitute single choices" can be applied equally to both phenomena of lexical co-occurrence and the co-occurrence of a word (or a group of words) and a particular construction. If we take choice in terms of a choice to express a particular meaning, then language consists of a rather complex system of choices (some of which may even determine the meanings we tend to express in a particular situation of utterance).37 Recognizing that some meanings can be expressed by single words, by words linguists tend to call complex or by combinations that one can refer to as collocations, multi-word units or item-based constructions means understanding the generally idiomatic character of language. Thus drawing a line between phraseology and lexis proper then seems as inadequate as drawing a sharp line between grammar and lexis, of which John Sinclair (2004: 164) wrote: Recent research into the features of language corpora give us reason to believe that the fundamental distinction between grammar, on the one hand, and lexis, on the other hand, is not as fundamental as it is usually held to be and since it is a distinction that is made at the outset of the formal study of language, then it colours and distorts the whole enterprise. If we approach the choices speakers have for expressing particular meanings from an onomasiological perspective, this is equally true of the distinction between lexis and phraseology.

Choosing sandy beaches - collocations, probabemes and the idiom principle

47

Notes 1 2

3 4

5 6 7

8

9

10 11

I would like to thank Susen Faulhaber, Eva Klein, David Heath, Michael Klotz, Kevin Pike and Peter Uhng for their valuable comments. Compare Giles's (2008: 6) definition of a phraseologism as "the cooccurrence of a form or a lemma of a lexical item and one or more additional linguistic elements of various kinds which functions as one semantic unit in a clause or sentence and whose frequency of co-occurrence is larger than expected on the basis of chance". See, for instance, Altenberg (1998) and Johansson and Holland (1989). See also Biber (2009). Cf also Mukherjee (2009: 101-116). For the role of phraseology in linguistic theory see Ones (2008). For a comparison of traditional phraseology and the Smclairean approach see Granger and Paquot (2008: 28-29). For parallels with pattern grammar and construction grammar see Stubbs (2009: 31); compare also Gnes (2008: esp. 12-15). Cf. Croft and Cruse (2004: 225), who point out that "construction grammar grew out of a concern to find a place for idiomatic expressions". For a discussion of definitions of construction see Fischer and Stefanowitsch (2006: 5-7). See also Goldberg (1995: 4). For a detailed discussion of different concepts of collocation cf. Nesselhauf (2005: 11^0) or Handl (2008). For a discussion of the frequency-based, the semantically-based approach and a pragmatic approach to collocation see Siepmann (2005: 411). See also Cowie (1981) or Schmid (2003). Handl (2008: 54) suggests a multi-dimensional classification in terms of a semantic, lexical and statistical dimension. "Der Kollokator ist em Wort, das beim Formuheren in Abhangigkeit von der Basis gewahlt wird und das folglich mcht ohne die Basis defimert, gelernt und iibersetzt werden kann" (Hausmann 2007: 218). This table shows the number of occurrences of the adjectives listed with the corresponding nouns (query adjective + {noun/N}). It must be stressed that the figures given refer to absolute frequencies of occurrence and should in no way be taken as a measure of collocational strength. Grey highlighting means that the respective collocation is listed under the noun in the Oxford Collocations Dictionary (2002). Obviously, not all possible collocates of the nouns have been included. Furthermore, one has to bear in mind that in some cases such as weak tea and light tea - the adjectives refer to different lexical units. Cf. also Herbst (2010: 133-134). For the criterion of "begrenzte Kombmationsfahigkeit" see Hausmann (1984: 396). See also the pattern grammar approach taken by Hunston and Francis (2000). The patterns in the table can be illustrated by sentences such as the following: NP V„tNP NP: / wasn 't really what you 'd call a public school boy... (VDE); NP

48

12 13 14 15 16 17

18

19

20

21

22

Thomas Herbst Vact NP as NP: Many commentators have regarded a stable two-party system as the foundation of the modern British political system (VDE); NP Vact of NP as NP: One always thinks of George Orwell as a great polemicist (VDE). See also Behrens (2009: 390) or Lieven, Behrens, Speares and Tomasello (2003). Cf., however, OALD8 under beach: "a sandy/pebble/shingle beach". Excluded are three cases where sandy and beach co-occur but where sandy is not m a direct relationship to beach. The symbol @ is used to denote all morphological forms. Source: syltinfo.de. (www.syltmfo.de/content/view/286/42/; August 2011). Of course, the question of whether Sandstrand and sandiger Strand are interchangeable or synonymous is notoriously difficult to answer. There may certainly be cases where sandige Strande is chosen consciously and for a special reason, such as example (15) perhaps. On the other hand, a large number of sandige Strande from the Google text search seem to occur in travel itineraries or adverts for hotels or holiday cottages, where no semantic difference to Sandstrande can be discerned (but then, as pointed out above, many of these occurrences may have been influenced by texts written in English). Even if they were not necessarily produced by native speakers of German, pragmatically they would certainly be perceived as having the same meaning as Sandstrande (and not evoke the negative connotations the German adjective sandig seems to have in combinations such as sandige Schuhe, sandiges Haar etc.). "This is one of the sunniest, driest areas of the United Kingdom with some of the sandiest beaches in the land." (www.campsites-uk.co.uk/details.php7id =1741;March2011) This does not mean that such combinations should be classified as compounds. For instance, it can be doubted that they have the same power of hypostatization. For a detailed discussion of "the conceptual effect of one single word" see Schmid (2008: 28). It seems that one can either argue that sandy beaches is stored in the mind as a unit in a similar way as Sandstrand is or that when speakers feel they want to specify or describe a beach more closely, the German lexicon provides a compound whereas in English a compositional process has to take place. For a discussion of factors resulting in entrenchment see Schmid (2008: 19-22). Compare also Langacker (1987: 59). Log-likelihood values for buy + shares (1708.2759), house (1212.157) and goods (990.8394); for sell + shares (1706.8992), house (467.7887) and goods (1933.0683) in the BNC. Compare e.g. Tomasello (2003: 178-131), Goldberg (1995: 122-123, 2006: 94-96) and Stefanowitsch (2008).

Choosing sandy beaches - collocations, probabemes and the idiom principle 23

24 25 26

27 28

29

30

31 32

33 34

35

49

The picture is complicated somewhat by the fact that six months can occur as a premodifier in noun phrases. Excluding uses of the type 4 to 6 months the BNC yields the following figures: six months (3750), 6 months (187), six month (135), 6 month (6), six-month (214), 6-month (14) versus half a year (46), half year (171), half-year (81) and halfyear (1). However, it is worth noting that in the BNC the verbs last and spend do not seem to co-occur with half a year but that there are over 60 co-occurrences of the verb spend with six months and 27 of the verb last (span ±5). Example numbers added by me; running text in original. Cf also Fillmore (1976) and Fillmore and Atkins (1992). This seems very much in line with the following statement by Firth (1968: 18): "Words must not be treated as if they had isolate meaning and occurred and could be used in free distribution". Cf. e.g. Goldberg (2006: 5) or Fillmore, Kay and O'Connor (1988: 534). Compare also the list of complex prepositions given in the Comprehensive Grammar of the English Language (1985: 9.10-11) including items such as ahead of instead of subsequent to, according to or in line with, whose constituents, however, are analyzed in traditional terms such as adverb, preposition etc. On the other hand, within a lexically-oriented valency approach these ofphrases could be seen as optional complements, which would have to be part of the precise description of the corresponding units. Note, however, the considerable amount of variation of idiomatic expressions indicated in the Oxford Dictionary of Current Idiomatic English by Cowie, Mackm and McCaig 1983). For the related problem of defining constructions in construction grammar see Fischer and Stefanowitsch (2006: 4-12). For a discussion of the collocates of agree on the basis of completion tests see Greenbaum (1988: 118) and Herbst (1996). For instance, it could be argued that specialised collocation dictionaries such as the Oxford Collocations Dictionary would be even more useful to learners if they provided some indication of relative frequency in cases where several synonymous collocates are listed. Compare also all the sun long, a grief ago wi farmyards away discussed by Leech (2008: 15-17). The frequencies of door (27,713) and car (33,942) cannot account for this difference. The analysis is based on the Erlangen treebank.mfo project (Uhrig and Proisl 2011). See Underwood, Schmitt and Galpin (2004: 167) for experimental "evidence for the position that formulaic sequences are stored and processed hohstically". Compare also the research carried out by Ellis, Frey and Jalkanen (2009). See also Schmitt, Grandage and Adolphs (2004: 147), who come to

50

36

37

Thomas Herbst the conclusion that "corpus data on its own is a poor indication of whether those clusters are actually stored in the mind". The latest editions of learner's dictionaries such as the Longman Dictionary of Contemporary English (LDOCE5), the Oxford Advanced Learner's Dictionary (OALD8) and the Macmillan English Dictionary for Advanced Learners (MEDAL2) make use of rather sophisticated ways of covering multi-word units such as collocations; cf. Herbst and Mittmann (2008) and Gotz-Votteler and Herbst (2009), which can be seen as a direct reflection of the developments described. Similarly, dictionaries such as the Longman Language Activator (1993), the Oxford Learner's Thesaurus (2008) or the thesaurus boxes of LDOCE5 list both single words as well as word combinations under one lemma. Compare the approach of constructional analysis presented by Stefanowitsch and Ones (2003).

References Altenberg,Bengt 1998 On the phraseology of spoken English: The evidence of recurrent word-combinations. In Phraseology: Theory, Analysis, and Applications, Anthony P. Cowie (ed.), 101-122. Oxford: Clarendon Press. Behrens,Heike 2007 The acquisition of argument structure. In Valency: Theoretical, Descriptive and Cognitive Issues, Thomas Herbst and Katrin GotzVotteler (eds.), 193-214. Berlin/New York: Mouton de Gruyter. Behrens,Heike 2009 Usage-based and emergentist approaches to language acquisition. Linguistics V (2): 383-411. Biber, Douglas 2009 A corpus-driven approach towards formulaic language in English: Extending the construct of lexical bundle. In Anglistentag 2008 Tubingen: Proceedings, Chnstoph Remfandt and Lars Eckstein (eds.), 367-377. Trier: Wissenschaftlicher Verlag Trier. Bybee,Joan 2007 The emergent lexicon. In Frequency of Use and the Organization of Language, Joan Bybee (ed.), 279-293. Oxford: Oxford University Press. Cosenu,Eugemo 1973 ProblemederstrukturellenSemantik.Tubmgm-.mn.

Choosing sandy beaches - collocations, probabemes and the idiom principle

51

Cowie, Anthony Paul 1981 The treatment of collocations and idioms in learners' dictionaries. Applied Linguistics 2: 223-235. Croft, William and D.Alan Cruse 2004 Cognitive Linguistics. Cambridge: Cambridge University Press. Ellis, Nick C, Eric Frey and Isaac Jalkanen 2009 The psycholinguist* reality of collocation and semantic prosody (1): Lexical access. In Exploring the Lexis-Grammar Interface, Ute Romer and Rainer Schulze (eds), 89-114. Amsterdam/Philadelphia: Benjamins. Faulhaber, Susen 2011 Verb Valency Patterns: A Challenge for Semantics-Based Accounts. Berlin/New York: De Gruyter Mouton. Fillmore, Charles 1976 Frame semantics and the nature of language. In Origins and Evolution of Language and Speech: Annals of the New York Academy of Sciences, Stevan R. Hamad, Horst D. Stekhs and Jane Lancaster (eds.), 20-32. New York: The New York Academy of Sciences. Fillmore, Charles 1977 The case for case reopened. In Kasustheorie, Klassifikation, semantische Interpretation, Klaus Heger and Janos S. Petofi (eds.), 3-26. Hamburg: Buske. Fillmore, Charles, and Beryl T. Atkins 1992 Toward a frame-based lexicon: The semantics of RISK and its neighbors. In Frames, Fields, and Contrasts: New Essays in Semantic and Lexical Organization, Adnenne Lehrer and Eva Feder Kittay (eds.), 75-188. Hillsdale/Hove/London: Lawrence Erlbaum Associates. Fillmore, Charles, Paul Kay and Catherine M. O'Connor 1988 Regularity and idiomaticity in grammatical constructions: The case of let alone. Language 64: 501-538. Firth, John Rupert 1968 Linguistic analysis as a study of meaning. In Selected Papers by J. R. Firth 1952-59, Frank R. Palmer (ed.), 12-26. London/Harlow: Longmans. Fischer, Kerstin and Anatol Stefanowitsch 2006 Konstruktionsgrammatik: Em tjberblick. In Konstruktionsgrammatik: Von der Theorie zur Anwendung, Kerstin Fischer and Anatol Stefanowitsch (eds.), 3-17. Tubingen: Stauffenburg.

52

Thomas Herbst

Gilqum,Gaetanelle 2007 To err is not all: What corpus and dictation can reveal about the use of collocations by learners. In Collocation and Creativity, Zeitschrift fur AnglistikundAmerikanistik 55 (3): 273-291. Glaser,Rosemane 1990 Phraseologie der englischen Sprache. Leipzig: Enzyklopadie. Gotz-Votteler, Katrm and Thomas Herbst 2009 Innovation in advanced learner's dictionaries of English. Lexicographica 25: 47-66. Goldberg, AdeleE. 1995 A Construction Grammar Approach to Argument Structure. Chicago/London: Chicago University Press. Goldberg, AdeleE. 2006 Constructions at Work: The Nature of Generalizations in Language. Oxford/New York: Oxford University Press. Granger, Sylviane 1998 Prefabricated patterns in advanced EFL writing: Collocations and formulae. In Phraseology: Theory, Analysis and Applications, Anthony Paul Cowie (ed.), 145-160. Oxford: Oxford University Press. Granger, Sylviane 2011 From phraseology to pedagogy: Challenges and prospects. This volume. Granger, Sylviane and MagahPaquot 2008 Disentangling the phraseological web. In Phraseology: An Interdisciplinary Perspective, Sylviane Granger and Fanny Meumer (eds.), 37-49. Amsterdam/Philadelphia: Benjamins. Greenbaum, Sidney 1988 Good English and the Grammarian. London/New York: Longman. Gnes, Stefan 2008 Phraseology and linguistic theory. In Phraseology: An Interdisciplinary Perspective, Sylviane Granger and Fanny Meumer (eds.), 3-35. Amsterdam/Philadelphia: Benjamins. Handl, Susanne 2008 Essential collocations for learners of English: The role of collocational direction and weight. In Phraseology in Foreign Language Learning and Teaching, Fanny Meumer and Sylviane Granger (eds.), 43-66. Amsterdam/Philadelphia: Benjamins. Hausmann, Franz Josef 1984 Wortschatzlernen ist Kollokationslernen. Praxis des neusprachlichen Unterrichts 31: 395-406.

Choosing sandy beaches - collocations, probabemes and the idiom principle

53

Hausmann, Franz Josef 1985 Kollokationen mi deutschen Worterbuch: Em Beitrag zur Theorie des lexikographischen Beispiels. In Lexikographie und Grammatik, Hennmg Bergenholtz and Joachim Mugdan (eds.), 118-129. Tubingen: Niemeyer. Hausmann, Franz Josef 2004 Was sind eigentlich Kollokationen? In Wortverbindungen mehr oder wenigerfest, Kathrin Steyer (ed.), 309-334. Berlm/New York: Walter de Gruyter. Hausmann, Franz Josef 2007 Die Kollokationen im Rahmen der Phraseologie: Systematische und histonsche Darstellung. Zeitschrift fur Anglistik und Amerikanistik 55 (3): 217-234. Herbst, Thomas 1996 What are collocations: Sandy beaches or false teeth? English Studies 77 (4): 379-393. Herbst, Thomas 2007 Filmsynchromsation als multimediale Translation. In Sprach(en)kontakt -Mehrsprachigkeit- Translation: Innsbrucker Ringvorlesungen zur Translationswissenschaft V 60 Jahre Innsbrucker Institut fur Translationswissenschaft, Lew N. Zybatow (ed.), 93-105. Frankfurt: Lang. Herbst, Thomas 2009 Valency: Item-specificity and idiom principle. In Exploring the Grammar-Lexis Interface, Ute Romer and Rainer Schulze (eds.), 4968. Amsterdam/Philadelphia: John Benjamins. Herbst, Thomas 2010 English Linguistics. Berlin/New York: De Gruyter Mouton. Herbst, Thomas and Michael Klotz 2003 Lexikografie. Paderborn: Schomngh (UTB). Herbst, Thomas and Bngitta Mittmann 2008 Collocation in English dictionaries at the beginning of the twentyfirst century. In Lexicographic** 24: 103-119. Tubingen: Max Niemeyer Verlag. Herbst, Thomas and Susen Schuller [now Faulhaber] 2008 An Introduction to Syntactic Analysis: A Valency Approach. Tubingen: Narr. Herbst, Thomas and Peter Uhng 2009Erlangen Valency Patternbank. Available online at: http://www. patternbank.um-erlangen.de. Hockett, Charles 1958 A Course in Modern Linguistics. New York: Macmillan.

54

Thomas Herbst

Hunston, Susan and Gill Francis 2000 Pattern Grammar: A Corpus-Driven Approach to the Lexical Grammar of English. Amsterdam/Philadelphia: Benjamins. Johansson, Stig and Knut Holland 1989 Frequency Analysis of English Vocabulary and Grammar, Based on the LOB Corpus. Vol. 2: Tag Combinations and Word Combinations. Oxford: Clarendon Press. Langacker, Ronald W. 1987 Foundations of Cognitive Grammar. Volume 1: Theoretical Prerequisites. Stanford, CA: Stanford University Press. Leech, Geoffrey 2008 Language in Literature. Harlow: Pearson Longman. Lieven, Elena forthc. First language learning from a usage-based approach. Patterns and Constructions, Thomas Herbst, Hans-Jorg Schmid and Susen Faulhaber (eds.). Berlin/New York: de Gruyter Mouton. Lieven, Elena, Heike Behrens, Jennifer Speares and Michael Tomasello 2003 Early syntactic creativity: A usage-based approach. Journal of Child Language 30: 333-370. Makkai,Adam 1972 Idiom Structure in English. The Hague/Paris: Mouton. Mukherjee,Joybrato 2009 Anglistische Korpuslinguistik: Fine Einfuhrung. Berlin: Schmidt. Nesselhauf,Nadja 2005 Collocations in a Learner Corpus. Amsterdam: Benjamins. Pawley, Andrew and Frances Hodgetts Syder 1983 Two puzzles for linguistic theory. In Language and Communication, Jack C. Richards and Richard W. Schmidt (eds.), 191-226. London: Longman. Quirk, Randolph, Sidney Greenbaum, Geoffrey Leech and Jan Svartvik 1985 The Comprehensive Grammar of the English Language. London/New York: Longman. [CGEL] Saussure, Ferdinand de 1916 Cours de linguistique generate, Charles Bally and Albert Sechehaye (eds.). Pans/Lausanne: Payot. Schmid, Hans-Jorg 2003 Collocation: Hard to pin down, but bloody useful. Zeitschrift fur AnglistikundAmerikanistik 51 (3): 235-258. Schmid, Hans-Jorg 2008 New words in the mind: Concept-formation and entrenchment of neologisms.^/,"* 126: 1-36.

Choosing sandy beaches - collocations, probabemes and the idiom principle

55

Schmid,Hans-J6rg 2011 English Morphology and Word Formation. Berlin: Schmidt. 2nd revised and translated edition of Englische Morphologie und Wortbildung 2005. Schmitt, Norbert, Sarah Grandage and Svenja Adolphs 2004 Are corpus-derived recurrent clusters psychologically valid? In Formulaic Sequences, Norbert Schmitt (ed.), 127-151. Amsterdam/Philadelphia: Benjamins. Siepmann,Dirk 2005 Collocation, colligation and encoding dictionaries. Part I: Lexicological aspects. InternationalJournal of Lexicography 18: 409-443. Siepmann,Dirk 2011 Sinclair revisited: Beyond idiom and open choice. This volume. Sinclair, John McH. 1991 Corpus, Concordance, Collocation. Oxford: Oxford University Press. Sinclair, John McH. 2004 Trust the Text: Language, Corpus and Discourse. London/New York:Routledge. Stefanowitsch,Anatol 2005 New York, Dayton (Ohio) and the Raw Frequency Fallacy. Corpus Linguistics and Linguistic Theoiy 1 (2): 295-301. Stefanowistch,Anatol 2008 Negative entrenchment: A usage-based approach to negative evidence. Cognitive Linguistics 19 (3): 513-531. Stefanowitsch, Anatol and Stefan Th. Ones 2003 Collostructions: Investigating the interaction between words and constructions. International Journal of Corpus Linguistics 8 (2): 209-243. Stubbs, Michael 2009 Technology and phraseology: With notes on the history of corpus linguistics. In Exploring the Grammar-Lexis Interface, Ute Romer and Rainer Schulze (eds.), 15-31. Amsterdam/Philadelphia: John Benjamins. Togmm-Bonelli, Elena 2002 Functionally complete units of meaning across English and Italian: Towards a corpus-driven approach. In Lexis in Contrast CorpusBased Approaches, Bengt Altenberg and Sylviane Granger (ed.), 7 3 95. Amsterdam/Philadelphia: Benjamins. Tomasello, Michael 2003 Constructing a Language: A Usage-based Theoiy of Language Acquisition. Cambridge, MA/London: Harvard University Press.

56

Thomas Herbst

Uhrig, Peter and Thomas Proisl 2011 The treebank.mfo project. Paper presented at ICAME 32, Oslo, 4 June 2011. Underwood, Geoffrey, Norbert Schmitt and Adam Galpin 2004 They eyes have it. An eye-movement study into the processing of formulaic sequences. In Formulaic Sequences, Norbert Schmitt (ed.), 153-172. Amsterdam/Philadelphia: Benjamins.

Dictionaries A Valency Dictionary of English 2004 by Thomas Herbst, David Heath, Ian Roe and Dieter Gotz. Berlin/New York: Mouton de Gruyter. [VDE] Collins COBUILD English Language Dictionary 1987 edited by John McH. Sinclair. London: Collins. [Cobmldl] DudenDeutsches Universal*•orterbuch 2001 edited by Dudenredaktion (Annette Klosa, Kathrin Kunkel-Razum, Werner Scholze-Stubenrecht and Matthias Wermke), Mannheim: Dudenverlag. 4th edition. Langenscheidt Collins Grofiworterbuch Englisch 2004 edited by Lorna Sinclair Knight and Vincent Docherty, Berlin et al.: Langenscheidt. 5th edition. Langenscheidt's Power Dictionary Englisch-Deutsch Deutsch-Englisch 1997 edited by Vincent J. Docherty, Berlm/Munchen: Langenscheidt. Longman Dictionary of Contemporary English 2009 edited by Michael Mayor. Harlow: Pearson Longman. 5th edition. [LDOCE5] Longman Language Activator 1993 edited by Delia Summers. Harlow: Longman. Macmillan English Dictionary for Advanced Learners 2007 edited by Michael Rundell. Oxford: Macmillan. [MEDAL2] Oxford Advanced Learner's Dictionary of Current English 2010 by A. S. Hornby, edited by Joanna Turnbull. Oxford: Oxford University Press. 8th edition. [OALD 8] Oxford Collocations Dictionary for Students of English 2002 edited by Jonathan Crowther, Sheila Dignen and Diana Lea. Oxford: Oxford University Press. Oxford Dictionary of Current Idiomatic English. Vol 2: Phrase, Clause and Sentence Idioms 1983 edited by Anthony Paul Cowie, Ronald Mackm and Isabel R. McCaig. Oxford: Oxford University Press.

Choosing sandy beaches - collocations, probabemes and the idiom principle

57

Oxford English Dictionary 1989 edited by John Simpson, and E. S. C. Werner. Oxford: Clarendon. 2nd edition. [OED2] Oxford Learner's Thesaurus. A Dictionary of Synonyms 2008 edited by Diana Lea. Oxford: Oxford University Press.

Corpora and further sources used BNC DeReKo

ICLE

NW OCB VDE

British National Corpus Das Deutsche Referenzkorpus DeReKo, http://www.ids-mannheim. de/kl/projekte/korpora/, am Institut fur Deutsche Sprache, Mannheim (using: COSMAS I/II {Corpus Search, Management and Analysis System), http://www.ids-mannheim.de/cosmas2/, © 1991-2010 Institut fur Deutsche Sprache, Mannheim. International Corpus of Learner English, Version 1.1. 2002. Sylviane Granger, Estelle Dagneaux, Fanny Meumer, eds. University cathohque de Louvain: Centre for English Corpus Linguistics. Nice Work. By David Lodge (1989). Harmondsworth: Penguin. First published 1988. On Chesil Beach. By Ian McEwan. London: Cape. A Valency Dictionary of English (see bibliography).

Sinclair revisited: beyond idiom and open choice DirkSiepmann

1.

Introduction

In the present article I have set myself a triple goal. First, I would like to suggest a new take on the principles of idiom and open choice. Second, I wish to highlight the need to complement these principles by what I have chosen to term "the principle of creativity". Third, I shall endeavour to show how these three principles can be applied conjointly to the teaching of translation. 2.

The principles of idiom and open choice

In 1991 the late John Sinclair, who is renowned for his pioneering work in the field of corpus-based lexicography, propounded an elegantly simple theory. In Sinclair's view, the prime determinants of our language behaviour are the principles of idiom and open choice, and the principle of idiom takes precedence over the principle of open choice: "The principle of idiom is that a language learner has available to him or her a large number of semi-preconstructed phrases that constitute single choices, even though they might appear to be analysable into segments" (Sinclair 1991: 110). The principle in question finds its purest expression in what Sinclair (1996) terms the lexical item. Here is a straightforward example: the text the work the book the volume

can be dmded is arranged is organized consists

mto three parts in four sections mto eight thematic sections of ten chapters

Sinclair's lexical item displays marked surface variations which conceal one and the same semantic configuration. Although none of the words in this configuration is obligatory, there is no doubt that all the lexical realizations of the configuration are semantically related.

60

DirkSiepmann

Broadly speaking, we can distinguish four principal levels in this kind of configuration: the lexical level: preferential collocations (book + be organized + sections, etc.) the syntactic level: preferential colligations (subject + verb + preposition + noun phrase, normally with a passive verb) the semantic level: preferential semantic fields (nouns denoting publications or parts of these publications, verbs describing the way in which something is constructed) the pragmatic level: the discursive function, the speaker's attitude (cf. Sinclair 1996, 1998; Stubbs 2002: 108-121) Incidentally, Sinclair and Stubbs were no doubt the first to provide a systematic description of these collocational configurations, but long before the studies of Sinclair and Stubbs appeared in print the linguistic phenomena in question were familiar to translator trainers and specialists in foreign language teaching. Specimens can be found in translation manuals published in the 1970s and 1980s. Here is a representative example from a book on German-English translation (Gallagher 1982: 47): _ fed with empty promises consoled with _ put off with _ A few years before the principle of idiom was enunciated by Sinclair, Hausmann (1984/2007), working independently of his British colleague, set up a typology of word combinations which is illustrated in the following table: co-creation = words with wide combmability which enter into relations with each other in accordance with minimal semantic rules

collocation = words with limited combmability which enter into relations with each other in accordance with refined semantic rules

counter-creation = words with limited combmability which enter into relations with lexical items beyond their normal combinatory profile in accordance with minimal semantic rules

Sinclair revisited: beyond idiom and open choice 61 a red suitcase, a nice house, to wait for the postman

to curb one's anger, a peremptory tone, a cracked wall

the lisping bay, a loose procession of slurring feet, the water chuckled, the body ebbs

Although Hausmann's typology and Sinclair's principles originated in different intellectual contexts,1 the degree of overlap between the conceptual systems evolved by Hausmann and Sinclair is striking. Co-creations and collocations can be explained by the principle of idiom, while countercreations can be accounted for by the principle of open choice. The only difference between the two systems resides in the way Hausmann defines the collocational phenomenon (Hausmann 2007). In our view, Hausmann's definition is unduly restrictive. I have shown elsewhere (Siepmann 2006a, 2006b) that the collocational phenomenon is not limited to binary units, that there is no clear dividing line between collocations on the one hand and colligations and morpheme combinations on the other, and that Hausmann's hypothesis concerning a difference in semiotactic status between the constituents of a collocation, though pedagogically very useful, is seriously flawed. This view is shared by Frath and Gledhill, who use the concepts "denominator" (denomination) and "interpretant" (mterpretant) in order to demonstrate that the phraseological unit is an "essentialist artefact selected arbitrarily by linguistic tradition from a continuum of referential expressions ranging from the lexical unit to the sentence, the paragraph, and even the entire text" (Frath and Gledhill 2005a; my translation).2 Frath and Gledhill (2005b) apply the term "denominator" to all multi-word units which are more or less frozen: A denominator (symbolised here by N) is a word or a string of words winch refers globally to elements of our experience which are lumped into a category by the N. Ns are not usually created by the individual, they are given to us by our community. They are what Merleau-Ponty (1945) called parole institute, i.e. institutionalised language. Whenever we get acquainted with an N, we naturally suppose that it refers to an object (or O), even if we know nothing about it. [my emphasis] Thus, for instance, units such as strong tea, coffee grinder and psychoanalysts all allegedly refer jointly and globally to an object defined more or less arbitrarily by the linguistic community. In our view the term object is inapposite in the above definition; it would be preferable to speak of a concept. Linguistic sequences such as

62

DirkSiepmann

mmdyour own damn business, I can't believe that, history never repeats Uself or I love you Amotion as indices pointing towards concepts or patterns which are familiar to the linguistic community. The link between the linguistic sequence and its semantic extension or "reference" is not, as Frath and Gledhill apparently assume, a direct one. The social "value" of a linguistic sign is constituted not by what it refers to, but rather by the conventional manner in which it is used. In the present instance we should speak of self-reference rather than reference, for we are here concerned with cases where one act of communication refers to another rather than to an object. Using the terminology of Gestalt psychology, we might say that a set phrase is a linguistic "figure" or "fore-ground" which refers to a situational "ground" or "back-ground" (cf.Feilke 1994: 143-151). 3.

Extending the principle of idiom

The validity of the principle of idiom has now been proved beyond dispute. In several of his own publications Sinclair clearly demonstrated that his assumption was correct (cf Sinclair 1991, 1996, 1998); his ideas have apparently been taken up by exponents of construction grammar as well as by scholars subscribing to other schools of thought; and the author of this article has added a small stone to this vast edifice by extending the principle of idiom in two directions. First, I have shown that Sinclair's principle applies not only to isolated words, but also to syntagmas comprising several units. Thus, for instance, the word group with this m mind collocates with syntagmas such as let us turn to and let us consider (for further details, see Siepmann 2005: 100105). Second, I have followed up on the idea that semantic features exert a collocational attraction on each other - an idea which, in my opinion, is implicit in the postulate that there are such things as collocational configurations. Thus, in configurations such as the work is arranged m eight sections or the second volume consists of five long chapters, the nouns which occupy the subject position contain the semantic feature /text/. Semantic features, like living creatures, may be attracted to each other even when they are separated by considerable distances. In extreme cases there may be dozens of words between the semic elements involved. Convenient examples are provided by structures extending over formal sentence or even paragraph boundaries, e.g. certainly [...] but, it often seems that [...]. Not so and you probably think that [...]. Not so. I have termed

Sinclair revisited: beyond idiom and open choice

63

these lexical dependencies long-distance collocations (cf. Siepmann 2005b). In order to take account of such phenomena, I have reformulated Sinclair's idiom principle as follows: "One of the main principles of the organisation of text is that the choice of one word or phraseological unit affects the choice of other words or phraseological units, usually within a maximum span of several paragraphs. Collocation is one of the patterns of mutual choice, and idiom is another" (Siepmann 2005a: 102, based on Sinclair 1991: 173). Pursuing this idea further, we might hazard the hypothesis that certain syntactic phenomena which appear to be free choices are in fact determined by the principle of idiom. A good example is provided by pseudo-cleft sentences. Functional grammarians generally take the line that this type of sentence marks a turning point in an argument. Thus, for instance, a pseudo-cleft sentence is often used to mark a transition to a new topic, introduce a comment or highlight a contrast. The rhematic element in the pseudo-cleft sentence is thrown into sharp relief, and at the same time it is presented as the topic which is to be dealt with in the text segment that follows. Here are a few examples from journalistic texts: 1. Contrast There are important debates to be had in this area. Marc even makes his tiny contribution, saying seriously for a moment that he does not believe in the values that dominate contemporary art, citing his dislike of its "surprise" and "novelty". Yet that contribution stuck out so strangely that / wondered for a second whether it had been slipped in by Christopher Hampton, the translator, to make the original more topical for the British. What is certain is that by importing a reactionary French diatribe against the legacy of modernism, the presenters of Reza's play have muddied the waters of debate, almost certainly preventing a far more significant discussion from taking place. (The Guardian 29.10.1996: 7) So it is clear what the advantage of a merger would be for Warburg. What is less obvious is what Morgan Stanley hopes to gam. (The Economist 10.12.1994) 2. Topic shifting Dr Narin is also able to break patents down by nationality. He discovered, for example, that in 1985 Japanese companies, for the first time, filed more patents in America than American companies did. They have continued to do so every year smce.

64 DirkSiepmann Numbers are Significant, of course, but what really counts is quality: only a few inventions end up as big money-spinners. Dr Narin thinks he can spot these, too. Patent applications often refer to other patents. By looking at the number of times a particular patent is cited in subsequent applications, and comparing this to the average number of citations for a patent in that industry, he gams a measure of its importance to the development of the field. {TheEconomist20.U.\992) 3. Commentary At 9pm, with the lights, telly, dishwasher and washing machine on, Wattson is registering a whopping £3,000. But what most shocks me is the power being drawn from the socket when my appliances are supposedly turned off. {Times Online 15.7.2004) It would be counter-intuitive to postulate a correlation between the use of specific terms and a non-specific linguistic phenomenon like topic shifting. Yet it is precisely this postulate that is borne out by our statistics. If we look at the adjectives and adjectival phrases that crop up repeatedly in pseudo-cleft sentences hinging on the copula to be, we find that some 70% of these sentences contain a limited number of specific lexical items. This can be seen from the following table: Speaker's attitude "what ism question" clarity necessity "what is striking" comparison

what + is at stake/in question/involved is sure/certain/clear/obvious is important/counts is remarkable / exceptional / characteristic / evident / odd (etc.) is less sure 1 more important

It follows from this that even the use of certain syntactic constructions is, to a certain extent, determined by the principle of idiom, even though wnters may have some leeway regarding the meanings that have to be expressed. Does this mean that all linguistic behaviour boils down to the principle of idiom? The answer is "no". Nonetheless, there is no escaping the fact that language users have little room for manoeuvre in situations where they have to arrange morphemes, individual words and syntagmas consisting of several lexical items. In such situations the principle of open choice only comes into play when language users display their incapacity to conform to linguistic norms or happen to be motivated by the desire to flout such

Sinclair revisited: beyond idiom and open choice 65 norms by breaking valency, collocational or semantic moulds.3 The language user's natural bent is to follow collocational norms (or ''denominational" norms, to use the terminology of Frath and Gledhill). This is confirmed by the fact that many translators are inclined to "normalise" texts which run counter to linguistic norms (cf Chevalier and Delport 1995; Kenny 2001; Gallagher 2007: 213-15). When our attention is no longer absorbed by short and relatively rigid word groups we enjoy a certain amount of freedom at the sentence, paragraph and text levels, yet even here we have less room for manoeuvre than is generally assumed (cf. Stein 2001). In our view the principle of open choice normally comes into play whenever we have to link up several valency patterns, collocations or probabemes4 This means that open choices are made primarily at the semantic level and the level of unspoken or deverbalised thoughts (if such entities really exist). Once we have made an open choice we generally find ourselves in the domain of prefabricated language. 4.

The principle of creativity

So far I have endeavoured to circumscribe the range of the principles of idiom and open choice. Closer examination reveals the need to distinguish two types of open choice: those which are completely abnormal in every respect, and those which constitute a more or less deliberate deviation from accepted norms that can be accounted for in terms of a set of semantic relations such as analogical transfer, lexical substitution, metaphor or metonymy. We can bring these phenomena into sharper focus by examining a phrase from a novel by Colin Dexter: [...] a paperback entitled The Blue Ticket, with a provocative picture of an economically clad nymphet on the cover. (Dexter 1991: 201; [emphasis added])5 In the present instance we have to do with a kind of analogical transfer. Under normal circumstances, clad and its more modern synonym dressed both collocate with the adverb scantily. Since economically is a partial synonym of scanty, this is neither a collocation nor an open choice (or, to use Hausmann's terminology, a counter-creation). It follows therefore that the expression economically clad cannot be explained by either of the two

66

DirkSiepmann

principles enunciated by Sinclair. It is, so to speak, a humorous extension of the principle of idiom. In view of the special characteristics of this and other examples which are too numerous to cite (cf for example Partington 1998), I feel it is necessary to postulate a third principle which stands in a complementary relationship to the two principles enunciated by Sinclair. I shall call this the principle of creativity. Taking Hausmann's classification as my starting point, I therefore suggest that co-occurrences should be divided into four groups: co-creations, collocations, analogical creations and countercreations. Co-creations and collocations can be explained by the principle of idiom, analogical creations by the principle of creativity, and countercreations by the principle of open choice. The distinctions I have drawn can be justified on grounds of frequency and distribution (see the table below). principle of idiom co-creation

collocation figure-ground relation

beautifully clad (350 hits in Google Books6) real hate (365 Ms m Google Books)

denominator scantily clad (799 htem Google Books) naked hate (207 htem Google Books)

two-hour drive

scenic drive dizzy whh shock

principle of creativity analogical creation figure with a text-specific ground (or a ground specific to a limited number of texts) interpretant economically clad (24 bits m Google Books) bare hate (5 hits in Google Books)

27-hour meander by sledge his eyes widened with shock

principle of open choice counter-creation groundless figure

interpretant thirstily clad (no Ms m Google Books) loving hate (Shakespeare)

sledge meander arnve whh shock

Such analogical transfers underlie the phenomena which Hoey (2005) describes as "semantic associations". Hoey argues that a word combination such as a two-hour drive is based on an associative pattern of the type

Sinclair revisited: beyond idiom and open choice 67 number-time-journey by vehicle. It is this kind of pattern which generates analogical creations such as a 27-hour meander by sledge. Valency patterns can be extended in the same way. A characteristic example is provided by English verbs and adjectives which are combined with the preposition with and a noun denoting an emotion (e.g. anger or grief). Thus, for instance, we can say she was dizzy with shock or he was shaking with rage. Francis, Hunston and Manning (1998: 336) are correct in claiming that only verbs and adjectives admit this construction, but they restrict their attention to word combinations which are explainable by the idiom principle and ignore analogical creations such as his eyes widened with shock, her eyes sparkled with happiness and out of his mind with

grief One might object that all these word combinations can be explained by the principle of idiom. This is partly true if one takes account of the fact that the idiom principle extends to semantic features (Siepmann 2005)8 and semantic associations (Hoey 2005), but one ought to bear in mind the fact that Sinclair only takes account of the lexical surface. What I would like to stress here is the fact that there are other types of analogical creation which cannot be explained by the principle of idiom. Good examples are provided by metaphors in general and synaesthetic metaphors in particular. It is particularly enlightening to analyse examples from Iris Murdoch's novels and compare Murdoch's sentences with their French translations. Let us begin with a sentence from a novel entitled The Red and the Green. English original The wind blew the light ram against Xn^mAo^ in intermittent sighing g«sfa that were like a softrippleof waves. (Murdoch 1965: 247)

French translation Leventpoussaitlegerementlapluie centre l e s f e n e l r e s / r c r W o ^ commedessoupirsintermittents: on auraitditundouxclapotisdevagues (Murdoch 1967: 233, tr. Anne-Mane Soulac)

By combining the verb sigh with an inanimate noun (gust), Murdoch personifies the wind. But how can we classify a word combination like sighing gusts! It is neither a co-creation ("a regularly formed, normal-sounding combination"9) nor a collocation ("a manifestly current combination"). If we adopt Hausmann's classification, we are therefore forced to conclude

68

DirkSiepmann

that sitting gusts is a counter-creation, i.e. a rare or umque combination which can be explained by the open-choice principle. However, if we compare Hausmann's counter-creative examples 10 with sighing gusts, it immediately becomes apparent that the word-combinations which Hausmann classes as counter-creations are much more daring and much less natural than sighing gusts. Hausmann's expressions are very rare, probably even unique " Not so with sighing gusts Our intuition told us that this word combination and its underlying semantic configuration are not infrequent in literary texts, and this intuitive insight was confirmed by a number of Google searches. In Google Books alone there were 675 hits for sighing winds, 660 for sighing wind and 119 for sighing gusts. We may therefore conclude that the word combination in question is a metaphor that happens to have taken the fancy of certain authors. As a search on Google Books readily shows, it occurs several times, for instance, in the works of Fitzgerald and Murdoch. There is sufficient evidence to prove that the distinction I have just made is directly relevant to practical translation work. Soulac, the French translator of Murdoch's The Red and the Green, has "normalised" the passage quoted above by applying the principle of idiom (Murdoch 1965: 247). Par bourrasquesis a prepositional phrase of medium frequency12 - in other words a collocation - and soupirs intermittent is a co-creation (intermittent + any noun denoting a discontinuous phenomenon). What Soulac failed to notice is that there is a grey zone between the principle of idiom and the principle of open choice - a zone in which the creativity principle holds sway. By applying this principle, she might have succeeded in producing a more satisfactory rendering of Murdoch's poetic prose. After examining a number of well-written literary texts by native speakers of French, I decided that it would be preferable to translate the passage in question as follows: Le vent qui soupiiait en bouffees (or: en bourrasques) mterrmttentes rabattait une pluie fine contre les fenetres: on aurait dit un doux clapotis de vagues. Le vent qui soupirait en bourrasques plaquait le crachm contre les fenetres: on aurait dit un doux clapotis devagues. Le vent qui soupirait en bourrasques plaquait de douces ondees de pluie fine contre les fenetres. In all these sentences the verb soupirer is connected to vent in the same way as sigh is linked to gust in the English sentence.

Sinclair revisited: beyond idiom and open choice

69

It is interesting to compare the extract from The Red and the Green (1965) with a passage from a much later novel entitled The Good Apprentice (1985). EngHsh original | French translation Le vent, qui le fatiguaittant lejour, The wind, which tired him so by day, came ztmght in regular sighing gusts, venaithunitenrafalesdesoupirs obstines,commeunegrmde chose sounding like some great thing deeply respirantprofondementetreguliereand steadily breathmg. (Murdoch 2001:152) ment (Murdoch 1987: 187, tr. Anny Amberni) Here we have exactly the same poetic word combination as in The Red and the Green, but Amberm's translation is quite different from the rendering suggested by Soulac. Instead of applying the creativity principle, Amberni has gone to the opposite extreme by opting for an open choice which savours of affectation (en rafales de soupirs). This calls for a number of comments. Although the pattern en NP de NP is quite common in contemporary French (e.g. en cascades de dtamants), en NP de souptrs is an extremely rare pattern, and en rafales de NP is subject to severe selectional restrictions. In French prose we often come across syntagms where the slot following the preposition en is occupied by a noun denoting a sudden gush of fluid or semi-liquid matter (e.g. en cascades d'eau clatre, en torrents de bone), but such syntagms sound distinctly odd as soon as we replace cascades or torrents by rafales. In meteorological contexts syntagms such as en rafales de 45 nceuds or en rafales de plutes torrentielles sound perfectly normal, but these patterns are only marginally acceptable when nouns like nceuds, plute or grele are replaced by words denoting sounds or emotions (souptrs, rales, hatne, rage). Amberm's phrase is not un-French, but it is a counter-creation and therefore sounds much less natural than Murdoch's phrase. Our third and final example provides even more convincing evidence of the workings of the creativity principle. Here is a sentence which shows how a vivid stylistic effect can be achieved by means of a syntactic transformation: Sarah Harrison, a slimly attractive, brown-eyed brunette in her late twenties [= slim and attractive] (Dexter 2000: 29)

|

70

DirkSiepmann

Dexter's adverb + adjective construction is based on a common or garden co-creation consisting of a pair of coordinated adjectives (slim and attractive). Since the underlying semantic association is not modified in any way when slim and attractive is transformed into slimfy attractive, the latter cannot be classed as a counter-creation. Nor can it be categorized as a collocation, for it is neither normal-sounding nor particularly frequent. We must therefore conclude that it is an analogical creation which can only be explained by the creativity principle. 5.

The implications for translation teaching

It remains to consider the relevance of the aforementioned principles to translation teaching. Translation teachers should explain these principles to their students and encourage them to replicate stylistic effects wherever possible. In many cases the creativity principle can be applied in both the source and the target language. Thus sighmg gusts can be rendered adequately by a persomficatory expression containing the words vent, souprrer and bourrasques (see the aforementioned examples from Murdoch's novels). Stylistic normalization should only be attempted whenever the application of the creativity principle would violate target language norms. Slimfy attractive is a good example of an English stylistic device which cannot be replicated in French. Since the French adjective mince cannot be converted into an adverb (*mincement), we have no choice but to normalize Dexter's expression by rendering it as mince et sedmsante. I believe that the translation of syntagms is amenable to a systematic presentation, but I concede that the systematic treatment of such translation problems may be hindered by obstacles such as polysemy, transpositional irregularities, unpredictable stylistic and textual factors, differences in frequency, and collocational gaps. Let us now consider each of these obstacles in turn. 5.1.

Polysemy

Since collocations are often polysemous, translators have to look carefully at the contexts in which they occur in order to find out exactly what they mean. Thus, for instance, the collocation elever + niveau can be used with reference to racehorses as well as debates and conversations.13 Similarly,

Sinclair revisited: beyond idiom and open choice 71 the EngHsh collocation have an interest (+ in) can mean either to be interested or to have a stake}* Even Sinclair's (1996) prototypical example (the postulated link between the phraseological combination with / to the naked eye and the notions of difficulty and visibility) can pose problems for the translator since the French expression a I 'ceil nu is not always used in the szme^y as with/to the naked eye}5 | with/to the naked eye Sense 1 (semantic equivalence between French and English) prototypical example (cf. Sinclair 1996) Neutral examples

[...] just visible to the naked eye [...] At night at their house they sat on the deck and watched the stars with the naked eye (there was no telescope). Egypt may be the best spot on earth to see the stars with the naked eye

Ul'ceilnu

[...] a peine visible a l'ceilnut...] Jerusalem est uneville d'oul'onpeut encore voiral'oeilnul'epaisse couvertured'etoiles.

[...]•

Counter-examples

The brightest star in it is oVelorum(3.6).Itis easily visible with the naked eye [...] In astronomy, the nakedeye planets are the five planets of our solar system that can be discerned with the naked eye without much difficulty.

Pourlesdissuader,chaquenouvellecoupure porteseptsignesde secunte parfaitement visiblesal'oeilmUiesa la fabrication du papier etal'mipression.

|

72 DirkSiepmann Sense 2 (present in French but not in English) = strikes the eye

[...Jlecontrasteavecles Etats-Unissevoital'ceil nu/Les« accords de Mat lg non»sontdevenusfragnes.Lesfelures sevoiental'ceilnu./ [...Jpresqueal'cennu Ilestvraiquequiconque peutconstateral'ceilnu quelesdegatssont consequents [...](= sans regarderdepres) Sanscetteechappee lusitanienne.onvoit bien,al'ceilnu,que dans6y 0 «ety 0 « e /,ilya «jou»,commedans «joujou» [...].

5.2.

Problems associated with the systematic presentation of transpositions

Translation manuals often contain transpositional rules. Chuquet and Paillard (1987: 18), for instance, claim that an English adjective modified by an adverb ending in -ly often has to be rendered by means of a double transposition (adverb + adjective - noun + adjective). To support their assertion they cite the following example: remarkably white (skin) - (teint) d'une blancheur frappante This kind of transposition is more dependent on collocational and cocreational constraints than appears at first sight. After examining all the contexts in which collocations of the type remarkably white might occur, I have reached the following conclusions: 1. It is not necessary to resort to a transposition in order to express the notion of intensification earned by a degree adverb. Remarkably wMte skm and remarkably whrte teeth can be rendered respectively as une peau tres blanche (or une peau toute blanche) and des dents tres

Sinclair revisited: beyond idiom and open choice 73 blanches (or des dents toutes blanches). This shows that translation work requires a perfect mastery of both the source and the target language - a mastery that can only be attained with the aid of large corpora 16 2. Transposition is impossible in cases where the qualities expressed by the adverb and the adjectives do not add up (cf Ballard 1987: 189; Gallagher 2005: 16): previously white areas - des zones jusqu'alors blanches 3. Other kinds of transposition might be envisaged (e.g. ebloutssant de blancheur). 4. The example cited by Chuquet and Paillard is atypical, for the colour adjective white is generally combined with other adjectives (e.g. pure, dead, bright, brilliant) or with nouns and adjectives designating white substances (e.g. mtlk(y), creamfy), chalkfy)). Moreover, white often occurs in comparative expressions like as white as marble. It is this type of word combination that ought to constitute the starting-point for any systematic contrastive study of the combinatorial properties of white, whiteness, blanc and blancheur. When we embark on this kind of study we soon notice that the French almost invariably use constructions such as d'une blancheur I d'un blanc + ADJECTIVE {absolu(e), fantomattque, latteux, latteuse, etc.) or d'une blancheur de + NOUN / d'un blanc de {crate, ecume, porcelame, etc.). The English, by contrast, use a variety of expressions such as pure white, ghostly white, milky white or as white as foam. 5.3.

Unpredictable stylistic factors

I shall restrict my attention to two typically French phenomena: synonymic variation and subjectivism. The Gallic preference for synonymic variation may be illustrated by means of the English expression golden age and its French equivalents. While an English-speaking author will have no qualms about repeating golden age several times within the same paragraph, a French author will avoid such flat-footed repetition by using a stylistic variant (epoque doree - age d'or). Any translator worth his salt will do the same. French subjectivism manifests itself in a marked tendency to set facts in relation to an active subject, whereas English tends to represent reality as

74 DirkSiepmann clusters of facts which are unrelated to the creatures which observe them. If we compare the collocations of the French noun impression with their English equivalents, we find that avoir I'impression is rarely rendered by its direct equivalent have the impression. One of the reasons for this is that the word combination avoir I'impression frequently occurs in subjectivist constructions. Here is an example from Multiconcord (GroG, MMer and Wolff 1996): Comment se peut-il qu'en l'espace d'une demi-heure, alors qu'on s'est borne a deposer les bagages dans le vestibule, a preparer un peu de cafe, a sortir le pain, le beurre et le miel du refngerateur, on ait une telle impression de chaos? How in the world did it happen that within half an hour - though all they had done was to make some coffee, get out some rye crisp, butter, and honey, and place their few pieces of baggage in the hall - chaos seemed already to have broken loose, [...]. The same kind of interlingual divergence can be observed when we examine the English translation equivalents of French noun phrases where impression is followed by the preposition de and another noun: cette impression de vertige disparut - this giddiness disappeared 5.4.

Unpredictable textual factors

Literal translation is often impossible for textual reasons. A good example is provided by the word combination jours heureux. This is normally rendered directly as happy days (cf Memories of Happy Days, the title of a book by the Franco-American author Julian Green). However, if the preceding context implies a comparison, jours heureux has to be translated as happier daysThis translation shift is due to the fact that the French generally prefer jours heureux to jours plus heureux. This is true even when jours heureux is immediately preceded by a verb implying a transition from unhappiness to happiness. Witness the following quotation from a blog: Cote boulot enfin, Julie a recu son contrat done pas de probleme, et moi je me retrouve des mardi a supmeca pans dans le cadre de mon stage de secours, en attendant desjours heureux. (http://juheetjeremieapans.blogspot.com/2006_04_01_archive.html)

Sinclair revisited: beyond idiom and open choice

75

It is interesting to note that the word combination en attendant des jours heureux is used interchangeably with en attendant des jours meilleurs." Both word combinations are so common that they have virtually attained the status of set phrases.19 5.5.

Differences in frequency

A little consideration shows that equivalence problems may be posed by the fact that a high-frequency word combination in one language may correspond to a word combination with a much lower frequency in another language. While word combinations such as illegal download and telechargement illegal are equally common in English and French, the same cannot be said of ambiance de plomb and leaden atmosphere. The expression ambiance de plomb can often be heard in French radio broadcasts, and over 4,000 occurrences can be found on the Internet, but its direct English equivalent, leaden atmosphere, is comparatively rare. The English prefer expressions such as a brooding atmosphere, an oppressive atmosphere or an atmosphere heavy with tension (and menace). 5.6.

Collocational gaps

This brings us to the subject of collocational gaps, which occur wherever the languages under consideration use different collocates although there is an exact correspondence between the collocational bases. Consider, for instance the French economic term creneau and its English equivalents {gap tn the market and market gap). Since creneau is frequently combined with prometteur but gap tn the market (like market gap) rarely collocates with the adjective promising? English translators would be well advised to render un creneau prometteur by means of a word combination such as a profitable gap tn the market or a potentially profitable gap tn the market. It will be evident from the foregoing that the automatic21 translation of collocations and collocational configurations is fraught with often unpredictable problems. Nonetheless, we can pave the way to a systematic treatment of translation equivalences by studying collocational gaps and frequency differences between various languages. In order to make this perfectly clear, I shall round off my study with a detailed analysis of the combinatorial properties of the French noun impression. The table in the appendix juxtaposes the collocations of the

76 DirkSiepmann French noun impression recorded in the Robert des combinations de mots and the collocations of the corresponding English noun listed in the Oxford Collocations Dictionary for Students ofEnghsh. The words in small capitals are additional collocations which I added after a thorough corpus investigation. A cursory examination of the verb-noun collocations shows that the English collocation record + impression has no simple direct equivalent in French. In order to fill this collocational gap, translators have to resort to a kind of translation shift which Vinay and Darbelnet (1958) termed "modulationshe recorded her impressions (of the city) in her diary - elle a confie ses impressions a son journal / elle a Hvre (raconte) ses impressions dans son journal If we draw up systematic lists of collocational equivalents and collocational gaps we can predict cases were translators have to resort to modulatory shifts, for there is clear evidence of a direct correlation between the meaning^) of words and their ability to combine with other lexical items. The systematic description of words' combinatorial properties reduces the risks involved in a purely intuitive approach to translation teaching. Translation into foreign languages can be dealt with in a more objective manner if translation work proper is preceded by a systematic comparison of articles from collocation dictionaries. This can be effectively demonstrated by examining the adjectives that combine with the noun impression. The Dictiormaire des combmarsons de mots lumps together adjectives like desagreable and navrant on the one hand and defavorable on the other. However, a French-English comparison shows that in the present instance we have to do with two distinct categories: impression defavorable can be classified under the sense we might label "opinion", while word combinations like impression navrante, impression epouvantable and impression horrible belong to the sense we might label "feeling". Since noun-adjective collocations belonging to the second category are specific to French, they cannot be rendered directly into English. The following example from Multiconcord (GroB, MiBler & Wolff 1996) is not above criticism, but it illustrates the kind of modulation technique which has to be used in such cases: les maigrichons me donnent toujours l'impression desagreable de ne pas etre a la hauteur - thin men always make me feel inadequate somehow

Sinclair revisited: beyond idiom and open choice 77 6.

Conclusion

I would like to conclude by recapitulating the main points discussed in this article: the extension of the idiom principle to semantic features and word groups the need to establish a new principle which I have termed the creativity principle the need to operational^ the practice of translation on the basis of objectively verifiable principles, concepts and research findings (a) the principles of idiom, creativity and open choice, (b) the results obtained by a systematic study of the collocational equivalences between source and target languages. In order to teach translation more effectively, we need new translation manuals in which information has been organized in accordance with these principles. The books currently available on the market generally contain a mere hotchpotch of ad-hoc observations which fail to take account of recent advances in lexicography. In order to improve translation books we need to draw a clear distinction between two distinct categories of lexical items: those to which the creativity principle applies, and those which need to be dealt with in accordance with the idiom principle. In order to resolve the difficulties posed by words belonging to the latter category, we require translation-oriented lexicogrammatical reference works (cf Salkoff 1999) containing in-depth analyses of typical transpositional problems (cf. our discussion of the colour adjective white). However, there is no need to treat modulations in minute detail since most of these translation shifts can be predicted with the aid of existing collocation dictionaries.22 These reference works, however, need to be expanded and reorganized. Notes 1

2

Hausmann's typology has its origins in traditional phraseology, which seeks to define and categorize different types of word combinations; Sinclair's principles, by contrast, are firmly rooted in British contextual^, which considers the concept of collocation primarily as a heuristic tool for constructing a new language theory. "C'est un artefact essentiahste, une entite arbitrairement selectionnee par la tradition linguistique dans un continuum depressions referentielles qui vont

78

3 4

5 6 7

8

9 10 11 12 13 14 15

16

17

DirkSiepmann de l'umte lexicale a la phrase, au paragraphe, au texte tout entier. Elle n'a ainsi pas d'existence 'en soi'" (Frath/Gledhill 2005a). Michael Hoey describes these moulds as "primings" (Hoey 2005). Herbst and Klotz (2003: 145-149) use the termprobabeme to denote multiword units which speakers are likely to use to express standard thought configurations. Thus, according to Herbst and Klotz, a native speaker of English might say grind a cigarette end into the ground, while a native speaker of German would probably use the verb austreten to express the same idea. The kinetic verb grind evokes a vivid image of a cigarette butt being crushed beneath a foot, while the less graphic verb austreten gives more weight to the idea of extinction. One might postulate a gradient ranging from valencies to probabemes via collocations (in the traditional sense of the term). This example, like the one that follows, was suggested by John D. Gallagher. The search was carried out on 11 March 2008. His eyes widened with shock and her eyes sparkled with happiness have a distinctly literary flavour, but an expression like out of his mind with grief might occur in everyday conversation. This shows that the creativity principle operates at every level. The hypothesis that semantic features exert a powerful attraction on each other offers a plausible explanation for word combinations like his eyes sparkled with joy / happiness /glee and he was light-headed / almost unconscious with tiredness. Cf.Hausmann (1984) and (2007). Hausmann cites word combinations such as la route se rabougrit and le jour estfissure. When we looked for la route se rabougrit and le jour est fissure we were unable to find any occurrences on the Internet or in our corpus. It should, however, be noted that en bourrasque is more common. We can say le cheval est encore capable d'elever son niveau or ilfaut elever le niveau dudebat. For further examples see Siepmann (2006b). According to Sinclair, the phraseological combination the naked eye consists of a semantic prosody ("difficult"), a semantic preference ("see"), a colligation (preposition) and an invariable core (the collocation "the naked eye"). Our own findings indicate that the semantic prosody postulated by Sinclair is not always present in English (cf. our counter-examples). In our opinion "degree of difficulty" would be a more appropriate expression here. Similar remarks apply to the word combination critically ill, which can be rendered as dans un etat grave or gravement malade (cf. Chuquet and Paillard 1987: 18). A particularly enlightening example can be found in a translation manual by Mary Wood (Wood 1995: 106, 109 [note 24]).

Sinclair revisited: beyond idiom and open choice 18

19 20 21 22

79

Cf. the following example from a French daily: Si tel devait etre le cas, Primakov ne ferait que retrouver, dans l'ordre mterieur, la fonction que lui avait devolue en son temps son veritable patron histonque, Youn Andropov, chef du KGB de 1967 a 1982, sur la scene diplomatico-strategique du monde arabe: organiser, canaliser et moderer, en attendant des jours meilleurs, les bouffees neo-staliniennes en provenance de l'Orient comphque. {Le Monde 5.11.1998:17) It should however be noted that en attendant des jours meilleurs is more common than en attendant des jours heureux. When we searched the Web we found fewer than ten examples in texts written by native speakers of English. We use automatic in the fullest sense of the word. We have demonstrated this by means of a detailed analysis of the collocations of Fr. impression and their English translation equivalents.

References Ballard, Michel 1987 La traduction de I'anglais aufrancais. Paris: Nathan. Chevalier, Jean-Claude and Marie-France Delport 1995 Problemes linguistiques de la traduction: L'horlogerie de Saint Jerome. Paris: L'Harmattan. Chuquet,Helene and Michel Paillard 1987 Approche linguistique des problemes de traduction. AnglaisFrancars. Paris: Ophrys. Crowther, Jonathan, Sheila Dignen and Diana Lea (eds.) 2002 Oxford Collocations Dictionary for Students of English. Oxford: Oxford University Press. Feilke, Helmut 1994 Common sense-Kompetenz: Uberlegungen zu einer Theorie "sympathischen" und "naturlichen" Meinens und Verstehens. Frankfurt a. M.: Suhrkamp. Francis, Gill, Susan Hunston and Elizabeth Manning (eds.) 1998 Collins Cobuild Grammar Patterns. 2: Nouns and Adjectives. London: HarperCollins. Frath, Pierre and Christopher Gledhill 2005a Qu'est-ce qu'une unite phraseologique? In La phraseologie dans tous ses etats: Actes du collogue "Phraseologie 2005" (Louvam, 13-15 Octobre 2005), Catherine Bolly, Jean Rene Klein and Beatrice Lamiroy (eds.). Louvain-la-Neuve: Peeters. [cf. www.res-pernomen.org/respernomen/pubs/lmg/SEM12-Phraseo-louvam.doc].

80

DirkSiepmann

Frath, Rerre and Christopher Gledhill 2005b Free-range clusters or frozen chunks? Reference as a defining criterion for linguistic units. In RANAM (Recherches Anglaises et NordAmericaines) 38 [cf www.res-per-nomen.org/respernomen/pubs/ lmg/SEM09-Chunks-new3.doc]. Gallagher, John D. 1982 German-English Translation: Texts on Politics and Economics. Munich: Oldenbourg. Gallagher, John D. 2005 Stilistik und tibersetzungsdidaktik. In Linguistische und didaktischpsychologische Grundlagen der Translation, Bogdan Kovtyk (ed.), 15-36. Berlin: Logos. Gallagher, John D. 2007 Traduction litteraire et etudes sur corpus. In Les corpus en linguistique et en traductologie, Michel Ballard and Carmen PineiraTresmontant(eds.), 199-230. Arras: Artois Presses University. GroB, Annette, Bettina MiBler and Dieter Wolff 1996 MULTICONCORD: Em Multilmguales Konkordanz-Programm. In Kommunikation und Lernen mit alten und neuen Medien: Beitrage zum Rahmenthema "Schlagwort Kommunikationsgesellschaft" der 26. Jahrestagung der Gesellschaft fur Angew andte Linguistik, Bernd Riischoff and Ulnch Schmitz (eds.), 4 9 ^ 3 . Frankfurt: Peter Lang. Guillemm-Flescher, Jacqueline 2003 Theonser la traduction. Revue francaise de linguistique appliquee VIII (2): 7-18. Hausmann, Franz Josef 2007 Apprendre le vocabulaire, c'est apprendre les collocations. In Franz Josef Hausmann: Collocations, phraseologie, lexicographie: Etudes 1977-2007 et Bibliographie, Elke Haag, (ed.), 49-61. Aachen: Shaker. First published in 1984 as Wortschatzlernen ist Kollokationslernen: Zum Lehren und Lernen franzosischer Wortverbmdungen. Praxis des neusprachlichen Unterrichts 31: 395^06. Herbst, Thomas and Michael Klotz 2003 Lexikografie: Eine Einfuhrung. Paderborn: Schomngh. Hoey, Michael 2005 Lexical Priming: A New Theoiy of Words and Language. London: Routledge. Kenny, Dorothy 2001 Lexis and Creativity in Translation: A Corpus-Based Study. Manchester: St Jerome.

Sinclair revisited: beyond idiom and open choice

81

Le Fur, Dominique (ed.) 2007 Dictionnaire des combinations de mots: Les synonymes en contexte. Pans: Le Robert. Partington, Alan 1998 Patterns and Meanings: Using Corpora for English Language Research and Teaching. Amsterdam/Philadelphia: Benjamins. Salkoff, Morris 1999 A French-English Grammar: A Contrastive Grammar on Translation^ Principles. Amsterdam: Benjamins. Siepmann,Dirk 2002 Ergenschaften und Formen lexrkalrscher Kollokatronen: Wider em zu enges Verstandms. Zeitschrift fur franzosische Sprache und Literatur 3: 240-263. Siepmann,Dirk 2005a Discourse Markers across Languages: A Contrastive Study of Second-Level Discourse Markers in Native and Non-Native Text with Implications for General and Pedagogic Lexicography. Abingdon/New York: Routledge. Siepmann,Dirk 2005b Collocation, colligation and encoding dictionaries. Part I: Lexicological aspects. International Journal of Lexicography 18 (4): 409444. Siepmann,Dirk 2006a Collocation, colligation and encoding dictionaries. Part II: Lexicographical aspects. International Journal of Lexicography 19 (1): 139. Siepmann,Dirk 2006b Collocations et dictionnaires d'apprentissage onomasiologiques: questions aux theonciens et pistes pour l'avenir. Langue francaise 150:99-117. Sinclair, John McH. 1991 Corpus, Concordance, Collocation. Oxford: Blackwell. Sinclair, John McH. 1996 The search for units of meaning. Textus IX: 75-106. Sinclair, John McH. 1998 The lexical item. In Contrastive Lexical Semantics, Edda Weigand (ed.), 1-24. Amsterdam: Benjamins. Stem, Stephan 2001 Formelhafte Texte: Musterhaftigkeit an der Schmttstelle zwischen Phraseologie und Textlmguistik. In Phraseologie und Phraseodidaktik, Marline Lorenz-Bourjot and Hemz-Helmut Liiger (eds.), 21-40. Vienne:Praesens.

82

DirkSiepmann

Stubbs, Michael 2002 Words and Phrases: Corpus Studies of Lexical Semantics. Oxford: Blackwell. Vinay, Jean-Paul and Jean Darbelnet 1958 Stylistique compare dufrancais et de I'anglais. Pans: Didier.

Sources used Dexter, Colin 1991 The Third Inspector Morse Omnibus. London: Pan Books. Dexter, Colin 2000 The Remorseful Day. London: Pan Books. Green, Julian 1942 Memories of Happy Days. New York: Harper. Murdoch, Ins 1965 The Red and the Green. New York: The Viking Press. Murdoch, Ins 1967 Pdques sanglantes. Paris: Mercure de France. Murdoch, Ins 1987 L'apprenti du Men. Pans: Gallrmard. Murdoch, Ins 2001 The Good Apprentice. London: Penguin. Wood, Mary 1995 Theme anglais: Filiere classique. Pans: Presses Universitaires de France.

Appendix: Collocations of impression Collocations pnnted in small capitals have not been recorded in Le Fur (2007) and Crowther, Dignen & Lea (2002). For some collocations, additional information has been provided in the form of examples or explanations. Question marks indicate that the dictionary in question offers no equivalent for a particular sense. Arrows provide cross-references to other types of equivalents.

Sinclair revisited: beyond idiom and open choice 1.V

83

+impression

Le Robert des combinations de mots

Oxford Collocations Dictionary for Students of English

avoir, eprouver,ressentir,retirer

form, gam, get, have, obtain, receive, be under convey, create, give (sb), leave sb with, provide (sb with), MAKE {he really made an impression on me)

causer, creer, produire, provoquer, susciter,degager, dormer, faire, laisser, procurer ? confirmer,conforter,corroborer aggraver,accentuer,accroitre,ajouter a attenuer,temperer comger, rectifier contredire,dementir d1SS1per,effacer,gommer cedera,sefiera,garder,conserver, restersur EViTERDEDONNER,~sedefairede {onadumaldsedefairedecette impression) ? commumquer,confier,decnre, echanger,expnmer,livrer,raconter entendre, veemmv {recueillir les impressions de M. Mitterrand apres le voyage) RAPVORTERDEQerapporteune impression favorable decetentretien),

maintain confirm, BEAR OUT heighten, reinforce, strengthen, INTENSIFY

correct BELIE

? -> BE LEFT WITH ~ avoid {It was difficult to avoid the impression that he was assisting them for selfish reasons) record {She recorded her impressions of the city in her diary) EXCHANGE

?

COME AWAY WITH (J came away with a favourable impression of that meeting)

'REPARTIRAVEC

EMOVSSER{l'habitude availemousse chez luicette impression d'aventure qu'ilressentaitdixansauparavant)

DULL

84

DirkSiepmann

2. impression+

ADJ

Le Robert des combinations

de mots

personnelle

Oxford Collocations Dictionary Students of English

personal subjective DOMINANT, mam, overriding, overwhelming

SUBJECTIVE

dommante generale,D'ENSEMBLE,GLOBALE

general (1), overall

MVVELLE (Qui sail ce que serontses penseesjorsquedenouvelles impressions naitront en elle!) DERNIERE(S)

final

'repandue (rare) pretmere DEUXiEME(rare) confuse, diffuse, vague JUSTE fugace, fugitive Icgcvc (j-'aiunelegere dejd-vu)

for

general (2), widespread, COMMON

early, first, immediate, initial, INSTANT SECOND v a g u e , CONFUSED, CERTAIN

accurate, right fleeting, BRIEF

impression de

contrasted (toujoursavec

tor.-.4

superficial (cf. also I have a slight / vaSuefeelinSofdeid-vu)

?

lafoisjuvenileetmaternelle, Pauline Amoultlaissaituneimvression contrasts de ieune femme fatale et de maturite [...]) W t r a i r e (X, qui est certainement un treshaut magistrate nous a pas donneune impression contraire)

opposite, CONTRARY (we apologise if the contrary impression was conveyed)

bonne, excellente, positive

excellent, favourable, good (better, best), great, right

Sinclair revisited: beyond idiom and open choice

defavorable,desagreable,DEPLAiSANTE,DOULOUREUSE,facheuse,

frustrante,mauvaise,navrante, negative, pemble,pietre,catastrophique, deplorable, desastreuse, detestable, epouvantable, horrible, TERRIBLE, SALE,TRlSTE(j'ai la triste impression d'une occasion manquee) erronee,fausse,illusoire,trompeuse

85

bad, poor, unfavourable, negative, NASTY

distorted, erroneous, false, mistaken, misleading, spurious, wrong, MERETRICIOUS, CONTRADICTORY, DAMAGING,

enorme,grosse,profonde,vive, forte, grande, intense, nette, PUISSANTE

UNFAIR, UNFORTUNATE BIG, DEFINITE, REAL, considerable, deep, powerful, profound, strong, tremendous, ENORMOUS, MASSIVE, SERIOUS, SIGNIFICANT; clear, vivid, UNMISTAKEABLE;

distinct, firm, strong ;

DIS-

TINCT ; FORCEFUL; KEEN

etonnante,frappante,incroyable, saisissantcetrange ambigue, bizarre, cuneuse, drole de, etrange,smguliere,troublante, trouble indefinissable,indescriptible, inexplicable indelebile,inoubliable, memorable, TENACE agreable, delectable, delicieuse, douce, enivrante, grisante, voluptueuSe,apaiSante,raSSurante,HEUREUSE,SALUTAIRE Waincante

EERIE -» feeling

DANGEROUS, UNCOMFORTABLE, UNNERVING

abiding, indelible, lasting, LINGERING, PERMANENT -» feeling

ENDURING,

convincing

PEUDE(rare),MOiNDRE(C/«rac^ bien garde delaissertransparaitre dans son discours la moindre impression de frilosite ou de doute)

LITTLE (this made very little impression on me) (ties frequent), NOT THE SLIGHTEST

COLLOCATIONS « GRAMMATICALES » / COLLIGATIONS :

COLLOCATIONS « GRAMMATICALES » / COLLIGATIONS

DE

OF + N

86

DirkSiepmann

DE + INF QUELLE + IMPRESSION QVELQVE (Son discoursnelaissa pas defairequelque impression surmoi) QUELQUES PRON + PROPRE(S)

OE + V-mG (he gives every impression of soldiering on) OFS.O. AS S.TH. (the abiding impression is ofAlan Bates as a wonderfully clenched Hench) FROM (my abiding impression from the Matrix Churchill documents was that [•••])

NOT MUCH OF AN S.O'S IMPRESSION IS OF S.TH. (my chief impression was of a city of'retiree [...]) THE (ADJ) IMPRESSION IS ONE OF (e.g. great size) A (ADJ)/THE IMPRESSION IS THAT + CLAUSE EVERY O^ (my own impression from the literature is that [...]) THE SAME SOME CLINICAL (special.) (the clinical impression of hepatic involvement)

Eugene Mollet, Alison Wraf and Tess Fitzpatrick

1.

Introduction2

Before the computer age, there were many questions about patterns in language that could not be definitively answered. Dictionaries were the primary source of information about word meaning, and said little if anything about the syntagmatic aspect of semantics - how words derive meaning from their context. Now, with computational approaches to language description and analysis, we have an Aladdin's cave of valuable information, and can pose questions, the answers to which derive from the sum of many instances of a word in use. However, there remain questions with elusive answers, and there is still an onward journey for linguistic research that is dependent on new advances in computation. This paper offers one contribution to that journey, by proposing a method by which new and useful questions can be posed of text. It addresses a challenge that Sinclair identified: "What we still have not found is a measure relating the importance of collocation events to each other, but which is independent of chance" (Sinclair, Jones and Daley [1970] 2004: xxii, emphasis added). Specifically, we have applied an existing analytic method, network modelling, to the challenge of finding out about patterns of lexical cooccurrence. To date, networks have been used little, if at all, as a practical means of extracting collocation, and for good reason: they are computationally heavy and render results that, even if different enough to be worth using, remain broadly compatible with those obtained using the existing statistical approaches such as T-score (see later). However, the very complexity that makes networks a disadvantageous option for exploring basic co-occurrence patterns also presents a valuable opportunity. For encoded in a network is information that is not available using other methods, regarding notable patterns of behaviour by one word within the context of another word's co-occurrence patterns. It is this phenomenon, which we term 'second-order collocation'3 that we propose offers very valuable opportunities

88

Eugene Mollet, Alison Wray and Tess Fitzpatrick

not only for investigating subtle aspects of the meanings of words in context, but also for work in a range of applied domains, including critical discourse analysis, authorship and other stylistics studies, and second language acquisition research4 1.1.

Conceptualising 'second-order collocation' analysis

The information available through second-order collocation analysis can be illustrated with an analogy that reflects, appropriately, social networking for the network modelling we will apply derives from that sphere of enquiry (compare Milroy 1987). Let us take as our 'corpus' a large academic conference lasting several days. Our 'texts' are the many different events that take place: plenary and parallel presentations, meals, coffee breaks, visits to the book stall and bathroom, walks back to the hotel, and so on. Our target 'word' is a particular academic attending the conference, and our analysis aims to ascertain his interests and friendships, by examining whom he spends his time with. Of course, he will be found in the proximity of many more people than he knows and will probably talk to many people of no personal or professional significance to him. But in the course of the conference it should be possible to establish, on the basis of the emerging patterns, whom he most likes to spend time with, and which other people must share his interests even if he never speaks to them, because they turn up in the same rooms at the same time, to hear the same papers. Were the academic in question to be asked what sort of impression such an analysis would give of his profile of professional acquaintances and interests, he would no doubt first of all say that no single conference could capture everything about him - as no corpus can capture everything about a word. Next, he might observe that his behaviour at a conference depended on more than just what papers were on offer and whom he spotted in the crowd: there are other dynamics, ruled by factors less directly or entirely beyond his control. For example, if he had his talented post-doctoral researcher with him, he might make a point of introducing her to people he thought might have a job coming up in their department - such people he might otherwise not prioritize speaking to. Meanwhile, as a result of current politics in his department, he might need to be careful about being seen to fraternize with anyone his Department Chair viewed as 'the enemy' - so a lot would depend on where his Department Chair was at the time. Furthermore, if he were very candid, he might admit that the amount of time he

Accessing second-order collocation through lexical co-occurrence networks

89

spends with a certain female acquaintance was determined in large measure by whether or not his wife's brother was at the conference. We see that the pattern of strong 'collocations' associated with this academic will depend not only on who is there, whom he knows, and what local 'semantic' contexts he finds himself in, but also on how his own interactions are affected by the presence of others and their own acquaintances, allies and enemies. In short, we cannot isolate this one academic's relationships from the deep and broad context of all the other relationships at the conference. In corpus analysis, although some measures, such as Mutual Information, take into account not only how much word A is attracted to word B but also the reverse - something that has a clear parallel in our academic's conference experience - overall, analyses of word collocation are not all that sophisticated in their ability to reflect the dynamic of secondary relationships - how one word's behaviour is influenced by the collocational behaviour of other words. Sinclair's (2004: 134) discussion of 'reversal' is highly relevant here: Situations frequently arise in texts where the precise meaning of a word or phrase is determined more by the verbal environment than the parameters of a lexical entry. Instead of expecting to understand a segment of text by accumulating the meanings of each successive meaningful unit, here is the reverse; where a number of units taken together create a meaning, and this meaning takes precedence over the 'dictionary meanings' of whatever words are chosen. Sinclair goes on to say that coherence in text is achieved naturally if there is an obvious relationship between the meaning of the individual item and that added by the environment. But if there is not, the reader must work harder, variously inferring that a rare meaning of the item is intended, or a metaphorical or ironic one. Capturing second-order patterns is computationally extremely greedy. Nevertheless, our method does it by economizing in ways relevant to the analysis. To understand how, it is useful to return to the analogy. Were we to have unlimited time and resources, we could track every single person at the conference, rather than just the one, to gain the perfect picture of all interactions. However, this computationally expensive procedure would render much more information than we need. Our interest in the Department Chair, for instance, extends only to those instances when he is located close to our target academic, since we want to know how our target's behaviour in relation to some third party is influenced by his presence.

90

Eugene Mollet, Alison Wray and Tess Fitzpatrick

For this reason, in our approach to networking we constrain the model to provide information only about situations in which the target item and the influencing item co-occur. For instance, if we were interested in the cooccurrence patterns associated with the word SIGNIFICANT, we might wish to establish how its collocation with words such as HIGHLY is affected by the presence, in the same window, of the word STATISTICALLY on the one hand and SOCIALLY on the other. We would conduct one network analysis to capture information about the interrelationships of all the collocates occurring with SIGNIFICANT when STATISTICALLY is also present, and another for when SOCIALLY is present. By comparing the two, we would be able to ascertain how the presence of the 'influencing' items affects the collocation behaviour of the primary item. Although this does require computational power because the entire 'space' of all the interrelationships is calculated, the computational costs are reduced relative to computing all of the interrelationships in the entire text or corpus, by extracting information only from the contexts where the primary and secondary items co-occur. 1.2.

Outline of the paper

In the remainder of this paper we develop the case for using network models and exemplify the process in detail. In section 2, we review the existing uses of lexical networks in linguistic analysis and consider the potential and limitations of using them to examine first-order collocations. We also explain what a network is and how decisions are taken about the parameters for its construction. In section 3 we describe the network model adopted here and illustrate the model at work using a corpus consisting of just two lines of text by Jane Austen. In section 4 we demonstrate what the model can do, exploring the lexical item ORDER in the context of the lexical item SOCIAL on the one hand and of the tag on

the other. Finally, we suggest how second-order collocation information might be used in linguistic analysis. 2.

Modelling language using lexical networks

Network models have been employed for a surprisingly diverse variety of linguistic data, but seldom to extract information about collocations directly. Two useful overviews of network modelling in language study are those of Sole, Murtra, Valverde and Steels (2005) and Steyvers and

Accessing second-order collocation through lexical co-occurrence networks

91

Tenenbaum (2005). Meara and colleagues use network models to interpret word association experiments (e.g. Meara 2007; Meara and Schur 2002; Schur 2007; Wilks and Meara 2002, 2007; Wilks, Meara and Wolter 2005). Others have used them to express thesaural relations (Holanda, Pisa, Kinouchi, Martinez and Ruiz 2004; Kinouchi, Martinez, Lima, Lourenco and Risau-Gusman 2002; Motter, de Moura, Lai and Dasgupta 2002; Sigman and Cecchi 2002; Zlatic, Bozicevic, Stefancic and Domazet 2006), phonological neighbourhoods (Vitevitch 2008), syntactic dependencies in Chinese (Liu 2008) and in English (Ferrer i Cancho 2004, 2005, 2007; Ferrer i Cancho, Capocci and Caldarelh 2007; Ferrer i Cancho, Sole and Kohler 2004), lemma, type and token co-occurrence (Antiqueira, Nunes, Ohveira and Costa 2007; Caldeira, Petit Lobao Andrade, Neme and Miranda 2006; Masucci and Rodgers 2006), syllables in Portuguese (Soares, Corso and Lucena 2005) and Chinese characters (Li and Zhou 2007; Zhou, Hu, Zhang and Guan 2008). Network studies of collocation include Ferrer i Cancho and Sole (2001), Magnusson and Vanharanta (2003) and Bordag (2003). However, the tendency has been to construct networks of collocations previously extracted rather than using the network model as the basis for the extraction,5 something which fails to encode the additional layers of information that we exploit in our procedure. Ferret's (2002) approach does extract collocations on the basis of the network, using them to sort text extracts by topic. Park and Choi (1999) experiment with thesaurus-building using a collocation map constructed from probabilities between all items on the map. These approaches nevertheless differ from our method, because we are able to interpret relationships between two collocates relative to a third. At its simplest, a network consists of a collection of nodes connected by lines. Depending on the purpose of the model, the nodes may represent, inter alia, words in an individual's receptive or productive vocabulary, sounds, graphemes, concepts, morphemes, or something else. For example, in Schur's (2007) word association research, the nodes are a finite set of stimulus words, joined to indicate which other stimulus word or words the subject selected as plausible associative partners. Mapping lexical knowledge in this way offers two different sorts of opportunity. Semantic network models focus on what is similar across individuals' knowledge, so that one can talk about the associative properties of sets of words in a language. Such research may seek to explain typical patterns of interference between words or concepts in the same semantic field, such as in terms of competition during spreading activation across the network (e.g. Abdel

92 Eugene Mallet, Alison Wray and Tess Fitzpatnck Rahman and Melmger 2007: 604-605). In contrast, word association studies like Schur's are typically used to seek differences between individuals' knowledge networks. The detail of how a network is constructed depends upon decisions about what should serve as a node and the parameters that should apply for connecting nodes. For a given set of nodes, the more connections there are between them, the denser the network will be (figure 1). One of the major challenges in network modelling is selecting parameters that reveal the most useful amount of information. Much as in the more standard approaches to studying collocation in corpora, decisions must be made about the length of string under scrutiny and about frequency. In both approaches, thresholds are applied, to thm out the representation until it is manageable.

Figure 1.

(a) a sparsely connected network; (b) a densely connected network

In network models, connection strengths can be expressed through weighting, on the basis of, for instance, frequency. In our analyses, weightings are determined on the basis of distance from the primary focus word, as described later. Deciding whether or not to include weighting is contingent on one's specific aims in an analysis. The same applies to the question of directionality: should the model encode information about whether (and, consequentially, how often) a given word precedes or follows the reference word? In the case of text analysis, the decision may depend on the analyst's views about what the language under analysis permits in terms of semantic relationships based on order. Encoding directionality could, for example, lead to different profiles for pairs of lexical items in asymmetrical relationships of attraction in opposite directions, such as DAMSEL -> DISTRESS and HIGH passport+expire, linking both expressions to the same situation. The option of

Chunks and the effective learner 151 storing it simply as ablaufen=exptre is not chosen, since this would or could mean that ablaufen can always be translated or rendered by expire (an example of the commonest strategic mistake in language learning). In other words, what you assemble in your memory is something like (Pass ablaufen) > situation < (expire passport), where Pass ablaufen would be optional since, as a German, you know it and could produce it anyway. Imagine now that you want to produce something that might be phrased in German as Ihr Pass muss erneuert werden (i.e., literally, renewed). If passport > renew is available, the learner will have no difficulty. If this particular collocation is not at hand though, you might check whether there is something similar to Pass > erneuern, hit upon Pass > ablaufen, classify Your passport > expire as contextual^ synonymous (not strictly synonymous, of course) for your purposes and then render Ihr Pass muss erneuert werden as Your passport will expire. We then get (Pass > ablaufen) > Situation < expire > passport (Pass > erneuern) > Situation < expire > passport, and similarly (Pass > verlangern) > Situation < expire > passport Clearly, the concept of collocation is present here. But for a learner these "collocations" are primarily such chains of words, more or less adjacent words, that attract attention, regardless of further linguistic subclassification. It is unlikely that learners spend much energy on deciding whether e.g. raise an objection is primarily a collocation, a subcollocation, a valency , or whatever, as long as it comes in handy. I would conclude this from my own personal experience as a learner of Italian, where I noted down the following chunks, taken from an Easy Reader, with the meanings I (as a learner!) attribute to them, in brackets without paying any attention to their linguistic "status": nel mese Febbrato ('in February'), quell' anno ('in that year'), il preside entrando ('the President, while coming m\fino al mezzanotte ('until midnight'), Lei si deve calmare ('Do calm down'), raccontare tutto con ordtne ('tell everything just as/in the order it happened (in good order)'), era cosa conosctuta ('everybody knew'), senttre un botto ('hear a shot'). What is common to them all is that I think they might be useful to me personally when I want to express myself in Italian and when I feel that I would not be able to produce them. (If I had time to learn ten such items a day, they would add up to some 10,500 in three years, a formidable collection.)

152 DieterGotz 4.

Metalingual faculties

Awareness of chunks - or the role of the idiom principle in the operation of language - is important for developing metalingual faculties. "Metalingual" shall here refer to the way in which foreign language learners process the pieces of the target language. An example: when inexperienced German learners want to store foreign language items, they link them to items in their native language, e.g. choose to wdhlen, carry to tragen, afraid to angstlich, extend to verldngern etc. That is, when they produce an utterance in English, what they produce is a roughly translated equivalent of German, hopefully, and they use their kind of English as a metalanguage for German. Effective metalinguistic competence, however, is important for enlarging your general linguistic competence. I have over the years asked many groups of students to give me a list of sentences which illustrate the different senses of carry. The result was always the same. A group of 10-20 students would produce three or four different types of meaning. Nearly all of them come up with carry as in She carried a basket, some with carry as in the tram earned passengers and perhaps one or two with Flies can carry diseases. Usually, carry a gun, carry the roof, carry oneself like a ... and others, do not appear. Most likely, this is due to the students tagging German words with English ones {tragen carry). What learners need is that they store several kinds of carry: one that is tied up with e.g. bags, babies, another that goes with train and people, a third with disease, a fourth with e.g. pillars and roof. Fossilization in learners (see Selinker 1972; Han and Selinker 2005) may of course be a partly psychological phenomenon - but learners who do not develop a complex awareness of situation, recurrence and chunks, will never become achievers. Any piece of language that is situationally correct and that learners have stored in their memory and that they can retrieve, is one that allows metalinguistic inspection. So if you know that poach collocates with egg, tomato, fish then you might be able to paraphrase poach as 'cook in hot water in such a way that the shape of the food is preserved' - which would be very close to a native speaker's intuition. Or if you know that repair collocates with damage and words that signify it (such as leak), you will perhaps use the word mend with trousers (which is the usual collocation) since you have not yet met My trousers were repaired. Your analysis may not be quite correct (clothes etc, does collocate with repair, though rarely), but considerations like these will make you suspicious and lead you to

Chunks and the effective learner 153

choose something else if you want to play safe. It is up to you to run risks or not. While some people think that trusting native speakers is too risky, learners who trust themselves and their own poor translation, run a much greater risk. Metalinguistic inspection may of course be applied to any level of linguistic description (from phonology to discourse analysis). 5.

Skills and chunks

The function of repetition when acquiring language skills is more than obvious. Clearly, one of the most important keys to listening comprehension is repetition. Repetition equals redundancy and redundancy will raise the degree of expectability. Learners cannot learn listening by listening, but they can learn listening by detecting co-occurrent vocabulary. Fast reading is another skill that needs chunk stores. When writing, learners can choose to play safe and use only those stretches which they know to be correct,2 and should they leave firm ground they will at least know that they are doing just that. Advanced conversational skills is another point. Here, repetition facilitates quick comprehension (and quick comprehension is necessary) and it is also the basis for producing prefabricated items as quickly as is normal. These items also help learners to gain and compete for the speaker's role. Moreover, chunks allow a non-native speaker to monitor their production and to know that what they said was what they meant. 6.

Exploiting chunks

The concept of chunks, together with the implications for language learning, has been around for quite a number of years, cf e.g. Braine (1971), Asher , Kusudo and de la Torre (1974), Gotz (1976). Chunkiness, however, was not really a popular idea in advanced generative grammar - but for some time, and perhaps due to the idiomatic principle, it has no longer been frowned upon (see e.g. Sylviane Granger, this volume). 6.1.

Bridge dictionaries

One particularly important field in this respect is, of course, lexicography.3 Surprisingly, even some modern dictionary-makers might need to catch up

154 DieterGotz on chunks. OALD8 s.v. watch, illustrates the pattern ~ where, what, etc... by Hey, watch where you 're going! This is a good example, but only under very favourable circumstances. It can re-enforce the learner's knowledge in case he or she already knows the phrase. Learners who do not know it, cannot decide what it really means, to what kind of situation it really refers. Does it mean a) 'make sure you take a direction/road etc. that leads to where you want to go', perhaps, or specifically, b) '... where you want to go in life', or c) 'be inquisitive about the things around you!' or perhaps d) 'look where you set your foot, might be slippery, muddy, etc'. Learners cannot know intuitively that d) is correct, and hence this example needs some comment or a translation, e.g. Pass auf, wo du hmtnttst! in German. (Admittedly, the hey would be a kind of hint for those that know.) Examples of usage might be chosen and translated in such a way that they indicate clearly what sort of situation they refer to - and can show how coselection (see e.g. Sinclair 1991) works. In short, we are approaching the idea of a bridge dictionary - one of the many ideas suggested by John Sinclair. In a bridge dictionary, foreign language items are presented in the native learner's language. Using this kind of metalanguage will ensure that a learner has no difficulty understanding what is said even if it is fairly subtle 4 Incidentally, a COBUILD-style explanation is one that tries to depict a situation, cf "1 If you watch someone or something, you look at them, usually for a period of time and pay attention to what is happening" (COBUILD4). To my knowledge, various lexicographers (including myself) have tried to find publishers for bridge dictionaries (such as English - German, English - Italian etc.), but they have tried in vain. However, a dictionary that contains information like the following article (based here on an OALD version) need not necessarily become a flop: watch [...] Verb 1 mit Aufmerksamkeit schauen, zuschauen, beobachten: watch + N watch television/watch football fernsehen, FuBball schauen, gucken: Watch (me) carefully Schau gut zu, Pass gut auf, wie ich es mache; "Would you like to play?"-"No, I'll just watch" ... Nein, ich kucke bloB zu; watch + to + Verb He watched to see what would happen Er schaute hin um mitzukriegen, was passieren wiirde; watch + whShe watched where they went Sie schaute wohin sie gmgen; watch + N + -ing She watched the children playing Sie schaute den Kindern beim Spielen zu; watch + Infinitiv She watched the children cross

Chunks and the effective learner 155 the road Sie sah, wie die Kinder iiber die StraBe gmgen 2 watch (over) + N sich um etwas oder jemanden kiimmern, indem man darauf aufpasst: Could you watch (over my clothes while I swim Passt du auf meine Kleider auf, wahrend ich beim Schwimmen bin? 3 (Umgangssprache) auf das aufpassen, was man tut, etwas mit Sorgfalt tun: Watch it! Pass auf! Watch yourself. Pass auf und ... (fall nicht hm, sag mchts Falsches, lass dich mcht erwischen) You 'd better watch your language tiberleg dir, wie du es formulierst Watch what you say Pass auf, was du sagst! watch + for + N (ausschauen und) warten, dass jemand kommt oder dass etwas passiert: You'll have to watch for the right moment Du musst den richtigen Zeitpunkt abpassen; watch + out (besonders im Imperativ) aufpassen, well Vorsicht notig ist Watch out! There's a car coming Achtung! ...; watch + out + for + N 1 konzentnert zuschauen, hmschauen, damit einem mchts Wichtiges entgeht: The staff were asked to watch out for forged banknotes Die Angestellten mussten sorgfaltig auf gefalschte Geldscheme achten 2 bei etwas sehr vorsichtig sein: Watch out for the steps they're rather steep Pass beiderTreppe auf...

6.2.

Collocations and patterns

Concomitance of words is due to the fact that some situations are alike, or viewed alike. It is imperative for a learner to be aware of this phenomenon, and hence it should be an integral part of learner's dictionaries. Although the recent editions of English learners' dictionaries such as the Longman Dictionary of Contemporary English have made great progress in this direction, Langenscheidt's Grofiworterbuch Deutsch ah Fremdsprache (1993) and its derivatives can be seen as one of the first really systematic treatments of collocations in this type of dictionary as it provides lists of collocations for many headwords: the entry for e.g. Sturm contains a section in < >, namely . A list like this need not be representative or exhaustive or meticulously structured - its main purpose is to demonstrate concomitant words (some of them certainly useful) and remind the learner of concomitance. A surface syntactic pattern, such as Noun + Verb + Noun is of course much too general to make sense as a chunk. Usually, however, such a surface pattern is in reality a kind of cover term for "chunkable" items, provided the pattern is filled semantically. In the case of e.g. the verb fly we get several distinguishable semantic subpatterns of the syntactic pattern

156 DieterGotz Noun + Verb + Noun, depending on the actants' reference, such ^ pilot + /fy + /,&*>, ^ M g B r + fly + a / r f i ^ /,&*> + J7y + distance/direction, pilot + Jfy + /rco/rf* f+ * « C t f o « ; and others (see Gotz-Votteler 2007). Coverage of this kind of pattern must probably be restricted to specialized dictionaries such as the Valency Dictionary of English (2004) or textbooks such as Hennger's (2009) collection of "valency chunks". 5 Such collections of chunks are of course not lists of items to be learnt by heart. They serve as a range of offers from which you can choose if need be and, more importantly, they can serve as evidence of how language works and offer effective ways of learning a foreign language. In any case, dealing with collocations and item-specific constructions, and thus doing justice to Sinclair's idiom principle, will remain one of the great challenges in the future - in language teachmg and lexicography. 6

Notes 1 2 3 4

5

6

For the distribution of chunks across registers see view.byu.edu. See de Cock (2000). See e.g. Siepmann (2005). Lexicographers of English often underrate the difficulties that learners have in understanding explanations. According to LDOCE, the meaning of charge a price is 'to ask someone for a particular amount of money for something you are selling': Does ask mean 'put a question' or 'beg' or 'request'? Who is you? Can services be sold? It is very attractive to assume that "valency chunks" might be useful material for learners. Heringer (2009) is a collection of such syntactic-semantic chunks in German, some thirty chunks for about eighty verbs each. Here is, in a Simplified form, a selection of chunks containing antworten: Da hab ich geantwortet; Was wiirden Sie antworten, wenn; antwortete er ... er wisse nicht ...; was soil man darauf antworten; zogert kurz und antwortet dann; antwortet er auf die Frage warum; auf einen Brief geantwortet. The chunks themselves were determined by co-occurrence analyses (cf Belica 2001-2006). I would tike to thank Tony Hornby and the editors of this volume for then comments on an earlier draft of this article.

Chunks and the effective learner 157 References Asher, James J., Jo Anne Kusudo and Rita de la Torre 1974 Learning a second language through commands: The second field test. Modern Language Journal 58 (1/2): 24-32. Belica, Cyril 2001-06 Kookkurrenzdatenbank CCDB. Erne korpuslinguistische Denk- und Expenmentrerplattform fur die Erforschung und theoretrsche Begrundung von systemrsch-strukturellen Ergenschaften von Kohasionsrelatronen zwrschen den Konstrtuenten des Sprachgebrauchs. Instrtut fur Deutsche Sprache, Mannherm. Brame, Martin Dan Isaac 1971 On two types of models of the internalization of grammars. In The Ontogenesis of Grammar: A Theoretical Symposium, D. I. Slobin (ed.), Academrc Press: New York: 153-186. deCock,Sylvie 2000 Repetitive phrasal chunkmess and advanced EFL speech and writing. In Corpus Linguistics and Linguistic Theory, Christian Man and Marianne Hundt (eds.), 51-68. Amsterdam/Atlanta: Rodopi. Douglas Biber, Stig Johansson, Geoffrey Leech, Susan Conrad and Edward Finegan 1999 Longman Grammar of Spoken and Written English. Edinburgh: Longman. Gotz, Dieter 1976 Textbezogenes Lernen: Aspekte des Fremdsprachenerwerbs fortgeschnttenerLernender. DNS75: 471^78. Gotz, Dieter, Gunther Haensch and Hans Wellmann (eds.) 2010 Grofiworterbuch Deutsch als Fremdsprache. Berlin/Mimchen: Langenscheidt. G6tz-Votteler,Katrm 2007 Describing semantic valency. In Valency: Theoretical, Descriptive and Cognitive Issues, Thomas Herbst and Katrin Gotz-Votteler (eds.), 37-50. Berlin/New York: Mouton de Gruyter. Granger, Sylviane 2011 From phraseology to pedagogy: Challenges and prospects. This volume. Han, ZhaoHong and Larry Selmker 2005 Fossilization in L2 Learners. In Handbook of Research in Second Language Teaching and Learning, Eh Hinkel (ed.), 455-470. Mahwah,NJ:Erlbaum

158 DieterGotz Herbst, Thomas, David Heath, Ian F. Roe and Dieter Gotz (eds.) 2004 A Valency Dictionary of English: A Corpus-Based Analysis of the Complementation Patterns of English Verbs, Nouns and Adjectives. Berlm/NewYork:MoutondeGruyter. Hermger,HansJurgen 2009 Valenzchunks: Empirisch fundiertes Lernmaterial. Miinchen: Indicium. Selmker, Larry 1972 Interlanguage./iMI 10 (2): 209-231. Siepmann,Dirk 2005 Collocation, colligation and encoding dictionaries. Part I: Lexicological aspects. International Journal of Lexicography 18: 409-443. Sinclair, John McH. 1991 Corpus, Concordance, Collocation. Oxford: Oxford University Press. Togmm-Bonelli, Elena and Elena Manca 2004 Welcoming children, pets and guests: Towards functional equivalence in the languages of 'Agritunsmo' and 'Farmhouse Holidays'. In Advances in Corpus Linguistics: Papers from the 23rd International Conference on English Language Research on Computerized Corpora (ICAME 23), Goeteborg 22-26 May 2002, Karin Aymer and Bengt Altenberg (eds.), 371-385. Amsterdam/New York: Rodopi.

Dictionaries Longman Dictionary of Contemporary English 2009 edited by Michael Mayor. Harlow: Pearson Longman. 5th edition. [LDOCE5] Oxford Advanced Learner's Dictionary of Current English 2010 by A. S. Hornby, edited by Sally Wehmeier. Oxford: Oxford University Press. 8th edition. [OALD8]

Corpus BNC

The British National Corpus, version 3 (BNC XML Edition). 2007. Distributed by Oxford University Computing Services on behalf of the BNC Consortium, http://www.natcorp.ox.ac.uk/.

Exploring the phraseology of ESL and EFL NadjaNesselhauf

1.

Introduction

John Sinclair was among the leading proponents of the centrality of phraseology, or what he referred to as the "idiom principle", in language. He also advocated that this aspect of language be investigated with a corpusapproach. These convictions have been proved right over and over again by what must by now be tens of thousands of corpus-based studies on the phraseology of (LI) English (and other languages). The study of the phraseology of EFL varieties1 has also intensified over the past few years, although only a relatively small proportion of this work is corpus-based. What is rare to date, however, is studies, in particular corpus-based ones, on the phraseology of ESL varieties. What is practically non-existent (in any type of approach) is comparisons of the phraseology of ESL and EFL varieties. Given the pervasiveness of the phenomenon in any variety and the relatedness of the two types of variety, this is a gap that urgently needs to be filled. The present paper is therefore going to explore phraseological features in ESL and EFL varieties and to investigate to what degree and in what respects the phraseology of the two types of variety is similar.2 The paper starts out by providing a brief overview of previous research as well as an overview of the corpora and methodology used for the investigation. Then, three types of analyses into the phraseology of ESL and EFL varieties will be presented. In Section 3.1, it will be investigated how "competing collocations" (or collocations that share at least one lexical element and are largely synonymous) are dealt with in the two types of varieties. In 3.2, the treatment of internally variable collocations will be considered. Finally, I am going to look at what have been referred to as "new prepositional verbs" (Mukherjee 2007), i.e. verbs that are simple verbs in LI English but have become or are treated as if they were verbpreposition collocations (or prepositional verbs) in ESL or EFL varieties (Section 3.3).

160 NadjaNesselhauf

2.

Previous research and methodology

2.1.

Previous research

Studies on phraseological features of any kind in ESL varieties are rare so far. Usually, a number of such features is included in studies of individual varieties, but only very few investigations actually focus on phraseology. Of those that do, many have a cultural impetus and are restricted to the examination of collocates of culturally loaded terms (e.g. Schmied 2004; Wolf and Polzenhagen 2007). Only a few studies are not restricted in this way, as for example Skandera (2003), who looks at idioms in Kenyan English, or Schilk (2006), who investigates collocations in Indian English. Comparisons of phraseological features across ESL varieties are rarer still and mostly consist of surveys of existing studies on individual varieties (which in turn are mostly based on anecdotal evidence; e.g. Crystal [1997] 2003; Ahulu 1995; Piatt, Weber and Lian 1984). The number and scope of systematic, corpus-based studies on phraseological features across several ESL varieties is highly restricted to date (Schneider 2004; Mair 2007; and the papers by Hoffmann, Hundt and Mukherjee 2007 and Sand 2007). What is more, existing comparative studies tend to focus on differences between different ESL varieties rather than on common points. In spite of this, a few phraseological features or tendencies have been reported as occurring in several varieties: Schneider (2004), for example, finds that the omission of particles in phrasal verbs occurs in East African English, Indian English, Philippine English and Singapore English. Piatt, Weber and Lian (1984) report that collocating the verbs open and close with electrical switches or equipment as in open the radio or close the hght is common in six different ESL varieties of English (Hawaiian English, East African English, Hong Kong, Malaysian, Philippine and Singaporean English), and Sand (2007) finds occurrences of in the light of, discuss about and with regards to in four different ESL varieties (Singaporean, East African, Jamaican and Indian English). The investigation of phraseological phenomena in learner language has a longer tradition than in ESL varieties, but is also usually limited to one specific learner variety and often restricted to the investigation of a small number of phraseological items by elicitation. One of the few exceptions is Kaszubski (2000), who investigates a great number of collocations in the language of learners with different LI backgrounds. A great number of phraseological deviations from LI English is also listed in the Longman

Exploring the phraseology ofESL and EFL varieties 161

Dictionary of Common Errors (Turton and Heaton [1987] 1996), which is based on analyses of a huge learner corpus (the Longman Learners' Corpus), which contains writings from foreign learners with a wide range of different LI backgrounds. As in the case of phraseological ESL studies, a number of common points can also be inferred from a close reading of the existing investigations. For example, collocations involving high-frequency verbs appear to be a source of deviation across different LI backgrounds (Kaszubski 2000; Shei 1999; Nesselhauf 2003, 2005). Various studies also reveal that learners are often unaware of collocational restrictions in LI English and at the same time unaware of the full combinatory potential of words they know (cf e.g. Herbst 1996; Howarth 1996; Channell 1981; Granger 1998). Comparisons of ESL and EFL varieties, finally, are rare in general, with Williams (1987) and Sand (2005) being two notable exceptions. It seems that an important reason for this neglect is that the two fields of ESL variety research and research into foreign learner production (or "interlanguage") have remained separate to a large degree so far - a situation that clearly needs to be remedied. 2.2.

Corpora and methodology

Three types of corpora were needed for the present investigation: ESL corpora, learner corpora, and, as a point of reference, LI corpora. As a corpus representing several ESL varieties of English, I used the ICE-corpus {International Corpus of English). The varieties included in the present study are Indian English, Singaporean English, Kenyan English and Jamaican English (cf. table 1). It is important to note that the degree of institutionalization of English vanes in these four countries and, in particular, that Jamaican English occupies a special position among the four, in that it may also with some justification be classified as an ESD (English as a second dialect) variety rather than as an ESL variety.3 The composition of all the ICEcorpora, with the exception of ICE-East Africa, is the same, with each subcorpus containing 1 million words in total, of which 60 % are spoken and 40 % written language. In the case of ICE-East Africa, only the Kenyan part was included in the present investigation, which contains slightly less than one million words and about 50 % of spoken and written language each.

162 NadjaNesselhauf Table 1.

ICE-subcorpora used in the analyses4

Corpus:

Number of words:

ICE-India

1.14 million

ICE-Smgapore

1.11 million

ICE-Jammca (version 11 June 07)

1.02 million

ICE-EastAfnca (Kenya only)

0.96 million

As a corpus representing foreign learner language of learners with different first language backgrounds, I used ICLE (International Corpus of Learner English). ICLE contains 2.5 million words of learner language from learners with the following 11 LI backgrounds: Bulgarian, Czech, Dutch, Finnish, French, German, Italian, Polish, Russian, Spanish and Swedish. The text type predominantly represented in the corpus is argumentative essays. An important limitation of the study presented here therefore is the different composition of the ESL and the EFL corpora, in particular the fact that the ESL corpora contain both written and spoken language of various text types, while the learner corpus is restricted to one kind of written language. For better comparison, therefore, in some analyses only the written parts of the corpora will be considered. The difference in size of the ESL and EFL corpora is less problematic. When comparable size seemed desirable, I only used a one-million word (precisely: 0.98 million) subcorpus of ICLE, in which only writings by learners with the Lis Finnish, French, German and Polish (i.e. one from each language family) were included. This corpus will be referred to as ICLE-4L1 in what follows. A further limitation of the study is the size of the corpora in general, which is small for many types of phraseological features. However, for a first exploration of the topic the corpora were considered to be adequate. As a point of comparison, I chose British English as represented in ICEGB and in the BNC (British National Corpus). ICE-GB has the same size and composition as the other ICE-corpora, and the BNC contains 100 million words of a great range of text types, of which 10 % are spoken and 90 % written texts. For better comparison, whenever frequencies from the whole of the BNC are provided, I also provide the frequencies for a fictitious corpus based on the BNC labelled "BNC"-ICE-comp., which is of the same size and general composition as the ICE-corpora.

Exploring the phraseology ofESL andEFL varieties 163

3.

Analyses and results

3.1.

Competing collocations

The starting point for the analysis of competing collocations was an observation I made in my analysis of collocations in German learner language (e.g. Nesselhauf 2005). The learners tended to overuse the collocation play a role, while they hardly used the largely synonymous and structurally similar collocation play a part. In LI English, on the other hand, both of these competing collocations occur with similar frequencies. The use of the two expressions was thus investigated in LI, L2 and learner varieties, to find out whether the behaviour of the ESL varieties in any way resembled the behaviour of the learner varieties and if so, to what degree and why. The results of this investigation (with only the written parts of the corpora considered) are provided in figure 1.

BNC

3349

ICE-Jam

3236 28

12

ICE-Sing

48

4

ICE-Ind

55

3

ICE-Ken

67

3

ICLE-4L1

pPLAY+PART

12 1

o%

20%

40%

DPLAY+ROLE

32 60%

80%

100%

Figure 1. PLAY + ROLE / PART in the wntten parts of the BNC, ICE and ICLE Here and elsewhere, the bars in the graphs indicate the relative frequencies of the relevant expressions, but the absolute frequencies are also given on each individual bar. The results confirm my earlier observations. In the written part of the BNC the two expressions have almost the same frequencies, and in ICLE-4Ll,^aj a role is used in about 80 % and play apart in about 20 % of the occurrences of either expression. In the ESL varieties,

164 NadjaNesselhauf the proportion of play a role is also consistently greater than that of play a part. Except in the case of Jamaican English, the proportion of play a role is even greater in the ESL varieties than in the learner varieties. So it seems that overuse of play a role at least partly at the expense of play apart is a feature of both types of varieties under investigation. To find out whether this result is restricted to this particular pair of competing collocations or whether it reveals a more general tendency, another group of competing collocations was investigated: take into consideration, take into account and take account of. The results are displayed in figure 2.

BNC

1526

2617

!4 uhm do at Cambridge because < > from Agmeszka's point of view it was so difficult despite the fact that she's really good () Such structures are common enough to consider them as instances of a conventionalized grammatical construction, albeit one that is restricted to the spoken language. As always in corpus analyses of spontaneous speech, however, it is difficult to set the limits of what one usefully includes in one's counts. Thus, while the following example is clearly very interesting in terms of Paul Hopper's (1998) notion of emerging grammar (and similar constructions are in fact discussed in Hopper 2001 and 2004), it was not counted in the present analysis, as too much material intervenes between the two parts of an arguable cleft-sentence-like focus construction: (5)

what I like doing is uhm < > with the Pakistani children and the Indian children the infants when their tooth falls out in school and they cry < > and if they've got enough English / explain to them that in England < > you put h under the pillow ()

"Orderly" notions of grammatical structure inspired by written English fail here, leading to an analysis of the passage as an anacoluthon, with what I like doing is... representing a false start which is not taken up again. Alternatively, we could assume contamination of what I like doing with the Pakistani children and the Indian children the infants when their tooth falls out in school and they cry... and if they 've got enough English I explain to them that in England you put it under the pillow. Seen in its discourse context and from the point of view of the speaker, however, this is an act of focussing, functionally equivalent to a cleft sentence What I like doing is [to] explain .... In other words, something which is partly ungrammatical at the syntactic level turns out to be a very successful instance of attentiongetting and competent floor-holding in discourse-analytical terms. But let us return to the realm of "grammar proper" even in our analysis of the spoken data (if only to ensure comparability with the findings ob-

184

ChrmanMarr

tamed from the "Brown family"). Table 2 gives the frequencies of the four recurrent types of specificational clefts, i.e. those which could be considered conventional grammatical constructions, in the spoken corpus. "LLC" indicates that examples are from the "old" (London-Lund Corpus, 19581977) part of the DCPSE, whereas ICE-GB indicates origin in the "new" (ICE-GB, 1990-1992) part. Table 2. LLC ICE-GB

Four types of specificational clefts in the DCPSE to-infinitive Unmarked infinitive -ing finite "echo" clause 24 9 1 11 18 31 0 6

(Chi square to- vs. bare infinitive: p=0.0030) Let us focus on the three constructions familiar from the written corpora first, that is the two types of infinitival complements and the rare -mgcomplement. Here the most striking result is that the reversal of preferences in British English spoken usage is virtually simultaneous with the one observed in writing. Clearly, this is not what one would expect given the generally conservative nature of writing. As for -/^-complements, they are as marginal in this small diachromc spoken corpus as in the written corpora of the Brown family. The exclusively spoken finite-clause complement (All I did was I asked), on the other hand, is amply attested and apparently even on the rise in terms of frequency. This raises an interesting question: Why is it that one innovative structure, the bare infinitival complement (All I did was ask), should show up in written styles so immediately and without restriction, whereas the other, the finite-clause type, should be blocked from taking a similar course? The reason is most likely that the finite-clause variant is not a grammatically well-formed and structurally complete complex sentence and therefore not felt to be fully acceptable in writing. That is writers refrain from using it for essentially the same reasons that they shun left- and right-dislocation structures or the use of copy pronouns (e.g. this man, I know Mm; that was very rude, just leaving without a word). And just as such dislocation structures are presumably very old, the finite type of specificational cleft, unlike the unmarked infinitive, may not really be an innovation but an old and established structure which merely failed to register in our written sources.

Writing the history of spoken standard English in the twentieth century 185 Further corpus-based research on specification^ clefts should proceed in two directions. On the basis of much larger corpora of (mostly written) English, it should be possible to determine the history and current status of the -/^-complement, which is not attested in sufficient numbers either in the Brown family or in the DCPSE. Possibly small but specialized corpora of speech-like genres (informal letters, material written by persons with little formal education, Old Bailey proceedings, etc.) are needed, on the other hand, to establish the potentially quite long history of the finite-clause type. 3.

"Modality on the move" (Leech 2003)

Modal verbs, both the nine central modals and related semi-auxihanes and periphrastic forms, have been shown to be subject to fairly drastic diachromc developments in twentieth and twenty-first century written English. The point has been made in several studies based on the Brown family (e.g. Leech 2003; Smith 2003; Mair and Leech 2006; Leech et al. 2009). Other studies, such as Krug (2000), have explored the bigger diachromc picture since the Early Modern English period and show that such recent changes are part of a more extended diachromc drift. Considering the central role of modality in speech and writing, modals are thus a top priority for research in the DCPSE 5 Table 3 shows the frequency of selected modal verbs and periphrastic forms in the oldest (1958-1960) and most recent (1990-1992) portions of the DCPSE. The restriction to the first three years, at the expense of the intervening period from 1961 to 1977, was possible because modals are sufficiently frequent. It was also desirable because in this way the extreme points of the diachromc developments were highlighted. What is a potential complication, though, is the fact that it is precisely the very earliest DCPSE texts which contain the least amount of spontaneous conversation, so that a genre bias might have been introduced into the comparison.

186

ChnstianMatr

Table 3.

Real-time evidence from spoken English - frequencies of selected modals and semi-modals of obligation and necessity in the DCPSE (Klein 2007)

DCPSE must (HAVEJgotto HAVEto need(zux.) NEED to Total

1958-1960 total n/10,000 38 10.21 24 6.45 34 9.31 0 0.00 1 0.27 97 26.06

1990-1992 total n/10,000 195 4.63 185 4.39 555 13.17 1 0.02 116 2.75 1052 24.97

Log lklhd Diff(%) 16.61** 2.84 4.97* 0.17 13.15** 0.16

-54.67 -31.90 +44.21 +924.80 -4.19

Log likelihood: a value of 3.84 or more equates with chi-square values of p < 0.05; a value of 6.63 or more equates with chi-square values of p < 0.01. *HAVE to 1958-60 vs.1990-92: significant at p < 0.05; **must, need to 1958^0 vs.1990-92: significant at p < 0.01. CAPITALIZED forms represent all morphological variants. Much of what emerges from these spoken data is farmhar from the study of contemporaneous wntten English: the dominant position of have to among the present-day exponents of obligation and necessity, the decline of must, the marginal status of need in auxiliary syntax and the phenomenal spread of main-verb need to in modal functions. Note, for example, that in the span of a little more than 30 years the normalized frequency of must drops from around 10 instances per 10,000 words to a mere 5, thus leading to its displacement as the most frequent exponent of obligation and necessity. By the early 1990s this position has been clearly ceded to have to. Note further, that main-verb need to, which barely figured in the late 1950s data, has firmly established itself 30 years later. However, as table 4 shows, normalized frequencies (per 10,000 words of running text in this case) and, more importantly, relative rank of the investigated forms still differ considerably across speech and writing.

Writing the history of spoken standard English in the twentieth century 187 Table 4.

Modals and semi-modals of obligation and necessity in their order of precedence in speech and writing (Smith 2003: 248; Klein 2007)

LOB Rank

F-LOB n/10,000

DCPSE 1958-60 n/10,000

DCPSE 1990-92

n/10,000

n/10,000

1

must

11.41

HAVEto

8.17

must

10.21

HAVE to 13.17

2

HAVEto

7.53

must

8.07

HAVEto

9.13

must

4.63

3

(HAVE)

4.11

NEED to 1.96

(HAVE)

6.45

(HAVE)

4.39

got to 4

JVEEDto 0.54

(HAVE)

0.27

got to

got to

MJED to 0.27

NEED to 2.75

go/to Total

23.59

1^50

26,06

24.94

The decline of must is less pronounced in writing than in speech - as would be expected for changes originating in the spoken language. By contrast, the drop in the frequency of (have) got to is sharper in writing than in speech. Growing reluctance to use this form in writing may be due to two factors. First, it has an informal stylistic flavour, and secondly it is one of the very few clear syntactic Briticisms. Its near elimination from written British English may thus be a sign of a trend towards greater homogemsation of formal and written language use in an age of globalization - an analysis which is consistent with the oft-proved sociolinguistic dichotomy ofSchreibeinheit vs. Sprechvielfalt (Besch 2003; M a n 2007), roughly to be translated as "unity in writing" vs. "diversity in speech".

4.

Autonomous change in writing: information compression in the noun phrase

Numerous corpus-based studies (e.g. Raab-Fischer 1995; Hinnchs and Szmrecsanyi 2007) have provided overwhelming evidence to show that the absolute frequency of s-gemtives has increased in the recent past in written English corpora. What remains controversial is the question whether the observed statistical increase is due to more common occurrence in the traditional range of uses (the point of view defended in Mair 2006), or whether it is partly the result of an additional trend towards greater use of the sgemtive with inanimate nouns (cf Rosenbach 2002: 128-176). The details of this controversy need not preoccupy us here; the major point relevant to

188

ChrmanMarr

the present discussion is that, as will be shown, all and any changes observed in genitive usage seem to be confined to writing (or writing-related formal genres of speech such as broadcast news). This emerges in striking clarity from a comparison of genitive usage in the Brown family and in the DCPSE. For ease of comparison, DCPSE figures have been normalized as "N per million words", with absolute frequencies given in brackets:6 Table 5.

S-genitives in selected spoken and written corpora

DCPSE (spoken Bntish English) B-LOB, LOB &F-LOB (written British English) Brown & Frown (written Amencan English)

"1930s" n.a. 4625

"1960s" 2037(861) 4962

"1990s" 1786(775) 6194

n.a.

5063

7145

The table shows that genitives in spoken language are consistently less frequent than in writing in both periods compared, which is an expected spin-off from the general fact that noun phrases in spontaneous speech tend to be much shorter and less structurally complex than in writing. More interesting, though, is the fact that while nothing happens diachromcally in speech (with the frequency of genitives hovering around the 2,000 instances per million word mark), there are steep increases in the written corpora, which in the thirty-year interval of observation even document the emergence of a significant regional difference between American and British English.7 In other words, on the basis of the recent diachrony of the sgemtive (and a number of related noun-phrase modification structures),8 we can make the point that written language has had a separate and autonomous history from spoken English in the recent past. This history is apparently a complex one as the observed development manifests itself to different extents in the major regional varieties. How this partial diachromc autonomy of writing can be modelled theoretically is a question which we shall return to in the following section. 5.

Conclusion: theoretical issues

Even in the case of English, a language endowed with a fantastic corpuslinguistic working environment, the real-time corpus-based study of ongoing grammatical change in the spoken language is a fascinatingly novel

Writing the history of spoken standard English in the twentieth century 189 perspective. It was opened up only a very short while ago with the publication of the DCPSE and is currently still restricted to the study of one single variety, British Standard English. As I hope to have shown in the present contribution, it is definitely worth exploring. Change which proceeds simultaneously in speech and writing is possible but rare. In the present study, it was exemplified by the spread of unmarked infinitives at the expense of ^-infinitives in specificational clefts. The more common case by far is change which proceeds broadly along parallel lines, but at differential speed in speech and writing. This was illustrated in the present study by some ongoing developments involving modal expressions of obligation and necessity. The recent fate o? have got to in British English, for example, shows very clearly that local British usage may well persist in speech while it is levelled away in writing as a result of the homogenizing influences exerted by globalized communication. Conversely, must, which decreases both in speech and writing, does so at a slower rate in the latter. The potential for autonomous developments in speech and writing was shown by finite-clause clefts (the Allldidwas Iasked-typc) and s-gemtives respectively. Of course, the fact that there are developments in speech which do not make it into writing (and vice versa) does not mean that there are two separate grammars for spoken and written English. Genuine structural changes, for example the grammaticalization of modal expressions, usually arise in conversation and are eventually taken up in writing - very soon, if the new form does not develop any sociolinguistic connotations of informality and non-standardness, and with a time lag, if such connotations emerge and the relevant forms are therefore made the object of prescriptive concerns. What leads to autonomous grammatical developments is the different discourse uses to which a shared grammatical system may be put in speech and writing. Spoken language is time-bound and dialogic in a way that formal edited writing cannot be. On the whole, spoken dialogue is, of course, as grammatical as any written text, but this does not mean that the grammaticalstructural integrity of any given utterance unit is safe to the same extent in spontaneous speech as that of the typical written sentence. Structurally complete grammatical units are the overwhelming norm in writing but much more easily given up in the complex trade-offs between grammatical correctness, information distribution and rhetorical-emotional effects which characterize the online production of speech. This is witnessed by "dislocation" patterns such as that kind of people, I really love them or - in the con-

190

ChrmanMarr

text of the present study - the finite "echo clause" subtype of specificational clefts (4// / did was I asked)9 Tins structure shows a sufficient degree of conventionalisation to consider it a grammatical construction. However, it is not a grammatical construction which is likely to spread into writing because the subordinate part of the cleft construction is not properly embedded syntactically. Conversely, compression of information as it is achieved by expanding noun heads by modifiers such as genitives, prepositional phrases or attributively used nouns is not a high priority in spontaneous speech. However, it is a central functional determinant of language use in most written genres. More than ever before in history, writers of English today are having to cope with masses of information, which will give a tremendous boost to almost any structurally economical compression device in the noun phrase, as has been shown for the s-gemtive in the present study. Thus, even if spoken and written English share the same grammar, as soon as we move to the discourse level and study language history as a history of genres or as the history of changing traditions of speaking and writing, it makes sense to write a separate history of the written language in the Late Modern period. This history will document the linguistic coping strategies which writers have been forced to develop to come to terms with the increasing bureaucratization of our daily lives, the complexities introduced by the omnipresence of science and technology in the everyday sphere and the general "information explosion" brought about by the media. Above and beyond all this, however, close attention to the spoken language in diachromc linguistics is salutary for a more general reason. It keeps challenging us to question and re-define our descriptive categories. As was shown in the case of specificational clefts, the variable and its vanants were easy to define in the analysis of the written language, and difficulties of classification of individual corpus examples were rare. This was entirely different in the spoken material, where we were constantly faced with the task of deciding which of the many instances of discoursepragmatic focusing which contain chunks such as what X did was or all X did was represented a token of the grammatical construction "specificational cleft sentence" whose history we had set about to study. Grammar thus "emerges" in psychological time in spontaneous discourse long before it develops as a structured system of choices in historical time.

Writing the history of spoken standard English in the twentieth century 191

Notes 1

2 3

4

5

6 7

8

That is the famihar array of the Brown Corpus (American English, 1961), its Bntish counterpart LOB (1961), their Freiburg updates (F-LOB, British English 1991; Frown, American English 1992) and - not completed until recently - B-LOB ("before LOB"), a matching corpus illustrating early 1930s British English. I am grateful to Geoff Leech, Lancaster, and Nick Smith, Salford, for allowing me access to this latter corpus, which is not as yet publicly available. Note that this example has the speaker correcting an unmarked infinitive into m-ing form. This admittedly unsophisticated strategy secures relatively high precision and even higher recall, although of course a very small number of instances with material intervening between be and do, such as All I did to him was criticise to will be missed). In particular, the following two issues are in need of clarification, on the basis of much larger corpora than the Brown family: (1) Are there -ingcomplements without the preceding trigger (type All I did was asking), and (2) are there unmarked infinitival complements following a preceding progressive (type All I was doing was ask)? The one instance found of the latter, quoted as (3) above, shows instant self-correction by the speaker. And this research was duly carried out by Barbara Klein in an MA thesis (Klein 2007). The author wishes to thank Ms. Klein for her meticulous work in one of the first DCPSE-based studies undertaken. The DCPSE consists of matching components of London-Lund (1958-1977) and ICE-GB (1990-1992) material, totalling ca. 855,000 words. Judging from the B-LOB data, it also seems that the trend picked up speed in the second half of the twentieth century in British English. Pending the completion of a "pre-Brown" corpus of 1930s written American English, it is, however, difficult to determine the precise significance of the B-LOB findings. Chiefly, these are nouns used in attribute function, for which similarly drastic increases have been noted in Biber (2003), for example. See also Biber (1988) and (1989). Indeed, in terms of information density, a noun phrase such as Clinton Administration disarmament initiative could be regarded as an even more compressed textual variant of the Clinton Administration's disarmament initiative, which in turn is a compressed form of the disarmament initiative of the Clinton Administration. Raab-Fischer (1995) was the first to use corpus analysis to prove that the increase in genitives went hand in hand with a decrease in 0/-phrases post-modifying nominal heads. Her data was the then available untagged press sections of LOB and F-LOB. Analysis of the POStagged complete versions of B-LOB, LOB and F-LOB shows that her provisional claims have stood the test of time quite well. 0/-phrases decrease from

192

9

ChrmanMarr 31,254 (B-LOB) through 28,134 (LOB) to 27,115 (F-LOB). Like genitives, noun+noun sequences, or more precisely: noun+common noun (= tag sequence N* NN*) sequences, increase - from 17,023 in B-LOB through 21,393 m LOB to 25,774 m F-LOB. Here, additional evidence is provided by the "emergent" structures briefly illustrated in example (5) above, which were excluded from consideration as they would have distorted the statistical comparison between speech and writing.

References Besch, Werner 2003 Schriftemheit - Sprechvielfalt: Zur Diskussion urn die nationalen Vananten der deutschen Standardsprache. In Deutsche Sprache im Wandel: Kleinere Schriften zur Sprachgeschichte, Werner Besch (ed.), 295-308. Frankfurt: Lang. Biber, Douglas 1988 Variation Across Speech and Writing. Cambridge: Cambridge University Press. Biber, Douglas 2003 Compressed noun-phrase structures in newspaper discourse: The competing demands of popularization vs. economy. In New Media Language, Jean Aitchison and Diana M. Lewis (eds.), 169-181. London: Routledge. Biber, Douglas and Edward Finegan 1989 Drift and evolution of English style: A history of three genres. Language 65: W-511. Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad and Edward Finegan 1999 Longman Grammar of Spoken and Written English. Harlow: Longman. Hmrichs, Lars and Benedikt Szmrecsanyi 2007 Recent changes in the function and frequency of standard English genitive constructions: A multivariate analysis of tagged corpora. English Language and Linguistics 11: 437-474. Hoffmann, Sebastian 2005 Grammaticalization and English Complex Prepositions: A CorpusBased Study. London: Routledge. Hopper, Paul 1998 Emergent grammar. In The New Psychology of Language, Michael Tomasello (ed.), 155-175. Mahwah, NJ: Lawrence Erlbaum.

Writing the history of spoken standard English in the twentieth century 193 Hopper, Paul 2001 Grammatical constructions and their discourse origins: Prototype or family resemblance? In Applied Cognitive Linguistics I: Theory and Language Acquisition, Martin Ptitz and Susanne Niemeier (eds.), 109-129. Berlin: MoutondeGruyter. Hopper, Paul 2004 The openness of grammatical constructions. Chicago Linguistic Society 40: 239-256. Klein, Barbara 2007 Ongoing morpho-syntactic changes in spoken British English: A study based on the DCSPE. Unpublished Master's thesis. University ofFreiburg. Krug, Manfred 2000 Emerging English Modals: A Corpus-Based Study of Grammaticalization. Berlin/New York: Mouton de Gruyter. Leech, Goeffrey 2003 Modality on the move: The English modal auxiliaries 1961-1992. In Modality in Contemporary English, Roberta Facchmetti, Manfred Krug and Frank R. Palmer (eds.), 223-240. Berlin: Mouton de Gruyter. Leech, Geoffrey, Mananna Hundt, Christian Mair and Nicholas Smith 2009 Change in Contemporary English: A Grammatical Study Cambridge: Cambridge University Press. Mair, Christian 2006 Inflected genitives are spreading in present-day English, but not necessarily to inanimate nouns. In Corpora and the History of English: Festschrift for Manfred Markus on the Occasion of his 65th Birthday, Christian Mair and Reinhard Heuberger (eds.), 243-256. Heidelberg: Winter. Mair, Christian 2007 British English/American English grammar: Convergence in writing, divergence in speech. Anglia 125: 84-100. Mair, Christian and Geoffrey Leech 2006 Current changes. In The Handbook of English Linguistics, Bas Aarts and April McMahon (eds.), 318-342. Oxford: Blackwell. Raab-Fischer,Roswitha 1995 Lost der Gemtiv die 0/-Phrase ab? Erne korpusgestiitzte Studie zum Sprachwandel im heutigen Englisch. Zeitschrift fur Anglistik und Amerikanistik 43: 123-132. Rohdenburg,Gunter 2000 The complexity principle as a factor determining grammatical variation and change in English. In Language Use, Language Acquisition

194

ChrmanMarr

and Language History: (Mostly) Empirical Studies in Honour of Rudiger Zimmermann, Ingo Plag and Klaus Peter Schneider (eds.), 25-44. Trier: WVT. Rosenbach,Anette 2002 Genitive Variation in English: Conceptual Factors in Synchronic andDiachronic Studies. Berlin: Mouton de Gruyter. Sinclair, John McH. 1972 A Course in Spoken English Grammar. Oxford: Oxford University Press. Sinclair, John McH. and Richard M. Coulthard 1975 Towards an Analysis of Discourse: The English Used by Teachers and Pupils. Oxford: Oxford University Press. Smith, Nicholas 2003 Changes in the modals and semi-modals of strong obligation and epistemic necessity in recent British English. In Modality in Contemporary English, Roberta Facchmetti, Manfred Krug and Frank R. Palmer (eds.), 241-266. Berlin: Mouton de Gruyter. Traugott, Elizabeth 2008 'All that he endeavoured to prove was ...': On the emergence of grammatical constructions in dialogic contexts. In Language in Flux: Dialogue Coordination, Language Variation, Change and Evolution, Robin Cooper and Ruth Kempson (eds.), 143-177. London: Kings College Publications.

Corpora BLOB

BROWN

DCPSE

FROWN LOB

The BLOB-1931 Corpus (previously called the Lancaster-1931 or B[efore]-LOB Corpus. 2006. Compiled by Geoffrey Leech and Paul Rayson, University of Lancaster. A Standard Corpus of Present-Day Edited American English, for use with Digital Computers (Brown). 1964, 1971, 1979. Compiled by W. N. Francis and H. Kucera. Brown University. Providence, Rhode Island. The Diachronic Corpus of Present-Day Spoken English. 2006. Compiled by Bas Aarts, Survey of English Usage, University College London. The Freiburg-Brown Corpus ('Frown'), original version. Compiled by Christian Mair, Albert-Ludwigs-Umversitat Freiburg. The LOB Corpus, original version. 1970-1978. Compiled by Geoffrey Leech, Lancaster University, Stig Johansson, University of Oslo (project leaders) and Knut Holland, University of Bergen (head of computing).

Writing the history of spoken standard English in the twentieth century 195 FLOB

The Freiburg-LOB Corpus ('F-LOB'), original version. 1996. Compiled by Christian Mair, Albert-Ludwigs-Umversitat Freiburg.

Prefabs in spoken English Brigitta Mittmann

1.

Introduction

This article discusses the method of using parallel corpora from what are arguably the two most important regional varieties of English - American English and British English - for finding prefabricated or formulaic word combinations typical of the spoken language. It sheds further light on the nature, shape and characteristics of the most frequent word combinations found in spoken English as well as the large extent to which the language can be said to consist of recurrent elements. It thus provides strong evidence supporting John Sinclair's idiom principle (1991: 110-115). The article is in parts an English synopsis of research that was published in German in monograph form in Mittmann (2004) and introduces several new aspects of this research hitherto not published in English. This research is highly relevant in connection with several issues discussed elsewhere in this volume. 2.

A study of British-American prefab differences

2.1.

Material

The study is based upon two corpora which both aim to be representative of natural every-day spoken English: for British English, the spoken demograpMc part of the British National Corpus (BNCSD) and for American English, the Longman Spoken American Corpus (LSAC). Both corpora contain recordings of people from all age groups, both sexes and from a variety of social and regional backgrounds of the two countries. The size of the two corpora is similar: the BNCSD contains about 3.9 million and the LSAC about 4.9 million words of running text. Despite some minor differences, these corpora are similar enough to be compared as parallel corpora. When the research was earned out, they belonged to the largest corpora of spoken English available for this purpose. Nonetheless, frequency-based and comparative studies of word combina-

198 BrigittaMittmann tions demand a certain minimum of occurrence. This meant that it was necessary to concentrate upon the most frequent items since otherwise the figures become less reliable. 2.2.

Method

For determining the most frequent word combinations in the BNCSD and the LSAC, a series of programs was used which had been specially written for this purpose by Flonan Klampfl. These programs are able to extract ngrams or clusters (i.e. combinations of two, three, four, or more words) from a text and count their frequency of occurrence. For example, the sentence Can I have a look at this? contains the following tngrams: CAN I HAVE-I HAVE A-HAVE A LOOK-A LOOK AT and LOOK AT THIS (each with a raw frequency of one). The idea to use n-grams - and the term cluster - was inspired by the WordSmith concordancing package (Scott 1999). With the help of the X2-test a list was created which sorted the clusters from those with the most significant differences between the two corpora to those with the greatest similarity. In addition to this, a threshold was introduced to restrict the output to those items where the evidence was strongest. This minimum frequency is best given in normalized figures (as parts per million or ppm), as the LSAC is somewhat larger than the BNCSD. It was at least 12.5 ppm in at least one of the corpora (this corresponds to around 49 occurrences in the BNCSD and about 61 occurrences in the LSAC). 2.3.

Types of clusters/prefabs found

The cluster lists contained ample evidence of word combination differences between the two varieties. A part of them had previously been recorded elsewhere as being more typical of either American or British English, but a considerable number appeared to be new. Most notably, the material contained many conversational routines or parts of them. They are of very different kinds, ranging from greeting formulas, such as how are you? or how are you doing? (esp. LSAC) to multifunctional expressions like here you are (esp. BNCSD) or here you go (esp. LSAC), complex discourse markers such as mind you or the trouble rs (both esp. BNCSD), hedges like kind 0 /(esp. LSAC) and sort 0 / ( e S p . BNCSD), general extenders (cf Overstreet 1999) such as and stuff (like that) and shit

Prefabs in spoken English 199 (like that) (both esp. LSAC), or multi-word expletives like bloody hell, oh dear (esp. BNCSD) or (oh) my gosh, oh boy, oh man, oh wow (esp. LSAC). Apart from conversational routines, there was a wide range of other types of word combinations to be found. Most of the material is unlikely to fall under the heading of 'classical idioms', but nonetheless a substantial part of it can be seen as idiomatic in the sense that their constituents are either semantical^ or syntactically odd. Borrowing Fillmore, Kay and O'Connor's (1988: 508) expression, one could say that they are mostly "familiar pieces unfamiliarly arranged". Amongst these pre-assembled linguistic building blocks, there are idioms such as be on about (sth), have a go (both esp. British English), 2 but also expressions such as be like, a reporting construction used predominantly by younger American speakers as it is exemplified in the following stretch of conversation: (1) Call him up and he gets kind of snippy with me on the phone. Well he's sending in mixed messages anyway but uh I called him and he's snippy and he's like no I can't go. And I'm like fine that's all I need to know. And so let him go. Tuesday he comes in and he's like hi how are you? and everything and I'm just like comes up and talks to me like twice and he's like you don't believe me do you? I'm like no. (LSAC 150701) Further types of frequent word combinations include phrasal and prepositional verbs - e.g. British get on with, go out with; American go ahead (and ...), figure out, work out ('exercise') - and certain quantifiers such as a bit of (esp. BNCSD), a Utile 0 / ( e s p . LSAC) or a load 0 / ( B N C S D ) . In the corpora, there were also instances showing that individual words can have quite different collocational properties in different varieties. For example, a lot tends to co-occur much more often with quite and not in British English than it does in American English.

2.4.

The'fuzzy edges'of phraseology

A large number of clusters points to what one might call the 'fuzzy edges' of traditional phraseology. On the borderline between phraseology and syntax there are, for example, tag questions, 3 periphrastic constructions such as the present perfect (which is used more frequently in British English), semi-modals such as have got (esp. BNCSD) and going to/gonna (esp. LSAC) or valency phenomena such as the fact that in spoken Amen-

200 BrigittaMittmann

can Enghsh feel/look/seem take complements introduced by like much more often than they do in Bntish EngHsh. Recent research in several linguistic paradigms, most notably perhaps in the context of construction grammar, has tended to bridge the traditional divide between grammar and lexis (see e.g. Romer and Schulze 2009) and also emphasizes the continuum between phraseology and word formation (e.g. Granger and Paquot 2008). Here, again, the comparison of BNCSD and LSAC provided a number of relevant combinations such as adjectival ones consisting of participle and particle (e.g. fed up with, screwed up in a whole load of screwed up ones, cf BNCSD file ked), complex prepositions (apart from, in terms of), complex conjunctions (as if as though), complex adverbs (of course, as well, at all), or complex pronouns (you guys, y'all/you all). A type of prefabricated expression that is typically ignored in studies of formulaic word combinations is that of time adverbials - or parts of time adverbials such as British at the moment, in a minute or American right now, at this point, (every once) in a while, the whole time. In the BNCSD the time of day is usually given with the help of expressions such as half past, quarter to, (quarter) of an hour, while the LSAC contains more combinations such as six thirty. The reason why these sequences are mostly ignored by researchers is likely to be that in many cases they appear to have been generated according to simple semantic and syntactic rules. A similar problem exists with respect to frequent responses. As will be discussed below, however, all of these combinations are of great significance in that they are typical ways of expressing the relevant concepts and are preferred over other expressions which might have been used instead. 2.5.

Evaluation

In sum, studying phraseological differences between varieties of spoken English with the help of clusters has proved very successful. It covered most of the rather scattered and often unsystematic descriptions of BritishAmerican word combination differences previously found in the literature, but brought a large number of new phenomena to light which had hitherto mainly - if at all - been mentioned in dictionaries. It goes without saying that certain kinds of word combinations are not caught in the net of this procedure. One example of this is collocations of the type identified by Hausmann (1985, 1989), which may be more than just a few words apart; i.e. combinations of lexical words such as schiitteres Haar, 'thin hair', as in

Prefabs in spoken English 201

Hausmann's example Das Haar ist mcht nur ber alten Menschen sondern auch ber relatrv jungen Menschen bererts recht haufig schutter (Literally: 'The hair not just of older people but also of relatively young ones is quite often thin already.') (1985: 127). However, this phenomenon tends to be comparatively rare in comparison with the large amount and wide variety of other material which can be collected. The approach chosen is largely data-driven and casts the net wide without either restricting or anticipating results. It proved very useful for what was effectively a pilot study, as there had not been any systematic treatment of such a wide variety of word combination differences between spoken American and British English. In 2006, John Algeo published a book on British word and grammar patterns and the ways in which they differ from American English. His focus and methods are different from the ones reported upon here and he based his research upon other data (including different corpora). Nonetheless, there is some overlap and in these areas, his findings generally corroborate those from Mittmann (2004). Another recently finished study which has some connection with the present research is the project of Anne Grimm (2008) in which she studied differences between the speech of men and women, including amongst other things the use of hedging expressions and expletives. This project is also based upon the BNCSD and the LSAC and observes differences between the regional varieties. Again, the results from Mittmann (2004) are generally confirmed, while Grimm differentiates more finely between the statistics for different groups of speakers for the items that she focuses on. However, in these - and other - works a number of theoretical issues had to be left undiscussed and it is some of these points that will be explored in the next sections. 3.

Theoretical implications

3.1.

The role of pragmatic equivalence

In comparing two corpora, the problem arises what the basis for the comparison (or the tertium comparatioms) should be. If one combination of words occurs, for example, five times as frequently in one corpus as it does in the other one, then this may be interesting, but it leaves open the question how the speakers in the other corpus would express the same concept or pragmatic function instead. Therefore it is highly relevant to look for

202 BrigittaMittmann what one might call "synonymous" word combinations - and take this to include combinations with the same pragmatic function. Sometimes such groups of expressions with the same function or meaning can be found relatively easily, as in figure 1 (taken from Mittmann 2005), which gives a number of comment clauses which have the added advantage of having similar or identical structures: DLSAC HBNCSD

I reckon I should think I expect I suppose I think I believe I figure I guess

0%

20%

40%

60%

80%

100%

Figure 1. However, finding such neat groups can be difficult and similar surface structures do not guarantee functional equivalence. For example, it has been pointed out in the literature on British-American differences (Benson, Benson and Ilson 1986: 20) that in a number of support verb constructions such as take a bath vs. have a bath, American English tends to use take, while British speakers typically use have. While there is no reason to doubt this contrast between take and have in support verb constructions in general, a very different situation obtains with respect to certain specific uses in conversation. It is remarkable that while HAVE a look does indeed appear quite frequently in the BNCSD, this is not true of take a look in the LSAC. Instead, expressions such as let's see or let me see appear to be used instead. Moreover, both let me see and let's see as well as let me have a look and let's have a look are often used synonymously, as can be seen from the following extract from a conversation between a childminder and a child:

Prefabs in spoken English 203 (2) ... you've got afilthynose. Let's have a look. (BNCSD, kb8) Quite feasibly, a German speaker might use a very different construction such as Zeig mal (her) - which consists of an imperative (zeig, 'show') + pragmatic particle (mal) + adverb (her; here 'to me') - in similar situations. Pragmatic equivalence is therefore context-dependent and comparisons between varieties can be made at very different levels of generality. This is also a problem for anyone studying what Herbst and Klotz (2003: 145-149) have called probabemes, i.e. likely linguistic realizations of concepts. If one opts for the level of pragmatic function, then very general functions such as expressing indirectness are very difficult to take into account, as they can be realised by such a variety of linguistic means, from modal verbs and multi-word hedging expressions to the use of questions rather than statements in English, or the use of certain pragmatic particles (e.g. vielleicht) in German. Any statement about whether, for example, the speakers of one group are more or less indirect than those of another will have to take all those features into account. And while it appears to be true, for example, that British speakers use certain types of modal verb more frequently than their American counterparts (Mittmann 2004: 101-106), there are a number of speakers in the LSAC who use many more hedging expressions such as like (as in And that girl's going to be like so spoiled, LSAC 130801, or it's like really important, LSAC 161902).

3.2.

Variety differences can be used to identify prefabs

For a number of word combinations, it was the comparison of two parallel corpora from regional varieties which was vital in drawing attention to their fixedness or what Alison Wray (2002: 5) would call thenformulataty. This is particularly interesting in those cases where the cluster frequencies show different tendencies to the frequencies of certain single words. For example, the verb WANT (in all its inflectional forms) occurs more frequently in the American corpus, while the question Do you want...? can be found more frequently in the British one. Vice versa, the words oh, well and sorry can be found more often in the British corpus while the clusters oh my gosh, oh man, well thank you and I'm sorry are more typical for the American texts. On the other hand, certain clusters show what may be called a microgrammar in that certain grammatical phenomena which are otherwise typical for a variety do not apply to them. For example, the semi-modal have

204 BrigittaMittmann got is typical for spoken Bntish English (see above), but does not normally occur together with no idea. Both the forms I've no idea and No idea are much more frequent than /'ve got no idea. This also means that the external form of prefabs can be crucial, as there may be small, but established differences between varieties. They can relate to the use of words belonging to 'minor' word classes or to conventional ellipses. For example, the use of articles can differ between the two vaneties, as with get a hold of (something), an expression which is more typical of the American corpus, versus get hold of (something), which is its British counterpart. Sometimes, the meaning of a phrase depends crucially on the presence of the article, as in the combinations the odd [+ noun] or a right [+ noun] (both esp. BNCSD). In this use, odtf typically has the meaning 'occasional', as in We are now in a country lane, looking out on the odd passing car (BNCSD, ksv), whereas right is used as an intensify for nouns denoting disagreeable or bad situations, personal qualities or behaviour, as in There was a right panic in our house (BNCSD, kcl). However, as seen above with get (a) hold of (something), there does not have to be any such change of meaning. In other cases, interjections are a characteristic part of certain expressions. For example, in both corpora in around 80 % of all cases my god is preceded by oh. Again, there are a number of clusters containing interjections which are far more typical of one of the two varieties. Examples for this are well thank you, no I won >t or yes you can. Often these are responses, which will be discussed again below. A further formal characteristic of certain formulaic sequences is that they are frequently elliptical, such as No idea, which appears on its own in almost half the cases in the BNCSD, Doesn >t matter (one third of cases without subject), or Course you can (more than two thirds of cases). All of them are often used as responses, as in the following examples: (3a) Oh ah! Can I take one? Course you can. I'll take two. (...) (BNCSD, kbs) (3b)

(...) Can I use your phone? Yeah, course yO« can. (BNCSD, kbc)

Prefabs in spoken English 205 3.3.

Variety differences show the extent to which language is prefabricated

The fixedness of some word combinations may seem debatable, as they simply appear to be put together according to syntactic and semantic rules. Amongst these, there are recurrent responses such as No, it isn >t or Yes, it is. It may seem at first sight that these are ordinary, unspectacular sequences, but in fact the difference between the corpora is highly significant here. These expressions are highly context-dependent and grammatically elliptical. A number of responses contain interjections: no, I won't; no, I don't; no, it >s not; no it isn >t; yes it is; yes/yeah, you can; yes/yeah I know; yes, please; oh alright; oh I see (esp. BNCSD) oh okay (psp.LSAC) However, there are also many combinations without interjections which are typically used in response to another speaker's turn. The following clusters appear directly after a speaker change in more than 50 % of all cases: / don't mind; never mind; it's up to you; it's alright/all right; that's alright/all right; that's/that is right; that'll do/that will do; that's it; that's not bad; that would be ...; that's a good.., don't be silly. In some cases such as Course you can or Never mind, these responses tend to be conventionally elliptic, while in others such as That's /Ythey are syntactically well-formed but highly context-dependent, with an anaphoric reference item such as that. There are also responding questions such as the following ones: Why's that?; Why not? (esp. BNCSD) How come?; Like what? (esp. LSAC) Again, they are either elliptical or contain anaphoric reference items. It is notable that many of the responses mentioned above appear much less frequently in the American corpus than they do in the British one. Presumably, Americans tend to respond differently, for example using single words such as Sure. In her above-mentioned recent detailed empirical study of similarities and differences between the language of women and men, Grimm (2008: 301) found that the American speakers generally used more minimal responses than their British counterparts, which would seem to confirm this hypothesis. Arguably, if speakers of different varieties typically use different word combinations for verbalizing the same concepts, there is an indication that the expressions they use are - at least to some extent - formulaic and re-

206 BrigittaMittmann

tneved from memory. This means that even highly frequent utterances such as No, it isn >t or Yes, it is, which seem banal in that they are fully analyzable and can be constructed following the grammatical rules of the language, can be regarded as prefabricated, which should put them at the centre of any theory of language. Authors such as Wray have argued persuasively in favour of seeing prefabricated word combinations (or, as she puts it, formulaic sequences) as central to linguistic processing (2002: 261), although using varieties of a language as support for this position appears to be an approach which had not actually been put into practice before Mittmann (2004). 4.

Scope for further research

As a consequence of the richness and great variety of the material found in the clusters, some potentially interesting findings had to be left unexplored. These issues might be relevant for further investigations into the interplay between prefabs and grammatical, semantic and pragmatic rules in speech production, which is why they will be outlined briefly in this section. 4.1.

Chunk boundaries and'wedges'

One problem which has also been noted by other authors is that it is often difficult to determine where the boundaries between chunks are. Many chunks show what one might term 'crystallization', having a stable core and more or less variable periphery. And while some chunks are comparatively easy to delimit, others are not. This is, for example, partly reflected in Sinclair and Mauranen's distinction between O and M units (2006: 59). The M units contain what is being talked about whereas the O units (e.g. hedges, discourse markers and similar items) organize the discourse. The latter tend to be particularly stable in their form. On top of this, there are sometimes intriguing differences between the varieties. For example, in the American corpus certain items, notably certain discourse markers such as you know or the negative particle not, can interrupt verb phrases or noun phrases by squeezing in between their constituents like a wedge. In examples (4.1) and (4.2) below, the wedge is placed between the infinitive particle and the verb, in (4.3), it is between the article and the premodifier, and in (4.4) between an adverbial and the verb.

Prefabs in spoken English 207 (4.1) Yeah, I don't like them either. No that's supposed to be a good school. I'll just try to you know cheer along. Be supportive. (LSAC, 155401) (4.2) It's so much easier to not think, to have somebody else to do it for you. (LSAC, 150102) (4.3) Well and he was saying I wasridingon the sidewalk which you can do outside of the you know, downtown area. (LSAC, 125801) (4.4) and uh, my brother, just you know cruises around on his A T V and his snowmobile when it's snowmobile season (...) (LSAC, 144301) In a similar manner, other items such as kind of or / think can function as wedges in these positions. Apparently, there is a greater tendency in American English to insert items just in front of the verb or between certain other closely linked clause or phrase constituents. These places would appear to be where the speaker conventionally takes time for sentence planning and there may well be differences between the varieties - as indeed there are between languages - in this respect. Anybody who has ever studied English films dubbed into German will probably agree that hesitation phenomena (notably repetitions and pauses) are somewhat odd in comparison to non-scripted, everyday conversational German. 4.2.

Differences in rhythm

The wedges also affect the rhythmic patterning of sentences in the American corpus. Further research should investigate the links between stress (and, thus, rhythm) and intonation, pauses, hesitation phenomena and 'chunks'. Sometimes, interesting rhythmical patterns seem to appear in other portions of the material. For example, many of the responses which are overwhelmingly found in the BNCSD have a stress pattern of two unstressed syllables followed by a stressed syllable (in other words, an anapaest), as in / don 7 mind; yes you can; course you can, etc. In addition to this, there appear to be differences in the use of contracted forms. For simple modal verbs, for example, the BNCSD has more contractions involving the negative particle (e.g. can % couldn Y), whereas there is a stronger tendency towards using the full form not (e.g. cannot, could not) in the LSAC. The same applies to the use of -11 versus will. However, since the study of such contractions depends crucially on transcription conventions, further research in this field would need to include the audio files.

208 BrigittaMittmann 5.

Conclusion

The project described in this article has shown that American and British spoken English differ markedly in the word combinations which they typically use. These word combinations span a wide range of types - from various kinds of routine formulae to frequently recurring responses. A few formulaic sequences are grammatically or semantically odd, but many more are neither of those, although they typically have a special pragmatic or discourse-related function. Nonetheless, the fact that they are typical of one variety of a language but not for another indicates that they are to some extent formulaic. Thus, the British-American differences reported on here provide further proof for the fact that everyday language is to a great extent conventionalised. Idiomaticity (or formulaicity) pervades language. It consists largely of recurring word combinations which are presumably stored in the speaker's memory as entities. The comparison of parallel corpora offers compelling evidence confirming Sinclair's idiom principle. In the words of Franz Josef Hausmann, we can say that there is "total idiomaticity" (1993: 477)

Notes 1 2

3

The author is grateful to S. Faulhaber and K. Pike for their comments on an earlier version of this article. There is of course an overlap with some of the routine formulae such as the pragmatic marker mind you - classified by Moon (1998: 80-81) as an "illformed" fixed expression. Tag questions show fossilization in that there are invariant forms such as inr.it? m some varieties.

References Algeo,John 2006 British or American English? A Handbook of Word and Grammar Patterns. Cambridge: Cambridge University Press. Benson, Morton, Evelyn Benson and Robert Ilson 1986 Lexicographic Description of English. Amsterdam: John Benjamins. Fillmore, Charles J., Paul Kay and Mary C. O'Connor 1988 Regularity and idiomaticity in grammatical constructions. Language 64 (3): 501-538.

Prefabs in spoken English 209 Granger, Sylviane and MagaliPaquot 2008 Disentangling the phraseological web. In Phraseology: An Interdisciplinary Perspective, Sylviane Granger and Fanny Meumer (eds.), 27-49. Amsterdam/Philadelphia: Benjamins. Grimm, Anne 2008 "Mannersprache" "Frauensprache"? Eine korpusgestiitzte empirische Analyse des Sprachgebrauchs britischer und amerikanischer Frauen und Manner hinsichtlich Geschlechtsspezifika. Hamburg: Kovac. Hausmann, Franz Josef 1985 Kollokationen im deutschen Worterbuch: Em Behrag zur Theorie des lexikographischen Beispiels. In Lexikographie und Grammatik (Lexicographica Series Maior 3), Henning Bergenholtz and Joachim Mugdan (eds.), 118-129. Tubingen: Niemeyer. Hausmann, Franz Josef 1989 Le dictionnaire de collocations. In Worterbucher, Dictionaries, Dictionnaires, vol. 1, Franz Josef Hausmann, Oskar Reichmann, Herbert Ernst Wiegand and Ladislav Zgusta (eds.), 1010-1019. Berlm/New York: Walter de Gruyter. Hausmann, Franz Josef 1993 1st der deutsche Wortschatz lernbar? Oder: Wortschatz ist Chaos. DaF 5: 471-485. Herbst, Thomas and Michael Klotz 2003 Lexikographie. Paderborn: Schoningh. Mittmann,Bngitta 2004 Mehrwort-Cluster in der englischen Alltagskonversation: Unterschiede zwischen britischem und amerikanischem gesprochenen Englisch als Indikatoren fur den prafabrizierten Charakter der Spracfe. Tubingen: GunterNarr. Mittmann,Bngitta 2005 'I almost kind of thought well that must be more of like British English or something': Prefabs in amerikanischer und britischer Konversation. In Linguistische Dimensionen des Fremdsprachenunterrichts, Thomas Herbst (ed.): 125-134. Wiirzburg: Komgshausen und Neumann. Moon, Rosamund 1998 Fixed Expressions and Idioms in English: A Corpus-Based Approach. Oxford: Oxford University Press. Overstreet,Maryann 1999 Whales, Candlelight, and Stuff Like That: General Extenders in English Discourse. New York/Oxford: Oxford University Press.

210 BrigittaMittmann R6mer,UteandRamerSchulze(eds.) 2009 Exploring the Lexis-Grammar Interface. Amsterdam/Philadelphia: Benjamins. Scott, Mike 1999 WordSmith Tools, version 3, Oxford: Oxford University Press. Sinclair, John McH. 1991 Corpus, Concordance, Collocation. Oxford: Oxford University Press. Sinclair, John McH. and Anna Mauranen 2006 Linear Unit Grammar: Integrating Speech and Writing. Amsterdam/Philadelphia: John Benjamins. Wray, Alison 2002 Formulaic Language and the Lexicon. Cambridge: Cambridge University Press.

Corpora BNC

LSAC

The British National Corpus. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk/. The Longman Spoken American Corpus is copyright of Pearson Education Limited and was created by Longman Dictionaries. It is available for academic research purposes, and for further details see http://www.longman-elt.com/dictionaries/.

Ute Romer

1.

Introduction

The past few years have seen an increasing interest in studies based on new kinds of specialized corpora that capture an ever-growing range of text types, especially from academic, political, business and medical discourse. Now that more and larger collections of such specialized texts are becoming available, many corpus researchers seem to switch from describing the English language as a whole to the description of a number of different language varieties and community discourses (see, for example, Biber 2006; Biber, Connor and Upton 2007; Bowker and Pearson 2002; Gavioh 2005; Hyland 2004; and the contributions in Connor and Upton 2004; and in Romer and Schulze 2008). This paper takes a neo-Firthian approach to academic writing and examines lexical-grammatical patterns in the discourse of linguistics. It is in many ways a tribute to John Sinclair and his groundbreaking ideas on language and corpus work. One of the things I learned from him is that, more often than not, it makes sense to "go back" and see how early ideas on language, its structure and use, relate to new developments in resources and methodologies. So, in this paper, I go back to some concepts introduced and/or used by John Sinclair and by John Rupert Firth, a core figure in early British contextualism, who greatly influenced Sinclair's work. Continuing Sinclair's (1996: 75) "search for units of meaning" and using newgeneration corpus tools that enable us to explore corpora semiautomatically (Collocate, Barlow 2004; ConcGram, Greaves 2005; kJNgram, Fletcher 2002-2007), the aim of this paper is to uncover the phraseological profile of a particular sub-type of academic writing and to see how meanings are created in a 3.5-million word corpus of linguistic book reviews written in English, as compared to a larger corpus of a less specialized language. After an explanation of the concept of "restricted language" and a discussion of ways in which meaningful units can be identified in corpora, the

212 Ute Romer

paper will focus on a selection of common phraseological items in linguistic book review language, and investigate how specific (or how "local") these items are for the type of language under analysis and whether the identified local patterns are connected to local, text-type specific meanings. It will conclude with a few thoughts on "local grammars" and recommendations for future research in phraseology and academic discourse. 2.

Taking a neo-Firthian approach to academic writing

The context of the analysis reported on in this paper is a large-scale corpus study of academic discourse. Central aims of the study are to investigate how meanings (in particular evaluative meanings) are created in academic writing in the discipline of linguistics, and to develop a local lexical grammar of book review language. The approach taken in the larger-scale study and described in the present paper is neo-FrrtMan in that it picks up some central notions developed and used by Firth (and his pupil Sinclair) and uses new software tools and techniques which lend themselves to investigating these notions but which Firth did not have at his disposal. The notions discussed here are "restricted language" (e.g. Firth [1956] 1968a), "collocation" (e.g. Firth [1957] 1968c; Sinclair 1991), "unit of meaning"/"meamng-shift unit" (Sinclair 1996, 2007 personal communication), "lexical grammar" (e.g. Sinclair 2004) and "local grammar" (e.g. Hunston and Sinclair 2000). 2.1.

The discourse of linguistics as a "restricted language"

In the following, I will report on an analysis of a subset of the written English discourse among linguists regarded as a global community of practice. This type of discourse, the discourse of linguistics, is only one of the many types of specialized discourses that are analyzed by researchers in corpus linguistics and EAP (English for Academic Purposes). In Firthian terms, all these specialized discourses constitute "restricted languages". As Leon (2007: 5) notes, "restricted languages ... became a touchstone for Firth's descriptive linguistics and raised crucial issues for early sociolinguistics and empiricist approaches in language sciences". Firth himself states that "descriptive linguistics is at its best when dealing with such [restricted] languages" (Firth 1968a: 105-106), mainly because the focus on limited systems makes the description of language more manageable. A

Observations on the phraseology of academic writing 213

restricted language can be defined as the language of a particular domain (e.g. science, politics or meteorology) or genre that serves "a circumscribed field of experience or action and can be said to have its own grammar and dictionary" (Firth [1956] 1968b: 87). That means that we are dealing with a subset of the language, with "a well defined limited type or form of a major language, let us say English" (Firth 1968a: 98). A restricted language thus has a specialized grammar and vocabulary, "a micro-grammar and a microglossary- (Firth 1968a: 106, emphasis in original). An alternative concept to restricted language would be that of sublanguage. Sublanguage is a term used by Hams (1968) and Lehrberger (1982) to refer to "subsets of sentences of a language" (Harris 1968: 152) or languages that deal with "limited subject matter" and show a "high frequency of certain constructions" (Lehrberger 1982: 102). The concept of sublanguage also occurs in modern corpus-linguistic studies, for example in a study on the language of dictionary definitions by Barnbrook who considers the concept "an extremely powerful approach to the practical analysis of texts which show a restricted use of linguistic features or have special organisational properties" (Barnbrook 2002: 94). I will now turn to looking at the language of academic book reviews (a language of a particular domain with its own lexical microgrammar) and at some typical constructions in this sublanguage or restricted language. The restricted language I am dealing with here is captured in a 3.5milhon word corpus of 1,500 academic book reviews published in Linguist List issues from 1993 to 2005: the Book Reviews tn Linguistics Corpus (henceforth BRILC). The language covered in BRILC constitutes part of the discourse of linguistics (in an English-speaking world). BRILC mirrors how the global linguistic research community discusses and assesses publications in the field. For a corpus of its type, BRILC is comparatively large, at least by today's standards, and serves well to represent the currently common practice in linguistic review writing. However, the corpus can of course not claim to be representative of review writing in general, and certainly not of academic discourse in its entirety, but it helps to provide insights into the language of one particular discourse community: the community of a large group of linguists worldwide.

214 Ute Romer

2.2.

The identification of meaningful units in a corpus of linguistic book reviews

Continuing Sinclair's search for units of meaning, the question I would like to address here is: How can we find meaningful units in a corpus? Or, more specifically (given that BRILC contains a particularly evaluative type of texts), how can we find units of evaluative meaning in a corpus? Evaluation, seen as a central function of language and broadly defined (largely in line with Thompson and Hunston 2000) as a term for expressions of what stance we take towards a proposition, i.e. the expression of what a speaker or writer thinks of what s/he talks or writes about, comes in many different shapes, which implies that it is not easy to find it through the core means of corpus analysis (doing concordance searches or word lists and keyword lists). As Mauranen (2004: 209) notes, "[identifying evaluation in corpora is far from straightforward. ... Corpus methods are best suited for searching items that are identifiable, therefore tracking down evaluative items poses a methodological problem". On a similar note, Hunston (2004: 157) states that "the group of lexical items that indicate evaluative meaning is large and open", which makes a fully systematic and comprehensive account of evaluation extremely difficult. In fact, the first analytical steps I carried out in my search for units of evaluative meaning in BRILC (i.e. the examination of frequency word lists and keyword lists, see Romer 2008) did not yield any interesting results which, at that point in the analysis, led me to conclude that words are not the most useful units in the search for meaning ("the word is not enough", Romer 2008: 121) and that we need to move from word to phrase level. So, instead of looking at single recurring words, we need to examine frequent word combinations, also referred to as collocations, chunks, formulaic expressions, n-grams, lexical bundles, phraseframes, or multi-word units. In Romer (2008), I have argued that the extraction of such word combinations or phrasal units from corpora, combined with concordance analysis, can lead to very useful results and helps to highlight a large number of meaningful units in BRILC. In the present paper, however, I go beyond the methodology described in the earlier study in which I only extracted contiguous word combinations from BRILC (n-grams with a span of n=2 to n=7), using the software Collocate (Barlow 2004). I use two additional tools that enable the identification of recurring contiguous and non-contiguous sequences of words in texts: kJNgram (Fletcher 2002-2007) and ConcGram (Greaves 2005). Like Collocate, kJNgram generates lists of n-grams of different lengths (i.e.

Observations on the phraseology of academic writing 215 combinations of n words) from a corpus, e.g. 3-grams like as well as or the book is. In addition to that, the program creates lists of so-called "phraseframes" (short "p-frames"). P-frames are sets of n-grams which are identical except for one word, e.g. at the end of, at the beginning o/and at the turn o/would all be part of the p-frame at the * of. P-frames hence provide insights into pattern variability and help us see to what extent Sinclair's Idiom Principle (Sinclair 1987, 1991, 1996) is at work, i.e. how fixed language units are or how much they allow for variation. Examples of pframes in BRILC, based on 5-gram and 6-gram searches, are displayed in figure 1. i t would be * t o it it it it it it it it it it it it it it

would be interesting to would be useful to would be nice to would be better to would be possible to would be helpful to would be fair to would be difficult to would be necessary to would be good to * be interesting to would be interesting to will be interesting to might be interesting to

it it it it

* be interesting to see would be interesting to see will be interesting to see might be interesting to see

Figure 1.

101 44 14

10

58 44 6 33 23 7 3

3

Example p-frames in BRILC, together with numbers of tokens and numbers of variants (kfNgram output)

Together with the types and the token numbers of the p-frames, kfNgram also lists how many variants are found for each of the p-frames (e.g. 10 for tt would be * to). The p-frames in figure 1 exhibit systematic and controlled variation. The first p-frame (it would be * to) shows that, of a large number of possible words that could theoretically fill the blank, only a small set of (mainly positively) evaluative adjectives actually occur. In p-frames two

216 Ute Romer and three, modal verbs are found in the vanable slot; however not all modal verbs but only a subset of them {would, will, might). ConcGram allows an even more flexible approach to uncovering repeated word combinations in that it automatically identifies word association patterns (so-called "concgrams") in a text (see Cheng, Greaves and Warren 2006). Concgrams cover constituency variation (AB, ACB) and positional variation (AB, BA) and hence include phraseological items that would be missed by Collocate or kfNgram searches but that are potentially interesting in terms of constituting meaningful units. Figure 2 presents an example of a BRILC-based concgram extraction, showing constituency variation (e.g. it would be very interesting, it should also be interesting). 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92

pon are backward anaphora, and spective of grammaticalisation semantic transparency; again, from a theoretical standpoint, r future research, noting that ift from OV to VO in English. ook as exciting as I had hoped d very elegantly in the paper, s in semantics. In my opinion, oun derivatives are discussed, felt most positively" (p. 22). rs also prove a pumping lemma. pear in Linguist List reviews: is given on this work, though and confined to the endnotes. n of a book title was omitted. erative work on corpora. Maybe second definition. Of course, ub-entries, for instance). So, olved in dictionary-making and n a constituent and its copy. ity and their self-perception. iteria seem fairly reasonable. is, rhetoric, semantics, etc. nages to carry out the action. CTIC THEORY" by Alison Henry).

Figure 2.

it it it it it It it it it it It It it it It It it it it it It It It It It It

would be interesting to see how his theory can would be very interesting to have a survey of the would be very interesting to see this pursued in would be very interesting to expand this analysis would be especially interesting to follow the would be particularly interesting to see if this might be, although Part 4 was guite interesting, would be interesting to discuss the would be interesting to see how this ontological would be interesting at least to mention verbal should be noted that some interesting results woul! d be interesting to see further wouldn't be very interesting, I didn't make a seems to be very interesting for the linguist's would also be interesting to set Hornstein's view would also be interesting to see if some of the would also be interesting to test the analyses in would also be interesting to find out that should also be interesting to find, among the should also be interesting to all dictionary might however be interesting to seek a connection might prove to be interesting to compare the would, however, be interesting to study the would certainly be very interesting to see what would most certainly be interesting to look at seems to me that it would be interesting to

Word association pattern (concgram) of the Hems it + be + interesting mBRlLC (ConcGram output; sample)

All three tools (Collocate, kfNgram and ConcGram) can be referred to as "phraseological search engines" as they facilitate the exploration of the phraseological profile of texts or text collections. The extraction of n-grams (of different spans), p-frames and concgrams was complemented by manual filtering of the output lists and extensive concordancing of candidate phraseological items. These semi-automatic BRILC explorations resulted in a database of currently a little over 800 items (i.e. types) of evaluative meaning. Part of these items are inherently evaluative (e.g. it rs not clear, wonderful, or a lack of), while others appear "neutral" in isolation but introduce or frame evaluation (e.g. at the same

Observations on the phraseology of academic writing 217

time or on the one hand). This type of implicit or "hidden" evaluation is much more pervasive than we would expect and will be focused on in the remainder of the paper. In the next section, we will look at items that prepare the ground for evaluation to take place and examine their use in linguistic book reviews. The items that will be discussed are all frequent in BRILC and appeared at the top of the n-gram and p-frame lists. 3.

Uncovering the phraseological profile of linguistic book reviews

3.1.

Central patterns and their meanings

Before I turn to some of the high-frequency n-grams from my lists and their use in BRILC, I would like to look at an item that came up in a discussion I had about evaluation with John Sinclair (and that is also quite common in BRILC, however not as common as the other items that will be described here). In an email to me, he wrote: "Re evaluation, I keep finding evaluations in what look like "ordinary" sentences these days. ... I came across the frame "the - - lies in - -"" (Sinclair 2006, personal communication). I think Ues in is a fascinating item and I am very grateful to John Sinclair for bringing it up. I examined Ues tn in my BRILC data and found that gap 1 in the frame is filled by a noun or noun group with evaluative potential, e.g. the mam strength of the book in example (1). Gap 2 takes a proposition about action, usually in the form of a deverbal noun (such as coverage), which is pre-evaluated by the item from the first gap. (1)

The main strength of the book lies in its wide coverage of psycholinguistic data and models...

This is a neat pattern, but what type of evaluation does it mainly express? An analysis of all instances oilies in in context shows that 16 out of 135 concordance lines (12 %) express negative evaluation; see examples in (2) and (3). We find a number (27.8 %) of unclear cases with "neutral" nouns like distinction or difference in gap 1 (see examples [4] and [5]), but most of the instances of Ues in (80, i.e. 60.2 %) exhibit positive evaluation, as exemplified in (1) and (6). The BRILC concordance sample in figure 3 (with selected nouns/noun groups in gap 1 highlighted in bold) and the two ConcGram displays of word association patterns in figure 4 serve to illustrate the dominance of positively evaluative contexts around Ues tn. This means that a certain type of meaning (positive evaluation) is linked to the

218 UteRomer lies in pattern. In section 3.2 we will see if this is a generally valid patternmeaning combination or whether this combination is specific to the restricted language under analysis. (2) The obvious defect of such an approach lies in the nature of polysemy in natural language. (3)

Probably, the only tangible limitation of the volume lies in some typographical errors...

(4) (5)

The main difference lies in first person authority ... This distinction lies in the foregrounded nature of literary themes.

(6)

The value of this account lies in the detail of its treatment of the varying degrees and types ofgivenness and newness relevant to these constructions.

89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 105 106 107 108 109

outstanding contribution made by Saussure tions. Kennedy concludes that a solution (to y ) ' (J.K's ex. (8a)). The solution mplex). Evidence for the above statement of word> geography is a — task that still — < — - ^ the scope of the term. Hinkel's strength entation is convincing, and its strength n book. As a textbook, its main strength curious aspect of this agreement system K , , ^ , * , ^ . „ , _ ish linguistic history through its texts selective loss. One explanation for that of the strengths of "Language in Theory" facing the author of a work such as this covert. Suranyi's explanation for this he importance of providing this training eculiar trait of the hymn as a text type iscernible stress. Aguaruna's uniqueness s true for the present volume. Its value related fields of study. Its true value

Figure 3.

lies lies lies lies lies T lies lies lies lies ,,

in in in in in < in in in in ,

his theory of general linguisti maintaining a purely syntactic the exploitation of Generalized the following linguistic facts: future'' ' (p. 405). In ''Diachro — — . . ^ ^ the fact that she led her resea that it concentrates on one Ian the presentation of the details the fourth available agreement ^ , . , ^ ^ ^ r

lies lies lies lies lies lies lies lies lies lies

in in in in in in in in in in

part with the wealth of textual the hypothesis that identificat offering an opportunity for dis where to set the limits of scho the nature of the features at the fact that simultaneous inte "the degree of 'openness' of te the following two properties ha the fact that we can select fro its compact though penetrating

BRILC concordance sample of lies in, displaying predominantly positive evaluation

1 The strength, then, of The Korean Language, 2 The main strength of this book probably 3 A particular strength of Jackson's book 4 synopsis, a major strength of this textbook 5 ransformations. The strength of this chapter 6 ASSESSMENT The main strength of this book 7 SUMMARY The main strength of the book 8 rgumentation is convincing, and its strength 9 book. As a textbook, its main strength 10 s the scope of the term. Hinkel's strength

lies lies lies lies lies lies lies lies lies lies

in in in in in in in in in in

its encyclopedic breadth of cover the fact that it incorporates int its relevant biographical informa the integration of essential sema the discussion where the authors the personal testimonies and stor its wide coverage of psycholingui that it concentrates on one langu the presentation of the details. the fact that she led her researc

Observations on the phraseology of academic writing 219 1 the preface that the value of the reader 2 and in my view the main value of the paper 3that is terribly new in this book; its value 4 there is an intellectual value in exposing 5 addressed. The value of his contribution 6 startling claim. The value of this account 7 enge, a further added value of this chapter 8 ins strong"(471). The value of this volume 9 syntacticians. The real value of this book 10 al related fields of study. Its true value 11 not an easy read. Despite this, its value 12 Ids true for the present volume. Its value 13 the compound prosodic word), and its value

Figure 4.

lies lies lies lies lies lies lies lies lies lies lies lies lies

in bringing together work from VARIOIJ in the mono- and multi-factorial anal rather in how it selects, organizes a and deceptions, and here I think even in the realization of the power imbed in the detail of its treatment of the in the close link with Newerkla's ch a) in its bringing together in one pi in its treatment of the larger issues in its compact though penetrating dis in how it still manages to demonstrat in the fact that we can select from t mainly in demonstrating how some rece

Word association patterns (concgrams) of the items lies + in + strength and lies + in + value in BRILC (ConcGram output; sample)

Let us now take a closer look at three items from the frequency-sorted ngram and p-frame lists: at the same time, it seems to me (it seems to *) and on the other hand. In linguistic book review language as covered in BRILC, at the same time mainly (in 56 % of the cases) triggers positive evaluation, as exemplified in (7) and in the concordance sample in figure 5. With only 5 % of all occurrences (e.g. number [8]), negative evaluation is very rare. In the remaining 39 % of the concordance lines at the same time is used in its temporal sense, meaning "simultaneously" (not "also"); see example (9). (7) (8) (9)

Dan clearly highlights where they can be found and at the same time provides a good literature support. At the same time, K's monograph suffers from various inadequacies ... At the same time, some new words have entered the field...

142

e animal world. At the same time, it includes a careful and honest discussion of wh

14 3 14 4 14 5 146 14 7 14 8 14 9

n O c t o b e r 19 9 7 . e at t i m e s , but s corpus d a t a ) . ghout the b o o k . ian and H e b r e w ; ard M a c e d o n i a n . by _ h i s _ , but

At at At At at At at

the the the the the the the

same same same same same same same

t i m e , it is a s t a t e - o f - t h e - a r t p a n o r a m a of the (sub-)fi t i m e it is a n a l m o s t e n c y c l o p e d i c s o u r c e of i n f o r m a t i o n t i m e , it is clear t h a t not every author has b e e n u s i n g t i m e , it is f l e x i b l e e n o u g h i n o r g a n i s a t i o n t o a l l o w th t i m e it is n e v e r t h e case t h a t , say, a c c o m p l i s h m e n t s sh t i m e , it is n o t a b l e t h a t M u s h i n ' s results are c o n s i s t e n t i m e it is the subj ect of the J a p a n e s e p r e d i c a t e p h r a s e

150 151 152 153 154 155 156

anguage change. of the base. taste, and, osition itself. ertainly rigid. a events, while ge history, but

At At at At At at at

the the the the the the the

same same same same same same same

time, it may equally be used by college teachers who wi time, it must be no larger than one syllable (as discus time, it provides the research with steady foundations time, it was cliticised to an immediately following ver time, King claims, we can easily account for such utter time leading to interesting questions about the often time maintains an engaging and entertaining style throu

Figure 5.

BRILC concordance sample of at the same time, displaying predominantly positive evaluation

The next selected item, it seems to me, prepares the ground for predominantly negative evaluation (281 of 398 instances, i.e. 70.5 %), as exemplified in (10) and the concordance sample in figure 6. Positive evaluation, as

220 UteRomer shown in (11), is rare and accounts for only 4.9% of all cases. About 24.6 % of the BRILC sentences with it seems to me constitute neutral observations, see e.g. (12). (10) Finally, it seems to me that the discussion of information structure was sometimes quite insensitive to the differences between spoken and written data. (11)

In general, it seems to me this book is a nice conclusion to the process started in the Balancing Act...

(12)

// seems to me that it is a commonplace that truth outstrips epistemic notions...

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

new. lem; ail, 68). ude, ies. ath; per. ry). 3c). Yet VIEW ary. ded. lly,

It it it It it It it It It It It It It It it

Figure 6.

seems seems seems seems seems seems seems seems seems seems seems seems seems seems seems

to to to to to to to to to to to to to to to

me, nevertheless, that there are some difficulties related to this me, rather, that precedence is always transitive; it is the particu me that a more explicit definition of word would be needed to handl me that a high price has been paid in terms of numbers of categorie me that as for theoretical results, much more [...] should be said me that both fields would benefit from acting a little more like th me that Copper Island Aleut is not a good example of such process me that in some cases this could lead M to certain misinterpretatio me that it would be interesting to examine such problems in a more me that M-S coniures up notions of abstract constructs that are not me that one oan likewise make a strong ease for claiming that espeo me that one of the central guestions being analyzed in this book is me that some additional topics could have been incorporated into th me that such a term is used in more than one sense, having to do bo me to be a weakness of this approach that it will not easily handle

BRILC concordance sample of it seems to me, displaying predominantly negative evaluation

Finally, if we look at on the other hand, positive evaluation follows the 4gram in only 8 % of the 567 BRILC examples, as in (13). Negative evaluations (54 %) and neutral observations (38 %) are considerably more frequent. This is illustrated in figure 7 and in examples (14) and (15) below. (13) Other chapters, on the other hand, provide impressively comprehensive coverage of the topics... (14)

(15)

but on the other hand, it is obvious that the book under review fails in various regards to take into account major developments in research into Indian English over the last 25 years. Prepositional clauses, on the other hand, do not allow stranding.

Observations on the phraseology of academic writing 221 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99

M., on, ts, on. nd. t). un. re, an, acy ns, R , en, B,

on on on On On On On on on on on on on on On

the the the the the the the the the the the the the the the

other other other other other other other other other other other other other other other

hand, comes to the opposite conclusion on the same point. It would hand, concerns the marking of event sequences through lexical and s hand, consider the legitimate explanations to be those that do not hand, context in CA is not a priori but something that emerges from hand, corpus linguists who want to develop their own tailor-made so hand, C denies the existence of the notion of "subject" as a univer hand, C importantly neglects other hypotheses on the origin of pers hand, denote type shifted, generalized quantifier-like or ,>-type e hand, despite its importance in the United States, left very little hand develops more slowly, influenced by production ease, salience, hand, do not contribute to the truth conditions of the utterance bu hand, do provide support for D, and 0 and L (2002), while finding f hand do seem to have such restrictions, lengthening such words and hand, does not view optimization of language very seriously. Instea hand, [...], articles are missing on subjects that FG did attend to

Figure 7. BRILC concordance sample of on the other hand, displaying examples of negative evaluation and neutral observation 3.2.

Corpus comparison: How "local" are these patterns and meanings?

The items we have just analyzed clearly show interesting patterns and pattern-meaning relations. Their existence in BRILC alone, however, does not say much about their status as "local" patterns, i.e. patterns that are characteristic of linguistic book review language as a restricted language (in Firth's sense). In order to find out how restncted-language-specific the above-discussed phraseological items {lies m, at the same time, it seems to me, on the other hand) are, I examined the same items and their patterns and meanings in a larger reference corpus of written English, the 90-million word written component of the British National Corpus (BNC written). In a first step, I compared the frequencies of occurrence (normalized per million words, pmw) of the four items in BRILC with those in BNC written. As we can see in table 1, all units of evaluative meaning are more"frequent in BRILC than in BNC written, which may not be all that surprising if we consider the highly evaluative type of texts included in BRILC. Moving on from frequencies to functions, the next step then involved an analysis of the meanings expressed by each of the phraseological items in BNC written. For lies in I did not find a clear preference for one type of evaluation (as in BRILC). Instead, there was a roughly equal distribution of examples across the three categories "positive evaluation" (34.5 %), "negative evaluation" (32.5 %) and "neutral/unclear" (33 %). While negative evaluation was rather rare in the context of Ites in in the book review corpus, the item forms a pattern with nouns like problem and difficulty in BNC written, as the concordance samples in figure 8 show.

222 Ute Romer Table 1.

Frequencies of phraseological items in BRILC and BNC_wntten BRILC BNC_wntten lies in 38pmw 19pmw on the other hand 162 pmw 57 pmw at the same time 100 pmw 73 pmw it seems to me 19 pmw 5pmw

360 s without saying; my difficulty 361 s. The ohief oause of difficulty 362 olved. The practical difficulty 363 on CD-ROM. The real difficulty 364 n Fig. 8.5. A second difficulty 365 , on the contrary the difficulty 366 ists the cause of the difficulty 367 opinion. Part of the difficulty 368 d spellcheckers. The difficulty 369 urn, is apparent. The difficulty 370 experimentally. The difficulty 371 y extraordinary. The difficulty 372 ilm. At present, the difficulty 373.. mess of things. The difficulty 10 07 e root of the innovation problem 1008 rations. sequo The main problem 1009 p to my Martin D-16. My problem 1010 way from the house. One problem 1011 asonably good; the only problem 1012 rly where the particular problem 1013 ould argue that the real problem 1014 rpose, whereas the real problem 1015 equo. Perhaps the Met 's problem 1016 If Radiohead 's singular problem 1017 taining prose. Here the problem 1018 believes the root of the problem 1019 FAO) 1985). Part of the problem 1020 umour? sequo Part of the problem

lies lies lies lies lies lies lies lies lies lies lies lies lies lies lies lies lies lies lies lies lies lies lies lies lies lies lies lies

in in in in in in in in in in in in in in in in in in in in in in in in in in in in

knowing how defensible they are in the fo the fact that confessions are typically o deciding how to value the external effects the fact that CD-ROM can only process one the uncertainty in our knowledge of the to obtaining sufficient evidence to identify an institution, central planning, which c the developments which have taken place in building real quality into the products. D providing an adequate theoretical framewor heating the fuel to temperatures of about finding an acceptable implied limitation. understanding how this relates smdash if a convincing yourself of that! If all is we a dilemma : Sbguo Curriculum innovation re the amount of translation the software wil the fact that I can sbquo t get the same o the fact that the space is considerably „i reaching the RAM upgrade s lots, which are the case of this ruler. Books about Mary t the fact that shares had been overvalued f the adjustment of the model 's control lin the present state of museum affairs, where the sheer obviousness of their line of att the generality of the terms squot descript a fault with the child 's immune cells in the fact that much of this produce is expo his opening statement : sbguo Eighty seven

Figure 8. BNC_wntten concordance samples of lies in, displaying patterns of negative evaluation For at the same time we also find a lower share of positive contexts in the BNC written than in the BRILC data. While authors of linguistic book reviews use the item predominantly to introduce positive evaluation, this meaning is (with 9 %) very rare in "general" written English (i.e. in a collection of texts from a range of different text types). An opposite trend can be observed with respect to it seems to me. Here, positive contexts are much more frequent in BNC written than in BRILC, where negative evaluation dominates (with 70.5 %; only 30 % of the BNC written examples express negative evaluation). Finally, with on the other hand positive evaluation or a positive semantic prosody is (with 33 %) also much more common in BNC written than in BRILC (see [16] and [17] for BNC written examples). For book reviews, I found that on the other hand mostly introduces negative evaluation and that only 8 % of the BRILC

Observations on the phraseology of academic writing

223

concordance lines express positive evaluation. These findings indicate that the examined patterns and their meanings are indeed quite "local", i.e. specific of the language of linguistic book reviews. Not only do we find certain phraseological items or patterns to occur with diverging frequencies across text types and to be typical of a particular kind of restricted language, we also observe that the same items express different meanings in different types of language. (16)

Jennie on the other hand was thrilled when the girls announced ^ < B N C w n t t e „ : B 3 4 914>

(17)

On the other hand, he at last gains well-deserved

wedding

riches and a life of

C O M / 0 r/.

4.

Concluding thoughts

Referring back to the groundbreaking work of John Firth and John Sinclair, this paper has stressed the importance of studying units of meaning in restricted languages. It has tried to demonstrate how a return to Firthian and Sinclainan concepts may enable us to better deal with the complex issue of meaning creation in (academic) discourse and how corpus tools and methods can help identify meaningful units in academic writing or, more precisely, in the language of linguistic book reviews. We saw that the identification of units of (evaluative) meaning in corpora is challenging but not a hopeless case and that phraseological search-engines like Collocate, kjNgram and ConcGram can be used to automatically retrieve lists of meaningful unit candidates for further manual analysis. It was found to be important to complement concordance analyses by n-gram, p-frame and concgram searches and to go back and forth between the different analytic procedures, combining corpus guidance and researcher intuition in a maximally productive way. In the analysis of high-frequency items from the meaningful unit candidate lists, it then became clear that a number of "innocent" n-grams and p-frames have a clear evaluative potential and that apparently "neutral" items have clear preferences for either positive or negative evaluation. The paper has also provided some valuable insights into the special nature of book review language and highlighted a few patterns that are particularly common in this type of written discourse. One result of the study was that it probably makes sense to "think local" more often because the isolated patterns were shown to be actually very restricted-language-

224 UteRomer specific. In a comparison of BRILC data with data retrieved from a reference corpus of written English (the written component of the British National Corpus), we found that not only the patterns but also the identified meanings for each of the patterns (and their distributions) are local. I would suggest that these local patterns be captured in a "local lexical grammar" which "is simply a logical extension of the concept of pattern grammar" (Hunston 1999) in that it, being text-type specific, covers the patterns that are most typical of the text type (or restricted language) under analysis and links these patterns with the most central meanings expressed in the specialized discourse. I think that a considerable amount of research on disciplinary phraseology still needs to be done, and see the development of local lexical grammars based on restricted languages as an important future task for the corpus linguist. These text-type specific grammars will help us get a better understanding of how meanings are created in particular discourses and come closer to capturing the full coverage of Sinclair's (1987) idiom principle.

Notes 1

I would Hke to thank the participants at the symposium on "Chunks in Corpus Linguistics and Cognitive Linguistics: In Honour of John Sinclair", 25-27 October 2007, at the University of Erlangen-Nuremberg for stimulating questions and suggestions after my presentation.

References Barlow, Michael 2004 Collocate 1.0: Locating Collocations and Terminology. Houston, TX: Athelstan. Barnbrook, Geoff 2002 Defining Language: A Local Grammar of Definition Sentences. Amsterdam: John Benjamins. Biber, Douglas 2006 University Language: A Corpus-based Study of Spoken and Written Registers. Amsterdam: John Benjamins. Biber, Douglas, Ulla Connor and Thomas A. Upton 2007 Discourse on the Move: Using Corpus Analysis to Describe Discourse Structure. Amsterdam: John Benjamins.

Observations on the phraseology of academic writing 225 Bowker,Lynne and Jennifer Pearson 2002 Working with Specialized Language: A Practical Guide to Using Corpora. New York/London: Routledge. Cheng, Winnie, Chris Greaves and Martin Warren 2006 From N-gram to Skrpgram to Concgram. IJCL 11 (4): 411-433. Connor, Ulla and Thomas A. Upton (eds.) 2004 Discourse in the Professions: Perspectives from Corpus Linguistics. Amsterdam: John Benjamins. Firth, John R. 1968a Descriptive linguistics and the study of English. In: Selected Papers of J. R. Firth 1952-5% Frank Robert Palmer (ed.), 96-113. Bloomington: Indiana University Press. First published in 1956. Firth, John R. 1968b Linguistics and translation. In: Selected Papers of J. R. Firth 195259, Frank Robert Palmer (ed.), 84-95. Bloommgton: Indiana University Press. First published in 1956. Firth, John R. 1968c A synopsis of linguistic theory. In: Selected Papers of J. R. Firth 1952-59, Frank Robert Palmer (ed.), 168-205. Bloommgton: Indiana University Press. First published in 1957. Fletcher, William H. 2002-07 KfNgram. Annapolis, MD: United States Naval Academy. Gavioli, Laura 2005 Exploring Corpora for ESP Learning. Amsterdam: John Benjamins. Greaves, Chris 2005 ConcGram Concordancer with ConcGram Analysis. HongKong: Hongkong University of Science and Technology. Harris, ZelligS. 1968 Mathematical Structures of Language. New York: Interscience Publishers. Hunston, Susan 1999 Local Grammars: The Future of Corpus-driven Grammar? Paper presented at the 32nd BAAL Annual Meeting, September 1999, University of Edinburgh. Hunston, Susan 2004 Counting the uncountable: Problems of identifying evaluation in a text and in a corpus. In: Corpora and Discourse, Alan Partington, John Morley and Louann Haarman (eds.), 157-188. Bern: Peter Lang.

226 Ute Romer Hunston, Susan and John McH. Sinclair 2000 A local grammar of evaluation. In: Evaluation in Text: Authorial Stance and the Construction of Discourse, Susan Hunston and Geoff Thompson (eds.), 74-101. Oxford: Oxford University Press. Hyland,Ken 2004 Disciplinary Discourses: Social Interactions in Academic Writing. Ann Arbor, MI: University of Michigan Press. Lehrberger,John 1982 Automatic translation and the concept of sublanguage. In: Sublanguage: Studies of Language in Restricted Semantic Domains, Richard Kittredge and John Lehrberger (eds.), 81-106. Berlin: Walter de Gruyter. Leon,Jaquelme 2007 From linguistic events and restricted languages to registers. Firthian legacy and Corpus Linguistics. Henry Sweet Society Bulletin 49: 5 25. Mauranen,Anna 2004 Where next? A summary of the round table discussion. In: Academic Discourse: New Insights into Evaluation, Gabnella Del Lungo Camiciotti and Elena Togmm Bonelli (eds.), 203-215. Bern: Peter Lang. Romer, Ute 2008 Identification impossible? A corpus approach to realisations of evaluative meaning in academic writing. Functions of Language 15 (1): 115-130. Romer, Ute and Rainer Schulze (eds.) 2008 Patterns, Meaningful Units and Specialized Discourses (special issue of International Journal of Corpus Linguistics). Amsterdam: John Benjamins. Sinclair, John McH. 1987 The Nature of the evidence. In: Looking Up: An Account of the COBUILD Project in Lexical Computing, John McH. Sinclair (ed.), 150-159. London: HarperCollins. Sinclair, John McH. 1991 Corpus, Concordance, Collocation. Oxford: Oxford University Press. Sinclair, John McH. 1996 The search for units of meaning. Texto IX (1): 75-106. Sinclair, John McH. 2004 Trust the Text: Language, Corpus and Discourse. London: Routledge.

Observations on the phraseology of academic writing 227 Thompson, Geoff and Susan Hunston 2000 Evaluation: An introduction. In: Evaluation in Text: Authorial Stance and the Construction of Discourse, Susan Hunston and Geoff Thompson (eds.), 1-27. Oxford: Oxford University Press.

Corpora BNC

BRILC

The British National Corpus. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. URL: http://www.natcorp.ox.ac.uk/. Book Reviews in Linguistics Corpus. Compiled by the author of this paper.

Collocational behaviour of different types of text Peter Uhrig andKatrin Gotz-Votteler^

1.

Introduction

Ever since John Sinclair introduced his idiom principle and his notion of collocation (see Sinclair 1991: 109-121), there has been an increasing interest in the study of different aspects of phraseology. In this article we would like to present a work-in-progress report on a project exploring the collocational behaviour of text samples.2 By this term we refer to the extent to which a text relies on or uses collocations. A text that is classified as "collocationally strong" can therefore be defined as a text in which a substantial number of statistical collocations can be found, a "collocationally weak" text as a text which contains fewer statistical collocations and consists of more free combinations (in the sense of Sinclair's open choice principle). In order to determine the collocational behaviour of a text, a computer program was designed to compare the co-occurrence of words within a certain text with co-occurrence data from the British National Corpus (BNC). For the analysis outlined here, eight different text samples representing different text types were compiled. This selection of samples was chosen in order to test whether certain interrelations between different texts (or text types) and their collocational behaviours can be found. The following three hypotheses summarize three kinds of interrelation that we expected to occur: Hypothesis 1: There is an interrelation between the collocational behaviour ofatext and its perceived difficulty. Research on collocation claims that collocations are stored in the mind as prefabricated items (Underwood, Schmitt, and Galpin 2004: 167; Ellis, Frey, and Jalkanen 2009). It is therefore to be expected that the collocationally stronger a text, the easier it should be to process, as the text follows expected linguistic patterns. If, on the other hand, the reader is presented with a text that consists of a considerable number of new combinations, no already established linguistic knowledge can be used for processing the

230 Peter Uhrig and Katrin Gotz-Votteler

text. In order to test Hypothesis 1, we selected texts which represent a range of difficulty; those texts which we ranked as more difficult should therefore consist of more new combinations and be collocationally weaker. Hypothesis 2: There is an interrelation between the collocational behaviour ofatext and the text type. Different text types might show different collocational behaviour. Fictional texts, for example, are generally assumed to be linguistically quite creative, i.e. these texts might to a higher degree rely on unusual word combinations in order to create certain effects on the reader and should therefore be collocationally weaker than newspaper articles or academic writing, which use a larger number of standardized expressions.3 Hypothesis 3: There is an interrelation between the collocational behaviour ofatext and its idiomaticity. Knowing the c o l l a t o r s a word is associated with is one of the crucial steps towards advanced foreign language proficiency (Hausmann 2004, Granger this volume). Texts produced by learners of English are therefore likely to be collocationally weaker than texts written by native speakers. 2.

Software and text processing

In order to evaluate word co-occurrences in a given text to find out whether they are strong collocations, some sort of reference is needed, which, for our purposes, is the British National Corpus (BNC). The first version of the computer program that was designed to test our hypotheses, queried SARA4 at runtime, which resulted in very long response times. It only compared bigram frequencies, so the insights to be gained from it were rather limited (see below). The current version makes use of a precomputed database of all co-occurrences in the BNC for every span from 1 to 5, both for word forms and for lemmata. It offers user-definable span, allows the use of lemmatisation and can ignore function words. The software uses tagged and lemmatised text as input. To ensure compatibility with the database based on the BNC, the same tools that were used to annotate the BNC were applied to the text samples. All input texts were thus PoS-tagged with CLAWS (see Leech, Garside, and Bryant 1994) and lemmatised with LEMMINGS via WMatnx (Rayson 2008).5

Collocational behaviour of different types of text 231

Our software computes all co-occurrence frequencies and association measures, and the resulting lists are imported into Microsoft Excel where graphs are plotted. An example list is given in the appendix (table 1). 3.

Text samples

Each sample contains 20,000 words of English text. The non-fictional category is composed of newspaper texts from the Guardian and articles from academic journals by British authors. The fictional category comprises Elizabeth George,6 P D. James, Ian McEwan, and Virginia Woolf. Additionally, a sample of EFL essays by German students7 was compiled. The last sample is an automatic translation of a 19th century German novel (Theodor Fontane's Effi Briest) by AltaVista Babelfish (now called Yahoo! Babelfish*). This sample was included to double-check the results against some unnatural and umdiomatic language. The criteria for the selection of texts were a) varying degrees of difficulty in orderto test Hypothesis 1, b) coverage of different text types in order to test Hypothesis 2, c) different levels of idiomaticity in order to test Hypothesis 3. A few words have to be said about criterion a): even though it is very common to describe a certain article, story or novel as "difficult to read", it is - from a linguistic point of view - hard to determine the linguistic features that support this kind of subjective judgment. A quantitative analysis of fictional literature, using some of the authors above, has shown that the degree of syntactic complexity seems to correspond to the evaluation of difficulty (Gotz-Votteler 2008); the inclusion of Hypothesis 1 can be seen as a complementation of that study. 4.

Results

The first run of the software provided calculations based on nonlemmatised word forms, including grammatical words.9 Figure 1 is a plot of all mutual information scores.

232 Peter Uhrtg andKatrm Gotz-Votteler

Figure 1.

Word forms, span 1, function words included

As mutual information boosts low-frequency highly specific combinations (as opposed to MB for instance; see Evert 2005: 243), the two top dots ("Helter Skelter" and "Ehud Olmert") are of no particular relevance. What is represented as zero in the diagram are those combinations for which no association score could be computed because one or both of the items do not occur in the BNC or their co-occurrence frequency in the BNC is zero (such as "primitive hut"). The curves in figure 1 are strikingly similar and do not permit any conclusions that there are differences between text types. The only curve that deviates slightly is the automatic translation of Fontanel Effi Briest, which is quite strong in the negative numbers, indicating that there are many pairs of adjacent words which are much less common than expected from their individual frequencies ("anti-collocations"). Similar results were found using different association measures, which is not surprising as the absolute frequency of co-occurrence did not vary much across text types.

Collocational behaviour of different types of text 233

Figure 2.

Word forms, span 1, function words included

Figure 2 gives the number of hits in mutual information score bands of width 1 (apart from

P

P

£

P

P

'P

NN

N>

*v> N1" 'P >P ^

\% \Q

(& o>

Lemmata, span 5, only noun-adjective/adjective-noun combinations

Discussion and evaluation

We will now return to our hypotheses and evaluate them critically in the light of our findings:

Collocational behaviour of different types of text 237

Hypothesis 1: There is an interrelation between the collocational behaviour ofatext and its perceived difficulty. As mentioned above, the text samples were chosen to cover a certain range of difficulty. The preceding discussion of the three queries showed that the results do not support Hypothesis 1. We would even go a step further and claim that the collocational behaviour of a text does not seem to contribute to the perceived difficulty. Hypothesis 2: There is an interrelation between the collocational behaviour ofatext and the text type. Neither did the data provide any evidence for Hypothesis 2: from our charts no interrelation between the text type and the collocational strength of a text was visible. However, the text samples did display a difference in lexical density, i.e. text types such as newspaper articles or academic writing contain a larger number of lexical items, whereas a text type such as fiction consists of alarger percentage of function words.12 Hypothesis 3: There is an interrelation between the collocational behaviour ofatext and its idiomaticity. For our third hypothesis the results proved to be the most promising ones. The largely nonsensical text sample generated by an automatic translation device showed differences in behaviour for the span -5 - +5. The same is true of the texts written by EFL learners, even though much less obviously than we would have assumed. As the discussion of the three hypotheses reveals, our findings are far less conclusive than expected. This is partly due to some technical and methodological problems which shall be briefly outlined in the following: A whole range of problems is associated with tokemsation, PoS-tagging, and lemmatisation. Even though excellent software was made available for the present study, there are still errors.13 These errors would only have been a minor problem, had they been consistent, but over the past 15 years, CLAWS has been improved, so the current version of the tagger does not produce the same errors it consistently produced when the BNC was annotated.14 In addition, multi-word units are problematic in two respects: firstly, the tagger recognizes many of them as multi-word units while the lemmatiser lemmatises every orthographic word, rendering mappings of the

238 Peter Uhrig andKatrin Gotz-Votteler

two very difficult. Besides they distort the results, even if function words are excluded, as they always lead to really high association scores.15 The most serious problem, though, is related to the size of the reference corpus, the BNC. Even if all proper names are ignored and only lemmatised combinations of nouns and adjectives in a five-word span to either side are taken into account, there are still up to 40% of combinations in the samples which do not exist at all in the BNC. Up to 60% occur less than 5 times - a limit below which sound statistical claims cannot be maintained. This is of course partly due to the automatic procedure, which looks at words in a five-word span and thus may try to score two words which are neither syntactically nor semantical^ related in any way.16 It therefore seems as if the BNC is still too small for this kind of research by an order of magnitude or two. This problem may at least be partially solved by augmenting the BNC dataset with data from larger corpora such as Google's web IT 5-gram or by limiting the research to syntactically related combinations in parsed corpora. Despite (or perhaps even because of) the inconsistencies and inconclusive results of the present study, some of the aspects presented above seem to very much deserve further investigation: as we have seen that there are slight differences between native and non-native usage, at least for nounadjective collocations, it might be interesting to see whether it is possible to automatically determine the level of proficiency of learners looking at the collocational behaviour of their text production.17

Collocational behaviour of different types of text 239 Appendix Table 1.

Output of the database query tool

Lemma 1 supermarket_SUBST supermarket_SUBST supermarket_SUBST supcrmarkct_SUBST accuse_VERB accuse VERB accusc_VERB organic_ADJ organic_ADJ organic_ADJ organic. ADJ food SUBST food_SUBST foodJSUBST food SUBST prcssurc_SUBST pressure_SUBST pressure_SUBST pressureJSUBST ease VERB ease VERB cascJVERB

Fl 1040 1040 1040 1040 2629 2629 2629 2112 2112 2112 2112 18674 18674 18674 18674 11790 11790 11790 11790 2357 2357 2357

Lemma2 accusc_VERB organic_ADJ food SUBST prcssurc_SUBST organic_ADJ food SUBST prcssurc_SUBST food_SUBST pressure.SUBST easc_VERB standard.SUBST pressure_SUBST casc_VERB standard_SUBST say_VERB easc_VERB standard_SUBST say_VERB expertJSUBST standard SUBST say_VERB expertSUBST

F2

Fl,2

2629 2112 18674 11790 2112 18674 11790 18674 11790 2357 15079 11790 2357 15079 317539 2357 15079 317539 7099 15079 317539 7099

0 1 10 0 1 5 0 54 1 0 6 17 2 60 205 60 21 150 4 2 28 1

MI 0

Log-like

Log-log Z-Score Mi3

T-Score

0

0 0

0

5,474 5,635 5,652 58,843

0

0

4,136 3,314

3,848 13,983

0

0

7,063 423,033 1,971 1,243

0

0

4.201 2,914 2,149 4,379 1,755 7,72 3,528 1,968 2,222 2,458 1,869 2,545

23,613 39,22 2,862 250,367 211,398 524,5 64,392 186,978 6,039 3,544 32,051 1,871

0

0

6.517 21,975

5,474 0,978 12,295 3,099

0

0

0

7,694

3.955 6,342

4,136 7,958

0,943 2,011

0

0

0

0

40,644

84,324 1,475

18,572 7,293 1,971 0,745

0

0

0

9,934 9.819 2,307 33,631 18,51 111,926 14,212 18,03 3,394 2,711 7,344 2,001

9,371 11.089 4,149 16,192 17,114 19,533 12,312 16,425 6,222 4,458 11,484 2,545

2,316 3,576 1,095 7,374 10,076 7,709 4,185 9,116 1,571 1,157 3,843 0,829

18,774

0 0

0 0 10,86 11,912 2,149 25,864 13,477 45,599 15,494 14,224 4,444 2,458 8,984

0

"

Notes 1 2

3 4 5 6 7

The order of authors is arbitrary. This project is earned out by Thomas Herbst, Peter Uhrig, and Katrin GotzVotteler at the University of Erlangen-Nurnberg. It was supported wrth a grant by the Sonderfonds fur wissenschaftliche Arbeiten an der Universitdt Erlangen-Nurnberg. For a characterization of varying linguistic behaviour of different text types see also Biber (1988) and Biber et al. (1999). SGML Aware Retrieval Application; the software shipped with the original version of the BNC. Thanks to Paul Rayson of Lancaster University, who kindly allowed us to use WMatnx for this research project. Elizabeth George is our only American author. This did not have any effect on the results, despite our British reference corpus. The students attended a course preparing them for an exam roughly on level CI of the Common European Framework of Reference.

240 Peter Uhrig and Katrin Gotz-Votteler 8 9 10

11

12 13

14

15 16

17

http://de.babeffish.yahoo.com/ According to Sinclair's definition (1991: 170), "only the lexical cooccurrence of words" counts as collocation. Cf. the following sentence: "Man is too addicted to this intoxicating mixture of adolescent buccaneering and adult perfidy to relinquish it [spying] entirely." There was no calculation of co-occurrences across sentence boundaries; thus sentence length may also be held responsible for this finding. However, an analysis of mean sentence length did not confirm this assumption. For the distribution of some types of function words in different types of texts see Biberetal. (1999: ch. 2.4). If we assume that the success rate of CLAWS in our study is roughly 97% (as published in Leech and Smith 2000), we still get about 600 ambiguous or wrongly tagged items per 20,000 word sample. The word organic*, for instance, is tagged as plural in the BNC and as singular by the current version of CLAWS. Thus no combinations containing the word organic* were found by our automatic procedure, which always queries word/tag combinations. (Since the XML version of the BNC was not yet available when the present study was started, the database is based on the BNC World Edition.) A case in point would be Prime Minister. In "carnivorous plant in my office", carnivorous and office are found within a 5-word span. It is not surprising, though, that they do not occur within a 5word span in the BNC. The software may also be used for a comparison of different samples from "New Englishes" in order to find out whether these show similar results to British usage or have a distinct collocational behaviour. (Thanks to Christian Man for suggesting this application of our methodology.) In addition, it is capable of identifying non-text, which means it could be used to find automatically generated spam emails or web pages. So in the end this could spare us the trouble of having to open emails which, on top of trying to sell dubious drugs, contain a paragraph which serves to trick spam filters and reads like the following excerpt: "Interview fired attorney david Iglesias by Shockwave something."

Collocational behaviour of different types of text 241 References Biber, Douglas 1988 Variation across Speech and Writing. Cambridge/New York/New Rochelle/Melbourne/Sydney: Cambridge University Press. Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad and Edward Finegan 1999 Longman Grammar of Spoken and Written English. Harlow: Pearson Education Limited. Ellis, Nick, Eric Frey and Isaac Jalkanen 2009 The psycholinguists reality of collocation and semantic prosody (1): Lexical access. In Exploring the Lexis-Grammar Interface: Studies in Corpus Linguistics, Ute Romer and R. Schulze (eds.), 89-114. Amsterdam: John Benjamins. Evert, Stefan 2005 The Statistics of Word Cooccurrences: Word Pairs and Collocations. Dissertation, Institut fur maschmelle Sprachverarbeitung, University of Stuttgart, URN urn:nbn:de:bsz:93-opus-23714. G6tz-Votteler,Katrm 2008 Aspekte der Informationsentwicklung im Erzahltext. Tubingen: GunterNarrVerlag. Granger, Sylviane 2011 From phraseology to pedagogy: Challenges and prospects. This volume. Hausmann, Franz Josef 2004 Was sind eigentlich Kollokationen? In Wortverbindungen mehr oder wenigerfest, Kathrin Steyer (ed.), 309-334. Berlin: Walter de Gruyter. Leech, Geoffrey, Roger Garside and Michael Bryant 1994 CLAWS4: The tagging of the British National Corpus. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 94), 622-628. Kyoto, Japan. Leech, Geoffrey and Nicholas Smith 2000 Manual to accompany The British National Corpus (Version 2) with improved word-class tagging. Lancaster. Published online at http://ucrel.lancs.ac.uk/bnc2/. Nesselhauf,Nadja 2005 Collocations in a Learner Corpus. Amsterdam, Philadelphia: Benjamins. Rayson,Paul 2008 Wmatrix: A web-based corpus processing environment. Computing Department, Lancaster University, http://ucrel.lancs.ac.uk/wmatrix/.

242 Peter Uhrig and Katrin Gotz-Votteler Sinclair, John McH. 1991 Corpus, Concordance, Collocation. Oxford: Oxford University Press. Underwood, Geoffrey, Norbert Schmitt and Adam Galpin 2004 They eyes have it: An eye-movement study into the processing of formulaic sequences. In Formulaic Sequences, Norbert Schmitt (ed.), 153-172. Amsterdam/Philadelphia: Benjamins.

Corpus BNC

The British National Corpus. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk/.

Roland Hausser

1.

Learner's dictionary and statistical tagging

In British Corpus Linguistics (CL), two schools may be distinguished: one is associated with the University of Birmingham and its mentor John Sinclair, the other with the University of Lancaster and its mentors Roger Garside and Geoffrey Leech. The Birmingham approach has been characterized as "CL-as-theory" and "doing language", the Lancaster approach as "CL-as-method" and "doing computing" (Kirk 1998). The difference between the two approaches is apparent in their respective analyses of a word. Take for example the word decline, analyzed in the Collins COBUILD English Language Dictionary (CCELD, Sinclair 1987) as a lexical entry with several readings: decline /di'klain/, declines, declining, declined. 1 If something declines, it becomes less in quantity, importance, or strength, [examples] 2 If you decline something or decline to do something, you politely refuse to accept it or do it; a fairly formal Word, [examples]

3 Decline is the condition or process of becoming less in quantity, importance, or quality, [examples]

Figure I.

Entry of decline in Collins COBUILD ELD 1987 (excerpt)

Intended for learners rather than fluent speakers of English, the forms declines, declining and declined are explicitly listed (instead of naming the paradigm). In a separate column (not shown in figure 1), the CCELD characterizes reading 1 as an intransitive Verb with the hypernym decrease, the cognate diminish and the antonym increase Reading 2 is characterized as V V+O OR V+ to-im, whereby V+O indicates a transitive verb. Reading 3 is characterized as N UNCOUNT/COUNT:USU SING, i.e. as a noun t h i c h is usually used in the singular. Chapter 3 of Sinclair (1991) provides a de-

244 RolandHausser tailed discussion of this entry to explain the form and purpose of entries in the CCELD in general. Next consider the corresponding Lancaster analysis: 3682 declme NN1 451 declme VVI 381 declme NN1-VVB 121 declme VVB-NN1 38 declme VVB 1 decline-and-fall AJ0-NN1 1 declme/withdraw VVB 800 declined VVN 610 declined VVD 401 declined VVD-VVN 206 declined VVN-VVD Figure 2.

1 declmedtocomment NN1 249 declines VVZ 26 declines VVZ-NN2 22 declines NN2 7 declines NN2-VVZ 446 declining AJO 284 declining VVG-AJO 234 declining AJO-VVG 138 declining VVG 1 declining-cost AJO 1 declmmg-m AJO

Forms of decline as analyzed in the BNC 2007 XML Edition

To evaluate the tagging, we have to look up the definitions of the relevant tag-set 2 in order to see which classifications are successful. For example, declmmg is assigned four different tags (ambiguity), which are defined as follows: 446 284 234 138

declmmg declmmg declmmg declmmg

AJO VVG-AJO AJO-VVG VVG

adjective (unmarked) (e.g. GOOD, OLD) -ingfonnof lexical verb and adjective (unmarked) adjective (unmarked) and -ing form of lexical verb -ing form of lexical verb (e.g. TAKING, LIVING)

From a linguistic point of view, it would be better to classify declmmg unambiguously as the progressive form of the verb and leave the standard uses of the progressive as a predicate, a modifier, or a noun to the rules of syntax. Critical remarks on the accuracy 3 and usefulness of statistical tagging aside, the Birmingham and the Lancaster approaches share the same methodological issues of corpus linguistics, namely sampling representativeness, size, format (and all their many sets of choices) as well as the basic techniques such as the use of frequency lists, the generation of concordances, the analysis of collocations and the question of tagging and other kinds of in-text annotation. And both raise the question of whether their computational analysis of machine-readable texts is just a methodology (extending the tool box) or a linguistic theory.

Corpus linguistics, generative grammar and database semantics 245 This question is addressed by Teubert and Knshnamurthy (2007: 1) as follows: corpus linguistics is not a branch of linguistics, but the route into linguistics corpus linguistics is not a distinct paradigm in linguistics but a methodology corpus linguistics is not a linguistic theory but rather a methodology corpus linguistics is not quite a revolt against an authoritarian ideology, it is nonetheless an argument for greater reliance on evidence corpus linguistics is not purely observational or descriptive in its goals, but also has theoretical implications corpus linguistics is a practice, rather than a theory corpus linguistics is the study of language based on evidence from large collections of computer-readable texts and aided by electronic tools corpus linguistics is a newly emerging empirical framework that combines a firm commitment to rigorous statistical methods with a linguistically sophisticated perspective on language structure and use corpus linguistics is a vital and innovative area of research Regarding the Birmingham "CL-as-theory" and "doing language" approach, Sinclair is quite adamant about the authonty of real data over examples invented by linguists in the Chomskyan tradition - which is a methodological issue. But when it comes to writing lexical entries, Sinclair is pragmatic, with readability for the learner as his topmost priority. For example, in his introduction to the CCELD (1987: xix) Sinclair writes: Within each paragraph the different senses are grouped together as well as the word allows. Although the frequency of a sense is taken into account, the most important matter within a paragraph is the movement from one sense to another, giving as clear as possible a picture.

2.

The place of lexical meanings

The aim of corpus linguists and dictionary builders is to provide an accurate description of "the language" at a certain point in time or in a certain time interval. It seems to follow naturally from this perspective that a language is viewed as an object "out there in the world". As Teubert (2008) puts it:

246 RolandHausser Language is symbolic. A sign is what has been negotiated between sign users. The meaning of a sign is not my (non-symbolic) experience of it. Meanings are not in the head as Hilary Putnam4 never got tired of repeating. The meaning of a sign is the way in which the members of a discourse community are using it. It is what happens in the symbolic interactions between people, not in their minds. On the one hand, it is uncontroversial that language meanings should not be treated as something personal left to the whim of individuals. On the other hand, simply declaring meanings to be real external entities is an irrational method for making them "objective". The real reason why the conventionalized surface-meaning relations are shared by the speech community is that otherwise communication would not work. Even if we accept for the sake of the argument that language meanings may be viewed (metaphorically) as something out there in the world, they must also exist in the heads of the members of the language community. How else could speaker-hearers use language surface forms and the associated meanings to communicate with each other? That successful natural language interaction between cognitive agents is a well-defined mechanism is shown by the attempt to communicate in a foreign language environment. Even if the information we want to convey is completely clear to us, we will not be understood by our hearers if we fail to use their language adequately. Conversely, we will not be able to understand our foreign communication partners who are using their language in the accustomed manner unless we have learned their language. Given that natural language communication is a real and objective procedure, it is a legitimate scientific goal to model this procedure as a theory of how natural language communication works. Such a theory is not only of academic interest but is also the foundation of free human-machine communication in natural language. The practical implications of having machines which can freely communicate in natural language are enormous: instead of having to program the machines we could simply talk with them. 3.

Basic structure of a cognitive agent with language

Today, talking robots exist only in fiction, such as C-3PO in the Star Wars movies (George Lucas 1977-2005) and Roy, Rachael, etc. in the movie Blade Runner (Ridley Scott 1982). The first and so far the only effort to model the mechanism of language communication as a computational linguistic theory is Database Semantics (DBS).

Corpus linguistics, generative grammar and database semantics 247

DBS is developed at a level of abstraction which applies to natural agents (humans) and artificial agents (talking robots) alike. In its simplest form, the interfaces, components and functional flow of a talking agent may be characterized schematically as follows:

Figure 3.

Structuring central cognition in agents with language (borrowed from Hausser2006:26)

According to this schema, the cognitive agent has a body out there in the world5 with external interfaces for recognition and action. Recognition is for transporting content from the external world into the agent's cognition, action is for transporting content from the agent's cognition into the external world.6 In this model, the agent's immediate reference7 with language to corresponding objects in the agent's external environment is reconstructed as a purely cognitive procedure. An example of immediate reference in the hearer mode is following a request, based on (i) language recognition, (ii) transfer of language content to the context level based on matching and (111) context action. An example in the speaker mode is reporting an observation, based on (i) context recognition, (ii) transfer of context content to the language level based on matching and (111) language production including sign synthesis.8 From the viewpoint of building a talking robot, the language signs existing in the external reality between communicating agents are merely acoustic perturbations (speech) or doodles on paper (writing) which are com-

248 Roland Hausser pletely without any grammatical properties or meaning (cf. Hausser 2006: Sect. 2.2; Hausser 2009b). The latter arise via the agent's wordform recognition, based on matching the shapes of the external surface forms with corresponding keys in a lexicon stored in the agent's memory. This lexicon must be acquired by each member of the language community. The learning procedure is self-correcting because using a surface form with the wrong conventional meaning leads to communication problems. If there is anything like Teubert's and Putnam's notion of language (a position known as linguistic externalism), it is a reification of the intuitions of members of the associated language community, manifested as signs produced by speakers (or writers) in a certain interval of time. These manifestations may then be selected, documented and interpreted by corpus linguists. 4.

Automatic word form recognition

The computer may be used not only for the construction of dictionaries, e.g. by using a machine-readable corpus for improving the structure of the lexical entries, but also for their use: instead of finding the entry for a word like decline in the hardcopy of a dictionary using the alphabetical order of the lemmata, the user may type the word on a computer containing an online version of the dictionary - which then returns the corresponding entry on its screen. Especially in the case of large dictionaries with several volumes and extensive cross-referencing, the electronic version is considerably more user-friendly to the computer-literate end-user than the corresponding hardcopy. Electronic lexical lookup is based on matching the unanalyzed surface of the word in question with the lemma of the online entry, as shown in the following schema:

Figure 4.

Matching an unanalyzed surface form onto a key

There exist several techniques for matching a given surface form automatically with the proper entry in an electronic lexicon.9

Corpus linguistics, generative grammar and database semantics 249

The method indicated in figure 4 is also used for the automatic word form recognition in a computational model of natural language communication, e.g. Database Semantics. It is just that the format and the content of the lexical descriptions are different w This is because the entries in a dictionary are for human users who already have natural language understanding, whereas the entries in an online lexicon are designed for building language understanding in an artificial agent. 5.

Concept types and concept tokens

The basic concepts in the agent's head are provided by the external interfaces for recognition and action. Therefore, an artificial cognitive agent must have a real body interacting with the surrounding real world. The implementation of the concepts must be procedural because natural organisms as well as computers require independence from any metalanguage." It follows that a truth-conditional or Tarskian semantics cannot be used' 2 According to the procedural approach, a robot understands the concept of shoe, for example, if it is able to select the shoes from a set of different objects, and similarly for different colours, different kinds of locomotion like walking, running, crawling, etc. The procedures are based on concept types, defined as patterns with constants and restricted variables, and used at the context level for classifying the raw input and output data.13 As an example, consider the following schema showing the perception of an agent-external square (geometric shape) as a bitmap outline which is classified by a corresponding concept type and instantiated as a concept token at the context level:

250 Roland Hauler

Figure 5.

Concept types at the context and language level

The necessary properties," shared by the concept type and the corresponding concept token, are represented by four attributes for edges and four attrtbutes for angles. Furthermore, all angle attributes have the same value, namely the constant "90 degrees" in the type and the token. The edge attributes also have the same value, though it is different for the type and the token. The accidental property of a square is the edge length, represented by the variable a in the type. In the token, all occurrences of this variable have been instantiated by a constant, here 2 cm. Because of its variable, the type of the concept square is compatible with infinitely many corresponding tokens, each with another edge length. At the language level, the type is reused as the literal meaning of the English surface form square, the French surface form earre and the German surface form Quadrat, for example. The relation between these different surface forms and their common meaning is provided by the different conventions of these different languages. The relation between the meaning at the language level and the contextual referent at the context level is based on matching using the type-token relation. The representation of a concept type and a concept token in figure 5 is of a preliminary holistic nature, intended for simple explanation'5 How such concepts are exactly implemented as procedures and whether these

Corpus linguistics, generative grammar and database semantics 251

procedures are exactly the same in every agent is not important. All that is required for successful communication is that they provide the same results (relative to a suitable granularity) in all members of a language community. 6.

Proplets

Defining a basic meaning like square as a procedure for recognition and action is only the first step to make an artificial agent understand. Leaving aside questions of whether or not there is a small set of "semantic primitives" (Wierzbicka 1991: 6-8) from which all other meanings can be built, and of whether or not all natural languages code content in the same way (Nichols 1992), let us turn to the form of lexical entries in DBS. Starting from a basic meaning, the lexical entries add morpho-syntactic properties such as part of speech, tense in verbs, number in nouns, etc., needed for grammaticalized aspects of meaning, syntactic agreement, or both. These properties are coded (i) in a way suitable for computational interpretation and (n) as a data structure fulfilling the following requirements: First, the lexical entries of DBS are designed to provide for an easy computational method to code the semantic relations of functor-argument and coordination structure between word forms. Second, they support a computationally straightforward matching procedure, needed (i) for the application of rules to their input and (n) for the interaction between the language and the context level inside the cognitive agent. Third, they code the semantic relations in complex expressions in an order-free manner, so that they can be stored in a database in accordance with the needs of storage in the hearer mode and of retrieval in the speaker mode. The format for satisfying these linguistic and computational requirements are flat (non-recursive) feature structures called proplets. As an example consider the lexical analysis of the English word surface form square as a noun (as in "Anna drew a square"), as a verb (as in "Lorenz squared his account") and as an adjective (as in "Jacob has a square napkin").

252 Roland Hausser

These proplets contain the same concept type square (illustrated in figure 5) as the value of their respective core attributes, i.e. noun, verb and adj providing the part of speech. Different surface forms are specified as values of the surface attribute and different morpho-syntactic properties16 are specified as values of the category and semantics attributes. For example, the verb forms are differentiated by the combmatonally relevant cat values ns3' a' v, n-s3' a' v, n' a' v and a' be, whereby ns3' indicates a valency slot (Herbst et al. 2004; Herbst and Schiiller 2008) for a nominative 3rd person singular noun, n-s3' for a nominative non-3rd person singular noun, n' for a nominative of any person or number, and a' for a noun serving as an accusative. They are further differentiated by the sem values pres,pastlperf and prog for tense and aspect.

Corpus linguistics, generative grammar and database semantics 253 This method of characterizing variations in lexical meaning by inserting the same concept as a core value into different proplet structures applies also to the word decline:

Figure 7.

Lexical analysis of decline in DBS

The intransitive and the transitive verb variants are distinguished by the absence versus presence of the a' valency position in the respective cat values. The verbs and the noun are distinguished by their respective core attributes verb and noun as well as by their cat and sem values. The possible variations of the base form surface forms correspond to those in figure 6.

7.

Grammatical analysis in the hearer mode of DBS

Compared to the CCELD (1987) dictionary entries for decline (cf figure 1), the corresponding DBS proplets in figure 7 may seem rather meagre. However, in contrast to dictionary entries, proplets are not intended for being read by humans. Instead, proplets are a data structure designed for processing by an artificial agent. The computational processing is of three kinds, (i) the hearer mode, (ii) the think mode and (hi) the speaker mode. Together, they model the cycle of natural language communication 17 In the hearer mode, the processing establishes (i) the semantic relations of functor-argument and coordination structure between proplets (horizontal relations) and (ii) the pragmatic relation of reference between the language and the context level (vertical relations, cf. figure 3). In the think mode, the processing is a selective activation of content in the agent's memory (Word Bank) based on navigating along the semantic relations between proplets and deriving new content by means of inferences. 18 In the speaker mode, the navigation is used as the conceptualization for language production.

254 Roland Hausser

Establishing semantic relations in the hearer mode is based solely on (i) the time-linear order of the surface word forms and (h) a lexical lookup provided by automatic word form recognition. As an example, consider the syntactic-semantic parsing of "Julia declined the offer.", based on the DBS algorithm of LA-grammar.

Figure 8.

Time-linear derivation establishing semantic relations

The analysis is surface compositional in that each word form is analyzed as a lexical proplet (cf lexical lookup, here using simplified proplets). The derivation is time-linear, as shown by the stair-like addition of a lexical proplet in each new line. Each line represents a derivation step, based on a rule application. The semantic relations are established by no more and no less than copying values, as indicated by diagonal arrows."9

Corpus linguistics, generative grammar and database semantics 255 The result of this derivation is a representation of content as an orderfree set of proplets. Given that the written representation of an order-free set requires some order, though arbitrary, the following example uses the alphabetical order of the core values:

Figure 9.

Content of "Julia declined the offer."

The proplets are order-free because the grammatical relations between them are coded solely by attribute-value pairs (for example, [arg: Julia offer] in the decline proplet and [fnc: decline] in the Julia proplet) - and not in terms of dominance and precedence in a hierarchy. As a representation of content, the language-dependent surface forms are omitted. Compared to figure 8, the proplets are shown with additional cat and sem features.

8.

Abstract coding of semantic relations

Linguistically, the DBS derivation in figure 8 and the result in figure 9 are traditional in that they are based on explicitly coding functor-argument (or valency) structure 20 as well as morpho-syntactic properties. Given that other formal grammar systems, even within Chomsky's nativism, have been showing an increasing tendency to incorporate traditional notions of grammar, there arises the question of whether DBS is really different from them. After all, Phrase Structure Grammar, Categonal Grammar, Dependency Grammar and their many subschools 21 have arrived at a curious state of peaceful coexistence 22 in which the choice between them is more a matter of local tradition and convenience than a deliberate research decision. DBS is essentially different from the current main stream grammars mentioned above because DBS hearer mode derivations map the lexical analysis of a language surface form directly into an order-free set of proplets which is suitable (i) for storage in and retrieval from a database and thus (ii) suitable for modelling the cycle of natural language communica-

256 RolandHausser

tion.23 This would be impossible without satisfying the following requirements: Requirements for modeling the cycle of communication: 1. The derivation order must be strictly time-linear. 2. The coding of semantic relations may not impose any order on the items of content. 3. The items of content (proplets) must be defined as flat, non-recursive structures. These DBS requirements are incompatible with the other grammars for the following reasons: (1) and (2) preclude the use of grammatically meaningful tree structures and as a consequence of (3) there is no place for unification. Behind the technical differences of method there is a more general distinction: the current main stream grammars are ^ - o r i e n t e d , whereas DBSisag^-onented. For someone working in sign-oriented linguistics, the idea of an agentoriented approach may take some getting used to.24 However, an agentoriented approach is essential for a scientific understanding of natural language, because the general structure of language is determined by its function 25 and the function of natural language is communication. Like any scientific theory, the DBS mechanism of natural language communication must be verified. For this, the single most straightforward method is implementing the theory computationally as a talking robot. This method of verification is distinct from the repeatability of experiments in the natural sciences and may serve as a unifying standard for the social sciences. Furthermore, once the overall structure of a talking robot (i.e., interfaces, components and functional flow, cf figures 3 and 5, Hausser 2009a) has been determined, partial solutions may be developed without the danger of impeding the future construction of more complete systems.26 For example, given that the procedural realization of recognition and action is still in its infancy in robotics, DBS currently makes do with English words as placeholders for core values. As an example, consider the following lexical proplets, which are alike except for the values of their sur and noun attributes:

Corpus linguistics, generative grammar and database semantics 257

Figure 10. Different eore values in the same proplet structure

These proplets represent a class of word forms with the same morphosyntaetie properties. This elass may be represented more abstractly as a proplet pattern."

Figure 11. Representing a class of word forms as a proplet pattern

By restricting the variable a to the core values used in figure 10, the representation in figure 11 as a proplet pattern is equivalent to the explicit representation of the proplets class in figure 10. Proplet patterns with restricted variables are used for the base form lexicon of DBS, making it more transparent and saving a considerable amount of space. In concatenated (non-lexical) proplets, the (i) core meaning and (ii) the compositional semantics (based on the coding of morpho-syntaetic properties) are clearly separated. This becomes apparent when the core values of any given content are replaced by suitably restricted variables, as shown by the following variant of figure 9:

258 RolandHausser

Figure 12. Compositional semantics as a set of proplet patterns By restricting the variable a to the values decline, buy, eat, or any other transitive verb, p to the values Julia, Susanne, John, Mary or any other proper name, and y to the values it, offer, proposal, invitation, etc., this combinatorial pattern may be used to represent the compositional semantics of a whole set of English sentences, including figure 9. 9.

Collocation

At first glance, figure 12 may seem open to the objection that it does not prevent meaningless or at least unlikely combinations like Susanne ate the invitation, i.e. that it fails to handle collocation (which has been one of Sinclair's main concerns). This would not be justified, however, because the hearer mode of DBS is a recognition system taking time-linear sequences of unanalyzed surface forms as input and producing a content, represented by an order-free set of proplets, as output. In short, in DBS the collocations are in the language, not in the grammar. The Generative Grammars of nativism, in contrast, generate tree structures of possible sentences by means of substitutions, starting with the S node. Originally a description of syntactic wellformedness, Generative Grammar was soon extended to include world knowledge governing lexical selection. For example, according to Katz and Fodor (1963), the grammar must characterize ball in the man hit the colorful ball as a round object rather than a festive social event. In this sense, nativism treats collocations as part of the Generative Grammar and Sinclair is correct in his frequent protests against nativist linguists' modelling their own intuitions instead of looking at "real" language. In response, generative grammarians have turned to annotating corpora by hand or statistically (treebanks) for the purpose of obtaining broader data coverage. For example, the University of Edinburgh and various other

Corpus linguistics, generative grammar and database semantics 259

universities are known to have syntactically parsed versions of the BNC. The parsers used are the RASP, the Mimpar, the Charniak and the IMS parser. Unfortunately, the resulting analyses are not freely available. Yet even if one of them succeeded to achieve complete data coverage (according to some still to be determined standard of wider acceptance) there remains the fact that constituent-structure-based Generative Grammars and their tree structures were never intended to model communication and are accordingly unsuitable for it. In DBS, the understanding of collocations by natural and artificial agents is based on interpreting (i) the core values and (h) the functorargument and coordination structure of the compositional semantics (as in figure 8) - plus the embedding into the appropriate context of use and the associated inferencing. This is no different from the understanding of newly coined phrases (syntactic-semantic neologisms), which are as much a fact of life as are collocations. Speakers or writers utilize the productivity of natural language in word formation and compositional semantics to constantly coin new phrases. Examples range from politics (calling the US stimulus package "the largest generational theft bill on record") via journalism (creative use of navigate in "President Obama has to navigate varying advice on Afghanistan") to advertising (contrived alliteration in "Doubly Choc Chip, Bursting with choc chips in a crunchy chocolaty biscuit base"). Another matter are idioms, such as a drop tn the bucket or a blessing tn dtsgutse. As frozen non-literal uses, they are either i n t e r t a b l e by the same inferencing as spontaneous non-literal uses (e.g. metaphor, cf Hausser 2006: 75-78) or they must be learned. For example, an ax(e) to gnnd may be viewed as similarly opaque (non-compositional or nonFregean) in syntax-semantics as cupboard is in morphology. Just as cupboard must be equated with kitchen cabinet in the agent's cognition, an axe to gnnd (attributed to Benjamin Franklin) must be equated with expressing a serious complaint. 10.

Context

The attempt of Generative Grammar to describe the tacit knowledge of the speaker-hearer without the explicit reconstruction of a cognitive agent has led not only to incorporating lexical selection into the grammar, but also the context of use. Pollard and Sag (1994), for example, propose a treatment of context in HPSG which consists in adding an attribute to lexical entries (see

260 RolandHausser

also Green 1997). The values of this attribute are called constraints and have the form of such definitions28 as (a) (b)

"the use of the name John is legitimate only if the intended referent is named John." "the complement of the verb regret is presupposed to be true. "

For a meaningful computational implementation this is sadly inadequate, though for a self-declared "sign-based" approach it is probably the best it can do. Instead of cramming more and more phenomena of language use into the Generative Grammar, Database Semantics clearly distinguishes between the agent-external real world and the agent-internal cognition. The goal is to model the agent, not the external world.29 Whether The model is successful or not can be verified, i.e. determined objectively, (i) by evaluating the artificial agent's behaviour in its interaction with its environment and with other agents and (n) by observing the agent's cognitive operations directly via the service channel (cf.Hausser 2006: Sect. 1.4). In the agent's cognition, DBS clearly separates the language and the context component (cf figure 3) and defines their interaction via a computationally viable matching procedure based on the data structure of proplets (cf. Hausser 2006: Sect. 3.2). In addition, DBS implements three computational mechanisms of reference for the sign kinds symbol, mdextcal and name^ This is the basis for handling the HPSG context definition (a), cited above, as part of a general theory of signs, whereas definition (b) is treated as an inference by the agent. For systematic reasons, DBS develops the context component first, in concord with ontogeny and phylogeny (cf. Hausser 2006: Sect. 2.1). To enable easy testing and upscaling, the context component is reconstructed as an autonomous agent without language. The advantage of this strategy is that practically all constructs of the context component can be reused when the language component is added. The reuse, in turn, is crucial for ensuring the functional compatibility between the two levels. For example, the procedural definition of basic concepts, pointers and markers provided by the external interfaces of the context component are reused by the language component as the core meanings of symbols, indexicals and names, respectively. The context component also provides for the coding of content and its storage in the agent's memory, for inferencing on the content and for the derivation of adequate actions, including language production.

Corpus linguistics, generative grammar and database semantics 261

In human-machine communication, the context component is essential for reconstructing two of the most basic forms of natural language interaction. One is telling the artificial cognitive agent what to do, which involves contextual action. The other is the artificial cognitive agent's telling what it has perceived, which involves contextual recognition. 11.

Conclusion

From the linguists' perspective, the learner is for an English learner's dictionary what the artificial cognitive agent is for Database Semantics: each raises the question of what language skills the learner/artificial agent should have. However, the learner already knows how to communicate in a natural language. Therefore, the goal is to provide her or him with information of how to speak English well, which requires the compilation of an easy to use, accurate representation of contemporary English. Database Semantics, in contrast, has to get the artificial agent to communicate with natural language in the first place. This requires the reconstruction of what evolution has produced in millions of years as an abstract theory which applies to natural and artificial agents alike. In other words, Database Semantics must start from a much more basic level than a learner's dictionary. For DBS, any given natural language requires automatic word form recognition for the expressions to be analyzed, syntactic-semantic interpretation in the hearer mode, resulting in content which is stored in a database and selectively activated and processed in the think mode and appropriately realized in natural language in the speaker mode. On the one hand, each of these requirements constitutes a sizeable research and software project. On the other hand, the basic principles of how language communication works is the same for different languages. Therefore, once the software components for automatic word form recognition, syntactic-semantic parsing, etc. have been developed in principle, they may be applied to different languages with comparatively little effort.31 Because the theoretical framework of DBS is more comprehensive than that of a learner's dictionary, DBS can provide answers to some basic questions. For example, DBS allows to treat basic meanings in terms of recognition and action procedures, phenomena of language use with the help of an

262 RolandHausser explicitly defined context component and collocations produced in the speaker mode in terms of what the agent was exposed to in the hearer mode. Conversely, a learner's dictionary as a representation of a language is much more comprehensive than current DBS and thus provides a high standard of what DBS must accomplish eventually.

Notes 1

2 3 4

5 6

7 8 9 10

This paper benefited from comments by Thomas Proisl, Besim Kabashi, Johannes Handl and Carsten Weber (CLUE, Erlangen), Haitao Liu (Communication Univ. of China, Beijing), Kryong Lee (Korea Univ., Seoul) and Brian MacWhmney (Carnegie Mellon Univ., Pittsburgh). The UCREL CLAWS5 tag-set is available at http://ucrel.lancs.ac.uk/ claws5tags.html. Cf. Hausser ([1999] 2001: 295-299). Putnam attributes the same ontological status to the meanings of language as Mathematical Realism attributes to mathematical truths: they are viewed as existing eternally and independently of the human mind. In other words, according to Putnam, language meanings exist no matter whether they have been discovered by humans or not. What may hold for mathematics is less convincing in the case of language. First of all, there are many different natural languages with their own characteristic meanings (concepts). Secondly, these meanings are constantly evolving. Thirdly, they have to be learned and using them is a skill. Treating language meanings as pre-existing Platonic entities out there in the world to be discovered by the members of the language communities is especially doubtful in the case of new concepts such as transistor or ticket machine. The importance of agents with a real body (instead of virtual agents) has been emphasized by emergentism (MacWhmney 2008). While language and non-language processing use the same interfaces for recognition and action, figure 3 distinguishes channels dedicated to language and to non-language interfaces for simplicity: sign recognition and sign synthesis are connected to the language component; context recognition and context action are connected to the context component. Cf. Hausser (2001: 75-77); Hausser (2006: 27-29). For a more extensive taxonomy see the 10 SLIM states of cognition in Hausser (2001: 466-473). See Aho and Ullman (1977: 336-341). Apart from then formats, a dictionary and a system of automatic word form recognition differ also in that the entries in a dictionary are for words (represented by then base form), whereas automatic word form recognition ana-

Corpus linguistics, generative grammar and database semantics 263

11 12 13 14 15 16

17 18 19 20

21 22

23

24

lyzes inflectional, derivational and compositional word forms on the basis of a lexicon for allomorphs or morphemes (cf. Hausser 2001: 241-257). Statistical tagging also classifies word forms, but uses transitional likelihoods rather than a compositional analysis based on a lexical analysis of the word form parts. Cf. Hausser (2001: 82-83). Cf. Hausser (2001: 375-387). For a more detailed discussion of the basic mechanisms of recognition and action see Hausser (2001: 53-61) and Hausser (2006: 54-59). Necessary as opposed to accidental (kata sumbebekos), as used in the philosophical tradition of Aristotle. For a declarative specification of memory-based pattern recognition see Hausser (2005). For simplicity, proplets for the genitive singular and plural forms of the noun and any comparative and superlative forms of the adjective are omitted. Also, the attributes nc (next conjunct) and pc (previous conjunct) for the coordination of nouns, verbs and adjectives have been left out. For a detailed explanation of the lexical analysis in Database Semantics see Hausser (2006: 51-54, 209-216). For a concise description of this cycle see Hausser (2009a). Cf. Hausser (2006: 71-74). For more detailed explanations, especially the function word absorptions in line 3 and 4, see Hausser (2006: 87-90) and Hausser (2009a, 2009b). In addition, the DBS method is well-suited for handling extrapropositional functor-argument structure (subclauses) and intra- and extrapropositional coordination including gapping, as shown in Hausser (2006: 103-160). Known by acronyms such as TG (with its different manifestations ST, EST, REST and GB), LFG, GPSG, HPSG, CG, CCG, CUG, FUG, UCG, etc. This state is being justified by a whole industry of translating between the different grammar systems and proposing conjectures of equivalence. An early, pre-statistical instance is Sells (1985), who highlights the common core of GB, GPSG and LFG. More recent examples are Andersen et al. (2008), who propose a treebank based on Dependency Grammar for the BNC and Liu and Huang (2006) for Chinese; Hockenmaier and Steedman (2007) describe CCGbank as a translation of the Penn Treebank (Marcus, Santorim and Marcmkiewicz 1993) into a corpus of Combinatory Categonal Grammar derivations. A formal difference is that LA-grammar is the first and so far the only algorithm with a complexity hierarchy which is orthogonal to the Chomsky hierarchy (Hausser 1992). Also, there seems to be an irrational fear of creating artificial beings resembling humans. Such homuncuh, which occur in the earliest of mythologies,

264 RolandHausser

25 26

27

28

29

30

31

are widely regarded as violating the taboo of doppelganger similarity (Girard 1972). Another matter is the potential for misuse - which is a possibility in any basic science with practical ramifications. Misuse of DBS (in some advanced future state) must be curtailed by developing responsible guidelines for clearly defined laws to protect privacy and intellectual property while maintaining academic liberty, access to information and freedom of discourse. This is in concord with Darwin's theory of evolution in which anatomy, for example, will be structured according to functions associated with use. The recent history of linguistics contains numerous examples of naively treating morphological as well as semantic phenomena in the syntax, pragmatic phenomena in the semantics, etc. These are serious mistakes, some of which have derailed scientific progress for decades. MacWhmney (2005) describes "feature-based patterns" arising from "itembased patterns", which resembles our abstraction of proplet patterns from classes of corresponding proplets. These definitions are reminiscent of Montague's (1974) meaning postulates for constraining a model structure of possible worlds, defined purely in terms of set theory. Supposed to represent spatio-temporal stages of the actual world plus counterfactual worlds with unicorns, etc., a realistic definition or programming of such a model structure is practically impossible. Therefore, it is always defined "m principle" only. Cf Hausser (2001: 392-395). This is in contrast to the assumptions of truth-conditional semantics, including Montague Grammar, Situation Semantics, Discourse Semantics, or any other metalanguage-based approach. Cf. Hausser (2001: 371-426), Hausser (2006: 25-26). Cf. Hausser (2001: 103-107), Hausser (2006: 29-34). The type-token relation between corresponding concepts at the language and the context level illustrated in 5.1 happens to be the reference mechanism of symbols. For example, given (i) an on-line dictionary of a new language to be handled and (ii) a properly trained computational linguist, an initial system of automatic word form recognition can be completed in less than six months. It will provide accurate, highly detailed analyses of about 90% of the word form types m a corpus.

Corpus linguistics, generative grammar and database semantics 265 References Aho, Alfred Vaino and Jeffrey David Ullman 1977 Principles of Compiler Design. Readmg, MA.: Addison-Wesley. Andersen, 0ivm, Juhen Nioche, Edward John Briscoe and John A. Carroll 2008 The BNC parsed with RASP4UIMA. In Proceedings of the Sixth Language Resources and Evaluation Conference (LREC'08), Nicoletta Calzolari, Khahd Choukri, Bente Maegaard, Joseph Manani, Jan Odjik, Stehos Pipendis and Daniel Tapias (eds.), 865-860. Marrakech, Morocco: European Language Resources Association (ELRA). Girard,Rene 1972 La violence et le sacre. Pans: Bernard Grasset. Green, Georgia 1997 The structure of CONTEXT: The representation of pragmatic restrictions in HPSG. Proceedings of the 5th Annual Meeting of the Formal Linguistics Society of the Midwest, James Yoon (ed.), 215-232. Studies in the Linguistic Sciences. Hausser, Roland 1992 Complexity in left-associative grammar. Theoretical Computer Science 106 (2): 283-308. Hausser, Roland 2001 Foundations of Computational Linguistics: Human-Computer Communication in Natural Language. Berlin/Heidelberg/New York: Springer. 2nd edition. First published in 1999. Hausser, Roland 2005 Memory-based pattern completion in database semantics. Language and Information 9 (1): 69-92. Hausser, Roland 2006 A Computational Model of Natural Language Communication: Interpretation, Inference and Production in Database Semantics. Berlm/Heidelberg/New York: Springer. Hausser, Roland 2009a Modeling natural language communication in database semantics. In Proceedings of the APCCM 2009, Markus Kirchberg and Sebastian Link (eds.), 17-26. Australian Computer Science Inc., CIPRIT, Vol. 96. Wellington, New Zealand: ACS. Hausser, Roland 2009b From word form surfaces to communication. In Information Modelling and Knowledge Bases XXI, Hannu Kangassalo, Yasushi Kiyoki and Tatjana Welzer (eds.), 37-58. Amsterdam: IOS Press Ohmsha.

266 RolandHausser Herbst, Thomas, David Heath, Ian Roe and Dieter Goetz 2004 A Valency Dictionary of English: A Corpus-Based Analysis of the Complementation Patterns of English Verbs, Nouns and Adjectives. Berlin: Mouton de Gruyter. Herbst, Thomas and Susen Schtiller 2008 Introduction to Syntactic Analysis: A Valency Approach. Tubingen: GunterNarr. Hockenmaier, Julia and Mark Steedman 2007 CCGbank: A corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Computational Linguistics 33 (3): 355-396. Katz, Jerrold Jacob and Jerry Alan Fodor 1963 The structure of a semantic theory. Language 39: 170-210. Krrk,JohnM. 1998 Review of T. McEnery and A. Wilson 1996 and of G. Barnbrook 1996, Computational Linguistics 24 (2): 333-335. Liu, Haitao and Wei Huang 2006 A Chinese dependency syntax for treebankmg. In Proceedings of the 20th Pacific Asia Conference on Language, Information, Computation, 126-133. Beijing: Tsinghua University Press. MacWhmney, Brian James 2005 Item-based constructions and the logical problem. Association for Computational Linguistics (ACL), 46-54. Momstown, NJ: Association for Computational Linguistics (ACL). MacWhmney, Brian James 2008 How mental models encode embodied linguistic perspective. In Embodiment, Ego-Space and Action, Roberta L. Klatzky, Marlene Behrmann and Brian James MacWhmney (eds.), 360-410. New York: Psychology Press. Marcus, Mitchell P., Beatrice Santormi and Mary Ann Marcmkiewicz 1993 Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19: 313-330. Montague, Richard 1974 Formal Philosophy. New Haven: Yale University Press. Nichols, Johanna 1992 Linguistic Diversity in Space and Time. Chicago: University of Chicago Press. Pollard, Carl and Ivan Sag 1994 Head-Driven Phrase Structure Grammar. Stanford: CSLI.

Corpus linguistics, generative grammar and database semantics 267 Putnam, Hilary 1975 The meaning of "meaning". In Mind, Language and Reality: Philosophical Papers, vol. 2, Hilary Patnam (ed.), 215-271. Cambridge: Cambridge University Press. Sinclair, John McH.(ed.) 1987 Collins COBUILD English Language Dictionary. London/Glasgow: Collins. Sinclair, John McH. 1991 Corpus, Concordance, Collocation. Oxford: Oxford University Press. Sells, Peter 1985 Lectures on Contemporary Syntactic Theories: An Introduction to GB Theory, GPSG andLFG. Stanford: CSLI. Teubert, Wolfgang 2008 [Corpora-List] Bootcamp: 'Quantitative Corpus Linguistics with R'~ re Louw's endorsement, http://mailman.uib.no/public/corpora/2008August/007089.html. Teubert, Wolfgang and Knshnamurthy, Ramesh (eds.) 2007 Corpus Linguistics. London: Routledge. Wierzbicka,Anna 1991 Cross-Cultural Pragmatics: The Semantics of Human Interaction. Berlin: Mouton de Gruyter.

Corpus BNC

The British National Corpus, version 3 (BNC XML Edition). 2007. Distributed by Oxford University Computing Services on behalf of the BNC Consortium, http://www.natcorp.ox.ac.uk/.

Chunk parsing in corpora Gunther Gorz and Gunter Schellenberger

1.

What on earth is a chunk?

1.1.

Basic features of chunks

A fundamental analytical task in Natural Language Processing (NLP) is the segmentation and labeling of texts. In a first step, texts are broken up into sentences as sequences of word forms (tokens). Chunking in general means to assign * partial structure to a sentence. Tagging assigns to the tokens labels which represent word specific and word form specific information such as the word category and morphological features. Chunk parsing regards sequences of tokens and tries to identify structural relations within and between segments. Chunk parsing as conceived by Abney (1991) originated from a psycholinguistic motivation (1991: 257): I begin with an intuition: when I read a sentence, I read it a chunk at a time. For example, the previous sentence breaks up something like this: (1) [I begin] [whh an intuition]: [when I read] [a sentence], [I read it] [a chunk] [at a time] These chunks correspond in some way to prosodic patterns. It appears, for instance, that the strongest stresses in the sentence fall one to a chunk, and pauses are most likely to fall between chunks. Chunks also represent a grammatical watershed of sorts. The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template ...There is psychological evidence for the existence of chunks... In the context of corpus linguistics, chunk parsing is regarded as an efficient and robust approach to parsing at the cost of not trying to deal with all of language. Hence, a good coverage on given corpora can be achieved, also in the presence of errors as it is the case with (transcribed) speech. Although the problem is defined on the semantic-pragmatic level, it can be captured automatically only by syntactic means - notice the analogy to the

270 Gunther Gorz and Giinter Schellenberger

problem of collocations. Chunks are understood as "non-overlapping regions of text, usually consisting of a head word (such as a noun) and the adjacent modifiers and function words (such as adjectives or determiners)" (Bird, Klein and Loper 2006). Technically, there are two main motivations for chunking: to locate information - for information retrieval - or to ignore information, e.g. to find evidence for linguistic generalizations in lexicographic and grammatical research. The grouping of adjacent words into a single chunk, i.e. a subsequence, should be faithful regarding the meaning of the original sentence. The sentence (1)

The quick brown fox jumps over the lazy dog.

will be represented as: ((FOX) (JUMPS) over (DOG)) where (FOX) and (DOG) represent the noun chunks (the quick brown fox) with head/ox and (the lazy dog) with head dog, respectively. In a similar manner, (JUMPS) is a one-word verb chunk headed by jumps. The results of chunk parsing are shorter and easier to handle, e.g. for computer aided natural language tools. Briefly, chunk parsing follows a dividend-conquer principle as illustrated in the following commutative diagram:

In this diagram, LF(x,y,...) represents some logical form, e.g. a predicate like jumps over(who, over-whom), which can be extracted much more easily from the compact sequence on the lower left side of the diagram. Chunking serves as a normalization step; the meaning of the original sentence can be derived from that of the chunked result by substituting the chunks (bottom line) with their origin. As / « „, the implementation of m(.) is easier. 1.2.

Chunk parsing and full parsing

In view of examples such as (1) above it is intuitive to think of chunk parsing as an intermediate step towards full parsing The type and head of a chunk are taken as non-terminal symbol and literal substitute, respectively,

Chunk parsing in corpora

271

in a recursive process. After the first stage of base segmentation, adjacent chunks and isolated words can be wrapped up to chunks of a higher level finally reaching a full constituent tree. The advantages of introducing an additional layer of processing are based on the assumption: Proposition 1 Chunks have a much more simple internal structure than the sequence of chunks inside higher level constructions, including sentences. In order to support the intentions mentioned above, some constraints are imposed on chunks: Proposition 2 Chunks... 1. never cross constituent boundaries; 2.

form true compact substrings of the original sentence;

3.

are implemented based on elementary features of the word string, like the POS tags or word types, avoiding deep lexical or structural parameterization incorporated in their implementation;

4. are not recursive. Rule (3) is required to allow for fast and reliable construction of chunking structures based on their simple nature. A closer inspection of rule (4), which gives a formal definition of simplicity, shows some consequences: Chunks do not contain other chunks; -

Recursive rules like the following are not allowed (NO. Noun Chunk): NC^Det NC Recursive rule systems, e.g. (where ADVC: Adverbial Chunk) ADVC^Adj ADVC? NC^Det ADVC? N can be 'flattened' to regular structures like NC^Det Adj? Adj? Adj? N (Tiodicates that the preceding element .optional)

1.3.

Use of chunks in spoken language processing

Our work in the area of speech processing systems, including dialogue control, gave us additional motivation for chunk parsing. In the special case of (transcribed) speech corpora, usually we don't have many well-formed sentences and grammatically perfect constituents. A lot of additional ambiguities come up, because the usual differences in spelling cannot be found,

272 Gunther Gorz and Giinter Schellenberger e.g. May vs. may or the German verb essen vs. the German proper noun (name of the city) Essen. Furthermore, there are no punctuation marks, including marks for the beginning and end of sentences, which again raises many reading ambiguities. This means that dialogue systems have to follow multiple different paths of interpretation. Therefore, search spaces tend to become much bigger and we are facing time and memory limitations due to combinatorial explosion. In practical systems, recognition errors have to be taken into account and they should be identified as soon as possible. Base chunks meeting the requirements of Proposition (2) are therefore often the ultimate structures for spoken language systems as input to higher analysis levels which have to assign semantic roles to the parts of chunked input. To summarize, reliable syntactic structures other than chunks are often not available in speech processing. 1.4.

Limitations and problems of chunking in English

Although on a first glance the idea of chunk parsing promises to push natural language understanding towards realistic applications, things are not as easy as they seem. Named entities such as John Smrth or The United Kingdom should be identified as soon as possible in language processing. As a consequence, named entity recognition has to be included in chunk parsing. But merging subsequent nouns into the same chunk cannot be introduced as a general rule. In fact, the famous example (1) already contains a trap making it easy for chunk parsers to stumble over: - jumps could be taken as a noun (plural of jump) and merged into the preceding noun chunk which is prohibitive for semantic analysis. -

Some more examples of this kind are: (2) (3) (4)

-

Similar considerations hold for named entities comprising a sequence ofnounsortermssuchas (5)

-

The horse leaps with joy. This makes the horse leap with joy. Horse leaps are up to eight meters long.

System under construction, matter of concern, ...

Chunk Parsers usually do not wrap compound measurement expressions like

Chunk parsing in corpora 273 (6)

twenty meters and ten centimetres

These examples reveal a fundamental problem of implementing rule (1) in Proposition (2): this rule expresses a semantic constraint - how can it be implemented consistently with the other rules? How should a chunk parser respect boundaries of structures which are to be built later based upon its own results? Problems which go even deeper show up on closer inspection of verb phrases as W over... or leap with... in the examples (1) and (2)-(4), resp. In example (1), it might not be helpful to merge the preposition "over" with the chunk (DOG) into a prepositional phrase. Here, the preposition qualifies the verb, not the object. More precisely, "over" qualifies the object's role: the dog is not a location, not a temporal unit etc., but the affected entity of the predicate. Again, whether or not a preposition following a word forms a phrasal expression with that verb and should be merged into a chunk cannot be decided by following the rules for chunks above. This holds, of course, for verbal expressions spanning several words (e.g. "put up with" or "add up to"), but also for expressions like "as far as I know". Missing or wrong chunking could in such cases lead later processing steps into a trap again. At least, in English, prepositions or other material that considerably change the reading of a verb directly follow that verb or are kept close to it. 1.5.

Limitations of chunking in German

The definition of chunks as subsequences of the original sentence provides a serious coverage limitation for German with its relatively frequent discontinuous structures. The burden of ambiguities between verbs and nouns in German is somehow eased by capitalization rules, which of course do not help in the case of speech processing. In addition, German sentences may start with a verb; furthermore it is not common to combine the words which make up geographical place names with dashes. So, instead of (7)

Stratford-upon-Avon

there exist (8)

WeilderStadt

(9)

NeustadtanderWaldnaab

(10)

NeuhausenaufdenFildern

274 Gunther Gorz and Gunter Schellenberger

What is even worse is that the constraints easing the process of verb phrase chunking in English do not hold for German prefixed verbs which are separable in present tense. On the contrary, the space in between verbal stem and split prefix does not only allow for constituents of arbitrary length, it even has to include at least passive objects: (11)

Ich hebe Apfel, welche vom Baum gefallen sind, niemak auf. 'I never pick up apples fallen down from the tree.'

(12) (13)

Ich hebe Apfel niemals auf, welche vom Baum gefallen sind. ??Ich hebe niemals auf Apfel, welche vom Baum gefallen sind.

The facts concerning German separable composite verbs in present tense compared with phrasal verbs in English can be extended without exceptions to auxiliary and modal constructions, including past, perfect and future. This, amongst other linguistic phenomena, drastically limits the coverage of chunk parsing for German. However, even for German dialogue systems, chunking of at least noun phrases is necessary at the beginning of processing. 2.

Using corpora: from chunking to meaning

Our introduction suggests that a good start for chunk parsing would be to commence with regular rules. Unfortunately, pure rule-based approaches lack sufficient coverage. Therefore, to amend the performance of chunk parsing, examples from corpora have to be included. 2.1.

IOB tagging

A general and approved approach to solve the problems as introduced in the preceding section is to incorporate annotated samples which are compared with text to be analysed. To make this approach applicable to base chunking, the task of chunk parsing is first transformed into a problem of tagging and prototyping as follows: Given a sequence of words Wl w 2 • • • u,t , assign to it a corresponding sequence of tags tl t2 • • • *,... w i t h

Chunk parsing m corpora 275

-

Optionally, a syntactic classification can be attached to B-tags: B-NP, B-VP, etc. Alternatively, this subclassification is done separately.

2.2.

Solutions by means of statistical inference

One class of solutions of the lOB-Taggmg problem is to introduce numerical functions based on numerical features of the w(l): For each word in «.„«•,,•••,«,.... assign a vector of w - (»„,»„,•- , « , » , - ) features, i.e. numerical functions. Examples of definitions for v„: v,,, = 1, if w,is a noun, 0 otherwise v,;2 = 1, if w,-is a verb, 0 otherwise

-

-

v,3o = 1, if w, = 'Company', 0 otherwise

-

POS-tagofw,

Prefix or suffix With the exception of the beginning and end of a sentence or a sequence, the features v,-,- are independent from i Each feature is a function of a window m-,m-M.• ,wi+1, around w, of length 1

U is expressed as a function of the feature vector assigned wu parameterized by a parameter vector a independent of,: U = MM) The parameters a are calculated as to generate optimum results given a corpus equipped with IOB-tagS which in turn are assumed to be correct: Proposition 3 Given a tagged corpus R = («*,*), («i,ti),..., (« w ,t w ), called the training corpus, - find « to minimize E ll-MW) -U\\;i = 0 w

276 Giinther Gorz and Gunter Schellenberger

2.3.

Preparing an annotated corpus

The task of selecting the function F*(.) is far from being trivial and so is the minimization task for the parameters; there are several approaches to finding a satisfactory result. In general, the problem is given a geometric interpretation: the vectors of evaluated features are taken as points in a multidimensional space, each equipped with a tag, namely the assigned 705-tog.Thetaskthenisto: Identify clusters of points carrying identical tag; -

Express membership to or distance from clusters by appropriate functions.

-

Example: The Support Vector Machine SVM separates areas of different tags with hyperplanes of maximum coverage. Calculating is the true proficiency of computers, so as soon as appropriate features and Fw(.) are selected, chunk parsing can be done efficiently as required. But how can the laborious work of preparing a training corpus n be facilitated? There are two main options: 1. An automatically annotated corpus is corrected manually. For the beginning, a chunk parser, based on a few simple rules incorporating basic features, e.g. POS-tags, is used (cf Proposition (2)). This initial parser is called the "baseline". 2.

Chunk structures are derived from a corpus already equipped with higher level analyses, for example constituency trees (treebank). For an example, cf. Tjong Kim Sang and Buchholz (2000). At this place, it is worthwhile to point out that there is no formal and verifiable definition of correct chunking. Tagging a corpus to train a chunker also means to define the chunking task itself. 2.4.

Transformation-based training

In view of option (2), another approach to the tagging task in general, including IOB-taggmg, opens up. The method outlined in the following is called Transformation-basedLearnmg (cf. Ramshaw and Marcus 2005).

Chunk parsing in corpora 277 1.

2.

Start with a baseline tagger. Example: Use POS-tags of words alone to define the mapping into IOB-space. Identify the set of words in the input wh w2, ... which have been mistagged.

3.

Add one or more rules to correct as many errors as possible.

4.

Retag the corpus and restart at step 2 until no or not enough errors remain.

5.

Given a sequence of words uu u2, ... outside the corpus to be tagged, do baseline tagging, then apply the rules found in the steps above. Example: "Adjectives ... that are currently tagged I but that are followed by words tagged O have their tags changed to O" (Ramshaw and Marcus 2005: 91). 3.

Assessment of results

3.1.

Measuring the performance of chunk parsers

A manually tagged corpus (see 1Z in Proposition (3)) is passed to a chunker for automatic tagging. The performance is measured in terms of Precision and Recall. Informally, precision is the number of segments correctly labeled by the chunker, divided by the total number of segments found by the chunker; i.e. the sum of true positives and false positives, which are items incorrectly labeled as belonging to the class. Recall is defined in this context as the number of true positives divided by the total number of elements that actually belong to the class; (i.e. the sum of true positives and false negatives, which are items which were not labeled as belonging to that class, but should have been). The formal definitions of precision and recall are as follows: True Positives (TP): segments that are identified and correctly labelled by the chunker; -

False Positives (FP): segments that are labelled by the chunker, but not in K-

-

n False Negatives (FN): segments that are labelled in *, but not by the chunker;

278 Gunther Gorz and Gunter Schellenberger -

True Negatives (TN): segments that are labelled neither in K, nor by the chunker;

-

Precision: percentage of selected items that were correctly labelled: Precision = TJ^FP Recall: percentage of segments that were detected by the chunker Recall = ^ 5 w The results of the "CoNLL-2000 Shared Task in Chunking" (Tjong Kim Sang and Buchholz 2000) are still representative for the state of the art: precision and recall just below 94 % for pure IOB-tagging have been achieved. Bashyam and Taira (2007) report on lower results when training a special domain chunker for anatomical phrases. 3.2.

How far can statistical chunking reach?

At the Conference on Natural Language Learning in 2004, the CoNLL2004 Shared Task of Semantic Role labeling (SRL) had been introduced. SRL can be understood as the minimum requirement for automated semantic analysis of free input text and addresses questions such as the following (cf Carreras and Marquez 2004): Who is the agent addressed by the verb of a sentence? -

Who or what is the patient or instrument?

-

What adjuncts specify location, manner or cause belonging to the verb? The particular challenge was to restrict machinery involved in solving the task to levels below chunk parsing (i.e. words and POS-tags) and more basic chunk parsing applications: pure segmentation, i.e. IOB-tagging without further labeling of segments, or named entity recognition, i.e. identification of noun chunks representing names of persons, organizations etc. together with a label indicating the type of the entity: person, organization, location and other. The results achieved for language-dependent named entity recognition are:

^rW English

Precision below 84% below 90%

Recall below 65% below 90%

Chunk parsing in corpora 279 As a consequence of the mentioned constraints, the results of the CoNLL2004 competition can be taken as a realistic orientation mark as far as the applicability of chunk parsing is concerned in the sense of a realistic use of automatic language analysis for whatever specific application. So, the overall precision of participants in SRL hardly exceeds 75 %; for correct identification of the agent of a sentence, precision reaches 94 %. 4.

Conclusion

The authors of the CoNLL-2004 Shared Task (Carreras and Marquez 2004) conclude: ... state-of-the-art systems working with full syntax still perform substantially better, although far from a desired behavior for real-task applications. Two questions remain open: which syntactic structures are needed as input for the task, and what other sources of information are required to obtain a real-world, accurate performance.

Appendix: Some publicly available chunk parsers There are several chunk parsers which can be downloaded for free from the World Wide Web. We present a small selection of those we tried with test data. One the one hand, there are natural language processing toolkits and platforms such as NLTK (Natural Language ToolKit),2 or GATE (General Architecture for Text Engineering)3 which contain part-of-speech taggers and chunk parsers. In particular, NLTK offers building blocks for ambitious readers, who want to develop chunkers or amend existing ones on their own (in Python). SCP is a Simple rule-based Chunk Parser by Philip Brooks (2003), which is part of ProNTo, a collection of Prolog Natural language Tools. A POS tagger and a chunker for English with special features for parsing a "huge collection of documents"4 have been developed by Tsuruoka and Tsujii (2005). A state-of-the-art pair of a tagger and a chunker, along with parameter files trained for several languages, has been developed at the University of Stuttgart.5 The "Stuttgart-Tubingen Tagset" STTS has become quite popular in recent years for the analysis of German and other languages.

280 Gunther Gorz and Giinter Schellenberger Finally, chart parsers running in bottom-up mode and equipped with an appropriate chunk grammar, can be used for chunk parsing as well. This technique has been used in our dialogue system CONALD (Ludwig, Reiss and Gorz 2006).

Notes 1 The authors are indebted to Martin Hacker for critical remarks on an earlier draft of this paper. 2 http://nltk.sourceforge.net/mdex.php/Mam Page, accessed 31-10-2008. 3 http://gate.ac.uk/, accessed 31-10-2008. " 4 available for download from http://www-tsujii.is.s.u-tokyo.ac.jp/tsuruoka/ chunkparser/, accessed 31-10-2008. 5 available for download from http://www.ims.um-stuttgart.de/projekte/corplex/ TreeTagger/, accessed 31-10-2008.

References Abney, Steven 1991 Parsing by chunks. In Principle-Based Parsing, Robert Berwick, Steven Abney and Carol Tenny (eds.), 257-278. Dordrecht: Kluwer. Bashyam, Vijayaraghavan and Ricky K. Taira 2007 Identifying anatomical phrases in clinical reports by shallow semantic parsing methods. In Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007), Honolulu, 210-214. Honolulu, Hawati: IEEE. Bird, Steven, Ewan Klein and Edward Loper 2006 Chunk Parsing. (Tutorial Draft) University of Pennsylvania. Brooks, Phitip 2003 SCP: A Simple Chunk Parser. University of Georgia. ProNTo (Prolog Natural Language Tools), http://www.ai.uga.edu/mc/ ProNTo, accessed 31-10-2008. Carreras.XavierandLluisMarquez 2004 Introduction to the CoNLL-2004 Shared Task: Semantic Role Labeling. CoNLL-2004 Shared Task Web Page: http://www.lsi.upc.edu/ srlconll/st04/papers/mtro.pdf, accessed 31-10-2008. Ludwig, Bernd, Peter Reiss and Gunther Gorz 2006 CONALD: The configurable plan-based dialogue system. In Proceedings of the 2006IAR Annual Meeting. German-French Institute for Automation and Robotics, Nancy, November 2006. David Brie, Keith Burnham, Steven X. Ding, Luc Dugard, Sylviane Gentil,

Chunk parsing in corpora 281 Gerard Gissinger, Michel Hassenforder, Ernest Hirsch, Bernard Keith, Thomas Leibfned, Francis Lepage, Dirk Soeffkher and Heinz Worn (eds.),A2.15-18. Nancy: IAR. Ramshaw, Lance A. and Mitchell P. Marcus 2005 Text chunking using transformation-based learning. In Proceedings of the Third Workshop on Very Large Corpora, David Yarowsky and Kenneth Church (eds.), 82-94. Cambridge, MA: MIT, Association for Computational Linguistics. Tjong Kim Sang, Erik F. and Sabine Buchholz 2000 Introduction to the CoNLL-2000 shared task: Chunking. Proceedings of the Fourth Conference on Computational Natural Language Learning, CoNLL-2000, Lisbon, Claire Cardie, Walter Daelemans, Claire Nedellec and Erik Tjong Kim Sang (eds.), 127-132. New Brunswick, NJ: ACL. Tsuruoka, Yoshimasan and Jun'ichi Tsujii 2005 Chunk parsing revisited. Proceedings of the 9th International Workshop on Parsing Technology (IWPT 2005), Vancouver, British Columbia: Association for Computational Linguistics, Hany Bunt (ed.), 133-140. New Brunswick, NJ: ACL.

properties contributing to i d i o L t i c i t y Ulrich Held

1.

Idiomaticity, multiwords, collocations

1.1.

Objectives

In linguistic phraseology and in computational linguistics, many different definitions of idiomatic expressions and collocations have been given. The traditional view involves a difference between compositional word combinations and non-compositional or semi-compositional ones, classifying the latter ones as idiomatic expressions (cf e.g. Burger 1998). In work on collocations, it has been observed that the degree of opacity (i.e. noncompositionality) differs between its two elements (Hausmann 1979, Hausmann 2004) and within the range of word pairs commonly denoted by the term "collocation": Grossmann and Tutin (2003: 8) distinguish regular, transparent and opaque collocations. For them, collocations are semicompositional and located between full idioms and word combinations only governed by semantic selection criteria; the three subtypes show the boundary between collocations where the meaning of the collocate is opaque for decoding and collocations where only lexical selection is idiosyncratic (i.e. they pose an encoding problem). In many studies in Natural Language Processing (henceforth: NLP), emphasis has so far mainly been on identifying idiomatic expressions and less on classifying them. Thus the rather general term 'multiword expression' has been used, which denotes a wide variety of phenomena, ranging from multiword function words (e.g. in spite of) over collocations and idioms in the traditional phraseological sense (e.g. eat humble pie, give a talk), to multiword names (e.g. Rio de Janeiro, New York; for a more detailed list, see Heid 2008: 340). In the following, we will start from the main trends in research about collocations (section 1.2), and we argue that knowledge about collocations used in lexicography and in NLP should be more than just knowledge

284 UlrichHeid about the combinability of lexemes. In fact, not only the idiomatic nature of collocations, but also of other idiomatic multiword expressions, is characterized by a considerable number of morphological, syntactic, semantic and pragmatic preferences, which contribute to the peculiarity of these word combinations (section 2); these properties are observable in corpus data. From the point of view of language learning, they should be learned along with the word combination, very much the same way as the corresponding properties of single words (cf Heid 1998 and Heid and Gouws 2006 for a discussion of the lexicographic implications of this assumption). From the viewpoint of NLP, they should be part of a lexicon. If we accept this assumption, the task of computational linguistic data extraction from corpora goes far beyond the identification of significant word pairs. We will not only show which additional properties may play a role for German noun+verb combinations (section 2), but also sketch a computational architecture (section 3.3) that allows us to extract data from corpora which illustrate these properties. Some such properties can also be used as criteria for the identification of idiomatic or collocational noun+verb combinations, others just provide the necessary knowledge one needs when one wants to write a text and to insert the multiword expressions in a morphologically and syntactically correct way into the surrounding sentence. Our extraction architecture relies on more complex preprocessing than most of the tool setups used in corpus linguistics: it presupposes a syntactic analysis down to the level of grammatical dependencies. We motivate this by comparison with a flat approach, namely the one implemented in the Sketch Engine (Kilgarnff et al. 2004), for English and Czech (section 3.2). 1.2.

Collocations in linguistics and NLP

The firthian notion of collocation (cf. Firth 1957) is mainly oriented towards lexical cooccurrence ("You shall know a word by the company it keeps" (Firth 1957: 11)). British contextualism has soon discovered cooccurrence statistics as a device to identify word combinations which are collocational in this sense. John Sinclair places himself in this tradition, in Corpus, Concordance, Collocation (Sinclair 1991), emphasizing however the idiomatic nature of the combinations by contrasting the idiom principle and the open choree principle. The range of phenomena covered by his approach as presented in Sinclair (1991: 116-118) includes both lexical collocations and grammatical collocations (in the sense of Benson, Benson

German noun+verb collocations in the sentence context 285

and Ilson 1986): for example, pay attention and back to both figure in the lists of relevant data he gives. The lexicographically and didactically oriented approach advocated, among others, by Hausmann (1979), Hausmann (2004), Mel'cuk et al. (1984, 1988, 1992, 1999), Bahns (1996) is more oriented towards a syntactic description of collocations: Hausmann distinguishes different syntactic types, in terms of the category of the elements of the collocation: noun+verb collocations, noun+adjective, adjective+adverb collocations, etc. Moreover, Hausmann and Mel'cuk both emphasize the binary nature of collocations, distinguishing between the base and the collocate. Hausmann (2004) summarizes earlier work by stating that bases are typically autosemantic (i.e. have the same meaning within collocations as outside), whereas collocates are synsemantic and receive a semantic interpretation only within a given collocation. Even though this distinction is not easy to operationalize, it can serve as a useful metaphor, also for the analysis of longer multiword chunks, where several binary collocations are combined. Computational linguistics and NLP have followed the contextualist view, in so far as they have concentrated on the identification of collocations within textual corpora, designing different types of tools to assess the collocation status of word pairs. Most simply, a sorting of word pairs by their number of occurrence (observed frequency) has been used on the assumption that collocations are more frequent than non-collocational pairs (cf. Krenn and Evert 2001). Alternatively, association measures are used to sort word pairs by a statistical measure of the 'strength' of their association (cf. Evert 2005); to date over 70 different formulae for measuring the association between words have been proposed. An important issue in the context of collocation identification from texts is that of defining the kinds of word pairs to be counted and statistically analyzed: by which procedures can we extract the items to be counted? Simple approaches operate on windows around a keyword, e.g. by looking at items immediately preceding or following that item. Wordsmith Tools (Scott 2008) is a well-known piece of software which embeds this kind of search as its 'collocation retrieval' function (fixed distance windows, left and right of a given keyword). Smadja (1993) combines the statistical sorting of word pair data with a grammatical filter: he only accepts as collocation candidates those statistically relevant combinations which belong to a particular syntactic model, e.g. combinations of an adjective and a noun, or of a verb and a subsequent noun; in English, such a sequence mostly implies that the noun is the direct object of the verb.

286 UlrichHeid For German and other languages with a more variable word order than English, the extraction of pairs of grammatically related items is more demanding (see below, section 3.2). The guiding principle for collocation extraction for such languages is to extract word pairs which all homogeneously belong to one syntactic type, in Hausmann's (1979) sense, e.g. verbs and their object nouns. Proposals for this kind of extraction have been made, among others, by Held (1998), Krenn (2000), Ritz and Held (2006). This syntactic homogeneity has two advantages: on the one hand, it provides a classification of the word pairs extracted in terms of their grammatical categories, and on the other hand, it leads to samples of word pairs, from where the significance of the association can be computed with respect to a meaningful subset of the corpus (e.g. all verb+object pairs). More recent linguistic work on multiword expressions has questioned some of the restrictions inherent to the lexico-didactic approach. At the same time, the late John Sinclair has suggested that quantitative and structural properties seem to jointly characterize most such expressions, cf Togmm-Bonelli, this volume. In this sense, a combination of the two main lines of tradition can be seen as an appropriate basis for computational linguistic data extraction work. In terms of a modification of the lexico-didactic approach, Schafroth (2003: 404-409) has noted that there are many idiomatized multiword expressions which cannot be readily accounted for in terms of the strictly binary structure postulated by Hausmann (1979). Siepmann (2005) has given more such examples. Some of them can be explained by means of recursive combinations of binary collocations: e.g. scharfe Kritik iiben ('criticize fiercely', Schafroth 2003: 408, 409) can be seen as a combination of Kritik uben ('criticize', lit. 'carry out criticism') and the typical adjective+noun collocation for Kritik, namely scharfe Kritik (cf. Held 1994: 231; Hausmann 2004; Held 2005). But other cases are not so easy to explain and the question of the 'size' of collocations is still under debate. But the notion of collocation has not only been widened with respect to the size of the chunks to be analyzed; researchers also found significant word combinations which are of other syntactic patterns than those identified e.g. by Hausmann (1989). Examples of this are combinations of discourse particles in Dutch (e.g. maar even, 'a bit'), where the distinction between an autosemantic base and a synsemantic collocate is hard to draw. In conclusion, it seems that the different strands of collocation research, in the tradition of both the early work of John Sinclair and of the lexicographic and didactic approach, can jointly contribute to a better under-

German noun+verb collocations in the sentence context 287

standing of the properties of idiomatized multiword expressions. This is especially the case for computational linguistic work on such multiwords, i.e. on their extraction from large corpora and on their detailed linguistic description, for formal grammars, text understanding or high quality information extraction. For the purpose of this article (and for the work underlying it), we take Sinclair's idiom principle as a theoretical starting point and we are interested in identifying linguistic properties of multiword items which contribute to their idiomaticity. 2.

Linguistic properties of collocations: German verb + object pairs as a case in point

Noun+verb collocations and verbal idioms have been discussed in some detail in the literature (cf e.g. Fellbaum, Kramer and Neumann 2006). As they have a number of interesting linguistic properties, we nevertheless use them again to illustrate the importance of a detailed and comprehensive description of those context-related properties which govern the insertion of the collocations into the sentence. At the same time, these properties can be seen as an expectation horizon for data extraction from corpora. 2.1.

Determination and modification of the noun

The literature on German noun+verb collocations contains discussions of aspects of morphosyntactic fixedness. Helbig (1979) already mentions a number of morphosyntactic properties of German 'Funktionsverbgefiige' (support verb constructions, svc), which he uses to distinguish what he calls 'localized' as opposed to 'non-lexicalized' support verb constructions. These properties include the morphosyntactic correlates of referentially fully available nouns (and thus noun phrases) in the case of nonlexicalized support verb constructions, and, vice versa, morphosyntactic restrictions in legalized ones. Examples of such properties are the use of articles or the possibility to pluralize the noun of a support verb construction, to modify it with an adjective, a relative clause, etc., or to make reference to it with a personal or interrogative pronoun (cf. Held 1994: 234 for a summary). Examples of these properties are given in (1) to (4), where we use the collocation Frage + stellen ('ask + question') as an example of a combination where the noun is referentially available (a non-lexicalized svc, in

288 UlrichHeid

Helbig's terms), and zur Sprache bnngen ('mention', 'bring to the fore'), as an example of a legalized support verb construction. In line with e.g. Burger's (1998) discussion of the modifiability and fixedness of idioms, Helbig's distinction can also be recast in terms of more vs. less idiomatization. (1)

Fragen stellen ('[to] ask questions') *zuSprachen bringen

(2)

eine Frage stellen die Frage stellen diese Frage stellen ('[to] ask a/the/ this question') *zu einer Sprache bringen *zu der/dieser Sprache bringen

(3)

eine relevante Frage stellen ('[to] ask a relevant question') *zurrelevanten Sprache bringen eine Frage stellen, die enorm wichtig ist ('[to] ask a question which is enormously important') *zur Sprache bringen, die enorm wichtig ist

(4)

The examples in (1) to (4) show the morphosyntactically fixed nature of zur Sprache bnngen, which does not accept any of the operations which are perfectly possible with Frage + stellen. A task for data extraction from corpora is thus to test each collocation candidate for these properties (number, determination, modifiability of the noun) and to note the respective number of occurrences of each option in the corpus. 2.2.

Syntactic subcategonzation

In Burger's work on idioms, examples are given of verbal idiomatic expressions which have their own syntactic subcategonzation behaviour, different from that of any of their components: for example, (etnen) Baren aujbmden ('[to] pull someone's leg') takes a subject and an indirect object, and it can have a ^ . - c l a u s e which expresses the contents of the statement made (cf also Keil 1997). (5)

Der Kollege hat dir einen Baren aufgebunden. Er hat mir den Baren aufgebunden, dass ...

Like verbal idioms, verb+object collocations can also have their own valency, but this property has long not been recognized. Krenn and Erbach (1994) state that the subcategonzation of the noun is taken over in the support verb construction. This is true in many cases, but not all: Lapshinova and Heid (2007) show examples of collocations which have their own subcategonzation properties (cf. (6), below).

German noun+verb collocations in the sentence context 289 (6)

zum Ausdruck bringen ('[to] express', lit. 'bring to the expression'), zum Ausdruck kommen ('[to] be expressed'), zurSprache bringen ('[to] mention'), inAbredestellenC[to]dmf\ zu Protokollgeben C[to] state'). The collocations in (6) all s u b c a t e g o r y for a sentential complement, while the nouns Ausdruck, Protokoll, Sprache and Abrede (except in another reading) as well as the verbs involved do not allow a complement clause. Thus, even if the number of cases is relatively small, it seems that some collocational multiwords require their own subcategonzation description. A related fact is nominal 'complementation' of the nominal element of noun+verb collocations, i.e. the presence of genitive attributes. Many collocations contain relational nouns or other nouns which have a preference for a genitive attribute. An example is the collocation im Mittelpunkt (von X) stehen ('[to] be at the centre [of X]'). Other examples, extracted from large German newspaper corpora are listed in (7), below. 1 (7) (a) in + Mittelpunkt + GEN + stellen/rucken ('[to] put into the centre of...', Frankfurter Rundschau 40354632) .... die die Sozialpolitik mehr in den Mittelpunkt des offentlichen Interesses stellen will ('which wants to put social politics more into the centre of public interest'); (b) auf + Boden + GEN + stehen ('[to] be on the solid ground of ...', Frankfurter Rundschau,, 440l29) [...Jfragte er seinen schlaftrunkenen Kollegen, der mit einem Mai wieder auf dem Boden der Realitat stand ('[...] he asked Ms sleepy colleague who, all of a sudden, was back to reality'); (c) sich auf+ Niveau + GEN + bewegen ('[to] be at the level of...', Stuttgarter Zeitung4245965S) [...J wahrend sich der Umfang des Auslandsgeschafts auf dem Niveau des Voriahres bewegte ('whereas the amount of foreign trade was at the level of the previous year'). Some of the noun+verb collocations where the noun tends to have a genitive (or a v 0 «-phrase) seem to have, in addition to this syntactic specificity, also particular lexical preferences: for example, we find auf dem Boden der Realitat stehen, auf dem Boden der Verfassung stehen ('[to] be rooted in the constitution'), auf dem Boden (seiner) Uberzeugungen stehen ('[to] be attached to one's convictions') more frequently than other combinations of auf+ Boden + stehen with genitives. Moreover, these combinations often

290 UlrichHeid come with the adverb fast, such that the whole expression is similar to 'be firmly rooted in ...'. The analysis of such combinations of collocations requires very large corpora and ideally corpora of different genres: our data only come from newspapers and administrative texts. A more detailed analysis should show larger patterns of relatively fixed expressions, likely specific to text types, genres etc. At the same time, it would show how the syntactic property of the nouns involved (to take a genitive attribute) interacts with lexical selection, and how collocational selection properties of different lexemes interact to build larger idiomatic chunks of considerable usage frequency. On the other hand, there are collocations which hardly accept the insertion of a genitive after the noun, and if it is inserted, the construction seems rather to be a result of linguistic creativity than of typical usage. Examples are given in (8): the collocations have no genitives in over 97 % of all observed cases (the absolute frequency of the collocation in 240 M words is given in parentheses), and our examples are the only ones with a genitive: (8) (a) in Flammen aufgehen ('[to] go up in flames', Die Zeit3926l 518, 433): LaButes Monolog ist ein Selbstrechtfertigungssystem von so trockener Vernunft, dass es iederzeit in den Flammen des Wahnsinns aufeehen konnte ('LaBute's monologue is a self-justification system of such and reason, that it could go up, at any moment, in the flames of madness'); (b) in die Irre fiihren ('[to] mislead', Frankfurter Allgemeine Zeitung62122S15, 344): Ihr 'Requiem' versucht sich in unmittelbarer Emotionalitat, von der die Kunstja immer wieder traumt, und die doch so oft in die Irre der Banalitat fuhrt ('Her 'Requiem' makes an attempt at immediate emotionality, which art tends to dream of every now and then, and which nevertheless misleads quite often towards banality'). Similar strong preferences for the presence or absence of modifying elements in noun+verb collocations are also found with respect to adjectival modification. This phenomenon has however mostly been analyzed in line with the above mentioned morphosyntactic properties which depend on the referential availability of the noun. Examples which require modifying adjectives are listed in (9) and a few combinations which do not accept them are given in (10). (9)

eine gute/brillante/schlechte/... Entwicklung nehmen ('[to] progress well/brilliantly/not to progress well') eine gute/schlechte/traurige/... Figur abgeben

German noun+verb collocations in the sentence context 291 ('[to] cut a good/bad/poor figure') imjruhen/letzten/entscheidenden/... Stadium sein ('[to] be in an early/the last/the decisive stage') aus guten/kleinen/geordneten/... Verhaltnissen stammen ('[to] be of .../humble/... origin') (10)

Gebrauchmachen CM make use') PlatznehmenCMtske a seat') Stellung beziehen ('[to] position oneself) Schulemachen ('[to] find adherents') The examples in (10) can only be modified wrth adverbs (eindeutig Stellung beztehen) and are, at least in our corpora, not used with adjectives. With many other collocations, both options are available (cf. Starrer 2006 on examples such as brieflich in Kontakt stehen vs. in brieflichem Kontakt stehen, both meaning 'be in (postal) correspondence with sb.'). The examples above seem to suggest that it is necessary to keep track of the syntactic valency behaviour of nouns in noun+verb collocations in more detail than this is often done in the literature and in lexicography. Moreover, the valency behaviour seems to be one of the factors contributing to the development of larger collocational clusters.

2.3.

Negation and coordination in noun+verb combinations

Preferences for negation, as well as for coordinated nouns are a strong indicator of idiomatization, and noun+verb combinations with strong preferences of this kind tend to be full idioms. But also collocations where the nominal part can be seen as 'auto-semantic' in Hausmann's sense, show such preferences. Many collocations show up both in a positive and in a negated form. For German data, the difference between verb negation (ntcht) and NP negation {kern) needs to be accounted for in addition. For a lexicographic description, it is necessary to indicate preferences in this respect. Similarly, it makes sense to indicate which collocations show a marked tendency towards negation; for example, the proportion of negated vs. non-negated instances in a big corpus could be indicated. Typical examples of combinations which are idiomatic and most often found in the negated form are given in (11) below, in order of decreasing frequency in our corpus. (11) keinenHehl machen aus ('[to] make no secret of...') [einerSache] keinen Abbruch tun ('not to spoil sth.')

292 UlrichHeid keine Grenze(n) kennen ('[to] know no bounds') kein Ende nehmen (cf. 'there is no end to ...') kein Wort verlreren ('not to waste any words on ...') A similar sign of idiomatization is the presence of coordinated noun phrases in the multiwords; the items in (12) are typical examples of this phenomenon; they are not perceived as correct if one of the conjuncts is missing. On the other hand, coordinated NPs in 'normal' collocations (e.g. pay attention, ask + question, etc.) are rather rare or a matter of creative use. (12) in Fleisch undBlut ubergehen ('[to] become second nature [to s.o.]') rnLohn undBrot stehen/bringen ('[to] be employed by [s.o.]') hinterSchlossundRiegelsitzen/bringen/... ('[to] l « ^ e kept under lock and key') in Sack undAsche gehen ('[to] repent in sackcloth and ashes') 2.4.

Preferences with respect to word order

The properties discussed so far mainly have to do with a collocation's form (cf. 2.1, 2.3) or with its syntactic embedding in a sentence (2.2). Many collocations also seem to have preferences with respect to word order: this property does not affect their form directly, but it is part of the linguistic knowledge governing their correct use in a sentence. It is likely that these preferences have to do with the status of the nominal element of the collocations and with the nature of some of the sequential positions in German sentences. German has three different word order patterns. They are defined with respect to the model of topological fields. It distinguishes two areas where verbal elements (finite or non-finite verbs, verb complexes, or verb particles) or conjunctions can be placed: 'linke Satzklammer' and 'rechte Satzklammer', LK and RK, respectively, in table 1, below. Furthermore, it identifies three areas for other types of constituents, ' Vorfeld' (VF, in table 1), 'Mittelfeld' (MF) and 'Nachfeld' (right of RK, left out from table 1, for sake of simplification). These areas can be filled in different ways, depending on the place of the finite verb. It can be in RK, as it happens in subclauses; this model is the 'verb-final' model (VL, for 'verb-letzt', in table 1); it accounts for approx. 25 % of all occurrences of finite verbs in our corpora. Alternatively, the finite verb is in LK, with a full constituent in Vorfeld ('verb-

German noun+verb collocations in the sentence context 293 second' model, V2 in table 1), or with an expletive or no constituent at all in Vorfeld ('verb-first' model, used in interrogatives and conditionals, V I ) . V2 sentences are by far the most frequent ones (approx. 65 % ) , the verbfirst model accounting at most for approx. 10 % of our data. These facts are summarized using the example sentence in (13) and its word order variants, in table 1. (13) DieFrage (kann/wird) (in Darmstadt) (dann/wohl) gestellt (werden). This question (can/will) (in Darmstadt) (then/maybe) asked (be) 'This question will/could (then/maybe) be asked (in Darmstadt)'. Table 1.

German verb placement models

VF

LK

MF

(Es)

wird Kann

dieFrage dieFrage

V2 DieFrage DieFrage

wird kann

in Darmstadt

VL

weil dass

dieFrage dieFrage

VI

RK dann dann dann dann

gestellt gestellt

werden?

gestellt gestellt

werden

gestellt gestellt

wird werden kann

The table shows that a noun+verb collocation like Frage + stellen ('ask + question') can appear under all three word order models. At the time of writing this article, we are in the process of investigating in more detail which collocation candidates readily appear in all three models and which ones do not. This was prompted by work on word order constraints for Natural Language Generation (cf Cahill, Weller, Rohrer and Heid 2009) and by the following observation: some collocations which are highly idiomatized and morphosyntactically relatively fixed, such as Gebrauch machen ('make use') tend not to have their nominal element in Vorfeld (cf. Heid and Weller 2008). This is illustrated with the examples in (14), which are contrasted with instances of Frage + stellen. (14) (a) [...,] weil derChefeinerelevante Frage stellt. ('because the boss asks a relevant question', VL) (b) (c)

[...,] weil erdavon Gebrauch macht. ('because he makes use of h', VL) Fine relevante Frage stellt der Chef(V2).

294 UlrichHeid (d) *Gebrauchmachter davon Qfl). In fact, what is marked with an asterisk in (14d) is maybe not ungrammatical, but at least dispreferred. There are examples of NP or PP fronting (i.e. the NP or PP in Vorfeld), under contrastive stress, as in the (invented) example (15): (15) Das Medikament ist seither im Hans. Gebrauch hat davon noch niemand gemacht. ('We have this medicine at home since then. Nobody has so far made use of it') It should be noted that the example in (15) has an auxiliary in Vorfeld position (instead of Gebrauch macht davon niemand, in simple present), which seems, according to our preliminary data, a factor which makes this particular word order a bit more likely. Another variant, which is also more likely under contrastive stress, is partial VP fronting, i.e. a case like in (16), where the NP or PP and a non-finite verb form of the collocation find themselves left of the finite auxiliary. (16)

Zur Kenntnis genommen wurde das Hauflein Bamberger Demonstranten kaum. ('The small group of demonstrators from Bamberg was barely noticed.') A comparison between idiomatized verb+pp collocations and nonidiomatic combinations of the same syntactic form (verb + prepositional phrase) seems to indicate that the collocations hardly accept the fronting of their prepositional phrase, whereas this is normal for trivial combinations. In (17) we reproduce data from a preliminary test (earned out by Marion Weller, IMS Stuttgart, unpublished), where idiomatized collocations are listed, along with the total frequency with which they were found in the corpus ('tot': in [17]), as well as the absolute observed frequencies of pp fronting with simple tense forms ('pp-frt'), pp fronting with complex tense forms (i.e. the auxiliary being in VF, 'pp-aux') and partial VP fronting ('vp-frt'). (17) zuVerfugungstehen tot:9825 pp-frt: 41 pp-aux:0 vp-frt: 1 zuVerfugungstellen tot:7209 PP-frt:0 vp-frt:0 PP -aux:2 inMittelpunktstehen tot:5172 pp-frt: 1613 PP-aux:171 vp-frt:0 inAnspruchnehmen tot:4984 pp-frt:2 vp-frt:2 PP -aux:2 umLebenkommen tot:4691 pp-frt:3 pp-aux: 1 vp-frt:0 mKrafttreten tot:3886 pp-frt:9 pp-aux:0 vp-frt: 17 inFragestellen tot:3884 pp-frt: 1 pp-aux:0 vp-frt:9 inVordergrundstehen tot:3173 pp-frt:511 pp-aux:61 vp-frt:0

German noun+verb collocations in the sentence context 295 zuEndegehen tot:3087 pp-frt:5 pp-aux:0 vp-frt:0 mFragekommen tot:3080 pp-frt:106 pp-aux:3 vp-frt:3 zuKenntmsnehmen tot:2835 pp-frt:0 pp-aux:0 vp-frt:3 In (18), we reproduce a similar table, with top frequency combinations which do allow pp fronting. This table contains almost exclusively combinations of verbs and prepositional phrases which either do not semantically belong together (as the PP is an adjunct, cf in Monat + steigen, 'rise + in [the] month [of]'), or which are not to be considered as idiomatized collocations (cf. mtt [dem] Ban begmnen, '[to] start the building work'). (18) anStelletreten tot: 1566 pp-frt:341 pp-aux:248 vp-frt:0 in Monat steigen tot: 1050 pp-frt:150 pp-aux:190 vp-frt:0 mitBaubeginnen tot:864 pp-frt:2 pp-aux:130 vp-frt:2 aufSeitenaben tot:568 pp-frt:0 pp-aux:59 vp-frt:0 in Quartal steigen tot:545 pp-frt:69 pp-aux:95 vp-frt:0 in Halbjahr steigen tot:512 pp-frt:73 pp-aux:88 vp-frt:0 nachAnsichthaben tot:470 pp-frt:0 pp-aux:138 vp-frt:0 mZusammenhanghmwe1Sentot:465 pp-frt:109 pp-aux:78 vp-frt:0 anEndehaben tot:455 pp-frt:0 pp-aux:124 vp-frt:0 inZeitnaben tot:438 pp-frt:0 pp-aux:81 vp-frt:0 mPras enthalten tot: 167 pp-frt:0 pp-aux:65 vp-frt:48 In (17), im Mtttelpunkt stehen and im Vordergrund stehen stand out as having a considerable number of pp fronting cases. These two collocations have a very strong preference for a genitive attribute (see example [7], above), which makes the fronted pp longer and 'more contentful'. Table (18) contains im Frets enthalten (sem) ('[to] be included in the price'), which has particularly and unexpectedly high counts of partial VP fronting. This is due to text-type specificities: the expression is typically used when offers from hotels or travel agencies are described, and the respective sentences almost invariably come as im Frets (von X Euro) enthalten stnd Ubernachtung, Fruhstuck, Abendessen und Hoteltransfer ('the price [of X Euro] includes the hotel room, breakfast, dinner and bus transfer to the hotel'). A preliminary conclusion which can be drawn from the data in (17) and (18) is that idiomatic verb+object and verb+pp combinations tend to show restrictions with respect to the position of their nominal or prepositional phrases in Vorfeld, much more so than non-idiomatic combinations of the same structure.2 The Vorfeld test seems thus to some extent to be usable, at least for items of a medium or high frequency, to distinguish idiomatic verb+object and verb+pp combinations from non-idiomatic ones.

296 UlrichHeid

2.5.

Vanation by text types, regions and registers

In addition to the formal properties discussed so far, several layers of variation need to be taken into account. Obviously, there is variation related to text types and domains. An analysis of legal journals from the field of intellectual property law and trademark legislation shows that not only this domain has its own specialized phraseology (cf Heid et al. 2008), but also preferences with respect to the use of certain collocations from general language. Regional variation is also very clearly observable; in a preliminary study, we analyzed German newspaper texts from Germany, Switzerland, Austria and South Tyrol, using part of the corpus material gathered by the Institut fur deutsche Sprache (Mannheim) and the universities of Tubingen and Stuttgart, in the framework of the DeReKo project (Deutsches Referenzkorpus), as well as texts from the South Tyrolean newspaper Dolorrnten (made available to us under a specific contract). Obviously, a corpus containing different Swiss newspapers (and different amounts of material from the individual newspapers) is nothing more than an opportunistic 'archive'-like collection, and only partial generalizations about regional differences in collocational behaviour can be derived from that material; but it nevertheless provides data which indicate a few tendencies. A simple comparison of relative frequencies of collocations in the news texts from different regions suggests that there are considerable differences between, say, Swiss and German texts. These differences do not (only) have to do with regional objects, institutions or procedures (cf. e.g. Kanton, kantonal, em Nem m dre Urne legen ['{to} vote no']). They neither are exclusively due to specific regional lexemes which are synonyms of items used in Germany, such as CH Entscherd for DE Entscheidung ('decision'), or South Tyrolean Erhebungen aufnehmen for DE Errmttlungen aufnehmen ('[to] take up investigations'). There are also regional differences in collocate selection: Swiss texts have for example a much higher proportion of the collocation tiefer Prers ('low price'), which is not completely absent from texts from Germany, but very rare (2 occurrences in 100 million words from the tageszeitung); German texts from Germany tend to have mednger Prers instead, which itself exists in the Swiss data, but is less frequent than tiefer Preis. Moreover, first analyses of the above-mentioned data suggest that there may also be some regional differences with respect to the idiomatic use of determiners in the noun phrase of noun+verb collocations. Concerning the

German noun+verb collocations in the sentence context 297

collocation Geschaft + machen ('[to] make [a good] bargain'), for example, our data provide evidence for a preference, in texts from Germany, for an indefinite plural: Geschafte machen, gute Geschafte machen; this form is the most frequent one also in Switzerland, but, contrary to the data from Germany, Swiss news texts have also a high proportion of indefinite singulars: em Geschaft machen, em gutes Geschaft machen. More analyses are certainly necessary to get a real picture of such differences. But it would not be all that surprising to see the same types of variation in collocation use as they are found in the use of single word items. And collocations not only show preferences at the morphosyntactic and syntactic level, but these may also be subject to regional and text typerelated variation. 2.6.

Intermediate summary

In this second part of the paper, a few linguistic properties of noun+verb collocations have been briefly discussed: determination and modification, subcategonzation, preferences for negation or coordination, compatibility with the three German word order models and regional variation. All of these are preferential in character, not categorical. All of them play a role in the insertion of collocations into sentences and texts, but not all are relevant as indicators of the idiomatic status of the combinations, even though the observable preferences are likely due to idiomatization. In fact, restrictions with respect to determination, modification and preferences for negation or coordination have been used as indicators of noncompositionahty in NLP research (cf Fazly and Stevenson 2006). We have the impression that the compatibility with NP or PP fronting is also a good criterion to separate collocations from non-idiomatic, fully compositional combinations. Moreover, as the description of collocations should cover both lexical selection and morphosyntactic preferences, likely both would need to be analyzed for variation. The idea underlying e.g. Fazly and Stevenson's (2006) work is that idiomaticity and fixedness are correlated: the more fixed the morphosyntax of a multiword, the more likely it is idiomatic. Fazly and Stevenson (2006) analyze fixedness phenomena and build a sort of "fixedness vector", by adding points for all pieces of evidence that speak for fixedness. Beyond a certain threshold, they classify the respective multiword as idiomatic. This procedure does not keep track of the individual properties of the multiwords, but it leads to a relatively adequate classification. With extraction

298 UlrichHeid tools that keep track of the type of preference, both functions could be achieved: a broad classification into [± idiomatic] and a detailed description. This inventory of phenomena to be analyzed leads to rather complex expectations for (semi-)automatic data extraction from corpora. Beyond lexical selection, there is also a need to extract evidence for the properties discussed above. And in the ideal case, careful corpus design and a detailed classification of the corpus data used should allow for the variational analysis suggested here. 3.

Procedures for extracting collocation candidates from texts

In this section, a brief overview of the main approaches to the extraction of collocation candidates from text corpora will be given. More details can be found in Evert (2005) and Evert (2009). In this article, we will mainly analyze the existing approaches and contrast them with the expectation horizon presented in section 2. This will be done by comparing English and German. At the end of this section, we discuss the architecture used in our own work. 3.1.

Properties of collocations used for their extraction

There are three groups of properties of collocations which are used to extract collocational data from text corpora, at least for English. These are (i) cooccurrence frequency and significance, (n) linear order and adjacency and (in) morphosyntactic fixedness. All three properties as well as their use for data extraction have been briefly mentioned above (in section 1.2). Extraction procedures based exclusively on statistics (frequency, association measures) are at a risk of identifying at least two types of noise: one word combinations that are typical of a certain text type, but phraseologically uninteresting. An example is die Polizei teilt mt(, dass) ('the police informs [that...]'): this word combination is particularly frequent in German newspaper corpora (because newspapers report about many events where the police has to intervene, and to inform the public thereafter), but it is not particularly idiomatic, since any noun denoting a human being or an institution composed of human beings can be a subject of mitteilen. Next to these semantical^ and lexically trivial word combinations, also

German noun+verb collocations in the sentence context 299 pairs of words could be extracted which are not even in a grammatical relationship. Thus, association measures have been supplemented with filters based on a modelling of grammatical relations. Here, language specific problems arise (see below, section 3.2): the effort that needs to be spent in order to extract correct relational pairs from large corpora with a satisfactory recall differs from one language to the other. As mentioned above, both sequences of statistical and symbolic procedures have been used (cf section 1.2: Smajda [1993] vs. e.g. Krenn [2000]). The third type of properties is morphosyntactic fixedness, i.e. the properties discussed above in the second section. Fazly and Stevenson (2006) take morphosyntactic preferences or formal fixedness as an indicator of idiomaticity. Ritz (2006), Ritz and Held (2006) used a similar procedure on a homogeneous set of verb+pp combinations extracted from German prenominal participle constructions (e.g. der emgeretchte Antrag - Antrag + einreichen, 'the submitted proposal, submit + proposal'), and they evaluated which percentage of the extracted noun+verb combination types could be classified as idiomatized collocations on the basis of being morphosyntactically restnced or fixed; in fact, only 35 % of the combinations were accepted as being idiomatic or collocational in manual evaluation (2 evaluators). Obviously, most practical approaches to collocation candidate extraction use two or all three of the above types as extraction criteria, as none of them, taken in isolation, performs well enough. 3.2.

Extracting grammatical relationships

As mentioned before, there are several types of noun+verb collocations: verb+object and verb+subject collocations and verb+pp collocations. In addition to direct objects, German also has indirect objects, and we believe that these also can be part of collocational combinations with verbs; examples are given in (19). These have been extracted from a corpus of legal journals, but could also be found in other text types. (19) dem Zweifel unterliegen ('[to] be doubtful'), denAnforderungengenugen ('[to] satisfy [the] requirements'), einem Antrag stattgeben ('[to] accept a proposal') A task within collocation extraction is thus to identify the different syntactic subtypes of noun+verb collocations. For English, this task is relatively

300 UlrichHeid easy, as typically the subject of an English verb can be found to its left and the object to its right. To identify English noun+verb combinations, partof-speech tagged corpora and (regular) models about the adjacency of noun and verb phrases are mostly sufficient; it is possible to account for non-standard word order types (e.g. passives, relative clauses) with relatively simple rules. For inflecting languages (like the Slavonic languages, or Latin), nominal inflection gives a fair picture of case and thus of the grammatical relation between a nominal and its governing verb. Even though many inflectional forms of e.g. nouns in Czech are case-ambiguous in isolation, a large percentage of noun groups (e.g. adjectives plus nouns) is unambiguous. Thus, inflecting languages, allowing for flexible constituent order, still lend themselves fairly well to a morphology-based extraction approach of relational word pairs. In fact, the collocation extraction within the lexicographic tool Sketch Engine (Kilgarnff et al. 2004) is based on the above mentioned principles for English and Czech: the extraction of verb+object pairs from English texts relies on sequence patterns of items described in terms of parts-ofspeech and the tool for Czech on patterns of cooccurrence of certain morphological forms, in arbitrary order, and within a window of up to five words. German is different from both English and Czech, for that matter. Due to its variable constituent order (see table 1), sequence patterns and the assumption of verb+noun adjacency do not provide acceptable results. German has four cases and nominals inflect for case; but nominal inflection contains very much syncretism, such that, for example, nominative and accusative, or genitive and dative are formally identical in several inflection paradigms. Evert (2004) has extracted noun and noun phrase data from the manually annotated Negra corpus (a subset of the newspaper Frankfurter Rundschau) and he found that only approx. 21 % of all noun phrases in that corpus are unambiguous for case, with roughly the same amount not giving any case information (i.e. being fully four-way ambiguous, as is the case with feminine plural nouns!), and 58 % being 2- or 3-way ambiguous (cf table 2, below).

German noun+verb collocations in the sentence context 301 Table 2.

Case syncretism: Evert (2004: 1540) on Negra

Nouns

unambiguous

2/3 alternatives

no information

forms alone

7%

3^^%

5^%

NP + agreement

21%

58%

21%

We have tested different ways to improve verb+object precision in the extraction of data from German text corpora. One option is to use full syntactic analysis. This is what we do in the work underlying the data presented in this paper. Another option is partial syntactic analysis, in the sense of chunking (cf Abney 1991). This involves the recognition of the pre-head part of nominal, adjectival and prepositional phrases, and of the head; it does not account for post-head modification, pp attachment and overall sentence structure. Such an approach is useful, when the chunks are annotated with hypotheses about case (cf. Kermes 2003), from where to derive grammatical relations. Yet other approaches rely on an approximative modelling of case and grammatical relations by means of data on case endings. As mentioned, these are too synergistic to be used in isolation; they need at least to be combined with knowledge about a noun's grammatical gender. In our experiments (Ivanova et al. 2008), this knowledge was inferred from derivational affixes (e.g. 'all nouns in -hett/kett ['-ity'] are feminine'). The experiments carried out were performed on the Sketch Engine, with different versions of a sketch grammar for German; the different versions are used to show and to compare the impact of the use of different amounts of linguistic knowledge. The different versions of the grammar contain the types of information listed in (20), below. (20) Data available to different German sketch grammar versions: (a)

case guessed from inflection forms in affix sequences: dem (dative-sg) kleinen (dative-sg) Hans ([nom|akk|dat]-sg); (b) like (a), plus gender guessed from derivational affixes (-heit: fern); (c) inflection-based guessing as in (a) plus adjacency of np and verb under the verb-final constituent order model; (d) tike (c), plus gender guessing (as in (b));

302 UlrichHeid (e)

inflection-based guessing plus explicit sentence structure models for sentences which contain only and exactly a subject and object, an indirect object and/or a prepositional object (sentence patterns);

(f) like (e), plus gender guessing (as in [b]). The results of an evaluation against a small set of sentences manually annotated for the case of noun phrases showed that condition (20f), i.e. the most complex one, gave the best results, for precision as well as for recall; the other conditions with restrictive patterns also provided better results than the more 'sloppy' extractors (Ivanova et al. 2008). This seems to indicate that the extraction of verb+object pairs from German data is harder than, e.g. from English or Czech data. Our conclusion is that instead of the approximative modelling summarized in (20), foil syntactic analysis seems to be the most appropriate preprocessing of corpora for collocation extraction from German texts. A similar conclusion is reached by Seretan (2008) who evaluated pattern-based methods against parsing-based collocation extraction for both English and French. She finds that in particular the recall of the extraction procedures can be increased, if parsed data are used, instead of only postagged material. In a mini-experiment on the comparison of a chunker (Kermes 2003) with parsing-based extraction, we found a similar discrepancy: on the top 250 collocation candidates identified by each method (top 250 by frequency and significance), on one and the same corpus, almost no differences in precision could be found between the chunking-based extractor and the parsing-based one. But the parsed data provided almost twice as much material, i.e. a massively higher recall (cf Heid et al. 2008). It seems to be easier for a linguist to write sufficiently restrictive extraction rules (i.e. to control precision) than to have a clear view of the cases missed by the extractor (i.e. to avoid losses in recall). 3.3.

An architecture for the extraction and classification of German noun+verb collocations

On the basis of the results of the experiments described above, it seems natural to use syntactically preprocessed text corpora for collocation candidate extraction. For the work described in this paper (e.g. the data discussed above, in section 2), we thus used Schiehlen's (2003) dependency parser. It produces a tabular representation of dependency structures as an output and it has an acceptable coverage; firrthermore, it annotates all cases

German noun+verb collocations in the sentence context 303 of local syntactic ambiguity and of non-attachment into the analysis result: if necessary, one can thus skip those parsing results which seem to be ambiguous (e.g. with respect to case) or which may not have been assigned enough structure (i.e. where most items are directly attached to the top node of the dependency tree). Figure 1 contains atree representation of the sentence in (21), as well as the tabular output produced by the parser. (21) die zweite Studie lieferte ahnliche Ergebnisse ('the second study provided similar results')

0 Die ART d | 2 SPEC 1 zweite ADJA 2. | 2 ADJ 2 Studie NN Studie Nom:F:Sg 3 NP:nom 3 lieferte VVFIN liefern 3:Sg:Past:Ind* -1 TOP 4 ahnliche ADJA ahnlich | 5 ADJ 5 Ergebnisse NN Ergebnis Akk:N:P | 3 NP:akk 6 . $. . | -1 TOP Figure 1. Dependency structure output used for collocation extraction: tree representation and internal format of the parser by Schiehlen (2003) (from: Fntzmgeretal. 2009) To extract collocation candidates from the parsed output, we use pattern matching at the level of grammatical functions: for example, we extract combinations of main verbs and nouns, where the nouns are in a direct object relation to the respective main verbs. Such data can be read off the parsing output. As the parser marks ambiguities, we could also just work with non-ambiguous sentences (so far, the quality achieved by using both unambiguous and ambiguous data is however satisfactory). The parsing output not only contains the word form found in the sentence analyzed (second column figure 1), but also its lemma and its mor-

304 UlrichHeid phosyntactic features (case, number, etc., cf. third and fourth column of figure 1). In the extraction work, we rely on these data, as they give hints on the morphosyntactic properties of the collocations extracted: the morphosyntactic features, as well as the form of the determiner, possible negation elements in the noun phrase or in the verb phrase, possible adverbs, etc. are extracted along with the lemma and form of base and collocate. This multiparametnc extraction is modular: new features or context partners can be added if this is necessary. Similarly, additional patterns, which in this case cover the sentence as a whole, can be used to detect passives and/or to identify the constituent order models involved. In this way, we get data for the specific analyses discussed in section 2. All extracted data for a given pair of base and collocate are stored in a relational database, along with the sentence where these data have been found. An example of a data set for a sentence is given in (22), below, for sentence (23). For this example, the database contains information about the noun and verb lemma (the verb here being a compound: geltend machen, 'put forward'), but also about the number, the kind of determiner present in the NP (here: null, i.e. none), the presence of the passive (including the lemma of the passive auxiliary, here werden), the sentence type (verb-second), modifiers found in the sentence (adverbs and prepositional phrases) and about the fact that the verb is embedded under a modal (for details on the procedures, see Heid and Weller 2008). (22) njemma | Grand vjemma | geltend machen number | PI type_of_det | null active/passive | passive pass.auxiliaiy | werden serotype | v-2 modifiers | auch (ADV), PP:fur:Emchtung, PP:fur:Land modal |konnen preposition | null chunk | Solche Grande konnen auch fur die Emchtung ernes gememsamen Patentamtes for die Lander geltend gemacht werden (23)

Solche Griinde konnen auch fur die Errichtung eines gemeinsamen Patentamtes fur die Lander geltend gemacht werden. ('Such reasons can also be put forward for the installation of a common patent office for the Lander').

German noun+verb collocations in the sentence context 305

When it comes to interpreting the data for a given collocation, we extract all individual records, sentence by sentence, for a given pair of base and collocate from the database. We sum up over the features and, if necessary, combine these observed frequencies with data from association measures (to identify those lemma combinations which are significant) and with a calculus of the relative proportions of individual feature values (e.g. the relationship between singular and plural). For the latter, we use a calculus proposed by Evert (2004). Such quantitative analyses provide a lower bound of a confidence interval for the percentage of cases that display a certain feature, e.g. a preference for singular null articles. An example is given in figure 2: it shows absolute frequencies of different parameter distributions for the collocations Rechnung ausstellen ('[to] make out a bill') and Rechnung tragen ('[to] take into account'), in data from the Acquis Communautaire corpus of the European Joint Research Centre (JRC, Ispra). Rechnung tragen clearly prefers a null article in the singular, but it allows both active and passive. f 5 4 4 1 1387 262 136 1 1 1

Figure 2.

n lemma Rechnung Rechnung Rechnung Rechnung Rechnung Rechnung Rechnung Rechnung Rechnung Rechnung

v lemma ausstellen ausstellen ausstellen ausstellen tragen tragen tragen tragen tragen tragen

type of det def indef def def null null null dem poss def

number Sg Sg Sg PI Sg Sg Sg Sg Sg Sg

active/passive passive active active active active passive passive active active passive

Sample cumulative database entry for Rechnung ausstellen and Rechnung tragen m the collocations database

From the database, we can read off relevant combinations of morphosyntactic features and combine these data for manual inspection. 4.

Conclusions

In this paper, we have discussed current approaches to the extraction of collocation candidates from text corpora. Emphasis was on the linguistic

306 UlrichHeid properties which need to de described in detail, in addition to mere knowledge about lexical combinatorics. We discussed the differences between configurational languages (e.g. English), inflecting languages (e.g. Czech) and German with respect to the devices necessary to extract grammatical relations between base and collocate (e.g. verb+object pairs), and we presented a parsing-based architecture for German which allows us to extract such relational pairs. We feed all extraction results into a database, so as to be able to investigate the behaviour of collocational items in a multiparametnc way. Even though this work is still at its beginning, the database can already be used as a tool for research: with its help, we were able to analyze the preferences of collocations with respect to the three German word order models, discussed in section 2.4. In the future, work on large corpora from different sources (e.g. regional variants, different text types, different degrees of formality, different domains, etc.) and a thorough awareness of these metadata should allow us also to undertake investigations into the variation potential of the German language with respect to collocations. Furthermore, we expect to be able to analyze in detail the interplay between idiomatic fixedness (at the morphosyntactic level) and grammatical constraints: if a noun in a collocation like zu + Schluss + gelangen ('[to] arrive at + conclusion') subcategories for a ^ - c l a u s e , the presence of this clausal complement will enforce the defimteness of the noun, i.e. its definite article, and, in the particular case, even a non-fused preposition+article group: zu dem Schluss kommen, dass... (and not: zum Schluss kommen, dass...); by looking at other cases of the same parameter constellation (singular, preposition, sentence complement after the collocation), we hope to be able to more closely inspect such cases of the interplay between grammar and collocation, i.e. between open choice and idiomaticity. Notes 1

The examples are taken from ongoing work by Marion Weller (IMS Stuttgart) on a corpus of German newspaper texts from 1992 to 1998, comprising material from SMtgarter Zeitung, Frankfurter Rundschau, Die Zeit and Frankfurter Allgemeine Zeitung, a total of ca. 240 million words. These sources are indicated by the title and the onset of the citation in the IMS version of the respective corpora. The text of Frankfurter Rundschau (1993/94) has been published by the European Corpus Initiative (ELSNET, Utrecht, The Nether-

German noun+verb collocations in the sentence context 307

2

lands) in its first multilingual corpus collection (ECI-MC1). The other newspapers, as well as the juridical corpus cited below have been made available to the author under specific contracts for research purposes. A related observation has to do with verb-final contexts: there, the support verb and the pertaining noun tend to be adjacent. Only few types of phrases can be placed between the two elements, e.g. adverbs, pronominal adverbs or prepositional phrases. However, in the data used for our preliminary investigations, this criterion does not help much to distinguish idiomatic groups from non-idiomaticones.

References Abney, Steven 1991 Parsing by chunks. In Principle-Based Parsing, Robert Berwick, Steven Abney and Carol Tenny (eds.), 257-278. Dordrecht: Kluwer. Bahns,Jens 1996 Kollokationen als lexikographiscf.es Problem: Eine Analyse allgemeiner und spezieller Lernerworterbucher des Englischen. Lexicographica Series Maior 74. Tubingen: Max Niemeyer. Benson, Morton, Evelyn Benson and Robert Ilson, 1986 The Lexicographic Description of English. Amsterdam/Philadelphia: John Benjamins. Burger, Harald 1998 Phraseologie: Eine Einfuhrung am Beispiel des Deutschen. Berlin: Erich Schmidt Verlag. Cahill, Aiofe, Marion Weller, Christian Rohrer and Ulnch Held 2009 Using tn-lexical dependencies in LFG parse disambiguation. In Prodeedings of the LFG09 Conference, Miriam Butt and Tracy Holloway King (eds.), 208-221. Standford: CSLI Publications. Evert, Stefan 2004 The statistical analysis of morphosyntactic distributions. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), 1539-1542. Lisbon: ELRA. Evert, Stefan 2005 The Statistics of Word Cooccurrences: Word Pairs and Collocations. Stuttgart: University of Stuttgart and http://www.collocations.de/ phd.html. Evert, Stefan 2009 Corpora and collocations. In Corpus Linguistics: An International Handbook, Anke Ludelmg and Merja Kyto (eds.), 1212-1248. Berlin/New York: Walter de Gruyter.

308 UlrichHeid Fazly, Afsaneh and Suzanne Stevenson 2006 Automatically constructing a lexicon of verb phrase idiomatic combinations. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL2006), 337-344. Trento, Italy, April, 2006. http://www.cs.toronto. edu/~suzanne/papers/FazlyStevenson2006.pdf. Fellbaum, Christiane, Undine Kramer and Gerald Neumann 2006 Corpusbasierte lexikographische Erfassung und lmguistische Analyse deutscher Idiome. In Phraseology in Motion I: Methoden und Kritik, Annelies Buhofer and Harald Buger (eds.), 43-56. Basel: Baltmannsweiler. Firth, John Rupert 1957 Modes of Meaning. In Papers in Linguistics 1934-51, John Rupert Firth (ed.), 190-215. Oxford: Oxford University Press. Fntzmger, Fabienne, Ulnch Held and Nadine Siegmund 2009 Automatic extraction of the phraseology of a legal subdomam. In Proceedings des XVII European Symposium on Languages for Specific Purposes, Arhus, Danmark. http://www.ims.um-stuttgart.de/ ~fntzife/pub.html. Grossmann, Francis and Agnes Tutin 2003 Quelques pistes pour le traitement des collocations. In Les collocations: analyse et traitement, Francis Grossmann and Agnes Tutin (eds.), 5-21. Amsterdam: DeWerelt. Hausmann, Franz Josef 1979 Un dictionnaire des collocations est-il possible? Travaux de linguistiqueetdelitterature XVII (1): 187-195. Hausmann, Franz Josef 1989 Le dictionnaire de collocations. In Worterbucher, Dictionaries, Dictionnaires: Ein Internationales Handbuch, Franz Josef Hausmann, Oskar Reichmann, Herbert-Ernst Wiegand and Laidslav Zgusta (eds.), 1010-1019. Berlin: De Gruyter. Hausmann, Franz Josef 2004 Was smd eigentlich Kollokationen? In Wortverbindungen mehr oder wenigerfest, Institut fur Deutsche Sprache Jahrbuch 2003, Kathrin Steyer (ed.), 309-334. Berlin: De Gruyter. Held, Ulnch 1994 On ways words work together: Topics in lexical combinatorics. In Proceedings of the Vlth Euralex International Congress, Willy Martin et al. (eds.), 226-257. Amsterdam: Euralex.

German noun+verb collocations in the sentence context 309 Heid,Ulrich 1998 Building a dictionary of German support verb constructions. In Proceedings of the 1st International Conference on Linguistic Resources and Evaluation, Granada, May 1998, 69-73. Granada: ELRA. Heid,Ulnch 2005 Corpusbasierte Gewmnung von Daten zur Interaktion von Lexik und Grammatik: Kollokation - Distribution - Valenz. In Corpuslinguistik in Lexik und Grammatik, Fnednch Lenz and Stefan Schierholz (eds.), 97-122. Tubingen: Stauffenburg. Heid,Ulnch 2008 Computational phraseology: An overview. In Phraseology: An Interdisciplinary Perspective, Sylviane Granger and Fanny Meumer (eds.), 337-360. Amsterdam: John Benjamins. Held, Ulnch, Fabienne Fntzmger, Susanne Hauptmann, Julia Weidenkaff and Marion Weller 2008 Providing corpus data for a dictionary for German juridical phraseology. In Text Resources and Lexical Knowledge: Selected Papers from the 9th Conference on Natural Language Processing, KONVENS 2008, Angelika Starrer, Alexander Geyken, Alexander Siebert and Kay-Michael Wiirzner (eds.), 131-144. Berlin: Mouton de Gruyter. Held, Ulnch and RufusH.Gouws 2006 A model for a multifunctional electronic dictionary of collocations. In Proceedings of the Xllth Euralex International Congress, 979-989. Alessandria: Ediziom dell'Orso. Held, Ulnch and Marion Weller 2008 Tools for collocation extraction: Preferences for active vs. passive. In Proceedings of LREC-2008: Linguistic Resources and Evaluation Conference, Marrakesh, Morocco. CD-ROM. Helbig, Gerhard 1979 Probleme der Beschreibung von Funktionsverbgefugen im Deutschen. Deutsche Fremdsprache 16: 273-286. Ivanova, Kremena, Ulnch Held, Sabine Schulte im Walde, Adam Kilgarnff and Jan Pomikalek 2008 Evaluating a German sketch grammar: A case study on noun phrase case. In Proceedings of LREC-2008: Linguistic Resources and Evaluation Conference, Marrakech, Marocco. CD-ROM. Keil, Martina 1997 Wort fur Wort: Representation und Verarbeitung verbaler Phraseologismen (Phraseolex). Tubingen: Niemeyer. Kermes, Hannah 2003 Offline (and Online) Text Analysis for Computational Lexicography. Dissertation, IMS, University of Stuttgart.

310 UlrichHeid Kilgamff, Adam, Pavel Rychly, Pavel Smrz and David Tugwell 2004 The sketch engine. In Proceedings of the Xlth EURALEX International Congress, G. Williams and S. Vessier, (eds.), 105-116. LonentiUmversitedeBretagneSud. Krenn,Bngitte 2000 The Usual Suspects: Data-Oriented Models for the Identification and Representation of Lexical Collocations. Saarbrucken: DFKI, University des Saarlandes. Krenn,BngitteandGregorErbach 1994 Idioms and Support Verb Constructions. In German in Head Driven Phrase Structure Grammar, John Nerbonne, Klaus Netter and Carl Pollard, (eds.), 297-340. Stanford, CA: CSLI Publications. Krenn,Bngitte and Stefan Evert 2001 Can we do better than frequency? A case study on extracting PP-verb collocations. In Proceedings of the ACL Workshop on Collocations, 39^6. Toulouse: Association for Computational Linguistics. Lapshmova, Ekaterma and Ulnch Held 2007 Syntactic subcategonzation of noun+verb-multiwords: Description, classification and extraction from text corpora. In Proceedings of the 26th International Conference on Lexis and Grammar, Bonifacio, 26 October 2007. Mel'cuk, Igor A., Nadia Arbatchewsky-Jumarie, Leo Elmtsky, Lidija Iordanskaja andAdeleLessard 1984-99 Dictionnaire explicatif et combinatoire du francais contemporain: Recherches Lexico-Semantiques I-IV. Montreal: Presses Universit i e s de Montreal. Ritz, Julia 2006 Collocation extraction: Needs, feeds and results of an extraction system for German. In Proceedings of the Workshop on Multiwordexpressions in a Multilingual Context, 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, April 2006, 41-48. Trento: Association for Computational Linguistics. Ritz, Julia and Ulnch Held 2006 Extraction tools for collocations and their morphosyntactic specificities. In Proceedings of the Linguistic Resources and Evaluation Conference, LREC 2006, Genova, Italia, 2006. CD-ROM. Schafroth,Elmar 2003 Kollokationen im GWDS. In Untersuchungen zur kommerziellen Lexikographie der deutschen Gegenwartssprache I. "Duden: Das grofie Worterbuch der deutschen Sprache in zehn Banden", Herbert Ernst Wiegand (ed.), 397-412. Tubingen: Niemeyer.

German noun+verb collocations in the sentence context 311 Schiehlen, Michael 2003 A cascaded finite-state parser for German. In Proceedings of the Research Note Sessions of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), Budapest, April 2003, 133-166. Budapest: Association for Computational Linguistics. Scott, Mike 2008 WordSmith Tools, version 5, Liverpool: Lexical Analysis Software. Seretan,Violeta 2008 Collocation Extraction Based on Syntactic Parsing. Dissertation No. 653, Dept. de lmguistique, Umversite de Geneve, Geneve. Siepmann,Dirk 2005 Collocation, colligation and encoding dictionaries. Part I: Lexicological Aspect. International Journal of Lexicography 18 (4): 409-444. Sinclair, John McH. 1991 Corpus, Concordance, Collocation. Oxford: Oxford University Press. Smadja, Frank 1993 Retrieving collocations from text. Computational Linguistics 19 (1): 143-177. Starrer, Angelika 2006 Zum Status der nommalen Komponente in Nommahsierungsverbgefugen. In Grammatische Untersuchungen: Analysen und Reflexionen, Eva Bremdl, Lutz Gunkel and Bruno Strecker (eds.), 275-295. Tubingen: Narr.

Corpora JRC Acquis Acquis Communautaire described in the following paper: Stemberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaz Erjavec, Dan Tufis, Daniel Varga (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation. Genoa, Italy, 24-26 May 2006. The corpus is available at http://langtech.jrc.it/JRC-Acquis.html. See note 1 for further sources.

Author index Aarts,Jan,18 Abdel Rahman, Rasha, 92 Abney, Steven, 269, 301 Adolphs,Svenja,49 Aho, Alfred Vamo, 262 Ahulu, Samuel, 160 Algeo, John, 201 Altenberg,Bengt,47,127 Andersen, 0rvm, 263 Andrade,R.F.S.,91 Antiqueira, Lucas, 91 Asher,JamesJ.,153 Atkins, Beryl T., 49 Bahns, Jens, 285 Ballard, Michel, 73 Barfield, Andy, 131, 137 Barlow, Michael, 211, 214 Barnbrook, Geoff, 213 Bashyam,Vijayaraghavan,278 Bauer, Laurie, 136 Behrens,H ei ke,32,48 Belica, Cyril, 156 Benson, Evelyn, 202, 284 Benson, Morton, 202, 284 Besch, Werner, 187 Biber, Douglas, 47, 148, 179, 191, 211,239,240 Bird, Steven, 270 Bordag, Stefan, 91 Bowker,Lynne,211 Bozicevic,Miran,91 Braine, Martin Dan Isaac, 153 Brazil, David, 5 Brooks, Philip, 279 Bryant, Michael, 231 Buchholz, Sabine, 276, 278 Burger, Harald, 283, 288 Bybee,Joan,32 Cahill,Aiofe,293 Caldarelli,Guido,91

Calde 1 ra,S 1 lv 1 aM.G.,91 Capocci, Andrea, 91 Carreras,Xav ie r,278,279 Carter, Ronald, 115 Cecchi,GuillennoA.,91 Channell, Joanna, 8, 161 Cheng, Winnie, 10, 216 ChevaHer, Jean-Claude, 65 Choi, Key-Sun, 91 Chuquet,Helene,72,73,78 ColHer, Alex, 114-116 Conklm,Kathy,127 Connor, Ulla, 41, 49, 199,211 Conzett, Jane, 128, 132 Corso,Grlberto,91 Cosenu,Eugemo,43,147 Coulthard,R.Malcom,5,6,20,179 Cowie, Anthony Paul, 47, 49, 132 Coxhead,Averil,127,138 Croft, William, 30, 47 Cruse, D.Alan, 30, 47 Crystal, David, 160 Csardi,Gabor,107 Cullen, Richard, 140 Dagneaux,Estelle,138 Daley, Robert, 3, 7, 8, 18, 20, 87, 113 Darbelnet,Jean,76 Dasgupta,Partha,91 De Cock, Sylvie, 137 de la Torre, Rita, 153 DeCarrico,JeanetteS.,127,129,133 Dechert,Hans-Wilhelm,128 Delport, Mane-France, 65 Deveci,Tanju,132 Domazet,Mladen,91 Ellis, Nick, 49, 127, 129, 229 Erbach,Gregor,288 Evert, Stefan, 101, 232, 285, 298, 300,301,305

314 Author index Faulhaber,Susen,31 Fazly,Afsaneh,297,299 Feilke, Helmut, 62 FeUbaum, Christians 287 Ferrer iCancho, Ramon, 91 Ferret, Olivier, 91 Fillmore, Charles, 40, 41, 49, 199 Firth, John Rupert, 5, 6, 12, 19, 49, 147,211,212,221,223,284 Fischer, Kerstm, 47, 49 Fuzpatnck,Tess,136 Fletcher, William H., 211, 214 Fodor, Jerry Alan, 258 Francis, Gill, 8, 47, 67 Francis, W.Nelson, 7, 18 Frath,Pierre,61,62,65,78 Frey,Enc,49,229 Fritzinger,Fabienne,303 Gabnelatos,Costas,133,140 Gallagher, John D., 60, 65, 73, 78 Galpm,Adam,49,229 Garside, Roger, 231, 243 Gavioli, Laura, 211 Gilquin,Gaetanelle,28,127,138 Girard, Rene, 264 Gledhill, Christopher, 61, 62, 65, 78 Glaser,Rosemarie,41 G6rz,Gunther,280 Gotz, Dieter, 50, 153, 156, 239 G6tz-Votteler,Katrm,50,156,231 Goldberg, AdeleE., 31, 40, 44,4749 Gouverneur, Celine, 133 Gouws,RufusH.,284 Grandage, Sarah, 49 Granger,Sylvrane,28,41,46,47, 127,134,138,153,161,200,230 Grant, Lynn, 136 Greaves, Chris, 10,211,214,216 Green, Georgia, 67-69, 74, 260 Greenbaum, Sidney, 49 Gries, Stefan Th., 47, 50 Grimm, Anne, 201, 205

Grossmann,Franc1S,283 GroB, Annette, 74, 76 Guan,Jihong,91 Hamilton, Nick, 132 Han,ZhaoHong,152 Handl,Susanne,47,262 Harris, ZelhgS., 213 Harwood, Nigel, 124, 126, 131 Hasselgren, Angela, 130 Hausmann, Franz Josef, 28-30, 3234,37,47,60,61,65-68,77,78, 200,208,230,236,283,285, 286,291 Hausser, Roland, 247, 248, 256, 259, 260,262-264 Heaton, John B., 161 Heid,Ulrich, 283, 284, 286-288, 293,296,299,302,304 Helbig, Gerhard, 287, 288 Herbst,Thomas,28,29,31,39,41, 47,49,50,78,161,203,252 Heringer, Hans Jurgen, 156 Hinrichs, Lars, 187 Hockemnaier, Julia, 263 Hoey, Michael, 24, 66, 67, 78, 123, 128 Hoffmann, Sebastian, 160, 180 Hofland,Knut,47 Hoover, David, 3 Hopper, Paul, 183 Howarth, Peter, 127, 161 Hu,Guobiao,91 Huang, Wei, 263 Hugon,Cla1re,134 Hundt, Marianne, 160 Hunston,Susan,8,14,47,67,123, 212,214,224 Hyland, Ken, 137,211 Ilson, Robert, 202, 285 Ivanova,Kremena,301,302 Jackson, Dunham, 116,218

Author index 315 Jalkanen, Isaac, 49, 229 Janulevieiene,Violeta,132,133 Johansson, Stig, 17, 21, 47 Johns, Tim, 132 Jones, Susan, 3, 7, 8, 18, 20, 87, 113

Loper, Edward, 270 Lourenco,G.M.,91 Lowe, Charles, 125, 126, 130, 136 Ludw lg ,Bernd,280 Leon, Jaquelme, 212

Kaszubski,Przemyslaw,160 Katz,Jerrold Jacob, 258 Kavaliauskiene,Galina,132,133 Kay, Paul, 41, 49, 199 Keil, Martina, 288 Kenny, Dorothy, 65 Kermes, Hannah, 301, 302 Kilgarriff, Adam, 284, 300 Kinouchi,Osame,91 Kirk, John M., 243 Klein, Barbara, 186, 187, 191 Klein, Ewan, 270 Klotz, Michael, 39, 78, 203 K6hler,Remhard,91 Kramer, Undine, 287 Krenn,Bn gl tte,285,286,288,299 Knshnamurthy,Ramesh,8,18,93, 123,245 Krug, Manfred, 185 Kundert,K.R.,116 Kusudo,JoAnne,153 Kucera,Henry,7,18

MacWhinney, Brian James, 262, 264 Magnusson, Camilla, 91 Mair, Christian, 160, 174, 185, 187, 240 Makkai,Adam,30 Manca, Elena, 149 Manning, Elizabeth, 8, 67 Marcmkiewicz, Mary Ann, 263 Marcus, Mitchell P., 263, 276, 277 Marquez,Llms,278,279 Martinez, Alexandre Souto, 91 Masucci,A.P., 91 Mauranen,Anna,12,23,206,214 Maynard, Carson, 127, 129 Meara,PaulM.,91 Meehan, Paul, 133 Mehler, Alexander, 115 Melmger,Atissa,92 Meunier, Fanny, 138 Milroy, Lesley, 88 Milton, John, 137 Miranda, J. G. V., 91 Mittmann,Bngitta,50,197,201,

Lai,Ying-Cheng,91 Langacker, Ronald W., 48 Lapshmova,Ekaterma,288 Leech, Geoffrey, 49, 180, 185, 231, 240,243 Lehrberger, John, 213 Lennon, Paul, 128 Lewis, Michael, 124-126, 129, 131133,136,137,139 Li,Jianyu,91 Lian,HoMian,160 L ie ven,Elena,32,48 Lima,GilsonFranzisco,40,91 Lm,Haitao,91,262,263 Lobao, Thierry C, 91

202,203,206 MiBler,Bettma,74,76 Montague, Richard, 264 Moon, Rosamund, 8, 9, 208 Motter,AdilsonE.,91 Moura,AlessandroP.S.de,91 Mukherjee,Joybrato,47,159,160, 170 Murtra,BernatCorommas,90 Myles, Florence, 136 Nattinger, James, 127, 129, 133 Neme, Alexis, 91 Nesselhauf,Nadja,28,30,47,128, 137,161,163,170,171,174,236

316 Author index Neumann, Gerald, 287 Nicewander, W.Alan, 116 Nichols, Johanna, 251 01iveira,OsvaldoN.,91 Overstreet,Maryann,198 Pacey, Mike, 114-116 Paillard,Michel,72,73,78 Paquot,Magali,41,46,47,127,134, 138,200 Park, Young C, 91 Partington, Alan, 66 Pawley, Andrew, 11,40 Pearson, Jennifer, 116,211 Piatt, John, 160 Pollard, Carl, 259 Polzenhagen, Frank, 160 Porto, Melma, 125, 127, 129 Proisl, Thomas, 49, 262 Pulverness, Alan, 130, 139 Putnam, Hilary, 246, 248, 262 Quirk, Randolph, 18 Raab-F1scher,Rosw1tha,187,191 Ramshaw,LanceA.,276,277 Rayson, Paul, 115,231,239 Reiss, Peter, 280 Renouf, Antoinette, 8, 114-116, 130, 134,135,138 Richards, Jack C, 128 Risau-Gusman, Sebastian, 91 Ritz, Julia, 286, 299 Rodgers,G.J.,91 Rodgers, Joseph Lee, 116 Romer.Ute, 174,200,211,214 Rogers, Ted, 131 Rohdenburg,Gunter,180 Rohrer, Christian, 293 Rosenbach,Anette,187 Sag, Ivan, 259 Salkoff, Morris, 77

Sand, Andrea, 35, 160, 161, 165, 174 Santorini, Beatrice, 263 Saussure, Ferdinand de, 43, 218 Schafroth,Elmar,286 Schiehlen, Michael, 302, 303 Schilk, Marco, 160 Schm 1 d,Hans-J6rg,36,47,48 Schmied, Josef, 160 Sch mi tt,Norbert,49,127,229 Schneider, Edgar W., 160 Schuller,Susen,41,252 Schur, Ellen, 91 Scott, Mike, 198, 246, 285 Selinker, Larry, 152 Sells, Peter, 263 Seretan,Violeta,302 Shei,Chi-Chiang,161 S 1 epmann,D 1 rk,30,44,47,61-63, 67,78,156,286 Sigman, Mariano, 91 Simpson-Vlach, Rita, 127, 129 Sinclair, John McH., 1-14, 17-24, 27-29,32,37,38,41-43,45,5963,66,67,71,77,78,87,89,93, 103,109,113,115,116,123, 124, 126, 128, 130, 132, 134, 135, 138, 139, 147, 154, 156, 159,179,197,206,208,211, 212,214,215,217,223,224, 229,240,243,245,258,284,286 Siyanova, Anna, 127 Skandera, Paul, 160, 170 Smadja, Frank, 285 Smith, Nicholas, 185, 187, 240, 272 Scares, MarcioMedeiros, 91 Sole, Richard V., 90 Speares, Jennifer, 48 Steedman, Mark, 263 Steels, Luc, 90 Stefanci6,Hrvoje,91 Stefanowitsch,Anatol, 44, 47-50 Stein, Stephan, 65 Stevenson, Suzanne, 297, 299 Steyvers,Mark,90

Author index 317 Starrer, Angelika, 291 Stubbs,Michael,24,29,47,60,96, 115 Svartvrk,Jan,18 Syder, Frances Hodgetts, 40 Szmrecsanyi,Benedikt,187 Taira, Ricky K., 278 Tenenbaum, Joshua B., 91 Teubert, Wolfgang, 18, 20, 245, 248 Thompson, Geoff, 117,214 Thornbury, Scott, 136 Timmis, Ivor, 139, 140 Tjong Kim Sang, Erik F., 276, 278 Togmm-Bonelh,Elena,3,45,149, 286 Tomasello, Michael, 32, 48 Traugott, Ehzabeth, 181 Tremblay,Antome,129 Tsujii,Jun'ichi,279 Tsuruoka,Yosh im asan,279 Turton,N lg elD.,161 Tutin, Agnes, 283 Uhng,Peter,31,47,49 Ullman, Jeffrey David, 262 Underwood, Geoffrey, 49, 229 Upton, Thomas A., 211

Valverde,Ser gl ,90 Vanharanta,Hannu,91 Vinay, Jean-Paul, 76 Vitevitch, MichaelS., 91 Waibel,Birgit,137 Warren, Martin, 10, 216 Weber, Heidi, 160, 262 Weller, Marion, 293, 294, 304, 306 Widdowson, Henry G., 140 Wierzbicka, Anna, 251 Wiktorsson, Maria, 137 Wilks, Clarissa, 91 Williams, Jessica, 161 Willis, Dave, 132, 134, 136, 138 Wilson, Edward O., 13 Wolf, Hans-Georg, 160 Wolff, Dieter, 74, 76 Wolter, Brent, 91 Woolard, George, 126, 130, 132 Wray, Alison, 129, 135, 136, 203, 206 Zhang, Zhongzhi, 91 Zhou,Jie,91 Zhou,Shuigeng,91 Zlatic,Vmko,91

Subject index active, 3, 4, 73, 133, 304, 305 adjective, 30, 31, 33, 37, 47, 48, 64, 67,70,72,73,75-77,128,138, 215,236,238,244,251,263, 270,285-287,290,291,300 adjunct, 278, 295 adverb, 41, 49, 65, 70, 72, 73, 95, 133,200,203,285,290,291, 304,307 adverbial, 41, 200, 206 affected, 273 agent, 278, 279 algorithm, 93, 100, 105, 117,254, 263 ambigmty, 132, 240, 244, 300, 303 annotation, 179, 237, 244, 274, 276, 300-302 argument, 31, 251, 253, 255, 259, 263 argument structure, 31, 263 aspect, 252 association measure, 231, 232, 285, 298,299,305 base, 30, 32, 33, 44, 219, 253, 257, 259,262,271,274,285,286, 304-306 case, 300-304 category, 41, 45, 46, 61, 76, 77, 133, 190,221,231,252,269,285,286 Chinese, 91, 263 choice, 27, 29, 31-33, 37, 38, 40, 42, 44-46,59,63,65,69,77,133, 165,190,306 chunk, 12, 28, 29, 34, 40-42, 45, 124, 129, 132, 147, 149, 151156,179,180,190,206,207, 214,269-274,276-280,285,286, 290,301,304 chunking, 270-274, 276, 278, 301, 302

cluster, 50, 74, 198-200, 203-206, 276,291 co-occurrence, 20, 22, 28, 31, 36, 42, 43,46,47,49,66,87,90,91,94, 95,97,98,100-102,104,108, 111,112,115-117,134,156, 229-232,240 co-selection, 9, 27, 154 Cobuild,31,130,139 cognitive, 28, 32, 35, 36, 38, 42, 45, 103,113,246,247,249,251, 259-261 coherence, 89 colligation, 78, 179 collocate, 11,20,22,30,33,37,38, 44,47,49,62,65,75,90,91,94, 100,102-105,107,109,111-114, 138,152,160,234,283,285, 286,296,304-306 collocation, 1,3,7,9, 11, 19-21,23, 28-30,32-37,39,41,43-45,47, 49,50,60,61,63,65-68,70,72, 74-79,82,85,87-95,101-104, 107,108,112-116,123,125, 128, 133, 136, 137, 147, 149, 151, 152, 155, 156, 159, 160, 163-166, 168, 169, 172-174, 179, 200,212,214,229,230,232, 233,235,236,238,240,244, 258,259,262,270,283-300, 302-306 collocator,230 constructional, 50 competence, 4, 113, 126, 128, 152 complement, 31, 38, 41, 49, 59, 126, 132,181,183-185,191,200, 223,260,289,306 complementation, 167-169,231,289 complex conjunction, 200 complex preposition, 49, 180,200 compound, 29, 32-37, 39, 41, 48, 219,272,304

320 Subject Mex computational linguistics, 115,283 concordance, 9, 10, 96, 114, 166, 214,217-223,244 constituent, 49, 61, 199, 206, 207, 216,259,271,274,292,300, 301,304 construction, 2, 27-29, 31, 37, 40, 41,45-47,49,62,64,67,70,73, 74,90,126,136,156,172,174, 181,183,184,190,199,200, 202,203,213,218,248,256, 271,272,274,287,288,290,299 construction grammar, 28, 29, 46, 47,49,62,126,200 context, 17, 22, 42, 69, 70, 72, 74, 87,89,90,94,99,101,103,109, 111,112,114,116,123,128, 129,183,203,205,217,221, 222,247,249-251,253,259-262, 264,287,304 contrasts, 73, 128, 294 core, 22, 78, 125, 126, 206, 211, 214,252,253,255-257,259, 260,263 corpus-based, 3, 21, 59, 134, 137, 159, 160, 179, 185, 187, 188 corpus-driven, 123, 132 culture, 7, 160 database, 216, 230, 239, 240, 251, 255,261,304-306 declarative, 13, 263 dependency, 72, 87, 103, 203, 205, 255,278,302 descriptive, 45, 46, 190, 212, 245 determiner, 41, 270, 296, 304 diachronic, 180, 182, 184, 185, 188, 190 dialect, 161 dictionary, 2, 8, 9, 21, 30, 33-35, 39, 40,46,49,50,76,77,82,89, 103,111,130,133,153-156, 200,213,216,243,245,248, 249,253,261,262,264

discourse, 1-3, 5, 6, 13, 20, 24, 88, 95, 104, 127, 153, 183, 189, 190, 198,206,208,211-213,223, 246,264,286 discourse analysis, 1-3, 5, 6, 13, 20, 88, 153 Dutch, 162, 286 elicitation, 160 emergence, 188 emergentism,262 encoding idiom, 30, 34 English, 2, 4, 6-8, 11, 13, 14, 17, 18, 20,21,23,24,30,31,33-41,46, 48-50,60,67-76,78,79,83,84, 91,104,117,137-139,148-150, 152, 154-156, 159-166, 169-171, 173,174,179-191,197-204,207, 208,211-213,216,220-222,224, 230,231,243,250,251,256, 258,261,272-274,278,279, 284-286,298-300,302,306 error, 113, 128, 138,218,237,269, 272,277 extended units of meaning, 9, 40, 45 extraction, 91, 102, 214, 216, 284, 286-288,297-306 figurative, 136 fixedness, 134, 203, 205, 287, 288, 297-299,306 foreign language, 28-30, 42, 43, 60, 76,113,123,130,133,149,150, 152,154,156,230,246 foreign language teaching, 30, 43, 60, 149 formula, 22, 49, 90, 103-106, 108110,130,197,200,204,205, 208,214,285 frame, 40, 138, 214-217, 219, 223 free combination, 33, 39, 45, 229 French, 63, 67-76, 79, 162, 250, 302 frequency, 8-11, 29, 30, 32, 37, 38, 43-45,47,49,66,68,70,75,92,

Subject index 321 94,95,98,100,102,106,110, 115,116,129-131,134,135, 138, 147, 148, 161-166, 169, 184-188,197-200,203,204,206, 213,214,217,219-223,230-232, 234,236,244,245,258,273, 285,290,293-298,302,305 function, 24, 41, 47, 60, 62, 95, 117, 123, 127, 130, 133, 134, 136, 138, 140, 147, 153, 165, 186, 191,201,203,207,208,214, 221,230,232-235,237,238, 240,256,263,264,269,270, 275,276,283,285,298 functor, 251, 253, 255, 259, 263 generalization, 31, 270, 296 generative grammar, 28, 153, 258 German, 33-36, 38, 39, 48, 60, 78, 149-152, 154, 156, 162, 163, 166,170,197,203,207,231, 250,272-274,278,279,284, 286,287,289,291-293,296-302, 306 grammar, 2, 4, 6, 8, 11, 12, 17-19, 23,28,38,41,46,47,123-126, 132, 133, 139, 140, 179, 180, 183,189,190,200,201,212, 213,224,254-256,258,259, 263,280,287,301,306 grammatical function, 303 grammatical relation, 255, 299-301, 306 grammaticatization,189,216 head, 41, 190, 191, 270, 301 headword, 155 idiom, 27, 28, 36, 38-40, 42, 43, 45, 59,62,63,65,67,77,123,127, 133, 136, 149, 152, 156, 159, 160,197,199,208,224,229, 259,283,284,287,288,291

idiom principle, 27, 28, 38-40, 42, 43,45,59-64,66-68,77,123, 147,152,156,159,197,208, 224,229,284,287 idiomatic expression, 47, 49, 137, 283,288 idiomaticity,36,39,40,208,230, 231,236,237,287,297,299,306 idiomatization,39,288,291,292, 297 idiosyncratic, 17, 28, 35, 283 imperative, 203 instrument, 278 intransitive, 243, 253 intuition, 68, 103, 113, 152,223,269 Italian, 149, 151, 154, 162 item-specific, 33, 40, 43, 156 Japanese, 63, 219 language acquisition, 29, 88, 127, 129, 130, 135 Latin, 300 learner, 2, 9, 28, 30, 31, 34, 43, 46, 49,50,59,123,124,126-140, 149-156, 160-164, 166, 168-171, 174,230,237,238,243,245,261 learning, 110, 113, 123, 128-130, 132, 133, 135, 136, 139, 149, 151,153,156,248,284 lemma, 8, 47, 50, 91, 96, 216, 248, 303-305 lexeme, 45, 284, 290, 296 lexical bundle, 136, 137, 148,214 lexical grammar, 212, 224 lexical item, 8, 19, 22, 23, 28, 35, 36, 38,47,59,60,64,76,77,90,92, 93,96,103,107,112,114,116, 130,131,133,137,214,237 lexical unit, 3, 10, 47, 61, 96 legalization, 139 lexicography, 1-2, 7-9, 24, 30, 43, 59,77,153,156,283,291

322 Subject Mex lexicon, 19, 36, 48, 130, 179, 248, 249,257,263,284 lexis, 7, 8, 11, 18, 19, 21, 23, 24, 46, 123-126,130,131,133,140,200 Hght verb, 174 meaning, 1,3, 7, 8, 10-12, 21, 23, 27,29-31,33-39,41-46,48,49, 64,76,87-89,102,103,109, 111-114,116,123,128,130, 133, 134, 137, 138, 140, 147, 150-152,156,165,172,202, 204,211,212,214,216,217, 219,221-224,245,246,248, 250,251,253,257,260-262, 264,270,274,283,285,291 mental lexicon, 29, 36 metaphor, 65, 67, 68, 136, 259, 285 metaphorical, 10, 89 metonymy, 65 modifier, 167, 190, 244, 270, 304 motivation, 28, 127, 135, 137, 139, 269,271 multi-word, 39, 45, 50, 61, 78, 124, 131-134,199,203,214,237 multilingual, 307 n-gram, 198, 214, 216, 217, 219, 223 Natural Language Processing (NLP), 269 negation, 291, 297, 304 network, 87, 88, 90-95, 98-102, 105108,110,112-115,123 New Englishes, 240 non-compositionality, 10, 127, 134, 259,283,297 noun, 22, 31, 37, 43, 47, 60, 62, 69, 73, 128, 155, 165, 167, 168, 187, 190,191,204,217,221,234, 238,251,263,271-273,286, 287,289-291,300,301,303 object, 44, 149, 273, 274, 285-288, 295,299-303,306

objective, 7, 76, 246 obligatory, 59 OED,35 omission, 96, 160 opacity, 283 opaque, 259, 283 open choice principle, 28, 31, 37, 38, 44,59,61,64-66,68,229,284 optional, 49, 271 paradigmatic, 45 paraphrase, 152 parole, 43, 61 parser, 259, 272, 273, 276, 277, 279, 280,302,303 parsing, 254, 261, 269-272, 274, 276,278-280,302,303,306 participant, 96, 224, 279 particle, 160, 200, 203, 206, 207, 286,292 passive, 60, 125, 133, 274, 304, 305 patient, 278 pattern, 2, 9, 12, 13, 17, 18, 20-22, 27,31,32,46,47,62,63,65-67, 69,87-91,94,101-103,110,112, 113,116,123,126,129,138, 154,155,165-167,189,201, 207,211,212,215-217,219, 221-224,229,249,257,258, 263,264,269,286,290,292, 300,302-304 performance, 147, 274, 277, 279 periphery, 28, 127, 206 phrasal, 8, 36, 123, 127, 137, 139, 160,199,214,273,274 phrasalverb,36,123,127,137,160, 274 phrase, 6, 8, 11,27,28,41,42,46, 49,59,60,62,64,65,68,69,74, 75,78,89,125,127,129,131138, 150, 154, 180, 187, 188, 190,191,204,206,207,214, 215,219,259,273,274,278,

Subject index 323 287,289,292,294-296,300-302, 304,307 phraseological, 11, 13,28,30,36, 38,39,41,46,61,63,71,78, 126, 147, 159, 160, 162, 165, 170,173,174,200,211,212, 216,217,221-223,283 phraseological umt, 13, 36, 39, 41, 61,63,147,170 phraseologism,47 phraseology, 3, 9, 11, 13, 28, 39, 4547,77,123,127,131,132,134, 136, 137, 159, 160, 165, 174, 199,200,212,224,229,283,296 pragmatic, 10, 41, 47, 60, 123, 190, 201,203,206,208,245,253, 264,269,284 predicate, 219, 244, 270, 273 predicative, 32, 34, 36, 37 prefab, 197, 198, 203, 204, 206 preference, 23, 31, 78, 138, 165, 181,182,184,221,223,284, 289-292,295-299,305,306 premodification, 167, 168 preposition, 27, 41, 49, 60, 67, 69, 74,78,117,133,159,170,172, 173,273,304,306 prepositional verb, 159, 170-174, 199 prescriptive, 189 probabeme,39,40,43,45,65,78, 203 processing, 36, 114, 136, 139,206, 229,230,253,262,271-274,279 productivity, 259 pronoun, 40, 42, 184, 200, 287 proplet, 251-258, 260, 263, 264 proposition, 214, 217 prosody, 78 proverb, 28, 127 quantitative, 7, 171,231,286,305 recurrence, 43, 134, 147, 152

regularity, 165 routine formula, 208 rule, 60, 72, 129, 132, 200, 205, 206, 244,251,254,271-274,276, 277,279,300,302 Russian, 162 salience, 102 schema, 247-249 selection, 21, 23, 112, 124, 128, 134, 135,138,258,259,283,290, 296-298 semantic prosody, 22, 78, 123,222 semantic role, 149,272 semantics, 12, 19, 33, 40, 87, 216, 249,252,257-259,264 sense, 10, 21, 34-36, 38, 76, 82, 103, 104,111,152,166,173,219,245 significance, 30, 37, 88, 95, 191, 200,286,298,302 speech act, 147 spoken,2,5,7,14,19,113,123, 127,161,162,165,179-181, 183, 184, 186-190, 197, 199-201, 204,208,220,271,272 statistics, 7, 37, 39, 47, 64, 87, 102, 181,183,187,192,201,229, 238,243-245,263,275,278, 284,285,298,299 storage, 32, 33, 38, 42, 45, 136, 139, 150,251,255,260 structural, 42, 123, 124, 130, 181, 189,269,271,286 structure, 6, 10, 23, 62, 132, 183, 184,188,190,192,202,211, 220,245,246,248,251-253, 255-260,264,269,271-273,276, 279,286,295,301-303 style, 2, 6, 12, 18, 69, 70, 73, 88, 112,117,135,154,187,219,234 subject, 42, 60, 62, 73, 204, 288, 298-300,302 synchronic, 181 syntagmatic,87,127

324 Subject Mex tagging, 152, 237, 243, 244, 263, 274,276-278 text, 1-3, 6, 7, 9-13, 17, 18, 20-24, 27,28,33,35,38,45,48,49,59, 61-63,65,66,68,70,74,79,8793,96,97,100,101,103,104, 109,111-113,115,116,123, 125, 128, 136, 138, 150, 162, 185, 186, 189, 191, 197, 198, 203,211-214,216,218,221, 222,224,229-232,234-240,244, 245,269,270,274,278,284, 285,287,290,295-302,305,306 theme, 1,3, 180 token, 91, 94, 100, 104, 190, 215, 249,250,264,269 transitive, 220, 243, 253, 258 translation, 39, 45, 59-61, 67-70, 7379,149,153,154,222,231-237, 263 type, 91, 93, 94, 102, 104, 215, 216, 249,250 umtofmeanmg,l,3,8,ll,13,36, 45,211,214,223 usage, 31, 32, 147, 154, 170, 181, 184,188,189,238,240,290 usage-based, 32 utterance, 2, 40, 46, 147, 150, 152, 189,206,221

valency, 28, 31, 32, 38, 41, 46, 49, 65,149,151,156,199,252,253, 255,288,291 variability, 166, 169, 173, 174, 215 variant, 8, 73, 165, 169, 170, 184, 186,190,191,215,253,293, 294,306 variation, 42, 49, 73, 149, 181, 215, 216,296,297,306 variety, 4, 137, 159-161, 163, 165, 168-171,173,174,181,188, 189,197-201,203-208,211 verb, 22, 23, 27, 31, 32, 37, 38, 4044,49,60,67,68,74,76,78,89, 123, 134, 138, 155, 156, 159161, 165, 170-174, 185, 186, 202,203,206,207,216,243, 244,251-253,258,260,263, 270,272-275,278,284-297,299304,306,307 word class, 204 word form, 123,200,230,231,233, 248,249,251,254,257,259, 261-264,269,273,303 written, 7, 14, 113, 123, 127, 161165,179-181,183-191,212,220223