New Language Technologies and Linguistic Research : A Two-Way Road [1 ed.] 9781443858632, 9781443853774

This book is a collection of the papers presented and discussed at the 11th Corpus Linguistics Symposium (ELC 2012), hel

161 93 7MB

English Pages 238 Year 2014

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Linguistic Genocide or Superdiversity?: New and Old Language Diversities 9781783096060

Are we facing an immense wave of language death or a period of remarkable new linguistic variation? Or both? This book a

180 104 6MB Read more

Linguistic Universals and Language Variation 9783110238068, 9783110238051

The volume explores the relationship between linguistic universals and language variation. Its contributions identify th

309 50 4MB Read more

Language Acquisition by Children: A Linguistic Introduction 9781474458177

An up-to-date introduction to language acquisition for advanced undergraduates and beginning graduate students in lingui

197 58 2MB Read more

Understanding the Chinese language : a comprehensive linguistic introduction

2,374 170 5MB Read more

How the Brain Got Language – Towards a New Road Map 9027207623, 9789027207623

How did humans evolve biologically so that our brains and social interactions could support language processes, and how

439 96 16MB Read more

Linguistic Awareness in Multilinguals: English as a Third Language 9780748626540

GBS_insertPreviewButtonPopup('ISBN:9780748619146); Key Features: The first study of the important role metalingu

178 72 2MB Read more

Language Acquisition: A Linguistic Introduction (Blackwell Textbooks in Linguistics) 9780631173861

This text is an up-to-date introduction to language acquisition, designed to meet the needs of advanced undergraduates a

764 136 7MB Read more

Archaeology and Language II : Archaeological Data and Linguistic Hypotheses.

Using language to date the origin and spread of food production, Archaeology and Language II represents groundbreaking w

553 91 14MB Read more

Language, Immigration and Naturalization: Legal and Linguistic Issues 9781783095162

This volume focuses on the everyday legalities and practicalities of naturalization, bringing together scholars from a w

113 39 2MB Read more

Language: A Right and a Resource: Approaches to Linguistic Human Rights 9789633865217

This path-breaking study broadens our knowledge of the important role of language in minority rights and in social and p

141 57 83MB Read more

New Language Technologies and Linguistic Research : A Two-Way Road [1 ed.]
9781443858632, 9781443853774

Author / Uploaded
Sandra Maria Aluisio
Stella E. O. Tagnin

Citation preview

New Language Technologies and Linguistic Research

New Language Technologies and Linguistic Research: A Two-Way Road

Edited by

Sandra Maria Aluisio and Stella E. O. Tagnin

New Language Technologies and Linguistic Research: A Two-Way Road, Edited by Sandra Maria Aluisio and Stella E. O. Tagnin This book first published 2014 Cambridge Scholars Publishing 12 Back Chapman Street, Newcastle upon Tyne, NE6 2XX, UK British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Copyright © 2014 by Sandra Maria Aluisio, Stella E. O. Tagnin and contributors All rights for this book reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owner. ISBN (10): 1-4438-5377-1, ISBN (13): 978-1-4438-5377-4

TABLE OF CONTENTS Introduction ............................................................................................. viii Part I: Corpus Linguistics and Language Description Chapter One ................................................................................................ 2 Stance Bundles in Learner Corpora Deise Prina Dutra, Barbara Malveira Orfano and Tony Berber Sardinha Part II: Translation, Terminology and Corpora Chapter Two ............................................................................................. 18 A Bilingual Glossary of Collocations Typical of the Hotel Industry: A Model in Light of Corpus Linguistics Sandra Navarro Part III: Spoken Language and Corpora Chapter Three ........................................................................................... 44 Dialogic Units in Spoken Brazilian and Italian: A Corpus-Based Approach Maryualê Malvessi Mittmann and Tommaso Raso Chapter Four ............................................................................................. 62 Segmentation Tags: A Proposal for the Analysis of Subtitles Élida Gama Chaves and Vera Lúcia Santiago Araújo Part IV: Natural Language Processing and Corpora Chapter Five ............................................................................................. 78 Automatic Extraction of Subcategorization Frames from Portuguese Corpora Leonardo Zilio, Adriano Zanette and Carolina Scarton

vi

Table of Contents

Part V: Corpus Annotation Chapter Six ............................................................................................... 98 Espanhol-Acadêmico-Br: A Corpus of Academic Portuguese Learners Produced by Native Speakers of Spanish Lianet Sepúlveda Torres, Roana Rodrigues and Sandra Maria Aluísio Chapter Seven......................................................................................... 112 The Challenges of the Annotation of a Soccer Language Corpus with Semantic Frames Rove Chishman, Anderson Bertoldi, João Gabriel Padilha and Diego Spader de Souza Chapter Eight .......................................................................................... 128 Sparkling Vampire … LOL! Annotating Opinions in a Book Review Corpus Cláudia Freitas, Eduardo Motta, Ruy Luiz Milidiú and Juliana César Part VI: Corpora and Multiple Documents Chapter Nine........................................................................................... 148 Manual Alignment of News Texts and their Multi-Document Human Summaries Verônica Agostini, Renata Tironi de Camargo, Ariani Di Felippo and Thiago Alexandre Salgueiro Pardo Chapter Ten ............................................................................................ 171 Corpus Annotation of Textual Aspects in Multi-Document Summaries Ariani Di Felippo, Lucia H. M. Rino, Thiago A. S. Pardo, Paula C. F. Cardoso, Eloize R. M. Seno, Pedro P. Balage Filho, Amanda P. Rassi, Márcio S. Dias, Maria Lúcia R. Castro Jorge, Erick G. Maziero, Andressa C. I. Zacarias, Jackson W. C. Souza, Renata T. Camargo and Verônica Agostini Part VII: Plenary Papers Chapter Eleven ....................................................................................... 194 Podemos Contar com as Contas? Diana Santos

New Language Technologies and Linguistic Research

vii

Chapter Twelve ...................................................................................... 215 The Dialogue between Man and Machine: The Role of Language Theory and Technology Sara Candeias and Arlindo Veiga List of Authors........................................................................................ 227

INTRODUCTION* This book features 12 papers on six topics addressed at the 11th Corpus Linguistics Symposium (ELC 2012), held at the Instituto de Ciências Matemáticas e de Computação (Institute of Mathematics and Computer Science) of the University of São Paulo, in São Carlos, state of São Paulo, Brazil. The topics are: Corpus Linguistics and Language Description; Translation, Terminology and Corpora; Spoken Language and Corpora; Natural Language Processing and Corpora; Corpus Annotation, and Corpora and Multiple Documents. The overall theme of the conference was Technological convergence for language processing and analysis: new technologies for linguistic research and linguistic research for new technologies. Natural Language Processing (NLP), also known as Computational Linguistics, and Corpus Linguistics (CL) have experienced a significant development in the last decades, particularly in Europe and the United States, mainly for the English language. In Brazil, despite considerable advances in the last decade, such areas are not yet widespread and are restricted to just a few universities, often as subdivisions of broader areas such as Computing or Linguistics. With the proposed convergence we hope to call attention to the importance of activities that arise under the broader scope of NLP with CL. For this reason, ELC 2012 intended to bring together researchers in Linguistics, Computer Science, Historical Linguistics, Applied Linguistics, Cognitive Linguistics, Information Sciences, and other Corpus Linguistics-related fields to submit studies that have been completed or are in progress in that multidisciplinary area. ELC 2012 took place right after the 6th Brazilian School of Computational Linguistics which opened with a plenary talk on speech synthesis and recognition, by Sara Candeias. Two more plenary sessions were held at the 11th ELC, given by Diana Santos and Nancy Ide, along with a roundtable on Speech recognition and synthesis for Portuguese and interactions between linguists and engineers: experiences and *

We would like to thank Danilo Murakami, from the Department of Modern Languages at the University of São Paulo, for helping us put this book together. He revised every single article to make sure they all conformed to the formatting guidelines.

New Language Technologies and Linguistic Research

ix

opportunities. Two papers, described below, give an overview of the topics dealt with in these sessions. The first paper, in this collection, on the topic Corpus Linguistics and Language Description is by Deise Prina Dutra, Barbara Malveira Orfano and Tony Berber Sardinha. The authors present the analysis of four-word bundles extracted from native and non-native corpora of argumentative essays. In addition, the study highlights the differences among the corpora as far as stance expressions are concerned and to detect if these differences are mainly structural or related to frequency within a specific function. As a final aim the paper discusses the use of stance expression bundles in academic writing. Such bundles are important as they carry the opinion or judgment of the writer or speaker. The second paper is by Sandra Navarro and addresses the topic Translation, Terminology and Corpora. The author describes the process of building a bilingual English-Portuguese corpus for the hotel industry and extracting equivalents in both languages. She discusses four collocations with the word “room” and shows how collocational differences affect the translation of the terminology. These differences are highlighted in the entries presented to constitute the glossary. The next two papers are by Maryualê Malvessi Mittmann, Tommaso Raso, Adriellen Arruda and by Élida Gama Chaves and Vera Lúcia Santiago Araújo. Both papers address Spoken Language and Corpora. The first paper presents a cross-linguistic study on the usage of dialogic units in spoken Brazilian Portuguese and Italian. It aims to show the distribution of information units in both languages and to discuss some interesting aspects regarding the usage of dialogic units in Brazilian and Italian. The second paper puts forward a proposal of customized tags to investigate the segmentation in intralingual Brazilian Portuguese subtitles for the deaf and the hard-of-hearing. The fifth paper is by Leonardo Zilio, Adriano Zanette and Carolina Scarton on the topic Natural Language Processing and Corpora. The authors present a system to extract subcategorization frames (SCFs) from corpora written in Portuguese and describe how the system is used in a verb classification task and in the labeling of semantic roles. The following three papers deal with Corpus Annotation. The first one, by Lianet Sepúlveda Torres, Roana Rodrigues and Sandra Maria Aluísio presents the compilation of a corpus composed of texts written in Portuguese by Spanish speakers enrolled in Brazilian graduate programs. The purpose is to formalize a typology of the main errors committed by these speakers when writing theses and dissertations in Portuguese. The second paper, by Rove Chishman, Anderson Bertoldi, João Gabriel

x

Introduction

Padilha, and Diego Spader de Souza, discusses the challenges encountered in annotating a Brazilian Portuguese corpus of soccer language with semantic frames. The third paper in this group, by Cláudia Freitas, Eduardo Motta, Ruy Luiz Milidiú and Juliana César, describes the construction of ReLi – a corpus of book reviews manually annotated with respect to the expression of opinion. Specifically, it reports on the decisions made during the annotation process based on the results of an inter-annotator agreement study, and briefly explores the corpus created. The following papers address Corpora and Multiple Documents. The first one, by Renata T. Camargo, Verônica Agostini, Ariani Di Felippo, and Thiago A. S. Pardo, discusses Multi-Document Summarization (MDS), which involves the production of a single summary from a group of texts that deal with the same subject. The paper mainly addresses the alignment of source texts to their human (manually produced) multidocument summaries, which allows a linguistic analysis of human summarization strategies that may subsidize the creation of rules and models for MDS methods that are more linguistically motivated. The second one, by Di Felippo et al., aims to contribute to the linguistic characterization of summaries. For that purpose, the group developed a corpus-based analysis of aspects conveyed by human multi-document summaries. The corpus used was the CSTNews corpus with texts from various online Brazilian news agencies. Each human summary of the corpus was manually annotated with 17 aspects resulting from the refinement of a previous set of aspects. And last, but by no means least, there are two papers by our invited speakers Diana Santos and Sara Candeias. Santos presents a personal view of quantitative, and especially statistical, approaches to language study, filling a pedagogical gap in this area. In addition, she introduces a new grammar of Portuguese, which is being developed at Linguateca, based on the AC/DC (Access to corpora /Availability of corpora) project. Candeias discusses the role of language theory for designing and developing applications in speech technology, providing an overview of the typical architecture of speech technology systems (recognizers and synthesizers). The author also introduces the core application areas in speech technology, which are raising new challenges for linguistic research. We hope this collection will show that New Language Technologies and Linguistic Research really constitute a two-way road. Sandra M. Aluísio and Stella E. O. Tagnin Organizers

PART I CORPUS LINGUISTICS AND LANGUAGE DESCRIPTION

CHAPTER ONE STANCE BUNDLES IN LEARNER CORPORA DEISE PRINA DUTRA†, BARBARA MALVEIRA ORFANO‡ AND TONY BERBER SARDINHA§ 1. Introduction This article is part of a broader research project, which aims at analyzing in detail the many functions and uses of lexical bundles (Biber et al. 1999 et seq.) from quantitative and qualitative views. “Lexical bundles are combinations of three or more words which are identified empirically in a corpus of natural language” (Cortes 2006, p. 392). The identification of words that ‘go together’, in a corpus linguistic perspective, has been addressed under different terms, such as fixed collocations, extended collocations, cue phrases, clusters, and ngrams, to name a few. As Cortes (2006) notes, lexical bundles might be related to proficient and fluent language production (e.g. Hyland 2008a,b). As our understanding of lexical bundles increases, so does the need for more research on how they operate in different discourse contexts, such as in second language (L2) writing. In this article, we look at stance lexical bundles in learner corpora, and contrast their frequency and functional associations with native speaker writing. It is organized as follows. First, we review some of the studies in the next paragraphs (Section 1.1). Secondly, we introduce the corpora and methods used, then we present the findings, and draw the paper to a close with considerations on the role of lexical bundles in shaping L2 writing.

†

UFMG – Federal University of Minas Gerais. UFSJ – Federal University of São João Del Rey. § PUC-SP – Catholic University at São Paulo. ‡

Stance Bundles in Learner Corpora

3

1.1. Lexical bundles in corpus research Lexical bundles are defined as “simply sequences of word forms that commonly go together in natural discourse” (Biber et al. 1999, p. 990). They are generally not structurally complete 1 , and perform different discourse functions, the main of which are referential, stance and discourse organizing (Biber et al. 2004). Above all, they act as building blocks of discourse (Biber et al. 2004), and, therefore, “serve the most important communicative needs of a register” (Biber 2009, p. 285). Hyland (2008a) explores the structure and function of 4-word bundles in a corpus of academic discourse. The data for his study consisted of three corpora (research articles, doctoral dissertations and master’s theses) comprising 3.5 million words. The results show significant differences in frequencies across registers, for instance fewer lexical bundles in doctoral theses than in articles; surprisingly, there are more bundles in master’s theses, which might be explained by a number of possible reasons, such as the possibility that novice writers might rely on formulaic expressions more than more experienced ones due to a restricted vocabulary, or that these novice writers might incorporate more fixed expressions with the intent of being more readily accepted in the community. In a similar direction and adopting a frequency-driven approach, Chen and Baker (2010) identify the most frequent lexical bundles in three written corpora: 1) a sub-corpus from FLOB (academic prose section), 2) BAWE-CH (Chinese students of English) and 3) BAWE-EN (English students). Their comparative study shows both differences and similarities between native and learner academic writing. The use of lexical bundles in non-native and native student essays, for example, is very similar from a structural point of view. They both have more VP-based bundles and discourse organizers than native expert writing, whereas native professional writers exhibit a wider range of NP-based bundles and referential markers. Following a pragmatic-functional approach Simpson-Vlach and Ellis (2010) also looked at the most common lexical bundles in academic discourse in both oral and written corpora. Their data comprised the Michigan Corpus of Academic Spoken English (MICASE), the oral academic part of the British National Corpus (BNC), the Hyland 2004 1

This lexical bundles characteristic has been recently challenged in Cortes (2013: 38) as she found bundles that are “complete structures, complete clauses, and sometimes even sentences. That is why the following new category was created: d. Lexical bundles that include noun phrases and verb phrases (fragments or whole phrases or clauses) in bundles such as the rest of the paper is organized as follows, and the objective of this study was to evaluate.”

4

Chapter One

corpus and the written BNC files of various academic subjects. They extracted 3 and 4-word n-grams and had ESP instructors judge if those were chunks, meaning in that study that the n-gram was associated with a particular function and was worth teaching. As a result, they proposed the Academic Formulas List (AFL) with 435 lexical bundles distributed in 18 subcategories (see Table 1 with its subcategories and examples). Since it attempts to cover the most frequent lexical bundles in academic written discourse, the Simpson-Vlach and Ellis (2010) list has contributed significantly to our research, serving as a coding tool and as a basis for the analysis. Table 1: Functional classifications of lexical bundles based on Simpson-Vlach and Ellis (2010) Group A – Referential Expressions 1. specification of attributes 1.a intangible framing attributes 1.b tangible framing attributes 1.c quantity specification 2. identification and focus 3. contrast and comparison 4. deictics and locatives 5. vagueness markers Group B – Stance expressions 1. hedges 2. epistemic stance

Examples form of the; in terms of a a list of; both of these a high degree; a large number of [an/the] example of (a); does not have; there has been be related to; (on) the other (hand) (the) at this point; at the time of and so on Examples are likely to; it appears that we can see; assumed to be (it should) be noted; take into account (the) allows us to; can also be, most likely to important role in; it is necessary (to) to do so; we do not

3. obligation and directive 4. expressions of ability and possibility 5. evaluation 6. intention/volition, prediction Group C - Discourse organizing expressions Examples 1. metadiscourse and textual reference as shown in; in the present study 2. topic introduction and focus for example (if/in/the); what are the

Stance Bundles in Learner Corpora

3. topic elaboration 3.a non-causal 3.b cause and effect 4. discourse markers

5

are as follows; see for example as a consequence; for this reason in other words; even though the

In Dutra & Berber-Sardinha (2013: 121) we presented the general counts of the three broad categories (referential, stance and discourse organizing expressions) across three corpora (ICLE – the International Corpus of Learner English, Br-ICLE – the Brazilian subcorpus of ICLE, and LOCNESS - Louvain Corpus of Native English Essays), which is reproduced below (Figure 1), and a summary of the statistical analysis (Chi-square) (Table 2). These results came up after the application of the frequency cut-off point of 20 times per million words (pmw), which yielded a total of 676 four-word bundle types which remained in the data, across the three corpora. The category with the highest count of bundles for all three corpora are referential expressions, followed by stance expressions, and discourse organizers (Figure 1). There are statistical differences across the corpora (Chi-square 17.126, df=4, p=0.002). Similar results have been found in Biber et al. (2004), but in Chen & Baker (2010) the discourse organizing bundles were the second highest group of frequent bundles. This difference may be due to the fact that Biber et al. (2004) used a native speaker corpus, the T2K-SWAL, containing over 2 million words from oral and written registers (classroom teaching and textbooks). On the other hand, Chen & Baker (2010)’s corpus included a mixture of expert native speaker, native speaker student, and non-native learner data. Another reason that might explain this difference in stance expression frequency is the fact that argumentative essays may require more expressions of opinion (stance bundles) while expert academic texts, such as the ones in the academic prose section of the Freiburg-Lancaster-Oslo/Bergen (FLOB) corpus, might rely less on stance bundles and more on referential ones. Other differences have been found in Hyland (2008b), in which stance bundles were the least frequent across the four disciplines corpora investigated (Biology, Electrical Engineering, Applied Linguistics and Business Studies) 2 . These studies suggest that the relative frequency of functional associations is sensitive to the genre of texts included in the corpora, and consequently the rank of

2

This paper, which investigates theses, dissertations and research articles, uses a different terminology from the one adopted in our paper and stance bundles are called participant-oriented bundles.

Chapter One

6

functional categories in a particular study cannot be predicted ahead of time. 200 180 160 140 120

Br-ICLE

100

ICLE LOCNESS

80 60 40 20 0

A = referential expressions

B = stance expressions

C = discourse organizing expressions

Fig. 1. Lexical bundle category count by corpus.

Table 2: Chi-square test – Category count by corpus

Pearson Chi-Square Likelihood Ratio N of Valid Cases

Value

Df

17.126 17.508 676

4 4

Asymp.Sig. (2-sided) 0.002 0.002

1.2. Research goals The main goal of this paper is to show the relevance of analyzing and contrasting types of stance bundles produced by native and non-native speakers in argumentative essays. It is believed that the discussion encompassing the frequency of lexical bundle types that appear under the category “stance expressions” can bring interesting insights to what lies behind quantitative and qualitative corpus studies. In addition, as a secondary goal, we investigate if the similarities or differences in the frequency of stance bundles are mainly structural or functional. The final goal of this study is to discuss the role of stance expression bundles in L2 academic writing.

Stance Bundles in Learner Corpora

7

2. Materials and Methods This study comprises three corpora of argumentative essays: a) LOCNESS (Louvain Corpus of Native English Essays), which contains 324,006 words written by American and British university students; b) ICLE, the International Corpus of Learner English containing 3.7 million words – this corpus is comprised of 16 subcorpora of written essays by international students from many different countries (Japan, China, Italy, Finland and etc.; for a full description, see Granger et al. (2009)); and c) Br-ICLE, which has approximately 159,000 words of academic argumentative essays written by Brazilian university students, and which does not form part of the consolidated ICLE distribution yet. Having described the data used in this research, we proceed to the methodological framework. First, sequences of four words were extracted from each corpus with scripts specially developed for our research project. Only bundles that occurred in five different essays were included in our list. These sequences were categorized both manually and semiautomatically according to the AFL framework, which includes its three major categories, namely referential expressions, stance expressions and discourse organizing functions, and also according to its 18 different subcategories. Secondly, the sequences were counted, their frequencies normed, and those sequences that occurred at least 20 times pmw were selected for further inspection. Finally, frequency comparisons across corpora and categories were carried out to determine which categories were predominant in the data. As part of our broader investigation we have extracted lexical bundles of 3, 4 and 5 words. In this paper we focus on the analysis of 4-word bundles, as a follow-up on our previous work (Dutra & Berber Sardinha 2013). In this research, we use a conservative cut-off point and focus on bundles that occurred at least 20 times pmw (see Cortes 2008).

3. Results In a previous paper (Dutra & Berber Sardinha 2013) we argued that referential expression bundles have a crucial role in how learners present their argumentation. In this article, we turn to stance expression bundles since it is also relevant to investigate how judgments and opinions are conveyed in the three corpora, contributing to a better understanding of how learners express themselves in the academic register. Sample 1 shows an example of the presence of stance expressions in Br-ICLE (all samples

Chapter One

8

are shown exactly as they were produced by the students, without proofreading). Sample 1 Unfortunatelly, we live in a society where a lot of people still think that the certificate is more important than all others things in the life. I don’t think that the study is not important but I think that it is not vital. Today, there are a lot of Universities graduating people, but there are not enough work to every these people. The graduation will not help people to get a work. There are a lot of people that know how to embroider, to sew, people that work in the field planting our food. There are millions of people that know to do many kinds of works that are very important to our life even not being in the universities’ curriculum.

In sample 1, the learner uses I + think, a very common construction employed by users of English to express opinion3. We shall return to this construction in more detail later in this section. After running Fisher’s Exact test on the count of different bundles for the stance expression subcategories: hedges, epistemic stance, obligation/directives, ability/possibility, evaluation, intention/prediction/volition , no statistically significant difference was found (Data from Table 2 and p = 0.09365). Yet, it seemed to us that it was worth to closely investigate the choices made by the users (native and non-native speakers) as the bundles classified in the stance subcategories were not always the same in structure which affect the meaning they convey. Table 3 shows a breakdown of the frequencies by stance type within each corpus. Table 3: Stance bundle type normed frequency4 Hedges Epistemic stance Obligation and directives Expressions of ability and possibility 3

LOCNESS 9.26 33.95 27.77

ICLE 0.26 3.98 2.38

Br-ICLE 6.28 75.38 87.94

18.52

2.65

62.82

Simpson-Vlach and Ellis (2010) included the bundle I think this is in their Academic Formulas List as it appeared as a frequent epistemic stance bundle in primarily academic spoken discourse. 4 The frequency counts refer to bundle types in each subcategory.

Stance Bundles in Learner Corpora

Evaluation Intention, volition and prediction

37.04 6.17

5.57 0.79

9

75.38 31.41

Despite the lack of significant statistical difference across corpora in terms of the stance bundle types, there may be relevant differences if these bundles are qualitatively analyzed. Due to space restrictions we will focus our discussion on three stance sub-categories, namely hedge, epistemic stance, and obligation & directives. The following subsections present, first, a description of the stance bundle structure in the data. Secondly, we discuss register adequacy in the learner corpora (mainly in Br-ICLE) from the point of view of argumentation structure. Finally, we argue that analyses that consider bundles from a qualitative standpoint as well have advantages over studies that are limited to a purely quantitive approach to bundle use.

3.1. Stance Bundle Structure We analyzed all instances of the stance lexical bundles in all three corpora and established a cut-off point of at least 20 times pmw words for a bundle to be on our shortlist for further consideration. We categorized these shortlisted bundles structurally as shown in Table 4. These patterns are neither complete structural units nor idiomatic expressions (Biber et al., 1999). In five out of the six bundle types there is a verb, mainly a VP with a modal (e.g. would have to be). Two of these verbal patterns include a pronoun: antecipatory it + VP/AdjP (it should not be) and (NP)+ VP + (that- or to-clause) (e.g. think that it is, we need to be). Prepositional phrases and noun phrases (e.g. to a certain extent, my point of view) also appear, even though less frequently. Table 4: Bundle structure Bundle structure (Preposition) + NP Passive (NP)+ VP + (that- or to-clause) VP (Modal + V) Copula be + NP or AdjP Antecipatory it + VP/AdjP

Example to a certain extent can be seen to think that it is, we need to be would have to be is a kind of it should not be

10

Chapter One

3.2. Pragmatic functional subcategories To fully understand the choice of bundles to express stance meanings, we turn now to how the structures presented in Table 4 are distributed among the following subcategories: hedge, epistemic stance, and obligation & directives, across the three corpora (Table 5). Table 5: Bundle examples (subcategories by corpus) Hedge

LOCNESS to a certain extent could be used to can be seen to is shown to be I think that the I feel that the can be seen as is seen to be

ICLE is a kind of

I think it is I do not think I think that the my point of view seems to be a would have to be do not want to Obligation and directives it should not be they do not should be able have to to think that it is should not be do not have to allowed should be should be able to allowed to Epistemic stance

Br-ICLE is a kind of

it has been argued that some people think that think that it is my point of view what they want to you do not have we need to be do not need to

The results in Table 5 show that learners (non-native speakers –NNSs) typically choose the bundle is a kind of as a hedging device, which indicates that their repertoire for this function might be very limited, compared to the range of bundles put in play by the native speaker students (LOCNESS). This bundle is formed by one of the most common stance adverbials in English (Biber et al., 1999), a kind of. Biber et al. (1991) note that this stance adverbial is more frequent in conversation, being a marker of imprecision mainly preferred in American English.5 The overuse of the lexical bundle is a kind of both in ICLE and Br-ICLE might signal some form of ‘mode crossover’ in learner discourse, that is, the use 5

“… BrE conversation shows a preference for sort of.” (Biber et al., 1999: 867)

Stance Bundles in Learner Corpora

11

in writing of bundles most commonly associated with speech. Sample 2 illustrates this use in a text written by a Brazilian student. Sample 2 Beliefs like these are part of a great range of concepts which comes in and goes out throughout minds that constitute our society. Sexism is a kind of prejudice which is neither too weak to be ignored nor too strong to complaint about it. Prejudice does not favor women or men, both just have to deal with the limitations it puts in their lives.

While modal verbs such as can and could are chosen by native speakers to convey a hedging function, non-native speakers’ overuse of the bundle is a kind of is an example of learners’ restricted linguistic hedging choices. The LOCNESS results, however, indicate that native speakers use three different hedging bundles with high frequency (to a certain extent, could be used to, can be seen to) (Table 5). This choice of bundle gives native speakers the possibility of varying the way they present an opinion or judgment. Sample 3, from LOCNESS, gives an example of how native speakers (NSs) hedge their arguments with a modal (could) while making a claim about a particular topic. Sample 3 This treatment is costly. On the NHS, treatment costs could be used to fund more life-saving treatments rather than to bring more children into an overpopulated world. There are many children who are unwanted and who need adopting.

In LOCNESS, another hedging lexical bundle is to a certain extent, which combines an adjective expressing likelihood (certain) with a stance noun (extent) (see sample 4). A similar lexical bundle in form and function (to some extent) has been reported by Simpson-Vlach and Ellis (2010) as being frequent in both written and oral academic discourse, and is therefore included in the Academic Formulas List (AFL). Sample 4 With health experts, however, our marvellous machine provokes problems to our health which must not be ignored. Apart from being a necessity to some, it makes the human being lazy to a certain extent and thus, allows us to use our brains even less and less. It is true to say that a computer is

12

Chapter One able to perform many hundreds of useful tasks which our brains are completely indifferent to.

While hedging lexical bundles are more frequent in LOCNESS, expressions that convey an obligation or an intention of being more assertive are more frequent in Br-ICLE and ICLE (Figure 2 and Tables 3 and 5). Assertiveness can be seen as the opposite in meaning to hedgings; the use of assertive bundles such as they do not have to, we do not need is greater than that of the hedge is a kind of, a frequent choice in the NNS corpora. The difference in the use of hedges in the three corpora suggests that native speakers might make use of more ‘inexplicit language’, here characterized by hedges, which, in turn, lends more modalization to their assertions. Even when NSs choose to be more assertive, they do so by using bundles such as would have to be, should not be, should be able to and should not be allowed, all of which incorporate a modal (should, would), which in turn are less assertive than the periphrastic modals (have to, need to) preferred by the NNSs. In addition, the passive construction is common in LOCNESS as a means of adding impersonality to the obligation and directive bundles. Conversely, NNSs seem to lack this ability and express their arguments in different ways. The results in Table 3 show that the Br-ICLE students use obligation markers more than twice as often as NSs. This overuse of directives by learners shapes the way their writings are interpreted by readers in general. The apparent excess of obligation might be seen as odd in written academic registers, since those are usually associated with a less assertive style.

3.3. Going beyond frequency The discussion presented in the previous section shows how a corpus can reveal the different ways in which particular groups present arguments: NSs seem to have more types of hedges at their disposal, and they also use them more frequently than learners. The stance lexical bundles in LOCNESS are more polite and impersonal (e.g. with anticipatory it + passive). On the other hand, the Brazilian students display an overuse of obligation and directive bundles. The native speakers use bundles in similar functions, yet the proportion is half of that of the Brazilians, and the type structures are again more impersonal (passive + anticipatory it). Looking closely at all three corpora, personal pronouns (I, we) appear in lexical bundles in all three corpora analysed. Both personal and

Stance Bundles in Learner Corpora

13

possessive pronouns (I, my) are characteristic features of spoken discourse, revealing personal engagement of the participants 6 . Personal pronouns such as I and you are usually in the first top ten words in most spoken corpora. Combined with the choice of using the periphrastic modal verb have to, this demonstrates the inclusion of oral features in literate discourse, especially in the writing of Brazilian students. This is due to the fact that spoken communication implies a more dyadic interaction between participants which necessitates frequent use of first and second person pronouns. It seems that learners attempt to replicate this engagement in writing by making use of characteristics from oral communication, such as through the bundle I think that it is. The choice of this bundle to express epistemic stance instead of bundles that include the passive voice (e.g. can be seen as) might suggest that learners do not possess a wide range of forms to perform this function, or that hedging is not common in academic writing in their mother tongue, among other reasons.

4. Final remarks The analysis of the types of stance bundles across the three corpora shows that the Brazilian student corpus presents a less diverse use of these bundles than ICLE or LOCNESS, especially with hedge bundles. The qualitative analysis reveals that Br-ICLE and ICLE bundles are more personal and are formed less frequently with antecipatory it and passive structures. There seems to be an overuse of directive and obligation bundles in the Br-ICLE corpus, which may mark the texts as peremptory. We would like to see more qualitative analyses in lexical bundle research to complement the more mainstream quantitative analyses. If we had only considered the statistically significant differences across the corpora, we would have probably missed out on both the influence of oral discourse in written stance bundles, and the choice of less complex yet more personal bundles in the learner corpora.

6 Although we claim that there might be influence of oral features into the Brazilian learners’ written discourse, there may be other explanations for their choices. For instance, there is a tendency, recently, for scientific article writers to stand up for their results and, therefore, use personal pronouns (e.g. we) (Dayrell and Candido Jr, 2013).

14

Chapter One

References Biber , D. 2009, A corpus-driven approach to formulaic language in English: Multi-word patterns in speech and writing. International Journal of Corpus Linguistics 14(3), 275-311. Biber, D., Conrad, S. & Cortes, V. 2004, If you look at... Lexical bundles in university teaching and textbooks. Applied Linguistics 25(3), 371405. Biber, D., Johansson, S., Leech, G., Conrad, S. & Finegan, E. 1999, Longman Grammar of Spoken and Written English. Essex: Longman. Chen, Y. & Baker, P. 2010, Lexical bundles in L1 and L2 academic writing. Language Learning & Technology, v.14, n.2, 30-49. Cortes, V. 2004, Lexical bundles in published and student disciplinary writing: Examples from history and biology. English for Specific Purposes, v. 23, 397-423. —. 2006, Teaching lexical bundles in the disciplines: An example from a writing intensive history class. Linguistics and Education, v. 17, 391– 406. —. 2008, A comparative analysis of lexical bundles in academic history writing in English and Spanish. Corpora 3(1), 43-57. —. 2013, The purpose of this study is to: Connecting lexical bundles and moves in research article introductions. Journal of English for Academic Purposes, v. 12, 33–43. Dayrell, C.; Candido JR., A. 2013, Textual patterns and rethorical moves in English scientific abstracts: comparing student and published writings. Paper presented at Learner Corpus Research 2013. Bergen, Norway. Dutra, D. P. & Berber Sardinha, T. 2013, Referential expressions in English learner argumentative writing. In S. Granger, G. Gilquin & F. Meunier (eds) Twenty Years of Learner Corpus Research: Looking back, Moving ahead. Corpora and Language in Use – Proceedings 1, Louvain-la-Neuve: Presses universitaires de Louvain, 117-127. Granger, S. et al. 2009, International Corpus of Learner English: Version 2. Louvain-la-Neuve: UCL Presses Universitaires de Louvain. Hyland, K. 2008a, Academic clusters: text patterning in published and Postgraduate writing. International Journal of Applied Linguistics, v.18, 41-62. —. 2008b, As can be seen: lexical bundles and disciplinary variation. English for Specific Purposes v.27, p. 4-21.

Stance Bundles in Learner Corpora

15

O’Keeffe, A., McCarthy, M. & Carter, R. 2007, From Corpus to Classroom: Language Use and Language Teaching. Cambridge: Cambridge. Simpson-Vlach, R. & Ellis, N. 2010, An academic formulas list: New methods in phraseology research. Applied Linguistics, v. 31, n.4, p. 487–51.

PART II TRANSLATION, TERMINOLOGY AND CORPORA

CHAPTER TWO A BILINGUAL GLOSSARY OF COLLOCATIONS TYPICAL OF THE HOTEL INDUSTRY: A MODEL IN LIGHT OF CORPUS LINGUISTICS SANDRA NAVARRO* 1. Introduction Tourism is a major economic and cultural sector around the world. Fueled largely by international travels, this field has great impact on several other sectors, especially the hotel industry. If back in the early days people stopped by on roadside homes seeking shelter, today the purpose of many trips is to stay in luxury hotels, which bring together all sorts of entertainment. Therefore, the hotel industry has evolved considerably over the years and has become increasingly multicultural. To mediate the relations of individuals of different nationalities and cultures, communication, usually in English, plays a crucial role and translation is a constant need. One of the major sources of demand for translation in this area is the Internet. In order to cater to a larger audience, many hotel websites, booking websites, travel guides, among others, choose to translate their content into different languages. At this point, the translator working in this specialized field faces numerous challenges. One of them concerns the close relationship the hotel industry holds with cultural aspects of each country and region. In the words of experts, “Tourism, besides being an economic activity, is culture in its essence. Hotels lie within this context” (Gregson, 2009: IX). Indeed, hotel descriptions encompass topics as varied as cuisine, architecture, décor, sports, entertainment, different types of establishments and services and even historical and geographic features. Therefore, translating hotel texts *

University of São Paulo.

A Bilingual Glossary of Collocations Typical of the Hotel Industry

19

is an attempt to bring closer together two distinct realities, two culturally different worlds. Despite this difficulty and the great demand for translations, the area lacks terminological materials, especially bilingual English-Portuguese publications. This fact could be observed over two years of my personal experience as a translator of hotel websites. During that period, my team and I did not make use of any bilingual dictionary simply because the publications available did not meet our needs and were completely outdated. In fact, none of the publications available is targeted exclusively at translators and consequently does not help this professional in his fundamental task: to produce a technically accurate and natural translation that can be read as an original text. Hence, in order to meet the needs of a text producer, reference materials should go beyond a list of terms and their equivalents. For the professional translator, it is important to know how terms and surrounding words function in the target language. Therefore, translators would benefit from examples of usage, suggestions of equivalents and translation solutions, information about the typical ways of expression in a given context, the frequency of equivalents and cultural information, to name a few options. This scenario gave rise to my desire to narrow the gap between the high demand for translations in this field and the shortage of specialized reference materials to support the challenging task of translators. To this end, I have conducted a master’s research (Navarro, 2011) whose objective was to propose a model for a bilingual unidirectional glossary (English - Portuguese) of collocations typical of the hotel industry, aimed at the translator. This paper presents some of the results from this recently completed research. First, in section 2, we briefly outline the theoretical background that guided the research. Section 3 presents the study corpus and summarizes the procedures to identify the collocations and their equivalents. Section 4 discusses the results and presents glossary entries for some collocations. The paper concludes with final remarks, in section 5, about the main findings of this study.

2. Theory The theoretical framework of this study comprises three related fields: Corpus Linguistics (Halliday, 1991; Sinclair, 1991), Terminology (Hoffmann, 1999; Krieger and Finatto, 2004) and Translation (Chesterman, 1998). These areas provided subsidies for a linguistic research with an empirical approach, focused on the observation of language in use.

20

Chapter Two

The empiricist approach underpins Corpus Linguistics (CL), a method of linguistic investigation based on the analysis of a collection of authentic texts in a corpus. CL considers language a probabilistic system, whereby although there are a number of possible lexical choices and combinations, they do not occur with the same frequency. In other words, some lexical combinations are more recurrent than others. Thus we can say that language is conventionalized, following a phraseological tendency, or the “idiom principle” as put forward by Sinclair (1991), and this view of language gives rise to collocations. For the purpose of this study, a collocation is the recurrent association of lexical items (Sinclair, 1991). Even though collocations do not necessarily represent a comprehension problem, they can certainly pose a challenge to translators. Take the case of “complimentary breakfast”, which could be literally translated into Portuguese as “café da manhã de cortesia”, but according to our corpus research, “café da manhã incluído na diária” [breakfast included in the daily rate] is the most common equivalent. Moreover, studies demonstrate that collocations constitute approximately 70% of terminological occurrences in specialized languages (Krieger and Finatto, 2004: 81); hence the importance of developing terminology reference materials with a special focus on collocations. Regarding Terminology, this study adopts a descriptive approach, in line with the principles of socio-communicative theories (Krieger and Finatto, 2004: 78), such as Textual Terminology (Hoffmann, 1988). This in vivo approach acknowledges the fundamental role of context in terminological research, represented by the texts that make up the corpus. Corpus-based translation studies have influenced the notion of equivalence, a crucial element for this study. In this sense, we adopt the notion of context equivalence put forward by Chesterman (1998: 31). Under this view, equivalents must fulfill the same role, within the same context, in both source and target languages. This perspective expands the concept of equivalence beyond the formula “word x = word y”, reflecting more accurately the complexity of terminological correspondence. It is worth mentioning again the example of “complimentary breakfast”, whose functional equivalent in Portuguese is “café da manhã incluído na diária” [breakfast included in the daily rate] instead of the literal translation “café da manhã de cortesia”.

3. Methodology To carry out this study, I have compiled a comparable corpus (texts written originally in English and in Portuguese) of texts taken from

A Bilingual Glossary of Collocations Typical of the Hotel Industry

21

websites of hotel establishments in Brazil and the United States. This corpus is divided into five categories in each language and contains 546,106 words and 321 texts in English, 514,449 words and 710 texts in Portuguese, totaling 1,060,555 words and 1,031 texts. Figure 1 summarizes our study corpus.

Fig. 1: The Hotel Industry corpus

The categorization above was based on the classification adopted by two hotel reservation websites: www.hotels.com, for categories in English, and www.hoteis.com.br for categories in Portuguese. These sites have similar purposes: provide a wide variety of establishments – subdivided into the categories aforementioned – so that the customer can obtain information, select a hotel and make a reservation through the website. It is interesting to notice that the categories are not exactly equivalent. We chose to keep these differences so as to show that we are dealing with two culturally distinct worlds. Even in the case of corresponding categories like hotels and resorts, cultural differences between the two countries arise, because many types of establishments are typical of their regions and find no equivalent in another cultural context. For instance, we have luxurious casino hotels in Las Vegas, while in Brazil this type of property is illegal. In Brazil, there are sophisticated jungle lodges in the region of the Amazon forest or historical ranch hotels in the state of Minas

22

Chapter Two

Gerais, whereas in the US we find Disney theme resorts in Florida or ski resorts in Denver, establishments very typical of that culture. One of the desirable characteristics of a comparable corpus is that it is balanced, i.e., that it has a similar amount of data (texts and words) in both languages. The numbers in Figure 1 show that our corpus is balanced by the number of words (a little over 500 thousand in each language), while the number of texts in Portuguese is more than twice the number of texts in English (710 vs. 321, respectively). This fact reveals that English texts are consistently longer and more detailed than the ones in Portuguese. All those texts received special treatment before analysis. Firstly, files were named and numbered according to language and categories, such as EN-H01 (ENglish, category Hotel, file 01); next they were identified by headers, which provided metadata such as the source website, name of the hotel, category and access date. Topics within the texts were also classified by means of discursive tags, such as “introduction”, “accommodations”, “policies”, “dining”. Lastly, the corpus was automatically tagged to allow searches considering word classes, the socalled part of speech tagging. For that, we used the software Tree-Tagger.1 This corpus was then explored with the aid of WordSmith Tools (Scott, 2007, version 5) and its major tools – wordlists, keyword lists, collocation lists, concordance lines. In order to identify the most recurrent collocations in English and establish their equivalents in Portuguese, we followed the procedures roughly summarized as follows: Finding collocations in English • generate a list of keywords of the corpus • generate a list of collocates and clusters of the selected keyword • analyse concordance lines • validate collocation candidates as per established criteria (minimum 10 occurrences; Mutual Information (MI) score ≥ 3 2 in 4 x 4 window, context analysis (concordance lines)) Finding equivalents in Portuguese • identify a prima facie equivalent (list of keywords and collocates) 1 Available at. Access date: August, 2011. 2 Mutual Information (MI score): this statistic (Church and Hanks, 1991) provides a measure of association between two lexical items. The higher the MI score, the lower the chance of the co-occurrence between node and collocate to be random, i.e., the higher the MI score, the stronger the collocation. The authors suggest that MI score ≥ 3 indicates a significant association pattern.

A Bilingual Glossary of Collocations Typical of the Hotel Industry

23

• analyse concordance lines using the tagged corpus (when needed) • analyse concordance lines of immediate context words Creating the glossary entry • organize collected information in terminological indexes • sort equivalents in order of frequency • select examples • include additional information, where appropriate Our glossary proposal comprises all kinds of collocations, i.e., verbal, adjectival, nominal and adverbial collocations, consisting of two or more words, provided that they contain a keyword and are statistically significant (minimum of 10 occurrences in the corpus, Mutual Information score (MI score) ≥ 3). We present in Section 4 a detailed investigation of the word “room” to demonstrate that the theoretical and methodological approaches described above are effective to achieve the desired results.

4. Results Results will be illustrated with four collocations of the word “room”. This word was chosen because it is the most frequent keyword in the English corpus, with 6,383 occurrences, being therefore very representative of the hotel domain. We opted to present just a few collocations, but to describe the procedures to identify the English collocation and their equivalents in Portuguese in more detail. For the sake of better understanding, equivalents and examples in Portuguese will be followed by a literal translation in English in an attempt to highlight the linguistic variations between the two languages. After the description of each collocation, the full glossary entry is presented, containing the following information: • Collocation in English and equivalents in Portuguese • Variant / related extended collocations • Frequency information (equivalents are shown from the most frequent (three stars) to the least frequent (one star) • Notes to translators (common translation pitfalls, alternative translation suggestions) • Additional information in the section “Você Sabia?” [“Did you know?”]

24

Chapter Two

4.1. Collocation “guest room” “Guest” is the first content word in the list of collocates of “room”, making up “guest room(s)” with 530 occurrences. We can add 161 instances of the variation “guestroom”, spelled as one word. Also, “room” by itself is used as a synonym for “guest room” in several contexts. Therefore, the three instances (“guest room”, “guestroom” and “room”) designate the guest accommodation at a hotel, as in: “All guest rooms have a full kitchen, which includes a refrigerator, two-burner stove and chairs. Each room also features a private bath with tub and shower.”

Fig. 2: Partial list of collocates of “room”

One may criticize the collocational status of “guest room”, since the same meaning can be conveyed by a single word (“guestroom” or “room”). In our study, we consider “guest room” a collocation because it fulfills the requirements we set for determining a collocation: a) a collocation must contain a keyword – in this case, “room”; 2) the keyword collocate must be in the list of collocates among the first 200 words – “guest” is first collocate in the list; 3) the collocation must achieve minimum frequency of 10, – there are 530 instances of “guest room”; 4) the collocation must have an MI score greater than or equal to 3 in a 4 x 4 window – the MI score for “guest room” is more than 6 (as seen in the column Relation in Figure 2). By analyzing the lists of collocates and clusters and concordance lines, we observed that “guest room” also integrates a larger phrase, “guest room and suites” (33 occurrences). The next step was to establish the equivalent in Portuguese. We began with the analysis of the keywords from the Portuguese corpus, where we selected: “apartmento(s)” [room / apartment] (2621 hits) and “quarto(s)” [room / bedroom] (1061 hits). The first alternative is the more direct

A Bilingual Glossary of Collocations Typical of the Hotel Industry

25

equivalent, given its greater frequency and also due to the fact that it presents similar collocates, such as “apartamentos e suítes” (equivalent to “guest rooms and suites” mentioned above). See example below: “O hotel possui 174 apartamentos e suítes com ar condicionado, música ambiente, cofre e fechadura eletrônicos, telefone com discagem direta, TV a cabo, internet gratuita, telefone no banheiro, secador de cabelo, espelho para maquiagem e banheira.” [The hotel has 174 rooms and suites with air conditioning, piped music, electronic safe and locks, direct dial telephone, cable TV, free internet, telephone in bathroom, hair dryer, makeup mirror and bathtub.] We also validated “quarto” as a second equivalent for “guest room”. The examples below show the collocations in English and in Portuguese in a very similar context: “Four Seasons makes no additional charge for children 18 years old and under occupying the same guest room with parents or guardians (space permitting).” “Diária família: uma criança com menos de 10 anos no mesmo quarto que os pais não paga.” [Family rate: a child less than 10 years old staying in the same room as their parents does not pay.] Despite the synonymous meanings of “apartamento” and “quarto”, some concordance lines suggested a more specific use for “quarto”. By analyzing 540 concordance lines, we could observe the following patterns: a) “quarto” is one of the sleeping rooms within an “apartamento”, “suíte” or “chalet” (65% of contexts): “O hotel possui 23 amplos apartamentos, 19 com um quarto com cama de casal (para 2 pessoas) e 4 apartamentos duplos, com dois quartos com cama de casal (para 4 pessoas).” [The hotel has 23 spacious guest rooms, 19 one-bedroom guestrooms with double bed (sleeps 2) and 4 two-bedroom guest rooms with double beds (sleeps 4).] b) “quarto” is used in the same sense as “apartamento” (35% of contexts), but most of them are found in pousadas (inns/bed and breakfasts) or refer to a lower category of accommodation, as in: “A pousada possui 33 quartos com banheiro privativo, ventilador de teto, televisão, roupas de cama e frigobar.” [The inn has 33 rooms with private bathroom, ceiling fan, television, fridge and bed linen.]

Chapter Two

26

“Nossos apartamentos contam com confortáveis instalações, pois possuem ar-condicionado, TV a cores e banheiro. Temos apartamentos duplos, triplos, quádruplos e nossa diária inclui café da manhã. Dispomos também de quartos duplos e triplos mais simples: com ventilador e TV.” [Our guest rooms offer comfortable facilities, as they have airconditioning, color TV and bathroom. We have double, triple, quadruple guest rooms and our rate includes breakfast. We also have simpler double and triple bedrooms: with fan and TV.] These nuances of meaning are included in the section “Did you know?” of our glossary entry in Figure 4. The data presented raised the following question: if there is a slight difference in meaning between “apartamento” and “quarto” in Portuguese, are there different uses between “guest room” and “bedroom” in English? What about “apartment”? Is it the same as its Portuguese cognate “apartamento”? And what is a “suite”? Our corpus research revealed the following: “Bedroom” appears 1084 times in the corpus. Its main collocates are “one” (221), “two” (211), “suite(s)” (234). Concordance lines indicate the term is mainly associated with quantifying expressions. The cluster tool made this finding even more evident (see Figure 3). We also noticed that “bedroom” is used to specify the kind of room within the suite, such as “master bedroom”. Cluster ONE BEDROOM SUITE THE MASTER BEDROOM TWO BEDROOM SUITE ONE BEDROOM SUITES TWO BEDROOM SUITES AND TWO BEDROOM IN THE MASTER BEDROOM HAS A MASTER BEDROOM AND THE SECOND BEDROOM ONE AND TWO KING SIZE BED Fig. 3: Clusters of “bedroom”

Freq. 64 57 51 50 45 45 36 36 35 34 34 32

A Bilingual Glossary of Collocations Typical of the Hotel Industry

27

Therefore, we may conclude that “bedroom” is used mainly to refer to the sleeping room that is part of a guest room or suite. This is closest in meaning to “quarto” mentioned above. The word “apartment” appears 57 times in the corpus. This information in itself is revealing: “apartment” is a lot less frequent than “guest room” (over 6,000 occurrences) and “bedroom” (over 1,000). Regarding its usage, the examples below show that “apartment” is used to designate a self-serviced independent unit, which may even be a house with several rooms, intended for extended stays. “Just opened for extended stay, this apartment features one bedroom with queen sized poster bed and a full separate office equipped with full size desk. It includes a living room and a fully equipped and functional kitchen with large side by side refrigerator, gas stove, double sinks, dishwasher, etc. This unit features a separate entry for privacy.” “The apartment features gourmet kitchen with custom cabinets, granite countertops, stainless steel appliances. Additional features include: mirror armoires in bedroom, large living area, French doors, brand new carpet, fireplace, laundry and a large courtyard.” It is relevant to verify that “apartment” has a different meaning than its cognate “apartmento”. “Apartamento” is the most common term in Portuguese to describe hotel accommodations (in fact, it is the equivalent for “guest room”, as described above); for this reason it is very common to see “apartamento” translated as “apartment” in English, which is actually a translation pitfall. This kind of information is included in the “Note to Translator” section of the entry so as to advise readers about the most appropriate translation. Finally, we analyze the term “suite”, the second most frequent word in the corpus after “room” (3,559 occurrences). Due to the high number of occurrences, it was not feasible to conduct a detailed analysis. However, it was possible to analyze lists of collocates and clusters, as well as the data previously obtained. We also looked for specialized definitions on the web, where the term “suite” is described as a unit characterized by having a sleeping room and a bathroom in a separate area from the living room, as well as a kitchen3. It is, therefore, a higher accommodation category. “Our suites are 50% larger than traditional hotel rooms with separate spaces for living, dining, sleeping and working. There are studio, one and 3 Available at: http://www.cvent.com/en/resources/suite-hotel.shtml. Access on 19/ 09/2011.

28

Chapter Two

two bedroom floor plans to maximize efficiency and ease. A fully equipped kitchen enables our guests to enjoy relaxing meals within the privacy of their own suite.” All this information is organized in the entry shown in Figure 4 (literal translation provided as footnote).

Fig. 4: Entry of “guest room”4

4

Did you know? (EN) Guest room: most common designation for guest accommodation in the hotel. / Bedroom: refers to the dormitory in a suite or guest room.

A Bilingual Glossary of Collocations Typical of the Hotel Industry

29

4.2. Collocation “in-room safe” The collocation “in-room safe” occurs 100 times in the corpus. The analysis of the concordance lines revealed that “in-room safe” is often part of a listing of the articles offered in the room, as can be seen in Figure 5:

Fig. 5: concordance lines of “in-room safe”

Suite: characterized by having a certain number of bedrooms, bathroom, living room and sometimes kitchen. It is, therefore, a category superior to the guest room./Apartment: a self-service unit with several rooms, especially for extended stays. Did you know? (PT) “Quarto” is a synonym for “apartamento” in several contexts. But attention to some differences: 1) “Quarto” can be part of an “apartamento”, “suíte” or “chalet”. The hotel has 23 spacious guest rooms, 19 one-bedroom with double bed (sleeps 2) and 4 twobedroom guest rooms with double beds (sleeps 4). 2) “Quarto” may also be a simple type of accommodation, such as in inns and bed and breakfasts. The inn has 33 rooms with private bathroom, ceiling fan, television, fridge and bed linen / Our guest rooms offer comfortable facilities, as air-conditioning, color TV and bathroom. We have double, triple, quadruple guest rooms and our rate includes breakfast. We also have simpler double and triple bedrooms: with fan and TV. Note to translator: When translating “apartamento” into English, avoid “apartment” and prefer “guest room”.

Chapter Two

30

Interestingly “in-room” is used even when the context makes it very clear that the safe is in the room, as in: “Services/amenities in all rooms: cable/satellite TV, flat screen/plasma TV, free high speed internet, free local calls, hair dryer, inroom coffee maker, in-room desk, in-room safe, individual a/c & heat.” This repetition would seem redundant and be considered somewhat poor style in Portuguese; however, in English, it is acceptable. This can certainly represent a problem in translating from English into Portuguese, when translators may tend to inadvertently reproduce the English structure in Portuguese, thus producing a rather unnatural translation. This warning is included in our glossary entry (Figure 7). The next step was to determine the collocation equivalent. We first analyzed the list of keywords from the Portuguese corpus, where we identified the word “cofre” [safe], with 444 occurrences. We then generated a list of collocates of “cofre”, where we identified some adjectives used to describe “cofre”: “individual” [individual], “eletrônico” [electronic], “digital” [digital]. Other collocates in the list point to exactly the same context of “in-room safe”, words such as: “mini bar”, “telephone”, “cable TV”, “hair dryer”, that is, items that are normally included in the hotel room. Finally, by crossing the list of collocates, the list of clusters and the concordance lines, we identified several equivalent candidates and the following were validated given their higher frequency, as shown in Figure 6:

Cofre [safe]

in-room safe

Cofre individual [individual safe] Cofre eletrônico [electronic safe] Cofre digital [digital safe]

Fig. 6: Validated equivalents for “in-room safe”

Notice that the most frequent equivalent is plainly “cofre”, with no other modifier. This finding was only possible through the analysis of each concordance line of “cofre”, so as to make sure that we were dealing with the same context of the English collocation (a safe in the room), as in:

A Bilingual Glossary of Collocations Typical of the Hotel Industry

31

“calefação, TV LCD, aparelho de DVD, frigobar, cofre, telefone, acesso gratuito a internet/wireless” [heating, LCD TV, DVD player, minibar, safe, telephone, complimentary wireless internet access] The entry is organized as shown in Figure 7.

Fig. 7: Entry for “in-room safe”5

4.3. Collocation “reserve [a, your, this] room” From 167 occurrences of the verb “to reserve”, 56 are part of the collocation “to reserve a/your/one room”: “Our online booking service is the most convenient way to reserve a room.” “In order to reserve your room a valid credit card number will be required.”

5 [Note to translator: there is no need to translate “in-room”; in most cases, reference to the room is implicit in the context.]

32

Chapter Two

In addition to the collocates of the word “room”, we identified 32 other collocational patterns, such as: To reserve [a suite (7), a meeting room (4), accommodations (3), a stay (3), tee times (2), seating (2), a package (1), bicycles (1), tickets (1), a spot (1), a date (1), a table (1)]. The verb “to reserve” is also widely used intransitively (31 occurrences): “reserve now” (10), “reserve online” (8), “reserve by phone” (5), “reserve early” (1), as in the example below: “Reserve now! We look forward to having you as our guest!” Thus we can say that “reserve” occurs mainly with “room”, but also with several other generic terms (“package”, “date”, “table”, etc.). This preliminary research provided data for comparison with another synonymous verb, “to book”, also found in the list of collocates of “room”. We found 300 occurrences of the verb “to book”, i.e., almost twice as much as “reserve” (167). The most common collocate of “book” is also “room”, with 46 occurrences. Examples: “Click here to check availability and book your room online.” “Ready to book your room! Click here to take advantage of great rates!” At this point it is possible to draw an important parallel. “To book” is more recurrent than “to reserve” in the corpus and both are collocates of “room”. Nevertheless, the association between “to reserve” and “room” is stronger than between “to book” and “room”, considering that 33% (56 out of 167 occurrences) of the collocations of “to reserve” are with “room”, while only 15% (46 out of 300 ocurrences) of the collocations of “to book” are with “room”. “To book” also occurs with several other nouns; we identified more than 40. The main ones are related to accommodations: To book [a reservation (28), a suite (10), a stay (9), nights (9), accommodations (5), a package (4), an appointment (4), a hotel (3), an event (2), a massage (2), a vacation (1)] “A two-night deposit is required to book a reservation.” “Book two nights and get 50% off the third night.”

A Bilingual Glossary of Collocations Typical of the Hotel Industry

33

One of the main uses of “book”, though, is in the intransitive form (78 occurrences), such as “book online” (35), “book now” (29), “book early” (3), as can be seen in the example below: “Save up to 30% on your reservation at the Bay Inn & Suites SeaWorld San Diego when you book online! Book Now!” We also identified a similar pattern, but with a significant difference in terms of frequency when we compare the use of “book” and “reserve” – “reserve now” (10) while “book now” (29); “reserve online” (8) while “book online” (35). In a nutshell, this comparative analysis allows us to state that: • both verbs are synonymous, because they share the same meaning and several collocates; • both verbs have “room” as their main collocate, but “to reserve a room” is more frequent and stronger than “to book a room”; • “to book” is more commonly used in the intransitive form than “to reserve”. This information is summarized in the section “Did you know?” of our glossary entry (Figure 9). The next step was to establish the equivalents in Portuguese. We began by analyzing the cognate verbs – “reserve” and “reservar”, one in the imperative form and the other in the infinitive form, respectively; and then the noun “reserva(s)” [reservation(s)]. The imperative form “reserve” (46 occurrences) features various noun and adverbial collocates. For example: Reserve (um/uma, sua/seu(s), o/a(s)) [estadia (4), apartamento (1), suíte (1), quarto (1), pacote (1), horário (1), café da manhã (1), espaço (1), férias (1)] (Reserve (a/one, his/her(s), the/a) [stay, apartment, suite, bedroom, package, time, breakfast, space, holiday]) The verb also occurs with some adverbs, such as “aqui” [here] (12), “agora” [now], (7) and “já” [now] (6). The verb in the infinitive form –“reservar” – is less frequent, occurring 34 times, with “apartamento” [guest room] and “quarto” [bedroom] (1), generally preceded by the expression “mais de um” [more than one]. Example:

34

Chapter Two

“Para reservar mais de um quarto, basta fazer outra reserva.” [To reserve more than one room, just make another reservation.] Up to this point, we identified only 7 occurrences of the verbs “reserve” and “reservar” followed by “apartamento” and “quarto”, which would be the literal translation of “reserve a room”. For this reason, we decided to investigate the noun “reserva(s)” [reservation(s)], a lot more frequent than the verb forms, with 1,939 occurrences. Given the large number of occurrences, we considered only the contexts related to reserving accommodations (rooms). Here the tagged corpus enabled a more precise investigation bv distinguishing the verb form “reserva” [third person singular] from the noun “reserva”, the focus of our search. It also allowed us to look specifically for the verbs that co-occur with “reserva”. Figure 8 shows concordance lines using the tagged corpus.

Fig. 8: Concordance lines of verbs followed by “reserva” (noun) using the Portuguese tagged corpus

A Bilingual Glossary of Collocations Typical of the Hotel Industry

35

Finally, the following collocations were identified as equivalents for “to reserve a room”: 1) “faça sua reserva” [make your reservation] (56) / “faça [agora, já] sua reserva” [make (now) your reservation] (26) / “fazer [uma, sua] reserva” [to make (one, your) reservation] (27) “Faça sua reserva on-line e ganhe 10% de desconto.” [Make your reservation online and get 10% discount.] “Para fazer a sua reserva no Victory Suites, efetue o cadastro em nosso sistema.” [To make your reservation at Victory Suites, register in our system.] 2) “efetuar [a, sua, uma] reserva” [carry out [the, your] reservation] (19) “Para efetuar sua reserva ou solicitar informações adicionais, favor entrar em contato conosco pelo email.” [To carry out your reservation or request additional information, please contact us by email.] 3) “solicitar [a, sua, uma] reserva” [request (the, your, a) reservation] (18) “Preencha esse formulário para solicitar uma reserva.” [Fill out this form to request a reservation.] 4) “realizar [a, sua, uma] reserva” [carry out (the, your, a) reservation] (6) “Para realizar uma reserva, o cliente deverá fazer um depósito bancário de 25% do valor total das diárias.” [To carry out a reservation, the customer must make a deposit of 25% the total rates.] We noticed that neither “apartamento” [guest room] nor “quarto” [bedroom] integrated the collocations (reserva(s) de apartamento(s)/ quarto(s) [reservation of guest room/bedroom] occurred 3 times only). Thus, we can say that “room” is implicit in the context in Portuguese, since all collocates listed above refer to the reservations of accommodation in the hotel. Data showed that in Portuguese there is a preference for the structure “verb + reserva” [verb + reservation] instead of just the verb “reservar”. In other words, the most common equivalent for “Book now!”, for instance, is not the literal translation “Reserve agora”, but “Faça já sua reserva” [Make your reservation now!]. All these findings are included in the entry shown in Figure 9.

36

Chapter Two

Fig. 9: Entry for “reserve a room”6

6

Did you know? “Book” or “reserve”? Both are synonyms, but some usages are more characteristic. To book is more used in expressions as “book a reservation”, “book now”, “book online”. “A two night deposit is required to book a reservation.” “Save up to 30% on your reservation at the Bay Inn & Suites SeaWorld San Diego when you book online. Book now!” To reserve is more common in the expression “reserve a/your/this room”. “In order to reserve your room, a valid credit card number will be required.” Note to translator: When translating into Portuguese the imperative form of “book” and “reserve” opt for “Faça sua reserva” [make your reservation]. It is not necessary to include the word “apartamento” [room]. “Faça agora mesmo sua reserva!” [Make your reservation now!]

A Bilingual Glossary of Collocations Typical of the Hotel Industry

37

4.4. Collocation “room amenities” “Room amenities” is listed 74 times in the corpus (see Figure 10). We also found the synonymous variant “guest room amenities” (16) and “inroom amenities” (14). This collocation is used to refer to several different items found in the accommodation, such as safe, coffee maker, internet, cable TV, DVD, hair dryer, bathrobes, etc. N Concordance 24 25 26

living area. There is a full kitchen, wood burning fireplace, and balcony. Guest room Amenities Daily housekeeping In room pay per view movies Pay for Use at this our hotel even more memorable, we offer the following hotel and guest room amenities and services: Hotel Amenities Ski-in/Ski-out access Spencer's SeaWorld, the Wild Animal Park, Legoland and Other Local Attractions Guest Room Amenities: • Safety deposit box • Coffee-maker with complimentary tea

27 and crafts and yoga with elephants. We can tailor an experience just for you! In-Room Amenities All rooms have plush terry robes, oversized bath linens, hair 28

Our two bedroom Family Cottage is ideal for families traveling together. Guest Room Amenities 100% cotton sheets & fluffy towels Turkish Towel Company

29

parking. And especially...the pampered touch...a private candlelight breakfast. Room Amenities • 2nd Floor of a Private Villa, Upstairs • Glass King-size canopy

30

five distinctly decorated rooms in a relaxing, private and comfortable setting. Room amenities include high-quality queen size beds, private modern baths,

31 romance of a fireplace, jetted tub, or both. Click here for a complete list of guest room amenities. Your stay at The Oliver Inn always includes a Full, Hot, Gourmet 32

available for guests use, each guest quarters includes the following: Guest Room Amenities * Beds are triple sheeted with down comforters * Thick,

33 indoors relaxing with a glass of port by the warm glow of the living room fire. In-Room Amenities All of our guest rooms have luxury linens, spa robes and 34 35

trickle of water (below the vanity window). The dormer window has a lake view. Room amenities include a king-size bed, private bath, and a flat-screen cable TV. a menu that is ready for your feast. Our Suites Included are all of the many fine room amenities plus a charming living room with sleeper love seat, antique and

Fig. 10: Concordance lines of “room amenities”.

By analyzing the concordance lines, we observed that in 52 out of 74 occurrences, the collocation precedes a listing of the items included in the room, as in: “Room amenities: cable television with multi channels, microwave, refrigerator, writing table with chairs, telephone for room to room and local calls, air conditioner, in-room safe, hair dryer, wireless internet access.” In Portuguese, our intuition led us to start the search for the equivalent by “comodidades”, the prima facie translation of “amenities”, according to our background knowledge. We found 19 occurrences of this word, of which only five referred to the same context in English: “São 396 acomodações distribuídas entre apartamentos e suítes, com todas as comodidades: wireless, workstation no quarto, TV a cabo, cofres individuais, telefone com discagem direta, frigobar, room-service 24 horas por dia (...).” [There are 396 types of accommodations, rooms and suites, with all the amenities: wireless workstation in the bedroom, cable TV, individual safe, direct dial telephone, minibar, 24 hour room service.] From the example above, we can consider “comodidades” as an equivalent for “room amenities”. However, due to the very low number of occurrences (5) for such a common context in hotel descriptions, we continued the search to identify other possibilities. We resorted to the

38

Chapter Two

strategy of searching for words in the immediate context of the word we were looking for. We then analyzed the concordance lines of words such as “cofre” [safe], “cafeteira” [coffee maker], “frigobar” [minibar], and others, i.e., examples of room amenities, so as to observe how these items were introduced in the text in Portuguese. This led to the following equivalents in Portuguese: Facilidades [facilities] – 122 occurrences “Facilidades: acesso à internet banda larga, aquecimento central, ar condicionado com controle individual, TV a cabo.” [Facilities: broadband Internet access, central heating, air conditioning with individual control, cable TV.] Características [features] – 38 occurrences “Características: armário, minibar, telefone com secretária eletrônica, mesa com 8 cadeiras, ar condicionado central, sofás, poltronas e chaise longues, TVs a cabo.” [Features: closet, minibar, telephone with answering machine, table with 8 chairs, central air conditioning, sofas, ottomans and chaise longues, cable TVs.] In addition to the equivalents above, the search showed that in Portuguese the articles that are provided in the rooms are also introduced by other expressions: “estar equipado com” [to be equipped with], “possui” [has], “conta com” [features (verb)], “apartamento com” [guest room with], as in the examples: “E para o seu total conforto, o Costa dos Coqueiros oferece 52 apartamentos equipados com frigobar, internet wireless, ar condicionado, varanda e um delicioso café da manhã, incluso na diária.” [And for your total comfort, Costa dos Coqueiros offers 52 guest rooms equipped with minibar, wireless internet, air conditioning, balcony and a delicious breakfast, included in the rate.] “As acomodações do Mabu possuem ar condicionado, TV LCD, TV a cabo, WC com ducha, secador de cabelos, telefone, frigobar, amenities, internet wireless.” [Mabu’s accommodations have air conditioning, LCD TV, cable TV, bathroom with shower, hairdryer, telephone, minibar, toiletries, wireless internet.]

A Bilingual Glossary of Collocations Typical of the Hotel Industry

39

In conclusion, the following structure is more common (in bold) in English: “Room amenities include a king-size bed, private bath, and a flatscreen cable TV.” While in Portuguese, the most common structure is (in bold): “Apartamentos equipados com ar condicionado split, ventilador de teto, frigobar, TV 32 polegadas LCD, TV a cabo com 60 canais, cofre.” [Guest rooms equipped with air conditioning, ceiling fan, minibar, 32inch LCD TV, cable TV with 60 channels, safe.] Therefore, the collocation “room amenities” can be translated into Portuguese in most contexts by using an expression that does not even include an equivalent for “amenities” (such as “comodidades”), i.e., it can be translated as “apartamentos equipados com” [guest rooms are equipped with]. Finally, it was curious to find that the word “amenities’ is also used in Portuguese (64 occurrences in the Portuguese corpus), but to designate what is known as “toiletries” in English, that is, the complimentary articles found in the hotel bathrooms for guest use, such as shampoo, conditioner, shower cap, lotions, etc. Roupão de banho, toalha de banho, chinelo, amenities (shampoo, condicionador). [Bathrobe, towels, slippers, toiletries (shampoo, conditioner).] All the information discussed above is summarized in the glossary entry shown in Figure 11.

40

Chapter Two

Figura 11: Entry for “room amenities”7

5. Final remarks This article highlights some of the findings from a larger study whose aim was to put forward a proposal for a bilingual glossary of collocations targeted at the translator working in the hotel domain. The study confirmed some of the propositions found in the literature regarding collocations. First, it showed that collocations are indeed a widespread 7

Did you know? The word “amenities” is used in Portuguese to refer to the items in the bathroom (shampoo, conditioner, lotion, etc.). Bathroom with bathtub, separate shower, telephone, dryer, bathrobe, scale and full toiletries kit. / All units are equipped with: hair dryer, toiletries (shampoo and conditioner). In English, the word “toiletries” is used to describe these articles. “Bathroom toiletries include shampoo, hair conditioners, body gels and lotions, facial cotton towels and shower caps.” Note to translator: In Portuguese, the articles included in the guest room are commonly introduced in the text by using the following structures: “[ESTAR] equipado com” [(BE) equipped with] All rooms are equipped with minibar, telephone, cable TV, heater, ceiling fan, safe (…). “Possuir” [have] All of Mabu’s accommodations have air conditioning, LCD TV, cable TV, WC with shower, hair dryer.

A Bilingual Glossary of Collocations Typical of the Hotel Industry

41

phenomenon in specialized languages, confirming the phraseological tendency of language. It also demonstrated that the meaning of a word emerges fundamentally from the relationship it holds with the surrounding words and context of use, making collocations a crucial element to be included in terminological reference materials. We discussed four collocations of the word “room”, outlining the procedures to establish their equivalents in Portuguese. The methodology used proved adequate to achieve the expected results and also expanded the scope of the research, i.e., it enabled us not only to determine the most frequent collocational equivalents (as in the case of “in-room safe”), but also to identify different nuances in the meanings of related words (as in “guest room”, “bedroom”, “suite”), as well as possible translation pitfalls (“apartamento” x “apartment”), alternative translation suggestions (such as using the expression “apartamento está equipado com” [guest room is equipped with] as an equivalent for “room amenities”), and even curious facts about current language usage (the word “amenities” being used in Portuguese with a different meaning than in English). Finally, the entries presented for each collocation illustrate what we believe to be a glossary that could actually help the translator in the crucial task of producing a technically accurate and natural translation.

References Chesterman, A. 1998. Constrastive functional analysis. Amsterdam: John Benjamins. Church, K and Hanks, P. (1991). Word Association Norms, Mutual Information and Lexicography. Computational Linguistics: 16:1. Gregson, P. W. (org.) 2009. Hotelaria na Prática. Editora Manole: Barueri. Halliday, M. A. K. 1991. Corpus studies and probalistic grammar. In: AIJMER, K.; ALTENBERG, B. (orgs.). English corpus linguistics: studies in honor of Jan Svartvik. Londres: Longman. p. 30-43. Hoffmann, L. 2004 [1999]. Por uma terminologia textual. [translated by Sandra Dias Loguercio]. In. KRIEGER e ARAÚJO (orgs) (2004) A Terminologia em foco. Cadernos de Tradução 17. Porto Alegre, Instituto de Letras da UFRGS, out-dez. Krieger, M. G. & Finatto, M. J. B. 2004. Introdução à Terminologia: Teoria e Prática. Editora Contexto: São Paulo. Navarro, S. 2011. Glossário bilíngue de colocações da hotelaria: um modelo à luz da Linguística de Corpus. Master’s Dissertation – Faculdade de Filosofia, Letras e Ciências Humanas da Universidade de

42

Chapter Two

São Paulo, São Paulo. Available at: http://www.teses.usp.br/teses/ disponiveis/8/8147/tde-16082012-122119/pt-br.php Scott, M. 2007. Wordsmith Tools, version 5. Oxford: Oxford University Press. Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.

PART III SPOKEN LANGUAGE AND CORPORA

CHAPTER THREE DIALOGIC UNITS IN SPOKEN BRAZILIAN AND ITALIAN: A CORPUS-BASED APPROACH MARYUALÊ MALVESSI MITTMANN* AND TOMMASO RASO* 1. Introduction In this paper we present a cross-linguistic study on the usage of dialogic units in spoken Brazilian Portuguese and Italian. The data come from two comparable spontaneous speech corpora and the analysis is based on the Language into Act Theory (Cresti 2000; Cresti and Moneglia 2010) which defines dialogic units as information units (IU) dedicated to regulate communication. Such units mostly correlate to what other theoretical approaches commonly call discourse markers. We support the argument that the speech flow can be properly analyzed if segmented into utterances and intonation units. This segmentation is based on the perception of prosodic boundaries, which also allow the identification of the pragmatic functions of each intonation unit. Our main goals are to discuss some interesting aspects regarding the usage of dialogic units in Brazilian Portuguese and Italian. We describe the functions performed by these units in spoken language as well as present the most frequent lexical items associated with different dialogic functions in Brazilian Portuguese and Italian. We also investigate the distribution of dialogic units across the utterance in order to detect possible specific language usages. In the next section, we first discuss the notion of discourse markers according to several researchers who have been developing usage-based research. Then, we introduce the main concepts of Language into Act *

UFMG – Federal University of Minas Gerais.

Dialogic Units in Spoken Brazilian and Italian

45

Theory and examine in detail the types and functions of information units in this approach. Research methods are described in Section 3. Section 4 presents the results with quantitative data and a few qualitative analyses, showing examples extracted from the corpora. All examples in this text can be obtained online through the IPIC database (IPIC 2012), where the audio signal is available together with the text transcription and recording session metadata.

2. Conceptual issues and theoretical framework 2.1. Some conflicting issues on Discourse Markers In the Linguistics literature, Discourse Markers are often defined as linguistic expressions that lose their semantic meaning and also their original morphosyntactic value. Such expressions do not belong to the semantic and syntactic structure of the utterance. Such expressions do not affect the truth value of the utterance (Schneider 1999), and are not part of the propositional content of the message conveyed, therefore they do notcontribute to the meaning of the proposition itself (Fraser 2006). According to different traditions, discourse markers acquire different pragmatic functions, which can be either textual or meta-textual. Some textual features usually attributed to discourse markers are turn-taking, silence filling, phatic function, request for attention, agreement and confirmation. Meta-textual functions can be focus, demarcation, indication of paraphrase or reformulation, modality, among others (Fisher 2006). However, one can argue that several of the “textual” functions, such as turn-taking or request for attention, are actually pragmatic functions, since they do not contribute to the actual content of the spoken text, but rather regulate the interaction between speakers. Additionally, there is little agreement in the literature regarding the number of discourse markers, their functions and the criteria to define them. Discourse markers are often related to concepts such as form, attitude and emotion (Traugott 2007), but there is no agreement regarding these concepts either. The main agreement about discourse markers among scholars is that they are not semantically and syntactically compositional with the rest of the utterance, but there is no proposal for an operational definition that would allow predicting if a given lexeme is or is not a discourse marker.

Chapter Three

46

Given a certain lexical item, we have two possibilities: it is per se semantically and syntactically non-compositional, e.g. it is an interjection; (ii) it can be compositional or non-compositional, depending on its use in the text: for instance a verb (look) or an adverb (but) or a noun (god) can be used either as a discourse marker or as a compositional item inside an utterance. (i)

Taking these possibilities into consideration, we must look into three different issues concerning discourse markers that are not solved in the studies mentioned. First, we must consider that any lexeme can be used in spontaneous speech to perform a certain speech act. Therefore, interjections (such as “oh!”) should not be automatically classified as discourse markers, as they can be used in communication as a complete utterance, conveying an illocutionary value. We must then have a way of predicting if the interjection (or any given word) is a discourse marker or an illocution. Second, if an expression can be either part of a compositional string or a discourse marker, we must also have a positive parameter to predict when it is one case or the other. Third, once we know that a certain expression is a discourse marker, we must be able to identify its specific pragmatic function (such as phatic or conative or other). Some authors note a strong correlation between discourse markers and some prosodic properties, such as the fact that they tend to be uttered in a dedicated intonation unit that can be eliminated without any effect on the utterance (Bazzanella et al. 2008). We argue that prosody provides the necessary parameters to solve the problems pointed out above. The observation that discourse markers are produced with a specific prosodic enveloping is consistent with the theoretical model adopted for our studies, the Language into Act Theory, described in the following paragraphs.

2.2. Language into Act Theory The theoretical framework of Language into Act Theory (L-AcT) (Cresti 2000; Cresti and Moneglia 2010; Cresti 2011) was developed through empirical corpus research. L-AcT comes from a very long observation of corpora of spontaneous speech that induced progressive generalizations on the organization of speech structure. In Language into Act Theory, the referring unit for the analysis of spoken language is an utterance, defined as the linguistic counterpart of a speech act. An utterance is the shortest linguistic unit that can be pragmatically interpreted and is delimited in the speech flow by prosodic

Dialogic Units in Spoken Brazilian and Italian

47

boundaries (Crystal 1975) that bear a conclusive value. In spontaneous speech, prosody plays a fundamental role of parsing the speech flow into discrete intonation units. Intonation units can be prosodically autonomous or prosodically nonautonomous. Autonomous intonation units are those delimited by prosodic boundaries perceived by the hearer as having a conclusive value. Prosodically delimited linguistic sequences – intonation units – convey information. The information is pragmatically autonomous if it bears an illocutionary value. Intonation units that convey illocutionary value are both prosodically and pragmatically autonomous and are associated with the Comment function. Intonation units that are delimited by prosodic boundaries with a non-conclusive value convey other types of information and are associated with different functions. An utterance may be produced as a single intonation unit (simple utterance) or it can be prosodically parsed into two or more intonation units (compound utterance), creating a prosodic pattern (Hart, Collier, and Cohen 1990). The units of a prosodic pattern are associated with information functions, through which information is patterned in the utterance. Examples (1) and (2) show simple utterances in Brazilian Portuguese and Italian respectively, while examples (3) and (4) illustrate compound utterances in each language. The examples are identified with the filename followed by the number of utterance in the corpus. (1) bfamdl01, 5101 FLA: seu dinheiro tá caindo hhh //

your money is falling [of your pocket]’

(2) ifamdl04, 47 ART: le quattro componenti son queste // the four components are these

(3) bfamdl02, 194 BEL: uhn /talvez na parte maior / hm

não //

maybe in the bigger section no

(4) ifammn17, 16 SAR: lo username / è a erre /sessantanove / mentre la password / the username is ‘a’ ‘ar’ sixty-nine

è yyy //

while the password

is yyy

1 All examples provided are available in the online database http://lablita.dit.unifi. it/app/dbipic/

Chapter Three

48

The Informational Patterning Hypothesis (Cresti and Moneglia 2010; Scarano 2009) proposes that there is a systematic correspondence between the prosodic pattern and the information pattern of an utterance. Information Units (IU) are classified into textual and dialogic. Textual units participate in the construction of the semantic content of the utterance. Dialogic units are devoted to the successful pragmatic performance of the utterance (e.g. to regulate the relationship between speakers). Every utterance has at least one Comment unit, since it is the Comment that bears the utterance’s illocutionary force. The Comment is the only necessary and sufficient unit to constitute an utterance. Textual functions are: a) b)

c)

d) e) f)

g) 2

Comment (COM): It accomplishes the illocutionary force of the Utterance and is necessary and sufficient to perform an Utterance. Multiple Comments (CMM): A chain of two or more Comments forming an illocutionary pattern, which accomplishes the illocutionary force of the utterance. The illocutionary pattern performs only one utterance, which develops a general conventional rhetoric effect at the locutionary level (such as comparison or listing). Bound Comments (COB): A sequence of Comments forming not an utterance, but another type of perceptually concluded2 linguistic sequence (also called “terminated” sequence): the stanza. A stanza is produced by progressive adjunctions of Comments, following the flow of thought. The illocutionary force of Bound Comments is weak and homogeneous, and is preferably assertive or expressive. Topic (TOP): It represents the pragmatic domain of application for the linguistic activity accomplished by the Comment, allowing the Comment’s displacement from the context. List of Topics (TPL): It develops the function of one single Topic, but corresponds to a chain of two or more Topics. Appendix of Comment (APC): It integrates the Comment text either with fillers, following a repetition strategy, or adding more specific information for the addressee, always intentionally seeking his agreement. Appendix of Topic (APT): It integrates the Topic text, adding delayed information, amendments, or, rarely, repetitions.

Perceptually concluded linguistic sequences refer to linguistic expressions that are autonomous from the acoustic point of view and also from the pragmatic point of view.

Dialogic Units in Spoken Brazilian and Italian

49

h) Parenthesis (PAR): A meta-linguistic appreciation of the Utterance’s content, with a backward or forward scope. i) Locutive Introducer (INT): It signals that the subsequent set of IUs, including the Comment, has a unitary point of view often diverging from the Utterance’s one. The subsequent IUs can correspond to reported speech, a spoken thought, a list, a narration, or an emblematic exemplification. In L-AcT, discourse markers are identified as dialogic information units. In this paper, discourse markers and dialogic units are considered to be the same. The denomination Dialogic Unit emphasizes the pragmatic function of these units and justifies their insertion in a frame that can account for pragmatic speech structuring as a whole. Dialogic units, as a whole, are defined as prosodically delimited linguistic expressions that function regulating the communicative interaction (Cresti 2000; Frosali 2008). The dialogic functions are described in more detail below. All information properties (i.e. functional, distributional and prosodic) take the Comment (the core of the utterance) as their reference unit. 2.2.1. Incipit (INP) An Incipit is a dialogic information unit that has the function of opening the communicative channel while signaling a contrastive value in relation to the previous utterance. The contrast does not have a logic nature (it does not depend on the lexical item), but rather an affective one, signaled by its prosodic features. It is important to note that the distribution of the INP in the utterance is not free; it must be positioned at the beginning of the utterance. Prosodically, INP shows a rapid fundamental frequency (F0)3 rising-falling movement that reaches a high F0 frequency. The movement can also be just rising (reaching a high F0 frequency) or just falling (starting from a high F0 frequency); typically an Incipit is produced with high intensity and short syllabic duration in comparison with the Comment information unit4.

3

Fundamental frequency (F0) is responsible for a person’s pitch. It is defined as the frequency (cycles per second) at which the vocal cords vibrate to produce voiced speech sounds. One complete cycle on a periodic waveform corresponds to one complete cycle of opening and closure of the vocal cords (Hewlett and Beck 2006; Davenport and Hannahs 2013). 4 A Comment is the nuclear information unit of any utterance.

Chapter Three

50

(5) bfamdl02, 254 BEL: não / e ainda cheguei falando com a minha mãe / %inf

INP

no

and besides I came talking with my mother

toda empolgada / mãe / minha mala / all excited %inf

mom / my bag

tá super bem arrumada // COM

is really well organized

2.2.2. Conative (CNT) A Conative is a dialogic information unit that has the function of engaging the listener in the communicative situation, so he/she starts or stops doing something. A Conative expresses the speaker’s wish that the listener take part in an adequate way in the interaction. It is therefore more frequent with directive illocutions, but can also appear with other illocutionary values. It has a very characteristic falling F0 contour. The syllabic duration is short or average and it has high intensity in comparison with the Comment unit. CNT’s position is free, but it occurs mostly in the initial and final positions of the utterance. (6) bfamcv01, 56 EVN: o' / o Arnaldinum é caro / %inf: %inf:

CNT

look

Arnaldinum [Stadium] is expensive

e tem aqueles problemas // COM

and has those problems

2.2.3. Phatic (PHA) A Phatic is a dialogic information unit that has the function of ensuring the maintenance of the communicative channel. It is prosodically performed with flat or falling F0 contours and with very low intensity and very short syllabic duration. The distribution in the utterance is free, which means it can appear in any position regarding the Comment. PHA is the most frequent dialogic unit. (7) bfamcv02, 9 TER: é que ea ganhou tudo / %inf:

COM

né // PHA

it’s ‘cause she was given all [the house gifts] right

Dialogic Units in Spoken Brazilian and Italian

51

2.2.4. Allocutive (ALL) An Allocutive is a dialogic information unit that has the function of specifying to whom the message is directed and marks the type of social relation between interlocutors. In traditional studies, ALL would be considered part of the vocative category together with the illocution of recall, which we do not consider appropriate, considering that, in the second case, the unit featuring the illocution of recall is autonomously interpretable, while in the first case, the ALL unit is not interpretable in isolation. Prosodically, ALL has a flat or falling F0 contour, with short or medium syllabic duration and low intensity. ALLs have free distribution, but the preferred position changes depending on the language/culture. We cannot say that ALLs lose their lexical value, since they are normally proper nouns, nicknames, titles or epithets, and the lexicon is very important in order to recognize its function. (8) bfamdl01, 496 FLA:vai esse / né / %inf:

COM

Rena // ALL

this one right Rena [=short for Renata]

2.2.5. Expressive (EXP) An Expressive is a dialogic information unit that has the function of providing emotional support for the utterance, and a way of sharing the social cohesion with the interlocutor. Prosodically, EXP shows various possibilities because of its function of expressing emotional value. The F0 can have a rising, a modulated or a flat contour. EXPs have free distribution, but their preferred position is the initial one. Their frequency is very much language/culture specific, as we will show later. (9) bfammn05, 35 CAR:ah / a história dele é muito bonita também // %inf:

EXP

COM

ah his history is very nice too

2.2.6. Discourse Connector (DCT) A Discourse Conector is a dialogic information unit that has the function of signaling the continuity of the discourse while it establishes a relation between the previous and following units. A DCT has the function of linking two utterances without emotional contrast (it functions in the

Chapter Three

52

opposite way as the Incipit). Prosodically, DCT is characterized by a flat or rising F0 contour. The syllabic duration is long and the intensity is high in relation to the Comment. It can occur at the beginning of the dialogic turn or utterance. Dialogic units are always produced in dedicated intonation units, i.e. an expression that performs any of the above mentioned functions is entirely constrained within the same prosodic unit, preceded and followed by a prosodic boundary. The pragmatic function is performed by the entire unit, and not by one lexical item inside a larger sequence hosted by a single intonation unit. No dialogic unit has prosodic or pragmatic autonomy, which means that it is not interpretable in isolation, a characteristic that applies only to the Comment. Therefore dialogic units must be inserted in an utterance that features a Comment unit and that may feature other information units (with textual functions) too. (10) bpubmn01, 72-73: CAR:então / a orientadora ela nũ quer fazer %inf:

so

the counselor

COM

she doesn’t wanna play

o papel da coordenadora // COM

the role of the coordinator

e / vice-versa //

DCT

COM

and vice versa

3. Methods We present two samples of spoken corpora that received tagging at the information structure level according to the Language into Act Theory. The Italian sample comes from the C-ORAL-ROM (Cresti and Moneglia 2005) (Italian section) and the Brazilian sample comes from C-ORALBRASIL (Raso and Mello 2012). The samples come from informal sections of oral corpora containing a broad variety of communicative situations and were selected for a strict comparison with each other. The corpora are representative of spontaneous speech, recorded in natural, not controlled, communicative situations. Recordings are transcribed in CHAT format (MacWhinney 2000) with annotation of prosodic boundaries (Moneglia and Cresti 1997). The prosodic boundary annotation was validated in both corpora (Moneglia, Scarano, and Spinu 2005; Raso and Mittmann 2009). Table 3-1 shows the symbols used in the annotation of prosodic phenomena.

Dialogic Units in Spoken Brazilian and Italian

53

Table 1: Prosodic breaks annotation scheeme in C-ORAL corpora Signal

Meaning Delimits a prosodically autonomous sequence (utterance) with a clear ? interrogative prosodic contour.* Delimits a prosodically autonomous sequence (utterance) voluntarily … interrupted by the speaker with a suspensive prosodic contour.* Signals unintentionally interrupted sequences. In this case, the + speaker’s program is broken and the interpretability of the sequence can be compromised Indicates a terminal prosodic boundary, marks all prosodically // autonomous sequences (utterances) that do not belong to the previous classes. / Signals non-terminal prosodic boundary, it delimits intonation unit. Represents retracting phenomena (i.e. false starts), where n corresponds to the number of retracted words. Retracting marks can be considered a type of non-terminal break, but the words involved in [/n] false starts do not contribute to the informational patterning nor to the semantic content of the utterance * Used only in C-ORAL-ROM. For the same cases, C-ORAL-BRASIL uses the // sign.

The Italian sample5 contains 29,414 words, 5,286 utterances and 11,517 prosodic/information units. The Brazilian Portuguese sample 6 has 31,318 words, 5,483 utterances and 9,825 prosodic/information units. Samples are balanced for type of communicative interaction (see Table 3-2). Table 2: Number of recorded sessions according to communicative context and type of interaction Corpus section Communicative Type of interaction context Monologue Family/ Dialogue Private Conversation Monologue Public Dialogue conversation Total 5

Number of sessions Brazilian

Italian

6 5 4 1 2 2 20

5 6 3 2 2 2 20

The Italian sample is identified in the online database (IPIC) as MiniCorpus_Ita. The Brazilian Portuguese sample is identified in the online database as Brasiliano. 6

Chapter Three

54

Samples received manual annotation of information functions, according to the information units proposed by the Language into Act Theory. We extracted the data through IPIC (first release), a theoreticallybound XML database designed for the study of linear relations among Informative Units in spoken language corpora (Panunzi and Gregori 2011). The tagset is available at the database website (IPIC 2012). The data were tabulated and we analyzed the frequencies and distribution of all information units in both samples.

4. Results According to the L-AcT framework, pragmatics operates at two levels: (i)

The macro-pragmatic level is related to the production of Speech Acts (Austin 1962). It organizes the speech flow into pragmatically autonomous linguistic sequences (utterances). (ii) The micro-pragmatic level is related to the patterning of information within the utterance. It organizes utterances into patterns of information units (IU). Results show a prevalence of compound utterances in Italian (36%) in comparison with Brazilian Portuguese (29%) that is statistically significant (chi-square=52,848 – p 00:03:02,607 E logo descobri que não pertencia mais ao mundo dos vivos. [Soon I found I no longer belonged to the world of the living]

TAGGED SUBTITLES 5 00:02:59,012 --> 00:03:02,607 E logo descobri que não pertencia

mais ao mundo dos vivos.

11 00:03:36,349 --> 00:03:38,977 E minha história estava apenas recomeçando. [My story was just starting over]

11 00:03:36,349 --> 00:03:38,977 E minha história estava

apenas recomeçando.

98 00:15:44,109 --> 00:15:48,375 que vinha do fundo da minha alma, mas fui ouvido. [that came from the bottom of my soul, but I was being heard.

98 00:15:44,109 --> 00:15:48,375 que vinha do fundo

da minha alma, mas fui ouvido.

4. Tagging and Analyzing a Subtitle File This section will present our complete tag-set for the analysis of segmentation in subtitling and also a brief analysis of the segmentation in the film Our Home.

4.1. Visual segmentation problems Visual segmentation problems occur when the subtitle does not follow the cut and continues in the next frame, or in the language of subtitlers:

68

Chapter Four

when the caption “leaks”. Based on these problems, the label was created.

4.2. Rhetorical segmentation problems Rhetorical segmentation problems occur when the subtitle anticipates or delays information, because it does not follow the flow of speech, including hesitations and pauses. For Diaz Cintas and Remael (2007, 179) this type of segmentation may reflect the dynamics of dialogues: “Good rhetorical segmentation helps conveying surprise, suspense, irony, hesitation, etc.” Two tags were defined to approach these problems: , for anticipations, and , for delays.

4.3. Linguistic segmentation problems In order to help viewers to grasp the meaning quickly, subtitlers write phrases, sentences and clauses which represent a complete thought, so that they have time to read the subtitle, look at the image and feel comfortable with this movement, which will be done quite often when one watches a subtitled film. Linguistic segmentation problems occur when the internal constituents of these phrases, sentences and clauses are broken in situations which do not allow them to be read as a complete thought. Based on these problems, it was possible to define the tag for linguistic segmentation problems which indicates that the line break occurred in noun phrases (SN), verb phrases (SV), prepositional phrases (SP), adverbial phrases (SAdv), adjective phrases (SAdj) or in coordinate clauses (COORD) or subordinate clauses (SUBORD). Figure 3 shows a shot from Our Home illustrating the segmentation problem at the level of the noun phrase: Aquela (that) should have been shown in the second line.

Segmentation Tags

69

Fig. 3: Segmentation problem in the noun phrase. Source Our Home (2010)

The problems regarding subordinate and coordinate clauses occur when the conjunction is not shown together with the second clause. It can also be observed when there is a break between the negative particle and the clause. Figure 4 shows a segmentation problem at the clause level. The conjunction quando (when) should be in the second line.

Fig. 4: Segmentation problem in the subordinate clause. Source Our Home (2010)

Chapter Four

70

4.3.1. Noun phrase segmentation problems - When there is a break between pre-modifiers and nouns e.g.: Pelo menos tira aquela/sensação de fome. At least it eliminates that/feeling of hunger.

- When there is a break between noun and modifier, or in reverse order, modifier and noun e.g.: Seu aparelho/gastrointestinal estava... Your gastrointestinal/tract was...

- When there is a break between superlative and adjective

e.g.: Muito mais/bem disposto, pelo visto. In a much/better mood, I presume.

- When there is a break between relative pronoun and incomplete clause e.g.: A vida na Terra é que/é uma cópia daqui, André. It is life on earth which/is a copy of here, André.

- When there is a break between title and proper noun

e.g.: Você pode procurar o irmão/Genésio no Ministério do Auxílio. You may look for brother/Genésio in the Ministry of Aid.

- When there is a break in the internal structure of collocations, conventional expressions and idioms . e.g.: que vinha do fundo/da minha alma, mas fui ouvido. that came from the bottom/of my soul, but I was heard.

4.3.2. Prepositional phrase segmentation problems - When there

is

a

break

between

preposition

and

noun

Segmentation Tags

71

e.g.: O que sabe sobre/a medicina espiritual? What do you know about/spiritual medicine?

4.3.3. Verb phrase segmentation problems - When there is a break between two or more verbs, whether they are auxiliary, modal or main e.g.: O amigo parece ter/compreendido o sentido da água, You seem to have/understood the meaning of the water

- When there is a break between verb and adverb e.g.: Já me perdi/muito por essas trilhas. I’ve gotten lost/a lot on these paths

- When there is a break between negative particle (not, nor etc) and verb e.g.: E se eu não/quiser entender, vó? And if I do not/want to understand, grandma?

- When there is a break between object pronoun (preceded or not by verb) and verb e.g.: O ministro vai/nos receber em breve. The Minister will/meet us soon.

4.3.4. Adverbial phrase segmentation problems - When there is a break inside the adverb structure e.g.: Um ato realizado durante/longos e longos anos, An act performed during/many many years,

4.3.5. Adjective phrase segmentation problems - When there is a break between noun and adjective e.g.: com a separação/temporária da morte, with temporary/separation of death,

72

Chapter Four

4.3.6. Coordinate clause segmentation problems - When there is a break between conjunction (and, but, so etc) and coordinate clause. e.g.: vamos entrar e você/fala com o governador. let us go inside and you/speak to the governor

4.3.7. Subordinate clause segmentation problems - When there is a break between subordinate conjunction (when, while, that, why etc.) and clause e.g.: Tudo perde o sentido quando/a gente acorda depois de morrer. Everything loses its meaning when/we wake up after dying

Table 2 shows a synoptic table with the segmentation tags. Table 2: Synoptic table with proposed segmentation tags INDICATIVE TAG OF LINGUISTIC SEGMENTATION PROBLEM (Grammatical)

INDICATIVE TAGS OF RHETORIC SEGMENTATION PROBLEM

INDICATIVE TAG OF VISUAL SEGMENTATION PROBLEM

TAGS FOR NOUN PHRASE ANALYSIS (SN)

TAG FOR PREPOSITIONAL PHRASE ANALYSIS (SP)

TAGS FOR VERB PHRASE ANALYSIS (SV)

Segmentation Tags

73

TAG FOR ADVERBIAL PHRASE ANALYSIS (SAdv)

TAG FOR ADJECTIVE PHRASE ANALYSIS (SAdj)

TAGS FOR COORDINATE CLAUSE ANALYSIS (COORD)

TAGS FOR SUBORDINATE CLAUSE ANALYSIS (SUBORD)

For the film Our Home subtitlers produced 1132 subtitles. The analysis of these subtitles revealed 88 segmentation problems. They were largely concentrated at the level of phrases, especially in verb phrases, when there is a break in the linguistic constituent verb + verb. Figure 5 illustrates the results.

Segmentation Problems

SP PROSEGR SUBORD 2% 8% 4% SAdv 7%

SUBORD 8%

SV 41%

SAdj 10%

SN 26%

Fig. 5: Results of segmentation problems in the film Our Home

In our view, the graph in Figure 5 shows that the tagging process was relevant for a systematic analysis of the segmentation and for perceiving patterns and regularities in the corpus, such as the fact that segmentation problems are less frequent in clauses than in phrases. This regularity makes us suppose that clauses, i.e. more complex units of language, are more intuitive to segment than phrases, which are less complex units of language.

74

Chapter Four

5. Final remarks The tag-set presented in this study was defined initially in Chaves’s Master Thesis (2012), in which it was also possible to come up with tags for the analysis of subtitling technical parameters, such as: number of lines, number of characters per second and subtitle rate. The tagging process was of utmost importance, because it was possible to carry out a systematic analysis of segmentation and to perceive patterns and regularities in the corpus, such as the fact that segmentation problems are less frequent in clauses than in phrases. These results lead us to think that there must be some difficulty for subtitlers to analyze phrases, given the dynamics of linguistic constituents which may perform similar syntactic roles, depending on the context in which they are inserted. However, coordinate and subordinate clauses should have been analyzed with ease, because they present an intuitive structure formed of clauses linked by easily identifiable conjunctions. These findings suggest that more attention must be given mainly to the analysis of linguistic segmentation focusing at the level of phrases, looking for a better understanding of the features of these constituents. This should be helpful for the training of novice subtitlers. A proposal of segmentation tags for the analysis of subtitles is seminal and of great value for both Corpus Linguistics and Translation Studies, since it serves as a resource for expert and novice translators and scholars who aim to develop greater awareness and better founded translation practices. Given the scope of corpus-based research, as well as the role played by tags in this study, it is possible to glimpse that other corpora can receive empirical treatment, such as the analysis and description of other subtitling parameters such as subtitle rate, condensation and explicitation. Moreover, this study offers resources to solve problems and cases previously investigated. To conclude, without exhausting the subject, it is reasonable to state that Corpus Linguistics is multifaceted, because a corpus-based analysis can stimulate discussion, develop methodologies and systematize analyses.

References Baker, M. 1996. Corpus-based translation studies: the challenges that lie ahead. In Somers, H. (ed.). Terminology, LSP and translation. Amsterdam, Philadelphia: John Benjamins, 175-187.

Segmentation Tags

75

Chaves, E. G. 2012. Legendagem para Surdos e Ensurdecidos: Um estudo Baseado em Corpus da Segmentação nas legendas de filmes brasileiros em DVD. Unpublished MA Thesis, State University of Ceará, Fortaleza, Brazil. —. 2009. Legenda para Surdos no Brasil: uma análise baseada em corpus. Unpublished paper. State University of Ceará, Fortaleza, Brazil. Diaz-Cintas, J.; Remael, A. 2007. Audiovisual Translation: Subtitling. Manchester: St. Jerome Publishing. Feitosa, M. P. 2009. Legendagem comercial e legendagem pirata: um estudo comparado. Unpublished Doctoral Dissertation, Federal University of Minas Gerais, Belo Horizonte, Brazil. Kalantzi, D. 2008. Subtitling for the Deaf and Hard of Hearing: A corpusbased methodology for the analysis of subtitles with a focus on segmentation and deletion. Unpublished Doctoral Dissertation, School of Languages, Linguistics and Cultures of the University of Manchester. Karamitroglou, F. 1998. A Proposed Set of Subtitling Standards in Europe. In Translation Journal, 2(2):1-15. http://translationjournal. net/journal//04stndrd.htm Nosso Lar. 2010. Director: Wagner de Assis. Brasil: Fox do Brasil. Perego, E. 2008. What Would We Read Best? Hypotheses and Suggestions for the Location of Line Breaks in Film Subtitles. In The Sign Language Translator and Interpreter. Manchester: St. Jerome Publishing, 35-63. —. 2003. Evidence of explicitation in subtitling: towards a characterization. In Across Languages and Cultures, A Multidisciplinary Journal for Translation and Interpreting Studies. Budapest: Adadémiai Kiadó, 4(1): 63-88. Perini, M. A. 2010. Gramática do português brasileiro. São Paulo: Parábola Editorial. Scott, M. 2008. Wordsmith Tools. http://www.lexically.net/wordsmith/ index.html SubRip 1.50 Beta 4. 2006. www.divx-digest.com/software/subrip.html

PART IV NATURAL LANGUAGE PROCESSING AND CORPORA

CHAPTER FIVE AUTOMATIC EXTRACTION OF SUBCATEGORIZATION FRAMES FROM PORTUGUESE CORPORA LEONARDO ZILIO1, ADRIANO ZANETTE2 AND CAROLINA SCARTON3 1. Introduction The task of extracting subcategorization frames (SCFs) from corpus, which can be seen as a task of lexical acquisition, is a challenge to Natural Language Processing (NLP) which has been dealt with in many languages. Besides being useful for the classification of linguistic elements (such as verbs and nouns), addressing this task may help with several other tasks of language description, such as improving a parser’s performance (Korhonen, Krymolowski, and Briscoe 2006). For example, for Portuguese the most famous and most accurate parser is PALAVRAS (Bick 2000). Since it is rule-based, its precision and recall depends on the rules implemented, and, since it is impracticable to define rules for all phenomena in a language, automatic lexical acquisition could be an alternative to improve its results. In addition, automatic lexical acquisition could be useful for automatic verb classification (Sun and Korhonen 2009; Sun et al. 2010; Scarton and Aluísio 2012; Scarton 2013), information extraction (Surdeanu et al. 2003), and other tasks. Bearing this perspective in mind, this paper presents a system for the identification, extraction, and organization of subcategorization frames for Portuguese (Zanette 2010; Zanette, Scarton and Zilio 2012; Zilio, Zanette and Scarton 2012), which is an adaptation of previous work by Messiant 1

Institute of Linguistics/UFRGS. Institute of Informatics/UFRGS. 3 NILC/ICMC/USP. 2

Automatic Extraction of Subcategorization Frames

79

(2008) for Portuguese. We also briefly describe the use of this system in two corpus-based studies of language description. The objectives of this paper are thus: 1) to describe the system for extracting SCFs; and 2) to report its application on two corpus-based studies. In the next section we present a definition of SCF and single out some studies on the subject. In Section 3 we introduce the concept of semantic roles. Section 4 briefly describes how the system for extracting SCFs works. This is followed, in Section 5, by both corpus-based studies and, their results. The last section is reserved for our final remarks.

2. Subcategorization Frames A subcategorization frame (SCF) is the syntactic representation of a sentence or phrase which, when observed in large extensions of text, allows for the classification of certain linguistic elements. In this paper the relevant linguistic elements are verbs. Thus, the subcategorization frames represent the syntactic structures of sentences in one or more corpora. Figure 1 shows some examples of SCFs of the verb “encontrar” [to find] extracted from a Cardiology corpus, which will be described later in this paper.

Fig. 1: SCFs of the verb “encontrar” [to find] in a Cardiology corpus

SUBJ[NP] = subject; V = verb; NP = nominal phrase; PP[prep] = prepositional phrase (the relevant preposition is in square brackets); REFL = reflexive pronoun; ADJP = adjective phrase.

80

Chapter Five

With this type of information, it is possible to classify verbs (or other elements) in groups that share the same or some of its properties. That is why it is important to develop systems that can extract SCFs from corpora for as many languages as possible. Up until now there are systems for the extraction of subcategorization frames developed for languages such as English (Briscoe and Carrol 1997; Korhonen, Krymolowski, and Briscoe 2006), French (Messiant 2008), German (Schulte im Walde 2002), and Italian (Ienco, Villata, and Bosco 2008), among others. In English, these studies were already applied, for example, for the expansion of VerbNet (Kipper 2005; Kipper et al. 2006), a repository of verbs with semantic role annotation. Most of these studies have verbs as their focus, but there are some studies that also deal with other types of SCFs. For the Portuguese language, where its developments are still in their infancy, there is a system developed by Augustini (2006), which addresses, besides verbs, also nouns, prepositions, and adjectives. The only restriction in this system is the fact that the syntactic patterns must be given a priori by the researcher. Thus it acts more as a matcher of superficial structures to syntactic ones. In this paper, the SCFs are frequently seen as a base on which semantic information is added. This semantic information appears in the form of semantic roles, which are described in the next section.

3. Semantic Roles As this topic will appear recurrently, we believe it is important to explain what semantic roles are. Keeping it short, semantic roles are a means of identifying an abstract and schematic meaning of arguments connected to a linguistic element (in our case, a verb). Semantic roles act as a semantic level that can be added to the SCFs of the verb. To show an example, let us take the very simple sentence “John saw Mary”. In this sentence, we recognize that there is a verb (see) and two arguments (John and Mary). At the SCF level, we have NP_V_NP (the arguments being a subject and a direct object). But, if we go to the semantic role level (which can be situated somewhere between the syntactic and the semantic levels) we would find that the argument John carries the role of an experiencer, because the verb see indicates that someone is experiencing something through one of his or her sensory functions; the argument Mary is a bit more complicated to analyze, since there is more than one possibility, depending on the set of semantic roles; for now, let us assume that the role is that of an experienced (explicating the direct complement to the experiencer role).

Automatic Extraction of Subcategorization Frames

81

There are several theories and sets of semantic roles and there is no agreement about which one is the best. Therefore, each researcher decides to use the set that best fits his or her research. Having said that, the semantic roles used in this study were taken from VerbNet (Kipper 2005).

4. System for Extracting Subcategorization Frames As said before, the SCFs Extractor system extracts verbal SCFs from texts written in Portuguese. It was developed by Zanette (2010) and is an adaptation of Messiant’s system for French verbs (Messiant 2008). The system is built in four modules: Reader, Extractor, Builder and Filter. It uses corpora annotated by the parser PALAVRAS (Bick 2000), and, after the input is processed by all four modules in sequence, an SCFs list is produced for each verb in the corpus. The output also contains frequency information for all verbs and SCFs. All the extracted information is stored in a database, allowing for future searches. In Figure 2 we can see how the modules are connected in the system.

Fig. 2: System overview

Reader: This module has the sole function of retrieving each sentence on the annotated corpus and delivering it to the Extractor module, which will then process it. This is made so that the system can work with many formats of input (text files, databases, etc.). Extractor: For each finite verb (verbs labeled with the “VFIN” tag) in the sentence, this module extracts all its dependencies (i.e. elements linked to the verb, according to parser annotation) and tries to classify each dependency as one of the following arguments: • SUBJ[NP]: subject of the sentence; • NP: for phrases that have nouns as head elements;

Chapter Five

82

• PP[prep]: for phrases that have prepositions as head elements (the main preposition of the phrase is also presented in square brackets); • REFL: for reflexive pronouns; • ADJP: for phrases that have adjectives as head elements; • SINF: for infinitive clauses.

For each argument, its position in the sentence is extracted and a relevance value is attributed to it according to the relation between verb and argument as follows: • 1 for subject; • 2 for direct object and reflexive pronoun; • 3 for indirect object; • 4 for adjective (separated from a noun) • 5 for adverbial adjunct. The Extractor module further identifies if a frame is in the passive or active form. Builder: Using the extracted arguments from the output of the previous module, this module constructs an SCF and stores it in a database. An SCF is composed by the target verb, the frame (which contains verb arguments extracted by the previous module), and absolute and relative frequencies. Arguments can be distributed in an SCF according to two distinct criteria:

• Position: arguments are sorted by position of argument in the sentence;

• Relevance: arguments are sorted by the argument relevance value attributed by the Extractor.

When the Extractor outputs no arguments, the Builder marks the SCF as “INTRANS”. Filter: Since all steps are automated, the Builder outputs some wrong SCFs. This happens for many reasons, like parsing errors or wrong distinction between verb arguments and adjuncts (this task is hard even for human beings (Messiant 2008)). In this paper, we follow the hypothesis put forward by Messiant (2008) that all real arguments tend to occur in positions of arguments more frequently than adjuncts. This means that if a linguistic element is an argument, it will appear much more often than if it

Automatic Extraction of Subcategorization Frames

83

is an adjunct. For example, in Portuguese, direct transitive verbs normally present “NP_V_NP” as the most frequent SCF; examples of such verbs are “encontrar” (to find), “mostrar” (to show), and “realizar” (to accomplish). This can also be seen in the case of intransitive verbs, which normally present “NP_V” as the most frequent SCF, like “ocorrer” (to occur), “acontecer” (to happen), or “morrer” (to die). Following Messiant’s assumption, the simple filtration of SCFs with lower frequencies should improve the output. The system has three filter types:

• Absolute verb frequency filter: this filter eliminates all SCFs that

belong to a verb with a frequency lower than a predefined threshold; • Absolute frame frequency filter: this filter eliminates all SCFs that have a frequency lower than a predefined threshold; and • Relative frame frequency filter: this filter eliminates all SCFs that have a relative frequency lower than a predefined threshold. The relative frequency of a scfi with a verbj is calculated as follows: RelFreq(scfi, verbj) = |scfi, verbj| / |verbj| where |scfi, verbj| is the number of occurrences of a scfi with a verbj, and |verbj| is the total number of occurrences of a verbj in the corpus.

The system was evaluated using the NILC lexicon (Muniz and Nunes 2004) as the gold standard. This lexicon has more than 1.5 million inflected entries for the Portuguese language. For verbs, it has special annotations like TD (direct transitive verbs), TI (indirect transitive verbs), INT (intransitive verbs), and BI (bitransitive verbs). For each verb with an indirect transitive element, it also presents the preposition that appears as complement of the verb. From these annotations, one can infer the following mappings for SCFs: • TD => NP • TI => PP[prep] • BI => NP_PP[prep] • INT => INTRANS The “prep” element is replaced by each preposition that appears in the verb entry during the mapping.

84

Chapter Five

To test the automatic extraction, we used the Bosque corpus, which belongs to the Floresta Sintá(c)tica project (Afonso et al. 2002). Bosque consists of 9,368 sentences in European and Brazilian Portuguese annotated with PALAVRAS in many formats. The system obtained 1,411 distinct verbs, and 6,024 frames for these verbs. Using a relative frequency filter over the 15 most frequent verbs with a cutoff point of 0.008, the system got 57% precision, 61% recall, and 59% f-measure. Figure 3 shows the system’s performance associated with various cutoff points along the relative frequency line.

Fig. 3: Filter evaluation varying along the relative frequency

The system is currently being adapted to allow for other types of annotation and to identify other types of arguments, like reflexive objects, pronominal indirect objects, clausal direct objects and predicatives, which are sometimes misclassified in the current version. We are also developing other types of filters to better cope with the noise extracted by the Extractor module.

5. Studies Developed with the System As we stated in the Introduction, this paper shows not only a system for extracting SCFs, but also two applications of this system in different studies. First of all, it is important to make it clear that both studies were carried out almost entirely independently from one another and both of them have made some modifications to the base system, so that sometimes it is possible to recognize elements present in one of the studies that are

Automatic Extraction of Subcategorization Frames

85

not present in the other. Despite these minor modifications, the basic system was the same, and the results we show extend to different fields. In this section we start by presenting a study that deals with a verb lexicon and its results. After that we introduce the second study, which revolves around the annotation of semantic roles in corpora.

5.1. Verb Lexicon The first study involves the building of a verb lexicon for Brazilian Portuguese (Scarton and Aluisio 2012; Scarton 2013), called VerbNet.Br, which is based on the VerbNet (Kipper 2005). VerbNet contains information about SCFs, semantic roles, semantic predicates and selectional preferences of verbs. These verbs are grouped together according to Levin’s verb classes (Levin 1993). So, following Levin’s hypothesis, VerbNet turns out to be flexible across languages because the classes bear a cross-linguistic potential (as proved by Sun et al. (2010) for French, by Merlo et al. (2002) for Italian, and also by Kipper (2005) herself for Portuguese). An example of a class in VerbNet (class leave51.2) can be seen in Table 1. The first version of VerbNet.Br was built automatically: the members of Portuguese classes were selected through the use of VerbNet, WordNet (Fellbaum 1998), and WordNet.Br (Dias-da-Silva et al. 2008), which are aligned one to the other. As such, if a synset in WordNet.Br was aligned with another synset in WordNet, the Portuguese synset was imported automatically to VerbNet.Br, along with the semantic roles existing in WordNet (which are aligned with VerbNet). For example, let us consider the class leave-51.2. The verb “to abandon” is a member of this class. This member is mapped to a synset in WordNet, which presents an inter-lingual index 02163637 and whose gloss is the following: (forsake, leave behind) “We abandoned the old car in the empty parking lot”. Then, we search WordNet.Br if there are one or more synsets in Brazilian Portuguese aligned with the synset “to abandon”. There is only one synset in WordNet.Br with three verbs: deixar “to leave”, abandonar “to abandon” and “largar” (with the sense of “to leave”). Therefore, these three verbs are selected to fit in the class leave-51.2 of the first version of VerbNet.Br. However, it was impossible to use the results of the alignments directly, mainly because there is a gap between the VerbNet and WordNet theories. Therefore, we proposed a method to validate the candidate members, thus making a better selection. This method was divided into four stages.

86

Chapter Five

Table 1: Syntactic-Semantic Frames Leave-51.2 Semantic roles and selectional restrictions: Theme [+concrete] and Source [+location - region] Members: abandon, split Frames: Name Example Sintax Semantic Basic We abandoned Theme V Source • motion(during(E), Transitive the area. Theme) • location(start(E), Theme, Initial_Location) • not(location(end (E), Theme, Initial_Location)) • direction(during (E), from, Theme, Initial_Location)

In the first stage some SCFs were manually translated from English classes into Portuguese. For example, the class equip-13.4.2 presents the frame NP_V_NP_PP[with] (John equipped Brown with a camera). This frame could be translated into Portuguese replacing the preposition with with com: NP_V_NP_PP[com] (John equipou Brown com uma câmera). We only translated those frames which were subject to a direct translation (such as in the example above). The second stage dealt with the search of SCFs in corpora, and, for that, we used the system presented in Section 4. Three corpora were considered: Lácio-Ref (Aluísio et al. 2004), which has approximately 9 million words and is divided into four genres: scientific, informative, legal and literary; PLN-BR-FULL (Muniz et al. 2007), which has around 29 million words of the informative genre; and Revista Fapesp Corpus (Aziz and Specia 2011), which has approximately 6 million words of the popular science genre. Among the results of the SCF extractor, we identified 3,779 verbs (with frequencies above 10) and 3,578 different syntactic frames (with frequencies above 5 and parameterized by preposition). Some examples are shown below of two syntactic frames for the verb encantar (to delight/fascinate):

• SUBJ[NP] V NP - Mozart, Villa-Lobos, Britten e Grieg encantam

centenas de crianças. [Mozart, Villa-Lobos, Britten and Grieg fascinate hundreds of children.];

Automatic Extraction of Subcategorization Frames

87

• SUBJ[NP] V NP PP[com] - Conseguiram fazer dinheiro com o

animal, porque o Gordo bom contador de histórias foi encantando o público com suas narrativas cheias de anjos e peripécias. [They were able to make money with the animal, because Fat Man, a very good story teller, fascinated the public with his tales full of angels and adventures.]

This stage was useful to compare SCFs presented by verbs in corpora with the ones translated manually for each class (as explained in the fourth stage below). In the third stage members were inherited by VerbNet.Br taking into account the alignments between the other resources. Since each member of VerbNet can be aligned to one or more synsets of WordNet and each synset of WordNet can be aligned to one or more synsets of WordNet.Br, all Portuguese verbs that fit in these alignments were selected. However, the candidate members identified in the third stage could not be considered directly as members of VerbNet.Br. Since there are differences between VerbNet and WordNet theories, so that some alignments of the resources are approximations, it was necessary to filter the candidate verbs in order to identify the real members of VerbNet.Br classes. Therefore, the fourth stage was responsible for the validation of the selected candidate members. To that end, all the previous stages were combined, following the algorithm: 1. For each class in VerbNet.Br do: a.For each candidate member in the current class do: i. Search for all SCFs identified for the verb in stage 2. ii. Compare these SCFs with the SCFs manually translated in stage 1 for this class: 1. If the candidate member presents at least 10% of the class’s (manually translated) SCFs in corpora: a. Select the verb. b. Else, the verb was not considered as a member of this class. VerbNet.Br presented an f-measure of 60.62%. Table 2 shows an example of a class in VerbNet.Br4 (the same class presented in Table 1 for VerbNet). By analyzing Table 2, one would probably question whether the main sense of the verbs abalar and vazar is “to leave”. This example 4

To access VerbNet.Br: http://www.nilc.icmc.usp.br/verbnetbr

88

Chapter Five

shows that this resource still needs a linguistic post-validation to evaluate if these verbs should be kept in this class or not. One strong point in favor of this inheritance is that not only the semantic roles, but also the selectional restrictions and the semantic predicates were inherited from English, accruing a decent amount of information using a single process. This was possible because of the cross-linguistic potential of Levin’s verb classes. Table 2: Syntactic-Semantic Frames Leave-51.2-VerbNet.Br Semantic roles and selectional restrictions: Theme [+concrete] and Source [+location - region] Members: abalar, abandonar, deixar, fugir, largar, partir, sair, vazar Frames: Name Example Syntax Semantic Basic We abandoned Theme V • motion(during(E), Theme) Transitive the area / Nós Source • location(start(E), Theme, abandonamos Initial_Location) a área • not(location(end(E), Theme, Initial_Location)) • direction(during(E), from, Theme, Initial_Location)

5.2. Semantic Role Labelling The second study that makes use of the system for extracting SCFs deals with the manual annotation of sentences extracted from corpora. The main goal of this study is to generate a list of verbs with their respective SCFs, semantic roles and as many examples as there are relevant sentences in the corpus. So, to put it simply, the study is a mix between VerbNet (Kipper 2005) and PropBank (Palmer, Gildea and Kingsbury 2005), but it is still in its primary stages, as we have just conducted some tests with different sets of semantic roles and different rules for the extraction of SCFs. To annotate real sentences in Brazilian Portuguese, we selected two corpora: one built with Cardiology texts (representative of a specialized language) and the other with texts from Diário Gaúcho, a newspaper which aims at using a language and style that are accessible to most people (representative of non-specialized language). The Cardiology corpus was built by Zilio (2009) and has ca. 1.5 million words; the Diário Gaúcho corpus was compiled by the PorPopular project and totals ca. 1 million

Automatic Extraction of Subcategorization Frames

89

words. The use of corpora representing different instances of language (specialized vs. non-specialized) is related to a broader objective, which is to compare both corpora so as to identify their similarities and differences. In this study the system for extracting SCFs does not work only as an extractor, but also as an organizer of SCFs and as an interface for the manual annotation process. By organizer we mean that it has also the function of classifying the arguments and displaying them according to their syntactic function, following a predetermined relevance rank. So the system not only extracts the SCFs, but also, for example, always presents the subject arguments before the object arguments, independently of their order in the sentence, as briefly explained in Section 4. With this organization by relevance, it is much easier to consider the syntactic patterns of the sentence for the purpose of semantic role labeling. In addition, there is also an interface for the manual annotation, which is of great help for the human annotator, speeding up the task of looking for the arguments, finding the “normal” structure in the sentence and assigning the semantic roles. In Figure 4 we present the interface, with its ranked disposition of arguments and the drop-down menu for choosing a semantic role.

Fig. 4. Interface for the annotation of semantic roles

As can be inferred from Figure 4, the system organizes the examples of each verb according to the SCF patterns, which were already classified by a ranking system. In this way the system not only provides the information needed for the annotation, but also presents it in a way that speeds up the work. The semantic role list can also be easily modified or changed by the

90

Chapter Five

user, since it is stored in a TXT file where each role is separated by a comma. The annotation process was divided into two pilot studies, both using the same corpora and the same system for the extraction and annotation, but changing the sets of semantic roles and some rules for the identification of SCFs. We will present both short case studies separately, so as to make it clear which one is which. First pilot study. In this study we used a set of 46 semantic roles proposed by Brumm (2008). This set was later enlarged by Gelhausen (2010), to 70 roles, but we decided to keep the smaller one for our pilot experiment purposes. The length and scope of this paper does not allow for long explanations, so we will not discuss the roles in particular (they are explained in detail in the studies mentioned above), saving the space to report some results. For this first experiment we chose only four verbs to be annotated in both corpora. Being a very small sample, we decided to annotate verbs which had a similar frequency and, at the same time, were among the fifty most frequent verbs in both corpora. The verbs chosen were encontrar [to find/meet], levar [to bring/carry], receber [to receive], and usar [to use]. After annotating up to ten examples in all SCFs that appeared in frequencies of ten and above, we ended up with 482 sentences annotated with semantic roles, which were distributed among 138 semantic role patterns. Regarding the comparison between the semantic roles in both corpora, we did not find great differences. Only the verb levar [to bring/carry] showed some differences, by presenting more sentences with roles such as agent and patient in the Diário Gaúcho corpus, while roles such as creator and result were preferred in the Cardiology corpus. Unfortunately, as the sample was too small, we could not draw relevant conclusions. What we were able to do, on the other hand, was to improve the rules for extraction of SCFs, since we had already looked at many sentences and could spot some problems in them. We also felt that the set of semantic roles was too large, including roles that might never be used. Also, some of the roles had a very narrow meaning, almost approaching the lexical semantic level, which is, from our point of view, not the intention of a semantic role annotation. Some of them occur only in pairs, like qualifitiens and qualitas (i.e. characterized and characteristic), which are very specific to some situations, but overlap with other, broader semantic roles like habitum or favor (i.e. possession or benefit). Second pilot study. Following our first experiment, we conducted a second one, with a completely different set of semantic roles, and

Automatic Extraction of Subcategorization Frames

91

improved extraction rules, which were the same used in the study with the verb lexicon, described in Section 5.1. In this second pilot study, instead of trying to tweak the previous set of 46 semantic roles until it fitted our needs, we decided to go for a completely new list of roles. So we chose to adopt the ones used by Kipper (2005) in the VerbNet (version 3.2). We carried out a long study of evaluating the roles and the annotation present in VerbNet, and made only minor tweaks, most of them concerning the hierarchy of semantic roles, and not the roles themselves. As to the roles, we only added two to the existing group of 35. Those added were verb (used for describing cases of support verb or some fixed expressions, in which the complement of the verb has the real function of a verb or builds a complex meaning with the verb a complex meaning) and comparative (used for some adjuncts which denoted comparisons). One important difference between our study and VerbNet is that, provided that they appear with a sufficiently high frequency, we also annotated adjuncts, not only arguments. This time a broader selection of verbs was annotated, totaling 50 verbs in each corpora. The verbs were the same in both corpora, so that we could make the same comparison between specialized and non-specialized language as in the first pilot study. The jump from four to fifty verbs was possible because of some new resources available for the visualization of the results, such as their conversion to XML (a more user-friendly format) and their presentation in the form of syntactic-semantic frames (which can be seen in Figure 5). We also made only minor changes in the methodology of the annotation in comparison to our first study: this time, the annotation was always made on exactly ten samples for each of the annotated SCFs of each verb.

Fig. 5: Syntactic-Semantic Frames

After the work was completed with the fifty verbs, the annotated sentences added up to 3,400. From this set, 1790 comes from the Cardiology corpus and the remaining 1610 from the Diário Gaúcho

92

Chapter Five

corpus. With this larger dataset, we were able to use a Kendall’s tau-b5 rank correlation test to see if there was a significant correlation between the rankings of semantic roles in the two corpora. The answer was negative for the list of syntactic-semantic frames (e.g. SUBJ+ DIR.OBJ, SUBJ+DIR.OBJ ) ― with τb = 0.031 ―, but positive for the individual syntactic-semantic arguments (e.g. SUBJ, SUBJ , DIR.OBJ) ― with τb = 0.523. This means that, the arguments, while taken individually, had some correlation, but when we looked at the broader sentence frames, this correlation disappears. Besides the statistical analysis, there were some interesting qualitative aspects that should not go unnoticed. For example, in the Cardiology corpus we found a much greater tendency to write in the passive voice and to use verbs in their intransitive form (for those which allow for inergative diathesis) than in the Diário Gaúcho corpus. We also observed many more instruments assuming the position of subject, instead of an agent. This leads to the conclusion that the Cardiology corpus has a tendency to suppress the real agents, replacing them with tools and medical examination methods, which confirms the findings of Swales (1990). In fact, the agent is explicit in only 11.06% of the sentences annotated in the Cardiology corpus, against 45.59% in the newspaper corpus. After this second pilot study we are confident that the set of semantic roles used in the VerbNet (version 3.2) is a good choice as a base to carry out a large-scale annotation of Brazilian Portuguese sentences. There are some modifications that need to be addressed, like the still small number of roles for adjuncts, but those can be borrowed and added from the PropBank (Palmer, Gildea and Kingsbury 2005) set without much effort. Therefore the next step for the large-scale annotation is well underway.

6. Final remarks In this paper, we presented both an extractor of SCFs as well as two studies that used this system for different purposes. The system, consisting of four modules, has proven to be much more than a simple extractor, since it organizes the information in different ways, and is very easy to tweak if necessary. Regarding the extraction of SCFs itself, the system achieved an f-measure score of 59%, which is a very good performance for a first implementation in Portuguese. 5 This correlation test results in values ranging from -1 (inverse correlation) to +1 (perfect correlation), with zero meaning completely uncorrelated variables.

Automatic Extraction of Subcategorization Frames

93

The other two studies, although they shared some common features, were very different in structure. One aimed at building a new resource based on VerbNet, while the other aimed at annotating the semantic roles directly in sentences from the corpora. In the first study, a semi-automatic method was used to build a VerbNet for Portuguese. The SCFs were used to validate the candidate members inherited from the alignments between VerbNet, WordNet and WordNet.Br. This method achieved 60.62% of f-measure. We know that this is not enough for a resource to be useful; therefore, we are considering the linguistic validation of VerbNet.Br as our future work. The second study we presented involved annotation of semantic frames in sentences from specialized and non-specialized corpora. Two pilot studies were conducted with two different sets of semantic roles. The first one, using a very small sample, served more as an introduction to the annotation, and its results were mostly used to improve the annotation methodology per se. The second study, on the other hand, was more robust, presenting results from which some preliminary conclusions could be drawn. All that has been presented is, in some way or another, still work in progress. The system for extracting SCFs is currently being modified to accept types of annotation other than the one from PALAVRAS. The VerbNet.Br is not finished yet, and could use more information from other sources to expand its base of verbs. The annotation of semantic roles is now moving to another level, where we intend to annotate much more than fifty verbs so that it might serve as one of the sources for a future expansion of VerbNet.Br.

Acknowledgements We would like to thank FAPESP (process number: 2010/03785-0), CAPES and CNPq for funding, and also the CAMELEON Project (CAPES/Cofecub 707/11) and NILC-ICMC-USP for all the support.

References Afonso, Susana, Eckhard Bick, Renato Haber, and Diana Santos. 2002. “Floresta sintá(c)tica”: a treebank for Portuguese.” Third International Conference on Language Resources and Evaluation (LREC 2002). Las Palmas, Canary Islands, Spain. 1698-1703. Aluísio, Sandra, Gisele M. Pinheiro, Aline A. M. P. Manfrim, Leandro H. M. de Oliveira, Luiz C. Genoves Jr., and Stella E. O. Tagnin. 2004.

94

Chapter Five

“The Lácio-Web: Corpora and Tools to advance Brazilian Portuguese Language Investigations and Computational Linguistic Tools.” 4th International Conference on Language Resources and Evaluation (LREC 2004). Lisboa, Portugal. 1779-1782. Augustini, Alexandre. 2006. Aquisição Automática de Subcategorização Sintáctico-Semântica e sua Utilização em Sistemas de Processamento da Língua Natural. PhD Thesis, Lisboa, Portugal: Universidade Nova de Lisboa. Aziz, Wilker, and Lucia Specia. 2011. “Fully Automatic Compilation of Portuguese-English and Portuguese-Spanish Parallel Corpora.” 8th Brazilian Symposium in Information and Human Language Technology (STIL 2009). Cuiabá-MT, Brazil. Bick, Eckhard. 2000. The Parsing System Palavras: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus: Aarhus University Press. Briscoe, Ted, and John Carroll. 1997. “Automatic extraction of subcategorization from corpora.” Fifth Conference on Applied Natural Language Processing. Washington, D.C.. 356-363. Brumm, Torben. 2008. “Erstellung eines Systems thematischer Rollen mit Hilfe einer multiplen Fallstudie.” Dias-Da-Silva, Bento C., Ariani Di Felippo, and Maria das Graças V. Nunes. 2008. “The automatic mapping of Princeton WordNet lexicalconceptual relations onto the Brazilian Portuguese WordNet database.” 6th International Conference on Language Resources and Evaluation (LREC 2008). Marrakech, Morocco. 1535-1541. Fellbaum, Christiane. 1998. WordNet: An electronic lexical database. Cambridge, Massachusetts: MIT Press. Gelhausen, Tom. 2010. Modellextraktion aus natürlichen Sprachen: eine Methode zur systematischen Erstellung von Domänenmodellen. KIT Scientific Publishing. Ienco, Dino, Serena Villata, and Cristina Bosco. 2008. “Automatic extraction of subcategorization frames for Italian.” Sixth International Conference on Language Resources and Evaluation (LREC 2008). Marrakech, Morocco. 2094-2100. Kipper, Karin. 2005. VerbNet: A broad-coverage, comprehensive verb lexicon. PhD Thesis, Pennsylvania, Colorado: University of Pennsylvania. Kipper, Karin, Anna Korhonen, Neville Ryant, and Martha Palmer. 2006. “Extending VerbNet with novel verb classes.” Fifth International Conference on Language Resources and Evaluation (LREC 2006). Genova, Itália. 1027-1032.

Automatic Extraction of Subcategorization Frames

95

Korhonen, Anna, Yuval Krymolowski, and Ted Briscoe. 2006. “A large subcategorization lexicon for natural language processing applications.” Fifth International Conference on Language Resources and Evaluation (LREC 2006). Genova, Itália. 1015-1020. Levin, Beth. 1993 English Verb Classes and Alternation, A Preliminary Investigation. The University of Chicago Press. Merlo, Paola, Suzanne Stevenson, Vivian Tsang, and Gianluca Allaria. 2002. “A multilingual paradigm for automatic verb classification.” 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002). Philadelphia, PA. 207-214. Messiant, Cédric. 2008. “A subcategorization acquisition system for French verbs.” 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Student Research Workshop. Colombus, OH. 55-60. Muniz, Marcelo C. M. 2004. A construção de recursos linguisticocomputacionais para o portugues do brasil: o projeto de Unitex-PB. Master Thesis, São Carlos-SP, Brazil: University of São Paulo. Muniz, Marcelo C. M., et al. 2007. “Taming the tiger topic: an XCES compliant corpus Portal to generate subcorpus based on automatic text topic identification.” Corpus Linguistics Conference (CL 2007). Birmingham, UK. Palmer, Martha, Daniel Gildea, and Paul Kingsbury. 2005. “The proposition bank: An annotated corpus of semantic roles.” Computational Linguistics Journal 31, no. 1 (2005): 71-106. Scarton, Carolina. 2013. VerbNet.Br: Construção semiautomática de um léxico verbal online e independente de domínio para o Português do Brasil. Master Thesis, São Carlos-SP, Brazil: University of São Paulo. Scarton, Carolina, and Sandra Aluísio. 2012. “Towards a cross-linguistic VerbNet-style lexicon for Brazilian Portuguese.” LREC 2012 Workshop on Creating Cross-language Resources for Disconnected Languages and Styles. Istanbul, Turkey. Schulte im Walde, Sabine. 2002. “A subcategorisation lexicon for German verbs induced from a lexicalised PCFG.” Third Conference on Language Resources and Evaluation (LREC 2002). Las Palmas, Canary Islands, Spain. 1351-1357. Sun, Lin, and Anna Korhonen. 2009. “Improving verb clustering with automatically acquired selectional preferences.” 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009). Singapore. 638-647. Sun, Lin, Anna Korhonen, Thierry Poibeau, and Cédric Messiant. 2010. “Investigating the cross-linguistic potential of verbnet: style

96

Chapter Five

classification.” 23rd International Conference on Computational Linguistics (COLING 2010). Beijing, China. 1056-1064. Surdeanu, Mihai, Sanda Harabagiu, John Williams, and Paul Aarseth. (2003). “Using predicate-argument structures for information extraction.” 41st Annual Meeting on Association for Computational Linguistics (ACL 2003). Sapporo, Japan. 8-15. Swales, John. 1990 Genre Analysis - English in academic and research settings. Cambridge University Press. Zanette, Adriano. 2010. Aquisição de subcategorization frames para verbos da língua portuguesa. Porto Alegre-RS, Brazil: Federal University of Rio Grande do Sul. Zanette, Adriano, Carolina Scarton, and Leonardo Zilio. 2012. “Automatic extraction of subcategorization frames from corpora: an approach to Portuguese.” 2012 International Conference on Computational Processing of Portuguese-Demo Session. Coimbra, Portugal. Zilio, Leonardo. 2009. Colocações especializadas e Komposita: um estudo contrastivo alemão-português na área de Cardiologia. Master Thesis, Porto Alegre-RS, Brazil: Federal University of Rio Grande do Sul. Zilio, Leonardo, Adriano Zanette, and Carolina Scarton. 2012 “Extração Automática de Estruturas de subcategorição a partir de Corpora em Português.” XI Encontro de Linguística de Corpus (ELC 2012). São Carlos-SP, Brazil.

PART V CORPUS ANNOTATION

CHAPTER SIX ESPANHOL-ACADÊMICO-BR: A CORPUS OF ACADEMIC PORTUGUESE LEARNERS PRODUCED BY NATIVE SPEAKERS OF SPANISH LIANET SEPÚLVEDA TORRES∗, ROANA RODRIGUES† AND SANDRA MARIA ALUÍSIO* 1. Introduction With approximately 272.9 million speakers, Portuguese is the 5th most spoken language in the world (Carvalho et al. 2010). In the last decade, as a result of Brazilian economical growth and the increased presence of multinationals in the country, the interest of foreigners in learning Portuguese rose (most of those foreigners are Spanish speakers). The interest to learn Portuguese is also noted by the number of students enrolled in the CELPE-Bras 1 , the Portuguese proficiency exam, which jumped from 1155 to 6139 (Foreque 2011). This scenario brings an increase of research dedicated to developing more efficient methods to improve the Teaching-Learning process of Portuguese for Spanish speakers (Evers et al 2011; Carvalho et al. 2010; Henriques 2004; ∗

Interinstitutional Center for Research and Development in Computational Linguistics ICMC-University of São Paulo São Carlos – SP – Brazil. † Language Department at the Federal University of São Carlos (UFSCar)-SPBrazil. 1 This test assesses the Portuguese language proficiency in the ability to act in the world in situations similar to real ones. It measures language use considering grammatical knowledge or specific vocabulary, but mostly, the appropriate forms of language use in a communicative environment (Schoffen 2009).

Espanhol-Acadêmico-Br

99

Grannier and Carvalho 2001), as well as several initiatives to create new resources, such as the Portuguese for Spanish Speakers Project, developed at the University of Arizona (Carvalho et al. 2010). Since Portuguese and Spanish are the closest romance languages (Henriques 2004), several investigations (Da Silva 2010; Mohr 2007; Gomes 2002; Furtoso and Gimenez 2000; Jandyra 1999) have come to the conclusion that Spanish speakers have different characteristics in relation to other Portuguese language learners (Mohr 2007). For that reason, institutions offer specialized courses for native Spanish speakers. In literature, the similarity between these two languages is considered as a positive element which often becomes an obstacle, because similarity and closeness frequently conceal differences and hinder learners from mastering the Portuguese language, showing interferences from their native Spanish both when speaking and writing in Portuguese (Gomes 2002; Furtoso and Gimenez 2000). However, according to Scaramucci, Ricardi and Rodrigues (2004), Spanish speakers’ productions also present problems common to speakers of other foreign languages as well as Brazilians themselves. Grannier and Carvalho (2001) conducted a study that shows the main errors of Spanish speakers when writing in Portuguese. Additionally, Durão (1999) carried out another study related to learning Portuguese by Spanish speaking people. The problems identified in both studies were extracted from texts on general topics, produced by native Spanish speakers, learners of Portuguese. Although these studies illustrated several problems that a Spanish speaker faces when writing in Portuguese, these studies have not formalized the creation of the corpus, according to the principles of Corpus Linguistics, neither have they presented a typology for the errors identified. In general, research related to teaching-learning a second or foreign language (Hana et al. 2012; Rosner et al. 2012; Da Silva 2010; Allen 2009; Dayrell and Aluísio 2008; Dayrell 2008) uses learner corpora. Learner Corpora are text collections written by foreign learners of a language. Several learner corpora have been created with the purpose of studying the real problems that foreign speakers face when they study English, for example, Br-ICLE2 is the Brazilian Portuguese subcorpus of the ICLE3 project. For languages other, than English, which have been less explored, such as Czech and Maltese, the number of projects dedicated to creating learner corpora is also increasing (Hana et al. 2012; Rosner et al. 2 3

http://www2.lael.pucsp.br/corpora/bricle/ http://www.uclouvain.be/en-cecl-icle.html

Chapter Six

100

2012). On the other hand, the number of learner corpora for Brazilian Portuguese is still limited and this number is even lower if we consider a corpus which is specific for Spanish speakers. The present research is inserted in this context, with the main objective of formalizing a typology of the main errors that Spanish speakers enrolled in Brazilian graduate programs make when writing their theses and dissertations in Portuguese. This paper presents the compilation of the Espanhol-Acadêmico-Br corpus which consists of introductions of academics texts written in Portuguese, by students enrolled in the School of Engineering and the Institutes of Physics, Chemistry, Mathematics and Computer Science, and Architecture and Urbanism at USP in São Carlos, Brazil. As future work, we hope to extend the compilation of the Espanhol-Acadêmico-Br corpus, so it can become a formal resource to show the real problems that learners of Portuguese face when writing academic texts, as these require a more formal and technical language. In Section 2, we introduce two studies which analyze several errors that native speakers of Spanish make when they learn Portuguese. In Section 3, we show some studies related to the creation and annotation of a learner corpus. In Section 4, we present the Espanhol-Acadêmico-Br learner corpus, as well as the typology of errors in that corpus. Section 5 concludes with the final remarks of our study and future work we envisage.

2. Error Typologies for Brazilian Portuguese Grannier and Carvalho’s (2001) studies present several error categories based on the analysis of 15 CELPE-Bras tests of various genres. The tests were divided into 3 groups (basic, intermediate and advanced), according to the proficiency degree of the candidates and the error categories were classified according to a purely linguistic criterion, shown in Table 1. Table 1: Error Categories identified by Grannier and Carvalho (2001) Lexical Errors Morphosyntactic Errors Syntactic Errors Lexical-SyntacticSemantic Errors

Errors related to lexical selection, either in its form or in its semantics appropriateness. Errors affecting the internal structure of the word and their links with other words (Ex: colligation errors and gender agreement or gender choice errors). Errors related to the constituents or connective order. Errors that included two or more lexical items in at least one of the two languages. The errors involved semantic differences and/or the use of different syntactic structures.

Espanhol-Acadêmico-Br Inadequacies

101

Inadequacies of structure and register (Example: Misuse of unstressed pronoun4 or relative sentences)

On the other hand, Durão (1999), from a pedagogical perspective, conducted a study related to learning Portuguese by Spanish speaking learners and to learning Spanish by native Portuguese speakers. The study related to learning Portuguese by Spanish speaking learners was based on the analysis of texts written by 24 students from the first year of a Portuguese course for foreigners, offered at the Valladolid University, Spain. All the students with proficiency degrees between basic and intermediate were natives of Spanish-speaking countries. Table 2 shows the more general categories, the errors and their respective percentages. Table 2: Error categories identified by Durão (1999) Categories Phonological and Graphic errors Lexical Errors Grammatical Errors

Errors 595 386 186

Percentages 51% 33,10% 15,90%

Durão (1999) concluded that in the process of learning Spanish by native Portuguese speakers, and of Portuguese by native Spanish speakers, it is interesting to notice that the learners know the similarities between the two languages, as well as critical points that can trigger errors. Although the classification proposed by Grannier and Carvalho (2001) is more general than Durão’s (1999), both studies indicate a significant number of problems. However, there are still other types of problems, more specific and difficult to deal with, such as cohesion, coherence, gender appropriateness and text structure. The errors identified in those studies were the basis to formalize the error typology presented in Section 4 of this paper.

3. Learner Corpora According to Duran (2008), learner corpora are resources widely used to identify differences or similarities in the learning processes of native and non-native speakers. Additionally, with learner corpora it is possible to analyze the interference of the mother tongue in the process of learning a second language. For that purpose, according to Granger (2004), two 4 Words of two or more syllables also contain unstressed syllables, that is, syllables that are not emphasised.

102

Chapter Six

analyses are being used: a contrastive analysis and an error analysis. A contrastive analysis extracts the following information from the learner texts: most frequent vocabulary, discourse markers, collocations or expressions, grammatical items and syntactic structure (Da Silva 2010; Allen 2009; Dayrell and Aluísio 2008; Dayrell 2008). On the other hand, an error analysis consists of annotating a learner corpus with a predefined set of error tags (Dahlmeier and Tou Ng 2011; Genoves et al. 2007), therefore this analysis is a more time-consuming and costly task, because it requires the manual annotation of errors. Some of the errors most studied are: article use, prepositions, lexical and grammatical errors; some studies also address the problem of verb tense use. Simple automatic tools have been proposed to improve the analysis of a learner corpus and to provide linguistic and computational resources (Granger et al. 2007; Granger 2004). By means of an automatic analysis of texts it is possible to extract basic information, such as total number of words in the text, list of the most frequent words, number of paragraphs or sentences, as well as more sophisticated information, such as the occurrence of n-grams in the corpus and their frequency of occurrence (Granger et al. 2007). However, automated tools, such as morphosyntactic taggers and stemmers, bring advantages for both linguistic and computational analysis. Additionally, in the manual processes of annotating learner corpora, the tags most often used are: discourse tags, semantic tags and other error tags. There are countless corpora created for learners of the English language, and one can find extensive corpora that can represent the real problems that foreigners of various nationalities have when they study English as a second or foreign language. An example is the corpus of native Japanese writing in English: ALESS Active Learning of English for Science Students5, shown by Allen (2009). The ALESS corpus consists of 847 academic texts in English, produced by students of the scientific writing course from the Tokyo University. The corpus, with a total of 731,612 words, was used to analyze the occurrence of lexical expressions of the scientific genre, produced by Japanese learners of English. Allen (2009) proposed a contrastive analysis and the results of the learner corpus were compared with the results of a reference corpus, composed of academic texts written by native English speakers in order to evaluate the uses of lexical expressions in learner texts and in native texts. Another example is the NUS Corpus of Learner English (NUCLE), compiled by Dahlmeier and Tou Ng (2011) in collaboration with the 5

http://aless.ecc.u-tokyo.ac.jp/

Espanhol-Acadêmico-Br

103

National University of Singapore (NUS). The NUCLE corpus is composed of 1400 texts of various topics, produced by students from the Singapore University. The corpus contains approximately one million words and is fully annotated with tags of errors and their corrections. Dayrell (2008) presents a learner corpus with abstracts of academic texts in English. The corpus consists of 33,836 words and was created with the purpose of examining a text produced by a Brazilian student when he writes in English and to compare the learner texts with a set of English abstracts already published. The focus of the analysis were the lexical choices and collocations used in each of the academic abstracts. Another important project, restricted to Brazilian learners with various languages other than English, is the learner corpus - CoMAprend (Tagnin 2006). This corpus contains essays of learners of various foreign languages like German, Spanish, French, English and Italian. One of the studies that investigated Portuguese texts produced by native speakers of Spanish was guided by Da Silva (2010). Da Silva (2010) created a corpus with the oral productions of six Spanish learners of Portuguese to analyze the peculiarities of pronouns functioning as verbal complements. Another initiative was led by Evers et al. (2011) who compiled a learner corpus with 16 texts (8.873 words) of different genres, written by learners of Portuguese.

4. Annotation of the Espanhol-Acadêmico-Br Corpus The Espanhol-Acadêmico-Br learner corpus, showed in this section, is composed by introductions, written by different students of USP graduate programs for theses and dissertation exams. The corpus contains 13 texts, with a total of 617 sentences and 17795 words. Table 3 shows the corpus statistics and the number of texts of each research area. Table 3: Statistics of the academic writing corpus produced by Spanish learners of Portuguese Research Area

Introduction

Hydraulic engineering Mathematics Civil Engineering Computer Science Total

I1

Learner Corpus Number of Sentences 25

Number of Words 827

I2 I3,I4,I5,I8,I9,I13

22 246

699 7599

I6,I7,I10,I11,I12

324

8670

13

617

17,795

104

Chapter Six

After compiling the corpus, we administered a questionnaire to find out the main difficulties that Spanish speakers, enrolled in graduate programs at USP São Carlos, face when they write academic texts in Portuguese. A total of 66 graduate students at USP São Carlos completed the questionnaire, which consisted of 17 items related to the problems identified by Grannier and Carvalho (2001) and Durão (1999) and some aspects related to language proficiency. Each item of the questionnaire made reference to one of the problems that Spanish speakers face when learning Portuguese. Fig. 1 shows the errors included in the form, as well as their level of difficulty, according to the learners of Portuguese.

Fig. 1: Quantification of errors that appeared in the questionnaire

Some of the problems shown in Fig. 1 can be handled by most spellcheckers in texts processors and all of them were identified as difficulties in writing texts in Portuguese. The most common ones were: false cognates, compound words, pronoun complements, verb conjugations, gender disagreement, word-order errors, spelling errors, use of informal vocabulary, use of Spanish words or expressions in the texts and use of prepositions. After quantifying the items of the questionnaire, the EspanholAcadêmico-Br corpus was manually annotated based on the errors identified in the questionnaire and the annotations shown in others studies. Two annotators analyzed each type of error separately; if a new type of mistake appeared a new category of error was considered and characterized by a number. For each error in the annotation process, the type of error and a suggestion to correct it was included. One problem of the annotation process was that some errors could be included in several categories. In that case, we chose to classify them in the category considered more “serious” or relevant for learning. In this first step, we

Espanhol-Acadêmico-Br

105

considered as more serious or relevant errors for learning, those that represent interferences from the native language. Some examples of these types of errors are the use of Spanish words or expressions in the texts. These types of errors show that learners still use an interlanguage when producing texts in the second language. In the future, we expect to conduct a study to improve the choice of these criteria. Another alternative to solve this problem would be to use multilevel annotations, which might be considered in the future. At the beginning of the annotation process, we hoped that the corpus had few errors because the essays were written with the use of tools for checking grammar and spelling problems. These essays were produced for Masters and PhD exams, which require advanced levels of proficiency in Portuguese. However, during the annotation, we identified errors detected by spelling and grammar checkers, as well as others that had not been corrected, mostly because they are errors specific of learners of Portuguese. This fact is supported by a survey made by Duran (2008) who found that the most widespread grammatical and spelling tools were not built to identify errors made by language learners. Table 4 shows the typology of errors using the learner corpus. Some of the categories have an example taken from the corpus to illustrate the identified mistake. The numbers that represent the error categories and their corrections appear in bold in the table. Fig. 2 shows a chart with the occurrences of each mistake in the corpus. Table 4: Typology of errors in the learner corpus 1- Spelling error (Stress marks; Nasalization marks ; Inappropriate phonemes; Inappropriate graphemes for the same phoneme) 2- Absence or inappropriate use of crasis 3- Hyphen (separation and union of words) 4- Pronouns (position of pronouns ) 5- False cognates e.g. Também, métodos de patrões (padrões/5) foram empregados através do uso de templates ... [Also, methods of bosses (patterns/5) were employed through the use of templates ...] 6 -Use of informal language in the academic genre e.g. ... essas referências são para mencionar uns poucos (alguns/6) artigos em diferentes contextos. […these references are to name a few (some/6) articles in different contexts.] 7- Gender disagreement 8- Inappropriate use of article

106

Chapter Six

9- Verb conjugation e.g. O Desenvolvimento Sustentável é definido como um modelo econômico, político, social, cultural e ambiental equilibrado, que satisfaze (satisfaz/9) as necessidades das gerações atuais. [Sustainable development is defined as a balanced economic, political, social, cultural and environmental model, which satisfy (meets/9) the needs of current generations.] 10- Inappropriate use of preposition e.g. O recurso está sendo desenvolvido na base de (com base em/10) um pequeno córpus, ... [The resource is being developed in the base of (on the basis of/10) a small corpus,] 11- Word-order e.g. A identificação modal operacional com só (só com/11) medições das saídas é um atrativo. [The modal operational identification with only (only with/11) measurements of the outputs is attractive.] 12- Use of Spanish words or expressions e.g. Neste aspecto ainda não tem sido muito abordada a validez dessa assunção (afirmação/12) ... [In this aspect the validity of this assumption (assertion/12) has not been frequently addressed yet … 13- Nominal agreement 14- Fuzzy Categories ((i) Long sentences without connectives and punctuation marks and with poor vocabulary) e.g. Felizmente, verifica-se que gradualmente se vai fomentando a procura deste tipo de materiais, e maior é a tendência dos pesquisadores de estimular a busca de novas matérias-primas que sejam provenientes de fontes renováveis, menos poluentes, e locais, seja porque está a surgir uma mudança de mentalidade da sociedade, seja por uma questão de moda ou mesmo pela simples necessidade de mudança [Fortunately, it appears that the demand for such materials has been gradually increasing, and greater is the tendency of researchers to stimulate the search for new raw materials that come from renewable, less polluting, and local sources, either because a change in the mindset of society is arising, or because it is in fashion or even simply due to a need for change] (Suggestions: Use connectives, rephrase the long sentence into shorter smaller sentences/14). 15- Change of grammatical category due to spelling error e.g. A digestão anaeróbia sofre grande influencia (influência/15) do regime hidráulico. [Anaerobic digestion is greatly influence (influence/15) of the hydraulic regime.]

In addition to the error pointed out in category 12, the word “validez” is very wrong. This word could be corrected using “validade” or “veracidade”. In that example only the error related to the category’s name was annotated.

Espanhol-Acadêmico-Br

107

Fig. 2: Quantification of errors in the learner corpus

As seen in Fig. 2, the most frequent mistakes do not always correspond to those indicated by the learners of Portuguese in the questionnaire (Fig. 1). For instance, the use of false cognates, which was the most alarming in the form, was the lowest occurrence in the corpus. It is believed that this problem is more common in the spoken language while not so common in the written mode. Sepúlveda-Torres and Aluisio (2011) present an initial study to automatically create dictionaries of cognates and false cognates between Spanish and Portuguese. This type of resource is very useful when a learner studies a second language that is similar to his/her native language. The mistakes in the fuzzy categories show that learners have problems putting together clear and concise ideas, and produce long sentences without connectives or punctuation and use poor vocabulary. Some verb conjugations cannot be corrected by spelling or grammar checkers, as is the case of category 9. Spanish expressions sometimes appear in the corpus and these interferences can be identified using automatic methods, like an n-gram language model, for example. An n-gram model is one of the most important tools in speech and language processing. It estimates the probability of one word or sequence of words given a certain number of previous words (Jurafsky 2007). Furthermore, the use of crasis is one of the most frequent problems in the corpus but there are several rules which allow these mistakes to be corrected automatically. Fig. 3 shows graphs of errors identified in three texts of the corpus. In each text there are problems that are common among learners and others that are more specific to one learner. It is observed that the inappropriate use of crasis (category 2 in Table 4) is common to all three texts and that

108

Chapter Six

texts 9 and 11 have errors related to verb conjugations (9), misuse of prepositions (10) and use of Spanish expressions (12). Fig. 3.a shows an essay with few errors, while the essay represented in Fig. 3.b, shows a higher number of errors than the first one. Finally, the number of errors in the essay represented by Fig. 3.c is the highest, a total of 26 errors. This analysis allows for an assessment of texts in terms of mistakes.

Fig. 3.a: Essay with 3 errors

Fig. 3.b: Essay with15 errors

Fig. 3.c: Essay with 26 errors Fig. 3: Types of errors identified in individual texts

5. Final remarks Although Spanish speakers use tools that assist in the process of writing texts in Portuguese, there are some problems that these tools cannot identify. Following this line, the present study aimed to: (i) identify categories of errors that are not covered by computational tools, (ii) identify gaps or weaknesses of these systems and (iii) select the problems that can be automated, based on the learner corpus. The present study makes available an annotated corpus that illustrates the real problems that native Spanish speakers face when they write in Portuguese. The analysis of these problems will serve to create a tool that will automatically identify these errors, offer suggestions to improve the quality of the texts and assess the writing of abstracts of theses and dissertations according to the number and type of errors. In addition to this

Espanhol-Acadêmico-Br

109

tool, the present learner corpus will be enlarged with more texts and will be publicly available for further research. As future work, we intend to extend the compilation of the corpus until it becomes a formal resource that will show a greater number of the real problems that Spanish speakers learning Portuguese face when writing academic papers, a genre which requires a more technical and formal language. In addition to these specific applications, our study could also contribute to pedagogical studies of Portuguese as a foreign language. The error typology could be an aid to prepare teaching materials for the teaching and learning of Portuguese for Spanish speakers.

Acknowledgments The authors are grateful to FAPESP for supporting this work.

References Allen, David. 2009. Lexical bundles in learner writing: An analysis of formulaic language in the aless learner corpus. Komaba Journal of English Education, 1:105–127. Carvalho, Ana M., Freire, Juliana L. and Da Silva, Antonio J.B. 2010. Teaching Portuguese to Spanish speakers a case for trilingualism. In Hispania, 70–75. Da Silva, Laís. (2010). As formas de preenchimento do objeto direto na aprendizagem de português/LE por Argentinos. Universidade Federal de São Carlos. Centro de Educação e Ciências HumanasDepartamento de Letras. Cursos de Letras. Dahlmeier, Daniel and Tou Ng, Hwee 2011. Grammatical error correction with alternating structure optimization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies – Volume 1, HLT ’11, 915–923, Stroudsburg, PA, USA. Association for Computational Linguistics. Dayrell, Carmen and Aluísio, Sandra M. 2008. Using a comparable corpus to investigate lexical patterning in English abstracts written by nonnative speakers. In Proceedings of the 6th international conference on Language Resources and Evaluation. Dayrell, Carmen 2008. Sense-related items and lexical patterning in english and portuguese scientific abstracts. In International Symposium on Using Corpora in Contrastive and Translation Studies.

110

Chapter Six

Durão, Adja B.A.B. 1999. Análisis de errores e interlengua de brasileños aprendices de español y españoles aprendices de portugués. Editora UEL. Universidade Estadual de Londrina. Duran, Magali 2008. Customização de corretores ortográﬁcos para aprendizes de línguas estrangeiras. In Anais do VII Encontro de Linguística de Córpus. Evers, Aline, Finatto, Maria J., and Pasqualini, Bianca (2011). Córpus de aprendizes de português como língua adicional (pla): Compilação inicial e primeiros resultados. In 18 InPLA – Intercâmbio de Pesquisas em Linguística Aplicada. Foreque, Flávia. 2011. Crescimento do Brasil leva estrangeiros a aprenderem português. In Folha.com. Frunza, Oana and Inkpen’s Diana 2009. Identification and Disambiguation of Cognates, False Friends, and Partial Cognates Using Machine Learning Techniques, International Journal of Linguistics, vol. 1, no. 1, 1-37, Otawa, Canada. Furtoso, Viviane, B. and Gimenez, Telma, N. 2000. Ensino e pesquisa em português para estrangeiros programa de ensino e pesquisa em português para falantes de outras línguas (peppfol). DELTA: Documentação de Estudos em Linguística Teórica e Aplicada, 16:443 – 447. Genoves, Luiz Carlos, JR., Lizotte, R., Schuster, Ethel, Dayrell, Carmen, and Aluísio, Sandra M. 2007. A two-tiered approach to detecting english article usage: an application in scientific paper writing tools. In Recent Advances in Natural Language Processing, 2007, Borovets. Gomes, Gloria, P.F.V. 2002. Características da interlíngua oral de estudantes de letras espanhol nos dois últimos semestres de estudo. In Congresso Brasileiro de Hispanistas, São Paulo, Brasil. Granger, S. (2004). Computer learner corpus research: Current status and future prospects. Language and Computers, 52:123–145. Granger, Silviane, Kraif, Olivier, Ponton, Claude, Antoniadis, Georges, and Zampa, Virginie. 2007. Integrating learner corpora and natural language processing: A crucial step towards reconciling technological sophistication and pedagogical effectiveness. ReCALL, 19(3):252– 268. Grannier, Daniele, M. and Carvalho, Elzamária, A.C. 2001. Pontos críticos no ensino de português a falantes de espanhol – da observação do erro ao material didático. In Anais do IV Congresso da SIPLE, PUC Rio, Rio de Janeiro, Brasil. Henriques, Eunice, R. 2004. Intercompreensão de Texto Escrito por Falantes Nativos de Português e de Espanhol. DELTA:

Espanhol-Acadêmico-Br

111

Documentação de Estudos em Linguística Teórica e Aplicada, 16:263– 295. Jandyra, Maria and Santos, Percília. 1999. O ensino de português como segunda língua para falantes de espanhol: teoria e prática. Ensino e Pesquisa em Português para Estrangeiros. Brasília: Editora Universidade de Brasília, 49-57. Jurafsky, Daniel and Martin, James, H. 2007. Speech and Language Processing: An Introduction to Speech Recognition, Computational Linguistics and Natural Language Processing. Pearson International Edition edition, (2009) Hana, Jirka, Rosen, Alexandr and Stindlova, Petr, J. 2012. Building a learner corpus. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), 3228-3232. Istanbul, Turkey. Mohr, Denise 2007. Português para hispanofalantes: Uma alternativa para o ensino de gêneros escritos. In Professores de Línguas Estrangeiras do Paraná Línguas: culturas, diversidade, integração (XV EPLE), 372–387. Rosner, Mike, Gatt, Albert, Joachimsen, Jan and Attard, Adrew 2012. Incorporating an Error Córpus into a Spellchecker for Maltese. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), 743-750. Istanbul, Turkey. Scaramucci, Matilde, Ricardi and Rodrigues, Meirelén, S. A. 2004. Compreensão (oral e escrita) e produção escrita no exame celpe-bras: análise do desempenho de candidatos hispanofalantes. In: SIMÕES, A. R. M. et al. (Org.Ed.). Português para falantes de espanhol: artigos selecionados escritos em português e inglês. Campinas: Pontes, 153174. Sepúlveda-Torres, Lianet and Aluísio, Sandra, M. 2011. Using machine learning methods to avoid the pitfall of cognates and false friends in Spanish-Portuguese word pairs. In: The 8th Brazilian Symposium in Information and Human Language Technology, 2011, Cuiabá/MT. v. 1. p. 67-76. Schoffen, Juliana 2009. Gêneros do Discurso e Parâmetros de Avaliação de Proficiência em Português como Língua Estrangeira no Exame CELPE-Bras. Tese de Doutorado, Universidade Federal do Rio Grande do Sul. Instituto de Letras. Tagnin, Stella 2006. A Multilingual Learner Corpus in Brazil. Language and Computers, 56(1):195–202.

CHAPTER SEVEN THE CHALLENGES OF THE ANNOTATION OF A SOCCER LANGUAGE CORPUS WITH SEMANTIC FRAMES ROVE CHISHMAN*, ANDERSON BERTOLDI*, JOÃO GABRIEL PADILHA* AND DIEGO SPADER DE SOUZA* 1. Introduction The purpose of this paper is to discuss the main challenges faced throughout the annotation process of a soccer language corpus for which the semantic frames developed in the scope of the Kicktionary Project (Schmidt 2009) were used. The Kicktionary is a multilingual database in German, French and English, which describes the language of soccer according to the principles of Frame Semantics (Fillmore 1982, 1985). In this paper, we discuss how the metaphorical language of soccer, support verbs and the polysemy of lexical items interfered in the process of annotation of a Brazilian Portuguese corpus in order to develop semantic frames. Considering that the Kicktionary is a database of soccer language in English, German and French, it was necessary, for the semantic annotation of the Portuguese corpus, to search first for equivalents for the lexical units in Portuguese. After we found an equivalent in English, it was possible to investigate the Kicktionary database and check which frame the lexical unit was evoking. In this paper we highlight that the differences found were more related to linguistic structure and lexicon than to conceptual differences between the languages. As the rules of soccer are

*

Universidade do Vale do Rio dos Sinos (UNISINOS).

The Challenges of the Annotation of a Soccer Language Corpus

113

universal, the frames could be used without major difficulties for the annotation of the corpus in Portuguese. This paper is structured in the following way: in Section 2 we discuss Frame Semantics and FrameNet, the theory and the application of this theory that guides our work. In Section 3 we present the Kicktionary, which inspired our project. Section 4 reports the annotation process of the corpus. In Section 5 we talk about the metaphors we encountered during the annotation. In Section 6 we discuss the support verbs in the corpus and, in Section 7, the cases of polisemy. Our final remarks are presented in Section 8.

2. Frame Semantics and FrameNet Frame Semantics was born from a much discussed concept during the 1970s, the frame. Initially, Fillmore (1975) made a distinction between the concepts “scene” and “frame”. The concept of scene was considered a broad concept which encompassed not only visual scenes, but body, social and cultural experiences. The concept of frame, however, was seen as a system of linguistic choices, including words and grammar rules, associated to certain prototypical instances of a scene. Despite the fact that the classical texts of Frame Semantics date back to the 1980s (Fillmore 1982, 1985), its roots can already be found in the 1960s. In his article The case for case (Fillmore 1968), Fillmore studies the semantic roles that could be considered universal. For this, he adopted the position of Tesnière (1959), who states that the distinction subject/predicate is not appropriate to describe language. This is when case structure (case frame) emerges with the six cases, or semantic roles, from which all studies of thematic relations originated: agentive, instrumental, dative, factitive, locative and objective. Through the event of a commercial transaction, Fillmore (1977) demonstrates that the verbs to buy, to sell and to cost represent different perspectives of the same event. The seller gives the product in exchange for money and the buyer gives the money in exchange for the product. An event such as a commercial transaction marks the exchange of ownership of two goods: the money goes from the buyer to the seller and the product goes from the seller to the buyer. The analysis of the perspective relation in an event already presents an early draft of what will be later called “frame elements” by the FrameNet Project. These frame elements come to replace the case proposal (Fillmore 1968). The distinction between scene, as a cognitive structure, and frame, as a linguistic structure, is later on abandoned (Fillmore 1982, 1985).

114

Chapter Seven

According to Fillmore (1982 p.111), “By the term “frame” I have in mind any system of concepts related in such a way that to understand any one of them you have to understand the whole structure into which it fits”. For Frame Semantics, words have the ability to evoke world knowledge organized through a cognitive structure called frame: “A frame is evoked by the text if some linguistic form or pattern is conventionally associated with the frame in question (Fillmore 1985 p.232)”. Fillmore and Atkins (1992) present a first exercise of semantic analysis based on frames and point to the future creation of an online frame-based dictionary. Based on a study of lexical units that express risk, like risk, danger and hazard, Fillmore and Atkins propose eleven categories to describe the participants of the Risk frame. These categories are: chance, harm, victim, valued object, risky situation, deed, actor, intended gain, purpose, beneficiary and motivation. Frame Semantics inspired the development of FrameNet (Fillmore, Johnson e Petruck 2003). FrameNet is a lexical database which describes the meaning of lexical items relating them to frames. FrameNet lexical items are treated as “lexical units”, and a lexical unit is the pairing of a form (word) with a meaning. Each meaning of a word is related to a distinct frame. FrameNet, for example, presents three lexical units for the verb accuse. Each lexical unit evokes a different frame: Judgment, Judgment_communication and Notification_of_charges. FrameNet describes the concepts related to each frame as “frame elements”. These elements are considered situational roles, and not semantic roles as predicted by case grammar (Fillmore 1968). According to Fillmore and Baker (2010), frame elements represent properties or entities that can or must be present in any instance of a frame. FrameNet distinguishes frame elements between “core”, “peripheral” and “extrathematic”, but they state that the distinction between these types is not always clear. In general, mandatorily expressed frame elements are core. Peripheral frame elements usually express functions of assistants, which means they carry additional, not central, information, indicating time, place or manner. The difference between core and peripheral elements depends on the necessity of complementation of the lexical unit. Extrathematic frame elements introduce information that refers to another frame, motivated by some event or action. The peripheral and extrathematic frame elements are grouped in FrameNet under the denomination of non-core elements. For example, the frame Commercial_transaction has the following core frame elements: BUYER, GOODS, MONEY and SELLER.

The Challenges of the Annotation of a Soccer Language Corpus

115

3. The Kicktionary The Kicktionary reactivates the distinction between scenes and frames. Frames congregate the lexical units in German, French and English, and scenes represent the typical scenarios related to each frame. The Pass scene, for example, has 11 frames: Pass, Pass_Back, Mark, Being_Free, Control, Connect, Flick_On, Intercept, Bad_Pass, Pass_Combination and Supply_Pass. According to Schmidt (2009), a frame is a structural entity used to group linguistic expressions that share a perspective in common about a certain conceptual scene. Differently from FrameNet, Kicktionary does not distinguish frame elements between core and non-core. The frame Pass, for example, presents twelve frame elements: PASSER (JOGADOR_QUE_PASSA), TARGET (ALVO), RECIPIENT (JOGADOR_QUE_RECEBE), SOURCE (FONTE), DIRECTION (DIREÇÃO), BALL (BOLA), PART_OF_BODY (PARTE_DO_CORPO), DISTANCE (DISTÂNCIA), MOVING_BALL (BOLA_EM_MOVIMENTO), PASS (PASSE), SHOT (CHUTE) e PATH (TRAJETÓRIA). Below, we show two sentences annotated with frame elements, extracted from the Kicktionary database to illustrate the annotation of the sentences with lexical units evoking the Pass frame: (1) [Gilewicz PASSER] played [the ball BALL] [to Ivica Vastic RECIPIENT] who was tackled as he lined up a shot. (2) With ten minutes left [Vranješ PASSER] passed [to substitute Ivan Leko RECIPIENT] just outside the area.

4. The annotation process The objective of this paper was to measure the applicability of semantic tags developed in the context of the Kicktionary Project to the annotation of a representative corpus of soccer texts in Brazilian Portuguese. To do so, the following methodological steps were adopted: (a) creation of a specialized corpus, (b) development of an annotation manual containing the translation of all scenes and frames from the Kicktionary, (c) segmentation of the corpus in sentences and (d) analysis and annotation of the sentences by pairs of annotators to guarantee some level of accuracy.

116

Chapter Seven

The corpus of this study is composed of 100 texts containing descriptions of soccer games. The texts were collected from websites of Brazilian clubs on the days after championship games such as Copa do Brasil or Copa Libertadores. These texts were divided into eleven folders, in accordance with the website from which they were collected, totaling 3307 sentences. An initial manual corpus analysis showed the complexity of the sentences to be annotated. These sentences presented a lot of occurrences of support verbs and metaphorical constructions evoking frames. The difficulties reported here did not present “conceptual” differences between the moments of a soccer game described in Brazilian Portuguese and in English, as the rules of this sport, as well as the events related to this practice, are internationally broadcast. What was noticed was that the two languages rely on different constructs to refer to the same events. Therefore, the differences are not on the “lexical” level. To illustrate one of the difficulties encountered during the annotation process, we can mention the lack of direct equivalents in English for the lexical item bicicleta [bicycle], which refers to the type of kick in which the player kicks the ball and it goes over his head. The same occurs with the lexical item bomba [bomb] a noun that refers to a kick when it is more powerful than a usual kick. Its indirect equivalent, that is, a metaphorical construction with this sense of “powerful kick” is the noun thunder [trovão], a metaphor which, as we can see up to this point, is not part of the soccer lexicon in Brazilian Portuguese. We noticed a large number of occurrences regarding the Shot, Goal and Pass scenes. Thus, we decided first to annotate just a part of the corpus. So, we selected the first 1000 sentences of the corpus for the annotation of these three most recurrent scenes. Proceeding with the annotation of these sentences, we first identified the lexical unit evoking the frame in the sentence and the evoked frame. Next, the main phrases of the sentence were annotated with frame elements, as shown below (3). (3) Em seguida, [Ricardinho estreante Neto Berola Substituir].

JOGADOR_SUBSTITUÍDO] SUBSTITUTO]

[Cena:

foi substituído [pelo Substituição/Frame:

The Challenges of the Annotation of a Soccer Language Corpus

117

Next, [Ricardinho SUBSTITUTED_PLAYER] was substituted [by the entrant Neto Berola SUBSTITUTE] [Scene: Substitution/Frame: Substitute]1 The manually annotated files were inserted in the SALTO software (Burchardt et al., 2006) so we could obtain the files in XML format, enabling its computational applicability. This kind of file allows machines to read the information inserted manually during the annotation process. Figure 1 exemplifies the use of the SALTO tool for the annotation of semantic frames. .

Fig. 1: Annotation of frames with the SALTO tool

5. The metaphors of soccer The metaphors in the soccer corpus require special attention from the annotators. In these cases, it is necessary to interpret the metaphorical meaning of the linguistic expression so as to identify the evoked frame. As we can see in examples (4) and (5), the lexical units roubar [steal], which evokes a frame of stealing, and cortar [cut], which evokes a frame of cutting, are used, metaphorically, to express an interception. roubou [grande bola BOLA] [no meio e deixou Diego Maurício livre para chutar colocado, mas a bola passou raspando a trave. [Cena: Passe/Frame: Interceptar]

(4) [Camacho

INTERCEPTADOR]

LOCAL_DA_INTERVENÇÃO]

1

All translations of the examples are literal translations solely meant to make the meaning clear.

Chapter Seven

118

[Camacho

INTERCEPTOR]

stole [big ball BALL] [in the middle and left Diego Maurício free to kick, but the ball passed the beam. [Scene: Pass/Frame: Interception]

INTERVENTION_LOCATION]

(5) Mas a [zaga INTERCEPTADOR] cortou. [Cena: Passe/Frame: Interceptar] But the [defense INTERCEPTOR] cut. [Scene: Pass/Frame: Interception]

6. Support verbs The corpus also presented a large amount of support verbs, as in examples (6) and (7). Support verbs are verbs with less semantic content as in “Mary gave her a hug”. In this sentence, the verb gave needs the complement a hug to define its exact meaning. Therefore, in the case of support verbs, the unit that evokes the frame is not the verb, but a noun. The examples below show that the lexical item passe [pass] incorporates the frame element into the evoking unit. (6) [Ramirez JOGADOR_QUE_PASSA] fez [o passe PASSE] [para Battión JOGADOR_QUE_RECEBE]. [Cena: Passe/Frame: Fornecer_passe] [Ramirez PASSER] did [the passing PASS] [to Battión RECIPIENT]. [Scene: Pass/Frame: Supply_Pass] (7) [O autor do primeiro gol JOGADOR_QUE_PASSA] deu [passe milimétrico PASSE] [para Adriano JOGADOR_QUE_RECEBE] apenas tocar na saída do goleiro chileno e fazer 2 a 0 aos 13. [Cena: Passe/Frame: Fornecer_passe] [The author of the first goal PASSER] gave [graph pass PASS] [for Adriano to kick in the output of the Chilean goalkeeper and score 2 x 0 at 13. [Scene: Pass/Frame: Supply_Pass] RECIPIENT]

It is also important to notice that support verbs, in Portuguese, are often communicatively more adequate because they allow the complement to be modified by an adjective. This does not occur in cases of regular verbs, as exemplified below: (8) Aos 44, [Carlos Alberto JOGADOR_QUE_FAZ_O_PASSE] avançou pela direita para [Renan Oliveira e deu [ótimo passe PASSE] JOGADOR_QUE_RECEBE_O_PASSE], que finalizou para a defesa de Magrão.

The Challenges of the Annotation of a Soccer Language Corpus

119

At 44, [Carlos Alberto PASSER] moved to the right and gave [great pass PASS] to [Renan Oliveira RECIPIENT], who concluded to the defense of Magrão. Replacing the support verb construction dar passe [give pass] by the regular verb passar [pass], there would be no place for the adjective ótimo [great]. Instead, the adjective would have to change into an adverb, otimamente [greatly], which does not sound natural in the popular discourse of soccer.

7. Polysemy Polysemy of lexical items also represented a challenge for the annotation process. In examples (9) to (16), one can see two different senses related to the verb tocar. From (9) to (12), tocar is used in the sense of chutar (equivalent to the verb to kick, in English), as shown below: (9) Aos sete minutos, [Tardelli JOGADOR_QUE_RECEBE] recebeu [ótimo passe PASSE] [de Muriqui JOGADOR_QUE_PASSA] e tocou na saída do goleiro para fazer o terceiro gol: Galo 3 x 1. [Scene: Shot/Frame: Shot] At seven minutes. [Tardelli RECIPIENT] received [an excellent pass PASS] from Muriqui PASSER] and touched at the goalkeeper´s output to score the third goal: Galo 3 X 1. (10) Após bola recuada, [o atacante da equipe baiana JOGADOR_QUE_CHUTA] chegou solando para ganhar a jogada do goleiro Marcelo e tocar para [o gol ALVO]. [Scene: Shot/Frame: Shot] After a pass back, [the Bahia team´s striker SHOOTER] arrived soloing to win the move from goalkeeper Marcelo and to touch towards the goal TARGET] (11) [Bruno Mineiro JOGADOR_QUE_CHUTA] tocou [de cabeça PARTE_DO_CORPO] mas [o goleiro Edson GOLEIRO] salvou [em cima da linha LOCAL_DA_INTERVENÇÃO]. [Scene: Shot/Frame: Shot] [Bruno Mineiro SHOOTER] touched [with his head PART_OF_THE_BODY]but [on the line [goalkeeper Edson GOALKEEPER]saved PLACE_OF_INTERVENTION].

120

Chapter Seven

(12) O Timão quase passou à frente outra vez aos 9 minutos, com [DaniloJOGADOR_QUE_CHUTA], que tocou [a bola BOLA][para o gol ALVO] após cruzamento rasteiro de Defederico, mas [o goleiro da equipe do Rio GOLEIRO] espalmou. [Scene:Shot/Frame:Shot] The Timão almost passed ahead one more time at 9 minutes, [with Danilo SHOOTER], who touched [the ball BALL][to the target ALVO] after the low cross from Defederico, but [the Rio´s team goalkeeper GOALKEEPER] palmed. From (13) to (16), the verb tocar denotes the sense of passar [to pass], as shown below: (13) [O Furacão JOGADOR_QUE_PASSA] tentou tocar mais [a bola BOLA] para tirar a velocidade do Timão, mas não adiantou. [Scene: Pass/ Frame: Pass] [The Hurricane PASSER] tried to pass [the ball BALLl more in order to diminish Timão´s speed but it did not work out. (14) No lance, [Muriqui JOGADOR_QUE_PASSA] arrancou pela esquerda e tentou tocar para [Diego Tardelli JOGADOR_QUE_RECEBE], mas o [passe PASSE] foi novamente interceptado pela [zaga do Ceará. INERCEPTADOR] [Scene: Pass/Frame:Pass] In this move, [Muriqui PASSER] ran by the left and tried to pass [to Diego Tardelli RECEIPT], but the pass was again intercepted by Ceará´s defense. (15) Aos dez minutos, após boa jogada de Correa [na entrada da área LUGAR], [Muriqui JOGADOR_QUE_PASSA] tocou [de calcanhar PARTE DO CORPO] e [Diego Tardelli JOGADOR_QUE_RECEBE] finalizou rente à trave. [Scene:Pass/Frame:Pass] At ten minutes Mark, after a good move from Correa [in the entrance of the defense area, [Muriqui PASSER] passed with his ankle PART_OF_THE_BODY] and Diego Tardelli RECEIPT] finished next to the post. (16) No ataque seguinte, [Tardelli JOGADOR_QUE_PASSA] avançou [pela direita DIREÇÃO] e tocou para [Muriqui JOGADOR_QUE_RECEBE], mas o [passe PASSE] foi interceptado. [Scene: Pass/Frame:Pass]

The Challenges of the Annotation of a Soccer Language Corpus

121

In the following offensive move, [Tardelli PASSER] advanced [through the right side DIRECTION] and passed to Muriqui RECEIPT], but the pass was intercepted. This pattern of polysemy totaled 21 occurrences in our corpus: 11 were related to the pass scenario (“tocar” means “passar”), and 10 to the shot scene (“tocar” means “chutar”). It is worth mentioning that the cases involving polysemy did not relate only to verbs, but also to nouns, as in the case of defesa [defense]. This noun appears in two distinct frames: the first is the frame Defender (Save), within the chute (Shot) scene. In relation to this frame, the noun defesa – considered a target, i.e., the lexical item that evokes this frame – appears in sentences in which the NP subject is the goalkeeper, who prevents his opponent from scoring a goal by holding the ball, or by touching it with any other part of his body, in such a way that the ball does not enter the goal. We can see this in sentences (17) to (19), below: (17) [Felipe GOLEIRO] ainda teve de fazer mais uma boa defesa antes do fim do primeiro tempo. [Scene: Shot/Frame: Save] [Felipe GOALKEEPER] still had to perform a good defense before the end of the first half. (18) Aos 33min, foi a vez de [Elias JOGADOR_QUE_CHUTA] arriscar [de longa distância ORIGEM], mas [o bom goleiro Bobadilla GOLEIRO] estava lá novamente para fazer outra defesa. [Scenario: Shot/Frame: Save], At 33 minutes, it was [Elias´ SHOOTER] turn to risk [from a long distance ORIGIN], but the good goalkeeper Bobadilla GOALKEEPER] was there again to perform another defense. (19) Com a vantagem no marcador, o Timão voltou ainda melhor na segunda etapa e quase ampliou duas vezes com Dentinho – na primeira, a bola passou tirando tinta do gol de Contreras; na segunda, [o arqueiro uruguaio GOLEIRO] fez uma grande defesa. [Scenario: Shot/Frame: Save] With a certain advantage in the score, the Timão came back still better in the second-half and almost amplified twice with Dentinho – in the first one, the ball passed scratching the Contreras post´s paint; in the second one, [the uruguaian goalkeeper GOALKEEPER] performed a great defense.

122

Chapter Seven

(20) Instantes depois do gol, [Leandro JOGADOR_QUE_CHUTA] chutou [de fora da área ORIGEM] e exigiu boa defesa d[o goleiro Fernando GOLEIRO]. [Scene: Shot/Frame: Save] Moments after the goal, Leandro SHOOTER] shot from outside the area and demanded a good defense from [goalkeeper Fernando GOLEIRO]. It was noticed that 36 sentences in our corpus presented patterns of polysemy as the ones shown above, in relation to the Shot scene. The second pattern of polysemy identified in our corpus is related to the scene Passe (pass), more precisely, to the frame Interceptar (Intercept), which belongs to this scenario. In this case, the noun defesa is not a target, but a frame element of the frame Interceptar, as sentences (21) to (24) show (where the noun being discussed is underlined): (21) O Atlético retornou do intervalo com Diego Macedo no lugar de Júnior e teve a primeira boa chance no cruzamento de [Coelho JOGADOR_QUE_PASSA] [pela direita DIREÇÃO], interceptado pel[a defesa carioca INTERCEPTADOR]. [Scene: Pass/Frame: Intercept] Atlético returned from the intermission with Diego Macedo in the place of Júnior and had the first good chance in Coelho´s PASSER] cross [from the right DIREÇÃO] side, intercepted by [the carioca defense INTERCEPTOR]. (22) O Galo2 tentou reagir no cruzamento de [Muriqui JOGADOR_QUE_PASSA] [pela direita DIREÇÃO], cortado pel[a defesa] do Fluminense INTERCEPTADOR]. [Scene: Pass/Frame: Intercept] Galo tried to react on Muriqui´s PASSER] cross [from the right side DIRECTION], intercepted [by Fluminense´s defense INTERCEPTOR]. (23) O Atlético deu a saída e foi a primeiro a levar perigo na tabela entre [Ricardinho JOGADOR_QUE_PASSA] e [Muriqui JOGADOR_QUE_RECEBE], mas o cruzamento d[o atacante JOGADOR_QUE_PASSA] foi interceptado pel[a defesa paranaense INTERCEPTADOR]. [Scene: Pass/Frame: Intercept]

2 In English, Galo is a rooster. It is common in Brazil to refer to the Atlético team as “the rooster”, once its mascot is one.

The Challenges of the Annotation of a Soccer Language Corpus

123

Atlético started playing and was the first team to menace in a passcombination between Ricardinho PASSER] and [Muriqui RECEIPT], but the striker´s cross was intercepted by the Paraná defense INTERCEPTOR]. (24) Aos 31 minutos, [Carlos Alberto JOGADOR_QUE_PASSA] cruzou com perigo [pela direita direção] e [a defesa do Grêmio Prudente INTERCEPTADOR] fez o corte. [Scene: Pass/Frame: Intercept] At 31 minutes, [Carlos Alberto crossed dangerously from the right side and Gremio Prudente´s defense INTERCEPTOR] cut it. Defesa, in this sense, does not refer to the goalkeeper’s activity, but, instead, to the group of players known as zagueiros (fullbacks, or defensive players, in English). This sense is admitted in English, too, but it is not related to the frame Intercept. Instead, it is related to the frame Team, which belongs to the Actors scene, as the figure below shows:

Fig. 2: the noun defence in the Kicktionary site (source: www.kicktionary. deprotectedData/SCENARIOS/LUs/Team/LU_1355.html)

In relation to the frame Defender (in the Shot scene), the equivalent in English for defesa is the noun save (there is no lexical unit defense, as in Portuguese, see Figure 3): Thus, we can observe that Portuguese lexicalizes the Defesa frame in a different way, if compared to English. Once these cases of polysemy have been listed, it is possible to come back to the challenges faced in the annotation process (mentioned at the beginning of this section).

124

Chapter Seven

Fig. 3: English lexical units listed in the Kicktionary (source: http://www. kicktionary.de/protectedData/SCENARIOS/StartFrame.html)

When annotating these sentences, the ambiguity of these lexical items arose, making annotators wonder which frame to select. The criterion selected, then, was the context in which the sentences occurred. For instance, in the case of tocar, the syntactic pattern itself – VP+PP (tocar para o gol) and VP+NP (tocar a bola) – does not vary from one context to

The Challenges of the Annotation of a Soccer Language Corpus

125

another (it does not change when this verb means to pass the ball or to kick subtly). Context played an important role, thus. As for the noun defesa, it was slightly different, considering that, in the VP+PP case, the verbs are targets and, in the VP+NP case, one of the nouns is a target (which evokes the frame Defesa), and the other one is a frame element: the presence of the noun goleiro defined which frame that was, the same way PPs like pela defesa paranaense helped to identify the group of defensive players. Polysemy was present also in the annotation of the Goal scenario. Probably the most apparent case of polysemy in this scenario is the case concerning the verb marcar. This lexical unit is seen in two different frames of two different scenarios: Goal, from the Goal scenario and Referee_decision, from the Foul scenario. To start discussing this case, it is important to see the examples below: (25) Após uma defesa de Felipe, [Renato Cajá JOGADOR_QUE_FEZ_O_GOL] pegou o [rebote EVENTO_PREPARATÓRIO] e marcou, igualando [o placar PLACAR]. [Scene: Goal/Frame: Goal] After Felipe´s save, [Renato Cajá marked, equalizing [the score SCORE].

SCORER]

had the rebound and

(26) Aos 40 minutos, [o árbitr JUIZ] marcou [pênalti COMPENSAÇAO] de Jairo Campos em Marcelo Oliveira. [Scene: Foul/Frame: Referee_decision ] At 40 minutes, [the referee marked [a penalty COMPENSATION] by Jairo Campos over Marcelo Oliveira. In the first example, mark carries a sense of scoring a goal, while in the second sentence, this lexical unit refers to an event in which the referee [árbitro] awards a penalty in a foul situation. As it was pointed out before in this section, the presence of these cases led to difficulties for the annotators, because marcar, as a frame evoker, can establish relations with different situations. In The Kicktionary, however, the polysemy of marcar does not occur. Marcar, in the context of the Goal scenario, appears as score. In the Foul scenario, there are two lexical units related to what we have in Portuguese (marcar): award and rule. While in Portuguese the lexical unit, in the Foul scenario, serves the purpose of two perspectives (the perspective of the offender and the offended player), in English, The Kicktionary has a

126

Chapter Seven

lexical unit for each of these perspectives: award.v for the offended player and rule.v for the offender. The sentences below, taken from The Kicktionary, exemplify this question: (27) But [the referee]REFEREE had already awarded [a freekick]COMPENSATION after adjudging Ciprian Marica to have fouled in the build up. (28) [Owen]OFFENDER also turned in a Beckham centre but was ruled [offside]OFFENSE. In the first sentence, we can see that the lexical unit award.v is related to a compensation, that is, to an opportunity that the referee gives to the offended player to compensate the offense. In the second example, ruled is connected to an offense, the violation practiced by the offender.

8. Final remarks The process of annotation reported in this paper represents the first stage of a lexicographic project which aims at creating a bilingual dictionary of the soccer domain. The challenges faced during the annotation of a monolingual corpus of the language of this sport lead to reflections concerning the creation of bilingual lexicons for the soccer domain: considering the difficulties mentioned in relation to the translation equivalents, we realized the necessity – as well as the importance – of updating the data offered by the pioneer project by adding to it the representative lexical units in Brazilian Portuguese. It is worth mentioning that the development of bilingual computational lexicons also demands finding equivalences for metaphorical expressions, and for support-verb constructions, as was shown in the examples from our corpus.

Acknowledgments This paper was supported by the agencies CAPES, CNPq and FINEP, through the Edital number 001/2010, MCT/CNPq/FINEP – Post-Ph.D National Program (PNPD) and FAPERGS – Programa de Complementação de Bolsas de Pós-Doutorado – process nr. 1612/12-1.

The Challenges of the Annotation of a Soccer Language Corpus

127

References Burchardt, A., Erk, K., Frank, A. Kowalski, A., Padó, S. and Pinkal, M. 2006, “SALTO – A Versatile Multi-Level Annotation Tool” in: Proceedings of the 5th International Conference on Language Resources and Evaluation LREC 2006. Genova: ELRA, 517-520. Goffman, E. 1974, Frame Analysis. Nova York: Harper. Fillmore, C. J. 1968, “The case for case”, in: Bach, E. e Harms, R. T. (Eds.) Universals in Linguistic Theory, vol. 67. Nova York: Holt, Rinehart and Winston, 1-88. —. 1975, “An alternative to checklist theories of meaning”, in: Proceedings of the first annual meeting of the Berkeley Linguistics Society. Berkeley: Berkeley Linguistics Society, 123-131. —. 1977, “Scenes-and-frames semantics”, in: Zampolli, A. (Ed.). Linguistic Structures Processing: Fundamental Studies in Computer Science, nº. 59. Amsterdã: North Holland Publishing, 55-88. —. 1982, “Frame semantics”, in: Linguistics in the Morning Calm. Seul: Hanshin Publishing Co., 111-137. —. 1985, “Frames and the semantics of understanding.” Quaderni di Semantica, vol.6, nº.2, 222-254. Fillmore, C. J. e Atkins, B. T. 1992, “Toward a frame-based lexicon: The semantics of RISK and its neighbors”, in: LEHRER, A. and KITTAY, E.F. (Eds.). Frames, fields and contrasts: New essays in semantic and lexical organization. Hillsdale/Nova Jersey: Erlbaum, 75-102. Fillmore, C. J. e Baker, C. 2010, “A frames approach to semantic analysis”, in: Haine, B. e Narrog, H. (Eds.). The Oxford Handbook of Linguistic Analysis. Oxford: Oxford University Press, 313-339. Fillmore, C. J., Johnson, C. R. e Petruck, M. R. L. 2003, “Background to FrameNet.” International Journal of Lexicography, vol.16, nº.3, 235250. Minsky, M. 1974, A framework for representing knowledge. Artificial Intelligence Memo Nº. 306. Cambridge, MA: Massachusetts Institute of Technology, 1974. Schmidt, T. 2009, “The Kicktionary – a multilingual lexical resource of football language”, in: Boas, H. C. (Ed.). Multilingual FrameNets Methods and Applications. Berlim/Nova York: Mouton de Gruyter, 101-132. Tesnière, L. 1959, Éléments de syntaxe structurale. Paris: Klincksieck.

CHAPTER EIGHT SPARKLING VAMPIRE … LOL! ANNOTATING OPINIONS IN A BOOK REVIEW CORPUS CLÁUDIA FREITAS*, EDUARDO MOTTA†, RUY LUIZ MILIDIÚ† AND JULIANA CÉSAR* 1. Introduction Literary and non-literary texts are a rich repository of our praxis, culture and language, carrying factual and subjective information. Since the advent of the internet, the amount of texts produced and made available has increased continuously, and so has the need to handle this large bulk of unstructured information. However, under the format of raw texts, corpora – a considerable amount of documents, available for electronic processing (Leech, 2005; Sinclair, 2005; Sampson, 2001; Santos, 2008) – are of limited use for research. When enriched with linguistic information, especially semantic information, they turn into valuable resources for investigating language, culture and to implement systems dedicated to natural language tasks. Natural Language Processing (NLP) and language studies, specifically with relation to language description, share an interest in the use of corpus. They can be related to each other in two ways: (i)

* †

A description which is motivated by NLP applications, by the need to solve language tasks. Palmer et al (2005), for instance, present an annotated corpus with semantic roles, designed mainly to implement machine learning systems for the task of semantic role labeling. As to the Portuguese language, Freitas et al (2005)

PUC-Rio/Department of Modern Languages. PUC-Rio/Department of Computer Science.

Sparkling Vampire … LOL!

129

introduce a noun phrase annotation model designed to support supervised learning. (ii) A description that benefits from Computational Linguistic tools (concordancers, lemmatizers, morphosyntactic taggers) when investigating language, as Biber et al.’s (1999) corpus-based grammar shows. Clearly, both approaches benefit from each other – a description motivated by an application enlightens the understanding of our habits of speech, and a description motivated by the understanding of our habits of speech contributes to the development of resources and tools dedicated to language dependent tasks. This paper presents ReLi (in Portuguese, Resenhas de Livros) – a usergenerated content corpus, comprised of book reviews posted on the Internet, manually annotated with respect to the expression of opinion about the books. The creation of ReLi is closely linked to the first approach mentioned above. The main interest in building this corpus relies on solving a specific problem: identifying opinions about certain entities in the texts. As a linguistic resource, ReLi aims to build systems for Sentiment Analysis/ Opinion Mining, a research area related to the automatic identification of opinion about and evaluation of entities such as people, products and organizations (Pang & Lee, 2008). From a linguistic perspective, there is a growing interest not only in compiling corpora with non-journalistic texts containing non-standardized language used on the web, but also in corpus-based studies that deal with evaluative language. Thus we believe ReLi is also a valuable resource for language studies. Lexical resources are acknowledged as a core part of NLP systems. They can be either inspired by the traditional dictionary model, such as Princeton WordNet (Fellbaun, 1998) and subsequent wordnets, or assume the format of a deeply annotated corpus, such as PropBank (Palmer et al., 2005). However, if the creation of a rich manually annotated corpus is time-consuming, on the other hand robust semantic interpretation is facilitated when based on shallow parsing methods, which in turn generally rely on deeply annotated corpora. ReLi is a corpus of book reviews manually annotated concerning opinions about books and their polarities (positive or negative). It is annotated at the sentence and phrase levels, and opinions and polarities were considered in the context of the whole text, rather than judged out of context. Instructions given to the annotators emphasized the importance of considering the context, but didn´t specify any particular words or parts of speech, so several different types of constituents were annotated.

130

Chapter Eight

2. Opinion Mining, Sentiment Analysis and related work As Pang & Lee (2008) state, “what other people think” has always been an important piece of information during decision-making processes. If we used to ask friends for recommendations, now we can also rely upon online advice, even if given by strangers (Pang and Lee, 2008). This context explains the need for systems capable of processing subjective information. The challenges posed by this kind of task include determining (i) which documents or portions of documents contain opinion material, and (ii) the semantic orientation of these opinions. The term “Sentiment Analysis” is broadly used to mean the computational treatment of opinion, sentiment, and subjectivity in text. However, when it appeared, the term “sentiment” was used to refer to the automatic analysis of evaluative texts, due to the interest in analyzing market sentiment. In turn, the term “Opinion Mining” was strongly associated with web searching or information retrieval, aiming at processing a set of search results, generating a list of product attributes, such as quality, and aggregating opinions about each of them (excellent, poor) (Pang & Lee, 2008). Currently, Sentiment Analysis has a wide range of interests, from the “traditional” characterization of reviews as positive or negative to the identification of viewpoints in politics, thus posing striking challenges also from a linguistic perspective. Most research in Sentiment Analysis departs from lexicons of emotions/sentiments, which can be general, such as SentiWordNet (Esuli & Sebastiani 2006), or specific for certain domains or tasks (Riloff & Wiebe 2003; Poirier et al. 2011). For European Portuguese, there is SentiLex (Silva et al. 2012), a lexicon derived from Senti-Corpus (Carvalho et al. 2011)1. Aside from word classes, these lexicons contain the polarity associated with each item (positive, negative and, in some cases, neutral). Words are a cluster of many kinds of information – grammatical, phonological, ideological etc - and affective information, simply treated here as polarity, is one example. A word or expression is considered to have polarity when it is systematically used to express an opinion or sentiment about

1

Although the explicit mention of a language variety is of no use when considering computational lexicons, we choose to highlight the “European” origin of Senti-Lex for, in a superficial analysis, we have noticed that some words/expressions have polarity values different from what we would expect in Brazilian Portuguese, and we wonder whether these differences relate to language variety or to other aspects.

Sparkling Vampire … LOL!

131

something. “Perfect”, “lovely”; and “like” are examples of words with positive polarity. Polarity lexicons can be hand-crafted or automatically built, the latter being preferable. The automatic construction is underpinned by (i) handcrafted lexical resources, such as WordNet (Fellbaum, 1998) or (ii) corpora, which may, in turn, be manually or automatically annotated with polarity information. Yet, studies that are based on annotated corpora are few, due to its high cost of preparation. For Portuguese, we highlight Senti-Corpus (Carvalho et al. 2011), “a corpus of comments manually annotated with polarity and opinion targeting human entities, in particular politicians”. Regarding sentiment and the Portuguese language, we are aware of Maia & Santos (2012), who use corpora to explore ways in which the English and Portuguese languages express fear, based on the BNC and the Linguateca AC/DC corpora, the latter being automatically annotated with the lexicon of fear. The MPQA Opinion Corpus (Wiebe et al. 2005) is the corpus that most closely resembles ReLi. Specifically in relation to the identification of opinion in product evaluations, we highlight Zagibalov et al. (2010), who built parallel corpora in English and Russian from book reviews published in Amazon.com. The corpora were manually annotated concerning the review’s overall polarity.

3. Annotation choices: Facts and opinions By dealing with reviews, we expected to find only opinion texts, or texts with a fairly clear boundary between opinion and description/ information, but this was not the case. There are reviews which summarize the book; long “reviews” with only one or two opinion sentences; reviews with scarce opinions alongside random digressions not related to the book under review. Free-text reviews - as opposed to evaluations in which the evaluator is asked to write the product’s pros and cons separately, or in which the evaluator is asked to write a detailed evaluation, in addition to the pros and cons, – are unpredictable, and the reviewers may be more creative than we would expect. However, no matter how difficult it might be, research in Sentiment Analysis is based on the distinction between opinion and fact/subjective and factual information. Although we have no doubt about considering (a) and (b) below as examples of opinion and fact, respectively, determining exactly where description/information ends and where opinion begins is not an easy task.

132

Chapter Eight

(a) I loved my new computer (b) The product is available in black, green and white. In the book review domain, our main difficulty was to determine whether there was an opinion or a piece of description, which is also some kind of an attribute (as in “the ending is sad”). The examples below, taken from ReLi, illustrate the thin line between descriptions/attributes and opinions. The segments in italics correspond to some doubtful opinion candidates: 1. O vampiro é o cara. Altruísta, elegante, charmoso, não perde a compostura nunca... [The vampire is the man. Altruistic, elegant, charming, never gets carried away… ] 2. (...) o imaginário encontro com um vampiro todo charmoso não é algo que me desperte a vontade de acompanhar a história. [(...)the imaginary meeting with a charming vampire, is not something that makes me want to read the book] 3. Eis o inicio desta trama sobrenatural. Um vampiro charmoso, cavaleiro, um digno lorde, que se apaixona por uma “pobre” mortal. [Here is the beginning of this supernatural plot. A charming vampire, a gentleman, a lord, who falls in love with an ordinary girl] 4. Continuo achando a Bella Swan uma Emo idiota e oferecida. [I still think that Bella Swan is an idiot and loose emo] 5. É um livro estranho, tudo se passa tão rápido. [It’s an odd book, it all goes really fast.] In (1), it is impossible for us to know if there is an opinion or a description of the character, even though the first sentence suggests the existence of a positive feeling. In (2), we are under the impression that “charming” is a trait of the character; therefore, we do not consider it an opinion. In (3), we have a clear description of the vampire. In (4) there is no doubt that “idiot” and “loose” are opinions about the character, and in (5) we have a clear opinion, but were unable to assign the value of the polarity. However, the occurrence of utterances that strongly convey opinions is undeniable, as in (4), which certainly interests us considering the opinion mining task. In order to account for this fragile distinction, we turn back to our motivation-application: we want to know whether people liked the book/parts of the book or not. Thus, in order to be considered a piece of opinion, a sentence must answer, even in a weak way, the hypothetical question “How did you like the book/this part of the book?” If, for

Sparkling Vampire … LOL!

133

example, the answer is “I think it is sad/ it is sad/it is a complex reading/it’s odd”, it is not possible to know whether s/he liked it or not – thus, it is not possible to assign polarity to the sentence. On the other hand, if the answer is “It was horrible /I thought it is horrible”, we are clearly facing a negative opinion. Thus, in order to avoid fuzzy decisions, we merged opinion mining and sentiment analysis: in the annotation schema, we only considered as opinions those statements containing attributes associated with evaluative judgments. An opinion sentence should simultaneously carry opinion and polarity (positive or negative). During the annotation process, we took a conservative view and, if we were not sure about the content – whether it was a description or an opinion –, we decided not to annotate it. Thus, even though “charming” (in sentences 1-3) is often taken as expressive language, it is only through its use that we will be able to validate this trait (and in our examples it was not validated). This shows the importance of the context in the annotation process. We did not consider prior, or intrinsic, polarities. Taboada et al. (2011), for instance, create a polarity lexicon based on the assumptions that individual words have a prior polarity, understood as context independent semantic orientation. In our approach we take the opposite view. Instead of prior polarities, we consider “agreed” polarities, that is, speakers tend to agree about considering some words to be positive or negative, although these polarities are not intrinsic. Yet, even considering contextual clues, there were cases when polarity identification was not trivial, and we had to rely mainly on careful reading strategies, as sentences 6-8 illustrate: 6. É claro que o livro não é perfeito, mas é muito bom.[positive] [Of course the book is not perfect, but it is really good] 7. É um livro bom, mas não é excelente. [positive][It`s a good book, but it`s not excellent.] 8. Bom, mas meio melancólico. [positive][It`s good, but a bit melancholic]

3.1 Delimiting the opinion target The opinion annotated in ReLi refers exclusively to the opinion about the book being reviewed. As such, opinions about other books or films (even if based on the reviewed books), were not considered. (No matter how obvious this may be, it is an aspect to be taken into account when

134

Chapter Eight

using ReLi as a training corpus for opinion mining.) We also did not consider opinions that involved comparisons, even if one of the objects compared was the book reviewed. We know this is an arbitrary choice, but nothing prevents us from expanding the annotation to comparisons in the future. The need to specify the limits of our opinion target was another non foreseen aspect. If the occurrence of book parts such as chapters, characters, language is predictable, mentions to specific parts of each book, such as “romantic part” in (9), are frequent as well: 9. “Não contente em derrapar na parte romântica, Stephenie Meyer também resolve deturpar o clássico e marcante vampiro...”. [Not happy with ruining the romantic part, Stephenie Meyer also misrepresents the classic and striking vampire...] Even though our first intention was to discard those parts precisely because of their specificity, we soon realized that it was an artificial boundary to the opinion mining task. In other words, we were taking an arbitrary decision based only on the easiness of the annotation process. Therefore, we decided to consider the opinion about the book and all of its parts, no matter how specific they were. Another important point of the annotation is that it is not always simple (or possible) to identify the exact passage of opinion. In these cases, we decided to annotate the whole passage (or sentence), as examples 10-12 show: 10. Mas em um contexto todo, o livro conseguiu suprir minhas expectativas. [All in all, the book met my expectations.] 11. impossível abandonar o livro pela metade [it was impossible to stop reading the book before its end] 12. a tradução deveria ser CRAPúsculo2. [The translation should be CRAPusculo] Although rare, the presence of opinions in questions was another difficulty we faced in the annotation process. We decided not to annotate these cases, since the opinion/polarity was beyond the scope of the sentence (13-15): 13. Se o livro tem méritos? Eu diria que sim. [Does the book have merits? I would say so.] 2

In Portuguese, the book Twilight was translated as Crepúsculo.

Sparkling Vampire … LOL!

135

14. O que falta para uma boa história? Nada! [What else is needed for a good story? Nothing!] 15. Ler os outros?!? Nem se eu tivesse a eternidade toda pela frente... [Read the other books?!! Not even if I had all the time in the world…] In ReLi, then, a sentence is considered to be the unit under analysis. In order to analyze a sentence and verify its polarity, however, it may be necessary to read the whole paragraph or the entire review.

4. The ReLi corpus Liu et al. (2005) classify the evaluation of products into 3 types: in type 1, the evaluator is asked to write the pros and cons separately; in type 2, in addition to the pros and cons, the evaluator is asked to write a detailed evaluation; type 3 evaluation has an open format: the evaluator can write freely and there is no formal separation between pros and cons. The reviews that comprise ReLi fit into the third type, and were extracted from the Skoob.com site, a social network of books and readers in which readers/collaborators actively participate, commenting on books they have read. The textual material varied widely in style, amount of subjectivity content and grammaticality. It showed a major presence of alternative spellings, emoticons and other features typical to internet writing, posing additional challenges for automatic language processors. The corpus is composed of 1600 reviews of 13 books (7 authors), comprising about 260,000 words and 12,000 sentences. There are around 200 reviews for each book and when this number could not be reached, we added other books by the same author until we arrived at a number close to 200. Table 1 presents ReLi’s content. Books and authors were chosen based on the number of reviews available per book. However, some authors/books were discarded due to a large number of identical reviews. The variety of book styles led to a variety of language styles in ReLi: from very informal writing with heavy use of slang expressions, abbreviations, neologisms and emoticons, to more formal reviews with a more refined vocabulary.

Chapter Eight

136

Table 1: The content of ReLi Author Stephenie Meyer Thalita Rebouças

Sidney Sheldon

Jorge Amado George Orwell José Saramago J.D. Salinger TOTAL

Title Crepúsculo (Twilight) Fala sério, amiga!; Fala sério, amor!; Fala sério, mãe!; Fala sério, pai!; Fala sério, professor! O Outro lado da meia noite (The Other Side of Midnight); O Reverso da Medalha (Master of the Game); Se houver Amanhã (If Tomorrow Comes) Capitães da Areia (Captains of the Sands) 1984

Reviews 409

Sentences 3,266

Words 62,268

161

910

16,864

230

1,569

31,712

187

1,348

32,117

202

2.228

51,320

Ensaio sobre a Cegueira (Blindness) O Apanhador no Campo de Centeio (The Catcher in the Rye)

271

1,991

42,152

140

1,158

23,725

1,600

12,470

259,978

4.1. Annotation Scheme Annotation consists of adding linguistic information (tags) to a corpus, according to the annotation goals. The definition of the tagset is a decision-making process related to the way the problem (or task) will be modeled. Morphosyntactic annotation is the attaching of traditional linguistic categories, such as verb, noun or preposition, to words or expressions. In semantic annotation, semantic information is added to single words or larger portions of text. The kinds of tags are potentially infinite, and some instances refer to semantic classes of proper nouns, semantic roles, and polarities.

Sparkling Vampire … LOL!

137

In ReLi, opinion was annotated at the phrase and sentence levels, in three parts: (i)

identification and annotation of the polarity of sentences that contain opinion; (ii) identification of the target of opinion; (iii) identification and annotation of the polarity of the segment (words or expressions) that contains opinion. The annotation schema underlying the ReLi corpus involves:

Opinion identification: Identification of the segment which expresses an opinion. In (16), there is no opinion about the book, in (17) the opinion segment in underlined, and in (18) the whole sentence conveys opinion: 16. Tenho que admitir que li esse livro com um pouco de receio, não pelas críticas, mas pelo próprio gênero que não é minha leitura habitual. [I must admit I read this book with some caution, not because of the critiques, but because this is not the genre I usually read] 17. Os primeiros capítulos foram um tanto tediosos, mas como odeio abandonar um livro continuei. (The first chapters were kind of tedious, but as I hate abandoning a book, I kept going.] 18. Quando comecei a ler não consegui parar mais. [When I started reading, I couldn’t stop) Sentiment classification/orientation: In (17) and (18) above, the semantic orientations are negative and positive, respectively. Identification of the opinion target: In (17), the target is “primeiros capítulos” [first chapters], in (18), the “implicit” target is the book. We didn’t consider neutral polarities, and each sentence – and each target – might convey more than one opinion, as in (19): 19. Romance adolescente bonitinho e meloso que poderia ter acabado no primeiro livro. [A cute and mellow teenager novel, that could have ended in the first book) With annotation at phrase level, we were able to deal with contrastive opinions in the same sentence. However, at the sentence level, only one and overall opinion was considered. In (20), we have two targets

138

Chapter Eight

(underlined): “story” and “description”. Different opinions and polarities are associated with each target. “Gostei” [“I liked it”] and “prendia” [“catchy”] are positive, and related to the story; “cansava” [“tiresome”] is negative and refers to “description”. However, even with both distinct polarities, we interpreted the whole sentence as positive as related to the book. Therefore, at the sentence level, the polarity is positive, even though at the phrase level there is a negative opinion. 20. Gostei da história em si, ela me prendia, mas toda essa descrição me cansava. [I liked the story itself, it was catchy but the whole description was tiresome.] When the sentence didn’t explicitly mention the opinion target (for instance, “Fantastic!”, and in (18), above), we considered that, by default, the opinion was related to the book.

4.2. Annotation process The fine tuning among annotators is a crucial aspect in the annotation process, especially when the task depends on a high level of semantic interpretation. ReLi was initially annotated by three annotator, – A, B e C –, although at the end there was only one left. All the annotators were undergraduate students in the Language and Literature program at PUCRio. They all went through a training process until they were familiar with the task, the instructions, and the annotation tool. The relevance of always making interpretation decisions within the context of the review was also emphasized. During the training period as well as throughout the whole process of annotation, annotators were encouraged to ask questions and discuss their options and, when unexpected cases arose, solutions were discussed and incorporated in the manual (Freitas & Cesar, 2012). Once the manual was finished, the corpus went through a process of revision to spot inconsistencies. A general revision of the content was also made by one of the authors of this article. The annotation tool was adapted from a tool already developed by our research group. It had a “Help” function designed to highlight sentences and excerpts that annotators found difficult to annotate.

5. Inter-Annotator Agreement Study After around 400 annotated reviews, we conducted a study of the agreement between the annotators. The study considered 2 sets of data: (i)

Sparkling Vampire … LOL!

139

the annotation of 2 annotators in 390 reviews (48,000 words); (ii) the annotation of the 3 annotators in a subset of 107 reviews (1,200 words). The following points were considered in the evaluation of agreement: sentences selected; polarity of the sentences selected; objects selected; opinions selected; polarity of the opinions selected. Although the annotation instructions gave some information with respect to segmentation, some variation was expected as to the extension of the units selected. One of the challenges, therefore, was defining agreement in cases in which the annotators identified the same opinion or target of opinion, but diverged in relation to the limits of the unit. In fact, this is a task whose evaluation is more complex than the judgment of assignment of polarity to sentences. We relied on the evaluation process by Wiebe et al (2005) in two major points: (i) we considered expressions such as final part and final as equivalent expressions, and (ii) we used the agr metric (Wiebe et al, 2005), whose objective was to evaluate whether the annotators identified the same set of objects and opinions, that is, how much of what A annotated was also annotated by B. Table 2 shows the results of the agreement study between annotators A, B and C. The first row (A|B) indicates the agreement between A and B, taking A as the baseline (in other words, how many segments identified by A were also identified by B.). The second row reflects the opposite situation: considering B as the baseline, how many choices made by B were also made by A. The same applies to A|C; C|A; C|B; B|C. We separated the agreement study into two groups: agreement related to identification (if the annotators identified the same set of expressions, as to opinion target and opinion itself) and agreement related to polarity assignment (once annotators agreed on the selected part, then we measured if they agreed about polarity). As expected, agreement with polarity assignment (almost 100%) was higher than with the identification of the set of expressions that contained opinion (around 80%). Once annotators agreed on what to annotate, they tended to agree on the polarity orientation as well. The qualitative analysis showed that the rare cases of disagreement were due to adversative sentences in which the same sentence conveys contrastive opinions, and that the disagreement was precisely in the assignment of the overall polarity at the sentence level. Phrase level disagreement occurred only once. It took place with the adjective “mirabolante” (we couldn’t find a faithful translation for this word; some suggestions are “fantastic, lavish, extravagant”).

Chapter Eight

140

Analyzing the identification agreement, we observed that annotator A tended to consider more elements during the annotation process. Keeping in mind the complexity of the task, 80% of agreement is a very good index. Although the type of annotation is not exactly the same, the result of agreement in Wiebe et al. (1995) with respect to the identification of the phrases of opinion is 72%, which corroborates the idea of uniformity among the annotators of ReLi. Table 2: Inter-annotator agreement results Annotators

IDENTIFICATION Sentence

POLARITY ASSIGNMENT Target Opinion Sentence Phrase

A|B B|A Average A B A|C C|A Average A C B|C C|B Average B C

82.9 87.2 85

69.9 72.3 71.1

73.9 83.1 78.5

98.1

99.9

77.7 86.1 81.9

74.4 76.1 75.3

78.6 84.5 81.5

98.4

99.6

78.5 78.4 78.5

72.9 69.6 71.3

84.6 75.5 79.8

98.4

100

6. Exploring the corpus Table 3 shows polarity distribution in ReLi. From 12,000 sentences of 3 the corpus, almost 24% are opinion sentences . About 80% of them are positives, and only 5% of them convey contrastive opinions. Table 3: Distribution of polarities in ReLi ReLi data n. sentences n. sentences [+] n. sentences [-] n. contrastive sentences n. opinions [+] n. opinions [-] 3

Quantity 12,514 2,433 544 169 4,268 1,023

24% may seem low, but we must keep in mind we annotated only opinions about the book reviewed and we were interested in opinions that conveyed positive or negative polarities.

Sparkling Vampire … LOL!

141

Briefly exploring ReLi from a qualitative point of view, we underscore the importance of annotation in context to determine polarity. We noted the presence of lexical “false signs,” or rather, words or expressions which, decontextualized, could be considered negative, but were used in a positive way, as examples (21-25) illustrate: 21. Lamentável tê-lo lido somente agora. [It was sad to have read it only now] 22. O livro é sufocante. [The book is suffocating] 23. Fui a nocaute, sem direito a (re)contagem. Chorei. Fiquei meio deprê. [I was knockedout, without recounting. I cried. I got a bit down] 24. Tenso. É o único livro que me deixou nervoso e apreensivo enquanto lia. [Tense. It was the only book that got me nervous and apprehensive while reading it] 25. Dolorido, pavoroso, nojento, repugnante e nauseante. Por todos esses adjetivos que o livro nos causa, ele consegue ser bom. [Painful, awful, disgusting, repulsive and nauseating. For all these adjectives that the book makes us feel, it is a good one.] There were instances when the same lexical item conveyed alternative polarities: 26. Nunca sofri tanto para ler um livro.[I’ve never suffered like that to read a book.]  negative 27. Eu sofria a cada vez que tinha que adiar a leitura. [I suffered every time I had to postpone reading it]  positive Other words or expressions, generally considered neutral, appear with a clearly defined polarity, reinforcing the pertinence of a corpus-based approach. The verb to swallow and the adjective incredible, for example, are used only in a negative and positive way, respectively. We were surprised by the substantial amount of unclear texts; poorly organized and punctuated, which sometimes made the reading very difficult. We were also surprised by words used in a sense which is different from the licensed use (28-30), leading us to reflect upon the concept of “error”. From a practical perspective, this makes the annotation task more difficult. 28. “mamão com açúcar” [a piece of cake] (literally “papaya with sugar”) instead of “água com açúcar” (something light, naïve, literally “water with sugar”)

142

Chapter Eight

29. “estilo simplista” [simplistic style] instead of “estilo simples” (simple style) 30. “seu texto é pegajoso até o final.” [the text is clingy to the end], when the intention was to say the text grabs the reader. As these were rare cases, we expect them to be dissolved in the corpus, but they serve as a point to be considered when we work with unrevised 4 texts . Neologisms (“eu realmente não gosto dela, acho ela muito tonga” [I really don’t like her, I think she is very tonga]), slang expressions and internet expressions (“rá!”; “Nah neh noh”) contribute to the description of the Portuguese language, and also show the need for robust POS taggers. To conclude, although superficial the corpus analysis revealed interesting facts about creativity as to how we express opinions, in particular concerning irony. In 31-34 below, we offer a small sample of what we found, as an invitation for the reader to explore ReLi. 31. Vampiro que brilha... rá ! Daqui a pouco ele vive de tofu! [Sparkling vampire… lol! In a while he will live on tofu!] 32. A vida de Bella passa a ser o pior pesadelo que ela já teve: viver em um lugar onde o sol aparece menos que foto do Roberto Carlos de sunga. [Bella`s life becomes her worst nightmare: to live in a place where the sun comes out less often than a photo of Roberto Carlos in briefs.5] 33. Stephenie Meyer acabou com o mito do vampiro, depois desse livro só o que o Conde Vlad deseja é ter uma estaca de madeira no peito. [Stephenie Meyer has killed the vampire myth, after that book, the only thing Conde Vlad wants is a wooden stake in his chest.] 34. Tudo o que vi foi blábláblá os vampiros eram lindos blábláblá e eram alvos como o leite blábláblá e Edward era a coisa mais linda que Bella já vira e blábláblá eram realmente brancos blábláblá ...[Everything I saw was blablabla the vampires were so beautiful blablabla and were as white as snow blablabla and Edward was the most beautiful creature that Bella had ever seen blablabla and they were really white blablabla…]

4

In these cases, the annotation corresponds to the author’s intention. So, “pegajoso” was considered positive. 5 Roberto Carlos is a famous Brazilian singer who has an artificial leg.

Sparkling Vampire … LOL!

143

7. Final remarks We have presented ReLi, a manually annotated corpus of book reviews with opinions and their polarities. ReLi was built to support the development of NLP systems, but also to support language investigations concerning the way we express our opinions and emotions, based on naturally occurring texts. As the corpus is a user-generated content corpus, its reviews were written by non-specialists in an online social network of readers. Thus we expect ReLi to contribute to the investigation of web genres as well, joining the still sparse Portuguese corpora consisting of non-journalistic texts. If, on the one hand, the lack of material about evaluative language in 6 Portuguese heightened our challenge, on the other hand, it forced us to produce extremely detailed documentation. The high level of agreement between the annotators, for a known complex task, is a good indication of 7 how clear the annotation guidelines were . We point out that the granularity of the proposed annotation can be abstracted to cases in which there is no need for complete semantic information, such as the training of systems capable of detecting the polarity of sentences. However, from the point of view of the description of Portuguese, the granularity of annotation can be of great value. Polarity at the phrase level allows us to capture information/opinions that would otherwise be difficult to examine, such as sentences with mixed polarity, as well as words or expressions that tend to carry positive or negative values. With the present paper, then, we seek to combine the creation of linguistic resources for the use by both NLP and Linguistics. A good corpus must not only be well documented – and, once annotated, human revised. Considering mainly the linguistic community, it must also be available on the web and associated with a search interface/search tool, so researchers can find what they are looking for (Santos, 2008b). 8 We believe ReLi is on the right path: it has been documented , the manual annotation has undergone human revision (although we are aware 6 When we searched for papers on expressing opinions in Portuguese, we found a study related to expressing contrary opinions from an L2 perspective (Almeida, 2007), and another one related to an analysis of the act of giving second evaluations or opinions (Oliveira, 1996), both studies with approaches and interests different from ours. 7 http://www.linguateca.pt/Repositorio/ReLi/ 8 ReLi corpus and guidelines: http://www.linguateca.pt/Repositorio/ReLi/

Chapter Eight

144

a corpus is never 100% perfect), and it is now also available through the AC/DC project9 (Santos, 2011). Concluding, a corpus is (a portion of) language in use, with its regularities and irregularities. An annotated corpus is the product of an analysis – and a corpus annotated with semantic information is the result of an interpretation process. Thus, the semantic annotation task can be described as an attempt to formalize our understanding about something, from a certain point of view, with certain interests, and with the hesitations and variations that might naturally exist.

References Almeida, Patricia Maria Campos de 2007, A elaboração da opinião desfavorável em português do Brasil e sua inserção nos estudos de português como segunda língua para estrangeiros. Tese de doutorado em Letras. Rio de Janeiro: Pontifícia Universidade Católica do Rio de Janeiro. Biber, Douglas, Stig Johansson, Geoffrey Leech; Susan Conrad and Edward, Finegan. 1999, Longman grammar of spoken and written English. New York: Pearson Education. 1999. Carvalho, Paula, Sarmento, Luís, Teixeira, Jorge e Silva, Mário J. 2011, “Liars and Saviors in a Sentiment Annotated Corpus of Comments to Political Debates”. ACL (Short Papers), 564-568. Esuli, Andrea and Sebastiani, Fabrizio 2006, “Sentwordnet: A Publicly Available Lexical Resource for Opinion Mining”, in: Proceedings of the 5th Conference on Language Resources and Evaluation, 417—422. Fellbaum, Christiane 1998, WordNet: An Electronic Lexical Database, MIT Press, 1998. Freitas, M. Cláudia, Milena Uzeda-Garrão, Claudia Oliveira, Cícero Nogueira dos Santos and Maria Cândida Silveira 2005, “A anotação de um corpus para o aprendizado supervisionado de um modelo de SN”. In: Anais do XXV Congresso da Sociedade Brasileira de Computação, São Leopoldo, Brasil (2005). 2178-2187. Freitas, Cláudia e Cesar, Juliana 2012, “Diretivas para a anotação da opinião em resenha de livros”, Release 2. 3. http://www.linguateca.pt/ Repositorio/ReLi/ Leech, Geoffrey. 2005, “Adding Linguistic Annotation” in Developing Linguistic Corpora: a Guide to Good Practice, ed. M. Wynne. Oxford: Oxbow Books: 17-29. 9

http://www.linguateca.pt/ACDC

Sparkling Vampire … LOL!

145

Liu, Bing, Minquing, Hu, e Junsheng, Cheng 2005, “Opinion Observer: Analyzing and Comparing Opinions on the Web”, in: Proceedings of the 14th international World Wide Web conference. Chiba, Japão. Maia, Belinda and Santos, Diana 2012, “Who is afraid of... what? - In English and in Portuguese”. In Signe Oksefjell Ebeling, Jarle Ebeling & Hilde Hasselgård (eds.), Aspects of corpus linguistics: compilation, annotation, analysis. Studies in Variation, Contact and Change in English, 2012. e-ISSN: 1797-4453. http://www.helsinki.fi/varieng/ journal/volumes/12/maia_santos/ Oliveira, Maria do Carmo Leite de 1996, “A organização de preferência em cartas de pedido de empresas estatais brasileiras.” D.E.L.T.A, vol. 12, nº 2, pp. 265-280. Palmer, Martha, Kingsbury Paul, Gildea Daniel (2005). “The Proposition Bank: An Annotated Corpus of Semantic Roles”. Computational Linguistics 31 (1): 71–106. Pang, Bo and Lilian Lee 2008, Opinion Mining and Sentiment Analysis, Foundations and Trends in Information Retrieval: Vol. 2: No 1–2, p.1135, 2008. Poirier, Damien, Bothorel, Cécile, Guimier, Emilie De Neef and Boullé, Marc 2011, “Automating Opinion Analysis in Film Reviews: the Case of Statistic versus Linguistic Approach”, in: AHMAD, K. (ed), Affective Computing and Sentiment Analysis: Metaphor, Ontology, Affect and Terminology. Riloff, Ellen and Wiebe, Janyce 2003), “Learning Extraction Patterns for Subjective Expressions”, in: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Sampson, Geoffrey 2001, Empirical Linguistics. London: Continuum. 2001. Santos, Diana 2008, “Corporizando algumas questões”. In Stella E. O. Tagnin & Oto Araújo Vale (orgs.), Avanços da Lingüística de Corpus no Brasil, Editora Humanitas/FFLCH/USP, São Paulo, 2008, pp.41-66. —. 2008h, “Linguística com corpos na era da internetização”. Palestra convidada, VII Encontro de Lingüística de Corpus (6-7 de novembro de 2008), Universidade Estadual Paulista - UNESP - Campus de São José do Rio Preto, Brasil. Disponível em http://www.linguateca.pt/ Diana/download/SantosELC2008.pdf [Acessado em 27/05/2013] —. (2011). “Linguateca’s infrastructure for Portuguese and how it allows the detailed study of language varieties “, in J.B. Johannessen (ed.), Language Variation Infrastructure. OSLa: Oslo Studies in Language 3, 2 , pp. 113 – 128.

146

Chapter Eight

Sinclair, John 2005. “Corpus and Text - Basic Principles” in Developing Linguistic Corpora: a Guide to Good Practice, ed. M. Wynne. Oxford: Oxbow Books: 1-16. Silva, Mário J., Carvalho, Paula and Sarmento, Luís. (2012). “Building a Sentiment Lexicon for Social Judgement Mining”, in: Lecture Notes in Computer Science (LNCS), International Conference on Computational Processing of the Portuguese Language (PROPOR), Springer, pp. 218-228. Taboada, Maite, Brooke, Julian, Tofiloski, Milan, Voll, Kimberly and Stede, Manfred Lexicon-based methods for sentiment analysis. Computational Linguistics. 37, 2 (June 2011), p.267-307, 2011. Wiebe, Janyce, Wilson, Theresa and Cardie, Claire (2005). ”Annotating expressions of opinions and emotions in language”, in: Language Resources and Evaluation, vol. 39, issue 2-3, pp. 165-210. Zagibalov, Taras, K. Belyatskaya, Katerina and Carrol, John (2010). “Comparable English-Russian book review corpora for sentiment analysis”, in: Proceedings of the 1st Workshop on Computational Approaches to Subjectivity and Sentiment Analysis. Lisboa, Portugal, pp. 67-72.

PART VI CORPORA AND MULTIPLE DOCUMENTS

CHAPTER NINE MANUAL ALIGNMENT OF NEWS TEXTS AND THEIR MULTI-DOCUMENT HUMAN SUMMARIES VERÔNICA AGOSTINI*†, RENATA TIRONI DE CAMARGO*‡, ARIANI DI FELIPPO*‡ AND THIAGO ALEXANDRE SALGUEIRO PARDO*† 1. Introduction Generally speaking, alignment is the process of relating textual segments (e.g., words, sentences or paragraphs) from different texts. This task is used in many applications developed in the Natural Language Processing (NLP) area: translation, question answering, textual simplification, and summarization, among others. In most of the applications, the alignment (sometimes referred to as “indexing”) usually subsidizes the acquisition of knowledge about the task/phenomenon under study in order to allow its linguistic description and/or automation. In machine translation, for instance, correspondences between the passages of a text in its original language and its version in one or more different languages are identified. The data resulting from this kind of alignment represent important linguistic resources, since they account for lexical equivalences and translation rules (see, e.g., the works of Gale and Church, 1991; Yamada and Knight, 2001; Caseli, 2003). In text simplification, a similar reasoning is possible: by aligning an original text *

Núcleo Interinstitucional de Linguística Computacional (NILC). Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo (USP). ‡ Departamento de Letras, Universidade Federal de São Carlos (UFSCar). †

Manual Alignment of News Texts and their Human Summaries

149

to its simplified version, one may learn simplification rules for some intended application, such as making reading easier for people with aphasia (e.g., Specia, 2010). More daring approaches are also possible. In question answering, there are attempts to align questions to their answers in order to learn how to generate appropriate answers to new questions (e.g., Soricut and Brill, 2004). In Automatic Summarization (AS), the focus of this paper, the alignments are established between summaries (usually abstracts) and one or more corresponding source texts (see, e.g., the works of Marcu, 1999; Hirao et al., 2004). In single document AS, the parts of a summary are aligned to their original passages in the source text. In multi-document AS (in which a single summary is built from a group of texts on the same topic), the process is more difficult because one passage may be aligned to several passages from different texts. Figure 1 illustrates the alignment of a sentence from a multi-document human summary to two sentences from source texts on the same topic, which contribute to the information in the sentence in the summary. Sentence in the summary O Brasil não fará parte do trajeto de 20 países do revezamento da tocha. [Brazil is not part of the path of 20 countries of the torch relay.]

Sentence in source text 1 A tocha passará por vinte países, mas o Brasil não estará no percurso olímpico. [The torch will pass through twenty countries, but Brazil will not be on the Olympic journey.] Sentence in source text 2 O Brasil não faz parte do trajeto da tocha olímpica. [Brazil is not part of the path of the Olympic torch.]

Fig. 1: Example of sentential alignment in AS

In the example presented in Figure 1, the alignment is said to be a 1-2 alignment, because one summary sentence was aligned to two different sentences in the source texts. More generally, a sentence in the summary may be aligned to N sentences in the source texts, characterizing a 1-N alignment (where N equals 2 in the example above). Specifically in AS, the alignment of human summaries and source texts reveals the origin of the information that makes up the summary, which allows the investigation of its characteristics. Thus, it is possible to obtain human summarization strategies that may support AS, making it more linguistically motivated. Such strategies may be obtained by manual analysis of the alignments or by machine learning algorithms (i.e.,

150

Chapter Nine

procedures that provide computers with the ability to learn from data without being explicitly programmed, looking in the data of interest for patterns of occurrence of the phenomenon being studied). For instance, one may discover that sentences in certain positions in the texts are preferred over other sentences (in news texts, this is true for the first sentences, since they express the most important information of the main event), that some news agencies are more represented in the summaries (also true for some mainstream agencies), and that information from texts are generalized for the summaries, for example, using hypernyms. In this scenario, with the above motivations, this paper presents and discusses the work on sentence alignment between multi-document human summaries and their source texts in a corpus of news texts written in Portuguese. The corpus, called CSTNews (Cardoso et al., 2011), is the only corpus available for multi-document AS purposes for Portuguese and has been the basis for the work in the area for this language. To the best of our knowledge, the summary-text alignment presented for Portuguese is also original, since no related initiative is known for this language. From a theoretical perspective, the contributions of this research are the definition of a systematic procedure for summary-text alignment and the typification of such alignment in news texts, as well as the characterization of the information in the summaries in relation to the available sources. From a practical perspective, this work may subsidize new linguistically-enriched methods for AS. In Section 2, we present some related work. Section 3 introduces the corpus that was used for this study. In Section 4, we present the alignment task and its results. Section 5 concludes this chapter.

2. Related work We may mention a range of authors who have tried to perform alignments between texts on the same topic, however each of them has used different criteria in order to connect the text passages. Marcu (1999) and Jing and MacKeown (1999), for instance, took intuition into consideration to accomplish the task in their studies, favoring the meaning of the passages to decide on what to align. Hatzivassiloglou et al. (1999), in turn, defined two passages as similar if they shared the same focus on a common concept, actor, object, or action. Unlike the previous authors, Barzilay and Elhadad (2003) defined two sentences as aligned if they contained at least one clause that expressed the same information.

Manual Alignment of News Texts and their Human Summaries

151

Specifically in relation to AS, the first efforts date back to the end of the 90s, when just a few studies had tried to explicitly align summaries and texts. For single document AS, Marcu (1999) reported that 14 judges randomly selected 10 source texts from the Ziff-Davis corpus1 and aligned them to their corresponding single document human summaries, considering both clauses and sentences as basic text passages. The judges also had to annotate the degree of overlapping between the passages considering one of 5 possibilities: (i) perfect match; (ii) perfect match, but the summary passage presents more information; (iii) perfect match, but the source text passage presents more information; (iv) none of the former options could be applied to the alignment, although the pair of passages shared some meaning; and (v) the passage was inferred by the summary author. The author measured the agreement among the annotators using the traditional kappa statistic, which is generally thought to be a more robust measure than simple percent agreement calculation since κ takes into account the agreement occurring by chance (Carletta, 1996) 2 . Although it highly depends on the task that is under evaluation, a kappa value of 0.6 is usually accepted as the minimum value for which the annotation may be considered reliable. Marcu (1999) evaluated both sentential and clausal level agreement, resulting in the kappa values of 0.52 and 0.48, respectively. In the work by Barzilay and Elhadad (2003), two annotators aligned pairs of texts and their summaries. In the authors’ guidelines, they defined two sentences as aligned if they contained at least one clause that expressed the same information. According to the authors, there was agreement for most of the cases and, when there was disagreement, the case was decided by a third judge. The agreement between the annotators was not computed. Two other studies Daumé III and Marcu (2004, 2005), also performed the alignment between texts and their single document. In these, two annotators manually aligned 45 pairs of source texts and their summaries taken from the Ziff-Davis corpus, which is the same corpus used in the work by Marcu (1999) and Jing and McKeown (1999). As a training phase, five of the 45 pairs were annotated individually and discussed later, in order to better understand the studied phenomenon and to achieve 1 A collection of newspaper articles that advertise products related to computers, in English. 2 If the annotators are in complete agreement then κ = 1. If there is no agreement among the annotators other than what would be expected by chance (as defined by Pr(e)), κ = 0.

152

Chapter Nine

agreement in the task; the other 40 pairs were annotated independently without consulting each other, producing the final annotation. The authors considered phrases instead of sentences and the judges should judge the alignments considering two different categories: (i) possible or (ii) sure. Moreover, they used the kappa measure to calculate the agreement among the annotators and the result was 0.63. Carletta (1996) suggests that kappa values over 0.80 reflect very strong agreement and that kappa values between 0.60 and 0.80 reflect good agreement. Regarding multi-document AS, we may highlight the work by Hirao et al. (2004) for which the authors used the TSC3 corpus (Okumura et al., 2003). For each cluster, which means a collection of texts written about the same topic, three judges created three long and three short abstracts for both single document and multi-document clusters. Then, the summary sentences were aligned to the source text sentences. The authors observed that the alignment for multi-document summaries consisted in a more complex phenomenon, since the sentences from these summaries might result from compaction (by compressing passages from the source texts), combination and integration of other sentences. As we can see, there are some studies which have assigned some types to the alignments. Specifically, the typification consists of classifying the alignments according to some criteria and, expressing/codifying the types of the alignments, labels or tags which are commonly used (see, e.g., Marcu, 1999; Daumé III and Marcu, 2004, 2005). Regarding typification, we may also mention the work carried out by Clough et al. (2002), in which the authors tried to figure out the types of rewriting operations that might occur between different versions of a news text. At the text level, trained journalists were required to assign one of the following three labels: (i) wholly-derived, (ii) partially-derived and (iii) non-derived. At the word or sentence level, they were to assign one of the three labels: (i) verbatim, (ii) rewrite and (iii) new. In addition, some authors tried to categorize the summary production operations, which were directly related to the types of alignments mentioned before. For instance, Jing and McKeown (1999) defined six operations, namely: (i) sentence reduction, (ii) sentence combination, (iii) syntactic transformation, (iv) lexical paraphrasing, (v) generalization/specification and (vi) reordering. Hasler (2007), in turn, proposed five summary production operations: (i) deletion, (ii) insertion, (iii) replacement, (iv) reordering and (v) merging of passages. As a matter of fact, both works studies examined linguistic operations 3 The TSC (Text Summarization Challenge) corpus contains data related to single (30 documents) and multi-document AS (224 documents in 30 clusters).

Manual Alignment of News Texts and their Human Summaries

153

used by a human to transform extracts into abstracts in order to improve their coherence.

3. The CSTNews corpus The corpus we used for this study is the CSTNews (Cardoso et al., 2011), which has 50 clusters of 2 or 3 news texts and their human (manual) and automatic single and multi-document summaries. The corpus presents some other annotations as well, like the identification of temporal expressions, informative aspects, topics and subtopics, single and multidocument discourse structures, and word senses for nouns, among others, consisting in a rich corpus for AS and deep NLP in general. Each cluster in the corpus has, in average, 42 sentences (10 to 89 sentences) and each manual multi-document summary has, in average, 7 sentences (3 to 7). The texts of each cluster are on the same topic, as is usual in multi-document AS. The news texts, manually gathered from the online newspapers Folha de São Paulo, Gazeta do Povo, Estadão, Jornal do Brasil and O Globo, are divided into six sections which, together with the number of texts in each category, can be seen in Figure 2. The clusters were named according to the sections of the newspapers.

Fig. 2: Distribution of sections and their texts in the corpus

Regarding multi-document human summaries, we emphasize that they were built manually and in an abstractive way, i.e., allowing rewriting operations of passages extracted from the texts. Accordingly, the identification of the origin of the information that constitutes a summary is not always clear in the texts. In addition, the summary production was guided by a compression rate of 70%. Thus, the summaries contain 30%

154

Chapter Nine

of the number of words considering the largest text in each cluster. Figure 3 shows an example of a human summary from the CSTNews corpus. Maradona voltou a ter problemas de saúde no fim de semana e foi internado novamente em um hospital em Buenos Aires. Ele teve uma recaída da hepatite aguda. Ele melhorou e está estável, mas continuará internado. Maradona desenvolveu hepatite por excesso de álcool, mas, nesta recaída, ele não estava bebendo e a causa ainda é indeterminada, segundo seu médico. [Maradona again had some health problems on the weekend and he was again admitted to a hospital in Buenos Aires. He had a relapse of acute hepatitis. He has improved and is stable but will remain hospitalized. Maradona developed hepatitis from alcohol abuse, but in this relapse, he was not drinking and the cause is still undetermined, according to his doctor.] Fig. 3: Example of a human document summary

Since the CSTNews corpus is widely used in multi-document AS for Portuguese, we selected it as the basis of our research. In this case, we used its source texts and its multi-document human summaries in order to perform the alignment between the sentences.

4. The manual alignment In what follows we characterize the task of alignment, present the alignment rules that were developed and followed during this work, report our main analysis results, and end by discussing the types of alignments that we were able to identify.

4.1. Characterization The alignment task was performed by two Computational Linguists during approximately two months in daily meetings of about two hours. Overall, each researcher was in charge of aligning half of the CSTNews clusters. The annotators did not use any specialized annotation tool for the task. Instead, they adopted simple text editing tools of their preference to perform the alignments, which were later automatically mapped into XML format. To the best of our knowledge, there are no tools to align multiple texts to a unique target (the summary) in the way we needed. However, for regular parallel text alignment, these tools do exist, e.g., TagAlign (Caseli et al., 2002) and Yawat (Germann, 2008). The alignment was performed according to two main guidelines: (i) the type of text passages to be aligned and (ii) the criterion for the

Manual Alignment of News Texts and their Human Summaries

155

identification of correspondences. Regarding the type of text passages, we opted for the sentence since it is a well-defined unit of information. The choice of sentences as information units for alignments may imply losing some more refined information about which sentence parts (clauses, ngrams, etc.) were aligned. However, it simplifies the task and makes it more manageable, allowing a more systematic annotation. Concerning the alignment criterion, it should be pointed out that the correspondences among the summaries and their source texts were identified based solely on content overlap of the whole sentences or their parts. By pursuing content-oriented alignments, the indexing process does not rely only on the overlap of word forms, i.e., the lexical units themselves. Accordingly, sentences which contain information in common but with low word overlap are also aligned. Next we show some examples of alignments. In the example in Figure 4, one can see that the summary sentence is entirely contained in the sentence in the text (shown in bold). This may be easily identified for establishing an alignment. Sentence in the summary O motim começou durante a festa do Dia das Crianças. [The riot began at the Children’s day party.]

Sentence in the text O motim começou durante a festa do Dia das Crianças, realizada na terça-feira (16). [The riot began at the Children’s day party, on Tuesday]

Fig. 4: Example of alignment based on content overlap

In the example in Figure 5, one can observe that the summary sentence and the text sentence show partial content overlap, so that it is much harder to identify the alignment. In this case, the summary segment “preparing for the hurricane” expresses more general information than the text segment “stored food, water, flashlights and candles”, since the word “storage” may be interpreted as a “kind of preparation” for the arrival of the hurricane. Besides, the reader has to recognize that “the regions” refers to parts of Jamaica that will probably be hit by the hurricane. Overlapping content could not be identified based exclusively on lexical units, because both sentences do not have content words (noun, verb, adjective and adverb) in common. Considering the general guidelines, we engaged in a training phase that occurred before the major alignment task. For that purpose, two clusters were randomly selected and individually aligned by each annotator. Next, the alignments were compared and the cases of disagreement were discussed in order to adjust the alignment criteria for the computational linguists. After this training, we formulated some general and some

156

Chapter Nine

specific rules which went through a process of refinement throughout the process. Therefore, at the end, we elaborated an alignment handbook, whose rules are presented in what follows. Sentence in the summary Vários moradores e turistas nas regiões, inclusive brasileiros, foram retirados dos locais, enquanto outros estão se preparando para a passagem do furacão. [Many residents and tourists in the regions, including Brazilians, were evacuated from the places, while others are preparing themselves for the hurricane.]

Sentence in the text Na Jamaica, muitos estocaram alimentos, água, lanternas e velas. [In Jamaica, many people stored food, water, flashlights and candles.]

Fig. 5: Example of alignment based on content overlap

4.2. The alignment rules Throughout the alignment process, we created eight rules: four general and four specific rules. Next, we present these rules and exemplify them by using actual CSTNews alignments. Besides, we highlight the linguistic criteria that supported the formulation of the rules. Therefore, the text sentences are referred to by the abbreviation TS and the summary sentences are referred to by the abbreviation SS. 4.2.1. General rules Here, we present the concept of the general rules, some actual examples and the explanation of each case. RULE 1: Align based on content overlap This rule states that the alignment is performed based on content overlap between a summary sentence (SS) and one or more text sentences (TSs) and not due to the occurrence of common lexical items or similar syntactic structures. Accordingly, sentences that convey content in common by different linguistic expressions should be aligned. Furthermore, it is noteworthy that the content in common is not always directly identified, since some inferences may happen. In (1) and (2), we present some alignment examples based on Rule 1. In (1), the sentences share the same main contents, i.e., “the number of dead people in the plane crash”, expressed with different words. In (2), the alignment is due to inference, identified by the expression “give up” in the

Manual Alignment of News Texts and their Human Summaries

157

SS and the “resignation” of reported congressmen in TS, besides other content overlappings. (1) SS: 17 pessoas morreram após a queda de um avião na República Democrática do Congo. [17 people died after a plane crash in the Democratic Republic of Congo.] TS: Um acidente aéreo na localidade de Bukavu, no leste da República Democrática do Congo (RDC), matou 17 pessoas na quinta-feira à tarde, informou nesta sexta-feira um porta-voz das Nações Unidas. [A plane crash in the town of Bukavu, in the east of the Democratic Republic of Congo (DRC), killed 17 people on Thursday afternoon, said a United Nations spokesman on Friday.] (2) SS: A expectativa de lideranças da Câmara e do Conselho de Ética é que pouco mais de 10% dos 69 deputados denunciados no relatório parcial da CPI abrirão mão de seus mandatos. [The expectation of leaders from the House of Representatives and the Ethics Committee is that just over 10% of the 69 reported congressmen in the CPI partial report will give up their mandates.] TS: As renúncias têm que ser publicadas até terça-feira, quando o presidente do Conselho de Ética, deputado Ricardo Izar (PTB-SP), vai instaurar os processos de perda de mandato contra os 69 deputados acusados pela CPI dos Sanguessugas de envolvimento com a máfia das ambulâncias. [The resignations must be published by Tuesday, when the chairman of the Ethics Committee, Congressman Ricardo Izar (PTB-SP), will initiate the impeachment process against the 69 congressmen accused by the “CPI of Sanguessugas” of being involved with the ambulance mafia.]

RULE 2: Align based on main information overlap This rule states that the alignment is performed based on the main content conveyed by the sentences. Thus, a SS is aligned to some TSs when there is a central idea overlap, expressed by the main verb. In (3), we present an example in which the sentences were aligned according to Rule 2. Thus, the verbs “assert” and “say” have the same meaning in the example, since the subjects in both sentences (the president and Lula) are the same person. (3) SS: O presidente também afirmou que o critério para os municípios e estados contemplados com obras é técnico. [The president also asserted that the criterion for the cities and the states which were awarded with construction works is technical.] TS: Lula disse que o critério para o investimento nas cidades será técnico, não partidário. [Lula said that the criterion for investment in the cities will be technical, and not related to political parties.]

158

Chapter Nine

In (4), we show an example in which the sentences were not aligned according to Rule 2, despite the subjects overlap. The main verbs “find out” and “report”, which express the main information of each sentence, are not similar. (4) SS: Usando telescópios do Observatório Europeu Sul (ESO), Ray Jayawardhana, da Universidade de Toronto, e Valentin D. Ivanov, do ESO, descobriram um planemo com sete vezes a massa de Júpiter, o planeta mais pesado do Sistema Solar, e outro com o dobro desse peso, que giram um ao redor do outro, denominado Oph 162225-240515, o primeiro planemo duplo. [Using telescopes from the Southern European Observatory (SEO), Ray Jayawardhana, from the University of Toronto, and Valentin D. Ivanov, from the SEO , discovered a planemo with seven times the mass of Jupiter, the Solar System’s heaviest planet, and another one with double this weight, which rotate around each other, called Oph 162225-240515, the first double planemo.] TS: Os pesquisadores Ray Jayawardhana e Valentin D. Ivanov informam a descoberta na edição de quinta-feira do serviço online Science Express, mantido pela revista Science. [The researchers Ray Jayawardhana and Valentin D. Ivanov report the discovery in the Thursday edition of Science Express online service, maintained by the Science journal.]

RULE 3: Align based on secondary information overlap This rule specifies that the pair of sentences must be aligned based on content overlap of secondary information. Thus, a SS must be aligned to one or more TSs not only for the main content, but also when they share peripheral information. In (5) and (6), we present two alignments that illustrate the application of Rule 3. In (5), for instance, the SS and the TS were aligned due to the sharing of secondary information expressed by the passages “rotate around each other” and “rotate around one another”, despite not having main content overlap. In (6), the alignment is due to the fact that the SS and the TS share the cause (“payment of personal expenses”) of the main event conveyed by SS (“Renan is target of a lawsuit for breach of decorum”). (5) SS: Usando telescópios do Observatório Europeu Sul (ESO), Ray Jayawardhana, da Universidade de Toronto, e Valentin D. Ivanov, do ESO, descobriram um planemo com sete vezes a massa de Júpiter, o planeta mais pesado do Sistema Solar, e outro com o dobro desse peso, que giram um ao redor do outro, denominado Oph 162225-240515, o primeiro planemo duplo. [Using telescopes from the Southern European Observatory (SEO), Ray Jayawardhana, from the University of Toronto, and Valentin D. Ivanov, from

Manual Alignment of News Texts and their Human Summaries

159

the SEO , discovered a planemo with seven times the mass of Jupiter, the Solar System’s heaviest planet, and another one with double this weight, which rotate around each other, called Oph 162225-240515, the first double planemo.] TS: Ambos os mundos têm massa semelhante à de outros exoplanetas já catalogados, mas não giram em torno de uma estrela – na verdade, giram em torno um do outro. [Both worlds have a mass similar to other already cataloged exoplanets but do not revolve around a star – in fact, they rotate around one another.] (6) SS: Renan é alvo de um processo por quebra de decoro acusado de receber recursos da construtora Mendes Junior para pagamento de despesas pessoais, como aluguel e pensão para a jornalista Mônica Veloso, com quem tem uma filha. [Renan is the target of a lawsuit for breach of decorum accused of receiving funds from the Mendes Junior construction company to pay personal expenses such as rent and allowance to journalist Monica Veloso, with whom he has a daughter.] TS: Isso permitiria que os peritos da Polícia Federal pudessem trabalhar durante o período de descanso dos senadores e, no retorno das férias, apresentarem um relatório detalhado sobre o conjunto de documentos – notas fiscais, recibos de vacinação, extratos bancários, guias de transporte de animais – que o senador apresentou para justificar o pagamento da pensão informal à jornalista Mônica Veloso. [This would allow Federal Police experts to work during the period when senators rest, and upon returning from their vacation, they could submit a detailed report on the set of documents – invoices, vaccination receipts, bank statements, animal transport guides – that the Senator presented to justify the payment of an informal allowance to journalist Monica Veloso.]

RULE 4: Align whenever it is possible to align This rule states that a SS is aligned when a TS with content overlap is identified, even if the SS has already been aligned before due to sharing the same content. In (7), we illustrate the application of Rule 4. In this case, the SS, which was already aligned to a text sentence in TS1 due to sharing secondary information (according to (4)), is aligned again to TS2, since content overlap was identified once more. (7) SS: Usando telescópios do Observatório Europeu Sul (ESO), Ray Jayawardhana, da Universidade de Toronto, e Valentin D. Ivanov, do ESO, descobriram um planemo com sete vezes a massa de Júpiter, o planeta mais pesado do Sistema Solar, e outro com o dobro desse peso, que giram um ao redor do outro, denominado Oph 162225-240515, o primeiro planemo duplo.

160

Chapter Nine [Using telescopes from the Southern European Observatory (SEO), Ray Jayawardhana, from the University of Toronto, and Valentin D. Ivanov, from the SEO , discovered a planemo with seven times the mass of Jupiter, the Solar System’s heaviest planet, and another one with double this weight, which rotate around each other, called Oph 162225-240515, the first double planemo.] TS1: Astrônomos do Observatório Europeu Austral, localizado no Chile, anunciaram a descoberta de uma dupla de planetas errantes (sem estrelamãe) que giram ao redor deles mesmos e que vagam livremente pelo espaço. [Astronomers from the European Southern Observatory, located in Chile, announced the discovery of a pair of wandering planets (without a parent star) that rotate around themselves and wander freely through space.] TS2: O fato extraordinário é que ele não gira em volta de uma estrela, mas em torno de outro corpo frio com o dobro de sua massa. [The extraordinary fact is that it does not rotate around a star, but around another cold body with twice its mass.]

4.2.2. Specific rules The set of specific rules consists of four instructions, which have been prepared based on particular cases of content overlap. One of them, in particular, specifies when not to align two sentences. RULE 5: Align even when there is contradictory numerical data This rule states that a SS should be aligned with one or more TSs due to an overlap of the same central idea even when there is contradictory numerical data, which may be related, for instance, to the time of occurrence of a particular event. In (8), the example depicts an alignment performed based on Rule 5. Specifically, the SS and TS share the main content, i.e., both mention the fact that “the city of São Paulo presents points of flooding,” but they show contradictory information about the time when this was observed. In situations with such contradictions, sentences are still aligned. (8) SS: Às 9h, a cidade tinha oito pontos de alagamento, sendo dois intransitáveis. [At 9am, the city had eight points of flooding, two impassable.] TS: O CGE (Centro de Gerenciamento de Emergências) da Prefeitura de São Paulo registrava oito pontos de alagamento na cidade, às 9h30 desta segunda-feira. [The CGE (Center of Emergency Management) from São Paulo’s city council reported eight points of flooding in the city at 9:30 am on Monday.]

Manual Alignment of News Texts and their Human Summaries

161

RULE 6: Align even when there are different levels of generalization Rule 6 states that a SS should be aligned with one or more TSs due to an overlap of the same central idea even when the information is displayed with different levels of generalization. In (9), one can observe that the alignment occurs due to sharing of information between the SS and the TS, i.e., “the traffic jam rate (in São Paulo) above average”, despite the fact that SS specifies this information when registering (i) the exact extensions of the traffic jam in kilometers and (ii) the exact times of these occurrences. In (10), the SS1 and the SS2, from the same summary, were aligned to the TS based on Rule 6, since the SSs present some more specific information than the TS. In this case, both summary sentences specify in percentage the “intensification of the surveillance”, the main content of the TS. Regarding the example (11), we point out that the SS contains more general information, whereas the TSs contain the more specific information. (9) SS: A Companhia de Engenharia de Tráfego (CET) anunciou que o índice de congestionamento era de 54 quilômetros às 8h, 113 km às 9h e 110 km meia hora depois, valores bem acima das médias para os horários, que eram de 36, 82 e 76 quilômetros respectivamente, mas não havia registro de acidentes graves, apesar de haver feridos. [The Traffic Engineering Company (TEC) announced that the traffic jam rate was 54 km at 8am, 113 km at 9am and 110 km half an hour later, values far above the averages for those times, which were 36, 82 and 76 km respectively, but there was no information about serious accidents, although there were wounded people.] TS: Com o asfalto molhado, o trânsito ficou mais lento e o congestionamento ficou o dobro da média. [With the wet pavement, the traffic slowed down and the traffic jam was double the average.] (10) SS1: O balanço divulgado mostra que as autuações cresceram 316,5% nos sete primeiros meses deste ano e chegaram a R$ 1,339 bilhão. [The published balance shows that the fines grew 316.5% in the first seven months of this year and reached R$ 1.339 billion.] SS2: Foram autuados 208.471 contribuintes, um crescimento de 104,47% em relação ao mesmo período do ano passado. [208,471 taxpayers were fined, an increase of 104.47% over the same period of last year.] TS: BRASÍLIA – A Receita Federal intensificou a fiscalização e o resultado foi um aumento do número de contribuintes que caíram na malha fina.

162

Chapter Nine [BRASILIA – The IRS has intensified audits and the result was an increase in the number of taxpayers who have fallen into the “malha fina”.]

(11) SS: A Receita Federal intensificou a fiscalização sobre as declarações das pessoas físicas neste ano. [The IRS has intensified audits of statements of individual taxpayers this year.] TS1: Balanço da fiscalização, divulgado nesta segunda-feira pela Receita mostra que as autuações cresceram 316,5% nos sete primeiros meses deste ano e chegaram a R$ 1,339 bilhão. [The balance of audits, published this Monday by the IRS shows that the fines grew 316.5% in the first seven months of this year and reached R$ 1.339 billion.] TS2: O volume de recursos recolhido com multas passou de R$ 326,1 milhões para R$ 1,339 bilhão. [The volume of funds collected through fines increased from R$ 326.1 million to R$ 1.339 billion.]

RULE 7: Align even when there are different levels of assertiveness Rule 7 shows that a SS should be aligned with one or more TSs due to an overlap of the same central idea even when they present different levels of the speaker’s assertiveness related to the main information being conveyed. In example (12), the sentences were aligned due to the overlap of central information (in this case, “the authorship of the criminal actions”), even though the SS presents a higher level of assertiveness related to the main fact than the TS. In the TS, the lowest level of assertiveness is identified by the occurrence of the modal verb “may” in the TS (“they may have been ordered”). (12) SS: As ações são atribuídas à facção criminosa Primeiro Comando da Capital (PCC), que já comandou outros ataques em duas ocasiões. (The actions are assigned to the criminal gang Capital’s First Command (CFC), which has led attacks on two other occasions.) TS: As ações criminosas podem ter sido ordenadas pelos líderes do Primeiro Comando da Capital (PCC), que haviam prometido retomar os ataques no Estado de São Paulo no Dia dos Pais, no próximo domingo. (The criminal actions may have been ordered by the leaders of the Capital’s First Command (CFC), who had promised to resume attacks in São Paulo on Father’s Day, next Sunday.)

Manual Alignment of News Texts and their Human Summaries

163

RULE 8: Do not align when one segment expresses the “whole” and the other “a part” in a whole-part relation between the segments This rule states that one SS and one or more TSs should not be aligned if there is any intensity or quantity difference related to the main information shared by them. In (13), we illustrate a case in which the sentences were not aligned based on Rule 8. In (13), one may verify that the SS and the TS present similar main information, i.e., “senator hospitalization”. Nevertheless, the SS presents an adverbial phrase that indicates the number of times that the main fact occurred, i.e., “three times,” while, the TS presents the adverbial phrase “in April,” which indicates the exact date of the “hospitalization”. Thus, one can see that the SS expresses the repetition of the same event (sequence), while the TS describes one of the occurrences of the event, so that the sentences are not aligned as they have different information. (13) SS: Somente neste ano, o senador se internou por três vezes no Incor. [Only this year, the senator was hospitalized three times at Incor.] TS: Em abril, o senador foi internado no InCor com insuficiência cardíaca. [In April, the senator was hospitalized with heart failure at InCor.]

Overall, there are a lot of similarities among the rules used for this work and the directions that were used by authors before. For instance, the authors who describe alignments in the summarization area take into account content overlap, i.e., they do not consider only lexical correspondences. However, we may assume that the rules we created for this work are less specific than the ones established for the work by Hatzivassiloglou et al. (1999), for instance. Besides, the possibility of aligning one sentence to various of them (1-N) may also be compared to other previous works (see, e.g., Barzilay and Elhadad, 2003).

4.3. Alignment analysis The results from the CSTNews alignment encompass the quantification of the different types of alignments and the exemplification of some cases. We obtained 1007 alignments in the 50 clusters. Their types may be seen in Table 1. Approximately 78% of the summary sentences were aligned to more than one sentence of the source texts. This result was expected, since it was possible for a multi-document summary to be connected to 2 or 3 related source texts of its cluster.

Chapter Nine

164

Table 1: Alignments types in the corpus Alignment types 1-0 1-1 1-2 1-3 1-4 1-5 1-6 1-7 1-8 1-9 1-10 1-11 1-12

Number 2 of alignments

71

91

72

33

37

13

6

6

1

1

2

1

Based on Table 1, one can observe two extreme facts: (i) summary sentences which were not aligned (1-0) and (ii) summary sentences which were aligned to a large number of text sentences (1-10, 1-11 or 1-12). The two cases of non-alignment are justified by the fact that both sentences have information that is not present in the text sentences and, therefore, they were inferred by the summarizers. A case of non-alignment occurred in a cluster that belongs to the “sports” category, whose texts deal with “games of the Brazilian soccer and volleyball teams during the Pan-American tournament in 2007”. In the human multi-document summary, the sentence “On Sunday, the Brazilian sport cheered up the Brazilian fans” was not aligned to any sentence of the three source texts because the information is not explicit, i.e., it was inferred by the human summarizer. There is a single case of the 1-12 alignment type, which means that 1 summary sentence was aligned to 12 text sentences. This occurred in a cluster that also belongs to the “sports” category and comprises 3 news texts about “the Brazilian soccer team beating the Ecuador soccer team in qualifying games in World Cup-2010”. In Figure 6, one may observe an excerpt of this case, in which the summary sentence “The game featured beautiful performances by good players like Ronaldinho and Kaká” was aligned to 12 different text sentences. Sentence in the summary

Sentences in the texts Kaká acertou um belíssimo chute de longe no ângulo aos 31 e fez 3 a 0. [Kaká buried a stunning long-range kick at 31 minutes to make it 3-0.] O jogo contou com belas atuações de craques como Kaká fez excelente jogada na direita e virou o jogo para Robinho na esquerda. Ronaldinho e Kaká. [Kaká made a brilliant move from the right and [The game featured beautiful performances by crossed to Robinho.] good players like Cinco minutos depois, aos 31, Kaká fez o gol Ronaldinho and Kaká.] mais bonito da partida. [Five minutes later, at 31 minutes, Kaká scored the most beautiful goal of the match.] (…) Fig. 6: Excerpt of a 1-12 alignment

Manual Alignment of News Texts and their Human Summaries

165

These alignments were performed based on Rule 6 because the SS and the 12 TSs share the same central idea with different degrees of generalization, i.e., all the 12 TSs provide details about the general information in SS. From the 2067 sentences in the source texts, 877 (42.43%) were aligned to some summary sentence, but this does not mean that the sentences were aligned only once. A sentence of a summary may be aligned to more than one sentence of the source texts, and the sentences of the source texts may be redundant or even identical. From 336 summary sentences, 334 were aligned (99.4%) to some source sentence. To ensure the reliability of the annotation, we computed the agreement between the annotators by using the kappa statistic (Carletta, 1996). The annotator agreement was computed once a week. To do so, the annotators individually aligned the same cluster and compared the results of their alignment to verify the agreement. Overall, five clusters were used in this task and the kappa result was 0.831 (which goes up to 1, meaning perfect agreement). This value indicates that the task, although subjective, was well-defined and that the results are reliable. It is worth noticing that the result is higher than the ones obtained in the literature. From the alignment results, we could draw some relevant statistical data. For instance, in Figure 7, one may observe that 56% of the aligned sentences are from Folha de São Paulo. In this case, we may say that, in order to produce multi-document summaries of the CSTNews corpus, the human summarizers selected mostly texts from Folha de São Paulo as their basis. Humans probably preferred this source because it is a mainstream and very influential news agency.

Fig. 7: Percentage of alignments for each online newspaper

166

Chapter Nine

4.4. The typification of alignments In addition to the alignment task, we also conducted the typification of the CSTNews alignments. This task, which is reported in more details in Camargo et al. (2013), consisted in assigning types to all the alignments that we performed in this study. All the labels used to characterize the alignments can be seen in Figure 8. The alignment types were divided into two main categories, namely, form and content. Regarding form, the alignments could be: (i) identical, when the aligned sentences were completely identical, (ii) partial, when the sentences had some, or several, words in common, or (iii) different, if they had a few or no words in common. Concerning the content, the pair of sentences could receive the following types: (i) specification, when the SS contained some specific information related to the original content of the aligned TS; (ii) generalization, when the SS generalized the content of the TS; (iii) contradiction, when SS and TS presented some contradictory information; (iv) inference, when the SS expressed information that was inferred from the corresponding TS; (v) neutral, when the SS contained some information that did not result from any special process on the TS content or when some unknown process was carried out, and (vi) other, when the annotators did not agree with the prior alignment types, considering that alignment is a subjective task. Besides, the annotators also classified the alignments based on the occurrence of onomastic elements, i.e., proper nouns, divided into two types: (i) toponomastics, when names of places occurred in the aligned sentences, and (ii) anthroponomastics, when names of persons occurred in the aligned sentences. Form types Identical Partial Different

Content types Specification Generalization Contradiction Inference Neutral Other

Onomastics Toponomastics Anthroponomastics

Fig. 8: Labels for the typification of the alignment

In example (14), one may see an example of a partial alignment, since the two sentences (TS and SS) have some words in common but are not identical. It is also possible to notice that there was a generalization from the TS to the SS: the Brazilian states “Amazonas, Distrito Federal, Mato Grosso, Acre and Rondônia” were simply expressed by “many states”. Furthermore, we also identified the presence of some names of places (states), which were labeled with the tag “toponomastics”. Therefore, the

Manual Alignment of News Texts and their Human Summaries

167

alignment in (14) received the tags: (i) partial, (ii) neutral, (iii) generalization and (iv) toponomastics. (14) TS: A PF divulgou que mais de 300 policiais federais do Amazonas, Distrito Federal, Mato Grosso, Acre e Rondônia fazem parte das investigações da “Operação Dominó”. [The FP reported that more than 300 federal police officers from Amazonas, Distrito Federal, Mato Grosso, Acre and Rondônia took part in the “Operação Dominó”.] SS: Mais de 300 policiais federais de vários estados participaram das buscas e prisões durante a operação. [More than 300 federal police officers from many states took part in the searches and arrests during the operation.]

As a result, we can observe in Table 2 the distribution of the alignments types and subtypes in the corpus. Out of a total of 1007 alignments, we identified 867 partial alignments (86%), 58 identical alignments (5.7%) and 82 different alignments (8.1%). Regarding the content, we identified 949 neutral alignments (94.2%), 37 contradiction alignments (3.6%), 82 generalization alignments (8.1%), 48 specification alignments (4.7%), 33 inference alignments (3.2%) and 6 other alignments (0.5%). We emphasize that, from the 867 partial alignments, 714 were classified as neutral (70.9%), without any other content tag. Considering the alignments that presented cases related to onomastics, we identified 4 toponomastics (0.3%) and 20 anthroponomastics (1.9%). Table 2: Distribution of alignment types and subtypes in the corpus Types Form

Content

Onomastics

Subtypes Parcial Identitcal Different Neutral Contradiction Generalization Specification Inference Other Anthroponomastics Toponomastics

Occurences in the corpus 867 58 82 949 37 82 48 33 6 20 4

Percentage 86% 5,7% 8,1% 94,2% 3,6% 8,1% 4,7% 3,2% 0,5% 1,9% 0,3

In general, we may assume that humans do not conduct many transformations that are covered by our alignment types, since 818

168

Chapter Nine

alignments (81.2%) received only the neutral content tag. On the other hand, there are not too many extractive sentences, i.e. sentences taken entirely from the documents, since only 58 of 1007 alignments (5.7%) were annotated as identical. The high number of neutral alignments is due to the possibility of n-grams overlapping and the likelihood of cooccurrences of this tag with other content ones. We may also infer that humans seem to prefer generalizations over specifications, as there are 82 generalizations (8.1%) and 48 specifications (4.7%) in the results. We may explain this difference considering that generalizing a piece of information is a way to remove unnecessary details and reduce content in order to create a summary. As in the alignment process, we calculated the kappa statistic considering five clusters. We obtained 0.452 of kappa regarding all the tags. The kappa result regarding the form types was 0.717; for content tags, it was 0.318, because the alignments could receive more than one content tag and this is a very subjective task.

5. Final remarks In this article we presented the process of aligning multi-document human summaries to their source texts in the CSTNews corpus. We have not only systematized a set of rules to perform the task, but have also carried out a first characterization of the summarization process. Future work includes both investigating how to automate the alignment process (for which the annotated corpus may provide training and testing data) and finding summarization patterns that may subsidize new AS methods. Concerning the difficulties we encountered while performing the alignment task, we highlight the need for domain knowledge to accomplish the alignment of texts and summaries of some clusters, especially those in the “politics” and “sports” categories. The alignments are incorporated as XML annotations in the CSTNews corpus, enriching it for linguistic research and for the NLP studies for Portuguese. The corpus is available in the Sucinto4 project page.

Acknowledgments The authors are grateful to FAPESP, CAPES, and CNPq for supporting this work.

4

http://www2.icmc.usp.br/~taspardo/sucinto/

Manual Alignment of News Texts and their Human Summaries

169

References Barzilay, Regina and Noemie Elhadad 2003, “Sentence Alignment for Monolingual Comparable Corpora.” Proceedings of the Empirical Methods for Natural Language, 25-32. Camargo, Renata T., Verônica Agostini, Ariani Di Felippo, and Thiago A. S. Pardo Forthcoming 2013, “Manual Typification of Source Texts and Multi-document Summaries Alignments.” V International Conference on Corpus Linguistics (CILC2013). Caseli, Helena M., Valéria D. Feltrim and Maria das Graças V. Nunes 2002, “TagAlign: Uma ferramenta de pré-processamento de textos.” NILC Technical Report, NILC-TR-02-09. June, 41p. Cardoso, Paula C. F., Erick G. Maziero, Maria Lucía R. Castro Jorge, Eloize, M. R. Seno, Ariani Di Felippo, Lucia H. M. Rino, Maria das Graças V. Nunes, and Thiago A. S. Pardo 2011, “‘CSTNews’ – A Discourse-Annotated Corpus for Single and Multi-Document Summarization of News Texts in Brazilian Portuguese.” Proceedings of the 3rd RST Brazilian Meeting, 88-105. Carletta, Jean 1996, “‘Assessing Agreement on Classification Tasks’: The Kappa Statistic.” Computational Linguistics, v. 22, n. 2, 249-254. Caseli, Helena de Medeiros 2003, “Alinhamento sentencial de textos paralelos português-inglês.” Universidade de São Paulo (USP). Clough, Paul, Robert Gaizauskas, Scott S. L. Piao, and Yorick Wilks 2002, “‘METER’: MEasuringTExt Reuse.” Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, 152-159. Daumé III, Hal, and Daniel Marcu 2004, “A Phrase-Based HMM Approach to Document/Abstract Alignment.” Empirical Methods in Natural Language Processing (EMNLP). Daumé III, Hal, and Daniel Marcu 2005, “Induction of Word and Phrase Alignments for Automatic Document Summarization.” Computational Linguistics, v. 31, n. 4, 505-530. Gale, William. A., and Kenneth. W. Church 1991, “A program for aligning sentences in bilingual corpora.” Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL), Berkley, 177-184. Germann, Ulrich 2008, “Yawat: Yet Another Word Alignment Tool.” Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies (ACLHLT), 20-23.

170

Chapter Nine

Hasler, Laura 2007, “From extracts to abstracts: human summary production operations for computer-aided summarisation.” RANLP 2007 Workshop on Computer-Aided Language Processing (CALP). Borovets, Bulgaria, 30 September, 11-18. Hatzivassiloglou, Vasileios, Judith L. Klavans, and Eleazar Eskin 1999, “‘Detecting Text Similarity over Short Passages’: Exploring Linguistic Feature Combinations via Machine Learning.” Proceedings of the Empirical Methods for Natural Language Processing, 203–212. Hirao, Tsutomu, Jun Suzuki, Hideki Isozaki, and Eisaku Maeda 2004, “Dependency-based Sentence Alignment for Multiple Document Summarization.” COLING ’04 Proceedings of the 20th international conference on Computational Linguistics, 446-452. Jing, Hongyan, and Kathleen R. McKeown 1999, “The Decomposition of Human-Written Summary Sentences.” Proceedings of the 22nd Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, 129-136. Marcu, Daniel 1999, “The automatic construction of large-scale corpora for summarization research.” Proceedings of the 22nd Conference on Research and Development in Information Retrieval, 137-144. Okumura, Manabu, Takahiro Fukusima, and Hidetsugu Nanba 2003, “‘Text Summarization Challenge 2’ - Text Summarization Evaluation at NTCIR Workshop 3.” HLT-NAACL 2003 Workshop: Text Summarization (DUC03), 49-56. Soricut, Radu, and Eric Brill 2004, “‘Automatic Question Answering’: Beyond the Factoid.” Proceedings of HLT-NAACL, 57-64. Specia, Lucia 2010, “Translating from Complex to Simplified Sentences.” Proceedings of PROPOR, 30-39. Yamada, Kenji, and Kevin Knight 2001, “A syntax-based statistical translation model.” Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL), 523-530.

CHAPTER TEN CORPUS ANNOTATION OF TEXTUAL ASPECTS IN MULTI-DOCUMENT SUMMARIES ARIANI DI FELIPPO*†, LUCIA H. M. RINO*‡, THIAGO A. S. PARDO*§, PAULA C. F. CARDOSO*‡, ELOIZE R. M. SENO* **, PEDRO P. BALAGE FILHO*, AMANDA P. RASSI*†, MÁRCIO S. DIAS*‡, MARIA LÚCIA R. CASTRO JORGE*‡, ERICK G. MAZIERO*‡, ANDRESSA C. I. ZACARIAS*†, JACKSON W. C. SOUZA*†, RENATA T. CAMARGO*† *‡ AND VERÔNICA AGOSTINI

1. Introduction Research on Automatic Summarization (AS) aims at automatically producing a summary from one or more source texts (Mani, 2001; Nenkova and McKeown, 2011). Current approaches to AS include linguistic and statistical-based ones, and target either the so-called single or multi-document summarization (MDS). A single document summary is one that is produced for just one source text; a multi-document one, in turn, is produced for multiple source texts on the same topic. In both cases, modeling AS artifacts can benefit from identifying text features. Usually, human or manual summaries provide the means for that. In human single document summaries of news texts, especially informative and generic ones, very typical features that help modeling *

Núcleo Interinstitucional de Linguística Computacional (NILC). Departamento de Letras, Universidade Federal de São Carlos. ‡ Departamento de Computação, Universidade Federal de São Carlos. § Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo. ** Instituto Federal de São Paulo. †

172

Chapter Ten

have been identified, namely: (i) the presence of keywords, which are usually associated with the most frequent words in the source text (Luhn, 1958); (ii) information conveyed by title or subtitles; (iii) information found at the beginning of the corresponding source, etc. (Cremmins, 1996; Endress-Niggemeyer, 1998). Informative and generic summaries are those that embed the main content of the source text and, as such, may replace reading the complete source; generic ones aim at a broad and non-specific audience. Usually, extractive AS summarizers1 select the sentences that present those features for composing the final summary. In contrast, multi-document news summaries often convey information that is most common to the majority of texts in the collection in focus. In this case, that is considered the main information (for example, Mani, 2001; Nenkova, 2006). Such a feature has been broadly used for extractive MDS. Based on diverse strategies, these systems select the most recurrent content to compose the multi-document summary of the collections.Using recurrence as a sentence selection criterion has been lately a significant means for state-of-the-art MDS (see, e.g., Zhang et al., 2002; Castro Jorge and Pardo, 2012). This has also been achieved by dealing with linguistic phenomena that are typical of the multi-document scenario, such as the existence of redundant information itself, and the identification of complementary or contradictory information that should be processed for good MDS results. Despite the notable advances in MDS, many performance problems still have to be tackled, which undermine the summary quality. Amongst them, those that deserve special attention are: lack of cohesion (which ultimately results in lack of coherence), inclusion of irrelevant units of information, and exclusion of relevant ones in turn, and absence of suitable referential chaining. Several initiatives have been proposed to tackle these problems. Of special interest in this paper is the so-called Guided Multi-document Summarization, first proposed in the 2010 Text Analysis Conference (hereafter TAC 2010). Guided MDS is based on the empirical evidence that multi-document summaries of news texts convey several units of information, called aspects, which can be found in the corresponding source texts. These aspects are usually expressed through well delimited text spans, and may be signaled by text spans as large as needed to convey the idea they are related to. They may also be general, i.e., they may 1 These systems produce summaries by copying and pasting full segments from the source texts. That is why they are extractive.

Corpus Annotation of Textual Aspects in Multi-Document Summaries

173

encompass several text categories, or they may be typical of a text category (Li et al., 2011). Category is the term used in TAC 2010 to refer to the domain, or subject matter, of a text or group of texts. Owczarzak and Dang (2011), for example, found that summaries of the “Natural disasters” category present the following aspects: WHAT, WHEN, WHO_AFFECTED, DAMAGES, COUNTERMEASURES, WHERE, and WHY. Among them, WHAT, WHEN, WHERE and WHY are general, since they may occur in texts from others categories, and WHO_AFFECTED, DAMAGES and COUNTERMEASURES are typical aspects of texts from the “Natural disasters” category. Following TAC 2010 guidelines, several analyses were performed to model aspects for MDS. Steinberger et al. (2010) performed semantic analyses aiming at multilingual AS. Makino et al. (2011) and Li et al. (2011) compiled aspects found in Wikipedia summaries. Barrera et al. (2011) produced a Question Answering system based on aspect identification for diverse categories. In MDS, aspects can be used for both determining relevant information in the source texts and depicting constraints for discourse organization. Therefore, schemata for structuring content can be drawn from descriptions of text categories. In TAC 2010, pre-defined aspects were derived as a result of analyses carried out by participants, for each category suggested. Results are better described in Genest et al. (2009). Aiming at reproducing the TAC task for MDS of texts in Portuguese, a manual annotation of aspects was carried out by a group of native language speakers as a preliminary task for MDS modeling, following the initial efforts reported in previous works by the authors of this chapter (see, e.g., Camargo et al., 2012; Castro Jorge et al., 2012; Rassi et al., 2012; Zacarias et al., 2012). Manually produced multi-document summaries were analyzed and annotated with aspects. These summaries were taken from the CSTNews corpus2 (Aleixo and Pardo, 2008; Cardoso et al., 2011), which will be briefly described later in this chapter. The annotation was carried out by circa fifteen people, who examined thoroughly the complexities of identifying and determining aspects for conceptual segments expressed on the text surface. Whenever possible, analysis was pursued in an objective way. However, it is a well-known problem that identifying concepts by means of surface choices is quite a subjective task. The annotation contributed to the field in different ways: a quite complete characterization of aspects found in the data was derived, which also includes formal definitions for each of them. The new corpus of 2

Available for download at www2.icmc.usp.br/~taspardo/sucinto/cstnews.html

174

Chapter Ten

manual summaries, annotated with aspects, is in itself an extension of the CSTNews corpus with novel and richer data. Patterns of organization that capture the occurrence of aspects for the CSTNews categories were derived. Overall, these contributions are very significant for modeling MDS, the first goal of the task reported in this article. In what follows, the annotation of aspects in CSTNews multidocument summaries is described in Section 2, along with the aspects identified in four different categories of texts. Section 3 presents the aspect distribution in the summary corpus, and describes the derived organizational patterns for MDS of texts in Portuguese. Some final remarks are made in the last section.

2. Identifying and defining aspects in the CSTNews corpus The CSTNews corpus consists of 50 clusters (i.e., a collection of texts written about the same topic) of news texts from some of the main Brazilian online news agencies. Each cluster comprises 2 or 3 texts about the same issue, and their human and automatic single and multi-document generic informative summaries. The news texts are clustered according to the newspaper sections where they appear. The very same sections gave rise to the categories, as suggested in TAC 2010. This was made possible because the news sections mirror the usual content organization of subject matter provided by the Brazilian news agencies. Therefore, CSTNews conveys the following categories: “World” (14 clusters), “Daily News” (14 clusters), “Politics” (10 clusters), “Sports” (10 clusters), “Finances” (1 cluster) e “Science” (1 cluster). Summaries of the “Finances” and “Science” categories were excluded from the annotation task due to their nonrepresentative sizes (only 1 summary for each one, which makes it impossible to identify recurrent patterns). CSTNews summaries do not convey all the categories defined at TAC 2010, namely, “Accidents and natural disasters”, “Attacks”, “Health and safety”, “Endangered resources”, and “Investigations and trials”. However, CSTNews do present a certain correspondence of content with that aimed at by the TAC categories. For example, the category “Daily News” addresses attacks and trials and “World” addresses natural disasters. Despite the above intersections of categories and the existence of aspects that turned out to be generic enough to coincide with some of those defined in TAC 2010, it was necessary to establish clear directions to delimit the aspects in the CSTNews summaries. This was quite complex

Corpus Annotation of Textual Aspects in Multi-Document Summaries

175

to pursue because TAC 2010 guidelines for the guided MDS track3 were too subjective and general. It was even more important to have clear directions for the category-dependent aspects. Important results of the annotation reported in this section include more precise and practical guidelines for aspect annotation and the definition of each aspect identified during analysis.

2.1. Guidelines for aspect annotation A sentential unit was adopted as the smallest unit for aspect identification and tagging. The reasons for taking this as a minimal unit are twofold: a sentence has clear boundaries and it conveys a complete message. So, each sentence of a multi-document summary may be annotated with several aspects. These, in turn, could also be depicted at different levels of the discourse structure. While some aspects were welldelimitated at the intrasentential level, other ones were identified through the relations that the text segments have with other ones in the same sentence or in other sentences in the text. Linguistic markers attached to the spans could also signal aspect boundaries and, thus, they helped identifying them. For instance, it is usual to find the DECLARATION aspect with the expression according to. It was also relevant to distinguish aspects related to the main topic of a text from those related to secondary information, in which case aspects appear with the suffix extra. For instance, the main topic of a text is signaled by the aspect WHAT. When there is a topic shift, the corresponding sentence is annotated with the tag WHAT_EXTRA, and so are its other intertwined aspects. Procedures and formats for both annotating and storing the annotated summaries were also established. First, each sentence was delimited by square brackets; then a list of all the aspects that it conveyed was inserted at the end. Aspects tags appeared in capital letters; when more than one aspect appeared in the list, they were delimited by “/”. Whenever possible, the order of tags in a list followed the same order of their corresponding text spans in the sentence. Tag names were also kept in English, for making their correspondence and identification with the tags suggested in TAC 2010 easier, when applicable. An example of an annotated sentence is given below. A literal translation follows for clarity and for helping readers to identify the aspects that come out from the corresponding text spans. 3

The NLP tasks in TAC are known as “tracks”.

Chapter Ten

176

[A equipe brasileira, comandada por Bernardinho, venceu a Finlândia por 3 sets a 0, em Tampere (FIN), mantendo sua invencibilidade na Liga Mundial de Vôlei-06.]WHO_AGENT/WHAT/SCORE/WHERE/CONSEQUENCE/ SITUATION [The Brazilian team, coached by Bernardinho, beat Finland by 3 to 0, in Tampere (FIN), keeping its leadership in the 2006 World Volleyball League.]WHO_AGENT/SCORE/WHERE/CONSEQUENCE/ SITUATION/WHAT

2.2. Definition of aspects The only categories of the CSTNews corpus in focus for annotations are “World” “Daily News”, “Politics”, and “Sports”, as mentioned above. Each aspect identified in the multi-document summaries was given a clear definition and a prototypical example. It is important to notice that, although the aspects are general enough to be used by all categories, their occurrence is not uniform throughout them, as we discuss later. Definitions were pursued not only for consistency of annotation, but also for documentation purposes. This is a novel contribution to the field in general and to AS in particular. Indeed, to the best of our knowledge, so far there has been no initiative to define them. It is also worthy noticing that, although their definitions were drawn from data written in Portuguese, they may be quite applicable to other languages as well. Table 1 presents definitions for the aspects identified in the corpus (remember that extra suffixes are used to identify the aspects that refer to information of events which are not the main ones in the texts, having the same meaning of the aspects shown in the table; for this reason, they are not shown in the table). Examples are provided throughout the chapter. Table 1: Aspects identified in the CSTNews corpus Aspects COMMENT COMPARISON CONSEQUENCE COUNTERMEASURES DECLARATION GOAL HISTORY

Definitions Author’s commentary about a fact or event Different data or statistics aimed at comparing two or more entities Fact or event caused by another fact or event Measures aimed at solving or anticipating or preventing problems related to a fact or event Statement by someone or by a source through direct or indirect speech Goal or reason for a fact or event still to occur Context information about the history or past of a fact or event

Corpus Annotation of Textual Aspects in Multi-Document Summaries HOW PREDICTION SITUATION

SCORE WHAT WHEN WHERE WHO_AGENT WHO_AFFECTED WHY

177

The way a fact or event occurs Information about the feasibility of future facts or events (which may also be sure to occur) Occasion of a fact or event; it may involve a transaction, a championship, an agenda, or other types of situations for which date or place are not necessarily specified Numerical result of a fact or event (as score, time and distance), mainly related to sports Fact or event described in the text Date or period of time (strictly temporal) of occurrence of a fact or event Position (physical or geographical) of a fact or event Entity (person or organization) responsible for causing or provoking the occurrence of a fact or event Entity (person or organization) that suffers the effects of a fact or event Explanation of why a fact or event happens (or happened)

2.3. A summary annotated with aspects The summary below, from the “Sports” category, is fully annotated according to the above definitions and guidelines.. As one may see, the following aspects were identified in the first sentence: WHO_AGENT (The Brazilian team, coached by Bernardinho), SCORE (by 3 to 0), WHERE (in Tampere (FIN)), CONSEQUENCE (keeping its leadership), SITUATION (2006 World Volleyball League), and WHAT (the whole event of winning, lexically headed by beat). Notice that the aspect WHAT signals the main topic of the summary presented; it appears in the first sentence, an expected phenomenon for news texts (which is the “lead” sentence). Extra aspects in other sentences are linked to secondary information.

178

Chapter Ten

[A equipe brasileira, comandada por Bernardinho, venceu a Finlândia por 3 sets a 0, em Tampere (FIN), mantendo sua invencibilidade na Liga Mundial de Vôlei06.]WHO_AGENT/SCORE/WHERE/CONSEQUENCE/SITUATION/ WHAT [Amanhã as equipes voltarão a se enfrentar, no mesmo local.]WHEN_EXTRA/WHO_AGENT_EXTRA/PREDICTION/ WHERE_EXTRA [Com o resultado, o Brasil está na liderança do grupo B, perto da classificação para a próxima fase do campeonato.]WHO_AGENT/CONSEQUENCE [A seleção brasileira ainda enfrentará portugueses e finlandeses na fase de classificação.]WHO_AGENT/ PREDICTION/SITUATION_EXTRA [A equipe brasileira já conquistou cinco vezes a Liga Mundial.]WHO_AGENT/ HISTORY [A fase final deste ano acontecerá na Rússia.]PREDICTION/WHERE_EXTRA [The Brazilian team, coached by Bernardinho, beat Finland by 3 to 0, in Tampere (FIN), keeping its leadership in the 2006 World Volleyball League.]WHO_AGENT/SCORE/WHERE/CONSEQUENCE/SITUATION/ WHAT [Tomorrow the teams will meet again at the same place.]WHEN_EXTRA/ WHO_AGENT_EXTRA/PREDICTION/WHERE_EXTRA [With the result, Brazil is the Group B leader, and it is about to classify for the next phase of the championship.]WHO_AGENT/CONSEQUENCE [The Brazilian team will still meet the Portuguese and Finnish teams in the classification phase.]WHO_AGENT/ PREDICTION/SITUATION_EXTRA [The Brazilian team has already won five times the World League.]WHO_AGENT/ HISTORY [This year’s finals will be held in Russia.]PREDICTION/WHERE_EXTRA

Given the aspects definitions and having the multi-document summaries of the four categories annotated, a synthesis was provided for AS modeling purposes, which is addressed next.

3. Representativeness of aspects and organizational patterns In this section the annotation is synthesized on the aspects basis for all the categories. Their representativeness is analyzed by examining the frequency of aspects in each set of summaries, followed by an analysis of their relative distribution. Based on that, general patterns of organization were derived for each category, aiming at guided MDS of texts in Portuguese.

Corpus Annotation of Textual Aspects in Multi-Document Summaries

179

3.1. Representativeness for each category 3.1.1. “Daily News” Table 2 shows the distribution of aspects for the “Daily News” category. The aspects that occurred in this category are ordered by their frequency, which is shown in the “Total” column. The table also shows the percentage of occurrence of each aspect in the category, as well as the average number of occurrences per summary (Avg). Only aspects with non-zero frequency are shown. Table 2: Representativeness of aspects in the “Daily News” category Aspect DECLARATION WHAT_EXTRA CONSEQUENCE WHO_AGENT WHO_AGENT_EXTRA WHAT WHEN WHERE WHEN_EXTRA HISTORY WHERE_EXTRA WHO_AFFECTED COMMENT HOW COUNTERMEASURES SITUATION WHY WHY_EXTRA COMPARISON GOAL HOW_EXTRA PREDICTION WHO_AFFECTED_EXTRA Total

Total 28 25 22 22 18 13 13 12 11 10 9 9 6 5 4 4 2 2 1 1 1 1 1 220

% 13% 11% 10% 10% 8% 6% 6% 5% 5% 5% 4% 4% 3% 2% 2% 2% 1% 1% 0% 0% 0% 0% 0%

Avg 2.0 1.8 1.6 1.6 1.3 0.9 0.9 0.9 0.8 0.7 0.6 0.6 0.4 0.4 0.3 0.3 0.1 0.1 0.1 0.1 0.1 0.1 0.1

One can see that DECLARATION (13%), WHAT_EXTRA (11%), CONSEQUENCE (10%), and WHO_AGENT (10%) were the most frequent aspects. The high frequency of WHAT and WHAT_EXTRA is justified because they refer to the main and secondary news, respectively. DECLARATION aspects are also frequent as news texts usually embed

180

Chapter Ten

statements by people involved in the reported events. The statistical relevance of CONSEQUENCE suggests that “Daily News” texts usually report consequences of events related to accidents (casual or induced) or criminal attacks. In that case, they are actually problems resulting from the accidents or attacks. When they appear along with reports of measures to solve them, they are less frequent in the summaries analyzed. This is evidenced by the low representativeness of COUNTERMEASURES (2%) in the table. HISTORY, which apparently appears in the summaries with an average frequency (5%), deserves attention: 7 occurrences, out of 10, appear just in one summary. Therefore, one cannot consider its frequency expressive enough to typify summaries of this category. 3.1.2. “Sports” Table 3 shows the distribution of aspects for the “Sports” category. One can see that the most frequent aspects are WHO_AGENT (11%), HOW (10%), COMMENT (9%), WHAT_EXTRA (8%), and WHO_AGENT_EXTRA (8%). The outstanding representativeness of WHO_AGENT and WHAT (and both extra versions) reveals that news on sports report a fact or event (thus tagged with WHAT or WHAT_EXTRA) in which athletes and teams (tagged with WHO_AGENT or WHO_AGENT_EXTRA) presented a specified performance. WHAT_EXTRA is more frequent than WHAT (8% against 6%) because the referred news convey more information that does not refer to the main topic than the opposite. For example, information on winning a gold medal usually reports the performance of the other athletes that competed. The aspects HOW and COMMENT are also frequent (representativeness of 10% and 9%, respectively). This shows that the way of achieving the result of a game, or the performance of an athlete or team, and the opinion of the writer are very common information in sports news. The high frequency of HOW, though, is due to several occurrences in just one summary. Some aspects present an average frequency, such as CONSEQUENCE (5%), SCORE (4%) and SITUATION (4%). This suggests that sports news convey less frequently a result (or score), the result consequences, and the place of the games or championships. These three aspects, however, appear in most summaries of the “Sports” category. SCORE, by the way, is typical only for this category.

Corpus Annotation of Textual Aspects in Multi-Document Summaries

181

Table 3: Representativeness of aspects in the “Sports” category Aspect WHO_AGENT HOW COMMENT WHAT_EXTRA WHO_AGENT_EXTRA WHAT WHEN CONSEQUENCE SCORE SITUATION PREDICTION SITUATION_EXTRA WHERE WHO_AFFECTED COMMENT_EXTRA HISTORY WHEN_EXTRA WHERE_EXTRA SCORE_EXTRA COMPARISON CONSEQUENCE_EXTRA DECLARATION WHY GOAL WHO_AFFECTED_EXTRA WHY_EXTRA Total

Total 19 17 16 14 13 10 10 8 7 6 5 5 5 5 4 4 4 4 3 2 2 2 2 1 1 1 170

% 11% 10% 9% 8% 8% 6% 6% 5% 4% 4% 3% 3% 3% 3% 2% 2% 2% 2% 2% 1% 1% 1% 1% 1% 1% 1%

Avg 1.9 1.7 1.6 1.4 1.3 1.0 1.0 0.8 0.7 0.6 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1

3.1.3. “World” Table 4 shows the distribution of aspects for the “World” category. The aspect WHAT_EXTRA is the most recurrent (12%) in this category, and it reemphasizes the high frequency of secondary events in the related summaries. WHAT is also relatively frequent (7%). It appears in every summary of the “World” category. WHERE and WHO_AGENT_EXTRA are also very frequent (8% each).

182

Chapter Ten

Table 4: Representativeness of aspects in the “World” category Aspect WHAT_EXTRA WHERE WHO_AGENT_EXTRA WHAT WHO_AFFECTED COUNTERMEASURES DECLARATION HISTORY WHEN_EXTRA WHY CONSEQUENCE WHEN WHO_AFFECTED_EXTRA WHERE_EXTRA PREDICTION WHO_AGENT CONSEQUENCE_EXTRA GOAL GOAL_EXTRA SITUATION WHY_EXTRA COUNTERMEASURES_EXTRA DECLARATION_EXTRA PREDICTION_EXTRA Total

Total 27 17 17 16 16 14 13 13 12 12 10 10 10 8 5 5 2 2 2 2 2 1 1 1 218

% 12% 8% 8% 7% 7% 6% 6% 6% 6% 6% 5% 5% 5% 4% 2% 2% 1% 1% 1% 1% 1% 0% 0% 0%

Avg 1.9 1.2 1.2 1.1 1.1 1.0 0.9 0.9 0.9 0.9 0.7 0.7 0.7 0.6 0.4 0.4 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

As most of the summaries of this category are basically about accidents, disasters and attacks, the aspects WHO_AFFECTED and COUNTERMEASURES are frequent, conveying information about victims, measures to help victims and measures of precaution and for reconstruction. 3.1.4. “Politics” Table 5 shows the distribution of aspects for the “Politics” category. WHAT_EXTRA is the most frequent one (27%), followed by WHO_AGENT_EXTRA (17%). This suggests that political discourse usually mentions secondary information.

Corpus Annotation of Textual Aspects in Multi-Document Summaries

183

Table 5: Representativeness of aspects in the “Politics” category Aspect WHAT_EXTRA WHO_AGENT_EXTRA DECLARATION WHO_AFFECTED_EXTRA WHAT WHO_AGENT WHY_EXTRA WHEN WHEN_EXTRA COMPARISON PREDICTION WHO_AFFECTED CONSEQUENCE GOAL_EXTRA HOW HISTORY SITUATION WHY GOAL Total

Total 51 31 16 13 12 9 9 7 7 5 5 5 3 3 3 2 2 2 1 186

% 27% 17% 9% 7% 6% 5% 5% 4% 4% 3% 3% 3% 2% 2% 2% 1% 1% 1% 1%

Avg 5.1 3.1 1.6 1.3 1.2 0.9 0.9 0.7 0.7 0.5 0.5 0.5 0.3 0.3 0.3 0.2 0.2 0.2 0.1

The relatively high frequency of WHO_AGENT_EXTRA (17%), WHO_AFFECTED_EXTRA (7%), and WHO_AGENT (5%) is expected in texts on Politics: they frequently convey information about events with active characters in the national scenario and about related organizations, which are mentioned as agents. The high frequency of DECLARATION (9%), in addition, points to the tendency for the discourse to indicate to whom the speeches are attributed.

3.2. Possible patterns of organization for AS In this section we present the resulting analyses of each category, which were synthesized in patterns of organization that may be followed for AS. It is important to notice that, for some categories, it was possible to generalize (e.g., for the “Daily News” category) and, for others, it was necessary (and sometimes desirable) to identify sub-categories (e.g., for the “World” category). In what follows, we initially present the analysis of each category and, then, the general observations that we could draw from them.

Chapter Ten

184

3.2.1. “Daily News” In annotating daily news texts, it has been observed that there is no common group of aspects for all the respective summaries. This may be due to the great diversity of themes in this category. However, there is a group of aspects that may typify most of them. There is also a partial order between them, which is described in Table 6 through the symbol ‘