Linguistic Resources for Natural Language Processing: On the Necessity of Using Linguistic Methods to Develop NLP Software 9783031438103, 9783031438110

Empirical — data-driven, neural network-based, probabilistic, and statistical — methods seem to be the modern trend. Rec

107 100 12MB

English Pages 232 [230] Year 2024

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Natural Language Processing (NLP)

1,035 187 3MB Read more

Hands-On Natural Language Processing with PyTorch 1.x: Build smart, AI-driven linguistic applications using deep learning and NLP techniques 9781789802740, 1789802741

Become a proficient NLP data scientist by developing deep learning models for NLP and extract valuable insights from str

2,009 76 5MB Read more

Hands-On Natural Language Processing with PyTorch 1.x: Build smart, AI-driven linguistic applications using deep learning and NLP techniques 9781789802740, 1789802741

Become a proficient NLP data scientist by developing deep learning models for NLP and extract valuable insights from str

290 91 10MB Read more

Neural Network Methods for Natural Language Processing

Synthesis Lectures on Human Language Technologies

0 0 4MB Read more

Using Corpus Methods to Triangulate Linguistic Analysis 1138082546, 9781138082540

This book builds on Baker and Egbert's previous work on triangulating methodological approaches in corpus linguisti

1,330 175 4MB Read more

Neural Network Methods for Natural Language Processing 9781627052955

1,901 199 2MB Read more

Neural Network Methods for Natural Language Processing 9781627052955

Neural networks are a family of powerful machine learning models. This book focuses on the application of neural network

1,994 211 3MB Read more

Encyclopedia of Systemic Neuro-Linguistic Programming and NLP New Coding

2,503 392 51MB Read more

Unraveling the complexity of SE (Studies in Natural Language and Linguistic Theory, 99) 3030570037, 9783030570033

This book makes a novel contribution to our understanding of Romance SE constructions by combining both diachronic and s

111 5 10MB Read more

Practical Natural Language Processing: A Comprehensive Guide to Building Real-world Nlp Systems 1492054054, 9781492054054

Many books and courses tackle natural language processing (NLP) problems with toy use cases and well-defined datasets. B

24,665 8,384 25MB Read more

Linguistic Resources for Natural Language Processing: On the Necessity of Using Linguistic Methods to Develop NLP Software
9783031438103, 9783031438110

Author / Uploaded
Max Silberztein

Categories
Computers
Algorithms and Data Structures: Pattern Recognition

Table of contents :
Foreword
Preface
About This Book
Contents
Part I: Introduction
The Limitations of Corpus-Based Methods in NLP
1 Introduction
1.1 Training Corpora
1.2 Limited Tag Sets
1.3 Reliability of Training Corpora
2 NLP Software Results
2.1 Incorrect Tags
2.2 Concepts, Themes, and Terms
2.3 Word Clouds
2.4 Semantic Networks
3 Principles at the Basis of Empirical Methods
3.1 Flawed Principle: The Notion of Similar Contexts
3.2 Flawed Principle: The Units of Processing
4 The Scientific Approach
4.1 The Real Value of a Training Corpus
5 Conclusion
References
Part II: Developing Linguistic-Based NLP Software
Linguistic Resources for the Automatic Generation of Texts in Natural Language: The Elvex Formalism
1 Introduction
2 A Hypothetico-deductive Approach of NLG
2.1 A Deterministic Model
2.2 A Declarative and Constraint-Based Approach
2.3 A Monotonic Model
3 Writing a Grammar with Elvex
3.1 Feature Structure
3.2 Constituent-Structure (C-Structure)
3.3 Syntactic Rules
3.4 The Lexicon
The Morphological Lexicon Defined by Extension
The Pattern Lexicon
3.5 The Language Rules
3.6 The Speaker´s Rules and Language
4 Conclusion
References
Towards a More Efficient Arabic-French Translation
1 Introduction
2 Related Works
3 Problems in Arabic Named Entities
3.1 Problems in Recognition
3.2 Problems with the Translation of Arabic NEs
4 Implementation of the Set of Transducers
4.1 Recognition Phase
4.2 Translation Phase
5 Experimentation and Evaluation
5.1 Experimentation of Recognition Phase
5.2 Experimentation of Translation Phase
6 From the NEs Recognition to the Translation of Complex Sentences
6.1 NEs and Relative Clauses
6.2 Implemented Resources
6.3 Experimentation of Implemented Resources
7 Conclusion and Perspectives
References
Linguistic Resources and Methods for Belarusian Natural Language Processing
1 Introduction
2 Text-to-Speech Synthesizer
3 The Transcription Generator
4 Word Paradigm Generator
5 Conclusion
References
Part III: Linguistic Resources for Low-Resource Languages
A New Set of Linguistic Resources for Ukrainian
1 Introduction
2 Theoretical Basis
2.1 Approaches to Designing Natural Language Processing Applications
3 Statistical Tools for Ukrainian
3.1 Sketch ENGINE
3.2 TreeTagger and RNNTagger
4 Ukrainian Linguistic Resources
4.1 Dictionary
4.2 Morphological Grammars
5 Conclusion and Perspectives
References
Formalization of the Quechua Morphology
1 Introduction
2 Constructing Electronic Dictionaries
2.1 Formalization of Quechua Noun Inflections
2.2 Formalizing Quechua Verb Morphology
2.3 Formalizing Adjective Morphology
2.4 Formalizing Adverbs Morphology
2.5 Formalizing Pronouns Morphology
3 Conclusion and Perspectives
References
The Challenging Task of Translating the Language of Tango
1 Introduction
2 The Project: Automatic Machine Translation
3 Translation of Terms
4 Translating Syntactic Structures
5 Conclusion
References
A Polylectal Linguistic Resource for Rromani
1 Introduction
1.1 Rromani Dialectology
1.2 Rromani Alphabet
1.3 Rromani Dictionaries
1.4 Rromani Language Lessons
1.5 Rromani Grammar
2 Empirical NLP Software
2.1 Rromani, a Low-Resource Language
3 Rromani Online Resources
3.1 Russian Romani Corpus
3.2 ROMLEX
3.3 Online Rromani Dictionaries
3.4 Need for Coherent Linguistic Resources
4 Rromani Linguistic Resources
4.1 Dictionary
4.2 Morphology
5 Evaluation and Perspectives
References
Part IV: Processing Multiword Units: The Linguistic Approach
Using Linguistic Criteria to Define Multiword Units
1 Introduction
2 The Corpus-Based Approach and Collocations
3 Semantic Atomicity
4 Term Usage
5 Idiosyncratic Transformational Analyses
6 Conclusion
References
A Linguistic Approach to English Phrasal Verbs
1 Introduction
2 English Phrasal Verbs: Particle vs. Preposition
2.1 Further Distinctions: Prepositional and Phrasal Prepositional Verbs
3 Lexicon-Grammar of Phrasal Verbs
3.1 Using Lexicon-Grammar in Tandem with NooJ
3.2 Accuracy of Discontinuous Phrasal Verbs
4 Removing False Phrasal Verbs Automatically
4.1 PV Disambiguation Grammar 1: Environment to Right of ``PV´´
4.2 PV Disambiguation Grammar 2: Environment to Left of ``PV´´
4.3 PV Disambiguation Grammar 3: Locative Environment to Right of ``PV´´
5 Conclusion
References
Analysis of Indonesian Multiword Expressions: Linguistic vs Data-Driven Approach
1 Introduction
2 POS Tagging
3 Syntactic Parsing
4 Machine Translation
5 Conclusion
References

Citation preview

Max Silberztein Editor

Linguistic Resources for Natural Language Processing On the Necessity of Using Linguistic Methods to Develop NLP Software

Linguistic Resources for Natural Language Processing

Max Silberztein Editor

Linguistic Resources for Natural Language Processing On the Necessity of Using Linguistic Methods to Develop NLP Software

Editor Max Silberztein Université de Franche-Comté Paris, France

ISBN 978-3-031-43810-3 ISBN 978-3-031-43811-0 https://doi.org/10.1007/978-3-031-43811-0

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

In honor of Peter This volume is dedicated to the memory of Peter Aloysius Machonis, a member of the Department of French and Linguistics at Florida International University, who passed away on March 8, 2023. Peter joined Maurice Gross’s LADL laboratory in the early 1980s. He constructed a set of lexicon-grammar tables to describe the syntax of English frozen expressions and then that of operator and support verbs. More recently, Peter built a dictionary describing 1200 English phrasal verbs, which is integrated into NooJ English module. His last contribution, describing his research project, is presented in this volume. Peter was a friend and an honest scientist whom all his colleagues and students appreciated. His absence is and will be deeply felt by members of the NooJ community and his colleagues worldwide.

Foreword

At the end of his preface, Max Silberztein writes that “this book aims at rehabilitating the linguistic approach to NLP.” Is there a need for that rehabilitation, and if yes, does this book successfully attain this goal? I would like to answer “yes” to the first question, with some added critics to those who in fact ostracize proponents of methods based on linguistic knowledge, because it is not only totally unethical but also because they often use patently false arguments. Concerning the second question, I would like to answer, “yes but not totally,” at the same time stressing the intrinsic interest of all the articles in the book, and the power and conspicuousness of NooJ, a descriptive and executable formalism created by Max Silberztein, usable in principle to describe all languages (and polylects), in particular low-resource languages such as Quechua and Tango. More than 30 languages have detailed grammars and large dictionaries in NooJ. Max calls “empirical” the methods based exclusively on learning from “large enough” annotated corpora, and “linguistic” those based exclusively on grammars and dictionaries. I would prefer to call the second kind “expert” methods. Why? Because expert methods are not always based exclusively on “carefully manually encoded linguistic knowledge.” For instance, in machine translation (MT), enriched corpora have been used since the very beginning of the field, both for development and for testing. In 1961–1967, the first CETA Russian-French system was developed using a large corpus of texts (400,000 wordforms, about 1600 standard pages) on satellites and rocket engines, prepared and given by the Rand Corporation. The morphological analyzer produced not lemmas and morphological attributes, but “lexical units” (LUs), in fact, derivational families, and attributes including morphosyntactic category (POS), number, case, person, tense, etc., plus UL and derivation code. Many lemmas found in the texts were not contained directly in the dictionary but were deduced from the UL and the (dictionary) potential derivations. As a derivation contains not only a morphological part, but a syntactic and a semantic part (e.g., verb → action noun), two advantages resulted: the size of the dictionary was smaller, and the information in the result was “multilevel.” vii

viii

Foreword

Now to the “ostracization” mentioned above. First, this is not the first time we see that kind of behavior in computational linguistics (CL) and natural language processing (NLP). Remember the contempt of proponents of knowledge-based methods for automatic speech recognition (ASR) against proponents of inherently stochastic HMM methods. But, at the end of the 1970–1975 DARPA project on ASR, it appeared that the empirical HMM-based Harpy system (CMU) clearly dominated the other competing systems, all knowledge-based (Hearsay-2 from CMU, HWIM from BBN, SUS from SRI). Then the “mainstream” swung to the opposite, so that papers on “expert” systems were rejected by reviewers of conferences and journals. Same for jobs: Jelinek at IBM wrote “Every time I fire a linguist, the performance of my ASR system jumps up by 10%”! In MT, this happened in the late 1990s. There was a claim in the early 2000s that statistical MT (SMT), derived from the early (1980s) work of Jelinek, Brown, and others at IBM, was “beating” linguistic-based MT systems such as Systran, Reverso (Softissimo & ProMT), ATLAS-II (Fujitsu, JP $ EN), AS-Transac (Toshiba), METAL (DE → EN, LRC & Siemens), or METEO. Very often, the claim is based on an invalid evaluation method. Most SMT systems were and are “objectively” evaluated with a similarity measure (erroneously called “metrics”) such as BLEU, Orange, NIST, or WER. But, contrary to popular belief, they do not and cannot measure translation quality. This has been clearly stated and demonstrated by a famous paper by Callison-Burch, Osborne, and Koehn.1 There is a far better measure for translation usage quality, if the goal is to produce high-quality translation results, especially in technical domains, by postediting MT results. J. Slocum seems to have introduced it back in 1984, for evaluating the GE→EN METAL system on Siemens texts.2 Using the same units (time per minute per standard page of 1430 characters),3 I proposed around 2005 a PEMT quality score on a scale of 20 (used in French schools), given by the simple formula: PEMTquality = ð20 - 2=5 mnPE=pageÞ=20

1

Chris Callison-Burch, Miles Osborne, and Philipp Koehn (2006) Reevaluating the role of BLEU in MT research. In 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 249–256, Trento, Italy. https://aclanthology.org/E06-1032. A sentence that was actually translated correctly can still receive a low score, depending on the human reference. Moreover, BLEU cannot evaluate the importance of errors. For a BLEU score, an error is just that: an error. 2 Jonathan Slocum, Winfield S. Bennett, Lesley Whiffin, Edda Norcross (1985) An evaluation of METAL: The LRC Machine Translation System. Proc. Second Conference of the European Chapter of the Association for Computational Linguistics, Geneva, Switzerland. 3 About 250 words in English, French, Spanish, etc., but in German there are many long compound words, so that the character count is more universal for languages using alphabets. For ideogrambased writing systems, a standard page (having approximately the same semantic content) is about 400–440 characters.

Foreword

mnPE/page PEMTquality Interpretation

ix

0 20 Perfect

5 18 Excellent

10 16 Very good

15 14 Good

20 12 Fair

25 10 Average

30 8 Bad

It gives a good idea of MT results compared to human professional translation where the first draft typically takes 1 h/page and the expert revision 20 min/page. A reasonable goal of MT+PE is to automate the production of the first draft and to spend less than 20 min on post-editing. In Slocum’s experiments with METAL, still at an early stage, mnPE/page was on average 11.85 ((10.9+12.8)/2), meaning better than just fair. Let us also consider the case of METEO. This system, specialized for translating weather bulletins, ran for nearly 20 years at Environnement Canada, on microcomputers, giving almost perfect outputs. PE time (by expert translators from the Bureau des Traductions) was about 1 min/bulletin, or about 6 minPE/page (a typical bulletin is 40-token long). That gives a PEMTquality of nearly 18/20, that is, excellent. In 2005, the RALI made experiments to test whether that quality could be approached or improved by reconstructing this system using empirical approaches, starting from a translation memory of 40M EN→FR sentence pairs produced (by MT+PE) during the last few years. In their article to EAMT-2005,4 they wrote: We show how a combination of a sentence-based memory approach, a phrase-based statistical engine, and a neural-network rescorer can give results comparable to those of the current system while offering a faster development cycle and arguably better customization possibilities.

We see here undue optimism, to say the least. First, what does “comparable” mean? The only measure mentioned in the paper is “70% of acceptability.” If this means that a post-editor must start from scratch for 30% of the sentences, then PE time was more than 18 min/page (60×0.3) for that part. Adding an optimistic 6 min/page for the rest (70%), we get 24 min/page, or a PEMTquality of less than average. In reality, from a study of E. Macklovitch,5 we know that only 11% of the sampled sentences were different from the revised ones, which means that only 11% were post-edited by the senior translators, the rest being left as perfect. The authors then wrote: Third, an informal evaluation on a random sample of translations that differed from the reference showed that 77% of these bad translations were found acceptable by humans.

But, in such a constrained situation, for a sentence to be accepted does not mean it is acceptable! Even if its meaning is quite exact, it must fit into the sublanguage at

4

Philippe Langlais, Thomas Leplus, Simona Gandrabur, and Guy Lapalme (2005) From the Real World to Real Words: The METEO case. Proc. EAMT-2005, Budapest. MeteoEAMT05pdf.pdf 5 Elliott Makclovitch (1985) A linguistic performance evaluation of METEO-2. Technical report, Canadian Translation Bureau, Aug. 1985.

x

Foreword

hand. That is so true that, when J. Chandioux got a contract for adapting METEO-2 (EN→FR) in view of the Atlanta Olympic Games, he had not only to introduce a lot of proper names (toponyms and patronyms) but also to modify the analyzer to take into account the American sublanguage of weather bulletins, somewhat different from the Canadian English one. To continue on the false pretenses of quality that lead researchers and funders to ostracize proponents of expert approaches to NLP: in the case of MT, it has been widely said and written that neural MT produces outputs of the same or even higher quality than professional translators. The following excerpt demonstrates that these affirmations are quite false. Plus, this METAL output dates back to February 1984, and is clearly quite better than DeepL output today, 39 years later! And no professional translator, even a very bad one, could produce (like GT) “The development of semiconductor technology, especially microprocessors, has in the opened up new prospects for EDP in recent years.” Original German text CSE SpracheingabeGeraete Einfuehrungsschritt 1 Einleitung Die Entwicklung der Halbleitertechnik, insbesondere der Mikroprozessoren, hat in den vergangenen Jahren neue Perspektiven fuer die EDV eroeffnet. Im Bereich der Datenerfassung wurde mit der Spracheingabe in den Computer ein langgehegter Wunsch erfuellt. Remarks

GE→EN by METAL (1984) CSE voice data entry devices Introduction 1 Introduction The development of semiconductor technology, in particular the microprocessors has opened the new prospects for EDV in the last years. In the range of data acquisition, a longcherished wish was filled with voice data entry into the computer.

GE→EN by DeepL (2023) CSE Language Input Devices Introduction Step 1 Introduction The development of the semiconductor technology, in particular of the microprocessors, has opened new opened new perspectives for the EDP in the last years. In the field of data acquisition, a longcherished wish has been fulfilled with the voice input into the computer.

EDV → EDP (electronic data processing) voice data entry → correct

voice input → voice data entry has opened new opened new → repetition.

GE→FR by Google (2023) Étape d’introduction des périphériques d’entrée vocale CSE 1. Introduction Le développement de la technologie des semi-conducteurs, en particulier des microprocesseurs, a dans le ouvert de nouvelles perspectives à l'informatique ces dernières années. Dans le domaine de l’acquisition de données, un vœu de longue date a été exaucé avec l’entrée vocale dans l’ordinateur. le ouvert → nonsense entrée vocale → saisie vocale informatique → not in source

A similar discussion could be given on other applications, such as POS tagging. The performances in the 1970s were about 95%, with an apparent limit at 96%, for tag sets of 100–250 tags. Empirical systems do not do better. Often, claims are made that a performance of 90% or even 85% would be very good. Not so! Already, with 96% success, one gets 10 errors per page of 250 words (about 20 sentences), which means that further processing starts with about 50% of sentences containing a POS

Foreword

xi

error. With 90% success, we get 25 errors per page, meaning that on average 100% of sentences have at least a POS error. Max is quite right in wanting to distinguish CL and NLP. The goal of CL is to improve our knowledge about languages, while NLP is about producing useful applications. In a sense, research in CL is the “fundamental research” of NLP. Two claims are possible at this point in defense of expert methods. First, one can observe that, in the history of sciences, fundamental research has often led, sometimes much later, to unexpected practical discoveries. Second, it is also true that empirical work has often led to the appearance of new insights at the fundamental level. Rather than deny the evidence, it is better to ask oneself why a given empirical system does work better than expert-based competitors. For example, Yves Lepage proposed a new insight after he worked from 1998 to 2006 on “analogical MT”:6,7 “96% of analogies of form are also analogies of meaning.” Proponents of expert methods should not claim that empirical methods cannot do a given application (such as MT) at some quality level, without really proving it, either by demonstrating a better expert system (e.g., a full MT system, not only a part translating some very short simple sentences better than DeepL) or by proposing a new interesting NLP application for which no empirical approach is possible for some strong reason, such as “the learning corpora would necessarily contain high-level expert knowledge the application in view must deliver, while producing them is simply unfeasible.” An interesting example is given in the preface, the “Joe loves Lea” project, rejected twice by the Institut Universitaire de France. It shows two things: (1) our reviewers are often not very competent in linguistics, so they simply cannot understand the fundamental interest of a research topic in CL, and (2) given that state of affairs, one should present such a fundamental CL research topic by some potential application(s) in NLP. In this case, one aims at generating and recognizing sublanguages defined (in the way of Zellig Harris) inductively as , where the base R is a finite set of sentences and R is a finite set of transformation rules that generate sentences that contain the same main elements (theme and rheme, in JeanMarie Zemb’s statutory analysis), with possible variations, such as adding modalities and replacing names by pronouns. An interesting practical application would be, in the context of the war against organized crime, to search the web or e-mail logs or social network logs for sentences in the sublanguage generated from seed sentences (gathered by investigative means). Another application could be to “inflate” bilingual corpora by taking as base a translation memory (a set of bi-sentences) and as rules pairs of monolingual transformational rules.

He worked with the classical proportional analogies on written strings, of the form “a : b :: c : d” (a is to b what c is to d) and solved analogical equations to get translations (strings with the same meaning) between two languages, possibly not using the same writing system, like Japanese, English, and French. 7 Y. Lepage, E. Denoual (2006) Purest ever example-based machine translation: Detailed presentation and assessment. Machine Translation 19, pages 251–282. See also Proc. Workshop on example-based machine translation, Phuket, 2005. 6

xii

Foreword

To conclude, I strongly hope and expect that the articles in this volume will contribute to the “rehabilitation of linguistics in NLP,” and in concrete terms to stop the above ostracization and begin again to fund research projects in linguistics. Another contribution is to demonstrate that expert tools and resources can be developed in a very homogeneous environment (NooJ) for many and potentially all languages, and many levels of linguistic description, including non-continuous and context-dependent constructions. Christian Boitet is emeritus professor at the Université Grenoble Alpes and member of the LIG lab, after having been between 1977 and 2016 full professor of computer science at Université Joseph Fourier. He is one of the authors of ArianeG5, GETA’s generator of MT systems. Université Grenoble Alpes, Grenoble France

Christian Boitet

Preface

Today, empirical—data-driven, neural-network-based, probabilistic, or statistical— methods are trendy, thanks partly to DeepMind’s AlphaGo Zero win against professional Go players in 2016,8 followed by the wins of its variant AlphaZero against professional chess players. Recently, OpenAI ChatGPT,9 Google Bard, and Microsoft Sydney chatbots have been garnering a lot of attention for their detailed answers across many knowledge domains. Consequently, most researchers in artificial intelligence today develop systems that use empirical methods to extract solutions from massive databases used as cheat sheets,10 instead of trying to understand and formalize what common intelligence is11 or how intelligent agents construct scenarios to solve various problems.12 In the same manner, natural language processing (NLP) software that uses training corpora processed by empirical methods is being used daily: people regularly talk to their smartphone thanks to OK Google or Apple Siri, or to their home thanks to Amazon Alexa, and use machine translation applications such as Google Translate or DeepL for their personal and business needs.

8

Cf. https://en.wikipedia.org/wiki/DeepMind. Cf. https://en.wikipedia.org/wiki/ChatGPT. 10 There is mounting resistance to AI’s empirical approaches; see for instance: Chomsky, Noam, Ian Roberts, & Jeffrey Watumull, 2023. The False Promise of ChatGPT. http://Portside.org/2023-03-0 8/noam-chomsky-false-promise-chatgpt or: Robot, Jean-Christophe, Cécile Dumas, 2022. Autopsie d’une intelligence artificielle. Prod. Look at Sciences. Autopsy of an artificial intelligence. Prod. Look at Sciences. 11 See for example: Minsky, Marvin, 2007. The emotion machine: Commonsense thinking, artificial intelligence, and the future of the human mind. Simon and Schuster. 12 Schank, Roger C., and Robert P. Abelson, 1975: Scripts, plans, and knowledge. IJCAI. Vol. 75. 9

xiii

xiv

Preface

The success of these software applications has led many proponents of empirical approaches to infer that linguistic methods henceforth are obsolete; for example, according to Eric Brill:13 Automatic part of speech tagging is an area of natural language processing where statistical techniques have been more successful than rule-based methods.

As a consequence, researchers in NLP today promote exclusive uses of training corpora, to the detriment of the development of linguistic methods and resources.14 Most linguistic laboratories have abandoned the scientific goal of formalizing natural languages with handcrafted electronic dictionaries and grammars; international scientific institutions equalize the notion of linguistic resource to training corpus; and almost all papers presented at leading conferences such as COLING (“COmputational LINGuistics”) and ACL (for “Association for Computational Linguistics”) present systems or methods that contain no linguistic resources nor methods. The apparent competition between linguistic and empirical approaches to NLP is not new and can be retraced from Noam Chomsky’s criticism that statistical properties could not define grammaticality:15 The real import of Mandelbrot’s work for linguistics seems to be that it shows that rankfrequency distributions of the type that Zipf and others have found are consistent with a very wide class of plausible assumptions about linguistic structure, and consequently, that we learn practically nothing about words when we discover this rank-frequency relation. In other words, this way of looking at linguistic data is apparently not a very fruitful one.

However, today in France as well as in many countries, the situation is dire. Linguistic departments in universities looking for NLP specialists favor scientists who follow empirical approaches based on training corpora, at the expense of computational linguists who would have helped them formalize their work. TAL is the premiere NLP journal in France: its gatekeepers apparently reject articles that present systems not based on the use of training corpora... for the reason that they lack roots in the corpus-training approach community.16 More generally, scientific projects that aim at formalizing natural languages by developing handcrafted dictionaries and grammars are systematically being rejected as obsolete by “expert” reviewers.

13

Brill, Eric, 1992. A simple rule-based part of speech tagger. Pennsylvania Department of Computer and Information Science. https://apps.dtic.mil/sti/pdfs/ADA460532.pdf. 14 Cf. https://www.youtube.com/watch?v=QIdB6M5WdkI. 15 Chomsky, Noam, 1958. Review of Vitold Belevitch Langage des machines et langage humain 1958. Language #34. Extract cited by Léon Jacqueline, 2017. The statistical studies of vocabulary in the 1950–1960s in France. Theoretical and institutional issues. In Quantitative linguistics in France, ed. S. Loiseau and J. Léon, 9–28. Lüdenscheid: RAM-Verlag. 16 “. . . le principal reproche qu'on peut faire à ce texte est, paradoxalement, son peu d'ancrage dans la communauté HN. . .” [. . . the main criticism of this text is, paradoxically, its lack of roots in the Digital Humanities community...]. TAL’s criticism rings false considering that this same article was later honored at the opening of the 16th International Conference on Statistical Analysis of Textual Data (JADT 2022), which is the main conference on Digital Humanities in Europe.

Preface

xv

As an example: an ambitious project aimed at implementing an exhaustive transformational grammar that would allow a computer to automatically link all transformed sentences that contain one predicate, e.g., Joe loves Lea ⟺ Joe’s love for Lea ⟺ It was not him who fell madly in love with Lea ⟺ Joe might not have stopped loving her, etc. was twice rejected by the Institut Universitaire de France.17 It was deemed obsolete (it has never been done); based on an outdated technology (its engine has the power of a Turing machine); limited to a niche system (it is used in over 30 universities, cited by over 500 scientific publications, and is being downloaded over 1000 times per year since September 2020). This ambitious project, for which a proof of concept had already been implemented,18 would have constituted the initial step of the larger project of formalizing a natural language in its entirety, which would be the linguistic equivalent to the Human Genome Project in biology. In this volume, we are not questioning the intrinsic value of software applications based on empirical methods, in the same manner that no one is contesting the value of Hammond and Horn’s 1954 famous statistical study that had unearthed the correlation between smoking and cancer deaths, and thus has helped save millions of lives.19 However, we note that it is only thanks to much later biological research such as 2017 Yunlong Ma and Ming Li’s20 that we have recently started to understand the mechanisms in play for a DNA sequence to get damaged by carcinogens, which will help biologists design new and efficient medicine against cancer. In conclusion, we believe that empirical- and linguistic-based NLP software do not share the same goals: whereas the former aim at providing users with applications that produce reasonably good results and hence are commercially successful, the primary goal of computational linguists is to describe natural languages formally, in a reproducible and accumulative way, and use this formal description to perform various linguistic analyses automatically. Of course, all linguists believe that under-

IUF’s two-line rejection expertise for the project’s second and final submission: “Software based on outdated methodology; sticks to a small niche (his own); lexicon-grammars of Spanish/Italian less available than is said here.” 18 Silberztein, Max. Joe loves Lea: Transformational Analysis of Direct Transitive Sentences. In: Automatic Processing of Natural-Language Electronic Texts with NooJ: 9th International Conference, NooJ 2015, Minsk, Belarus, June 11–13, 2015, Revised Selected Papers 9. Springer International Publishing, 2016. pp. 55–65. 19 Hammond, E. C., & Horn, D., 1954. The relationship between human smoking habits and death rates: a follow-up study of 187,766 men. Journal of the American Medical Association, 155(15), 1316–1328. 20 Ma, Y., & Li, M. D., 2017. Establishment of a strong link between smoking and cancer pathogenesis through DNA methylation analysis. Scientific Reports, 7(1), 1–13. 17

xvi

Preface

standing how natural languages work will eventually lead to developing better NLP software applications. Still, they consider the “mere” project of formalizing natural languages as a worthwhile scientific goal. This volume, therefore, aims at rehabilitating the linguistic approach to NLP. Paris, France

Max Silberztein

About This Book

In the first part of the volume, Max Silberztein’s contribution “The limitations of training corpus-based methods in NLP” uncovers several technical limitations and theoretical flaws of using training corpora to develop NLP applications, even the simplest ones, such as automatic taggers. The second part of the volume is dedicated to showing how carefully handcrafted linguistic resources can be successfully used to enhance NLP software applications: – Lionel Clément’s contribution “Automatic generation of texts in natural language” presents the Elvex platform, which uses precisely formalized handcrafted linguistic resources to produce high-quality generated texts, in a deterministic way. – Hela Fehri’s contribution “Towards a more efficient Arabic-French translation” presents a machine translation system that can process named entities (NEs) and text segments that should be transliterated rather than translated, such as proper names. The system uses handcrafted local grammars to recognize and process these text segments. The contribution compares the translations of relative clauses automatically produced by the system with the translations produced by Google Translate and Reverso. – Yuras Hetsevich’s and Mikita Suprunchuk’s contribution “Linguistic resources and methods for Belarusian natural language processing” presents a set of WEB services that can be used to synthesize speech from written texts in Belarusian. The contribution focuses on three services: text-to-speech synthesizer, transcription generator, and word paradigm generator that can be used together, or individually, for instance in other NLP applications. These tools rely on rules in the form of regular expressions carefully handcrafted by the members of the Speech Synthesis and Recognition Laboratory of the United Institute of Informatics Problems of the National Academy of Sciences (Minsk). The third part of the volume presents case studies where data-driven approaches cannot be implemented because there is not enough data available: low-resource languages. xvii

xviii

About This Book

– Maximiliano Duran’s contribution “Formalization of the Quechua morphology” describes the linguistic resources he has developed to formalize how affixes combine inside wordforms in Quechua to perform various linguistic transformations, thus augmenting considerably Quechua’s vocabulary. He also shows how these linguistic resources can be used to compute translations into French and Spanish. – Olena Saint-Joanis’ contribution “A new set of linguistic resources for Ukrainian” uncovers the mistakes produced by empirical NLP software in Ukrainian and then describes the linguistic resources she has developed to formalize the lexicon and morphology of Ukrainian. – Andrea Fernando Rodrigo’s and Mariana González’s contribution “The challenging task of translating the language of Tango” shows that describing sublanguages, such as the language of Tango in Argentinian Rioplatense Spanish, can be processed correctly using precisely handcrafted linguistic resources, whereas traditional MT methods that use training corpora produce many mistakes. – Masako Watabe’s contribution “A polylectal linguistic resource for Rromani” shows how current corpora-based resources for Rromani are deficient and proceeds to describe the construction of a linguistic set of resources for Rromani comprising of a dictionary, an inflectional grammar, and an agglutinative grammar. One specificity of this resources is that it accounts for the different Rromani dialects, both at the lexical and at the morphological levels. The fourth part of the volume addresses the problem of how to treat multiword units in NLP software. – Max Silberztein’s contribution “Using linguistic criteria to define multiword units” presents a set of linguistic criteria that can be used operationally to differentiate sequences of wordforms that must be processed as atomic units (the multiword units) from sequences of wordforms that must be analyzed. – Peter Machonis’ contribution “A linguistic approach to English phrasal verbs” presents a set of linguistic resources in the form of electronic dictionaries and grammars used to recognize and disambiguate phrasal verbs automatically. – Prihantoro’s contribution “Analysis of Indonesian multiword expressions” addresses the problem of tagging multiword expressions in Indonesian, showing that data-driven systems that rely on training corpora produce inaccurate results that can be prevented by using handcrafted electronic dictionaries and grammars. We believe that readers interested in natural language processing will appreciate the importance of this volume, both for its questioning of the training corpus-based approach and for the intrinsic value of the linguistic formalization and the underlying methodology presented here. Max Silberztein Editor

Contents

Part I

Introduction

The Limitations of Corpus-Based Methods in NLP . . . . . . . . . . . . . . . . Max Silberztein 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Training Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Limited Tag Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Reliability of Training Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . 2 NLP Software Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Incorrect Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Concepts, Themes, and Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Word Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Semantic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Principles at the Basis of Empirical Methods . . . . . . . . . . . . . . . . . . . . 3.1 Flawed Principle: The Notion of Similar Contexts . . . . . . . . . . . . . 3.2 Flawed Principle: The Units of Processing . . . . . . . . . . . . . . . . . . 4 The Scientific Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 The Real Value of a Training Corpus . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part II

3 3 3 4 6 8 8 10 11 12 13 14 17 20 20 22 23

Developing Linguistic-Based NLP Software

Linguistic Resources for the Automatic Generation of Texts in Natural Language: The Elvex Formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lionel Clément 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 A Hypothetico-deductive Approach of NLG . . . . . . . . . . . . . . . . . . . . 2.1 A Deterministic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 A Declarative and Constraint-Based Approach . . . . . . . . . . . . . . 2.3 A Monotonic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.

27

. . . . .

27 29 30 33 35 xix

xx

Contents

3

Writing a Grammar with Elvex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Feature Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Constituent-Structure (C-Structure) . . . . . . . . . . . . . . . . . . . . . . . 3.3 Syntactic Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 The Lexicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 The Language Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 The Speaker’s Rules and Language . . . . . . . . . . . . . . . . . . . . . . . 4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35 36 38 39 40 42 44 46 47

Towards a More Efficient Arabic-French Translation . . . . . . . . . . . . . . Héla Fehri 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Problems in Arabic Named Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Problems in Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Problems with the Translation of Arabic NEs . . . . . . . . . . . . . . . . 4 Implementation of the Set of Transducers . . . . . . . . . . . . . . . . . . . . . . . 4.1 Recognition Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Translation Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Experimentation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Experimentation of Recognition Phase . . . . . . . . . . . . . . . . . . . . . 5.2 Experimentation of Translation Phase . . . . . . . . . . . . . . . . . . . . . 6 From the NEs Recognition to the Translation of Complex Sentences . . . 6.1 NEs and Relative Clauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Implemented Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Experimentation of Implemented Resources . . . . . . . . . . . . . . . . . 7 Conclusion and Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

Linguistic Resources and Methods for Belarusian Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuras Hetsevich and Mikita Suprunchuk 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Text-to-Speech Synthesizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The Transcription Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Word Paradigm Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part III

49 50 51 51 51 52 52 54 59 60 60 61 62 65 66 67 68 69 69 70 75 78 82 83

Linguistic Resources for Low-Resource Languages

A New Set of Linguistic Resources for Ukrainian . . . . . . . . . . . . . . . . . . Olena Saint-Joanis 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Theoretical Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87 87 88

Contents

xxi

2.1

Approaches to Designing Natural Language Processing Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3 Statistical Tools for Ukrainian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.1 Sketch ENGINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.2 TreeTagger and RNNTagger . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4 Ukrainian Linguistic Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.1 Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.2 Morphological Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5 Conclusion and Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Formalization of the Quechua Morphology . . . . . . . . . . . . . . . . . . . . . . Maximiliano Duran 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Constructing Electronic Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Formalization of Quechua Noun Inflections . . . . . . . . . . . . . . . . . 2.2 Formalizing Quechua Verb Morphology . . . . . . . . . . . . . . . . . . . 2.3 Formalizing Adjective Morphology . . . . . . . . . . . . . . . . . . . . . . . 2.4 Formalizing Adverbs Morphology . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Formalizing Pronouns Morphology . . . . . . . . . . . . . . . . . . . . . . . 3 Conclusion and Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Challenging Task of Translating the Language of Tango . . . . . . . . Andrea Fernanda Rodrigo and Mariana González 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 The Project: Automatic Machine Translation . . . . . . . . . . . . . . . . . . . . 3 Translation of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Translating Syntactic Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Polylectal Linguistic Resource for Rromani . . . . . . . . . . . . . . . . . . . . Masako Watabe 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Rromani Dialectology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Rromani Alphabet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Rromani Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Rromani Language Lessons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Rromani Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Empirical NLP Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Rromani, a Low-Resource Language . . . . . . . . . . . . . . . . . . . . . . 3 Rromani Online Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Russian Romani Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 ROMLEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Online Rromani Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109 109 111 112 114 120 121 123 123 124 127 127 128 130 140 143 144 147 147 148 150 153 154 154 154 155 156 156 158 158

xxii

Contents

3.4 Need for Coherent Linguistic Resources . . . . . . . . . . . . . . . . . . . . Rromani Linguistic Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Evaluation and Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Part IV

159 160 161 162 171 171

Processing Multiword Units: The Linguistic Approach

Using Linguistic Criteria to Define Multiword Units . . . . . . . . . . . . . . . . Max Silberztein 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 The Corpus-Based Approach and Collocations . . . . . . . . . . . . . . . . . . . 3 Semantic Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Term Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Idiosyncratic Transformational Analyses . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Linguistic Approach to English Phrasal Verbs . . . . . . . . . . . . . . . . . . Peter A. Machonis 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 English Phrasal Verbs: Particle vs. Preposition . . . . . . . . . . . . . . . . . . . 2.1 Further Distinctions: Prepositional and Phrasal Prepositional Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Lexicon-Grammar of Phrasal Verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Using Lexicon-Grammar in Tandem with NooJ . . . . . . . . . . . . . . 3.2 Accuracy of Discontinuous Phrasal Verbs . . . . . . . . . . . . . . . . . . 4 Removing False Phrasal Verbs Automatically . . . . . . . . . . . . . . . . . . . . 4.1 PV Disambiguation Grammar 1: Environment to Right of “PV” . . 4.2 PV Disambiguation Grammar 2: Environment to Left of “PV” . . . 4.3 PV Disambiguation Grammar 3: Locative Environment to Right of “PV” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analysis of Indonesian Multiword Expressions: Linguistic vs Data-Driven Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prihantoro 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 POS Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Syntactic Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

175 175 177 181 182 183 185 186 189 189 190 191 192 194 195 195 196 197 197 198 198 201 201 204 208 212 215 215

Part I

Introduction

The Limitations of Corpus-Based Methods in NLP Max Silberztein

Abstract Nowadays, most Natural Language Processing software applications use empirical “black box” methods associated with training corpora to analyze texts written in natural languages. To analyze a sequence of text, they look for similar sequences in a corpus, select among them the most similar one according to some statistical measurement or some neural-network-based optimization state, and then bring forth its analysis as the new sequence analysis. Here, I first show that the limited size of the corpora used and their questionable quality explain why most NLP applications produce unreliable results. Next, I examine the principles which are at the basis of corpus-based methods and uncover their linguistic naiveté. I finally dispute the scientific validity of empirical approaches. I propose solutions to various problems that are based on the use of carefully handcrafted linguistic methods and resources. Keywords Computational Linguistics · Natural Language Processing · Corpus Linguistics · Training corpus · Statistical methods · Neuronal methods · Empirical methods · Multiword units · Lexical ambiguity · Automatic disambiguation

1 Introduction 1.1

Training Corpora

Today, most Natural Language Processing (NLP) software use training corpora associated with probabilistic-, statistical- or neural-network-based methods. These methods rely on the use of training corpora. A training corpus contains texts priorly analyzed and annotated by human employees. The most fundamental tool all empirical NLP applications rely on is the Part Of Speech (POS) tagger, associated with a training corpus that contains a text in which all graphical wordforms are followed by their POS category, e.g., Noun, Verb,

M. Silberztein (✉) Université de Franche-Comté, Besançon, France e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Silberztein (ed.), Linguistic Resources for Natural Language Processing, https://doi.org/10.1007/978-3-031-43811-0_1

3

4

M. Silberztein Battle-tested/SingularProperName Japanese/SingularProperName industrial/Adjective managers/PluralNoun here/Adverb always/Adverb buck/Verb up/Preposition nervous/Adjective newcomers/PluralNoun with/Preposition the/Determiner tale/SingularNoun of/Preposition the/Determiner first/Adjective of/Preposition their/Possessive countrymen/PluralNoun to/TO visit/Verb Mexico/SingularProperName ,/, a/Determiner boatload/SingularNoun of/Preposition samurai/PluralNoun warriors/PluralNoun blown/PastParticiple ashore/Adverb 375/Number years/PluralNoun ago/Adverb ./. From/Preposition the/Determiner beginning/SingularNoun ,/, it/Pronoun took/VerbPreterit a/Determiner man/SingularNoun with/Preposition extraordinary/Adjective qualities/PluralNoun to/TO succeed/Verb in/Preposition Mexico/SingularProperName ,/, ”/” says/VerbPresent3rdSingular Kimihide/SingularProperName Takimura/SingularProperName ,/, president/SingularNoun of/Preposition Mitsui/PluralNoun group/SingularNoun ’s/Possessive Kensetsu/SingularProperName Engineering/SingularProperName Inc./SingularProperName unit/SingularNoun ./.

Fig. 1 Extract of the Penn Treebank (in this figure, I have replaced codes such as “NNP” with more readable ones such as “SingularProperName”)

Adjective, etc. To process any new text, a POS tagger uses coefficients derived from its training corpora to add a POS category after each wordform of the text. Examples of English training corpora used by taggers are the Brown Corpus,1 the Corpus of Contemporary American English (COCA),2 the Open American National Corpus (OANC),3 or the Penn Treebank.4 They contain texts in which each wordform5 has been tagged, i.e., associated with a code that represents its Part-OfSpeech (POS) category (Noun, Verb, Adjective, etc.), as seen in Fig. 1. Empirical taggers are unanimously rated as producing excellent results by the scientific community, as it is not rare to read that this or that system recall and accuracy rates are over 95%. The “excellency” of a system that has a 95% accuracy should be contested (as it means one mistake every 20 words, i.e., every 2 sentences), but even these results deserve some careful study.

1.2

Limited Tag Sets

The tag sets used in most training corpora are equivalent to very small and low-quality dictionaries.6 For example: 1

Available at: https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/cor pora/list/private/brown/brown.html. See (Nelson & Kucera, 1979). 2 Available at: https://www.english-corpora.org/coca. See (Davies 2009). 3 Available at: https://anc.org. 4 Available at: https://paperswithcode.com/dataset/penn-treebank. See (Taylor et al. 2003). 5 By wordforms, we mean contiguous sequences of alphabetic characters delimited by non-letters (a.k.a. delimiters). For instance, “Monday”, “soon” and “stepfather” are wordforms, whereas “week end”, “as soon as possible” and “father-in-law” contain respectively two, four and three wordforms. 6 For instance, (Kupść and Abeillé 2008) describe the extraction from the Université Paris 7’s treebank of all its tags to construct the treeLex dictionary automatically, which size represents only a few percentages of a handcrafted dictionary such as the Lexicon-Grammar.

The Limitations of Corpus-Based Methods in NLP

5

– Tag sets typically do not distinguish between abstract, concrete, determiner, and human nouns. This is a problem, because certain verbs require human object complements (amuse, babysit, persuade, etc.), whereas certain verbs require abstract object complements (ascertain, cancel, etc.). If an NLP software does not have access to this distinction, it will not be able to parse correctly sentences such as “His presentation amused the room” (where room should be interpreted as the people in the room) or “She believed him” (where him should be interpreted as what he said). Even more basic NLP applications need to distinguish at least nominal determiners from regular nouns to correctly parse sentences such as: This group of friends slept too much; she drank a full can of beer.

Linking sleep to group (rather than to friends) or drink to can (rather than to beer) would make Information Retrieval systems produce useless information. – Tag sets do not distinguish predicative verbs (e.g., They are going to Paris) from auxiliary verbs (e.g., They are going to buy cheese in Paris).7 We will see below that treating auxiliary verbs incorrectly as meaningful units makes current software applications in the digital humanities unreliable at best. The limited size of the tag sets explains their high precision rates: when a wordform can be associated with only one tag, the tagger will never produce any mistakes. For example, the wordforms my, his, the (always determiners), at, from, of, with (always prepositions), him, himself, it, me, she, them, you (always pronouns), and, or (always conjunctions), again, always, not, rather, too (always adverbs), am, be, do, have, is (always verbs), day, life, moment, thing (always nouns) are extremely frequent, and will always be tagged correctly. Moreover, most wordforms associated with more than one potential tag have a statistically preferential analysis. For example, for most of their occurrences, the wordforms age, band, card, detail, eye should be tagged as nouns rather than as the verbs to age, to band, to card, to detail, to eye, etc. Simply ignoring these latter rarer cases is, therefore, very efficient from a statistical point of view, but not reliable in general. Charniak (1997) showed that a simple program that would just copy the most frequent tag associated with each wordform would produce results with a degree of precision greater than 90%. This is equivalent to using a very low-quality dictionary that contains only one usage for each of its entries. Obtaining a 95% precision rate when a simple lookup operation already produces a 90% precision is not that spectacular. Finally, note that the typical rate of 95% accuracy is only correct if one ignores the large number of multiword units that occur in texts. Silberztein (2018) showed

7

There are approximately 200 auxiliary verbs used frequently in English, including aspectual verbs (e.g., She keeps on drinking milk), modal verbs (e.g., He needs to drink milk) and movement verbs (e.g., They ran out to buy some milk).

6

M. Silberztein

that over 10% of the wordforms in corpora such as the COCA or the OANC are in fact constituted by multiword units, which makes their tags irrelevant at best. For example, there is no point in tagging “matter” and “fact” as nouns in the sequence “as a matter of fact”: a machine translation system should translate this adverb as “en fait” in French; an Information Retrieval system should not be allowed to index occurrences of “matter” in this multiword unit as a meaningful term (like in dark matter, dead matter, gray matter, organic matter), etc.

1.3

Reliability of Training Corpora

Most training corpora contain a considerable number of mistakes. For instance, in the extract of the Penn Treebank8 shown in Fig. 1, Battle-tested and Japanese should have been tagged as adjectives rather than proper names. Multiword units that occur in this text have been ignored, which will make any NLP software application that use this corpus produce misleading results. For example: – Processing industrial manager as a sequence of an adjective and a noun will trigger the same analysis as the one generally associated with the adjective industrial in industrial diamond, industrial food, industrial soap, industrial chicken, etc. However, industrial managers are not managers made/raised in a factory. – The phrasal verb to buck up (which means to encourage) should not be analyzed word-for-word as to buck (which has 11 meanings listed in Wiktionary, including to copulate like a buck, to throw a rider, to leap upward or to resist obstinately) followed by the locative preposition up. – The head word of the noun phrase a boatload of samurai warriors is warriors, not boatload; therefore, a boatload of should be tagged as a determiner. – The sequence “Kimihide Takimura” should not be treated as the adjunction of two different proper names: it represents one single proper name; idem for “Kensetsu Engineering Inc.” More recent training corpora also contain many mistakes.9 For example, by applying handcrafted English dictionaries to the OANC and to the COCA, (Silberztein 2016) unearthed many problems:

8

See (Taylor et al. 2003). See for instance (Green, Gallyy and Manning 2010) about tagging errors in Arabic corpora, (Dickinson Ledbetter 2012) about errors in a Hungarian corpus, (Kulick et al. 2011) and (Volokh and Neumann 2011) about errors in tree-banks and (Dickinson 2015) about methods to detect annotation errors. See http://gate.ac.uk.

9

The Limitations of Corpus-Based Methods in NLP

7

– 20% of the vocabulary extracted from these corpora are associated with incorrect tags.10 For example, the OANC contains the following mistakes: abbreviate, abduct, abhor, abhors, etc. (tagged incorrectly as nouns), about, agonized, bible, cactus, California, etc. (tagged incorrectly as adjectives) expenditures, Japanese, many, initiatives, wimp, etc. (tagged incorrectly as verbs) anomaly, back, because, by, of, out, upon, etc. (tagged incorrectly as adverbs)

– 15% of wordforms in uppercase are incorrectly tagged as proper names, e.g.: Abacuses, Abandoned, ABATEMENT, Abattoir, Abbreviated, Ablaze,Abnormal, Abolished, Abuse, Abstract, Accidental, ALMOST...

Linguistic solution Looking up a carefully handcrafted English dictionary would allow the software application to detect these incorrect tags and replace them with the correct ones. – Typos seem to be tagged systematically as common nouns if in lowercase and as proper names if in uppercase: absentionists, achives, accrossthe, afteryou, etc. (tagged as common nouns) Aconfession, Afew, AffairsThe, Allpolitics, etc. (tagged as proper names)

Linguistic solution Looking up carefully handcrafted English dictionaries would allow the software application to recognize these typos as such, since most of them are not valid dictionary entries.11 – Multiword units have been systematically ignored. For example, OANCs “Slate” sub-corpus of 4 million wordforms contains over 160,000 occurrences of ignored multiword units, including commonly used nouns (e.g., bulletin board), adjectives (e.g., born again), adverbs (e.g., about time), prepositions (e.g., for the sake of) and verbs (e.g., break even). Ignored multiword units correspond to 400,000 occurrences of wordforms, which means that 10% of the wordforms occurring in this corpus have been incorrectly tagged as linguistic units, and 4% of the linguistic units (i.e., the multiword units) have not been tagged.

10 It is easy to find these mistakes: extract all the pairs (wordform, tag) from the tagged corpus; sort and remove duplicates; compare to an English dictionary to extract all the pairs that are not listed in the dictionary. 11 Such as Wiktionary, or the dictionary included in NooJ (Silberztein 2016), and JRC Names for proper names (Steinberger et al. 2013).

8

M. Silberztein

These corpora do contain a few multiword units that were tagged correctly, but not in a consistent way. For example, in the COCA, “a-capella” is correctly tagged as one unit, whereas its spelling variant “a capella” is tagged as a sequence of two linguistic units. Reciprocally, many wordforms that should have been processed as sequences of multiple linguistic units have been incorrectly tagged as single units, e.g., “adoption-related,” “Afghan-based,” “autodialed,” and “barklike.” Linguistic solution Looking up a carefully handcrafted dictionary of multiword units would allow the software application to recognize them as such.12 Applying morphological grammars to analyze unlisted wordforms would allow the system to recognize and analyze agglutinations correctly.

2 NLP Software Results As most current empirical NLP applications use training corpora at one step of their processing or another, one cannot expect these applications to provide reliable results.

2.1

Incorrect Tags

For instance, Fig. 2 displays a French text provided by Sketch Engine, one of the most popular corpus processing software, used in Europe by language teachers and linguists, as well as many scientists in the social sciences and humanities. In this extract, “cours” [during] has been incorrectly tagged as “NCFP000/cour”, i.e., a form of the feminine noun cour [courtyard]; “rencontre,” “tombe,” “doute,” and “pense” have been incorrectly tagged as “VMIP1S0”, i.e., verbs conjugated in the first person singular (instead of in the third person); “Aimée” [a firstname] has twice been incorrectly tagged as “VMP00SF”, i.e., as a past-participle of the verb aimer, “A” [at] incorrectly tagged as “VMPI3S0/avoir”, i.e., as a conjugated form of the verb avoir [to have] and “qu’” as “CS”, i.e., a conjunction instead of a relative pronoun. There are nine mistakes. Moreover, the adverbs “d’ailleurs” [by the way], “dès le premier regard” [at first glance], and “de plus en plus” [more and more], as well as the preposition “au cours de” [during] have been ignored; that makes 4 missing tags, plus 13 irrelevant tags. There are therefore 26 mistakes in total out of 58 wordforms, that corresponds to a 55% accuracy rate, far from the 95% accuracy rate universally considered as typical for taggers. As we saw earlier, this 55% accuracy rate is not even a spectacular result, 12 For example, Wiktionary contains many multiword units. NooJ’s English dictionary contains over 200,000 multiword units that cover the standard English vocabulary (Silberztein 2016). Domain-specific multiword units that occur in new texts could be recognized and stored in specialized dictionaries, that would in turn be applied to the texts to tag them.

The Limitations of Corpus-Based Methods in NLP

9

Fig. 2 French text tagged with Sketch Engine

considering that among the 33 correctly tagged wordforms, 18 wordforms can have only one possible tag: à (4 occurrences), c’, d’ (4 occurrences), de, et, il (4 occurrences), lui, pour, un. The presence of so many erroneous tags has consequences: researchers who use Sketch Engine to study introspection or dialogs in a novel might be led to think that it contains a very high frequency of verbs conjugated in the first person singular, which is exceptional in novels. Researchers who study an author’s style or syntactic signature might be led to incorrectly believe that a text contains very few adverbs. Linguists who study the usage of locative nouns (such as courtyard) might get many incorrect occurrences, etc. Most statistical data produced by Sketch Engine, even when displayed in beautiful graphs, are probably misleading. Linguistic solution Looking up a French dictionary for multiword units would have allowed the software to tag “au cours de” as a preposition and “d’ailleurs,” “dès le premier regard” and “de plus en plus” as adverbs, avoiding at the same time to incorrectly tag their wordform constituents. Simple grammar rules, such as: il ⟶ [“il” is followed by verbs conjugated in the third person singular only]

would have allowed the software to correctly tag the four verbal forms “rencontre,” “tombe,” “doute,” and “pense” in the third person singular.

10

M. Silberztein

2.2

Concepts, Themes, and Terms

In France, many researchers in linguistics, literature and political studies use statistical text analyzers such as Hyperbase,13 IRaMuTeQ,14 Lexico15 or TXM16 to help them perform various discourse analyses and find interesting concepts or themes, or correlations between concepts and themes. Because these software applications do not have access to linguistic resources that would allow them to process terms and their variants, they are reduced to process graphical wordforms. As an example, if one looks for the theme “amour” [love] in Zola’s series of novels “Les Rougon-Macquart”, one gets the following typical display (Fig. 3): The software brings up 8 related wordforms such as “amour”, “amours”, “amouracher” [develop a crush], “amoureux” [lover] and “amoureusement” [lovingly], but it also proposes 11 unrelated wordforms: “amphigourique” [amphiguric],

Fig. 3 Looking for terms associated with the term “amour” [love]

13

Cf. (Brunet 2010). Cf. (Loubère et al. 2014). 15 Cf. (Lamalle et al. 2002). 16 Cf. (Heiden 2010). 14

The Limitations of Corpus-Based Methods in NLP

11

“amphithéâtre” [amphitheater], “ample” [ample], “ampleur” [magnitude], “amplifier” [to amplify], “amuser” [to amuse]. At the same time, none of the conjugated forms of the verb “aimer” [to love] has been offered to the user. In consequence, users who do not check the list of terms generated by the software to perform statistical measurements, or do not have the patience to enter manually the 40+ correct conjugated and derived forms of amour and aimer, will not get a reliable statistical analysis of the theme amour in their corpus. Linguistic solution Giving the software application access to a dictionary in which entries are associated with their inflectional and derivational paradigms17 such as: aimer,VERB+FLX=AIDER+DRV=AMOUR+DRV=EUX +DRV=EUSEMENT where property “FLX” refers to formalized conjugation paradigms, and “DRV” to derivation paradigms, would have brought forward all the inflected forms of the words amour, amoureux, amoureusement and aimer, while avoiding unrelated wordforms such as amphithéâtre.

2.3

Word Clouds

Software applications used to perform discourse analyses often display word clouds to help users detect central themes in their corpus. For example, here is a typical word cloud (Fig. 4): This cloud shows that the term “abriter” is somehow central to the corpus. But this verb has five meanings: abriter #1: Joe protects Eva with an umbrella. abriter #2: Eva shelters sick children. abriter #3: This building houses the ministry of defense. abriter #4: Joe hides himself behind pretexts to not help them. abriter #5: The dikes shield the harbor from waves. It is not useful to display abriter as a central term in a corpus without clarifying to what meaning its occurrences correspond, or at least without checking that its occurrences belong all to the same meaning. For example, if the goal is to compute the opinion rating of a political figure, meanings #1 and #2 could be considered as “positive”, whereas meaning #4 as “negative”. If the software does not distinguish between these different meanings, how can one draw a positive or negative opinion on this political figure?

17

Such as NooJ’s dictionary, described in (Silberztein 2016).

12

M. Silberztein

Fig. 4 A word cloud for the homograph “abriter”

Linguistic solution Disambiguating among the different meanings of the verb abriter involves applying a dictionary that precisely describes the syntactic and semantic properties of each meaning and associates it with a syntactic grammar that precisely describes the corresponding syntactic and semantic contexts.18

2.4

Semantic Networks

Software applications used to perform discourse analyses typically compute “semantic networks” to help users find important themes and relations between themes in a text. For example, most corpus analyzers in the digital humanities will produce a typical “semantic” network such as the following one (Fig. 5): Notice how terms that are displayed as most “important” (larger, in the center) are in fact auxiliary (aspectual, modal or support) verbs, i.e., grammatical words that carry no meaning: aller, arrêter, continuer, mettre, prendre, venir.19

18

See (Dubois 1997) for a presentation of the dictionary LVF, and (Silberztein 2014) for a presentation of a system capable of retrieving specific meanings for this verb. 19 The verbs aller and venir are French auxiliary verbs, e.g., Il va manger [He is going to eat] and Elle vient de manger [She has just eaten]. The verbs arrêter and continuer are aspectual verbs, e.g., Il a arrêté de boire [He stopped drinking] and Elle continue de fumer [She continues to smoke]. The verbs mettre and prendre are support verbs, e.g., Il met la table [He sets the table] and Elle prend une douche [She is taking a shower].

The Limitations of Corpus-Based Methods in NLP

13

Fig. 5 A semantic network

Linguistic solution A simple lookup of dictionaries such as the lexicongrammar tables of verbs20 or the LVF dictionary21 would have allowed the software to exclude these verbs from the list of semantically relevant terms.

3 Principles at the Basis of Empirical Methods As we have seen, empirical methods use annotated training corpora as “references” to perform analyses on texts. To analyze a sequence of text, NLP systems such as Brill’s tagger,22 TreeTagger,23 Gate,24 and Stanford POST25 look for the most

20

Cf. (Gross 1968). LG1 contains 100 auxiliary, aspectual, and modal verbs; LG2 contains 200 movement verbs used as auxiliary verbs, e.g., Joe ran out to buy some milk. 21 Les Verbes Français, see (François et al. 2007). 22 (Brill 1998). 23 (Schmid 1994). 24 (Bontcheva et al. 2003). 25 (Toutanovaet al. 2003).

14

M. Silberztein

similar sequence according to some probabilistic or statistical measurement, or to some neural network optimized state, and then bring forth the analysis associated with the most similar sequence found in the training corpus, as the resulting analysis for the sequence to analyze. There is no intelligence in this process: it is just a matter of comparing sequences of words, and then copying some result, without any linguistic understanding.26 But teachers know by experience that students who use cheat sheets or copy from each other during an exam without understanding the topic get mediocre grades in average. Even then, for copying to be a somewhat successful strategy, there are three important requirements: 1. Copiers must copy from good students or reliable cheat sheets. 2. Copiers must copy from students who have the same test questions as them. 3. Copiers must be able to recognize the relevant pieces of information they copy. We have already shown that (1) training corpora contain many mistakes: an analyzer that uses faulty corpora as a cheat sheet cannot possibly produce perfect results. We now show that (2) training corpora cannot possibly be similar enough to the texts to be analyzed, and that (3) the units that these NLP software process (i.e., look for, evaluate, copy) are not the right ones.

3.1

Flawed Principle: The Notion of Similar Contexts

For a empirical method to correctly analyze a sentence, it must find in the training corpus some sentences similar enough to it, which in practice is seldom the case.27 Consider now the following trivial sentence: Our colleagues have offered a beautiful bouquet to the head of the institute.

To compute an estimate of the potential number of such sentences in English, one can use the fact that the English vocabulary contains an order of magnitude of 103 determiners, 104 human nouns, 102 auxiliary verbs, 102 dative verbs, 104 adjectives, 104 concrete nouns, 102 prepositions and 104 organization/collective nouns.28 According to this estimate, there are:

26

When analyzing the meaning of a sentence, there are multiple levels of understanding. At the purely linguistic level, understanding a sentence means being able to perform transformational operations on it. For example, linking complex sentences such as “Joe might not have fallen in love with Lea” with the elementary predicate “Joe loves Lea”, see (Silberztein 2015). 27 The largest available training corpora, such as the texts produced by the European parliament or by the Canadian government, contain mostly legal texts that are not similar to the typical texts MT users need to translate. 28 These orders of magnitudes are based on the following evaluations: there are 24,000 adjectives and 17,000 simple human nouns listed in NooJ’s dictionary, 13,000 compound human nouns listed

The Limitations of Corpus-Based Methods in NLP

15

103 × 104 × 102 × 102 × 103 × 104 × 104 × 102 × 103 × 104 × 102 × 103 × 104 = 1038 potential sequences made of the same pattern

The current largest training Gigaword corpora29 contain only a magnitude of 108 sentences. Even if every single sentence in these corpora had the same exact length and structure as the sentence above, the probability of finding this sentence in the corpus would be 10-30, i.e., impossible. That is why empirical algorithms do not rely on exact matches between sentences: they operate at the POS level. At his level, the sentence above is represented as:30

Neither the OANC nor the COCA corpora contain even one sentence with this structure.31 And we have not found any occurrence of this sequence in available samples of other giga-word corpora either.32 Because it is not possible to match a text to a corpus at the sentence level, empirical analyzers process smaller contexts, typically sequences of few words and POS tags. For instance, to disambiguate the wordform “place” (noun or verb) in the sentence: We shall always place education side by side with instruction. . .

they look for the most “similar” left and/or right context of the wordform place in the training corpus, e.g.: . . . we shall always place/ a cone inside the work area. . .,

and then copy the analysis they found in this context, in that case: (verb). But this definition of context overlooks the basic linguistic principle that sentences are structured sequences of linguistic units, rather than linear sequences of POS categories; in fact, any POS category can be followed by any POS category at the superficial level. For example, the wordform place (either a noun or a verbal form) can occur with any of the following right contexts:33

in the DELAC dictionary, 300 auxiliary verbs listed in the Lexicon-Grammar tables LG1 and LG2, and 300 dative verbs listed in Table 36DT. 29 See for instance (Oravecz et al., 2014), (Derczynski et al., 2021), (Hong et al. 2006). 30 A more precise pattern would be: , but training corpora and their tag sets do not have this level of precision. 31 115 text files w_acad*, w_fic*, w_mag, w_news* and w_spok*. The closest sentence we found is in file w_news2005: “No one has had a better view of the capriciousness of the NBA” where the subject is the pronoun “no one” instead of the sequence and the second and third nouns are abstract nouns, rather than concrete and human nouns. 32 As of June 2023. 33 All these examples have been found with Google Search.

16

M. Silberztein

Verb + Adjective Noun + Adjective Verb + Adverb Noun + Adverb Verb + Determiner Noun + Determiner Verb + Conjunction Noun + Conjunction Verb + Noun Noun + Noun Verb + Preposition Noun + Preposition Verb + Pronoun Noun + Pronoun

Swedish women place nice significance on the value of schooling. It is a place nice to visit. They place very high demand on their suppliers. . . . that have snapped into place suddenly over the past month. you can use the system to place that order. I stumbled across this gem of a place the weekend before. make money no matter what they place but only first place is. LED walls have their place but there’s still room for projectors. It is allowed to place chickens in a barn with cattle. These wedding place card ideas are sure to get ideas flowing. . . . debris which local drivers often place behind their cars. Everyone must visit this place in their life. People in our country place those with specialized skills. we could sit on another empty table near the place they showed us.

The same is true with its left context: Adjective + Noun Adjective + Verb Adverb + Noun Adverb + Verb Determiner + Noun Determiner + Verb Conjunction + Noun Conjunction + Verb Noun + Noun Noun + Verb Preposition + Noun Preposition + Verb Pronoun + Noun Pronoun + Verb

It occupies a huge place in the lives of people. Players feeling lucky place their bet. . . I started to notice how quickly places that I found interesting. . . They quickly place the cross on the wall. . . . cover the bet and the place and win are how you make ... The place bet payoffs are lower than a win bet. . . The time and place that gave me life. . . How to choose and place art. Her birth place was Basel. Authors place their books under the spotlight. The Places API lets you search for place information. A purchase manager has decided to place an order. When you give them place they will take yours. They place him in a tomb of stone.

If the wordform “place” can occur before or after a wordform of any POS category, then looking in a corpus for the POS category next to it to disambiguate it cannot be reliable: as the corpus cannot be large enough to contain all potential contexts for each wordform, the software will not be aware that some tag sequences are valid (e.g., Pronoun + Noun in When you give them place). And even if a

The Limitations of Corpus-Based Methods in NLP

17

particular corpus contained all possible contexts, then the software could only pick the “most probable” tag sequence according to its frequency found in this particular corpus (which is to say, not in all potential texts), i.e., results will be “optimized”, but not guaranteed to be 100% correct. Note finally that even if the same exact sentence as the one the software application wishes to process occurred in the training corpus, copying its analysis would not produce reliable results, as most sentences are, in fact, ambiguous. For example, in one given training corpus, the following sentence: There is a round table in room A32

might be about a table with a round shape (piece of furniture), which results in the tagged sequence: “round/ table/”, whereas the text that is being analyzed would be about a meeting, which should produce the tag sequence “round table/”.34 In the general case, lexical ambiguity cannot be resolved reliably without performing a syntactic and semantic analysis of the sentence. Linguistic solution Do not try to disambiguate wordforms before (1) retrieving all their potential analyses by looking up dictionaries, including dictionaries of multiword units; (2) performing some syntactic and semantic analyses to eliminate impossible analyses.

3.2

Flawed Principle: The Units of Processing

Nowadays, NLP software applications process texts as if they were linear sequences of graphical wordforms, on the assumption that these wordforms represent the basic pieces of information that need to be processed and therefore are useful to their users: concordances display wordforms in context; search engines compile indices of wordforms to answer users’ queries (themselves processed as sequences of wordforms); semantic networks typically display networks of wordforms and wordform clouds; statistical analyses are used by these applications to detect interestingly high or low frequencies of wordforms, retrieve frequent collocations of wordforms, or classify sub-corpora into clusters based on differences in wordforms frequencies; taggers associate each wordform of a text to a tag supposed to represent its syntactic or semantic properties, etc. However, wordforms rarely correspond to the units of meaning—concepts, entities, predicates, and relations—that are the real components of sentences. For example, it makes little sense for an NLP application to process the wordform

34

Over 100,000 multiword units are structurally ambiguous, like round table. Multiword units are extremely frequent in texts, as (Silberztein 2016) showed.

18

M. Silberztein

Fig. 6 An extract of the Wiktionary English dictionary

“plant”—tag it, index it, compute its frequency, explore its context, translate it, etc.—because it corresponds to over a dozen homographs that have very different syntactic and semantic properties. One can find 17 meanings in Wiktionary for this noun: meaning #1 refers to a botanic organism (e.g., marijuana plant), meaning #5 to an industrial facility (e.g., nuclear plant), meaning #7 to a person (meaning an undercover agent), meaning #16 to a technical term in the domain of control theory, etc. (Fig. 6). Any software application that does not distinguish between these meanings will produce meaningless results. For example:

– A search engine that computes only one index for all occurrences of the wordform “plant” will aggregate these different meanings, producing a set of documents that have nothing in common. – An automatic alert system that notifies its users that the frequency of the wordform plant was abnormally high in social media during a certain week, without specifying if its occurrences were related to marijuana plants, to electric plants, or to undercover operations, or even if its high frequency is an artifact due to adding together occurrences of different meanings that occur in articles about unrelated events, will produce false alerts. – No Machine Translation system should even try to translate the wordform “plant” without first distinguishing between its meanings. For example, plant #1 should be translated in French by “plante”, whereas plant #5 should be translated by “centrale” and plant #7 by “infiltré” or “taupe”.

The Limitations of Corpus-Based Methods in NLP

19

Rather than being constituted by wordforms, sentences are in fact constituted by unambiguous elements of the vocabulary of the language: the Atomic Linguistic Units (ALUs).35 ALUs should not be analyzed: locutors have learned them by heart, secondlanguage learners often cannot produce them correctly, computer software cannot infer all their syntactic or semantic properties from the properties of their constituents, and machine-translation software should access their translation in a multilingual dictionary rather than try to compute their translation word-forword. Morphemes Many wordforms represent agglutinated sequences of multiple ALUs and therefore should be associated with sequences of multiple tags. For instance, the wordform audienceless means “without an audience”, fishlike means “similar to a fish”. Multiword units Reciprocally, no NLP software application should ignore multiword units. For instance, an MT system should translate “tank top” to the single French noun débardeur. Multiword units should therefore be tagged as atomic units and their wordforms constituents ignored. The English standard vocabulary contains over 100,000 multiword units.36 Discontinuous expressions A large part of the vocabulary of a language is constituted by discontinuous sequences of wordforms. In English, there are tens of thousands of idiomatic expressions (e.g., take the bull by the horn, buy the farm, take for granted), idiosyncratic associations between predicate nouns and their support verb (e.g., have a baby, give a hand, take a shower) and phrasal verbs (e.g., fill out, give up, turn off). Most of the properties of these ALU cannot be computed from their constituents; for example, Joe took a shower does not usually mean that he dismounted the shower from the bathroom wall and took it with him; nothing is “turning” when someone turns the light off. Tagging the constituents of these ALUs as if they were independent units therefore misleads any NLP application that involves translating texts, or even looking for basic information in texts.

35

See (Silberztein 2016). In the NooJ framework, all ALUs are listed in dictionaries and their morphological, syntactic, and semantic properties must be described explicitly. ALUs are classified into four formal categories: simple words (e.g., “table”), morphemes (e.g., “dis_”, “_ization”), multiword units (e.g., “as a matter of fact”, “round table”) and discontinuous expressions (e.g., “take . . . into account”, “turn . . . on”). Instead of using a tag system, NooJ annotates every potential ALU in a text with its linguistic properties, including morphemes (inside wordforms) and discontinuous expressions, and stores all annotations in the Text Annotation Structure (TAS). When syntactic and semantic grammars are applied to the text, impossible annotations are automatically removed from the TAS. 36 See (Silberztein 2015).

20

M. Silberztein

Linguistic solution Rather than processing sentences as if they were linear sequences of graphical wordforms, NLP applications should first detect all the potential vocabulary units that constitute each sentence, store, and represent all potential analyses in a structure that will be accessed by subsequent syntactic and semantic analyzers, that will use contextual information to remove invalid ones.37

4 The Scientific Approach Beyond the flawed methodology, empirical approaches pose several theoretical problems: how to evaluate the scientific value of a training corpus? What is the scientific value of results obtained by a black box approach (even if they are correct)?

4.1

The Real Value of a Training Corpus

When they are in the process of tagging wordforms in a corpus, human taggers encounter three cases. 1. Some graphical wordforms must always be associated with one single tag, such as “the”: always a determiner. For these wordforms, a 1-billion-word or even a 100-billion-word corpus brings no more value to the system than a single computer command such as: sed ":\bthe\b:the/DETERMINER:g" corpus.txt

or a single lexical entry such as: the,DETERMINER

2. Wordforms that have more than one potential analysis need to be disambiguated, either manually or automatically. It would be extremely costly to ask a person to disambiguate all the occurrences of these wordforms one by one: even if this person can disambiguate one occurrence per second, disambiguating a billion wordforms would take over 13 years to accomplish, without breaks nor vacation. And the result would certainly contain a significant number of mistakes, because no human being can be expected to work continuously 8 h/ day and make no mistakes.

37 NooJ uses a Text Annotation Structure (TAS) to represent all linguistic units of a text. Typically, lexical and morphological analyses add annotations to the TAS, whereas syntactic and semantic analyses both remove lexical annotations (i.e., disambiguate words) and add syntactic and semantic annotations.

The Limitations of Corpus-Based Methods in NLP

21

3. In practice therefore, human taggers use tools to help them tag multiple occurrences of ambiguous wordforms in batches. These tools recognize certain context patterns and propose the corresponding tags as a result: these (pattern/ result) pairs function exactly like grammar rules. It is the set of these rules that constitutes the real value of these corpora, not the corpora themselves. Unfortunately, in practice, these disambiguation rules are automatically generated and therefore fundamentally incorrect. For example, none of the ten disambiguation rules (called “patches”) computed automatically by the tagger presented by Brill (1998) are correct: – It is not true that if a wordform in uppercase is followed by a verbal form that can be tagged either as a preterit or as a past participle form, then it should be tagged as a preterit form. Here is a counterexample: . . .A used tool. . . (“used” is not a preterit form)

– It is not true that if a wordform close to the wordform “had” can be tagged as a past participle form, then it should be tagged as a past participle form. Here is a counterexample: . . .She had to come. . . (“come” is not a past participle form)

– etc. Giving explicitly the list of rules used to disambiguate a training corpus, rather than the corpus itself, would be more useful to the scientific community, as these rules could be examined, corrected, and maintained, while at the same time being applied to any text, and therefore be checked, falsified, and refined at will. In essence, this is what the linguistic approach proposes. Most of the rules automatically generated by current empirical analyzers have no scientific value, even if they could produce perfectly tagged texts inside a given finite-sized corpus: they are similar to stating that in all Shakespeare’s pieces, all nouns starting with an “a” are followed in the next page by a verbal form in the gerundive: this remark would have no generality, would teach us nothing about English literature nor language, and would quite simply have no scientific value. Proponents of the use of training corpora might argue that the errors mentioned here will be avoided when larger and larger training corpora are available. But, because the rules automatically generated by the taggers are fundamentally incorrect, adding new texts to a given corpus will inevitably invalidate these rules. But then, correcting these rules to take the new texts into account will, in turn, break the initial system by inserting mistakes in the initial corpus. For example, if, in the initial corpus, all the occurrences of the French wordform “la” were correctly associated with their Pronoun or Determiner POS categories, then adding a text in which “la” is tagged as a noun (the musical note) will modify the patch responsible for the correct tagging of the wordform “la” in the initial text, therefore risks to introduce new mistakes in the initial corpus.

22

M. Silberztein

Accumulating more and more texts does not guarantee that the accuracy of the training corpus will be enhanced.38 Linguistic solution It is not the size of a training corpus that represents its value: it is the granularity of its tag set, the size of its vocabulary, and the robustness of its disambiguation rules. Dictionaries handcrafted carefully by language experts have a much better level of granularity and exhaustivity than training corpora. Grammars handcrafted carefully by language experts have a much better robustness than automatically generated disambiguation rules based on local contexts found in corpora of limited size.

5 Conclusion Using training-corpus-based “black box” methods that produce satisfactory results, but without any power of explanation nor generalization, is a viable approach to non-critical NLP software engineering problems. However, it is time for all scientists—both linguists and computer scientists—to realize that this approach is the opposite of what scientific approaches require: reliable accumulation of knowledge. Even if we stay at the engineering level and discuss the problem of producing NLP software that produce reliable results, the existing NLP software applications that rely on training-corpus approaches have too many flaws. The solutions to enhance their reliability are quite simple. – Give them access to handcrafted dictionaries so that they can recognize and process all types of linguistic units correctly (including affixes, simple words, multiword units, and discontinuous expressions), to avoid mistakes such tagging many as a verb. There are already good available electronic English dictionaries. The initial dictionaries might contain unsuitable (or even erroneous) lexical entries for a specific NLP application, but they will be easy to correct. If these dictionaries contain gaps, these will be flagged as unknowns when parsing texts, therefore it will be easy to add the corresponding units to the dictionaries: their quality will enhance over time, and the linguistic knowledge they contain will accumulate. – Give them access to handcrafted grammars that can be applied to texts deterministically and produce robust analysis results. If a text contains word sequences that are not analyzed correctly by the grammar, then correct or enhance the grammar, and make sure it does not produce new mistakes by re-applying it to

38

The situation is similar in neural-network-based AI systems. For instance, new updates of Tesla’s full-self-driving beta software often bring problems that were solved in previous versions (e.g., phantom braking in version 11.3.6).

The Limitations of Corpus-Based Methods in NLP

23

previously analyzed texts. Both recall and precision will be enhanced, and the linguistic knowledge contained in the grammars will accumulate over time. Standard software engineering techniques, such as version control, repository sharing and unit testing, can be used to help enhance both dictionaries and grammars in an accumulative way.

References Bontcheva, Kalina, Maynard, Diana, Tablan, Valentin, Cunningham, Hamish, 2003: GATE: A Unicode-based infrastructure supporting multilingual information extraction. Brunet E, 2010. HYPERBASE : Manuel de référence. hhal-01362721i. Brill Eric, Wu, Jun. 1998. Classifier combination for improved lexical disambiguation. In ACL 36/COLING 17, pages 191–195. Charniak, Eugene, 1997. Statistical techniques for natural language parsing, AI Magazine, vol. 18, no. 4, p. 33. Dickinson, Markus, and Scott Ledbetter, 2012. Annotating Errors in a Hungarian Learner Corpus. LREC 2012. Dickinson, Markus, 2015. Detection of Annotation Errors in Corpora. In Language & Linguistics Compass, vol 9, Issue 3. Wiley Online Library, https://doi.org/10.1111/lnc3.12129. Davies, Mark, 2009. The 385+ million-word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. In International journal of corpus linguistics 14.2: 159-190. Derczynski, Leon, et al., 2021. "The Danish Gigaword Corpus." Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa). Dubois Jean, Dubois-Charlier Françoise, 1997. Les Verbes français. Larousse : Paris. Francis, W. Nelson & Henry Kucera, 1979. Brown Corpus Manual: Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English for Use with Digital Computers. http://icame.uib.no/brown/bcm.html François, Jacques, Denis Le Pesant, and Danielle Leeman, 2007. “Présentation de la classification des Verbes français de Jean Dubois et Françoise Dubois-Charlier.” Langue française 1: 3–19. Green, Spence, and Christopher D. Manning, 2010. Better Arabic parsing: Baselines, evaluations, and analysis. Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics. Gross, Maurice. "Grammaire transformationnelle du français: syntaxe du verbe." (No Title) (1968). Heiden S., Magué J.-P., Pincemin B. « TXM : Une plateforme logicielle open-source pour la textométrie - conception et développement ». 10th International Conference on the Statistical Analysis of Textual Data – JADT 2010, Jun 2010, Rome, Italie. pp.1021-1032. hhalshs00549779i. Hong, Jia-Fei, and Chu-Ren Huang, 2006. "Using chinese gigaword corpus and chinese word sketch in linguistic research." Proceedings of the 20th Pacific Asia conference on language, information and computation. Lamalle C, Martinez W, Fleury S, Salem A, Fracchiolla B, Kuncova A, & Maisondieu A, 2002. Lexico 3, Outils de statistique textuelle. UniversitÕ de la Sorbonne Nouvelle. Loubère, Lucie, and Pierre Ratinaud, 2014. “Documentation IRaMuTeQ 0.6 alpha 3 version 0.1.” Récupéré à http://www.iramuteq.org/documentation/fichiers/documentation_19_02_2014.pdf. Kupść, Anna, and Anne Abeillé, 2008. Growing treelex. In International Conference on Intelligent Text Processing and Computational Linguistics, pp. 28-39. Springer, Berlin, Heidelberg. Oravecz, Csaba, Tamás Váradi, and Bálint Sass, 2014. The Hungarian gigaword corpus: 1719-1723.

24

M. Silberztein

Schmid, Helmut, 1994. TreeTagger-a language independent part-of-speech tagger. http://www.ims. uni-stuttgart.de/projekte/corplex/TreeTagger Seth Kulick, Ann Bies, Justin Mott, 2011. Further Developments in Treebank Error Detection Using Derivation Trees. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics. 693–698, Portland, Oregon, USA. Silberztein, Max, 2014. “THE DEM AND LVF DICTIONARIES IN NOOJ.” Formalising Natural Languages with NooJ. Silberztein, Max, 2015. Joe loves lea: Transformational Analysis of Direct Transitive Sentences. In Automatic Processing of Natural-Language Electronic Texts with NooJ. 9th International Conference, NooJ 2015, Minsk, Belarus, June 11-13, 2015, Revised Selected Papers. Springer International Publishing. Silberztein, Max, 2016. Formalizing Natural Languages: the NooJ Approach. Wiley Eds.: Hoboken, NJ. Silberztein, Max, 2018. Using linguistic resources to evaluate the quality of annotated corpora. In Proceedings of the First Workshop on Linguistic Resources for Natural Language Processing, pp. 2-11. Steinberger, Ralf, Bruno Pouliquen, Mijail Kabadjov, and Erik Van der Goot, 2013. JRC-Names: A freely available, highly multilingual named entity resource." arXiv preprint arXiv:1309.6162. Taylor, Ann, Mitchell Marcus, and Beatrice Santorini, 2003. The Penn treebank: an overview. Treebanks: 5-22. Toutanova, Kristina, Dan Klein, Christopher D. Manning, and Yoram Singer, 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 252-259. Volokh, Alexander, Günter Neumann, 2011. Automatic detection and correction of errors in dependency treebanks. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technology: short papers, vol. 2, pages 346-350.

Part II

Developing Linguistic-Based NLP Software

Linguistic Resources for the Automatic Generation of Texts in Natural Language: The Elvex Formalism Lionel Clément

Abstract I present a Natural Language Generation system based on symbolic methods and linguistic knowledge. We call Elvex the formalism as well as the implemented program based on it. My goal in developing such a system is not to try to develop an application as talkative and relevant as the Pre-Trained Transformer Chatbots like ChatGPT, launched by OpenAI in 2022, or the BLOOM project more recently. Instead, my aim is to follow an hypothetico-deductive approach and implement a Text Generation system that supports a linguistic study. Keywords Elvex · Natural Language Generation · Linguistic Formalization

1 Introduction In recent literature, most studies have focused on using neural methods for NLG with very efficient results.1 However, the approach presented here is quite different, as it relies on linguistic knowledge rather than neural networks trained from texts. The objective of this study is to utilize a hypothetico-deductive approach for testing a Text Generation System. The system will generate sentences that will be assessed for their grammaticality by native speakers to evaluate the grammatical proficiency of the system. This approach involves formulating hypotheses, conducting experiments, and using the findings to refine and enhance our understanding of the abilities of the system. By using this approach, the generated text will adhere to the language rules and match the input text exactly. In contrast, a neural approach to natural language generation often generates text that complements or can be inferred from the input. Our approach may correspond to an industrial need where it would be necessary to

1

See (Brown et al. 2020; Floridi and Chiriatti 2020; Thomson and Reiter 2021; Teven et al. 2022).

L. Clément (✉) Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, Talence, France e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Silberztein (ed.), Linguistic Resources for Natural Language Processing, https://doi.org/10.1007/978-3-031-43811-0_2

27

28

L. Clément

ensure that the texts produced contain precisely all the information to be communicated and nothing more. For example, this could apply to a forecast or a mail service. Comparing the two approaches in more detail is challenging because they differ greatly, and it is tough to evaluate them on anything other than the quality of the texts they produce. Content comparison is not really feasible due to the differences in their underlying methodologies. Several linguistic realization systems based on linguistic models have been used. The input of these systems are abstract syntactic structures, the result of a “document planner”. The KPML2 and SURGE (Systemic Unification Realization Grammar of English)3 are based on Systemic Functional Grammar,4 while RealPro5 is based on the Meaning-Text Theory.6 EasyText7 is an NLG system based on the G-TAG formalism.8 G-TAG follows the standard architecture and uses the Tree-Adjoining Grammar model for the sentence planner module. More recently, Danlos et al. (2014) takes advantage of the formal properties and algorithms of Abstract Categorial Grammar using the G-TAG formalism with more expressive linguistic descriptions. Although my approach shares similarities with these approaches, it differs in that we utilize a non-compositional formal model that enables us to modify a word or phrase based on its context, like the Lexical-Functional Grammars.9 Moreover, a typical rules-based NLG system architecture is modular and consists of a series of modules, including a “document planner” that handles content determination and document structuring tasks, as well as a “tactical component” that determines words, syntactic structures, and ultimately maps an abstract representation to the surface realization.10 In contrast, our system considers all levels of linguistic analysis simultaneously, allowing them to be articulated together. For example, the structuring of a text into discursive units can be done concurrently with the analysis of nouns. If a noun refers to something that has already been mentioned in previous paragraphs, it can either be replaced by an anaphoric reference, or it can be marked by a definite determiner to indicate that it refers to a specific, previously mentioned element. The Elvex specifications have been formally described as a specific-domain language that defines a grammar and a lexicon. An algorithm and software have been developed, which allow us to produce texts. We carried out several experiments in French to examine different linguistic phenomena: control and complementation,

2

See (Bateman 1997). See (Elhadad and Robin 1998). 4 See (Halliday 1985). 5 See (Lavoie and Rambow 1997). 6 See (Mel’čuk 2006). 7 See (Danlos et al. 2011). 8 See (Danlos 1992). 9 See (Bresnan 2001). 10 See (Reiter and Dale 2000). 3

Linguistic Resources for the Automatic Generation of Texts in. . .

29

tense agreement between sentences, participle agreement, phraseological expressions, causative clauses with tense agreement and anaphora resolution. Additionally, Elvex makes it easy to create language exercises such as, for example, randomly gapped phrases and conjugation paradigms. In the first part of this presentation, I will present the formal aspects of our approach. From there, I will explore how to use Elvex to write the lexicon and grammatical rules without being able to delve into the technical details. Before I conclude with the results, I will provide examples that demonstrate how we can consider both the language rules and the speaker’s choices to create texts that accurately reflect the given input.

2 A Hypothetico-deductive Approach of NLG A hypothetico-deductive approach to Text Generation process could involve creating a system that generates texts based on any specific syntactic theory, and then evaluate the generated texts to see if it confirms the theory. When it comes to create the texts according to the rules of syntax, morphology, and orthography realization, a high-quality encoder-decoder system like Chat-GPT seems to have a good knowledge of the language and makes very few mistakes. Such a system is pre-trained on a vast amount of text and would excel at identifying morphological and syntactic patterns, as well as linguistic regularities (such as idiomatic expressions, collocations, agreement, morphology, and lexicon). The grammatical knowledge is then inferred by the large redundancy of linguistic phenomenon that are contained in these very voluminous texts. But an encoderdecoder text generator does not allow to extract such a specific knowledge to describe it or to infer grammatical rules. For example, we know that the canonical order of English qualitative adjectives depends on their category, which includes opinion, size, age, shape, color, origin, and material. This explains why the noun phrase “The beautiful large Italian building” is grammatically correct, while the sequence *The Italian large beautiful building is not. An encoder-decoder text generator can generate grammatically correct text sequences, including adjective order. However, a future challenge for these systems will be to explain how they determine the correct order of the adjectives. The empirical model of language is based solely on raw text data, or tokenized or tagged texts. This means that there is no higher-level language model used in the training of the empirical model, which has significant implications for its accuracy and limitations. On the contrary, my approach assumes a formal system that explicitly describes the language model used to generate the texts. Furthermore, in my approach, the theory is considered falsified if the data contradicts the hypotheses. In other words, if the theory produces irrelevant texts or texts that contain ungrammatical sentences, the grammar is incorrect. Conversely, it can be argued that a coherent and complete

30

L. Clément

linguistic knowledge will always generate grammatically correct sentences and relevant texts. A perfect Text Generator system based on the hypothetico-deductive theory would have ideal accuracy, with a precision of 1, meaning it would never give false positive results. It is obvious that it is not possible for a empirical model. However, the hypothetico-deductive model is based on a specific grammar, and its recall is limited by the completeness of this grammar in the context domain. This means that the larger the domain, the lower the recall. As a result, symbolic methods are highly effective for generating texts in small domains such as weather or market forecast generators. However, they often produce disappointing results when applied to non-limited domains such as chatbots for customer service.

2.1

A Deterministic Model

Elvex representation of the input convey the intended meaning of the input data. However, it is impossible to anticipate the system linguistic choices based on the grammar and lexicon, as they’re not solely determined by the input. This means that the input only contains predicate structures and communication intention, rather than specific words, phrases, or sequences of texts. The generation process takes as input a communication intention that operates at multiple linguistic levels, including rhetorical and syntactic choices. For instance, the intention of the text generation can convey/express several speech acts such as inform, persuade, entertain, or even deceive the reader. Like the Segmented Discourse Representation Theory (SDRT),11 we can employ a hierarchical structure of rhetorical relations between discourse segments. For example, the meaning of causality which indicates that a1 is the cause of a2 will be written CAUSE(a1, a2), where a1, a2 are discourse segments, and may correspond to the verbs “to cause”, to the nouns “causation”, “result”, “effect”, etc., to a light verb “to drive”, “to make” (“to drive one crazy”, “to make someone sick”). It may also correspond to a paratactic construction of two clauses: “He forgot his cellphone in his vehicle, he can’t be reached” or to a coordination construction: “The students worked hard, and they passed all the exams”. At another level, the generation process can evoke a specific object, person, or situation by focusing the text on them. The syntactic predicates that can be easily identified as words, such as nouns, verbs, adjectives, and adverbs in the output text. However, these syntactic predicate structures can also correspond to more complex linguistic elements, such as phraseological clauses or structural layouts. My approach relies on a symbolic method because we aim to leverage the knowledge gained from linguistic research to develop a framework for natural

11

See (Asher and Lascarides 2003).

Linguistic Resources for the Automatic Generation of Texts in. . .

31

language generation. One advantage of this approach is that it enables us to define a deterministic model. A symbolic approach enables us to create a text that consistently corresponds to the input provided. By definition, a symbolic method creates, from a given input, a text that adheres to the language model defined by the grammar in a generative term. This approach is highly flexible, as we can describe the language model that precisely corresponds to a specific application, such as speech language and slang language for a video game or formal legal language for a mailing application. However, as mentioned above, while the precision can be absolute with such a model, the recall may be low. Therefore, the deterministic model we propose should be presented as an alternative to a data-driven classical generator to improve recall results and ensure adequacy between the result and the given input. For example, suppose the meaning of the text is that the fact that it is raining has caused Marc to take an umbrella, and the communication intent is to assert vulgarly this situation. An input in Elvex may be this: HEAD: CAUSE(a1, a2), a1:[HEAD:RAIN], a2:[HEAD:UMBRELLA(a3), a3:[HEAD: MARC]], LANGUAGE_REGISTER: SLANG

Similar to functional unification grammar (FUG),12 it involves using a feature structure as a single framework to represent information of heterogeneous nature. Here we use the feature structure to represent four predicates: CAUSE(a1, a2), RAIN, UMBRELLA(a3), and MARC. The predicate CAUSE(a1, a2) is the “head” of the text, because we want to actualize its meaning in the discourse situation. The CAUSE predicate is the main focus of the text, and is considered the “head” because the goal of communication is to convey its meaning within the context of the discourse. CAUSE is a predicate with two arguments and could correspond to the verb “to cause”, to a paratactic construction or to a coordinated sentence. RAIN is a predicate with zero argument and corresponds to the verb “to rain”, which requires an impersonal subject in an English utterance (“it rains”, “it is raining”). UMBRELLA(a3) is a predicate with one argument. Although it does not correspond to any verb in English (*Someone umbrellas to protect themselves from the rain), it can be constructed with the light verb “to take” to create a verbal utterance (Someone takes an umbrella to protect themselves from the rain). To express this situation in a more informal way, we will make several choices: 1. We will use the idiomatic phrase “to rain cats and dogs” as the lexical entry for the RAIN predicate. 2. We will use the construction “...and that’s why...” to actualize the CAUSE predicate.

12

See (Kay 1984).

32

L. Clément

3. We will use the interjection “Yo” as a marker to situate the discourse towards the speaker in a slang language level. 4. We will use the light verb construction “someone takes an umbrella” to actualize the UMBRELLA predicate. 5. We will use the present progressive tense to describe the ongoing, atelic situation. Since the aspect of the predicates is not given, we can assume a neutral aspect. The output corresponding to these choices is: (1) Yo, it is raining cats and dogs out there and that’s why Marc is taking an umbrella. But nevertheless, it is important to ensure that all relevant information is included in the input, even if some of it may be redundant or obvious could be inferred from the context or co-text. After certain choices have been made in the input, the remaining choices are related to the language being used and must meet the requirements of that language. These choices might include things like selecting the appropriate grammatical structures, choosing the right vocabulary, and following the conventions of the language in terms of syntax and word order. In other words, we focus on microplanning and surface realization (“How to say”) following the terminology of (Reiter and Dale 2000), rather than macroplanning (“What to say”). No choice corresponds to a meaning that can be inferred from the input data. Even if we trivially evoke the modus ponens (‘A, ‘ (A→B)) in the input, the system should not infer ‘B, which is, however, the logical conclusion. Indeed, the formalism does not allow processing the inputs but only translating them into texts. On the other hand, the inference that can be made from the linguistic elements themselves must be considered. For example, the concessive construction “P but Q” leads to the conclusion R, which is not always linguistically marked in the discourse, but which is implied from P. The argument Q, on the other hand, leads to the opposite conclusion, non-R, since it too triggers a whole network of ideas which, because of the presence of “but”, enters into opposition with P.13 (2) a. It was pouring rain, but Marc went out without opening his umbrella. b. While it was pouring rain, Marc went out without opening his umbrella. c. ? It was pouring rain, but Marc found a coin on the sidewalk and picked it up. d. While it was pouring rain, Marc found a coin on the sidewalk and picked it up. To use the word “but” appropriately in 2-a and 2-c, it is important to consider that it is raining and people typically take an umbrella to protect themselves when going out, but they do not usually pick up coins.

13

See (Ducros 1972).

Linguistic Resources for the Automatic Generation of Texts in. . .

33

The sentence 2-a may be used when the interlocutor is contradicted about whether Marc will not take an umbrella when going out, while sentence 2-b simply states that Marc will not take an umbrella in the same situation. The differences between the inputs of 2-a and 2-b relate to the fact that the interlocutor’s unspoken thoughts or feelings are upset. It explains why the sentence 2-c is difficult to interpret, which was indicated with a question mark. This extensive knowledge that enriches the lexicon is primarily composed of lexical definitions from a language. However, it is important to note that this linguistic knowledge does not constitute pragmatic knowledge or knowledge derived from ontological analysis of the domain. In other words, it does not provide insight into how knowledge about the underlying concepts and entities the language is referring to. This explains why 2-c. is difficult to interpret in a general area where one does not usually found coins when it is raining. However, the sentence is not grammatically incorrect, and it is possible to generate it using a natural language generation (NLG) system when one expects this particular situation. Our observation is that sentence 2-c can be considered acceptable in cases where the lexicon does not include information about the pragmatic aspect. This is very different from the approach that consists in using a chat-bot to bring an answer to a question, to question the world or to complete a text from the information it contains. In practice, we understand that these are essentially the limits of the model: everything that is said is contained within the manually constructed lexicon, without the use of a semantic network or ontology. The output text precisely corresponds to the input, without any additional information. It is challenging to represent the complete reality of a particular domain within a lexicon. The model will therefore be adapted to very restricted domains and will not cover a wide domain of topics.

2.2

A Declarative and Constraint-Based Approach

As earlier NLG works, we selected a linguistic model close to a functional theory of grammar that sees the functionality of language as the way to produce a text that corresponds to the meaning and intent in a given communication situation. In this approach, the produced text corresponds to choices made by the speaker to achieve a communication goal. These choices are categorized as rhetorical (e.g., asking a question to get someone to do something), enunciative (e.g., negation, asking a question, giving an order), syntactic (e.g., using a phrasal verb to express modality, using a complement to complete a valency, morphological agreement), and lexical (e.g., using a complex or simple lexical entry that corresponds to a meaning). The syntactic organization of the text consists of a declarative description of functional dependencies. For example, a verbal predicate may be composed of arguments that are syntactic functions (i.e., the dependency between a verb and its subject and

34

L. Clément

complements). A theme may be related to a rheme in a rhetorical structure, and the argument structure (Agent, Patient) may correspond to a syntactic distribution for a given diathesis (Subject, Object), etc. The first level of organization is described in both the grammar and lexicon as declarative constraints, without considering word order and phrasal structure. This corresponds to Saussure’s concept of absentia according to Saussure’s terminology.14 The second level of organization is the syntagmatic description (in præsentia), which corresponds to the phrasal structure of word order and morphological constructions. To create a formalism for the two levels, we opted for an approach similar to the LFG model.15 LFG is a non-transformational generative grammar formalism which distinguishes between two types of syntactic structures: constituent structure (c-structure) and functional structure (f-structure). Unlike LFG, which uses f-structure as a universal description for syntax and semantics, we use more common feature structures (or Kay’s functional description) instead to describe several linguistics levels. Furthermore, our approach does not rely on other LFG results from morphology (optimality theory) or semantics (glue semantics).16 Each production rule in the c-structure describes a constituency relation. This relation defined obviously the phrases and its constituency, which is defined by the hierarchical arrangement of its constituent elements and the word order. It also may be used to defined the arrangement of a text at several levels of linguistic analysis: • Syntactic and morphosyntactic level. • Macrosyntax, Prosodic units, Illocutionary Units17,18 • Rhetorical structuration (Rhetorical Structure Theory)19 At the syntactic level, the feature structure represents the grammatical information associated with a sentence or phrase, such as its tense, aspect, voice, and case marking. Additionally, it represents the syntactic relationships between words in a sentence, such as subject and object, in relation to the local predicate. The feature structure is typically used to represent the micro syntactic organization of a sentence. The valency, the rection and the agreement are represented in the structure. The feature structure is also used to represent the macro-syntactic and rhetorical dependencies between words, phrases and sentences. As in the LFG formalism, the c-structure and the f-structure are connected through a mapping process called functional mapping and described by equations.

14

See (Saussure 1916). See (Bresnan 2001). 16 See (Dalrymple et al. 1995). 17 See (Berrendonner 1990). 18 See (Blanche-Benveniste et al. 1990). 19 See (Mann and Thompson 1988). 15

Linguistic Resources for the Automatic Generation of Texts in. . .

35

Technically, the grammar is an attribute grammar20 where attributes are feature structures. In other words, the mapping process is a way to narrow the language LG = {ω E Σ/ S⟹ω} by adding semantic rules on each production of the grammar. To each symbol X2 V of the grammar, we associate a pair {(H(X), S(X)} which is inherited and synthesized feature structures. Both feature structures have a (possibly infinite) set of possible values, from which one value will be selected (by means of the semantic rules) for each appearance of X in a derivation tree. The grammar is composed of a set of production rules that are used to create c-structures, along with a set of semantic rules that are used to create feature structures. A set of rules are projected onto the lexicon, which is made up of terminal symbols that are associated with a set of semantic rules. The semantic rules represent all the lexical properties of the lexical entries. Both the construction of c-structures and the mapping process are performed using a declarative formalism that relies on constraints within the grammar and the lexicon.

2.3

A Monotonic Model

The model we want for a formal representation of linguistic constraints is monotonic, in the sense that no rule could erase the effects of an existing set of rules. The lexicon will be constructed as well: adding an entry word to the lexicon only adds possible outputs without erasing or modifying the other lexical entries. This way, a linguistic description made with our formalism will be very easy to build, adding more complex rules to an existing grammar core. It is worth noting that there are no logical non-monotonic inferences strategies in the system. If they are required, they are explicitly described in the lexicon or in the grammar.

3 Writing a Grammar with Elvex Creating a grammar involves writing both a lexicon and grammatical rules, which include linguistic constraints that are specific to the domain. The rules in question are formulated through the utilization of formal components, including feature structures and c-structures, which we will proceed to define. To create an Elvex grammar, I developed a domain-specific language. Although I will not go into technical details, I will use a simpler notation in the sequel.

20

See (Knuth 1968).

36

L. Clément

Fig. 1 Equation on feature structures

3.1

Feature Structure

To bring together different levels of linguistic analysis into a unified structure, we draw on the functional descriptions first introduced by Martin Kay in his seminal work.21 These descriptions have since been incorporated into various linguistic unification formalisms, such as LFG,22 HPSG,23 and TAG.24 Although the feature structure in Elvex is very similar and formalisms can certainly be compared to each other, Elvex takes a different approach to unification by allowing linked and free variables to be defined within feature structures. In Elvex, we use the “$” sign to represent free or linked variables and the “_” sign for anonymous free variables. Linked variables in the same feature structure are the way to represent conflicting paths. An “unifier” is a mapping between variables and values that is used to replace variables with values to solve equations. In other words, the purpose of unification is to substitute variables with their corresponding values. An equation on feature structures may be for example (Fig. 1): The equation involves two feature structures, which are shown on either side of the “=” sign. A solution to this equation is a unifier that replaces all variables. Specifically, $TENSE is replaced with the constant PRESENT, $REST is replaced with [NEG +] which is the complement of the feature structure, and $I is replaced with the constant 564. The “head” of the Elvex feature structure input for a sentence may be the main predicate of the sentence. It can also serve as the head of an “illocutionary unit”, which refers to the communicative intention behind the sentence.

21

See (Kay 1984). See (Bresnan 2001). 23 See (Pollard and Sag 1985). 24 See (Joshi and Schabes 1997). 22

Linguistic Resources for the Automatic Generation of Texts in. . .

37

Fig. 2 Feature structure for the sentence: Sales prices further decreased. We have seen the result: turnover declined by 50%

A sentence like “We have seen the result” does not always correspond to the predicate “to see”. Instead, it may be understood as a rhetorical function for which the speaker uses their own words to assert a subjective clause, as shown in (3-a to 3-d). (3) a. Sales prices further decreased. We have seen the result: turnover declined by 50%. b. Turnover declined by 50% because sales prices further decreased. c. We have seen the result: investment banking, as we knew it, has disappeared before our eyes, and all banking institutions have come under severe pressure. (http://www.ecb.europa.eu/ consulted the 17th April 2020) d. Without something to test our faith, we have grown complacent... and we have seen the result with the two who tried to kill you (Crusade 1999, J. Michael Straczynski, TV series). In (3-a), the speaker makes a personal and subjective assertion that the cause of the decline in turnover is the decrease of sales prices. This distinguishes it from sentence (3-b), where there is no such subjective assertion for the same clause. We use the feature structures at different levels, from a description of the meaning and illocutionary act to the syntactic level. We illustrate and summarize such a feature structure which provides the rhetorical function in the following feature structure where the head called “SPEAKER_ASSERTS_A_RESULT” does not correspond to a semantic or syntactic predicate, but to an illocutionary act from which the phrase “We have seen the result” will be added to the final clause in our example (Fig. 2). At each level, I use the term “HEAD”. At the syntactic level, this term refers to a predicate, which represents the relationship between a linguistic sign and grammatical functions, as “PRED” in LFG or “ARG-ST” in HPSG. At other levels, it refers to the argument structure between a head and its peripherals (as “CONTENT” for the semantic representation in HPSG). This relationship may be expressed using a word or an idiomatic phrase. However, as we have observed, it can also be conveyed

38

L. Clément

through syntactic constructions, such as word order or the use of a specific morpheme. To summarize, the feature structure includes the predicate structure with grammatical functions that express thematic functions, such as Subject, Object, Oblique Object, Verbal complement, and Sentence complement. It also involves the argument structure that identifies thematic functions, such as Agent, Patient, Theme, and Recipient. Additionally, there is the rhetorical structure that specifies illocutionary functions, including Focus, Topic, Theme, Rheme, and Predicate. Lastly, there is the macro syntactic content which articulates a central core and surrounding peripherals with prosodic marks. It is important to note that, at this point, neither the word order nor the phrase categories have been described yet. So, basically, what is needed is a way to represent the structure that determines the order of words and how sentences are formed.

3.2

Constituent-Structure (C-Structure)

The Elvex formalism describes the c-structures constructions with context-free production rules. In Elvex, a c-structure is a “shared forest”, which is a set of trees that can also be viewed as a context-free grammar utterance. Elvex allows for the definition of empty rules, optional symbols, and alternative symbols. Additionally, circular rules can be expressed as α→α, which applies a syntactic rule without modifying its c-structure. Instead, it modifies the associated feature structure. Similar to LFG theory, Elvex distinguishes between the feature structure and c-structure, and also defines a mapping function to ensure their relationship. However, in LFG each node of the c-structure must be directly associated with one or more f-structures, and conversely each part of an f-structure must be associated with a c-structure node.25 The Elvex c-structure has been extended to build empty phrases, or phrases that are not directly mapped from a part of a feature structure. It is then possible to have a feature structure in Elvex that does not correspond to any phonological element, as well as a text produced may not be directly described in the feature structure. For example, the layout of a document can be described in the c-structure depending on the document type itself (formal letter, newspaper chapter, email, etc.), and not on the input represented in the feature structure. Conversely, a feature structure part may not correspond to some words or to some phrases, but to syntactic constraints without any lexical projection associated with. For example, a feature structure may provide information that shows a relatively high correlation between two predicates, without explicitly indicating a causality or

25

See (Bresnan 2001).

Linguistic Resources for the Automatic Generation of Texts in. . .

39

opposition (as seen in 4-b and 4-c). However, it is important to note that this correlation may not correspond to a specific lexical entry. Instead, it could be related to the sequencing of tenses and the use of anaphora, as demonstrated in reference 4-a. (4) a. We should continue to study on hunter-gatherer societies without having received any funding for it. b. We should continue our study on hunter-gatherer societies, even though we have not received any funding for it. c. Despite not receiving any funding for our study, we are continuing our research on hunter-gatherer societies.

3.3

Syntactic Rules

In Elvex, a production rule is A → B1 B2 ⋯Bn (1) Γ(") (2) #i = ϕ(", #k, +k), k ≠ i (3) * = ψ(", #i, +i) (1) is the guard of the rule, which may be empty; (2) is the operational semantics; (3) is the synthesized semantics of the rule. This rule defines both the c-structure and the mapping process that connects the c-structure with feature structures. 1. The guard is a condition on feature structures that the rule depends on. In simpler terms, it depends on the context where the rule must be used. This context is defined with the inherited feature structure ". 2. Throughout the production process, the operational semantics will be repeatedly utilized to compute the feature structures and identify the deep dependencies among Bi relative to both its own feature structure and the inherited feature structure, contingent on the rule context. To ensure the process terminates and a fixed point is reached, it is essential that the set of operational semantic rules is acyclic. In simpler terms, there cannot exist a sequence (Φ1, Φ2, ⋯) such that both x = Φi( y) and y = Φj(x). 3. The synthesized semantics of the rule defines the feature structure of A from the realization of Bi. In simpler terms, it defines the context where the rule must be used. • " (resp. #i) is the inherited feature structure of A (resp. Bi). An inherited f-structure is calculated top-down considering the c-structure. • * (resp.+i) is the synthesized feature structure of A (resp. Bi). A synthesized feature structure is calculated bottom-up considering the c-structure.

40

L. Clément

• Γ(") is a logical predicate that must be evaluated to true (a guard in computational terms). • The functions Φ and ψ operate on n-tuples of feature structures and produce a resulting feature structure. In unification grammars formalism (FUG, GPSG, LFG, HPSG, TAG), it is usual to define the equations between feature structures with the unification operation which provides a unique solution to each univocal relationship. In Elvex, we use two different feature structures for each embedded node in the syntagmatic description (#i, +i). This is because the syntagmatic description is no longer a tree in Elvex, but rather a graph (specifically, a shared forest). As a result, #i may not denote the same feature structure as +i. This is because the ith c-structure node may not be unique and may contain incompatible feature structures. To illustrate this point, we can examine the following example: A→B Γ 1( " ) # 1 = ϕ1 * = ψ1

B→C Γ2(") # 1 = ϕ2 * = ψ2

B→D Γ 3( " ) # 1 = ϕ3 * = ψ3

The solution to this system of equations may not be a unique solution for both ϕ1 = ψ 2 and ϕ1 = ψ 3. This means that “C” and “D” may represent different words or phrases that have the same meaning as “B”, but they cannot be combined into a single term. As a result, we separate the inherited feature structures and the synthesized feature structures into two different sets. Following this grammar, the c-structure for a set of paraphrases is the shared-forest A[1, 2]→B[1, 2], B[1, 2]→C[1, 2], B[1, 2]→D[1, 2], where “*” for D[1, 2] is ψ 3, while “*” for C[1, 2] is ψ 2.

3.4

The Lexicon

Elvex lexicon is organized in two parts:

The Morphological Lexicon Defined by Extension The data consists of tuples (Form, Category, Lemma, Feature structure). The “Form” is a text-string representing a morphological analysis, but the details of how it was derived are not included in the data and are supposed to be calculated from an

Linguistic Resources for the Automatic Generation of Texts in. . .

41

Fig. 3 The pattern lexicon entry for the phrase: tomber dans les pommes

intensional lexicon that is not explicitly referenced here.26 The “Category” refers to the part of speech or grammatical category, such as Verb, Noun, Adjective, etc. The “Lemma” is the name of a class that includes multiple inflected forms of the same lemma. Finally, the “Feature structure” is a structure consisting of either simple or complex features that contain only constants and linked variables as values.

The Pattern Lexicon This data is a set of tuples (Lexeme, Category, Lemma, Feature structure) where “Lexeme” is an identifier for a lexical entry. It may correspond to a set of lemmas and phrases. The complete Elvex lexicon is created by merging the two lexicons that share the same category and lemma, and then unifying their feature structures. This representation gives a very detailed lexicon and allows for the description of idiomatic expressions and phrases. For instance, in French, the term “fainting” can be expressed as “falling/taking/passing out” or “going out cold”, and this is commonly done through the phrase “tomber dans les pommes” (literally translated as “falling in the apples”) (Fig. 3). A lexical function defined by a collocation27 can be added to a lexical entry simply by creating a new entry. In French, the adjective “gros” (big) is used to magnify a noun such as “ennui” (trouble). The lexical entry for “ennui” will include the features [mod:, lexical_function:magn], where GROS is the lexeme for “magnify” in this lexical entry. It should be noted that the lexeme is used for co-occurrence and not the lemma, because a lexicon entry may be composed of different lemmas of the same lexeme. For example, in French slang, the expression “prendre la tête” means to bore someone and literally translates to “to take the head”. This phraseological expression 26 27

See (Sagot et al. 2006). See (Polguère and Mel’čuk 2003).

42

L. Clément

Fig. 4 The simplified Elvex lexical entry for the verb to suggest

can be productively constructed using any slang expressions for “tête” (such as “prendre le chou” [to take the cabbage], “prendre la calebasse” [to take the gourd], “prendre la caboche” [to take the noggin], etc.). In the French slang lexicon, the lexeme “TÊTE” corresponds to all of these slang expressions. Furthermore, a phraseological expression is not always an identifiable sequence of words; it can also be a constructed expression, as in this example from spoken French:

(5) Dans les pommes, qu’il est tombé, dans les pommes, je te dis ! [In the apples], [that] [he fell], [in the apples], [I tell you]! He really fainted, I’m telling you, he fainted!

3.5

The Language Rules

Before defining the Elvex grammar that will translate an input feature structure into a c-structure and then into texts, let us say a few words about the constraints that language applies to sentences by propagating agreement, valency properties, and all kinds of lexical and syntactic constraints. Both the lexicon and the grammar provide a set of constraints defined by the declarative and constraint-based model that we have presented. Let us provide a simple example. It is well known that the verb in a main clause conditions the modality of a subordinate clause, as in the following example in a formal text: (6) a. The company suggests that the lawyer pursue the lawsuit. b. *The company suggests that the lawyer pursues the lawsuit. c. The company suggests that the lawyer is pursuing the lawsuit. d. The company knows that the lawyer is pursuing the lawsuit. In 6-a, the present subjunctive tense is used after “suggest” as well after certain verbs. The subjunctive verbal tense is used to express a suggestion, desire, or hypothetical situation. However, this expression is not conveyed by the speaker but by the language itself. In fact, the equivalent sentence in the indicative form is not grammatically correct, as demonstrated in 6-b, 6-c (Figs. 4 and 5).

Linguistic Resources for the Automatic Generation of Texts in. . .

43

Fig. 5 The rule that will collapse this lexical constraint from the lexicon

1. This rule acts as a guard by ensuring that the inherited feature structure " always includes the attribute scomp. By applying this constraint, the local environment will be updated to include both $scomp and $Rest. 2. This rule has the task of defining several variables, including $scompSynt, $mainClauseVtense, $mainClauseMode, and $mainClauseAuxVtense. $scompSynt is derived from the lexicon projection of the main verb, while the remaining variables are determined based on the realization of the verb phrase (VP). Both of these components are represented by +. 3. This rule serves to define the inherited feature structure of the verb phrase (VP), specifically by assigning a value to the $Rest variable. This value represents all of the features contained in " except for the scomp feature. 4. This rule serves to define the inherited feature structure of the complement clause (SComp). This involves assigning values to all of the features contained in the inherited feature structure, as well as any synthesized feature structures resulting from $scompSynt. These synthesized feature structures place constraints on the presence of a complementizer such as “THAT”, or the subjunctive mode, both of which are represented by and drawn from the bottom of the shared forest. By considering these additional constraints, this rule helps to ensure that the resulting parse tree is as accurate and complete as possible. Given the lexicon entry for the verb “to suggest”, this rule will produce 7-a and never 7-b, 7-c or 7-d. (7) a. [. . .] suggests that the lawyer pursue the lawsuit. b. [. . .] suggests that the lawyer pursues the lawsuit.

44

L. Clément

c. [. . .] suggests if the lawyer pursue the lawsuit. d. [. . .] suggests if the lawyer pursues the lawsuit.

3.6

The Speaker’s Rules and Language

The formalism does not split the generative process into separated steps between strategic and tactical components. The formalism is there to prevent the generation gap between text planning and linguistic realization described by Meteer (1991). We integrate it in planning terms, following some NLG works like (Bateman 1997). The generation process can be summarized as a behavior to achieve a communicative goal that produces well-formed utterances. A text generation process involves determining how a speaker’s choices about what to say, in a given context, are expressed in language. We have discussed how these choices are defined in a feature structure. Now, we need to understand how a shared forest is created to generate text that reflects these choices. In this process, both the inherited feature structures and the synthesized feature structures are propagated through the shared forest according to the rule set. On each rule of the grammar, a set of constraints is defined that says how the inherited feature structure will be realized. On one hand, the inherited feature structures will be propagated down to their components according to the content which has to be realized. On the other hand, the output resulting from each component realization will be propagated up to the calling rules to give a context to the rest of the text. We will provide an overview of how the rules are written using this example, without going into too much detail. Let us assume for this example that the goal is to convey the meaning expressed by a sentence such as: (8) The students were over the moon when they found out that they had been accepted to study abroad next year. This sentence uses a causal relationship between two predicates, and the syntax used to express this causation is the construction “X WHEN Y”. We introduce the Elvex rule, which enables the generation of a sentence “X when Y”. In some cases, Y arguments may refer to certain elements in X. The full rule covers all scenarios and incorporates pronouns, synonyms, and hyponyms to prevent repetition. Unfortunately, we cannot provide the complete rule at this point, as it still needs to be fully explained. However, we do clarify the meaning of the rule through comments marked with “//” (Fig. 6). In the first clause, the main predicate is “EUPHORIC” and the agent is “STUDENT”. In the second clause, the main verb is “ACCEPT” with a passive diathesis that is used to avoid to designate the agent of this action. This diathesis may be expressed with the passive voice.

Linguistic Resources for the Automatic Generation of Texts in. . .

45

Fig. 6 The Elvex rule for the causative expression “X when Y”

In the third clause, the main predicate is “STUDY”. The first clause also includes an idiomatic expression using the verb “to be” and the phrase “over the moon” as a constant prepositional complement. This sentence refers to an idiomatic expression, which is explained in the lexicon under the entry for the word “EUPHORIC” and the lemma “to be”. The description is provided in an informal language register: EUPHORIC verb to_be language_register:informal]

[pobjC:[HEAD:MOON,number:sg,def:yes,pcas:over],

The second part of the sentence contains a passive complement clause that includes an infinitive. The use of the infinitive in this clause helps to avoid repeating the agent and prevents the verb from being put into the active voice, as demonstrated by the less correct following sentence: (9) The students were over the moon when they found out that they had been accepted that they could study abroad next year. Consider now the infinitive phrase “to study abroad next year”, in which the agent is the “STUDENT”: The input feature structure at this level can be represented as follows, where both ID refer to the same entity (Fig. 7): The syntactic realization of “ACCEPT” is driven by a rule where the c-structure is unchanged, but where the arguments are transformed from II, III to SUBJECT

46

L. Clément

Fig. 7 The input feature structure with a co-indexation ID

Fig. 8 A simplified rule in Elvex for the passive transformation

and VCOMP (Subject and Infinitive functions), and where the diathesis is realized into a passive voice (Fig. 8).

4 Conclusion We have developed a system that can generate sentences and texts from a given meaning and communication intention. The system is based on linguistic knowledge, which is described using a specific-domain language called Elvex as well as the formalism defined to implement it. This language provides a clear and accurate description of the grammar, including the lexicon and rules based on declarative and constraint-based methods. As a result, one can produce texts in a consistent and reliable manner. That makes it possible for us to compare the generated results with real-life language use, using a testable and repeatable approach. This deterministic approach is distinct from using an encoder-decoder Natural Language Generator, which can be employed to answer questions, or complete text based on the information it has. Indeed, a empirical system does not follow a specific grammatical model and is not deterministic. While it can generate text that adheres to various grammatical models, its functioning is not based on linguistic theory. Instead, it relies on statistical patterns that have emerged from large corpora of data.

Linguistic Resources for the Automatic Generation of Texts in. . .

47

Furthermore, the generated results from Elvex can be used in situations where one or more paraphrased texts are required to convey a particular meaning or communication intention. In summary, this system can serve as a valuable tool for linguistic research as well as for developing advanced text generation systems with high quality output in restricted domains. Acknowledgements I would like to thank Max Silberztein and Joan Busquets for reviewing this text. It should be noted that they are not responsible for the views expressed in this document.

References Asher, N., Lascarides, A.: Logics of Conversation. Cambridge University Press (2003) Bateman, J.A.: Enabling technology for multilingual natural language generation: the KPML development environment. Natural Language Engineering 3(1), 15–55 (1997) Berrendonner, A.: Pour une macro-syntaxe. Travaux de linguistique 21, 25–31 (1990) Blanche-Benveniste, C., et al.: Le français parlé : études grammaticales. Éditions du CNRS, Paris (1990) Bresnan, J.: Lexical-Functional Syntax. Blackwell Publishers, Oxford (2001) Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., alii: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020) C. Mann, W., A. Thompson, S.: Rhetorical structure theory: Toward a functional theory of text organization. Text – Interdisciplinary Journal for the Study of Discourse 8(3), 243–281 (1988). DOI doi:https://doi.org/10.1515/text.1.1988.8.3.243 Dalrymple, M., Kaplan, R.M., Maxwell, J.T., Zaenen, A.: Mathematical and computational issues. In: M. Dalrymple, R.M. Kaplan, J.T. Maxwell, A. Zaenen (eds.) Formal Issues in Lexical-Functional Grammar, pp. 331–338. CSLI Publications, Stanford, CA (1995) Danlos, L.: Generation with tree adjoining grammars. In: Proceedings of the 15th International Conference on Computational Linguistics (COLING), pp. 386–392 (1992) Danlos, L., Maskharashvili, A., Pogodalla, S.: An ACG analysis of the G-TAG generation process. In: INLG. Philadelphia (2014) Danlos, L., Meunier, F., Combet, V.: Easytext: an operational NLG system. In: E.W. on Natural Language Generation (ed.) ENLG. Nancy, France (2011) Ducrot, O.: Dire et ne pas dire. Hermann, Paris (1972) Elhadad, M., Robin, J.: Surge: a comprehensive plug-in syntactic realization component for text generation. Tech. rep., Computer Science Dep., Ben-Gurion University, Beer Sheva, Israel (1998) Floridi, L., Chiriatti, M.: Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines 30, 1–14 (2020). DOI https://doi.org/10.1007/s11023-020-09548-1 Halliday, M.A.K.: An Introduction to Functional Grammar. Edward Arnold (1985) Joshi, A.K., Schabes, Y.: Tree adjoining grammars. Handbook of Formal Languages 3, 69–123 (1997) Kay, M.: Functional unification grammar: A formalism for machine translation. In: Annual Meeting of the Association for Computational Linguistics (1984) Knuth, D.E.: Semantics of context-free languages. Mathematical Systems Theory 2(2), 127–145 (1968)

48

L. Clément

Lavoie, B., Rambow, O.: A fast and portable realizer for text generation systems. In: Fifth Conference on Applied Natural Language Processing, p. 265–268. Association for Computational Linguistics, Washington, DC, USA (1997) Mel’čuk, I.: Explanatory combinatorial dictionary. In: Open problems in Linguistic and lexicography, pp. 225–355. Giandomenico Sica, Monza (Italy) (2006) Meteer, M.W.: Bridging the generation gap between text planning and linguistic realization. Computational Intelligence 7(4), 296–304 (1991). DOI https://doi.org/10.1111/j.1467-8640. 1991.tb00402.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-8640.1991. tb00402.x Polguère, A., Mel’čuk, I.: Lexique combinatoire et explicatif: le nouvel outil de consultation du lexique informatisé. In: Traitement Automatique des Langues Naturelles, vol. 2, pp. 436–439 (2003) Pollard, C., Sag, I.A.: Information-based syntax and semantics. In: Proceedings of the 23rd Annual Meeting of the Association for Computational Linguistics, pp. 1–10 (1985) Reiter, E., Dale, R.: Building Natural Language Generation Systems. Cambridge University Press, United Kingdom (2000) Sagot, B., Clément, L., de la Clergerie, E.V., Boullier, P.: The lefff 2 syntactic lexicon for French: architecture, acquisition, use. In: International Conference on Language Resources and Evaluation (2006) de Saussure, F.: Cours de Linguistique Générale. Payot, 1995, Paris (1916) Teven, L.S., Angela, F., Christopher, A., Ellie, P., Suzana, I., Daniel, H., Roman, C., Alexandra Sasha, L., François, Y., et al.: Bloom: A 176b-parameter open-access multilingual language model. arXiv (2022). URL https://arxiv.org/abs/2211.05100 Thomson, C., Reiter, E.: Generation challenges: Results of the accuracy evaluation shared task. In: Proceedings of the 14th International Conference on Natural Language Generation, pp. 240–248. Association for Computational Linguistics, Aberdeen, Scotland, UK (2021). URL https://aclanthology.org/2021.inlg-1.23

Towards a More Efficient Arabic-French Translation Héla Fehri

Abstract Even though word-to-word translation is a relatively trivial task, translation becomes more and more difficult once sentences to be translated contain compound elements such as named entities (NEs), or elements that must be kept unmodified, such as proper names. We present a system that translates Arabic NEs using local grammars. We have developed specific linguistic resources for the recognition of NEs, to which we have added a transliteration module. We then evaluate our system and show that it gives better results than other translators such as Google and Reverso. After that, we show the importance of NEs recognition in relative clause translation while comparing the obtained results with Google and Reverso. We finally argue for the viability of rule-based linguistic approaches to machine translation. Keywords Named Entity recognition, Machine translation, Transducer, Local grammar, Dictionary, Relative clause

1 Introduction While implementing fully automatic machine translation (MT) systems has become a realistic goal by researchers and private companies, only literal translations produce an accuracy of 100%. Furthermore, when translation needs to take into account more complicated phenomena, such as multiword expressions (MWEs and NEs) and complex morphology as found in the Arabic language, it presents higher challenges. We focus here on the subtask of recognizing, annotating, and translating NEs from Arabic to French to obtain reliable full sentence translations. Subsequently, we show how to use the NE Recognition module to translate complex sentences such as relative clauses. Regarding the complexity of this subtask, fully statistical approaches reach many limits due to the lack of lexical and morpho-syntactic resources. We provide

H. Fehri (✉) MIRACL Laboratory, University of Sfax, Sfax, Tunisia © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Silberztein (ed.), Linguistic Resources for Natural Language Processing, https://doi.org/10.1007/978-3-031-43811-0_3

49

50

H. Fehri

arguments and concrete results that demonstrate the advantages of using handcrafted linguistic resources. Our approach is based on the use of rule-based local grammars represented as transducers. The transducers we have developed perform powerful text analyses, which have allowed to solve several problems during NE recognition. Connecting these transducers to obtain a deep analysis is not a trivial task; we present the finalized linguistic module. First, we present a brief overview of related works. Then, we describe the main identified problems encountered during the NE recognition and translation processes. Next, we present several resources in detail, along with their implementation and their evaluation. After that, we describe the important role of the NE recognition module in obtaining a correct translation of complex sentences including relative clauses. Finally, we outline future works.

2 Related Works The literature review shows that Web-based machine translation tools, such as Google translator and Reverso have gained popularity among translators. Nevertheless, they produce low-quality results when NEs (especially compound nouns) occur in the source texts. Despite the progress made by translation technologies, the accuracy of human translation remains unmatched because these tools do not understand the meaning of the sentences they try to translate, and therefore they rely solely on wordforms and their contexts. Moreover, despite the use of new technologies such as “deep learning” to improve the quality of their results, these systems remain inaccurate, especially when dealing with frozen expressions and NEs. The main reason for these failures is the fact that these translators do not have access to precise linguistic resources. Work on NEs revolves around two complementary axes: the first focuses on NE recognition, and the second on their translation. Identification, tagging and translation of NEs have been implemented in several languages, using linguistic, statistic and hybrid approaches. Here, we present a linguistic approach to solve both the recognition and the translation problems. Regarding NE recognition, we can cite the system presented in (Friburger 2002). This system performs an automatic extraction of proper names in French. The proposed method involves multiple syntactic transformations and is implemented with transducers. Other related works have tackled various topics including the recognition of elliptical expressions (Hasni et al. 2009), Arabic compound nouns (Khalfallah et al. 2009), broken plurals (Ellouze et al. 2009) and Arabic Phonological Changes (Kassmi et al. 2020). Other works have been dedicated to the translation of NEs from one language to another. For instance, (Barreiro 2008) presents a system that performs translation of simple sentences from English to Portuguese; (Wu 2010) presents a system that

Towards a More Efficient Arabic-French Translation

51

translates certain noun phrases from French to Chinese, and (Torjmen and Haddar 2022) presents a system that translates NEs from Tunisian Dialect to Modern Standard Arabic.

3 Problems in Arabic Named Entities In this section, we cite some problems that can be treated in recognition and translation processes.

3.1

Problems in Recognition

The recognition of Arabic NEs needs to address problems related to proper names and syntax. – It is a challenge to locate proper names in Arabic texts, because they do not start with a capital letter, and there is no special sign in Arabic to identify and distinguish them from other wordforms. Moreover, the length of persons’ proper names cannot be known in advance, as it depends on the traditions of the region where the person was born. For example, in the education domain, universities can be composed of one noun such as "‫[ "ﺟﺎﻣﻌﺔ ﺍﻷﺳﺪ‬university alasad] or of a noun and a forename such as "‫[ "ﺟﺎﻣﻌﺔ ﻣﺤ ّﻤﺪ ﺍﻷﻭﻝ‬University Mohammed Premier] or of a noun and a forename preceded by a title of nobles "‫[ "ﺟﺎﻣﻌﺔ ﺍﻷﻣﻴﺮ ﻣﺤ ّﻤﺪ ﺑﻦ ﻓﻬﺪ‬Prince Mohammad Bin Fahd University]. Finally, it is not possible to store all persons’ proper names, with all their variants, in a dictionary. – It is also a challenge to construct a grammar of Arabic NEs. Their lengths cannot be known in advance, and a given category of word, such as adjectives, can be found in various positions in the NE. As a matter of fact, adjectives are not always placed after the nouns they modify. For example in the NE "‫"ﺟﺎﻣﻌﺔ ﺗﻮﻧﺲ ﺍﻻﻓﺘﺮﺍﺿﻴﺔ‬ [University Tunis Virtual], the adjective "‫[ "ﺍﻻﻓﺘﺮﺍﺿﻴﺔ‬Virtual] is not placed after the noun that it modifies "‫[ "ﺟﺎﻣﻌﺔ‬University]. However, in the NE "‫ﺍﻟﺠﺎﻣﻌﺔ ﺍﻟﻮﻃﻨ ّﻴﺔ‬ ‫[ "ﻟﻠﻌﻠﻮﻡ ﻭﺍﻟﺘﻜﻨﻮﻟﻮﺟﻴﺎ‬University National of Science and Technology], the adjective "‫[ "ﺍﻟﻮﻃﻨﻴّﺔ‬National] is placed after the noun that it modifies "‫[ "ﺟﺎﻣﻌﺔ‬University].

3.2

Problems with the Translation of Arabic NEs

Translation of NEs from Arabic to French presents several problems, including: – Words in the Arabic text are highly ambiguous. For example, in Arabic, the word “stadium” can represent an athletic facility or a name of a sportive club.

52

H. Fehri

– Arabic words and their French equivalent do not always have the same gender. For example, the word "‫[ "ﻣﺴﺒﺢ‬swimming pool] is masculine but its translation into French piscine is feminine. As both Arabic and French have gender agreements between nouns and their modifiers (adjectives and determiners), producing a correct translation involves setting gender properties for all its modifiers. – There are systematic ambiguities between Arabic country names and city names. For example, the toponym Tuwnis must be translated in French by Tunisie (the country) or Tunis (the city). – Adjectives are positioned differently in Arabic and in French. For example,” ‫ﻋﺒﺪ‬ ‫[ ”ﺍﻟﻌﺰﻳﺰ ﺍﻻﻭﻟﻤﺒﻲ ﻣﻠﻌﺐ‬Abdelaziz Olympic stadium] must be translated by “Stade olympique Abdelaziz”. Moreover, the translation of NEs are complicated when they contain a person’s name because person’s names must be transliterated rather than translated.

4 Implementation of the Set of Transducers Our system performs NEs translation in a two-phase process: first, the recognition of Arabic NEs phase, and then the translation with transliteration phase. Each phase relies on a set of specific transducers. Transducers represent sets of patterns that may include lexical and morpho-syntactic constraints and are applied in cascade.

4.1

Recognition Phase

The recognition phase allows the recognition of Arabic NEs from the built corpus. It consists of two essential steps: building the corpus, and developing the linguistic resources. These steps are represented in Fig. 1. Building the corpus This step consists in building a cross-domain corpus. As far as we know, there is no available accessible on-line corpus for NEs. We have built our own corpus using web scrapping technics applied dynamically to various journalistic sites. Note that the resulting corpus is not annotated. Thereafter, we have cleaned up the corpus by eliminating non-pertinent information and by regrouping other information. On the one hand, this step has allowed us to focus on useful information only, to prepare the corpus for the next process. On the other hand, this has allowed us to optimize the recognition process efficiency of the NEs. Next, we have studied the corpus to enumerate the different patterns used to locate Nes, and we have gathered the corresponding vocabulary in terminological dictionaries. By studying the corpus, we have identified the context in which each terms appears, which highlights the related domains and identifies the corresponding list of trigger words. We have developed several recognition patterns and

Towards a More Efficient Arabic-French Translation

53

Fig. 1 NEs recognition approach

dictionaries. The result of this step is a set of words being transferred into entries in a dictionary and a set of patterns that will be represented by grammars.

Construction of dictionaries and grammars Because of the massive morphological variation in Arabic, we need to construct dictionaries that associate each entry with exhaustive and precise morpho-syntactic description. For each entry, we have indicated the category (e.g., N), its semantic features (e.g., Country, Location, DateMonthName) and its French translation. Figure 2 represents an example among the built dictionaries that describes dates and their translation. Morphological phenomena, including agglutination, are formalized by morphological grammars. Figure 3 shows an extract of the grammar that solves the agglutination problem. This grammar takes into account the compound names. In fact, Arabic compound nouns can be preceded by prepositions. Let us note that morphological grammars use dictionaries as input. The main transducer in Fig. 4 is used to recognize all person NEs. It contains eight embedded graphs. Each path in each embedded graph corresponds to a rule that was extracted from our corpus. The NE recognition process is based on 56 graphs that cover all the rules identified in our corpus.

54

H. Fehri

Fig. 2 Arabic-French dictionary

Fig. 3 Morphological grammar

4.2

Translation Phase

Translation is performed in three steps; each step uses as its input the output produced by the previous step (Fig. 5).

Towards a More Efficient Arabic-French Translation

55

Fig. 4 Main transducer of Person name recognition (ENAMEX+PERS)

Fig. 5 Steps of the translation process

In what follows, we describe the resources we have developed for each of these three steps. Word-to-word Translation To implement the word-to-word translation, we have constructed a syntactic grammar that translates each word inside the NE sequence, avoiding words that are not found in our dictionaries, or that cannot be translated (number, special character,

56

H. Fehri

Fig. 6 Word-to-word translation grammar

Fig. 7 Sub-transducer “MOTDIC”

etc.). This grammar takes as its input the list of NEs that were produced by the recognition phase. The grammar is described by the transducer in Fig. 6. This transducer processes wordforms in the NE that keep the same values in the target language. These wordforms include numbers (symbol ), special characters (symbol

), wordforms not found in our dictionaries (symbol ), as well as wordforms that are described in our dictionaries as requiring a specific treatment. For example, if the wordform to translate is a first name such as “‫”ﺍﻟﺒﺎﺳﻞ‬ [albaasil], then, it must keep the same value in the target language; this wordform will be treated by the transliteration process. The transducer MOTDIC in Fig. 7 produces as its output the translation of each wordform, as well as its linguistic annotation. These annotations help in the reorganization and agreement phase. For example, if we apply this transducer to the NE “‫[ ”ﻣﺴﺒﺢ ﻣﺪﻳﻨﺔ ﺍﻟﺒﺎﺳﻞ ﺍﻟﺮﻳﺎﺿﻴﺔ‬sport city pool Al Bacel], we get as a result ‫< ﺍﻟﺒﺎﺳﻞ‬sportif A+f+s>.

Towards a More Efficient Arabic-French Translation

57

Fig. 8 Disambiguation transducer

The specific problems during this phase are related to the fact that a wordform might get multiple potential translations. For example, the word madynat has two different translations: “cite” [city] and “ville” [town]. Therefore, we need to eliminate the incorrect translations. This is achieved by applying the transducer in Fig. 8. In this example, the adjective “sportif” helps us decide between the two potential translations cité [city] and ville [town], by eliminating the latter translation. Note that the word-to-word translation phase involves managing priority levels. This phase is actually applied twice, the first time to process compound nouns, and the second time to process simple words: if a NE contains a compound noun, then we need to translate the compound noun first to avoid translating each of its components. For example, if the NE contains the compound noun “‫( ”ﺑﺤﻤﺎﻡ ﺍﻷﻧﻒ‬biHammam al'anf) which is a town name, we do not want to translate it as “hammam de nez” [hamman of nose]. Reorganization and agreement Several reorganization and agreement rules based on sub-categorization mechanism have been applied to improve the word-to-word translation phase. These rules formalize the order of the components inside NEs and the agreement of adjectives and nouns in gender and number. For instance, on the one hand, if a NE in the source language contains an adjective, then we must know whether this adjective modifies the trigger word or a noun that occurs just before it. For example, in the NE ‫< ﺍﻟﺒﺎﺳﻞ‬sportif A+f +s>, the adjective “sportif” is feminine and singular in the source language (), the noun cité [city] is also singular and feminine in the target language, but the noun “piscine” is singular and masculine. We can thus deduce that the adjective sportif modifies the noun cité rather than the noun piscine. To obtain the desired result, we apply recursively the transducer shown in Fig. 9, until the process does no longer produce any more annotation. In the example ‫< ﺍﻟﺒﺎﺳﻞ‬sportif A+f+s>, we have applied the transducer in Fig. 9 twice. During the first iteration, the grammar

58

H. Fehri

Fig. 9 Reorganization and agreement transducer

Fig. 10 Extract of the transducer “Ns+A”

produces two annotations for the two nouns, followed by an annotation for the adjective. Therefore, the embedded transducer “Ns+A” is selected. This transducer is illustrated in Fig. 10. This transducer produces as a result “piscine ‫< ﺍﻟﺒﺎﺳﻞ‬sportif A+f +s>” which contains one annotated noun followed by one annotated adjective. Therefore, during the second iteration, the transducer “N+chaine+A” is applied. The result produced after the second iteration is then “piscine cité sportive ‫”ﺍﻟﺒﺎﺳﻞ‬. Note that the resulting NE does no longer have any annotation: the halting condition is reached. Readjustment If a NE in the source language contains a sequence of nouns, or an adjective followed by a noun, then we need to apply some grammars to take into account Arabic contracted forms. These rules are presented in Fig. 11. Readjustment rules are implemented with transducers. These transducers are used after the reorganization and agreement phase. For example, in the NE “piscine cité sportive ‫”ﺍﻟﺒﺎﺳﻞ‬, the French noun cité [city] is singular and feminine and starts with a consonant, therefore we should insert the preposition “de la” before it to obtain the correct French translation “piscine de la cité sportive ‫”ﺍﻟﺒﺎﺳﻞ‬.

Towards a More Efficient Arabic-French Translation

59

Fig. 11 Extract of the readjustment transducer

Transliteration process The transliteration is performed at the very end of the process. It consists of transliterating all the remaining untranslated wordforms still written in Arabic, using specific morphological transducers. For this process, we have used the El Qalam transliteration system, and implemented each of its rules as a morphological transducer. This transliteration task is described in Fehri et al. (2009).

5 Experimentation and Evaluation We have used the NooJ linguistic platform with its Arabic linguistic module1 to develop our linguistic resources (dictionaries, morphological and syntactic grammars), as well as to construct experiments and evaluate the results of the two phases (NE recognition and translation). On top of the standard Arabic module, we have added various linguistic information, including the French translation of each term, as well as distributional properties (e.g, +person, +city) to some specific entries. Note however that our “First Name” dictionary is monolingual, because its entries must be transliterated rather than translated.

1 The NooJ software and this module of linguistic resources are freely available for download at: www.nooj4nlp.org.

60

H. Fehri

Table 1 Performance Newspaper texts 42,600 texts (943,7 Mo)

5.1

Occurrence (NEs) More than 800,000

Precision 98%

Recall 90%

F-measure 94%

Experimentation of Recognition Phase

To evaluate the recognition phase of our system, we have applied our resources to a corpus formed by 42,600 texts in our evaluation corpus which is different from our study corpus. Because we are using transducers that implement productive rules, each recognized NE satisfies at least one of the described rules, and therefore is associated with an annotation. We have evaluated the results by computing the following measures: Precision, Recall and F-measure (F1). Results are illustrated in Table 1. Although very high, the measures in Table 1 show that there are still unsolved problems. Some problems are related to the lack of a single standard for spelling proper names and the absence of some words in our dictionaries, which lower the recall rate. Other problems are related to the use of metaphors in the Arabic source text. These measures demonstrate that our resources can be applied to texts regardless of their domain if we use for these texts the same features as the ones we have implemented in our dictionaries. Of course, we will need to enhance our grammars when we want to process texts in new specialty domains by adding new transducers, but we will not need to reconstruct or rebuild the ones we have already developed.

5.2

Experimentation of Translation Phase

The translation phase is applied to the Arabic NEs that were recognized during the first phase. Note that the first phase might have produced some erroneous results which we filtered out using fairly reliable heuristic methods. Untranslated words will be transliterated later. In our experiment, our system produced 97% correctly translated NEs. Although this result is promising, it shows that there are still some unsolved problems. These problems are mostly related to toponyms that have multiple potential translations, e.g., the same wordform “tuwnis” can be translated in Tunis (the city) or Tunisie (the country). We have compared our system to Google and Reverso, which are considered among the best translators that offer Arabic-French translation. Results are illustrated in Table 2. According to this table, these two translators produced incorrect translations in all the cases, whereas our system produced the correct results. Among these examples, note that some NEs were translated by Google into English rather than French, cf. Table 2 (a and c); note also that Google and Reverso do not process

Towards a More Efficient Arabic-French Translation

61

Table 2 Experimental results Arabic NE ‫ﻣﺴﺒﺢ ﻣﺪﻳﻨﺔ‬ ‫ﺍﻟﺒﺎﺳﻞ‬ ‫ﺍﻟﺮﻳﺎﺿﻴﺔ‬ (a) ‫ﺍﻧﺘﻘﻞ ﺍﻟﻰ‬ ‫ﺣﻤﺎﻡ ﺍﻷﻧﻒ‬ (b) ‫ﻣﺪﻳﻨﺔ ﺗﺸﺮﻳﻦ‬ ‫ﺑﺪﻣﺸﻖ‬ (c) ‫ﻣﻠﻌﺐ‬ ‫ ﻧﻴﺴﺎﻥ‬7 ‫ﺑﺤﻠﺐ‬ (e)

Transliteration masbah madynat al bacel al riadhiya

Our system Piscine de la cité sportive al Bacel

Google Piscine Al-Basel Sports City

Reverso Piscine Basel Sports City

intakala ila Hammam al’anf

Déménager à Hammam lîf

Aller au bain de nez

Prendre un bain de nez

madynat techrin – dimachq

Ville de Techrine – Damas

Tishreen Ville de Damas

Tin City, Damas

malàb 7 nysaan biHalab

Stade 7 nisan à Alep

Stade du 7 avril à Alep

7 Stade Nissan Alep

compound noun translation, cf. Table 2 (b), whereas our system has produced the correct translation in all cases. These results demonstrate the accuracy of our proposed post-processing phase that performs the reorganization, the agreement check and the readjustments of the NE constituents. In addition, they yield the translation of compound nouns before processing simple wordforms. In this modular system, the recognition and translation of NEs are two independent processes. Therefore, it is possible to reuse most of the resources for other applications. For example, if we want to translate Arabic NEs to another language other than French, we can still reuse all the linguistic resources used by the recognition phase. Reciprocally, if we want to translate NEs from another language into French, we can reuse the resources used by the translation phase, because this phase only contains resources specific to the French language.

6 From the NEs Recognition to the Translation of Complex Sentences NE recognition plays a very important role in obtaining a reliable translation. These entities can be found in simple sentences as well as in complex sentences such as relative clauses. In the following, we explain how the recognition module can be exploited to perform translation of relative clauses into another language such as English.

62

6.1

H. Fehri

NEs and Relative Clauses

As we have already mentioned, named entities can be essential components in complex sentences in different positions. Let us take the following relative clause as an example: As shown in the example of Fig. 12, the sentence “‫”ﺍﻷﻧﻒ ﺫﻫﺐ ﺍﻟﺮﺟﻞ ﺍﻟﺬﻱ ﺗﺮﺍﻩ ﺍﻟﻰ ﺣﻤﺎﻡ‬ contains a compound noun “‫( ”ﺣﻤﺎﻡ ﺍﻷﻧﻒ‬Hammam al'anf) which is a town name. If this compound noun is not recognized as a NE, its translation would be incorrect. In the following, we focus on the translation of the two relative pronouns “that” and “who”. Therefore, we start with a comparison between the relative pronouns for the two concerned languages: Arabic and English. Table 3 describes this. As mentioned in Table 3, the relative pronoun in Arabic are two types: special and common (Hamdallah and Tushyeh 2019). The special relative pronouns are those that have singular, dual and plural. These are six: ‫ ﺍﻟ ٰﻼ ِﺗﻲ‬، ‫ ﺍﻟ ٰﻠﺘﺎﻥ‬، ‫ ﺍ ٰﻟ ِﺘﻲ‬، ‫ ﺍ ٰﻟ ِﺬﻳﻦ‬، ‫ ﺍﻟ ٰﻠﺬﺍﻥ‬، ‫ﺍ ٰﻟ ِﺬﻱ‬, as listed in the table above. ‫ ﺍﻟ ٰﻠﺬﺍﻥ‬and ‫ ﺍﻟ ٰﻠﺘﺎﻥ‬become ‫ ﺍﻟ ٰﻠﺬﻳﻦ‬and ‫ ﺍﻟ ٰﻠﺘﻴﻦ‬in the accusative and genitive cases. The common relative pronouns that have the same form for the singular, dual, and plural. These are ‫ ﻣﻦ‬and ‫ﻡﺍ‬. The first (‫ )ﻣﻦ‬is used for human, while ‫ ﻡﺍ‬is used for non-human. In the following, we give an example for the use of each relative pronoun in the sentences containing NEs.

Fig. 12 Example of an Arabic relative clause

Towards a More Efficient Arabic-French Translation

63

Table 3 Comparison between Arabic and English relative clauses Arabic Difference between masculine singular “‫”ﺍﻟﺬﻱ‬, feminine singular “‫ ”ﺍﻟﺘﻲ‬masculine dual “‫”ﺍﻟﻠﺬﺍﻥ‬ feminine dual “‫ ”ﺍﻟﻠﺘﺎﻥ‬masculine plural “‫”ﺍﻟﺬﯾﻦ‬ and feminine plural “‫”ﺍﻟﻼﺗﻲ‬ Pronouns are used to speak about people, object or animals without any difference

English The notion of masculine and feminine does not exist

The choice of the suitable relative pronoun depends on the nature of the word the relative pronoun refers to. Examples: Who → people (especially the agent of an action) That → people on which the action is performed, objects and animals.

– ‫ ﺍ ٰﻟ ِﺬﻱ‬is used for human and non-human masculine singular nouns, as in these examples: ‫ ﻡ‬605 ‫ ﺣﺎﺗﻢ ﺍﻟﻄﺎﺋﻲ ﺍﻟﺬﻱ ﻋﺮﻑ ﺑﻜﺮﻣﻪ ﻣﺎﺕ ﺳﻨﺔ‬: Hatim Al-Ta'i, who was known for his generosity, died in 605 AD ‫ﺼﻞ ﻋﻠﻰ ﺟﺎﺋﺰﺓ ﻋﺎﻟﻤﻴﺔ‬ ّ ‫ ﺍﻟﻜﺘﺎﺏ ﺍﻟﺬﻱ ﺃﻟّﻔﻪ ﻧﺠﻴﺐ ﻣﺤﻔﻮﻅ ﺗﺤ‬: The book that is written by Najib Mahfoudh won an international award ‫ ﺍﻟﺤﺼﺎﻥ ﺍﻟﺬﻱ ﻛﺎﻥ ﺟﺰﺀﺍ ﻣﻦ ﺣﺮﺏ ﻃﺮﻭﺍﺩﺓ ﻏﺪﺍ ﺃﺳﻄﻮﺭﺓ ﺗﺎﺭﻳﺨﻴﺔ‬: The horse that was part of the Trojan War has become a historical legend – ‫ ﺍﻟ ٰﻠﺬﺍﻥ‬is used for human and non-human masculine dual nouns, as in these examples: ‫ ﺍﻟﺒﻴﺘﺎﻥ ﺍﻟﻠّﺬﺍﻥ ﻳﻘﻌﺎﻥ ﻓﻲ ﺍﻟﻬﻨﺪ ﺃﻧﻴﻘﺎﻥ‬: The two houses that are robbed are located in India ‫ ﺍﻟﻄﺒﻴﺒﺎﻥ ﺍﻟﻠّﺬﺍﻥ ﺃﺷﺮﻓﺎ ﻋﻠﻰ ﻋﻼﺝ ﻣﺮﺽ ﺍﻟﻘﻠﺐ ﻣﺸﻬﻮﺭﺍﻥ‬: The two doctors who supervised the treatment of heart disease are famous ‫ ﺍﻷﺳﺪﺍﻥ ﺍﻟﻠّﺬﺍﻥ ﻫﺮﺑﺎ ﻣﻦ ﺣﺪﻳﻘﺔ ﺍﻟﺤﻴﻮﺍﻧﺎﺕ ﻗﺒﺾ ﻋﻠﻴﻬﻤﺎ‬: The two lions who escaped from the zoo were caught – ‫ ﺍ ٰﻟ ِﺬﻳﻦ‬is used for human masculine plural nouns only, as in these examples: ‫ ﺍﻷﻃﻔﺎﻝ ﺍﻟﺬﻳﻦ ﻫﺮﺑﻮﺍ ﻣﻦ ﺍﻟﺤﺮﺏ ﺗ ّﻢ ﺍﻳﻮﺍﺅﻫﻢ ﻓﻲ ﻣﻠﺠﺄ ﺍﻷﻣﻢ ﺍﻟﻤﺘّﺤﺪﺓ‬: Children who fled the war are sheltered in the UN shelter – ‫ ﺍ ٰﻟ ِﺘﻲ‬is used for human and non-human feminine singular nouns as well as for non-human masculine and feminine plural nouns, as in these examples: ‫ ﺍﻟﺒﻨﺖ ﺍﻟﺘﻲ ﺗﻠﻌﺐ ﻓﻲ ﺣﺪﻳﻘﺔ ﺍﻟﺒﻠﻔﺪﻳﺮﺟﻤﻴﻠﺔ‬: The girl who plays in the Belvedere Park is pretty ‫ ﺍﻟﻜﺘﺐ ﺍﻟﺘﻲ ﻋﺮﺿﺖ ﻓﻲ ﻣﻌﺮﺽ ﺍﻟﻜﺘﺎﺏ ﺍﻟﺪﻭﻟﻲ ﺑﺘﻮﻧﺲ ﻗﻴﻤﺔ ﺟﺪﺍ‬: The books that are presented at the International Book Fair in Tunis are very valuable ‫ ﺍﻟﺴﻴﺎﺭﺓ ﺍﻟﺘﻲ ﺍﺷﺘﺮﻳﺘﻬﺎ ﻣﻦ ﻣﻌﺮﺽ ﺍﻟﺴﻴﺎﺭﺍﺕ ﺑﺼﻔﺎﻗﺲ ﺣﻤﺮﺍﺀ ﺍﻟﻠﻮﻥ‬: The car that I bought at the car showroom in Sfax is red

64

H. Fehri

‫ ﺍﻟﺴﻴﺎﺭﺍﺕ ﺍﻟﺘﻲ ﺭﺃﻳﺘﻬﺎ ﻓﻲ ﻣﺮﻛﻦ ﺍﻟﻜﻠﻴﺔ ﻓﺎﺧﺮﺓ‬: The cars that I saw in college parking are luxurious ‫ ﺍﻟﻜﻼﺏ ﺍﻟﺘﻲ ﺗﺪ ّﺭﺏ ﻓﻲ ﻭﺯﺍﺭﺓ ﺍﻟ ّﺪﺍﺧﻠﻴﺔ ﺫﻛﻴﺔ ﺟ ّﺪﺍ‬: The dogs that are trained in the Ministry of Interior are very smart ٰ – ‫ ﺍﻟﻠﺘﺎﻥ‬is used for human and non-human feminine dual nouns, as in these examples: ‫ ﺍﻟﺒ ّﻄﺘﺎﻥ ﺍﻟﻠّﺘﺎﻥ ﺗﺴﺒﺤﺎﻥ ﻓﻲ ﺑﺮﻛﺔ ﺍﻟﺤﺪﻳﻘﺔ ﺍﻟﻌﻤﻮﻣﻴﺔ ﻛﺒﻴﺮﺗﺎﻥ‬: The two ducks that swim in the public garden pond are big ‫ ﺍﻟﺴﻴّﺪﺗﺎﻥ ﺍﻟﻠّﺘﺎﻥ ﺗﻘﻔﺎﻥ ﻓﻲ ﻣﺤ ّﻄﺔ ﺑﺎﺏ ﺳﻌﺪﻭﻥ ﻣﻌﻠّﻤﺘﺎﻥ‬: The two women who stand at the Bab Saadoun station are teachers ‫ ﺍﻟﻄﺎﻭﻟﺘﺎﻥ ﺍﻟﻠّﺘﺎﻥ ﺗﻮﺟﺪﺍﻥ ﻋﻠﻰ ﺭﻛﻦ ﻣﺴﺮﺡ ﺍﻷﻭﺑﺮﺍ ﻣﺰﺭﻛﺸﺘﺎﻥ‬: The two tables that exist on the corner of the opera stage are decorated – ‫ ﺍﻟ ٰﻼ ِﺗﻲ‬is used for human feminine plural nouns, as in these examples: ‫ ﺍﻟﻤﻤ ّﺮﺿﺎﺕ ﺍﻟ ّﻼﺗﻲ ﻳﻮﺟﺪﻥ ﻓﻲ ﺍﻟﻤﺴﺘﺸﻔﻰ ﺍﻟﺠﺎﻣﻌﻲ ﺍﻟﻬﺎﺩﻱ ﺷﺎﻛﺮ ﺑﺼﻔﺎﻗﺲ ﺧﺒﻴﺮﺍﺕ‬: The nurses who exist at the University Hospital Hedi Chaker in Sfax are experts – ‫ ﻣﻦ‬is used for human masculine and feminine singular, dual, and plural nouns. The kind of nouns it refers to is known from the relative clause that follows, as in these examples: ‫ ﺃﺣﺘﺮﻡ ﻣﻦ ﻳﻌﺘﻨﻲ ﺑﺎﻷﻃﻔﺎﻝ ﻓﻲ ﺩﺍﺭ ﺍﻷﻳﺘﺎﻡ‬: I respect who takes care of the children in the orphanage ‫ ﺃﺣﺘﺮﻡ ﻣﻦ ﻳﻌﺘﻨﻮﻥ ﺑﺎﻷﻃﻔﺎﻝ ﻓﻲ ﺩﺍﺭ ﺍﻷﻳﺘﺎﻡ‬: I respect those who take care of the children in the orphanage ‫ ﺃﺣﺘﺮﻡ ﻣﻦ ﺗﻌﺘﻨﻲ ﺑﺎﻷﻃﻔﺎﻝ ﻓﻲ ﺩﺍﺭ ﺍﻷﻳﺘﺎﻡ‬: I respect who takes care of the children in the orphanage ‫ ﺃﺣﺘﺮﻡ ﻣﻦ ﺗﻌﺘﻨﻴﺎﻥ ﺑﺎﻷﻃﻔﺎﻝ ﻓﻲ ﺩﺍﺭ ﺍﻷﻳﺘﺎﻡ‬: I respect the two ladies who take care of the children in the orphanage If we replace the common relative pronoun in the above sentences by the special ones, they respectively become: ‫ ﺃﺣﺘﺮﻡ ﺍﻟﺬﻱ ﻳﻌﺘﻨﻲ ﺑﺎﻷﻃﻔﺎﻝ ﻓﻲ ﺩﺍﺭ ﺍﻷﻳﺘﺎﻡ‬: I respect the one (man) who takes care of the children in the orphanage ‫ ﺃﺣﺘﺮﻡ ﺍﻟﺬﻳﻦ ﻳﻌﺘﻨﻮﻥ ﺑﺎﻷﻃﻔﺎﻝ ﻓﻲ ﺩﺍﺭ ﺍﻷﻳﺘﺎﻡ‬: I respect those (men) who take care of the children in the orphanage ‫ ﺃﺣﺘﺮﻡ ﺍﻟﺘﻲ ﺗﻌﺘﻨﻲ ﺑﺎﻷﻃﻔﺎﻝ ﻓﻲ ﺩﺍﺭ ﺍﻷﻳﺘﺎﻡ‬: I respect the one (woman) who takes care of the children in the orphanage ‫ ﺃﺣﺘﺮﻡ ﺍﻟﻠّﺘﻴﻦ ﺗﻌﺘﻨﻴﺎﻥ ﺑﺎﻷﻃﻔﺎﻝ ﻓﻲ ﺩﺍﺭ ﺍﻷﻳﺘﺎﻡ‬: I respect the two (women) who take care of the children in the orphanage – ‫ ﻣﺎ‬is used for non-human masculine and feminine singular, dual and plural nouns. Like ‫ﻣﺎ ﻳﻮﺟﺪ ﻓﻲ ﻣﺘﺤﻒ ﺑﺎﺭﺩﻭ ﻣﻦ ﺗﺤﻒ ﺛﻤﻴﻦ‬ ‫ﻣﺎ ﺗﺒ ّﺜﻪ ﻗﻨﺎﺓ ﺍﻟﺠﺰﻳﺮﺓ ﻣﻦ ﺧﺒﺮ ﺃﺧﺎﻓﻨﻲ‬

Towards a More Efficient Arabic-French Translation

65

Similarly, if we replace ‫ ﻣﺎ‬with a special pronoun, the above sentences respectively become: ‫ ﺍﻟﺘّﺤﻒ ﺍﻟﺘﻲ ﺗﻮﺟﺪ ﻓﻲ ﻣﺘﺤﻒ ﺑﺎﺭﺩﻭ ﺛﻤﻴﻨﺔ‬: The artifacts that exist in the Bardo Museum are precious ‫ ﺍﻟﺨﺒﺮ ﺍﻟﺬﻱ ﺗﺒﺜّﻪ ﻗﻨﺎﺓ ﺍﻟﺠﺰﻳﺮﺓ ﺃﺧﺎﻓﻨﻲ‬: The news that was transmitted on Al-Jazeera channel scared me To reach this objective, the proposed method is based on the transducers and bilingual dictionaries. These resources will be detailed in the next session.

6.2

Implemented Resources

Relative clauses in Arabic can exist in verbal or nominal sentences. Figure 13 represents an extract of the different structures studied for verbal sentences. The sub-graph “V+DefArt+N+RelPRO+V+PREP+DefArt+N” describes the relative clauses embedded in verbal sentences having as structure a noun followed by a relative pronoun followed by a verb followed by a place complement. Figure 14 describes the different transformations that must be performed on the sentences respecting the sub-graph “V+DefArt+N+RelPRO+V+PREP+DefArt+N” structure in order to obtain the correct translation. As an example, we can cite the

Fig. 13 Example of the studied structures

66

H. Fehri

Fig. 14 Example of a studied structure

following sentence: ‫" ﺍﻧﺘﻘﻞ ﺍﻟﻤﻌﻠﻢ ﺍﻟﺬﻱ ﺩﺭﺳﻨﺎ ﺇﻟﻰ ﻣﺪﻳﻨﺔ ﺻﻔﺎﻗﺲ‬The teacher who taught us moved to the town of Sfax". To solve the agglutination problem, we implemented in NooJ the morphological grammar described in Fig. 3.

6.3

Experimentation of Implemented Resources

After studying many structures applying dictionaries and syntactic local grammars, we measured the concordance of our sentences to test the effectiveness of our approach and followed method. We concluded that our method provided 85% of well translated relative clauses. The results proved to be very promising despite the fact that there are still some unsolved problems. To evaluate our work more precisely, we have compared our system with Google and Reverso, which are considered among the best translators that offer ArabicFrench translation. Results are illustrated in Table 4. According to this table, these two translators produced incorrect translations in most cases, whereas our system produced correct results. As we can notice, the two translators do not process the NEs, cf. Table 4 (a and d). In these examples, the toponym "‫ "ﺣﻤﺎﻡ ﺍﻷﻧﻒ‬is translated into "nose bath" and the person name "‫"ﻋﻤﺮﺍﻟﻤﺨﺘﺎﺭ‬ is translated into the "age of the chosen". Furthermore, in most cases, the relative pronoun is omitted in the translation cf. Table 4 (a and b). Moreover, the problem of having multiple translation for the same word is not solved. In fact, the word "‫ "ﻣﺪﻳﻨﺔ‬is

Towards a More Efficient Arabic-French Translation

67

Table 4 Experimental results Arabic relative clauses containing NEs ‫ﺣﻤﺎﻡ ﺍﻷﻧﻔﺬﻫﺐ‬ ‫ﺍﻟﺮﺟﻞ ﺍﻟﺬﻱ ﺗﺮﺍﻩ‬ ‫ﺍﻟﻰ‬ (a) ‫ﺳﺎﻓﺮﺕ ﺍﻟﺒﻨﺎﺕ‬ ‫ﺍﻟﻼﺗﻲ ﺗﺴﻜﻦ‬ ‫ﺑﺠﻮﺍﺭﻧﺎ ﺍﻟﻰ ﻣﺪﻳﻨﺔ‬ ‫ﺗﺸﺮﻳﻦ ﺑﺪﻣﺸﻖ‬ (b) ‫ﺍﻧﺘﻘﻞ ﺍﻟﻤﻌﻠﻢ ﺍﻟﺬﻱ‬ ‫ﺩﺭﺳﻨﺎ ﺇﻟﻰ ﻣﺪﻳﻨﺔ‬ ‫ﺻﻔﺎﻗﺲ‬ (c) ‫ﺍﻟﺠﻤﻴﻊ‬ ‫ﻳﺘﺬﻛﺮﻋﻤﺮﺍﻟﻤﺨﺘﺎﺭ‬ ‫ﺑﺄﻓﻌﺎﻟﻪ ﺍﻟﺘﻲ ﺃﻋﻄﺘﻪ‬ ‫ﻫﺎﻟﺔ ﺃﺳﻄﻮﺭﻳﺔ‬ (d)

Transliteration dhahaba alrajolu alladhi tarahu ilaa Hammam al’anf

Our system The man who you see went to Hammam Lîf

Google The man you see went to the nose bath

Reverso The man you see went to the nose bath

safarat albanatu allaati taskonna bijiwaarinaa ila madynati dimachka intakala almo`llimu alladhy darrasanaa ilaa madiinati Safakos aljamyù yatadhakkaru ùmar almokhtaar biafàalihi allaty a`Tathu haalatan osTuuriyyatan

The girls who live next to us travelled to the city of Tishreen in Damascus The teacher who taught us moved to the city of Sfax

The girls who live next to us traveled to Tishreen, Damascus The teacher who taught us moved to the city of Sfax

The girls living next to us travelled to Taytoun, Damascus

Everyone remembers Omar AlMukhtar by his deeds that distinguishes him as a legendary figure

Everyone remembers Omar Al-Mukhtar with his actions that gave him a legendary aura

Everyone remembers the age of the chosen by his actions that gave him a mythical aura

The teacher who studied us moved to the city of Sfax

translated into "city" instead of "town" cf. Table 4 (c). In addition to that, the problem of agglutination is not solved.

7 Conclusion and Perspectives In this presentation, we have followed a linguistic approach to build an automatic translation system for Arabic NEs into French, using large-coverage dictionaries, morphological and syntactic grammars. The translation process is implemented by two phases: a recognition and a translation phase. The wealth of these resources in terms of the lexical and morpho-syntactic properties has allowed us to formalize reliable linguistic patterns in a much more robust way than what empirical methods can accomplish. – If our system produces an incorrect translation for a particular NE, we can just correct the description of the corresponding lexical entry; if a certain pattern of wordforms has been incorrectly processed, we can correct the corresponding grammar.

68

H. Fehri

– If we need to process texts that are in a specialized terminological domain not covered by the linguistic resources we have already developed, we can add new resources to cover the new domain, without any risk to destroy the previously obtained results. To highlight the NEs recognition process, we have shown how it can be exploited in the translation of relative clauses. Indeed, the recognition of NEs leads to correct translations. On a slightly more ambitious note, we would like to generalize our approach to other complex structure such as MWEs, frozen expressions and elliptic expressions. We want also to treat other structures by studying other relative pronouns like “where” referring to places, “when” referring to time, and “whose” referring to possession.

References Barreiro Anabella, 2008. Port4NooJ: an open source, ontology-driven Portuguese linguistic system with applications in machine translation. NooJ’08, Budapest Ellouze Samira, Haddar Kais, Abdelwahed, Abdelhamid, 2009. Étude et analyse du pluriel brisé avec la plateforme NooJ. Tozeur, Tunisia Fehri Hela, Haddar Kais, Ben Hamadou Abdelmajid, 2009. Translation and Transliteration of Arabic Named Entities. LTC Conference, Pologne: 275-279. Friburger Nathalie, 2002. Reconnaissance automatique des noms propres. PhD thesis, François Rabelais university. Hamdallah Rami and Tushyeh Hanna, 2019. A Contrastive Analysis of English and Arabic in Relativization. Papers and Studies in Contrastive Linguistics 34, 141-152. Hasni Elyes, Haddar Kais, Abdelhamid Abdelwahed, 2009. Reconnaissance des expressions elliptiques arabes avec NooJ. In proceedings of the 3rd International Conference on Arabic Language Processing (CITALA’09) sponsored by IEEE Morocco Section, Rabat, Morocco: 83-88 Khalfallah Faten, Haddar Kais, Abdelhamid Abdelwahed, 2009. Construction d’un dictionnaire de noms composés en arabe. In proceedings of the 3rd International Conference on Arabic Language Processing (CITALA’09), Rabat, Morocco: 111-116 Kassmi Rafik, Mohamed Mourchid, Abdelaziz Mouloudi and Samir Mbarki, 2020. Recognition of Arabic Phonological Changes by Local Grammars in NooJ. In Formalising Natural Languages with NooJ 2019 and its Natural Language Processing Applications. Fehri, Mesfar, Silberztein Eds. Springer CCIS Series. Torjmen Roua and Haddar Kais, 2022. Translation system from Tunisian Dialect to Modern Standard Arabic. Concurr. Comput. Pract. Exp. 34(6) Wu Mei, 2010. Integrating a dictionary of psychological verbs into a French-Chinese MT system. In Finite-State Language Engineering with NooJ: Selected Papers from the NooJ 2009 International Conference (Tozeur, Tunisia). Edited by Abdelmajid Ben Hamadou, Slim Mesfar and Max Silberztein. Centre de publication Universitaire : Sfax., Tunisia: 315-324

Linguistic Resources and Methods for Belarusian Natural Language Processing Yuras Hetsevich and Mikita Suprunchuk

Abstract We present several newly developed services, methods, and algorithms for Belarusian in the field of NLP, which provide users with a set of tools for text, speech and other multimedia processing. Such services as the Speech synthesizer, Transcription generator, Word paradigm generator are grounded on a rule-based approach. The algorithms, databases and lists of rules for their development are described in detail. Each tool realizes a specific task for computational processing of textual information and speech. The proposed resources are also used to collect targeted thematic content to develop and refine natural language processing systems for Belarusian. As a consequence, we show that linguistic resources have not lost their relevance to NLP. Keywords Automatic language processing · Belarusian · Concordance · Dictionary · Natural language processing · Phonetics · Transcription · Text-to-speech synthesizer · Word paradigm

1 Introduction In the second half of the 2010s, methods based on the use of neural networks and deep machine learning gained popularity in the sphere of Natural Language Processing (NLP). However, these methods have some drawbacks: they require large quantities of textual materials to build their language models. Unfortunately, the amount of required textual material is not always available. In addition, a language model trained on a certain dataset may not be able to process linguistic phenomena correctly if their frequency is too low in the dataset. Machine learning requires a long running time, which developers may not have. Neural models show excellent results at processing wordforms, but results at higher linguistic levels are less impressive. Rule-based NLP approaches are still relevant today: they achieve greater accuracy than empirical approaches when solving a lot of problems in computational

Y. Hetsevich (✉) · M. Suprunchuk United Institute of Informatics Problems, Minsk, Belarus © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Silberztein (ed.), Linguistic Resources for Natural Language Processing, https://doi.org/10.1007/978-3-031-43811-0_4

69

70

Y. Hetsevich and M. Suprunchuk

linguistics, and the linguistic resources created for a specific NLP application can further contribute to the general collection of language knowledge. For the last 50 years, the Speech Synthesis and Recognition Laboratory of the United Institute of Informatics Problems of the National Academy of Sciences (Minsk) has been working on developing software tools and linguistic resources to process texts and other data for the Belarusian language. In 2014, the Institute published the WEB site www.corpus.by to offer WEB services for voice, and text processing. The services are sorted into thematic groups for convenient usage in specific fields of application, as presented in Hetsevich et al. (2021). In the following, we present three services that use carefully handcrafted linguistic resources for Belarusian: the Text-to-Speech Synthesizer, the Transcription Generator, and the Word Paradigm Generator.

2 Text-to-Speech Synthesizer Our Text-to-Speech Synthesizer system (TTS), implemented in C++, processes a written text and constructs an audio file that users can listen to, download and save. It is publicly available at: www.corpus.by/TextToSpeechSynthesizer. Its model is based on theoretical and experimental data specific to Belarusian: the linguistic resources formalize the phonetic and prosodic structure of speech as well as articulatory and acoustic phenomena involved in speech formation. TTS uses a multi-wave approach to synthesis, i.e., it compiles segments of a natural speech wave, correlated with elements of various phonetic lengths: allophones, diphones, and three-phones. TTS uses a handcrafted phonetic-acoustic database that describes the intra- and inter-language specific to phonetic systems, as well as positional-combinatorial phenomena that generate allophonic speech, cf. (Taylor 2009). TTS performs lexical and grammatical analysis of the input text by modeling the process of speech formation, considering pronunciation and intonation features of Belarusian. The input text undergoes a sequence of processing performed by specialized processors: a text processor, a prosodic processor, a phonetic processor, and an acoustic processor; each processor is associated with a specific database that contains handcrafted rules. For more details see (Lobanov and Tsirulnik 2008). – The text processor processes the input text in the following sequence: text cleaning, character conversion (abbreviations, acronyms, numbers, etc.), placement of accents, and POS tagging of wordforms. The resulting annotated text is then processed by the prosodic processor, which divides it into syntagmatic phrases and accent units (AU), which are further marked up into elements: pre-core, core, and post-core intonation, and then sets the values of amplitude (A), phoneme duration (D), and pitch frequency (F0) for each AU, in accordance with a database of prosodic “portraits”.

Linguistic Resources and Methods for Belarusian Natural Language Processing

71

Fig. 1 The structure of the TTS software implementation

– The phonetic processor converts the text into a phonemic transcription and generates positional and combinatorial allophones. It uses rules to convert the text into a sequence of phonemes (letter-to-phoneme conversion) and rules to convert 392 phoneme sequences into an allophone sequence (phoneme-to-allophone conversion). – The acoustic processor generates a speech signal by compiling segments of natural sound waves of the corresponding allophones and multiphones. It uses information about which allophones need to be synthesized, as well as which prosodic characteristics should be attributed to each allophone. The text, prosodic, phonetic, and acoustic processors of the speech synthesizer are basically language-independent, and the language specifics (in our case, Belarusian) are set by an appropriate line-up of databases and knowledge, i.e., linguistic rules. We show the architecture of the TTS system in Fig. 1. Modules that control the sequence of actions of other modules are controllers, while modules that implement algorithms for processing a text or speech signal are processors. The system's main controller performs the sequence of transformations on the input data, receiving intermediate results at each stage and then transmitting them to the next stage. The text normalization controller removes characters from the text that are unnecessary for speech synthesis, as well as accidental duplications of punctuation marks, standardizes character variants, and removes from the text invalid characters using character replacement rules. The resulting text constitutes the input of the linguistic controller, the output of which is a prosodically marked text and feeds the phonetic processor, which applies “letter-phoneme” and “phoneme-allophone” rule transformations. The next processing is performed by the prosodic processor, which sets the current values of the amplitude, pitch frequency, and duration of each allophone.

72

Y. Hetsevich and M. Suprunchuk

Table 1 Belarusian Dictionary database architecture and CVocReader software BED database architecture Field Type name Spelling String word (≤255 characters) Stress Vector of position integers Tag String (≤255 characters)

Architecture of the CVocReader software for reading a dictionary Open/close Get the number of words in the dictionary Word Search Search for the next word Get the latest error

The dictionary is switched to read/share mode Real number of words

A record (three fields) is selected from the database table The remaining entries are selected (there may be homographs in the dictionary, so this function will show all homographs stored) While reading the dictionary, this function must be constantly checked for the next correct operation of the entire speech synthesis process

A prosodically marked allophone text is then processed by the acoustic processor, which generates the speech signal, using a handcrafted database of sound waves of allophones and multi-phones. The result of the conversions goes to the output data format controller, which converts it to the sound file in the desired format (wav or mp3), as described in Hetsevich and Lobanov (2010). The text processor performs some preprocessing of the input text (morphological and accent marking of the words), covering 2,097,967 Belarusian wordforms, as in the dictionary (Biryla 1987). The Belarusian Electronic Dictionary contains three types of entries: spelling words, stress positions in words, word tags. The program interface CVocReader is used to manage it. Its architecture is illustrated in Table 1. For example, when the text processor processes the sentence Груша цвіла апошні год. . ., it produces the following tagging result: w гру+ша_НевядомаяКатэгорыя_sbm2012initial_гру+ша_NNIFO_ sbm1987_гру+ша_NFN1_noun2013_3 w цвіла+_VIIPF_sbm1987_цвіла+_дзеяслоў_verb2013_цвіла+_? _words_processed_3 wапо+шні_НевядомаяКатэгорыя_sbm2012initial_апо+шні_JJMO_sbm1987_апо+шні_JJMA_ sbm1987_апо+шні_прыметнік_adjective2013_апо+шні_прыметнік_ adjective2013_5 w го+д_НевядомаяКатэгорыя_sbm2012initial_го+д_NNIMO_sbm1987_го+д_NNIMA_sbm1987_ го+д_NMN1_noun2013_го+д_NMA1_noun2013_5 p. p newline

Linguistic Resources and Methods for Belarusian Natural Language Processing

73

The interface of TTS is shown in Fig. 2. To get a synthesized speech, users type in a text in the input field and then click “Generate synthesized speech!”. Users can then click “Listen to generated speech” or “Download generated speech file”. To get better results, users can insert the following marks in the input text: – plus / + / or an accent / а / – to define the main stress (for example, “звыча+йны”); – equal to / = / or gravis / э / – to define the secondary stress (for example, “тэ=леперада+ча”); – circumflex / ^ / – between two words, to combine them into a complex phonetic form (for example, “на^стале+”, “сказа+ў^бы”). While processing a text, the system produces intermediate results, including a normalized text, a phonemic text, an allophonic text, etc. (Fig. 3). These results‌ may be used to solve other computer-linguistic problems, such as to transcribe the text in Cyrillic, in IPA, or in X-SAMPA. The “Tokens” window displays five types of characters: alphabet (characters of the target alphabet—the languages selected for synthesis); other letters (not of the target alphabet); digits; whitespaces (whitespace characters: space, line feed, tab, etc.); other characters. The “Text” window displays data analysis on the wordforms, POS categories, and certain morphological features. The “Stressed tokens” window displays a list of words with user-placed accents, and “unknown tokens”—the list of words that are not listed in the system database. To determine the position of stress in a wordform, the speech synthesizer checks each word in the input text in dictionaries; indeed, words with the same spelling might have different stresses. The system shows the information about words with an ambiguous accent in the “Homographs” window.

Fig. 2 The interface of the Internet version of the Text-to-speech synthesizer

74

Y. Hetsevich and M. Suprunchuk

Fig. 3 Intermediate results produced by the Text-to-speech synthesizer

The “Intonation markers” window displays information about intonation markup. Speech synthesis is carried out according to sentences that are characterized by a sufficient degree of intonational autonomy in the text which, in their turn, form separate syntagmas. As a rule, a syntagma consists of one word or a combination of words that have a certain semantic and intonational completeness. Due to the complexity and insufficient development of rules for extracting syntagmas, it is only possible to perform a superficial syntactic analysis using available morphosyntactic information about the phrases and punctuation. To automatically produce the text-to-speech conversion, we propose to determine the intonation types of the syntagmas in narrative, interrogative, and exclamative sentences, according to the formal markers presented in the first table. For each punctuation sign, we compute its formal marker and intonation portrait, which replaces the punctuation sign (Table 2). Thus, the narrative sentence Стары бабёр з палёгкаю ўздыхнуў – след вады не абмане, прывядзе да вады! (Алесь Жук) [The old Beaver sighed with relief— the trace of water will not deceive, it will lead to water!] is analyzed by the synthesizer as: Стары бабёр з палёгкаю ўздыхнуў (C4) след вады не абмане (C3) прывядзе да вады (E2). The “Phonemic Text” and “Allophonic Text” windows present the user with a list of words in phonemic and allophone forms, respectively. The durations and frequencies of allophones are displayed in the “Allophone characteristics” window. This is the only available open access TTS system for the Belarusian language. The system offers users the possibility not only to listen to a synthesized audio file, but also to download it. The possibility of accessing intermediate results allows us to understand the intricacies of the text-to-speech sequential process, as presented in Hetsevich et al. (2019).

Linguistic Resources and Methods for Belarusian Natural Language Processing

75

Table 2 Correspondence type of intonation-formal marker Type of utterance Narrative

Type of intonation Finality

Non-finality

Question

Interrogative

Exclamation

Exclamative/ imperative

Punctuation sign and an intonation portrait P1—intonation of “colon”—[:], P2—intonation of “introduction”—[ )], P3—intonation of “semicolon”—[;], P4—intonation of "dot"—[.], P5—intonation of “ellipsis”—[...], P6—intonation of “paragraph”—[#] C1—intonation of “conjunction AND”, C2—intonation of “conjunction OR”, C3—intonation of “comma”—[,], C4—intonation of “dash”—[ – ], C5—intonation of “pre-introduction”—[(], C6—intonation of lexical syntagmas Q1—single-syntagma question with a question word, Q1-1—a question with a question word, containing two or more syntagmas, Q2—single-syntagma question without a question word, Q2-1—a question without a question word, containing two or more syntagmas E1—single-syntagma exclamation with an interjection, E1-1—an exclamation with an interjection, containing two or more syntagmas, E2—single-syntagma exclamation, E2-1—an exclamation, containing two or more syntagmas

3 The Transcription Generator The Transcription Generator is the most illustrative proof of the importance of linguistic rule-based approach in NLP. The service converts graphical (alphabetical) Belarusian wordforms into a phonetic transcription, following the norms of modern pronunciation. TTS systems necessarily include several transcription generation algorithms. Such systems usually contain several processors for each stage of the transcription, as we have seen in Fig. 1. An incorrect result produced at a stage by one of the processors significantly deteriorates the final result therefore it is crucial to be able to detect mistake produced by process, and correct it. This is possible if all processes are based on handcrafted data that can be easily corrected. The “grapheme-to-phoneme” conversion process is described in Hetsevich et al. (2014). Its algorithm determines the sequence of phonemes that correspond to the input text. In Belarusian, many word spellings are close to their pronunciation and can thus be described by rules; our system uses these rules to convert graphic forms to their phonetic representation. However, it is not possible to produce the phonetic transcription of a wordform directly from its graphical form, we should use an intermediate level to represent allophones. This intermediate representation is constituted by

76

Y. Hetsevich and M. Suprunchuk

a sequence of allophones and pauses separated by commas, cf. (Zahariev et al. 2016), as shown in the following: • Original text: Напэўна не скажу, над якою рэчкаю векаваў стары дуб: ці то над Нёманам, ці то над Свіслаччу (Якуб Колас). • Translation: “I don't think I can say for certain beside which river the old oak-tree stood,—whether it was the river Nioman, or whether it was the Śvisłač” (Yakub Kolas); • TTS format: >,N004,A221,>,P001,E011,W013,>,N004,A223,/,>, N'002, E042,/,>,S002,K004,A232,>,ZH002,U020,/,>,#C3, >,N002,A022,T002,/,>, J'002,A242,>,K001,O033,>,J'012,U242,/,>, R002,E022,>,CH002,K004, A333,>,J'012,U343,/,>,V'012,E342,>,K004,A231,>,V011,A011,W013,/,>, S002,T002,A222,>,R002,Y022,/,>,D002,U021,P000,/,>,#P1, >,C'001,I042,/, >,T001,O022,/,>,N002,A022,T002,/,>,N'002,O141,>,M002,A112,>,N002, A121,M000,/,>,#C3, >,C'001,I042,/,>,T001,O022,/,>, N002,A022,T002,/,>, S'002,V'001,I042,>,S002,L004,A312,>,CH102,U320,/,>,#P4, • Cyrillic transcription: [напэўна] [н'э] [скажу] | [нат] [йакойу] [рэчкайу] [в'экаваў] [стары] [дуп] || [ц'і] [то] [нат] [н'оманам] | [ц'і] [то] [нат] [с'в'іслач:у] || • International Phonetic Alphabet transcription: [naˈpɛwna] [ˈnʲɛ] [skaˈʐu] | [ˈnat] [jaˈkɔju] [ˈrɛʧkaju] [vʲɛkaˈvaw] [staˈrɨ] [ˈdup] || [ˈʦʲi] [ˈtɔ] [ˈnat] [ˌnʲɔˌmaˌnam] | [ˈʦʲi] [ˈtɔ] [ˈnat] [ˈsʲvʲislaʧʧu] || To produce the final transcription, it is necessary to access a database of correspondences of the form allophone,transcription. Each allophone is represented by a code constituted by one, two or three Latin letters, an apostrophe sign, and three Arabic numerals that characterize the type of junction of the phoneme. The initial allophone database contained 960 different allophones that were described manually. An analysis of the correspondence between allophones and their phonetic transcription showed that abbreviated allophone designations, especially phoneme names, signs of softness, and first index are sufficient for the transcription. Thanks to this observation, we were able to decrease the number of allophone,transcription correspondences to 99. Linguists developed this list of correspondences, following the guide (Padluzhny 1989). We give a fragment of the resulting list in Table 3. Thus, the transcription of Belarusian texts uses the following resources: – the set of correspondences “punctuation mark—intonation mark P = , where pk—k-i is the punctuation mark, intk—k-i is the intonation mark, k is the number of correspondences: – a database of “grapheme-phoneme” conversion rules; Table 3 Fragment of the list of correspondences “allophone-transcription” Abbreviated allophone code Transcription

A0 а

A1 À

A2 a

B0 б

B’0 б'

B1 Б:

B’1 б':

Linguistic Resources and Methods for Belarusian Natural Language Processing

77

– a database of “phoneme-allophone” conversion rules; – a database of correspondence “allophone-transcription”. For example, the most common approach performed during the “grapheme-tophoneme” conversion is to process character sequences from left to right. For each character of the input, we apply one or more rules to generate its corresponding phoneme. The presence of two assimilation effects of the previous consonant phoneme is typical in Belarusian: deafness-sonority, hardness-softness. Moreover, it is necessary to note that the effect of assimilation on deafness-sonority can be intra-word and inter-word. At the same time, their distribution to neighboring graphemes goes from right to left. Since the indicated effects do not affect each other, it is possible to process the transformation “grapheme-phoneme” in four consecutive stages: 1. Verify that the grapheme complies with the rules that treat canonical changes, check for the effects of assimilation of consonant phonemes by deafness-sonority, and replace it (in case of coincidence) with the corresponding phoneme group. 2. Replace a letter with a phoneme according to standard rules. 3. Check the softness of the previous grapheme (a necessary but insufficient condition for softness). 4. Check the grapheme for compliance with the softening rules, and add softness to this phoneme in case of a match. The structure of expert rules consists of four blocks: 1. “Standard” rules replace a grapheme with a phoneme. For example, the grapheme “A” in Belarusian is often replaced with the phoneme “A”. 2. Exceptions to the standard replacement rules are expressed with regular expressions. They can represent an assimilation effect, a replacement, etc. For example, in Belarusian, the grapheme “Г” can turn into phonemes “G” or “GH”, depending on its right context: горка [slide; hill]—GH,O,+,R,K,A; гузік [button]—G,U,+,Z', I,K. 3. Softening graphemes, i.e., those graphemes before which the soft sound may appear. 4. Softening rules expressed as regular expressions. We have described the condition under which a grapheme will turn into a soft phoneme. For example, grapheme Н turns into soft phoneme H’ before the sequence of consonants ДЗ, Ц, Й and vowels Е, Ё, Ю, Я, І, Ь or С, Л, Ц, З. Examples of formalized grapheme-phoneme conversion rules are presented in Table 4. The Transcription Generator provides an accurate result in more than 98% of the cases. It has significantly facilitated the construction of the Orthoepic Dictionary of the Belarusian Language (Rusak 2017), which contains 117,000 headings, as well as different implementations of the dictionary (Fig. 4).

78

Y. Hetsevich and M. Suprunchuk

Table 4 Fragment of “grapheme-phoneme” conversion rules for the Belarusian language Standard “grapheme-tophoneme” rules Ж–ZH

Exceptions to the standard rules Д(С)ТВ–C

Softening graphemes Е

З–Z

(Д)[КСПТФХЦЧШ]– T (Т)[БГДЗЖ]–D (З)ДЖ–ZH (З)[КПСТФХЦЧШ]– S

Ё

Standard softening rules (Н)[ДЗЦЙ] [ЕЁЮЯІЬ] (Н)[СЛЦЗ]

Ю Я І

(Л)[Л] (М)[М] ([ЗСН])[Д]

І–I Й–J' К–K

Fig. 4 Text-to-speech format, Cyrillic, IPA and X-Sampa outputs

4 Word Paradigm Generator The “Word Paradigm Generator” service is free and available at https://corpus.by/ WordParadigmGenerator. It receives a word as input and returns its corresponding paradigm. If the word is not listed in the dictionary, the service returns a paradigm associated with a word similar to the word in the input; for more details see (Hetsevich et al. 2016).

Linguistic Resources and Methods for Belarusian Natural Language Processing

79

This service uses the grammatical dictionary of the Belarusian language (https:// corpus.by/VoicedElectronicGrammaticalDictionary/?lang=be), which covers a large vocabulary, but has certain gaps, primarily in technical terminology. Therefore, it is necessary to update it. We have used this service to compile terminological dictionaries in the legal domain as well as in the medical domain, as presented in Varanovich et al. (2021). Texts that need to be processed often contain words unknown to the system. When voicing a text, it is very important to determine the stress in each word because the stress in Belarusian is unfixed and mobile. The main goal of this service is not only to compute the category, but also the whole paradigm. The algorithm our team designed for the automatic generation of word paradigms consists of 11 consecutive interdependent steps, which produce the most suitable paradigms for a given word. The algorithm looks for the closest paradigm(s) for the last letters of a word, and is presented in flow graph 1 (Fig. 5). The dictionary used by the automatic synthesizer might contain mistakes in setting the location of the stress. Therefore, it must be updated when the system encounters words unknown to the automatic speech synthesizer or when the services “Spell checker” and “Voiced Electronic Grammatical Dictionary” produce new words. The service “Unknown Words Processor” (https://corpus.by/ UnknownWordsProcessor) presents the list of unknown words and allows users to set their stress and their POS category. However, this service does not consider the lemma and other variants of the wordform (for example, its singular and plural forms in six cases for nouns). The “Word Paradigm Generator” service is used to completely close the gaps in the dictionary. It sets not only the POS category of the word, but also produces one or several most suitable paradigms for the word, using the last letters of a word to get its most likely paradigms. The algorithm consists of 11 consecutive interdependent steps. Figure 6 displays the graphical interface of the service. The interface contains the following areas: – – – –

input field for the wordform(s); choice of the dictionary (about NooJ format see (Silberztein 2003)); optional selection of a tag and/or a POS category; the button “Generate possible paradigms!” which starts the processing and returns the results.

Tags displayed after the “_” symbol show the grammatical meaning of the word, for example, POS category, gender, number, case, etc., and generate paradigms of the input word based on similar words with the same grammatical meaning if the input word cannot be found in the service dictionaries, as described in Zanouka (2017). The system can generate the paradigms for the wordform if it is not described in the dictionary, using words. For example, there is no paradigm of the word аўдыягід in the dictionary. When the service is executed, it will produce paradigms based on all words similar in writing and found in the dictionary—гід, альдэгід, поліфармальдэгід, фармальдэгід, агід, эгід, see Fig. 7.

80

Y. Hetsevich and M. Suprunchuk

Fig. 5 The algorithm of the service “Word Paradigm Generator”

Linguistic Resources and Methods for Belarusian Natural Language Processing

Fig. 6 The graphical interface of the service «Word Paradigm Generator»

Fig. 7 Potential paradigms generated for a missing word

81

82

Y. Hetsevich and M. Suprunchuk

Difficulties arise when generating paradigms of rarely used or new words. For the word агмень the service offers 17 potential paradigms. In such cases, users must rely on their own knowledge and reference (dictionaries, reference books, etc.) to choose the correct paradigm. “Word Paradigm Generator” service is the first Belarusian online open and free service that offers anyone to create and manage their own electronic dictionaries. The system is continually being improved.

5 Conclusion We have presented three tools to process Belarusian in the form of WEB services: “Text-to-Speech Synthesizer”, “Transcription Generator”, and “Word Paradigm Generator”. These tools are based on handcrafted rule-based linguistic resources that contain grammatical rules, dictionaries as well as databases. These tools can be used to develop WEB applications that process large volumes of text and speech. In particular, for the “Transcription Generator” service, we had to develop a special linguistic resource to transcribe allophone text, in the form of a list of correspondences “letter—phoneme—allophone—transcription”. This service produces an accurate result in 98% of cases, and is used in the “Text-to-speech synthesizer”. The “Text-to-Speech Synthesizer” service is based on a set of databases and linguistic rules. It voices Belarusian texts entered by users and produce a corresponding audio file that can be listened, downloaded and saved to a computer. This service generates additional intermediate results, including a normalized text, a phonemic text and an allophonic text. Some potential applications of this service include call systems and information kiosks, voice and alarm notification systems, book reading systems, pedagogical applications, and talking computers for the visually impaired. The “Word Paradigm Generator” returns a paradigm of the word. If this paradigm is not already described, it proposes potential paradigms associated with words with similar endings. The service is used to develop and update dictionaries, including the dictionary used by the speech synthesizer, as well as dictionaries for specific domains (i.e., legal, medical). Because these services are based on carefully, meticulously handcrafted linguistic resources, they produce results with high accuracy. They offer efficient management interfaces, the linguistic resources can easily be tested, fixed and updated. The above-mentioned services are available on the Computational platform at http://corpus.by. They are free and complemented by courses in language processing, data analysis for the digital humanities in Belarusian. These linguistic resources are available to scholars, researchers and scientists from all spheres through single sign-on access.

Linguistic Resources and Methods for Belarusian Natural Language Processing

83

Acknowledgments We would like to express our special gratitude to all employees of the Speech synthesis and recognition laboratory of the United Institute of Informatics Problems for the development of the presented resources. And also, we warmly thank our colleagues Valery Varanovich, Yauheniya Zianouka, Maryia Slesarava and Anna Dolgova for their great help in preparing the article for this volume.

References Biryla, Mikalaj V. (ed.), 1987. The dictionary of the Belarusian language: Orthography. Orthoepy. Accentuation. Inflection. BelSE, Minsk. In Belarusian: Слоўнік беларускай мовы: Арфаграфія. Арфаэпія. Акцэнтуацыя. Словазмяненне / пад рэд. М. В. Бірылы. Мінск: БелСЭ, 1987. 902 с. https://clarin-belarus.corpus.by/the-dictionary-of-the-belarusian-lan guage-1987-is-provided-for-clarin-vlo/ Hetsevich, Yuras S., and Boris M. Lobanov, 2010. The system of synthesis of Belarusian speech by text. In: Speech technologies, 1: 91–100. In Russian: Гецевич, Ю. С. Система синтеза белорусской речи по тексту / Ю. С. Гецевич, Б. М. Лобанов // Речевые технологии. 2010. № 1. С. 91–100. Hetsevich, Yuras S., Vladimir A. Zhitko, Sviatlana A. Hetsevich, Lesia I. Kajharodava, and Kiryl A. Nikalaenka, 2019. Designing natural language interfaces for reference systems. In Informatics, 16-3: 37–47. In Belarusian: Гецэвіч, Ю. С. Праектаванне натуральна-моўных інтэрфейсаў для даведкавых сістэм / Ю. С. Гецэвіч, У. А. Жытко, С. А. Гецэвіч, Л. І. Кайгародава, К. А. Нікалаенка // Інфарматыка. 2019. Т. 16, № 3. С. 37–47. Hetsevich, Yuras, Veronika Mandik, Valentina Rusak, Tatsiana Okrut, Boris Lobanov, Stanislau Lysy, and Dzmitri Dzenisiuk, 2014. The system of generation of phonetic transcriptions for input electronic texts in Belarusian. In: Pattern Recognition and Information Processing: Proceedings of the 12th International Conference, eds. A. Tuzikov, and V. Kovalev, pp. 81–85. UIIP NASB, Minsk. Hetsevich, Yuras, Yauheniya Zianouka, Siarhej Majeuski, Zmicier Dzienisiuk, and Anastasija Drahun, 2021. Computational platform for electronic text and speech processing in Belarusian, Russian and English. In Speech Technologies, 1-2: 37–46. In Russian: Гецевич, Ю. С. Компьютерная платформа для обработки электронного текста и речи на белорусском, русском и английском языках / Ю. С. Гецевич, Я. С. Зеновко, С. С. Маевский, Д. А. Денисюк, А. Е. Драгун // Речевые технологии. 2021. № 1-2. С. 37–46. Hetsevich, Yury, Valery Varanovich, Evgenia Kachan, Ivan Reentovich, and Stanislau Lysy, 2016. Semi-automatic part-of-speech annotating for Belarusian dictionaries enrichment in NooJ. In: Automatic processing of natural-language electronic texts with NooJ. NooJ 2016, eds. L. Barone, M. Monteleone, M. Silberztein. Communications in Computer and Information Science, vol. 667. Springer, Cham. Lobanov, Boris M., and Liliya I. Tsirulnik, 2008. Computer synthesis and cloning of speech. Belaruskaya Navuka Publ., Minsk. In Russian: Лобанов, Б. М., Цирульник, Л. И. Компьютерный синтез и клонирование речи. Минск: Бел. наука, 2008. 344 с. Padluzhny, Aliaksandar I. (ed.), 1989. Phonetics of the Belarusian standard language. Navuka i Tehnika Publ., Minsk. In Belarusian: Фанетыка беларускай літатурнай мовы / рэд. А. І. Падлужны. Мінск: Навука і тэхніка, 1989. 335 с. Rusak, Valentina P. (ed.), 2017. Orthoepical dictionary of the Belarusian language. Bielaruskaja Navuka Publ., Minsk. In Belarusian: Арфаэпічны слоўнік беларускай мовы / рэд. В. П. Русак. Мінск: Бел. навука, 2017. 757 c. Silberztein, Max 2003: NooJ Manual. https://nooj.univ-fcomte.fr/downloads.html. Taylor, Paul, 2009. Text-to-Speech synthesis. Cambridge University Press, N. Y.

84

Y. Hetsevich and M. Suprunchuk

Varanovich, Valery, Mikita Suprunchuk, Yauheniya Zianouka, Tsimafei Prakapenka, Anna Dolgova, and Yuras Hetsevich, 2021. Creation of a legal domain corpus for the Belarusian module in NooJ: Texts, Dictionaries, Grammars. In: 15th International Conference NooJ 2021: book of abstracts, eds M. Bigey, A. Richton, M. Silberztein, I. Thomas: 36–37. Besançon, France. Zahariev, Vadim, Stanislau Lysy, Alena Hiuntar, Yury Hetsevich, 2016: Grapheme-to-phoneme and phoneme-to-grapheme conversion in Belarusian with NooJ for TTS and STT systems. In: Automatic Processing of Natural-Language Electronic Texts with NooJ. NooJ 2015, eds. T. Okrut, Y. Hetsevich, M. Silberztein, H. Stanislavenka. Communications in Computer and Information Science, vol. 607. Springer, Cham. Zanouka, Evgenia, 2017. The enlargement of electronic lexical database by computational on-line free system. In: Open Semantic Technologies for Intelligent Systems: Proceedings of the International Conference, ed. V. Golenkov: 179–182. BSUIR, Minsk.

Part III

Linguistic Resources for Low-Resource Languages

A New Set of Linguistic Resources for Ukrainian Olena Saint-Joanis

Abstract We have constructed a Ukrainian set of linguistic resources that have allowed us to construct various NLP applications on Ukrainian, including information retrieval and extraction, morphological, syntactic semantic and statistical analysis, spell checking, and machine translation. Our goal was to develop a reliable tool that would allow students and teachers of the Ukrainian language to explore simple texts, as well to allow researchers in the social sciences to analyze their own corpora of Ukrainian texts. We will first review the various existing NLP software applications that can process Ukrainian texts, their functionalities, and their performance. We then describe the linguistic resources we have developed, and finally compare the results produced by both approaches. Keywords Corpus Linguistics · Natural Language Processing · NooJ · Ukrainian language

1 Introduction Although interest in the Ukrainian language has increased greatly in recent years, it remains poorly described and formalized. The few Natural Language Processing (NLP) software applications available do not necessarily meet the needs of students or researchers. These tools have been developed using empirical approaches (e.g., statistical- and neural-network-based) and therefore their outputs do not give access to a solid linguistic interpretation. Also, they produce too many errors. In consequence, their usefulness is at least questionable.

O. Saint-Joanis (✉) Université de Franche-Comté, Besancon, France CREE, INALCO, Paris, France e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Silberztein (ed.), Linguistic Resources for Natural Language Processing, https://doi.org/10.1007/978-3-031-43811-0_5

87

88

O. Saint-Joanis

After studying the available empirical NLP applications, we chose to use the NooJ linguistic platform1 to process Ukrainian because it contains the tools we need to develop linguistic resources, in the form of dictionaries and orthographic, morphological, syntactic and semantic grammars. NooJ also contains tools to manage corpora, perform various statistical analyses on them, and is well adapted to construct pedagogical applications. In the following, we give an overview of the theoretical approaches in NLP, present several tools based on empirical methods, and describe the linguistic resources we have developed for Ukrainian.

2 Theoretical Basis 2.1

Approaches to Designing Natural Language Processing Applications

Since the 1950s, numerous NLP applications capable of analyzing textual data have emerged. These tools often contain resources for several languages and can combine multiple functionalities that allow: – – – – – – – – – – –

display the frequency of units (occurrences, lemmas, parts of speech), find matches (examples of use in context) of single or collocated units, find collocations in parallel corpora, search by lexical field (semantic analysis), tag the text with information (morphological, syntactic, semantic analysis) integrate or create electronic dictionaries, manage corpora, generate texts in natural language, generate document summaries, automatically translate words or phrases, recognize and generate speech.

However, tools that provide all these features for multiple languages, and particularly for Ukrainian, are rare. For an NLP application to perform well, it must be able to identify the textual units, their grammatical forms, their roles in the sentence, and their semantics, as closely as possible to what human locutors can do. It must be able to remove ambiguities, recognize proper nouns, multiword units, inflected and derived forms, and to create syntactic links between words within a sentence, locate lexical fields and link together spelling variants. There are two approaches to designing NLP software applications:

1

NooJ is a free and open-source linguistic development environment, cf. www.nooj4nlp.org. NooJ’s theoretical and methodological frameworks are presented in Silberztein (2016).

A New Set of Linguistic Resources for Ukrainian

89

– rule-based approaches rely on carefully handcrafted dictionaries and grammars – empirical (example-, statistical-, neural-network-based) approaches rely on data extracted from large training corpora. Rule-based models NLP software applications that use the rule-based approach rely on linguistic knowledge that has been previously described in the form of formalized dictionaries and grammar rules. These software applications are typically the fruit of the cooperation between linguists who elaborate dictionaries and describe linguistic rules (paradigms, relations between linguistic units) and computer scientists who construct software that can use these rules to perform various computations on texts. The rule-based approach derived from after studies initiated by Chomsky, with the help of the Fenchmathematician Schützenberger, who defined mathematical models called “generative grammars”2 to formalize natural languages, i.e., to describe with the help of mathematical equations. Therefore, Chomsky placed linguistics and the knowledge of grammar at the center of his research. I cite (Chomsky 1984): I approach the study of language with the assumption that linguistic knowledge can be properly characterized through a generative grammar, a system of rules and principles that assigns structural descriptions to linguistic expressions. According to this view, the basic concepts are those of “grammar” and “knowledge of grammar.”

Rule-based models involve work by linguists who analyze the phenomena of the language and build dictionaries and rules to describe them. As a result, NLP tools based on rule-based models are expensive and time-consuming to implement. However, the advantages of the rule-based approach are not negligible: – – – –

Rules are readable and modifiable, therefore can be corrected and enhanced, Errors are easy to detect and correct, Ambiguities can be removed with great precision, It is possible to process and locate linguistic units that are not limited to wordforms,

Derived forms, including those that take into account palatalization and phonetic variants, can be linked to the corresponding dictionary entry, It is possible to perform analyses at various linguistic levels. As of now, two NLP software available online contain linguistic resources for Ukrainian: GRAC3 and LanguageTools.4 – GRAC is a large collection of Ukrainian texts comprising 130,000 texts of different genres, dating from 1816 to the most recent period. Users can access

2

Regular, Context-Free, Context-Sensitive and Unrestricted Grammars are described in Chomsky and Schützenberger (1963). 3 GRAC (General Regionally Annotated Corpus of Ukrainian) is available at: https://uacorpus.org/ Kyiv/en. Cf. (Shvedova et al. 2021). 4 LanguageTool is available at: https://community.languagetool.org/analysis?lang=uk

90

O. Saint-Joanis

the corpus through two web interfaces. KonTex5 provides access to versions 10 and 11 and NoSketch Engine6 gives access to versions 10–16. Both interfaces allow simple or combined searches by lemma, by inflected form, by non-full form or by tag (morphological or semantic). The second tool also allows user to locate phrases, display their frequency and create sub-corpora. Another interface Plots (CQL7) gives access to the tool GRAC that allows building frequency graphs, based on GRAC. – LanguageTool is another corpus processor; it does not contain any pre-compiled corpus, but allows users to annotate their own small texts (size limited to 998 characters, space included). However, both tools have serious limitations: they do not link spelling variants; therefore, these variants are indexed, located, and counted as if they were independent words. In the same way, perfective and imperfective forms of a same verb are not linked; therefore, users need to systematically enter two queries to find all the occurrences of a verb. Derived forms are not linked either, which in Ukrainian poses systematic problems, as masculine and feminine forms of occupation names are constructed via derivational operations, e.g., “викладач” [teacher] and “виклаладачка” [teacher]. Same problem with derived adjectives “Настусин,” which are not linked with their original nominal for “Настуся”. Moreover, these tools do not solve lexical and morphological ambiguities,8 which makes their search functionality produce large number of false positive. In conclusion, GRAC and LanguageTools bring significant values to linguists, but they cannot satisfy the needs of students who learn Ukrainian. Empirical methods NLP software applications based on a empirical method9 rely on knowledge obtained empirically from training corpora. These training corpora must have been previously tagged manually, and are typically constituted by large collections of texts, so that the software can reference as many morphological, syntactic and/or semantic properties associated with a text unit as possible. Note that the text units processed by software based on training corpora are graphical wordforms, i.e., sequences of characters, not linguistic units.

5

https://parasol.vmguest.uni-jena.de/grac/corpora/corplist https://parasol.vmguest.uni-jena.de/grac_crystal/#dashboard?corpname=grac16 7 Structured Query Language, a graph query language that lets you retrieve data from the graph. 8 GRAC and LanguageTools can be interfaced with the statistical tagger TagText.groovy which aims at analyzing wordforms not listed in their dictionaries and solving lexical and morphological ambiguities. However, as we will see in the next section, the resulting tagged texts contain many mistakes. 9 Under the name of “empirical” models, we include statistical, probabilistic, and neural networkbased methods. 6

A New Set of Linguistic Resources for Ukrainian

91

When parsing a text, these tools first split the text into sequences of sentences and graphical wordforms via a process of tokenization,10 then compare the contexts of wordforms with the ones found in the training corpus to infer their most probable analysis. Typical analyses are lemmatization11 and stemming.12 NLP applications based on training corpora that can process texts in Ukrainian include Sketch Engine, TreeTagger, RNNTagger. One great advantage of the empirical approach is that it is inexpensive and quick to implement. As a result, software applications based on training corpora constitute the majority of the available NLP tools. However, these tools are not without flaws: – they need large training corpora; some languages, including Ukrainian, lack open access corpora, therefore companies wishing to develop NLP tools for these languages must first build the training corpora. – the available training corpora contain many mistakes, including even spelling mistakes, because they were often obtained via automatic optical character recognition software, and then tagged by people who are not always qualified linguists.13 – Automatic taggers rely on rules that were automatically computed from the training corpus and are mostly incorrect.14

3 Statistical Tools for Ukrainian We now present some available NLP tools based on empirical methods that perform linguistic analyses of Ukrainian texts.

3.1

Sketch ENGINE

Sketch Engine15 hosts more than 700 corpora for over 100 languages. It can perform morphological and semantic tagging, statistical analysis, word searches according to different criteria (word, lemma, phrase, character sequence), collocation searches, and can produce graphs, word clouds, dictionaries from texts, word and n-grams 10

Process that segments the sentence into a sequence of wordforms. Process that links a wordform to the corresponding lexical entry. 12 Process that computes the root, prefix and suffix of a wordform. 13 Cf. (Silberztein 2018): “As a result, these corpora contain approximately one error every twenty words, which is well known (see e.g., Volokh & Neumann 2011; Dickinson 2015).” 14 Silberztein (2018) gives various examples of incorrect rules used by Brill’s tagger as well as by GATE (http://gate.ac.uk) to tag the Penn Treebank and the OANC corpora. 15 Cf. (Kilgarriff et al. 2004), https://app.sketchengine.eu. 11

92

O. Saint-Joanis

Fig. 1 Errors related to lemmatization in Sketch Engine

frequency tables as well as bilingual term extraction. Sketch Engine is free for 30 days. The Ukrainian Web 2020 and 2014 corpus (ukTenTen20) was is constituted by a large collection of varied texts obtained from the WEB. It is possible to download the corpus (paying option) and to create one’s own corpus. This corpus was lemmatized by CSTLemma, and then annotated by RFTagger.16 The probabilistic model uses the MULTEXT-East Ukrainian and Universal Dependencies Part of Speech (POS) tag set. Sketch Engine lemmatizer was trained on the UK-Brown dictionary available on GitHub and its tagger was trained on the Universal Dependencies data with a support dictionary harvested from the UK-Brown dictionary. To test Sketch Engine reliability. we downloaded and compiled our own corpus, composed of 10,434 forms.17 Lemmatization errors in Sketch Engine Sketch Engine displays the 100 most frequent lemmas of the corpus. However, out of these 100 lemmas, we have identified 21 errors that we can observe on Fig. 1. For example, the wordforms “баронес” and “настус” are presented as lemmas, whereas the first one should be analyzed as the genitive plural of “баронесa” and the

The RFTagger (developed by Helmut Schmid) is a tool for the annotation of text with fine-grained part-of-speech tags https://www.cis.uni-muenchen.de/~schmid/tools 17 Our test text contains two parts: A recent newspaper excerpt and an excerpt from a literary work from the early twentieth century. 16

A New Set of Linguistic Resources for Ukrainian

93

Павлусь/павлитися/Ncmsnn Малинка/малинк/Mlcmsn був/бути/Vapis-sm гарний/гарний/Afpmsnf з/з/Spsg лиця/лиц/Ncnsgn ,/,/X мов/мова/Css чорнявий/чорнявий/Aomsnf та/та/Ccs кучерявий/кучерявий/Ao-msnf бог/бог/Ncmsny Аполлон/аполлон/Aomsns././X Настуся/наститися/Ao-fsns Самусівна/самусівний/Ncfsnn була/бути/Vapis-sf пишна/пишний/Ap-fsns-ep на/на/Spsa вроду/врода/Ncfsan ,/,/X мов/мова/Css Афродіта/афродіт/Ncfsnn ,/,/X котра/котрий/Pr--f-sna тільки/тільки/Q що/що/Pr--nnsnn вихопилась/вихопитися/Vmpip3s з/з/Spsg морської/морський/Ao-fsgf піни/піна/Ncfsgn на/на/Spsl хвилях/хвиля/Ncfpln ,/,/X але/але/Ccs ще/ще/R не/не/Q помолилася/помолитися/Vmpip3s богу/бог/Ncfsan ,/,/X не/не/Q вмилась/вмитися/Vmeis-sf і/і/Ccs не/не/Q обтерла/обтерти/Vmeis-sf гаразд/гаразд/Spsi рушником/рушник/Ncnsin морської/морський/Ao-fsgf піни/піна/Ncfsgn на/на/Spsl виду/вид/Nc

Fig. 2 Morphological tagging errors in Sketch Engine

second is not an independent form: it is the root of the word “настуся”, which does not exist without an ending. The wordform “калатайти” has also been presented as a lemma (instead of “калатати”). In the text, this verb is used in the singular imperative; therefore, the lemmatizer should have first removed the imperative ending -й, then have added the infinitive ending –ти. The affective proper names “Павлусь” and “Настуся” were incorrectly lemmatized as the verbs “павлитися” and “наcтитися,”; the common masculine plural noun “боєприпaси” was incorrectly lemmatized as a verb “боєприпасти”. Morphological tagging errors in Sketch Engine Sketch Engine offers the possibility to search for wordforms, and then display their tags in context. After performing several tests, we noticed a lot of mistakes. Figure 2 shows an extract composed of 2 sentences and 45 wordforms. Lemmas are written in italics (incorrect lemmas in blue) and tags are in bold (wrong tags in red). We can see that 15 wordforms have been associated with incorrect morphological tags, and 8 wordforms have been associated with incorrect lemmas. Errors related to spelling mistakes Statistical tools are designed to process all wordforms in the training corpus. Therefore, they treat spelling errors as correct words, tag and lemmatize them. In our corpus, the following sentence of six wordforms contains three spelling mistakes: Сeстра четає книгу, а брят слихає instead of: Сeстра читає книгу, а брат слухає The incorrect wordforms do not exist. Yet, Sketch Engine has tagged them as follows: четає/V-pip3s-/четати/ брят/Ncmsnn/бря/ слихає/V-pip3s-/слихати/

94

O. Saint-Joanis

Fig. 3 ALUs in Sketch Engine

In consequence, if one needs to build a dictionary from this corpus, Sketch Engine will produce a dictionary that contains erroneous lexical entries. Moreover, as these incorrect wordforms are treated as correct, results of the statistical analyses are incorrect. Note that these problems would not exist with software applications that access a properly handcrafted dictionary: if the software does not find a wordform, then it would tag it as rather than compute an incorrect lemma. If the software could process morphological grammars, then it could recognize, tag and lemmatize wordforms absent from the dictionary but constructed via productive morphological phenomena, as we will see in Sect. 4. Spelling variants We also checked how spelling variants are treated. Unfortunately, we realized that they were lemmatized with different lemmas. For example, for the variants “грунт/ ґрунт”, instead of producing a single lemma for these two variants (“ґрунт” according to the current orthographic standard), Sketch Engine produced two different lemmas. This is a problem because the frequency of the word “ґрун” (at the basis of all statistical analyses) will be underestimated. Atomic Linguistic Units All natural languages contain a vocabulary that is the set of its atomic linguistic units (ALUs). In a linguistic NLP software, each ALU must correspond to one, and only one lexical entry, which behaves syntactically and semantically as a single unit. In Ukrainian for example, many adverbs or prepositions are composed of several wordforms. We wanted to see if Sketch Engine can distinguish these units. For this purpose, we looked for the two adverbs “кінець кінцем” and “нога в ногу” in the corpus. Unfortunately, we found that these two units are treated as sequences of single units, see Fig. 3: Here the first ALU is represented as if it corresponded to a sequence of a noun in the Nominative, followed by a preposition, followed by a noun in the Accusative; the second adverb is represented as if it corresponded to a noun in the Nominal followed by a noun in the Instrumental. Note that a linguistic-based NLP software would have tagged these adverbs correctly, as we will see below. In conclusion, Sketch Engine is a very useful tool to process many texts in many languages, thanks to its large number of functionalities, but it cannot be used reliably for Ukrainian because it produces too many several lemmatization and tagging mistakes, it does not differentiate spelling errors from legitimate wordforms, does not process spelling variants and does not recognize multiword units. Therefore, the statistical analyses it produces are unreliable.

A New Set of Linguistic Resources for Ukrainian

3.2

95

TreeTagger and RNNTagger

As the RFTagger tagger used by Sketch Engine for the Ukrainian module is not available for testing, we checked how the other NLP software, developed by the same author, work. These are TreeTagger and RNNTagger, two well-known tools in the NLP world, which have Ukrainian models freely available. Description of the tools TreeTagger, one of the first tools developed by Helmut Schmid (1994) annotates texts with Part of Speech (POS) information and links inflected forms to their lemma. But it does not assign tags for grammatical subcategories, e.g., case, conjugation, gender. There are resources for 28 languages in the current version,18 among them several Slavic languages, including Ukrainian. There are available online interfaces for 17 languages,19 but not Ukrainian; however, one can tag Ukrainian texts on a Linux machine. The tool uses the Universal Dependencies POS tag set. After tagging the text it displays the result in three columns: inflected form, POS, lemma. It also does partial recognition of unknown units, that is, it assigns a POS tag to all units, but when it does not find any corresponding lemma, it displays the special tag . RNNTagger uses neural networks to perform the tagging. It was developed by Helmut Schmid in 2019 (Schmid 2019). Like TreeTagger, RNNTagger tags wordforms with their POS and lemma. It supports 30 languages, including Ukrainian, as well as 11 ancient languages. Its WEB site states that RNNTagger has a better accuracy than TreeTagger, and can lemmatizes all wordforms; however, larger parameter files require PyTorch; without a GPU, the processing is rather slow and lemmas of unknown tokens are guessed and not guaranteed to be always correct.20 RNNTagger uses the Universal Dependencies POS tag set combined with the MULTEXT-East Ukrainian tag set, which gives very detailed morphological information. TreeTagger We tested TreeTagger with the same text we used for testing Sketch Engine. At first sight, the result may seem better: we detected only 315 tagging errors (3% of the text). However, TreeTagger cannot be considered as a reliable NLP tool. – First, 557 of the inflected forms were not lemmatized, even though they are very common words. We also noticed that some inflected forms of the same lemma were incorrectly associated with different POS tags.

18

Helmut Schmid, https://www.cis.lmu.de/~schmid/tools/TreeTagger https://cental.uclouvain.be/treetagger/ Service provided by the Centre de Traitement automatique du Langage (CENTAL), Université catholique de Louvain. 20 https://www.cis.uni-muenchen.de/~schmid/tools/RNNTagger/ 19

96

O. Saint-Joanis

Table 1 POS tags for “Настуся” in Sketch Engine

Inflected forms Настусю Настуся У Настусі

POS PROPN VERB ADP NOUN

Lemmas

у

Table 2 Lemmatization of “Київщині” and “Київщини” in Sketch Engine

Inflected forms і на Київщині Педагогів Київщини

POS CCONJ ADP PROPN NOUN PROPN

Lemmas і на Київщина педагог

Table 3 Lemmatization of “Настусина” and “Настусина” in Sketch Engine

Inflected forms Настусина Настусині

POS NOUN ADJ

Lemmas Настусин Настусиній

We can observe this in Table 1. The first wordform has been labeled as PROPN (proper name), the second as a verb (probably because of its ending in “ся”), and the third as a common noun, whereas all three wordforms are inflected forms of the proper name “Настуся”. An NLP tool that would have access to a dictionary associated with a morphological grammar would not have had any problem processing these wordforms, as we will see below. It is also surprising that the inflected forms of a same word are not even linked to their identical lemma, as can be seen in Tables 2 and 3. Table 2 shows that the lemma of “Київщинi” (Locative preceded by the preposition на) was correctly computed, whereas the lemma of “Київщини” (Genitive sg. preceded by a noun in Genitive pl) was not. One guess is that as the training corpus was not large enough to contain any occurrence for this wordform, it was processed as a proper name because of its capitalization. In Table 3, we see two forms of the same word “Настусин”. This word is an adjective (not a Noun as noted for the first form). The second form was correctly associated with the POS ADJ, but its lemma is incorrect: “Настусиній” is rather an inflected form of “Настусин”. A software that would have access to a dictionary associated with a morphological grammar that describe the inflectional paradigm of each lexical entry would not have produced these mistakes. Note also that linguistic-based NLP software can also process derivations, which is a much more complex operation, as it leads to various transformations of the lemma (via addition of suffixes or prefixes), and sometimes even a change of POS category (e.g., from a noun, create an adjective or a verb which have different

A New Set of Linguistic Resources for Ukrainian

97

inflectional paradigms). Here, the inflected forms “Настусина” and “Настусині” should be linked to the lemma “Настуся”, because they are derived from the noun “Настуся”. There are many other tagging mistakes in the training corpus. For example, the word “нема” was incorrectly tagged as a verb and lemmatized as “немати”, even though this “verb” does not exist. In fact, “нема” should be tagged as a predicate, but this category does not exist in the tag set used by in TreeTagger. These mistakes would not have been produced by linguistic-based NLP software. RNNTagger We also checked RNNTagger because it is based on the much newer neuronnetwork based approach, with the same corpus of 10,434 inflected forms. We found that all wordforms were lemmatized, but 284 of them were incorrectly lemmatized. Table 4 displays typical mistakes. The first two wordforms are proper names (a diminutive of a Ukrainian feminine name and the name of the French city, Nice). The following two wordforms have been incorrectly tagged as feminine nouns; in fact, they are adverbs. We suspect that the tool got mixed up because their word endings looks like an ending for feminine nouns. The wordform “трубами-димарями”, composed of two independent and variable units, was incorrectly lemmatized (the second part was reduced to the root). The sixth and seventh wordforms were lemmatized incorrectly, probably because the system did not consider the alternation of vowels -e-/-i- and the mobility of the letter -e-. The wordform “матері” was incorrectly lemmatized as a regular feminine noun, although its inflection paradigm is an exception. The last two wordforms have been tagged as masculine nouns rather than feminine nouns: the first was used in the text in the genitive plural and therefore has no ending (as masculine nouns in the nominative), whereas the second was used in the text in the nominative and has the ending of masculine words in the genitive. Note that these mistakes are similar to the ones produced by TreeTagger. We analyzed the same sentence by using Sketch Engine associated with RFTagger (see Fig. 2) and still found numerous mistakes: out of 45 wordforms, 11 were incorrectly tagged. We can observe the result of this test in Table 5. In conclusion, both TreeTagger and RNNTager are not satisfactory to analyze Ukrainian texts because they produce too many tagging and lemmatization mistakes, which makes any NLP software application using them produce unreliable and often linguistically incorrect results.

1 2 3 4 5 6 7 8 9 10

Inflected forms Настуся Ніцці кудою змалку трубами-димарями шатрами льоду матері царівен неня

Lemmas in RNNTagger настися Ніцець куда змалка трубами-димар шатр льод матеря царівен нінь

Table 4 Lemmatization errors in RNNTagger Tags in RNNTagger Verb Masculine noun Feminine noun Feminine noun Masculine noun Masculine noun Masculine noun Feminine name Masculine noun Masculine noun

Lemmas and correct tags Настуся, Feminine Proper Name Ніццa, Feminine Proper Name кудою, Adverb змалку, Adverb труба-димар, Masculine noun шатeр, Masculine Noun лiд, Malsculine Noun мати, Feminine Noun царівнa, Feminine Noun неня, Feminine Noun

98 O. Saint-Joanis

A New Set of Linguistic Resources for Ukrainian

99

Table 5 Tagging errors in RNNTagger 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Павлусь Малинка був гарний з лиця мов чорнявий та кучерявий бог Аполлон Настуся Самусівна була пишна на вроду мов Афродіта котра тільки що

Npmsny Npmsgy Vapis-sm Afpmsnf Spsg Ncnsgn Css Ao-msnf Ccs Ao-msnf Ncmsny Npmsny Ncfsnn Npmsgy Vapis-sf Afpfsns Spsa Ncfsan Css Npfsny Pr--f-sna Q Q

Павлусь Малинко бути гарний з лице мов чорнявий та кучерявий бог Аполлон настуся Самусівн бути пишний на врода мов Афродіта котрий тільки що

24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

вихопилась з морської піни на хвилях але ще не помолилася богу не вмилась і не обтерла гаразд рушником морської піни на виду

Vmeis-sf Spsg Ao-fsgf Ncfsgn Spsl Ncfpln Ccs R Q Vmeis-sf Ncmsgn Q Vmeis-sf Ccs Q Vmeis-sf R Ncmsin Ao-fsgf Ncfsgn Spsl Ncmsln

вихопитися з морський піна на хвиля але ще не помолитися біг не вмитися і не обтерти гаразд рушник морський піна на вид

4 Ukrainian Linguistic Resources We are in the process of developing a set of linguistic resources for Ukrainian, which first version has been published in open-source and can be freely downloaded from: https://nooj.univ-fcomte.fr/resources.html. This module is composed of an electronic dictionary associated with morphological grammars that describe inflectional and derivational paradigms, morphological grammars that describe productive morphology that can recognize wordforms absent from the dictionary, as well as syntactic grammars that can solve various lexical and morphological ambiguities.

4.1

Dictionary

In its first version, the dictionary contains 167,098 entries (lemmas). It covers the current normative language, according to the 1992 spelling and the 2019 spelling standard. The main source for the list of entries was the Open Source dictionary in its version 2.9.1 (Rysin 2016). We manually described each lexical entry, and then

100

O. Saint-Joanis

supplemented the source with additional adverbs, prepositions, and interjections from the Groroh online dictionary.21 The dictionary contains 13,176 verbal entries, 74,838 nouns, 534 deciphered abbreviations, 51,135 adjectives, 13,484 adverbs, 144 numerals, 105 pronouns, 119 conjunctions, 143 interjections, 175 prepositions, 135 particles, and 39 predicates. Each lexical entry is a lemma, associated with its POS category and properties. We have described all POS categories, property names and their potential values in a meta-grammar. Each lexical entry is potentially associated with its inflectional (FLX) and derivational paradigms (DRV).22 Following is an example of a lexical entry: новина,NOUN+Feminine+Inanimate+FLX=ГРУПА The lexical entry “новина” is associated with the category “NOUN”; its gender value is “Feminine” and its distributional value is “Inanimate”.23 The property “FLX=ГРУПА” states the fact that the inflectional paradigm for the lexical entry is described by the inflectional grammar rule named “ГРУПА”: ГРУПА= /Nominative+Singular | и/Genitive+Singular | у/Accusative+Singular | і/Dative+Singular | ою/Instrumental+Singular | і/Locative+Singular | о/Vocative+Singular | и/Nominative+Plural | /Genitive+Plural | и/Accusative+Plural | ам/Dative+Plural | ами/Instrumental+Plural | ах/Locative+Plural | и/Vocative+Plural ; In this grammar rule for instance, the term “о/Vocative+Singular” states that if one deletes the last letter of the lexical entry (operator “backspace” ) and then add an “o”, one produces a Vocative Singular form.24 When the lexical parser applies this dictionary and its associated paradigms to a text, it will produce a Text Annotation Structure (TAS) that contains annotations that represent all potential linguistic analyses, for all recognized linguistic units, such as the following one as “новина” (Fig. 4): Lexical entries might be associated also with one or more derivation paradigms (property DRV). Describing derivation allows us to associate a lexical entry with its derived forms, which themselves can be of a different POS category, and have their 21

Goroh is available at: https://goroh.pp.ua The FLX and DRV paradigms are described by context-free transduction grammars. 23 In the NooJ platform, the grammar “properties.def” links the names and values of each property, e.g., “Gender = Masculine | Feminine”. Therefore, the lexical feature “+Feminine” is equivalent to the property name-value pair “Gender=Feminine”. 24 NooJ offers a dozen basic operators such as (empty string), (delete current character), duplicate current character, etc. Specific operators have been added for several languages, e.g., : finalize consonant (in Hebrew), remove accent (in Spanish), etc. 22

A New Set of Linguistic Resources for Ukrainian

101

Fig. 4 Tagging of an arrow shape, linked to the “новина” entry

own inflectional paradigms. This opens many applications; for example, one may want to link feminine occupations to masculine occupations or feminine nationality to masculine nationality, which in Ukrainian are expressed by nouns rather than by adjectives. The derivation from “викладач” [teacher] to “виклаладачка” [teacher] is described as follows:

викладач,NOUN+Masculine+Animate +FLX=ТОВАРИШ+DRV=PROFESSION_КА:ТІТКА Here, the inflectional paradigms ТОВАРИШ and ТІТКА are similar to the inflectional paradigm ГРУПА presented a little above, whereas the derivation paradigm is described as follows: PROFESSION_ КА = ка/NOM+Feminine ; This rule states that by adding to the original lexical entry the suffix “к” followed by the ending “а”, one obtains a new form which is a feminine noun. The subsequent inflectional rule ТІТКА is then applied to generate all the inflected forms of the derived noun. We have used this mechanism of inflectional and derivational paradigms to link the two forms of the verbal aspectual couple (Imperfective/Perfective). As far as we know, no available NLP software tool manages aspect, and yet, linking the forms Imperfective and Perfective is very important for many NLP applications; for instance, these two forms are translated by the same verb in non-Slavic languages. Linking them has allowed us to treat them as two facets of the same linguistic unit. This link has allowed us to create one single entry in the dictionary; the link between the two forms of the verb is formalized through a derivation such as the following: читати,VERB+FLX=ЧИТАТИ+DRV=ПРО Here, the imperfective lexical entry “читати” is associated with the category VERB. The property FLX=ЧИТАТИ is used to associate the lexical entry with its inflected forms. The property DRV=ПРО is used to associate the imperfective lexical entry “читати” with its perfective verbal form “прочитати”. This mechanism has allowed us to describe 6604 pairs of imperfective and perfective verbs. Thanks to these pairing, a simple query such as , when applied to any Ukrainian text, will find all occurrences of all conjugated

102

O. Saint-Joanis

Fig. 5 Concordance produced by the query

Fig. 6 Tagging an abbreviation

forms of both the perfective and imperfective forms of the verb “читати”, as we can see in Fig. 5. Pairing perfective and imperfective forms of verbs will also allow us to construct a transformational grammar, capable of automatically producing imperfective sentences from the corresponding perfective sentences, and vice versa, using the methodology described by Silberztein (2016). Transformational grammars constitute also a great pedagogical tool to teach Ukrainian as a second language. Electronic dictionaries offer the possibility to link lexical entries with their spelling variants. We especially appreciate this possibility, because Ukrainian has several spelling standards. The latest one from 2019, for example, replaces the letter -e- with -є- after vowels, as in “проект” = “проєкт”. To link two spelling variants, we can enter double entries in the dictionary, like this:

проект,проєкт,NOUN+Masculine+Inanimate+FLX=ЛИСТ We can use this mechanism to link abbreviations with their explicit forms in the same way: АЕС,атомна електростанція,NOUN+Abbreviation+Feminine+Invariable We can see how this abbreviation will be analyzed in Fig. 6. Linking abbreviations and their meaning is a useful pedagogical tool, as learners of Ukrainian can locate both forms in texts, click an abbreviation and get the corresponding explicit form, etc.

A New Set of Linguistic Resources for Ukrainian

103

Fig. 7 Tagging spelling variants

This mechanism can also be used to construct a dictionary of dialects or normative variants, as we might want to link dialectical forms with their corresponding normative form. For example, we can enter the orthographic norm zhelexivka25 in the dictionary; this norm represents orthographic variants such as: “лїс” for “ліс”, “сьмішно” for “смішно”, “з 'уміти” for “з 'уміти”, “житє” for “житя”.26 Variants can then be associated with a special tag to indicate their origin. The result is shown in Fig. 7. Other spelling variations can be formalized. For example, before 2019, the word “міні-спідниця” was spelled with a hyphen, whereas it now must be spelled as a single word “мініспідниця”. Current texts can therefore contain these two spellings, especially as Microsoft spell checker has not yet been updated. We link these variants in the following way: міні_спідниця,NOUN+Feminine+Inanimate+FLX=ҐУЛЯ The special character “_” stands for “hyphen or concatenated”. Other special characters are available to represent various productive variations.

4.2

Morphological Grammars

The available Ukrainian dictionaries are quite large; for example, VESUM,27 the largest online dictionary, currently contains more than 418,000 entries. Moreover, they are constantly updated, as their authors add neologisms, derived lemmas, named entities or metaphoric expressions. There is also a common misconception that empirical systems can better cover the existing vocabulary of a given language than linguistic systems, and that they can better handle new or unknown wordforms. However, we have seen that TreeTagger and RNNTagger have failed to lemmatize and tag correctly the inflected forms of the proper noun “Настуся” (see Figs. 1 and 2; Tables 1, 4, and 5). In fact, only human locutors with a good background in linguistics can describe new elements of a vocabulary reliably. To reliably recognize and tag wordforms that

25

Phonetic spelling in Western Ukraine 1892–1922. Examples given in Matviyas (2011). 27 VESUM is available at https://r2u.org.ua/vesum/ and https://github.com/brown-uk/dict_uk 26

104

O. Saint-Joanis

Fig. 8 Grammar for affective forms of feminine nouns

Fig. 9 Annotation for the affective “Настуся”

are not yet listed in our dictionary, we have therefore developed a set of morphological grammars that use productive patterns to infer their properties.

Grammars for affective nouns We have constructed three morphological grammars that produce all affective forms for masculine nouns, for feminine names and for neutral nouns. For example, for the feminine nouns, the left side of Fig. 8 displays the main graph of the grammar. This graph recognizes feminine nouns in the nominative (thanks to constraints such as ), then replaces their suffix (e.g., “a”) with suffixes described in the 17 embedded graphs, which correspond to affective suffixes. The embedded graph for the affective suffix “-уся” is displayed to the right of Fig. 8; it recognizes 14 inflected forms. By combining the following lexical entry: Настя,Анастасія,NOUN+Feminine+Animate+FLX=МАТУСЯ with the grammar shown in Fig. 8, the lexical parser can recognize, tag and lemmatize all inflected forms of the affective “Настуся”, as we can see in Fig. 9. Grammars for adjectives of belonging We have also seen that the empirical tools are incorrectly tagging adjectives of belonging that are derived from nouns (see Table 3). In Ukrainian, adjectives of belonging express the connection with a living being (e.g., “мамин”: the one that belongs to Mom). They can be productively constructed from nouns designating a person, including proper names, so it is impossible to list them in a dictionary. These adjectives are created by adding specific suffixes -ин-/-iн- for feminine or neutral nouns, with potential alternations of consonants (г/ж, х/ш, к/ч, ґ/дж), and

A New Set of Linguistic Resources for Ukrainian

105

Fig. 10 Grammar for adjectives of belonging

suffixes -ів-/-їв- for masculine nouns. Our dictionary (together with its inflectional and derivational grammars) describes forms derived from common nouns that designate living beings, but does not describe those derived from proper names, e.g., “Настусин”: the one that belongs to Настуся. To recognize, tag and lemmatize these adjectives, we have constructed the grammar shown in Fig. 10. This grammar consists of 41 embedded graphs.28 The main graph recognizes proper names ending with several characteristic suffixes (e.g., “xa”); the grammar then replaces these suffixes with specific suffixes that handle alternation of consonants. The adjectival paradigms are described by eight embedded graphs. By applying this grammar to a text, the lexical parser can recognize, tag and lemmatize the wordforms, as can be seen in Fig. 11. Here, the lexical parser first recognized the wordform “Настуся” thanks to the grammar in Fig. 8, then recognized the adjective “Настусин” and associated it with the property “Meaning=Belonging”.

28

In a graph, yellow nodes correspond to embedded graphs, and are equivalent to auxiliary symbols.

106

O. Saint-Joanis

Fig. 11 Annotation for adjectives of belonging “Настусин”

Fig. 12 Reflexive verbs

Fig. 13 Annotation for reflexive verbs

Reflexive verbs Goroh’s online dictionary has listed 50,686 reflexive verbs. However, In Ukrainian, one can productively create reflexive forms of verbs by adding certain postfixes such as “-ся”. We have seen in the Sect. 3 that empirical tools tag all unknown forms ending with “-ся” as verbs, which is incorrect. There is a linguistic solution to avoid both having to list all reflexive verbs explicitly and analyzing all forms that end with “-ся” incorrectly. A simple grammar can be created to verify that the base word (before the postfix “-ся”) is indeed a verbal form, as seen in Fig. 12. This grammar recognizes all wordforms that end with one of the six possible reflective postfixes, and then checks that their base can be analyzed as a verbal form (constraint ). It then computes the resulting lemma by adding the postfix “-ся” to the base verb. The resulting annotation contains the property “+СЯ”, as seen in Fig. 13. Here the inflected form of the reflexive perfective verb is linked to its imperfective lemma and associated with the reflexive property “+ся”. In conclusion, we have developed 20 morphology grammars to complement our dictionary. We have applied these resources to a corpus of 100 novel extracts from

A New Set of Linguistic Resources for Ukrainian

107

the nineteenth to twentieth centuries, as well as various article extracts; our testing corpus contains 199,996 graphical wordforms. As a result, we have obtained a very good recall rate, as the percentage of unrecognized wordforms went from 5% to 1%. The wordforms still not recognized forms belong to the dialectical vocabulary. Among the wordforms recognized by these new grammars, we have not detected any mistake and we can confidently state that the accuracy rate is higher than 99%.

5 Conclusion and Perspectives Currently, more and more researchers in literature, history, social sciences, and linguistics are interested in computer tools that allow them to process texts, explore their corpora, and perform various analyses on them. Teachers of Ukrainian as a second language are increasingly offering their students practical exercises based on existing corpora. For these applications, reliable NLP tools are needed. There are two types of tools: some are based on empirical models (statistical, neural); others are based on rule-based models. Today, most existing tools are based on empirical methods, probably because they are easier and faster to implement. They perform tagging and lemmatization thanks to predictions from training corpora. They produce quite good results, but as we have seen, they also produce many mistakes, such as confusions between POS categories, bad analyses of wordforms not listed in dictionaries, and incorrect lemmatizations. Several morphological phenomena (e.g., vowel alternation, mobile vowels, derivations, or spelling variants) are not and cannot be covered by these tools. These systematic mistakes compromise statistical results and more generally, most of the NLP applications reliability. Rule-based methods allow us to process these phenomena in a reliable way. A well-designed set of linguistic resources, comprised on electronic dictionaries and various types of grammars (that describe inflexions, derivations as well as productive phenomena), can achieve better recall and accuracy. To date, we have published the first version of our Ukrainian module, composed of one dictionary associated with a set of morphological grammars, as well as a first set of syntactic grammars to solve many lexical and morphological ambiguities; we are in the process of developing this set of syntactic grammars, and are planning to add grammars that recognize and tag metaphorical expressions and named entities, as well as a new set of morphological grammars to recognize wordforms that will not be listed in our dictionary, such as wordforms spelled in the Latin alphabet or containing Russian letters. The Ukrainian module is freely available and can be downloaded from the page: https://nooj.univ-fcomte.fr/resources.html.

108

O. Saint-Joanis

Tools: Goroh: http://goroh.pp.ua GRAC: https://parasol.vmguest.uni-jena.de/grac_crystal/#dashboard?corpname= grac16 and LanguageTools: https://community.languagetool.org/analysis/analyzeText NooJ: http://www.nooj4nlp.org Plots (CQL): https://parasol.vmguest.uni-jena.de/grac_batch/NGyears.html?fbclid= IwAR2wj2HFi6X4WX7EqVoefnu-7tNHf_PST99F3N_tnmj4XP2NL3z3 OMQp8gA RNNTagger: https://www.cis.uni-muenchen.de/~schmid/tools/RNNTagger/ Sketch Engine: https://www.sketchengine.eu/ TagText.groovy : https://github.com/brown-uk/nlp_uk/blob/master/src/main/ groovy/ua/net/nlp/tools/TagText.groovy TreeTagger: https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ Ukrainian Dictionary Open Source: http://extensions.services.openoffice.org/en/ project/dict-uk_UA VESUM: https://r2u.org.ua/vesum/ and https://github.com/brown-uk/dict_uk

References Chomsky, Noam and Schutzenberger, Marcel-Paul, 1963. The algebraic theory of context-free languages, [Braffort et Hirschberg, eds] Computer Programming and Formal Dy stems, Studies in logic series, 119-161, Amsterdam, North-Holland. Chomsky, Noam, 1984. La connaissance du langage. In: Communications, 40. Grammaire générative et sémantique. pp. 7-24. Dickinson, Markus, (2015). Detection of Annotation Errors. In Corpora: Detection of Annotation Errors in Corpora. Language and Linguistics Compass, Vol 9 (3). P.119–138. Kilgariff Adam, Rychly Pavel, Smrz Pavel et Tugwell David, 2004. The Sketch Engine, Proceedings of the Eleventh EURALEX International Congress. pp. 105-116. Lorient, France. Matviyas, Ivan, 2011. Особливості фонетичної системи в західноукраїнському варіанті літературної мови [Features of the phonetic system in the Western Ukrainian variant of the literary languag]. in Мовознавство [Linguistic ], № 4. pp. 16-21. Rysin, 2016. Ukrainian Dictionary Open Source (version 2.9.1). http://extensions.services. openoffice.org/en/project/dict-uk_UA Schmid, Helmut,1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing, Manchester, UK Schmid, Helmut, 2019. Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts, DATeCH, May 2019, Brussels, Belgium. Silberztein, Max, 2016. Formalizing Natural Languages: the NooJ approach. Wiley Eds. Hoboken, USA. Silberztein, Max. 2018. Using lexical resources to evaluate the quality of annotated corpora, Proceedings of the First Workshop in Linguistic Resources for Natural Language Processing, (COLING 2018). Santa Fe, New Mexico, USA, August 20, 2018. P. 2-11. Shvedova, Maria, Rysin, Andriy, Starko, Vasyl, 2021. Handling of Nonstandard Spelling in GRAC. in Conference: 2021 IEEE 16th International Conference on Computer Sciences and Information Technologies (CSIT). pp 105-108. Volokh, Alexander. & Neumann, Günter, 2011. Automatic detection and correction of errors in dependency tree-banks., 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: proceedings of the conference., Portland, Oregon, USA, June 19–24, 2011.Vol. 2: Short Papers. P. 346–350.

Formalization of the Quechua Morphology Maximiliano Duran

Abstract I present a method to formalize the morphology of Quechua nouns, verbs, and other Part Of Speech (POS) categories to develop Natural Language Processing (NLP) applications. First, I constructed an electronic corpus comprising several digitalized texts and electronic dictionaries. After a detailed inventory of all Quechua suffixes, I classified them into specific sets corresponding to their POS category. Next, I formalized their grammatical behavior separately, using elementary matrices. The resulting tables describe valid combinations of two, three, and four suffixes. Finally, I formalized the inflection and derivation of each POS category. Keywords Formalizing Quechua · Electronic dictionary · Suffix agglutination · Natural Language Processing · Computational Linguistics

1 Introduction Until a few years ago, Quechua was a poorly endowed language. Now, because many internet sites referring to Quechua have appeared, we may think that the situation for this language has changed. However, if we try to find open-source electronic lexical resources ready to use for some NLP project, we see that the situation has only slightly evolved. The central question is: are there available opensource linguistic resources for Quechua? – Available Quechua corpora. The written corpus published up to the middle of the twentieth century contained around half a million Quechua tokens. It included, in the majority, religious publications such as the translation of the Christian Bible (New Testament, 492,859 tokens, including less than 50,000 different tokens); various Cuzco stories (12,223 tokens) cited by Monson et al. (2006); stories of Urubamba of Lira (31,986 tokens); prayers (around 3000 tokens), legends and stories of Huarochiri’s manuscript (around 11,000 tokens), stories of Sta Cruz Pachakuti (around 2500 tokens), Guaman Poma’s chronicle M. Duran (✉) Université de Franche-Comté, CRIT, Besancon, France © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Silberztein (ed.), Linguistic Resources for Natural Language Processing, https://doi.org/10.1007/978-3-031-43811-0_6

109

110

M. Duran

Nueva Cronica y buen Gobierno (1496 different tokens) and several grammars and dictionaries such as Sto Thomas’ (around 4000 tokens), Holguín’s (around 15,000 tokens), Blas Valera’s (around 1000 tokens), Betanzos’ (around 1000 tokens), and P. Meneses novels (around 5000 tokens). – Quechua Dictionaries. Printed or electronic versions of Quechua dictionaries, such as the following, cannot be downloaded nor even displayed and do not contain Part of Speech (POS) information: https://issuu.com/idiomaquechua/ docs/diccionarioquechua?pageNumber=1 and https://www.crisol.com.pe/libronuevo-diccionario-espanol-quechua-quechua-espanol-9789972607448. Some Web sites allow users to get translations of simple words or simple canonical phrases, such as http://www.sankaypillo.com/2014/07/diccionarioselectronicos-aymara.html, but the dictionaries used in these applications are neither available nor displayed. – Quechua in Universities. Universities that propose Quechua in their curricula have inactive Web sites. For instance, UCLA quechua.ucla.edu website displays the message: “Digital Resources for the Study of Quechua is being upgraded and moved to new hosting. We will post new information here when it is available”. The website of one of the prominent universities of Peru, Universidad de San Marcos, https://www.facebook.com/CatedradeLenguaQuechuaUNMSM, contains information about the Quechua courses, but it does not offer Quechua linguistic resources. The Universidad Católica in Peru https://idiomas.pucp.edu. pe/programas/quechua/curso-de-quechua/#nopresencial proposes Quechua basic lessons online, but it does not present any dictionary or corpus. – Natural Language Processing Applications. Some projects have seen the light in the last decades. The University of Zurich MT project, for instance, has developed several linguistic resources aiming to develop a Quechua-Spanish (QU-SP) Machine Translation system. Their resources include bilingual dictionaries and monolingual texts of the Cuzco variant. A. Rios (2011, 2016), one of the leading animators proposes in her thesis and further articles some formalization for a grammar, a morphological analyzer, a syntax analyzer, and an initial Spanish to Quechua (SP-QU) translator. The Siminchik project, led by R. Cardenas et al. (2018), has developed “a speech corpus suitable for training and evaluating speech recognition systems.” They remark: “Peruvian native languages, amongst which Quechua is included, present scarce written footprint and are predominantly orally transmitted even today. Even worse, the amount of digital content in Peruvian languages is meager”, and all of them are considered “under-resourced.” In consequence, they “introduce the first speech corpus of Southern Quechua, Siminchik, suitable for training and evaluating speech recognition systems.”1 On the algorithm they use to perform tokenization and

Called QuBERT, it appeared in 2022 claiming to be a “large combined corpus for deep learning of an indigenous South American low-resource language . . . created from text gathered from the southern region of Peru . . . the entire data set consists of 4,408,953 tokens and 384,184 sentences, including what are known as Chanka and Collao variants” indigenous South American

1

Formalization of the Quechua Morphology

111

transcription (BPE), they state: “BPE represents text at the character level and then merges the most frequent pairs iteratively until a pre-determined number of merge operations has been reached . . . we note that the accuracy scores of our results are somewhat lower than the state-of-the-art for high-resource languages on the namedentity recognition (NER).” These poor results come from their choice of using statistics rather than grammar rules and the introduction of a class of sui-generis “prefix” and “postfix” particles. To use statistical or neural network algorithms to process and translate texts in Quechua automatically, one would need access to a large enough corpus and sets of aligned bilingual wordforms and expressions. Unfortunately, these resources are not yet available2 as there is still a scarcity of written sources for Quechua (less than 1 million tokens if we include the recently added texts). Because of the impossibility of obtaining the resources and tools needed to develop a Machine Translation system for Quechua, I decided to build linguistic resources from zero. I started 30 years ago to construct several electronic dictionaries (simple words, multiword units, named entities, technical and scientific terms) with their bilingual and bidirectional versions: SP-QU, QU-SP, French to Quechua (FR-QU) and Quechua to French (QU-FR). I also had to develop a new set of electronic grammars: there are many available classical grammars for Quechua, but they describe the language as if it were a Romance language, following the works initiated by the Spanish linguist Nebrija of s. XV. As the morphology and syntax of Quechua differ from Romance languages, these grammars do not help study an agglutinative and SOV-type language. Quechua is a logical and poly-synthetic morpho-syntactic language, and we must approach its study as such. Here, I present some details of the formalized Quechua electronic grammar.

2 Constructing Electronic Dictionaries Because of its agglutinative attribute of the language, we need first to formalize the ways affixes agglutinate to roots and combine. First, I have inventoried all the Ayacucho Quechua suffixes taken from different authors like (Perroud 1970; Pino 1980; Soto 1976), and from my introspection. Each Part Of Speech (POS) category possesses a set of suffixes, which may have allomorphs. I present, for each POS category, their potential agglutinations and the corresponding formalized paradigms for inflection (FLX) and derivation (DRV); these formalized paradigms will be associated with each entry of the dictionary following (Silberztein 2010). I now low-resource language . . . created from text gathered from the southern region of Peru . . . the entire data set consists of 4,408,953 tokens and 384,184 sentences, including what are known as Chanka and Collao variants.” 2 Recently, Cardenas et al. (op. cit.) state that they have obtained a monolingual data set of around 1,200,000 words, from the transcription of 97 h of speech in Ayacucho and Cuzco dialects. This data set is not available yet. And for monolingual corpus see the preceding footnote.

112

M. Duran

Fig. 1 QU-SP and QU-FR dictionaries

present the suffixes associated with Nouns, Adjectives, Verbs, pronouns, and adverbs separately.

2.1

Formalization of Quechua Noun Inflections

In 2012, I presented my first electronic dictionary of nouns in the article “Formalizing Quechua Noun Inflection” (Duran 2012, 2014), containing several hundreds of simple nouns. Duran (2021) presented an enhanced dictionary containing actualized noun inflection grammars FLX=NVOCAL or FLX=NCONSO, as shown in Fig. 1. By analyzing the nominal morphology for the Ayacucho-Chanka Quechua variant, I gathered all the nominal suffixes shown in the set Suf_N SUF_N = {-ch , -cha, -cha ,-chik, -chiki, -chu, -chu(?),3 -hina, -kama, -kuna, -lla, -má, -man, -manta, -m, -mi, -mpa, -nimpa, naq, -nta, -ninta,4 -nintin, -ntin, -niraq, -niyuq, -niq, -ña, -p, -pa, -paq, -pas, -pi, -poss(7v+7c), -puni, -pura,

3

The interrogative and exclamation signs are used to indicate the presence ascendant intonation followed by a pause pour first and descendant followed by a pause pour exclamation sign. 4 ninta is in fact a composition of the phonic support particle “-ni” and the suffix “-nta”, applicable to nouns ending in a consonant. This same “ni” intervenes in the compositions ninka, ninta, nintin, etc.

Formalization of the Quechua Morphology

113

-qa, -rayku, -raq, -ri, -s, -si, -sapa , -su, -ta, -taq, -wan, -y(!), -ya(!), -yá, -yupa , -yuq}5 (50+7v+7c) where (7v+7c) represents the set of possessive suffixes {–i, -iki, -n, -nchik, -iku, -ikichik, -inku; –nii, -niiki, -nin, -ninchik, -niiku, -niikichik, -niinku}. The first seven suffixes are added to nouns ending with a vowel, and the remaining seven to nouns ending in a consonant. A detailed description of the semantics induced to a noun by each nominal suffix can be found in Duran (2017, chapter 2). There is another set of suffixes used to derive nouns into verbs: S_N_V= {y, yay, chay}. Some nouns accept only one, some two, and some all three of these suffixes, i.e.: taki/ song → takiy/ to sing wasi/ house → wasiyay/ to become a shelter wasi/ house → wasichay/ to cover a house rumi/ stone → rumichay/ to cobble Figure 1 presents an excerpt of both QU-SP and QU-FR noun dictionaries, in which each inflectional (FLX) and derivational (DRV) grammar is specified. In the QU-SP dictionary shown in Fig. 1, we have used morpho-syntactic codes such as “N” (for nouns), “Nc” (common nouns), “Nhum” (human nouns), “Anim” (animal), “Mamiph” (mammal), “zoo” (zoology), “Alim” (food), “Ana” (anatomy), “Tec” (technic), etc. “FLX=NVOCAL” is the inflectional paradigm used for nouns that end with a vowel, and “FLX=NCONSO” is the inflectional paradigm used for nouns that end with a consonant. The full tag set contains around 50 non-exhaustive codes. Depending on whether the ending of the noun is in a vowel or a consonant, the inflection FLX grammar that will generate the corresponding inflections, maybe one of the following grammar rules: NVOCAL= :NOM | :N_V_1 | :N_V_2 | :N_3_GEN ; NCONSO= :NOM | :N_C_1 | :N_C_2 | :N_3_GEN ; where we find NOM symbolizing a sub-grammar involving 64 paradigms, it contains a single nominal suffix. Grammar rule N_V_2 involves more than 600 paradigms containing combinations of two suffixes, whereas N_3_Gen involves several thousand paradigms containing combinations of three suffixes. These combinations are allowed according to the morpho-syntactic properties of the lexical entry. For example, one can have wasikunamanchá [towards the houses] (3 suffixex). Applying these grammar rules to the dictionary will automatically generate thousands of inflected forms for each noun, as can be seen in the extract shown in Fig. 2 for the noun wasi.

5

Found in Guardia Mayorga (1973; Perroud 1970; Pino 1980; Soto 1976).

114

M. Duran

Fig. 2 One-layer inflections of wasi [house]

2.2

Formalizing Quechua Verb Morphology

Our Quechua dictionary contains less than 1500 simple verbs, which is very modest compared to the 9000+ verbs listed in French and Spanish dictionaries. Fortunately, Quechua follows an interesting strategy, using derivation suffixes and agglutination to enlarge its set of verbs considerably. Therefore, after their inventory and a study of their morpho-syntactic behavior, I proposed a primary classification of verbal suffixes as follows: Interposition suffixes (IPS),6 Postpositional suffixes (PPS),7 and verb nominalizing suffixes (N_S),8 with additional classes:

6

IPS= {chaku, chi, chka, ykacha, ykachi, ykamu, ykapu, ykari, yku, ysi, kacha, kamu, kapu, ku, lla, mpu, mu, na, naya, pa, paya, pti, pu, ra, raya, ri, rpari, rqa, rqu, ru, spa, sqa, stin, tamu, wa}. 7 SPP= {ch, chá, chik, chiki, chu, chusina, má, man, m, mi, ña, pas, puni, qa, raq, s, si, taq, yá}. 8 N_S= {y,-na,-q,-sqa}. N_S suffixes will be used to produce nominalized forms under certain linguistic conditions.

Formalization of the Quechua Morphology

115

Fig. 3 Part of the bi-dimensional combinations Matrix of IPS’

– A subset of IPS, coded IPS_DV,9 will be used to generate new derived verbs, – SPP suffixes will be used to generate various agglutinated inflections, both with simple and derived verbs. This classification has been advantageous to formalize verbal morphology,10 as it has helped us recognize and analyze verb forms and thus avoid producing mistakes as those (Zevallos et al. 2022) signaled: “To statistically determine which branch of morphemes a verb phrase falls under can be difficult with Quechua since there are so few resources”. Following are examples of the V-V derivation of simple verbs; in this case, I take the verb maskay [to search] to produce 27 new verbs agglutinating a single IPS: maska-ku-y maska-yku-y maska-pa-y maska-ri-y maska-chi-y

to search for oneself to search with determination to search zealously to search superficially to make someone search

Following similar steps for the nouns, I constructed tables containing the matrices of grammatical agglutinations of two or more suffixes. For instance, Fig. 3 presents a Boolean matrix representing the combinations of two IPS suffixes. The corresponding Boolean matrix for the grammatical combinations of three IPS has, as its first row, the set of 27 IPSs and, as its first column, the 295 valid binary combinations I have obtained in Fig. 3. Here are some resulting combinations:

9

IPS_DV= {chaku, chi, chka, ykacha, ykachi, ykamu, ykapu, ykari, yku, ysi, kacha, kamu, kapu, ku, lla, mpu, mu, naya, pa, paya, pu, raya, ri, rpari, rqu, ru, tamu}. 10 Zevallos: “A short example sentence of how complex morpheme determination can be depicted in Table 1. In some cases, there are hundreds of options to choose from when choosing which suffix to use for a given Quechua word”.

116

M. Duran

Fig. 4 Extract of the electronic dictionary of Simple QU_FR verbs

rikuchka, rikuyku, rikuysi, rikukapu,rikulla, rikupa, rikupu, rikura, rikurqa, rikurqu, rikuru, rillaykacha, rillaykachi, rillaykari, rillaysi, rillara, rillaraya, rillarqa, rimuchka, rimuykari, rimulla, rimurqa, rimurqu

There are also combinations of four IPS suffixes, such as cha-ku-na-lla , cha-ku-llawa, cha-mu-chka-pti The QU_FR electronic dictionary The electronic QU_FR dictionary of simple verbs, shown in Fig. 4, contains 1181 entries. In this extract: – There are neither compound verbs nor phrasal verbs, which are represented in another dictionary. – Each verb has an inflectional paradigm FLX= V_TR for transitive verbs or FLX= V_ITR for intransitive ones. For instance, the entry unit rimay / to talk

Formalization of the Quechua Morphology

117

inflects according to the paradigm V_TR; thus, the entry becomes rimay,V+tr +FR= “parler” +FLX=V_TR. – It also contains some syntactic and semantic information, like the two main classes of verbs: Transitive (tr): rimay / to talk, Intransitive (it): mikuy / to eat. – The intransitive class is relatively small. It contains less than 100 verbs. The class of impersonal verbs (imp) includes mainly those relating to the weather: paray to rain; lastay to snow. Following is inflectional grammar V_TR:11 V_TR =:V_SPP | :V_TR_SIP | :V_CONJ_TR | :I_TR; V_TR_SIP = :SIP1_G | :SIP2_G | :SIP3_G; SIP1_G = /INF | :SIP1 | :V_SIP1_INF | :SIP1_N; V_SIP1_INF = (:CHAKU | :CHI | :CHKA | :YKACHA | :YKACHI | :YMANA | :YKAMU | :YKAPU | :YKARI | :YKU | :YSI | :KACHA | :KAMU | :KAPU :KU | :LLAV | :MU | :NAYA | :PAV | :PAYA | :PU | :RAYAV | :RIV | :RPARI | :RQU | :RU | :TAMU )y/INF;

This grammar formalizes, among other paradigms, the conjugation in the present, past, and future tenses. For example, the following is the detailed definition of PR and FUT: PR = (ni/PR+s+1 | nki/PR+s+2 | n/PR+s+3 | nchik/PR+pin+1 | nkichik/PR+p+2 |nku/PR+p+3 |niku/PR+pex+1);

Examples: taki-ni [I sing], taki-nki [you sing], taki-n [he sings] FUT = (saq/F+s+1 | nki/F+s+2 | nqa/F+s+3 | saqku/F+pex+1 | sunchik/F+Pin+1 | nkichik/F+p+2 | nqaku/F+p+3);

Example: taki-saq [I will sing], taki-nki [you will sing], taki-n-qa [he will sing] By processing the dictionary of 1180 simple verbs with the V_TR grammar, we obtained 31,860 new verbs; an extract of them is shown in Fig. 5. Duran (2017, 2021) verified that all the produced verbs have an actual meaning and translated them to French, using the (Dubois and Dubois-Charlier’s 2007) French LVF dictionary. Mixed verbal phrases A remarkable phenomenon in the inflection of Quechua verbs is the behavior of Present tense PR-ENDINGS, which act as fixed points around which IPS or PPS suffixes may be agglutinated to obtain a verbal form that represents long phrases in Indo-European languages. The following examples illustrate this property for the ni ending:

11

This context-free grammar is represented in the NooJ format presented in Silberztein (2003, 2010, 2016). Auxiliary symbols are prefixed with the colon character “:”; the pipe “|” character represents the disjunction.

118

M. Duran

Fig. 5 Derived verbs with one layer of IPS suffixes

Miku-ni [I eat] Miku-chka-ni [I am eating] Miku-chka-ni-raq [I am eating before anything else) Miku-chi-chka-ni-raq-mi [I am eating before anything else indeed] Miku-chi-yku-chka-ni-lla-raq-mi [I am carefully helping him to eat before anything else indeed] These sequences are described by the expression: , where:

Formalization of the Quechua Morphology

– – – –

119

represents the Verb stem rima (comes from rimay[ to talk]); 12 the suffixes placed between the verb stem and the ending, 13 the endings for the Present tense and their transformations and :14 Postposed suffix (placed after the ending).

represents seven Present tense personal endings that behave as fixed points for inflection. This set may be topologically transformed into nine set of endings, as detailed in Duran (2017), to obtain the gerunds or other aspect forms. For example: rima-nki, [you talk]: Present tense, first person, singular rima-ri-nki, [you start talking]: the IPS suffix -ri is interposed rima-nki_man, [you should talk]: the PPS suffix -man is postposed rima-ri-nki-man, [you should perhaps start talking]: IPS suffix -ri and PPS suffix man are mixed. In these examples, each suffix class intervenes in the inflection only once at both sides of the ending. But Quechua allows combining several instances of IPS and PPS. I have represented the mixed agglutinations with the following grammar rule: V_MIX1 = (:SIP1_PR_V) (:SPP1_V) | (:SIP1_PR_C)(:SPP1_C) | (: SIP1_PRM_V ) (:SPP1_V) | (:SIP1_PRM_C)(:SPP1_C); Example: Miku-chi--raq V_MIX12 = :SIP1_PR_V) (:SPP2_V) | (:SIP1_PR_C) (:SPP2_C) | (: SIP1_PRM_V ) (:SPP2_V) | (:SIP1_PRM_C) (:SPP2_C);

Example: Miku-chka-ni-raq-mi V_MIX21= (:SIP2_PR_V)(:SPP1_V)|(:SIP2_PR_C)(:SPP1_C);

Example: Miku-chi-chka-ni-raq V_MIX22= (:SIP2_PR_V)(:SPP2_V)|(:SIP2_PR_C)(:SPP2_C);

Example: Miku-chi-chka-ni-raq-mi where V_MIX12 describes the mixed verbal agglutination of one IPS suffix and two SPP suffixes, and SIP1_PR_V describes the derivation using one IPS suffix and the conjugation in the Present, etc. Applying these grammar rules for the verb rimay [to talk] produces automatically 289,413 mixed verbal forms.

12

IPS={chaku, chi, chka, ykacha, ykachi, ykamu, ykapu, ykari, yku, ysi, kacha, kamu, kapu, ku, lla, mpu, mu, na, naya, pa, paya, pti, pu, ra, raya, ri, rpari, rqa, rqu, ru, spa, sqa, stin, tamu, wa}. 13 ENDINGS = {ni, nki, n, nchik, niku, nkichik, nku, . . . }. 14 PPS={ch, chaa, chik, chiki, chu(?), chu, chusina, má, man, m, mi, ña, pas, puni, qa, raq,ri, si, s, taq, yá}.

120

M. Duran

The LVF_QU dictionary LVF_QU is a bilingual electronic dictionary containing around 8600 thousand French verbs, which I translated to Quechua from the (Dubois-Dubois Charlier LVF 2007) dictionary. Consequently, I used it to build the QU_LVFQ (QuechuaFrench) dictionary. Automatic Translation of derived verbs As seen in section “The QU_FR electronic dictionary”, the dictionary for simple verbs contains around 1200 QU-FR entries. Applying the V_TR_INF grammar, we have produced about 31,800 derived verbs. Some of the resulting new verbs are already lexicalized in some dictionaries, but most of them do not have written translations. I have therefore added the translation into French and Spanish for 7000 derived verbs. For example, the simple verb asiy [to laugh], when derived by the “ri” IPS suffix, produces the verb asiriy [to smile], which is already a lexical entry in some dictionaries, whereas the derived verb asichakuy [to laugh ridiculously] derived by the “chaku” IPS suffix, is not present in any existing dictionaries. Other examples of derived verbs that are not listed in existing dictionaries: ripuy [to leave]; ripu-ku-y [to move]; rakiy [to split]; raki-nakuy-y [to divorce]; samay [to rest]; sama-rqu-y [to bivouac].

2.3

Formalizing Adjective Morphology

The formalization of the adjective inflection and derivation consists in describing the paradigms of grammatical agglutinations of the adjectival suffixes Suf_A: Suf_A = {-ch , chá, -cha ,-chik, -chiki, -chu, -chu?, - hina, -kama, -kuna, -lla, -má, -man, -manta, -masi, -m, -mi, -naq, -nka, -ninka, -nta , -ninta , -nintin, -ntin, -niraq, -niq, -ña, -p, -pa, -paq, -pas, -pi, -puni, pura, -qa, -rayku, -raq , -ri, -s, -si, -su, -ta, -taq, -wan, -yá, -yupa } I present a portion of the table that includes the Boolean matrix of its corresponding bi-suffix combinations in Fig. 6. The general grammar to inflect or derive an adjective ending in a vowel is: AVOCAL = :A_V_1 | :A_V_2 | :A_V_3;

The first component is the paradigm involving only one suffix, the second one two suffixes, and the last three suffixes. The grammar rule A_V_2 corresponding to the matrix of Figure is the following: A_V_2 = :DCHACH| :DCHACHAA| :DCHACHIK| :DCHACHIKI| :DCHACHUI | :HINACHIKI | :HINACHUI | :HINACHUN | :HINAGEPA | :HINAKAMA | :HINALLA | :HINAMAA | :HINAMAN | :KAMACHUI | :KAMACHUN | :KAMAGEPA | | :LLANIRAQ | :LLAñA | :LLAPAQ | :LLAPAS | :LLAPI | :LLA | :LLA| :LLAPUNI| :LLAPURA| :LLATA

Formalization of the Quechua Morphology

121

Fig. 6 Matrix of the bi-suffix combinations of Suf_A

| :NIRAQKAMA| :NIRAQLLA| :NIRAQMAA| :NIRAQMAN | :NIRAQMANTA | :NIRAQMI | :NIRAQWAN | :NIRAQYAA| :ñAMAA | :ñAMM | :ñARI | :ñASISV | :ñATAQ | :ñAYAA | :PAQCHAA | :PAQCHIK | :PAQCHIKI | :PUNITA | :PUNITAQ | :PUNIWAN | :PUNIYAA| :PURACH| :PURACHAA | :PURACHIK | :TACH | :TACHAA | :TACHIK | :TACHIKI | :TACHUI | :TACHUN | :TAñA | :TAPAS | :TAPUNI | :TAQA | :TARAQ | : . . . | :YUPAQA | :YUPARI | :YUPASISV | :YUPAWAN| YUPAYAA;

For adjectives that end with a consonant or with the particle ai, the grammar rule is: ACONSO = :A_C_1 | :A_C_2 | :A_C_3;

The first is the paradigm involving only one suffix, and the last two components originate from similar matrices as of Fig. 6 manually constructed.

2.4

Formalizing Adverbs Morphology

To formalize the adverb inflection and derivation, one also needs to build the Boolean matrices of combinations of two or three of the adverbial suffixes in the following set Suf_ADV: Suf_ADV = {ch , chaa, chik , chiki, chun, chui, kama, lla, maa , manta, m, mi, ña, paq, pas, pi, puni, qa, hina, raq , ri , sisc, siv, nta, ninta, taq, wan, yá} and then build the formal grammar rules that describe the corresponding paradigms:

122

M. Duran

ADV_V=/ADV | :ADV_V_1 | :ADV_V_2 | :ADV_V_3; for adverbs ending in a vowel ADV_C=/ADV | :ADV_C_1 | :ADV_C_2 | :ADV_V_3; for adverbs ending in a consonant.

Their first component, for both the vowel and the consonant cases, looks as below: ADV_V_1 = :CH | :CHAA | :CHIK | :CHIKI | :CHUN | :CHUI | :KAMA | :LLA | :MAA | :MAN | :MANTA | :MM | :ñA | :PAQ | :PAS | :PI | :PUNI | :QA | :HINA | :RAQ | :RI | :SIV | :NTA | :TAQ | :WAN | :YAA; ADV_C_1 = :CHAA | :CHIK | :CHIKI | :CHUN | :CHUI | :KAMA | :LLA | :MAA | :MAN | :MANTA | :MI | :ñA | :PAQ | :PAS | :PI | :PUNI | :QA | :HINA | :RAQ | :RI | : SISC | :NINTA | :TAQ | :WAN | :YAA; And the special inflection formula for the adverb of negation mana [non]: ADV_MANA = :CH | :CHAA | :CHIK | :CHIKI | :CHUI | :MAA | :MM | :ñA | :PAS | : PUNI | :RAQ | :RI | :SIV | :TAQ | :YAA;

Following is an extract of the 733 bi-suffixed adverbial forms automatically generated from the adverb paqarin [tomorrow]: paqarinhinach,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+CMP+DINT paqarinhinachá,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+CMP+DINT paqarinhinachu?,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+CMP+ITG paqarinhinakama,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+CMP+MET paqarinhinalla,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+CMP+ISO paqarinhinaman,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+CMP+DIR paqarinhinam,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+CMP+ASS paqarinhinaqa,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+CMP+THE paqarinkamachu,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+MET+NEG paqarinkamalla,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+MET+ISO paqarinkamamá,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+MET+CTR paqarinkamam,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+MET+ASS paqarinkamaña,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+MET+TRM paqarinpihina,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+LOC+CMP paqarinpiwan,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+LOC+INS paqarinpiya,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+LOC+IVOC paqarintawan,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+ACC+INS paqarinyupata,paqarin,ADV+EN=tomorrow+FLX=ADV_C_2+CPR+ACC The electronic dictionary of QU_FR adverbs contains 565 entries.

Formalization of the Quechua Morphology

2.5

123

Formalizing Pronouns Morphology

Pronouns also can be inflected and derived in Quechua, e.g., Qampas [also you], paypaq [for him], qamlla [only you], etc. To formalize pronoun inflection and derivation, I made an inventory of the set of pronominal suffixes Suf_PRO, then I constructed the matrix that represents the grammatical combinations of two or three suffixes, then deduced from the matrix the corresponding paradigms of inflection of pronouns that end with a vowel or a consonant, and then constructed the formal grammar (samples presented below), apply it to the dictionary of 565 simple and composed pronouns QU-FR. The result produced more than 117,346 inflected and derived pronominal wordforms, as shown in Fig. 7. Figure 7 displays extracts of the electronic dictionary of QU_FR pronouns (on the left side) and of the pronominal phrases automatically generated (on the right side). 117,346 forms have been generated automatically. This process has allowed me to implement an automatic translation (QU-SP, QU-FR) of the generated inflected words from the POS Nouns, Verbs, adjectives, pronouns, and adverbs. For example, the simple adjective yuraj [white] is now automatically inflected as yurajniraj, and then is automatically translated as “near to white (pale),” which is the correct translation.

3 Conclusion and Perspectives The dictionaries, automatically generated by applying morphological grammars to the initial set of elementary dictionaries, contain several millions of Quechua wordforms. I have already associated around 400,000 of these wordforms with

Fig. 7 Extract of the QU-FR dictionary of pronouns and their 117,346 inflected forms

124

M. Duran

their Spanish or French translation. In addition, I am working on associating all the conjugated Quechua verbs with their corresponding French conjugated forms and thereafter with their corresponding Spanish conjugated forms. I have presented a method to formalize the morphology of Quechua nouns, verbs, and other Part Of Speech (POS) categories in a format that can be used to develop Natural Language Processing (NLP) applications. I have shown a sample of the obtained electronic dictionary of nouns built. Next, I presented a detailed inventory of all Quechua verb suffixes, how I classified them into specific sets corresponding to each POS category, and how I formalized their grammatical behavior, using elementary matrices describing their valid combinations of two, three, and four suffixes. Then, I showed how I converted these matrices into formalized grammars, thus formalizing the inflection and derivation of each POS category. These works allowed me to obtain the electronic dictionaries corresponding to each POS.

References Cardenas R., Zevallos R., Baquerizo R. and Camacho L., 2018. Siminchik: A speech Corpus for Preservation of Southern Quechua. lrec-conf.org. http://lrec-conf.org › workshops › lrec2018. Dubois, Jean. et Dubois-Charlier F., 2007. Dictionnaire Linguistique et Sciences du langage, Editions Larousse, Paris. Duran, Maximiliano, 2012. Formalizing Quechua verbs Inflexion, Proceedings of the NooJ 2013 International Conference, Saarbrücken. Cambridge Scholars. Duran, Maximiliano, 2014. Morphological and syntactic grammars for the recognition of verbal lemmas in Quechua. Proceedings of the 2014 International Conference and Workshop. Sassari. Duran, Maximiliano, 2017. Dictionnaire électronique français-quechua des verbes pour le TAL., 2017 : Thèse Doctorale. Duran, Maximiliano, 2021. Morfología y diccionario electrónico de nombres en Quechua MarchDOI: https://doi.org/10.35305/an.vi1.4 Conference: Proceedings of the Linguistic Resources for Automatic Natural Language Generation. Université de Franche-Comté. Mars 2017 (2017) Guardia Mayorga, César, 1973. Gramatica Kechwa, Ediciones Los Andes. Lima Peru. Monson, C., Llitj os, A. F., Aranovich, R., Levin, L., Brown, R., Peterson, E., Carbonell, J., and Lavie, A. 2006. Building NLP systems for two resource-scarce indigenous languages: Mapudungun and Quechua. Strategies for Developing machine translation for minority languages. aclanthology.org. https://aclanthology.org › LREC-2006-Monson. Perroud, Pedro Clemente, 1970. Diccionario castellano kechwa, kechwa castellano. Dialecto de Ayacucho. Santa Clara, Peru. Seminario San Alfonso. Pino, Duran, A. German,1980. Uchuk Runasimi (Jechua - Quechua). Conversación y vocabulario Castellano-Quechua Ocopa, Concepción Perú. Rios, Annette, 2011. Spell checking an agglutinative language Quechua. The University of Zurich. Zurich Open Repository and Archive. Rios, Annette. 2016. A basic language technology toolkit for Quechua. A Basic Language Technology Toolkit for Quechua. Thesis. DOI: https://doi.org/10.5167/uzh-119943. Silberztein, Max, 2010. La formalisation du dictionnaire LVF avec NooJ et ses applications pour l’analyse automatique de corpus. Langages 3/2010 (n° 179-180), pp. 221-241. Silberztein, Max, 2016. Formalizing Natural Languages: the NooJ Approach. Wiley Editions. Hoboken. Silberztein, Max, 2003. NooJ Manual. https://nooj.univ-fcomte.fr/downloads.html.

Formalization of the Quechua Morphology

125

Soto Ruiz, Clodoaldo, 1976. Gramática quechua: Ayacucho-Chanca. Lima: Ministerio de Educación, Instituto de Estudios Peruanos. Zevallos, R., et al., 2022. Introducing QuBERT: A Large Monolingual Corpus and BERT Model for Southern Quechua. Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language. Association for Computational Linguistics

The Challenging Task of Translating the Language of Tango Andrea Fernanda Rodrigo and Mariana González

Abstract It is necessary to review the performance of Machine Translation software when they process low-resource languages. Our case study will be the language used in Tango songs. Tango is a challenging subject: in its lyrics, appear customary beliefs, social forms, and idiosyncratic expressions typical of the Argentinian culture from the late nineteenth to the mid-twentieth century, using Lunfardo as the predominant sublanguage. We address the problem of translating a Tango song lyrics to English. We look at the translations produced by Google Translator and DeepL and we compare them with translations produced by accessing handcrafted linguistic resources specifically developed for Rioplatense Spanish. Keywords Rioplatense Spanish · Lunfardo · Tango · Machine translation · Google Translate · DeepL · NooJ

1 Introduction As second-language teachers, one question we want to answer is whether linguistic analyzers, online dictionaries, Machine Translation software and other Natural Language Processing software can help us teach languages. Can we design software tools to process language without undermining the boundless linguistic production at our disposal? This work focuses on Rioplatense Spanish1 from a pedagogical perspective. Today’s major constraint is that many expressions used in Tango songs belong to Lunfardo, a sublanguage used extensively in Argentina, especially around the Río de la Plata basin.2 In the following, we proceed with testing Machine

1

See Rodrigo and Bonino (2019). As indicated that “critical analyses which had focused their attention on alleged or effective deficiencies of each of these works, detecting the persistent Eurocentrism in the way how the Diccionario Panhispánico de Dudas describes errors and recommendations (Méndez García de Paredes 2012), and the way how the Diccionario de Americanismos deals with selection and 2

A. F. Rodrigo (✉) · M. González CETEHIPL, Facultad de Humanidades y Artes, UNR, Rosario, Argentina © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Silberztein (ed.), Linguistic Resources for Natural Language Processing, https://doi.org/10.1007/978-3-031-43811-0_7

127

128

A. F. Rodrigo and M. González

Translation systems Google Translate (Google Translate 2006) and DeepL (DeepL 2017) by applying to a Tango song lyrics,3 and compare their results with the results produced by a linguistic method that consists in accessing handcrafted dictionaries and grammars.

2 The Project: Automatic Machine Translation Generally, Machine translation software that use training corpora and empirical methods, such as Google Translate and DeepL, can produce satisfactory translations from most source languages to target languages, in real time. However, since most of these systems depend on large training corpora, their accuracy is degraded when they perform translation for low-resource languages. In this study, we compare their results with the ones produced by using purely rule-based resources adapted to a low-resource language, without any need for training corpora. We seek to translate the lyrics of a Tango song4 to English. Tango language best embodies slang, idiosyncratic expressions, and other words reflecting customary beliefs and forms typical of the Rioplatense culture. We chose the song “Mi noche triste”. This song was performed by the celebrated Tango singer Carlos Gardel and the lyrics were written by Pascual Contursi (Contursi 1917). Here are the lyrics of “Mi noche triste”:5 Percanta que me amuraste en lo mejor de mi vida, dejándome el alma herida y espina en el corazón, sabiendo que te quería, que vos eras mi alegría y mi sueño abrasador, para mí ya no hay consuelo y por eso me encurdelo paólvidarme de tu amor. Cuando voy a mi cotorro y lo veo desarreglado, todo triste, abandonado, me dan ganas de llorar;

description of entries and entries (Lara 2012) as well as the lack of normative values which reach down to the lack of information of the native speaker” were immediate p. 59, Sebastian Greußlich (2015). 3 Although it is not the only dialect in Argentina, we believe it is an important point of departure for this work. 4 Neither machine translation software propose Rioplatense Spanish as an option. Instead, Spanish stands as the only option to be selected as source or target language. 5 The words and phrases shown underlined were incorrectly translated mostly by Google Translate or DeepL.

The Challenging Task of Translating the Language of Tango

129

me detengo largo rato campaneando tu retrato pa’poderme consolar. Ya no hay en el bulín aquellos lindos frasquitos, arreglados con moñitos todos del mismo color. El espejo está empañado y parece que ha llorado por la ausencia de tu amor. De noche, cuando me acuesto no puedo cerrar la puerta, porque dejándola abierta me hago ilusión que volvés.6 Siempre llevo bizcochitos pa'tomar con matecitos como si estuvieras vos, y si vieras la catrera cómo se pone cabrera cuando no nos ve a los dos. La guitarra, en el ropero todavía está colgada: nadie en ella canta nada ni hace sus cuerdas vibrar. Y la lámpara del cuarto también tu ausencia ha sentido porque su luz no ha querido mi noche triste alumbrar.

We have implemented a linguistic-based translation system that computes the English translation of a text written in Lunfardo. For a description of this translation system, its architecture, methodology and the form of its linguistic resources, see for instance (Barreiro 2008) for a Portuguese-English MT system, (Maisto and Guarasci 2016) for an automatic English-Italian translation of medical terms, (Essid and Fehri 2019) for a translation of Arabic verbs into French, or (ElFqih et al. 2022) for an automatic translation of Arabic legal terms into English. Our system computes the translation of each stanza of the poem in three steps: – First, we perform a lexical analysis of the Spanish input text, using a (Spanish) Lunfardo to English dictionary associated with a morphological grammar; lexical entries of this dictionary are associated with their English translation. – Then, we apply a bilingual Spanish-English syntactic grammar to reorder the words and insert grammatical words in the English result when needed. – Finally, a English morphological grammar is used to inflect the English terms.

The Diccionario Panhispánico de Dudas includes the “voseo” as a variant for the Spanish second person singular pronoun “tú”.

6

130

A. F. Rodrigo and M. González

The aim of this study is to compare the results produced by the generic translators Google Translate and DeepL with the results produced by our handcrafted system. In the following, we discuss some typical mistakes produced by Google Translate or DeepL. First, we focus on terms and then on syntactic structures.

3 Translation of Terms For each mistake, we first show the output produced by Google Translate, check the terms in the Diccionario de Lunfardo and in Todo Tango, a WEB site declared of National Interest of Argentina (Rodríguez 1989), or the Diccionario de la Real Academia Española (Real Academia Española 2005, 2010 and 2014), then present the translation produced by DeepL, and finally the translation produced by our linguistic-based system. (1) percanta and amuraste The words percanta and amuraste belong to the Lunfardo vocabulary. The Diccionario de Lunfardo contains the following definition for percanta:7 percanta: mujer, considerada desde el punto de vista amatorio [a woman, from an amatory viewpoint] Todo Tango presents the following definition for this word:8 percanta: (pop.) Mujer (LCV.), amante (LCV.), querida (LCV) [concubine]. For the first line of the poem “Percanta que me amuraste”, Google Translate produces the following result: As we can see in Fig. 1, Google Translate has not translated the word percanta.9 We believe this word is so rare that it has probably no occurrence in the corpus used by Google Translate. Google Translate incorrectly analyzes amurar as if it were a form of enamora, as it probably considers the first two letters of amurar (a and m) as prefixes of the Spanish word amor [love]. However, the Diccionario de Lunfardo provides the following definition: amurar: 1-Abandonar. 2-Aprisionar, encerrar en la cárcel, cerrar, clausurar. 3-Clavar. 4-Engañar a una persona, empeñar, estafar. [1-abandon. 2-imprison, in jail, close, shut down 3-to Nail, 4-deceive a person, pawn, swindle].10

7

https://fattiditango.files.wordpress.com/2007/05/diccionario-del-lunfardo-jerga-del-espanol-debuenos-aires-vocabulario-tango.pdf, p. 25. (Fattiditango, 2007) 8 https://www.todotango.com/buscar/?kwd=percanta 9 The term percanta also appears in the Tango song Percanta arrepentida. The song features music by Francisco and Julio de Caro and the lyrics were written by Juan Feilberg, cf.: https://www. todotango.com/musica/tema/5463/Percanta-arrepentida/ 10 https://fattiditango.files.wordpress.com/2007/05/diccionario-del-lunfardo-jerga-del-espanol-debuenos-aires-vocabulario-tango.pdf, p. 1.

The Challenging Task of Translating the Language of Tango

131

Fig. 1 Google Translation for the words percanta and amuraste

Fig. 2 DeepL Translation for the words percanta and amuraste

and Todo Tango:11 amurar: (lunf.) Abandonar, dejar sin protección// arrestar, detener (LCV), aprisionar (LCV), encarcelar, encerrar (AD)// robar, estafar (JFP)// engañar (JFP)// empeñar (AD), dar una cosa en prenda por un préstamo (AD)// arrinconar, cercar// (jgo.) en el billar, dejar una de las bolas junto a la banda; (pop.) no pagar una deuda// cerrar, clausurar. DeepL does not translate these two words, as seen in Fig. 2: Note that entering the correct correspondence in a handcrafted bilingual dictionary Lunfardo-English allows any linguistic-based software to produce the correct translation automatically, just by a simple lookup procedure (Fig. 3): (2) me encurdelo (encurdelarme) Google Translate does not translate me encurdelo, as we can see in Fig. 4:

11

https://www.todotango.com/buscar/?kwd=amurar

132

A. F. Rodrigo and M. González

Fig. 3 Translation of the words percanta and amuraste obtained after a dictionary lookup

Fig. 4 Google Translation for the word me encurdelo

The Diccionario de Lunfardo contains the following definition for the word encurdelar: encurdelar(se): emborrachar (se) [to get drunk]12 Todo Tango:13 encurdelar: encurdelar (lunf.) Embriagar (LCV.) DeepL translates me encurdelo incorrectly as I’m dazed (Fig. 5): However, a simple lookup of a bilingual Lunfardo-English dictionary produces the correct translated term (Fig. 6): (3) cotorro Google Translate translates the word cotorro incorrectly as parlor (Fig. 7):

12

https://fattiditango.files.wordpress.com/2007/05/diccionario-del-lunfardo-jerga-del-espanol-debuenos-aires-vocabulario-tango.pdf, p.10. 13 https://www.todotango.com/buscar/?kwd=encurdelar

The Challenging Task of Translating the Language of Tango

133

Fig. 5 Translation output for the phrase me encurdelo using DeepL

Fig. 6 Correct translation produced by a lookup of a bilingual dictionary

Fig. 7 Google Translation for the word cotorro

However, the correct meaning of the term cotorro is described in the Diccionario de Lunfardo:14

14

https://fattiditango.files.wordpress.com/2007/05/diccionario-del-lunfardo-jerga-del-espanol-debuenos-aires-vocabulario-tango.pdf, p. 8.

134

A. F. Rodrigo and M. González

Fig. 8 DeepL Translation for the word cotorro

Fig. 9 Translation obtained by a lookup of a bilingual dictionary

cotorro: Aposento, cuarto pobre, habitación, [bedchamber, bedroom, poor room] as well as in Todo Tango:15 cotorro: (pop.) Habitación de soltero; habitación para citas amorosas. DeepL incorrectly analyzes cotorro by replacing its suffix with the inflectional suffix for the feminine –a: cotorra [blabbermouth], which is incorrect (Fig. 8). Looking up our bilingual dictionary produces the correct result (Fig. 9): (4) bulín Google Translate does not translate the wordform bulín.16 We believe that there is probably no occurrence of this wordform in the corpus used by Google Translate (Fig. 10): However, the Diccionario de Lunfardo contains the following definition:17

15

https://www.todotango.com/buscar/?kwd=cotorro The word bulín is also used in another famous tango: El bulín de la calle Ayacucho. With lyrics written by Celedonio Flores and music composed by José and Luis Servidio. Lyrics available at the following link: https://www.letras.com/carlos-gardel/524498/ 17 Link to Diccionario de lunfardo: https://fattiditango.files.wordpress.com/2007/05/diccionariodel-lunfardo-jerga-del-espanol-de-buenos-aires-vocabulario-tango.pdf, p.4. 16

The Challenging Task of Translating the Language of Tango

135

Fig. 10 No Google Translation for the word bulín

Fig. 11 DeepL does not translate the word bulín

bulín: cuarto, habitación, término cargado de afectividad con que se designa la habitación en que se vive o que se reserva para entrevistas amorosas [bachelor’s apartment, tryst-room, a room where lovers meet, often regarded as a secret place].

and Todo Tango:18 bulín: (lunf.) Cotorro (JAS), habitación (AD), cuarto de soltero para citas amorosas, lugar donde se duerme o vive// (carc.) Celda. DeepL, just like Google Translate, does not translate the wordform (Fig. 11):

18

In Todotango: https://www.todotango.com/buscar/?kwd=bul%c3%adn

136

A. F. Rodrigo and M. González

Fig. 12 Translation of the word bulín produced by a dictionary lookup

Fig. 13 Google Translation for me hago ilusión

Fig. 14 DeepL Translation for me hago ilusión

A correct translation can be automatically obtained by looking up our bilingual dictionary (Fig. 12): (5) me hago ilusión (hacerme ilusión, ilusionarme) This is a Spanish verbal locution, so we have treated it as a lexical item.

The Challenging Task of Translating the Language of Tango

137

Fig. 15 Translation of me hago ilusión thanks to a dictionary lookup

Fig. 16 Google does not translate matecitos

Fig. 17 DeepL does not translate matecitos

Google Translate translates me hago ilusión incorrectly as I delude myself (Fig. 13). However, the locution hacerme ilusión is synonymous with ilusionar, i.e., to hope, to look forward to. DeepL produces the correct translation in this case (Fig. 14). This translation is also easily produced by the linguistic method, as the expression is lexicalized (Fig. 15): (6) matecitos The wordform “matecitos” is the diminutive form of mate, an infused herbal drink typical of the Argentinean and the Rioplatense culture. Neither Google Translate, DeepL nor NooJ translate it (Figs. 16, 17 and 18): The problem is that matecitos is not a lexical entry in our Lunfardo-English dictionary, and therefore is not recognized by the translator’s lexical analysis. In

138

Fig. 18 Looking up our bilingual dictionary fails

Fig. 19 Google translates catrera as cot

Fig. 20 DeepL does not translate catrera

A. F. Rodrigo and M. González

The Challenging Task of Translating the Language of Tango

139

Fig. 21 NooJ translates catr as cot

Fig. 22 Sample of our bilingual Spanish-English syntactic grammar

order to fix this problem, we will need to associate the lexical entry “mate” with a derivational grammar and an inflectional grammar, to link this entry with the form “matecitos”, and then add to the Spanish to English syntactic grammar a transducer that translates Spanish diminutive nouns into English nouns modified by an adjective such as “little” or “small”. (7) catrera Google Translate produces the correct translation (Fig. 19, 20 and 21):

140

A. F. Rodrigo and M. González

4 Translating Syntactic Structures Machine Translation systems that follow the linguistic approach not only rely on bilingual dictionaries: they also need to reorder words in the sentence, and also add grammatical words in the target language. Following is an extract of the Spanish to English syntactic grammar developed specifically to translate Contursi’s Tango lyrics to English (Fig. 22). This grammar recognizes Spanish morpho-syntactic structures that occur in the song. For instance, recognizes the stanza constituted by a preposition, a determiner, and a noun. The grammar then produces the corresponding English translation using the bilingual dictionary. For example, the operator $THIS$EN looks up the current lexical item in the bilingual dictionary and returns the value of the property +EN of the lexical entry, i.e., its English translation. (1) ponerse cabrera The lyrics of Mi noche triste contains a personification of a bed (catrera). An extra hurdle to the translation task is that since Spanish is a pro-drop language, subjects can be implicit, i.e., unexpressed. In the song, catrera is the implicit subject of se pone cabrera: y si vieras la catrera cómo se pone cabrera cuando no nos ve a los dos.

Therefore, to produce the correct English translation, one needs first to describe the Spanish verbal locution (ponerse cabrera). The Diccionario de Lunfardo provides the following description for this adjective:19 cabrero: enojado [angry]. and Todo Tango: lanza cabrera: (delinc.) Substracción de los bolsillos en la que el ladrón es sorprendido por la víctima. mina cabrera: (pop.) Mujer desconfiada (ERDELV.). punga cabrera: (delinc.) Igual que Lanza cabrera. The subject of the verb poner [get] is catrera [cot]. Tango songwriter Pascual Contursi uses personification as a poetic device to add an agentive feature to catrera. thus, to produce the correct English translation, one needs to add an explicit subject, in this case, the subject she, as in she gets mad (ella se pone cabrera), because catrera is an feminine adjective (the subject’s gender has to be in agreement with the gender of catrera).

19

https://fattiditango.files.wordpress.com/2007/05/diccionario-del-lunfardo-jerga-del-espanol-debuenos-aires-vocabulario-tango.pdf, p. 4.

The Challenging Task of Translating the Language of Tango

141

Fig. 23 Google Translation of cómo se pone cabrera

Fig. 24 DeepL Translation for cómo se pone cabrera

Fig. 25 Translation output of cómo se pone cabrera using our bilingual grammar

Google Translate produces a proper subject-verb agreement; DeepL does not translate the adjective Cabrera; our bilingual grammar produces the correct translation (Figs. 23, 24 and 25):

142

A. F. Rodrigo and M. González

Fig. 26 Google Translation of pa’poderme consolar

Fig. 27 Translation output of pa’poderme consolar using DeepL

Fig. 28 Translation of pa’poderme consolar using our Spanish-English grammar

(2) Colloquial expressions with word clipping: pa’ instead of para Colloquial expressions such as pa’poderme consolar, pa’olvidarme, pa’poderme, pa’tomar deserve a specific treatment. The Spanish preposition para and its variant pa’ are translated as to (Fig. 26). DeepL uses to (or in order to as an expanded option) (Fig. 27):

The Challenging Task of Translating the Language of Tango

143

Table 1 Results produced by Google Translate, DeepL, and NooJ Word/ structure in the source language percanta amuraste encurdelarme cotorro bulín me hago ilusión matecitos catrera se pone cabrera

pa’poderme consolar

Google Translate/target language Not translated

DeepL/target language Not translated

Incorrect translation Not translated

Not translated

Incorrect translation Not translated

Incorrect translation

Incorrect translation Not translated Correct translation Correct translation of the lexical entry and the syntactic structure Correct translation

Correct translation

Incorrect translation

Not translated

Not translated Not translated Not translated from the lexical entry. Correct agreement subject-verb in syntactic structure Correct translation

NooJ (Linguistic approach) Correct translation Correct translation Correct translation Correct translation Correct translation Correct translation Not translated Correct translation Correct translation of the lexical entry and the syntactic structure Correct translation

Term described in Lunfardo Lunfardo Lunfardo Lunfardo Lunfardo Spanish colloquialism Quechua Lunfardo Lunfardo

American colloquialism

Following is the translation produced thanks to our bilingual syntactic grammar (Fig. 28): The grammar processes the word pa’ just like para, therefore translating pa’poderme consolar and para poderme consolar produces the same result. The only difference is that as we have access to the bilingual dictionary, we can modify the word translation in the lexical entry: pa',PREP+AMER+EN="to"

5 Conclusion The purpose of this work was to account for the translation mistakes produced while translating Tango lyrics from Spanish to English. We analyzed a paradigmatic text of Argentine culture, the lyrics of the Tango song “Mi noche triste”, and used Google Translate and DeepL to translate them to English. We then compared each result with the one produced by a translation system that relies on handcrafted specialized

144

A. F. Rodrigo and M. González

Comparison of the perfomance of machine translators 12 10 8 6 4 2 0

Google

DeepL Inc and Not Transl

NooJ Correct

Fig. 29 Performance of the three systems

linguistic resources (bilingual dictionaries and grammars) (see Silberztein 2003 and 2016). Table 1 below presents the results produced by these three systems, and Fig. 29 resumes the performance of the three systems in a chart. As Fig. 29 shows, the number of mistakes and untranslated terms is higher for Google Translate and DeepL, which proves the need to develop linguistic resources specifically designed for Rioplatense Spanish. As a perspective for our research team, we intend to build a larger corpus of texts associated with more complete bilingual dictionaries and grammars, to refine the translation results. This task will be undertaken in the future to broaden the scope of this research.

References Barreiro, Anabela, 2008. Port4NooJ: Portuguese linguistic module and bilingual resources for machine translation. In Workshop on Language Resources for Teaching and Research. Faculdade de Letras da Universidade do Porto. Contursi, Pascual, Mi noche triste, 1917. Available at: https://www.todotango.com/musica/tema/1 78/mi-noche-triste-lita/ [last accessed February 12, 2023] DeepL, 2017. Available at: https://www.deepl.com/translator [last accessed February 12, 2023] ElFqih, K.A., di Buono, M.P. and Monti, J., 2022, June. Automatic Translation of Arabic Legal Terminology Using NooJ. In 16th International NooJ 2022 conference, Revised Selected Papers. Springer International Publishing. (p. 26). Essid, M. and Fehri, H., 2019. A Semantico-Syntactic Disambiguation System of Arabic Movement and Speech Verbs and Their Automatic Translation to French Using NooJ. In Formalizing Natural Languages with NooJ 2018 and Its Natural Language Processing Applications. 12th International Conference, NooJ 2018, Palermo, Italy, June 20–22, 2018, Revised Selected Papers. Springer International Publishing (pp. 167-179).

The Challenging Task of Translating the Language of Tango

145

Fattiditango, 2007. Available at: https://fattiditango.files.wordpress.com/2007/05/diccionario-dellunfardo-jerga-del-espanol-de-buenos-aires-vocabulario-tango.pdf [last accessed February 12, 2023] Google Translate, 2006. Available at: https://translate.google.com [last accessed February 12, 2023] Greußlich, Sebastian 2015. El pluricentrismo de la cultura lingüística hispánica: política lingüística, los estándares regionales y la cuestión de su codificación, Available at: http://www.scielo.org. pe/pdf/lexis/v39n1/a02v39n1.pdf [last accessed February 12, 2023] Lara, Luis Fernando, 2012. Reseña: ASALE: Diccionario de americanismos. Panace XIII, 36:352– 355 Maisto, A. and Guarasci, R., 2016. Morpheme-based recognition and translation of medical terms. In Automatic Processing of Natural-Language Electronic Texts with NooJ: 9th International Conference, NooJ 2015, Minsk, Belarus, June 11-13, 2015, Revised Selected Papers. Springer International Publishing (pp. 172-181). Méndez García de Paredes, Elena, 2012. Los retos de la codificación normativa del español: Cómo conciliar los conceptos de español pluricéntrico y español panhispánico. En Lebsanft 2012:181– 212 Rodrigo, Andrea and Bonino, Rodolfo, 2019. Aprendo con NooJ: de la lingüística computacional a la enseñanza de la lengua. Rosario: Ciudad Gótica. Translation Silvia Reyes. ISBN 978-987597-398-5 Real Academia Española, 2005. Diccionario Panhispánico de Dudas, Modelos de conjugación verbal. Available at: https://www.rae.es/dpd/ayuda/modelos-de-conjugacion-verbal [last accessed February 12, 2023] Real Academia Española, 2010. Diccionario de americanismos, Available at: https://www.rae.es/ obras-academicas/diccionarios/diccionario-de-americanismos [last accessed February 12, 2023] Real Academia Española, 2014. Diccionario de la lengua española. Available at: https://dle.rae.es/ [last accessed February 12, 2023] Rodríguez, Adolfo Enrique, 1989. Lexicón, 12500 voces y locuciones lunfardas, populares, jergales y extranjeras. Buenos Aires: Editorial Policial. Available at: https://www.todotango.com/ [last accessed February 12, 2023] Silberztein, Max, 2003. NooJ Software. Available at: https://atishs.univ-fcomte.fr/nooj/downloads. html [last accessed March 8, 2023] Silberztein, Max, 2016. Formalizing Natural Languages: The NooJ Approach. Wiley Eds.: Hoboken, NJ.

A Polylectal Linguistic Resource for Rromani Masako Watabe

Abstract We describe the characteristics and problems of Rromani to develop linguistic resources and then examine existing applications using empirical and linguistic methods. Empirical methods can rapidly develop some resources, but they always contain several types of errors; the linguistic methods are more timeconsuming, but more reliable. We then present a polylectal linguistic resource for Rromani that covers all basic dialectal variants, both at the lexical and morphological levels. This resource is composed of a lexicon, a grammar that describes inflectional and derivational morphology, and a grammar that describes agglutination. It can be used both for Rromani language studies and for developing Natural Language Processing (NLP) applications. We show that the same architecture can describe other low-resource languages. Keywords Corpus Linguistics · Natural Language Processing · Polylectal · Rromani language

1 Introduction Rromani is the language of the Rromani people. It is a people “without compact territory” (Courthiade, M. 2004), characterized by geographical dispersion, mainly in Europe and in the Americas. It is the most significant low-resource language “without compact territory” in Europe. However, the fact that it does not belong to any state, in other words, that any state does not protect it, is not negligible in various fields, not only political, economic, or social but also cultural, linguistic, and computational. However, the Rromani people have a proper representative organization since 1971: the International Rromani Union (IRU), an NGO recognized by the United Nations and having a social consultative status.1

1

Member United Nations (ECO-SOC, No EE-3377) Social Consultative Status (NGO No D9424).

M. Watabe (✉) University of Franche-Comté, C.R.I.T., Besançon, France © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Silberztein (ed.), Linguistic Resources for Natural Language Processing, https://doi.org/10.1007/978-3-031-43811-0_8

147

148

M. Watabe

The number of Rromani speakers is estimated at 5.5 million, cf. (Gurbetovski, M. et al. 2010). Nevertheless, the “Atlas of the World’s Languages in Danger” of UNESCO classifies the Rromani language as a “definitely endangered language.2” That means “children no longer learn the language as a mother tongue in the home.” Admittedly, there are very few Rromani lessons in schools for children; nevertheless, the Romanian government recognizes the teaching of the Rromani language and culture from elementary school through to university. Regarding higher education, only three universities offer Rromani studies: INALCO in Paris, the University of Bucharest, and Charles University in Prague. Taking Rromani language lessons as optional at some European faculties is also possible. Multilingualism is a frequent phenomenon among Rroms in some regions. For example, Rroms in Kosovo may have four mother tongues: Rromani, Albanian, Serbian and Turkish. It is not uncommon for Rromani to be the first language before the local one(s) that Rromani children learn at home. Of course, we cannot ignore the influence of local languages on Rromani, but this phenomenon has no impact on the common origin of Rromani. Rromani is an Indo-Aryan language. It is very similar to modern North Indian languages in terms of vocabulary and grammar, especially the nominal phrase system. The Rromani language contains a common lexical origin: over 800 Indian stems, around 200 Greek stems, around 60 Persian stems, and so on. These stems are common to all dialects regardless of where they live and of the local majority languages. In addition, there are local borrowings: German, Hungarian, Romanian, and Slavic, among others, whose use is local (Études tsiganes 2005).

1.1

Rromani Dialectology

We propose a three-level dialect classification: superdialects, dialects, and vernaculars. Firstly, the entire Rromani language is grouped into two superdialects: O and E. Then, there are two dialectal subgroups according to phonetic mutations within each superdialect. That makes four dialects (Courthiade, M. 2016): – – – –

O superdialect without mutation (O-bi3) O superdialect with mutation (O-mu4) E superdialect without mutation (E-bi) E superdialect with mutation (E-mu)

The UNESCO list has five categories of endangerment: vulnerable, definitely endangered, severely endangered, critically endangered, and extinct. 3 bi means without in Rromani. 4 mu is the initial of the word mutation. 2

A Polylectal Linguistic Resource for Rromani

149

Table 1 Interdialectal equivalents between the two superdialects: O and E Morphology Phonology

Lexicon Syntax

O superdialect daràndilǎs pani phen ćhaj daj puzgal ćulal fededer po śukar

E superdialect daràjas paj phej ćhej dej istral pitǎl maj miśto maj feder

Translation (he/she) was afraid water sister Rromani girl, daughter mother to slipa to drip better

a

There is no infinitive in Rromani. The form of the third person singular in the present tense is used as an entry (headword) in dictionaries

Each dialect includes certain vernaculars which names are often related to professional terms, such as ursàri, related to bear leaders, and kelderaś, related to metal workers. One of the characteristics of Rromani dialectology is that two main isoglosses are not areal. The first isogloss criteria concern the opposition “o” vs. “e.” This opposition is mainly marked in morphology: the verb ending of the first person singular in the past tense, the copula ending of the first person singular in the present tense, and the definite article in the plural. phirdom vs. phirdem (I) walked som vs. sem (I) am o Rroma vs. e Rroma the Rroms In addition to the opposition “o” vs. “e,” this isogloss concerns other morphological, phonological, lexical, and syntactic phenomena. This fact substantiates a diasystem (i.e., systematic correspondences of interdialectal equivalents) in the Rromani language (Table 1). The criteria of the second isogloss concern the phonetic mutation of two consonants: alveolar affricates [ʧʰ] and [ʤ] turn into alveolo-palatal fricatives [ɕ] and [ʑ]. ćhavo [ʧʰavo] without mutation → ćhavo [ɕavo] with mutation ʒukel [ʤukel] without mutation → ʒukel [ʑukel] with mutation

Rromani son dog

boy,

These two crossing isoglosses form the four main dialects. Rromani speakers spread throughout Europe and across the Atlantic regardless of the different dialects. Then contact with local languages resulted in different vernaculars. This is how several layers of isoglosses are superimposed on the geographical map. This means that those who live in nearby regions do not necessarily speak nearby dialects, and it is not surprising to hear the same dialect in distant countries. In dialectological terms of Rromani, the geographical distance does not correspond to the linguistic distinction between dialects (Fig. 1).

150

M. Watabe

Fig. 1 Rromani dialects map

Thanks to the diasystem in Rromani shown in the table above, mutual understanding between good speakers of different dialects is possible when the topic of conversation concerns everyday life.

1.2

Rromani Alphabet

The graphic standard was defined at the IRU Congress in 1990 at Warsaw. There is no other standardization: neither lexical, grammatical, nor phonetic. The Rromani alphabet contains 35 letters. Its basis is the Latin alphabet, including two special characters: “ʒ5” and “θ.6” There are five vowels: “a,” “e,” “i,” “o,” and “u.” There are mainly two types of diacritics added to the vowels. The grave accent (e.g., “à”) marks syllables with stress. No borrowed words in Rromani are oxytonic (i.e., have the stress on the last

5 6

Its pronunciation is [ʤ] or [ʑ] according to dialects. Its pronunciation is [t] or [d] according to preceding phonemes.

A Polylectal Linguistic Resource for Rromani

151

Fig. 2 Standardized Rromani alphabet

syllable). In this case, there is no mark of stress in the writing. If it is not an oxytone, a grave accent will mark its position. The caron (e.g., “ǎ”) causes the prejotization; “a” [a] turns into “ǎ” [ja] with a caron. Moreover, the diaeresis (e.g., “ä”) transcribes the pronunciation of some dialects; “a” [a] turns into “ä” [ə] with a diaeresis. Three letters at the bottom of Fig. 2, “ç,” “q,” and “θ” are used only at the beginning of postpositions. The pronunciation of these letters depends on the preceding phonemes. For example, “θ” will be pronounced as [d] if there is an “n” just before it (e.g., manθe [at my place]). On the other hand, “θ” will be pronounced as [t] if there is no “n” just before it (e.g., laθe [at her place], lesθe [at his place], tuθe [at your place]). The four digraphs “ćh,” “kh,” “ph,” and “th” are aspirated. Rromani alphabet distinguishes single “r” and double “rr.” Single “r” is pronounced as a “rolled R,” while the pronunciation of double “rr” depends on dialects. Single “r” and double “rr” make minimal pairs at the beginning, middle, and end of words. rani lady vs. rrani branch ćoripen stealing vs. ćorripen poverty

152

M. Watabe

Fig. 3 60 possible spellings for the word ćhib [language/tongue] (This picture is one of the panels created for the exhibition “La langue rromani – un atout pour l’éducation et la diversité” (Council of Europe 2014). We have added number of possible transcriptions)

bar

stone

vs. barr

enclosure

Double “rr” is also used in the name of the people Rroma [Rroms], as the “Universal Declaration of linguistic rights” (Barcelona 1998) says, Everyone has the right to the use of his/her own name in his/her own language (. . .) to the most accurate possible phonetic transcription.

In a dialectal sense, this standardized alphabet gives flexibility to all speakers. For example, the word ćhib [language/tongue] has four possible pronunciations according to dialects (i.e., [ʧʰb], [ʧʰp], [ɕib] and [ɕip]). If all speakers transcribe this word using the graphic standard of their countries, there will be 60 ways of spelling, and mutual understanding will be difficult. On the other hand, the standardized Rromani alphabet enables mutual understanding in writing between speakers of different dialects by giving them comfort in pronunciation (Fig. 3). Rromani speakers are scattered across many different countries, and they use computer keyboards corresponding to their majority languages. How can they quickly write in the standardized Rromani alphabet, which includes special

A Polylectal Linguistic Resource for Rromani

153

characters and diacritics on the computer keyboard? There are two virtual keyboards7 available online8 to resolve this obstacle.

1.3

Rromani Dictionaries

There are a few Rromani editorial dictionaries. Only three dictionaries (Courthiade, M. et al. 2009; Sarău, G. 2012) adopted the standardized alphabet. Almost all these dictionaries are dialect-specific. Sailley, Robert. 1979. Vocabulaire fondamental du tsigane d’Europe. Maisonneuve et Larose, Paris. Calvet, Georges. 1982. Lexique tsigane : Dialecte des Erlides de Sofia. Publications Orientales de France, Paris. Calvet, Georges. 1993. Dictionnaire tsigane-français: dialecte kalderash. L’Asiathèque, Paris. Boretzky, Norbert, and Igla, Birgit. 1994. Wörterbuch Romani Deutsch Englisch. Harrassowitz Verlag, Wiesbaden. Courthiade, Marcel. et al. 2009. Morri angluni rromane ćhibǎqi evroputni lavustik (My first European dictionary of the the Rromani language). Romano Kher, Budapest. Sarău, Gheorghe. 2012. Dicţionar Romăn-Rrom. Sigma publishing house, Bucharest. Sarău, Gheorghe. 2012. Dicţionar Rrom-Romăn. Sigma publishing house, Bucharest. De Gouyon Matignon, Louis. 2012. Dictionnaire tsigane: Dialecte des Sinté français-tsigane / tsigane- français. L’Harmattan, Paris. Mégret, Jean-Claude. 2016. Dictionnaire de la Romani Commune: (langue tsigane). L’Harmattan, Paris. Lush, Harald. S. 2017. Dictionnaire Romani Trilingue: Anglais - Français - Romani. L’Harmattan, Paris. Rotaru, Julieta, Shapoval, Viktor, and Tirard, Aurore. 2022. Romani Lexicography in the Nineteenth Century. Volume 1: Vasile Pogor. Lexicon Româno-Țigănesc / Romanian-Gypsy Dictionary. Lincom Europe, München.

The “EuroUniv” driver corresponds to “qwerty” keyboards, and the “EuroLatin” driver corresponds to “azerty keyboards. These are universal keyboards that allow users to enter all the letters of main European languages, including letters with diacritics. 8 Available on http://www.red-rrom.com/ 7

154

1.4

M. Watabe

Rromani Language Lessons

Similarly, there are few Rromani language manuals except school manuals in Romania. Only three manuals (Heinschink, M, Krasa, D and Gurbetovski, M. 2010. Sumi, Y. 2018. Sarău, G. 2022) adopted the standardized alphabet. Some manuals are dialect-specific. De Gila-Kochanowski, Vania. 1994. Parlons tsigane. Histoire, culture et langue du peuple tsigane. L’Harmattan, Paris. Hancock, Ian. 1995. A Handbook of Vlax Romani. Slavica, Indiana. Heinschink, Mozes, Krasa, Daniel and Gurbetovski, Medo. 2010. Guide de conversation rromani de poche. Assimil, Paris. Machida, Ken. et al. 2010. ニューエクスプレス・スペシャルヨーロッパのおもしろ言語. Hakusuisha, Tokyo. De Gouyon Matignon, Louis. 2014. Apprendre le tsigane. L’Harmattan, Paris. Sumi, Yusuke. 2018. ニューエクスプレスロマ(ジプシー)語. Hakusuisha, Tokyo. Sarău, Gheorghe. 2012. Practical course of Rromani, Sigma publishing house, Bucharest.

1.5

Rromani Grammar

Pobożniak, Tadeusz. 1964. Grammar of the Lovari Dialect. Polska Academia Nauk, Krakow. Hancock, Ian. 1993. A grammar of Vlax Romani. Romanestan Publications, London & Austin. Matras, Yaron. 2002. Romani: A linguistic introduction. Cambridge University Press, Cambridge.

2 Empirical NLP Software Today the publication on social media is written in Rromani, as prevalent as in local languages among young and adult Rroms. While the standardized Rromani alphabet exists, almost all Rromani speakers, unfortunately, do not know it. The Rromani language is taught using the graphic standard only at schools and universities in Romania and at the INALCO in France. Those who do not know the Rromani alphabet write in their own way, often using the graphic standard of the local majority language. How do Rromani speakers who live in different countries and speak different dialects communicate easily on social media, meaning by writing tools? How can Rromani native speakers and learners understand corpora, including dialectal

A Polylectal Linguistic Resource for Rromani

155

variants, without being disturbed by them? We will examine Natural Language Processing (NLP) software applied to the Rromani language by two different approaches: empirical and linguistic. Regarding the linguistic approach, we would define the specific needs and best solutions for the Rromani language.

2.1

Rromani, a Low-Resource Language

Rromani is not processed by any of the NLP software below that are the most popular today. – – – –

Google Translate: 133 supported languages. 9 DeepL: 31 supported languages. 10 Sketch Engine: 147 supported languages.11 NLLB-200 by Meta AI: 200 supported languages. 12

Rromani is not supported by Facebook, but when one posts something in Rromani, Facebook will automatically “recognize” the source language and “translate” it. The quality of this translation is obviously catastrophic. For example, the text in Fig. 4 is a message written in Rromani to celebrate April 8, the International Rromani Day.13 “Baxtalo 8 Aprilo. Savore Romenge, oven saste!” meaning “Happy April 8. For all the Rroms, may you be healthy!”. Aprilo [April] → avril [April] = correct savore [all] (PL) → au flaveur [with savor] = incorrect oven [to become] (PRS.2PL or 3PL) → de four [of oven] = incorrect The translated text at the bottom of Fig. 4: “Baxtalo 8 avril romenge au flaveur saste de four” is the supposed translation into French,14 but makes no sense. Note that only three words are translated into French: the word Aprilo is correctly translated as avril but the others are completely mistranslated. savore is an adjective in the plural, it is translated as au flaveur which is not a correct French sequence. oven is a verb conjugated in the second or third person plural in the present tense, but it is translated as de four, i.e., by a preposition followed by a noun, probably because Facebook thinks that oven is an English word. These mistakes are typical of corpusbased methods that do not have access to large enough corpora.

9

Cf. https://en.wikipedia.org/wiki/Google_Translate Cf. https://www.deepl.com/fr/translator 11 Cf. https://www.sketchengine.eu/corpora-and-languages/ 12 Cf. https://ai.facebook.com/research/no-language-left-behind/#200-languages-accordion 13 The day was declared at IRU Congress in 1990. 14 French is the author’s tool language on Facebook. 10

156

M. Watabe

Fig. 4 “Translated” text on Facebook

3 Rromani Online Resources We now present some available NLP tools for Rromani developed by research scientists.

3.1

Russian Romani Corpus

The “Russian Romani Corpus”15 contains approximately 720,000 tokens from texts published in the USSR in the 1920s and 1930s. The target language is English. The development of the corpus was headed by Kirill Kozhanov from 2014 to 2015 (Fig. 5). If one enters a lexeme, this application will find all occurrences of the lexeme, including its inflected forms.16 For example, for the query дад [father], this application finds 1031 occurrences in 82 documents. If one places the cursor over a

15 16

Cf. http://web-corpora.net/RomaniCorpus/search/?interface_language=en Rromani is an inflectional language. Verbs, nouns, pronouns, and adjectives are inflected.

A Polylectal Linguistic Resource for Rromani

157

Fig. 5 Search by query дад [father] on “Russian Romani Corpus”

wordform, a bubble shows the basic form, its Part Of Speech category (POS), and its grammatical properties, but the translation is not always given: note that there is no translation given for the noun дад. дадэскиро [of (the) father] дад [father] дадэскэ [for (the) father]

→ дад (N) M.OBL.SG.GEN.M.DIR.SG → дад (N) M.DIR.SG → дад (N) M.OBL.SG.DAT

Genitive forms in Rromani are composed of a noun in the oblique case followed by a variable postposition -qo;17 there is no space between a noun and a postposition. The genitive postposition declines according to the number, gender, and case of its determinate. This is why certain morphological values are repeated in this application. For instance, two values, “M” and “SG,” are produced twice for the genitive form дадэскиро [of (the) father]. In fact, дадэскиро [of (the) father] is composed of the oblique singular form дадэс [father] followed by the genitive postposition of the singular masculine direct form18 -киро [of], but this software application does not represent the wordform as composed of two linguistic units. For didactic applications, showing how linguistic units compose to form wordforms is useful to learners. Unfortunately, the software developers have decided to process postpositions as inflectional endings, probably because Russian has many inflected cases. Furthermore, отдэла [(he/she) gives] is composed of the preverb от- and the verb дэла [to give] in the third person singular, present tense. Rromani is not a

17

-киро is a variant of the genitive postposition used in Russia. The genitive postposition agrees with the number, gender, and case of the possessed noun in Rromani.

18

158

M. Watabe

preverb-rich language. However, some Rromani dialects use preverbs since their local languages have preverbs. There is no mention of any Rromani dialect in this corpus, and spelling uses the Russian Cyrillic alphabet. This resource, collected in a limited space and period, may be significant for users who would like to study Rromani written in Russia. Still, for other users who want to learn the dialectal diversity of Rromani using the standardized Rromani alphabet, this application is not very useful.

3.2

ROMLEX

“ROMLEX19” is a lexical database in which 27 Rromani dialects and 16 target languages can be cross-referenced. The translation in the 16 target languages is not systematic, but the English translation is always mentioned. Search directions can be reversed. It is possible to search by query of a lexeme in either Rromani or English. However, the lexical properties (i.e., lexeme, POS, lexical values, and translation in English) are the same regardless of the search direction. Morphological values, such as the inflectional ending of nouns, are indicated only for a few dialects. ROMLEX contains 134,676 entries; the number of entries listed for each dialect varies from 1101 to 11,028. If one would link all dialectal variants of a term to one single lexical entry, the number of lexical entries should not exceed 20,000. Dialect databases are independent; therefore, it is not possible to view information about two different dialects at the same time. The software interface for ROMLEX accepts only lemmas as inputs. It is not possible to look for inflected forms: therefore, users are expected to master morphology to search for a term. There is an operator, “prefix matching,” meant to help users unfamiliar with Rromani morphology. However, it will find all lexemes that share the same prefixes. For instance, if one searches me [I] in the “Banatiski Gurbet” dialect using the operator “prefix matching”, the application will produce a list that will contain mećka [female bear], medicina [medicine (science)] etc. (Fig. 6). The graphemes used in this application are written using the Latin alphabet with some diacritics, which does not correspond to the Rromani standardized alphabet.

3.3

Online Rromani Dictionaries

There are three Rromani dictionaries available on line. Unfortunately, they are far from being exhaustive, their content is not consistent, they do not offer any search functionality and are not using the Rromani standard alphabet.

19

Cf. http://romani.uni-graz.at/romlex/

A Polylectal Linguistic Resource for Rromani

159

Fig. 6 Query me [I]

– The Romany-English Glossary20 contains about 400 entries. It presents each entry in Rromani with its English translation, its etymology, the language of origin of borrowings, and an example of concrete usage for certain entries. It does not contain any linguistic property. – The Romany-English Dictionary21 contains about 280 entries, either in English or in Rromani. The author of this dictionary states that “it is sometimes easier to learn from one language to the other and sometimes other way around.” Consequently, some entries are in the English-to-Rromani direction, and others in the other direction. Most of entries contain the translation. A few entries mention dialectal variants and etymology. There are no linguistic properties. – The Romano Language22 dictionary contains about 220 entries. Each line begins with a Rromani word followed by its English translation. There are no linguistic properties.

3.4

Need for Coherent Linguistic Resources

“Russian Rromani Corpus” allows users to study the vocabulary specific to an area, period, and dialect, and offers access to inflected forms, but without showing the composition of linguistic units. It is not possible to process new texts using this resource. 20

Created by Fergus Smith in 1998 and last updated in 1998. Available at: http://www2.arnes.si/ ~eusmith/Romany/glossary.html 21 Created by Angela Ba'Tal Libal and Will Strain, last updated in 1997. Available at: https:// geocities.restorativland.org/SoHo/3698/rom.htm 22 Available on https://www.larp.com/jahavra/language.html

160

M. Watabe

“ROMLEX” is useful to study dialectal variants at the phonological level, since it shows the difference of phonetic variants. But users who do not know Rromani morphology cannot look up terms by their inflected form. Note that users cannot find correspondences between dialect variants, as there are no links between dialects in this database. “Universal Declaration of linguistic rights,” says, In the field of information technology, all language communities are entitled to have at their disposal equipment adapted to their linguistic system and tools and products in their language.

However, Rromani speakers have not yet benefitted from tools developed for many languages (such as spell checkers, electronic dictionaries): they are not yet available for Rromani. Developing a simple dictionary of wordforms would not be sufficient to develop any NLP software: one needs to implement a formalized morphosyntactic description of Rromani. Only a person with linguistic knowledge can construct the lexicon and associated morphological grammar to develop NLP software. The Rromani language has a complex dialectal structure; therefore, we need to develop a diasystem (not only for lexical entries but also for morphological equivalences) that takes all Rromani variants into account.

4 Rromani Linguistic Resources We are in the process of developing a set of linguistic resources for Rromani, using the NooJ platform.23 The first version has already been implemented, using two small corpora24 that contain 798 wordforms in total. Our current objective is to adapt an editorial dictionary25 (Courthiade, M. et al. 2009) to enrich our lexicon. Once this task is complete, we will publish the linguistic resources as open-source. Our linguistic resources consist of several dictionaries, morphological grammars, and syntactic grammars. The aim of creating these resources for Rromani is not only to describe this language for academic research but to significantly contribute by offering a useful tool to all Rromani speakers and learners.

23

NooJ is a linguistic development environment, see: www.nooj4nlp.org An essay of Duka, Jeta, Deś berś vaś-i rromani ćhib and-o INALCO (Ten years for the Rromani language in INALCO), MS. and a poem of Rajko Đurić. Rromani ćhib (Rromani language) in La littérature des Rroms, Sintés et Kalés. 2006. 25 This dictionary covers all four dialects in Rromani (i.e., polylectal) and is written in Rromani standard alphabet. 24

A Polylectal Linguistic Resource for Rromani

4.1

161

Dictionary

In its first version, our dictionary contains 640 entries (i.e., lemmas). It is far from being exhaustive, but it covers variants of the four dialects and some vernaculars according to the Rromani standardized alphabet. We have manually described each lexical entry and added grammatical words: pronouns, prepositions, postpositions, and determiners such as definite articles, possessives, and demonstratives extracted from the editorial dictionary. Each lexical entry is a lemma associated with its POS category and properties. We have described all POS categories, property names, and their potential values for the Rromani. Each lexical entry is potentially associated with its inflectional (FLX) and derivational (DRV)26 paradigms. Here are some examples of lexical entries: phral,N+hum+m+EN="brother"+FLX=rrom+DRV=rromorro:ćhavo phej,N+hum+f+rre+EN="sister"+RRO="phen"+FLX=phej+DRV=ćhajorri: rromni phen,N+hum+f+rro+EN="sister"+RRE="phej"+FLX=phen+DRV=phenǒrri: rromni bakro,N+ani+m+EN="sheep"+FLX=ćhavo+DRV=ćhavorro ćhib,N+ina+f+EN="language,tongue"+FLX=phuv+DRV=phenǒrri:buti

One must not confuse lemmas and paradigm names. The words at the beginning of each line (e.g., phral, phej, phen, bakro, and ćhib) are lexical entries, whereas the words following “FLX=” or “:” (e.g., rrom, ćhavo, phej, phen, rromni, phuv, and buti) are inflectional paradigm names, and the words following “DRV=” (e.g., rromorro, ćhajorri, phenǒrri, and ćhavorro) are derivational paradigm names. The names of paradigms are arbitrary. It would be possible to use paradigm names such as Nm1 (for an inflectional paradigm of masculine nouns); we have chosen to use a lemma representative for each paradigm as its paradigm name, which makes it more concrete and easier to understand for all users. The lemma phral [brother] is associated with the POS category “N” (i.e., noun); its gender value is “m” (i.e., masculine); its semantic value is “hum” (i.e., human27); its translation28 into English (EN) is brother; its inflectional paradigm name is “rrom” and its derivational paradigm name is “rromorro”. The lemma phral and

26

Only the diminutive is programmed in the present version. We have defined three semantic values for nouns: human, animal, and inanimate object. Animal nouns in Rromani are divided into two subgroups: superior and inferior. Superior animal nouns usually have the same inflectional morphology as human nouns, while inferior animal nouns have the same inflectional morphology as inanimate object nouns. For the Rromani module, we have decided to process all animal nouns as human nouns so as not to limit the possibility of the oblique form without postposition, which does not exist for inanimate object nouns. 28 It is possible to add other target languages in a single module. 27

162

M. Watabe

its derivative phralorro are not associated with the same inflectional paradigm:29 this is why the inflectional paradigm of the derivative is indicated as “ćhavo”. The lemma bakro [sheep] and its derivative bakrorro are associated with the same inflectional paradigm “ćhavo”; consequently, it is not necessary to indicate the inflectional paradigm of this derivated form. The innovative characteristic of these resources is polylectal, meaning that it covers all four dialects in a single database: we have therefore added dialectal values30 to each variant of a lexical entry. If it is a variant shared by two dialects that belong to one superdialect (e.g., “O-bi” and “O-mu” dialects belonging to the “O” superdialect), we encoded its dialect value by the single tag “rro.” If it is a variant used only in one dialect (e.g., the “E-mu” dialect), its dialect value is defined by a double tag “rre+rrmu.” We use dialect tags in lower- or uppercase. A dialect tag in lower-case corresponds to the dialect value of the lemma, whereas a dialect tag in uppercase precedes the dialectal equivalent of the lemma. For example, the lemma phej [sister] is used in superdialect “E” (i.e., its dialect value is “rre”), and its equivalent in superdialect “O” is preceded by the tag “RRO”; it is phen [sister]. We use dialect tags both in the dictionary and the corresponding grammars. If one needs to annotate a dialect-specific inflected form, one will also obtain its dialect values.

4.2

Morphology

The dictionary is associated with a morphological grammar that describes inflectional and derivational paradigms. A simple lookup procedure can then annotate any text simply by applying the combined automaton dictionary+grammar to the text. The present resource for Rromani contains 196 grammar rules. Here are two examples of inflectional paradigms: rrom = /sg+dr | a/pl+dr | es/sg+ob | en/pl+ob ; ćhavo = /sg+dr | e/pl+dr | es/sg+ob | en/pl+ob ;

In the example above, “rrom” and “ćhavo” are names of inflectional paradigms. The paradigm “rrom” is applied to oxytonic and consonantal human masculine nouns. The paradigm “ćhavo” is applied to oxytonic and vocalic human masculine nouns. The paradigm name is followed by the equal sign “=,” and then by a regular expression. “” stands for “Empty string” and “” stands for “Backspace” The paradigm “rrom” represents oxytonic human masculine nouns ending with a consonant, while the paradigm “ćhavo” is for oxytonic human masculine nouns ending with a vowel. 30 We have two values of superdialects; “rro” and “rre,” and two values for two dialectal subgroups; “rrbi” and “rrmu.” 29

A Polylectal Linguistic Resource for Rromani

163

Fig. 7 Inflectional grammar “rrom” represented by a graph

Fig. 8 Annotation of the inflected form phrales

(i.e., delete one letter). The slash character “/” separates the suffix of each wordform from its properties. For instance, the term “a/pl+dr” means that if one adds the letter “a” to the original lexical entry, one gets a plural (pl) direct (dr) wordform. The disjunction operator “|” that separates terms of the regular expression is used to associate the original entry to multiple forms. If one applies the inflectional grammar “rrom” to the lexical entry phral, the grammar will generate four inflected forms: phral (sg+dr), phrala (pl+dr), phrales (sg+ob) and phralen (pl+ob).31 Morphological paradigms can also be defined graphically. The graph “rrom” in the figure below is equivalent to the regular expression “rrom” above (Fig. 7). When a lexical parser applies the dictionary and its morphological grammar to a text, it automatically annotates all recognized wordforms. The following figure shows that the wordform phrales has been annotated as the singular oblique form of the human masculine noun phral which translates in English as [brother] (Fig. 8). One can then locate in a corpus the occurrences of a wordform (e.g., Rromes) simply by entering the form in a query. If one wants to locate the occurrences of all inflected and/or derived forms associated with a lexical entry, one can enter the lexical entry between angle brackets (e.g., “”). The example below is the concordance produced by the query “” applied to the above-mentioned text by Duka, J. (Fig. 9).

31

In Rromani, there are two genders (masculine and feminine), two numbers (singular and plural), and two morphological cases (direct and oblique).

164

M. Watabe

Fig. 9 Concordance produced by the query Fig. 10 Derivational grammar “rromorro”

Multiple derivations can be associated with a given lemma. As of now, we have formalized only the diminutive derivation; we are planning to describe other derivations soon (Fig. 10). The derivational paradigm “rromorro” concerns consonantal masculine nouns (e.g., phral [brother]), regardless of the position of stress. Its derivation value is diminutive. Since the lemma phral is consonantal, its inflectional paradigm is “rrom.” However, its diminutive phralorro [little brother] is vocalic, therefore its inflectional paradigm is “ćhavo.” Specific morphological operators for Rromani We needed to implement operators specific to the Rromani language: “” and “.” “” is used to remove the grave accent from a vowel, and “” adds a grave accent to the vowel. These operators allow us to regularize inflectional and derivational grammar rules (Silberztein, M. 2015). Even non-oxytonic nouns become oxytonic in the diminutive. So, the operator “” deletes a grave accent. The paradigm “rromorro” produces four inflected forms in the diminutive: phralorro (sg+dr), phralorre (pl+dr), phralorres (sg+ob), and phralorren (pl+ob), e.g., phralorre is a plural direct form of the diminutive of the lemma phral as a masculine human noun (Fig. 11). Rromani Verbs Following is an example of a verbal entry: kinel,V+tr+EN="to buy"+FLX=kerel

A Polylectal Linguistic Resource for Rromani

165

Fig. 11 Annotation of the derived and inflected form phralorre

Fig. 12 Inflectional paradigm “kerel”

The lemma kinel [to buy] is a transitive verb; its inflectional paradigm name is “kerel” which represents transitive verbs whose ending of the third person singular in the present tense is “-el” and whose morpheme in the past tense is “-d-.” Inflectional grammars for verbs are represented by context-free grammars (or, equivalently, by embedded graphs), i.e., they are defined recursively. For example, the paradigm “kerel” is defined using ten “embedded grammar rules”: one for the indicative present, for the future, for the past, for the imperfective and pluperfect, for the imperative, for the medio-passive present and for the past, gerund, and past passive participle (Fig. 12). The main graph “kerel” contains a reference to the embedded graph “PRESENTel” that formalizes the conjugation in the present tense (Fig. 13). For example, when one applies the graph “PRESENTel” to the verb kinel [to buy], the morphological analyzer first deletes from kinel its two final letters (“”), then adds an “a” and then a “v”. This produces the wordform kinav [(I) buy], associated with the properties “1+sg” (first person singular form). Note that there is an optional extra suffix that can be added to the wordform kinav: the “a” suffix will subsequently add a grave accent and then add an “a”, which

166

M. Watabe

Fig. 13 Inflectional grammar for the Present tense “PRESENTel”

produces the final form kinàva [(I ) buy], with the stress indicated on the penultimate syllable. The “a” suffix is associated with the dialectal values “rro+rrbi,” meaning the “O-bi” dialect; this can be followed by two vernacular values “rrs” or “rrn”. That represents the fact that the wordform kinàva corresponds to the first person singular in vernaculars used in the South of the Balkans (i.e., “rrs”) and in the north of Russia (i.e., “rrn”). The paradigm generates 12 forms for the lexical entry kinel [to buy] in the present tense: kinav (1+sg), kines (2+sg), kinel (3+sg), kinas (1+pl), kinen (2+pl), kinen (3+pl) as basic forms; kinàva (1+sg), kinèsa (2+sg), kinèla (3+sg), kinàsa (1+pl), kinèna (2+pl), kinèna (3+pl) in vernaculars “rrs” and “rrn”. Dialectal values are introduced in the grammar of the past tense as well. That concerns verbal endings of the first person singular. Paradigm “PASTǎs” will delete two last letters of the lexical entry (see the command “” in the graph “kerel”), add a “d” as a morpheme for past tense, and add three variants of ending “-om” or “-ǒm” in the “O” superdialect (i.e., “rro”) and “-em” in the “E” superdialect (i.e., “rre”). That produces three wordforms of the first person singular in the past tense, all of which means [(I) bought] (Fig. 14). This paradigm represents 13 inflected forms for the verb kinel [to buy]: kindom, kindǒm (1+sg+rro), kindem (1+sg+rre), kindan, kindǎn (2+sg), kindas, kindǎs, kinda, kindǎ (3+sg), kindam, kindǎm (1+pl), kinden (2+pl), kinde (3+pl). Agglutinative Morphology There are five postpositions in Rromani: four invariable (ablative, dative, instrumental, and locative) and one variable (possessive). The possessive inflects according to the number, gender, and case of its possessed noun. There are two ways of processing postpositions: as a morphological ending or as an agglutinated linguistic unit. We have tested both ways.

A Polylectal Linguistic Resource for Rromani

167

Fig. 14 Inflectional graph for the past tense “PASTǎs”

Postpositions in Rromani follow an oblique form of noun with no space in the writing. The possible combination of the lemma phral [brother] in singular and postpositions is as follows. mire32 phraleça [with my brother] mire phralesqe [for my brother] mire phralesθar [from my brother] mire phralesθe [at my brother’s place] mire phralesqo ćhavo [my brother’s son] mire phralesqi ćhaj [my brother’s daughter] mire phralesqe ćhave [my brother’s sons] The final “s” of the oblique form phrales is contracted in the instrumental case. Therefore, the instrumental form is not phralesça but phraleça. In other cases, the postposition is simply glued to the oblique form. The possessive endings are “-o” (masculine singular), “-i” (feminine singular), and “-e” (plural and oblique). There are dialectal variants of possessive postpositions: short forms (e.g., -qo in masculine singular) and long forms (e.g., -qoro in masculine singular). If one represents all postpositions in inflectional grammars, there is no need to describe them as entries in the dictionary, and consequently, the module will not recognize postpositions as regular lexical units. If one applies the inflectional

32 mire is the oblique (of any number and any gender) form of the possessive miro [my]. There are several dialectal variants of the possessive.

168

M. Watabe

Fig. 15 Inflectional grammar “rrom” with postpositions

Fig. 16 Annotation of the “inflected” form phralesθe

grammar33 “rrom” (Fig. 15) to lemma phral [brother], the lexical analyzer generates 519 forms, which seems unnecessary and, above all, not consistent from the linguistic and didactic points of view. Moreover, if one represents forms that contain a postposition as if they were inflected forms, the analyzer will not mark the separation between the oblique form and the postposition (Fig. 16), and it will not distinguish the morphological values of the noun and those of the possessive (Fig. 17). phral,N+hum+m+EN="brother"+FLX=rrom+DRV=rromorro:ćhavo qe,PSTP+dat+EN="for" ça,PSTP+ins+EN="with" θar,PSTP+abl+EN="from" θe,PSTP+loc+EN="at" qo,PSTP+poss+EN="of"+FLX=qo

33 There is some embedded grammar (“possS,” “possL,” “pstpS,” and “pstpX”) inside of the main grammar “rrom,” which is the “inflectional” grammar for the possessive.

A Polylectal Linguistic Resource for Rromani

169

Fig. 17 Annotation of the “inflected” form phralesqi

Fig. 18 Productive morphology grammar for postpositions

An alternative method is to process postpositions as proper lexical units, and use agglutinative morphological rules to represents the agglutinated forms. In that case, all postpositions will be entries of the dictionary, and the analyzer will represent them as full lexical units. If one applies the inflectional paradigm “rrom” without postpositions (Fig. 7) to the lemma phral [brother], the analyzer generates only four inflected forms. Similarly, if one applies the inflectional paradigm “qo” to the postposition -qo [of], the analyzer generates 56 inflected forms. Therefore, the module can potentially agglutinate these 56 forms to any noun in the oblique case. Thanks to agglutinative morphological rules, if one needs to annotate a form with a postposition, the module will mark the separation between the oblique form and the postposition and distinguish the morphological values of the noun from those of the possessive. That is a much better solution, as it is consistent from the computational, didactic, and linguistic points of view. In the agglutinative morphological grammar for postpositions (Fig. 18), each path corresponds to a postpositional phrase, that starts with a sequence of letters (), stored in variable “Subs,”. The value of variable “$Subs” is restricted by a constraint 34 that forces the sequence of letters to represent a noun in the

The operator “=:” is the matching operator. For example, “=:N” matches any noun in any inflected form.

34

170

M. Watabe

Fig. 19 Annotation of the lemma phral with an invariable postposition -θe

Fig. 20 Annotation of the lemma phral with a variable postposition -qo

oblique case. For example, phrales [brother] is a noun in the oblique case, therefore the constraint is satisfied. The graph contains two other variables “$PstpInvar1” (invariable postposition 1) and “$PstpVar” (variable postposition) that must satisfy other constraints. A third variable, “$PstpInvar2” (invariable postposition 2), must be appended to “$PstpVar.” The grammar finally produces a sequence of two annotations: ,35 each of which corresponds to a linguistic unit. Variable $1L represents the lemma of the first constraint, i.e., phral; $1C its category (N); $1S its syntactic and semantic properties (hum); and $1F its inflectional properties (m+sg+ob). Variable $2L corresponds to the lemma of the second constraint, i.e., θe; $2C its category (PSTP); $2S its syntactic and semantic properties (loc). As a result, each recognized wordform will be annotated by a sequence of two annotations. The analyzer recognizes the postpositional phrase phralesθe [at brother’s place] and annotates it as seen in Fig. 19: Note that this single wordform is clearly annotated as a sequence of two annotations. Thus, there is no risk of mixing and confusing the properties of the noun and the postposition, such as morphological values of the noun and those of the postposition of possessive36 (Fig. 20). Thanks to this agglutinative grammar, nouns with a postposition are correctly annotated, and this module would be useful for all users: academic and non-academic, native, and non-native speakers.

Each constraint is numbered, $1 being the first constraint. The various fields of the lexicon are named “L” (Lemma), “C” (Category), “S” (Syntactic and semantic properties) and “F” (inFlectional properties). For instance, “$1L” represents the lemma that satisfies the first constraint. 36 Values preceded by a “D” in capital letter are those of the possessed noun. 35

A Polylectal Linguistic Resource for Rromani

171

5 Evaluation and Perspectives Our project is to create a polylectal set of linguistic resources system that covers all Rromani dialects respecting the diasystem of this language. Furthermore, we would not like to prioritize any dialects since the declaration at the first IRU Congress in 1971 states: “no dialect is better than others.” We have tested the two existing linguistic resources for Rromani: “Russian Romani Corpus” and “ROMLEX”. None of these resources uses the Rromani standardized alphabet, and they do not represent dialectal variants. Moreover, they cannot be used to query or annotate new texts. The current set of resources we have implemented is still small at the dictionary level, but it is almost complete, as the main inflectional paradigms of nouns, adjectives, and verbs are already programmed. Our next objective is to import an editorial polylectal dictionary to enrich our dictionary. This step will also allow us to classify lemmas and implement a derivational morphological grammar. Rromani speakers do not fully benefit from new technology that would be adapted to their own language. Contrary to corpus-based NLP applications, the linguistic resources we have developed for Rromani seem useful both to develop NLP software applications and to be used as a pedagogical tool. Moreover, as our applications are not data-driven “knowledge-unaware” software applications, but applications built by expert linguists who carefully build the lexicon and the grammars according to detailed linguistic knowledge, we can rely on their results. The Rromani module will soon be freely available and can be downloaded from the page: https://nooj.univ-fcomte.fr/resources.html. Tools The NooJ software. Available at https://nooj.univ-fcomte.fr ROMLEX. Available at http://romani.uni-graz.at/romlex/ R.E.D.-RROM. Available at http://www.red-rrom.com/home.page Russian Romani Corpus. Available at http://web-corpora.net/RomaniCorpus/ search/?interface_language=en

References Courthiade, Marcel. 2006. La littérature des Rroms, Sintés et Kalés. INALCO, Paris. Courthiade, Marcel. 2016. The nominal flexion in Rromani, in Professor Gheorghe Sarău: a life devoted to the Rromani language. p. 157-211, Editura Universității din București, Bucharest. Courthiade, Marcel. 2004. Les Rroms dans le contexte des peuples européens sans territoire compact, in Bulletin de l’Institut national des langues et civilisations orientales, octobre 2004, p. 31-37, INALCO, Paris. Courthiade, Marcel. et al. 2009. Morri angluni rromane ćhibǎqi evroputni lavustik (My first European dictionary of the Rromani language). Romano Kher, Budapest. Duka, Jeta, Deś berś vaś-i rromani ćhib and-o INALCO (Ten years for the Rromani language in INALCO), ms.

172

M. Watabe

Gurbetovski, Medo, Heinschink, Mozes and Krasa, Daniel. 2010. Guide de conversation rromani de poche, Assimil, Paris. Silberztein, Max. 2015. La formalisation des langues : l’approche de NooJ, ISTE Eds., Londres. Atlas of the World’s Languages in Danger. UNESCO, Paris. Third edition 2010. Études tsiganes, in n°22 : Langue et culture : approche linguistique. Le Centre de documentation, Paris, 2005. Universal Declaration of linguistic rights. Barcelona, 1998. La langue romani – un atout pour l’éducation et la diversité. Exhibition, Council of Europe, Strasbourg, 2014. Sarău Gheorghe. 2012. Dicţionar Rrom-Romăn. Sigma publishing house, Bucharest.

Part IV

Processing Multiword Units: The Linguistic Approach

Using Linguistic Criteria to Define Multiword Units Max Silberztein

Abstract To describe the infinite set of sentences expressed in a Natural language, one needs to define the finite set of its atomic units, i.e., its vocabulary, and the rules that combine these atomic units to construct sentences, i.e., its grammar. However, separating the vocabulary from the grammar is not straightforward; one crucial problem is defining multiword units. Here, I present three reproducible criteria to characterize them and thus separate the vocabulary from the grammar in an operational way. Keywords Multiword Units · Linguistic analysis · Machine Translation · Information Retrieval

1 Introduction To describe the infinite set of sentences that can be expressed in a Natural language, one needs first to describe the set of its atomic linguistic units (ALUs) and then the set of rules used to combine these ALUs to construct the infinite number of sentences. The set of ALUs constitutes the vocabulary of the natural language; the set of combination rules constitutes its grammar. For this formalization to be feasible, the number of ALUs and the number of grammar rules must be finite.1 Making sure the vocabulary is finite in size requires several restrictions to overcome; the two most important problems are: – As natural languages evolve, it is difficult to restrain their vocabulary size: indeed, every day, new terms appear, and obsolete ones are forgotten. 1 Indeed, if a so-called “vocabulary” were not finite in size, it would not be possible to describe it in extenso. In all natural languages, there are productive mechanisms that locutors use to create new words at will. These productive mechanisms must be described by some generative rules, which is another way of saying that these new words are not atomic, i.e., ALUs.

M. Silberztein (✉) Université de Franche-Comté, Paris, France e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Silberztein (ed.), Linguistic Resources for Natural Language Processing, https://doi.org/10.1007/978-3-031-43811-0_9

175

176

M. Silberztein

To bypass this issue, linguists describe the vocabulary in a synchronic approach, i.e., at a particular instant. To account for the evolution of languages, they regularly release a new edition of their dictionaries, typically every year. – The vocabularies of natural languages contain many different sets, as every scientific, technological, or artistic domain contains specialized vocabulary. For example, medical doctors must master many technical terms denoting diseases, medicines, symptoms, anatomical parts, and names of molecules. Architects, chemists, physicists, painters, plumbers, etc., all must master concepts, devices, and methods which have their names and terms. Hobbyists—chess players, cinephiles, stamp collectors, etc.—master specialized vocabularies as well. Moreover, vocabularies have regional variants: Americans in South Carolina use different words and expressions (e.g., “crank the car” = start the car) than Americans from New York (e.g., “a whip” = a nice car); Londoners and Mancunians have their slang, as do Parisians and Marseillais, etc. One solution to the many-vocabularies problem is to construct a base dictionary for each language, that covers the language standard vocabulary; this standard vocabulary is mastered and shared by all the language locutors. For example, the standard British English vocabulary could be defined as the one used in the media, such as The Guardian newspaper, ITV TV channel, and BBC radio channel, and shared by the British population. On top of the base dictionary, one could construct other dictionaries adapted to a specific NLP application, such as a dictionary of medical terms or a dictionary of terms and expressions explicitly used in Scotland.2 One remaining problem must be solved to formalize a natural language and construct reliable NLP software:3 How to set the limit between the vocabulary and the grammar? For example, when processing a sequence of graphical wordforms in a text, how does one decide whether a software application should parse this sequence using rules or just look up a dictionary to access its properties directly? In some instances, the answer is straightforward. For instance, many NLP software applications (including information retrieval and machine translation) must process the word sequence a small foot as a +BodyPart noun phrase precisely as the word sequences a giant foot, a dirty foot, or an injured foot. In all these noun phrases, the property +BodyPart can be inferred automatically from their head, i.e., the noun foot, which is a lexical entry described as: foot,N+BodyPart

together with an inheritance rule such as:4 /$N$Properties

2

See for example Aoughlis (2007) for a French dictionary of computer science terms, and Kocijan et al. (2021) for a Croatian dictionary of medical terms. 3 See (Sag et al. 2002). 4 In the following, we are using NooJ’s symbols and formalism to describe linguistic analyses. matches any determiner; any adjective; any noun; $N is a variable that

Using Linguistic Criteria to Define Multiword Units

177

However, to process the word sequence a Black foot as a +Human noun,5 one must deactivate the previous general computation on the one hand and, on the other hand, activate an alternative mechanism, which must be activated only for the word sequence Black foot. The following lexical entry can represent this idiosyncratic mechanism: Black foot,N+Human

which is equivalent to considering the sequence Black foot as a multiword unit, i.e., as a common element of the vocabulary,6 i.e., an ALU. But many cases are not as clear-cut as these simple examples: how must an NLP software application process “business card,” “shopping center,” “budget amendment”? In other words, can a NLP software application compute all the properties of these sequences to produce the correct answer to a query, their proper RDF semantic representation, their accurate translation, etc.? Or should one process these sequences as ALUs, to access their properties directly from a dictionary? Before answering these questions, we first describe and criticize the approach usually followed in NLP to recognize, represent and process these word sequences.

2 The Corpus-Based Approach and Collocations Most researchers in Corpus linguistics and NLP separate multiword units into two classes:7 – Idioms, such as black foot, whose meaning cannot be understood from the words black and foot. – Collocations, compositional phrasemes, or phraseological collocations, such as shopping center, that can be understood from the words that make them up8 but are constituted by wordforms that occur together more often than expected by chance. We will discuss in Sect. 3 what understanding the meaning of these word sequences involves and show that most of these collocations are, in fact, multiword units, i.e., plain ALUs. This section focuses on the methodology used in Corpus linguistics and NLP to treat these objects.

contains the lexeme matching ; $N$Properties represents all $N’s properties. The “/” character associates an input (here: ) with an output that represents the result of the analysis. 5 A member of a North American Indian tribe, or a French living in Algeria during the colonization. 6 The spelling variant Blackfoot is an entry in the Encyclopedia Britannica. 7 See for instance (Gledhill and Frath 2005). 8 We will discuss in Sect. 3 what semantic computation understanding the meaning of these word sequences involves and show that most of these word sequences are in fact, multiword units.

178

M. Silberztein

Indeed, if collocations are defined by the fact that they are constituted by wordforms that occur together more often than expected by chance, it is reasonable to try to identify and recognize them by applying statistical tools to large corpora of texts. Looking in the Corpus of Contemporary American English (COCA),9 we get the following ten most frequent digrams: Rank 1 2 3 4 5 6 7 8 9 10

Count 9166 7349 7009 3970 3210 2549 2549 2342 2142 1956

Digram of the in the n’t to the on the to be and the for the at the in a

These digrams consist of grammatical words, are not multiword units, and do not carry meaningful information. The most frequent digram consisting of non-grammatical words (i.e., an adjective and a noun) is “New York” (rank: 61, count: 732), but most of its occurrences are, in fact, substrings of larger multiword units, e.g., Bank of New York, New York City, New York Daily News, New York Giants, New York Jets, New York Police Department, New York Post, New York Rangers, New York State, New York Times, New York University, New York Yankees, University of New York. These more explicit sequences are the units that should be considered as the processing units, not “New York.” “New York” frequently occurs, not because it is a multiword unit but because it is a component of many multiword units. By applying the DELAC dictionary10 of multiword units to the corpus, we get the following ten most frequent multiword units: Rank 1 2 3

Count 1042 557 553

Multiword unit out of (Preposition) kind of (Adverb) that is (Conjunction) (continued)

Our corpus consists of 116 files available from the COCA corpus: w_acad_*, w_fic_*, w_mag_*, w_news_*, w_spok_* and contains 4,603,154 occurrences of wordforms. Silberztein (2018) describes other statistical experiments performed by applying various dictionaries to the Open American National Corpus (OANC). OANC’s Slate sub-corpus latest version, available in April 2023, contains 4,887,276 occurrences of wordforms. 10 The first version of the DELAC dictionary included in NooJ was developed by Chrobot et al. (1999). 9

Using Linguistic Criteria to Define Multiword Units Rank 4 5 6 7 8 9 10

Count 501 526 464 448 326 322 296

179 Multiword unit a lot of (Determiner) a little (Adverb) at least (Adverb) of course (Adverb) as if (Conjunction) according to (Preposition) in fact (Adverb)

The most frequent multiword unit (the preposition out of) has a count of 1042, which corresponds to the digram with rank #33; in other words, before identifying it as a phraseological collocation, one must explain why 32 more frequent collocations should be rejected. If we look for the most frequent compound nouns, we get the following ten most frequent nouns: Rank 1 2 3 4 5 6 7 8 9 10

Count 167 102 95 90 90 80 75 72 70 67

Noun high school health care little bit White House vice president social studies academic freedom general manager prime minister Middle East

The most frequent compound noun, high school, occurs 170 times,11 i.e., rank #592 among digrams. Clearly, multiword units do not occur more often than digrams and cannot be characterized simply by their frequency. More sophisticated statistical criteria have been designed to try to distinguish multiword units and collocations: Point-wise Mutual Information, T-Test, Chi-Squared Test, etc., but none of them produces reliable criteria.12 For example, the Point-wise Mutual Information index compares the probability of the two wordforms high and school to occur together if they are unrelated, with the actual frequency of the sequence high school:

11

For proper counting, one should add to these 167 occurrences the occurrences of the plural form high schools and remove the occurrences that are substrings of more complete compounds such as junior high school(s). These corrections would require linguistic analyses, which is what statistical criteria are designed to avoid. 12 For example, Lambert (2004) shows that the “collocations” produced by these measures cannot be used reliably for NLP software applications that need to perform tasks similar to question answering.

180

M. Silberztein

MI =

Pðhigh schoolÞ 170 = × 4,603,154 = 1054 PðhighÞ × PðschoolÞ 910 × 816

This high index could be interpreted as a sign that these two wordforms constitute a multiword unit, as, for instance, the MI measure for a frequent digram such as could have is much lower: MI =

Pðcould haveÞ 9166 = × 4,603,154 = 11 Pðcould Þ × PðhaveÞ 41,984 × 90,851

but this is expected, as one cannot compare the frequency of prepositions and determiners with the frequency of adjectives and nouns (grammatical words do not have the same behavior as meaningful terms). Now let us compute the MI measure for a digram constituted by two nouns, such as freshman Georgia: MI =

Pðfreshman GeorgiaÞ 1 = × 4,603,154 = 1423 PðfreshmanÞ × PðGeorgiaÞ 33 × 98

MI( freshman Georgia) is higher than MI(high school), even though the two wordforms freshman and Georgia do not constitute a multiword unit. The COCA corpus contains 12,657 different forms of recognized multiword units, including 6030 hapaxes that occur only once, making them useless for any statistical method. Thus, any statistical software that would mine the COCA corpus to extract its multiword units would hope to recognize, at best, 6627 occurrences/ 136,482 lexical entries, i.e., less than 5% of the standard English vocabulary. More generally, one should challenge the principle of using frequency measurements to define what the elements of a vocabulary are. After all, no one questions that abstention, absolutism, or abstention are legitimate English words, even though they do not occur in the COCA, and the fact that musculature and superficially appear only once does not make them lesser English words than bone or deeply. In the same way, there are multiword units that occur in the corpus frequently (e.g., high school), others that occur only once (e.g., white flag, wet suit, word processor), and some that do not even occur once in the corpus (e.g., abandoned ship, absolute zero, academic program). There is no reason to infer that absolute zero should be considered less as a multiword unit than high school. In conclusion, the frequency of a sequence of wordforms cannot be used to decide if it corresponds to a multiword unit. Determining whether a sequence of graphical wordforms should be analyzed or lexicalized must be based on some linguistic criterion. We now present the set of three criteria used to draw the line between word sequences whose properties can be computed using grammar rules and word sequences whose properties must be described in dictionaries: Semantic Atomicity, Term Usage, and Idiosyncratic Transformational Analysis. This set of criteria is arbitrary, but it is reproducible. It has allowed linguists who use the NooJ platform to

Using Linguistic Criteria to Define Multiword Units

181

work together using an accumulative methodology and to construct large-coverage dictionaries of multiword units for over 30 natural languages.13 More importantly, these criteria are designed to be well adapted to most NLP applications, including Information Retrieval, paraphrase generators, and Machine Translation.

3 Semantic Atomicity If a software application cannot compute all the semantic properties of a word sequence it needs from its constituents, then it needs to access these properties in a dictionary, i.e., this sequence must be treated as an ALU. Consider the sentence “The company expects its blue collars to go on strike.” Because the verbal expression “to go on strike” expects a subject in the distributional class +Human, the parser needs to associate the word sequence “its blue collar” with this property. However, if it tries to compute the distributional property of the noun phrase “its blue collars” from the properties that are associated with its constituents, it will give the noun phrase the semantic class of its head noun “collar,” which is described as +Concrete in the English dictionary, rather than +Human. For the NLP software to correctly analyze this noun phrase as +Human, it needs thus to deactivate its usual semantic computation and access instead its semantic property +Human directly. This can be done by a simple dictionary lookup in which “blue collar” is associated with the semantic feature +Human. In other words, “blue collar” is treated as an ALU. This is not to say that there is no semantic relationship between manual workers, the blue color, and the piece of cloth: many native English locutors know why certain manual workers have been traditionally referred to as blue collars. However, their understanding not only involves knowing what the blue color is and what piece of cloth a collar is: they also know that some workers must wear uniforms during work, that uniforms colors used to be meaningful traditionally, that blue uniforms were used by manual workers’ uniforms, etc. In conclusion, one must consider “blue collar” as an ALU, not because its meaning is unrelated to the meanings of its constituents, but because the meanings of its constituents are not the only ones that need to be brought into play to fully comprehend it. 13

NooJ is a free, open-source linguistic development environment, see www.nooj4nlp.org. See Silberztein (2016a) for a presentation of NooJ’s theoretical and methodological bases. See Silberztein (1990) for the presentation of the DELAC dictionary of French multiword units: the first available large-coverage electronic dictionary of multiword units specifically designed for NLP applications. Silberztein (1993) presents the set of criteria used to define French compound nouns. See Chrobot et al. (1999) for a description of the first version of a DELAC dictionary for English compound nouns included in NooJ and Machonis (2012) for its dictionary of English Phrasal Verbs. Dictionaries for other languages have been constructed with NooJ, e.g., see Chadjipapa et al. (2010) for a dictionary of Greek multiword units and Najar (2016) for a dictionary of Arabic multiword units. Linguistic Resources for over 30 languages are available at: https://nooj.univ-fcomte.fr/ resources.html

182

M. Silberztein

In the same manner, when analyzing the following word sequences: a business card, a credit card, a memory card

one might think that they are semantically analyzable: a business card is indeed a card that has something to do with business; a credit card is a card used in relation to credit, and a memory card is indeed a card used in relation to some memory. However, these “analyses” would not indicate what these cards are used for and what synonyms or expansion one should expect (e.g., credit card =: Mastercard, visa card, American Express card). A more useful description would be: – a business card is a card used to exchange persons’ professional contact information. – a credit card is a magnetized plastic card used for payment. – a memory card is an electronic chip to save or move computer files. These analyses are all different and cannot be inferred by any linguistic method: no rule in English grammar could stop a business card from referring to a means of payment, used, for example, in certain business centers; no English grammar rule could stop the term credit card from referring to an ID card used by people to get credit; no grammar rule could stop a memory card from referring to a playing card used in a Memory game, etc. Any NLP software needs to deactivate the potential incorrect analyses of these sequences and rather access their description directly from a dictionary to process these word sequences correctly. Note that these word sequences must be processed differently than the following ones: a big card, a plastic card, an interesting card

in which the kind of cards mentioned there are undefined because their adjectives do not force any particular interpretation. In conclusion: as certain word sequences are associated with some idiosyncratic semantic properties, NLP software must deactivate any computation that would produce other potential but incorrect analyses and access the correct properties directly, i.e., in a dictionary, which is equivalent to treating these sequences as ALUs.

4 Term Usage Even when the semantic properties of a word sequence can be correctly computed from its constituents, NLP applications that need to generate texts from some semantic representation, Automatic Summarization applications, or Machine Translation software that generates texts in a target language must produce natural word sequences, i.e., the ones that are usually used by locutors. For example, consider the term washing machine. Its meaning is transparent, and any English learners will have no difficulty understanding what a washing machine

Using Linguistic Criteria to Define Multiword Units

183

is if they know the meaning of the verb to wash and the noun machine. However, in theory, English grammar should allow us to refer to this object using expressions such as a linen-washer, a clothes-cleaning device, a personal cleaner, a soap-andrinse automatic apparatus, a textile centrifuge, a deterging appliance, etc. But in reality, no English locutor will ever use these word sequences: everyone uses the term washing machine. Therefore, an NLP application that would produce the resulting sequence “textile centrifuge” instead of the term “washing machine” will not be satisfactory. More generally, consider the following pairs of term/expression: a professional musician a shopping center a hurricane lamp a word processor

a musical worker a walk-in purchase building village an emergency portable electrical lighting gadget a piece of textual writing software

Any NLG software must produce the terms in the first column, rather than the semantically equivalent word sequences in the second column, which implies that it distinguishes terms from the infinite number of potential paraphrases. Many concepts and objects have a term in one language and not in another one.14 For example, the French term “livre de chevet” can be roughly translated in English only as “a book kept on the bedside that helps people fall asleep.” Conversely, there is no term in French for “coffee-table books”: a French person will describe them as “Beaux livres que certains présentent sur leur table basse” [Beautiful books that some people present on their coffee-tables], but that is not a French term. Many concepts and objects do not even have a term. For example, there is no term for shirts that display forests (*forest shirt), even though the term flowery shirt exists. There is no such term as a *door poster even though there are wall posters, etc. Describing the vocabulary of a natural language necessarily requires distinguishing terms such as washing machine from non-terms such as “textile centrifuge”. Any NLP application that generates texts must produce the terms, even if it could generate semantically equivalent expressions. The only way to distinguish terms from free word sequences is to list and describe them in a dictionary, which in effect, is equivalent to treating them as ALUs.

5 Idiosyncratic Transformational Analyses Formalizing the vocabulary of a natural language does not constitute a goal in itself: linguists need it to analyze the sentences that are constructed from its components. The most ambitious NLP project is to construct a system capable of automatically analyzing the meaning expressed by each sentence, producing its semantic

14

See (Rheingold 2000).

184

M. Silberztein

representation. There are many frameworks that aim at representing meaning formally: Lambda Calculus, First Order Logic formalisms such as Prolog, e.g.: Eve gave an apple to Adam ⟶ GIVE(Eve, apple, Adam)

Semantic Web XML, RDF, or Turtle notations, e.g.: New York has the postal abbreviation NY ⟶ "NY".

etc. In the following discussion, I will use the transformational grammar framework, in which the meaning of a complex sentence is represented by a series of elementary sentences.15 For example: Joe’s cousin promised him her book ⟶ Joe has a cousin; the cousin has a book; sometime in the past, this cousin promised Joe something: that promise is that Joe’s cousin would offer Joe this book.

Before constructing a system capable of parsing such complex sentences, one should start by implementing a system that analyzes more straightforward word sequences.16 For example, NLP software should be able to link word sequences of the structure Noun Noun to their equivalent structure Noun Preposition Determiner Noun as follows: Noun Noun → Noun Preposition Determiner Noun a border dispute → a dispute about a border a tax fraud → a fraud against the tax (law) a hospital stay → a stay at a hospital the media coverage → the coverage by the media an election campaign → a campaign during the election a shock therapy → a therapy consisting of (electric) shocks an acquisition agreement → an agreement on an acquisition a budget amendment → an amendment to the budget a computer fraud → a fraud using a computer Such a system would need to contain hundreds of possible transformation rules, since numerous potential combinations of prepositions (about, against, at, etc.) and determiners (empty, definite, generic, or indefinite) exist. But how could such a system know when to forbid these productive rules to be applied to the wrong word sequences? a border dispute → *a dispute (before, behind, below, by . . .) a border a tax fraud → *a fraud by the tax (law) a hospital stay → *a stay using a hospital. 15 By elementary sentence, I mean a sentence that contains only one predicate and its arguments, with no modifiers. 16 See Silberztein (2016b) for an implementation of an automatic transformational analysis of the elementary sentence Joe loves Lea, that produces several millions sentences such as “It is not Joe who suddenly fell in love with her”.

Using Linguistic Criteria to Define Multiword Units

185

There are hundreds of other transformational analyses that will need to be implemented as well,17 e.g.: Noun1 Noun2 → Noun1 Verb2 =: a student protest → students protest Noun1 Noun2 → Someone Verb2 Noun1 =: a heart transplant → someone transplants a heart Adjective1 Noun2 → Someone Verb2 Adverb1 =: an accidental death → someone dies accidentally Noun1 Noun2 → Noun2 Verb1 Noun3 =: a compensatory amount → an amount (of money) compensates (someone) (for something) Noun1 Noun2 → Noun2 is Location1 =: a brain injury → An injury is in the brain ... When one has implemented the full set of transformational rules for English, it will be possible to find some regularity that will help implement an automatic system that can analyze these word sequences correctly. Until then, the only workable solution is to list each of the word sequences which analysis is irregular/exceptional, and associate it with its correct analysis, which is equivalent to processing it as an ALU.

6 Conclusion For an NLP software application to process the infinite number of different sentences that can be expressed in a natural language, it needs to be able to analyze each sentence as a sequence of Atomic Linguistic Units combined according to a set of grammar rules. One crucial problem is then to separate word sequences that can be processed by grammar rules from word sequences that must be described directly in a dictionary and thus are treated as multiword units. We have presented three criteria that can be used to define multiword units: Semantic Atomicity, when a word sequence cannot be fully understood from the meaning of its constituents; Term Usage, when all locutors use a particular word sequence to name a concept or object; and Idiosyncratic Transformational Analyses when a word sequence accepts certain transformation rules in an unpredictable way. These criteria are particularly well adapted to the construction of NLP software applications: – Processing multiword units that are semantically atomic allows information retrieval systems to retrieve and expand concepts correctly; for example, link the multiword unit credit card with visa card, master card, etc., rather than link the simple word card with discount card, ID card, memory card, penalty card, playing card, etc. 17 Silberztein (1993) describes other transformations for French; Silberztein (2016a) describes other transformations for English.

186

M. Silberztein

– Accessing the exhaustive list of terms of a language allows any NLG application or Machine Translation system to generate the correct terms, e.g., generate washing machine rather than cloth cleaner. – Accessing the list of transformational rules that can be applied to a particular word sequence allows an automatic paraphraser or semantic analyzer to produce its correct analyses, e.g., linking presidential election to People elect the president rather than The presidents elect someone. These criteria are not based on probabilistic nor statistical measures, nor do they imply the recognition of collocations: multiword units can be frequent in a particular corpus of texts, or rare, or even occur once, just like simple words: what defines them is their linguistic properties, not their frequency. This set of criteria, although arbitrary, is reproducible, which has allowed the community of NooJ users to construct together large-coverage dictionaries for many languages in a consistent way.18

References Aoughlis, Farida: 2007. A Computer Science Electronic Dictionary for NooJ. Lecture Notes in Computer Science (LNCS) 4592, Springer Verlag 2007: 341-351 Chadjipapa Elina, Lena Papadopoulou and Zoe Gavriilidou, 2010. New data in the Greek NooJ module: Compounds and Proper Nouns. Applications of Finite-State Language Processing: Selected Papers from the NooJ 2008 International Conference (Budapest, Hungaria). Edited by Kuti Judit, Silberztein Max, Varadi Tamas. Cambridge Scholars Publishing, Newcastle., UK: 93-100 Chrobot Agata, Blandine Courtois, Mary Hammani-McCarty et al., 1999. Dictionnaire électronique DELAC anglais : noms composés. Technical Report 59, LADL, Université Paris 7, 1999. Gledhill, Christopher, Pierre Frath, 2005. Free-Range Clusters or Frozen Chunks? Reference as a defining criterion for linguistic units. Recherches Anglaises et Nord Américaines n°38. Kocijan Kristina, Krešimir Šojat and Silvia Kurolt, 2021. Multiword Expressions in the Medical Domain: Who Carries the Domain Specific Meaning. In Formalizing Natural Languages: Applications to Natural Language Processing and Digital Humanities. CCIS Series. Springer-Verlag: Berlin Heidelberg. Lambert, Benjamin, 2004. Statistical Identification of Collocations in Large Corpora for Information Retrieval. Available at: http://www.cs.cmu.edu/~belamber/pdf/CollocForIR.pdf Machonis, Peter A, 2012. Sorting NooJ out to take multiword expressions into account. In Automatic Processing of Various Levels of Linguistic Phenomena: Selected Papers from the NooJ 2011 International Conference, pp. 152-165. Newcastle upon Tyne: Cambridge Scholars Publishing. Najar Dhekra, Slim Mesfar, 2016. A large terminological dictionary of Arabic Compound Words. In Automatic Processing of Natural Language Electronic Texts with NooJ: Selected Papers from the International NooJ2015 Conference. Springer CCIS Series #607. Rheingold, Howard, 2000. They have a word for it: A lighthearted lexicon of untranslatable words & phrases. Sarabande Books, 2000.

18 Dictionaries for other languages, or for specialized technical vocabularies, are available for download at: https://nooj.univ-fcomte.fr/resources.html

Using Linguistic Criteria to Define Multiword Units

187

Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger, 2002: Multiword expressions: A pain in the neck for NLP. In International conference on intelligent text processing and computational linguistics, pp. 1-15. Springer, Berlin, Heidelberg. Silberztein, Max, 1990. Le dictionnaire électronique des mots composes DELAC. In Courtois, Silbertein Eds. Dictionnaires électroniques du français, Larousse, Paris pp. 11-22. Silberztein, Max, 1993. Groupes nominaux libres et noms composés lexicalisés. In Linguisticae Investigationes, vol. XVII, no. 2, pp. 405–425. Silberztein, Max, 2016a. Formalizing Natural Languages: the NooJ approach. Wiley Eds., Hoboken: New Jersey. Silberztein, Max, 2016b. Joe loves Lea: Transformational Analysis of Transitive Sentences. in Formalising Natural Languages with NooJ (9th International NooJ conference, Minsk, Belarus 2015), CCIS Series. Springer Verlag: Heidelberg. Silberztein, Max, 2018. Using linguistic resources to evaluate the quality of annotated corpora. In Proceedings of the LR4NLP Workshop at COLING2018. Available at: http://www.aclweb.org/ anthology/W18-38

A Linguistic Approach to English Phrasal Verbs Peter A. Machonis

Abstract This presentation shows how a lexicon grammar dictionary of English phrasal verbs (PV) can be transformed into an electronic dictionary, in order to accurately identify PV in large corpora within the linguistic development environment, NooJ. The NooJ program is an alternative to statistical methods commonly used in NLP: all PV are listed in a dictionary and then located by means of a PV grammar in both continuous and discontinuous format. Results are then refined with a series of dictionaries, disambiguating grammars, filters, and other linguistics resources. The main advantage of such a program is that all PV can be identified, not just collocations of higher-than-normal frequency. Keywords English phrasal verbs · Corpus linguistics · Lexicon-grammar · NooJ

1 Introduction Maurice Gross first laid the groundwork of what would become the theory of lexicon-grammar in 1968 in his Grammaire transformationnelle du français: Syntaxe du verbe. Following the theoretical framework of Zellig Harris (1956) where transformations are simply “equivalence relations among sentences or certain constituents of sentences,” Gross added the extra element of incorporating a significant coverage of the language, maintaining that there could be no theory without an associated accumulation of data. During the 1970s, linguists at the Laboratoire d’Automatique Documentaire et Linguistique (LADL) at the Université de Paris 7 were building practical large-scale formal classifications for French. For example, each French verb at the time was discussed by a team of linguists and marked either as plus (+) or minus (-) for every relevant transformation. This exhaustive process revealed many classes of French verbs (e.g., Boons et al. 1976), showed the

P. A. Machonis (✉) Department of Modern Languages, Florida International University, Miami, FL, USA e-mail: machonis@fiu.edu © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Silberztein (ed.), Linguistic Resources for Natural Language Processing, https://doi.org/10.1007/978-3-031-43811-0_10

189

190

P. A. Machonis

enormous complexity of language, but also challenged the Chomskian model (Gross 1979). In the early 1980s, when Maurice Gross began systematically classifying French idiomatic expressions, linguists started to realize that rather than being exceptions, idioms represented a basic and widespread linguistic phenomenon. Based on systematic studies of the French lexicon, Gross (1994:233) claimed that there were 12,000 free sentences as opposed to 20,000 frozen ones. By the end of the twentieth century, idioms played an important role in many syntactic theories (e.g., Everaert et al. 1995). “There are too many idioms and other fixed expressions for us to simply disregard them as phenomena ‘on the margin of language,’” said Jackendoff (1997: 177), who extends the meaning of idioms to include Wheel of Fortune puzzles, where we find not only idioms per se, but compounds, famous names, clichés, and song titles. Research in Natural Language Processing (Sag et al. 2002) has also confirmed not only the scale and the importance of idioms, but also the difficulties involved in their identification. Here, I focus on a particularly difficult idiom of English, the phrasal verb (PV), using a lexicon-grammar framework, along with the linguistic development environment NooJ for automatic recognition in large corpora. Our corpora consist of a good number of nineteenth century British and American novels, including the complete works of Charles Dickens and Herman Melville, as well as a contemporary oral corpus consisting of 25 transcribed Larry King Live programs from January 2000.

2 English Phrasal Verbs: Particle vs. Preposition Although many “phrasal verb lists” include syntactically different multiword expressions, we use the linguistic definition for English PV. They involve the combination of a verb with a particle (usually a preposition or adverb) that behaves as one syntactic and semantic unit. They are also known as two-word verbs, multiword verbs, or verb-particle combinations, which can be expressed in either the continuous or the discontinuous form: (1) a. Max looked up the telephone number [= find in a list, etc.] b. Max looked the telephone number up (2) a. Ann turned on the computer [= start] b. Ann turned the computer on We distinguish PV from simple verb plus prepositional phrase constructions, where the prepositional phrase acts as one syntactic unit separate from the verb and which cannot be transposed as in (1) and (2): (3) a. Max looked up Mary’s dress [= look + prepositional phrase] b. *Max looked Mary’s dress up

A Linguistic Approach to English Phrasal Verbs

191

(4) a. The bus driver turned on Main Street [=turn + prepositional phrase] b. *The bus driver turned Main Street on Particles and prepositions are indeed a very problematic area for Natural Language Processing. As Talmy (1985: 105), who refers to particles as satellites, states: “a problem arises for English which, perhaps alone among Indo-European languages, has come to regularly position satellite and preposition next to each other in a sentence.”

2.1

Further Distinctions: Prepositional and Phrasal Prepositional Verbs

Furthermore, PV can be differentiated from prepositional verbs and phrasal prepositional verbs, although both of these are included in most PV dictionaries, as well as under the rubric “phrasal verbs” for many researchers in NLP. Prepositional verbs are multiword verbs that function as one semantic unit, but do not allow movement of the preposition, such as: (5) a. Women make up half of the applicants [= comprise] b. *Women make half of the applicants up (6) a. Max called on his neighbor [= visit] b. *Max called his neighbor on Moreover, we also find what are called phrasal prepositional verbs, which appear to be followed by a particle and a preposition, but likewise function as one semantic unit: (7) a. The students looked up to the teacher [= admire] b. *The students looked the teacher up (E + to) (8) a. His kindness made up for his mean remarks [= compensate] b. *His kindness made his mean remarks up (E + for) However, many “phrasal verb lists” include all of these syntactically different multiword expressions. Furthermore, most researchers working in NLP are treating all of them in the same way. That is, if the verb and the following word are frequently used, then they are tagged as an MWE. Kostadin and Kordoni (2014:198), who are working in statistical machine translation (SMT), are starting to look at encoding some of the linguistic properties of MWEs. One of their suggestions is to identify “whether a particle can be separated from the PV in particle verb constructions.” As an example, they compare fell off and turn on. (9) a. He fell off his bike b. * He fell his bike off (10) a. She turns on the engine b. * She turns the engine on

192

P. A. Machonis

However, our examples (1)–(4) clearly illustrate that the problem is that turn on is a true PV, while fell off is a prepositional verb followed by a preposition. The verb fall could be very well followed by many other prepositions: fall from the mountain, fall on the sidewalk, fall across the ice, fall through the roof, etc. That is to say that all MWEs, or what some dictionaries claim to be “phrasal verbs”, are not the same linguistically speaking.

3 Lexicon-Grammar of Phrasal Verbs Presently, the lexicon-grammar tables of English PV include an exhaustive list of transitive phrasal verbs followed by the particle up (over 700 entries), a substantial list of transitive phrasal verbs followed by the particles out (200 entries), down (100 entries) and off (90 entries), and a sampling of other particles, such as away, back, in, on, out, over, etc. A sample of the lexicon-grammar table of English PV with the particle up is displayed in Table 1. The first two columns represent potential subjects, or N0, which can be either Human or Non-human or both. This is followed by the verb and the particle (e.g., up) as well as an example of a direct object, N1, which is also classified by the properties of Human and Non-human. Table 1 Sample lexicon-grammar table: phrasal verbs with the particle up

A Linguistic Approach to English Phrasal Verbs

193

The next column, N0 V N1, takes into consideration cases where the verb can have a similar meaning even if the particle is not used, with a plus indicating that both the empty sequence (E) and the particle are possible, such as: (11) a. The criminal beat (E + up) the child b. The chef boiled (E + up) some water In the case of the particle up, optional particle usage occurs in 40% of the data, and thus it can be argued that these phrasal verbs should be considered compositional. Machonis (2009) showed that the compositional phrasal verbs with up keep their regular meaning, but the particle is viewed as an intensifier (beat up the child), an aspect marker (boot up the computer), or in some cases an adverbial noting direction (drive up prices). However, since our objective is to automatically identify all phrasal verbs in a corpus, we have chosen to list all phrasal verbs in one dictionary, rather than listing optional particle usage as part of the simple verb entry. The next column, N1 V Part, considers neutral or ergative verbs, with a plus attesting to the fact that the verb has both a transitive and intransitive linked use, such as: (12) a. The terrorists blew up the building b. The building blew up Verbs that can undergo this alternation represent 25% of phrasal verbs with up. Consequently, although we have classified only transitive phrasal verbs, many intransitive phrasal verbs will also be considered in our lexicon-grammar tables due to this type of alternation. The last column of pluses and minuses, N1 V, indicates if a verb can be neutral even if the particle is not expressed: (13) a. The cook was boiling up the potatoes b. The potatoes were boiling. For example, the last entry in the sample Table 1 shows that the verb boot can undergo all three alternations - transitive use without particle, and neutral use with and without particle: (14) a. Max booted (up + E) the computer b. The computer booted (up + E) The very last column of the sample Table 1, Synonym, gives a paraphrase of the phrasal verb in order to highlight ambiguity, as in the four meanings associated with the expression blow up. These phrasal verb lexicon-grammar tables of the various English particles were subsequently combined to create a single NooJ phrasal verb dictionary.

194

3.1

P. A. Machonis

Using Lexicon-Grammar in Tandem with NooJ

The NooJ platform (Silberztein 2016) allows linguists to describe several levels of linguistic phenomena and then apply formalized descriptions to any corpus of texts. Instead of relying on a part of speech tagger that obligatorily produces a certain percentage of tagging mistakes, NooJ uses a Text Annotation Structure (TAS) that holds all unsolved ambiguities. Furthermore, these annotations can represent discontinuous linguistic units, such as phrasal verbs. Figure 1 is a sample of the NooJ PV Dictionary, which mirrors all the syntactic information contained within the highlighted area of the Lexicon-Grammar entry seen in Table 1. The NooJ PV Grammar in Fig. 2 works in tandem with the dictionary to annotate PV in large corpora. The NooJ functionality $THIS=$V$Part assures that a particular particle must be associated with a specific verb in the PV dictionary in order for it to be recognized as a PV. That is, NooJ only recognizes verb-particle combinations listed in the PV dictionary. It also guarantees that a noun phrase is between the verb and the corresponding particle in discontinuous constructions.

Fig. 1 NooJ PV Dictionary corresponding to Table 1

Fig. 2 NooJ PV grammar

A Linguistic Approach to English Phrasal Verbs

3.2

195

Accuracy of Discontinuous Phrasal Verbs

Torres-Martínez (2015:18) claims that “the COCA search mode for MWVs may yield false positives, especially when the number of intervening words is up to three.” In our search for PV in Dickens and Melville (Machonis 2021), however, NooJ correctly identified many discontinuous PV with up to six intervening wordforms (WF), with very few false positives: (15) and laid her weary head down, weeping (Dombey and Son) [3 WFs] (16) she drove her wheeled chair rapidly back (Little Dorrit) [4 WFs] (17) Shall I put any of those little things up with mine? (The Pickwick Papers) [5 WFs] (18) felt his empty sleeve all the way up, from the cuff, (Great Expectations) [6 WFs]

4 Removing False Phrasal Verbs Automatically In addition to the PV grammar and dictionary that work in tandem to annotate PV in large corpora, three other types of algorithms help to remove noise (falsely identified PV): (1) disambiguation grammars, which examine the immediately preceding and following environments of potential PV, and eliminate nouns that are mistaken for verbs (e.g., take a run down to Spain ≠ run down, his hands still in his pockets ≠ hand in), prepositions that are identified as PV (what a comfort I take in it ≠ take in), and prepositions that introduce locative expressions (asked you in Rome); (2) adverbial and adjectival expression filters; and (3) idiom dictionaries that identify certain fixed expressions as “unambiguous” and thus cannot be given the TAS of PV (e.g., asked in a low tone ≠ ask in, put on one’s guard ≠ put on, take an interest in ≠ take in). The goal of these extra grammars and dictionaries is to remove potential noise without creating silence. More details on removing false PV automatically using NooJ are described in Machonis (2016), but let us examine three disambiguation grammars which are linguistically motivated. Each of the three disambiguation grammars specifies certain structures that are not to be assigned the TAS of . The specification under the node in each grammar means “not a PV.” Some grammars furthermore specify that the particle must be a preposition by listing below the in the graph.

196

4.1

P. A. Machonis

PV Disambiguation Grammar 1: Environment to Right of “PV”

The first disambiguation grammar (Fig. 3) examines the environment to the right of a candidate PV string. This syntactically motivated grammar states that if the PV occurs with a pronoun object, the PV must be in the discontinuous format (e.g., figure it out, look him up, take them away). Thus, if an object pronoun follows a supposed particle, it must be a preposition, as in the following: what sort of pressure is put on them back in Cuba. The first disambiguation grammar specifies that this instance of put on (e.g., put on my T-shirt) is not a PV. The PV put on is very common in our oral corpus (e.g., put on nine pounds, put on my wedding dress, put on a prayer shawl, put my jeans on), yet shows enormous potential for overlapping with prepositional phrases. The pronoun her, since it is also the possessive adjective her, is not included with the other pronouns since it would produce too much silence (e.g., locked up her apartment, pulling off her gloves, took up her parasol, etc.). But if followed by punctuation, then we can be sure that her is not an adjective, but a pronoun as in the following, and should therefore not be identified as a PV: (19) showing the same respectful interest in Isabel’s affairs that Isabel was so good to take in hers. (20) rising and bending over her, as she rose from the bench. Finally, using the NooJ functionality +EXCLUDE, this graph does not remove good PV, as in the following where the pronoun introduces another sentence (21) or is part of a that clause (with that deleted) followed by the verb be (22). Both would be excluded from the automatic PV removal process, thus not creating silence: (21) but from what I can make out you’re not embarrassed (22) pundits [...] pointed out it was Bradley doing poorly

Fig. 3 PV Disambiguation Grammar 1

A Linguistic Approach to English Phrasal Verbs

4.2

197

PV Disambiguation Grammar 2: Environment to Left of “PV”

The second disambiguation grammar (Fig. 4) aims to identify verbs that are obviously nouns by examining the environment to the left of a supposed PV. Again, NooJ can automatically remove the PV status in the TAS if the preceding linguistic environment justifies it. Essentially, if a determiner or adjective appears immediately before the supposed PV, then the disambiguation grammar correctly assumes that it is a noun and removes the PV status from the TAS. This grammar successfully eliminates much noise derived from PV that overlap with nouns, such as break in, check out, cheer up, figure out, hand in, head up, play out, time in, etc.

4.3

PV Disambiguation Grammar 3: Locative Environment to Right of “PV”

Our third disambiguation grammar (Fig. 5) examines the environment to the right of a supposed “PV,” but specifically focuses on prepositions introducing locative prepositional phrases that are clearly not part of a PV. This disambiguation grammar relies on a supplemental dictionary of locative nouns, NLoc. This dictionary contains some frequent locatives found in our corpora, such as brewery yard, church, city, garden, library, sitting-room, street, yard, etc., as well as place names such as America, London, Paris, Rome—these nouns are all marked as N+Loc. This graph eliminates the PV status in the TAS of many incorrectly identified PV, mostly involving the prepositions in and off. For example, the top part of the graph removes the PV TAS in continuous PV such as: to be asked in church, while the

Fig. 4 PV Disambiguation Grammar 2

198

P. A. Machonis

Fig. 5 PV Disambiguation Grammar 3

bottom portion of the graph (i.e., the NP and ADV loop) does the same for discontinuous PV like: was asked three times in Italy. However, the +EXCLUDE path will not create silence by removing the status of genuine PV with the particle up such as: clean up New York, shut up her house, etc.

5 Conclusion The NooJ PV dictionary and grammar are great resources for identifying a difficult, yet characteristic feature of the English language. While PV are indeed a “pain in the neck for NLP,” what we have described is a linguistically accurate way of identifying them in large corpora, while automatically removing as much noise as possible based on syntactic criteria. In fact, many researchers are realizing the importance of linguistic criteria. As Kostadin and Kordoni (2014:200) write, “the addition of linguistically informative features to a phrase-based SMT model improves the translation quality of a particular type of MWEs, namely phrasal verbs.”

References Boons, Jean-Paul, Alain Guillet, Christian Leclère, 1976. La structure des phrases simples en français: Constructions intransitives. Geneva: Droz. Cholakov Kostadin, Valia Kordoni, 2014. Better Statistical Machine Translation through Linguistic Treatment of Phrasal Verbs. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 196–201. Doha, Qatar: Association for Computational Linguistics. Everaert, Martin, Erik-Jan van der Linden, André Schenk, Rob Schreuder (eds.), 1995. Idioms: Structural and Psychological Perspectives. Hillsdale, N.J.: Erlbaum. Gross, Maurice, 1979. On the failure of generative grammar. Language 55(4): 859-885. Gross, Maurice, 1994. Constructing Lexicon-Grammars. In Atkins and Zampolli (eds.) Computational Approaches to the Lexicon, 213-263. Oxford: Oxford University Press.

A Linguistic Approach to English Phrasal Verbs

199

Harris, Zellig S., 1956. Introduction to transformations. In Papers in Structural and Transformational Linguistics (1970): 383-389. Dordrecht-Holland: D. Reidel Publishing Co. Jackendoff, Ray, 1997. The Architecture of the Language Faculty. Cambridge, MA: The MIT Press. Machonis, Peter A., 2009. Compositional phrasal verbs with up: Direction, aspect, intensity. Lingvisticae Investigationes 32.2: 253-264. Machonis, Peter A., 2016. Phrasal Verb Disambiguating Grammars: Cutting Out Noise Automatically. In L. Barone, M. Monteleone, M. Silberztein (eds.), Automatic Processing of NaturalLanguage Electronic Texts with NooJ. CCIS, vol. 667: 169-181. Cham, Switzerland: Springer. Machonis, Peter A., 2021. Where the Dickens are Melville’s Phrasal Verbs? In. B. Bekavac, K. Kocijan, M. Silberztein, K. Šojat K. (eds.), Formalizing Natural Languages: Applications to Natural Language Processing and Digital Humanities CCIS, vol. 1389, 99-110. Cham, Switzerland: Springer. Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake, Dan Flickinger, 2002. Multiword expressions: A pain in the neck for NLP. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics, 1-15. Mexico City: CICLING. Silberztein, Max, 2016. Formalizing Natural Languages: The NooJ Approach. London: Wiley-ISTE. Talmy, Leonard, 1985. Lexicalization patterns: Semantic structure. In Timothy Shopen (ed.), Lexical Forms in Language Typology and Syntactic Description, 57-149. New York: Cambridge University Press. Torres-Martínez S., 2015. A constructionist approach to the teaching of phrasal verbs: Dispelling the verb particle myth in multiword verb instruction. English Today, 31(3): 46-58.

Analysis of Indonesian Multiword Expressions: Linguistic vs Data-Driven Approach Prihantoro

Abstract Tagging systems developed using a data-driven approach are often considered superior to those produced using a linguistic approach [Brill (A Simple RuleBased Part of Speech Tagger. Applied Natural Language Processing Conference, 1992, p.152)]. The creation of dictionaries and grammars (resources typically used in a linguistic approach) is considered costly compared to the creation of a training corpus (a resource typically used in a data-driven approach) [Silberztein (Formalizing Natural Languages: The NooJ Approach, 2016, p.22)]. In this contribution, I argue that such a view needs to be reconsidered. Focusing on MWE, I will show that some data-driven systems which rely on training corpora may produce inaccurate results, leading to incorrect automatic POS tagging, syntactic parsing and machine translation. I also show that such errors can be prevented using dictionaries and grammars for systems developed using a linguistic approach, which is principally in line with Silberztein’s (Formalizing Natural Languages: The NooJ Approach, 2016) view. Keywords Data-driven · Linguistic · Multiword expression · Annotation · Translation · Indonesian language

1 Introduction There are cases where a sequence of words may function as a single unit of meaning, as reported by many scholars (Baldwin and Kim 2010; Calzolari et al. 2002; Masini 2019; Silberztein 2016). See the examples below (obtained from BNC XML at CQPweb Lancaster1 (Hardie 2012)), in which such sequences are underlined:

1

https://cqpweb.lancs.ac.uk/

Prihantoro (✉) Universitas Diponegoro, Semarang, Indonesia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 M. Silberztein (ed.), Linguistic Resources for Natural Language Processing, https://doi.org/10.1007/978-3-031-43811-0_11

201

202

Prihantoro

(1) I decided to give up my London flat. (2) Will you bury the hatchet for a while and dance with me? (3) But by and large these are only examples of the gesture. The underlined sequences in these examples are well-known expressions for English speakers, where the meaning of each word is not always semantically compositional to the meaning/function of the sequence where they belong. These sequences are often referred to as Multiword Expressions (MWEs). How MWEs are processed by NLP systems (POS taggers, syntactic parsers, machine translation, QnA systems, among many others) may determine their success. As for this contribution, approaches to building an NLP system can be distinguished into two types: empirical or data-driven2 (as statistical- and neuralnetwork-based) and linguistic (often called rule-based), following Voutilainen (1999, p.9). A number of scholars differ slightly when it comes to defining MWEs. Calzolari et al. (2002) refer to an MWE as a sequence of words that acts as a single unit at some level of linguistic analysis. This definition is very similar to the MWE concept I offered at the beginning of this section. Carpuat and Diab (2010) incorporate quantitative approaches to MWE; they define an MWE as a multiword unit or a collocation of words that co-occur together statistically more than by chance. Another definition emphasizes idiosyncrasies across word boundaries, as Sag et al. (2002) propose. Baldwin and Kim (2010) hold that MWEs are lexical items that: (a) can be decomposed into multiple lexemes; and (b) display lexical, syntactic, semantic, pragmatic and statistical idiomaticity. This definition incorporates both linguistic non-compositionality and frequent co-occurrence. MWEs can be categorized still further using different schemes. Constant et al. (2017, pp.840–841) propose a non-exhaustive categorization of MWEs. Their categories are: idiom (e.g. to kick the bucket), light verb construction (e.g. to take a shower), verb-particle structure (to give up), compound whether or not a space separates each component (e.g. dry run, banknote), complex function words (e.g. as soon as), named entity (e.g. International Business Machines) and multiword term used in a specific subject field (e.g. short-term scientific mission). From the above subcategories, one of the most common present in existing annotation schemes is the compound. Consider the data below in which customer service is analyzed as a compound using the Universal Dependencies (UD) scheme (McDonald et al. 2013; Nivre et al. 2017). See lines 4 and 5 (Table 1). As for this presentation, in terms of definition, we will not discuss MWEs from frequency aspects, like some scholars mentioned earlier. Silberztein (2016, p.71) argues that being highly dependent on frequency to define a vocabulary element (simple words or MWE) can be misleading. For instance, in a data-driven approach, the task of compiling MWEs typically starts with extracting n-grams from a large corpus. See Suhardijanto (2020, p.562). However, frequent co-occurrences 2

Can be implemented by adopting certain models/techniques/algorithms, such as Markov model, neural network, Viterbi, among many others.

Analysis of Indonesian Multiword Expressions: Linguistic vs Data-Driven. . . Table 1 UPOS and Deprel analysis performed by Stanza (Qi et al. 2020)a

1 2 3 4 5 6

Token this is a customer service area

UPOS PRON AUX DET NOUN NOUN NOUN

203 Deprel nsubj cop det compound compound root

From all the Universal Dependencies fields, only three are presented in the table: token, UPOS and deprel

a

Table 2 Selected bigrams that begin with wrist in the BNC

Rank 1 3 9 179

Bigram Wrist. Wrist and Wrist watch Wrist torques

Raw frequency 251 111 14 1

sometimes are not meaningful MWEs. See the bigrams table below for MWUs beginning with wrist (extracted from BNC XML CQPweb (Hardie 2012), Lancaster version) (Table 2). The bigram with the highest raw frequency3 is wrist and, which is not an MWE or at least not a meaningful one. Indeed, we have the MWU wrist watch within the top ten (14 times), but another MWU wrist torque only appears once (hapax). As hapaxes may also include MWEs, one must be very careful when relying on frequency to define MWUs. Brezina (2018, p.74) argues that at least two aspects should be considered when choosing the most appropriate collocation measure according to the extraction aim(s): frequency and exclusivity. Note that a collocation measure like log-likelihood tends to score frequent collocations higher, which contrasts with Mutual Information (MI), where a low-frequency collocation, which is exclusive, may score highly. However, even when MI is used, there can still be a problem. If an MWE is not present in the corpus (for instance, wrist shackle, which is absent from the BNC despite its large size4), then the MWE will not be considered. As for the language, here we focus on Indonesian MWEs. Suhardijanto et al. (2020) propose a framework to identify Indonesian MWEs, and they plan to create an MWE lexicon (which has previously been attempted by Gunawan et al. (2016)). Arwidarasti et al. (2020) adjusted some Indonesian MWEs to the PennTreebank format. However, for a project like this, other than MWEs extraction and format adaptation, it is also important to anticipate issues that may arise in the implementation. 3

Punctuation is considered as tokens in BNC XML in CQPweb Lancaster, thus, wrist is present as the most frequent bigram. 4 The size of BNC XML in CQPweb Lancaster is 112,102,325 (punctuation is considered as tokens).

204

Prihantoro

Suhardijanto (2020, p.587) touches on a number of potential issues, but does not offer methods or approaches to handle MWEs properly. I will present cases in which data-driven systems fail to adequately address Indonesian MWEs in different NLP tasks (POS tagging, syntactic parsing, machine translation). It will also propose methods typically used in systems developed via a linguistic approach (often called rule-based) to successfully analyze MWEs and improve the quality of the aforementioned NLP tasks. To implement these methods I use NooJ,5 a program that is designed to support a linguistic approach, and can be used to build a tagging system. NooJ is computationally powerful as it provides users with computational machines to process all four types of Chomsky’s hierarchy (Chomsky 1956) grammars, namely, regular, context-free, context-sensitive and unrestricted grammars. Other than computational power, NooJ has several advanced features as compared to other similar programs which adopt a linguistic approach, such as foma (Hulden 2009) or xfst (Beesley and Karttunen 2003), which are discussed in Silberztein (2016, pp.28–29), and thus not repeated here.

2 POS Tagging Automatic Parts of Speech (POS) tagging is often viewed as a successful implementation of a data-driven approach (Brill 1992). Statistical POS taggers rely on language models built on their resources. Typically, these POS taggers require a large pre-tagged corpus as one of their core resources (Silberztein 2016, p.19). The corpus is typically split into two sections. One section is usually used as a training corpus, the other as a testbed corpus, as shown in Bird et al.6 (2009). Statistical methods to create POS tagging systems may vary, but the idea of using a training corpus as one of its core resources remains substantial. The better the quality of a training corpus used by a system, typically, the better the system performs, and vice versa. Below is a demo of a transformer-based Indonesian POS tagger, trained on the Roberta-based model, using Indonlu (Wilie et al. 2020) POSP data set tag-labelled news (Fig. 1).7 (4) Budi sedang pergi ke pasar PN in-progress go to market Budi is on the way to a market

5

https://atishs.univ-fcomte.fr/nooj/downloads.html Also, freely available from https://www.nltk.org/book/. A relevant page to this is https://www. nltk.org/book/ch05.html 7 See its demo page at: https://huggingface.co/w11wo/indonesian-roberta-base-posp-tagger?text= Budi+sedang+pergi+ke+pasar 6

Analysis of Indonesian Multiword Expressions: Linguistic vs Data-Driven. . .

205

Fig. 1 Annotation of an Indonesian sentence which contains no MWEs

Fig. 2 The POS annotation of an Indonesian sentence which contains an MWE

Some errors are already in the result; the token pasar [market] and full stop marker are tagged as proper nouns. Now, we will try incorporating an Indonesian MWE rumah makan [restaurant] as an oblique. The outcome is as follows (still using the same tagger as earlier) (Fig. 2). rumah makan (5) Budi sedang makan di PN in-progress eat PREP house eat Budi is eating in a restaurant.

The changes have caused more errors. First, the verb, which was correctly analyzed as a verb, is now incorrectly interpreted as a noun. Second, the MWE rumah makan is still analyzed as two separate tokens. Let us look at the result of another data-driven based tagging system. Below is the output from IPOStagger, a

206

Prihantoro

Hidden Markov Model (HMM) based POS tagger written by Wicaksono and Purwarianti (2010), applied to the same sentence. Budi/NNP sedang/RB makan/VBT di/IN rumah/NN makan/VBT ./. Here the MWE rumah makan is not tagged as a single sequence but as two separate ones. The analysis does not make any sense because makan is inaccurately tagged as a transitive verb (VBT). One possible reason for this is the absence of this MWE from the training corpus. Thus, the system makes a best guess. Another reason is the absence of MWE annotation. The training corpus does have rumah makan in it but is not correctly annotated. These two problems imply an issue in the quality of a training corpus8 (see Silberztein (2016, p.19)). This error can easily be prevented by incorporating the MWE into one of the entry lines in the dictionary used by a tagging system. Consider the entry lines from the dictionary below9 (Silberztein 2003, pp.78–91, 2016, pp.99–106). Budi,NOUN+Pr sedang,ADV makan,VERB di,PREP rumah makan,NOUN+MWE+UNAMB The format of each entry line is a token, a delimiter, and one or more morphosyntactic analytic labels. In the dictionary, rumah and makan are treated as two separate entry lines (noun and verb, respectively). However, at the bottom, rumah makan is written as a single-entry line corresponding to a nominal MWE. We also see label +UNAMB, used to prioritize this lexical entry (Silberztein 2003, p.134). This operator is often used for disambiguation. See the result below (Fig. 3). The way the dictionary works is straightforward. It does one-on-one pattern matching and uses the matching entry line to annotate each token in the text. The entry line with +UNAMB (in this case, rumah makan) is prioritized. Once the lexical parser finds a perfect match between a string in the text and an item among prioritized dictionary lines, the matching entry line is used to annotate the string as

8 The reasons for errors in a training corpus may vary. They may be caused by an annotator’s inaccurate tagging. In some cases, a training corpus may be semi-automatically tagged, which may lead to errors in some of its tagging. Errors in the training corpus are quite common so that they become a specific area of research (Silberztein 2016). 9 Available as a NooJ project from https://tinyurl.com/mb929zy3 (retrieved 3/2/2023).

Analysis of Indonesian Multiword Expressions: Linguistic vs Data-Driven. . .

207

Fig. 3 Tagging of a sentence in Indonesian containing an MWE

a single token. Once done, the process stops searching for lexical entries. For this reason, we do not see separate annotations of rumah and makan. That a statistical POS tagger does not use a dictionary is incorrect because it is implicitly present in the training corpus (Silberztein 2016, p.23). A dictionary may be generated automatically from the training corpus and used by the system. However, its quality is typically not as good as a dictionary created by professional linguists (Silberztein 2016, p.24). He explains further that taggers with implicit dictionaries tend to produce very low-quality results, as the quality of the training corpus solely determines its performance. The application of TreeTagger (Schmid 1994, 1995) may serve as an example. This statistics-based program may be used to create a morphosyntactic tagging system for many languages, including Indonesian.10 I tried making a parameter file for Indonesian by relying solely on a training corpus, and a dictionary generated automatically from the training corpus. When applied on a testbed, it gave me 56% accuracy.11 When I used the parameter file to annotate an actual sentence containing an MWE, it also gave me an erroneous result. Budi sedang makan di rumah makan

NN NN VB IN NN VB

Next, I incorporated an actual dictionary from The Great Dictionary of Indonesian12 (third edn).13 The accuracy improved to 96% (parameter file available from 10

While the work has been completed, an article about it is in preparation. POS tagger accuracy should not be used as the only performance measure. Silberztein (2016, pp.18–20) argues that 95% or above accuracy is not very impressive because most of the words in a testbed corpus are not ambiguous. Second, many testbed corpora are extracted from a reference corpus. Thus, they may fall into the same domain, and the quantity is typically not better than the training corpus. That is why, here, I test the parameter file on a sentence that is not present in the training corpus to see how the program performs. 12 Also known as Kamus Besar Bahasa Indonesia in Indonesian, currently fourth edition https:// kbbi.kemdikbud.go.id/ (retrieved 3/2/2023). 13 The data are obtained from Kateglo (http://kateglo.lostfocus.org/ retrieved 3/2/2023). 11

208

Prihantoro

Table 3 Typical N1 V N2 patterns to form a sentence in Indonesian Substantive gajah [elephant] manusia [human] ayah [father]

Verb makan [eat] minum [drink] baca [read]

Substantive rumput [grass] air [tea] buku [book]

Translation Elephant eats grasses Human drinks water Father read a book

TreeTagger web site). TreeTagger also allows an MWE wordlist to be used for tokenization. The output is shown below when all these resources (training corpus, dictionary, MWE list) are used. Budi Sedang Makan di rumah makan

NN RB VB IN NN

We can see that the MWE is tokenized as a single token and correctly analyzed as a noun. The above output shows that the presence of an explicit dictionary (with MWE) is substantial in determining the success of a tagging system.

3 Syntactic Parsing A POS tagger may be used as a subsystem of a syntactic parser, as shown in Chen and Manning (2014). Thus, an error in the POS tagging an MWE may lead to a syntactic parsing error. If we use our earlier MWE rumah makan, which was incorrectly analyzed as a combination of a noun and a verb, this MWE might be parsed as a sentence (at least intransitive). This is because a sentence in Indonesian may consist of a substantive (noun or pronoun), a verb and another substantive (noun pronoun). The string rumah may be parsed as a subject, with makan as its verb, and if another substantive follows, it may be analyzed as an object (Table 3). Now, consider this utterance, which consists of an MWE, rumah makan [restaurant], followed by a third-person pronoun. By disregarding the MWE, the string may resemble a sentence as it is composed of N1 V N2. However, this reading is inaccurate. The construction is a genitive phrase. The third person at the end is the possessor of rumah makan. It is not a complete sentence. (6) rumah makan saya House eat 3 my restaurant

Analysis of Indonesian Multiword Expressions: Linguistic vs Data-Driven. . .

209

I applied Stanza14 (Qi et al. 2020), a neural network-based NLP pipeline, and ran a dependency parsing analysis (2013). The system incorporates MorphInd’s15 (Larasati et al. 2011) morphosyntactic analysis to fill the slot in xpos. See below. The parser mistook this utterance as an SVO sentence. This is because it disregards the MWE. Thus rumah and makan are considered as a combination of a substantive, which functions as a subject (UPOS: NOUN, deprel:nsubj) and a verb which serves as its predicate (UPOS:VERB, deprel:root). [ [ { "id": 1, "text": "rumah", "lemma": "rumah", "upos": "NOUN", "xpos16": "NSD", "feats": "Number=Sing", "head": 2, "deprel": "nsubj", "start_char": 0, "end_char": 5 }, { "id": 2, "text": "makan", "lemma": "makan", "upos": "VERB", "xpos": "VSA", "feats": "Mood=Ind|Voice=Act", "head": 0, "deprel": "root", "start_char": 6, "end_char": 11 }, { "id": 3, "text": "saya", "lemma": "saya", "upos": "PRON", "xpos": "PS1", "feats": "Number=Sing|Person=1|Polite=Form|PronType=Prs", "head": 2, "deprel": "det",

14

https://stanfordnlp.github.io/stanza/ MorphInd is a hybrid tagger (POS tagger + morphological analyser; data-driven + linguistic approaches), accessible at https://septinalarasati.com/morphind/ (retrieved 13/2/2023). For a purely morphological analyzer, built only using a linguistic approach, see SANTI-morf (Prihantoro 2021a, 2021b, 2022). 16 XPOS tag from Larasati et al. (2011). 15

210

Prihantoro

Fig. 4 Correct analysis of an Indonesian MWE (Task Annotation Structure) Fig. 5 Correct analysis of an Indonesian MWE (Syntax Tree)

"start_char": 12, "end_char": 16 } ] ]

Now, let us look at the syntactic parsing using NooJ.17 The system reads the correct annotation of MWE. It correctly analyses rumah makan saya as a phrase, not a sentence. The system also successfully identifies rumah makan as the root of the phrase and the first-person pronoun as a determiner (Figs. 4 and 5). 17

Available as a NooJ project https://tinyurl.com/yeyt3nen (retrieved 3/2/2023).

Analysis of Indonesian Multiword Expressions: Linguistic vs Data-Driven. . .

211

Fig. 6 Incorrect analysis of an Indonesian MWE (Task Annotation Structure) Fig. 7 Incorrect analysis of an Indonesian MWE (Syntax Tree)

If we disregard the multiword expression, even in NooJ, the parsing will be inaccurate as it reads incorrect POS tagging. Consider the syntactic parsing result below in which rumah and makan are analyzed as separate noun and verb tokens. In consequence, it concludes that rumah, makan and saya is a sentence composed of a subject, verb, and object, respectively (Figs. 6 and 7).

212

Prihantoro

4 Machine Translation The analysis of MWE is essential in determining the success of a machine translation application. MWE resources in both source and target language must correspond and be equally good in terms of quality. Note that an MWE in a source language may correspond to another MWE, a single word in a target language, or both. This factor must also be considered (Table 4). Now, let us look at an example provided by Google Translate18 (GT). GT is one of the most popular machine translation applications today. The application relies heavily on statistical methods and large parallel corpora.19 In some cases, GT can successfully recognize some MWEs in Indonesian and correctly translate them. For instance, in Indonesian, the MWE rumah sakit [hospital] (literally from rumah [house] and sakit [sick]) is always correctly translated to [hospital]. However, not all Indonesian MWEs are always accurately translated. See the example below. (7) cantik=nya bukan main beautiful=3 no play 'she is very/incredibly beautiful' The phrase bukan main is a fixed expression which may correspond to the adverbs very or really in English. This MWE is not frequently used, particularly in a formal register. In LCC Indonesian CQPweb version (Hardie 2012) rumah sakit and bukan main appear 34,267 and 1090 times (raw frequency). Proportionally, the former counts20 for 97%, while the latter is only 3%. Now, let us consider how GT translates the latter into English (Table 5). Even if one does not speak any Indonesian, just by looking at the English translation (right), one can see a problem. While the sentence is grammatically correct, it is semantically nonsense. The translation here is literal but inaccurate (bukan [not], main [play]). This reflects a failure of GT’s statistical analysis to capture a complex linguistic description of an Indonesian MWE or a paucity in GT’s resources (most likely the parallel corpora).

Table 4 A sample of English MWEs & non-MWEs to Indonesian MWEs

18

No 1 2

EN Hospital In general (MWE)

3

Responsibility

https://translate.google.com/ https://www.youtube.com/watch?v=_GdSC1Z1Kzs 20 Figures are rounded. 19

ID Rumah sakit (MWE) Umumnya Secara umum (MWE) tanggung jawab (MWE)

Analysis of Indonesian Multiword Expressions: Linguistic vs Data-Driven. . .

213

Table 5 Inaccurate translation (ID > EN) from Google Translate ID Cantiknya bukan main Table 6 Typical rightheaded evaluative expression structure in Indonesian (Non-MWE)

EN Beautiful is not playing

ADV (intensifier) terlalu [too] agak [somewhat] sangat [very]

ADJ (evaluation) cantik [beautiful] cantik [beautiful] cantik [beautiful]

Translation too beautiful somewhat beautiful very beautiful

Fig. 8 Annotation of an Indonesian MWE and clitic

The MWE bukan main in this context is an intensifier which corresponds to very or really in English, thus it is an adverb. This is the first failure. In the English translation, bukan and main are still translated literally as two separate words. Second, it also fails to capture the syntactic and morphosyntactic context in which this MWE is used, thus leading to an incorrect translation (Table 6). Another failure of the analysis is to identify the presence of a pronoun, which in this case is cliticized (=nya). We can see in the GT output that there is no thirdperson pronoun in the English translation. Thus, to accurately translate it, a system must be able to identify both the clitics as well as the MWE, including its unique syntactic construction. Consider the image below, also from NooJ.21 We can see that the MWE bukan main is annotated as a single adverbial token instead of a combination of two tokens. To analyze target language input correctly is the first step towards a successful machine translation (Fig. 8). The annotation is obtained from an entry line in a dictionary in use. Here, I show a simplified version of the dictionary, which consists of only six lines. The entry line used to annotate bukan main is entry line number 4. This entry line is prioritized (see +UNAMB operator). Because a full match is found for this entry line, other matching entry lines (numbers 1 and 2) are not used. The format is very similar. The only difference is the attribute value style in the analytic label. The attribute (EN) refers to the target language, and the value refers to the translation of the entry line’s item.

21

Available as a NooJ project https://tinyurl.com/4swkn4z5 (retrieved 3/2/2023).

214

Prihantoro

1 bukan,ADV+EN="no" 2 main,VERB+EN="play" 3 cantik,ADJ+EN="beautiful" 4 bukan main,ADV+MWE+EN="very"+UNAMB 5 sangat,ADV+EN="very" 6 dia,PRON+EN="s/he" We do not see an entry line corresponding to the analysis of =nya as a pronoun. This is because the analysis is not obtained from the dictionary, but from a grammar. As I mentioned earlier, NooJ’s primary resources are not only dictionaries but also grammar. The line in the grammar file that allows the annotation of =nya as a pronoun is as follows.22 1 Main = $(X * $) / / 2 nya/ The first line is the start of a grammar rule. It checks whether a string begins with a combination of letters that match one of the entry line items (in this case, cantik). The string must be followed by nya. If these two conditions are fulfilled, then an annotation for cantik and nya will be given (code in bold in line 1 to take labels from the dictionary entry line, and code in bold in line 2 to annotate nya). As for the clitic, the free form dia is used as its lemma annotation, as shown in the earlier figure. Now that all elements are correctly analyzed, let us turn to the automatic machine translation. The grammar below is used to perform a simple machine translation. Two lines are essential here: lines 2 (input) and 4 (output). Line 2 allows the grammar to identify the targeted sentence containing the MWE. It simply checks whether the atypical syntactic construction,23 here a combination of an adjective, a pronoun and an adverbial MWE, is present. 1 Main = (:MWE | :nonMWE) :trans ; 2 MWE = $(adj $) $(pron $) $(adv $) ; 3 nonMWE = $(pron $) $(adv $) $(adj $) ; 4 out= /TRANS=$pron$EN /" is " /$adv$EN /" " /$adj$EN; Line 4 is dedicated to producing the translation. The grammar line checks the dictionary entry lines and output. First, a match is found for the three tokens: cantik, nya and bukan main. Subsequently, the English translation for each token is copied, but reordered to the target language syntactic construction, namely pronoun, adverb and adjective. A copula “is” is inserted between the pronoun and the adverb to ensure the sentence is grammatically correct in English (Fig. 9). This shows that a machine translation performed by a rule-based system can be comparable to or even better than a data-driven system. For other successful machine 22 In the actual grammar, the two parts must be presented in line. The division into two lines and the numbering in this paper are only for reading convenience. 23 The typical syntactic construction of PRON ADV ADJ is written in line 3.

Analysis of Indonesian Multiword Expressions: Linguistic vs Data-Driven. . .

215

Fig. 9 Translation of Indonesian sentences with (upper) and without (lower) MWE

translations using rule-based systems in different languages (English, Vietnamese, Arabic and French), see Silberztein (2016, pp.12–18).

5 Conclusion I have shown a number of cases in which some data-driven systems fail to properly handle Indonesian MWEs in the area of POS tagging, syntactic parsing and machine translation (from Indonesian to English). I have also offered methods from NooJ, a rule-based system, to resolve these issues. Unlike data-driven systems, which rely heavily on a training corpus, the methods I have explained primarily rely on dictionaries and grammars, two key resources in systems developed using a linguistic approach. This means that rule-based systems or methods should not be underestimated as they can be comparable to empirical-based systems, or even better.

References Arwidarasti, J. N., Alfina, I., & Krisnadhi, A. (2020). Adjusting Indonesian Multiword Expression Annotation to the Penn Treebank Format. 2020 International Conference on Asian Language Processing (IALP), 75–80. Baldwin, T., & Kim, S. N. (2010). Multiword Expressions. Handbook of Natural Language Processing. Beesley, K. R., & Karttunen, L. (2003). Finite-state morphology: Xerox tools and techniques. CSLI, Stanford. Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media. https://books.google.co.id/books?id= KGIbfiiP1i4C Brezina, V. (2018). Statistics in Corpus Linguistics. Cambridge University Press. https://doi.org/10. 1017/9781316410899 Brill, E. (1992). A Simple Rule-Based Part of Speech Tagger. Applied Natural Language Processing Conference.

216

Prihantoro

Calzolari, N., Fillmore, C. J., Grishman, R., Ide, N., Lenci, A., MacLeod, C., & Zampolli, A. (2002, May). Towards Best Practice for Multiword Expressions in Computational Lexicons. Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). http://www.lrec-conf.org/proceedings/lrec2002/pdf/259.pdf Carpuat, M., & Diab, M. T. (2010). Task-based Evaluation of Multiword Expressions: a Pilot Study in Statistical Machine Translation. North American Chapter of the Association for Computational Linguistics. Chen, D., & Manning, C. D. (2014). A fast and accurate dependency parser using neural networks. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 740–750. Chomsky, N. (1956). Three models for the description of language. IEEE Transactions on Information Theory, 2(3), 113–124. https://doi.org/10.1109/TIT.1956.1056813 Constant, M., Eryiğit, G., Monti, J., van der Plas, L., Ramisch, C., Rosner, M., & Todirascu, A. (2017). Multiword Expression Processing: A Survey. Computational Linguistics, 43(4), 837–892. https://doi.org/10.1162/COLI_a_00302 Gunawan, D., Amalia, A., & Charisma, I. (2016). Automatic extraction of multiword expression candidates for Indonesian language. 2016 6th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), 304–309. https://doi.org/10.1109/ICCSCE. 2016.7893589 Hardie, A. (2012). CQPweb — combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3), 380–409. https://doi.org/10.1075/ijcl.17.3. 04har Hulden, M. (2009). Foma: a finite-state compiler and library. Proceedings of the Demonstrations Session at EACL 2009, 29–32. Larasati, S. D., Kubon, V., & Zeman, D. (2011). Indonesian morphology tool (MorphInd): Towards an Indonesian corpus. International Workshop on Systems and Frameworks for Computational Morphology, 119–129. Masini, F. (2019). Multi-Word Expressions and Morphology. Oxford Research Encyclopedia of Linguistics. McDonald, R. T., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., Hall, K. B., Petrov, S., Zhang, H., Täckström, O., Bedini, C., Bertomeu, N., & Lee, J. (2013). Universal Dependency Annotation for Multilingual Parsing. Annual Meeting of the Association for Computational Linguistics. Nivre, J., Zeman, D., Ginter, F., & Tyers, F. (2017, April). Universal Dependencies. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts. https://aclanthology.org/E17-5001 Prihantoro. (2021a). An evaluation of MorphInd’s morphological annotation scheme for Indonesian. 16(2), 287–299. https://doi.org/10.3366/COR.2021.0221 Prihantoro. (2022). SANTI-morf dictionaries. Lexicography, 9(2), 175–193. https://doi.org/10. 1558/lexi.23569 Prihantoro, P. (2021b). An automatic morphological analysis system for Indonesian (Doctoral Thesis). Lancaster University. Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. https://nlp. stanford.edu/pubs/qi2020stanza.pdf Sag I. A., Timothy, B., Francis B., Ann, C., & Dan F. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2002), 1–15 Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German. ftp:// ftp.ims.uni-stuttgart.de/pub/corpora/tree-tagger2.pdf Silberztein, M. (2003). NooJ Manual. Available at: https://nooj.univ-fcomte.fr/downloads.html.

Analysis of Indonesian Multiword Expressions: Linguistic vs Data-Driven. . .

217

Silberztein, M. (2016). Formalizing Natural Languages: The NooJ Approach. Suhardijanto, T., Mahendra, R., Nuriah, Z., & Budiwiyanto, A. (2020). The Framework of Multiword Expression in Indonesian Language. Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation, 582–588. https://aclanthology.org/2020. paclic-1.67 Voutilainen, A. (1999). A Short History of Tagging. In H. van Halteren (Ed.), Syntactic Wordclass Tagging (pp. 9–21). Springer Netherlands. https://doi.org/10.1007/978-94-015-9273-4_2 Wicaksono, A. F., & Purwarianti, A. (2010). HMM based part-of-speech tagger for Bahasa Indonesia. Fourth International MALINDO Workshop, Jakarta. Wilie, B., Vincentio, K., Winata, G. I., Cahyawijaya, S., Li, X., Lim, Z. Y., Soleman, S., Mahendra, R., Fung, P., Bahar, S., & Purwarianti, A. (2020). IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding. AACL.