Persian Computational Linguistics and NLP 9783110619225, 9783110616545

This companion provides an overview of current work in the areas of Persian Computational Linguistics (CL) and Natural L

188 31 8MB

English Pages 268 Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Persian Computational Linguistics and NLP
 9783110619225, 9783110616545

Table of contents :
Introduction
Contents
1 Research in Persian Natural Language Processing – History and State of the Art
2 Challenges of Persian NLP: The Importance of Text Normalization
3 Dabire: A Phonemic Orthography for Persian
4 Speech Recognition for Persian
5 Syntactic Parsing of Persian: From Theory to Practice
6 Persian Named Entity Recognition with Structural Prediction Methods
7 Multiword Expressions in Persian
8 Machine Translation: From Paper Tapes to Neural Networks
9 Distributional Word Representation for Persian
Index

Citation preview

Persian Computational Linguistics and NLP

The Companions of Iranian Languages and Linguistics



Editor Alireza Korangy

Volume 2

Persian Computational Linguistics and NLP �

Edited by Katarzyna Marszałek-Kowalewska

ISBN 978-3-11-061654-5 e-ISBN (PDF) 978-3-11-061922-5 e-ISBN (EPUB) 978-3-11-061671-2 ISSN 2627-0765 Library of Congress Control Number: 2023931700 Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. © 2023 Walter de Gruyter GmbH, Berlin/Boston Cover image: DamienGeso / iStock / Getty Images Plus Typesetting: VTeX UAB, Lithuania Printing and binding: CPI books GmbH, Leck www.degruyter.com

Introduction When you ask people, “What is language to you” they usually come up with a few responses. It is rather rare that a person gives only one answer. The most frequent ones are: – Most efficient way of communication. – Basic human feature. – Corner stone of civilization. – Complex and difficult to grasp. – Connection with our ancestors. – Tool of manipulation. – Expressions of emotions and thoughts. – Part of personality. Language is open-ended. It helps us collate our thoughts and transmit them, allows us to communicate and express emotions, transfers knowledge and information, and acts as a medium that binds people who use the same lullabies to put their kids to sleep. Furthermore, language allows us to not only describe the surrounding reality but also discuss the past, predict the future, or even create new realities, like fantasy worlds with kingdoms full of elves and goblins. It helps us create art, music, and literature and serves an entertainment purpose when approached creatively while playing Scrabble or MadLib. It is difficult to imagine the development of, for example, medicine or space exploration without language. Of course, to cure a diabetic person or investigate life on Mars, other tools and methods are needed, but the discovery of these tools or procedures without language could not be imagined. You would not be reading this Introduction if it was not for the development of language. Human civilization, as we know it, would clearly be different without language: it would probably be less advanced; maybe, it would not even exist. Therefore, language, in a way, has brought humanity to where it is today. Without language, we would not be able to discuss quantum mechanics, negotiate treaties, clarify misunderstandings, or simply say that we want this coffee with oat milk. In fact, the world would probably be the same; it would be us – humans – who would presumably not understand its complexities. People appear to have always had an interest in language, starting from Antiquity (Pāṇini or Aristotle), continuing through the Middle Ages (Ibn Abi Ishaq or Sibawayh), and persisting till today. Recently, owing to the advent of computers, the interest in language shifted to the areas of computational linguistics (CL) and natural language processing (NLP). Both fields try to bridge the communication between humans and a new communication partner – computer. In order to talk to people, a https://doi.org/10.1515/9783110619225-201

VI � computer should understand syntax (grammar), semantics (meaning), or morphology (structure of words). In recent years, the interest in CL and its engineering domain of NLP has exploded. Asking Siri to show you the way to the nearest coffee shop, conversing with a chat-bot about your bank account, using a translation button on Facebook, autocorrect feature on your mobile, targeting your business ad to reach potentially best customers, auto-completing your thoughts while writing an email, and filtering spam are just some examples of everyday usage of NLP. A decade ago, talking to your mobile seemed like a sci-fi whim. Now, it is a daily activity. The reason CL and NLP have gained recognition in recent years and continue to gain momentum is evident: human civilization is drowning in data and striving for insights. In 2008, Google reported that the World Wide Web had one trillion pages.1 The total amount of data created and captured in 2018 was 33 zettabytes (ZB).2 In 2020, it reached about 59 ZB. International Data Corporation projects that by 2025, the available data may expand to 175 ZB.3 It is estimated that about 2.5 quintillion bytes (2,500,000,000 GB) of data are produced daily.4 Although these estimates include video, image data, and databases, most of it is plain old text. Unstructured data (also known as free-form text) comprise 70–80 % of the data available on computer networks. The information content of this resource is unavailable to governments, public services, businesses, and individuals unless humans read these texts or devise means to derive information value from them. Here enters CL and NLP that are applied to extract, characterize, interpret, and understand the meaningful information content of the free-form text. Persian, the main protagonist of this companion, is spoken by more than 1105 million people worldwide, mainly in Iran (Farsi), Afghanistan (Dari) and Tajikistan (Tajiki).6 According to W3Tech, it is the fifth content language of the Web, and it is used by 3.5 % of all the websites whose content language we know today.7 Even

1 https://www.itpro.co.uk/604911/google-says-the-web-hits-a-trillion-pages 2 1 ZB equals 1,099,511 627,776 Gigabytes (GB). 3 https://www.networkworld.com/article/3325397/idc-expect-175-zettabytes-of-data-worldwide-by2025.html 4 https://techjury.net/blog/how-much-data-is-created-every-day 5 As of February 2022, https://www.worldatlas.com/articles/where-is-farsi-spoken.html 6 Until the twentieth century, all native speakers of Persian used the term “Farsi” to refer to the Persian language. Due to political reasons, the terms “Dari” and “Tajiki” are now used to refer to the Persian spoken in Afghanistan and Tajikistan, respectively, while the Persian of Iran retains the name “Farsi”. 7 https://w3techs.com/technologies/history_overview/content_language

Introduction

� VII

more interesting, it is classified as the second fastest-growing content language (after Russian). In 2015, Persian was not even in the top 10 content languages, yet it secured the eighth position in 2019, and it has continuously moved up since then. This positive change in its position shows the significance and expansion of the Persian language on the Internet. For the NLP community, this means that there is a robust amount of Persian data that can be used for various tasks, such as text summarization, sentiment analysis, or information extraction. This volume intends to provide an overview of the current work in the areas of Persian CL and NLP. It is designed as a reference and source text for graduate students and researchers from the fields of NLP, computer science, linguistics, translation, psychology, philosophy, and mathematics who are interested in this topic. The volume is structured in two parts. Part I (Chapters 1–3) covers the fundamentals of Persian CL and NLP and aims to provide the reader with a general background of the two. Part II (Chapters 4–9) introduces applications of CL and NLP in Persian language studies. It covers several topics and describes the most innovative works of distinct academics analyzing the Persian language. Persian Computational Linguistics and NLP comprises nine chapters, each describing a separate sub-field of CL and NLP: Chapter 1 presents the history of research on Persian NLP. Magdalena Borowczyk describes the steps of the Persian NLP pipeline and lists applications, from spell-checking to information retrieval and sentiment analysis. Moreover, this chapter presents and summarizes the most significant Persian corpora and linguistic resources. In Chapter 2, Marszałek-Kowalewska discusses the main challenges that the Persian language poses for the NLP tasks. The chapter shows how important it is to clean and normalize Persian input texts before using them, underlying the impact of normalization steps on downstream NLP tasks. Building on the preprocessing topic, the transliteration task is presented in Chapter 3. Jalal Maleki briefly discusses Persian phonology to introduce Dabire – a Romanized, phonemic writing scheme for Persian. In Chapter 4, Mahsa Vafaie and Jon Dehdari discuss the task of speech recognition in Persian. They present its history, techniques, useful resources, as well as state-of-the-art Persian automatic speech-recognition (ASR) systems. This chapter lists some applications that use Persian ASR. Masood Ghayoomi looks at the field of syntactic analysis – the process of analyzing natural language with the rules of a formal grammar – in Chapter 5. Here, the author focuses on part-of-speech tagging and various parsing algorithms used for the syntactic analysis of a Persian sentence. In the next – sixth Chapter – Hanieh Poostchi and Massimo Piccardi describe the problem of named entity recognition (NER) in Persian. They present classic

VIII � and recent techniques, discuss the available datasets, and assess state-of-the-art approaches to Persian NER. Chapter 7 addresses the topic of multiword expressions (MWEs). MarszałekKowalewska discusses different approaches to this phenomenon and its properties, typologies, and two main processing tasks – MWE discovery and MWE identification – focusing, of course, on Persian MWEs. Adel Rahimi delineates the topic of machine translation in Chapter 8. He describes the history of the field, discusses the most recent approaches in more detail, and presents resources useful for Persian machine translation. In Chapter 9, Saeedeh Momtazi expands on the topic of distributional word representations. It addresses the importance of context in distributional semantic models, presents embedding techniques, and evaluates datasets available for the Persian language. The Persian Computational Linguistics and NLP aims to be a ready reference, providing access to selected concepts in Persian CL and NLP. This volume is not exhaustive, given that CL and NLP are developing rapidly and that what is state-ofthe-art today might be considered a “traditional approach” tomorrow. That is why this volume is accompanied by a web page – a companion wiki8 – to keep track of recent developments and hopefully bring together new generations of researchers who will contribute to the advancements in the study of the Persian language.

8 https://www.persian-nlp-companion.com

Contents Introduction � V Magdalena Borowczyk 1 Research in Persian Natural Language Processing – History and State of the Art � 1 Katarzyna Marszałek-Kowalewska 2 Challenges of Persian NLP: The Importance of Text Normalization � 25 Jalal Maleki 3 Dabire: A Phonemic Orthography for Persian � 47 Mahsa Vafaie and Jon Dehdari 4 Speech Recognition for Persian � 85 Masood Ghayoomi 5 Syntactic Parsing of Persian: From Theory to Practice � 105 Hanieh Poostchi and Massimo Piccardi 6 Persian Named Entity Recognition with Structural Prediction Methods � 149 Katarzyna Marszałek-Kowalewska 7 Multiword Expressions in Persian � 185 Adel Rahimi 8 Machine Translation: From Paper Tapes to Neural Networks � 219 Saeedeh Momtazi 9 Distributional Word Representation for Persian � 235 Index � 255

Magdalena Borowczyk

1 Research in Persian Natural Language Processing – History and State of the Art Abstract: This chapter synthesizes the most prominent Natural Language Processing (NLP) studies conducted on Persian, focusing on text processing. The first section contains selected tasks from the NLP pipeline, such as text preprocessing, tokenization, POS tagging, syntactic parsing, treebank annotation or semantic analysis along with examples of how researchers approached the problem for Persian and, where applicable, examples of tools developed to perform given tasks. The following section discusses the application of Persian NLP like spell-checking, information retrieval, machine translation or sentiment analysis. Finally, the last section summarizes the Persian NLP corpora and other resources.

1.1 Introduction Natural Language Processing (NLP) is an interdisciplinary field that combines the achievements of linguistics, computer science, and artificial intelligence, aiming to aid human-computer interaction, whereby a series of tasks performed on a text or spoken language lead to a computer system analyzing, attempting to understand or producing human language (Ralston, Reilly, and Hemmendinger 2003). The characteristics of the Persian language introduce a number of challenges to NLP: the writing system proves to be problematic in terms of character encoding, word boundary detection, and word sense disambiguation; free word order increases the complexity of parsing. Persian also has many multi-word expressions, compounds and fixed phrases, which make computer analysis more difficult compared to other languages. Moreover, Persian variants in Iran, Afghanistan, and Tajikistan all present strong diglossia (Megerdoomian 2018). These complexities and the fact that Persian is a scarce-resourced language in comparison to English make the amount and outcomes of the available Persian Natural Language Processing research quite impressive. Most major universities in Iran have departments, laboratories, or research groups contributing to Persian NLP, to name a few: Department of Linguistics, NLP Lab and School of Electrical and Computer Engineering at the Tehran University, NLP Lab at the Shahid Beheshti University, Natural Language Processing Group at Department of Computer Engineering of the Amirkabir University of Technology, Natural Language Processing and Intelligent Planning Laboratory at the Sharif University of Technology, Web Technology Laboratory at the Ferdowsi University of Mashhad, Computer Engineering and https://doi.org/10.1515/9783110619225-001

2 � M. Borowczyk IT Department at the University of Qom or Department of Electrical and Computer Engineering at the Isfahan University of Technology. There is also a governmental research center that covers NLP: Institute for Humanities and Cultural Studies. Moreover, there are multiple researchers of Persian Natural Language Processing across Australia, Canada, Europe, and the USA. This chapter is divided into three sections: the first presents research tools developed for language processing itself, the second showcases some of the Persian NLP applications, and the third section lists the most popular Persian corpora and other NLP resources. It is worth noting that this chapter does not aspire to be fully comprehensive when it comes to Persian NLP and presents only selected works, mostly focusing on text analysis.

1.2 Natural Language Processing Pipeline Natural Language Processing tasks can incorporate supervised machine learning, which relies on labeled input data, or unsupervised machine learning, which does not require data labels. For both types of methods, there are a number of steps that may need to be induced on an unstructured text in order to make it easier for a computer to process. These include, but are not limited to: – text preprocessing – tokenization – part of speech tagging – syntactic parsing – treebank annotation – semantic analysis This sub-section presents how Persian NLP researchers approached the above tasks.1

1.2.1 Text Preprocessing In order to process a digital text, software solutions need standardized input data, such as corpora with unified encoding, where all elements are properly defined.

1 There are of course, many more fields of Persian NLP research than the ones listed and discussed in this chapter. Many of the missed ones are discussed as standalone chapters in this volume, e. g., challenges of processing Persian texts or Named Entity Recognition.

1 Research in Persian Natural Language Processing – History and State of the Art

� 3

The omnipresent ambiguities of natural languages and writing systems often stand in the way of computational text analysis. This is where text preprocessing comes into use. It is the task of converting a digital text into a sequence of linguistically meaningful entities. Text preprocessing is essential to all further stages of digital text analysis as this is where graphemes, words, and sentences are defined (Palmer 2010). The overall need to prepare Persian texts for processing arises from the fact that, like many other languages, Persian uses several registers with different writing styles. Seraji (2013) has defined the challenges specific to Persian and, based on that, created PrePer – a Persian text preprocessing tool. According to Seraji, there are three main issues Persian text processors face. Firstly, although Persian uses its own alphabet and has a dedicated Unicode Standard, many pieces of software use the code for Arabic instead (Seraji 2013). This causes discrepancies in encoding some Persian letters, especially in online texts. Secondly, official Persian (e. g., mass media) uses the so-called zero-width no-joiner space (Unicode ID U+200B) for bound morphemes, inflectional affixes, and compound nouns. Blogs and less formal registers tend to use regular whitespace (U+0020) or no space at all, adjoining the morphemes to the preceding words (Seraji 2013). This fact causes difficulties in identifying the basic text units. Thirdly, although a separate Unicode Standard exists for Persian digits, software and online sources tend to mix Western digits within Persian texts as the digits are not processed as numerical data (Esfahbod 2004). PrePer, Seraji’s open-source software developed in the Ruby programming language, provides solutions to the above-mentioned issues. It aligns the Unicode standard with Persian across the input text to address the letter and digit representation discrepancies and introduces zero-width no-joiner space in all applicable use cases (Seraji 2013). Moreover, PrePer users can add custom rules to clean up and standardize Persian texts derived from different sources for more efficient text analysis and processing.

1.2.2 Tokenization Text analysis on word level is referred to as lexical analysis or tokenization. It is a task of transforming strings of characters into sequences of tokens, or meaningful lexemes. Tokenization may be considered part of the parsing process. Its primary step is to map a string of morphological variants to their lemma (Hippisley 2010). The results of tokenization can be further used in other NLP tasks such as text preprocessing, text generation, parsing, machine translation, information retrieval, etc. Sagot and Walther (2010) have used tokenization principles to develop PerLex, a morphological lexicon for Persian based on the Alexina framework (Sagot

4 � M. Borowczyk 2010). Alexina helps represent lexical information comprehensively and aims to be universal and independent from grammatical formalisms. The Alexina model has two-level representation: the intensional lexicon (mapping of morphological and semantic information on lemmas) and the automatically generated extensional lexicon, which maps inflected forms with deep morphological and semantic information. The latter may be used by parsers. While developing PerLex, Sagot and Walther (2010) started with creating the intensional lexicon and left aside the semantic information mapping. They created a description of Persian morphology, built a list of verbal lemmas with inflection classes based on online resources, collected proper nouns from Wikipedia and assigned inflection noun class, cleaned the list, and manually added missing entries. They added a lexicon of Persian nouns to add nominal lemmas, to which they assigned plural forms. Other categories were retrieved from the BijanKhan corpus (BijanKhan 2004; Hojjat and Oroumchian 2007). The next stage of the lexicon development was the clean-up: entries with typographic errors or grammatical mistakes were eliminated. The result is what constitutes PerLex – a morphological lexicon for Persian, containing 35,914 lemma-level entries that generate 524,700 form-level entries corresponding to 494,488 distinct forms. Recently, Ghayoomi (2019a) proposed an algorithm to resolve the problem of Persian tokenization. The proposed method, based on language modelling, was tested on Farsi Linguistic Database (FLDB) and achieved 72.04 % correction of the errors in the test set with 97.8 % accuracy and 0.02 % error production in the spelling.

1.2.3 Part of Speech Tagging Part-of-speech (POS) tagging is the task of labeling words in a sentence with their categories. It is a basic step in NLP and constitutes a part of many other NLP tasks such as spell checking, text-to-speech, automatic speech recognition systems, machine translation, etc. (Voutilainen 2003). Persian researchers identify a number of challenges for automated POS tagging, e. g. word disambiguation, recognizing compound verbs, named entity recognition (identifying proper names), or the presence of multiword expressions, as well as the fact that Persian has a free sentence order (Pozveh, Monadjemi, and Ahmadi 2016). As POS tagging is part of many NLP tasks, numerous research groups have developed taggers for their work based on various approaches. TagPer (Seraji 2011) is a reimplementation of the statistical HunPOS tagger (Halácsy, Kornai, and Oravecz 2007). Seraji included a morphological analyzer in order to address the morphological complexity of the Persian language and trained the system with the Bijankhan corpus. Mohseni and Minaei (2010) also recognized Persian morphology to be an

1 Research in Persian Natural Language Processing – History and State of the Art

� 5

obstacle in efficient POS tagging of Persian texts, so they reduced the number of tags defined in the Bijankhan corpus and lemmatized the words in order to avoid a situation where words with the same lemma are interpreted differently. A similar approach was taken to develop FarsiTag (Rezai and Miangah 2016), who defined morphological rules along with syntactic rules in text preprocessing. An additional output of this research is a POS-tagged parallel English-Persian corpus. Yet another idea for POS tagging is to use artificial neural networks (ANN) (Pozveh, Monadjemi, and Ahmadi 2016). The group assumed that deep learning will be more efficient in handling Persian complexities. They used the Elman (1990) network to tag unknown and ambiguous words, the Fuzzy C-means algorithm to cluster the documents, and tag sequence statistics within clusters to annotate the POS. Combining the algorithm and statistics with a neural network resulted in reducing the ANN complexity and its learning time. Semantic input improved disambiguation. Table 1.1 presents accuracy results of the above taggers: Table 1.1: Taggers comparison. POS Tagger TagPer Mohseni & Minaei FarsiTag Hosseini Pozveh

Accuracy 96.9 % 90.2 % 98.6 % 96.17 %

It needs to be noted, however, that the evaluation of these taggers was based on different data sets and therefore, should not be used to conclusively prove the advantage of one tagger over the others.

1.2.4 Syntactic Parsing Syntactic parsing is analyzing a string of words (usually a sentence) with an outlook on formal grammar that aims to determine the string’s structural description. Parsing is a step in more complex analysis processes rather than a purpose itself. Prior to syntactic parsing, the text usually undergoes two steps of preprocessing: tokenization and lexical analysis. The output of this step is an interpretable syntactic structure (Ljunglöf and Wirén 2010). Seraji, Megyesi, and Nivre (2012a) trained two existing open-source dependency parsers: MaltParser (Nivre, Hall, and Nilsson 2006) and MSTParser (McDonald et al. 2005), both acknowledged by the community of the Conference on Computational

6 � M. Borowczyk Natural Language Learning as the most accurate and applied for numerous languages. The Uppsala Persian Dependency Treebank (Seraji, Megyesi, and Nivre 2012b), a syntactically annotated corpus of 1,000 sentences from various sources was used as input data set, annotated with gold standard POS tags from Université Paris-Est Créteil (UPEC) using TagPer, an automatic morphological annotator (Seraji, Megyesi, and Nivre 2012a). MaltParser is a data-driven dependency parser developed by Nivre, Hall, and Nilsson (2006) based on the inductive dependency parsing approach. It is universal and can be optimized and customized to develop a parser for a new language, all it needs is a treebank. Prior to parsing the Persian data set, the research team used MaltOptimizer (Ballesteros and Nivre 2012) to train the tool and ensure the most applicable algorithms and features were used. MSTParser (McDonald et al. 2005) takes a different approach to parsing. It was developed with the view of graph-based dependency parsing. This parser “extracts the highest-scoring spanning tree from a complete graph containing all possible dependency arcs, using a scoring model that decomposes into scores for smaller subgraphs of a tree.” (Seraji, Megyesi, and Nivre 2012b: 3). During the validation phase, 90 % of the data was used to train the parsers and establish the most efficient and accurate algorithms and feature sets. The remaining 10 % of the data was used for the final test. In all experiments, MaltParser proved to be more accurate. Both parsers achieved better results in the final test when using the UPEC gold standard POS tags in comparison to the automatically generated tags, however, the tags generated by TagPer gave better results in training (Seraji, Megyesi, and Nivre 2012a). While working on converting an HPSG-based treebank into its parallel dependency-based treebank, Ghayoomi and Kuhn (2014) trained MATE dependency parser which significantly outperformed MaltParser.

1.2.5 Treebank Annotation Treebank is a syntactically annotated corpus (Hajičová et al. 2010). In addition to part of speech and other morphological annotations, treebanks can also contain syntactic, semantic, and sometimes even intersentential relations. As word order in Persian is free, researchers stipulate that dependency grammar is a reasonable choice as a basis for Persian treebanks (Rasooli et al. 2011; Seraji et al. 2014). In order to create a Persian dependency treebank, Rasooli et al. (2011) set out to develop a verb valency dictionary. It contains obligatory and optional complements of 4,282 Persian verbs. The results show the total number of valencies to be 5,429, the

1 Research in Persian Natural Language Processing – History and State of the Art

� 7

average distinct valencies per verb to be 1.268, and the maximum number of valencies per verb to be 5. The process of creating the dictionary involved preprocessing of the corpora (POS tagging and lemmatization) and choosing candidates for compound verbs and manual valency tagging. The same research team have then created a Persian dependency treebank (Rasooli, Kouhestani, and Moloodi 2013) containing 30,000 sentences annotated with syntactic roles, morpho-syntactic features, lemma, POS tags, person, number, and tense-mood aspect. Statistics of the treebank show that 39.24 % of the words in the treebank are tagged as nouns, 12.62 % as verbs, 11.64 % as prepositions, and 7.39 % as adjectives. The most frequent dependency relations are post-dependent (15.08 %) and Ezafeh construction (10.17 %). Rasooli, Kouhestani, and Moloodi (2013) have also carried out comprehensive research on Persian grammar from the Dependency Grammar perspective. The Uppsala Persian Dependency Treebank was developed by Seraji et al. (2014). It contains 6,000 sentences extracted from various sources such as news, fiction, technical, or descriptions from the Uppsala Persian Corpus (Seraji, Megyesi, and Nivre 2012b). The annotation scheme in this treebank is based on the Stanford Type Dependencies customized in order to address Persian-specific issues. As byproducts of this work, the research group has developed a text normalizer, sentence segmenter and tokenizer, part-of-speech tagger, and a parser (Seraji et al. 2014).

1.2.6 Semantic Analysis Semantic analysis in NLP serves the purpose of understanding utterances through the analysis of words, fixed expressions, and sentences in context. The process of automatic understanding of meaning is commonly based on extracting the most important words in the utterance and identifying the relations between them, considering the neighboring utterances (Goddard and Shalley 2010). Semantic analysis has a large number of applications: understanding user queries and matching user requirements to available data, information retrieval, text summarization, datamining, machine translation, translation aids, etc. Unsupervised methods of semantic analysis as well as other NLP tasks are more cost-effective and less demanding as no annotated corpora are required to complete them (Aminian, Rasooli, and Sameti 2013). The first semantic classification of Persian verbs was developed by Aminian, Rasooli, and Sameti (2013). Verb classification commonly requires syntactic information, however, as it often proves to be insufficient, this time the researchers complemented syntactic information with restrictions as to verb selection imposed both on syntactic and semantic levels. Assuming Levin’s verb taxonomy (verbs that are in the same semantic classes are expected to have similar syntactic behaviors), the

8 � M. Borowczyk research team has clustered Persian verbs based on syntactic information from dependency trees and classified 246 verbs into 43 classes. Two algorithms were examined during the experiment, and Spectral Clustering turned out to be more effective than KMeans in this case. The results also show that prepositions carry valuable information for distinguishing semantic clusters of Persian verbs. Semantic analysis also benefits from specifying semantic roles (situationspecific relations between predicates and their arguments) (Goddard and Shalley 2010). The first semantic role labeling (SRL) on Persian verbs was carried out by Saeedi, Faili, and Shakery (2014). The researchers proposed an unsupervised semantic role induction model based on probability. They used generative models for statistical semantic parsing. The process of role annotation was based on a Bayesian network. They also used the expectation-maximization (EM) algorithm to estimate the probability of linking arguments to semantic clusters. The output of the research is a link table of argument features mapped to semantic clusters according to predicates for 32 Persian verbs. An interesting approach to automatically identify word senses in Persian was presented by Ghayoomi (2019b). He used word embeddings and proposed two modes: sentence- and context-base one, in order to obtain vectors of sentences containing target (ambiguous) word. An external evaluation showed that the contextbased model obtained better results.

1.3 Applications of Persian Natural Language Processing The following section discusses some of the interesting applications of Persian NLP.

1.3.1 Spell-Checking Spell checking is one of the most widely used NLP applications. At the same time, it is often part of text normalization in the NLP pipeline. Spellcheck tools are almost always incorporated in software that involves typing. The task consists of two main steps: error detection and error correction. The latter is realized within three steps: generation of substitute words, substitute validation against a dictionary, and suggestions ranking (Naseem 2004). Originally, researchers proposed to incorporate models of common misspellings (e. g., Damerau 1964), however nowadays probabilistic models are more popular.

1 Research in Persian Natural Language Processing – History and State of the Art

� 9

Based on error analysis, scientists have created a number of spell-checking techniques, such as Edit Distance by Wagner and Fischer (1974), which identifies the minimum changes needed to convert a misspelled string to one defined in a dictionary. Damerau (1964) created a similar concept, but also carried out research which shows that 80 % of single character errors are caused by: inserting a character, deleting a character, substituting a character, or swapping a character by its neighboring character. Erikson (1997) built on Damerau’s idea and developed a technique allowing multiple errors within one word. Kukich (1992) then proved that, contrary to her predecessors’ assumptions, character deletion and insertion should not be treated with equal probability. According to her research, most of the substitution errors are caused by pressing neighboring keys on the keyboard. A number of other approaches followed that facilitated creation of more spell-checking techniques: phonetics-based, similarity key, or probabilistic. All these paved the way to spellchecking models based on probable error (Kukich 1992). Jurafsky and Martin (2008) described the use of the Noisy Channel Model for spell checking. The model itself was developed by Shannon (1948) and has been amended by researchers from various fields. It currently has many applications within machine translation, speech recognition, and spell-checking. The above methods address isolated word errors as defined by Kukich. Non-word errors and context-dependent errors require incorporating additional syntactic and semantic analysis (Kukich 1992). There are a number of spell checkers available for Persian, such as Virastyar used in Microsoft Word, which is generally well-received, Vira by Sepanta Institute, which according to Miangah (2013) requires some improvements in the area of Persian morphology as it tends to return false-positive errors dealing with Persian compounds. Another example of MS Word add-ons is Vafa Spell Checker (Faili et al. 2014), a hybrid system using statistical approaches to detect and correct spelling and realword errors and a rule-based approach to detect and correct grammar mistakes. Another instance of spell checkers for Persian is FarsiSpell by Miangah (2013) based on a large monolingual Persian corpus and aiming to use context for improved results. Sheykholeslam, Minaei-Bidgoli, and Juzi (2012) proposed using the Noisy Channel Model (NCM) as a Persian spell-checking framework based on their evaluation. The research group compared the Noisy Channel Model with Jaro-Winkler (involving frequency-based ranking) and Damerau-Levenshtein techniques, out of which NCM shows the best results in first suggestion accuracy in the case of Persian. Based on the assumption that users may have different error patterns, Rasooli, Kashefi, and Minaei-Bidgoli (2011) proposed an interesting approach to develop an adaptive spell-checking method. It is dynamic in that it can adapt to different policies for generating respellings and ranking suggestions. The method is enhanced by additional analysis of the user’s idiolect: for ranking the misspelling suggestions, it

10 � M. Borowczyk considers word frequency in current and processed documents as well as statistics of typographical errors in processed documents. In order to reduce the complexity of the tool, Rasooli et al. (2011) set the limit of considered misspellings to 30, updated every 10 misspelling corrections. Research presented in this work proves this method to be more accurate than non-adaptive spell-checking techniques.

1.3.2 Information Retrieval Information retrieval (IR) can be defined as a domain dealing with “representation, storage, organization of, and access to information items. These information items could be references to real documents, documents themselves, or even single paragraphs, as well as Web pages, spoken documents, images, pictures, music, video, etc.” (Baeza-Yates and Ribeiro-Neto 1999: 1). An IR model defines the format of the source and queries as well as the strategy used to obtain the information. There are numerous IR models that can be divided into the following types: set-theoretic (treating documents and queries as sets of words and phrases), algebraic (considering documents and queries to be vectors, matrices, etc.), probabilistic (computing similarities in documents and queries as probabilities) and feature-based (where documents are defined as vectors of values of features) (Hobbs and Riloff 2010). The classical and probably most-adopted model is the set-theoretic standard Boolean model (Cleverdon 1984), in which documents are represented by a set of manuallyindexed keywords. The available research shows that the use of NLP techniques in information retrieval improves the search results (Sheridan and Smeaton 1992). NLP tasks can be used to improve monolingual information retrieval. According to Karimpour et al. (2008), Persian IR systems perform better when they incorporate part of speech tagging and stemming. In their experiment, the group trained a Trigrams’n’Tags (TNT) (Brants 2000) POS tagger with 40 tags they identified as the most important for Persian document retrieval. Additionally, both corpus and queries were stemmed, which significantly improved the precision. It has been observed that the classical Boolean model for information retrieval lacks flexibility. In order to address this issue in Persian IR, Nayyeri and Oroumchian (2006) experimented with incorporating fuzzy logic in information retrieval and developed the FuFaIR (Fuzzy Farsi Information Retrieval) system. The results of the experiment show improved performance in comparison to other available models. The accessibility of multilingual information sources inspires researchers to develop and improve cross-language information retrieval (CLIR) systems, i. e. systems in which the query language is different than the document language, especially for low-resource languages such as Persian. Such systems enable multilingual

1 Research in Persian Natural Language Processing – History and State of the Art

� 11

users to enter queries in one language and retrieve documents in another. Combined with a machine or human translators, monolingual users can also benefit from CLIR. Alizadeh, Davarpanah, and Fattahi (2010), propose to use NLP techniques in order to improve the Persian-English CLIR. They propose to introduce morphological analysis to the query translation process in order to address the issue of poor retrieval results for Persian queries containing affixed words that are not identified by CLIR dictionaries and therefore are omitted in the process of translation. Additionally, the group also examines the effectiveness of tokenization and the Persian part of speech tagging in query translation and shows that CLIR benefits from those tasks.

1.3.3 Machine Translation With the world becoming a global village, there is increasing demand for machine translation. Easy access to resources and international relations create the need to speak multiple languages. For the majority of the population, machine translation is a convenient choice, hence researchers’ interest in fulfilling this requirement. There were several milestones in the history of machine translation. The mere idea dates back to 1947 when Weaver proposed the statistical approach in an informal letter to his acquaintance, Norbert Wiener. The Mathematics of Statistical Machine Translation (Brown et al. 1993) is considered to be one of the most important works in the field. While researchers around the world came up with many interesting solutions to the problems set out by the statistical approach to machine translation, Bengio et al. (2003) developed the first language model based on a neural network, which paved the way to Neural Machine Translation (NMT). Another step that followed was the development of the encoder-decoder structure using a Convolutional Neural Network to encode the source text and a Recurrent Neural Network to decode the source vector to the target language (Kalchbrenner and Blunsom 2013). NMT has been developed and improved ever since. While there is still space for improvement, according to recent research, neural machine translation outperforms statistical machine translation on many grounds: it can train multiple features, does not require prior domain knowledge, produces better sentence structure, and reduces morphology, syntax and word order errors commonly seen in SMT. Research involving Persian has been carried out following both of the above approaches.

1.3.3.1 Statistical Machine Translation There are at least five English-Persian SMT systems developed so far:

12 � M. Borowczyk –





– –

Shiraz project (Amtrup et al. 2000), a Persian to English translation tool composed of a dictionary, morphological analyzer, and syntactic parser, which uses a unification-based formalism A rule-based English to Persian translation system developed by Faili and Ghassem-Sani (2004), Faili and Ghassem-Sani (2005) and Faili (2009) based on a tree-adjoining grammar (TAG) formalism with a word-sense disambiguation module and statistical parser PEnTrans, a rule-based English to Persian and Persian to English translator by Saedi, Shamsfard, and Motazedi (2009), which also uses semantic approaches for better results Google Translate (available as from 2009 to 2016), a web-based, bilateral Persian/English SMT machine PersianSMT by Pilevar and Faili (2010), a system developed by training statistical models using monolingual and parallel corpora derived from movie subtitles. It involves phrase and lexical translation probabilities, lexicalized distortion model, word and phrase penalty, and a target language model.

Problems commonly observed in English/Persian statistical machine translation include systems’ inability to translate compound verbs, extensively used in Persian, which according to Pilevar and Faili (2010) is caused by high inflectional morphology phenomena in Persian. Also, the use of pronouns in Persian is optional, which additionally complicates the task. Word order proves to be problematic in the case of English/Persian SMT systems as well. Interestingly, Pilevar and Faili (2010) did manage to train PersianSMT to correctly identify and translate idioms and detect the right word sense.

1.3.3.2 Neural Machine Translation There is significantly less research in Persian Neural Machine Translation (NMT) available as it only became popular several years ago. The first attempt to adjust Google’s Tensorflow MT model (2015) to Persian-English translation was taken by Bastan, Khadivi, and Homayounpour (2017). The researchers have introduced some Persian-specific text preprocessing, such as tokenization of adherent words. Additionally, they have implemented an alignment feature typically used in SMT to the NMT model and developed a cost function that indicates the difference between NMT and SMT alignment. By doing this, Bastan, Khadivi, and Homayounpour (2017) improved the accuracy of the translation result and reduced the convergence time. Another interesting approach was taken by Zaremoodi and Haffari (2018). They proposed a forest-to-sequence attentional NMT model for Persian, which introduces

1 Research in Persian Natural Language Processing – History and State of the Art

� 13

syntactic information in NMT models in a form of numerous parse trees of the source sentence. The purpose of this addition is to build on the preceding work of Eriguchi, Hashimoto, and Tsuruoka (2016) and address translation mistakes caused by using the 1-best parse tree of the source sentence. Experiments carried out by the research group show that this method is superior to the tree-to-sequence model.

1.3.4 Sentiment Analysis Sentiment analysis (SA) is in other words the study of opinions. People’s opinions about products and services support business and individuals’ decision-making, hence the growing interest in the SA domain. Researchers tend to choose one of the two following methodologies: machine learning or lexicon-based techniques, that use lexicons of sentiment-bearing words. Both have advantages and drawbacks. Machine learning methods enable the identification of non-sentiment phrases that imply sentiment but require manually annotated corpora for training, which is time-consuming. Lexicon-based techniques can be developed automatically, however, according to Basiri, Naghsh-Nilchi, and Ghassem-Aghaee (2014), they are not as efficient, at least in the case of the Persian language. There is no doubt that a good quality data set is the foundation of a welldesigned and efficient SA model. SentiPers (Hosseini et al. 2018) is an open-source corpus of formal and informal contemporary Persian created for sentiment analysis. It contains 26,000 sentences extracted from the Internet and manually annotated. It is very thorough, as it provides polarity analysis on document-, sentenceand entity/aspect-levels. Another resource used in the process of sentiment analysis is a polarity lexicon. It contains sentiment-bearing words rated accordingly. Researchers took different approaches to creating Persian polarity lexicons. PersianClues (Shams, Shakery, and Faili 2012) was created by iterative machine translation of an English polarity lexicon. Afterwards, the research group achieved polar sets from the clues and assigned each document with polarity using a classification algorithm. SentiFars is a Persian lexicon created by machine translation (Dehkharghani 2019) that contains English polarity lexicons automatically translated into Persian. Potential translation errors were compensated by combining several English resources instead of one and comparing results for the same words. All terms were translated using a supervised machine learning method and afterwards, Persian terms were annotated manually with polarity. The lexicon is domain-independent. Researchers have also developed sentiment analysis frameworks for Persian. Bagheri and Saraee (2014) combined the NLP task of lemmatization (removing inflectional affixes) with the Naive Bayes machine learning method as a classifier.

14 � M. Borowczyk They have proposed a novel technique for feature selection: Modified Mutual Information (MMI) which uses positive and negative factors on features and classes. A framework developed by Basiri, Naghsh-Nilchi, and Ghassem-Aghaee (2014) also presents a combined approach. It is an unsupervised, lexicon-based method with NLP tasks added to address Persian-specific issues. As one of such issues, they recognized the fact that the same words may be spelled differently, some characters may be omitted and word boundaries are not always clear in Persian due to the popular use of a semi-space within words. In order to mitigate this issue, they have created a word normalization module. Another obstacle to text normalization is the presence of spelling mistakes and the use of informal spellings in the source documents, so they included a spell checker. The next step in document pre-processing in this framework is stemming, which deals with the Persian complex morphology and inflection. Additionally, the research group proposed to remove stop words that do not impact sentiment to improve accuracy and reduce the size of the analyzed document. In this framework, each sentence of the document should be analyzed separately, so the research group added a step to split multi-sentenced documents. Once all the described preprocessing steps are completed, sentence polarity is detected using the SentiStrength library (Thelwall et al. 2010) translated into Persian. It contains a lexicon, a booster list, emoticons, idioms, and negations. The final step of the framework is multi-step data fusion from all sentences and the production of an overall score (Basiri, Naghsh-Nilchi, and Ghassem-Aghaee 2014). More recently, Ghasemi, Ashrafi Asl, and Momtazi (2020) have developed a cross-lingual deep learning framework to benefit from English training data in order to improve sentiment analysis in Persian. Their experiments with Amazon (English) and Digikala (Persian) datasets showed that the proposed method significantly outperformed state-of-the-art monolingual approaches.

1.4 Persian Corpora and NLP Resources A corpus is a data set consisting of structured texts, usually annotated. In NLP, it can be used to train and test solutions or analyze language structure. Corpora can be monolingual or parallel (containing the same content in two languages). With Persian NLP on the rise, multiple useful language resources have been created, many of which are freely available for research purposes (Megerdoomian 2018). Many of the references discussed below were mentioned in the previous sections of this chapter as examples of NLP task applications. The purpose of this section is to gather and summarize them for reference.

1 Research in Persian Natural Language Processing – History and State of the Art

� 15

1.4.1 Corpora The majority of corpora created by NLP practitioners are monolingual. They can be used e. g., for information retrieval, semantic analysis, defining language structures, as well as identifying cross-register differences or new language trends. Below is a list composed of the Persian corpora commonly referenced in NLP research: – Farsi Linguistic Database (FLDB) (Assi 1997) is the first Persian corpus developed at the Institute for Humanities and Cultural Studies in Tehran; its more recent version (known as Persian Linguistis Database) contains 56 million words (Assi 2005) – BijanKhan (BijanKhan 2004), intended for training and evaluation of part-ofspeech tagging systems; it consists of news and common texts with 2.6 million tokens tagged manually – Peykare (Bijankhan et al. 2011), is a continuation of the work on the Bijankhan corpus, intended for language modeling; it contains 35,058 texts with over 100 million words and covers five language varieties: standard-formal, standardinformal, super-standard-formal, super-standard-informal, and sub-standardinformal; it was lemmatized, tokenized, POS-tagged and labeled by topic – Hamshahri (created by Darrudi, Hejazi, and Oroumchian 2004 and later improved by Aleahmad et al. 2009), intended for information retrieval systems, is a corpus consisting of the Hamshahri newspaper articles, collected between 1996 and 2002; the second version of it contains over 300,000 texts with 417,339 unique words; the texts are tagged by category – Tehran Monolingual Corpus,2 an open-source corpus developed at the University of Tehran; it is the Hamshahri corpus improved by tokenization and spellchecking combined with resources from the ISNA agency news; it contains 250 million tokens (300,000 unique words) – SentiPers (Hosseini et al. 2018), intended for sentiment analysis, is an opensource corpus of formal and informal contemporary Persian; it contains 26,000 sentences extracted from the Internet and manually annotated; it provides polarity analysis on the document-, sentence- and entity/aspect-levels – MirasText (Sabeti et al. 2018), an automatically generated Persian corpus created by crawling 250 websites; the topic categories are mostly news, but also economical, technology, industrial and social; the corpus contains 2,835,414 documents with a total of 1,429,878,960 words. 2 https://ece.ut.ac.ir/en/node/940

16 � M. Borowczyk Multilingual corpora are helpful in machine translation applications or crosslanguage comparative analysis. Some examples of Persian-English parallel data sets are: – TEP: Tehran English-Persian (Pilevar, Faili, and Pilevar 2011) the first freely released large-scale English-Persian parallel corpus, created from mined movie subtitles; it contains 1,200 subtitle pairs (one Persian and one English version of subtitles of the same movie version) with 120,405 Persian and 72,474 corresponding English unique words – Shiraz corpus (Amtrup et al. 2000) – a parallel tagged corpus consisting of 3,000 Persian sentences collected from the Hamshahri newspaper with the corresponding, manually translated English sentences

1.4.2 Other Resources Aside from the corpora presented above, NLP practitioners also create other resources useful for research and machine learning purposes, for example, treebanks (annotated corpora), such as those mentioned in this chapter: – Uppsala Persian Dependency Treebank (Seraji et al. 2014) contains 6,000 sentences extracted from various sources such as news, fiction, technical or descriptions from the Uppsala Persian Corpus (Seraji, Megyesi, and Nivre 2012c); the annotation scheme in this treebank is based on the Stanford Type Dependencies customized in order to address Persian-specific issues – Persian dependency treebank by Rasooli, Kouhestani, and Moloodi (2013) containing 30,000 sentences annotated with syntactic roles, morpho-syntactic features, lemma, POS tags, person, number, and tense-mood aspect; 39.24 % of the words in the treebank are tagged as nouns, 12.62 % as verbs, 11.64 % as preposition and 7.39 % as adjectives; the most frequent dependency relations are postdependent (15.08 %) and Ezafeh construction (10.17 %) Another type of re-usable language resource is a lexicon (inventory of lexemes). This chapter mentions PerLex by Sagot and Walther (2010) (a morphological lexicon for Persian containing 35,914 lemma-level entries that generate 524,700 form-level entries corresponding to 494,488 distinct forms) as well as PersianClues (Shams, Shakery, and Faili 2012) and SentiFars (Dehkharghani 2019) – polarity lexicons automatically translated from English, intended for sentiment analysis. WordNet-type lexicons are databases grouped into sets of cognitive synonyms (Miller 1995; Fellbaum 1998). The idea was originally developed for English at Princeton and then researchers around the world started employing the same principles to create similar sets for other languages. There have been numerous approaches

1 Research in Persian Natural Language Processing – History and State of the Art

� 17

to creating a Persian WordNet, for example, semi-automated FarsNet (Shamsfard et al. 2010) with 13,155 words, 9,266 synsets, and 9,059 relations or fully automated Persian WordNet developed by Mousavi and Faili (2021) with 27,000 words, 28,000 synsets and 67,000 word-sense pairs that substantially outperforms the previous Persian wordnet with about 16,000 words, 22,000 synsets, and 38,000 word-sense pairs. Aside from data sets like corpora, treebanks, or lexicons, language models are increasingly important in modern NLP. One of the recent developments in this area is ParsBERT (Farahani et al. 2021), a Persian language model based on Google’s BERT transformer-based language model (Devlin et al. 2019). Farahani et al. (2021) took an interesting approach in that, instead of using BERT as a multilingual model, they have proposed a monolingual Persian version of it. The model outperforms its multilingual versions and other prior solutions for Sentiment Analysis, Text Classification, and Named Entity Recognition. Alongside the model, the researchers also produced a data set and released a pre-trained version of the model. Yet another interesting resource is Taghipour’s text mining platform for Persian news agencies (Taghipour et al. 2018) as it is a great example of both resource and NLP application of Persian NLP. The platform performs the following steps: 1. Content crawling 2. HTML/JSON parsing 3. Text preprocessing 4. Similarity detection 5. Automated unified class labelling 6. News topic detection based on a multi-level clustering engine 7. Impact analysis – calculation of the influence and penetration characteristics of the content 8. Result visualization This solution was successfully implemented for one of the Iranian news agencies.

1.5 Conclusion Although Persian is a low-resource language compared to English, and it poses certain processing challenges, the quality of Persian NLP research does not differ from that of more popular languages. This chapter aimed to provide an overview of the history and recent trends in the Persian NLP. The works cited here were presented during international scientific conferences and warmly received by the audience.

18 � M. Borowczyk As the field of NLP is developing rapidly at the moment, so should its part focusing on the Persian language.

Bibliography Aleahmad, Abolfazl, Hadi Amiri, Ehsan Darrudi & Farhad Oroumchian. 2009. Hamshahri: A standard Persian Text Collection. Knowledge-Based Systems 22. 382–387. Alizadeh, Hamid, Mohammad Reza Davarpanah & Rahmatollah Fattahi. 2010. Applying Natural Language Processing Techniques for Effective Persian-English Cross-Language Information Retrieval. International Journal of Information Science and Management 8. 89–98. Aminian, Maryam, Mohammad Sadegh Rasooli & Hossein Sameti. 2013. Unsupervised Induction of Persian Semantic Verb Classes Based on Syntactic Information. In Mieczysław A. Kłopotek, Jacek Koronacki, Małgorzata Marciniak, Agnieszka Mykowiecka & Sławomir T. Wierzchoń (eds.), Language Processing and Intelligent Information Systems – 20th International Conference, IIS 2013, Warsaw, Poland, 2013. 112–124. Berlin, Heidelberg: Springer Verlag. Amtrup, Jan, Hamid Mansouri, Karine Megerdoomian & Remi Zajac. 2000. Persian-English Machine Translation: An Overview of the Shiraz Project. New Mexico, USA: Computing Research Laboratory, New Mexico State University. Assi, Mostafa. 1997. Farsi linguistic database (FLDB). International Journal of Lexicography 10. 5–7. Assi, Mostafa. 2005. PLDB: Persian linguistic database. Tehran, Iran: Institute for Humanities & Cultural studies Technical report. Baeza-Yates, Ricardo & Berthier Ribeiro-Neto. 1999. Modern Information Retrieval. New York, USA: Association for Computing Machinery Press. Bagheri, Ayoub & Mohamad Saraee. 2014. Persian Sentiment Analyzer: A Framework based on a Novel Feature Selection Method. Computing Research Repository (CoRR). http://arxiv.org/abs/1412.8079. Ballesteros, Miguel & Joakim Nivre. 2012. MaltOptimizer: A System for MaltParser Optimization. In Frédérique Segond (eds.), Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, 2013, 58–62. Stroudsburg, PA, USA: Association for Computational Linguistics. Basiri, Ehsan, Ahmad Naghsh-Nilchi & Nasser Ghassem-Aghaee. 2014. A Framework for Sentiment Analysis in Persian. Open Transactions on Information Processing 1–14. Bastan, Mohaddeseh, Shahram Khadivi & Mohammad Mehdi Homayounpour. 2017. Neural Machine Translation on scarce-resource condition: A case-study on Persian-English. In 2017 Iranian Conference on Electrical Engineering (ICEE), Tehran, Iran, 2017, 1485–1490. Tehran, Iran: Institute of Electrical & Electronics Engineers. Bengio, Yoshua, Réjean Ducharme, Pascal Vincent & Christian Jauvin. 2003. A Neural Probabilistic Language Model. Journal of Machine Learning Research 3. 1137–1155. BijanKhan, Mahmood. 2004. The Role of the Corpus in Writing a Grammar: An Introduction to a Software. Iranian Journal of Linguistics 19(2). 48–67. Bijankhan, Mahmood, Javad Sheykhzadegan, Mohammad Bahrani & Masood Ghayoomi. 2011. Lessons from building a Persian written corpus: Peykare. Language Resources and Evaluation 45. 143–164. Brants, Thorsten. 2000. TnT – A Statistical Part-of-Speech Tagger. In Sixth Applied Natural Language Processing Conference, 224–231. Seattle, Washington, USA: Association for Computational Linguistics.

1 Research in Persian Natural Language Processing – History and State of the Art

� 19

Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra & Robert L. Mercer. 1993. The Mathematics of Machine Translation: Parameter Estimation. Computational Linguistics 19. 263–311. Cleverdon, Cyril. 1984. Optimizing convenient online access to bibliographic databases. Information Services and Use 4. 37–47. Damerau, Fred J. 1964. A Technique for Computer Detection and Correction of Spelling Errors. Communications of the Association for Computing Machinery 7(3). 171–176. Darrudi, Ehsan, Mehrdad Hejazi & Farhad Oroumchian. 2004. Assessment of a Modern Farsi Corpus. In 2nd Workshop on Information Technology & its Disciplines (WITID), Kish Island, Iran. 2004, 73–77. Kish Island, Iran: Iran Telecommunication Research Center. Dehkharghani, Rahim. 2019. SentiFars: A Persian Polarity Lexicon for Sentiment Analysis. Association for Computing Machinery Transactions on Asian and Low-Resource Language Information Processing 19. 1–12. Devlin, Jacob, Ming-Wei Chang, Kenton Lee & Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Jill Burstein, Christy Doran & Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 2019, 4171–4186. Stroudsburg, PA, USA: Association for Computational Linguistics. Elman, Jeffrey L. 1990. Finding Structure in Time. Cognitive Science 14. 179–211. Eriguchi, Akiko, Kazuma Hashimoto & Yoshimasa Tsuruoka. 2016. Tree-to-Sequence Attentional Neural Machine Translation. In Katrin Erk & Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 2016, 823–833. Stroudsburg, PA, USA: Association for Computational Linguistics. Erikson, Klas. 1997. Approximate Swedish Name Matching – Survey and Test of Different Algorithms. Stockholm, Sweden: Stockholm School of Engineering Physics Master Thesis. Esfahbod, Behdad. 2004. Persian Computing with Unicode. The Farsi Web Project. http://www.farsiweb. info. Faili, Heshaam 2009. From Partial Toward Full Parsing. In Proceedings of International Conference on Recent Advances in Natural Language Processing (RANLP 2009), Borovets, Bulgaria, 2009, 71–75. Stroudsburg, PA, USA: Association for Computational Linguistics. Faili, Heshaam, Nava Ehsan, Mortaza Montazery & Mohammad Taher Pilevar. 2014. Vafa spell-checker for detecting spelling, grammatical, and real-word errors of Persian language. Digital Scholarship in the Humanities 31. 95–117. Faili, Heshaam & Gholamreza Ghassem-Sani. 2004. An Application of Lexicalized Grammars in English-Persian Translation. In Proceedings of the 16th European Conference on Artificial Intelligence (ECAI), Valencia, Spain, 2004, 596–600. Valencia, Spain: IOS Press. Faili, Heshaam & Gholamreza Ghassem-Sani. 2005. Using a Decision Tree Approach for Ambiguity Resolution in Machine Translation. In Proceedings of 10th Annual Computer Society of Iran’s Conference (CSICC 2005), Tehran, Iran, 2005, 252–256. Tehran, Iran: Computer Society of Iran. Farahani, Mehrdad, Mohammad Gharachorloo, Marzieh Farahani & Mohammad Manthouri. 2021. ParsBERT: Transformer-based Model for Persian Language Understanding. Neural Processing Letters 53. 1–17. Fellbaum, Christiane. 1998. WordNet: An Electronic Lexical Database. Cambridge, MA, USA: MIT Press. Ghasemi, Rouzbeh, Seyed Arad Ashrafi Asl & Saeedeh Momtazi. 2020. Deep Persian sentiment analysis: Cross-lingual training for low-resource languages. Journal of Information Science 48. 449–462. Ghayoomi, Masood. 2019a. A Tentative Method of Tokenizing Persian Corpus based on Language Modelling. Cognitive Science 14. 21–50.

20 � M. Borowczyk

Ghayoomi, Masood. 2019b. Identifying Persian Words’ Senses Automatically by Utilizing the Word Embedding Method. Iranian Journal of Information Processing Management 35. 25–50. Ghayoomi, Masood & Jonas Kuhn. 2014. Converting an HPSG-based Treebank into its Parallel Dependency-based Treebank. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), 802–809. Reykjavik, Iceland: European Language Resources Association (ELRA). Goddard, Cliff & Andrea C. Shalley. 2010. Semantic Analysis. In Nitin Indurkhya & Fred J. Damerau (eds.), Handbook of Natural Language Processing, 92–120. Boca Raton, FL, USA: CRC Press. Hajičová, Eva, Anne Abeillé, Jan Hajič, Jiří Mírovský & Zdeňka Urešová. 2010. Treebank Annotation. In Nitin Indurkhya & Fred J. Damerau (eds.), Handbook of Natural Language Processing, 167–188. Boca Raton, FL, USA: CRC Press. Halácsy, Péter, Andras Kornai & Csaba Oravecz. 2007. Hunpos – an open source Trigram Tagger. In Annie Zaenen & Antal van den Bosch (eds.), Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, 2007, 209–212. Stroudsburg, PA, USA: Association for Computational Linguistics. Hippisley, Andrew. 2010. Lexical Analysis. In Nitin Indurkhya & Fred J. Damerau (eds.), Handbook of Natural Language Processing, 31–58. Boca Raton, FL: CRC Press. Hojjat, Hadi Amiri Hossein & Farhad Oroumchian. 2007. Investigation on a feasible corpus for Persian POStagging. In International Computer Conference, Computer Society of Iran, Tehran, Iran, 2007, 4–9. Tehran, Iran: Iranian Computer Society. Hosseini, Pedram, Ali Ahmadian Ramaki, Hassan Maleki, Mansoureh Anvari & Seyed Abolghasem Mirroshandel. 2018. SentiPers: A Sentiment Analysis Corpus for Persian. Computing Research Repository (CoRR). http://arxiv.org/abs/1801.07737. Hobbs, Jerry R. & Ellen Riloff. 2010. Information Extraction. In Nitin Indurkhya & Fred J. Damerau (eds.), Handbook of Natural Language Processing, 511–532. Boca Raton, FL: CRC Press. Jurafsky, Daniel & James Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Englewood Cliffs, NJ: Prentice Hall, Pearson Education International. Kalchbrenner, Nal & Phil Blunsom. 2013. Recurrent Continuous Translation Models. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu & Steven Bethard (eds.), Conference on Empirical Methods in Natural Language Processing (EMNLP), Seattle, USA, 2013, 1700–1709. Karimpour, Reza, Amineh Ghorbani, Azadeh Pishdad, Mitra Mohtarami, Abolfazl Aleahmad, Hadi Amiri & Farhad Oroumchian. 2008. Improving Persian Information Retrieval Systems Using Stemming and Part of Speech Tagging. In Evaluating Systems for Multilingual and Multimodal Information Access, 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, 2008. Berlin, Heidelberg: Springer Verlag. Kukich, Karen. 1992. Techniques for Automatically Correcting Words in Text. Association for Computing Machinery Survey 24(4). 377–439. Ljunglöf, Peter & Mats Wirén. 2010. Syntactic Parsing. In Nitin Indurkhya & Fred J. Damerau (eds.), Handbook of Natural Language Processing, 59–92. Boca Raton, FL, USA: CRC Press. McDonald, Ryan, Fernando Pereira, Kiril Ribarov & Jan Hajič. 2005. Non-projective Dependency Parsing using Spanning Tree Algorithms. In Human Language Technology and Empirical Methods in Natural Language Processing Conference, Vancouver, Canada, 2005, 523–530. Vancouver, Canada: Association for Computational Linguistics. Megerdoomian, Karine. 2018. Computational Linguistics. In Anousha Sedighi & Pouneh Shabani-Jadidi (eds.), The Oxford Handbook of Persian Linguistics, 461–479. Oxford: Oxford University Press.

1 Research in Persian Natural Language Processing – History and State of the Art

� 21

Miangah, Tayebeh Mosavi 2013. FarsiSpell: A spell-checking system for Persian using a large monolingual corpus. Literary and Linguistic Computing 29. 56–73. Miller, George A. 1995. WordNet: A Lexical Database for English. Communications of the Association for Computing Machinery 38(11). 39–41. Mohseni, Mahdi & Behrouz Minaei. 2010. A Persian Part-of-Speech Tagger Based on Morphological Analysis. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner & Daniel Tapias (eds.), Proceedings of the International Conference on Language Resources and Evaluation, LREC, Valetta, Malta, 2010, 1253–1257. Valetta, Malta: European Language Resources Association (ELRA). Mousavi, Zahra & Heshaam Faili. 2021. Developing the Persian Wordnet of Verbs Using Supervised Learning. Transactions on Asian and Low-Resource Language Information Processing 20. 1–18. Naseem, Tahira. 2004. A Hybrid Approach for Urdu Spell Checking. Islamabad, Pakistan Master Thesis. Nayyeri, Amir & Farhad Oroumchian. 2006. FuFaIR: a Fuzzy Farsi Information Retrieval System. In 2006 IEEE/ACS International Conference on Computer Systems and Applications (AICCSA 2006), Dubai/Sharjah, UAE, 2006, 1126–1130. Dubai, United Arab Emirates: Institute of Electrical & Electronics Engineers. Nivre, Joakim, Johan Hall & Jens Nilsson. 2006. Maltparser: A data-driven parser-generator for dependency parsing. In Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk & Daniel Tapias (eds.), The 5th International Conference on Language Resources and Evaluation, Genoa, Italy, 2006, 2216–2219. Genoa, Italy: European Language Resources Association (ELRA). Palmer, David D. 2010. Text Preprocessing. In Nitin Indurkhya & Fred J. Damerau (eds.), Handbook of Natural Language Processing, 9–30. Boca Raton, FL: CRC Press. Pilevar, Mohammad Taher & Heshaam Faili. 2010. PersianSMT: A first attempt to English-Persian statistical machine translation. In The International Conference of Statistical Analysis of Textual Data (JADT), Rome, Italy, 2019, 1001–1112. Rome, Italy: LED Edizioni Universitarie. Pilevar, Mohammad Taher, Heshaam Faili & Abdol Pilevar. 2011. TEP: Tehran English-Persian parallel corpus. In Computational Linguistics and Intelligent Text Processing – 12th International Conference, Tokyo, Japan, 2011, 68–79. Berlin, Heidelberg: Springer-Verlag. Pozveh, Zahra Hosseini, Amirhassan Monadjemi, & Ali Ahmadi. 2016. Persian Texts Part of Speech Tagging Using Artificial Neural Networks. Journal of Computing and Security 3(4). 233–241. Ralston, Anthony, Edwin D. Reilly & David Hemmendinger. 2003. Encyclopedia of Computer Science. Chichester: John Wiley & Sons Ltd. Rasooli, Mohammad Sadegh, Omid Kashefi & Behrouz Minaei-Bidgoli. 2011. Effect of Adaptive Spell Checking in Persian. In 7th Conference on Natural Language Processing and Knowledge Engineering (NLPKE 2011), Tokushima, Japan, 2011, 161–164. Tokushima, Japan: Institute of Electrical & Electronics Engineers. Rasooli, Mohammad Sadegh, Manouchehr Kouhestani & Amirsaeid Moloodi. 2013. Development of a Persian Syntactic Dependency Treebank. In Lucy Vanderwende, Hal Daumé III & Katrin Kirchhoff (eds.), The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT), Atlanta, GA, USA, 2013. Stroudsburg, PA, USA: Association for Computational Linguistics. Rasooli, Mohammad Sadegh, Amirsaeid Moloodi, Manouchehr Kouhestani & Behrouz Minaei-Bidgoli. 2011. A Syntactic Valency Lexicon for Persian Verbs: The First Steps towards Persian Dependency Treebank. In 5th Language and Technology Conference, Poznan, Poland, 2011, 227–231. Berlin, Heidelberg: Springer International Publishing. Rezai, Mohammad Javad & Tayebeh Mosavi Miangah. 2016. FarsiTag: A part-of-speech tagging system for Persian. Digital Scholarship in the Humanities 32(3). 632–642.

22 � M. Borowczyk

Sabeti, Behnam, Hossein Abedi, Ali Janalizadeh Choobbasti, S. H. E. Mortazavi Najafabadi & Amir Vaheb. 2018. MirasText: An Automatically Generated Text Corpus for Persian. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis & Takenobu Tokunaga (eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018, 1174–1177. Miyazaki, Japan: European Language Resources Association (ELRA). Saedi, Chakaveh, Mehrnoush Shamsfard & Yasaman Motazedi. 2009. Automatic Translation between English and Persian Texts. In Machine Translation Summit,Ottawa, Canada, 2009, 26–32. Ottawa, Canada: Association for Machine Translation. Saeedi, Parisa, Heshaam Faili & Azadeh Shakery. 2014. Semantic role induction in Persian: An unsupervised approach by using probabilistic models. Digital Scholarship in the Humanities 31(1). 181–203. Sagot, Benoı̂ t. 2010. The Lefff, a freely available and large-coverage morphological and syntactic lexicon for French. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner & Daniel Tapias (eds.), Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta, 2010, 2744–2751. Valletta, Malta: European Language Resources Association (ELRA). Sagot, Benoı̂ t & Geraldine Walther. 2010. A Morphological Lexicon for the Persian Language. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner & Daniel Tapias (eds.), Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta, 2010, 300–303. Valletta, Malta: European Language Resources Association (ELRA). Seraji, Mojgan. 2011. A Statistical Part-of-Speech Tagger for Persian. In Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA, Riga, Latvia, 2011, 340–343. Riga, Latvia: Northern European Association for Language Technology (NEALT). Seraji, Mojgan. 2013. PrePer: A Pre-processor for Persian. In Fifth International Conference on Iranian Linguistics (ICIL5), Bamberg, Germany, 2013, 7–10. Bamberg, Germany: Cahiers de Studia Iranica. Seraji, Mojgan, Carina Jahani, Beáta Megyesi & Joakim Nivre. 2014. A Persian Treebank with Stanford Typed Dependencies. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), Conference on Language Resources and Evaluation, Reykjavik, Iceland, 2014, 796–801. Reykjavik, Iceland: European Language Resources Association (ELRA). Seraji, Mojgan, Beáta Megyesi & Joakim Nivre. 2012a. Dependency Parsers for Persian. In Proceedings of 10th Workshop on Asian Language Resources, Mumbai, India, 2012, 35–44. Mumbai, India: The COLING 2012 Organizing Committee. Seraji, Mojgan, Beáta Megyesi & Joakim Nivre. 2012b. Bootstrapping a Persian Dependency Treebank. Linguistic Issues in Language Technology (LiLT) 7. 1–10. Seraji, Mojgan, Beáta Megyesi & Joakim Nivre. 2012c. A basic language resource kit for Persian. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, 2012, 2245–2252. Istanbul, Turkey: European Language Resources Association (ELRA). Shams, Mohammadreza, Azadeh Shakery & Heshaam Faili. 2012. A non-parametric LDA-based induction method for sentiment analysis. In The 16th Computer Society of Iran’s International Symposium on Artificial Intelligence and Signal Processing (AISP 2012), Shiraz, Iran, 2012, 216–221. Shiraz, Iran: Institute of Electrical & Electronics Engineers.

1 Research in Persian Natural Language Processing – History and State of the Art

� 23

Shamsfard, Mehrnoush, Akbar Hesabi, Hakimeh Fadaei, Niloofar Mansoory, Ali Famian & Somayeh Bagherbeigi. 2010. Semi Automatic Development Of FarsNet: The Persian Wordnet. In Proceedings of the 5th Global WordNet Association Conference, Mumbai, India, 2012. Mumbai, India: Global WordNet Association. https://www.academia.edu/331794/Semi_Automatic_ Development_of_FarsNet_The_Persian_WordNet. Shannon, Claude E. 1948. A Mathematical Theory of Communication. The Bell System Technical Journal 27. 379–423. Sheridan, Paraic & Alan F. Smeaton. 1992. The Application of Morpho-Syntactic Language Processing to Effective Phrase Matching. Information Processing & Management 28. 349–369. Sheykholeslam, Mohammad Hoseyn, Behrouz Minaei-Bidgoli & Hossein Juzi. 2012. A Framework for Spelling Correction in Persian Language Using Noisy Channel Model. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, 2012, 706–710. Istanbul, Turkey: European Language Resources Association (ELRA). Taghipour, Mohammad, Foad Aboutorabi, Vahid Zarrabi & Habibollah Asghari. 2018. An Integrated text mining Platform for Monitoring of Persian News Agencies. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis & Takenobu Tokunaga (eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018. Miyazaki, Japan: European Language Resources Association (ELRA). http://lrec-conf.org/workshops/lrec2018/W13/pdf/7_W13.pdf. Thelwall, Mike, Kevan Buckley, Georgios Paltoglou, Di Cai & Arvid Kappas. 2010. Sentiment Strength Detection in Short Informal Text. Journal of the American Society for Information Science and Technology 61. 2544–2558. Voutilainen, Atro. 2003. Part-of-Speech Tagging. In Ruslan Mitkov (eds.), The Oxford Handbook of Persian Linguistics, 219–232. Oxford: Oxford University Press. Wagner, Robert A. & Michael J. Fischer. 1974. The String-to-String Correction Problem. Communications of the Association for Computing Machinery 21(1). 168–173. Zaremoodi, Poorya & Gholamreza Haffari. 2018. Incorporating Syntactic Uncertainty in Neural Machine Translation with a Forest-to-Sequence Model. In Emily M. Bender, Leon Derczynski & Pierre Isabelle (eds.), Proceedings of the 27th International Conference on Computational Linguistics, New Mexico, USA, 1421–1429. Stroudsburg, PA, USA: Association for Computational Linguistics.

Katarzyna Marszałek-Kowalewska

2 Challenges of Persian NLP: The Importance of Text Normalization Abstract: Data is said to be the new oil. This applies to textual data as well. The amount of plain textual data on the Internet is abundant, and it is still growing. However, similar to oil, raw data is not valuable by itself. Its value and potential are availed through further accurate processing and analysis. And this is where Natural Language Processing (NLP) and its procedures come into the picture. Since NLP tasks require standardized and high-quality inputs for high efficiency and performance, raw textual data requires at least a basic level of cleaning and standardization. Therefore, text normalization – a process of transforming noisy data into improved, i. e., a standard representation – is often a prerequisite for a variety of NLP tasks, including information extraction, machine translation, or sentiment analysis. The Persian language, being the 5th content language of the Web, can be a great source of diverse information. However, several language-specific characteristics hinder the potential application of Persian data. The focus of this chapter lies in describing challenges that the Persian language poses for NLP and evaluating the impact of normalizing a large Persian corpus for one of the downstream NLP tasks – the discovery of multiword expressions (MWEs).

2.1 Introduction In recent times, human civilization is drowning in data. In 2008, Google reported that the World Wide Web had one trillion pages.1 The total amount of data created and captured in 2018 was 33 zettabytes (ZB). In 2020, it reached about 59 ZB. International Data Corporation projects that by 2025, the available data may expand to 175 ZB.2 It is estimated that about 2.5 quintillion bytes of data are produced daily.3 This includes everything from tweets about elections, holiday photos on Instagram, to the data collected by International Space Station. Although these estimates include videos, image data, and databases, most of them are plain old text. Unstructured data (also called as free-form text) comprises 70–80 % of the data available on 1 https://www.itpro.co.uk/604911/google-says-the-web-hits-a-trillion-pages 2 https://www.networkworld.com/article/3325397/idc-expect-175-zettabytes-of-data-worldwide-by2025.html 3 https://techjury.net/blog/how-much-data-is-created-every-day https://doi.org/10.1515/9783110619225-002

26 � K. Marszałek-Kowalewska computer networks. The content of this resource is unavailable to the authorities, businesses, and individuals unless humans read these texts or devise some other means to derive valuable information from them. Therefore, human civilization is drowning in data but starving for insights. And this is where Natural Language Processing (NLP) comes into the game. NLP procedures can be applied to characterize, interpret, and understand the information content of a free-form text, in other words, to unlock the potential of the unstructured data. The Persian language, with its 110 million speakers, is, according to W3Tech, the 5th content language of the Web and is used by 3.5 % of all the websites whose content language we know. Even more interesting, it is the 2nd fastest growing content language (after Russian). In 2015, Persian was not even in the top 10 content languages, yet it reached 8th position in 2019, and since then it has continuously moved up. This positive change in the position shows the significance and expansion of the Persian language on the Internet. For the NLP community, it means that there is a robust amount of Persian data that can be used for various tasks, such as text summarization, sentiment analysis or, information extraction. However, this data is unstructured, and since the quality of the input data influences the quality of the output, in most cases before the NLP pipeline uses it, this unstructured Persian data (any unstructured data, in fact) needs to undergo certain cleaning and normalization tasks, e. g., removal of extra whitespace, the substitution of acronyms, the transformation of numerical information, accent removal, the substitution of special characters and emoji, or normalization of date format. This chapter focuses on the importance of the normalization of the Persian language data for the NLP applications. It starts by discussing the challenges Persian poses for NLP (section 2.2). Section 2.3 discusses briefly the general aspects of text normalization and its impact on a downstream NLP task – the discovery of multiword expressions in Persian – is presented in section 2.4. This chapter concludes with a summary and recommendations for the future work.

2.2 Challenges of Persian NLP The Persian language belongs to the Iranian branch of the Indo-Iranian language family. It is spoken by more than 1104 million people across the world, mainly in

4 As of November 2021, https://www.worldatlas.com/articles/where-is-farsi-spoken.html

2 Challenges of Persian NLP: The Importance of Text Normalization

� 27

Iran (Farsi), Afghanistan (Dari), and Tajikistan (Tajiki).5 It is also the 5th content language for the Web according to W3Tech.6 Despite being one of the most popular languages of the Web, Persian, and its computational analysis, has until recently received far less attention than the other popular languages, including English, Russian, and Spanish. Research in Persian NLP faces two significant challenges. The first one arises from the limited number of available resources. Although there has been a significant improvement in the quantity of NLP resources, e. g., Hamshahri Corpus (Darrudi, Hejazi, and Oroumchian 2004), Bijankhan Corpus (Bijankhan et al. 2011), FarsName (Hajitabar et al. 2017), ShEMO (Nezami, Lou, and Karami 2019), Persian Dependency Treebank (Rasooli, Kouhestani, and Moloodi 2013), SentiPers (Hosseini et al. 2018), FarsNet (Shamsfard et al. 2012), ParsBERT (Farahani et al. 2021) and tools, e. g., STeP-1 (Shamsfard, Jafari, and Ilbeygi 2010), PrePer (Seraji 2013), Parsivar (Mohtaj et al. 2018) or ParsiNorm (Oji et al. 2021) in recent years, Persian is still an underresourced language. The second problem is related to the challenging characteristics of Persian itself, and especially, inconsistencies in its writing format. The following section discusses the Persian language’s main challenges for NLP applications.

2.2.1 Encoding The written characters of human languages are assigned numbers through a process called characters encoding, which allows these characters to be stored, transmitted, and transformed using computers. For example, the Latin character for the capital letter A has the code U+0041; for an ampersand & it is U+0026. The encoding problems are one of the first ones in processing Persian texts. While creating digital texts, both Persian Unicode characters and Arabic ones can sometimes be used. As a result, for example, the letter ‫[ ی‬ye] can be expressed by 3 different encodings: either the Persian one: \uo6a9, or two Arabic encodings: \uo6cc or \uo49 (Sarabi, Mahyar, and Farhoodi 2013; Ghayoomi and Momtazi 2009; Megerdoomian 2018). Table 2.1 shows how one word can have different encodings for the same characters. While processing Persian texts, all the encodings of the same character should be unified to avoid the varied output with the identical words containing different encodings as different words. 5 The term Farsi was used to refer to the Persian language by all its native speakers until the 20th century. Currently, the terms Dari and Tajiki are used to refer to Persian spoken in Afghanistan and Tajikistan respectively while Persian of Iran retained the Farsi name. 6 As of November 2021.

28 � K. Marszałek-Kowalewska Table 2.1: Encoding variations (after Sarabi, Mahyar, and Farhoodi 2013: 3, example mine, KMK). Translation

Word

Letter

Letter

Letter

Letter

Letter

‘village’

‫آﺑﺎدی‬

‫ی‬ 06CC

‫د‬ 062F

‫ا‬ 0627

‫ب‬ 0628

‫آ‬ 0622

Code

‘village’

‫َاﺑﺎدي‬

‫ي‬ 064A

‫د‬ 062F

‫ا‬ 0627

‫ب‬ 0628

‫آ‬ 0622

Code

2.2.2 Writing System The Persian language is written from right to left, using the Arabic script. There are four more letters that do not exist in Arabic and were added to the writing system to represent Persian consonants. These are: ‫[ پ‬p], ‫[ چ‬č], ‫[ ژ‬ĵ] and ‫[ گ‬g]. The writing system of Farsi is presented in Table 2.9 in the Appendix section. The Persian writing system poses several difficulties in the case of NLP. First of all, Persian letters can have joiner and non-joiner forms based on their position in a word. This feature is quite common among languages, yet in Persian certain letters written at the end of a word may not be joined to it, e. g., ‫[ ﺳﭙﺎﺳﮕﺰار‬sepasgazar] ‘grateful’, ‫[ اﺗﺎق‬otaq] ‘room’. Some users treat non-joiner forms as separate characters and do not use whitespace after the word. As a result, tokenization is not always reliable. Moreover, foreign (borrowed) elements used in Persian tend to be written arbitrarily, i. e., the fact that there are, for example, four possible forms of letter ‘z’ (‫ )ز ظ ذ ض‬poses certain difficulties for users. Although the The Academy of Language and Literature7 attempted to systemize it, there is still a great degree of arbitrariness when it comes to the actual usage. As an example, consider the following possible variants of the borrowed French word fantaisie ‘fantasy’ in Persian: – ‫ﻓﺎﻧﺘﺰی‬ – ‫ﻓﺎﻧﺘﻈﯽ‬ – ‫ﻓﺎﻧﺘﺬی‬ – ‫ﻓﺎﻧﺘﻀﯽ‬ Furthermore, there are no capital letters in Persian, which further causes ambiguity, especially for the named entity recognition task. For example, ‫ روﯾﺎ‬could mean both ‘dream’ and a female name ‘Roya’, ‫ ﻣﺎﻟﺖ‬could be interpreted as ‘Malta’ or ‘malt’ (a germinated cereal grain).

7 Academy of Language and Literature (in Persian ‫ )ﻓﺮﻫﻨﮕﺴﺘﺎن زﺑﺎن و ادب ﻓﺎرﺳﯽ‬is the official Iranian regulatory body of the Persian language.

2 Challenges of Persian NLP: The Importance of Text Normalization

� 29

The lack of capital letters can similarly cause problems while identifying acronyms, e. g., Persian acronym for Random-access memory – ‘RAM’ – ‫ رم‬can also be wrongly translated as ‘rum’ or ‘stampede’. Another challenge with the writing system is its text directionality. Although letters are written from right to left, numbers are written in the opposite direction, e. g., ‫ ﻣﯿﻠﯿﻮن ﺑﺸﮑﻪ ﻧﻔﺖ ﺧﺎم ﺻﺎدر ﮐﺮد‬۱.۲ ‫اﯾﺮان‬ ‘Iran exported 1.2 million barrels of crude oil’. What is more, it is not uncommon for the users to use Arabic numerals instead of Persian ones, e. g., ‫ اﺗﻔﺎق اﻓﺘﺎد‬1997 ‫ﮐﻨﻔﺮاﻧﺲ در ﺳﺎل‬ ‘The conference took place in 1997’. The problem of bidirectionality can make text processing difficult (Ghayoomi and Momtazi 2009: 2), thus a certain level of unification should be applied.

2.2.3 Word and Phrasal Boundaries As in many other languages, whitespace designates the word boundary in Persian. However, apart from the standard whitespace, zero-width-non-joiner space (known as pseudospace) is also used with non-joiner letterforms. Nevertheless, the whitespace and pseudospace are used inconsistently, causing tokenization and segmentation extremely challenging. As mentioned in section 2.2.2, Persian letters have different forms depending on their position in a word. Thus, users often treat non-joiner forms incorrectly, i. e., not adding whitespace after them, e. g., ‫‘ ﺗﻮازﻣﺎﮐﺘﺎبراﮔﺮﻓﺖ‬YouTookTheBookFromUs’. As a consequence, this phrase would be processed as one lexeme instead of six separate ones, i. e., ‫‘ ﺗﻮ از ﻣﺎ ﮐﺘﺎب را ﮔﺮﻓﺖ‬You took the book from us’. On the other hand, whitespace is often used instead of pseudospace, which causes words such as ‫‘ زﺑﺎنﺷﻨﺎﺳﯽ‬linguistics’ to be processed as two separate words ‫‘ زﺑﺎن‬language’ and ‫‘ ﺷﻨﺎس‬knowledge’ (when written with whitespace, i. e., ‫)زﺑﺎن ﺷﻨﺎﺳﯽ‬. As a result, word and phrase boundaries are often unclear, and tokenization, phrase segmentation, and clause splitting can be very challenging steps in the Persian NLP pipeline. Moreover, the inconsistent use of whitespaces in the case of detached morphemes poses another challenge. The number of possibilities resulting from such a case is presented in Table 2.2:

30 � K. Marszałek-Kowalewska Table 2.2: Word boundary ambiguity (after Ghayoomi and Momtazi 2009: 4, examples mine, KMK). Translation

Attached

Pseudospace

Whitespace

Morpheme

‘(he/she) buys’ ‘international’ ‘books’

‫ﻣﯿﺨﺮﯾﺪ‬ ‫ﺑﯿﻨﻠﻤﻠﻠﯽ‬ ‫ﮐﺘﺎﺑﻬﺎ‬

‫ﻣﯽﺧﺮﯾﺪ‬ ‫ﺑﯿﻦاﻟﻤﻠﻠﯽ‬ ‫ﮐﺘﺎبﻫﺎ‬

‫ﻣﯽ ﺧﺮﯾﺪ‬ ‫ﺑﯿﻦ ﻟﻤﻠﻠﯽ‬ ‫ﮐﺘﺎب ﻫﺎ‬

‫ﻣﯽ‬ ‫ﺑﯿﻦ‬ ‫ﻫﺎ‬

The problem of boundaries ambiguity makes identifying words, phrases, and finally, sentences challenging tasks (Sarabi, Mahyar, and Farhoodi 2013; Ghayoomi and Momtazi 2009; Megerdoomian 2018). Inconsistent use of white- and pseudospace is directly related to complex lexemes, consisting of a lexeme and attached affixes that represent a separate lexical category or part of speech from the one they are attached to. A few examples of this situation are presented in Table 2.3. Table 2.3: Complex tokens (after Ghayoomi and Momtazi 2009: 4, examples mine, KMK). Affix

Type

Whitespace

Pseudospace

Attached

‫ﺑﻪ‬ ‫ﻫﻢ‬ ‫اﯾﻦ‬ ‫آن‬ ‫را‬ ‫ﮐﻪ‬

Preposition Prefix Determiner Determiner Postposition Relativizer

‫ﺑﻪ ﺷﯿﻮه‬ ‫ﻫﻢ ﮐﻠﺎس‬ ‫اﯾﻦ ﻣﺮد‬ ‫آن ﻗﺪر‬ ‫ﺷﺮاﯾﻂ را‬ ‫ﭼﻨﺎن ﮐﻪ‬

‫ﺑﻪﺷﯿﻮه‬ ‫ﻫﻢﮐﻠﺎس‬ ‫اﯾﻦﻣﺮد‬ ‫آنﻗﺪر‬ ‫ﺷﺮاﯾﻄﺮا‬ ‫ﭼﻨﺎنﮐﻪ‬

‫ﺑﺸﯿﻮه‬ ‫ﻫﻤﮑﻠﺎس‬ ‫اﯾﻨﻤﺮد‬ ‫آﻧﻘﺪر‬ ‫ﺷﺮاﯾﻄﺮا‬ ‫ﭼﻨﺎﻧﮑﻪ‬

2.2.4 Ambiguity Human languages are ambiguous, which means that a language that is either in written or spoken form is open to multiple interpretations, making it very challenging for NLP. Lexical ambiguity, i. e., the fact that words and phrases can have more than one meaning, i. e., bank: ‘as a financial institution’ and ‘as a river bank’, is a very frequent ambiguity type and dealing with it is one of the main NLP challenges. This task is particularly difficult in the case of Persian, as the number of heterophonic homographs (words with identical written forms but with different pronunciations, and each associated with a different meaning) is high. The main reason for this situation is that Persian short vowels are usually not written in the script. Therefore, the word ‫ ﻣﻠﮏ‬could be interpreted in the four following ways: – ‫[ ﻣﻠﮏ‬malak] ‘angel’, – ‫[ ﻣﻠﮏ‬malek] ‘prince’,

2 Challenges of Persian NLP: The Importance of Text Normalization

– –

� 31

‫[ ﻣﻠﮏ‬melk] ‘domain’, ‫[ ﻣﻠﮏ‬molk] ‘country, territory’.

2.2.5 Ezafe Construction Ezafe is a syntactic construction that expresses determination. In most cases, it is pronounced but not written (since it is expressed by a short vowel), further contributing to ambiguity, especially in chunking, semantics and finally syntactic processing of a sentence. Hence, the following sentence can be interpreted in two different ways depending on the presence of ezafe: 1. 2.

‫ﭘﺪر ﺣﺴﻦ را دﯾﺪ‬ [pedar hasan ra did] ‘Father saw Hassan.’ [pedare-e hasand ra did] ‘He/She saw Hassan’s father.’

It was estimated (Bijankhan et al. 2011: 10) that on average 20 % of words in a corpus of contemporary Persian are accompanied by the ezafe marker. Thus, especially for syntactic analysis, it seems crucial to address its proper processing.

2.2.6 Spoken Language and Dialectal Variants A comprehensive computational linguistic system should be able to process and analyze all major variants of the language. For example, in the case of Persian, it should be able to deal with both spoken language and dialectical variants, i. e., Tajiki and Afghan Persian. The diglossia in Persian is very strong. Some of the main differences between spoken and written forms include: – differences in word-formation, – phonological alternations, – a large number of neologisms and loanwords. The differences encountered in the spoken version of Persian are also present in the written form, especially in online social media texts. What is more, dialectal variations differ in the written scripts. Tajiki Persian, for example, does not use PersoArabic script, but the Cyrillic one, and thus is much less ambiguous. On the other hand, the writing system of Afghani Persian, being more flexible than the Iranian one, poses even more challenges for NLP systems. Please refer to Megerdoomian (2018) for more details on this topic.

32 � K. Marszałek-Kowalewska

2.3 Normalization Text normalization is a process that attempts to reduce text randomness, by for example removing unnecessary whitespace, converting special characters like emojis, unifying date formats or providing spell correction. In other words, text normalization focuses on transforming noisy (non-standard and informal) textual data to its more standard representation. Linguistic resources, especially those online containing slang-based expressions, acronyms, abbreviations, hashtags, or spelling errors, can deviate from the standard language. Text normalization procedures are applied to facilitate NLP applications while dealing with such noisy inputs. Since there are several possible normalizing steps, Baldwin and Li (2015) proposed a taxonomy of normalization edits with three levels of granularity which is presented in Figure 2.1. Here, the original figure has been limited to the first two levels to show examples of Persian-specific edits of level three in the following part.

Figure 2.1: Normalization taxonomy (after Baldwin and Li 2015: 422).

As is observed at the first level, Baldwin and Li (2015) distinguish between the edit type: insertion, replacement, and removal. The second level distinguishes between words and punctuation normalization edits, e. g., adding a missing subject or removal of doubled exclamation marks. What follows are some examples of level three edits that apply to the Persian language: – Insertion – Punctuation * adding a missing whitespace, e. g., ‫ﺗﻮاز ﻣﺎ ﮐﺘﺎب را ﮔﺮﻓﺖ → ﺗﻮازﻣﺎﮐﺘﺎبراﮔﺮﻓﺖ‬ – Word * adding a missing definite direct object marker ‫را‬ – Replacement – Punctuation

2 Challenges of Persian NLP: The Importance of Text Normalization

*





� 33

replacing an incorrect whitespace with pseudospace, e. g., ‫→ زﺑﺎن ﺷﻨﺎﺳﯽ‬

‫زﺑﺎنﺷﻨﺎﺳﯽ‬ Word * conversion of Arabic numerals, e. g., 2021 → ۱۲۰۲ * conversion of words written with other alphabets, e. g., Bluetooth →

‫ﺑﻠﻮﺗﻮث‬ * unification of dates, e. g., ۷۵۳۱ ‫ ﺑﻬﻤﻦ‬۲۲ and 1357 ‫ ﺑﻬﻤﻦ‬22 → ۷۵۳۱ ‫ ﺑﻬﻤﻦ‬۲۲ * spell correction, e. g., ‫او دوﺳﺖ ﻣﻦ اﺳﺖ → او دوﺻﺖ ﻣﻦ اﺳﺖ‬ * unification of quotation marks, e. g., «…» → “…” Removal – Punctuation * removal of repeated punctuation, e. g., !!!!! → ! – Word * removal of stop words, e. g., ‫در‬, ‫اﯾﻦ‬, ‫و‬ * data specific removals, e. g., hashtags

The first study indicating the importance of text normalization was performed by Sproat et al. (2001), who tried to develop a general normalization process applicable to diverse domains. Since then, the impact of normalizing noisy text and its influence on downstream NLP tasks has been analyzed in several studies. Han, Cook, and Baldwin (2013) showed the impact of normalizing social media texts on part-ofspeech-tagging. In particular, they focused on tweets and compared the original and the normalized input texts and different taggers: general Stanford POS tagger and domain-specific Twitter POS tagger. The influence of normalization on parsing was studied by Zhang et al. (2013), who introduced a normalization framework designed with the possibility of domain adaptation. Hassan and Menezes (2013) proposed a domain and language independent system based on unsupervised learning for machine translation. Since text normalization is, in many cases, a necessary preprocessing step for numerous NLP tasks, there are several normalization steps. However, as noticed by Baldwin and Li (2015), it is essential to remember that different normalization tasks would fit different data and downstream NLP applications. Moreover, normalization systems as “one size fits all” seem to be less precise than the tailored ones. Research on normalizing Persian text focused mainly on addressing the specific challenges (described in section 2.2) this language poses for NLP tasks. This resulted in a number of processing tools. In 2010, Shamsfard, Jafari, and Ilbeygi (2010) proposed STeP1, which offers tokenization, morphological analysis, part-of-speech tagging, and spell checking. The ParsiPardaz toolkit, which, apart from providing the same processing steps as STeP1, also includes an additional normalization step,

34 � K. Marszałek-Kowalewska was proposed by Sarabi, Mahyar, and Farhoodi (2013). The first open-source preprocessing tool – Hazm – was introduced by Hazm (2014). In 2018, Parsivar, another open-source tool, was presented by Mohtaj et al. (2018). Finally, Oji et al. (2021) proposed ParsiNorm – a tool for speech (but also text) normalization. Apart from work on preprocessing tools, research in Persian normalization focused also on classification tree and support vector machine (Moattar, Homayounpour, and Zabihzadeh 2006), N-gram language model combined with a rule-based method (Panahandeh and Ghanbari 2019) or sequence labeling models (Doostmohammadi, Nassajian, and Rahimi 2020).

2.4 Impact of Normalization on MWEs Discovery The challenges presented in section 2.2: inconsistency in using white- and pseudospace, different encodings, missing short vowels or bidirectionality can pose many challenges for adequate processing of Persian for several NLP tasks. Therefore, a certain level of text normalization seems to be a necessary preprocessing step. The following section describes the impact normalization procedures have on one of the downstream NLP tasks – the discovery of multiword expressions (MWEs).

2.4.1 Multiword Expressions Discovery The popular definition of multiword expressions states that they are linguistics expressions which consist of at least two words (even when represented by a single token) and are syntactically and/or semantically idiosyncratic (Baldwin and Kim 2010: 3). MWEs are very frequent in language and range over several different linguistic constructions, from idioms, e. g. to kick the bucket, to fixed expressions, e. g., fish and chips, light verb constructions, e. g., give a demo, to noun compounds, e. g., traffic light. Biber et al. (1999) claim that the number of MWEs in spoken English is 30–45 % and 21 % in academic prose. Jackendoff (1997) suggests that the number of MWEs in a speaker’s lexicon is the same as simple words, yet if we consider the domain-specific lexicons, this number seems to be an underestimation (Sag et al. 2002). Indeed, the research conducted by Ramisch (2009) suggests that the MWEs ratio can be between 50 % and 80 % in a corpus of scientific biomedical abstracts. Research by Krieger and Finatto (2004) estimates that MWEs can constitute more than 70 % of the specialized lexicon. MWEs processing consists of two tasks: identification and discovery. MWEs identification focuses on tagging a corpus with actual MWEs. The research on MWEs

2 Challenges of Persian NLP: The Importance of Text Normalization

� 35

in Persian has so far focused mainly on the identification of verbal multiword units and light verb constructions (LVCs) in particular, e. g., Taslimipoor, Fazly, and Hamze (2012), Salehi, Askarian, and Fazly (2012), Salehi, Cook, and Baldwin (2016). MWE discovery – the task presented in this chapter – is a process that focuses on finding new MWEs (types) in corpora and storing them, e. g., in the form of a lexicon, for further usage. It takes text as input and generates a list of MWE candidates from it. These candidates can be further filtered and evaluated by trained experts. True MWEs are stored in a repository or added to the MWEs lexicon.

2.4.2 Corpus The corpus used in the study was MirasText (Sabeti et al. 2018) corpus – an automatically generated text corpus for Persian. It is one of the largest available Persian corpora, containing 2.8 million documents and over 1.4 billion tokens. The size of the corpus is 15 GB. Each data point is provided with the following information: – content: webpage main content; – summary: content summary; – keywords: content keywords; – title: content title; – website: base website; – URL: exact URL of the webpage. The content of the MirasText corpus was generated from 250 web pages selected from a wide range of fields to ensure the diversity of data, e. g., news, economy, technology, sport, entertainment, or science. Figure 2.2 presents the initial website seeds used for corpus compilation. Corpus content was generated through crawling; thus, there is a possibility of including duplicated texts. To remove duplicate content from the corpus, Sabeti et al. (2018) used a filtering process based on a bloom filter (Almeida et al. 2007).

2.4.3 Normalization 2.4.3.1 Processing Tools Evaluation To ensure that the best normalization tool is used for the discovery of MWEs task, firstly, research comparing three open-source processing tools for Persian was carried out. These tools are Hazm, Parsivar, and ParsiNorm. Table 2.4 provides a comparison of selected normalization features provided by these tools.

36 � K. Marszałek-Kowalewska

Figure 2.2: Distribution of corpus data according to the content (after Sabeti et al. 2018: 1175). Table 2.4: Comparison of open-source normalization tools for Persian.

numbers normalization punctuation normalization date normalization space correction pinglish conversion encodings unification symbols normalization repeated punctuation removal informal to formal conversion

Hazm

Parsivar

ParsiNorm

? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ?

In order to evaluate Hazm, Parsivar, and ParsiNorm, a small corpus of 5000 sentences was annotated by three Persian linguistic experts with respect to tokenization. The inter-annotator agreement (IAA) was calculated with Fleiss’ Kappa – a metric used to evaluate the agreement among three or more raters (Fleiss 1971). The IAA achieved by annotators was 96 % which indicates almost perfect agreement.8 Table 2.5 presents the tokenization results of normalized and raw data. The best performing tokenizer turned out to be Parsivar (Mohtaj et al. 2018), achieving 98 % F-score. The superior performance of Parsivar, at least over Hazm,

8 For interpretation see Landis and Koch (1977).

2 Challenges of Persian NLP: The Importance of Text Normalization

� 37

Table 2.5: Tokenization results.

Hazm not-normalized Hazm normalized Parsivar not-normalized Parsivar normalized ParsiNorm not-normalized ParsiNorm normalized

Precision

Recall

F1

71 % 97,5 % 79 % 99 % 72 % 98 %

73 % 97 % 75 % 98 % 73 % 97 %

72 % 97 % 77 % 98 % 72.5 % 97.5 %

was also confirmed in the Persian plagiarism detection study by Mohtaj et al. (2018). It seems that the main difference between these tools lies in the better performance of space correction by Parsivar. Nevertheless, what seems to be of more significance here, is the fact that the results obtained using raw and normalized corpus differ significantly. Regardless of the normalizing tool, the tokenizer performance was in all cases more than 20 % higher in the case of the normalized data.

2.4.3.2 Corpus and Its Normalization Since the MirasText corpus data was obtained via crawling, it seems necessary to perform certain cleaning and normalization tasks. The initial corpus analysis showed that a certain number of articles contain incomplete content (clipped content). Such articles were excluded from the final corpus used in this study. After filtering out the clipped articles, the total number of corpus documents was 2,072,521. As the next step, 50 million tokens corpus for the discovery of MWEs was sampled. For most of the NLP tasks, the first necessary step is to tokenize the input text. However, as already mentioned, this might be a challenging task in Persian text processing since there are two kinds of spaces: white- and pseudospace, which are not used consistently. Using inconsistent spacing may result in high ambiguity, both on lexical and syntactical levels. Therefore, for a corpus of millions of documents written by thousands of various authors, it is necessary to unify its data, and one of the first and most essential unification steps in Persian NLP is to correct spaces. As a result of an experiment described in section 2.4.3.1, the best normalizing tool turned out to be Parsivar (Mohtaj et al. 2018), and the corpus used for the discovery of MWEs task was normalized with it. Parsivar, in its normalization task apart from encodings and numbers unification, performs two different types of space correction:

38 � K. Marszałek-Kowalewska –



rule-based space correction: a set of rules using regular expressions were employed to detect spaces within words correctly, e. g., ‫‘ ﻣﯽ روم‬I am going’ or ‫ﺗﺤﻠﯿﻞ ﮔﺮ‬ ‘analyzer’. The problem with words that consist of two or more tokens but cannot be extracted with one of these rules was addressed by constructing a dictionary. This helped with words as ‫‘ ﮔﻔﺖ و ﮔﻮ‬conversation’. learning-based space correction: using a training model that recognizes multitoken words as one token. Parsivar uses 90 % of the Bijankhan corpus (which contains multiword tokens annotated with IOB tagging format) as training data. Naïve Bayes model was used to find word boundaries. The model was evaluated on the remaining 10 % of the Bijankhan corpus and got 96.5 % of F-score for space correction on that validation set.

Table 2.6 presents raw and normalized metrics of sentence segmentation and tokenization performed on the corpus used in the present study. Table 2.6: Corpus metrics comparison.

number of sentences number of tokens

Not-normalized

Normalized

1,464,996 52,536,988

1,537,725 51,525,867

It is observed that both the number for sentence segmentation and tokenization differ significantly (the difference in the number of tokens is almost one million!) if we compare the corpus before and after normalization. The difference in sentence segmentation stems from the incorrect treatment of dots in the not-normalized corpus, especially in the case of numerals, dates, webpages, and in combination with other punctuation marks. These results show that proper cleaning and normalization tasks (especially unifying spaces) are crucial during preprocessing of Persian texts. Since normalization helps reduce randomness, it can have a great impact on machine learning algorithms routinely used in NLP tasks. First of all, normalization improves efficiency by reducing the amount of different information – fewer input variables to deal with help to improve the overall performance. Moreover, it reduces the dimensionality of the input, which is important in creating embeddings. Finally, normalization helps to extract reliable statistics from natural language inputs. Figure 2.3 shows the number of most common tokens from the corpus used for the MWEs discovery study. By applying normalization, the number of most common tokens was reduced by 31 %.

2 Challenges of Persian NLP: The Importance of Text Normalization

� 39

Figure 2.3: Number of most common (correspond to 80 % of all tokens) tokens.

Even a bigger difference can be observed in the total number of distinct tokens (Figure 2.4). Here, by applying normalization steps, the number of distinct tokens was reduced by 63 %.

Figure 2.4: Total number of distinct tokens.

40 � K. Marszałek-Kowalewska

2.4.4 Methodology The assumption that MWEs stand out, i. e., they exhibit some sort of salience, allows us to extract (or discover) them automatically from texts. This salience is also why especially statistical procedures, such as association measures (AMs), have been so popular. In the present experiment, the task of MWEs discovery was also addressed by employing AMs . In order to extract Persian multiword expressions, a list of 20 lemmas that would serve as initial seeds was prepared. For every lemma, its bi-grams and trigrams were extracted separately from the raw and normalized corpus using the following association methods: – PMI – log-likelihood – t-score – χ 2 test These specific AMs were chosen since they are the most popular ones used for the discovery of MWEs (Evert 2008; Seretan 2008; Wahl and Gries 2018; Villavicencio and Idiart 2019). For each association measure, its top 100 bi- and tri-grams per lemma were extracted – this resulted in 1487 unique MWE candidates from the normalized corpus and 1817 from the raw one. All the annotators were working on both sets: MWE candidates extracted from the row and normalized corpus. The IAA results were 87 % and 81 % for normalized and raw corpus, respectively.

2.4.5 Results The main objective of the study was to evaluate the impact of text normalization on the MWEs discovery task in Persian. After evaluating the candidates, the number of true MWEs in a normalized corpus was 389 and 154 in the raw one. Figure 2.5 shows the performance of the four selected association measures when it comes to the discovery of true MWEs. As can be seen, each AM performs better when used with the normalized data. The highest number of MWEs were extracted with t-score (248 MWEs), followed closely by log-likelihood (220 MWEs), both performed on the normalized corpus. The number of true MWEs is however not enough to evaluate the performance. Therefore, it is interesting to perform error analysis and check which cases were and which were not discovered in the raw corpus (as compared to the normalized one). Correctly detected MWEs in the raw corpus can be divided into three categories:

2 Challenges of Persian NLP: The Importance of Text Normalization

� 41

Figure 2.5: A number of true MWEs extracted with analyzed association measures.

– – –

MWEs with Arabic numerals, e. g., 360 ‫ ﭘﺎﻧﻮراﻣﺎ‬360 panorama, MWEs with words written in Latin script, e. g., HDR ‫ ﻓﻨﺎوری‬HDR technology, MWEs whose components do not end with non-joiner letterforms, e. g., ‫اﺑﺮ‬ ‫ﮐﺎﻣﭙﯿﻮﺗﺮ‬, supercomputer.

Examples of MWEs that were not discovered in the raw corpus seem to have generally one thing in common: they contain words with non-joiner letters, therefore the use of whitespace is not always consistent, e. g., ‫ ﻓﺮوش ﺑﺮﺧﻂ‬online sales, ‫ﺑﺎزی راﯾﺎﻧﻪای‬ computer game, ‫ ﺑﺎﺷﮕﺎه ورزﺷﯽ‬sport club, or ‫ اﮐﻮﻟﻮژی درﯾﺎ‬marine ecology. Furthermore, all MWEs found in the raw corpus were also discovered in the normalized one. To further evaluate true MWEs discovered using raw and normalized corpus, the combined outcome from all AMs was used. For MWE candidates from raw and normalized corpus, precision, recall and F-score were computed.9 The overall impact of text normalization on the discovery of multiword expressions in Persian is presented using F-score in Table 2.7. The F-score turned out to be 26 % higher in the case of normalized data. Therefore, applying text normalization procedures proved to have a significant impact on the discovery of multiword expressions task in Persian.

9 Similarly to Evert and Krenn (2001) who used these metrics to plot a precision-recall curve for direct comparison of different AMs.

42 � K. Marszałek-Kowalewska Table 2.7: Comparison of F-score for the MWEs discovery task. F-score not-normalized normalized

15 % 41 %

Since different normalization steps would fit different data and downstream NLP tasks, it is interesting to analyze the impact of the particular normalization steps on the MWEs discovery task. Table 2.8 presents various combination metrics. Table 2.8: Evaluation of normalization steps on MWEs discovery in Persian. Normalization step(s) not-normalized encodings unification encodings unification + date unification encodings unification + date unification + space correction encodings unification + space correction encodings unification + date unification + space correction + pinglish conversion

F-score 15 % 19 % 24 % 41 % 39 % 40 %

It turned out that the most efficient combination of normalization steps is the unification of encodings and dates combined with space correction. In fact, correcting and unifying spaces proved to be the most crucial normalization step for the current task.

2.5 Conclusion Human civilization is drowning in noisy, unstructured data and is starving for extracting knowledge from it. Text normalization helps NLP applications derive insights from this data by transforming it into a more formal, standardized version. Therefore, it is essential for a variety of applications, e. g., machine translation, information extraction, parsing, sentiment analysis, or speech recognition. This chapter discussed the challenges that the Persian language poses for an NLP pipeline, and thus, the importance of normalizing Persian input. Since the impact of text normalization depends on the data and NLP task, here it was presented on the task of multiword expressions discovery. The experiment results show that the performance of a system without a Persian-tailored normalization step is 26 % worse (F-score), which is a significant deterioration.

2 Challenges of Persian NLP: The Importance of Text Normalization

� 43

Since normalization systems as “one size fits all” seem to be less precise than the tailored ones, some further future works should include analyzing how normalized data influences other NLP tasks in the Persian language, particularly information retrieval, syntactic parsing, and sentiment analysis. What is more, with the advent of pre-trained large, language models also for a low-resource language as Persian (ParsBERT by Farahani et al. 2021), approaches taking advantage of such models should also be studied for text normalization.

Appendix A Table 2.9: Persian Writing System. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

Letter

Name

Sound

‫ا‬-‫آ‬ ‫ء‬ ‫ب‬ ‫پ‬ ‫ت‬ ‫ث‬ ‫ج‬ ‫چ‬ ‫ح‬ ‫خ‬ ‫د‬ ‫ذ‬ ‫ر‬ ‫ز‬ ‫ژ‬ ‫س‬ ‫ش‬ ‫ص‬ ‫ض‬ ‫ط‬ ‫ظ‬ ‫ع‬ ‫غ‬ ‫ف‬ ‫ق‬ ‫ک‬ ‫گ‬ ‫ل‬ ‫م‬ ‫ن‬ ‫و‬ ‫ه‬ ‫ی‬

alef hamze be pe te se jim če he xe dal zal re ze že sin šin sad zad tā zā ?ein qein fe qāf kāf gāf lām mim nun vāw he ye

ā, a, e, o

P

b p t s j č h x d z r z ž s š s z t z ? q f q k g l m n v, o, u, ow h, e, a y, i, ey

44 � K. Marszałek-Kowalewska

Bibliography Almeida, Paulo, Carlos Baquero Sergio, Nuno Preguiâ & David Hutchison. 2007. Scalable bloom filters. Information Processing Letters 101. 255–261. Baldwin, Timothy & Su Nam Kim. 2010. Multiword expressions. In Nitin Indurkhya & Fred J. Damerau (eds.), Handbook of Natural Language Processing, 2nd edn., 267–292. New York: Chapman & Hall/CRC. Baldwin, Tyler & Yunyao Li. 2015. An in-depth analysis of the effect of text normalization in social media. In Rada Mihalcea, Joyce Chai & Anoop Sarkar (eds.), Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, May–June, 2015, 420–429. Stroudsburg, PA, USA: Association for Computational Linguistics. Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad & Edward Finegan. 1999. Longman Grammar of Spoken and Written English. Essex, England: Pearson Education Ltd. Bijankhan, Mahmood, Javad Sheykhzadegan, Mohammad Bahrani & Masood Ghayoomi. 2011. Lessons from building a Persian written corpus: Peykare. Language Resources and Evaluation 45. 143–164. Darrudi, Ehsan, Mahmoud R. Hejazi & Farhad Oroumchian. 2004. Assessment of a modern Farsi corpus. In 2nd Workshop on Information Technology & its Disciplines (WITID), Kish Island, Iran, 2004, 73–77. Kish Island, Iran: Iran Telecommunication Research Center. Doostmohammadi, Ehsan, Minoo Nassajian & Adel Rahimi. 2020. Joint Persian word segmentation correction and zero-width non-joiner recognition using BERT. In Donia Scott, Nuria Bel & Chengqing Zong (eds.), Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, December, 2020, 4612–4618. Barcelona, Spain (Online): International Committee on Computational Linguistics. Evert, Stefan. 2008. Corpora and collocations. In Anke Ludeling & Merja Kyto (eds.), Corpus Linguistics. An International Handbook, 1212–1248. Berlin: Mouton de Gruyter. Evert, Stefan & Brigitte Krenn. 2001. Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, July, 2001, 188–195. Stroudsburg, PA, USA: Association for Computational Linguistics. Farahani, Mehrdad, Mohammad Gharachorloo, Marzieh Farahani & Mohammad Manthouri. 2021. ParsBERT: transformer-based model for Persian language understanding. Neural Processing Letters 53. 3831–3847. Fleiss, Joseph L. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76. 378–382. Ghayoomi, Masood & Saeedeh Momtazi. 2009. Challenges in developing Persian corpora from online resources. In Proceedings of the 2009 International Conference on Asian Language Processing, Singapore, 2009. Singapore: Institute of Electrical & Electronics Engineers. https://www. academia.edu/12944708/Challenges_in_Developing_Persian_Corpora_from_Online_Resources. Hajitabar, Alireza, Hossein Sameti, Hossein Hadian & Arash Safari. 2017. Persian large vocabulary name recognition system (FarsName). In 2017 Iranian Conference on Electrical Engineering (ICEE), 1580–1583. Tehran: Institute of Electrical & Electronics Engineers (IEEE). Han, Bo, Paul Cook & Timothy Baldwin. 2013. Lexical normalization for social media text. ACM Transactions on Intelligent Systems and Technology 4(1). 5:1–5:27. Hassan, Hany & Arul Menezes. 2013. Social text normalization using contextual graph random walks. In Hinrich Schuetze, Pascale Fung & Massimo Poesio (eds.), Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, 2013, 1577–1586. Stroudsburg, PA, USA: Association for Computational Linguistics.

2 Challenges of Persian NLP: The Importance of Text Normalization

� 45

Hazm. 2014. Python library for digesting Persian text. https://github.com/sobhe/hazm. Hosseini, Pedram, Ali Ahmadian Ramaki, Hassan Maleki, Mansoureh Anvari & Seyed Abolghasem Mirroshandel. 2018. SentiPers: a sentiment analysis corpus for Persian. In Computing Research Repository (CoRR). http://arxiv.org/abs/1801.07737. Jackendoff, Ray. 1997. Twistin the night away. Language 73. 534–559. Krieger, Maria & Maria José Bocorny Finatto. 2004. Introduçâo à terminologia: teoria & pratica. Sao Paulo: Contexto. Landis, Richard & Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33. 159–174. Megerdoomian, Karine. 2018. Computational Linguistics. In Anousha Sedighi & Pouneh Shabani-Jadidi (eds.), The Oxford Handbook of Persian Linguistics, 461–480. Oxford: Oxford University Press. Moattar, Mohammad Hossein, Mohammad Mehdi Homayounpour & Davood Zabihzadeh. 2006. Persian text normalization using classification tree and support vector machine. In 2006 2nd International Conference on Information and Communication Technologies, 1308–1311. Damascus, Syria: Institute of Electrical & Electronics Engineers. Mohtaj, Salar, Behnam Roshanfekr, Atefeh Zafarian & Habibollah Asghari. 2018. Parsivar: a language processing toolkit for Persian. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis & Takenobu Tokunaga (eds.), Proceedings of the eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018, 1112–1118. Miyazaki, Japan: European Language Resources Association (ELRA). Nezami, Omid Mohamad, Paria Jamshid Lou & Mansoureh Karami. 2019. ShEMO – a large-scale validated database for Persian speech emotion detection. Language Resources and Evaluation 53. 1–16. Oji, Romina, Seyedeh Fatemeh Razavi, Sajjad Abdi Dehsorkh, Alireza Hariri, Hadi Asheri & Reshad Hosseini. 2021. ParsiNorm: A Persian Toolkit for Speech Processing Normalization. https://arxiv.org/ pdf/2111.03470.pdf. Panahandeh, Mahnaz & Shirin Ghanbari. 2019. Correction of spaces in Persian sentences for tokenization. In 2019 5th Conference on Knowledge Based Engineering and Innovation (KBEI), 670–674. Tehran, Iran: Institute of Electrical & Electronics Engineers. Ramisch, Carlos. 2009. Multi-word terminology extraction for domain-specific documents. Grenoble, France: École Nationale Supérieure d’Informatique et de Mathématiques Appliquées Master Thesis. Rasooli, Mohammad Sadegh, Manouchehr Kouhestani & Amirsaeid Moloodi. 2013. Development of a Persian syntactic dependency treebank. In Lucy Vanderwende, Hal Daumé III & Katrin Kirchhoff (eds.), Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), New York, USA, 2013. Stroudsburg, PA, USA: Association for Computational Linguistics. Sabeti, Behnam, Hossein Abedi Firouzjaee, Ali Janalizadeh Choobbasti, S. H. E. Mortazavi Najafabadi & Amir Vaheb. 2018. MirasText: an automatically generated text corpus for Persian. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis & Takenobu Tokunaga (eds.), Proceedings of the eleventh International Conference on Language Resources and Evaluation (LREC 2018), 1174–1177. Miyazaki, Japan: European Language Resources Association (ELRA). Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann A. Copestake & Dan Flickinger. 2002. Multiword expressions: a pain in the neck for NLP. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2002), 1–15. Berlin, Heidelberg: Springer-Verlag.

46 � K. Marszałek-Kowalewska

Salehi, Bahar, Narjes Askarian & Afsaneh Fazly. 2012. Automatic identification of Persian light verb constructions. In Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer. Salehi, Bahar, Paul Cook & Timothy Baldwin. 2016. Determining the multiword expression inventory of a surprise language. In Yuji Matsumoto & Rashmi Prasad (eds.), Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: technical papers, 471–481. Osaka, Japan: The COLING 2016 Organizing Committee. Sarabi, Zahra, Hooman Mahyar & Mojgan Farhoodi. 2013. ParsiPardaz: Persian language processing toolkit. In Proceedings of the 3rd International Conference on Computer and Knowledge Engineering, (ICCKE), Iran, 2013, 73–79. Iran: Institute of Electrical & Electronics Engineers. Seraji, Mojgan. 2013. PrePer: a pre-processor for Persian. In Fifth International Conference on Iranian Linguistics (ICIL5), Bamberg, Germany, 2013, 7–10. Bamberg, Germany: Cahiers de Studia Iranica. Seretan, Violeta. 2008. Collocation extraction based on syntactic parsing. Geneva, Switzerland. Ph. D. Thesis, University of Geneva. Shamsfard Mehrnoush, Hoda Sadat Jafari & Mahdi Ilbeygi. 2010. Step-1: a set of fundamental tools for Persian text processing. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner & Daniel Tapias (eds.), Proceedings of the International Conference on Language Resources and Evaluation, Valletta, Malta, 2010, 859–865. Valletta, Malta: European Language Resources Association (ELRA). Shamsfard Mehrnoush, Akbar Hesabi, Hakimeh Fadaei, Niloofar Mansoory, Ali Famian, Somayeh Bagherbeigi, Elham Fekri, Maliheh Monshizadeh & S. Mostafa Assi. 2012. Semi automatic development of FarsNet; the Persian WordNet. In Proceedings of the 5th Global WordNet Association Conference, Mumbai, India, 2012. Mumbai, India: Global WordNet Association. https: //www.academia.edu/331794/Semi_Automatic_Development_of_FarsNet_The_Persian_WordNet. Sproat, Richard, Alan W. Black, Stanley Chen, Shankar Kumar, Mari Ostendorf & Christopher Richards. 2001. Normalization of non-standard words. Computer Speech and Language 15(3). 287–333. Taslimipoor, Shiva, Afsaneh Fazly & Ali Hamze. 2012. Using noun similarity to adapt an acceptability measure for Persian light verb constructions. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds.), Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), 670–673. Istanbul, Turkey: European Languages Resources Association (ELRA). Villavicencio, Aline & Marco Idiart. 2019. Discovering multiword expressions. Natural Language Engineering 25(6). 715–733. Wahl, Alexander & Stefan Th. Gries. 2018. Multi-word expressions: a novel computational approach to their bottom-up statistical extraction. In Pascual Cantos-Gomez & Moises Almela-Sanchez (eds.), Lexical collocation analysis: advances and applications, 85–109. Cham: Springer International Publishing. Zhang, Congle, Tyler Baldwin, Howard Ho, Benny Kimelfeld & Yunyao Li. 2013. Adaptive parser-centric text normalization. In Hinrich Schuetze, Pascale Fung & Massimo Poesio (eds.), Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, August, 2013, 1159–1168. Stroudsburg, PA, USA: Association for Computational Linguistics.

Jalal Maleki

3 Dabire: A Phonemic Orthography for Persian Abstract: This chapter describes Dabire, a phonemic writing system for Persian. The writing system uses an extended Latin alphabet and introduces a number of writing conventions with the aim of providing a simple and easy to learn writing system, and thereby, making the language accessible to a broader audience. Explicit representation of vowels and the Ezâfe-construction, and the fact that Dabire is faithful to Persian, improves linguistic processing and analysis.

3.1 Introduction This chapter presents Dabire, a phonemic and romanized writing system for Persian. The motivation for proposing Dabire is threefold: (1) facilitate linguistic analysis (computational or otherwise), (2) provide an alternative writing system for Persian speakers unfamiliar with the Arabic-based script, and (3) provide a practical, consistent, and easy-to-learn writing system for teaching Persian. Currently, Persian is predominantly written in variations of the Arabic writing system. The official writing system of Iran is the Perso-Arabic Script (PA-Script) (see Farhangestan 2003). In recent years, however, an increasing body of romanized Persian text has appeared on the Internet and in the context of mobile communication. For majority of Persian speakers who are well-acquainted with PA-Script, occasional use of the Latin alphabet is mainly due to the technological ease of use associated with the Latin alphabet. For the second and third generation Persian speaking immigrants who are less likely to have had the opportunity to learn the PA-Script and more likely to have been educated in a language written in Latin script, a romanized script for Persian is usually the natural choice for writing Persian.

3.1.1 Background 3.1.1.1 Brief Overview of Earlier Work Romanized transcription of Persian has a relatively long history. Initiatives in designing Latin-based writing systems for Persian date back to the late 19th century. Hashabeiky (2005) provides a thorough review of the history of the orthographic reform in Iran. According to Behrouz (1984), Mirzā Malkom Khān (1833–1908) is the https://doi.org/10.1515/9783110619225-003

48 � J. Maleki first Iranian who openly spoke of alphabet reform. Later, another pioneer Ākhunzādeh1 (see Encyclopedia Iranica 1982–2022) publishes a new alphabet, Alefbâ-eJadid, in 1857. Both Malkom Khān and Ākhunzādeh believed that the traditional Arabic-based writing system hindered progress in many aspects of the society and change of orthography would remedy those problems. The next important event in the romanization of Persian takes place in the Soviet Republic of Tajikistan. Unsatisfied with earlier reforms in the Perso-Arabic orthography, the Soviet Republic eventually chooses to switch over to the Latin alphabet in 1927. This changeover, however, does not last long and in the late 1930s Tajiks change to the Cyrillic script. During the second part of the 20th century, many authors use romanization in various publications. Lambton (1953) uses romanized transcription in her introductory Persian grammar book. A few years later, Neysari (1956) writes a Persian book for primary school in a script named Xatt-e Jahãni which essentially shows that he has had a romanization proposal in mind without presenting it as such.2 In 1957, Lazard (1957) writes a Persian grammar book in French and uses a romanization transcription to show the pronunciations of Persian words. A bit more recently, Mace (2003) uses his own romanization scheme in his grammar book for Persian. The most recent proposals which were developed almost in parallel include Xatt e Now (2007), Unipers (2007), Maleki (2008, 2003). These proposals are very similar to the schemes introduced in 1950s. Descriptions of Eurofarsi and Unipers reside in the Internet archives since the web-sites describing these proposals have ceased to exist. Encyclopedia Iranica (see Encyclopedia Iranica 1982–2022)3 introduces and uses a rather rich romanization scheme for transcription of words in multiple languages, including Persian. The only official attempt at standardizing Persian romanization is limited to a transliteration4 table that the Iranian government has submitted to the United Nations. The main use of this scheme is for standardizing transliteration of the geographic names in Iran.5 1 Apparently inspired by the writings of his friend Mirzā Malkom Khān. 2 When I visited Dr. Neysari (RIP) in 2007, sadly my only meeting with him ever, he gave me two small books authored by him: Robãiyat e Omar Xayyãm (1962) and Mollã Nasreddin (1961). Both books are written in a Latin-based orthography he names Xatt-e Jahãni. 3 Encyclopedia Iranica which resides on http://www.iranicaonline.org is a valuable source of knowledge initiated by Professor Ehsan Yarshater in 1982. 4 It is important to point out that we distinguish between transliteration, which essentially means letter by letter conversion between writing systems, and transcription, which refers to systematic conversion of speech to an orthgraphic form. 5 See https://unstats.un.org

3 Dabire: A Phonemic Orthography for Persian



49

Building on the worldwide popularity of the Latin-based writing systems, we hope that Dabire contributes to making Persian more accessible to all those who have difficulties learning Persian using the traditional PA-Script. In particular, we hope to support the younger generation of Iranian immigrants who speak Persian at home, as well as others who wish to learn Persian. In many cases, the Internetsavvy Iranians have already taken matters into their own hands, and every one of them seems to have a personal romanization scheme. Many of the pioneers for Persian romanization have aimed at changing the writing system completely. Many, quite rightly, believe that Persian is an IndoEuropean language that will benefit from a romanized writing system. However, introducing reforms in writing systems is usually an undertaking for governments and organizations such as the Persian Academy (Farhangestân). Our main aim in proposing Dabire is to provide a tool for preparing text for computer-based processing and to make the language more accessible to a broader audience. Furthermore, while teaching Persian to others, we have learnt that Dabire is a helpful writing scheme for teaching/learning Persian, in the same way Pinyin facilitates teaching or learning Chinese.

3.1.2 Structure and Format of This Chapter The rest of this chapter is organized as follows. Section 3.2 introduces some phonological characteristics of orthographic relevance and gives a detailed account of Dabire. Although the focus of this chapter is the writing system, some applications related to the writing system are briefly discussed in Section 3.3. The final section summarises and concludes the chapter. Throughout this chapter, the Dabire text is printed in italics. The orthographic conventions adopted in Dabire are highlighted with two triangles (󳶳󳶣) and numbered to facilitate cross referencing. For example, here is the first DOC-convention (Dabire Orthographic Convention): 󳶳 DOC-1 Numbers in Dabire As customary in Latin-based writing systems, Dabire uses the Arabic decimal digits 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 for representation of numbers. 󳶣

3.2 Dabire This section forms the main part of this chapter and introduces Dabire, a romanized transcription scheme for Persian. Dabire consists of an alphabet and a number of transcription conventions. Although we have tried to cover a lot of details, we are

50 � J. Maleki sure some issues are left out; therefore, it is important to see this proposal as the initial step towards a complete and robust writing system for Persian. Dabire writing conventions are mainly based on pronunciation. The goal is to keep the spelling as close to the pronunciation as possible. Here are the main characteristics of the proposed writing system: – Dabire uses an extended Latin alphabet and is written from left to right. – The selected alphabet is adequate for most dialects of Persian.6 However, some dialects of Persian may require minor extensions. – Dabire is primarily based on pronunciation and, as much as possible, we try to follow the principle: whatever is pronounced is written and vice versa. – Dabire is designed for the Persian language and is subject to its own rules and specifications. There are similarities to other Latin-based orthographies, but these are very limited in context and extent and should not be generalized. – Keeping things as simple as possible is a general design principle that applies to the design of writing systems as well. So, although the main aim is to represent every phoneme with a grapheme, it is also essential to minimize the number of graphemes and diacritics in the writing system. For example, even though the pronunciation of the Perso-Arabic letters Qeyn and Qâf are slightly different, we have, in accordance with many earlier proposals, decided to only designate a single Dabire-grapheme q for both. – Style issues such as capitalization, punctuation, and abbreviation conventions are mostly similar to other Latin-based scripts. – In an attempt to keep Dabire simple, we have tried to introduce conventions that are less likely to lead to exceptions.

3.2.1 Design Principles In Dabire Designing a new orthography or revising an existing one is a complicated undertaking subject to various constraints and requirements. Dabire has been designed with the following principles in mind. – Linguistic Soundness Bird (2000) argues that linguistics provides an ‘expedient’ off-the-shelf technology for the orthographer. She or he can take a few hundred words from the language under study and identify a set of minimal pairs which then constitutes the basis for a new alphabet. This working principle forms the foundation of a linguistically sound writing system. 6 The examples are written in the mainstream accent of Iran. More specifically, the kind of Persian Iranian schools are expected to teach.

3 Dabire: A Phonemic Orthography for Persian













51

Phonemicity A writing system is phonemic if it systematically encodes the sounds of the language for which it is designed. This systematicity is often interpreted as a welldefined mapping between the phonemes of the language and the graphemes of the writing system. A clear advantage of this principle is that it facilitates learning and teaching because there is a one-to-one correspondence between sounds and their written representations. Ease of Learning and Use Nowadays, everyone is supposed to learn to read and write before the age of 7. Both teachers and the students have limited time and resources available. Consequently, it is important to remove bottlenecks such as spelling problems related to many-to-one, one-to-many or one-to-none mappings between phonemes and graphemes. Designing a new writing system or introducing reforms into an existing one provides an opportunity to consider usability issues. Simplicity To create a simple writing system, one needs to compromise, for example, as mentioned earlier, choosing to represent both Qâf and Qeyn with the same grapheme q. Exception minimization is another issue that keeps things simple. Phonemic vs Morphophonemic Considerations Adjacent morphemes sometimes affect the realization of the neighboring sounds. For example, pronunciation of to o man (you and I) may generate a [v] between the adjacent [o]s. Should we then choose the convention of writing the [v], to vo man? In a non-compromising, phonemic writing system, one should choose to write it, but an alternative is to compromise the phonemic principle and not write the [v].7 In situations like this, the designer of the writing system needs to either compromise the phonemicity property or the morphological identities. Faithfulness If an orthography does not respect the language for which it is designed, then improper decisions will be made. For example, the dominance of the English language may tempt some to select the digraph ‘sh’ for representing the phoneme /ŝ/. For Persian, at least, such a choice would be inappropriate since the use of the plural suffix -ha in words ending in s would introduce a large number of ambiguities, for example, pâshâ could stand for pâs-hâ (plural of pass) and pâŝâ (a name - from pâdŝâh (king)). Let’s consider a few examples

7 In the same way that we decide to write “vodka and lime” rather than “vodka[r] and lime” when some English speakers insert the [r] between the two [a]s for ease of pronunciation.

52 � J. Maleki

– –



in order to exemplify the concept of faithfulness in writing. In particular, here are the transcription of pâŝâ in various languages: – pasha, faithful to English – pacha, faithful to French – pascha, faithful to German – pasja, faithful to Swedish – pasza, faithful to Polish – pâŝâ, faithful to Persian Although many Dabire-graphemes are similar to other writing systems, the choice of mapping between phonemes and graphemes is mostly Persianspecific. Transparency Choices in Dabire are motivated and when necessary compared to alternatives. Completeness Dabire is designed to enable the representation of all phonologically significant aspects of Persian. However, although intonations are important in Persian they are not represented in Dabire. Orthographical Depth Dabire is a so-called shallow orthography (see Bresner and Smith 1992). In some writing systems, the correspondence between phonemes and graphemes of the orthography is consistent and almost forms a one-to-one relationship. Such orthographies are called shallow orthographies. Examples of shallow orthographies are: Finnish, Italian and Spanish. For example, in Finnish “ää” is pronounced the same way in määrä, Ennakkoäänestys, and päättyi. Although these words look rather complicated, after the first year of school, students are not only able to spell and read these words correctly, they are capable of properly reading and spelling non-words such as häämänuunni. On the other hand, English has a writing system where the correspondence between the phonemes and graphemes is opaque. In these so-called deep orthographies, the relationship between phonemes and graphemes is complicated and inconsistent. For example, learning the pronunciation of the letter “o” in the English word “one” does not provide any clues as to how its occurrence in some, on, open or shop should be pronounced. Furthermore, multiple graphemes may stand for the same sound, for example, newt and cute; and the same sort of spelling may stand for different sounds, for example, have and take.

3 Dabire: A Phonemic Orthography for Persian



53

3.2.2 Dabire Alphabet This section introduces the Dabire alphabet. Other proposals have adopted similar alphabets. However, proposing a new alphabet is only a minor part of an orthography. Persian has 29 phonemes (23 consonants and 6 vowels). Furthermore, in Arabic loan words, the glottal stop Hamze (IPA: P ) and the pharyngeal fricative Eyn (IPA: Q) are phonologically significant. However, the pronunciation of Hamze and Eyn in Persian are similar (IPA: P ). Since there is a one-to-one correspondence between Persian phonemes and graphemes, we will denote both using the graphemes listed in DOC-2. 󳶳 DOC-2 Dabire Letters Dabire has 30 letters including 24 consonants b, c, d, f, g, h, j, ĵ, k, l, m, n, p, q, r, s, ŝ, t, v, w, x, y, z, ’, and 6 vowels: a, â, e, i, o, u (where, â, i and u represent the long vowels and a, e and o represent the short vowels). The Dabire letters are as follows (the names of the letters and the pronunciation of the name appear inside parentheses): a (A, æ ), â (Â, A), b (Be, bE ), c (Ce, tS E ), d (De, dE ), e (E, PE ), f (Fe, f ), g (Ge, gE ), h (He, hE ), i (I, PI ), j (Je, dZE ), ĵ (Ĵe, ZE ), k (Ke, kE ), l (Le, lE ), m (Me, mE ), n (Ne, nE ), o (O, Po ), p (Pe, pE ), q (Qe, qE ), r (Re, ôE ), s (Se, sE ), ŝ (Ŝe, S E ), t (Te, tE ), w (We, UE ), u (U, Pu), v (Ve, vE ), x (Xe, xE ), y (Ye, JE ), z (Ze, zE ), ’ (Ist, PIst) 󳶣 The letter names do not necessarily correspond to the letter names in the PAScript. The IPA-representations of the letters are shown in DOC-3. 󳶳 DOC-3 Phonetic Representation a [æ ], â [A], b [b], c [tS ], d [d ], e [E ], f [f ], g [g or Í ], h [h], i [I ], j [dZ ], ĵ [Z ], k [k or c ], l [l ], m [m], n [n], o [o ], p [p ], q [q] and [K ], r [ô ], s [s ], ŝ [S ], t [t], u [u], v [v ], w [U ], x [x ], y [J ], z [z ], ’ [P ] 󳶣 󳶳 DOC-4 Dabire Diphthongs Diphthongs such as ow [oU ] and ey [9J ] have no special representations in Dabire. Both are treated as a vowel followed by a consonant. 󳶣

3.2.3 Phonological Preliminaries In this section, some phonological issues that are of relevance to the orthography are presented. For more detailed phonology of Persian see Samare (1997).

54 � J. Maleki 3.2.3.1 Syllable and Phonotactic Constraints Every language has its constraints on how sequences of phonemes can be arranged to form admissible syllables and words, and these constraints form the phonotactic constraints of the language. The following syntactic rules show the syllable structure in Persian. In these rules, square parentheses are used to indicate optional elements and V and C stand for a vowel and a consonant phoneme respectively. Syllable → [Onset]Rime Onset → C Rime → V[C[C]] In other words, [C]V[C[C]] or, more specifically, V, VC, VCC, CV, CVC, CVCC show the six possible syllable patterns in Persian. As implied by these rules, the onset is optional. However, some linguists consider onset to be compulsory (Samare 1997). Others, such as Neysari (1996) or Windfuhr (1989) see it as optional. Samare (1997) considers the glottal stop at onset as phonemic8 and the fact that this leads to fewer syllabic patterns is considered an advantage: CV, CVC, CVCC.

3.2.3.2 Diphthongs Certain combinations of the approximants U and J can, from a phonetic point of view, be viewed as diphthongs. These are, [ây], [ay], [ey], [oy], [uy], [ow] (see Samare 1997). For example, [ey] in peyk (courier) and [ow] in jow (barley). However, there is no convincing evidence that all these formations in Persian can be considered diphthongs from a phonological point of view. For example, in case of [ow], lack of syllabic constructions of the form /ow/CC or C/ow/CC weakens the status of /ow/ as a diphthong (see Windfuhr 1989).

3.2.3.3 Morphophonemics This section discusses some morphophonemic issues that are relevant to the orthography of Persian words.

8 Prof. Samare believed that it was impossible to generate an initial vowel sound without first generating a glottal stop – personal communication during a visit to Farhangestân.

3 Dabire: A Phonemic Orthography for Persian



55

Hiatus and Euphony In phonology, hiatus refers to situations where a vowel, ending a syllable, is immediately succeeded by another vowel that initiates the next syllable. In some languages, these adjacent vowels keep their phonological identities and are pronounced separately, for example, Hawaiian loaa (have), Japanese ao (blue), Persian pâiz (Autumn), Swahili kuona (to see), Italian io (I). Sometimes, pronouncing the vowels one after the other is not considered as optimal or “pleasant”. In order to ease the pronunciation, various changes in words are made, in particular euphonic epenthesis. Epenthesis Epenthesis is the insertion of a sound, a letter, or a syllable into a word to facilitate its pronunciation. In Persian, epenthesis occurs in many contexts. Persian morphology is inflectional and derivational, and includes some suffixes and enclitics that begin with vowels. When these suffixes are concatenated with words that end in a vowel, an interaction between the vowels results. This interaction is usually graceful and in speech, presents itself as a switch from the sound of one vowel to the other. For example, in pâiz, both /â/ and /i/ preserve their identity and quality. In other situations, a direct switch from one vowel to another is not smooth and certain consonants (usually euphonic) are used as “mediators” between the adjacent vowels. These consonants are /g/, /j/, /n/, /v/ and /y/. Occasionally, /h/ may also take a mediating role, for example, in beheŝ (to her/him/it): beheŝ goftam (I told her/him/it). Epenthesis also occurs in loanwords. Syllables from foreign words that have clusters of consonants at onset or more than two consonants in rime (for example, CCV (ska), CCC (krk), CCCVCC (Spring), CVCCC (Minsk)) violate the syllabic structures of Persian. When such words enter Persian as new words or are pronounced according to the constraints of Persian rather than the source language, their syllabic structure is modified to fulfill the limitations imposed by [C]V[C[C]]. In particular, clusters of consonants may be broken by inserting vowels between the constituents of the cluster in order to create syllables that are tolerated. Epenthesis usually involves /e/, but occasionally anticipatory coarticulation affects the choice of the epenthetic vowel. Here are a few examples: kerak (crack), Estokholm (Stockholm). Epenthesis also occurs when compound forms are formed. For example, when the suffix -stan is added to words ending with a consonant. For example, Tâjikestân, Kordestân.9 9 Sometimes, epenthesis is applied unnecessarily. Here are some examples: nardebân (nard+bân), amuzegâr (amuz+gâr). nardbân and amuzgâr are well-formed words.

56 � J. Maleki Coarticulation Coarticulation refers to the effects of the articulation of one sound on the articulation of a neighboring sound. For example, when [b] and [p] in shabpare become a single [p]. Coarticulation is a common phenomenon which usually has no consequences for the orthography of the words. However, some coarticulation may have orthographic consequences: – Consonantal coarticulation is the immediate reconfiguration of the speech organs when articulating the consonant [n] and then transitioning to [b] which leads to the unintended generation of [m]. Some examples from Persian are panbe (cotton), ŝanbe (Saturday). – Anticipatory Assimilation is a sort of coarticulation which refers to situations where the anticipation of a particular vowel affects the pronunciation of an earlier one. For example, bekon (do) → bokon, jelo (front) → jolo. Elision Sometimes certain sounds are deleted to improve pronunciation in certain contexts, for example, in poems or in colloquial communication. For example, beneŝin to benŝin by dropping /e/, ŝâdbâŝ → ŝâbâŝ, con → co. Here are some more examples: – Dropping e Preceding a Suffix When words that end with /e/ are concatenated to the inflections -am, -at, and aŝ, the e of the word may be dropped forming an abbreviated form of the resulting word, for example, xâneaŝ → xânaŝ and kardeam → kardam. In most cases, a word with three syllables is reduced to a word with two syllables, and the stress is put on the second syllable. – Dropping a of suffixes -am, -at, and aŝ When ke and ce are concatenated with the inflections -am (my), -at (your), and aŝ (her/his), the a of the inflection may be dropped to form a shorter form of the construction. keam → kem (which of my (people)) keat → ket (which of your (people)) keaŝ → keŝ (which of his/her (people)) pâaŝ → pâŝ (his/her/its foot/leg) jeloaŝ → jeloŝ (in front of him/her/it) – Dropping a of ast kojâ ast (where is it?) → kojâst doruqgu ast (he is a liar) → doruqgust ketâbi ast (it is a book) → ketâbist

3 Dabire: A Phonemic Orthography for Persian



57

Vowel Transition in Colloquial Tehrâni Persian – Transition of /a/ to /â/: cahâr → câhâr – Transition of /â/ to /o/ or /u/: Some instances of /â/ that precede /m/ or /n/ are transformed to /u/ and very seldom as o. Examples: ân → un, râ → ro, nân → nun, xâne → xune, Tehrân → Tehrun, âmad → umad, tamâm → tamum – Transition of /e/ to /i/: beŝin → biŝin, devist → divist – Transition of /o/ to /u/: kodâm → kudum, orupâ → urupâ, ŝoluq → ŝuluq Compound Transitions Sometimes transitions are combinations of simple transitions; for example, in the following cases, first the /a/ in ast (is) is dropped, and then the /e/ of the previous word is changed to i. ke ast (who is she/he?) → kist, ce ast (what is it?) → cist, neast or naast) → nist 10 In the Persian that is spoken in Tehran, ân râ (that – as the direct object of a verb) can be transformed to uno. This involves three operations: the â of ân is transformed to u, râ is changed to ro and finally the r of the resulting ro is dropped. Inflectional Morphology Persian is an inflectional language. Word stems (lexemes) are concatenated with marks to form new words. The spelling of the inflected lexeme is usually preserved in the resulting new word, but may occasionally be modified due to morphophonemic reasons. raft (past stem of to go) → raftam (I went) raft → rafti (you went) ro (present stem of to go) → naro (don’t go) Using inflectional morphemes sometimes involves specific phonological transitions as well. ce -am → ceam or ciam (what am I?) ce -i → cei or cii (what are you?) ce -e → cee or cie (what is it?) ce -im → ceim or ciim (what are we?) ce -id → ceid or ciid (what are you?) ce -and → ceand or ciand (what are they?) Because of limited space, other example of similar transformations are excluded from this chapter. 10 ne- and na- are the prefix for negating verbs. In fact ne- is a transformation of na-.

58 � J. Maleki

3.2.4 Dabire Transcription Conventions This section specifies some more conventions that contribute towards creating a well-defined orthography for Persian. We expect these conventions to be revised over time. 󳶳 DOC-5 Phonemic Transcription The general principle of transcription to Dabire is based on the pronunciation of the word. 󳶣

3.2.4.1 Coarticulation Since Dabire is a phonemic writing system, we need to decide to what extent coarticulation should affect the orthography of the word(s) involved. 󳶳 DOC-6 Consonantal Coarticulation Dabire leaves the transcription of side-effects of consonantal coarticulation open, for example, the author of the text would decide whether to write panbe (cotton) as pambe or ŝanbe (Saturday) as ŝambe. 󳶣 󳶳 DOC-7 Anticipatory Assimilation In some cases, coarticulation has historically affected the orthography of words. In such cases, the Dabire-convention is to abide by the tradition and write the words to reflect the coarticulation, for example, boluz (blouse) rather than beluz or biâ (come!) rather than beâ. In many other cases, the author of the text would be free to choose between alternatives. For example, bekon (do!) and bokon are both admissible. 󳶣

3.2.4.2 Writing Persian Names Over the years, due to the absence of a standard for romanization, Latin-based romanization of Persian names has been subject to a chaotic process. Depending on the individual’s own choice11 or the preferences of a certain passport issueing office or officer, the same name is written in many different ways. For example, if the

11 Here we also include the choices of the parents at the time of registering a name. The name Niuŝa, for example, could, depending on preference and country of registration, be registered as Niusha, Niyusha, Niyousha, Niosha, Nioucha, Niyoucha, Niuscha, Nioscha, and some other alternatives.

3 Dabire: A Phonemic Orthography for Persian



59

name Said has, for some reason, been written as Saeid, Saied, Saeed and the owner has no interest/possibility to change it to Said, then the old transcription should be accepted as correct. 󳶳 DOC-8 Transcription of Persian Names Transcription of Persian names should follow the conventions of Dabire unless external factors dictate alternative writing. 󳶣

3.2.4.3 Trademarks Trademarks, or rather tradenames, can be divided into two major classes: romanized trademarks and others. Trademarks that are Latin-based transcriptions, such as Smirnoff, Linux, Coca Cola, VOLVO, cK, etc. should keep the original spelling irrespective of how they are pronounced. 󳶳 DOC-9 Transcription of Trademarks Trademarks written in Latin-based writing systems keep their spelling form in Dabire. The Dabire-transcription of non-romanized trademarks should be based on the pronunciation of the trademark in the original language. 󳶣

3.2.4.4 Non-Persian Proper Names Correct transcription of proper foreign names depends on the writing system in which the name is written. If the source is Latin-based, then the name keeps its orthography in Dabire, otherwise, its pronunciation will determine the Dabiretranscription. For example, although a reasonable way of writing ‘Geoffrey’ in Dabire would be as Jefri, it would be inappropriate to do so. In fact, the correct way of writing peoples names is to write them according to their wishes which mostly coincides with how their names are written in official documents. In Dabire, we would write the Canadian name k. d. lang as thus rather than as K. D. Lang and the Swedish name Göran Älvskog exactly so rather than as pronounced in Persian Yorân Elveskug. On the other hand, if the original name is not Latin-based and we have no information as to how the romanized version should be written, we will base the Dabiretranscription on the pronunciation of the name in the original language and a mapping from the phonemes of the source language to the phonemes of Persian.

60 � J. Maleki 󳶳 DOC-10 Transcription of Foreign Proper Names Proper names that originate from languages with romanized scripts are written exactly as in the original script. If the original script is not based on Latin, and there is no established romanized transcription for the name, the Dabire-transcription will be based on the pronunciation of the name in the original language. 󳶣

3.2.4.5 Geographic Names As far as geographic names are concerned, we would like to distinguish between those that have a well-established Persian pronunciation and those that do not. For example, Landan is the established Persian name for London and the natural Dabire-transcription would be Landan. Similarly, Pâris, Estokholm. For other names, since there is international agreement on the romanization of geographic names, Google Maps and the like provide a suitable reference. 󳶳 DOC-11 Writing Geographic Names For geographic names, Dabire-transcription is normally based on the well-established pronunciation of the name in Persian. If such a pronunciation does not exist, the international resources for romanized geographic names should be used. For example, Vâŝangton (Washington), Samarqand (Samarkand), Mosko (Moscow), Munix (Munich) and Doŝanbe (Dushanbe) have well-established pronunciations in Persian, whereas Mjölby (a city in Sweden), Llandudno (a city in Wales) do not. 󳶣

3.2.4.6 General Transcription of Non-Persian Words For words from other languages, we will use the following convention. 󳶳 DOC-12 Dabire-Transcription of Foreign Words Given a foreign word w that belongs to a language L with a script S, we propose the following rules for the Dabire-transcription of w: – if there exists a well-established pronunciation for w in Persian (either traditionally or a new one created by the Persian Academy), then the pronunciation will determine the Dabire-transcription, for example, kuântom (Quantum) and foton (photon), otherwise,

3 Dabire: A Phonemic Orthography for Persian







61

if S is a Latin script, then the spelling of w in S would be adopted,12 for example, the English word Sir and the name of the Swedish city Gränna would be written in the same way in Dabire; otherwise, the Dabire-transcription will be based on the pronunciation of the word in the original language, such as naam (yes in Arabic). 󳶣

3.2.5 Some Arabic-Specific Issues For many centuries, Arabic has been an extensive ingredient of Persian. Persian texts often contain Arabic words and terms, and, most important of all, many great Iranian writers, poets, mathematicians and scientists have Arabic names. It is, therefore, important to introduce some conventions specifically for Arabic. The Arabic Definite Article Al Arabic consonants are divided into two classes, solar and lunar consonants. When the definite article Al precedes a word that starts with a solar consonant, the l in Al is assimilated to the solar consonant by means of gemination. A minor transcription issue in this respect is whether the transcription should keep the article Al intact or transcribe the assimilation. For a word such as Aŝŝams (the sun), for example, there are three alternative transcriptions in Dabire: 1. Al ŝams or Al-ŝams 2. Alŝams 3. Aŝŝams The first two alternatives preserve the morphemic form of Al and leave the correct pronunciation to the reader, whereas the third alternative reflects the actual pronunciation (respecting the DOC-5 convention). Although Dabire is a phonemic transcription scheme, we choose to leave the writing format to the author.13 󳶳 DOC-13 Writing the Arabic Definite Article Al Orthography of the Arabic definite article Al is left open, that is, the author of the text decides whether or not to distinguish between lunar and solar consonants or to write the article and the subsequent word in a close or an open form. 󳶣 12 The disadvantage of so doing would be similar to the complications that imported foreign words have created for English. 13 The reader may or may not be aware of solar and lunar consonants in Arabic, thereby, some risk of mispronunciation.

62 � J. Maleki The Glottal Stop Ist In Persian, the realization of the Arabic pharyngeal fricative Eyn (IPA: Q) and the glottal stop Hamze (IPA: P ) are similar. We have chosen the Persian name Ist (stop) for these sounds and phonetically denote it as P and orthographically as ’. In contrast to the Arabic pronunciations of these phonemes, the realization of P is soft, and sometimes it is even realized by lengthening the vowel of the syllable in which it occurs. The phonemic status of Ist is an unsettled issue, some consider it as phonemic (Samare 1997; Windfuhr 1989; Jahani 2005) and others do not consider it as phonemic at certain positions (Adib-Soltani 2000; Neysari 1996). We do not assign phonemic significance to word-initial glottal stop.14 The orthographic treatment of Ist is summarized in the following convention and is based on the position of Ist. In imported Arabic words, Ist is only written in following contexts: – postconsonantal such as joz’ (part), ŝam’ (candle) – prevocalic, such as joz’i (minor) – postvocalic and preconsonantal such as bo’d (dimension), ba’d (later) which will otherwise be confused with bad (bad) 󳶳 DOC-14 Writing Convention for Ist Dropping Ist in following situations is lossless: 1. when the Ist occurs at word initial position, for example, ’Ali ('vŠcv) → Ali (vŠcv) ’a’emme ('vŠcvcŠcv) → a’emme (vŠcvcŠcv) 2. when the Ist initiates a syllable that succeeds a vowel, for example, masâ’el (cvŠcvŠ'vc) → masâel (cvŠcvŠvc) a’emme (vŠ'vcŠcv) → aemme (vŠvcŠcv). In other situations, we can not drop a medial or a final Ist since it will lead to incorrect syllabifications. For example, writing maŝ’al (cvcŠcvc) as maŝal (cvŠcvc) would lead to incorrect syllabification. In this example, as a consequence of removing Ist, the rime of the first syllable changes role and becomes the onset of the second syllable creating a syllabification that would be incorrect in Arabic. 󳶣

14 Glottal stops exist in many languages, including many Indo-European languages. Longman Pronunciation Dictionary (LPD) names glottal stop at onset as hard attack. LPD continues by saying that, in English, hard-attack is not customary and is only used for emphasis. For example, “to eat” is usually expressed as [tu ’i:t] but sometimes as [tu ’?i:t]. Even though glottal stop occurs in English, there is no orthographic representation of it since there is no need for it (Wells 2000). The same kind of reasoning holds in Persian phonology.

3 Dabire: A Phonemic Orthography for Persian



63

Here are some examples of medial and final occurrences of Ist that are phonemically significant: cv' ŝey’ (thing) cvc' ŝam’ (candle), qat’ (cut) cv'c bo’d (dimension), ba’d (later, after), ra’d (lightning) Arabic Taŝdid In Arabic, Ŝadda or Taŝdid is a diacritic which marks a geminated consonant, for example, morabbâ (jam) would be written with a single Arabic “b” decorated with Taŝdid (a small w-shaped diacritic). 󳶳 DOC-15 Writing Convention for Geminated Consonants Similar to orthography of some other Indo-European languages, such as Italian and Swedish, the geminated consonant will be repeated in Dabire. The exception is the geminated arabic “y” which, depending on the pronunciation and syllabification, is written as yy or iy. For example, sayyed (cvcŠcvc, master), kamiyat (cvŠcvŠcvc, quantity), tahiye (cvŠcvŠcv, preparation).15 󳶣

3.2.6 Ezâfe Ezâfe is an inposition16 that relates a head of a phrase (mozâf ) to its complements (mozâfon elayh) and creates an Ezâfe-structure. Mozâfon elayh is usually an adjective or a noun (phrase). In the PA-Script, Ezâfe is usually not written – often making life harder for the reader of the text. 󳶳 DOC-16 Writing Ezâfe in Dabire Ezâfe is transcribed as e or ye. If Mozâf ends with an a, â, e, or u then a euphonic [y] is added and it is written as ye, otherwise as e. 󳶣 Ezâfe appears in many different contexts: possession, ketâb e man (my book), pâ ye u (his/her foot); specialization of a concept, saqf e xâne (ceiling of a house); type specification samâvar e noqre (silver samovar), sini e mesi (brass tray), comparison mahd e Zamin (mother Earth); metaphorical dast e ruzgâr (hand of time); connecting noun and its adjective âb e zolâl (clear water), to e ahmaq (stupid you). 15 Some dictionaries suggest iyy as alternative to iy, for example, tahiyye (preparation), sahmiyye (portion), but iy is sufficient. 16 Also referred to as an enclitic. We chose the term inposition since – just like other adpositions – it relates various parts of a sentence. Ezâfe relates the entities that come before and after it.

64 � J. Maleki In spoken Persian, an Ezâfe is inserted between person’s first name and the surname. For example, the name of a person with the first name Bâbak and the surname Xorramdin is pronounced as Bâbak e Xorramdin. We think it is inappropriate to include the Ezâfe when writing names. After all, person names are registered identities of some sort. 󳶳 DOC-17 Ezafe in Names Although usually pronounced, an Ezâfe between the first name and the surname is not written in Dabire. For example, we write Dâryuŝ Purisa rather than Dâryuŝ e Purisa. 󳶣 There are three ways of writing Ezâfe. We show these using the example “My book”: 1. ketâb e man (Maleki 2008; Xatt e Now 2007) 2. ketâb-e man (Lazard 1957) 3. ketâbe man (Lambton 1953; Unipers 2007, Mace 2003) Because Ezâfe declares a correspondence between two entities, writing it separately (alternatives 1 and 2) is better than concatenating it to the end of the head-word (mozâf ). A further advantage of so doing is that it will not be confused with a suffixed /e/ which could either be the abbreviation of ast (is) or function as a definite article, for example, âbie (is blue) and âbie (the blue one).17 However, from a pronunciation point of view, it is advantageous to count Ezâfe as the vowel of the final syllable of the head-word. For example, the syllabic structure of hamsar e man (my partner) is cvcŠcvŠcv cvc rather than cvcŠcvc v cvc. 󳶳 DOC-18 Writing Ezâfe Separately Ezâfe is written separately unless it forms a constituent of a compound word. 󳶣

3.2.7 Indefinite Object Marker -i In Persian, indefiniteness is sometimes marked using the suffix -i, for example, mizi (a table), zani (a woman), Cinii (a Chinese). Just like all suffixes, -i is written in the closed format (concatenated to the word).

17 In âbie (is blue) the stress is on the first syllable and in âbie (the blue one) on the last.

3 Dabire: A Phonemic Orthography for Persian



65

3.2.8 Definite Direct Object Marker Râ The postposition râ follows a definite noun marking it as the direct object of the verb. For example, Ĵilâ nâme râ neveŝt (Ĵilâ wrote the letter). In colloquial Persian, râ is sometimes pronounced as ro and sometimes shortened to o. There are three ways of writing râ. We show these using the example “Adam ate the Apple”. 1. Âdam sib râ xord (Maleki 2008; Xatt e Now 2007) 2. Âdam sib-râ xord (Lazard 1957) 3. Âdam sibrâ xord (Lambton 1953; Unipers 2007; Mace 2003) In Dabire we have chosen the first alternative. However, when it participates in the formation of compound words, it is written in the closed form: cerâ (why)18 , marâ (short for man râ (me)), zirâ (because), torâ (you – as the object of a sentence). 󳶳 DOC-19 Writing Râ or Ro in Dabire The postposition râ (colloquially ro) is written separately unless it forms the constituent of a compound word. 󳶣 The reason for choosing to write râ separately is that in complicated structures, it can be placed in multiple positions. Consider the following. – Zani ke Jâyeze ye Nobel bord râ didam. (I saw the woman who won the Nobel Prize.) – Zani râ ke Jâyeze ye Nobel bord didam. (I saw the woman who won the Nobel Prize.) 󳶳 DOC-20 Writing o – the abbreviation of Râ The postposition o, which is an abbreviation of râ or ro, is always attached to the definite direct-object. For example, U uno mixâd (She wants that one), Matte pico ŝekast ([The] drill broke the screw). 󳶣 󳶳 DOC-21 Writing Râ With Abbreviated Pronoun When râ marks an abbreviated personal pronoun, then it is attached to the pronoun. For example, Marâ bebus! (Kiss me!). The pronoun man is shortened to maand attached to râ. 󳶣

18 cerâ should not be confused with ce râ. Note the difference between cerâ xând? (why did he read) and ce râ xând? (what did he read?).

66 � J. Maleki

3.2.9 The Definite Marker -e In colloquial Persian, the clitic -e is sometimes used to indicate definiteness. 󳶳 DOC-22 Writing e The Definite Noun Marker The postposition e that can be used to mark a definite noun is attached to the end of the noun. For example, Jarue kâr nemikone (The hoover is not working) When the marked word is followed by the definite direct object marker ro, the marker is assimilated to an a. For example, jarue → jarua; Jarua ro pasbede! (Return the hoover!). 󳶣 e also represents Ezâfe as well as serving as an abbreviation of ast (is), for example, ḿ arde (is a man), whereas, mardé (the man). As Ezâfe e is written separately (see DOC-16).

3.2.10 Compound Words A compound word refers to a word created by joining together several words (‘subwords’). In general, the meaning of a compound word gradually becomes independent from the meanings of its constituents, and therefore, the word becomes a candidate as a dictionary-entry. For example, the word deltang (homesick) which constitutes del (heart) and tang (tight, narrow). There are numerous ways of building compound words.

3.2.10.1 Compound Word Formations In this section, some of the possible ways of creating compound words are listed. 1. Two nouns, for example, kârxâne [kâr-xâne] (factory), golâb [gol-âb] (rose water), Xalij e Fârs (Persian Gulf), âb o havâ (climate), caŝm be râh [caŝm (eye), be (to, on), râh (road)] (state of waiting), mâdarŝowhar [mâdar (mother) ŝowhar (husband)] (husband’s mother), sangdel [sang (stone), del (heart)] (cruel) 2. Two verbs or verb roots, for example, keŝâkeŝ [keŝ-â-keŝ] (struggle), hast o nist (belongings, ‘that which exists and that which not’), bud o nabud (existence), âmad nayâmad (for example, âmadnayâmad dâre would mean: it may or may not ‘stick’) 3. A noun and an adjective, for example, Sefidrud [Sefid-rud] (Sefidrud - ‘white river’), Siâhkuh [siâh-kuh] (Siâhkuh – ‘black mountain’), Nowruz (Persian New Year – ‘new day’), xoŝlebâs [xoŝ (nice), lebâs (clothes)] (well-dressed, nicely

3 Dabire: A Phonemic Orthography for Persian



67

dressed), zibâru [zibâ (beautiful, pretty), ru (face)] (pretty face), kamru [kam (little), ru (face)] (shy), kamzur [kam (little), zur (power)] (weak), porzur (strong) 4. A noun and some form of a verb, for example, pâdow [pâ (foot), dow (present stem of davidan (to run))] (pageboy), sarafrâz [sar (head), afrâz from afrâŝtan] (proudly, honoured), mâdarxânde (mother in law), darmânde [dar (inside, door), mânde is from mândan (to remain, to stay)] (hopeless) 5. A numeral and a noun, for example, cârpâ [câr 19 (four) pâ (foot, leg)] (fourlegged) 6. An adverb and a verb, for example, piŝrow [piŝ (forward), row from raftan (to go)] (pioneer) 7. An adjective and a verb, for example, zendebâd [zende-bâd] (long live), ŝâdbâŝ [ŝâd-bâŝ] (congratulation), nowâmuz [now (new) âmuz from amuxtan (to learn)] (new learner), tondrow [tond (quick, fast) row from raftan (to go)] (extremist) 8. An infinitive and a verbal noun connected by o the short form of the connective va (and), for example, xordoxâb [xord-o-xâb] (eating and sleeping) 9. Two words where the second is constructed from the first by exchanging the first letter of the first word with one of the letters b, f, l, m, p, s,t, v. The second word is in principle meaningless but could be taken as meaning ’and such’. The words are sometimes attached and sometimes separated with the conjunction o. For example, ŝerr o verr (nonsense), kâr o bâr (business), kotmot (jacket and such), cart o part (gibberish, nonsense). 10. Two words where the second is constructed by changing the first vowel of the first word to u and joining them with the conjunction o. For example, hârt o hurt (empty threat), câle cule (holes), tak o tuk (seldom) 11. Repetition of a noun, adjective, adverb, verb or sound may in some cases be used to amplify the semantics of a word or to simply create a new word, for example, namnam (in fine drops), namnam e bârân (drizzling rain), zârzâr (loud and bitter – when crying), dastedaste (in bundles, in groups), qahqah or qâhqâh (load laugh), bezanbezan (fight), bahbah! (expressing appreciation, such as, welldone!, way to go!, delicous!). Sometimes, the infix -â- is inserted between the instances of the repetition, for example, keŝâkeŝ (struggle), barâbar (equal), peyâpey (consecutive), sarâsar (all over) 12. A word and a preposition, for example, bikas [bi-kas] (someone with no friends or relatives), barqarâr [bar-qarâr] (established), nâvâred [nâ-vâred] (novice – in a negative sense)

19 câr is abbreviation of cahâr (four).

68 � J. Maleki 13. A number and a noun, for example, câhârrâh (road crossing), seguŝ (three corners – triangle, triangular) 14. A prefix and a word, for example, begu [be-gu] (say!), barnâme [bar-nâme] (program), piŝraft [piŝ-raft] (progress), ham-kâr [ham (like, same), kâr (work)] (coworker), nâ-dân (ignorant) 15. A word and a suffix, for example, lâlezâr [lâle-zâr] (tulip garden), kârgar [kârgar] (worker), behtar [beh-tar] (better), behtarin [beh-tarin] (best), dâneŝmand [dâneŝ-mand] (scientist) 16. Pronouns of various sorts such as demonstrative pronouns, reflexive pronouns, relative and interrogative pronouns, and indefinite pronouns can join other pronouns and words and create compound words. For example, incenin (such). 17. Some combinations of the above, for example, bâdbâdak (kite), dodasti (with both hands). In the latter, do is the number two, dast is a noun and i is a suffix. Section 3.2.10.2 discusses the orthographic issues related to compound words in Dabire. 3.2.10.2 Writing Compound Words In some European languages, such as German and Swedish, the constituents of a compound are usually concatenated to form a single word; for example, the Swedish words aktiebolag [aktie (share, equity) + bolag (company)] (limited company), regeringspartiet [regering (government) + partiet (the party)] (the ruling political party). In English, compound words could be written in three different formats: open format (spaced as separate words) such as ‘Home Office’, hyphenated format (words separated by a hyphen) such as ‘LED-lighting’, or a closed format where words are concatenated to form a single word such as ‘screwdriver’ (see Ritter 2002). In Dabire, the closed-format is preferred but for less established compounds, hyphenated and open formats are also allowed. 󳶳 DOC-23 Orthography for Compound Words Compound words are either written in closed-format, hyphenated-format or openformat. The preferred format will be the closed-format but for less established compound words, the other formats are also admissible. 󳶣 In the following subsections, some specific classes of compound words will be discussed in more detail. For those cases that are not covered by these rules, the following general rule applies: when a word combination appears for the first time it may be written according to the open-format and then as it is used more often for

3 Dabire: A Phonemic Orthography for Persian



69

a longer period of time, it moves to a hyphenated format and finally to the closed format. For example, the compound term taxte siâh (black board) may after some persistent usage, be written as taxte-siâh and as it is used more and more one may write it as taxtesiâh (blackboard). Writing words in the closed-format may lead to some ambiguities or mispronunciations. For example, when one constituent of a compound word ends with the same consonant that initiates the next, the repetition may be interpreted as gemination. Another potential source of ambiguity is when a constituent ends with an -e and is followed by a y- which may lead to the resulting ey to be interpreted as a diphthong, for example, beyâb (find - IMP+2ND+SG) which contains the prefix be and the verb yâb. The following sections will list some common compound word cases and suggest the appropriate orthographic choices.

3.2.10.3 Compound Verbs Compound verbs are verbs that are formed using multiple words. In Persian, all compound verb constructions start with one or more non-verb constituents and end with a verb. The non-verbal element can be a preposition, an adjective, a noun, or a past stem of another verb. Here are some common ways of generating compound verbs: 1. Proposition+Verb: Some propositions together with some verbs (infinitives) create new verbs (infinitives). For example, bar-+verb: bardâŝtan (to take, to pick up), barandâxtan (to bring down) dar-+verb: daryâftan (to understan), darâvardan (to take off) ru-+verb: rukardan (to expose) zir-+verb: zirkardan (to run over) 2. Adjective+Verb sorxkardan (to fry, sorx (red) and kardan (to do)) xubkardan (to heal, xub (good) and kardan (to do))20 xubŝodan (to improve, to get well, xub (good) and ŝodan (to become)) 3. Noun+Verb rangkardan (to paint, rang (color, paint) and kardan (to do)) zendegikardan (to live) dustdâŝtan (to like, to love) 20 Note: in xub kardan (to do something well), xub is an adverb.

70 � J. Maleki

4.

ŝostoŝukardan (to wash, ŝostoŝu is itself a compound noun formed using two verb stems ŝost (washed) and ŝu (wash)) gapzadan (to chat or converse, gap (chat, conversation) and zadan (to hit)) ârâmŝodan (to calm down, ârâm (calm, slow) and ŝodan (to become)) Past Stem+Verb âmaddâŝtan or alternatively âmadnayâmaddâŝtan (to bring good luck, âmad (come PAST+3RD+SG) and nayâmad (come NEG+PAST+3RD+SG) and dâŝtan (to have))

󳶳 DOC-24 Writing Compound Verbs Compound verbs are written in closed format. Compare, for example, u bâzgaŝt (she came back) and u bâz gaŝt (she searched again). 󳶣

3.2.10.4 Compounds Using Conjunctive O As mentioned earlier, a common way of constructing compound words is to join them with the connective o (and). We saw examples involving two verbs and an infinitive and a verbal noun. Here are some examples (including some of the earlier ones): jost o ju (find and search) → jostoju âs o pâs (hopeless) → âsopâs xord o xâb (eating and sleeping) → xord o xâb pas o piŝ (back and forth) → pasopiŝ raft o âmad (commuting) → raftoâmad cart o part (irrelevant talk, gibberish) → cartopart serke o angabin (vinegar and honey) → serkeangabin21 Over time, some of these compounds have been transformed and the connective o has been shifted into e. For example, jostoju → josteju and goftogu → goftegu. 󳶳 DOC-25 Writing Conjunctions Using O Compound words constructed by the conjunction o are initially written in the open format, for example, goft o gu (conversation, dialogue). However, extensive and persistent use of the compound that results in an independent meaning for the conjunction would motivate a change of format to the closed-format, for example, goft-o-gu or goftogu 󳶣

21 serkeangabin has actually been transformed to sekanjabin.

3 Dabire: A Phonemic Orthography for Persian



71

3.2.10.5 Compounds Using Ezâfe Ezâfe is a common tool for constructing compound words in Persian. Just like o in the previous section, Ezâfe works fine as a gluing mechanism in constructing compounds. 󳶳 DOC-26 Compounds with Ezâfe Compound words constructed by the inposition Ezâfe are written in open format. But extensive common usage of such a compound motivates hyphenated or closed format. For example, gol e yax (chimonanthus praecox, wintersweet) can be written as goleyax since its meaning is no longer the sum of the meanings of its constituents. 󳶣 󳶳 DOC-27 Words That Only Occur Together With Ezâfe Some words only appear as the head-word of an Ezâfe-construction and, therefore, are written together with the Ezâfe forming a compound word, for example, bedune (without) where the final e is the Ezâfe. 󳶣 Dropping the Ezâfe is a common phenomenon in Persian. Once Ezâfe has been removed from an Ezâfe-construction, it is important to reconsider the orthographic convention for the remaining mozâf and mozâfon elayh. A number of different cases arise. 1. Fakk e Ezâfe. Dropping Ezâfe while preserving the order of the mozâf and mozâfon elayh is called Fakk e Ezâfe. For example, pedar e zan → pedarzan (wife’s father) âb e jow → âbjow (beer, juice of barley) Words constructed this way are suitable candidates as entries in a dictionary. 2. Ezâfe ye Maqlub. Dropping Ezâfe and at the same time switching the order of mozâf and mozâfon elayh is called Ezâfe ye Maqlub. For example, xâne ye mehmân (house of guest) which after this transformation becomes mehmânxâne (motel). Here are some more examples: falak e gardande → gardandefalak 22 (rotating heavenly wheel/universe) âb e seyl → seylâb (flood water) barg e gol → golbarg (petal, ’leaf of flower’) afzâr e narm → narmafzâr (software) 22

Piŝ az man o to, leyl o nahâri budast Gardandefalak niz, be kâri budast Har jâ ke qadam nahi to bar ru ye zamin ân mardomak e ceŝm e negâri budast – Xayyâm

72 � J. Maleki 󳶳 DOC-28 Compounds with Dropped Ezâfe Compound words constructed by implicit Ezâfe or dropped Ezâfe (including both fakk e Ezâfe and Ezâfe ye maqlub) should be written in closed format. 󳶣 3.2.10.6 Compounds Using Affixes In Dabire, affixes and the affixed word are written in the closed format. There are a large number of prefixes and suffixes in Persian. Some common prefixes are: mi-, ma-, be-, na-, bar-, bi-, nâ-. Some compound words containing these prefixes are: miguyad (she/he is saying), begu! (say!), nagu! (don’t say!), bargaŝt (return, returned), bigonâh (innocent), nâdân (stupid). Persian has a large number of suffixes as well: -hâ or -ân for constructing plurals, -i for transforming an indefinite noun to a definite noun, -mand for assigning ownership. For example, setârehâ (plural of setâre (star)), abruân23 (plural of abru (eyebrow)), pâhâ (plural of pâ (foot, leg)), mardi (a man), daneŝmand (scientist), xeradmand (wise). Infixes are not so common but they exist. The appropriate format for compound words with an infix is the closed-format. For example, the infix â in keŝâkeŝ (struggle), peyâpey (one after the other, in series), bonâguŝ (cavity behind the ear),24 takâpu (search, running about), zanâŝui (married life). 󳶳 DOC-29 Writing Compound Words Created By Affixing Affix and the affixed word are written in the closed-form. 󳶣

3.2.10.7 Compounds With Pronouns There are many pronouns in Persian, for example, reflexive and reciprocal pronouns, demonstrative pronouns, relative and interrogative pronouns, and indefinite pronouns. Below, we present some examples for selected pronouns. Similar principles apply to other pronouns. – Demonstrative Pronouns In, Ân etc. The demonstratives (in Persian, vâĵehâ ye eŝâre), in (this) and ân (that) are usually written separately, except in the following cases where they are glued to a neighbouring word (which could be another pronoun) and written in a closed format. In and ân may appear before, after or both before and after words, for

23 Some may argue for a mediating [v] and a shift in vowel → abrovân. 24 From bon (root) and guŝ (ear). Also pronounced as banâguŝ.

3 Dabire: A Phonemic Orthography for Persian









73

example, injâ (here), ânjâ (there), ânce (that which), inhâ (these), ânhâ (those), inke (this who), ânke (that who), insân (in this way), inqadr (so much), ânqadr (so much), hamin (this), hamân (that very), conin25 (such, such a one), conân (such), ânconân (such, such a one), inconin (such a one), where ân appears twice and ânconin where both in and ân appear. ânconân oftâd ke pâyaŝ ŝekast, hamconin (just like this, also), hamconân (like before, as it has been, still), candin (several, so many), candân (so many, so much),26 intowr (thus, in this way), ântowr (in that way), ingune (in this way), ângune (in that way), ângâh (then). Interrogative Pronoun Ce Ce (what) is usually written separately, for example, Ce gofti? (What did you say?), In ce bud? (What was this?). When ce creates compound words together with other pronouns and prepositions, then the compound word is written in the closed format: ânce (that which), cerâ (why), cehâ (plural of ce (what)), cetowr (how), cesân (how), cegune (how, in what way), ceqadr (how much). When ce is prefixed to râ, they are written as one word: cerâ (why, what for), cerâ (yes of course). Clearly, it is important to distinguish between the cases where ce is an independent word and those where it participates in a compound word. Consider the examples: Ce râ bordi? (What did you take?), cerâ bordi? (why did you take?). Interrogative Pronoun Ke Just like Ce, Ke (who, that) is written separately, for example, Ke goft? (Who said?), In ke bud? (Who was this?). When ke creates compound words together with other pronouns and prepositions, then the compound word is written in the closed format: inke (‘this who’, ‘this which’), ânke (that who), kehâ (plural of ke), Kio entexâb kardan? 27 (Who did they choose?), conânke or cenânke (if indeed), haminke (as soon as), hamânke (as soon as), balke (perhaps, on the contrary). Similar to other pronouns above, it is important to distinguish between the cases where ke is an independent word and those where it is a constituent of a compound word. Reflexive pronoun Xod (self) Similar to other pronouns, xod is written separately unless it participates in forming a compound word. Examples: Xod e u be man goft (He/she told me

25 In compounds formed using con (since, like, because), the o may shift to an e, for example, conin→cenin and conân→cenân and so on. 26 Candân ham gerân nabud (It wasn’t that expensive). 27 Here, kio is the result of some transformations: ke râ → ro → keo (after râ has been abbreviated to o and finally shifting of the vowel e to i).

74 � J. Maleki





him/herself), Xodam goftam (I said it myself), xodrow (vehicle), xodkâr (automatic, also ballpoint pen). Ham (also, too, co-, together) Ham appears both as a prefix and as an independent word. As a prefix, it is used quite often to form new words which should be written in the closedformat. In all other cases, ham is written separately. For example, Hammihan (fellow countryman), hamkelâs (classmate), hamkâr (co-worker), hamdigar (each other), hamrâh (fellow traveller, escort), hamcenân (as usual), hamin (this one), hamân (that one), hamânâ (indeed), bâ ham âmadim (we came together), to ham biâ (you come too). Negative Pronoun Hic (null, nothing) When participating in the formation of compound words, hic is attached to the word it is quantifying, for example, hicyek (no one, none) – yek (one) hickodâm (none) – kodâm (which) hicgâh (never) – gâh (time) Yâram Hamedâni o xodam hicmadâni – Yârabb ce konad hicmadân bâ hamedâni Otherwise, it is written separately. For example, Bâzi yek hic ŝod (The result of the game was one null) Hoquq hic ensâni nabâyad pâymâl beŝe (The rights of no human being should be violated)

Similar principles apply to other pronouns. 󳶳 DOC-30 Compound Words Involving Pronouns Pronouns are written separately. When a pronoun participates in a compound word it is written in the closed-form. For example, marâ (shortened man+râ). 󳶣

3.2.10.8 Compounds With Apositions Just like pronouns, apositions are written separately unless they participate in forming a compound word usually with a meaning that is independent from the aposition and the other participating words. In what follows, we list some apositions and some examples. – Preposition Be (to): Mâ be sinemâ raftim (We went to the movies) xâne be xâne (house to house) Be Tehrân raft (He/She went to Tehran)

3 Dabire: A Phonemic Orthography for Persian









75

bejoz (except) bedin (old version of be in (to this))28 Preposition Bâ (with): bâz bâ bâz (hawk with hawk – idiomatic: with own kind) bâ cakoŝ (with hammer) qazâŝ bâmaze ast. (The food is delicious) dalqake bâmaze ast. (The clown is funny) bâhuŝ (clever) bâsor’at → besor’at (quickly, sor’at (speed)) Preposition Bâz (again): bâz be mâ sar bezan (do come and visit us again) bâzgaŝtan (return) bâzpors (interrogator) Preposition Bi (without): man bi to be kojâ beravam (Where shall I go without you; “I’ll be lost without you”) bihude (pointless) bixod (out of one’s senses, without purpose) bidâd (injustice) bijâ (improper) bitarbiat (without manners, rude)

󳶳 DOC-31 Writing Compound Words Involving Apositions Compound words that combine prepositions or postpositions together with other words are written in the closed-form. 󳶣

3.2.11 Verb Prefixes and Inflections Verb prefixes such as na-, ma-, be- and mi- always join the verb. For example, magu (don’t say), naro (don’t go), begu! (say!), miravam (I am going). If the verb following na- starts with a consonant then it may be shifted to ne- If the verb following na or ma starts with one of the vowels a, o, â or u, then the glide [y] is inserted before the vowel, for example, in nayandâz [na+andâz] (don’t throw), mayafkan [ma+afkan] (don’t throw), nayoft [na+oft] (don’t fall), nayâ [na+â] (don’t come), nayumad [na+umad] (umad is same as âmad (came) in some dialects).

28 Basi ranj bordam dar in sâl e si – Ajam zende kardam bedin pârsi Namiram az in pas ke man zendeam – Ke toxm e soxan râ parâkandeam – Ferdowsi

76 � J. Maleki Verb inflections suffixes are also written in closed format, for example, goftam, gofti, goftim. 󳶳 DOC-31 Writing Verb Prefixes and Inflections Compound words resulting from verb prefixes and verbal inflections are written in the closed-form. 󳶣

3.2.12 Miscellaneous Orthographic Style Issues This section presents some minimal conventions concerning capitalization, abbreviation, and punctuation. A more extensive treatment of these issues is not in the scope of this paper. The conventions are similar to other Latin-based scripts.

3.2.12.1 Capitalization Since capitalization of certain letters improves the readability of text, we propose capitalization rules that are in line with many other Latin-based scripts. 1. The first word of a sentence is capitalized. For example, Otobus yek sâat e digar miresad (The bus arrives in an hour). 2. The first word of a syntactically complete quoted sentence is capitalized. For example, Mahnâz goft, ‘Barâbari e zan va mard barâye man mohemm ast’ (Mahnâz said, ‘Equality between men and women is important for me’) 3. Proper names and geographic names are always capitalized. If these names are compound words written in an open format, then every ‘major’ component should be capitalized. Here are some examples: Tehrân (Tehran – capital of Iran), Iâlât e Mottahed e Âmrika (United States of America), Sâzmân e Melal e Mottahed (United Nations), USA (USA), Rais e Koll e Qovâ (Head of the military forces) Even certain expressions that refer to well-defined geographic parts of the world or phenomena should be capitalized, for example, Xâvar e Dur (Far East), Bâd e Ŝomâl (Northern Wind). However, compass directions in general are not capitalized, for example, jonub e qarbi e Irân (south west of Iran). 4. Even abbreviations that have obtained the status of a word through time could be capitalised. For example, we not only can write NATO, but also Nato. 5. Contrary to the practice in English, the names of weekdays, months, and years are not capitalized. For example, doŝanbe (Monday), farvardin (First month of the Iranian calender), ut (August), sal e meymun (year of the monkey).

3 Dabire: A Phonemic Orthography for Persian

6.

7.

8.

9.



77

Abbreviated titles are always capitalized. For example, Âq. Rezâ Âŝuri (Mr. Rezâ Âŝuri), Dr. Ahmadi (Dr. Ahmadi), Xâ. Simâ Ŝirâzi (Mrs. Simâ Ŝirâzi), Du. Lâle Kermâni (Du. is abbreviated duŝize, Miss. Lâle Kermâni) In articles and books, all the main words appearing in a title or chapter names are capitalized, for instance, Fasl e Yek: Zabân e Fârsi (Chapter One: Persian Language) Nationality should be capitalized, for example, dâneŝju ye Irâni (Iranian student), hame ye Irâniân (all Iranians). This should also apply to other geographic units, for instance, bâzârhâ ye Âsiâi (Asian markets) Conventions for writing names of people and trademarks (see Section 3.2.4.3) override capitalization conventions. Trademarks and other registered names must be written exactly as the specifications of the trademark dictates. For example, cK (Trademark of Calvin Klein).

3.2.12.2 Abbreviation We consider the following types of abbreviations. These rules apply to cases where one needs to create new abbreviations. 1. Abbreviation of single words should, in general, follow a default rule. An abbreviation should end with a period (.). For example, Teh. or thrn. as possible abbreviations for Tehran, q. as a possible abbreviation of qeyd (adverb). A simple rule for creating such abbreviations is to start with the first letter of the word and continue including letters until we reach a vowel or the end of the first syllable. In order to create unique abbreviations this rule needs to be extended. Here are some examples, Iran Tehran dâneŝgâh

2.

Ir. Teh. dân.

Naturally, some words can be exempted from these rules, for example, when we create abbreviations for days of the week or months of the year. In such cases other requirements, such as a constant length for abbreviations, may determine the format of the abbreviation. When abbreviating a compound name, the first letter of each major word in the compound name should be included in the abbreviation, for example, ŜNI as the abbreviation of Ŝerkat e Naft e Irân (Iranian Petroleum Company), RI as the abbreviation of Râdio Irân. When the resulting abbreviations are used as words in Persian, then we may write them as ordinary words. If we imagine there were a ministry called

78 � J. Maleki

3.

4. 5.

6.

7.

Vezârat e âb Va Enerĵi e Keŝvar, then it could be abbreviated as VÂVEK which may gradually turn into an everyday word and be written as Vâvek or even vâvek. When abbreviating an expression or a construction consisting of two words or more, the first letter of each significant word is included in the abbreviation and is succeeded by a period. For example, b. m. as an abbreviation for barâye mesâl (for example), v. e. a. as an abbreviation for va elâ âxar (and so on, etc.), b. b. i. as an abbreviation of banâ bar in (therefore, hence). No spaces should be included in the abbreviation. Some abbreviations such as units of measurements should be exempted from this kind of punctuation. For example, it is better to write 5cm (5 centimeters) rather than 5c.m. or 80GB (80 gigabytes) rather than 80G.B.. Names of days can be abbreviated as: ŝanbe (ŝa.), yekŝanbe (yeŝ.), doŝanbe (doŝ.), seŝanbe (seŝ.), cahârŝanbe (caŝ.), panjŝanbe (paŝ.), âdine (âdi.) Names of months can be abbreviated as: farvardin (far.), ordibeheŝt (ord.), xordâd (xor.), tir (tir), mordâd (mor.), ŝahrivar (ŝah.),ef mehr (meh.), âbân (âbâ.), âzar (âza.), dey (dey), bahman (bah.), esfand (esf.). Titles should normally be abbreviated, for example, Âyatollâh Taleghani (Ayatollah Taleghani) could be written as Âyat. Tâleqâni, Duŝize Leylâ Golcin (Miss. Leylâ Golcin) could be abbreviated as Du. Leylâ Golcin, Porfesor Pari Mehrbân (Professor Pari Mehrbân) could be abbreviated as Porf. Pari Mehrbân. In Persian, ordinals (or ordinal numbers) are constructed by adding the suffix -om or -vom to a number. For example, yekom (first), dovom (second), sevom (third), cahârom (forth), and so on. As you may have noticed, the suffix -vom is used for numbers that end with a vowel and -om is used when the name of the number ends with a consonant. When the number is unknown, cand is used instead and the suffix -om is added to it: candom. We propose that these numbers be abbreviated by writing the numeral followed by the suffix -om, like so: 1om (first), 2om (second), 3om (third), . . . , or alternatively as 1om , 2om , 3om ,. . . , in mathematical texts. Another sequence of numbers for ranking in Persian is: yekomin, dovomin, sevomin, . . .. The difference between, sevom and sevomin, for example, is that sevom appears after the word whose order is being given and sevomin occurs before it (see also the discussion on Ezâfe ye Maqlub in Section 3.2.10). The following expressions are semantically equivalent: Konferâns e Sevom e Anjoman e ânformâtik e Irân Sevomin Konferâns e Anjoman e ânformâtik e Irân 3in Konferâns e Anjoman e ânformâtik e Irân (3rd Conference of the Informatics Society of Iran).

3 Dabire: A Phonemic Orthography for Persian



79

As noted, in the second case the order of Konferâns and Sevom is switched and the Ezâfe is dropped and Sevom is replaced with Sevomin. For ordinals created using the suffix -in, we suggest the following abbreviations: 1in, 2in, 3in, . . . , or 1in , 2in , 3in , . . .. Candomin (which – in order) can be used when querying the order, and naxostin (initial, first) and vâpasin are used for referring to first and last respectively. Some examples sentences follow. Candomin raisjomhor e Âmrika be Vietnâm hamle kard? (Which, in ordinal position, American president attacked Vietnam?) Naxostin parvâz e man be Ŝirâz bud. (My first flight was to Shiraz.) Vâpasin porseŝ e u ce bud? (What was his final question?) 8.

9.

Dash, –, (in Persian, xatt e fâsele) can be used as an abbreviation of the word tâ (to) which is used to specify intervals, for example, s. 11–23 as an abbreviation of az safhe ye 11 tâ 23 (from page 11 to 23), doŝanbe–panjŝanbe (Monday– Thursday). Dash is also used as an abbreviation for the connective va or o (and), for instance, didâr e Putin-Bush (Putin-Bush meeting), ravâbet e Irân–Âmrikâ (Iran-USA relations). Dates may be written in any of the following ways. We exemplify these formats using the ŝanbe (Saturday) 22nd day of the 2nd month ordibeheŝt, year 1358. ŝanbe, 22 ordibeheŝt 1358 22 ordibeheŝt 1358 22 ord. 1358 22-02-58

10. Hours may be written in any of the following formats. 10:00pi (10:00 am) 10:00pa (10:00 pm) 1:15pi (1:15 am) 5pa (5 pm) 19:57 (19:57) The reference point for time based on a 12-hour system is noon. We propose using pa and pi as abbreviations of pas az zohr (after noon) and piŝ az zohr (before noon).

3.2.12.3 Punctuation Punctuation rules in Dabire follow the same conventions as English (see Ritter 2002).

80 � J. Maleki

3.3 Dabire and Persian Language Processing Lack of – or at best, inconsistent – representation of short vowels and Ezâfe in Persian text is a major source of ambiguity. Processing Persian text based on phonemic orthography adds a very useful dimension to the computational processing of the language. In particular, using vowels systematically and making short vowels and Ezâfe explicit helps AI and machine learning systems with transliteration, machine translation, speech synthesis, etc., to reach better conclusions. This probably extends to other Arabic script-based languages too, for example, Saadane and Benterki (2012) show how romanized transliteration of Arabic words improves precision and recall in word-alignment between French and Arabic. In our applications of Dabire, the aim has been to facilitate back and forth conversion between Dabire and the PA-Script. Such a system can be used in various ways, for example, enabling Persian speakers to communicate using various scripts. Another useful application is that the script conversion system can be used to automatically generate romanized corpus from relatively larger PA-Script-corpus that are available. We have also used Dabire in some rule-based systems, for example, in 2007, we implemented a small program in XFST (Beesley and Karttunen 2003) that successfully syllabifies any given Persian word written in Dabire. The core of the program is the following rules implemented as transducers in XFST (see Beesley and Karttunen 2003): define Syllable V|VC|VCC|CV|CVC|CVCC; define Syllabify C* V C* @-> ... “.” | | _ Syllabify Here, V and C represent vowels and consonants of Dabire. The program essentially takes a word and reduces it to a syllabified version where the syllables are separated by a “.” The syllabification produces the correct syllables by processing the word from left to right while maximizing the number of consonants at the onset of the candidate syllables. For example, Nikârâgue is syllabified as Ni.kâ.râ.gu.e rather than Nik.âr.âg.u.e, Nik.â.râ.gu.e, etc. Based on this syllabification program and realizing the importance of syllables in generating correct PA-Script text, we implemented another system for converting words written in Dabire to PA-Script (Maleki and Ahrenberg 2008). This syllabification was also a part of a prototype system (Maleki 2010) for phonemic transliteration of English words to P-Script, where Dabire served as an intermediate representation. These systems clearly illustrate the importance of having an orthographic representation such as Dabire. We also used the XFST technology for converting back and forth between the PA-Script and Dabire using two parallel morphological analyzers, one for analysis of words written in P-Script (a subset of the PA-Script that excludes Arabic specific

3 Dabire: A Phonemic Orthography for Persian



81

Figure 3.1: The composition of transducers for back and forth transliteration between Dabire and PA-Script.

letters) and another for analysis of Dabire-text (Maleki, Yaesoubi, and Ahrenberg 2009). The system works by morphological analysis of a word written in the source script (for example, Dabire) and generates its root morpheme and other grammatical tags. Then using a dictionary of root morphemes, it retrieves the orthography of the root morpheme in the target script (for example, PA-Script). Finally, the root in the target script and the grammatical tags are used to run the finite state morphology networks backwards to generate the orthography of the word in the target script. The schema is presented in Figure 3.1. Rule-based systems such as the ones referenced above have the disadvantage that they rely on well-structured data such as root words and large sets of morphological rules describing correct syntactical structures. Furthermore, as the words and rules grow in number, the system slows down. A more promising approach is to use neural network based techniques that only require large language data that are usually readily available such as text collections, online text, online books, etc. In order to test the suitability of Dabire as a means for teaching Persian to novices, we have carried out a number of small-scale pedagogical projects with promising outcomes. In one study, we asked the Persian-speaking participants to read poems of Hâfez for an audience. Almost all Iranians made mistakes when reading these poems in the traditional script, and the cause of the problem was usually the absence of short vowels and the Ezâfe. After a two-hour introduction to Dabire, the same participants were asked to read same poems written in Dabire and almost everyone could read the poems easily and made almost no mistakes. Another experiment involved a number of Swedish individuals interested in learning Persian.

82 � J. Maleki After a four hour workshop, it was surprising how well the participant could read and write Dabire. Because of the limited scale of these studies, we have chosen not to publish the results and hope that larger-scale experiments will be carried out in the future.

3.4 Discussion and Further Work There are no standard orthographic principles for writing Persian using the Latin script. There are many proposals but none of them has given any rigorous account of transcription of Persian to the Latin alphabet. We hope that the orthographic principles we have sketched here will eventually contribute towards a standard, but actual standardization is a complicated process beyond the scope of this work. Dabire is an orthographic scheme based on Persian phonology. As well as introducing an alphabet, we have declared some conventions and, where necessary, provided the reasons for the choices that have been made. A lot of work remains to be done. There is a need for computer-readable dictionaries and resources. Furthermore, we need software for automatic conversion between Persian text written in the PA-Script and Dabire. We think that Dabire is a natural complement to the existing writing systems in the Persian-speaking countries, in particular Iran.

3.4.1 Conversion Between Scripts Historically, languages and writing systems have created barriers between different people. Using artificial intelligence and machine learning these barriers are gradually disappearing. In the future, we will be able to communicate in the language or dialect of our choice and read and write in the script we prefer. Our digital companions (computers, telephones, watches, glasses) will seamlessly convert back and forth between languages and scripts. However, in the context of educating children or other new language learners, the arguments in favor of easing the use of a shallow orthography such as Dabire will remain valid.

3.4.2 Further Issues There are some minor issues that are subject to further study and discussion. Here are some of them:

3 Dabire: A Phonemic Orthography for Persian











83

We have decided to write the Ezâfe separately, for example, nur e xorŝid (sunlight), pâ ye to (your foot). This is also suggested by Xatt e Now (2007), but Lambton (1953) and Unipers (2007) propose joining the Ezâfe to the end of the head word mozâf, for example, nure xorŝid, pâye to. It seems the choice made in Dabire is justifiable but we hope future discussions and actual usage of these conventions will shed more light on the advantages and disadvantages of each alternative. For the sake of simplicity, it might be worth considering to always write Ezâfe as e and let the euphonic [y] be a speech-time phenomenon not represented in writing. This would certainly make the orthography simpler but would compromise the phonemic property of Dabire. The same is true about the Arabic article al as we argued earlier. The PA-letters Qeyn and Qâf, which represent [K ] and [q] respectively, largely have the same pronunciation in Persian. In Arabic, however, there is a clear difference in the pronunciation of the two. In Dabire we have chosen to represent both with the grapheme q, but some may argue that they should be assigned different letters. Some may argue that Qeyn represents a genuine Persian phoneme and therefore it deserves its own orthographic representation. We had at some point considered ĝ as a grapheme for Qeyn but chose to drop it for the sake of simplicity. Most earlier romanizations seem to be in agreement in assigning a single grapheme (q) for both Qeyn and Qâf, see for example Xatt e Now (2007), Lambton (1953). The status of ow and ey as diphthongs is still an unsettled issue. From pedagogical point of view, avoiding diphthongs in the writing system is important. Probably replacing these with ô and ê will be a good choice. This would mean, we would write nô rather than now (new) and êvân rather than eyvân (porch).

Bibliography Adib-Soltani, Mir-Shamseddin. 2000. An Introduction to Problems of Persian Orthography. Tehran, Iran: Amir Kabir Publishing House. Beesley, Kenneth R. & Lauri Karttunen. 2003. Finite State Morphology. Stanford: Center for the Study of Language & Information. Behrouz, Zabih. 1984. Khatt o Farhang [Ortography and Language]. Tehran, Iran: Forouhar. Bird, Steven. 2000. Orthography and Identity in Cameroon. Notes on Literacy 26(1). 3–34. Bresner, Derek & Marilyn Chapnik Smith. 1992. Basic Processes in Reading: Is the Orthographic Depth Hypothesis Sinking? In Ram Frost & Leonard Katz (eds.), Orthography, Phonology, Morphology, and Meaning, 45–66. Amsterdam: Elsevier. Encyclopedia Iranica. 1982–2022. Accessed: 2022-09-30. www.iranicaonline.org.

84 � J. Maleki

Farhangestan. 2003. Dastur-e Khatt-e Farsi (Persian Orthography). Supplement No. 7. Tehran, Iran: Persian Academy. Hashabeiky, Forough. 2005. Persian Orthography – Modification or Changeover? Uppsala, Sweden: Acta Universitatis Upsalienis. Jahani, Carina. 2005. The Glottal Plosive: A Phoneme in Spoken Modern Persian or Not? In Éva Agnes Csató, Bo Isaksson & Carina Jahani (eds.), Linguistic Convergence and Areal Diffusion: Case studies fron Iranian, Semitic and Turkic, 79–96. London: Routledge Curzon. Lambton, Ann K. S. 1953. Persian Grammar. Cambridge: Cambridge University Press. Lazard, Gilbert. 1957. Grammaire du Persan Contemporain. Paris: Klincksieck. Mace, John. 2003. Persian Grammar. London: Routledge Curzon. Maleki, Jalal. 2003. eFarsi, a Latin-Based Writing System for Persian. Unpublished Paper. Maleki, Jalal. 2008. A Romanized Transcription for Persian. In The 6th International Conference on Informatics and Systems, Cairo, Egypt, 2008, 166–175. Cairo, Egypt: University of Cairo. Maleki, Jalal. 2010. Syllable Based Transcription of English Words into Perso-Arabic Writing System. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner & Daniel Tapias (eds.), Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17–23 May 2010, Valletta, Malta, 2010, 62–65. Valletta, Malta: European Language Resources Association. Maleki, Jalal & Lars Ahrenberg. 2008. Converting Romanized Persian to Arabic Writing System Using Syllabification. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis & Daniel Tapias (eds.), Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May–1 June 2008, Marrakech, Morocco, 2008. Marrakech, Morocco: European Language Resources Association. Maleki, Jalal, Maziar Yaesoubi & Lars Ahrenberg. 2009. Applying Finite State Morphology to Conversion Between Roman and Perso-Arabic Writing Systems. In Post-proceedings of the 7th International Workshop FSMNLP 2008, Amsterdam, Netherlands, July, 2008 (Frontiers in Artificial Intelligence and Applications), 215–223. Amsterdam, Netherlands: IOS Press. Neysari, Salim. 1956. Ketab-e Avval e Farsi Be Xatt-e Jahani [The first Persian boook in the universal script]. Unpublished manuscript. Neysari, Salim. 1996. A Study on Persian Orthography. Tehran, Iran: Sazman-e Chap va Entesharat. Ritter, Robert M. 2002. The Oxford Guide to Style. Oxford, UK: Oxford University Press. Saadane, Houda & Ouafa Benterki. 2012. Using Arabic Transliteration to Improve Word Alignment from French- Arabic Parallel Corpora. In Farhad Oroumchian Ali Farghaly (eds.), Fourth Workshop on Computational Approaches to Arabic-Script-based Languages, San Diego, California, USA, November, 2012, 38–46. San Diego, California, USA: Association for Machine Translation in the Americas. Samare, Yadollah. 1997. Persian Phonology. Tehran, Iran: Markaz e Daneshgahi. Unipers. 2007. Accessed: 2022-10-30. http://www.unipers.com. Wells, John. 2000. Longman Pronunciation Dictionary. London: Longman. Windfuhr, Gernot. 1989. Persian. In Bernard Comrie (eds.), The World’s Major Languages, 523–546. London: Routledge. Xatt e Now. 2007. Eurofarsi. Accessed: 2022-09-30. http://www.eurofarsi.com.

Mahsa Vafaie and Jon Dehdari

4 Speech Recognition for Persian Abstract: Automatic Speech Recognition (ASR) is a cross-disciplinary field that enables computers to process human speech into written form and consequently make meaning of it with the help of other NLP technologies such as natural language understanding and natural language interpretation. Automatic speech recognition has benefited a great deal from advances in machine learning throughout the past decade. In this chapter a brief summary of ASR technologies and advances in Persian ASR is provided. Moreover, datasets and tools for Persian ASR are introduced and following a discussion of the peculiarities of Persian speech, some future directions and research gaps in the implementation of speech recognition technologies for Persian are addressed.

4.1 Introduction Speech is the primary mode of human communication, figuring into diverse communicative genres ranging from trivial everyday conversations to world-scale international negotiations. Unlike writing, which can be digitised and processed fairly easily, automated processing of human speech has proven to be a long-standing challenge. The potential benefits of automated recognition of human speech, however, are many: computers able to accurately recognise speech would potentially be able to transcribe, to follow instructions, and to converse. The field of Automatic Speech Recognition (ASR) has played a critical role in the ever-advancing field of human–computer interaction since the 1950s (Davis, Biddulph, and Balashek 1952; Forgie and Forgie 1959; Olson and Belar 1956), with major breakthroughs almost each decade. The use of Artificial Neural Networks (ANNs)—inspired by biological neurons—is one of the latest trends in speech recognition, achieving results that outperform earlier approaches (Lippmann 1988). Deep learning (multilayered neural networks) has increased the accuracy of speech recognisers such that they can now be used outside of carefully controlled environments like laboratories, and within consumer electronics such as personal computers, tablets and phones. It is only in the past quarter century that speech recognition for Persian has been addressed by the research community. In this chapter, following a brief history of ASR, we provide an overview and analysis of speech recognition datasets and systems for Persian from the first attempts in 1994 to larger datasets and state-ofthe-art systems used in applications today. https://doi.org/10.1515/9783110619225-004

86 � M. Vafaie and J. Dehdari

4.2 History of Automatic Speech Recognition Research into automatic speech recognition dates back to the early 1950s, when the first ASR system was made for single-speaker digit recognition by Bell Laboratories. This system worked by locating formant frequencies, which are manifested as major regions of energy concentration in the power spectrum of each utterance, and matching them with patterns (Davis, Biddulph, and Balashek 1952). Other early recognition systems from this period include the single-speaker syllable recogniser of Olson and Belar (1956) and Forgies’ speaker-independent ten-vowel recogniser (Forgie and Forgie 1959). In the 1960s, two researchers at Kyoto University employed a speech segmenter for the first time, making it possible to recognise and analyse individual words within an input utterance (Sakai and Doshita 1962). This was the first project to focus on processing of natural language, rather than speech sounds or words in isolation (Juang and Rabiner 2005). Later in the decade, Soviet researchers devised the Dynamic Time Warping (DTW) algorithm and used it to create a recogniser capable of operating on a 200-word vocabulary (Velichko and Zagoruyko 1970). The DTW algorithm, still in use today, divides the speech signal into short frames, e. g. 10 millisecond segments, and processes each frame as a single unit. Despite the breakthroughs associated with segmentation of speech signals, implementations in this period were constrained in scope, handling only specific phonemes or words; analysis of unconstrained continuous speech remained a distant goal. During the early 1970s, the Defense Advanced Research Projects Agency (Darpa) of the U. S. Department of Defense started funding the Speech Understanding Research (SUR) program, working in cooperation with a number of research institutes and private companies in the United States. IBM was focused on creating a “voiceactivated typewriter”, which converted a spoken utterance into a sequence of letters that could be typed on paper or displayed on a screen (Jelinek, Bahl, and Mercer 1975)—a task generally referred to as transcription. They built a speaker-dependent speech recognition system, equipped with a language model and a large vocabulary. At AT&T Bell Laboratories, on the other hand, the focus was on speaker-independent speech recognition. Their goal was to provide automated telecommunication services to the public. To achieve this, systems were needed that could successfully process input from speakers with different regional accents, without the need for individual speaker training. Their approach involved the development of speech clustering algorithms, which built word and sound reference patterns containing information about dialectal variants of a given phoneme. Research at Bell Laboratories emphasised keyword spotting as a basic form of speech understanding (Wilpon et al. 1990). Keyword spotting is a technique aimed at detecting salient terms (i. e. ideationally rich content words) embedded in a longer utterance, while paying less

4 Speech Recognition for Persian



87

attention to closed-class/function words (determiners, conjunctions, prepositions, etc.). The introduction of Linear Predictive Coding (LPC) (Atal and Hanauer 1971; Itakura 1970) in the 1970s changed the approach toward input signals. LPC is a tool for representing the spectral envelope of a digital speech signal in compressed form, using the information of a linear predictive model (Deng and O’Shaughnessy 2003). In the decades to come, extracting features from the speech signal before inputting them into the system became state-of-the-art technique, with feature extraction becoming a major research interest. Later in the 1980s, Perceptual Linear Prediction (PLP) coefficients (Hermansky 1990) and Mel-Frequency Cepstral Coefficients (MFCCs) (Davis and Mermelstein 1980) were introduced. These approaches are still in use today. The dominant approach to speech recognition in the 1970s was a templatebased pattern recognition paradigm, in combination with acoustic–phonetic methods. This methodology shifted toward a statistical modelling framework in the 1980s. Hidden Markov Models (HMMs) became the preferred method for speech recognition and stayed so for long after the theory was initially published (Ferguson 1980; Levinson, Rabiner, and Sondhi 1983). The use of HMMs allowed researchers to combine different sources of knowledge, such as acoustic models and language models, in a unified probabilistic model. With the widespread use of HMM techniques, researchers realised that the performance of the system was being constrained by limitations on the form of the density functions. This was particularly harmful for speaker-independent tasks. These limitations were partially overcome by extending the theory of HMM to mixture densities (Juang 1985; Lee et al. 1990) to ensure satisfactory recognition accuracy, particularly for speaker-independent, large-vocabulary speech recognition tasks. This extension of the HMMs is called Gaussian Mixture Models (GMMs). In the late 1980s, Artificial Neural Networks (ANNs) emerged as a new approach to acoustic modelling in ASR. Neural networks were first introduced by McCulloch and Pitts (1943) as an attempt to mimic the biological neural processing mechanism. While the approach attracted little attention initially, it was revived in the 1980s with the advent of Parallel Distributed Processing (PDP), known as Connectionism today. PDP is an artificial neural network approach that stresses the parallel nature of neural processing, and the distributed nature of neural representations. Lippmann (1988) reported success using neural networks for preliminary speech recognition in constrained tasks such as vowel classification and digit recognition. Later, Lippmann (1989) specifically described the potential for neural networks to offer new algorithmic approaches to problems in speech recognition. In the 1990s, Bourlard and Morgan (1993) built a hybrid speech recogniser, replacing the GMM with a single-layered neural network in an HMM-based system.

88 � M. Vafaie and J. Dehdari This method successfully predicted the correct HMM, but the performance was still lower than the GMM-HMM architecture. The main reason for this was that large amounts of high-quality training data and computational resources capable of performing the large numbers of calculations necessary for neural networks were lacking at that time. The introduction of Deep Neural Networks (DNNs) in the past decade, coupled with growing affordability of high-performance computing, has resolved many historical challenges in ASR. A DNN is a feed-forward, artificial neural network that has multiple layers of hidden units between its inputs and its outputs. DNNs can approximate any function with desired accuracy given enough nodes. Hinton et al. (2012) report the DNN approach showing significant improvements over Gaussian Mixture Model–Hidden Markov Model (GMM–HMM) systems in a variety of stateof-the-art ASR systems.

4.3 Automatic Speech Recognition Techniques Traditionally, ASR systems have been composed of two major components: the front end and the decoder. The front end builds a spectrum representation of the incoming speech wave, extracting the most relevant features in order to reduce data complexity. Feature extraction in the context of ASR involves identifying the components of the audio signal that are characteristic of the linguistic content, while discarding all the other non-linguistic sounds that carry information like background noise, line noise, etc. The most widely used features in ASR are Mel Frequency Cepstral Coefficients (MFCCs), also used in neural speech recognition systems specifically. MFCCs were introduced by Davis and Mermelstein (1980), and have been state-of-the-art ever since. To extract features that contain all information about the linguistic content, MFCC mimics some parts of the human speech production and speech perception. Sounds generated by a human are filtered by the shape of the vocal tract. This shape determines what sound comes out. Subsequently, if we can determine the shape, it should give us a representation of the phoneme being produced. The shape of the vocal tract manifests itself in the envelope of the short time power spectrum, and MFCCs can accurately represent this envelope. They also eliminate speakerdependent characteristics by excluding the fundamental frequencies, which makes them suitable for speaker-independent speech recognition tasks. The flowchart for implementation of Mel-Frequency Cepstral Coefficients algorithm is shown in Figure 4.1 below. The steps to compute MFCCs are as follows: 1. Frame the signal into short (10–25 ms) frames. 2. For each frame calculate the periodogram estimate of the power spectrum.

4 Speech Recognition for Persian

� 89

Figure 4.1: Block diagram of the MFCC algorithm, inspired by Jimenez and Trujillo (2018).

3. 4. 5. 6.

Apply the mel filterbank to the power spectra, sum the energy in each filter. Take the logarithm of all filterbank energies. Take the Discrete Cosine Transform (DCT) of the log filterbank energies. Choose the number of cepstral coefficients (typically in a range of 13 to 20) for further processing, keep them and discard the rest.

After feature extraction, the decoder finds the best match of word sequences for the input acoustic features based on acoustic model, lexicon, and language model. The lexicon is a database that stores the pronunciations of words and their equivalent lexical representations/spellings, and the acoustic model predicts the sound units (i. e. phonemes) based on the extracted speech features. In the final step, the language model picks the best candidate words, based on the context of neighbouring words. Figure 4.2 shows the block diagram of a traditional ASR system.

Figure 4.2: A traditional automatic speech recognition system.

90 � M. Vafaie and J. Dehdari Deep learning has made it possible to build end-to-end speech recognisers that take sound files, and directly turn them into transcriptions. These systems do not require all the components of a traditional ASR system, but instead require very large amounts of training data. The training data must contain a mapping of speech files to the expected outputs, which are usually either phonetically transcribed or standard-language written versions of the utterances. The general architecture of a deep learning-based speech recognition system is shown in Figure 4.3. It is a neural network fed with input—in this case audio files—trained to produce output in the form of text.

Figure 4.3: A neural speech recognition system.

Audio files contain recorded speech in the form of sound waves. Sound waves are one-dimensional entities with a single value at every moment, which is based on the height of the wave at that moment. Recording the height of the wave at equallyspaced points is called sampling. In a neural speech recognition system, the set of sound waves is transformed into a purely numerical representation, and then fed into the neural network. In general, a sampling rate of 16 kHz (16,000 samples per second) is enough to cover the frequency range of human speech, and thus to suffice for speech recognition tasks. To cope with the non-stationary nature of the speech signal, windows of 20 to 40 milliseconds are taken to extract the parameters. These parameters are a representation of the frequency components of the signal. Different representations have been reported in the literature, but MFCCs continue to be the most popular choice. After the sound files are transmitted and fed into the neural network in chunks of 20–40 milliseconds, the neural network predicts which character(s) (phonemes or graphemes, depending on the task) correspond to each chunk. Using the parameters in the trained model, there will be a probability distribution over letters of the alphabet for each chunk. Given this distribution, the most likely sequence of characters is picked as the output sequence. This step, called output decoding, can be done using different algorithms. The resulting transcription then has to be cleaned before it is presented as output. For instance, take a case where the output sequence of characters for a piece of Persian audio is SSSA_LL_AAAAM. First, any repeated string of characters is replaced with only one character, transforming SSSA_LL_AAAAM into SA_L_AM. Then the blanks are removed, and the result is presented as the final transcription, SALAM (‘hello’).

4 Speech Recognition for Persian



91

4.4 Persian Speech The Persian orthographic system in Iran and Afghanistan consists of 32 graphemes for consonants and 6 graphemes for vowels. Table 4.3 in Appendix A, inspired by Bijankhan et al. (2003), lists phonemes in Farsi (Iranian Persian), their corresponding representation in the International Phonetic Alphabet (IPA), orthographic representation in the Persian alphabet, and a brief phonetic gloss. Short vowels in Farsi (represented in IPA as /e/, /æ/ and /o/) are optional in written form and are usually omitted. This results in a loose grapheme-to-phoneme mapping in the writing system, and thus poses a potential challenge to grapheme-based ASR. Grapheme-based ASR systems rely on pronunciation dictionaries as one of their main components. Creating a pronunciation dictionary for Farsi, given the absence of most short vowels in writing, requires some level of supervision by a Farsi speaker. For instance, the polygrapheme ‫⟨ ﺷﻜﺮ‬škr⟩ – missing the short vowels – can be pronounced as both [SekæR] (‘sugar’), or [SokR] (‘appreciation’). ˚ is a great deal of phonological, lexical and gramWithin Persian dialects, there matical variation. Ketābi Fārsi, generally seen as “standard” or “formal” Iranian Persian, is the main language of books and newspapers in Iran (ketābi, in fact, means ‘bookish’). Colloquial Farsi (encompassing a number of regional dialects), on the other hand, is the common language of everyday interactions. The difference between these two modes is significant enough to cause serious challenges for NLP systems, trained on data from the standard written form. For instance, informal Farsi of Tehran has shortened verbal stems and inflectional endings (Megerdoomian 2006). Table 4.1 lists conjugations of the verb goftan (‘to say’) in present tense, in both formal and conversational Farsi (Tehrani dialect). The first morpheme, mi-, is a verbal prefix marking continuity (present/imperfect tense). The second part, gu, is the stem, which is shortened and reduced to a single consonant, g, in conversational speech. The last part is the inflectional suffix marking person and number, also shortened in conversational Farsi. Table 4.1: Conjugation of goftan (‘to say’) in formal and informal Tehrani Farsi. Person

Number

Formal Farsi

Informal Tehrani Farsi

1st 2nd 3rd 1st 2nd 3rd

Singular Singular Singular Plural Plural Plural

mi-gu-yam mi-gu-yi mi-gu-yad mi-gu-yim mi-gu-yid mi-ug-yand

mi-g-am mi-g-i mi-g-e mi-g-im mi-g-id mi-g-an

92 � M. Vafaie and J. Dehdari State-of-the-art ASR systems for Farsi can recognise continuous speech in the formal, Ketābi Farsi (Sameti et al. 2011), but colloquial speech still poses a problem for these systems. Phonological/dialectal variations, for example, can result in outof-vocabulary (OOV) words, which are problematic for traditional speech recognition systems that rely on a lexicon of Ketābi representations of words. Furthermore, ASR resources are lacking for other varieties of Persian outside of Iran, such as Dari and Tajik.

4.5 Datasets Early modern statistical speech recognition algorithms required datasets that could be used for both acoustic and language modelling. Today, with deep learning techniques taking prominence in computational sciences, the significance of large-scale data is paramount. In this section we introduce the most widely-known, currently available documented speech corpora for Persian. Datasets are introduced in chronological order.

4.5.1 FARSDAT Farsdat, the pioneering Persian speech database, includes Farsi read speech data uttered by 304 native speakers who differ from each other with regards to age, gender, dialect, and educational level.1 The population ratio of male to female is about two to one. Each speaker uttered twenty sentences in two different sessions, which resulted in two speech files with 10 sentences each for each speaker. The sentences were made using the 1,000 most frequent words in Farsi, extracted from newspaper data. The speech was collected in the acoustic booth of the Linguistics Laboratory of University of Tehran, sampled at 22.05 kHz with 16 bit resolution, using a Sony cardioid dynamic microphone. A total of 6,080 utterances were segmented and labelled manually at phoneme-, word- and sentence-levels with start and end points. The annotation was done using Latin-script alphabetic and punctuation symbols that represent IPA characters. Farsdat was the first step toward producing Farsi speech databases to support original and advanced research in speech sciences and technology (Bijankhan, Sheikhzadegan, and Roohani 1994).

1 For information regarding access to this database contact the authors.

4 Speech Recognition for Persian

� 93

4.5.2 TFarsDat TFarsDat, the Telephone Farsi Speech Database, is an audio collection of phone conversations between pairs of 202 different speakers.2 All participants are native speakers of Farsi with differences in age, gender, education level, and dialect. The participants consist of 77 female and 125 male speakers (62 % male, 38 % female). Audio files in TFarsDat have been recorded with a 16-bit audio card, sampled at 11 kHz. A linguist transcribed and labelled the data at word and phoneme levels using purpose-built software. Each audio file has been transcribed using a combination of Latin alphabetical and punctuation characters. For example, the character (point) denotes the fricative /S/ (“sh”) and the character