Motifs in Language and Text 9783110476637, 9783110474961

The edited volume Motifs in Language and Text is the first collection of original research in the area of the quantitati

172 58 4MB

English Pages 279 [280] Year 2017

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Motifs in Language and Text
 9783110476637, 9783110474961

Table of contents :
Table of Contents
Editor's Foreword
Persistency of Higher Order Motifs
On Motifs and Verb Valency
Chinese Word Length Motif and Its Evolution
Quantitative Text Classification Based on POS-motifs
L-motif TTR for Authorship Identification in Hongloumeng and Its Translation
Length Motifs of Words in Traditional and Simplified Chinese Scripts
Dependency Distance Motifs in 21 Indo-European Languages
Word Length Distribution and Text Length: Two Important Factors Influencing Properties of Word Length Motifs
Quantitative Genre Analysis Using Linguistic Motifs
The Rank-frequency Distribution of Part-of-speech Motif and Dependency Motif in the Deaf Larners’ Compositions
Quantitative Properties of Polysemy Motifs in Chinese and English
The Words and F-motifs in the Modern Chinese Versions of the Gospel of Mark
Motifs of Generalized Valencies
Index of Names
Subject Index

Citation preview

Haitao Liu and Junying Liang (Eds.) Motifs in Language and Text

Quantitative Linguistics

Editor Reinhard Köhler Advisory Editor Hermann Moisl

Volume 71

Motifs in Language and Text Edited by Haitao Liu Junying Liang

ISBN 978-3-11-047496-1 e-ISBN (PDF) 978-3-11-047663-7 e-ISBN (EPUB) 978-3-11-047506-7 ISSN 0179-3616 Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. © 2017 Walter de Gruyter GmbH, Berlin/Boston Printing and binding: CPI books GmbH, Leck ♾ Printed on acid-free paper Printed in Germany www.degruyter.com

Editors’ Foreword Linearity is one of the main features of human languages. However, for various reasons, linguistic research from a quantitative perspective is largely paradigmatic instead of being syntagmatic. The current volume focuses on the linguistic motifs and emphasizes the linear organisation of the units. As suggested by Köhler, a motif is defined as the longest continuous sequence of equal or increasing values representing a quantitative property of a linguistic unit. This volume documents some recent results in this area, and it is the first book that collects systematically and presents original research on linguistic motifs. It contains a collection of thirteen papers of altogether eighteen authors. The contributions cover quite a broad spectrum of topics from theoretical discussions to practical applications. The first group consists of theoretically oriented papers. André Pascal Beyer suggests the persistency of higher order motifs by comparing Italian president speeches, the Russian Uppsala corpus and a set of DNA sequences. George K. Mikros and Ján Mačutek examine the modern Greek blogs, and point out that word length distribution and text length are the two important factors influencing properties of word length motifs. Radek Čech, Veronika Vincze and Gabriel Altmann suggest that verb valency motifs are regular language entities. Hongxin Zhang and Haitao Liu take a further step, validating valency motifs as basic language entities and also as a result of diversification processes. The second group includes nine papers focused on practical applications. Cong Zhang investigates the words and F-motifs in six modern Chinese versions of the Gospel of Mark from the year 1855 to 2010, Heng Chen and Junying Liang compare the word length motif in modern spoken Chinese and written Chinese, both suggesting motifs as an index of language evolution. Yingqi Jing and Haitao Liu investigate the linear arrangement of dependency distance in IndoEuropean languages, Ruina Chen focuses on the Part-of-speech motifs, Yaqin Wang uses the L-motifs and F-motifs, and Yu Fang compares the L-motif TTR in two translated works, claiming motifs as an index of text classification and language typology. Jiang Yang examines the quantitative properties of polysemy motifs in Chinese and English, Wei Huang mainly investigates the rank frequency distribution and the length distribution of word length motifs in Chinese texts, Jingqi Yan presents an explorative study of part-of-speech motifs and dependency motifs using the treebanks of deaf students’ writing in three learning stages, pointing out the function of motifs in language description and acquisition.

VI | Editors’ Foreword We hope that this volume will give insight to linguistic motifs across (1) different languages; (2) text types; (3) dimensions of languages, and also, tentatively, into the cognitive mechanisms underlying the linguistic motifs. Moreover, we hope this volume will become a reference work for the related future research and as well as for undergraduate and postgraduate courses in the areas of Linguistics, Natural Language Processing and Text Mining. We would like to thank all authors for their contributions and nice collaborations during the editing phases, and the referees for their invaluable efforts, and also Jieqiang Zhu and Wei Huang for their assistance in editing work. Most importantly, we would like to express our thanks to Reinhard Köhler for his suggestion that we edit this volume and his continuous help and encouragement during the process of editing. We would also like to show our thanks to two other editors Gabriel Altmann and Peter Grzybek, for their support and timely help. Finally, we would like to acknowledge the National Social Sciences Funding of China – Quantitative Linguistic Research of Modern Chinese (No. 11&ZD188), the Fundamental Research Funds for the Central Universities (Program of Big Data PLUS Language Universals and Cognition, Zhejiang University), and the MOE Project of the Center for Linguistics and Applied Linguistics, Guangdong University of Foreign Studies, which supported us during the preparation of this volume. Haitao Liu, Junying Liang Department of Linguistics, Zhejiang University Hangzhou, China

Table of Contents Editor's Foreword …..………………………………………........………….......................... V André Pascal Beyer Persistency of Higher Order Motifs ............................................................... 1 Radek Čech – Veronika Vincze – Gabriel Altmann On Motifs and Verb Valency ....................................................................... 13 Heng Chen – Junying Liang Chinese Word Length Motif and Its Evolution .............................................. 37 Ruina Chen Quantitative Text Classification Based on POS-motifs ................................. 65 Yu Fang L-motif TTR for Authorship Identification in Hongloumeng and Its Translation .................................................................................... 87 Wei Huang Length Motifs of Words in Traditional and Simplified Chinese Scripts ........ 109 Yingqi Jing – Haitao Liu Dependency Distance Motifs in 21 Indo-European Languages .................... 133 George K. Mikros – Ján Mačutek Word Length Distribution and Text Length: Two Important Factors Influencing Properties of Word Length Motifs ........................................... 151 Yaqin Wang Quantitative Genre Analysis Using Linguistic Motifs ................................. 165 Jingqi Yan The Rank-frequency Distribution of Part-of-speech Motif and Dependency Motif in the Deaf Larners’ Compositions ................................ 181

VIII | Table of Contents Jiang Yang Quantitative Properties of Polysemy Motifs in Chinese and English . . . . 201 Cong Zhang The Words and F-motifs in the Modern Chinese Versions of the Gospel of Mark . . . . . . . . . . . . . . . . . . . . . . . 217 Hongxin Zhang – Haitao Liu Motifs of Generalized Valencies . . . . . . . . . . . . . . . . . . . 231 Index of Names . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

André Pascal Beyer

Persistency of Higher Order Motifs Abstract: In former and recent publications calculating motifs of motifs has been done repeatedly. This calculation seems meaningful intuitively. However, it is still not investigated how persistent and significant motifs of motifs are albeit interesting results are obtained from them. A further approach regarding the elucidation of meaning of motif derivation is tried to be done within the following investigation. Two linguistic and one DNA corpus were used to calculate higher-order L-motifs from them. The entropy and the Hurst-exponent could be obtained from each level of L-motifs. The entropy dropped for each layer as predicted. For the first few levels the values for the Hurst-exponent shrink and then start rising again. This behavior was not expected and is still to be explained. Keywords: Higher-order motifs, entropy, Hurst-exponent

1 Introduction Syntactic motifs seem to gain more attention and become a more and more interesting unit to study in the field of quantitative linguistics. Recent volumes of this series feature studies of motifs (e.g.: Köhler, 2015; Mačutek, 2015). Motif research is accompanied by many unexamined assumptions - being a relatively new unit in the field of linguistics. The meaningfulness of the calculation of motifs of motifs is one of them. For instance, investigations of taking L-motifs of L-motifs (becoming LL-motifs) have already been done and seem to deliver interesting results (e.g. Milička, 2015; Köhler & Naumann 2010). The mechanisms of the unit are still not known despite these results. No valid linguistic theory has been hypothesized to verify the meaningfulness of calculating motifs of motifs. This article attempts to approach this question if calculating motifs of motifs is a reasonable operation. This will be done by comparing the entropy and the Hurst-exponent of consecutive motif processing.

|| André Pascal Beyer: Department for Computational Linguistics and Digital Humanities, Trier University, Germany, [email protected]

2 | André Pascal Beyer

2 Higher Order Motifs The scope of motifs as unit is limited: each motif captures only a small proportion of syntactic information. The following sentence: Honestly, they could not have answered those questions.

can be transformed to the following L-motifs (ascending, with length in term of the number of character): (8) (4-5) (3-4-8) (5-9)

The longest motif consists of 3 elements (3-4-8) in this example. This is the longest scope obtained through simple L-motifs here. This syntactic scope is not even long enough to include the whole sentence - which can be a small size in comparison to paragraphs and documents. It seems questionable because of that limitation if the syntactic investigation with motifs of phenomena farer apart (e.g. anaphors distributed over multiple sentences) can be done. A proposed method to overcome the constraint of a small syntactic scope is using motifs of motifs. This abstraction or derivation will be referred to as higher order motif, e.g. an LL-motif is a higher order motif of the second order, an LLL-motif is a higher order motif of the third order, and so on. Higher order motifs are assumed to grasp the structure and information of its underlying motif while increasing the syntactic scope. The example motifs above would become the following LL-motifs: (1-2-3) (2)

Each motif here refers back to its underlying motifs and abstracts over them. The longest LL-motif consists of 3 elements, representing 3 L-motifs or 6 words. The syntactic scope would be increased by the magnitude of 2 if the assumption is valid. It is still not known by which mechanism the results of former motif investigations were obtained. To this point no tested theory of language concludes a meaningful application and usage of higher order motifs. In the same vain it is similarly assumed that the derivation of motifs will lead to other meaningful motifs. The following investigation tries to elucidate more insights to the latter.

Persistency of Higher Order Motifs | 3

3 Investigation 3.1 Data For the following investigation three corpora were used: Italian president speeches (Tuzzi, 2010), the Russian Uppsala corpus (Lönngen, 1993), and a set of DNA sequences. These sequences were chosen to differentiate from the purely linguistic focus. DNA sequences have comparable syntagmatic and paradigmatic relations to natural language texts and seem to be an interesting addition because of that. The Italian president speeches are New Year’s speeches of Italian presidents collected over time. The Uppsala corpus combines texts of various genres in the Russian language. The DNA data were obtained through the NCBI genome database. Following species were chosen: Eel, Rutilus, Urticaceae, and Salmon. Since our knowledge of DNA and biology is very limited these selections are assumed to be of arbitrary attributes. The lengths of the DNA files have a large gap: the observed files are either short or long but nothing in between. Two file length groups were used, the remaining third one – the longest – was omitted because of its high calculation time.

3.2 Method For each further inquiry each corpus is transformed into L-Motifs from the first to the tenth order. The length for the L-Motifs for the natural language texts is measured in the syllable count per word, for the DNA sequences the count of immediate character repetition is used, e.g.: ACCAGGGTAA

becomes (1-2) (1-3) (1-2)

With every derivation the original file shrinks further and further. Most files contain only the (1) motif by the tenth order - so no further derivation was made.

4 | André Pascal Beyer

3.3 Results A loss of information through motif derivation is an obvious and easily testable assumption. Reducing multiple motifs into one does not preserve its original motif nor can the higher order motif be used to reconstruct the underlying one. Information is measured in terms of Shannon's entropy. For every file of each corpus the entropy was calculated by the entropy package (Hausser, 2014) of the R programming language. The logarithm with base 2 was used. At a certain length no further calculation could be done of either entropy or the Hurstexponent. As an effect the count of total texts to be investigated will decrease from order to order. The decrease of entropy from order to order can easily be seen in figures 1 to 3:

Fig. 1: Measured entropy for the Uppsala Corpus

Persistency of Higher Order Motifs | 5

Fig. 2: Measured entropy for Italian president speeches.

Fig. 3: Measured entropy for the DNA corpus. Note the gap inside the data because of different length groups.

The gap on the right plot in Figure 2 illustrates a shift in the habit of speech: more recent presidents tend to speak longer than former ones. A side effect of derivation is the shrinking of the total motif count. This effect may be a possible explanation for the information loss. The following plots show the decreasing of motif counts per order:

6 | André Pascal Beyer

Fig. 4: L-Motif count per order in the Uppsala corpus

Fig. 5: L-Motif count per order in the Italian corpus

Persistency of Higher Order Motifs | 7

Fig. 6: L-Motif count per order in the DNA corpus

The loss of information should lead to a loss of syntactic relations: every further derivation loses information from its original - it is not only the lengths in the former motifs but also their exact syntactic structure. A high volatility of syntactic influence seems to be a plausible result of this. The Hurst-exponent is used to investigate the persistency of motif derivation: it is normally used to measure volatility/persistency of time series (Kleinow, 2002). Its value ranges from 0 to 1. Calculating the Hurst-exponent for syntactic properties seems meaningful since speech is generated and processed in a sequential order (or over time). Therefore, latter parts are more or less dependent on former parts in most languages. With decreasing syntactic relations the volatility should rise and the Hurst-exponent should approach the value 0.5 (calculating a series generated by white noise leads to this value). For the calculation the empirical Hurst-exponent He of R’s pracma package (Borchers, 2016) was used. Every value higher or lower than 0.5 can be interpreted as more persistent. Figures 7 to 9 show the results:

8 | André Pascal Beyer

Fig. 7: Hurst-exponent per order in the Uppsala corpus

Fig. 8: Hurst-exponent per order in the Italian corpus

Persistency of Higher Order Motifs | 9

Fig. 9: Hurst-exponent per order in the DNA corpus

The following tables summarize the arithmetic mean values of the different calculations for each order: Tab. 1: Mean values in the Uppsala corpus Order

Entropy

Hurst-exponent

Length

1

9.7160624

0.5970850

1282.12013

2

8.4019044

0.5419067

480.62987

3

7.0429998

0.5176098

187.418831

4

5.6589410

0.5581296

71.811688

5

4.2902836

0.6103100

28.412338

6

2.8037808

0.6661566

11.525974

7

0.9347163

0.5753992

5.165584

8

0.1058951

0.5092617

2.688312

9

0

0.5

1.668831

10

0

0.5

1.220779

10 | André Pascal Beyer Tab. 2: Mean values in the Italian corpus Order

Entropy

Hurst-exponent

Length

1

10.5072408

0.5607754

14341.290221

2

9.222838

0.5537708

5462.25366

3

7.6543511

0.5228936

1948.318612

4

6.3511765

0.52799

765.384858

5

4.9736314

0.5982845

292.318612

6

3.2365861

0.6562195

111.911672

7

1.5547342

0.5200989

43.454259

8

1.0693542

0.5041328

17.365931

9

0.8111868

0.5105243

7.375394

10

0.5723404

0.5164744

3.400631

Tab. 3: Mean values in the DNA corpus Order

Entropy

Hurst-exponent

Length

1

9.4715076

0.3800402

3774.959184

2

4.57468807

0.5546034

716.867925

3

3.99987424

0.5126682

310.566038

4

3.39109424

0.4833373

116.566038

5

2.77007128

0.5127951

44.226415

6

2.19559288

0.5202250

17.886792

7

1.6581588

0.6040636

7.754717

8

1.12059208

0.6294851

3.849057

9

0.04723129

0.5030371

2.452830

10

0

0.5000000

1.716981

The Hurst-exponent seems to decrease with the first 4 orders in all observations (except for the first order of the DNA data). Interestingly the values rise again with about the 5th order and decrease after that (with the DNA data it’s the fourth). We thought at first that there is a certain threshold in length leading to unreliable calculations, but for every corpus this threshold would be at a different point – which leaves this conclusion in doubt.

Persistency of Higher Order Motifs | 11

4 Conclusion There is no possible explanation or assumption for the seemingly systematic decrease in about the fifth order so far. It is still possible that the chosen corpora accidently feature this as a rare abnormality. Still we doubt that because of the size of each corpus. We are left with the question if this bend is something that inherits within motifs itself. There are still other methods of investigating the dependency of motif derivations. Our general knowledge of time series calculation is limited. More measurements for persistency of time series can be done, e.g. the Lyapunov stability as another calculation based on fractals (like the Hurst-exponent) or Markov Chains. The usage from the first to the fifth order seems meaningful nevertheless. The basic assumption about information loss and increasing volatility holds and can be seen within the data. Every derivation loses more information and persistency, which is still to be borne in mind.

References Borchers, H. W. (2016). Practical Numerical Math Functions (R package), https://cran.rproject.org/web/packages/pracma/index.html Hausser, J. & Strimmer, K. (2014). Estimation of Entropy, Mutual Information and Related Quantities (R package), https://cran.r-project.org/web/packages/entropy/ Mačutek, J. (2015). Type-token relation for word length motifs in Ukrainian texts. In A. Tuzzi, M. Benešová, & J. Mačutek (Eds.), Recent Contributions to Quantitative Linguistics (pp. 63– 74). Berlin/Boston: de Gryuter. Milička, J. (2015). Is the Distribution of L-Motifs Inherited from the Word Length Distribution? In G. Mikros, & J. Mačutek (Eds.), Sequences in Language and Text (pp. 133–145). Berlin: De Gruyter. Köhler, R., & Naumann, S. (2010). A syntagmatic approach to automatic text clas sification. Statistical properties of F- and L-motifs as text characteristics. In P. Grzybek, E. Kelih, J. Mačutek (Eds.), Text and Language. Structures, functions, interrelations, quantitative perspectives (pp. 81-89). Wien: Praesens. Köhler, R. (2015). Linguistic Motifs. In G. Mikros & J. Mačutek (Eds.), Sequences in Language and Text (pp. 89–108). Berlin/Boston: De Gruyter. Kleinow, T. (2002). Testing continuous time models in financial markets (Doctoral dissertation, Humboldt University Berlin, 2002).

12 | André Pascal Beyer Lönngren, L. (1993). Частотный словарь современного русского языка. L. Lönngren (Ed.) 1993. Častotnyj slovar´ sovremennogo russkogo jazyka. Uppsala: Uppsala University. (=Studia Slavica Upsaliensia 32) Tuzzi, A., Popescu, I., & Altmann, G. (2010). Quantitative analysis of Italian texts. Lüdenscheid:RAM-Verlag.

Radek Čech, Veronika Vincze, Gabriel Altmann

On Motifs and Verb Valency

Abstract: The present study mainly scrutinizes the question of motif-like characters of verb valency. Do they behave as other linguistic units? Tests are performed using the Czech text Šlépěj (Footprint) written by Karel Čapek and the Hungarian translation of G. Orwell´s 1984. The rank-frequency, the spectrum of motifs and the relation between length and frequency are examined. The Zipf-Mandelbrot distribution is used for the rank-frequency and spectrum as a model; the relationship between length and frequency is modelled by the Lorentzian function. For a determination of verb valency, a full valency approach is used. Results show that valency motifs are regular language entities. Keywords: motifs, verb valency, distribution

1 Introduction It is well known that any names, concepts, definitions and criteria in science are conventions. They are set up in order to identify and determine a “real” entity which should be concerned or analysed. They may refer to things, actions, properties, time, place, relations, etc. One can coin them freely but their classes must be useful for some scientific purpose, e.g. description, classification, comparison, modelling, search for links to other properties, search for laws, etc. Some linguistic entities have been known already in old Greece and India and new ones have been coined especially in the last two centuries. In recent decades, a new unit called motif has been used in linguistics. The aim of this study is to apply this unit in a syntactic analysis. Specifically, we focus on a verb valency and we scrutinize whether so called verb valency motifs behave as other linguistic units. The inventor of motifs was the musicologist M.G. Boroda who introduced a

|| Radek Čech: Department of Czech Language, University of Ostrava, Ostrava, Czech Republic, [email protected] Veronika Vincze: Research Group on Artificial Intelligence, Hungarian Academy of Sciences, Szeged, Hungary, [email protected] Gabriel Altmann: Lüdenscheid, [email protected]

14 | Radek Čech – Veronika Vincze – Gabriel Altmann formally unequivocally identifiable musical motif (Boroda 1973, 1982, 1988), not identical with motifs defined by classical musicologists. The idea has been transferred into linguistics by Köhler (2005, 2006, 2008a,b), and today it enjoys an ever more increasing interest (cf. Köhler, Naumann 2008, 2009, 2010; Mačutek 2009; Sanada 2010; Beliankou, Köhler, Naumann 2013; Köhler 2015; Milička 2015; Mačutek, Mikros 2015). In the article, we first introduce the linguistic background of the analysis and its methodology (Section 2 and 3), then the frequency distribution, frequency spectrum and relationship between the frequency and the length of verb valency motifs is modelled (Section 4 and 5). Finally in Section 6 we present further possibilities of the research.

2 Verb Valency As a matter of fact, valency of the verb is a subset of its polytexty: it is measured on the basis of the environment of the verb. Since the size of the subset can easily be stated, sentences of a text can be written as a numerical (or a symbolic) sequence of verb valencies. One must, however, differentiate between the possible subset of valencies which can be found in dictionaries and the topical subset observed in the given sentence which may be different for each text. After the sequence of valencies is stated, valency motifs can be constructed which in this case are the subsequences of non-decreasing numbers. The better specified the text, i.e. the greater the valency of the verbs, the greater the mean of the sequence. Hence, even the individual motifs can be characterized, e.g. by their average, standard deviation, range, etc. Here we shall concentrate on the usual quantitative motifs representing valency.

3 Method For a determination of verb valency, a full valency approach is used (cf., Čech et al. 2010). Contrary to an original conception of valency (Allerton 2005), there is no differentiation between obligatory complements and non-obligatory adjuncts in this approach; all sentence elements which are dependent on a verb constitute the (full) verb valency. The approach was proposed as a reaction to the lack of clear criteria for distinguishing complements and adjuncts. Present results (Čech et al. 2010; Vincze 2014) seem to justify this concept and, moreover,

On Motifs and Verb Valency | 15

the approach seems to be applicable for the analysis of syntactic dependency relationships in general. Specifically, for each verb in a text a full valency frame is determined as follows; let us start with a sentence (1): (1) My father gave four books to Mary yesterday evening in Berlin In accordance with syntactic dependency formalism (Meľčuk 1988), it is possible to express syntactic relationships in a form of graph (see Fig. 1). There are four sentence elements (father, books, to, yesterday, in) directly dependent on the verb gave, consequently, all of them constitute the full valency frame of the verb. Another way form presentation is linear, simply joining gave by means of an edge with father, books, to, yesterday, and in. In the present study, only the number of elements constituting the full valency frame is used for the analysis; for each predicative verb in a text the number is calculated. Thus, a sequence of numbers is obtained expressing the size of full valency frames and, then, it is possible to determine the motifs based on the size of full valency frame; we call them FVS-motifs.

Fig. 1: Syntactic tree of the sentence (1) based on dependency syntax formalism

The minimum size of full valency frame equals zero, for instance it occurs in imperative sentences, cf. the English sentence (2) (2) Run! or in pronoun-dropping (or null-subject) realization of sentences, cf. Russian (3) (3) Пришёл [prišjol; he came]

16 | Radek Čech – Veronika Vincze – Gabriel Altmann or in sentences where the subject is not realized at all, cf. Czech (4) (4) Prší. [it rains]. Theoretically, the maximum size of full valency frame is unlimited, however, our empirical findings reveal that it does not exceed 6 (for Czech) and 10 (for Hungarian), as shown in Tab. 1 and 2 (See Appendix). The short story Šlépěj (Footprint) written by the Czech writer Karel Čapek (1917) is used for the present analysis. The text contains 301 predicative verbs; to each verb, the size of full valency frame was annotated manually. A sequence of the sizes from the text is as follows: [4, 3, 3, 4, 2, 3, 2, 2, 2, 1, 4, 3, 3, 3, 2, 0, 2, 1, 3, 3, 2, 1, 1, 1, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 3, 3, 3, 1, 3, 1, 0, 1, 0, 2, 2, 1, 2, 1, 2, 2, 1, 3, 3, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 1, 4, 2, 3, 2, 2, 2, 1, 2, 2, 4, 0, 2, 2, 2, 2, 1, 2, 2, 2, 2, 5, 2, 0, 3, 4, 1, 3, 1, 3, 3, 3, 4, 1, 3, 2, 3, 2, 3, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 3, 3, 3, 2, 0, 2, 1, 2, 2, 2, 4, 2, 3, 5, 4, 4, 3, 2, 2, 2, 2, 1, 3, 2, 2, 2, 4, 3, 2, 1, 2, 2, 2, 1, 4, 4, 2, 1, 2, 2, 2, 2, 1, 2, 4, 3, 2, 2, 3, 3, 4, 0, 3, 2, 2, 0, 2, 2, 2, 3, 2, 1, 1, 2, 1, 2, 3, 2, 2, 1, 4, 1, 3, 2, 3, 2, 1, 2, 3, 3, 2, 3, 3, 1, 4, 1, 1, 3, 2, 4, 4, 2, 1, 1, 2, 2, 3, 1, 1, 2, 2, 3, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 3, 2, 2, 2, 4, 1, 3, 3, 2, 2, 4, 1, 3, 3, 1, 3, 1, 3, 2, 2, 1, 2, 1, 1, 2, 2, 5, 3, 1, 3, 3, 1, 3, 3, 1, 3, 1, 3, 2, 2, 2, 2, 1, 4, 2, 2, 6, 2, 3, 2, 2, 2, 1, 3, 1, 3, 3, 2, 3, 1, 2, 2, 2, 4, 1, 1, 1, 2, 2, 2, 2, 1, 3, 2]

Analogically to L-motifs which are defined as a sequence of equal or increasing length values (cf. Köhler 2015), we define FVS-motifs as a sequence of equal or increasing numbers of full valency frames. For illustration, see the first 10 FVSmotifs in the text: 4 3-3-4 2-3 2-2-2 1-4 3-3-3 2 0-2 1-3-3 2

For the complete text we may present the frequency of individual motif-types in form of a rank-frequency distribution. The data are presented in Tab. 1. Since the FVS-motif is a new linguistic unit, one may conjecture that it abides by a regular distribution. If it does not abide by a model which has been used for well-established linguistic units (such as words, morphemes, syllables

On Motifs and Verb Valency | 17

etc.), it should be either rejected as meaningless or operationalized in a different way. As for the model, we apply the well-proven Zipf-Mandelbrot distribution which was used, besides other things, with an excellent result for modelling the L-motif distribution (Köhler 2015). Fitting the model to the data as presented in the last column of Tab. 1 yields results as follows: the parameters are estimated as a = 0.9859, b = 1.9063; the χ2 = 2.7839 with 36 degrees of freedom; the probability is P = 1.0, as shown in the last row of Tab. 1. The result represents an excellent fit (cf. Fig. 2), consequently, the FVS-motif can be placed, at least with regard to the distribution, in the list of other basic linguistic units. It must be remarked that the Zipf-Mandelbrot distribution is only one of several ones capturing the given data. Some of them are used frequently in linguistics (e.g. negative hypergeometric, Pólya, zeta, right truncated Zipf-Alekseev, etc.)

Fig. 2: Graph of the Zipf-Mandelbrot distribution as fitted to the data from Tab. 1

4 The Spectrum Further, a transformation of the data (from Tab. 1) to a frequency spectrum, which expresses the number of FVS-motifs occurring in the text exactly x times, allows another evaluation of the behavior of the unit. Applying the transformation x ≤ fx < x+1 (cf. e.g. Wimmer et al. 2003: 119) to the ZipfMandelbrot distribution the resulting spectrum should be

18 | Radek Čech – Veronika Vincze – Gabriel Altmann

�� = � �



��





(���)�



(1)

Since modelling by means of distributions or alternatively, by functions, does not distort the “reality”, we apply the above function and consider C (the normalizing constant) simply as the sum of cases. In Tab. 1, one can see that there are 30 motifs occurring exactly once; 7 motifs occurring each twice, etc. In this way we obtain the observed and computed frequency spectrum of FVS-motifs in the short story Šlépěj (Footprint) as presented in Tab. 3 (See Appendix). The model is very satisfactory and can be preliminarily accepted. Its advantage is the existence of only one parameter. However, fitting the ZipfMandelbrot distribution, we would also obtain acceptable results (a = 2.314, b = 0.5333, χ2 = 2.33 with 3 degrees of freedom and the probability P = 0.721); even the simple zeta-distribution is sufficient (a = 2.047, χ2 = 1.9303 with 6 DF and P = 0.926). Since the software yields also figures of results, below, in Fig. 3 and 4, we show the fitting of the Zipf-Mandelbrot and the zeta distributions.

Fig. 3: Graph of the Zipf-Mandelbrot distribution as fitted to the data from Tab. 3

On Motifs and Verb Valency | 19

Fig. 4: Graph of the zeta distribution as fitted to the data in Tab. 3

The spectrum of the Hungarian data is presented in Tab. 4 (See Appendix). Evidently, formula (1) is appropriate, even if here we used it as a simple function. The results testify to the fact that motifs behave like all other linguistic units.

5 The Link between Length and Frequency Both the rank-frequency distribution and the frequency spectrum of FVS-motifs corroborate the linguistic status of this unit. Therefore, it seems reasonable to expect that the unit also meets some of the requirements fulfilled by basic linguistic ones. The hypothesis predicting a relationship between the frequency and length of the unit is ranked among the best corroborated ones. Thus, we assume that more frequent FVS-motifs must be shorter (on the average) than less frequent ones. Despite the fact that our data (see Tab. 3) did not allow to test the hypothesis properly – there are not enough instances in each frequency class, cf. Tab. 1 – it is possible, just preliminarily, to observe a tendency, if it exists at all. The length of the FVS-motif is counted in the number of words it consists of. In order to express the existence of a link between these two properties, we use the Lorentzian function

20 | Radek Čech – Veronika Vincze – Gabriel Altmann

�=�+

���



��� � � �

(2)

used several times in linguistics (Popescu et al. 2009, 2011, 2015). This function can easily be derived using the differential equation approach containing the classical components: the language constant, the effort of the speaker and the equilibrating force of the hearer (cf. Wimmer, Altmann 2005). The result of the computation is displayed in Tab. 5 (See Appendix) and Fig. 5.

Fig. 5: Graph of the relationship between the frequency and the average length of FSV-motifs from Tab. 5

On Motifs and Verb Valency | 21

For the Hungarian data we obtain the results presented in Tab. 6 and Fig. 6. In this data we have a simple decreasing phenomenon but the oscillation is evident. Hence we try to capture it applying a simple power function � = �� �

and obtain the results presented in Tab. 6 (See Appendix).

(�)

Fig. 6: Graph of the relationship between the frequency and the average length of FSV-motifs from Tab. 6

In accordance with the hypothesis, there is a tendency to emphasize the central FVS-motif length and shorten less frequent and more frequent FSV-motifs,

22 | Radek Čech – Veronika Vincze – Gabriel Altmann hence, the curve should rather bell-shaped. But the Hungarian data which are very extensive show that the dependence need not be bell-shaped. Perhaps the parameters of the functions are associated with the kind of linguistic entity. However, this is, so to say, not the last word. Analyses of many texts will, perhaps, show that the link is very complex. Here we considered frequency as the independent variable but the other way round is possible, too.

6 Conclusion The study reveals that verb valency motifs can be considered to be the linguistic unit which share the same characteristics as the majority of well-established traditional units. Specifically, both the rank frequency and spectrum abide by the Zipf-Mandelbrot distribution; further, there is the relationship between the frequency and the length of the motif. Valency is a property of the verb stated on the basis of its associated environment. Verbs themselves express a number of various activities which need not be restricted to Man. Capturing them, e.g. according to Ballmer (1982), cf. also Köhler, Altmann (2014: 66 f.) in form of B-motifs, one can study the relation of the text to the reality. This aspect has nothing to do with grammar but rather with the semantics of verbs. A further possibility is to give names to the members of the valency, e.g. N, V, Adv, Prep, Pron, …. or to the phrases in which they occur, and then study the form of the valencies, i.e. consider the frequencies and similarities of valency frames.

References Allerton, D. J. (2005). Valency Grammar. In K. Brown (Ed.), The Encyclopedia of Language and Linguistics (pp. 4878-4886). Elsevier Science Ltd. Ballmer, Th.T. (1982). Biological foundations of linguistic communication. Amsterdam/Philadelphia: Benjamins. Beliankou, A., Köhler, R., & Naumann, S. (2013). Quantitative properties of argu mentation of motifs. In I. Obradović, E. Kelih, & R. Köhler (Eds.), Methods and Applications Quantitative Linguistics (pp. 33-43). Selected papers of the VIIIth Inter national Conference on Quantitative Linguistics (QUALICO) in Belgrade, Serbia. Boroda, M.G. (1973). K voprosu o metroritmičeski elementarnoj edinice v muzyke. Bulletin of the Academy of Sciences of the Georgian SSR, 71(3), 745–748.

On Motifs and Verb Valency | 23

Boroda, M.G. (1982). Die melodische Elementareinheit. In J.K. Orlov, M.G. Boroda, & I.Š. Nadarejšvili (Eds.), Sprache, Text, Kunst. Quantitative Analysen (205–222). Bochum: Brockmeyer. Boroda, M.G. (1988). Towards a problem of basic structural units of musical texts. Musikometrika 1 (pp. 11-69), Bochum: Brockmeyer. Čech, R., Pajas, P., & Mačutek, J. (2010). Full valency. Verb valency without distinguishing complements and adjuncts. Journal of Quantitative Linguistics, 17, 291–302. Köhler, R. (2005). Synergetic linguistics. In R. Köhler, G. Altmann, & R.G. Piotrowski (Eds.), Quantitative Linguistics: An International Handbook (pp. 760–774). Berlin/New York: de Gruyter. Köhler, R. (2006). The frequency distribution of the length of length sequences. In J. Genzor, M. Bucková (Eds.), Favete linguis. Studies in honour of Viktor Krupa (pp. 145-152). Bratislava: Slovak Academy Press. Köhler, R. (2008a). Word length in text. A study in the syntagmatic dimension. In S. Mislovičová (Ed.), Jazyk a jazykoveda v pohybe (pp. 421–426). Bratislava: Veda. Köhler, R. (2008b). Sequences of linguistic quantities. Report on a new unit of in vestigation. Glottotheory 1(1), 115–119. Köhler, R. (2015). Linguistic motifs. In G.K. Mikros, & J. Mačutek (Eds.), Sequences in Language and Text (pp. 89-108). Berlin/Boston: de Gryuter. Köhler, R., & Altmann, G. (2014). Problems in Quantitative Linguistics Vol. 4. Lüdenscheid: RAM-Verlag Köhler, R., & Naumann, S. (2008). Quantitative text analysis using L-, F- and T- segments. In C. Preisach, H. Burkhardt, L. Schmidt-Thieme, R. Becker, R. (Eds.), Data Analysis. Machine Learning and Applications (pp. 635-646). Berlin/ Heidelberg: Springer. Köhler, R., & Naumann, S. (2009). A contribution to quantitative studies on the sentence level. In R. Köhler (Ed.), Issues in Quantitative Linguistics (pp. 34-57). Lüdenscheid: RAM-Verlag. Köhler, R., & Naumann, S. (2010). A syntagmatic approach to automatic text clas sification. Statistical properties of F- and L-motifs as text characteristics. In P. Grzybek, E. Kelih, J. Mačutek (Eds.), Text and Language. Structures, functions, interrelations, quantitative perspectives (pp. 81-89). Wien: Praesens. Mačutek, J. (2009). Motif richness. In R. Köhler (Ed.), Issues in Quantitative Lin guistics (pp. 5160). Lüdenscheid: RAM-Verlag. Mačutek, J., & Mikros, G. (2015). Menzerath-Altmann law for word length motifs. In G.K. Mikros, & J. Mačutek (Eds.), Sequences in Language and Text (pp. 125-131). Berlin/Boston: de Gryuter. Meľčuk, I. (1988). Dependency Syntax: Theory and Practice. Albany: State Univers ity of New York Press. Milička, J. (2015). Is the distribution of L-motifs inherited from the word length distribution In G.K. Mikros, & J. Mačutek (Eds.), Sequences in Language and Text (pp. 133-145). Berlin/Boston: de Gryuter. Popescu, I.-I., Altmann, G., Grzybek, Jayaram, B.D., Köhler, R., Krupa, V., P., Mačutek, J., Pustet, R., Uhlířová, L., & Vidya, M.N., (2009). Word frequency studies. Berlin/ New York: Mouton de Gruyter. Popescu, I. I., Čech, R., & Altmann, G. (2011). Vocabulary richness in Slovak poetry. Glottometrics, 22, 62–72. Popescu, I.I., Lupea, M., Tatar, D., & Altmann, G. (2015). Quantitative Analysis of Poetic Texts. Berlin: de Gruyter.

24 | Radek Čech – Veronika Vincze – Gabriel Altmann Sanada, H. (2010). Distribution of motifs in Japanese texts. In P. Grzybek, E. Kelih, & J. Mačutek, (Eds.), Text and Language. Structures, functions, interrelations, quantitative perspectives (pp. 183-194) Wien: Praesens. Vincze, V. (2014). Valency frames in a Hungarian corpus. Journal of Quantitative Linguistics 21(2), 153–176. Wimmer, G., Altmann, G., Hřebíček, L., Ondrejovič, S., & Wimmerová, S. (2003) Úvod do analýzy textov [Introduction to text analysis]. Bratislava: VEDA. Wimmer, G., & Altmann, G. (2005). Unified theory of some linguistics laws. In Köhler, R., Altmann, G., Piotrowski, R., (Eds.) Quantitative Linguistics. An International Handbook (pp. 791-807). Berlin, New York: Walter de Gruyter.

Appendix Tab. 1: Rank-frequency distribution of the FSV motifs in the short story Šlépěj (Footprint) Rank

Motif

Fr.

ZI-MA

Rank

Motif

Fr.

ZI-MA

1

1-3

11

11.68

25

2-3-3

1

1.30

2

2

10

8.73

26

3-3-3

1

1.26

3

1-3-3

7

6.97

27

1-2-4

1

1.21

4

2-3

7

5.81

28

2-2-4

1

1.17

5

1-4

5

4.98

29

3-3-4

1

1.14

6

3

4

4.35

30

1-4-4

1

1.10

7

2-2-2

4

3.87

31

2-4-4

1

1.07

8

2-2

4

3.49

32

2-3-5

1

1.04

9

1-1-1-2-2-2-2

3

3.17

33

2-2-6

1

1.01

10

1-1-2-2-3

3

2.91

34

4-4

1

0.98

11

1-2-2-2-2

3

2.69

35

0-1

1

0.95

12

1-2-2

2

2.50

36

0-2-2

1

0.93

13

1-2

2

2.33

37

0-2-2-2-3

1

0.91

14

0-2

2

2.19

38

0-3

1

0.88

15

0-2-2-2-2

2

2.06

39

0-3-3-3

1

0.86

16

1-2-2-2-4

2

1.95

40

0-3-4

1

0.84

17

2-2-2-2

2

1.84

41

1-1-2-2-5

1

0.82

18

2-2-2-4

2

1.75

42

1-2-2-2

1

0.80

19

1

1

1.67

43

1-2-2-2-3-3-3

1

0.79

On Motifs and Verb Valency | 25

Rank

Motif

Fr.

ZI-MA

Rank

Motif

Fr.

ZI-MA

20

4

1

1.59

44

1-2-2-4

1

0.77

21

5

1

1.53

45

1-2-3-3

1

0.75

22

1-1-2

1

1.46

46

1-3-3-3-4

1

0.74

23

1-1-3

1

1.40

47

2-2-2-2-2-2

1

0.72

24

1-2-3

1

1.35

48

2-2-3-3-4

1

0.71

a = 0.9859, b = 1.9063, n = 48, DF = 36, X = 2.7839, P = 1.00 2

Tab.2: Rank-frequency distribution of the FSV motifs in the Hungarian translation of Orwel’s 1984 Rank

Motif

Fr.

ZI-MA

Rank

Motif

Fr.

ZI-MA

1

2

422

514.92

244

1-1-2-2-2-4

2

2.22

2

3

384

391.78

245

0-0-3-3

2

2.21

3

2-3

278

312.09

246

1-1-1-1-2-3

2

2.20

4

1-3

273

258.87

247

1-1-1-2-2-2

2

2.19

5

1-4

226

216.65

248

0-1-2-5

2

2.18

6

2-4

203

186.24

249

2-2-2-2-5

2

2.16

7

1-2

153

162.54

250

2-2-3-6

2

2.15

8

2-2

129

143.63

251

0-3-3-3-4

1

2.14

9

0-3

107

128.24

252

2-4-4-6

1

2.13

10

2-5

103

115.53

253

0-0-3-4

1

2.12

11

4

103

104.81

254

2-3-3-3-3-3-3-4

1

2.11

12

3-3

90

95.73

255

1-4-7

1

2.10

13

3-4

88

87.93

256

4-4-6

1

2.08

14

0-4

82

81.18

257

2-2-3-3-3-4

1

2.07

15

2-3-3

75

75.27

258

1-2-2-5-6

1

2.06

16

1-5

74

70.08

259

0-1-1-2-6

1

2.05

17

1-3-3

73

65.47

260

2-3-8

1

2.04

26 | Radek Čech – Veronika Vincze – Gabriel Altmann

Rank

Motif

Fr.

ZI-MA

Rank

Motif

Fr.

ZI-MA

18

1-2-3

71

61.37

261

4-5-6

1

2.03

19

2-2-3

69

57.70

262

0-1-2-3-3-4

1

2.02

20

1-2-4

68

54.39

263

3-4-7

1

2.01

21

2-2-4

66

51.39

264

0-3-5-9

1

2.00

22

1

64

48.67

265

3-3-3-3-4-4

1

1.99

23

0-2

59

46.20

266

2-2-3-3-3-3-3-3-

1

1.98

24

1-3-4

57

43.93

267

0-0-1-2-3

1

1.98

25

2-3-4

54

41.85

268

2-4-4-4-4

1

1.96

26

1-1-3

48

39.93

269

0-1-4-4-4

1

1.95

27

1-2-2

47

38.17

270

1-1-1-1-1-4

1

1.94

28

2-2-2

40

36.53

271

2-2-2-4-5

1

1.93

29

3-5

39

35.01

272

1-1-1-1-3-3-4

1

1.92

30

0-2-3

36

33.60

273

0-1-3-3-6

1

1.91

31

2-2-5

35

32.28

274

2-3-3-3-5-5

1

1.90

32

0-5

35

31.05

275

1-1-1-3-7

1

1.89

33

1-1-4

35

29.90

276

5-5

1

1.88

34

1-1-2

34

28.83

277

0-1-2-2-2-3-3-4-

1

1.88

3-3-4

4 35

1-6

33

27.81

278

1-1-2-3-4

1

1.87

36

0-3-3

31

26.86

279

0-3-4-4-4

1

1.86

37

2-2-2-3

30

25.97

280

1-3-3-3-5-5-6

1

1.85

38

2-4-4

28

25.12

281

2-2-2-2-4-4

1

1.84

39

2-2-3-3

28

24.32

282

1-4-6

1

1.83

40

2-3-5

27

23.56

283

0-0-2-4

1

1.82

41

1-2-2-3

26

22.85

284

1-2-7

1

1.81

42

1-4-4

25

22.17

285

0-1-3-3-3

1

1.81

On Motifs and Verb Valency | 27

Rank

Motif

Fr.

ZI-MA

Rank

Motif

Fr.

ZI-MA

43

0-1-3

25

21.52

286

3-3-4-4-4

1

1.80

44

1-2-5

25

20.91

287

0-3-3-6

1

1.79

45

4-4

24

20.33

288

0-4-6

1

1.78

46

3-3-4

23

19.77

289

2-3-4-5

1

1.77

47

1-1-5

23

19.24

290

0-2-2-3-3-4

1

1.76

48

1-2-3-3

23

18.74

291

1-3-6-6

1

1.76

49

0-2-4

22

18.25

292

2-2-2-2-2-3-4

1

1.75

50

2-2-2-4

22

17.79

293

1-1-2-5

1

1.74

51

1-2-2-4

21

17.35

294

2-2-2-3-3-5

1

1.73

52

2-6

21

16.93

295

2-2-2-3-3-4

1

1.72

53

3-3-3

20

16.52

296

0-1-2-4-6

1

1.72

54

1-3-3-3

19

16.13

297

1-2-2-2-2-3-4

1

1.71

55

0-3-4

19

15.76

298

1-2-2-2-2-2-3-3

1

1.70

56

2-3-3-3

18

15.40

299

0-0-2-3-6-6

1

1.69

57

0-2-2

18

15.10

300

1-2-5-5

1

1.69

58

1-3-3-4

17

14.72

301

4-4-4-6

1

1.68

59

2-2-3-4

17

14.41

302

1-3-5-5

1

1.67

60

1-1-3-3

15

14.10

303

1-1-3-3-3-5

1

1.66

61

3-4-4

15

13.80

304

2-3-3-3-5

1

1.66

62

1-1-2-3

15

13.52

305

1-1-2-2-6

1

1.65

63

0-3-3-3

15

13.24

306

0-2-3-3-3-3

1

1.64

64

1-1-2-2

15

12.97

307

1-1-1-2-4-5

1

1.63

65

2-2-2-2-3

15

12.72

308

1-1-3-3-3-3

1

1.63

66

1-1-2-4

15

12.47

309

2-2-2-2-3-3-3-5

1

1.62

67

2-3-3-4

14

12.23

310

1-1-3-3-3-4-4

1

1.61

68

1-4-5

13

12.00

311

2-3-3-3-3-4

1

1.61

28 | Radek Čech – Veronika Vincze – Gabriel Altmann

Rank

Motif

Fr.

ZI-MA

Rank

Motif

Fr.

ZI-MA

69

1-2-2-2

13

11.78

312

4-8

1

1.60

70

0-2-5

12

11.56

313

0-8

1

1.60

71

0-6

12

11.35

314

4-4-4

1

1.59

72

3-3-5

12

11.15

315

0-2-2-2-5

1

1.58

73

1-3-5

12

10.95

316

1-2-2-2-2-3-3-3

1

1.57

74

0-2-2-3

12

10.76

317

0-0-1-4

1

1.57

75

3-6

11

10.57

318

1-3-3-6

1

1.56

76

2-3-3-5

11

10.39

319

0-1-3-5

1

1.55

77

1-1-3-4

11

10.22

320

2-2-2-2-2-2-2

1

1.55

78

2-2-2-2

11

10.05

321

1-1-1-2-3-3-3

1

1.54

79

2-2-4-5

11

9.88

322

2-2-2-2-2-5

1

1.53

80

1-7

11

9.72

323

0-4-4-6

1

1.53

81

0-1-4

11

9.56

324

1-1-1-2-2-3-3

1

1.52

82

5

10

9.41

325

2-2-3-3-4-5

1

1.51

83

1-2-3-4

10

9.27

326

0-2-2-4

1

1.51

84

2-3-4-4

10

9.12

327

1-2-2-2-3-3-4-4

1

1.50

85

1-1-1-4

10

8.98

328

1-1-3-3-3-3-3

1

1.50

86

1-2-3-5

9

8.85

329

1-3-4-8

1

1.49

87

2-2-2-3-3

9

8.71

330

2-2-5-6

1

1.48

88

2-3-6

9

8.58

331

1-8

1

1.48

89

0-3-3-5

9

8.46

332

0-2-2-6

1

1.47

90

1-2-3-3-4

9

8.34

333

0-5-6

1

1.47

91

1-2-2-2-3

9

8.22

334

0-1-1-2-2-5

1

1.46

92

0-1-2-2

9

8.10

335

0-1-1-1-3

1

1.45

93

0-1-2

8

7.99

336

0-1-1-1-2

1

1.45

94

3-3-3-4

8

7.88

337

0-4-5-6

1

1.44

95

2-2-6

8

7.77

338

1-1-1-2-2-5

1

1.44

96

0-2-2-2

8

7.66

339

0-0-2-3-3-3-6

1

1.43

On Motifs and Verb Valency | 29

Rank

Motif

Fr.

ZI-MA

Rank

Motif

Fr.

ZI-MA

97

2-2-3-5

8

7.56

340

0-1-1-1-3-3

1

1.42

98

1-1

8

7.46

341

1-2-2-3-3-3-4

1

1.42

99

2-2-2-2-4

8

7.36

342

0-3-6

1

1.41

100

2-7

8

7.26

343

0-1-1-3-3-3-3-4

1

1.41

101

2-4-5

7

7.17

344

2-2-2-8

1

1.40

102

1-2-2-3-3

7

7.08

345

0-2-2-2-2-4-4-4

1

1.40

103

2-2-2-3-4

7

6.99

346

2-2-3-5-5

1

1.39

104

3-3-6

7

6.90

347

2-2-8

1

1.39

105

1-2-6

7

6.81

348

0-1-2-2-2-4

1

1.38

106

1-1-1-3

7

6.73

349

0-2-3-6

1

1.38

107

2-2-4-4

7

6.64

350

1-3-3-3-3-3

1

1.37

108

1-2-2-2-4

7

6.56

351

0-1-1-4-5

1

1.37

109

1-3-3-3-3

7

6.48

352

1-2-2-3-3-5

1

1.36

110

1-3-3-5

7

6.41

353

1-2-2-2-5-5

1

1.35

111

1-2-4-4

7

6.33

354

1-3-4-6-6

1

1.35

112

2-2-3-3-4

7

6.26

355

1-1-2-2-2-2-3-3-

1

1.34

3 113

3-4-5

6

6.18

356

1-2-2-2-2-2-2-4

1

1.34

114

2-3-3-3-4

6

6.11

357

2-3-3-4-4-5

1

1.33

115

0-4-4

6

6.04

358

2-2-2-2-3-4-4

1

1.33

116

0-3-5

6

5.97

359

0-1-2-4-5

1

1.32

117

2-4-6

6

5.91

360

1-1-2-2-2-2-2

1

1.32

118

1-3-4-4

6

5.84

361

1-1-1-1-2-4

1

1.31

119

1-3-6

6

5.78

362

2-2-3-3-3-4-4

1

1.31

120

0-2-6

6

5.71

363

0-2-4-8

1

1.30

121

2-2-3-3-3

6

5.65

364

1-2-2-3-3-4-4-4

1

1.30

122

2-3-3-3-3

6

5.59

365

0-1-1-2-5

1

1.30

123

1-2-2-2-2-3

6

5.53

366

0-0-6

1

1.29

124

1-2-2-5

6

5.47

367

0-1-1-3-4

1

1.29

125

4-5

5

5.41

368

1-2-2-3-3-3

1

1.28

126

3-3-3-3

5

5.36

369

0-0-3-3-3-3

1

1.28

30 | Radek Čech – Veronika Vincze – Gabriel Altmann

Rank

Motif

Fr.

ZI-MA

Rank

Motif

Fr.

ZI-MA

127

2-8

5

5.30

370

1-1-3-4-4

1

1.27

128

0-3-3-4

5

5.25

371

1-2-3-3-4-5

1

1.27

129

0-1-6

5

5.19

372

1-1-1-2-2

1

1.26

130

2-5-5

5

5.14

373

0-2-2-2-4-4

1

1.26

131

0-4-5

5

5.09

374

0-3-5-6

1

1.25

132

1-3-3-3-4

5

5.04

375

1-2-2-2-2-2-3-3-

1

1.25

3-3-3 133

1-1-2-2-4

5

4.99

376

2-2-2-3-6

1

1.24

134

1-5-5

5

4.94

377

2-2-2-3-4-4

1

1.24

135

1-2-2-3-4

5

4.89

378

1-2-2-2-3-4-4

1

1.24

136

1-1-2-2-3

5

4.84

379

2-2-2-2-6

1

1.23

137

2-2-2-2-2

5

4.80

380

1-2-2-2-4-5

1

1.23

138

1-1-1-1-3

5

4.75

381

0-1-1-1-3-5

1

1.22

139

1-2-3-3-3

5

4.70

382

0-0-3-5

1

1.22

140

3-3-4-4

5

4.66

383

1-1-3-3-4

1

1.21

141

1-1-1-2-3

5

4.62

384

1-1-1-4-4

1

1.21

142

2-3-3-4-4

5

4.57

385

1-2-2-2-2-2-2

1

1.21

143

2-2-7

4

4.53

386

1-1-1-1-2-2-2-2-

1

1.20

2-3 144

2-4-4-4

4

4.49

387

1-1-2-2-2-3-4

1

1.20

145

0-1-2-4

4

4.45

388

2-2-2-2-2-2-2-3

1

1.19

146

2-3-3-6

4

4.41

389

1-1-1-1-2-2-3

1

1.19

147

1-1-4-5

4

4.37

390

0-1-3-4-5

1

1.18

148

2-2-2-3-5

4

4.33

391

3-5-6

1

1.18

149

1-4-4-4

4

4.29

392

2-2-3-4-4-4

1

1.18

150

0-1-3-3

4

4.25

393

1-2-3-3-3-4

1

1.17

151

0-1-5

4

4.22

394

4-4-5

1

1.17

152

0-2-3-3

4

4.18

395

3-3-3-3-4-6

1

1.16

153

1-1-3-5

4

4.14

396

1-3-3-4-5

1

1.16

154

1-1-1-3-4

4

4.11

397

0-2-2-2-7

1

1.16

155

2-2-2-2-3-3

4

4.07

398

1-1-3-6

1

1.15

156

2-2-2-4-4

4

4.04

399

1-3-3-3-3-3-4

1

1.15

On Motifs and Verb Valency | 31

Rank

Motif

Fr.

ZI-MA

Rank

Motif

Fr.

ZI-MA

157

0-0-4

4

4.00

400

3-3-3-5-6-6

1

1.15

158

2-4-4-5

4

3.97

401

0-3-3-4-4

1

1.14

159

0-1-1-2

4

3.94

402

4-4-4-5

1

1.14

160

1-1-2-2-2

4

3.90

403

1-3-3-7

1

1.13

161

0-1

4

3.87

404

0-1-2-2-5

1

1.13

162

1-3-7

4

3.84

405

2-2-2-5-6

1

1.13

163

1-5-6

4

3.81

406

2-5-6

1

1.12

164

1-1-2-3-3

4

3.78

407

1-1-1-2-4-4

1

1.12

165

2-2-2-5

4

3.75

408

1-4-5-6

1

1.11

166

1-1-1-2

4

3.72

409

-3-10

1

1.11

167

0-7

3

3.69

410

1-3-4-6

1

1.11

168

3-3-3-3-3

3

3.66

411

1-5-5-6

1

1.10

169

1-1-4-4

3

3.63

412

0-2-3-4-4

1

1.10

170

1-1-1-2-5

3

3.60

413

1-1-3-3-4-5

1

1.10

171

1-2-2-4-4

3

3.57

414

2-2-2-3-3-6

1

1.09

172

1-1-6

3

3.55

415

1-2-3-4-4-4-4

1

1.09

173

0-1-2-3

3

3.52

416

2-4-4-4-5

1

1.09

174

0-0-3

3

3.49

417

3-3-3-4-4

1

1.08

175

1-1-1-1-3-3

3

3.46

418

1-3-3-5-5

1

1.08

176

0-1-2-2-4

3

3.44

419

3-4-6

1

1.08

177

2-2-2-2-2-2

3

3.41

420

1-2-2-3-6

1

1.07

178

3-7

3

3.39

421

1-2-4-4-4-4

1

1.07

179

2-2-2-3-3-3

3

3.36

422

1-1-2-3-3-4-4

1

1.07

180

1-1-2-2-2-2-

3

3.34

423

1-1-1-1

1

1.06

3 181

0-1-1-3

3

3.31

424

0-4-4-5

1

1.06

182

0-2-2-2-3

3

3.29

425

2-2-2-2-3-5

1

1.06

183

1-2-2-2-2-4

3

3.26

426

1-2-2-2-2-2-2-3

1

1.05

184

1-2-2-6

3

3.24

427

2-2-3-3-3-3

1

1.05

185

1-3-4-5

3

3.22

428

0-1-1-1-4

1

1.05

186

1-3-3-4-4

3

3.19

429

0-2-3-3-3-3-3-3-

1

1.04

3-3

32 | Radek Čech – Veronika Vincze – Gabriel Altmann

Rank

Motif

Fr.

ZI-MA

Rank

Motif

Fr.

ZI-MA

187

1-1-2-3-5

3

3.17

430

0-1-1-1-2-3

1

1.04

188

1-1-1-5

3

3.15

431

0-1-2-3-3-3

1

1.04

189

1-1-2-2-2-2

3

3.13

432

0-2-2-2-3-3

1

1.03

190

2-6-6

2

3.11

433

0-0-5

1

1.03

191

2-4-5-5

2

3.08

434

2-2-2-2-3-4

1

1.03

192

0-3-5-5

2

3.06

435

1-2-3-3-3-3-3

1

1.02

193

1-2-3-4-4

2

3.04

436

2-2-3-4-7

1

1.02

194

3-3-3-5

2

3.02

437

2-2-2-2-3-3-4

1

1.02

195

0-1-1-3-3

2

3.00

438

1-1-2-2-3-5

1

1.01

196

2-2-2-2-2-3

2

2.98

439

1-1-1-1-4

1

1.01

197

4-4-4-4

2

2.96

440

0-2-3-3-3-3-3-4

1

1.01

198

4-6

2

2.94

441

1-1-3-5-5

1

1.00

199

1-2-2-2-3-4

2

2.92

442

1-1-1-2-2-2-3

1

1.00

200

1-2-4-6

2

2.90

443

1-2-4-4-4

1

1.00

201

1-4-4-5

2

2.88

444

1-1-3-3-3

1

0.99

202

2-3-4-4-4

2

2.86

445

2-2-2-2-2-2-3

1

0.99

203

0-2-3-5

2

2.84

446

0-1-1-2-3

1

0.99

204

0-2-5-5

2

2.82

447

1-1-1-1-1-2

1

0.99

205

0-0-2

2

2.81

448

0-1-7

1

0.98

206

2-2-2-2-2-3-

2

2.79

449

0-1-1-4

1

0.98

3 207

2-3-4-6

2

2.77

450

2-2-2-6

1

0.98

208

2-2-4-6

2

2.75

451

0-1-2-3-3

1

0.97

209

0-1-1-5

2

2.73

452

0-1-3-6

1

0.97

210

1-1-1-1-2-2-2

2

2.72

453

1-1-1-2-2-3-3-3-

1

0.97

4 211

0-1-3-4

2

2.70

454

1-1-1-1-1-1-5

1

0.97

212

2-2-3-3-5

2

2.68

455

1-1-1-2-4

1

0.96

213

0-2-3-3-5

2

2.67

456

1-1-1-3-3-4

1

0.96

214

0-0-2-2-3

2

2.65

457

1-2-2-3-4-4

1

0.96

215

0-2-2-2-6

2

2.63

458

1-1-2-2-2-3

1

0-95

216

1-2-2-2-5

2

2.62

459

1-1-2-2-2-2-2-2-

1

0.95

On Motifs and Verb Valency | 33

Rank

Motif

Fr.

ZI-MA

Rank

Motif

Fr.

ZI-MA

2-2-2 217

3-4-4-5

2

2.60

460

0-2-2-2-3-3-4

1

0.95

218

0-2-3-4

2

2.58

461

0-1-2-2-2-2-3

1

0.95

219

1-2-3-3-5

2

2.57

462

2-2-2-7

1

0.94

220

0-2-2-3-4

2

2.55

463

2-2-2-2-3-3-3

1

0.94

221

1-2-2-2-2-3-

2

2.54

464

0-1-2-2-2

1

0.94

3 222

1-2-2-2-2

2

2.52

465

1-2-5-6

1

0.93

223

1-2-3-6

2

2.51

466

0-2-2-2-2

1

0.93

224

0-5-5

2

2.49

467

0-2-3-3-4

1

0.93

225

0-4-5-5

2

2.48

468

0-1-2-4-4-5

1

0.93

226

1-1-3-7

2

2.46

469

0-0-1-1-1-1

1

0.92

227

1-1-1-3-3

2

2.45

470

1-2-2-2-3-3-4

1

0.92

228

0-2-2-3-3

2

2.43

471

1-1-2-2-3-3

1

0.92

229

1-2-2-3-3-4

2

2.42

472

0-0-2-3

1

0.92

230

1-2-2-2-3-3

2

2.41

473

2-3-3-3-3-3-3

1

0.91

231

0-1-4-4

2

2.39

474

0-0-1-3-3

1

0.91

232

1-1-2-3-3-3

2

2.38

475

1-2-2-2-2-2

1

0.91

233

0-2-2-5

2

2.36

476

0-2-2-2-4

1

0.91

234

0-1-2-4-4

2

2.35

477

0-3-4-4

1

0.90

235

0-4-4-4

2

2.34

478

1-1-2-7

1

0.90

236

1-2-2-4-5

2

2.32

479

1-4-4-4-4

1

0.90

237

1-1-2-2-3-4

2

2.31

480

0-3-3-3-3

1

0.90

238

1-1-2-2-2-5

2

2.30

481

1-1-4-4-5

1

0.89

239

0-1-1-2-2

2

2.29

482

0-2-3-3-7

1

0.89

240

1-1-1

2

2.27

483

1-2-3-4-4-4-5

1

0.89

241

1-2-4-5

2

2.26

484

3-3-3-7

1

0.89

242

1-1-1-1-2

2

2.25

485

0-2-2-2-3-6

1

0.88

243

1-2-3-4-5

2

2.24

486

1-2-3-3-4-4

1

0.88

34 | Radek Čech – Veronika Vincze – Gabriel Altmann Tab. 3: The frequency spectrum of the FSV-motifs in the short story Šlépěj (Footprint) Frequency

No. of FVS-motifs

Theoretical (5)

1

30

29.90

2

7

7.87

3

3

3.41

4

3

1.84

5

1

1.13

7

2

0.53

10

1

0.24

11

1

0.19

a = 1.4071, R = 0.9917 2

Tab. 4: The frequency spectrum of the FSV-motifs in the Hungarian translation of Orwel’s 1984 Frequency No. of FVS-motifs Theoretical (5)

Frequency No. of FVS- motifs Theoretical (5)

1

237

235.33

34

1

0.17

2

60

67.26

35

3

0.16

3

23

30.61

36

1

0.15

4

24

17.15

39

1

0.12

5

18

10.83

40

1

0.12

6

12

7.39

47

1

0.08

7

12

5.34

48

1

0.08

8

8

4.01

54

1

0.06

9

7

3.12

57

1

0.05

10

4

2.48

59

1

0.05

11

7

2.02

64

1

0.04

12

5

1.67

66

1

0.04

13

2

1.40

68

1

0.04

14

1

1.19

69

1

0.03

15

7

1.03

71

1

0.03

17

2

0.78

73

1

0.03

18

2

0.69

74

1

0.03

19

2

0.61

75

1

0.03

20

1

0.54

82

1

0.02

21

2

0.49

88

1

0.02

On Motifs and Verb Valency | 35

Frequency No. of FVS-motifs Theoretical (5)

Frequency No. of FVS- motifs Theoretical (5)

22

2

0.44

90

1

0.02

23

3

0.40

103

2

0.01

24

1

0.36

107

1

0.01

25

3

0.33

129

1

0.01

26

1

0.30

153

1

0.01

27

1

0.28

203

1

0.003

28

2

0.26

226

1

0.002

30

1

0.22

273

1

0.002

31

1

0.21

278

1

0.001

33

1

0.18

384

1

0.0007

422

1

0.0006

a = 1.2657, c = 402.8904, R2 = 0.9922

Tab. 5: The frequency and average length of the FSV-motifs in the short story Šlépěj (Footprint) by K. Čapek Frequency

Average

Lorentzian

length 1

3.36

2.29

2

3.57

3.71

3

5.66

5.63

4

2

2.47

5

2

2.12

7

2.5

1.98

10

1

1.95

11

2

1.94

a = 1.9279, b = 6.2626, c = 2.6563, d = 0.4132, R2 = 0.82

36 | Radek Čech – Veronika Vincze – Gabriel Altmann Tab. 6: The frequency and average length of the FSV-motifs in the Hungarian translation of Orwel’s 1984 Fr

L

Power

Fr.

L

Power

Fr.

L

Power

1

5.43

5.70

22

3.50

3.11

64

1.00

2.52

2

4.70

4.97

23

3.33

3.08

66

3.00

2.51

3

4.61

4.59

24

2.00

3.06

68

3.00

2.49

4

4.00

4.34

25

3.00

3.03

69

3.00

2.49

5

4.06

4.16

26

4.00

3.01

71

3.00

2.47

6

3.92

4.01

27

3.00

2.99

73

3.00

2.46

7

4.12

3.89

28

3.50

2.97

74

2.00

2.45

8

3.38

3.79

30

4.00

2.93

75

3.00

2.45

9

4.29

3.71

31

3.00

2.91

82

2.00

2.40

10

3.25

3.63

33

2.00

2.87

88

2.00

2.37

11

3.29

3.56

34

3.00

2.86

90

2.00

2.36

12

3.00

3.50

35

2.67

2.84

103

1.50

2.30

13

3.50

3.45

36

3.00

2.82

107

2.00

2.28

14

4.00

3.40

39

2.00

2.78

129

2.00

2.20

15

4.00

3.35

40

3.00

2.77

153

2.00

2.13

17

4.00

3.27

47

3.00

2.68

203

2.00

2.01

18

3.50

3.24

48

3.00

2.67

226

2.00

1.97

19

3.50

3.20

54

3.00

2.61

273

2.00

1.90

20

3.00

3.17

57

3.00

2.58

278

2.00

1.89

21

3.00

3.14

59

2.00

2.56

384

1.00

1.78

422

1.00

1.74

a = 5.6962, b = -0.1957, R2 = 0.7082

Heng Chen, Junying Liang

Chinese Word Length Motif and Its Evolution Abstract: In this paper, we compare the word length motif in modern spoken Chinese and written Chinese, and attempt to make clear how it evolves in written Chinese. The synchronic studies show that the rank-frequency distributions can be fitted with power law function y=axb, and the length distributions can be fitted with the Hyper-Pascal function. As for the diachronic studies, evolutionary regularities are found. On the one hand, we found an increasing trend of parameter a and a decreasing trend of parameter b in the rank-frequency distribution function y=axb. On the other hand, although the motif length distributions of ancient Chinese cannot be fitted with the HyperPascal function, the entropy analyses show that there is a tendency that the distributions are more concentrated on some certain motif patterns. Keywords: word length; motif; evolution; spoken Chinese; written Chinese

1 Introduction Virtually all prior studies of word length are devoted to the problem of the frequency distribution of word length in texts, regardless of the syntagmatic dimension (Köhler, 2006). In recent years, word length sequences gradually come into vision, e.g., word length correlations (Kalimeri et al., 2012), word length repetitions (Altmann et al., 2009), word length entropies (Kalimeri et al, 2015; Papadimitriou, 2010; Ebeling & Pöschel, 1994), and the latest word length motifs (Köhler, 2006, 2015). Word length motif is a new syntagmatic approach to automatic text classification (Köhler & Naumann, 2008, 2010). Basically, word length sequences are inherited from word sequence. Word sequence (or word order) has been used as an indicator of language typologies (Jiang & Liu, 2015; Levinson et al., 2011; Liu, 2010). Köhler (2006) stated that word attributes may reflect the properties of its basic language units, i.e., words.

|| Heng Chen: Center for Linguistics & Applied Linguistics, Guangdong University of Foreign Studies, Guangzhou, China, [email protected] Junying Liang: Department of Linguistics, Zhejiang University, Hangzhou, China, [email protected]

38 | Heng Chen – Junying Liang Indeed, word length is one attribute that transmits a certain amount of word information. For example, Piantadosi et al. (2011) point out that word length is closely related to the information that words transmit; Garcia et al.(2012)state that those longer words are more likely to be used to express abstract things. These studies predicate that word length sequences may also predict the regularities in languages. Despite that motif is rather new in word length sequence studies, an increasing number of studies show that it is a very promising attempt to discover the syntagmatic relations of the word lengths in texts (Milička, 2015; Köhler, 2006, 2008, 2015). In this study, we intend to explore if word length motifs co-evolve with word length from a diachronic and dynamic point of view. Our hypothesis is that word length motifs co-evolve with word length (in written Chinese). In a recent study (Chen et al., 2015), we investigate how word length evolves based on the analysis of written texts from ancient Chinese within a time span of over 2000 years. In the present study, we will investigate the correlations between the evolution of word length and word length motifs. Besides, since motif frequency can be modelled by the rank-frequency distribution model y=axb, we will also investigate if it co-evolves with word length. The rest of this paper is organized as follows. Part 2 describes the materials and methods used in this study; Part 3 gives the results of the synchronic and diachronic investigations, as well as some discussions. Part 4 is a conclusion. To anticipate, this study may give us a much more in-depth understanding of word length motifs.

2 Materials and Methods This study includes both synchronic and diachronic investigations. On the one hand, in order to explore whether the word length motifs are sensitive to different language styles, we compare the rank-frequency distributions and length distributions of word length motifs between modern spoken and written Chinese. On the other hand, we investigate the evolution of word length motifs of written Chinese in the last 2000 years, with the purpose of finding out whether Chinese word length motifs co-evolve with word length.

Chinese Word Length Motif and Its Evolution | 39

2.1 Materials For the comparison between spoken and written Chinese, we built a dialogue text collection (spoken language) and a prose text collection (written language) with 20 texts respectively, and the number of words in each text ranges from 726 to 3792. The spoken language texts come from a TV talk show (three people) named Qiang Qiang San Ren Xing (Behind the Headlines) on Phoenix TV, from 2013.06 to 2013.09, 5 texts each month and 20 texts in total, in the form of daily conversation. This TV program mainly discusses the current social hot issues. The written language texts come from a well-known Chinese prose journal Selective Prose (http://swsk.qikan.com), from 2013.06 to 2013.09, 5 texts each month and 20 texts in total. Since there are no natural boundaries between words, word segmentation is needed before measuring word length. It is generally held that word segmentation involves the definition of the word, and it is a particularly difficult problem in Chinese. However it is not the issue we will discuss here, in the present investigation we segment words with a unified standard. First, we used the ICTCLAS (http://ictclas.nlpir.org/), one of the best Chinese word segmentation software, to segment words automatically. Then we checked and corrected the errors manually. Tab. 1 and 2 show the number of characters and words in each text. Tab.1: Number of characters and words in modern spoken Chinese texts Text Character tokens Word tokens Text Character tokens Word tokens S1

2168

1589

S11 5441

3792

S2

1561

1068

S12 5419

3783

S3

2520

1763

S13 5216

3592

S4

2245

1526

S14 5021

3444

S5

1373

941

S15 4959

3498

S6

1002

726

S16 5251

3609

S7

2287

1567

S17 5093

3571

S8

1306

883

S18 5127

3437

S9

2047

1445

S19 4848

3329

S10 1822

1278

S20 4668

3197

40 | Heng Chen – Junying Liang Tab.2: Number of characters and words in modern written Chinese texts Text

Character tokens

Word tokens

Text

Character tokens

Word tokens

W1

1920

1366

W11

1928

1368

W2

1309

952

W12

2655

1861

W3

2055

1490

W13

1423

948

W4

2394

1657

W14

2318

1779

W5

2014

1502

W15

1471

962

W6

1550

1119

W16

4128

2876

W7

1786

1269

W17

5143

3654

W8

1466

993

W18

5012

3512

W9

1830

1366

W19

4423

3057

W10

2693

1928

W20

4403

2953

As for the diachronic investigation, we use a collection of reliable written Chinese texts ranging from around 300 BC to 2100 AD, which is divided into 6 time periods: BC 3th–BC 2th, AD 4th–AD 5th, AD 12th–AD 13th, AD 16th–AD 17th, Pre-AD 20th, AD 21th. The original scale of the whole texts collection in each time period ranges from about ten thousand to two million characters. The details of the texts are shown in Tab. 21 in Appendix. To have a reliable measurement of word length, we select a sample of ten-thousand-character text from the text collections in each time period randomly. Moreover, to guarantee the impartiality of the results, the segmentation work was done by an expert in old Chinese.

2.2 Methods As defined by Köhler (2015), a length-motif is a continuous series of equal or increasing length values (e.g. of morphs, words or sentences). An example of Chinese word length motif segmentation is listed as follows. The Chinese sentence Chinese: 汉语 词长 动链 是 如何 演化 的? Pinyin: Hànyǔ Cícháng Dònɡliàn Shì Rúhé Yǎnhuà De? English (word-for-word translation): (Chinese) (word length) (motif) (is) (how) (evolve) (particle ‘de’)? English: How does Chinese word length motif evolve?

Chinese Word Length Motif and Its Evolution | 41

is according to the above-given definition, represented by a sequence of 3 word length motifs: (2-2-2) (1-2-2) (1). It should be noticed that the Chinese word length is measured in syllables in this study.

3 Results and Discussions In the following sections, the results and discussions of our synchronic and diachronic investigations are given respectively.

3.1 Word Length Motifs in Spoken and Written Chinese In this section we investigate both rank-frequency and length distributions of word length motifs in spoken and written Chinese.

3.1.1 The rank-frequency distributions of word length motifs It should be noticed that since the longest word in Chinese does not exceed 9, we did not put the “-” in word length motifs. The rank-frequency distributions of word length motifs in Text S1 are shown in Tab. 22 in Appendix. As we can see in Tab. 22, motifs “12” and “112” are the only two most frequent ones whose percentage exceeds 10%, and this is corroborated by the other 19 texts. The motifs data about all the 20 written texts are shown in Tab. 3.What is more, we found that “12”, “112”,”122” and “1112” are the most frequent motifs in the 20 spoken texts. Tab. 3: The Motifs data about the 20 spoken texts Text Word length motif types

Word length motif tokens

Text

Word length motif types

Word length motif tokens

S1

96

893

S11

79

834

S2

70

939

S12

91

787

S3

90

924

S13

61

418

S4

67

343

S14

57

406

S5

52

275

S15

47

242

S6

85

869

S16

41

170

42 | Heng Chen – Junying Liang

Text Word length motif types

Word length motif tokens

Text

Word length motif types

Word length motif tokens

S7

72

889

S17

51

407

S8

84

860

S18

54

219

S9

89

879

S19

67

346

S10 87

903

S20

54

293

Next we fit the power law function y=axb to the rank-frequency distribution data of word length motifs in the 20 spoken texts. Fig. 1 is the fitting result of Text S1.

Fig. 1: Fitting the power law function y=axb to the rank-frequency distribution data in Text S1

The goodness of fit R2 in Fig. 1 is 0.9859, which means that the relation can be captured by power law. The whole 20 fitting results in spoken texts can be seen in Tab. 4.

Chinese Word Length Motif and Its Evolution | 43

Tab. 4: The rank-frequency fitting results of word length motifs in 20 spoken texts Text

a

b

R2

S1

189.7287

-0.99279

0.98585

S2

221.63959

-0.99334

0.97604

S3

211.93943

-1.00448

0.97473

S4

66.3312

-0.91236

0.97498

S5

54.64877

-0.89137

0.96225

S6

185.60126

-0.97553

0.98363

S7

215.36072

-1.00972

0.97722

S8

181.78242

-0.96029

0.9713

S9

201.7334

-1.02898

0.98719

S10

207.03127

-1.01568

0.9882

S11

199.41935

-1.03769

0.98813

S12

174.24107

-1.01198

0.98529

S13

85.49793

-0.9069

0.9503

S14

84.08575

-0.91851

0.96472

S15

47.13713

-0.85896

0.93009

S16

40.99748

-0.99462

0.9887

S17

86.94427

-0.91944

0.95011

S18

48.2563

-0.99044

0.98131

S19

76.23932

-0.96282

0.95829

S20

59.18672

-0.9018

0.96449

It can be seen from Tab. 4 that all the fittings are successful. In the following, we turn our attention to the 20 written texts. The rankfrequency distribution data in Text W1 can be seen in Tab. 23 in Appendix. As we can see in Tab. 23, the most frequent motifs in Text W1 are “12”, “112”, “122”, “1112”, and so are the rest 19 written texts. The motifs data about the 20 written texts are shown in Tab. 5.

44 | Heng Chen – Junying Liang Tab. 5: The Motifs data about the 20 written texts Text

Motif types

Motif tokens

Text

Motif types

Motif tokens

W1

110

894

W11

86

854

W2

79

901

W12

59

299

W3

55

321

W13

71

458

W4

52

213

W14

94

746

W5

61

326

W15

60

340

W6

54

441

W16

67

442

W7

58

329

W17

48

256

W8

53

260

W18

55

308

W9

53

296

W19

82

762

W10

53

253

W20

48

255

Next we fit the power law function y=axb to the rank-frequency distribution data of word length motifs in the 20 written texts. Fig. 2 shows the fitting in Text W1.

Fig. 2: Fitting the power law function y=axb to the rank-frequency distribution data in Text W1

Chinese Word Length Motif and Its Evolution | 45

The fitting in Fig. 2 succeeded, and in the same way we fit the power law function to the rest 19 written texts. Tab. 6 shows the fitting results. Tab. 6: The rank-frequency fitting results of word length motifs in 20 written texts Text

a

b

R2

W1

188.49475

-0.99192

0.98043

W2

217.48211

-1.03134

0.98509

W3

67.88936

-0.92708

0.96132

W4

36.90815

-0.82067

0.90456

W5

58.89495

-0.8638

0.96935

W6

118.78263

-1.09064

0.99293

W7

69.4951

-0.91219

0.93201

W8

56.51712

-0.95244

0.98106

W9

59.48396

-0.89846

0.96267

W10

62.1218

-1.02801

0.96833

W11

186.93144

-0.98059

0.97655

W12

70.68574

-1.02447

0.99064

W13

122.53303

-1.11158

0.9902

W14

164.1028

-0.99796

0.98057

W15

85.436

-1.05757

0.9849

W16

88.3765

-0.91278

0.96074

W17

54.30844

-0.91184

0.95147

W18

63.85772

-0.91261

0.95573

W19

162.48936

-0.95915

0.97757

W20

61.96259

-1.00867

0.98549

As can be seen in Tab. 6, the goodness fit of all the 20 written texts are larger than 0.9, which means that all the fittings are successful. This indicates that the word length motifs are self-organizing systems in texts. To compare the rank-frequency distributions of word length motifs in both spoken and written Chinese texts, we take a student t-test. The student t-test results show that there is no significant difference in both the values of parameters a and b (with p=0.117 and p =0.799 respectively). Since there is no significant difference in the rank-frequency distributions for word length motifs of spoken and written texts, we continue examining the length distributions of word length motifs in both the two language styles.

46 | Heng Chen – Junying Liang 3.1.2 Length distributions of word length motifs Since the preceding section shows that word length motif is a self-regulating system, we hypothesize that the length distributions of word length motifs in spoken and written Chinese texts should also accord with laws. Tab. 22 in Appendix shows the word length motifs distribution data in Text S1. It should be noticed that the calculation of frequency of word length motifs in Tab. 7 is based on tokens. Tab. 7: Word length motif distribution data in Text S1 Length

Frequency

Length

Frequency

1

19

10

9

2

222

11

8

3

180

12

5

4

150

13

3

5

116

14

4

6

71

15

1

7

52

17

2

8

31

18

1

9

19

-

-

Köhler & Naumann’s (2008) study shows that Hyper-Pascal model is adequate to describe word length distributions. Through theoretical deduction, Köhler (2006) states that the Hyper-Pascal model can also be used to describe word length motif distributions. Here, we fit this model to Text S1, as well as all the other texts. First, we take a close look into the Hyper-Pascal function:

(1)

Chinese Word Length Motif and Its Evolution | 47

There are three parameters in the function, i.e., k, m, q. We used Altmann-Fitter to fit the function to all the texts. The fitting results of all the 20 spoken texts are shown in Tab. 8. Tab. 8: The results of fitting the Hyper-Pascal function to the length distributions of word length motifs in spoken texts Text

k

m

q

X2

P(x)2

df

c

R2

S1

0.2428

0.0137

0.6570

11.1712

0.5965

13

0.0125

0.9903

S2

0.7401

0.0206

0.5502

13.5090

0.2614

11

0.0144

0.9930

S3

0.6881

0.0489

0.5530

10.7625

0.4634

11

0.0116

0.9921

S4

0.3882

0.0474

0.6651

10.9825

0.5304

12

0.0320

0.9808

S5

0.9136

0.0754

0.5226

8.9775

0.3442

8

0.0326

0.9921

S6

0.7008

0.0625

0.5656

14.0225

0.2318

11

0.0161

0.9916

S7

0.7826

0.0309

0.5275

8.9850

0.5335

10

0.0101

0.9903

S8

0.8527

0.0556

0.5638

12.5464

0.3240

11

0.0146

0.9886

S9

1.5327

0.3824

0.9681

24.8329

0.0156

12

0.0283

0.9875

S10

0.7205

0.0495

0.5339

13.0836

0.2190

10

0.0145

0.9917

S11

0.4199

0.0197

0.6009

15.3403

0.1674

11

0.0184

0.9929

S12

0.2748

0.0188

0.6394

13.4042

0.4171

13

0.0170

0.9948

S14

1.9039

0.1674

0.4235

8.8882

0.3518

8

0.0219

0.9797

S15

2.6097

0.1486

0.3807

9.7396

0.2038

7

0.0402

0.9589

S16

0.5152

0.0493

0.6316

6.3306

0.7064

9

0.0372

0.9789

S17

0.6749

0.0254

0.5415

7.8998

0.4433

8

0.0194

0.9769

S18

0.2959

0.0222

0.6328

4.2475

0.8944

9

0.0194

0.9941

S19

0.9328

0.0533

0.5526

14.8051

0.0964

9

0.0428

0.9573

We can see from Tab. 8 that only the fitting of Text S13 and Text S20 failed. As the fittings can be affected by factors such as text sizes, authorship, writing time, etc., the failures are a normal phenomenon. Since we cannot expect to use one function to fit all the texts, exceptions are acceptable. Based on this, we conclude that the Hyper-Pascal function is also fit for the length distributions of word length motifs in spoken Chinese texts, which is consistent with Köhler’s (2006) hypothesis. Next we investigate the fittings in written Chinese texts. Tab. 9 shows the data of Text W1; the frequency is also based on tokens.

48 | Heng Chen – Junying Liang Tab. 9: Word length motif distribution data in Text W1 Length

Frequency

Length

Frequency

1

29

10

19

2

228

11

7

3

193

12

12

4

147

13

5

5

96

14

2

6

61

16

1

7

43

17

3

8

29

18

1

9

17

19

1

Again we fit the Hyper-Pascal function to the 20 written texts, and the results are shown in Tab. 10. Tab. 10: The results of fitting the Hyper-Pascal function to the length distributions of word length motifs in written texts Text

k

m

q

X2

P(x)2

df

c

R2

W1

0.2826

0.0241

0.6489

14.6117

0.3322

13

0.0163

0.9970

W2

0.3324

0.0194

0.6231

11.1503

0.5161

12

0.0124

0.9945

W3

1.1316

0.0370

0.5224

3.3577

0.9484

9

0.0105

0.9859

W5

0.4702

0.0360

0.6572

13.0606

0.3647

12

0.0401

0.9713

W6

0.6266

0.0281

0.5277

13.9460

0.0832

8

0.0316

0.9890

W8

0.2527

0.0182

0.6571

6.7868

0.8161

11

0.0262

0.9801

W9

0.7193

0.0261

0.5788

6.7639

0.6617

9

0.0229

0.9764

W10

0.2416

0.0128

0.6286

10.2833

0.3280

9

0.0406

0.9760

W11

0.6904

0.0345

0.5724

11.2706

0.4209

11

0.0132

0.9953

W12

0.0315

0.0014

0.7189

9.3846

0.6698

12

0.0315

0.9866

W13

0.2206

0.0045

0.6440

21.3565

0.0299

11

0.0466

0.9776

W14

0.4266

0.0277

0.6118

8.3785

0.7549

12

0.0112

0.9912

W15

0.7190

0.0426

0.5644

17.9030

0.0363

9

0.0527

0.9500

W16

0.8888

0.0928

0.5753

10.7904

0.3741

10

0.0244

0.9812

W17

0.8318

0.0570

0.5175

11.0858

0.1969

8

0.0433

0.9741

W18

0.4518

0.0251

0.6775

10.1155

0.6845

13

0.0328

0.9876

Chinese Word Length Motif and Its Evolution | 49

Text

k

m

q

X2

P(x)2

df

c

R2

W19

0.6406

0.0481

0.5616

10.9952

0.3579

10

0.0144

0.9935

W20

0.3463

0.0082

0.5796

5.6708

0.6841

8

0.0222

0.9871

As can be seen in Tab. 10, only 2 fittings failed, i.e., Text W4 and Text W7. Similarly, we conclude that the Hyper-Pascal function is also fit for written Chinese texts. Up to now, the length distributions of word length motifs in both spoken and written Chinese texts have been investigated. To compare if the parameters of Hyper-Pascal function are different in the two different language styles, we used statistical tests. Firstly, the KS tests show that both the data fit normal distributions. Then take three independent samples T tests to groups of values of parameters k, m, q. The T tests show that there is a significant difference in parameter (P=0.047), but not in parameters m (P=0.060) and q (P =0.534). Köhler(2006)analyzed the significance of the parameters in the HyperPascal function, and hypothesizes that they are correlated with long words, short words, as well as mean word length. Chen et al. (2015) found that the increase of word length is an essential regularity in Chinese word evolution, which cause a series of changes in the self-organizing lexical system, especially in word length distribution. Since word length motif is correlated with word length distributions, we are going to investigate how word length motif evolves in the following section.

50 | Heng Chen – Junying Liang

Fig. 3: Fitting the power law function y=axb to the rank-frequency distribution of word length motifs in time period 1

3.2 The Evolution of Chinese Word Length Motifs Since Chinese word length increases with time (Chen et al., 2015), we hypothesize that the word length motifs may also evolve with time.

3.2.1 Evolution of rank-frequency distributions Firstly, we obtained the rank-frequency distribution data in the six time periods. Then we fitted the power law function y=axb to all the data. Fig. 3 shows the fitting in time period 1. The fitting results in period 1 as well as in other time periods can be seen in Tab. 11.

Chinese Word Length Motif and Its Evolution | 51

Tab. 11: The results of fitting y=axb to rank-frequency distributions of word length motifs in 6 historical time periods Time Period

a

b

R2

1

203.7642

-0.8373

0.8661

2

253.9157

-0.9119

0.9610

3

278.6423

-0.8991

0.9084

4

298.9862

-0.9567

0.9792

5

377.545

-1.0059

0.9725

6

375.6220

-1.0035

0.9843

As can be seen in Tab. 9, the values of goodness of fit R2 are all larger than 0.8, which means that the fittings succeeded. As for the parameters, we can see that there is an increasing trend in parameter a, and a decreasing trend in parameter b. In order to see if this regulation is correlated with the case in word length, we also fit the power law function y=axb to the rank-frequency distributions of word length in 6 historical time periods, the data of which can be seen in Tab. 12. Tab. 12: The results of fitting y=axb to rank-frequency distributions of word length in 6 historical time periods Time Period

a

b

R2

1

0.7176

-2.6975

0.99186

2

0.5704

-1.9977

0.94791

3

0.5013

-1.7255

0.89313

4

0.4828

-1.6451

0.84742

5

0.5053

-1.7302

0.88671

6

0.4555

-1.5749

0.85701

As can be seen from Tab. 12, contrary to the case in Tab. 11, there is a decreasing trend in parameter a, and an increasing trend in parameter b, which corroborates that word length motifs and word length are highly interrelated (the Person Correlation Coefficient is -0.828, P =0.042). Besides, we also fit the power law function y=axb to the rank-frequency distributions of word frequencies in the 6 historical time periods, the results of which can be seen in Tab. 13.

52 | Heng Chen – Junying Liang Tab. 13: The results of fitting y=axb to rank-frequency distributions of word frequencies in 6 historical time periods Time Period

a

b

R2

1

544.9705

-0.7547

0.9597

2

342.2418

-0.7334

0.9819

3

250.6101

-0.6427

0.8867

4

245.7847

-0.6578

0.9189

5

434.8907

-0.7965

0.9911

6

403.5935

-0.7946

0.9904

In Tab. 13, we cannot see significant increasing or decreasing trends in the changes of parameters a and b, which means that there are no significant correlations between word length motifs and word frequencies in this sense.

3.2.2 Evolution of length distributions of word length motifs Word length motif distributions, i.e., the length distributions of word length motifs, show the frequencies of motifs (here based on tokens) in different length (ranging from the shortest to the longest motifs). Tab. 14 shows the data in period 1. Tab. 14: Length distributions of word length motifs in period 1 Length

Frequency

Length

Frequency

1

3

21

6

2

167

22

7

3

185

23

5

4

166

24

6

5

124

25

3

6

103

26

3

7

72

27

1

8

59

28

1

9

46

30

2

10

53

31

3

Chinese Word Length Motif and Its Evolution | 53

Length

Frequency

Length

Frequency

11

25

33

1

12

24

34

6

13

17

38

1

14

21

39

1

15

13

40

1

16

12

42

1

17

14

45

2

18

11

55

1

19

14

56

1

20

5

77

1

As can be seen in Tab. 14, the most frequent motif length is 3, and there are 40 different lengths of word length motifs. We fitted the Hyper-Pascal function to the data in Tab. 14, but failed. Then we used Altmann-Fitter and also failed to find an appropriate model. Tab. 15: Length distributions of word length motifs in period 2 Length Frequency

Length

Frequency

1

18

16

6

2

185

17

12

3

306

18

1

4

193

19

3

5

163

20

6

6

138

21

2

7

97

22

2

8

70

24

1

9

63

25

1

10

40

26

1

11

27

27

1

12

15

29

2

13

11

30

1

14

12

31

1

15

13

38

1

54 | Heng Chen – Junying Liang It can be seen from Tab. 15 that the most frequent length is 3, which is consistent with the case in period 1. However, there are less different lengths (there are 30 in period 2) motifs. Once again, we tried to find a model with the help of Altmann-Fitter but also failed. Tab. 16: Length distributions of word length motifs in period 3 Length

Frequency

Length

Frequency

1

25

11

24

2

253

12

17

3

323

13

20

4

235

14

8

5

208

15

6

6

142

16

5

7

107

19

1

8

65

20

1

9

41

21

3

10

33

-

-

In period 3, the most frequent motif length is also 3, which is the same as case in the former two periods. What is more, the types of different motif lengths are also decreasing (19 different lengths). There are also no fit models in this period. Let us see the case in period 4, the data of which can be seen in Tab. 17.

Tab. 17: Length distributions of word length motifs in period 4 Length

Frequency

Length

Frequency

1

27

11

21

2

192

12

13

3

381

13

19

4

245

14

7

5

221

15

2

6

149

16

3

7

88

17

5

Chinese Word Length Motif and Its Evolution | 55

Length

Frequency

Length

Frequency

8

64

20

1

9

52

21

1

10

19

-

-

In Tab. 17, we can see that length 3 has the most motifs tokens, which is the same as the case in the former 3 periods. Moreover, period 4 also sees less lengths (only 19). The fittings also failed in this period. The above four periods are all in ancient Chinese. In the following, we investigate the cases in modern Chinese. Tab. 18 shows the data in period 5, i.e., in the early 20th century. Tab. 18: Length distributions of word length motifs in period 5 Length

Frequency

Length

Frequency

1

20

12

13

2

381

13

13

3

367

14

4

4

292

15

8

5

172

16

2

6

129

17

1

7

99

19

1

8

62

20

2

9

35

21

1

10

33

24

1

11

17

28

1

As can be seen from Tab. 18, the most frequent motif length is 2, which is different from the former 4 time periods. As for the motif length types, although there is a slight increase, the types are also relatively few (only 22). Next we used Altmann-Fitter to fit the data and found that Hyper-Pascal function is the only fit model, with k=0.4558, m=0.0159,q=0.6422; x2=25.9730,p(x)2=0.0544,DF=16,C=0.0157,R2=0.9935. The fitting can be seen in Fig. 4.

56 | Heng Chen – Junying Liang

Fig. 4: Fitting the Hyper-Pascal function to the word length motif distribution data in time period 5

Finally it comes to time period 6, i.e., the 21th century. The word length motif distribution data are shown in Tab. 19. Tab. 19: Length distributions of word length motifs in period 6 Length Frequency

Length Frequency

1

42

10

21

2

441

11

14

3

397

12

6

4

288

13

6

5

216

14

7

6

135

15

3

7

72

16

1

8

51

19

1

9

32

It can be seen from Tab. 19 that the most frequent motif length is 2, which is the same as the case in time period 5 and different from ancient Chinese. In time period 6, there are only 17 different motif lengths.

Chinese Word Length Motif and Its Evolution | 57

Then we fit the Hyper-Pascal function to the data in Tab. 19, which can be seen in Fig. 5.

Fig. 5: Fitting the Hyper-Pascal function to the word length motif distribution data in time period 6

We found that Hyper-Pascal function is the only fit model, with k=0.4558, m=0.0159 , q = 0.6422 ; x2=25.9730 , p(x2)=0.0544 , DF = 16 , C = 0.0157 , R2=0.9935. Up to now, all the word length motif distribution data in the 6 historical time periods have been investigated. We find that the Hyper-Pascal model is only fit for modern Chinese, but not ancient Chinese. In view of this, we turn to some more general statistical methods to seek rules. Here we use entropy to describe the evolution of word length motif distributions. For entropy, generally speaking, large entropy means large information. As for entropy in language property distributions, Popescu et al. (2009) find that the changes of entropy (H) in language entities are closely related with entity types (V), which can be fitted by power law function H = aVb. Therefore, this means that larger entropies indicate richer lexicon. From a mathematical point of view, the values of entropies range from 0 to

loge V . When the entropy approaches 0, the distributions concentrate to some certain entity types, and the information content is low; when the entropy

loge V

approaches tent is high.

, the distributions are uniform, and the information con-

58 | Heng Chen – Junying Liang Based on the distribution data of word length motifs in the 6 historical time periods, we calculate the entropies, as can be seen in Tab. 20. Tab. 20: Entropies of word length motif distributions in 6 time periods Time period

Entropy

1

2.7073

2

2.4089

3

2.2449

4

2.1873

5

2.1197

6

2.0158

It can be seen from Tab. 20 that there is a decreasing trend in the values of entropies. Therefore, although the Hyper-Pascal model is not fit for ancient Chinese, the decrease of entropy values can be seen as a rule in word length motifs evolution. As stated above, when the entropy approaches 0 (i.e., when it decreases, just as in this case), the distributions concentrate to some certain entity types, and the information content is low. The decrease of entropy in Tab. 20 indicates that as time went on, Chinese word length series are more likely to be developed into certain patterns of word length motifs.

4 Conclusion In this paper, we investigate word length motifs in spoken and written Chinese, and try to make clear how it evolves in written Chinese. Firstly, we investigate the synchronic investigations of word length motifs in contemporary Chinese. On the one hand, the rank-frequency distributions of both spoken and written Chinese word length motifs can be modeled by the power law function y=axb, which indicates that Chinese word length motifs are also self-organizing systems. However, as for the fitting parameters, the student t-test results show that there is no significant difference in the rank-frequency distributions between spoken and written Chinese with respect to word length motifs. Secondly, both modern spoken and written Chinese motif length distributions can be modeled by the Hyper-Pascal function deduced by Köhler (2006).

Chinese Word Length Motif and Its Evolution | 59

Moreover, the t-test results show that there is a significant difference in parameter k, but not in parameters m and q. Thirdly, as for word length motif of written Chinese in the last 2000 years, the rank-frequency distributions data in all the 6 time periods can be fitted with the power law function y=axb. As for the parameters, we can see that there is an increasing trend of parameter a, and a decreasing trend in of parameter b. The results show that word length and word length motif truly co-evolve. Fourthly, different from the rank-frequency distribution, the evolution of motif length distributions is rather complicated since only modern Chinese (time periods 5 and 6) word length motif distributions can be fitted with the Hyper-Pascal function. A deeper entropy analysis shows that the decrease of entropy is a tendency in motif length distribution that they are more and more concentrated on some certain motif patterns. Last but not least, further explorations are still needed: the reason why length distributions of ancient Chinese word length motifs cannot be fitted with the Hyper-Pascal function; and whether this problem is correlated with the decrease of word length motif types?

Acknowledgement This work was supported by the National Social Science Foundation of China under Grant No. 11&ZD188.

References Altmann, E. G., Pierrehumbert, J. B., & Motter, A. E. (2009). Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words. PLoS One, 4(11), E7678. doi:10.1371/journal.pone.0007678 Chen, H., Liang, J., & Liu, H. (2015). How does word length evolve in written Chinese? PLoS One, 10(9), E0138567. doi:10.1371/journal.pone.0138567 Ebeling, W., & Pöschel, T. (1994). Entropy and long-range correlations in literary English. EPL (Europhysics Letters), 26(4), 241–246. Garcia, D., Garas, A., & Schweitzer, F. (2012). Positive words carry less information than negative words. Epj Data Science, 1(1), 1–12. Jiang, J. & Liu, H. 2015. The Effects of Sentence Length on Dependency Distance, Dependency Direction and the Implications - Based on a Parallel English-Chinese Dependency Treebank. Language Sciences, 50, 93–104.

60 | Heng Chen – Junying Liang Kalimeri, M., Constantoudis, V., Papadimitriou, C., Karamanos, K., Diakonos, F. K., & Papageorgiou, H. (2012). Entropy analysis of word-length series of natural language texts: Effects of text language and genre. International Journal of Bifurcation and Chaos, 22(9) doi:10.1142/S0218127412502239 Kalimeri, M., Constantoudis, V., Papadimitriou, C., Karamanos, K., Diakonos, F. K., & Papageorgiou, H. (2015). Word-length entropies and correlations of natural language written texts. Journal of Quantitative Linguistics, 22(2), 101–118. doi:10.1080/09296174.2014.1001636 Köhler, R. (2006) The frequency distribution of the lengths of length sequences. In J. Genzor & M. Bucková (Eds.), Favete linguis. Studies in honour of Viktor Krupa (pp. 145–152). Bratislava: Slovak Academic Press. Köhler, R. (2008) Word length in text. A study in the syntagmatic dimension. In S. Mislovičová (Ed.), Jazyk a jazykoveda v pohybe (pp. 416–421). Bratislava: Veda. Köhler, R. (2015) Linguistic Motifs. In G. K. Mikros & J. Mačutek (Eds.) Sequences in Language and Text (pp. 89-108). Berlin/Boston: De Gruyter Mouton. Köhler, R., & Naumann, S. (2008). Quantitative text analysis using L-, F- and T-segments. In: B. Preisach, & D. Schidt-Thieme (Eds.), Data Analysis, Machine Learning and Applications. Proceedings of the Jahrestagung der Deutschen Gesellschaft für Klassifikation 2007 in Freiburg (pp. 637-646). Berlin-Heidelberg: Springer. Köhler, R., & Naumann, S. (2010). A syntagmatic approach to automatic text classification. Statistical properties of F- and L-motifs as text characteristics. In: P. Grzybek, E. Kelih & J. Mačutek (Eds.). Text and Language. Structures – Functions – Interrelations – Quantitative Perspectives (pp. 81–89). Wien: Praesens. Levinson, S. C., Greenhill, S. J., Dunn, M., & Gray, R. D. (2011). Evolved structure of language shows lineage-specific trends in word-order universals. Nature, 473(7345), 79–82. doi:10.1038/nature09923 Liu, H. (2010). Dependency direction as a means of word-order typology: a method based on dependency treebanks. Lingua, 120(6): 1567-1578. Mačutek, J. & Mikros, G.K. (2015) Menzerath-Altmann Law for Word Length Motifs. In G. K. Mikros & J. Mačutek (Eds.) Sequences in Language and Text. Berlin/Boston: De Gruyter Mouton. Milička, J. (2015) Is the distribution of L-Motifs inherited from the word length distribution? In G. K. Mikros & J. Mačutek (Eds.) Sequences in Language and Text (pp. 133-146). Berlin/Boston: De Gruyter Mouton, 2015. Papadimitriou, C., Karamanos, K., Diakonos, F. K., Constantoudis, V., & Papageorgiou, H. (2010). Entropy analysis of natural language written texts. Physica A: Statistical Mechanics and its Applications, 389 (16), 3260-3266. doi:10.1016/j.physa.2010.03.038 Piantadosi, S., Tily, H., & Gibson E. (2011) Word lengths are optimized for efficient communication. PNAS, 108(9), 3526-3529. Popescu, I.-I., Altmann, G., Grzybek, Jayaram, B.D., Köhler, R., Krupa, V., P., Mačutek, J., Pustet, R., Uhlířová, L., & Vidya, M.N., (2009). Word frequency studies. Berlin/ New York: Mouton de Gruyter.

Chinese Word Length Motif and Its Evolution | 61

Appendix Tab. 21: Diachronic corpus details Time Period 1

2

3

4

5

6

Texts

Work

Work

Work

Work

Work

Work

MèngZǐ (Mencius)

Shìshuōxīn yǔ (A New Account of the Tales of the World)

Niǎn Shíèrlóu Yùguānyīn (Twelve (Grinding Jade Floors) Goddess of Mercy)

Nàhǎn (Yelling)

Xīndàofózhī (The Buddha Knows Your Mind)

Lǚshìchūnqī u (Mister Lv’s Spring and Autumn Annals)

Yánshì Jiāxùn Shū (Mister Yan’s Family Motto)

Cuòzhǎncuīníng (Wrongfully Accused of Ying Ning) Jiǎntiēhéshan g (A letter from a monk)

Wúshēngx ì (A Silence Play)

Pánghuán g (Hesitating)

Huíménlǐ (A Wedding President)

141864 Scale (characters)

94729

11220

233430

91705

12980

Time span

A.D. 4th – A.D. 5th

A.D. 12th– A.D. 13th

A.D. 16th– A.D. 17th

Pre-A.D. 20th

A.D. 21th

B.C. 3th– B.C. 2th

Tab. 22: The rank-frequency distributions of word length motifs in Text S1 Rank

Word length motif

Frequency

Rank

Word length motif

Frequency

1

12

188

49

11111111111122

2

2

112

86

50

111111111122

2

3

122

72

51

111114

2

4

1112

58

52

111222222

2

5

11112

49

53

1224

1

6

1222

37

54

12222222

1

7

1122

30

55

11111222

1

62 | Heng Chen – Junying Liang

Rank

Word length motif

Frequency

Rank

Word length motif

Frequency

8

11122

20

56

11111123

1

9

11222

20

57

111111111111122

1

10

111122

20

58

111111111111111112

1

11

13

20

59

11222222

1

12

111112

20

60

24

1

13

1111112

18

61

114

1

14

2

18

62

1111123

1

15

12222

14

63

1222223

1

16

22

12

64

1111113

1

17

11111122

11

65

133

1

18

11111112

11

66

111112223

1

19

1111122

11

67

11111111112222222

1

20

1113

10

68

11111111111111222

1

21

112222

7

69

1133

1

22

1111222

7

70

1111111111122

1

23

113

7

71

223

1

24

111222

7

72

111112222

1

25

123

6

73

1234

1

26

122222

6

74

111122222222

1

27

1123

5

75

11111111122222

1

28

11113

5

76

1111111112222

1

29

222

5

77

222222

1

30

11123

5

78

11112223

1

31

111111112

5

79

2222

1

32

111111122

4

80

124

1

33

1111111112

4

81

14

1

34

1223

4

82

112222222

1

35

111113

4

83

112223

1

36

111111222

4

84

1122223

1

37

1122222

4

85

11111114

1

38

1111223

3

86

1111111111112

1

39

11112222

3

87

11114

1

40

1112222

3

88

111111111112

1

41

11111111122

3

89

3

1

42

11111111112

3

90

111111124

1

Chinese Word Length Motif and Its Evolution | 63

Rank

Word length motif

Frequency

Rank

Word length motif

Frequency

43

1124

2

91

1111111122

1

44

1222222

2

92

22222222222

1

45

122223

2

93

111123

1

46

1111111113

2

94

11111111111112

1

47

1111112222

2

95

11111111222

1

48

11223

2

96

111111111222

1

Tab. 23: The rank-frequency distributions of word length motifs in Text W1 Rank

Word length motifs

Frequency

Rank

Word length motifs

Frequency

181

56

1111111113

2

1

12

2

112

99

57

1111111122

2

3

122

75

58

11111113

2

4

1112

70

59

1112222

2

5

11112

34

60

124

2

6

1122

30

61

1223

2

7

1222

27

62

11111111112

2

8

2

25

63

111111112222

2

9

111112

23

64

111111114

2

10

1111112

19

65

1133

2

11

13

18

66

111111122

2

12

14

18

67

11111111111112

1

13

11122

18

68

11112223

1

14

111122

14

69

1234

1

15

12222

12

70

12224

1

16

11222

11

71

11111111114

1

17

11111112

9

72

23

1

18

1111122

8

73

111112223

1

19

22

8

74

11111111111111222

1

20

111222

7

75

111122222222

1

21

112222

7

76

11124

1

22

1113

6

77

1222222222

1

23

113

5

78

1114

1

24

1111111112

5

79

11111123

1

64 | Heng Chen – Junying Liang

Rank

Word length motifs

Frequency

Rank

Word length motifs

Frequency

25

1111112222

5

80

1111113

1

26

11223

5

81

1111111111113

1

27

11111122

5

82

11111111222

1

28

11114

5

83

1222223

1

29

111111222

5

84

1124

1

30

11113

5

85

122223

1

31

11112222

4

86

11111223

1

32

3

4

87

111122222

1

33

114

4

88

1111111111222

1

34

111111111112

4

89

11111244

1

35

122222

4

90

112224

1

36

111111111122

3

91

111113

1

37

11111222

3

92

11122222

1

38

111111112

3

93

222

1

39

1111223

3

94

1222222

1

40

111112222

3

95

111111111222

1

41

1123

3

96

11111111111123

1

42

1111111222

3

97

1122222222

1

43

11111111122

3

98

11144

1

44

1111222

3

99

1111111111111122

1

45

133

2

100

111111122222

1

46

12223

2

101

111111111111122222

1

47

1122222

2

102

1111114

1

48

24

2

103

11222222

1

49

1111123

2

104

1134

1

50

2222

2

105

1111111112222

1

51

1111111111122

2

106

1224

1

52

223

2

107

1111111111111111112

1

53

111114

2

108

111123

1

54

123

2

109

11123

1

55

11111111112222222

2

110

134

1

Ruina Chen

Quantitative Text Classification Based on POS-motifs Abstract: In the present study, we testify that POS (Part of Speech)-sequence in the syntagmatic dimension can also be used to distinguish certain text types. By employing vocabulary-independent properties, POS-motifs, an unrepeated sequence of POS tags within the sentence boundaries, we tend to classify five different text types in Chinese and English, with text samples from the Lancaster Corpus of Mandarin Chinese (LCMC) and the Freiburg-Brown corpus of American English (Frown). Datasets are evaluated by six quantitative indices indicating variation or richness of POS sequences, namely, TTR, Hapax percentage, R1, Entropy, RR and Gini’s coefficient. The results of discriminant analysis, decision trees, and random forests support the conclusion that the richness of POSmotifs may function as an acceptable indicator in classifying some text types in both Chinese and English, especially for distinguishing texts into the narrative vs. expository dichotomy. Keywords: POS-motifs, text classification, decision tree, random forests

1 Introduction Most approaches to text classification are based on paradigmatic information, that is, they apply a “bag-of-words” model or “language in the mass” of text (Herdan 1966) to obtain text features for classification. In the field of corpuslinguistics, Biber (1988, 1995) and his follower-ups (Conrad and Biber 2001; Grieve et al 2010; Xiao 2009) use the Multi-Feature/Multi-Dimension (MF/MD) method to obtain a classification of texts in various registers; in the shared field of computational linguistics and natural language processing (Jurafsky and Martin, 2009; Manning and Schütze, 1999), documents are represented and then classified or clustered by term weight vectors based on word frequency information. Researchers from quantitative linguistics have tried to contribute to text

|| Ruina Chen: College of Foreign languages, Guizhou University, Guiyang, China, [email protected]

66 | Ruina Chen classification on the basis of, e.g. word length and sentence length distributions looking for statistical properties of these quantities that could be typical of text genres or text sorts in general (Köhler & Naumann 2008; Popescu et al. 2013). Similar investigations are also emerging in stylometrics (Hoover 2002, 2003a, 2003b, 2007) and, in recent years, in forensic linguistics (Finegan 2010; Kredens & Coulthard 2012), where, in the first place, word frequency distributions play a crucial role. POS can be used as a distinctive feature of texts, especially for determining the Chinese quantitative stylistic features, which mainly adopts a “bag-of-POS” method, like the study of Hou and Jiang (2014). In the current study, we adhere to the syntagmatic properties of texts for classification. A unit for sequential analyses, the motif, is introduced into linguistics. Motif is based on sequences of monotonously increasing values of selected linguistic properties, thus is in contrast to the commonly applied bag-of-word model. A POS-motif is defined as a POS tag segment which, beginning with the first POS of a word in the given text without extending the sentence boundaries, consists of POS sequences which are not the same as the one in the left neighbor. As soon as a POS tag is identical to the previous one, the end of the current POS-motif is reached. Thus, the POS tags fragment (1) will be segmented as shown by the POS-tags sequence (2): (1) “或许_d , 严同_nr 已_d 在_p 这_rzv 忙碌_a 中_f ,开始_v 算清_v 了_ule 为_p 那_rzv 九十万_m 斤_q 粮_n 票_n 他_rr 应_v 付出_v 的_ude1 巨大 _a 代价_n; ” (2) {d+nr},{d+p+rzv+a+f+v},{v+ule+p+rzv+m+q+n},{n+rr+v},{v+u+a+n}. (3) {II+JJ+NP1+NN},{NN},{NN+VH+VV+TO+VV+AT},{NN+IO+AT1},{NN},{N N},{NN},{NN+VV+NN+NP+AT+MD},{NN},{NN+IO+APPGE},{NN+RR+MC +NNT+II+AT},{NN}. The first POS-motif consists of two elements because in the following POS-motif there is a repetition of the first; the second one ends also where one of its elements occurs again. For the formation of POS-motif in English, lemmatization of word forms is conducted first. This is because we tend to make better comparisons in both

Quantitative Text Classification Based on POS-motifs | 67

Chinese and English the force of POS-motifs in classifying equivalent text types1. Thus, the POS-motif of an English segment (3) will be like that in (4): (4) “Despite_II intense_JJ White_NP1 House_NN1 lobbying_NN1, Congress_NN1 has_VHZ voted_VVN to_TO override_VVI the_AT veto_NN1 of_IO a_AT1 cable_NN1 television_NN1 regulation_NN1 bill_NN1, dealing_VVG President_NNB Bush_NP1 the_AT first_MD veto_NN1 defeat_NN1 of_IO his_APPGE presidency_NN1 just_RR four_MC weeks_NNT2 before_II the_AT election_NN1.” (5) {II+JJ+NP1+NN},{NN},{NN+VH+VV+TO+VV+AT},{NN+IO+AT1},{NN},{N N},{NN},{NN+VV+NN+NP+AT+MD},{NN},{NN+IO+APPGE},{NN+RR+MC +NNT+II+AT},{NN}. These motifs can be measured by the type(s) of POS tags embedded within, that is, POS-motif richness. Thus, a frequency distribution of x-POS-motifs in each sample text can be obtained, based on which quantitative indices can be derived for further analysis. The computations of motifs have been applied to various versions of linguistic entities, such as word lengths and word frequencies (Köhler 2008a, 2008b, Köhler & Naumann 2010, Sanada 2010), the lengths of the motifs per se (Köhler 2006), and the quantitative properties of motifs of RST (Rhetorical Structure Theory) relations (Beliankou et al. 2013). The goal of the current research is to examine: firstly, to what degree a classification of text types can be achieved on the basis of POS-motifs, or, to put it differently, to what degree POS-motifs may contribute to a classification of texts? Secondly, can POS-motifs contribute to distinguish text types in both Chinese and English? In Section 2, the language materials, the basic statistical data of POS-motifs in the research corpus, the instruments adopted to process the language materials and the six quantitative indices employed are elaborated; in Section 3, our empirical results are presented, concerning the classification results from dis-

|| 1 The typological differences between Chinese and English has made the comparison of derived POS-motifs in different text types in both languages not on the equal standard, that is, the existence of tense, aspect and the singular and plural form of nouns and verbs in Chinese, but not in English. Thus, lemmatization of English words and the abstraction of their POS as well, are deemed necessary. After that, all words and their POS are manually checked in order to ensure accuracy.

68 | Ruina Chen criminant analysis, decision trees, and random forests in LCMC and Frown. Finally, the results are summarized.

2 The Corpora, Methods and Quantitative Indices 2.1 The Research Corpora The Lancaster Corpus of Mandarin Chinese (further LCMC2) and the FreiburgBrown corpus of American English (further Frown) are purposely chosen to “conduct contrastive research between Chinese and English” (McEnery & Xiao 2004). Both are one-million-word balanced corpora, comparable in size, and represent the language of the early 1990s. Both contain 500 texts of around 2000 words each; they comprise 15 text categories, falling into 4 macro-domains: Press, Non-fiction, Academic and Fiction. The compositions of two corpora are similar except for two differences. One is in text types of “adventure fiction”, which is martial art fiction in LCMC (since Chinese has no western fiction) but western and adventure fiction in Frown; the other is that word segmentation and part-of-speech (POS) annotations are performed in LCMC, and only POS in Frown. Text types of both corpora are presented in Table 1. Tab. 1: Text types in LCMC and Frown Domain

Text Type

Samples

news

press reportage (A)

44

press editorial (B)

27

press reviews (C)

17

religious writing (D)

17

instructional writing (E)

38

popular lore (F)

44

biographies/essays (G)

77

reports/official documents (H)

30

non- fiction

|| 2 This corpus is available at http://www.lancaster.ac.uk/fass/projects/corpus/LCMC/

Quantitative Text Classification Based on POS-motifs | 69

Domain

Text Type

Samples

academic

academic prose (J)

80

fiction

general fiction (K)

29

mystery/detective fiction (L)

24

Science fiction (M)

6

Adventure fiction (N)

29

Romantic fiction (P)

29

Humor (R)

9

Total

500

Our pilot study has found that the classification for sub-domains of news and fiction with POS-motifs were quite unsatisfactory, thus we will not choose all 15 text types in both corpora for the present study. We finally pinned down to 5 text types for more depth comparison and discussion: press reportage (further news), biographies/essays (further essays), reports/official document (further official), academic prose (further academic), and general fiction (further fiction). Admittedly, they do not cover all text types in either Chinese or English, but they are representative and considered adequate for the purpose of demonstrating the current research methodology. In the current study, the formation of POS-motifs in all sample texts is done by a Python program; the rank-frequency information of POS-motifs, as well as the calculation of six indices is conducted with the software QUITA (Kubát, Matlach and Čech 2014). Our research concerning discriminant analysis, the decision tree and random forest are conducted by R programs. The statistical data of POS-motifs in five text types is presented in Table 2. Altogether there are 134,228 and 154,400 POS-motif tokens in Chinese and English respectively. On the average, a sentence contains 5.86 POS-motifs in Chinese while 5.67 POS-motifs in English. Though the types and tokens of POSmotifs in Chinese are less than those in English, the TTR of POS-motifs in Chinese is still a bit higher than those in English. POS-motifs per sentence in Chinese in different text types are higher than in English, except for the text type of essays.

70 | Ruina Chen Tab. 2: The basic statistical data of POS-motifs in the research corpus Types

Text Type

Tokens

TTR

POS-motifs Per Sentence

CN

EN

CN

EN

CN

EN

CN

EN

news(A)

14,343

16,279

21,891

27,027

0.66

0.60

5.50

5.30

essays(G)

25,622

27,443

35,833

42,509

0.72

0.65

4.83

6.08

official(H)

8,062

9,614

19,168

19,939

0.42

0.48

10.18

6.70

academic(J)

24,882

26,481

44,044

48,368

0.57

0.55

6.97

6.39

fiction(K)

9,637

11,148

13,292

16,557

0.73

0.67

4.05

3.58

Total

82,546

90,965

134,228

154,400 00 0.61

0.59

5.86

5.67

Average

2.2 Quantitative Indices Measuring POS-motifs Richness In quantitative linguistics, TTR (type-token ratio) (V/N), Hapax percentage, R1, Entropy, RR and Gini’s coefficient are all measures of lexical richness, and the first four indices have a positive correlation with lexical richness, while the latter two are negative. These indices are used in the current study for measuring richness of POS-motifs. As TTR and Hapax percentage are the most common statistics, we will elaborate on the latter four indices. R1 is a quite different approach to lexical richness which considers the hpoint. Words with ranks smaller than h are mostly auxiliaries and synsemantics which occur quite frequently but do not contribute to the richness. Richness is produced rather by autosemantics that occur more seldom. Popescu et al. (2009:29) take into account the fixed point h and consider all words whose frequency is smaller than h as contributors to richness. To obtain a comparable

indicator, we first define the cumulative probabilities up to h as F ([ h ]) , which is the sum of relative frequencies of words whose ranks are smaller or equal to h, illustrated as

F ([h])  F ( r  h ) 

1 N

[h]

f r 1

r

.

(1)

Quantitative Text Classification Based on POS-motifs | 71

Then a slight correction to F ([ h ]) is conducted as

 F ([h])  h

2

/ 2N

:

(2) 2

the subtraction of the quantity h / 2 N (the half of the square of the h-point) from F ([ h ]) . Based on these conditions, R1 is defined as

R1  1 

 F([h])  h / 2N  2

.

(3)

Higher R1 indicates greater POS-motif richness. Entropy in our analysis is adopted as that proposed by C. Shannon and applied in linguistics to show the diversity (uncertainty of the information) and the concentration of the distribution. It is defined as: V

H    pr log 2 pr ,

(4)

r 1

where

pr is the relative frequency of one word in a sample (that is, the propor-

tion of words frequency fr ), V is the total number of types of words. The entropy value varies in the interval H  < 0 ,  lo g 2 V >

;

(5)

if the entropy is zero, all frequencies are concentrated on one entity 1

H 

1log 1  0 2

r 1

,

(6)

and the predictability is quite simple; if entropy attains its maximum, then all entities have the same number of frequencies (for example 1/V,

72 | Ruina Chen

V

H 

1

 V log

1 2

r 1

  log 2 V ),

V

(7)

there is a perfect uniformity and nothing can be predicted. Thus, higher entropy of POS-motifs indicates more richness of POS sequence variation of a text and vice versa. RR is the repeat rate which is asymptotically the same as Entropy, but is interpreted in a reverse sense. RR is defined as: V

RR 

 r 1

p

2 r



1 N

V

2

f r 1

2 r

.

(8)

If all frequencies fall upon one POS-motif, then the text is maximally concentrated. If all POS-motifs have the same frequency, then the smallest concentration is given. Thus, higher value of RR indicates lower POS-motif richness and vise visa. Gini’s coefficient has been introduced by Popescu and Altmann (2006), here it is used as an indicator of measuring POS-motif richness. In quantitative linguistics, fortunately, it is not necessary to revert and cumulate the distribution and compute the sum of trapezoids to obtain the area above the Lorenz curve. Instead, one can directly compute

G

1 2 V 1  V N

V

 rf r 1

r

  

(9)

As Gini’s coefficient represents the area between the diagonal and the Lorenz curve, the greater the area, the smaller the POS-motif richness.

3 Results and Discussion 3.1 The Discriminant Analysis The discriminant analysis is a supervised classification method, which is used more commonly for dimensionality reduction before later classification. The

Quantitative Text Classification Based on POS-motifs | 73

result shows that in LCMC 93.83% of the between-group variance were on the first discriminant axis, which indicates that nearly all the differences between groups can be “explained” using the first discriminant, whereas a further 4.28% was explained on the second discriminant axis. In Frown, 75.13% of the between-group variance was on the first discriminant axis whereas a further 15.56% was explained on the second. The coefficients for each discriminant function are shown in Table 3 below. In LCMC, the most influential variables in the first discriminant are RR and Hapax Percentage, the second discriminant has added Gini, R1 and TTR; while in Frown, TTR and Gini are the most influential variables in the first discriminant, TTR and RR are among the second discriminant. Tab. 3: Coefficients of linear discriminants First discriminant

Second discriminant

CN

EN

CN

EN

TTR

-6.26

-71.08

44.70

361.67

Entropy

-0.64

8.00

4.56

-8.44

R1

5.43

-31.95

-25.98

-29.92

RR

23.04

39.98

5.32

-123.75

Gini

-6.77

-61.85

90.49

237.10

Hapax

17.19

18.14

35.58

-95.32

Plotting the scores of each observation on the first-second discriminant plane, we get the result in Fig. 1 (LCMC) and Fig. 2 (Frown) respectively. The graphs are very useful for identifying clusters.

74 | Ruina Chen

Fig. 1: Linear discrimination between different text types in LCMC “A” for news, “G” for essays, “H” for officials, “J” for academics, and “K” for fiction

In LCMC (Figure 1), we note that most scores are concentrated on the first discriminant plane. Recall that it was evident that 93.83% of the between-group variance was on the first discriminant axis. In addition, official documents (H) and academic proses (J) are closely clustered on the left-hand side of the plot, while essays (G), news (A) and fiction (K) are on the right-hand side. This seems to signal that further discrimination between the “expository vs. narrative” dichotomy in Chinese may be feasible. In Frown (Fig. 2) over the two discriminant planes, we can still find that expository (H and J) and narrative (A, G and K) texts are on opposite side of the first discriminant axis, just like those in LCMC. But there are more misclassifications in news (A) and fiction (K), which may be also manifested by the classification results listed in Tab. 3, compared with those in LCMC.

Quantitative Text Classification Based on POS-motifs | 75

Fig. 2: Linear discrimination between different text types in Frown “A” for news, “G” for essays, “H” for officials, “J” for academics, and “K” for fiction

The results of classification according to the discriminants are shown in Tab. 4. In order to reveal where our discrimination succeeds and where it fails, we then form a misclassification table, presented in Tab. 5. Combing the two tables, it can be seen that in LCMC, news (A) is the most poorly attributed text type, either classified as essays (G) or academics (J). Fiction (K) is also badly classified, mostly mixed with essays (G). While essays (G) are mostly correctly classified, followed by officials and academics. While in Frown, all text types are poorly attributed. Tab. 4: The result of classification according to discriminants Corpus

A

G

H

J

K

LCMC

6.82

90.91

80.00

75.00

34.48

Frown

34.09

70.67

53.33

67.50

51.72

76 | Ruina Chen Tab. 5: Linear discriminant misclassification table Frown

LCMC A

G

H

J

K

Total

A

G

H

J

K

Total

A

3

21

0

19

1

44

G

1

70

0

5

1

77

A

15

16

0

11

2

44

G

5

53

0

12

5

H

0

0

24

6

0

75

30

H

1

2

16

11

0

30

J

3

11

6

60

K

1

16

0

2

0

80

J

4

13

9

54

0

80

10

29

K

1

11

0

2

15

29

3.2 Decision Trees Decision trees (DTs) are an increasingly popular method used for classifying data. In a DT, each sub-region is represented by a node in the tree. The node can be either terminal or non-terminal. Non-terminal nodes are impure and can be split further using a series of tests based on the feature variables, a process called splitting. The split which maximizes the reduction in impurity is chosen, the data set split and the process repeated. Splitting continues until the terminal nodes are too small or too few to be split. To obtain a fully grown tree, this process is recursively applied to each non-terminal node until terminal nodes are reached. The terminal nodes correspond to homogeneous or near homogeneous sub-regions in the feature space. Each terminal node is assigned the class label that minimizes the misclassification cost at the node. In LCMC the tree has six end nodes and only Hapax Percentage and RR are employed for the classification. This has reinforced RR and Hapax Percentage as two important variables in the discriminant analysis. The first rule involves the Hapax Percentage, with texts exhibiting low Hapax percentage being official texts (H). The second type of texts classified is academics (J) with a Hapax percentage ranging above 0.362 but below 0.519. While those texts varying between the range above 0.519 but below 0.581 can either be news (A) or academics (J), which may explain the large number of news which were misclassified as academics (J) in LCMC (Tab. 5). On the other hand, if texts have a higher Hapax percentage (> 0.581) but a low RR (≤ 0.008) as well, they are classified as fiction (K). Those texts with higher values on Hapax Percentage (> 0.581) and RR (> 0.008) are mostly essays (G). In Frown, seven end nodes are derived, and three variables are employed for the classification --- R1, Hapax Percentage and Entropy. Text types with ex-

Quantitative Text Classification Based on POS-motifs | 77

pository attributes like official (H) and academic (J) are the first group to be distinguished from narratives texts of news (A), which is with R1≤0.693 and hapax percentage ≤0.431. This has indicated the lack of variation of POS sequence of expository texts. Text types with narrative attributes like news (A) are singled out at endnote 9 with entropy > 7.228 and R1≤ 0.747, essays (G) at endnote 12 with R1>0.693 and entropy > 7.312, and fiction (K) at endnote 13 with R1>0.693 and entropy > 7.726. Comparing the trees of Frown with LCMC, it can be seen that the rules for distinguishing expository texts are relatively simple and straightforward, plus, the classification results are much better than those of the narrative texts; while narrative ones generally require more rules and variables to distinguish, even this, the misclassification into other subtypes still exist. This is shared in both corpora, which also echoes the results in discriminant analysis.

1.0 0.8 0.6 0.4 0.2 0

Fig. 3: Classification tree for distinguishing five Types in LCMC

A G H J K

Node2(n=28)

1.0 0.8 0.6 0.4 0.2 0

≤0.362

p<0.001

1 Hapax

A G H J K

Node5(n=61)

≤0.519

≤0.581

1.0 0.8 0.6 0.4 0.2 0 A G H J K

Node6(n=45)

>0.519

4 Hapax p=0.003

>0.362

1.0 0.8 0.6 0.4 0.2 0

p<0.001

3 Hapax

A G H J K

7 RR p<0.001

1.0 0.8 0.6 0.4 0.2 0

≤0.008

Node8(n=10)

>0.581

A G H J K

Node10(n=71)

9 RR

1.0 0.8 0.6 0.4 0.2 0

A G H J K

Node11(n=45)

>0.014

p=0.003 ≤0.014

>0.008

78 | Ruina Chen

1.0 0.8 0.6 0.4 0.2 0

Fig. 4: Classification tree for distinguishing five Types in Frown

A G H J K

Node3(n=45)

≤0.431

1.0 0.8 0.6 0.4 0.2 0

A G H J K

Node4(n=10)

>0.431

≤0.693 2 Hapax Percentage p=0.018

1.0 0.8 0.6 0.4 0.2 0 A G H J K

6 Entropy

1.0 0.8 0.6 0.4 0.2 0

≤7.228

>7.228 8 R1 p=0.009

>0.693

A G H J K

1.0 0.8 0.6 0.4 0.2 0

≤0.747

Node9(n=26)

p<0.001

Node7(n=50)

1 R1 p<0.001

A G H J K

Node11(n=7) 1.0 0.8 0.6 0.4 0.2 0

A G H J K

Node12(n=83)

>7.312

10 Entropy p=0.011 ≤7.312

>0.747

≤7.726

1.0 0.8 0.6 0.4 0.2 0

A G H J K

Node13(n=37)

>7.726

5 Entropy p<0.001

Quantitative Text Classification Based on POS-motifs | 79

80 | Ruina Chen Misclassified observations for each text type are listed in Tab. 6. The overall error rate for LCMC and Frown is 35.38% (92/260 = 0.3538) and 44.96% (116/258=0.4496) respectively. It can be seen that in LCMC and Frown, text types are more likely to attribute to similar groups rather than otherwise, that is, narrative text types of news (A), essay (G) and fiction (K) are more possible to be misclassified into each other, the same as expository texts of official (H) and academic (J), but they are less likely to enter the opposite dichotomy. Tab. 6: Misclassification Table of decision trees

A G

LCMC H J

K

Total

A

G

K

Total

0

10

2

44

A 0 22 0

22 0

44

A

G 0 68 0

7

2

77

G 9

46

2

6

12

75

H 0 0

24

6

0

30

H 1

3

25

1

0

30

J 0 8

4

68 0

80

J

6

13

25

32

4

80

3

29

K

0

9

0

1

19

29

K 0 18 0

8

20 12

Frown H J

3.3 Random Forest Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them (Breiman 2001). Random forest can be used for classification. We select random forest to measure the contribution of each quantitative indicator to distinguish different text types. We adopt random forest to build classification models on training data (the ratio of the training set vs. the testing set is 7:3). The classification results of LCMC and Frown are shown in Tab. 7, it can be seen that the overall classification result in LCMC(OOB estimate of error rate 42.41%) is a bit better than that in Frown (OOB estimate of error rate 52.33%).

Quantitative Text Classification Based on POS-motifs | 81

Tab. 7: The confusion matrix by random forest in LCMC (left) & Frown (right) LCMC

A G H J K

Frown A 5 6 0 6 0

G 11 46 0 6 16

H 0 0 17 3 0

J 14 5 6 41 3

K 1 3 0 1 1

class.error 0.8387 0.2333 0.2609 0.2807 0.9500

A G H J K

A 7 5 0 3 0

G 8 36 3 9 9

H 0 0 6 9 1

J 11 7 11 36 2

K 1 6 0 0 11

class.error 0.7407 0.3333 0.7000 0.3684 0.5217

In LCMC the class error in essays (G), officials (H) and academics (J) is much lower than that in news and fiction, indicating that POS-motifs are acceptable in classifying these text types. In Frown, except for the text type of essays (G) and academics (J), the rest text types are all poorly classified.

Fig. 5: Importance of quantitative indicator according to random forest in LCMC(Left) and Frown (Right) (Expressed in MeanDecreaseGini)

In Figure 5, MeanDecreaseGini computes the impurity-level influence of variables to compare the importance of the six indices towards the contribution to

82 | Ruina Chen classification. A higher MeanDecreaseGini of an indicator indicates its greater importance. The importance of each quantitative indicator in LCMC and Frown is a bit different. We also adopt random forests for classify all text samples into only narrative (A, G and K) and expository (H and J) categories, the results are presented in Tab. 8. Tab. 8: The confusion matrix of narrative vs. expository distinction by random forest in LCMC (Left) & Frown (Frown) LCMC

Ex Na

Frown Ex 86 25

Na 24 125

class.error 0.2182 0.1667

Ex Na

Ex 62 14

Na 24 96

class.error 0.2791 0.1273

Note: “Na” stand for “Narrative”, while “Ex” for “Expository”. It can be seen that the overall classification results of dichotomous categories are much better than if attributing all texts to individual types, with the overall error rate of 18.85% in LCMC, and 23.64% in Frown. In addition, adopting POSmotifs for the classification of expository texts is less error-prone in Chinese (21.82%) than that in English (27.91%), while narrative texts achieve a bit higher accuracy in English (87.27%) than in Chinese (83.33%).

4 Conclusion In this study, we have employed vocabulary-independent properties, POSmotifs, an unrepeated sequence of POS tags within the sentence boundaries, to classify five different text types in Chinese and English. Datasets are characterized by six quantitative indices indicating variation or richness of POS-motifs, namely, TTR, Hapax percentage, R1, Entropy, RR and Gini’s coefficient. It is found that the richness of POS-motifs may function as an acceptable indicator in classifying some text types in both Chinese and English. The results from discriminant analysis indicate that the narrative vs. expository distinction exist in both LCMC and Frown. The results from decision trees manifest that though different variables are employed for splitting different types in LCMC and Frown, expository text types of official (H) and academic (J) are the earlier

Quantitative Text Classification Based on POS-motifs | 83

ones to be split via the major rule of hapax percentage of POS-motifs, compared with those of narrative text types which involve a bit complicated rules indicating their richness of POS-motifs. In addition, text types are more likely to be misclassified into homogenous groups rather than opposite ones. The results from random forests indicate that using POS-motifs richness as an indicator to classify texts is acceptable in certain types. In LCMC, narrative types of news (A) and fiction (K) are the poorly attributed ones, while expository ones like officials (H) and academics (J) are mostly correctly classified. However in Frown, most text types are poorly classified except for essays (G). The overall classification result for attributing into individual text types in both LCMC and Frown is not satisfactory, but acceptable if just distinguish into “narrative vs. expository” dichotomy. As POS has already attained a certain degree of abstraction for words, POS sequences combining syntagmatic relations can be regarded as, to some extent, crossing and straddling the boundaries of morphology and lexicology to reflect genre-specific syntactic peculiarities. But such syntactic regularities have been constrained a little bit by the indicator of POS-motifs in the current study, as repeated POS will be automatically broken to form a new motif. It is to be remarked that motifs are quite “legal” units of second order just as other well known units like sentence, clause, compound, etc. They represent vectors of “more primary” units like syllables, words, etc., but any units in language are conventional definitions and may be stated and segmented differently in various languages. Methodologically, they have the same status as other units or properties.

Acknowledgement This research is supported by the National Social Science Foundation of China funder Grant # 15BYY098.

84 | Ruina Chen

References Beliankou, A., Köhler, R., & Naumann, S. (2013). Quantitative properties of argumentation motifs. In: I. Obradović, E. Kelih, R. Köhler (Eds.). Methods and Applications of Quantitative Linguistics (pp. 35-43). Selected Papers of the 8th International Conference on Quantitative Linguistics (QUALICO). Belgrade: Academic Mind. Biber, D. (1988). Variation across Speech and Writing. Cambridge: Cambridge University Press. Biber, D. (1995). Dimensions of Register Variation: A Cross-linguistic Perspective. Cambridge: Cambridge University Press. Breiman, L. (2001). Random forests. Machine Learning, 45(1): 5–32. Conrad, S., & Biber, D. (Eds.). 2001. Variation in English: Multi-dimensional Studies. New York: Longman. Finegan E. (2010).Corpus linguistic approaches to “legal language”: adverbial expression of attitude and emphasis in Supreme Court opinions. In C. Malcolm, & A. Johnson (Eds.). The Rutledge Handbook of Forensic Linguistics (pp. 65-77). Abingdon: Rutledge. Grieve, J., Biber, D., Friginal, E., & Nekrasova, T. (2010). Variation among blogs: a multidimensional analysis. In A. Mehler, S. Sharoff, & M. Santini (Eds.) Genres on the Web: Corpus Studies and Computational Models (pp. 45–71). New York: Springer-Verlag. Herdan, G. (1966). The Advanced Theory of Language as Choice and Chance. Berlin: Springer. Hoover, L. D. (2002). Frequent word sequences and statistical stylistics. Literary and Linguistic Computing, 17(2):35–42. Hoover, L. D. (2003a). Frequent collocations and authorial style. Literary and Linguistic Computing, 18(3): 45–56. Hoover, L. D. (2003b). Multivariate analysis and the study of style variation. Literary and Linguistic Computing, 18(4):65–79. Hoover, L. D. (2007). Corpus stylistics, stylometry, and the styles of Henry James. Style, 41(2): 174-203 Kubát, M., Matlach, M., & Čech, R. (2014). Quantitative Index Text Analyser (QUITA). http://oltk.upol.cz/software. McEnery, T., & Xiao, R. (2004). The Lancaster Corpus of Mandarin Chinese: a corpus for monolingual and contrastive language study. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC) 2004 (pp. 1175-1178). Lisbon. Hou, R., & Jiang, M. (2014). Analysis on Chinese quantitative stylistic features based on text mining. Literary and Linguistic Computing: Digital Scholarship in the Humanities, (4):1–11. Jurafsky, D., & Martin, J.H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd ed. Upper Saddle River, NJ: Pearson Prentice Hall. Köhler, R. (2006). The frequency distribution of the lengths of length sequences. In: J. Genzor, & M. Bucková (Eds.), Favete linguis. Studies in honour of Viktor Krupa (pp.142–152). Bratislava: Academic Press. Köhler, R. (2008a). Word length in text. A study in the syntagmatic dimension. In: S. Mislovičová (Ed.), Jazyk a jazykoveda v pohybe (pp. 416–421). Bratislava: VEDA. Köhler, R. (2008b): Sequences of linguistic quantities. Report on a new unit of investigation. Glottotheory, 1(1):115–119. Köhler, R., & Naumann, S. (2008). Quantitative text analysis using L-, F- and T-segments. In: B. Preisach, & D. Schidt-Thieme (Eds.), Data Analysis, Machine Learning and Applications.

Quantitative Text Classification Based on POS-motifs | 85

Proceedings of the Jahrestagung der Deutschen Gesellschaft für Klassifikation 2007 in Freiburg (pp. 637-646). Berlin-Heidelberg: Springer. Köhler, R., & Naumann, S. (2010). A syntagmatic approach to automatic text classification. Statistical properties of F- and L-motifs as text characteristics. In: P. Grzybek, E. Kelih & J. Mačutek (Eds.). Text and Language. Structures – Functions – Interrelations – Quantitative Perspectives (pp. 81–89). Wien: Praesens. Kredens, K.,& Coulthard M. (2012) Corpus linguistics in authorship identification. In P.M. Tiersma, & L. M. Solan (Eds.).The Oxford Handbook of Language and Law (pp. 504516).Oxford: Oxford University Press. Manning, C.D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press. Popescu, I.I., & Altmann G. (2006). Some aspects of word frequencies. Glottometrics, 13, 23– 46. Popescu, I.I., Altmann, G., Grzybek, P., Jayaram, B. D., Köhler, R., Krupa, V., Mačutek,J., Pustet, R., Uhlířová, L., & Vidya, M. N. (2009). Word Frequency Studies. Berlin/New York: Mouton de Gruyter. Popescu, I.-I., Zörnig, P., Grzybek, P., Naumann, S., & Altmann, G. (2013). Some statistics for sequential text properties. Glottometrics, 26, 50–94. Sanada, H. (2010). Distribution of motifs in Japanese texts. In: P. Grzybek, E. Kelih, & J. Mačutek (Eds.). Text and Language. Structures – Functions – Interrelations – Quantitative Perspectives (pp. 183-193). Wien: Praesens. Xiao, R. (2009). Multidimensional analysis and the study of world Englishes, World Englishes, 28 (4): 421–50.

Yu Fang

L-motif TTR for Authorship Identification in Hongloumeng and Its Translation Abstract: Previous studies have found that different authors have different writing styles, which can be shown in vocabulary richness. However, little research has focused on whether such differences are shown in corresponding translations. In the present study, Hongloumeng and its two translations by Hawkes and Yang Xianyi are selected as the study object. L-motif TTR is used to reevaluate the authors of Hongloumeng through 15 selected chapters from the first 80 chapters and 15 selected chapters from the rest 40 chapters. Results show that significant differences exist in vocabulary richness between the two parts, suggesting that the two parts were written by two authors. Furthermore, we also evaluate the quality of different translations: (a) both translators choose nearly the same word to express the story; (b) Yang uses more varied words to avoid repetition, whereas Hawkes prefers more simplified words; (c) In translating culture-loaded words, Hawkes favors equal words in Western culture, while Yang translates those words literally. Keywords: L-motif TTR, authorship identification, Hongloumeng, translation

1 Introduction Authorship identification has been the focus of many researchers due to literary works of disputed or unknown authorship. The basic approach is to identify “a stylistic fingerprint characteristic of the author, and then determine whether this fingerprint is also present in a disputed work” (Juola and Baayen 2005: 59). Such approach is usually based on one presupposition, that is, different authors have different writing styles, which can be shown in their use of function words, word collocation, sentence structure, etc. In measuring an author’s writing style, vocabulary richness is one of the important indicators for authors clearly differ in the sizes and structures of their vocabularies. Studies in this field usually follow two lines — authorship attribution (Labbé 2007; Khmelev 2000) and authorship verification (Koppel and Schler 2004; Iqbal et al 2010).

|| Yu Fang: Department of Linguistics, Zhejiang University, Hangzhou, China, [email protected]

88 | Yu Fang The earliest study of quantitative authorship attribution concerning vocabulary richness can be traced back to Yule (1944) who proposed the index K. Since then, a number of indexes have been applied into research. CoyotlMorales et al. (2006), considering both stylistic and topic features of texts, used a set of word sequences that combine function words and content words to test 353 poems written by five modern writers. The result showed that such method was appropriate to capture all writers’ distinct features, so it can handle the attribution of short documents. Similarly, Argamon and Levitan (2005) selected twenty novels written by five authors, which turned out that using the most frequent function words in the corpus as features of stylistic text classification gives a good discrimination for both author and nationality attribution tasks. Though many studies on vocabulary richness have been carried out concerning the style of original authors, there has been little interest in studying the style of translators. One reason might be that the number of translated works with disputed or unknown authorship is not as many as that of original texts. Another major reason is that for a long time, many researchers have considered translation as a derivative rather than a creative activity, so translators should simply reproduce as closely as possible the style of the original and not have a style of his or her own. In recent years, however, more and more researchers are aware that translation is “a rewriting of an original text” (Lefevere 1992:1), so studies concerning translators’ style have emerged. Baker (2000) chose five texts translated by Peter Bush and three texts translated by Peter Clark for their stylistic analysis. Two indicators--type-token ratio and average sentence length--suggested that vocabulary richness in Peter Clark’s works was lower than that in Peter Bush’s work. She also investigated the frequency and patterning of SAY (the most frequent reporting verb in English) and found that Peter Clark tended to use SAY in his works while Peter Bush did not. Fang & Liu (2015a) used STTR and lambda to measure the vocabulary richness of two translated Hongloumeng and found that the vocabulary richness in the native speaker’s (Hawkes) version is no higher than that in the non-native speaker’s (Yang) version, because Yang liked to create some new words to express culture-loaded words. Those studies show the powerfulness of vocabulary richness in attributing authorship, both in original texts and in translations. However, there is little research dealing with a literary work with different authors. If the work is supposed to have more than one author, it can be divided into several parts according to its authors and each part can be regarded as one individual text, then will the differences of vocabulary richness exist among those parts? There is also few research dealing with stylistic differences of one translator in translating a text written by several authors. In other words, if the above-mentioned literary

L-motif TTR for Authorship Identification in Hongloumeng and Its Translation | 89

work is translated by one person, will the differences of vocabulary richness in the original text shown in translations of the translator? And if this work has different translated versions, will the result be the same? That is to say, if original texts have been proved to be written by different authors, will it reflect in their translations? Moreover, if one translation reflects such difference, while the other not, can we say the former one is a better version? In other words, can this result assist in evaluating the quality of different translations? Moreover, if we look deeper into the methodologies of previous studies, we can find that they either focus on some distinctive words like the most frequent verbs and function words or they only consider texts as a whole. In other words, one commonality was shared: they did not take the sequential organization of a text with respect to any linguistic unit into consideration. As a result, linguistic motifs, originally called segment or sequence (Köhler 2006), a new unit for sequential analyses, was introduced. Linguistic motif, the longest continuous sequence of equal or increasing values of linguist units, represents quantitative properties of linguistic units and thus is useful for the comparison of author and texts. Linguistic motif has four major types — L-motif, F-motif, P-motif and Tmotif, among which an L-motif refers to continuous series of equal or increasing length values of morphs, word and so on. Köhler and Naumann (2008) calculated L-motif TTR of 66 poems and prose texts and fitted the values to MenzerathAltmann Law, and all cases had excellent fits. In this paper, we will choose the original Hongloumeng and its two translations by David Hawkes and Yang Xianyi as the materials. L-motif TTR (Köhler and Naumann 2008) is used to measure their vocabulary richness. Considering the limitations and problems existing in previous research, we keep the following research questions in mind: Question 1. Was Hongloumeng, as many researchers suggest, written by two authors? In other words, will L-motif TTR of the first 80 chapters be significantly different from that of the last 40 chapters? Question 2. If Hongloumeng was proven to be written by two authors, will this difference be shown in the translation of one translator? More specifically, will the difference of vocabulary richness exist in the translator’s work? Question 3. According to the results of L-motif TTR we get from the first two questions, that is, from the vocabulary richness point, can we evaluate the quality of different translations?

90 | Yu Fang

2 Materials and Method Hongloumeng is one of the masterpieces of Chinese literature and one of the Four Great Chinese Classical Novels. However, its author has been widely discussed and no agreement has been reached. Usually, people accept the idea that it was written by two authors: Cao Xueqin wrote the first 80 chapters and Gao E the rest chapters. Recently, some researchers have applied quantitative method to this authorship identification. The statistical analysis of a literary text justifies the traditional methodology to works which for a long time may have received only impressionistic and subjective treatment. The most two representative studies are carried out by Chen Bingzao in the University of Wisconsin and Li Xianping in Fudan University. Calculating word-correlativity between the first 80 chapters and the rest 40 chapters, Chen (1980) concluded that the whole 120 chapters of Hongloumeng were written by the same author. While Li (1987) extracted 47 function words and calculated their frequencies respectively in each chapter. And then cluster analysis was used, which revealed that this book was written by more than two authors. Owing to the popularity of the original text, many translators have tried to introduce it to other countries. Until now, there are nine complete or selective English translations (Chen and Jiang 2003), and two of them are widely accepted: The Story of the Stone translated by a British sinologist David Hawkes and his son-in-law John Minford; A Dream of Red Mansions translated by a Chinese translator Yang Xianyi and his wife Gladys Yang. To reduce the workload while still reach the aim of the study, we randomly selected 15 chapters (Chapter 4, 12, 16, 24, 28, 36, 40, 44, 48, 52, 64, 68, 72, 76, 80) out of the first 80 chapters labeled as A and also 15 chapters (Chapter 82, 84, 88, 90, 92, 94, 96, 100, 104, 108, 110, 112, 114, 116, 118) out of the rest 40 chapters labeled as B. To measure the vocabulary richness of Hongloumeng and also its translations, type-token ratio (TTR) is selected from many available indexes. TTR, indicating the relationship “between the total number of running words in a corpus and the number of different words used” (Olohan 2004: 80), is an important indicator of vocabulary richness. It can reflect the writing style of an author for “the writer has available a certain stock of words, some of which he/she may favor more than others” (Holmes 1994: 91). Thus it turns out to be a quite powerful indicator for authorship identification. Analogous to L-motif of words or other linguistic units, L-motif TTR refers to continuous series of equal or increasing length values of different words. To get the values of L-motif TTR, we need to output the number of different words N times in an N-word text. For example, if we have a text as “Today is Sunday, and

L-motif TTR for Authorship Identification in Hongloumeng and Its Translation | 91

we don’t need to go to school on Sunday.” The text has 13 words, so the L-motif TTR is like: token type 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 9 11 10 12 11 13 11 Herdan (1964) observed a near linear relationship between the size of vocabulary V (types) and the total number of word tokens N in a text: V= αN

β

(1)

Altmann (1980) gave a theoretical derivation of model (1), which is “the most commonly used one in linguistics” (Köhler & Naumann, 2008: 642): y= x a

(2)

Just like model (1), x represents the number of tokens, which is N, and y represents the number of types, which is V. And a is an empirical parameter. Model (2) is expected to work with L-motif TTR (Köhler and Naumann 2008: 642). Before getting values of L-motif TTR, word segmentation needs to be done because Chinese, unlike English, has no space between each word. Segtag1, developed by Professor Shi Dongxiao in Xiamen University, is applied to the original text of Hongloumeng. After the automatic segmentation, we check and justify some results for accuracy. In both Chinese and English texts, punctuation is deleted before measuring L-motif TTR.

|| 1 It can be accessed from http://vdisk.w�i��.���/s/�������Vv���V.

92 | Yu Fang NLREG2, as a powerful statistical analysis program that can perform non regression analysis, is used in this study and it can determine the values of parameters in an equation whose form the user specifies. In this study, values of Lmotif TTR of each chapter will firstly be calculated by a self-built program, and we expect them to fit model (2). So NLREG will be applied to determine values of parameter a. After all the data are obtained, SPSS 20 is applied to carry out significance analysis: values of parameter a in A and values of parameter a in B are compared to discover whether Hongloumeng was written by different authors. Then in the same way, p-values of parameter a in the two translations are tested. After that, we will look at whether the results of translations are consistent with that of the original text.

3 Results and Discussion In this section, whether Hongloumeng has more than one author is tested first; and then if the original text were written by two authors, how it showed in the two translations; finally, we proceed to evaluate the quality of the two translations.

3.1 One or Two Authors in the Original Text Using NRLEG, we can fit the values of L-motif TTR into model (2) and get the values of parameter a. Due to the limitation of space, here we only take Chapter 44 as an example and Fig. 1 shows the excellent fit of this model to data from this chapter.

|| 2 It can be accessed from http://www.nlreg.com/.

L-motif TTR for Authorship Identification in Hongloumeng and Its Translation | 93

Fig. 1: L-motif TTR of Chapter 44 in the original text

Goodness-of-fit was determined using the determination coefficient R2, which was 0.9792 in this case. Besides this one, other chapters all yielded excellent fits. Tab. 1 shows the values of parameter a and R2 in all selected 30 chapters. As we can see, R2 in nearly all chapters are above 0.9 and some are even above 0.98.

Tab. 1: The values of parameter a and R2 in the original text* Chapter

Parameter a

R2

Chapter

Parameter a

R2

4

0.8846

0.9832

82

0.8481

0.9820

12

0.8771

0.9926

84

0.8470

0.9568

16

0.8807

0.9606

88

0.8455

0.9744

24

0.8464

0.9578

90

0.8475

0.9823

28

0.8438

0.9763

92

0.8464

0.9944

36

0.8535

0.9699

94

0.8446

0.9468

40

0.8424

0.9747

96

0.8504

0.9153

44

0.8459

0.9792

100

0.8584

0.9533

48

0.8579

0.9663

104

0.8497

0.9780

52

0.8518

0.9675

108

0.8464

0.8941

64

0.8643

0.9506

110

0.8515

0.9276

68

0.8577

0.9112

112

0.8455

0.9859

72

0.8499

0.9459

114

0.8595

0.9911

76

0.8661

0.9812

116

0.8567

0.9692

94 | Yu Fang

Chapter

Parameter a

R2

Chapter

Parameter a

R2

80

0.8684

0.9922

118

0.8522

0.9856

*The figures in the table are rounded to four decimal places According to Tab. 1, the curves of those values are produced and are shown in Fig. 2.

parameter a 0.89 0.88 0.87 0.86 0.85 0.84 0.83 0.82

1

2

3

4

5

6

7

8

9

parameter a in A

10 11 12 13 14 15 parameter a in B

Fig. 2: The values of parameter a in the original text

Fig. 2 shows that the dash line is smoother than the full line, which means values of parameter a in the 15 chapters selected from the rest 40 chapters do not differ from each other. This finding can also be verified by the exact data: values of parameter a in A fluctuate between 0.84 and 0.89, while values in B fluctuate between 0.84 and 0.86. We can also find that the full line is overall above the dash line, though the divergence is not so wide. Therefore, a significance test is needed for further discussion. On the basis of the results we have so far, we propose the following hypothesis: H0: The values of parameter a in A is not significantly different from that in B. H1: The values of parameter a in A is significantly different from that in B.

Since both A and B have only 15 values, normal distribution tests are needed. One sample Kolmogorov-Smirnov test shows values of parameter a in the two groups are normally distributed: pA=0.726>0.05, pB=0.904>0.05, which ensures

L-motif TTR for Authorship Identification in Hongloumeng and Its Translation | 95

the rationality of the independent sample t-test: t (28) = 2.509, p=0.0180.05, pB=0.652>0.05. Then the independent sample t-test is carried out: t (28) = -2.286, p=0.0300.05, pB=0.904>0.05, the independent sample t-test is carried out: t (28) = 1.038, p=0.308>0.05. Different from Hawkes’ translation, the values of parameter a in the two parts of Yang’s translation have no significant difference, which is also different from the result in the original text. Getting L-motif TTR of all selected 30 chapters, we fit its values into model (2) and carry out significance tests of parameter a. The test shows that the first

80 chapters and the rest 40 chapters were written by different authors, which conforms to the result in most previous studies. The two translations behave differently: the two parts of Hawkes’ translation show a significant difference in vocabulary richness, which does not show in Yang’s translation. When different translation versions of the same text are provided, readers usually tend to choose a better one to read, thus an evaluation of the quality of translations is needed.

100 | Yu Fang

3.3 Evaluating the Quality of Translations From the above sections, we can know that fitting values of L-motif TTR into model (2) yields the parameter a, which has a direct link with vocabulary richness: the larger the parameter a, the richer the vocabulary richness is. Using parameter a to measure vocabulary richness, can we offer a suggestion for evaluating the quality of the two translated Hongloumeng? Since the vocabulary richness of two parts in Hawkes’ translation has significant difference, we will also compare the two translations in two parts. According to the data we get above, the curves of parameter a in the two parts are shown in Fig. 7 and 8 respectively.

parameter a 0.88 0.86 0.84 0.82 0.8

1

2

3

4

5

6

7

8

9

10

11

12

parameter a in A of Hawkes' translation parameter a in A of Yang's translation Fig. 7: Comparison of the values of parameter a in A

13

14

15

L-motif TTR for Authorship Identification in Hongloumeng and Its Translation | 101

parameter a 0.865 0.86 0.855 0.85 0.845 0.84 0.835 0.83 0.825

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

parameter a in B of Hawkes' translation parameter a in B of Yang's translation Fig. 8: Comparison of the values of parameter a in B

In Fig. 7, we can see that except parameter a in Chapter 1, the values of Hawkes version are lower than that of Yang’s version in other chapters. However, the difference seems not so wide in B: from Fig. 8, we cannot judge which version’s parameter a is larger. Till now, another significance test is needed. H0: The vocabulary richness of the two translations is the same. H1: The vocabulary richness of the two translations is significant different.

The independent sample t-test is conducted and a significant difference between Hawkes’ translation and Yang’s translation in the first selected 15 chapters is found: t (28) = -3.067, p=0.0050.95) in the three distributions and the determination coefficient (R2) was getting better with the growth of grades. The individual fitting data of the three sub-treebanks is presented in Tab. 3. Tab. 3: Fitting the data by Zip-Mandelbrot distribution for the POS R-motif from the three subtreebanks Stages

P(X2)

DF

R2

A

b

n

Primary

1

265

0.95

0.99

0.59

341

Junior

1

299

0.96

1.00

0.78

394

188 | Jingqi Yan

Stages

P(X2)

DF

R2

A

b

n

Senior

1

321

0.97

1.01

0.77

426

3.1.2 Fitting of the R-motif on the basis of dependency relations We then testified the data of R-motif based on dependency relations to the ZipfMadelbrot distribution. The 32 types of dependency relations annotated here followed Liu’s categorization (2007). Tab. 4 shows the distribution results. Concerning the statistical result, the Zipf-Mandelbrot distribution in the three sub-treebanks yielded not very good fitting (R2