Scalability Issues in Authorship Attribution [1 ed.] 9789054878230

146 46 2MB

English Pages 196 Year 2010

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Scalability Issues in Authorship Attribution [1 ed.]
 9789054878230

Citation preview

Faculteit Letteren en Wijsbegeerte Departement Taalkunde

Scalability Issues in Authorship Attribution

Schaalbaarheid bij Auteursherkenning

Proefschrift voorgelegd tot het behalen van de graad van doctor in de Taalkunde aan de Universiteit Antwerpen te verdedigen door Kim LUYCKX

Promotor: Prof. dr. Walter Daelemans

Antwerpen, 2010

Cover design: Tom De Smedt Print: Silhouet, Maldegem c 2010 Kim Luyckx

c 2010 Uitgeverij UPA University Press Antwerp

UPA is an imprint of ASP nv (Academic and Scientific Publishers nv) Ravensteingalerij 28 B-1000 Brussels Tel. + 32 (0)2 289 26 50 Fax + 32 (0)2 289 26 59 E-mail: [email protected] www.upa-editions.be ISBN 978 90 5487 823 0 NUR 616 / 984 Legal deposit D/2010/11.161/146 All rights reserved. No parts of this book may be reproduced or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the author.

Abstract

This dissertation is about authorship attribution, the task that aims to identify the author of a text, given a model of authorial style based on texts of known authorship. In computational authorship attribution, we do not rely on in-depth reading, but rather automate the process. We take a text categorization approach that combines computational analysis of writing style using Natural Language Processing with a Machine Learning algorithm to build a model of authorial style and attribute authorship to a previously unseen text. In traditional applications of authorship attribution – for instance the investigation of disputed authorship or the analysis of literary style – we often find large sets of textual data of the same genre and small sets of candidate authors. Most approaches are able to reliably attribute authorship in cases like these. However, the types of data that we find online, require an approach that is able to deal with large sets of candidate authors, a large variety of topics, and often very short texts. Even though the last decades of research have brought substantial innovation, most studies only scratch the surface of the task because they are limited to small and strictly controlled problem sets. As a result, it is uncertain how any of the proposed approaches will perform on a large scale. In addition, the often vague descriptions of experimental design and the underuse of objective evaluation criteria and of benchmark data sets, cause problems for the replicability and evaluation of some studies. Since most studies focus on quantitative evaluation of results but refrain from going into detail about the features of text used to attribute authorship, it is difficult to assess the quality of their approach. In this dissertation, we investigate whether a commonly applied text categorization approach is viable for application on a large scale – for instance in the detection of fraud or in social media analysis. In this context, scalability refers to the ability of a system to achieve coniii

sistent performance under various uncontrolled settings. We stress-test our approach by confronting it with various scalability issues and study its behavior in detail. By combining performance analysis with an in-depth analysis of features, we aim at increased insight into the strengths and weaknesses of the approach. The first scalability issue we discuss, is the importance of experimental design when dealing with multi-topic data. Topic is one of the most important factors interfering with authorship, a characteristic making it hard to ‘separate’ from authorship. Including topic markers in the model of authorship undermines the scalability of the approach because it leads to overfitting. We implement various types of feature selection methods and frequency thresholds and find that the application of a topic frequency threshold allows us to restrict the feature set to the most efficient and scalable features. Another way to deal with multi-topic data is the application of variations of the standard cross-validation scheme. Whereas some schemes have the disadvantage of requiring topic information – normal in controlled, experimental settings but a luxury in real-life applications – they also allow for insight into the model and the challenge of multi-topic authorship attribution. Although our three data sets only contain a limited number of topics, the amount of inter-topic and intra-topic variation is substantial. The effect of author set size – i.e. the number of candidate authors – is the second scalability issue we address. We see a significant decrease in performance with increasing author set size. When we build a model from a small set of candidate authors and use it to attribute authorship in a larger set of candidate authors, a dramatic drop in performance occurs. While the approach is very sensitive to variations in author set size and data size, and to author or topic imbalances, it can be relied on for cases with limited sets of candidate authors. The third scalability issue is the data size, interpreted here as the amount of textual data used for training. The results show a significant decrease in accuracy as data size is reduced, implying that the approach is not able to reliably perform authorship attribution on limited data. The limited data used in our study are a challenge and require a reliable and robust representation of those texts as well as a discriminative approach that is able to deal with sparse data. Our results confirm the observation that Support Vector Machines are good at dealing with sparse data. When combined with lexical or syntactic features, Memory-Based Learning shows some degree of robustness to sparse data, but not to the extent that we can claim superiority. From this dissertation, we can conclude that our text categorization approach, although commonly applied in the field, is not reliable to apply on a large scale. This dissertation showcases the complexity of the authorship attribution task given a data set. In both aspects, the field has only seen the tip of the iceberg and would benefit from more transparency in terms of experimental design, performance, and features.

iv

Samenvatting

Dit proefschrift gaat over auteursherkenning, een taak waarbij het doel is de auteur van een tekst te identificeren. Toekenning van auteurschap gebeurt op basis van een model van de schrijfstijl van de auteur dat gebaseerd is op teksten waarvoor auteursinformatie beschikbaar is. Computationele auteursherkenning automatiseert dit proces, terwijl traditionele benaderingen afgaan op een diepgaande analyse van de tekst door experts (cf. ‘in-depth reading’). Dit proefschrift past automatische tekstcategorisatie toe op de taak, een benadering die computationele analyse van schrijfstijl (met behulp van natuurlijke taalverwerking) combineert met een ‘Machine Learning’-algoritme dat een model van schrijfstijl samenstelt en vervolgens auteurschap toekent aan een ongeziene tekst. In traditionele toepassingen van auteursherkenning – bijvoorbeeld het onderzoeken van betwist auteurschap of de analyse van literaire stijl – vinden we vaak grote hoeveelheden teksten in hetzelfde genre en kleine groepen van kandidaat-auteurs. De meeste benaderingen zijn in staat om in dergelijke gevallen auteurschap toe te kennen aan een tekst. De aard van de teksten waar we online mee geconfronteerd worden, vereist echter een aanpak die betrouwbaar kan omspringen met grote groepen van auteurs, een grote verscheidenheid aan onderwerpen en vaak zeer korte teksten. Hoewel de voorbije decennia tot belangrijke innovaties hebben geleid in auteursherkenning, bieden de meeste studies slechts een oppervlakkige visie op de taak omdat ze zich beperken tot kleine en streng gecontroleerde problemen. Het is bijgevolg onduidelijk hoe de voorgestelde benaderingen zullen presteren op grote schaal. Bovendien staan de repliceerbaarheid en toepasbaarheid van een aantal studies ter discussie omwille van een te vage beschrijving van de experimentele opzet en het beperkte gebruik van objectieve evaluatiecriteria en van vrij beschikbare ‘benchmark’ corpora. Het is moeilijk om de kwaliteit van een

v

benadering te evalueren omdat de nadruk meestal ligt op kwantitatieve evaluatie van de resultaten in plaats van op een gedetailleerde analyse van de taalkundige kenmerken die auteurschap bepalen. In dit proefschrift onderzoeken we of een algemeen toegepaste aanpak – de automatische tekstcategorisatie – toepasbaar is op grote schaal, bijvoorbeeld voor het detecteren van fraude of voor de analyse van sociale netwerksites. In deze context verwijst schaalbaarheid naar het vermogen van een systeem om consequent te blijven in verschillende ongecontroleerde contexten. We zetten onze benadering onder druk door in detail te analyseren hoe de verschillende schaalbaarheidsproblemen haar gedrag be¨ınvloeden. Door analyse van accuraatheid te combineren met diepgaande analyse van lingu¨ıstische kenmerken willen we meer inzicht krijgen in de sterkte en zwakte punten van de benadering. Het eerste thema dat we bespreken, is het belang van de experimentele opzet bij het gebruiken van teksten in verschillende onderwerpen (‘multi-topic data’). Het onderwerp van de tekst interfereert in hoge mate met kenmerken van auteurschap, waardoor het moeilijk is het ‘onderscheid’ te maken tussen kenmerken die naar het onderwerp dan wel naar de auteur verwijzen. Het opnemen van zogenaamde ‘topic markers’ in het model van auteursstijl ondermijnt de schaalbaarheid van het model omdat het leidt tot ‘overfitting’ (d.i. de situatie waarbij een model in dergelijke mate is gebaseerd op bepaalde teksten dat het niet kan generaliseren naar andere teksten). Onze implementatie van verschillende methodes om de meest relevante lingu¨ıstische kenmerken te weerhouden leidt tot de conclusie dat de toepassing van een ‘topic frequency threshold’ (d.i. enkel kenmerken die voorkomen in ver¨ schillende onderwerpen, worden weerhouden) de meest efficiente en schaalbare kenmerken in het model houdt. Een andere manier om met teksten in verschillende onderwerpen om te gaan, is het toepassen van verschillende procedures voor crossvalidatie. Hoewel voor sommige procedures informatie over het onderwerp noodzakelijk is – een normale vereiste in een experimentele context, maar een luxe in toepassingen op grote schaal – geven ze ons uniek inzicht in de uitdagingen van ‘multi-topic’ auteursherkenning. Ondanks het beperkte aantal onderwerpen in onze drie corpora, zien we grote variatie tussen de verschillende onderwerpen (‘inter-topic variation’) en binnen de verschillende onderwerpen (‘intra-topic variation’). De invloed van de ‘author set size’ – d.i. het aantal kandidaat-auteurs – is het tweede thema in dit proefschrift. We zien een opvallende daling in accuraatheid wanneer het aantal auteurs toeneemt. Als we een model bouwen op basis van een kleine groep auteurs en dit toepassen op een probleem met een grote groep kandidaat-auteurs, blijkt dat de accuraatheid van het systeem aanzienlijk daalt. Hoewel onze benadering zeer gevoelig is voor variaties in ‘author set size’ en ‘data size’ en voor de interne verdeling van het aantal onderwerpen en auteurs, stellen we vast dat ze betrouwbaar is voor casussen met een beperkt aantal auteurs.

vi

Het derde thema is de ‘data size’, hier ge¨ınterpreteerd als de hoeveelheid teksten waarop een model gebaseerd is. De resultaten tonen een significante daling in accuraatheid aan wanneer we de ‘data size’ beperken. De beperkte hoeveelheid teksten die we gebruiken, betekent een echte uitdaging en vereisen een betrouwbare, robuuste representatie van deze teksten en een algoritme dat met dit type data kan omgaan. Onze resultaten bevestigen dat Support Vector Machines aan deze vereisten voldoen. In combinatie met lexicale of syntactische kenmerken toont Memory-Based Learning een zekere mate van robuustheid, maar niet voldoende om van een duidelijke overmacht te spreken. Uit dit proefschrift kunnen we besluiten dat onze benadering, de automatische tekstcategorisatie, niet betrouwbaar genoeg is om ze op grote schaal toe te passen voor auteursherkenning, hoewel ze in het vakgebied gemeengoed is. Dit proefschrift demonstreert de complexiteit van auteursherkenning gegeven een casus en de bijhorende teksten. Voor wat betreft deze twee aspecten heeft het vakgebied enkel het tipje van de ijsberg gezien. Meer transparantie op vlak van experimentele opzet, accuraatheid en de lingu¨ıstische kenmerken in het model zijn essentieel voor de evaluatie van een benadering voor auteursherkenning.

vii

Acknowledgements

Not everything that can be counted counts, and not everything that counts can be counted. (Cameron, 1963, p. 13) As much as the above quote holds for the characteristics of a text that make style (the topic of this dissertation), it is true for the numerous people who have supported me throughout the entire process of my PhD. I would like to express my gratitude to the people who have become invaluable for me. This dissertation would be unthinkable without the commitment of Walter Daelemans, my supervisor. He pushed me to a higher level of research, by emphasizing the importance of methodology and innovation, but also by having confidence in me. By making clear when something is not good enough and being enthusiastic when it is, he helped me become a better researcher, and I cannot thank him enough for that. My gratitude also goes to ´ Veronique Hoste and Steven Gillis for agreeing to be in my thesis committee and for their remarks. I would like to thank Guy De Pauw, who commented on the entire manuscript and was always ready to give advice. I am also grateful to Harald Baayen and Efstathios Stamatatos for their willingness to be in my jury and for their valuable suggestions. Being a researcher has shaped me to a large extent. In fact, I practically grew up in the presence of my colleagues at CLiPS (formerly CNTS). The last six years at the University of Antwerp have been a real joy, thanks to the inspiring people who are my colleagues: Agnita, Anja, Anne, Annemie, Bart, Bram, Claudia, Dominiek, Elena, Emmanuel, Eric, Erik, Evie, Frederik, Gert, Guy, Hanne, Helena, Inge, Iris, Jo, Karen, Kathy, Kris, Lien, Lieve, Marie-Laure, Martine, Mike, Naomi, Øydis, Reinhild, Renate, Roser, Sarah, Steven, ´ Tom, Veronique, Vincent, and Walter. I really appreciate their open-mindedness and the ix

no-nonsense atmosphere in our research group. Special thanks to Agnita and Vincent, my former and current office roommates, for their patience and advice. Lieve, my train buddy, made me reconsider train delays as an opportunity instead of as a problem. My gratitude also goes to Roser, Guy, Mike, and Thomas who agreed to proodread this dissertation. I enjoy the lunches with Mike and Thomas and have really grown attached to them. I hope we will continue to have fun together for years to come. Kathleen and Tine, I am really greatful for the support you have given me. It feels good to have two good friends that I can rely on for serious conversations, shopping trips, and even clothing advice. My family is essential to my well-being. I want to thank my parents for having confidence in me and for stimulating me to develop my potential. I am very much the product of my father’s love for language and my mother’s practical skills. They have provided me with a loving home that I can always depend on for serious conversations and support, but also for taking care of Ole after school. A warm thanks goes to my parents-in-law, who have welcomed me into their family and who are always prepared to pitch in and take care of Ole. Without my family’s support, I could not have finished this dissertation. Every evening I go home, knowing that I can fall back on Sven, my husband, who has proven to be a great help in setting worries aside because of his no-nonsense attitude. I feel lucky having him around me. He is a wonderful father for our son and a joy to observe when cooking in the kitchen. I treasure Ole for being the energetic boy that he is. Watching him grow up and develop his own character is a sensation every day. Knowing that I can count on these people means the world to me, so this dissertation is as much their accomplishment as it is mine.

x

Table of Contents

1 Introduction 1.1 Context and Motivation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2.1 Experimental Design in Multi-Topic Data . . . . . . . . . . . . . . . . .

3

1.2.2 Author set size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.2.3 Data size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.3 Our Perspective on Authorship Attribution . . . . . . . . . . . . . . . . . . . .

5

1.3.1 Short Text Authorship Attribution . . . . . . . . . . . . . . . . . . . . .

6

1.3.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.3.3 Qualitative Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . .

7

1.4 Contributions and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.5 Chapter Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.2 Research Objectives

I

1

Text Categorization Approach to Authorship Attribution

2 State of the Art in Authorship Attribution xi

11 13

Table of Contents 2.1 Introducing the Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.2 Perspectives on Authorship Attribution . . . . . . . . . . . . . . . . . . . . . .

14

2.3 Methods and Techniques for Authorship Attribution . . . . . . . . . . . . . . .

15

2.3.1 Discriminative Methods . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.3.2 Feature Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.3.3 Feature Selection

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3 Methodology and Data Sets

23

3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

3.1.1 Text Categorization Approach . . . . . . . . . . . . . . . . . . . . . . .

23

3.1.2 Pre-Processing and Linguistic Analysis . . . . . . . . . . . . . . . . .

25

3.1.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

3.1.4 Machine Learning for Text Categorization . . . . . . . . . . . . . . . .

28

3.1.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

3.2 Evaluation Data Sets

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2.1 Requirements and Motivation

II

30

. . . . . . . . . . . . . . . . . . . . . .

30

3.2.2 Ad-Hoc Authorship Attribution Competition – Problem Set A ( AAAC A )

31

3.2.3 Dutch Authorship Benchmark corpus ( ABC NL 1) . . . . . . . . . . . .

32

3.2.4 Personae corpus ( PERSONAE ) . . . . . . . . . . . . . . . . . . . . . .

32

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

Scalability in Authorship Attribution

35

4 The Effect of Experimental Design in Multi-Topic Data 4.1 Introduction and Research Questions . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Working with Multi-Topic Data xii

. . . . . . . . . . . . . . . . . . . . . .

37 38 39

Table of Contents 4.1.2 Experimental Design

. . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

4.2 Experimental Matrix and Baseline . . . . . . . . . . . . . . . . . . . . . . . .

43

4.3 Feature Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

4.3.1 Introducing the Methods

. . . . . . . . . . . . . . . . . . . . . . . . .

4.3.2 Scalability towards Unseen Texts

44

. . . . . . . . . . . . . . . . . . . .

46

4.3.3 Increasing Scalability towards Other Topics . . . . . . . . . . . . . . .

49

4.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

4.4 Cross-Validation Schemes for Multi-Topic Data . . . . . . . . . . . . . . . . .

54

4.4.1 Performance as an Indicator of Scalability and Variation . . . . . . . .

55

4.4.2 Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

4.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

4.5 Conclusions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 The Effect of Author Set Size

60 63

5.1 Introduction and Research Questions . . . . . . . . . . . . . . . . . . . . . .

64

5.2 Experimental Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

5.3 The Effect of Author Set Size in the Original Data Sets ( EXP 1)

. . . . . . . .

69

5.4 The Effect of Author Set Size in Data Size and Topic Balanced Data ( EXP 2) .

73

5.5 The Effect of Exclusive Testing on Small Author Set Sizes . . . . . . . . . . .

77

5.5.1 Performance Decay with Increasing Author Set Size . . . . . . . . . .

77

5.5.2 Reliability and Scalability of Features and Feature Types

. . . . . . .

81

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

6 The Effect of Data Size

91

6.1 Introduction and Research Questions . . . . . . . . . . . . . . . . . . . . . .

92

6.1.1 Data Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92

xiii

Table of Contents 6.1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

6.2 Experimental Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

6.2.1 Data Size as the Number of Variable-Length Training Samples . . . .

98

6.2.2 Data Size as the Number of Fixed-Length Training Samples . . . . . .

98

6.2.3 All Data Performance . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

6.3 Data Size as the Number of Variable-Length Samples ( EXP 1)

. . . . . . . .

100

6.4 Data Size as the Number of Fixed-Length Samples ( EXP 2) . . . . . . . . . .

106

6.5 Robustness to Limited Data

. . . . . . . . . . . . . . . . . . . . . . . . . . .

111

6.5.1 The Limited Data Challenge . . . . . . . . . . . . . . . . . . . . . . . .

112

6.5.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .

114

6.6 Conclusions

III

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusions

121

7 Conclusions and Further Research 7.1 Conclusions

123

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

124

7.1.1 Experimental Design in Multi-Topic Data . . . . . . . . . . . . . . . . .

124

7.1.2 Author Set Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

126

7.1.3 Data Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

127

7.1.4 Scalability of a Text Categorization Approach to Authorship Attribution

128

7.2 Further Research

IV

118

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Appendices

129

131

A Features Below the Topic Frequency Threshold

133

B Performance with Topic Frequency Threshold

135

xiv

Table of Contents C The Effect of Author Set Size: Machine Learner Comparison

137

D The Effect of Author Set Size in the Original Data Sets

139

E The Effect of Author Set Size with Data Size and Topic Balanced Data

143

F Data Size as the Number of Variable-Length Samples

147

G Data Size as the Number of Fixed-Length Samples

157

H Robustness to Limited Data: Comparing Data Representations and Machine Learners 167 Bibliography

171

xv

Chapter 1

Introduction

This dissertation is about computational authorship attribution, a task in which the construction and evaluation of a model of authorial style is essential. During the training phase, the model is built from documents of known authorship. In a testing phase, we apply that model to an unseen document and evaluate whether it allows us to determine the authorship of that document. Computational authorship attribution differs from traditional approaches to authorship attribution (e.g. in-depth reading by literary experts) in that we use computational analysis of writing style to build the model and apply a learning algorithm to attribute authorship. During the last decade, computational authorship attribution research has benefited from increased attention in both Digital Humanities and Computational Linguistics. However, the field is dominated by studies focusing on large sets of textual data and small sets of candidate authors. Moreover, studies often fail to provide a clear-cut description of their experimental design. Nevertheless, experimental design is an important factor that significantly affects the reliability of an approach. As a consequence, it is uncertain if and how the methods used in these studies will perform in situations with small sets of textual data and/or large sets of candidate authors. In addition, the underuse of objective evaluation criteria and lack of benchmark data sets undermine the reliability of methods, feature types, and Machine Learners typically proposed for the task. In this dissertation, we operationalize a standard text categorization approach by combining Shallow Parsing and Memory-Based Learning, and place it under scrutiny. By studying its behavior in several conditions, we aim at contributing to a benchmark for authorship attribution. The focus is on three issues that affect the scalability of the approach: experimental design, author set size, and data size. In this introductory chapter, we will first describe the context of this dissertation and motivate the focus on scalability issues (Section 1.1). Next, we introduce the research objectives (Section 1.2) and clarify our perspective on the authorship attribution task (Section 1.3). Finally, we describe the most important contributions and limitations of the dissertation (Section 1.4) and provide a chapter guide (Section 1.5). 1

Introduction

1.1 Context and Motivation Authorship attribution aims at identifying the author of an unseen document given a set of documents by a number of candidate authors. The field has originated from studies applying in-depth reading to texts of unknown or disputed authorship, such as the Federalist Papers (Mosteller & Wallace, 1964) or the œvre attributed to Shakespeare (Merriam, 1993). Contemporary research in computational authorship attribution concentrates on two issues: the selection and extraction of features that relate to the author’s writing style and exhibit predictive power, and the selection of robust discriminative methods. State-of-the-art approaches are able to reliably solve an authorship attribution task given a small set of candidate authors and – as is often the case – a large set of training data. During the last decades, a lot of interesting new feature types (e.g. syntactic features) and discriminative approaches (e.g. Burrows’ Delta (2002)) have been proposed for the task. However, comparison of the various feature types and approaches proposed, is very complex – not to say unfeasible – for three reasons. First of all, the field is dominated by studies that evaluate an approach on small sets of candidate authors and large sets of training data exclusively. Without testing on larger author set sizes and smaller sets of training data, it is impossible to assess how an approach scales. We define scalability (of an approach) in this dissertation as: the ability to achieve consistent performance under various uncontrolled settings, such as variations in topic, genre, the number of candidate authors, and the amount of textual data available per author. Applying authorship attribution on a large scale – in online social networks, for instance – requires an approach that is robust to the large sets of candidate authors, the often small amounts of data per author, the large amounts of topics, etc. found in large data sets. When envisaging authorship attribution ‘in the wild’ (term by Koppel et al., forthcoming), it is essential to single out feature types and discriminative methods that are able to efficiently deal with large author set sizes, small data sizes, and a variety of topics and genres. A second reason is the fact that benchmark data sets are hardly ever used. In computational linguistics, benchmark data sets and standard evaluation metrics are the keys to thorough evaluation and analysis of the behavior of an approach. Although benchmark data sets have been developed (and made publicly available) for research in authorship attribution, they are severely underused. Most studies introduce a new data set that is directed to their specific interest (e.g. analysis of literary writing style). While there is no harm in developing custom data sets, using them exclusively does not contribute to establishing benchmarks for authorship attribution. The most commonly used data set, the Federalist Papers (e.g. Mosteller & Wallace, 1964; Holmes & Forsyth, 1995; Tweedie et al., 1996; Jockers & Witten, 2

1.2 R ESEARCH O BJECTIVES

2010), contains texts of disputed and possibly also collaborative authorship, a characteristic that makes it unfit for systematic evaluation of approaches, but nevertheless an interesting case for any authorship attribution approach. A third reason is the fact that a lot of studies fail to be specific about their experimental design, leaving too much room for interpretation, a situation that reduces the chances of replication. Especially in multi-topic authorship attribution, careful experimental design is essential to increase scalability. According to Stamatatos (2009), the field of authorship attribution will only be able to deal with these problems provided that the performance of all methods is measured under various conditions. Equally essential are objective evaluation criteria and the comparison of different methods on the same benchmark data sets. In this dissertation, we stress-test a text categorization approach to authorship attribution and provide a systematic comparison of the behavior of the approach when confronted with scalability issues. We report on experiments using three publicly available evaluation data sets, each with different dimensions in terms of author set size, data size, and number of topics. In order to increase comparability with textual data of e-mail or blog post length, we focus on short text authorship attribution. We will elaborate on the specific challenge of working with short texts in Section 1.3. Our focus is on three issues that affect scalability: experimental design in multi-topic authorship attribution, author set size, and data size. By testing the predictive strength of various feature types against (some of) the challenges of large-scale authorship attribution, we aim to assess the viability of our text categorization approach when applied on a large scale, and to contribute to a benchmark for the task.

1.2 Research Objectives Now that we have described the context of and motivation for this dissertation, we zoom in on the three scalability issues and state research objectives.

1.2.1

Experimental Design in Multi-Topic Data

When designing an experiment in multi-topic authorship attribution, topic is a factor that needs to be controlled. Topic is one of the most important factors interfering with authorship, a characteristic making it hard to ‘separate’ from authorship. In addition, including topic markers in the attribution model can either aid classification (when an author has a specific preference for certain topic markers), negatively affect performance (when the texts in the test set are on a different topic than those in training), or confuse the discriminative method 3

Introduction (when the topic of the test document differs from that of the training document by the same author). A commonly applied solution to avoiding topic, is the selection of (a set of) function words. Although function words are robust to limited data and provide good indicators of authorship, the a priori exclusion of content words causes a lot of useful information to be disregarded. For that reason, we aim to integrate content words into the model, however without decreasing scalability. In Chapter 4, we investigate two aspects of experimental design that have an effect on the scalability of the resulting model. Depending on the feature selection method or cross-validation scheme chosen, topic will play a role to a larger or smaller extent. We will show that seemingly small decisions in experimental design can have large consequences in terms of scalability. The research questions we address, are: Q1 What is the effect on scalability of decisions in experimental design? Q2 What is the best technique to increase the scalability of the approach towards other topics? Q3 Is it possible to use content words in multi-topic authorship attribution without reducing scalability?

1.2.2

Author set size

Most studies in quantitative or Machine Learning based authorship attribution focus on two or a few authors. We claim that this constraint makes it difficult to predict performance with larger author set sizes. Moreover, testing an approach on small author set sizes exclusively also leads to an overestimation of the importance of the features extracted from the training data and found to be discriminating for these small sets of authors. Whereas it is possible that different types of features (e.g. character n-grams or function word distributions) are reliable for small as well as large sets of authors, the individual features may be very different in both conditions. Only recently, research has started to focus on larger sets of authors. In Chapter 5, we measure the effect of author set size by increasing the amount of candidate authors stepwise and analyzing performance and robustness of features to the effect of author set size. We claim that exclusive testing on small author set sizes fails to give an impression of the scalability of an approach. The research questions we address, are: Q1 Do we find support for the hypothesis that studies that test an approach on a small set of candidate authors only, overestimate the approach when making claims concerning a its performance and scalability for cases with large sets of candidate authors, and b the importance and scalability of specific predictive features? 4

1.3 O UR P ERSPECTIVE ON AUTHORSHIP ATTRIBUTION

Q2 Is the effect of author set size in experiments balanced for data size and topic the same as in experiments that are not balanced for these factors? In other words, how do data size and topic interact with the effect of author set size?

1.2.3

Data size

Most studies in authorship attribution use large amounts of textual data per candidate author to determine authorship. Distinguishing between a small set of authors based on large collections of training data per author is a task that can be solved with high accuracy. However, when only limited training data is available for a specific author, the authorship attribution task becomes much more difficult. By testing the system on very limited data, we can estimate its viability when applied to small collections of e-mails, letters, or blog posts. Recently, a few studies have focused on larger author set sizes than typical in the field. However, they fail to assess the scalability of the approach taken. We present learning curve experiments in authorship attribution and expect to see an increase in performance when the system is trained on more textual data. We investigate which feature types show more robustness to the effect of data size, and compare two document representations and Machine Learning algorithms in terms of their ability to deal with limited data. These are the research questions we address: Q1 How scalable is the text categorization approach towards smaller sets of textual data? Do we find robustness of specific feature types? Q2 What is the effect of document representation on the ability of the approach to deal with (extremely) limited data? Is the profile-based approach more robust to limited data than the instance-based approach? Q3 What is the effect of the Machine Learning algorithm on the ability of the approach to deal with (extremely) sparse data?

1.3

Our Perspective on Authorship Attribution

In this dissertation, we operationalize a standard text categorization approach that is dominant in computational authorship attribution (e.g. Gamon, 2004; Houvardas & Stamatatos, 2006; van Halteren, 2007; Luyckx & Daelemans, 2008a; Raghavan et al., 2010). However, our study differs from most contemporary studies in that it combines three aspects that are gradually becoming more important in the field. Two of those aspects – the focus on short text authorship attribution and on scalability issues – have only recently emerged as research 5

Introduction themes in authorship attribution. The third aspect, providing a qualitative analysis of features, is an innovative type of analysis in the field.

1.3.1

Short Text Authorship Attribution

Short text authorship attribution poses a specific challenge to the text categorization approach we adopt – and by extension to any approach. Whereas stylistic choices are generally accepted to be present in every text written by an author, they occur less frequently in short texts. Working with short texts requires a reliable and robust representation of these texts as well as a Machine Learning algorithm that is able to deal with limited data. In most studies, texts of book length are used for training, whereas studies involving short texts are relatively scarce. In Stamatatos (2009), it is stated that the text samples should be long enough so that the text representation features can adequately represent their style. However, there is no consensus on the minimal requirements for a text sample. In the last few years, the field has seen a number of studies in short text authorship attribution. When only short texts are available, for instance in poems (Coyotl-Morales et al., 2006) or student essays (van Halteren et al., 2005), often a large number of these texts is used for training. Some studies have shown promising results with short texts of about 500 characters (Sanderson & Guenter, 2006) or 500 words (Koppel et al., 2007). In Hirst & Feiguina (2007), it was shown that reducing the length of the training samples has a direct effect on performance. In our experiments, we use short texts as an approximation of texts of e-mail or blog post length. This allows us to simulate the type of data available on the web and investigate the scalability of our approach to short texts. Large-scale authorship attribution on the web will often involve very short texts. In addition, when we investigate the effect of data size (cf. Chapter 6), we will reduce the number of short texts samples used for training dramatically, allowing us to estimate performance with a very limited number of short text samples.

1.3.2

Scalability

The focus on scalability issues is relatively new in the field. Zhao & Zobel (2005) found large inconsistencies from one author pair to another in terms of performance and concluded that this causes considerable doubt over the results reported in many of the previous papers on this topic, most of which used only two authors (Zhao & Zobel, 2005, p.183). Stamatatos (2009) provides a survey of the field that is critical towards the state of the art because of the lack of systematic evaluation and comparison of approaches. Recently, Koppel et al. 6

1.4 C ONTRIBUTIONS AND L IMITATIONS

(forthcoming) presented a systematic analysis of the effects of data size and author set size in a statistical approach to the task. Investigating the scalability of an approach leads to better insight in the merits of an approach in various conditions. By stress-testing the approach, we can estimate how reliable it will perform on a large scale. That said, techniques that fail to scale can still be useful in specific cases. For instance, when tracking the evolution of literary writing style or comparing the writing styles of a limited set of authors, it is useful to tailor the model to the problem set at hand. Even in large-scale authorship attribution, using features that fit the topic rather than the author of a text, can aid performance if some authors have a preference for a specific topic.

1.3.3

Qualitative Feature Analysis

In this dissertation, we not only zoom in on performance, but also on the features that make the attribution model. By providing a qualitative analysis of features, we are able to evaluate whether the individual features are scalable or not. This type of analysis is typically lacking in authorship attribution studies, where the focus is primarily on performance, or restricted to an analysis of function words (e.g. Koppel et al., 2003a). Many studies focus on the performance of specific feature types or discriminative methods, but refrain from going into detail about the features selected. We consider it crucial to gain insight into the attribution model in order to increase our understanding of the effect of experimental design (cf. Chapter 4). We also track the individual features in the attribution model while author set size increases in order to verify whether they show robustness to large author set sizes (cf. Chapter 5).

1.4 Contributions and Limitations The main contribution of this dissertation to the field of computational authorship attribution is that it investigates the behavior of a text categorization approach to the task when confronted with scalability issues. By addressing the issues of experimental design, data size, and author set size, the dissertation demonstrates whether the approach taken is valid in experiments with limited or sufficient data, and with small or large sets of authors. Systematic analysis of these issues allows us to evaluate whether the approach is fit for application on a large scale. Although we are only discussing the tip of the iceberg – authorship attribution ‘in the wild’ may entail thousands of candidate authors with often small sets of data or only very short texts, in 7

Introduction substantially more topics, genres, and registers (e.g. Koppel et al., forthcoming) – our study provides unique insight into the attribution model and into the factors that turn the task into a challenge. An important insight obtained, is that there are a lot of interacting factors, most of which cannot be measured or blocked out. Any approach to authorship attribution will need to effectively deal with these factors. Stress-testing an approach and analyzing the features that are part of the attribution model, allows a thorough evaluation of the scalability of an approach. As far as our text categorization approach is concerned, we will show that it is not viable for application on a large scale because the resulting performance is unpredictable and the model of authorial style is overfitting the training data.

1.5 Chapter Guide In this chapter, we briefly sketched the context of and motivation behind the dissertation, and described our perspective on authorship attribution. We also introduced the main research questions and described the contributions and limitations of the dissertation. Chapter 2 describes the authorship attribution task along with its main assumptions. We introduce the two main perspectives on the task – statistical and computational authorship attribution – and provide a description of the state of the art in terms of the most important discriminative methods, feature types, and feature selection methods. In the first part of Chapter 3, we describe the baseline approach we take in this dissertation. We begin by explaining the text categorization model our approach is based on, and then go into detail on the different steps of pre-processing, linguistic analysis, feature extraction, classification, and evaluation. The second part of this chapter introduces the evaluation data sets we will use in the subsequent chapters. In Chapter 4, we investigate the crucial role of experimental design when working with multitopic data. Since lexical features often relate to topic as well as authorship, including these features in the model implies a risk to its scalability. In order to make the attribution model more robust, we explore two aspects of experimental design that have an effect on the type of information in the model: feature selection and cross-validation. We investigate variations of the standard approaches in order to assess whether they allow us to increase the scalability of the model. Chapter 5 presents a systematic study of how author set size affects performance and the (types of) predictive features selected. Most studies in the field are limited to small sets of candidate authors, a situation that can lead to unrealistic expectations concerning the scalability of the approach or feature type suggested. Our aim is to identify robust and reliable approaches for large-scale authorship attribution. 8

1.5 C HAPTER G UIDE

In Chapter 6, we investigate the effect of data size on performance in authorship attribution by gradually decreasing the amount of data used for training. Results are presented in learning curves, allowing an analysis of the evolution of performance with decreasing training data. We also explore internal and external factors – such as the Machine Learning algorithm selected – that affect performance when the text categorization approach to authorship attribution is confronted with limited data. In Chapter 7, we draw conclusions, answer the research questions formulated in Chapter 1, and state further research perspectives.

9

Part I

Text Categorization Approach to Authorship Attribution

Chapter 2

State of the Art in Authorship Attribution

In this chapter, we introduce the authorship attribution task and identify trends in the field. We also describe the most commonly applied discriminative methods, feature types, and feature selection methods. This survey is biased towards studies relevant to our interpretation of the task, more specifically to those taking a text categorization approach involving Machine Learning for classification. Recent, more general surveys of the field can be found in Holmes (1998) (on early research in authorship attribution), and Juola (2008) and Stamatatos (2009) (on modern authorship attribution methods). This chapter sketches the context of this dissertation. It is structured as follows. First, we introduce the task and its assumptions (Section 2.1). Then, we sketch the most dominant perspectives on authorship attribution (Section 2.2). We describe the evolution of research in the last decades in an overview of the most important discriminative methods, feature types, and feature selection methods (Section 2.3).

2.1 Introducing the Task Authorship attribution (also known as authorship identification or authorship recognition) is the task of identifying the author of an unseen text from a set of candidate authors. On the basis of an abstract representation of texts of known authorship (i.e. the training data), the author of the unseen text (i.e. the test data) is determined. Key issues in the field are the selection of texts representative of the author’s writing style, the selection of linguistic features that allow quantification of writing style, and the selection of a method that allows us to distinguish between the set of candidate authors as well as identify the author of the unattributed text. The field of authorship attribution originates from a tradition of in-depth reading by human experts investigating disputed authorship in literary works like the œuvre attributed to Shakespeare. This type of research is commonly referred to as traditional authorship attribution. From the late 19th century on, there have been attempts to quantify writing style, the most 13

State of the Art in Authorship Attribution important studies being those of Mendenhall (1887), Zipf (1932), and Yule (1938, 1944). The most influential early study, by Mosteller & Wallace (1964), adopted distributions of function words as a discriminating feature to settle the disputed authorship of the Federalist Papers between three candidate authors (Alexander Hamilton, James Madison, and John Jay). Since then, an increasing number of studies has focused on testing style markers in cases of disputed authorship or in works of known authorship (e.g. the Bronte¨ sisters) using statistical analysis. Until the late 1990s, the field was dominated by studies applying multivariate statistical analysis or clustering to distinguish between authors. The emergence of the world wide web and the availability of more powerful computers have instigated a whole new line of research, applying insights from computer science and computational linguistics, more specifically from Information Retrieval, Machine Learning, and Natural Language Processing (NLP). In general, studies in authorship attribution rest on a number of assumptions. First of all, writing style is believed to be influenced by the author’s characteristics (e.g. identity, gender, personality, education level). A second assumption is that the author’s characteristics can be perceived from his or her writing style. The most important assumption in non-traditional methods (i.e. methods involving statistical or computational techniques; cf. Section 2.3.1) is, that writing style is quantifiable in terms of characteristic language use. Although it is generally accepted that writing style can be affected by a number of external factors, such as time, register, topic, etc., there is a consensus that some characteristic elements of the author’s writing style will always be present, irrespective of these factors. Nevertheless, it is common practice to keep constant as many potentially interacting factors as possible. We will elaborate on this in Chapter 3 (Section 3.2), when we motivate our selection of data sets.

2.2 Perspectives on Authorship Attribution Most of the research in authorship attribution is done from the perspectives of Digital Humanities (DH) and Computational Linguistics (CL). Although they share the topic of authorship attribution, DH and CL often have different emphases and objectives. Whereas in DH, the focus is primarily on cases of actual disputed authorship or on the analysis of literary style, most research in the CL perspective is concerned with performance on data sets of known authorship and on identifying the most reliable techniques. The more systematic approach in CL allows for a strict control of factors that interact with authorship (e.g. topic and genre), a set-up that often cannot be realized in cases of disputed authorship. Studies that simulate the challenge of large-scale authorship attribution – for instance, by increasing the author set size, or decreasing the training data size – allow for a 14

2.3 M ETHODS AND T ECHNIQUES FOR AUTHORSHIP ATTRIBUTION

systematic evaluation of the state of the art under various circumstances. One of the main advantages of DH-oriented research is the focus on the interpretation of results as well as on the implications for an author’s (literary) style. This type of analysis is currently lacking in CL-oriented studies. The closely related fields of computer forensics and forensic linguistics have also shown considerable interest in the authorship attribution task (e.g. Gray et al., 1997; de Vel et al., 2001; Chaski, 2005; Lambers & Veenman, 2009). They see authorship attribution as an interesting case for the evaluation of approaches for plagiarism detection, fraud detection, computer security, etc. In the CL framework, we find studies on related classification tasks – other tasks that involve learning from authorship metadata – such as authorship verification, authorship profiling, and plagiarism detection. In authorship verification, the task is to decide whether a given text was written by authorx or not. In this case, there is an open candidate set, and for that reason an absence of negative examples – i.e. no texts are included that have not been written by authorx . Authorship verification, although in essence a one-class learning task, is often approached as a one-vs.-all classification task, where negative examples of writing style from a set of likely candidate authors is included. Authorship profiling focuses on the extraction of the author’s personal characteristics (e.g. gender, age, education level). Plagiarism detection (e.g. Uzuner et al., 2005; Potthast et al., forthcoming) attempts to verify whether a new text has been plagiarized or not. This dissertation focuses on authorship attribution.

2.3

Methods and Techniques for Authorship Attribution

In practice, the differences between the two frameworks are less dramatic than suggested above, since the field has seen a number of serious attempts at crossover. That is also the context of this dissertation, which attempts to combine the strengths of CL and DH. In this section, we give a brief overview of the most commonly used discriminative methods, feature types, and feature selection techniques in modern authorship attribution studies (including both DH and CL). We provide a brief history of the field on the basis of a number of seminal studies – often at the intersection of the DH and CL frameworks. The interdisciplinary character of authorship attribution – the field combines techniques and theories of literature, linguistics, computer science, artificial intelligence, mathematics, and psychology – entails that categorization of studies is a difficult task. The classification presented in Stamatatos (2009) is closest to the one we apply in this dissertation. Note that our focus is on studies that study authorship attribution for the sake of the task rather than on studies that use it as a test case for discriminative methods (e.g. Allison & Guthrie, 2006; Jair Escalante et al., 2009). 15

State of the Art in Authorship Attribution

2.3.1

Discriminative Methods

Contemporary research in authorship attribution shows two lines of research in terms of the type of discriminative method selected: one that involves statistical analysis and comparison of texts, and another one taking a text categorization approach involving Machine Learning for classification. In this dissertation, we will refer to the first type as the statistical approach and to the second type as the computational or text categorization approach. Statistical authorship attribution (also known as quantitative authorship attribution) relies on multivariate statistical analysis as a technique to allow comparison of texts of different authors along stylistic markers of those texts. Discriminant analysis (e.g Stamatatos et al., 2000; Tambouratzis et al., 2004; Chaski, 2005) and principal component analysis (e.g. Burrows, 1992; Baayen et al., 1996) are the commonly applied statistical techniques in the field. Burrows’ Delta (Burrows, 2002) is no doubt the most influential method in current digital humanities research on authorship attribution. Delta is defined as the mean of the absolute differences between the z-scores for a set of word-variables in a given text-group and the z-scores for the same set of word-variables in a target text. Argamon (2008) presents a theoretical analysis of Delta in order to explain why it is working so well. In fact, Delta appears to be a variant of the k-Nearest Neighbor algorithm that assigns the label of the nearest class instead of the label of the nearest instance. Delta has shown success in several studies (e.g. Hoover, 2004). However, Argamon (2008) suggested the application of Delta should be limited to documents of the same text type, because the method relies on the assumption that word frequencies for the different authors are similarly distributed, which is not the case in texts from different text types. Computational authorship attribution takes a text categorization approach towards authorship attribution. Automatic text categorization (Sebastiani, 2002) labels documents according to a set of predefined categories. Most text categorization systems use a two-stage approach in which features are extracted that have high predictive value for the categories, after which a Machine Learning algorithm is trained to categorize new documents by using the features selected in the first stage, and tested on previously unseen data. Stamatatos et al. (2000) and Koppel et al. (2003b) were among the first to apply the standard text categorization approach to the authorship attribution task. The model starts from a set of documents of which the author is known (the so-called training data), automatically extracts features – representing layers of linguistic information, obtained by applying NLP – that are informative for the identity of the author, and trains a Machine Learning method that uses these features to do authorship attribution for previously unseen documents with unknown authorship (the test data). This approach has not only been applied with success to authorship attribution (Gamon, 2004; Houvardas & Stamatatos, 2006; van Halteren, 2007; Luyckx & Daelemans, 2008a; Raghavan et al., 2010), but also to the closely-related fields of author16

2.3 M ETHODS AND T ECHNIQUES FOR AUTHORSHIP ATTRIBUTION

ship verification (Argamon et al., 2003a; Koppel & Schler, 2004; Koppel et al., 2007; Luyckx & Daelemans, 2008a), gender prediction (Koppel et al., 2003b), and personality prediction (Mairesse et al., 2007; Nowson & Oberlander, 2007; Luyckx & Daelemans, 2008b). In contemporary computational research, Support Vector Machines (SVMs) are the learning method of choice (e.g de Vel et al., 2001; Diederich et al., 2003; Koppel et al., 2003b; Argamon et al., 2007), but other algorithms have been tested as well, among them are decision trees (e.g. Zhao & Zobel, 2005; Zheng et al., 2006), neural networks (e.g. Matthews & Merriam, 1994; Zheng et al., 2006; Tearle et al., 2008) and Memory-Based Learning (e.g. kNN) (e.g. Luyckx & Daelemans, 2008a). Some studies interpret authorship attribution as a binary classification task, where each author is contrasted with the other authors in the set (i.e. authorx vs. authoryz ). Others see it as a multi-class task, distinguishing all authorship classes at the same time (i.e. authorx vs. authory vs. authorz ). This dissertation takes the multi-class approach. One other interesting method to mention is the use of compression algorithms, which has been tested for text categorization in general (e.g. Teahan & Cleary, 1997; Frank et al., 2000), and was first applied to the authorship attribution task in Benedetto et al. (2002). The central idea is to take a compression algorithm and to compare every unseen text with all training texts of the selected set of candidate authors. Compressing texts in pairs (unseen text ↔ training text) allows for authorship attribution in an intuitively simple way. Since frequent sequences are encoded in less bytes than rare sequences – an idea adopted from Information Theory – high compression rate indicates similar writing style. Although a number of studies were optimistic about the use of compression for authorship attribution (e.g. Kukushkina et al., 2001), the compression-based approach was strongly criticized in Goodman (2002) for being slower and less accurate than standard Machine Learning algorithms like Naive Bayes. Marton et al. (2005) presented a systematic comparison of compression-based algorithms for authorship attribution, but, as far as we are aware, this approach has not been pursued further as a discriminative method for attributing authorship. However, it has been used in plagiarism detection (e.g. Lambers & Veenman, 2009).

2.3.2

Feature Types

The field has seen various attempts to find the ultimate style marker for authorship. Different vocabulary richness measures – such as type-token ratio (V/N or vocabulary size divided by number of tokens), hapax legomena, sentence length, Yule’s characteristic K (Yule, 1938), ´ R (Honore, ´ 1979), etc. – have been claimed to be reliable markers of style. HowHonore’s ever, for every study claiming reliability of these naive markers, there are a few suggesting the opposite. 17

State of the Art in Authorship Attribution An excellent overview of early style markers such as these can be found in Holmes (1998). The consensus today is that vocabulary richness measures are unreliable to be used in isolation, although they can be useful in combination with other feature types. In contemporary research, the most widespread approach is to extract features from the text representing a specific layer of linguistic information (e.g. character, lexical, syntactic, semantic). By extracting from the training documents all words, for instance, and applying a feature selection metric, the most predictive features will come to the surface. A set of predictive features emerging from the training data will then be applied to the unseen test data. Most studies take a bag-of-words approach, hence disregarding the feature’s context. Studies that do want to take into account the context, select n-grams of features. Four main types of features potentially useful for authorship attribution research can be distinguished on the basis of the amount of automatic linguistic analysis (i.e. NLP) required: character, lexical, syntactic, and semantic features. Character features are the easiest to extract, since they only require a digital version of the text and no NLP. In spite of proven success in language identification in the early 1990s (e.g. Cavnar & Trenkle, 1994; Dunning, 1994), character n-grams have only been applied to the authorship attribution task since the 2000s (e.g. Clement & Sharp, 2003; Keselj et al., 2003; Peng et al., 2003; Stamatatos, 2006; Grieve, 2007; Hirst & Feiguina, 2007). The success of character n-grams can be explained by their ability to capture nuances on different linguistic levels (Houvardas & Stamatatos, 2006), but mainly by their ability to handle limited data. Lexical features are by far the most commonly used feature type because they only require tokenization and are easier to interpret. Moreover, the idea that they contain interesting stylistic information is rather intuitive. Two main types of lexical features can be distinguished. Function words, in contrast to content words, do not bear any topic information. Content words relate to the author’s stylistic choices, but provide the attribution model with an advantage when topic information is included. Although they are very frequent and occur in every document, function words have been shown to be reliable and informative for authorial style. In fact, they are able to handle limited data reliably, just like character features. Most studies in statistical authorship attribution focus on the distribution of specific function words – usually a pretermined set of determiners, prepositions, pronouns, etc. – as a characteristic of writing style. In computational authorship attribution, we find studies suggesting n-grams of (content) words while admittedly allowing topic information in. While topic information can be desirable when trying to limit the set of candidate authors (as some topic markers may be author-specific), it also decreases the scalability of the resulting model. Syntactic features, the third type, have been suggested as more reliable than content words since they are not under the conscious control of the author and allow for a level of abstraction from the individual words. From the late 1990s on, advances in NLP have launched the use of syntactic features for authorship attribution. Baayen et al. (1996) was one of the first 18

2.3 M ETHODS AND T ECHNIQUES FOR AUTHORSHIP ATTRIBUTION

studies to suggest syntactic features for authorship attribution, by showing that frequencies of rewrite rules were able to reliably distinguish between authors, registers and text types. Whereas rewrite rules require full parsing, most studies that followed Baayen et al. (1996), used output of part-of-speech tagging or shallow parsing (including chunking and identification of grammatical relations such as subject and object) (e.g. Stamatatos et al., 2000; Khmelev & Tweedie, 2001; Kukushkina et al., 2001; Diederich et al., 2003). More recently, rewrite rules have been reintroduced in authorship attribution (e.g. Gamon, 2004; van Halteren, 2007) as a result of improvements in full parsing. One of the more notable and recent studies that adopt syntactic features is van Halteren (2007). In this study, each word is represented by (i) an abstract form that represents the word’s length, frequency class, capitalization, and its last three characters (showing suffix information), (ii) the three most frequent part-of-speech tags for that word, and (iii) n-grams of constituents. The resulting extensive sets of several thousands of features are effectively dealt with by the Linguistic Profiling (LP) method, a technique that compares each of the test documents against a ‘profile’ of the training data per authorship class. In one-vs.-all authorship attribution, this leads to high classification performance. The application of syntactic features is mostly limited to studies in the computational linguistics domain, viz. to those taking a text categorization approach. Semantic features have been suggested for authorship attribution, but the complexity and relatively low accuracy of automatic semantic analysis have a negative effect on their reliability. The most interesting application of semantic analysis so far can be found in Argamon et al. (2007), where a set of functional lexical features is used to represent the semantic function of each clause in a sentence and text (e.g. conjunction, elaboration, extension).

2.3.3

Feature Selection

When all features have been extracted, feature selection is applied to limit the pool of potentially relevant features. Feature selection is an essential part of every authorship attribution study that starts from a large set of features (e.g. all words in a given data set) and aims to identify the most relevant ones for the task at hand. The frequency of an item – of any type, be it character, lexical, syntactic, or semantic – is the most powerful criterion for selecting features for authorship attribution. The simplest way of performing feature selection is to restrict the set to the n most frequent terms in the data set (e.g. Burrows, 1987, 1992; Hoover, 2003). The resulting reduced set will mainly consist of function words since these occur with high frequency in most text types. Although a simple technique, term frequency continues to dominate the field, as it has since Mosteller & Wallace (1964).

19

State of the Art in Authorship Attribution A number of standard methods from Information Theory, such as Information Gain and Entropy (e.g. Houvardas & Stamatatos, 2006), Odds Ratio (e.g. Koppel et al., 2006), Kolmogorov complexity (e.g. Juola, 2008), and chi-squared (e.g. Grieve, 2007; Luyckx & Daelemans, 2008a), have also been applied to the authorship attribution task. Although standard feature selection methods, Odds Ratio (Koppel et al., 2006) and information gain (Houvardas & Stamatatos, 2006) were found to be less efficient for authorship attribution than simple term frequency in some data sets. One of the more interesting feature selection methods – at least from an interpretative point of view – is the stability method. Meaning-preserving stability (Koppel et al., 2003a) was suggested as a technique to limit the pool of potential features and to measure the presence of stylistic choices in text. Whereas stable words in a sentence cannot be replaced without changing the content, unstable features are likely to reflect the author’s stylistic choice. This feature selection method is corpus-independent and can be applied to any type of features. However, when tested on an authorship attribution task, the stability measure only outperforms other feature selection methods when it is combined with the words’ average frequencies. Although stability seems a promising feature selection technique, its application has been limited to the Koppel et al. (2003a) study. Presumably the explanation for this is the complexity of the approach involving Machine Translation for the automatic generation of sentences that convey similar meanings.

2.4

Summary

In this chapter, we have given an overview of the most commonly applied discriminative methods, feature types, and feature selection methods. As far as the discriminative methods are concerned, we described the three most typical methods. The text categorization method is dominant in CL research, while multivariate statistics and Delta are most prominent in DH research on authorship attribution. Compression methods have been proposed for the task, but the method has moved into the background. The field has seen some early attempts to find the ultimate style marker, but the consensus is that they should not be used in isolation. Most contemporary studies extract different types of features, such as character n-grams, function words, syntactic features, and semantic features, and apply a feature selection method to limit the pool of potential predictive features. State-of-the-art authorship attribution is dominated by studies applying a text categorization or statistical approach, using Delta and SVMs for classification, function words and syntactic features for the analysis of writing style, and a term frequency threshold for feature selection. Research in authorship attribution has a long history dating back to the end of the 19th century. In contemporary research, there are two basic perspectives on the task, each with 20

2.4 S UMMARY

their own research objectives. In Digital Humanities (DH) research, the main focus is on the analysis and comparison of (literary) writing styles, hence on interpretability. In the field of Computational Linguistics (CL), classification performance and robustness of feature types and Machine Learning algorithms are more important than a qualitative analysis of features. This dissertation joins both perspectives in a study of scalability issues with a focus on interpretation as well as performance.

21

Chapter 3

Methodology and Data Sets

In the first part of this chapter, we describe the baseline approach we take in this dissertation. We begin by explaining the text categorization model our approach is based on, and then go into detail on the different steps of pre-processing, linguistic analysis, feature extraction, classification, and evaluation. The second part of this chapter introduces the evaluation data sets we will use in the following chapters.

In the previous chapter, we sketched the general framework behind authorship attribution and discussed the most commonly applied discriminative methods and feature types. In this chapter, we describe the text categorization methodology adopted in this dissertation (Section 3.1) and introduce the three student essay data sets that will be used in the following chapters to evaluate the scalability of the approach (Section 3.2).

3.1

Methodology

In this dissertation, we approach authorship attribution as a classification task (cf. ‘computational authorship attribution’ in Chapter 2). In Section 3.1.1, we explain how this text categorization approach is applied to the authorship attribution task. The following sections elaborate on the different steps of automatic linguistic analysis (Section 3.1.2), feature engineering (Section 3.1.3), Machine Learning (Section 3.1.4), and evaluation of results (Section 3.1.5). This chapter describes our baseline approach and experimental design. In Chapters 4 to 6, we will sometimes adapt the experimental design, but we will go into detail there.

3.1.1

Text Categorization Approach

Automatic text categorization (Sebastiani, 2002) labels documents according to a set of predefined categories. Most text categorization systems apply a two-stage approach that first 23

Methodology and Data Sets extracts features with high predictive value for the categories, and then trains a Machine Learning algorithm to categorize new documents by using the features selected in the first stage. The resulting model is then tested on previously unseen documents (the ‘test data’). Fig. 3.1 shows a visualization of the approach.

Training data

Feature selection

n-best Training instances

Data set (linguistically analyzed)

Machine Learning Test data

Test instances Labeled test instances

Figure 3.1: Visualization of the text categorization approach we apply to the authorship attribution task. Starting from a linguistically analyzed data set, the data is separated in train and test set(s). In a first stage, predictive features are extracted from the linguistically analyzed training data, after which training and test instances are created, based on these features. In the second stage, a Machine Learning model is generated from the training data, in order to be tested on unseen test data. Stamatatos et al. (2000) and Koppel et al. (2003b) were among the first to apply the text categorization approach – commonly used in topic detection – to the authorship attribution task. This approach has not only been applied with success to authorship attribution (Gamon, 2004; Houvardas & Stamatatos, 2006; van Halteren, 2007; Luyckx & Daelemans, 2008a), but also to the closely-related fields of authorship verification (Argamon et al., 2003a; Koppel & Schler, 2004; Koppel et al., 2007; Luyckx & Daelemans, 2008a), gender prediction (Koppel et al., 2003b), and personality prediction (Mairesse et al., 2007; Nowson & Oberlander, 2007; Luyckx & Daelemans, 2008b). In most studies, the focus is on supervised categorization of authorship (as opposed to unsupervised categorization), the situation where labelled training data is used to train a Machine Learner. Supervised classification allows for evaluation of classification, and therefore is the best technique to investigate the scalability of the text categorization approach. We perform 24

3.1 M ETHODOLOGY

multi-class authorship attribution. This is a deviation from most other computational studies in the field, since they tend to approach authorship attribution as a binary classification task where a model is trained for each authorship class that distinguishes it from all other classes. Binary classification is dominant because Support Vector Machines (SVMs) – the ML algorithm of choice in the field – considers each classification task as a binary problem. Multi-class classification allows us to train a single model that distinguishes between all authorship classes at the same time.

3.1.2

Pre-Processing and Linguistic Analysis

In a pre-processing stage, the data is cleaned and analyzed linguistically by means of a shallow parser. First, we remove a number of terms that provide unique identification of authorship, such as author names and dates (e.g. 9-30-03, 10/22/03). It is obvious that this type of information, although a good clue for authorship, would provide the approach with an unfair advantage in the controlled data set that will not scale towards another genres, topics, registers, etc. After this clean-up, the data is converted into UTF -8 format for easy processing, and then it is sent to a parser – a system that performs automatic linguistic analysis. The selection of syntactic features rather than just (n-grams of) words or characters, requires robust, and accurate text analysis tools such as lemmatizers, part of speech taggers, chunkers etc. Tadpole (Van den Bosch et al., 2007) will tokenize, tag, lemmatize, and morphologically segment word tokens in incoming Dutch text files, and assign a dependency graph to each sentence. We use a predecessor of Tadpole, the Memory-Based Shallows Parser ( MBSP ) (Daelemans & van den Bosch, 2005), which is available for both English and Dutch. Figure 3.2 shows MBSP sample output for English and Dutch. English

Dutch

The / DT / I-NP / NP-SBJ-1 / the cat / NN / I-NP / NP-SBJ-1 / cat jumped / VBD / I-VP / VP-1 / jump on / IN / I-PP / B-PNP / on the / DT / I-NP / I-PNP / the table / NN / I-NP / I-PNP / table ././O/O/.

De / De / LID(bep,stan,rest) / B-NP / I-SU kat / kat / N(soort,ev,basis,zijd,stan) / I-NP / I-SU sprong / springen / WW(pv,verl,ev) / B-VP / I-HD op / op / VZ(init) / B-PP / I-LD de / de / LID(bep,stan,rest) / B-NP / I-LD tafel / tafel / N(soort,ev,basis,zijd,stan) / I-NP / I-LD ././LET()/O/O

Figure 3.2: Samples of MBSP output for English and Dutch. The structure for English is as follows: word / part-of-speech / (position in) chunk / lemma. For Dutch, the order is: word / lemma / part-of-speech / (position in) chunk / grammatical relation. The position of a word in a chunk is indicated by B (‘begin of’) or I (‘inside’) a chunk (e.g. an NP). 25

Methodology and Data Sets The different Memory-based NLP modules that constitute MBSP are: (i) Tokenization: The text is segmented into sentences. Tokenization includes the identification of a punctuation mark as either a sentence break or part of an abbreviation. (ii) Lemmatization: For each word in the tokenized text, the lemma is determined. (iii) Part-of-Speech Tagging: Each word is assigned a part-of-speech tag. (iii) Chunking: Words are grouped into chunks, such as verb phrase, noun phrase, adjectival phrase, and prepositional phrase chunks. (iv) Identification of Grammatical Relations: Grammatical relations such as head verb, object, and subject are identified.

3.1.3

Feature Engineering

As we have stated in Chapter 2, there is no such thing as an ultimate style marker. In contemporary research, we can discern four types of features that carry potential cues for authorship: lexical, character, syntactic, and semantic features (cf. Chapter 2 for the motivation behind using these features). We report on experiments using the first three types of features, since these are the most commonly applied and more are reliable than semantic features, considering the state of the art in semantic analysis. The features we use are listed in Table 3.1. The IDs will be used in the Chapters 4 to 6 to refer to the various feature types. We implemented a number of basic lexical features indicating vocabulary richness, like type-token ratio – indicating the ratio between the number of unique words and the total number of words in a text – the Flesch-Kincaid metric indicating the readability of a text, and average word and sentence length. Most of these features are considered unreliable when used by themselves, but they can be useful in combination. We use them as a naive baseline in tok. Other lexical features are content words (cwd), function words (fwd), word n-grams (lex), and n-grams of lemmata (lem). In lex and lem, both content and function words are included. For the chr feature type, we generate n-grams of characters. Character n-grams – have been proven useful for Language Identification (Cavnar & Trenkle, 1994), Topic Detection (Clement & Sharp, 2003) and Authorship Attribution (Keselj et al., 2003; Grieve, 2007; Hirst & Feiguina, 2007). They are able to reliably handle limited data, which is why we test them for short text authorship attribution.

26

3.1 M ETHODOLOGY

As syntactic features, we select part-of-speech (or PoS) n-grams. These are implemented in two ways: as fine-grained (pos) and as coarse-grained PoS (cgp). Fine-grained PoS tags provide more detailed information about the subcategorization properties and morphological properties of words. Depending on the language of the data set, only coarse-grained PoS (for English) or both types (for Dutch) are available. N-grams of chunks and grammatical relations (e.g. subject, object, main verb) are also included in the experiments. The last feature type is lexpos, a simple concatenation of the lex and pos feature types (e.g. book N). ID

Feature

Type

tok

Type-token ratio V/N Avg. word length Avg. sentence length Readability

cwd fwd lex lem

Content words Function words Word n-grams n-grams of lemmata

chr

Character n-grams

Character

cgp pos chu rel lexpos

Coarse-grained PoS n-grams Fine-grained PoS n-grams Chunk n-grams Grammatical relations Concatenation of lex and pos

Syntactic

Lexical

Table 3.1: Features and feature types used in this study. We use chi-squared as a baseline feature selection method. Equation 3.1 shows how it is calculated. Chi-squared (χ2 ) calculates, for all items (i) (e.g. words) in the entire data set, the expected (E) and observed frequencies (O) per authorship category (n). Observed frequency represents the item’s absolute frequency in a category, while the expected frequency takes into account the number of words in that category as compared to the number of words in the full data set. The resulting chi-squared score is an indication of how well the term frequency corresponds with the expected frequency. The sum of all chi-squared scores (i.e. as many scores as there are categories) for an item determines the item’s position in a ranking, hence allowing us to select the most representative items for the task at hand. This metric has been used in several studies in text categorization in general (Yang & Pedersen, 1997), and in authorship attribution specifically – a recent example is Grieve (2007).

χ2 =

n X (Oi − Ei )2

Ei

i=1

27

(3.1)

Methodology and Data Sets The training instances are numeric feature vectors that represent term frequencies of each of the selected features in the text sample, followed by the author label. All frequencies are normalized for text length. As far as the representation of the original documents in training is concerned, there are two basic approaches: instance-based and profile-based. In text categorization research, the instance-based approach (not to be confused with instance-based learners such as kNN) is most commonly applied. Such an approach represents each text sample separately as an instance in the training set. A profile-based approach cumulates all text samples from a specific author into a ‘pseudo-document’ and creates a single instance based on that profile of the author’s writing style. While the profile-based approach allows for a representation of writing style that is less susceptible to noise, the instance-based approach has the advantage of being fine-grained. Individual differences are all represented equally in the instance-based approach, while the profile-based approach averages out the small differences. In this dissertation, we adopt the instance-based approach, since it is more commonly applied in computational authorship attribution than the profile-based approach. In order to obtain a set of instances for each authorship class (even with a single text per author available), every text is fragmented into ten variable-length fragments (referred to as FLEX). That way, we will be able to learn from a data set of a single text per author. This results in an equal number of training instances for each candidate author. It is important to note that, since fragmentation is done randomly, we do not try to end each fragment with a sentence boundary, which may affect syntactic features to a small extent.

3.1.4

Machine Learning for Text Categorization

In discriminative Machine Learning, a distinction is made between eager learning methods and lazy learning methods, Eager learners abstract away from the training data to learn a model and apply the model to new data during testing. Lazy learners simply store training data at learning time, and use local similarity-based extrapolation during testing. It has been argued that lazy learning is at an advantage in language learning as it does not abstract from (potentially useful) low-frequency and low-typicality instances (Daelemans & van den Bosch, 2005). We test this claim in Chapter 6, where we investigate the effect of data size and potential robustness to that effect in Machine Learning algorithms. We perform ten-fold cross-validation (Weiss & Kulikowski, 1991), a technique generally applied in text categorization and Machine Learning research. Ten equally sized partitions (aka. folds) are created randomly from the data. Per fold, the model is trained on nine partitions, and tested on the remaining partition. This way we ensure that there is no overlap between training and test data, and that all data is used for testing and training. We use a version 28

3.1 M ETHODOLOGY

called stratified cross-validation, a scheme that allows for balance in terms of authorship classes over the folds. For classification, we experiment with lazy supervised learning. We use Memory-Based Learning (MBL) as implemented in TIMBL (Tilburg Memory-Based Learner) (Daelemans & van den Bosch, 2005), an open-source supervised inductive algorithm for learning classification tasks based on the k-Nearest Neighbor (kNN) algorithm with various extensions for dealing with nominal features and feature relevance weighting. Memory-Based Learning stores feature representations of training instances efficiently in memory without abstraction and classifies new instances by matching their feature representation to all instances in memory. From the closest instances (the ‘nearest neighbors’), the class of the test item is extrapolated. We use TIMBL version 6.1 (Daelemans et al., 2007).

3.1.5

Evaluation

Performance of the text categorization approach is evaluated by looking at standard evaluation metrics. Accuracy is used to indicate the number of correctly classified instances over P +T N the total number of test instances (i.e. T P +FT N +F P +T N ). We evaluate scores resulting from the application of k-fold cross-validation by computing the number of True Positives ( TP ) and True Negatives ( TN ), over all folds and experiments and calculating the average accuracy. Figure 3.3 shows a confusion matrix for authorship attribution of two candidate authors (aka. two-way authorship attribution). In multi-class classification, the matrix expands with every added authorship class. PREDICTED CLASS

EXPERT CLASS

authorx

authory

authorx

TP

FN

authory

FP

TN

Figure 3.3: Confusion matrix for two-way authorship attribution. A second type of evaluation is qualitative rather than quantitative. In this dissertation, we provide an in-depth and critical analysis of the features selected for the attribution model. Such an evaluation is innovative in authorship attribution, since most studies tend to avoid discussing the predictive features that were used for training and classification of unseen text. We will elaborate on this in Chapter 4.

29

Methodology and Data Sets

3.2 Evaluation Data Sets Now that we have introduced the baseline methodology for the experiments in Chapters 4 to 6, we can describe the data sets we use for the evaluation of the viability of the methodology for large-scale authorship attribution. First, we describe the motivation behind our selection of data sets (Section 3.2.1), and then we go into detail on each of them in terms of background, characteristics, and related research (Sections 3.2.2 to 3.2.4).

3.2.1

Requirements and Motivation

In order to evaluate the approach’s scalability toward large-scale authorship attribution, we test our hypotheses on three evaluation data sets for short text authorship attribution. In most Natural Language Processing tasks (e.g. parsing, word sense disambiguation, coreference resolution), and text categorization tasks (e.g. sentiment mining), a lot of effort is invested in (manual and/or automatic) annotation. In authorship attribution, however, the creation of a data set is relatively straightforward since we only need authorship metadata. Each of the three data sets contains fixed topic student essays written during the same time period (often a semester during the academic year), and by students with similar education level. The acquisition of these student essays was done in the context of a university course. The structure of the data set – in terms of author set size, data size, and number of topics – is an important issue in authorship attribution. Stamatatos (2009) states that, in order to ensure that authorship would be the most important discriminatory factor between the texts, a good evaluation corpus should be controlled for genre and topic. The ideal corpus would also be controlled for factors like age, gender, education level, nationality, etc. as well as the time period in which the texts were written – in order to avoid stylistic changes over time. The three data sets used in this dissertation, conform to a great extent to the ideal evaluation corpus as described here. However, we do not control for age or gender in the data sets – in fact, we only have that information for one of the data sets, but we decided against using it – because we consider these to be inseparable from the author’s identity. It is important to remark here that, although these evaluation data sets are ideal for discovering those features that are relevant for authorship attribution and for benchmarking purposes, their often strictly controlled structure will not be found in the wild. The artificial structure is a simplification of the task since each of the factors interacting with authorship will have a specific influence on performance. In Table 3.2, an overview of the language, type of data, author set size, and data size of the different data sets is given. Two of them are in Dutch (one from the Netherlands, one from the Flanders region of Belgium), and the other one is in (American) English. One data set 30

3.2 E VALUATION DATA S ETS

is single-topic, while the others contain multiple topics, a set-up that allows us to investigate the effect of having multi-topic data on performance and feature selection (cf. Chapter 4). Each author in a data set is represented by the same number of texts in the same topics. One exception to that rule can be found in AAAC A, where one text is missing for one of the authors. Data set PERSONAE AAAC A ABC NL 1

Language

Authors

Topics/author

Docs

Words

Words/topic (average)

Dutch English Dutch

145 13 8

1 4 9

145 51 72

205,277 43,497 72,721

1,413 844 1,017

Table 3.2: General information concerning the evaluation data sets used in this dissertation. As far as the amount of data per author is concerned, the three data sets allow for an interesting comparison. On the one hand, in ABC NL 1, each author is represented by more than 9,000 words, close to the traditional description of a reliable minimum (Burrows, 2007). On the other hand, AAAC A and PERSONAE only have 3,000 or 1,400 words per author available, respectively. Both ABC NL 1 and AAAC A contain respectively nine and four texts per author, while there is only one text per author available in PERSONAE. This results in instances that represent very short – about 100 words in length – fragments of text. Short text authorship attribution is seen here as an approximation of the length of the texts found online (e.g. blogs, e-mails)

3.2.2

Ad-Hoc Authorship Attribution Competition – Problem Set A ( AAAC A )

In 2004, the ‘Ad-Hoc Authorship Attribution Competition’ (AAAC) was organized, inviting researchers in the field to participate and submit classification results for (some of) a collection of thirteen authorship attribution problem sets. The AAAC corpus (Juola, 2004) included data sets in contemporary English, Middle English, French, Serbian-Slavonic, Latin, and Dutch. The goal was to provide benchmark data sets for authorship attribution, but some of the problem sets have been criticized for their limited size (cf. Jockers & Witten, 2010). For our experiments, we select problem set A (AAAC A), a data set acquired by the organizer. Thirteen students were asked to write four essays on the following topics: work (T1), the Frontier Thesis (T2), the American Dream (T3), and national security (T4). In AAAC A, there is one author who wrote three instead of four essays. The genre of the essays is argumentative non-fiction. This data set was considered, by the AAAC participants as well as the organizer, to be a difficult problem, and reported unsolvable by many participants (Juola, 2008, p.290). In the 31

Methodology and Data Sets framework of the competition, top performance in thirteen-way authorship attribution was 85% or eleven out of thirteen correct attributions. The technique used, was the Common N-Grams Method with weighted voting as suggested in Keselj et al. (2003) on profiles (cf. profile-based approach) in combination with kNN for classification. Since 2004, it has been used for experiments in Luyckx & Daelemans (forthcoming) only, as far as we know.

3.2.3

Dutch Authorship Benchmark corpus ( ABC

NL 1)

For Dutch, we selected the Dutch Authorship Benchmark corpus (ABC NL 1) (Baayen et al., 2002). Eight students (undergraduates in Dutch literature) were asked to write essays in three genres, on nine topics in total: Argumentative non-fiction: Essays about the television programme ‘Big Brother’ (T1), health risks of smoking (T2), and the unification of Europe (T7). Descriptive non-fiction: Essays about football (T3), a recent book the students read (T4), and the upcoming new millennium (T8). Fiction: Retelling the fairy tale of Little Red Riding Hood (T5), a chivalry romance (T6), and a murder story taking place at the university (T9). The ABC NL 1 data set was also incorporated in the AAAC competition, as problem set M. The best scoring approach in the competition, Linguistic Profiling (van Halteren, 2007), correctly classified 88% of the test documents. Outside the competition, ABC NL 1 has been used primarily by the researchers who designed it (Baayen et al., 2002; Juola & Baayen, 2005; van Halteren et al., 2005; van Halteren, 2007), and in Luyckx & Daelemans (forthcoming).

3.2.4

Personae corpus ( PERSONAE )

The PERSONAE corpus (Luyckx & Daelemans, 2008b) consists of student essays by 145 BA level students. Each author wrote about the same topic, a documentary on Artificial Life. These essays contain a factual description of the documentary and the student’s opinion about it (i.e. argumentative non-fiction). The students also took an online Myers-Briggs Type Indicator (MBTI) (Briggs Myers & Myers, 1980) personality test and submitted their profile, the text and some user information via a website. All students released the copyright of their text and explicitly allowed the use of their text and associated personality profile for research, which makes it possible to distribute the corpus. The corpus cannot only be used for authorship attribution and verification experiments, but also for personality prediction. Apart from the Luyckx & Daelemans (2008b) and Luyckx & Daelemans (forthcoming) studies, no other authorship attribution studies have been using this data set. 32

3.3 S UMMARY

3.3 Summary In this chapter, we introduced our approach to the authorship attribution task. We adopt a two-stage text categorization approach that uses shallow parsing output to allow the extraction of different levels of linguistic information from text, and combines it with a Machine Learning algorithm for classification. We apply stratified cross-validation, a technique that allows us to assess performance outside the controlled data set. The original texts are split into ten equally-sized text samples so that they can be represented as individual instances and contribute individually to the attribution model. The resulting short text samples are considered an approximation of the average text length of a blog post or e-mail. We also described the motivation behind and structure of the three evaluation data sets we will be using for the experiments in the next three chapters. In these chapters, we will deviate from the baseline approach from time to time. We will go into detail on the specific design of the experiments in those chapters.

33

Part II

Scalability in Authorship Attribution

Chapter 4

The Effect of Experimental Design in Multi-Topic Data

In this chapter, we investigate the crucial role of experimental design when working with multi-topic data. Since lexical features often relate to topic as well as authorship, including these features in the model implies a risk to its scalability. In order to make the attribution model more robust, we explore two aspects of experimental design that have an effect on the type of information in the model: feature selection and cross-validation. We investigate variations of the standard approaches in order to assess whether they allow us to increase the scalability of the model. In Part I, we provided a description of the state of the art in authorship attribution (Chapter 2), an introduction of the text categorization approach we take, and a description of the data sets for authorship we work on (Chapter 3). In Part II, we present a systematic investigation of three crucial aspects of the authorship attribution task that greatly affect the scalability of the approach, but are overlooked in many other studies. Without addressing these issues, it is impossible to claim superiority of any approach to authorship attribution. Our aim is to contribute to establishing benchmarks in the field. The aspects dealt with are experimental design in multi-topic data (this chapter), author set size (Chapter 5), and data size (Chapter 6). In this chapter, the focus is on the importance of methodological design in multi-topic authorship attribution. We investigate the behavior of our text categorization approach when confronted with multi-topic data. In the design of an experiment on multi-topic data, a number of decisions need to be made, each with a specific influence on scalability. We implement the most important methodological decisions in two stages of the text categorization approach that affect the selection of features. Our aim is to include content words in the model without affecting scalability. An in-depth feature analysis increases our insight into the behavior of our approach. This chapter is structured as follows. First, we introduce the focus of this chapter and formulate research questions (Section 4.1). After that, the set-up of the experiments is described (Section 4.2). Then, we explore and introduce the different approaches towards experimen37

The Effect of Experimental Design in Multi-Topic Data tal design for multi-topic data and test their effectiveness in dealing with the effect of topic so that the resulting model would be scalable (Sections 4.3 and 4.4). Finally, we formulate conclusions and describe the experimental design for the next chapters (Section 4.5).

4.1 Introduction and Research Questions In spite of the long history of computational authorship attribution and various attempts to establish benchmarks for predictive features (e.g. Grieve, 2007) and discriminative methods (e.g. Jockers & Witten, 2010), most studies only scratch the surface of the task. They not only ignore the notion of scalability towards larger sets of candidate authors (cf. Chapter 5) and smaller or larger sets of data (cf. Chapter 6), but also disregard replicability. Since most studies are imprecise about their experimental design and because benchmark data sets are still scarce (or scarcely used), each study makes its own decisions in terms of experimental design and creates its own evaluation data set. These factors make comparison and evaluation of approaches an almost impossible task. In addition, topic, while one of the most crucial factors interfering with authorship characteristics, is overlooked in most authorship attribution studies. Nevertheless, including topic in the attribution model has a substantial effect on its scalability. Consider a data set in which author1 writes about topics A and B, author2 about topics B and C. If the held-out test instance (in reality by author1) is about topic C, an attribution model should be able to deal with the effect of topic in order to find the correct authorship label. When topic plays a role in the attribution model, classification is facilitated since the model represents author as well as topic characteristics. In other cases, the resulting model will cause confusion and have a negative effect on performance. In addition, the model will be unreliable when tested on other topics or data sets, since most of the features will be topic-specific and are unlikely to occur in the unseen data. Without a systematic study of the behavior of the approach when confronted with multitopic data and a comparison of the effect of experimental design in this respect, we cannot reliable evaluate the merits and scalability of an approach. In order to establish benchmarks for authorship attribution, the topic influence factor needs to be investigated and controlled. Recently, it has been claimed that stylistic features – i.e. features tested for use in authorship attribution, such as measures of vocabulary richness, function words, or syntactic features (cf. Chapter 2, section 2.1) – aid topic detection performance, hence meaning they have subject-revealing power, according to Argiri (2006, p.30). When these features were added to a set of topic-specific features, performance increased. Argiri (2006) suggests a high correlation between features for authorship attribution and features for topic detection. 38

4.1 I NTRODUCTION AND R ESEARCH Q UESTIONS

In this chapter, we focus on aspects of experimental design that allow us to increase the scalability of the approach. We present a systematic study of the different decisions when designing an experiment in multi-topic authorship attribution, and provide an in-depth analysis of the features resulting from their application. In this section, we will start by explaining to what extent the field struggles with multi-topic data (Section 4.1.1), introduce two aspects of experimental design that are crucial when dealing with multi-topic data (Section 4.1.2), and formulate research questions (Section 4.1.3).

4.1.1

Working with Multi-Topic Data

In most authorship attribution studies, topic does not emerge as an issue (e.g. Baayen et al., 2002; Argamon et al., 2003b; Koppel & Schler, 2003; Luyckx & Daelemans, 2008b). This is due to the focus being on corpora controlled for topic, a decision instigated by Rudman (1998). Although Rudman advocates genre and time period control for authorship attribution data sets, this soon extended to topic as well, indicating a general apprehension about the effect of topic. Controlling for various factors in supervised learning of authorship is a widely accepted approach, allowing us to investigate fundamental problems. However, in order to assess the applicability of an approach on a large scale, we need to evaluate its behavior when confronted with multi-topic data. In contrast to the observation in Kessler et al. (1997) that topicality – as opposed to genre – is well-explored territory, we perceive that authorship attribution struggles with the notion of topic. Some studies regard data taken from the same newspaper section (e.g. Sanderson & Guenter, 2006) or online discussion group (e.g. Argamon et al., 2003c; Madigan et al., 2005) as single-topic, while others talk about topic classes consisting of several subtopics (e.g. Mikros & Argiri, 2007; Stamatatos, 2008). In this dissertation, we work with data sets consisting of student essays (viz. in ABC NL 1, AAAC A, and PERSONAE), and use the notion of topic to refer to the topic assigned by the lecturer (cf. Chapter 3). Disregarding the actual outcome of the assignment is an experimental decision that allows us to analyze the effect of topic in a controlled but nevertheless artificial setting. Outside the experimental context (e.g. in real-life or online applications of authorship attribution), it is impractical to attempt to describe or identify the topic of each text and cluster them into topic classes. Instead, we consider the number of topics in real-life data to be in direct proportion to the number of texts in that data set. When dealing with multi-topic data, the use of function words instead of content words is the ideal technique to avoid the influence of topic (e.g. Mosteller & Wallace, 1964; Burrows, 1987; Argamon & Levitan, 2005; Miranda-Garc´ıa & Calle-Mart´ın, 2007). In the field, there is a consensus that function words are topic-neutral, highly frequent, and not under the author’s conscious control. Early, naive attempts to measure style such as measures of 39

The Effect of Experimental Design in Multi-Topic Data vocabulary richness (cf. Chapter 2) are also considered topic-neutral, but have fallen into disuse because of their sensitivity to text length. However, it has been shown that hardly any of these topic-neutral features are really topicneutral, since topic is hard to ‘separate’ from style. Baayen et al. concluded after successful experiments with function words that style and content are intertwined to a greater extent than we had previously thought (Baayen et al., 2002, p.74). This concern was confirmed by Clement & Sharp (2003), and more recently by Mikros & Argiri (2007), who found that most so-called topic-neutral features are correlated with topic as well as authorship. The authors advise that the exploitation of these topic-neutral features should be done with caution (Mikros & Argiri, 2007, p.29). Diederich et al. (2003) demonstrates the advantage of using content words instead of function words and syntactic features, since the differences in performance are substantial. Zhao & Zobel (2005) observe that the results presented in Diederich et al. (2003) are not reliable because of the presence of content words, and for that reason confine their study to function words. It is clear that reliability and scalability are considered at risk when topic is included in the attribution model. Apart from the interaction between style and content in function words, there is another important downside. The a priori exclusion of content words from the attribution model keeps out potentially interesting authorial information in those content words. Studies that do include content words, often admit to accepting the presence of topic influence in the model. In some cases, this may be a desirable effect. In Koppel et al. (forthcoming), for instance, character n-grams are applied to authorship attribution ‘in the wild’, but not without noting that character n-gram statistics capture both aspects of document content and writing style. Although this distinction is often an important one in authorship studies, we do not dwell on it in this paper. For our purposes, we do not particularly care if attributions are based on style or content or both. (Koppel et al., forthcoming, p.4). Madigan et al. (2005) hint at the importance of providing ‘topic-free’ features in the bag-of-words representation of documents, and suggest the use of n-grams of suffixes, prefixes, and part-of-speech information. An interesting approach to the selection of predictive features over all levels of linguistic information was suggested in Koppel et al. (2003a): the meaning-preserving stability (or stability) measure. Whereas stable words in a sentence cannot be replaced without changing its content, unstable features are likely to reflect the author’s stylistic choice. This feature selection method is corpus-independent, as sentences are extracted from a reference corpus and Machine Translation is used (English → language-x → English) to produce semantically equivalent sentences. The method can be applied to function words, content words, syntactic constructs, etc. Examples of unstable features are function words such as ‘over’ and ‘out’, verbs ‘has’ and ‘been’, and syntactic constructs like ‘noun noun noun’. Testing the technique in authorship attribution and gender prediction shows that the stability measure only outperforms other methods when it is combined with the word’s average frequency in a reference 40

4.1 I NTRODUCTION AND R ESEARCH Q UESTIONS

or training corpus. This study shows an interesting approach to including content words in the attribution model, although no actual examples of content words have been provided. Character n-grams (e.g. Clement & Sharp, 2003; Keselj et al., 2003; Peng et al., 2003; Stamatatos, 2006; Grieve, 2007; Hirst & Feiguina, 2007; Luyckx & Daelemans, forthcoming) could offer a good alternative to function and content words since they capture nuances on the different linguistic levels (viz. lexical as well as syntactic) (Houvardas & Stamatatos, 2006), and are able to deal with limited data, in contrast to content words. In that respect, character n-grams combine characteristics of both function words and content words.

4.1.2

Experimental Design

In this chapter, we investigate how content words can be included in the attribution model without involving a risk to its scalability towards other topics. We zoom in on two aspects in the set-up of a text categorization based experiment in authorship attribution that we consider crucial when working with multi-topic data since they affect the selection of features for the attribution model. A first aspect is the choice of feature selection method for restricting the feature set to the features with highest predictive power for a given set of candidate authors. In text categorization and Machine Learning research, there have been various studies focusing on that aspect. In authorship attribution, however, a systematic comparison of feature selection methods is still lacking. In multi-topic data, we want to avoid selecting topic-related words that have low probability of occurring in a text on a new, previously unseen topic. A second aspect of experimental design is cross-validation (Weiss & Kulikowski, 1991). Kfold cross-validation (CV), a technique commonly applied in Machine Learning research, is omnipresent in computational authorship attribution studies. Essentially, the technique selects k subsets of randomly selected documents for training and test. By ensuring that every document is tested once and occurs in training k-1 times, cross-validation provides a more reliable estimation of the accuracy of a system than when a system would be evaluated only once on a specific train-test set. The probability of selecting a fortuitous train-test set is balanced out by the k random subsets. One type of cross-validation scheme has been developed specifically for use in multi-topic data: held-out topic cross-validation. This scheme trains on all-but-one topics and tests on the remaining topic. This is cross-validated, so that each topic is held out once. Some authorship attribution studies apply this scheme, but the use of stratified CV or a single-topic set-up is widespread. Each of the proposed schemes is useful for a specific scenario in multi-topic authorship attribution, but as far as we know, comparison of the schemes as well as a detailed analysis of the features resulting from their application, are still missing. 41

The Effect of Experimental Design in Multi-Topic Data

4.1.3

Research Questions

In this chapter, we focus on two stages in experimental design that are typical for a text categorization experiment (cf. our approach to authorship attribution; Chapter 3), but have a large impact on the ability of the approach to deal with multi-topic data. So far, the field of authorship attribution has not seen a systematic comparison of decisions in experimental design in terms of performance or ability to increase scalability. We also present an in-depth qualitative analysis of the features selected in order to investigate their scalability towards other topics. This type of analysis is typically lacking in authorship attribution studies, or restricted to an analysis of function words (e.g. Koppel et al., 2003a). In fact, many studies focus on the performance of specific feature types or ML algorithms, but refrain from going into detail about the features selected. The research questions we address in this chapter are: Q1 What is the effect on scalability of decisions in experimental design? Q2 What is the best technique to increase the scalability of the approach towards other topics? Q3 Is it possible to use content words in multi-topic authorship attribution without reducing scalability? The focus of this chapter will be on these aspects of experimental design: a. Feature selection (Section 4.3) Various techniques have been suggested to perform feature selection. We test whether any of the commonly used methods allow for reliable selection of lexical features other than function words and provide an in-depth analysis of the behavior of these methods when confronted with multi-topic data. b. Cross-validation schemes (Section 4.4) A solution to dealing with multi-topic data is often sought in variations of the standard cross-validation scheme (e.g. held-out topic experiments). We test their effectiveness in dealing with the effect of topic. We consider the ideal technique for minimizing the effect of topic and increasing scalability to be knowledge-light. This is important when envisaging real-world applications of authorship attribution, for instance in situations with a lot of data and a large number of candidate authors (e.g. on the web).

42

4.2 E XPERIMENTAL M ATRIX AND B ASELINE

4.2 Experimental Matrix and Baseline In the following sections, we will compare the performance of the different techniques with our standard approach, as described in Chapter 3. Table 4.1 shows the standard approach, serving as our baseline, and the sections where we will explore the different options for each aspect of experimental design – indicated with a question mark. When analyzing the effect of cross-validation schemes, for instance, we keep the other factors constant to allow for a straightforward comparison of the various schemes. In all experiments, an instance-based approach is taken (cf. Chapter 3).

Baseline Section 4.3 Section 4.4

Feature selection

Cross-validation

chi-squared ? chi-squared

stratified stratified ?

Table 4.1: Experimental matrix and baseline for the experiments. In each of the following sections, the focus is on a different aspect of experimental design. Investigating one aspect implies the other one remains stable. For classification, we use Memory-Based Learning (MBL) as implemented in TIMBL for numeric features with default settings. We report on experiments with two multi-topic data sets in two languages with the maximum number of authors (cf. Chapter 3, Section 3.2 for an overview of the data sets) and eighteen different feature types (cf. Chapter 3, Section 3.1 for an overview of feature types). For ABC NL 1 (Dutch), we report on experiments in 8-way authorship attribution, and for AAAC A (English) on 13-way authorship attribution. ABC NL 1 consists of texts written on nine topics (and three genres), and AAAC A contains data in four topics. Baselines are 12.50% (1/8 correct) for ABC NL 1 and 7.69% (1/13 correct) for AAAC A. Table 4.2 shows the various genres and topics in the two multi-topic data sets. Data set

Genre

Topics

ABC NL 1

argumentative non-fiction

T1 Big Brother, T2 health risks of smoking, T7 unification of Europe

descriptive non-fiction

T3 football, T4 book review, T8 millennium

fiction

T5 fairy tale about Little Red Riding Hood, T6 chivalry romance, T9 murder story at the university

descriptive non-fiction

T1 work, T2 Frontier Thesis, T3 American Dream, T4 national security

AAAC A

Table 4.2: Genres and topics in ABC NL 1 and AAAC A. 43

The Effect of Experimental Design in Multi-Topic Data

4.3 Feature Selection Methods Feature selection allows us to narrow down the initial feature set by taking into account the frequency distributions and tendencies in the data. Although feature selection is an important step in each text categorization task, the field of authorship attribution has not yet seen a systematic evaluation of their behavior when confronted with multi-topic data, in terms of performance and scalability. The relation between features and frequency has been an issue since the start of nontraditional (i.e. using methods other than in-depth reading) authorship attribution. Authorship attribution has been claimed successful with highly frequent words – more specifically function words (e.g. Argamon & Levitan, 2005; Zhao & Zobel, 2005; Miranda-Garc´ıa & CalleMart´ın, 2007) – as well as with hapaxes in a few early studies (cf. Holmes, 1994). Hapaxes – words that occur once in a data set – often indicate noise and therefore fail to scale towards other topics or data sets. The use of function words is appealing because of their ability to deal with limited data, but they offer a very restricted representation of the data. There is no doubt that interesting stylistic information is contained in content words. The use of content words in authorship attribution has been limited to a few studies that admittedly include topic in their model (e.g. Koppel et al., forthcoming). Allowing content words in the attribution model necessitates a study of the feature selection methods used to construct the model and of its accuracy and scalability. In this section, we start by introducing the various feature selection methods (cf. Section 4.3.1). Then we analyze their scalability towards the test set (cf. Section 4.3.2), and zoom in on their scalability towards other topics (Section 4.3.3). Finally, we determine the best technique that allows us to include topic-free content words without reducing scalability (Section 4.3.4).

4.3.1

Introducing the Methods

In authorship attribution, there are no studies that compare the different feature selection methods, as far as we know. For that reason, we turn to text categorization in general, and topic detection in particular. In Yang & Pedersen (1997), for instance, five feature selection methods for text categorization were compared in terms of performance and characteristics: document frequency threshold (DF), information gain (IG), chi-squared (χ2 ), mutual information (MI), and term strength (TS). The first three methods achieve excellent classification performance with kNN. In addition, these methods promote high-frequency content words – note that a stop word list has been applied to exclude highly frequent function words. IG 44

4.3 F EATURE S ELECTION M ETHODS

and χ2 are equally reliable according to this study. In fact, they have similar properties in that they use category information and take term absence into account. In addition, the study shows that DF is a simple but effective method for reducing the feature set. However, Forman (2003) observes that χ2 is not effective, and that IG outperforms χ2 in a collection of benchmark problem sets. More general overviews of feature selection methods can be found in Sebastiani (2002) and Guyon & Elisseeff (2003). A first feature selection method we will test, was designed specifically for the authorship attribution task. We divide the full set of features into three classes or frequency strata (viz. LOW F, MID F, HIGH F), a technique suggested in Burrows (2007). That study shows, by means of experiments with the Delta metric, that evidence of authorship is present in every frequency stratum, while studies often ignore the second frequency stratum (MID F). LOW F consists only of hapaxes – words that occur one or twice in the entire data set. In the experiments discussed below, we use all features in a given frequency stratum, without applying any additional feature selection method (such as IG or χ2 ) so that we can evaluate the absolute effect of these frequency strata. We use frequency strata as a naive baseline. The Information Gain (IG) metric is a standard information theory procedure that indicates the informativeness of a feature given a classification task. Feature values with low Entropy – a measure for the degree of surprise in the probabilities of each feature value given the different classes (Shannon, 1948) – have high IG. There is no real consensus concerning a frequency threshold to make IG more reliable. In our experiments, we use a version without a threshold, and one with a term frequency threshold of 1 so that hapaxes are not included in the model (+TF). A third method, the chi-squared (χ2 ) metric, indicates to what extent a relation or dependency exists between a term and a class by taking into account the expected and observed frequencies of a term given that class (cf. Chapter 3, Formula 3.1). The resulting chi-squared scores per class and term are summed and sorted. Terms with high ranking are more indicative of a dependency between a term and a class than those with low ranking. Again, we test a version without a threshold, and one with an expected frequency threshold (+EXPTF). It has been shown that chi-squared is unreliable when the expected frequency is lower than five (Butler, 1985, p. 177). For that reason, we use an EXPTF of five. We also test the baseline χ2 method without a frequency threshold. Note that all experiments have been performed on the same training and test sets, as generated by applying stratified cross-validation, in order to allow for full comparability of the various feature selection methods and thresholds.

45

The Effect of Experimental Design in Multi-Topic Data

4.3.2

Scalability towards Unseen Texts

As a first step, we evaluate the different feature selection methods in terms of scalability towards the unseen text samples in the test set. If the approach fails to scale towards these unseen texts – in a controlled data set –, this is a problem for its reliability when applied to large-scale, uncontrolled data sets. Table 4.3 shows MBL performance of authorship attribution in two data sets when using the different feature selection methods we introduced above. Best overall performance – although low (majority baselines are 12.50% for ABC NL 1 and 6.90% for AAAC A) – is obtained when using χ2 and lemma or character n-grams. Authorship attribution with eight candidate authors and nine topics can be done with an accuracy of almost 43.75% (35 out of 80 correct) using CHR 3. In AAAC A, the scores are not lower, in spite of the higher number of candidate authors. Here, the maximum score obtained is 44.62% (58 out of 130 correct), also with character trigrams. When we compare performance scores in the three frequency strata, it becomes clear that high-frequency features score best with character bigrams. The list of high-frequency bigrams contains (parts of) function words (e.g. ‘the’, ‘of’, ‘and’), verbs (e.g. ‘is’), pronouns (e.g. ‘wh’, ‘it’), suffixes (e.g. ‘ed’, ‘ion’, ‘ing’, ‘ly’), and prefixes (e.g. ‘over’). Function words also score highest when the high-frequency stratum is targeted, since most function words are highly frequent in the two data sets. Mid-frequency features, however, score best with lexical features, of which the majority in that class are content words. The hapaxes in the low-frequency class are clearly not very informative in the two data sets we tested here. Most of them do not generalize towards the unseen test samples. Results with IG as feature selection method are significantly lower than with χ2 , while for some unigram feature types (e.g. CHR 1 and POS 1), scores are the same as with χ2 . Where IG results are low, test instances show zero values for most features, indicating that most of the IG features (without frequency threshold) have no informative value for the held-out test set. This indicates that IG cannot be used without a frequency threshold without putting scalability at risk. After applying a simple term frequency threshold (in this case removing the hapaxes cf. Section 4.3.1), IG +TF scores significantly better than IG. Removing hapaxes reduced the initial feature set to about 75% of its size (calculated for LEX 1). However, chi-squared (χ2 ) outperforms IG, by showing the highest and most consistent performance overall when compared to the other feature selection methods and thresholds we tested. Applying the expected frequency threshold to χ2 causes a drop in performance over all feature types, which can be explained, at least in part, by the aggressive thresholding. χ2 +EXPTF reduces the initial set of features to 97% of its initial size (calculated for LEX 1), leaving 233 LEX 1 features for ABC NL 1 (of the initial 7753), and 136 for AAAC A (of the initial 4349). 46

4.3 F EATURE S ELECTION M ETHODS

Feature type

LOW F

MID F

HIGH F

IG

cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3

18.75 11.25 n/a 16.25 13.75 12.50 12.50 13.75 12.50 12.50 10.00 13.75 12.50 12.50 17.50 15.00 12.50

27.50 18.75 27.50 18.75 27.50 28.75 15.00 22.50 26.25 12.50 20.00 17.50 15.00 18.75 32.50 21.25 20.00

20.00 22.50 25.00 37.50 18.75 17.50 15.00 16.25 21.25 16.25 21.25 30.00 17.50 12.50 17.50 15.00 22.50

12.50 20.00 36.25 11.25 6.25 12.50 1.25 12.50 12.50 1.25 12.50 21.25 11.25 12.50 12.50 12.50 11.25

Average

12.35

21.35

20.06

12.59

(a)

ABC NL 1:

LOW F

MID F

HIGH F

IG

cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3

6.92 9.23 7.69 13.08 8.46 8.46 10.00 8.46 9.23 7.69 8.46 7.69 10.00 8.46 6.92 7.69 7.69

14.62 19.23 21.54 17.69 18.46 18.46 12.31 24.62 23.85 16.92 18.46 23.08 15.38 14.62 20.77 16.15 21.54

18.46 26.92 19.23 43.08 23.08 13.85 14.62 22.31 16.15 16.92 20.77 16.92 27.69 9.23 15.38 16.15 23.08

6.92 22.31 23.85 18.46 1.54 7.69 7.69 0.77 3.08 7.69 0.77 21.54 17.69 7.69 6.92 7.69 0.77

Average

8.12

18.18

19.76

8.94

AAAC A :

χ2 +EXPTF

16.25 20.00 36.25 32.50 37.50 28.75 25.00 18.75 28.75 17.50 20.00 21.25 30.00 20.00 26.25 21.25 20.00

37.50 20.00 36.25 33.75 43.75 35.00 30.00 30.00 46.25 25.00 28.75 21.25 36.25 16.25 37.50 37.50 28.75

13.75 21.25 32.50 38.75 41.25 23.75 12.50 21.25 30.00 17.50 20.00 23.75 32.50 16.25 22.50 16.25 21.25

24.41

31.65

23.41

+TF

8-way; 9 topics (baseline 12.50%)

Feature type

(b)

χ2

IG

χ2

χ2 +EXPTF

23.85 27.69 23.85 47.69 46.15 34.62 23.85 26.92 29.23 22.31 17.69 21.54 23.08 15.38 35.38 22.31 23.08

31.54 22.31 23.85 38.46 44.62 43.08 30.77 36.92 43.85 33.85 33.08 21.54 28.46 19.23 33.08 30.00 36.15

15.38 25.38 23.85 34.62 20.00 28.46 18.46 16.92 28.46 14.62 14.62 20.00 20.00 12.31 32.31 23.08 9.23

26.82

31.94

20.65

IG

+TF

13-way; 4 topics (baseline 7.69%)

Table 4.3: The influence of the feature selection method on performance in two multi-topic data sets using the maximum author set size. (Underlined scores fail to improve upon random baseline performance)

47

The Effect of Experimental Design in Multi-Topic Data When applied to LEX 3, the EXPTF aggressive thresholding leads to a feature set with a single feature above the threshold in both data sets: ‘./het/is’ in ABC NL 1 and ‘the/American/dream’ in AAAC A. (Note that these are observations from a single fold in the 10-fold stratified crossvalidation scheme.) However, LEX 3 still improves upon the random baseline. This can be explained by the fact that we used a kNN algorithm. The test instances with zero frequency for that single word trigram are assigned the class that showed zero frequency in training (10 of the 13 test instances in one fold get the same label assigned; 1/10 is correct). For non-zero frequency, the labels of the two nearest neighbors is used (the three remaining test instances are labelled with the two labels that showed non-zero frequency in training; 2/3 is correct). Although good performance gives an indication of the merits of an approach, it is obvious that features such as ‘the/American/dream’ will not scale towards other topics. We estimate to what extent scalability is at risk by examining the unique identifiers of an authorship label over training and test instances. Unique identifiers are features that occur exclusively with a specific authorship class in training and uniquely identify a test instance by the same author. Although a coincidence – the frequency of a feature in an unseen test set or the topic of that test set cannot be predicted – topic-related unique identifiers provide the model with an unfair advantage that will not scale. Unique identifiers can be a sign of overfitting, a situation where the model relies on random or noisy characteristics of the data that have no predictive power when applied to unseen data (e.g. in different topics or genres). In authorship attribution, unique identifiers can be signs of topic influence, but also of unique authorship characteristics or typos. Authorship-related unique identifiers do not necessarily indicate overfitting, although most author-specific markers will not be scalable towards other topics. Since unique identifiers that relate to the topic or represent noise cause a potential threat to the scalability of the approach, we want to avoid them. Because the scalability of the approach is potentially threatened by unique identifiers that relate to the topic or represent noise, we want to avoid those features. Table 4.4 shows the percentage of unique identifiers when applying the different feature selection methods to lexical features. As we indicated above, IG without frequency threshold failed to perform well, because the features selected from training hardly appeared in test. This is confirmed in Table 4.4, where we hardly see any unique identifiers between training and test. LOW F also shows no unique identifiers, whereas MID F and HIGH F have a higher percentage of unique identifiers. The highest number of unique identifiers can be found in χ2 . When frequency thresholds are combined with IG and χ2 , no unique identifiers between train and test occur.

48

4.3 F EATURE S ELECTION M ETHODS

Method

ABC NL 1

(8-way; 9 topics)

AAAC A

(13-way; 4 topics)

LEX 1

LEX 2

LEX 3

LEX 1

LEX 2

LEX 3

0.00 0.50 0.24

0.00 0.20 0.22

0.00 0.32 0.26

0.02 0.68 0.12

0.06 0.44 0.30

0.04 0.96 1.04

IG + TF

0.00 0.00

0.00 0.00

0.00 0.00

0.04 0.00

0.00 0.00

0.00 0.10

χ2 χ2 +EXPTF

1.36 0.00

1.42 0.00

1.42 0.00

1.24 0.00

2.20 0.00

3.90 0.00

LOW F MID F HIGH F IG

Table 4.4: The percentage of unique identifiers when applying the different feature selection methods. A unique identifier is a feature that uniquely identifies an authorship label in training and in test. We consider these unique identifiers to be potential threats to scalability. However, the absence of unique identifiers does not necessarily imply a scalable approach. In fact, the set of 233 χ2 +EXPTF features in ABC NL 1 includes topic markers (e.g. ‘Roodkapje’, ‘ridder’, ‘wolf’, ‘big’, ‘brother’, ‘programma’) and proper names (e.g. ‘Nederland’, ‘Europa’). The same can be said for the 136 χ2 +EXPTF AAAC A features (e.g. ‘Turner’, ‘Frontier’, ‘september’, ‘security’). While these features accommodate authorship attribution of the unseen test samples, they also mean a risk to the scalability of the approach towards other topics. It is clear that these features will not scale to large-scale and uncontrolled data sets.

4.3.3

Increasing Scalability towards Other Topics

In this section, the focus will be on the IG and χ2 feature selection methods and on the lexical features selected by IG and χ2 . We use these feature selected methods without frequency threshold, present a qualitative analysis of the resulting features, and introduce a threshold that allows us to increase scalability towards other topics. Note that the analyses presented, are based on a single fold in order to zoom in on the individual features in a classification task without averaging out the differences (which is what we would do when combining analyses over ten folds). Above, we have shown that χ2 without frequency threshold allows for good performance, but also introduces a lot of unique identifiers. The individual LEX 1 features selected by χ2 and IG that uniquely identify an author (hence introducing scalability risks), are shown in Table 4.5. In χ2 , showing the highest percentage of unique identifiers, we find features that relate directly to the topic of the text (e.g. ‘Europa-een’, ‘Europarlementariers’, ‘grootmoeder’, ‘jonker’ (from ABC NL 1) and ‘Turners’, ‘thesis’, ‘troops’ (from AAAC A)), topic-related dramatis per49

The Effect of Experimental Design in Multi-Topic Data sonae or locations (e.g. ‘Alex’, ‘Grenouille’, ‘Lisette’, ‘wolf’ and ‘Washington’, ‘Americans’), and typos or spelling variations (e.g. ‘millenium’, ‘sciecle’ and ‘Hofstader’). It is obvious that these features provide the model with unique evidence for classification that does not transfer to other topics. χ2 in ABC NL 1

‘?’, ‘Alex’, ‘alleen’, ‘avondje’, ‘begon’, ‘besturen’, ‘bestuur’, ‘club’, ‘cultuur’, ‘de’, ‘dus’, ‘Europa-een’, ‘Europarlementariers’, ‘fluisterde’, ‘gedacht’, ‘gek’, ‘geliefde’, ‘Grenouille’, ‘grootmoeder’, ‘heerlijke’, ‘ineens’, ‘jonker’, ‘jonkvrouw’, ‘Jurian’, ‘keek’, ‘kinder’, ‘Kooten’, ‘Lisette’, ‘Merel’, ‘millenium’, ‘moestuin’, ‘moet’, ‘Muskulan’, ‘niets’, ‘nog’, ‘om’, ‘oordeels’, ‘opa’, ‘Peter’, ‘riep’, ‘Roodcapje’, ‘samenstelling’, ‘schrijft’, ‘sciecle’, ‘scooter’, ‘sterren’, ‘stoffen’, ‘Tara’, ‘twenty’, ‘u’, ‘Veldkamp’, ‘vergelijk’, ‘verzorgingshuis’, ‘vliegveld’, ‘waarschuwingen’, ‘werkelijke’, ‘wetgeving’, ‘wolf’, ‘zei’, ‘zij’

χ2 in AAAC A

‘/’, ‘;’, ‘ability’, ‘activities’, ‘adventure’, ‘allow’, ‘Americans’, ‘art’, ‘benefited’, ‘by’, ‘careful’, ‘characters’, ‘compromised’, ‘confidence’, ‘course’, ‘crossed’, ‘denying’, ‘disregards’, ‘domestic’, ‘everything’, ‘everywhere’, ‘executive’, ‘experience’, ‘female’, ‘forever’, ‘foundation’, ‘happen’, ‘Hofstader’, ‘ideas’, ‘importance’, ‘interesting’, ‘kind’, ‘liberty’, ‘many’, ‘metal’, ‘middle’, ‘myself’, ‘orders’, ‘points’, ‘politics’, ‘pride’, ‘produced’, ‘production’, ‘put’, ‘reasons’, ‘responsibility’, ‘rewards’, ‘sacrifice’, ‘slightly’, ‘superfluous’, ‘teenager’, ‘thesis’, ‘things’, ‘through’, ‘too’, ‘troops’, ‘Turners’, ‘Washington’, ‘whites’

IG

in ABC NL 1

none

IG

in AAAC A

‘funds’, ‘Hodstafer’

Table 4.5: Unique identifiers between training and test set in two multi-topic data sets when using χ2 and IG without thresholds on LEX 1 features. More surprisingly, the lists also contain function words (e.g. ‘zij’, ‘u’, ‘om’, ‘nog’ and ‘by’, ‘forever’) and punctuation marks. While some of these may be author-specific (some authors may have a preference for question marks or semicolons), others are probably included in the list because of the limited data they are based on – remember we are dealing with short text authorship attribution. Another surprise is the inclusion of frequent verbs such as ‘moet’ and ‘zei’ (Dutch) and ‘happen’ and ‘allow’ (English). These relate to the sentiment or opinion of the author, and also to the genre – fairy tales contain direct and indirect speech (hence the inclusion of ‘zei’). The analysis of unique identifiers clearly provides interesting insight into the model. Note that IG promotes low-frequency and rarely used features, leading to a low percentage of unique identifiers. The ones that do appear, are ‘funds’ (used as a synonym for ‘money’ by one author in a consistent way) and ‘Hodstafer’ (a typo). Now that we have identified the main sources of unique identifiers of an authorship class in χ2 and IG, we can introduce a technique to increase the scalability of these feature selection methods. We start by presenting an innovative type of feature analysis according to 50

4.3 F EATURE S ELECTION M ETHODS

the authorship and topic classes involved. The top part of Table 4.6 shows the distribution of the different sources of scalability issues in function of their topic frequency and author frequency. Topic frequency indicates the frequency of each term over the different topics, whereas author frequency indicates the frequency of each term over the different authorship classes. Both work in a similar way as document frequency (DF), except for the fact that DF does not require class information, whereas topic and author frequency do. Since that type of information is available in our controlled, artificial setting, we use it to explore the application of a topic frequency threshold.

Typos & spelling variation (Du. ‘Roodcapje’, ‘werdstrijdje’) (En. ‘Dostoevsky’, ‘full-fill’)

Characters & locations (Du. ‘Tjenkov’, ‘Sebastiaan’) (En. ‘Eliot’, ‘Philippines’)

Topic markers

topicFreq>1

topicFreq=1

Check points

authorFreq=1 √

authorFreq>1

authorFreq=1

authorFreq>1















(Du. ‘roker’, ‘Roodkapje’) (En. ‘frontiersman’, ‘work-a-holic’)

Function words (Du. ‘de’, ‘om’, ‘over’, ‘maar’) (En. ‘the’, ‘to’, ‘but’)

Non-author-/topic-specific words (Du. ‘mochten’, ‘wisten’) (En. ‘make’, ‘ask’)

Author-specific words # LEX 1 features in ABC NL 1 (total: 7753) # LEX 1 features in AAAC A (total: 4349)









4850 (62.56%) 2299 (52.86%)

454 (5.86%) 404 (9.29%)

171 (2.21%) 85 (1.95%)

2278 (29.38%) 1561 (35.89%)

Table 4.6: Sources of scalability issues and scalable features in the LEX 1 feature type (the check points), represented in function of their topic and author frequency. The top half contains features that fail to scale towards other topics, whereas the mid part represents scalable features. The bottom part shows the degree of dimensionality reduction. We provide examples for Dutch (from ABC NL 1) and for English (from AAAC A).

51

The Effect of Experimental Design in Multi-Topic Data When topic information is not available (e.g. in social networks), we consider the document frequency threshold to be a viable alternative, since we assume each text is a different topic (cf. Section 4.1). According to Yang & Pedersen (1997), document frequency has the same positive characteristics as χ2 and IG, except that it is easier to apply in large data sets since it does not require task information, a feature making it computationally inexpensive. We see that most of the sources of scalability issues appear with a topic frequency of 1. The list of LEX 1 features that occur in more than one topic (topicFreq>1), show no typos or spelling variations, no characters or locations, and no topic markers. If we restrict our feature set to the ones with a topic frequency of 1 (aka. the topic frequency threshold), we can exclude these features from the feature set. The mid part of Table 4.6 shows function and content words that we want to include in the attribution model. Most function words are used in all topics and over all authors, so applying the topic frequency threshold will not exclude them. The topic frequency threshold has as a downside that it excludes part of the author-specific words, which form useful evidence of authorship. It is however not possible to make the distinction between markers of topic and markers of authorship with the given amount and type of data. The same can be said for non-author/non-topic-specific words such as ‘mochten’ or ‘wisten’ (from ABC NL 1) and ‘make’ or ‘ask’ (from AAAC A), since these can be found in all topic-author distributions. Applying the topic frequency threshold leads to a feature set size reduction of around 70% in both ABC NL 1 and AAAC A . The sets of non-author, non-topic-specific words and author-specific words contain interesting evidence for authorship that relates to the author’s sentiment or opinion – e.g. ‘luttele’, ‘eigenwijs’ (from ABC NL 1) and ‘incredible’, ‘effective’, ‘meaningful’ (from AAAC A). Appendix A lists the features that were removed by applying the topic frequency threshold to χ2 . Applying the topic frequency threshold no doubt entails some loss of (potentially useful) information in the attribution model. With more data (i.e. more texts) available, potential markers of authorship will occur in several texts by the author, survive the topic frequency threshold, and be included in the model. Table 4.7 shows the effect of applying the topic frequency threshold on classification performance in lexical features. The full set of results is presented in Appendix B. We see a clear increase in performance when using the topic frequency threshold on IG in comparison with standard IG performance. Whereas standard IG test instances showed a lot of features with zero values, applying the topic-frequency threshold allows for a set of features that scales towards unseen data. When comparing χ2 with χ2 +TOPIC F performance, we see an increase in performance for ABC NL 1 lexical features, but no improvement in performance for AAAC A lexical features. Most other feature types show a decrease in performance (cf. Appendix B). 52

4.3 F EATURE S ELECTION M ETHODS

When compared to χ2 +EXPTF, the topic frequency threshold shows more consistent performance. Performance in authorship attribution with eight candidate authors (and nine topics) can be done with an accuracy of 43.75% (with CHR 2). Thirteen-way authorship attribution (with four topics) achieves 49.23% accuracy after minimizing the effect of topic (with CHR 3). IG

IG + TF

IG + TOPIC F

χ2

χ2 +EXPTF

χ2 +TOPIC F

lex1 lex2 lex3

12.50 1.25 12.50

28.75 25.00 18.75

25.00 30.00 22.50

35.00 30.00 30.00

23.75 12.50 21.25

41.25 23.75 32.50

Average

8.75

24.17

25.83

31.00

19.17

32.50

Feature type

(a)

ABC NL 1:

8-way; 9 topics (baseline: 12.50%)

IG

IG + TF

IG + TOPIC F

χ2

χ2 +EXPTF

χ2 +TOPIC F

lex1 lex2 lex3

7.69 7.69 0.77

34.62 23.85 26.92

13.08 17.69 16.92

43.08 30.77 36.92

28.46 18.46 16.92

43.08 22.31 26.92

Average

5.38

28.46

15.90

36.92

21.28

30.77

Feature type

(b)

AAAC A :

13-way; 4 topics (baseline: 7.69%)

Table 4.7: Comparison of performance before and after applying a frequency (+TF or +EXPTF) or topic frequency threshold (+TOPIC F) to χ2 and IG in two multi-topic data sets. (Underlined scores fail to improve upon random baseline performance)

4.3.4

Discussion

When designing an experiment in multi-topic authorship attribution, it is essential to keep the topic factor out of the attribution model so that scalability towards other topics would not be at risk. A common approach in authorship attribution to avoid the effect of topic in the attribution model is to focus on function words only. A priori exclusion of content words implies that a lot of potentially useful information is disregarded without consideration. We investigated whether it is possible to include content words without causing scalability problems. In this section, we tested how commonly used feature selection methods deal with multitopic data. We started by evaluating the frequency strata, information gain (IG), and chisquared (χ2 ) methods in terms of their performance and scalability towards unseen texts. Chi-squared proved to be the overall best scoring method, but also the one that exhibited the largest amount of (potential) scalability issues. Applying information gain, conversely, failed to perform well, but the resulting feature set hardly contained any features that caused scalability problems towards the test set. 53

The Effect of Experimental Design in Multi-Topic Data As far as scalability towards other topics is concerned, we have shown that the absence of unique identifiers between training and test sets does not imply that IG is a more scalable feature selection method than χ2 . By analyzing the unique identifiers in the set of features selected by χ2 , we identified the main sources of scalability issues: typos, spelling variation, characters, locations, and topic markers. An analysis of the distribution of these undesired features in function of their author and topic frequency suggested that applying a topic frequency threshold allows us to restrict the feature set to the most efficient and scalable features (i.e. function words, non-topic/non-author markers, author-specific words). Application of the topic frequency threshold leads to a dramatic reduction of the feature set size, and an increase in performance as compared to IG and χ2 with term frequency thresholding. Taking into account the potential of the topic frequency threshold and performance of the χ2 method, we believe there is sufficient reason to choose the χ2 +TOPICF feature selection method for our experiments in Chapters 5 and 6.

4.4 Cross-Validation Schemes for Multi-Topic Data Another way to effectively deal with multi-topic data is the application of variations of the standard cross-validation scheme. We will investigate how the different schemes behave in terms of performance and scalability. Whereas some schemes have the disadvantage of requiring topic information – normal in controlled, experimental settings but a luxury in reallife applications – they also allow for insight into the model and the challenge of multi-topic authorship attribution. A commonly used scheme in text categorization is random cross-validation. The selection of train and test instances is performed in an absolutely random way, implying that imbalanced train and test sets – both in terms of topic and authorID – are likely. However, since we want to report on authorship attribution experiments with an exact number of candidate authors (viz. eight for ABC NL 1 and thirteen for AAAC A), we apply stratified cross-validation as a baseline for our experiments (cf. Chapter 3). When dealing with multi-topic data, we can apply variations of the standard scheme that deal with topic explicitly. The different cross-validation schemes we test in this chapter are: Stratified Stratified cross-validation ensures that the result of random cross-validation is balanced in terms of the number of candidate authors in the training and test sets. Held-out topic A large number of studies (Baayen et al., 2002; van Halteren et al., 2005; van Halteren, 2007; Caver, 2009) apply the held-out topic cross-validation scheme. The idea is that a model for authorship is built from all-but-one topics, and testing is done on the remaining topic. Cross-validating in the held-out topic scheme entails that 54

4.4 C ROSS -VALIDATION S CHEMES FOR M ULTI -TOPIC DATA

each of the topics in the data set is held out once. We will refer to the held-out topic as the focus topic. This scheme is applicable in controlled settings, since it requires topic information, and – ideally – a data set that is balanced in terms of topics and authors. Single-topic Another way is to split the data in as many single-topic classification tasks as there are topics. It is used in some approaches to the AAAC competition. Rather than addressing the complexity of the task, the single-topic scheme avoids the problem. Although it requires topic information, it increases our understanding of the amount of topic variation in a given data set. First, we investigate how performance achieved by the different cross-validation schemes, is indicative of inter- and intra-topic variation. This gives us an idea of the complexity of multi-topic authorship attribution. Next, we zoom in on the features in the attribution model and analyze how they deal with the effect of topic. Finally, we formulate conclusions on the effect of the cross-validation scheme when designing an experiment in multi-topic authorship attribution.

4.4.1

Performance as an Indicator of Scalability and Variation

Table 4.8 shows how the different cross-validation schemes perform in 8-way and 13-way authorship attribution. Our baseline approach, stratified cross-validation, is indicated with a grey background. In case of held-out topic and single-topic – with as many classification tasks as there are topics in the data set – the results indicate average performance. Individual results for HELD - OUT T and SINGLE - T are shown in Figures 4.1 and 4.2. Classification results in Table 4.8 show that, in most feature types, held-out topic yields considerably lower performance than stratified cross-validation. Whereas stratified authorship attribution reaches a maximum accuracy of 46.25% with lemma unigrams in ABC NL 1, that same feature type drops to a 19.03% accuracy with the held-out topic scheme. We see a similar performance drop in AAAC A. Character trigrams are the most successful feature type in AAAC A in stratified and single-topic cross-validation. Tok is the best performing feature type in that scheme, but performance overall is very low. In AAAC A, In single-topic authorship attribution on ABC NL 1, character trigrams (CHR 3) obtain a high average classification score of 74.31%. Single-topic cross-validation scores significantly higher than stratified cross-validation. Although reducing a multi-topic task to a set of single-topic tasks is feasible in controlled settings, the single-topic scheme cannot be applied to large-scale authorship attribution and fails to address the complexity of working with multi-topic data.

55

The Effect of Experimental Design in Multi-Topic Data ABC NL 1

(8-way; 9 topics)

AAAC A

(13-way; 4 topics)

Feature type

STRATIFIED

HELD - OUT T

SINGLE - T

STRATIFIED

HELD - OUT T

SINGLE - T

cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3

37.50 20.00 36.25 33.75 43.75 35.00 30.00 30.00 46.25 25.00 28.75 21.25 36.25 16.25 37.50 37.50 28.75

13.61 22.78 26.81 20.28 21.67 18.89 17.50 17.64 19.03 18.33 16.67 18.61 18.33 14.03 19.31 19.31 17.36

44.58 33.75 37.36 47.64 74.31 57.78 37.92 37.64 55.00 39.31 37.92 36.53 42.64 39.72 57.08 37.78 38.33

31.54 22.31 23.85 38.46 44.62 43.08 30.77 36.92 43.85 33.85 33.08 21.54 28.46 19.23 33.08 30.00 36.15

9.78 12.12 14.60 15.27 16.39 13.54 10.19 8.43 14.63 10.00 10.19 15.32 12.34 10.00 13.38 10.54 9.38

41.51 28.20 30.08 49.98 66.84 54.86 39.23 35.73 58.69 38.37 38.31 34.97 40.56 40.38 47.80 34.60 37.53

Average

31.39

18.89

43.89

31.78

11.67

41.22

Baseline

12.50%

7.69%

Table 4.8: The influence of three cross-validation schemes (stratified cross-validation, the held-out topic scheme, and the single topic scheme) on performance in two multi-topic data sets. Depending on the focus topic – the held-out topic or the single topic – performance can fluctuate from one topic to another. An analysis of the individual results per focus topic allows us to investigate the extent of these fluctuations and answer the question whether some topics are easier to learn than others. Limited variation indicates that performance is relatively predictable, and that the approach is scalable towards other topics. Substantial variation indicates a limited degree of scalability towards other feature types. The individual results per focus topic – the held-out topic for testing in the held-out topic scheme or the topic of the single-topic scheme – provide insight into the amount of variation in the results, as an indication of the inherent complexity of multi-topic authorship attribution. Figure 4.1 shows performance per focus topic in held-out topic cross-validation. Results for ABC NL 1 are situated between 10% and 50% accuracy, depending on the topic held out for testing.

56

100

100

80

80

Accuracy (in %)

Accuracy (in %)

4.4 C ROSS -VALIDATION S CHEMES FOR M ULTI -TOPIC DATA

60 40

40

(a)

tok cw d fw ch d r ch 1 r ch 2 r3 lex lex1 lex2 lem3 lem1 lem2 po 3 s po 1 s po 2 lex s3 lexpos1 lexpos2 po s3

20

tok cw d fw ch d r1 ch r ch 2 r lex3 lex1 lex2 lem3 lem1 lem2 po 3 s po 1 s po 2 lex s3 lexpos1 lexpos2 po s3

20

60

ABC NL 1

(b)

AAAC A

Figure 4.1: Performance per topic that was held out in testing while the other topics were used for training (i.e. the held-out topic scheme). Each x represents a held-out topic.

100

100

80

80

Accuracy (in %)

Accuracy (in %)

Figure 4.2 shows performance per focus topic in single-topic cross-validation. Depending on the focus topic, authorship attribution with eight candidate authors and the CHR 3 feature type yields accuracies between 58% and 98%.

60 40

40

(a)

tok cw d fw ch d r ch 1 r ch 2 r3 lex lex1 lex2 lem3 lem1 lem2 po 3 s po 1 s po 2 lex s3 lexpos1 lexpos2 po s3

20

tok cw d fw ch d r1 ch r ch 2 r lex3 lex1 lex2 lem3 lem1 lem2 po 3 s po 1 s po 2 lex s3 lexpos1 lexpos2 po s3

20

60

ABC NL 1

(b)

AAAC A

Figure 4.2: Performance per topic in single-topic authorship attribution. Each x represents a topic on which single-topic authorship attribution is performed. We perform a weighted comparison of results over feature types per held-out topic in order to obtain a ranking of the topics in a data set in terms of the inherent complexity of the task. This ranking is based on a weighted comparison of accuracy results over feature types, calculated in three steps. Per feature type, we rescale the results to values between 0 (low57

The Effect of Experimental Design in Multi-Topic Data est performance) and 1 (highest performance). Per topic, we compute the classifiability by adding up the individual rescaled scores per feature type for that topic. Finally, we compare the complexities per topic and obtain a ranking. The resulting rankings are shown in Figure 4.3 (visualization adapted from Van Asch & Daelemans, 2010). We consider ranking in the held-out topic scheme to be an indicator of inter-topic variation. The higher the topic is situated in the ranking, the lower the results will be when that topic is held out for testing the attribution model. The ranking resulting from analysis of the single-topic scheme, can be considered an indicator of intra-topic variation. The higher the topic is situated in the ranking, the more homogeneous the topic is, resulting in a higher classification accuracy in single-topic authorship attribution.

Inter-topic variation

Intra-topic variation

Held-out topic

Inter-topic variation

Single-topic

T5 - Fic

Intra-topic variation

Held-out topic

Single-topic

T1

T3 - Desc T2 -- Arg T7 Arg

T3

T4 - Desc T9 - Fic T1 - Arg T5 - Fic

T8 - Desc

T8 - Desc

T2 - Arg T6 - Fic

T4 T4

T3 - Desc T1 - Arg T4 T9 - Desc Fic T2 T7 - Arg

T6 - Fic

(a)

T2 T1

T3

ABC NL 1

(b)

AAAC A

Figure 4.3: Inter- and intra-topic variation in two multi-topic data sets. The labels indicate the topicID (e.g. T1) and in case of ABC NL 1, the genre is also indicated (Fic for fiction, Desc for descriptive non-fiction, and Arg for argumentative non-fiction). The higher a topic is situated in the figure, the higher the degree of variation. The rankings for ABC NL 1 with held-out topic are shown in Figure 4.3a on the left (cf. Intertopic variation) and can be read top to bottom, from most to least challenging focus topic for the model for authorship trained on the other topics. The text type – or genre – clearly plays a role in the ranking. The most challenging held-out topic was a fairy tale about Little Red Riding Hood (T5), while the least challenging was a descriptive non-fiction text about the unification of Europe. In single-topic cross-validation, on the other hand, the ranking gives an indication of the most difficult topic to train and test an attribution model for (cf. Intra-topic variation). Again, the genre plays a role in the ranking. Argumentative non-fiction texts exhibit high intra-topic variation, while fiction texts show a lower degree of variation; understandable considering 58

4.4 C ROSS -VALIDATION S CHEMES FOR M ULTI -TOPIC DATA

the different opinions in the former genre. In AAAC A, the resulting rankings are shown in Figure 4.3b. Whereas T1, work, is the most challenging in held-out experiments, it is the most homogeneous internally, resulting in a higher score for single-topic experiments. The opposite can be said about T3, the American Dream. This analysis help us understand the difficulty of designing an approach that is scalable towards other topics. Although the testing ground is limited – a set of four topics and thirteen authors (in AAAC A) and one of nine topics and eight (in ABC NL 1) – the substantial amount of inter- and intra-topic variation implies that estimating performance in multi-topic authorship attribution is not possible. Applying authorship attribution ‘in the wild’, with a large number of topics involved and a potentially low degree of homogeneity in the authorial sets, is a difficult challenge for our text categorization approach. Of the best-scoring approaches to multitopic authorship attribution, the stratified CV scheme is the one that does not require topic information or strictly controlled settings, hence the only scheme applicable in large-scale authorship attribution.

4.4.2

Feature Analysis

Apart from performance, we can also evaluate scalability by zooming in on the unique identifiers between train and test. An analysis of the features in the model allows us to assess the effect of topic in the stratified CV scheme in comparison with the held-out topic scheme, where we find the smallest number of unique identifiers. Remember that a unique identifier is a potential risk to the scalability of the approach towards other topics (e.g. typos, dramatis personae, topic-related words). The results in Table 4.9 show, as expected, that the percentage of unique identifiers after applying stratified CV, is much higher than in the held-out topic scheme. A qualitative evaluation of these features in stratified cross-validation shows a predominance of characters (aka. dramatis personae). Comparison of the number of unique identifiers in the stratified scheme (using χ2 as a baseline) and when using the topic frequency threshold technique, show that the latter is the better option for dealing with the effect of topic, although it does not increase scalability as much as the held-out topic scheme does.

59

The Effect of Experimental Design in Multi-Topic Data

CV

scheme (feature selection method)

STRATIFIED

(8-way; 9 topics)

AAAC A

(13-way; 4 topics)

LEX 1

LEX 2

LEX 3

LEX 1

LEX 2

LEX 3

(using χ ) (using χ2 )

1.36 0.11

1.42 0.20

1.42 0.07

1.24 0.25

2.20 0.25

3.90 0.10

(using χ2 +TOPIC F)

0.52

0.74

1.64

0.74

1.72

2.06

2

HELD - OUT T STRATIFIED

ABC NL 1

Table 4.9: The percentage of unique identifiers between training and test sets after applying the two most representative cross-validation schemes. For two multi-topic data, we compare figures between using chi-squared without and with the topic frequency threshold.

4.4.3

Discussion

In this section, we investigated how the choice of a cross-validation scheme has an impact on the performance of the text categorization approach in multi-topic authorship attribution and on the scalability of the approach towards other topics. The held-out topic and single-topic cross validation schemes allow for increased insight into the difficulty of working with multitopic data. Classification results show that some topics are considerably more difficult to learn than others, and that some topics show higher homogeneity than others. This implies that designing an experiment that scales towards other topics and allows us to estimate performance in other topics is not a straightforward task. In fact, our text categorization approach will fail to perform reliably when confronted with large-scale multi-topic data sets. Although the held-out topic and single-topic cross-validation schemes are interesting for evaluating topic scalability and variation, they require strictly controlled settings – the same topics for all authors – and topic information. This type of structure is available in an experimental set-up, but not in real-life applications of authorship attribution. The stratified CV scheme does not allow us to increase the degree of scalability as much as the held-out topic scheme does, but it is applicable in authorship attribution ‘in the wild’. For these reasons, we will continue our experiments in the next chapters with the stratified cross-validation scheme.

4.5

Conclusions

Topic, while one of the most critical factors interfering with authorship characteristics, is also one of the least studied and most often ignored. Allowing topic markers in the attribution model not only leads to higher classification performance because it combines authorship attribution with topic detection, but also results in a model that is unreliable when tested on other topics. Without addressing the effect of topic, it is impossible to claim superiority of any approach to authorship attribution. Moreover, most studies are rather unspecific about the 60

4.5 C ONCLUSIONS

design of the experiments. We show that seemingly small decisions in experimental design can have a large effect on the feature set and on the scalability of the model. Studies that do refer to the effect of topic on the attribution model, often address it by restricting the model to superficial features such as function words or measures of vocabulary richness, since these are commonly believed insensitive to the effect of topic. Recently, it has been shown that stylistic features work well for topic identification, and, conversely, that hardly any of the so-called topic-neutral features are in fact topic-neutral. Another downside of superficial features is, that potentially interesting content information is excluded a priori. In this chapter, we focused on two aspects of experimental design that behave differently when confronted with multi-topic data. Most authorship attribution studies are restricted to using function words for lack of a technique to incorporate topic-free content words. Our aim is to use all types of lexical features – function words and content words – without risking scalability and reliability. The research questions we answered in this chapter are: Q1 What is the effect on scalability of decisions in experimental design? Q2 What is the best technique to increase the scalability of the approach towards other topics? Q3 Is it possible to use content words in multi-topic authorship attribution without reducing scalability? First of all (cf. Section 4.3), we investigated how commonly used feature selection metrics deal with markers of topic and authorship. We started by assessing how the resulting attribution model scales towards the test set. Examining the unique identifiers of authorship over train and test, enabled the identification of potential sources of scalability problems. We found that none of the feature selection methods offer good performance as well as a solution to the effect of topic. By analyzing these features in terms of author and topic frequency, we have shown the potential of a topic frequency threshold. This threshold allows us to exclude topic markers, typos, and topic-related proper nouns from the attribution model. In addition, applying the threshold leads to reliable and good classification results. Applying the TOPICF threshold causes a dramatic decrease in the size of the feature set, but returns a set of efficient and scalable features (i.e. function words, non-topic/non-author markers, and author-specific words). A second way of addressing the effect of topic is to apply variations of the standard crossvalidation scheme that allow for topic control (cf. Section 4.4). The most notable one is held-out-topic cross-validation, a scheme that builds a model on all-but-one topics and tests it on the held-out topic. Classification results are considerably lower than for standard crossvalidation, but the held-out-topic scheme provides a solution for dealing with the effect of topic. However, there are two important downsides to the approach. First of all, the scheme relies on topic information, making it ideal for experimental settings, but not for authorship attribution on a large scale. Secondly, the amount of inter-topic and intra-topic variation is 61

The Effect of Experimental Design in Multi-Topic Data substantial, which makes it difficult to assess the quality of an approach when tested on other topics. The fact that performance is unpredictable, implies that the approach is not stable enough to apply to large-scale multi-topic data. We can conclude from this chapter, that the topic frequency threshold, combined with the chisquared feature selection method χ2 + TOPICF, provides the most reliable approach for multitopic authorship attribution with lexical features (when using a text categorization approach). For the experiments in the following chapters, we will combine this feature selection method with stratified cross-validation, and focus on the effects of author set size (cf. Chapter 5) and data size (cf. Chapter 6) without having to be concerned with the interaction between topic and authorship.

62

Chapter 5

The Effect of Author Set Size

In this chapter, we present a systematic study of how author set size affects performance and the (types of) predictive features selected. Most studies in the field are limited to small sets of candidate authors, a situation that can lead to unrealistic expectations concerning the scalability of the approach or feature type suggested. Our aim is to identify robust and reliable approaches for large-scale authorship attribution.

In this chapter, we investigate author set size, a factor that – just like topic (cf. Chapter 4) – has received only limited attention in authorship attribution, but nevertheless has a significant impact on classification performance as well as on the features in the attribution model. Since most predictive feature types and classification techniques are only tested on small sets of candidate authors, it is impossible to assess how an approach scales when confronted with larger author set sizes. By increasing the number of candidate authors stepwise, we investigate the effect of author set size and demonstrate how the attribution model evolves with increasing author set size. This chapter is organized as follows. First, we introduce the author set size issue and formulate research questions (Section 5.1). Then, we describe the experimental set-up (Section 5.2). We investigate the effect of author set size in two sets of experiments. In a first experiment, we zoom in on the natural setting by using the original data sets (Section 5.3), we investigate the effect of author set size in the original data sets as introduced in Chapter 3. For a second set, we design a controlled-corpus experiment that allows us to minimize the effect of data size (Section 5.4). In the discussion section, we elaborate on the effect of exclusive testing on small author set sizes on the scalability of our approach (Section 5.5). Finally, we formulate conclusions (Section 5.6). 63

The Effect of Author Set Size

5.1 Introduction and Research Questions Trying to classify an unseen text as being written by one of two or a few candidate authors is a relatively simple task that in most cases can be solved with high reliability and accuracies over 90%. Moreover, the often large volumes of data used for training have a positive effect on performance. The absence of a benchmark data set has resulted in a multitude of feature types, experimental designs, and machine learning methods being claimed reliable for authorship attribution. Although most approaches are only tested on small sets of candidate authors, we often find unverified claims concerning the scalability and reliability of these approaches when confronted with large sets of candidate authors. As a result, entire lines of research in authorship attribution are based on the assumption that an approach found useful in one data set, will also be effective in another data set, irrespective of its dimensions in terms of candidate authors, data, and topics. We claim that testing on small sets of authors exclusively, leads to an overestimation of performance and of the importance of the features extracted from the training data and found to be discriminating for these small sets of authors. Author set size is an important factor when envisaging real-world applications of authorship attribution, cases that often entail large author set sizes and challenging data sizes. Without investigating the effect of author set size on performance and on the attribution model, it is impossible to assess the validity of the techniques suggested. Only recently, the authorship attribution field has started using larger sets of authors. We limit our survey of related research to the papers where the effect of author set size is investigated explicitly or where the author set size is larger than typical in authorship attribution studies. We list the exceptions to the general rule of using small sets of candidate authors, starting with the studies closest to ours in focus. Koppel et al. (forthcoming) zoom in on the relationship between authorship attribution performance and author set size. From a blog data set with 10,000 candidate authors, they select 2,000 words per author for training and 500 words for testing the model. Using a statistical rather than a Machine Learning approach (cf. Chapter 2), they vary five different parameters, such as the number of candidate authors, training fragment size, and test fragment size. They find a decrease in performance with increasing number of candidate authors. Nevertheless, their unmasking technique (cf. Koppel et al., 2007) is able to attribute the author of a 500-word text sample to one of 1,000 authors with 93.2% precision and 39.3% recall. The underlying idea of unmasking is that train and test samples by the same author will show similar characteristics, even when various (sizes of) feature sets are used (over several iterations). The authors indicate that the approach shows reliability for large author set sizes specifically, but not for small ones. A limitation of their study is, that they only tested character n-grams and focused on one type of data set, although it shows high dimensionality 64

5.1 I NTRODUCTION AND R ESEARCH Q UESTIONS

with respect to author set size and data size. Note that this study admits to including topicrelated features. In Chapter 4, we have shown that this significantly and negatively affects the scalability of the approach. Abbasi & Chen (2008) present a study on large-scale authorship attribution in cyberspace, and investigate the effect of author set size in order to improve scalability of authorship attribution across authors. They use a rich, holistic set of stylistic features on different linguistic levels (viz. lexical, syntactic, structural, content-specific, and idiosyncratic attributes) that consists of several tens of thousands of features. Performance of the Writeprint system is compared with several other Machine Learners in cases with 25, 50, and 100 candidate authors. In the Enron Email corpus, a closed candidate data set with about 28,000 words per author, Writeprint achieves a remarkable performance of 83% accuracy when identifying a text as being written by one of a group of hundred candidate authors. However, on texts taken from an online Java forum with more than 40,000 words per author, performance drops from 88% accuracy with 25 candidate authors to 53% with an author set size of 100. The approach shows promise for large author set sizes, but relies on large amounts of training data and topic-specific features. The difference with our study is that we work with data sets consisting of limited data and short texts, and remove topic-specific words from the attribution model. The Zhao & Zobel (2005) study compares performance of different Machine Learners on authorship attribution with two, three, four, and five candidate authors. Results show a decrease in accuracy of about 20% when comparing five-way to two-way authorship attribution. When comparing several two-way authorship attribution tasks, large inconsistencies in terms of performance were revealed, causing considerable doubt over the results reported in many of the previous papers on this topic, most of which used only two authors (Zhao & Zobel, 2005, p.183). Argamon et al. (2003c) report on results in authorship attribution in a data set of Usenet posts on a variety of topics. In this study, author set size is varied but not explicitly investigated. They selected the two, five, and twenty most active authors as well as the ten most active along with the ten least active authors from the five hundred most recent posts. Instead of performing multi-class classification, they adopt two approaches that both entail a binarization of the task: a one-vs.-all approach and an all-vs.-all approach, combined with a voting scheme. Features include function words, abbreviations commonly used in the newsgroups domain, and orthographic features. In the rec.arts.books newsgroup, for example, increasing author set size leads to a drop in performance from 67% accuracy with two candidate authors over 46% (five candidate authors) and 30% (ten candidate authors), to 26% accuracy with twenty candidate authors involved. Note that we apply multi-class classification. It is not clear how binary and multi-class classification compare in terms of performance.

65

The Effect of Author Set Size In Luyckx & Daelemans (2008a), a systematic investigation of the effect of author set size on learning and feature selection was presented. Authorship attribution on PERSONAE, a singletopic corpus of 145 candidate authors, showed a decrease in performance with increasing author set size. Grieve (2007) aims at finding the feature types that best predict the correct author of a previously unseen text, and shows a detailed analysis of up to forty textual measurements on a carefully selected data set. From this data set, consisting of 1.5 million words from documents written by forty authors and around 37,500 words per author, sets of two, three, four, five, ten, twenty, and forty candidate authors are selected. The results show a significant decrease in performance with increasing author set size, yet the best performing feature types still achieve over 60% accuracy for forty candidate authors. In contrast to our approach, which has a computational perspective, the Grieve (2007) study adopts a quantitative (i.e. statistical; cf. Chapter 2) approach in which each test sample is compared against all samples in training, resulting in a ranking of authors in terms of similarity. Madigan et al. (2005) uses a collection of data released by Reuters consisting of 114 authors, each represented by a minimum of 200 texts. Results of Bayesian multinomial logistic regression on this corpus show error rates between 97% and 20%, depending on the type of features applied. This is only partially comparable to our approach because of the large amount of data in the Madigan et al. (2005) study, while we focus on short texts and only limited data per author. In this chapter, we will address this issue and analyze performance decay as well as evolutions in the attribution model with increasing author set size. By investigating the effect in three data sets of different dimensions, we want to gain insight into the general tendencies in author set size. Our aim is to investigate the scalability of the text categorization approach as we defined it (cf. Chapter 3) towards other topics. We apply both qualitative and quantitative analysis to identify feature types that are robust to the effect of author set size and scale towards larger sets of candidate authors. We present experiments in short text authorship attribution, using only limited data for training. We provide qualitative as well as quantitative evaluation of features with increasing author set size, a type of analysis that is lacking in most other authorship attribution studies. The following research questions are addressed in this chapter: Q1 Do we find support for the hypothesis that studies that test an approach on a small set of candidate authors only, overestimate the approach when making claims concerning a its scalability for cases with large sets of candidate authors, and b the importance and scalability of specific predictive features? Q2 Is the effect of author set size in experiments balanced for data size and topic the same as in experiments that are not balanced for these factors? In other words, how do data size and topic interact with the effect of author set size? 66

5.2 E XPERIMENTAL S ET-U P

We analyze authorship attribution performance while increasing the number of candidate authors stepwise. In the discussion section (cf. Section 5.5), we investigate the reliability of claims based on exclusive testing on small sets of candidate authors when tested on larger author set sizes. Furthermore, we test the scalability of performance, individual features, and feature types.

5.2 Experimental Set-Up We present two sets of experiments. In a first experiment, we investigate the effect of author set size in its natural setting by using the original data sets. This implies that data size and topic distribution in training may affect performance. In a second set of experiments, we use the same amount of training data for each of the authorship classes and data sets. This set-up allows us to analyze the effect of author set size more directly, although it is rather artificial and will not occur in the wild. We test a text categorization approach to authorship attribution (cf. Chapter 3) on three evaluation data sets. Figure 5.1 shows the dimensions of each data set in terms of author set size, the number of topics, the number of words per author, and the number of words per topic per author. Each of the data sets exhibits a unique combination of the various dimensions. PERSONAE has only limited data per author, and a large set of candidate authors, but contains data in a single topic. ABC NL 1 shows high dimensionality in all factors except author set size. AAAC A dimensions are situated in between the dimensions of the other data sets, but still presents any approach with a challenge. So far, it is unclear how the factors shown in Figure 5.1 interact which each other. It seems likely that, with increasing data size per author, authorship attribution performance increases. In addition, the more texts or topics an attribution model can be trained on, the more stable the model will be. However, the amount or, conversely, lack of variation in the different topics might mislead the model when it is tested on topics of a completely different nature. As we mentioned above, the two sets of experiments allow us to gain insight into the interaction between data size, topic, and the effect of author set size. In the first set (EXP 1), we extract variable-length (FLEX) samples (cf. Chapter 4) from the original data sets, where each sample represent a 10-percent slice of an original text. In a second set of experiments (EXP 2), we limit the data to 500 words per text (i.e. the size of the shortest text for an author in the three data sets). These 500 words per text are five randomly selected 100-word samples (i.e. FIX samples). The author set sizes and number of random selections for the different data sets are presented in Table 5.1. By balancing the training data in terms of data size, we ensure that each author in training shows the same amount of variation in terms of texts – and, therefore, topics. We also make sure the training data is balanced in terms 67

The Effect of Author Set Size

140 120 100 80 60 40 20 0

Author set size

8 6 4 2 ABC_NL1

AAAC_A

Personae

0

ABC_NL1

AAAC_A

Personae

Personae

1400 1200 1000 800 600 400 200 0

ABC_NL1

AAAC_A

Personae

Data size per author

10000 8000 6000 4000 2000 0

Number of topics

10

ABC_NL1

AAAC_A

Data size per author and topic

Figure 5.1: Dimensions of the three evaluation data sets. of authorship, so that none of the author labels is at an advantage because of the effect of frequency (Stamatatos, 2008). In both sets of experiments, we apply the topic frequency threshold (cf. Chapter 4), a technique that allows us to remove topic-specific features from the attribution model. Author set size is investigated by increasing the number of authors taken into account for classification stepwise. Since most studies in authorship attribution analyze up to five candidate authors, we mimicked these experiments by selecting two, three, four, or five authors randomly from our larger sets of candidate authors. This set-up allows for a good comparison. In order to get reliable estimates, we take several random selections of [2,3,4,5] authors and report on averaged scores. For the larger sets, we also repeated the experiments a number of times. All data sets are subject to exactly the same procedures and experiments, allowing us to draw conclusions that generalize over several data sets of different sizes.

Data set PERSONAE AAAC A ABC NL 1

Data size EXP 1

EXP 2

1413 844 1017

500 500 500

Author set sizes x number of random selections [2:100, 3:100, 4:100, 5:100, 10:10, 20:5, 50:2, 100, 145] [2:20, 3:20, 4:10, 5:10, 10:10, 13] [2:20, 3:20, 4:10, 5:10, 8]

Table 5.1: Set-up for the author set size experiments. 68

5.3 T HE E FFECT OF AUTHOR S ET S IZE IN THE O RIGINAL DATA S ETS ( EXP 1)

Experiments in authorship attribution are done by means of instance-based (cf. Chapter 3 on the distinction between instance-based and profile-based categorization) multi-class classification (cf. Chapter 3). In each fold, we train on all-but-one samples per author and topic, and test on the remaining sample. We make sure each sample is in test once, a technique referred to as k-fold cross-validation (Weiss & Kulikowski, 1991). In EXP 1, we use ten-fold cross-validation, while results from the EXP 2 are based on five-fold cross-validation. For all experiments, we use Memory-Based Learning (MBL) as implemented in TIMBL (version 6.1) (Daelemans et al., 2007) with default settings for numeric features. The rationale behind using default settings is that we are not concerned with optimal accuracy (optimization of algorithm parameters would lead to higher absolute results), but with measuring a relative effect (viz. of author set size). We did compare MBL performance with four other algorithms in order to evaluate our choice for MBL as a classifier for authorship attribution, in a separate experiment. Table C.1 (in Appendix C) compares performance of MBL with that of four other algorithms on the three evaluation data sets, using two, five, and the maximum number of candidate authors. This quick comparison teaches us that Support Vector Machines (SVMs) score the highest results over all author set sizes tested here. Understanding why SVMs perform so much better than MBL, is a research topic in itself, and not the focus of this dissertation. However, the overall trend – i.e. a decrease in performance with increasing author set size – will stay the same, irrespective of the learning algorithm.

5.3 The Effect of Author Set Size in the Original Data Sets ( EXP 1) In this first set of experiments, we present results of authorship attribution on three evaluation data sets, where we use the original data sizes and do not force topic balance in training. Results from EXP 1 will give an indication of the effect of author set size in a natural setting. In EXP 2 (Section 5.4), a more artificial set-up will ignore the dissimilarities between the data sets in terms of data size. Note that, in both sets of experiments, we make sure all authorship classes are represented by an equal number of instances, in order to avoid class imbalance. Figure 5.2 and Table 5.2 demonstrate the effect of author set size in authorship attribution using Memory-Based Learning (MBL) in the original evaluation data sets. For reasons of clarity, only part of the results is presented here. Per feature type (e.g. LEX 1, LEX 2, and LEX 3), only the one with highest score is shown. Results for all feature types are in Appendix D. Already at first sight, it is clear that increasing the number of candidate authors leads to a significant decrease in performance. This effect is visible in the three data sets, regardless of their size, the language they are written in, and the number of topics. Nevertheless, the single-topic data set PERSONAE shows a steeper decrease in performance with increasing author set size than the multi-topic data sets. 69

The Effect of Author Set Size

100

tok cwd fwd chr3 lex1 lem1

Accuracy (in %)

80

cgp3 pos2 lexpos2 chu3 rel

60 40 20 2510 20

50

100

Number of authors (a)

PERSONAE

100

tok cwd fwd chr3

80

Accuracy (in %)

145

lex1 lem1 pos2 lexpos1

60 40 20 2

3

4

5

Number of authors (b)

13

AAAC A

100

tok cwd fwd chr3 lex1 lem1

80

Accuracy (in %)

10

cgp2 pos2 lexpos1 chu1 rel

60 40 20 2

3

4

5

Number of authors (c)

8

ABC NL 1

Figure 5.2: Visualization of the effect of author set size in the original data sets. 70

5.3 T HE E FFECT OF AUTHOR S ET S IZE IN THE O RIGINAL DATA S ETS ( EXP 1)

Feature

2x100

3x100

4x100

5x100

10x10

20x5

50x2

100

145

tok cwd fwd chr3 lex1 lex3 lem3 cgp3 pos2 lexpos3 chu3 rel

73.75 66.65 71.30 94.50 74.50 57.00 58.75 71.40 74.45 56.50 69.05 54.70

59.33 48.13 57.23 87.37 60.43 48.33 48.80 60.97 55.63 48.67 51.17 36.03

54.42 47.95 47.12 81.05 59.58 42.80 42.45 55.90 54.95 42.85 41.55 27.93

46.82 47.32 41.36 76.36 59.68 40.32 39.74 49.42 51.00 40.20 39.82 23.92

30.00 35.00 26.70 54.60 42.50 29.00 30.00 31.90 36.47 29.40 21.70 11.80

26.10 27.70 14.40 39.49 33.30 28.10 26.10 22.00 26.90 28.70 18.20 6.90

12.90 15.50 6.90 25.00 25.50 26.40 25.40 10.70 13.80 26.90 6.70 2.30

8.30 7.90 4.20 12.20 15.50 25.60 25.90 6.00 7.80 25.40 6.00 1.30

6.07 7.03 2.83 10.90 12.21 22.76 22.07 4.76 5.31 22.28 3.17 1.24

Baseline

50.00

33.33

25.00

20.00

10.00

5.00

2.00

1.00

0.69

(a)

PERSONAE

Feature

2x20

3x20

4x10

5x10

10x10

13

tok cwd fwd chr2 chr3 lex1 lem1 pos2 lexpos1

76.25 61.50 73.50 87.00 94.25 77.00 75.25 79.00 71.50

55.17 46.33 58.17 70.83 80.67 58.67 60.50 59.00 52.00

48.50 41.75 47.00 66.50 76.50 48.00 49.50 54.25 47.75

47.80 37.20 35.40 64.40 72.80 50.20 53.60 46.20 44.80

30.90 31.80 26.20 45.45 58.60 45.00 44.70 31.40 38.80

29.23 20.77 22.31 49.23 46.15 43.08 46.15 29.23 31.54

Baseline

50.00

33.33

25.00

20.00

10.00

7.69

(b)

AAAC A

Feature

2x20

3x20

4x10

5x10

8

tok cwd fwd chr3 lex1 lem1 lem3 cgp2 pos2 lexpos1 chu1 rel

72.00 62.25 71.50 80.25 70.50 71.75 53.00 66.75 73.25 69.00 66.75 46.50

57.17 47.33 53.50 69.17 55.67 56.67 39.17 50.00 55.50 58.00 48.83 34.83

47.50 39.00 46.00 61.00 42.00 48.50 41.00 40.00 43.25 42.75 35.50 28.25

50.40 27.60 38.20 58.00 44.60 45.40 36.00 35.40 37.80 45.20 32.30 23.20

27.50 26.25 20.00 43.75 41.25 31.25 36.25 28.75 38.75 36.25 23.75 10.00

Baseline

50.00

33.33

25.00

20.00

12.50

(c)

ABC NL 1

Table 5.2: The effect of author set size in the original data sets. 71

The Effect of Author Set Size Results for the PERSONAE data set are shown in Figure 5.2a and Table 5.2a. In authorship attribution with two candidate authors, we achieve an accuracy of 95% by using character trigrams. CHR 3 performance remains around 75% when the author set size is five, but we see a dramatic decrease in performance when the author set size equals ten. Only character trigrams show robustness to these author set sizes. Increasing the number of candidate authors even more – to 50-, 100-, and 145-way authorship attribution – however, shows that lexical features perform markedly better than character n-grams. Authorship attribution with 145 candidate authors can be done with an accuracy of around 23%, which is reasonable, taking into account the dimensionality of the task and only a limited set of data (viz. 1,400 words on average) available per author. Random baseline performance in this task is 0.69% (1 correct prediction out of 145). Figure 5.2b and Table 5.2b show the influence of author set size in the AAAC data set. We see a similar drop in performance as the one observed in PERSONAE. Authorship attribution with five candidate authors still achieves a score of about 73%, while two-way authorship attribution can be done with an accuracy of around 94%. Authorship attribution on the maximum number of candidate authors in this data set is possible with 49% accuracy. Character n-grams are in all author set sizes the best choice. Results for ABC NL 1 are presented in Figure 5.3c and Table 5.3c, showing a considerably lower performance than in PERSONAE and AAAC A. Eight-way authorship attribution achieves an accuracy of 44%, which is significantly lower than the score for 13-way authorship attribution in AAAC A. This is unexpected, since increasing author set size typically has a negative effect on performance. This deviation could be explained by the dimensionality in terms of data size, topic (four in AAAC A, nine in ABC NL 1) or genre (three in ABC NL 1, one in AAAC A) in the different data sets. We will elaborate on this below. Now that we have discussed the effect of author set size on performance, we bring the feature types into focus, in order to answer the question whether some feature types are more robust to author set size than others. When zooming in on the feature types in Table 5.2, we see that the best scores are achieved by character n-grams. Character trigrams seem to outperform the other feature types in the three data sets, although lexical information shows more robustness to author set sizes of fifty and higher. Whereas these results indicate robustness of the lexical feature type, this does not necessarily indicate robustness of the individual features in that feature set. In Section 5.5, we will elaborate on the reliability of character and lexical features when confronted with increasingly larger sets of candidate authors. Syntactic features like rewrite rules, n-grams of parts-of-speech, and function words have all been claimed reliable markers of style. In the results shown here, we find no evidence to support that claim. Grammatical relations (REL) and superficial lexical features (e.g. type-token ratio and average word length; as implemented in TOK) hardly improve over the majority baseline. 72

5.4 T HE E FFECT OF AUTHOR S ET S IZE IN DATA S IZE AND TOPIC B ALANCED DATA ( EXP 2)

According to Houvardas & Stamatatos (2006), character n-grams capture and combine nuances on different linguistic levels: lexical, syntactic, and structure, which could be an explanation for their high score. Another explanation for the good score of character n-grams can be found in Stamatatos (2008), where it is stated that character n-grams reduce the sparse data problems that arise when using word n-grams. In spite of these indications, a more qualitative analysis of character n-grams in authorship attribution is still lacking – in fact, this can be said for most feature types. In the discussion section (cf. Section 5.5), we will show results of such a qualitative analysis. Comparing results over data sets, we find a consistently decreasing performance with increasing author set size, although there is clear interaction between author set size, data size, and topic. As we indicated above, it is unclear how these factors interact with each other. Evaluation of the second set of experiments (cf. Section 5.4) will allow us to investigate whether data size is in fact one of the main factors playing here. Balancing for data size and per topic allows us to analyze more general tendencies in author set size.

5.4 The Effect of Author Set Size in Data Size and Topic Balanced Data ( EXP 2) In this second set of experiments, we limit the data size used for training. In each data set, the attribution model is trained on four samples of 100 words per text and author, and tested on one sample of 100 words. This set-up allows for a more straightforward comparison of the different data sets, although the amount of training data is severely limited as compared to EXP 1. We do not only establish a balanced class distribution in training – like we did in EXP 1 – but we also guarantee equal distribution of texts – hence topics (or genres, depending on the data set) – over the authors in training, ensuring a clear focus on the effect of author set size. Figure 5.3 and Table 5.3 show the effect of author set size in authorship attribution using Memory-Based Learning (MBL) in the three evaluation data sets. Again, the figures and tables below represent only part of the results. Results for all feature types are in Appendix E. When we compare these results to those in EXP 1 (cf. Section 5.3), we see that performance drops considerably. In cases with two candidate authors, using character trigrams leads to a performance drop of 16% in PERSONAE in comparison to the first experiments. This result can be explained by the very limited amount of data per author. When an author is represented by several topics instead of only one, as is the case in AAAC A and ABC NL 1, results indicate a considerably smaller accuracy drop (a 6% absolute difference).

73

The Effect of Author Set Size

100

tok cwd fwd chr3 lex1 lem3

Accuracy (in %)

80

cgp3 pos2 lexpos2 chu3 rel

60 40 20 2510 20

50

100

Number of authors (a)

PERSONAE

100

tok cwd fwd chr3

80

Accuracy (in %)

145

lex1 lem1 pos1 lexpos1

60 40 20 2

3

4

5

Number of authors (b)

13

AAAC A

100

tok cwd fwd chr3 lex1 lem1

80

Accuracy (in %)

10

cgp2 pos2 lexpos1 chu1 rel

60 40 20 2

3

4

5

Number of authors (c)

8

ABC NL 1

Figure 5.3: Visualization of the effect of author set size in data size and topic balanced data. 74

5.4 T HE E FFECT OF AUTHOR S ET S IZE IN DATA S IZE AND TOPIC B ALANCED DATA ( EXP 2)

Feature

2x100

3x100

4x100

5x100

10x10

20x5

50x2

100

145

tok cwd fwd chr3 lex1 lex2 lem1 cgp3 pos2 lexpos2 chu3 rel

61.80 62.90 68.50 78.50 71.80 62.80 73.80 67.50 67.40 63.10 64.40 52.90

43.80 44.13 48.73 71.47 50.53 45.27 50.67 41.73 45.13 44.67 43.27 36.60

34.40 37.00 39.80 70.55 43.10 36.10 42.75 35.70 39.55 35.65 31.90 29.15

32.68 28.56 34.80 66.16 35.44 30.36 36.00 28.36 32.92 28.76 26.68 22.28

15.00 24.40 17.60 47.40 24.00 22.80 23.60 17.60 23.40 21.40 11.20 10.80

9.00 19.60 9.80 27.20 19.40 15.60 19.60 13.20 21.00 16.00 9.00 5.20

2.80 13.00 6.40 11.60 12.40 7.60 10.40 5.60 5.40 8.00 4.20 3.20

2.20 7.20 3.20 10.00 10.40 12.20 8.00 2.40 4.60 11.80 3.00 1.20

0.83 4.97 2.07 5.10 6.48 10.76 5.10 3.31 4.14 10.90 1.93 1.24

Baseline

50.00

33.33

25.00

20.00

10.00

5.00

2.00

1.00

0.69

(a)

PERSONAE

Feature

2x20

3x20

4x10

5x10

10x10

13

tok cwd fwd chr3 lex1 lem1 pos1 lexpos1

56.54 62.70 70.50 88.14 74.05 74.48 71.32 72.75

44.92 44.64 53.74 84.27 56.12 60.13 59.61 54.42

32.38 37.07 47.28 80.00 54.70 54.75 45.09 52.68

28.00 30.50 38.50 73.70 44.30 47.30 41.20 40.20

16.86 20.23 22.61 54.78 32.46 38.74 27.21 30.14

12.94 23.53 20.00 51.76 32.16 38.04 23.53 30.59

Baseline

50.00

33.33

25.00

20.00

10.00

7.69

(b)

AAAC A

Feature

2x20

3x20

4x10

5x10

8

tok cwd fwd chr3 lex1 lem1 cgp2 pos2 lexpos1 chu2 rel

61.50 56.72 66.89 74.44 66.67 67.72 64.22 63.67 67.33 61.33 51.72

46.48 39.74 48.74 66.70 48.26 50.44 47.41 47.48 48.37 44.78 35.56

36.67 33.22 35.78 58.61 41.61 41.06 36.89 39.39 40.33 35.89 26.56

30.76 28.27 34.36 51.56 33.29 38.98 30.04 31.78 34.22 28.36 23.02

23.33 17.50 25.83 41.11 28.06 32.22 16.67 28.89 25.28 21.94 13.33

Baseline

50.00

33.33

25.00

20.00

12.50

(c)

ABC NL 1

Table 5.3: The effect of author set size in data size and topic balanced data. 75

The Effect of Author Set Size Analysis of classification results with five candidate authors again shows a substantial performance drop in PERSONAE as compared to results of the first set of experiments (from 76% to 66%), but limiting AAAC A data size does not result in a decrease in performance; it remains stable around 73%. ABC NL 1 shows a more moderate drop (from 58% to 52%) than PERSONAE . Comparing performance of authorship attribution with the maximum number of candidate authors shows a performance drop from 23% in EXP 1 to 11% in EXP 2 for PER SONAE . Since the issue of number or variety of topics does not play a role in this data set, we can account for this drop by referring to the very limited amount of data per author (viz. only 500 words, a data size reduction of 65% as compared to EXP 1). In AAAC A with thirteen candidate authors, nevertheless, we observe a small increase in performance – from 49% to 52% – when comparing results of the two sets of experiments. It is clear that reducing the data size (from 844 to 500 words, a reduction of 41%) is not the only factor playing here. Our current set-up ensures topic balancing in the training data, whereas the EXP 1 set-up only balanced the instances in terms of authorship, potentially leading to an imbalance in terms of texts, hence topics (in AAAC A and ABC NL 1) and/or genres (in ABC NL 1). The increase in performance from EXP 1 to EXP 2 suggests that EXP 1 results were actually influenced by that imbalance, and that balancing these factors aids performance, even with limited data size. However, most of these interacting factors will disappear with more training data available, making data size the most important factor overall. Nevertheless, ABC NL 1 shows no increase in performance in eight-way authorship attribution, but a small decrease from 44% (EXP 1) to 41% (EXP 2), which can be explained partially by the dramatic data size reduction (from 1017 to 500 words, a reduction of 51%), a limitation which was less spectacular in AAAC A (cf. Table 5.1). The other factor is the fact that we balance for topics and genres. Moreover, ABC NL 1 contains more topics and genres than AAAC A , which also explains why AAAC A outperforms ABC NL 1 in both sets of experiments. As we have shown in the two sets of experiments, identifying the factors that interact with the effect of author set size is not a straightforward task. First, reducing the data size has a negative influence on performance. A second factor is the effect of potential topic imbalance, which can have a positive influence on performance in results of EXP 1. Removing that imbalance, like we did in EXP 2, can cause a performance drop. A third, and potentially most defining factor is the inherent complexity of a task, including factors such as the degree of topic or genre variation, the presence of stylistic choices in text, etc. It is not straightforward to assess the ‘strength’ of an authorial set. In the previous chapter (cf. Section 4.4, Figure 4.3), we have shown that the degree inter-topic and intratopic variation is considerable in the two multi-topic data sets. All of these factors affect performance, but at this point, no techniques have been developed to assess the inherent complexity of a task given a data set.

76

5.5 T HE E FFECT OF E XCLUSIVE T ESTING ON S MALL AUTHOR S ET S IZES

5.5 The Effect of Exclusive Testing on Small Author Set Sizes Now that we have presented and discussed the results of both sets of experiments, we can identify general tendencies in the effect of author set size. In this section, we will elaborate on its effect on performance (cf. Section 5.5.1), and on features and feature types that are found predictive for a specific number of candidate authors (Section 5.5.2). We will show that testing an approach on small author set sizes exclusively, can lead to overestimation of its performance and reliability when tested on larger author set sizes. We also find that hardly any of the individual features found predictive for small author set sizes are reliable with other or larger sets of candidate authors. This implies a serious limit to the scalability and robustness of approaches and features found successful for small author set sizes. Nevertheless, we do find consistency in the performance of feature types.

5.5.1

Performance Decay with Increasing Author Set Size

Analysis of the two sets of experiments has shown that there is a substantial drop in performance with increasing author set size. This effect was found in the three data sets, regardless of their data size and number of topics. By comparing results of EXP 1 – where the original data sizes were used – and EXP 2 – where the data size of each text was reduced to 500 words – we investigated factors interacting with the effect of author set size. Reducing the data size leads to a decrease in performance, as expected. In AAAC A, however, the small increase in performance from EXP 1 to EXP 2 indicates that balancing the training data for data size has a positive effect on performance. ABC NL 1 results showed only a small performance drop, indicating a role for the amount of variation in topics and genres as well as for data size. Consistent performance decay Figure 5.4 allows us to compare performance decay with increasing author set size over the different data sets and feature types. The scores plotted in the graph are an average of the results obtained by the different feature types per author set size. The original results are in Appendices D and E. The left-hand side shows results of EXP 1, and the right-hand side shows scores for EXP 2. These figures show that in each of the three data sets, performance follows a similar pattern with increasing author set size.

77

The Effect of Author Set Size

100

Personae AAAC-A ABC-NL1

80

Rescaled performance

Rescaled performance

100

60 40 20 2581320

50

100

Number of authors

(a)

EXP 1:

Personae AAAC-A ABC-NL1

80 60 40 20

145

2581320

Original data sets

(b)

EXP 2:

50

100

Number of authors

145

Data size and topic balanced experiment

Figure 5.4: Performance decay with increasing author set size.

Forecasting performance on larger author set sizes This consistency, found in the various data sets, allows us to investigate the possibility of forecasting performance for larger author set sizes based on results of small(er) author set sizes. By doing that, we can investigate whether success in small sets of candidate authors warrants scalability to larger author set sizes. In authorship attribution literature, we often find claims concerning the reliability of approaches that have been tested on small author set sizes only. We simulate the point of view of the authorship attribution studies that limit themselves to small author set sizes – 2, 3, 4, and 5 candidate authors – and try to forecast performance on larger author set sizes. The resulting trend lines demonstrate the decay in performance with increasing author set size without actually doing the experiment. This type of analysis can be used to quickly evaluate the strength of an approach or a feature type without performing the full experiment. In large-scale authorship attribution, this can be a valuable and time-saving test. This forecasting analysis is implemented as follows: We start off by taking accuracy scores for two small author set sizes, for instance 2 and 3, and fit a power trend line to those accuracies. Figure 5.5 represents a visualization of the trend lines. Assuming that each additional candidate author will continue to decrease performance, we project the decay in performance from the small author set sizes onto the larger author set sizes. With every additional candidate author, we recalculate and apply the decay function and compare the resulting predictions to the actual performance (as reported on in the previous section). Underlying this analysis is the basic assumption that the unseen data is of a similar nature as the known data. This type of data is available when working in controlled settings (e.g. authors with a similar background, age, education level, or similar data sizes for the 78

5.5 T HE E FFECT OF E XCLUSIVE T ESTING ON S MALL AUTHOR S ET S IZES

100

actual performance [2:3] prediction [2:5] prediction [2:100] prediction

80 60 40 20 2510 20

50 (a)

PERSONAE

100

100

80

80

60

60

40

40

20

actual performance [2:3] prediction [2:5] prediction [2:10] prediction

actual performance [2:3] prediction [2:5] prediction

20

2 3 4 5

10 (b)

145

100

13

2

3

4 (c)

AAAC A

5

8

ABC NL 1

Figure 5.5: Actual vs. predicted performance (based on power trend lines) using CHR 3.

various candidate authors, etc.), but is a luxury situation as compared to the uncontrolled data found in online networks. We applied this power regression analysis to the three data sets using CHR 3, and take an experimental set-up as in EXP 2 (cf. Section 5.4). In PERSONAE (cf. Figure 5.6a), we see that predictions based on results for 2-way and 3-way authorship attribution (indicated with [2:3]) overestimate performance in cases with ten or more candidate authors. Only with four and five candidate authors, the actual performance was higher than predicted. The more candidate authors an approach is tested on, the better our estimation of performance with higher author set sizes. Predictions based on all-but-one cases (cf. [2:100]) confirm this, since the difference with the actual performance is relatively small (7.86% predicted vs. 5.10% actual performance). In AAAC A (cf. Figure 5.6b), we observe the same tendencies. Performance predicted from scores in 2-way and 3-way authorship attribution shows some overestimation of performance with four and five candidate 79

The Effect of Author Set Size authors. We do see that the difference between predicted and actual performance is considerably higher for ten or thirteen candidate authors. When predictions for ten and thirteen candidate authors are based on accuracies from two to five, the difference is smaller. Again, estimations of performance become more reliable with testing on several sets of candidate authors. ABC NL 1 (cf. Figure 5.6c) confirms these findings, since the difference between estimated and actual performance becomes smaller as author set size is increased. Author set size

Actual

2 3 4 5 10 20 50 100 145

78.50 71.47 70.55 66.16 47.40 27.20 11.60 10.00 5.10

Prediction [2-3]

[2-5]

[2-20]

[2-100]

66.87 63.50 54.09 46.07 37.27 31.75 29.13

59.04 52.34 44.64 39.57 37.10

20.52 14.95 12.62

7.86

(a)

Author set size 2 3 4 5 10 13

Prediction

Actual [2-3] 88.14 84.27 80.00 73.70 54.78 51.76 (b)

PERSONAE

81.63 79.64 73.76 71.65

[2-5]

66.17 63.00

[2-10]

53.14

Author set size

Actual

2 3 4 5 8

74.44 66.70 58.61 51.56 41.11 (c)

AAAC A

Prediction [2-3]

[2-5]

61.70 58.08 51.14

43.87

ABC NL 1

Figure 5.6: Predicted performance based on power regression analysis using CHR 3. Discussion We can conclude that testing an approach on small author set sizes exclusively, leads to overestimation of the performance of the approach on larger author set sizes. If we were to rely on results on two and three candidate authors, we would expect a significantly higher performance than the approach is actually able to achieve. The results suggest that the text categorization approach is not scalable towards larger author set sizes. It is clear that the author set size factor cannot be ignored and that systematic testing on larger author set sizes should be a part of any authorship attribution study aiming to claim reliability or superiority of a specific approach or feature type. 80

5.5 T HE E FFECT OF E XCLUSIVE T ESTING ON S MALL AUTHOR S ET S IZES

5.5.2

Reliability and Scalability of Features and Feature Types

Character n-grams show the most consistent performance with increasing author set size in the two sets of experiments. Only in cases with fifty candidate authors or more, lexical features score better. We could conclude from these results that lexical features are more robust to the effect of author set size than other feature types. However, high performance is only one factor indicating scalability and robustness. In most authorship attribution studies, a qualitative analysis of the individual features that make a feature set, is still lacking. A multitude of style markers and approaches has been suggested for the authorship attribution task, in most cases without systematically testing them in other data sets or larger author set sizes. Nevertheless, the field has seen, amongst others, lexical features being portrayed as the ultimate style marker. In this section, we investigate the reliability of the individual features and feature types with increasing author set size. In essence, a feature set is scalable when it shows robustness to larger author set sizes and is portable to other data sets. We aim to answer the question whether lexical features are indeed more reliable when applied to large sets of candidate authors than character n-grams, for instance.

The individual features that make the feature set We begin by presenting a qualitative analysis of CHR 3 and LEX 1 features in order to assess the predictive strength and scalability of these feature types. Table 5.4 shows the top-ten CHR 3 features (as ranked by χ2 and TIMBL ’s Gain Ratio) for each data set in authorship attribution with five and the maximum number of candidate authors, respectively. Note that we conducted this analysis on the features extracted from the second set of experiments (EXP 2; cf. Section 5.4), where we balanced data size and topic. PERSONAE

AAAC A

ABC NL 1

5

‘les’, ‘ler’, ‘os ’, ‘oos’, ‘xe ’, ‘ zw’, ‘. w’, ‘zwa’, ‘exe’, ‘ica’

145

‘ows’, ‘law’, ‘? z’, ‘cde’, ‘abc’, ‘bcd’, ‘ - ’, ‘ ik’, ‘AI ’, ‘cod’

5

‘, 2’, ‘03 ’, ‘, j’, ‘003’, ‘nyo’, ‘urr’, ‘oyc’, ‘2 .’, ‘n 1’, ‘dit’

13

‘ ” “ ’, ‘ppa’, ‘hoe’, ‘dne’, ‘adn’, ‘bur’, ‘ndu’, ‘apo’, ‘eyo’, ‘u m’

5

‘n “ ’, ‘; z’, ‘orp’, ‘dor’, ‘ uw’, ‘ n ’, ‘ ur’, ‘eim’, ‘nkb’, ‘htg’

8

‘dak’, ‘ w.’, ‘pmu’, ‘ish’, ‘muz’, ‘!! ’, ‘kts’, ‘i i’, ‘sis’, ‘opm’

Table 5.4: Top-ten CHR 3 features (ordered by χ2 and TIMBL’s Gain Ratio) per data set.

81

The Effect of Author Set Size In five-way authorship attribution on PERSONAE, most of the CHR 3 features are slices of frequently occurring words (in this data set), such as ‘alles’, ‘leren’, ‘simuleren’, ‘eindeloos’, ‘complexe’, ‘zwerm’, ‘exemplaar’, and ‘informatica’. Some of these are topic-related, which implies that their occurrence in other topics is not very likely. Note that PERSONAE is a single-topic data set, which implies that we cannot apply the topic frequency threshold, like we did in the multi-topic data sets AAAC A and ABC NL 1. Scalability in PERSONAE features is very limited for that reason. However, with more topics available, the individual features would become more reliable. In AAAC A features based on five candidate authors, most of the CHR 3 features are slices of dates (‘2003’), names (‘Joyce’), and the indefinite pronoun ‘anyone’. In ABC NL 1, we find character n-grams referring to commonly used words, such as ‘nkb’ (e.g. ‘denkbaar’) and ‘htg’ (e.g. ‘echtgenoot’), but also the polite form of ‘your’ (‘uw’ in Dutch) and ‘eim’, a slice of ‘geheim’, a word that occurs in several topics in ABC NL 1 (e.g. murder story, chivalry romance). Analysis of CHR 3 features based on the maximum number of candidate authors for PER SONAE shows that most of these features are only predictive for a single author (e.g. ‘ows’ as in ‘Windows’; ‘cde’, ‘abc’, and ‘bcd’ as in ‘abcde’). Another one refers to the first person personal pronoun ‘I’ (‘ ik’), a word occurring in 82% of all authors. The top-ten list also contains a punctuation mark and a topic indicator (‘AI’, since the essays in PERSONAE are about artificial intelligence). In AAAC A, we find slices of adverbs like ‘apparently’ and ‘beyond’ as well as to modal verbs ‘you make/may/must/might’ (cf. ‘u m’) and quotation marks. ABC NL 1 features contain slices of topic-related words like ‘popmuziek’ or ‘huishoudgeld’ and punctuation marks. Character n-grams indeed combine nuances on different linguistic levels, similar to the observations made in Houvardas & Stamatatos (2006). We did a similar analysis for LEX 1 features. Lexical features are considered to carry interesting topic-related and author-related information, but are either avoided in authorship attribution studies because they carry topic information or are used while their subject-revealing power (term by Mikros & Argiri, 2007, p. 30) is tolerated (cf. Chapter 4). The topic frequency threshold we introduced in Section 4.3 allows us to avoid topic-related words in the attribution model and focus on content words without the smoking gun of topic. Table 5.5 shows the top-ten LEX 1 features in the three data sets. In these lists, we find words of different syntactic classes, amongst others verbs (e.g. ‘begon’, ‘ging’, ‘ontstaat’; ‘come’, ‘includes’; ‘brengt’, ‘ontdekt’), adverbs (‘net’, ‘dagelijks’; ‘too’, ‘possible’, ‘therefore’; ‘simpelweg’, ‘volgens’), adjectives (‘small’, ‘dominant’, ‘heerlijke’, ‘lieflijke’), and punctuation marks.

82

5.5 T HE E FFECT OF E XCLUSIVE T ESTING ON S MALL AUTHOR S ET S IZES

5 PERSONAE

145

‘ ” ’, ‘:’, ‘eiwitten’, ‘evolutie’, ‘begon’, ‘ging’, ‘net’, ‘ontstaat’, ‘ene’, ‘man’ ‘moreel’, ‘vorser’, ‘prent’, ‘reconstrueren’, ‘dosis’, ‘land’, ‘dagelijks’, ‘fragmenten’, ‘onoverkomelijke’, ‘basisbeginselen’

5

‘Moser’, ‘too’, ‘come’, ‘instead’, ‘slowly’, ‘cited’, ‘best’, ‘possible’, ‘deal’, ‘due’

13

‘small’, ‘dominant’, ‘experienced’, ‘therefore’, ‘concept’, ‘includes’, ‘points’, ‘computers’, ‘protest’, ‘escape’

5

‘binnen’, ‘volledig’, ‘minder’, ‘brengt’, ‘simpelweg’, ‘volgens’, ‘Nederlands’, ‘heerlijke’, ‘verantwoordelijk’, ‘toekomst’

8

‘volledig’, ‘vrijwel’, ‘hopelijk’, ‘nogal’, ‘lieflijke’, ‘gedurende’, ‘tante’, ‘ontdekt’, ‘soms’, ‘ochtend’

AAAC A

ABC NL 1

Table 5.5: Top-10 LEX 1 features (ordered by χ2 ) for the three data sets. Whereas most of these lexical and character features have predictive value for some of the authors, they cause the model to overfit the training data. The analysis shows that, although the feature types show robustness to author set size, the individual features fail to scale.

Scalability towards different author sets When we use the set of features found predictive for a small set of candidate authors, to represent a different set of candidate authors, this will lead to low classification performance. In addition, we claim that applying those features to a larger set of candidate authors will also show a significant performance drop. In order to verify these claims, we first analyze the amount of overlap in the individual features between n author pairs, as shown in Table 5.6. With this analysis, we mimic the situation where the predictive value of a set of features found predictive for two-way authorship attribution, is tested on other author pairs. This situation would be similar to testing an approach on comparing Cather vs. Kipling, for instance, and using features extracted for that task on a different two-way problem from the same period (e.g. distinguishing between Doyle and London). Taking the 500 CHR 3 or LEX 1 features with highest chi-squared value in n author pairs, we compute how many features occur in all of the author pairs. We start by selecting ten author pairs randomly, and then compute overlap for n author pairs. With each additional author pair, overlap is recomputed.

83

The Effect of Author Set Size Number of author pairs

PERSONAE

AAAC A

ABC NL 1

CHR 3

2

3

4

5

6

7

8

9

10

13.20

2.73

0.95

0.12

0.03

0.03

0.00

0.00

0.00

LEX 1

11.24

2.16

0.72

0.19

0.11

0.00

0.00

0.00

0.00

CHR 3

15.70

3.60

1.00

0.44

0.23

0.00

0.00

0.00

0.00

LEX 1

13.12

4.83

2.39

0.68

0.14

0.00

0.00

0.00

0.00

CHR 3

15.20

2.07

0.50

0.24

0.10

0.06

0.00

0.00

0.00

LEX 1

16.07

3.37

1.70

0.94

0.56

0.18

0.11

0.09

0.08

Table 5.6: Percentage of overlap in individual features over n authors pairs . This analysis clearly shows that the overlap in individual features rapidly declines with increasing points of comparison, a conclusion we can draw from each of the three data sets and from both character n-grams and lexical features. When we calculate overlap in six author pairs, a number of robust LEX 1 features come into view: ‘kan’ and ‘hij’ in PERSONAE, ‘some’ in AAAC A, and ‘haar’, ‘je’, ‘had’, ‘bij’, ‘willen’, the single and double quote, and ‘. . . ’ in ABC NL 1. Of these features, only the single quote and ellipsis marker survive with ten points of comparison. However, it is clear that these features will not be able to stand the test with a different data set. This confirms our claim that individual features based on a small set of candidate authors are not portable towards a different set of candidate authors. Without systematic and extensive testing outside the candidate set, individual features cannot be claimed reliable.

Scalability towards larger author sets Our second claim concerns the question whether features found predictive for a small set of candidate authors will be successful in distinguishing between a larger set of candidate authors. In Figure 5.7, we show how the attribution model evolves with increasing author set size. We start by randomly selecting an author pair (e.g. authors 38 and 44 from PERSONAE) – i.e. an instance of two-way authorship attribution – and then extend the training and test sets with additional data from different authors (e.g. by adding author 107). This set-up allows us to verify whether the features extracted for the task with two candidate authors, also have predictive value for the tasks that entail three, four, five, etc. candidate authors. This situation is similar to distinguishing between William Shakespeare and Edward de Vere, and then adding additional candidates, such as Sir Francis Bacon (3-way task), Christoper Marlowe (4-way task), etc.

84

5.5 T HE E FFECT OF E XCLUSIVE T ESTING ON S MALL AUTHOR S ET S IZES

(38 & 44) AAAC A (04 & 12) ABC NL 1 (03 & 08) PERSONAE

2

3

4

5

8

10

13

20

50

100

145

500 500 500

206 222 187

88 109 86

39 65 40

– – 29

10 43 –

– 35 –

7 – –

6 – –

5 – –

5 – –

(a)

(38 & 44) AAAC A (04 & 12) ABC NL 1 (03 & 08) PERSONAE

CHR 3

2

3

4

5

8

10

13

20

50

100

145

500 500 500

169 88 176

77 78 125

38 72 87

– – 64

14 63 –

– 54 –

10 – –

8 – –

5 – –

5 – –

(b)

LEX 1

Table 5.7: Evolution of overlap (in absolute numbers) in individual features – (a) character and (b) lexical features – with increasing author set size. Figure 5.7a displays the number of CHR 3 features extracted for distinguishing between two candidate authors that survive when the number of candidate authors is increased. Figure 5.7b shows the same type of analysis for LEX 1 features. The two author IDs that we track with increasing author set size, are shown between parentheses. In the three data sets, we see a dramatic decrease in overlap when more candidate authors are added. In AAAC A , the decline is very sharp, with only 18% of the LEX 1 features surviving with only one additional candidate author. When the author set size is set at 145 in PERSONAE, only five CHR 3 and LEX 1 features, or 1.00% of the 500 features in the two-way case, prevail. In AAAC A (13 candidate authors) and ABC NL 1 (8 candidate authors), we find more absolute overlap than in PERSONAE (145 candidate authors), which can be explained by the amount of variation introduced by the 143 additional authors. Table 5.8 gives an overview of the features per data set that show robustness with increasing author set size. These lists represent different types of linguistic information: next to content words (‘brein’; ‘family’, ‘people’, ), verbs (‘consider’, ‘look’, ‘realize’; ‘zal’, ‘doorhebt’), adverbs (‘by’, ‘so’; ‘bovendien’, ‘vrijwel’, ‘totdat’), adjectives (‘hard’; ‘lekker’, ‘ziek’), and punctuation markers, we also find a lot of pronouns (‘my’, ‘I’, ‘some’, ‘we’, ‘his’; ‘ik’, ‘zijn, ‘haar’, ‘u’). There is a lot of overlap between LEX 1 and CHR 3 features, since we can find most character trigrams of ‘family’ in the CHR 3 list (viz. ‘fam’, ‘ami’, and ‘ily’), for instance. From this analysis, we can conclude that robust features do emerge from the three data sets, but only in cases with a limited set of candidate authors (8 or 13). In cases with 145 authors, hardly any of the initial features remain part of the attribution model.

85

The Effect of Author Set Size

PERSONAE

‘ule’, ‘iee’, ‘ ro’, ‘e r’, ‘ ’ ‘

AAAC A

‘i w’, ‘par’, ‘iti’, ‘me ’, ‘ace’, ‘mon’, ‘. ”’, ‘iet’, ‘ is’, ‘sta’, ‘ney’, ‘ami’, ‘mys’, ‘nee’, ‘hav’, ‘ ro’, ‘al ’, ‘ us’, ‘my ’, ‘ i ’, ‘fam’, ‘is ’, ‘tio’, ‘be ’, ‘ ki’, ‘ily’, ‘ ” ’, ‘ini’, ‘ion’, ‘hin’, ‘n h’, ‘ my’, ‘ey ’, ‘ok ’, ‘ur ’

ABC NL 1

‘, e’, ‘omi’, ‘ ik’, ‘ pe’, ‘’ e’, ‘ht ’, ‘ om’, ‘ . ’, ‘zij’, ‘ is’, ‘n !’, ‘zo”, ‘n ”, ‘had’, ‘’n ’, ‘khe’, ‘ ro’, ‘oog’, ‘ se’, ‘ ! ’, ‘om ’, ‘odi’, ‘dit’, ‘ , ’, ‘ds ’, ‘ - ’, ‘o’n’, ‘ ’ ’, ‘ s ’

PERSONAE

‘dan’, ‘men’, ‘artificieel’, ‘brein’, ‘ ’ ’

AAAC A

‘own’, ‘consider’, ‘family’, ‘people’, ‘almost’, ‘money’, ‘look’, ‘hard’, ‘some’, ‘idea’, ‘an’, ‘individual’, ‘college’, ‘something’, ‘in’, ‘need’, ‘our’, ‘thought’, ‘use’, ‘”’, ‘from’, ‘for’, ‘person’, ‘)’, ‘(’, ‘support’, ‘point’, ‘such’, ‘by’, ‘their’, ‘that’, ‘way’, ‘get’, ‘then’, ‘we’, ‘his’, ‘ability’, ‘parent’, ‘government’, ‘there’, ‘but’, ‘nation’, ‘job’, ‘part’, ‘them’, ‘than’, ‘during’, ‘come’, ‘realize’, ‘he’, ‘me’, ‘earn’, ‘like’, ‘i’, ‘work’, ‘life’, ‘will’, ‘thing’, ‘so’, ‘can’, ‘become’, ‘the’, ‘my’, ‘or’

ABC NL 1

‘beter’, ‘zal’, ‘bovendien’, ‘ten’, ‘hoor’, ‘had’, ‘is’, ‘beetje’, ‘ik’, ‘minder’, ‘paar’, ‘zijn’, ‘hart’, ‘!’, ‘steeds’, ‘zo’n’, ‘”, ‘iedereen’, ‘duidelijk’, ‘gewoon’, ‘-’, ‘,’, ‘.’, ‘haar’, ‘spel’, ‘zie’, ‘zag’, ‘zullen’, ‘;’, ‘te’, ‘zij’, ‘?’, ‘gelukkig’, ‘lekker’, ‘...’, ‘toch’, ‘dag’, ‘klein’, ‘dan’, ‘overal’, ‘ook’, ‘maar’, ‘echte’, ‘veel’, ‘moment’, ‘zoals’, ‘situatie’, ‘vrijwel’, ‘om’, ‘doorhebt’, ‘ogen’, ‘gauw’, ‘was’, ‘der’, ‘grote’, ‘licht’, ‘vreselijk’, ‘tegen’, ‘s’, ‘u’, ‘dit’, ‘niets’, ‘totdat’, ‘ziek’

CHR 3

LEX 1

Table 5.8: Features found in two-way authorship attribution that show robustness with increasing author set size in the data sets under investigation. However, it is not reasonable to claim that the use of ‘dan’, ‘men’, ‘artificieel’ and ‘brein’ is indicative of authorship in any set of candidate authors. Since PERSONAE is a singletopic data set, the survival of ‘artificieel’ and ‘brein’ is normal. Although ‘dan’ and ‘men’ are function words, it is far from evident that they will occur in other sets of candidate authors. These features are tailored to the specific data set and author set size under investigation, and unfit to represent any other data set or larger author set size. The tendencies in the features, however, indicate that feature types such as LEX 1 are able to capture some of the important nuances on the different syntactic levels. The features shown above clearly indicate that manual error analysis is essential when evaluating scalability.

Discussion Now that we have investigated the robustness of individual features and feature types with increasing author set size, we can return to the claims we formulated. Our claim that individual features do not scale to larger sets of candidate authors, holds. In order to assess the reliability of an approach, systematic testing on other author sets is essential. We also con86

5.6 C ONCLUSIONS

firmed the claim that the individual features cannot be relied on when dealing with a larger set of candidate authors. While it may be correct to claim that lexical features are good clues for authorship, the distribution of a particular word, however useful to distinguish between one particular pair of authors, is irrelevant when comparing another pair of authors. Whereas character n-grams and lexical features as feature types clearly show robustness to the effect of author set size, the individual features show as little scalability and portability as lexical features do. It is clear from the results presented above that any of the individual features causes the model to overfit the training data, a shortcoming making them unfit to be applied in other and/or larger groups of candidate authors – let alone in data sets of a different nature (e.g. as far as topic or genre is concerned). This type of critical point-of-view on authorship attribution is relatively new in the field. Most other studies tend to stay away from an in-depth analysis of predictive lexical features. Zooming in on the individual features leads to increased insight into the behavior of the approach when confronted with larger author set sizes than typical in authorship attribution. Our study has shown in fact that any of the individual features in the attribution model overfits the training data. This implies that they are not robust to author set size and not scalable.

5.6 Conclusions Over the last decade, the field of authorship attribution has seen a lot of interesting studies introducing a new feature type or discriminative approach. However, due to a lack of testing on larger author set sizes than typical in the field, it is difficult to assess the scalability of these feature types and approaches. In this chapter, we investigated the effect of author set size on performance and on the selection of features. By studying the behavior of our text categorization approach when confronted with increasingly larger author set sizes, we wanted to evaluate its scalability. Another goal was to identify robust and scalable feature( type)s. The research questions we answered in this chapter are: Q1 Do we find support for the hypothesis that studies that test an approach on a small set of candidate authors only, overestimate the approach when making claims concerning a its scalability for cases with large sets of candidate authors, and b the importance and scalability of specific predictive features? Q2 Is the effect of author set size in experiments balanced for data size and topic the same as in experiments that are not balanced for these factors? In other words, how do data size and topic interact with the effect of author set size? 87

The Effect of Author Set Size In a first set of experiments, we have shown that performance decreases substantially with increasing author set size. Character n-grams outperform the other feature types when confronted with sets of up to twenty candidate authors. When the author set size is increased more, lexical features show more robustness to author set size. Although increasing the author set size has a negative effect on performance, an experiment with eight candidate authors in ABC NL 1 scored worse than one with thirteen authorship classes in AAAC A. Accounting for this atypical result is not straightforward because of the differences in data size, topic and/or genre variation between the evaluation data sets. In addition, it is also unclear how these factors interact with the effect of author set size. In order to get a more clear-cut view on the tendencies in author set size, we controlled for data size and topic in a second set of experiments. This set-up allowed us to verify whether data size is in fact one of the main factors interacting with author set size. Balancing the amount of data used for training in the three data sets, caused a significant performance drop in the single-topic data set PERSONAE, but showed a different trend in the multi-topic data sets. In AAAC A, a small increase in performance suggested that balancing for data size as well as authorship, has a positive effect on performance. ABC NL 1 results indicated a role for the amount of variation in topics and genres as well as for data size. However, the fact that thirteen-way AAAC A continues to score better than eight-way ABC NL 1 – even with data size and topic balance – indicates that other factors are in play. The inherent complexity of a classification task given a data set is a factor – unexplored so far – that plays an important role in any authorship attribution experiment, irrespective of the amount of control on other factor. Data size and topic (variety) are the main factors accounting for fluctuations in performance, but not the only ones. Most studies in authorship attribution only test on a task with a small number of candidate authors (often less than ten). We have shown that this restriction leads to a limited understanding of the dynamics and reliability of the approach. In the discussion section, we focused on the consequences of testing on small author set sizes exclusively, in terms of performance as well as the reliability of feature types and individual features found predictive of a specific set of candidate authors. A comparative study of classification performance confirmed our claim that testing an approach on small author set sizes only, leads to an overestimation of its performance with different or larger sets of candidate authors. Analysis of feature types with increasing author set size indicated that character n-grams and lexical features are more robust – as a type – than other feature types. In fact, character n-grams show more promise than lexical features because they are better at dealing with limited data (cf. Stamatatos, 2008). However, as far as individual features are concerned, we observed that features found predictive for a small set of authors, disappear in the variation of a different or larger set of 88

5.6 C ONCLUSIONS

candidate authors. None of the individual features can be relied on outside the data set they have been extracted from. Most of the individual features cause the model to overfit the data, consequently leading to a model that fails to scale towards larger author set sizes. Our conclusion that the individual features lead to a model that overfits the training data and fails to scale, introduces an important limitation to the use of the text categorization approach as we apply it in this dissertation. When dealing with limited candidate sets, the approach is useful and reliable. However, in cases with a large set of candidate authors, the approach is not scalable. Moreover, the approach is very sensitive to variations in author set size and data size, and to author or topic imbalances. Establishing benchmark approaches, Machine Learners, and features for authorship attribution, requires each approach to be evaluated on data sets of different dimensions in terms of author set size.

89

Chapter 6

The Effect of Data Size

In this chapter, we investigate the effect of data size on performance in authorship attribution by gradually decreasing the amount of textual data used for training. Results are presented in learning curves, allowing an analysis of the evolution of performance with decreasing training data. We also explore internal and external factors – such as the Machine Learning algorithm selected – that affect performance when the text categorization approach to authorship attribution is confronted with limited data.

In the previous chapter, we investigated how scalable our text categorization approach is with respect to the number of candidate authors to be learned. Results show a steep decrease in performance with increasing author set size, but nevertheless indicate that there is consistency in the performance of feature types – more specifically of character n-grams and lexical features – whereas we do not discern a similar scalability in the individual features. The text categorization approach to authorship attribution shows reliability for a small set of candidate authors, but applying it in large-scale authorship attribution is not feasible. In this chapter, we switch to another scalability issue: training data size. Most studies in authorship attribution rely on large volumes of text, often amounting to several tens of thousands of words per author. Modern applications of authorship attribution necessitate an approach that is able to deal with the small sets of textual data per author found in e-mails, blog posts, and tweets. Gradually decreasing the amount of text samples used for training allows us to investigate the effect of data size on performance and feature selection. This chapter is organized as follows. First, we introduce the data size factor and formulate research questions and expectations (Section 6.1). Then, we describe the experimental set-up (Section 6.2). The results are presented and discussed in two stages (Sections 6.3 to 6.4). After that, we compare two Machine Learning algorithms in terms of robustness to sparse data (Section 6.5). Finally, we formulate conclusions (Section 6.6). 91

The Effect of Data Size

6.1 Introduction and Research Questions In the last decades of research in (quantitative and computational) authorship attribution, a multitude of discriminative features, experimental designs, and discriminative methods has been suggested to be reliable. Whereas most studies suggest interesting approaches to the task, they are also limited in that they only test a given approach on large sets of training data. Distinguishing between a small set of authors, backed up by a large set of training data, and with (potential) topic influence, is a task that can be solved with high accuracy. However, in cases involving large sets of candidate authors and often small sets of data per author, these approaches may appear less reliable than expected from previously reported results. We consider data size to be as crucial to the scalability of an approach as author set size and topic.

6.1.1

Data Size

In Machine Learning research, it is widely accepted that including more training data leads to better performance, since it allows for more representative sampling. This is commonly referred to as the There’s no data like more data concept (Moore, 2001). According to Manning & Schutze (1999), it is more efficient to acquire more training data than to apply ¨ balancing strategies on small sets of data. Similarly, Banko & Brill (2001) show that more effort should be invested in investigating scalability issues than to explore different Machine Learning algorithms on a limited data set. Data size is typically analyzed by gradually increasing the amount of training data and presenting results in learning curves. However, since most learning curves tend to reach a ceiling at some point, the main concern in NLP studies is efficiency, or, in other words, to estimate minimum data size requirements that allow for best classification performance. In the Banko & Brill (2001) study, for instance, data size is gradually increased from one million words to one billion, in order to investigate how much training data is required for reliable word sense disambiguation. The few studies that have focused on the issue of data size in authorship attribution, explored the other extreme of data size, viz. the effect of dealing with limited data (e.g. Zhao & Zobel, 2005; Hirst & Feiguina, 2007; Luyckx & Daelemans, 2008a; Koppel et al., forthcoming). The main concern in these studies is to estimate the drop in performance when only limited textual data is available as compared to the case when tens of thousands of words are available per author. Investigating the effect of data size reduction in terms of text length allows for an estimation of performance on texts of e-mail, blog post, or tweet length. Investigating reduction in terms of the number of training samples allows us to assess the approach when confronted with only a single training sample per author. Ultimately, the study of data size may allow us to estimate at what point performance will meet the random baseline. 92

6.1 I NTRODUCTION AND R ESEARCH Q UESTIONS

According to Biber (1990, 1993), 1,000 words is an adequate size for reliable calculations of stylistic variation in text. However, in practice, 10,000 words per author is considered a reliable minimum for an authorial set in literary texts (Burrows, 2007). Some feature types, like vocabulary richness measures, are unreliable when applied to texts shorter than 1,000 words (Tweedie & Baayen, 1998). When applied to blogs, it was shown that increasing training data size aids performance, but also that having 2,000 words per author does not improve significantly on performance with 1,500 words in training (Koppel et al., forthcoming). When no long texts are available, for example in poems (Coyotl-Morales et al., 2006), online messages (Zheng et al., 2006), or student essays (van Halteren et al., 2005), often a large number of short texts per author is selected for training. Some studies have shown promising results with short texts (Sanderson & Guenter, 2006; Hirst & Feiguina, 2007), while Eder (2010) shows that the Delta method (cf. Chapter 2) – by extension any approach based on word frequencies and similarity-based techniques – is not reliable below 2,500 words. Clearly, there is no straightforward answer concerning the minimum size requirements for an authorial set. Over the last few years, the authorship attribution field has seen an increase in the number of studies focusing on the effect of data size. We describe the most important studies that focus on data size explicitly, where ‘data size’ can refer to both the length and the number of text samples used for training. Zhao & Zobel (2005) vary the amount of positive and negative training data per author, and compare binary with multi-class classification. On the one hand, testing on eleven randomly selected pairs of authors (viz. binary classification) does not allow for consistent observations. Results in multi-class authorship attribution, on the other hand, show only a small increase in performance when data size is increased from 50 to 300 documents per author in training. The authors conclude that multi-class classification is a better effectiveness and scalability test than two-class classification. Sanderson & Guenter (2006) investigated the influence of the amount of training and test material in a study on ONE - VS .- ALL authorship verification with fifty candidate authors. They observed that the amount of training material has more influence on performance than the amount of test material. In order to obtain reliable performance, they state that 5,000 words of training data per author can be considered a minimum requirement. The results also indicate that the unmasking technique (introduced in Koppel et al., 2007) is less useful when applied to short texts than when applied to long texts (i.e. books by 19th century authors in the Koppel et al. (2007) study). Unmasking is a meta-learning method where the central idea is to build a classifier per candidate author that compares an unseen text with the training data for that author. Iteratively removing predictive features from each classifier shows a larger drop with the correct author than with the other authors. Hirst & Feiguina (2007), in a study on authorship attribution of short texts in works by Anne ¨ present a systematic investigation of the impact of variations in block and Charlotte Bronte, 93

The Effect of Data Size size – the number of words in a text – from 200 to 500 to 1,000 words and of the effect of increasing the number of blocks used for training. The results show that using multiple short texts overcomes part of the obstacle of having only short texts, even when ‘short’ means only 200 words per author. In Abbasi & Chen (2008), we find a comparison of four data sets with varying sizes and characteristics. The CyberWatch Chat data set, for instance, contains on average 1,400 words per author, whereas the Java Forum data set consists of 44,000 words per author. They also include the Enron Email data set as well as a set of eBay comments. By applying a rich feature set of over 10,000 features, the Writeprints technique is able to correctly identify 83% of the 100 authors in the Enron set, 91% in the eBay comments data set, 53% in Java Forum data, and 32% in CyberWatch Chat data. The results indicate that data size is an important factor, but that genre, register, and the amount of noise in the respective data sets also play a role in performance. In a recent article, Koppel et al. (forthcoming) concentrate on the relationship between authorship attribution performance and the factors of author set size and training and test data size. From a blog data set with 10,000 candidate authors, they select balanced sets of 2,000 words per author for training and 500 words for testing. Instead of constructing binary or multi-class classifiers for the task, they apply cosine similarity as a classification technique to find the most similar author of a training text. Space-free character 4-grams are used as features. The authors state that, for large sets of candidate authors, statistical methods are more appropriate than Machine Learning methods (Koppel et al., forthcoming, p. 2). By reducing known-text length (i.e. length of training samples) from 2,000 to 500 words, they find that performance is positively influenced by data size. However, the gain in performance from 1,500 to 2,000 words is only marginal, suggesting 1,500 words per author in training is the optimal size in their blog set. A limitation of this study is the fact that they use character n-grams while allowing for topic influence (Koppel et al., forthcoming, p. 4). Topic influence may be desirable in some closed set problems – for instance, in the Enron Email Corpus, made public during the legal investigation into the Enron corporation (Klimt & Yang, 2004) – or in cases with thousands of candidate authors (Koppel et al., forthcoming). In this dissertation, however, we try to increase scalability by restricting topic influence in the attribution model.

6.1.2

Research Questions

In this chapter, we investigate how scalable the text categorization approach is towards smaller sets of training samples. Whereas several techniques have been shown successful for authorship attribution while using large sets of training data, there is only limited research that applies these techniques to limited sets of data. A systematic study of the influence 94

6.1 I NTRODUCTION AND R ESEARCH Q UESTIONS

of data size on the scalability of an approach is missing so far. We claim that testing an approach on large sets of training data exclusively, fails to give a realistic estimation of performance on smaller sets of training data. Without investigating the effect of data size, it is impossible to assess the validity of the approach suggested and to assess performance and reliability on data sets with different characteristics in terms of dimensionality. When envisaging application of authorship attribution ‘in the wild’, data size is an important factor. Our goal is to identify feature types that show robustness to limited data. Indications that syntactic or character features are more robust to limited data than lexical features can be found in Stamatatos (2008), where it is said that character n-grams reduced the sparse data problems that arise when using word n-grams. We can expect the same to hold for syntactic features. In the field, the term ‘data size’ can be used to refer to the length of the training samples as well as to the number of training samples. It is generally accepted in Machine Learning and natural language processing that the number of training instances is more relevant to the algorithm’s performance than the length of the samples these training instances are based on. However, the authorship attribution task might be a special case. One of the basic assumptions underlying stylometry is the idea that stylistic choices are present in all end products of an author. Nevertheless, short texts include less of the authors specific style preferences, and there is no consensus on the minimum requirements for an authorial set. Assessing the inherent complexity of a task given (the size of) the data set is not straightforward. Nevertheless, we are able to investigate some of the external factors that influence performance. We will elaborate on this issue in Section 6.5, where we investigate whether we see robustness to limited data in the different document representations and to sparse data in (types of) Machine Learning algorithms tested. In addition, we investigate the interaction between document representation and ML algorithms when dealing with limited data. The following research questions are addressed in this chapter: Q1 How scalable is the text categorization approach towards smaller sets of textual data? Do we find robustness of specific feature types? Q2 What is the effect of document representation on the ability of the approach to deal with (extremely) limited data? Is the profile-based approach more robust to limited data than the instance-based approach? Q3 What is the effect of the Machine Learning algorithm on the ability of the approach to deal with (extremely) sparse data? How do MBL and SVMs compare in terms of robustness to sparse data? In this chapter, we implement two interpretations of data size and investigate how reducing training data size affects performance. The first experiment allows us to assess the effect of varying the number of variable-length training samples, while maintaining an equal dis95

The Effect of Data Size tribution of instances over authorship classes. The second experiment enables an analysis of the effect of varying the number of fixed-length training samples, while the distribution of instances is irregular over the different authorship classes. As far as the Machine Learning algorithm used for classification is concerned, we will compare performance of MBL and SVMs with sparse data. Our expectation is that eager learners (such as SVMs) will tend to overgeneralize for this task when dealing with sparse training data, while lazy learners (such as MBL), by delaying generalization over training data until the test phase, will be at an advantage when dealing with sparse data. Unlike eager learners, they will not ignore – i.e. not abstract away from – the frequently occurring infrequent or atypical patterns in the training data, that will nevertheless be useful in generalization. Note that we use the term limited data to refer to the actual documents in the data set, indicating short texts or only a small set of text samples being available per candidate author. Sparse data refers to a representation of documents into feature vectors that may contain a lot of zero-frequency features. A sparse representation can be caused by applying sets of lexical features to limited data. Function words are more robust to limited data in that they occur frequently, even in short texts. Therefore, we say they avoid sparse data problems. Limited data are a challenge in terms of document representation. Sparse data pose a challenge for Machine Learning algorithms.

6.2

Experimental Set-Up

We examine the effect of data size in three data sets (AAAC A, ABC NL 1, and PERSONAE) by gradually increasing the amount of textual data used for training. The resulting learning curves will be used to compare the different feature types in terms of robustness to the effect of data size. Again, we use TIMBL with default settings for numeric features for classification. We are aiming at more insight into the behavior of the text categorization approach when confronted with limited data. Although there are many different Machine Learning parameters that can be applied to increase performance, the focus in this chapter is on measuring a relative effect rather than on achieving an optimal score. As we have shown in Figure 5.1 (cf. Chapter 5), each of the data sets exhibits a unique combination of dimensions in terms of author set size, the number of topics, the number of words per author, and the number of words per topic per author. Figure 6.1 shows the distribution of data size (indicated as the number of words) over the various authorship classes and texts (hence, topics).

96

Length in # words

6.2 E XPERIMENTAL S ET-U P

3000 2500 2000 1500 1000 500 20

5000 4000 3000 2000 1000

10000 8000 6000 4000 2000

1

2

1

40

3

2

60

4

5

3

80

6

7

4

100

8

5

author ID

120

140

10 11 12 13

9

6

7

8

Figure 6.1: The distribution of data size over the different texts per authorship class. The top data set is PERSONAE (145 authors, 1 topic), the mid data set is AAAC A (13 authors, 4 topics), and the bottom data set is ABC NL 1 (8 authors, 9 topics). Each dot represents a text. Since the three data sets are controlled in terms of topics, each author writes texts in the same topics. In PERSONAE, most texts are around 1,300 words in length. There are a number of clear outliers, with two texts of less than 1,000 words, and a few texts of 2,000 words or more. In AAAC A, a multi-topic data set, the differences are larger in that some authors have only 2,000 words available, while others wrote texts for a total length of 5,000 words. In fact, one of the authors did not write a text on the topic ‘work’. The third data set, ABC NL 1, shows that the authorial set for author 8 contains significantly less data (less than 8,000 words) than for the other authors (with a minimum of 9,000 words in total). We interpret authorship attribution as a multi-class classification task, using stratified 10-fold cross-validation, and apply the topic frequency threshold to the chi-squared (χ2 ) feature selection method (cf. Chapter 4) in order to minimize the effect of topic in the attribution model and increase scalability towards other topics. As far as the author set size is concerned, we will present results of authorship attribution with two, five, and the maximum number of candidate authors. As a Machine Learning algorithm, we use Memory-Based Learning (MBL) as implemented in TIMBL (cf. Chapter 3). In Section 6.5, we will compare TIMBL performance with that of eager learner SMO, an SVM implementation using Sequential Minimal Optimization (Platt, 1998). 97

The Effect of Data Size We present results of two experiments that each follow a different interpretation of ‘data size’. The first interpretation (EXP 1) uses an equal number of samples per author for training, whereas the second interpretation (EXP 2) trains on fixed-length samples, resulting in a larger number of training samples than in EXP 1. Table 6.1 shows the experimental set-up of the two experiments. We will elaborate on their meaning below. Note that while these controlled set-ups are artificial and will not appear in real-life situations, they do enable us to zoom in on the effect of reducing data size. Data size (in #Instances) per author and topic Sample type

Training [per step]

EXP 1

FLEX :

[9; 8; 7; 6; 5; 4; 3; 2; 1]

1

EXP 2

FIX :

#T rainingInstances 9

1

10% slice

100-word slice

Test

* [9; 8; 7; 6; 5; 4; 3; 2; 1]

Table 6.1: Experimental set-up in the two different implementations of data size (adopted in EXP 1 and EXP 2). Data size is reduced from 90% to 10% of the available training instances, while the test set is kept constant at 10% of the total data size per author and topic.

6.2.1

Data Size as the Number of Variable-Length Training Samples

The first implementation (EXP 1 in Table 6.1) interprets data size as the number of variablelength samples used for training. From the original text, we extract slices that represent 10% of the total length of that text (aka. variable-length or FLEX samples). As a result, each text is represented by an equal number of training samples. Of these ten training samples per text (i.e. per topic), one is held out for testing, and the other nine are used for training. When reducing data size stepwise from nine to a single training sample per topic and author, we keep an equal distribution of instances per topic over the different authorship classes, since all texts are represented in ten samples, independent of their original length. This experiment allows us to investigate performance with an equal number of training samples, but a variable length. Depending on the original text’s length the resulting samples can be of e-mail, blog post, or tweet length. Nevertheless, the classifier is confronted with an equal number of instances per author, avoiding the case where none of the authorship classes is better represented than others, a situation that could attract misclassifications.

6.2.2

Data Size as the Number of Fixed-Length Training Samples

The second implementation (EXP 2 in Table 6.1) extracts 100-word slices from the original texts. These 100-word slices (aka. FIX samples) allow for a study of performance of au98

6.2 E XPERIMENTAL S ET-U P

thorship attribution on texts of fixed length. Only the number of samples used for training is varied. From the extracted FIX samples, one per author and topic is held out for testing, and the others are used for training. Our choice for samples of 100 words (instead of 200 or 500 words) was influenced by the limited data available. Since we want to investigate the effect of data size over different evaluation data sets, the length of a FIX sample had to take these limits into account. Some texts are only 600 words in length, while others contain up to 2,500 words. Providing 200-word samples would not allow us to generate as many instances, resulting in a poorer representation for some authors. When reducing training data size, we take into account the original distribution of instances over authorship classes, using the function shown in Table 6.1. For authorship attribution in ABC NL 1 with two candidate authors, this function results in the training data sizes as shown in Table 6.2. The test data size per author equals the number of topics, resulting in nine test instances per author in experiments on ABC NL 1 (by analogy, one in PERSONAE, and four in AAAC A). In EXP 2, we present performance on a single held-out test set by selecting one random set of test samples.

author ID 2 (± 10,400 words) 8 (± 7,500 words)

Training data size per step (in # FIX instances) 9

8

7

6

5

4

3

2

1

95 66

84 59

74 51

63 44

53 37

42 29

32 22

21 15

11 7

#T rainingInstances 95 66

Table 6.2: Example of how the number of FIX training samples (hence, instances) is calculated in two authorship classes from the ABC NL 1 data set. The function used to calculate this number is #T rainingInstances * [9; 8; 7; 6; 5; 4; 3; 2; 1], ensuring that the original distri9 bution of instances over authorship classes is taken into account while data size is reduced. The test data size is equal to the number of topics per author (i.e. nine in ABC NL 1).

6.2.3

All Data Performance

For reasons of comparison, we provide the accuracy scores of the top performing feature types in Table 6.3. Per data set, and for two, five, and the maximum number of candidate authors, we show classification performance when using 90% of the data for training, as shown in the previous chapter (cf. Section 5.3, where we varied data size within the original data sets). In the following sections, we will refer to these scores as the all data performance. Note that, in this chapter, classification performance with 90% of the data in training will be the same as shown in Chapter 5.

99

The Effect of Data Size Data set PERSONAE

AAAC A

ABC NL 1

#Authors

Feature type

Accuracy (in %)

2 5 145

chr3 chr3 lex3

94.50 76.36 22.76

2 5 13

chr3 chr3 chr2

94.25 72.80 49.23

2 5 8

chr3 chr3 chr3

80.25 58.00 43.75

Table 6.3: All data performance of authorship attribution with two, five, and the maximum number of candidate authors for each of the data sets involved in this study. These results have been discussed in Chapter 5.

6.3 Data Size as the Number of Variable-Length Samples ( EXP 1) In this first experiment, we reduce the training data size while keeping the distribution of instances over classes constant (cf. Section 6.2). We present results of authorship attribution trained on 9, 8, 7, ... down to a single FLEX text sample per author and topic in training. Each FLEX sample represents 10% of the text’s length (aka. ‘variable-length samples’). One such sample is held out for testing. This experiment gives an indication of performance with limited data in cases where the number of instances per authorship class is balanced for all data sizes. As a result, none of the authorship classes can benefit from being better represented in training than the others. Figure 6.2 visualizes the drop in performance with decreasing data size. For intelligibility reasons, these graphs only present scores for authorship attribution with two and the maximum number of candidate authors. Results of the overall best scoring feature types – in fact the ones with highest cumulative performance – are indicated with dotted lines. The regions show a decrease in classification accuracy in each data set and for all author set sizes. However, results for the top scoring feature types show that the decrease in performance is steeper for the ones that scored high when using all data (data size = 9; aka. the all data performance, cf. Table 6.3). In each of the data sets, one of the feature types clearly outperforms the other ones, more specifically when the model is trained on five or more FLEX samples per author and topic. When data size is reduced further, other feature types tend to come close to top performance or even take the lead. Overall, the results exhibit a consistent evolution and performance improves upon the random baseline (in most cases). 100

6.3 DATA S IZE AS THE N UMBER OF VARIABLE -L ENGTH S AMPLES ( EXP 1)

100

Accuracy (in %)

80 60 40 20 9

8

7

6

5

4

3

2

1

# FLEX fragments per author per topic in training (a)

PERSONAE :

2 and 145 candidate authors

100

Accuracy (in %)

80 60 40 20 9

8

7

(b)

AAAC A :

8

7

(c)

ABC NL 1:

6

5

4

3

2

1

2

1

# FLEX fragments per author per topic in training 2 and 13 candidate authors

100

Accuracy (in %)

80 60 40 20 9

6

5

4

3

# FLEX fragments per author per topic in training 2 and 8 candidate authors

Figure 6.2: Visualization of the effect of data size, with data size interpreted as the number of variable-length (or FLEX) samples available for training. 101

102

tok cwd fwd chr3 lex1 lem1 cgp1 pos1 lexpos1 chu1 rel

tok cwd fwd chr3 lex1 lem1 pos2 lexpos1

tok cwd fwd chr3 lex1 lem1 cgp2 pos1 lexpos1 chu3 rel

Feature type

67.50 55.75 66.00 85.50 64.50 67.25 65.75 63.25 67.25 62.25 55.50

1

67.60 59.75 64.70 68.60 69.95 72.30 70.40 67.70 70.30 69.05 58.95

59.75 55.50 57.50 60.00 57.50 58.00 60.00 57.00

66.25 56.25 63.00 67.25 63.75 63.00 63.00 58.75 64.50 62.75 55.00

56.50 52.75 54.25 59.00 54.50 54.00 52.75 53.25 53.75 56.25 54.75

ABC NL 1

67.75 59.50 63.25 74.50 64.75 68.25 65.00 64.75

AAAC A

73.75 62.80 71.35 83.45 68.75 70.30 69.45 72.85 69.45 65.75 56.10

2-way

70.75 60.50 69.75 87.25 72.25 72.50 74.25 69.25

5

PERSONAE

2-way

73.75 66.70 71.30 94.50 74.35 75.55 70.40 74.15 74.50 69.05 54.70

2-way

9

tok cwd fwd chr3 lex1 lem1 cgp2 pos1 lexpos1 chu2 rel

tok cwd fwd chr3 lex1 lem1 pos1 lexpos1

tok cwd fwd chr3 lex1 lem1 cgp3 pos2 lexpos1 chu3 rel

Feature type

38.80 19.80 33.20 67.20 29.00 30.80 34.60 32.60 29.20 27.40 22.20

1

36.18 28.46 34.22 32.62 36.00 36.62 32.32 33.78 36.04 31.46 23.46

27.60 23.80 27.40 29.20 27.40 24.80 31.80 28.00

35.80 20.00 31.80 47.80 28.60 29.20 29.80 30.20 26.80 26.60 19.60

30.00 19.00 25.20 24.40 25.20 26.40 23.00 21.00 25.80 21.20 21.00

ABC NL 1

31.00 24.60 32.20 57.80 35.40 36.20 36.40 33.00

AAAC A

44.68 36.52 39.92 62.52 45.32 43.18 40.68 38.48 45.34 35.48 23.84

5-way

35.60 28.20 36.00 74.20 37.80 41.20 38.00 34.40

5 PERSONAE

5-way

46.82 47.30 41.36 76.36 59.54 57.04 49.40 51.14 59.84 39.82 23.90

5-way

9

tok cwd fwd chr3 lex1 lem1 cgp2 pos1 lexpos1 chu1 rel

tok cwd fwd chr3 lex1 lem1 pos2 lexpos1

tok cwd fwd chr3 lex3 lem3 cgp3 pos2 lexpos3 chu2 rel

Feature type

28.75 21.25 20.00 33.75 27.50 21.25 26.25 18.75 25.00 20.00 16.25

4.34 1.66 2.28 1.79 1.66 1.59 2.00 2.34 1.38 2.55 1.38

10.00 8.46 12.31 17.69 13.85 14.62 17.69 13.08

27.50 15.00 20.00 31.25 22.50 25.00 26.25 23.75 26.25 18.75 11.25

18.75 17.50 25.00 17.50 21.25 27.50 21.25 15.00 26.25 20.00 10.00

ABC NL 1

13.08 10.77 23.08 25.38 13.85 18.46 18.46 11.54

AAAC A

5.52 4.07 3.59 7.86 13.38 12.00 4.21 5.52 13.17 3.10 1.10

8-way

14.62 14.62 21.54 44.62 30.00 20.77 25.38 21.54

1

PERSONAE

5

13-way

6.07 7.03 2.83 10.90 22.69 21.93 4.76 5.31 22.28 3.24 1.24

145-way

9

The Effect of Data Size

Table 6.4. The effect of data size, with data size interpreted as the number of variable-length (or FLEX) samples available for training. (Underlined scores fail to improve upon random baseline

performance)

6.3 DATA S IZE AS THE N UMBER OF VARIABLE -L ENGTH S AMPLES ( EXP 1)

The individual scores per feature type and data size – limited to data sizes 9, 5, and 1 for reasons of clarity – are shown in Table 6.4. Appendix F contains the full set of results. In PER SONAE results, we see a clear dominance of character n-grams over the other feature types in two-way and five-way authorship attribution. However, lexical (e.g. LEX , LEM , LEXPOS) and syntactic features (e.g. CGP, POS) tend to outperform the other feature types when data size is reduced to a single sample per author and topic. In 145-way authorship attribution, the situation is different. With a data size of 9 or 5, we see lexical features clearly outperforming character n-grams. However, with only a single FLEX sample available per author and topic in the training set, token-level features score best, although we should emphasize that overall performance is very low. AAAC A results also show a dominance of character n-grams over most other feature types. Contrary to the results of authorship attribution on the PERSONAE data set, where lexical features score second-best, syntactic features (i.e. POS) perform at the same level as character n-grams and even outperform character features when data size equals 1.

Character n-grams are also the best performing feature type in ABC NL 1 experiments with data sizes 9 and 5, an observation that can be made for all author set sizes tested. However, as data size is reduced to a single sample per author and topic, token-level features (in case of 5-way authorship attribution) and lexical features (in case of 145-way authorship attribution) score better than character n-grams. In general, we can say that character n-grams show more robustness to limited data than the other feature types as long as multiple training samples per author and topic are available. With only a single sample available for training, lexical and syntactic features tend to outperform character n-grams. Providing multiple samples is in most cases a decision in terms of experimental design, however in some cases the text samples may be too short to contain stylistic choices of the author. We will return to this issue in Section 6.5, where we discuss internal and external factors that affect performance in an authorship attribution experiment with limited data. In order to assess whether data size has a similar effect on all authorship classes, we perform a per-class analysis of the precision of the approach when confronted with less and less data. Precision indicates the number of correct decisions (aka. the True Positives or TP) as well as the number of misclassifications in the direction of that class (aka. the False Positives or FP) (cf. confusion matrix in Chapter 3, Section 3.1.5). This analysis allows us to assess which authorship classes are more influenced by data size than others and therefore easier to predict. Note that while we are able to examine the effect of data size, no techniques have been developed, as far as we are aware, to assess the inherent complexity of a class given the training data. Our analysis will in most cases combine both factors.

103

The Effect of Data Size Figures 6.3 and 6.4 show the interaction between precision, data size, and the distribution of instances over the various authorship classes. The x-axis shows the different authors in the two multi-topic data sets (i.e. AAAC A and ABC NL 1), while the y-axis indicates the data sizes tested. The higher the colored block, the higher the precision score for that data size, whereas block width indicates how well an authorship class is represented in training. Bar width is a function of the original distribution of text length (in number of words) over the authorship classes (as shown in Figure 6.1. When precision equals zero, a thin black line in the graph indicates the class distribution. In the AAAC A data set (cf. Figure 6.3), classes 2 and 4 are the least prominent in the training set, while class 13 is represented by the highest number of training instances. Consistently high scores for all data sizes can be found with authorship classes 13, 5, 1, 10, 7, and 3 (in that order). Keeping in mind the original class distributions, we see similarities between the classes with most data available – viz. 13, 3 and 7 – and the precision scores with varying data size. This is especially true for class 13, with significantly more data as well as considerably higher precision. Nevertheless, data size does not have a negative influence on precision scores for classes 1 and 5. Moreover, class 2, while significantly underrepresented when compared to class 13, scores relatively well.

dataSize [height = precision]

9 8 7 6 5 4 3 2 11 2

3 4

5 6

7 8

9 10 11 12 13

author ID [width = % trainingInstances/class]

Figure 6.3: Effect of class imbalance and data size on precision in EXP 1. This plot shows per-class precision of experiments on AAAC A with 13 candidate authors using CHR 2. Bar width indicates the number of instances in training of each class and the length of the corresponding text samples, whereas bar height indicates precision per data size experiment.

104

6.3 DATA S IZE AS THE N UMBER OF VARIABLE -L ENGTH S AMPLES ( EXP 1)

In ABC NL 1 (cf. Figure 6.4), we observe similar tendencies. Classes 4 and 8, while least prominent in training, outperform the other authorship classes in terms of precision. Out of the best represented authorship classes (viz. 1, 2, and 3), class 2 exhibits the most consistent scores.

dataSize [height = precision]

9 8 7 6 5 4 3 2 11

2

3

4

5

6

7

8

author ID [width = % trainingInstances/class]

Figure 6.4: Effect of class imbalance and data size on precision in EXP 1. This plot shows per-class precision of experiments on ABC NL 1 with 8 candidate authors using CHR 3. Bar width indicates the number of instances in training of each class and the length of the corresponding text samples, whereas bar height indicates precision per data size experiment.

This analysis of precision scores per authorship class demonstrates the effect of data size, which is clearly one of the main factors accounting for high or low precision scores. However, since we observed high scores for classes that are considerably less prominent in training than others, the inherent complexity of a classification task could be an explanation for these scores. The importance of this factor is difficult to estimate, but it clearly plays a role in authorship attribution. Another explanation could be the fact that we conducted the precision analysis on character features only. It is possible that providing a precision analysis of lexical or syntactic features, would lead to different conclusions.

105

The Effect of Data Size

6.4 Data Size as the Number of Fixed-Length Samples ( EXP 2) In this section, we present results of a different interpretation of ‘data size’, more specifically as the number of fixed-length (or FIX) samples used for training. Our initial training data set contains 90% of all available FIX samples per author. We decrease the amount of training data step by step, by removing 10% of the samples (cf. Section 6.2) from that initial set. In fact, each data size experiment represents the distribution of instances over authorship classes of the initial set, as we indicated in Figure 6.2. This experiment gives an indication of authorship attribution performance on a very limited set of training samples, with a distribution of instances over authorship classes as in the original data set. Again, we only present part of the results for the various feature types and data sizes in this section. The full list of results is presented in Appendix G. In Figure 6.5, we indicate the range within which all classification results are situated. The grey regions represent results for authorship attribution with two and the maximum number of candidate authors. In addition, we show some of the best scoring feature types per data set and author set size (cf. the dotted lines in Figure 6.5). This is determined by cumulating the accuracy scores per data size and ranking the feature types along these sums. In PER SONAE , performance in both 2-way and 145-way authorship attribution is consistent in that the results with 10% of the data in training are not substantially different from – the already low – results when all available data (i.e. 90%) was used for training. In AAAC A, there is an upward trend with increasing data size that goes down again when data size is decreased further. This trend is not confirmed in ABC NL 1 results. However, the uncontrolled fluctuations in performance over the different data sizes show that discerning general tendencies is complex. In fact, Table 6.5 shows that, depending on the data set, author set size, or feature type used, performance can either increase or decrease with increasing data size. PERSONAE results with two candidate authors show that some feature types – more specifically, FWD, CHR 1, CGP 2, POS 1, CHU 2, REL – show a consistent decrease in performance when data size is reduced from 90% of the available FIX training samples to 10% of them. However, lexical and superficial feature types such as TOK , CWD, LEX 1 and LEXPOS 1 show a small increase in performance under the same conditions. In fact, these types achieve highest score with 50% of the training data instead of with 90%, as we would have expected. In AAAC A, the picture is different in that performance consistently decreases when data size is reduced. Finally, tendencies in ABC NL 1 results are fluctuating as was the case in PERSONAE. Most lexical features exhibit large variations in performance while reducing the training data size: sometimes we see a drop in accuracy, and at other times there is an improvement. Syntactic and character features score consistently worse with only limited training data.

106

6.4 DATA S IZE AS THE N UMBER OF F IXED -L ENGTH S AMPLES ( EXP 2)

100

Accuracy (in %)

80 60 40 20 90

80

70

60

50

40

30

20

10

% 100-word fragments per author per topic in training (a)

PERSONAE :

2 and 145 candidate authors

100

Accuracy (in %)

80 60 40 20 90

80

70

60

(b)

AAAC A :

80

70

(c)

ABC NL 1:

50

40

30

20

10

% 100-word fragments per author per topic in training 2 and 13 candidate authors

100

Accuracy (in %)

80 60 40 20 90

60

50

40

30

20

10

% 100-word fragments per author per topic in training 2 and 8 candidate authors

Figure 6.5: Visualization of the effect of data size, with data size interpreted as the number of fixed-length (or FIX) samples available for training. 107

108

tok cwd fwd chr2 lex3 lem2 cgp1 pos1 lexpos2 chu2 rel

tok cwd fwd chr2 lex1 lem1 pos1 lexpos1

tok cwd fwd chr1 lex1 lem1 cgp2 pos1 lexpos1 chu2 rel

Feature type

67.50 55.00 65.00 70.00 62.50 50.00 57.50 72.50 52.50 72.50 52.50

54.50 57.00 57.00 55.00 60.00 60.50 55.50 53.50 61.00 50.00 52.00

50.00 47.50 57.50 67.50 62.50 52.50 50.00 62.50

72.50 52.50 57.50 90.00 60.00 65.00 67.50 62.50 62.50 72.50 52.50

60.00 55.00 70.00 60.00 50.00 55.00 50.00 62.50 57.50 55.00 60.00

ABC NL 1

62.50 60.00 72.50 82.50 65.00 65.00 85.00 62.50

AAAC A

57.00 59.00 69.00 66.00 65.50 61.00 64.50 65.50 65.00 60.00 53.00

2-way

62.50 57.50 82.50 80.00 70.00 77.50 90.00 67.50

10%

PERSONAE

50%

2-way

55.50 50.00 68.50 66.50 56.50 64.00 64.00 67.50 56.50 59.50 54.00

2-way

90%

tok cwd fwd chr3 lex1 lem1 cgp1 pos1 lexpos1 chu1 rel

tok cwd fwd chr2 lex1 lem1 pos1 lexpos1

tok cwd fwd chr3 lex1 lem1 cgp1 pos1 lexpos1 chu2 rel

Feature type

36.00 26.00 38.00 50.00 34.00 30.00 40.00 46.00 38.00 30.00 26.00

26.40 24.20 30.00 23.20 26.60 26.20 28.80 29.20 26.80 24.60 22.80

14.00 28.00 34.00 38.00 46.00 32.00 22.00 40.00

36.00 22.00 26.00 38.00 32.00 32.00 40.00 52.00 34.00 24.00 32.00

40.00 20.00 30.00 32.00 30.00 36.00 30.00 34.00 26.00 38.00 10.00

ABC NL 1

32.00 48.00 42.00 68.00 44.00 52.00 44.00 42.00

AAAC A

31.40 23.60 36.40 41.00 26.20 26.40 32.40 35.00 25.60 29.60 21.80

5-way

30.00 44.00 34.00 66.00 54.00 66.00 46.00 50.00

10%

PERSONAE

50%

5-way

29.60 31.60 36.40 42.20 37.40 36.00 31.60 37.80 34.40 28.20 21.00

5-way

90%

tok cwd fwd chr3 lex1 lem1 cgp1 pos1 lexpos1 chu2 rel

tok cwd fwd chr2 lex1 lem1 pos3 lexpos1

tok cwd fwd chr2 lex1 lem1 cgp2 pos1 lexpos1 chu2 rel

Feature type

12.50 25.00 12.50 50.00 12.50 25.00 25.00 37.50 12.50 37.50 37.50

7.69 15.38 15.38 23.08 7.69 15.38 15.38 15.38

12.50 12.50 25.00 50.00 25.00 25.00 0.00 12.50 37.50 12.50 50.00

12.50 25.00 12.50 37.50 37.50 50.00 50.00 0.00 37.50 12.50 12.50

ABC NL 1

7.69 23.08 23.08 38.46 46.15 38.46 23.08 38.46 8-way

7.69 7.69 46.15 53.85 23.08 46.15 15.38 30.77

2.07 1.38 2.76 2.76 3.45 4.14 2.07 3.45 3.45 1.38 2.07

AAAC A

0.00 0.00 2.07 5.52 5.52 6.21 2.07 2.07 2.76 2.07 0.00

13-way

0.00 6.21 2.76 3.45 6.21 8.28 0.69 2.07 5.52 4.14 0.69

10%

PERSONAE

50%

145-way

90%

The Effect of Data Size

Table 6.5: The effect of increasing data size, with data size interpreted as the number of fixed-length (or FIX) samples available for training. (Underlined scores fail to improve upon random

baseline performance)

6.4 DATA S IZE AS THE N UMBER OF F IXED -L ENGTH S AMPLES ( EXP 2)

In order to gain insight into the sources of these fluctuations, we present an analysis of precision per class with increasing data size and class imbalance, like we did in the previous experiment (cf. Section 6.3). Figures 6.6 and 6.7 show the effect of data size on precision per authorship class for AAAC A and ABC NL 1. The x-axis indicates the author ID with bar width representing the distribution of training instances over the authorship classes. In order to calculate the share of each class in training, we take into account the total number of FIX instances in training. On the y-axis, the data size – ranging from 10% to 90% of the available training samples per author and topic – is presented, while bar height indicates precision. In other words, high bars indicate an authorship class that maximizes the number of correctly classified instances and minimizes the number of misclassifications attracted at the same time. Authorship classes that are better represented than others – indicated by wider bars – can be at an advantage during classification because they are more prominent in training, and hence attract misclassifications (i.e. false positives or FP). AAAC A precision scores for each of the thirteen authorship classes – using character bigrams – are shown in Figure 6.6. When precision equals zero, a thin black line in the graph indicates the class distribution. Classes 2, 5, 9, 11, and 12 are supported by less training data than other classes, but the differences are only small.

dataSize [height = precision]

9 8 7 6 5 4 3 2 11 2

3 4

5 6

7 8

9 10 11 12 13

author ID [width = % trainingInstances/class]

Figure 6.6: Effect of class imbalance and data size on precision in EXP 2. This plot shows perclass precision of experiments on AAAC A with 13 candidate authors using CHR 2. Bar width indicates the number of instances in training of each class, whereas bar height indicates precision per data size experiment.

109

The Effect of Data Size On the one hand, we see that some classes score top precision irrespective of the amount of textual data. 10% or 30% is enough training data for classes 1, 3, 4, and 5 to reach high performance, in spite of the limited data available for author 5 in comparison with the other authorship classes. On the other hand, other classes (6, 7, and 11) score zero precision, irrespective of the amount of training data. ABC NL 1 scores for eight authorship classes – using character trigrams – are shown in Figure 6.7. Here, we see similar tendencies as in AAAC A, in that some classes (e.g. 2 and 6) are easier to predict than others (e.g. 5, 7, and 8), regardless of the data size. In fact, the ABC NL 1 data set consists of four texts per author, except for author 2, who only wrote three texts, which confirms that data size is not the only factor explaining these results.

dataSize [height = precision]

9 8 7 6 5 4 3 2 11

2

3

4

5

6

7

8

author ID [width = % trainingInstances/class]

Figure 6.7: Effect of class imbalance and data size on precision in EXP 2. This plot shows perclass precision of experiments on ABC NL 1 with 8 candidate authors using CHR 3. Bar width indicates the number of instances in training of each class, whereas bar height indicates precision per data size experiment.

This second experiment has shown that reducing the training data size – i.e. reducing the number of fixed-length training samples while keeping intact the original distribution of instances over classes – causes large fluctuations in performance. The fact that results are unpredictable, has an impact on our expectations for performance of real-life applications of authorship attribution. Although the differences in distributions over the authorship classes is only limited, real-life applications will typically involve authors with only short text samples as well as large authorial sets, making performance very difficult to predict. This implies that 110

6.5 R OBUSTNESS TO L IMITED DATA

our text categorization approach is very sensitive to the various factors internal and external to the task and therefore unable to perform reliably when confronted with these variations, especially when only limited data is available. However, like in the first experiment (cf. Section 6.3), the precision analysis shows that data size, while important, is not the only factor influencing performance. One possible explanation lies in the inherent complexity of a task and of the authorship classes involved. As we explained in the previous section, it is hard to estimate to what extent the inherent complexity plays a role. A second explanation could be that the character n-grams used in this analysis are unable to score high precision for all classes.

6.5 Robustness to Limited Data In the two previous sections, experimental results have shown that reducing the amount of textual data used for training has a clearly negative effect on performance. We can identify a number of factors that interact with the effect of data size, grouped under the term inherent complexity of the task given the data set. These factors relate to the authorial sets and the short text samples involved. The most apparent ones are: • consistency in writing style over different topics (cf. Chapter 4 on intra-topic variation) • comparability between authorship classes over various topics (cf. Chapter 4 on intertopic variation) • presence of different genres in the data sets (three genres in ABC NL 1 and a single genre in AAAC A and PERSONAE) • presence of stylistic choices in the short text samples Each of these factors affects performance, possibly even to a similar extent as reducing data size does. The impact of some of these factors can be estimated, as we indicated in Chapter 4 when zooming in on inter- and intra-topic variation, but others are far more difficult to assess. The ‘presence of stylistic choices’ in short texts, for instance, requires a technique to calculate the number of stylistic choices in a text. The stability method suggested by Koppel et al. (2003a) may provide an interesting approach towards measuring stylistic choices. The central idea of stability is that words that can be replaced in a sentence without modifying its contents, are potential stylistic choices. The approach, including Machine Translation for the generation of semantically equivalent sentences, is rather complex, but as far as we know it is the only meaningful technique – as opposed to naive techniques such as Yule’s characteristic K (cf. Chapter 2) – to estimate the number of stylistic choices in a text. Techniques that allow us to assess the ‘strength’ of an authorial set would provide insight into the complexity of the task. As far as we know, there have not been any attempts to develop such techniques. 111

The Effect of Data Size Apart from these inherent factors, performance in short text authorship attribution is also affected by a number of external factors. In this section, we zoom in on two of these factors (document representation and Machine Learning algorithm), investigate how they interact, and analyze their robustness to limited data.

6.5.1

The Limited Data Challenge

Short text authorship attribution poses a specific challenge to our text categorization approach, and by extension probably to any approach. Whereas stylistic choices are generally accepted to be present in every text, they occur less frequently in short texts. Short text authorship attribution with a limited set of training data requires a reliable and robust representation of these texts as well as a Machine Learning algorithm that is able to deal with potentially sparse data. When training data size is reduced, the approach is stress-tested even more.

Document Representation In terms of document representation, the two main approaches in authorship attribution are instance-based and profile-based representation (terminology adopted from Stamatatos, 2009). Instance-based approaches represent each training text sample as an instance, whereas profile-based approaches group the training texts per candidate author in a ‘pseudodocument’ and represent them cumulatively, by means of a profile of the author’s style. On the one hand, the advantage of an instance-based approach is that it represents the variation in the author’s writing style over the different text samples, while part of the variation is averaged out in profile-based approaches. On the other hand, the profile-based approach generates a more general representation of the authors writing style, whereas that representation is (literally) fragmented in the instance-based approach. Adapted to multi-topic data, profiles represent a blend of all topics, whereas the instance-based approach treats all topics separately and on the same level. As far as we know, there is no study in authorship attribution that presents an empirical comparison of the two main approaches. In Stamatatos (2009), we find a conceptual comparison of instance-based and profile-based approaches based on a survey of the field. The most important points of comparison are the applicability of text-level features (e.g. sentence length), the applicability and sensitivity of certain classification methods towards these representations, the training and running time costs, and the influence of class imbalance in these approaches. In the same study, profile-based approaches are said to be more reliable than instance-based approaches in short text authorship attribution. We test whether this observation is valid in the short text data sets we deal with and in the different data sizes. 112

6.5 R OBUSTNESS TO L IMITED DATA

We implement the profile-based approach by combining all texts by a specific candidate author into a single large document, and extract features from that set. The instance-based representation was implemented as described in Chapter 3 (Section 3.1.3). Note that there is no difference in the set of predictive features generated in the two approaches. Chisquared (χ2 ) operates on the class level – comparing expected and observed frequencies over authorship classes – and not on the level of the individual text (as opposed to TF - IDF, for instance). The different document representations have an effect on the organization of text samples over instances, not on the selection of features. The difference lies in how the respective learning algorithms deal with these representations, hence in performance.

Lazy vs. Eager Learning A second external factor is the Machine Learning algorithm. In supervised Machine Learning (ML), there are two types of algorithms: lazy learning and eager learning methods. Whereas eager learners (e.g. Support Vector Machines, Decision Trees, Neural Networks) build a model from the training instances and test it against the incoming test instances, lazy learners (e.g. Memory-Based Learning, Case-Based Reasoning) simply store the training instances and compare the incoming test instances against these training instances. When applied to sparse data, eager learners will tend to overgeneralize. Lazy learners, by delaying generalization over training data until the test phase, could be at an advantage when dealing with sparse data. Unlike eager learners, they will not ignore – therefore, not abstract away from – the infrequent or atypical patterns in the training data, that will nevertheless be useful in generalization. Since it is essential in short text authorship attribution to select an algorithm that shows robustness to sparse data, we compare performance of an eager learner, SVMs, with that of a lazy learner, MBL. Both algorithms have been compared against other algorithms in authorship attribution test cases. In Zhao & Zobel (2005), four algorithms (i.e. Naive Bayes, Bayesian Networks, k-Nearest Neighbors, and Decision Trees) are compared in terms of performance in two-way and fiveway authorship attribution. Overall, Bayesian Networks were found more reliable than the other learners. However, in one-vs.-all authorship attribution (an experimental approximation of one-class learning), with only limited positive data available for training, nearest neighbor methods perform best. Jockers & Witten (2010) compare five classification methods (i.e. Delta, k-Nearest Neighbors, Support Vector Machines, Nearest Shrunken Centroids, and Regularized Discriminant Analysis) on a case of disputed authorship, the Federalist Papers. The study suggests Nearest Shrunken Centroids as the most reliable learner. In spite of the thorough analysis of results, we consider this experiment to be problematic since disputed authorship does not allow for benchmarking because of the absence of ground truth. In 113

The Effect of Data Size addition, both studies were limited in terms of author set size (i.e. two or five candidate authors) and data size (i.e. large authorial sets). In this chapter, we present a comparison of SVMs and MBL in terms of performance with decreasing data size. In addition, we test whether there is interaction between document representation and the learner selected for classification. According to Stamatatos (2009), instance-based approaches take advantage of powerful Machine Learning algorithms able to handle high-dimensional, noisy, and sparse data (e.g., SVM). We investigate whether the data confirm this or not. For lazy learning, we use a Memory-Based Learner (MBL), TIMBL (Tilburg Memory-Based Learner) (Daelemans et al., 2007), a supervised inductive algorithm for learning classification tasks based on the kNN algorithm. This was also the algorithm used in the experiments discussed above. SMO is an implementation of eager learner SVM using Sequential Minimal Optimization (SMO) (Platt, 1998), embedded in the WEKA (Waikato Environment for Knowledge Analysis) software package (Witten & Frank, 1999). We use TIMBL with default settings for numeric features (cf. Chapter 3) and SMO with default settings. Note that we do not optimize in terms of features or parameters for MBL or SVMs. For both algorithms, optimization can have a large effect on performance. Identifying the optimal parameters for the different data sets, data sizes, and feature types, is beyond the scope of this dissertation.

6.5.2

Results and Discussion

The results presented here, adopt the interpretation of data size as shown in EXP 1, using an equal distribution of instances over classes. When reducing data size, we reduce the number of variable-length (FLEX) text samples used for training (cf. Section 6.2). We present results in authorship attribution with the maximum number of candidate authors for each data set. Figures 6.8 to 6.10 show learning curves for the three data sets and four feature types: CHR , LEX , POS , LEXPOS . We compare results for instance-based and profile-based representations combined with MBL or SVMs for classification. Note that combining a profile-based approach with SVMs for classification will influence the ability of SVMs to build a classification model. Nevertheless, it is interesting to see how the various combinations of approaches and learners deal with limited data. Our comparisons are based on the same feature types for all experiments. We rely on the best performing feature types in our baseline approach: instance-based representation combined with MBL (presented in Section 6.3). The exact results for each data set are in Appendix H. In PERSONAE with 145 candidate authors (Figure 6.8), a comparison in terms of document representation shows that the profile-based approach (solid line) scores marginally better than the instance-based approach (dashed line) in CHR and POS. 114

6.5 R OBUSTNESS TO L IMITED DATA

60 40 20 9

8

7

100

6

5

Data Possize

80

Acurracy (in %)

80

4

3

2

1

Instance-based TiMBL Instance-based SVMs Profile-based TiMBL Profile-based SVMs

40 20 9

8

7

6

5

Data size

4

3

2

40 20 9

8

7

6

5

Data size Lexpos

1

4

3

2

1

Instance-based TiMBL Instance-based SVMs Profile-based TiMBL Profile-based SVMs

80

60

Instance-based TiMBL Instance-based SVMs Profile-based TiMBL Profile-based SVMs

60

100

Acurracy (in %)

Acurracy (in %)

80

Lex

100

Instance-based TiMBL Instance-based SVMs Profile-based TiMBL Profile-based SVMs

Acurracy (in %)

Chr

100

60 40 20 9

8

7

6

5

Data size

4

3

2

1

Figure 6.8: The effect of document representation (instance-based vs. profile-based) and Machine learning algorithm (MBL vs. SVMs) on performance in data size experiments in 145-way PERSONAE. Data size is interpreted as in EXP 1. However, in lexical feature types LEX and LEXPOS, the instance-based approach scores significantly better than the profile-based approach. While there is a more or less constant performance in the instance-based approach over the different data sizes, we see a gradually declining performance in profile-based approaches. The results for PERSONAE confirm the observation in Stamatatos (2009) that a profile-based approach is a better choice than an instance-based approach when dealing with short text data. As far as the ML algorithm is concerned, we see that SVMs score significantly better than MBL. However, SVM performance declines rapidly as data size is reduced, down to the same level as MBL. In the multi-topic data set AAAC A with thirteen candidate authors (Figure 6.9), our baseline approach is the worst performing one. In fact, adopting a profile-based approach, combined with MBL for classification, already means a substantial improvement – an error reduction of around 30% – for most data sizes, but this holds for character n-grams only. In the other feature types, we hardly see profile-based+MBL improve on instance-based+MBL.

115

The Effect of Data Size

60 40 20 9

8

7

100

6

5

Data Possize

80

Acurracy (in %)

80

4

3

2

1

Instance-based MBL Instance-based SVMs Profile-based MBL Profile-based SVMs

40 20 9

8

7

6

5

Data size

4

3

2

40 20 9

8

7

6

5

Data size Lexpos

1

4

3

2

1

Instance-based MBL Instance-based SVMs Profile-based MBL Profile-based SVMs

80

60

Instance-based MBL Instance-based SVMs Profile-based MBL Profile-based SVMs

60

100

Acurracy (in %)

Acurracy (in %)

80

Lex

100

Instance-based MBL Instance-based SVMs Profile-based MBL Profile-based SVMs

Acurracy (in %)

Chr

100

60 40 20 9

8

7

6

5

Data size

4

3

2

1

Figure 6.9: The effect of document representation and Machine learning algorithm on performance in data size experiments in 13-way AAAC A. Data size is interpreted as in EXP 1.

Although CHR scores best overall, performance drops dramatically when data size equals two. As far as the choice for MBL or SVMs is concerned, we see that in all cases, SVMs score significantly higher than MBL. The difference in performance between using instance-based or profile-based approaches with SVMs is small in most cases. In the syntactic feature type POS , applying SVM s for classification in combination with a profile-based or instance-based approach has only little effect. The other multi-topic data set ABC NL 1 with eight candidate authors (Figure 6.10) shows very similar results as AAAC A. In most feature types and data sizes, instance-based+MBL is often the worst scoring combination. Adopting the profile-based document representation improves performance substantially in CHR and POS. In POS, profile-based+MBL is even the best scoring combination when data size is lower than seven (i.e. seven FLEX training samples per topic and author). We see a similar tendency in LEX and LEXPOS in that MBL performance improves on SVM s with decreasing data size, although the differences are small. However, results with character n-grams show that SVMs outperform MBL.

116

6.5 R OBUSTNESS TO L IMITED DATA

60 40 20 9

8

7

100

6

5

Data Possize

80

Acurracy (in %)

80

4

3

2

1

Instance-based MBL Instance-based SVMs Profile-based MBL Profile-based SVMs

40 20 9

8

7

6

5

Data size

4

3

2

40 20 9

8

7

6

5

Data size Lexpos

1

4

3

2

1

Instance-based MBL Instance-based SVMs Profile-based MBL Profile-based SVMs

80

60

Instance-based MBL Instance-based SVMs Profile-based MBL Profile-based SVMs

60

100

Acurracy (in %)

Acurracy (in %)

80

Lex

100

Instance-based MBL Instance-based SVMs Profile-based MBL Profile-based SVMs

Acurracy (in %)

Chr

100

60 40 20 9

8

7

6

5

Data size

4

3

2

1

Figure 6.10: The effect of document representation and Machine learning algorithm on performance in data size experiments in 8-way ABC NL 1. Data size is interpreted as in EXP 1.

In this section, we discussed two crucial external factors that affect performance in authorship attribution: document representation and learning algorithm. More specifically, when applied to short text attribution, the various document representation approaches (instancebased or profile-based) and Machine Learning algorithm (eager SVMs or lazy MBL) show different behavior. Our comparison of the different approaches has shown that instancebased+MBL – our baseline – is often the worst performing combination. Applying a profilebased approach (combined with MBL for classification) increases performance substantially, but not to the extent that it outperforms SVMs. SVMs are currently the method of choice in authorship attribution research, and our results indicate that they are able to deal with short text data better than MBL is able to. MBL shows some robustness to sparse data in lexical and syntactic features, but the differences are too small and inconsistent to claim superiority to SVMs.

117

The Effect of Data Size

6.6 Conclusions Most studies in authorship attribution are able to reliable identify the author of a text while relying on large amounts of training data per candidate author. However, when applying authorship attribution on a large scale, often only limited data per candidate author is available. This situation requires an approach that scales towards these smaller sets of textual data. In this chapter, we investigated the scalability of our text categorization approach to authorship attribution in terms of data size. We systematically decreased the number of text samples used for training and presented results in learning curves. We also investigated two types of document representation approaches and two types of learning algorithms for robustness to limited data. The following research questions were addressed in this chapter: Q1 How scalable is the text categorization approach towards smaller sets of textual data? Do we find robustness of specific feature types? Q2 What is the effect of document representation on the ability of the approach to deal with (extremely) limited data? Is the profile-based approach more robust to limited data than the instance-based approach? Q3 What is the effect of the Machine Learning algorithm on the ability of the approach to deal with (extremely) sparse data? How do MBL and SVMs compare in terms of robustness to sparse data? We implemented two interpretations of ‘data size’. In a first set of experiments (EXP 1), data size was interpreted as the number of variable-length text samples per topic and author used for training. This set-up ensures an equal distribution of instances over authorship classes. In a second set of experiments (EXP 2), data size was interpreted as the number of fixedlength training samples per author. In this set-up, the original distribution of instances over classes is preserved over the data size experiments. In both sets of experiments, we are dealing with short text samples that approximate the length of an e-mail. The learning curves presented in EXP 1 show a dramatic decrease in performance as data size is reduced to 10% of the data available per candidate author. Overall, character ngrams dominate the other feature types in terms of performance with reducing data size. In a few cases, with only a single sample per topic used for training, lexical and syntactic features outperform character n-grams. In EXP 2, where the distribution of instances over classes shows substantial variation, the results exhibit very little consistency. Again, character n-grams show the highest degree of robustness to limited data. We performed analyses of the precision results per authorship class in in order to evaluate whether an authorship class represented by more data than other classes also leads to higher performance. The results show that, apart from the size of the authorial set, other internal factors play a role. The fact that some classes score better than others, although they are represented by fewer instances, indicates a role for the inherent complexity of a task, given a data set. The importance of some of these inherent factors – such as inter-topic and intra-topic variation (cf. 118

6.6 C ONCLUSIONS

Chapter 4) – can be estimated, but the field has not seen a lot of effort invested in estimating the ‘strength’ of an authorial set. Performance is also affected by a number of external factors, such as the document representation and the Machine Learning algorithm used for classification. We have shown that the profile-based document representation leads to higher performance than the instancebased approach (both with MBL as a learner). In terms of learning algorithm, eager learner SVM shows more robustness to sparse data than lazy learner MBL. There is no real difference in performance when combining SVMs with instance-based or profile-based document representation. Our results confirm the observation that SVMs are good at dealing with sparse data. When combined with lexical or syntactic features, lazy learner MBL shows some degree of robustness to sparse data, but not to the extent that we can claim superiority. We can conclude from this chapter that the text categorization approach shows little reliability when dealing with limited data. The fact that performance is difficult to predict and that the approach scores low, has important consequences. When applied to authorship attribution ‘in the wild’, a situation with a lot of text samples available for some authors and only limited data for others, performance will be unpredictable. Our text categorization approach to authorship attribution is not scalable when confronted with limited data.

119

Part III

Conclusions

Chapter 7

Conclusions and Further Research

Even though the last decades of research – from both Digital Humanities and Computational Linguistics perspectives – have brought substantial innovation to the field of authorship attribution, most studies only scratch the surface of the task. The field is dominated by studies performing authorship attribution on small sets of candidate authors supported by large sets of training data and a set of topic-neutral features (often function words). As a result, it is uncertain how the proposed approaches will perform when confronted with different types of data. In addition, the often vague descriptions of experimental design and the underuse of objective evaluation criteria and of benchmark data sets, cause problems for the replicability and evaluation of reliability of some studies. As far as interpretability of features is concerned, most studies either restrict their analysis to function words or focus on quantitative evaluation of results. This dissertation addressed these issues in a systematic study of the scalability of a text categorization approach to authorship attribution. We studied the behavior of the approach when confronted with multi-topic data, large author set sizes, and limited training data. Our goal was to identify robust and scalable features and feature types, and to determine whether the approach is viable for application on a large scale. The approach was tested on three evaluation data sets that exhibit distinct characteristics in terms of the number of topics, author set size, and data size. The dissertation was set up as follows. In Chapter 2, we described the state of the art in authorship attribution and introduced the most widespread discriminative methods, feature types, and feature selection methods in authorship attribution research. Chapter 3 introduced our text categorization approach and described the background and structure of the three evaluation data sets. Chapters 4 to 6 studied the behavior of our approach when confronted with scalability issues, both in isolation and in interaction with other factors. In this concluding chapter, we discuss the most important insights obtained from our study and indicate how these insights contribute to our understanding of the approach and to establishing benchmark approaches for authorship attribution (Section 7.1). Finally, we formulate limitations of this dissertation and describe perspectives for further research (Section 7.2). 123

Conclusions and Further Research

7.1 Conclusions In this dissertation, we stress-tested a text categorization approach to authorship attribution in order to investigate its scalability. Application of authorship attribution on a large scale, for instance in social networks, requires an approach that performs consistently under various uncontrolled settings. These settings can be variations in topic, genre, the number of candidate authors (the author set size), and the amount of data available per author (the data size). A scalable approach performs reliably and above baseline, irrespective of these variations. Although each of these variations has a substantial influence on performance and on the individual features selected to form the attribution model, most contemporary studies in authorship attribution ignore the issue of scalability. Scalability has only recently emerged as a research topic in authorship attribution. The studies that do focus on the effect of author set size and data size often use relatively longer texts than we did in this dissertation. The short texts used in our study are a challenge and require a reliable and robust representation of those texts as well as a discriminative approach that is able to deal with sparse data. We attempted to join the strengths of Digital Humanities and Computational Linguistics studies by combining a systematic approach and focus on performance with a thorough evaluation of features and their viability for the task. The main objectives in this dissertation were three-fold. First of all, we provided an account of the scalability of the discriminative approach and of a variety of feature types. While doing that, we focused on thorough experimental design and used freely available evaluation data sets in order to ensure replicability. A final objective was to provide an in-depth feature analysis to allow for interpretability of the results. Our approach was stress-tested by confronting it with multi-topic data, substantially larger author set sizes than we usually find in the field, and considerably smaller sets of training data than typically used. Our analysis of the behavior of the approach when challenged with data similar to data found online, allows us to evaluate its scalability. Essentially, this study is a showcase for the complexity of an authorship attribution task given a data set. Below, we go into detail on each of the scalability issues and the insights resulting from our study. We conclude by evaluating the scalability of the text categorization approach as we implemented it in this dissertation.

7.1.1

Experimental Design in Multi-Topic Data

Multi-topic authorship attribution requires careful experimental design in order to keep topic out of the attribution model. Including topic-specific words in the attribution model can either 124

7.1 C ONCLUSIONS

aid classification, affect performance in a negative way, or confuse the discriminative method. This results in a model unreliable when tested on other topics, in other words, a model that fails to scale. However, topic is a difficult factor to ‘separate’ from authorship since both are intertwined to a large extent. The field of authorship attribution clearly struggles with topic. A commonly applied solution to avoiding topic influence in the attribution model, is the restriction to using only function words, as these are considered insensitive to topic shifts. Although function words are robust to limited data and provide good indicators of authorship, we see at least two reasons to consider the integration of content words. First of all, it has been shown that stylistic features work well for topic identification, and that hardly any of the so-called topic-neutral features are in fact topic-neutral. Secondly, a lot of useful predictive information is disregarded by excluding content words without consideration. In Chapter 4, we investigated the effect of various feature selection methods and crossvalidation schemes on the scalability of the resulting model. Our aim was to integrate content words into the model, without including topic-specific words, as these affect scalability. First, we focused on feature selection and evaluated techniques to include topic-neutral content words into the model. We implemented a number of commonly applied methods and compared their ability to scale towards the test set. We did that by tracking unique identifiers of an authorship label between the train and test data. We showed that information gain (IG) scores low accuracy, but does not introduce any scalability issues. Chi-squared (χ2 ), in contrast, performs better, but potentially leads to a lot of scalability problems. However, the absence of scalability issues does not imply a model that is scalable towards other topics. We provided an in-depth analysis of the different types of features influencing scalability and found that combining χ2 with a topic frequency threshold – although an aggressive feature selection method – allows us to exclude topic-specific words without sacrificing scalability. A second factor was the choice of cross-validation (CV) scheme. We compared the effect of applying stratified CV against that of applying held-out-topic cross-validation. A third scheme, the single-topic scheme, redefines a multi-topic task as a set of single-topic tasks, hereby disregarding the multi-topic nature of the task at hand. In terms of performance, the singletopic scheme was overall the best performing scheme, whereas held-out topic proved to be the most challenging one. While these schemes are impractical to apply in large-scale authorship attribution, because they require topic information, they allow for unique insight into the challenge of working with multi-topic data. We observed a large amount of variation between the various topics in an authorial set (cf. intra-topic variation) as well as between the topics over the different authors (cf. inter-topic variation). Because of the substantial variation, it is not straightforward to evaluate the scalability of our approach. In fact, the approach behaves unpredictably when a held-out topic scheme is adapted, implying it is not stable enough to apply on a large scale.

125

Conclusions and Further Research From our study in Chapter 4, we can conclude that small decisions in experimental design indeed have a large effect on performance and on the scalability of the model. Combining the stratified CV scheme with chi-squared with topic frequency threshold was shown to be the best technique to allow for content words in the attribution model without sacrificing scalability.

7.1.2

Author Set Size

In Chapter 5, we investigated the effect of author set size on performance and on the scalability of the attribution model. Since most studies in authorship attribution focus on small sets of candidate authors, it is impossible to assess the viability of the approaches when applied to larger author set sizes. By systematically increasing the number of candidate authors, we can zoom in on performance and on the (types of) features performing well. Since we wanted to investigate the interaction between author set size and data size as well, we reported on two sets of experiments: one with the original data sets, and a second one where the data size was balanced over the three data sets. Both sets of experiments indicated a substantial decrease in performance with increasing author set size. Character n-grams and lexical features showed more robustness to the effect of author set size than other feature types. However, there are variations in the results of the first experiment that are not caused by author set size, but could be accounted for by differences in training data size or an unbalanced distribution of topics. In a second set of experiments, we limited the training data to the same amount for the three data sets and ensured topic balance. However, we observed the same variations after applying this balanced set-up. This suggests that data size and topic (variation) are important factors, but that there is another important factor affecting performance. The inherent complexity of a task given a data set, joins a group of factors that relate to the strength of an authorial set. How all of these factors interact with each other, is not clear. Because performance is consistent with increasing author set size, we can simulate the point of view of studies that are limited to experiments using a small set of candidate authors and report on good classification results. We forecasted performance in authorship attribution with a large author set size by calculating a decay function from results of two-way to five-way authorship attribution and projecting it on larger author set sizes. This analysis has shown that actual performance is substantially lower than predicted performance, implying our text categorization approach is not scalable towards larger author set sizes. Nevertheless, it can be applied reliably to cases with limited sets of candidate authors. As far as scalability of features and feature types is concerned, we see that character ngrams and lexical features scale towards larger author set sizes. The individual features, 126

7.1 C ONCLUSIONS

however, are tailored to the data set in such a way that they fail to scale to other author pairs or larger sets of candidate authors. We can conclude from this study that our approach is not reliable and stable enough to apply it on cases entailing large author set sizes. In order to thoroughly evaluate the scalability of an approach, error analysis is essential.

7.1.3

Data Size

A third scalability issue is training data size. Reducing the amount of data used for training, allows us to evaluate the behavior of our approach when confronted with limited data. Another aim was to identify robust and scalable feature types, and to explore robustness to data size in terms of document representation and Machine Learning algorithm. We ran two types of experiments, following two interpretations of ‘data size’. In a first set of experiments, we ensured an equal distribution of instances over classes so that none of the classes would be at an advantage in that respect. The results showed a significant decrease in accuracy as data size was reduced. Although data size is one of the main factors explaining these variations, it is clearly not the only one. For that reason, we did a second set of experiments, where we interpreted data size in such a way that the original distribution of instances over classes is kept intact over the different data size experiments. The short text samples used for training are an approximation of the length of an e-mail. Adopting that interpretation of data size leads to uncontrolled fluctuations in performance, implying that the approach is not able to reliably deal with these imbalances when only limited data is available. Apart from data size and the distribution of instances over classes, the inherent complexity of a task given a data set is an important factor explaining fluctuations in the results. However, the complexity is difficult to measure since no techniques exist to quantify the strength of an authorial set. We can investigate robustness to the effect of data size of a number of external factors, such as the document representation approach and the Machine Learning algorithm. Since profile-based approaches are less sensitive to fluctuations in the data than instancebased approaches, they could be more robust to data size. Eager learners such as SVMs have been suggested as robust to sparse data, whereas the organization of a lazy learner such as MBL suggests that it does not abstract away from exceptions in the data and could deal with sparse data in a reliable way. Applying a profile-based approach, in combination with MBL, does not improve over SVM performance. SVMs are currently the algorithm of choice in the field. Our study has confirmed that SVMs outperform MBL. The results did not show superiority of profile-based over instance-based approaches. In our study of the effect of data size, we have shown that the amount of data used for training has a large effect on performance. We see a downward trend, but character n-grams and 127

Conclusions and Further Research lexical features show more robustness to that effect than the other feature types we tested. Our text categorization is not reliable when confronted with limited data. It is clear that data size is one of the most important challenges of large-scale authorship attribution.

7.1.4

Scalability of a Text Categorization Approach to Authorship Attribution

In this dissertation, we studied the behavior of our text categorization approach to authorship attribution when confronted with multi-topic data with larger author set sizes than usual or data sizes smaller than typical in the field. In cases where the data entails a limited set of candidate authors, (to some extent) controlled distribution of topics over authors, and ample training data, the approach is reliable. However, our approach is not scalable to large-scale applications of authorship attribution, for a number of reasons. First of all, its performance cannot be predicted or forecasted since it is sensitive to variations in author set size and data size, and to author or topic imbalances. In fact, all studies that claim reliability of an approach may be overestimating. Secondly, it does not score high enough in controlled experiments to allow for reliable performance on a large scale. A last reason is the fact that the features resulting from its application cause the model to overfit the training data in all cases, hence making it unfit to use on a large scale. Since our approach is a combination of many small decisions in experimental design, it is important to disentangle the various components of the approach and their impact on the failure of the approach to scale. In short, we perform supervised multi-class classification and take an instance-based approach to document representation. We extracted various feature types and experimented with Memory-Based Learning for classification. In this dissertation, experiments have shown that taking a profile-based instead of an instance-based approach does affect performance in a positive way, but that the downward trend in performance remains the same. Using eager learner SVMs instead of a lazy learner results in better performance, but does not improve the scalability of the approach. Considering the complexity of the task and the many interacting factors, using a different ML algorithm may affect performance, but the trends will remain the same. If we were to consider how humans identify the author of a text, we could say that they do not perform multi-class classification. In fact, they would more likely contrast each potential author with each author in their memory, a situation we would refer to as binary classification. This works for closed cases, for instance when the task is to decide who is the author of an essay within a group of students. In other cases, humans would construct a model of an author’s writing style without contrasting it with a group of others, a situation we refer to as one-class learning for authorship verification. It is likely that most people have a notion of Shakespeare’s writing style, without having contrasted it to a set of contemporary playwrights 128

7.2 F URTHER R ESEARCH

or to all potential authors. Authorship verification is actually the real task of which authorship attribution is a simplification. In the next section (cf. Section 7.2), we will elaborate on this. In addition, humans do not necessarily need authorship labels to learn a model of writing style, implying they do not perform supervised classification. In most cases, they will know the author of some of the texts in their model, but be unaware of the authors of other texts. In the field, we would refer to this as semi-supervised classification. They are also likely to combine models of genre with (semi-supervised) models of authorship. So far, computers are unable to compete with their world knowledge. As far as features are concerned, we used feature types of a specific class, without combining it with different feature types. Although this is common practice in the field – often, a single feature type is chosen for an entire study – humans use all the information they find useful in the text and combine it into a model of authorship. Providing a heterogeneous set of features would allow for more reliable computational authorship attribution, but the overall trends will remain the same.

7.2 Further Research Our ideas for further research are inspired by the limitations of our current study and by the observations made above. First of all, authorship verification is a more realistic interpretation of the task since it approximates better what humans do than authorship attribution as we implemented it. One-class learning overcomes the issue of providing representative ‘negative’ instances of authorship and effectively deals with imbalanced data (Raskutti & Kowalczyk, 2004), but the absence of negative instances does entail a loss of performance (Manevitz & Yousef, 2001). Essentially, the task is to define a boundary around the target class that leaves out the outliers. One-class learning, in most cases with SVMs as a learning algorithm (Manevitz & Yousef, ¨ 2001; Scholkopf et al., 2001; Tax, 2001), has been applied to categorization tasks such as topic detection (e.g. Manevitz & Yousef, 2001) and fraud detection (e.g. Fawcett & Provost, 1999). The Koppel & Schler (2004) study was among the first to test one-class learning for authorship verification and provide an alternative that takes into account the negative information available. The unmasking approach is a meta-learning approach to authorship verification where the central idea is to build a classifier per candidate author that compares an unseen text with the training data for that author. Iteratively removing predictive features from each classifier shows a larger drop with the correct author than with the other authors. The underlying idea is that train and test samples by the same author will show similar characteristics, even when various (sizes of) feature sets are used (over several iterations). Unmasking scores an overall accuracy of 95.7% (Koppel et al., 2007). The technique has 129

Conclusions and Further Research been picked up in Sanderson & Guenter (2006), where it was shown that unmasking is less useful when applied to short texts than when applied to long texts. Still, it is an interesting alternative to one-class learning that scores significantly better and makes use of the negative examples of an author’s writing style. It is clear that investing research effort into approaches – one-class learning as well as unmasking – to authorship verification will provide the field with a more intuitive and reliable solution than supervised authorship attribution does. .

Providing a set of heterogeneous features is a second essential step in investigating the behavior of an approach when confronted with scalability issues. Several studies have shown that combining features of several types has a positive impact on performance (Gamon, 2004; Grieve, 2007; Luyckx & Daelemans, 2008a). Providing a more heterogeneous feature set – by using lexical as well as syntactic features, for instance – gives a less restricted representation of the authorial set. We expect performance to increase, but trends to remain intact. This dissertation has only shown the tip of the iceberg in terms of scalability issues. Our analyses of the effects of topic, author set size, and data size show only a fraction of the issues to be dealt with in large-scale applications of authorship attribution. Expanding the stress-test to even larger sets of topics, genres, author set sizes is an evident perspective for further research. An issue unexplored so far is the effect of class imbalance. Our study included short text fragments and a class balance. In authorship attribution in the wild, there will be large as well as small authorial sets, a situation that requires a stable approach that deals with this class imbalance. In this dissertation, we explored the various interacting factors, both internal and external, that affect performance and scalability. As a future research goal, we want to gain insight into the inherent complexity of the task. As we have indicated, no techniques exist to evaluate the strength of an authorial set. Yet it is likely that some authors exhibit a more distinct writing style than others, for reasons of maturity or interest. It is clear that authorship attribution is not a linear problem. More data does not necessarily lead to higher performance. Similarly, having multi-topic data does not always lead to better scores since too much or too little variation can confuse the learner. This dissertation demonstrated the complexity of the authorship attribution task as well as the challenges of large-scale application. In both aspects, the field has only seen the tip of the iceberg and would benefit from more transparency in terms of experimental design, performance, and features.

130

Part IV

Appendices

Appendix A

Features Below the Topic Frequency Threshold

This Appendix contains the list of LEX 1 features processed with chi-squared (χ2 ) that were removed when applying the topic frequency threshold (χ2 + TOPIC F) because they did not occur in more than one topic. Most of them are proper nouns, topic markers, and typos. In ABC NL 1, the following features were removed by applying the topic frequency threshold: ‘Adriana’, ‘agenten’, ‘agressie’, ‘alchemisten’, ‘Alex’, ‘anatoompatholoog’, ‘Anita’, ‘Anneke’, ‘artiesten’, ‘avondje’, ‘baby’, ‘Baldini’, ‘basketbal’, ‘Beckman’, ‘behandschoende’, ‘beleid’, ‘beschaving’, ‘besturen’, ‘bewakers’, ‘bewoners’, ‘Bianca’, ‘bisschop’, ‘bloesje’, ‘bode’, ‘bospad’, ‘boze’, ‘Brigitte’, ‘bril’, ‘Brussel’, ‘buik’, ‘buurvrouw’, ‘Casper’, ‘club’, ‘comissaris’, ‘commerciele’, ‘conducteur’, ‘conflict’, ‘cyclus’, ‘Delft’, ‘des’, ‘Emmy’, ‘Ernst-Jan’, ‘EU’, ‘Europaeen’, ‘Europarlementariers’, ‘fabrikant’, ‘fabrikanten’, ‘fin’, ‘gebaat’, ‘gemene’, ‘gevierd’, ‘goals’, ‘graaf’, ‘Grasse’, ‘Grenouille’, ‘grietje’, ‘grootmoeder’, ‘helm’, ‘Henk’, ‘heroine’, ”ho-mo’s”, ‘hoofdpersoon’, ‘huisje’, ‘Hustinx’, ‘idool’, ‘Ierland’, ‘inspraak’, ‘Jacques’, ‘jager’, ‘jatte’, ‘Jefimytsj’, ‘jongeling’, ‘jonkheer’, ‘jonkvrouw’, ‘jonkvrouwe’, ‘Jurian’, ‘kampen’, ‘kandidaat’, ‘Kappie’, ‘kasteel’, ‘Kerstfeest’, ‘Kerstverhaal’, ‘kinder’, ‘kinderboek’, ‘kindje’, ‘kitsch’, ‘Klaas’, ‘klanten’, ‘kloostertuin’, ‘koekjes’, ‘koeltjes’, ‘konijnen’, ‘konijntje’, ‘Kooten’, ‘krokodil’, ‘Kuijper’, ‘Lettica’, ‘Lisa’, ‘Lisette’, ‘Maarten’, ‘madame’, ‘Mariken’, ‘Merel’, ‘millenium’, ‘millenniumwisseling’, ‘mogelijkheden’, ‘Mona’, ‘monnik’, ‘moorden’, ‘moreel’, ‘Mourik’, ‘Muskulan’, ‘neef’, ‘noem’, ‘ns’, ‘onoverwinnelijk’, ‘ontroerend’, ‘oordeels’, ‘paard’, ‘parfumeur’, ‘pestte’, ‘Peter’, ‘poetsvrouw’, ‘popmuziek’, ‘prijzen’, ‘professor’, ‘Pruysen’, ‘Pullaert’, ‘Quintiabella’, ‘rage’, ‘Richard’, ‘ridder’, ‘ridders’, ‘Rob’, ‘Robin’, ‘roker’, ‘Roodcapje’, ‘Roodkapje’, ‘schrijft’, ‘sciecle’, ‘scooter’, ‘Sebastiaan’, ‘sekten’, ‘Sesamstraat’, ‘Simon’, ‘Sjoerd’, ‘slag-veld’, ‘snol’, ‘snor’, ‘sportieve’, ‘staten’, ‘stenen’, ‘ster’, ‘stoffen’, ‘stopte’, ‘strijdperk’, ‘Suze’, ‘Sydney’, ‘Tara’, ‘tas’, ‘team’, ‘tekeningen’, ‘thema’, ‘Tienhuizen’, ‘Tjenkov’, ‘toernooi’, ‘touw’, ‘Tsjechov’, ‘twenty’, ‘ultra-sonique-clean’, ‘valuta’, ‘Veldkamp’, ‘Veldkamps’, ‘vergaan’, ‘verslavende’, ‘verzorgingshuis’, ‘viool’, ‘vliegveld’, ‘vogel’, ‘volleybal’, ‘voorbereiden’, ‘w.’, ‘wapen’, ‘wedstrijdkarakter’, ‘weduwe’, ‘weiland’, ‘wetgeving’, ‘Wijnand’, ‘Wijnands’, ‘wolf’, ‘Worst’, ‘wortelen’, ‘wraak’, ‘zuster’, ‘zwaardmeester’

133

Appendix A. Features Below the Topic Frequency Threshold In AAAC A, the following features were removed by applying the topic frequency threshold: ‘/’, ‘1’, ‘165’, ‘1880’, ‘1960’, ‘529’, ‘a9’, ‘advances’, ‘adventure’, ‘airport’, ‘allegiance’, ‘ambitious’, ‘anthem’, ‘assignment’, ‘benefited’, ‘book’, ‘branch’, ‘Bush’, ‘callings’, ‘careful’, ‘cloutman’, ‘club’, ‘collected’, ‘compromised’, ‘Cong’, ‘constitutional’, ‘continents’, ‘criticize’, ‘crossed’, ‘crow’, ‘customers’, ‘cyberspace’, ‘decades’, ‘degrees’, ‘denying’, ‘describes’, ‘detector’, ‘difficulties’, ‘discoveries’, ‘domestic’, ‘duties’, ‘economics’, ‘effort’, ‘elements’, ‘employed’, ‘enhances’, ‘enjoyment’, ‘essay’, ‘executive’, ‘expansion’, ‘explains’, ‘exploration’, ‘FAA’, ‘farther’, ‘Fitzgerald’, ‘frontier’, ‘frontiers’, ‘frontiersman’, ‘global’, ‘grandfathers’, ‘granite’, ‘guide’, ‘heads’, ‘hobbies’, ‘Hofstader’, ‘Hofstadter’, ‘horizons’, ‘inward’, ‘issued’, ‘Japanese-Americans’, ‘Jim’, ‘Jr.’, ‘Katie’, ‘Kevin’, ‘king’, ‘leader’, ‘liberties’, ‘lifeline’, ‘Lincoln’, ‘livelihood’, ‘lucky’, ‘market’, ‘measures’, ‘message’, ‘metal’, ‘non-citizens’, ‘orders’, ‘outer’, ‘percent’, ‘personally’, ‘plan’, ‘pledge’, ‘privacy’, ‘production’, ‘protect’, ‘puts’, ‘residents’, ‘retaliation’, ‘revision’, ‘rewards’, ‘Richard’, ‘ring’, ‘sea’, ‘seriously’, ‘sit’, ‘strike’, ‘stripped’, ‘subjects’, ‘tax’, ‘technological’, ‘terroristic’, ‘thesis’, ‘threat’, ‘trial’, ‘Turner’, ‘Turners’, ‘unity’, ‘unknown’, ‘unwillingly’, ‘viewed’, ‘violated’, ‘watermark’, ‘weapons’, ‘western’, ‘wild’, ‘willingly’, ‘win’, ‘workers’, ‘would-be’

134

Appendix B

Performance with Topic Frequency Threshold

This Appendix shows performance after applying the topic frequency threshold (+TOPIC F) to feature selection methods information gain (IG) and chi-squared (χ2 ), both with (IG+TF and χ2 +EXPTF) and without (IG and χ2 ) frequency threshold. In ABC NL 1: IG

IG + TF

IG + TOPIC F

χ2

χ2 +EXPTF

χ2 +TOPIC F

cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3

12.50 20.00 36.25 11.25 6.25 12.50 1.25 12.50 12.50 1.25 12.50 21.25 11.25 12.50 12.50 12.50 11.25

16.25 20.00 36.25 32.50 37.50 28.75 25.00 18.75 28.75 17.50 20.00 21.25 30.00 20.00 26.25 21.25 20.00

26.25 17.50 36.25 26.25 18.75 25.00 30.00 22.50 20.00 33.75 27.50 21.25 17.50 31.25 26.25 28.75 22.50

37.50 20.00 36.25 33.75 43.75 35.00 30.00 30.00 46.25 25.00 28.75 21.25 36.25 16.25 37.50 37.50 28.75

13.75 21.25 36.25 33.75 43.75 35.00 30.00 30.00 46.25 25.00 28.75 21.25 36.25 16.25 37.50 37.50 28.75

26.25 20.00 36.25 32.50 43.75 41.25 23.75 32.50 31.25 25.00 36.25 21.25 38.75 17.50 36.25 21.25 36.25

Average

12.59

24.41

25.00

31.65

23.41

30.24

Feature type

Table B.1: Comparison of performance before and after applying the frequency and topic frequency thresholds to χ2 and IG in ABC NL 1 with eight candidate authors and nine topics (baseline: 12.50%).

135

Appendix B. Performance with Topic Frequency Threshold In AAAC A: IG

IG + TF

IG + TOPIC F

χ2

χ2 +EXPTF

χ2 +TOPIC F

cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3

6.92 22.31 23.85 18.46 1.54 7.69 7.69 0.77 3.08 7.69 0.77 21.54 17.69 7.69 6.92 7.69 0.77

23.85 27.69 23.85 47.69 46.15 34.62 23.85 26.92 29.23 22.31 17.69 21.54 23.08 15.38 35.38 22.31 23.08

14.62 24.62 23.85 46.92 13.08 13.08 17.69 16.92 14.62 18.46 16.15 21.54 20.00 13.08 16.15 13.85 16.15

31.54 22.31 23.85 38.46 44.62 43.08 30.77 36.92 43.85 33.85 33.08 21.54 28.46 19.23 33.08 30.00 36.15

15.38 25.38 23.85 34.62 20.00 28.46 18.46 16.92 28.46 14.62 14.62 20.00 20.00 12.31 32.31 23.08 9.23

20.77 22.31 23.85 49.23 46.15 43.08 22.31 26.92 46.15 26.15 23.85 21.54 29.23 16.92 31.54 22.31 27.69

Average

8.94

26.82

18.41

31.94

20.65

28.94

Feature type

Table B.2: Comparison of performance before and after applying the frequency and topic frequency thresholds to χ2 and IG in AAAC A with thirteen candidate authors and four topics (baseline: 7.69%).

136

Appendix C

The Effect of Author Set Size: Machine Learner Comparison

This Appendix compares performance of five types of Machine Learning algorithms: MemoryBased Learning (MBL) as implemented in TIMBL (Daelemans & van den Bosch, 2005), Rule Induction as implemented in Ripper (Cohen, 1995), Support Vector Machines (SVMs) as implemented in SMO (Platt, 1998), Naive Bayes (NB) (John & Langley, 1995), and Decision Trees (DT) as implemented in C4.5 (Quinlan, 1993) with increasing author set size. We use a set-up as in EXP 2, where we balance the data size and topics of the training data. PERSONAE

MBL RIPPER CHR 3

SVM s NB DT MBL RIPPER

LEX 1

SVM s NB DT

BASELINE

ABC NL 1

AAAC A

2

5

145

2

5

13

2

5

8

78.50 66.30 93.90 93.10 66.30

66.16 31.04 82.32 79.04 31.04

5.10 2.90 11.31 7.72 2.90

88.14 68.28 97.81 96.00 68.26

73.70 43.50 90.20 82.60 43.50

51.76 27.45 71.76 56.47 27.45

74.44 76.22 94.89 93.17 76.22

51.56 47.11 79.20 72.89 47.11

41.11 44.44 68.89 61.67 44.44

71.80 62.00 83.10 77.70 62.00

35.44 36.64 65.92 58.36 36.64

6.48 2.21 15.72 8.83 2.21

74.05 70.71 90.45 84.00 70.71

44.30 42.30 72.20 62.30 42.30

32.16 25.10 64.71 56.47 25.10

66.67 72.39 84.00 79.56 72.39

33.29 41.20 69.02 61.29 41.20

28.06 27.50 63.89 54.17 27.50

50.00

20.00

0.69

50.00

20.00

7.69

50.00

20.00

12.50

Table C.1: Comparing Machine Learners’ performance with increasing author set size using CHR 3 and LEX 1 in a data size and topic balanced set-up.

137

Appendix D

The Effect of Author Set Size in the Original Data Sets

This Appendix shows how increasing the author set size affects performance in the original data sets as implemented for EXP 1.

139

Appendix D. The Effect of Author Set Size in the Original Data Sets ( EXP 1) In PERSONAE: Feature

2x100

3x100

4x100

5x100

10x10

20x5

50x2

100

145

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

73.75 66.65 71.30 69.70 72.35 94.50 74.50 65.45 57.00 75.55 65.65 58.75 67.90 70.40 71.40 74.15 74.45 66.50 74.60 66.50 56.50 64.90 67.30 69.05 54.70

59.33 48.13 57.23 55.03 57.00 87.37 60.43 57.47 48.33 53.50 57.30 48.80 49.43 55.17 60.97 59.30 55.63 55.91 60.80 57.30 48.67 47.73 50.23 51.17 36.03

54.42 47.95 47.12 44.27 48.03 81.05 59.58 47.15 42.80 58.10 46.28 42.45 41.37 46.25 55.90 50.18 54.95 51.82 60.98 47.30 42.85 39.62 42.60 41.55 27.93

46.82 47.32 41.36 39.54 46.14 76.36 59.68 43.36 40.32 56.92 43.70 39.74 35.24 40.76 49.42 42.52 51.00 45.40 59.76 43.64 40.20 32.56 36.92 39.82 23.92

30.00 35.00 26.70 22.00 29.60 54.60 42.50 27.20 29.00 42.80 28.90 30.00 21.60 25.30 31.90 27.80 36.47 30.20 42.10 26.40 29.40 20.60 21.10 21.70 11.80

26.10 27.70 14.40 17.80 24.20 39.49 33.30 25.80 28.10 31.30 25.70 26.10 13.50 13.60 22.00 17.30 26.90 21.20 35.90 25.10 28.70 13.60 14.00 18.20 6.90

12.90 15.50 6.90 6.90 14.30 25.00 25.50 18.20 26.40 24.50 15.80 25.40 7.40 8.60 10.70 9.10 13.80 10.60 24.90 17.40 26.90 6.70 6.90 6.70 2.30

8.30 7.90 4.20 4.10 7.30 12.20 15.50 15.60 25.60 16.20 13.00 25.90 3.70 4.90 6.00 4.20 7.80 6.80 15.20 16.40 25.40 3.30 4.90 6.00 1.30

6.07 7.03 2.83 2.69 6.28 10.90 12.21 15.45 22.76 12.28 13.72 22.07 2.76 4.48 4.76 4.07 5.31 4.62 12.00 16.83 22.28 2.76 3.24 3.17 1.24

Average Baseline

68.44 50.00

54.72 33.33

48.40 25.00

44.36 20.00

29.36 10.00

22.64 5.00

14.20 2.00

10.04 1.00

8.44 0.69

Table D.1: The effect of author set size in the original PERSONAE data set.

140

In AAAC A: Feature

2x20

3x20

4x10

5x10

10x10

13

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3

76.25 61.50 73.50 78.00 87.00 94.25 77.00 58.75 59.75 75.25 58.00 59.25 78.25 79.00 71.00 71.50 55.75 61.50

55.17 46.33 58.17 59.17 70.83 80.67 58.67 42.50 44.83 60.50 41.83 44.00 56.17 59.00 52.33 52.00 41.83 44.83

48.50 41.75 47.00 52.75 66.50 76.50 48.00 37.50 33.50 49.50 47.00 33.00 52.25 54.25 45.25 47.75 38.00 35.00

47.80 37.20 35.40 47.60 64.40 72.80 50.20 43.00 27.80 53.60 33.60 27.00 46.40 46.20 41.00 44.80 41.40 26.40

30.90 31.80 26.20 30.20 45.45 58.60 45.00 21.90 24.40 44.70 24.70 25.60 31.30 31.40 26.90 38.80 23.60 25.70

29.23 20.77 22.31 23.85 49.23 46.15 43.08 22.31 26.92 46.15 26.15 23.85 21.54 29.23 16.92 31.54 22.31 27.69

Average Baseline

70.56 50.00

53.39 33.33

47.11 25.00

43.28 20.00

32.06 10.00

28.94 7.69

Table D.2: The effect of author set size in the original AAAC A data set.

141

Appendix D. The Effect of Author Set Size in the Original Data Sets ( EXP 1) In ABC NL 1: Feature

2x20

3x20

4x10

5x10

8

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

72.00 62.25 71.50 75.25 77.50 80.25 70.50 57.25 53.00 71.75 60.00 53.00 66.00 66.75 66.00 65.00 73.25 67.00 69.00 60.25 54.25 66.75 66.25 60.50 46.50

57.17 47.33 53.50 61.50 61.33 69.17 55.67 46.67 40.17 56.67 46.00 39.17 49.50 50.00 42.67 50.50 55.50 44.33 58.00 44.50 40.33 48.83 47.67 42.83 34.83

47.50 39.00 46.00 57.25 52.75 61.00 42.00 40.50 39.75 48.50 42.00 41.00 37.75 40.00 39.00 44.00 43.25 38.00 42.75 38.75 38.25 35.50 39.00 34.75 28.25

50.40 27.60 38.20 46.40 46.60 58.00 44.60 33.20 35.80 45.40 37.00 36.00 33.00 35.40 32.80 38.40 37.80 29.00 45.20 33.00 32.20 32.30 29.60 31.20 23.20

27.50 26.25 20.00 36.25 32.50 43.75 41.25 23.75 32.50 31.25 25.00 36.25 23.75 28.75 25.00 21.25 38.75 17.50 36.25 21.25 36.25 23.75 21.25 23.75 10.00

Average Baseline

65.00 50.00

49.32 33.33

41.96 25.00

36.96 20.00

27.76 12.50

Table D.3: The effect of author set size in the original ABC NL 1 data set.

142

Appendix E

The Effect of Author Set Size with Data Size and Topic Balanced Data

This Appendix shows how increasing the author set size affects performance in data size and topic balanced data, as implemented for EXP 2.

143

Appendix E. The Effect of Author Set Size with Data Size and Topic Balanced Data ( EXP 2) In PERSONAE: Feature

2x100

3x100

4x100

5x100

10x10

20x5

50x2

100

145

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

61.80 62.90 68.50 61.70 69.50 78.50 71.80 62.80 55.60 73.80 63.10 56.40 63.90 69.50 67.50 72.10 67.40 64.50 73.80 63.10 55.60 59.60 64.00 64.40 52.90

43.80 44.13 48.73 43.47 46.60 71.47 50.53 45.27 37.07 50.67 44.87 37.07 44.87 45.00 41.73 47.67 45.13 43.73 51.27 44.67 37.60 39.93 40.73 43.27 36.60

34.40 37.00 39.80 38.00 41.55 70.55 43.10 36.10 28.95 42.75 35.65 28.80 36.20 37.75 35.70 41.45 39.55 34.80 43.00 35.65 29.75 33.00 31.65 31.90 29.15

32.68 28.56 34.80 30.96 35.56 66.16 35.44 30.36 22.88 36.00 30.28 23.72 28.68 28.80 28.36 34.40 32.92 27.04 36.56 28.76 24.64 24.32 26.28 26.68 22.28

15.00 24.40 17.60 19.40 19.60 47.40 24.00 22.80 26.00 23.60 22.80 25.40 16.00 19.20 17.60 20.00 23.40 30.00 24.00 21.40 24.00 13.20 12.20 11.20 10.80

9.00 19.60 9.80 12.40 15.60 27.20 19.40 15.60 21.20 19.60 14.80 20.60 8.60 7.80 13.20 9.60 21.00 14.20 20.20 16.00 21.00 6.20 10.20 9.00 5.20

2.80 13.00 6.40 4.20 9.40 11.60 12.40 7.60 14.00 10.40 9.40 15.80 3.60 3.20 5.60 4.40 5.40 8.60 13.60 8.00 14.20 7.00 5.00 4.20 3.20

2.20 7.20 3.20 2.40 6.40 10.00 10.40 12.20 13.40 8.00 8.40 11.80 3.40 2.40 2.40 3.40 4.60 7.20 10.20 11.80 14.00 2.60 2.60 3.00 1.20

0.83 4.97 2.07 3.17 3.86 5.10 6.48 10.76 7.17 5.10 9.93 7.17 1.10 1.52 3.31 3.03 4.14 4.28 7.03 10.90 6.90 1.79 1.52 1.93 1.24

Average Baseline

64.44 50.00

44.52 33.33

36.96 25.00

30.56 20.00

20.92 10.00

14.32 5.00

7.76 2.00

6.24 1.00

4.16 0.69

Table E.1: The effect of author set size in data size and topic balanced PERSONAE.

144

In AAAC A: Feature

2x20

3x20

4x10

5x10

10x10

13

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3

56.54 62.70 70.50 70.07 83.43 88.14 74.05 68.91 65.98 74.48 69.11 63.48 71.32 68.77 65.57 72.75 69.59 65.41

44.92 44.64 53.74 55.70 73.02 84.27 56.12 47.78 49.83 60.13 47.11 47.09 59.61 52.46 46.83 54.42 47.37 48.83

32.38 37.07 47.28 49.42 64.75 80.00 54.70 38.61 35.34 54.75 42.97 35.94 45.09 43.87 38.08 52.68 38.94 36.21

28.00 30.50 38.50 41.60 59.50 73.70 44.30 30.00 27.40 47.30 30.20 28.00 41.20 36.00 30.50 40.20 28.90 27.40

16.86 20.23 22.61 26.06 44.54 54.78 32.46 23.66 16.18 38.74 21.84 14.91 27.21 21.65 21.83 30.14 21.44 16.18

12.94 23.53 20.00 18.82 40.00 51.76 32.16 32.16 19.22 38.04 18.43 14.91 23.53 18.04 16.86 30.59 30.59 17.65

Average Baseline

69.56 50.00

53.61 33.33

45.50 25.00

37.61 20.00

25.67 10.00

25.06 7.69

Table E.2: The effect of author set size in data size and topic balanced AAAC A.

145

Appendix E. The Effect of Author Set Size with Data Size and Topic Balanced Data ( EXP 2) In ABC NL 1: Feature

2x20

3x20

4x10

5x10

8

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

61.50 56.72 66.89 71.06 71.33 74.44 66.67 62.78 60.78 67.72 62.11 61.94 63.89 64.22 58.67 64.72 63.67 62.50 67.33 63.28 60.83 61.89 61.33 59.67 51.72

46.48 39.74 48.74 55.00 53.78 66.70 48.26 43.33 40.55 50.44 44.56 39.81 47.22 47.41 41.78 48.44 47.48 43.85 48.37 42.30 40.40 43.81 44.78 41.33 35.56

36.67 33.22 35.78 43.11 47.00 58.61 41.61 36.72 29.89 41.06 34.17 29.39 36.17 36.89 34.11 41.00 39.39 36.06 40.33 36.06 30.11 34.33 35.89 29.06 26.56

30.76 28.27 34.36 39.02 41.16 51.56 33.29 31.56 25.16 38.98 29.87 23.87 28.62 30.04 26.71 35.91 31.78 27.64 34.22 29.29 25.51 28.89 28.36 24.44 23.02

23.33 17.50 25.83 26.94 34.17 41.11 28.06 27.22 14.72 32.22 21.94 14.72 16.67 16.67 15.28 24.17 28.89 21.39 25.28 21.11 15.00 18.06 21.94 17.22 13.33

Average Baseline

62.92 50.00

45.48 33.33

36.56 25.00

30.80 20.00

22.08 12.50

Table E.3: The effect of author set size in data size and topic balanced ABC NL 1.

146

Appendix F

Data Size as the Number of Variable-Length Samples

This Appendix shows the effect of data size, where data size is interpreted as the number of variable-length (or FLEX) text samples per author (EXP 1). We present results for the three data sets, with experiments with two, five, and the maximum number of candidate authors per data set.

147

Appendix F. Data Size as the Number of Variable-Length Samples ( EXP 1) In two-way PERSONAE: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

73.75 66.70 71.30 69.70 72.35 94.50 74.35 65.40 57.05 75.55 65.70 58.75 67.90 70.40 71.40 74.15 74.50 66.40 74.50 66.45 56.50 64.90 67.30 69.05 54.70

73.65 65.95 70.90 70.10 72.10 92.25 73.90 63.85 56.35 75.05 63.85 58.00 67.50 71.20 70.00 74.50 72.75 64.60 74.10 64.70 56.00 64.00 66.85 68.85 55.00

73.45 64.45 70.55 69.75 72.85 90.30 72.85 64.10 55.50 74.35 64.00 56.90 67.70 71.15 68.50 74.35 68.70 63.25 72.00 63.85 55.45 63.55 67.00 68.25 55.65

73.90 63.50 70.65 68.15 72.20 87.40 71.65 61.50 56.00 71.40 62.60 56.60 66.05 69.80 68.30 72.95 66.60 62.35 71.00 61.05 55.65 62.85 67.90 66.70 55.50

73.75 62.80 71.35 67.95 72.10 83.45 68.75 60.70 55.85 70.30 60.70 55.60 65.40 69.45 68.70 72.85 66.05 60.50 69.45 60.65 55.85 63.60 66.90 65.75 56.10

73.65 62.90 70.30 67.05 71.95 78.80 69.05 59.00 54.10 70.40 59.55 54.55 65.75 69.00 68.25 71.10 64.80 59.00 68.65 59.10 54.40 62.95 65.10 64.55 55.90

71.80 63.00 68.55 65.20 71.20 73.60 68.85 57.65 53.25 70.30 58.00 53.65 65.00 68.15 67.30 70.85 65.25 59.75 68.75 58.90 52.55 63.00 65.60 63.30 56.20

70.00 61.30 67.30 64.45 69.25 68.75 67.60 57.55 53.50 70.50 58.30 53.15 61.55 68.50 67.30 67.00 64.85 58.55 67.70 57.30 53.65 62.85 65.05 64.20 56.10

67.60 59.75 64.70 60.55 66.50 68.60 69.95 60.05 53.62 72.30 60.40 54.05 59.50 70.40 70.40 67.70 69.95 61.45 70.30 59.60 53.56 58.30 66.60 69.05 58.95

Average Baseline

68.44

67.80

67.08

66.00

65.20 50%

64.36

63.56

62.64

63.24

Table F.1: The effect of data size, with data size interpreted as the number of variable-length (or FLEX) samples available for training in 2-way PERSONAE.

148

In five-way PERSONAE: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

46.82 47.30 41.36 39.54 46.14 76.36 59.54 43.22 40.22 57.04 43.58 39.74 35.24 40.76 49.40 42.52 51.14 45.30 59.84 43.92 39.94 32.56 36.92 39.82 23.90

46.42 44.74 40.96 38.38 44.08 74.30 55.62 43.06 39.40 54.46 43.06 38.72 34.88 40.20 49.02 42.26 48.34 43.40 56.02 42.54 39.50 32.66 36.46 38.18 23.96

45.66 42.66 40.78 37.26 43.02 70.90 53.14 42.72 38.18 49.90 42.48 36.72 34.66 39.16 46.52 41.58 44.96 42.46 52.88 42.32 37.98 32.32 36.20 36.18 23.78

44.88 38.84 39.50 37.08 40.84 66.44 50.46 41.22 36.92 45.88 40.78 36.74 34.20 38.84 44.02 40.80 41.90 42.24 50.32 41.14 36.80 32.06 35.92 35.84 23.68

44.68 36.52 39.92 36.62 39.38 62.52 45.32 38.50 35.18 43.18 38.84 34.60 33.94 38.66 40.68 40.38 38.48 39.54 45.34 38.72 35.00 32.50 35.22 35.48 23.84

43.76 31.94 39.64 35.10 38.88 56.52 41.42 35.82 31.66 38.30 35.64 31.88 33.34 38.22 36.22 39.82 36.82 36.18 41.42 34.72 31.36 31.70 34.98 35.10 24.10

41.48 30.08 38.92 34.12 39.14 51.22 36.36 34.16 28.24 37.62 34.68 27.90 32.98 36.74 34.04 38.68 34.62 34.60 37.08 33.96 28.18 30.38 34.32 33.96 24.64

39.30 29.58 37.10 32.18 37.38 43.94 36.88 30.26 25.02 37.70 30.10 25.04 31.40 34.64 34.04 37.30 34.62 29.70 36.78 29.76 25.00 29.40 33.04 32.54 24.04

36.18 28.46 34.22 30.10 36.00 32.62 36.00 25.32 21.69 36.62 25.30 22.12 28.88 32.72 32.32 33.70 33.78 25.04 36.04 25.16 21.35 28.26 31.92 31.46 23.46

Average Baseline

44.36

43.20

41.64

40.12

38.40 20%

36.04

34.24

32.28

29.56

Table F.2: The effect of data size, with data size interpreted as the number of variable-length (or FLEX) samples available for training in 5-way PERSONAE.

149

Appendix F. Data Size as the Number of Variable-Length Samples ( EXP 1) In 145-way PERSONAE: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

6.07 7.03 2.83 2.69 6.28 10.90 12.21 15.45 22.69 12.28 13.72 21.93 2.76 4.48 4.76 4.07 5.31 4.62 12.00 16.90 22.28 2.76 3.24 3.17 1.24

6.48 7.52 2.90 2.90 6.00 9.10 10.00 15.24 21.59 13.45 12.55 21.52 2.97 3.72 4.07 4.00 5.52 5.31 10.21 15.31 21.52 2.48 3.31 2.90 1.03

6.00 5.45 3.24 2.76 6.07 8.90 9.59 13.24 19.17 10.90 13.17 18.76 3.17 3.86 5.24 3.86 5.59 4.28 10.07 12.83 19.38 2.41 3.52 2.69 0.90

6.00 4.07 2.97 2.69 6.00 8.55 8.48 11.72 16.76 11.31 11.45 15.52 3.03 3.79 4.62 4.07 5.66 4.90 8.62 11.86 16.83 2.07 3.52 3.03 1.17

5.52 4.07 3.59 2.62 5.38 7.86 7.31 9.38 13.38 8.62 10.14 12.00 3.10 3.38 4.21 4.00 5.52 5.10 6.55 10.48 13.17 2.55 3.10 2.83 1.10

5.38 4.62 3.24 2.83 5.03 6.21 6.48 8.41 9.52 7.31 8.48 10.00 2.90 3.10 3.17 4.07 4.00 4.55 6.14 8.90 9.24 2.62 2.83 3.38 1.31

4.14 2.48 2.90 2.76 5.17 4.21 5.24 5.79 6.62 6.14 5.86 7.38 2.69 2.76 2.76 3.59 3.79 2.41 5.59 5.66 6.69 2.00 3.38 2.83 0.48

3.66 2.62 2.62 3.03 4.76 3.45 5.10 2.83 2.62 3.93 2.62 3.31 2.69 2.90 2.76 3.10 3.52 2.14 5.86 2.69 2.76 2.34 2.83 2.28 0.69

4.34 1.66 2.28 2.62 3.79 1.79 2.55 1.59 1.66 3.10 1.03 1.59 2.62 2.48 2.00 2.97 2.34 1.79 2.83 1.52 1.38 1.79 2.55 1.79 1.38

Average Baseline

8.40

8.04

7.32

6.68

5.84 0.69%

4.96

3.60

2.52

1.68

Table F.3: The effect of data size, with data size interpreted as the number of variable-length (or FLEX) samples available for training in 145-way PERSONAE.

150

In two-way AAAC A: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3

70.75 60.50 69.75 66.25 78.00 87.25 72.25 61.00 56.50 72.50 61.00 58.75 71.00 74.25 62.50 69.25 62.25 57.75

70.75 59.50 68.25 65.00 77.75 86.50 72.25 61.50 57.25 70.50 60.00 55.75 69.50 73.75 60.50 68.00 63.25 58.75

69.50 59.00 66.50 64.25 75.25 82.25 70.25 58.25 56.00 69.50 58.50 56.25 68.00 72.75 62.50 69.25 60.50 57.00

68.50 60.75 62.25 63.75 75.00 79.00 68.50 58.25 56.25 70.50 56.50 57.75 67.00 70.25 62.00 69.75 59.25 57.50

67.75 59.50 63.25 61.00 69.50 74.50 64.75 57.75 54.50 68.25 60.50 55.50 66.50 65.00 59.00 64.75 60.50 55.25

69.00 59.25 63.25 59.75 67.50 70.75 65.75 56.00 50.25 66.00 54.25 51.50 67.50 67.00 56.25 62.75 56.50 51.50

69.00 59.00 59.50 57.50 67.25 68.00 61.75 55.75 49.00 63.50 55.25 51.25 67.75 66.75 61.00 62.75 55.75 49.50

66.00 56.50 56.75 59.75 63.00 63.00 57.00 55.50 50.30 57.50 54.75 50.61 62.25 60.75 55.50 57.00 54.25 49.94

59.75 55.50 57.50 57.75 60.25 60.00 57.50 54.68 51.14 58.00 54.18 51.17 56.75 60.00 50.18 57.00 54.43 51.54

Average Baseline

66.94

66.17

65.00

64.22

62.22 50%

60.44

59.61

56.83

55.61

Table F.4: The effect of data size, with data size interpreted as the number of variable-length (or FLEX) samples available for training in 2-way AAAC A.

151

Appendix F. Data Size as the Number of Variable-Length Samples ( EXP 1) In five-way AAAC A: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3

35.60 28.20 36.00 37.60 53.20 74.20 37.80 29.00 28.80 41.20 31.00 29.60 38.00 41.00 34.40 34.40 29.00 29.80

35.60 28.20 36.60 37.00 52.40 70.00 39.60 29.60 28.60 40.40 28.20 29.60 38.40 41.40 32.40 36.20 30.40 29.20

33.80 29.20 37.80 36.20 51.80 68.20 38.20 29.20 29.00 39.00 28.20 30.80 39.20 41.40 31.20 36.60 30.20 28.60

32.20 25.00 35.00 36.00 48.40 61.20 35.60 28.00 31.00 37.00 28.00 30.80 39.20 39.00 29.40 35.80 29.80 31.00

31.00 24.60 32.20 33.40 48.40 57.80 35.40 28.40 26.60 36.20 27.80 27.60 36.40 37.40 29.60 33.00 28.40 26.00

29.60 25.80 30.20 31.20 46.20 53.40 34.20 28.40 24.20 32.80 29.00 25.40 38.00 37.40 27.40 35.40 28.20 24.60

28.40 25.40 31.00 28.60 43.40 44.60 30.60 26.80 23.20 31.00 25.80 25.00 35.60 34.20 26.20 31.40 26.20 22.60

28.60 25.20 31.60 29.80 37.40 35.20 29.20 25.40 20.93 28.60 24.40 20.80 33.00 29.80 23.60 29.20 25.00 20.73

27.60 23.80 27.40 26.40 31.20 29.20 27.40 23.40 21.66 24.80 25.20 21.06 31.80 25.00 21.60 28.00 23.40 22.06

Average Baseline

36.83

36.50

36.22

34.89

32.94 20%

31.94

29.61

27.22

25.22

Table F.5: The effect of data size, with data size interpreted as the number of variable-length (or FLEX) samples available for training in 5-way AAAC A.

152

In 13-way AAAC A: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3

14.62 14.62 21.54 17.69 28.46 44.62 30.00 13.85 21.54 20.77 19.23 16.92 16.92 25.38 16.15 21.54 13.08 21.54

14.62 12.31 23.85 19.23 27.69 40.00 24.62 13.85 20.00 16.15 16.15 19.23 18.46 23.85 16.92 20.00 17.69 19.23

13.08 14.62 23.85 18.46 26.92 36.92 19.23 12.31 16.92 17.69 17.69 16.15 20.00 20.00 15.38 17.69 12.31 15.38

13.08 13.08 23.85 14.62 26.92 34.62 15.38 10.00 10.00 18.46 12.31 11.54 18.46 19.23 9.23 11.54 13.08 9.23

13.08 10.77 23.08 14.62 26.92 25.38 13.85 10.77 9.23 18.46 10.77 9.23 16.92 18.46 8.46 11.54 14.62 9.23

13.85 13.08 24.62 12.31 21.54 30.00 18.46 13.08 9.23 16.15 9.23 10.00 16.92 20.77 8.46 14.62 14.62 11.54

13.85 11.54 24.62 13.08 21.54 29.23 20.00 15.38 13.85 21.54 11.54 13.08 16.92 18.46 5.38 17.69 16.92 13.08

12.31 8.46 22.31 9.23 16.92 20.00 14.62 10.00 10.77 16.92 13.08 10.00 13.08 18.46 6.92 16.15 14.62 10.77

10.00 8.46 12.31 8.46 16.15 17.69 13.85 7.69 7.14 14.62 7.69 7.14 11.54 17.69 8.46 13.08 8.46 7.14

Average Baseline

20.50

19.78

18.11

15.44

14.22 7.69%

15.06

16.06

13.17

10.56

Table F.6: The effect of data size, with data size interpreted as the number of variable-length (or FLEX) samples available for training in 13-way AAAC A.

153

Appendix F. Data Size as the Number of Variable-Length Samples ( EXP 1) In two-way ABC NL 1: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

67.50 55.75 66.00 59.75 67.75 85.50 64.50 57.75 57.25 67.25 59.50 55.00 65.75 62.25 57.25 63.25 63.00 54.50 67.25 60.00 56.00 62.25 62.75 59.25 55.50

67.50 57.50 64.75 60.00 67.50 82.25 61.50 59.00 55.75 63.25 55.75 53.50 68.00 62.00 56.50 63.25 59.50 55.00 63.00 59.75 54.75 65.75 64.25 60.75 54.50

67.25 58.50 62.00 58.75 66.00 75.25 62.00 56.25 54.50 62.00 54.50 55.00 66.25 60.75 55.25 62.75 61.25 53.25 63.50 57.50 53.50 61.75 63.50 61.00 54.50

65.75 58.25 62.50 60.50 65.75 73.00 61.00 57.25 52.75 62.75 53.00 53.50 64.25 60.00 55.25 62.50 61.75 56.00 61.50 57.75 52.25 61.75 63.25 57.75 55.00

66.25 56.25 63.00 59.00 68.25 67.25 63.75 52.75 48.50 63.00 52.75 51.00 63.00 59.25 54.25 58.75 61.00 53.25 64.50 53.25 48.75 62.75 59.00 57.25 55.00

68.75 56.75 61.75 58.00 64.50 64.00 61.75 54.75 52.75 60.00 53.75 53.25 63.00 57.00 52.50 62.75 57.75 53.75 59.25 57.25 53.25 60.50 57.75 56.25 55.25

66.25 57.75 61.50 55.75 62.75 64.50 60.25 51.00 51.50 61.25 49.75 53.00 59.75 58.00 49.00 60.50 57.00 53.25 61.25 52.75 51.50 58.25 56.00 57.25 54.00

64.25 54.75 60.25 53.75 59.50 61.50 57.00 53.00 50.50 63.75 50.75 50.50 57.25 55.25 51.50 58.75 58.75 51.50 56.50 52.50 51.25 58.00 53.00 58.00 56.50

56.50 52.75 54.25 50.50 54.50 59.00 54.50 55.73 51.89 54.00 56.08 51.64 52.75 52.50 54.00 53.25 54.25 53.93 53.75 54.03 51.89 56.25 53.00 56.75 54.75

Average Baseline

61,72

61.00

59,92

59,40

58,16 50%

57,80

56,60

55,52

53,64

Table F.7: The effect of data size, with data size interpreted as the number of variable-length (or FLEX) samples available for training in 2-way ABC NL 1.

154

In five-way ABC NL 1: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

38.80 19.80 33.20 33.20 37.20 67.20 29.00 26.40 22.40 30.80 24.40 19.80 31.00 34.60 32.80 32.60 30.60 23.20 29.20 23.80 21.00 27.80 27.40 24.80 22.20

37.60 21.00 34.60 32.80 36.00 59.80 30.00 23.80 24.20 31.20 26.00 22.40 29.60 32.00 33.20 32.20 29.80 22.60 29.60 24.20 22.60 26.40 27.20 25.20 22.40

37.40 22.20 33.40 31.20 35.00 56.40 31.40 24.60 21.80 30.40 24.20 21.80 29.60 33.20 31.00 28.80 29.20 20.80 30.40 23.20 21.20 28.00 27.40 26.60 20.00

36.20 22.80 32.20 29.60 34.00 53.60 32.00 24.00 20.40 31.00 23.80 20.20 29.40 31.60 30.80 29.00 28.80 22.20 30.60 23.20 21.20 26.20 26.60 25.60 18.60

35.80 20.00 31.80 30.20 34.40 47.80 28.60 23.80 20.60 29.20 23.20 20.20 28.40 29.80 30.60 30.20 27.60 20.80 26.80 24.00 19.20 25.80 26.60 26.00 19.60

35.20 20.20 27.80 28.40 34.00 41.00 27.20 22.00 20.20 27.80 22.40 21.00 26.20 28.20 27.80 27.80 27.80 23.20 26.80 20.40 20.20 24.00 26.60 25.20 21.40

34.40 21.20 27.80 26.00 30.60 33.40 24.40 21.20 20.40 26.80 21.40 21.20 27.20 29.40 27.80 27.80 26.20 22.40 24.20 21.80 19.60 26.00 25.20 23.60 21.00

34.40 20.60 28.00 24.60 27.20 24.80 27.40 18.80 20.22 27.80 18.40 21.42 23.40 25.60 25.40 25.00 25.00 22.20 27.60 20.20 20.02 25.80 25.60 21.80 20.60

30.00 19.00 25.20 21.20 23.60 24.40 25.20 18.00 21.41 26.40 20.60 21.20 25.80 23.00 25.00 21.00 22.60 21.20 25.80 17.60 21.75 23.60 21.20 23.00 21.00

Average Baseline

29.28

29.08

28.40

27.76

26.76 20%

25.76

24.84

23.64

22.44

Table F.8: The effect of data size, with data size interpreted as the number of variable-length (or FLEX) samples available for training in 5-way ABC NL 1.

155

Appendix F. Data Size as the Number of Variable-Length Samples ( EXP 1) In 8-way ABC NL 1: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

28.75 21.25 20.00 26.25 23.75 33.75 27.50 11.25 10.00 21.25 12.50 17.50 17.50 26.25 18.75 18.75 28.75 20.00 25.00 12.50 13.75 20.00 18.75 13.75 16.25

30.00 22.50 30.00 30.00 21.25 32.50 26.25 20.00 13.75 20.00 12.50 15.00 17.50 26.25 22.50 17.50 20.00 16.25 25.00 18.75 16.25 18.75 18.75 15.00 12.50

27.50 18.75 27.50 28.75 23.75 33.75 25.00 15.00 12.50 23.75 11.25 13.75 20.00 27.50 22.50 17.50 18.75 20.00 21.25 15.00 16.25 22.50 18.75 16.25 12.50

28.75 20.00 22.50 31.25 22.50 36.25 27.50 10.00 15.00 21.25 13.75 11.25 20.00 27.50 18.75 20.00 21.25 12.50 26.25 10.00 16.25 16.25 17.50 15.00 11.25

27.50 15.00 20.00 28.75 22.50 31.25 22.50 12.50 16.25 25.00 15.00 11.25 21.25 26.25 20.00 23.75 17.50 13.75 26.25 12.50 13.75 18.75 16.25 17.50 11.25

27.50 13.75 22.50 22.50 17.50 23.75 23.75 11.25 18.75 22.50 11.25 15.00 20.00 28.75 18.75 17.50 17.50 13.75 21.25 15.00 15.00 17.50 18.75 16.25 15.00

23.75 16.25 23.75 16.25 18.75 23.75 21.25 11.25 15.00 22.50 11.25 15.00 18.75 23.75 18.75 20.00 12.50 15.00 21.25 11.25 11.25 18.75 16.25 20.00 10.00

21.25 16.25 20.00 10.00 23.75 23.75 22.50 11.25 12.50 25.00 11.25 11.25 15.00 20.00 25.00 16.25 12.50 15.00 22.50 8.75 12.50 18.75 15.00 18.75 10.00

18.75 17.50 25.00 10.00 18.75 17.50 21.25 12.50 12.50 27.50 10.00 12.50 25.00 21.25 22.50 15.00 11.25 6.25 26.25 13.75 12.50 20.00 15.00 22.50 10.00

Average Baseline

19.72

20.44

19.96

19.40

19.08 18.16 12.50%

17.08

16.44

16.68

Table F.9: The effect of data size, with data size interpreted as the number of variable-length (or FLEX) samples available for training in 8-way ABC NL 1.

156

Appendix G

Data Size as the Number of Fixed-Length Samples

This Appendix shows the effect of data size, where data size is interpreted as the number of fixed-length (or FIX) text samples per author (EXP 2). We present results for the three data sets, with experiments with two, ve, and the maximum number of candidate authors per data set.

157

Appendix G. Data Size as the Number of Fixed-Length Samples ( EXP 2) In two-way PERSONAE: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

55.50 50.00 68.50 66.50 67.00 68.00 56.50 51.50 51.00 64.00 50.50 51.50 62.50 64.00 60.50 67.50 57.50 54.50 56.50 51.00 49.50 55.50 59.50 55.00 54.00

58.50 51.00 68.50 64.50 68.50 64.50 59.00 49.50 51.50 64.50 54.00 51.50 67.00 62.00 64.00 67.50 58.00 51.00 57.50 50.50 52.00 59.50 61.00 59.00 54.50

60.00 58.50 67.50 67.00 63.00 64.00 60.00 51.50 51.50 61.50 56.00 52.00 65.00 63.00 59.50 69.50 59.50 52.00 60.00 53.50 52.50 56.00 60.00 56.50 53.50

59.00 59.00 66.50 64.50 63.00 63.00 62.50 51.00 50.00 64.50 51.50 51.50 61.50 64.50 58.00 66.50 60.50 51.50 61.00 51.00 53.50 58.00 60.50 59.50 53.50

57.00 59.00 69.00 66.00 61.00 61.50 65.50 53.50 50.50 61.00 53.50 50.00 64.00 64.50 58.00 65.50 58.50 52.50 65.00 51.50 50.50 56.00 60.00 56.00 53.00

59.50 60.00 66.00 64.00 62.00 58.00 62.50 50.00 50.50 63.50 50.50 50.00 60.50 62.50 58.50 60.00 57.50 54.00 63.50 50.00 50.50 59.50 57.50 52.50 54.00

56.00 58.50 66.50 64.00 61.00 53.00 62.00 53.00 50.50 65.00 53.00 51.00 57.50 60.00 58.50 59.50 59.00 57.50 64.50 52.00 51.50 55.50 56.00 55.50 58.00

56.50 58.00 67.00 62.00 62.00 51.00 66.00 58.50 51.00 62.00 56.00 49.50 58.00 60.00 56.50 59.00 60.50 58.50 67.00 54.50 53.50 56.00 53.50 53.50 58.00

54.50 57.00 57.00 55.00 59.50 56.50 60.00 59.00 54.00 60.50 60.50 55.00 58.50 55.50 54.50 53.50 58.00 55.00 61.00 60.00 54.00 57.00 50.00 55.50 52.00

Average Baseline

57.60

58.48

58.68

58.32

58.28 50%

57.20

57.32

57.72

56.32

Table G.1: The effect of increasing data size, with data size interpreted as the number of fixed-length (or FIX) samples available for training in 2-way PERSONAE.

158

In five-way PERSONAE: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

29.60 31.60 36.40 36.80 38.20 42.20 37.40 26.60 25.40 36.00 26.80 24.80 31.60 30.00 28.00 37.80 27.60 24.60 34.40 27.20 25.80 23.40 28.20 23.20 21.00

31.20 30.40 35.60 38.00 36.20 41.40 32.20 25.20 24.20 32.00 26.60 25.00 32.00 28.20 28.60 36.80 26.40 26.00 32.40 26.40 24.20 26.40 27.80 24.00 21.40

32.60 27.40 36.40 36.40 35.80 41.00 30.20 27.00 23.20 29.20 23.80 23.80 32.20 26.60 27.00 35.20 24.40 25.40 28.80 25.20 23.20 24.80 31.60 25.60 20.20

31.00 25.20 35.80 35.20 35.40 37.60 30.60 29.40 25.60 27.40 25.80 25.40 30.80 28.00 24.80 34.60 25.00 28.00 29.00 27.20 25.40 26.20 30.00 25.20 21.60

31.40 23.60 36.40 35.40 34.00 41.00 26.20 26.20 23.00 26.40 24.60 22.60 32.40 28.40 24.00 35.00 22.80 22.00 25.60 25.00 23.20 25.00 29.60 26.00 21.80

33.40 20.40 35.00 35.40 34.00 36.40 23.80 24.00 21.40 23.00 24.40 21.80 29.60 29.20 24.60 33.80 24.40 22.40 23.00 24.20 20.60 25.20 29.20 25.80 21.40

29.60 22.80 33.80 32.20 35.20 35.20 25.00 24.20 21.40 25.20 21.80 21.20 27.00 28.40 26.40 29.00 26.20 23.00 26.20 22.80 21.20 24.20 30.00 24.00 21.40

29.20 28.40 31.20 29.20 34.60 34.40 31.40 22.00 20.00 33.20 21.00 20.40 27.60 29.00 24.60 32.80 29.20 21.20 32.20 20.80 20.40 25.60 28.20 24.60 19.60

26.40 24.20 30.00 28.20 29.40 23.20 26.60 23.00 23.00 26.20 23.00 23.00 28.80 28.80 24.40 29.20 25.60 22.40 26.80 23.20 22.80 25.60 24.60 23.00 22.80

Average Baseline

29.76

29.24

28.28

28.44

27.36 20%

26.20

26.00

26.48

25.00

Table G.2: The effect of increasing data size, with data size interpreted as the number of fixed-length (or FIX) samples available for training in 5-way PERSONAE.

159

Appendix G. Data Size as the Number of Fixed-Length Samples ( EXP 2) In 145-way PERSONAE: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

0.00 6.21 2.76 1.38 3.45 5.52 6.21 4.83 4.14 8.28 6.21 5.52 0.00 0.69 0.69 2.07 2.76 0.69 5.52 6.21 4.83 1.38 4.14 0.69 0.69

0.00 2.76 2.07 1.38 4.83 4.83 3.45 4.14 2.76 5.52 6.21 4.14 0.00 0.69 0.69 2.07 1.38 2.07 4.14 4.14 2.07 1.38 4.14 2.07 0.00

0.00 3.45 1.38 1.38 4.83 5.52 6.21 2.76 2.76 6.21 2.76 3.45 0.00 1.38 1.38 1.38 3.45 2.07 6.90 4.83 2.07 1.38 2.07 2.76 0.00

0.00 2.76 2.07 1.38 6.21 4.83 4.83 6.21 2.07 3.45 5.52 2.76 0.00 0.00 0.69 2.76 2.76 3.45 4.14 6.21 2.76 0.00 2.07 0.69 0.00

0.00 0.00 2.07 1.38 5.52 4.14 5.52 1.38 3.45 6.21 2.07 3.45 0.00 2.07 1.38 2.07 2.76 2.07 2.76 2.07 2.76 0.69 2.07 0.69 0.00

0.69 2.76 2.07 0.69 3.45 1.38 1.38 2.76 3.45 4.83 4.14 3.45 0.00 2.07 0.69 4.14 4.14 0.00 3.45 1.38 4.83 1.38 0.69 0.69 0.00

0.69 3.45 2.07 0.69 6.21 3.45 2.07 2.07 3.45 1.38 2.07 2.76 0.00 2.07 1.38 4.83 2.07 2.76 3.45 2.76 2.76 0.00 1.38 1.38 0.69

1.38 0.00 3.45 2.07 3.45 3.45 2.76 2.07 1.38 4.14 2.07 1.38 0.00 1.38 1.38 4.14 3.45 1.38 3.45 2.07 2.07 0.69 2.76 1.38 0.69

2.07 1.38 2.76 2.76 2.76 1.38 3.45 0.69 1.38 4.14 0.69 1.38 0.00 2.07 1.38 3.45 0.00 2.76 3.45 1.38 1.38 2.76 1.38 0.00 2.07

Average Baseline

2.96

2.36

2.40

2.32

1.96 1.76 0.69%

1.84

1.76

1.48

Table G.3: The effect of increasing data size, with data size interpreted as the number of fixed-length (or FIX) samples available for training in 145-way PERSONAE.

160

In two-way AAAC A: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3

62.50 57.50 82.50 77.50 80.00 75.00 70.00 60.00 65.00 77.50 62.50 57.50 90.00 52.50 55.00 67.50 60.00 65.00

60.00 55.00 80.00 70.00 80.00 72.50 62.50 57.50 55.00 65.00 62.50 50.00 87.50 60.00 50.00 67.50 57.50 52.50

62.50 60.00 77.50 72.50 85.00 72.50 65.00 62.50 57.50 62.50 60.00 55.00 80.00 60.00 42.50 67.50 57.50 60.00

65.00 55.00 75.00 65.00 85.00 70.00 75.00 70.00 57.50 62.50 70.00 57.50 80.00 60.00 47.50 67.50 65.00 52.50

62.50 60.00 72.50 70.00 82.50 72.50 65.00 65.00 55.00 65.00 67.50 57.50 85.00 60.00 55.00 62.50 60.00 57.50

65.00 70.00 67.50 65.00 80.00 82.50 67.50 70.00 60.00 75.00 65.00 50.00 75.00 47.50 55.00 67.50 67.50 57.50

50.00 57.50 72.50 65.00 80.00 72.50 65.00 67.50 52.50 67.50 65.00 50.00 67.50 47.50 50.00 72.50 67.50 52.50

47.50 50.00 75.00 60.00 77.50 70.00 57.50 57.50 52.50 57.50 60.00 55.00 62.50 60.00 52.50 70.00 62.50 60.00

50.00 47.50 57.50 55.00 67.50 57.50 62.50 55.00 52.50 52.50 50.00 47.50 50.00 50.00 47.50 62.50 55.00 55.00

Average Baseline

67.39

63.39

64.17

65.39

65.06 50%

65.78

62.06

60.17

53.89

Table G.4: The effect of increasing data size, with data size interpreted as the number of fixed-length (or FIX) samples available for training in 2-way AAAC A.

161

Appendix G. Data Size as the Number of Fixed-Length Samples ( EXP 2) In five-way AAAC A: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3

30.00 44.00 34.00 48.00 66.00 64.00 54.00 24.00 22.00 66.00 28.00 22.00 46.00 42.00 26.00 50.00 24.00 26.00

28.00 52.00 34.00 40.00 72.00 70.00 52.00 26.00 24.00 52.00 32.00 22.00 44.00 42.00 24.00 48.00 28.00 26.00

34.00 44.00 38.00 38.00 76.00 68.00 48.00 30.00 30.00 46.00 30.00 30.00 44.00 46.00 24.00 48.00 34.00 30.00

38.00 44.00 40.00 42.00 72.00 64.00 46.00 30.00 26.00 48.00 22.00 26.00 44.00 42.00 36.00 42.00 26.00 26.00

32.00 48.00 42.00 40.00 68.00 58.00 44.00 32.00 24.00 52.00 26.00 24.00 44.00 44.00 34.00 42.00 32.00 24.00

38.00 40.00 42.00 36.00 62.00 52.00 52.00 36.00 30.00 54.00 26.00 30.00 44.00 46.00 32.00 48.00 28.00 30.00

26.00 28.00 40.00 40.00 60.00 50.00 40.00 24.00 26.00 42.00 32.00 26.00 48.00 36.00 28.00 42.00 26.00 32.00

24.00 22.00 32.00 44.00 52.00 46.00 34.00 32.00 24.00 34.00 30.00 26.00 46.00 40.00 26.00 28.00 30.00 22.00

14.00 28.00 34.00 42.00 38.00 40.00 46.00 28.00 22.00 32.00 20.00 20.00 22.00 30.00 34.00 40.00 30.00 20.00

Average Baseline

39.78

39.78

41.00

39.67

39.44 20%

40.33

35.89

32.89

30.00

Table G.5: The effect of increasing data size, with data size interpreted as the number of fixed-length (or FIX) samples available for training in 5-way AAAC A.

162

In 13-way AAAC A: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3

7.69 7.69 46.15 15.38 53.85 61.54 23.08 30.77 7.69 46.15 7.69 7.69 23.08 7.69 15.38 30.77 30.77 15.38

7.69 15.38 38.46 7.69 53.85 46.15 23.08 7.69 7.69 38.46 23.08 23.08 23.08 7.69 7.69 38.46 23.08 15.38

7.69 7.69 38.46 15.38 46.15 38.46 23.08 23.08 15.38 38.46 15.38 15.38 23.08 7.69 15.38 30.77 23.08 15.38

7.69 0.00 30.77 15.38 46.15 30.77 30.77 23.08 15.38 46.15 23.08 15.38 7.69 0.00 23.08 38.46 15.38 7.69

7.69 23.08 23.08 15.38 38.46 30.77 46.15 23.08 15.38 38.46 30.77 15.38 7.69 15.38 23.08 38.46 23.08 15.38

7.69 23.08 23.08 7.69 38.46 38.46 38.46 15.38 15.38 53.85 23.08 15.38 7.69 0.00 15.38 30.77 23.08 15.38

0.00 15.38 23.08 7.69 30.77 38.46 15.38 23.08 15.38 30.77 23.08 15.38 0.00 0.00 15.38 15.38 15.38 15.38

0.00 7.69 30.77 7.69 38.46 23.08 15.38 15.38 15.38 15.38 15.38 7.69 7.69 0.00 7.69 7.69 7.69 7.69

7.69 15.38 15.38 7.69 23.08 30.77 7.69 7.69 7.69 15.38 7.69 7.69 7.69 23.08 15.38 15.38 7.69 0.00

Average Baseline

23.83

22.22

21.83

20.56

23.56 7.69%

21.39

16.33

12.33

11.89

Table G.6: The effect of increasing data size, with data size interpreted as the number of fixed-length (or FIX) samples available for training in 13-way AAAC A.

163

Appendix G. Data Size as the Number of Fixed-Length Samples ( EXP 2) In two-way ABC NL 1: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

67.50 55.00 65.00 62.50 70.00 70.00 57.50 50.00 62.50 62.50 50.00 55.00 57.50 72.50 57.50 72.50 50.00 57.50 60.00 52.50 62.50 52.50 72.50 67.50 52.50

62.50 50.00 75.00 62.50 82.50 75.00 60.00 55.00 65.00 57.50 57.50 60.00 57.50 65.00 57.50 70.00 60.00 55.00 50.00 55.00 65.00 55.00 60.00 67.50 52.50

65.00 62.50 70.00 62.50 85.00 72.50 55.00 60.00 65.00 52.50 50.00 60.00 65.00 60.00 60.00 65.00 65.00 52.50 57.50 70.00 65.00 60.00 60.00 65.00 60.00

67.50 60.00 62.50 72.50 82.50 67.50 50.00 57.50 67.50 57.50 62.50 55.00 65.00 67.50 67.50 65.00 70.00 57.50 55.00 62.50 67.50 60.00 60.00 57.50 55.00

72.50 52.50 57.50 72.50 90.00 72.50 55.00 65.00 60.00 60.00 65.00 62.50 67.50 55.00 65.00 62.50 60.00 60.00 60.00 62.50 62.50 60.00 72.50 55.00 52.50

72.50 47.50 57.50 75.00 80.00 62.50 50.00 62.50 57.50 55.00 65.00 57.50 70.00 55.00 60.00 67.50 55.00 52.50 47.50 65.00 60.00 60.00 65.00 60.00 52.50

62.50 47.50 62.50 72.50 77.50 62.50 50.00 55.00 50.00 57.50 62.50 57.50 65.00 57.50 57.50 67.50 50.00 60.00 50.00 62.50 47.50 65.00 60.00 52.50 45.00

65.00 45.00 67.50 72.50 72.50 57.50 50.00 65.00 55.00 62.50 62.50 57.50 65.00 62.50 50.00 57.50 50.00 47.50 57.50 55.00 55.00 55.00 50.00 50.00 50.00

60.00 55.00 70.00 72.50 60.00 62.50 62.50 60.00 50.00 57.50 55.00 57.50 50.00 55.00 45.00 62.50 52.50 52.50 65.00 57.50 50.00 57.50 55.00 52.50 60.00

Average Baseline

60.28

61.12

62.48

62.60

62.96 50%

60.28

58.00

57.28

57.28

Table G.7: The effect of increasing data size, with data size interpreted as the number of fixed-length (or FIX) samples available for training in 2-way ABC NL 1.

164

In five-way ABC NL 1: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

36.00 26.00 38.00 38.00 42.00 50.00 34.00 24.00 20.00 30.00 16.00 22.00 40.00 26.00 26.00 46.00 38.00 26.00 38.00 22.00 20.00 30.00 26.00 30.00 26.00

40.00 24.00 32.00 38.00 34.00 46.00 32.00 28.00 24.00 32.00 26.00 32.00 48.00 20.00 26.00 48.00 34.00 24.00 40.00 28.00 22.00 28.00 32.00 26.00 24.00

40.00 30.00 36.00 34.00 36.00 46.00 34.00 22.00 22.00 34.00 18.00 22.00 42.00 20.00 22.00 46.00 36.00 24.00 34.00 22.00 24.00 24.00 32.00 20.00 26.00

36.00 28.00 40.00 36.00 38.00 38.00 24.00 26.00 22.00 32.00 18.00 26.00 36.00 26.00 30.00 42.00 34.00 26.00 30.00 20.00 24.00 24.00 28.00 24.00 32.00

36.00 22.00 26.00 32.00 36.00 38.00 32.00 22.00 20.00 32.00 20.00 20.00 40.00 28.00 22.00 52.00 36.00 22.00 34.00 26.00 22.00 24.00 30.00 22.00 32.00

44.00 16.00 30.00 34.00 40.00 42.00 30.00 24.00 18.00 24.00 22.00 20.00 38.00 30.00 26.00 42.00 30.00 26.00 32.00 22.00 18.00 16.00 28.00 30.00 28.00

32.00 32.00 34.00 26.00 44.00 36.00 28.00 26.00 24.00 28.00 26.00 28.00 36.00 34.00 24.00 42.00 38.00 34.00 22.00 24.00 24.00 32.00 24.00 20.00 22.00

28.00 22.00 30.00 34.00 40.00 34.00 36.00 18.00 24.00 34.00 22.00 26.00 44.00 44.00 22.00 36.00 22.00 24.00 26.00 20.00 18.00 32.00 24.00 22.00 26.00

40.00 20.00 30.00 32.00 38.00 32.00 30.00 26.00 22.00 36.00 24.00 26.00 30.00 28.00 32.00 34.00 28.00 24.00 26.00 26.00 22.00 38.00 22.00 24.00 10.00

Average Baseline

30.80

31.52

29.83

29.60

29.04 20%

28.40

29.60

28.32

28.00

Table G.8: The effect of increasing data size, with data size interpreted as the number of fixed-length (or FIX) samples available for training in 5-way ABC NL 1.

165

Appendix G. Data Size as the Number of Fixed-Length Samples ( EXP 2) In 8-way ABC NL 1: Feature

90%

80%

70%

60%

50%

40%

30%

20%

10%

tok cwd fwd chr1 chr2 chr3 lex1 lex2 lex3 lem1 lem2 lem3 cgp1 cgp2 cgp3 pos1 pos2 pos3 lexpos1 lexpos2 lexpos3 chu1 chu2 chu3 rel

12.50 25.00 12.50 25.00 12.50 50.00 12.50 12.50 0.00 25.00 12.50 12.50 25.00 25.00 25.00 37.50 12.50 12.50 12.50 25.00 12.50 0.00 37.50 37.50 37.50

12.50 25.00 12.50 25.00 25.00 37.50 12.50 12.50 0.00 25.00 12.50 0.00 25.00 0.00 0.00 37.50 12.50 25.00 25.00 37.50 0.00 0.00 37.50 25.00 25.00

0.00 25.00 12.50 37.50 25.00 37.50 0.00 25.00 0.00 0.00 12.50 0.00 12.50 12.50 37.50 25.00 12.50 25.00 25.00 25.00 0.00 0.00 37.50 25.00 25.00

12.50 0.00 12.50 12.50 37.50 37.50 37.50 0.00 12.50 25.00 0.00 12.50 12.50 0.00 50.00 12.50 12.50 25.00 37.50 0.00 0.00 12.50 37.50 25.00 12.50

12.50 12.50 25.00 12.50 12.50 50.00 25.00 0.00 0.00 25.00 0.00 37.50 0.00 0.00 12.50 12.50 12.50 25.00 37.50 0.00 0.00 25.00 12.50 0.00 50.00

12.50 12.50 12.50 12.50 37.50 25.00 37.50 12.50 12.50 12.50 12.50 25.00 0.00 12.50 0.00 25.00 25.00 12.50 25.00 12.50 0.00 25.00 25.00 12.50 25.00

12.50 37.50 12.50 25.00 25.00 25.00 25.00 0.00 12.50 25.00 25.00 12.50 25.00 25.00 12.50 25.00 12.50 0.00 37.50 12.50 12.50 25.00 12.50 12.50 12.50

12.50 37.50 25.00 25.00 25.00 25.00 25.00 0.00 12.50 37.50 0.00 12.50 50.00 25.00 0.00 25.00 25.00 0.00 12.50 0.00 12.50 0.00 12.50 12.50 12.50

12.50 25.00 12.50 12.50 37.50 37.50 37.50 12.50 0.00 50.00 12.50 0.00 50.00 0.00 12.50 0.00 12.50 12.50 37.50 0.00 0.00 12.50 12.50 12.50 12.50

Average Baseline

20.20

17.80

17.32

17.20

15.80 16.72 12.50%

18.24

16.80

16.68

Table G.9: The effect of increasing data size, with data size interpreted as the number of fixed-length (or FIX) samples available for training in 8-way ABC NL 1.

166

Appendix H

Robustness to Limited Data: Comparing Data Representations and Machine Learners

This Appendix shows how the choice for instance-based or profile-based approach, or for MBL or SVM s as learning algorithms, affects performance with reducing data size. Data size is interpreted as the number of variable-length (or FLEX) text samples per author (cf. EXP 1). We present results for the three data sets, with experiments on the maximum number of candidate authors for each data set.

167

Appendix H. Robustness to Limited Data In 145-way PERSONAE (baseline: 0.69%): Instance-based

Machine Learner

Feature type

90%

80%

70%

60%

50%

40%

30%

20%

10%

MBL

chr3 lex3 pos2 lexpos3

10.90 22.69 5.31 22.28

9.10 21.59 5.52 21.52

8.90 19.17 5.59 19.38

8.55 16.76 5.66 16.83

7.86 13.38 5.52 13.17

6.21 9.52 4.00 9.24

4.21 6.62 3.79 6.69

3.45 2.62 3.52 2.76

1.79 1.66 2.34 1.38

SVM s

chr3 lex3 pos2 lexpos3

28.28 19.38 16.48 18.55

24.62 18.55 14.90 18.76

22.69 18.07 14.28 17.86

17.93 14.55 12.8 13.66

15.31 11.86 11.24 11.45

11.79 9.10 7.66 7.93

8.21 5.38 6.97 5.03

4.83 2.48 4.07 2.76

2.00 1.86 1.45 1.79

Machine Learner

Feature type

90%

80%

70%

60%

50%

40%

30%

20%

10%

MBL

chr3 lex3 pos2 lexpos3

16.55 4.69 7.31 4.76

14.90 6.07 7.03 5.59

12.00 5.45 6.14 5.93

11.31 5.59 6.62 5.45

9.45 5.79 8.14 5.93

8.21 5.31 5.72 5.31

6.28 4.14 5.03 4.76

3.66 2.62 3.10 2.76

1.79 1.66 2.34 1.38

SVM s

chr3 lex3 pos2 lexpos3

24.83 27.86 20.76 27.93

22.28 26.90 19.24 26.90

20.76 24.83 17.66 25.17

18.07 19.38 15.24 19.52

13.03 15.38 11.93 14.83

10.69 10.90 9.93 10.07

7.59 7.45 7.79 7.17

4.62 3.03 4.14 3.45

2.00 1.86 1.45 1.79

Profile-based

Table H.1: The effect of document representation (instance-based vs. profile-based) and Machine learning algorithm (MBL vs. SVMs) on performance in data size experiments in 145way PERSONAE. Data size is interpreted as the number of variable-length (or FLEX) samples available for training (aka. EXP 1).

168

In 13-way AAAC A (baseline: 7.69%): Instance-based

Machine Learner

Feature type

90%

80%

70%

60%

50%

40%

30%

20%

10%

MBL

chr3 lex1 pos2 lexpos1

44.62 30.00 25.38 21.54

40.00 24.62 23.85 20.00

36.92 19.23 20.00 17.69

34.62 15.38 19.23 11.54

25.38 13.85 18.46 11.54

30.00 18.46 20.77 14.62

29.23 20.00 18.46 17.69

20.00 14.62 18.46 16.15

17.69 13.85 17.69 13.08

SVM s

chr3 lex1 pos2 lexpos1

65.38 47.69 27.69 45.38

63.85 37.69 24.62 39.23

63.08 31.54 25.38 33.08

60.00 31.54 23.08 30.00

50.00 28.46 16.15 26.92

36.15 23.85 19.23 23.85

33.08 25.38 18.46 24.62

23.08 15.38 16.92 16.92

19.23 10.77 13.08 13.85

Machine Learner

Feature type

90%

80%

70%

60%

50%

40%

30%

20%

10%

MBL

chr3 lex1 pos2 lexpos1

61.54 33.08 20.00 35.38

55.38 21.54 17.69 26.92

58.46 16.15 15.38 17.69

51.54 12.31 18.46 15.38

46.15 16.15 16.15 23.85

37.69 20.77 15.38 16.15

38.46 13.08 13.85 13.08

21.54 11.54 12.31 13.08

17.69 13.85 17.69 13.08

SVM s

chr3 lex1 pos2 lexpos1

73.08 41.54 24.62 47.69

64.62 33.08 23.08 35.38

60.77 30.77 22.31 35.38

57.69 26.92 18.46 31.54

50.77 27.69 19.23 26.92

48.46 24.62 17.69 24.62

40.77 19.23 16.92 17.69

20.00 14.62 14.62 17.69

19.23 10.77 13.08 13.85

Profile-based

Table H.2: The effect of document representation (instance-based vs. profile-based) and Machine learning algorithm (MBL vs. SVMs) on performance in data size experiments in 13way AAAC A. Data size is interpreted as the number of variable-length (or FLEX) samples available for training (aka. EXP 1).

169

Appendix H. Robustness to Limited Data In 8-way ABC NL 1 (baseline: 12.50%): Instance-based

Machine Learner

Feature type

90%

80%

70%

60%

50%

40%

30%

20%

10%

MBL

chr3 lex1 pos1 lexpos1

33.75 27.50 18.75 25.00

32.50 26.25 17.50 25.00

33.75 25.00 17.50 21.25

36.25 27.50 20.00 26.25

31.25 22.50 23.75 26.25

23.75 23.75 17.50 21.25

23.75 21.25 20.00 21.25

23.75 22.50 16.25 22.50

17.50 21.25 15.00 26.25

SVM s

chr3 lex1 pos1 lexpos1

81.25 30.00 21.25 36.25

72.50 30.00 18.75 36.25

67.50 32.50 22.50 33.75

60.00 27.50 18.75 32.50

50.00 20.00 18.75 27.50

35.00 17.50 21.25 26.25

26.25 16.25 17.50 18.75

21.25 26.25 17.50 26.25

13.75 21.25 15.00 18.75

Machine Learner

Feature type

90%

80%

70%

60%

50%

40%

30%

20%

10%

MBL

chr3 lex1 pos1 lexpos1

50.00 16.25 20.00 16.25

65.00 21.25 20.00 18.75

60.00 18.75 22.50 18.75

57.50 15.00 23.75 18.75

46.25 17.50 28.75 17.50

31.25 11.25 25.00 16.25

27.50 12.50 26.25 17.50

21.25 25.00 20.00 23.75

17.50 21.25 15.00 26.25

SVM s

chr3 lex1 pos1 lexpos1

83.75 27.50 15.00 30.00

77.50 28.75 22.50 32.50

63.75 23.75 25.00 25.00

56.25 25.00 17.50 26.25

51.25 13.75 18.75 17.50

33.75 15.00 22.50 15.00

28.75 12.50 22.50 15.00

22.50 23.75 16.25 25.00

13.75 21.25 15.00 18.75

Profile-based

Table H.3: The effect of document representation (instance-based vs. profile-based) and Machine learning algorithm (MBL vs. SVMs) on performance in data size experiments in 8way ABC NL 1. Data size is interpreted as the number of variable-length (or FLEX) samples available for training (aka. EXP 1). Performance is expressed in terms of accuracy.

170

Bibliography

Abbasi, A., & Chen, H. (2008). Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems, 26(2), 7:1–7:29. Allison, B., & Guthrie, L. (2006). Another look at the data sparsity problem. In Proceedings of Text, Speech, and Dialogue, Lecture Notes in Computer Science 4188, (pp. 327–334). Berlin, Heidelberg: Springer Verlag. Argamon, S. (2008). Interpreting Burrows’s Delta: Geometric and probabilistic foundations. Literary and Linguistic Computing, 23(2), 131–147. Argamon, S., Koppel, M., Fine, J., & Shimoni, A. (2003a). Gender, genre, and writing style in formal written texts. Text, 23(3), 321–346. Argamon, S., & Levitan, S. (2005). Measuring the usefulness of function words for authorship attribution. In P. Liddell, R. Siemens, A. Bia, M. Holmes, P. Baer, G. Newton, & S. Arneil (Eds.) Proceedings of the 2005 joint conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities, (pp. 1–3). Victoria, BC, Canada: Humanities Computing and Media Centre, University of Victoria. ˘ c, M., & Stein, S. (2003b). Learning algorithms and features for multiple Argamon, S., Sari´ authorship discrimination. In Proceedings of the 2003 International Joint Conferences on Artificial Intelligence: Workshop on Computational Approaches to Style Analysis and Synthesis, (pp. 475–480). Acapulco, Mexico: International Joint Conference on Artificial Intelligence, Inc. ˘ c, M., & Stein, S. (2003c). Style mining of electronic messages for multiple Argamon, S., Sari´ authorship discrimination: First results. In L. Getoor, T. Senator, P. Domingos, & F. C. 171

BIBLIOGRAPHY (Eds.) Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (pp. 475–480). Washington, DC, USA: Association for Computing Machinery. Argamon, S., Whitelaw, C., Chase, P., Dawhle, S., Hota, S., Garg, N., & Levitan, S. (2007). Stylistic text classification using functional lexical features. Journal of the American Society of Information Science and Technology, 58(6), 802–822. Argiri, E. (2006). Style-based topic categorization with the use of machine learning techniques. MSc dissertation, University of Athens, National Technical University, Greece. Baayen, H., Van Halteren, H., Neijt, A., & Tweedie, F. (2002). An experiment in authorship ´ ´ internationales d’ Analyse attribution. In P. Sebillot (Ed.) Proceedings of the 2002 Journees ´ Textuelles, (pp. 69–75). Saint Malo, France: JADT. statistique des Donnees Baayen, H., Van Halteren, H., & Tweedie, F. (1996). Outside the Cave of Shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3), 121–131. Banko, M., & Brill, E. (2001). Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, (pp. 26–33). Toulouse, France: San Francisco, CA, USA: Morgan Kaufmann. Benedetto, D., Caglioti, E., & Loreto, V. (2002). Language trees and zipping. Physical Review Letters, 88(4). Biber, D. (1990). Methodological issues regarding corpus-based analyses of linguistic variations. Literary and Linguistic Computing, 5, 257–269. Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8, 1–15. Briggs Myers, I., & Myers, P. (1980). Gifts differing: Understanding personality type. Mountain View, CA: Davies-Black Publishing. Burrows, J. (1987). Computation into criticism: A study of Jane Austen’s novels and an experiment in method. Oxford, UK: Clarendon Press. Burrows, J. (1992). Not unless you ask nicely: the interpretative nexus between analysis and information. Literary and Linguistic Computing, 7 (2), 91–109. Burrows, J. (2002). ’delta’: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17 (3), 267–287. Burrows, J. (2007). All the way through: Testing for authorship in different frequency strata. Literary and Linguistic Computing, 22(1), 27–47. 172

BIBLIOGRAPHY Butler, C. (1985). Statistics in linguistics. Blackwell, Oxford. Cameron, W. (1963). Informal Sociology: A Causal Introduction to sociological thinking. New York: Random House. Caver, J. (2009). Novel topic impact on authorship attribution. MSc dissertation, Naval Postgraduate School, Monterey, CA, USA. Cavnar, W., & Trenkle, J. (1994). N-gram-based text categorization. In Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, (pp. 161–175). Las Negas, NV. Chaski, C. (2005). Who’s at the keyboard? Authorship attribution in digital evidence investigations. International Journal of Digital Evidence, 4(1), 1–13. Clement, R., & Sharp, D. (2003). Ngram and Bayesian classification of documents for topic and authorship. Literary and Linguistic Computing, 18(4), 423–447. Cohen, W. (1995). Fast effective rule induction. In Proceedings of the 12th International Conference on Machine Learning, (pp. 115–123). Tahoe City, CA, USA: San Fransisco, CA: Morgan Kaufmann. ˜ Pineda, L., Montes-y Gomez, ´ Coyotl-Morales, R., Villasenor M., & Rosso, P. (2006). Authorship attribution using word sequences. In Proceedings of the 11th Iberoamerican Congress on Pattern Recognition, Lecture Notes in Computer Science 4225, (pp. 844– 853). Cancun, Mexico: Heidelberg: Springer Verlag. Daelemans, W., & van den Bosch, A. (2005). Memory-Based Language Processing. Studies in Natural Language Processing. Cambridge, UK: Cambridge University Press. Daelemans, W., Zavrel, J., van der Sloot, K., & van den Bosch, A. (2007). TiMBL: Tilburg Memory Based Learner, version 6.1, reference guide. Technical Report ILK Technical Report Series no. 07-07, University of Tilburg, Tilburg, The Netherlands. de Vel, O., Corney, M., Anderson, A., & Mohay, G. (2001). Language and gender author cohort analysis of e-mail for computer forensics. ACM SIGMOD Record, Special Issue on Data Mining for Intrusion Detection and Threat Analysis, 30(4), 55–64. Diederich, J., Kindermann, J., Leopold, E., & Paass, G. (2003). Authorship attribution with Support Vector Machines. Applied Intelligence, 19(1-2), 109–123. Dunning, T. (1994). Statistical identification of language. Technical Report MCCS 94-273, Computing Research Lab (CRL), New Mexico State University, Las Cruces, New Mexico. 173

BIBLIOGRAPHY Eder, M. (2010). Does size matter? Authorship attribution, small samples, big problem. In E. e. a. Pierrazo (Ed.) Proceedings of Digital Humanities 2010, (pp. 132–135). London, UK: Centre for Computing in the Humanities, Kings College London. Fawcett, T., & Provost, F. (1999). Activity monitoring: noticing interesting changes in behavior. In U. Fayyad, S. Chaudhuri, & D. Madigan (Eds.) Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (pp. 53– 62). San Diego, CA, USA: Association for Computing Machinery. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305. Frank, E., Chui, C., & Witten, I. (2000). Text categorization using compression models. In Proceedings of the IEEE Data Compression Conference, (pp. 200–209). Snowbird, USA: IEEE Computer Society Press. Gamon, M. (2004). Linguistic correlates of style: Authorship classification with deep linguistic analysis features. In Proceedings of the 20th International Conference on Computational Linguistics, (pp. 611–617). Geneva, Switserland: Association for Computational Linguistics. Goodman, J. (2002). Extended comment on ’Language Trees and Zipping’. Condensed Matter Archive, February 2002, 6 p.. Gray, A., Sallis, P., & MacDonnell, S. (1997). Software forensics: Extending authorship analysis techniques to computer programs. In Proceedings of the 3rd Biannual Conference of the International Association of Forensic Linguists, (pp. 1–8). Durham NC, USA: International Association of Forensic Linguists. Grieve, J. (2007). Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing, 22(3), 251–270. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. Hirst, G., & Feiguina, O. (2007). Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing, 22(4), 405–417. Holmes, D. (1994). Authorship attribution. Computers and the Humanities, 28(2), 87–106. Holmes, D. (1998). The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing, 13(3), 111–117. Holmes, D., & Forsyth, R. (1995). The federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 10(2), 111–127. 174

BIBLIOGRAPHY ´ A. (1979). Some simple measures of richness of vocabulary. Association for Literary Honore, and Linguistic Computing Bulletin, 7 (2), 172–177. Hoover, D. (2003). Multivariate analysis and the study of style variation. Literary and Linguistic Computing, 18(4), 341–360. Hoover, D. (2004). Delta Prime? Literary and Linguistic Computing, 19(4), 477–495. Houvardas, J., & Stamatatos, E. (2006). N-gram feature selection for authorship identification. In Proceedings of Artificial Intelligence: Methodology, Systems, and Applications (AIMSA), (pp. 77–86). Varna, Bulgaria: Heidelberg: Springer Verlag. Jair Escalante, H., Montes, M., & Villaseor, L. (2009). Particle Swarm Model Selection for authorship verification. In E. Bayro-Corrochano, & J.-O. Eklundh (Eds.) Proceedings of the 14th Iberoamerican Congress on Pattern Recognition, Lecture Notes in Computer Science 5856, (pp. 563–570). Heidelberg: Springer Verlag. Jockers, M., & Witten, D. (2010). A comparative study of machine learning methods for authorship attribution. Literary and Linguistic Computing, 25(2), 215–223. John, G., & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, (pp. 338–345). Quebec, Canada: San Francisco, CA, USA: Morgan Kaufmann. Juola, P. (2004). Ad-Hoc Authorship Attribution Competition. In Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Com¨ ¨ puters and the Humanities (ALLC/ACH) Book of Abstracts. Goteborg, Sweden: Goteborg University. Juola, P. (2008). Authorship attribution. Foundations and Trends in Information Retrieval, vol. 1, nr. 3. NOW Publishers. Juola, P., & Baayen, H. (2005). A controlled-corpus experiment in authorship identification by cross-entropy. Literary and Linguistic Computing, 20(Supplementary Issue), 59–67. Keselj, V., Peng, F., Cercone, N., & Thomas, C. (2003). N-gram-based author profiles for authorship attribution. In Proceedings of the 6th Conference of the Pacific Association for Computational Linguistics, (pp. 255–264). Halifax, Canada: Pacific Association for Computational Linguistics. Kessler, B., Nunberg, G., & Schutze, H. (1997). Automatic detection of text genre. In Proceedings of the 35th Meeting of the Association for Computational Linguistics, (pp. 32–38). Khmelev, D., & Tweedie, F. (2001). Using Markov chains for identification of writers. Literary and Linguistic Computing, 16(4), 299–307. 175

BIBLIOGRAPHY Klimt, B., & Yang, Y. (2004). The Enron corpus: A new dataset for email classification research. In J.-F. Boulicaut, F. Esposito, F. Giannotti, & D. Pedreschi (Eds.) Proceedings of the 15th European Conference on Machine Learning, Lecture Notes in Computer Science 3201, (pp. 217–226). Pisa, Italy: Heidelberg: Springer Verlag. Koppel, M., Akiva, N., & Dagan, I. (2003a). A corpus-independent feature set for style based text categorization. In Proceedings of the 2003 International Joint Conferences on Artificial Intelligence: Workshop on Computational Approaches to Style Analysis and Synthesis, (pp. 61–67). Acapulco, Mexico: International Joint Conference on Artificial Intelligence, Inc. Koppel, M., Argamon, S., & Shimoni, A. (2003b). Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17 (4), 401–412. Koppel, M., & Schler, J. (2003). Exploiting stylistic idiosyncrasies for authorship attribution. In Proceedings of the 2003 International Joint Conferences on Artificial Intelligence: Workshop on Computational Approaches to Style Analysis and Synthesis, (pp. 69–72). Acapulco, Mexico: International Joint Conference on Artificial Intelligence, Inc. Koppel, M., & Schler, J. (2004). Authorship verification as a one-class classification problem. In C. Brodley (Ed.) Proceedings of the 21st International Conference on Machine Learning, (pp. 489–495). Banff, Alberta, Canada: Association for Computing Machinery. Koppel, M., Schler, J., & Argamon, S. (forthcoming). Authorship attribution in the wild. Language Resources and Evaluation, Special Issue on Plagiarism and Authorship Analysis. Koppel, M., Schler, J., Argamon, S., & Messeri, E. (2006). Authorship attribution with thou¨ sands of candidate authors. In E. Efthimiadis, S. Dumais, D. Hawking, & K. Jarvelin (Eds.) Proceedings of the 29th International Conference of the Special Interest Group on Information Retrieval, (pp. 659–660). Seattle, WA, USA: Association for Computing Machinery. Koppel, M., Schler, J., & Bonchek-Dokow, E. (2007). Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research, 8, 1261–1276. Kukushkina, O., Polikarpov, A., & Khmelev, D. (2001). Using literal and grammatical statistics for authorship attribution. Problemy Peredachi Informatsii, 37 (2), 96–108. Lambers, M., & Veenman, C. J. (2009). Forensic authorship attribution using compression distances to prototypes. In Z. Geradts, K. Franke, & C. Veenman (Eds.) IWCF ’09: Proceedings of the 3rd International Workshop on Computational Forensics, Lecture Notes in Computer Science 5718, (pp. 13–24). Berlin, Heidelberg: Springer Verlag. Luyckx, K., & Daelemans, W. (2008a). Authorship attribution and verification with many authors and limited data. In D. Scott, & H. Uszkoreit (Eds.) Proceedings of the 22nd International Conference on Computational Linguistics, (pp. 513–520). Manchester, UK. 176

BIBLIOGRAPHY Luyckx, K., & Daelemans, W. (2008b). Personae: a corpus for author and personality prediction from text. In Proceedings of the 6th International Conference on Language Resources and Evaluation, (p. no pages). Marrakech, Morocco: ELDA. Luyckx, K., & Daelemans, W. (forthcoming). The effect of author set size and data size in authorship attribution. Literary and Linguistic Computing, (p. 21). Madigan, D., Genkin, A., Lewis, D., Argamon, S., Fradkin, D., & Ye, L. (2005). Author identification on the large scale. In Proceedings of the 2005 Meeting of the Classification Society of North America. St. Louis, MO, USA: Classification Society of North America. Mairesse, F., Walker, M., Mehl, M., & Moore, R. (2007). Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research, 30, 457–500. Manevitz, L., & Yousef, M. (2001). One-class SVMs for document classification. Journal of Machine Learning Research, 2, 139–154. Manning, C., & Schutze, H. (1999). Foundations of statistical natural language processing. ¨ Cambridge, MA, USA: MIT Press. Marton, Y., Wu, N., & Hellerstein, L. (2005). On compression-based text classification. In D. Losada, & J. Fernndez-Luna (Eds.) Proceedings of the European Conference on Information Retrieval, Lecture Notes in Computer Science 3408, (pp. 300–314). Heidelberg: Springer Verlag. Matthews, R., & Merriam, T. (1994). Neural computation in stylometry I: An application to the works of Shakespeare and Fletcher. Literary and Linguistic Computing, 8(4), 203–209. Mendenhall, T. (1887). The characteristic curves of composition. Science, IX , 237–249. Merriam, T. (1993). Marlowe’s hand in ’Edward III’. Literary and Linguistic Computing, 8(2), 59–72. Mikros, G., & Argiri, E. (2007). Investigating topic influence in authorship attribution. In B. Stein, M. Koppel, & E. Stamatatos (Eds.) Proceedings of the 30th SIGIR, Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (PAN), (pp. 29–35). Amsterdam, The Netherlands. Miranda-Garc´ıa, A., & Calle-Mart´ın, J. (2007). Function words in authorship attribution studies. Literary and Linguistic Computing, 22(1), 49–66. Moore, R. (2001). There’s no data like more data (but when will enough be enough?). In Proceedings of IEEE International Workshop on Intelligent Signal Processing. Budapest, Hungary. 177

BIBLIOGRAPHY Mosteller, F., & Wallace, D. (1964). Inference and disputed authorship: The Federalist. Series in Behavioral Science: Quantitative Methods Edition. Nowson, S., & Oberlander, J. (2007). Identifying more bloggers. towards large scale personality classification of personal weblogs. In Proceedings of the 1st International Conference on Weblogs and Social Media. Boulder, Colorado. Peng, F., Shuurmans, D., Keselj, J., & Wang, S. (2003). Language independent authorship attribution using character level language models. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, (pp. 267–274). Morristown, NJ. Platt, J. (1998). Advances in Kernel Methods - Support Vector Learning, chapter Fast training of Support Vector Machines using Sequential Minimal Optimization, (pp. 185–208). Cambridge, MA, USA: MIT Press. ´ ˜ A., Stein, B., & Rosso, P. (forthcoming). Cross-language plaPotthast, M., Barron-Cede no, giarism detection. Language Resources and Evaluation, Special Issue on Plagiarism and Authorship Analysis. Quinlan, J. (1993). C4.5: Programs for Machine Learning. San Francisco, CA, USA: Morgan Kaufmann. Raghavan, S., Kovashka, A., & Mooney, R. (2010). Authorship attribution using probabilistic context-free grammars. In Proceedings of the 48th Annual Meeting of the Association of Computational Linguistics, (pp. 38–42). Uppsala, Sweden: Association for Computational Linguistics. Raskutti, B., & Kowalczyk, A. (2004). Extreme re-balancing for SVMs: A case study. ACM SIGKDD Explorations Newsletter , 6(1), 60–69. Rudman, J. (1998). The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities, 31, 351–365. Sanderson, C., & Guenter, S. (2006). Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP), (pp. 482–491). Syndney, Australia. ¨ Scholkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., & Williamson, R. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 1443–1471. Sebastiani, F. (2002). Machine learning in automated text categorization. Association for Computing Machinery Computing Surveys, 34(1), 1–47. 178

BIBLIOGRAPHY Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, 27 (4), 379–423. Stamatatos, E. (2006). Ensemble-based author identication using character n-grams. In B. Stein, & O. Kao (Eds.) Proceedings of the 3rd International Workshop on Text-Based Information Retrieval, (pp. 41–46). Trento, Italy. Stamatatos, E. (2008). Author identification: Using text sampling to handle the class imbalance problem. Information Processing and Management, 44(2), 790–799. Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3), 538–556. Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2000). Automatic text categorization in terms of genre and author. Computational Linguistics, 26(4), 461–485. Tambouratzis, G., Markantonatou, S., Hairetakis, N., Vassiliou, M., Carayannis, G., & Tambouratzis, D. (2004). Discriminating the registers and styles in the modern Greek language – part 2: Extending the feature vector to optimize author discrimination. Literary and Linguistic Computing, 19(2), 221–242. Tax, D. (2001). One-Class Classification. Ph.D. thesis, TU Delft, The Netherlands. Teahan, W., & Cleary, J. (1997). Applying compression to Natural Language Processing. In Proceedings of the Applied Natural Language Processing Conference. Tearle, M., Taylor, K., & Demuth, H. (2008). An algorithm for automated authorship attribution using neural networks. Literary and Linguistic Computing, 23(4), 425–442. Tweedie, F., & Baayen, H. (1998). How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32, 323–352. Tweedie, F., Singh, S., & Holmes, D. (1996). Neural network applications in stylometry: The Federalist Papers. Computers and the Humanities, 30(1), 1–10. Uzuner, O., Katz, B., & Nahnsen, T. (2005). Using syntactic information to identify plagiarism. In Proceedings of the 2nd Workshop on Building Educational Applications using NLP, (pp. 37–44). Ann Arbor. Van Asch, V., & Daelemans, W. (2010). Using domain similarity for performance estimation. In Proceedings of the ACL 2010 Workshop on Domain Adaptation for Natural Language Processing (DANLP), (pp. 31–36). Uppsala, Sweden: Association for Computational Linguistics. 179

BIBLIOGRAPHY Van den Bosch, A., Busser, G., Daelemans, W., & Canisius, S. (2007). An efficient memorybased morphosyntactic tagger and parser for dutch. In Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, LOT Occasional Series 7, (pp. 99–114). Utrecht, The Netherlands: Landelijke Onderzoekschool Taalwetenschap. van Halteren, H. (2007). Author verification by Linguistic Profiling: An exploration of the parameter space. Association for Computer Machinery Transactions on Speech and Language Processing, 4(1), 1–17. van Halteren, H., Baayen, H. R., Tweedie, F., Haverkort, M., & Neijt, A. (2005). New machine learning methods demonstrate the existence of a human stylome. Journal of Quantitative Linguistics, 12(1), 65–77. Weiss, S., & Kulikowski, C. (1991). Computer systems that learn. San Francisco, CA, USA: Morgan Kaufmann. Witten, I., & Frank, E. (1999). Data Mining: Practical Machine Learning Tools with Java Implementations. San Fransisco, CA: Morgan Kaufmann. Yang, Y., & Pedersen, J. (1997). A comparative study on feature selection in text categorization. In D. Fisher (Ed.) Proceedings of the Fourteenth International Conference on Machine Learning (ICML), (pp. 412–420). Nashville, Tennessee, USA: San Francisco, CA, USA: Morgan Kaufmann. Yule, G. (1938). On sentence-length as a statistical characteristic of style in prose, with application to two cases of disputed authorship. Biometrika, 30, 363–390. Yule, G. (1944). The statistical study of literary vocabulary. Cambridge, UK: Cambridge University Press. Zhao, Y., & Zobel, J. (2005). Effective and scalable authorship attribution using function words. In G. Lee, A. Yamada, H. Meng, & S. Myaeng (Eds.) Proceedings of the 2nd Asian Information Retrieval Symposium, Lecture Notes in Computer Science 3689, (pp. 174–189). Jeju Island, Korea: Heidelberg: Springer Verlag. Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages. Journal of the American Society for Information Science and Technology, 57 (3), 378–393. Zipf, G. (1932). Selected studies of the principle of relative frequency in language. Cambridge, MA, USA: Harvard University Press.

180