Text analysis pipelines: towards ad-hoc large scale text mining 9783319257402, 9783319257419, 3319257404, 3319257412

This monograph proposes a comprehensive and fully automatic approach to designing text analysis pipelines for arbitrary

586 118 6MB

English Pages xx, 302 pages: illustrations [317] Year 2015

Report DMCA / Copyright

DOWNLOAD FILE

Text analysis pipelines: towards ad-hoc large scale text mining
 9783319257402, 9783319257419, 3319257404, 3319257412

Table of contents :
Foreword......Page 7
Preface......Page 9
Symbols......Page 13
Contents......Page 15
6 Conclusion......Page 0
1.1 Information Search in Times of Big Data......Page 21
1.1.1 Text Mining to the Rescue......Page 22
1.2 A Need for Efficient and Robust Text Analysis Pipelines......Page 24
1.2.1 Basic Text Analysis Scenario......Page 25
1.2.2 Shortcomings of Traditional Text Analysis Pipelines......Page 26
1.2.3 Problems Approached in This Book......Page 27
1.3.1 Central Research Question and Method......Page 28
1.3.2 An Artificial Intelligence Approach......Page 29
1.4 Contributions and Outline of This Book......Page 32
1.4.1 New Findings in Ad-Hoc Large-Scale Text Mining......Page 33
1.4.2 Contributions to the Concerned Research Fields......Page 34
1.4.3 Structure of the Remaining Chapters......Page 35
1.4.4 Published Research Within This Book......Page 36
2.1.1 Text Mining......Page 39
2.1.2 Information Retrieval......Page 40
2.1.3 Natural Language Processing......Page 41
2.1.4 Data Mining......Page 44
2.1.5 Development and Evaluation......Page 49
2.2.1 Text Analysis Tasks......Page 53
2.2.2 Text Analysis Processes......Page 55
2.2.3 Text Analysis Pipelines......Page 56
2.3.1 InfexBA -- Information Extraction for Business Applications......Page 59
2.3.2 ArguAna -- Argumentation Analysis in Customer Opinions......Page 61
2.3.3 Other Evaluated Text Analysis Tasks......Page 62
2.4.1 Text Analysis Approaches......Page 63
2.4.2 Design of Text Analysis Approaches......Page 64
2.4.3 Efficiency of Text Analysis Approaches......Page 66
2.4.4 Robustness of Text Analysis Approaches......Page 69
3.1.1 The Optimality of Text Analysis Pipelines......Page 74
3.1.2 Paradigms of Designing Optimal Text Analysis Pipelines......Page 78
3.1.3 Case Study of Ideal Construction and Execution......Page 82
3.2 A Process-Oriented View of Text Analysis......Page 86
3.2.1 Text Analysis as an Annotation Task......Page 87
3.2.2 Modeling the Information to Be Annotated......Page 88
3.2.3 Modeling the Quality to Be Achieved by the Annotation......Page 89
3.2.4 Modeling the Analysis to Be Performed for Annotation......Page 90
3.2.5 Defining an Annotation Task Ontology......Page 92
3.2.6 Discussion of the Process-Oriented View......Page 93
3.3 Ad-Hoc Construction via Partial Order Planning......Page 94
3.3.1 Modeling Algorithm Selection as a Planning Problem......Page 95
3.3.2 Selecting the Algorithms of a Partially Ordered Pipeline......Page 96
3.3.3 Linearizing the Partially Ordered Pipeline......Page 98
3.3.4 Properties of the Proposed Approach......Page 100
3.3.5 An Expert System for Ad-Hoc Construction......Page 104
3.3.6 Evaluation of Ad-Hoc Construction......Page 106
3.3.7 Discussion of Ad-Hoc Construction......Page 109
3.4.1 Text Analysis as a Filtering Task......Page 110
3.4.2 Defining the Relevance of Portions of Text......Page 113
3.4.3 Specifying a Degree of Filtering for Each Relation Type......Page 115
3.4.4 Modeling Dependencies of the Relevant Information Types......Page 116
3.4.5 Discussion of the Information-Oriented View......Page 118
3.5.1 Modeling Input Control as a Truth Maintenance Problem......Page 119
3.5.2 Filtering the Relevant Portions of Text......Page 122
3.5.3 Determining the Relevant Portions of Text......Page 124
3.5.4 Properties of the Proposed Approach......Page 125
3.5.5 A Software Framework for Optimal Execution......Page 127
3.5.6 Evaluation of Optimal Execution......Page 129
3.5.7 Discussion of Optimal Execution......Page 134
3.6.1 Integration with Passage Retrieval......Page 135
3.6.2 Integration with Text Filtering......Page 136
3.6.3 Implications for Pipeline Efficiency......Page 138
4 Pipeline Efficiency......Page 140
4.1.1 The Efficiency Potential of Pipeline Scheduling......Page 141
4.1.2 Computing Optimal Schedules with Dynamic Programming......Page 143
4.1.3 Properties of the Proposed Solution......Page 146
4.1.4 Case Study of Ideal Scheduling......Page 148
4.2 The Impact of Relevant Information in Input Texts......Page 151
4.2.1 Formal Specification of the Impact......Page 152
4.2.2 Experimental Analysis of the Impact......Page 153
4.2.3 Practical Relevance of the Impact......Page 155
4.2.4 Implications of the Impact......Page 157
4.3 Optimized Scheduling via Informed Search......Page 158
4.3.1 Modeling Pipeline Scheduling as a Search Problem......Page 159
4.3.2 Scheduling Text Analysis Algorithms with k-best A* Search......Page 161
4.3.3 Properties of the Proposed Approach......Page 164
4.3.4 Evaluation of Optimized Scheduling......Page 166
4.3.5 Discussion of Optimized Scheduling......Page 172
4.4.1 Experimental Analysis of the Impact......Page 173
4.4.2 Quantification of the Impact......Page 176
4.4.3 Practical Relevance of the Impact......Page 178
4.5 Adaptive Scheduling via Self-supervised Online Learning......Page 180
4.5.1 Modeling Pipeline Scheduling as a Classification Problem......Page 181
4.5.2 Learning to Predict Run-Times Self-supervised and Online......Page 182
4.5.3 Adapting a Pipeline's Schedule to the Input Text......Page 183
4.5.4 Properties of the Proposed Approach......Page 184
4.5.5 Evaluation of Adaptive Scheduling......Page 186
4.5.6 Discussion of Adaptive Scheduling......Page 192
4.6.1 Effects of Parallelizing Pipeline Execution......Page 194
4.6.2 Parallelization of Text Analyses......Page 196
4.6.3 Parallelization of Text Analysis Pipelines......Page 197
4.6.4 Implications for Pipeline Robustness......Page 199
5.1.1 The Domain Dependence Problem in Text Analysis......Page 201
5.1.2 Requirements of Achieving Pipeline Domain Independence......Page 203
5.1.3 Domain-Independent Features of Argumentative Texts......Page 207
5.2.1 Text Analysis as a Structure Classification Task......Page 208
5.2.2 Modeling the Argumentation and Content of a Text......Page 209
5.2.3 Modeling the Argumentation Structure of a Text......Page 210
5.2.4 Defining a Structure Classification Task Ontology......Page 212
5.2.5 Discussion of the Structure-Oriented View......Page 214
5.3.1 Experimental Analysis of Content and Style Features......Page 215
5.3.2 Statistical Analysis of the Impact of Task-Specific Structure......Page 218
5.3.3 Statistical Analysis of the Impact of General Structure......Page 221
5.3.4 Implications of the Invariance and Impact......Page 223
5.4.1 Approaching Classification as a Relatedness Problem......Page 225
5.4.2 Learning Overall Structures with Supervised Clustering......Page 226
5.4.3 Using the Overall Structures as Features for Classification......Page 229
5.4.4 Properties of the Proposed Features......Page 231
5.4.5 Evaluation of Features for Domain Independence......Page 233
5.4.6 Discussion of Features for Domain Independence......Page 238
5.5 Explaining Results in High-Quality Text Mining......Page 240
5.5.2 Explanation of Arbitrary Text Analysis Processes......Page 241
5.5.3 Explanation of the Class of an Argumentative Text......Page 244
5.5.4 Implications for Ad-Hoc Large-Scale Text Mining......Page 246
6.1.1 Enabling Ad-Hoc Text Analysis......Page 249
6.1.3 Optimizing Analysis Efficiency......Page 250
6.1.4 Robustly Classifying Text......Page 251
6.2.1 Towards Ad-Hoc Large-Scale Text Mining......Page 252
6.2.2 Outside the Box......Page 254
A.1 Analyses and Algorithms......Page 256
A.1.1 Classification of Text......Page 257
A.1.3 Normalization and Resolution......Page 259
A.1.4 Parsing......Page 260
A.1.5 Relation Extraction and Event Detection......Page 261
A.1.6 Segmentation......Page 263
A.1.7 Tagging......Page 264
A.2.1 Efficiency Results......Page 265
A.2.2 Effectiveness Results......Page 266
B.1 An Expert System for Ad-hoc Pipeline Construction......Page 268
B.1.1 Getting Started......Page 269
B.1.2 Using the Expert System......Page 270
B.1.3 Exploring the Source Code of the System......Page 271
B.2.2 Using the Framework......Page 273
B.2.3 Exploring the Source Code of the Framework......Page 274
B.3.1 Getting Started......Page 275
B.3.2 Using the Application......Page 276
B.3.3 Exploring the Source Code of the Application......Page 277
B.4.1 Software......Page 278
B.4.3 Experiments and Case Studies......Page 279
C.1.1 Compilation......Page 281
C.1.2 Annotation......Page 283
C.2.1 Compilation......Page 285
C.2.2 Annotation......Page 287
C.3 The LFA-11 Corpus......Page 290
C.3.2 Annotation......Page 291
C.4.1 CoNLL-2003 Dataset (English and German)......Page 294
C.4.3 Brown Corpus......Page 295
C.4.4 Wikipedia Sample......Page 296
References......Page 297
Index......Page 308

Polecaj historie