Complex Pattern Mining: New Challenges, Methods and Applications (Studies in Computational Intelligence, 880) 9783030366162, 3030366162

This book discusses the challenges facing current research in knowledge discovery and data mining posed by the huge volu

161 32 9MB

English Pages 260 [251]

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Complex Pattern Mining: New Challenges, Methods and Applications (Studies in Computational Intelligence, 880)
 9783030366162, 3030366162

Table of contents :
Preface
Contents
Efficient Infrequent Pattern Mining Using Negative Itemset Tree
1 Introduction
2 Related Works
3 Preliminaries
3.1 Neg-Rep and Negative Itemsets
4 Negative Itemset Tree Miner
4.1 Negative Itemset Tree and Support Counting
4.2 Infrequent Pattern Mining with Termination Nodes Pruning
4.3 Infrequent Pattern Mining with 1st-Layer Nodes Pruning
5 Experimental Evaluation
6 Conclusion and Future Works
References
Hierarchical Adversarial Training for Multi-domain Adaptive Sentiment Analysis
1 Introduction
2 Related Work
3 Hierarchical Adversarial Sentiment Analysis
3.1 Hierarchical RNNs
3.2 Semi-supervised Adversarial Training Framework
3.3 Sentiment Prediction on an Unknown Domain
4 Empirical Analysis
4.1 Experiment Setup
4.2 Pairwise Domain Adaptation
4.3 Multi-domain Adaptation
5 Conclusion
References
Optimizing C-Index via Gradient Boosting in Medical Survival Analysis
1 Introduction and Background
2 Our Approach
2.1 Derivation
2.2 Data Approach
3 Datasets
4 Methods
5 Results
6 Summary and Conclusions
References
Order-Preserving Biclustering Based on FCA and Pattern Structures
1 Introduction
2 Order-Preserving Biclusters
3 FCA and Pattern Structures
4 Finding Biclusters Using Partition Pattern Structure
4.1 Partition Pattern Structure
4.2 OP Biclustering Using Partition
5 Finding Biclusters Using Sequence Pattern Structure
5.1 Sequence Pattern Structure
5.2 OP Biclustering Using Sequence
6 Experiment
7 Conclusion
References
A Text-Based Regression Approach to Predict Bug-Fix Time
1 Introduction
2 Background
3 Related Work
4 Proposed Model
4.1 Data Collection
4.2 Pre-processing
4.3 Learning and Severity Prediction
5 Experiment
5.1 Projects Selected
5.2 Metrics Used
5.3 Results
6 Discussion and Conclusion
6.1 Threats to Validity
References
A Named Entity Recognition Approach for Albanian Using Deep Learning
1 Introduction
2 Related Works
3 Challenges of NER in Albanian Language
4 Albanian Corpus Building and Annotation
5 System Architecture and Algorithmic Design
5.1 Frameworks and Libraries
5.2 Preprocessing Module
5.3 Neural Network Layer
5.4 CRF Layer
6 Experimental Analysis and Accuracy Evaluation
6.1 Experimental Environment
6.2 Experiments
6.3 Experimental Results
7 Conclusions and Future Works
References
A Latitudinal Study on the Use of Sequential and Concurrency Patterns in Deviance Mining
1 Introduction
2 Deviance Mining
3 Sequential Versus Concurrency: Challenges and Benefits
4 Experiments
4.1 Datasets
4.2 Settings
4.3 Results
5 Discussion
6 Conclusion
References
Efficient Declarative-Based Process Mining Using an Enhanced Framework
1 Introduction
2 Background and Related Works
3 The WoMan Framework
3.1 The WoMan Formalism
3.2 WoMan Modules
4 Optimization Approaches
4.1 Prototyped Process Discovery
4.2 Distributed Process Discovery
5 Performance Evaluation
5.1 Prototype's Handling Evaluation
5.2 Distributed Approach Evaluation
6 Conclusions and Future Directions
References
Exploiting Pattern Set Dissimilarity for Detecting Changes in Communication Networks
1 Introduction
2 Related Works
3 Preliminaries
3.1 Data Representation
3.2 Basic Definitions
3.3 Pattern Set Dissimilarity
3.4 Problem Definition
4 The Algorithm
5 Experiments
5.1 Datasets Description
5.2 Influence of the Input Parameters
5.3 Comparative Evaluation on Real Networks
5.4 Comparative Evaluation on Synthetic Networks
5.5 Case Study
6 Conclusions
References
Classification and Clustering of Emotive Microblogs in Albanian: Two User-Oriented Tasks
1 Introduction
2 Related Work and Contribution
3 Construction of the Sentence-Level Datasets
3.1 Data Collection and Assembly
3.2 Data Preprocessing
4 Sentence-Based Classification
4.1 Experimental Setting
4.2 Experimental Evaluation
5 Keyword Extraction
5.1 Cluster-Based Analysis
5.2 Keyword Extraction Through Clustering
5.3 Methodology
5.4 Experimental Results and Discussion
6 Conclusions and Future Remarks
References
Dealing with Class Imbalance in Android Malware Detection by Cascading Clustering and Classification
1 Introduction
2 Machine Learning Features
3 Cascading Clustering and Classification
3.1 Classification Algorithm
3.2 Clustering Algorithm
3.3 Cascading Clustering and Classification
4 Experimental Analysis
4.1 Data
4.2 Experimental Methodology
4.3 Compared Algorithms
4.4 Results and Discussion
5 Conclusion
References
Applying Analytics to Artist Provided Text to Model Prices of Fine Art
1 Introduction
2 Methods
2.1 Dataset Acquisition and Design
2.2 Product Description
2.3 Social Media and Sales
2.4 Determining Text Similarity Using Vectors
2.5 Sentiment Analysis
3 Results
3.1 Base Features
3.2 Social Media Presence
3.3 Word Count Results
3.4 Document Vector Clusters
3.5 Results with Sentiment
3.6 Combined Features
4 Conclusions
References
Approximate Query Answering over Incomplete Data
1 Introduction
2 Background
3 System Overview
4 Experimental Evaluation of Approximation Algorithms
5 Conclusion
References
A Machine Learning Approach for Walker Identification Using Smartphone Sensors
1 Introduction
2 Background on Decision Tree Classification
2.1 Random Forests
3 Related Work on Learning Algorithms for Mobile Sensors Data
4 Methodology
4.1 Features Model
4.2 Classification Approach
4.3 The Adopted Classifiers
5 Evaluation
5.1 Description of the Experiments
5.2 Evaluation Setting
6 Results and Discussion
6.1 D1 Dataset Results
6.2 D2 Dataset Results
7 Conclusions
References
Author Index

Citation preview

Studies in Computational Intelligence 880

Annalisa Appice · Michelangelo Ceci · Corrado Loglisci · Giuseppe Manco · Elio Masciari · Zbigniew W. Ras   Editors

Complex Pattern Mining New Challenges, Methods and Applications

Studies in Computational Intelligence Volume 880

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. The books of this series are submitted to indexing to Web of Science, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink.

More information about this series at http://www.springer.com/series/7092

Annalisa Appice Michelangelo Ceci Corrado Loglisci Giuseppe Manco Elio Masciari Zbigniew W. Ras •









Editors

Complex Pattern Mining New Challenges, Methods and Applications

123

Editors Annalisa Appice Dipartimento di Informatica Università degli Studi di Bari Aldo Moro Bari, Italy Corrado Loglisci Dipartimento di Informatica Università degli Studi di Bari Aldo Moro Bari, Italy Elio Masciari Università degli Studi di Napoli Federico II Naples, Italy

Michelangelo Ceci Dipartimento di Informatica Università degli Studi di Bari Aldo Moro Bari, Italy Giuseppe Manco ICAR-CNR Rende, Italy Zbigniew W. Ras Department of Computer Science University of North Carolina Charlotte, NC, USA

ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-030-36616-2 ISBN 978-3-030-36617-9 (eBook) https://doi.org/10.1007/978-3-030-36617-9 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Complex pattern mining provides concepts and techniques to process the huge volumes of data with a complex structure, which are nowadays gathered in various applications. These massive and complex data pose new challenges for current research in Knowledge Discovery and Data Mining. They require new theory and design methods for storing, managing, and analyzing them by taking into account various complexity aspects: complex structures (e.g., multi-relational, time series and sequences, networks, and trees) as input/output of the data mining process; massive amounts of high-dimensional data collections flooding as high-speed streams and requiring (near) real-time processing and model adaptation to concept drifts; new application scenarios involving security issues, interaction with other entities, and real-time response to events triggered by sensors. Recent literature has endowed plentiful endeavors to this research area with significant breakthroughs. In terms of scientific research, complex pattern mining has been focusing on developing specialized techniques and algorithms, which preserve the informative richness of data and allow us to efficiently and efficaciously identify complex information units present in such data. On the other hand, as a fundamental field of data mining, complex pattern mining is emerging in a wide range of real-world applications ranging from process mining to cybersecurity, medicine, language processing, and remote sensing. The intent of this book is to cover the recent developments in the theory, applications, and design methods of complex pattern mining as embedded in the fields of data science and big data analytics. In particular, the works presented in this book should keep the attention of both researchers and practitioners of data mining who are interested in the advances and latest developments in the area of extracting patterns. In our open call for contributions, we solicited submissions discussing and introducing new algorithmic foundations and representation formalisms in mining patterns from complex data. We received a number of 16 submissions, which shows the liveliness of this field. From the received 16 submissions, we selected 14 for inclusion in this book. These articles are briefly summarized below.

v

vi

Preface

Chapter “Efficient Infrequent Pattern Mining Using Negative Itemset Tree” describes a novel algorithm to discover rare patterns by employing both top-down and depth-first traversing paradigm. Chapter “Hierarchical Adversarial Training for Multi-domain Adaptive Sentiment Analysis” illustrates a hierarchical adversarial neural network (HANN) for adaptive sentiment analysis, which shares information between multiple domains bidirectionally. Chapter “Optimizing C-Index via Gradient Boosting in Medical Survival Analysis” investigates whether optimizing directly C-index via gradient boosting may gain accuracy in medical survival analysis. Chapter “Order-Preserving Biclustering Based on FCA and Pattern Structures” explores the relation between bi-clustering and pattern structure by studying the order-preserving bi-clusters whose rows induce the same linear order across all columns. Chapter “A Text-Based Regression Approach to Predict Bug-Fix Time” tackles the problem of predicting the bug-fixing time by resorting to a multiple regression analysis and accounting for textual information extracted from the bug reports. Chapter “A Named Entity Recognition Approach for Albanian Using Deep Learning” proposes a deep learning approach to face the task of Named Entity Recognition in Albanian language. It employees LSTM cells as the hidden layers, a Conditional Random Field as the output, as well as word and character tagging. Chapter “A Latitudinal Study on the Use of Sequential and Concurrency Patterns in Deviance Mining” illustrates a latitudinal study on the use of sequential and concurrency patterns in deviance process mining. Chapter “Efficient Declarative-Based Process Mining Using an Enhanced Framework” presents a process mining approach aimed at improving both the efficiency in learning process models and the readability of the learned process models. Chapter “Exploiting Pattern Set Dissimilarity for Detecting Changes in Communication Networks” describes a data mining approach to analyze evolving communication data and detect changes in the communication modalities. Chapter “Classification and Clustering of Emotive Microblogs in Albanian: Two User-Oriented Tasks” proposes a data mining methodology that resorts to both classification and clustering, in order to analyze microblogging content and characterize users writing posts with emotional content. Chapter “Dealing with Class Imbalance in Android Malware Detection by Cascading Clustering and Classification” describes a supervised learning approach for classifying Android applications. It resorts to a combination of clustering and classification in order to deal with the imbalanced data problem. Chapter “Applying Analytics to Artist Provided Text to Model Prices of Fine Art” develops a set of text-based features that are processed in combination with clustering and sentiment analysis, in order to predict the price of a work of contemporary art sold online.

Preface

vii

Chapter “Approximate Query Answering over Incomplete Data” compares several recently proposed approximation algorithms which are designed to deal with incomplete data in big data applications. Chapter “A Machine Learning Approach for Walker Identification Using Smartphone Sensors” illustrates a classification-based methodology that analyzes the data collected through MEMS smartphone sensors, in order to recognize the identity of the walker and the pose of the device during the walk. We would like to thank all the authors who submitted papers for publication in this book. We are also grateful to the members of the Program Committee and external referees for their excellent work in reviewing submitted and revised contributions with expertise and patience. Last but not least, we thank Janusz Kacprzyk and Ramamoorthy Rajangam of Springer for their continuous support. Bari, Italy Bari, Italy Bari, Italy Rende, Italy Naples, Italy Charlotte, USA October 2019

Annalisa Appice Michelangelo Ceci Corrado Loglisci Giuseppe Manco Elio Masciari Zbigniew W. Ras

Contents

Efficient Infrequent Pattern Mining Using Negative Itemset Tree . . . . . Yifeng Lu, Florian Richter and Thomas Seidl

1

Hierarchical Adversarial Training for Multi-domain Adaptive Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhao Xu, Lorenzo von Ritter and Giuseppe Serra

17

Optimizing C-Index via Gradient Boosting in Medical Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alicja Wieczorkowska and Wojciech Jarmulski

33

Order-Preserving Biclustering Based on FCA and Pattern Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nyoman Juniarta, Miguel Couceiro and Amedeo Napoli

47

A Text-Based Regression Approach to Predict Bug-Fix Time . . . . . . . . Pasquale Ardimento, Nicola Boffoli and Costantino Mele A Named Entity Recognition Approach for Albanian Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evis Trandafili, Elinda Kajo Meçe and Enea Duka

63

85

A Latitudinal Study on the Use of Sequential and Concurrency Patterns in Deviance Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Laura Genga, Domenico Potena, Andrea Chiorrini, Claudia Diamantini and Nicola Zannone Efficient Declarative-Based Process Mining Using an Enhanced Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Stefano Ferilli and Sergio Angelastro Exploiting Pattern Set Dissimilarity for Detecting Changes in Communication Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Angelo Impedovo, Corrado Loglisci, Michelangelo Ceci and Donato Malerba ix

x

Contents

Classification and Clustering of Emotive Microblogs in Albanian: Two User-Oriented Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Marjana Prifti Skenduli and Marenglen Biba Dealing with Class Imbalance in Android Malware Detection by Cascading Clustering and Classification . . . . . . . . . . . . . . . . . . . . . . 173 Giuseppina Andresini, Annalisa Appice and Donato Malerba Applying Analytics to Artist Provided Text to Model Prices of Fine Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Laurel Powell, Anna Gelich and Zbigniew W. Ras Approximate Query Answering over Incomplete Data . . . . . . . . . . . . . . 213 Nicola Fiorentino, Cristian Molinaro and Irina Trubitsyna A Machine Learning Approach for Walker Identification Using Smartphone Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Antonio Angrisano, Pasquale Ardimento, Mario Luca Bernardi, Marta Cimitile and Salvatore Gaglione Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

Efficient Infrequent Pattern Mining Using Negative Itemset Tree Yifeng Lu, Florian Richter and Thomas Seidl

Abstract In this work, we focus on a simple and fundamental question: How to find infrequent patterns, i.e. patterns with small support value, in a transactional database. In various practical applications such as science, medical and accident data analysis, frequent patterns usually represent obvious and expected phenomena. Really interesting information might hide in obscure rarity. Existing rare pattern mining approaches are mainly adapted from frequent itemset mining algorithms, which either suffered from the expensive candidate generation step or need to traverse all frequent patterns first. In this paper, we propose an infrequent pattern mining algorithm using a top-down and depth-first traversing strategy to avoid the two obstacles above. A negative itemset tree is employed to accelerate the mining process with its dataset compressing and fast counting ability.

1 Introduction Frequent itemset (pattern) mining has been studied for decades and successfully applied to many areas. It focus on patterns occurred more than a given minimum support threshold. Many efficient algorithms, either breadth-first based or depth-first based, are proposed. However, the opposite question, infrequent itemset mining, is rarely studied yet. In many applications, infrequent patterns, or low support patterns, are interesting since they contain new knowledge while frequent patterns usually represent known and expected phenomena. Those low support patterns usually lead to highly confident association rules while can still have a significant number of support Y. Lu (B) · F. Richter · T. Seidl Database Systems and Data Mining Group, LMU, Munich, Germany e-mail: [email protected] F. Richter e-mail: [email protected] T. Seidl e-mail: [email protected] © Springer Nature Switzerland AG 2020 A. Appice et al. (eds.), Complex Pattern Mining, Studies in Computational Intelligence 880, https://doi.org/10.1007/978-3-030-36617-9_1

1

2

Y. Lu et al.

in a large dataset. Thus, they are important in different areas such as recommendation systems, medical, scientific or accident data analysis. For example, in recommendation systems, infrequent patterns represent those “long tail” customers’ behaviors. Knowing infrequent patterns is a big advantage in predicting customers’ interesting. In medicine area, infrequent patterns can be indicators of adverse reactions or drug interactions. They are also crucial in identifying rare diseases where untypical responses to medications are more important than frequent and expected ones for domain experts. In the analysis of traffic accidents, less frequent and abnormal behaviors may be the real cause of traffic accidents. Other pattern mining tasks also need the ability to mining infrequent patterns. For example, in discriminative pattern mining [4], the support of many critical discriminative patterns in dense and high-dimensional data are small. Existing discriminative pattern mining algorithms have to make trade-off between the pattern completeness and the runtime performance. Infrequent patterns are also useful in generating synthetic datasets from a given real dataset for further research. The common approach is sampling. The diversity of the generated dataset is limited since infrequent patterns in the real dataset are hardly to be sampled. Knowing low support patterns in advance makes it possible for user to control the low support area in the generated synthetic dataset. Existing frequent itemset mining algorithms can be used to mine infrequent patterns by setting the minimum support threshold to 1. However, those algorithms have to access all frequent patterns, which is undesired and time consuming. On sparse datasets, such as the typical supermarket transaction dataset with small transaction length, accessing all frequent patterns is not a big issue for modern efficient frequent itemset mining algorithms since the number of frequent patterns in such datasets is not very huge. However, datasets mentioned in motivations above for infrequent pattern mining are usually much denser with very long transactions. Mining low support patterns on such dataset is much slower. In general, frequent itemset mining algorithms identify frequent (short) patterns first, which can be considered as bottom-up based. In this paper, we focus on the infrequent itemset mining problem. Thus, to avoid accessing frequent patterns, we propose a top-down based approaches which identifies infrequent patterns by accessing long infrequent patterns first. The “starting point” of top-down based approach is low support patterns, which means that no time is wasted on accessing frequent patterns. There are infrequent pattern mining algorithms proposed in recent years while most of them are adapted from bottom-up based approaches. They also suffered from the expensive frequent pattern accessing step. There are also a few infrequent pattern mining algorithms proposed to extract patterns using top-down strategy. However, the expensive candidate-generation step is necessary as the breadth-first traversing paradigm (apriori like) is applied. Various condensed representations or further advanced constraints for infrequent itemset mining have also been investigated in recent years to solve the redundant problem or to achieve a better performance [11, 12]. However, the most fundamental problem: identifying all patterns with low support, has not been addressed efficiently yet. In this work, we focus on the simple but challenging problem. A novel infrequent

Efficient Infrequent Pattern Mining Using Negative Itemset Tree

3

itemset mining algorithm is proposed which employs both top-down and depth-first traversing paradigm. We utilize a new tree-based structure with efficient support computing. Extensive evaluations on different real-world datasets show that our approach is very efficient.

2 Related Works The itemset mining task can be considered as a process of traversing the powerset lattice shown in Fig. 1. Related works can be classified based on their strategy of traversing. All frequent itemset mining algorithms start from the bottom with either breadth-first or depth-first traversing applied. Breadth-first based approaches, such as the Apriori [2] algorithm, extends the size of candidate itemsets step-wise, i.e., all itemsets with size k have to be generated and checked before considering itemsets with size k + 1. Such candidate generation step is known to be time and memory intensive [3]. Depth-first traversing based algorithms, like the FP-growth [7] algorithm, are able to avoid the explicit candidate generation step due to the employment of the divide-and-conquer paradigm. Most existing infrequent itemset mining algorithms are adapted from the Apriori algorithm which is also bottom-up and breadth-first. Reference [9] inverted the idea of Apriori by defining a maximum support threshold. ARIMA [14] follows this idea, which uses the pruned itemsets of Apriori in a first mining step to generate rare itemset candidates bottom-up in a second step. FRIMA [8] also follows a bottom-up breadth-first traversing based approach. There are also several bottom-up depthfirst based infrequent itemset mining algorithms. Reference [16] makes the heuristic that a rare pattern must contain at least one infrequent item. Reference [6] also utilizes FP-growth paradigm but only returns minimum infrequent itemsets, which

ABCDE

ABCE

ABDE

ACDE

BCDE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

A

B

C

D

Top-down

Bottom-up

ABCD

E

{}

Fig. 1 The powerset lattice with five items. Two borders that divide frequent, infrequent (rare) and nonexistent patterns with minSup = 3 are sketched. Most patterns do not exist in the dataset

4

Y. Lu et al.

Fig. 2 Example transaction database (a) and its corresponding neg-rep dataset (b)

is a lossy condensed representation. All methods above employed the bottom-up strategy which is not efficient. To the best of our knowledge, only few studies mine patterns top-down. Reference [18] first mentioned a top-down based method which takes the transpose of the given dataset, i.e. each item is treated as a transaction id and each transaction id is treated as an item. Mining patterns bottom-up on this transposed dataset is equivalent to top-down mining on the original dataset. AfRIM [1] and Rarity [15] also traverse the search space in a top-down manner. However, their performances are limited by the breadth-first traversing strategies. Our early work in [13] combines both top-down and depth-first traversing strategies by utilizing a novel tree structure. However, it is only efficient on extreme dense datasets. In summary, existing rare pattern mining approaches either employ the breadthfirst traversing strategy or have to access all frequent patterns first, both of them lead to a performance deficiency. A detailed overview of infrequent itemset mining algorithms in recent years can be found in [10].

3 Preliminaries Consider I = {i 1 , i 2 , . . . , i m } to be a set of all distinct items. Any non-empty subset X ⊆ I is an itemset. Any itemset X with size |X | = k is referred to as a k-itemset. A tuple T = (tid, X ) is called a transaction, where tid is the transaction identifier. For simplicity, a transaction T also refers to its itemset X if not specified. Any nonempty itemset Y ⊆ X is contained by a transaction T = (tid, X ) and we just write Y ⊆ T . A set of transactions establish a transaction database D. Figure 2a illustrates an example transaction database. Given a transaction database D, the (absolute) support of an itemset X is defined as the number of transactions T ∈ D containing X : X.supp = |{T ∈ D|X ⊆ T }|. The minimum support threshold (minSup) categorize all itemsets (patterns) into three types: nonexistent, infrequent and frequent. An itemset X is infrequent if and only if: 0 < X.supp < minSup. Otherwise, it is frequent (X.supp ≥ minSup) or nonexistent (X.supp = 0). With |I| distinct items, a dataset contains 2|I| patterns while most of them are nonexistent.

Efficient Infrequent Pattern Mining Using Negative Itemset Tree

5

We are aiming at extracting all infrequent itemsets with support smaller than the given minimum support threshold and larger than 0. However, it is still worth to note that extracting low support patterns does not necessarily mean to extract all infrequent patterns. User might interest in patterns with support fall in a range, for example between 10 and 20, which is still small in a dataset with hundred thousands of long transactions. A bottom-up based approach can extract patterns in this range with minimum support threshold set to 10 rather than 0. However, the number of frequent patterns to be identified is still massive. In contrast, a top-down based approach will be more efficient since it only extracts “infrequent” patterns occurred less than 20 times. In our experiments later, different threshold values are assigned to simulate the low support pattern mining scenario.

3.1 Neg-Rep and Negative Itemsets In the conventional notation of an itemset, each symbol expresses the existence of an item. For example, given an itemset X = {A, B, C}. Its notation implies that items A, B and C exist in X . For simplicity, we call symbol of items in this notation as positive items and the notation as positive itemsets. Similarly, we can also represent the itemset X by utilizing those items that do not exist in X . This negative representation is the basic concept for support counting in our mining process. Definition 1 (Negative Item) Given the set of items I = {i 1 , i 2 , . . . , i m }, the corresponding negative item of i ∈ I is denoted as ¬i. The symbol ¬ is used to represent the idea of not exist, which can be dropped in some notations below for simplicity. Definition 2 (Neg-Rep Itemset and Negative Itemset) Given a positive itemset X = {x1 , x2 , . . . , xn } ⊆ I, its neg-rep (negative represented) itemset is the set of items / X } = I \ X . The negative that X does not have, denoted as X = {¬i|i ∈ I ∧ i ∈ itemset of X is denoted as ¬X = {¬x1 , ¬x2 , . . . , ¬xn }. A positive itemset X and its neg-rep itemset X are two different notations of the same pattern. This concept is important for the support definition described later. Converting each transaction in D into the their neg-rep itemset yields the corresponding neg-rep transaction database D (Fig. 2b). Two support values, intersection support and joint support, are defined on D and D respectively. Definition 3 (Intersect Support and Joint Support) Given a non-empty itemset X = {x1 , . . . , xn }: • The intersect support of X in a transaction database D is the number of transactions that contains all items of X : X.isupp = |{T ∈ D | x1 ∈ T ∧ x2 ∈ T ∧ · · · ∧ xn ∈ T }|.

6

Y. Lu et al.

• The joint support of the negative itemset ¬X is defined in the corresponding negrep dataset D. It is the number of transactions that contains at least one item of ¬X : ¬X. jsupp = |{T ∈ D | ¬x1 ∈ T ∨ ¬x2 ∈ T ∨ · · · ∨ ¬xn ∈ T }|. Obviously, the intersect support is equivalent to the original definition for (absolute) support, i.e.: X.isupp = X.supp. The join support, on the other hand, has the following property: Theorem 1 Given itemset X , dataset D and the corresponding neg-rep dataset D, then X.isupp = |D| − ¬X. jsupp. Proof If a transaction T does not contain X , then T  X ⇔ T ∪ ¬X = ∅. Based on the definition of joint support, ¬X. jsupp = |{T ∈ D|T ∪ ¬X = ∅}| = |{T ∈ D|T  X }|. Furthermore, X.isupp = |{T ∈ D|T ⊇ X }|. Thus, X.isupp + ¬X. jsupp = |D| ⇒ X.isupp = |D| − ¬X. jsupp. Thus, the support of patterns in D can be computed equivalently using the joint support of neg-rep patterns in D.

4 Negative Itemset Tree Miner In this section, we will describe our rare itemset mining algorithm: Negative Infrequent Itemset tree miner (NIIMiner), which is adapted from our previous work in [13] with the idea of diff-set [19] applied.

4.1 Negative Itemset Tree and Support Counting The negative itemset tree (NI-tree) is a prefix tree generated based on the neg-rep database D which summarizes the itemset information. The mining process extracts infrequent itemsets from the tree by deleting nodes recursively. Each node n = [¬i, c, l] is a triple consisting of a negative item ¬i, a count value c and a child list l. The root node r = [is, c, l] stores an itemset {is}, which is initialized as I. The root node is on the 0th-layer. Nodes that are direct successors of the root, i.e. in the list r.l, are called the 1st-layer nodes and so on. All negative items on a path from the root to any node form an itemset with negative items. The count value c is the number of neg-rep transactions that end at the node. l is the list of child nodes. To build a negative itemset tree, the dataset D is converted into its neg-rep database D. Negative items in each transaction are sorted in descending order based on their occurrence in D. Transactions in D are inserted to the NI-tree one by one in ascending order with respect to their length. The count c of the last node in each insertion is increased by 1. Furthermore, if the count of all nodes on the path during an insertion

Efficient Infrequent Pattern Mining Using Negative Itemset Tree

7

Fig. 3 Negative Itemset tree built based on dataset in Fig. 2a. Termination nodes are marked in red

are 0, then the last node will be marked as a termination node, since it is the end of a negative itemset. In another words, termination nodes are the first node with non-zero count on each path from root to leaf. The corresponding NI-tree built based on the dataset in Fig. 2a is shown in Fig. 3. Removing items from the root node, as well as the corresponding nodes in the NI-tree, lead to a new NI-tree, called the deducted tree (de-tree). Detailed excluding process is shown in Algorithm 1. The support of a pattern can be computed efficiently during such removing process. The itemset {is} in the new root node of the de-tree is a new pattern. Given the initial NI-tree constructed from the neg-rep dataset D, we first check each node on the 1st-layer of the NI-tree (Step 6–13, Algorithm 1). If a node is marked with an item in the itemset is, it will be attached to the new root node. Otherwise, we will skip this node and its child nodes will be recursively checked (Step 11, Algorithm 1). Input: Root Node r , Items to be removed R Result: New Root Node r 1 r ←new NI-treeNode({r.is \ R}, r.c, ∅) 2 TraverseSubtree(r , r , r .is) 3 return r 4 5 6 7 8 9 10 11

Procedure TraverseSubtree(Node n, Node p, Itemset is) foreach Child n ∈ n.l do if n .i ∈ p.is then Add n to p.l else p.c ← p.c + n .c TraverseSubtree(n , p, is) 12 end 13 end 14 end

Algorithm 1: NI-treeSubtraction

8

Y. Lu et al.

(a)

(b)

(c)

Fig. 4 Examples de-trees by excluding ¬C, ¬C¬D and ¬C¬D¬B from the NI-tree in Fig. 3. The node of ¬B is not removed in tree (c) since the subtraction process is terminated when all 1st-layer nodes are covered by the itemset in root

We add the count value of removed nodes to the new root node (Step 10, Algorithm 1). When the removing process is finished, the count in the new root node of the de-tree is the support of the corresponding pattern. It is worth to note that the subtraction process is terminated when all 1st-layer nodes in the original NI-tree are checked. Thus, nodes below the 1st-layer of the de-tree are not checked yet and may still contain items not covered by the itemset in the root node. Such scheme avoids the scanning of the whole NI-tree. Those nodes can be removed later and the count value is still correct as proved later. Three examples are illustrated in Fig. 4 by excluding ¬C, ¬C¬D and ¬C¬D¬B from the NI-tree in Fig. 3 respectively. In Fig. 4a, the node of ¬C is removed and all its child nodes are attached to the new root node. The support of the pattern {AB D E} is 0 after the subtraction since the count of the removed node is 0. The NI-tree in Fig. 4b is achieved by further subtracting node of ¬D from the previous tree. The count of node ¬D is added to the new root. Thus, the support of the pattern {AB E} is 1. However, if we further exclude item ¬B from the above NI-tree, the node of ¬B is kept since the node of ¬A above it is on the 1st-layer. Thus, nothing is removed and the support of pattern AE is still 1 as shown in Fig. 4c. Theorem 2 The items removing process on the NI-tree described above generates the support of patterns correctly. Proof According to the construction process of the initial NI-tree, only the count of the last node is increased by 1 for each transaction. Thus, the total count of all nodes in a NI-tree is |D| = |D|. Assume the itemset in the new root node of the de-tree is X . Given a transaction T in |D|, if T ∩ ¬X = ∅, then at least the last node of T must be in the de-tree since otherwise, all nodes on the path corresponding to T will be removed. Thus, the number of such transaction T is just the total count remained in the de-tree. By definition, it equals to ¬X. jsupp. As all other counts are added to the new root node, its count value is |D| − ¬X. jsupp = X.isupp.

Efficient Infrequent Pattern Mining Using Negative Itemset Tree

9

Furthermore, the items removing and support counting process can be applied recursively. Given two set items to be removed: R1 and R2 . Obviously, first removing R1 then R2 is equivalent to first removing R2 then R1 . Both removing order will finally generate the same de-tree as removing the joint set of R1 ∪ R2 . Thus, we can enumerate all patterns in the dataset without starting from the initial NI-tree. The removing process is simple and efficient. Considering one removing process which generate a new de-tree with k items in the new root node of the de-tree. The time to determine if an item is in the new root is O(log k). Moreover, let M be the number of nodes in the original NI-tree labeled with items that not in the new root node. Then, in the worst case, all M nodes are removed during the process. Thus, the overall complexity of the process is O(M log k). The value of k, which is the length of a pattern, is relatively small compared the size of dataset. Therefore, log k can be treated as a constant. The number of nodes M is linear to the size of the dataset |D|. Thus, the complexity of our support counting process is also linear to the size of the dataset, which is the same as other efficient bottom-up pattern mining algorithms [17].

4.2 Infrequent Pattern Mining with Termination Nodes Pruning The initial NI-tree contains the full itemset I in the root node, which means that the NI-tree represents the support of the pattern I. One item is excluded in each recursive step and a new de-tree will be generated with a shorter itemset in the root node. All combinations of items in I will be enumerated recursively. This is a typical divideand-conquer paradigm, which is employed by many other pattern mining algorithms as well. The difference is that we remove items in the NI-tree rather than project on items in the tree. More specifically, items are excluded in ascending order with respect to their frequency in the original dataset. Let the operator ≺ denotes “less frequent”, A ≺ B ≺ C, we exclude items using the following divide-and-conquer paradigm: 1. excluding A and all its combinations, 2. excluding all combinations of B but without A, 3. excluding all combinations of C but without A and B. The support value will be computed for each excluded itemset as described above. Infrequent patterns found so far are stored in the infrequent pattern list IL. The recursive process is terminated if the current pattern is frequent. Such divide-and-conquer paradigm is known as the depth-first traversing for itemset mining [3]. The procedure RecursiveRemove in Algorithm 2 illustrates such process. It is obvious that there are a huge number of nonexistent patterns in a real dataset. Intuitively, they should be skipped. However, the simple divide-and-conquer procedure described above starts from the full set and excluding items one by one. All nonexistent patterns have to be traversed before considering existent infrequent patterns, which might cost even more time than bottom-up traversing. For example, the

10

Y. Lu et al.

NI-tree in Fig. 4a, which only excludes C, should be skipped since its corresponding pattern {A, B, D, E} does not exist in the dataset. To address this problem, patterns stored in termination nodes mentioned before are used as the starting point, rather than the full pattern I. Algorithm 2, step 1–16, illustrates the overall NIIMiner procedure in detail. For each termination node, items in its corresponding neg-rep itemset, which is formed by all items on the path up to the root node, will be removed at once in the first recursive step, which guarantees that the generated de-tree represents an existing pattern in the dataset (Step 4–6, Algorithm 2).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Input: Transaction Database D, Minimum Support minSup Result: Infrequent Itemset List IL r, L t ←BuildNI-tree(D) ; // root node r , termination node list L t IL, F L ← ∅ ; // Infrequent and frequent pattern list /* Start with excluding items towards termination nodes */ foreach Termination Node Nt ∈ L t do lt ← {Items on path from r to Nt } is ← r.is \ lt if is ∈ / IL ∧ is ∈ / F L then r ←NI-treeSubtraction(r , lt ) if r .c < minSup then IL ← IL ∪ {is } RecursiveRemove(r ,minSup,IL,F L,null) ; // null≺ i ∈ I else F L ← F L ∪ {is } end end end return IL /* Typical divide-and-conquer step */ Procedure RecursiveRemove(Node r , minSup, IL, F L, Last item i X ) foreach i ∈ r.is, i X ≺ i do is ← r.is \ {i} if is ∈ / IL ∧ is ∈ / F L then r ←NI-treeSubtraction(r ,{i}) if r .c < minSup then IL ← IL ∪ {is } RecursiveExtend(r ,,IL,F L,i) else F L ← F L ∪ {is } end end end end

Algorithm 2: NIIMiner

Efficient Infrequent Pattern Mining Using Negative Itemset Tree

11

The divide-and-conquer paradigm is then applied similarly on each de-tree for the rest of items after removing termination nodes. But, introducing termination nodes will lead to duplicates in the recursive traversing process. For example, given termination nodes in the initial NI-tree in Fig. 3, excluding C D and B is equivalent to excluding B D and C. If a duplicate arises, the recursion should be terminated sine removing more items will also lead to duplicates. An extra pruning step is necessary to check if the current pattern has been accessed before (Step 5 and 20, Algorithm 2). Infrequent and frequent patterns found so far are stored in two hash sets. Before generating a new de-tree, its corresponding pattern is tested. If it already exists in one of the hash sets, itself and all its subsets must have been accessed already. Thus, further divide-and-conquer process on this pattern can be terminated.

4.3 Infrequent Pattern Mining with 1st-Layer Nodes Pruning In experiments on real-world datasets, we noticed that the number of duplicates is enormous. Thus, it is worth to rethink about the non-existent pattern skipping schema. First of all, the structure of NI-tree can address the nonexistent pattern problem implicitly, without suffering from the expensive duplicates checking schema. For example, in the NI-tree of Fig. 3, we have C ≺ D ≺ E ≺ A ≺ B. Only ¬C and ¬D are on the 1st-layer. Removing items that are not on the 1st-layer does not lead to any existent pattern. For instance, according to the divide-and-conquer paradigm, if we first remove E, then in the following recursive step, we remove E A, E B and E AB. All of them correspond to nonexistent patterns. However, if we remove C, though the pattern {AB D E} is nonexistent, removing C will still lead to a valid pattern later when further items after C are removed under the divide-and-conquer paradigm. Thus, when the current pattern is nonexistent, we should only remove items in a 1st-layer node. We can guarantee that each removing action will eventually lead to at least one valid infrequent pattern. Our NIIMiner is modified respectively. The first part (step 3–16, Algorithm 2) of removing termination nodes is skipped. We perform the divide-and-conquer paradigm directly on the initial NI-tree. When the pattern in the current root is nonexistent, we only remove items on its 1st-layer, i.e., changing step 18, Algorithm 2 to “foreach i ∈ {n.i|n ∈ r.l}, i X ≺ i do”. The duplicate checking step (step 20, Algorithm 2) is then not necessary since the divideand-conquer paradigm for pattern mining guarantees that no duplicate happen. Moreover, only removing items in the 1st-layer is also more efficient since the child list of the root is always short when compared with I. The reason is simple. Assume that the order of items in the given dataset D is A ≺ B ≺ C ≺ D ≺ . . . , i.e., A is the rarest item in D. There will be a node ¬A on the 1st-layer of the initial NI-tree. A node labeled with ¬B should also be there since otherwise, A is an item that does not exist in any transactions, which is contradict to the fact that A is in D. Thus, the root node in our initial NI-tree must contain two children: A and B.

12

Y. Lu et al.

The story is different for item C. If ¬C is on the 1st-layer, then there must be at least one transaction T ∈ D such that {¬A, ¬B} ∈ / T ⇒ {A, B} ∈ T . Let P(i) be the probability that the item i is in a randomly selected transaction of D. Then, the probability that {A, B} ∈ T is P(A)P(B), assuming items are independently distributed. Since A and B are the two rarest items in D, we know that P(A)P(B) is minimal. Therefore, it is unlikely that ¬C is on the 1st-layer. In fact, we have  P(i ∈ r.l) = j≺i P( j), which means that all frequent items are unlikely to appear on the 1st-layer. In consequence, the modified mining process is more efficient.

5 Experimental Evaluation Four real datasets obtained from the frequent itemset mining dataset repository (http://fimi.ua.ac.be/data/) are investigated. Figure 5 lists main features of these four datasets. It is worth to note that they have a larger average transaction length when compared with a typical business/supermarket dataset. LCMfreq [17] is used as the baseline. LCMfreq is one of the most efficient frequent itemset mining algorithms which represents the performance of bottom-up and depth-first lattice traversing approach. Our NIIMiner is implemented in JAVA while LCMfreq is obtained from the SPMF library [5]. An early approach proposed in [13], which starts from the full itemset and tests all nonexistent patterns, is not included since it can not finish on those real datasets. Existing top-down breadth-first algorithms are also not included in our experiments since they can not finish the mining task too, due to the expensive candidate generation step. In fact, they are much slower than bottom-up depth-first based approaches, such as FPGrowth, as shown in [16]. We also test the first top-down depth-first pattern mining algorithm mentioned in [18]. This algorithm treats items as transaction ids and transaction ids as items and identifies patterns with limited length using typical bottom-up traversing method. However, such algorithm is also extremely slow since it has to mine patterns from a dataset with very long transactions. In our early experiments, this first top-down depth-first algorithm can only handle datasets with hundreds of transactions. As a consequence, our experiments in this section only compare three algorithms: LCMfreq, which represents bottom-up depth-first approach, and our NIIMiner with different nonexistent pattern skipping methods (Tnode: Termination Nodes, 1node: 1st-layer nodes).

Fig. 5 Statistics on real datasets. |D|: transaction number, |X |: number of items, |I |: number of distinct items, |L|: average itemset size

Efficient Infrequent Pattern Mining Using Negative Itemset Tree

13

We first investigate the runtime performance with respect to different minimum support values for each dataset. Note that the main goal of this work is to extract low support patterns. Given a small minimum support value minSup, the NIIMiner will access all patterns with support smaller than minSup. In contrast, the bottomup based LCMfreq algorithm needs to traverse all patterns with support larger than 1 in D to generate the same set of low support patterns. It is too slow to make a reasonable performance comparison. A small value c is introduced so that the LCMfreq algorithm only traverse patterns with support larger than minSup − c, rather than 1. Thus, our experiments can also be interpreted as comparing the runtime performance of accessing patterns in the support range of [minSup − c, minSup). The value c is fixed for each dataset. Furthermore, the maximum transaction length L is restricted since otherwise, no algorithm can finish the low support mining task. As shown in Fig. 6, our NIIMiner is more efficient than the LCMfreq approach since it has to access a huge number of high support patterns. When the minimum support increasing, the NIIMiner needs to traverse more patterns while the LCMfreq algorithm traverses less. Thus, a bottom-up based approach might be more efficient if desired supports are not that small. However, since we focus on the low support 150 NIIMiner+1node NIIMiner+Tnode LCMFreq

30

NIIMiner+1node NIIMiner+Tnode LCMFreq

Runtime (s)

Runtime (s)

40

20 10 0

100

50

0 10

20

30

40

50

10

Absolute Minimum Support

30

NIIMiner+1node NIIMiner+Tnode LCMFreq

10

10

20

30

40

Absolute Minimum Support

50

Runtime (s)

1,000

100

Runtime (s)

20

40

50

Absolute Minimum Support

NIIMiner+1node NIIMiner+Tnode LCMFreq

100

10

1 200

300

400

500

Absolute Minimum Support

Fig. 6 Runtime experiments on different absolute minimum support values

600

14

Y. Lu et al.

250

800

Runtime (s)

Runtime (s)

700 200 150 100

600 500 400 300 200

50

100 0

0 750

800

850

900

950

2,700

Dataset Size (10e3)

3,000

3,300

3,600

3,900

4,200

Dataset Size (10e3)

Fig. 7 Runtime experiments on different dataset size

scenario, our NIIMiner will be a better choice. NIIMiner with 1st-layer nodes pruning is even faster since it avoids nonexistent patterns as well as expensive duplicate checking step (Fig. 7). We also studied the runtime performance with respect to different dataset size (total number of items |X |). The dataset size is adjusted by limiting the maximum transaction length L. Our top-down based NIIMiner behaviors similar to the bottomup based approach. Obviously, runtime is positively correlated to the dataset size. However, the runtime of our NIIMiner increased much slower than its competitor. This is because the NI-tree employed in our approach compresses the dataset as well as provides efficient counting ability. Moreover, bottom-up traversing strategy accessed more frequent patterns when the dataset size increased. It worth to note that our negative itemset tree is actually the same as the FP-tree used by the FPGrowth algorithm [7]. Thus, the space complexity of our approach is the same as the FPGrowth algorithm, which is known to be efficient. More specifically, our consumption is a constant multiple of the space used by FPGrowth while the constant value is determined by how the negative dataset is larger than the original one.

Efficient Infrequent Pattern Mining Using Negative Itemset Tree

15

6 Conclusion and Future Works Our novel rare itemset miner NIIMiner has proven to solve the problem of rare itemset mining in an efficient and successive manner. By utilizing the negative representations of rare itemsets, we addressed this task from its dual perspective. Two different nonexistent pattern pruning methods is proposed and the 1st-layer nodes pruning method is more efficient. The major limitation of our NIIMiner appears on extreme sparse datasets, such as a typical supermarket dataset, since the corresponding neg-rep dataset can be thousand times larger than the original one with very long neg-rep transactions. An integration of both bottom-up and top-down traversing strategies should be investigated to overcome this problem. Furthermore, condensed representations for rare patterns, such as closed pattern or non-derivable patterns, and their corresponding algorithms could be used for real applications and should be conducted as future works.

References 1. Adda, M., Wu, L., Feng, Y.: Rare itemset mining. In: ICMLA 2007, pp. 73–80. IEEE (2007) 2. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994) 3. Agarwal, R.C., Aggarwal, C.C., Prasad, V.: A tree projection algorithm for generation of frequent item sets. J. Parallel Distrib. Comput. 61(3), 350–371 (2001) 4. Fang, G., Pandey, G., Wang, W., Gupta, M., Steinbach, M., Kumar, V.: Mining low-support discriminative patterns from dense and high-dimensional data. IEEE Trans. Knowl. Data Eng. 24(2), 279–294 (2012) 5. Fournier-Viger, P., Lin, J.C.W., Gomariz, A., Gueniche, T., Soltani, A., Deng, Z., Lam, H.T.: The SPMF open-source data mining library version 2. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 36–40. Springer (2016) 6. Gupta, A., Mittal, A., Bhattacharya, A.: Minimally infrequent itemset mining using patterngrowth paradigm and residual trees. In: Proceedings of the 17th International Conference on Management of Data, p. 13 (2011) 7. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD’00, New York, NY, USA, pp. 1–12. ACM (2000). https://doi.org/10.1145/342009.335372 8. Hoque, N., Nath, B., Bhattacharyya, D.: An efficient approach on rare association rule mining. In: Proceedings of 7th International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA 2012), pp. 193–203. Springer (2013) 9. Koh, Y., Rountree, N.: Finding sporadic rules using apriori-inverse. Advances in Knowledge Discovery and Data Mining, pp. 153–168 (2005) 10. Koh, Y.S., Ravana, S.D.: Unsupervised rare pattern mining: a survey. ACM Trans. Knowl. Discov. Data (TKDD) 10(4), 45 (2016) 11. Liu, B., Hsu, W., Ma, Y.: Mining association rules with multiple minimum supports. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 337–341. ACM (1999) 12. Liu, B., Hsu, W., Ma, Y.: Pruning and summarizing the discovered associations. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 125–134. ACM (1999)

16

Y. Lu et al.

13. Lu, Y., Richter, F., Seidl, T.: Efficient infrequent itemset mining using depth-first and top-down lattice traversal. In: International Conference on Database Systems for Advanced Applications, pp. 908–915. Springer (2018) 14. Szathmary, L., Napoli, A., Valtchev, P.: Towards rare itemset mining. In: Tools with Artificial Intelligence 2007, ICTAI 2007. 19th IEEE International Conference on, vol. 1, pp. 305–312. IEEE (2007) 15. Troiano, L., Scibelli, G.: A time-efficient breadth-first level-wise lattice-traversal algorithm to discover rare itemsets. Data Min. Knowl. Discov. 28(3), 773–807 (2014) 16. Tsang, S., Koh, Y.S., Dobbie, G.: Rp-tree: rare pattern tree mining. In: Proceedings of the 13th International Conference on Data Warehousing and Knowledge Discovery, DaWaK’11, pp. 277–288. Springer, Berlin, Heidelberg (2011) 17. Uno, T., Kiyomi, M., Arimura, H.: LCM ver. 2: efficient mining algorithms for frequent/closed/maximal itemsets. In: Fimi, vol. 126 (2004) 18. Zaki, M.J.: Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng. 12(3), 372–390 (2000) 19. Zaki, M.J., Gouda, K.: Fast vertical mining using diffsets. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 326–335. ACM (2003)

Hierarchical Adversarial Training for Multi-domain Adaptive Sentiment Analysis Zhao Xu, Lorenzo von Ritter and Giuseppe Serra

Abstract Extracting useful insights with sentiment analysis is of increasing importance due to the growing availability of user-generated content. Sentiment analysis usually involves multiple different domains, and the labeled data is often difficult to obtain. In this paper we propose a hierarchical adversarial neural network (HANN) for adaptive sentiment analysis. Unlike most existing deep learning based methods, the proposed method HANN is able to share information between multiple domains bidirectionally, not just transfers information from source domain to target domain in one direction only. In particular, the HANN method is inspired by the ideas of hierarchical Bayesian modeling and generative adversarial networks. We introduce each domain a distinct encoder to model the domain-specific distribution of the latent features. The learning procedures on different domains are coupled by a discriminator network to propagate the information, which can be viewed as adversarial networks in a supervised context by forcing the discriminator to identify domain labels. The proposed method HANN not only captures the distinct properties of each domain, but also shares common information across multiple domains. We demonstrate the superior performance of our method on real data including the Amazon review dataset and the Sanders Twitter sentiment dataset.

Z. Xu (B) NEC Laboratories Europe, Heidelberg, Germany e-mail: [email protected] L. von Ritter Technical University of Munich, Garching, Germany e-mail: [email protected] G. Serra University of Birmingham, Birmingham, UK e-mail: [email protected] © Springer Nature Switzerland AG 2020 A. Appice et al. (eds.), Complex Pattern Mining, Studies in Computational Intelligence 880, https://doi.org/10.1007/978-3-030-36617-9_2

17

18

Z. Xu et al.

1 Introduction With the exponential growth of user-generated content on online social networks and e-commerce systems, sentiment analysis [3, 17, 23, 25] attracts growing interests to identify user opinions on products and services. In sentiment analysis, the datasets usually span many different domains, and the labeled samples are often expensive and difficult to collect. Although the domain specific data contains distinct properties, there are also common features and patterns which can be shared across domains. Thus it would be beneficial to build learning models that enable global information propagation across domains properly. To achieve this end, domain adaptive sentiment analysis methods have been investigated in the literature. Most deep learning based works studied the problem of pairwise domain adaptation, which aims to generalize a model learned on a source domain to a new target domain, and the information propagation is usually unidirectional [1, 2, 7, 16]. However sentiment analysis often involves data from multiple domains, such as books, movies and hotels, and it is critical to share information across all available domains, and improve the predictive performance jointly. In this paper we propose a deep learning framework, hierarchical adversarial neural network (HANN), for adaptive sentiment analysis on multiple domains. The HANN method is inspired by the ideas of hierarchical Bayesian modeling and adversarial neural networks, and aims to effectively capture domain specific properties, and meanwhile to learn common patterns shared by multiple domains. Hierarchical Bayesian (HB) modeling is typically used to model data generated from similar but different settings. Unlike many domain adaptation methods that fit a single model for all available data, the HB methods learn a distinct model for each setting, and the parameters of the models share a common prior distribution. Motivated by the idea of HB, the proposed method HANN introduces a deep neural network DNN to each domain, which models the unique distribution of the domain specific data in the feature space. Here we leverage the long short term memory network (LSTM) [8, 12] to fit the latent features of the texts. The parameters for each domain are different but follow a common prior, by which the learned global patterns can be propagated and allow for better predictions. However the parameterized prior may cause some technical challenges. As a deep neural network usually involves large amount of parameters, it is computationally prohibitive even if the conjugate priors are selected for efficient learning. To meet this challenge, we do not explicitly define and learn the prior distribution as HB modeling, but introduce the idea of adversarial neural networks into the hierarchical framework to propagate information across domains. Generative adversarial networks (GANs) [10] are generally used to generate data from latent features with a generator G and a discriminator D. The network G samples data conditioned on the latent variables, and the network D predicts whether a data example is real data or a sample from G. The two networks G and D are trained simultaneously with conflicting objectives. Here we employ the strategy of adversarial training in a supervised context by forcing the discriminator to identify domain labels. In particular, we do not impose assumptions on the prior of the DNN

Hierarchical Adversarial Training for Multi-domain Adaptive Sentiment Analysis

19

parameters, nor learn a parameterized posterior. Instead, the domains are modeled in the latent features space, and the DNNs are coupled in a nonparametric way with a domain discriminator. The discriminator and domain specific DNNs are trained together in an adversarial manner. This enables information propagation across multiple domains, and allows for effectively capturing the distinct distributions of the domain-specific latent features. The framework can be easily used to predict a set of texts from an unknown domain (without any known labels) via investigating the distribution of their embedded inputs. As the distributions represent the domain patterns, we can thus model the distribution of the embedded data to predict sentiments of unknown domains. We demonstrate the superiority of the HANN method with real multi-domain sentiment analysis data: Amazon reviews and Sanders Twitter sentiment datasets. The experimental analysis shows promising results. The rest of the paper is organized as follows. We start off with a brief review on related work, and then introduce the proposed learning framework HANN for sentiment domain adaptation. Before conclusion, we present the empirical analysis of the method on the real data.

2 Related Work In sentiment analysis, modeling data from different domains is a challenging problem and of practical importance for many real-world applications. Most of the recent works mainly focus on unidirectional pairwise domain adaptation based on deep learning techniques. For example, Blitzer et al. extended the structural correspondence learning method to transfer knowledge between domains for sentiment classification [1]. They selected features that link the source and target domains based on common frequency and the mutual information with the source labels. Courty et al. [2] proposed a method to find a predictive function that minimizes the optimal transport loss between the joint source distribution and the target distribution. [7] introduced a domain adversarial neural network which enforces the encoder to minimize the differences between target domain and source domain. Zhao et al. [28] proposed a new generalization bound to transfer information from multiple source domains to a single target domain by optimizing a minimax saddle point. [16] presented an end-to-end adversarial memory network to learn the pivots between target and source domains with attention mechanism. [19] tried to learn transferable representation by combining deep learning and optimal two-sample matching. In the literature, there are some works for multi-domain adaptation, but they are often based on non-deep-learning techniques. For example, [14] introduced an ensemble learning method, which trained a single classifier for each domain and combined them using the meta learning method. The authors extended their work in [15] to investigate another ensemble learning strategy based on probabilistic models to combine the multiple domain-specific sentiment classifiers. Wu and Huang [27] developed a regularization method to explore relations between domains based on textual content and sentiment word distribution. They constructed a domain similarity

20

Z. Xu et al.

graph using the domain relations and encoded it as regularization over the domainspecific sentiment classifiers. In contrast to the existing work, we propose a novel hierarchical adversarial neural network for adaptive sentiment analysis on multidomains. The method is inspired by the ideas of hierarchical Bayesian modeling and adversarial networks. It can better capture the distinct properties of different domains, and jointly enhance the predictive performance over all domains.

3 Hierarchical Adversarial Sentiment Analysis In this section, we will introduce the hierarchical adversarial neural network (HANN) for multi-domain adaptive sentiment analysis. The work aims to learn domainspecific neural networks that can capture both commonality and differences of multiple domains, instead of only modeling shared patterns between source and target domains. Furthermore the method intends to facilitate information transfer across all domains, rather than unidirectional propagation from a single source domain to a single target domain. To achieve these goals, the HANN framework is motivated by hierarchical Bayesian modeling, and extends adversarial networks in a supervised learning context to fit the distinct domain-specific neural networks. The resulting models have compact structures yet good prediction performance. In addition, the labeled sentiment data is often expensive to acquire in many real applications. As an extra advantage, the proposed method can naturally exploit the easily available unlabeled data to enhance the learning procedure. The problem and notations are defined as follows. Assume that there is a set of labeled and unlabeled reviews D = {D1 , . . . , D M } collected from M domains (e.g., books, DVDs and electronics). Each domain has L labeled reviews {(X m,1 , sm,1 ), . . . , (X m,L , sm,L )}, and U  L unlabeled ones Um = {X m,L+1 , . . . , X m,L+U }. A review X m,i consists of a sequence of N words. sm,i and ym,i respectively denote sentiment and domain labels of a review. The task is to predict unknown sentiment labels of all domains.

3.1 Hierarchical RNNs Recurrent neural networks (RNN) [3, 11, 26] are a class of hidden state methods to model sequence data with many successful applications in text analysis. Here we employ long short-term memory network (LSTM) [8, 12] as an encoder to learn the distribution of review texts in latent feature space. In contrast to most of domain adaptation methods that learn a single neural network to capture the common patterns shared by different domains, we are motivated by the idea of hierarchical Bayesian modeling. We assume that each set of reviews Dm is generated from a different but related setting, and thus can be modeled with a distinct LSTM. These LSTM encoders share the same network structure, but possess different parameters θm to learn the unique domain-specific properties. All the parameters θm are formulated as

Hierarchical Adversarial Training for Multi-domain Adaptive Sentiment Analysis

21

Fig. 1 The HANN method for multi-domain sentiment adaptation, illustrated with three domains. The framework jointly improves predictive performance on each individual domain

samples drawn from a common but unknown prior distribution with hyperparameters α. The shared prior will capture the commonality between domains, such as syntactic and semantic structures, as well as emotional expressions. In particular, each word in a sentence is represented as a d-dimensional vector xt . The vectors are pretrained with word embedding methods [13, 21, 24], which map words to continuous lowdimensional vectors and retain similarity between words. For each word, there is an LSTM hidden unit associated. The input of the unit consists of two components: the word vector xt , and the K -dimensional hidden vector ht−1 of the last unit at t − 1. The output is a hidden vector ht , which transfers the sequential information to the next unit t + 1. Recursively ht is computed as: Input gate: it = σ (Ui xt + Wi ht−1 + bi ) Forget gate: ft = σ (U f xt + W f ht−1 + b f ) Output gate: ot = σ (Uo xt + Wo ht−1 + bo ) Candidate hidden state: gt = tanh(Ug xt + Wg ht−1 + bg ) Internal memory: ct = ct−1  ft + gt  it Hidden state: ht = tanh(ct )  ot

(1)

σ (·) is sigmoid function.  denotes elementwise multiplication of vectors. The LSTM encoder models the sequential patterns underlying the texts. With the encoder, the texts are mapped into feature space to approximate their distribution. Following the LSTM, there is a neural unit (denoted as a network S in Fig. 1) for sentiment classification. The input is the hidden vector h N of the last LSTM unit of the sentence, and the output is the probability of the sentiment label: p(s = +1|h N ) = σ (wsT h N + bs ).

(2)

22

Z. Xu et al.

Since the model is domain-specific, i.e. one distinct model for each domain, it is flexible enough to capture the unique properties of the domains, and thus leads to good predictions even if the structure of each neural network is compact. From Bayesian modeling point of view, this is a discriminative model p(s|X, θm ). The LSTM encoder can be viewed as a multi-output nonlinear function with a word sequence as input, formally defined as f : Rd×N → R K . Conditioned on the function value, logistic regression (the neural unit) is used to label the sentiment, i.e. p(s| f (X, θm )). The parameters θm of the discriminative model include: weight matrices Ui , Wi , U f , W f , Uo , Wo , Ug , Wg and bias vectors bi , b f , bo , bg of the function f , as well as weight vector ws and bias bs of the logistic regressor. In a Bayesian framework, we often assume conjugate priors for efficient computation: the matrix parameters are drawn from matrix normal distributions, and the vector parameters are drawn from multivariate normal distributions. Then the joint distribution p(sm , θm |Xm , α) of a domain m is written as: 

N K (bz ) MN K ×S (Uz ) MN K ×K (Wz )  (1 + exp(− f (X  , θm )))−1 × N(bs ) N K (ws ) z∈{i, f,o,g}



(3)

N(·) and MN(·) denote the normal and matrix normal distributions. Their hyperparameters, i.e., means and variances, are denoted as α. It is obvious that full Bayesian inference in the hierarchical RNN models is prohibitive, even if we employ conjugate priors. More importantly, the parameterized prior imposes additional assumptions on the data, which may reduce the flexibility of the neural network models.

3.2 Semi-supervised Adversarial Training Framework To efficiently learn the hierarchical RNN models, we employ adversarial neural networks in a supervised learning context. Generative adversarial networks (GANs) [10] have attracted significant interests, since they can estimate the generative models with a nonparametric adversarial training strategy. The GANs are originally used for unsupervised learning scenarios [10], and have been extended to supervised learning cases, e.g. [7, 22, 28]. Here we propose an adversarial network based approximation method to estimate the hierarchical RNNs for multi-domain adaptation, such that the domain-specific sentiment models can be learned jointly in a nonparametric way. The overall framework is illustrated in Fig. 1. The latent features h are vector representations of the reviews. In an ideal situation, the samples in the data space and in the feature space are distributed similarly. As the domains have distinct properties, i.e., different distributions, their latent features should distribute differently as well. A domain discriminator D (i.e. a multi-layer neural network) can distinguish them as shown in Fig. 1. The loss function is thus defined as:

Hierarchical Adversarial Training for Multi-domain Adaptive Sentiment Analysis

 1  sm, log(ˆsm, ) + (1 − sm, ) log(1 − sˆm, ) ML m   ρ  ym, log( yˆm, ) + (1 − ym, ) log(1 − yˆm, ) − ML m   λ  − ym,u log( yˆm,u ) + (1 − ym,u ) log(1 − yˆm,u ) , MU m u

23



(4)

where sˆ and yˆ respectively denote the estimations of the sentiment labels and the domain labels, computed by the sentiment regressors and the domain discriminator. ρ and λ are the regularization coefficients. Thus no matter what the real distributions of the data are, we do not need to impose any assumptions on the functional form of the distributions, but tune the domain specific encoders such that the latent features h can be easily identified by a domain discriminator. As the unlabeled data can help reveal the distributions of the data as well, we integrate them in the loss function. To capture the commonality of the domains, we first learn the common prior parameters of the neural networks from all domains, instead of an expensive prior distribution. It can be viewed as a point estimation α∗ = E[θm ] of the hyperparameter α in empirical Bayes. Intuitively the common prior parameters E[θm ] will balance the distinct properties of domains, and results in a tradeoff model to capture the commonality of all the involved domains. Secondly the information propagates across domains via the coupled latent features. From Fig. 1, one can find that the feature vectors are linked with the domain discriminator. The learning process will pit the domain specific encoders against each other, and will propagate information in the meanwhile. Put everything together, the overall learning procedure is as follows. Initialize all LSTM encoders and sentiment regressors with the prior parameters learned from all domains. Iteratively optimize the loss function (4) w.r.t. the parameters until convergence. After that, we can use the learned domain-specific encoders and sentiment regressors to predict unknown sentiments of the corresponding domains.

3.3 Sentiment Prediction on an Unknown Domain Under the proposed hierarchical RNN framework, it is straightforward to predict a set of reviews of an unknown domain with the learned domain-specific sentiment models. For the unknown domain, there is no sentiment label associated with the reviews. It is thus reasonable to make predictions based on the sentiment models learned from the most similar domains with the set of unknown reviews. To accomplish this, we explore the generative distribution of the data. In particular, we not only model the conditional sentiment probability p(s|X), but also consider the distribution of the reviews p(X). As the reviews are vectorized with h, we model the distribution p(h). The idea can be illustrated with Fig. 3, which visualizes the distributions of the Amazon sentiment dataset [1]. One can find that the reviews from different domains

24

Z. Xu et al.

are mapped into the embedding space with different distributions. The embedding vectors h intrinsically encode the domain properties of the reviews. Given a set of reviews of an unknown domain, if their distribution is very similar to that of a known domain m ∗ , then it is convincing to use the sentiment model θm ∗ of that domain to classify the set of unknown reviews. Formally, given a set of reviews Du of an unknown domain, a set of M candidate domains with the learned RNN encoders θm , and the embedding review vectors {h}m , m = {1, . . . , M}, the predictions are made as follows: 1. Learn the review vectors {h}(m) u of the unknown domain with the RNN encoders m = {1, . . . , M}. 2. Compute the distance Dis( p({h}m ), p({h}(m) u )) between the two empirical review distributions, and find the most similar domain m ∗ = arg min Dis( p({h}m ), p({h}(m) u )). m

The distance is estimated based on the two sample Kolmogorov–Smirnov test [6], i.e., the KS statistic Dis = suph |( p({h}m ) − p({h}(m) u )|, where sup denotes the supremum function. 3. Predict the sentiment labels of Du with the model θm ∗ . The unknown domain prediction is investigated in the experiment section with interesting results.

4 Empirical Analysis To validate the performance of the proposed method HANN, we conduct experimental analysis on two sets of real data: the Amazon multi-domain sentiment data and the Sanders Twitter sentiment data. The task is to predict domain specific sentiment labels of the user generated texts. We compare the HANN method with multiple state-of-the-art approaches. We first conduct experiments to demonstrate the performance of the HANN method for the typical pairwise domain adaptation. Then we evaluate our method in a scenario of information propagation across multi-domains.

4.1 Experiment Setup The Amazon multi-domain sentiment dataset is introduced in [1] (https://www.cs. jhu.edu/~mdredze/datasets/sentiment/index2.html). It consists of Amazon product reviews in four product categories (domains), including books, DVD, electronics and kitchen. Each category contains 1000 positive and 1000 negative reviews, summing up to 8000 labeled reviews for the entire dataset. In addition, there are totally 692105 unlabeled reviews.

Hierarchical Adversarial Training for Multi-domain Adaptive Sentiment Analysis

25

The Sanders Twitter sentiment corpus (http://www.sananalytics.com/lab/twittersentiment/) contains manually-created sentiment labels for tweets about four companies (Apple, Google, Microsoft and Twitter). The tweets are labeled as positive, negative, and others. Since some tweets have been removed from Twitter, we get a total of 5113 labeled tweets via Twitter API. The textual data is preprocessed by removing punctuations (full stops were replaced with end-of-sentence tags), replacing some special characters (e.g. replace " by "), making the text lower-case and tokenizing the remaining words with the Natural Language Toolkit Tokenizer. We standardize all reviews to a length of 200 words for the Amazon data and 50 words for the twitter data by cropping longer reviews and padding shorter ones with zeros. We embed the words into a 200-dimensional vector space with word embedding methods [13, 21, 24]. Here we employ GloVe. For the Amazon data, we train the word embedding on all reviews, including the labeled and unlabeled ones [3, 4, 9] to better capture the syntactic and semantic structures of the corpus. For the Twitter dataset, we use the pre-trained 200-dimensional word vectors learned from 2 billion tweets, as the Sanders Twitter dataset is of limited size. The pre-trained vectors are provided by GloVe. In the experiments, we use 60% of the reviews for training and the rest for test. The implementation of the proposed method is based on Theano and Keras. The hidden vector h of the LSTM encoder is set to be 50-dimensional. The sentiment classifier is a one-layer neural network with h as input and the sentiment probability as output. The discriminator is a two-layer network and the middle layer is 8-dimensional. We optimize the model using RMSprop with a batch size of 32.

4.2 Pairwise Domain Adaptation Although the HANN method focuses on multi-domain analysis, it works well in the scenario of pairwise domain adaptation. From the Amazon data, we generate six possible pairwise adaptation tasks. The model performance is measured with prediction accuracy. In the literature, the pairwise domain adaptation approaches mainly focus on unidirectional information transfer from the source domain to the target domain. A recent baseline is the domain adversarial neural network DANN [7], and the other is the joint distribution optimal transport JDOT [2]. We use the latest results reported in [2]. These approaches are compared in Table 1. In addition, we analyze the distributions of the Amazon reviews to understand the performance of the HANN approach in more details. To visualize the learned review vectors, the TSNE method [20] is employed for dimensionality reduction. Figure 2 illustrates an example of pairwise domain adaptation Books ↔ DVD. The left panel shows the distributions of the latent features learned with the domain-specific models separately (i.e. no information sharing). The right panel reveals the results of the HANN approach. One can find that the latent features learned with the HANN are better structured.

26

Z. Xu et al.

Table 1 Prediction accuracy on the Amazon data for pairwise domain adaptation. The results of the baselines are reported in [2] Tasks HANN Tasks DANN JDOT Books ↔ DVD

Books DVD Books ↔ Elec. Books Elec. Books ↔ Kitchen Books

0.853 0.819 0.839 0.88 0.843

Kitchen

0.885

DVD Elec. DVD ↔ Kitchen DVD Kitchen Elec. ↔ Kitchen Elec. Kitchen

0.833 0.87 0.824 0.878 0.885 0.896

DVD ↔ Elec.

DVD → Books Books → DVD Elec. → Books Books → Elec. Kitchen → Books Books → Kitchen Elec. → DVD DVD → Elec. Kitchen → DVD DVD → Kitchen Kitchen → Elec. Elec. → Kitchen

0.747 0.806 0.718 0.747 0.718

0.763 0.795 0.749 0.781 0.728

0.767

0.794

0.726 0.738 0.73 0.765 0.846 0.850

0.737 0.788 0.765 0.821 0.845 0.872

Fig. 2 Distributions of the Amazon reviews embedded with different models. The Books reviews are red, and the DVD reviews are blue. Left: the domain-specific models of Books and DVD are learned separately without information sharing. Right: the models are learned with the proposed domain adaptation approach HANN

4.3 Multi-domain Adaptation We further validate the performance of the HANN approach in the multi-domain adaptation setting. The proposed method is compared with several state-of-the-art approaches: • RMTL [5]: Instead of just learning on the target task, use multi-task learning on related tasks. • MTFL21R [18]: A l2,1 -norm regularization model that selects features from multiple tasks, while promoting common patterns among tasks.

Hierarchical Adversarial Training for Multi-domain Adaptive Sentiment Analysis

27

Table 2 Prediction accuracy on the Amazon data for multi-domain adaptation. The results of the baselines are reported in [27] Books DVD Electronics Kitchen RMTL MTFL21R MTLGraph CMSC-LS CMSC-SVM CMSC-Log HANN

0.8133 0.7952 0.7966 0.8210 0.8226 0.8181 0.855

0.8218 0.8060 0.8184 0.8240 0.8348 0.8373 0.8338

0.8549 0.8474 0.8369 0.8612 0.8676 0.8667 0.8775

0.8702 0.8639 0.8706 0.8756 0.8820 0.8823 0.9012

Table 3 Prediction accuracy on the Twitter data for multi-domain adaptation. The results of the baselines are reported in [27] Apple Google Microsoft Twitter RMTL MTFL21R MTLGraph CMSC-LS CMSC-SVM CMSC-Log HANN

0.8563 0.8300 0.8541 0.8580 0.8610 0.8703 0.8542

0.8587 0.8385 0.8468 0.8808 0.8742 0.8676 0.875

0.8228 0.7653 0.7745 0.8246 0.8286 0.8311 0.8556

0.7737 0.7430 0.7315 0.7764 0.7853 0.8031 0.7547

• MTLGraph [29]: Multi-task learning with graph structures, a method included in the MALSAR package. • CMSC [27]: Collaborative multi-domain sentiment classification that trains domain-specific classifiers. The classifiers are trained with squared loss (LS), hinge loss (SVM) and log loss (Log), respectively. The experiment results are shown as Table 2 for the Amazon data and Table 3 for the Sanders Twitter data. For a fair comparison, we directly use the results of the baselines reported in [27]. One can find that the HANN outperforms the state-ofthe-art methods in most settings, and achieves the best results. This reveals that the domain-specific neural networks with information sharing can best fit the local data. Figure 3 visualizes the distributions p(h) of the embedded Amazon reviews with clear structures. The proposed method efficiently captures the commonalities of all domains, and meanwhile distinguishes the properties of each single domain, and thus achieves better prediction results. In addition, we evaluate the performance of the method HANN for sentiment prediction of unknown domains, i.e. using the learned domain-specific neural networks to predict a set of reviews which domain information is not available. The experiments on the Amazon data are set as follows. There are now four learned sentiment models, one for each category: books, DVD, electronics and kitchen. Given a set

28

Z. Xu et al.

Fig. 3 Distributions of the Amazon reviews embedded with the HANN approach. The colors represent the categories of the reviews: books (red), DVD (blue), electronics (magenta), and kitchen (cyan)

of test reviews, the goal is to select a model from the four candidates to predict the sentiments of the test data. We embed the unknown data with the learned models, and denote the latent features as h∗ . Figure 4 and Fig. 5 show the distributions p(h∗ ) for different settings. We again use TSNE to visualize the high dimensional vectors h∗ . The left panels show the distributions of the training reviews in their own spaces spanned by the corresponding domain models. The right panels are distributions of the test data in the four spaces spanned by the domain models of books, DVD, electronics, and kitchen. One can find that, only when the test and training domains match, the test data distributes in a similar way as the training data. For example, the top panels of Fig. 4 reveal that, if the test data is books reviews, then its distribution in the books space (top, the second column) looks similar as the known books data (top left). That means based on the similarity of the distributions, we can find the most relevant domain-specific model to predict sentiments of the unknown texts. The other panels show the distributions of the unknown reviews from DVD, electronics, and kitchen domains, which reveal similar tendency. Formally, we compute the KS statistic to make selections, shown as Table 4. Intuitively, KS statistic is the distance between two empirical distributions. The smaller the KS statistic is, the more similar the two distributions are. Table 4 shows that we find the correct domains of the test data with the KS statistics of the learned latent vectors. And then the predictions can be done with the corresponding domain specific models. The results will be the same as Table 2. In real applications, we can use weighted sum with the exponentially decayed weights of the KS statistics to make predictions.

5 Conclusion The paper presents a hierarchical adversarial neural network HANN for multi-domain sentiment analysis. The HANN method aims to jointly improve sentiment predictions on each individual domain, not just unidirectionally transfers information from a

Hierarchical Adversarial Training for Multi-domain Adaptive Sentiment Analysis

29

Fig. 4 Distributions of the Amazon reviews books (red) and DVD (blue) in different spaces. The left column is the distributions of the training reviews in the spaces spanned by the corresponding domain-specific models. The right panels are the distributions of the test reviews in the four spaces: books, DVD, electronics, and kitchen

source domain to a target domain. Unlike most domain adaptation method that learn a single model to capture the common properties shared by two domains, our method learns a distinct model for each domain, and couples the domain specific DNNs in a nonparametric way based on supervised adversarial networks. The flexibility of the HANN method enables information propagation across domains, and meanwhile elicits the unique distributions on the domain-specific latent features. The learned

30

Z. Xu et al.

Fig. 5 Distributions of the Amazon reviews electronics (magenta) and kitchen (cyan) in different spaces. The left column is the distributions of the training reviews in the spaces spanned by the corresponding domain-specific models. The right panels are the distributions of the test reviews in the four spaces: books, DVD, electronics and kitchen

embedding vectors are thus better structured. Exploring the distributions provides interesting and useful insights of the sentiment data. The experimental analysis on the real data demonstrates the performance of the proposed approach. Our work provides interesting avenues for future work, such as integrating domain hierarchy and knowledge graph into the adaptive analysis.

Hierarchical Adversarial Training for Multi-domain Adaptive Sentiment Analysis Table 4 KS Distance between two empirical review distributions Test Reviews Spaces spanned by the domain-specific models Books DVD Electronics Books DVD Electronics Kitchen

0.0771 0.1279 0.1600 0.1696

0.1500 0.0958 0.3379 0.3029

0.3546 0.2367 0.1042 0.1212

31

Kitchen 0.3758 0.3183 0.1117 0.1183

Acknowledgements This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 766186.

References 1. Blitzer, J., Dredze, M., Pereira, F.: Biographies, bollywood, boomboxes and blenders: domain adaptation for sentiment classification. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 440–447 (2007) 2. Courty, N., Flamary, R., Habrard, A., Rakotomamonjy, A.: Joint distribution optimal transportation for domain adaptation. In: NIPS, vol. 30 (2017) 3. Dai, A.M., Le, Q.V.: Semi-supervised sequence learning. In: Advances in Neural Information Processing Systems, vol. 28 (2015) 4. Erhan, D., Bengio, Y., Courville, A., Manzagol, P.A., Vincent, P., Bengio, S.: Why does unsupervised pre-training help deep learning? JMLR 11, 625–660 (2010) 5. Evgeniou, T., Pontil, M.: Regularized multitask learning. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 109–117 (2004) 6. Fasano, G., Franceschini, A.: A multidimensional version of the kolmogorov-smirnov test. Mon. Notices R. Astron. Soc. 225, 155–170 (1987) 7. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., Lempitsky, V.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(59), 1–35 (2016) 8. Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with lstm. Neural Comput. 12(10), 2451–2471 (2000) 9. Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In: ICMLs (2011) 10. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014) 11. Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence. Springer, Berlin (2012) 12. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 13. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: ICML, pp. 1188–1196 (2014) 14. Li, S., Zong, C.: Multi-domain sentiment classification. In: ACL:HLT, pp. 257–260 (2008) 15. Li, S.S., Huang, C.R., Zong, C.Q.: Multi-domain sentiment classification with classifier combination. J. Comput. Sci. Technol. 26(1), 25–33 (2011)

32

Z. Xu et al.

16. Li, Z., Zhang, Y., Wei, Y., Wu, Y., Yang, Q.: End-to-end adversarial memory network for crossdomain sentiment classification. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 2237–2243 (2017) 17. Liu, B.: Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers (2012) 18. Liu J, Ji S, Ye J (2009) Multi-task feature learning via efficient l2, 1-norm minimization. In: UAI, pp. 339–348 19. Long, M., Wang, J., Cao, Y., Sun, J., Yu, P.: Deep learning of transferable representation for scalable domain adaptation. IEEE Trans. Knowl. Data Eng. 28(8), 2027–2040 (2016) 20. van der Maaten, L.J.P., Hinton, G.E.: Visualizing high-dimensional data using t-sne. J. Mach. Learn. Res. 9, 2579–2605 (2008) 21. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 26, 3111–3119 (2013) 22. Odena, A.: Semi-supervised learning with generative adversarial networks. In: Data Efficient Machine Learning Workshop at ICML 2016 (2016) 23. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(1–2), 1–135 (2008) 24. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014) 25. Socher, R., Perelygin, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: EMNLP, pp. 1631–1642 (2013) 26. Sutskever, I.: Training recurrent neural networks. Ph.D. thesis, University of Toronto (2013) 27. Wu, F., Huang, Y.: Collaborative multi-domain sentiment classification. In: IEEE International Conference on Data Mining, pp. 459–468 (2015) 28. Zhao, H., Zhang, S., Wu, G., Costeira, J., Moura, J., Gordon, G.: Multiple source domain adaptation with adversarial training of neural networks (2017) 29. Zhou, J., Chen, J., Ye, J.: MALSAR: Multi-tAsk Learning via Structural Regularization. Arizona State University (2011)

Optimizing C-Index via Gradient Boosting in Medical Survival Analysis Alicja Wieczorkowska and Wojciech Jarmulski

Abstract In medical databases, data represent the results of various medical procedures and analyses, often performed in non-uniform time steps. Therefore, when performing survival analysis, we deal with a data set with missing values, and changes over time. Such data are difficult to be used as a basis to predict survival of patients, as these data are complex and scarce. In survival analysis methods, usually partial log likelihood is maximized following the idea by Cox used in his regression. This approach is also most commonly adopted in non-linear survival analysis methods. On the other hand, the predictive performance of survival analysis is measured by concordance index (C-index). In our work we investigated whether optimizing directly C-index via gradient boosting yields better results and compared it with the other survival analysis methods on several medical datasets. The results indicate that in majority of cases gradient boosting tends to give the best predictive results and the choice of C-index as the optimized loss function leads to further improved performance.

1 Introduction and Background In survival analysis [21], the data representing the length of time are analyzed from the origin up to the endpoint of interest, e.g. the survival time after the diagnosis of a disease, the time from birth to the onset of a disease, loss of transplant graft etc. Such data are called right-censored survival data. The observations can be in any domain [21, 32], however, in this paper we will focus on medical data. A non-parametric statistic called Kaplan–Meier estimator can be applied for this purpose [19, 23, 32]. Kaplan–Meier method allows calculating the incidence rate A. Wieczorkowska · W. Jarmulski (B) Polish-Japanese Academy of Information Technology, Koszykowa 86, 02-008 Warsaw, Poland e-mail: [email protected] A. Wieczorkowska e-mail: [email protected] © Springer Nature Switzerland AG 2020 A. Appice et al. (eds.), Complex Pattern Mining, Studies in Computational Intelligence 880, https://doi.org/10.1007/978-3-030-36617-9_3

33

34

A. Wieczorkowska and W. Jarmulski

of events, which in medical applications most often corresponds to the fraction of patients living for a specified amount of time after treatment (e.g. transplant). The ˆ is defined as a step function with steps at the death times [22]: estimator S(t) ˆ = S(t)

 t   d N (s) , 1− Y (s) s=0

(1)

where d N (s) is the change in the process (is 1 if a death occurred at s or 0 otherwise), and Y (s) is the number of individuals in the study time. Another method applied in survival analysis is Cox proportional hazards model [6, 7, 19]. The Cox model h(t, X) is a semi-parametric regression model, described by the formula:  p   h(t, X) = h 0 (t)ex p βi X i , (2) i=1

where βi —regression coefficients, X = (X 1 , . . . , X p)—explanatory/predictor variables, h 0 (t)—baseline hazard. The baseline hazard is unspecified, and it can be replaced with a specific function if the shape of this function can be assumed. Machine learning models have been also recently applied in survival analysis [20]. Random Survival Forests (RSF) [18] is one of such methods which builds on the idea of Random Forests (RF) [3]. RF is an ensemble learning method aggregating decision trees. The method builds a large collection of decorrelated decision trees and then averages their outputs in regression problems or takes the majority vote in classification problems. This allows to reduce the variance of an estimated prediction of underlying decision trees. RSF is an extension of RF where the ensemble is built from survival trees, with the purpose to analyze right-censored survival data [2]. Gradient Boosting Machine (GBM) is an ensemble learning method, which consists in constructing an additive regression model from weak learners, by sequentially fitting a simple parameterized function (base learner) [10, 11] through increasingly refined approximations. This method is based on regression trees. The approximation accuracy and execution speed of gradient boosting is improved thanks to incorporating randomization into the procedure. Survival analysis can be considered a ranking problem, and the Concordance index (C-index) is a standard measure to qualify the quality of rankings [13, 26, 27]. The C-index represents the probability that, for a pair of randomly chosen comparable samples, the sample with the higher risk prediction will experience an event before the other sample. The C-index can also be used in binary classification with uneven cardinality of classes. In this work we investigate how using GBM which optimizes directly C-index improves the predictive performance of survival models. First, we derive our proposed approach (Sect. 2). In the next part we present our non-public dataset of patients who underwent liver transplantation surgery. This dataset was the main driver for derivation of our method, however, for comparison purposes we also apply the derived

Optimizing C-Index via Gradient Boosting in Medical Survival Analysis

35

models on other publicly available medical datasets (Sect. 3). The following Sect. 4 describes machine survival methods which were compared against our proposed approach. Finally, we discuss the results of the methods’ comparison and conclude with the discussion (Sects. 5 and 6).

2 Our Approach The main contribution of this work is the investigation how optimizing C-index as a loss function—instead of partial likelihood—in survival analysis non-linear methods impacts the models’ prediction power. In this section we mathematically describe our approach and characterize our non-public medical dataset, on which the model was trained.

2.1 Derivation Cox proportional hazards model is derived by maximizing partial likelihood [5]. It is a modification of full likelihood to accommodate censored data. Partial likelihood L of the parameter vector θ for right-censored data (most common case) is defined as:   Pr (T = Ti |θ ) Pr (T > Ti |θ ) , (3) L(θ ) = T ∈unc.

T ∈r.c.

where Ti is the time of a data point which belongs to either uncensored (unc.) or right-censored (r.c.) group. Other machine learning methods, such as GBM, are also directly or indirectly maximizing partial likelihood. In our work we aimed to verify whether optimizing directly C-index in survival analysis methods yields better results than maximizing partial likelihood. We chose to test it by modifying GBM and substituting maximization of partial log likelihood with maximization of C-index formula. This approach was presented first in [4]. Our derivation differs in the usage of C-index formula - instead of the original Harrell’s formula [13], we applied Uno’s estimation procedure for C-index as follows:  Cˆ U no =

−2 ˆL j,k (G n (T j )) I (T j



< Tk )I (η j > ηk ) j , L −2 ˆ j,k (G n (T j )) I (T j < Tk ) j

(4)

where Ti is the time of a data point i, i is the information on censoring of an observation i (i = 1 if the event occurred and i = 0 when no such information is given, i.e. lost to follow-up case), and I represents the identity function. Gˆ nL is Kaplan–Meier estimator of the unconditional survival function of Tcens , estimated

36

A. Wieczorkowska and W. Jarmulski

from learning data. The advantage of Uno’s C-index formula is that it allows to reliably compare models built on various subsets of patients. The C-index formula has one disadvantage, namely it is not differentiable wrt. ηi and therefore cannot be used by methods that rely on direct optimization, such as gradient boosting. Therefore we used the approximated version proposed by [26] which is directly minimized: − Cˆ smooth (T, η) = −

 i,k

wik

1 , i 1 + ex p( ηk −η ) σ

(5)

where σ is a sigmoid function and weights are defined as: i (Gˆ nL (Ti ))−2 I (Ti < Tk ) wik :=  . −2 ˆL i,k i (G n (Ti )) I (Ti < Tk )

(6)

2.2 Data Approach One of the goals of our work is to test modeling methods applied to the non-public liver transplantation data [20], hereinafter referred to as liver. These data represent observations of patients after the liver transplantation surgery, collected since 1994. The observations for 1095 patients were gathered until the end of 2015, altogether 48,772 observations. Initially, the observations were manually inserted into the database, so there is a lot of noise in these data. Noise originates mainly from human errors in copying the data. Another source of noise is measurement errors, e.g. measuring the bilirubin level from a gall container instead of the patient’s body. The observations represent various biochemical measurements. From the medical point of view, it is desirable to identify patients who can lose the graft, as this is a life threatening situation. Such patients can undergo more aggressive treatment, to assure their survival. However, these data are sparse, and of uneven distribution over time. Many observations are missing, so the survival analysis has to be limited to selected parameters (biochemical measurements), most often available, and to patients for whom several observations for these parameters are available. One year period will be analyzed, and this analysis can save lives of many patients. This is why we would like to test methods for survival data analysis, and apply the results of the experiments presented in this paper to the liver transplantation data. For objectivity and comparison purposes, we also apply different modeling methods on 5 publicly available datasets described in Sect. 3, as well as on liver data set. We aim to verify, whether our approach generalizes also to other cases of medical survival prediction.

Optimizing C-Index via Gradient Boosting in Medical Survival Analysis

37

3 Datasets In our research we used 5 publicly available medical data sets and our non-public data set in order to provide comprehensive comparison of various models’ performance. Datasets used in our experiments are as follows: • GBSG2 [12]—a data set containing the observations from the German Breast Cancer Study Group 2 (GBSG2) study; • lung [24]—survival in patients with advanced lung cancer from the North Central Cancer Treatment Group; • pbc [9]—data from the Mayo Clinic trial in primary biliary cirrhosis (PBC) of the liver, conducted between 1974 and 1984; • retinopathy [1, 15]—a trial of laser coagulation as a treatment to delay diabetic retinopathy; • std [22]—Sexually Transmitted Disease (STD) morbidity data; • liver—non-public data set representing observations of patients after the liver transplantation surgery, as described in Sect. 2.2. All these datasets represent medical survival cases, and if our approach works well for these data, we can assume that such analysis methods can be generalized to many survival prediction scenarios.

4 Methods For the purpose of the analysis of the survival data, we decided to apply methods presented in Sect. 1. The experiments were done using R environment for statistical computing [29]. We have tested the following methods in our analysis: • Cox regression—the R package survival was applied for the purpose of this analysis [30], • Random survival forests (RSF)—the randomForestSRC package was applied for this purpose [16, 17]. We used the default parameters with 1000 trees in the ensemble, • Gradient boosting—we used the gbm (Generalized Boosted regression Models— GBM) package for this analysis [28], • GBM-CIO—gradient boosting with C-index optimization; this analysis was based on the mboost package [14]. Various parameter values were used in this work, namely – the number of boosting iterations, which was determined using early-stopping technique, – nu—the step size, or shrinkage parameter; we used 0.1 and 0.01,

38

A. Wieczorkowska and W. Jarmulski

Fig. 1 The results of the GBM-CIO training for the retinopathy data, for nu = 0.01, sigma = 0.1. Gray lines represent a single model training, the black line—the average of all runs

– sigma—smoothness parameter for sigmoid functions inside C-index; we used 0.1 and 0.01. Kaplan–Meier estimator has not been used in our comparison as it is used in practice mainly for explanatory analysis in contrast to predictive analysis. Parameters selection for GBM-CIO (nu and sigma) was performed using grid search. Sample results of these optimizations are visualized in Figs. 1 and 2. More detailed discussion on the impact of these parameters follows in Sect. 5. From the results perspective we were especially interested in the last method, i.e. GBM-CIO, which deviates from maximizing partial log likelihood and optimizes C-index directly, as described in Sect. 2.1. However, these runs were most time consuming. Therefore, we were also interested in results obtained using other methods mentioned above.

5 Results We applied the methods presented in Sect. 4 to the data listed in Sect. 3. Gradient boosting with C-index optimization was applied with various parameters. Model efficiency is measured with C-index. The results are shown in Tables 1, 2, 3, 4, 5 and 6. In Fig. 1, we present exemplary results of GBM-CIO training for the retinopathy data. Model performance is measured with C-index calculated using Uno’s formula. Training and testing was repeated 100 times on bootstrapped data, with the exception of GBM-CIO, where bootstrapping was repeated 10 times only, due to long time of

Optimizing C-Index via Gradient Boosting in Medical Survival Analysis

39

Fig. 2 The results of the GBM-CIO training for the std data, for nu = 0.1, sigma = 0.1. Gray lines represent a single model training, the black line—the average of all runs Table 1 C-index results for the GBSG2 data. Best predictability was achieved by gradient boosting (in bold) Method Min 1st Quartile Median Mean 3rd Quartile Max Cox RSF GBM GBM-CIO

0.5055 0.4573 0.4920 0.5004

0.5168 0.4971 0.5169 0.5221

0.5361 0.5015 0.5272 0.5380

0.5336 0.5045 0.5272 0.5341

0.5460 0.5159 0.5363 0.5489

0.5647 0.5571 0.5712 0.5576

Table 2 C-index results for the lung data. Best predictability was achieved by Cox regression (in bold) Method Min 1st Quartile Median Mean 3rd Quartile Max Cox RSF GBM GBM-CIO

0.5521 0.5026 0.5166 0.5048

0.5920 0.5347 0.5505 0.5336

0.5967 0.5466 0.5606 0.5769

0.5981 0.5503 0.5740 0.5762

0.6130 0.5702 0.5976 0.6020

0.6337 0.5921 0.6434 0.6667

execution. In the following discussion we concentrate on the mean C-index value in the results of these boostraping runs. Mean has been arbitrarily chosen; using median instead of mean value does not have any impact on the conclusions drawn from these experiments. Cox regression is our baseline method as it is the most popular modeling technique in survival analysis using linear combination of predictors as an input.

40

A. Wieczorkowska and W. Jarmulski

Table 3 C-index results for the pbc data. Best predictability was achieved by gradient boosting (in bold) Method Min 1st Quartile Median Mean 3rd Quartile Max Cox RSF GBM GBM-CIO

0.7275 0.7259 0.7312 0.7127

0.7397 0.7500 0.7540 0.7526

0.7602 0.7584 0.7669 0.7638

0.7644 0.7691 0.7708 0.7674

0.7811 0.7979 0.7732 0.7824

0.8342 0.8313 0.8438 0.8272

Table 4 C-index results for the retinopathy data. Best predictability was achieved by gradient boosting (in bold) Method Min 1st Quartile Median Mean 3rd Quartile Max Cox RSF GBM GBM-CIO

0.4802 0.5482 0.5728 0.5319

0.6602 0.6017 0.6549 0.6475

0.6651 0.6677 0.6790 0.6907

0.6524 0.6432 0.6732 0.6684

0.6782 0.6784 0.6995 0.7018

0.7163 0.7226 0.7416 0.7214

Table 5 C-index results for the std data. Best predictability was achieved by gradient boosting (in bold) Method Min 1st Quartile Median Mean 3rd Quartile Max Cox RSF GBM GBM-CIO

0.5058 0.1149 0.5255 0.5193

0.5365 0.5099 0.5494 0.5395

0.5595 0.5285 0.5774 0.5719

0.6030 0.4778 0.6102 0.6151

0.5851 0.5437 0.5854 0.5867

0.9083 0.5702 0.9149 0.9149

Interestingly, for one dataset (lung) the top results were achieved using Cox regression. The most likely explanation is that the dependency between hazard ratio and the predictors was linear, and as a result it could be fully captured by Cox regression. Additionally, it should be noted that all the models on this dataset performed poorly, providing only slightly better prediction than random guessing. On one dataset (liver) RSF had the highest C-index and on the four datasets (GBSG2, pbc, retinopathy, std) GBM methods provided the best predictive results. Exemplary GBM training for the pbc data is shown in Fig. 3. In half of the datasets in which GBM performed best, GBM-CIO variant was slightly better than standard GBM. The details of the results of GBM-CIO on one of these datasets, namely std data, are shown in Fig. 2 (although the training not always was so smooth). In the remaining methods, non-linear RSF was not consistently better than Cox regression. As we can see, although there are common patterns, the results vary between the datasets. It might be caused by outlier cases, so we may consider testing approaches targeting at finding such exceptional cases [8]. Generally, RSF and GBM give better

Optimizing C-Index via Gradient Boosting in Medical Survival Analysis

41

Table 6 C-index results for the liver data. Best predictability was achieved by random survival forest (in bold) Method Min 1st Quartile Median Mean 3rd Quartile Max Cox RSF GBM GBM-CIO

0.3916 0.5312 0.5499 0.5100

0.4920 0.6635 0.6090 0.6380

0.5696 0.7192 0.6839 0.6772

0.5549 0.6935 0.6748 0.6856

0.5916 0.7467 0.7230 0.7476

0.7114 0.7933 0.7919 0.8403

Fig. 3 Impact of parameters choice when training GBM-CIO on the pbc data: nu = 0.01, and sigma = 0.1. Gray lines represent a single model training, the black line—the average of all runs

results when there are nonlinear dependencies between predictors. Otherwise, in the case of linear only dependencies, Cox regression gives comparable results. We have additionally tested the impact of parameters nu and sigma on GBM-CIO models which represent respectively the boosting step size, or shrinkage parameter, and smoothness parameter for sigmoid functions inside the smoothed C-index, see Formula 5. Figures 3, 4, 5 and 6 present training shapes of various training curves on pbc dataset when nu and sigma take values 0.1 and 0.01. Both parameters determine how fast models converge towards the optimum of C-index value. Smaller values of sigma and larger of nu lead to fast convergence, and opposite holds true for slower convergence. GBM-CIO is not sensitive to values of these parameters, as the optimum C-index is picked by the appropriate number of boosting iterations, which is determined using early-stopping technique.

42

A. Wieczorkowska and W. Jarmulski

Fig. 4 Impact of parameters choice when training GBM-CIO on the pbc data: nu = 0.1, and sigma = 0.1

Fig. 5 Impact of parameters choice when training GBM-CIO on the pbc data: nu = 0.01, and sigma = 0.01

6 Summary and Conclusions The obtained results indicate that the outcomes depend on the datasets, and in some cases it is difficult to achieve satisfactory results (although all results were better than random guessing). In one case, classic Cox regression yielded best results. However,

Optimizing C-Index via Gradient Boosting in Medical Survival Analysis

43

Fig. 6 Impact of parameters choice when training GBM-CIO on the pbc data: nu = 0.1, and sigma = 0.01

in most cases, four out of six of the datasets, GBM based models performed best and the choice of C-index as the optimized loss function usually led to further improved performance. Our experiments show that machine learning methods improve predictive performance of survival models. The best results were achieved by gradient boosting models, and GBM results were further improved by direct optimization of C-index instead of partial log likelihood. Therefore, we recommend using this approach in medical survival analysis where predictive power of models is of primary interest. We also believe that this approach could also be applied in other survival methods, which optimize the target function, in order to improve their predictive power. What is more, we believe that the presented methods could also be successfully applied in other medical applications and even outside of medicine, but this requires further research. In further works, we are planning to apply the finding described in this paper and to continue analyzing the liver transplantation data [20], as our main goal to achieve is to indicate patients at risk of losing graft after the first year after the transplantation surgery. These data are noisy and scarce, but we already started the experiments, cleaned the data, and we were able to build models predicting if a patient will lose a graft after transplantation in a specified time horizon. Observing changes of biochemical measurements is an innovation in the approach applied so far in liver transplantation observations, as physicians use a static indicator (Model For End-Stage Liver Disease, MELD [25]), which does not take any changes in time into account. We have already shown than the analysis of changes of biochemical observations in time provides better predictions than MELD [20]. We continue work-

44

A. Wieczorkowska and W. Jarmulski

ing with these data, and we consider testing various approaches, in order to provide better predictions for the patients. Acknowledgements This work was partially supported by the Research Center of the PolishJapanese Academy of Information Technology, supported by the Ministry of Science and Higher Education in Poland.

References 1. Blair, A.L., Hadden, D.R., Weaver, J.A., Archer, D.B., Johnston, P.B., Maguire, C.J.: The 5-year prognosis for vision in diabetes. Am. J. Ophthalmol. 81, 383–396 (1976) 2. Bou-Hamad, I., Larocque, D., Ben-Ameur, H.: A review of survival trees. Stat. Surv. 5, 44–71 (2011) 3. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001) 4. Chen, Y., Jia, Z., Mercola, D., Xie, X.: A gradient boosting algorithm for survival analysis via direct optimization of concordance index. Comput. Math. Methods Med. Article ID 873595 (2013) 5. Cox, D.R.: Partial likelihood. Biometrika 62(2), 269–276 (1975) 6. Dekker, F.W., de Mutsert, R., van Dijk, P.C., Zoccali, C., Jager, K.J.: Survival analysis: timedependent effects and time-varying risk factors. Kidney Int. 74, 994–997 (2008) 7. van Dijk, P.C., Jager, K.J., Zwinderman, A.H., Zoccali, C., Dekker, F.W.: The analysis of survival data in nephrology: basic concepts and methods of Cox regression. Kidney Int. 74, 705–709 (2008) 8. Duivesteijn, W., Feelders, A.J., Knobbe, A.: Exceptional model mining supervised descriptive local pattern mining with complex target concepts. Data Min. Knowl. Disc. 30, 47–98 (2016) 9. Flemming, T.R., Harrington, D.P.: Counting Processes and Survival Analysis. Wiley, New York (1991) 10. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 29(5), 1189–1232 (2001) 11. Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002) 12. Garcia, A.L., Wagner, K., Hothorn, T., Koebnick, C., Zunft, H.-J.F., Trippo, U.: Improved prediction of body fat by measuring skinfold thickness, circumferences, and bone breadths. Obes. Res. 13(3), 626–634 (2005) 13. Harrell, F.E. Jr., Lee, K.L., Mark, D.B.: Tutorial in Biostatistics. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat. Med. 15(4), 361–387 (1996) 14. Hothorn, T., Buehlmann, P., Kneib, T., Schmid, M., Hofner, B., Sobotka, F., Scheipl, F., Mayr, A.: Model-based boosting (2018). https://cran.r-project.org/web/packages/mboost/mboost. pdf. Accessed 5 July 2018 15. Huster, W.J., Brookmeyer, R., Self, S.G.: Modelling paired survival data with covariates. Biometrics 45, 145–156 (1989) 16. Ishwaran, H., Kogalur, U.B.: Random survival forests for R. R News 7(2), 25–31 (2007) 17. Ishwaran, H., Kogalur, U.B.: Random forests for survival, regression, and classification (RSFSRC) (2018). https://cran.r-project.org/web/packages/randomForestSRC/randomForestSRC. pdf. Accessed 5 July 2018 18. Ishwaran, H., Kogalur, U.B., Blackstone, E.H., Lauer, M.S.: Random survival forests. Ann. Appl. Stat. 2(3), 841–860 (2008) 19. Jager, K.J., van Dijk, P.C., Zoccali, C., Dekker, F.W.: The analysis of survival data: the KaplanMeier method. Kidney Int. 74, 560–565 (2008)

Optimizing C-Index via Gradient Boosting in Medical Survival Analysis

45

20. Jarmulski, W., Wieczorkowska, A., Trzaska, M., Ciszek, M., Paczek, L.: Machine learning models for predicting patients survival after liver transplantation. Comput. Sci. 19(2). https:// doi.org/10.7494/csci.2018.19.2.2746 21. Kartsonaki, C.: Survival analysis. Diagn. Histopathol. 22(7), 263–270 (2016) 22. Klein, J.P., Moeschberger, M.L.: Survival Analysis Techniques for Censored and Truncated Data. Springer, Berlin (1997) 23. Lacny, S., Wilson, T., Clement, F., Roberts, D.J., Faris, P., Ghali, W.A., Marshall, D.A.: KaplanMeier survival analysis overestimates cumulative incidence of health-related events in competing risk settings: a meta-analysis. J. Clin. Epidemiol. 93, 25–35 (2018) 24. Loprinzi, C.L., Laurie, J.A., Wieand, H.S., Krook, J.E., Novotny, P.J., Kugler, J.W., Bartel, J., Law, M., Bateman, M., Klatt, N.E.: Prospective evaluation of prognostic variables from patientcompleted questionnaires. North Central Cancer Treatment Group. J. Clin. Oncol. 12(3), 601– 607 (1994) 25. Malinchoc, M., Kamath, P.S., Gordon, F.D., Peine, C.J., Rank, J., ter Borg, P.C.J.: A model to predict poor survival in patients undergoing transjugular intrahepatic portosystemic shunts. Hepatology 31(4), 864–871 (2000) 26. Mayr, A., Schmid, M.: Boosting the concordance index for survival data - A unified framework to derive and evaluate biomarker combinations. PLoS ONE 9(1), e84483 (2014) 27. Raykar, V.C., Steck, H., Krishnapuram, B., Dehing-Oberije, C., Lambin, P.: On ranking in survival analysis: bounds on the concordance index. Adv. Neural Inf. Process. Syst. 20, 1209– 1216 (2008) 28. Ridgeway, G.: Generalized boosted regression models (2018). https://cran.r-project.org/web/ packages/gbm/gbm.pdf. Accessed 5 July 2018 29. The R project for statistical computing (2018). https://www.r-project.org/. Accessed 5 July 2018 30. Therneau, T.M.: Survival analysis (2018). https://cran.r-project.org/web/packages/survival/ survival.pdf. Accessed 5 July 2018 31. Uno, H., Cai, T., Pencina, M.J., D’Agostino, R.B., Wei, L.J.: On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Stat Med. 30(10), 1105–1117 (2011) 32. Zheng, L.-Y., Chang, Y.-T.: Risk assessment model of bottlenecks for urban expressways using survival analysis approach. Transp. Res. Procedia 25, 1544–1555 (2017)

Order-Preserving Biclustering Based on FCA and Pattern Structures Nyoman Juniarta, Miguel Couceiro and Amedeo Napoli

Abstract Biclustering is similar to formal concept analysis (FCA), whose objective is to retrieve all maximal rectangles in a binary matrix and arrange them in a concept lattice. FCA is generalized to more complex data using pattern structure. In this article, we explore the relation of biclustering and pattern structure. More precisely, we study the order-preserving biclusters, whose rows induce the same linear order across all columns.

1 Introduction CrossCult (http://www.crosscult.eu) is a European project whose idea is to support the emergence of a European cultural heritage by allowing visitors in different cultural sites (e.g. museum, historic city, archaeological site) to improve the quality of their visit by using adapted computer-based devices and to consider the visit at a European level. Such improvement can be accomplished by studying, among others, the possibility to build a dynamic recommendation system. This system should be able to produce a relevant suggestion on which part of a cultural site may be interesting for a specific user/visitor. Given U as the set of previous users and I as the set of items, one approach for producing such suggestion is by retrieving a set of similar users in U . There are different techniques of calculating the similarity between any two users. From a dataset of item ratings, we can interpret u 1 ∈ U as similar to u 2 ∈ U if they have similar order of preference. For example, suppose that rating(u x , i y ) is the rating given by user u x ∈ U to item i y ∈ I . We can consider that u 1 is similar to u 2 if rating(u 1 , i 1 ) > N. Juniarta (B) · M. Couceiro · A. Napoli Université de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France e-mail: [email protected] M. Couceiro e-mail: [email protected] A. Napoli e-mail: [email protected] © Springer Nature Switzerland AG 2020 A. Appice et al. (eds.), Complex Pattern Mining, Studies in Computational Intelligence 880, https://doi.org/10.1007/978-3-030-36617-9_4

47

48

N. Juniarta et al.

rating(u 1 , i 2 ) and rating(u 2 , i 1 ) > rating(u 2 , i 2 ) (i.e. both of them prefer i 1 over i 2 ). Furthermore, their similarity is stronger if they give similar rating order over a larger set of items, e.g. if rating(u 1 , i 1 ) > rating(u 1 , i 2 ) > · · · > rating(u 1 , i 10 ) and rating(u 2 , i 1 ) > rating(u 2 , i 2 ) > · · · > rating(u 2 , i 10 ). The task of finding a set of similar users based on this similarity can be regarded as the task of finding order-preserving (OP) biclusters in a numerical matrix. In this article we explore two approaches of finding OP biclusters. The first is based on partition pattern structure, while the second is the application of sequence pattern structure. Both approaches originate from formal concept analysis (FCA), since it has a related objective with biclustering: finding submatrices whose cells have “similar” value. By using these approaches, we can theoretically enumerate all possible OP biclusters in a numerical matrix, and arrange them in a hierarchical structure. OP bicluster is first defined in Ben et al. [2] to find a highly statistically significant OPSM in a gene expression dataset. The dataset is represented as a numerical matrix with genes as rows, experiments as columns, and each cell represents the expression level of a given gene under a given experiment. In such a matrix, an OPSM corresponds to a subset of genes that induce the same linear ordering of a subset of experiments. The OP bicluster discovery problem is also a special case of sequential pattern mining [8]. A row in an m × n matrix can be regarded as a sequence of n items, and the whole matrix corresponds to a dataset of m sequences. This sequence dataset is extremely dense, since all possible items are present in each sequence (except missing value). Furthermore, many existing sequential pattern mining methods try to find patterns that satisfy a minimum support threshold [10, 12–14]. This means that longer sequences are difficult to be recovered, since shorter sequences (hence smaller OPSMs) dominate the results. This problem has been studied in [3] by proposing “rare” sequential pattern mining. We will also show that it can be used to obtain OP biclusters from a numerical matrix. This article is organized as follows. First, we explains the background and examples of biclustering in Sect. 2. Section 3 describes the basic definitions of FCA and pattern structure. Our first approach is explained in Sect. 4, while the second approach is in Sect. 5. Some results of our experiments are described in Sect. 6. In the end, we discuss our conclusions and some future works in Sect. 7.

2 Order-Preserving Biclusters In this section, we recall the basic background and discuss illustrative examples of biclustering, especially order-preserving biclusters, as described in [9]. We consider a dataset composed of a set of objects G, each of which has values over a set of attributes M. Therefore, in a numerical dataset (G, M), the value of m for object g is written as m(g).

Order-Preserving Biclustering Based on FCA and Pattern Structures Table 1 Examples of a constant-value, b constant-columns, and c order-preserving (OP) biclusters

49

(a) 4 4 4 4 (b) 4 4 4 4 (c) 1 3 2 4

4 4 4 4

4 4 4 4

4 4 4 4

2 2 2 2

5 5 5 5

3 3 3 3

2 5 3 5

4 7 8 9

3 6 4 8

One may be interested in finding which subset of objects possesses the same values w.r.t. a subset of attributes. Regarding the matrix representation, this is equivalent to the problem of finding a submatrix that has a constant value over all of its elements (example in Table 1a). This task is called biclustering with constant values, which is a simultaneous clustering of the rows and columns of a matrix. Other than constant values, the bicluster approach also focused on finding other types of submatrices, as shown in Table 1. A bicluster with constant columns is a submatrix where each column has the same value, illustrated in Table 1b. In this article, we are dealing with OP biclusters. In this type, each row induces the same linear order across all columns as defined in Definition 1. Definition 1 Given a dataset (G, M), a pair (A, B) (where A ⊆ G, B ⊆ M) is an order-preserving (OP) bicluster iff ∀g ∈ A, there exist a sequence of columns (m 1 · · · m |B| ) in B such that m i (g) < m i+1 (g) for all 1 ≤ i ≤ |B| − 1. For example, in the bicluster in Table 1c, each row follows column1 < column2 < column4 < column3. Biclustering shares many common elements with formal concept analysis (FCA). In FCA, from a binary matrix we try to find a maximal submatrix whose elements are 1. In other words, the objective of FCA is to identify maximal constant-value biclusters (but only for biclusters whose values are 1). Hence, a formal concept can be considered as a bicluster of objects and attributes. Furthermore, formal concepts are arranged in a concept lattice, that materializes the hierarchical relation among all biclusters. In the following section, we will recall some basic definitions of FCA and its generalization to pattern structures.

50

N. Juniarta et al.

3 FCA and Pattern Structures Formal concept analysis is a mathematical framework based on lattice theory and used for classification, data analysis, and knowledge discovery [6]. From a formal context, FCA detects all formal concepts, and arranges them in a concept lattice. A formal context is a triple (G, M, I ), where G is a set of objects, M is a set of attributes, and I is a binary relation between G and M, i.e. I ⊆ G × M. If an object g has an attribute m, then (g, m) ∈ I . An example of a formal context is shown in Table 2, where × indicates the corresponding (g, m) ∈ I . Between the sets G and M, there is a Galois connection that consists of two functions: 2G → 2 M and 2 M → 2G . For a subset of objects A ⊆ G, A is the set of attributes that are possessed by all objects in A, i.e.: A = {m ∈ M|∀g ∈ A, (g, m) ∈ I },

A ⊆ G.

Dually, for a subset of attributes B ⊆ M, B  is the set of objects that have all attributes in B, i.e.: B ⊆ M. B  = {g ∈ G|∀m ∈ B, (g, m) ∈ I }, A formal concept is a pair (A, B), where A ⊆ G and B ⊆ M, and such that A = B and B  = A. For example, {g2 , g3 } = {m 1 , m 3 }, and {m 1 , m 3 } = {g2 , g3 } in Table 2. They form a formal concept ({g2 , g3 }, {m 1 , m 3 }). A formal concept (A, B) is a subconcept of (C, D)—denoted by (A, B) ≤ (C, D)—if A ⊆ C (or equivalently D ⊆ B). A concept lattice can be formed using the ≤ relation which defines the partial order among concepts. If (A, B) < (C, D) and there is no (E, F) such that (A, B) < (E, F) < (C, D), then (A, B) is a lower neighbor of (C, D). For the context in Table 2, the formal concepts and their corresponding lattice are shown in Fig. 1. FCA is restricted to specific datasets where each attribute is binary (e.g. has only yes/no value). For more complex values (e.g. numbers, strings, trees, graphs…), FCA is then generalized into pattern structures [5]. A pattern structure is a triple (G, (D, ), δ), where G is a set of objects, (D, ) is a complete meet-semilattice of descriptions, and δ : G → D maps an object to a description. The operator  is a similarity operation that returns the common elements between any two descriptions. It implies that c  d = c ⇔ c d. A description can be a number, set, sequence, tree, graph, or other complex structure. In standard Table 2 A formal context

m1 g1 g2 g3 g4

m2

m3

× ×

× × ×

m4 ×

× ×

×

Order-Preserving Biclustering Based on FCA and Pattern Structures

51

Fig. 1 Concept lattice for the formal context in Table 2

FCA (with set as description),  corresponds to set intersection (∩), i.e. {a, b, c}  {a, b, d} = {a, b}, and corresponds to subset inclusion (⊆). In the case of sequence as a description,  can be a set of common closed subsequences (SCCS) [3]. Similarly,

corresponds to subsequence inclusion ( ). The Galois connection for a pattern structure (G, (D, ), δ) is defined as: 

A =

δ(g),

A ⊆ G,

g∈A

d = {g ∈ G|d δ(g)},

d ∈ D.

A pattern concept—similar to a standard formal concept—is a pair (A, d), A ⊆ G and d ∈ D, where A = d and d = A. Table 2 can be regarded as a pattern structure with G = {g1 · · · g4 }, while the description of each object is a set of attributes. For example, δ(g2 ) = {m 1 , m 3 } and δ(g3 ) = {m 1 , m 2 , m 3 }. Their similarity is a set intersection, i.e. given A = {g2 , g3 }, A = δ(g2 )  δ(g3 ) = {m 1 , m 3 }. Furthermore, with d = {m 1 , m 3 }, we have d δ(g2 ) and d δ(g3 ). Therefore d = {g2 , g3 } and the pair ({g2 , g3 }, {m 1 , m 3 }) is a pattern concept. The set of all pattern concepts also forms a lattice. When applying pattern structure to the task of finding biclusters, the lattice can be useful to obtain the hierarchical structure of a set of biclusters. In the following sections, we will describe two approaches of finding OP biclusters: based on partition pattern structure (Sect. 4) and based on sequence pattern structure (Sect. 5).

4 Finding Biclusters Using Partition Pattern Structure In this section, first we recall the partition pattern structure (pps) detailed in [4]. In pps, the pattern structures is (M, (D, ), δ), where attributes are considered as objects, explained in Sect. 4.1. The description of each attribute m ∈ M is a partition

52

N. Juniarta et al.

Table 3 A dataset with 4 objects and 5 attributes

g1 g2 g3 g4

m1

m2

m3

m4

m5

1 1 2 2

2 2 5 5

3 4 4 4

1 2 5 5

7 7 3 7

of objects according to the values on the given attribute. We will then propose an extension of this approach to perform OP biclustering in Sect. 4.2.

4.1 Partition Pattern Structure As explained in [1, 4], a partition d of a set G is a collection of subsets of G (d = { pi }, pi ⊆ G), such that: 

pi = G and pi ∩ p j = ∅ whenever i = j.

(1)

pi ∈d

The set of all partitions is denoted as D, and it will become the description of attributes in the pattern structure (M, (D, ), δ). The function δ maps an attribute to a partition of objects, i.e. δ : M → D. The whole set of attributes can be partitioned (respecting Eq. 1) according to the values of an attribute. To do that, we need an equivalence relation—which is reflexive, symmetric, and transitive—among objects. The equivalence relation [gi ]m j of an object gi w.r.t. an attribute m j is: [gi ]m j = {gk ∈ G|m j (gi ) = m j (gk )}.

(2)

Given an attribute m j , the relation [gi ]m j splits G into equivalence classes. It satisfies Eq. 1, such that the classes cover the whole set of objects and there is no intersection between any two different classes. These classes can be regarded as partition elements. Therefore a partition mapping is defined as δ : M → D, such as: δ(m j ) = {[gi ]m j |gi ∈ G}. For example, from Table 3: [g1 ]m 1 = [g2 ]m 1 = {g1 , g2 } [g3 ]m 1 = [g4 ]m 1 = {g3 , g4 } δ(m 1 ) = {[g1 ]m 1 , [g2 ]m 1 , [g3 ]m 1 , [g4 ]m 1 } = {{g1 , g2 }, {g3 , g4 }}.

(3)

Order-Preserving Biclustering Based on FCA and Pattern Structures Table 4 A dataset with 5 objects and 5 attributes

g1 g2 g3 g4 g5

53

m1

m2

m3

m4

m5

1 4 2 5 2

2 2 3 4 1

3 1 4 2 5

4 5 1 3 4

5 3 5 1 3

The meet and join of two partitions d1 = { pi } and d2 = { p j } are defined as: d1  d2 = {{ pi ∩ p j }| pi ∈ d1 and p j ∈ d2 }

(4)

d1  d2 = ({{ pi ∪ p j }| pi ∈ d1 and p j ∈ d2 and pi ∩ p j = ∅})

+

(5)

where (.)+ is a closure that preserves only the maximal components in d. For example, given δ(m 1 ) = {{g1 , g2 }, {g3 , g4 }} and δ(m 4 ) = {{g1 }, {g2 }, {g3 , g4 }}, we have δ(m 1 )  δ(m 4 ) = {{g1 }, {g2 }, {g3 , g4 }}, and δ(m 1 )  δ(m 4 ) = {{g1 , g2 }, {g3 , g4 }}. The order between any two partitions is given by the subsumption relation: d1 d2 ⇐⇒ d1  d2 = d1 .

(6)

Therefore, δ(m 4 ) δ(m 1 ) since δ(m 1 )  δ(m 4 ) = δ(m 4 ). Given a set of attributes M, a set of partitions D, and a mapping δ, a partition pattern structure is determined by the triple (M, (D, ), δ). Notice that in this triple, M is regarded as the set of objects, where the description of an object is a partition. A pair (A, d) is then called a partition pattern concept (pp-concept) iff A = d and d  = A, where: A =

δ(m)

A⊆M

(7)

d  = {m ∈ M|d δ(m)}

d ∈ D.

(8)



m∈A

For example, given A = {m 3 , m 5 } in Table 3, we get A = δ(m 3 )  δ(m 5 ) = {{g1 }, {g3 }, {g2 , g4 }}. Dually, given d = {{g1 }, {g3 }, {g2 , g4 }}, we get d  = {m 3 , m 5 } since d δ(m 3 ) and d δ(m 5 ). Therefore, ({m 3 , m 5 }, {{g1 }, {g3 }, {g2 , g4 }}) is a ppconcept. All pp-concepts from Table 3 are hierarchically illustrated as a lattice in Fig. 2.

4.2 OP Biclustering Using Partition In this subsection, we will explain the possible application of partition pattern structure to discover OP biclusters as defined in Definition 1.

54

N. Juniarta et al.

Fig. 2 Partition pattern lattice for Table 3

Consider the dataset given by Table 4, with the set of attributes G = {g1 , g2 , g3 , g4 , g5 }. For the task of finding OP biclusters, we introduce the notation r yx as a pair of attributes m x and m y , x < y, and R is the set of all possible r yx . That is, from n attributes, there will be C2n pairs. Using the same definition of partition as Eq. 1, we define a new partition mapping δ : R → D, such as: (9) δ(r yx ) = {[gi ]r yx |gi ∈ G}. The [gi ]r yx is the equivalence relation of an object w.r.t. a pair of attributes: [gi ]r yx = {gk ∈ G| arg max(m j (gi )) = arg max(m j (gk )), j ∈ {x, y}}. j

j

(10)

For example, from Table 4: [g1 ]r21 = [g3 ]r21 = {g1 , g3 } [g2 ]r21 = [g4 ]r21 = [g5 ]r21 = {g2 , g4 , g5 }. In other words, δ maps a pair of attributes to a partition according to the pair’s comparison. For example, δ(r21 ) = {{g1 , g3 }, {g2 , g4 , g5 }} because m 2 > m 1 for g1 and g3 ; and m 1 > m 2 for g2 , g4 , and g5 . Some pairs and their partitions are listed in Table 5. Given a set of attribute pairs R, a set of partitions D, and the mapping function δ, a partition pattern structure for finding OP biclusters is determined by the triple (R, (D, ), δ). A concept is a pair (B, d) such that B  = d and d  = B, where:

Order-Preserving Biclustering Based on FCA and Pattern Structures Table 5 Some examples of partitions over Table 4

B =

55

Pair

Partition

r21 r31 r41 r32 r52

{{g1 , g3 }, {g2 , g4 , g5 }} {{g1 , g3 , g5 }, {g2 , g4 }} {{g1 , g2 , g5 }, {g3 , g4 }} {{g1 , g3 , g5 }, {g2 , g4 }} {{g1 , g2 , g3 , g5 }, {g4 }}

δ(h)

B⊆R

(11)

d  = {r ∈ R|d δ(r )}.

d∈D

(12)



r ∈B

The meet and subsumption relation between two partitions follow Eqs. 4 and 6 respectively. Here, the extent of a pp-concept is a set of attribute pairs. We can obtain a OP bicluster in a concept if there is a “clique” among the attributes in the pairs, as described in Definition 2. Definition 2 Consider a concept (B, d). There may exist a set of attributes C ⊆ M such that for all m, n ∈ C, there exists rnm ∈ B or rmn ∈ B. The set C is called a clique in B. For example, consider the concept pc1 with extent {r21 , r31 , r51 , r32 , r54 } and intent {{g1 , g3 }, {g5 }, {g2 , g4 }}. Its extent forms a clique among m 1 , m 2 , and m 3 , since all pairs of any two of those attributes are included. If a concept (B, d) contains a set of attributes C ⊆ M that forms a clique, then each pair in {( p, C)| p ∈ d} corresponds to a OP bicluster. For example, from pc1 , we can obtain three OP biclusters: • ({g1 , g3 }, {m 1 , m 2 , m 3 }), which follows m 1 < m 2 < m 3 ; • ({g2 , g4 }, {m 1 , m 2 , m 3 }), which follows m 3 < m 2 < m 1 ; and • ({g5 }, {m 1 , m 2 , m 3 }), which follows m 2 < m 1 < m 3 .

5 Finding Biclusters Using Sequence Pattern Structure In this section, we will recall the characterization of sequential pattern mining using sequence pattern structure as proposed in [3]. Then, we show how the problem of finding OP biclusters can be solved using sequence pattern structure.

56 Table 6 An example of sequence database

N. Juniarta et al. s

Sequence

s1 s2 s3

ab|acd a|d|c ac|d|b

5.1 Sequence Pattern Structure A sequence is an ordered list s1 s2 . . . sm , where si is an itemset {i 1 , . . . , i n }. The list s 1 = {a, b}{a, c, d} is a sequence of 2 itemsets. For simplicity, we follow the bar notation in [3] such that two consecutive itemsets are separated by a bar. Therefore, the sequence from the previous example can be written as ab|acd. A sequence s = s1 |s2 | . . . |sm is a subsequence of t = t1 |t2 | . . . |tn , denoted by s t, if there exist indices 1 ≤ i 1 < i 2 < · · · < i m ≤ n such that s j ⊆ ti j for all j = 1 . . . m and m ≤ n. For example, the sequence a|d is a subsequence of ab|acd, while sequence c|d is not. Given a sequence database S (example in Table 6), the set of all subsequences of elements in S is denoted as Q, where q ∈ Q ⇐⇒ ∃s ∈ S s.t. q s (S ⊆ Q). The objective of frequent sequential pattern mining is to retrieve all q ∈ Q whose support is larger than a threshold. The support of a sequence—denoted as σ (q)—w.r.t. S is the number of sequences in S which have q as a subsequence, or σ (q) = |{s ∈ S|q s}|, q ∈ Q. In Table 6, given q 1 = a|c, the support of q 1 is 2, since q 1 s 1 and q 1 s2. Sequential pattern mining in is defined with respect to a particular sequence pattern structure, and it can retrieve “rare” sequences (small support but having larger number of itemsets) [3]. This pattern structure relies on the notion of closed sequences. From a set of sequences d, its set of closed sequences is: d + = {q i ∈ d|q j ∈ d s.t. q i ≺ q j }.

(13)

In other words, d + is all sequences in d who are not subsequence of other sequence in d. Then, the sequence pattern structure is defined as a triple (S, (D, ), δ), where δ : Q → D is a mapping from a sequence q to its description δ(q). Here, a description is a set of closed sequences. As example, δ(s 1 ) = {ab|acd} in Table 6. The similarity operator between any two descriptions d 1 and d 2 is the similarity between each sequence in d 1 and each sequence in d 2 , or:

d1  d2 =

⎧ ⎨ ⎩



q i ∈d 1 ,q j ∈d 2

qi ∧ q j

⎫+ ⎬ ⎭

.

(14)

The ∧ operator returns the set of closed sequences that are subsequences of both q i and q j , or:

Order-Preserving Biclustering Based on FCA and Pattern Structures

q i ∧ q j = {q ∈ Q|q q i and q q j }+ .

57

(15)

For example, suppose that we have d 1 = {a|c|d, a|b|c} and d 2 = {a|d|c, a|b|c|d}. Each of these descriptions has two sequences. Their similarity is: d 1  d 2 = {a|c|d ∧ a|d|c, a|c|d ∧ a|b|c|d, a|b|c ∧ a|d|c, a|b|c ∧ a|b|c|d}+ = {a|c, a|d, a|c|d, a|c, a|b|c}+ = {a|c|d, a|b|c}. The space of descriptions is denoted by D, and is a partially ordered set of descriptions. A description d 1 ∈ D can be subsumed by ( ) d 2 ∈ D, iff: d 1 d 2 ⇐⇒ ∀q i ∈ d 1 : ∃q j ∈ d 2 s.t. q i q j .

(16)

This means that d 1 is subsumed by d 2 iff every sequence in d 1 is a subsequence of at least one sequence in d 2 . With T ⊆ S and d ∈ D, a pair (T, d) is a sequence pattern concept iff T  = d and d  = T , where: T =

δ(s)

(17)

d  = {s ∈ S|d δ(s)}.

(18)



s∈T

As example, ({s 1 , s 3 }, {a|d, b, ac}) is a sequence pattern concept from Table 6. All concepts from this table are shown in Fig. 3.

Fig. 3 Sequence pattern lattice for Table 6

58

N. Juniarta et al.

5.2 OP Biclustering Using Sequence The task of finding OP biclusters can be formulated as sequential pattern mining, and can be solved using sequence pattern structure. Consider again the dataset in Table 4. It can be regarded as a sequence database. Each g ∈ {g1 , . . . g5 } is a sequence of five items m ∈ {m 1 , . . . m 5 }. The numbers in the matrix determine the order of each item in a given sequence. For example, the object g2 has m 3 < m 2 < m 5 < m 1 < m 4 , so the sequence of g2 is m 3 |m 2 |m 5 |m 1 |m 4 , and consequently δ(g2 ) = {m 3 |m 2 |m 5 |m 1 |m 4 }. A sequence pattern concept (T, d) has a set of objects (or sequences) as extent and a set of common subsequences as intent. Given a numerical table like Table 4, the intent is a set of sequence of attributes. A sequence m 1 | · · · |m n in the intent signifies that m 1 < · · · < m n for all objects in the extent. As a result, from any sequence pattern concept (T, d), any pair of T and each sequence in d forms a OP bicluster. For example, consider the concept ({g2 , g4 , g5 }, {m 3 , m 2 |m 1 , m 5 |m 4 }) from Table 4. Its intent has three sequences, so we have three OP biclusters: • ({g2 , g4 , g5 }, {m 3 }), • ({g2 , g4 , g5 }, {m 1 , m 2 }), and • ({g2 , g4 , g5 }, {m 4 , m 5 }). Therefore, we can obtain OP biclusters in a numerical matrix using sequence pattern structure.

6 Experiment In this section, we first compare the runtime of the two methods for finding OP bicluster: partition pattern structure (pps) and sequence pattern structure (sps), both using AddIntent algorithm to generate all concepts and the corresponding lattice. Randomly generated matrices are used, where the value of each cell is between 1 and 100 following uniform distribution. We inspect the effect of the number of attributes on both methods, and we choose a small number of objects (10). The comparison of runtime are shown in Fig. 4. It is shown that the execution time of partition-based mining of OP biclusters grows faster than sequence-based. Using partition pattern structure, from an m × n numerical matrix, a new m × C2n matrix is generated to compare every pair of columns. The partition is then performed in the new matrix. In this second approach, an m × n matrix is converted to a set of m sequences, each of them has n items. The first approach is more complex than the second, since the first creates a larger matrix before applying partition pattern structure. We tested the pps-based approach to breast cancer dataset [7]. This dataset has 3226 rows (genes) and 21 columns (tissues). As shown in Table 7, these 21 tissues are composed by 7 brca1 mutations, 8 brca2 mutations, and 6 sporadic breast cancers. Since it has 21 columns, the pps-based approach will convert the dataset into a

Order-Preserving Biclustering Based on FCA and Pattern Structures

59

350

Execution time (seconds)

300

pps sps

250 200 150 100 50 0 10

11

12

13

14

15

16

Number of attributes Fig. 4 Comparison of partition pattern structure (pps) and sequence pattern structure (sps) in the task of finding OP biclusters in matrices with 10 rows and varying number of columns Table 7 The columns in the breast cancer dataset in [7] Column Type Column Type c1 c2 c3 c4 c5 c6 c7

brca1 brca1 brca1 brca1 brca1 brca1 brca2

c8 c9 c10 c11 c12 c13 c14

brca2 brca2 brca2 sporadic sporadic sporadic sporadic

Column

Type

c15 c16 c17 c18 c19 c20 c21

sporadic sporadic brca1 brca2 brca2 brca2 brca2

3226 × C221 = 3226 × 210 matrix. In order to reduce the computational complexity, we introduce a parameter θ . In calculation of the intent of any partition pattern concept, any partition component having the number of elements less than θ is discarded. For example, with θ = 3, the partition {{a, b, c}, {d, e}} becomes {{a, b, c}}. The number of concepts can be very large, so we provides the runtime until 10K concepts are obtained. The result of experiment with varying θ is shown in Fig. 5. Here we see that in general, larger θ means more time to obtain 10K concepts. However, it does not necessarily mean that larger θ needs more time to finish the computation of the whole lattice. This is because lesser θ implies more concepts, hence faster to obtain 10K concepts. It can also be noted that for θ > 120 (not shown in the figure), there are less than 10K concepts in the whole lattice. We also tested the sps-based approach to the same dataset. Contrary to the previous experiment where we take only 10K concepts, here we calculate the runtime for the

60 25

20

Execution time (seconds)

Fig. 5 Runtime of pps-based approach with AddIntent and varying θ in obtaining 10K concepts, applied to the breast cancer dataset. Any partition component having elements less than θ is discarded

N. Juniarta et al.

15

10

5

0

20

40

60

80

100

120

16

17

18

19

20

21

60

Execution time (minutes)

Fig. 6 Runtime of sps-based approach with AddIntent and varying , applied to the breast cancer dataset. Any sequence having length less than  is discarded

0

40

20

0 15

computation of the whole lattice. To reduce the computational time, we introduced the parameter , which is the minimal length of any sequence. Therefore, in calculating the intent of any sequence pattern concept, any sequence having length less than  is discarded. For example, with  = 3, an intent {a|b|c, a|d} becomes {a|b|c}. The result of this experiment is shown in Fig. 6. With larger , the computational time is reduced, since we will have less concepts. This is useful compared to the majority of existing sequential pattern miners. To obtain sequential pattern with larger length, they usually need to lower the minimum support parameter. This results in more patterns, and consequently larger computational time. Concerning the biclusters, we found that the bicluster of size 15 × 2 is the widest (having most columns) bicluster with more than 1 row. It follows the

Order-Preserving Biclustering Based on FCA and Pattern Structures

61

sequence of columns c19 < c20 < c1 < c17 < c12 < c21 < c11 < c9 < c16 < c14 < c10 < c7 < c13 < c5 < c3 , which is present in gene #24638 and #291057. Regarding biclusters covering more than 2 rows, the widest biclusters are of size 10 × 3. They are statistically significant because at a random 3226 × 21 matrix, we can only expect a bicluster covering 1 row (as 3226/10! < 1) exhibiting a certain ordering of 10 columns.

7 Conclusion In this article, we propose two approaches of finding OP biclusters. The first approach is based on partition pattern structure, going from the idea that given two attributes m i and m j , the set of all objects G can be separated to two groups: those with m i < m j and those with m i > m j . For this article, we do not consider yet the equality (the case with m i = m j ). Then given that the task of finding OP biclusters is similar to the sequential pattern mining, the second approach is based on sequence pattern structure. This allows us to define a threshold, such that we can mine OP biclusters whose number of columns is larger than the threshold. In general, both approaches generate a set of overlapping OP biclusters, a characteristic similar to FCA, where a set of overlapping submatrices is obtained. Furthermore, both the set of partition pattern concepts and the set of sequence pattern concepts are partially ordered and forms a lattice. This ordering can be studied to obtain the hierarchical structure of the generated OP biclusters. The hierarchical and overlapping characteristics are notably similar to HOCCLUS2 method in [11], although with different bicluster type. One of the differences between our method and HOCCLUS2 is that we obtain hierarchical and overlapping biclusters directly from a matrix, while HOCCLUS2 finds a set of non-overlapping biclusters from a matrix, and then identify overlapping and hierarchically organized biclusters. The OP biclusters can be further studied to build a collaborative recommendation systems. One example is when we have a user-item matrix, where each cell shows the rating of an item given by a user. A OP bicluster from this matrix can be regarded as a set of users having similar order of preference over a set of items.

References 1. Baixeries, J., Kaytoue, M., Napoli, A.: Characterizing functional dependencies in formal concept analysis with pattern structures. Ann. Math. Artif. Intell. 72, 129–149 (2014) 2. Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z.: Discovering local structure in gene expression data: the order-preserving submatrix problem. J. Comput. Biol. 10(3–4), 373–384 (2003) 3. Codocedo, V., Bosc, G., Kaytoue, M., Boulicaut, J.F., Napoli, A.: A proposition for sequence mining using pattern structures. In: International Conference on Formal Concept Analysis, pp. 106–121. Springer (2017)

62

N. Juniarta et al.

4. Codocedo, V., Napoli, A.: Lattice-based biclustering using partition pattern structures. In: Proceedings of the Twenty-first European Conference on Artificial Intelligence, pp. 213–218. IOS Press (2014) 5. Ganter, B., Kuznetsov, S.O.: Pattern structures and their projections. In: International Conference on Conceptual Structures, pp. 129–142. Springer (2001) 6. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations (1999) 7. Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Raffeld, M., et al.: Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med. 344(8), 539–548 (2001) 8. Henriques, R., Madeira, S.C.: BicSPAM: flexible biclustering using sequential patterns. BMC Bioinform. 15(1), 130 (2014) 9. Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 1(1), 24–45 (2004) 10. Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu, M.C.: Mining sequential patterns by pattern-growth: the prefixspan approach. IEEE Trans. Knowl. Data Eng. 16(11), 1424–1440 (2004) 11. Pio, G., Ceci, M., D’Elia, D., Loglisci, C., Malerba, D.: A novel biclustering algorithm for the discovery of meaningful biological correlations between micrornas and their target genes. BMC Bioinform. 14(7), S8 (2013) 12. Wang, J., Han, J.: BIDE: efficient mining of frequent closed sequences. In: Proceedings of 20th International Conference on Data Engineering, pp. 79–90. IEEE (2004) 13. Yan, X., Han, J., Afshar, R.: CloSpan: mining: closed sequential patterns in large datasets. In: Proceedings of the 2003 SIAM International Conference on Data Mining, pp. 166–177. SIAM (2003) 14. Zaki, M.J.: Spade: an efficient algorithm for mining frequent sequences. Mach. Learn. 42(1–2), 31–60 (2001)

A Text-Based Regression Approach to Predict Bug-Fix Time Pasquale Ardimento, Nicola Boffoli and Costantino Mele

Abstract Predicting bug-fixing time can help project managers to select the adequate resources in bug assignment activity. In this work, we tackle the problem of predicting the bug-fixing time by a multiple regression analysis using as predictor variables the textual information extracted from the bug reports. Our model selects all and only the features useful for prediction, also using statistical procedures, such as the Principal Component Analysis (PCA). To validate our model, we performed an empirical investigation using the bug reports of four well-known open source projects whose bugs are stored in Bugzilla installations, where Bugzilla is an online open-source Bug Tracking System (BTS). For each project, we built a regression model using the M5P model tree, Support Vector Machine (SVM) and Random Forests algorithms. Experimental results show the model is effective, in fact, they are slightly better than all the ones known in the literature. In the future, we will use and compare other different regression approaches to select the best one for a specific data set.

1 Introduction In software maintenance, “a critical activity, which consumes the majority of the effort spent within the lifetime of a software system” [1], a significant amount of time is spent to investigate software bugs [2]. Generally, large-scale software projects use a Bug Tracking System (BTS) to report and manage a software bug. BTS management is relied on by team members, which can be developers and test engineers, and which have to fix bugs in the source code files. Each bug report must be triaged. P. Ardimento (B) · N. Boffoli · C. Mele Department of Informatics, University of Bari Aldo Moro, Via Orabona 4, Bari, Italy e-mail: [email protected] N. Boffoli e-mail: [email protected] C. Mele e-mail: [email protected] © Springer Nature Switzerland AG 2020 A. Appice et al. (eds.), Complex Pattern Mining, Studies in Computational Intelligence 880, https://doi.org/10.1007/978-3-030-36617-9_5

63

64

P. Ardimento et al.

The triager, who usually is a senior developer, selects the appropriate developer to fix the newly submitted bug. However, due to the large number of bug reports submitted daily for large-scale software projects, accurate bug triage is normally done manually. Furthermore, several studies demonstrate that bug-assignment is error-prone, expensive and that many times it is necessary to reassign a bug to another one (“bug tossing”). In recent years, several researchers analyzed bug-fixing time and its prediction. For example, Panjer [3] proposed to use classification techniques such as 0-R, 1-R, Decision Tree, Naive Bayes and Logistic Regression to predict the time to fix a bug for Eclipse project obtaining an accuracy of 34.9%. In [4] Kim et al. studied the life span of bugs in ArgoUML and PostgreSQL projects, and found that bug-fixing time had a median of about 200 days. Giger et al. [5] used Decision Tree to classify fast and slowly fixed bugs studying Eclipse, Mozilla, and Gnome projects. The above-mentioned works, focused on bug-fixing time for open source projects, show a real need to improve the prediction accuracy results. The contribution of this paper is building a regression model, modifying the model already proposed in [6, 7], useful to predict the bug-fixing time, in order to solve this issue as a numerical regression problem. For this purpose, we extracted information contained in the Bugzilla bug reports relating to the Mozilla [8], FreeDesktop [9], NetBeans [10] and Eclipse [11] projects, to create a database on which the machine learning (ML) algorithms trains. The database chosen to host extrapolated data is MongoDB [12], a non-relational database that can easily handle collections of JSON documents. The environment used to create the data set and to perform the regression analysis is R [13], an open source software for statistical analysis and ML. We evaluated our model using M5P model tree, Random Forests and SVM algorithms comparing obtained results with that one’ s known in the literature. Here below, Sect. 2 shows the background whereas Sect. 3 gives an overview of the literature found on the subject. Section 4 describes the proposed model and the results of the empirical investigation are presented in Sect. 5. Finally, Sect. 6 discusses results and provides conclusions.

2 Background Each bug reported in a BTS follows a life cycle: it starts when the bug is discovered and ends when the bug is closed, after ensuring it has been fixed. Bug life cycle can be slightly different depending on the BTS used. To select bugs useful for prediction and, at the same time, to build a model independently from the BTS chosen, we studied both general bug life cycle and Bugzilla bug life cycle. We selected Bugzilla as BTS basically for two reasons: first, it has a wide public installation base; on Bugzilla official page there is a list, whose last update is on May 3rd, 2017, of 137 companies, organizations, and projects that run “public” Bugzilla installations. Second, since version 5.0, Bugzilla installations offer a native well documented REST API [14] as a preferred way to interface with Bugzilla from external apps. Figure 1 shows life cycle of a bug in Bugzilla, as represented in the

A Text-Based Regression Approach to Predict Bug-Fix Time

65

Fig. 1 Life cycle of a bug in Bugzilla

Bugzilla official documentation release 5.0.4 at 2.4.4 section [15], while Fig. 2 shows general bug life cycle. General BTS as well as Bugzilla BTS allow users to report, track, describe, comment on and classify bug reports. A bug report is characterized by several predefined fields, such as the relevant product, version, operating system and self-reported incident severity, as well as free-form important text fields such as bug title, called summary in Bugzilla, and description. Moreover, users and developers add comments and submit attachments, which often take the form of patches, screenshots, test cases or anything else binary or too large to fit into a comment. When initially declared, a bug starts out in the unconfirmed pending state until a triager makes a first evaluation to see if the bug report corresponds to a valid bug, and that the bug is not already known, i.e., the submitted bug report is a duplicate of another bug report already stored in the defect reporting system. Bug reports can pass through several different stages before finally being resolved. Bug reports that are closed receive one of the following status: duplicate, invalid, fixed, wontfix, or worksforme.

66

P. Ardimento et al.

Fig. 2 General life cycle of a bug

These indicate why the report was closed; for example, worksforme and invalid both indicate that quality assurance was unable to reproduce the issue described in the report. Sometimes a bug report needs to be reopened and when it happens the normal defect lifecycle starts with status reopened. Reopened status represents the most important difference between the two lifecycles because it is absent in Bugzilla. Anyway, differently from what shown in Fig. 1, reproducing trusty the image shown in Bugzilla documentation, it is also possible to add a reopened status in Bugzilla. This operation can be done simply adding a new status option, technically selecting “add option for Adding a new status”, for the field value of status. As consequence, we decided to select only Bugzilla installations on where reopened status was added.

3 Related Work According to our research, we focus on studies that propose models for predicting the overall time required for fixing bugs via classification and regression techniques. In 2007, Lucas D. Panjer [3] focused his research on the bug reports of Eclipse project. He used machine learning algorithms as 0-R, 1-R, decision trees, Naive Bayes and logistic regression and he reported that his model is able to correctly

A Text-Based Regression Approach to Predict Bug-Fix Time

67

predict 34.9% of the bugs. Despite the results obtained by the logistic regression, the experimentation shows a lack in the classification phase, however the results obtained are in line with those obtained from other experiments in the literature. In the same year, Hooimeijer et al. [16] applied linear regression on 27.000 bug reports from the Firefox project in an attempt to identify an optimal threshold value by which the bug report may be classified as either “convenient” or “expensive”. Experiments have shown that if there are many comments or if there are many attachments, it is very likely that the bug is classified as “expensive”. The model was constructed using a statistical approach, as the text categorization is computationally more burdensome than a linear model, but using techniques based on text categorization could result in a significant increase in performance compared to the model presented. In 2009, Anbalagan et al. [17] performed their study on 72.482 bug reports from Ubuntu. The experimentation showed that there is a strong linear relationship between the time to fix a bug and the number of developers involved in the correction, linear regression was used to estimate the coefficients of the predictive model. The results of this study are not satisfactory, since it has emerged that the predictive model achieved is able to predict the time to correct a bug, about with the same precision of the models already existing in the literature and at the same cost. In 2011, Bhattacharya et al. [18] have trained a multiple regression model considering the severity of the bug, the number of attachments, the dependencies between the various bugs and the number of developers involved in the resolution process as independent variables. The results denote a low predictive power of the model. The results shown by these experiments should not surprise us, as previously the low predictive power of the models existing in the literature has been highlighted. In 2016, Puranik et at. [19] have developed a predictive model by selecting the minimal set of best performing metrics used in the literature related to the bug prediction problem. To carry out the experiments, a data set already proposed in [20] was used. The model realized is based on multiple linear regression, considering as variables the optimal metrics selected by the authors, such as the number of bugs found up to that moment, the version number adopted at that time, the number of lines of code and the entropy. The results of this experimentation were not provided; however, the authors confirm that the proposed model behaves much better than the other two models considered, especially when the metrics used in the evaluation are calculated on the test set. Finally, some researchers have applied Markov-based models. In 2018, Habayeb et al. [2] employed a hidden Markov model for predicting bug fixing time based on the temporal sequence of developer activities. This approach considers the temporal sequences of developer activities rather than frequency of developer activities used by previous approaches in [3, 5, 16]. They performed an experiment on Firefox projects and compared her model with popular classification algorithms to show that the model outperformed existing ones. In 2013 Zhang et al. [21] work on predicting bug fixing time. They used open source data from three commercial software projects from CA technologies and they apply a Markov-based model to predict the number of bugs that can be fixed monthly. In 2018, Akbarinasaji et al. [22] replicated Zhang

68

P. Ardimento et al.

et al. [21] using open source data from Bugzilla Firefox. The results of this replication study are similar to the original experiment and confirm the original proposed model. Starting from the results obtained from the various studies it is possible to state that the models based on the information retrieval, if used in a classification activity, are more predictive than the statistical models, the same cannot be said for regression analysis, because in the literature there is not a numerical regression model that exploits the text information contained in the bug reports. The experiments highlight that the selection of attributes contributes significantly to increase the predictive power of the model, especially when used to define attributes characterized by a stronger correlation with respect to the time of resolution of a bug. In some cases, the information on sampling is omitted, the chosen sampling could therefore largely influence the results obtained and there is no way to compare them appropriately. In the context of classification, we can affirm that at present the logistic regression, when compared with other algorithms, seems to obtain the best performances, very often due to the simplicity of the training phase compared to other models. This work, to the best of our knowledge, is the first one to tackle the problem of predicting the bug-fixing time by a multiple regression analysis using as predictor variables the textual information extracted from the bug reports. To this regard, we used SVM, M5P model tree and Random Forests algorithms, all configured for regression analysis. Moreover, this work is also the first one to use a dimensionalityreduction method, a process until now never used even if, as stated by many authors, necessary in accordance to the intrinsic nature of the aforementioned problem. In our work, we used PCA as a dimensionality-reduction method.

4 Proposed Model Our idea is to transform the prediction problem into a numerical regression problem, in which we extract significant textual information from bug reports in order to predict bug-fixing time. The prediction model proposed is shown schematically in Fig. 3. It mainly consists of three phases, already proposed in [6, 7], that are Data Collection, Pre-processing, and Learning and severity prediction. The main differences of the model proposed in this work are the use of a dimensionality-reduction method and having dealt the problem as a numerical regression problem not more as a classification problem.

4.1 Data Collection Data Collection phase involves data gathering and data analysis for the bug-fix time prediction from one or more Bug Tracking Systems. The model of this first phase is shown in the left side of Fig. 3. Our design is largely application independent but, anyway, for this work we decided to use the open source BTS Bugzilla.

A Text-Based Regression Approach to Predict Bug-Fix Time

69

Fig. 3 Conceptual design of bug-fix time prediction process

Bug report selection consists of data gathering and data selection of only those historical bug reports from the BTS datastore whose Status field has been assigned to VERIFIED and Resolution field has been assigned to FIXED. These ones are the only useful for our regression analysis. For this purpose, we have used a web application able to carry out a web scraping process of bug reports from the Bugzilla platform. This process was made possible by exploiting some APIs made available by Bugzilla, collecting bug reports of each project adopted in separate JSON file. Our approach involves the use of the textual content of the bug reports extracted as independent variable, hence we selected those fields deemed significant for the prediction. Our choice includes the selection of the following fields: • Product (a real-world product, identified by a name and a description, having one or more bugs). • Component (a given subsection of a Product, having one or more bugs). • Short_desc (a one-sentence summary of the problem). • First_priority (priority set by the user who created the report. Default values of priority are from P1 , highest, to P5 , lowest). • First_severity (severity set by the user who created the report. This field indicates how severe the problem is, from blocker when the application is unusable, to trivial). • Reporter (the account name of the user who created the report). • Assigned_to (the account name/s of the developer/s to which the bug has been assigned to by the triager, and responsible for fixing the bug).

70

P. Ardimento et al.

• Priority (priority set by the triager or a project manager). • Severity (severity set either by the triager or a project manager). • First_comment (the first comment posted by the user who created the report, which usually consists of a long description of the bug and its characteristics). • Comments (subsequent comments posted by the Reporter and/or developers endowed with appropriate permissions, which can edit and change all bugs fields, and comment these activities accordingly). • Fixing-time was not available, so we introduce an additional field called Days_resolution, calculated as the time distance between the final time where bug field Status was set to RESOLVED and the date where the bug report was assigned for the first time. It is important to note that Days_resolution field is calculated in calendar days and not in working days, where usually a working day correspond to 8 h, because there is no accurate information about the actual time spent by developers responsible for fixing bugs. For this reason, Days_resolution field may be not very accurate and potentially affect the results. We decided to discard some fields, because insignificant or unusable. The “Number of activities” field, for example, has been discarded because it is a numeric field, so, for this reason, it would have been any way removed in the pre-processing phase. Another field, the “CC list” field, containing the list of users interested in receiving an email notification each time the report update, was discarded because often not filled; fields “Status” and “Resolution” were not considered because already used for the selection of bug reports, hence not statistically valid for the prediction. After selecting the bug reports and extracting from them the relevant fields, we stored them in a non-relational database, our choice was the MongoDB database. We have chosen a non-relational database for the greater flexibility they offer for storing textual documents. Then we used a R script to access to the MongoDB database to import the bug reports as JSON objects in R environment. Due to hardware and software limitations it was not possible to use the entire set of bug reports stored in the Mongo DB database for the purpose of prediction. For this reason, we performed a random sampling for each data set, considering a sample composed of at most 2000 instances. We split the resultant data sets into training, test and validation set, given a fixed split percentage. Data Collection involves also information filtering of those fields that are not generally present at the time of the insertion of a new report. In this activity, moreover, the Days_resolution field belonging to the bug reports is temporarily eliminated and kept for the purpose of prediction, given that this field does not require a pre-processing phase, being a numeric field. Initially we thought to use information filtering, denoted as IF1 (Information Filtering n. 1), on the test set and validation set, as already performed in [22], because them instances simulate newly-opened and previously unseen bug reports, and this makes compulsory to delete some of the previously extracted fields that were not actually available before the bug was assigned. The deleted fields are: First_comment for instances belonging to the training set; Priority, Severity and Comments for instances belonging to the test and validation set.

A Text-Based Regression Approach to Predict Bug-Fix Time

71

We have also presented a further methodology of information filtering, denoted as IF2 (Information Filtering n. 2), which provides for the uniform filtering of the information present in the instances belonging to the training, test and validation set, as we believe that a prediction based on textual content should be done using the same information for model training and for predicting information related to the bug-fixing time. In this case, the deleted fields are: Priority, Severity and Comments.

4.2 Pre-processing The pre-processing phase, shown in Fig. 3, converts the original textual bug reports data in a data-mining-ready structure, where the most significant text-features that serve to build the regression model, are identified. The model used to predict bug resolution time is based on the bug report representation in terms of bag-of-words. In this representation, the order of occurrence and the grammatical form of the words are not relevant while the presence or not of a term and its occurrence are discriminant. To represent the bug reports in terms of bag-of-words, it becomes necessary to do a text pre-processing: such activity is common to many works of text categorization and natural language processing and is well documented in the literature [23]. The goal is to define a vocabulary of terms representative of the context to classify, eliminating information that brings no benefit. Text pre-processing tasks we used are the well-known ones such as: converting all words to lowercase; removing punctuation; removing URLs; removing stop words; text stemming, using Porter stemming algorithm, that is reducing each word to its stem. The following code shows some of the principal activities performed during text pre-processing of the data corpus, using the R package “SnowballC” [26]. Text pre-processing

# remove extra white-spaces corpus