Biologically Inspired Techniques in Many-Criteria Decision Making: International Conference on Biologically Inspired Techniques in Many-Criteria ... and Analytics in Intelligent Systems, 10) 9783030390327, 9783030390334, 3030390322

This book addresses many-criteria decision-making (MCDM), a process used to find a solution in an environment with sever

135 68 21MB

English Pages 273 [268]

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Biologically Inspired Techniques in Many-Criteria Decision Making: International Conference on Biologically Inspired Techniques in Many-Criteria ... and Analytics in Intelligent Systems, 10)
 9783030390327, 9783030390334, 3030390322

Table of contents :
Preface
Contents
Biologically Inspired Techniques and Their Applications
Classification of Arrhythmia Using Artificial Neural Network with Grey Wolf Optimization
1 Introduction
2 Related Literature
3 Methodology
3.1 Grey Wolf Optimization (GWO)
4 Experimental Setup and Result
5 Conclusion
References
Multi-objective Biogeography-Based Optimization for Influence Maximization-Cost Minimization in Social Networks
1 Introduction
2 Backgrounds
2.1 Social Influence, Influence Maximization, and Influence Maximization-Cost Minimization Problem
2.2 Overview of Multi-objective Optimization
3 Related Works
4 Our Proposed Approach
4.1 Solution Encoding
4.2 Selection Strategy for Determining Serviceable Nodes, Population Initialization, and for Mutation
4.3 Skip Migration When a Mutation Occurs
4.4 Evolving Mutation Rate
4.5 Consolidated Algorithm
5 Experimental Setup and Result Analysis
6 Conclusion
References
Classification of Credit Dataset Using Improved Particle Swarm Optimization Tuned Radial Basis Function Neural Networks
1 Introduction
2 Literature Survey
3 Machine Learning Techniques Used for Credit Risk Analysis
3.1 Radial Basis Function Neural Networks
3.2 Canonical Particle Swarm Optimization Algorithm
3.3 Improved PSO Tuned RBFNN Algorithm
4 Experimental Works
4.1 Environment
4.2 Parameters Details
5 Result Analysis
6 Conclusions
References
Multi-Verse Optimization of Multilayer Perceptrons (MV-MLPs) for Efficient Modeling and Forecasting of Crude Oil Prices Data
1 Introduction
2 Related Work
3 MV-MLP
4 Experimental Results and Discussion
5 Concluding Remarks
References
Application of Machine Learning to Predict Diseases Based on Symptoms in Rural India
1 Introduction
2 Materials and Methods
2.1 Information Gain
2.2 Entropy
2.3 Algorithm
3 Results and Discussion
4 Summary/Conclusion
References
Classification of Real Time Noisy Fingerprint Images Using Flann
1 Introduction
2 Finger Print Classification
3 BBO-FLANN Hybrid Network
4 Result Analysis
5 Conclusion
References
Software Reliability Prediction with Ensemble Method and Virtual Data Point Incorporation
1 Introduction
2 Related Study
3 Methods and Methodologies
3.1 Ridge Regression and Bayesian Ridge Regression
3.2 Support Vector Regression (SVR)
3.3 K-Nearest Neighbors Regression (KNN)
3.4 Artificial Neural Network (ANN)
3.5 Radial Basis Function Network (RBFN)
3.6 Virtual Data Generation Methods
3.7 Liner Interpolation
4 Proposed Model
5 Experimental Study and Result Analysis
6 Conclusions
References
Hyperspectral Image Classification Using Stochastic Gradient Descent Based Support Vector Machine
1 Introduction
2 Related Work
3 Proposed Model and Techniques Used
4 Results and Discussion
4.1 Dataset Description
4.2 Experimental Setup
4.3 Result Analysis
5 Conclusion
References
A Survey on Ant Colony Optimization for Solving Some of the Selected NP-Hard Problem
1 Introduction
2 Computational Complexity of Algorithms and NP-Hard Problems
3 Basics of Ant Colony Optimization
3.1 Background of Ant Colony Optimization
3.2 Graph Representation of ACO Algorithm
3.3 ACO Algorithm
4 Ant Colony Optimization for NP-Hard Problems
4.1 Ant Colony Optimization for Traveling Salesman Problem
4.2 Ant Colony Optimization for Subset Selection Problem
4.3 Ant Colony Optimization for 0/1 Knapsack Problem
4.4 Ant Colony Optimization for Minimum Vertex Cover Problem
5 Discussions and Conclusions
References
Machine Learning Models for Stock Prediction Using Real-Time Streaming Data
1 Introduction
2 Machine Learning Techniques Adopted
2.1 Decision Tree
2.2 Polynomial Linear Regression
2.3 Support Vector Regression (SVR)
2.4 Random Forest Generation
3 Proposed Work and Analysis
4 Performance Measures
5 Result and Analysis
6 Conclusion and Future Work
References
Epidemiology of Breast Cancer (BC) and Its Early Identification via Evolving Machine Learning Classification Tools (MLCT)–A Study
1 Introduction
2 Cancer Incidence and Cancer Cases Vary by State
3 Risk Factor
4 Availability of Open Source Data for Research
5 ML Algorithm for Cancer Prediction
6 ANN for Cancer Research
7 Related Work
8 Conclusion and Recommendation
References
Ensemble Classification Approach for Cancer Prognosis and Prediction
1 Introduction
2 Materials and Methods
2.1 Dataset Pruning
2.2 Normalization (Min – Max Algorithm)
2.3 Feature Extraction (Double RBF Kernel Based Ranking of Feature)
2.4 KNN Classifier
2.5 MLP (Multi Linear Perceptron)
2.6 Decision Tree
3 Proposed Algorithm
4 Parameter Discussion
5 Experimental Evaluation
6 Dataset
6.1 Wisconsin Prognostic Breast Cancer (WPBC)
6.2 Five Bench Mark Cancer Dataset
7 Ranking of Classifier
8 Result Analysis and Comparison
8.1 Breast Cancer Prognosis
8.2 Cancer Prediction
9 Conclusion
References
Extractive Odia Text Summarization System: An OCR Based Approach
1 Introduction
2 Related Work
3 Experimental Setup
3.1 Data Preprocessing
3.2 Methodology
4 Result
5 Conclusion and Future Work
References
Predicting Sensitivity of Local News Articles from Odia Dailies
1 Introduction
2 Related Works
3 Methodology
3.1 Performance Evaluation Measure
4 Results Analysis
5 Conclusion
References
A Systematic Frame Work Using Machine Learning Approaches in Supply Chain Forecasting
1 Introduction
2 Related Works
3 Framework
4 Results
5 Conclusion
References
An Intelligent System on Computer-Aided Diagnosis for Parkinson’s Disease with MRI Using Machine Learning
1 Introduction
1.1 GBA
1.2 LRRK2
1.3 SNCA
2 Related Works
3 Architecture
4 Materials and Methods
4.1 Dataset
4.2 Data Pre-processing
5 Feature Classification
5.1 AdaBoost
5.2 Naïve Bayes
5.3 Random Forest
5.4 Logistic Regression
5.5 Decision Trees
5.6 Support Vector Machines
6 Implementation
7 Result
References
Multi-Criteria Decision Making Approaches
Operations on Picture Fuzzy Numbers and Their Application in Multi-criteria Group Decision Making Problems
1 Introduction
2 Preliminaries
2.1 Fuzzy Set
2.2 Intuitionistic Fuzzy Set
2.3 Picture Fuzzy Set
2.4 -cut of Picture Fuzzy Set
2.5 Triangular PFSs
3 Arithmetic Operations of Triangular PFSs and Their Examples
4 Numerical Examples
5 Ranking of Triangular PFSs Based on Value and Ambiguity
6 Multi-criteria Group Decision-Making Based on Arithmetic Operation Between Triangular PFSs
7 Hypothetical Case Study
7.1 Computational Procedure Is Discussed in Detail
8 Conclusion
References
Some Generalized Results on Multi-criteria Decision Making Model Using Fuzzy TOPSIS Technique
1 Introduction
2 Preliminaries
2.1 TOPSIS Method
2.2 Fuzzy TOPSIS Method
3 Proposed Methodology
4 Computational Illustration
5 Conclusion
References
Data Mining, Bioinformatics, and Cellular Communications
A Survey on FP-Tree Based Incremental Frequent Pattern Mining
1 Introduction
2 FP-tree Based Incremental FPM
2.1 Two-Pass FPM
2.2 Single-Pass FPM
3 Research Issues and Challenges
4 Conclusion
References
Improving Co-expressed Gene Pattern Finding Using Gene Ontology
1 Introduction
2 Combined Similarity (ComSim)
3 Proposed Technique PatGeneClus
4 Experimental Results
5 Conclusions
References
Survey of Methods Used for Differential Expression Analysis on RNA Seq Data
1 Introduction
2 Issues and Challenges
3 Normalisation of RNA-Seq Data
4 Differential Gene Expression
4.1 Mathematical Definition
5 Differential Gene Expression Analysis
5.1 Tools Used for Differential Gene Expression Analysis
5.2 Results of the Study on Differential Expression Packages
6 Clustering the Results
7 Conclusions
References
Adaptive Antenna Tilt for Cellular Coverage Optimization in Suburban Scenario
1 Introduction
2 Related Work
3 System Model
3.1 Radio Channel Modelling
3.2 Base Station Antenna Radiation Pattern Modelling
3.3 Application of Reinforcement Learning Algorithm to the Current Scenario
4 Results and Analysis
5 Conclusion
References
A Survey of the Different Itemset Representation for Candidate Generation
1 Introduction
2 Layout for Itemsets Representation
3 Data Structures for Itemset Representation
4 Experimental Analysis
5 Conclusion
References
Author Index

Citation preview

Learning and Analytics in Intelligent Systems 10

Satchidananda Dehuri Bhabani Shankar Prasad Mishra Pradeep Kumar Mallick Sung-Bae Cho Margarita N. Favorskaya Editors

Biologically Inspired Techniques in Many-Criteria Decision Making International Conference on Biologically Inspired Techniques in Many-Criteria Decision Making (BITMDM-2019)

Learning and Analytics in Intelligent Systems Volume 10

Series Editors George A. Tsihrintzis, University of Piraeus, Piraeus, Greece Maria Virvou, University of Piraeus, Piraeus, Greece Lakhmi C. Jain, Faculty of Engineering and Information Technology, Centre for Artificial Intelligence, University of Technology, Sydney, NSW, Australia; University of Canberra, Canberra, ACT, Australia; KES International, Shoreham-by-Sea, UK; Liverpool Hope University, Liverpool, UK

The main aim of the series is to make available a publication of books in hard copy form and soft copy form on all aspects of learning, analytics and advanced intelligent systems and related technologies. The mentioned disciplines are strongly related and complement one another significantly. Thus, the series encourages cross-fertilization highlighting research and knowledge of common interest. The series allows a unified/integrated approach to themes and topics in these scientific disciplines which will result in significant cross-fertilization and research dissemination. To maximize dissemination of research results and knowledge in these disciplines, the series publishes edited books, monographs, handbooks, textbooks and conference proceedings.

More information about this series at http://www.springer.com/series/16172

Satchidananda Dehuri Bhabani Shankar Prasad Mishra Pradeep Kumar Mallick Sung-Bae Cho Margarita N. Favorskaya •







Editors

Biologically Inspired Techniques in Many-Criteria Decision Making International Conference on Biologically Inspired Techniques in Many-Criteria Decision Making (BITMDM-2019)

123

Editors Satchidananda Dehuri Department of Information and Communication Technology Fakir Mohan University Balasore, Odisha, India Pradeep Kumar Mallick School of Computer Engineering KIIT Deemed to be University Bhubaneswar, Odisha, India

Bhabani Shankar Prasad Mishra School of Computer Engineering KIIT Deemed to be University Bhubaneswar, Odisha, India Sung-Bae Cho Department of Computer Science Yonsei University Seoul, Korea (Republic of)

Margarita N. Favorskaya Institute of Informatics and Telecommunications Reshetnev Siberian State University of Science and Technology Krasnoyarsk, Russia

ISSN 2662-3447 ISSN 2662-3455 (electronic) Learning and Analytics in Intelligent Systems ISBN 978-3-030-39032-7 ISBN 978-3-030-39033-4 (eBook) https://doi.org/10.1007/978-3-030-39033-4 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Multi-criteria decision making (MCDM) is a process of finding a solution in an environment with several criterions. Many problems of real life consist of several objectives which should be taken into account. Solving problems of such category is a challenging task and requires full attention of deep investigation. In real applications, often simple and easy to understand methods are used and lead to situations in which solutions accepted by decision-makers are not optimal solutions. On the other hand, algorithms which do not lead to such situations are very time consuming. The biggest challenge standing in front of the researchers is how to create effective algorithms which will lead to optimal solutions with low time complexity. That is why lots of effort has given to implement biologically inspired algorithms (BIAs), which succeed solving uni-objective problems. Hence to introduce the readers about the state-of-the-art development of biologically inspired techniques and their applications with a great emphasis on MCDM process, this project has been started. This book is not only restricted with the blend of contributions of BIAs and MCDM process but also includes the contributions from nature-inspired algorithms, multi-criteria optimization and machine learning and soft computing. Mohapatra, et al. have contributed an article on classification of arrhythmia using artificial neural network (ANN) with grey wolf optimization (GWO). Research on biomedical signal analysis is growing day-by-day. Accurate classification is an essential and challenging task. Authors in this work have tried to obtain better accuracy in the work of cardiac signal classification using ANN classifier. Weights of the neural network are optimized using GWO algorithm. The proposed optimized model is utilized for arrhythmia classification. De and Dehuri have presented an article on multi-objective biogeography-based optimization for influence maximization and cost minimization in social networks. The influence maximization, which selects a set of k users (called seed set) from a social network to maximize the expected number of influenced users (called influence spread), is a key algorithmic problem in social influence analysis. In the past, a lot of studies were carried out to identify influential seeds from a given social graph and propagation model. Many propagation models, greedy algorithms and v

vi

Preface

approximation algorithms came thereby. However, a very less effort was made towards the influence maximization and cost minimization problem. Therefore, in this work, authors have suggested a multi-objective biogeography-based optimization strategy to maximize influence while minimizing the cost. The strategy combines the best attributes of biogeography-based optimization and non-dominated sorting genetic algorithm II. Classification of credit dataset using improved particle swarm optimization tuned radial basis function neural networks is contributed by Pandey et al. Credit risk assessment is acting as a survival weapon in almost every financial institution. It involves in-depth and sensitive analysis of various economic, social, demographic and other pertinent data provided by the customers and about the customers for building a more accurate and robust electronic finance system. In this paper, radial basis function neural network (RBFNN) with particle swarm optimization (RBFNN + PSO) and improved particle swarm optimization tuned radial basis function neural network (RBFNN + IMPSO) learning algorithms have been studied and compared their effectiveness for credit risk assessment. The experimental findings draw a clear line between the proposed model and traditional learning algorithms. Nayak, et al. have presented a novel multi-verse optimization of multi-layer perceptrons (MV-MLPs) for efficient modelling and forecasting of crude oil prices data. Arbitrary changes in crude oil prices make its forecasting quite difficult. Multi-layer neural networks are found to be effective in predicting such cured oil prices. Crafting optimal neural network architecture requires numerous trial and error methods. This article presents a hybrid model based on multi-verse optimization (MVO) of multi-layer perceptron (MLP), termed as (MV-MLP), where a universe/individual of MVO represents a potential MLP in the universe of discourse. The proposed MV-MLP is evaluated on forecasting the crude oil prices, and the predictability performance is established through the comparative study with other trained models. Biswal, et al. have contributed an article on application of machine learning to predict diseases based on symptoms in rural India. Living in a place away from hospitals and medical aid, it is often difficult for some people in rural India to diagnose a disease at an early state. Though hospitals in urban areas are using advanced technology for diagnosis and prognosis, disease prediction is one of the hypercritical tasks. Providing sophisticated and accurate algorithms and techniques to overcome this issue will be revolutionary. In this paper, the authors present a machine learning technique using decision tree algorithm (DT) to interconnect the symptoms and rearrange them and retrieve the most probable diagnosis. This technique allows the system to self-learn without using programming. Mishra, et al. have contributed an article on classification of real-time noisy fingerprint images using FLANN. In this work, authors have examined a novel biogeography-based optimized FLANN for classifying noisy fingerprints as a biometric classifier. Here, the database is collected in real time from ten different persons, and the features are extracted from five different classes of fingerprints using Gabor filter bank. The results prove that this method is robust enough to classify the fingerprints with a good accuracy.

Preface

vii

Behera and Panda have contributed an article entitled software reliability prediction with ensemble method and virtual data point incorporation. This article suggests a method of enriching training dataset through exploration and incorporation of virtual data from existing data. For the boost up, the overall accuracy and reducing the risk of model selection an ensemble framework of five ML methods are suggested. The extended software reliability dataset is then exposed to the constituent models as well as the ensemble approach separately to estimate the future data. Extensive simulation results on a couple of software reliability datasets reveal that our proposed model significantly improves the prediction accuracy. Sampurnima, et al. have contributed hyperspectral image classification using stochastic gradient descent (SGD)-based support vector machine (SVM). In the recent days, the hyperspectral images are most popularly used for remote sensing. Hyperspectral imaging has many applications including resource management, mineral exploration, agriculture, environmental monitoring and other tasks for earth observation. Earlier, these images are very rarely available. But with recent appearance of airborne hyperspectral imaging system, hyperspectral images entered into main stream of remote sensing. In this work, authors have considered few officially and publicly available hyperspectral image data. As the image contains spectral, spatial and temporal resolutions in the image, so to classify several regions in the images, authors have considered the powerful machine learning technique that is SVM optimized with SGD for image classification task. Mandal and Dehuri have contributed an article entitled a survey on ant colony optimization for solving some of the selected NP-hard problem. This paper analyses various ant colony optimization (ACO)-based techniques for solving some of the selected intractable problems like travelling salesperson (TS), subset selection (SS), minimum vertex cover (MVC) and 0/1 Knapsack in tolerable amount of time. We have reviewed literature on the usage of aforesaid meta-heuristic algorithms for solving the said intractable problems; it is shown that ACO algorithm demonstrates significant effectiveness and robustness in solving intractable problems. Jena et al. have contributed an article on machine learning models for stock prediction using real-time streaming data. In this work, an attempt is made to develop machine learning models to predict the potential prices of a company’s stock which helps in making financial decisions. Spark streaming has been considered for the processing of humongous data, and data ingestion tools like NodeJS have been further used for analysis. Earlier researches are made on the same concept, but the present goal of the study is to develop such a model that is scalable, fault-tolerant and has a lower latency. The model rests on a distributed computing architecture called the Lambda architecture which helps in attaining the goals as intended. Upon analysis, it is found that the prediction of stock values is more accurate when support vector regression is applied. The historical stock values are considered as supervised datasets for training the models. Maurya et al. have contributed an article entitled epidemiology of breast cancer (BC) and its early identification via evolving machine learning classification tools (MLCT)—a study. In this study, authors have summarized numerous ML

viii

Preface

techniques which could be used as an important tool by surgeons for timely detection and prediction of cancerous cell. Maurya et al. contributed an article on ensemble classification approach for cancer prognosis and prediction. In this paper, the authors use double RBF kernel function for feature selection and novel fusion procedure to enhance the performance of three base classifiers, i.e. K nearest neighbour (KNN), multi-layer perceptron (MLP) and decision tree (DT). Training of classifier is implemented based on k-fold cross-validation techniques. The predicted accuracy of the proposed model has been compared with recent fusion methods such as majority voting, distribution summation and Dempster–Shafer on six benchmark cancer datasets. Experiment evaluation and result analysis give promising and better performance than other fusion strategies, aiming at our goal functions. Wisconsin breast prognosis dataset is used with the proposed model for gene selection and prognosis prediction. Pattnaik, et al. have contributed an article on extractive Odia text summarization system: An OCR-based approach. In this work, authors propose a novel technique to extract Odia text from the image files using optical character recognition (OCR) and summarize the obtained text using extractive summarization techniques. Also, authors performed a manual evaluation to measure the quality of summaries to validate their techniques. The proposed approach is found suitable for generating summarized Odia text, and the same technique can also extend to other low-resource languages for extractive summarization system. Jena and Mohanty have contributed an article predicting sensitivity of local news articles from Odia dailies. The news article manifests different issues of the public domain. This paper purposes to categorize positive, negative and neutral local news and later on the prediction of sensitivity from negative local news articles. The purpose of sensitivity analysis of negative local news articles is to set priority of action to be taken by the local administration. The sensitive news discusses the issues or event of urgent in nature, which need sudden intervention. The experiment is carried out basing on Odia syntactosemantic knowledge for categorization of 1000 local Odia news article, for sensitivity analysis tf and tf-idf score is calculated using unigram and bigram representation of data at document level, tf and tf-idf vector is passed to SVM for result, and the results are analysed by calculating the accuracy and F1 score. Prahathish, et al. have contributed an article a systematic framework using machine learning approaches in supply chain forecasting. Forecasting is an important study in the field of supply chain and logistics for operations management. Based on a study, a systematic framework has been worked and has been proposed for the same. Artificial neural network has been into this field and has been utilized for an efficient way to forecast and reduce errors marginally. The purpose of such a systematic approach using the proposed architecture is to reduce inventory holdings which shall largely account for important decision-making policies in the future.

Preface

ix

Naren, et al. have presented an article intelligent system on computer-aided diagnosis for Parkinson’s disease with MRI using machine learning. Parkinson’s disease (PD), an intensifying neurological disorder, is predominantly because of failing dopaminergic neurons of the midbrain. Dopamine is involved in sending of messages to those parts that control coordination and movement in brain. With the help of machine learning approaches, it sets a base for an intelligent system that helps in computer-aided diagnosis of PD patients. Machine Learning is used for early diagnosis and prediction so that it can be utilized to treat the disease quicker. In medicinal science, it is visible that outputs from the imaging devices can be incorporated for predicting a disease better. The paper specifies a brief synopsis of machine learning techniques along with MRI data which can yield faster prediction of PD. Dutta, et al. have presented an article on operations on picture fuzzy numbers and their application in multi-criteria group decision-making problems. Uncertainty is an unavoidable component of our life, and fuzzy set theory (FST) is generally explored to deal with it. However, in some complex situations, FST is not capable of playing a crucial rule. In such situations, picture fuzzy set (PFS) comes into the picture which is the direct extension of FST and intuitionistic fuzzy set (IFS). Although different studies on FST and IFS have been done including their algebraic structure, these studies are found to be inappropriate to deal with picture fuzzy situations. In this regard, this paper presents the basic arithmetic operations on PFSs along with numerical examples. Finally, the application of PFSs in multi-criteria group decision making is performed through a case study. Parida has contributed an article on some generalized results on multi-criteria decision-making model using fuzzy TOPSIS technique. This article proposes a new model involving decision making using fuzzy technique for order performance by similarity to ideal solution (FTOPSIS) as collaborative decision-makers. FTOPSIS is very frequently used in multi-criteria decision making (MCDM). Furthermore, the concept is employed in measuring the being far-off of particularized fuzzy number from one as well as other fuzzy positive ideal solution (FPIS) and fuzzy negative ideal solution (FNIS). The numerical example is investigated to analyse the outcome of the alternative solution against the ideal solution. Ahmed and Nath have contributed an article on a survey on FP tree-based incremental frequent pattern mining. Several methods for efficient mining of frequent patterns (FP) can be found in the literature. But most of the approaches assume that the whole dataset to be considered can be stored on the computers on hand main memory, and the dataset is static in nature. Practically, none of the transactional datasets are static. The datasets get updated due to inclusion of new transactions or exclusion of obsolete transactions as the time advances or the user may be required to generate the frequent patterns for a new threshold value for the updated database. This may generate new frequent patterns or refinement of existing patterns, and it becomes practically infeasible if the process starts from scratch. Many methods have been found in the literature tried to deal with the issues of incremental frequent pattern mining (FPM), but most of the algorithms are main memory dependent. Therefore, in this paper, authors are going to discuss some

x

Preface

of the algorithms with their pros and cons to see whether the main memory limitation of the existing techniques can be mitigated so that it can be efficiently used in the incremental scenario. Baishya and Sarmah have contributed an article on improving co-expressed gene pattern finding using gene ontology. A semi-supervised gene co-expressed pattern finding method, referred to as PatGeneClus, which attempts to find all possible biologically relevant gene coherent patterns from any microarray dataset by exploiting both gene expression similarity as well as GO similarity is presented in this paper. PatGeneClus uses a graph-based clustering algorithm called DClique to generate a set of clusters of high biological relevance. We establish the effectiveness of PatGeneClus over several benchmark datasets using well-known validity measures. Joshi and Sarmah have contributed an article on survey of methods used for differential expression analysis on RNA-seq. data. Gene expression indicates the amount of mRNA produced by a gene under a particular biological condition. Genes responsible for changes in biological conditions of an organism will have different gene expression values across different conditions. Gene expression analysis is useful in the domain of transcriptomic studies to analyse functions of and interactions among different molecules inside a cell. A significant analysis is that of a differential gene that is a gene that exhibits strong change in behaviour between two or more conditions. Thus, behavioural cell changes can be attributed to the differentially expressed genes. Statistical distributional properties in the read counts that constitute RNA-seq data are used for detecting the differentially expressed genes. In this paper, we provide a comparison study of different tools which aid in RNA-seq-based differential expression. It is important to note how the results of these tools differ and which tool provides more statistically significant results for the same. Samal, et al. have contributed an article on adaptive antenna tilt for cellular coverage optimization in suburban scenario. Radio coverage optimization is a critical issue for mobile network operators (MNO) in the deployment of future generation cellular networks, especially on users at cell edge. The key factor that influences the coverage in mobile networks is mostly related to the configuration of the antennas and especially the angle of antenna tilt. The received signal power in a cell can be increased with proper antenna tilt, causing a significant improvement in signal-to-interference-plus-noise ratio (SINR) at the cell edge. This also leads to the reduction in interference towards other cells. In this paper, a method for coverage optimization using base station antenna electrical tilt in mobile networks for suburban scenario is proposed. The main focus is on the downlink power setting by using electrical antenna tilt in the mobile network. This proposed solution uses reinforcement learning technique, and the simulation results show that the proposed algorithm can be used to improve overall network performance in terms of SINR and received signal power at the cell edge with 80–100% user satisfaction. Kharkongor and Nath have contributed an article on a survey of the different itemset representation for candidate generation. In this paper, a study of the data structures used for itemset representation is discussed. The different data structures

Preface

xi

are being tested on the different datasets for the generation of candidate itemsets. The performance of the different data structures in candidate generation process is analysed. December 2019

Satchidananda Dehuri Bhabani Shankar Prasad Mishra Pradeep Kumar Mallick Sung-Bae Cho Margarita N. Favorskaya

Contents

Biologically Inspired Techniques and Their Applications Classification of Arrhythmia Using Artificial Neural Network with Grey Wolf Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saumendra Kumar Mohapatra, Sipra Sahoo, and Mihir Narayan Mohanty

3

Multi-objective Biogeography-Based Optimization for Influence Maximization-Cost Minimization in Social Networks . . . . . . . . . . . . . . . Sagar S. De and Satchidananda Dehuri

11

Classification of Credit Dataset Using Improved Particle Swarm Optimization Tuned Radial Basis Function Neural Networks . . . . . . . . Trilok Nath Pandey, Parimal Kumar Giri, and Alok Kumar Jagadev

35

Multi-Verse Optimization of Multilayer Perceptrons (MV-MLPs) for Efficient Modeling and Forecasting of Crude Oil Prices Data . . . . . Sarat Chandra Nayak, Ch. Sanjeev Kumar Dash, Bijan Bihari Mishra, and Satchidananda Dehuri Application of Machine Learning to Predict Diseases Based on Symptoms in Rural India . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suvasree S. Biswal, T. Amarnath, Prasanta K. Panigrahi, and Nrusingh C. Biswal Classification of Real Time Noisy Fingerprint Images Using Flann . . . . Annapurna Mishra, Satchidananda Dehuri, and Pradeep Kumar Mallick Software Reliability Prediction with Ensemble Method and Virtual Data Point Incorporation . . . . . . . . . . . . . . . . . . . . . . . . . . Ajit Kumar Behera and Mrutyunjaya Panda Hyperspectral Image Classification Using Stochastic Gradient Descent Based Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . Pattem Sampurnima, Sandeep Kumar Satapathy, Shruti Mishra, and Pradeep Kumar Mallick

46

55

62

69

78

xiii

xiv

Contents

A Survey on Ant Colony Optimization for Solving Some of the Selected NP-Hard Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akshaya Kumar Mandal and Satchidananda Dehuri

85

Machine Learning Models for Stock Prediction Using Real-Time Streaming Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Monalisa Jena, Ranjan Kumar Behera, and Santanu Kumar Rath Epidemiology of Breast Cancer (BC) and Its Early Identification via Evolving Machine Learning Classification Tools (MLCT)–A Study . . . . 109 Rajesh Kumar Maurya, Sanjay Kumar Yadav, and Pragya Tewari Ensemble Classification Approach for Cancer Prognosis and Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Rajesh Kumar Maurya, Sanjay Kumar Yadav, and Rishabh Extractive Odia Text Summarization System: An OCR Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Priyanka Pattnaik, Debasish Kumar Mallick, Shantipriya Parida, and Satya Ranjan Dash Predicting Sensitivity of Local News Articles from Odia Dailies . . . . . . . 144 Manoj Kumar Jena and Sanghamitra Mohanty A Systematic Frame Work Using Machine Learning Approaches in Supply Chain Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 K. Prahathish, J. Naren, G. Vithya, S. Akhil, K. Dinesh Kumar, and S. Sai Krishna Mohan Gupta An Intelligent System on Computer-Aided Diagnosis for Parkinson’s Disease with MRI Using Machine Learning . . . . . . . . . . . . . . . . . . . . . . 159 J. Naren, Praveena Ramalingam, U. Raja Rajeswari, P. Vijayalakshmi, and G. Vithya Multi-Criteria Decision Making Approaches Operations on Picture Fuzzy Numbers and Their Application in Multi-criteria Group Decision Making Problems . . . . . . . . . . . . . . . . 169 Palash Dutta, Rajdeep Bora, and Satya Ranjan Dash Some Generalized Results on Multi-criteria Decision Making Model Using Fuzzy TOPSIS Technique . . . . . . . . . . . . . . . . . . . . . . . . . 189 P. K. Parida Data Mining, Bioinformatics, and Cellular Communications A Survey on FP-Tree Based Incremental Frequent Pattern Mining . . . . 203 Shafiul Alom Ahmed and Bhabesh Nath

Contents

xv

Improving Co-expressed Gene Pattern Finding Using Gene Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 R. C. Baishya, Rosy Sarmah, and D. K. Bhattacharyya Survey of Methods Used for Differential Expression Analysis on RNA Seq Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Reema Joshi and Rosy Sarmah Adaptive Antenna Tilt for Cellular Coverage Optimization in Suburban Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 Soumya Ranjan Samal, Nikolay Dandanov, Shuvabrata Bandopadhaya, and Vladimir Poulkov A Survey of the Different Itemset Representation for Candidate Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 Carynthia Kharkongor and Bhabesh Nath Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

Biologically Inspired Techniques and Their Applications

Classification of Arrhythmia Using Artificial Neural Network with Grey Wolf Optimization Saumendra Kumar Mohapatra1 , Sipra Sahoo1 , and Mihir Narayan Mohanty2(B) 1 Department of Computer Science and Engineering, ITER, Siksha ‘O’ Anusandhan (Deemed

to be University), Bhubaneswar, Odisha, India [email protected], [email protected] 2 Department of Electronics and Communication Engineering, ITER, Siksha ‘O’ Anusandhan (Deemed to be University), Bhubaneswar, Odisha, India [email protected]

Abstract. Research on biomedical signal analysis is growing day-by- day. Accurate classification is an essential and challenging task. Authors in this work have tried to obtain better accuracy in the work cardiac signal classification using artificial neural network (ANN) classifier. Weights of the neural network are optimized using Grey Wolf Optimization (GWO) algorithm. The proposed optimized model is utilized for arrhythmia classification. Data collected from UCI repository for the said purpose. The performance is compared for both ANN and ANN with GWO. 93.38% classification accuracy is obtained by using ANN-GWO classifier. As compare to earlier work it is found that the proposed approach is much better in terms of accuracy. Keywords: ECG · Arrhythmia · ANN · Optimization · GWO

1 Introduction Electrocardiogram (ECG) is one of the important source for the analysis of any kind of cardiac disease. Electrical impulses that occur during each heartbeat can be visualized in ECG. Accurate analysis of the ECG is most important in the diagnosis of heart disease. Any changes in a normal ECG can occur during various cardiac abnormalities like atrial fibrillation, ventricular tachycardia, myocardial infarction, hypokalemia, etc. Arrhythmia is a type of cardiac illness and it can be detected by strongly analyzing waves in ECG [1]. It causes due to the abnormal electrical activity in the heart. In this case, the heart beat rhythms fast, or slow. The factors that cause cardiac arrhythmia are smoking and alcohol consumption, mental stress, diet, diabetes, etc. Due to the manual data analysis system, it is difficult to extract useful information from the clinical data. Computer-based automated disease diagnosis system will be most useful in medical sectors. The system will automatically take the decision by analyzing the data collected through a different medical test. Data include ECG, Ultrasonic images, MRI are useful for analysis and diagnosis. Machine learning-based medical data analysis is one of the advanced technology that can reduce human interaction by enhancing the © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 3–10, 2020. https://doi.org/10.1007/978-3-030-39033-4_1

4

S. K. Mohapatra et al.

machine efficiency and will be cost-effective. For early detection of cardiac arrhythmia, a real-time automatic ECG analysis system will be the best support to the clinicians [2]. The characteristic of the ECG varies for different patients in different physical and temporal conditions. Due to these variations in ECG, the task is very difficult for analysis. This requires to develop an automatic ECG classification system [3]. For development of accurate ECG classifier it is important to extract the useful features from data. Here in the proposed work a ANN based classifier is designed with Grey Wolf Optimization (GWO) technique for the classification of five different types of cardiac arrhythmia. The rest of the paper is organized as follows. Related literature is presented in Sect. 2. Description about the proposed classifier is presented in Sect. 3. Results are discussed in Sect. 4. Section 5 conclude the work.

2 Related Literature Automated ECG classification can help the cardiologist for the diagnosis of any type of cardiac abnormalities. In the last few decades, several algorithms have been introduced by the researchers for the auto classification of the cardiac signal. Preprocessing, feature extraction, and classification are the three basic steps in ECG signal classification and multiple methods have been applied by the researchers for each of these processes. Classification of normal and coronary artery disease was performed by applying higher-order statistics and spectra (HOS) method [4]. From each heartbeat, HOS bispectrum and cumulants features were extracted. For dimensionality reduction, authors have taken PCA. KNN and decision tree classifiers were used for the classification purpose as the performance with fewer features are suitable. In [5], Support vector machine (SVM) classifier taken by the authors to detect coronary artery disease [5]. They have used PCA for the feature selection purpose and 79% classification accuracy was achieved by them. Genetic algorithm (GA) and binary particle swarm optimization (BPSO) methods were used for feature selection [6]. SVM with k-fold cross-validation was considered for classification purpose and 81% classification accuracy was obtained from their proposed system. S-transform and wavelet transform was used for feature extraction [7]. Multi-layer perceptron was considered for the classification of normal and abnormal ECG beats. A long term ECG classification framework was presented in a study [8]. Exhaustive k-means clustering technique was used for getting an optimal number of key beat and master beats from the ECG waveform. Backpropagation algorithm based classifier was introduced by the authors for the classification purpose. Average accuracy for their proposed system was 99.04%. In certain cases, multiple classifiers for ECG classification were used. Automatic detection of the coronary artery from the cardiac signal using four different classifiers was performed [9]. In the first stage, authors have applied DWT for decomposition purpose. Further, the dimensions of the wavelet coefficients were reduced by applying PCA, LDA, and ICA. After the selection of suitable features from the cardiac signal, the classification was performed by using KNN, SVM, PNN, and GMM. From their result, it has been observed that ICA with GMM classifier was giving better result as compare to the other three types of classifiers. Soft computing techniques have been utilized for ECG signal classification in different studies. Radial basis function neural network-based [10]. and Block- based neural

Classification of Arrhythmia Using Artificial Neural Network

5

network (BBNN) [11] was applied in some studies for automatic ECG classification. Similarly, the neural network was used to predict arrhythmia [12]. They have claimed the classification accuracy of 67%. Data mining method such as attribute selection and EM-based data clustering was introduced by authors in [13]. The dimension of the feature vector is reduced by correlation- based selection technique. A rule-based classifier was designed for classifying the ECG and Higher-order spectra (HOS) was introduced by the authors in [14]. For reduction of the dimension, ICA was considered and the classification was occurred by using multiple classifiers. From their result, KNN classifier was giving better classification accuracy as compared to other classifiers. Optimum path forest (OPF) based classifier was chosen by the authors in [15]. In their work, they have compared their result for six types of feature extraction technique, and three types of classification technique. All those techniques were applied in the same MIT-BIH database. However, they have not focused on Arrhythmia dataset properly. This work proposes with artificial neural network based classifier and is optimized using GWO. The data is collected from UCI repository.

3 Methodology Five types of cardiac arrhythmia is classified using artificial neural network with GWO optimization. Cardiac data can be collected from the direct patient in hospitals or from open source databases available in internet. UCI Cleveland heart disease data is used in this work. The dataset contains 303 patient’s samples and each sample is characterized by 13 attributes. The details of the Cleveland database is explained in Table 1. Weights of the artificial neural network model is optimized using GWO technique and explained in the following section. 3.1 Grey Wolf Optimization (GWO) The working principle of the grey wolf optimization (GWO) is leadership and hunting activities of the grey wolves. GWO is one of the stochastic and powerful algorithm having well convergence quality. Performance of this optimization is better as compare to other optimization algorithm like genetic algorithm (GA), particle swarm optimization (PSO), ant colony optimization (ACO), Gravitational Search Algorithm (GAS) [16, 17]. The main advantages of this optimization is that it does not require gradient information like other algorithms. This can be easily implemented in many optimization problems [18, 19]. GWO algorithm is designed with hunting and leadership principles of wolves and it has three phases: • Social hierarchy of wolves for approaching the prey • Encircling the prey • Hunting the prey

6

S. K. Mohapatra et al. Table 1. Cleveland Heart Disease Dataset

Attributes Explanation Age

Patient age

Sex

1 represents male and 0 represents female patients..

Cp

Different types of pain in chest (1 = typical angina, 2 = atypical angina, 3 = non-angial pain, 4 = asymptomatic)

Trestbps

Blood pressure of the patient at the admission time to the hospital

Chol

Cholesterol of the patient in mg/dl

FBS

Sugar in blood at fasting time

Restecg

ECG report (For normal it is 0, patients having ST-T wave abnormality it is 1 and left ventricular is 2

Thalch

Pulse rate

Exang

Exercise encouraged angina (for Yes it is 1 and for No it is 0)

Oldpeak

ST depression induced by exercise ST segment

Slope

Peak exercise slope (up sloping = 1, flat = 2, down sloping = 3

Ca

Amount of main vessels (0-3) colored by fluoroscopy

Thal

3 shows the normal, fixed defect is denoted as 6 and the reversible defect is denoted as 7

For implementing the first phase, the social hierarchy of grey wolves, γ is considered as the best solution and it act as the decision maker like wolves. σ and ω is considered as the second and third best solutions that assist the wolves who acts as the leader in decision making. Other wolves in the population is considered as θ and it follows γ , σ and ω The second phase, encircling the prey model can be states as:   g p (i) − g(i) g(i + 1) = g p (i) − B  D. (1) where g is the position vector of the wolf, i is iteration, g p indicates the position vector  is the coefficient vector and can be calculated as: of previs, B and D B = 2b. r1 − b

(2)

 = 2. D r2

(3)

where r1 and r2 are the random vectors and b is the linearly decreasing over many number of iteration and can be defend as:   = 2 − 2i Imax b(i) (4) where Imax is the maximum number of iteration.

Classification of Arrhythmia Using Artificial Neural Network

7

Other wolves position will be updated according to the position of γ , σ and ω Hunting the prey phase which is the last phase can be simulated as follows:     (5) g1 = gγ − B1  D gγ − g 1 .     g2 = gσ − B2  D (6) gσ − g 2 .     (7) g3 = gω − B3  D gω − g 3 . g(i+1) =

g1 (i) + g2 (i) + g3 (i) 3

(8)

4 Experimental Setup and Result The simulation is carried out on by using MATLAB in a PC having i5 processor, 4 GB RAM, windows operating system. Five different types of arrhythmias are classified in the proposed work. ANN classifier is designed with GWO optimization. Detail description of the weight optimizer and ANN structure is presented in Table 2. The resultant weight from the trained dataset has been tested using sample test data. The experimental results of GWO is compared with simple ANN. From the below figures, it is observed ANN with GWO gives better result as compare to simple ANNN and it converges is also faster than ANN. The performance is evaluated using MAE, MedAE, MSE, RMSE. Table 2. Different parameters of ANN and ANN-GWO ANN

ANN-GWO

Size of the input node = 4 Population size - 50 Total number of hidden layers = 1 Nodes in the hidden layer = 4

Total number of iteration - 50

Learning rate = 0.125

b = 2 to 0

Total number of iteration-50

B = [−1,1] D = [0,2]

In Table 3 the confusion matrix obtained by ANN with GWO is presented. From this confusion matrix around 93.38% classification result is archived and it is quite good as compare previous work presented in Table 4. In Fig. 1 the RMSE for ANN and ANNGWO is presented and it can be observed that ANN with GWO is providing better result as compare to simple ANN.

8

S. K. Mohapatra et al. Table 3. Confusion matrix obtained using ANN-GWO

Class

Normal Coronary Anterior Inferior Sinus Total True % Artery Myocardial Myocardial tachycardy Infarction Infarction

Normal

63

1

2

0

0

66

Coronary Artery

0

21

1

0

0

22

Anterior Myocardial Infarction

0

0

14

1

0

15

Inferior Myocardial Infarction

0

0

1

12

1

14

Sinus tachycardy

0

0

1

0

3

4

63

22

19

13

4

121

113/121 * 100 = 93.38

0.32 ANN ANN-GWO

0.3

0.28

RMSE

Predicted

0.26

0.24

0.22

0.2

0.18

0

5

10

15

20 25 30 Number of Iterations

35

40

45

Fig. 1. RMSE for ANN and ANN-GWO. Table 4. Comparison of obtained result with earlier work Reference

Method

Result

[15]

Optimum path forest

68%

[11]

LDA

76%

[15]

KNN

76.6%

Proposed work

ANN-GWO

93.3%

50

Classification of Arrhythmia Using Artificial Neural Network

9

5 Conclusion Cardiac problem is a widespread disease and is vital to detect accurately. Though numerous researchers have worked on it, still different techniques were developed to test the same. In this work, ANN classifier with GWO is used for the classification purpose and from this we have achieved around 93% classification result which quite good as compare to other works. Further accuracy can be improved by use of different classification algorithm and other optimization techniques.

References 1. Mohapatra, S.K., Mohanty, M.N.: Analysis of resampling method for arrhythmia classification using random forest classifier with selected features. In: 2018 2nd International Conference on Data Science and Business Analytics (ICDSBA), pp. 495–499. IEEE, September 2018 2. Polat, K., Güne¸s, S.: Detection of ECG Arrhythmia using a differential expert system approach based on principal component analysis and least square support vector machine. Appl. Math. Comput. 186(1), 898–906 (2007) 3. Hoekema, R., Uijen, G.J., Van Oosterom, A.: Geometrical aspects of the interindividual variability of multilead ECG recordings. IEEE Trans. Biomed. Eng. 48(5), 551–559 (2001) 4. Acharya, U.R., Sudarshan, V.K., Koh, J.E., Martis, R.J., Tan, J.H., Oh, S.L., Muhammad, A., Hagiwara, Y., Mookiah, M.R., Chua, K.P., Chua, C.K.: Application of higher-order spectra for the characterization of coronary artery disease using electrocardiogram signals. Biomed. Sig. Process. Control 31, 31–43 (2017) 5. Babao˘glu, I., Fındık, O., Bayrak, M.: Effects of principle component analysis on assessment of coronary artery diseases using support vector machine. Expert Syst. Appl. 37(3), 2182–2185 (2010) 6. Babaoglu, ˙I., Findik, O., Ülker, E.: A comparison of feature selection models utilizing binary particle swarm optimization and genetic algorithm in determining coronary artery disease using support vector machine. Expert Syst. Appl. 37(4), 3177–3183 (2010) 7. Das, M.K., Ari, S.: ECG beats classification using mixture of features. International scholarly research notices (2014) 8. Kiranyaz, S., Ince, T., Pulkkinen, J., Gabbouj, M.: Personalized long-term ECG classification: A systematic approach. Expert Syst. Appl. 38(4), 3220–3226 (2011) 9. Giri, D., Acharya, U.R., Martis, R.J., Sree, S.V., Lim, T.C., Vi, T.A., Suri, J.S.: Automated diagnosis of coronary artery disease affected patients using LDA, PCA, ICA and discrete wavelet transform. Knowl.-Based Syst. 37, 274–282 (2013) 10. Lewenstein, K.: Radial basis function neural network approach for the diagnosis of coronary artery disease based on the standard electrocardiogram exercise test. Med. Biol. Eng. Comput. 39(3), 362–367 (2001) 11. De Chazal, P., O’Dwyer, M., Reilly, R.B.: Automatic classification of heartbeats using ECG morphology and heartbeat interval features. IEEE Trans. Biomed. Eng. 51(7), 1196–1206 (2004) 12. Abdou, A.D., Ngom, N.F., Niang, O.: Classification and prediction of Arrhythmias from electrocardiograms patterns based on empirical mode decomposition and neural network. In: International Conference on e-Infrastructure and e-Services for Developing Countries, pp. 174–184. Springer, Cham, November 2018 13. Sufi, F., Khalil, I.: Diagnosis of cardiovascular abnormalities from compressed ECG: a data mining-based approach. IEEE Trans. Inf Technol. Biomed. 15(1), 33–39 (2010)

10

S. K. Mohapatra et al.

14. Martis, R.J., Acharya, U.R., Min, L.C.: ECG beat classification using PCA, LDA, ICA and discrete wavelet transform. Biomed. Signal Process. Control 8(5), 437–448 (2013) 15. Mustaqeem, A., Anwar, S.M., Majid, M., Khan, A.R.: Wrapper method for feature selection to classify cardiac arrhythmia. In: 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 3656–3659. IEEE, July 2017 16. Mishra, A., Dehuri, S.: An experimental study of filter bank approach and biogeography-based optimized ANN in fingerprint. In: Nanoelectronics, Circuits and Communication Systems: Proceeding of NCCS 2017, vol. 511, pp. 229–237 (2018) 17. Ghosh, A., Dehuri, S.: Evolutionary algorithms for multi-criteria optimization: a Survey (2005) 18. Mirjalili, S., Mirjalili, S.M., Lewis, A.: Grey wolf optimizer. Adv. Eng. Softw. 1(69), 46–61 (2014) 19. Long, W., Jiao, J., Liang, X., Tang, M.: Inspired grey wolf optimizer for solving large-scale function optimization problems. Appl. Math. Model. 60, 112–126 (2018)

Multi-objective Biogeography-Based Optimization for Influence Maximization-Cost Minimization in Social Networks Sagar S. De(B) and Satchidananda Dehuri Department of Information and Communication Technology, Fakir Mohan University, Vyasa Vihar, Balasore 756 020, Odisha, India [email protected], [email protected]

Abstract. The influence maximization, which selects a set of k users (called seed set) from a social network to maximize the expected number of influenced users (called influence spread), is a key algorithmic problem in social influence analysis. In the past, a lot of studies were carried out to identify influential seeds from a given social graph and propagation model. Many propagation models, greedy algorithms, approximation algorithms came. However, a very less effort was made towards influence maximization-cost minimization problem. Therefore in this work, we have suggested a multi-objective biography-based optimization strategy to maximize influence while minimizing the cost. The strategy combines the best attributes of biogeography-based optimization and nondominated sorting genetic algorithm II. A multi-objective ranking and selection strategy improve the convergence rate. Our empirical analysis on many real-life networks confers the effectiveness of the algorithm in terms of both influence spread and time efficiency. Keywords: Information diffusion · Influence maximization · Cost minimization · Multi-objective optimization · Biogeography-based optimization

1

Introduction

The last decades have witnessed the booming of online social networks (OSN), where hundreds of millions of people interact with each other and produce an unprecedented amount of content. The number of users in OSNs also proliferating. The online social sites such as Twitter, Facebook, YouTube has become the defacto platform selected by companies for promoting their products and spreading information. With this, OSNs acts as a principal medium for the rapid spread of the messages, opinions, influence, ideas among people. The diffusion process that effectively followed the word-of-mouth analogy [1] was proved very powerful in several real-life applications [2], such as the adoption of new products, c Springer Nature Switzerland AG 2020  S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 11–34, 2020. https://doi.org/10.1007/978-3-030-39033-4_2

12

S. S. De and S. Dehuri

political views. The success of information diffusion in online social networks has attracted extensive research efforts from multiple fields, including computer science, physics, epidemiology, etc. In the study of complex network analysis, the social influence analysis which performs a vital task in understanding the spreading behavior is one of the most exciting problems. The problem was derived for the improvement of viral marketing1 . A legitimate query in viral marketing is – “given limited resources, which initial set of customers should be rewarded such that the resulting influenced population is maximized?”. This elegant problem is also essential for the understanding of information diffusion process, cascade dynamics, behavior contagion, and making concrete decisions in applied fields such as marketing, contagion management, immunization, public health, control of rumors [3,4], network surveillance [5], recommendation [6], feed ranking [7], and target advertisement [8]. The initial success of influence analysis mainly in the form of maximizing the influence attracts more research attention and extended studies. “The influence maximization (IM) problem, which selects a set of k users (called seed set) from a social network to maximize the expected number of influenced users (called influence spread), is a key algorithmic problem in social influence analysis” [9]. Recently the influence maximization problem is extensively studied because of its implicit commercial benefit. In the information diffusion research, the problem is considered as one of the critical algorithmic problems. Several studies were conducted for identifying influential users from a given social graph and a propagation model [9]. The influence maximization problem inherently embraces enormous research challenges. For example, the information propagation strategy in social networks may significantly influence the spreading behavior. Theoretically, solving an influence maximization problem is hard. Previous researches showed that identifying an optimal solution for an influence maximization problem is NP-hard for almost all diffusion models [10–12]. Furthermore, due to the stochastic nature of information diffusion, even the evaluation of influence spread of any individual seed set is computationally complex. These theoretical results have shown that it is very challenging to retrieve a (near) optimal seed set and to scale to massive social graphs at the same time. Third, very recently, online social networks are being equipped with novel features, e.g., locationbased services, topical analysis, streaming content, etc. This has opened up an opportunity of combining IM with various contexts, such as location, time and topic information, to improve the effectiveness of IM. Many technical challenges naturally arise in solving such context-aware influence maximization problems. IM that can handle a non-progressive diffusion model such as non-progressive linear threshold (NLT), voter model, and heat conduction (HC) model. Due to the aforementioned challenges, a proliferation of research was carried out for making efficient techniques toward influence maximization. However, for 1

“Viral marketing is any marketing technique that induces websites or users to pass on a marketing message to other sites or users, creating a potentially exponential growth in the message’s visibility and effect.”.

Multi-objective Biogeography-Based Optimization for IM-CM Problem

13

a commercial perspective, it is obvious for a company to reward all individuals in the initial seeds. Therefore, the strategy for selecting propagation seeds that maximize the influence spread with a minimum cost is a practical problem. In the complex network analysis, the problem is known as influence maximization-cost minimization (IM-CM) problem. The IM-CM problem is a bi-objective optimization problem. In general, multi-objective problems have many non-dominated solutions considering all the objectives simultaneously [13]. A user trade-off is essential to determine the best solution from the non-dominated solutions. In this work, we have proposed a multi-objective biogeography-based optimization algorithm for solving the IM-CM problem. The goodness of this algorithm is that more fittest solutions are used in a higher quantity to generate new offsprings so that the optimum result can be achieved quickly. The algorithm uses real coded encoding to handle large network datasets. We also have derived a probabilistic approach for the initial seed selection. Extensive empirical studies on many benchmarked social networks show that the proposed algorithm is efficient in solving IM-CM problem. The rest of the paper is organized as follows. Sections 2 and 3 present background and related works respectively. The proposed algorithm is discussed in Sect. 4. Section 5 presents empirical studies and Sect. 6 concludes the study.

2

Backgrounds

In this section, first, we briefly review concepts related to social influence, influence maximization, and influence maximization-cost minimization problem followed by multi-objective optimization. 2.1

Social Influence, Influence Maximization, and Influence Maximization-Cost Minimization Problem

As defined by Rashotte, “Social influence is the change in an individual’s thoughts, feelings, attitudes, or behaviors that results from interaction with other people or group” [14]. Social influence happens when one’s thoughts, feelings, or behavior are deliberately or incidentally influenced by others. Social influence can be viewed throughout OSNs in various forms. When it comes to data mining and big-data analysis perspective, the social influence is analyzed in numerous applications e.g., viral marketing, political campaigning, diffusion modeling, and recommender systems. In 2003, Kempe et al. [11] first modeled the influence maximization problem as an algorithmic problem. This problem studies a social network represented as a graph G = (V, E, W ), where V is the set of nodes in G (i.e., users), E is the set of (directed/undirected) edges in G (i.e., social links between users), and W is edge weight matrix. Each element wi,j in W denotes the weight of edge between i and j, which represents the probability of node i activates node j. The goal of the IM problem is to find the initial set of users of size k that maximum influence in graph G under a given diffusion model M .

14

S. S. De and S. Dehuri

The influence of any seed set is defined based on the information diffusion process among the users. An example of the information diffusion is viral marketing, where a company may wish to spread the adoption of a new product from some initial adopters through the social links between users. To quantify information diffusion, we formally define the diffusion model and the influence spread under the model. Definition 1 (Diffusion Model & Influence Spread). Given a social graph G = (V, E, W ), a user set S ⊆ V , and a diffusion model M captures the stochastic process for S spreading information on G. The influence spread (aka. the influence function) of S, denoted as σG,M (S), is the expected number of users influenced by S (e.g., users who adopt the new product in viral marketing), where σG,M (.) is a non-negative set function defined on any subset of users, i.e., σG,M : 2V → R ≥ 0. Recent years have witnessed a large amount of literature that develops diffusion models to formulate the diffusion process and compute the influence spread [15–17]. In this work, we focus on progressive diffusion models, i.e., activated nodes cannot be de-activated in later steps. Currently, there are a variety of diffusion models arising from the economics and sociology communities. The most popular models are Independent Cascade (IC) and Linear Threshold (LT) model, which are widely used in studying the social influence problems. The main characteristic of the independent cascade model is the independence of the influence spread on each edge of the whole graph; in other words, a single person is sufficient to activate another individual. This model is appropriate for modeling the diffusion of ideas, information, viruses, etc. Every edge (u, v) ∈ E has an associated influence probability puv ∈ [0, 1], that is, the probability that node u succeeds in its attempt to activate its neighbor node v. The Threshold model is suitable for describing the adoption process of a new, unproven product. When the sum of all their friends’ recommendations reaches a certain personal threshold, people are likely to adopt the new product themselves. This is mostly the case when adopting a new standard in the industry. While in the IC model every edge (u, v) was associated with an influence probability puv ; the same edges are this time assigned with influence weight wuv ∈ [0, 1] in the linear threshold model. This represents the degree to which user u can influence another user v. Those weights are normalized so that the sum of all incoming edges of each node is lower or equal to 1, i.e., u∈N in (v) Wuv ≤ 1. Besides those two well-known models, other propagation models such as time-aware, context-aware, propagation with negative feedback are also present in the literature [9,18]. Based on the formalization of the influence spread, the influence maximization problem is defined as follows: Definition 2 (Influence Maximization (IM)). Given a social graph G, a diffusion model M , and a positive integer k, IM selects a set S of k users from V as the seed set to maximize the influence spread σG,M (S) under the diffusion model M .

Multi-objective Biogeography-Based Optimization for IM-CM Problem

15

Intuitively, the influence function σG,M (.) would heavily depend on the diffusion process. Definition 3 (Influence Maximization-Cost Minimization (IM-CM)). Given a social graph G, a diffusion model M , a cost function C of each vertex, and a positive integer k, a influence maximization-cost minimization problem identifies a set S of k users from V as the seed set that maximize the influence spread  σG,M (S), as well as minimize the total cost, i.e., arg min C(S) = C(v)∀v ∈ S. The IM-CM problem takes both individual influence and individual cost into considerations, which can vividly and capture the characteristics of real-world networks than IM. At the same time, the objective functions contains both maximization and minimization objective. Therefore for the sake of optimization algorithm, arg max σG,M (S) has been converted as arg min − σG,M (S). Now both the objectives have become a minimization function. In Eq. 1 we have presented the mathematical definition of the IM-CM problem. Minimize S

f (x) = [−σG,M (S), C(S)]

T

(1)

Subject to, |S| = k, S⊆V

In general, the IM-CM problem does not have a single optimum solution. Instead, the IM-CM problem carries more than one conflicting solutions. A multiobjective solver is therefore essential to identify all the conflicting solutions. In the below section, we have discussed some the concept of multi-objective optimization. 2.2

Overview of Multi-objective Optimization

A multi-objective or multi-criterion optimization (MOO) considers multiple objectives to be optimized simultaneously. Thus the objective function f (x) is a k dimensional vector of optimization functions. The problem can be mathematically expressed as: Optimize x

f (x) = [f1 (x) , f2 (x) , . . . , fk (x)]

Subject to gp (x) ≤ 0, hq (x) = 0,

∀p = 1, 2, . . . , n

T

(2)

∀q = 1, 2, . . . , m

where, g(x) is n number of inequality constraints and h(x) is m number of equality constraints. For a nontrivial multi-objective optimization problem (MOP), no single solution exists that simultaneously optimizes each objective. Thus finding nondominated solutions or Pareto optimal solutions from the conflicting solutions is essential. In MOO, a solution x1 is said to dominate another solution x2 , if both conditions 1 and 2 are true:

16

S. S. De and S. Dehuri

1. The solution x1 is no worse than x2 in all objectives, or fi (x1 ) ♣ fi (x2 ) for all i = 1, 2, . . . , k. 2. The solution x1 is strictly better than x2 in at least one objective, or fi (x1 ) ♠ fi (x2 ) for at least one i = 1, 2, . . . , k. The operators ♣ and ♠ are defined as follows:   ≤, Minimization objective , Maximization objective Without additional subjective preference, all Pareto optimal solutions are considered equally good. Having more than one objective functions, the notion of “optimum” solution in MOPs can be viewed as a “trade-offs” of non-dominated solutions.

Fig. 1. Non-dominated solutions for a bi-objective minimization problem

In Fig. 1 the concept of non-dominated solutions and Pareto front have been presented. Here, the search space Ω ∈ R2 ; x = [x1 , x2 ]T such that x1 , x2 ∈ [0, 6]. The 2-objective functions are f1 (x) = min(x1 ) and f2 (x) = min(x2 ). A square represents a feasible solution. The green squares are the non-dominated or Pareto optimum solutions. The green line connecting Pareto optimum solutions is the Pareto front. Dotted lines indicate the cost for the respective objective function.

Multi-objective Biogeography-Based Optimization for IM-CM Problem

17

A bold dotted line indicates the superiority for the objective. Solution A dominates solution B in f1 and solution B dominates solution A in f2 , therefore, they are non-dominated with respect to each other. Solution D dominates solution C in both the objectives. Solution E dominates solution F as both the solutions are equal in f2 but solution E dominates solution F in f1 . Over the past decade, many multi-objective optimizers such as multiobjective genetic algorithm (MOGA) [19], strength Pareto evolutionary algorithm 2 (SPEA2) [20], non-dominated sorting genetic algorithm II (NSGAII) [21], multi-objective evolutionary algorithm based on decomposition (MOEA/D) [22], and so on was proposed. Details study on multi-objective optimization can be found in Coello Coello [23] and Zitzler et al. [24]. Among them, NSGA-II is one of the most used multi-objective optimizers.

3

Related Works

Domingos and Richardson [25,26] are the first to study influence maximization as an algorithmic problem. Their methods are probabilistic, however. Kempe et al. [11] is the first to formulate the problem as the following discrete optimization problem. Kempe et al. used a natural hill-climbing greedy approximation algorithm which guarantees that the influence spread is within (1 − 1/e) of the optimal influence spread. In their approach, they run Monte-Carlo simulations significantly in large quantity to obtain an accurate estimate of the influence spread. As a result, the algorithm reached into a severe performance bottleneck. Kimura and Saito [16] propose shortest-path based influence cascade models and provide efficient algorithms of computer influence spread. Leskovec et al. [5] presented ‘Cost-Effective Lazy Forward’ (CELF). The greedy optimization strategy uses submodularity property of the influence maximization to reduce the number of evolution. Gong et al. [27] extends the work of Leskovec et al. and improved the CELF approach. Chen et al. [28] proposed NewGreedy and MixedGreedy algorithms for the IC model with uniform probabilities. However, their performance is non-steady, sometimes even worse than CELF. Zhou et al. in [29,30] further enhanced the CELF by the upper bound based approach known as upper bound based lazy forward (UBLF). In their method, the Monte-Carlo calls in the first round are drastically reduced compared with the CELF. Maximum influence paths (MIP) between every pair of nodes in the network via a Dijkstra shortest-path algorithm, and ignore MIPs with probability smaller than an influence threshold θ, effectively restricting influence to a local region. Wang et al. [31] proposed a new community-based greedy algorithm for mining top-k influential nodes. Barbieri et al. [32] studied social influence from a topic modeling perspective. Guo et al. [33] investigated the influence maximization problem from the item-based data. Rodriguez et al. [34] studied the influence maximization problem in continuous-time diffusion networks. Goyal et al. [35] proposed an alternative approach to influence maximization, which, instead of assuming influence probabilities are given as input, directly uses the past available data. In the works [36], the authors discussed the integral influence maximization problem

18

S. S. De and S. Dehuri

when repeated activations are involved. The complementary problem of learning influence probabilities from the available data is studied in the works [37] and [38]. In [39], Yang and Liu worked on IM-CM problem. They have used the multi-objective discrete particle swarm optimization algorithm to identify influential seeds with minimum cost.

4

Our Proposed Approach

The influence maximization-cost minimization problem is a bi-objective optimization problem. Therefore any multi-objective optimization algorithm is suitable to find out the optimum non-dominated solutions. However, in this work we have developed “Non-dominated sorting biogeography-based optimization for multi-objective optimization” (NSBBO) – a multi-objective optimization algorithm based on the biogeography-based optimization algorithm. NSBBO combines the best features of non-dominated sorting genetic algorithm II (NSGAII) [21] and Biogeography-based optimization (BBO) [40]. Biogeography-based optimization is based on the concept of the natural migration strategy. Assigning an emigration and immigration probability to each solution relative to the fitness of the respective individual, the guided search attempts to include superior solutions in great quantity as compared to inferior solutions. Optionally, BBO has the capability of carrying elite solutions to the succeeding generation. A survey of the BBO algorithm can be found in [41]. In Algorithm 1 the elementary BBO algorithm as introduced by Simon [40] has been presented.

Algorithm 1. Biogeography-Based Optimization. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

{xs } ← Randomly initialize a population of N solutions while termination criteria is not satisfied do Compute fitness for solutions {xs }; s = 1, 2, . . . , N foreach xs do μs ← [0, 1] ∝ fitness of xs /* Emigration probability λs ← (1 − μs ) /* Immigration probability end {xs } ← {xs } foreach xs do foreach independent variable index d ∈ [1, D] do /* Use λs to probabilistically decide whether to immigrate to xs rand ← Generate a random number ∈ (0, 1) if λs < rand then xj ← Use {μ∗ } to probabilistically select the emigrating individual xs (d) ← xj (d) end end Probabilistically mutate xs end {xs } ← {xs } end

*/ */

*/

Multi-objective Biogeography-Based Optimization for IM-CM Problem

19

In MOO, a non-dominated sorting algorithm such as fast-non-dominatedsort( Rt ) [21] draws many Pareto fronts and assigns solutions to the respective front according to their dominance relation. Therefore in NSBBO, the essential emigration and immigration probability for each solution can be drawn from the Pareto fronts. In this concept, solutions in front 0 are the fittest solutions, solutions in front 1 are the next fittest solutions, and so on. Thus fitness of an individual solution is inversely proportional to its front number. Therefore the emigration probability of a solution s can be calculated using Eq. 3.   Front number where s belongs (3) μs = 1 − Number of fronts − 1 The rest operations of NSBBO operate as conventional BBO operators. In Algorithms 2 and 3 the NSBBO and the NSBBO implementation for the IMCM problem has been presented, respectively. As the BBO algorithm itself has a better exploration/exploitation capability, we have eliminated computing crowding-distance-assignment( Fi ). Thus we have improved computational time. Our experimental studies confirm that NSBBO is particularly suitable for the network optimization problem. To utilize NSBBO for the IM-CM problem for a given social network, a population of random solutions containing k random non-duplicate nodes in each solution have been initialized. A seed set is a solution in the population. The influence of the seed set is calculated employing a defined propagation model M . Simultaneously, the cost of the seed set is also computed using cost function C.

Algorithm 2. Non-dominated Sorting Biogeography-Based Optimization.“ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

{xs } ← Randomly initialize a population of N solutions while termination criteria is not satisfied do /* Computes Pareto fronts and front number for solutions {xs } {(F , {xs })} ← Apply fast-non-dominated-sort /* Compute fitness for each solution Utilize {(F , {xs })} and Eq. 3 to compute fitness for solutions {xs }; s = 1, 2, . . . , N foreach xs do μs ← [0, 1] ∝ fitness of xs /* Emigration probability λs ← (1 − μs ) /* Immigration probability end {xs } ← {xs } foreach xs do foreach independent variable index d ∈ [1, D] do /* Use λs to probabilistically decide whether to immigrate to xs if λs < Generate a random number ∈ (0, 1) then xj ← Use {μ∗ } to probabilistically select the emigrating individual xs (d) ← xj (d) end end Probabilistically mutate xs end {xs } ← {xs }” end

*/ */

*/ */

*/

20

S. S. De and S. Dehuri

The algorithm returns a set of “Pareto optimal solutions”. A subjective information is required for suitably determining the best node set from the Pareto optimal solutions. 4.1

Solution Encoding

In NSBBO, the encoding of the solutions remains one of the elementary requirements. Binary encoding and real coded encoding are the two primary types of encoding adapted in EAs. The binary encoding of a solution of the network uses a n bit string, where n depicts the network size. Index i of the string indicates ith node of the network. A 1 bit indicates that the respective node is a key node and 0 for the non-key node. Figures 3(a) and (b) show the binary encoding for the seed sets {D, G} and {C, I} presented in Network 2 respectively. In this representation, the length of the binary string is identical to the network size. Therefore the approach is not suitable for handling large social networks. The representation itself may exceed the memory. On the other way, the real coded encoding skims only keeps the set of key nodes as of their form. Consequently, they are adequately capable of model solutions irrespective of network size. The approach demands a fewer amount of memory. In Figs. 4(a) and (b) the real coded encoding for the seed sets {D, G} and {C, I} presented in Network 2 respectively.

(a)

(b)

Fig. 3. Binary encoding for the seed sets {D, G} and {C, I} presented in Network 2 respectively.

Fig. 2. A toy network

(a)

(b)

Fig. 4. Real coded encoding for the seed sets {D, G} and {C, I} presented in Network 2 respectively.

Although real coded encoding is better suited for the large network optimization problem, it additionally introduces a few more challenges due to evolutionary operations.

Multi-objective Biogeography-Based Optimization for IM-CM Problem

21

1. The crossover, migration, and mutation operations may produce duplicate solutions. Figures 5, 6 and 7 depict the situations respectively.

Fig. 5. Duplicate solutions generated after crossover. In the illustration, although all four parents are unique, after crossover, they produce duplicate solutions offspring-1 and offspring-4.

Fig. 6. Duplicate solutions generated after migration.

Fig. 7. Duplicate solutions generated after mutation.

2. Set size may shrink (i.e., |S| < k) due to more than one time occurrences of a node within a solution. That also violates the optimization constraint. An ordered set is capable of reducing duplicate generation probability. However, the ordering will not guarantee the complete elimination of the duplicate node. Figure 8(a) illustrates how a crossover can reduce the size of the solution and Fig. 8(b) overcome the duplicate generation using an order set. Figures 9 and 10 illustrate the shrink of set size due to mutation and migration respectively.

22

S. S. De and S. Dehuri

(a) Unordered

(b) Ordered

Fig. 8. Shrink due to crossover in an unordered set. The problem has been discarded using an ordered set.

Fig. 9. Shrink due to migration.

Fig. 10. Shrink due to mutation.

4.2

Selection Strategy for Determining Serviceable Nodes, Population Initialization, and for Mutation

The fundamental principle of population initialization for an evolutionary algorithm asserts that the solutions should be adequately distributed [42,43]. Thus with an equal probability of selection, solutions are being initialized by randomly picking non-duplicate nodes of size k. However, for the IM-CM problem, it is observed that the best combination also appears from the leading nodes ranked by degree centrality. Therefore engaging leading nodes in greater probability is essential. Thus ranking of nodes in node level is beneficial. To define the node’s selection probability we have ranked each node utilizing fast-nondominated-sort( Rt ) considering two objectives namely node’s degree and cost. Finally, we have defined a node’s selection probability proportional to the respective front number, and a roulette wheel based selection is adapted to pick nodes. The selection probability μ ˆ for a specific node v ∈ V may be calculated using Eq. 4. Here for the simplicity of the understanding, we have taken linear selection probability. Kernels [44,45] can be utilized to get nonlinear selection probability.

Multi-objective Biogeography-Based Optimization for IM-CM Problem

 μ ˆv = 1 −

Front number where v belongs Number of fronts − 1

23

 (4)

As the ranking of nodes carries a good amount of subspace knowledge and as the best seed set appears in leading nodes, invoking inferior nodes is irrelevant. Simultaneously limiting the search space is also important to increase search efficiency. For instance, selecting leading 40% nodes as serviceable nodes. Unfortunately, there is no deterministic way with which the exact fraction can be determined guaranteeing the presence of the best combination in the search space. Suitability determining the number of serviceable nodes for the optimum result is an open problem. However, examining several real-life networks we have observed 30% nodes is a good selection. 4.3

Skip Migration When a Mutation Occurs

It is observed that when a mutation happens for a node, preceding migration operation become useless. Therefore in the original BBO, the migration and mutation have been swapped. We invoke a mutation operator before migration. If a node is mutated then directly, we have proceeded for the next node/solution. Otherwise, we continue with the migration. With this, we can save computation time equal to mutation probability × time for each migration step × number of generation evolve. 4.4

Evolving Mutation Rate

The evolving mutation rate enhances the convergence rate. A higher mutation probability is beneficial in the initial stages to enhance solutions diversity. Whereas a low mutation probability is adequate as solutions approach the convergence. Thus in our model, we have established a sigmoid based mutation probability proportional to the generation number. The mutation probability of a solution xs in a generation gen is defined as Eq. 5, where the variable maxgen is the maximum generation to evolve, and the variable control is the control of the curve. In our experiment, to use a moderate and gradually decreased mutation probability, we have used a control value of 7. Figure 11 presents the mutation probability curves for the control values {1, 2, 3, 4, 5, 6, 7, 8, 10, 15, 20, 25} over 500 generations.   1 (5) M utation probability(xs , gen) = 1.05 − gen ∗ control 1 + e( maxgen ) 4.5

Consolidated Algorithm

Algorithm 3 presents the consolidated approach. Utilizing the fast-nondominated-sort and Eq. 4, line 1 computes selection probability μˆv for each node v. Line 1 returns a map {(v, μˆv )} representing μˆv for the respective node v.

24

S. S. De and S. Dehuri

Fig. 11. Mutation probability over generations.

Using the map {(v, μˆv )}, line 2 probabilistically initialize a set of N solutions {xs }. Each initialized solution xs is a k-sized ordered set. The loop at line 3 iterates over generations to obtain the Pareto optimal solutions until termination conditions are satisfied. For every solution in {xs }, line 4 computes the propagated influence under propagation model M . Similarly, for every solution xs , using the cost function C, line 5 computes the total cost of the seed set by adding the cost of individuals in xs . Using the computed propagation and cost matrix, Line 6 invokes fast-non-dominated-sort to obtain fronts. After execution the line returns a map {(F, {xs })} representing front number and set of solutions for the respective front. The map essentially helps to determine the fitness of a solution. Lines 7–29 invoke BBO operators to generate offsprings for the next generation. Here, lines 7–9 computes emigration probability μs for each solution xs inversely proportional to the Front number where xs belongs. Line 10 copies the population for the next generation. In loops 11, 12 each node in individual solutions probabilistically evolute to generate new offspring for the next generation. Utilizing {(v, μˆv )}, line 13 probabilistically mutate node xs (d) with another serviceable node such that new node should not repeat in the respective solution. If xs (d) has not mutated, the code block at lines 15–17 probabilistically migrate nodes from another solution xj . The migration also happens if the current node is duplicated in the offspring. The immigrant solution xj is chosen using μs at line 16. At line 17, during migration, if the chosen node is already present in the immigrating solution xs , then another node from xj other than the node in index d has been considered. Line 24 repeats steps at lines 5–14 to compute the fitness of solutions present in the final population and to determine the Pareto front. Finally, in line 25, the Pareto front is returned as the set of optimum solutions.

Multi-objective Biogeography-Based Optimization for IM-CM Problem

25

Algorithm 3. “Multi-objective Biogeography-Based Optimization for Influence Maximization-Cost Minimization in Social Networks 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Input: Network G = (V, E, W ); k: Number of seed nodes; C: Cost function for each node Output: {S} s.t., set of nodes S ⊆ V {(v, µ ˆ v )} ← Use fast-non-dominated-sort and Eq. 4 to compute selection probability of v, ∀v ∈ V Initialize population: Using {(v, µ ˆ v )}, probabilistically create a random set of N solutions {xs }, where each solution xs is an ordered set of size k while termination criteria is not satisfied do Using the propagation model M , compute the influence of the solutions {xs } Using cost function C, compute total cost by adding up individual cost of nodes in xs {(F , {xs })} ← Apply fast-non-dominated-sort using influence and cost matrix /* Biogeography-Based Optimization */ foreach xs do µs ← [0, 1] ∝ inverse of front number of xs /* Use Eq. 3 to compute fitness of xs */ end {xs } ← {xs }   foreach xs do foreach node index d ∈ {1, 2, . . . , k} do /* Mutation involving superior nodes as much as possible */ Utilizing {(v, µˆv )}, and Eq. 5 probabilistically mutate xs (d) s.t.,   / xs (1, 2, . . . , d − 1) xs (d) ∈ if xs (d) has not mutated then /* Use µs to probabilistically decide whether to immigrate to xs . */ if µs < a random number ∈ (0, 1) or xs (d) ∈ xs (1, 2, . . . , d − 1) then xj ← Use {µ∗ } to probabilistically select the emigrating individual    ˆ xj (d) ˆ ∈ x (d) ← xj (d) / x (1, 2, . . . , d − 1); dˆ∈ {d, 1, 2, . . . , k} /* if all s

s

ˆ are processed and xs (d) not found, then update j xj (d) 18 19

end

20

end

21

end {xs } ← {xs }

22

*/

end

23

end Repeat steps at lines 4-6 to compute fronts 25 return {xs | xs ∈ f ront(1)}” 24

5

Experimental Setup and Result Analysis

A set of 13 benchmarked network datasets having distinct clustering coefficients are used to validate the efficiency of the proposed model. Table 1 presents the datasets and their network statistics. The datasets presented in Table 1 are ordered by their respective clustering coffecient2 . We also have covered a wide range of clustering coefficient with this datasets. The datasets used in this study does not contain the essential parameter ‘cost of nodes’. Thus for the sake of the experimental study, we have considered normalized PageRank as costs of nodes. In this experimental study, first, we have presented network propagation model. The proposed IM-CM algorithm is capable of handling any network propagation model. However, for simplicity, we have presented the experimental results based on the linear cascade model only. In Figs. 13 and 12, the linear cascade propagation started from 3 different seed nodes for the dolphin and prisoners networks are presented respectively. Both figures contain three 2

A clustering coefficient indicates the tendency of nodes to be clustered together.

26

S. S. De and S. Dehuri Table 1. Datasets used for the study and respective network statistics

Network

Average Average Density Diameter ↓ Clustering Triangles distance degree Coefficient

Type Nodes Edges D? W?

soc-anybeat∓

F

F 12,645

67,053

3.1715

10.6055

0.0008

10

0.0217

4,13,346

F

F

1,858

12,534

3.4525

13.4919

0.0073

14

0.0904

50,250

US power grid†,§

F

F

4,941

2.6691

0.0005

46

0.1032

1,953

socfb-Indiana∓

F

F 29,732 13,05,757

2.9613

87.8351

0.0030

8

0.1350 2,81,73,249

socfb-Auburn71∓ F

F 18,448

2.6755 105.5852

0.0057

7

0.1367 3,03,29,328

Celegansneural§

T

T

297

2,359

3.9919

15.8855

0.0268

14

0.1807

9,723

Polblogs§

F

F

1,490

19,090

3.3902

25.6242

0.0086

9

0.2260

3,03,129 174

Hamsterster friendships†

6,594 18.9892 9,73,918

Prisoners†

F

F

67

142

3.3546

4.2388

0.0642

7

0.2881

Dolphin†,§

F

F

62

159

3.3570

5.1290

0.0841

8

0.3088

285

Hep-Th§

F

T

8,361

15,751

7.0254

3.7677

0.0005

19

0.3296

39,906

crack∓

F

T 10,240

30,380 40.9971

5.9336

0.0006

107

0.3595

60,423

Ego-Facebook‡

F

F

88,234

43.6910

0.0108

8

0.5192

48,36,030

4,039

3.6925

F F 54,870 13,11,227 63.8718 47.7939 0.0009 175 0.5549 3,51,55,833 sc-nasasrb∓ D?: Is directed network? W?: Is weighted network? Data source: † Koblenz network collection; ∓ Network data repository; § Mark newman’s personal data’; ‡ Stanford large network dataset collection.

sub-figures indicating three different network propagation initiated from 3 different seeds set. In these figures, the seed nodes are colored as dark brown background. Remaining nodes are colored using heatmap color indicating traversal distance. White background nodes are the uncovered nodes by the respective propagation. In Figs. 12(a), (b) and (c) the propagation started from vertex sets {4, 34, 38}, {5, 47, 50}, and {3, 12, 13} respectively and cover 96%, 90%, and 80% of vertices respectively. Similarly, Figs. 13(a), (b) and (c) the propagation started from vertx sets {22, 30, 41}, {10, 14, 43}, and {6, 25, 27} respectively. Aftere the propagation these models cover 95%, 82%, and 34% of vertices respectively. As an influence maximization-cost minimization problem, our main aim is to identify an initial set of influencial individuals that maximizes the influence with a minimum cost. Thus multi-objective analysis on two objectives, namely ‘Coverage’ and ‘Total cost of seeds’ is used. In Fig. 14 solutions and fronts from the final generation is presented. The front presented here with thick dark brown color is the Pareto front that contains the non-dominated solutions. Figure 15 presents the generation wise improvements of solutions. A line represents a Pareto front. Generation wise Pareto fronts are colored using heatmap color. The black-colored Pareto front is the final Pareto font obtained after 300 evolutions. As a nature of multi-objective optimization problem, the IM-CM problem carries more than one solutions in the Pareto front. Therefore determining the most competent solution from a set of Pareto optimal solutions requires a user preference of coverage and cost. However, for a better understanding of cost vs. coverage in Fig. 16, we have plotted the normalized objective values. From the plot, the gain ratio on cost is quite visible to determine the best solution.

Multi-objective Biogeography-Based Optimization for IM-CM Problem

27

(a) Seeds: {4, 34, 38}; Coverage: 0.96

(b) Seeds: {5, 47, 50}; Coverage: 0.90

(c) Seeds: {3, 12, 13}; Coverage: 0.80

Fig. 12. Nodes coverage for the prisoners network using independent cascade propagation model started from a set of 3 vertex.

28

S. S. De and S. Dehuri

(a) Seeds: {22, 30, 41}; Coverage: 0.95

(b) Seeds: {10, 14, 43}; Coverage: 0.82

(c) Seeds: {6, 25, 27}; Coverage: 0.34

Fig. 13. Nodes coverage for the dolphin network using independent cascade propagation model started from a set of 3 vertices.

Multi-objective Biogeography-Based Optimization for IM-CM Problem

29

(a) Dolphin

(b) sc-nasasrb

(c) socfb-Auburn71

(d) ego-facebook

(e) Prisoners

(f) socfb-indiana

(g) us power grid

Fig. 14. Solutions and fronts obtained after the evolution of the last generation considering two objectives ‘coverage’ and ‘total cost of seeds’.

30

S. S. De and S. Dehuri

(a) Dolphin

(b) sc-nasasrb

(c) socfb-Auburn71

(d) Crack

(e) socfb-indiana

(f) Us power grid

Fig. 15. Generation-wise Pareto fronts for a set of 3 seeds identification towards influence maximization-cost minimization for the dolphin network.

Multi-objective Biogeography-Based Optimization for IM-CM Problem

(a) Dolphin

(b) sc-nasasrb

(c) socfb-Auburn71

(d) Crack

(e) socfb-indiana

(f) Us power grid

31

Fig. 16. Normalized ‘coverage’ over ‘total cost of seeds’ for the selection of best solution.

6

Conclusion

The identification of influential seeders from a given social network is challenging. In the last decads, due to its implicit commential value, the influence maximization research grown exponentially. Simultaneously, resource/cost minimization also was considered. Many approaches were proposed for the influence maximization-cost minimization problem. In this work, we have suggested a multi-objective optimization approach for solving IM-CM problems.

32

S. S. De and S. Dehuri

The proposed approach combines the best attributes of BBO and NSGA-II. In this developed strategy, we have applied a multi-objective ranking strategy for the nodes. The node ranking strategy allows invoking highly probable nodes in a higher quantity, than inferior nodes. Thus the node ranking allows faster convergence. We also probabilistically initialized the population for faster convergence. In the proposed method we have used the real encoding instead of binary encoding. With a very less amount of information, the real encoding helps to reduce the memory utilization significantly. On the other hand we have shown real coding inhariently brings some challanges, such as existance of duplicate nodes. In the preceding section we have discussed the mecanism to overcome the challenges. The approach improves performance by invoking mutation beforehand of migration. We recommend an evolving mutation strategy for faster convergence. With a set of experimental studies on many real-world benchmarked network datasets, we have validated our developed model. The multi-objective plotting demonstrates the improvements in the Pareto front over generations. As the final Pareto front consists of many solutions, a plot of normalized cost and coverage of Pareto optimal solutions helps to determine the gain of coverage over cost.

References 1. Li, F., Du, T.C.: The effectiveness of word of mouth in offline and online social networks. Expert Syst. Appl. 88, 338–351 (2017) 2. Guille, A., Hacid, H., Favre, C., Zighed, D.A.: Information diffusion in online social networks: a survey. ACM SIGMOD Rec. 42(2), 17–28 (2013) 3. Budak, C., Agrawal, D., El Abbadi, A.: Limiting the spread of misinformation in social networks. In: Proceedings of the 20th International Conference on World Wide Web, pp. 665–674. ACM (2011) 4. He, X., Song, G., Chen, W., Jiang, Q.: Influence blocking maximization in social networks under the competitive linear threshold model. In: Proceedings of the 2012 SIAM International Conference on Data Mining, pp. 463–474. SIAM (2012) 5. Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J., Glance, N.: Cost-effective outbreak detection in networks. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 420–429. ACM (2007) 6. Ye, M., Liu, X., Lee, W.-C.: Exploring social influence for recommendation: a generative model approach. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 671–680. ACM (2012) 7. Ienco, D., Bonchi, F., Castillo, C.: The meme ranking problem: maximizing microblogging virality. In: 2010 IEEE International Conference on Data Mining Workshops, pp. 328–335. IEEE (2010) 8. Li, Y., Zhang, D., Tan, K.-L.: Real-time targeted influence maximization for online advertisements. Proc. VLDB Endow. 8(10), 1070–1081 (2015) 9. Li, Y., Fan, J., Wang, Y., Tan, K.-L.: Influence maximization on social graphs: a survey. IEEE Trans. Knowl. Data Eng. 30(10), 1852–1872 (2018) 10. Chen, S., Fan, J., Li, G., Feng, J., Tan, K.-L., Tang, J.: Online topic-aware influence maximization. Proc. VLDB Endow. 8(6), 666–677 (2015)

Multi-objective Biogeography-Based Optimization for IM-CM Problem

33

´ Maximizing the spread of influence through 11. Kempe, D., Kleinberg, J., Tardos, E.: a social network. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 137–146. ACM (2003) 12. Liu, B., Cong, G., Xu, D., Zeng, Y.: Time constrained influence maximization in social networks. In: 2012 IEEE 12th International Conference on Data Mining, pp. 439–448. IEEE (2012) 13. Censor, Y.: Pareto optimality in multiobjective problems. Appl. Math. Optim. 4(1), 41–59 (1977) 14. Rashotte, L.: Social influence. In: The Blackwell Encyclopedia of Sociology (2007) 15. Valente, T.W.: Network models of the diffusion of innovations. Comput. Math. Organ. Theory 2(2), 163–164 (1996) 16. Kimura, M., Saito, K.: Tractable models for information diffusion in social networks. In: European Conference on Principles of Data Mining and Knowledge Discovery, pp. 259–271. Springer (2006) ´ Influential nodes in a diffusion model for 17. Kempe, D., Kleinberg, J., Tardos, E.: social networks. In: International Colloquium on Automata, Languages, and Programming, pp. 1127–1138. Springer (2005) 18. Zhou, C., Guo, L.: A note on influence maximization in social networks from local to global and beyond. Proc. Comput. Sci. 30, 81–87 (2014) 19. Fonseca, C.M., Fleming, P.J.: Genetic algorithms for multiobjective optimization: formulation discussion and generalization. In: Proceedings of the 5th International Conference on Genetic Algorithms, San Francisco, CA, USA, pp. 416–423. Morgan Kaufmann Publishers Inc. (1993) 20. Zitzler, E., Laumanns, M., Thiele, L.: SPEA2: improving the strength pareto evolutionary algorithm. TIK-report, vol. 103 (2001) 21. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6, 182–197 (2002) 22. Zhang, Q., Li, H.: MOEA/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Trans. Evol. Comput. 11(6), 712–731 (2007) 23. Coello Coello, C.A.: A short tutorial on evolutionary multiobjective optimization. In: International Conference on Evolutionary Multi-Criterion Optimization, pp. 21–40. Springer (2001) 24. Zitzler, E., Laumanns, M., Bleuler, S.: A tutorial on evolutionary multiobjective optimization. In: Metaheuristics for Multiobjective Optimisation, pp. 3–37. Springer (2004) 25. Domingos, P., Richardson, M.: Mining the network value of customers. In: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 57–66. ACM (2001) 26. Richardson, M., Domingos, P.: Mining knowledge-sharing sites for viral marketing. In: Proceedings of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 61–70. ACM (2002) 27. Gong, M., Yan, J., Shen, B., Ma, L., Cai, Q.: Influence maximization in social networks based on discrete particle swarm optimization. Inf. Sci. 367, 600–614 (2016) 28. Chen, W., Wang, Y., Yang, S.: Efficient influence maximization in social networks. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 199–208. ACM (2009) 29. Zhou, C., Zhang, P., Guo, J., Zhu, X., Guo, L.: UBLF: an upper bound based approach to discover influential nodes in social networks. In: 2013 IEEE 13th International Conference on Data Mining, pp. 907–916. IEEE (2013)

34

S. S. De and S. Dehuri

30. Zhou, C., Zhang, P., Guo, J., Guo, L.: An upper bound based greedy algorithm for mining top-k influential nodes in social networks. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 421–422. ACM (2014) 31. Wang, Y., Cong, G., Song, G., Xie, K.: Community-based greedy algorithm for mining top-k influential nodes in mobile social networks. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1039–1048. ACM (2010) 32. Barbieri, N., Bonchi, F., Manco, G.: Topic-aware social influence propagation models. Knowl. Inf. Syst. 37(3), 555–584 (2013) 33. Guo, J., Zhang, P., Zhou, C., Cao, Y., Guo, L.: Item-based top-k influential user discovery in social networks. In: 2013 IEEE 13th International Conference on Data Mining Workshops, pp. 780–787. IEEE (2013) 34. Rodriguez, M.G., Sch¨ olkopf, B.: Influence maximization in continuous time diffusion networks. arXiv preprint arXiv:1205.1682 (2012) 35. Goyal, A., Bonchi, F., Lakshmanan, L.: A data-based approach to social influence maximization. Proc. VLDB Endow. 5(1), 73–84 (2011) 36. Zhou, C., Zhang, P., Zang, W., Guo, L.: Maximizing the cumulative influence through a social network when repeat activation exists. Proc. Comput. Sci. 29, 422–431 (2014) 37. Goyal, A., Bonchi, F., Lakshmanan, L.: Learning influence probabilities in social networks. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 241–250. ACM (2010) 38. Saito, K., Nakano, R., Kimura, M.: Prediction of information diffusion probabilities for independent cascade model. In: International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, pp. 67–75. Springer (2008) 39. Yang, J., Liu, J.: Influence maximization-cost minimization in social networks based on a multiobjective discrete particle swarm optimization algorithm. IEEE Access 6, 2320–2329 (2017) 40. Simon, D.: Biogeography-based optimization. IEEE Trans. Evol. Comput. 12(6), 702–713 (2008) 41. Guo, W., Chen, M., Wang, L., Mao, Y., Wu, Q.: A survey of biogeography-based optimization. Neural Comput. Appl. 28(8), 1909–1926 (2017) 42. Whitley, D.: A genetic algorithm tutorial. Stat. Comput. 4(2), 65–85 (1994) 43. Chou, C., Chen, J.: Genetic algorithms: initialization schemes and genes extraction. In: 2000 Ninth IEEE International Conference on Fuzzy Systems, FUZZ IEEE 2000, vol. 2, pp. 965–968. IEEE (2000) 44. Nasrabadi, N.M.: Pattern recognition and machine learning. J. Electron. Imaging 16(4), 049901 (2007) 45. Sch¨ olkopf, B., Smola, A.J., et al.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2002)

Classification of Credit Dataset Using Improved Particle Swarm Optimization Tuned Radial Basis Function Neural Networks Trilok Nath Pandey1 , Parimal Kumar Giri2(B) , and Alok Kumar Jagadev3 1 Department of Computer Science and Engineering, S ‘O’ A Deemed to be University,

Bhubaneswar 751030, India [email protected] 2 Department of Information and Communication Technology, Fakir Mohan University, Balasore 756019, India [email protected] 3 School of Computer Science, KIIT Deemed to be University, Bhubaneswar 751020, India [email protected]

Abstract. Credit risk assessment is acting as a survival weapon in almost every financial institution. It involves an in-depth and sensitive analysis of various economic, social, demographic, and other pertinent data provided by the customers and about the customers for building a more accurate and robust electronic finance system. The classification problem is one of the primary concerns in the process of analyzing the gamut of data; however, its complexity has ignited us to use machine learning-based approaches. In this paper, radial basis function neural network (RBFNN) with particle swarm optimization (RBFNN + PSO) and improved particle swarm optimization tuned radial basis function neural network (RBFNN + IMPSO) learning algorithms have been studied and compared their effectiveness for credit risk assessment. The experimental findings draw a clear line between the proposed model and traditional learning algorithms. Moreover, the proposed method is very promising vis-à-vis of individual classifiers. Keywords: RBFN · Credit risk · PSO · IMPSO · Neural network · Machine learning

1 Introduction Credit risk assessment is acting as a survival weapon in almost every financial institution [1]. In this paper, we have collected German and Australian credit risk data from UCI repository [2]. It involves in-depth and sensitive analysis of various financial, social, demographic, and other pertinent data provided by the customers and about the customers for building a more accurate and robust electronic finance system [3]. The classification problem is one of the primaries concerned in the process of analyzing the gamut of data [4]; however, its complexity has ignited us to use machine learning-based approaches. In this paper, some machine learning algorithms have been studied and compared their effectiveness for credit risk assessment. © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 35–45, 2020. https://doi.org/10.1007/978-3-030-39033-4_3

36

T. N. Pandey et al.

Further, as an extension of our study, we have applied our proposed IMPSO tuned RBFNN model to the above financial data for classification. We have also made an experimental comparison of various models with our proposed model. From the experimental analysis, we have concluded that our proposed model is performing better in comparison to other models, as discussed in the result analysis section.

2 Literature Survey Financial institutions are those organizations that provide lending services to individuals as well as a massive conglomeration of individuals. Therefore, they are directly involved in the financial strength of a country. Credit risk analysis is one of the essential components in the financial sector. Several researchers have done many experiments to classify credit data [5]. From the rigorous study, we have found that the credit data can be classified more accurately using neural networks as these have a better approximation capacity [6]. Again from the literature survey, we have observed that the radial basis function neural network (RBFNN) is classified as the credit data better than the other statistical and neural network algorithms [7]. A comparison of different models used in this survey for credit risk data classification using German and Australian datasets [8] is shown in Table 1. Table 1. Classification of Australian and German credit data Classification models

Dataset and Accuracy Australian (Accuracy) (%) German (Accuracy) (%)

Bayesian [24]

86.96

77.10

Decision tree [8]

90.72

85.50

KNN [9]

89.10

72.20

SVM [28]

85.94

78.40

K-means [30]

80.40

79.20

Bagging [25]

89.13

87.90

MLPNN [26]

86.95

73.00

RBFNN [9]

91.36

89.72

FLANN [29]

88.89

86.66

Committee machine [27] 90.87

89.20

From the above literature survey, we found that RBFNN gives better results for the prediction and classification of financial data. RBFNNs are data-driven and self-adaptive methods with few prior assumptions [9]. They are also a good predictor with the ability to make generalized observations from the results learned from original data. RBFNNs are universal approximations as a network can efficiently approximate a continuous function to the desired level of accuracy. RBFNNs are very efficient in solving nonlinear problems, including those in the real world. Therefore, in this research work, we

Classification of Credit Dataset Using Improved Particle Swarm Optimization

37

have considered RBFNN to predict and classify the credit scoring data [2]. From the careful study, we have found that the performance of RBFN can be improved if we fine-tune the parameters such as the spread, center, and width of RBFNN [10]. From the survey, it has also been observed that instead of choosing the initial solution randomly, if we accept the optimal solution by using a bio-inspired particle swarm optimization algorithm (PSO), then the performance of the model can be improved [11, 12]. Therefore, in our research work, we have used improved particle swarm optimization tuned radial basis algorithm (IMPSO+RBFNN) to overcome the above mentioned problems. In the next section, we have discussed RBFNN, PSO, and IMPSO algorithms.

3 Machine Learning Techniques Used for Credit Risk Analysis In this section, we have discussed several machine learning techniques used for the classification and optimization of credit risk problems. 3.1 Radial Basis Function Neural Networks Radial basis function neural network is a member of the family of artificial neural networks (ANNs) and have proposed by Montazer, et al. [13]. It differs from neural networks with sigmoid activation functions in that it utilizes essential functions in the hidden layer that are locally responsive to the input stimulus. The underlying architecture of the RBF network is shown in Fig. 1. Radial basis function (RBF) is embedded in a two-layer neural network, where each hidden unit implements a radial activated function. The output units implement a weighted sum of hidden unit outputs. While the input into an RBFN network is non-linear, the output is often linear. Their excellent approximation capabilities have been studied by Zhong, et al. [14]. Owing to their non-linear approximation properties, the RBF networks can model complex mappings, which indicating that neural networks can only model employing (by means of) multiple intermediary layers [15]. As shown in Fig. 1, the ideal output, the actual output, and the weight value of the output layer can be obtained by the RBF neural network. Choosing Gaussian function as radial basis function and the weight value wij is adjusted to satisfy the following formula, from which the final results of the RBF neural network can be obtained. Thus the predicted error E can be calculated by the following Eq. (1) as. E=

n k=1

 2 n yk − y¯ k =

k=1

 2 m yk − wik ∅i (x) , i=1

(1)

  where ∅i (x) = exp[−x − ci ]2 / 2σ 2 and ‘m’ is the number of neurons in the hidden layer i  {1, 2, 3…m}, ‘n’ represents the number of neurons in output layer {1, 2, 3…n}, wij is the weight of the ith neuron and kth output, ∅i is the radial basis function, Ci is the center vector of the ith neuron, and y¯k is the network output of kth neuron. From the analysis of the RBFNN algorithm, it has been observed; substantial modeler involvement is required to construct a quality RBFNN [17]. Another issue that requires modeler involvement in the selection of a quality set of model inputs [18]. Both of these steps represent a combinatorial problem and the solution of this problem can be largely or partly automated through the application of optimization algorithms such as Particle

38

T. N. Pandey et al.

Fig. 1. Basic architecture of RBF neural network

Swarm Optimization (PSO), Artificial Bee Colony (ABC), Ant Colony Optimization (ACO), Biogeography Based Optimization (BBO), etc. [16, 19]. In the next section, we have discussed one of the most commonly used optimization technique, i.e., canonical PSO. 3.2 Canonical Particle Swarm Optimization Algorithm Particle swarm optimization is a nature-inspired algorithm proposed by Kennedy and Eberhard [20]. PSO received its inspiration from bird flocking, fish schooling, and herds of animals. In PSO, a set of randomly generated solutions (initial swarm) propagates in the design space towards the optimal solution over several iterations (moves) based on a large amount of information about the design space that is assimilated and shared by all members of the swarm [21]. Kennedy and Eberhard describe a complete chronicle of the development of the PSO algorithm from merely a motion simulator to heuristic optimization of the PSO algorithm. The standard PSO algorithms broadly consist of three computational steps: (a) Generation of particle positions and velocities. (b) Updating the velocity of each particle. (c) Updating the position of each particle. Particles typically model the swarm in multidimensional space that has a position x(t) and a velocity v(t). These particles fly through hyperspace (i.e., Rn) and have two essential reasoning capabilities: their memory of their own best position (pBest) and knowledge of the global or their neighborhoods best (gBest). The particle position and velocity update equations in the simplest form that govern the PSO are given in Eqs. 2 and 3: v(t + 1) = λ ∗ v(t) + c1 ∗ rand(pBest − x(t)) + c2 ∗ rand(gBest − x(t))

(2)

Classification of Credit Dataset Using Improved Particle Swarm Optimization

39

x(t + 1) = x(t) + v(t + 1)

(3)

Where, λ is the inertia weight, c1 and c2 are integer constants. In the next Section, we have discussed how can enhance the performance of PSO by replacing the random function with a chaotic map function. 3.3 Improved PSO Tuned RBFNN Algorithm We have also analyzed from intensive studies that PSO algorithm is slow at searching around the global optimum. So, we can improve the searching performance by using the improved the PSO algorithm for optimizing the RBFNN parameters. We can improve the searching speed of the PSO algorithm by decreasing the inertia weight (λ) gradually as the number of iteration increases. Thus, we can reduce the search space for global optimum by reducing the value as the number of generation increases. After each generation, the best particle in the previous generation will replace the worst particle in the current generation. Researchers have suggested several selection strategies. In this research work, we have applied two types of selection strategy sequentially for inertia weight, one is linear selection, and the other is the non-linear selection. In linear selection, (λ) should reduce rapidly; while around the optimum (λ) will reduce slowly. Mathematically, it can be described as follows in Eq. 4.    λ1 ∗i , (4) λ1 = λ 0 − g1 where, i = 1,2,3,…..g1 , (λ0 ) is the initial value of inertia weight, (λ1 ) is the endpoint of linear selection, g1 is the number of generations for linear selection, and g2 is the number of generations of non-linear selection. Then, according to the proposed algorithm for 1 to g1 number of generations the inertia weight for PSO will be calculated as given in Eq. 4. In Eq. 2, we generally take λ as constant for the total number of iterations, in this case, we can gradually decrease the value of λ either linearly or nonlinearly while increasing the number of iterations and decrease the search space for the global optimum. For g1 to g2 number of generations, the inertia weight for IMPSO will be calculated according to Eq. 5.     λ1 = (λ0 − λ1 ) ∗ exp g1 + 1 − i /i (5) where, i = g1 , g1+1 , ……g2 . Generally, the values of g1 and g2 are selected for empirical study. We have considered the total number of generations as 200. Linear and non-linear selection of the inertia weight takes place for half of the total number of generations. The algorithmic description of the proposed model is given in Algorithm 1.

40

T. N. Pandey et al.

Algorithm 1: Improved PSO tuned RBFNN algorithm(IMPSO) Input: A set of centers from the given dataset Output: A set of optimized center values Initialization For each particle do Initialize particle position and velocity using equation 3.2 and 3.3; End For While stopping criteria is not satisfied do Calculate the inertia weight using equation 3.4 and 3.5 depending on generation number; For each particle do Calculate fitness value using MSE of RBFNN; if fitness value is better than best fitness value then Set current position as pBest; Choose the global best(gBest) as the parametric with best fitness value among all particles; End if End for For each particle do Calculate particle velocity using equation 3.2; Update particle position (Center, Spread and Weight) using equation 3.3; End for End while

4 Experimental Works Most credit scoring applications often employ the accuracy as the criterion for performance evaluation. It represents the proportion of the correctly predicted cases (good and bad) on a particular data set. However, empirical and theoretical evidence shows that this measure is strongly biased with respect to data imbalance and proportions of correct and incorrect predictions. In this section, we described the detailed analysis of experimental works carried out for our proposed model to classify German and Australian credit data. The computational complexity of the proposed algorithm may change for different datasets depending on its size and the parametric values. 4.1 Environment In this work, the simulation of the biologically inspired algorithm is completed under MATLAB (version7.10.0.499 (R2010a)) environment. The operating system used is Linux Mint 17.2 and Windows 7 with hardware configuration as 2 GB of RAM and Intel processor.

Classification of Credit Dataset Using Improved Particle Swarm Optimization

41

4.2 Parameters Details The different important parameters used for RBFNN are center, spread, and weight. The special symbols used for RBFNN, PSO, and Chaotic improved PSO for classification of credit data are described in Tables 2 and 3. Table 2. Description of parameters used in RBFNN for credit dataset classification.

n

Description

Considered Value/Size

Number of input vectors

1000/690

D Desired output vector

1000 × 1/ 690 × 1

m Number of hidden neurons

14

w Weight vector

14 × 1

N Number of input neurons

21/16

C Centre matrix

14 × 21/14 × 16

Table 3. Description of parameters used in RBFNN + IMPSO for credit data classification. Symbols used Description

Considered value/Size

λ0

Initial inertia weight

0.8

λ1

Inertia weight for linear selection

0.5

c1

Local search coefficient

0.9

c2

Global search coefficient

0.9

g1

Generation for linear increment

100

g2

Generation for non-linear increment 100

5 Result Analysis In general, some standard performance evaluation criteria in the field of credit scoring include accuracy, mean squared error, Type-I, and Type-II errors. For a two-class problem, most of these metrics can be easily derived from a 2 × 2 confusion matrix which is discussed in [22], where each entry (i, j) contains the number of good or bad applicants. Most of the credit scoring applications often employ accuracy as the criterion for performance evaluation [23]. It represents the proportion of the correctly predicted cases (good and bad) on a particular dataset. For a two-class problem, most of these metrics can be easily derived from a 2 × 2 confusion matrix. In this work, we have presented an experimental evaluation of our proposed model and performance comparison with other models. We have selected a number of epochs

42

T. N. Pandey et al. Table 4. Comparison of IMPSO model with RBFNN and RBFNN+PSO model

Models used for classification

Dataset

Type-I error Type-II error Sensitivity Specificity Accuracy

RBF

German

RBF + PSO RBF + IMPSO

0.1084

0.1142

0.8857

0.8831

88.52%

Australian 0.0977

0.0935

0.9064

0.9022

90.43%

German

0.1465

0.0302

0.9697

0.8534

92.89%

Australian 0.0540

0.0312

0.9459

0.9447

95.97%

German

0.0778

0.0538

0.9461

0.9221

93.78%

Australian 0.0533

0.0190

0.9809

0.9466

96.34%

Accuracy of Gernam Dataset using different models 94.00% 92.00% 90.00% 88.00% 86.00% 84.00% RBF

RBF+PSO

RBF+IMPSO

Fig. 2. Accuracy of German dataset

as well as the number of nodes in the hidden layer of the network as the experimental parameters. Different experiments have been performed to analyze the efficiency of RBFNN network trained with improved particle swarm optimization algorithm to classify the German and Australian credit risk data. From the above simulation result, we have analyzed that the specificity, sensitivity, Type-I, Type-II, and accuracy of the proposed model is better as compared to RBFNN and canonical PSO tuned RBFNN. As a result, if the credit-granting policy of a financial institution is too generous, this will be exposed to high credit risk. Hence, improved PSO tuned RBFNN is found as an improved classifier as compared to RBFNN and RBFNN+PSO algorithm. The Comparison graph of the IMPSO model with RBFNN and RBFNN+PSO model for German and Australian dataset are shown in Table 4, Figs. 1 and 2.

Classification of Credit Dataset Using Improved Particle Swarm Optimization

43

Accuracy of Australian Dataset using differnt model 98.00% 96.00% 94.00% 92.00% 90.00% 88.00% 86.00% RBF

RBF+PSO

RBF+IMPSO

Fig. 3. Accuracy of Australian dataset

From the experimental results, we have analyzed that our RBF+IMPSO model is performing better than RBF and RBF+PSO models (Fig. 3).

6 Conclusions In this work, a canonical PSO based RBFNNs and IMPSO based RBFNNs models are effectively used for classification of German and Australian credit dataset. The canonical PSO tuned RBFNNs, and the IMPSO tuned RBFNNs model have made the credit data classification simpler and involves lesser computations as compared to the other models reported earlier. An experimental study demonstrated that the performance of IMPSO tuned RBFNNs model is better than the canonical PSO tuned RBFNNs model in the case of a higher number of iterations. In future work, we can use the input dataset to classify the credit dataset by using an evolutionary optimization technique, like ABC, BBO, Killer heard, etc. algorithms for further optimization of RBFNN parameters to achieve the desired model. A rigorous study on the convergence and stability analysis of the proposed model can also be included in our bag of future works.

References 1. Bask, A., Merisalo, H., Tinnila, M., Lauraeus, T.: Towards E-banking: the evolution of business models in financial services. Int. J. Electron. Finance 5(4), 333–356 (2011) 2. Frank, A., Asuncion, A.: UCI Machine Learning Repository. Technical Report, University of California, Irvine, School of Information and Computer Sciences (2010). http://archive.ics. uci.edu/ml 3. Curran, K., Orr, J.: Integrating geolocation into electronic finance applications for additional security. Int. J. Electron. Finance 5(3), 272–285 (2011)

44

T. N. Pandey et al.

4. Zheng, H., Zhang, Y., Liu, J., Wei, H., Zhao, J., Liao, R.: A novel model based on wavelet LS-SVM integrated improved PSO algorithm for forecasting of dissolved gas contents in power transformers. Electric. Power Syst. Res. 155, 196–205 (2018) 5. Chorowski, J., Wang, J., Zurada, M.J.: Review and comparison of SVM and ELM based classifiers. Neurocomputing 128, 506–516 (2014) 6. Danenas, P., Garsva, G., Gudas, S.: Credit risk evaluation using SVM classifier. In: International Conferences on Computational Science, pp. 1699–1709 (2011) 7. Pandey, T.N., Jagadev, A.K., Choudhury, D., Dehuri, S.: Comparison of classification techniques used for credit risk assessment in financial modeling. Int. J. Manage. IT Eng. 3(5), 180–201 (2013) 8. Dietterich, T.G.: Experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach. Learn. 40, 139–157 (2000) 9. Pandey, T.N., Jagadev, A.K., Choudhury, D., Dehuri, S.: Machine learning based classifiers ensemble for credit risk assessment. Int. J. Electron. Finance 7(3/4), 227–249 (2013) 10. Danenas, P., Grasva, G.: Selection of support vector machine based classifier for credit risk. Expert Syst. Appl. 42, 3194–3204 (2015) 11. Hsu, F.J., Chen, M.Y., Chen, Y.C.: The human-like intelligence with bio-inspired computing approach for credit ratings prediction. Neurocomputing 279, 11–18 (2018) 12. Maldonado, S., Bravo, C., López, J., Perez, J.: Integrated framework for profit-based feature selection and SVM classification in credit scoring. Decis. Support Syst. 104, 113–121 (2017) 13. Montazer, G., Giveki, D., Karami, M., Rastegar, H.: Radial basis function neural networks: a review. Comput. Rev. J. 1(1), 52–74 (2018) 14. Zhong, H., Miao, C., Shen, Z., Feng, Y.: Comparing the learning effectiveness of BP, ELM, I-ELM and SVM for corporate credit rating. Neurocomputing 128, 285–295 (2014) 15. Montazer, G.A., Giveki, D.: An improved radial basis function neural network for object image retrieval. Neurocomputing 168, 221–233 (2015) 16. Giri, P., De, S., Dehuri, S.: Adaptive neighbourhood for locally and globally tuned biogeography based optimization algorithm. J. King Saud Univ. Comput. Inf. Sci. (2018). https://doi. org/10.1016/j.jksuci.2018.03.013 17. Huang, G.B., Chen, L., Siew, C.K.: Universal approximation using incremental networks with random hidden computation nodes. IEEE Trans. Neural Netw. 17(4), 1243–1289 (2006) 18. Dash, C.S.K., Behera, A.K., Dehuri, S., Cho, S.B.: Radial basis function neural networks: a topical state-of-the-art survey. Open Comput. Sci. 6, 33–63 (2016) 19. Marques, A.I., Garcia, V., Sanchez, J.S.: A literature review on the application of evolutionary computing to credit scoring. Oper. Res. Soc. 64(9), 1384–1399 (2013) 20. Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: International Conference on Neural Networks, pp. 1942–1948, Springer (1995) 21. Yuxiang, S., Qing, C.H.: RBF neural network based on particle swarm optimization. In: International Symposium on Neural Networks, pp. 169–176, Springer (2010) 22. Kruppa, J., Schwrz, A., Arminger, G., Zeigler, A.: Consumer credit risk: individual probability estimate using machine learning. Expert Syst. Appl. 40, 5125–5131 (2013) 23. Pandey, T.N., Jagadev, A.K., Mohapatra, S.K., Dehuri, S.: Credit risk analysis using machine learning classifiers. In: International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS), pp. 1850–1854 (2017) 24. Shih, K.H., Hung, H.F., Lin, B.: Construction of classification models for credit policies in banks. Int. J. Electron. Finance 4(1), 1–18 (2010) 25. Tsai, C.F., Chen, M.L.: Credit rating by hybrid machine learning technique. Appl. Soft Comput. 10, 374–380 (2010) 26. Wang, G., Ma, J.: A hybrid ensemble approach for enterprise credit risk assessment based on SVM. Expert Syst. Appl. 39, 5325–5331 (2012)

Classification of Credit Dataset Using Improved Particle Swarm Optimization

45

27. Wang, G., Hao, J., Ma, J.: A comparative assessment of ensemble learning for credit scoring. Expert Syst. Appl. 38, 223–230 (2011) 28. Yu, L., Yue, W., Wang, S., Lai, K.K.: SVM based multi agent ensemble learning for credit risk evaluation. Expert Syst. Appl. 37, 1351–1360 (2010) 29. Zhang, Z., Gao, G., Shi, Y.: Credit risk analysis using multi-criteria optimization classifier with kernel, fuzzyfication and penalty factor. Eur. J. Oper. Res. 237, 335–348 (2014) 30. Zhou, H., Lan, Y., Soh, Y.C., Huang, G.B.: Credit risk evaluation using extreme learning machine. In: International Conferences on Systems Man and Cybernetics (IEEE), pp. 1064– 1069 (2012)

Multi-Verse Optimization of Multilayer Perceptrons (MV-MLPs) for Efficient Modeling and Forecasting of Crude Oil Prices Data Sarat Chandra Nayak1(B) , Ch. Sanjeev Kumar Dash2 , Bijan Bihari Mishra2 , and Satchidananda Dehuri3 1 Department of Computer Science and Engineering,

CMR College of Engineering & Technology, Hyderabad 501401, India [email protected] 2 Department of Computer Science, Silicon Institute of Technology, Bhubaneswar, India [email protected], [email protected] 3 Department of Information and Communication Technology, Fakir Mohan University, Vyasa Vihar, Balasore 756019, Odisha, India [email protected]

Abstract. Arbitrary changes in crude oil prices makes its forecasting quite difficult. Multilayer neural networks are found to be effective in predicting such cured oil prices. Crafting optimal neural network architecture requires numerous trial and error methods. This article presents a hybrid model based on multi-verse optimization (MVO) of multilayer perceptron (MLP), termed as (MV-MLP), where a universe/individual of MVO represents a potential MLP in the universe of discourse. A set of such universes forms a population and the best universe, i.e. optimal MLP is selected through a search process. The search process starts with a random population, gradually moves toward the global optimum and the optimal MLP is obtained at fly rather fixing it earlier. The proposed MV-MLP is evaluated on forecasting the crude oil prices and the predictability performance is established through comparative study with other trained models trained. Experimental results and comparative study suggests the superiority of MV-MLP based forecasting. Keywords: Crude oil price forecasting · Multi-verse optimizer · Multilayer perceptron · Artificial neural network · Evolutionary optimization algorithm

1 Introduction Crude oil is a precious commodity for economical and industrial development of a country as well as for global economy. Nominal change that occurs to the crude oil price gives impact to the petroleum price, crude oil products and the global economy. The price depends upon multiple socio-economical as well as political factors, associated with high nonlinear structures, hence difficult to predict its future. Computational intelligence methods such as artificial neural networks (ANNs) have been shown their promising results in forecasting the behavior of the crude oil price [1–5]. © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 46–54, 2020. https://doi.org/10.1007/978-3-030-39033-4_4

Multi-Verse Optimization of Multilayer Perceptrons (MV-MLPs)

47

The adjustment of neuron weight and bias of ANN is the key factor of ANN training and requires frequent human interventions. The performance of ANN solely depends upon the adjustment of weight and bias vectors. To circumvent the limitations of gradient descent based ANN training, large number of nature and bio-inspired optimization techniques are proposed and applied for financial time series forecasting [6–8]. Evolutionary and swarm intelligent based learning techniques such as GA [9], PSO [10], DE [11] are found popular in training ANN based forecasts. Performance of these techniques mainly depends on proper adjustment of algorithm specific control parameters and there is no single technique performing well on all problems. Multi-verse optimizer (MVO) is a newly proposed metaheuristics [12] inspired from hypothesis of existence of multiple universes and interactions among them through white, black, and worm holes. It is a population based optimization technique and capable to find the global optima. MVO is claimed to achieve the exploration of the search space through black and white holes. Similarly, the exploitation of the search space is achieved through wormholes. The optimization process simulates the few rules on the universes (solution/individual) to perform the search operation. The universes are communicated with each other through sending objects (variable in a solution). The algorithm assigns an inflation rate to each universe (solution) that is calculated as proportional to their fitness value. Higher inflation rate of a solution indicates more possibility of having white hole, hence higher is the tendency to send objects. There is more probability of having black holes by a universe with lower inflation rate; hence more is the tendency to receive the objects. Irrespective of inflation rate, objects in universes pursue random movement towards the best universe so far obtained through wormholes. The application of MVO towards economic forecasting is a new direction and perhaps not applied to crude oil price forecasting. The objective of this article is to explore the capability of MVO on training MLPs. The synaptic weight vectors and biases for the MLP are adjusted with MVO to obtain an optimal MLP; hence the model is termed as MV-MLP. The proposed MV-MLP is used to forecast the future price of four crude oil prices time series data and evaluated through error metrics such as minimum, maximum, mean, and standard deviation. A comparative performance analysis is carried out considering other forecast systems such as gradient descent based MLP (GD-MLP), PSO based MLP (PSO-MLP), GA based MLP (GAMLP), and DE based MLP (DE-MLP). The article is structured into five major sections. An introduction is presented in Sect. 1. Related works are discussed in Sect. 2. Proposed MV-MLP is presented in Sect. 3. Section 4 summarizes the experimental results and discussion. The concluding remarks are given in Sect. 5.

2 Related Work Several related studies in crude oil price forecasting domain are found in the literature includes statistical as well as artificial intelligence based methods. We present a short description about these methods here. The conventional category of forecasting model includes statistical models like autoregressive moving average (ARMA), autoregressive integrated moving average (ARIMA), generalized autoregressive conditional heteroscedasticity (GARCH) and their

48

S. C. Nayak et al.

hybridization. Many authors proposed this approach for financial and oil market forecasting. Gavriilidis et al. studied the effect of inclusion of oil price shocks from different origin in a set of GARCH-X models [13]. Herrera et al. evaluated the relative performance of various econometrics models using high frequency intra-day volatility data [14]. A hybrid metaheuristic approach based on ANN, ARIMA, and support vector machines (SVM) was proposed by Naderi et al. [15]. A nonlinear metabolic grey model corrected by ARIMA (NMGM-ARIMA) for enhanced accuracy in China’s foreign oil dependence has been demonstrated by Wang et al. [16]. The advanced category methods include artificial intelligence, machine learning, soft computing, and hybridization of them. Huang and Wu [17] developed deep multiple kernel learning approach for energy commodity price forecasting. Hybrid linear and non-linear techniques for energy demand forecasting in China and India are proposed by authors in [18]. To study the nonlinear complex nature of crude oil price movement Chen et al. [19] proposed a deep learning based model and achieved improved forecasting accuracy. A GA and fast ensemble empirical mode decomposition (GA-FEEMD) for forecasting crude oil price time series data has been proposed by authors in [20]. They compared the proposed model with ARIMA, and ANN found improvement. A PSO optimized gray Markov model was suggested by Hu et al. [21] and found smaller error and better accuracy compared to others. A hybrid PSO and radial basis functional neural network is proposed in [22] that outperformed other models. A forecasting model based on Grey Wolf Optimizer is proposed for short term forecasting of energy commodity [23]. The model has been compared with model trained with artificial bee colony optimization and differential evolution and found more competitive. A combination of kernel principal component analysis and DE optimized support vector machine is proposed for crude oil price prediction [24]. A survey on computational methods for crude oil price forecasting has been done by authors in [25]. These metaheuristics are population based and able to land at global optima with a reasonable computational cost. MVO is a recently proposed metaheuristics [12] inspired from hypothesis of existence of multiple universes. MVO is a population based optimization technique and capable to find the global optima. It claimed to achieve the exploration of the search space through black and white holes and the exploitation of the search space is achieved through wormholes. We found lack of MVO applications towards ANN training. There is one application of MVO for feature selection and optimization of support vector parameter optimization for classification problem. Another application using MVO based voltage stability analysis through continuation power flow is found in [26].

3 MV-MLP The proposed MV-MLP here uses a MLP with one hidden layer as the base architecture. The error correction is a supervised learning. The MLP here has a single output unit to estimate one-day-ahead price. Input layer neurons use linear transfer function, hidden and output neurons use sigmoid activation function. A universe of MVO can be viewed as a potential weight and bias vector for the MLP. A set of such universes forms a population and the best universe, i.e. optimal MLP is selected through the search process. The process starts with a random population, gradually moves toward the global optimum

Multi-Verse Optimization of Multilayer Perceptrons (MV-MLPs)

49

and the optimal MLP is obtained at fly rather fixing earlier. The MV-MLP process is shown in Fig. 1. The process starts with a set of random solutions (weight vectors + bias) and updates them through number of iterations. As per the basic MVO, the   a finite j th th j parameter of i solution wi is updated as: ⎧⎧ ⎨ w j + T D R + ((ub j − lb j ) ∗ rand3 + lb j ), i f rand2 < 0.5 ⎪ ⎪ ⎨ i f rand < W E P j 1 wi = ⎩ − T D R + ((ub j − lb j ) ∗ rand3 + lb j ), i f rand2 ≥ 0.5 w ⎪ ⎪ ⎩ j j W Roulettewheel i f rand1 ≥ W E P

⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭

(1)

Where: rand1 , rand2 and rand3 are three random numbers in [0, 1], W j = the j th parameter of the best weight set, ub j = Upper bound of j th element, lb j = Lower bound of j th element, j w Roulettewheel is the j th element of a solution selected by roulette wheel method, Wormhole Existence Probability coefficient, calculated as: max − min W E P = min + curr entiter ∗ Maxiteration

(2)

TDR = Travelling Distance Rate coefficient, calculated as: T DR = 1 −

curr entiter Maxiteration

1

p

(3)

The parameters WEP and TDR decide how frequently and how extent the solution changes during optimization process. The parameter p is the exploitation accuracy. The increased exploitation is achieved with increase in WEP value. The exploration is achieved in replacing the j th element of i th solution by that of a solution selected by roulette wheel mechanism. This also helps in avoiding the local optima. In this step the current solution and the best solution obtained so far contain the black and white hole respectively. The trade-off between exploration and exploitation can be achieved through changing WEP and TDR adoptively during the optimization process. The search process improves the inferior solutions using the best solution obtained so far by choosing the white holes with roulette wheel relative to their fitness cost and creating black holes inversely proportional to the fitness value.

50

S. C. Nayak et al.

Fig. 1. MV-MLP process

4 Experimental Results and Discussion The experimental data are extracted from US Department of energy: Energy Information Administration web site: http://www.eia.doe.gov/during the period April 1983 to July 2019. The daily crude oil prices series has 9105 data points, weekly series has 1891 data points, monthly series and annual data series contain 434 and 36 number of data points respectively. Inputs for the model are selected by sliding window method and normalized using sigmodal method [8]. The MVO parameters are considered as suggested in [12]. The comparative models are trained similar to MV-MLP. Average error statistics from twenty simulations are considered for comparison study. Forecasting errors from all models are summarized in Table 1. It is observed that MV-MLP generated lowest average error for three data series out of four. In case of weekly time series both PSO-MLP and MV-MLP have same average error of 0.0095. The actual oil prices verses estimated prices by MVMLP are plotted in Figs. 2, 3, 4 and 5. These plots show that the estimated prices are very closer to actual prices, and more often follow the trend of the time series.

Multi-Verse Optimization of Multilayer Perceptrons (MV-MLPs)

51

Table 1. Forecasting errors from five models Crude Error oil price statistic dataset

Forecasting models GD-MLP

GA-MLP

DE-MLP

PSO-MLP

MV-MLP

Daily Minimum 6.0334e−03 crude Maximum 0.0482 oil price Average 0.0198 series Standard 0.0085 deviation

6.0063e−03

5.20036e−04 3.0073e−03 3.5500e−04

0.0394

0.0375

0.0381

0.0355

0.0193

0.0189

0.0157

0.0119

0.0080

0.0075

0.0074

0.0048

Weekly Minimum 6.0508e−04 crude Maximum 0.0587 oil price Average 0.0128 series Standard 0.0168 deviation

5.2524e−05

3.0527e−05

3.1522e−05 1.2451e−05

0.0468

0.0472

0.0448

0.0444

0.0120

0.0098

0.0095

0.0095

0.0153

0.0055

0.0055

0.0043

Monthly Minimum 3.1004e−03 crude Maximum 0.0966 oil price Average 0.0189 series Standard 0.0135 deviation

3.5227e−05

4.056e−05

3.7764e−05 2.3493e−05

0.0573

0.0558

0.0554

0.0548

0.0177

0.0168

0.0157

0.0159

0.0142

0.0089

0.0077

0.0110

Annual Minimum 3.2237e−03 crude Maximum 0.0482 oil price Average 0.0438 series Standard 0.0065 deviation

5.14870e−04 5.02875e−04 4.5275e−04 4.5042e−04 0.0474

0.0399

0.0409

0.0355

0.0390

0.0369

0.0367

0.0308

0.0058

0.0062

0.0058

0.0046

0.4645 MV-MLP Actual

0.464

0.4635

0.463

Normalized price

0.4625

0.462

0.4615

0.461

0.4605

0.46

0.4595 0

10

20

30

40

50

60

70

80

90

Finance day

Fig. 2. MV-MLP estimated v/s actual prices from daily prices series

100

52

S. C. Nayak et al. 0.474

0.472

0.47

Normalized price

0.468

0.466

0.464

0.462

MV-MLP

0.46

Actual

0.458 0

10

20

30

40

50

60

70

80

90

100

Finance day

Fig. 3. MV-MLP estimated v/s actual prices from weekly prices series 0.5 MV-MLP Actual

0.495

0.49

0.485

Normalized price

0.48

0.475

0.47

0.465

0.46

0.455

0.45 0

10

20

30

40

50

60

70

80

90

100

Finance day

Fig. 4. MV-MLP estimated v/s actual prices from monthly prices series

Fig. 5. MV-MLP estimated v/s actual prices from annual prices series

5 Concluding Remarks For efficient modeling and forecasting of chaotic crude oil prices time series, this article presents a hybrid model called MV-MLP. The optimal weight sets and bias values for MLP are adjusted by MVO. The model uses the effective learning ability of MVO and better approximation capacity of MLP to forecast the uncertain behavior of crude oil data. The MV-MLP forecast is validated on prediction of one-step-ahead oil price and comparison with four other models trained similarly. Observations from experimental

Multi-Verse Optimization of Multilayer Perceptrons (MV-MLPs)

53

results and comparative studies suggest superiority of the proposed model and may be adopted as a promising alternative tool for crude oil price forecasting. The method can be applied to other financial time series. The base architecture may be replaced by higher order neural networks.

References 1. Hamdi, M., Aloui, C.: Forecasting crude oil price using artificial neural networks: a literature survey. Econ. Bull. 3(2), 1339–1359 (2015) 2. Chiroma, H., Abdulkareem, S., Herawan, T.: Evolutionary neural network model for west texas ıntermediate crude oil price prediction. Appl. Energy 142, 266–273 (2015) 3. Mahdiani, M.R., Khamehchi, E.: A modified neural network model for predicting the crude oil price. Intellect. Econ. 10(2), 71–77 (2016) 4. Chiroma, H., Abdul-kareem, S., Shukri Mohd Noor, A., Abubakar, A.I., Sohrabi Safa, N., Shuib, L., Herawan, T.: A review on artificial ıntelligence methodologies for the forecasting of crude oil price. Intell. Autom. Soft Comput. 22(3), 449–462 (2016) 5. Wang, J., Wang, J.: Forecasting energy market ındices with recurrent neural networks: case study of crude oil price fluctuations. Energy 102, 365–374 (2016) 6. Shadbolt, N.: Nature-inspired computing. IEEE Intell. Syst. 19(1), 2–3 (2004) 7. Nayak, S.C., Misra, B.B.: Estimating stock closing ındices using a GA-weighted condensed polynomial neural network. Financ. Innov. 4(1), 21 (2018) 8. Nayak, S.C., Misra, B.B., Behera, H.S.: Efficient financial time series prediction with evolutionary virtual data position exploration. Neural Comput. Appl. 31(2), 1053–1074 (2019) 9. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading (1989) 10. Kennedy, J., Eberhart, R.C.: Swarm Intelligence. Morgan Kaufmann, San Francisco (2001) 11. Price, K., Storn, R., Lampinen, J.: Differential Evolution: a Practical Approach to Global Optimization. Springer, Berlin (2005) 12. Mirjalili, S., Mirjalili, S.M., Hatamlou, A.: Multi-verse optimizer: a nature-ınspired algorithm for global optimization. Neural Comput. Appl. 27(2), 495–513 (2016) 13. Gavriilidis, K., Kambouroudis, D.S., Tsakou, K., Tsouknidis, D.A.: Volatility forecasting across tanker freight rates: the role of oil price shocks. Transp. Res. Part E: Logist. Transp. Rev. 118, 376–391 (2018) 14. Herrera, A.M., Hu, L., Pastor, D.: Forecasting crude oil price volatility. Int. J. Forecast. 34(4), 622–635 (2018) 15. Naderi, M., Khamehchi, E., Karimi, B.: Novel statistical forecasting models for crude oil price, gas price, and ınterest rate based on meta-heuristic bat algorithm. J. Petrol. Sci. Eng. 172, 13–22 (2019) 16. Wang, Q., Li, S., Li, R.: China’s dependency on foreign oil will exceed 80% by 2030: developing a novel NMGM-ARIMA to forecast China’s foreign oil dependence from two dimensions. Energy 163, 151–167 (2018) 17. Huang, S.C., Wu, C.F.: Energy commodity price forecasting with deep multiple kernel learning. Energies 11(11), 3029 (2018) 18. Wang, Q., Li, S., Li, R.: Forecasting energy demand in china and india: using single-linear, hybrid-linear, and non-linear time series forecast techniques. Energy 161, 821–831 (2018) 19. Chen, Y., He, K., Tso, G.K.: Forecasting crude oil prices: a deep learning based model. Procedia Comput. Sci. 122, 300–307 (2017) 20. Akrom, N., Ismail, Z.: A hybrid GA-FEEMD for forecasting crude oil prices. Indian J. Sci. Technol. 10, 31 (2017)

54

S. C. Nayak et al.

21. Hu, H., Zhai, X., Guan, X.: Crude oil output forecasting based on PSO of Unbiased Gray Markov model. In: 2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS), pp. 644–647. IEEE (2017) 22. Chandar, S.K., Sumathi, M., Sivanandam, S.: Crude oil prdiction using a hybrid radial basis function network. J. Theor. Appl. Inf. Technol. 72(2) (2015) 23. Yusof, Y., Mustaffa, Z.: Time series forecasting of energy commodity using grey wolf optimizer. In: Proceedings of the International Multi Conference of Engineers and Computer Scientists (IMECS 2015), vol. 1, no. 1 (2015) 24. Hu, H., Fan, L., Guan, X.: The research on modeling and simulation of crude oil output prediction based on KPCA-DE-SVM. In: 2017 2nd IEEE International Conference on Computational Intelligence and Applications (ICCIA), pp. 93–97. IEEE (2017) 25. Gabralla, L.A., Abraham, A.: Computational modeling of crude oil price forecasting: a review of two decades of research. Int. J. Comput. Inf. Syst. Ind. Manag. Appl. 5, 729–740 (2013)

Application of Machine Learning to Predict Diseases Based on Symptoms in Rural India Suvasree S. Biswal1 , T. Amarnath1 , Prasanta K. Panigrahi2 , and Nrusingh C. Biswal3(B) 1 Department of Information Technology, Indian Institute of Information Technology,

Bhubaneswar, Odisha, India [email protected], [email protected] 2 Department of Physical Sciences, Indian Institute of Science Education and Research, Kolkata, West Bengal, India [email protected] 3 Department of Radiation Oncology, University of Maryland School of Medicine, Baltimore, MD 21201, USA [email protected]

Abstract. Living in a place away from hospitals and medical aid, it is often difficult for some people in rural India to diagnose a disease at an early state. Though hospitals in urban areas are using advanced technology for diagnosis and prognosis, disease prediction is one of the hypercritical tasks. Providing sophisticated and accurate algorithms and techniques to overcome this issue will be revolutionary. In this paper, we present a machine learning technique using Decision Tree Algorithm (DTL) to interconnect the symptoms and rearrange them and retrieve the most probable diagnosis. This technique allows the system to self-learn without using programming. This paper presents a system for diagnosis of common diseases by entering the symptoms into the system. Keywords: Machine learning · Decision Tree Algorithm · Information gain · Entropy · Root node · Disease diagnosis

1 Introduction Machine learning has been the active research in the medical community for some time now and it may take over the decision makings in diagnosis of many health problems [1–4]. Significant attempts are made for the enhancement of computer aided diagnosis applications because errors in medical diagnostic systems can result in seriously misleading medical treatments. Machine learning is used to generate more imaginative and predictive system. It is important in Computer Aided Diagnosis. Machine Learning approach helps to integrate the computer-based system into the healthcare field in order to obtain best and accurate results for the system [5]. Clinicians often use trial and error approach for predicting the diseases based on clinical investigations available. To predict the diseases is one of the major challenges © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 55–61, 2020. https://doi.org/10.1007/978-3-030-39033-4_5

56

S. S. Biswal et al.

in past years and today also. There is great need of something that predicts the diseases early based on available symptoms, which can save millions of lives and saves lots of money in health care. In this paper, we have used a very simple algorithm which creates a decision tree out of the training samples provided to the algorithm. Decision trees are used to approximate a discrete valued function, in which the final learned function is the decision tree itself. Decision trees are more useful than regular algorithms, as they can have improved readability [6]. The tree does this by reducing all values to individual nodes, and then generating a simple if else case for each node pair. The Decision tree has been very useful in the various fields, from medical diagnosis and other fields such as stock market analysis. The algorithm takes into consideration of all the symptoms and creates the shortest tree out of it. The attributes (symptoms) are splitted across the tree in such a manner that it takes very minimal time to reach to the class (leaf) node from the root node of the tree [7]. The further pruning of the tree makes it more efficient to use for predictions.

2 Materials and Methods A decision tree uses a tree-like model including possible outcomes. It contains only conditional control statements. These are normally used in research and decision-making strategies. The decision tree model is built based on the training set (pair of input and output variables) examples. It can be used to solve regression as well as classification problems. Our aim here is to classify symptoms into probable human diseases. A decision tree contains a tree-like structure where each internal node represents as a test on an attribute. Each branch from that node represents the outcome of that test. Each leaf node represents a class of outcome. A path from the root node to the leaf node represents a classification rule. Decision Tree works on the SOP (Sum of Product) form which is also known as DNF (Disjunctive Normal Form). In Decision Tree, the major challenge is to identify the attribute for the root node in each level. This process is known as attribute selection. We have two popular attribute selection measures, i.e. Information Gain and Entropy, they are explained below. A representation of a sample decision tree generated from the algorithm is shown in Fig. 1. 2.1 Information Gain The information gain is calculated as the difference in entropy after splitting a dataset with respect to an attribute. The attribute with the highest information gain is selected for splitting the dataset further (Fig. 2). Gain(T, X ) = Entr opy (T ) − Entr opy (T, X )

Where, T is the target X is the symptom on which the tree must get split

(1)

Application of Machine Learning to Predict Diseases

57

2.2 Entropy Entropy is the measure of an impurity contained in an attribute, which is the summation of the probabilities with their logarithms. The less the entropy, the efficient the decision tree. It is expressed as;  E = − pi log2 pi (2)

2.3 Algorithm

Create Decision Tree (Training examples, Target disease, Symptoms) • Create a root node for the decision tree. • If all training sets are of the same disease (T), return leaf node with disease ‘T’. • Repeat the following steps until all symptoms are used for splitting o Calculate the total entropy of current state/node H(S) o For each symptom, calculate the entropy with respect to the symptom ‘X’ denoted by H (S, X) o Calculate Information Gain IG (S, X) for all the symptoms. o Select the symptom that has maximum IG. o Remove the symptom that offers highest IG from the set of symptoms

Fig. 1. Architecture diagram: the training examples are used as an input to the given algorithm to get the decision tree which is further used to predict the disease from the symptoms

58

S. S. Biswal et al.

Fig. 2. A representation of a sample decision tree generated from the algorithm

3 Results and Discussion According to the algorithm stated above, first we need to calculate the total entropy of the root node. For calculating, we sum the product of different probabilities with their logarithms which was found out to be E(S) = 0.909. Next, we need to calculate the entropies for all the attributes taken into consideration. Firstly, we calculate the entropy for Blood Pressure and then Sugar and so on. After calculating the entropies, we calculate the information gain for all the attributes by subtracting from the total entropy at the root node. Entropy of the root node for certain diseases are listed in Table 1. Sample calculations is as follows: Calculating the Information Gain for Blood Pressure:         1 1 1 1 ∗ log − ∗ log = 0.3010 E (B.P. = Low) = − 2 2 2 2         1 1 1 1 ∗ log − ∗ log E(B.P. = Normal) = − 5 5 5 5         1 2 2 1 ∗ log − ∗ log = 0.5785 − 5 5 5 5         2 1 1 2 ∗ log − ∗ log E(B. P. = High) = − 6 6 6 6         2 1 1 2 ∗ log − ∗ log = 0.5774 − 6 6 6 6       5 6 2 ∗ 0.3010 + ∗ 0.5785 + ∗ 0.5774 = 0.5353 I(Blood Pressure) = . 13 13 13

Application of Machine Learning to Predict Diseases

59

Table 1. Calculating entropy for the root node

Diseases

Probability(pi ) 0.166666667

(pi*log2pi)

Heart Failure

Occurrence s 2

Kidney Damage

1

0.083333333

-0.089931771

Atherosclerosis

1

0.083333333

-0.089931771

Decrease in Oxygen Levels

1

0.083333333

-0.089931771

Diabetes

3

0.25

-0.150514998

Hypoglycemia

1

0.083333333

-0.089931771

Heart arrhythmia

1

0.083333333

-0.089931771

Hyponatremia

1

0.083333333

-0.089931771

Hypernatremia

1

0.083333333

-0.089931771

E(S)

12

-0.129691875

0.909729266

Information Gain (Blood Pressure) = E(S) − E(B.P.) = 0.909 − 0.5353 = 0.3737 Calculating the Information Gain for Sugar:         1 1 1 1 ∗ log − ∗ log E(Sugar = Lower than Normal) = − 3 3 3 3     1 1 ∗ log = 0.4771 − 3 3         1 1 1 1 ∗ log − ∗ log E(Sugar = Normal) = − 4 4 4 4         1 1 1 1 ∗ log − ∗ log = 0.6020 − 4 4 4 4         1 1 1 1 ∗ log − ∗ log E(Sugar = Higher Than Normal) = − 6 6 6 6         1 2 2 1 ∗ log − ∗ log − 6 6 6 6     1 1 ∗ log = 0.6778 − 6 6       4 6 3 ∗ 0.4771 + ∗ 0.6020 + ∗ 0.6778 = 0.6081 I (Sugar) = 13 13 13 Information Gain (Sugar) = E(S) − E(Sugar) = 0.909 − 0.6081 = 0.3009 Similarly, we can calculate the Information Gain for all the symptoms. The symptom with maximum information gain will be considered for splitting at the root node. Then

60

S. S. Biswal et al.

the root node will have multiple branches based on the number of possibilities of the symptoms. Successively, the size of the training examples will go on decreasing as we go down in the tree. At each level, we need to perform the above calculation to find the symptom on which training examples will get split until there are no more symptoms to split the decision tree on (Table 2). Table 2. Training examples Blood Pressure (120/80)

Sugar (70100/140)

Heartbeat rate(60100bpm)

High

Higher than Normal Higher than Normal Normal

High

High Normal

Low

Higher than Normal Normal

Normal

Normal

High

Higher than Normal Lower than Normal Lower than Normal Normal

Normal

Normal

Normal High

Low

High

Higher than Normal Lower than Normal Higher than Normal

Potassium level (3.55mEq/L)

Is Pregnant

Is Asthmatic

Smokes Regularly

Probable Disease

Lower than Normal

Blood Sodium level(135145mEq/L) Higher than Normal

Low

No

No

Yes

Heart Failure

Lower than Normal

Lower than Normal

High

No

No

No

Kidney Damage

Lower than Normal Normal

Normal

Low

No

Yes

Yes

Heart Failure

Higher than Normal

High

No

No

No

Atherosclerosis

Lower than Normal Lower than Normal Normal

Lower than Normal Lower than Normal Lower than Normal

High

No

No

No

Low

No

No

No

Decrease in Oxygen Levels Diabetes

High

Yes

Yes

No

Diabetes

Higher than Normal

Higher than Normal

Low

No

No

No

Hypoglycemia

Higher than Normal

Normal

Low

No

No

Yes

Heart arrhythmia

Normal

Higher than Normal Higher than Normal

High

No

No

Yes

Low

Yes

Yes

No

Heart arrhythmia Diabetes

Higher than Normal Normal

Lower than Normal

High

No

No

No

Hyponatremia

Lower than Normal

Higher than Normal

High

Yes

Yes

Yes

Hypernatremia

4 Summary/Conclusion We used supervised machine learning algorithms to develop a system that allows the doctor to predict probable diseases based on patient symptoms as observed which results in better diagnosis and further treatment. In this paper, the decision tree generated shall predict the most probable disease out of the provided symptoms. The more the symptoms captured, the efficient the decision tree, the accurate the prediction. The users of this system will be mostly doctors and the people who may encounter with some symptom of the disease which they are unaware off. The user needs to input the list of symptoms

Application of Machine Learning to Predict Diseases

61

which they are experiencing. The corresponding output will be the probable disease with the maximum likelihood. The proposed work will be further enhanced to diagnose and take precautionary measures to prevent the occurrence of a specific disease, i.e. diabetes.

References 1. Keerrthega, M.C., Thenmozhi, D.: Identifying disease – treatment relations using machine learning approach. Procedia Comput. Sci. 87, 306–315 (2016) 2. Chen, M., Hao, Y., Hwang, K., Wang, L., Wang, A.L.: Disease prediction by machine learning over big data from healthcare communities. Digit. Object Identifier 5, 8869–8879 (2017) 3. Fatima, M., Pasha, M.: Survey of machine learning algorithms for disease diagnostic. J. Intell. Learn. Syst. Appl. 9, 1–16 (2017) 4. Ramalingam, V.V., Dandapath, A., Raja, M.K.: Heart disease prediction using machine learning techniques: a survey. Int. J. Eng. Technol. 7(2.8), 684–687 (2018) 5. Singh, P., Singh, S., Pandi-Jain, G.S.: Effective heart disease prediction system using data mining techniques. Int. J. Nanomed. 13, 121–124 (2014) 6. Kim, Y.J., Park, H.: Improving prediction of high-cost health care users with medical check-up data. Big Data 7(3), 163–175 (2019) 7. Kononenko, I.: Machine learning for medical diagnosis: history, state of the art and perspective. Artif. Intell. Med. 23, 89–109 (2001)

Classification of Real Time Noisy Fingerprint Images Using Flann Annapurna Mishra1(B) , Satchidananda Dehuri2 , and Pradeep Kumar Mallick3 1 Department of Electronics and Communication Engineering, Silicon Institute of Technology,

Silicon Hills, Patia, Bhubaneswar 751024, Odisha, India [email protected] 2 Department of Information and Communication Technology, Fakir Mohan University, Vyasa Vihar, Balasore 756019, Odisha, India [email protected] 3 School of Computer Engineering, KIIT Deemed to Be University, Bhubaneswar 751024, Odisha, India [email protected]

Abstract. In this work, we have examined a Novel Biogeography based optimized FLANN for classifying noisy fingerprints as a biometric classifier. Here the database is collected in real time from ten different persons and the features are extracted of five different classes of fingerprints using Gabor Filter bank. The results prove that this method is robust enough to classify the fingerprints with a good accuracy. Here we have used Biogeography based optimized Functional Link Artificial Neural Networks (BBO-FLANN) for the task of classification of noisy distorted fingerprint images. Keywords: Noisy · Gabor · FLANN

1 Introduction Fingerprints are the vivid flow-like structures available on human fingers. acknowledging to their individuality, security, longevity, and ease, fingerprints are the most extensively used biometric characteristics. The pattern of the lines and rift on the human fingertips forms the fingerprint pictures. Scrutinizing this pattern at various levels discloses various types of characteristics that are, global feature and local feature. In this paper we have extracted the unique global features of the noisy distorted fingerprints collected from real time database by using Filter bank approach [10, 12]. The features are collected in terms of feature vectors of each individual fingerprints and stored in a feature sheet and further classification is tested using a novel Biogeography based optimized FLANN. The Functional link ANN [2, 21] is an excellent classifier. Here we have used BBO as optimizer to optimize the weight parameters of the network for faster convergence with least mean square error values [1]. Artificial Neural Networks (ANN) is a robust tool for many complicated applications including functional implementations, non-linear system recognition and management, unsupervised categorization and optimization [4, © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 62–68, 2020. https://doi.org/10.1007/978-3-030-39033-4_6

Classification of Real Time Noisy Fingerprint Images Using Flann

63

5]. The ANN’s are competent of originating complicated plotting and thus are able to form random complicated nonlinear decisive limits [18]. Hence, to solve few of the problems, in this study we use functional link artificial neural network for solving the categorization difficulty [7, 19]. The FLANN is primarily a plane net and the necessity of the concealed layer is detached and thus, the learning algorithm implemented in this network becomes very easy [6, 8]. The functional enlargement efficiently increases the dimensionality of the input vector and hence the hyper planes generated by the FLANN along with BBO gives us greater classification accuracy in the input pattern space. Real Time Database Collection and Feature Extraction Fingerprint Classification is a two step process consisting of feature extraction and classification. The feature can be extracted from a collected group of sample fingerprint images called database image. The frequently used standard database are NIST 9, DB etc., but here we have collected the images on real time to form a database of 50 samples. The collected database is a collection of a group of 50 fingerprint sample images of some students of Silicon Institute of Technology, Bhubaneswar. So a total of 50 fingerprints consisting of all 5 classes (here 4 classes combining Arch and Tented Arch as one class) of fingerprints originates the database for further processing like preprocessing and feature extraction. Again all the sample images were not noisy. So to make them noisy here we have added Gaussian as well as Salt and Pepper noise in different percentage. Now the created noisy database is used for feature extraction using Gabor filterbank. The extracted features are stored in the excel sheet and processed towards classification (Fig. 1).

Fig. 1. Feature extraction process

Pre-processing: Improvement of the fingerprint is the operation involving the adjustment of digital image of fingerprint so that the outcomes are more satisfactory for display or for additional image investigation. It is primarily done to enhance the image standards and to make it simpler for further operations as shown in Fig. 2. Often fingerprint images from different origins falls short of adequate disparity and simplicity. Hence image improvisation is a necessity [10]. It increases the disparities between corrugations and furrows and joins some of the erroneous broken points of corrugations due to inadequate quantity of ink or degraded standard of sensor input.

64

A. Mishra et al.

Feature Extraction: Global corrugation and furrow formations aid us in the deducing the category of a fingerprint. A justifiable characteristic set for fingerprint categorization should be able to encapsulate this global data efficiently [22]. The rendition of both the smallest of attributes and the global corrugation and furrow formations of a fingerprint is done by the matured filter bank-based fingerprint representation. Adjustments of our representation is done such that it is very efficient in depicting the global corrugation and furrow formations and is unchangeable to individual small attributes for the intent of categorization. There are definite limitations in the evolved representation scheme which are adjusted to our fingerprint categorization algorithm [3, 12]. To generate the four component images a fingerprint is convolved with four Gabor filters (θ = 0° , 45° , 90° , and 135° ). Thus, our characteristic vector is 152-dimensional (38 × 4) as shown in Fig. 2. Our investigational outputs shows that most of the corrugation spatiality data present in a fingerprint image are captured by the four component images and thus authentic representation is formed. We demonstrate this by reconstructing a fingerprint image by adding together all the four filtered images. The reconstructed image is equivalent to the actual image without a remarkable loss of data.

Fig. 2. Pre processing steps

2 Finger Print Classification BBO Algorithm Biogeography based optimization algorithm focuses to the candidate solutions or habitats through population. The habitats are characterized by HSI factor [1]. The HSI can be high and low indicating the amount of species and the scope of adapting habitats. Alike Genetic Algorithm, BBO also performs two operations like Migration and Mutation. Through migration it is capable of providing high HSI habitat. The performance selection of habitats is decided by a probabilistic operator as given in Eq. 5. The migration process is quantified by two operations like Immigration and Emigration to be measured by emigration rate (μ) and immigration rate (λ) and they also decide the rate of migration for the next generation as given in Eq. 9. Here μ and λ are directly related to the number of species in the habitat. Here let us take; there are k species in the habitat. Hi (S I V ) ← H j (S I V )

(1)

μk = E k Smax and λk = I (1 − μk )

(2)

Classification of Real Time Noisy Fingerprint Images Using Flann

65

where, E is the maximum emigration rate, I is the maximum immigration rate, and Smax is the largest achievable number of species that the habitat can support. The second operation in BBO is mutation which modifies the randomly selected SIV of a population as per the mutation rate.   Pi (3) m i = Mmax 1 − Pmax Where, m i is the mutation rate, Mmax is the maximum mutation rate, and Pmax is the maximum probability of species.

3 BBO-FLANN Hybrid Network In our work, we have integrated BBO algorithm as an optimizer to optimize FLANN network as shown in Fig. 3. The process is done in the following way. First the network is created with a fixed number of habitats so as to fit the network requirements. For each of the habitats the weights are fixed randomly as well as biases. Based on the fixed weights and biases, the fitness value is generated and the MSE as cost function is calculated. Here our aim in the network is to minimize the MSE. Then the network follows various probabilistic operations like migration and mutation to improve the solution. These operations are carried repeatedly in the network till it achieves the desired goal as MSE.

Fig. 3. Structure of the proposed BBO- FLANN hybrid method

66

A. Mishra et al.

4 Result Analysis The extracted features from the noisy fingerprint images are collected and stored in an excel sheet as feature datasheet. The feature set is used to train the network with the specified classes as LL, RL, Whorl, Arch and Tented Arch as five classes. Here class4 (arch) and class5 (tented arch) jointly represent class4 [15, 20]. Each class is represented in the excel sheet in terms of five values between 1 to −1 i.e. (−1, −0.5, 0, 0.5, 1) or four values i.e. (−1, −0.33, 0.33, 1) to represent the five classes respectively. It is tested for better performance for different iteration levels like 200, 500, 1000 etc. but the classifier is robust enough to produce the accuracy in 200 iterations. Figure 4 shows the confusion matrix and accuracy obtained for the noisy fingerprints. From the result it is clear that the hybrid classifier is good enough to produce a classification accuracy of 60% for noisy fingerprints which are affected by Gaussian noise. For normal fingerprints without noise it produces 78% of accuracy. The performance comparison is given in Fig. 4. But if the noise percentage is increased it may fail to get the accuracy.

(a)Normal Fingerprints

(b)Noisy Fingerprints

(c )Best Cost graph of Normal Fingerprints

(d) Best Cost graph of Noisy Fingerprints

Fig. 4. Confusion matrix and best cost graph of normal fingerprints and noisy fingerprints

Classification of Real Time Noisy Fingerprint Images Using Flann

67

5 Conclusion Here the classification of real time noisy fingerprints using multichannel filter bank approach with BBO-FLANN is successfully done using 152 dimensional feature vectors. Our algorithm is robust enough to produce best accuracy of 60% with a small execution time. For noisy fingerprints the classifier is good enough to produce the classification accuracy and the added Gaussian noise affects much to the quality of image if it is added in a lower percentage. But if the noise percentage is increased it may fail to get the accuracy. Our feature work includes developing more efficient and robust classifiers to classify noisy and degraded fingerprints with better accuracy by combining the best attributes of particle swarm optimization (PSO), Ant colony optimization (ACO), and support vector machine (SVM) for fingerprint recognition. Acknowledgment. The first author would like to thank the technical support of Department of Information and Communication Technology, Fakir Mohan University, Vyasa Vihar, Balasore.

References 1. Giri, P.K., De, S.S., Dehuri, S.: A novel locally and globally tuned biogeography-based optimization algorithm. In: Soft Computing: Theories and Applications, pp. 635–646 (2018) 2. Guo, Y.H., Huang, C.L.: Functional link artificial neural networks filter for Gaussian noise. In: Applied Mechanics and Materials, vol. 347, pp. 2580–2585. Trans Tech Publications (2013) 3. Jain, A.K., Prabhakar, S., Hong, L.: A multichannel approach to fingerprint classification. IEEE Trans. Pattern Anal. Mach. Intell. 21(4), 348–359 (1999) 4. Zhang, L., Jack, L., Nandi, A.K.: Fault detection using genetic programming. Mech. Syst. Sign. Process. 19(2), 271–289 (2005) 5. Yao, Y., Marcialis, G.L., Pontil, M., Frasconi, P., Roli, F.: Combining flat and structured representations for fingerprint classification with recursive neural networks and support vector machines. Pattern Recogn. 36(2), 397–406 (2003) 6. Nagaty, K.A.: On learning to estimate the block directional image of a fingerprint using a hierarchical neural network. Neural Networks 16(1), 133–144 (2003) 7. Samanta, B., Al-Balushi, K.R., Al-Araimi, S.A.: Artificial neural networks and support vector machine with genetic algorithm for bearing fault detection. Eng. Appl. Artif. Intell. 16(7-8), 657–665 (2003) 8. Senior, A.: A combination fingerprint classifier. IEEE Trans. Pattern Anal. Mach. Intell. 23(10), 1165–1174 (2001) 9. Cappelli, R., Maio, D., Maltoni, D.: Combining fingerprint classifiers, pp. 351–361 (2000) 10. Jain, A.K., Prabhakar, S., Hong, L., Pankanti, S.: Filterbank based fingerprint matching. IEEE Trans. Image Process. 9(5), 846–859 (2000) 11. Hong, L., Jain, A.K.: Classification of fingerprint images. Technical report MSUCPS: TR9818, State University, Michigan (1998) 12. Jain, A.K., Hong, L., Pankanti, S., Bolle, R.: An identity authentication system using fingerprints. Proc. IEEE 85(9), 1365–1388 (1997) 13. Lawrence, S., Giles, C.L., Tsoi, A.C., Back, A.D.: Face recognition: a convolutional neuralnetwork approach. IEEE Trans. Neural Networks 8(1), 98–113 (1997) 14. Maio, D., Maltoni, D.: A structural approach to fingerprint classification. In: Proceedings of 13th International Conference on Pattern Recognition, vol. 3, pp. 578–585 (1996)

68

A. Mishra et al.

15. Karu, K., Jain, A.K.: Fingerprint Classification. Pattern Recogn. 29(3), 389–404 (1996) 16. Patra, J.C., Pal, R.N., Chatterji, B.N., Panda, G.: Identification of nonlinear dynamic systems using functional link artificial neural networks. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 29(2), 254–262 (1999) 17. Pandey, C., Singh, V., Singh, O.P., Kumar, S.: Functional link artificial neural network for denoising of image. IOSR J. Electron. Commun. Eng. 4(6), 109–115 (2013) 18. Pao, Y.-H., Phillips, S.M., Sobajic, D.J.: Neural-net computing and the intelligent control of systems. Int. J. Control 56(2), 263–289 (1992) 19. Kumar, P.S., Valarmathy, S.: Development of a novel algorithm for SVMBDT fingerprint classifier based on clustering approach. In: 2012 International Conference on Advances in Engineering, Science and Management (ICAESM), pp. 256–261 (2012) 20. Tarjoman, M., Zarei, S.: Automatic fingerprint classification using graph theory. In: Proceedings of World Academy of Science, Engineering and Technology, vol. 30, pp. 831–835 (2008) 21. Kamijo, M.: Classifying fingerprint images using neural network: deriving the classification state. In: IEEE International Conference on Neural Networks 1993, pp. 1932–1937 (1993) 22. Liu, W., Chen, Y., Wan, F.: Fingerprint classification by ridgeline and singular point analysis. In: Congress on Image and Signal Processing 2008. CISP 2008, vol. 4, pp. 594–598 (2008)

Software Reliability Prediction with Ensemble Method and Virtual Data Point Incorporation Ajit Kumar Behera1,2(B) and Mrutyunjaya Panda1 1 Department of Computer Science and Application, Utkal University,

Vani Vihar, Bhubaneswar 751004, Odisha, India [email protected], [email protected] 2 Silicon Institute of Technology, Silicon Hills, Patia, Bhubaneswar 751024, India [email protected]

Abstract. Software reliability is one of the key aspects of software quality estimation and prediction during software testing period. Hence, accurate prediction of software reliability is an important but critical job. Machine Learning (ML) techniques have been proven victorious in providing superior results than traditional techniques for software reliability prediction. Generally, ML models entail sufficient data for training to achieve improved generalization. Inadequate training data may lead to land at suboptimal solution. This article suggests a method of enriching training dataset through exploration and incorporation of virtual data from existing data. For boost up the overall accuracy and reducing the risk of model selection an ensemble framework of five ML methods is suggested. The extended software reliability dataset is then exposed to the constituent models as well as the ensemble approach separately to estimate the future data. Extensive simulation results on a couple of software reliability datasets reveal that our proposed model significantly improves the prediction accuracy. Keywords: Software reliability · Machine learning · Virtual data · Interpolation · Ensemble framework · Artificial neural network · NRMSE

1 Introduction In modern society, computer system controlled by software plays a foremost role. As unreliable software may cause significant damages to humans or customers’ goodwill, it is indispensable for software practitioners to achieve high quality software. Reliability is essentially one of the imperative aspects of quality of software. Software reliability is defined as “The ability of the software to perform its required function under stated conditions for a stated period of time” [1, 2]. Because of the fast growing of software with size and complexity of the software, reliability of software is hard to accomplish. In recent years, many methods have been developed to improve the quality and generalization capability of reliability models [3–6]. Software market suffers various provocations for developing reliable software. It is apparent that a software product having a large number of defects is unreliable. Software reliability data can be visualized as time series which are associated with high nonlinearities and uncertainties. Recently, © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 69–77, 2020. https://doi.org/10.1007/978-3-030-39033-4_7

70

A. K. Behera and M. Panda

many non-linear models have established, but the prediction accuracy of these models does not improve much. One of the reasons is that the dataset is having very less data points. So, in addition to existing data positions, if we explore of few more close enough data positions, then the prediction accuracy of software can be improved. It is apparent that a software product having a large number of defects is unreliable. Especially, these virtual data helps in establishing the association between the recent and previous data at the oscillation point on the time series. For boost up the overall accuracy and reducing the risk of model selection an ensemble framework of five ML methods is suggested. The rest of the paper is organized as follows: Sect. 2 provides the literature survey. Some of the basic regression techniques and virtual data points are described in Sect. 3. Section 4 described the proposed work. Detailed experiments studies and results analysis are provided in Sect. 5. Finally this paper is concluded in Sect. 6 followed by a list of reference.

2 Related Study In recent years many ML based software reliability models has been proposed in the literature. Few of the techniques such as artificial neural network, Fuzzy models, Decision Trees, Genetic algorithms and many more [7–10]. Ho et al. proposed a connectionist model for prediction of software reliability and found their model as better than traditional modes [11]. Su and Huang have proposed new techniques based on neural network for software reliability prediction [12]. Though there are various models were developed for software reliability prediction, no single models provide accurate predications under all circumstances. Another problem is that there is no sufficient training data available, which may lead to under fitting [13]. In view of rationalized analysis of VDP based neural network our work is generic and the whole contribution of this work includes: (i) By using interpolation technique for creating virtual data point [14], (ii) machine learning techniques & ensemble [8, 9] approach for study of software reliability prediction and compared their performance with existing algorithms on various statistical measures using in two datasets.

3 Methods and Methodologies Here we describe about base model i.e. different regression techniques like Ridge Regression, Bayesian Ridge Regression, Random Forest, Decision Tree, SVR, K-Nearest Neighbors, ElasticNet Regression, Lasso Regression, ANN, RBFN, and VDP. 3.1 Ridge Regression and Bayesian Ridge Regression Ridge Regression can be used to analyze multiple regression data which is having multico linearity exist among the data. Bayesian Ridge Regression is similar to Ridge Regression in which the statistical analysis is carried out along with Bayesian Interface. The random forest (RF) operates by constructing decision trees for training as well as for prediction.

Software Reliability Prediction with Ensemble Method

71

3.2 Support Vector Regression (SVR) Support Vector Regression is a regression method, which maintains all the main features similar to Support Vector Machine (SVM) that characterizes the algorithm. 3.3 K-Nearest Neighbors Regression (KNN) K-nearest neighbors predicts the numerical target based on a similarity measure. KNN regression uses the same distance function as KNN classification. 3.4 Artificial Neural Network (ANN) ANN systems are those that indistinctly stimulated by the biological neural networks that are found in human brain. The Neural network itself isn’t an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs. 3.5 Radial Basis Function Network (RBFN) Radial Basis Function Network is an ANN that forms hyper-planes to partition the input space into various regions and the output nodes. RBFN have many uses, including function approximation, time series prediction, and system control. 3.6 Virtual Data Generation Methods A VDP is a projected value between two successive data points in the original software time series dataset. The objective of this study is to improve the prediction performance of the model by adding VDPs in addition to the actual data positions comprising the training pattern. 3.7 Liner Interpolation Linear Interpolation is a method of creating virtual point in the dataset. If the dataset size is small; then it is very difficult to get good accuracy after train-test split. So, in order to improve the accuracy of the model this method is used to increase the size of the training data. After increasing the dataset size, it is very efficient to train the model and test the model. Due to this, the performance of the model increases. As the benchmark dataset used in this proposed is single featured failure time data; so, linear interpolation is easy to use. The Linear Interpolation algorithm works as follows: 1. train = train_set(Dataset) 2. data_array = [ ] 3. data_array.append(train [0]) 4. for index in range(1, length(train)) 5. data_array.append((train[index-1]+train[index])/2) 6. data_array.append(train[index])

72

A. K. Behera and M. Panda

4 Proposed Model In the proposed model, first of all the other base learners are trained using the available data, then a combiner algorithm is trained to make prediction using all the predictions of the other base learners as additional inputs. If an arbitrary combiner algorithm is used, then model can theoretically represent any of the ensemble technique. The Proposed Model is represented as shown in Fig. 1.

Fig. 1. Proposed ensembled model

In Fig. 1 it is explained how different based learner algorithms are combined to give the final output and better accuracy. The proposed model essentially exhibits better result than any single one of the trained models. The algorithm of the proposed model is given below: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

Input: Dataset Base-Learners = [SVR, ANN, Bayesian Ridge, KNN…] Combiner =[Ridge] X_train, X_test, Y_train,Y_test = test_train_split(Dataset) Model=[] For each function in BaseLearners: Model.append(function.train(X_train,Y_train) Train_set=[] For i in range(length(X_train)); Train=[] For m in Model: Train.append(m.predict(X_train[i])) Combiner.train(Train_set,Y_train)

Software Reliability Prediction with Ensemble Method

73

Prediction: 1. 2. 3. 4.

Train_set_x = [] For m in model: Train_set_x.append(m.predict(x)) Output = Combiner.predict(Train_set_x)

This is the ensembling algorithm that we have used in this work to improve the accuracy by combining different other algorithm.

5 Experimental Study and Result Analysis This section describes about data collection, performance metrics, input selection, result analysis. a. Experimental dataset For the experiments, two-benchmark datasets Musa dataset and Iyer Lee dataset are used. Many standardized published works have been done on these datasets and statistical error measures are available. b. Comparison criteria For the comparison criteria, Normalized Root Mean Squared Error (NRMSE), that represents the normalized error between observed actual failure data and the estimated failure data. The formula of the NRMSE is defined in Eq. 1 as follows:  n  (yi − y)i2 N RMSE =

i=1

n 

(1) y2

i=1

where, n is the number data points; yi is the original time between failures and y is the estimated time between ith and (i − 1)th failures. c. Experiment In this work of predicting time between successive failures, the two benchmark datasets are passed to various algorithms [10]. The statistical error measures of the model’s performance are plotted. To find the performance of the models on various regressors, the length of the sliding window is varied from length 2 to 25 and the observations are recorded. We also observed that on increasing the length of sliding window, the size of the train set decreases. Hence, creation of virtual data points seemed a way of increasing the dataset size. As this is a time series data having positive and negative peaks; Linear Interpolation was used to populate the train set. As observed, certain

74

A. K. Behera and M. Panda

models improved their performance drastically after interpolating the points as the correlation between data points were established and variance between them decreased. But for certain models like Ridge and Bayesian Ridge regressors, it had an inverse impact. Since, the bias of the non-CART (Classification and Regression Trees) models increases as we decrease the variance during the bias-variance trade-off, the performance of the model decreases significantly. Musa Dataset Results The plots of NRMSE values with variation on lag length of different regressors are given in Fig. 2. The plots also include performance with and without interpolation. The plots show that regressors like Ridge, Bayesian Ridge, SVR, Random Forest performed better on this dataset. While regressors like ElasticNet and Lasso regression show the worst error values. Linear Interpolation of the train sets affects the NRMSE performance for

Fig. 2. Performance of Musa dataset with different lag length

Software Reliability Prediction with Ensemble Method

75

certain lag lengths by improving it but in most of the cases of regressors like DT, RF, RBFN, SVR the performance deteriorates as seen from the plots. Iyer Lee Dataset Results The plots of NRMSE values with variation on lag length of different regressors are given in Fig. 3. The plots also include performance with and without interpolation. The plots show that regressors like RBFNN, ANN and SVR performed better on this dataset. While regressors like ElasticNet and Lasso regression show the worst error values. The above mentioned two regressors’ error values don’t vary much with the change of lag

Fig. 3. Performance of Iyer Lee dataset with different lag length

76

A. K. Behera and M. Panda

window size. KNN, ANN, Random Forest, Decision Tree regressors performed better with the inclusion of the value from interpolation. As acknowledged from the graphs, KNN, ANN, DT regressors have random plot performance. The interpolation of data in the train set also has a random effect on the regressors. It increases the NRMSE values for certain lag lengths while decreases for some cases. For RF regressor, the interpolated data doesn’t perform well against the actual data. For Ridge and Bayesian Ridge regressors in this dataset, actual data has linear NRMSE values whereas the interpolated train set model behaves like a quadratic plot with the increase of lag length. For the Musa Dataset; Ridge, Bayesian Ridge and RF regressors are the best performers when measuring on the actual dataset. KNN outperforms others with the interpolated dataset. DT, RF, Ridge and Bayesian Ridge with interpolation data are the worst performers for this data. RBFN and SVR are the better performers on the Iyer Lee dataset while measuring on the actual dataset. Other regressors have similar performance on this dataset with minimum variation. DT is the worst performer among all. The performance comparison with existing statistical models and reference model [8] is given in the Tables 1 and 2. For Iyer Lee dataset, Ridge Regressors have outperformed existing algorithms whereas ANN-PSO still remains the best performer for Musa dataset. Table 1. NRMSE performance comparison ANN + PSO

Lag length ANN Musa

Iyer Lee Musa

Ridge

Iyer Lee Musa

Iyer Lee

1

0.1593 0.5129

0.1686 0.3908

0.176869 0.091658

2

0.1415 0.4939

0.1628 0.3765

0.171176 0.109134

3

0.053

0.1451 0.3245

0.159977 0.12362

4

0.1528 0.5321

0.1283 0.3098

0.155523 0.140684

5

0.1634 0.5642

0.1529 0.3472

0.151145 0.146877

6

0.1682 0.5871

0.1782 0.3981

0.142872 0.155379

0.541

Table 2. Performance comparison with existing algorithms performance Statistical models Musa

Iyer Lee

Jelinski Moranda 0.1444 – Geometric

0.1419 3.5375

Musa Basic

0.1413 –

Musa Okumoto

0.1409 3.3255

ANN

0.1415 0.4939

ANN-PSO

0.1283 0.3098

KNN

0.1357 0.09525

Ridge

0.1366 0.09165

Software Reliability Prediction with Ensemble Method

77

6 Conclusions Different machine learning prediction algorithms are used in this work of software reliability prediction. As seen from the results above the Ridge regressor along with the Bayesian Ridge regressor performs better on both the datasets while working on actual data. The performance is better compared with existing algorithms on the NRMSE values. Certain models perform better while interpolating points like KNN, ANN. As a whole, logarithmic scaling was an important factor in the improvement of the model performance because of uneven distribution data values and the time series datasets having peaks at certain points. As observed Ridge, Bayesian Ridge, RBFN regressors perform their best while the lag length is around 7 to 11. RF, ANN regressors give their best while increasing the lag length but on the actual dataset. KNN and SVR regressors were performing better on the linearly interpolated data which was the best performance among all the cases in certain lag lengths.

References 1. Aljahdali, S.H., Buragga, K.A.: Employing four ANNs paradigms for software reliability prediction: an analytical study. ICGST Int. J. Artif. Intell. Mach. Learn. 8(2), 1–8 (2008) 2. Quyoum, A., Dar, M.D., Quadri, S.M.K.: Improving software reliability using software engineering approach-a review. Int. J. Comput. Appl. 10(5), 41–47 (2010) 3. Karunanithi, N., Whitley, D., Malaiya, Y.K.: Prediction of software reliability using connectionist models. IEEE Trans. Softw. Eng. 18(7), 563–574 (1992) 4. Malhotra, R., Kaur, A., Singh, Y.: Empirical validation of object-oriented metrics for predicting fault proneness at different severity levels using support vector machines. Int. J. Syst. Assur. Eng. Manag. 1(3), 269–281 (2010) 5. Lo, J.H.: Predicting software reliability with support vector machines. In: 2010 Second International Conference on Computer Research and Development, pp. 765–769 (2010) 6. Singh, Y., Kumar, P.: Prediction of software reliability using feed forward neural networks. In: 2010 International Conference on Computational Intelligence and Software Engineering, pp. 1–5 (2010) 7. Mohanty, R., Ravi, V., Patra, M.R.: Hybrid intelligent systems for predicting software reliability. Appl. Soft Comput. 13(1), 189–200 (2013) 8. Bisi, M., Goyal, N.K.: Artificial Neural Network Applications for Software Reliability Prediction. Wiley, Hoboken (2017) 9. Jaiswal, A., Malhotra, R.: Software reliability prediction using machine learning techniques. Int. J. Syst. Assur. Eng. Manag. 9(1), 230–244 (2018) 10. Behera, A.K., Nayak, S.C., Dash, C.S.K., Dehuri, S., Panda, M.: Improving software reliability prediction accuracy using CRO-based FLANN. In: Innovations in Computer Science and Engineering, pp. 213–220. Springer, Singapore (2019) 11. Ho, S.L., Xie, M., Goh, T.N.: A study of the connectionist models for software reliability prediction. Comput. Math Appl. 46(7), 1037–1045 (2003) 12. Su, Y.S., Huang, C.Y.: Neural-network-based approaches for software reliability estimation using dynamic weighted combinational models. J. Syst. Softw. 80(4), 606–615 (2007) 13. Ardabili, S., Mosavi, A., Varkonyi-Koczy, A.R.: Advances in machine learning modeling reviewing hybrid and ensemble methods (2019) 14. Nayak, S.C., Misra, B.B., Behera, H.S.: Efficient financial time series prediction with evolutionary virtual data position exploration. Neural Comput. Appl. 31(2), 1053–1074 (2019)

Hyperspectral Image Classification Using Stochastic Gradient Descent Based Support Vector Machine Pattem Sampurnima1 , Sandeep Kumar Satapathy1 , Shruti Mishra2 , and Pradeep Kumar Mallick3(B) 1 Department of CSE, KL University, Vijayawada, Andhra Pradesh, India

[email protected], [email protected] 2 School of CSE, VIT-AP University, Amaravati 522237, Andhra Pradesh, India

[email protected] 3 School of Computer Engineering, Kalinga Institute of Industrial Technology (KIIT)

Deemed to be University, Bhubaneswar, India [email protected]

Abstract. In the recent days the Hyperspectral images are most popularly used for remote sensing. Hyperspectral imaging has many applications including resource management, mineral exploration, agriculture, environmental monitoring and other tasks for earth observation. Earlier these images are very rarely available. But with recent appearance of Airborne Hyperspectral Imaging System, hyperspectral images entered into main stream of remote sensing. In this work we have considered few officially and publicly available hyperspectral image data. As the image contains spectral, spatial and temporal resolutions in the image, so to classify several regions in the images we have considered the powerful machine learning technique that is Support Vector Machine (SVM) optimized with Stochastic Gradient Descent (SGD) for image classification task.

1 Introduction With the advent of remote sensing, the Hyperspectral image classification has taken a forefront. The methodology involved in Hyperspectral imaging usually considers the electromagnetic spectrum from different wavelength ranges and the sensors usually provides many spectral bands from a particular area on the surface of earth. In the images of this type, each pixel is a high dimensional vector with a specific wavelength and this spectral difference in Hyperspectral images make it highly prominent to be applied in many fields. Classification in the Hyperspectral images is one of the challenging task as the curse of dimensionality always prevails. The hyperspectral data sample from the spectrum have high level of dependency between different spectral bands. In the due course of time, many classification techniques like naïve Bayesian approach, K-nearest neighbours etc. [1] have taken a forward leap for handling the pixel wise methods. Also, with the advent of the computing and with the availability of large scale of high dimensional datasets © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 78–84, 2020. https://doi.org/10.1007/978-3-030-39033-4_8

Hyperspectral Image Classification Using Stochastic Gradient Descent Based SVM

79

many other new advanced techniques like deep learning and other variations of machine learning techniques [2] have gained utmost success. In this study of research, varied existing images have been taken into consideration which is a combination of spectral [3], spatial and temporal resolutions [4] for the purpose of classification [5, 6]. The rest of the paper is divided as follows: Sect. 2 will discuss about some of the findings of hyperspectral images, followed by the model and the techniques used in this study and the last section will state our findings and the future scope.

2 Related Work As stated earlier, with the power of computing and availability of large-scale data classification of Hyperspectral images have been exceptional. Chen et al. [7], proposed a deep learning-based classification method where using the autoencoder technique high level of features where extracted from the images. Similarly, an improved version of the autoencoder model that is the sparse stacked autoencoder model was proposed by Tao et al. [8] where they deployed a term by adding the energy function. Zhang et al. [9], proposed a non-local weighted joint sparse classification method for hyperspectral images for improving the pixel consistency and for enhancing the discriminative ability. Fang et al. [10], introduced a spectral spatial classification of the hyperspectral images based on sparse model where they employed a discriminative K-SVD algorithm for simultaneously learning a dictionary classifier that have an iterative step. Similarly, an effective classification model based on deep learning was developed by Liu et al. [11] for extracting features from the hyperspectral images by employing an active learning algorithm that selects high quality samples for the purpose of training. This was followed by Zhong et al. [12] where they used the deep belief networks for the image classification by tuning the parameters of the belief networks that resulted in performance improvisation in terms of accuracy factor.

3 Proposed Model and Techniques Used The main aim of this work is to how efficiently we can analyse a hyperspectral image which contains not only just pixel information but also spectral and spatial information in the image. In this paper we tried to classify different objects inside the image with more accuracy by the help of machine learning algorithm such as SVM. But the kernel functions used such as linear, polynomial or RBF may behave differently for different set of samples (Fig. 1).

SUPPORT VECTOR MACHINE OPTIMIZED WITH STOCHASTIC GRADIENT DESCENT

Fig. 1. Proposed model for classification of hyperspectral images

80

P. Sampurnima et al.

Hence, we need an efficient and faster optimization algorithm to search and optimize best fitted kernel algorithm. Stochastic Gradient Descent (SGD) is very simple but very efficient approach to discriminative learning of classifiers under convex loss functions.

4 Results and Discussion 4.1 Dataset Description All the datasets are collected from publicly available hyperspectral images provided by Computational Intelligence Group (GIC) of the University of the Basque Country. The following datasets are used for all experimental evaluations. Table 1 gives detail information about each dataset. Table 1. Description of hyperspectral image datasets Name

Collected by

Indian Pines

AVIRIS sensor 16

Number of classes Spectral band Region 224

North-western Indiana

Pavia Centre

ROSIS sensor

9

102

Salinas Valley, California

Pavia University

ROSIS sensor

9

103

Pavia, Northern Italy

Kennedy Space Centre

AVIRIS NASA 13

224

KSC, Florida

Botswana

NASA EO-1

242

Okavango Delta, Botswana

14

4.2 Experimental Setup In this paper all, the experimental works have been performed by using python environment in windows platform and with a hardware support of intel core i5 8th generation processor and 8 GB RAM. There are few basic python libraries are used like numpy, scipy, scikit-learn, scikit-image, matplotlib and pytorch etc. To visualise the outputs at one place visdom server has been used. By using this we can see all our outputs at one place in a browser window. 4.3 Result Analysis There are mainly five datasets of hyperspectral images have been used for all experimental work such as Indian Pines, Pavia Centre and University, Kennedy Space Centre and Botswana. The proposed algorithm SVM-SGD have been applied on all the datasets.

Hyperspectral Image Classification Using Stochastic Gradient Descent Based SVM

Table 2. Result analysis of SVM-SGD algorithm applied to different hyperspectral images Dataset

SVM-SGD Accuracy F1-score

Kappa

Indian Pines

85.394%

0.977 (Stone-Steel-Towers)

0.805

Pavia Centre

98.881%

1.000 (Water)

0.984

Pavia University

94.233%

0.999 (Painted metal sheets)

0.923

Kennedy Space Centre 63.103%

0.922 (Water)

0.578

Botswana

0.997 (Water)

0.917

92.348%

Fig. 2. Confusion matrix for Indian Pines dataset

Fig. 3. Confusion matrix for Pavia Centre dataset

81

82

P. Sampurnima et al.

Fig. 4. Confusion matrix for Pavia University dataset

Fig. 5. Confusion matrix for Kennedy Space Centre dataset

From Table 2 it can be understood that SVM-SGD gives very good result in the classification of Pavia Centre hyperspectral images compared to all others. Figures 2, 3, 4, 5 and 6 shows the confusion matrix for classification of all classes in the datasets. There are mainly three measures used to compare the efficiency of algorithm for different datasets that are Accuracy, F1-score and Kappa coefficient. Accuracy is the overall accuracy of the algorithm for test samples calculated in percentage. F1-score is calculated for each and every class individually but in the table only the class with highest F1-score has been mentioned. Another accuracy measurement indicator is the Kappa coefficient. It is a measure of how the classification results compare to values assigned by chance. Its value can be 0 to 1. The higher Kappa coefficient value indicates more accurate classification.

Hyperspectral Image Classification Using Stochastic Gradient Descent Based SVM

83

Fig. 6. Confusion matrix for Botswana dataset

5 Conclusion Hyperspectral images are very useful resources in recent days for solving many more problems in different applications. Many mathematical tools and algorithms are being researched such as hyperspectral classification, hyperspectral unmixing, data fusion etc. So, in this work we proposed an SVM-SGD based machine learning algorithm for classification of different hyperspectral images. From this research work we concluded that this proposed algorithm performs most efficiently with Pavia Centre dataset of hyperspectral image.

References 1. Satapathy, S.K., Jagadev, A.K., Dehuri, S.: An empirical analysis of training algorithms of neural networks: a case study of EEG signal classification using java framework. In: Intelligent Computing, Communication and Devices, pp. 151–160. Springer (2015) 2. Satapathy, S.K., Jagadev, A.K., Dehuri, S.: An empirical analysis of different machine learning techniques for classification of EEG signal to detect epileptic seizure. Informatica 41(1) (2017) 3. Landgrebe, D.: Information extraction principles and methods for multispectral and hyperspectral image data. Inf. Process. Remote Sens. 82, 3–37 (1999) 4. Govender, M., Chetty, K., Bulcock, H.: A review of hyperspectral remote sensing and its application in vegetation and water resource studies. Water SA 33(2), 145–151 (2007) 5. Carrasco, O., Gomez, R.B., Chainani, A., Roper, W.E.: Hyperspectral imaging applied to medical diagnoses and food safety. Proc. SPIE 5097, 215–221 (2003) 6. Gowen, A.A., P’Donnell, C.O., Cullen, P.J., Downey, G., Frias, J.M.: Hyperspectral imagingan emerging process analytical tool for food quality and safety control. Trends Food Sci. Technol. 18(12), 590–598 (2007) 7. Chen, Y., Lin, Z., Zhao, X., Wang, G., Gu, Y.: Deep learning-based classification of hyperspectral data. IEEE J. Select. Top. Appl. Earth Obser. Remote Sens. 7(6), 2094–2107 (2014) 8. Tao, C., Pan, H., Li, Y., Zou, Z.: Unsupervised spectral-spatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification. IEEE Geosci. Remote Sens. Lett. 12(12), 2438–2442 (2015)

84

P. Sampurnima et al.

9. Zhang, H., Li, J., Huang, Y., Zhang, L.: A nonlocal weighted joint sparse representation classification method for hyperspectral imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 7(6), 2057–2066 (2014) 10. Fang, L., Li, S., Kang, X., Benediktsson, J.A.: Spectral–spatial classification of hyperspectral images with a super pixel-based discriminative sparse model. IEEE Trans. Geosci. Remote Sens. 53(8), 4186–4201 (2015) 11. Liu, P., Zhang, H., Eom, K.B.: Active deep learning for classification of hyperspectral images. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 10(2), 712–724 (2017) 12. Zhong, P., Gong, Z., Li, S., Schönlieb, C.B.: Learning to diversify deep belief networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 55(6), 3516–3530 (2017)

A Survey on Ant Colony Optimization for Solving Some of the Selected NP-Hard Problem Akshaya Kumar Mandal(B) and Satchidananda Dehuri Department of Information and Communication Technology, Fakir Mohan University, Vyasa Vihar, Balasore 756019, Odisha, India [email protected], [email protected]

Abstract. This paper analyses various ant colony optimization (ACO) based techniques for solving some of the selected intractable problems. ACO is one of the popularly used techniques in the field of meta-heuristic techniques that gave acceptable solutions to intractable problems like Travelling Salesperson (TS), Subset Selection (SS), Minimum Vertex Cover (MVC), and 0/1 Knapsack in tolerable amount of time. We have reviewed literature on the usage of aforesaid meta-heuristic algorithms for solving the intractable problems like TS, SS, MVC, and 0/1-Knapsack. A review of several ACO for NP-Hard problems with different instances shows that ACO algorithm demonstrates significant effectiveness and robustness in solving intractable problems. Keywords: Ant system · Ant colony optimization · Travelling salesman problem · Subset selection problem · Minimum Vertex Cover

1 Introduction Nature is, of course, a massive and immense source of inspiration for solving NP-hard complex problems in area of computer science. It always finds the optimal solution to overcome its problems and maintains proper balance among its components. ACO is a nature inspired, meta-heuristic algorithm [1] that mimics the nature for solving optimization problems opening a new era in computation. For the past decades numerous research efforts has been done in this particular area. Still it is an emerging field of research and results being very amazing broaden the scope and viability of ACO algorithms exploring new areas of application and more opportunities in computing. There are several combinatorial problems in this paper we review some of them the NP-hard Problems like TSP, SSP, 0/1 Knapsack problem, MVC. The TSP is the problem of finding the shortest tour among a set of cities starting from one city and journey all other cities once and only once before returning to the home (starting) city. In TSP we have finding the shortest Hamiltonian path in a fully connected graph. The goal is to finding best path in term of minimum distance (or the minimum cost) travel by salesman. TSP is defined as a permutation problem with the objective of finding the path of shortest length (or the minimum cost). For example, if there is n cities, then (n − 1)! number of © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 85–100, 2020. https://doi.org/10.1007/978-3-030-39033-4_9

86

A. K. Mandal and S. Dehuri

possible tours needs to examine, when n = 4, then 3! number of tours. In SSP, we have finding an optimal feasible subset of an initial set of objects with respect to an objective function or some constraints. Subset selection is a heuristic search process where search space contains states, each of which generate a candidate subset for evaluation. The 0/1 Knapsack problem consists of loading objects in to a knapsack load capacity. Each object can be loaded or not loaded into the knapsack, relating 0-1 decision about object loading. The 0/1 knapsack problem does not allow the user to put multiple copies of the same items in their knapsack. In MVC problem, given an undirected, unweighted graph G = (V, E), is to find a subset of vertices V such that for each edge in E, at least one of its two end vertices is in the subset so it cover all edges and its cardinality is minimum. For a non-empty vertex set S ⊆ V, if each edge in E has at least one endpoint in S, then S is called a vertex cover set of graph G. This study focuses on a particular class of approximation algorithm named meta- heuristics for solving combinatorial optimization problems.

2 Computational Complexity of Algorithms and NP-Hard Problems When designing an algorithm, one is generally interested in improving its efficiency as much as possible, where the efficiency of an algorithm is typically measured in terms of its computation time. Efficiency is such a critical factor in design of algorithms, if a problem is regarded as well-solved there exist an algorithm is available to solve efficiently all its instances and it runs in polynomial time in the size of the instance. The actual computation time of an algorithm absolutely depends on the speed and on the hardware, software architectures of the computer on which it runs. Using the concepts in terms of Turing machine model [2, 3]. The theory of NP-completeness [12] testifies to the surmise that some problems in NP cannot be solved by efficient algorithms. As we have seen this holds as well for related optimization problems such as the Traveling Salesman problem and many others. They are NP-hard in the sense that an efficient optimization algorithm for one of these problems would imply an efficient decision algorithm for their corresponding (NP-complete) decision problem versions. However, having a problem classified to be NP-complete or NP-hard does not resolve the necessity of trying to solve it at least somehow. If we assume P = N P [4], we can’t any longer hope for an efficient algorithm, so there is no need to waste time on trying to find an efficient algorithm for it. Instead, we can focus on design approximation algorithm. The purpose is then to compute efficiently correct solutions with high probability. The prize we have to pay for increasing the efficiency is the missing guarantee of success in all cases. It is used for optimization problems and we want to turn back again to such problems from now on. The main idea is to weaken the condition of computing the exact optimal value of a problem. Instead of requiring the output of an algorithm to be the optimum, one attempts to approximate the latter as good as possible. The hope is to obtain at least efficient approximation algorithms for NP- hard problems. That is, we try to gain efficiency by weakening the solution property from being optimal to being approximately optimal. In this paper, we present the basic concepts and some of the basic results related to approximation algorithms.

A Survey on ACO for Solving Some of the Selected NP-Hard Problem

87

3 Basics of Ant Colony Optimization Dorigo and team mates have introduced the first ACO algorithm in the early 1990s [5]. The development of this algorithm was inspired by the real ant colonies. Ant colony optimization (ACO) is a family of meta-heuristics [1] that are inspired by real ant colonies. The main ideas behind ACO are the self-organizing principles that coordinated behaviors of real ants can be feat to coordinate populations of artificial agents that collaborate to solve the intractable problems. The optimization mechanism of the ACO is based on two important features [6]: The state transition rule and the pheromone updating rule. The first one, which is a probabilistic operation, is applied when an ant is choosing the next vertex to visit. The second one dynamically changes the preference degree for the edges that have been traversed through. It solves complex combinatorial optimization problems such as the TSP, graph coloring problem and so on. Generally, the problem under study is transformed into a weighted graph. The set of artificial ants are distributes onto the graph to construct the paths to obtained optimal solutions. 3.1 Background of Ant Colony Optimization The behavior that provided the inspiration for ACO is the ants’ foraging behavior [5, 6], and in particular, how ants can find shortest paths between food sources and their nest. During their searching food, while moving, ants deposit chemical substances called pheromone on the ground. Other ants can sense the pheromone, and the paths become with highly probability to be visited next time. As soon as an ant finds a food source, it evaluates the quantity and the quality of the food and carries some of it back to the nest. During the return trip, the quantity of pheromone that an ant leaves on the ground may depend on the quantity and quality of the food [5, 8]. The pheromone trails will guide other ants to the food source. It has been shown in that the indirect communication between the ants via pheromone trails—known as stigmergy [8]—enables them to find shortest paths between their nest and seeking food process. 3.2 Graph Representation of ACO Algorithm Let G (V, E) be undirected weighted complete graph, where V is the set of vertices representing cities, and E is the set of edges representing paths fully connecting all cities. Each edge (i, j) ∈ E is associated with a cost di j , which is the distance between cities i and j. The pheromone trail τi j are associated with each edge of the graph G of visiting vertices ito j. 1, (i, j) ∈ E where τi j = 0, other wise The heuristic function is chosen as ni j = 1/di j , that measures the quality of items that can be added to the current partial solution [7]. The arc has the maximum probability Pikj , select vertex associated with that arc. Then the state transition probability of ant k is defined as follows which is used to select the next edge of graph G. where an ant will move from node i to node j with probability [5]. β

Pikj

=

τiαj ni j α β j∈Nik τi j ni j

(1)

88

A. K. Mandal and S. Dehuri

Where Nik is the feasible neighborhood of ant k, when being at city ‘i’. α and β are two parameters which determine the relative influence of the pheromone trail and the heuristic information. • The role of the parameters α and β is defined in the following. If α = 0, the closest cities are more likely to be selected. If β = 0, only pheromone amplification is at work, that is only pheromone is used, without any heuristic bias. This generally leads to rather poor results and, in particular, for values of α > 1 it leads to the rapid emergence of a stagnation situation, which is general, is strongly sub-optimal. Each ant k maintains a memory m k which contain the cities already visited in the order they are visited [9]. This memory is used to define the feasible neighborhood Nik in the construction rule given by Eq. (1). 3.2.1 Update of Pheromone Trail As pheromone is a chemical substance it is evaporated after sometime. This evaporation rate will be more important along longer paths [5]. So, shorter paths are more rapidly refreshed, hence increase the probability of selecting that path by other ants. The general structure of any ACO algorithm [6], and thus also AS, is relatively simple. Starting with an initialization of the algorithm, iteration after iteration all ants first construct their tour and then update the pheromone trails accordingly. In an extended scenario (i.e., an optional case), a local search method can be used to improve the ants’ tours before updating the pheromone. For all ants have completed their turn, the pheromone evaporation is implemented as follows: τi j = (1 − ρ)τi j ∀(i , j) ∈ E (where E is edges of a graph) Where ρ ∈ (0, 1) is a parameter which simulates the evaporation rate of the pheromone intensity. After all the ants have completed their turns, the pheromone trails are updated as follows: m τi j = τi j + τikj (2) k=1

Where τ kij the amount of pheromone ant k deposits on the arcs ants are visited. It is defined as follows:  Q , i f ar c(i, j) ∈ T k τ kij = L k (3) 0 , other wise Where L k , the length of the tour T k but by the K th ant is completed as the sum of the lengths of the arcs belonging to T k . By means of the Eq. (3) the better, an ants tour is the more pheromone the arcs belonging to this tour receives. This idea is summarized in the following pseudo code [5].

A Survey on ACO for Solving Some of the Selected NP-Hard Problem

89

Initialization (); While (not (termination Criterion True ())) { Construct Ant Tours (); if (use Optional Local Search) apply Local Search (); update Pheromones ();}

3.3 ACO Algorithm Step-1

Each path followed by an ant is associated with a candidate solution for a given problem. Step-2 When an ant follows a path if the amount of pheromone [5] deposited on that path is proportional to the quality of the corresponding candidate solution for the target problem. Step-3 If the ant must choose between two or more paths then the path with higher amount of Pheromone has a greater probability of being chosen by the ant. Step-3.1 Apply the pheromone updating rule until all ants have completed their tours, continue end of condition. Step-4 Output the global best solution.

4 Ant Colony Optimization for NP-Hard Problems This article presents the review of literature for some of the selected NP-hard problems like TSP, SSP, 0/1 Knapsack Problem and MVC for solving combinatorial optimization problem using ACO. A large variety of methods have been designed to solve these problems. The aim of review of literature is to study and analyze the available algorithms to search prominent or optimal solutions of above problems. 4.1 Ant Colony Optimization for Traveling Salesman Problem Literature studies show that a lot of work has been done on TSP using ACO. TSP was introduced by Whiteney [10]. The origin of TSP lies with Hamilton’s Icosian Game, which was a recreational puzzle based on finding a Hamiltonian cycle [11]. In 1972 Karp showed that the Hamiltonian cycle problem was NP-Complete [12], it implies the NP-hardness of TSP. The TSP is a problem in combinatorial optimization. In this Section, we will review some of the works developed in the line of solving TSP using ACO. Raghavendra presents a framework for basic ant colony optimization technique for solving traveling salesmen problem using ant colony optimization algorithm [13], restricted with travelling sales person problem and their solution through ACO by illustrative example. In this paper he applied the ant colony optimization technique for symmetric travelling salesperson problem. Analysis are shown that the ant select the next city based on maximum probability and the illustrative example finding out the best

90

A. K. Mandal and S. Dehuri

path in the iteration number 5 that all the ants converge to the best path which gives minimum distance. According to Chen et al. presented an ant colony optimization with tabu to solve TSP problem [14]. The proposed method shows the combination of ant colony optimization and tabu search algorithm to optimize the TSP and the effectiveness of the ant colony optimization with tabu table. They also comparing with the results of an ant algorithm with tabu and their experiments result are more effective then classic ant algorithm. Mueller et al. gave a hybrid approach for TSP based on Neural Networks and ant colony optimization [15]. This proposed article was a hybrid approach based on an intelligent Combination of artificial ants and neurons. They also fixed the respective parameter and their dependencies on the algorithms behavior where investigated and explained in the proposed article. Valdez et al. presented a framework for Ant Colony Optimization for solving the TSP Symmetric with parallel processing [16]. The proposed article shows the symmetric TSP with 22 and 1060 cities, they use Euclidean distances to find distance between two cities. Their proposed algorithm gets better results than the classic Ant algorithms. According to Salem et al. presented a framework for analysis of ACO solutions for TSP [17]. In this proposed article they implement several optimization algorithm are applied to TSP, based on tuning multiple parameters. The proposed algorithm was maximizing the performance to reach the best path and they shows the cost increases with both Number of visited cities, evaporation rate, while it decreases with both number of ants and number of iterations. It can be summarized that the quality of solutions depends on the number of ants and the experimental results based on local heuristics which can be used for constructing solutions also examine their influence on the performance of Ant System. According to Ni [37], presents a framework for optimization research of railway passenger transfer scheme based on ACO. This proposed method realized the passenger transfer scheme for high speed railway departure and arrival time. Finally, he analysis on the scheme from Lanzhou xi to Beijing xi, based on high–speed railway Network of China in 2017. The experiments results show that the proposed method can get better transfer plan of passenger. Bouzbita et al. [38], presented a parameter adaptation ACO algorithm using Hidden Markov Mode (HMM) for TSP Problems. In this proposed method they applied HMM Controller to ACS algorithm to obtain shortest path of TSP and improve quality of solutions. Shetty et al. [39], presents a formwork for an improved ACO algorithm Minion Ant (MAnt) and its application on TSP. This proposed algorithm produces good results when applied to TSP problem. It improved performance of TSP by parallelization of the deployment of minion ants. Yang et al. [44], gave a new version of ACO algorithm for generalized TSP problem. The proposed algorithm focused on a local search technique it could get good results when the problem scale is less than 200 cities. Bharduaj et at. [45], where they adopt the parallel implementation of TSP using ACO. They compare different parameters of ACO and also gave a hybrid implementation of ACO on GPU by utilizing the resources properly with data fragmentation and they optimize proposed algorithm to gain more speedup. Finally, Menezes et al. [46], presents a framework for parallelization strategies of GPU-based ACO solving TSP. The proposed parallel algorithm reduces the execution time of TSP. Table 1 presents some of the variants of ACO algorithms developed for solving TSP.

A Survey on ACO for Solving Some of the Selected NP-Hard Problem

91

Table 1. The usage of ACO in TSP Sl. No. Authors

Variants of ACO algorithms for TSP

Improvements

Results and discussion

1

Raghavendra [13]

Ant Colony System (ACS)

Symmetric ACO is introduced

It improve global search ability and increases the performance in the TSP

2

Chen et al. [14]

Elitist AS (EAS)

Combined ACO and Genetic algorithms

More emphasis on global best tour

3

Mueller et at. [15]

Ant Colony System (ACS)

A hybrid approach combination of ACO and neurons

Hybrid of Ant Colony and Neurons find best possible path with a high statistical significant

4

Valdez et al. [16]

Ant Colony System (ACS)

Parallel ACO, algorithm was introduced

It reduces processing time in the experiments

5

Salem et al. [17]

Ant Colony System (ACS)

Tuning multiple parameters is introduced

It improved global search ability to select optimal path with better computational performance

6

N i a , [37]

MAX-MIN Ant system

Introduced random factor for pheromone concentration at time t

Get transfer plan quickly on railway passengers train schedule

7

Bouzbita et al. [38]

Ant Colony System (ACS)

Global pheromone update was introduced

Fuzzy and ACO based method has Lower time complexity regards to fuzzy system

8

Shetty et al. [39]

Ant Colony System (ACS)

Dynamic parameter alteration was introduced

It achieves the best tour 96830 with a relative error of 0.059% increase

9

Yang et al. [44]

Ant Colony System (ACS)

A mutation process and a local search technique was introduced

The generalized TSP get good results when the problem scale is less than 200 cities

10

Bhardwaj et al. [45] Ant Colony System (ACS)

Introduced hybrid implementation of ACO on GPU

The parallelization of ACO gains more speedup

11

Menezes et al. [46]

Introduced parallel strategies for GPU-based ACO

The proposed parallel algorithm reduces execution time of TSP

Ant Colony System (ACS)

4.2 Ant Colony Optimization for Subset Selection Problem The ACO literature typically development of a construction graph for the problem to be solved as necessary to the application of ACO, as it is a graph based on shortest path problem that real ants solve when traveling from the nest to a food source. Cao et al. [18] gave an ensemble classifier based on feature selection using ACO. This article proposed classifier to provide on effective methodology and also verified the performance of the proposed ensemble classifier. Rajoo et al. [19], where they adopted the graph based ant system to solve feature selection where the degree ofthegraph representing the search space which reduce computation complexity from O n 2 to O(nm). Qi et al. [40], presented a large scale transactional service selection approach based

92

A. K. Mandal and S. Dehuri

on Skyline and ACO algorithm. This proposed method improves the efficiency of feature selection with a large scale transactional service. Shemi et al. [41], gave a novel relay selection algorithm using ACO with artificial noise for secrecy enhancement in cooperative networks. The proposed algorithm is evaluated three wireless scenarios with bidirectional relay communication with trusted or un-trusted relays. Crawford et al. presented Ant-based solver for subset problems [20]. The proposed algorithm solve some benchmarks of the set partitioning problem with an ant-based algorithm using a transition rule with look ahead mechanism. They tested the effectiveness of the proposed rule on benchmark problems and the results are compared with existing ACO algorithms. Sharma et al. gave a novel ant colony optimization based training subset selection algorithm for hyper spectral Image classification [21]. The proposed method provides a simple and novel ACO based algorithm for training subset selection. This method inspired by Ant colony system for selection of a subset from already developed training samples to achieve better classification results, that small subset is efficient for classification and achieve better accuracy. Fidanova et al. presented start strategies of ACO applied on subset problems [22]. The proposed method was addressed to ant colony optimization algorithm with controlled start combining five start strategies. The achieved solutions with strategies are better than random start. This method will be investigated on influence of the parameters to algorithm performance. Alsabour gave a Binary Ant colony optimization for subset problems [23]. The proposed algorithm focuses on developing a new framework in ACO for optimization problems that require selection for regression problems as a representative for subset problems. It composed of three steps that are: explaining the main guidelines of developing an ant algorithm, demonstrating different solution representations for subset problem using ACO algorithm and proposing a binary ant algorithm for feature selection for regression problems. Aghdam et al. [24] follow similar method like Jensen et al. [25] but the pheromone is associated with the vertices of graph instead of edges of the graph. The vertices represent features and the edges between them indicate the choice of the next feature. Each ant starts with a random feature, from these initial positions, they visits edges probabilistically until a traversal stopping criterion is satisfied. Fallahzadeh et al. [47], presents a Raman spectral feature selection using ACO for breast cancer diagnosis. This proposed method gave a new version of ant system which improves the diagnostic accuracy of Raman-based diagnostic models. It reached the accuracy of 87.7% of detecting subset of cancers samples of breast tissue. Nasser et al. [48], presented a hybrid approach for feature subset selection using ACO and multi-classifier ensemble. It checks the classification with highest accuracy of selecting feature subsets. Finally, Peng et al. [49], presents an improved feature selection algorithm based on ACO. The proposed method designed a fitness function and the pheromone updating rule which improved the classification accuracy for selecting the subset of features. Table 2 presents the usage of ACO in SSP. 4.3 Ant Colony Optimization for 0/1 Knapsack Problem The aim of this literature survey is to study and analyze the available algorithms to predict optimal solution for 0/1 Knapsack Problem. It is a problem in combinatorial optimization: Given a set of objects, each with a weight determines the number of each item to include in a collection, so that the total weight is less than or equal to a given

A Survey on ACO for Solving Some of the Selected NP-Hard Problem

93

limit and total weight is as large as possible. In this Section, we will review some of the works developed in the line of solving 0/1 Knapsack using ACO. Table 2. The usage of ACO in SSP Sl. No. Authors

Variants of ACO algorithms Improvements for SSP

Results and discussions

1

Cao et al. [18]

ACO-Parallel classifier

Introduced support vector machine (SVM) as base classifiers

It improves the classifying ability and availability of single source information of feature selection method

2

Rajoo et al. [19]

Ant-Q

Introduced graphs with prescribed degree sequences

Reduced the computation

3

Qi et al. [40]

MAX-MIN ant system

Impose lower and upper bounds of pheromone trail

Comprises 2507 real world web services to use skyline to reduce candidate services selection

4

Shemi et al. [41]

Ant Colony System (ACS)

Add wireless channel parameters

Relay selection algorithm has bidirectional relay communication employing cooperative trusted or un-trusted relays

5

Crawford et al. [20]

Ant Colony System (ACS)

Addition of a look-ahead mechanism in the construction phase of ACO

The number of generated candidate solutions is limited to 1500

6

Sharma et al. [21]

Ant Colony System (ACS)

Introduced a joint AS and It is batter accuracy by ACS approach to select selecting small samples available samples from each class

7

Fidanova et al. [22]

Ant Colony System (ACS)

Add several strategies

The achieved solutions with strategies are better than random start

8

Alsabour [23]

Ant Colony System (ACS)

Introduced Binary ACO

SVM system is improved compared to ACO algorithm

9

Fallahzadh et al. [47] Ant Colony System (ACS)

Introduced Raman spectral feature selection

It detects breast cancer tissue accuracy of 87.7%

10

Nasser et al. [48]

Ant Colony System (ACS)

Introduced a hybrid approach to select heuristic function

It selecting a subset with highest accuracy

11

Peng et al. [49]

Ant Colony System (ACS)

New fitness function and pheromone updating rule was introduced

It improved the classification accuracy for selecting the subset of features

complexity from O(n 2 ) to O(n)

Changdar et al. presented a framework of solving 0-1 Knapsack Problem by continuous ACO algorithm [26]. The proposed ACO algorithm, ‘n’ candidate group is created for ‘n’ variable function. Any initial information for pheromone initialization is not necessary in proposed ACO approach. The proposed algorithm may be used for both discrete and continuous problems. They also compare the result of standard test functions and

94

A. K. Mandal and S. Dehuri

0/1 knapsack problem with existing literature. Chaharsooghi et al. gave a modified Ant colony optimization algorithm of an intelligent multi-colony multi-objective ACO for the 0/1 knapsack problem [27]. They also proposed a new pheromone updating rule for multi-objective case that increase effectiveness of ACO algorithm. Alzaqebah et al. [28] presented a framework of a novel ACO with dynamic pheromone updating for 0/1 knapsack problem. They proposed a novel ACO algorithm with new heuristic that allows the item to an empty knapsack. The proposed randomization process converges faster. According to Mansour et al. [42], they gave a framework for indicator based ACO (IBACO) for multi-objective knapsack problem. The proposed method guide the artificial ants so that IBACO is working with the greater efficiency compared to other tested algorithm. Samanta et al. presented a framework for Ant weight Lifting algorithm is used to solve 0/1 knapsack problem [29]. They present how ants are carry food heavier than their own weight so they use that concept of ACO algorithm. This proposed algorithm was solves 0/1 knapsack problem with single objective. Kumar et al. gave an ant colony optimization algorithm use to solve the multi-dimensional 0/1 knapsack problem [30]. They proposed a new form of 0/1 knapsack problem are to maximize the profit and the total sum of weights of items which needs to be put into knapsack should not greater than given knapsack capacity. This proposed parallel ant colony algorithm introduce to solving multi-dimensional 0/1 knapsack with large data-set using message passing interface. The parallel ant colony optimization algorithm solves the problem effectively and takes minimum time. Fidanova et al. gave a generalized net model, the process of ant colony optimization is constructed for multiple knapsack problems with controlled starts [31]. The proposed article have developed several start strategies and tested on multiple knapsack problems and analysis the proposed algorithm and determining the performance with the best strategy. Tao et al. [43], they presented Map-Reduce based ACO for multi-dimensional knapsack problem. They also propose a heuristic function by changing the timing and probability calculations with greater efficiency. Iqbal et al. presented a framework of a novel ACO technique for fast and near optimal solutions for the multi-dimensional multi-choice knapsack problem [50]. They proposed a novel ACO algorithm for solving Multidimensional Multi-choice Knapsack problems (MMKPs), with a effective random local search Strategy for increase performance. They also compared the solution of their algorithm with other existing algorithms. Finally, Zouari et al. [51], gave a novel approach for hybridization of ant colony algorithm with a local search for the strongly correlated knapsack problem. This proposed ACO scheme used combines two ant algorithms that give quick and efficient results. In Table 3 we present some of the variants of ACO in 0/1-knapsack problemes. 4.4 Ant Colony Optimization for Minimum Vertex Cover Problem In the literature of MVC, there are various methods designed to solve this problem. The aim of this survey is to study and analyze the existing algorithms to predict an optimal solution for MVC problem. Here, we will review some of the works developed in the line of solving MVC using ACO. Bouamama et al. [32], have presented an algorithm based on ACO for minimum connected dominating set problem. The proposed algorithm reduced variable neighborhood search technique for improving the generated solutions. Mehrabi et al. [33], have

A Survey on ACO for Solving Some of the Selected NP-Hard Problem

95

Table 3. The usage of ACO in 0/1 Knapsack problem Sl. No.

Authors

Variants of ACO algorithms for 0/1-Knapsack problem

Improvements

Results and discussions

1

Changdar et al. [26]

Ant colony system (ACS)

Introduced continuous pheromone initialization

It is used to solve both discrete and continuous problems

2

Chaharsooghi et al. [27]

Ant colony system (ACS)

A new pheromone updating rule was introduced

It increases performance of algorithm and gives better results

3

Alzaqebah et al. [28]

Ant colony system (ACS)

Introduced a new heuristic function

The proposed algorithm is faster than most of the problem instances

4

Mansour et al. [42]

Indicator-based ACO

Uses binary quality indicators to guide artificial ants

IBACO shows its effectiveness on multi-objective problems

5

Fidanova [31]

Ant system (AS)

Introduced probabilistic model

It is more relevant to generate high-quality solutions for MKP

6

Samanta et al. [29]

Ant Weight Lifting algorithm (AWL)

Introduced Weight in ACS

It solve 0/1 Knapsack with single objective

7

Kumar et al. [30]

ACO-MDKP

Introduced parallel ACS

It solves the problem effectively and increases the execution time

8

Tao et al. [43]

MapReduce based Add MIAM into ACS improved ACO for MDKP

The proposed algorithms achieved good results

9

Iqbal et al. [50]

Ant Colony System (ACS) Multidimensional multi-choice with random local search strategy was introduced

It increase performance of algorithm

10

Zouari et al. [51]

MAX-MIN Ant system and Ant Colony system (ACS)

The proposed algorithm is quick and efficient

Introduced hybridization of algorithms

introduced a pruning based ant colony algorithm for minimum vertex cover problem (MVC). They proposed a meta-heuristic based on ACO approach to find approximate solutions to the MVC problem and the proposed method improves both time and convergence rate of the algorithm. Jovanovica et al. [34], have introduced a novel method for an ant colony optimization algorithm with improved pheromone correction strategy for minimum weight vertex cover (MWVC) problem. In this proposed article they present a new type of hybridization for the ACO applied to MWVC problem. This hybridization adds a pheromone trial correction method to the existing ACO algorithm and it improve the efficiency of ACO, applications for different problems with heuristics function which defined for measuring the appropriate solution on the part of problem. They also compared their hybridization to the use of classical ACO. They show the advantage of hybridization to the existing ACO algorithm with minimal change to the original source code. Shyu et al. presented a meta-heuristic based upon the ACO approach to find approximate solutions to the MWVC problem [36]. The proposed ACO algorithm adds several new features to select vertices out of the vertex set whereas the total weight can

96

A. K. Mandal and S. Dehuri

be minimized. They also do the computational experiments to show the performance of their proposed algorithm. Finally, Chen et al. presented a framework of MVC problem based on ant colony algorithm [35]. In this proposed algorithm they modify the state transition probability and obtain the MVC approximate algorithm. Finally they show the   time complexity of proposed algorithm is O n 2 . In this section, we have reviewed a meta-heuristic scheme to compose approximate solutions to MVC, which is known as computationally intractable. We have reviewed the process of illustration for an ACO approach for solving MVC. The state transition probability is defined for ants to choose the vertices and different vertices are used as initial points for ants to search vertex cover sets. Table 4 presents the various variants of ACO for MVC. Table 4. The usage of ACO in MVC Sl. No. Authors

Variants of ACO algorithms Improvements for MVC

Results and discussion

1

Bouamama et al. [32] MAX-MIN Ant System

Modified pheromone values It reduced neighborhood are introduced search technique

2

Mehrabi et al. [33]

Ant colony system (ACS)

Introduced pruning paradigm for ants

It improves both time and convergence rate of the MVC algorithm

3

Shyu [36]

Ant system (AS)

A new pheromone updating rule was introduced

It selects subset of vertex which covers all edges of graph without losing the performance

4

Jovanovic [34]

Ant colony system (ACS)

Introduced pheromone Hybridization was the correction heuristic strategy potential to improve ACO application for MVC Problem

5

Chen et al. [35]

Ant-Cycle Model

Modify the state transition probability

It is obtain the minimum vertex cove with less time complexity

5 Discussions and Conclusions After reviewing various literatures to solve TSP using ACO algorithms, it is found that ACO is proved to be much better algorithm than the conventional algorithms like dynamic programming and greedy algorithms. Heuristic algorithms are more effective for small and mid-size problems than other algorithms. Different updating mechanisms and route selection mechanisms are adding memories in artificial ants and implemented for the enhancement in TSP-ACO algorithm. But still there is a lot of work to do for improvement of existing ACO algorithms. Finally, we conclude that the proposed algorithm of Salem et al. [17], is more efficient based on tuning multiple parameters. This proposed method is affecting the cost of the TSP algorithm and maximizing the performance to reach the best path. They shows that the cost increases with both number of visited cities and

A Survey on ACO for Solving Some of the Selected NP-Hard Problem

97

evaporation rate of pheromones, while it decreases with both number of ants and number of iterations. Similarly, we have made a thorough review of various research papers to solve SSPs in which the task is to find a feasible and optimal subset of an initial set of object S. We conclude that the proposed algorithm, FACO of Peng et al. [49], is more suitable and cost effective, which avoid the defects in the existing feature selection algorithms. They designed a fitness function for the feature selection which improves the path transfer probability method of the ant colony, for which the two stage pheromone updating rule was used to add pheromones to more paths to prevent the algorithm from falling into a local optimum prematurely. The experimental results show that the FACO algorithm can improve the classification efficiency and accuracy of the classifiers, which is gave more practical significance. After reviewing various literature to solve 0/1 knapsack problem, in which we collect different approaches of heuristic functions of ACO algorithm that can be helpful for improvement towards further design of 0/1 knapsack problem. We conclude that the proposed algorithm of Alzaqebah et al. [28], is more efficient than other existing methods, which introducing new heuristics that integrated with the amount of pheromone increase after each cycle. In this proposed method selection of each object depends on different pheromone increment based on the solution profit and the situation of the knapsack before adding the object. The experimental results show that the proposed approach is clearly faster than the original one in most of the problem instances due to the randomization process. From the available literature to solve ACO minimum vertex cover problem, it is studied and analyzed the available algorithms to search prominent or optimal solutions of above problems. They have proposed different heuristic function to select minimum number of vertices so as to covers all the arcs of a graph. But still there is a lot of work to do for improvement of these ACO-MVC algorithms. Finally, we conclude that the proposed approach of Chen et al. [35], is more efficient, they improve original Ant-Cycle Model. This proposed approximate algorithm is finding the minimum vertex subset with   the time complexity of O n 2 , where n is the number of vertices in a graph. At the end, they illustrate an example to show the process of the algorithm, and it’s feasibility. The bag of future research includes practical realization of ACO in very large scale intractable problems, exploration of ACO in more and more intractable problems with societal relevance, and developing novel ant based algorithm in India Context.

References 1. Osman, I.H., Laporte, G.: Meta-heuristics: a bibliography. Ann. Oper. Res. 63, 513–623 (1996) 2. Papadimitriou, C.H.: Computational Complexity. Addison-Wesley Inc., Boston (1994) 3. Ausiello, G., Crescenzi, P., Gambosi, G., Kann, V., Marchetti-Spaccamela, A., Protasi, M.: Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Properties. Springer, Heidelberg (1999) 4. Asif, M., Baig, R.: Solving NP-complete problem using ACO algorithm. In: Proceedings of International Conference on Emerging Technologies, pp. 13–16. IEEE (2009) 5. Dorig, M., Stützle, T.: Ant Colony Optimization. MIT Press, Cambridge (2004)

98

A. K. Mandal and S. Dehuri

6. Dorigo, M., Blum, C.: Ant colony optimization theory: a survey. Theor. Comput. Sci. 344(2– 3), 243–278 (2005) 7. Dorigo, M., Caro, G.D., Gambardella, L.: Ant algorithms for discrete optimization. Artif. Life 5, 137–172 (1999) 8. Dorigo, M., Bonabeau, E., Theraulaz, G.: Ant algorithm and stigmergy. Future Gener. Comput. Syst. 16(8), 851–871 (2000) 9. Li, B., Wang, L., Song, W.: Ant colony optimization for the traveling salesman problem based on ants with memory. In: Proceedings of Fourth International Conference on Natural Computation. IEEE (2009) 10. www.math.uwaterloo.ca/tsp/history/index.html 11. https://kids.kiddle.co/Travelling_salesman_problem 12. Asif, M., Baig, R.: Solving NP-complete problem using ACO algorithm. In: Proceedings of International Conference on Emerging Technologies, pp. 13–16 (2009) 13. Raghavendra, B.V.: Solving traveling salesmen problem using ant colony optimization algorithm. J. Appl. Comput. Math. JACM 4(6), 260 (2015) 14. Chen, H., Tan, G., Qian, G., Chen, R.: Ant colony optimization with tabu table to solve TSP problem. In: Proceedings of the 37th Chinese Control Conference, 25–27 July, pp. 2523–2527. IEEE (2018) 15. Mueller, C., Kiehne, N.: Hybrid approach for TSP based on neural networks and ant colony optimization. In: Symposium Series on Computational Intelligence, pp. 1431–1435. IEEE (2015) 16. Valdez, F., Chaparro, I.: Ant colony optimization for solving the TSP symmetric with parallel processing. In: Proceedings of Joint IFSA World Congress and NAFIPS Annual Meeting, pp. 1192–1196. IEEE (2013) 17. Salem, A., Sleit, A.: Analysis of ant colony optimization algorithm solutions for travelling salesman problem. Proc. Int. J. Sci. Eng. Res. 9(2) (2018) 18. Cao, J., Guojun, L., Shang, Y., Weng, N., Chang, C., Liu, Y.: An ensemble classifier based on feature selection using ant colony optimization. In: Proceedings of High Performance Extreme Computing Conference (HPEC). IEEE (2018). 978-1-5386-5989-2 19. Rajoo, R.R., Salam, R.A.: Ant colony optimization based subset feature selection in speech processing: constructing graphs with degree sequences. Proc. Int. J. Adv. Sci. Eng. Inf. Technol. 8(4–2), 1728 (2018) 20. Crawford, B., Carlos, C., Monfroy, E.: An ant-based solver for subset problems. In: Proceedings of International Conference on Advances in Computing, Control, and Telecommunication Technologies, pp. 268–270. IEEE Xplore (2009) 21. Sharma, S., Buddhiraju, K.M.: A novel ant colony optimization based training subset selection algorithm for hyper spectral image classification. In: Proceedings of International Geosciences and Remote Sensing Symposium, pp. 5748–5751. IEEE (2018) 22. Fidanova, S., Atanassov, K., Marinov, P.: Start strategies of ACO applied on subset problems. In: Dimov, I., Dimova, S., Kolkovska, N. (eds.) Proceedings of International Conference on Numerical Methods and Applications, pp. 248–255. Springer, Heidelberg (2011) 23. Abd-Alsabour, N.: Binary ant colony optimization for subset problems. In: Dehuri, S., Jagadev, A., Panda, M. (eds.) Multi-objective Swarm Intelligence. Studies in Computational Intelligence, vol. 592, pp. 105–121. Springer, Heidelberg (2015) 24. Nemati, S., Basiri, M.E., Ghasem-Aghaee, N., Aghdam, M.H.: A novel ACO–GA hybrid algorithm for feature selection in protein function prediction. Proc. Expert Syst. Appl. 36, 12086–12094 (2009) 25. Jensen, R., Shen, Q.: Webpage classification with ACO-enhanced fuzzy-rough feature selection. In: Greco, S., et al. (eds.) RSCTC 2006. LNAI, vol. 4259, pp. 147–156. Springer, Heidelberg (2006)

A Survey on ACO for Solving Some of the Selected NP-Hard Problem

99

26. Changdar, C., Mahapatra, G.S., Pal, R.K.: Solving 0-1 knapsack problem by continuous ACO algorithm. Int. J. Comput. Intell. Stud. 2(3/4), 333 (2013) 27. Chaharsooghi, S.K., Amir, H., Kermani, M.: Ant intelligent multi-colony multi-objective ant colony optimization (ACO) for the 0-1 knapsack problem. In: Proceedings of IEEE Congress on Evolutionary Computation (CEC), pp. 1195–1202 (2008) 28. Alzaqebah, A., Abu-Shareha, A.A.: Ant colony system algorithm with dynamic pheromone updating for 0/1 knapsack problem. Proc. Int. J. Intell. Syst. Appl. IJISA 11(2), 9–17 (2019) 29. Samanta, S., Chakraborty, S., Acharjee, S., Mukherjee, A., Dey, N.: Solving 0/1 knapsack problem using any weight lifting algorithm. In: IEEE International Conference on Computational Intelligence and Computing Research (2013) 30. Kumar, A., Rasool, A., Hajela, G.: Parallel ant colony algorithm for multi-dimensional 0-1 knapsack problem based on message passing interface (MPI). Proc. Int. J. Comput. Sci. Inf. Secur. (IJCSIS) 1(8), 613–620 (2016) 31. Fidanova, S., Atanassov, K., Marinov, P., Parvathi, R.: Ant colony optimization for multiple knapsacks problem with controlled starts. BIO-Automation 13(4), 271–280 (2009) 32. Bouamama, S., Blum, C., Fages, J.G.: An algorithm based on ant colony optimization for the minimum connected dominating set problem. Appl. Soft Comput. J. 80, 672–686 (2019) 33. Mehrabi, A.D., Mehrabi, S., Mehrabi, A.: A pruning based ant colony algorithm for minimum vertex cover problem. In: International Joint Conference on Computational Intelligence, IJCCI, pp. 281–286 (2009) 34. Jovanovica, R., Tubab, M.: An ant colony optimization algorithm with improved pheromone correction strategy for the minimum weight vertex cover problem. J. Appl. Soft Comput. 11, 5360–5366 (2011) 35. Chen, J., Kanj, I.A., Xia, G.: Improved parameterized upper bounds for vertex cover. In: Královiˇc, R., Urzyczyn, P. (eds.) Mathematical Foundations of Computer Science 2006. LNCS, vol. 4162, pp. 238–249. Springer, Heidelberg (2006) 36. Shyu, S.J.: An ant colony optimization algorithm for the minimum weight vertex cover problem. Ann. Oper. Res. 131, 283–304 (2004) 37. Ni, X.: Optimization research of railway passenger transfer scheme based on ant colony algorithm. In: 6th International Conference on Computer-Aided Design Manufacturing Modeling and Simulation, API Conference Proceedings (2018) 38. Bouzbita, S., El Afia, A., Faizi, R.: Parameter adaptation for ant colony system algorithm using Hidden Markov Model for TSP problems. In: Proceedings of LOPAL Conference, pp. 2–5. ACM (2018) 39. Shetty, A., Shetty, A., Puthusseri, K.S., Shankaramani, R.: An improved ant colony optimization algorithm: Minion Ant (MAnt) and its application on TSP. In: Symposium Series on Computational Intelligence, SSCI. IEEE (2018). 978-1-5386-9276-9/2018 40. Qi, L., Yao, W., Chang, J.: A large scale transactional service selection approach based on skyline and ant colony optimization algorithm. Int. J. Technol. Eng. Stud. 4(3), 95–101 (2018) 41. Shemi, P.M., Jibukumar, M.G., Sabu, M.K.: A novel relay selection algorithm using ant colony optimization with artificial noise for secrecy enhancement in cooperative networks. Int. J. Commun. Syst. 31(14), 3739 (2018) 42. Mansoura, I.B., Alayaa, I.: Indicator based ant colony optimization for multi-objective knapsack problem. In: International Conference on Knowledge Based and Intelligent Information and Engineering Systems (2015). Procedia Computer Science Vol. 60, pp. 448 – 457, Science Direct 43. Tao, L.R., Jian, L.X.: MapReduce-based ant colony optimization algorithm for multidimensional knapsack problem. Appl. Mech. Mater. J. AMM 380-384, 1877–1880 (2013) 44. Yang, J., Shia, X., Marchese, M., Liang, Y.: An ant colony optimization method for generalized TSP problem. Prog. Nat. Sci. 18, 1417–1422 (2008)

100

A. K. Mandal and S. Dehuri

45. Bhardwaj, G., Pandey, M.: Parallel implementation of travelling salesman problem using ant colony optimization. Int. J. Comput. Appl. Technol. Res. 3(6), 385–389 (2014) 46. Menezes, B.A.M., Kuchen, H., Amorim Neto, H.A., de Lima Neto, F.B.: Parallelization strategies for GPU-based ant colony optimization solving the traveling salesman problem. In: Proceedings of IEEE Congress on Evolutionary Computation (2019) 47. Fallahzadeh, O., Dehghani-Bidgoli, Z., Assarian, M.: Raman spectral feature selection using ant colony optimization for breast cancer diagnosis. Lasers Med. Sci. 33(8), 1799–1806 (2018) 48. Naseer, A., Shahzad, W., Ellahi, A.: A hybrid approach for feature subset selection using ant colony optimization and multi-classifier ensemble. Int. J. Adv. Comput. Sci. Appl. IJACSA 9(1), 306–313 (2018) 49. Peng, H., Ying, C., Tan, S., Hu, B., Sun, Z.: An improved feature selection algorithm based on ant colony optimization. IEEE Access 6, 69203–69209 (2018) 50. Iqbal, S., Bari, F.Md., Rahman, M.S.: A novel ACO technique for fast and near optimal solutions for the multi-dimensional multi-choice knapsack problem. In: 13th International Conference on Computer and Information Technology. IEEE Xplore (2010) 51. Zouari, W., Alaya, I., Tagina, M.: A hybrid ant colony algorithm with a local search for the strongly correlated knapsack problem. In: 14th International Conference on Computer Systems and Applications, pp. 527–533. IEEE (2017). IEEE Access 52. Dorigo, M.: Optimization, learning and natural algorithms. Ph.D. thesis, Dipartimento di Elettronica, Politecnico di Milano, Italy (1992). (in Italian) 53. Dorigo, M., Blum, C.: Ant colony optimization theory: a survey. Theoret. Comput. Sci. 344(2–3), 243–278 (2005) 54. Dorigo, M., Caro, G.D.: The ant colony optimization meta-heuristi. In: Corne, D., Dorigo, M., Glover, F. (eds.) New Ideas in Optimization. McGraw-Hill, New York (1999) 55. Dorigo, M., Caro, G.D., Gambardella, L.: Ant algorithms for discrete optimization. Artif. Life 5, 137–172 (1999) 56. Dorigo, M., Gambardella, L.: Ant colony system: a cooperative learning approach to the travelling salesman problem. IEEE Trans. Evol. Comput. 1, 53–66 (1997) 57. Dorigo, M., Maniezzo, V., Colorni, A.: Ant system: optimization by a colony of cooperating agents. IEEE Trans. Syst. Man Cybern. 26(1), 28–41 (1996) 58. Dorigo, M., Maniezzo, V., Colorni, A.: The ant system: optimization by a colony of cooperating agents. IEEE Trans. Syst. Man Cybern. 26, 29–41 (1996) 59. Dorigo, M., Gambardella, L.M.: Ant colony system: a cooperative learning approach to the traveling salesman problem. IEEE Trans. Evol. Comput. 1(1), 53–66 (1997)

Machine Learning Models for Stock Prediction Using Real-Time Streaming Data Monalisa Jena1(B) , Ranjan Kumar Behera2 , and Santanu Kumar Rath3 1 Department of Information and Communication Technology, Fakir Mohan University,

Balasore, Odisha, India [email protected] 2 Department of Computer Science and Engineering, Veer Surendra Sai University of Technology, Burla, India [email protected] 3 Department of Computer Science and Engineering, National Institute of Technology, Rourkela, Rourkela, India [email protected]

Abstract. In recent years stock prediction has attracted a lot of attention to the researchers in financial sectors. Apart from the static log data, streaming data has also been proven to be a perennial source of data analysis collected in realtime, which basically deals with the continuous flow of data carrying information from sources like websites, mobile phone applications, server logs, social websites, trading floors, etc. The classifying model made out of historical data can be relentlessly honed to give even more accurate results since its outcome is always compared to the next tick of the clock. In this study, an attempt is made to develop machine learning models to predict the potential prices of a company’s stock which helps in making financial decisions. Spark streaming has been considered for the processing of humongous data and data ingestion tools like NodeJS have been further used for analysis. Earlier researches are made on the same concept but the present goal of the study is to develop such a model that is scalable, fault-tolerant and has a lower latency. The model rests on a distributed computing architecture called the Lambda Architecture which helps in attaining the goals as intended. Upon analysis, it is found that prediction of stock values is more accurate when support vector regression is applied. The historical stock values are considered as supervised datasets for training the models. Keywords: Stream processing · Support vector regression · NodeJS · Decision Tree · Stock prediction

1 Introduction As technology advances, remarkable growth is observed in the size of the data produced on a daily basis. Hence, the need of managing such humongous data on a real-time basis has cropped up. This paved the motivation for considering a data processing architecture for massive quantities of real-time data. In this study, a case on prediction of stock prices © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 101–108, 2020. https://doi.org/10.1007/978-3-030-39033-4_10

102

M. Jena et al.

based on real-time data has been taken into consideration. The major requirements for a good real-time data processing architecture are that it should be fault-tolerant, extensible and scalable in nature. Lambda architecture is one of the popular distributed architecture widely adopted for streaming analysis by number of companies like Yahoo or Netflix [1]. The need for analysis based on huge amount of data has laid new foundations for research in areas such as Big Data analysis projects which resulted in a robust model that could scale up to handle with optimum fault tolerance [2]. In this rattling world, information flow has become quite easy. This includes the dissemination of different market rates through financial websites or views of the users via social media, both of which can influence the market rates through leaps and bounds [3]. Hence, predicting the rates based on these information makes it more meaningful since the continuous flow of data reshapes the model prepared from historical data, thus making it even more accurate with each iteration.

2 Machine Learning Techniques Adopted Out of the numerous machine learning algorithms being used for data analysis, four machine learning algorithms are used in this work. The reason for choosing the four regression models under supervised learning is because of their technical feasibility and availability through Spark’s MLlib [4]. 2.1 Decision Tree Decision Tree algorithms are a part of supervised learning algorithms which are used for classification as well as regression [5]. This is the simplest, effective, rule based machine learning algorithm. The main idea behind its working is the splitting node that decides upon the class. The splitting criteria is mainly based on the computation of factors like gini index, information gain, chi-square, etc. Upon finding the values of each splitting feature, the root node is created based on feature showing the best splitting value. 2.2 Polynomial Linear Regression As per statistical mathematics, Polynomial regression is a linear approach for finding out if there exists a relationship between one or more explanatory variables which are otherwise called as independent variables and a scalar dependant variable [6]. If the model depends on a single explanatory variable, it is called single linear regression. Moreover, if it involves many explanatory variables, it is called multiple linear regression. 2.3 Support Vector Regression (SVR) Support Vector Machine (SVM) tends to be one of the most popular algorithms among all machine learning techniques. This technique is usually used for classification problems where an object or entity is classified into specific category based on feature vector. However, it is also used for regression analysis i.e., prediction of continuous ordered value based on past available data [7]. The concept it rests on is called the support vectors

Machine Learning Models for Stock Prediction Using Real-Time Streaming Data

103

which can be considered as the borderlines of the points that truly mark for separation of each of the classes. After getting the support vectors of each class, the mean of those vectors are found out which becomes the final margin of separation. 2.4 Random Forest Generation Random forests algorithm is based on the concept similar to decision trees [8]. It generates multiple decision trees in order to get more accurate results. Hence it is also called as ensemble decision tree method. Each split that is done for building a tree resides on many factors such as the value of information gain, gini index, etc. Similarly, for a regression problem of a random forest, a number of decision tree has been considered and finally predict the outcome based on the output of all such trees.

3 Proposed Work and Analysis Prediction of stock market prices with high accuracy has been a challenging task to achieve. Hence, it has gathered the attention of good number of researchers. In this paper, an attempt has been made into analysing and predicting values of stock market prices for the subsequent days. Partial availability or unavailability of data for that matter poses a major drawback in ingesting data into our domain for analysis. In order to implement the proposed work, streaming data collected from financial websites like Google finance and Yahoo finance has been considered to predict the future stock price. Features such as opening price, closing price, volume of stock involved, etc. are collected for the particular day for a particular ticker symbol that is entitled for a particular company. Data Ingestion from NodeJS NodeJS is a free open source software which does asynchronous event based handling by using JavaScript on the server [9]. For collecting the required information from the google finance website, few built in utilities like express, https, cors and socket.io has been considered. In this work, data from online financial websites have been collected. The historical data for opening prices for each ticker on each day from Jan, 2000 till Aug, 2017 was procured. In the batch processing section, 85% of the historical data has been considered for training the model and the rest of the data are used for testing the model. Moreover, in the streaming layer, data is procured for every second from the Google Finance’s website through NodeJS. Parameters such as current price, date, change percent, dividend are fetched which is then rendered to the model formed over the batch layer. Data streamed in this layer has been considered in order to predict the price for the next instance. Phase-1: Batch Processing There are a considerable measure of financial pointers which are complicated and furthermore, the prices being highly volatile. But, with the advancement in technology, there arises a chance of gaining a fortune through stock markets. This encourages researchers to look for those indicators that can very well predict the price of the subsequent days. This gives rise to the fact that such prediction comes with a lot of risk too. Hence, history

104

M. Jena et al.

data needs to be properly scrutinized and modelled for properly predicting the values and also keeping the risk low. The following steps enumerate the process followed for data validation and analysis: • Collection of Data In this step, the historical stock data containing various features like opening price, date, closing price, volume, etc. are collected from Google finance. Then, this historical data which is collected as a part of analysis is utilised for the predicting the subsequent prices of the ticker involved. • Preprocessing of Data The pre-processing stage includes the following steps: • Discretization of data: This generally relates to data reduction yet has a particular importance. This method is hence very important for numerical data. • Transformation of Data: Normalization to the existing data is carried out in order to make the ranges of a particular feature restricted to the specific limits. • Cleansing of Data: In order to deal with the missing data, this method is adopted for filling up the null or missing values. • Integration of Data: Integration of multiple data files into one file might be the needed in processing for certain algorithms. Thus, this method ensures the same. Once all the cleansing and formatting of data is completed, the entire dataset is then divided into training and testing datasets for evaluation. For that matter, more recent values are generally considered for training purposes and a few percentage of it is considered for testing purposes which tends to 10–15% of the entire dataset. • Feature Extraction The size of feature set in online financial data is often large in nature which is quite difficult to process through models. Proper feature extraction is necessary in order to reduce the dimension of the feature set. This is the major phase in preprocessing where important features are extracted from multi-dimensional data. • Training Models In this phase, data procured is provided as input to the respective algorithms and trained for prediction by assigning random weights and biases.

Phase-2: Stream Processing Stream processing is carried out in the second phase of implementation wherein NodeJS has been used for ingesting the data from online financial websites [9, 10]. The injested data is then stored in HDFS for further analysis. Spark DStream (Discretized Stream) is the basic abstraction of Spark Streaming [11]. DStream is a continuous stream of data. It receives input from various sources like Kinesis, TCP sockets, Kafka, or Flume. This can also involve a stream of data that is recreated through transforming the original stream itself. At its core, DStream is a continuous stream of Resilient Distributed Datasets, i.e. RDDs (Spark abstraction), which contains data from a particular interval. Thus, an operation that is performed on a DStream becomes applicable to all the underlying RDDs which are a part of that DStream.

Machine Learning Models for Stock Prediction Using Real-Time Streaming Data

105

The implementation was carried out using Scala IDE along with the aid of NodeJS for data collection. This data is then given as an input to a scala program that implements certain algorithms which are a part of the Spark component called the MLlib. Packages that have been developed as a part of this component have been imported in order to predict the opening price of the stock for the subsequent day. MLlib is one of Spark’s many components that is dedicated to machine learning (ML) [11, 12]. The spark.mllib package supports various methods for binary classification, multiclass classification, and regression analysis. The goal of this component as a part of Spark is to make practical machine learning easy and scalable.

4 Performance Measures Stock prices for different companies have been predicted through real time analysis of financial data and twitter data. Machine learning models have been designed to predict the stock prices. The efficiency and accuracy of the system have been analyzed through the performance parameters such as Mean Square Error (MSE), Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). MSE is the mean/average of the square of all of the error. The error or the difference between the target and the obtained output value is minimized by using RMSE value. RMSE can be calculated as:  2 1  Y p − Ya RMSE = N Where Y p and Ya are the predicted and actual outputs of the model respectively at a point. N is the total number of observed points. MAE is the mean of absolute difference between actual and predicted value obtained at observed points. It can be computed as: n i=1 |yi −xi | M AE = Where, xi and yi represent actual and predicted values at n observed n points, respectively.

5 Result and Analysis Actual stock price of Google, Microsoft and Apple have been collected using Google finance website, which is treated as ground truth for performance measure. The predicted stock price for the testing data is compared with the actual price. Table 1 represents the performance measure values for the different algorithms that have been considered for the case study, i.e. analysis through finance data for two datasets namely Apple Data and Google data. It can be observed that support vector regression poses the most useful regression technique that precisely helps in predicting the opening price for the subsequent days. The graphs shown in Figs. 1 and 2 as a part of the result indicate the prediction accuracy of the proposed models. It can be observed that there is a pattern for increase and decrease of the stock value for an entire year. For example, for the current scenario where we have considered the stock prices of the US, the values of the stock prices tend to sore up during the Christmas and thanks giving time, else maintain the average price.

106

M. Jena et al. Table 1. Performace result of various machine learning models Apple (stock symbol: AAPL) Classifier

MAE

MSE

RMSE

Decision tree

0.104944 0.016948 0.13085

Polynomial linear regression 0.112830 0.018638 0.136521 Support vector regression

0.100732 0.015288 0.123644

Random forest regression 0.117092 0.020056 0.141622 Google (stock symbol: GOOG) Classifier

MAE

Decision tree

0.096190 0.014100 0.118742

MSE

Polynomial linear regression 0.098133 0.01476

RMSE 0.21638

Support vector regression

0.08977

Random forest regression

0.096152 0.014026 0.118851

0.012372 0.1112

Fig. 1. Prediction of Apple’s opening price from finance streaming data

Machine Learning Models for Stock Prediction Using Real-Time Streaming Data

107

Fig. 2. Prediction of Google’s opening price through finance data

6 Conclusion and Future Work Streaming analysis of finance data has also been carried out to predict the stock prices which proves to provide more accurate results than prediction results obtained through twitter data analysis. It can be inferred that a greater accuracy of prediction can be attained by training the models with finance data. However, it should also be noted that sentiments play an important role in the prediction of the movements of price for the subsequent days. The resultant model discussed as a part of the implementation process is proved to be fault-tolerant since, the RDDs are replicated and check points are introduced through Apache Spark. Scalability and lesser latency is also ensured by designing the model over a distributed architecture. Moreover, feasibility study through batch processing has also been performed with this experiment by taking the aid of various machine learning algorithms such as Random Forest Regression, Support Vector Regression, Decision Tree and Linear Regression, This study further motivates us to dig upon the streaming of online stock data available in various financial websites which may provide us a better model to analyse and predict the future stock prices for a particular company.

References 1. Kiran, M., Murphy, P., Monga, I., Dugan, J., Baveja, S.S.: Lambda architecture for costeffective batch and speed big data processing. In: 2015 IEEE International Conference on Big Data (Big Data), pp. 2785–2792. IEEE (2015)

108

M. Jena et al.

2. Bollen, J., Mao, H., Zeng, X.: Twitter mood predicts the stock market. J. Comput. Sci. 2(1), 1–8 (2011) 3. Morck, R., Yeung, B., Yu, W.: The information content of stock markets: why do emerging markets have synchronous stock price movements? J. Financ. Econ. 58(1–2), 215–260 (2000) 4. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016) 5. Rokach, L., Maimon, O.: Top-down induction of decision trees classifiers-a survey. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 35(4), 476–487 (2005) 6. Altay, E., Satman, M.H.: Stock market forecasting: artificial neural network and linear regression comparison in an emerging market. J. Financ. Manag. Anal. 18(2), 18 (2005) 7. Smola, A.J., Scholkopf, B.: A tutorial on support vector regression. Stat. Comput. 14(3), 199–222 (2004) 8. Liaw, A., Wiener, M.: Classification and regression by randomForest. R News 2(3), 18–22 (2002) 9. Cantelon, M., Harter, M., Holowaychuk, T.J., Rajlich, N.: Node. js in Action. Manning Publications, Shelter Island (2017) 10. Donaldson, R.G., Kim, H.Y.: Price barriers in the Dow Jones industrial average. J. Financ. Quant. Anal. 28(3), 313–330 (1993) 11. Bifet, A., Maniu, S., Qian, J., Tian, G., He, C., Fan, W.: StreamDM: advanced data mining in spark streaming. In: 2015 IEEE International Conference on Data Mining Workshop (ICDMW), pp. 1608–1611. IEEE (2015) 12. Cutler, D.M., Poterba, J.M., Summers, L.H.: What moves stock prices? (1988)

Epidemiology of Breast Cancer (BC) and Its Early Identification via Evolving Machine Learning Classification Tools (MLCT)–A Study Rajesh Kumar Maurya1(B) , Sanjay Kumar Yadav1 , and Pragya Tewari2 1 Department of Computer Science and Information Technology, Sam Higginbottom University

of Agriculture, Technology and Sciences (SHUATS), Allahabad, UP, India [email protected], [email protected] 2 Department of Computer Applications, ABES Engineering College, Ghaziabad, UP, India [email protected]

Abstract. Now a day Breast cancer (BC) is very common and terrific disease in women, most detected and second leading cause of the ladies’ demise from the worldwide. Big number of people is passing their life or poor survival rate is because of this disease every year. Females are at high risk of BC, so it became quite essential and necessary for doctors to choose for an exact and suitable treatment for avoidance and remedy of cancer patients. So the basic motive is to find the cancer cells very correctly. Forecasting and categorization of BC using an effective and correct model of machine learning (ML) is essential for creation a new type of BC prognostic and diagnostic policies that really give a reduction push to the sufferer. Diversified technology, including Bayesian classifiers, Artificial Neural Networks and Decision Trees have been commonly applied in cancerous tumor. Undoubtedly methods used for Machine Learning may increase our understanding about breast cancer prediction and progression. It is important to consider these approaches in daily clinical practice. Neural networks are now a day’s very key and popular field in computational biology, chiefly in the area of radiology, oncology, cardiology and urology. In this study, we had summarized numerous ML techniques which could be used as an important tool by surgeons for timely detection, and prediction of cancerous cells has been studied and introduced. Keywords: Cancer statistics in India · Algorithms for cancer detection · Types of ML techniques · Bioinformatics · Open source data for research · Risk factor · Performance

1 Introduction In most of the cases cancerous (malignant) lump usually grows in older age; more than 80% incidence of all cancers in the developed nations are detected in persons 50–55 years of aged. Certain possible causes/behavior may increase the possibility of breast cancersuch as, having excess consumption of tobacco, body weight, radiation, UV radiation in sunlight, ionizing, medicines that cause immune deficiency, radiation © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 109–119, 2020. https://doi.org/10.1007/978-3-030-39033-4_11

110

R. K. Maurya et al.

and drinking alcohol and inherited faulty gene [1, 2]. The age of cervical cancer is in range of 55–59 years in most cases, and is a considerable proportion of women in the late stages of the disease. Other epidemiological risk factors are, multiple sexual partners, early age at marriage, high parity, multiple pregnancies, continual habit of hormonal contraceptives, tobacco use, poor genital hygiene, low socioeconomic status, malnutrition, immune suppression, use of oral contraceptives and lack of awareness [3]. Annually, the American Cancer Society (ACS) make an estimate of the figure of fresh cancer cases and demises in United States and bring together the latest data on cancer incidence, mortality and survival [4]. Very powerful use is made of a machine learning and data analytics approach in medical science as such an approach can be seen as a significant help in the decision making process of medical specialists. Overall, there are more than 96 thousand fresh cases occurring in India yearly, where the median stage is between 14.9 and 9.2 per one lack population. BC has become one of the frequent medical disorder and most common diseases among women that lead to demise. BC can be identified by tumor classification. There are two forms of cancer cells, such as benign and malignant. Medical practitioner required a reliable inspective way to make a difference between these cancer cells. But in general it is very difficult to understand and distinguishes tumors even by cancer specialists or medical practitioner. Therefore, to diagnose cancer; the automation of a good diagnostic system is required [5–7]. Patients Treated for Cancer under Cancer Control Programme 2014–15 India were 102580 for Cervical Screening, 29730 for PAP smear (Screening procedure for cervical cancer), 3871 for Breast Cancer, 12009 for Oral Cavity (Fig. 1). 120000

100000

80000

60000

Numbers

40000

20000

0 Cervical Screening

PAP Smear

Breast Cancer

Oral Cavity

Fig. 1. Patients treated for cancer under cancer control programme 2014–15 in India.

Epidemiology of Breast Cancer (BC) and Its Early Identification

111

2 Cancer Incidence and Cancer Cases Vary by State State wise estimated incidence of cancer cases in India from 2014–2016 (Source of data: data.gov.in) (Table 1). Table 1. State wise cancer incidence in India S.N State/UT

2014

2015

2016

Total

Rank

1

Uttar Pradesh

222615

233659

245231

701505

1

2

Maharashtra

122256

127390

132726

382372

2

3

Bihar

117603

123949

130628

372180

3

4

West Bengal

99339

103532

107906

310777

4

5

Madhya Pradesh

81034

85078

89315

255427

5

6

Rajasthan

75642

79160

82836

237638

6

7

Tamil Nadu

76091

78512

80999

235602

7

8

Karnataka

67237

70302

73511

211050

8

9

Gujarat

66952

70171

73551

210674

9

10

Andhra Pradesh

53570

55776

58072

167418 10

11

Orissa

45736

47666

49674

143076 11

12

Telangana

38494

40177

41939

120610 12

13

Kerala

37550

39672

42004

119226 13

14

Jharkhand

37031

38947

40959

116937 14

15

Assam

31124

31474

31825

94423 15

16

Punjab

30002

31214

32474

93690 16

17

Chhattisgarh

28738

30239

31817

90794 17

18

Haryana

27933

29240

30611

87784 18

19

Delhi

18356

19168

20015

57539 19

20

Jammu & Kashmir

14115

14864

15652

44631 20

21

Uttaranchal

11240

11796

12381

35417 21

22

Himachal Pradesh

7425

7722

8029

23176 22

23

Meghalaya

3184

3246

3311

9741 23

24

Manipur

2836

2916

2998

8750 24

25

Tripura

2139

2169

2199

6507 25

26

Goa

1587

1655

1726

4968 26

27

Mizoram

1585

1618

1652

4855 27

28

Pondicherry

1428

1510

1596

4534 28

29

Nagaland

1288

1294

1300

3882 29 (continued)

112

R. K. Maurya et al. Table 1. (continued)

S.N State/UT

2014

2015

2016

Total

Rank

30

Arunachal Pradesh

1231

1252

1272

3755 30

31

Chandigarh

1162

1217

1274

3653 31

32

Sikkim

467

473

479

1419 32

33

Dadra & Nagar Haveli

421

457

497

1375 33

34

Andaman & Nicobar Islands

402

415

429

1246 34

35

Daman & Diu

339

385

440

1164 35

36

Lakshadweep

77

82

89

248 36

37

Total cancer incidence and cancer cases

1330243 1390412 1453433 4168043

Average annual incidence of cancer cases for certain types of cancer by state shown in Fig. 2. The lung cancer rates vary the maximum by state, reflecting ahistorical differences in smoking preponderance those continue today.

Fig. 2. Cancer incidence and cancer cases

3 Risk Factor Healthy diet and lifestyles like no smoking, regular exercise, lose weight, eat fruits, cut out red and processed meats and reduce stress can help to prevent or fight against cancer. It could be a behavior, condition. But sometimes BC begins and grows in women who don’t have any of the risk factors described below in Table 2.

Epidemiology of Breast Cancer (BC) and Its Early Identification

113

Table 2. Breast cancer risk factor and their causes S.N Factor

Description

Reference

1

Age

Increasing age is associated with breast cancer risk as women grow older. About 80% of breast cancers were found in women over 50 years and most men with breast cancer over 60 years

[8, 9]

2

Personal History of BC

Women with breast cancer in one breast area are at risk for [10] another breast cancer

3

Family history of BC

Women are more likely to have breast cancer if their mother, sister or daughter has breast cancer, especially at an early age (before forty) with other risks of breast cancer. Can increase

[9, 11]

4

Genetic factors

Women with breast radiotherapy (RT) treatment have a higher risk of developing breast cancer in patients with Hodgkin’s lymphoma (HL). The contribution of genetic factors to this risk is unclear

[12, 13]

5

Childbearing and menstrual history

An older woman has a higher risk of breast cancer when she has her first child. Women who have never had children. Menstrual history and quality of life result in women with positive breast

[14, 15]

4 Availability of Open Source Data for Research S.N Year Name of repository 1

1998 https://archive.ics.uci.edu/ml/index.php

2

1999 https://www.broadinstitute.org

3

1998 http://archive.ics.uci.edu/ml/datasets/Hepatitis

4

1991 https://archive.ics.uci.edu/ml/index.php

5

2000 http://tunedit.org

5 ML Algorithm for Cancer Prediction This section presents the Supervised Learning (SL) and unsupervised machine learning (ML) algorithms used in the study. ML techniques were used for prediction and prognosis. Supervised Learning Algorithms are the ones that involve direct supervision of the operation. ML is a part within AI that belongs to expert machines.

114

R. K. Maurya et al.

Fig. 3. Cancer genomic data processing system

Automated information gaining focus by ML techniques via implementation and design of rules exactly where empirical data is needed by the algorithms. Mainly the techniques for learning a machine are taught by ML depending on the use of possibilities [16, 17]. The abilities to approximate nonlinear functions and acquiring composite relations in the data are instrumental abilities those could assist the health area [18, 19]. A sample Genomic Data Processing System model shown in Fig. 3 and category of ML techniques shown in Table 3. Table 3. Category of ML techniques S.N Supervised learning

Unsupervised learning

Ensemble learning techniques

1

Linear Regression

Apriori (Classical method in data Bagging mining)

2

Logistic Regression

K-means

Boosting (continued)

Epidemiology of Breast Cancer (BC) and Its Early Identification

115

Table 3. (continued) S.N Supervised learning

Unsupervised learning

3

Classification and Regression Principal Component Analysis Trees (CART) (PCA)

4

Naïve Bayes (NB)

t-Distributed Stochastic Neighbor Embedding (t-SNE)

5

Artificial Neural Network (ANN)

Association rule

6

Support Vector Machine(SVM)

7

Decision Trees (DT)

8

Random Forecast (RF)

9

K-Nearest Neighbor (KNN)

Ensemble learning techniques Stacking

6 ANN for Cancer Research Neural Network, a simplified framework of biological neuron is a massively parallel distributive computing system made up by highly connecting neural computing units that have the ability to learn and then adopt knowledge and make it available to use.

Fig. 4. A semantic example for classifying how ANN is trained to predict using layers (2-Output, 5-Input, and 7-Intermediate) for outcomes.

116

R. K. Maurya et al.

A neuron is constituted of nucleus and a cell body called soma. The Soma is attached to long irregular molded filaments called dendrites that behave as inputs. All the inputs from neurons get in through dendrites, which looks like branches of trees in winter. Other kinds of link attaches to soma are axon. The axon is electrically progressive and functions as an output. Human brain is very much complex structure prospect as a massively connected network of simplistic processing elements known as neurons. Behavior of a neuron can visualize by a simple framework that constitutes the basics of ANN [20, 29]. A Semantic example for classifying how ANN is trained to predict is shown in Fig. 4 (Table 4).

7 Related Work Table 4. Related work on cancer using ANN method and their accuracy S.N Title

Authors

Methods

0

A comparison of neural network and fuzzy c-means methods in bladder cancer cell classification.

Hu, Y., Ashenayi, K., Veltri, R., O’Dowd, G., Miller, G., Hurst, R., & Bonner, R.

Single hidden 1994 96.9% layer feed forward NN

[21]

1

Paired neural network with negatively correlated features for cancer classification in DNA gene expression profiles.

Won, H. H., & Cho, S. B.

Paired neural network

2003 97.1%

[22]

2

Classification of breast cancer histology images using convolutional Neural Networks

Spanhol, F. A., Oliveira, CNN L. S., Petitjean, C., & Heutte, L

2017 77.8%

[23]

3

Breast cancer classification using deep belief networks

Abdel-Zaher, A. M., & Eldeib, A. M.

2016 99.68%

[24]

4

Breast cancer diagnosis using Genetically Optimized Neural Network model

Bhardwaj, A., & Tiwari, GONN A.

2015 98.24%

[25]

5

WBCD breast cancer database classification applying artificial metaplasticity neural network

Marcano-Cedeño, A., Artificial 2011 99.26% Quintanilla-Domínguez, metaplasticity J., & Andina, D. Multilayer Perceptron (AMMLP) algorithm

[26]

6

Breast cancer histopathological image classification using convolutional neural networks.

Fabio AlexandreSpanhol; Luiz S. Oliveira; Caroline Petitjean; Laurent Heutte

[20]

deep belief network path (DBN-NN)

Year Accuracy References

Convolutional 2016 90% Neural Networks (CNNs)

(continued)

Epidemiology of Breast Cancer (BC) and Its Early Identification

117

Table 4. (continued) S.N Title

Authors

Methods

Year Accuracy References

7

Normalized Neural Networks for Breast Cancer Classification

Alickovic, E., &Subasi, A.

Normalized Multi-Layer Perceptron Neural Network

2019 99.27%

[19]

8

A Hybrid Model for Kaya, Yılmaz Breast Cancer Diagnosis Based on Expection-Maximization and Artificial Neural Network: EM + ANN.

EM-NN

2015 98.54%

[27]

9

A new intelligent classifier for breast cancer diagnosis based on a rough set and extreme learning machine: RS + ELM

Yilmaz, Kaya

RS + ELM

2013 100%

[28]

10

Using three machine learning techniques for predicting breast cancer recurrence

Ahmad, L. G., Eshlaghy, A. T., Poorebrahimi, A., Ebrahimi, M., & Razavi, A. R.

ANN

2013 94.7%

[29]

8 Conclusion and Recommendation This kind of study relates to the strong use of machine learning, learned technique to analyze different categories of breast cancer types and their complications. Higher classification, prediction accuracies and timely diagnosis of different stages of breast cancer can reduce mortality worldwide. Diversified research, analysis focuses on need of developing effective, intelligent and appropriate risk classification models that used machine learning techniques. Although many impressive classification algorithms have accomplished higher accuracy in WBCD, improved breast cancer algorithms are still needed. The usefulness of various ML classification methods, such as ANN methods for cancer characterization, and the study of multidimensional integration of biological data, is a great approach to human understanding of competence in diagnosis and diagnosis of breast cancer.

References 1. Banerjee, S., Ghosh, A., VonHoff, D.D., Banerjee, S.K.: Cyr61/CCN1 targets for chemosensitization in pancreatic cancer. Oncotarget 10(38), 3579 (2019) 2. Siegel, R.L., Miller, K.D., Jemal, A.: Cancer statistics. CA. Cancer J. Clin. 69(1), 7–34 (2019) 3. Chaurasia, V., Pal, S.: A novel approach for breast cancer detection using data techniques. Int. J. Innovative Res. Comput. Commun. Eng. 2(1), 2456–2465 (2014)

118

R. K. Maurya et al.

4. Sreedevi, A., Javed, R., Dinesh, A.: Epidemiology of cervical cancer with special focus on India. Int. J. Women’s Health (2016) 5. Chopra, S., Shukla, R., Budukh, A., Shrivastava, S.K.: External radiation and brachytherapy resource deficit for cervical cancer in India: call to action for treatment of all. J. Glob. Oncol. (2019) 6. Ferlay, J., Soerjomataram, I., Ervik, M., et al.: Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries, September 2018 7. Campos, N.G., Tsu, V., Jeronimo, J., Regan, C., Resch, S., Clark, A., Kim, J.J.: Health impact of delayed implementation of cervical cancer screening programs in India: a modeling analysis. Int. J. Cancer 144(4), 687–696 (2019) 8. Kresovich, J.K., Xu, Z., O’Brien, K.M., Weinberg, C.R., Sandler, D.P., Taylor, J.A.: Methylation-based biological age and breast cancer risk. JNCI: J. Nat. Cancer Inst. (2019) 9. Antoniou, A., Pharoah, P.D., Narod, S., Risch, H.A., Eyfjord, J.E., Hopper, J.L., Pasini, B.: Average risks of breast and ovarian cancer associated with BRCA1 or BRCA2 mutations detected in case series unselected for family history: a combined analysis of 22 studies. Am. J. Human Genet. 72(5), 1117–1130 (2003) 10. Hartmann, L.C., Schaid, D.J., Woods, J.E., Crotty, T.P., Myers, J.L., Arnold, P.G., Frost, M.H.: Efficacy of bilateral prophylactic mastectomy in women with a family history of breast cancer. N. Engl. J. Med. 340(2), 77–84 (1999) 11. Kash, K.M., Holland, J.C., Halper, M.S., Miller, D.G.: Psychological distress and surveillance behaviors of women with a family history of breast cancer. JNCI: J. Nat. Cancer Inst. 84(1), 24–30 (1992) 12. Lee, A., Mavaddat, N., Wilcox, A.N., Cunningham, A., Carver, T., Hartley, S., Walter, F.: BOADICEA: a comprehensive breast cancer risk prediction model incorporating genetic and nongenetic risk factors (2019) 13. Opstal-van Winden, A.W., de Haan, H.G., Hauptmann, M., Schmidt, M.K., Broeks, A., Russell, N.S., van Eggermond, A.M.: Genetic susceptibility to radiation-induced breast cancer after Hodgkin lymphoma. Blood 133(10), 1130–1139 (2019) 14. Scott, D.M., Shah, N.M., Jeruss, J.S.: Oncofertility in the premenopausal breast cancer patient. In: Textbook of Oncofertility Research and Practice, pp. 431–437. Springer, Cham (2019) 15. Kelsey, J.L., Gammon, M.D., John, E.M.: Reproductive factors and breast cancer. Epidemiologic Rev. 15(1), 36 (1993) 16. Shadman, T.M., Akash, F.S., Ahmed, M.: Machine learning as an indicator for breast cancer prediction. Doctoral dissertation, BRAC University (2018) 17. Akshya, Y., Jamir, I., Jain, R.R., Sohani, M.: Comparative study of machine learning algorithms for breast cancer prediction-a review (2019) 18. Abbass, H.A.: An evolutionary artificial neural networks approach for breast cancer diagnosis. Artif. Intell. Med. 25(3), 265–281 (2002) 19. Alickovic, E., Subasi, A.: Normalized neural networks for breast cancer classification. In: International Conference on Medical and Biological Engineering, May 2019, pp. 519–524. Springer, Cham (2019) 20. Lisboa, P.J., Taktak, A.F.: The use of artificial neural networks in decision support in cancer: a systematic review. Neural Netw. 19(4), 408–415 (2006) 21. Hu, Y., Ashenayi, K., Veltri, R., O’Dowd, G., Miller, G., Hurst, R., Bonner, R.: A comparison of neural network and fuzzy c-means methods in bladder cancer cell classification. In: Proceedings of 1994 IEEE International Conference on Neural Networks, June 1994, ICNN 1994, vol. 6, pp. 3461–3466. IEEE (1994) 22. Won, H.H., Cho, S.B.: Paired neural network with negatively correlated features for cancer classification in DNA gene expression profiles. In: Proceedings of the International Joint Conference on Neural Networks, July 2003, vol. 3, pp. 1708–1713. IEEE (2003)

Epidemiology of Breast Cancer (BC) and Its Early Identification

119

23. Araújo, T., Aresta, G., Castro, E., Rouco, J., Aguiar, P., Eloy, C., Campilho, A.: Classification of breast cancer histology images using convolutional neural networks. PLoS One 12(6), 0177544 (2017) 24. Abdel-Zaher, A.M., Eldeib, A.M.: Breast cancer classification using deep belief networks. Expert Syst. Appl. 46, 139–144 (2016) 25. Bhardwaj, A., Tiwari, A.: Breast cancer diagnosis using genetically optimized neural network model. Expert Syst. Appl. 42(10), 4611–4620 (2015) 26. Marcano-Cedeño, A., Quintanilla-Domínguez, J., Andina, D.: WBCD breast cancer database classification applying artificial metaplasticity neural network. Expert Syst. Appl. 38(8), 9573–9579 (2011) 27. Kaya, Y.: A hybrid model for breast cancer diagnosis based on expection-maximization and artificial neural network: EM + ANN. Karaelmas Sci. Eng. J. 5(1) (2015) 28. Kaya, Y.: A new intelligent classifier for breast cancer diagnosis based on a rough set and extreme learning machine: RS + ELM. Turk. J. Electr. Eng. Comput. Sci. 21(Sup. 1), 2079– 2091 (2013) 29. Ahmad, L.G., Eshlaghy, A.T., Poorebrahimi, A., Ebrahimi, M., Razavi, A.R.: Using three machine learning techniques for predicting breast cancer recurrence. J. Health Med. Inform. 4(124), 3 (2013)

Ensemble Classification Approach for Cancer Prognosis and Prediction Rajesh Kumar Maurya1(B) , Sanjay Kumar Yadav1 , and Rishabh2 1 Department of Computer Science and Information Technology, Sam Higginbottom University

of Agriculture, Technology and Sciences (SHUATS), Allahabad, UP, India [email protected], [email protected] 2 Department of Computer Science and Engineering, Maharshi Dayanand University, Rohtak, Haryana, India [email protected]

Abstract. Gene expression data mostly available as cancer data have major challenges such as analyze, pattern matching and classification. Sometime task become more complex with large number of genes and small samples are available with noise and redundant information. Meaningful correlated information from dataset is the first and most important steps to be extracted for better diagnosis through artificial intelligence (AI). Accordingly, recent work for AI based classification and prognosis are focused in two steps process that is: (a) Feature extraction, and, (b) Ensemble Classification. Feature extraction will help in eliminating redundant and irrelevant genes, whereas ensemble classifier will help to optimize the accuracy. In this paper, we use double RBF kernel function for feature selection and novel fusion-procedure for enhance the performance of three base classifiers i.e., K Nearest Neighbor (KNN), Multi-Layer Perceptron (MLP) and Decision Tree (DT). Training of classifier is implemented based on k-fold cross validation techniques. The predicted accuracy of the proposed model has been compared with recent fusion methods such as Majority Voting, Distribution Summation and Dempster–Shafer on six benchmark cancer datasets. Experiment evaluation and result analysis gives promising and better performance than other fusion strategies, aiming at our goal-functions. Wisconsin Breast prognosis dataset is used with the proposed model for gene selection and prognosis prediction. Keywords: Gene expression · Classifier fusion · Kernel · Feature extraction · Prognosis

1 Introduction Due to the curse of huge dimensionality and redundant data input given to algorithm adversely affects the training time complexity. Hence, literature suggest of removal of redundant information and reduction of dimension specially known as feature reduction [1–3]. Certainly, small and meaningful attribute selection are preferred to get applicable information form dataset, in turn helps classifier to quickly train with better accuracy. © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 120–135, 2020. https://doi.org/10.1007/978-3-030-39033-4_12

Ensemble Classification Approach for Cancer Prognosis and Prediction

121

Classification process are trained in such a way that it foresees the unseen data using a mathematical function, which eventually maps into group or class label [4, 5]. Currently, classifier such as decision tree (DT), k-nearest neighbor (KNN), rule-based classifier (RBC), artificial neural networks (ANN), Support Vector Machine (SVM) and Bayesian classifier (BC) [6–9] are available. Looking to the classification advantages and disadvantages, it is now advisable not to rely on any single classifier, rather decision is taken from multiple classifier and by means of any objective function the result are fused before making any conclusion. Which lead to new research to tackle and analyze the classifier accuracy of each classifier known as classifier fusion. Therefore, some alternative fusion approach has been developed for improving the classification performance of the system [10]. Almost all classification fusion strategy relies on the output of classifiers [11–16]. Variety of fusion algorithm discussed above [16] depend the classification results of each classifier. For a set of input feature different classifier predict different class levels which make hard to truly recognize the class of feature vector. The factor which directly or indirectly affect the classification accuracy is the choice of input vector space for training of classifier. Which makes important to extract mining full and correlated feature from dataset. Normalization also plays important role in scale the data in range where classifier can be trained without error. Thus, many articles in current decades is been contributed in the area of feature extraction or selection. To estimate the recovery prediction or survival from disease, Prognosis (prediction) are used. Physicians use to provide statistics based on prognosis that state the possible value of disease of general population. These statistics provide the patient performance in coming future such as worse, average or good in terms of recovery. Prognosis depends on cancer stage, type, subtype. Some time prognosis depends on molecular profile and gender. Based on prognosis we test system on Wisconsin dataset for study. In our Proposed system, dataset is built by removing redundant data and replacing missing data using method used in [17]. Then the dataset is normalized using minmax algorithm. Double RBF Kernel algorithm [18] is than used for Feature extraction algorithm. Dataset is further divided into training and testing set. We have implemented three base classifier KNN, MLP and DT. Training of classifier is done using Training sample. We have used rank based fusion algorithm to fuse the classification result. Accuracy achieved using testing data is than compared with existing algorithm and individual classifier. The paper is organized as; Materials and methods is explained in Sect. 2; Next, proposed Multi classifier fusion techniques are described in Sect. 3. Experiment evolution and result analysis are shown in Sect. 5, followed by conclusion in Sect. 9.

2 Materials and Methods Selective ensemble classification method is implemented on various cancer dataset. As discussed, dataset may consist of redundant data and missing value. Hence, our primary aim is to avoid redundancy and missing value from the dataset. Later care has been taken to normalize and feature extract meaning full information from the input sample. Different classifier used are discussed with their training parameter. We consider it ill better to integrate many classifiers will be better than integrating all of them, for which we have implemented Rank algorithm [3] of classifier.

122

R. K. Maurya et al.

2.1 Dataset Pruning Missing data and redundant is a common-problems, which can be minimized with quality assurance and careful administration. It has been observed that data collected from selfreports or unmonitored staff usually consist higher rates of missing data. Pruned data will boost the accuracy of a classification. Which can be achieve in two-way process: a. Missing Value: We have used distance based mean method [17] technique to deal with missing value. Let there be dataset X having x1 , x2 , . . . , x N , having N number of rows, where xi1 , xi2 , . . . , xi M , has M number of columns. For xi j missing value, following equation are used Xi j =

1 k μ pk p=1 L

(1)

Where, μ pk is the k th nearest neighbor mean of p th feature. b. Revoking Redundancy: Data set consist of redundant information can lead to increase computational cost of classification. Hence to remove duplicate data, we use distancebased deletion scheme. If he Euclidean distance between two features is zero than they are equal and we delete one feature from the dataset.

2.2 Normalization (Min – Max Algorithm) Minmax normalization is a normalization strategy which linearly transforms the range into [0, 1]. X out =

(X in − X min ) (X max − X min )

(2)

where, X min and X max are the minimum and maximum values in X respectively. The result X out observed are the range in [0, 1]. 2.3 Feature Extraction (Double RBF Kernel Based Ranking of Feature) The ranking of feature vector also known as filter approach is been implemented [18, 19]. Double RBF Kernel based Ranking method helps to rank input feature attributes without learning algorithm. The overall design of algorithm is to assign a random weight to each feature attribute and with successive iteration series of steps to update weight

Ensemble Classification Approach for Cancer Prognosis and Prediction

123

based on objective function. Then selection of genes is based on their weights. For N feature vector W is randomly assign as  wk  [0, 1], M = 1, 2, .., M N (3) w= k=1 wk = 1 The objective of the ranking function is to reduce the dissimilarity distance between the feature vector with cluster center. The objective function is given by J. J=

C  x j ci

i=1

m   ∅2 X j , Vi + δ

k=1

wk2

(4)

Where, δ represent the distance from gene given by c δt = α

i=1



x j ∈ci

M

 

t  k=1 wk ∅

M  k=1

2  x jk − ∅(Vik ) 2

(t−1)

wk

(5)

And Vi is cluster vector given by Vik =

1  x jk x j ∈ci |Ci |

(6)

Where C is the class level. The J (Wk , λ) depends on the partial derivatives. When partial derivatives is zero than J (Wk , λ) reaches its minimum. Finally, w can be updated using (7). ⎛ ⎞    2 C       2 1 c  1 x j ci ∅ x jk − ∅(Vik ) i=1 ⎝ + − ∅ x jk − ∅(Vik ) ⎠ wk = i=1 x j ∈ci m 2δ m

(7) Where distance can be represented using        ∅ x jk − ∅(Vik )2 = 2 1 − K x jk − vik

(8)

Literature suggest possible use of RBF kernel with weighted analysis [20, 21] as KBCGS with clustering methodology, recently [18] modified KBCGS based on double RBF-kernels. Hence, we have implemented double kernel function to evaluate above equation K given by ().   2 2 K γ1 γ2 x, x j = ce−γ1 ||x−xi || + (1 − c)e−γ1 ||x−xi ||

(9)

The above ranking equation is executed up to 100 iteration or till the condition J t − J t−1 ≤ θ. Where, θ is set to 10−6 .

124

R. K. Maurya et al.

2.4 KNN Classifier The k-nearest neighbors’ algorithms are pattern recognition non-parametric algorithm used for regression and classification [22]. Algorithms make prediction based on the number of sample closest to the testing features. Therefore, to make predictions with KNN, distance metric is measured between query point x and case point p. Distance can be calculated using (10).  ⎧ ⎪ (x − p)2 Euclidean ⎪ ⎪ ⎨ 2 D(x, p) = (x − p) Euclidean Squar ed (10) ⎪ ⎪ Abs(x − p)Cit y Block ⎪ ⎩ M Ax(|x − p|)Chebyshev The value of k is selected first and KNN predictions is the average of the k-nearest neighbors’ outcome. y=

1 c yi i=1 c

(11)

Where, y is the prediction (outcome) of the query point. In contrast to regression. 2.5 MLP (Multi Linear Perceptron) A feedforward artificial neural network that mimics the learning and decisions making behavior of natural brain are computed in the form of multilayer perceptron (MLP) algorithm. MLP generally consist of three layers of nodes: an input layer, a hidden layer and an output layer. Each neurons except input layer accept summation of weighted product of input feature and produces output based on nonlinear activation function. Backpropagation technique is used to update the weight of the MLP [23, 24]. MLP can be used to distinguished non-linear separable dataset [25]. The two historically common activation functions are both sigmoid, and are described by −1  y(vi ) = 1 + e−vi

(12)

Equation (12) is the activation fiction known as logistic function, which is similar in shape but ranges from 0 to 1. Here yi is the i th neuron output, that accept weighted sum of input connections vi . Finally, the degree of error in an output node j in the n th data point given by e j (n) = d j (n) − y j (n)

(13)

Ensemble Classification Approach for Cancer Prognosis and Prediction

125

Where d and y are the target and produced value by the perceptron. Based on the error, node weights are updated, which in turn minimize the error in the entire output, given by (n) =

1 2 e (n) j j 2

(14)

And the weight is updated using (15). wi j (n) = −η

∂(n) yi (n) ∂v j (n)

(15)

Where yi is the output and η is the learning rate which can derive and simplified as −

  ∂(n) = e j (n)∅ v j (n) ∂v j (n)

(16)

Where ∅ is the derivative activation function, and the change in weight in hidden node can be successively computed as Eq. (17). −

   ∂(n) ∂(n)  = ∅ v j (n) wk j (n) − k ∂v j (n) ∂vk (n)

(17)

This depends on the change in weights from output layer first and moving backward towards hidden layer and input layers, which is represents a backpropagation of the activation function [26]. 2.6 Decision Tree A decision tree is a tree like model of decision and decision support are based on possible consequences, including chance event outcomes, resource costs, and utility. The algorithm is based upon the conditional control statements. Decision trees popular machine learning technique which are commonly used in operations research to help identify a strategy most likely to reach a goal. A decision tree can consist of three types of nodes: [27] such as Decision nodes, Chance nodes and End nodes typically represented by Square, circle and triangles respectively.

3 Proposed Algorithm The Proposed system shown in Fig. 1, consist of three phases. Phase 1, deals with building of dataset. Cancer Dataset downloaded from UCI Repository. Preprocessing and Feature extraction is done using the concept discussed in Sects. 2.1 and 2.2. Preprocessing, here refer as pruning steps where the missing value calculation is done, latter dataset is build having no delicacy. Feature extraction is one of the important steps in proposed architecture. It helps to come over the curse of higher dimensionality problem. It eliminates the less important feature and provide most manning full feature for

126

R. K. Maurya et al.

further classification. Phase II, deals with training and testing of classifier. K-fold cross validation is been used for train the classifier with training samples. Phase III, deals with fusion function. Based on the recent paper we have used novel ranking-based technique for eliminating and selection of appropriate classifier results for predicting class labels.

Data Base

KNN

Pre-Processing

MLP

Feature Extracon

Decision Tree

Fusion funcon

Decision

Fig. 1. Proposed ensemble classification architecture

4 Parameter Discussion Tuning of parameter is first and difficult task to obtain for training of classifiers. We have trained our classifier based on the parameter value obtained from hit and trail method. The value suggested are the best value where the individual classifier has receded maximum accuracy. In KNN, the value of k can influence noise if selected minimum and if selected large value than will increase computational cost. Based on hit and trial method we have chosen an odd k = 7 number for binary classification or set k = sqrt(n). Parameters used in MLP are w, v, α, η symbolized for weight matrix used between input neuron to hidden layer, weight matrix between hidden layer to output layer, acceleration constant and learning constant respectively. The value for w, v is randomly chosen in between [−1, 1], whereas α ranges from 0.2 to 0.9 and η ranges from 0.2 to 0.6. Complexity parameter of decision tree cp is chosen as 0.01. It helps to control the size of decision tree and select optimal tree size. Apart from parameter used in classifier, we have another parameter that is used during feature extraction that is kernel parameter γ1 = 0.4, γ2 = 0.3 and c = 0.2.

5 Experimental Evaluation The simulated model is designed and tested over MATLAB R2016a under Window 10 OS with 64 Bit Processor. The dataset downloaded from UCI repository and details are disused in succeeded in section.

Ensemble Classification Approach for Cancer Prognosis and Prediction

127

Table 1. Wisconsin Prognostic breast cancer - WPDC dataset Attribute and significance

Attribute number

Identification (ID) – that unique patient identification number 1 Out-come (O) – has R- for recurrent and N-nonrecurrent

2

Time – is the duration or time to recur/or disease-free survival 3 Radius (3) – is the mean distance from the center

4, 14, 24

Texture (3) – standard deviation of its gray scale value

5, 15, 25

Perimeter (3) – the perimeter of nucleus of cancer cell

6, 16, 26

Area (3) – area of cancer cell nucleus

7, 17, 27

Smoothness (3) – variation in radius

8, 18, 28

Compactness (3) – Perimeterˆ2/area - 1.0

9, 19, 29

Concavity (3) – Severity of concave portions of the contour

10, 20, 30

Concave Points (3) – contour portion of concave in number

11, 21, 31

Symmetry (3) – cancer cell symmetry

12, 22, 32

Fractal Dimension (3) – approximation of coastline

13, 23, 33

Tumor – size

34

Lymp node status

35

6 Dataset 6.1 Wisconsin Prognostic Breast Cancer (WPBC) WPBC dataset descriptions are shown in Table 1. These was first listed in 1984 by Dr. Wolberg since [28], which was discovered from the implementation of fine needle aspirate (FNA) on digitized image of a breast mass, which consist recurring cancer of dimension 198 × 35. The 198 features are the combination of recurring and nonrecurring patients. It consists of 47 recurring and rest 143 are non-recurring patients. WPBC has two special attribute such as; tumor size and lymp node status. 6.2 Five Bench Mark Cancer Dataset Five bench mark dataset is collected from different source for experimental evolution. Table 2, shows the description of dataset with source and feature information. a. Breast cancer dataset (is one of the common cancers in women that causes death in the range of 35–55 years age group) consist of 569 instances with 32 attributes group into two class level that is benign and malignant (Wolberg et al. 1988). b. Leukemia primarily bone marrow disorder is malignant neoplasm cell consist of 7129 attributes for 72 feature samples that consist of acute lymphoblastic leukemia (ALL) and acute myelogenous leukemia (AML) (Golub et al. 1999).

128

R. K. Maurya et al.

c. Hepatitis dataset consist of 20 contiguous valued attributes and 155 instances, with two class level denoting alive or dead (Gong (Doner) 1988). d. Lung cancer dataset consist of 32 feature and 56 attributes. It described 4 types of pathological lung cancers (Hong and Yang 1991). e. Diffuse large B-cell lymphoma (DLBCL), is the most common cancer dataset (Alizadeh et al. 2000). It consists of different stages of B-cell stated as gene expression. The dataset have 45 features and 4026 attributes of two classes. One class is germinal center B-like and second is activated B-like DLBCL.

Table 2. Dataset used for experimental analysis Source of dataset

Data set

UCI Machine Learning Repository (Wolberg et al. 1988)

Breast cancer 699 × 9

Broad Institute of MIT and Harvard (Golub et al. 1999)

Leukemia

72 × 7128

http://archive.ics.uci.edu/ml/datasets/Hepatitis (Gong (Doner) 1988)

Hepatitis

155 × 20

UCI Machine Learning Repository (Hong and Yang 1991)

Lung cancer

32 × 56

http://tunedit.org/repo/BioInformatics_Seville/Lymphoma/ Lymphoma reduced_2classes.arff (Alizadeh et al. 2000)

Number of feature

45 × 4026

7 Ranking of Classifier As it is well known, the combined classifier are better than single classifier, hence, we propose a selective ensemble classification technique combining cancer dataset to rank classification result of KNN, MLP, and Decision Tree for effectively diagnosis cancer dataset. The key in ensemble learning is the accuracy and diversity of base classifiers. We use the indicator R to rank the single classifier which is based on the accuracy and its diversity. R can be described Ri = μ1 ∗ ACCi − μ2 ∗ D Fi

(18)

Where, ACC is the accuracy of single classifier, double fault (DF) can reflect diversity of the classifier. DF can be computed using D Fi =

FN T P + T N + FP + FN

(19)

Where, FP, TP, TN, FP are false positive, true positive, true negative and false positive.

Ensemble Classification Approach for Cancer Prognosis and Prediction

129

8 Result Analysis and Comparison 8.1 Breast Cancer Prognosis We have considered WPBC dataset for recurrent and non-recurrent classification for clinical investigate the life-threatening disease. The classification result is discussed clearly after feature selection. It will help Oncologists to predict good prognosis from bad prognosis. Table 3, shows the performance-indicators value on Wisconsin Prognosis breast cancer dataset when the training dataset is preprocessed and extracted feature directly applied for training of classifier. Table 4 shows the similar statistics with the testing dataset for WPBC. Figure 3 shows the error curve convergence with respect to the training of MLP classifier for breast cancer dataset. Figure 4 shows the ROC curve between sensitivity and 1-sepecificity. The ROC curve validates the accuracy achieved from our system as it’s more towards sensitivity and rising towards 100% accuracy. Six indicators are being used to evaluate performance of system. Indicators are accuracy, sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), and the area under receiver operating characteristic curve (AUC). We have also used AUC which means the area of the ROC (receiver operating characteristic) curve to appraise the performance (Fig. 2). TP + TN ∗ 100 T P + T N + FP + FN TP Sensi t ivi t y = ∗ 100 T P + FN TN Speci f i t y = ∗ 100 T N + FP TP P PV = ∗ 100 T P + FP TN N PV = ∗ 100 T N + FN

AC C =

(20) (21) (22) (23) (24)

Table 3. The training performance of single classifier compared with the integrated classifiers with preprocessing and feature extraction on Wisconsin Prognosis breast cancer dataset Methods

Accuracy Sensitivity Specificity NPV

KNN

93.33333

97.701149 81.818182 93.40659 93.10345 0.92

MLP

95

98.850575 84.848485 94.50549 96.55172 0.93

DT

95

98.850575 84.848485 94.50549 96.55172 0.94

Proposed system

95.83333

98.863636 87.5

95.6044

PPV

AUC

96.55172 0.95

130

R. K. Maurya et al.

Table 4. The testing performance of single classifier compared with the integrated classifiers with preprocessing and feature extraction on Wisconsin Prognosis breast cancer dataset Methods

Accuracy Sensitivity Specificity NPV

PPV

AUC

KNN

89.74359 96.428571 72.727273 90

88.88889 0.88

MLP

91.02564 96.491228 76.190476 91.66667 88.88889 0.88

DT

92.30769 96.551724 80

93.33333 88.88889 0.9

Proposed system

94.87179 98.275862 85

95

94.44444 0.91

Fig. 2. Training error convergence curve of MLP for Wisconsin Prognosis breast cancer dataset

Fig. 3. ROC curve obtained from the training and testing accuracy for Wisconsin Prognosis breast cancer dataset

Ensemble Classification Approach for Cancer Prognosis and Prediction

131

Table 5. Dataset used for experimental analysis Sl. No. Data set

Number of feature after pruning steps

Class level Number of feature after feature extraction

Training sample (60%)

Testing sample (40%)

1

Breast cancer

461 × 9

2

461 × 6

277 × 6

184 × 6

2

Leukemia

72 × 7128

3

Hepatitis

155 × 20

2

72 × 3720

44 × 3720

28 × 3720

2

155 × 9

94 × 9

61 × 9

4

Lung cancer

32 × 56

3

32 × 28

20 × 28

12 × 28

5 6

Lymphoma

45 × 4026

2

45 × 2116

28 × 2116

17 × 2116

WPBC (Prognosis)

198 × 35

2

198 × 20

120 × 20

78 × 20

8.2 Cancer Prediction Our objective is to design a robust system which not only able to prognosis but also predict. We have tested our system with five bench mark dataset as discussed in Table 2. Not all but it was found that breast cancer dataset has 0.27% missing value and 21% duplicate data. Removing duplicate data reduces the breast cancer data size from 699 instances to 461 instances. It was noted that leukemia, Hepatitis, lung Cancer and Lymphoma have 0.0, 5.6%, 0.27%, and 3.30% of missing data available. We could not find any duplicate data in rest 4 cancer dataset apart from breast cancer dataset. Table 5 shows the details of impact on size of data set after preprocessing and feature extraction phase. It can be noted that our approach has successfully reduced the feature space by more than 50% of real space of dataset. Table 6, shows the performance-indicators (Sensitivity, Specificity, PPV and NPV, Accuracy) value on breast cancer preprocessed dataset and extracted feature directly applied for training of classifier. The AUC value lies between 0.5 to 1 where 0.5 denotes a bad classifier and 1 denotes an excellent classifier. Table 7 shows the similar statistics with the testing dataset. Figure 4 shows the error curve convergence with respect to the training of MLP classifier for breast cancer dataset. Figure 5 shows the ROC curve between sensitivity and 1-sepecificity. The ROC curve validates the accuracy achieved from our system as it’s more towards sensitivity and rising towards 100% accuracy.

132

R. K. Maurya et al.

Table 6. The training performance of single classifier compared with the integrated classifiers with preprocessing and feature extraction on breast cancer dataset Methods

Accuracy Sensitivity Specificity NPV

PPV

AUC

KNN

97.11191

97.014925 97.202797 97.01493 97.2028

0.91

MLP

94.22383

94.074074 94.366197 94.07407 94.3662

0.9

DT

96.38989

96.323529 96.453901 96.32353 96.4539

0.9

Proposed system

98.55596

98.540146 98.571429 98.54015 98.57143 0.91

Table 7. The testing performance of single classifier compared with the integrated classifiers with preprocessing and feature extraction on breast cancer dataset Methods

Accuracy Sensitivity Specificity NPV

PPV

AUC

KNN

95.65217

95.505618 95.789474 95.50562 95.78947 0.9

MLP

92.3913

92.134831 92.631579 92.13483 92.63158 0.9

DT

95.65217

95.505618 95.789474 95.50562 95.78947 0.94

Proposed system

96.73913

96.629213 96.842105 96.62921 96.84211 0.92

Fig. 4. Training error convergence curve of MLP for breast cancer dataset

Ensemble Classification Approach for Cancer Prognosis and Prediction

133

Fig. 5. ROC curve obtained from the training and testing accuracy for breast cancer dataset

Similar set of operation is done over other four dataset (including WPBC) and the result are compared with recent methodology. The result gained from our system proves to be more efficient in terms of accuracy. Table 8, shows the accuracy comparison of our method with all other recent methodology with all the bench mark dataset. Table 9, Shows the comparison of different fusion model (majority voting, Distribution Summation, Dempster Shafer) with the proposed model. The bold-faced value shows the performance of proposed model and it can be seen that proposed model is outperforms for all four datasets except lung cancer. Table 8. Average accuracy rate in percentage using 10-fold cross validation Data set

KNN

MLP

DT

Proposed model

Training Testing

Training Testing

Training Testing

Training Testing

Breast cancer 97.1119 95.6521 94.2238 92.3913 96.3898 95.6521 98.5559 96.7391 WPBC

93.3333 89.7435 95

91.0256 95

92.3076 95.8333 94.8717

Leukemia

97.7272 96.4285 93.1818 89.2857 93.1818 92.8571 100

96.4285

Hepatitis

94.6808 91.8032 100

90.1639 97.8723 90.1639 100

95.0819

Lymphoma

100

88.2352 92.8571 88.2352 92.8571 88.2352 100

94.1176

Lung cancer

85

75

83.3333

80

75

80

75

95

134

R. K. Maurya et al.

Table 9. Error rate comparison of different fusion strategies with proposed adaptive fusion model for 6 bench mark dataset. Data set

Majority voting

Distribution Summation

Dempster Shafer

Proposed model

Training Testing Training Testing Training Testing Training Testing Breast cancer WPBC

4.21

5.71

6.34

8.57

7.22

5.71

1.44

3.26

3.34

7.15

7.22

13.04

5.26

11.53

4.16

5.12

Leukemia

14.45

18.04

3.35

7.27

3.24

6.45

0

3.57

Hepatitis

3.22

4.91

2.44

4.29

4.14

4.91

0

4.91

Lymphoma

10.21

14.28

10.72

11.77

9.35

11.76

0

5.88

Lung cancer

10.21

28.22

16.9

28.55

9.44

19.48

5

16.66

9 Conclusion This paper focuses on new adaptive approach to combine classifiers for classification of gene expression dataset based on double radial basis kernel-based feature extraction and rank base fusion strategy. The experiment shows that the proposed framework achieves higher classification accuracy. The major challenge was to build data set for the training and testing of classifier. As the dataset is not in similar format and is downloaded from various sites, hence proper warehousing has been done to prune data provide in .csv format for easy access and experiment evolution. Comparing with various strategy shows that the accuracy achieved is quite promising. Future work will be to reduce the time complexity and work on various another dataset. Planning to execute cascade of classifier for parallel execution can be considered as future work.

References 1. Deng, L., Yu, D., et al.: Deep learning: methods and applications. Found. Trends Sig. Process. 7(3–4), 197–387 (2014) 2. Dara, S., Tumma, P.: Feature extraction by using deep learning: a survey. In: Proceedings of the 2nd International Conference on Electronics, Communication and Aerospace Technology (ICECA 2018), pp. 1795–1801 (2018) 3. Cong, J., Wei, B., He, Y., Yin, Y., Zheng, Y.: A selective ensemble classification method combining mammography images with ultrasound images for breast cancer diagnosis. Comput. Math. Methods Med. 7, 1:7 (2017) 4. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems (Gray, J. (Series Editor)), 2nd edn. Morgan Kaufmann Publishers, Burlington (2006) 5. Lee, Z.-J.: An integrated algorithm for gene selection and classification applied to microarray data of ovarian cancer. Artif. Intell. Mach. 42(1), 81–93 (2008) 6. Colin, C., Nello, C.: Simple learning algorithms for training support vector machines. Technical report, University of Bristol, pp. 1–29 (1998)

Ensemble Classification Approach for Cancer Prognosis and Prediction

135

7. Cooper, G.F., Herskovita, E.: A Bayesian method for the induction of probabilistic networks from data. Mach. Learn. 9(4), 309–347 (1992) 8. Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20, 273–297 (1995) 9. Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian networks to analyze expression data. J. Comput. Biol. 7(3–4), 601–620 (2000) 10. Xu, L., Krzyzak, A., Suen, C.: Methods of combining multiple classifiers and their applications to hand written numerals. IEEE Trans. Syst. Man Cybern. 22(3), 418–435 (1992) 11. Hansen, L., Salamon, P.: Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 12, 993–1001 (1990) 12. Hazem, J.M., Bakry, E.: An efficient algorithm for pattern detection using combined classifiers and data fusion. Inf. Fusion 11, 133–148 (2010) 13. Kilic, E., Alpaydin, E.: Learning the areas of expertise of classifiers in an ensemble. Procedia Comput. Sci. 3, 74–82 (2011) 14. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–239 (1998) 15. Mangai, U.G., Samanta, S., Das, S., Chowdhury, P.R.: A survey of fusion and feature fusion strategies for pattern classification. IETE Tech. Rev. 27(4), 293–307 (2010) 16. Rokach, L.: Ensemble methods for classifiers. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 957–980. Springer, Boston (2005) 17. Senapati, R., Shaw, K., Mishra, S., Mishra, D.: A novel approach for missing value imputation and classification of microarray dataset. Procedia Eng. 38, 1067–1071 (2012) 18. Liu, S., Xu, C., Zhang, Y., Liu, J., Yu, B., Liu, X., Dehmer, M.: Feature selection of gene expression data for cancer classification using double RBF-kernels. BMC Bioinformatics 19, 396 (2018) 19. Phienthrakul, T., Kijsirikul, B.: Evolutionary strategies for multi-scale radial basis function kernels in support vector machines. In: Conference on Genetic and Evolutionary Computation, vol. 14, no. 7, pp. 905–911 (2005) 20. Bernhard, S., Alexander, J.S.: Learning with Kernels. MIT Press, Cambridge (2002) 21. Chen, H., Zhang, Y., Gutman, I.: A kernel-based clustering method for gene selection with gene expression data. J. Biomed. Inform. 62, 12–20 (2016). https://doi.org/10.1016/j.jbi.2016. 05.007 22. Garcia, S., Derrac, J., Cano, J., Herrera, F.: Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 417–435 (2012) 23. Rosenblatt, F.: Principles of neurodynamics. perceptrons and the theory of brain mechanisms (No. VG-1196-G-8). Cornell Aeronautical Lab Inc., Buffalo, NY (1961) 24. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Rumelhart, D.E., McClelland, J.L., PDP Research Group (eds.) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundation. MIT Press, Cambridge (1986) 25. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems 2(4), 303–314 (1989) 26. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd edn. Prentice Hall (1998). ISBN 0-13-273350-1 27. Kami´nski, B., Jakubczyk, M., Szufel, P.: A framework for sensitivity analysis of decision trees. CEJOR 26(1), 135–159 (2017). https://doi.org/10.1007/s10100-017-0479-6 28. Wolberg, W.H., Mangasarian, O.: UCI Machine Learning Repository. http://archive.ics.uci. edu/ml. Irvine

Extractive Odia Text Summarization System: An OCR Based Approach Priyanka Pattnaik1 , Debasish Kumar Mallick1 , Shantipriya Parida2 , and Satya Ranjan Dash1(B) 1

School of Computer Engineering, KIIT University, Bhubaneswar, Odisha, India [email protected], [email protected], [email protected] 2 Idiap Research Institute, Martigny, Switzerland [email protected]

Abstract. Automatic text summarization is considered as a challenging task in natural language processing field. In the case of multilingual scenario particularly for the low-resource, morphologically complex languages the availability of summarization data set is rare and difficult to construct. In this work, we propose a novel technique to extract Odia text from the image files using optical character recognition (OCR) and summarize the obtained text using extractive summarization techniques. Also, we performed a manual evaluation to measure the quality of summaries to validate our techniques. The proposed approach is found suitable for generating summarized Odia text and the same technique can also extend to other low-resource languages for extractive summarization system.

Keywords: Optical character recognition Natural language processing

1

· Text summarization ·

Introduction

Automatic text summarization helps human to get their long phrases into short that is full of information and knowledge [2]. Odia is a language which is always known for its literature and it is lexically and morphologically rich. In this paper, we have tried to make a summarization of Odia texts into a few lines so that people can get some information or some knowledge in very few lines [9]. Summarization of text means making a long text in to short so that you can understand the depth of the long text. NLP has so many concepts for a different type of approach and one of its concepts can be done for text summarization. So text summarization can be done by specifically two ways i.e, abstractive way, and extractive way. In extractive text summarization, it extracts the keywords from the source document and gives them a summary [8]. In this approach, the keywords are extracted without making changes in the main source document. In abstractive text summarization, c Springer Nature Switzerland AG 2020  S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 136–143, 2020. https://doi.org/10.1007/978-3-030-39033-4_13

Extractive Odia Text Summarization System: An OCR Based Approach

137

it makes new phrases and new sentences that give us a meaningful summarization like we get the summarization from a human. In abstractive methods, we overcome the grammar inconsistencies that are in extractive method. So in this paper, we have done our work using extractive methods of text summarization.

2

Related Work

When we choose this language for our work, first we look around for the already present corpus for Odia. And we found that the amount is not satisfied for further research and hence there is very less amount of corpus are present with rightly translated in Odia. So it is one of the biggest research gaps for the researchers who have tried to work in this area but could not go for further research. Further, we found that the problem of Odia character recognition is a difficult task for the machine. Due to its high level of design in the character and hence researchers are still currently working on it. To make the machine understand humans, we need to make the machines understand the human’s language. And to make this possible for Odia language, we further search for text summarization. In our research, we found that in-text summarization, first we have to make it as supervised learning for the machine to understand and then we can put the unsupervised learning and extract the solution. So it is a preprocessed supervised learning for the machine. In Odia language, its literature is known for its complexity, because it is high in meanings but explained in a single term. So it’s a big challenge for all researchers those who are trying to make it easier. While further searching we found that there is no proper stemmer or a lemmatizer for Odia has been made properly. And hence in Odia, it is a difficult part to make a stemmer and a lemmatizer for each word because unlikely in English, Odia has so many extra characters to clean-out from the word and make it the root word. So hence there are so many works to do in Odia and to make it a machine learning language. And hence we found two relevant paper for Odia text summarization. In “Odia Text Summarization Using Stemmer” by R. C. Balabantaray et. al. has done the summarization by using an Odia stemmer [3]. In their work they take a part of the text as for their implementation, first, they had tokenized the whole text and in the second step, they remove the stop words. Then in the third step, they use an Odia stemmer and stem their text, In the fourth step they assigned a weight for words, and the higher rank words come together and make a sentence as for summary. They have done the summary as per the lines user want for them to read i.e, they have done a percentage-wise summary shown to as per user required percentage. S. Biswas et. al. have done an auto text summarization for Odia language [5]. In their work they had used Word frequency method, Positional Criteria method, Cue Phrase method, Title overlap method. By using this method they get the Precision value as and Recall value as 95%. In their work, we found that they had taken fewer stop words and their work of training is already being trained by the machine so that they got the precision and recall value as 100% for three methods which is used by them [14].

138

3

P. Pattnaik et al.

Experimental Setup

This section explains the experimental setup used in our paper. The proposed model is shown in Fig. 1.

Fig. 1. Proposed model

3.1

Data Preprocessing

Odia language is rich in its text, hence we take our input as images that may be scanned images or images clicked from any source. The image of Odia written language read as a 3D array. So in our second step, we want the image to be recognized. We have used the “Tesseract OCR engine” to read Odia character [13]. But we found that it uses a traditional method called average method to convert an RGB image to grayscale image. The image is read, in the form of the RGB image. We have converted the RGB image into a grayscale image by applying the weighted method or luminosity method [7]. The formula is given as Grayscale = 0.299R + 0.587G + 0.114B

(1)

Where R is Red, G is Green and B is blue color in a pixel. As “Tesseract” uses another traditional binarization algorithm called “Otsu” for threshold, so we have used another better algorithm called “Niblack and Sauvola threshold algorithm”. And we found that it gives better result as compared to “Otsu” technique. The advantage of Niblack algorithm is that it uses a rectangular window, which slides throughout the image [13]. The center pixel threshold (T) is derived by mean (m) and variance (s) values inside the window. T = m + k ∗ s,

(2)

Extractive Odia Text Summarization System: An OCR Based Approach

139

Where k is the constant and set that 0.8 But it creates some noise in some area. SO to remove those noises we have included Sauvola algorithm. Thus the modified formula is T = m ∗ (1 − k ∗ −(1 − (s/R))),

(3)

Where R is the dynamics of standard deviation that is set to 128. This formula will not detect all the images of documents. So the normalized formula we have implemented is T = m − k ∗ (1 − (s/R)) ∗ (m − M )),

(4)

Where R is the maximum standard deviation of all the windows, M is the gray level of the image. However, some pixels are missed during these processes which may lead to an error of character recognition. So we have used Dilation which helps to join those areas where some pixel values are missed [6].

Fig. 2. Dilation

We can see that in the left diagram (Fig. 2) some pixels are missed. And we can see that when the “Dilation” algorithm is implemented as shown in the right side (Fig. 2) it also have a problem, i.e. it takes extra pixels. For reducing this kind of errors the “Erosion” algorithm is implemented as shown in Fig. 3.

Fig. 3. Erosion

In our preprocessed data we take four Odia images i.e, the images related to “Naveen Pattnaik”, “Debasish Mohanty”, “Narendra Modi”, and “Subash Chandra Bose”. The sample image containing Odia text for one of the topic “Debasish Mohanty” is shown in Fig. 4.

140

P. Pattnaik et al.

Fig. 4. An example of original image containing Odia text

Table 1. Manual evaluation parameters for the generated summaries. Parameter

Description

Parameter 1 Is the summarization is related to the given topic? Parameter 2 Name of the main character is verified by looking at the summarization Parameter 3 Presence of the Bag of words is giving a relatable meaning Parameter 4 Is the total no of lines in the summarization understandable and meaningful? Parameter 5 Overall quality of the output

3.2

Methodology

After given the perfect shape, the “Tesseract” tool kit performs Odia character extraction. For text summarization, we have used “Term Frequency-Inverse Document Frequency”. The sentences which are extracted from the image are tokenized which split them into sentences. After sentences are tokenized, the sentences are split into words. To remove unnecessary words which are present in the sentences, the stop-word filtration process is performed [4,10]. As in Odia language, less number of a stop-word dataset is present. We have made our dataset. After removing of stop-words, the rest of the words “Term-Frequency (TF)” are calculated by the given formula below TF = Total Appearance of Word in the Document/Total Words in the Document After calculating the Term-Frequency, the Inverse Document Frequency (IDF) will calculate. The Formula is given below IDF = log (All Document Number/Document Frequency) TF-IDF = TF * IDF

Extractive Odia Text Summarization System: An OCR Based Approach

141

Fig. 5. An example of extracted Odia text and generated summaries

After been calculated, words of the documents are sorted in descending order by its TF-IDF. By summation of all TF-IDF values of words present in the sentences, which decide the rank of sentences values [1,12]. As TF-IDF is an extractive method the sentences appear in after summarization, those are same as the sentences present in the original document [11].

4

Result

When the proposed technique applied to the selected data, we got the summarized text as per our desire. The extracted Odia text and the generated Odia summaries are shown in Fig. 5.

142

P. Pattnaik et al. Table 2. Human evaluation rating table.

Evaluator

Topic name (in English)

Is the summarization is related to the given topic?

Name of the main character is verified by looking at the summarization

Presence of the bag of words is giving a relatable meaning

Is the total no of lines in the summarization understandable and meaningful?

Overall quality of the output

Evaluator 1

Naveen Pattnaik

100%

80%

55%

55%

Good (75%)

Debasish Mohanty

100%

88%

55%

65%

Good (80%)

Narendra Modi

100%

80%

60%

60%

Good (76%)

Subash Chandra Bose

100%

85%

75%

65%

Good (84%)

Naveen Pattnaik

100%

80%

65%

75%

Good (63%)

Debasish Mohanty

100%

90%

60%

60%

Good (75%)

Narendra Modi

100%

90%

55%

60%

Good (72%)

Subash Chandra Bose

100%

85%

63%

64%

Good (85%)

Naveen Pattnaik

100%

70%

68%

65%

Good (67%)

Debasish Mohanty

100%

80%

66%

67%

Good (74%)

Narendra Modi

100%

75%

57%

59%

Good (69%)

Subash Chandra Bose

100%

85%

85%

84%

Good (80%)

Naveen Pattnaik

100%

80%

69%

67%

Good (66%)

Debasish Mohanty

100%

90%

66%

69%

Good (72%)

Narendra Modi

100%

60%

60%

60%

Good (62%)

Subash Chandra Bose

100%

95%

85%

85%

Good (80%)

Evaluator 2

Evaluator 3

Evaluator 4

To judge the summarization, we have evaluated our results by human evaluators. We have chosen four human evaluators who can read, write and understands Odia properly. We have set five parameters for the manual evaluation as mentioned in the Table 1. We have decided to do the human evaluation as in our case we find it difficult for automatic evaluation. So, we provide the Odia extracted text and the generated summaries to the four experts (person who know Odia, who can write Odia properly, who can read Odia properly, and who can understand Odia properly). According to their evaluation, we find that all our result are purely related to the extracted Odia text hence the summarization is related to the topic. According to our result, they also have gone through our evaluation criteria and they give us result in percentile format. The manual evaluation results are shown in the Table 2.

Extractive Odia Text Summarization System: An OCR Based Approach

5

143

Conclusion and Future Work

In this paper, we have proposed a method for extracting Odia text from the image and generating summarized text. Odia is a language which is rich in text and known for its literature but lack in computational resources for machines to perform different NLP tasks such as machine translation, summarization, etc. Our motive is to make Odia language more enhanced for the machines by creating more language resources. In our future work, we will consider the abstractive techniques for summarizing Odia text by building summarization dataset (Odia texts and its corresponding summaries). Our method can be easily extended to generate summaries for low resource language.

References 1. Aizawa, A.: An information-theoretic perspective of TF-IDF measures. Inf. Process. Manage. 39(1), 45–65 (2003) 2. Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B., Kochut, K.: Text summarization techniques: a brief survey. arXiv preprint arXiv:1707.02268 (2017) 3. Balabantaray, R., Sahoo, B., Sahoo, D., Swain, M.: Odia text summarization using stemmer. Int. J. Appl. Inf. Syst. 1(3), 21–24 (2012). 2249–0868 4. Bharti, S.K., Babu, K.S.: Automatic keyword extraction for text summarization: a survey. arXiv preprint arXiv:1704.03242 (2017) 5. Biswas, S., Acharya, S., Dash, S.: Automatic text summarization for Oriya language. Int. J. Comput. Appl. 975, 8887 (2015) 6. Gaikwad, D.K., Mahender, C.N.: A review paper on text summarization. Int. J. Adv. Res. Comput. Commun. Eng. 5(3), 154–160 (2016) 7. Joshi, N.: Text image extraction and summarization. Asian J. Converg. Technol. (AJCT) 5(1), 1–7 (2019) 8. Kry´sci´ nski, W., Paulus, R., Xiong, C., Socher, R.: Improving abstraction in text summarization. arXiv preprint arXiv:1808.07913 (2018) 9. Lloret, E.: Text summarization: an overview. Paper supported by the Spanish Government under the project TEXT-MESS (TIN2006-15265-C06-01) (2008) 10. Munot, N., Govilkar, S.S.: Comparative study of text summarization methods. Int. J. Comput. Appl. 102(12), 33–37 (2014) 11. Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B., et al.: Abstractive text summarization using sequence-to-sequence RNNS and beyond. arXiv preprint arXiv:1602.06023 (2016) 12. Ramos, J., et al.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, Piscataway, NJ, vol. 242, pp. 133–142 (2003) 13. Smith, R.: An overview of the Tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633. IEEE (2007) 14. Yousefi-Azar, M., Hamey, L.: Text summarization using unsupervised deep learning. Expert Syst. Appl. 68, 93–105 (2017)

Predicting Sensitivity of Local News Articles from Odia Dailies Manoj Kumar Jena(B) and Sanghamitra Mohanty Department Computer Science & Application, Utkal University, Bhubaneswar, Odisha, India [email protected], [email protected]

Abstract. The news article manifests different issues of public domain. The news article are organized in different sections like local, national, international, sports, politics, editorial, readers view, sports etc. This paper purposes to categorize positive, negative and neutral local news and later on prediction of sensitivity from negative local news articles. The purpose of sensitivity analysis of negative local news articles is to set priority of action to be taken by the local administration. The sensitive news discusses the issues or event of urgent in nature, which need sudden intervention. The experiment is carried out basing on odia Syntactosemantic knowledge for categorization of 1000 local Odia news article and for sensitivity analysis tf and tf-idf score is calculated using unigram and bigram representation of data at document level, tf and tf-idf vector is passed to SVM for result and the results are analyzed by calculating the accuracy and F1 score. Keywords: SVM · TF · TF-Idf · Syntactosemantic · Sensitivity analysis · News article

1 Introduction Most of the odia dailies are publishing their news article through e-papers and several odia news portals. Generally the news are organized in different sections like local, national, international, sports, politics, editorial, readers view, sports etc. Readers are usually focusing their reading on specific section of their interest. It is observed that local news article is the area of interest for the people of the particular locality and people at Government establishments. The administrator at Govt. establishments are reading the news article relating to local issues and accordingly enquiry about the issues has been made for taking remedial measure to ameliorate the issue which are affecting general populace. Generally negative news regarding road condition, sanitation, water supply and other developmental issues are taken into account for taking remedial measures by local administration. Generally the local administrator are identifying those news on local news paper by reading the whole news paper for identifying the sensitivity local article and setting priority according to their sensitivity. Further the action is taken priority to resolve the issues. Here an attempt has been made to categorize the positive, negative and neutral local news article by the SVM classifier and lexicosynatic knowledge of odia language. Further tf, tf-idf matrix has been constructed at document which is passed to © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 144–151, 2020. https://doi.org/10.1007/978-3-030-39033-4_14

Predicting Sensitivity of Local News Articles from Odia Dailies

145

linear SVM for classification. We have taken eight hundred local news articles dataset from odia dailies, which are collected manually for training and similarly two hundred local news articles dataset for testing. However the dataset can also be further increased in due course of time.

2 Related Works News headlines are analyzed for categorizing positive and negative news and predicting pattern of news publication by different media [1]. Human insights and deep learning methods are applied for analyzing social media at the time of disaster [2] Chase et al has proposed topic level news article classification using tf-idf score with k-means algorithm and classifies the news article which belongs to different domain like science, sports, business etc [3]. Generally opinions are intended to be positive, negative or neutral on a particular entity. Bing Liu [4] defines an opinion as a quintuple, where Oi is the target object, fij is the feature of the target object Oi , hi is the opinion holder, tl is the time when the opinion is expressed and soijkl is the sentiment value of the opinion expressed by the opinion holder hi about the object Oi at time tl . Given a set of evaluative text documents D that contain opinions (or sentiments) about an object, opinion mining aims to extract attributes and components of the object that have been commented on in each document d ∈ D and to determine whether the comments are positive, negative or neutral. Opinion Mining has two main research directions, document level opinion mining and feature level Opinion mining [5]. Document level mining involves determining the document’s polarity by calculating the average semantic orientation of extracted phrases Opinion mining can be viewed as a kind of processing of natural language for tracking the attitudes, feelings or appraisal of the public about particular topic, product or services. All information available in web is of two types: facts and opinions [6]. Chaudhary et al [7] has experimented the opinion mining from news paper headline using linear SVM, tf-idf linear SVM and SGD classifier, the comparison between the result of three classifier has been made and the tf-idf with linear SVM outperforms other two classifier in terms of accuracy. However the method proposed by Thomas Scholz et al [8] opinion mining in news paper articles by entropy based word connection does not yield better result in comparison to opinion mining with SVM classifiers. Jena et al [9] proposed the a method Opinion mining using support vector machine using SVM but the result is not so impressive. Patil et al [10] has proposed algorithm classification of emotion like anger, disgust, joy, sadness, fear and surprise from the content on senval affective dataset. The method proposed Olsher et al [11] by using integration of domain, syntactic and lexical knowledge will not be applicable for resources scarce language like Odia. Balahur et al [12] have proposed a descent method annotation guideline and labeling process but inter annotator agreement was very low. Jena et al [13] has proposed a method to find the polarity of opined sentences using N-Gram based support vector machine. The method proposed by Kim et al [14] is impressive which predict the stock market prediction by analyzing the sentimental words from News Paper article. Mittal et al [15] has described the role of negation and discourse relation for sentiment analysis using Hindi Sentiword Network. Sameureanu et al [16] has proposed the method of applying Supervised Opinion Mining techniques on Online user reviews with Naïve

146

M. K. Jena and S. Mohanty

Bays Classifier using N = 3 and eliminating Stop words which gives better result than N = 1 and N = 2. In this paper the experimentation is done through python with NLTK for calculating the tf idf score of the documents and SVMlight is used for learning from the data set and evaluating.

3 Methodology This paper proposes a method for classification of local news articles basing on sensitivity and setting priority of action for the local administrator. The whole process of classification and priority setting basing on confidence score is depicted in Fig. 1

Fig. 1. Diagrammatic view of proposed approach

The whole process of opinion mining passes through seven phases. Manually collected news articles are grouped as a corpus and preprocessed for removing the numbers, emails, URLs punctuations and stop words as they have less information content by python. The example of local news articles in Odia language are given below

Predicting Sensitivity of Local News Articles from Odia Dailies

147

The cleaned text is has gone through Syntactosemantic tagging. The process of Syntactosemantic tagging is done by checking each word our dictionary with syntactic mark like Kriya (Verb), Bisesa (noun) Bisesana (adjective) and marked as positive, negative, neutral basing on category. However each word is not semantically tagged, especially the adjectives are semantically tagged. Then the tagged text is vectorized by vectorizer code and which calculates term frequencies (tf) and inverse document frequencies (idf). As the frequently occurring word having high importance than others and also tf-idf is calculated at document level from unigram and bigram representation of cleaned text. Example of Unigram and Bigram representation as follows.

Calculation of unigram and bigram probabilities can be done accordingly Unigram probabilities: P(wi ) = count (wi ))/count (total number of words) Probability of wordi = Frequency of wordi in our corpus/total number of words in our corpus Bigram probabilities: P(wi | wi−1 ) = count (wi−1 , wi )/count (wi−1 ) Probability that wordi−1 is followed by wordi = [Num times we saw wordi−1 followed by wordi ]/[Num times we saw wordi−1 ] So calculation tf and tf-idf vector are done for two representations. After that out of tf and tf-idf matrix of each representation 800 rows are meant for training the SVM classifier and 200 rows are meant for testing the classifier and the result is calculated. The SVMlight is used for this purpose. Basing on the result the prediction of peoples opinion on particular news articles are predicted by calculating the confidence score. The confidence score is determinant in setting priority for sensitivity analysis. If priority is three the news article is sensitive and urgent action is required. 3.1 Performance Evaluation Measure The parameter helps in evaluating performance of Supervised Machine learning algorithm. Here we have used three parameter i.e. accuracy, F1 Score and confidence of the classification. The performance measure parameters are calculated as detailed below which generally depends on the prediction as per four categories like, True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN). Accuracy - Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only

148

M. K. Jena and S. Mohanty

when you have symmetric datasets where values of false positive and false negatives are almost same. Accuracy =

TP +TN T P + FP + FN + T N

F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall. F1 Score =

2 ∗ r ecall ∗ pr ecision Recall + pr ecision

Confidence Measure: We decided to use the word probability derived from an n-gram model as a Confidence measure: C(tj ) = P(tj /tj , . . . ..tj - n + 1 )

4 Results Analysis The snapshot of results shown in Table 1 classifies of local news article with a confidence that when reader will read the article his opinion about a particular topic will be positive, negative or neutral. The confidence is a probability of reader view about classification of the text. When confidence score is less, then the prediction about reader’s motive may not be accurate. For those cases, further analysis The news article classified as negative having confidence score which determines the priority. There are three priority level ie, I, II, II depending on the confidence score. If the negative news article having greater than or equal to 0.7 confidence score, the priority level III and the news article is sensitive and urgent intervention is required, similarly if the confidence score is in between is above 0.5 and less than 0.7 then it is less sensitive and less than 0.5 then the news article is not sensitive. Figure 2 shows the sensitivity analysis of negative classified data sets over the hyper plane where the data clustered at -1(III) are sensitive and 1(II) are less sensitive and 0(I) are not sensitive.

Fig. 2. Sensitivity analysis of negative news article.

Predicting Sensitivity of Local News Articles from Odia Dailies

149

Table 1. Classified news article with confidence score

The graph is plotted to demonstrate the spread of sensitive data over the classification of negative. The 1000 dataset of Odia local News articles are collected from different news portal as input and classified as positive negative and neutral and the negative classified data has undergone sensitivity analysis but here we have demonstrated in the above figure, the sensitivity analysis of less numbers of negative classified local news article.

150

M. K. Jena and S. Mohanty Table 2. Experimental results N-gram + tf/tf-idf score from tagged data Accuracy F1score Unigram + tf

0.53

0.55

Bigram + tf

0.71

0.67

Unigram + tf-idf

0.73

0.66

Bigram + tf-idf

0.83

0.72

Table 2 depicts how accuracy and F1-score varies over selection of data representation using Unigram and Bigram. In the experiment accuracy improved from 0.53 to 0.73 and f1 score improved from 0.55 to 0.66 when term frequency is calculated from Unigram representation of data and when term frequency is calculated from Bigram representation of data. Similarly for when tf-idf is calculated from Unigram representation of data and when tf-idf is calculated from Bigram representation of data. The accuracy and f1 score for Unigram + tf, Bigram + tf, Unigram + tf-idf, Bigram + tf-idf is demonstrated in Fig. 3.

Fig. 3. Depicts the f1 score and accuracy using different approaches.

5 Conclusion This paper makes an attempt to predict the sensitivity of local news article in Odia collected from different news portal. The sensitivity of the local negative news article categorized as sensitive less sensitive and not sensitive. The confidence score has been

Predicting Sensitivity of Local News Articles from Odia Dailies

151

calculated to demonstrate to show the confidence of classification. In this paper attempt has been made to categorize the sensitive news and the sensitive news needs urgent attention of local administration and intervention is needed on that event or issue. The accuracy, f1 score calculated to demonstrate the accuracy of classification. However the result has been generated over Syntactosemantic tagging of data after cleaning depends on tf, tf-idf and unigram and bigram representation, which is calculated with syntactic and semantic analysis of the news articles. So further experimentation is to be done with large dataset with other machine learning algorithm rather than only through SVM.

References 1. Reis, J., Beneveunto, F., Olmo, P.: Breaking the news: first impressions matter on online news. In: Proceedings of the Ninth International AAAI Conference on Web and Social Media, pp 357–368 (2013) 2. Robertson, B.W., Jonson, M., Murthy, D., Smith, W.R., Stephens, K.K.: Using a combination of human insights and deep learning’ for real time disaster communication, Prog. Data Sci. 1–11 (2019) 3. Chase, J., Genajn, N., Karniol Tambour, O.: Learning multilevel topic classification of news articles. Compliance Eng. J. 1–6 (2011) 4. Baldonado, M., Chang, C.-C.K., Gravano, L., Paepcke, A.: The stanford digital library metadata architecture. Int. J. Digit. Libr. 1, 108–121 (1997) 5. Bruce, K.B., Cardelli, L., Pierce, B.C.: Comparing object encodings. In: Abadi, M., Ito, T. (eds.) Theoretical Aspects of Computer Software. Lecture Notes in Computer Science, vol. 1281, pp. 415–438. Springer-Verlag, Heidelberg (1997) 6. van Leeuwen, J.: Computer science today. In: Recent Trends and Developments. Lecture Notes in Computer Science, vol. 1000. Springer-Verlag, Heidelberg (1995) 7. Rameshbhai, J.C., Paulose, J.: Opinion on news paper headlines using SVM and NLP. Int. J. Electr. Comput. Engg. 9(3), 2152–2163 (2019) 8. Scholz, T., Conard, S.: Opinion mining in news articles by entropy –based word connections. In: Proceedings of 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1828–1839 (2013) 9. Jena, M., Mohanty, S.: Contextual opinion mining in online Odia text using support vector machine. Compliance Eng. J. 10(7), 166–169 (2019) 10. Patil, S., Chaudhari, A.: Classification of emotions from text using SVM based opinion mining. Int. J. Comput. Eng. Technol. 3(1), 330–338 (2012) 11. Olsher, D.: Full spectrum opinion mining: integrating domain, syntactic and lexical knowledge, cognitive science program, National University of Singapore (2012) 12. Balahur, A., Steinberger, R.: Rethinking sentiment analysis in the news: from theory to practice and back. In: Troyano, C.D. (ed.) WOMSA 2009, 1–12 (2009) 13. Jena, M., Mohanty, S.: Document level opinion mining for Odia language using N-gram based support vector machine. Compliance Eng. J. 10(8), 430–436 (2019) 14. Kim, Y., Jeong, S., Ghani, I.: Text opinion mining to analyze news for stock market prediction. Int. J. Advance. Soft Comput. Appl. 6(1), 1–8 (2014) 15. Mittal, N., et al.: Sentiment analysis of hindi review based on negation and discourse relation. In: International Joint Conference on Natural Language Processing, pp. 45–50 (2013) 16. Smaeureanu, I., Bucur, C.: Applying supervised opinion mining techniques on online user reviews. Informatica Economica 16(2), 81–91 (2012)

A Systematic Frame Work Using Machine Learning Approaches in Supply Chain Forecasting K. Prahathish1 , J. Naren1(B) , G. Vithya2 , S. Akhil1 , K. Dinesh Kumar1 , and S. Sai Krishna Mohan Gupta1 1 School of Computing, SASTRA Deemed University, Thanjavur, India

[email protected], [email protected], {121015123,121015119,saikrishna}@sastra.ac.in 2 School of Computing, KL University, Vijayawada, AP, India [email protected]

Abstract. Forecasting is an important study in the field of Supply Chain and Logistics for Operations Management. Based on a study a systematic framework has been worked and has been proposed for the same. Artificial Neural Network has been into this field and has been utilized for an efficient way to forecast and reduce errors marginally. The purpose of such a systematic approach using the proposed architecture is to reduce inventory holdings which shall largely account for important decision-making policies in the future. Keywords: Supply chain · Forecast · Artificial Neural Network · Logistics · Support Vector Machine · Bullwhip Effect · Inventory

1 Introduction Recent advancements in Information Technology have increased competition between businesses. Sustainable development and existence in the market is essential for organizations to survive. Forecasting involves prediction of future demand of products, goods and services based on its past history of consumption in the field of Operations Management. Supply chain (SC) is typically a network with set of operations carried out from a supplier to the end customer in delivering a finished goods and services. The end customer essentially procures or pulls the end product from the network while the other members such as suppliers, manufacturers, distributors and retailers push the product (Smyth [24]). Forecasting involves future predictions of such demand for products and this demand is a got high variance and fluctuations. Hence earlier prediction of such demand shall be helpful. Machine Learning (ML) techniques have been exhausted in research as a forecasting methodology that has proven to outperform the existing conventional statistical forecast models. Artificial Neural Networks (ANN) mimics the functioning of human nerves for processing information. The main reason behind in achieving such a higher accuracy lies mainly in the learning the non-linearity in data and processing such complex relationship © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 152–158, 2020. https://doi.org/10.1007/978-3-030-39033-4_15

A Systematic Frame Work Using Machine Learning

153

is pretty much what the ANN does. The purpose of this work is to provide a systematic framework that starts right from obtaining raw data based on the past history and end at forecasting the future over a given period of time horizons. A detailed fish-bone diagram has been described in Fig. 1.

Fig. 1. Fish bone diagram for supply chain management

2 Related Works Various ML techniques have been employed to forecast time series data over a given time horizons. There is an extensive research work that critically discusses the approaches to forecasting in general by Syntetos et al. [1]. Dong and Wen [21] adopted a ML model with Multi-Layer Feed Forward Neural Network and proposed a Recurrent Neural Network (RNN) that forecasted a sale of paper from a paper mill industry. The findings concluded that the proposed ML models have lower error rates and shall contribute to reduced inventory. Carbonneu et al. [17] in their study, mainly focused towards the upstream end of SC. A clear classification, description and comparison of proposed ML techniques and existing “traditional” models of forecasting have been analyzed on a simulated data. The comparison mainly concluded that the error percentages of ML techniques were marginally lower than the traditional approaches. Similar approaches in a comparison study were done by Gutierrez et al. [19] but in this case it was an Intermittent Demand (ID). ID refers to zero or non-zero demand with inter-demand intervals highly varying. Well known estimators of ID are the one proposed by Crostan and its variants [5, 12]. NN models achieved higher superiority than the existing estimators in regard to error measures. Forecasting ID was also done by Kourentzes [16] using a proposed NN model which were compared against the existing well-known estimators of ID. The architectures adopted (NN-Rate and NN-Dual) achieved lower forecasting accuracy

154

K. Prahathish et al.

however higher service levels were observed in this case. Bullwhip Effect (BWE) is an interesting concept were demand signal amplifies as one moves to the upstream end of the SC. Sanjita Jaipuria and Mahapatra [26] made an attempt to reduce the BWE by adopting a novel architecture called Discrete Wavelet Transform integrated with ANN (DWT-ANN). The proposed architecture outperformed the Box-Jenkins ARIMA model [9] with reduced error rates and Net Stock Amplification factor which marginally reduced the BWE. A concept called Extreme Learning machines was adopted by Lolli et al. [8] who proposed that the learning rate can be made faster by adopted one such concept.

3 Framework The novel architecture as described in Fig. 2 involves the following set of steps as described in this section. Obtaining raw data across any echelons of the SC is the basic step of forecasting. This data obtained shall typically be numerical values that shall signify a particular field(s). The data shall be obtained from any industry of any sector that is worthwhile in forecasting future demands. The obtained data has to be processed in the next step where this involves classification of data, application of data mining techniques. The essential data has to be classified and ordered as vectored inputs to the learning techniques Pattern identification involves exploring complexity in data, separation of data into varies fields, identification of patterns, and correlations in data. The patterned data shall be made ready to fed as input.

Fig. 2. A systematic architecture describing the framework

A Systematic Frame Work Using Machine Learning

155

The very next step involves grouping of all such patterns from the past history that shall significantly involve in accumulation of large amounts of data ready to be fed as the input. Big-data is one such field where classifiers of data based on its dimensions (L’Heureux et al. [2]) can be adopted as a learning measure in order to encompass scenarios involving huge amounts of data. Adoption of a Learning technique involves choosing of a ML technique such as ANN, NN trained using back-propagation, RNN, Extreme Learning Machine, Support Vector Machine (SVM) etc. Integrated approaches of ANN can also be adopted depending on the target of achievement in forecasting. Also, adoption of Deep learning can also be adopted. The wide range of targets in forecasting includes inventory metrics, error metrics, forecast metrics and the metrics contributing to inducing WE. The demand in SC varies across each player. Information sharing is one of the key factors in SC that significantly contributes to increased customer service levels, prompt delivery of goods and good relationships with the partners. Forecasting demand as given in the framework involves the computations carried out by the learning machine and the predicted output shall be obtained from the same. The output again has to be validated with a existential real-time dataset and comparative study can be made from this prediction.

4 Results Inference: From the above graphs we can observe the day wise prediction of a product for a month taking day as x-axis and demand as y-axis (Fig. 3).

Fig. 3. Results depicting the supply chain framework

156

K. Prahathish et al.

Fig. 3. (continued)

5 Conclusion The novel architecture involves a systematic approach to predict and forecast future data. Forecasting future consumption of product and services across various echelons of SC is present almost everywhere. Hence, forecasting techniques using ML learning algorithms have proven to have higher degree of accuracy and robustness when compared with the conventional statistical models. Research work can be concentrated on integrating learning techniques with Big Data analytics (L’Heureux et al. [2]).

A Systematic Frame Work Using Machine Learning

157

References 1. Syntetos, A.A., Babai, Z., Boylan, J.E., Kolassa, S., Nikolopoulos, K.: Supply chain forecasting: theory, practice, their gap and the future. Eur. J. Oper. Res. 252(1), 1–26 (2016). https:// doi.org/10.1016/j.ejor.2015.11.0100 2. L’Heureux, A., Grolinger, K., Elyamany, H.F., Capretz, M.A.M.: Machine learning with big data: challenges and approaches. IEEE Access 5, 7776–7797 (2017) 3. Rao, A.: A comment on: forecasting and stock control for intermittent demands. Oper. Res. Q. (1970–1977) 24(4), 639 (1973) 4. Syntetos, A., Boylan, J.: The accuracy of intermittent demand estimates. Int. J. Forecast. 21(2), 303–314 (2005) 5. Weiland, A., Leighton, R.: Geometric analysis of neural network capabilities. Technical report, Arpanet III, pp. 385–392 (1988) 6. Davis, E.W., Spekman, R.: Extended Enterprise. Prentice Hall, Upper Saddle River (2004) 7. Lolli, F., Gamberini, R., Regattieri, A., Balugani, E., Gatos, T., Gucci, S.: Single-hidden layer neural networks for forecasting intermittent demand. Int. J. Prod. Econ. 183, 116–128 (2017) 8. Wilson, G.: Time Series Analysis: Forecasting and Control (By George E. P. Box, Gwilym M. Jenkins, Gregory C. Reinsel and Greta M. Ljung), 5th edn. Wiley, Hoboken, p. 712. ISBN: 978-1-118-67502-1. Journal of Time Series Analysis, vol. 37, no. 5, pp. 709–711 (2016) 9. Forrester, J.: Industrial Dynamics. MIT Press, Cambridge (1961) 10. Småros, J., Lehtonen, J., Appelqvist, P., Holmström, J.: The impact of increasing demand visibility on production and inventory control efficiency. Int. J. Phys. Distrib. Logistics Manage. 33(4), 336–354 (2003) 11. Croston, J.D.: Forecasting and stock control for intermittent demands. Oper. Res. Q. 23, 289–304 (1972) 12. Tan, K.C.: A framework of supply chain management literature. Eur. J. Purchasing Supply Manage. 7(1), 39–48 (2001) 13. Sanders, N., Premus, R.: IT applications in supply chain organizations: a link between competitive priorities and organizational benefits. J. Bus. Logistics 23, 65–83 (2002) 14. Kourentzes, N.: Intermittent demand forecasts with neural networks. Int. J. Prod. Econ. 143(1), 198–206 (2013) 15. Carbonneau, R., Laframboise, K., Vahidov, R.: Application of machine learning techniques for supply chain demand forecasting. Eur. J. Oper. Res. 184(3), 1140–1154 (2008). https:// doi.org/10.1016/j.ejor.2006.12.004 16. Ireland, R., Bruce, R.: CPFR: only the beginning of collaboration. Supply Chain Manage. Rev., 80–88 (2000) 17. Gutierrez, R.S., Solis, A.O., Mukhopadhyay, S.: Lumpy demand forecasting using neural networks. Int. J. Prod. Econ. 111(2), 409–420 (2008) 18. Raghunathan, S.: Interorganizational collaborative fore-casting and replenishment systems and supply chain implications. Decis. Sci. 30(4), 1053–1071 (1999) 19. De Kok, T., Janssen, F., van Doremalen, J., van Wachem, E., Clerkx, M., Peeters, W.: Philips Electronics synchronizes its supply chain to end the bullwhip effect. Interface 35(1), 37–48 (2005) 20. Hill, T., Marquez, L., O’Connor, M., Remus, W.: Artificial neural network models for forecasting and decision making. Int. J. Forecast. 10(1), 5–15 (1994) 21. Dong, X., Wen, G.: An improved neural networks prediction model and its application in supply chain. Nature Sci. 4(3), 23–27 (2006) 22. Wang, X., Disney, S.M.: The bullwhip effect: progress, trends and directions. Eur. J. Oper. Res. 250(3), 691–701 (2016)

158

K. Prahathish et al.

23. Zhao, X., Xie, J., Wei, J.C.: The impact of forecast errors on early order commitment in a supply chain. Decis. Sci. 33(2), 251–280 (2002) 24. Smyth, H.: Procurement push and marketing pull in supply chain management: the conceptual contribution of relationship marketing as a driver in project financial performance. J. Finan. Manage. Property Constr. 10(1), 33–44 (2005) 25. Jaipuria, S., Mahapatra, S.S.: An improved demand forecasting method to reduce bullwhip effect in supply chains. Expert Syst. Appl. 41(5), 2395–2408 (2014)

An Intelligent System on Computer-Aided Diagnosis for Parkinson’s Disease with MRI Using Machine Learning J. Naren1(B) , Praveena Ramalingam1(B) , U. Raja Rajeswari1(B) , P. Vijayalakshmi1(B) , and G. Vithya2(B) 1 SASTRA Deemed University, Thirumalaisamudram, Thanjavur, Tamil Nadu, India

[email protected], {120003229,120003241, 120003363}@sastra.ac.in 2 KL University, Vijayawada, AP, India [email protected]

Abstract. Parkinson’s disease (PD), an intensifying neurological disorder is predominantly because of failing dopaminergic neurons of the midbrain. Dopamine is involved in sending of messages to those parts that controls coordination and movement in brain. With the help of Machine Learning approaches, it sets a base for an Intelligent system that helps in computer-aided diagnosis of PD patients. Machine Learning is used for early diagnosis and prediction so that it can be utilized to treat the disease quicker. In medicinal science, it is visible that outputs from the imaging devices can be incorporated for predicting a disease better. The paper specifies a brief synopsis of Machine Learning techniques along with MRI data which can yield faster prediction of PD. Keywords: Parkinson’s disease · Machine learning · MRI

1 Introduction Parkinson’s disease is an accelerating nervous disorder which leads to progressive degeneration or death of neurons. Prominent indications may develop gradually and start off with minute convulsion in one’s hand [1]. Those with Parkinson’s disease may encounter a wide selection of motor and non-motor ailments that come with tremor, bradykinesia, postural instability, affective disorders and cognitive deficits [2]. PD is mainly caused due to lack of dopamine within the striatum secondary to continuous degeneration of dopaminergic cells in substantia mitra pars, followed by the formation of Lewy bodies [2]. Stage and severity of PD is critical to approximate for taking effective selections associated with remedy. PD may be managed effectively via beginning the treatment with dopamine agonists together with ropinirole alone, rather than beginning with levodopa (which is the most potent and powerful medication in PD) [3, 4]. The proposed methodology for finding the severity of PD is classified based on different regions of brain and gene mutations which are eventual causes for the disease. © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 159–165, 2020. https://doi.org/10.1007/978-3-030-39033-4_16

160

J. Naren et al.

1.1 GBA Mutations in the glucocerebrosidase (GBA) gene, which conceals the Lysosomal biomolecule are among the important and common risk factors for Parkinson’s disease. The scientific similarity with idiopathic PD and the danger to discover PD at a pre-scientific level presents a completely unique possibility to research healing alternatives for early PD, earlier than important irreversible neurodegeneration takes place [5]. 1.2 LRRK2 Mutations in the LRRK2 gene is the principle factor in contributing to the genetic improvement of Parkinson’s ailment, and over one hundred mutations in this gene had been shown to increase the chance of PD development [6]. 1.3 SNCA Alpha synuclein (SNCA) is a protein that, in human beings, is encoded with the aid of the SNCA gene. It is considerable within the brain while smaller amounts are located within the heart, muscular tissues and different tissues. Alpha-synuclein is determined mainly at the guidelines of nerve cells in specialized systems known as presynaptic terminals. Within these systems, alpha-synuclein interacts with phospholipids and proteins [19].

2 Related Works Easy disease identification and diagnosis of symptoms is of foremost significance in the current research area. Several vaccines that treat cancers are in trial stages with the help of Machine Learning (Medicines in development for cancer, 2015). Predictions using Supervised Learning techniques gives the GPs to select a restricted set of treatments or personalized treatments. Machine Learning is being used for finding and developing drugs that has prospective uses (Machine learning in the pharmaceutical industry, 2016). Vocal fundamental frequency statistics yield to excessive accuracy. Accuracy improves at the same time as all classes of dysphonia measurements are hired. Refinement of dysphonia measurements allows achieving the very first-class accuracy. Parkinson’s disease detection primarily based on dysphonia measurements (Salim Lahmiri, 2016) deep learning methods for analysis/prognosis derived from a couple of minutes of eyes-closed resting electroencephalography statistics (EEG) collected from idiopathic RBD patients (n = 121) and healthy controls (HC, n = 91). Deep learning with EEG spectrograms in rapid eye movement behavior disorder has been projected. (Ruffini, 2018) Supervised machine learning has been proposed as an progressive method for figuring out sensitive medical image biomarkers (or mixture of them) allowing for automatic analysis of man or woman subjects. (Machine learning on brain MRI data for differential diagnosis of Parkinson’s disease. (C. Salvatore, 2014) Systematic facts enclosed in microarray records encodes applicable clues to overcome the poorly understood aggregate of genetic and environmental factors in Parkinson’s disease (PD), which represents the most important

An Intelligent System on Computer-Aided Diagnosis

161

impediment to recognize its pathogenesis and for developing disease-enhancing therapeutics. Methodical and genetically apt unanimity approach for Parkinson’s disease gene rating has been recommended. (Efficient and biologically applicable consensus strategy for Parkinson’s disease gene prioritization, 2016).

3 Architecture See Fig. 1.

MRI Dataset

Data Processing Algorithm

Classification

Classified Output Fig. 1. Block diagram of computer-aided diagnosis for PD

4 Materials and Methods 4.1 Dataset Data used within the article preparation have been acquired were from Parkinson’s Progression Markers Initiative (PPMI) database and carried out with the aid of using WEKA. For updated information on the study, please visit https://www.ppmi-info.org/. PPMI is a landmark that contains wide variety of PD progression biomarkers. The dataset which is used in the paper includes all kinds of patients – healthy control (HC) patients and Parkinson’s disease (PD) patients. The dataset incorporates from HC subjects - 678 observations and PD subjects 1066 observations, making a total of 2021 observations subjects. Table 1 indicates the wide variety of subjects of the healthy control (HC) and Parkinson’ disease (PD) groups. Table 2 explains the causes of Parkinson Disease in patients. Table 1. Number of subjects at enrollment for HC and PD HC PD 678 1066

162

J. Naren et al. Table 2. Causes of PD in the subjects at enrollment Gene related Number of subjects General Number of subjects LRRK2+

686

HYP

119

GBA+

428

RBD

96

SNCA+

41

4.2 Data Pre-processing The data set obtained from PPMI repository is already preprocessed. The MRI images are processed and the result as numerical data set is used for implementation.

5 Feature Classification Feature classification is a process of manually selecting the features which have more influence on the target variable. In other words, identification of related features from the dataset are made by removing irrelevant features which doesn’t contribute much on the outcome. The classifiers used for study are briefly explained as follows: 5.1 AdaBoost AdaBoost trains a model by giving a collection of decision trees sequentially as an input, then the new trees collected to ensemble are centralized on earlier misclassified samples. The model predicts by calculating the previously weighted average on overall predictions of trees in ensemble [8]. The algorithm works to eradicate any existing loopholes in the weak classifier. After training of each classifier, the algorithm assigns weight based on the accuracy. The classifier with higher weight will have more impact on the result [7]. 5.2 Naïve Bayes Naïve Bayes, a probabilistic method used for constructing classifiers. In other words, the NB algorithm makes an assumption of restrictive autonomy over a training dataset [9]. It produces good results in complex real-world situations. The algorithm needs a small amount of training dataset to evaluate the parameters for classification and such model can be trained incrementally. By using Bayes theorem, the conditional probability is given by     P X P(Ci ) Ci Ci = P X P(X)   Where P CXi posterior probability, P(X) predictor prior probability, P(Ci ) class prior   probability, P CXi Likelihood [10].

An Intelligent System on Computer-Aided Diagnosis

163

5.3 Random Forest Random forest is one of the ensembles learning approach used for cataloguing, regression and other responsibilities. The model constructs multiple decision trees and combines them together to give a better prediction [11, 12]. When the tree grows incrementally, the algorithm adds more randomness to the model. The best features are chosen from the random subset of features. The trees can be made more randomized by making use of random thresholds for each feature. It runs effectively on larger inputs and helps in handling the missing values [13]. 5.4 Logistic Regression The statistical method for analyzing a dataset, which consists of one or more independent variables that decides the outcome. The mathematical function used here is Logit function or the log odds, the algorithm of the odds ratio. The probability of occurrence for an event can be predicted by fitting data to a logit function. The relationship between one dependent(binary)variable and one or more nominal, ordinal variables can be explained clearly using the algorithm [13, 14]. 5.5 Decision Trees Decision tree is another classification algorithm [15, 16]. The uniqueness of decision tree is, it represents guidelines. The guidelines can be easily stated and understandable by the humans. These kind of trees forms a tree structure wherein all the nodes are either a leaf or decision node having two or more branches and each of it having a subtree. The decision rules can be learned by the model which is used to predict class of target variables. The main important key is that a tree can be divided based on maximum information gain [17]. 5.6 Support Vector Machines SVM is used when both class names and features are available in the dataset. It builds a model to predict classes for new cases. There are two types of classifiers linear SVM and Non-Linear SVM. In linear model, the points are separated by hyperplane. While drawing the hyperplane, it is necessary to expand the separation from hyper plane to closest data points of either class known as support vectors. To resolve a partition of data into various classes, Non-Linear SVM classifier is applied by using kernels to hyperplanes [18].

6 Implementation The above-mentioned classifiers are applied to the data set using Weka, which is an assortment of visualization tools and Machine Learning algorithm for data analysis. It is used to perform data mining tasks like classification, regression, clustering, association rules mining, and visualization. The results are tabulated as shown in Table 3.

164

J. Naren et al. Table 3. The results of classification Classifiers

Precision Recall Accuracy

Naïve Bayes

98.2%

98.2% 98.18%

SVM

98.9%

98.9% 98.90%

AdaBoost

67.8%

50.5% 50.45%

Logistic regression 98.1%

98.1% 98.13%

Decision tree

98.3%

98.3% 98.27%

Random forest

98.4%

98.4% 98.40%

Fig. 2. Comparison between different classifiers and accuracy

7 Result In the paper, datasets from PPMI (MRI) for predicting the severity of Parkinson’s disease is implemented in different classifiers. From Fig. 2 The observation shows, by using SVM produced a best result among all (98.90%) and gives the basic understanding of PD diagnosis. This study has a vital role in deriving an analysis which compares the various Machine Learning algorithms.

References 1. Miljkovic, D., Aleksovski, D.: Machine Learning and Data Mining Methods for Managing Parkinson ’s Disease. Springer, Cham (2016) 2. Gao, C., Sun, H.: Model-based and model-free machine learning techniques for diagnostic prediction and classification of clinical outcomes in Parkinson’s disease. Sci. Rep. 8(1), 7129 (2018)

An Intelligent System on Computer-Aided Diagnosis

165

3. Castrioto, A., Volkmann, J., Krack, P.: Postoperative management of deep brain stimulation in Parkinson’s disease. In: Handbook of Clinical Neurology, vol. 116, pp. 129–146 (2013) 4. Prashanth, R., Roy, S.D.: Novel and improved stage estimation in Parkinson’s disease using clinical scales and machine learning. Neurocomputing 305, 78–103 (2018) 5. Brockmann, K., Berg, D.: The significance of GBA for Parkinson’s disease. J. Inherit. Metab. Dis. 37(4), 643–648 (2014) 6. Li, J.-Q., Tan, L., Yu, J.-T.: The role of the LRRK2 gene in Parkinsonism. Mol. Neurodegener. 9(1), 47 (2014) 7. Mandal, I., Sairam, N.: New machine-learning algorithms for prediction of Parkinson’s disease. Int. J. Syst. Sci. 45(3), 647–666 (2012) 8. Camps, J., Sama, A., Martin, M.: Deep learning for freezing of gait detection in Parkinson’s disease patients in their homes using a waist-worn inertial measurement unit. Knowl.-Based Syst. 139, 119–131 (2017) 9. Jain, D., Singh, V.: Feature selection and classification systems for chronic disease prediction: a review. Egypt. Inf. J. 19(3), 179–189 (2018) 10. Cigdem, O., Demirela, H.: Performance analysis of different classification algorithms using different feature selection methods on Parkinson’s disease detection. J. Neurosci. Methods 309, 81–90 (2018) 11. Ho, T.K.: Random decision forests. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995, pp. 278–282 (1995) 12. Ho, T.K.: The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998). https://doi.org/10.1109/34.709601 13. Challa, K.N.R., Pagolu, V.S., Panda, G.: An Improved Approach for Prediction of Parkinson’s Disease using Machine Learning Techniques (2016) 14. Peng, J., Lee, K.L., Ingersoll, G.M.: An introduction to logistic regression analysis and reporting. Indiana University-Bloomington (2002) 15. Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth International Group, CA (1984) 16. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993) 17. Halawani, S.M., Ahmad, A.: Ensemble methods for prediction of Parkinson disease. Springer, Heidelberg (2012) 18. Muthumanickam, S., Gayathri, J., Eunice Daphne, J.: Parkinson’s Disease Detection and Classification Using Machine Learning and Deep Learning Algorithms–A Survey (2018) 19. Siddiqui, I.J., Pervaiz, N., Abbasi, A.A.: The Parkinson disease gene SNCA: evolutionary and structural insights with pathological implication. Sci. Rep. 6, 24475 (2016)

Multi-Criteria Decision Making Approaches

Operations on Picture Fuzzy Numbers and Their Application in Multi-criteria Group Decision Making Problems Palash Dutta1(B) , Rajdeep Bora1,2 , and Satya Ranjan Dash3 1

Department of Mathematics, Dibrugarh University, Dibrugarh 786004, India [email protected], [email protected] 2 Department of Mathematics, Kakoan College, Jorhat 785107, India 3 School of Computer Applications, KIIT Deemed to be University, Bhubaneswar, India [email protected]

Abstract. Uncertainty is an unavoidable component of our life and fuzzy set theory (FST) is generally explored to deal with it. However, in some complex situations FST is not capable to play crucial rule. In such situations Picture fuzzy set (PFS) come into the picture which is the direct extension of FST and Intuitionistic fuzzy set (IFS). Although different studies on FST and IFS have been done including their algebraic structure, however these studies are found to be inappropriate to deal with picture fuzzy situations. In this regard, this present paper presents the basic arithmetic operations on PFSs along with numerical examples. Finally application of PFSs in multi-criteria group decision making is performed through a case study. Keywords: Intuitionistic fuzzy set picture fuzzy set

· Picture fuzzy set · Triangular

Mathematics Subject Classification: 03E72

1

Introduction

Fuzzy set theory (FST) was developed by Zadeh [1] in the year 1965 to deal with uncertain environment in solving various problems of real world. FST provides an appropriate framework for representing vague concepts by allowing partial membership. After that the intuitionistic fuzzy set (IFS) was proposed by Atanassov [2] in 1986. An IFS has the charactristics of fulfilling two functions expressing the degree of membership (belongingness) and the degree of non membership (non belongingness) of the elements of the universe to the IFS and the sum of these two degrees must not exceed one. It is encountered that one of the novel idea viz., degree of neutrality, is lacking in IFS. The idea of degree of neutrality can be observed in the environments where human intuitions are involved nd need to answer of the types of situations certainly, refrain, c Springer Nature Switzerland AG 2020  S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 169–188, 2020. https://doi.org/10.1007/978-3-030-39033-4_17

170

P. Dutta et al.

denial and rebuttal. For instance, in a egalitarian ballot voting there are say four options viz., “vote for”, “abstain”, “vote, against” and “refusal of voting” for the voters to choose their suitable candidates. Existing FST and IFS theory can’t deal with this types of situations and to cope with such situations Coung co-authors [3], has introduced PFS in 2013, which is the direct extension of FST and IFS. Phong and co-authors [5] studied some compositions of picture fuzzy relations. Cuong and Hai [6] investigated main fuzzy logic operators: negations, conjunctions, disjunctions and implications on picture fuzzy sets and also constructed main operations for fuzzy inference processes in picture fuzzy systems. Cuong and co-workers [7] presented properties of an involuting picture negator and some corresponding De Morgan fuzzy triples on picture fuzzy sets, Viet and co-authors [8] presented picture fuzzy inference system based on membership graph, Singh [9] studied correlation coefficients of PFSs. Cuong and colleagues [10] investigated the classification of representable picture t-norms and picture t-conorms operators for picture fuzzy sets, Son [11] proposed a new distance measure between PFSs and applied in fuzzy clustering, Son [12] extended basic distance measures in PFSs and examined some of its properties. Son, Viet and Hai [13] proposed fuzzy inference system on PFSs. Peng and Dai [14] proposed an algorithm for PFS and applied in decision making based on new distance measure, Garg [15] studied some picture fuzzy aggregation operations and their applications to multicriteria decision making. Dutta et al. [4] defined (α, δ, β)cut, strong (α, δ, β)-cut of PFS, extension principle of PFS, basic arithmetic and stated properties of these defined terms and proved Decomposition Theorems of PFS. Similarity measures are generally used for determining the degree of similarity between two objects. Kaufman and Rousseeuw [16] presented some examples to illustrate traditional similarity measure applications in hierarchical cluster analysis. Different similarity measures had been proposed in the literature [17–19]. A nice account of similarity measures of IFSs can be noticed in [20–26] and then Wei [27] presented some process to measure similarity between PFS. The Jaccard index [28] is a statistic used for comparing the similarity and diversity of sample sets. Based on this Jaccard index, Hwang et al. [29] brought a new formula of similarity measure of IFSs. In this paper, the basic arithmetic operations on PFSs are studied along with numerical examples. Finally application of PFSs in multi-criteria group decision making is studied through a hypothetical example.

2 2.1

Preliminaries Fuzzy Set

Let X be a set of universe of discourse and A ⊂ X. Let μA (x) : X → [0, 1] be a function, which assigns a real number in the interval [0, 1], to each element x in A, where the value μA (x) shows the grade of membership of x in A. Then the fuzzy set A is the set A = {(x, μA (x)) : x ∈ X}.

Operations on Picture Fuzzy Numbers

2.2

171

Intuitionistic Fuzzy Set

A A of of

Intuitionistic fuzzy set A on a universe of discourse X is of the form = {(x, μA (x), νA (x)) : x ∈ X}, where μA (x) ∈ [0, 1] is called the degree membership of x in A, νA (x) ∈ [0, 1] is called the degree of non-membership x ∈ A, and where μA (x) and νA (x) satisfy the following condition: 0 ≤ μA (x) + νA (x) ≤ 1. The amount πA (x) = 1 − (μA (x) + νA (x)) is called hesitancy of x which is reflection of lack of commitment or uncertainty associated with the membership or non-membership or both in A. 2.3

Picture Fuzzy Set

A Picture Fuzzy Set (P F S) A on a universe X is an object of the form A = {(x, μA (x), ηA (x), νA (x)) : x ∈ X}, where μA (x) ∈ [0, 1] is called the degree of positive membership (PM) of x ∈ A, ηA (x) ∈ [0, 1] is called the degree of neutral membership (NeuM) of x ∈ A, νA (x) ∈ [0, 1] is called the degree of negative membership (NM) of x ∈ A. μA (x), ηA (x), νA (x) must satisfy the condition μA (x) + ηA (x) + νA (x) ≤ 1 ∀ x ∈ X. Then ∀ x ∈ X, 1 − (μA (x) + ηA (x) + νA (x)) is called the degree of refusal membership of x ∈ A. 2.4

α-cut of Picture Fuzzy Set

Let A be a picture fuzzy set of set of universe X. Then α-cut of A is the crisp subset Cα (A) of A, given by Cα (A) = {x : x ∈ X and μA (x) ≥ α, ηA (x) ≤ α, νA (x) ≤ α}, where α ∈ [0, 1] with 3α ≤ 1. That is α A+ = {x ∈ X : μA (x) ≥ α},α A± = {x ∈ X : ηA (x) ≤ α} and α A− = {x ∈ X : νA (x) ≤ α} are α-cuts of PF, NeuM and NM respectively of a PFS A, where A+ , A± and A− indicates positive membership function (PM), Neutral membership function (NeuM) negative membership function (NM) respectively. 2.5

Triangular PFSs

A triangular PFS is denoted by A = [p1 , q1 , r1 ; α1 ], [p1 , q1 , r1 ; δ1 ], [p1 , q1 , r1 ; β1 ] and defined by ⎧ x−p ⎪α1 q1 −p11 , p1 ≤ x ≤ q1 ⎨ −x μA (x) = α1 rr11−q , q1 ≤ x ≤ r1 1 ⎪ ⎩ 0, Otherwise. ⎧  (q1 −x)+(x−p1 )δ1 ⎪ , p1 ≤ x ≤ q1 ⎪ q1 −p ⎨ 1  1 −x)δ1 ηA (x) = (x−q1r)+(r , q1 ≤ x ≤ r1  ⎪ 1 −q1 ⎪ ⎩1, otherwise. ⎧ (q1 −x)+(x−p1 )β1 ⎪ , ⎪ q1 −p1 ⎨

νA (x) =

(x−q1 )+(r1 −x)β1 r1 −q1

⎪ ⎪ ⎩1,

,

p1 ≤ x ≤ q1 q1 ≤ x ≤ r1 otherwise

172

P. Dutta et al.

A graphical representation of a PFN [4, 10, 16; 0.5], [3, 10, 17; 0.1], [5, 10, 15; 0.3] is presented in Fig. 1.

Fig. 1. PFN {[4,10,16;0.5],[3,10,17;0.1],[5,10,15;0.3]}

3

Arithmetic Operations of Triangular PFSs and Their Examples

In this segment, attempt has been made to perform all the basic arithmetic operations between picture fuzzy numbers. The following four arithmetic operations on closed interval are adopted for the basic arithmetic operations. [a, b] + [c, d] = [a + c, b + d] [a, b] − [c, d] = [a − d, b − c] [a, b] × [c, d] = [ac, bd] [a, b] ÷ [c, d] = [a/d, b/c] Let us consider two triangular PFSs A = [p1 , q1 , r1 ; β], [p1 , q1 , r1 ; δ], [p1 , q1 , r1 ; γ] and B = [p2 , q2 , r2 ; β], [p2 , q2 , r2 ; δ], [p2 , q2 , r2 ; γ] given by, ⎧ x−p1 ⎪ ⎪ ⎨β q1 −p1 , −x μA (x) = β rr1−q , 1 1 ⎪ ⎪ ⎩0, otherwise

p 1 ≤ x ≤ q1 q1 ≤ x ≤ r1 ηA (x) =

⎧ (q1 −x)+(x−p 1 )δ ⎪ , ⎪ ⎪ q1 −p ⎪ 1 ⎨ (x−q )+(r  −x)δ

1 1 ⎪  −q ⎪ r1 1 ⎪ ⎪ ⎩1, otherwise

,

p 1 ≤ x ≤ q1 q1 ≤ x ≤ r1

Operations on Picture Fuzzy Numbers

νA (x) =

ηB (x) =

⎧ (q −x)+(x−p )γ 1 1 ⎪ , ⎪ q1 −p1 ⎪ ⎨

p1 ≤ x ≤ q1

 (x−q1 )+(r1 −x)γ ,  −q ⎪ r1 ⎪ 1

q1 ≤ x ≤

⎪ ⎩

1, otherwise ⎧ (q −x)+(x−p )δ 2 2 ⎪ , ⎪ q2 −p ⎪ 2 ⎨ ⎪ ⎪ ⎪ ⎩

 (x−q2 )+(r2 −x)δ ,  −q r2 2

1, otherwise

r1

p 2 ≤ x ≤ q2

μB (x) =

⎧ x−p 2 ⎪ , β ⎪ ⎨ q2 −p2 β r2 −x , ⎪ ⎪ r2 −q2 ⎩

0, otherwise ⎧ (q −x)+(x−p )γ 2 2 ⎪ , ⎪ q2 −p2 ⎪ ⎨

 (x−q2 )+(r2 −x)γ q2 ≤ x ≤ r2 νB (x) = ⎪ ,  −q r2 ⎪ 2 ⎪ ⎩ 1, otherwise

173

p 2 ≤ x ≤ q2 q2 ≤ x ≤ r 2

p2 ≤ x ≤ q2 q2 ≤ x ≤ r2

α-cuts of A are α

α α

α(q1 − p1 ) α(r1 − q1 ) , r1 − ], α ∈ [0, β], β β (q1 − p1 δ) − α(q1 − p1 ) α(r1 − q1 ) − (r1 δ − q1 ) , ], α ∈ [δ, 1], =[ 1−δ 1−δ (q1 − p1 γ) − α(q1 − p1 ) α(r1 − q1 ) − (r1 γ − q1 ) , ], α ∈ [γ, 1] =[ 1−γ 1−γ

A+ = [p1 + A± A−

and α-cuts of B are α

α α

α(q2 − p2 ) α(r2 − q2 ) , r2 − ], α ∈ [0, β], β β (q2 − p2 δ) − α(q2 − p2 ) α(r2 − q2 ) − (r2 δ − q2 ) , ], α ∈ [δ, 1], =[ 1−δ 1−δ     (q2 − p2 γ) − α(q2 − p2 ) α(r2 − q2 ) − (r2 γ − q2 ) , ], α ∈ [γ, 1] =[ 1−γ 1−γ

B+ = [p2 + B± B−

Theorem 1. Addition of two triangular PFSs with same height produces a triangular PFS. Proof: For PM function, α(q1 − p1 ) α(r1 − q1 ) , r1 − ], α ∈ [0, β], (2.1) β β α(q2 − p2 ) α(r2 − q2 ) α , r2 − ], α ∈ [0, β], B+ = [p2 + β β α α α A+ +α B+ = [p1 + p2 + {q1 − p1 + q2 − p2 }, r1 + r2 − {r1 − q1 + r2 − q2 }], α ∈ [0, β] β β α

A+ = [p1 +

To find the positive membership (PM) function μA+B (x) we equate both the first and second components of (2.1) which gives x = p1 + p 2 +

α α {q1 − p1 + q2 − p2 } and x = r1 + r2 − {r1 − q1 + r2 − q2 } β β

Now, we have α=β

x − (p1 + p2 ) q1 − p1 + q 2 − p2

(2.2)

174

P. Dutta et al.

α=β

(r1 + r2 − x) r1 − q1 + r2 − q2

(2.3)

Setting α ≥ 0, α ≤ β in (2.2) and α ≤ β, α ≥ 0 in (2.3), we get the domain of x, x ∈ [p1 + p2 , q1 + q2 ] and x ∈ [q1 + q2 , r1 + r2 ]. Hence, ⎧ x−(p1 +p2 ) ⎪ ⎪ ⎨β q1 −p1 +q2 −p2 , x ∈ [p1 + p2 , q1 + q2 ] μA+B (x) = β (r1 +r2 −x) , x ∈ [q1 + q2 , r1 + r2 ] r1 −q1 +r2 −q2 ⎪ ⎪ ⎩0, otherwise For NeuM function, α

{(q1 − p1 δ) + (q2 − p2 δ)} − α{q1 − p1 + q2 − p2 } , (2.4) 1−δ α{r1 − q1 + r2 − q2 } − {(r1 δ − q1 ) + (r2 δ − q2 )} ], α ∈ [δ, 1] 1−δ

A± +α B± = [

Equating each component of RHS of (2.4) with x, we have {(q1 − p1 δ) + (q2 − p2 δ)} − α{q1 − p1 + q2 − p2 } 1−δ α{r1 − q1 + r2 − q2 } − {(r1 δ − q1 ) + (r2 δ − q2 )} x = 1−δ

x =

Expressing α in terms of x, α=

(q1 − p1 δ) + (q2 − p2 δ) − x(1 − δ) , q1 − p1 + q2 − p2

(2.5)

x(1 − δ) + {r1 δ − q1 + r2 δ − q2 } r1 − q1 + r2 − q2

(2.6)

α=

Putting α ≤ 1, α ≥ δ in (2.5), we get (q1 − p1 δ) + (q2 − p2 δ) − x(1 − δ) ≤1 q1 − p1 + q2 − p2 ⇒ q1 + q2 − p1 δ − p2 δ − x(1 − δ) ≤ q1 + q2 − p1 − p2 ⇒ x(1 − δ) ≥ (1 − δ)p1 + (1 − δ)p2 ⇒ x ≥ p1 + p2 and q1 − p1 δ + q2 − p2 δ − x(1 − δ) ≥δ q1 − p1 + q2 − p2 ⇒ (q1 − p1 δ) + (q2 − p2 δ) − x(1 − δ) ≥ δ(q1 − p1 + q2 − p2 ) ⇒ q1 + q2 − (1 − δ)x ≥ δq1 + δq2 ⇒ (1 − δ)x ≤ (1 − δ)(q1 + q2 ) ⇒ x ≤ q1 + q 2

Operations on Picture Fuzzy Numbers

175

That is x ∈ [p1 + p2 , q1 + q2 ] for (2.5). Next putting α ≥ δ, α ≤ 1 in (2.6) x(1 − δ) + {r1 δ − q1 + r2 δ − q2 } ≥δ r1 − q1 + r2 − q2 ⇒ x(1 − δ) + {r1 δ − q1 + r2 δ − q2 } ≥ δ(r1 − q1 + r2 − q2 ) ⇒ x(1 − δ) ≥ q1 + q2 − q1 δ − q2 δ ⇒ x ≥ q1 + q 2 and x(1 − δ) + {r1 δ − q1 + r2 δ − q2 } ≤1 r1 − q1 + r2 − q2 ⇒ x(1 − δ) + {r1 δ − q1 + r2 δ − q2 } ≤ r1 − q1 + r2 − q2 ⇒ x(1 − δ) ≤ (r1 + r2 )(1 − δ) ⇒ x ≤ r1 + r2 That is x ∈ [q1 + q2 , r1 + r2 ] for (2.6). Hence NeuM function ηA+B (x) is ⎧ (q1 −p δ)+(q2 −p δ)−x(1−δ) 1 2 , ⎪  ⎪ q1 −p 1 +q2 −p2 ⎪ ⎪   ⎪ ⎪ ⎨x ∈ [p1 +p2 , q1 + q2 ] 1 δ−q1 +r2 δ−q2 } ηA+B (x) = x(1−δ)+{r , r1 −q1 +r2 −q2 ⎪ ⎪ ⎪   ⎪ x ∈ [q1 + q2 , r1 + r2 ] ⎪ ⎪ ⎩ 1, Otherwise. For NM function,  α

A− +α B− =

{(q1 − p1 γ) + (q2 − p2 γ)} − α{q1 − p1 + q2 − p2 } , (2.7) 1−γ  α{r1 − q1 + r2 − q2 } − {(r1 γ − q1 ) + (r2 γ − q2 )} , α ∈ [γ, 1] 1−γ

Equating each component of RHS of (2.7) with x, we have {(q1 − p1 γ) + (q2 − p2 γ)} − α{q1 − p1 + q2 − p2 } , 1−γ α{r1 − q1 + r2 − q2 } − {(r1 γ − q1 ) + (r2 γ − q2 )} x = 1−γ x =

Expressing α in terms of x, α=

(q1 − p1 γ) + (q2 − p2 γ) − x(1 − γ) , q1 − p1 + q2 − p2

(2.8)

176

P. Dutta et al.

x(1 − γ) + {r1 γ − q1 + r2 γ − q2 } r1 − q1 + r2 − q2

α=

Putting α ≤ 1, α ≥ γ in (2.8), we get (q1 − p1 γ) + (q2 − p2 γ) − x(1 − γ) ≤1 q1 − p1 + q2 − p2 ⇒ q1 + q2 − p1 γ − p2 γ − x(1 − γ) ≤ q1 + q2 − p1 − p2 ⇒ x(1 − γ) ≥ (1 − γ)p1 + (1 − γ)p2 ⇒ x ≥ p1 + p2 and q1 − p1 γ + q2 − p2 γ − x(1 − γ) ≥γ q1 − p1 + q2 − p2 ⇒ (q1 − p1 γ) + (q2 − p2 γ) − x(1 − γ) ≥ γ(q1 − p1 + q2 − p2 ) ⇒ q1 + q2 − (1 − γ)x ≥ γq1 + γq2 ⇒ (1 − γ)x ≤ (1 − γ)(q1 + q2 ) ⇒ x ≤ q1 + q 2 That is x ∈ [p1 + p2 , q1 + q2 ] for (2.8) Next putting α ≥ γ, α ≤ 1 in (2.9) x(1 − γ) + {r1 γ − q1 + r2 γ − q2 } ≥γ r1 − q1 + r2 − q2 ⇒ x(1 − γ) + {r1 γ − q1 + r2 γ − q2 } ≥ γ(r1 − q1 + r2 − q2 ) ⇒ x(1 − γ) ≥ q1 + q2 − q1 γ − q2 γ ⇒ x ≥ q1 + q 2 and x(1 − γ) + {r1 γ − q1 + r2 γ − q2 } ≤1 r1 − q1 + r2 − q2 ⇒ x(1 − γ) + {r1 γ − q1 + r2 γ − q2 } ≤ r1 − q1 + r2 − q2 x(1 − γ) ≤ (r1 + r2 )(1 − γ)x ≤ r1 + r2 That is x ∈ [q1 + q2 , r1 + r2 ] for (2.9) Hence the NM function νA+B (x) is ⎧ (q1 −p γ)+(q2 −p γ)−x(1−γ) 1 2 , ⎪ ⎪ q1 −p1 +q2 −p2 ⎪ ⎪   ⎪ ⎪ ⎨x ∈ [p1 + p2 , q1 + q2 ] νA+B (x) =

x(1−γ)+{r1 γ−q1 +r2 γ−q2 } , r1 −q1 +r2 −q2 ⎪ ⎪ ⎪   ⎪ ⎪ ⎪x ∈ [q1 + q2 , r1 + r2 ]



1,

Otherwise.

(2.9)

Operations on Picture Fuzzy Numbers

177

Thus, we find that A + B = [l1 , l2 , l3 ; β], [m1 , m2 , m3 ; δ], [n1 , n2 , n3 ; γ] where l1 = p1 + p2 , l2 = q1 + q2 , l3 = r1 + r2 , m1 = p1 + p2 , m2 = q1 + q2 , m3 = r1 + r2 , n1 = p1 + p2 , n2 = q1 + q2 , n3 = r1 + r2 which is clearly a triangular PFS with height of PM, NeuM, NM functions as β, δ, γ respectively. Theorem 2. Subtraction of two triangular PFSs with same height produces a triangular PFS. Proof: In a similar fashion the MF, NeUMF and NMF can be evaluated. The required MF μA−B (x) is ⎧ x−(p1 −r2 ) ⎪ ⎪ ⎨β q1 −p1 +r2 −q2 , x ∈ [p1 − r2 , q1 − q2 ] μA−B (x) = β (r1 −p2 )−x , x ∈ [q1 − q2 , r1 − p2 ] r1 −q1 +q2 −p2 ⎪ ⎪ ⎩0, Otherwise. The NeuMF is ηA−B (x) is ⎧ {q −p δ+r δ−q }−(1−δ)x 1 2 1 2 ⎪ , x ∈ [p1 − r2 , q1 − q2 ]  ⎪ q1 −p ⎨ 1 +r2 −q2   ηA−B (x) = (1−δ)x+{q2−p2δ+r1 δ−q1 } , x ∈ [q1 − q2 , r − p ] 1 2 q2 −p2 +r1 −q1 ⎪ ⎪ ⎩ 1, Otherwise. The NMF is νA−B (x) is ⎧ {q −p γ+r γ−q }−(1−γ)x 1 2 1 2 ⎪ , x ∈ [p1 − r2 , q1 − q2 ] ⎪ q1 −p1 +r2 −q2 ⎨   νA−B (x) = (1−γ)x+{q2−p2γ+r1 γ−q1 } , x ∈ [q1 − q2 , r − p ] 1 2 q2 −p2 +r1 −q1 ⎪ ⎪ ⎩ 1, Otherwise. Thus, we find that A − B = [l1 , l2 , l3 ; β], [m1 , m2 , m3 ; δ], [n1 , n2 , n3 ; γ] where l1 = p1 − r2 , l2 = q1 − q2 , l3 = r1 − p2 , m1 = p1 − r2 , m2 = q1 − q2 , m3 = r1 − p2 , n1 = p1 − r2 , n2 = q1 − q2 , n3 = r1 − p2 which is also clearly a triangular PFS with height of PM, NeuM, NM functions as β, δ, γ respectively. Theorem 3. Multiplication of two triangular PFSs with same height produces a triangular PFS.

178

P. Dutta et al.

Proof. For The MF, NeMF and NMF of AB can be evaluated. The MF μAB (x) is √ ⎧ −β{p1 (q2 −p2 )+p2 (q1 −p1 )}+ {p1 (q2 −p2 )+p2 (q1 −p1 )}2 β 2 −4(q1 −p1 )(q2 −p2 )(p1 p2 −x)β 2 ⎪ , ⎪ 2(q1 −p1 )(q2 −p2 ) ⎪ ⎪ ⎪ ⎪ ⎪ x ∈ [p1 p2 , q1 q2 ] ⎪ ⎪ ⎪ ⎪ ⎨ √ β{r1 (r2 −q2 )+r2 (r1 −q1 )}− {r1 (r2 −q2 )+r2 (r1 −q1 )}2 β 2 −4(r1 −q1 )(r2 −q2 )(r1 r2 −x)β 2 μAB (x) = , ⎪ 2(r1 −q1 )(r2 −q2 ) ⎪ ⎪ ⎪ ⎪ ⎪ x ∈ [q1 q2 , r1 r2 ] ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0 , Otherwise

The NeuMF ηAB (x) is ⎧ (q −p )(q −p δ)+(q −p δ)(q −p ) 1 2 1 2 1 2 1 2 − ⎪ ⎪  ⎪ 2(q1 −p ⎪ 1 )(q2 −p2 ) ⎪ ⎪  )(q −p δ)+(q −p δ)(q −p )}2 −4(q −p )(q −p ){(q −p δ)(q −p δ)−(1−δ)2 x} ⎪ {(q −p 1 2 2 1 2 1 2 1 ⎪ 1 2 2 1 2 1 2 1 ⎪ , ⎪  ⎪ 2(q1 −p ⎪ 1 )(q2 −p2 ) ⎪ ⎪   ⎪ ⎪ x ∈ [p1 p2 , q1 q2 ] ⎪ ⎪ ⎪ ⎪ ⎨  −q )(r  δ−q )+(r  δ−q )(r  −q ) (r1 1 2 1 2 2 1 2 ηAB (x) = −  −q )(r  −q ) ⎪ 2(r1 ⎪ 1 2 2  ⎪ ⎪ ⎪ {(r1 −q1 )(r2 δ−q2 )+(r1 δ−q1 )(r2 −q2 )}2 −4(r1 −q1 )(r2 −q2 ){(r2 δ−q2 )(r1 δ−q1 )−(1−δ)2 x} ⎪ ⎪ ⎪ ,  −q )(r  −q ) ⎪ 2(r1 ⎪ 1 2 2 ⎪ ⎪ ⎪x ∈ [q q , r  r  ] ⎪ 1 2 ⎪ 1 2 ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 1 , Otherwise.

The NMF νAB (x) is ⎧ (q −p )(q −p γ)+(q −p γ)(q −p ) 1 2 1 2 1 2 1 2 − ⎪ ⎪ ⎪ 2(q1 −p1 )(q2 −p2 ) ⎪ ⎪ ⎪  )(q −p γ)+(q −p γ)(q −p )}2 −4(q −p )(q −p ){(q −p γ)(q −p γ)−(1−γ)2 x} ⎪ {(q −p 1 2 1 2 1 2 2 1 ⎪ 1 2 1 2 1 2 2 1 ⎪ , ⎪ ⎪ 2(q1 −p1 )(q2 −p2 ) ⎪ ⎪ ⎪   ⎪ ⎪x ∈ [p1 p2 , q1 q2 ] ⎪ ⎪ ⎪ ⎪ ⎨  −q )(r  γ−q )+(r  γ−q )(r  −q ) (r1 1 2 1 2 2 1 2 νAB (x) = −  −q )(r  −q ) ⎪ 2(r1 ⎪ 1 2 2  ⎪ ⎪ ⎪ {(r1 −q1 )(r2 γ−q2 )+(r1 γ−q1 )(r2 −q2 )}2 −4(r1 −q1 )(r2 −q2 ){(r2 γ−q2 )(r1 γ−q1 )−(1−γ)2 x} ⎪ ⎪ ⎪ ,  −q )(r  −q ) ⎪ 2(r1 ⎪ 1 2 2 ⎪ ⎪   ⎪ ⎪ ⎪x ∈ [q1 q2 , r1 r2 ] ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 1 , Otherwise.

Thus, AB = [l1 , l2 , l3 ; β], [m1 , m2 , m3 ; δ], [n1 , n2 , n3 ; γ] where l1 = p1 p2 , l2 = q1 q2 , l3 = r1 r2 , m1 = p1 p2 , m2 = q1 q2 , m3 = r1 r2 , n1 = p1 p2 , n2 = q1 q2 , n3 = r1 r2 which is clearly a triangular PFS with height of PM, NeuM, NM functions as β, δ, γ respectively. Theorem 4. Division of two triangular PFSs with same height produces a triangular PFS. Proof: The MF, NeuMF and NMF of A/B can be evaluated. The MF μ A (x) is ⎧ xr2 −p1 p1 q1 ⎪ ⎨β q1 −p1 +(r2 −q2 )x , x ∈ [ r2 , q2 ] r1 −xp2 μ A (x) = β x(q2 −p , x ∈ [ qq12 , pr12 ] 2 )+r1 −q1 B ⎪ ⎩ 0 , Otherwise.

B

Operations on Picture Fuzzy Numbers

The NeuMF ηA/B (x) is

179

⎧  (q1 −p p q1 1 δ)+(r2 δ−q2 )x 1 ⎪ ⎪  −q )x+(q −p ) , x ∈ [ r  , q ] (r ⎪ 1 2 1 2 ⎨ 2 2

ηA/B (x) =

(q −p δ)x+(r  δ−q )

r 

2 1 2 1 , x ∈ [ qq12 , p1 ] ⎪ r1 −q1 +(q2 −p ⎪ 2 )x 2 ⎪ ⎩1, Otherwise.

The NMF νA/B (x) is

⎧ (q1 −p1 γ)+(r2 γ−q2 )x p1 q1 ⎪ ⎪  −q )x+(q −p ) , x ∈ [ r  , q ] (r ⎪ 2 1 2 1 2 ⎨ 2

νA/B (x) =

(q −p γ)x+(r  γ−q )

r

q1 2 1 2 1 1 ⎪ r1 −q1 +(q2 −p2 )x , x ∈ [ q2 , p2 ] ⎪ ⎪ ⎩1, Otherwise.

Thus, A = [l1 , l2 , l3 ; β], [m1 , m2 , m3 ; δ], [n1 , n2 , n3 ; γ] B where l1 = r1 p2

p1 r 2 , l2

=

q1 q2 , l3

=

r1 p2 ,

m1 =

p 1 r2 ,

m2 =

q1 q2 ,

m3 =

r1 , p 2

n1 =

p1 r2 ,

n2 =

q1 q2 ,

n3 = . which is clearly a triangular PFS with height of PM, NeuM, NM functions as β, δ, γ respectively.

4

Numerical Examples

Let A = [4, 10, 16; 0.5], [3, 10, 17; 0.2], [5, 10, 15; 0.1] and B = [1, 8, 15; 0.5], [3, 8, 13; 0.2], [5, 8, 11; 0.1] be two triangular PFSs. Then, A + B = [5, 18, 31; 0.5], [6, 18, 30; 0.2][10, 18, 26; 0.1] , where ⎧ x−5 ⎪ ⎨0.5 13 , x ∈ [5, 18] μA+B (x) = 0.5 31−x 13 , x ∈ [18, 31] ⎪ ⎩ 0, Otherwise. ⎧ 16.8−0.8x ⎪ ⎨ 12 , x ∈ [6, 18] ηA+B (x) = 0.8x−12 , x ∈ [18, 30] 12 ⎪ ⎩ 1, Otherwise. ⎧ 17−0.9x ⎪ x ∈ [10, 18] ⎨ 8 , 0.9x−15.4 νA+B (x) = , x ∈ [18, 26] 8 ⎪ ⎩ 1, Otherwise. A − B = [−11, 2, 15; 0.5], [−10, 2, 14; 0.2], [−6, 2, 10; 0.1] , where ⎧ x+11 ⎪ ⎨0.5 13 , x ∈ [−11, 2] μA−B (x) = 0.5 15−x 13 , x ∈ [2, 15] ⎪ ⎩ 0, Otherwise.

180

P. Dutta et al.

⎧ 4−0.8x ⎪ ⎨ 12 , ηA−B (x) =

0.8x+0.8 , ⎪ 12

⎩ 1, ⎧ 2.6−0.9x ⎪ ⎨ 8 , νA−B (x) = 0.9x−1 , 8 ⎪ ⎩ 1,

x ∈ [−10, 2] x ∈ [2, 14] Otherwise. x ∈ [−6, 2] x ∈ [2, 10] Otherwise.

AB = [4, 80, 240; 0.5], [9, 80, 221; 0.2], [25, 80, 165; 0.1] , where √ ⎧ −17+ 289−42(4−x) ⎪ ⎪ , x ∈ [4, 80] ⎨ √ 84 101− 10201−42(240−x) μAB (x) = , x ∈ [80, 240] ⎪ 84 ⎪ ⎩ 0, Otherwise. √ ⎧ 98.8− 9761.44−140(69.56−.64x) ⎪ ⎪ , x ∈ [9, 80] ⎨ 70 √ ηAB (x) = −70.8− 5012.64−140(35.64−.64x) , x ∈ [80, 221] ⎪ 70 ⎪ ⎩ 0, Otherwise. √ ⎧ 66− 4356−60(71.25−0.81x) ⎪ ⎪ , x ∈ [25, 80] ⎨ 30 √ νAB (x) = −60− 3600−60(58.65−0.81x) , x ∈ [80, 165] ⎪ 30 ⎪ ⎩ 0, Otherwise.  A 4 10 3 10 17 5 10 15 B = [ 15 , 8 , 16; 0.5], [ 13 , 8 , 3 ; 0.2], [ 11 , 8 , 5 ; 0.1] , where ⎧ 15x−4 4 10 ⎪ ⎨0.5 7x+6 , x ∈ [ 15 , 8 ] 16−x 10 μ A (x) = 0.5 7x+6 , x ∈ [ 8 , 16] B ⎪ ⎩ 0, Otherwise. ⎧ 9.4−5.4x 3 10 ⎪ ⎨ 5x+7 , x ∈ [ 13 , 8 ] 10 17 η A (x) = 7.4x−6.6 5x+7 , x ∈ [ 8 , 3 ] B ⎪ ⎩ 1, Otherwise. ⎧ 9.5−6.9x 5 10 ⎪ ⎨ 3x+5 , x ∈ [ 11 , 8 ] 7.5x−8.5 10 15 ν A (x) = 3x+5 , x ∈ [ 8 , 5 ] B ⎪ ⎩ 1, Otherwise.

5

Ranking of Triangular PFSs Based on Value and Ambiguity

Li [30] introduced the concept of value and ambiguity of GTIFN and same concept has been put forwarded for GTrIFN by De and Das [31]. In this section we have extended the concept of value and ambiguity to PFSs.

Operations on Picture Fuzzy Numbers

181

Definition: Let A = [p1 , q1 , r1 ; α1 ], [p1 , q1 , r1 ; δ1 ], [p1 , q1 , r1 ; β1 ] be a Triangular PFS and [L+ , R+ ], [L± , R± ], [L− , R− ] be α-cuts of PM, NeuM and NM functions of A respectively. Then the value of PM, NeuM and NM functions of PFS A are defined as:

α

1 + ± f (α)dα, Vη (A) = δ1 L± +R g(α)dα and Vν (A) = Vμ (A) = 0 1 L+ +R 2 2

1 L− +R− h(α)dα respectively. 2 β1 The ambiguity

α1 of PM, NeuM and NM functions

1 of PFS A are defined as: Aμ (A) = 0 (R+ − L+ )f (α)dα, Aη (A) = δ1 (R± − L± )g(α)dα and Aν (A) =

1 (R− − L− )h(α)dα respectively, where f (α) is a non-negative and nonβ1

α decreasing function on [0, α1 ] with f (0) = 0 and 0 1 f (α)dα = α1 , g(α) is a non-negative and non-increasing function on [δ1 , 1] with g(1) = 0 and

1 g(α)dα = 1 − δ1 and h(α) is a non-negative and non-increasing function on δ1

1 [β1 , 1] with h(1) = 0 and β1 h(α)dα = 1 − β1 . Like [7] and [26], we also choose

2(1−α) 2(1−α) f (α) = 2α α1 , α ∈ [0, α1 ], g(α) = 1−δ1 , α ∈ [δ1 , 1] and h(α) = 1−β1 , α ∈ [β1 , 1]. Thus the value of PM, NeuM and NM functions of PFS A are evaluated as: p +r  +4q1 p +r  +4q (1−δ1 ) and Vν (A) = 1 61 1 (1−β1 ) Vμ (A) = p1 +r61 +4q1 α1 , Vη (A) = 1 61 respectively. Similarly, the ambiguity of PM, NeuM and NM functions of PFS A are evaluated as: r1 −p r  −p 1 1 (1 − δ1 ) and Aν (A) = 1 3 1 (1 − β1 ) respecAμ (A) = r1 −p 3 α1 , Aη (A) = 3 tively. Zeng et al. [32] devised value-index and ambiguity-index to rank trapezoidal IFS and we adopt it for triangular PFSs. That is, for a Triangular PFS A = [p1 , q1 , r1 ; α1 ], [p1 , q1 , r1 ; δ1 ], [p1 , q1 , r1 ; β1 ] the value-index and ambiguity-index of A are defined as: V (A) = λ1 Vμ (A) + λ2 Vη (A) + (1 − (λ1 + λ2 ))Vν (A) and A(A) = λ1 Aμ (A) + λ2 Aη (A) + (1 − (λ1 + λ2 ))Aν (A) respectively, where λ1 , λ2 ∈ [0, 1] are weights which represents the decision maker’s preference information. In this paper, we shall apply the value-index in our case study to rank the alternatives.

6

Multi-criteria Group Decision-Making Based on Arithmetic Operation Between Triangular PFSs

In general, multi-criteria group decision-making problems include uncertain imprecise data and information. So, it is better to select the best alternative which contains less imprecise or uncertain information while treating a real world problem. we can do this using the proposed method which is justified through the following case study. Suppose a committee of K expert decision makers D1 , D2 , D3 , ...., Dk will choose best alternative among the n alternatives A1 , A2 , A3 , ...., An based on m criteria C1 , C2 , C3 , ...., Cm for each alternative respectively.

182

P. Dutta et al.

The procedure for the decision process is given below: Step-I: Decision makers choose linguistic weighting variables for the importance weight of criteria and the linguistic rating variables to evaluate the ratings of alternatives with respect to each criterion which are expressed in terms of positive PFSs. Step-II: Decision makers evaluate importance weight of each criterion using linguistic weighting variables. Step-III: The weights of criteria are aggregated using 1 1 [w ˜ +w ˜j2 + ...... + w ˜jk ] (6.1) K j to get the aggregated fuzzy weight w ˜j of the criterion cj . The new weight vector can be written as w ˜j =

˜1 + w ˜2 + ...... + w ˜n ], where w ˜j = [w w ˜j = (wj1 , wj2 , ......, wjn ) Step-IV: Decision makers give their opinion to get the aggregated fuzzy ratings x˜ij of alternative Ai under criterion Cj . That is, C ⎛ 1 A1 x ˜11 A ⎜x ˜21 R= 2 ⎜ .... ⎝ .... ˜m1 Am x

C2 x ˜12 x ˜22 .... x ˜m2

.......... .......... .......... ...... ..........

Cn ⎞ x ˜1n x ˜2n ⎟ ⎟ ⎠ x ˜mn

Step-V: Construct the weighted normalised fuzzy decision matrix ˜ = [d˜ij ]m×n , i = 1, 2, ..., m, j = 1, 2, ..., n, D where d˜ij = r˜ij w ˜j

(6.2)

using our proposed arithmetic operations which are normalised positive triangular PFSs. n ˜ Step-VI: Decision makers evaluate d˜i = j=1 dij using our proposed arithmetic operations. Step-VII: Based on maximum value-index of d˜i , decision makers will choose the suitable alternative Ai .

7

Hypothetical Case Study

Suppose a committee of three expert decision makers D1 , D2 and D3 has been formed to conduct an experiment to select the suitable place for setting a bridge on the bank of the river Brahmaputra among the three eligible places, namely A1 , A2 and A3 . Five benefit criteria are considered as: C1 : C2 : C3 : C4 : C5 :

Distance between the banks. Heights of the banks from the water level. Hardness of the soil of the banks. Danger of floods in the area of the bank and Curvaceousness of the banks.

Operations on Picture Fuzzy Numbers

7.1

183

Computational Procedure Is Discussed in Detail

Step-I: Decision makers choose the linguistic weighting variables (Table 1) for the importance weight of criteria and the linguistic ratings variable (Table 2) to evaluate the ratings of alternatives with respect to each criterion. Table 1. Linguistic variable for the importance weight of each criterion Very Low (VL)

[0, 0, 0.1; 0.3], [0, 0, 0.2; 0.2], [0, 0, 0.3; 0.4]

Low (L)

[0, 0.1, 0.25; 0.3], [0, 0.1, 0.3; 0.2], [0, 0.1, 0.35; 0.4]

Medium Low (ML)

[0.15, 0.3, 0.45; 0.3], [0.1, 0.4, 0.6; 0.2], [0.15, 0.24, 0.4; 0.4]

Medium (M)

[0.3, 0.5, 0.6; 0.3], [0.4, 0.6, 0.8; 0.2], [0.35, 0.45, 0.55; 0.4]

Medium High (MH) [0.55, 0.7, 0.85; 0.3], [0.5, 0.7, 0.9; 0.2], [0.6, 0.65, 0.75; 0.4] High (H)

[0.8, 0.9, 1; 0.3], [0.75, 0.9, 1; 0.2], [0.6, 0.7, 0.9; 0.4]

Very High (VH)

[0.9, 0.95, 1; 0.3], [0.8, 0.95, 1; 0.2], [0.5, 0.85, 0.95; 0.4]

Table 2. Linguistic variable for the ratings Very Poor (VP)

[0, 0, 0.1; 0.3], [0, 0, 0.12; 0.2], [0, 0, 0.13; 0.4]

Poor (P)

[0, 0.1, 0.2; 0.3], [0, 0.1, 0.25; 0.2], [0, 0.1, 0.35; 0.4]

Medium Poor (MP) [0.15, 0.2, 0.25; 0.3], [0.1, 0.3, 0.6; 0.2], [0.15, 0.2, 0.4; 0.4] [0.35, 0.5, 0.65; 0.3], [0.45, 0.6, 0.85; 0.2], [0.25, 0.55, 0.65; 0.4]

Fair (F)

Medium Good (MG) [0.5, 0.7, 0.75; 0.3], [0.45, 0.7, 0.9; 0.2], [0.3, 0.65, 0.7; 0.4] Good (G)

[0.6, 0.9, 1; 0.3], [0.65, 0.9, 1; 0.2], [0.6, 0.75, 0.95; 0.4]

Very Good (VG)

[0.8, 0.9, 1; 0.3], [0.85, 0.9, 1; 0.2], [0.65, 0.7, 0.8; 0.4]

Step-II: To asses the importance of the criteria (Table 3) linguistic weighting variable are used from Table 1. Table 3. The importance of weight of each criterion given by decision makers Decision Makers→/Criterion↓ D1

D2

D3

C1

VH VH MH

C2

MH H

C3

ML MH MH

C4

MH MH H

C5

MH VH MH

VH

Step-III: The weights of criteria are aggregated using Eq. (6.1) to get the aggregated fuzzy weight w ˜j of the criterion Cj and Decision Makers gave their opinion (Table 4) to get the aggregated fuzzy ratings x ˜ij of the alternative Ai under criterion Cj .

184

P. Dutta et al.

Table 4. The final aggregate result obtained from ratings given by decision makers Criterion Alternative Linguistic variable C1

A1 A2 A3

F MG G

C2

A1 A2 A3

VG MG F

C3

A1 A2 A3

G MG F

C4

A1 A2 A3

F MG G

C5

A1 A2 A3

G G VG

1 1 [w ˜ +w ˜j2 + ...... + w ˜jk ] K j 1 w ˜1 = [ 3 + + ] 1 = [] 3 = w ˜j =

Similarly, we have w ˜2 = w ˜3 = w ˜4 = [] w ˜5 = Step-IV: The fuzzy decision matrix R is constructed. Since all the weights and ratings are in the interval [0, 1], so the matrix R is the normalised fuzzy decision matrix. Step-V: The weighted normalised fuzzy decision matrix is now constructed by using Eq. (6.2). n Step-VI: To evaluate d˜i = j=1 d˜ij using our proposed arithmetic operations, we have (Table 5)

d˜1 d˜2 d˜3

d˜i

Value-Index λ1 = λ2 = 0 = [1.55, 2.8, 3.78; 0.3], [1.79, 2.99, 4.31; 0.2], [1.26, 2.22, 3.17; 0.4] 1.331 = [1.43, 2.84, 3.54; 0.3], [1.45, 2.86, 4.23; 0.2], [0.99, 2.28, 3.01; 0.4] 1.312 = [1.47, 2.88, 3.84; 0.3], [1.83, 3.03, 4.33; 0.2], [1.3, 2.26, 3.24; 0.4] 1.358

V (A) = λ1 Vμ (A) + λ2 Vη (A) + (1 − (λ1 + λ2 ))Vν (A) for λ1 , λ2 ∈ [0, 1] λ1 = λ2 = 0.5 λ1 = 0.3, λ2 = 0.6 λ1 = 0.2, λ2 = 0.5 1.617 1.825 1.768 1.549 1.745 1.697 1.639 1.850 1.794

Table 5. Value-Index and rank

2nd 3rd 1st

Rank

C1 C2 C3 C4 C5 ⎞ ⎛ A1 [.08,0.43,0.62;0.3],[0.32,0.52,0.82;0.2],[0.13,0.42,0.57;0.4] [0.6,0.77,0.95;0.3],[0.58,0.77,0.97;0.2],[0.37,0.51,0.7;0.4] [0.25,0.51,0.72;0.3],[0.24,0.54,0.8;0.2],[0.27,0.38,0.6;0.4] [0.22,0.39,0.59;0.3],[0.26,0.46,0.79;0.2],[0.15,0.37,0.52;0.4] [0.4,0.7,0.9;0.3],[0.39,0.7,0.93;0.2],[0.34,0.54,0.78;0.4] D˜ = A2 ⎝ [0.12,0.6,0.71;0.3],[0.31,0.6,0.87;0.2],[0.16,0.5,0.62;0.4] [0.38,0.6,0.71;0.3],[0.31,0.6,0.87;0.2],[0.17,0.47,0.61;0.4] [0.21,0.4,0.54;0.3],[0.17,0.42,0.72;0.2],[0.14,0.33,0.44;0.4] [0.32,0.54,0.68;0.3],[0.26,0.54,0.84;0.2],[0.18,0.44,0.56;0.4] [0.4,0.7,0.9;0.3],[0.4,0.7,0.93;0.2],[0.34,0.54,0.78;0.4] ⎠ A3 [0.14,0.77,0.95;0.3],[0.46,0.77,0.97;0.2],[0.32,0.58,0.84;0.4] [0.26,0.43,0.62;0.3],[0.31,0.51,0.82;0.2],[0.14,0.4,0.57;0.4] [0.15,0.29,0.47;0.3],[0.17,0.36,0.68;0.2],[0.11,0.28,0.41;0.4] [0.38,0.69,0.9;0.3],[0.38,0.69,0.93;0.2],[0.36,0.5,0.76;0.4] [0.54,0.7,0.9;0.3],[0.51,0.7,0.93;0.2],[0.37,0.5,0.66;0.4]

The weighted normalised fuzzy decision matrix

C1 C2 C3 C4 C5 ⎞ ⎛ A1 [0.35,0.5,0.65;0.3],[0.45,0.6,0.85;0.2],[0.25,0.55,0.65;0.4] [0.8,0.9,1;0.3],[0.85,0.9,1;0.2],[0.65,0.7,0.8;0.4] [0.6,0.9,1;0.3],[0.65,0.9,1;0.2],[0.6,0.75,0.95;0.4] [0.35,0.5,0.65;0.3],[0.45,0.6,0.85;0.2],[0.25,0.55,0.65;0.4] [0.6,0.9,1;0.3],[0.65,0.9,1;0.2],[0.6,0.75,0.95;0.4] R= A2 ⎝ [0.5,0.7,0.75;0.3],[0.45,0.7,0.9;0.2],[0.3,0.65,0.7;0.4] [0.5,0.7,0.75;0.3],[0.45,0.7,0.9;0.2],[0.3,0.65,0.7;0.4] [0.5,0.7,0.75;0.3],[0.45,0.7,0.9;0.2],[0.3,0.65,0.7;0.4] [0.5,0.7,0.75;0.3],[0.45,0.7,0.9;0.2],[0.3,0.65,0.7;0.4] [0.6,0.9,1;0.3],[0.65,0.9,1;0.2],[0.6,0.75,0.95;0.4] ⎠ A3 [0.6,0.9,1;0.3],[0.65,0.9,1;0.2],[0.6,0.75,0.95;0.4] [0.35,0.5,0.65;0.3],[0.45,0.6,0.85;0.2],[0.25,0.55,0.65;0.4] [0.35,0.5,0.65;0.3],[0.45,0.6,0.85;0.2],[0.25,0.55,0.65;0.4] [0.6,0.9,1;0.3],[0.65,0.9,1;0.2],[0.6,0.75,0.95;0.4] [0.8,0.9,1;0.3],[0.85,0.9,1;0.2],[0.65,0.7,0.8;0.4]

The fuzzy decision matrix R

Operations on Picture Fuzzy Numbers 185

186

P. Dutta et al.

Step-VII: It is clear from the above calculated value-index that d˜3 > d˜1 > d˜2 holds for any value of λ1 , λ2 ∈ [0, 1]. Hence the ranking order of the three alternatives is A3 > A1 > A2 and the best alternative is A3 .

8

Conclusion

PFS is the generalisation of the FST and IFS and which is more capable to deal with uncertainty. In some complex situations, FST and IFS are not appropriate to treat uncertainty in proper manner and therefore, PFS is taken into consideration to deal with such situations. In this article, arithmetic operations of different triangular PFSs are presented first along with numerical examples. Furthermore, these operations of PFSs are applied in a hypothetical fuzzy MCDM problem. For this purpose, a ranking approach has also been devised. It is observed that the proposed approach is simple, efficient, logical and can be implemented generally. However, due to consideration of maximum height of the membership functions and minimum heights of neutral and non membership functions lead to counterintuitive output in some situation. As an extension of this work further research work can be carried out to overcome this limitation. Funding. No funding Compliance with Ethical Standards Conflict of interest. The authors declare that they have no conflict of interest.

References 1. Zadeh, L.A.: Fuzzy set theory. Inf. Control 8, 338–356 (1965) 2. Atanassov, K.T.: Intuitionistic fuzzy sets. Fuzzy Sets Syst. 20, 87–96 (1986) 3. Coung, B.C., Kreinovich, V.: Picture fuzzy set-a new concept for computational intelligence problems. In: Proceedings of The Third World Congress on Information and Communication Technologies, WIICT, pp. 1–6 (2013) 4. Dutta, P., Ganju, S.: Some aspects of picture fuzzy set. Trans. A. Razmadze Math. Inst. 172, 164–175 (2018) 5. Phong, P.H., Hieu, D.T., Ngan, R.T.H., Them, P.T.: Some compositions of picture fuzzy relations. In: Proceedings of the 7th National Conference on Fundamental and Applied Information Technology Research, FAIR 2007, Thai Nguyen, pp. 19– 20 (2014) 6. Cuong, B.C., Hai, P.V.: Some fuzzy logic operators for picture fuzzy sets. In: Seventh International Conference on Knowledge and Systems Engineering, pp. 132–137 (2015) 7. Cuong, B.C., Ngan, R.T., Hai, B.D.: An involutive picture fuzzy negator on picture fuzzy sets and some De Morgan triples. In: Seventh International Conference on Knowledge and Systems Engineering, pp. 126–131 (2015)

Operations on Picture Fuzzy Numbers

187

8. Viet, P.V., Chau, H.T.M., Hai, P.V.: Some extensions of membership graphs for picture inference systems. In: 2015 Seventh International Conference on Knowledge and Systems Engineering (KSE), pp. 192–197. IEEE (2015) 9. Singh, P.: Correlation coefficients for picture fuzzy sets. J. Intell. Fuzzy Syst. 28, 591–604 (2015) 10. Cuong, B.C., Kreinovich, V., Ngan, R.T.: A classification of representable t-norm operators for picture fuzzy sets. In: 2016 Eighth International Conference on Knowledge and Systems Engineering (KSE), pp. 19–24. IEEE (2016) 11. Son, L.H.: Generalized picture distance measure and applications to picture fuzzy clustering. Appl. Soft Comput. J. (2016). http://dx.doi.org/10.1016/j.asoc.2016. 05.009 12. Son, L.H.: Measuring analogousness in picture fuzzy sets: from picture distance measures to picture association measures. Fuzzy Optim. Decis. Mak., pp. 1–20 (2017) 13. Son, L.H., Viet, P., Hai, P.: Picture inference system: a new fuzzy inference system on picture fuzzy set. Appl. Intell. 46, 652–669 (2017) 14. Peng, X., Dai, J.: Algorithm for picture fuzzy multiple attribute decision making based on new distance measure. Int. J. Uncertain. Quantif. 7, 177–187 (2017) 15. Garg, H.: Some picture fuzzy aggregation operators and their applications to multicriteria decision-making. Arab. J. Sci. Eng. 1–16 (2017). http://dx.doi.org/10. 1007/s13369-017-2625-9 16. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data. Wiley, New York (1990) 17. Yang, M.S., Hung, W.L., Chang-Chien, S.J.: On a similarity measure between LRtype fuzzy numbers and its application to database acquisition. Int. J. Intell. Syst. 20, 1001–1016 (2005) 18. Candan, K.S., Li, W.S., Priya, M.L.: Similarity-based ranking and query processing in multimedia databases. Data Knowl. Eng. 35, 259–298 (2000) 19. Phong, P.H., Son, L.H.: Linguistic vector similarity measures and applications to linguistic information classification. Int. J. Intell. Syst. 32, 67–81 (2017) 20. Chachi, J., Taheri, S.M.: A unified approach to similarity measures between intuitionistic fuzzy sets. Int. J. Intell. Syst. 28, 669–685 (2013) 21. Li, D.F., Chuntian, C.: New similarity measures of intuitionistic fuzzy sets and application to pattern recognition. Pattern Recogn. Lett. 23, 221–225 (2002) 22. Liang, Z., Shi, P.: Similarity measures on intuitionistic fuzzy sets. Pattern Recogn. Lett. 24, 278–285 (2003) 23. Mitchell, H.B.: On the Dengfeng-Chuntian similarity measure and its application to pattern recognition. Pattern Recogn. Lett. 24, 3101–3104 (2003) 24. Hung, W.L., Yang, M.S.: Similarity measures of intuitionistic fuzzy sets based on Hausdorff distance. Pattern Recogn. Lett. 25, 1603–1611 (2004) 25. Li, Y., Olson, D.L., Qin, Z.: Similarity measures between intuitionistic fuzzy (vague) sets: a comparative analysis. Pattern Recogn. Lett. 28, 2687–2693 (2007) 26. Hwang, C.M., Yang, M.S.: New construction for similarity measures between intuitionistic fuzzy sets based on lower, upper and middle fuzzy sets. Int. J. Fuzzy Syst. 15, 359–366 (2013) 27. Wei, G.W.: Some similarity measures for picture fuzzy sets and their applications. Iran. J. Fuzzy Syst. (2017). http://ijfs.usb.ac.ir/article-3273.html 28. Jaccard, P.: Nouvelles recherches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat. 44, 223–270 (1908) 29. Hwang, C.M., Yang, M.S., Hung, W.L.: New similarity measures of intuitionistic fuzzy sets based on the Jaccard index with its application to clustering. Int. J. Intell. Syst. 33, 1672–1688 (2018)

188

P. Dutta et al.

30. Li, D.F.: A ratio ranking method of triangular intuitionistic fuzzy numbers and its applications to MADM problems. Comput. Math. Appl. 60, 1557–1570 (2010) 31. De, P.K., Das, D.: A study on ranking of trapezoidal intuitionistic fuzzy numbers. Int. J. Comput. Inf. Syst. Ind. Manag. Appl. 6, 437–444 (2014) 32. Zeng, X., Li, D.F., Yu, G.: A value and ambiguity based ranking method of trapezoidal intuitionistic fuzzy numbers and application to decision making. Sci. World J. 2014, 1–8 (2014)

Some Generalized Results on Multi-criteria Decision Making Model Using Fuzzy TOPSIS Technique P. K. Parida(B) Department of Mathematics, C.V. Raman (Autonomous) College of Engineering, Bhubaneswar, India [email protected]

Abstract. This article proposes new model involving decision making using fuzzy technique for order performance by similarity to ideal solution (FTOPSIS) as collaborative decision-makers. FTOPSIS is very frequently used in multi-criteria decision making (MCDM). Here we give a brief introduction of FTOPSIS applied to DM. Furthermore, the concept is employed in measuring the being far off of particularized fuzzy number from one as well as other fuzzy positive ideal solution (FPIS) and fuzzy negative ideal solution (FNIS). The numerical example is investigated to analyze the outcome of the alternative solution against the ideal solution. Keywords: Multi-criteria decision matrix · FTOPSIS · FPIS · FNIS · Relative closeness matrix

1 Introduction The MCDM has been extensively applied in selecting a finite number of alternatives, whose characterization is done by multiple conflicting criteria. At present the TOPSIS is among the most widely used techniques for MCDM [1]. In TOPSIS methodology, the best alternative is one which is closest approximation of the positive ideal solution and is over the greatest interspace from the negative ideal solution. In the positive ideal solution the profit criteria gets to make the most of and the tariff criteria gets minimized. In the negative ideal solution the tariff criteria gets to make the most of and the profit criteria gets minimized. One can refer [1, 2] for idea on practical applications of TOPSIS. In the circumstance where the available information is vague, imprecise, or uncertain it is rather complicated to meticulously evaluate the alternatives for the criteria. The classification of one by one alternative with respect to one by one criterion can be then pronounced by fuzzy numbers [3–5]. A fuzzy number as it may be anticipated as an elongation of an intermission with assorted membership appraises. So particularized value in the intermission is attached to a real number which epitomizes its congruity with the vague statement encompassing a fuzzy number. Fuzzy numbers are subjugated by imperative of campaign. In the past © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 189–199, 2020. https://doi.org/10.1007/978-3-030-39033-4_18

190

P. K. Parida

decades we find emergence of numerous MCDM methods applying fuzzy logic for analyzing ambiguous data [5]. In the meanwhile we see an augmentation of TOPSIS to accord MCDM with an ambiguous decision matrix giving rise to FTOPSIS. FTOPSIS has been used to solve various MCDM problems with reasonable success [6–17]. The remaining part of this paper contains the following sections Sect. 2 contains the fundamental knowledge on research methodology. Here the fuzzy set, membership function, triangular fuzzy numbers, the TOPSIS and FTOPSIS methods are introduced. The purposed FTOPSIS methodology is presented in Sect. 3. Section 4 includes an illustration involving calculations and a comparison of the methods discussed. Section 5 concludes the paper.

2 Preliminaries Definition-1. A fuzzy set P˜ in R is epitomized by a membership function [17] μ P˜ (r ) which relates particularized point r to a real number in the intermission [0 , 1] ˜ epitomizing/ the grade of membership of r in P. Definition-2. A fuzzy set P˜ of a domain of discourse R is said to be convex iff ∀ r1 ∈ R, r2 ∈ R then there exists:   μ P˜ (λr1 + (1 − λ)r2 ) ≥ Min μ P˜ (r1 ), μ P˜ (r2 ) where μ P˜ is the membership function of the fuzzy set P˜ and λ ∈ [0 , 1] Definition-3. A fuzzy number p˜ is demarcated by a threeling p˜ = ( p1 , p2 , p3 ). The membership function is demarcated by: ⎧ 0, r < p1 ⎪ ⎪ ⎨ (r − p1 )/( p2 − p1 ), μ p˜ (r ) = ⎪ (r − p2 )/( p3 − p2 ), ⎪ ⎩ 0, r > p3

p2 ≥ r ≥ p1 p3 ≥ r ≥ p2

where p2 epitomizes the value for which μ p ( p2 ) = 1, p1 and p2 are the uttermost appraises on the left and on the right of the fuzzy number p˜ correspondingly with membership μ p˜ ( p1 ) = μ p˜ ( p3 ) = 0. Definition-4. The α-cut of fuzzy number P˜ is demarcated as    P˜ α = rβ : μ P˜ rβ ≥ α, rβ ∈ R where α ∈ [0, 1].

U

L > 0 and T˜ ≤ 1 for Definition-5. If T˜ is a triangular fuzzy number and T˜ α

α ∈ (0, 1] then P˜ is called a normalized positive triangular fuzzy number.

α

Some Generalized Results on Multi-criteria Decision Making Model

191

    Definition-6. Let t˜ = t  , t  , t  and s˜ = s  , s  , s  be two positive triangular fuzzy numbers, then the operation with these fuzzy numbers are demarcated as follows   t˜(±)˜s = t  ± s  , t  ± s  , t  ± s    t˜ × s˜ = t  × s  , t  × s  , t  × s    t˜/˜s = t  /s  , t  /s  , t  /s      k t˜ = k t  , t  , t  = kt  , kt  , kt      Definition-7. Let two triangular fuzzy numbers t˜ = t  , t  , t  and s˜ = s  , s  , s  then the interspace them is computed by    1  ˜ d t , s˜ = (t − s  )2 + (t  − s)2 + (t  − s  )2 3 2.1 TOPSIS Method A MCDM, we choose decision matrix DM which is resides of u alternatives and v  Cδ  criteria can be expressed in matrix form as D M = Aβ xβδ , where Aβ , β = 1, · · · , u are alternatives and Cδ , δ = 1, · · · , v are criteria, xβδ are prototype scores demonstrates the appraisal of the alternative Aβ with respect to criteria Cδ . The weight vector w = (w1 , w2 , · · · , wv ) is unflappable of the particularized weights W (δ = 1, 2, · · · , v) for one by one criteria Cδ .  2 for Step 1. Assemble normalized decision matrix Nβδ , where Nβδ = xβδ / xβδ β = 1, · · · , u ; δ = 1, · · · , v. where xβδ and Nβδ are prototype and normalized score of decision matrix. Step 2. Assemble the weighted normalized decision matrix: Vβδ = wδ Nβδ , where  wδ is the weight for δ th criteria and wδ = 1 Step 3. Determine the positive ideal solution P + :   P + = v1+ , v2+ , · · · , vv+ = {max Vβδ |δ ∈ J1 ; min Vβδ |δ ∈ J2 } β

β

P −:

and negative ideal solution   P − = v1− , v2− , · · · , vv− = {min Vβδ |δ ∈ J1 ; max Vβδ |δ ∈ J2 }, β

vδ+

β

1),vδ−

= (1, 1, = (0, 0, 0) and J1 and J2 epitomizes the benefit criteria and where tariff criteria correspondingly. Step 4. Compute the Euclidean interspaces from the positive ideal P + and negative ideal P − solutions for one by one alternative Aβ correspondingly: 2 2     + + − βδ − and dβ = dβ = βδ δ δ    +  − + − where βδ = vδ − Vβδ and βδ = vδ − Vβδ with β = 1, · · · , u Aβ with respect Step 5. Compute the relative closeness β for one by   one alternative to positive ideal solution P + as given by β = dβ− / dβ− + dβ+ where β = 1, · · · , u and δ = 1, · · · , v.

192

P. K. Parida

2.2 Fuzzy TOPSIS Method The decision matrix which is resides of alternative and criteria is characterized by D M =  Cδ  Aβ x˜βδ , where A1 , A2 , · · · , Au are alternatives and C1 , C2 , · · · , Cv are criteria, x˜βδ are fuzzy numbers demonstrates the appraisal of the alternative Aβ with respect to criteria Cδ . The weight vector W = (W1 , W2 , · · · , Wv ) unflappable  of the particularized weights W (δ = 1, 2, · · · , v) for one by one criteria Cδ gratifying vδ=1 Wδ = 1. The

 weighted normalized fuzzy decision matrix M˜ = m˜ βδ u×v with β = 1, 2, · · · , u and δ = 1, 2, · · · , v is assembled by multiplying the normalized fuzzy decision matrix by its concomitant weights. The weights fuzzy normalized value V˜βδ is computed as: V˜βδ = Wδ × m˜ βδ with β = 1, 2, · · · , u and δ = 1, 2, · · · , v The fuzzy TOPSIS is comprehended as follows: Step 1. Categorize the positive ideal solution P + and negative ideal solution P − as follows:     P + = V˜ + , V˜ + , · · · , V˜u+ and P − = V˜ − , V˜ − , · · · , V˜u− 1

2

1

2

  where V˜δ+ = max V˜βδ , δ ∈ J1 ; min V˜βδ , δ ∈ J2 β  β − and V˜ = min V˜βδ , δ ∈ J1 ; max V˜βδ , δ ∈ J2 δ

β

β

where J1 and J2 stand for the criteria benefits and tariff correspondingly. Step 2. Compute the Euclidean interspaces between the positive ideal solution P˜ + and negative ideal solution P˜ − for one by one alternative Aβ as follows:   v D˜ β+ = D V˜βδ , V˜δ+ with β = 1, 2, · · · , u δ=1   v − D˜ β = D V˜βδ , V˜δ− with β = 1, 2, · · · , u 

δ=1

 where the interspace D V˜βδ , V˜δ+ between two fuzzy numbers. ˜ β for one by one alternative Aβ with respect to Step 3. The relative closeness  positive ideal solution is ˜β = 

D˜ β−

D˜ β+ + D˜ β−

3 Proposed Methodology The steps for creating the group of awesome alternative are given below: Step 1. The positive ideal solution P + (benefits) and negative ideal solution P − (tariffs) for one by one group member r = 1, 2, · · · , R as follows:     r + P = r V˜1+ , r V˜2+ , · · · , r V˜u+ and r P − = r V˜1− , r V˜2− , · · · , r V˜u−

Some Generalized Results on Multi-criteria Decision Making Model

193

  r r ˜ ˜ where = max Vβδ , δ ∈ J1 ; min Vβδ , δ ∈ J2 β  β − r r r ˜ ˜ ˜ and Vδ = min Vβδ , δ ∈ J1 ; max Vβδ , δ ∈ J2 r V˜ + δ

β

β

where J1 and J2 epitomize the criteria, benefit and tariff correspondingly. Step 2. Compute the interspace of one by one alternative for various members. The interspace of alternative Aβ between the PIS and the NIS of the group members Sr , r D˜ β+ and r D˜ − are given by: β

r

D˜ β+ =

v δ=1

 D

r

 v V˜βδ , r V˜δ+ and r D˜ β− =

δ=1

 D

r

V˜βδ , r V˜δ−



with β = 1, 2, · · · , u; r = 1, 2,  · · · , R.    Where the interspaces D r V˜βδ , r V˜δ+ and D r V˜βδ , r V˜δ− between two fuzzy numbers are computed. Step 3. The relative closeness for one by one alternative Aβ of one by one member   ˜ Aβ with respect to positive ideal solution as r, r r

  ˜ Aβ = 

rD ˜− β rD ˜ + + r D˜ − β β

with β = 1, 2, · · · , u; r = 1, 2, · · · , R

  ˜ β Aβ for one by one member r we can form the relativeAfter calculating the r  closeness matrix as given by: ⎛1 ⎞ ˜ 1 ) 2 (A ˜ 1 ) · · · R (A ˜ 1) (A ⎜ 1 (A ˜ 2 ) 2 (A ˜ 2 ) · · · R (A ˜ 2) ⎟ ⎜ ⎟ Q=⎜ ⎟ .. .. .. .. ⎝ ⎠ . . . . 1 (A ˜ u ) 2 (A ˜ u ) · · · R (A ˜ u) Now we can obtain the weighted relative closeness matrix by introducing the importance weights of group members into the relative closeness is given by: ⎛ 1 ⎞ ˜ 1 ) α2 2 (A ˜ 1 ) · · · α R R (A ˜ 1) α1 (A ⎜ α1 1 (A ˜ 2 ) α2 2 (A ˜ 2 ) · · · α R R (A ˜ 2) ⎟ ⎜ ⎟ Qα = ⎜ ⎟ .. .. .. . . ⎝ ⎠ . . . . 1 2 R ˜ u ) α2 (A ˜ u ) · · · α R (A ˜ u) α1 (A Step 4. Identify the groups, PIS and NIS:   + + PG+ = VG1 , VG2 , · · · · · · , VG+R         1 2 R = max α1  Aβ , max α2  Aβ , · · · · · · , max α R  Aβ PG−

=



β

− − VG1 , VG2 ,······

β

 , VG−R

β

        1 2 R = min α1  Aβ , min α2  Aβ , · · · · · · , min α R  Aβ β

β

β

194

P. K. Parida

Step 5. Compute to one by one alternative Aβ the interspaces from the group PIS and NIS PG+ and PG− , correspondingly as follows: + dGβ

  R     + 2 αr r  Aβ − VGr = with β = 1, 2, · · · , u

− dGβ

  R     − 2 αr r  Aβ − VGr = with β = 1, 2, · · · , u

r =1

r =1

Step 6. Compute the group relative-closeness Gβ for one by one alternative Aβ with respect to group ideal solution as:   G A β =

− dGβ − + dGβ + dGβ

4 Computational Illustration In this category, we cite one computational illustration to elucidate the TOPSIS method for decision making problems with fuzzy data. Assume that, we have three alternative AL 1 , Al2 , AL 3 among which decision makers have to choose and, evaluated by three experts E X 1 , E X 2 , E X 3 under fuzzy environment for operation performance against three benefit criteria C R1 , C R2 , C R3 . The linguistic weights for performing the importance of criteria are of the form: none (N), very low (VL), low (L), medium low (ML), medium (M), medium high (MH), high (H), very high (VH) and excellent (E) [7, 9] with the following fuzzy numbers demarcated in Table 1. Table 1. Linguistic variables Linguistic variables Fuzzy numbers None (N)

(0.00, 0.00, 0.04)

Very Low (VL)

(0.04, 0.09, 0.14)

Low (L)

(0.14, 0.24, 0.34)

Medium low (ML)

(0.34, 0.42, 0.50)

Medium (M)

(0.50, 0.58, 0.62)

Medium high (MH) (0.62, 0.65, 0.68) High (H)

(0.68, 0.74, 0.82)

Very high (VH)

(0.82, 0.88, 0.92)

Excellent (E)

(0.92, 0.96, 1.00)

Some Generalized Results on Multi-criteria Decision Making Model

195

In the above, we are taking FTOPSIS for three decision matrices D M 1 , D M 2 , and D M 3 with same appraises of weights w 1 , w 2 and w 3 with w 1 = w 2 = w 3 = (0.333, 0.333, 0.333). Based on this value, we compute the normalized decision matrix, the weighted normalized decision matrix, fuzzy positive ideal solution (FPIS), fuzzy negative ideal solution (FNIS) and the relative closeness coefficient for one by one decision matrix corresponding to same weights. In Tables 2, 3 and 4, we compute the appraises using D M 1 with normalized decision matrix D M 1 and corresponding to the weight w 1 , Tables 5, 6 and 7, we compute the appraises using D M 2 with normalized decision matrix D M 2 and corresponding to the weight w 2 , and Tables 8, 9 and 10, we compute the appraises using D M 3 with normalized decision matrix D M 3 and corresponding to the weight w3 . In Table 11, we compute the fuzzy positive ideal solutions and fuzzy negative ideal solutions of all the decision matrices D M 1 , D M 2 , and D M 3 . In Table 12, determine the relative closeness matrix and relative closeness matrix corresponding to the weight. At the end, we compute the ranking order of all the relative closeness matrix in the group decision making with the ranking order AL 2 > AL 1 > AL 3 (Table 13). Table 2. FTOPSIS for DM1 Alts.\Cri. CR1

CR2

CR3

AL 1

(0.753, 0.803, 0.840) (0.660, 0.720, 0.780) (0.620, 0.687, 0.753)

AL 2

(0.640, 0.707, 0.747) (0.547, 0.603, 0.667) (0.787, 0.830, 0.867)

AL 3

(0.667, 0.733, 0.787) (0.853, 0.907, 0.947) (0.820, 0.857, 0.893)

Weights

(0.333, 0.333, 0.333) (0.333, 0.333, 0.333) (0.333, 0.333, 0.333)

Table 3. FTOPSIS for normalized DM1 Alts.\Cri. CR1

CR2

CR3

AL 1

(0.897, 0.956, 1.000) (0.697, 0.761, 0.824) (0.694, 0.769, 0.843)

AL 2

(0.762, 0.841, 0.889) (0.577, 0.637, 0.704) (0.881, 0.929, 0.970)

AL 3

(0.794, 0.873, 0.937) (0.901, 0.958, 1.000) (0.918, 0.959, 1.000)

196

P. K. Parida

Table 4. FTOPSIS for DM1 with weight w1 = (0.333, 0.333, 0.333) Alts.\Weight Cri. w1 CR1

w1 CR2

w1 CR3

AL 1

(0.299, 0.318, 0.333) (0.232, 0.253, 0.274) (0.231, 0.256, 0.281)

AL 2

(0.254, 0.280, 0.296) (0.192, 0.212, 0.235) (0.293, 0.309, 0.323)

AL 3

(0.264, 0.291, 0.312) (0.300, 0.319, 0.333) (0.306, 0.319, 0.333)

Table 5. FTOPSIS for DM2 Alts.\Cri. CR1

CR2

CR3

AL 1

(0.870, 0.860, 0.913) (0.680, 0.740, 0.820) (0.747, 0.870, 0.847)

AL 2

(0.787, 0.830, 0.867) (0.720, 0.753, 0.787) (0.807, 0.860, 0.913)

AL 3

(0.773, 0.833, 0.887) (0.787, 0.830, 0.867) (0.807, 0.860, 0.913)

Weights

(0.333, 0.333, 0.333) (0.333, 0.333, 0.333) (0.333, 0.333, 0.333)

Table 6. FTOPSIS for normalized DM2 Alts.\Cri. CR1

CR2

CR3

AL 1

(0.883, 0.942, 1.000) (0.785, 0.854, 0.946) (0.818, 0.883, 0.927)

AL 2

(0.861, 0.909, 0.949) (0.831, 0.869, 0.908) (0.883, 0.942, 1.000)

AL 3

(0.847, 0.912, 0.971) (0.908, 0.958, 1.000) (0.883, 0.942, 1.000)

Table 7. FTOPSIS for DM2 with weight w2 = (0.333, 0.333, 0.333) Alts.\Weight Cri. w1 CR1

w1 CR2

w1 CR3

AL 1

(0.294, 0.314, 0.333) (0.261, 0.284, 0.315) (0.272, 0.294, 0.309)

AL 2

(0.257, 0.303, 0.316) (0.277, 0.289, 0.302) (0.294, 0.314, 0.333)

AL 3

(0.282, 0.304, 0.323) (0.302, 0.319, 0.333) (0.294, 0.314, 0.333)

Table 8. FTOPSIS for DM1 Alts.\Cri. CR1

CR2

CR3

AL 1

(0.667, 0.733, 0.787) (0.740, 0.783, 0.833) (0.560, 0.633, 0.687)

AL 2

(0.640, 0.707, 0.747) (0.713, 0.780, 0.820) (0.627, 0.677, 0.727)

AL 3

(0.760, 0.813, 0.880) (0.487, 0.550, 0.600) (0.773, 0.833, 0.887)

Weights

(0.333, 0.333, 0.333) (0.333, 0.333, 0.333) (0.333, 0.333, 0.333)

Some Generalized Results on Multi-criteria Decision Making Model

197

Table 9. FTOPSIS for normalized DM1 Alts.\Cri. CR1

CR2

CR3

AL 1

(0.785, 0.833, 0.894) (0.888, 0.940, 1.000) (0.632, 0.714, 0.774)

AL 2

(0.727, 0.773, 0.826) (0.856, 0.936, 0.984) (0.707, 0.763, 0.820)

AL 3

(0.864, 0.924, 1.000) (0.584, 0.660, 0.720) (0.872, 0.940, 1.000)

Table 10. FTOPSIS for DM3 with weight w3 = (0.333, 0.333, 0.333) Alts.\Weight Cri. w1 CR1

w1 CR2

w1 CR3

AL 1

(0.252, 0.278, 0.298) (0.296, 0.313, 0.333) (0.210, 0.238, 0.258)

AL 2

(0.242, 0.257, 0.275) (0.285, 0.312, 0.328) (0.235, 0.254, 0.273)

AL 3

(0.288, 0.308, 0.333) (0.285, 0.312, 0.328) (0.290, 0.313, 0.333)

Table 11. FPIS and FNIS for DM1 , DM2 , DM3 correspondingly Alts.\FIS DM1 1d + β

with relative-closeness coefficients

DM2 1 β

1d − β

2d+ β

DM3 2d− β

2 β

3d + β

3d − β

3 β

AL 1

2.175 0.828 0.276 2.496 0.529 0.175 2.176 0.827 0.275

AL 2

2.202 0.8

AL 3

2.075 0.927 0.309 2.173 0.837 0.278 2.161 0.841 0.28

0.266 2.266 0.751 0.249 2.18

0.822 0.274

Table 12. The group relative-closeness matrix (RCM) and FTOPSIS with weight α = (0.333, 0.333, 0.333) to the Relative Closeness Matrix (WRCM) Alternatives RCM

WRCM

1 β

2 β

3 β

α11 β

α22 β

α33 β

AL 1

0.276 0.298 0.275 0.091908 0.099234 0.091575

AL 2

0.266 0.302 0.274 0.088578 0.100566 0.091242

AL 3

0.309 0.312 0.28

0.102897 0.103896 0.09324

Table 13. Group FPIS, FNIS and ranking order Alternatives d + Gβ

− dGβ

  G Aβ Rank

AL 1

0.012052576 0.003347 0.217324 2

AL 2

0.014836262 0.001332 0.082384 3

AL 3

0

0.015191 1

1

198

P. K. Parida

5 Conclusion In real world, MCDM finds a lot of applications in the fuzzy decision making problems. In this paper, the results demonstrate that there are three different decision matrices with weights of the proposed method. We believe that the proposed methods render value but, as a limitation, it is tough and complicated to estimate subjectively the fuzzy information in a realistic way while the results of the research are dependent on the experts’ opinions and linguistic variables. We considered a MCDM method, with decision makers; comprising of the value of a fuzzy number greater than or equal to another fuzzy number, a new interspace measure of one by one fuzzy number from fuzzy positive ideal solution (FPIS) as well as fuzzy negative ideal solution (FNIS). Moreover, here we developed a fuzzy TOPSIS for a group decision making for tackling the MCDM. This method yields the best alternatives. Acknowledgment. Acknowledge the content of this article has been prepared as a part of research work carried out in C. V. Raman College of Engineering, Bhubaneswar.

References 1. Parida, P.K., Sahoo, S.K.: Multiple attributes decision-making approach by TOPSIS technique. Int. J. Eng. Res. Technol. 2(11), 907–912 (2013) 2. Barros, C.P., Wanke, P.: An analysis of African airlines efficiency with two stage TOPSIS and neural networks. J. Air Transp. Manag. 44–45, 90–102 (2015) 3. Chen, C.T.: Extensions of the TOPSIS for group decision-making under fuzzy environment. Fuzzy Sets Syst. 114, 1–9 (2000) 4. Parida, P.K., Sahoo, S.K.: Fuzzy multiple attributes decision-making models using TOPSIS technique. Int. J. Appl. Eng. Res. 10(2), 2433–2442 (2015) 5. Parida, P.K.: A multi-attributes decision-making models based on fuzzy TOPSIS for positive and negative ideal solution with ranking order. Int. J. Civ. Eng. Technol. 9(6), 190–198 (2018) 6. Chen, T.Y., Tsao, C.Y.: The intermission valued fuzzy TOPSIS methods and experimental analysis. Fuzzy Sets Syst. 159(11), 1410–1428 (2008) 7. Chu, T.C.: Selecting plant location via a fuzzy TOPSIS approach. Int. J. Adv. Manuf. Technol. 20, 859–864 (2002) 8. Chu, T.C., Lin, Y.C.: Improved extensions of the TOPSIS for group decision making under fuzzy environment. J. Inf. Optim. Sci. 23, 273–286 (2002) 9. Chu, T.C., Lin, Y.C.: A fuzzy TOPSIS method for robot selection. Int. J. Manuf. Technol. 21, 284–290 (2003) 10. Krohling, R.A., Campanharo, V.C.: Fuzzy TOPSIS for group decision making: A case study for accidents with oil spill in the sea. Expert Syst. Appl. 38(4), 4190–4197 (2011) 11. Wang, Y.J.: Applying FMCDM to evaluate financial performance of domestic airlines in Taiwan. Expert Syst. Appl. 34, 1837–1845 (2008) 12. Wang, Y.J., Lee, H.S., Lin, K.: Fuzzy TOPSIS for multi-criteria decision making. Int. Math. J. 3, 367–379 (2003) 13. Wang, J., Liu, S.Y., Zhang, J.: An extension of TOPSIS for fuzzy MCDM based on vague set theory. J. Syst. Sci. Syst. Eng. 14, 73–84 (2005) 14. Wang, T.C., Lee, H.D.: Developing a fuzzy TOPSIS approach based on subjective weights and objective weights. Expert Syst. Appl. 36, 8980–8985 (2009)

Some Generalized Results on Multi-criteria Decision Making Model

199

15. Wang, Y.M., Elhag, T.M.S.: Fuzzy TOPSIS method based on alpha level sets with an application to bridge risk assessment. Expert Syst. Appl. 31(2), 309–319 (2006) 16. Zadeh, L.A.: Fuzzy sets. Inf. Control 8(3), 338–353 (1965) 17. Yong, D.: Plant location selection based on fuzzy TOPSIS. Int. J. Adv. Manuf. Technol. 28(7–8), 839–844 (2006)

Data Mining, Bioinformatics, and Cellular Communications

A Survey on FP-Tree Based Incremental Frequent Pattern Mining Shafiul Alom Ahmed(B) and Bhabesh Nath Tezpur University, Napaam, Tepur 784028, Assam, India [email protected], [email protected]

Abstract. Several methods for efficient mining of frequent patterns (FP) can be found in literature. But most of the approaches assume that the whole dataset to be considered can be stored on the computers on hand main memory and the dataset is static in nature. Practically, none of the transactional datasets are static. The datasets get updated due to inclusion of new transactions or exclusion of obsolete transactions as the time advances or the user may required to generate the frequent patterns for a new threshold value for the updated database. This may generate new frequent patterns or refinement of existing patterns and it becomes practically infeasible if the process starts from scratch. Many methods have been found in literature tried to deal with the issues of incremental frequent pattern mining (FPM) but most of the algorithms are main memory dependent. Therefore in this paper, we are going to discuss some of the algorithms with their pros and cons to see whether the main memory limitation of the existing techniques can be mitigated so that it can be efficiently used in incremental scenario. Keywords: Association Rule (AR) · Frequent pattern (FP) · Incremental mining · Frequent itemset (FI) · FP-tree · Rule mining (RM) · Data mining (DM)

1

Introduction

The knowledge discovery by analysing the databases using the existing and emerging data mining techniques have motivated the research community to develop new techniques to adapt to this fast changing world. FPM is considered as an essential problem of DM. It has been extensively exercised in ARM, sequential patterns [1], classification [2–4], max and closed FPs [5–7] and clustering [8] and has applications in market-basket analysis [9], bio-informatics [10] and web mining [11] etc. The problem of FPM was first familiarized by Agrawal et al. in 1993 (Apriori Algorithm) [12]. It uses recursive level-wise bottom-up database search to generate the complete set of FPs using the downward closure property of set. Since the algorithm recursively scans the whole dataset and generates huge number of candidate itemsets, resulting in CPU overhead, huge memory requirement and significance amount of time to prune out the infrequent candidate itemsets. Although, a significant number of methods viz. FUP [13], FUP-2 c Springer Nature Switzerland AG 2020  S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 203–210, 2020. https://doi.org/10.1007/978-3-030-39033-4_19

204

S. A. Ahmed and B. Nath

[14], Border algorithm [15], Modified borders [16], UWEP [17] (Update With Early Pruning), DEMON [18], Incremental Constrained APriori (ICAP) [19], MAAP [20] (Maintaining Association Rules with Apriori Property), Maximal Frequent Trend Pattern (MFTP) [21], PRE-HU [22] has been published in the last two decades for handling incremental scenario. But most of the algorithms have multiple database scan and large number of candidate itemset generation issue. Therefore, recomputing the fresh set PIs using Apriori based algorithms from incremental dataset is practically unacceptable. In order to mitigate these limitations Han et al. proposed FP-tree [23]: an efficient and compact data structure for mining FPs from large datasets with divide and conquer strategy. The more frequent items have higher possibility of sharing prefix paths than less frequent items. During the first database scan, the frequencies of each individual item of the database is calculated and then the items are ordered in the decreasing frequency order and maintained in a frequency list. In the second pass, it reads the transactions one by one and store them based on the frequency list and turns the whole dataset into a compressed FP-tree. After construction, the FPs are mined by recursively constructing conditional FP-trees for each FI. Only the FPs are generated from the conditional FP-trees and drops the infrequent itemsets. Though, FP-tree is found to be efficient for mining FPs in transactional databases. But it cannot be easily made compatible with dynamic databases. Because, reconstructing the whole tree from scratch as the database size increases is not a feasible approach. So to deal with this problem, some of the algorithms that are improvements over the FP-tree to generate the FPs from incremental databases are discussed below.

2 2.1

FP-tree Based Incremental FPM Two-Pass FPM

• DB-Tree: DB-tree [24] algorithm is capable of storing all the transactions in a FP-tree structure and it does not rescan the original database to handle incremental mining. This algorithm takes two scans over the database to construct the complete DB-tree and it considers all the items in the database. Therefore, DB-tree contains more prefix paths as well as tree nodes than FPtree, thus requires more memory than FP-tree. But, it provides the facility of mining FPs of any minimum support just by projecting the required FP-tree from the DB-tree based on the minimum support. • AFPIM: “Adjusting FP-tree for Incremental Mining” (AFPIM) [25] is an improvement over FP-tree structure; uses two support thresholds to mine incremental FPs in the cases addition of new transactions, exclusion of old transactions and modifications of transactions. AFPIM uses the FP-tree to maintain the information about “frequent” and “pre-frequent” itemsets of the original database. The updated information is reflected on the tree just by adjusting the prefix paths of the tree.

A Survey on FP-Tree Based Incremental Frequent Pattern Mining

205

Definition 1 Pre-Minimum Support [26] A support threshold which is less than the minimun support threshold is called pre-minimum support. Definition 2 Pre-Frequent Itemset [26] If the support count of an itemset say X is less than minimum support but greater than pre-minimum support, then X is called pre-frequent itemset. Whenever the order changes, the AFPIM adjusts the order by recursively swapping adjacent items or nodes using bubble sort. Whenever, an infrequent item becomes frequent, AFPIM can not maintain the FP-tree structure just by swapping the items. It rescans the updated database and creates a new FPtree using bubble sort recursively; which consumes a significant amount of time. Another major problem is its requirement of an extra support threshold preminimum. However, defining two appropriate supports is a challenging task. • IFP-tree: Incremental Frequent Pattern Tree (IFP-tree) [27] can handle all kinds of database updates, such as the addition and deletion of records and also supports changes or modifications of records in the database. In this approach, reconstruction of FP-tree from scratch is not required to handle incremental scenario. To achieve this, a complete tree is constructed. As the database updates this algorithm just incrementally update the tree without scanning the old database. But incrementally updating the tree may contradict the FP-tree structure. Therefore, algorithm performs “shuffling” and “merging” operation to maintain the FP-tree structure. When new transactions arrive, the updated item order is maintained in a list called “shuffle list”. This list reflects which node of the tree be adjusted to reflect the incremental update. • FUFP: Fast updated FP-tree (FUFP) [28] is an improvement over FPtree based on a FUP concept. The structure of the FUFP-tree is like FP-tree except that the parent and child node maintains bi-directional links. Along with the frequency list, a header table is used to store the frequent items. The bi-directional link and the header table helps in faster tree traversal and tree node deletion. After creating the FUFP-tree, it uses an incremental FUFPtree maintenance algorithm for handling incremental database. Whenever, a very huge number of transactions arrive, then the entire tree needs to be reconstructed in batch way. • Pre-FUFP: Pre-FUFP [29] is a modification over FUFP based on the “pre-large” itemset concept. This algorithm also uses two support threshold values: lower and upper to define the pre-large itemsets. By using these two support threshold values, Pre-FUFP can efficiently handle the problem when small itemsets in the existing database becomes large in the new set of transactions. This two support threshold value and the updated database size determines the number of original database rescan. The Pre-FUFP algorithm processes the newly added transactions by dividing them into three groups namely large, pre-large or small in the existing database. Pre-FUFP handles these groups accordingly to preserve the FUFP-tree property.

206

S. A. Ahmed and B. Nath

• BIT: Most of the above mentioned pattern mining algorithms perform the incremental mining by inserting the new transactions in succession into the existing FP-tree. The concept of Batch Incremental Tree [30] lies in merging two small successive FP-trees to acquire the corresponding FP-tree for the entire database. If a merge-able path is found in both the trees say T1 and T2 , then the frequency counts of the matched items are updated and the unmatched items are added a sub-branch of the last matching prefix item otherwise the prefix path is considered a new prefix path of T2 . The main advantage of BIT algorithm is that it processes the multiple occurrences of a prefix path only once for merging. • BIT FPGrowth: BIT FPGrowth [31] read the itemsets (as transactions) from a precomputed FP-tree. It uses Pre-computation and Data Reduction while constructing the periodical FP-trees to improve the scalability. Initially, it consider the FP-trees of periodic datasets and all the items with their local frequency counts as inputs to attain the global frequency order of items. Then the algorithm reads each itemset from the first FP-tree and rearranges the items as per the global frequency item order. Similarly, the itemsets of the second FP-tree are also rearranged as per the global frequency order and merged with the first restructured FP-tree. • VSIFP-growth: Variable support incremental FP-growth is a tree based approach uses MapReduce and reduce the complexity of incremental tree reconstruction. The VSIFP-growth [32] algorithm is capable of handling incremental dataset and support change at the same time to generate new association trees from incremental datasets. Though the new tree and the original FP-tree are structurally different but they both give the same mining results. However, VSIFP-Growth incurs a major problem in the parallel mining. An appropriate dataset scale value and data node count in this method are important, because they affect the mining performance. 2.2

Single-Pass FPM

• P-Tree: Pattern tree or P-tree [33] is another data structure which needs only one database scan. This algorithm can handle incremental mining just by updating the P-tree using only one scan over the new transactions. When new transactions arrive, then the P-tree is also needed to be updated. It is an important problem of most of the incremental algorithms. If a new item appears, which does not exist in the existing database, then the item is appended in the tree as leaf node. Since, P-tree is a complete tree, a FP-tree can be extracted from the P-tree with any support value and the complete set of FPs can be generated from the P-tree. • CATS-tree: CATS-tree [34] improves the memory requirement of FPtree, so that it can be used to handle incremental databases. It can handle both insertion and removal transactions into the database. The algorithm uses multiple support thresholds to mine FPs. A single pass over the dataset is needed to construct the CATS-tree. When the database gets updated, there may be a situation that the frequency of a descendant node may become

A Survey on FP-Tree Based Incremental Frequent Pattern Mining









207

higher than its ancestor node. Then to pertain the structural integrity, the algorithm swaps the nodes. It has also to deal with the child and parent node links; which is a very time consuming process. CanTree: CANonical-order Tree (CanTree) [35] is a FP-tree based tree structure, where the nodes are ordered according to some canonical order. This approach was developed by Leung et al. to handle the problems of AFPIM. CanTree is very much capable of handling transaction insertion, deletion or modification. It scans the original database once to construct the CanTree. Due to canonical ordering of nodes, CanTree even does not require restructuring, to maintain the integrity of the tree structure. Once the incremented CanTree is constructed, this algorithm uses FP-growth to mine frequent patterns. Though this algorithm saves a significant amount time and computation cost, but it requires huge main memory storage. CP-tree: To overcome the above mentioned issues of CanTree, Compact Pattern Tree (CP-tree) [36] was developed by Tanbeer et al. in 2008. The CP-tree is capable of mining incremental FPs only with a single scan. The tree maintains a frequency-dependent item order. While constructing the CP-tree, transactions are inserted into the tree one by one according to a predefined-order of the items. After inserting some transactions, if the current item order gets changed (to a predefined degree), the CP-tree uses dynamic tree restructuring approaches to update the item order with current item order. This algorithm uses the path adjusting method to restructure the CP-tree. CP-tree is also a complete tree. Improved CP-tree: Periodically restructuring the tree is a costly task. So, to deal with this problem a new prefix tree structure improved CP-tree [37] has been developed by Hamedanian et al. to reduce the CP-tree reconstructing complexity. This algorithm requires just one scan to construct the tree. Improved CP-tree construction is a two step process, insertion and restructuring phase. In the first step, items are sorted in descending order of their counts and then the transaction items are inserted into the tree accordingly; resulting in less number of unordered paths. During the restructuring phase, all the unordered paths of the tree are removed and sorts them in some temporary arrays. Arrays are sorted according to the frequencies and then again the sorted paths are inserted into the tree. So, by using the efficient array sorting approach, this algorithm enhances the performance of branch-sorting method of the CP-tree. GM-Tree: Generate and Merge tree (GM-tree) [38] is a combination canonical ordering and batch incremental techniques. The tree nodes are maintained in lexicographical order therefore, it is not required to rescan the tree nodes or perform swap or bubbling nodes with the updated item frequencies while merging two trees. The algorithm requires more memory for performing merge operations but requires very less amount of time for merging as it does not perform any swap or re-scanning of the tree nodes.

208

3

S. A. Ahmed and B. Nath

Research Issues and Challenges

From the above discussion and analysis of the existing rule mining algorithms, it is realised that each algorithm mines FP with certain pros and cons. From the study, following research issues challenges related to FPM have been found. • Proper estimation of minimum support threshold for each database is a difficult task. • Depending on the size and dimensionality of the database, sometimes the performance of the FP-tree based algorithm degrades and works like Apriori. • Most of the algorithms are main memory dependant and takes multiple scans over the database. • Incremental algorithms take a significant amount of time to mine the whole set of FIs from dynamic/incremental databases, because it needs to scan the updated bases many times. • Sometimes it may not be possible to maintain the complete updated database in main memory, so maintaining a suitable part of the database to mine rules from incremental databases is not an easy task. • Develop an efficient, scalable data-structure that can mine FIs from huge databases without generating the candidate itemsets with minimum number of database scan is a challenging task. • Development of an efficient method to define the suitable minimum support threshold for databases is a challenging task. • Developing a scalable, memory efficient method for mining association rules from incremental database is a challenging task.

4

Conclusion

The problem of mining FIs from incremental databases is an challenge to data mining. In this study, we have discussed some of the incremental FPM algorithms addressing their implementation issues, pros and cons and it is seen that, developing a memory efficient, scalable, without candidate generation algorithm for mining FIs as well as rare patterns from incremental databases with a minimum number of database scan is a big challenge to the rule mining problem.

References 1. Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: International Conference on Extending Database Technology, pp. 1–17. Springer, Heidelberg (1996) 2. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 24–25 (1998) 3. Hu, K., Lu, Y., Zhou, L., Shi, C.: Integrating classification and association rule mining: a concept lattice framework. In: International Workshop on Rough Sets, Fuzzy Sets, Data Mining, and Granular-Soft Computing, pp. 443–447. Springer, Heidelberg (1999)

A Survey on FP-Tree Based Incremental Frequent Pattern Mining

209

4. Wang, K., Zhou, S., Liew, S.C.: Building hierarchical classifiers using class proximity. In: VLDB, vol. 99, pp. 363–374 (1999) 5. Bayardo Jr., R.J.: Efficiently mining long patterns from databases. In: ACM Sigmod Record, vol. 27, pp. 85–93. ACM (1998) 6. Pei, J., Han, J., Mao, R., et al.: CLOSET: an efficient algorithm for mining frequent closed itemsets. In: ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, vol. 4, pp. 21–30 (2000) 7. Zaki, M.J.: Generating non-redundant association rules. In: KDD, vol. 2000, pp. 34–43 (2000) 8. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for datamining applications, vol. 27. ACM (1998) 9. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier (2011) 10. Wu, X., Wang, J.T.L., Jain, L., Zaki, M.J., Shasha, D., Toivonen, H., et al.: Data Mining in Bioinformatics. Springer, Heidelberg (2005) 11. Punin, J.R., Krishnamoorthy, M.S., Zaki, M.J.: Logml-xml language for web usage mining. In: WWW Posters (2001) 12. Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: ACM SIGMOD International Conference on Management of Data, vol. 22, pp. 207–216 (1993) 13. Cheung, D.W., Han, J., Ng, V.T., Wong, C.Y.: Maintenance of discovered association rules in large databases: an incremental updating technique. In: Proceedings of the Twelfth International Conference on Data Engineering, pp. 106–114. IEEE (1996) 14. Cheung, D.W., Lee, S.D., Kao, B.: A general incremental technique for maintaining discovered association rules. In: Database Systems For Advanced Applications, vol. 97, pp. 185–194. World Scientific (1997) 15. Aumann, Y., Feldman, R., Lipshtat, O., Manilla, H.: Borders: an efficient algorithm for association generation in dynamic databases. J. Intell. Inf. Syst. 12(1), 61–73 (1999) 16. Das, A., Bhattacharyya, D.K.: Rule mining for dynamic databases. In: International Workshop on Distributed Computing, pp. 46–51. Springer, Heidelberg (2004) 17. Ayan, N.F., Tansel, A.U., Arkun, E.: An efficient algorithm to update large itemsets with early pruning. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 287–291. ACM (1999) 18. Ganti, V., Gehrke, J., Ramakrishnan, R.: Demon: mining and monitoring evolving data. IEEE Trans. Knowl. Data Eng. 13(1), 50–63 (2001) 19. Ayad, A.M.: A new algorithm for incremental mining of constrained association rules. Ph.D. thesis, Master Thesis, Department of Computer Sciences and Automatic Control (2000) 20. Zhou, Z., Ezeife, C.I.: A low-scan incremental association rule maintenance method based on the apriori property. In: Conference of the Canadian Society for Computational Studies of Intelligence, pp. 26–35. Springer, Heidelberg (2001) 21. Guirguis, S., Ahmed, K.M., El Makky, N.M., Hafez, A.M.: Mining the future: predicting itemsets’ support of association rules mining. In: Sixth IEEE International Conference on Data Mining-Workshops (ICDMW 2006), pp. 474–478. IEEE (2006) 22. Lin, C.-W., Hong, T.-P., Lan, G.-C., Wong, J.-W., Lin, W.-Y.: Incrementally mining high utility patterns based on pre-large concept. Appl. Intell. 40(2), 343–357 (2014)

210

S. A. Ahmed and B. Nath

23. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation: a frequent-pattern tree approach. In: Proceedings of ACM SIGMOD, pp. 1–12, Dallas, TX (2000) 24. Ezeife, C.I., Su, Y.: Mining incremental association rules with generalized FP-tree. In: Conference of the Canadian Society for Computational Studies of Intelligence, pp. 147–160. Springer, Heidelberg (2002) 25. Koh, J.-L., Shieh, S.-F.: An efficient approach for maintaining association rules based on adjusting FP-tree structures. In: International Conference on Database Systems for Advanced Applications, pp. 417–424. Springer, Heidelberg (2004) 26. Aggarwal, C.C., Philip, S.Y.: Mining large itemsets for association rules. IEEE Data Eng. Bull. 21(1), 23–31 (1998) 27. Adnan, M., Alhajj, R., Barker, K.: Alternative method for incrementally constructing the FP-tree. In: Proceedings of IEEE International Conference on Intelligent Systems, UK, September 2006 28. Hong, T.-P., Lin, C.-W., Yu-Lung, W.: Incrementally fast updated frequent pattern trees. Expert Syst. Appl. 34(4), 2424–2435 (2008) 29. Lin, C.-W., Hong, T.-P., Wen-Hsiang, L.: The Pre-FUFP algorithm for incremental mining. Expert Syst. Appl. 36(5), 9498–9505 (2009) 30. Totad, S.G., Geeta, G.B., Prasad Reddy, P.V.G.D.: Batch processing for incremental FP-tree construction. Int. J. Comput. Appl. 5(5), 28–32 (2010) 31. Totad, S.G., Geeta, R.B., Prasad Reddy, P.V.G.D.: Batch incremental processing for FP-tree construction using FP-growth algorithm. Knowl. Inf. Syst. 33(2), 475– 490 (2012) 32. Guo, Y.-D., Li, S.-L., Li, Y.-Z., Wang, Z.-X., Zeng, L.: Large-scale dataset incremental association rules mining model and optimization algorithm. Int. J. Database Theor. Appl. 9(4), 195–208 (2016) 33. Huang, H., Wu, X., Relue, R.: Association analysis with one scan of databases. In: 2002 IEEE International Conference on Data Mining, 2002 Proceedings, pp. 629–632. IEEE (2002) 34. Cheung, W., Zaiane, O.R.: Incremental mining of frequent patterns without candidate generation or support constraint. In: 2003 Proceedings of Seventh International Database Engineering and Applications Symposium, pp. 111–116. IEEE (2003) 35. Leung, C.K.-S., Hoque, T., Khan, Q.I.: CanTree: a tree structure for efficient incremental mining of frequent patterns. In: Proceedings of IEEE ICDM, pp. 274–281 (2005) 36. Tanbeer, S.K., Ahmed, C.F., Jeong, B.-S., Lee, Y.-K.: Efficient single-pass frequent pattern mining using a prefix-tree. Inf. Sci. 179(5), 559–583 (2009) 37. Hamedanian, M., Nadimi, M., Naderi, M.: An efficient prefix tree for incremental frequent pattern mining. Int. J. Inf. 3(2) (2013) 38. Roul, R.K., Bansal, I.: GM-tree: an efficient frequent pattern mining technique for dynamic database. In: 2014 9th International Conference on Industrial and Information Systems (ICIIS), pp. 1–6. IEEE (2014)

Improving Co-expressed Gene Pattern Finding Using Gene Ontology R. C. Baishya, Rosy Sarmah(B) , and D. K. Bhattacharyya Tezpur University, Tezpur, Assam, India [email protected], {rosy8,dkb}@tezu.ernet.in

Abstract. A semi-supervised gene co-expressed pattern finding method, PatGeneClus is presented in this paper. PatGeneClus attempts to find all possible biologically relevant gene coherent patterns from any microarray dataset by exploiting both gene expression similarity as well as GO-similarity. PatGeneClus uses a graph-based clustering algorithm called DClique to generate a set of clusters of high biological relevance. We establish the effectiveness of PatGeneClus over several benchmark datasets using well-known validity measures. The clusters obtained by PatGeneClus have been found to be biologically significant due to their high p-values, Q-values and clustalW scores. Keywords: Gene expression data · Gene ontology · Graph-based clustering · Clique · P value

1 Introduction Bioinformatics research has slowly shifted to integrating external knowledge bases to make large-scale gene expression datasets meaningful. Data-driven clustering methods based on expression similarity are inadequate to automatically specify biological relationships represented in the clusters [1]. Therefore, the present trend has shifted from a fully automated process to semi-automated practices to detect clusters adopting their functional relationships using external knowledge sources such as the Gene Ontology [2–4]. The functional information of genes has been stored as ontology, called Gene Ontology [2]. In the process of making use of the information given by GO, one needs to quantify the similarity of two terms or between two sets of terms. An exhaustive survey of semantic similarity measures in biomedical ontologies can be found in [5]. As mentioned in [6], a better way of finding the similarity between two GO terms sets is to consider the amount of common information content they share. The similarity measure, simUI [7] calculates the similarity as the total GO terms shared by two proteins divided by the total GO terms each of them is annotated with. simGIC [6] which is an expansion of simUI, is based on information content, IC instead of counting the terms. Our measure on semantic similarity is based on simGIC (Graph Information Content). Though unsupervised clustering of genes helps in forming a hypothesis regarding possible functions, The supplementary materials are available at http://agnigarh.tezu.ernet.in/~rosy8/shared.html © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 211–225, 2020. https://doi.org/10.1007/978-3-030-39033-4_20

212

R. C. Baishya et al.

its main demerit is that it only isolates co-expressed gene patterns, which may not necessarily be biologically coherent units. Hence, in [4], the idea of incorporating biological information into gene expression data is exploited to find more meaningful relevant biological clusters. This class of methods form the semi-supervised clustering algorithms. The additional information to be introduced may be any of these: protein structure similarity, sequence similarity, shared functions and pathways. Using such information along with the expression data, genes can be shown to be related. In [8], the authors combine clustering using semantic similarity measure for gene ontology terms. In [3], k-means clustering is modified to incorporate the functional information from Gene Ontology during clustering, giving more biologically useful clusters. An unsupervised gene clustering integrating biological knowledge with expression is presented in [4]. A density based GO clustering is reported in [9]. Brionne et al. in [10] developed ViSEAGO to promote Gene Ontology (GO) functional analysis and allow study of huge datasets, visualize the GO profiles and reflect the biological knowledge. A semi-supervised GO based clustering algorithm, GO fuzzy relational clustering (GO-FRC), is proposed in [11] where a gene may belong to multiple clusters. In this work we present the following. (i) A graph based clustering algorithm, DClique, which is able to obtain biologically significant co-expressed patterns. (ii) An effective combined proximity measure, ComSim to support semi-supervised clustering combining the expression based proximity measure PWCTM proposed by us in [13] and the GO based proximity measure simGIC proposed in [6]. (iii) A clustering technique, PatGeneClus that integrates the expression based proximity measure with the GO-based measure to form a combined similarity measure, ComSim, to identify more biologically meaningful clusters. PatGeneClus is implemented as a tool and can be downloaded from http://agnigarh.tezu.ernet.in/~rosy8/ shared.html.

2 Combined Similarity (ComSim) We use a weighted sum to combine the expression similarity measure with a GO based similarity measure. The similarity measure simGIC given in [6] is used to discover the GO similarity among a pair of genes,  I C(t), ∀t ∈ (S1 ∩ S2 ) (1) simGIC(g1; g2) =  I C(t), ∀t ∈ (S1 ∪ S2 ) where S1 is the set of terms g1 is annotated with and S2 is the set of terms g2 is annotated with. IC(t) gives the information content of a GOterm t. To compute GO-based similarity, we use two sources of information obtained from www.geneontology.org: (i) the GO file, which gives us the GO terms and their relationships, and (ii) the annotation file, which gives us species-wise gene names along with their direct GO annotations. To use simGIC given in Eq. 1, we need each gene to be associated with a list of GO terms with which it is directly or indirectly annotated. To achieve this, we process and store these files as given in Fig. 1. From the GO file, we first construct a binary search tree with the GO-ID as key. This binary search tree is used to hold the DAG formed by the GO terms and relationships among them. To find the similarity between two genes, we need all the direct as well as indirect annotations each gene is annotated with. By making use of the

Improving Co-expressed Gene Pattern

213

above mentioned annotation file and the DAG constructed from the GO file, we construct another binary search tree, where a node of the tree holds the gene-ID of a particular gene along with the list of all direct and indirect annotations of the gene. An edge exists if there is a relationship among two nodes. We call this tree a gene-term tree. To speed up the process of finding similarity between two genes, the list of annotations stored in the nodes of the gene-term tree is stored in sorted order and the DAG is created only once. On the other hand the gene-term tree has to be constructed once for each species. When a dataset for a species is given for processing, one has to load the list of all direct/indirect annotations for each gene present in the dataset from the corresponding gene-term tree. We can construct the combined similarity measure by taking into consideration both the expression similarity, expSim and GO-based similarity simGIC as a weighted sum. The n × n combined similarity matrix, ComSim[i][j] is given in Eq. (2), where n gives the number of genes present in the dataset, and i, j represent the ith gene gi and jth gene gj respectively. When combining we can assign weights as follows. Com Sim[i][ j] = w1 × exp Sim(gi , g j ) + w2 × G O Sim(gi , g j )

(2)

To compute the weights w1 and w2 we use x% from expression similarity matrix and remaining (100 − x)% from GO similarity matrix i.e., w1 = x / 100 and w2 = (100−x) / 100 respectively. If the genes gi and gj are un-annotated, the combined similarity measure reflects only expression based similarity. In our proposed technique, the user can choose any of the three proximity measures (viz. Pearson’s correlation coefficient, Euclidean distance and PWCTM) for computing expSim. The ComSim matrix is constructed and used in the clustering process as in Fig. 2.

Fig. 1. Construction of GO similarity matrix

214

R. C. Baishya et al.

Fig. 2. Integrated approach of clustering.

3 Proposed Technique PatGeneClus The entire process of our technique, PatGeneClus is depicted in Fig. 2. PatGeneClus starts by creating the expression similarity matrix and the GO similarity matrix from the dataset as explained before. Both measures are integrated to form a combined similarity matrix ComSim. This combined similarity matrix is fed to the clustering algorithm DClique to generate the clusters as is illustrated in Fig. 2. DClique: A Graph-based Clustering: For a set of m genes, say g1 , g2 , g3 …. gm , where each gene gi is represented by a set of n conditions c1 , c2 , c3 …. cn , our goal is to find groups of genes where each gene in a group changes its expression value in a systematic way w.r.t. other genes in the same group. By systematic, we mean two genes are related by negative correlation, positive correlation or by time shifted correlation. As explained earlier in Eq. 2, ComSim(gi , gj ) gives how systematic the change of similarity values between two genes are. The greater the value of ComSim(gi , gj ), more systematic is their change with each other. Thus we can now transform the given problem into a graph the theory problem and find all possible maximal cliques. The following definitions are the basis of PatGeneClus.

Improving Co-expressed Gene Pattern

215

Definition 1 Gene-Connectivity: Two genes say gi ,gj are said to be connected by an edge if ComSim(gi , gj ) ≥ simTH, where simTH is a similarity threshold value. Definition 2 Gene-Clique: A gene-clique is a subgraph of the set of genes g1 , g2 , g3 …. gx such that any two genes in this set are at least simTH% similar to each other. In other words all genes in a gene-clique are gene-connected. Definition 3 Maximal-Gene-Clique: A maximal-gene-clique is a gene-clique which is maximal, i.e., no more genes can be added to it. To obtain all possible maximal cliques, we consider the m genes as the vertices of a graph and connect two genes gi , gj by an edge if they are gene-connected according to Definition 1. A clique of such a graph gives us a set of genes g1 , g2 , g3 …. gx such that any two genes in this set are at least simTH% similar to each other. Definition 4 Gene-Clique-Connectivity: A gene gk is said to be connected to a clique C x (where C x = {g1 , g2 , g3 …. gp }) if gk is similar to at least relaxTH percentage of genes in Cx, where, relaxTH is a threshold. This can be mathematically formalized as,   simg ∈C ; i = 1 . . . .m, (g , g ) ≥ simTH ≥ relaxTH%(|Cx |) k i x i DClique typically uses two steps, viz., insertion step and merge step, to generate the cliques present in the dataset. We discuss each of these steps next. (a) Insertion Step: A gene gk is inserted into a clique C x iff gk is gene-clique-connected to C x as given in Definition 4. Since we want to generate the cliques in such a way that for any two cliques generated, say ci and cj , neither ci is a proper subset of cj nor cj is a proper subset of ci , we compute only the maximal cliques. To implement this concept, we set simTH to a value which allows a small number of edges among the genes and at the same time allows the algorithm to include each of the genes in at least one of the clusters. This simTH value is used as an input parameter to the algorithm, so that it can be tuned properly. Again, instead of finding the set of genes connected to all other genes in the clique, we find the set of genes such that each gene is connected to at least relaxTH% of other genes in the clique as in Definition 4, where relaxTH% is an input parameter to the algorithm. Thus, by relaxing the condition of a clique we form clusters as given in Definition 5. Definition 5 Relax-Gene-Clique (Cluster): A relax-gene-clique or cluster is a subgraph of the set of genes such that each gene in this set are connected to at least relaxTH% of genes in the same cluster. In other words all genes in a gene-clique are gene-cliqueconnected to it. Definition 6 Cluster-Merge-Connectivity: A cluster C y is merged with another cluster C x iff clusterSim(C x ,C y ) ≥ mergeTH and C x ∈ TC, where TC is the set of all clusters generated so far.

216

R. C. Baishya et al.

Next, we present the measure used for merging two clusters. We use the following measure based on Jaccard index to find the similarity between two clusters.   ci ∩ c j    cluster Sim(ci , c j ) = (3) |ci | + c j  where |cj | gives the number of genes in cluster ci , |cj | gives the number of genes in cluster cj and ci ∩ c j  gives the number of genes common in both ci and cj . For each new cluster (C y ) generated, we check whether there are clusters that have been generated already (C x ∈ TC) whose similarity to the new cluster, C y is higher than a threshold value, mergeTH. If such clusters are found, the new cluster is merged with each such cluster. After merging, for each such cluster, we again check whether the new cluster can be merged with any of the previously generated clusters. If so, we merge the cluster with the other clusters separately and remove the cluster from the already generated clusters list. The second merging is necessary because after the first merging (if any) the genes within some clusters may change and hence the pattern that describes this cluster also may change. If no clusters are found for initial merging, the new cluster is added as a fresh cluster to the generated cluster list. The algorithm is listed below. 1: procedure DClique(m Genes, simTH, relaxTH, mergeTH) 2: for each gene gi do 3: if gi is not marked then 4: for each gene gj such that ComSim(gi,gj) ≥ simTH && both gi and gj are not there in any of the already generated clusters do 5: Create a cluster, say Cx and add gi and gj to Cx. 6: for each unmarked gene gk do 7: if gk is similar to more than relaxTH% of the genes in Cx (Insertion step), then 8: Add gk to Cx 9: end if 10: end for 11: end for 12: if no gene gj is found in the above loop then 13: set the marked value of the gene, gi, to true. 14: else 15: for each already generated clusters Cy, Cy TC, do 16: if clusterSim(Cx,Cy) ≥ mergeTH, then 17: Merge Cx to Cy (Merge Step) 18: end if 19: end for

Improving Co-expressed Gene Pattern

217

20: if no clusters are found in the above loop then 21: add Cx to the list of generated clusters, TC. 22: else 23: for each updated clusters Cp TC do 24: for each cluster Cq TC do 25: if clusterSim(Cp,Cq) ≥ mergeTH, then 26: Merge Cp to Cq (Merge Step) 27: end if 28: end for 29: if Cp is merged to at least one of the clusters then 30: Remove Cp from the already generated clusters list, TC 31: end if 32: end for 33: end if 34: end if 35: end if 36: end for 37: end procedure

From our experiments, we found that the DClique performs best in terms of z-score when values of the three parameters are within the range as follows: simTH = [70,80], relaxTH = [70,80] and mergeTH = [20,25]. To prevent DClique from generating the same cluster more than once, when constructing a cluster, we check that the first two genes in the cluster are not there in any other clusters. This guarantees that the new cluster has at least one unique edge. In each iteration, the algorithm takes one gene and performs any of the following Case A: Generate a new cluster; Case B: Set the marked attribute of a gene to be ‘true’. A working example of Dclique is given in the supplementary materials. We present next the experimental results obtained by using the proposed PatGeneClus on several datasets from several species.

4 Experimental Results We implemented the PatGeneClus technique in C#.NET and tested it on microarray datasets given in Table 1. Table 1. Datasets used Dataset

Genes# Conditions# Source

1 Yeast sporulation

474

7

http://cmgm.stanford.edu/ pbrown/sporulation/index.html

2 Subset of yeast cell cycle [13]

384

17

http://faculty.washington.edu/ kayee/cluster

6089

7

http://www.ncbi.nlm.nih.gov/ geo/query

37866

5

http://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi?acc=GSE6013

3 Yeast Diauxic shift [14] 4 Asbestos treatment of human lung cancer cells [15]

218

R. C. Baishya et al.

To validate the effectiveness of our cluster results we use three well-known measures: (i) p-value using FuncAssociate [16] tool, (ii) Q-value [17] and (iii) multiple sequence alignment score using ClustalW [18]. The coexpressed patterns of PatGeneClus for Datasets 1, 2, 3 and 4 with appropriate thresholds can be found in the supplementary materials. The functional enrichment for one of the clusters from Dataset 1 using Euclidean distance, PWCTM and PCC (Pearsons correlation) as similarity measures is presented in Table 2. The enrichment of functional categories for some of the clusters from Datasets 2 and 3 obtained by PatGeneClus in comparison to CC [19] and CLICK [20] algorithms are presented in Tables 3 and 4. From the tables, we see that clusters obtained by PatGeneClus have more biological relevance (indicated by their low p-values) than the clusters given by CLICK and CC. Table 5 presents results of PatGeneClus using PWCTM and PCC as the similarity measures wherein we see that PatGeneClus performs better with PCC in Cluster 1 whereas it performs better with PWCTM for cluster 2. Detailed results can be found in Supplementary file for all datasets. The GO categories and Q-values from a FDR corrected hypergeometric test for enrichment are reported in GeneMANIA [17]. We have used GeneMANIA [17] which is a flexible, user-friendly web interface for generating hypotheses about gene function, analyzing gene lists and prioritizing genes for functional assays. Given a query list, GeneMANIA extends the list with functionally similar genes that it identifies using available genomics and proteomics data. GeneMANIA displays results as an interactive network, illustrating the functional relatedness among the query and retrieved genes. GeneMANIA currently supports different networks including co-expression, physical interaction, genetic interaction and co-localization. The genes of the clusters obtained by PatGeneClus are submitted as a list of query genes to GeneMania [17] and their Q-values along with the different GO categories are shown for one of the clusters of Dataset 2 in Table 6 and its corresponding network is shown in Fig. 3. The default network weighting option has been used here. We conclude from the results of the tables and figures reported here that PatGeneClus generates highly enriched GO based clusters. Sequence alignment arranges two or more sequences of characters for identifying regions of similarity. This is of importance as similarities may be a consequence of functional or evolutionary relationships between these sequences. In other words, sequence similarity gives the extent of similarity between the nucleotide/protein sequence of two genes. It can be computed using an online tool, ClustalW [18]. ClustalW is a web based multiple sequence alignment program for DNA or proteins and produces biologically meaningful multiple sequence alignments of divergent sequences. We use a multiple sequence alignment score [18] available from ClustalW to conclude whether the genes in a particular cluster are functionally similar or not. More the alignment score, more similar are the genes in the cluster under consideration. The result of ClustalW using PatGeneClus on Dataset 4 is incorporated in Table 7 from which we conclude that our method gives sufficiently good sequence alignment scores with alignment scores ≥ 90. From this, we conclude that most of the clusters produced by PatGeneClus are biologically significant in terms of sequence similarity as well. Based on our experiments, we find that the values for simTH and relaxTH lie within the range [70, 80], for mergeTH the range is [20, 25] for PWCTM. These ranges of values were obtained based on the best z-scores obtained across experiments.

PCC P-value

1.70E−47 GO:0044445 Cytosolic part 1.50E−44 GO:0003735 Structural constituent of ribosome 1.50E−44 GO:0033279 Ribosomal subunit 3.10E−44 G0:0043228

9.90E−58 GO:0003735 Structural constituent of ribosome

9.90E−58 GO:0033279 Ribosomal subunit

7.50E−54 GO:0005840 Ribosome

GO number

GO category

1.30E−40 GO:0043232 Intracellular non-membrane-bounded organelle

3.00E−42 GO:0005840 Ribosome

1.70E−41 GO:0030529 Ribonucleo protein complex 3.30E−35 GO:0022625 Cytosolic large ribosomal subunit

3.40E−35 GO:0005198 Structural molecule activity 3.10E−34 GO:0005198 Structural molecule activity 8.90E−23 GO:0010467 Gene expression

6.00E−42 GO:0022625 Cytosolic large ribosomal subunit

1.20E−40 GO:0030529 Ribonucleo protein complex

3.70E−36 GO:0005829 cytosol

4.50E−36 GO:0015934 Large ribosomal subunit

1.60E−19 GO:0010467 Gene expression

4.50E−21 GO:0010467 Gene expression

1.20E−34 GO:0022625 Cytosolic large ribosomal subunit

4.50E−40 GO:0030529 Ribonucleo protein complex

1.30E−40 GO:0043228 Non-membrane-bounded organelle

3.10E−41 GO:0005840 Ribosome

1.50E−43 GO:0033279 Ribosomal subunit

1.50E−43 GO:0003735 Structural constituent of ribosome

1.70E−46 GO:0044445 Cytosolic part

4.70E−51 GO:0022626 Cytosolic ribosome

P-value

PWCTM

4.40E−50 GO:0005198 Structural molecule activity 3.10E−44 GO:0043232 Intracellular non-membrane-bounded organelle

Non-membrane-bounded organelle

4.60E−52 GO:0022626 Cytosolic ribosome

GO category

9.50E−61 GO:0044445 Cytosolic part

GO number

2.20E−65 GO:0022626 Cytosolic ribosome

GO category

P-value

GO number

Euclidean distance

Table 2. P-value comparison of PatGeneClus clusters on Dataset 1 for cluster 2

Improving Co-expressed Gene Pattern 219

GO category

1.755e−19 GO:0006281 DNA repair

GO:0006259 DNA metabolic process

8.51e−21

3.71e−20 GO:0006260 DNA replication

1.90e−20 GO:0006281 DNA repair

GO:0044454 Nuclear chromosome 6.43e−21 GO:0022402 Cell cycle process part

5.92e−24 GO:0006259 DNA metabolic process

1.40e−22

2.369e−25 GO:0022402 Cell cycle process

GO:0044427 Chromosome part DNA metabolic process

3.14e−29

GO number

8.63e−26 GO:0044427 Chromosomal part

CLICK[22] P-value

GO number

P-value

GO category

PatGeneClus GO number

GO category

1.25e−18 GO:0007049 Cell cycle

1.00e−18 GO:0044454 Nuclear chromosome part

5.12e−20 GO:0006259 DNA metabolic process

1.76e−20 GO:0022402 Cellular process

4.67e−24 GO:0044427 Chromosomal part

P-value

CC [4]

Table 3. P-value comparison of CLICK, CC and PatGeneClus clusters on Dataset 2 with GO weight, w2 = 50 for cluster 1

220 R. C. Baishya et al.

1.82e−18 GO:0044429 Mitochondrial part 1.2e−18 6.7e−17 1.4e−16

2.19e−18 GO:0005739 Mitochondrion

2.68e−18 GO:0044429 Generation of precursor energy

4.84e−18 GO:0006091 Respiratory electron transport chain

1.70e−16 GO:0022904 Mitochondrial membrane part

GO:0006091 Generation of precursor energy

GO:0022904 Respiratory electron transport chain

GO:0005739 Mitochondrion

1.92e−28 GO:0055114 Oxidation reduction process

GO category

3.96e−19 G0:0055114 Oxidation reduction process

GO number

P-value

GO category

P-value

GO number

CLICK [22]

PatGeneClus

GO number

GO category

GO:0005739 Mitochondrion

5.51e−14 GO:0006091 Generation of precursor energy

7.66e−15 GO:0044429 Mitochondrial part

3.3e−15

5.99e−16 GO:0022904 Respiratory electron transport chain

2.93e−16 G0:0055114 Oxidation reduction process

P-value

CC [4]

Table 4. P-value comparison of CLICK, CC and PatGeneClus clusters using Dataset 3 with GO weight, w2 = 50 for cluster 1

Improving Co-expressed Gene Pattern 221

222

R. C. Baishya et al. Table 5. P-values: Some more results of PatGeneClus on Dataset 4

Clusters using PW CTM Cl

C2

P-value

GO number

GO category

Clusters using PCC Cl

P-value

GO number

GO category

3.6e−25

G0:0005515

Protein binding

6.8e−34

GO:0005515

Protein binding

9.4e−17

GO:0005737

Cytoplasm

2.5e−23

GO:0005737

Cytoplasm

1.9e−16

G0:0044444

Cytoplasmic part

6.9e−18

GO:0044444

Cytoplasmic part

1.3e−15

GO:0008219

Cell death

5e−13

GO:0008219

Cell death

1.3e−15

GO:0016265

Death

5e−13

GO:0016265

Death

1.5e−15

GO:0044424

Intracellular part

3.2e−21

GO:0044424

Intracellular part

3.7e−15

GO:0006915

Apoptosis

7.7e−13

GO:0006915

Apoptosis

5.7e−15

GO:0012501

Programmed cell death

1.2e−12

GO:0012501

Programmed cell death

7.4e−15

GO:0042981

Regulation of apoptosis

3.5e−12

GO:0042981

Regulation of apoptosis

1.2e−14

GO:0043067

Regulation of programmed cell death

5.6e−12

GO:0043067

Regulation of programmed cell death

3.4e−14

GO:0043231

Intracellular membranebounded organelle

1.4e−13

GO:0043231

Intracellular membranebounded organelle

3.5e−09

GO:0030983

Mismatched DNA binding

1.3e−07

GO:0030983

Mismatched DNA binding

9.1e−09

GO:0006298

Mismatch repair

3.2e−07

GO:0006298

Mismatch repair

9.1e−09

GO:0045005

Maintenance of fidelity during DN A-dependent DNA replication

3.2e−07

GO:0045005

Maintenance of fidelity during DNA-dependent DNA replication

9.7e−08

GO:0003690

Double-stranded DNA binding

8.7e−07

GO:0000166

Nucleotide binding

3.3e−07

GO:0006950

Response to stress

3.3e−06

GO:0003690

Double-stranded DNA binding

7.4e−07

GO:0006261

DNA-dependent DNA replication

8.1e−06

GO:0017076

Purine nucleotide binding

1.3e−06

GO:0043566

Structurespecific DNA binding

8.4e−06

GO:0032553

Ribonucleotide binding

2e−06

GO:0007242

Intracellular signaling cascade

8.4e−06

GO:0032555

Purine ribonucleotide binding

2.7e−06

GO:0009719

Response to endogenous stimulus

8.6e−06

GO:0005524

ATP binding

3.2e−06

GO:0006974

Response to DNA damage stimulus

1e−05

GO:0030554

Adenyl nucleotide binding

4e−06

GO:0019911

Structural constituent of myelin sheath

1.1e−05

GO:0032559

Adenyl ribonucleotide binding

C2

Improving Co-expressed Gene Pattern

223

Fig. 3. (a) The network obtained for cluster 1 of Dataset 1 (using default parameters). (b) The weights obtained for each of the networks.

Table 6. Q-values of Dataset 2

Table 7. ClustalW result for one of the clusters obtained from Dataset 4

Cluster

GO annotation

Q value

Cl of PatGeneClus clusters using PCC

DNA repair

2.92e−22

SEQA

SEQB

SCORE

Response to DNA damage stimulus

2.92e−22

YBR181C

YPL090C

99.1561

YBR031W

YDR012W 98.8981

DNA replication

4.28e−17

YBL072C

YER102W

98.1758

Replication fork

4.28e−17

YHR203C

YJR145C

98.0916

DNA-dependent DNA replication

8.57e−16

YGL076C

YPL198W

96.7347

YNL301C

YOL120C

95.3654

Nuclear replication fork

1.44e−15

YOR312C

YMR242C

94.605

Mitotic cell cycle

2.28e−l5

Mitotic sister 2.4e−l3 chromatid cohesion Double-strand break repair

2.41e−13

M phase

2.02e−I2

Regulation of cell cycle

2.13e−12

Sister chromatid cohesion

4.76e−12

YER056C-A YIL052C

92.623

YGL031C

YGR148C

92.5214

YLR441C

YML063W 91.9271

YDR418W

YEL054C

91.5663

5 Conclusions This paper reports a semi-supervised technique called PatGeneClus for clustering gene expression data using both expression based and GO-based similarity measures. The PatGeneClus clusters are biologically significant due to their high p-values and Q-values. The clustalW scores obtained are also high, further strengthening our claim. In this paper, we use only GO information but, it is possible to incorporate other sources of information to further improve the clustering output.

224

R. C. Baishya et al.

References 1. Lagreid, A., Hvidsten, T.R., Midelfart, H., Komorowski, J., Sandvik, A.K.: Predicting gene ontology biological process from temporal gene expression patterns. Genome Res. 13(5), 965–979 (2003) 2. Harris, M.A., et al.: The gene ontology (GO) database and informatics resource. Nucleic Acids Res. 32(Database issue), D258–D261 (2004) 3. Macintyre, G., Bailey, J., Gustafsson, D., Haviv, I., Kowalczyk, A.: Using gene ontology annotations in exploratory microarray clustering to understand cancer etiology. Pattern Recogn. Lett. 31(14), 2138–2146 (2010) 4. Verbanck, M., Le, S., Pages, J.: A new unsupervised gene clustering algorithm based on the integration of biological knowledge into expression data. BMC Bioinform. 14, 42 (2013) 5. Pesquita, C., Faria, D., Falcao, A.O., Lord, P., Couto, F.M.: Semantic similarity in biomedical ontologies. PLoS Comput. Biol. 5(7), e1000443 (2009) 6. Pesquita, C., Faria, D., Bastos, H., Falcao, A.O., Couto, F.: Evaluating GO based semantic similarity measures. In: ISMB/ECCB 2007 SIG Meeting Program Materials, International Society for Computational Biology (2007) 7. Gentleman, R.: Visualizing and distances using GO (2005) 8. Ovaska, K., Laakso, M., Hautaniemi, S.: Fast gene ontology based clustering for microarray experiments. Bio Data Min. 1(1) (2008) 9. Mandal, K., Sarmah, R.: A Density-Based Clustering for Gene Expression Data Using Gene Ontology, Lecture Notes in Networks and Systems (2018) 10. Brionne, A., Juanchich, A., Hennequet-Antier, C.: ViSEAGO: a Bioconductor package for clustering biological functions using gene ontology and semantic similarity. BioData Min. 12(1), 16 (2019) 11. Paul, A.K., Shill, P.C.: Incorporating gene ontology into fuzzy relational clustering of microarray gene expression data. Biosystems 163, 1–10 (2018) 12. Baishya, R.C., Sarmah, R., Bhattacharyya, D.K., Dutta, M.: A similarity measure for clustering gene expression data. In: Proceedings of International Conference on Applied Algorithms, Kolkata, India, pp. 245–256 (2014) 13. Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L.: A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2(1), 65–73 (1998) 14. DeRisi, J.L., Iyer, V.R., Brown, P.O.: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997) 15. Nymark, P., Lindholm, P.M., Korpela, M.V., Lahti, L., Ruosaari, S., Kaski, S., Hollmen, J., Anttila, S., Kinnula, V.L., Knuutila, S.: Gene expression profiles in asbestos-exposed epithelial and mesothelial lung cell lines. BMC Genom. 8, 62 (2007) 16. Berriz, F.G., et al.: Characterizing gene sets with funcassociate. Bioinformatics 19, 2502–2504 (2003) 17. Warde-Farley, D., Donaldson, S.L., Comes, O., Zuberi, K., Badrawi, R., Chao, P., Franz, M., Grouios, C., Kazi, F., Lopes, C.T., Maitland, A., Mostafavi, S., Montojo, J., Shao, Q., Wright, G., Bader, G.D., Morris, Q.: The geneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res. 38, W214– W220 (2010) 18. Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, H., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J., Higgins, D.G.: Clustalw and clustalx version 2. Bioinformatics 23(21), 2947–2948 (2007)

Improving Co-expressed Gene Pattern

225

19. Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proceedings of ISMB 2000, pp. 93–103. AAAI Press (2000) 20. Sharan, R., Shamir, R.: Click: a clustering algorithm with applications to gene expression analysis. In: Proceedings of 8th International Conference on Intelligent Systems for Molecular Biology. AAAI Press (2000)

Survey of Methods Used for Differential Expression Analysis on RNA Seq Data Reema Joshi(B) and Rosy Sarmah Department of CSE, Tezpur University, Tezpur, India [email protected], [email protected]

Abstract. Gene expression indicates the amount of mRNA produced by a gene under a particular biological condition. Genes responsible for changes in biological conditions of an organism will have different gene expression values across different conditions. Gene expression analysis is useful in the domain of transcriptomic studies to analyse functions of and interactions among different molecules inside a cell. A significant analysis is that of a differential gene, that is a gene that exhibits strong change in behaviour between two or more conditions. Thus behavioural cell changes can be attributed to the differentially expressed genes. Statistical distributional properties in the read counts that constitute RNA-seq data are used for detecting the differentially expressed genes. In this paper we provide a comparison study of different tools which aid in RNA-seq based differential expression. It is important to note how the results of these tools differ and which tool provides more statistically significant results for the same. Keywords: RNa-seq · Differential expression · Differential gene · Empirical study · Differential expression tools

1 Introduction A gene is a small section of genetic material called DNA which contains genetic information. Every cell in an organism contains the whole genome (i.e. the DNA), which contains heritable information. The DNA is a helical-structured, double stranded molecule that is capable of undergoing certain biological processes to produce useful products such as proteins. Protein making goes through two stages: (1) Transcription, in which the DNA is converted into mRNA (messenger RNA) molecules with the help of an enzyme called the RNA polymerase (2) Translation, which occurs after the messenger RNA (mRNA) has carried the transcribed ‘message’ from the DNA to ribosomes [2], where proteins are made. Though thousands of transcripts are produced every second in each cell, the amounts and types of mRNA molecules in a cell reflect the function of that cell. Gene expression can also be understood as a measure of the activity level of a gene as it goes through the two stages mentioned above. Gene expression levels differ for different genes across cells and this is indicative of the function of a particular cell. As mentioned earlier, only a fraction of genes contained in the DNA are used in a cell at a given time. Measuring the expression level of each gene helps biologists understand cell © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 226–239, 2020. https://doi.org/10.1007/978-3-030-39033-4_21

Survey of Methods Used for Differential Expression Analysis on RNA Seq Data

227

behaviour and differentiate among cells within the same/different organism/organisms. Effectively measuring the activity/expression level of a particular gene could give us valuable knowledge that associates the function of a gene in a particular cell. Basically, there are two ways in which this can be performed: (1) Using the microarray based technique, and (2) Using the NGS technology. A DNA microarray is a microscopic slide that is impressed with thousands of tiny spots called DNA chips. Each spot contains a known DNA sequence or gene, also called probes. The process begins with collecting mRNA molecules from two samples: experimental and reference. Both samples are then converted into complementary DNA (cDNA) and labelled with a different color fluorescent probe. The next step is to mix and bind the two samples to the microarray slide. This step is called hybridisation. Post hybridisation, a certain color will be imprinted on the microarray slide. The color on the microarray spot can either be (1) Red, which indicates that the gene’s expression level was higher in the experimental sample as compared to the reference (2) Green, which indicates that the gene’s expression level was comparatively higher in the reference sample, or (3) Yellow, which indicates equal expression levels across the two samples. This information is then used to create gene expression profiles, which are capable of showing parallel changes in gene expression of multiple genes in response to certain biological treatments/conditions. Microarrays offer several advantages, namely (1) Robustness (2) Reliability (3) Streamlined handling - can be easily automated (4) Short turn-around time and (5) lower cost. Despite the many advantages, microarrays offer several disadvantages as they require prior knowledge of the sequence, are insensitive to structural variations or isoforms, have issues with hybridization and sample labelling biases. Another sequencing method that evolved as a standard technique over microarrays for sequencing and profiling of RNA transcripts across the transcriptome is the Next Generation Sequencing (NGS) technique. NGS is an umbrella term that covers a number of methods that can perform massive parallel sequencing and thus offer a very high throughput. The NGS method offers the following advantages: (1) Primers/probes are not required (2) High degree of non-biased generated data, and (3) Capable of transcript discoveries that cannot be done as effectively by microarray based techniques [1]. RNAseq is a powerful NGS method which provides a higher resolution because of its digital nature in contrast to microarrays which deal with analog data, thereby making even slight biological changes in gene expression noticeable to the analyst. This feature is very useful, especially for the purpose of differential expression studies. Also, RNAseq data is contingent on quality control since biases get introduced during alignment, sequencing, de novo assembly and expression quantification [2]. The advantages gained over microarrays are that RNA-seq (i) provides an overall transcriptomic view (ii) is not dependent on any prior knowledge of the nucleotide sequence (iii) has a high dynamic range (iv) is sensitive to structural variations (alternative splicing events/gene fusions) and (v) the data is not continuous but digital in nature (absolute abundance vs. relative abundance). However the downside is that the NGS technology is still new to some researchers. Data storage in NGS is a big challenge since the throughput is very high; it requires complex analysis due to a lack of standard protocols for sequencing. Also, there should be an availability of special computing infrastructure along with expert

228

R. Joshi and R. Sarmah

personnel to perform sequencing. Lastly, NGS is more expensive than the microarray based technique. Differential expression highlights traits of differential genes, i.e, genes whose read counts between two conditions differ significantly. If the difference is greater than that caused by a random variation, the gene is said to be differentially expressed. The difference can be calculated in terms of fold change or simply raw read counts. Finding statistical significance in a dataset can be infrequent and there are likelihood analysis techniques like Power analysis for the purpose [3]. There are a variety of tools which successfully identify differentially expressed genes using various statistical methods like Negative Binomial, Poisson Distribution, etc. These tools are DeSeq, edgeR, Limma, to name a few. However, it is found that the results of these tools do not usually agree with each other per se, i.e. each tool highlights a different set of genes as being differentially expressed. The reasons for these differences will be presented extensively in this paper. We have presented and compared the results of these different tools on a given dataset. The following two datasets have been used for the study: I. The Mouse Mammary Gland Dataset (Accession Number GSE60450) containing 12 samples across two conditions (Virgin, Pregnant and Lactate). II. Dataset of the Homo Sapien species containing 19 samples across two races, namely Caucasian and African-American.

2 Issues and Challenges RNA sequencing produces large and complex datasets and is not free from bias, which may adversely affect differential expression analysis. This is overcome by using normalisation methods such as CPM (Counts Per Million), TMMM (Trimmed Mean of M values), FPKM (Fragments Per Kilobase Million), etc. Another issue is that transformation of count data into continuous variables without considering distribution of count data creates inconsistencies [4]. Also scRNA seq has inherent problems like zero inflation, batch effects, etc. Even technical issues specific to the NGS technology cannot be overlooked. Sequencing errors, for example, in the output reads can come into view during stages of fragmentation, amplification or reverse-transcription [5–7]. This necessitates careful data quality control and normalisation.

3 Normalisation of RNA-Seq Data The goal of normalisation is to maintain the quanta level between the raw and normalised read counts. When the difference between the normalised reads accounts well for differences in the original expression count, we can conclude that normalisation was done correctly [8]. It is essential for downstream biological analysis and takes into account certain factors like sequencing depth, transcript size, sequencing error rate and GC-content. Normalisation is important because it decides how accurate the gene expression measure will be, and this will aid further analysis. Some of the popular normalisation methods are Relative log expression method implemented in DESeq [9], RPKM (Reads Per Kilobase Million) [10], FPKM (Fragments Per Kilobase Million) [11] and TPM (Transcripts Per

Survey of Methods Used for Differential Expression Analysis on RNA Seq Data

229

Kilobase Million). There are also measurement error models which compare estimated performance of these normalisation methods [12]. Quantile normalisation is an effective approach to improve the data quality of RNA-seq data containing low read count measures [13]. Other issues with normalisation include: 1. Random variables generally depict the number of reads aligned to a given gene under a certain condition. 2. Differences in a gene’s read count may be due to differential coverage, rather than differential expression [8]. 3. Despite these various issues, Bullard et al. [10] found that the normalisation procedure makes the highest impact on the results of differential expression analysis, with the choice of hypothesis test statistic being comparatively less contributory. 4. Few methods for normalisation are I.

Normalisation by library size – Removal of differences in sequencing depth by dividing each gene’s read count by the total number of reads across genes in each sample. II. Normalisation by distribution (DeSeq) – Compares distributions of genes across samples. Replacing each quantile with the average (or median) of that quantile across all samples (DeSeq), or TMM(Trimmed Mean of the M Values, where M indicates the number of samples), used by edgeR. TMMM chooses one sample as the reference sample and then, relative to that sample calculates parameters like fold changes, absolute expression levels, etc. within other samples. Another notable method is Median Ratio Normalisation (MRN) in which the read counts are divided by the total count of their sample, then averaged across all samples in a particular condition for a specific gene. III. Normalisation by testing (PoissonSeq) – Uses hypothesis testing to identify then non-differentially expressed genes. An iterative process is used to alternate between two stages: (1) Estimating a set of non-DE genes, and (2) Estimating the scaling factor for each sample using that set. With the help of scaling factor estimates, expected values of read counts are evaluated and the non-differentially expressed genes are filtered using a x 2 goodness-of-fit test. IV. Normalisation by controls – Aids in accurate fold change readings of the expression data. It should be noted that the results of these methods depend neither on the number of genes/samples. While each of these methods have their own assumptions, Evans et al. [8] assessed the downstream results of violating the assumptions of these different normalisation methods. Simulations were run in which the average Mean Squared Error (MSE), on non-DE LFCs (Log Fold Changes), and the average Enhanced False Discovery Rate (eFDR) were computed for varied proportions of differential expression (proportion of genes which are truly DE), relative amounts of mRNA/cell and amounts of asymmetry.

230

R. Joshi and R. Sarmah

4 Differential Gene Expression It is a means by which cells of an organism differentiate, while the DNA in all cells is identical. A small percentage of genome is expressed in each cell, and the cell type decides the percentage of RNA that will be expressed in that cell. A differentially expressed gene is one which has statistically significant difference or change in read counts between two biological conditions of the organism. 4.1 Mathematical Definition Mathematically, a gene Gi with expression level E1 and E2 between two conditions C1 and C2 can be called differentially expressed if either of the following two conditions hold: I. The log fold change or LFC across C1 and C2 exceeds a threshold T, i.e. LFC = log2 (E1/E2) > T. If LFC is positive, it indicates the gene expression in condition 2 is higher than condition 1 and the gene is said to be up regulated. If LFC is negative, it indicates the gene expression in condition 2 is higher than condition 1 and the gene is said to be down regulated. If LFC is zero, it indicates the gene expression levels in condition 2 and condition 1 are equal. If a gene expresses twice as much in condition 1 than in condition 2 then the fold change (FC) = E1/E2 = 2 and LFC = log2 (FC) = log2 2 = 1. II. If μi and μj are the respective means in the sample for conditions C1 and C2 respectively, then there is a noticeable difference in the variances σ12 and σ22 of a given gene G across these conditions. Basically, the null hypothesis σ12 = σ22 and the alternative hypothesis σ12 = σ22 are tested. If any statistical test rejects the null hypothesis, then the alternative hypothesis would be accepted and the gene can be termed as a differentially expressed gene. Therefore, the change in condition can be attributed to the differently expressed gene. It is important to analyse RNA-seq data across transcriptomes of different developmental stages of an organism. This will unravel genes that make diseased cells behave differently than normal cells. This type of analysis can be performed using Differential Gene Expression analysis [2].

5 Differential Gene Expression Analysis Differential Expression (DE) Analysis has been widely and effectively used for disease diagnosis. It takes the results of transcriptome profiling performed by NGS that gives RNA-seq reads and uses statistical means and validation techniques to identify genes which are differentially expressed. A differentially expressed gene will transcribe differently, i.e. have a significant difference (beyond a specified threshold and score cut-off) in the amount/abundance of mRNA across different developmental stages of an organism, e.g. normal vs. diseased states.

Survey of Methods Used for Differential Expression Analysis on RNA Seq Data

231

There are issues with RNA-sequencing that affect the reliability of the counts/reads mapped to a genome. DE analysis should take these issues into account. The issues are (i) Inherent nature of the NGS technology to include biases and errors (ii) Bias due to varying gene or transcript lengths/nucleotide composition/sequencing depth (iii) Differences in counts due to alternative gene isoforms and (iv) Variations in biological and/or technical replicates that creates difficulty in accurate discrimination of true differentially expressed genes from false ones [2]. An effective solution around these issues is Normalisation. There are several methods for normalisation which have been outlined in Sect. 3 of this paper. A basic workflow of RNA Sequence Analysis is outlined in Fig. 1.

Fig. 1. RNA sequencing workflow

232

R. Joshi and R. Sarmah

The workflow of Differential Gene Analysis is depicted in Fig. 2. For a given reference transcriptome (RNA-seq) or genome, there is a basic workflow followed for the detection of DE genes. The sequence is as follows: (i) Mapping reads, (ii) Computation of counts (iii) Normalisation and (iv) Identification of differentially expressed genes. Step (iv) involves detecting differences/fold changes in the normalised read counts. Any Differential Expression Analysis method has to validate the statistical significance of the difference through tests like Wald Test, Fischer’s exact test, F-test, T-test, Likelihood ratio test, p-values and q values, etc. The threshold/cut off to distinguish DE genes from non DE genes is decided by studying the distribution of the read counts. The null hypothesis for differential expression between two conditions states that the mean value μ is the same in both conditions [2]. Negative Binomial Distribution, Poisson Distribution, Gamma distribution etc. can model RNA-seq data, out of which the Negative Binomial Distribution is considered most suitable to fit sequence data.

Key Analysis Major Analysis Low count filter Preprocessing

bias removal normalisaon

DIFFERENTIAL GENE EXPRESSION

Differenal Expression

Parametric vs Non-Parametric

Package • TMM • DESeq • Poisson Seq • NOISeq • DESeq2 • edger • baySeq • limma+ voom

Splicing Events Alternave Splicing Analysis Isoform expressions

Fig. 2. Workflow of differential gene analysis

The number of reads that map to a given gene is relatively small as compared to the total reads (across all genes) per sample. Non negative discrete integers represent the read counts. A Poisson distribution, which models the random read count for a gene from the pool of read counts across the sample, seems like a good choice. However, Poisson distribution fails to fit in the high extent of variability of RNA-seq data. Also, a Poisson distribution is for distributions whose variance and mean are equal. The Recount project included the “Bottomly” experiment which modelled the gene-wise means versus variance (Fig. 2). The code for this experiment can be found at https://github.com/ bioramble/sequencing/blob/master/nb.R.

Survey of Methods Used for Differential Expression Analysis on RNA Seq Data

233

Fig. 3. Plot of gene expression versus variance

Figure 3 clearly indicates that for genes with high expression level, the variance exceeds the mean. This is known as overdispersion. Thus Poisson distribution cannot account for this fact, so any statistical test based on it cannot control type-I error [15]. This exceeding variance is fitted well by the Negative Binomial distribution as it takes into account another parameter known as dispersion. Overdispersion occurs because biological replicates may have different transcription levels even across the same conditions. For technical replicates, overdispersion does not occur, in which case Poisson distribution proves a good fit. The variance (dispersion) σ 2 can be expressed mathematically as σ 2 = μ + αμ2 where α represents the dispersion parameter and μ represents the mean of the distribution. It is to be noted that when α = 0, the distribution is Poisson, otherwise the dispersion is always greater than the mean. 5.1 Tools Used for Differential Gene Expression Analysis So far, many algorithms have been developed and there exist tools on these algorithms which aim to efficiently detect differentially expressed genes. A major concern here is minimizing the False Discovery Rate (FDR). These techniques take a dataset which is eventually represented as a matrix of read counts. Normalisation is performed on the read counts to account for bias due to varying gene lengths or technical/biological replicates. DE Analysis methods are categorised into the parametric and non-parametric types. Parametric methods have a limitation that they can be influenced by outliers (extreme points which fall outside the fences of the dataset) [2]. Benjamini et al. [16] proposed an effective method to control FDR. This paper discusses three major parametric tools used for Differential Expression Analysis and also presents results of the empirical study performed on the two datasets which have been outlined towards the end of Sect. 1.

234

I.

R. Joshi and R. Sarmah

edgeR: It takes a classical hypothesis testing based approach [2]. edgeR [17] first uses a weighted trimmed mean of the log expression ratios for normalising the counts, and then uses the Negative Binomial method to detect the differentially expressed genes. The expression data are fit into the NB model as follows: v = μ + αμ2

where μ and v signify mean and variance respectively, and α is known as the dispersion factor. Estimating the dispersion factor is done by using a likelihood function to combine two factors: (1) A common dispersion across all genes, and (2) A gene-specific dispersion, estimated by the empirical Bayes method [15]. Then, an exact test is used with FDR control to determine the differentially expressed genes. It is suggested by [17] that edgeR is stable under different sequencing depths. Two notable functionalities of edgeR, Classic and GLM (Generalised Linear Model) are based on the empirical Bayes method which allows estimation of gene-specific biological variation for experiments with even the slightest levels of biological replication [18]. edgeR is a well documented and widely used package [2]. It can be implemented using the Bioconductor package in R. II. DeSeq2: DESeq2 is a higher version of the DeSeq package. Based on the negative binomial distribution model, it allows the use of gene-specific shrinkage estimation for dispersions instead of using a fixed normalisation factor as in DeSeq [15]. The dispersion is estimated using genes with similar average expression values. For genes with small average expression values, the fold change is inspected. It produces more false positives though it can detect more differentially expressed genes. Chowdhury et al. [2] mention that the following cases set the results of DeSeq to NA: (1) A row contains an outlier (2) All samples in a row have zero counts (3) A row with a low mean normalised count is filtered automatically. III. Limma + Voom: limma is a parametric Differential Expression analysis method based on linear modelling. Having small sample sizes can be a problem which can be overcome using limma. Differential expression, differential splicing and expression profile analysis can be performed in terms of co-regulated sets of genes from RNA-seq data. Limma utilizes the fact that the power to detect differentially expressed genes increases when quantitative weights are incorporated into all levels of the statistical analysis (normalisation to linear modelling and gene set testing) [18]. It also uses a variance stabilizing strategy [2]. The read counts are converted to the log-scale and the mean-variance relationship is estimated empirically. The log-transformed read counts are analysed by incorporating the precision weights obtained using the voom function from the mean variance trend. The effect of outliers is minimal. However, at least 3 samples are required it needs at least 3 samples to detect differentially expressed genes with a high level of accuracy [19].

Survey of Methods Used for Differential Expression Analysis on RNA Seq Data

235

5.2 Results of the Study on Differential Expression Packages The code for differential expression was implemented using R on the two datasets with the help of the edgeR, limma and DeSeq2 packages. The results of the study are presented next. Results for Dataset 1: The Mouse Mammary Gland dataset containing 12 samples with 3 conditions, namely Virgin, Pregnant and Lactating mice. Table 1 lists the top 5 differentially expressed genes as identified by edgeR, which uses the Negative Binomial Distribution. The biological coefficient of variation, i.e square root of dispersion was found to be 0.7547. The top 5 differentially expressed genes identified by Deseq2 and limma are listed in Tables 2 and 3 respectively. Table 1. Differentially expressed genes by edgeR. Gene_ID logFC

logCPM Pvalue

FDR

71874

7.88386

2.20493 6.84E−13 1.86E−08

100340

2.96438

3.57286 3.73E−10 5.06E−06

56520

2.975072 2.30862 9.37E−08 6.98E−04

16906

1.86689

5.64137 1.25E−07 6.98E−04

12704

2.62806

1.62348 1.28E−07 6.98E−04

Table 2. Differentially expressed genes identified by DeSeq2 Gene_ID logFC 67111 69237 66234

−2.4816813

Pvalue

adj. Pvalue

1.24E−55 2.01E−51

0.99613866 2.55E−44 2.07E−40 −2.0993087

5.17E−29 2.80E−25

22224

0.93521045 3.92E−25 1.60E−21

67619

1.07464486 8.56E−22 2.79E−18

Results for Dataset 2: This dataset contains 19 different samples of homo sapiens across two races (conditions), namely Caucasian and African American. The top 5 differentially expressed genes identified by edgeR are listed in Table 4. The biological coefficient of variation, also known as the common dispersion, was found to be 1.365. The top 5 differentially expressed genes identified by Deseq2 and limma are listed in Tables 5 and 6 respectively. We present a brief analysis of some of the results obtained after performing the empirical study. The statistical tests were performed on a cutoff value (p < 0.05). The results are presented in Table 7 for reference.

236

R. Joshi and R. Sarmah Table 3. Top differentially expressed genes identified by limma Gene_ID logFC

Pvalue

adj. Pvalue

21953

−5.823245 3.47E−06 0.02449594

67111

−2.537943 3.99E−06 0.02449594

72515

1.935211 5.95E−06 0.02449594

232016

−2.595871 7.74E−06 0.02449594

329739

−1.52075

7.75E−06 0.02449594

Table 4. List of DE genes identified by the edgeR package logFC

logCPM

ENSG00000205609 −7.8021879

PValue

FDR

−1.597197702 3.41E−08 0.00179253

ENSG00000171551 −6.236179545

3.820516615 1.92E−06 0.0504448

ENSG00000173714 −5.457787615

2.574744131 3.73E−06 0.065335125

ENSG00000225614 −4.201549703

4.49217671

ENSG00000182749 −3.498896989

6.982436698 1.79E−05 0.187983965

5.49E−06 0.072161455

Table 5. List of DE genes identified by the Deseq2 package log2FoldChange pvalue

padj

ENSG00000129824 0.73601228

7.67E−10 8.13E−06

ENSG00000151704 0.690237918

4.67E−08 0.00020738

ENSG00000135374 0.722019304

5.87E−08 0.00020738

ENSG00000186458 0.545526721

3.50E−07 0.000898833

ENSG00000113889 0.486485768

4.24E−07 0.000898833

For Dataset 1, DeSeq2 and limma showed the highest number of overlapping genes = 15804. For Dataset 2, edgeR and DeSeq2 showed the highest number of overlapping genes = 12565. The obtained results indicate that all tools give us statistically significant results (p < 0.05) with fairly low degree of false positives ( 300 MHz. The median path-loss for suburban area can be expressed as, 2  L50,suburb (dB) = L50,urban (dB) − 2 log10 (fc /28) − 5.4 (7) Hence, the Eq. (5) may expressed as,

 2 Nsc(k) f PT GT(i,j) h(i,j)  L50,suburban(i,j) log2 1 + R=  n=1 Nsc KTK B + Jm=1,ml=j PR(i,m) L50,suburban(m,j)

(8)

3.2 Base Station Antenna Radiation Pattern Modelling In general, an array of dipole elements is used for the BS antenna in a mobile network [14]. The angle between the main beam lobe and the horizontal plane is referred as Antenna tilt [6]. The main lobe, side lobe and back lobes can be adjusted by using electrical down tilt. If the antenna is up-tilted from its current tilt, then the change in angle is negative, on the other hand, if the antenna is down tilted, then the angle is positive. The total antenna tilt angle can be expressed as shown in Fig. 2, θtilt = θelectrical + θmechanical For three-sector cell sites, the horizontal and vertical radiation pattern can be combined to form a 3D radiation pattern as in [15], A(ϕ, θ) = −min{−AH (ϕ) − AV (θ), Am } (dBi)

Fig. 2. A base station antenna radiation pattern using antenna tilt

(12)

Adaptive Antenna Tilt for Cellular Coverage Optimization in Suburban Scenario

245

Where, Am = Maximum horizontal attenuation. = 25 dB

3.3 Application of Reinforcement Learning Algorithm to the Current Scenario In order to optimize the cell coverage by using antenna tilt, the learner is the instance of an algorithm; either it is distributed in the BSs or in a centralized unit in the mobile network. Considering the network model designed in this paper the reinforcement learning (RL) is based on the available data at each BS i.e. the relative distance (d) to the each serving UE and the received power (PR(i,j) ) of the UE, the interference power (PI(i) ) from the neighbouring cells and the weighting factors wi (determines the user priority; 0 < wi < 1, i.e. wi = 1, the priority of the user is high to achieve higher data rate) for each UE. The basic operation principle of the proposed method for coverage optimization using BS electrical antenna tilt, is shown in the flowchart as in Fig. 3 [8].

Fig. 3. Flowchart of the algorithm for coverage optimization using electrical Antenna tilt

4 Results and Analysis In this section, the results of simulation tests of the proposed solution are provided. The specifications of the parameters used in the simulation are listed in Table 1.

246

S. R. Samal et al. Table 1. Parameter specifications Parameters

Specifications

Field dimensions

5000 m × 5000 m

Frequency bandwidth [15]

10 MHz

Number of user

5

Carrier frequency

f c = 1 GHz

Frequency reuse factor

1

Number of subcarriers

N sc = 5000

Call duration

60 s

Minimum SINR threshold Antenna parameters

5 dB Specifications

Maximum horizontal attenuation

Am = 25 dB

Horizontal half power beam width [15]

φ3dB = 70

Vertical half power beam width [15]

θ3dB = 10

Vertical side lobe attenuation

SLAv = 20 dBi

Minimum electrical tilt angle

θb,min = 0◦

Maximum electrical tilt angle

θb,max = 10

Change in electrical tilt angle in each step

θb = 2









Maximum gain of the transmitting antenna GT,max = 14 dBi Transmitting power [15] PT = 40 W (46 dBm) BS antenna height

hj = 32 m

UE antenna height [15]

hi = 1.5 m

The simulation tests are carried out by using a three sectored antenna with five numbers of users distributed randomly within the network, where the propagation path is based on the Okumura-Hata model for suburban areas. The red delta symbol () represents the user locations in the network. Initially the angles of the three antennas are set at 0◦ , 120◦ , 240◦ respectively. The main lobe of the antenna pattern is pointed towards the x-axis at 0◦ angle. The increment of tilting angle occurs in counter-clock wise direction. The resulting received power map is obtained after the completion of the exploration phase of the algorithm as shown in Fig. 4. Figures 4 and 5 shows the received power-map after optimization and received power level by each user served the BS in the corresponding network respectively. In Fig. 4 it is clearly seen that by using this proposed solution the coverage area can be optimized based upon the user distribution, as two users in cell 3 are placed at the cell edge. So to provide the desired power level the antenna in the cell 3 is up-tilted in an optimized manner thus reducing power leakage towards the other cell.

Adaptive Antenna Tilt for Cellular Coverage Optimization in Suburban Scenario

247

Fig. 4. Optimized received power-map

The Fig. 6 shows the SINR level of each user based upon their distribution. It signifies that out of five users, four users are getting sufficient SINR level to the get the service from the serving BS, which is about 80% of the total number of user. 0

Cell 1

Cell 2

Cell 3

ReceivedPower (dBm)

-20 User 1

-40

User 2 -60

User 3 User 4

-80

User 5

-100 -120

Fig. 5. Received signal power by each user

248

S. R. Samal et al. 30 25

SINR Level

20

User 1 User 2 User 3 User 4 User 5

15 10 5 0 -5

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19

No. of Iteraons

Fig. 6. SINR level of each UE based on their respective distance from the BS.

5 Conclusion In this literature, we propose a novel BS antenna tilt approach for the coverage optimization in mobile networks. An existing RL based algorithm is implemented to illustrate our proposed solution and used to analyze the antenna tilt mechanism in suburban scenarios. The simulation is carried out for a suburban scenario by using the simulation parameters as listed in Tables 1. The simulation results shows that by the application of the proposed solution, the coverage area can be optimized in such a way that the cell-edge users can get a significant improvement in their received power and the SINR level. By optimizing of the downlink power through proper antenna tilt with the help of proposed solution, a significant improvement in the SINR level with reduction in interference to the other cell can be observed. Due to the model-independent nature of the proposed algorithm, it can be used for the future generation mobile networks in interference reduction, energy saving purposes.

References 1. Hamalainen, S., Sanneck, H., Sartori, C.: LTE Self-Organising Networks (SON): Network Management Automation for Operational Efficiency. Wiley, Hoboken (2011) 2. Yilmaz, O.N.C., Hamalainen, S., Hamalainen, J.: System level analysis of vertical sectorisation for 3GPP LTE. In: Proceedings of the 6th IEEE International Symposium on Wireless Communication System, Tuscany, pp. 453–457 (2009) 3. Siomina, I., Varbrand, P., Yuan, D.: Automated optimization of service coverage and base station antenna configuration in UMTS networks. IEEE Wirel. Commun. 13(6), 16–25 (2006) 4. Athley, F., Johansson, M.N.: Impact of electrical and mechanical antenna tilt on LTE downlink system performance. In: 71st IEEE Vehicular Technology Conference (VTC 2010-Spring) (2010) 5. Parikh, J., Basu, A.: Impact of base station antenna height and antenna tilt on performance of LTE systems. IOSR J. Electr. Electron. Eng. (IOSR-JEEE) 9(4), 6–11 (2014) 6. Yilmaz, O.N.C., Hamalainen, S., Hamalainen, J.: Comparison of remote electrical and mechanical antenna downtilt performance for 3GPP LTE. In: 70th IEEE Vehicular Technology Conference Fall (VTC 2009-Fall), pp. 1–5 (2009)

Adaptive Antenna Tilt for Cellular Coverage Optimization in Suburban Scenario

249

7. Li, J., Zeng, J., Su, X., Luo, W., Wang, J.: Self-optimization of coverage and capacity in LTE networks based on central control and decentralized fuzzy q-learning. Int. J. Distrib. Sensor Netw. 8(8), 878595 (2012) 8. Dandanov, N., Al-Shatri, H., Klein, A., Poulkov, V.: Dynamic self-optimization of the antenna tilt for best trade-off between coverage and capacity in mobile networks. Wirel. Pers. Commun. 92(1), 251–278 (2017) 9. Yilmaz, O.N.C., Hamalainen, J., Hamalainen, S.: Self-optimization of remote electrical tilt. In: 21st IEEE International Symposium on Personal Indoor and Mobile Radio Communications (PIMRC), pp. 1128–1132 (2010) 10. Razavi, R., Klein, S., Claussen, H.: Self-optimization of capacity and coverage in LTE networks using a fuzzy reinforcement learning approach. In: 21st IEEE International Symposium on Personal Indoor and Mobile Radio Communications (PIMRC), pp. 1865–1870 (2010) 11. Thampi, A., Kaleshi, D., Randall, P., Featherstone, W., Armour, S.: A sparse sampling algorithm for self-optimisation of coverage in LTE networks. In: International Symposium on Wireless Communication Systems (ISWCS), pp. 909–913 (2012) 12. Berger, S., Fehske, A., Zanier, P., Viering, I., Fettweis, G.: Online antenna tilt-based capacity and coverage optimization. IEEE Wirel. Commun. Lett. 3(4), 437–440 (2014) 13. Goldsmith, A.: Wireless Communications. Cambridge University Press, Cambridge (2005) 14. Balanis, C.A.: Antenna Theory: Analysis and Design, 3rd edn. Wiley, Hoboken (2005) 15. 3GPP TR 36.814 V9.0.0, Technical specification group radio access network (E-UTRA); Evolved Universal Terrestrial Radio Access (E-UTRA); Further advancements for E-UTRA physical layer aspects (Release 9) (2010)

A Survey of the Different Itemset Representation for Candidate Generation Carynthia Kharkongor(B) and Bhabesh Nath Tezpur University, Tezpur 784028, Assam, India {caryn,bnath}@tezu.ernet.in

Abstract. Itemset representation is a pivotal part in association rule mining. The itemset representation is a way of how the itemset in the dataset are stored in the memory. There are different data structures used for storing such itemsets. Some of the common data structures used are linked list, array. The efficiency of the association rule mining algorithm largely depends on the way the itemsets are stored. The execution time and memory consumption of the itemsets plays a vital role for determining the performance of the mining algorithms. In this paper, a study of the data structures used for itemset representation is discussed. The different data structures are being tested on the different datasets for generation of candidate itemsets. The performance of the different data structures in candidate generation process is analysed. Keywords: Itemset · Candidate · Data structure · Frequent itemset · Array

1 Introduction Frequent itemsets is one of the well researched area in association rule mining [10]. The representation of itemsets in the memory plays a crucial role because it is one of the deciding factor of how much memory is needed. When we are dealing with large datasets, the memory requirement becomes a vital part for storing them. Sometimes, the dataset is too large that it does not fit in the main memory. Another additional memory is required which incurs cost as well as time requirement. Depending on the frequent itemsets, the rules are generated. Based on these rules, Agarwal and Srikant have discovered the problem of association rule mining. Given a dataset D with transactions T = {t 1 ,t 2 ,t 3 ,…,t n } consisting of itemsets I = {i1 ,i2 ,i3 ,…,in }, each item in the itemsets represents an element. Before exploring about the problem of association rule mining, the two important metrics are needed to be discussed. They are: – Support: the number of items the itemsets that appear in the entire dataset. It is elucidated as Support = Frequency of the itemsets/Total number of transactions in the dataset. The itemsets which have support count greater than the minimum threshold value are called frequent itemsets [6, 21]. © Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 250–256, 2020. https://doi.org/10.1007/978-3-030-39033-4_23

A Survey of the Different Itemset Representation for Candidate Generation

251

– Confidence: defines the rule of the itemsets. In other words, it defines the validity of the rule. The confidence of the rule A → B implies that when the item A appears in the particular transaction, then the item B also occurs in the same transaction. The confidence of the rule A → B = Support (A U B)/Support (A) [3, 21]. The problem of association rule mining is classified into two categories. They are follows; – The first problem is the generation of the frequent itemsets from the dataset. – The second one is the generation of the rules from the frequent itemsets.

2 Layout for Itemsets Representation The dataset can be organised in different ways in the memory. They are mostly processed into three ways such vertical, horizontal and projected database layout. – Vertical layout: the itemsets are represented by using the TID’s. The items are arranged in row wise and the items TID are marked against the items if the items are present in that particular transaction [13]. The advantage of representing the items in terms of vertical layout is that computation of support count for itemsets is easier simply by performing the intersection of the tidlist. Algorithms such as Eclat uses vertical layout. It is an equivalence class based algorithm which is more efficient for huge datasets rather then small ones [14, 23]. Other algorithm called fast vertical mining algorithm introduced by Zaki et al. keeps track of the differences of the tids. Representing the itemsets using diffsets have reduced the memory space [22]. – Horizontal layout: the rows represent the items which occur in the itemsets. If the itemsets are present in the particular TID, then the items are marked in the row [13]. The first algorithm that uses the horizontal algorithm is the Apriori algorithm [2]. The disadvantage of this algorithm is that it generates both the frequent and infrequent itemsets. This wastes both the time and space requirement while running the algorithm. Other algorithm such as DIC which also uses the horizontal approach. The DIC algorithm can repeatedly scan the database and new itemset can be added at any point [5]. Park et al. has proposed the Dynamic Hashing and Pruning (DHP) algorithm which was derived from Apriori algorithm with an additional control. Its main advantage is that it is space efficient by discarding the attributes which appear to be useless [15]. Lin et al. introduced the Pincer-Search algorithm which is a combination of both the directions i.e., bottom-up and top-down. It is a bottom up approach for pruning the candidate itemsets. One feature of this algorithm is that it reduces the number of scans of the database [12]. CBAR was discovered by Jiuan Tsay and Jiunn-Yann Chiang that requires only a scan of the dataset and compares with partial cluster table. This process reduces the time required to scan the database and also prunes the candidate itemsets [20]. An example of the itemset representation for vertical and horizontal layout for itemset representation is given in Fig. 1.

252

C. Kharkongor and B. Nath

Fig. 1. An example of layout representation of itemsets

– Projected database: FP-Tree algorithm that uses the projected layout database was introduced by Han J. and Pei J. It generates the frequent itemsets with only two scans of the dataset. It is an extended tree structure that stores only crucial information of the frequent patterns [8]. V. Prasad and R. Agarwal has proposed a Tree Projection algorithm which constructs a tree in such a way that the large dataset is projected into a reduced dataset. It is done on the basis of the pruned frequent itemsets [1]. V. K. Shrivastava et al. has introduced an approach based on FP-tree and co-occurrence of frequent items (COFI) for finding the frequent itemsets using a recursive mining process [18]. Supatra Sahaphong and Veera Boonjing has proposed an algorithm called Inverted Index Structure (IIS) for frequent itemset mining which scans the entire dataset only once. The algorithm needs to scan the database only once and re-scanning is not required even if minimum support threshold changes. Its performance is better on dense datasets with lesser run time, less memory requirement and scalability [17].

3 Data Structures for Itemset Representation Data structures implies the way the data are stored efficiently in the memory. Many types of data structures are used for storing the frequent itemsets. The most common data structures used for storing these itemsets are discussed below: 1. Array: Using this data structure, each element will consume 4 bytes. So, when the size of the dataset is of 10, the number of generated candidate itemsets will be 210 ≈ 1024. The memory consumption will be 4 × 1024 = 4096 bytes. H-mine algorithm uses both the tree and array representation of itemsets [16]. Gosta Grahne and Jianfei Zhu have implemented FP tree algorithm using array technique that have greatly improved the performance of the algorithm [7]. The itemset representation using array is shown with the help of an example. An itemset I = {1, 6, 11, 23, 30, 31} is represented using array is shown in Fig. 2. 2. Linked List: This data structure has two parts: one for the storing the element and the other for storing the pointer. Suppose if 10 is the dataset size, number of generated of candidate itemsets will be 210 ≈ 1024 and the memory used will be 8(4 + 4) × 1024 = 8192 bytes. Linked list uses 4 bytes for an element and another 4 bytes for storing the pointer. Algorithms that uses linked list such as FP-growth algorithm [9],

A Survey of the Different Itemset Representation for Candidate Generation

253

Fig. 2. Array representation for the itemsets I = {1, 6, 11, 23, 30, 31}

Transaction Mapping Algorithm [19], Trie based Apriori [4] and so on. The above itemset I is represented using linked list as shown in Fig. 3.

Fig. 3. Linked list representation for the itemsets I = {1, 6, 11, 23, 30, 31}

3. Set Representation: The itemsets in this representation is marked by ‘1’ if the item is present and ‘0’ if it is absent. Each item is represented by one bit. When the dataset size is of 10 with 210 ≈ 1024 itemsets, the total memory consumption will be 1 × 1024 = 1024 bits or 128 bytes [11]. The above itemset is represented using set representation given in Fig. 4.

Fig. 4. Set representation for the itemsets I = {1, 6, 11, 23, 30, 31}

In association rule mining, there are many set operations which are needed to be performed. The set operations are union, intersection, subset, superset and membership operations. The data structures greatly affect the performance of these operations. Using different data structures, the time for performing the set operations also differ. The memory consumption of the itemsets using the different data structures also varies. The difference between the complexities of different data structures for different set operations is given in Table 1.

4 Experimental Analysis The different data structures are being tested with three datasets for candidate itemsets generation. With varying values of support of 1%, 2.5% and 5%, the frequent itemsets are generated for the three datasets. The three datasets used are as follows: 1. 1st datasets: consists of 50 attributes with a dataset size of 1000. The result of candidate generation for different data structures using this dataset is shown in Table 2.

254

C. Kharkongor and B. Nath

Table 1. The complexity of the different set operations between array, linked list and set representation Set operations

Linked list Array

Set

Union operation

O(mn)

O(m + n) O(1)

Intersection operation

O(mn)

O(m + n) O(1)

Subset operation

O(mn)

O(mn)

Superset operation

O(mn)

Membership operation O(n)

O(1)

O(mn)

O(1)

O(logn)

O(1)

Table 2. The result of the consumption of memory and time for candidate generation on 1st dataset using the dats structures Data structures

Support count = 1% Memory (Kbs)

Time (ms) Memory (Kbs)

Support count = 2.5%

Time (ms) Memory (Kbs)

Support count = 5% Time (ms)

Linked list

137617

851435

90106

605538

68187.9

433710

Array

22350

281234

21790

261099

21340

250980

Set

13510

39345

13360

35213

13210.9

16091

2. 2nd datasets: has 2000 dataset size with 50 attributes. This dataset is tested for candidate generation of varying support for the data structures whose result is given in Table 3. Table 3. The time and memory consumption for candidate generation using the different data structures tested on 2nd dataset Data structures

Support count = 1% Memory (Kbs)

Time (ms) Memory (Kbs)

Time (ms) Memory (Kbs)

Time (ms)

Linked list

211989

1033760

211201

922479

210020

723347

Array

152333

116745

125589

108901

98341

87889

13589

102900

13421

922334

13312

52345

Set

Support count = 2.5%

Support count = 5%

3. 3rd datasets: similarly, this dataset has 50 attributes with a size of 3000. The candidate generation with varying support using different data structures is tested for this dataset as shown in Table 4.

A Survey of the Different Itemset Representation for Candidate Generation

255

Table 4. The memory and time consumption using the three data structures for candidate generation using 3rd dataset Data structures

Support count = 1%

Support count = 2.5%

Support count = 5%

Memory (Kbs)

Time (ms) Memory (Kbs)

Time (ms) Memory (Kbs)

Time (ms)

Linked list

172345

1233206

149932

993569

109431

823964

Array

22190

234789

21587

234657

179870

212679

Set

13789

202456

13598

193567

13423

102023

5 Conclusion In this paper, we have seen that the data structures such as linked list, array and set are used for representation. From the table and the results, the performance of the set representation is better than both linked list and array for itemset representation. The time and memory consumption for generation of the itemsets is less than both the representation. By using the set representation for representing the itemsets, the efficiency of the mining algorithm will greatly improve. Therefore, the performance of the mining algorithm will revamp if the set representation is used for itemset representation.

References 1. Agarwal, R.C., Aggarwal, C.C., Prasad, V.: A tree projection algorithm for generation of frequent item sets. J. Parallel Distrib. Comput. 61(3), 350–371 (2001) 2. Agrawal, R., Imieli´nski, T., Swami, A.: Mining association rules between sets of items in large databases. In: ACM SIGMOD Record, vol. 22, pp. 207–216. ACM (1993) 3. Bittmann, R.M., Nemery, P., Shi, X., Kemelmakher, M., Wang, M.: Frequent item-set mining without ubiquitous items. arXiv preprint arXiv:1803.11105 (2018) 4. Bodon, F.: A trie-based APRIORI implementation for mining frequent item sequences. In: Proceedings of the 1st International Workshop on Open Source Data Mining: Frequent Pattern Mining Implementations, pp. 56–65. ACM (2005) 5. Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic itemset counting and implication rules for market basket data. In: ACM SIGMOD Record vol. 26, no. 2, pp. 255–264 (1997) 6. Fournier-Viger, P., Lin, J.C.W., Truong-Chi, T., Nkambou, R.: A survey of high utility itemset mining. In: High-Utility Pattern Mining, pp. 1–45. Springer (2019) 7. Grahne, G., Zhu, J.: Efficiently using prefix-trees in mining frequent itemsets. In: FIMI, vol. 90 (2003) 8. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011) 9. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM SIGMOD Record, vol. 29, pp. 1–12. ACM (2000) 10. Hornik, K., Grün, B., Hahsler, M.: arules-a computational environment for mining association rules and frequent item sets. J. Stat. Softw. 14(15), 1–25 (2005) 11. Kharkongor, C., Nath, B.: Set representation for itemsets in association rule mining. In: 2018 IEEE Inter-national Conference on Intelligent Computing and Control Systems (ICICCS). IEEE (2018)

256

C. Kharkongor and B. Nath

12. Lin, D.I., Kedem, Z.M.: Pincer-search: a new algorithm for discovering the maximum frequent set. In: International Conference on Extending Database Technology, pp. 103–119. Springer (1998) 13. Meenakshi, A., Alagarsamy, K.: A novelty approach for finding frequent itemsets in horizontal and vertical layout-HVCFPMINETREE. Int. J. Comput. Appl. 10(5), 20–27 (2010) 14. Ogihara, Z.P., Zaki, M., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of association rules. In: 3rd International Conference on Knowledge Discovery and Data Mining. Citeseer (1997) 15. Park, J.S., Chen, M.S., Yu, P.S.: An Effective Hash-Based Algorithm for Mining Association Rules, vol. 24. ACM, New York (1995) 16. Pei, J., Han, J., Lu, H., Nishio, S., Tang, S., Yang, D.: H-mine: hyper-structure mining of frequent patterns in large databases. In: Proceedings 2001 IEEE International Conference on Data Mining. pp. 441–448. IEEE (2001) 17. Sahaphong, S., Boonjing, V.: IIS-Mine: a new efficient method for mining frequent itemsets. Maejo Int. J. Sci. Technol. 6(1), 130–151 (2012) 18. Shrivastava, V.K., Kumar, P., Pardasani, K.: FP-tree and COFI based approach for mining of multiple level association rules in large databases. arXiv preprint arXiv:1003.1821 (2010) 19. Song, M., Rajasekaran, S.: A transaction mapping algorithm for frequent itemsets mining. IEEE Trans. Knowl. Data Eng. 18(4), 472–481 (2006) 20. Tsay, Y.J., Chiang, J.Y.: CBAR: an efficient method for mining association rules. Knowl.Based Syst. 18(2–3), 99–105 (2005) 21. Wu, C.W., Fournier-Viger, P., Gu, J.Y., Tseng, V.S.: Mining compact high utility itemsets without candidate generation. In: High-Utility Pattern Mining, pp. 279–302. Springer (2019) 22. Zaki, M.J., Gouda, K.: Fast vertical mining using diffsets. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 326–335. ACM (2003) 23. Zhang, C., Tian, P., Zhang, X., Liao, Q., Jiang, Z.L., Wang, X.: HashEclat: an efficient frequent itemset algorithm. Int. J. Mach. Learn. Cybern. 10, 3003–3016 (2019)

Author Index

A Ahmed, Shafiul Alom, 203 Akhil, S., 152 Amarnath, T., 55 B Baishya, R. C., 211 Bandopadhaya, Shuvabrata, 240 Behera, Ajit Kumar, 69 Behera, Ranjan Kumar, 101 Bhattacharyya, D. K., 211 Biswal, Nrusingh C., 55 Biswal, Suvasree S., 55 Bora, Rajdeep, 169 D Dandanov, Nikolay, 240 Dash, Satya Ranjan, 136, 169 De, Sagar S., 11 Dehuri, Satchidananda, 11, 46, 62, 85 Dinesh Kumar, K., 152 Dutta, Palash, 169 G Giri, Parimal Kumar, 35 J Jagadev, Alok Kumar, 35 Jena, Manoj Kumar, 144 Jena, Monalisa, 101 Joshi, Reema, 226 K Kharkongor, Carynthia, 250

M Mallick, Debasish Kumar, 136 Mallick, Pradeep Kumar, 62, 78 Mandal, Akshaya Kumar, 85 Maurya, Rajesh Kumar, 109, 120 Mishra, Annapurna, 62 Mishra, Bijan Bihari, 46 Mishra, Shruti, 78 Mohanty, Mihir Narayan, 3 Mohanty, Sanghamitra, 144 Mohapatra, Saumendra Kumar, 3 N Naren, J., 152, 159 Nath, Bhabesh, 203, 250 Nayak, Sarat Chandra, 46 P Panda, Mrutyunjaya, 69 Pandey, Trilok Nath, 35 Panigrahi, Prasanta K., 55 Parida, P. K., 189 Parida, Shantipriya, 136 Pattnaik, Priyanka, 136 Poulkov, Vladimir, 240 Prahathish, K., 152 R Raja Rajeswari, U., 159 Ramalingam, Praveena, 159 Rath, Santanu Kumar, 101 Rishabh, 120

© Springer Nature Switzerland AG 2020 S. Dehuri et al. (Eds.): BITMDM 2019, LAIS 10, pp. 257–258, 2020. https://doi.org/10.1007/978-3-030-39033-4

258 S Sahoo, Sipra, 3 Sai Krishna Mohan Gupta, S., 152 Samal, Soumya Ranjan, 240 Sampurnima, Pattem, 78 Sanjeev Kumar Dash, Ch., 46 Sarmah, Rosy, 211, 226 Satapathy, Sandeep Kumar, 78

Author Index T Tewari, Pragya, 109 V Vijayalakshmi, P., 159 Vithya, G., 152, 159 Y Yadav, Sanjay Kumar, 109, 120