International Conference on Innovative Computing and Communications: Proceedings of ICICC 2021, Volume 3 (Advances in Intelligent Systems and Computing, 1394) [1st ed. 2022] 9811630704, 9789811630705

This book includes high-quality research papers presented at the Fourth International Conference on Innovative Computing

1,466 47 23MB

English Pages 925 [889] Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

International Conference on Innovative Computing and Communications: Proceedings of ICICC 2021, Volume 3 (Advances in Intelligent Systems and Computing, 1394) [1st ed. 2022]
 9811630704, 9789811630705

Table of contents :
ICICC-2021 Steering Committee Members
Preface
Contents
Editors and Contributors
Explanation-Based Serendipitous Recommender System (EBSRS)
1 Introduction
2 Literature Survey
3 Proposed Approach
3.1 Phase I: User Assessment Phase
4 Experiments, Evaluation and Comparative Analysis
5 Conclusion
References
Introduction of Feature Selection and Leading-Edge Technologies Viz. TENSORFLOW, PYTORCH, and KERAS: An Empirical Study to Improve Prediction Accuracy of Cardiovascular Disease
1 Introduction
2 Methods and Materials
3 Empirical Results and Discussion
3.1 Utilization of Leading-Edge Technologies Viz. TENSORFLOW, PYTORCH, and KERAS
4 Conclusion
References
Campus Placement Prediction System Using Deep Neural Networks
1 Introduction
2 Literature Review
3 Proposed Technique
4 Results and Discussion
5 Conclusion
References
Intensity of Traffic Due to Road Accidents in US: A Predictive Model
1 Introduction
2 Literature Review
3 Proposed Work
4 Experimentation and Results
4.1 Data Sources
4.2 Feature Selection
4.3 Exploratory Data Analysis
5 Machine Learning Modelling
6 Conclusions and Future Directions
References
Credit Card Fraud Detection Using Blockchain and Simulated Annealing k-Means Algorithm
1 Introduction
2 Related Works
3 Methodology
3.1 Blockchain
3.2 k-Means Clustering Algorithm
3.3 Simulated Annealing
4 Experimental Work
4.1 Dataset
4.2 Proposed Work
4.3 Results
5 Conclusion
References
Improving Accuracy of Deep Learning-Based Compression Techniques by Introducing Perceptual Loss in Industrial IoT
1 Introduction
2 Related Work
3 Proposed Method
3.1 Overall Architecture
3.2 Autoencoder Architecture
3.3 Loss Functions
3.4 Lossless Compression Algorithm
4 Experimental Results
5 Conclusion
References
Characterization and Prediction of Various Issue Types: A Case Study on the Apache Lucene System
1 Introduction
2 Related Work and Research Contribution
2.1 Related Work
2.2 Research Contribution
3 Experimental Details
4 Research Questions
4.1 RQ1: What is the Distribution of Different Categories (e.g., Bug, Documentation, Improvement, etc.) Among All the Issue Reports?
4.2 RQ2: Are There Any Distinguishing Terms that Differentiate Various Issue Categories?
4.3 RQ3: Is There Any Significant Difference Between Mean Time to Repair (MTTR) of Different Issue Categories?
4.4 RQ4: What is the Performance of Classic and Ensemble Classifiers for Issue-Type Classification
4.5 RQ5: How Much Time Do Classic and Ensemble Machine Learning Algorithms Take in Training and Prediction?
5 Conclusion and Future Work
References
Heart Disease Prediction Using Machine Learning Techniques: A Quantitative Review
1 Introduction
2 Machine Learning Algorithms Used in Heart Disease Prediction, Diagnosis, and Treatment
2.1 Decision Tree
2.2 Naïve Bayes
2.3 Support Vector Machine (SVM)
2.4 K-Nearest Neighbor
2.5 Random Forest (RF)
3 Literature Review
4 Discussion
4.1 Comparative Representation of Various Machine Learning Methodologies Based on Accuracy
5 Research Gaps/Problems Identified
6 Conclusion
References
Enhancing CNN with Pre-processing Stage in Illumination-Invariant Automatic Expression Recognition
1 Introduction
2 Image Pre-processing Techniques
2.1 Histogram Equalization
2.2 Discrete Cosine Transform Normalization
2.3 Rescaled DCT Coefficients
3 Convolutional Neural Network
4 Implementation and Result Discussion
5 Conclusion
References
An Expert Eye for Identifying Shoplifters in Mega Stores
1 Introduction
2 Related Work
3 Proposed Framework
3.1 Inception V3
3.2 Long Short Term Memory (LSTM)
4 Experimentation and Result Analysis
5 Conclusion
References
Sanskrit Stemmer Design: A Literature Perspective
1 Introduction
2 Background Study
2.1 NLP: Natural Language Processing
2.2 Stemming
2.3 Stemmer
2.4 Stem
2.5 Affix
2.6 Over-Stemming, Under-Stemming, Miss-Stemming
2.7 Sanskrit Stemmer
3 Literature Review
3.1 A Comparative Study of Stemming Algorithms ch11comparativestudy
3.2 A Fast Corpus-Based Stemmer ch11corpusBased
3.3 A Hybrid Inflectional and a Rule-Based Derivational Gujarati Stemmers ch11gujaratiStemmer
3.4 A Stemmer-Based Lemmatizer for Gujarati Text ch11stemmatizer
3.5 Text Stemming: Approaches, Applications, and Challenges ch11textStemmingApproaches
3.6 Stemmers for Indic Languages: A Comprehensive Analysis ch11comprehensiveAnalysisOfStemmers
3.7 Rule-Based Derivational Stemmer for Sindhi Devanagari Using Suffix Stripping Approach ch11sindhiDevanagiriScript
4 Proposed Sanskrit Stemmer Design
5 Conclusion and Future Scope
References
Predicting Prior Academic Failure of Students’ Using Machine Learning Approach
1 Introduction
2 Related Work
3 Research Methodology
3.1 Data and Sources of Data
3.2 Proposed Methodology
3.3 Pre-processing Techniques
3.4 Classification Techniques
4 Results and Discussion
4.1 Experimental Results
4.2 Comparison of Classification Techniques
5 Conclusion
References
Deep Classifier for News Text Classification Using Topic Modeling Approach
1 Introduction
2 Related Work
3 Research Methodology
3.1 Dataset
3.2 Proposed Methodology
3.3 Data Pre-processing
3.4 Feature Extraction
3.5 Classification Techniques
4 Results and Discussion
5 Conclusion
References
Forecasting Covid-19 Cases in India using Multivariate Hybrid CNN-LSTM Model
1 Introduction
2 Windowing
3 The Proposed Model
4 Dataset Description
5 Experimental Results and Discussion
5.1 COVID-19 Forecasting
6 Conclusion
References
Multi-resolution Video Steganography Technique Based on Stationary Wavelet Transform (SWT) and Singular Value Decomposition (SVD)
1 Introduction
2 Literature Review
3 Proposed Method
3.1 Stationary Wavelet Transform
3.2 Singular Value Decomposition
3.3 The Proposed Method
3.4 Embedding Process Steps
3.5 Extraction Process Steps
4 Experimental Results
4.1 Performance Criteria
4.2 Results and Discussion
5 Conclusion
References
A Novel Dual-Threshold Weighted Feature Detection for Spectrum Sensing in 5G Systems
1 Introduction
2 Proposed Dual-Threshold Weighted Feature Detection (DTWFD) System Model
3 SNR-Based Weighted Factor Algorithm for Double Threshold Weighted Feature Detection (DTWFD)
4 Performance Evaluation
5 Conclusion
References
A Systematic Review on Various Attack Detection Methods for Wireless Sensor Networks
1 Introduction
2 Background Study
2.1 Review on Attack Detection Techniques for WSNs
2.2 Review on Various Attack Detection Methods Based on Clustering Techniques for WSN
2.3 Review on Various Attack Detection Methods Based on Authentication Protocols for WSN
3 Issues from Existing Methods
4 Solution
5 Results and Discussion
6 Conclusion
References
Electronic Beam Steering in Timed Antenna Array by Controlling the Harmonic Patterns with Optimally Derived Pulse-Shifted Switching Sequence
1 Introduction
2 Theory and Mathematical Background
2.1 Switching Sequences
2.2 Cost Function Formulation
3 Numerical Results and Discussion
3.1 Case 1: Steered Patterns at ± 10°
3.2 Case 2: Steered Patterns at ± 20°
3.3 Case 3: Steered Patterns at ± 30°
4 Conclusion
References
Classification of Attacks on MQTT-Based IoT System Using Machine Learning Techniques
1 Introduction
2 Literature Review
3 Resources and Methods
3.1 Data Collection
3.2 Theoretical Considerations
3.3 Evaluation Criteria
4 Outcome of the Applied Model
4.1 Attack Classification Results and Analysis
4.2 Criteria for Detection
4.3 Detection Results
5 Conclusions and Future Scope
References
Encrypted Traffic Classification Using eXtreme Gradient Boosting Algorithm
1 Introduction
2 Literature Review
3 The Proposed System
4 Experiments and Results
5 Conclusion
References
Analyzing Natural Language Essay Generator Models Using Long Short-Term Memory Neural Networks
1 Introduction
2 Related Work
3 Methodology
3.1 Dataset Used and Data Preprocessing
3.2 Embeddings
3.3 Approach
4 Experiment
4.1 Experimental Settings
4.2 Evaluation Metrics
5 Experimental Result
6 Conclusion
References
Performance Evaluation of GINI Index and Information Gain Criteria on Geographical Data: An Empirical Study Based on JAVA and Python
1 Introduction
2 Decision Tree
3 Splitting Benchmarks
3.1 Information Gain
3.2 GINI Coefficient
4 Related Work
5 Dataset
5.1 Evaluation—Information Gain Versus GINI Index
5.2 Information Gain
5.3 GINI Coefficient
6 Decision Tree Implementation: An Empirical Examination of Python and Java
6.1 Implementation Using Information Gain
6.2 Implementation Using GINI Index
7 Minimum Descriptive Length (MDL) Pruning
8 Experimental Results and Performance Comparison
8.1 Performance: Python Versus Java
9 Conclusion and Future Work
References
Critical Analysis of Big Data Privacy Preservation Techniques and Challenges
1 Introduction
2 Privacy Concern in Big Data
3 Literature Review
4 Findings
5 Technological-Based Solutions
6 Conclusion and Future Work
References
Performance Improvement of Vector Control Permanent Magnet Synchronous Motor Drive Using Genetic Algorithm-Based PI Controller Design
1 Introduction
2 Inverter Model
2.1 Two-Level Voltage Source Inverter
3 PMSM Model
3.1 Vector Control Drive System
3.2 Tuning Methods of PI Controllers Gain Using GAs
4 Result and Analysis
4.1 Simulation Response of PMSM Drive at Change in Speed with Constant Load Without GAs
4.2 Simulation Response of Vector Control PMSM Drive with GAs
5 Conclusions
References
Monitoring and Protection of Induction Motors Against Abnormal Industrial Conditions Using PLC
1 Introduction
2 PLC as a System Controller
2.1 Proposed Hardware
2.2 Overvoltage Circuit
2.3 Under-Voltage Circuit
2.4 Over-Temperature Circuit
2.5 RPM Measurement Circuit
2.6 PLC Ladder Algorithm
3 Proposed Methodology
4 Results
4.1 Under-Voltage or Overvoltage Situations
4.2 Over-Current Conditions
4.3 Over-Temperature Condition
5 Conclusion
References
A Vision-Based Gait Dataset for Knee Osteoarthritis and Parkinson’s Disease Analysis with Severity Levels
1 Introduction
2 Related Work
3 Available Datasets on KOA and PD
3.1 Knee Osteoarthritis (KOA) Datasets
3.2 Parkinson’s Disease (PD) Datasets
4 Dataset Description: Scenario and Method
4.1 KOA Dataset Overview
4.2 PD Dataset Overview
4.3 Normal/Healthy Dataset
5 Conclusion
References
A Survey of Recommender Systems Based on Semi-supervised Learning
1 Introduction
2 Collaborative Recommender System
3 Content-Based Recommender System
4 Hybrid Filtering-Based Recommender System
5 Semi-supervised Learning-Based Recommender System
5.1 Semi-supervised Learning
5.2 Existing Work on Semi-supervised Learning
6 Conclusion
6.1 Challenges and Solutions
References
“Emerging Trends in Computational Intelligence to Solve Real-World Problems” Android Malware Detection Using Machine Learning
1 Introduction
2 Related Work
3 Proposed Approach
4 Implementation
5 Algorithms Discussion
5.1 Naive Bayes
5.2 Decision Tree
5.3 Random Forest
5.4 Support Vector Machine
6 Results
7 Discussion
8 Conclusion
References
A Novel Intrusion Detection System Using Deep Learning
1 Introduction
2 Literature Review
3 Intrusion Detection System
4 Experiment
4.1 Dataset
4.2 Architecture
5 Results
6 Conclusion
7 Future Work
References
Solution to OCT Diagnosis Using Simple Baseline CNN Models and Hyperparameter Tuning
1 Introduction
2 Literature
3 Methods and Materials
3.1 3 Layer Model
3.2 5 Layer Model
3.3 7 Layer Model
4 Results
5 Discussion
6 Future Work
References
Land Rights Documentation and Verification System Using Blockchain Technology
1 Introduction
2 Existing System
2.1 Related Works
2.2 The New Land Registry Business Process
2.3 Challenges to the Existing System of Land Registry
2.4 Challenges to the Existing Structure of the Land Registry
3 Methodology
3.1 Design Using Hyperledger Fabric
3.2 Verity
4 Implementation
4.1 Implementation Using Hyperledger Fabric
4.2 Technical Details
4.3 Chain Codes
4.4 Deployment
5 Evaluation and Results
5.1 Transaction Latency
5.2 Transaction Throughput
5.3 Execution Time
5.4 Result
6 Conclusion
References
Implication of Privacy Laws and Importance of ICTs to Government Vision of the Future
1 Introduction
2 Material and Methods
3 Results
3.1 Descriptive Statistics
3.2 Correlation Analysis
3.3 Regression Analysis
4 Conclusions
References
AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study
1 Introduction
2 Breast Imaging
2.1 Mammograms
2.2 MRI
2.3 Computerized Tomography
2.4 Scintimammography
2.5 Histopathologist Images
2.6 Positron Emission Tomography (PET)
3 Related Work
4 Pre-processing Techniques of Breast Images
4.1 Use of Filters for Normalization
4.2 Channeling of Images
4.3 Conversion to 3 Channel Images
4.4 Morphological Operations
5 AI Approaches for Breast Cancer Diagnosis
5.1 Machine Learning
5.2 Deep Learning Approach
6 Techniques to Improve Performance of AI Approaches
6.1 Data Augmentation
6.2 Transfer Learning
7 Research Opportunities and Challenges
8 Conclusion
References
Energy-Efficient Lifetime and Network Performance Improvement for Mobility of Nodes in IoT
1 Introduction
2 Related Work
3 Problem Description and Objectives
3.1 Problem Statement
3.2 Objectives
4 Network Architecture
4.1 Network Model
4.2 Energy Model
5 Proposed Method and Algorithm
5.1 Methodology
5.2 Algorithm
6 Simulation Analysis and Result
6.1 Simulation Model
6.2 Simulation Settings
6.3 Result Analysis
7 Conclusion and Future Work
References
Design and Implementation of Electronic Voting Using KECCAK256 Algorithm on Ethereum Network
1 Introduction
1.1 Blockchain, Natïve, and Traditional Voting [4]
1.2 Ethereum and Smart Contract
1.3 History of Elections
2 Review of Related Works
2.1 Toward Secure E-Voting Using Ethereum Blockchain [7]
2.2 Blockchain-Based E-Voting System [8]
2.3 A Privacy-Preserving Voting Protocol on Blockchain [9]
2.4 An E-Voting with Blockchain: An E-Voting Protocol with Decentralisation and Voter Privacy [1]
2.5 A Secure End-to-End Verifiable E-Voting System Using Zero-Knowledge-Based Blockchain [10]
2.6 A Conceptual Secure Blockchain-Based Electronic Voting System [11]
2.7 A Solution Based on the Diffie–Hellman Process System [12]
2.8 A Blockchain-Based E-Voting System [13]
2.9 An E-voting System Based on Blockchain and Ring Signature [14]
2.10 A Survey on Feasibility and Suitability of Blockchain Techniques for the E-Voting Systems [4]
2.11 An Electronic Voting Machine Based on Blockchain Technology and Aadhar Verification [15]
2.12 A Secure Voting System Using Ethereum’s Blockchain Which They Called BroncoVote [16]
2.13 A Crypto-Voting, a Blockchain-Based e-Voting System [17]
3 Methodology
3.1 Hardware Requirements
3.2 Implementation Tools
4 System Implementation
5 Conclusion
References
VizAudi: A Predictive Audio Visualizer
1 Introduction
2 Related Work
2.1 Works on the Categorization of Sound
2.2 Works on UrbanSound8K Dataset
2.3 Work on Deaf Assistance System
3 Dataset and Features
3.1 Dataset
3.2 Features
4 Methodology
4.1 Background and Foreground Sound Segregation
4.2 Background Sound Classification
4.3 Visual Output
5 Conclusions and Future Work
References
Universal Quantitative Steganalysis Using Deep Residual Networks
1 Introduction
2 Related Work
3 Basic Concepts
3.1 Steganalysis
3.2 Deep Residual Network
4 Proposed Work
5 Experimental Work
6 Conclusion
References
Image-Based Forest Fire Detection Using Bagging of Color Models
1 Introduction
2 Proposed Method
2.1 Feature Extraction
2.2 Threshold and Bagging
3 Experimental Results
4 Discussion
5 Conclusion and Future Work
References
Machine Learning Techniques for Diagnosis of Type 2 Diabetes Using Lifestyle Data
1 Introduction
2 Literature Survey
3 Proposed System
3.1 Dataset Description
3.2 Data Preprocessing
3.3 Machine Learning Techniques
4 Experimental Results
5 Conclusion
References
Deep Learning-Based Recognition of Personality and Leadership Qualities (DeePeR-LQ): Review
1 Introduction
1.1 Computing Personality
1.2 Leadership Qualities
1.3 Organisation
2 Related Work
2.1 Single Modal
2.2 Bi-Modal
2.3 Tri-Modal/Multimodal
3 Personality Traits to Leadership Qualities
4 Proposed Research Agenda
4.1 Challenges Ahead and Future Directions
5 Conclusions
References
Sentence-Level Document Novelty Detection Using Latent Dirichlet Allocation with Auto-Encoders
1 Introduction
2 Related Work
2.1 State-of-the-Art in Novelty Detection
2.2 Background: Topic Modeling
3 Proposed Method
4 Experimental Setup
5 Results and Discussions
6 Conclusions and Future Work
References
Prediction of Environmental Diseases Using Machine Learning
1 Introduction
1.1 Senior Citizens
1.2 Additional Information Required by the Volume Editor
2 Related Works
3 Waterborne Disease
3.1 Urbanization
4 Big Data Analysis
4.1 Volume
4.2 Velocity
4.3 Veracity
4.4 Variety
4.5 Validity
4.6 Volatility
4.7 Value
5 Health Care Ecosystem and Performance Measures
5.1 Experimental Method
6 Results and Discussion
7 Comparison with Other Techniques
8 Conclusion
References
Frequent Itemset Mining Using Genetic Approach
1 Introduction
2 Related Work
2.1 Frequent Itemset Mining
2.2 Genetic Algorithms in Frequent Itemset Mining
3 Problem Definition
3.1 Preliminaries
3.2 Problem Statement
4 FIMGA—Frequent Itemset Mining Using Genetic Approach
4.1 Genset1
4.2 Crossover
5 A Running Example
5.1 Genset1
5.2 Crossover
6 Experimental Study
6.1 Limitation of the Study
7 Conclusion
References
Gesture-Based Media Controlling Using Haar Cascade
1 Introduction
2 Literature Review
3 Architecture of Multimedia Devices
4 Methodology
5 Results and Discussion
6 Conclusion
References
Comparative Analysis of Models for Abstractive Text Summarization
1 Introduction
2 Related Work
3 Techniques for Abstractive Text Summarization
3.1 Preprocessing
3.2 Abstractive Text Summarization
4 Experiment and Results
4.1 Data set
4.2 Experimental Setting and Results
4.3 Evaluation Metric
5 Conclusion and Future Scope
References
Polarity Detection Across the Globe Using Sentiment Analysis on COVID-19-Related Tweets
1 Introduction
1.1 Research Study and Objectives
1.2 Our Work
2 Research Procedure
2.1 Research Design
2.2 Study Dimensions
2.3 Tools and Instrument
3 Literature Review
4 Dataset
4.1 Data Gathering Procedure
4.2 Data Preparation
5 Model for Sentiment Analysis
5.1 Naive Bayes
5.2 Linear Regression
5.3 Support Vector Machines (SVM)
6 Results on Trending Hashtag #Twitter Data
7 Conclusion
References
FOG-EE Computing: Fog, Edge and Elastic Computing, New Age Cloud Computing Paradigms
1 Introduction
1.1 Fog Computing
1.2 Edge Computing
1.3 Elastic Computing
2 Literature Survey
3 Methodology
3.1 Flowchart
4 Results
5 Conclusion
References
Hybrid Filter for Dorsal Hand Vein Images
1 Introduction
2 Related Work
3 Proposed Filter
4 Simulation Results
5 Conclusion
References
Satellite Image Enhancement and Restoration Using RLS Adaptive Filter
1 Introduction
2 Literature Review
3 Methodology
3.1 About Adaptive Filters
3.2 RLS—Recursive Least Square Adaptive Filter
3.3 Image Denoising and Channel Estimation
4 Experimental Results
5 Discussion
6 Conclusion
References
Efficient Recommendation System Using Latent Semantic Analysis
1 Introduction
2 Related Work
3 Methodology
3.1 Traditional Recommender System Techniques
3.2 Dimension Reduction Techniques
3.3 K-Nearest Neighbor (KNN)
4 Experimental Analysis
4.1 Training Data and Testing Data
4.2 Accuracy Metrics
5 Results and Discussion
6 Conclusion and Future Scope
References
A Study of Machine Learning Techniques for Fake News Detection and Suggestion of an Ensemble Model
1 Introduction
2 Literature Survey
3 Experimental Design
3.1 Dataset Description
3.2 Feature Extraction
3.3 Models
4 Discussion of Results
5 Conclusion
References
Metric Learning with Deep Features for Highly Imbalanced Face Dataset
1 Introduction
2 Proposed Metric Learning with Deep Face Features
2.1 Deep Face Networks
2.2 Distance Metric Learning
2.3 Proposed Model
3 Results and Discussions
3.1 Dataset
3.2 Result Analysis
4 Conclusion
References
An Adaptable Ensemble Architecture for Malware Detection
1 Introduction
2 Related Work
3 Proposed Ensemble Architecture
3.1 Convolution Neural Network
3.2 K Nearest Neighbors
3.3 Ensembling
4 Experimental Setup
5 Dataset Description and Result Analysis
6 Conclusion
References
An Application of Deep Learning in Identification of Depression Among Twitter Users
1 Introduction
2 Literature Review
3 Dataset
4 Methodology
4.1 Data Preprocessing
4.2 Text Classification
5 Experimental Results
5.1 Baseline Model Training
5.2 BiLSTM + CNN Training
6 Conclusion
References
Performance Evaluation of LSB Sequential and Pixel Indicator Algorithms in Image Steganography
1 Introduction
2 Method Analysis
3 Algorithms
3.1 LSB Sequential Substitution
3.2 Pixel Indicator Method
3.3 Coded LSB Substitution
3.4 Tools and Simulation Environment
4 Encoding and Decoding
5 Comparative Analysis
5.1 Image Perceptibility
5.2 Image Capacity
5.3 Security
6 Conclusion
References
MATHS: Machine Learning Techniques in Healthcare System
1 Introduction
2 Problem Statement
3 Literature Survey
4 Methodology Adopted
4.1 Algorithm
4.2 Dataset Description
4.3 Feature Selection
4.4 Applying Machine Learning Algorithms
5 Experimental Setup
6 Results: Analysis and Discussion
7 Conclusion and Future Work
References
EnSOTA: Ensembled State of the Art Model for Enhanced Object Detection
1 Introduction
2 Existing Methods
2.1 You Only Look Once (YOLO)
2.2 Single Shot Detector (SSD)
2.3 Faster Region-Based Convolutional Neural Networks
2.4 Ensembling
3 Proposed Method
3.1 Weighted Boxes Fusion
4 Research Approach
4.1 Dataset
4.2 Training
4.3 Testing
4.4 Prediction Ensembling
5 Results and Discussion
5.1 Evaluation Metrics
5.2 Intersection Over Union (IOU)
5.3 Precision × Recall Curve
5.4 Average Precision
5.5 Recorded Metrics
6 Conclusion
7 Future Scope
References
A Coronavirus Herd Immunity Optimization (CHIO) for Travelling Salesman Problem
1 Introduction
2 TSP Definition
3 Coronavirus Herd Immunity Optimizer for Travelling Salesman Problem
3.1 CHIO Procedure
4 Experiments and Results
4.1 TSP Data Description
4.2 TSP Results and Comparisons
5 Conclusion and Future Work
References
System for Situational Awareness Using Geospatial Twitter Data
1 Introduction
2 Related Works
3 Proposed Work
3.1 Algorithm
4 Results and Analysis
5 Conclusions and Future Works
References
Classification of Malware Using Visualization Techniques
1 Introduction
2 Related Works
3 Methodology
3.1 Base Dataset Selection
3.2 Dataset Creation (Feature Extraction)
4 Results
5 Conclusion and Future Work
References
Classification and Activation Map Visualization of Banana Diseases Using Deep Learning Models
1 Introduction
2 Background
3 Methodology
4 Crop Disease Detection by Notable DL Models
4.1 AlexNet (2012)
4.2 VGG16 (2014)
4.3 GoogLeNet (2014)
5 Experiments
5.1 Pre-trained Models
5.2 Workstation Specifications and Deep Learning Framework
5.3 Dataset
5.4 Performance Metrics
5.5 DL Architecture with Pre-training Versus DL Architecture Without Pre-training
6 Symptom Visualization
6.1 Symptoms and Disease Lesion Detection Using DL
6.2 Visualization of Every Channel in Each Intermediate Activation Layer
7 Conclusion
References
Exploring Total Quality Management Implementation Levels in I.T. Industry Using Machine Learning Models
1 Introduction
1.1 IT Industry
1.2 TQM and TQM Elements
1.3 ISO Certification and Quality Awards
1.4 Machine Learning Algorithms
2 Statement of Problem
3 Objective of Study
4 Literature Review
5 Conceptual Framework
6 Methodology
7 Results and Discussion
8 Conclusion and Recommendations
9 Limitations and Scope for Future Study
References
Predicting an Indian Firm’s Sickness Using Artificial Neural Networks and Traditional Methods: A Comparative Study
1 Introduction
2 Literature Review
3 Data and Methodology
3.1 Data
3.2 Methodology
4 Empirical Results
4.1 Comparison of the Prediction Accuracy
4.2 Coefficients of Variables of the Two-Year Predictive Model
5 Discussion
6 Conclusion
References
Analysis of Change of Market Value of Bitcoin Using Econometric Approach
1 Introduction
2 Literature Review
3 Research Methodology
3.1 Data Sources
4 Data Analysis
4.1 Micro–Macro Model Decomposition—Empirical Analysis
4.2 Analysis of Macro-variables
4.3 Analysis of Micro-variables
5 Conclusion
6 Managerial Implication
7 Limitations and Future Research
References
Detection of COVID-19 Using Intelligent Computing Method
1 Introduction
1.1 Coronavirus Origin
1.2 Virus Evolution
2 Literature Survey
3 Signs and Symptoms of Virus Existence
3.1 Risk Factors and Diagnosis
3.2 Transmission and Its Significance for Stopping
4 Results and Discussion
5 Conclusion and Future Scope
References
Two-Line Defense Ontology-Based Trust Management Model
1 Introduction
2 Related Work
3 The Proposed Method
3.1 First Line of Defense
3.2 Second Line of Defense
4 Implementation Results
4.1 Computational Complexity of the Proposed Approach
5 Conclusion and Future Work
References
A Machine Learning-Based Data Fusion Model for Online Traffic Violations Analysis
1 Introduction
2 Research Methodology
2.1 Traffic Violation Dataset
2.2 Preprocessing
2.3 Classification
2.4 Evaluation Metrics
3 Modeling and Implementation
3.1 Modeling
3.2 Implementation
4 Results and Discussion
5 Conclusion
References
Review of IoT for COVID-19 Detection and Classification
1 Introduction
2 Internet of Things
3 COVID-19
4 Literature Review
5 Conclusions and Future Works
References
On the Implementation and Placement of Hybrid Beamforming for Single and Multiple Users in the Massive-MIMO MmWave Systems
1 Introduction
2 System Model
2.1 Deep Learning Based Hybrid Beamforming Optimization with Limited Feedback
2.2 Delay Calculations for Different Placements of Beamforming
3 Simulation Results
4 Conclusions and Future Works
References
Neural Network Based Windowing Scheme to Maximize the PSD for 5G and Beyond
1 Introduction
2 System Model
2.1 Deep Neural Network Based Window Selection
2.2 Adaptive Window Selection to Maximize the PSD
3 Simulation Results
4 Conclusions and Future Works
References
Author Index

Citation preview

Advances in Intelligent Systems and Computing 1394

Ashish Khanna · Deepak Gupta · Siddhartha Bhattacharyya · Aboul Ella Hassanien · Sameer Anand · Ajay Jaiswal   Editors

International Conference on Innovative Computing and Communications Proceedings of ICICC 2021, Volume 3

Advances in Intelligent Systems and Computing Volume 1394

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong

The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. Indexed by DBLP, INSPEC, WTI Frankfurt eG, zbMATH, Japanese Science and Technology Agency (JST). All books published in the series are submitted for consideration in Web of Science.

More information about this series at http://www.springer.com/series/11156

Ashish Khanna · Deepak Gupta · Siddhartha Bhattacharyya · Aboul Ella Hassanien · Sameer Anand · Ajay Jaiswal Editors

International Conference on Innovative Computing and Communications Proceedings of ICICC 2021, Volume 3

Editors Ashish Khanna Maharaja Agrasen Institute of Technology Delhi, India Siddhartha Bhattacharyya Rajnagar Mahavidyalaya Birbhum, India Sameer Anand Department of Computer Science Shaheed Sukhdev College of Business Studies Rohini, Delhi, India

Deepak Gupta Department of Computer Science Engineering Maharaja Agrasen Institute of Technology Rohini, Delhi, India Aboul Ella Hassanien Computer and Artificial Intelligence Cairo University Giza, Egypt Ajay Jaiswal Department of Computer Science Shaheed Sukhdev College of Business Studies Rohini, Delhi, India

ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-981-16-3070-5 ISBN 978-981-16-3071-2 (eBook) https://doi.org/10.1007/978-981-16-3071-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Dr. Ashish Khanna would like to dedicate this book to his mentors Dr. A. K. Singh and Dr. Abhishek Swaroop for their constant encouragement and guidance and his family members including his mother, wife and kids. He would also like to dedicate this work to his (Late) father Sh. R. C. Khanna with folded hands for his constant blessings. Dr. Deepak Gupta would like to dedicate this book to his father Sh. R. K. Gupta, his mother Smt. Geeta Gupta for their constant encouragement, his family members including his wife, brothers, sisters, kids, and to my students close to my heart. Prof. (Dr.) Siddhartha Bhattacharyya would like to dedicate this book to Late Kalipada Mukherjee and Late Kamol Prova Mukherjee. Prof. (Dr.) Aboul Ella Hassanien would like to dedicate this book to his wife Nazaha Hassan. Dr. Sameer Anand would like to dedicate this book to his Dada Prof. D. C. Choudhary, his beloved wife Shivanee and his son Shashwat.

Dr. Ajay Jaiswal would like to dedicate this book to his father Late Prof. U. C. Jaiswal, his mother Brajesh Jaiswal, his beloved wife Anjali, his daughter Prachii and his son Sakshaum.

ICICC-2021 Steering Committee Members

Patrons Dr. Poonam Verma, Principal, SSCBS, University of Delhi Prof. Dr. Pradip Kumar Jain, Director, National Institute of Technology Patna, India

General Chairs Prof. Dr. Siddhartha Bhattacharyya, Christ University, Bangalore Prof. Valentina Emilia Balas, Aurel Vlaicu University of Arad, Romania Dr. Prabhat Kumar, National Institute of Technology Patna, India

Honorary Chairs Prof. Dr. Janusz Kacprzyk, FIEEE, Polish Academy of Sciences, Poland Prof. Dr. Vaclav Snasel, Rector, VSB-Technical University of Ostrava, Czech Republic

Conference Chairs Prof. Dr. Aboul Ella Hassanien, Cairo University, Egypt Prof. Dr. Joel J. P. C. Rodrigues, National Institute of Telecommunications (Inatel), Brazil Prof. Dr. R. K. Agrawal, Jawaharlal Nehru University, Delhi

vii

viii

ICICC-2021 Steering Committee Members

Technical Program Chairs Prof. Dr. Victor Hugo C. de Albuquerque, Universidade de Fortaleza, Brazil Prof. Dr. A. K. Singh, National Institute of Technology, Kurukshetra Prof. Dr. Anil K. Ahlawat, KIET Group of Institutes, Ghaziabad

Editorial Chairs Prof. Dr. Abhishek Swaroop, Bhagwan Parshuram Institute of Technology, Delhi Dr. Arun Sharma, Indira Gandhi Delhi Technical University for Women, Delhi Prerna Sharma, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi

Conveners Dr. Ajay Jaiswal, SSCBS, University of Delhi Dr. Sameer Anand, SSCBS, University of Delhi Dr. Ashish Khanna, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi Dr. Deepak Gupta, Maharaja Agrasen Institute of Technology (GGSIPU), New Delhi Dr. Gulshan Shrivastava, National Institute of Technology Patna, India

Publication Chairs Prof. Dr. Neeraj Kumar, Thapar Institute of Engineering and Technology Dr. Hari Mohan Pandey, Edge Hill University, UK Dr. Sahil Garg, École de technologie supérieure, Université du Québec, Montreal, Canada Dr. Vicente García Díaz, University of Oviedo, Spain

Publicity Chairs Dr. M. Tanveer, Indian Institute of Technology Indore, India Dr. Jafar A. Alzubi, Al-Balqa Applied University, Salt, Jordan Dr. Hamid Reza Boveiri, Sama College, IAU, Shoushtar Branch, Shoushtar, Iran Prof. Med Salim Bouhlel, Sfax University, Tunisia

ICICC-2021 Steering Committee Members

Co-convener Mr. Moolchand Sharma, Maharaja Agrasen Institute of Technology, India

Organizing Chairs Dr. Kumar Bijoy, SSCBS, University of Delhi Dr. Rishi Ranjan Sahay, SSCBS, University of Delhi Dr. Amrina Kausar, SSCBS, University of Delhi Dr. Abhishek Tandon, SSCBS, University of Delhi

Organizing Team Dr. Gurjeet Kaur, SSCBS, University of Delhi Dr. Aditya Khamparia, Lovely Professional University, Punjab, India Dr. Abhimanyu Verma, SSCBS, University of Delhi Dr. Onkar Singh, SSCBS, University of Delhi Dr. Kalpna Sagar, KIET Group of Institutes, Ghaziabad Dr. Purnima Lala Mehta, Assistant Professor, IILM. Dr. Suresh Chavhan, Vellore Institute of Technology, Vellore, India Dr. Mona Verma, SSCBS, University of Delhi

ix

Preface

We hereby are delighted to announce that Shaheed Sukhdev College of Business Studies, New Delhi, in association with the National Institute of Technology Patna and University of Valladolid, Spain, has hosted the eagerly awaited and much-coveted International Conference on Innovative Computing and Communication (ICICC2021) in hybrid mode. The fourth version of the conference was able to attract a diverse range of engineering practitioners, academicians, scholars and industry delegates, with the reception of abstracts including more than 3,600 authors from different parts of the world. The committee of professionals dedicated toward the conference is striving to achieve a high-quality technical program with tracks on innovative computing, innovative communication network and security, and Internet of things. All the tracks chosen in the conference are interrelated and are very famous among the present-day research community. Therefore, a lot of research is happening in the above-mentioned tracks and their related sub-areas. As the name of the conference starts with the word “innovation,” it has targeted out-of-box ideas, methodologies, applications, expositions, surveys and presentations helping to upgrade the current status of the research. More than 900 full-length papers have been received, among which the contributions are focused on theoretical, computer simulationbased research and laboratory-scale experiments. Among these manuscripts, 210 papers have been included in the Springer proceedings after a thorough two-stage review and editing process. All the manuscripts submitted to ICICC-2021 were peer-reviewed by at least two independent reviewers, who were provided with a detailed review pro forma. The comments from the reviewers were communicated to the authors, who incorporated the suggestions in their revised manuscripts. The recommendations from two reviewers were taken into consideration while selecting a manuscript for inclusion in the proceedings. The exhaustiveness of the review process is evident, given the large number of articles received addressing a wide range of research areas. The stringent review process ensured that each published manuscript met the rigorous academic and scientific standards. It is an exalting experience to finally see these elite contributions materialize into three book volumes as ICICC-2021 proceedings by Springer entitled “International Conference on Innovative Computing and Communications.” The articles are organized into three volumes in some broad categories covering subject matters on machine learning, data mining, xi

xii

Preface

big data, networks, soft computing and cloud computing, although given the diverse areas of research reported it might not have been always possible. ICICC-2021 invited seven keynote speakers, who are eminent researchers in the field of computer science and engineering, from different parts of the world. In addition to the plenary sessions on each day of the conference, ten concurrent technical sessions are held every day to assure the oral presentation of around 210 accepted papers. Keynote speakers and session chair(s) for each of the concurrent sessions have been leading researchers from the thematic area of the session. A technical exhibition is held during all the 2 days of the conference, which has put on display the latest technologies, expositions, ideas and presentations. The research part of the conference was organized in a total of 28 special sessions and 3 international workshops. These special sessions and international workshops provided the opportunity for researchers conducting research in specific areas to present their results in a more focused environment. An international conference of such magnitude and release of the ICICC-2021 proceedings by Springer has been the remarkable outcome of the untiring efforts of the entire organizing team. The success of an event undoubtedly involves the painstaking efforts of several contributors at different stages, dictated by their devotion and sincerity. Fortunately, since the beginning of its journey, ICICC-2021 has received support and contributions from every corner. We thank them all who have wished the best for ICICC-2021 and contributed by any means toward its success. The edited proceedings volumes by Springer would not have been possible without the perseverance of all the steering, advisory and technical program committee members. All the contributing authors owe thanks from the organizers of ICICC-2021 for their interest and exceptional articles. We would also like to thank the authors of the papers for adhering to the time schedule and for incorporating the review comments. We wish to extend our heartfelt acknowledgment to the authors, peer-reviewers, committee members and production staff whose diligent work put shape to the ICICC-2021 proceedings. We especially want to thank our dedicated team of peerreviewers who volunteered for the arduous and tedious step of quality checking and critique on the submitted manuscripts. We wish to thank our faculty colleagues Mr. Moolchand Sharma and Ms. Prerna Sharma for extending their enormous assistance during the conference. The time spent by them and the midnight oil burnt is greatly appreciated, for which we will ever remain indebted. The management, faculties and administrative and support staff of the college have always been extending their services whenever needed, for which we remain thankful to them. Lastly, we would like to thank Springer for accepting our proposal for publishing the ICICC-2021 proceedings. Help received from Mr. Aninda Bose, the acquisition senior editor, in the process has been very useful. Delhi, India Rohini, India

Ashish Khanna Deepak Gupta Organizers, ICICC-2021

Contents

Explanation-Based Serendipitous Recommender System (EBSRS) . . . . . Richa, Chhavi Sharma, and Punam Bedi Introduction of Feature Selection and Leading-Edge Technologies Viz. TENSORFLOW, PYTORCH, and KERAS: An Empirical Study to Improve Prediction Accuracy of Cardiovascular Disease . . . . . . Mudsir Ashraf, Yass Khudheir Salal, S. M. Abdullaev, Majid Zaman, and Muheet Ahmed Bhut Campus Placement Prediction System Using Deep Neural Networks . . . Bharat Udawat, Advait Kale, Divit Sinha, Hardik Sharma, and Deepa Krishnan Intensity of Traffic Due to Road Accidents in US: A Predictive Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pooja Mudgil and Ishan Joshi Credit Card Fraud Detection Using Blockchain and Simulated Annealing k-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Poonam Rani, Jyoti Shokeen, Amit Agarwal, Ashish Bhatghare, Arjun Majithia, and Jigyasu Malhotra Improving Accuracy of Deep Learning-Based Compression Techniques by Introducing Perceptual Loss in Industrial IoT . . . . . . . . . . Poonam Rani, Vibha Jain, Mohammad Saif, Saahil Hussain Mugloo, Mitul Hirna, and Somil Jain Characterization and Prediction of Various Issue Types: A Case Study on the Apache Lucene System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Apurva Aggarwal, Ajay Kumar Kushwaha, Somil Rastogi, Sangeeta Lal, and Sarishty Gupta Heart Disease Prediction Using Machine Learning Techniques: A Quantitative Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lubna Riyaz, Muheet Ahmed Butt, Majid Zaman, and Omeera Ayob

1

19

33

43

51

61

71

81

xiii

xiv

Contents

Enhancing CNN with Pre-processing Stage in Illumination-Invariant Automatic Expression Recognition . . . . . . . . . . Hiral A. Patel, Nidhi Khatri, Keyur Suthar, and Hiral R. Patel

95

An Expert Eye for Identifying Shoplifters in Mega Stores . . . . . . . . . . . . . 107 Mohd. Aquib Ansari and Dushyant Kumar Singh Sanskrit Stemmer Design: A Literature Perspective . . . . . . . . . . . . . . . . . . 117 Jayashree Nair, Sooraj S. Nair, and U. Abhishek Predicting Prior Academic Failure of Students’ Using Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Anamika and Maitreyee Dutta Deep Classifier for News Text Classification Using Topic Modeling Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Megha Singla and Maitreyee Dutta Forecasting Covid-19 Cases in India using Multivariate Hybrid CNN-LSTM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Abhishek Parashar and Yukti Mohan Multi-resolution Video Steganography Technique Based on Stationary Wavelet Transform (SWT) and Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Reham A. El-Shahed, M. N. Al-Berry, Hala M. Ebeid, and Howida A. Shedeed A Novel Dual-Threshold Weighted Feature Detection for Spectrum Sensing in 5G Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Parnika Kansal, M. Gangadharappa, and Ashwni Kumar A Systematic Review on Various Attack Detection Methods for Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 K. Jane Nithya and K. Shyamala Electronic Beam Steering in Timed Antenna Array by Controlling the Harmonic Patterns with Optimally Derived Pulse-Shifted Switching Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Avishek Chakraborty, Gopi Ram, and Durbadal Mandal Classification of Attacks on MQTT-Based IoT System Using Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Jigar Makhija, Akhil Appu Shetty, and Ananya Bangera Encrypted Traffic Classification Using eXtreme Gradient Boosting Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Neha Gupta, Vinita Jindal, and Punam Bedi

Contents

xv

Analyzing Natural Language Essay Generator Models Using Long Short-Term Memory Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Mayank Gaur, Mridul Arora, Varun Prakash, Yash Kumar, Kirti Gupta, and Preeti Nagrath Performance Evaluation of GINI Index and Information Gain Criteria on Geographical Data: An Empirical Study Based on JAVA and Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Sheikh Amir Fayaz, Majid Zaman, and Muheet Ahmed Butt Critical Analysis of Big Data Privacy Preservation Techniques and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Suman Madan, Kirti Bhardwaj, and Shubhangi Gupta Performance Improvement of Vector Control Permanent Magnet Synchronous Motor Drive Using Genetic Algorithm-Based PI Controller Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Rajesh Kumar Mahto, Ambarisha Mishra, and Bharti Kumari Monitoring and Protection of Induction Motors Against Abnormal Industrial Conditions Using PLC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 Aaryan Sharma, Poras Khetarpal, Neelu Nagpal, and Ruchi Sharma A Vision-Based Gait Dataset for Knee Osteoarthritis and Parkinson’s Disease Analysis with Severity Levels . . . . . . . . . . . . . . . . 303 Navleen Kour, Sunanda, and Sakshi Arora A Survey of Recommender Systems Based on Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 Aslam Hasan Khan, Jamshed Siddqui, and Shahab Saquib Sohail “Emerging Trends in Computational Intelligence to Solve Real-World Problems” Android Malware Detection Using Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Dhananjay Singh, Shubham Karpa, and Indu Chawla A Novel Intrusion Detection System Using Deep Learning . . . . . . . . . . . . . 343 Tanay Singhania, Vatsal Agarwal, Sunakshi, Prashant Shambharkar Giridhar, and Trasha Gupta Solution to OCT Diagnosis Using Simple Baseline CNN Models and Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 Ajay Kumar Kushwaha and Somil Rastogi Land Rights Documentation and Verification System Using Blockchain Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 Sifat Nur Billah, Farjana Hossain, and M. F. Mridha

xvi

Contents

Implication of Privacy Laws and Importance of ICTs to Government Vision of the Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 Ayush Gupta, Prabhat Mittal, Pankaj Kumar Gupta, and Sakshi Bansal AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 Harsh Jigneshkumar Patel, Parita Oza, and Smita Agrawal Energy-Efficient Lifetime and Network Performance Improvement for Mobility of Nodes in IoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 Y. Chitrashekharaiah, N. N. Srinidhi, Dharamendra Chouhan, J. Shreyas, and S. M. Dilip Kumar Design and Implementation of Electronic Voting Using KECCAK256 Algorithm on Ethereum Network . . . . . . . . . . . . . . . . . . . . . . 431 Oluwatosin James Fayemi, Aderonke Favour-Bethy Thompson, and Olaniyi Abiodun Ayeni VizAudi: A Predictive Audio Visualizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 Sapna Malik, Aashish Upadhyay, Kartik Kumar, and Nachiketa Raina Universal Quantitative Steganalysis Using Deep Residual Networks . . . . 465 Anuradha Singhal and Punam Bedi Image-Based Forest Fire Detection Using Bagging of Color Models . . . . . 477 Reyansh Mishra, Lakshay Gupta, Nitesh Gurbani, and Shiv Naresh Shivhare Machine Learning Techniques for Diagnosis of Type 2 Diabetes Using Lifestyle Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 Shahid Mohammad Ganie, Majid Bashir Malik, and Tasleem Arif Deep Learning-Based Recognition of Personality and Leadership Qualities (DeePeR-LQ): Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 Devraj Patel and Sunita V. Dhavale Sentence-Level Document Novelty Detection Using Latent Dirichlet Allocation with Auto-Encoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 S. Adarsh, S. Asharaf, and V. S. Anoop Prediction of Environmental Diseases Using Machine Learning . . . . . . . . 521 Amrita Sisodia and Rajni Jindal Frequent Itemset Mining Using Genetic Approach . . . . . . . . . . . . . . . . . . . . 533 Renji George Amballoor and Shankar B. Naik Gesture-Based Media Controlling Using Haar Cascade . . . . . . . . . . . . . . . 541 Pragati Chandankhede and Sana Haji Comparative Analysis of Models for Abstractive Text Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553 Minakshi Tomer and Manoj Kumar

Contents

xvii

Polarity Detection Across the Globe Using Sentiment Analysis on COVID-19-Related Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565 M. Uvaneshwari, Ekata Gupta, Mukta Goyal, N. Suman, and M. Geetha FOG-EE Computing: Fog, Edge and Elastic Computing, New Age Cloud Computing Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579 Shristi Achari and Rahul Johari Hybrid Filter for Dorsal Hand Vein Images . . . . . . . . . . . . . . . . . . . . . . . . . . 591 Nisha Charaya, Anil Kumar, and Priti Singh Satellite Image Enhancement and Restoration Using RLS Adaptive Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601 Rekh Ram Janghel, Saroj Kumar Pandey, Aayush Jain, Aditi Gupta, and Avishi Bansal Efficient Recommendation System Using Latent Semantic Analysis . . . . 615 Rahul Budhraj, Pooja Kherwa, Shreyans Sharma, and Sakshi Gill A Study of Machine Learning Techniques for Fake News Detection and Suggestion of an Ensemble Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627 Rajni Jindal, Diksha Dahiya, Devyani Sinha, and Ayush Garg Metric Learning with Deep Features for Highly Imbalanced Face Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639 Ashu Kaushik and Seba Susan An Adaptable Ensemble Architecture for Malware Detection . . . . . . . . . . 647 D. T. Mane, P. B. Kumbharkar, Santosh B. Javheri, and Rahul Moorthy An Application of Deep Learning in Identification of Depression Among Twitter Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661 Ashutosh Shankdhar, Rishik Mishra, and Nitya Shukla Performance Evaluation of LSB Sequential and Pixel Indicator Algorithms in Image Steganography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671 Jabed Al Faysal and Khalid Mahbub Jahan MATHS: Machine Learning Techniques in Healthcare System . . . . . . . . . 693 Medha Chugh, Rahul Johari, and Anmol Goel EnSOTA: Ensembled State of the Art Model for Enhanced Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703 Jayesh Gupta, Arushi Sondhi, Jahnavi Seth, Moolchand Sharma, Farzil Kidwai, and Aruna Jain A Coronavirus Herd Immunity Optimization (CHIO) for Travelling Salesman Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717 Lamees Mohammad Dalbah, Mohammed Azmi Al-Betar, Mohammed A. Awadallah, and Raed Abu Zitar

xviii

Contents

System for Situational Awareness Using Geospatial Twitter Data . . . . . . . 731 Hamid Omar, Akash Sinha, and Prabhat Kumar Classification of Malware Using Visualization Techniques . . . . . . . . . . . . . 739 Divyansh Chauhan, Harjot Singh, Himanshu Hooda, and Rahul Gupta Classification and Activation Map Visualization of Banana Diseases Using Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751 Priyanka Sahu, Anuradha Chug, Amit Prakash Singh, Dinesh Singh, and Ravinder Pal Singh Exploring Total Quality Management Implementation Levels in I.T. Industry Using Machine Learning Models . . . . . . . . . . . . . . . . . . . . . 769 Kapil Jaiswal, Sameer Anand, and Rupali Arora Predicting an Indian Firm’s Sickness Using Artificial Neural Networks and Traditional Methods: A Comparative Study . . . . . . . . . . . . 785 Narander Kumar Nigam, Harshit Agarwal, and Khushi Goyal Analysis of Change of Market Value of Bitcoin Using Econometric Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797 Harivansh Gahlot, Irsheen Baveja, Gurjeet Kaur, and Sandra Suresh Detection of COVID-19 Using Intelligent Computing Method . . . . . . . . . . 819 Asmita Dixit, Aatif Jamshed, and Ritin Behl Two-Line Defense Ontology-Based Trust Management Model . . . . . . . . . 833 Wurood AL-Shadood, Haleh Amintoosi, and Mouiad AL-Wahah A Machine Learning-Based Data Fusion Model for Online Traffic Violations Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847 Salama A. Mostafa, Aida Mustapha, Azizul Azhar Ramli, Mohd Farhan M. D. Fudzee, David Lim, and Shafiza Ariffin Kashinath Review of IoT for COVID-19 Detection and Classification . . . . . . . . . . . . . 859 Maha Mahmood, Wijdan Jaber AL-Kubaisy, and Belal AL-Khateeb On the Implementation and Placement of Hybrid Beamforming for Single and Multiple Users in the Massive-MIMO MmWave Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873 Mustafa S. Aljumaily, Husheng Li, Ahmed Hammoodi, Lukman Audah, and Mazin Abed Mohammed Neural Network Based Windowing Scheme to Maximize the PSD for 5G and Beyond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 881 Ahmed Hammoodi, Lukman Audah, Mustafa S. Aljumaily, Mazin Abed Mohammed, and Jamal Rasool Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891

Editors and Contributors

About the Editors Dr. Ashish Khanna has 16 years of expertise in Teaching, Entrepreneurship, and Research & Development. He received his Ph.D. degree from National Institute of Technology, Kurukshetra. He has completed his M.Tech. and B.Tech. from GGSIPU, Delhi. He has completed his Post-Doc from Internet of Things Laboratory at Inatel, Brazil, and University of Valladolid, Spain. He has published around 55 SCI indexed papers in IEEE Transaction, Springer, Elsevier, Wiley, and many more reputed Journals with cumulative impact factor of above 100. He has around 120 research articles in top SCI/Scopus journals, conferences and book chapters. He is Co-Author of around 30 edited and text books. His research interest includes distributed systems, MANET, FANET, VANET, IoT, machine learning, and many more. He is Originator of Bhavya Publications and Universal Innovator Laboratory. Universal Innovator is actively involved in research, innovation, conferences, startup funding events, and workshops. He has served the research field as a Keynote Speaker/Faculty Resource Person/Session Chair/Reviewer/TPC Member/post-doctorate supervision. He is Convener and Organizer of ICICC conference series. He is currently working at the Department of Computer Science and Engineering, Maharaja Agrasen Institute of Technology, under GGSIPU, Delhi, India. He is also serving as Series Editor in Elsevier and De Gruyter publishing houses. Dr. Deepak Gupta received a B.Tech. degree in 2006 from the Guru Gobind Singh Indraprastha University, India. He received M.E. degree in 2010 from Delhi Technological University, India and Ph.D. degree in 2017 from Dr. APJ Abdul Kalam Technical University, India. He has completed his Post-Doc from Inatel, Brazil. With 13 years of rich expertise in teaching and two years in the industry; he focuses on rational and practical learning. He has contributed massive literature in the fields of Intelligent Data Analysis, Biomedical Engineering, Artificial Intelligence, and Soft Computing. He has served as Editor-in-Chief, Guest Editor, and Associate Editor in SCI and various other reputed journals (IEEE, Elsevier, Springer, & Wiley). He has actively been an organizing end of various reputed International conferences. xix

xx

Editors and Contributors

He has authored/edited 50 books with National/International level publishers (IEEE, Elsevier, Springer, Wiley, and Katson). He has published 180 scientific research publications in reputed International Journals and Conferences including 94 SCI Indexed Journals of IEEE, Elsevier, Springer, Wiley, and many more. Prof. Siddhartha Bhattacharyya FIET (UK), is currently the Principal of Rajnagar Mahavidyalaya, Birbhum, India. Prior to this, he was Professor in Christ University, Bangalore, India. He served as Senior Research Scientist at the Faculty of Electrical Engineering and Computer Science of VSB Technical University of Ostrava, Czech Republic, from October 2018 to April 2019. He also served as Principal of RCC Institute of Information Technology, Kolkata, India. He is Co-Author of 6 books and Co-Editor of 75 books and has more than 300 research publications in international journals and conference proceedings to his credit. His research interests include soft computing, pattern recognition, multimedia data processing, hybrid intelligence, and quantum computing. Prof. Aboul Ella Hassanein is Founder and Head of the Egyptian Scientific Research Group (SRGE) and Professor of Information Technology at the Faculty of Computer and Artificial Intelligence, Cairo University. Professor Hassanien is ExDean of the faculty of computers and information, Beni Suef University. Professor Hassanien has more than 800 scientific research papers published in prestigious international journals and over 40 books covering such diverse topics as data mining, medical images, intelligent systems, social networks, and smart environment. Prof. Hassanien won several awards, including the Best Researcher of the Youth Award of Astronomy and Geophysics of the National Research Institute, Academy of Scientific Research (Egypt, 1990). He was also granted a Scientific Excellence Award in humanities from the University of Kuwait for the 2004 Award and received the scientific - University Award (Cairo University, 2013). Also, he was honored in Egypt as the best researcher at Cairo University in 2013. He was also received the Islamic Educational, Scientific and Cultural Organization (ISESCO) prize on Technology (2014) and received the State Award for excellence in engineering sciences 2015. He was awarded the medal of Sciences and Arts of the first class by the President of the Arab Republic of Egypt, 2017. Dr. Sameer Anand is currently working as Assistant Professor in the Department of Computer science at Shaheed Sukhdev College of Business Studies, University of Delhi, Delhi. He has received his M.Sc., M.Phil., and Ph.D. (Software Reliability) from Department of Operational Research, University of Delhi. He is a recipient of ‘Best Teacher Award’ (2012) instituted by Directorate of Higher Education, Govt. of NCT, Delhi. The research interest of Dr. Anand includes operational research, software reliability, and machine learning. He has completed an Innovation project from the University of Delhi. He has worked in different capacities in International Conferences. Dr. Anand has published several papers in the reputed journals like IEEE Transactions on Reliability, International journal of production research (Taylor & Francis), International Journal of Performability Engineering, etc. He is Member

Editors and Contributors

xxi

of Society for Reliability Engineering, Quality, and Operations Management. Dr. Sameer Anand has more than 16 years of teaching experience. Dr. Ajay Jaiswal is currently serving as Assistant Professor in the Department of Computer Science of Shaheed Sukhdev College of Business Studies, University of Delhi, Delhi. He Co-Editor of two books/Journals and Co-Author of dozens of research publications in International Journals and conference proceedings. His research interest includes pattern recognition, image processing, and machine learning. He has completed an interdisciplinary project titled “Financial InclusionIssues and Challenges: An Empirical Study” as Co-PI. This project was awarded by the University of Delhi. He obtained his masters from the University of Roorkee (now IIT Roorkee) and Ph.D. from Jawaharlal Nehru University, Delhi. He is a recipient of the Best Teacher Award from the Government of NCT of Delhi. He has more than nineteen years of teaching experience.

Contributors S. M. Abdullaev South Ural State University, Chelyabinsk, Russian Federation U. Abhishek Department of Computer Science and Applications, Amrita Vishwa Vidyapeetham, Amritapuri, India Shristi Achari SWINGER: Security, Wireless, IoT Network Group of Engineering and Research Lab, University School of Information, Communication and Technology (USICT), Guru Gobind Singh Indraprastha University, Dwarka, Delhi, India S. Adarsh Indian Institute of Information Technology and Management-Kerala (IIITM-K), Thiruvananthapuram, India Amit Agarwal Department of Computer Engineering, Netaji Subhas Institute of Technology, University of Delhi, Delhi, India Harshit Agarwal Shaheed Sukhdev College of Business Studies, University of Delhi, New Delhi, India Vatsal Agarwal Department of Applied Mathematics, Delhi Technological University, New Delhi, India Apurva Aggarwal Department of CSE & IT, JIIT, Noida, India Smita Agrawal Nirma University, Ahmedabad, India Jabed Al Faysal Khulna University, Khulna, Bangladesh M. N. Al-Berry Scientific Computing Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt

xxii

Editors and Contributors

Mohammed Azmi Al-Betar Artificial Intelligence Research Center (AIRC), College of Engineering and Information Technology, Ajman University, Ajman, UAE Belal AL-Khateeb College of Computer Science and Information Technology, University of Anbar, Ramadi, Iraq Wijdan Jaber AL-Kubaisy College of Computer Science and Information Technology, University of Anbar, Ramadi, Iraq Wurood AL-Shadood Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran Mouiad AL-Wahah Department of Computer Science, Thi-Qar University, DhiQar, Iraq Mustafa S. Aljumaily Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, USA Renji George Amballoor Directorate of Higher Education Government of Goa, Goa, India Haleh Amintoosi Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran Anamika NITTTR, Chandigarh, India Sameer Anand Shaheed Sukhdev College of Business Studies, New Delhi, India V. S. Anoop Rajagiri College of Social Sciences, Kochi, India Mohd. Aquib Ansari CSED, MNNIT Allahabad, Prayagraj, India Tasleem Arif Department of Information Technology, BGSB University, Rajouri, J&K, India Mridul Arora Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India Rupali Arora Chandigarh University, SAS Nagar, India Sakshi Arora School of Computer Science and Engineering, Shri Mata Vaishno Devi University, Katra, India S. Asharaf Indian Institute of Information Technology and Management-Kerala (IIITM-K), Thiruvananthapuram, India Mudsir Ashraf Jain University, Bangalore, India Lukman Audah Wireless and Radio Science Centre (WARAS), Faculty of Electrical and Electronic Engineering, Universiti Tun Hussein Onn Malaysia, Parit Raja, Batu Pahat, Johor, Malaysia Mohammed A. Awadallah Artificial Intelligence Research Center (AIRC), College of Engineering and Information Technology, Ajman University, Ajman,

Editors and Contributors

xxiii

UAE; Department of Computer Science, Al-Aqsa University, Gaza, Palestine Olaniyi Abiodun Ayeni Cyber Security Department, Federal University of Technology, Akure, Nigeria Omeera Ayob Department of Food Technology, New Delhi, India Ananya Bangera Sahyadri College of Engineering and Management, Mangalore, India Avishi Bansal National Institute of Technology, Raipur, India Sakshi Bansal Janki Devi Memorial College, University of Delhi, New Delhi, India Irsheen Baveja Shaheed Sukhdev College of Business Studies, Delhi, India Punam Bedi Department of Computer Science, University of Delhi, New Delhi, Delhi, India Ritin Behl Department of Information Technology, ABES Engineering College, Ghaziabad, Uttar Pradesh, India Kirti Bhardwaj Jagan Institute of Management Studies, Rohini, Delhi, India Ashish Bhatghare Department of Computer Engineering, Netaji Subhas Institute of Technology, University of Delhi, Delhi, India Muheet Ahmed Bhut University of Kashmir, Sirnagar, India Sifat Nur Billah Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka, Bangladesh Rahul Budhraj Computer Science and Engineering Department, Maharaja Surajmal Institute of Technology, GGSIPU, New Delhi, India Muheet Ahmed Butt Department of Computer Science, University of Kashmir, Srinagar, India Avishek Chakraborty Department of ECE, NIT Durgapur, Durgapur, West Bengal, India Pragati Chandankhede Sir Padampat Singhania University, Udaipur, India Nisha Charaya Amity University, Gurgaon, Haryana, India Divyansh Chauhan Delhi Technological University, New Delhi, India Indu Chawla HCL Technologies Ltd., Noida, India Indu Chawla Infosys Ltd., Pune, India Indu Chawla Jaypee Institute of Information Technology, Noida, India Y. Chitrashekharaiah Department of Computer Science and Engineering, University Visvesvaraya College of Engineering, Bangalore, India

xxiv

Editors and Contributors

Dharamendra Chouhan Department of Computer Science and Engineering, University Visvesvaraya College of Engineering, Bangalore, India Anuradha Chug University School of Information, Communication, and Technology, Guru Gobind Singh Indraprastha University, New Delhi, India Medha Chugh SWINGER,USICT, GGSIPU, New Delhi, Delhi, India Diksha Dahiya Department of Software Engineering, Delhi Technological University, Rohini, Delhi, India Lamees Mohammad Dalbah Artificial Intelligence Research Center (AIRC), College of Engineering and Information Technology, Ajman University, Ajman, UAE Sunita V. Dhavale Department of Computer Science and Engineering, Defence Institute of Advanced Technology, Pune, India S. M. Dilip Kumar Department of Computer Science and Engineering, University Visvesvaraya College of Engineering, Bangalore, India Asmita Dixit ABES Engineering College, Ghaziabad, Uttar Pradesh, India Maitreyee Dutta NITTTR, Chandigarh, India Hala M. Ebeid Scientific Computing Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt Reham A. El-Shahed Scientific Computing Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt Sheikh Amir Fayaz Department of Computer Science, University of Kashmir, Srinagar, India Oluwatosin James Fayemi Computer Science Department, Federal University of Technology, Akure, Nigeria Mohd Farhan M. D. Fudzee Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia Harivansh Gahlot Shaheed Sukhdev College of Business Studies, Delhi, India M. Gangadharappa Department of Electronics and Communication, Ambedkar Institute of Advanced Communication Technologies and Research, Delhi, India Shahid Mohammad Ganie Department of Computer Sciences, BGSB University, Rajouri, J&K, India Ayush Garg Department of Software Engineering, Delhi Technological University, Rohini, Delhi, India Mayank Gaur Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India M. Geetha SR University, Warangal, Telangana, India

Editors and Contributors

xxv

Sakshi Gill Computer Science and Engineering Department, Maharaja Surajmal Institute of Technology, GGSIPU, New Delhi, India Prashant Shambharkar Giridhar Department of Computer Science and Engineering, Delhi Technological University, New Delhi, India Anmol Goel SWINGER,USICT, GGSIPU, New Delhi, Delhi, India Khushi Goyal Shaheed Sukhdev College of Business Studies, University of Delhi, New Delhi, India Mukta Goyal Guru Nanak Dev Institute of Technology, New Delhi, Delhi, India Aditi Gupta National Institute of Technology, Raipur, India Ayush Gupta University of Turku, Turku, Finland Ekata Gupta Guru Nanak Institute of Management, New Delhi, Delhi, India Jayesh Gupta Maharaja Agrasen Institute of Technology, Delhi, India Kirti Gupta Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India Lakshay Gupta School of Computer Science, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India Neha Gupta Department of Computer Science, University of Delhi, Delhi, India Pankaj Kumar Gupta Jamia Milia Islamia University, New Delhi, India Rahul Gupta Delhi Technological University, New Delhi, India Sarishty Gupta Department of CSE & IT, JIIT, Noida, India Shubhangi Gupta Jagan Institute of Management Studies, Rohini, Delhi, India Trasha Gupta Department of Applied Mathematics, Delhi Technological University, New Delhi, India Nitesh Gurbani School of Computer Science, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India Sana Haji K.C.College of Engineering, Mumbai, India Ahmed Hammoodi Wireless and Radio Science Centre (WARAS), Faculty of Electrical and Electronic Engineering, Universiti Tun Hussein Onn Malaysia, Parit Raja, Batu Pahat, Johor, Malaysia Mitul Hirna Department of Computer Science and Engineering, NSUT, New Delhi, India Himanshu Hooda Delhi Technological University, New Delhi, India Farjana Hossain Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka, Bangladesh

xxvi

Editors and Contributors

Khalid Mahbub Jahan University of Dhaka, Dhaka, Bangladesh Aayush Jain National Institute of Technology, Raipur, India Aruna Jain Bharati College, University of Delhi, Delhi, India Somil Jain Department of Computer Science and Engineering, NSUT, New Delhi, India Vibha Jain Department of Computer Science and Engineering, NSUT, New Delhi, India Kapil Jaiswal Operations and Development—OATI, Chandigarh, India Aatif Jamshed ABES Engineering College, Ghaziabad, Uttar Pradesh, India K. Jane Nithya Department of Computer Science, Ethiraj College for Women, Tamil Nadu, Chennai, India Rekh Ram Janghel National Institute of Technology, Raipur, India Santosh B. Javheri JSPM’s Rajarshi Shahu College of Engineering, Pune, Maharashtra, India Rajni Jindal Department of Computer Science and Engineering, Delhi Technological University, New Delhi, Delhi, India; Department of Software Engineering, Delhi Technological University, Rohini, Delhi, India Vinita Jindal Department of Computer Science, Keshav Mahavidyalaya, University of Delhi, Delhi, India Rahul Johari SWINGER: Security, Wireless, IoT Network Group of Engineering and Research Lab, University School of Information, Communication and Technology (USICT), Guru Gobind Singh Indraprastha University, Dwarka, Delhi, India Ishan Joshi Department of Information Technology, Bhagwan Parshuram Institute of Technology, New Delhi, India Advait Kale Computer Engineering, NMIMS Deemed-To-Be University, Mumbai, Maharashtra, India Parnika Kansal Department of Electronics and Communication, Indira Gandhi Delhi Technical University, Delhi, India Shubham Karpa HCL Technologies Ltd., Noida, India Shubham Karpa Infosys Ltd., Pune, India Shubham Karpa Jaypee Institute of Information Technology, Noida, India Shafiza Ariffin Kashinath Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia

Editors and Contributors

xxvii

Shafiza Ariffin Kashinath Engineering R&D Department, Sena Traffic Systems Sdn. Bhd., Kuala Lumpur, Malaysia Gurjeet Kaur Assistant Professor, Shaheed Sukhdev College of Business Studies, Delhi, India Ashu Kaushik Department of Information Technology, Delhi Technological University, Delhi, India Aslam Hasan Khan Aligarh Muslim University, Aligarh, India Nidhi Khatri Department of Computer Engineering, SVNIT-VASAD, Vasad, India Pooja Kherwa Computer Science and Engineering Department, Maharaja Surajmal Institute of Technology, GGSIPU, New Delhi, India Poras Khetarpal Bharati Vidyapeeth’s College of Engineering, New Delhi, India Farzil Kidwai Maharaja Agrasen Institute of Technology, Delhi, India Navleen Kour School of Computer Science and Engineering, Shri Mata Vaishno Devi University, Katra, India Deepa Krishnan Computer Engineering, NMIMS Deemed-To-Be University, Mumbai, Maharashtra, India Anil Kumar Amity University, Gurgaon, Haryana, India Ashwni Kumar Department of Electronics and Communication, Indira Gandhi Delhi Technical University, Delhi, India Kartik Kumar Department of Computer Science, Maharaja Surajmal Institute of Technology, New Delhi, India Manoj Kumar Professor, Netaji Subhas University of Technology, East Campus (Formerly Ambedkar Institute of Advanced Communication Technologies and Research), New Delhi, India Prabhat Kumar Computer Science and Engineering, National Institute of Technology, Patna, India Yash Kumar Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India Bharti Kumari Nalanda College of Engineering, Chandi, Nalanda, India P. B. Kumbharkar JSPM’s Rajarshi Shahu College of Engineering, Pune, Maharashtra, India Ajay Kumar Kushwaha Department of Computer Science and IT, Jaypee Institute of Information Technology, Noida, India Sangeeta Lal School of Computing and Mathematics, Keele University, NewcastleUnder-Lyme, UK

xxviii

Editors and Contributors

Husheng Li Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, USA David Lim Engineering R&D Department, Sena Traffic Systems Sdn. Bhd., Kuala Lumpur, Malaysia Suman Madan Jagan Institute of Management Studies, Rohini, Delhi, India Maha Mahmood College of Computer Science and Information Technology, University of Anbar, Ramadi, Iraq Rajesh Kumar Mahto National Institute of Technology Patna, Patna, India Arjun Majithia Department of Computer Engineering, Netaji Subhas Institute of Technology, University of Delhi, Delhi, India Jigar Makhija Amrita Vishwa Vidyapeetham, Coimbatore, India Jigyasu Malhotra Department of Computer Engineering, Netaji Subhas Institute of Technology, University of Delhi, Delhi, India Majid Bashir Malik Department of Computer Sciences, BGSB University, Rajouri, J&K, India Sapna Malik Department of Computer Science, Maharaja Surajmal Institute of Technology, New Delhi, India Durbadal Mandal Department of ECE, NIT Durgapur, Durgapur, West Bengal, India D. T. Mane JSPM’s Rajarshi Shahu College of Engineering, Pune, Maharashtra, India Ambarisha Mishra National Institute of Technology Patna, Patna, India Reyansh Mishra School of Computer Science, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India Rishik Mishra Department of Computer Engineering and Applications, GLA University, Mathura, Uttar Pradesh, India Prabhat Mittal Satyawati College (Evening), University of Delhi, New Delhi, India Mazin Abed Mohammed College of Computer Science and Information Technology, University of Anbar, Anbar, Iraq Yukti Mohan Maharaja Surajmal Institute of Technology, New Delhi, India Rahul Moorthy Pune Institute of Computer Technology, Pune, Maharastra, India Salama A. Mostafa Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia M. F. Mridha Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka, Bangladesh

Editors and Contributors

xxix

Pooja Mudgil Department of Information Technology, Bhagwan Parshuram Institute of Technology, New Delhi, India Saahil Hussain Mugloo Department of Computer Science and Engineering, NSUT, New Delhi, India Aida Mustapha Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia Neelu Nagpal Maharaja Agarsen Institute of Technology, New Delhi, India Preeti Nagrath Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India Shankar B. Naik Directorate of Higher Education Government of Goa, Goa, India Jayashree Nair Department of Computer Science and Applications, Amrita Vishwa Vidyapeetham, Amritapuri, India Sooraj S. Nair Department of Computer Science and Applications, Amrita Vishwa Vidyapeetham, Amritapuri, India Narander Kumar Nigam Shaheed Sukhdev College of Business Studies, University of Delhi, New Delhi, India Hamid Omar Computer Science and Engineering, Vellore Institute of Technology, Vellore, India Parita Oza Nirma University, Ahmedabad, India Saroj Kumar Pandey National Institute of Technology, Raipur, India Abhishek Parashar Maharaja Surajmal Institute of Technology, New Delhi, India Devraj Patel Defence Institute of Advanced Technology, Pune, India Harsh Jigneshkumar Patel Nirma University, Ahmedabad, India Hiral A. Patel Faculty of Computer Application, Ganpat University, Mehsana, India Hiral R. Patel Faculty of Computer Application, Ganpat University, Mehsana, India Varun Prakash Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India Nachiketa Raina Department of Computer Science, Maharaja Surajmal Institute of Technology, New Delhi, India Gopi Ram Department of ECE, NIT Warangal, Warangal, Telangana, India Azizul Azhar Ramli Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia

xxx

Editors and Contributors

Poonam Rani Department of Computer Engineering, Netaji Subhas Institute of Technology, University of Delhi, Delhi, India Jamal Rasool Department of Communication engineering, University of Technology, Baghdad, Iraq Somil Rastogi Department of Computer Science and IT, Jaypee Institute of Information Technology, Noida, India Richa School of Computer Science & Engineering, Vellore Institute of Technology, Chennai, India Lubna Riyaz Department of Computer Sciences, University of Kashmir, Srinagar, India Priyanka Sahu University School of Information, Communication, and Technology, Guru Gobind Singh Indraprastha University, New Delhi, India Mohammad Saif Department of Computer Science and Engineering, NSUT, New Delhi, India Yass Khudheir Salal South Ural State University, Chelyabinsk, Russian Federation Jahnavi Seth Maharaja Agrasen Institute of Technology, Delhi, India Ashutosh Shankdhar Department of Computer Engineering and Applications, GLA University, Mathura, Uttar Pradesh, India Aaryan Sharma Bharati Vidyapeeth’s College of Engineering, New Delhi, India Chhavi Sharma University of Delhi, New Delhi, India Hardik Sharma Computer Engineering, NMIMS Deemed-To-Be University, Mumbai, Maharashtra, India Moolchand Sharma Maharaja Agrasen Institute of Technology, Delhi, India Ruchi Sharma Bharati Vidyapeeth’s College of Engineering, New Delhi, India Shreyans Sharma Computer Science and Engineering Department, Maharaja Surajmal Institute of Technology, GGSIPU, New Delhi, India Howida A. Shedeed Scientific Computing Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt Akhil Appu Shetty Sahyadri College of Engineering and Management, Mangalore, India Shiv Naresh Shivhare School of Computer Science, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India Jyoti Shokeen Department of Computer Science and Engineering, UIET, Maharshi Dayanand University Rohtak, Rohtak, Haryana, India

Editors and Contributors

xxxi

J. Shreyas Department of Computer Science and Engineering, University Visvesvaraya College of Engineering, Bangalore, India Nitya Shukla Department of Computer Engineering and Applications, GLA University, Mathura, Uttar Pradesh, India K. Shyamala Dr. Ambedkar Government Arts College, Tamil Nadu, Chennai, India Jamshed Siddqui Aligarh Muslim University, Aligarh, India Anuradha Singhal Department of Computer Science, University of Delhi, New Delhi, Delhi, India Tanay Singhania Department of Applied Mathematics, Delhi Technological University, New Delhi, India Amit Prakash Singh University School of Information, Communication, and Technology, Guru Gobind Singh Indraprastha University, New Delhi, India Dhananjay Singh HCL Technologies Ltd., Noida, India Dhananjay Singh Infosys Ltd., Pune, India Dhananjay Singh Jaypee Institute of Information Technology, Noida, India Dinesh Singh Division of Plant Pathology, Indian Agricultural Research Institute, New Delhi, India Dushyant Kumar Singh CSED, MNNIT Allahabad, Prayagraj, India Harjot Singh Delhi Technological University, New Delhi, India Priti Singh Amity University, Gurgaon, Haryana, India Ravinder Pal Singh Division of Plant Pathology, Indian Agricultural Research Institute, New Delhi, India Megha Singla NITTTR, Chandigarh, India Akash Sinha Computer Science and Engineering, National Institute of Technology, Patna, India Devyani Sinha Department of Software Engineering, Delhi Technological University, Rohini, Delhi, India Divit Sinha Computer Engineering, NMIMS Deemed-To-Be University, Mumbai, Maharashtra, India Amrita Sisodia Department of Computer Science and Engineering, Delhi Technological University, New Delhi, Delhi, India Shahab Saquib Sohail Jamia Hamdard University, New Delhi, India Arushi Sondhi Maharaja Agrasen Institute of Technology, Delhi, India

xxxii

Editors and Contributors

N. N. Srinidhi Department of Computer Science and Engineering, Sri Krishna Institute of Technology, Bangalore, India N. Suman SR University, Warangal, Telangana, India Sunakshi Department of Applied Mathematics, Delhi Technological University, New Delhi, India Sunanda School of Computer Science and Engineering, Shri Mata Vaishno Devi University, Katra, India Sandra Suresh Assistant Professor, Shaheed Sukhdev College of Business Studies, Delhi, India Seba Susan Department of Information Technology, Delhi Technological University, Delhi, India Keyur Suthar Department of Computer Engineering, SVNIT-VASAD, Vasad, India Aderonke Favour-Bethy Thompson Cyber Security Department, Federal University of Technology, Akure, Nigeria Minakshi Tomer Research Scholar, USICT, GGSIPU, New Delhi, India Minakshi Tomer IT Department, MSIT, New Delhi, India Bharat Udawat Computer Engineering, NMIMS Deemed-To-Be University, Mumbai, Maharashtra, India Aashish Upadhyay Department of Computer Science, Maharaja Surajmal Institute of Technology, New Delhi, India M. Uvaneshwari SRM Institute of Science and Technology, Chennai, India Majid Zaman Directorate of IT & SS, University of Kashmir, Srinagar, India Majid Zaman University of Kashmir, Sirnagar, India Raed Abu Zitar Sorbonne University Center of Artificial Intelligence, Sorbonne University-Abu Dhabi, Abu Dhabi, UAE

Explanation-Based Serendipitous Recommender System (EBSRS) Richa, Chhavi Sharma, and Punam Bedi

Abstract Recommender systems (RSs) have gained immense popularity and achieved great success as intelligent information system that helps to deal with information overload problem. Recommender systems (RSs) have been very long evaluated for accuracy. Nowadays, along with the accuracy of the presented recommendation, other factors like novelty, diversity and serendipity have become an important aspect of recommendation systems. In this paper, we propose Explanationbased Serendipitous Recommender System (EBSRS) to generate explanation for the serendipitous recommendations presented to the user. Hereby, the approach integrates the concept of serendipity in recommendations, ensuring the relevance of recommendations while generating serendipitous recommendation. The approach generates the explanation for the serendipitous recommendations to provide a justification for the recommended list. The proposed approach is evaluated using accuracy and relevancy measures. Precision, recall and f -measure are used as the accuracy measure, whereas explanation coverage and unexpectedness is used to get the relevancy measure. Keywords Recommender system · Serendipity · Explanation · Relevancy · Unexpectedness

1 Introduction Recent years have experienced an exponential growth in the available information on the Internet which leads to information overload problem. To assist the users in the decision making process, recommender systems (RSs) came into existence. The Richa (B) School of Computer Science & Engineering, Vellore Institute of Technology, Chennai, India C. Sharma University of Delhi, New Delhi, India P. Bedi Department of Computer Science, University of Delhi, New Delhi, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_1

1

2

Richa et al.

primary goal of a recommender system is to suggest users with the products that are suitable for their interest. The broad categorization of RSs suggested by various researchers is: Collaborative filtering (CF), content-based (CB) and hybrid approach [1]. The basic idea behind the collaborative filtering is that if the user has shared similar interest in the past, then they might share the same in future also. Contentbased approach relies on the content of the item and users profile. Hybrid approach is the combination of two or more recommendation approach. Although the major objective of recommender systems is to improve accuracy of the generated recommendation list, the recent trends argue in the favor of factors that goes slightly beyond the accuracy, e.g., serendipity and diversity [2]. There are various problems through which recommender system deal with, e.g., cold start, overspecialization and sparsity [3–6]. In order to deal with the overspecialization problem, serendipity helps and improves the quality of recommendation [7]. Serendipity is found as an important aspect to get the expansion in user’s taste and also provides the ‘surprise me’ list for the user. Various researchers have provided the reasons and formalization for serendipity definition. There are few factors that are considered as important in order to generate serendipitous recommendation to the users. This paper considers two important factors, relevancy and unexpectedness, to generate serendipitous recommendations. In recent years, explanation has gained its share of importance in recommender system. Literature discusses the increase in user acceptance of presented recommendation while using explanation. This helps user to settle down more conveniently and quickly with the decision making. While presenting serendipitous recommendation, it becomes important to provide explanation for the presented recommendation in order to achieve transparency and user acceptance [8]. Ontology is an explicit specification of concepts and relationships that can exist among the concepts. To represent the concepts of the domain and the relations between the concepts has been explored well using ontology. Both the user profile and recommendable items is represented by ontology in the ontology based RSs. The set of objects and the describable relationships among them are reflected in the representational vocabulary with which a knowledge-based program represents knowledge [9, 10]. Spreading activation technique is a widely used approach to get the hidden network information. It is used to get the probable concept to include in the extension of the ontology [11]. To model both explanation and ontology explanation ontology is proposed by [12]. They discussed about that the ontology can support user requirements for explanation. In this paper, we have proposed an approach to generate serendipitous recommendation along with the explanation using ontology and spreading activation technique. To generate serendipitous list, two major factors, relevancy and unexpectedness, are taken into consideration. For the explanation of the generated recommendation, few factors along with the ontology are exploited. The proposed approach deals with the overspecialization problem and enhances user acceptance toward generated recommendations. This paper is divided into further sections as Sect. 2 gives the related work of other authors. Section 3 describes the proposed approach EBSRS. Section 4 discusses the experimental evaluation and results and Sect. 5 concludes the paper.

Explanation-Based Serendipitous Recommender System (EBSRS)

3

2 Literature Survey Recommender system has been very profoundly discussed by various researchers since the last decade in various fields [1]. To provide quality recommendation is always the priority of various proposed recommendation algorithms. Also, literature discovered improved approaches that explore beyond accuracy of the presented list. Serendipity has been considered a major breakthrough to get involved with the expanded user interest. In the paper titled ‘A survey of serendipity in Recommender Systems,’ the authors summarize various proposed approaches to serendipity in recommender systems. They have also compared various definitions and validations of the concept. The challenges of serendipity in RSs have been discussed by [13]. De Gemmis et al. [14] have proposed a technique using a graph-based recommendation algorithm with background knowledge that allows the system to deeply understand the items it deals with. The work by [15] presents the design and implementation of a hybrid RS that connects a content based approach and serendipity factor in order to alleviate the overspecialization problem with serendipitous suggestions. In the paper by [16], the authors proposed a method to improve user satisfaction by generating unexpected recommendations based on the utility theory of economics. They have defined and formalized the concept of unexpectedness and discussed how it differs from the similar concepts of novelty, serendipity and diversity. In the Ph.D. thesis, the author [17] has used music data to incorporate serendipity in the playlist recommendation while ensuring accuracy of recommended items. In a work based on social network of a user, the authors [18] have proposed a friend recommender system that functions in the social bookmarking application domain and is founded on behavioral data mining, i.e., on the deployment of the users activity in a social bookmarking system. Their results prove how this type of mining is able to produce accurate, novel and serendipitous friend recommendations. Zheng et al. [19] have modeled unexpectedness by combining the notions of item rareness and dissimilarity. A review for the explanation in RS is presented by [8] where they discovered that set of characteristics are associated with explanation, e.g., transparency, validity, scrutability, trust, relevance, persuasiveness, comprehensibility, effectiveness, efficiency, satisfaction and education. These characteristics help to increase the trustworthiness of the system. To regain the customers trust and user acceptance [20] also provide their views on providing justification to a recommendation. A hybrid probabilistic model for real-time recommendation generation has been proposed by [21] in order to get the personalized recommendations. The trust of user depends on various factors which are explored by [22] in order to get a comprehensive ontology for trust. Ontology has been extensively used in RSs for knowledge representation in knowledge-based systems [23]. Gao et al. [24] have proposed an approach to combine the user ontology and spreading activation technique to uncover the enhanced user interests. A survey has been presented by [25] that gives detailed discussion on the use and advantages of ontology. They argued that in order to generate personalized recommendation, ontologies have benefits like reusability and reasoning ability.

4

Richa et al.

A content-based recommendation is proposed by [26] that exploits ontology and spreading activation model to get the meaningful content items. In this paper, we have proposed an approach to get the explanation for the serendipitous recommendation to enhance the user acceptance toward the presented recommendations. The serendipitous items are generated using two factors relevancy and unexpectedness. The items that can be mapped using domain ontology using spreading activation technique are used to generate explanation for the target user. Along with ontology, few more factors are used to generate the explanation score for the generation of explanation for the presented recommendations.

3 Proposed Approach In this paper, an Explanation-based Serendipitous Recommender System (EBSRS) is proposed. There are four phases involved for the processing of the system. First is user assessment phase, second is item assessment phase followed by the serendipity and explanation assessment phases. The steps are as follows (Fig. 1).

Target User

Existing user community

Similarity Computation

Prediction Computation

Top-N Item Selection and Explanation Generation

Fig. 1 Architecture of EBSRS

Preferences and Ratings

Database

Serendipity Score Computation

Explanation Generation

Explanation-Based Serendipitous Recommender System (EBSRS)

5

Phase I: User Assessment Phase Step 1: Collection of user interest and contextual information Step 2: Formation of input data (pre-processing phase) Step 3: Similarity Computation Phase II: Item Assessment Phase Step 1: Concept Mapping Step 2: Prediction Score Computation Step 3: Combining prediction score and items from concept mapping Phase III: Serendipity Assessment Phase Step-1: Relevancy Computation of items Step-2: Unexpected Computation of items Step-3: Intersection of relevant and unexpected items in order to generate serendipity score of items Phase IV: Explanation Assessment of Serendipitous Items Step-1: Explanation generation for the recommendations. Step-2: Top-N item generation

3.1 Phase I: User Assessment Phase Step 1

Collection of user interest and contextual information

User interest and other related information are captured in order to process the information to compute the recommendation for the target user. The related information includes contextual information of user that influences the decision of the user. The items rated by the target user and the contextual information of user are stored in the repository for further processing. Step 2

Formation of input data

This step deals with the preparation of the input data to process to initiate the computation. The input data consists of the ratings provided by the users and is stored in the form of user-item rating matrix. In this matrix, the rows represent the users and the columns represent the items. One entry in the matrix represents the ratings provided by a user ‘u’ to an item ‘i.’ Step 3

Similarity computation

The similarity computation involves computing the neighborhood for the target user. There are various methods suggested in literature to find the similarity between users, e.g., Pearson’s correlation coefficient (PCC), Jaccard coefficient, cosine similarity, etc. [2]. In this paper, PCC is used to get the similarity computation between users, and it is formulated as:

6

Richa et al.

Fig. 2 Graphical representation of tourism domain ontology

  − r x ) ∗ r yi − r y  Sim(x, y) =   2 n 2 n − r (r ) x i=1 xi i=1 r yi − r y n

i=1 (r xi

(1)

where r xi and r yi r¯x and r¯y

denote the ratings of users x and y for ith item, respectively. denote the average ratings of user x and y, respectively.

Phase II: Item Assessment Phase Step 1

Mapping of the concept using ontology

The user interest is captured and mapped with the domain ontology. The captured user interest activates the domain knowledge in order to generate a list of nodes. These connected nodes are super-concept, sub-concept or siblings [27]. This expansion helps to get more unexpected but at the same time relevant suggestions for the target user. The spreading of the user interest provides not only the items, which might be interesting, but also explains the reason of generating/having the particular recommendation. Step 2

Prediction computation

Prediction computation helps to get the items which are not yet seen by the target user and rated by the similar user. It gives the prediction score for the items, and this score decides whether the item can be recommended to the user or not. The formula for prediction computation is given below [2].

Explanation-Based Serendipitous Recommender System (EBSRS)

 y∈U

Pred(x, i) = r x +

  sim(x, y) ∗ r y,i − r y  y∈U sim(x, y)

7

(2)

where Sim(x, y) r¯x and r¯y r x,i Step 3

denote the similarity between the user x and y. denote the average rating of user x and y. denote the rating of user x for an item i.

Combining prediction score and items from spreading activation

Previous two steps provide the items through spreading the concept of domain ontology for the user interest and through prediction score. The final list of items is stored into the repository and is ready for the next phase of serendipity assessment. Phase III: Serendipity Assessment Phase Step 1

Relevancy computation

The relevancy score is computed for the selected items from the previous phase. The relevancy and unexpectedness are the two major components for the serendipity assessment of items. The formulation of relevance score of the target user is achieved using following formula [28–30]: rˆ (u, i) = r (u) + C



sim(u, v)(r (v, i) − r (v))

(3)

v∈Nk

where rˆ (u, i) r¯ (u) C sim(u, v) r (v, i)

denotes the relevance of the unseen item ‘i’ by the user ‘u.’ is the average ratings provided by user u. is the normalizing constant. is the similarity score between the user ‘u’ and ‘v.’ is the rating given by user ‘v’ to item ‘i.’

The relevancy score is computed for each of the item of the previous list. Further, the unexpectedness computation provides the score where both, relevancy and unexpectedness score, combined to get the serendipity score of the items. Step 2

Unexpectedness computation

A co-occurrence-based method is provided to compute the unexpectedness. It actually gives the approximate idea that if the items are not seen together then they are likely to be different. The unexpectedness is formulated as:

8

Richa et al.

PMI = − log2

p(i, j) / log2 p(i, j) p(i) p( j)

Unexpectednessco - occ1 (i) = max PMI(i, j) j∈P

(4) (5)

where p(i) and p(j) p(i, j) p(i) p(i) Step 3

represent the probability for the items to be rated by any user, and is the probability for the same user that has rated both of the items. number of users who have rated item i/Total number of item a user has rated. a user who have rated item i/Total number of item a user has rated.

Combination of both relevancy score and unexpectedness score

Based on the relevancy and unexpectedness score, the serendipitous score of items is computed Ser_ Sc Iu =



Rel_ Sci ∗ Unexp_ Sci

(6)

i∈Iu

Rel_Sci Unexp_Sci and

is the relevancy score of the item ‘i.’ is the unexpectedness score of the item ‘i.’ is the serendipity score of item ‘i’ in the list Iu .

Phase IV: Explanation Assessment for the Serendipitous Items Step 1

Explanation generation for the serendipitous items

The explanation for the items for the serendipitous item is generated while providing them the final top-N list as recommendation. An explanation favors the items which are sent as recommendation to the user, e.g., ‘the recommendation place is historical,’ ‘the recommended hotel is affordable’ or ‘the restaurant is inexpensive,’ etc. There can be many factors which will take lead for the explanation for the presented recommendation. Explanation will be added for the recommendations provided to the user. It will be used as an added feature for the serendipitous recommendation list. User interest activates the concepts in the domain ontology which is further spread to its super-domain, sub-domain and sibling that results into multiple routes to follow a path for explanation along with the four explanation factors [30, 31]. We are using tourism domain which is explained in the next section and the explanation factors for the domains are: (a)

Place preference: Place preference (u, j) of a user ‘u’ is given by the proportion of favorite place category of the user ‘u’ with the category of the place ‘j,’ e.g., the favorite place category of user is {religious, historical, amusement} and

Explanation-Based Serendipitous Recommender System (EBSRS)

9

the place category is {religious, historical, cultural} then the place preference (u, i) will be 2/3.     userplacecategory ∩ itemcategory  PlacePreference(u, j) = (7) item category (b)

Affordability: Affordability of a user ‘u’ for restaurant ‘i’ is given by the budget category and computed as the multiplicity of meals that he can purchase. It is computed as: Affordability, A(u, i) = Upper-Limit-BC(u)/Per- person-Cost(i)

(8)

where Upper-Limit-BC(u) Per-person-Cost(i) (c)

represents the upper limit of user u in his budget category represents per person food cost at restaurant i.

Cuisine Preferences: Cuisine preferences of a user ‘u’ are given by the proportion of favorite cuisine which is being served at restaurant ‘i,’ e.g., the cuisine preference of a user is {North Indian, mughlai, continental} and the cuisine served is {North Indian, South Indian, Chinese} which provides cuisine preference (u, i) will be 1/3. Cuisine_Preference(u, j) = |user_cuisine_choice ∩ cuisine_served|/cuisine served

(d)

(9)

Reputation: Reputation of each item is given by the reputation of an item j (ROIj) [32] in these vectors is computed as follows:  ROI j =

    3 × avg j × n j /N × 1/SD j         avg j × n j /N + n j /N × 1/SD j + avg j × 1/SD j (10)

where avgj nj N SDj

represents the average rating of an item j. represents number of users who rated an item j in the user-item rating matrix. denotes total number of users in user-item rating matrix. denotes standard deviation of the ratings given by individual users for an item j.

A small SD j indicates that the ratings are around the mean, i.e., ratings of the users are close to each other. The advantage of using the harmonic mean for computing ROIj is that it is robust to large differences between inputs so that a high weighting will only be calculated if avg j , (n j /N ) and (1/SD j ) all are high.

10

Richa et al.

The explanation score is described by the performance of explanation factors Place_Preference (P), Affordability (A), Cuisine_Preference (C) and Reputation (R). The explanation score is computed by determining maximum contributing explanation factor as follows: Explanation score = Max(W p × P, W a × A, W c × C, W r × R)

(11)

where Wp, Wa, Wc and Wr Step 2

are statistically determined constants.

Top-N item generation

After the generation of explanation score and the domain ontology (as shown in Fig. 2) the top-N list of recommendation is processed to the target user using the explanation score. Also, the list is further stored in repository for future reference.

Explanation-Based Serendipitous Recommender System (EBSRS)

11

12

Richa et al.

4 Experiments, Evaluation and Comparative Analysis For the evaluation of proposed system, we have used various metric based on quality. To evaluate the accuracy of the proposed system the standard metric, precision, recall and F-measure are used [33]. To justify the explanation of the presented recommendation, explanation metric is computed. To compute the serendipity measure, unexpectedness is evaluated. The detailed discussion over the evaluation metrics is given below. To measure the accuracy of the recommendation, we will be using the precision, recall and F-measure. Precision measures the rate of the item the user like therefore consumed. Precision@K reflects the fraction of relevant items retrieved by a recommender system in the first K results. Precision@K =

1  ||RSu (K ) ∩ RELu || ||U || u∈U K

(13)

Explanation-Based Serendipitous Recommender System (EBSRS)

13

where U is a set of users, while RSu (K ) is a set of top-K suggestions for user u. Recording from the test set for user u is represented by RELu . Recall is the number of consumed items in the recommendation list out of the total number of items the user consumed. Recall@K =

1  ||RSu (K ) ∩ RELu || ||U || u∈U Total_ Itemsu

F-measure is the measure that actually provides the balance between the two, i.e., precision and recall. F - measure =

2 ∗ Precision ∗ Recall Precision + Recall

(15)

The proposed approach is compared with the traditional approach of recommendation in order to get the accuracy of the presented recommendation list. The comparison shows the changes in precision, recall and f -measure when system varies the top-N item as 5, 11, 15 and 21 (Figs. 3, 4 and 5). The outcome of the comparison between the approaches shows that the proposed work outperforms as compared to the traditional approach. A user-oriented measure called explain coverage is used. For a user ‘u’ that receives a recommendation list L, the explain coverage for the justification list J is defined as follows:  ∀( f i ,c f i )∈J min{c f i , P(u, f i )}  (16) Explain coverage(u, J ) = ∀ f i ∈F P(u, f i ) where each pair (f i , cf i ) denotes that feature f i has overall frequency cf i inside L and P(u, f i ) is the frequency of f i in the feature profile of u. Explain coverage takes

Fig. 3 Precision for EBSRS and CF approach

14

Richa et al.

Fig. 4 Recall for EBSRS and CF approach

Fig. 5 F-measure for EBSRS and CF

values in the range [0,1], whereas values closer to 1 correspond to better coverage as shown in Table 1 and Fig. 6.

Table 1 Explanation Coverage for varying Top-N Items Explain_Coverage Coverage@3 Coverage@5 Coverage@7 Coverage@9 Coverage@11 EBSRS

0.469388

0.511905

0.512987

0.564286

0.571429

Explanation-Based Serendipitous Recommender System (EBSRS)

15

Fig. 6 Explanation coverage evaluation for EBSRS

Unexpectedness is the measure that can decide the ‘surprise me’ component in the presented recommendation. It measures how different the recommendations are as compared to the items rated by the user to get the novelty score of the items (Fig. 7 and Table 2). Fig. 7 Unexpectedness evaluation for EBSRS

Table 2 Unexpectedness for varying Top-N Items Unexpectedness Unexpected@3 Unexpected@5 Unexpected@7 Unexpected@9 Unexpected@ 11 EBSRS

0.227692

0.309231

0.386154

0.463077

0.54

16

Richa et al.

  1  USUu I˜ =   sim i j , i k  ˜  I |Ru | i j ∈ I˜ ik ∈ I˜

(17)

5 Conclusion In this paper, we introduced a new method for the Explanation-based Serendipitous Recommender Systems (EBSRS). The major emphasis of this paper is to generate serendipitous item recommendation and to generate explanation for the presented list. The serendipitous item is generated using two factors relevancy and unexpectedness. Explanation for the presented serendipitous recommendation is generated by spreading the user interest with the help of domain ontology along with factors that affect the explanation score. The presented RS was evaluated using accuracy measures, explanation coverage and unexpectedness measures. To evaluate the proposed system in terms of accuracy measures, the EBSRS is compared with the CF approach where EBSRS outperformed.

References 1. F. Ricci, L. Rokach, B. Shapira, Introduction to recommender systems handbook, in Recommender systems handbook (Springer, Boston, MA, 2011), pp. 1–35 2. M. Ge, C. Delgado-Battenfeld, D. Janach, Beyond accuracy: evaluating recommender systems by coverage and serendipity, in Proceedings of the fourth ACM Conference on Recommender Systems, pp. 257–260 (2010) 3. Z. Abbassi, A.-Y. Sihem, L.V. Laks, S. Vassilvitskii, Y. Cong, Getting recommender systems to think outside the box, in Proceedings of the Third ACM Conference on Recommender Systems, pp. 285–288 (2009) 4. D. Anand, K.K. Bharadwaj, Utilizing various sparsity measures for enhancing accuracy of collaborative recommender systems based on local and global similarities. Expert Syst. Appl. 38(5), 5101–5109 (2011) 5. X. Lam, T. Vu, T. Duc Le, A. Duc Duong, Addressing cold-start problem in recommendation systems, in Proceedings of the 2nd International Conference on Ubiquitous Information Management and Communication, pp. 208–211 (2008) 6. Richa, P. Bedi, Parallel context-aware multi-agent tourism recommender sys-tem. Int. J. Comput. Sci. Eng. 20(4), 536–549 (2019) 7. D. Kotkov, J. Veijalainen, S. Wang, How does serendipity affect diversity in recommender systems? A serendipity-oriented greedy algorithm. Computing 102(2), 393–411 (2020) 8. M.Z. Al-Taie, Explanations in recommender systems: overview and research approaches, in Proceedings of the 14th International Arab Conference on Information Technology (ACIT, Khartoum, Sudan) (2013) 9. T.R. Gruber, A translation approach to portable ontology specifications. Knowl. Acquis. 5(2), 199–220 (1993a) 10. T.R. Gruber, Towards principles for the design of ontologies used for knowledge sharing, in Formal Ontology in Conceptual Analysis and Knowledge Representation (Deventer, The Netherlands, 1993b)

Explanation-Based Serendipitous Recommender System (EBSRS)

17

11. W. Maalej, A.K. Thurimella, Towards a research agenda for recommendation systems in requirements engineering, in Second International Workshop on Managing Requirements Knowledge (IEEE, 2009), pp. 32–39 12. S. Chari, O. Seneviratne, D.M. Gruen, M.A. Foreman, A.K. Das, D.L. McGuinness, Explanation ontology: a model of explanations for user-centered AI, in International Semantic Web Conference (Springer, Cham, 2020), pp. 228–243 13. D. Kotkov, S. Wang, J. Veijalainen, A survey of serendipity in recommender systems. Knowl.Based Syst. 111, 180–192 (2016) 14. M. De Gemmis, P. Lops, G. Semeraro, C. Musto, An investigation on the serendipity problem in recommender systems. Inf. Process. Manage. 51(5), 695–717 (2015) 15. L. Iaquinta, M. De Gemmis, P. Lops, G. Semeraro, M. Filannino, P. Molino, Introducing serendipity in a content-based recommender systemn in International Conference on Hybrid Intelligent Systems (IEEE, 2008) 16. P. Adamopoulos, A. Tuzhilin, On unexpectedness in recommender systems: or how to better expect the unexpected. ACM Trans. Intell. Syst. Technol. (TIST) 5(4), 1–32 (2014) 17. E. Tacchini, Serendipitous Mentorship in Music Recommender Systems. UNIVERSITÁ DEGLI STUDI DI MILANOPhD Thesis (2012) 18. M. Manca, L. Boratto, S. Carta, Behavioral data mining to produce novel and serendipitous friend recommendations in a social bookmarking system. Inf. Syst. Front. 20(4), 825–839 (2018) 19. Q. Zheng, C.K. Chan, H.H. Ip, An unexpectedness-augmented utility model for making serendipitous recommendation, in Industrial Conference on Data Mining (Springer, 2015) 20. P. Symeonidis, A. Nanopoulos, Y. Manolopoulos, MoviEx-plain: a recommender system with explanations, in Proceedings of the third ACM Conference on Recommender Systems, pp. 317– 320 (2009) 21. P. Kouki, J. Schaffer, J. Pujara, J. O’Donovan, L. Getoor, Personalized explanations for hybrid recommender systems, in Proceedings of the 24th International Conference on Intelligent User Interfaces, pp. 379–390 (2019) 22. L. Viljanen, Towards an ontology of trust, in International Conference on Trust, Privacy and Security in Digital Business (Springer, Berlin, Heidelberg, 2005), pp. 175–184 23. J.K. Tarus, Z. Niu, G. Mustafa, Knowledge-based recommendation: a review of ontology-based recommender systems for e-learning. Artif. Intell. Rev. 50(1), 21–48 (2018) 24. Q. Gao, J. Yan, M. Liu, A semantic approach to recommendation system based on user ontology and spreading activation model, in IFIP International Conference on Network and Parallel Computing (IEEE, 2008), pp. 488–492 25. G. George, A.M. Lal, Review of ontology-based recommender systems in e-learning. Comput. Educ. 142, 1036–1042 (2019) 26. S. Papneja, K. Sharma, N. Khilwani, Context aware personalized content recommendation using ontology based spreading activation. Int. J. Inf. Technol. 10(2), 133–138 (2018) 27. P. Bedi, Richa, User interest expansion using spreading activation for generating recommendations, in International Conference on Advances in Computing, Communications and Informatics (ICACCI) (IEEE, 2015), pp. 766–771 28. P. Bedi, S.K. Agarwal, V. Jindal, Richa, MARST: Multi-Agent recommender system for etourism using reputation based collaborative filtering, in International Workshop on Databases in Networked Information Systems, ed. by Springer (Springer International Publishing, AizuWakamatsu City, Japan, 2014a), pp. 189–201 29. P. Bedi, S.K. Agarwal, S. Sharma, H. Joshi, Saprs: situation-aware proactive recommender system with explanations, in International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 277–283 (2014b) 30. P. Bedi, A. Gautam, Richa, C. Sharma, Using novelty score of unseen items to handle popularity bias in recommender systems, in International Conference on Contemporary Computing and Informatics (IC3I) (IEEE, 2014c), pp. 934–939 31. Richa, P. Bedi, Combining trust and reputation as user influence in cross domain group recommender system (CDGRS). J. Int. Fuzzy Syst., 1–12 (2020). Preprint

18

Richa et al.

32. P. Bedi, S.K. Agarwal, Aspect-oriented trust based mobile recommender system. Int. J. Comput. Inf. Syst. Ind. Manage. Appl. 354–364 (2013) 33. P. Bedi, P. Vashishth, Empowering recommender systems using trust and argumentation. Inf. Sci. 569–586 (2014) 34. H. Ju Jeong, M. Lee, Effects of recommendation systems on consumer inferences of website motives and attitudes towards a website. Int. J. Advert. 32(4), 539–558 (2013) 35. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, Recommendation systems: a probabilistic analysis. J. Comput. Syst. Sci. 63(1), 42–61 (2001)

Introduction of Feature Selection and Leading-Edge Technologies Viz. TENSORFLOW, PYTORCH, and KERAS: An Empirical Study to Improve Prediction Accuracy of Cardiovascular Disease Mudsir Ashraf, Yass Khudheir Salal, S. M. Abdullaev, Majid Zaman, and Muheet Ahmed Bhut Abstract Cardiovascular disorders are deliberated as one of the most lifethreatening diseases across the globe and are presumed to be fatal for human life. To prevent such irreparable human loss, it is imperative to predict cardiovascular disease accurately, and timely diagnosis of the disease is necessary for treatment and prevention of heart failure. Through traditional techniques such as medical history to predict the cardiovascular disease is contemplated as non-reliable and time-consuming practice. In this direction, non-invasive techniques such as machine learning approaches are effective and reliable. Therefore, in the proposed research study, investigators have developed a machine learning based diagnostic system to classify cardiovascular disease in which state-of-the-art techniques viz. TENSORFLOW, PYTORCH, and KERAS have been exploited. In addition to the deep learning techniques, the researchers have employed four classifiers, cross-validation mechanism, and various other performance metrics associated with classification algorithms such false positive rate, true positive rate, recall, precision, f -measure, and so on for prediction purposes. The efficiency of propounded system has been corroborated with feature selection algorithm on both sets viz. full features as well as on reduced set of features. The feature selection algorithm was exploited to authenticate whether there is any performance change in terms of prediction accuracy. Keywords TENSORFLOW · J48 · Knn · Naive Bayes · Random forest · KERAS · PYTORCH

M. Ashraf (B) Jain University, Bangalore, India Y. K. Salal · S. M. Abdullaev South Ural State University, Chelyabinsk, Russian Federation e-mail: [email protected] M. Zaman · M. A. Bhut University of Kashmir, Sirnagar, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_2

19

20

M. Ashraf et al.

1 Introduction The cardiovascular disease is contemplated as one of the complicated, lifethreatening, and fatal human diseases across globe. The disease typically occurs due to insufficient circulation of blood or flow to various parts of body, in order to carry out normal functioning of the entire body, and with the result the heart failure occurs [1]. Various countries have been vulnerable to cardiovascular disease, and among these countries, the USA has been extremely susceptible to this disease [2]. The signs of cardiovascular syndrome include faintness of physical body, shortness of breath, inflamed feet, and tire-out with associated indications, for instance, elevated jugular blood pressure and marginal edema triggered by functional cardiac or non-cardiac irregularities [3]. In early stages, the methods of examination used to diagnose cardiovascular disease were complex, and its subsequent intricacy has been one of the key factors responsible for affecting the standards of life [4]. Due to inadequacy of diagnostic peripherals, equipments, and other resources in developing and Asian countries, the task of investigation is extremely intricate, and thereby, the accuracy in predicting the cardiovascular disease becomes challenging [5]. It is imperative to mitigate the risk of heart failure by application of proper and accurate diagnostic methods [6]. In a report produced by the European society of cardiology, it was mentioned that 26 million grownups were identified with cardiovascular disease across globe and 3.6 million cardiovascular patients are identified each annum. It has been found that patients with cardiovascular disease are susceptible to die with a period of approximately 1–2 years, and the related costs of cardiovascular diseases from the healthcare financial budget are nearly 3% [7]. The intrusive-based approaches to the diagnosing of cardiovascular disorders are grounded on the examination of medical history of patient, tangible analytical report, and examination of related syndromes by medicinal professionals. These methods are prone to erroneous results and generally impede in the diagnosis process due to personnel blunders. Furthermore, these techniques are comparatively expensive, computationally complex and time consuming to attain the desired results [8]. To settle these intricacies in an intrusive way of diagnosing cardiovascular disease, a non-invasive healthcare decision support system which is stationed on predictive paradigms of machine learning, viz. support vector machine, k-nearest neighbor, artificial neural network, logistic regression, decision tree, AdaBoost, fuzzy logic, naïve Bayes, and rough set [9, 10], has been built by a number of investigators and has extensively been deployed for diagnosing of cardiovascular disease. Due to these medical decision support systems, considerable slowdown of death rate has been experienced [11]. Heart disease diagnosis has been published in numerous research studies which were supported by practices of machine learning applications. Furthermore, machine learning algorithms have reasonably generated better outcomes, not only in predicting cardiovascular diseases but also in areas such as pedagogy, medical, business, and so on [12–14] Detrano et al. [15] projected a model based on a classifier namely logistic regression for diagnosis of cardiovascular syndrome and attained a significant prediction

Introduction of Feature Selection and Leading-Edge …

21

accuracy of 77% [15]. Edmonds [16], in his research, applied Cleveland dataset with universal life-changing methods and acquired supreme accuracy while verifying the performance of the proposed model [16]. The study contemplated feature selection approaches that were exploited to select significant features from the dataset, and hence, its classification performance was driven based on the chosen features. Gudadhe et al. [17] employed support vector machine and multilayer perceptron classifiers for cardiovascular classification [17]. The recommended model achieved a noteworthy prediction performance of 80.41%. Moreover, the researchers such as Kahramanli and Allahverdi, produced a classification model for heart disease in which a blended approach was employed to integrate artificial neural network and fuzzy neural network [18]. The propounded model accomplished exceptional prediction accuracy of 87.4%. Palaniappan and Awang [19], implemented a healthcare diagonosis system for cardiovascular disease, and practiced various machine learning techniques viz. naïve Bayes, artificial neural network, and decision tree [19]. Among the employed classifiers, naïve Bayes obtained considerable performance accuracy of 88.12%. Artificial neural network was the second best classifier with prediction accuracy of 86.12%, and decision tress also exhibited noteworthy results of 80.4% in determining the correct instances. The investigators [20] proposed a significant cardiovascular prediction model wherein leading-edge frameworks of machine learning were implemented to produce precise prediction results [20]. In a study conducted by researchers Olaniyi and Oyedotun [21], propounded a system stationed on three phases to predict cardiovascular syndrome using artificial neural networks, and the model was successful in achieving paramount classification accuracy of 88.89% [21]. Das et al. [22] suggested an ensemble-based prediction system to diagnose the cardiovascular disorder. In this study, they exploited statistical miner 5.2 with inclusive classification model and acquired exceptional classification performance of 89.01% [22]. Jabbar et al. [23] developed a diagnostic mechanism for cardiovascular disease using various machine learning techniques viz. multilayer perceptron ANN, feature selection methods, and back propagation learning algorithm. In terms of performance, the proposed model produced significant results. Moreover, the machine learning algorithms have revolutionized the prognostic system in almost every domain [23–26]. Majid et al. [27] have explored different parameters of decision tree to generate accurate predictions [27, 28].

2 Methods and Materials In the present investigation, multiple machine learning classification algorithms have been employed including j48, knn, random forest, and naïve Bayes for the diagnosis of cardiovascular disease. The feature selection algorithm, viz. InfoGainAttributeEval, has been utilized to indicate significant and exceedingly correlated attributes

22

M. Ashraf et al.

that can have a substantial impact on the desired predicted value. Among crossvalidation techniques such as k-fold have been applied to achieve better performance. Moreover, several performance evaluation metrics including incorrectly classified (IC), correctly classified (CC), false positive rate (FP), true positive rate (TP), recall, precision, Mathews correlation coefficient (MCC), and receiver optimistic curves (ROC) have been explored to evaluate the performance of classifiers. Furthermore, data pre-processing techniques have been used to obtain improved results. The propounded model has been trained and tested on Stanford online healthcare repository. The main contribution of the projected research study is that the performance of all classifiers has been corroborated on entire features in terms of prediction accuracy of each classifier. The study recommends the procedure to conduct prediction of cardiovascular disease based on the designed model, demonstrated by Fig. 10.

3 Empirical Results and Discussion The dataset employed in this study was acquired from Stanford online healthcare repository to design a machine learning system for diagnosing cardiovascular disease. The dataset has a sample size of 462 instances of patient’s data, with ten features, and some missing values which were removed during the pre-processing process. The target class has two labels indicating the presence and absence of disease viz. true (presence of disease) and false (absence of disease) as can be seen from Fig. 1. The features incorporated in the dataset as illustrated in the underneath Fig. 1 include tobacco, SBP (systolic blood pressure), adiposity, low density lipoprotein cholesterol (LDL), typea (type-A behavior), famhist (family history of heart disease), age, alcohol (consumption rate), coronary heart disease (CHD). Moreover, machine learning classifiers viz. j48, knn, naïve Bayes, and random forest were used to categorize among cardiovascular patients and healthy patients. Table 1 exhibits performance of various machine learning classifiers including j48, knn, naïve Bayes, and random forest with other performance matrices viz. correctly classification rate (CC), incorrectly classification rate (IC), false positive rate (FP), true positive rate (TP), recall, precision, f-measure, Mathews correlation coefficient (MCC), and receiver optimistic curves (ROC). From Table 1, it is evident that naïve Bayes has demonstrated significant classification performance of 69.93% than other classification algorithms. The other parameters related with classifier such as IC, FP rate, TP rate, precision, and so on have also shown noteworthy results. In Fig. 2, correctly classification rate (CC) and incorrect classification rate (IC) of each algorithm with its respective classification performance have been demonstrated. Moreover, the results have been generated prior to the application of feature selection techniques. In this subsection, feature selection technique was employed on the same set of classification algorithms to corroborate whether feature selection approach has any impact on the performance of prediction accuracy of different algorithms. From Table 2, it is relatively apparent that each classifier has undergone a significant change in

Introduction of Feature Selection and Leading-Edge …

23

Fig. 1 Snapshot of the dataset used in the current study

Table 1 Performance of miscellaneous classifiers CC

IC

TP rate

FP rate

Precision

Recall

F-measure

MCC

ROC area

j48

64.07

knn

64.06

35.91

0.64

0.46

0.63

0.64

0.63

0.18

0.61

35.93

0.64

0.47

0.63

0.63

0.63

0.18

0.59

Naïve Bayes

69.93

30.06

0.69

0.31

0.69

0.69

0.69

0.39

0.75

Random forest

68.18

31.81

0.68

0.42

0.67

0.68

0.67

0.27

0.74

its prediction performance. The comparison can be drawn from two respective tables viz. Tables 2 and 3, wherein it can be seen the classification performance of j48 amplified from 64.07% to 70.25%, knn from 64.06% to 74.27%, naïve Bayes from 69.93% to 70.77%, and random forest from 68. 18% to 75.88%, respectively. From Figs. 2 and 3, it is obvious that each classification has demonstrated a substantial transition in its prediction accuracy; however among all classifiers, knn has experienced significant prediction accuracy than other classifiers by a margin of approximately 10%, which is massive from a classification point of view.

24

M. Ashraf et al.

Fig. 2 Performance of each classifier graphically Table 2 Performance of algorithms after application of feature selection technique CC

IC

TP rate

FP rate

Precision

j48

70.25

29.74

0.71

0.29

Knn

74.27

25.72

0.74

0.26

Naïve Bayes

70.77

29.22

0.71

Random forest

75.88

24.11

0.75

Table 3 Results produced by pioneering methods

Recall

F-measure

MCC

0.71

0.72

0.71

0.41

0.75

0.73

0.74

0.73

0.48

0.75

0.34

0.71

0.72

0.71

0.36

0.75

0.24

0.75

0.74

0.75

0.51

0.86

Classifier’s name

Prediction performance (%)

TENSORFLOW

70.9

PYTORCH

78.9

KERAS

80

Fig. 3 Performance of algorithms after subjected to feature selection approach

ROC area

Introduction of Feature Selection and Leading-Edge …

25

3.1 Utilization of Leading-Edge Technologies Viz. TENSORFLOW, PYTORCH, and KERAS In the current subsection, various state-of-the-art techniques have been enforced on cardiovascular dataset with foremost objective to improve prophecy of our results. Followed by the implementation of TENSORFLOW, PYTORCH, and KERAS, it was noticed that there was a significant development in performance of algorithms in envisaging the class labels. Each technique including TENSORFLOW, PYTORCH, and KERAS was employed with learning rate of 0.01 and epoch of 500 to attain ideal results. Figures 4, 5, 6, 7, 8 and 9 illustrate the prediction accuracy of individual methods viz. TENSORFLOW, PYTORCH, and KERAS. The overhead Fig. 4 exhibits a screenshot of outcomes that comprises of meansquared-error (MSE), actual accuracy, and training accuracy touched by TENSORFLOW. The over-all prediction accuracy realized by TENSORFLOW is 70.96% in

Fig. 4 Screenshot of correctness accomplished grounded on TENSORFLOW

26

M. Ashraf et al.

Fig. 5 How precision varies as epoch size was increased

Fig. 6 Results produced through PYTORCH

forecasting the resultant class and the same can be pictured from the aforementioned Fig. 4. Figure 5 clarifies the association between accuracy and epochs accomplished by TENSORFLOW throughout the process of testing. The prediction correctness generated by TENSORFLOW exposed better improvement until it reached an accuracy of 66%, and afterward the performance of algorithms altered unvaryingly and at

Introduction of Feature Selection and Leading-Edge …

Fig. 7 Relationship among epochs and accuracy

Fig. 8 Screenshot of results attained using KERAS

27

28

M. Ashraf et al.

Fig. 9 Discrepancy among epoch and accuracy in case of KERAS

definite point’s endured constant. Nevertheless, notable performance of 70.96% was achieved. Correspondingly, PYTORCH was examined with analogous parameters deployed in case of TENSORFLOW viz. learning rate (0.01), epoch (500), training data (70%), and test data (30%), as can be realized from the over-headed Fig. 6. Throughout the application of PYTORCH, the investigators estimated parameters such as accuracy, loss in the process of training phase, and val_loss, val_accuracy over the course of test stage. Moreover, it was noticed that PYTORCH accomplished striking prediction performance of 78.91% in forecasting the cardiovascular disease and achieved improved prediction accuracy not only than TENSORFLOW but also over conventional algorithms whose outcomes are demonstrated in Table 3. Figure 7 has explicated a distinctive assessment of training and testing correctness via PYTORCH on every epoch. From Fig. 7, it is distinctive that the performance correctness has touched supreme exactness of nearly 76% during the practice of training data. Nevertheless, during testing, highest plausible precision that the model achieved was 78.91% in forecasting the precise class labels. Nevertheless, in introductory stage, the prediction accuracy gradually improved, and subsequently after almost 25 epochs, the prediction accuracy of the algorithm mitigated and remained steady. After exploiting KERAS on the same set of attributes and of the same sample size as described in Fig. 8, the model generated exceptional performance of 80% in forecasting the unknown class labels. The prediction accuracy expressed through this specific deep learning framework has been observed to be significant in contrast to classifiers viz. knn, j48, random forest, naïve Bayes; and deep learning frameworks such as TENSORFLOW, KERAS exploited previously in overhead subdivisions.

Introduction of Feature Selection and Leading-Edge …

29

Fig. 10 Prediction model for cardiovascular disease

From Fig. 9, it is obvious that throughout the stages of training and testing, KERAS has produced an outstanding accuracy. Moreover, the prediction accuracy of KERAS has progressively revealed upsurge in its performance and has accomplished utmost accuracy of 80% in forecasting the correct instances during the course of testing. The propounded model enlightened in Fig. 10 provides all-inclusive description of training as well as testing of several cutting-edge technologies throughout the course of cardiovascular prophecy. In the present study, the researchers sought three extensive techniques such as TENSORFLOW, PYTORCH and KERAS to produce prophecies. Based on the pragmatic outcomes generated and after application of three predominant techniques, we proposed a prediction paradigm to embolden investigators to discover further deep learning frameworks so as to enhance prediction correctness in future.

30

M. Ashraf et al.

4 Conclusion In the current research study, cutting-edge machine learning techniques (TENSORFLOW, PYTORCH, KERAS) have been explored, and subsequently, a prediction model was propounded to predict cardiovascular syndrome. The proposed model was verified on online Stanford cardiovascular dataset. Primarily, we verified various classifiers such as j48, naïve Bayes, knn, and random forest, without being subjected to feature selection classifiers, and among the classification algorithms, naïve Bayes produced highest classification accuracy of 69.93%. Thereafter, the same set of learning algorithms was employed to corroborate the performance of each classifier when exposed with feature selection algorithm and tenfold-cross-validation method. It was observed that the performance of learning algorithms experienced significant change in their classification performance. Among four classifiers, knn demonstrated exceptional rise in its prediction accuracy from 64.06 to 74.27%, when subjected to feature selection algorithm. J48 and random forest classifiers were the succeeding best classifiers to produce effective classification performance from 64.07% to 70.25% and 68.18% to 75.88%, respectively. Finally, we applied the same dataset across various deep learning techniques such as TENSORFLOW, PYTORCH, and KERAS with the purpose to corroborate performance of prediction accuracy among conventional classifiers and deep learning techniques. It was examined that deep learning methods produced exceptional results of 70.9% (TENSORFLOW), 78.9% (PYTORCH) and 80% (KERAS) than conventional classifiers. Furthermore, for future studies, we shall be conducting further investigations to amplify the prediction accuracy of classifiers to predict cardiovascular disease more precisely using various feature selection algorithms, optimization techniques, and deep learning techniques.

References 1. A.L. Bui, T.B. Horwich, G.C. Fonarow, Epidemiology and risk profile of heart failure. Nat. Rev. Cardiol. 8(1), 30 (2011) 2. P.A. Heidenreich, J.G. Trogdon, O.A. Khavjou, J. Butler, K. Dracup, M.D. Ezekowitz, E.A. Finkelstein, Y. Hong, S.C. Johnston, A. Khera, D.M. Lloyd-Jones, Forecasting the future of cardiovascular disease in the United States: a policy statement from the American Heart Association. Circulation 123(8), 933–944 (2011) 3. M. Durairaj, N. Ramasamy, A comparison of the perceptive approaches for preprocessing the data set for predicting fertility success rate. Int. J. Control Theory Appl. 9(27), 255–260 (2016) 4. J. Mourao-Miranda, A.L. Bokde, C. Born, H. Hampel, M. Stetter, Classifying brain states and determining the discriminating activation patterns: support vector machine on functional MRI data. Neuroimage 28(4), 980–995 (2005) 5. s. Ghwanmeh, A. Mohammad, A. Al-Ibrahim, Innovative artificial neural networks-based decision support system for heart diseases diagnosis (2013) 6. Q.K. Al-Shayea, Artificial neural networks in medical diagnosis. Int. J. Comput. Sci. Issues 8(2), 150–154 (2011) 7. J. Lopez-Sendon, The heart failure epidemic. Medicographia 33(4), 363–369 (2011) 8. K. Vanisree, J. Singaraju, Decision support system for congenital heart disease diagnosis based on signs and symptoms using neural networks. Int. J. Comput. Appl. 19(6), 6–12 (2011)

Introduction of Feature Selection and Leading-Edge …

31

9. S. Nazir, S. Shahzad, S. Mahfooz, M. Nazir, Fuzzy logic based decision support system for component security evaluation. Int. Arab J. Inf. Technol. 15(2), 224–231 (2018) 10. S. Nazir, S. Shahzad, L.S. Riza, Birthmark-based software classification using rough sets. Arab. J. Sci. Eng. 42(2), 859–871 (2017) 11. A. Methaila, P. Kansal, H. Arya, P. Kumar, Early heart disease prediction using data mining techniques, in Proceedings of Computer Science & Information Technology (CCSIT-2014), vol. 24 (Sydney, NSW, Australia, 2014), pp. 53–59 12. M. Ashraf, et al.,Knowledge discovery in academia: a survey on related literature. Int. J. Adv. Res. Comput. Sci. 8(1) (2017) 13. M. Ashraf, M. Zaman, M. Ahmed, To ameliorate classification accuracy using ensemble vote approach and base classifiers, in Emerging Technologies in Data Mining and Information Security (Springer, Singapore, 2019), pp. 321–334 14. M. Ashraf, M. Zaman, M. Ahmed, Performance analysis and different subject combinations: an empirical and analytical discourse of educational data mining, in 2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence) (IEEE, 2018) 15. R. Detrano, A. Janosi, W. Steinbrunn, M. Pfisterer, J.J. Schmid, S. Sandhu, K.H. Guppy, S. Lee, V. Froelicher, International application of a new probability algorithm for the diagnosis of coronary artery disease. Am. J. Cardiol. 64(5), 304–310 (1989) 16. B. Edmonds, Using Localised ‘Gossip’to Structure Distributed Learning (AISB, 2005) 17. M. Gudadhe, K. Wankhade, S. Dongre, Decision support system for heart disease based on support vector machine and artificial neural network, in 2010 International Conference on Computer and Communication Technology (ICCCT) (IEEE, 2010, Sept), pp. 741–745 18. H. Kahramanli, N. Allahverdi, Design of a hybrid system for the diabetes and heart diseases. Expert Syst. Appl. 35(1–2), 82–89 (2008) 19. S. Palaniappan, R. Awang, Intelligent heart disease prediction system using data mining techniques, in 2008 IEEE/ACS International Conference on Computer Systems and Applications (IEEE, 2008, March), pp. 108–115 20. M. Ashraf, S.M. Ahmad, N.A. Ganai, R.A. Shah, M. Zaman, S.A. Khan, A.A. Shah, Prediction of cardiovascular disease through cutting-edge deep learning technologies: an empirical study based on TENSORFLOW, PYTORCH and KERAS, in International Conference on Innovative Computing and Communications (Springer, Singapore, 2020), pp. 239–255 21. E.O. Olaniyi, O.K. Oyedotun, K. Adnan, Heart diseases diagnosis using neural networks arbitration. Int. J. Intell. Syst. Appl. 7(12), 72 (2015) 22. R. Das, I. Turkoglu, A. Sengur, Effective diagnosis of heart disease through neural networks ensembles. Expert Syst. Appl. 36(4), 7675–7680 (2009) 23. M. Ashraf, M. Zaman, M. Ahmed, Using Ensemble StackingC method and base classifiers to ameliorate prediction accuracy of pedagogical data. Proc. Comput. Sci. 132, 1021–1040 (2018) 24. M. Ashraf, M. Zaman, M. Ahmed, An Intelligent prediction system for educational data mining based on ensemble and filtering approaches. Proc. Comput. Sci. 167, 1471–1483 (2020) 25. M. Ashraf, M. Zaman, Tools and techniques in knowledge discovery in academia: a theoretical discourse. Int. J. Data Mining Emerging Technol. 7(1), 1–9 (2017) 26. M. Ashraf, M. Zaman, M. Ahmed, Using Predictive Modeling System and Ensemble Method to Ameliorate Classification Accuracy in EDM 27. M. Zaman, S. Kaul, M. Ahmed, Analytical comparison between the information gain and Gini index using historical geographical data. Int. J. Adv. Comput. Sci. Appl. 11(5), 429–440 (2020) 28. M. Zaman, M.A. Butt, Information translation: a practitioners approach, in World Congress on Engineering and Computer Science (WCECS) (San Francisco, USA, 2012, Oct)

Campus Placement Prediction System Using Deep Neural Networks Bharat Udawat, Advait Kale, Divit Sinha, Hardik Sharma, and Deepa Krishnan

Abstract Finding jobs is a challenging task but with slight changes in the existing process, we can make it convenient which can result in several positive results. Many job options are given by various job recommendation sites, however all those options may not be helpful to all. Hence, a job recommendation engine that can recommend the best job matching to the profile of the applicant will be a very beneficial application. We have implemented our proposed application using deep neural network and logistic activation function with several features like percentage of marks, specialization and work experience. We have compared our results with logistic regression, gaussian Naïve Bayes and have analysed using various performance measures. The experimental results show that the deep learning algorithm gave an accuracy of 97.60% and an area under the receiver operating characteristic curve (ROC-AUC) score of 99.83% which is better than other compared algorithms. Keywords Placement · Prediction · Machine learning · Deep neural networks · Logistic activation function

1 Introduction Placement of students is one of the important activities in every educational institution. Every student dreams to be ace in campus recruitment and most of them keep expectations to get placed for the job that they apply for. However, every student does not end up getting placed in their dream job, and they waste a lot of their time and effort preparing for job profiles that do not match for them. This project intends to predict the placement chance of the students and thus find the best-fit job for the applicant. It will not only save time but will also help in boosting the confidence of the students. B. Udawat (B) · A. Kale · D. Sinha · H. Sharma · D. Krishnan Computer Engineering, NMIMS Deemed-To-Be University, Mumbai, Maharashtra, India D. Krishnan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_3

33

34

B. Udawat et al.

We use LinkedIn or Naukri.com for searching a job and applying for it, but these platforms give a lot of options to apply. It is generally experienced that we get many options that are not useful, or we are not eligible for it. To overcome this problem, we propose to make a system for predicting the chances of getting selected to save time and focus more on the applicant’s preferred choice. This will help the student to get the best suitable company, as well as the recruiters will get the best-suited candidate for the job. This paper is organized as follows. In Sect. 2, we have described the literature survey of various similar research works, and in Sect. 3, we have explained our proposed technique using deep neural networks, and finally, Sect. 4 gives the results and discussion followed by conclusion in Sect. 5.

2 Literature Review In [1], the authors developed a linear regression and artificial neural network model for predicting the final cumulative grade points average (CGPA) using the grade point scored in fundamental subjects in previous years. Artificial neural network (ANN) gave better accuracy than linear regression. The dataset used was of 391 matriculation students from three batches, which are from intakes in July 2005–07 and the diploma students totalled up to 505 from three intakes in July 2006–08. In [2], the authors aimed to find the right course of study for a fashion and apparel design based on their entrance exam rankings. They reviewed several classification and clustering algorithms such as K-means, decision tree, Naive Bayes, of which K-means was found the most accurate. Dataset consists of columns such as name, age, category, sector, rank, address, phone number, gender andspecialization from the National Institute of Fashion and Technology. In [3], various approaches for placement chance prediction, namely logistic regression, fuzzy approach, decision tree algorithm, random forest algorithm and classification, and clustering techniques, as well as unique methods like sum of difference method and job competency modelling, were reviewed by the authors. The dataset comprises academic details of students including their grade points and performance details and placement details are a potential source for predicting future placement chances. In [4], the authors built classification models on the basis of Iterative Dichotomiser 3 (ID3) decision tree and the random forest algorithm with age, credits, backlogs, B. Tech. % and placed/not placed as variables. The accuracy obtained after analysis for the decision tree is 84%, and the random forest is 86%. The dataset was compiled from over 1000 instances of students from Sreenidhi Institute of Science and Technology. In [5], the authors proposed a system for placement prediction of the cluster of students with the best knowledge, skills and aptitude combination using a self-made algorithm which found the probability of getting placed by checking how many students with the same score were placed previously. Dataset consists of columns such as year, registration no., branch, and per cent. Skills, effective score, placed/not placed from MBA, MCA, BCA, B.Com and BBA streams. In [6], the authors implemented the C4.5 decision tree classification algorithm for placement prediction using

Campus Placement Prediction System Using Deep Neural Networks

35

percentage and skillset and compared it to logistic regression and support vector machine. An interface for the same was also implemented. They analysed the previous year’s student’s historical data and predicted placement eligibility of students. In [7], an expert system was constructed for placement prediction. The data of all the students are stored in CSV format. The system checks whether the students satisfy the criteria or not and then decides whether the student will get placed or not. The data was from a well-known engineering college situated in Pune. The dataset contains 2330 tuples and 81 attributes holding multiple streamwise data of the students. In [8], it was discussed how parameters such as CGPA, additional certifications, 10th, 12th percentage, graduation branch and university affect the placement chance of a student which of them are essential for getting placed. Dataset used consisted of 318 students from different pass-out year batches such as 2014, 2015 and 2016 from various recognized universities across India. In [9], the authors propose a system which works on a hybrid algorithm combining K-nearest neighbours (KNN) with a content-based algorithm that matches features of candidate recommendations and job postings of historical interactions. The dataset consists of 150,000 subjects from a social network named Xing. Data sources such as user profile, a list of interactions that the user performed on the job postings, and details about the job postings were used. In [10], the authors implemented the K-nearest neighbours algorithm using gender, 10th, 12th %, B.E aggregate, backlogs and skills as attributes for classification which yielded an accuracy of 78.57% on testing. This accuracy was compared with logistic regression and support vector machine (SVM) and was found to be better. The college placement archive of the PES Institute of Technology consisted of 336 and 84 rows of training and testing data, respectively, was used. In [11], the paper reviews linear regression, K-nearest neighbour, decision tree algorithm, XGBoost model, gradient boost regression model, light gradient boosting machine (GBM) regression model and random forest algorithm to solve the problem of student placement prediction. The aim was to accurately predict the salary of students using K-nearest neighbours. K-nearest neighbours and linear regression performed the best in the dataset I and II, respectively. The dataset used was from the previous year’s placement results of the institution’s placement cell. In [12], binary logistic regression was used with gender, post-graduation (PG) specialization, PG CGPA, under graduation (UG) specialization, UG CGPA and soft skills as variables in the model. The confusion matrix showed a model accuracy of 72%. A random sample of 250 MBA student’s placement records from five leading institutions was used as the dataset. In [13], the authors used simple binomial logistic regression but with the inclusion of qualitative variables such as aptitude, technical and communication skills with an additional R-square test to determine variables for higher accuracy. The aim was to improve on existing data mining techniques. In-house student’s database of Atharva College of Engineering was used. In [14], logistic regression was implemented using 10th, 12th %, subject marks, gender and residency (urban/rural) as attributes to predict placement. The training and testing accuracy of the algorithm was 98.93% and 83.333%, respectively. The data used in the model is only in-house placement data from the 2009 to 2013 batch Information Technology (IT) branch of GNDEC. In [15], the authors aimed to classify students

36

B. Udawat et al.

into dream jobs, core company, mass recruiter, not eligible or not interested placement statuses. Multinomial logistic regression with decision tree classifier of the scikitlearn module (Python) was implemented with variables such as department, 12th board, 12th %, 12th location, backlogs and CGPA, which achieved an accuracy of 71.66% in the validation dataset. The Placement cell database of Amrita School of Engineering, Coimbatore were used to prepare 2205 training samples, 289 testing, and 60 validation samples. Our literature survey indicates that some of the research works have implemented the in-house dataset, which has resulted in accuracy between 80 and 85%. None of the papers has used skills or interests in the selection procedure. Papers have used a generalized approach and are not domain specific. All the papers give binary output, i.e. whether a candidate will get placed or not. Most of the predictions are based only on academic scores. The literature review indicates that the best algorithm that can be used is logistic regression and neural networks. We have observed that dataset plays an important role in the accuracy of the model.

3 Proposed Technique In Fig. 1, we have shown the architecture of our proposed work. The algorithm we have used is deep neural networks that use logistic sigmoid function as an activation function in each neuron present in the layers of the neural network which results in better accuracy than conventional machine learning algorithms. An activation function is a function that is added into a neural network to help the network learn complex patterns in the data. Some examples apart from sigmoid include rectified linear unit (ReLU), tan-hyperbolic functions. The algorithm gives a prediction output between 0 and 1, where 0 indicates that the applicant is not placed and 1 indicates placed. Any value between 0 and 1 gives the probability of getting placed. Steps Involved:

Fig. 1 Architecture of the proposed technique

Campus Placement Prediction System Using Deep Neural Networks

37

Fig. 2 Output of oversampling

1.

2.

3.

4.

Data Input and Data Overview: The first step which we have performed is ingesting the dataset collected for training the model. For the implementation, we have used a dataset of 1000 samples, including working professionals, ongoing students and interns. The parameters used were 10th class percentage (ssc_p), 12th class percentage (hsc_p), current CGPA or CGPA of completing the degree (degree_p), work experience (workex), the status of their placement (status), salary (salary) (if any), the board of their education (ssc_b) and (hsc_b) and their specialization (specialization). These are the parameters which are 10th and 12th percentage, CGPA and work experience. While training, the target parameter was the status of placement. After the ingestion of the dataset, we have found unique values in each column and levelled the data before preprocessing. Preprocessing of Data: In preprocessing the data, firstly, we have eliminated the NaN (Not a Number) values from the dataset. After which we need to change the “Status” and “Workex” column from string to binary integer (0 or 1), i.e. if the candidate has no work experience, we will change “no” to 0 and “yes” to 1 accordingly. Then have made every column value a common denominator of 100. For data smoothing, we have normalized the data, i.e. we need to make the range of every column similar and if possible same, i.e. 0–100. After the normalization, we have calculated the mean and standard deviation of the dataset. In Fig. 4, we have shown the data distribution. Oversampling for data imbalance: As shown in Fig. 2, we have used the oversampling technique to address the imbalance of the dataset. Oversampling of minority class is balancing the minority and majority classes to reduce the imbalance in the dataset which might lead to the skewness of the curve. To achieve this, we use the RandomOverSampler() from the imblearn package. After oversampling, the dataset is converted to a total of 1668 rows from the existing 1000 rows. Machine Learning Classifiers: 4.1.

Design of the Neural Network Structure: The neural network is deep and consists of 7 layers with 1 input, 1 output and 5 hidden layers. The first three layers consist of 30 perceptrons each while the next three layers consist of 8 each. The output layer is made up of 2 perceptrons indicating the output as 0 or 1.

4.2.

4.3.

Logistic Regression: The logistic regression is a supervised machine learning algorithm that uses the logit function to determine the target variable value ranging between 0 and 1. Naive Bayes: This is a supervised classification algorithm which is based on the Bayes’ theorem with an assumption of independence among predictors.

38

B. Udawat et al.

Table 1 Comparison of various algorithms Algorithm

Accuracy (%) Precision (%) Recall (%) F1 score (%) ROC-AUC score (%)

Logistic regression

80.00

90.00

68.00

77.00

56.93

Gaussian Naive 89.00 Bayes

82.00

100.00

90.00

88.43

Deep neural network

97.00

97.00

97.00

99.83

5.

6.

97.60

Analysis of Performance Measures: We have plotted the loss and accuracy of the model in Figs. 5 and 6 for illustrating the model performance. In Table 1, we have also compared the various algorithms with performance measures like accuracy, precision, recall and the ROC-AUC score. Deploy the machine learning model.

The activation functions, “Sigmoid” and “ReLU”, are used alternatively in every layer. The sigmoid function has a formula of 1/(1 + e−x ) while the ReLU function is max(0, z). To reduce overfitting of data, the technique of L2 regularization and dropout with a dropout rate of 0.1 is being used just before the output layer. The model is then compiled with a “binary_crossentropy” loss function and the “adam” optimizer. The “adam” optimizer, a shorthand for “Adaptive Moment Estimation”, is an extension to the stochastic gradient descent which is widely used in neural networks lately.

4 Results and Discussion After we implement the neural network, we will get several outputs, i.e. graphs of data distribution, loss and accuracy of the model. Below is the output obtained after performing oversampling on the dataset of 1000 students. Also, we will obtain a correlation matrix of the parameters we used, i.e. 10th and 12th marks, CGPA and salary. This shows correlation coefficients between sets of variables, and we can figure out the pairs which have the highest correlation. The correlation matrix is shown in Fig. 3. Fig. 3 Correlation matrix

Campus Placement Prediction System Using Deep Neural Networks

39

Fig. 4 Data distribution plot

Fig. 5 Exponentially reducing model loss

Fig. 6 Improving model accuracy

The graph shown depicts the data distribution plot of the various parameters. The graphs are coinciding to a great extent since data has been normalized (Fig. 4). We have implemented other machine learning algorithms as well to better understand the difference between our selected and the others proposed in several types of research. The plots shown below help us understand the model loss and accuracy

40

B. Udawat et al.

of our neural network, respectively, as we can figure out the loss is very much in the start as compared to the end because of the familiarity of the model with the dataset and vice versa for accuracy, and it is usually seen that every machine learning model faces the same trend, as the iterations are carried out more, the loss decreases and the accuracy increases (Figs. 5 and 6).

5 Conclusion The proposed technique gave an accuracy of 97.6%, precision, recall and F1 score as 97%. We have also obtained ROC-AUC score of 99.83% which is significantly higher than all the other algorithms. The use of deep neural network has helped in improving the accuracy because of its ability to learn using several hidden layers and activation functions of logistic sigmoid. In future work, we intend to add more features such as the number of certification courses, prizes won, hours of study, which might increase accuracy. We also propose to tune the various hyperparameters like learning rate, number of epochs and batch size in future.

References 1. Prediction of engineering students’ academic performance using artificial neural network and linear regression: a comparison, in IEEE 5th Conference on Engineering Education ICEED (2013) 2. Placement Chance Prediction: Clustering and Classification Approach; Artificial Intelligence and Evolutionary Computations in Engineering Systems (2016) 3. K. Gandhi, A. Dalvi, A review on student placement chance prediction, in International Conference on Advanced Computing and Communication Systems (ICACCS) (2019) 4. P. Manvitha, N. Swaroop, Campus placement prediction using supervised machine learning techniques. Int. J. Appl. Eng. Res. (2019) 5. Data mining approach for predicting student and institution’s placement percentage, in International Conference on Computational Systems and Information Systems for Sustainable Solutions (2016) 6. S. Ahmed, A. Zade, S. Gore, P. Gaikwad, M. Kolhal, Performance-based placement prediction system. IJARIIE (2018) 7. K. Gandhi, A. Dalvi, Expert system for student placement prediction. Int. J. Comput. Appl. Technol. Res. (2019) 8. C. Patel, Identification of Essential Parameters for Postgraduate Students’ Job Placement in Computer Applications in India (Bharati Vidyapeeth’s Institute of Computer Applications and Management, 2018) 9. T. De Pessemier, K. Vanhecke, L. Martens, A Scalable, High-performance Algorithm for Hybrid Job Recommendations (ACM Digital Library, 2016) 10. A. Giri, M. Vignesh, V. Bhagavath, B. Pruthvi, N. Dubey, A Placement Prediction System Using K-Nearest Neighbours Classifier (IEEE, 2017) 11. T. Aravind, B.S. Reddy, S. Avinash, G. Jeyakumar, A Comparative Study on Machine Learning Algorithms for Predicting the Placement Information of Under Graduate Students (IEEE, 2019)

Campus Placement Prediction System Using Deep Neural Networks

41

12. D. Satish Kumar, Z. Bin Siri, D.S. Rao, S. Anusha, Predicting student’s campus placement probability using binary logistic regression. Int. J. Innov. Technol. Exploring Eng. (IJITEE) (2019) 13. M.K. Shukla, P. Rambade, J. Torasakar, R. Prabhu, D. Maste, Students placement prediction model using logistic regression. Int. J. Eng. Res. Technol. (IJERT) (2017) 14. A. Shiv Sharma, S. Prince, S. Kapoor, K. Kumar, PPS—Placement Prediction System using Logistic Regression (IEEE, 2014) 15. S.K. Thangavel, P. Divya Bharathi, A. Sankar, Student Placement Analyzer: A Recommendation System Using Machine Learning (IEEE, 2017)

Intensity of Traffic Due to Road Accidents in US: A Predictive Model Pooja Mudgil and Ishan Joshi

Abstract Understanding the intensity of traffic congestion and its ensuing effects has a lot of uses that include alleviation in traffic intensity and the economic and social damage caused. The present research is an attempt to analyse data that might be useful in determining the effect on traffic during an accident. It is hypothesised that accurate traffic congestion intensity forecast will empower travellers to choose different routes, departure times and means of transport according to the given conditions. The paper attempts to extend machine learning theory by developing a model to predict traffic congestion intensity caused by road accidents in the US using K-nearest neighbours classification algorithm. A stepwise approach is adopted by taking appropriate data, formatting it, visualising it, creating meaning out of raw data, uncovering new insights and then modelling it. The research concludes that a model may be developed with an accuracy of ~ 68%. Keywords K-nearest neighbour · Predictive modelling · Machine learning · Traffic accidents · Traffic congestion intensity · Data analytics

1 Introduction The use of vehicles in everyday lives in the modern world is increasing exponentially which can be seen from the fact that between 2010 and 2013, there was a 16% increase in the number of registered vehicles in the world [1]. On one hand, this aids our daily lives exceedingly, while on the other, it comes with a lot of complications. With such an increase, we can see that there is a positive correlation between the number of new vehicles and the number of road traffic accident deaths [2]. Not only the first world countries but also it has become a global issue as, on average, road crashes cost countries 3% of their gross domestic product [3]. In the US, preretirement deaths resulting from traffic accidents outweigh the toll taken by the two most deadly diseases, cancer and heart disease, while nearly half of the deaths of 19-year-olds can P. Mudgil · I. Joshi (B) Department of Information Technology, Bhagwan Parshuram Institute of Technology, New Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_4

43

44

P. Mudgil and I. Joshi

be attributed to traffic accidents [4]. Traffic accidents have social costs also, like delay in travel time and emotional distress in the wake of an accident. Thus, the context of this research is to predict the impact of road accidents on traffic congestion in US by using the K-nearest neighbour (KNN) machine learning algorithm. In the US in 2018, there were some 12 million vehicles involved in crashes [5] leading to numerous adversities like loss of life, infrastructure and an immense effect on traffic. In a utopian setting, both traffic congestions, as well as the vehicle involved causing it, are reduced, but this is almost impossible to achieve. Both of these conditions are inter-dependent, an increase in accidents results in increasing levels of congestion/traffic volume [6] and an increase in congestions/traffic volume naturally increases accidents. The importance of mitigating congestion is in reducing delays and the resulting loss of economic productivity [7]. In view of the prevalent issues caused by traffic accidents, this prediction can be beneficial in numerous ways to various industries such as its implementation in the navigation industry. Hence, the present research is an attempt to analyse data that might be useful in determining the effect on traffic including information about the accidents that have occurred, like the visibility, day/night time, weather conditions and the temperature during the accident that all contribute to affect the intensity of traffic. The paper has been divided into the following sections: literature review, proposed work, experimentation and results, machine learning modelling, conclusions and future directions.

2 Literature Review Clark [8] presents an intuitive method of predicting the future state of traffic on a motorway, using a multivariate extension of nonparametric regression that exploits the three-dimensional nature of the traffic state using actual data from the London orbital motorway. The technique is able to produce forecasts for two of the three traffic state variables with reasonable accuracy and is capable of application on site. Another technique for predicting traffic flow information is based on the full consideration of the nonlinear characteristics of the traffic system; a support vector machine model and a BP neural network model optimised by genetic algorithm are used [9]. In the research titled ‘Prediction of traffic intensity for dynamic street lighting’ [10] the problem of short-term prediction of traffic flow in a city traffic network is considered. The forecasting is done by a multi-layer using artificial neural network. The proposed approach is tested by using data from the centre of Kraków. A research paper by Saliba et al. [11] used an ensemble of data mining and machine learning techniques, to extract and predict vehicular traffic patterns from mobile usage data. A predictive model based on an artificial neural network was trained to predict traffic level. The findings reveal that the built models are more

Intensity of Traffic Due to Road Accidents …

45

effective to measure and predict traffic flow demand for specific locations rather than the actual traffic flow rate. Another research presents an innovative algorithm integrated with particle swarm optimization and artificial neural networks to develop short-term traffic flow predictors, which are intended to provide traffic flow forecasting information for traffic management in order to reduce traffic congestion and improve mobility of transportation [12]. A novel traffic forecast model based on long short-term memory (LSTM) network is proposed which considers temporal–spatial correlation in traffic system via a twodimensional network which is composed of many memory units [13]. A study by Ma et al. [14] attempts to extend deep learning theory into large-scale transportation network analysis. A deep restricted Boltzmann machine and recurrent neural network architecture are utilised to model and predict traffic congestion evolution based on global positioning system (GPS) data from taxi. A numerical study in Ningbo, China, is conducted to validate the effectiveness and efficiency. The existing techniques give a good discernment of traffic analysis using various methods and various types of data, some that include data from motorways to using mobile usage data. However, there is a lack of traffic congestion analysis as a result of an accident and its prospective management. Hence, this research is an attempt to tackle this issue by throwing light on the traffic congestions caused by road vehicular accidents by predicting the intensity of the traffic caused.

3 Proposed Work The predictive model is built using a clear stepwise approach that works by taking appropriate data, formatting it, visualising it, creating meaning out of raw data and uncovering new insights and then modelling the data to create a predictive model which will give the desired predictions. The figure shows the workflow of the predictive model developed which involves five major steps (Fig. 1). Data Acquisition: To use the appropriate data for the model to be developed is crucial which is determined on the basis of the usefulness of the features in a dataset and the volume of the data present. Data having irrelevant features or a small magnitude are discarded.

Fig. 1 Workflow of predictive model

46

P. Mudgil and I. Joshi

Data Preparation: After the data is collected, it is prepared by doing data cleaning, formatting and removing outliers. Exploratory Analysis: Exploratory analysis is done by visualisation of the data which gives an extensive insight on the data and helps in uncovering new information about the data. Modelling: After the data is prepared, it is modelled using a machine learning algorithm. The algorithm is chosen on the basis of the problem and the data present. Evaluation: After the model is finally prepared, there are various parameters to determine the efficacy of the model itself that differ on the basis of the type of model created.

4 Experimentation and Results 4.1 Data Sources Data used is a countrywide car accident dataset, which covers 49 states of the US. The accident data are collected from February 2016 to June 2020, using two APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras and traffic sensors within the road networks. Currently, there are about 3.5 million accident records in this dataset. Out of these records, 100,000 records are used [15–17].

4.2 Feature Selection The variables selected for measuring the impact on the target variable, i.e. severity of the traffic congestion that is defined as an integer value (1, 2, 3 or 4), where 1 stands for mild traffic congestion, and similarly, 4 stands for heaviest congestion, in comparison are as follows: Temperature (F): Temperature during the occurrence of the accident in Fahrenheit Humidity (%): Humidity during the occurrence of the accident in per cent Visibility: Visibility during the occurrence of the accident in miles Weather Conditions: Clear or unclear weather during the occurrence of the accident Time: The hour of the day during the occurrence of the accident

Intensity of Traffic Due to Road Accidents …

47

Fig. 2 Temperature versus visibility

4.3 Exploratory Data Analysis Good data science practices include visual analysis of data discussed below which gives essential information for building a machine learning model, helping in determining outliers, anomalies and aids in gaining important insights.

4.3.1

Visualising the Relationship Between Temperature and Visibility

A scatter plot is derived from these two features which give an insight on the relationship between them (Fig. 2). This plot shows that most of the visibility is distributed below 10 miles and gives an insight on the fact that there are multiple outliers which are removed to facilitate in building a better model. After the outliers are removed, the relationship is plotted again showing homogenous values of the visibility feature (Fig. 3).

4.3.2

Visualising the Temperature Feature

A boxplot is created to visualise the distribution of the temperature feature, which provides an essential analysis. It reflects on the fact that there are considerable outliers below the ‘minimum’ and the ‘maximum’ which are removed, and a boxplot is created again (Fig. 4).

48

P. Mudgil and I. Joshi

Fig. 3 Temperature versus visibility after outlier removal

Fig. 4 Boxplot of temperature before (left) and after (right) outlier removal

4.3.3

Visualising the Humidity Feature

A boxplot is created to visualise the distribution of the humidity feature, which provides a further insight of the feature (Fig. 5). The boxplot clearly shows that the feature is not homogenous; hence, some values are removed after analysis and are made homogenous which further aids in building the model.

Intensity of Traffic Due to Road Accidents …

49

Fig. 5 Boxplot of humidity feature before (left) and after (right) data cleaning

5 Machine Learning Modelling The target variable, i.e. severity, is classified by four integers (1, 2, 3, 4); therefore, a model is prepared that will predict the severity using the other given features. For this, the KNN classification algorithm is used which provides fast and efficient results. The cleaned dataset is divided into two parts for training and testing the model in a 4:1 ratio. It is found through testing and automation that the value of ‘k’ is most efficient when it is taken as 24. This is discovered after testing the first 25 positive integers for the highest accuracy.

6 Conclusions and Future Directions The above modelling after fitting concludes with an in sample accuracy of 68% while the out of sample accuracy is nearly 67.5% after analysing 100,000 data rows. It may help to optimise navigation techniques and elucidate the importance of the features used during the time of the accident. The paper uses the KNN classification algorithm, while there are several other techniques such as logistic regression, support vector machine (SVM) and decision trees to name a few. The model may be built with higher accuracy by using other predictive classification techniques and/or a larger dataset.

50

P. Mudgil and I. Joshi

References 1. https://www.who.int/gho/road_safety/registered_vehicles/number_text/en/. Retrieved on 28th Oct 2020 2. L.-L. Sun, D. Liu, T. Chen, M.-T. He, Road traffic safety: An analysis of the cross-effects of economic, road and population factors. Chin. J. Traumatol. 22, 290–295 (2019) 3. https://www.who.int/news-room/fact-sheets/detail/road-traffic-injuries. Retrieved on 28th Oct 2020 4. L. Evans, Traffic SAFETY and the Driver (Van Nostrand Reinhold, New York, NY, USA, 1991). 5. https://www.statista.com/topics/3708/road-accidents-in-the-us/. Retrieved on 24th Sept 2020 6. A.E. Retallack, B. Ostendorf, Current understanding of the effects of congestion on traffic accidents. Int. J. Environ. Res. Public Health 16(18), 3400 (2019). https://doi.org/10.3390/ije rph16183400 7. A.M. Rao, K.R. Rao, Measuring urban traffic congestion-a review. Int. J. Traffic Transp. Eng. 2, 286–305 (2012) 8. S. Clark, Traffic prediction using multivariate nonparametric regression. J. Transp. Eng. 129(2) (2003) 9. Y.-N. Zhang, J.-S. Wang, H. Lu, H.-Y. Guo, X.-L. Zuo, Y. Zhou, C. Lu, Short-term traffic flow prediction based on bayesian fusion, in International Conference on Transportation and Development 2020. https://ascelibrary.org/doi/10.1061/9780784483138.014 10. M. Bielecka, A. Bielecki, S. Ernst, I. Wojnicki, Prediction of traffic intensity for dynamic street lighting, in 2017 Federated Conference on Computer Science and Information Systems (FedCSIS) (Prague, 2017), pp. 1149–1155. https://doi.org/10.15439/2017F389 11. M. Saliba , C. Abela, C. Layfield, Proceedings for the 26th AIAI Irish Conference on Artificial Intelligence and Cognitive Science Trinity College Dublin Dublin, Ireland, December 6–7th, 2018 12. K.Y. Chan, T. Dillon, E. Chang, J. Singh, Prediction of short-term traffic variables using intelligent swarm-based neural networks. IEEE Trans. Control Syst. Technol. 21(1), 263–274 (2013). https://doi.org/10.1109/TCST.2011.2180386 13. Z. Zhao, W. Chen, X. Wu, P.C.Y. Chen, J. Liu, LSTM network: a deep learning approach for short-term traffic forecast. IET Intel. Transport Syst. 11(2), 68–75 (2017). https://doi.org/10. 1049/iet-its.2016.0208 14. X. Ma, H. Yu, Y. Wang, Y. Wang, Large-scale transportation network congestion evolution prediction using deep learning theory. PLoS ONE 10(3), e0119044 (2015). https://doi.org/10. 1371/journal.pone.0119044 15. https://www.kaggle.com/sobhanmoosavi/us-accidents 16. S. Moosavi, M.H. Samavatian, S. Parthasarathy, R. Ramnath, A Countrywide Traffic Accident Dataset (2019). https://www.kaggle.com/sobhanmoosavi/us-accidents 17. S. Moosavi, M.H. Samavatian, S. Parthasarathy, R. Teodorescu, R. Ramnath, Accident risk prediction based on heterogeneous sparse data: new dataset and insights, in Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM, 2019). https://www.kaggle.com/sobhanmoosavi/us-accidents

Credit Card Fraud Detection Using Blockchain and Simulated Annealing k-Means Algorithm Poonam Rani, Jyoti Shokeen, Amit Agarwal, Ashish Bhatghare, Arjun Majithia, and Jigyasu Malhotra

Abstract A credit card fraud is a criminal activity in which a person’s credit card is used to make transactions he/she does not authorize. Credit card fraud is one of the major factors in causing huge revenue losses to the banking sector which in turn hurt the economy negatively. This paper combines blockchain technology with simulated annealing for credit card fraud detection. Simulated annealing is leveraged into k-means to detect anomalous and fraudulent banking transactions in a private-permissioned blockchain network. Experimental results demonstrate that the proposed approach exhibits high accuracy compared to the simple k-means algorithm. Keywords Credit card · Fraud detection · Blockchain · Simulated annealing · k-means · Clustering

1 Introduction In this age of digital era, the Internet plays an essential role in doing our tasks online, whether it is shopping, online learning, investing or banking transactions. Credit card fraud is an activity where a person gains access to another person’s credit card with the intent to commit fraud and performs unauthorized transactions. Credit card fraud is a cause of concern for the banking sector as it causes huge financial loss to the economy. With the progress in softwares and technology, it has become possible for the users to conceal their location and identity while performing unauthorized transactions over the Internet.

P. Rani (B) · A. Agarwal · A. Bhatghare · A. Majithia · J. Malhotra Department of Computer Engineering, Netaji Subhas Institute of Technology, University of Delhi, Delhi, India J. Shokeen Department of Computer Science and Engineering, UIET, Maharshi Dayanand University Rohtak, Rohtak, Haryana, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_5

51

52

P. Rani et al.

Blockchain is one of the emerging technologies for reducing fraud and improving security [5, 6]. This technology accurately identifies and prevents any anomalous and fraudulent banking transactions. Blockchain is a secured, shared ledger to track blocks in a transaction. It is a decentralized network which makes it impenetrable to attack of any type thus being secured. Since only verified contributors have access, it adds to security on the network. Permissioned blockchain networks act as an extra secure system. This gives an extra layer of security by allowing certain participants to perform certain tasks. Machine learning algorithms are proved useful in credit card fraud detection [2] and financial crisis prediction [7] problems. This paper aims at combining simulated annealing heuristic approach with k-means clustering algorithm in order to detect fraud credit card transactions. k-means with simulated annealing is a powerful and efficient clustering technique. It integrates the searching capability of k-means and obtaining minimum energy configuration of simulated annealing. In this paper, Sect. 2 presents the related work in this area. Section 3 presents a brief introduction of the methodologies used in the proposed work. Section 4 discusses the proposed work, the dataset used to perform experiments and the results. Lastly, Sect. 5 concludes the paper.

2 Related Works ˙ Ostapowicz and Zbikowski [3] analysed how security and reliability of blockchain can be used to detect fraudulent accounts in a banking system. They compared the effectiveness of different supervised learning algorithms on ethereum blockchain and demonstrated their capabilities so that they can be used in anti-fraud rules for digital currency. Fashoto et al. [2] used k-means clustering technique in combination with machine learning techniques for credit card fraud detection. They used multilayer perceptron (MLP) and hidden Markov model (HMM) and compared k-means clustering algorithm based on these techniques. Thang et al. [8] proposed a model to detect tax frauds in small firms. This model firstly uses fuzzy inferences to determine the business class in which the inspected firm belongs and then uses neural network to determine fraud status of the firm. Neural network is trained using periodic financial reports, market information of the business class and fraud history of the firm. Podgorelec et al. [4] proposed a machine learning-based model that automatically signs blockchain transactions and identifies fraud transactions. This model makes blockchain transactions easier as no user needs to sign the transaction now. The model was tested on data from Ethereum public main network. Brown and Huntley [1] aimed at formulating unsupervised clustering approach as an optimization problem to improve the efficiency and performance of k-means clustering algorithm. He used simulating annealing algorithm to partition data into clusters.

Credit Card Fraud Detection Using Blockchain and …

53

3 Methodology 3.1 Blockchain A private-permissioned blockchain places a restriction as to who is allowed to participate in the network and has access to limited transactions. A participant is added only through invitation, and the entry is validated according to a set of rules built while creating a network. Any member who is given permission in a network is visible to other members. In this way, decentralization is maintained as verification is done by the majority of members, and any unauthorized member can be reported and expelled. The blockchain stores only the hash of the transaction data. The actual data is generally stored on a distributed system along with the hash of that data. This way tampering of data can easily be detected by matching the original hash in the blockchain with hash of the transaction data.

3.2 k-Means Clustering Algorithm Clustering is an unsupervised machine learning technique used to subgroup the given dataset in such a way that data points belonging to the same cluster have similar properties and characteristics. k-means clustering algorithm divides the given data into k different clusters. Algorithm 1 denotes the steps of this algorithm.

Algorithm 1: k-means Clustering Algorithm Decide upon the number of cluster centres needed (k). Start with k initial seed values which define cluster centres for the given Solution space at the beginning.; 2 (E-step / Expectation step) Associate each data point to the cluster whose cluster center is the nearest. Techniques like Manhattan distance and Euclidean distance can be used to find distance between data points.; 3 (M-step / Maximization step) Recompute cluster centres by taking the average of all data points that belong to a single cluster.; 4 For any change in cluster center configurations, go back to step 2.; 1

3.3 Simulated Annealing Simulated annealing comes from a metallurgical process called annealing. The process of annealing aims at achieving the lowest possible energy state level of a system of molecules in any metal. It is a two step process: heating and cooling. Heating step involves heating the metal at a very high temperature in order to increase the energy of the molecules of the metal. Cooling involves gradually cooling off the highly heated

54

P. Rani et al.

metal until the equilibrium state is achieved. Finally, the molecules get arranged in an ordered lattice structure which is the lowest possible energy configuration of the system. The steps to run this algorithm are given in algorithm 2.

Algorithm 2: Simulated Annealing Heuristic Algorithm

1 2 3 4 5 6 7 8 9 10 11 12 13

Input : T0 : Initial temperature, T f : Final temperature, S0 : Initial seed values for the Solution space, E 0 : K-Means energy level for the initial solution space, max_itr = Maximum Number of iterations output: S1 = New Solution Space by randomly changing dimensionality of one random cluster, E 1 = K-Means energy level for the new solution space S1 while TO > T f do count=0; while count < max_itr do if (E 1 < E 0 ) or exp(-(E 1 − E 0 )/T0 )) > random(0,1) then S0 = S1 ; E0 = E1 ; else count++; end end T0 = (depreciating factor) × (T0 ); Output S0 ; end

4 Experimental Work 4.1 Dataset The dataset (https://www.kaggle.com/mlg-ulb/creditcardfraud) used in this paper is taken from Kaggle site and consists of details of European cardholders who made credit card transactions in the month of September 2013 over a period of 2 days. The dataset contains a total of 284,807 transactions out of which 492 are fraudulent. Principal component analysis (PCA) has been applied to the dataset to represent the values in numeric form.

4.2 Proposed Work In this paper, we combine blockchain and simulated annealing k-means clustering techniques to detect anomalous and fraudulent activities in banking transactions. In order to detect fraudulent banking activities, a 28-dimensional space was created from the dataset. The blockchain technique adds an additional layer of abstraction ensuring enhanced levels of security in the banking transaction domain. The results

Credit Card Fraud Detection Using Blockchain and …

55

of the proposed approach are then compared with k-means algorithm for credit card fraud detection. Figure 1 depicts the architecture of the proposed system. Algorithm 3 lists the steps involved in simulated annealing k-means algorithm. Algorithm 3: Simulated Annealing k-means Algorithm Use of Simulated Annealing for better initialisation of k-means initial seeds: one for classifying legitimate banking transactions and the other for classifying fraudulent banking transactions.; 2 E-step / Expectation step: Use Euclidean distance to calculate the distance of each data point in the dataset from each of the two cluster centers. The data point was then associated with the cluster whose center was nearer and classified as legitimate or fraudulent as per the cluster it belongs to.; 3 M step / Maximisation Step: For each cluster, take average of all the data points that belong to that cluster and label this as the new cluster center.; 4 Go back to step 2 until iterations=max_itr; 1

T0 =

aci ci log (ci ×acceptance)−(1−acceptance)×(max_itr−c i)

(1)

−β × aci log(e)

(2)

Tf =

Fig. 1 Proposed system architecture

56

P. Rani et al.

 depreciating factor =

Tf T0

T 1

num

(3)

4.3 Results We have defined the values of various hyperparameters for calculation of initial and final temperature in Table 1. These hyperparameters are used in simulated annealing to find the initial and final temperature which in turn calculates optimal initial seed values. Figure 2 shows the confusion matrices for k-means and k-means with simulated annealing techniques for fraud detection. Figures 3 and 4 show the area under curve–receiver operating characteristics (AUC-ROC) of k-means and our proposed technique on the given dataset. With simulated annealing, the accuracy is 82.43% as opposed to 79.63% with usual k-means. Table 2 gives the performance comparison of k-means versus k-means with simulated annealing technique on difference performance measures. On using proposed technique, there is a significant rise in performance parameters—precision, recall, F1 -score and Gmean.

Fig. 2 Confusion matrix for fraud detection

Credit Card Fraud Detection Using Blockchain and … Table 1 Hyperparameters used in proposed technique Hyperparameter Value Cost increase (ci ) Average cost increase (aci ) Acceptance e β max_itr Tnum

Fig. 3 AUC-ROC plot based on k-means for fraud detection

Fig. 4 AUC-ROC plot based on k-means simulated annealing for fraud detection

100 200 0.75 0.00000000001 0.125 500 200

57

58

P. Rani et al.

Table 2 Performance comparison of k-means with proposed approach Score k-means Proposed approach True Negative False Positive False Negative True Positive Accuracy Precision Recall F1 -score Gmean

22,649 5776 25 31 79.6320 0.0053 0.5536 0.0105 0.6641

23,444 4981 23 33 82.4303 0.0066 0.5893 0.0131 0.6972

5 Conclusion We combined blockchain with k-means simulated annealing technique to ensure better security level by detecting and hence preventing any anomalous and fraudulent credit card banking transaction. We used simulated annealing technique to optimize the k-means algorithm ensuring best results for fraud detection. Also, we combined this with blockchain technology to add another layer of security. Experimental results show that simulated annealing technique when combined with k-means algorithm performed better than simple k-means algorithm. The proposed approach achieves an accuracy of 82.43% as opposed to 79.63% with usual k-means.

References 1. D.E. Brown, C.L. Huntley, A practical application of simulated annealing to clustering. Pattern Recog. 25(4), 401–412 (1992) 2. S.G. Fashoto, O. Owolabi, O. Adeleye, J. Wandera, Hybrid methods for credit card fraud detection using k-means clustering with hidden Markov model and multilayer perceptron algorithm. Curr. J. Appl. Sci. Technol. 1–11 (2016) ˙ 3. M. Ostapowicz, K. Zbikowski, Detecting fraudulent accounts on blockchain: a supervised approach, in International Conference on Web Information Systems Engineering (Springer, Berlin, 2020), pp. 18–31 4. B. Podgorelec, M. Turkanovi´c, S. Karakatiˇc, A machine learning-based method for automated blockchain transaction signing including personalized anomaly detection. Sensors 20(1), 147 (2020) 5. P. Rani, A. Balyan, V. Jain, D. Sangwan, P.P. Singh, J. Shokeen, A probabilistic routing-based secure approach for opportunistic IoT network using blockchain, in 17th India Council International Conference (INDICON 2020) (IEEE, 2020) 6. P. Rani, P.P. Singh, A. Balyan, J. Shokeen, V. Jain, D. Sangwan, A secure epidemic routing using blockchain in opportunistic internet of things, in International Conference on Data Analytics and Management (Springer, Berlin, 2020)

Credit Card Fraud Detection Using Blockchain and …

59

7. S. Sankhwar, D. Gupta, K. Ramya, S.S. Rani, K. Shankar, S. Lakshmanaprabu, Improved grey wolf optimization-based feature subset selection with fuzzy neural classifier for financial crisis prediction. Soft Comput. 24(1), 101–110 (2020) 8. C. Thang, P.Q. Toan, E.W. Cooper, K. Kamei, Application of soft computing to tax fraud detection in small businesses, in 2006 First International Conference on Communications and Electronics (IEEE, 2006), pp. 402–407

Improving Accuracy of Deep Learning-Based Compression Techniques by Introducing Perceptual Loss in Industrial IoT Poonam Rani, Vibha Jain, Mohammad Saif, Saahil Hussain Mugloo, Mitul Hirna, and Somil Jain Abstract Over the years there has been a great increase in the industrial Internet of things, and consequently, over 20 billion devices are now connected through the Internet. Internet of things is the connection of a wide variety of devices or entities over the Internet for storage and gathering of data. This will raise a need for compression techniques for remote sensing images, which is an enabling technology for the Internet of things. While various conventional technologies exist for the compression of remote sensing images, we require a low complexity technique that preserves the quality of the images without compromising over the various spatial parameters that these images possess. Deep learning methods through the use of convolution neural networks have proved to be a reasonable solution for the compression problem. The proposed method uses a deep learning-based method that uses autoencoders for encoding and decoding that incorporates perceptual loss to improve the accuracy of the reconstructed images. The loss functions we utilize are mean squared error loss, content loss, and style loss. We have also made use of lossless compression techniques using LZMA and deflate compression algorithms which do not result in any information loss. Experimental results show that our proposed compression method outperforms standard benchmark techniques.

P. Rani (B) · V. Jain · M. Saif · S. H. Mugloo · M. Hirna · S. Jain Department of Computer Science and Engineering, NSUT, New Delhi, India e-mail: [email protected] V. Jain e-mail: [email protected] M. Saif e-mail: [email protected] S. H. Mugloo e-mail: [email protected] M. Hirna e-mail: [email protected] S. Jain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_6

61

62

P. Rani et al.

Keywords Remote sensing image · Compression technique · Perceptual loss · Deep learning · Industrial IoT

1 Introduction Internet of things (IoT) refers to a network of interrelated entities—“things”—that are connected over the Internet through some wireless connection, responsible for the collection and exchange of information without human intervention [1]. This permits the connected objects to learn from each other‘s experiences just as we humans do. The industrial Internet of things (IIoT) is defined as “machines, computers, and people enabling intelligent industrial operations using advanced data analytics for transformational business outcomes” [2, 3]. Use of industrial IoT has been found in a wide range of applications like for better quality management, to increase machine utilization, supply chain management, higher operator productivity, etc. [4]. With the advent of remote sensor technologies, there has been a great increase in the use of remote sensing images of spacecraft which is one of the basic enabling applications for the industrial IoT. It is defined as the process of detecting and monitoring the physical characteristics of an area by measuring the reflected and emitted radiation at a distance. Over the years, the IIoT through remote sensing images has influenced the way we are living as it has impacted several domains ranging from farming practices to military practices, substantially improving our living standards. This surge in the usage of high-quality remote sensing images poses the great challenge of storage and transmission. Different compression techniques are developed to produce reduced-sized remote sensing images by preserving their quality. Broadly, we have two types of compression: lossy and lossless. Over the years the joint efforts from four organizations of CCIT, ISO, MPEG, and the JPEG have seemed to have concluded that high compression ratios lead to lossy compression where the decompressed image is different from the original image. While lossless techniques can be used in critical applications. Several traditional means of image compression such as JPG and PNG suffer from either quality reduction or data loss [5]. These methods of compression cannot be used for the remote sensing images because of different complexities that arise from different features such as spatial arrangement. Since remote sensing images display large spatial correlation, data compression is valuable. Spectrum is the most viable attribute that we require to preserve in these images, and thus, merely extending the 2D coding system does not serve any good. Thus, new methods exploiting both the spatial and the spectral redundancy have been proposed in this paper. Despite the different coding techniques proposed for lossless compression such as DWT, DCT, KLT (which are all transform-based), CCAP, JPEG-LS (prediction based), deep learning methods seem to have provided better outcomes than these methods [6].

Improving Accuracy of Deep Learning-Based Compression …

63

The rest of the paper is structured as follows: Sect. 2 covers the literature work related to remote sensing image compression techniques. Section 3 describes the methodology of the proposed optimal image compression technique. Next, Sect. 4 illustrates the dataset and generated results. Finally, our work is concluded in Sect. 5.

2 Related Work In this section, we explained the work presented in previous related literature we have found. The different shortcoming that has been presented in previous studies is explained to devise an improved method. Patel et al. [7] proposed a new method using a new metric called deep perceptual, which has proven to fit better with the human perceptual similarity. This compression method uses a model that consists of an encoder and decoder jointly optimizing the new metric and MS-SSIM. By additional human evaluations, it is observed that this new method gives better results than the previous learning-based image compression methods as well as JPEG-2000 and can be compared to BPG also. Zebang et al. propose a lossy image compression architecture that utilizes the advantages of deep learning methods that help achieve very high coding efficiency in [8]. This design is densely connected with an autoencoder for lossy image compression, which is used to get richer feature information from the image that is utilized during compression. This method has been shown to perform better than JPEG and JPEG-2000 with sharp edges and better visual results in low bit rate image compression. Tschannen et al. [9] have shed light on the improvements in representation learning, showing special emphasis on models using autoencoders. They have shown that the use of meta-priors-believed is more effective in usage for downstream tasks, for example, hierarchical organization of features and disentanglement. These methods have been riddled with strong inductive biases, as well as modeling assumptions, despite showing good results. Odena et al. [10] suggest a method that has good results when used with nearestneighbor interpolation, however, had issues when the bilinear resize was tried. The proposed method worked well with the hyper-parameters that were optimized for deconvolution. This also shows the issues that arise with naively using the bilinear interpolation method, which is strongly resistant to high-frequency image features. Sujitha et al. [11] suggest that the use of deep learning methods in the domain of image compression will be able to produce better results when it comes to remote sensing. These methods have proven to be better than other formats such as JPEG and JPEG2000. This is also in view of the increase in demand for low-level applications involving image compression. Hence, we found that using a new metric, i.e., perceptual loss, we can achieve better perception similarity in humans. The use of deep learning methods was shown to achieve high coding efficiency. Autoencoders have conveyed that they have richer feature extraction information which can be helpful for compression, and an improved binarizer is used to quantize the result of the encoder.

64

P. Rani et al.

3 Proposed Method Using the findings from the literature review, we devise a deep learning-based method that uses autoencoders for encoding and decoding. Besides that, we also incorporate the use of perceptual loss to improve the accuracy of the reconstructed images. Our proposed compression technique has three operations, i.e., encoder, transmission, and decoder. Firstly, the autoencoder is constructed in such a manner that leads to minimum loss through the use of mean squared error (MSE) loss. To achieve maximum compression without significant loss of information, the middle-most layer is made as small as possible. Augmentation of MSE loss with two other perceptual losses, i.e., style loss and content loss, is performed. Afterward, the autoencoder is trained to minimize the combined loss. Furthermore, we utilize the lossless compression algorithms over the existing compressed image to observe the final compression ratio from the original image (shown in Fig. 1).

3.1 Overall Architecture The proposed architecture consists of mainly three parts: the encoder, the transmission, and the decoder. The encoder and decoder are trained in the form of an autoencoder and split into their respective roles after their completion. The transmission component consists of the input from the encoder, the lossless compression algorithm, the decompression algorithm, and the input to the decoder.

3.2 Autoencoder Architecture Autoencoders are meant to replicate the latent space of the image dataset they are trained upon. Thus, we design different architecture and test them to minimize information loss of images. Inside the encoder, three blocks of functional layers are

Autoencoder

Transmission

Fig. 1 System architecture

Compression

Decompression

Transmit

Encoder

Decoder

Improving Accuracy of Deep Learning-Based Compression …

65

presented which consist of a bunch of convolutional filters, batch normalization layer, leaky ReLU activation layer, and a max-pooling layer. For the decoder part of the autoencoder, two blocks of functional layers are present. The first block consists of a convolutional transpose layer and a leaky ReLU activation layer. The second block has a similar composition, but a sigmoid activation layer is used instead of a ReLu activation layer.

3.3 Loss Functions The loss function is a combination of three separate loss functions, i.e., the MSE loss for pixel loss, style loss, and content loss. We discuss the three functions as follows: MSE loss: MSE stands for mean squared error loss, and this loss has a single purpose, i.e., to minimize the difference between the images on a pixel scale. Thus, using pixel by pixel comparison, the difference between the actual image’s pixels and the output image’s pixels is compared, squared, and summed, and lastly, average is evaluated. The function of MSE loss is given by Eq. (1). L(y, yˆ ) =

N 1  (y − yˆ )2 N i=0

(1)

Style loss: This loss attempts to compare the style of one image with another image. This is done by passing both the output image and actual image into a loss network, and the activations of certain defined layers are used and compared. The comparison is done by utilizing gram matrices, which are the matrix multiplication of an activation layer’s output with its transpose. Thus, we calculate the style loss as given in Eq. (2) φ, j

φ

φ

lstyle (y, yˆ ) = G j (y) − G j ( yˆ ) 2F

(2)

Content loss: Comparison of the content of the images is the purpose of this loss, i.e., whether the same parts of the activation layers were triggered. If so, that means, similar features are present in both the images. This is done by taking the Euclidean norm of the difference of the activations of the output image and the actual image and dividing it by the product of the dimensions of the image, so as to normalize it. Evaluation of content loss is given in Eq. (3) φ, j

lfeature (y, yˆ ) =

1  φ j ( yˆ ) − φ j (y) 22 C j .H j .W j

(3)

Thus, the overall combined loss is calculated as given in Eq. (4). J (θ ) = L(y, yˆ ) +

 j

φ, j

lstyle (y, yˆ ) +

 j

φ, j

lfeature (y, yˆ )

(4)

66

P. Rani et al.

The VGG16 convolutional neural network (CNN) was chosen as the loss network for utilizing the perceptual losses via the extraction of activations of certain layers of the network.

3.4 Lossless Compression Algorithm Lossless compression algorithms compress information in such a manner that the result has no information loss. Since autoencoders already compress information in a very lossy manner, usage of lossy compression algorithms would result in further degradation of information. To demonstrate their usage, LZMA and DEFLATE compression algorithms were utilized to be able to give a comparative result. Lempel–Ziv Markov chain algorithm (LZMA) features a high compression ratio and a variable dictionary size. It utilizes the Lempel–Ziv algorithm but adds a sliding dictionary, delta filter, and range encoder. DEFLATE utilizes a combination of Lempel–Ziv– Storer–Szymanski (LZSS) and Huffman encoding.

4 Experimental Results This section describes the details of our experimental verification. To find the performance of our proposed compression technique, mean square error (MSE), root mean square error (RMSE), peak signal-to-noise ratio (PSNR), augmented perceptual loss, compression ration (CR) and bit rate are compared with three different compression technologies, i.e., D-CNN [11], and JPEG2000. The dataset utilized was the DOTA-v1.5 [12] that has 469 images for varying dimensions and resolutions. As PyTorch was used as the machine learning framework, the images were processed a bit differently. First, the images were resized into dimensions 512 × 512 × 3, then normalized by dividing by 255.0, followed by rearranging the dimensions of the image, to have the color channels as the first dimension, followed by the width and height. Five different images are used for the testing phase as shown in Fig. 2. Analysis of reconstructed image quality metrics using different compression algorithms is shown in Table 1. Here, MSE is used to define the difference between actual images and reconstructed compressed images. Lower the value of MSE results in high-quality reconstructed images. From the experimental, we can conclude that the MSE has a comparatively lower value with the D-CNN technique; however, our end goal is to reduce the perceptual loss that is achieved successfully. RMSE and PSNR are related to each other as shown in Eq. (5). PSNR = 20 × log10

255 RMSE

(5)

Improving Accuracy of Deep Learning-Based Compression …

P0000.png

67

P0032.png

P0200.png

P0458.png

P0625.png

Fig. 2 Sample test images Table 1 Analysis of different reconstructed image quality metrics Measure

Method

P0032.png

P0200.png

P0458.png

P0625.png

MSE

D-CNN with 3.5678 perceptual loss

2.479

6.590

5.0473

3.0802

D-CNN

2.58407

1.85494

2.79328

1.93430

1.63086

JPEG2000

98.650

126.430

59.160

109.332

127.642

D-CNN with 1.888 perceptual loss

1.5744

2.5670

2.2466

1.7550

D-CNN

1.6075

1.3620

1.6713

1.3908

1.2770

JPEG2000

9.932

11.244

7.692

10.456

11.298

D-CNN with 42.6067 perceptual loss

44.1880

39.9419

41.1002

43.245

D-CNN

44.0077

45.4475

43.669

45.265

46.006

JPEG2000

28.190

27.112

30.411

27.743

27.071

Augmented D-CNN with 32.3642 perceptual loss perceptual loss

35.2083

41.3862

31.4765

29.8534

D-CNN

47.475

50.3647

49.3388

54.263

45.8636

JPEG2000

80.484

89.861

83.7591

90.4865

86.8946

RMSE

PSNR

P0000.png

68

P. Rani et al.

Table 2 Analysis of different compression efficiency Measure

Method

CR

D-CNN with 0.44669 perceptual loss D-CNN

Bit rate

P0000.png

0.45316

P0032.png

P0200.png

P0458.png

P0625.png

0.38317

0.37322

0.41004

0.42508

0.3856

0.383

0.42178

0.42649

D-CNN with 2.42823 perceptual loss

2.20473

2.29766

2.35965

2.56884

D-CNN

2.21873

2.35786

2.42716

2.57733

2.46342

For PSNR and augmented perceptual loss, our proposed compression technique outperforms all the benchmark technique. From the result, we can conclude the there is a trade-off between MSE and augmented perceptual loss. With the increasing value of MSE, we get the best performance, i.e., perceptual losses reduces. Table 2 shows the comparative analysis of CF and bit rate. After reconstructing each test image, it is evident that the proposed compression method has superior performance with all the compared methods.

5 Conclusion In this paper, we propose a CNN-based image compression technique that makes use of perceptual loss functions to improve the accuracy of the reconstructed images which is an improvement from the already existing CNN-based approaches for image compression. Perceptual loss functions were incorporated into the architecture to improve the accuracy of the reconstructed images. Experimental results confirmed the usefulness of the approach by providing an average 34.05 augment perceptual loss that outperforms other benchmark compression technology.

References 1. P. Rani, P. P. Singh, A. Balyan, J. Shokeen, V. Jain, S. Sangwan, A secure epidemic routing using blockchain in opportunistic internet of things, in International Conference on Data Anayltics and Management (2020) 2. P. Rani, A. Balyan, V. Jain, D. Sangwan, P.P. Singh, J. Shokeen, A probabilistic routing-based secure approach for opportunistic IoT network using blockchain, in 2020 IEEE 17th India Council International Conference (INDICON) (2020) 3. Q. Wang, X. Zhu, Y. Ni, L. Gu, H. Zhu, Blockchain for the IOT and industrial IOT: a review. Internet of Things 10 (2020) 4. P. Rani, V. Jain, M. Joshi, M. Khandelwal, A secured supply chain network for route optimization and product traceability using blockchain in Internet of Things, in International Conference on Data Anayltics and Management (2020)

Improving Accuracy of Deep Learning-Based Compression …

69

5. Z. Wang, E.P. Simoncelli, A.C. Bovik, Multiscale structural similarity for image quality assessment, in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, vol. 2 (IEEE, 2003), pp. 1398–1402 6. J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer and superresolution, in European Conference on Computer Vision (Springer, 2016), pp. 694–711 7. Y. Patel, S. Appalaraju, R. Manmatha, Deep perceptual compression. arXiv preprint arXiv:1907.08310 (2019) 8. S. Zebang, K. Sei-ichiro, Densely connected autoencoders for image compression, in Proceedings of the 2nd International Conference on Image and Graphics Processing (2019), pp. 78–83 9. M. Tschannen, Q. Bachem, M. Lucic, Recent advances in autoencoder-based representation learning. arXiv preprint arXiv:1812.05069 (2018) 10. A. Odena, V. Dumoulin, C. Olah, Deconvolution and checkerboard artifacts. Distill. 1(10) (2016) 11. B. Sujitha, V.S. Parvathy, E.L. Lydia, P. Rani, Z. Polkowski, K. Shankar, Optimal deep learning based image compression technique for data transmission on industrial internet of things applications. Transactions on Emerging Telecommunications Technologies (2020), p. e3976 12. Detecting Objects in Aerial Images (DOAI), https://captain-whu.github.io/DOAI2019/dataset. html

Characterization and Prediction of Various Issue Types: A Case Study on the Apache Lucene System Apurva Aggarwal, Ajay Kumar Kushwaha, Somil Rastogi, Sangeeta Lal, and Sarishty Gupta

Abstract Issue reports related to a software system provide an important source of information. However, an issue report has to go through various phases before it gets fixed. One such phase is issue-type assignment which is currently done manually. Manual issue-type assignment is not only time consuming but also error prone. Automated system can help in reducing time and error in issue-type assignment. In this paper, we work on characterization and prediction of different issue types. We perform characterization study with respect to three parameters: distribution, mean time to repair, and top terms present in various issue types. We compared several classic and ensemble machine learning classifiers with respect to accuracy and prediction time. The experimental results performed on the Apache Lucene project show that the machine learning-based issue-type classification approach is effective as it gave maximum accuracy of 67%. The classifiers multinomial Naïve Bayes and ensemble (hard voting) give the best results. Keywords Issue report · Bug report · Machine learning · Prediction · Software engineering

1 Introduction Issue tracking system (ITS) such as Jira and Bugzilla is used widely these days to collect issue reports. An issue has to pass through several phases in its life cycle before it gets fixed. Each phase in issue life cycle takes a certain amount of time, and if we can reduce this time, it can help in reducing the overall fixing time of the bug report. One phase in issue life cycle is ‘type’ assignment, in which issues are assigned the type they belong to; for example, an issue can be a bug, improvement, A. Aggarwal · A. K. Kushwaha · S. Rastogi · S. Gupta Department of CSE & IT, JIIT, Noida, India e-mail: [email protected] S. Lal (B) School of Computing and Mathematics, Keele University, Newcastle-Under-Lyme, UK © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_7

71

72

A. Aggarwal et al.

new feature, wish, task, sub-task, test, etc. It is noticed that a common platform is used to collect all these types of issues. Once issues are collected, developers assign them various labels like priority, type, component, etc., and triage them to appropriate developers for its resolution. However, this manual resolution has two main challenges. First, a large number of issues are reported in the system daily. Second, manual assignment often leads to error in labeling. This wrong assignment of labels often leads to increase in fixing time. Hence, an automated system that can assign labels to bug reports will be of great use. In this paper, we work on characterization and prediction of different issue types. We work on seven types of issues as mentioned earlier: bug, new feature, improvement, task, sub-task, wish, and test. We perform characterization study with respect to three parameters: distribution, mean time to repair, and top terms present in various issue types. Our characterization study reveals several interesting insights; for example, we notice that the distribution of different issue types is imbalanced, and different issue types take different times to repair on an average. Our study also reveals that different issue types have different terms present in their reports. Using this insight, we proposed the use of machine learning models for building automated issue classification/prediction system using the textual data present in the issue reports. We proposed the use of machine learning for automated issue-type classification. We used both classic and ensemble learning algorithms for issue-type prediction. Our experimental results show that multinomial Naïve Bayes and ensemble (hard voting) algorithms give the highest accuracy of 65% and 67%, respectively. Both of these classifiers also took the least amount in prediction. The organization of the rest of the paper is as follows. In Sect. 2, we discuss the related work and our novel research contribution. In Sect. 3, we present the experimental details. In Sect. 4, we discuss the research questions and present the results obtained. Finally, we conclude the paper in Sect. 5 and provide several future directions.

2 Related Work and Research Contribution 2.1 Related Work In this section, we discuss the work closely related to the research work presented in this paper.

2.1.1

Binary and Multiclass Classification of Issue Reports

Antoniol et al. [1] reported and classified the 1800 issues from three open-source bug trackers. They built classifiers using techniques such as logistic regression, alternating decision trees, and Naïve Bayes to automatically classify the issues. Pingclasai

Characterization and Prediction of Various …

73

et al. [2] proposed techniques to automatically classify the bug reports. They used Naïve Bayes classifier, AD tree, and logistic regression. Kochhar et al. [3] classified bug reports into one of the 13 categories. Thung et al. [4] performed manual classification of 500 defects into three categories: control and data flow, structural, and non-functional. They applied SVM to automatically classify the defects, and their proposed solution gave 77.8% accuracy. All these studies work on classifying bug reports to different categories, but none of them worked on classifying bug reports to bug, improvement, feature request, task, sub-task, test, and wish. In contrast to these studies in this paper, we work on classifying bug reports to one of the seven categories.

2.1.2

Ensemble Learning Applications

Ensemble machine learning approaches combine the power of many machine learning algorithms together and have been used in several areas of software engineering such as logging prediction [5, 6], defect prediction [7], blocking bug prediction [8], and configuration bug prediction [9]. Alzubi et al. [10–12] also proposed several interesting ensemble techniques. However, ensemble methods have not been explored for issue-type prediction. In this paper, we use both classic and ensemble-based methods for issue-type classification.

2.2 Research Contribution In the context of related work, our work makes the following novel and unique research contributions. (1)

(2) (3)

We perform a characterization study of issue reports with respect to three dimensions: distribution, mean time to repair, and terms present in the issue reports. We use classic and ensemble classifiers for issue classification in seven categories. We conduct all the experiments on real-world project and present its results.

3 Experimental Details We conduct all our experiments on a large, real-world open-source project Apache Lucene. Apache Lucene is a Java library. It provides many features like indexing, search, and spell-checking. We downloaded 8000 issues that were marked as ‘fixed.’ The data present in an issue report is in natural language text. Conducting experiments on this data directly can lead to non-optimal results, and hence, we applied several preprocessing techniques to clean this data set. We removed all the special

74

A. Aggarwal et al.

characters, links, punctuation marks, URLs, emails, and phone number and converted all the words to lowercase. After preprocessing, we had 7191 bug reports left. The experiments are run on macOS and Windows. We use Python library ‘sklearn’ for the machine learning part implementation.

4 Research Questions 4.1 RQ1: What is the Distribution of Different Categories (e.g., Bug, Documentation, Improvement, etc.) Among All the Issue Reports? Motivation: In this RQ, we identify the percentage of each bug report type. The overall distribution of different bug report types will provide important insights about how frequently each type of bug report occurs in the system. Approach: To answer this RQ, we extract all the issue reports from the issue tracker with respect to each project. We then count the total number of reports filed in each category. We graphically represent this using a histogram. Results: Figure 1 shows the distribution of each issue type in Lucene project. The histogram in this figure shows that the Lucene project has the highest number of

Fig. 1 Distribution of various issue types in the Apache Lucene project

Characterization and Prediction of Various …

75

issues belonging to ‘bug’ category. It is interesting to note that the number of issues categorized to ‘wish’ is much less as compared to that of ‘new feature.’ This result is important as it helps in allocating the human resource required to handle issue in each category.

4.2 RQ2: Are There Any Distinguishing Terms that Differentiate Various Issue Categories? Motivation: We analyze the textual content of issue reports belonging to different categories. Our aim is to identify that whether different issue categories have differentiating terms or not. If different issue categories have different terms, then the textual content of issue reports can be used to build models for automated issue category prediction. Approach: We extract the title and summary of each issue report. First, we applied the basic preprocessing techniques to clean the data. Second, we formed two types of corpuses: the first corpus consisting of all the issue reports and the second corpus consisting of category-wise issue reports. Hence, in the second corpus we had seven mini-corpuses of each issue type. Next, we applied the tf–idf [13] function on each corpus. Results: Figure 2 shows the results of applying tf–idf on the first corpus. The top 5 terms are ‘add,’ ‘remove,’ ‘use,’ ‘failure,’ and ‘make.’ Table 1 shows the top 5 terms obtained by applying tf–idf on the second type of corpus. The table shows that there are distinguishing terms present under each issue category, and they are different from the top terms obtained in first type of corpus. These results indicate that a text classification-based approach can be used for issue-type classification/prediction.

4.3 RQ3: Is There Any Significant Difference Between Mean Time to Repair (MTTR) of Different Issue Categories? Motivation: Mean time to repair (MTTR) indicates how much time it takes to fix a particular issue, i.e., difference between the date when the issue is fixed and the date when the issue is filed. MTTR can throw light on which issue category requires maximum time to fix. This can be beneficial in prioritizing the issues. Approach: To find MTTR of issue reports, we obtained the date and time when the issue was reported and subtracted it from the date and time when the issue was fixed. We then calculated the number of hours it took to fix the issue. Results: Figure 3 shows the results obtained for this RQ. This figure shows that the issues ‘wish’ and ‘new feature’ take the highest amount of time to get resolved. There

76

A. Aggarwal et al.

Fig. 2 Top terms obtained by applying tf–idf on combined issue report corpus

Table 1 Top 5 terms in each issue category, imp: improvement, NF: new feature, ST: sub-task Bug

Imp

NF

ST

Task

Test

Wish

Failure

Add

Add

Remove

Remove

Tests

Add

Fails

Remove

Support

Query parser

Add

Add

Javadocs

Does

Use

New

Module

Lucene

Test

Incomplete

Test

Make

Lucene

Add

Upgrade

Fails

Javadocs

Index

Improve

Analyzer

Cleanup

Deprecated

Testing

Public

can be many reasons for this behavior—maybe Lucene project is giving priority to fix the existing bugs before implementing any new feature. This result provides another motivation for the automatic issue classification.

Characterization and Prediction of Various …

77

Fig. 3 MTTR of various issue types

4.4 RQ4: What is the Performance of Classic and Ensemble Classifiers for Issue-Type Classification Motivation: In this RQ, we tested the performance of several classic and ensemble classifiers and analyzed their performance for issue report classification. We selected different types of classifiers, for example, probabilistic classifier and tree-based classifier. Approach: To answer this RQ, we use Bernoulli Naïve Bayes [13, 14], multinomial Naïve Bayes [13, 14], random forests [13, 14], and linear SVM [13, 14] as a baseline classifier. We selected these classifiers because they belong to different types; for example, Naïve Bayes is a probabilistic classifier, and random forest is a tree-based classifier. In addition to this, we tested several ensemble classifiers; for example, we use majority voting (hard voting), average voting (soft voting), bagging of decision trees, AdaBoost, and stochastic gradient descent boosting techniques [13, 14]. Results: Figure 4a, b shows the results obtained for this RQ. Figure 4a shows the prediction accuracy of various classic classifiers. It shows that multinomial Naïve Bayes gives the best accuracy of 65%. The accuracy obtained by the random forest classifier and linear SVM is same, whereas the Bernoulli Naïve Bayes performed the worst. Figure 4b shows the accuracy score obtained by the ensemble classifiers. This figure shows that the ensemble (hard voting) gives the highest accuracy of 67%. Ensemble (soft voting), bagging, and stochastic gradient boosting gave similar accuracy. AdaBoost gives the worst accuracy. The results of this RQ show that the performance of both the classic and ensemble classifiers is similar with ensemble (hard voting) giving the best accuracy of 67%.

78

A. Aggarwal et al.

Fig. 4 a Accuracy values obtained using various classic classifiers and b accuracy values obtained using various ensemble classifiers

4.5 RQ5: How Much Time Do Classic and Ensemble Machine Learning Algorithms Take in Training and Prediction? Motivation: In this RQ, we compare the prediction time of classic and ensemble machine learning algorithms. Prediction time is an important parameter as it can affect the overall issue fixing time. The classifier taking low prediction time can be of great importance to the industry for issue-type classification. Approach: In the RQ, we compute the overall prediction time taken by both classic and ensemble classifiers on test data set.

Fig. 5 a Prediction time of the classic classifiers and b prediction time of the ensemble classifiers

Characterization and Prediction of Various …

79

Results: Figure 5 shows the prediction time taken by the classic and ensemble classifiers, respectively. A simple comparison of these two figures shows that bagging and AdaBoost techniques have increased the prediction time to a great extent. We observe that the prediction time of multinomial Naïve Bayes classifier and ensemble (hard voting) techniques is quite low as compared to the other classifiers. The results of RQ4 show that both of these classifiers give the best accuracy value as well. Hence, we conclude that on our experimental data set, multinomial Naïve Bayes and the ensemble (hard voting) performed the best.

5 Conclusion and Future Work In this paper, we performed characterization and prediction of various issue types present on the Apache Lucene system. The characterization study shows that the distribution of various issue types is imbalanced. The MTTR of ‘new feature’ and ‘wish’ issue types is the highest. We use both classic and ensemble machine learning algorithms for issue-type prediction. The results show that multinomial Naïve Bayes and ensemble (hard voting) algorithms give the highest accuracy and took the least amount in prediction. In the future, we are planning to extend this work in several directions. First, we will use deep learning methods for issue-type classification. Second, we will conduct experiments on more projects to see the generalizability of the results obtained. Third, we will use advance feature extraction approach [15] to build the final feature set to train the model.

References 1. G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, Y.G. Guéhéneuc, Is it a bug or an enhancement? A text-based approach to classify change requests, in Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative Research, pp. 304–318 (2008) 2. N. Pingclasai, H. Hata, K.I. Matsumoto, Classifying bug reports to bugs and other requests using topic modeling, in 20th Asia-Pacific Software Engineering Conference (APSEC), vol. 2 (IEEE, 2013), pp. 13–18 3. P.S. Kochhar, F. Thung, D. Lo, Automatic fine-grained issue report reclassification, in 2014 19th International Conference on Engineering of Complex Computer Systems (IEEE, 2014), pp. 126–135 4. F. Thung, D. Lo, L. Jiang, Automatic defect categorization (2012), in 19th Working Conference on Reverse Engineering (IEEE, 2012), pp. 205–214 5. S. Lal, N. Sardana, A. Sureka, ECLogger: cross-project catch-block logging prediction using ensemble of classifiers. e-Inform. Softw. Eng. J. 11(1) (2017) 6. S. Lal, N. Sardana, A. Sureka, Three-level learning for improving cross-project logging prediction for if-blocks. J. King Saud Univ.-Comput. Inf. Sci. 31(4), 481–496 (2019) 7. A.T. Mısırlı, A.B. Bener, B. Turhan, An industrial case study of classifier ensembles for locating software defects. Softw. Qual. J. 19(3), 515–536 (2011)

80

A. Aggarwal et al.

8. X. Xia, D. Lo, E. Shihab, X. Wang, X. Yang, Elblocker: predicting blocking bugs with ensemble imbalance learning. Inf. Softw. Technol. 61, 93–106 (2015) 9. B. Xu, D. Lo, X. Xia, A. Sureka, S. Li, Efspredictor: Predicting configuration bugs with ensemble feature selection, in 2015 Asia-Pacific Software Engineering Conference (APSEC) (IEEE, 2015), pp. 206–213 10. O.A. Alzubi, J.A. Alzubi, M. Alweshah, I. Qiqieh, S. Al-Shami, M. Ramachandran, An optimal pruning algorithm of classifier ensembles: dynamic programming approach. Neural Comput. Appl. 1–17 (2020) 11. O.A. Alzubi, J.A. Alzubi, S. Tedmori, H. Rashaideh, O. Almomani, Consensus-based combining method for classifier ensembles. Int. Arab J. Inf. Technol. 15(1), 76–86 (2018) 12. J.A. Alzubi, Optimal classifier ensemble design based on cooperative game theory. Res. J. Appl. Sci. Eng. Technol. 11(12), 1336–1343 (2015) 13. J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques (Elsevier, 2011) 14. P.N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining (Pearson Education, India, 2016) 15. D. Gupta, J.J. Rodrigues, S. Sundaram, A. Khanna, V. Korotaev, V.H.C. de Albuquerque, Usability feature extraction using modified crow search algorithm: a novel approach. Neural Comput. Appl. 1–11 (2018)

Heart Disease Prediction Using Machine Learning Techniques: A Quantitative Review Lubna Riyaz, Muheet Ahmed Butt, Majid Zaman, and Omeera Ayob

Abstract Heart diseases or the cardiovascular diseases are the main reasons for a large number of deaths in the world today. Heart disease affects the functioning of blood vessels and can cause coronary artery infections that in turn lead to weakening of patient’s body. Therefore, there is a need for reliable, accurate, and feasible system that can diagnose the heart disease on time so that the cardiac patient is given an efficient treatment before it leads to a severe complication, finally resulting in a heart attack. During the last many years, machine learning (ML) algorithms and techniques have been applied to various available heart disease datasets for automatic prediction, diagnosis, and treatment of the heart disease. This paper presents a thorough survey of various machine learning techniques and analyzes their performances which are used for efficient prediction, diagnosis, and treatment of various heart diseases. Some machine learning techniques used for the prediction of the occurrence of heart diseases that are surveyed in the proposed paper are support vector machine (SVM), decision tree (DT), Naïve Bayes (NB), K-nearest neighbor (KNN), artificial neural network (ANN), etc. Then, the average prediction accuracy was calculated for each technique to find out overall best and worst performing technique. According to the results, the highest average prediction accuracy was achieved by ANN (86.91%), whereas C4.5 decision tree technique came up with the lowest average prediction accuracy of 74.0%. Keywords Heart disease · Machine learning · Support vector machine · Decision tree · Random forest · Naïve Bayes · Artificial neural network · Heart disease prediction

L. Riyaz · M. A. Butt (B) Department of Computer Sciences, University of Kashmir, Srinagar, India M. Zaman Directorate of IT & SS, University of Kashmir, Srinagar, India O. Ayob Department of Food Technology, Jamia Hamdard, New Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_8

81

82

L. Riyaz et al.

1 Introduction Heart is an important organ in the human body; it pumps blood to every part of the body. If the heart stops functioning, then within a few seconds other organs will also stop functioning and the person will die immediately. The increase in various heart-related diseases is mainly due to changing lifestyle, work related stress, and the bad food habits. According to the World Health Organization (WHO), daily 18 million people die because of cardiovascular diseases [1]. USA spends almost $1 billion per day for treating patients with heart diseases [2]. There are various symptoms of heart disease such as chest pain, high blood pressure, cardiac arrest, hypertension, shortness of breath, pain in neck, and discomfort. There are different causes of heart disease like defects by birth, high blood pressure, smoking, alcohol, and diabetes. The different types of heart diseases include: heart failure, congenital heart disease, aneurysm, coronary artery disease, myocardial infarction (heart attack), stroke, cardiac arrest, heart infection, etc. Heart infection can be caused by viruses, bacteria, and parasites. The most common types of heart diseases are listed in Table 1. There are also some of the risk factors on the basis of which heart disease can be predicted. These include: age, gender, blood pressure, cholesterol level, diabetes, smoking, angina (chest pain), family history, obesity, stress, etc. (Table 2). Due to the rapid growth of digital technologies, the medical organizations all around the world store huge amount of health-related data in the databases. This data is very complex and challenging to analyze. Therefore, the need is to exploit this data using various techniques for automatic diagnosis of the disease. In recent years, various studies have been carried out regarding heart disease prediction by the use of automatic techniques like data mining, machine learning, deep learning [3], etc. Recently, these techniques have been used for various other purposes as well, such as rainfall prediction [4, 5], educational data mining [6–9], geographical data mining [10], intrusion detection system [11], and information translation [12]. Table 1 Types of heart diseases

Types of heart diseases

Description

Coronary artery disease

Caused by the buildup of plaque in the heart’s arteries

Cardiomyopathy

It affects how the heart muscle squeezes

Arrhythmias

Heart rhythm abnormality

Congenital heart disease

Heart irregularities present at birth

Myocardial infarction

Sudden reduced blood flow to the heart

Rheumatic heart disease

Caused due to rheumatic fever

Heart infections

Caused by bacteria, viruses, or parasites

Heart Disease Prediction Using Machine …

83

Table 2 Risk factors of heart diseases Risk factors

Description

Age

Aged people are more prone to heart diseases

Gender

Males have greater risk of heart diseases than females

Blood pressure

Variation in blood pressure levels can lead to narrowing and hardening of arteries

Cholesterol levels

Higher cholesterol level increases chances of formation of plaques

Diabetes

Caused because of abnormal sugar levels in the blood

Smoking

The chance of heart disease in smokers is higher as compared to those who do not smoke

Family history

If a person is having a family history regarding heart disease, then the chances of getting cardiovascular diseases are high

Obesity

Overweight people are more prone to heart diseases

Stress

Excess stress can damage arteries, thus leading to different kinds of heart diseases

The rest of this paper is organized as follows. Section 2 presents various machine learning algorithms used in heart disease prediction. Section 3 presents the literature review carried out to efficiently predict and diagnose heart disease using machine learning algorithms and techniques. Section 4 is the discussion section; it shows the comparative representation of various machine learning methodologies based on their accuracy in a tabular form. Section 5 describes the research gaps identified, and Sect. 6 is the conclusion.

2 Machine Learning Algorithms Used in Heart Disease Prediction, Diagnosis, and Treatment 2.1 Decision Tree Decision tree (DT) is a supervised learning algorithm, mostly used for the classification problem. It can deal with both categorical and numerical data. It is a tree-like structure and consists of internal nodes, leaf nodes, and branches. Most commonly used types of decision trees include CART, C4.5, ID3, etc. It first calculates entropy for each feature/attribute; it then divides the dataset into two or more similar sets using variables or the predictors [13, 14]. E(S) =

c 

− pi log2 pi

i=1

I G(Y, X ) = E(Y ) − E(Y/ X )

84

L. Riyaz et al.

2.2 Naïve Bayes Naïve Bayes (NB) is an efficient classification technique and is based on Bayes’ theorem. Naïve Bayes assumes that there is independence among the predictors; the occurrence of particular attributes in a class is independent of other attributes/features. Even if the dependency is there, still the features will contribute independently to the probability. It is used to compute the posterior probability for each class based on the conditional probability [13, 14]. P(c|x) = P(x|c)P(c)/P(x) P(c|X ) = P(x1 |c) × P(x2 |c) × ... × P(xn |c) × P(c)

2.3 Support Vector Machine (SVM) It is a supervised machine learning technique used for classification problem. It deals with both linear and nonlinear datasets. In order to do the classification, support vector machine technique finds a hyperplane and maximizes the margin to differentiate between classes. Support vector machine shows training points in feature space as points such that the points belonging to the different classes are separated by a margin as wide as possible. SVM can play a great role as far as prediction of heart disease is concerned [13–15] (Fig. 1). Fig. 1 Support vector machine

Heart Disease Prediction Using Machine …

85

2.4 K-Nearest Neighbor K-nearest neighbor (KNN) is a very simple and easy to implement machine learning algorithm. It is used for solving both regression and classification problems where there is no or very less knowledge about data distribution. It finds ‘k’ nearest data points to the data point, for which the target value is to be found, in the dataset. Then, it assigns the average of those ‘k’ data points to this particular data point [13].

2.5 Random Forest (RF) Random forests are supervised ensemble machine learning methods used for classification, regression, and other machine learning problems. During training, they construct multiple decision trees before giving an output. In case of classification, voting system is used for deciding the class, whereas in regression it takes the mean or average of the outputs of each of the individual trees. Random forests are based on the belief that more the number of decision trees, more is the chance of getting the accurate output [13].

3 Literature Review In recent years, various studies have been carried out to efficiently predict and diagnose heart disease using machine learning algorithms and techniques. Jabbar et al. [16] proposed a novel approach for association rule mining which was based on sequence number and clustering the transactional data base for predicting the heart attack. Experimental results showed effectiveness of the proposed algorithm. Chitra et al. [17] used a novel classification method for predicting the heart attack at an initial stage using stored records. The data is initially preprocessed using machine learning techniques. The features are classified using fuzzy c-means. According to the experimental results, the support vector machine (SVM) algorithm along with the reliefF feature selection algorithm gave the best results. This pair gave the highest accuracy value of 84.81%. Chandra Babu et al. [18] proposed an optimization function on SVM used for genetic algorithm (GA) which selects features that are more significant to predict heart disease. This algorithm is a total of seven attributes which in turn are given to support vector machine. As per the results, the SVM achieves 83.70% (full feature set). However, the support vector machine provides an accuracy of 88.34% (with selected attributes). Amin et al. [19] have identified the significant features and ML techniques for improving the accuracy of heart disease prediction. Features were used in different combinations, and seven classification methods were used: KNN, DT, NB,

86

L. Riyaz et al.

logistic regression (LR), SVM, neural network (NN), and vote (Naïve Bayes + LR). According to the results, the best performing technique was vote having an accuracy of 87.4% Prakash et al. [20] proposed a new technique to extract features, and then DT is created using selected features. Further unnecessary features are removed, and finally, optimality criteria function was utilized. Experimental results showed that the optimality criterion feature selection (OCFS) method took lesser time to execute as compared to other methods. Bashir et al. [21] applied various machine learning techniques, i.e., DT, LR, LR SVM, NB, and RF individually in RapidMiner on a heart disease dataset and compared the results with the past researchers. According to the results, the decision tree achieved an accuracy of 82.22%, LR 82.56%, RF 84.17%, NB 84.24%, and LR and SVM 84.85%. So, LR (SVM) and Naïve Bayes gave the highest accuracies. According to Alex et al. [22], among SVM, random forest, KNN, and ANN classification algorithms, ANN gave the best result with the accuracy of 92.21% in diagnosing heart disease. Tripoliti et al. [23] proposed the HEARTEN KMS. According to the results, knowledge management system (KMS) modules achieved an accuracy ranging between 78 and 95%. Mohan et al. [24] proposed a novel method that aims at finding the significant attributes using ML techniques. The predicted model was experimented with different attribute sets. According to experimental results, an enhanced performance level was achieved by hybrid random forest with a linear model (HRFLM), i.e., 88.7%. Chandra et al. [25] gave a hybrid technique whose output provides the feature which is optimized and which in turn is fed into the decision tree for finding heart disease type. Genetic algorithm evaluates the fitness. According to the results, the genetic algorithm provides optimized features; accuracy for classification is 85.37%. Tarawneh et al. [26] analyzed techniques for the prediction of heart and proposed a hybrid approach that combines all the techniques to give accurate diagnosis. As per the experimental results, they achieved an accuracy of 89.2%, features got reduced from fourteen to twelve, and there was no loss of accuracy. Ahmed et al. [1] presented a real-time system for heart disease diagnosis. Two types of feature selection algorithms were used, i.e., univariate and relief, four ML algorithms were compared: decision tree (DT), support vector machine (SVM), random forest (RF), and logistic regression (LR), and cross-validation (CV) was applied. According to the experimental results, random forest came out as the best classifier with an accuracy of 94.9%. Ali et al. [27] focused on refinement of the features for heart disease prediction. To eliminate irrelevant features, χ2 statistical model was used while deep neural network (DNN) has been searched using exhaustive search method. As per the results, the proposed model achieved an accuracy of 93.33%. Latha et al. [28] have investigated the ensemble classification on heart disease dataset. According to the experimental results, it was proven that bagging and boosting ensemble methods improve the prediction accuracy of weak classifiers by around 7%.

Heart Disease Prediction Using Machine …

87

Kannan et al. [29] examined and compared the accuracy of four ML algorithms with receiver operating characteristics (ROC) curve for heart disease prediction. Tenfold cross-validation is used during training. According to the results, logistic regression outperformed others by achieving an accuracy of 87%. Alotaibi [30] discussed, proposed, and implemented a machine learning model by the combination of five algorithms in order to improve the heart failure prediction accuracy. RapidMiner tool was used in this research. As per the results, this study achieved significant improvement in accuracy as compared to previous studies in predicting heart disease. Raza [31] proposed a model using ensemble learning and majority voting rule for improving heart disease prediction accuracy. Three classification techniques were applied. According to the results, the proposed ensemble method provided the prediction accuracy of 88.88%. Hasan et al. [32] proposed a new classification method, i.e., sparse discriminant analysis (SDA), for the prediction of heart disease. This algorithm reduces time complexity by analysis of linear discriminant analysis (LDA). As per results, sparse discriminant analysis outperforms other machine learning algorithms with an accuracy of 96%. Singh et al. [33] have used the normalization technique, which is a preprocessing technique, to reduce the complexity of large heart disease datasets by enhancing the Euclidean distance, which reduces the number of iterations that reduces the computational time than K-means, hence improving the performance of clusters. The technique adopted here performed relatively well as compared to clustering techniques. According to the experimental results, clustering with normalization achieved a better accuracy of 84.84% as compared to clustering without normalization (70.58%). Desai et al. [34] presented the comparative analysis of parametric and nonparametric approaches for heart disease classification. Back-propagation neural network (BPNN) and logistic regression have been used in this study. They also used the tenfold cross-validation in this study. The empirical results have shown that with parametric model—LR, any case of heart disease can be tested with eleven optimal parameters, achieving an accuracy of 92.58%. Thaiparnit et al. [35] analyzed the data related to heart disease using vertical Hoeffding decision tree (VHDT). Comparison of various algorithms such as decision stump (DS), random tree (RT), REP tree, LMT, random forest (RF), J48, and Hoeffding tree was done. The results showed that the best algorithm for data analysis is the Hoeffding tree (HT) having an average accuracy of 85.43% which is higher than other methods and processing error of 14.07% which is lesser than other techniques. Enriko [36] has used the top ten data mining classification algorithms for diagnosing the heart disease from Harapan Kita Hospital, Jakarta. Their performance was examined by checking the accuracy and speed. After the preprocessing step, the dataset was having 15 input parameters and one multiclass output parameter. According to the results, the algorithms that came up with the highest prediction accuracy were random forest, KNN, and multilayer perceptron (MLP), while those with fastest processing speed were AdaBoost, KNN, and Naïve Bayes. Since KNN

88

L. Riyaz et al.

proved best in both the measures, the study suggests KNN as the best classification algorithm for diagnosing the heart disease. Akgül et al. [37] first used the artificial neural network (taking default parameters) for diagnosing the heart disease. Then, they combined artificial neural network and genetic algorithm to form a hybrid model to increase the accuracy of classification. Cleveland dataset of UCI machine learning repository was used. According to the results, the proposed model outperformed others with an accuracy of 95.82%. Mehanovi´c et al. [38] applied artificial neural network (ANN), k-nearest neighbor (KNN), and support vector machine (SVM) algorithms to build a model for heart disease prediction. Then, multiple experiments were performed using each of these algorithms and majority voting was used as ensemble learning. According to the results, the majority voting technique achieved the highest accuracies in both the cases (i.e., binary and multiclass classification) which were 61.16% and 87.37% for multiclass and binary classification, respectively. The results in binary classification were higher as compared to multiclass classification, reason being that the number of classes in multiclass is more and hence harder to learn.

4 Discussion 4.1 Comparative Representation of Various Machine Learning Methodologies Based on Accuracy The comparative representation of various machine learning techniques which have been implemented on heart disease dataset is shown in Table 3. In Table 3, we presented various machine learning techniques that have been used recently for heart disease prediction. Each of these techniques has performed differently in different cases; e.g., decision tree (DT) in some cases has performed very well, while in others it has performed poorly. Multilayer perceptron (MLP), in case of Chandra Babu et al. [18], achieved an accuracy of 81.32%, while as in case of Enriko et al. [36] its accuracy was only 63.8%; it might be since the former used the feature selection algorithm, i.e., genetic algorithm with SVM for selecting the features, while the latter one did not use any such feature selection algorithm. KNN gave an accuracy of 84.3% and 76% in case of Chandra Babu et al. [18] and Tarawneh et al. [26], respectively. The reason for this can be that in [18], only seven significant features were used, while Tarawneh et al. [26] considered 12 features for heart disease prediction. In case of Alotaibi et al. [30], support vector machine (SVM), Naive Bayes, and decision tree achieved the highest accuracies of 92.30%, 87.27%, and 93.19%, respectively, that is highest among all. The reason might be that they used the tenfold cross-validation, since more iteration during the learning phase helps to generate more accurate results. Another reason might be due to the size of dataset, which has been amplified in this study [30]. For logistic regression, both Ahmed et al. [1] and Desai et al. [34] achieved the highest accuracies

76









79.25



Bagging: 81.52 Boosting: 79.54

Alex et al. [22]

Chandra et al. [25]

Tarawneh et al. [26]

Ahmed et al. – [1]



Bashir et al. [21]

Ali et al. [27]

Latha et al. [28]





83.21







Amin et al. [19]

84.3

K-nearest neighbor (KNN) (%)

81.32

Multilayer perceptron (MLP) (%)







84









83.64

J48 (%)

Techniques along with accuracies

Chandra Babu et al. [18]

Author



90.00

91.95

86



85.88



85.19

88.34

Support vector machine (SVM) (%)

Table 3 Comparison of algorithms used in various related works

84.16





86





84.24

84.81



Naïve Bayes (NB) (%)





92.12







82.56





Logistic regression (LR) (%)





89.88

77.5

66.67



82.22





Decision tree (DT) (%)

Bagging: 80.53 Boosting: 78.88

81.11

94.9





85.88

84.17





Random forest (RF) (%)







85



92.21







Artificial neural network (ANN) (%)

(continued)

Bagging: 79.87 Boosting: 75.9

















C4.5 (%)

Heart Disease Prediction Using Machine … 89









63.8





Alotaibi et al. [30]

Hasan et al. [32]

Desai et al. [34]

Enriko [36]

Akgül et al. [37]

Mehanovi´c et al. [38]

Multilayer perceptron (MLP) (%)

81.55

79.09

71.6









K-nearest neighbor (KNN) (%)















J48 (%)

Techniques along with accuracies

Kannan et al. [29]

Author

Table 3 (continued)

86.40



45.1





92.30

79.77

Support vector machine (SVM) (%)



82.58

50.4





87.27



Naïve Bayes (NB) (%)





62.4

92.58



87.36

86.51

Logistic regression (LR) (%)











93.19



Decision tree (DT) (%)





78.0



80

89.14

80.89

Random forest (RF) (%)

85.43

85.02











Artificial neural network (ANN) (%)

-

77.35

62.9









C4.5 (%)

90 L. Riyaz et al.

Heart Disease Prediction Using Machine …

91

Table 4 Average accuracies MLP (%)

KNN (%)

J48 (%)

SVM (%)

NB (%)

LR (%)

DT (%)

RF (%)

ANN (%)

C4.5 (%)

77.08

77.03

83.82

81.32

80.6

83.08

78.82

83.35

86.91

74.0

of 92.12% and 92.58%, respectively. It might be because in both the cases, the tenfold cross-validation method has been applied, and because of this Ahmed et al. [1] achieved the highest accuracy for random forest as well. For artificial neural network, the highest accuracy is achieved by Alex et al. [22] (92.21%), and others achieved accuracy less than this. It might be because of the variation in the datasets since Alex et al. [22] collected dataset from the Jubilee Mission College and Research Institute, Thrissur, while others collected it from other sources. Latha et al. [28] achieved the highest accuracy of 79.87% for C4.5 technique. The reason for this could be that they applied the ensemble classification tool on heart disease dataset, and with the bagging ensemble technique, they were able to achieve increase in accuracy for weak classifiers. We have also averaged the prediction accuracies for each of these techniques in order to find out the overall best and worst performing techniques (Table 4). According to the results, the techniques that came up with the highest average accuracy were ANN (86.91%), J48 (83.82%), and random forest (83.35%), while those that came up with lowest average accuracies were C4.5 (74.0%), KNN (77.03%), and MLP (77.08%).

5 Research Gaps/Problems Identified There are still various areas where further work needs to be done in order to enhance the current work. For example, the work done can be extended to large-scale realworld and real-time heart disease datasets. Further research can be done with diverse mixtures of machine learning techniques for better prediction by testing different combinations of ML techniques on the same dataset and can also be tested in the form of ensembles to further increase the accuracy. In addition, new feature selection methods can be developed to get a broader perspective of the significant features to increase the performance of heart disease prediction. In the future, TensorFlow can be used for heart disease prediction considering more datasets since TensorFlow can speed up the prediction process.

6 Conclusion During the last few decades, the overall percentage of heart disease patients has increased up to a very large extent. The reason is believed that the people living around

92

L. Riyaz et al.

the world have changed their lifestyle with the advent of science and technology. Therefore, there is a need of reliable, accurate, and feasible system that can diagnose the heart disease on time so that the cardiac patient is given an efficient treatment before it leads to a severe complication, finally resulting in a heart attack. This review has been done as an analysis on various machine learning techniques, wherein ten techniques have been highlighted: multilayer perceptron, k-nearest neighbor, J48, support vector machine (SVM), Naïve Bayes, logistic regression, decision tree, random forest, artificial neural network, and C4.5. The corresponding papers have also been referred, and based on their dataset, their experimental outcomes have been analyzed and represented in tabular form (Table 3). Then, the average prediction accuracy was calculated for each technique to find out the overall best and worst performing techniques. According to the results, the techniques that came up with the highest average prediction accuracies were artificial neural network (86.91%), J48 (83.82%), and random forest (83.35%), while the techniques that came up with the lowest average prediction accuracies were C4.5 decision tree (74.0%), k-nearest neighbor (77.03%), and multilayer perceptron (77.08)%.

References 1. H. Ahmed, E.M.G. Younis, A. Hendawi, A.A. Ali, Heart disease identification from patients’ social posts, machine learning solution on Spark. Futur. Gener. Comput. Syst. 111, 714–722 (2020). https://doi.org/10.1016/j.future.2019.09.056 2. Y. Hao, M. Usama, J. Yang, M.S. Hossain, A. Ghoneim, Recurrent convolutional neural network based multimodal disease risk prediction. Futur. Gener. Comput. Syst. 92, 76–83 (2019). https:// doi.org/10.1016/j.future.2018.09.031 3. M. Ashraf et al., Prediction of cardiovascular disease through cutting-edge deep learning technologies: an empirical study based on TENSORFLOW, PYTORCH and KERAS. Adv. Intell. Syst. Comput. 1165, 239–255 (2021). https://doi.org/10.1007/978-981-15-5113-0_18 4. R. Mohd, M.A. Butt, M.Z. Baba, GWLM–NARX: Grey Wolf Levenberg–Marquardt-based neural network for rainfall prediction. Data Technol. Appl. 54(1), 85–102 (2020). https://doi. org/10.1108/DTA-08-2019-0130 5. R. Mohd, M.A. Butt, M. Zaman Baba, SALM-NARX: self adaptive LM-based NARX model for the prediction of rainfall, in Proceedings of International Conference on I-SMAC (IoT Social Mobile, Analytics and Cloud), I-SMAC 2018, pp. 580–585, 2019. https://doi.org/10. 1109/I-SMAC.2018.8653747 6. M. Ashraf, M. Zaman, M. Ahmed, An intelligent prediction system for educational data mining based on ensemble and filtering approaches. Proc. Comput. Sci. 167(2019), 1471–1483 (2020). https://doi.org/10.1016/j.procs.2020.03.358 7. M. Ashraf, M. Zaman, M. Ahmed, To Ameliorate Classification Accuracy Using Ensemble Vote Approach and Base Classifiers, vol. 813 (Springer Singapore, 2019) 8. M. Ashraf, M. Zaman, M. Ahmed, Performance analysis and different subject combinations: an empirical and analytical discourse of educational data mining, in Proceedings of the 8th International Conference Confluence 2018 on Cloud Computing, Data Science and Engineering, Confluence 2018, Aug 2018, pp. 287–292. https://doi.org/10.1109/CONFLUENCE.2018.844 2633 9. M. Ashraf, M. Zaman, M. Ahmed, Using ensemble StackingC method and base classifiers to ameliorate prediction accuracy of pedagogical data. Proc. Comput. Sci. 132, 1021–1040 (2018). https://doi.org/10.1016/j.procs.2018.05.018

Heart Disease Prediction Using Machine …

93

10. M. Zaman, S. Kaul, M. Ahmed, Analytical comparison between the information gain and Gini index using historical geographical data. Int. J. Adv. Comput. Sci. Appl. 11(5), 429–440 (2020). https://doi.org/10.14569/IJACSA.2020.0110557 11. N.M. Mir, S. Khan, M.A. Butt, M. Zaman, An experimental evaluation of Bayesian classifiers applied to intrusion detection. Indian J. Sci. Technol. 9(12) (2016). https://doi.org/10.17485/ ijst/2016/v9i12/86291 12. M. Zaman, S.M.K. Quadri, M.A. Butt, Information translation: a practitioners approach. Lect. Notes Eng. Comput. Sci. 1, 45–47 (2012) 13. V.V. Ramalingam, A. Dandapath, M. Karthik Raja, Heart disease prediction using machine learning techniques: a survey. Int. J. Eng. Technol. 7(2.8), 684–687 (2018). https://doi.org/10. 14419/ijet.v7i2.8.10557 14. M. Chala Beyene, Survey on prediction and analysis the occurrence of heart disease using data mining techniques 118(8), 165–174 (2018) [Online]. Available http://www.ijpam.eu 15. https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learningalgorithms-934a444fca47 16. M. Jabbar, P. Chandra, B. Deekshatulu, Cluster based association rule mining for heart attack prediction. J. Theor. Appl. Inf. Technol. 32(2), 196–201 (2011) 17. R. Chitra, Heart attack prediction system using fuzzy C means classifier. IOSR J. Comput. Eng. 14(2), 23–31 (2013). https://doi.org/10.9790/0661-1422331 18. C.B. Gokulnath, S.P. Shantharajah, An optimized feature selection based on genetic approach and support vector machine for heart disease. Cluster Comput. 22(s6), 14777–14787 (2019). https://doi.org/10.1007/s10586-018-2416-4 19. M.S. Amin, Y.K. Chiam, K.D. Varathan, Identification of significant features and data mining techniques in predicting heart disease. Telemat. Inform. 36, 82–93 (2019). https://doi.org/10. 1016/j.tele.2018.11.007 20. S. Prakash, K. Sangeetha, N. Ramkumar, An optimal criterion feature selection method for prediction and effective analysis of heart disease. Cluster Comput. 22(s5), 11957–11963 (2019). https://doi.org/10.1007/s10586-017-1530-z 21. S. Bashir, Z.S. Khan, F. Hassan Khan, A. Anjum, K. Bashir, Improving heart disease prediction using feature selection approaches, in Proceedings of 2019 16th International Bhurban Conference on Application Science Technology IBCAST 2019, pp. 619–623, 2019. https://doi. org/10.1109/IBCAST.2019.8667106 22. P. Mamatha Alex, S.P. Shaji, Prediction and diagnosis of heart disease patients using data mining technique, in Proceedings of 2019 IEEE International Conference on Communication Signal Processing ICCSP 2019, pp. 848–852, 2019. https://doi.org/10.1109/ICCSP.2019.869 7977 23. E.E. Tripoliti et al., HEARTEN KMS—a knowledge management system targeting the management of patients with heart failure. J. Biomed. Inform. 94, 103203 (2019). https://doi.org/10. 1016/j.jbi.2019.103203 24. S. Mohan, C. Thirumalai, G. Srivastava, Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 7, 81542–81554 (2019). https://doi.org/10.1109/ACCESS. 2019.2923707 25. K. Chandra Shekar, P. Chandra, K. Venugopala Rao, An Ensemble Classifier Characterized by Genetic Algorithm with Decision Tree for the Prophecy of Heart Disease, vol. 74 (Springer Singapore, 2019) 26. M. Tarawneh, O. Embarak, Hybrid Approach for Heart Disease Prediction Using Data Mining Techniques, vol. 29 (Springer International Publishing, 2019) 27. L. Ali, A. Rahman, A. Khan, M. Zhou, A. Javeed, J.A. Khan, An automated diagnostic system for heart disease prediction based on χ2 statistical model and optimally configured deep neural network. IEEE Access 7, 34938–34945 (2019). https://doi.org/10.1109/ACCESS.2019.290 4800 28. C.B.C. Latha, S.C. Jeeva, Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques. Inform. Med. Unlocked 16, 100203 (2019). https://doi.org/ 10.1016/j.imu.2019.100203

94

L. Riyaz et al.

29. N.B. Muppalaneni, M. Ma, S. Gurumoorthy, Soft Computing and Medical Bioinformatics (Springer Singapore, 2019) 30. F.S. Alotaibi, Implementation of machine learning model to predict heart failure disease. Int. J. Adv. Comput. Sci. Appl. 10(6), 261–268 (2019). https://doi.org/10.14569/ijacsa.2019.010 0637 31. K. Raza, Improving the Prediction Accuracy of Heart Disease With Ensemble Learning and Majority Voting Rule (Elsevier Inc., 2019) 32. K.M.Z. Hasan, S. Datta, M.Z. Hasan, N. Zahan, Automated prediction of heart disease patients using sparse discriminant analysis, in 2nd International Conference on Electrical Computing Communication Engineering ECCE 2019, pp. 7–9, 2019.https://doi.org/10.1109/ ECACE.2019.8679279 33. R. Singh, E. Rajesh, Prediction of heart disease by clustering and classification techniques. Int. J. Comput. Sci. Eng. 7(5), 861–866 (2019). https://doi.org/10.26438/ijcse/v7i5.861866 34. S.D. Desai, S. Giraddi, P. Narayankar, N.R. Pudakalakatti, S. Sulegaon, Back-Propagation Neural Network Versus Logistic Regression in Heart Disease Classification, vol. 702 (Springer Singapore, 2019) 35. S. Thaiparnit, S. Kritsanasung, N. Chumuang, A classification for patients with heart disease based on hoeffding tree, in JCSSE 2019—16th International Joint Conference on Computing Science Software Engineering Knowledge Evolution Toward Singular Man-Machine Intelligent, pp. 352–357, 2019. https://doi.org/10.1109/JCSSE.2019.8864158 36. I.K.A. Enriko,“Comparative study of heart disease diagnosis using top ten data mining classification algorithms. ACM Int. Conf. Proceeding Ser. 159–164 (2019). https://doi.org/10.1145/ 3338188.3338220 37. M. Akgül, Ö.E. Sönmez, T. Özcan, Diagnosis of heart disease using an intelligent method: a hybrid ANN—GA approach. Adv. Intell. Syst. Comput. 1029, 1250–1257 (2020). https://doi. org/10.1007/978-3-030-23756-1_147 38. D. Mehanovi´c, Z. Mašeti´c, D. Keˇco, Prediction of heart diseases using majority voting ensemble method. IFMBE Proc. 73, 491–498 (2020). https://doi.org/10.1007/978-3-030-17971-7_73

Enhancing CNN with Pre-processing Stage in Illumination-Invariant Automatic Expression Recognition Hiral A. Patel, Nidhi Khatri, Keyur Suthar, and Hiral R. Patel

Abstract To understand the human behavior, emotions of human play a crucial role. Emotions can be recognized from facial expression analysis. Many researchers worked upon recognizing basic emotions like sad, happy, angry, fear, neutral, disgust and surprise through machine learning. Nowadays, deep learning is attracting the researcher in solving the real-word problems. In this research, convolutional neural network (CNN) was implemented with four different pre-processing strategies to improve CNN performance and to remove illusion effect. The four pre-processing strategies used and compared consist of (a) without pre-processing stage means with original dataset only, (b) histogram equalization, (c) discrete cosine transform and (d) rescaled DCT coefficients. All the said pre-processing techniques were implemented using Python and TensorFlow. Two hundred and thirteen facial expressions of 10 Japanese females available in Japanese Female Facial Expression (JAFFE) database and 354 facial expressions of 10 Indian people including both male and female created by the researcher in real-time environment were used for this research. Eighty percentage of the collected dataset was used for training the models, and 20% of the collected dataset was used for testing the models. All the four different models were compared with their performances in recognizing the different emotions as well as for different types of resolutions for the images. CNN model built with rescaling of DCT coefficients achieved the best accuracy in classifying expressions with 85.19% for JAFFE dataset and 97.6% for Indian dataset. The models were also checked with different types of images of resolutions like 20 × 20, 32 × 32, 64 × 64 and 128 × H. A. Patel (B) · H. R. Patel Faculty of Computer Application, Ganpat University, Mehsana, India e-mail: [email protected] H. R. Patel e-mail: [email protected] N. Khatri · K. Suthar Department of Computer Engineering, SVNIT-VASAD, Vasad, India e-mail: [email protected] K. Suthar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_9

95

96

H. A. Patel et al.

128. It is found that the images having 128 × 128 resolutions gave best recognition rate. Keywords Convolutional neural network (CNN) · Data pre-processing · Discrete cosine transform (DCT)

1 Introduction Humans interrelate with every one through communication and also by body gesture. To give focus on certain parts of their communication, they use emotions [1]. Humans demonstrate emotion through facial expressions. Facial expressions depict person’s emotion in very effective way [2]. It can be analyzed whether a person is in a good mood or not via his/her expressions. Automatic facial expression recognition has wide potential applications, which includes behavioral analysis, videoconferencing, lipreading, robotics, image retrieval, face animation and human emotion analysis. As humans communicate effectively using expression, automatic recognition can play an important role and also be used in human–computer interface (HCI) [3]. Facial expression recognition system is also used in the field of business, where the amount of satisfaction or dissatisfaction of people is very essential. Many researchers explained the application of machine learning classifiers in solving complex and various problems [4–8]. Many researchers implemented machine learning models for automatic facial expression recognition [8]. Shan et al. [9] implemented a classifier named support vector machine (SVM) for recognizing emotions from the images and showed that results are quite competent and effective for facial expression recognition process. Liliana et al. [10] proposed a machine learning model and recognized the emotions: sad, anger, happy, neutral, fear and disgust. They used CK+ dataset to build the model, and they found their model gave 80.71% accuracy in classifying the emotions from the image. An automatic facial expression recognition process consists of three major steps [11]: pre-processing the face image, expression feature extraction and classification of expression categories. The block diagram for methodology is shown in Fig. 1. To gain images with the normalized intensity, shape and uniform size, preprocessing step is used. Expression feature extraction step is consisting of extracting the features of facial expression from pre-processed (normalized) image. Two types

Fig. 1 Methodology for facial expression recognition

Enhancing CNN with Pre-processing …

97

of approaches [12] are used: geometrical feature-based methods and appearancebased methods. The geometrical methods find locations of features and measure the shapes of facial components like mouth, eyebrows, eyes, nose, etc. In case of appearance-based feature extraction, it uses the complete face image or portion of the face image through optical flow or some types of filters. Hence, the feature extraction process is accomplished by investigating the modification in the location and appearance of facial features or classifications of facial feature expressions. The process of classification uses extracted features and categorizes them into some facial expressions-interpretative categories. There are a total of seven facial expressions which are universally defined: surprise, angry, happy, neutral, fear, sad and disgust. The work demonstrated in this paper focuses on facial expression identification by using the images having dissimilar illumination conditions and classifies it using convolutional neural networks. The work experimented concentrates on identifying the seven worldwide expressions. The paper is arranged as follows: Section 2 lays out the pre-processing techniques in facial expression recognition to eliminate variations in illumination, while the main facial features are kept unaffected. Section 3 discusses the convolutional neural networks. Implementation of the recommended facial expression recognition and results are discussed in Sect. 5.

2 Image Pre-processing Techniques First stage in expression recognition is to pre-process the input image to eradicate any unnecessary information from the image. This step is used to obtain images having the normalized intensity, shape and uniform size. It may also require removing noise from image, if any. Depending upon the application, the algorithm for preprocessing of image is chosen. Input images having different illumination can find facial expression points incorrectly, so pre-processing step is required. Histogram equalization, DCT normalization and rescaling of DCT coefficient are explored as pre-processing methods in this research.

2.1 Histogram Equalization Histogram equalization is used to obtain an output image that has a uniform histogram. It produces an output image with equally distributed brightness level. Thus, it transfers dynamic range r in the input image into a level s in the output image. s = T (r )

(1)

98

H. A. Patel et al.

2.2 Discrete Cosine Transform Normalization DCT normalization [13] is the process consisting of three steps: converting an image into logarithm domain, finding DCT coefficients and illumination compensation. Step 1

Converting an Image into Logarithm Domain

Apply logarithm transform on input image to improve image quality. Assume that image gray level is the product of the illumination and reflectance. i.e., f (x, y) = r (x, y) ∗ e(x, y)

(2)

Applying the logarithm, log f (x, y) = log r (x, y) + log e(x, y)

(3)

Let e be the desired uniform illumination (it will be same for all pixels). From (4), we find out the normalized face image. log f (x, y) = log r (x, y) + log e = log r (x, y) + log e(x, y) − e(x, y) = log f (x, y) − e(x, y) Step 2

(4)

Discrete Cosine Transform

To get all frequency components of normalized image, DCT is applied on image using Eq. (5): C(u, v) = α(u)α(v)

M−1 N −1 

f (x, y)

x=0 y=0

× cos

π (2x + 1)u π (2y + 1)v cos 2M 2N

(5)

The inverse DCT is defined as f (x, y) =

M−1 N −1 

E(u, v) −

u=0 v=0

= F(x, y) −

E(ui, vi)

i=1 l  i=1

where

l 

E(ui, vi)

(6)

Enhancing CNN with Pre-processing … Fig. 2 Block feature of DCT coefficients and their selection in zigzag pattern

DC Coefficient

99

AC Coefficient

1 α(u) = √ , if u = 0 M  2 α(u) = , if u = 1, 2, . . . , M − 1 M 1 α(v) = √ , if v = 0 N  2 , if v = 1, 2, . . . , N − 1 α(u) = N Step 3

Illumination Compensation

As shown earlier, illumination compensation can be obtained by addition or removing the compensation term e(x, y) of (5) in logarithm domain. By discarding lowfrequency components, we reduce illumination dissimilarities. To compensate illumination set, the low-frequency DCT coefficients are set to zero. Coefficients of DCT having grater magnitude mainly lie in upper-left corner of DCT matrix. The way illumination is removed for that set low-frequency DCT coefficient to zero in zigzag manner as shown in Fig. 2. Therefore, uniform illumination is obtained by setting c = log(m) ∗ sqrt(M ∗ N ) d(1, 1) = c where m is mean and M, N is the size of image.

2.3 Rescaled DCT Coefficients In this technique, apply histogram equalization which increases local contrast but it cannot affect illumination variation. Only it shifts illumination variance upward in the gray scale. To eliminate illumination variation, use rescaling of DCT coefficients [14].

100

H. A. Patel et al.

As we know, in DCT low-frequency components are associated with illumination variation. These componenets are used for rescaling and the value by which it is scaled is called Rescaling down Factor (RDF). Uniform illumination can also be found by reducing low-frequency coefficient. The amount of DCT coefficients that are to be rescaled is denoted by Cresc. The number of low-frequency DCT coefficients to be rescaled is calculated using below equation: Number of coefficients to be rescaled = Cresc ∗ (Cresc − 1) − 1

3 Convolutional Neural Network Convolutional neural network [15, 16] can learn from local features available in input data used. CNN classifiers consist of three layers. First layer is convolution; after that, subsampling is used for feature extraction and at last use neural network for classification. Block diagram for CNN classifier used in this paper is shown in Fig. 3. First layer is convolution layer, which takes input image and convolves it with kernel(s) to produce feature map(s) or activation map(s). In the proposed system, we use output of pre-processed image as input to convolution layer with a kernel size 5 × 5 and stride 1. It aims to extract features like edges, oriented edges and corners. Output of first layer is given to second layer known as pooling layer and is used for dimensionality reduction of the feature. So, this layer reduces the complexity required to access the data. Max pooling tried to find the underlying features of an image. After this, we have model to understand the features. Next step is to flatten the final output and feed it to a regular neural network for classification purposes. So, each neuron is connected to all neurons in the previous layer. Loss function is used based on tasks. For categorical classification, we use softmax loss function.

4 Implementation and Result Discussion In the proposed approach, researcher implemented pre-processing technique using Python and TensorFlow. Researcher applied and tested facial expression images of Japanese Female Facial Expression (JAFFE) [17] dataset available worldwide on Internet. The dataset JAFFE has 213 expressions captured of 10 Japanese females. Another dataset is created with images of Indian faces captured in real-time environment by the researcher. Database contains 354 expressions posed by 10 Indian people including both males and females. Here, four pre-processing techniques are used to improve CNN performances and to remove illumination effect: (a) original input data or without pre-processing step, (b) histogram equalization, (c) discrete cosine transform (DCT) normalization and (d) DCT with scaling of coefficients. Figures 4 and 5 show output of each pre-processing on two datasets (Figs. 6 and 7).

Enhancing CNN with Pre-processing …

101

Fig. 3 Block diagram of CNN

Tables 1 and 2 show effect of pre-processing technique for each class on JAFFE and Indian datasets using 128 × 128 pixels. Researcher demonstrated the best accuracies for each group of expression and the mean of pre-processing step. The best normalization technique is rescaling of DCT coefficients achieving the best accuracy about 85.19% and 97.6% for JAFFE and Indian datasets, respectively. With JAFFE dataset, CNN recognizes surprise, anger and happy more accurately. CNN gained 100% of accuracy for those expressions. We used JAFFE dataset, and that people are not much expressive so CNN confuses between anger, disgust, fear and sad. The

102

H. A. Patel et al.

Fig. 4 Pre-processing on JAFFE dataset

most distinguished expressions are happy and surprise that are recognized with good accuracy. For Indian dataset, people are more expressive so CNN recognizes angry, happy, fear and surprise with 100% accuracy. The model can directly learn from image as all surroundings have been removed during pre-processing phase. Finally, the impact of resolution on CNN performance was checked. Tables 3 and 4 show accuracy of CNN with different resolutions. The proposed CNN model checks with 20 × 20, 32 × 32, 64 × 64 and 128 × 128 resolution and found that 128 × 128 resolution has good recognition rate for JAFFE and Indian datasets.

5 Conclusion Researcher has effectively implemented the planned techniques to create assessment among pre-processing methods for facial expression recognition. Based on the investigational results obtained, rescaling of DCT coefficients achieved the best

Enhancing CNN with Pre-processing …

Fig. 5 Pre-processing on Indian dataset Fig. 6 Confusion matrix using DCT scaling as pre-processing on JAFFE dataset

103

104

H. A. Patel et al.

Fig. 7 Confusion matrix using DCT scaling as pre-processing on Indian dataset

Table 1 Effect of pre-processing technique for each class of JAFFE dataset using 128 × 128 pixels Pre-processing stage

Angry (%)

Disgust (%)

Fear (%)

Happy (%)

Sad (%)

Surprise (%)

Neutral (%)

Average (%)

(a)

100

57.14

75.00

100

50.00

66.66

88.88

76.81

(b)

100

100

55.55

75.00

71.42

100

83.33

83.60

(c)

100

80.00

60.00

100

100

66.66

83.33

84.28

(d)

100

83.33

66.66

100

71.42

100

75.00

85.19

Table 2 Effect of pre-processing technique for each class of Indian dataset using 128 × 128 pixels Pre-processing stage

Angry (%)

Disgust (%)

Fear (%)

Happy (%)

Sad (%)

Surprise (%)

Neutral (%)

Average (%)

(a)

70.00

64.28

100

85.71

100

100

100

88.57

(b)

91.66

100

100

100

70.00

100

100

94.52

(c)

76.92

100

100

86.33

66.66

93.33

75.00

85.46

(d)

100

100

100

90.90

100

92.30

100

97.6

Table 3 JAFFE dataset recognition rate on testing set Pre-processing

20 × 20 (%)

32 × 32 (%)

64 × 64 (%)

128 × 128 (%)

Average (%)

(a)

41.86

53.49

72.09

76.74

61.05

(b)

32.56

67.44

74.42

81.40

63.95

(c)

44.19

62.79

65.12

81.40

63.37

(d)

69.77

72.09

76.74

86.05

76.16

Average

47.95

63.95

72.09

81.39

Enhancing CNN with Pre-processing …

105

Table 4 Indian dataset recognition rate on testing set Pre-processing

20 × 20 (%)

32 × 32 (%)

64 × 64 (%)

128 × 128 (%)

Average (%)

(a)

64.79

80.28

87.32

95.58

81.99

(b)

80.28

81.69

91.55

94.37

86.97

(c)

71.83

81.69

90.14

81.69

81.33

(d)

87.32

90.14

95.58

97.18

92.55

Average

76.05

83.45

91.14

92.21

improvement for the CNN performance. The proposed CNN model works better on 128 × 128 resolutions. We can improve the performance of CNN by changing number of scaling coefficients and also by changing the scaled value. In facial expression recognition, the number of images of different expressions is a limiting factor. For future enhancement, more images are required with different no. of expressions to improve the performance of system. There is a scope of analysis of images having different obstacles like specs, etc., for extending this research work.

References 1. A.N. Sreevatsan, K.G. Sathish Kumar, S. Rakeshsharma, M. MansoorRoomi, Emotion Recognition from Facial Expressions: A Target Oriented Approach Using Neural Network (Thiagarajar College of Engineering) 2. S.-S. Liu, Y.-T. Tian, D. Li, New research advances of facial expression recognition, in IEEE Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, Baoding, 12–15 July 2009 3. V. Bettadapura, Face Expression Recognition and Analysis: The State of the Art (Georgia Institute of Technology, 2010) 4. J.A. Alzubi, Optimal classifier ensemble design based on cooperative game theory. Res. J. Appl. Sci. Eng. Technol. 11(12), 1336–1343 (2015) 5. J.A. Alzubi, Diversity based improved bagging algorithm, in Proceedings of the International Conference on Engineering & MIS 2015 (2015, Sept), pp. 1–5 6. J.A. Alzubi, R. Jain, A. Kathuria, A. Khandelwal, A. Saxena, A. Singh, Paraphrase identification using collaborative adversarial networks. J. Intell. Fuzzy Syst. (Preprint), 1–12 7. D. Gupta, J.J. Rodrigues, S. Sundaram, A. Khanna, V. Korotaev, V.H.C. de Albuquerque, Usability feature extraction using modified crow search algorithm: a novel approach. Neural Comput. Appl. 1–11 (2018) 8. D.A. Pitaloka, A. Wulandari, T. Basaruddin, D.Y. Liliana, Enhancing CNN with preprocessing stage in automatic emotion recognition. Proc. Comput. Sci. 116, 523–529 (2017) 9. C. Shan, S. Gong, P.W. McOwan, Facial expression recognition based on local binary patterns: a comprehensive study. Image Vis. Comput. 27(6), 803–816 (2009) 10. D.Y. Liliana, M.R. Widyanto, T. Basaruddin, Human emotion recognition based on active appearance model and semi-supervised fuzzy C-means, in 2016 International Conference on Advanced Computer Science and Information Systems (ICACSIS) (IEEE, 2016, October), pp. 439–445 11. D.A. Pitalokaa, A. Wulandaria, T. Basaruddina, D.Y. Lilianaa, Enhancing CNN with preprocessing stage in automatic emotion recognition, in 2nd International Conference on Computer Science and computational Intelligence 2017 (ICCSCI, 2017), 13–14 Oct 2017, Bali, Indonesia

106

H. A. Patel et al.

12. M. Pantic, L.J.M. Rothkrantz, Automatics analysis of facial expressions: the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 22(12) (2000) 13. W. Chen, M.J. Er, S. Wu, Illumination compensation and normalization for robust face recognition using discrete cosine transform in logarithm domain. IEEE Trans. Syst. Man Cybern. Part B Cybern. 36(2) (2006) 14. T. Goel, V. Nehra, V.P. Vishwakarma, Comparative analysis of various illumination normalization techniques for face recognition. Int. J. Comput. Appl. 28(9), 0975–8887 (2011) 15. A. Ozturk, B. Akdemir, Effects of histopathological image pre-processing on convolution neural networks, in International Conference on Computational Intelligence and Data Science (ICCIDS 2018) 16. N. PattabhiRamaiah, E.P. Ijjina, C. Krishna Mohan, Illumination invariant face recognition using convolutional neural networks, Conference Paper, February 2015. https://doi.org/10. 1109/SPICES.2015.7091490 17. http://www.kasrl.org/jaffe.html

An Expert Eye for Identifying Shoplifters in Mega Stores Mohd. Aquib Ansari and Dushyant Kumar Singh

Abstract Nowadays, it has been observed that the theft of retail products by customers from the mega stores is increasing enormously. Some customers perform theft by concealing the goods in the pockets or bags or under clothes when no one observes and leaves the store without paying for them. The person involved in this kind of illegal act is called a shoplifter, and this activity is called shoplifting. Shoplifting is a major threat to retailers that causes loss in business. Therefore, minimizing the effects of theft is essential in retail safety, store design, and customer service. It is complicated to detect such persons who are involved in shoplifting in the shop/stores. Consequently, we intend to build a real-time camera-based surveillance system to detect illegal events and inform security personnel when shoplifting occurs. This article proposes a deep neural network-based shoplifting detection system that uses the Inceptionv3 model to extract the features and LSTMs networks to learn the temporal sequences. The proposed system can detect the persons involved in shoplifting kind of activity with an accuracy of up to 74.53%. Keywords Video surveillance · Video footages · Activity recognition · Inception V3 · Long short term memory (LSTM)

1 Introduction Shoplifting [1, 2] is an act of picking/taking the goods from an enterprise with the intention of not paying for them. It usually involves the person hiding the store items under clothes or in a pocket or bag and leaving the store without paying for them. Shoplifting makes many direct and indirect impacts on the business. It is a predominant menace to retailers, causing losses in the business. People who Mohd. A. Ansari (B) · D. K. Singh CSED, MNNIT Allahabad, Prayagraj, India e-mail: [email protected] D. K. Singh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_10

107

108

M. A. Ansari and D. K. Singh

are involved in such illegal activity are called shoplifters. According to a National Retail Federation (NRF-2018) survey [3], American retailers lose almost $18 billion annually to shoplifting. Shoplifting is a term narrower than theft that is regarded as taking someone’s property without prior permission to unlawfully deprive the owner/ person in possession of the property. Shoplifting is a corollary of theft, which is limited to taking things from stores and shops with the intention of not paying for them. In India, shoplifting is dealt with under Chapter XVII of the Indian Penal Code [4], which enlists the offenses against property. The code under section 379 states that if someone is convicted of theft, they can be sentenced to three years of imprisonment, or a fine, or both. To reduce such kinds of incidents, CCTV cameras are installed in the stores. Control room personnels monitor video footage received from CCTV cameras. Due to humans’ limited capability, they cannot watch an array of the video very long. Hence, there is a requirement for an expert surveillance system that can help in finding shoplifting events. Expert video surveillance systems are an incredible asset applied in various situations to automatize the detection of several risk circumstances and help human security personals to make proper decisions to improve the protection of assets. Shoplifting is a kind of unusual behavior that is rarely performed. Common behavior possessed by customers in mall/shop/store corresponds to activities like walking, talking, putting items in the basket/trolley, checking the items, billing, etc. In rare cases, some unusual activities are also recorded such as putting bags, clothes, luggage in pockets in such a way that it does not get noticed by anyone. We can say that “Goal of Shoplifting Detection” is incomplete without a system that is proficient in automatic detection at early stages. Consequently, there is a need to classify all these activities. Therefore, we intend to develop an advanced human activity recognition (HAR) [5–7] system that can examine each frame of video footage and identify the unusual activity like shoplifting. Till now, a comprehensive and cost-effective solution is still missing for detecting these kinds of activities. For that reason, this article proposed an advanced HAR system to identify the human stealing action in indoor surveillance. The proposed system uses temporal video sequences for experimentation, where each sequence identifies a complete action. Later, these video sequences are passed to the feature extraction and activity classification module for distinguishing the normal and shoplifting activities. This article is outlined in five sections. After the brief introduction in section one, section two is focused on related works of the existing HAR system. The proposed framework, along with its technicalities, is described in section three. Section four includes experimentation outcomes with its brief analysis. Finally, the conclusion with the future scope is shown in section five.

An Expert Eye for Identifying Shoplifters in Mega Stores

109

2 Related Work Identifying the shoplifters in indoor surveillance is a challenging task and can be solved using the human activity recognition (HAR) framework. Therefore, this section includes different HAR-based researches as follows: Sultani et al. [1] proposed a fast and robust algorithm based on multiple instance learning (MIL) for anomalies detection in videos. This framework uses C3D technology to extract features from positive and negative video segments in bags. Then it uses multiple instance learning (MIL) to classify video segments. They used the UCF crime video dataset for experiments, which have 13 different anomalies. Arroyo et al. [8] proposed an expert monitoring system to identify customers’ suspicious behavior in shopping malls. This article focuses on various scenarios to identify suspicious human behavior. These scenarios include store entry or exit events, detecting events that may result in theft, and identifying unattended cash desk. This paper has also suggested a method for human detection and tracking that can deal very well with occlusion among objects. This method includes global color histogram, local binary pattern, and histogram of oriented gradients for feature extraction. Further, these extracted features are used to build a classifier for decisionmaking purposes. Wang and Wu [9] developed an innovative deep learning-based framework for composite action recognition. This framework fuses the multiple cues of action motions, objects, and scenes in a single module. For object detection, it uses deep CNN to extract the features from the object or human body. On the other hand, GoogleNet [10] and AlexNet [11] extract the features for identifying scene and object cues. Finally, these extracted features are fed into a recurrent neural network to generate an action’s final representation. Nguyen and Ly [12] suggested a fast abnormal event detection framework based on a sparse combination learning framework. First, they used frame differencing algorithm to extract the region of interest. Then each ROI is enclosed with rectangle areas of fixed size. Once ROIs are extracted, fast, dense features like HOG/HOF, MBH are used to evaluate each ROI’s features. Support vector data description (SVDD) is used to model unpredictable abnormal activities. The higher accuracy found in the UCSD dataset and UMN dataset is 90.44% and 97.6%, respectively. Tripathi et al. [13] presented various state-of-the-art techniques demonstrating overall advances in suspicious activity recognition from video sequences using camera-based surveillance. This article focuses on six unusual activities in detail, including violence activity detection, object detection, fire detection, illegal parking detection, fall detection, and theft detection. Li et al. [14] suggested a deep neural network-based approach to identify various actions based on 2D poses. This article uses the human posture encoding scheme that captures the joints of the human body in each video sequence and characterizes the temporal evolution of the human posture configuration. The authors used global RGB and optical flow stream to evaluate the features, and then, these features are used to build CNN for classification purposes.

110

M. A. Ansari and D. K. Singh

3 Proposed Framework This article aims to propose an economical and real-time indoor activity recognition system that could identify a person’s stealing actions, like, putting stolen item/items in a person’s shirt or pocket or bag or dress. To fulfill the objective, a deep neural network-based framework is proposed to identify the shoplifting event, as shown in Fig. 1. This model uses sequence length 125, which means the entire activity takes 125 image sequences to complete. The proposed system uses the camera module that captures the real-time images from the input scenes. Next, feature extraction is done to extract meaningful information from each image sequence. These extracted feature sequences are fed to LSTMs networks for further classification.

3.1 Inception V3 Inception V3 [15] is an artificial convolutional neural network used for object detection and image analysis. It is computationally efficient in terms of network generated parameters and the computational cost. It is a 48-layer deep architecture that works better for image understanding and produces 2048 vectors as output. It is an upgraded version of Inception V2, which includes factorized convolution, batch norms in auxiliary classifiers, RMSprop optimizer, and label smoothing. In Inception V3, the factorized convolutional process breaks the big convolution operation into smaller convolutions. This process reduces the number of training parameters, making it more computationally efficient. It also achieves the lowest error rates. Batch norms in auxiliary classifiers normalize the fully connected layer and convolutions, while label smoothing is a kind of regularizing constituent added to the loss formula. It

Fig. 1 Proposed framework for identifying shoplifting

An Expert Eye for Identifying Shoplifters in Mega Stores

111

limits the network from becoming highly confident for a class. It is a lightweight architecture that takes very small space in the memory.

3.2 Long Short Term Memory (LSTM) LSTMs networks [16, 17] are a type of artificial recurrent neural network (RNN) that deals with sequence prediction problem. LSTM can process single data as well as entire data sequences. It has feedback connections that allow them to retain information in memory over time. A standard LSTM unit comprises an input gate, an output gate, a forget gate, and a cell. The three gates control the flow of information into and out of the cell, while the cell remembers values over arbitrary time intervals. LSTM is quite efficient in handling time-series data points. For a given time t, the parameters h t , Ct and xt represents hidden state, cell state/memory, and input data point, respectively. The hidden state h t for an LSTM cell can be evaluated as follows: The first sigmoid layer takes two inputs h (t−1) and xt , where h (t−1) is the hidden state of the previous cell, also known as forgotten gate. The forgotten gate output controls the overall regulated information that flows from the previous cell and produces a number as output in [0, 1]. Further, the output of forget gate is multiplied with C(t−1) . (1) f t = σ (Wf .[h t−1 , xt ] + bf ) The next step selects what kind of information is needed to keep in the cell state. A sigmoid layer decides the value to be updated and tanh layer generates a vector for a fresh candidate value. i t = σ (Wi .[h t−1 , xt ] + bi )

(2)

C˜t = tanh(Wc .[h t−1 , xt ] + bc )

(3)

The next step updates the C(t−1) (old cell state) into Ct (new cell state). Ct = f t ∗ Ct−1 + i t ∗ C˜t

(4)

Consequently, the cell state’s part in which the output flows is decided by the sigmoid layer. Then cell state Ct is kept through tanh, and it is multiplied by the sigmoid gate’s output to produce an outcome. ot = σ (Wo .[h t−1 , xt ] + bo )

(5)

h t = tanh(Ct ) ∗ ot

(6)

112

M. A. Ansari and D. K. Singh

The parameters named i, f, and o, respectively, represent the input, error, and output gates. W represents the weight matrix that connects the inputs to the current hidden layer, and b represents the bias vector parameter.

4 Experimentation and Result Analysis This work uses UCF crime dataset [1] for experimentation purposes. It contains 1900 real-world surveillance videos of 128 h having 13 realistic anomalies. We only considered the shoplifting anomaly video for performing the experimentations. The shoplifting anomaly videoset comprises CCTV video footage of arbitrary length. Therefore, the footages are trimmed for 5 s of short videos that consist of the specific action of human like stealing action and normal action. From these trimmed videos, a separate distribution is followed as shown in Table 1. The proposed human activity recognition model is trained for two different batch sizes (4 and 8) with sequence size 125. The training time for processing the image dataset with 8 batch size takes 42.43 min, while the model with 4 batch size takes almost 79.87 min. Table 2 presents the experimental outcomes of the proposed model configured for different batch sizes for 120 epochs. The model achieves 97.61% training accuracy and 73% validation accuracy for batch size 4, while it achieves 96.80% training accuracy and 74.53% validation accuracy for batch size 8.

Table 1 Shoplifting dataset distribution Classes Distribution Normal

Test Train Test Train

Shoplifting

Number of clips

Total clips

27 55 12 32

82 44

Table 2 Experimental outcomes Iteration

Batch size = 4 Train

Batch size = 8 Validation

Train

Validation

Accuracy Loss

Accuracy Loss

Accuracy Loss

Accuracy Loss

20

95.23

0.203

61.25

0.836

86.25

0.348

63.46

0.536

40

95.23

0.127

76.24

0.085

93.75

0.145

64.99

0.674

60

98.89

0.040

78.75

0.486

96.25

0.079

67.06

0.736

80

99.00

0.097

67.49

0.801

98.75

0.025

70.83

0.788

100

98.88

0.042

69.99

0.141

92.50

0.122

75.32

0.885

120

97.61

0.033

73.00

0.612

96.80

0.013

74.53

0.222

An Expert Eye for Identifying Shoplifters in Mega Stores

113

Fig. 2 Confusion matrix for proposed model (batch size = 8)

Fig. 3 Resultant images obtained from proposed model

Figure 2 shows the confusion matrix of the proposed model (batch size = 8). It has been found that 76% of instances of the normal class are predicted correctly, while 33% of instances are misclassified. In the case of the shoplifting class, 92% of instances are correctly labeled, while 0.08 are misclassified. This model achieves 74% overall accuracy. In testing, the proposed model takes video as input and predicts the behavior as output. Figure 3 shows the instance of resultant images obtained by applying our proposed model.

114

M. A. Ansari and D. K. Singh

5 Conclusion This article has proposed an advanced monitoring system to identify unusual activity like shoplifting in shops/stores. The proposed system alerts the security personal when anything related to shoplifting happens. On receiving the alert, the security person can take appropriate action. This research is primarily focused on reducing business losses due to shoplifting by incorporating the proposed HAR-based shoplifting detection system. Although the proposed model achieves up to 74% accuracy, improvement is still needed to refine it. Additionally, the proposed model sometimes predicts some false-positive instances during the testing phase. In the future, more improvements may be incorporated to enhance the performance of the system and minimize the false-positive instances. In context to the shoplifting dataset, the existing dataset does not carry clearly distinguishable actions. Most of the videos include partial occlusion, different illumination, and visual appearance, which degrades the proposed model’s performance. Because no other suitable dataset is available to deal with these problems, our research is still in progress, focusing on creating a shoplifting dataset with clearly performed actions.

References 1. W. Sultani, C. Chen, M. Shah, Real-world anomaly detection in surveillance videos, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 6479–6488 2. UK Crime, Shoplifting offences in England and Wales 2002–2019. https://www.statista.com/ statistics/303563/shoplifting-in-england-and-wales-uk-y-on-y/ (2019) 3. National Retail Federation, National retail security survey (2018) 4. G.C. Rankin, The Indian penal code. LQ Rev. 60, 37 (1944) 5. D.K. Singh, Human action recognition in video, in International Conference on Advanced Informatics for Computing Research (Springer, Berlin, 2018), pp. 54–66 6. C. Jobanputra, J. Bavishi, N. Doshi, Human activity recognition: a survey. Procedia Comput. Sci. 155, 698–703 (2019) 7. D.K. Singh, S. Paroothi, M.K. Rusia, M.A. Ansari, Human crowd detection for city wide surveillance. Procedia Comput. Sci. 171, 350–359 (2020) 8. R. Arroyo, J.J. Yebes, L.M. Bergasa, I.G. Daza, J. Almazán, Expert video-surveillance system for real-time detection of suspicious behaviors in shopping malls. Expert Syst. Appl. 42(21), 7991–8005 (2015) 9. R. Wang, W. Xinxiao, Combining multiple deep cues for action recognition. Multimedia Tools Appl. 78(8), 9933–9950 (2019) 10. X. Chen, A.L. Yuille, Articulated pose estimation by a graphical model with image dependent pairwise relations, in Advances in Neural Information Processing Systems (2014), pp. 1736– 1744 11. A.F. Khalifa, E. Badr, H.N. Elmahdy, A survey on human detection surveillance systems for raspberry pi. Image Vis. Comput. 85, 1–13 (2019) 12. T.N. Nguyen, N.Q. Ly, Abnormal activity detection based on dense spatial-temporal features and improved one-class learning, in Proceedings of the Eighth International Symposium on Information and Communication Technology (2017), pp. 370–377 13. R.K. Tripathi, A.S. Jalal, S.C. Agrawal, Suspicious human activity recognition: a review. Artif. Intell. Rev. 50(2), 283–339 (2018)

An Expert Eye for Identifying Shoplifters in Mega Stores

115

14. C. Li, R. Tong, M. Tang, Modelling human body pose for action recognition using deep neural networks. Arab. J. Sci. Eng. 43(12), 7777–7788 (2018) 15. L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, M. Pietikäinen, Deep learning for generic object detection: a survey. Int. J. Comput. Vis. 128(2), 261–318 (2020) 16. S.W. Pienaar, R. Malekian, Human activity recognition using LSTM-RNN deep neural network architecture, in 2019 IEEE 2nd Wireless Africa Conference (WAC) (IEEE, 2019), pp. 1–5 17. K. Xia, J. Huang, H. Wang, LSTM-CNN architecture for human activity recognition. IEEE Access 8, 56855–56866 (2020)

Sanskrit Stemmer Design: A Literature Perspective Jayashree Nair , Sooraj S. Nair, and U. Abhishek

Abstract Sanskrit, an ancient Indian language, lays the grammatical foundation for most of the languages present in India. These Indian languages, unlike English, are morphologically rich as their words have many inflections or variations. Consequently, computationally processing these languages is more challenging. This is where linguistics tools like a stemmer comes into play. A stemmer strips off the inflected word to its base form. It removes the affixes present in a word and reduces the word to its root or stem form. The stemming process is a language dependent task, as each language has its own linguistics and morphological rules in word formation. However, there are few existing works which have attempted in building language independent stemmers too. This paper explore most of the important stemmers built so for. It also explores the working and methodologies applied in these stemmers, along with their features and uniqueness. This paper also explicates the jargon related to the stemming process. Finally, a proposed design for a Sanskrit verb stemmer, which is based on the existing literature strategies, is also presented. This design sketches the important modules that can be included in rule-based stemmers for a morphologically rich languages, like Sanskrit. Keywords Stemmer · Stemming · Stem · Affix · Sanskrit stemmer · Natural language processing · Morphology · Indian languages · Stemmer design · Rule-based stemmer · Suffix stripping · Stemming algorithms

J. Nair · S. S. Nair (B) · U. Abhishek Department of Computer Science and Applications, Amrita Vishwa Vidyapeetham, Amritapuri, India e-mail: [email protected] J. Nair e-mail: [email protected] U. Abhishek e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_11

117

118

J. Nair et al.

1 Introduction Stemming is the process by which a word is reduced to its stem or base form. In Natural Language Processing (NLP) applications, stemming is one of the important pre-processing techniques . Given any word in a natural language, a stemming tool shortens it to its base form. This base form of a word is also termed as root or stem [1]. Most of the words in a language, be it a noun or an adjective or verb, are formed by adding affixes to the base form. As an example: e.g., Studying = Study + ing Books = Book + s Here “Study” is the stem and “ing” is the affix (suffix) of the word “Studying." Similarly, for “Books”, “Book” is the root and “s” is the suffix. A stemmer is a linguistic tool, which does stemming of a given word for a particular natural language. It identifies the base form of an inflected or derived word by removing its affixes. A stemmer finds its role in many of the text processing applications such as search engines, information retrieval, parts-of-speech tagging, parsing, machine translation, and so on [2]. Apart from a pre-processing step in NLP applications, stemming also aid in determining vocabulary for domain analysis, text clustering, digital dictionaries, etc. There exists a large variety of stemming algorithms for English as well as other European languages. In contrast, Indian languages have few stemming algorithms and of these Sanskrit is one of the least found. This is because Indian languages, especially Sanskrit is highly inflected in nature, so building a stemmer requires more linguistic resources, such as exhaustive grammar rule set and dictionary of roots. A Sanskrit stemmer would aid people who use Sanskrit language, such as Ayurveda students, practitioners or doctors and help them to analyze Sanskrit words better, as Ayurveda scriptures are in Sanskrit [3]. It can also help search engines to provide better results for Sanskrit text input. By creating a stemmer for Sanskrit, will help many Ayurveda students, doctors and even other Sanskrit enthusiasts who are struggling to analyze and understand Sanskrit words in Ayurveda textbooks, other Sanskrit writings, etc. It also aid as a pre-processing tool for Sanskrit text analysis. This paper presents the stemming jargon such as NLP, stemming, stemmer, stem, affixes, and so on. It also explains about the stemming process for Sanskrit words. A comparative study on the different existing stemmers in literature is also presented. Lastly, a simple rule-based Sanskrit stemmer design.

Sanskrit Stemmer Design: A Literature Perspective

119

2 Background Study This section explicitly elucidates the jargon pertained to NLP , the stemming process, and a Sanskrit stemmer.

2.1 NLP: Natural Language Processing Natural Language Processing, shortened as NLP, is the integrated research discipline which enables computing devices to read, understand, generate, and process human’s natural language [4]. It is an inter-disciplinary field of computer science that integrates linguistics and information engineering for the processing of natural languages [5]. Some of these technologies encompass sentiment analysis, speech recognition, text classification, machine translators, and question answering. These tools can be found in word processors like Microsoft Word, writing assistance tool like Grammarly, Interactive Voice Response applications, as well as Personal Voice Assistant applications such as Siri, Google Assistant, and so forth.

2.2 Stemming Stemming acts as an important pre-processing step for most of the NLP techniques mentioned above. It is also used in indexing systems, text clustering, categorization, and also summarization tools before actually applying the needed specific core algorithms. The process of automatically and precisely reducing the words to their stems by removing any attached suffixes or prefixes (generally termed as affixes). It is also an important pre-processing step for most of the NLP techniques.

2.3 Stemmer A tool which carries out the stemming process is generally called a stemmer. Most of the stemmers follow language-specific rules. Stemmers like Porter stemmer, Lancaster stemmer, Snowball stemmer, etc., have already been invented and proved efficient enough for most of the European languages including English [6]. Developing stemmers for Indian languages is challenging due to their inflectional nature. However, there are attempts by researchers to build rule-based as well as corpusbased stemmers [7]. They can be rule-based, statistical, corpus-based, etc. Porter stemmer for English is the most popular one till date.

120

J. Nair et al.

2.4 Stem Form of a word before any inflectional affixes is added to it [8]. It is also known as base or root.

2.5 Affix An affix is a language dependent morpheme, the smallest grammatical element [5], which when combined with a base word (stem) produces an inflected form of it. Prefixes, infixes, suffixes, co-suffixes, and circumfixes are different types of affixes [2]. A prefix occurs at the beginning of a stem, suffix at the end, and infix in between of a word. A circumfix is a combination of prefix and suffix which together creates an inflected word [9].

2.6 Over-Stemming, Under-Stemming, Miss-Stemming Over-Stemming: The type of stemmer error by which two words with different stems are reduced to the same root (false positive) [10]. Under-Stemming: The type of stemmer error by which two words have the same root are not reduced to it (false negative)[10]. Miss-Stemming: The type of stemmer error in which inflected words are ignored, and stemming is not applied [2].

2.7 Sanskrit Stemmer Sanskrit, an ancient Indian language, is one of the official Indian languages and specifically for the state of Uttarakhand [11, 12]. It has also paved the path for the grammatical foundation for most of the Indian languages [13]. According to Sage P¯an.ini, a pioneer in defining Sanskrit grammar, there are about two thousand verbal roots called dh¯atup¯atha and nominal bases termed as pr¯atipadikam [14]. As..ta¯ dhy¯ayi,¯ authored by Sage P¯an.ini, is the treatise on Sanskrit grammar[15]. Sankrit is a morphologically rich language[12], i.e., most of the words are highly inflected. Thus, such languages require extensive pre-processing methods [16]. Sanskrit has over 10 million potential verb forms and approximately 2000 verb roots, which are classified as ten morphological and semantic classes termed as gan.as[17]. Sanskrit verbs may be inflected based on first, second, or third person (same as in English), three numeric forms—singular, dual, and plural or on tense, mood, voice or aspect. Nouns in Sanskrit are formed by affixes that denotes cases for nominative, accusative,

Sanskrit Stemmer Design: A Literature Perspective

121

dative, and so forth[12]. The following examples illustrate the stemming process of the words , a nominal word and , a verb [12].

The stemming process described here for Sanskrit only splits the words into a stem and a suffix. However, there can be advanced linguistic tools such as a lemmatizer or a morphological analyser that can exactly decipher the roots and the affixes. To exemplify this, the verb has the stem but the root is .

3 Literature Review This section presents an exhaustive report on some of the important papers and existing work in the literature on stemmers. A consolidated report on these is depicted in Fig. 1.

3.1 A Comparative Study of Stemming Algorithms [6] This paper aims at comparing several stemming algorithms and describes their merits and demerits. Many stemming approaches are described in this paper, and some of the main approaches include truncation, statistical, and mixed. Out of which, Porter’s stemmer, HMM stemmer and corpus-based stemmer are the most important stemmers in their respective stemming approaches [6]. The Porter’s stemmer, being the most popular stemmer, uses a rule-based truncating algorithm. This algorithm assumes that most of the suffixes (mainly of English) are created by compounding smaller suffixes. The Porter stemmer consists of five steps in total, and within each step, several rules are employed to the word, and then the suffix is removed accordingly. It produces less errors and is fast [6]. Expanded as Hidden Markov Model, the HMM stemmer is based on the transitions happening between different states, which are ruled by probability functions. Since it is based on unsupervised learning, it does not require much knowledge of the language. Here, two disjoint sets are created from the states; the first set has only stems, and the second has stems or suffixes. For the input word, a split point will be produced by the best suitable path from the first state to last. The characters before this point are considered as the stem and the rest, the suffix [6]. The corpus-based stemmer is based on recognizing co-occurrence of variants of a word. This algorithm can overcome some of the drawbacks the Porter stemmer has. This algorithm automatically modifies words that have resulted into same stem if their meanings are different, with the help of the text corpus used. This algorithm first uses the Porter stemmer and identifies the stems and then uses the corpus to modify the result if needed [6].

122

Fig. 1 A Consolidated report charting all important stemmer papers

J. Nair et al.

Sanskrit Stemmer Design: A Literature Perspective

123

3.2 A Fast Corpus-Based Stemmer [7] The main aim is to design an unsupervised stemming algorithm using a corpus. The algorithm finds the equivalence classes of inflected words, i.e., the stem of the inflected words. The corpus is processed to identify potential suffix in the language. This is done by grouping words in the corpus with common suffixes with varied combination of suffix lengths [7]. Firstly, all words in the corpus lexicon that share the same suffix are grouped together. Using this, a set of all suffixes are derived. Here, frequency of a suffix implies the number of words with that suffix. Also, there is a threshold value which determines whether the suffix is a potential suffix or not. Primary assumption is that if a set of words {w1 , w2 , . . . , wn } is generated by a root word w, then the suffixes of {w1 , w2 , . . . , wn } applied on w will belong to the potential suffix set of that specific language. This, thereby, generates the equivalence classes by combining the common prefix and potential suffix information. This is a language independent algorithm and has low computational overhead when compared to the YASS stemmer. But, since it is an unsupervised learning algorithm, the process time will depend upon the corpus used [7].

3.3 A Hybrid Inflectional and a Rule-Based Derivational Gujarati Stemmers [18] No derivational stemming algorithms existed in Gujarati. So, creating a derivational stemmer and improving the existing inflectional stemmer would really be helpful for processing Gujarati. The proposed method here consists of two approaches for inflectional and derivational stemmer [18]. This algorithm is aimed at ensuring a common stem is generated for related words, regardless of whether the stem is word in the vocabulary or not. For improving the quality of the stemmer, a PartsOf-Speech (POS)-based stemming module, and one with a set of substitution rules is also implemented. This inflectional stemmer was found to be 90.7% accurate, while the derivational stemmer had 70.7% accuracy [18]. And for further scope and improvements, the paper suggests using a name entity recognizer [18].

3.4 A Stemmer-Based Lemmatizer for Gujarati Text [19] Here a stemmatizer is developed (a lemmatizer based on stemmer) using a combination of rule-based approach and dictionary-based approach for Gujarati. As a part of the rule of removal or replacement, 179 inflections are checked. An input word is checked first to find if it itself is a lemma. All of the inflections are checked recursively until no variation matches with the word [19]. If an inflection is found

124

J. Nair et al.

in the list, from the word, it is stripped off. This recursive searching ensures that all inflections are removed from the word. The rule-based algorithm applicable only for the words in the vocabulary. Though this tool is 98.33% accurate, it still can to be optimized to reduce the execution time [19].

3.5 Text Stemming: Approaches, Applications, and Challenges [20] The aim is to review some existing stemmers and check for the issues existing in the stemming techniques for non-English languages (Fig. 2). Rule-based stemmers reduces a word into their stems by using some languagespecific rules. It comprises of brute-force algorithm, Affix removal algorithm, and morphological stemmers. The statistical stemmers use unsupervised or semisupervised training algorithms for learning the stemming rules from a given corpus. It comprises of lexicon analysis-based stemmer, corpus analysis-based stemmer, and character N-gram-based stemmer. The hybrid stemmers combine several approaches to perform stemming [20].

3.6 Stemmers for Indic Languages: A Comprehensive Analysis [21] Aimed at analyzing the stemmers that exist for Indic languages, this paper consists of classifications of stemming techniques and their evaluation criteria. The Affix removal technique algorithm uses a list of affixes, and by using it, some criteria are applied, and then the affix is removed from the word [21].

Fig. 2 The text stemming techniques as in [20]

Sanskrit Stemmer Design: A Literature Perspective

125

The Dictionary lookup is a technique where a table of terms and their stem is maintained. Stemming is done by searching a term in the table and retrieving the corresponding stem related to that term. Statistical techniques comprise of N-gram, HMM, clustering-based method, and corpus-based method are described. The Hybrid technique includes a stemmer created by combining two or more approaches for stemmer development. Most hybrid stemmers were developed by combining lookup table with affix removal. In terms of accuracy, the rule-based stemmers give an accuracy of 80.02–89%, statistical stemmers give an accuracy of 63.5–89.9%, and hybrid stemmers give an accuracy of 67.86–95.6%.

3.7 Rule-Based Derivational Stemmer for Sindhi Devanagari Using Suffix Stripping Approach [22] A rule-based stemmer was developed for Sindhi language. The script used for the language is Devanagari. It is a derivational suffix-stripping algorithm [22]. For building the stemmer, first, 16 derivational suffixes were manually identified, and after that certain derivational rules were built to act on it. To overcome the errors caused by some exceptional words, a dictionary was created for such words. The algorithm used here is : 1. 2. 3. 4. 5.

Input an inflected word. Initialize the input as the root word. Check for matches for the input word in the dictionary of exceptional words. If matched, save it as the root word and go to step 7. Else go to step 5. Compare the word inputted with suffixes and if a match is identified then perform step 6 otherwise step 7 is executed. 6. From the input word cut off the suffix and assign it as the root word. 7. Display the root word.

This stemmer made the machine translating Sindhi Devanagari script easier. But, stemming errors such as over-stemming and under-stemming are generated for some words [22].

4 Proposed Sanskrit Stemmer Design Based on the comparative study done, the authors present a rule-based simple stemmer for Sanskrit verbs as describe in Fig. 3. This proposed stemmer consists of three major modules: Transliterator, Stemmer Module, and the Back-Transliterator. • Transliteration is the process by which a text in one script is converted to another [23]. The reverse process is termed as back-transliteration. The Transliterator will convert the Sanskrit script in Devanagari to Roman. This eases computation

126

J. Nair et al.

Fig. 3 Proposed Sanskrit stemmer design

as compared with processing with Devanagari. First, an inflected Sanskrit word is given as input to the stemmer. Then, the Sanskrit-to-English transliterator [24] is used to transfer the input word from the alphabets of Sanskrit to those of English. This is done in order to make it easier to apply rules on the input using regular expressions. • Then, with the support of a list of affixes and a dictionary of roots, the main Stemmer Module applies several linguistics rules on the input word and removes the suffix from the word. These rules can be implemented using regular expressions. • After that, the Sanskrit-to-English back-transliterator [24] is used to transfer the stemmed word from the alphabets of English back to those of Sanskrit. This stemmed Sanskrit word will be the output.

5 Conclusion and Future Scope Stemmer is a linguistic tool which is used to reduce an inflected word to its base/root form. Stemming for an Indian language like Sanskrit is comparatively a rarely explored area of research. This paper exhibits a survey, and a comparative study on few important works on various stemmers in the literature. As a result of the comparative study, a rule-based stemmer for stemming Sanskrit verbs is proposed here. The same design can be further extended for nouns too. Being a morphologically rich language, Sanskrit word inflections and grammar rules must be analyzed thor-

Sanskrit Stemmer Design: A Literature Perspective

127

oughly. Then, all these inflection governing rules must be efficiently implemented in the proposed stemming algorithm. The authors intend to build a rule-based Sanskrit stemmer in Python using libraries like re, (Regular Expressions). Due to limited lexical resources, methods need to be identified to process the language. Since affix removal may generate word ambiguity in many cases, a Sanskrit dictionary must be implemented.

References 1. J.B. Lovins, Development of a stemming algorithm. Mech. Transl. Comput. Linguistics 11(1– 2), 22–31 (1968) 2. A. Jabbar, S. Iqbal, M.U.G. Khan, S. Hussain, A survey on urdu and urdu like language stemmers and stemming techniques. Artif. Intell. Rev. 49(3), 339–373 (2018) 3. B. Premjith, C. Chandran, S. Bhat, S. Kp, P. Prabaharan, A machine learning approach for identifying compound words from a Sanskrit text, in Proceedings of the 6th International Sanskrit Computational Linguistics Symposium (IIT Kharagpur, India, Association for Computational Linguistics, 2019), pp. 45–51 4. M.J. Garbade, A simple introduction to natural language processing, in Becoming Human: Artificial Intelligence Magazine. https://becominghuman.ai/a-simple-introduction-to-naturallanguage-processing-ea66a1747b32. Accessed on 10/05/2020 5. J. Nair, R. Nithya, M.V. Jincy, Design of a morphological generator for an English to Indian languages in a declension rule-based machine translation system, in Advances in Electrical and Computer Technologies (Springer, 2020), pp. 247–258 6. A.G. Jivani et al., A comparative study of stemming algorithms. Int. J. Comp. Tech. Appl 2(6), 1930–1938 (2011) 7. J.H. Paik, S.K. Parui, A fast corpus-based stemmer. ACM Trans. Asian Lang. Inf. Process. (TALIP) 10(2), 1–16 (2011) 8. The use of word stems in English. https://www.thoughtco.com/stem-word-forms-1692141. Accessed on 10/04/2020 9. Affix|definition & examples|Britannica. https://www.britannica.com/topic/affix. Accessed on 10/04/2020 10. T. Srivastava, Nlp: A quick guide to stemming. stemming is basically removing the medium. https://medium.com/@tusharsri/nlp-a-quick-guide-to-stemming-60f1ca5db49e. Accessed on 11/09/2020 11. M. Kulkarni, C. Dangarikar, I. Kulkarni, A. Nanda, P. Bhattacharyya, Introducing Sanskrit wordnet, in Proceedings on the 5th Global Wordnet Conference (GWC 2010) (Narosa, Mumbai, 2010), pp. 287–294 12. J.K. Raulji, J.R. Saini, Sanskrit lemmatizer for improvisation of morphological analyzer. J. Stat. Manage. Syst. 22(4), 613–625 (2019) 13. J. Nair, Generating noun declension-case markers for English to Indian languages in declension rule based MT systems, in 2018 IEEE 8th International Advance Computing Conference (IACC) (IEEE, 2018), pp. 296–302 14. P.M. Scharf, A computational implementation of pan. INI’s derivational morphology of Sanskrit, On Resources and Tools for Derivational Morphology (DeriMo), p. 93, 2017 15. A. Krishna, P. Satuluri, H. Ponnada, M. Ahmed, G. Arora, K. Hiware, P. Goyal, A graph based semi-supervised approach for analysis of derivational nouns in Sanskrit, in Proceedings of TextGraphs-11: The Workshop on Graph-Based Methods for Natural Language Processing, pp. 66–75, 2017 16. M.A. Kumar, V. Dhanalakshmi, K. Soman, S. Rajendran, Factored statistical machine translation system for English to Tamil language. Pertanika J. Soc. Sci. Human. 22(4) (2014)

128

J. Nair et al.

17. S.K. Mishra, G.N. Jha, Identifying verb inflections in Sanskrit morphology. Proc. SIMPLE (2004) 18. K. Suba, D. Jiandani, P. Bhattacharyya, Hybrid inflectional stemmer and rule-based derivational stemmer for Gujarati, in Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing (WSSANLP), pp. 1–8, 2011 19. H. Patel, B. Patel, Stemmatizer—stemmer-based Lemmatizer for Gujarati text, in Emerging Trends in Expert Applications and Security (Springer, 2019), pp. 667–674 20. J. Singh, V. Gupta, Text stemming: approaches, applications, and challenges. ACM Comput. Surveys (CSUR) 49(3), 1–46 (2016) 21. H.B. Patil, B. Pawar, A.S. Patil, A comprehensive analysis of stemmers available for indic languages. Int. J. Nat. Lang. Comput 5(1), 45–55 (2016) 22. B. Nathani, N. Joshi, G. Purohit, Rule-based derivational stemmer for Sindhi Devanagari using suffix stripping approach, in Smart Systems and IoT: Innovations in Computing (Springer, 2020), pp. 227–235 23. J. Nair, A. Sadasivan, A roman to devanagari back-transliteration algorithm based on HarvardKyoto convention, in 2019 IEEE 5th International Conference for Convergence in Technology (I2CT) (IEEE, 2019), pp. 1–6, IEEE 24. J.N.T Raviteja, Devatrans—simple tool to transliterate Sanskrit. https://pypi.org/project/ devatrans/. Accessed on 10/05/2020

Predicting Prior Academic Failure of Students’ Using Machine Learning Approach Anamika and Maitreyee Dutta

Abstract In the educational institutions, research on educational data is on demand due to its predictive power and decision-making process using machine learning approach. The present research work can be broadly categorized into two modules. Firstly, data pre-processing directed on real-time data of Government Polytechnic College Ambala, Haryana, India. Secondly, desired data collected along with exploratory data analysis and human-interpretable features is tested on six different classifier to predict student third-year performance as binary classification based on first- and second-year performance, and maximum accuracy of 98.7% was achieved. This generates a chance to identify low-performing students, and accordingly, early interventions can be deployed to prevent them from failing or dropping. This study also suggests a viable direction to use educational data for getting insights by using the machine learning approach. Keywords Educational data mining · Predictive analysis · Machine learning · Computer science

1 Introduction The data produced every day from various organizations like scientific data, pictures, business transactions, sensor data, educational systems, and learning management systems is usually disorderly information which is so limitless that fetching meaningful knowledge out of that capacious data is almost impossible. Extracting significant information out of this voluminous data which is stored in hard disks, databases, files, warehouses and various other storage devices is called as data mining. There is a need to develop techniques to harness the knowledge intelligently out of this volatile growing data. Data mining is stated as knowledge discovery in database (KDD) which means nontrivial mining of implicit, potentially valuable and formerly unknown information from the stored data [1]. Machine learning when applied in Anamika · M. Dutta (B) NITTTR, Chandigarh, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_12

129

130

Anamika and M. Dutta

data mining can find possibly beneficial patterns; then, these patterns are evaluated to gain insights in data which uses visualization tools to generate tables, reports, discriminant rules and classification rules. The educational data mining is a field for solving educational-related problems for getting insights of how students are learning in the educational settings by seeking solutions to improvise methods for exploring the data which has eloquent hierarchy at manifold levels. Education data mining bridges among two fields computing sciences and education, where mutually machine learning and data mining as subfields of computing sciences engrossed [2]. It can be considered that educational system is an iterative sequence of creation, testing and enhancement and algorithms need to develop for better understanding educational data collected during teaching and learning processes. For supporting learning process, analyzing student’s behavior and recommending educational remedies before they plan to drop the course or fail the exam, there is need of prediction of student’s performance so that early interventions can be made for low-performing students [3]. The goal of our study is as follows: firstly, to pre-process real-time data and verifying the effectiveness of pre-processing techniques applied with reasonable accuracy and secondly, to identify the most appropriate machine learning algorithms for predicting third-year academic performance of students based upon last two academic years. This paper is organized as follows. Section 2 presents the related work done in this field. In Sect. 3, work methodology is discussed, and experimental results are presented in Sect. 4. Conclusion of the research is explained in the last section.

2 Related Work The past decades witnessed vital and active growth in the educational area of databases connecting to student’s information, which has turned to be gold mine in educational research. There have been true research objectives like to enhance and improve learning objectives and understanding of learning methods. The search process was confined within the last decade, in which acceptance and implementation of student performance prediction have grown. Researchers examined various factors like academic performance and social factors to build predictive model [4]. The main observation of the research is evaluation of student’s performance using data mining classification techniques. The loss of academic status can be predicted by the developed model with the help of low academic performance; further models evaluated using cross-validation technique on data [5]. Various studies have been proposed linear regression model to discover predictive power of undergraduate indicators in blend with variable selection algorithm and sightsee whether aggregation of grades enhance prediction performance [6, 7]. Multiple models were integrated to increase predictive power [8], but results show that there was some biasness in the data leading to unreliable model [9, 10]. Studies suggested that one model constructed for same type of institutions will not be scalable

Predicting Prior Academic Failure of Students’ Using Machine …

131

easily on other institutions base on predictive modeling [11, 12]. Attributes were selected carefully using gain ratio feature selection technique [13], wrapper method [14] and correlations among attributes along with cross-validations. The relationship between preadmission academic profile and final academic recital was investigated, and data pre-processing was performed to remove incomplete and redundant data [15]. The integration of alternative solutions and state of the art based on intelligent computing is also considered. For achieving better performance analysis, the ranking or classification of object within clusters is conducted [16]. Accuracy was improvised with incorporation of clustering approach with classification techniques [17]. In [18], it presented that the service value performance is the major factors that leave a huge impact on the behavioral intentions of students. Experiments showed that past performances can act as key factor in present performances [19, 20]. Multiview prior alarming systems with genetic programming classification is to highlight underperforming, and underrepresented student population with deep analysis is made to evaluate prediction accuracy [21, 22]. Previous year performance of students is the base for prediction analysis. In predicting student performance, each subject mark of ongoing studies equally matters among other attributes. In our study, we are going to develop a model for efficient performance prediction by means of data pre-processing, attribute selection and classification technique on real-time dataset and suggesting whether they are progressing or not as compare to previous year marks. Also, the relationship between attributes and selection of relevant attributes will be studied.

3 Research Methodology This section outlines the plan and process of carrying out the research which was conducted in two phases training and testing. It includes data sources, proposed framework and theoretical framework. The goal for this research is to compare the effectiveness of existing educational data mining techniques with proposed approach.

3.1 Data and Sources of Data The academic data of diploma students is collected from Directorate of Technical Education, Haryana. Data extracted from database which contains academic data of nine different courses: Architecture, Auto Engineering, Civil Engineering, Computer Engineering, Electrical Engineering, Electronics and communication Engineering, Mechanical Engineering, Plastic Technology and Architecture Assistantship of 2017–2019 which contain ninety-one attributes initially with 45,000 records on an average in one sheet. Data was dirty, inconsistent and voluminous so data pretreatment is needed before starting with actual processing. The entire process contains two

132

Anamika and M. Dutta

stages after extracting data from database. Firstly, data is auto cleaned and corrected using SQL techniques, and proper rules are defined. At last, we extract features from the aggregated data.

3.2 Proposed Methodology Student performance prediction provides vision of the future which can be used to control actions of the present. Extracted data will be pre-processed in two folds: One is where it is gone through various SQL techniques to get rid of messy, inconsistent and undesired type of data, and other is actual pre-processing of the data, and the chances are more for achieving better performance of the model to make a more accurate prediction for third-year students by expressing the problem as dual classification in the second fold. Our work is compiled in two modules. First step is the data preprocessing and in another module , the output driven from the first module, along with other human interpretable features used for predictive analysis. Figure 1 is the operational flow diagram of proposed methodology. Among all the classification models that will be used in this research, the best-performing model is

Fig. 1 Flowchart of proposed methodology used in this research

Predicting Prior Academic Failure of Students’ Using Machine …

133

selected after comparative analysis. The relevant attribute selection is done through data exploratory and then by a correlation heatmap to identify how attributes are correlated. The performance of proposed method is to be assessed with the help of different performance parameters.

3.3 Pre-processing Techniques Apply pre-processing technique on the uploaded data in the training as well as testing phases to make sure that the data is compatible for proposed work or not. The entire process contains two stages after extracting data from database: One is under SQL techniques, and other is pre-processing techniques like normalization and principal component analysis. Pre-processing of data is an initial step during data mining. It processes the raw data to convert it into formats which are easier to use. Preprocessing will remove the noise and irrelevant features. Firstly, data is processed using SQL techniques, and then, data is cleaned and corrected again using proper rules, and last data is normalized using min–max normalization.

3.3.1

Attribute Selection

Attribute selection deals with high dimensionality and unbalanced data issue as data contains large number of attributes as shown in Fig. 2. Some of the attributes are stored null values, some are meaningless, and some attributes will lead to important features of data. So, there is need for selection of attributes using exploratory data

Fig. 2 List of attributes in the data initially

134

Anamika and M. Dutta

analysis and correlation heatmap. Exploratory data analysis summarizes the main characteristics of the data and can tell us beyond the formal hypothesis or modeling task.

3.4 Classification Techniques Classification is a technique in which data is classified into pre-defined classes. In our study, we have binary class pass and fail. The goal of our applied techniques on pre-processed data is to recognize the class to which testing data will fall under. We have done comparative study on six data mining techniques. Logistic Regression uses logistic function for classifying the possible outcome of single model. For studying the influence of independent variable on single target variable, sigmoid function is used. Support Vector Machine uses hyperplane for categorizing data, and it is robust to high dimensionality. Naïve Bayes is based upon Bayes theorem, and it assumes that all variables are independent of each other. Decision Tree constructs the model by splitting data on the basis of different conditions required for predicting the target. It is known for its finest predictive capabilities. Random Forest is a bagging technique where trees run in parallel but once model generated, they will be independent of each other tree. Gradient Boosting constructs models in the form of ensembling weak prediction classifiers, generally decision tree.

4 Results and Discussion 4.1 Experimental Results In this section, result of various predictive models using classifiers is discussed and presented. Figure 3 depicts the correlation among finally selected attributes using exploratory data analysis and human-interpretable features. In Fig. 2, data analysis was conducted to get insight of relationship among attributes with each other.

4.2 Comparison of Classification Techniques Prediction capability of six selected classifiers without pre-processing is depicted in Table 1 where comparative analysis is also shown. In terms of accuracy, 96.7% of SVM is highest with precision and recall as 97.14% and 96.9%, respectively. Logistic regression and Naïve Bayes were the lowest in terms of accuracy. Table 2 depicts the result of predictive classifiers; after pre-processing, it is clear that performance of models was enhanced, and SVM outperforms with the accuracy of 98.7%. Accuracy of Naïve Bayes was also improved maximum among all

Predicting Prior Academic Failure of Students’ Using Machine …

135

Fig. 3 Correlation among final selected attributes Table 1 Performance comparison of the classifiers without applying pre-processing techniques used in this study Classifier

Accuracy

Precision

Recall

F-measure

Logistic regression

89.1

89.7

90.6

90.14

SVM

96.7

97.4

96.9

97.14

Decision tree

93.5

92.24

92.8

92.5

Naïve Bayes

90.8

91.5

92.3

91.8

Random forest

95.1

92.14

94.8

93.45

Gradient boosting

93.7

94.5

95.7

95.09

Table 2 Performance comparison of the classifiers with pre-processing techniques used in this study Classifier

Accuracy

Precision

Recall

F-measure

Logistic regression

90.15

90

91.6

90.7

SVM

98.7

98.1

98.2

98.14

Decision tree

91.5

93.2

92.4

92.8

Naïve Bayes

94.7

95

96.2

95.5

Random forest

96.1

97.4

96.8

97.09

Gradient boosting

95.7

96

96.7

96.3

136

Anamika and M. Dutta

classifiers used in our study by 4.1% when pre-processing techniques applied. It is inferring that students are correctly classified into two classes pass and fail when SVM is used and is better than any of the methods as we have discussed in our study. Reason for outperforming is that data is pre-processed first in SQL database using SQL commands, and all the inconsistency and redundancy were removed; then data is extracted in CSV file; and that file also has gone under further pre-processing before giving to the classifier for prediction.

5 Conclusion In our study, it can be concluded that graduating results of diploma students in the last year can be effectively predicted using the academic performance of the previous two years. Among all the classifiers used, SVM outperforms. The results indicate that the proposed work offers better performance than the existing approaches and efficient model for student performance prediction which is developed to improve the prediction performance. In future, research is more emphasized on various real factors like social, economic, attendance and behavior of students in class that will affect overall student academic performance, and more reliable model will be constructed using ensemble techniques. Acknowledgements I would like to thank Directorate of Technical Education Haryana for providing academic data of diploma students. I would also like to thank my guide for her continuous support and encouragement for carrying out this research.

References 1. S. Liao, P. Chu, Hsiao, Data mining techniques and applications—a decade review. Exp. Syst. Appl. 39, 11303–11311 (2012) 2. S.D. Gheware, A.S. Kejkar, Tondare, Data mining: task, tools, techniques and applications. Int. J. Adv. Res. Comput. Commun. Eng. 3, 8095–8098 (2014) 3. R. Baker, K. Yacef, The state of educational data mining: a review and future visions. J. Educ. Data Mining 1, 3–16 (2009) 4. J. Zimmerman, K.H. Brodersen, H.R. Heinimann, A model-based approach to predicting graduate-level performance using indicators of undergraduate-level performance. J. Educ. Data Mining 7, 151–176 (2015) 5. M.G. Asogbon, O.W. Samuel, M.O. Omisore, A multi-class support vector machine approach for students’ academic performance prediction. Int. J. Multidiscip. Current Res. 4, 210–215 (2016) 6. S.M. Merchan, J.A. Duarte, Analysis of data mining techniques for constructing a predictive model for academic performance. IEEE Lat. Am. Trans. 14, 2783–2788 (2016) 7. A. Zollanvari, R.C. Kizilirmak, Y.H. Kho, Predicting students’ GPA and developing intervention strategies based on self-regulatory learning behaviors. IEEE Access 5, 23792–23802 (2017)

Predicting Prior Academic Failure of Students’ Using Machine …

137

8. E.B. Costa, B. Fonseca, M.A. Santana, F.F. Araujo, J. Rego, Evaluating the effectiveness of educational data mining techniques for early prediction of students’ academic failure. Comput. Hum. Behav. 73, 247–256 (2017) 9. R. Asif, A. Merceron, S.A. Ali, N.G. Haider, Analysing undergraduate students’ performance using educational data mining. Comput. Educ. 113, 177–194 (2017) 10. S. Qu, K. Li, S. Zhang, Y. Wang, Predicting achievement of students in smart campus. IEEE Access 6, 60264–60273 (2018) 11. F. Yanga, F.W.B. Li, Study on student performance estimation, student progress analysis and student potential based on data mining. Comput. Educ. 123, 90–108 (2018) 12. E. Fernandes, M. Holanda, V. Borges, Educational data mining: predictive analysis of academic performance of public-school students in the capital of Brazil. J. Bus. Res. 95, 335–343 (2019) 13. G. Kostopoulos, S. Karlos, Multiview learning for early prognosis of academic performance: a case study. IEEE Trans. Learn. Technol. 12, 212–224 (2019) 14. A. Cano, J.D. Leonard, Interpretable multiview early warning system adapted to student populations. IEEE Trans. Learn. Technol. 12, 198–211 (2019) 15. A. Polyzou, G. Karypis, Feature extraction for next-term prediction of poor student performance. IEEE Trans. Learn. Technol. 12, 237–248 (2019) 16. D. Baneres, M. Seera, Rodríguez-Gonzalez, An early feedback prediction system for learners at-risk within a first-year higher education course. IEEE Trans. Learn. Technol. 12, 249–263 (2019) 17. A.I. Adekitan, O. Salau, The impact of engineering students’ performance in the first three years on their graduation result using educational data mining. Heliyon 5, 12–32 (2019) 18. L. Eglington, P. Pavlik, Predictiveness of prior failures is improved by incorporating trial duration. J. Educ. Data Mining 11, 1–19 (2019) 19. M. Hussain, W. Zhu, W. Zhang, Using machine learning to predict student difficulties from learning session data. Artif. Intell. Rev. 52, 381–407 (2019) 20. T. Toivonen, I. Jormanainen, Augmented intelligence in educational data mining. Smart Learn. Environ. 6, 1–25 (2019) 21. A. Yusuf, A. John, Classifiers ensemble and synthetic minority oversampling techniques for academic performance prediction. Int. J. Inform. Commun. Technol. 8(122), 127 (2019) 22. S. Tsai, C. Chen, Y. Shiao, Precision education with statistical learning and deep learning: a case study in Taiwan. Int. J. Educ. Technol. High. Educ. 17, 20–33 (2020)

Deep Classifier for News Text Classification Using Topic Modeling Approach Megha Singla and Maitreyee Dutta

Abstract Classification of text documents present over Internet has become vital and most crucial part for discovering the knowledge about text. Varieties of approaches have been proposed by researchers in this field of natural language processing (NLP). In most of existing approaches, semantics of the text is not taken in consideration, which in turn is not always realistic. In this paper, proposal of keeping the semantic relationship intact by not removing the stop words in feature extraction step and then applying deep learning is given. Most relevant features are extracted using bi-gram and topic modeling. Next, comparative study of different machine learning and deep learning classifiers is carried out. This paper shows that proposed method of feature extraction, deep learning classifier, and long short-term memory is better and shows accuracy of 96.1% than other classification techniques. Keywords Text classification · Topic modeling · Latent Dirichlet allocation (LDA) · Long short-term memory (LSTM) · Deep learning

1 Introduction We exist in space where information is produced over the Internet in abundance. Henceforth, analyzing and mining the actual knowledge from this ample information is tedious task. Every second something is happening somewhere in the world, making it a news. The number of articles generated every day is remarkably exceptional, and the quantity of the data is so vast that manual processing and primitive applications techniques are not acceptable. Therefore, increasing the necessity of advancements in techniques related to gathering the knowledge and then the classifying the text data is empathetically more. News articles have the tendency to get separated based on different geographic locations, politics, individuals, groups, and a lot more meaningful notation [1]. Major and most important category can be the classification based in the content of the article, for example, sports, crops, history, M. Singla · M. Dutta (B) NITTTR, Chandigarh, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_13

139

140

M. Singla and M. Dutta

monuments, etc. Extracting out the various topics related to what the article is about can help in this classification process. Topic modeling is one of the techniques for this task [2]. Textual data can be managed, organized, and understood using this topic modeling technique. Text data is basically unstructured data which is difficult to construe. Every word in the sentence has some meaning, but it is difficult for the machine to process each and every word, as the news articles present over Internet is surplus. There is need to find the relevant corpus which actually partake in classification and provide effective results [3]. The relevant word can be single word or the combination of words. In text data, it is said that combination of words contains more information than a single word as single word can have different meanings in different sentences. For example, “neural,” “network,” separately they are giving different meaning; while in combination, it is clearly suggesting about computer science and machine learning [4]. Keeping this semantic relationship in mind, in this paper, bi-gram techniques are used. Bi-grams will hold the semantic relationship, but using bi-gram can increase the corpus to reduce that stops words are taken in consideration. Instead of removing them directly, stop words are used to split the sentence in the article and them simultaneously bi-gram is applied over the corpus [5]. This technique reduces the corpus and in addition to that only relevant bi-grams are extracted. Next, on the extracted corpus topic modeling, latent Dirichlet allocation is applied. Comparative study of different machine and deep learning classification models is conducted for classifying news text data. Rest of the paper is organized as follows. Section 2 represents the existing approaches taken by the researchers for text classification and key points. Section 3 demonstrates the proposed methodology taken in consideration to carry out the research and brief of how the technologies are used. Section 4 illustrates the results of the proposed methodology for features extraction and comparative study for different classification algorithm. In Sect. 5, conclusion of the research is demonstrated.

2 Related Work Classification of text has been an important way in natural language processing for finding out the knowledge out of the numerous text data available over the Internet. Categorizing the text into most appropriate class is major area in this for improved retrieval. The effectiveness of the classification is based on how the features are extracted from the large corpus and what classification method is used. Working with the semantics of the text documents and figuring out what exactly the text is about and then categorizing them in most relevant class has been a trend and done by many researchers [6–8]. Earlier researchers worked with different models which does not emphasize on the semantics but treat every word separately as independent term, treats words with multiple meanings as a single term, and also take synonyms as separate words like bag of word [9–11], TF–IDF techniques [12–14]. Numerous variants of these techniques came, but still there was less focus on the semantics;

Deep Classifier for News Text Classification …

141

in reality, these semantics play significant role in understand the text. Aimed at machines to understand the text and classify, semantics can play vigorous role [15, 16]. The bi-gram approach guarantees that semantic linkages are included, as well as the relations between at least two words. Kilimci et al. have used bi-gram technique for feature extraction but did not focus on the redundant and irrelevant bi-grams which leads to increase in the size of corpus [17]. Same was the situation with Fabio et al. [18]. Later, topic modeling introduced to find out the hidden connotations and topics of the articles and verified to be decent for classification. There are various researchers who have used topic modeling using its different techniques. Neogi et al. have used latent semantic analysis (LSA) in their research and shown significant results, but LSA has difficulty in analyzing the dimensions and large corpus [19]. References [20, 21] have also used topic modeling but have not considered semantic relationships. Miha et al. have proposed short text classification using latent Dirichlet allocation; but corpus, they have used is small, and topic modeling is performing well in identifying the topics from the corpus [22]. Numerous researches are done on classifying documents by applying machine and deep learning classifiers, Naïve Bayes [23], support vector machine [24], K-nearest neighbors [25], decision trees, fuzzy system, and convolutional neural networks.

3 Research Methodology In this section detailed discussion about the different components, techniques used in the research work are explained.

3.1 Dataset In this research work to carry out the classification and analyze the performance of the various techniques applied in series, Reuters-21,578 dataset is used. Various proposals in abundance of different techniques are done on this dataset. The number 21,578 indicates that the dataset contains total of 21,578 documents having articles on different categories. There are around 90 categories and sub-categories in the data. The data is skewed so to remove that delinquency in research, there is requirement to pick the categories which can straightforwardly worked upon. Necessity of cleaning the data before using arises and also selecting which category to work upon. Thus, total of five categories are selected, and documents related to those categories are picked.

142

M. Singla and M. Dutta

3.2 Proposed Methodology Text classification is a way in which textual data available over Internet is automatically dispensed to the classes or categories after learning the given features. The goal of this paper is around the news text data classification, Reuters 21,578 dataset, to get the documents classified to a particular class which is most closely belongs. This task is accomplished by traversing the following steps in proposed methodology as shown in Fig. 1. After collecting the dataset and cleaning it, pre-processing steps applied. Before applying the feature extraction, the pre-processed data is fragmented using the stop words, and then stop words are removed. This is the key step to decrease the irrelevant corpus. Next step is to apply the feature extraction step, i.e., bi-gram and

Fig. 1 Flowchart of proposed methodology

Deep Classifier for News Text Classification …

143

Table 1 Snippet of article from Reuters dataset and conversion after applying lowercasing Text data

Lowercase text

“Reserves I, II, and III have matured. Level IV reflects grain entered after October 6, 1981, for feed grain and after July 23, 1981, for wheat. Level V wheat/barley after 5/14/82, corn/sorghum after 7/1/82. Level VI covers wheat entered after January 19, 1984…”

“reserves i, ii, and iii have matured. Level iv reflects grain entered after October 6, 1981, for feed grain and after July 23, 1981, for wheat. Level v wheat/barley after 5/14/82, corn/sorghum after 7/1/82. level vi covers wheat entered after January 19, 1984,”

Table 2 Snippet of article from Reuters dataset and conversion after applying lemmatization Text data

Lemmatized text

“Reserves I, II, and III have matured. Level IV reflects grain entered after October 6, 1981, for feed grain and after July 23, 1981, for wheat. Level V wheat/barley after 5/14/82, corn/sorghum after 7/1/82. Level VI covers wheat entered after January 19, 1984…”

“reserves i, ii, and iii have matured. Level iv reflects grain entered after October 6, 1981, for feed grain and after July 23, 1981, for wheat. Level v wheat/barley after 5/14/82, corn/sorghum after 7/1/82. level vi covers wheat entered after January 19, 1984,”

topic modeling, latent Dirichlet allocation. Finally, classification is done on training and testing dataset using long short-term memory.

3.3 Data Pre-processing Pre-processing is one of the utmost vital step in text classification. It is used to clean the text and formulate the data for different techniques. Techniques used are: 1.

2.

3.

Lowercasing: Lowercasing is most effective step in text pre-processing, and this step offers the consistency throughout the dataset and in the expected output (Table 1). Lemmatization: This step will convert all the words into the root words and removes all the inflammations in the word. This step results in converting all the words into its first form making the text cleaner and easier to map (Table 2). Noise Removal: Noise removal will clean the data by removing the irrelevant numbers and special character except the stop words (Table 3).

3.4 Feature Extraction In this section, feature extraction techniques are explained.

144

M. Singla and M. Dutta

Table 3 Snippet of article from Reuters dataset and conversion after removing noise Text data

Normalized text

“Reserves I, II, and III have matured. Level IV reflects grain entered after October 6, 1981, for feed grain and after July 23, 1981, for wheat. Level V wheat/barley after 5/14/82, corn/sorghum after 7/1/82. Level VI covers wheat entered after January 19, 1984…”

“reserves i, ii and iii have matured. Level iv reflects grain entered after October 6, 1981, for feed grain and after July 23, 1981, for wheat. Level v wheat/barley after 5/14/82, corn/sorghum after 7/1/82. level vi covers wheat entered after January 19, 1984,”

Table 4 Latent topics extracted using LDA Bi-grams

Topics from LDA

“Barley_export, Europe_wheat, wheat_free, weekly_crop, winter_wheat, topsoil_condtion, insect_active, spray_field, windy_weather,”

Crops, fields, weather, agriculture

Stopwords Processing and Bi-gram Training data is now given to the model, and whenever the stop word is encountered, the sentence is splitted, and stop word is removed. Now on, this split sentences bigram is simultaneously applied [18]. Bi-gram will retain the relationship and makes it possible for machine to understand the relationship between the two words. The co-occurrence of the words are addressed by this process and makes it easier to get what actually the realization of sentence is. Latent Dirichlet Allocation LDA is the topic modeling technique in which latent topics inside the corpus is found out. These latent topics are formed of the words which weighs more toward the particular topic, and those words are from which articles are collected and put in the matrix. In this paper, LDA is applied on the corpus extracted out by the stop word processing and bi-gram processing. LDA will find the fix number of topics in 15 epochs (Table 4).

3.5 Classification Techniques Comparative analysis of different machine and deep learning classification techniques is carried out in this model. In this paper, Naïve Bayes, support vector machine, decision tree, and machine leaning classifier are compared to classify the news text into predefined categories. Deep learning classifiers, convolutional neural network, and long short-term memory are considered.

Deep Classifier for News Text Classification …

145

Table 5 Performance of classifiers and proposed classifier (LSTM) to classify Reuters dataset Technique

Accuracy

Precision

Recall

F-Measure

Naïve bayes

58.2

58.4

57.9

58.1

Decision tree

78.2

78.6

78.5

78.4

Support vector machine

85

84.6

84.9

84.8

CNN

93.5

93.2

92.1

92.1

LSTM

96.1

96

96.3

96.2

Fig. 2 Micro- and macro-F1 of SVM and LSTM

4 Results and Discussion In this section, experimental results are shown, and comparative analysis of machine learning and deep learning is performed. Performance evaluation is carried out on parameters accuracy, precision, recall, and micro- and macro-F1. Table 5 shows the performance of different classifiers and proposed classifier LSTM outperforms all. Figure 2 shows the micro- and macro-F1 comparison of best performing machine learning algorithm and best performing deep learning algorithm with number of features taken in consideration. The results clearly show that the proposed classification technique LSTM outperforms all other classification method taken into consideration. LSTM complimented the feature extraction techniques, splitting using stop words, bi-grams, and topic modeling. LSTM gave 96.1% accuracy, i.e., among all machine learning and deep learning classifiers because of its memorizing quality and its gates. LSTM is a variant of recurrent neural networks which has capability of vanishing the gradients and work well with the variable dataset.

5 Conclusion In the study, feature extraction technique using bi-grams and topic modeling is proposed which work on the stop words of the news dataset. This technique results

146

M. Singla and M. Dutta

in extraction of improved and relevant features which in turn truly contributes in classification and progresses accuracy. Based on the empirical results carried out, we can conclude that LSTM, variant of recurrent neural network (RNN) outperforms all other classification technique (Naïve bayes, support vector machine, decision trees, convolutional neural network). In this research, emphasis was given on extracting out the most pertinent features which are essentially needed for the classification purposes and most accurately participates. In future research, a greater number of dataset can be compared like medical data, and 20 Newsgroup data and large dataset can be considered. Different feature extraction technique like tri-grams and variants of LDA can be applied to achieve better results.

References 1. K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, D. Brown, Text classification algorithms: a survey. Information 10, 150 (2019) 2. W. Bin, H. Yong, W.X. Yang, L. Xing, Short text classification based on strong feature thesaurus. J. Zhejiang Univ. Sci. C 13(9), 649–659 (2012) 3. Naresh, B.P. Kumar, Vijaya, V.S. Pruthvi, K. Anusha, V. Akshatha, Survey on classification and summarization of documents. SSRN, 7–15 (2020) 4. K. Kadhim, Survey on supervised machine learning techniques for automatic text classification. Artif. Intell. 273–292 (2019) 5. W. Chuan, Y. Wang, Liu, J. Ji, G. Feng, Composite feature extraction and selection for text classification. IEEE Access 7, 35208–35219 (2019) 6. L. Siwei, X. Liheng, L. Kang, J. Jun, Recurrent convolutional neural networks for text classification. Natl. Conf. Artif. Intell. 3, 2267–2273 (2015) 7. L. Qing, W. Jing, Z. Dehai, Y. Yun, Text features extraction based on TF-IDF associating semantic, in IEEE 4th International Conference on Computer and Communications (ICCC), Chengdu, China, pp. 2338–2343 (2018) 8. S. Bharath, F. Dave, D. Engin, F. Hakan, D. Murat, Short text classification in twitter to improve information filtering, in 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ‘10). Association for Computing Machinery, New York, USA, pp. 841–842 (2010) 9. W. Jin, L. Ping, F.H. Mary, N. Saeid, A. Kouzani, Bag-of-words representation for biomedical time series classification. Biomed. Sig. Process. Control 8(6), 634–644 (2013) 10. A. Berna, C.G. Murat, Semantic text classification: A survey of past and recent advances. Inf. Process. Manage. 54(6), 1129–1153 (2018) 11. Z. Zuo, J. Li, P. Anderson, L. Yang, N. Naik, Grooming detection using fuzzy-rough feature selection and text classification, in 2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–8 (2018) 12. Y. Gao, Y. Xu, Y. Li, B. Liu, A two-stage approach for generating topic models, in Advances in Knowledge Discovery and Data Mining. PAKDD 2013, eds. by J. Pei, V.S. Tseng, L. Cao, H. Motoda, G. Xu. Lecture Notes in Computer Science, vol. 7819 (Springer, Berlin, 2013) 13. A. Zubiaga, Enhancing Navigation on Wikipedia with Social Tags (2012). arXiv preprint arXiv: 1202.5469 14. S.W. Kim, J.M. Gil, Research paper classification systems based on TF-IDF and LDA schemes, in Human-centric Computing and Information Sciences, pp. 9, 30 (2019) 15. S. Albitar, S. Fournier, B. Espinasse, An effective TF/IDF-based text-to-text semantic similarity measure for text classification, in Web Information Systems Engineering—WISE 2014. WISE

Deep Classifier for News Text Classification …

16.

17. 18.

19.

20.

21. 22. 23.

24. 25.

147

2014, eds. B. Benatallah, A. Bestavros, Y. Manolopoulos, A. Vakali, Y. Zhang. Lecture Notes in Computer Science, vol. 8786 (Springer, Cham, 2014) H. Kilimci, S. Akyoku¸s, N-gram pattern recognition using multivariate-Bernoulli model with smoothing methods for text classification, in 24th Signal Processing and Communication Application Conference (SIU), Zonguldak, pp. 597–600 (2016) F. Fábio, R. Leonardo, C. Thierson, S. Thiago, A. Marcos, M. Wagner Jr., Word co-occurrence features for text classification. Inf. Syst. 36(5), 843–858 (2011) P.P.G. Neogi, A.K. Das, S. Goswami, J. Mustafi, Topic modeling for text classification, in Emerging Technology in Modelling and Graphics, eds. by J. Mandal, D. Bhattacharya. Advances in Intelligent Systems and Computing, vol. 937 (Springer, Singapore, 2020) L. Baoji, X. Wenhua, T. Yuhui, C. Juan, A phrase topic model for large-scale corpus, in IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), Chengdu, China, pp. 634–639 (2018) Y. Zhu, L. Li, L. Luo, Learning to classify short text with topic model and external knowledge, in Knowledge Science, Engineering and Management. KSEM 2013, ed. by Wang M. Lecture Notes in Computer Science, vol. 8041 (Springer, Berlin, 1153, 2013) P. Miha, P. Vili, Text classification method based on self-training and LDA topic models. Expert Syst. Appl. 80, 83–93 (2018) Xu.S. Bayesian, Naïve Bayes classifiers to text classification. J. Inf. Sci. 44(1), 48–59 (2018) Y. Tan, An improved KNN text classification algorithm based on k-medoids and rough set, in 10th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, pp. 109–113 (2018) S. Günal, Hybrid feature selection for text classification. Turk J Electr Eng Comput Sci 20, 1296–1311 (2012) Y. Wang, S. Sohn, A. Liu, A clinical text classification paradigm using weak supervision and deep representation. BMC Med. Inform. Decis. Mak. 19, 1 (2018)

Forecasting Covid-19 Cases in India using Multivariate Hybrid CNN-LSTM Model Abhishek Parashar and Yukti Mohan

Abstract During this worldwide crisis, it is well known that the whole world has been hit by a plenitude of untimely deaths caused during this pandemic. The lockdown in various countries has affected the lives of human beings in many ways. Because of this, it becomes necessary to study the complex interplay of various factors, ranging from macro-scale components such as population density, mortality rate, and recovery rate to singular components such as diabetic patients, smokers, gender, and age. A major concern of higher authorities is the accurate forecasting of COVID-19 cases and the role of various factors in COVID-19 spread to assist the policymakers in understanding the economic situation of the country as well as the factors which affect the current mortality rate. The presented work aims to resolve these concerns by proposing a multivariate hybrid model by taking all the aforementioned factors into account to forecast COVID-19 cases. The proposed model consists of a convolutional neural network (CNN) layer for feature extraction and long short-term memory (LSTM) layers to forecast COVID-19 cases, thus exhibiting the inherited advantage of both. The model is trained and tested on the online available dataset acquired from various resources. Experimental results show that the proposed model can forecast the number of cases in the coming month with a mean absolute error equal to 1.78, a training accuracy of 90.63%, and validation accuracy of 95.48%. Keywords Multivariate analysis · Time series forecasting · CNN · LSTM · COVID-19 cases · Hybrid model

1 Introduction On December 31, 2019, the World Health Organization, China Country Office had registered the first case of pneumonia unknown etiology, which was detected in the Chinese city of Wuhan. This unknown agent was identified as a novel coronavirus on A. Parashar · Y. Mohan (B) Maharaja Surajmal Institute of Technology, C-4, Janakpuri, New Delhi 110058, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_14

149

150

A. Parashar and Y. Mohan

January 7, 2020. Subsequently, the COVID-19 was declared a pandemic on March 11 [15], and the cases have been constantly increasing since then. As of October 2, nearly seven months after the declaration of the pandemic, there are a total of 34,495,176 confirmed cases out of which the total death count has been 1,025,729 worldwide. The number of recovered cases is way more than the total number of death cases, signifying the mildness of coronavirus. India has 6,467,066 total COVID cases and 100,774 deaths as reported by the World Health Organization [14], making India the second country to have the maximum number of cases. Coronavirus disease 2019 or COVID-19 is a severe acute respiratory syndrome caused by the SARS-CoV-2 pathogen [5]. Mainstream media articles and case reports from various countries [1, 4, 6, 10] indicate that the symptoms of a coronavirus infected person are like any other common flu. It is also known that it takes around 5–14 days for the symptoms to appear. Thus, it becomes extremely difficult to identify an infected person. Previous works [8, 9, 11, 17] of various authors have provided us with the inspiration to perform multivariate analysis for COVID-19 and thus forecast the future values based on the findings. The presented work displays the insights that the authors have obtained from the analysis of the data available. We have studied the effect on the susceptibility of a person to get infected by coronovirus because of various factors such as population density, mortality rate, recovery rate, diabetes, smoking tendency, age, and gender. The results obtained from the study have been verified using previous works and reports as well. We have also predicted the number of cases for 15 days using a combined model of CNN and LSTM with an MAE score equal to 1.78. Figure 1 represents the procedure followed in the presented approach to forecast the number of COVID-19 cases using a hybrid model consisting of CNN and LSTM layers. The rest of the paper is organized as follows: Sect. 2 explains the preprocessing of data using the windowing technique. The proposed model consisting of CNN and LSTM layers is explained in Sect. 3. Section 4 gives a brief description of the dataset used. Section 5 gives a detailed discussion of experimental results of the proposed model. Finally, the conclusion has been drawn in Sect. 6.

Fig. 1 Block diagram representation of CNN- and LSTM-based COVID-19 forecasting

Forecasting Covid-19 Cases in India using Multivariate Hybrid CNN-LSTM Model

151

2 Windowing The time series data is generally preprocessed before applying to any deep learning model to increase the efficiency of the model. In signal processing and statistics, windowing has been well explored for detecting transient events and time averaging [12]. In the case of time series data, windowing has been used to determine the seasonality and trend of the data to comment on the stationary nature of the data. If the trend and seasonality of data are changing, it is said to be non-stationary time series data [3]. In general, windowing function (also known as apodization function) is zero valued outside some chosen interval, normally symmetric around the median value, thus following Gaussian distribution. The product of any other function with window function is zero for values outside the window interval. Thus, a large dataset can be easily broken into smaller chunks of data to ease the analysis. This property of windowing has been used in the presented work. In the proposed work, sliding window method [16] is used, wherein the data is divided into smaller parts known as windows, windows are further divided into buffers. This helps in reducing computational complexity and extraction of important spatio-temporal features from small chunks using the 1D CNN layer.

3 The Proposed Model The proposed model has a one-dimensional convolutional layer as the input layer which is being used to simply stride on the data provided and extract important features. These features are then fed to the LSTM network, which recursively learns about the data, thus becoming aware of the information contained by the data. This mixed-model of CNN and LSTM layers is lightweight, accurate, and has lesser computation complexity. The parameters of interest such as the total number of trainable parameters, the loss (mean absolute error) incurred by model, and the time elapsed (in seconds) in training the model have been calculated for the proposed model. The architecture of the model is shown in Fig. 2. It comprises one CNN layer to gather important features and patterns present in the data. The next two LSTM layers play the most important role in learning the patterns and sequences to make the prediction. The two dense layers after that fulfill the same purpose. They are used since the complexity of a dense layer is less than that of an LSTM layer. The last dense layer is the output layer which gives out the final prediction (Table 1).

152

A. Parashar and Y. Mohan

Fig. 2 Proposed model architecture

Table 1 Model structure of the proposed model

Layer type

Output shape

Param#

conv1d (Conv1D)

(None, None, 32)

192

lstm (LSTM)

(None, None, 64)

24,832

lstm 1 (LSTM)

(None, None, 64)

33,024

dense (Dense)

(None, None, 30)

1950

dense 1 (Dense)

(None, None, 10)

310

dense 2 (Dense)

(None, None, 1)

11

lambda (Lambda)

(None, None, 1)

0

4 Dataset Description The proposed network has been trained using the database available online. 1.

Ministry of Health and Family Welfare Government of India

The details regarding day-wise total cases, confirmed cases, active as well as recovered cases have been taken from the site of the Ministry of Health and Family Welfare,

Forecasting Covid-19 Cases in India using Multivariate Hybrid CNN-LSTM Model

153

Government of India. This site also provides the data regarding daily testing done by the Indian Council of Medical Research. 2.

Dataset regarding state-wise data

The dataset to analyze the trend of the COVID-19 data in various states is taken from the Web site with link https://www.covid19india.org/. 3.

Country-wise data

The dataset to analyze myths regarding the COVID-19 pandemic has been taken from the link https://www.worldometers.info/. 4.

Novel Coronavirus 2019 Dataset

This dataset is available on Kaggle for the research community. In the presented work, the details regarding age group and state-wise health facilities have been taken from this dataset. 5.

Our World in Data

This data has many variables of potential interest, out of which data regarding diabetic patients, smoking, and non-smoking persons, etc., has been used in the presented approach.

5 Experimental Results and Discussion Several predictions have been done by researchers in the past to predict COVID-19 cases by analyzing the confirmed cases to date. The present work not only focuses on the confirmed cases but also considers various other susceptible features, which might affect corona spread and predicted COVID-19 cases using a hybrid CNNLSTM model. The research aims to assist the government in decision-making and policy-making to ease the life of citizens at the time of this pandemic spread.

5.1 COVID-19 Forecasting The proposed model is trained on various datasets of 275 days (starting from January 30 to October 30) available online. The data is split among training, testing, and validation set. Out of 275 days, 165 days have been used for training (January 30 to July 12), 82 days (July 13 to October 2) for testing, and 28 days (October 3 to October 30) for validation, which accounts for 60% training data, 30% testing data, and 10% validation data, respectively. The data is first preprocessed using the windowing technique with a window size of 30, batch size of 32, and shuffle buffer size of 1000. The filtered data is then fed to the hybrid CNN-LSTM model for training.

154

A. Parashar and Y. Mohan

Fig. 3 Loss characteristics of the proposed model

Figure 3 shows the loss characteristics of the proposed model. The left side of the figure shows the loss in terms of mean absolute error (MAE) for the given learning rate [2, 13]. To have a clearer picture of the proposed model, a graph is plotted taking three different features of the model: number of parameters, loss (MAE), and time taken. It may be observed that the proposed model takes 700 s to execute on 60, 319 parameters accounting for 1.78 MAE loss. It has a reasonable amount of parameters to train and executes quickly with decent loss. The specifications of the device are as follows: i7 processor, 8 GB RAM, 4 GB Nvidia 1050ti graphics card. In order to predict future cases, the authors have performed a multivariate analysis of the following factors: recovery rate, mortality rate, diabetic patients, gender, smoking tendency, and population density. The forecasting performed by the presented model is shown in Fig. 4. The training and validation accuracy for this model are shown in Fig. 5. Fig. 4 Comparison of confirmed versus predicted cases

Forecasting Covid-19 Cases in India using Multivariate Hybrid CNN-LSTM Model

155

Fig. 5 Training and validation accuracy for the presented model

6 Conclusion We have presented a multivariate analysis based on a hybrid CNN-LSTM model, with a training accuracy and validation accuracy equal to 90.63 and 95.48, respectively. The accuracy plot is shown in Fig. 5. The mean absolute error (mAE) is equal to 1.78, and the time taken to compute the predictions using the given model is 700 s which proves the efficiency of the model in terms of time.

References 1. M. Cascella, M. Rajnik, A. Cuomo, S.C. Dulebohn, D.Napoli, Raffaela: features, evaluation and treatment coronavirus (COVID-19), in Statpearls (StatPearls Publishing, 2020) 2. T. Chai, R.R. Draxler, Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE in the literature. Geosci. Model Dev Copernicus GmbH 7(3), 1247–1250 (2014) 3. V.K.R. Chimmula, L. Zhang, Time series forecasting of COVID-19 transmission in Canada using LSTM networks. Chaos, Solitons Fractals 135, 109864 (2020) 4. T.G. Ksiazek, D. Erdman, C.S. Goldsmith, S.R. Zaki, T. Peret, S. Emery, S. Tong, C. Urbani, J. A. Comer, W. Lim, et al., A novel coronavirus associated with severe acute respiratory syndrome. New England J Med Mass Med Soc 348(20), 1953–1966 (2003) 5. C.-C. Lai, T.-P. Shih, W.-C. Ko, H.-J. Tang, P.-R. Hsueh, Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and corona virus disease-2019 (COVID-19): the epidemic and the challenges. Int. J. Antimicrobial Agents 55, 105924 (2020) 6. Y. Liu, L.-M. Yan, L. Wan, T.-X. Xiang, A. Le, J.-M. Liu, M. Peiris, L.L.M. Poon, W. Zhang, Viral dynamics in mild and severe cases of COVID-19. Lancet Infect. Dis. 20, 656–657 (2020) 7. G. Pandey, P. Chaudhary, R. Gupta, S. Pal, SEIR and Regression Model Based COVID-19 Outbreak Predictions in India (2020). arXiv preprint arXiv:2004.00958 8. D. Parbat, et al., A Python based support vector regression model for prediction of Covid19 cases in India. Chaos, Solitons & Fractals 138, 109942 (2020) 9. H. Panwar, P.K. Gupta, M.K. Siddiqui, R. Morales-Menendez, V. Singh, Application of deep learning for fast detection of COVID-19 in x-rays using nCOVnet. Chaos, Solitons & Fractals 138, 109944 (2020)

156

A. Parashar and Y. Mohan

10. J.S.M. Peiris, S.T. Lai, L.L.M. Poon, Y. Guan, L.Y.C. Yam, W. Lim, J. Nicholls, W.K.S. Yee, W.W. Yan, M.T. Cheung, et al., Coronavirus as a possible cause of severe acute respiratory syndrome. The Lancet 361(9366), 1319–1325 (2003) 11. B.K. Sahoo, B.K. Sapra, A data driven epidemic model to analyse the lockdown effect and predict the course of COVID-19 progress in India. Chaos, Solitons & Fractals 139, 110034 (2020) 12. L. Stankovic´, T. Alieva, M.J. Bastiaans, Time–frequency signal analysis based on the windowed fractional Fourier transform. Sig. Process. 83(11), 2459–2468 (2003) 13. C.J. Willmott, K. Matsuura, Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 30(1), 79–82 (2015) 14. World Health Organization and World health organization and others: Coronavirus disease (COVID-2019) situation reports. https://www.who.int/emergencies/diseases/novel-cor onavirus-2019/situation-reports, cited 2020 15. World Health Organization, et al., WHO Director-General’s opening remarks at the media briefing on COVID-19–11 March 2020. https://www.who.int/director-general/speeches/det ail/who-director-general-s-opening-remarks-at-the-media-briefing-on-covid-19. 11 Mar 2020, cited 2020 16. A.D. Wyner, J. Ziv, The sliding-window Lempel-Ziv algorithm is asymptotically optimal. Proc IEEE 82(6), 872–877 (1994) 17. M. Yadav, M. Perumal, M. Srinivas, Analysis on novel coronavirus (COVID-19) using machine learning methods. Chaos, Solitons & Fractals 139, 110050 (2020)

Multi-resolution Video Steganography Technique Based on Stationary Wavelet Transform (SWT) and Singular Value Decomposition (SVD) Reham A. El-Shahed, M. N. Al-Berry, Hala M. Ebeid, and Howida A. Shedeed Abstract The digital data nowadays is always in increase. This data may be secret data that should be kept and saved. Therefore, steganography algorithms are proposed to hide the important data text, image or video. This paper is proposing a steganography algorithm to hide a secret image in a video using the multi-resolution stationary wavelet transform (SWT) and the singular value decomposition (SVD). Threeresolution 3D SWT is performed to the cover video frames. Then, SVD is applied to one of the transform sub-bands. SVD is applied also to the secret image. Both singular matrices are embedded. At last, inverse transform is performed to get the stego-video. The results of image hiding in each resolution are compared, and hiding in the third resolution outperformed in peak signal-to-noise ratio (PSNR) value. The algorithm was tested using different color and grayscale images and showed promising qualitative and quantitative results. Keywords Image hiding · Stationary wavelets transform · Video steganography · Singular value decomposition

R. A. El-Shahed (B) · M. N. Al-Berry · H. M. Ebeid · H. A. Shedeed Scientific Computing Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt e-mail: [email protected] M. N. Al-Berry e-mail: [email protected] H. M. Ebeid e-mail: [email protected] H. A. Shedeed e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_15

157

158

R. A. El-Shahed et al.

1 Introduction With the advancement of the Internet, digital content is in continuous increase. This content might be exposed to illegal copying or hacking. To protect the copyright of digital content, watermarking techniques were proposed. Steganography techniques were proposed to save digital content from being hacked. In steganography, the cover is updated in a way that enables only the sender and the intended receiver to find the hidden message embedded through it. Few properties must be considered while creating any digital hiding technique, namely imperceptibility, hiding capacity and robustness. Imperceptibility is to construct a stego-cover that is similar to the original cover; thus, the human visual system is not able to distinguish between them. Hiding capacity is a tradeoff between hiding a great amount of the secret data and the visual quality of the cover object. Finally, robustness property is the ability of the hidden data to remain safe even if the stego-cover undergoes any attacks while transferring or processing [1]. Liu et al. [2] suggested dividing video steganography algorithms into three categories according to the secret message embedding position. Intra-embedding methods are classified according to the video compression levels such as intraprediction or motion vector. In pre-embedding category, steganography algorithms are handled on the raw version of the video. Post-embedding methods are focused on the bit streams, which mean that the process of embedding and extraction is all applied to the compressed bit stream. A basic steganography model is built as shown in Fig. 1. Cover and secret data could be of any type of digital data, such as a text file, an image, a sound file or a video [3]. Steganography methods can be categorized as substitution methods, spread spectrum methods, statistical methods, transform domain methods, distortion and cover generation methods [1]. Substitution methods depend on replacing the repeated parts of the cover object with the secret data. In spread spectrum methods, spread spectrum communication is used to conceal secret data. Statistical methods depend on the statistical properties of the cover object. Distortion techniques save data by signal distortion and deviation from the original cover. Cover generation methods create a cover for information hiding [1]. In transform domain methods, secret data is concealed in another domain like discrete cosine transform (DCT) [4], discrete wavelet transform (DWT) [5] or discrete Fourier transform (DFT) [6]. There are three basic steps in any transform domain technique. First, the cover object is converted to the frequency domain. Then, the transformed coefficients are used to embed the secret message. At last, those updated coefficients are converted back to the original form to produce the stego-object [7]. Fig. 1 Basic steganography model

Multi-resolution Video Steganography Technique Based …

159

A new technique will be introduced in this paper. This technique depends on multi-resolution 3D SWT and SVD to conceal an image within a video file. SWT provides more hiding capacity and SVD enhancing the visual quality of the stegovideo. Thus, by using both techniques 3D SWT and SVD, we have enhanced the performance of the video steganography referencing to PSNR and SSIM. The rest of the paper is organized as follows: Sect. 2 introduces the previous work pursued in hiding a secret message within a video using transform techniques. Then, an explanation of SWT and SVD, and the proposed technique are presented in detail in Sect. 3. In Sect. 4, the experimental results are presented. Finally, Sect. 5 includes the main conclusions of the paper.

2 Literature Review Steganography can be carried out in different domains, and many techniques that use these domains have been proposed. Our suggested technique is a transform domain technique; therefore, in this section, the existing data hiding techniques that use the wavelet transform are reviewed. Sharma et al. [8] proposed to combine cryptography and steganography techniques along with neural network techniques to hide an image within another image. This combination provided more security. The image was successfully embedded and extracted. The algorithm performance was measured by the mean square error (MSE) for secret and cover images. The results showed a minimum error. Sarreshtedari and Ghaemmaghami [9] proposed an image steganography technique based on wavelet transform. The proposed approach used the wavelet transformation of the cover image to embed a secret data. Firstly, the capacity of all DWT blocks was estimated. The insertion operation was then performed over each block. This algorithm achieved great capacity compared to other algorithms. Siddharth et al. [10] proposed an image hiding technique based on SVD [11] and integer wavelet transform (IWT) [12]. Experimental results approved that the proposed technique was robust against different types of attacks like JPEG compression, rotation, low-pass filtering, scaling, noise addition and histogram equalization. In [13], Ramadhan et al. used two types of transform domain approaches to hide a text message in a video using multiple object tracking (MOT) and error correcting code (ECC) techniques. DCT and DWT coefficients of all motion regions were used to hide the secret message. This method resulted in a robust and safe steganography method in terms of PSNR and similarity index measure (SIM). Hemalatha et al. [14] depend on different image color spaces to conceal the secret message. In their proposed method, red–green–blue (RGB) and YCbCr color spaces were compared. The secret message was a grayscale image. IWT coefficients were used to embed the secret image. The results showed that the resolution of the stegoimages is good in the RGB space with reference to the PSNR values.

160

R. A. El-Shahed et al.

An audio–video steganography technique was proposed by Kakde et al. [15]. The proposed technique used DWT and SVD to hide secret images within video frames and used the least significant bit (LSB) to hide a text message in the audio file of the video. The technique achieved satisfactory results in both audio and video steganography. Hamsathvani [16] suggested using DWT and SVD to hide an image within a video sequence. First, the mean square error (MSE) of video frames was calculated. Then, the frame with low MSE was selected to conceal the secret image. In the proposed algorithm, the secret image was not embedded in the cover’s wavelet coefficients. The secret image was embedded in the singular matrix of DWT coefficients of the cover video frame. PSNR value showed that this technique provided the best resolution of the reconstructed image. A different technique was proposed by Danti et al. [17]. Embedding secret data was implemented in two ways, i.e., sequence and random. The red channel of the cover video transformed using DWT. The secret video frames were embedded in the covers’ wavelet coefficients. The red channel was reconstructed again and concatenated with the green and blue channels to produce the stego-video. The propounded technique showed promising results in quality and accuracy. The sequence embedding method showed better PSNR and MSE results. Swaraja et al. [18] proposed a blind watermarking technique. The proposed technique depends mainly on DCT and SVD. 2D DCT was applied for all 8 × 8 blocks of the video frames, the singular values of the image were inserted in the DCT coefficients, and the average PSNR was 36.66. Swaraja et al. proposed another watermarking technique to protect medical images [19]. The proposed technique used redundant discrete wavelet transform (RDWT) and SVD, and the average PSNR was 32.34. These algorithms suffered from false-positive problem produced from SVD so Swaraja et al. proposed another watermarking technique that used RDWT and Hadamard transform [20]. The average PSNR for this algorithm was 53.4.

3 Proposed Method 3.1 Stationary Wavelet Transform Stationary wavelet transform (SWT) [21] or the un-decimated wavelet transform [22] is a translation-invariance version of the discrete wavelet transform. The coefficients are not decimated at every transformation level. It is more computationally complex, but it is useful in change detection, denoising and pattern recognition applications. Figure 2 shows the digital implementation of SWT. Figure 3 shows the filters applied in SWT. In the proposed method, SWT is applied on the cover video frames, three-resolution 3D SWT is applied, and this results in eight sub-bands. Figure 4 shows a one-level 3D SWT analysis for the cover video frames. The 3D SWT coefficients are calculated as follows:

Multi-resolution Video Steganography Technique Based …

161

Fig. 2 Digital implementation of SWT

Fig. 3 SWT filters

Fig. 4 One-level stationary wavelet transform analysis

  ( j) ( j) ( j) c j+1 [x, y, t] = h h h ∗ c j [x, y, t]

(1)

  ( j) ( j) ω1j+1 [x, y, t] = g ( j) h h ∗ c j [x, y, t]

(2)

  ( j) ( j) ω2j+1 [x, y, t] = h g ( j) h ∗ c j [x, y, t]

(3)

  ( j) ω3j+1 [x, y, t] = g ( j) g ( j) h ∗ c j [x, y, t]

(4)

  ( j) ( j) ω4j+1 [x, y, t] = h h g ( j) ∗ c j [x, y, t]

(5)

  ( j) ω5j+1 [x, y, t] = g ( j) h g ( j) ∗ c j [x, y, t]

(6)

  ( j) ω6j+1 [x, y, t] = h g ( j) g ( j) ∗ c j [x, y, t]

(7)

  ω7j+1 [x, y, t] = g ( j) g ( j) g ( j) ∗ c j [x, y, t]

(8)

162

R. A. El-Shahed et al.

where g and h are the signal analysis filters for the wavelet function, its associated scaling function, h(n), g(n) are the impulse responses of the filters h, g, h[n] = h[−n], g[n] = g[−n],n ∈ Z are their time-reversed versions.

3.2 Singular Value Decomposition Video frames can be considered as images. Any image X is a matrix M × N of non-negative scalar values. The singular value decomposition for X is two orthogonal matrices U , V and a diagonal matrix S. This decomposition is mathematically described as follows: [11] X = U × S × VT

(9)

3.3 The Proposed Method The suggested method depends mainly on converting the cover video to the wavelet domain. As shown in Fig. 5, three-resolution 3D SWT is firstly applied to the cover video frames. Then, SVD is applied for one of the SWT sub-bands and the secret image. The cover singular values are modified to conceal the image’s singular values Fig. 5 Proposed video steganography algorithm

Multi-resolution Video Steganography Technique Based …

163

using the value of alpha. At last, inverse 3D SWT is applied to produce the stegovideo.

3.4 Embedding Process Steps (1) (2) (3)

Read the cover video Read the secret image Apply SWT for the video frames (a) (b)

(4) (5) (6)

If the secret image is a grayscale image, SWT is applied for the blue channel only of the video frames. If the secret image is a color image, SWT is applied for all color components RGB in the video frames.

Apply SVD for LLH sub-band Apply SVD for the secret image Modify the singular values of LLH sub-band as following: 

Sc = Sc + alpha ∗ Si

(10)

where Sc is the singular matrix of cover sub-band LLH, Si is the singular matrix of secret image, and alpha is a scaling factor ranges from 0 to 1. (7)

Apply inverse SWT to generate the stego-video.

3.5 Extraction Process Steps (1) (2)

Read the stego-video Apply SWT for the video frames (a) (b)

(3) (4)

If the secret image is a grayscale image, SWT is applied for the blue channel only of the video frames. If the secret image is a color image, SWT is applied for all color channel RGB in the video frames.

Apply SVD for LLH sub-band Calculate the singular values of the secret image as following: 

(S − Sc ) . Si = c alpha

(11)

164

R. A. El-Shahed et al.

4 Experimental Results 4.1 Performance Criteria The visual performance of the stego-video and the secret image is measured with reference to PSNR [23] and structural similarity index measure (SSIM) [23] presented in (12) and (13). PSNR calculates the ratio between two images and depends on the mean square error (MSE) in calculations. The MSE is computed as: x−1 y−1 1  MSE = [M(i. j) − N (i, j)]2 . x y i=0 j=0

(12)

The PSNR is computed as:  PSNR = 10.log10

 MAX2M . MSE

(13)

where MAX is the maximum image pixels value, image M is a matrix x x y, and N is its noisy approximation. The structural similarity index measure (SSIM) measures the similarity between two images. It is calculated as: SSIM(x, y) =

(2μx μ y + c1 )(2σx y + c2 ) + μ2y + c1 )(σx2 + σ y2 + c2 )

(μ2x

(14)

where x and y are two windows of common size, μx is the average of x, μ y is the average of y, σ2x is the variance of x, σ2y is the variance of y and σx y is the covariance of x and y.

4.2 Results and Discussion The proposed algorithm is implemented using MATLAB. To measure the performance of the suggested algorithm, several simulations have been performed. Two types of images were used, i.e., grayscale and color. The embedding algorithm was tested on the three resolutions of SWT analysis. The used cover video is of 15 s duration, and the frame size is 288 × 384. The secret image is 288 × 384. The mother wavelet function used is db4, and the alpha value is 0.2. Figure 6 shows the video frame before and after embedding the image. Figure 7 shows the same image “Lena” after being extracted from the stego-video in different levels of resolution. In Fig. 8, a comparison between images before embedding in the cover video and after extraction is shown. Table 1 shows the performance of the algorithm in different

Multi-resolution Video Steganography Technique Based …

165

Fig. 6 a Original cover frame and b the same frame after embedding the secret image

Fig. 7 a Extracted image from the first resolution, b the extracted image from the second resolution and c the extracted image from the third resolution

resolution levels. Table 2 shows the results of different images extracted from the third resolution. As shown in the figures, qualitative results are very well and the human visual system (HVS) cannot distinguish between original and extracted images in different levels of resolutions. In quantitative results, the average value of PSNR is 74 dB and SSIM near 1. Compared to the algorithm proposed in [18–20] with reference to average PSNR value as shown in Table 3, it proves that the imperceptibility of the

166

Fig. 8 a, c, e, g Original secret images. b, d, f, h The extracted images

R. A. El-Shahed et al.

Multi-resolution Video Steganography Technique Based … Table 1 Performance of the algorithm with different resolution levels

Table 2 Performance of the algorithm with different images in the third resolution

Table 3 Performance of the algorithm compared to other techniques

167

Level of resolution Image PSNR Image SSIM Cover PSNR Level one

39

0.98

43

Level two

42

0.98

60

Level three

43

0.98

73

Image

Image PSNR

Image SSIM

Cover PSNR

Plane Lena

43

0.98

73

40.9

0.99

76

Moon

45

0.99

74

Pepper

32

0.98

73

Used techniques

Average PSNR

Swaraja [18]

RDWT + SVD

32.34

Swaraja [19]

DCT + DWT + SVD

36.66

Swaraja [20]

RDWT + Hadamard

53.4

Proposed algorithm

SWT + SVD

74

proposed algorithm is very high. Hiding a color image does not impact the quality of the cover video.

5 Conclusion This paper gives a brief review of different steganography algorithms that rely on the wavelet transform domain. The proposed algorithm depends on SWT and SVD to hide a secret image. The cover video is transformed into the stationary wavelet domain. SVD is applied for secret image and wavelet coefficients. The singular values for both are embedded with the value of alpha. The inverse transform is performed to return to the original form. The algorithm was tested in different levels of resolution, and the third one shows the best quality in terms of PSNR. Different grayscale and color images were used, the average PSNR value is 74, and the average value of SSIM is 0.99. The comparative analysis proves that the imperceptibility of the algorithm is very high. In future work, the algorithm should be enhanced to be robust against attacks and increase the hiding capacity to be capable of hiding a video.

168

R. A. El-Shahed et al.

References 1. C.P. Sumathi, T. Santanam, G. Umamaheswari, A study of various stenographic techniques used for information hiding. Int. J. Comput. Sci. Eng. Surv. (IJCSES) 4(6), 9–25 (2013) 2. Y. Liu, S. Liu, Y. Wang, H. Zhao, S. Liu, Video steganography: a review. Neurocomputing 335, 238–250 (2019) 3. R.J. Mstafa, K.M. Elleithy, E. Abdelfattah, Video steganography techniques: taxonomy, challenges, and future directions, in IEEE Long Island Systems, Applications and Technology Conference (LISAT), Farmingdale, NY, pp. 1–6 (2017) 4. N. Ahmed, T. Natarajan, K.R. Rao, Discrete cosine transforms. IEEE Trans. Comput. C-32, 90–93 (1974) 5. G. Amara, An Introduction to Wavelets. IEEE Comput. Sci. Eng. 2(2), 50–56 (1995) 6. S.W. Smith, The Scientist and Engineer’s Guide to Digital Signal Processing, pp. 141–167 (1999) 7. M.M. Sadek, A.S. Khalifa, M.G.M. Mostafa, Video steganography: a comprehensive review. Multimed. Tools Appl. 74, 7063–7094 (2015) 8. S. Kartik, A. Ashutosh, S. Tanay, G. Deepak, K. Ashish, Hiding data in images using cryptography and deep neural network. J. Artif. Intell. Syst. 1, 143–162 (2019) 9. S. Sarreshtedari, S. Ghaemmaghami, High capacity image steganography in wavelet domain, in 2010 7th IEEE Consumer Communications and Networking Conference, Las Vegas, NV, pp. 1–5 (2010) 10. S. Singh, R. Singh, T.J. Siddiqui, Singular value decomposition based image steganography using integer wavelet transform, in Advances in Signal Processing and Intelligent Recognition Systems. Advances in Intelligent Systems and Computing, vol. 425 (Springer, Cham, 2016) 11. M.S. Wang, W.C. Chen, A hybrid DWT-SVD copyright scheme based on K-mean clustering and visual cryptography. Comput. Standard Interf. 31(4), 750–762 (2009) 12. A.R. Calderbank, I. Daubechies, W. Sweldens, B.L. Yeo, Wavelet transforms that map integers to integers. Appl. Comput. Harmon. Anal. 5(3), 332–369 (1998) 13. R.J. Mstafa, K.M. Elleithy, E. Abdelfattah, A robust and secure video steganography method in DWT-DCT domains based on multiple object tracking and ECC. IEEE Access 5, 5354–5365 (2017) 14. S. Hemalatha, U. Dinesh Acharya, A. Renuka, Comparison of secure and high capacity color image steganography techniques in Rgb and Ycbcr domains. Int. J. Adv. Inf. Technol. (IJAIT) 3, 3 (2013) 15. Y. Kakde, P. Gonnade, P. Dahiwale, Audio-video steganography. International Conference on Innovations in Information, in Embedded and Communication Systems (ICIIECS), Coimbatore, pp. 1–6 (2015) 16. A. Hamsathvani, Image hiding in video sequence based on MSE. Int. J. Electron. Comput. Sci. Eng. (2012) 17. A. Danti, G.R. Manjula, B.M. Pallavi, Wavelet based color video steganography using sequence and random technique. Int. J. Eng. Appl. Sci. Technol. 1(8), 294–298 (2016). ISSN No. 24552143 18. K. Swaraja, Y. Madhaveelatha, V.S.K. Reddy, Robust video watermarking by amalgamation of Image transforms and optimized firefly algorithm. Int. J. Appl. Eng. Res. 11(1), 216–225 (2016) 19. K. Swaraja, Protection of medical image watermarking. J. Adv. Res. Dyn. Control Syst. (JARDCS), Special Issue 11 (2017) 20. T. Yasasvy, K. Venkat Sushil, K. Meenakshi, K. Swaraja, P. Kora, A hybrid blind watermarking with redundant discrete wavelet and hadamard transform. Int. J. Innov. Technol. Explor. Eng. (IJITEE). 8(11), (2019). ISSN: 2278-3075

Multi-resolution Video Steganography Technique Based …

169

21. O.G. Sundararajan, The Discrete Stationary Wavelet Transform: A Signal Processing Approach. Chapter 13, pp. 234–246 (2015) 22. J.-L. Starck, J. Fadili, F. Murtagh, The undecimated wavelet decomposition and its reconstruction. IEEE Trans. Image Process 16(2), 297–309 (2007) 23. A.G. George, A.K. Prabavathy, A survey on different approaches used in image quality assessment. Int. J. Emerg. Technol. Adv. Eng. 3(2), 197–203 (2013)

A Novel Dual-Threshold Weighted Feature Detection for Spectrum Sensing in 5G Systems Parnika Kansal, M. Gangadharappa, and Ashwni Kumar

Abstract The objective of this research work is to overcome the issue of the secondary users bearing different fading environments, changing signal-to-noise ratios, and spectral congestion in 5G environments. A novel dual-threshold weighted feature detection technique (DTWFD) has been proposed that introduces relay centers and uses SNR-based weighted factors to overcome the aforementioned issue. These weights are assigned to the secondary users interacting cooperatively and given preference according to their weights. For achieving this, orthogonal frequency division multiplexing (OFDM) modulated signals are used and the fundamental autocorrelation property of the cyclic prefix is exploited to be used as the statistical feature for sensing. The final decisions are taken by the relay center according to the weights assigned to all the secondary users and the users having more weight contribute more to the final decision-making process. The DTWFD outperforms all the existing algorithms in terms of metrics detection and false alarm probabilities proving its worth for a 5G environment. Keywords Dual threshold · 5G · SNR weight factor · Weighted feature detection · Weighted energy detection · Spectrum sensing

1 Introduction The upcoming wireless technologies and the 5G systems have issues of increasing wireless users leading to the overcrowding of spectrum. This is due to the static allotment of the spectrum where a predetermined number of users are allotted the spectrum. The cognitive radio networks [1] operate in a manner so that no interference P. Kansal (B) · A. Kumar Department of Electronics and Communication, Indira Gandhi Delhi Technical University, Delhi 110006, India M. Gangadharappa Department of Electronics and Communication, Ambedkar Institute of Advanced Communication Technologies and Research, Delhi 110031, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_16

171

172

P. Kansal et al.

is created for the primary users. This is achieved using spectrum sensing to detect free bands and determine the existence of a primary user. In the literature, many conventional and novel techniques are available for spectrum sensing. The well-known techniques are energy detection [2–4], feature detection [5–7], and matched filter detection [8]. But these conventional techniques are prone to dysfunctionality for the 5G environment. Energy detection has many limitations like the concealed terminal issue, fading in the channel, and detecting spectrum at low SNR is a challenge for it. Cyclostationary detection can give a trustworthy performance under low SNR. In the available literature, feature detection sensing of orthogonal frequency modulated signals [9, 10] has been proposed. But these techniques have been implemented in a single threshold environment to satisfy the spectrum needs of earlier wireless generations. For the upcoming 5G systems, there is diversity and the different number of secondary users suffers from various fading conditions. The existing techniques [11–13] ignore this issue and treat all the secondary users regardless of the problem that each cognitive user has different SNR. So, it is necessary to exploit this SNR difference of the SU’s. Wan et al. [14] discuss a spectrum sensing technique in a double threshold environment an objective function is used to minimize the error probability, and a dynamic threshold adjustment is used to reduce the noise power uncertainty. The issue of noise power uncertainty has been addressed using the double threshold energy detection technique in [15]. Gupta et al. [16] present a crow search algorithm which is a modified version to extract the usability features to find the optimal solution. Another existing technique in the double threshold environment is presented in [17] that do not address the issues faced by the 5G environments. Rani et al. [18] present a Chinese remainder theorem technique that is used to share the healthcare data among an optimal number of clients. Chavhan et al. [19] discuss the concept of network slicing in a 5G cell for achieving various targets of the networks. Sekaran et al. [20] discuss an AI-based system for performing combined spectrum access and selection for 5G networks. Therefore, in this research work, a novel Dual-threshold weighted feature detection (DTWFD) has been proposed which introduces relay centers for making intelligent decisions about spectrum detection for satisfying the needs of 5G networks. To achieve this, an SNR-based weighted factor has been assigned to the secondary users participating in cooperative sensing so that the secondary user with greater SNR will be involved more in the decision making and the one with a low SNR weight factor will not be considered reliable. For achieving this, an OFDM modulated system model is proposed in which the rudimentary autocorrelation property of the cyclic prefix of OFDM signal has been exploited as a statistical feature for sensing purposes. The estimated test statistic of the autocorrelation coefficient is used for comparison with dual thresholds, and then the secondary users transmit their decisions to the relay center where the final decision takes place. This final decision-making process of determining whether the spectrum is free or not takes place based on the SNR-based weight factors is allotted to the secondary users. The proposed technique has been compared with the existing techniques of notable publications, and it outperforms them for all performance metrics. The important performance metrics are detection

A Novel Dual-Threshold Weighted Feature Detection …

173

and false alarm probabilities. The research paper has been organized as follows: Sect. 2 discusses the proposed DTWFD system model, Sect. 3 presents the SNRbased weighted factor algorithm for the proposed DTWFD, Sect. 4 discusses the performance evaluation and simulation results, and finally, the paper concludes in Sect. 5.

2 Proposed Dual-Threshold Weighted Feature Detection (DTWFD) System Model In this section, the proposed system model is presented. The primary users access the licensed band, and the secondary users participate in sensing cooperatively. The proposed system model introduces a relay center that makes decisions based on the local sensing decisions of the secondary users by using a novel dual-threshold weighted feature detection that utilizes SNR-based weighted factor. This is done to determine which secondary users can contribute more to the decision-making process of the relay center based on the weighted factors allotted to them. The spectrum sensing problem can be formulated as a binary hypothesis model depicted below with two hypotheses H0 and H1 . H0 : y(t) = n(t)

(1)

H1 : y(t) = h(t) × (t) + n(t)

(2)

where H 0 represents that the primary user (PU) does not exist, H1 represents that the primary user exists, y(t) is the OFDM signal received by the secondary user, x(t) is the transmitted signal of the primary user, n(t) is the additive white Gaussian noise, and h(t) is the coefficient of the channel between the primary and secondary users. The statistical attributes of the signal can be exploited by feature detection spectrum sensing which can be established both in time and space. In this research paper, the orthogonal frequency division multiplexing (OFDM)-based system model is considered. The OFDM signals are employed for sensing by the insertion of a cyclic prefix (CP) and using its autocorrelation coefficient. An OFDM symbol is assumed with various constraints or parameters. Here the feature of test statistics for the goal of spectrum sensing is the autocorrelation coefficient of the OFDM symbol. Following parameters are considered for an OFDM symbol: Td denotes the number of data symbols or also known as the size of inverse FFT, Tc depicts the number of symbols in cyclic prefix. For observation, S number of total symbols is considered of OFDM. The autocorrelation coefficient can be represented as [21]: α=

SNR Tc Tc + Td 1 + SNR

(3)

174

P. Kansal et al.

Now for the same autocorrelation coefficient, α, the maximum likelihood estimate can be calculated as [21]: αˆ =

1 2L

2L−1 t=0

R{x(t)x ∗ (t + T d)} 

ρz2

(4)

The R{.} depicts the real part of the complex number, L is the maximum likelihood estimate which is given as L = S * (T d + T c ) and ρz2 is the variance of the observed signal. For evaluating the performance of spectrum sensing, detection probability and false alarm probability can be defined as

√   1  Pf = P α > λ|H0 = erfc L.λ 2 

(5)

Therefore, the decision threshold λ can be estimated from the above equation as follows: 1 λ = √ .erfc−1 (2 ∗ P f ) L

(6)

√ α−λ  1  Pd = P α > λ|H1 = er f c( L. 2 1 − α2

(7)



Therefore, for sensing of spectrum through cooperative detection, the detection, and miss detection probabilities can be calculated asCm =

S 

1 − Pd,k

(8)

k=1

Cd = 1 − Cm

(9)

where Cm and Cd depict the probabilities of missed detection and detection for secondary users participating cooperatively, Pd,k depicts the detection probability of kth cognitive user. In the proposed system model, S number of secondary users are considered to be interacting cooperatively and the calculated test statistic for a kth secondary user can be denoted as α k . For the proposed system, each secondary user makes the ultimate decision d based on the test statistic α k as follows



d=



1; α k ≥ λk 0; α k < λk 

(10)

Here λk is the decision threshold for a kth secondary user where k = 1, 2…S. In the proposed scheme, the secondary users participating in the spectrum sensing procedure cooperatively are given preferences based on weight factors defined in terms of

A Novel Dual-Threshold Weighted Feature Detection …

175

signal-to-noise ratio or SNR. The secondary user with high SNR will contribute more to the final decision of the relay center, and the one with low SNR will contribute less to the decision. Distinct SNR-based weighted factors are allotted to different secondary users existing in different SNR conditions. The SU having large SNR is regarded as reliable, and its contribution is taken more into the final decision of the relay center. In the proposed technique, the SNR-based weighted factor can be defined as fk =

1 S

δk S

(11)

n=1 δn

where f k is the SNR-based weighted factor of kth secondary user, δn is the SNR of an nth secondary user. Therefore, each secondary user makes its ultimate decision d and for a kth secondary user its ultimate decision is dk for k = 1, 2 … S. All these decisions of the secondary users are transmitted to the relay center through dedicated control channels. The proposed mechanism operates in a dual-threshold environment. The two thresholds λ1 and λ2 can be calculated from Eqs. (6) and (7), respectively. Each cognitive user can arrive at its ultimate decision by comparing its test statistic α with two thresholds λ1 and λ2 . Figure 1a represents the single threshold detection which has been followed in all conventional techniques, and Fig. 1b represents the dual-threshold detection with two thresholds. Hence taking into account the SNR-based weighted factor in the dual-threshold environment, the final decision can be combined for all secondary users and represented as: 

d=

Fig. 1 a Single threshold detection. b Dual-threshold detection

S f k dk ≥ λ2 1; k=1 S 0; k=1 f k dk < λ1

(12)

176

P. Kansal et al.

Fig. 2 Block diagram of DTWFD

Each secondary user sends its information to the relay center. So, the final decision taken by the relay center after combining the decisions of all the secondary users represented by ξ can be depicted as follows: ξ=

S



α k ; λ1 < d; otherwise k=1

S k=1



α k < λ2

(13)

Figure 2 depicts the schematic block diagram of the proposed technique where each SU performs spectrum sensing using DTWFD. Therefore, it is assumed that the relay center receives r decisions and S-r estimates. The amalgamation of all test statistics at the relay center necessarily follows Gaussian distribution. So, if each test attribute follows the Gaussian distribution, then for a large L its amalgamation at the relay center will also have a Gaussian distribution. Hence for an OFDM-based dual-threshold weighted feature detection, the hypotheses can be defined as: 1 α k ∼ ℵr 0, 2L k=1

(14)

(1 − α 2 )2 α k ∼ ℵr α, 2L k=1

(15)

HO :

H1 :

S−r

S−r





Hence for a kth secondary user, the detection probability (Pd,k ) and miss detection probability (Pm,k ) can be represented as: 



Pd,k = P α k > λ2 |H1



√ λ2 − α 1 = erfc L 2 1 − α2

Pm,k = 1 − Pd,k

(16) (17)

A Novel Dual-Threshold Weighted Feature Detection …

177

3 SNR-Based Weighted Factor Algorithm for Double Threshold Weighted Feature Detection (DTWFD) The detection probability can be investigated in two aspects. Firstly, all the secondary users with SNR-based weighted factor greater than 1 should decide that

the channel is free and the primary user is absent. The probability can be defined as k∈Y 1− Pd,k where Y represents the group of secondary users having SNR-based weighted factor greater than 1. Alternatively, for all the secondary users having a weighted factor less than 1, all the possible grouping of the sum of weighted factors greater than 1 should be omitted. In the DTWFD scheme, if one cognitive user has a very low SNR-based weighted factor, then that cognitive user is considered not reliable to be taken into cooperation procedure of sensing. Hence, the following are the steps of the DTWFD algorithm: Step 1 Let μ is the number of secondary users selected to participate in cooperative sensing. Initially, let β = {1, 2, 3 … S} and μ=φ void set. Step 2 Estimate the secondary user with the highest SNR and a threshold SNR is defined in respect to the highest SNR. Step 3 Choose the secondary users whose SNR is greater than the threshold SNR. Step 4 Estimate the highest SNR, δmax =  max (δk ). k=1,2,...S

Step 5 A target SNR is set with δth = τ δmax , where τ is the threshold of SNR. Step 6 For every k β, if δk ≥ δth , µ = µ ∪{k}. Step 7 For each k μ, the SNR-based weighted factor can be determined as f i = δi ; ∀i ∈ μ. 1  δ |μ|

k μ k

Step 8 The relay center takes the final decision according to Eq. (13). Step 9 The primary user presence is estimated based on the final decision taken by the relay center in Step 8.

4 Performance Evaluation The performance evaluation has been performed from the proposed DTWFD technique on MATLAB R 2019 which runs on an Intel I7 CPU, 8 GB RAM, and CPU 2.2 GHz. The parameters used for experimentation and evaluation of spectrum sensing performance are presented in Table 1. In this performance evaluation, OFDM modulated signals are received at the secondary user and each secondary user is assigned SNR-based weighted factors to determine their contribution in the final decision of the relay center for performing

178 Table 1 Simulation parameters used for the experimentation

P. Kansal et al. S. No.

Specification parameters

Description

01

Number of primary users

150

02

Number of secondary users

100

03

Noise background

AWGN noise

04

Signal-to-noise ratio (SNR)

0:25 dB

05

Modulation technique

OFDM

06

Size of IFFT or data symbols (T d )

32

07

Size of cyclic prefix (T c )

T d /4 = 8

spectrum sensing in the 5G environment. The proposed DTWFD technique has been compared with the existing algorithms of notable publications, namely double threshold weighted energy detection (DTWED), double threshold energy detection (DTED), and conventional single threshold feature detection (FD). The results simulated are averaged over 1000 realizations. Figure 3 depicts the receiver operating characteristic curve for the proposed DTWFD compared with the existing techniques. As evident from the ROC curve, the proposed DTWFD technique records the highest detection probability of 0.87 than the other existing techniques. It can be seen that the conventional single threshold feature detection (FD) shows the lowest detection probability of 0.03 on the ROC curve. The detection probability increases as the false alarm probability increases. Figure 4 represents the ROC curve of the proposed DTWFD technique for different SNR scenarios ranging from 15 to 25 dB. It can be observed clearly from the graph

Fig. 3 ROC curve comparison of proposed DTWFD with the existing techniques

A Novel Dual-Threshold Weighted Feature Detection …

179

Fig. 4 ROC curve of proposed DTWFD technique at different SNR scenario

that as the signal-to-noise ratio increases, the detection probability of the proposed DTWFD technique increases which proves that it can show good performance even under a varying SNR environment. Figure 5 depicts the detection probability vs the changing number of secondary users. The graphs clearly show that as the number of secondary users increases from 0 to 100, the detection probability curve of the proposed DTWFD technique is greater than the other curves of the existing algorithms for every count of the secondary user. Therefore, even if we increase the number of secondary users participating in the cooperative spectrum sensing in a dual-threshold environment, the proposed technique maintains the highest detection probability as compared to the other existing techniques. Hence, the proposed technique outperforms the existing techniques in terms of all the performance metrics. So, the proposed technique is suitable for spectrum sensing in a 5G environment and can perform efficient spectrum sensing because of the inclusion of SNR-based weighted factor in the dual-threshold environment. The secondary users contributing more to the ultimate decision of the proposed technique are given weighted factors accordingly. The comparative analysis of the proposed technique with the existing techniques has been summarized in Table 2.

5 Conclusion This research work proposes a novel dual-threshold weighted feature detection (DTWFD) based on the detection of OFDM modulated signals. This technique

180

P. Kansal et al.

Fig. 5 Detection probability versus varying number of secondary users of DTWFD with existing techniques

Table 2 Comparative analysis of the proposed technique with the existing techniques S. No.

Spectrum sensing technique

Test statistic

Detection probability (Pd )

01

Conventional single threshold feature detection (FD)

Autocorrelation of OFDM signal with cyclic prefix

0.03

02

Double threshold energy detection (DTED)

Energy statistic

0.32

03

Double threshold weighted Energy statistic multiplied energy detection by weights (DTWED)

0.58

04

Proposed dual-threshold weighted feature detection (DTWFD)

0.87

Autocorrelation property multiplied by SNR-based weight factors according to dual-threshold criteria

exploits the autocorrelation property of the cyclic prefix of the OFDM signal. This research work aims to solve the problem of secondary users suffering from multiple fading environments and changing SNR conditions and is taken into account by the proposed technique for the upcoming 5G systems. This is achieved by an SNR-based weighted factor that is allocated to the secondary users, and their contribution to the decision-making process of the relay center is determined. The proposed technique outperforms all the existing techniques from notable publications depicted in the

A Novel Dual-Threshold Weighted Feature Detection …

181

simulated results. It records the highest detection probability of 0.87 as compared to the existing techniques with the conventional single threshold feature detection recording the lowest detection with a probability value of 0.03. Hence, DTWFD proves to be a worthy candidate for spectrum sensing in 5G environments so that the primary and secondary users can coexist together to use the spectrum bands efficiently. For future works, the dual-threshold techniques will be researched more to suit the needs of the upcoming 5G systems. Acknowledgements The authors would like to thank the support of the laboratory staff and technicians for resolving any hardware or software issues

References 1. G. Kakkavas, K. Tsitseklis, V. Karyotis, S. Papavassiliou, A software defined radio cross-layer resource allocation approach for cognitive radio networks: from theory to practice. IEEE Trans. Cog. Commun. Netw. 6, 740–755 (2020) 2. G. Mahendru, A. Shukla, P. Banerjee, A novel mathematical model for energy detection based spectrum sensing in cognitive radio networks. Wireless Pers. Commn. 110, 1237–1249 (2020) 3. P.V. Yadav, A. Alimohammad, F. Harris, Efficient design and implementation of energy detection-based spectrum sensing. Circuit. Syst. Sig. Process. 38, 5187–5211 (2019) 4. M.Z. Alom, T.K. Godder, M.N. Morshed, A. Maali, Enhanced spectrum sensing based on Energy detection in cognitive radio network using adaptive threshold, in2017 International Conference on Networking, Systems and Security (NSysS), Dhaka, pp. 138–143 (2017) 5. M. Barakat, W. Saad, M. Shokair, FPGA implementation of cyclostationary feature detector for cognitive radio OFDM signals, in 2018 13th International Conference on Computer Engineering and Systems (ICCES), Cairo, Egypt, pp. 215–218 (2018) 6. M. Sherbin, K. Sindhu, V. Cyclostationary, Feature detection for spectrum sensing in cognitive radio network, in 2019 International Conference on Intelligent Computing and Control Systems (ICCS), Madurai, India, pp. 1250–1254 (2019) 7. K. Aishwarya, T. Jagannadha Swamy, Design of power efficient and high-performance architecture to spectrum sensing applications using cyclostationary feature detection, in Cognitive Informatics and Soft Computing. Advances in Intelligent Systems and Computing, eds. by P. Mallick, V. Balas, A. Bhoi, G.S. Chae, vol. 1040 (Springer, Singapore, 2020) 8. S. Dhananjaya, B.N. Yuvaraju, A novel method in matched filter spectrum sensing to minimize interference from compromised secondary users of cognitive radio networks, in 2018 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT), Msyuru, India, pp. 228–231 (2018) 9. A. Kumar, P.N. Kumar: OFDM with cyclostationary feature detection spectrum sensing. ICT Exp. 5, 21–25 (2019) 10. P. Guangliang, J. Li, F. Lin, A cognitive radio spectrum sensing method for an OFDM signal based on deep learning and cycle spectrum. Int. J Dig. Mult. Broad. 2020, 1–10 (2020) 11. J. Wu, Y. Yu, T. Song, J. Hu, Sequential 0/1 for cooperative spectrum sensing in the presence of strategic byzantine attack. IEEE Wirel. Commun. Lett. 8, 500–503 (2019) 12. D.A. Guimarães, Pietra-Ricci index detector for centralized data fusion cooperative spectrum sensing. IEEE Trans. Veh. Tech. 69, 12354–12358 (2020) 13. R. Sarikhani, F. Keynia, Cooperative spectrum sensing meets machine learning: deep reinforcement learning approach. IEEE Commun. Lett. 24, 1459–1462 (2020) 14. R. Wan, L. Ding, N. Xiong, Dynamic dual threshold cooperative spectrum sensing for cognitive radio under noise power uncertainty. Hum. Cent. Comput. Inf. Sci. 9, 1–21 (2019)

182

P. Kansal et al.

15. S.M. Hassan, A. Eltholth, A.H. Ammar, Double threshold weighted energy detection for asynchronous PU activities in the presence of noise uncertainty. IEEE Access 8, 177682–177692 (2020) 16. D. Gupta, J.J.P.C. Rodrigues, S. Sundaram et al., Usability feature extraction using modified crow search algorithm: a novel approach. Neural Comput. Applic. 32, 10915–10925 (2020) 17. A. Bhowmick, A. Chandra, S. Dhar Roy, S. Kundu, Double threshold-based cooperative spectrum sensing for a cognitive radio network with improved energy detectors. IET Commn. 9, 2216–2226 (2015) 18. S.S. Rani, J.A. Alzubi, S.K. Lakshmanaprabu et al., Optimal users based secure data transmission on the internet of healthcare things (IoHT) with lightweight block ciphers. Multimed. Tools Appl. 79, 35405–35424 (2020) 19. S. Chavhan, P. Ramesh, R.R.S. Chhabra, D. Gupta, A. Khanna, J.J.P.C. Rodrigues, Visualizataion and performance analysis on 5G network slicing for drones, in DroneCom ‘20: Proceedings of the 2nd ACM MobiCom Workshop on Drone Assisted Wireless Communications for 5G and Beyond, pp. 13–19 (2020) 20. R. Sekaran, S.N. Goddumarri, S. Kallam, M. Ramachandran, R. Patan, D. Gupta, 5G integrated spectrum selection and spectrum access using AI-based framework for IoT based sensor networks. Comp. Net. 107649 (2020) 21. S. Chaudhari, V. Koivunen, H.V. Poor, Autocorrelation-based decentralized sequential detection of OFDM signals in cognitive radios. IEEE Trans. Sig. Process. 57, 2690–2700 (2009)

A Systematic Review on Various Attack Detection Methods for Wireless Sensor Networks K. Jane Nithya and K. Shyamala

Abstract Wireless sensor network (WSN) is implemented using low priced sensor nodes. These nodes are constrained in terms or memory, battery life, and computations. WSNs can monitor environmental conditions and are typically deployed in places that cannot be attended by humans like forests. The sensor nodes can be attacked in many ways and are vulnerable where a clone attack is one of the most dangerous attacks. In a clone attack, the sensor nodes are captured and clone where many sensors have the same ID. Secret information like credentials and keys are shared and deployed in other networks. Since, identifying a clone node is difficult, the attacker exploits these situation and uses it as the base for several dangerous attacks like black hole. Thus, detecting such loose points in a WSN is a challenging activity. This paper reviews different types of attacks on WSNs and existing detection mechanisms. This review can help in understanding the minimum elements required while designing a generic defense wall against node replication attacks. Keywords Wireless sensor network (WSN) · Artificial neural network (ANN) · Low-energy adaptive clustering hierarchy (LEACH) · Support vector machine (SVM) · Clone attacks

1 Introduction WSNs are a self-organized and distributed networks made from a collection of independent sensor nodes. These nodes can sense changes in environment like heat, rain, earthquakes or any abnormal activity. They are deployed even in harsh environments and have been used in battlefields or environmentally inaccessible places from where they transmit information which are processed. The signals are received

K. Jane Nithya (B) Department of Computer Science, Ethiraj College for Women, Tamil Nadu, Chennai 600008, India K. Shyamala Dr. Ambedkar Government Arts College, Tamil Nadu, Chennai 600039, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_17

183

184

K. Jane Nithya and K. Shyamala

Base Station Sensor Nodes

Fig. 1 Overview of wireless sensor network

using radio frequencies at a common point called the base station. The transmitted data is compressed and sent directly to the base station [1]. A WSN can be defined as networked devices which gather and transmit information using wireless links. The transmitted information is forwarded through a gateway to other networks. WSNs have been used to monitor physical changes in the environment like temperature, pressure, and sound where the nodes pass changes collectively and cooperatively to a specified location [2]. WSNs are deployed by scattering nodes in an unplanned manner and left unattended to execute their functions [3]. WSN is made of a base station and thousands of sensor nodes as shown in Fig. 1. The nodes with restricted computing capabilities forward the information in short ranges. The base station in contrast is equipped with abundant data processing/storage facilities. They forward the information to centers with human interfaces using security protocols [4]. Further, the rise of embedded systems has resulted in the use of WSNs in the areas of environment monitoring, security, health care, agriculture water management, and automations [5]. WSNs can be built with various network topologies are prone to attacks that interrupt communications where sinkhole attacks, selective forwarding attacks, wormholes attacks, and clone attacks are a few examples [6]. This review discusses details of a WSN for a probable solution, while designing a counter for the attacks though WSN security can be enhanced using QoS [7]. Clone attacks are of a very different nature, and the most problematic as a WSN ID is replicated in the network, and this replicated node is an authentic member of the network and compromises information to the attacker. The attacker can also reprogram the clones node and use it to access all other networks. Thus, the attacker weeds out credentials of legitimate members using the clone [8]. Figure 2 depicts the classification of WSN attacks [9].

A Systematic Review on Various Attack Detection …

185

Attacks on wireless sensor networks

Layer-dependent attacks

Layer-Independent attacks

Routing attacks Node replication attacks

Data aggregation attacks Sybil attacks

Localization attacks

Time synchronization attacks

Fig. 2 Classification of WSN attacks

The remainder of this paper is organized as follows. Section 2 summarizes background of the study in three sub-headings, Sect. 3 highlights issues in existing methods while Sect. 4 concludes this work.

2 Background Study Attacks on WSNs can be visualized in two levels, one against its security mechanisms while the other another is against its routing. WSNs attacks are classified as passive attacks or active attacks and are detailed below. • Passive Attacks: These kinds of attacks are used by adversaries to analyze and monitor traffic. They decrypt communications looking for sensitive information

186

K. Jane Nithya and K. Shyamala

which becomes the starting point for other attacks. Passive attacks breach confidentiality resulting in disclosure of sensitive information without the consent of the user. Passive attack types are detailed below – Traffic Analysis: This attack can cause major harm to WSNs. Messages relocate for analyzing communication patterns where sensor activities reveal information to permitted adversaries. – Camouflaging: A node is enclosed in the WSN which is used to copy information by attracting packets, route them to the adversary for a detailed analysis. Camouflaging: A node is enclosed in the WSN which is used to copy information by attracting packets, route them to the adversary for a detailed analysis. – Eavesdropping: It is one of the most common attacks on WSNs where the nodes snoop on the data for discovering communication information by the attacker. • Active Attacks: In these kinds of attacks, adversaries control the network. The communicated messages are modified for generating false data in communications. A types of these attacks are detailed below. – Selective Forwarding: A node that is malicious drops packets and filters forwarded packets selectively. This in co-ordination with attacks that gather traffic via the node are the most dangerous. – Black Hole/Sinkhole Attack: The malicious node draws all packets in the WSN toward it like a sinkhole. They can even affect nodes that are far from the base station. – Routing Attacks: The attacks occur on network layers are called routing attack. They can be classified further. In a Node Subversion, node reveal information including cryptographic keys thus compromising the entire WSN which is then used by adversaries. A False Node is adding a new node injected with malicious data to the WSN which them mocks the actual information transmission. Passive Information Gathering by adversaries happens on unencrypted WSNs. – Wormhole Attacks: The intruder records packet information, and then routes it a malicious location by tunneling information.

2.1 Review on Attack Detection Techniques for WSNs A cooperative game-based fuzzy Q-learning (G-FQL) was proposed in [10] for preventing intrusions in WSNs. The technique adopted a Markovian game theoretic approach along with fuzzy Q-learning algorithm. The players were base station, sink nodes with an attacker. The game was started when any node was flooded with packets beyond prescribed limits like a DDoS attack. A cooperative counter-attack scenario sprung with the base station and sink node as rational decision makers. Its

A Systematic Review on Various Attack Detection …

187

performance was evaluated using a simulated low-energy adaptive clustering hierarchy (LEACH) on NS-2 simulator with other algorithms. The model showed greater accuracy than other machine learning (ML) methods in its defense. Stacked denoising autoencoder-based localization attack identification approach (SDALAIA) proposed in [11] enhanced classification performance using a feature representation consisting of topological indexes and location features. The ability of the auto-encoder to learn features from input data was updated by backpropagation algorithm with using a stochastic gradient descent method. SDALAIA distinguished Sybil, interference, replay, and collusion attacks with a classification accuracy of above 94%. Jianjian et al. [12] developed an improved AdaBoost-radial basis function-based support vector machine (IABRBFSVM) for denial of service (DoS) attack detection. The parameter σ of RBF-SVM and model training error em influenced AdaBoost weights. On analyzing DoS attacks, an eigenspace was defined for the intrusion detection to work. The proposed IDS significantly improved network performance in simulations and was able to detect and removing malicious nodes based on packet delivery rate, energy consumption analysis, and transmission delay. Its simplicity of structure resulted in reduced computational times with a high detection rate. Fotohi and Bari [13] proposed a WSN firefly algorithm based on leach, and Hopfield neural network (WSN-FAHN) for denial of sleep attacks (DoSA). Mobile sink improved its network lifetime. Firefly clustered the nodes and authenticate in two levels to prevent WSNs from DoSA. Hopfield neural network detected and identified the route to send CH data of the sink. The simulations were executed on NS-2 environment with superior outcomes in comparison to other performance and evaluated on the metrics of average throughput, packet delivery ratio (PDR), detection ratio, and network lifetime. Ahmad et al. [14] proposed an artificial neural network (ANN) for DoS attacks detection in WSNs. LEACH was used in the proposal. Vector-based forwarding was used underwater WSN and scalability. NS2 was used to create a dataset from collected network. ANN trained the dataset and classified the DoS attacks. The scheme was highly efficient and accurate. Lu et al. [15] developed an intrusion detection system for WSNs called improved particle swarm optimization (IPSO) and backpropagation neural network (IPSOBPNN). The system followed a hierarchical structure for detecting nodes that were attacked. IPSO algorithm optimizes initial parameters of BPNN to avoid diminish of the local optimum. IPSO-BPNN is then applied for intrusion detection. The proposed system was tested on NSL-KDD and UNSW-NB15 datasets for verifying its performance. The results showed a higher detection accuracy rate, speedier convergence, and lowered false positive rates in comparison to BPNN. Elsaid and Albatati [16] proposed an improved artificial bee colony optimization (IABCO) for an optimized collaborative intrusion detection system (OCIDS) in WSNs. IABCO algorithm optimizes the hierarchical IDS which is applied to WSNs for accuracy in intrusion detection and lesser consumption of resources. Detection accuracy was improved by optimizing a weighed SVM algorithm which also resulted in reduced false alarm rates. The disparity between nodes, cluster heads, and BS in

188

K. Jane Nithya and K. Shyamala

a hierarchical WSN was also considered in the proposed system to provide preciseness. The system was evaluated by simulating various attacks on WSNs. When tested on the NSL-KDD dataset, the system provided a higher detection rate and achieved around 97% detection. Maleh and Ezzati [17] introduced SVM for intrusion detection in WSNs. The architecture divided sensors into groups with a cluster head (CH) for reducing energy consumption and thus increasing the network’s lifetime. CH forwarded aggregated data to the base station instead of nodes. Decision making was based on SVM, and anomalies were detected based on rules. Thus, intrusions were detected and classified. The system exploited the advantages of SVM and signature model to detect malicious behaviors and detected most routing attacks. Sybil nodes in stationary WSNs were identified by Jamshidi et al. [18] using a learning automaton (LA) model. Each node would broadcast puzzles and identify Sybil nodes in its neighborhood based on their response times. Each node was equipped with a LA to minimize computational overheads in puzzles processing. The model was simulated using J-SIM simulator which resulted in a false detection rate of about 5%. When compared to similar algorithms, the system performed better. Al-issa et al. [19] proposed SVM and decision trees (DT) detecting four types of DoS attacks, namely flooding, grayhole, scheduling, and black hole. The dataset size was reduced, and only grayhole and flooding attacks were considered. Its results showed that DTs achieved higher true positive rates of 99.86 which was better than SVM’s. Guo et al. [20] developed an algebra for wireless mesh networks (AWN) called multi-protocol-oriented middleware-level intrusion detection (MP-MID). The proposed system could generate all known attacks WSN protocols. The system automatically generated rules on detections. MP-MIDformalized AWN languages with algebra and obtained co-sentences representing the features of attacks in the protocol. This study of ad hoc on-demand distance vector (ADOV) protocol showed MP-MID generated multiple attacks outperforming other techniques in terms of accuracy rates. The study claimed that MP-MID could be used as it was a flexible tool to detect attacks in WSNs. Spectral clustering (SC) was applied on a dataset followed by deep neural network (DNN) by Ma et al. [21] to detect intrusions in WSNs. The proposed system divided the dataset into k subsets based on cluster centers. The distance between data points in the testing and training sets was measured and fed to the DNN. NSL-KDD and KDDCup99 datasets along with a sensor network dataset tested the system’s performance. Their proposed SCDNN classifier detected intrusions better than random forest (RF), SVM, backpropagation neural network (BPNN), and Bayes tree models. It was found to be an effective tool for analyzing intrusion detections in larger networks. Gunasekaran and Periakaruppan [22] developed defence mechanism for Dos based on a Table and swarm technique called (TIDSD). It was based on the idea of broadcasting attack’s impact to CH’s helps identify DoS attacks. The isolation and routing tables were clubbed to detect attacks on specific clusters which was then broadcasted other CHs. This intercommunication effectively prevented DoS attacks.

A Systematic Review on Various Attack Detection …

189

Swarm was used in the study to change faulty channels into normal operating channels using frequency hops. The results showed promises in energy consumption, transmission overhead/efficiency, and DoS predictions in WSNs (Table 1).

2.2 Review on Various Attack Detection Methods Based on Clustering Techniques for WSN Ahmad et al. [23] introduced a K-mean clustering for misdirection and black hole attacks. This hybrid method employed customized K-medoid clustering. Network parameters defined a synthetic dataset and anamolies were detected using threshold values. The technique’s experiments on R studio and NS-2 showed it could detect hybrid anomalies accurately and thus could be used to detect black hole attacks in WSNs. Routing attacks were detected using particle swarm optimization-based K-means clustering (PSO-KM) algorithm by Li et al. [24]. PSO algorithm, an evolutionary algorithm based on swarm intelligence, has global search abilities. K-means clustering when used with it could overcome the hurdle of local minima while achieving better overall convergence. KDD CUP 99 was used to validate the effectiveness of the proposed method where results showed low false detections. Danger theory was used by Shaukat et al. [25] to guard mobile wireless sensor networks (MWSNs) from clone attacks. Their hybrid approach depended on multiple levels of detections. In the first stage, abnormal behavior in mobile nodes is checked for dangers. Battery check and random number generation is the second stage. Other networks are warned in the third stage. Security parameters considered were detection times, energy, communication costs, and delays in detections. The approach demonstrated its capability to detect malicious activities on a replica thus making it useful for secured communications. Simulation results showed that the method could overcome weaknesses of centralized and distributed networks of MWSNs with lesser memory overheads. Node replication attacks were detected by Znaidi et al. [26] in their study with an aim to link security with existing routing methods. The hierarchical distributed algorithm used bloom filter mechanism with CH selection. Simulation showed that their method was efficient with high detection probability. Further, the method could be implemented on WSNs k-hop hierarchical protocol. Shahryari and Naji et al. [27] proposed intrusion detection based on clustering method to detect wormhole attacks and for M-leach protocol. The system had four phases, namely setup, member verification, CH routing, and steady state. CHs had specified number of nodes as non-CH member nodes. The information flowed to the CH which aggregated the received information. CH was elected dynamically for maintaining routing information. The proposed system proved useful and superior in detecting malicious nodes without extra calculations or hardware. NS-2 evaluation of the protocol in terms of throughput, packet drop ratio, and time delay and

190

K. Jane Nithya and K. Shyamala

Table 1 Inference on machine learning-based attack detection techniques for WSN S. No.

Author name

Method name

Advantages

Disadvantages

1

Shamshirb et al. [10]

Game-based fuzzy Q-learning (G-FQL)

Improved detection of attacks and better defense

It is not always accurate

2

Wang et al. [11]

Stacked denoising auto-encoder-based localization attack identification approach (SDALAIA)

Achieved an highest classification accuracy for attack recognition

It is not suitable for real-time environment

3

Jianjian et al. [12]

Improved Short computation AdaBoost-radial basis time and high function-based detection rate support vector machine (IABRBFSVM)

Poorly set number of weak classifiers

4

Fotohi and Bari [13]

WSN firefly algorithm based on LEACH, and Hopfield NN(WSN-FAHN)

Lesser energy consumption and greater network lifetime, also DoSA attack detection

High computational time complexity, slow convergence speed

5

Ahmad et al. [14] Artificial neural network (ANN)

Elevated classification accuracy and rates for attacks

The realization of the equipment is dependent

6

Lu et al. [15]

Improved particle swarm optimization (IPSO) with BPNN (IPSO-BPNN)

Faster convergence speed, higher detection accuracy rate

It can be sensitive for noisy data

7

Elsaid and Albatati [16]

Improved artificial bee Very high detection Very slow when colony optimization rates with minimized utilized to solve (IABCO) false alarm rates hard problems

8

Maleh and Ezzati Support vector [17] machine (SVM)

Detect and prevent most of routing attacks

9

Jamshidi et al. [18]

Sybil attack detection Not applicable for other attacks

10

Al-issa et al. [19] DTs and SVMs

Denial of service (DoS) attack detection

Does not perform very well, when the dataset has more noise

11

Guo et al. [20]

Algebra for wireless mesh networks (AWN)

High detection accuracy rate

It is not suitable for larger dataset

12

Ma et al. [21]

DNN with spectral clustering

Detection accuracy in Complexity and finding abnormal inability to attacks recover

Learning automaton (LA)

It is not suitable for large datasets

(continued)

A Systematic Review on Various Attack Detection …

191

Table 1 (continued) S. No.

Author name

13

Gunasekaran and Table-based intrusion Periakaruppan detection and [22] swarm-based defense (TIDSD)

Method name

Advantages

Disadvantages

DoS Slow down the prediction/prevention prevention speed especially on large-scale database

energy consumption showed it practicality and effectiveness in its toughness against wormhole attacks. Saini and Angurala [28] introduced a k-means clustering to detect wormhole attacks. The activity is in a sensor which detects a wormhole attack through its neighbor discovery process. K-means clustering is used to fetch neighborhood information. The proposed method was found to work without any special requirements. OPNET was used to test and evaluate the proposed methods, and its results were satisfactory. LEACH and ANN were used in co-operation by Almomani et al. [29] to detect four types of DoSattacks in WSNs, namely grayhole, black hole, scheduling, and Flooding. LEACH collected 23 features from NS-2 to create a dataset called WSNDS. This dataset was trained on ANN to detect and classify DoS attacks. A high level of classification accuracy was achieved in the results. A WEKA toolbox with holdout and ten-fold cross validation methods gave best results. An average accuracy of above 90% was achieved in detections. Fotohi et al. [30] developed an abnormal sensor detection accuracy Rivest– Shamir–Adleman (ASDA RSA) for enhancing security against DoS attacks. Choosing a proper CH based on distance and energy is the first phase. The second phase uses cryptography (RSA) and interlock protocol for authentications. It thus prevents WSNs from DoS attacks. When evaluated on NS2, the proposed method reduced energy consumptions in WSNs based on packet delivery ratio, throughput, detection ration, network lifetime, and average residual energy. An improved LEACH was proposed by Cheng et al. [31] to safeguard WSNs against replication attacks. The proposed NI-LEACH protocol reduced clustering scales considering optimal clusters and node residual energy. Moreover, the design introduced monitor nodes for stopping information tampering and attack detection. Though simple, it performed well on simulation resulting in optimal throughput and higher ability against clone attacks. Charumathim and Velumani [32] introduced a dynamic secure intrusion detection protocol model (DSIDP) detection method for node replication attacks. The intrusion detection process is based on the monitoring nodes. This model improves the lifetime of the network with enhanced energy efficiency and improved communication. The author has proposed an energy-saving data gathering method for WSN with less energy consumption and faster data transmission. A clustered network was taken

192

K. Jane Nithya and K. Shyamala

into consideration in which clustering was done by underwater density-based clustering sensor network (UWDBCSN) algorithm. The detection approach integrated sleep/wake scheduling algorithm for an enhanced network performance. Advanced hybrid intrusion detection system (AHIDS) was proposed by Singh et al. [33] where LEACHdetected attacks on WSNs attacks. The proposed AHIDS made use of a cluster-based architecture and enhanced its performance with LEACH protocol to reduce node energy consumption. Detections were based on fuzzy rule sets and multilayer perceptron neural network. Both feed forward and backpropagation were used to integrate detection of attackers like Sybil. An advanced Sybil attack detection algorithm detected the attacks using wormhole resistant hybrid technique. Hello flood was detected with distance and signal strength. The method’s detection for Sybil attacks was 99, 40%; hello flood attacks 98, 20%; and wormhole attacks 99, 20%. Sybil attacks were also detected using a centralized clustering-based hierarchical network by Jan et al. [34]. Clusters were formed after detecting Sybil nodes for disallowing forged identities in CH selection. CH was elected by legitimate nodes. The proposed scheme collaborated two high energy sensor nodes for analyzing neighbor nodes and their signal strengths. This scheme significantly improved network lifetime when compared with other hierarchical cluster-based routing protocols. Abasikele¸s-Turgut et al. [35] proposed LEACH for detection of sinkhole and the black hole attacks. The attacks are modeled on e LEACH. Three models were designed for these attacks, and the evaluated results under different performance metrics on nodes showed black hole attacks result in 38% of the packets to be dropped. Energy trust system (ETS) was introduced by Alsaedi et al. [36] in clustered WSNs for Sybil attacks. In the proposed system, multiple levels of detections occur based on identity and position. A trust algorithm judges the node energies. Data aggregation reduces communication overheads thus saving energy. The system’s evaluation showed its effectiveness in detecting Sybil attacks. It increased efficiency of multilevel detection by 30%. It also reduced communication costs, energy consumptions, and memory overheads by eliminating feedback exchanges and recommendations among nodes. Abdulqader et al. [37] developed a clustering-based for false node exclusion DoS (FEDoS) attack detection. The authentication-based defensive approach against DoS attack combined with jamming attack that prevents transferring data between attacked nodes in a cluster and cluster head node. The proposed method encompasses developing an algorithm with ability to bypass attacked path via alternative safe one under control of cluster head to mitigate the false node excluding DoS due to jamming attack. The proposed method has been experimentally tested against similar methods from the literature with arbitrary study cases. Proposed algorithm shows promising results in mitigating false node exclusion DoS (FnEDoS) attack where a full recovery of the attacked node is achieved in case of isolated nodes, and improvement between 36 and 52% is obtained when the attack affects a group of nodes at proximity (Table 2).

A Systematic Review on Various Attack Detection …

193

Table 2 Inference on clustering-based attack detection methods for WSN S. No.

Author name

Method name

Advantages

Disadvantages

1

Ahmad et al. [23]

K-mean clustering

high accuracy for black hole attacks

Difficult to predict k-value

2

Li et al. [24]

Particle swarm optimization-based K-means clustering (PSO-KM)

Lowered false With global detections cluster, it did not with high work well detection rates

3

Shaukat et al. [25]

Danger theory (DT)

Clone attacks detection

4

Znaidi et al. [26]

CH selection with bloom filters

Detected node It can only report replication yes or no attacks

5

Shahryari and Naji Clustering method et al. [27]

Improved resilience against wormhole attacks

Multiple initial partitions may result in varied final clusters

6

Saini and Angurala K-means clustering [28]

Detect the wormhole attack

Not suitable for real world environment

7

Almomani et al. [29]

LEACH and ANN

Highest classification accuracies for DoS attack detection

Unexplained functioning of the network

8

Fotohi et al. [30]

Abnormal sensor detection accuracy Rivest–Shamir–Adleman (ASDA-RSA)

DoS attack detection

It can be very slow in cases where large data needs to be encrypted

9

Cheng et al. [31]

Improved low-energy adaptive clustering hierarchy (NI-LEACH)

Replication attack detection is greatly improved

Not suitable for larger dataset

10

Charumathim and Velumani [32]

Dynamic secure intrusion Node detection protocol model replication (DSIDP) attack detection

Slow convergence and time consuming

11

Singh et al. [33]

LEACH

Difficult to detect attacks for larger dataset

High classification accuracy for attacks detection

Complexity and inability to recover

(continued)

194

K. Jane Nithya and K. Shyamala

Table 2 (continued) S. No.

Author name

Method name

Advantages

Disadvantages

12

Jan et al. [34]

Centralized clustering-based hierarchical network

Improved network lifetime and detection rate

Not support for realworld environment

13

Abasikele¸s-Turgut et al. [35]

Low-energy adaptive clustering hierarchy (LEACH)

Detection of sinkhole and the black hole attack

Energy among the nodes is not considered when selecting cluster heads

14

Alsaedi et al. [36]

Energy trust system (ETS)

Detect Sybil attacks

Not focused other types of attacks detection

15

Abdulqader AL-Shaihk and Hassanpour [37]

Clustering-based

False node Not focused on exclusion DoS larger dataset (FEDoS) attack detection

2.3 Review on Various Attack Detection Methods Based on Authentication Protocols for WSN Elhoseny et al. [38] developed an elliptic curve cryptography (ECC) which encrypted and secured transmitted data in WSNs. It would generate binary strings for sensors and club it with its ID, CH distance, and the transmission index round to generate a unique 176bit encryption key. The key was generated using exclusive OR, substitutions, and permutations. The methods achieved efficient encryption and decryption and also performed better improving network lifetime and minimized energy consumption at all the nodes. It could withstand and overcome selective forwarding, compromised CH, and HELLO flood attacks. Continuous time Markov chain (CTMC)-based state transition model was introduced by Shi et al. [39] to sensor behavior in WSNs when attacked internally. The attack was detected using epidemiological model where the detection rates, compromised transition states, and response states were explored. Bellman equation was used to write dynamic programs from state transitions of a sensor. The basic idea was to find the optimal detection rate by maximizing utility of the compromised state of the nodes. Survivability, current state, availability, and energy consumption of the WSN were encapsulated to obtain effectiveness. Siddiqui et al. [40] incorporated the AODV routing protocol with improvements for wormhole attack detection. The approach checks the validity of two hop neighbors for legality on packet forwarding. The authenticity of the nodes is check from prior stored information, and on finding an illegal node, it is eliminated in the hop riot. Throughput, packet delivery ratio, and delays were checked in simulation results. The output was promising when compared to other approaches.

A Systematic Review on Various Attack Detection …

195

Shi et al. [41] developed Sybil attack detection technique using a lightweight detection mechanism based on low-energy adaptive clustering hierarchy received signal strength indicator-ID(LEACH-RSSI-ID). LEACH protocol ensures a node is not a CH permanently. For lesser energy consumption, the remaining energy of nodes and relative density of nodes are considered. Sybil attack was detected by analyzing RSSI-ID tables. Even if a Sybil attack occurs in the initialization phase of a WSN, malicious nodes were detected by sink node or if the malicious node changes identification also it was detected within a short time. Simulations proved that LRD mechanism is very helpful in detecting Sybil attack based on accuracy and detection rates. The study by Makhdoom et al. [42] also proposed a WSN Sybil attack detection system called one way code attestation protocol (OWCAP). The system provided maximum security by minimizing on computational, storage, and transmission overheads. They claimed that the proposed technique was very economical without compromising security. Message authentication was minimal and secured with TinySec. The system was restricted to TinyOSI.x, but was an economical and a secure code attestation scheme for guarding against internal and Sybil attacks. Localized encryption and authentication protocol (LEAP) were used with fuzzy logic by Lee and Cho [43] ito detect node compromises. Fuzzy logic was applied to node neighbor information for detecting compromised nodes which could expose the keys to adversaries. Such nodes were detected and evicted from the network. Experimental results proved the system could guard WSNs against sinkhole attacks while reducing communication costs. Naderi et al. [44] introduced an entropy-based trust model for lessening sinkhole attacks in WSNs. The attacked area was estimated using energy consumptions. The entropy-based trust model was applied on trust-based routing of packets for higher security of the network. Packets were forwarded only through secure routes and related classes. A trust value was allocated to possible areas of attack in the WSN and packets forwarded through paths that had very low risks. Saghar et al. [45] proposed a robust formally analyzed protocol for wireless sensor networks deployment (RAEED) for DoS attack detections. The protocol followed three main steps, namely key setup, route setup, and data forwarding steps. The key setup was split into a bidirectional and key exchange verification stages. In route setup, 1-hop and 2-hop neighbors verified node ID’s were exchanged after being detected in the key setup stage. Data forwarding was done in local monitoring and neighbor ranking by the nodes. The node forwarding performances of the neighbors were used for ranking the nodes. The aim of the study was to subdue several kinds of attacks on WSNs. Secure data aggregation using the access control and authentication (SDAACA) protocol was proposed by Razaque and Rizvi [46]. The aim was to detect attacks in WSNs. SDAACA protocol was made from secure data fragmentation (SDF) algorithm and node joining authorization (NJA) algorithm where the former hides data from attackers by fragmentation while the latter authorizes any new node joining the network. The algorithms improve the quality of service (QoS) in WSNs. The proposed protocol was mapped to an oil-refinery plant for detecting Sybil attacks

196

K. Jane Nithya and K. Shyamala

and sinkhole attacks. The proposed protocol’s access control scheme was found to be strong and energy efficient. Dong et al. [47] proposed a distributed low-storage clone detection protocol (LSCD) applicable to wireless networks. Witness nodes were deployed in a circular path, and a detection route in a perpendicular direction to a witness path was designed. This ensured a detection route will cross the witness path as the distances between any two detection routes are smaller than the witness path. LSCD processed clones in a non-hotspot region with abundant energy to improve energy efficiency and effectively network lifetime. Their results showed substantial improvements over other schemes. Global deterministic linear propagation verification protocol (GDL) was proposed by Zhou et al. [48] where a randomized parallel multiple cells linear propagation (RMC) detected and counteracted on node replication attacks. Node information was propagated both horizontally and vertically where in an intersection nodes have the same location or ID. Though GDL scheme is not resilient to smart node replications to increase its robustness RMC verification (Amix of localized and linear multicasts) was added. Witness nodes were randomly selected from many geographically limited regions called cells. The same node location was distributed to multiple randomly selected cells and their information verified from localized cells. Simulation results proved the protocol improved detection efficiency while prolonging overall network lifetime. Aliady and Al-Ahmadi [49] used ADOV protocol for wormhole attacks detection using network connectivity. The proposed scheme’s measures were applied on Network Simulator 3. The results showed 100% accuracy on wormholes more than four hops length. The proposed scheme did not occur any additional costs as it used existing and plugged in hardware making it a viable option for WSNs. Detect wormhole attacks in a geographic routing protocol (DWGRP) were proposed by Sookhak et al. [50] to identify malicious nodes and reliable neighbors. The objects were identified using a pre-distribution technique with beacon packets and pair wise keys. Moreover, it did not assume any pre-necessities in hardware. The technique was validated on NS-2 with neighborhood information against related techniques like authentication of nodes scheme (ANS), received signal strength (RSS), wormhole detection, and wormhole detection uses hound packet (WHOP). It resulted in lower false detection rates. Signed response (SRES) authentication mechanism was proposed in the study by Saud Khan and Khan [51]. It was a simple Sybil attack detection scheme, based on global system for mobile (GSM) communication. The system proposed authenticated users with encrypted voice data. It could be used for hierarchical and centralized WSNs. The scheme was tested for many Sybil attacks. Its probabilistic model analyzed Sybil attacks in authentication. Evaluations showed it minimized computational and power factors when compared to other and was better in identifying Sybil attacks. Tiwari et al. [52] proposed a modified hop count analysis algorithm (MHCAA) for preventing wormhole attack. The study introduced specification-based intrusion detection and hop count analysis for identifying wormhole attacks. The attacks were

A Systematic Review on Various Attack Detection …

197

simulated, and the technique was found to be a good security solution for routing protocols as it informs malicious nodes in compromised or hostile environments. It demonstrated improvements in packet delivery-ratio but the end-to-end delay also increased in parallel. But on the whole, modified AODV was found to be suitable for preventing and identifying wormhole attacks. A geographic routing to multiple sinks in connected WSNs based on multiple sinks (GRPW-MuS) for identifying wormhole attacks was proposed by Sabri and El Kamoun [53]. This geographical routing protocol’s architecture was partitioned by logical levels. A multipoint relaying flooding technique reduced topology broadcasts. In OMNET++ simulation and MiXiM framework, the scheme’s results showed small false positives while detecting wormhole attacks with minimized energy consumptions. Moreover, it needed a small amount of memory at the nodes, an ideal condition for WSNs (Table 3).

3 Issues from Existing Methods WSNs security have been a focus for many researches. The studies gained momentum as WSN resources are constrained. Moreover, confidentiality, authenticity, and Integrity were a question in sensitive applications of WSNs. Many kinds of attacks like DoS, Sybil wormhole…etc. occur on WSNs. Clone attacks are the most dangerous as it allows adversaries to enter the network and steal sensitive information. It is a major security issue as conventional cryptographic tools failed to curtail such attacks. The attackers on WSNs have used this attack often to gain information from networks. Even key-distributions were hacked using this attack. Another problem found was in data aggregations where adversaries misled WSNs by injecting false data or through node revocations resulting in disconnected networks. Once adversaries entered the network to gain routing information and routes, it was used to place malicious nodes in their required locations for maximum information gains. Node replication attackers have used cryptographic keys in clone nodes to physically captures WSN nodes. Thus, security has been a major question in open networks like WSNs.

4 Solution WSNs are networks without a static infrastructure. The network’s communications occur using wireless links. This network being open, researchers have been probing all means in providing security mechanisms to the network. Duplication of nodes with same identities called clone attacks has been a major security issue to WSNs. Certain techniques proposed in studies can be of help in stopping these clone attacks. Hybrid clone node detection mechanisms can identify clone nodes in a WSN. SDNbased mechanisms can also be added to the techniques as they work at network level’s

198

K. Jane Nithya and K. Shyamala

Table 3 Inference on authentication protocols-based attack detection methods for WSN S. No.

Author name

Method name

Advantages

Disadvantages

1

Elhoseny et al. [38]

Elliptic curve Cryptography (ECC)

Minimized energy consumption with increased network lifetime

Lack of ability to be spatially invariant to the input data

2

Shi et al. [39]

Continuous time Markov chain (CTMC)

Internal attack detection

Not support for larger database

3

Siddiqui et al. [40]

Ad hoc on-demand distance vector (AODV)

Wormhole attack Not suitable for detection and other attack prevention detection

4

Shi et al. [41]

Low-energy adaptive clustering hierarchy received signal strength indicator-ID(LEACH-RSSI-ID)

High accurate rate of detection for Sybil attacks

Only applicable for Sybil attack

5

Makhdoom et al. [42]

One way code attestation protocol (OWCAP)

Sybil attacks detection

slow convergence and time consuming

6

Lee and Cho [43]

Fuzzy logic system-based method

Reduced communication cost for to prevent sinkhole attacks

It is not always accurate

7

Naderi et al. Entropy-based trust model [44]

Mitigation of sinkhole attacks

High computational cost and large time consuming

8

Saghar et al. Robust formally analyzed Identification of [45] protocol for wireless sensor DoS attacks networks deployment, (RAEED)

Not suitable for a larger dataset

9

Razaque and Rizvi [46]

Secure data aggregation using the access control and authentication (SDAACA)

Prevent and detect both sinkhole and Sybil attacks

Not suitable for real-time applications

10

Dong et al. [47]

Low-storage clone detection protocol (LSCD)

Better clone attack detection

Only focused on clone attack detection

11

Zhou et al. [48]

Global deterministic linear (GDL) and randomized parallel multiple cells (RMC)

Improved detection efficiency

It is not always accurate, so it may not be widely accepted

12

Aliady and Al-Ahmadi [49]

ADOV

Achieved highest High bandwidth accuracy rate requirement (continued)

A Systematic Review on Various Attack Detection …

199

Table 3 (continued) S. No.

Author name

Method name

Advantages

Disadvantages

13

Sookhak et al. [50]

Detect wormhole attacks in a geographic routing protocol (DWGRP)

Wormhole attack A delayed identification protocol because of its route discovery process

14

Saud Khan and Khan [51]

Signed response (SRES)

Detect Sybil attacks with higher probability

Not focused on other type of attacks

15

Tiwari et al. Modified hop count analysis [52] algorithm (MHCAA)

Detection and prevention of wormhole attacks

large networks, other remote routers may not be able to be reached

16

Sabri and El Geographic routing to multiple Kamoun sinks in connected wireless [53] sensor networks based on multiple sinks (GRPW-MuS)

Wormhole attacks detection

It does not perform very well

routes. Any scheme that analyzes WSNs based on time frames in relay and receive can identify redundant nodes of a cloning attack. Using bloom filters in WSNs and proper CH selections can avoid clones. Secure-efficient centralized (SEC) approaches also have also found to perform in detecting WSN replication attacks.. Finally, schemes which use LEACH and improve it further like the proposed NI-LEACHcan reduce cluster scales, improve network lifetimes, and define optimal number of clusters. Thus, there are many applicable schemes for stopping clone attacks in WSNs.

5 Results and Discussion This section discusses the performance comparison results for existing entropybased trust model (EBTM), one way code attestation protocol (OWCAP) and proposed hybrid clone node detection (HCND) mechanism. Proposed model can be implemented using NS2. Evaluation metrics: Precision: it is to measure the proportion of the actual clones which are to be specified identified. P=

Number of clones correctly found Total number of clones

(1)

200

K. Jane Nithya and K. Shyamala 120

Fig. 3 Precision performance comparison of different attack detection scheme Precision (%)

100 80 EBTM

60

OWCAP

40

HCND

20 0 EBTM

OWCAP

HCND

Different methods

Recall: It measures the proportion of non-clones which are to be accurately predicted correct manner. R=

Number of clones correctly found Total number of clones in source code

(2)

Attack detection performance is shown in Fig. 3 in terms of precision. The precision result of presented HCND technique is 97%, which is greater, compared to the previous EBTM and OWCAP methods produces only 75 and 85% accordingly (Table 4). Attack detection performance is shown in Fig. 4 in terms of recall. The recall result of presented HCND technique is 98%, which is greater, compared to the previous EBTM and OWCAP methods produces only 72 and 82% accordingly. Attack detection performance is shown in Fig. 5 in terms of clone attack detection. The clone attack detection result of presented HCND technique is 810, which is greater, compared to the previous EBTM and OWCAP methods produces only 225 and 450 accordingly. Attack detection performance is shown in Fig. 6 in terms of precision. The precision result of presented HCND technique is 280, which is greater, compared to the previous EBTM and OWCAP methods produces only 460 and 809 accordingly. Table 4 Performance comparison results

Metrics

Methods EBTM

OWCAP

HCND

Precision

75

85

97

Recall

72

82

98

Clone attack detection

225

450

810

Correctly detected clone

280

460

809

A Systematic Review on Various Attack Detection …

201

120

Fig. 4 Recall performance comparison of different attack detection scheme Recall (%)

100 80 EBTM

60

OWCAP

40

HCND 20 0 EBTM

OWCAP

HCND

Different methods

900 800 Clone aack detecon

Fig. 5 Clone attack detection performance comparison of different attack detection scheme

700 600 500

EBTM

400

OWCAP

300

HCND

200 100 0 EBTM

OWCAP

HCND

Different methods 900 Correctly detected clone

Fig. 6 Correctly detected clone performance comparison of different attack detection scheme

800 700 600 500

EBTM

400

OWCAP

300

HCND

200 100 0 EBTM

OWCAP Different methods

HCND

202

K. Jane Nithya and K. Shyamala

6 Conclusion The paper has methodically reviewed WSN security, its issues and challenges and dangers of attacks from multiple sources. This paper has detailed on the types of attacks possible on WSNs and their related studies. Clone attacks, considered the most dangerous threat to WSNs where adversaries easily capture a node, extract information and deploy them for hacking other networks has been discussed in detail. WSNs are also prone to different types of attacks in which data is lost to third parties or consumes energy of WSN nodes. This review has also discussed such attacks and ways to detect them in studies. This review has enveloped WSNs security studies in many areas like machine learning where researchers have proposed schemes using supervised, un-supervised, and deep learning algorithms for WSN safety. Further, this paper has detailed on the advantages and disadvantages of each of the proposed scheme taken for the study. Though clone attacks are the most dangerous, techniques to identify and disarm them from harming the network have been discussed. Finally concluded from the simulation results that the hybrid methods are better way to extend this work to improve the attack detection performance which gives better precision results better than other existing methods.

References 1. W. Xiang, N. Wang, Y. Zhou, An energy-efficient routing algorithm for software-defined wireless sensor networks. IEEE Sens. J. 16(20), 7393–7400 (2016) 2. T. Olofsson, A. Ahlén, M. Gidlund, Modeling of the fading statistics of wireless sensor network channels in industrial environments. IEEE Trans. Signal Process. 64(12), 3021–3034 (2016) 3. Y. Guan, X. Ge, Distributed attack detection and secure estimation of networked cyber-physical systems against false data injection attacks and jamming attacks. IEEE Trans. Sig. Inform. Process. Over Netw. 4(1), 48–59 (2017) 4. Q. Wang, J. Jiang, Comparative examination on architecture and protocol of industrial wireless sensor network standards. IEEE Commun. Surv. Tutor. 18(3), 2197–2219 (2016) 5. D. Qin, S. Yang, S. Jia, Y. Zhang, J. Ma, Q. Ding, Research on trust sensing based secure routing mechanism for wireless sensor network. IEEE Access 5, 9599–9609 (2017) 6. C.M. Yu, Y.T. Tsou, C.S. Lu, S.Y. Kuo, Localized algorithms for detection of node replication attacks in mobile sensor networks. IEEE Trans. Inf. Forensics Secur. 8(5), 754–768 (2013) 7. B. Cao, J. Zhao, Y. Gu, S. Fan, P. Yang, Security-aware industrial wireless sensor network deployment optimization. IEEE Trans. Industr. Inf. 16(8), 5309–5316 (2020) 8. K. Cho, M. Jo, T. Kwon, H.H. Chen, D.H. Lee, Classification and experimental analysis for clone detection approaches in wireless sensor networks. IEEE Syst. J. 7(1), 26–35 (2013) 9. L. Sujihelen, C. Jayakumar, C. Senthilsingh, SEC approach for detecting node replication attacks in static wireless sensor networks. J. Electric. Eng. Technol. 13(6), 2447–2455 (2018) 10. S. Shamshirband, A. Patel, N.B. Anuar, M.L.M. Kiahand A. Abraham, Cooperative game theoretic approach using fuzzy Q-learning for detecting and preventing intrusions in wireless sensor networks. Eng. Appl. Artif. Intell. 32, 228–241 (2014) 11. H. Wang, Y. Wen, D. Zhao, Identifying localization attacks in wireless sensor networks using deep learning. J. Intell. Fuzzy Syst. 35(2), 1339–1351 (2018) 12. D. Jianjian, T. Yang, Y. Feiyue, A novel intrusion detection system based on IABRBFSVM for wireless sensor networks. Procedia Comput. Sci. 131, 1113–1121 (2018)

A Systematic Review on Various Attack Detection …

203

13. R. Fotohiand, S.F. Bari, A novel countermeasure technique to protect WSN against denial-ofsleep attacks using firefly and Hopfield neural network (HNN) algorithms. J. Supercomput. 1–27 (2020) 14. B. Ahmad, W. Jian, R.N. Enam, A. Abbas, Classification of DoS attacks in smart underwater wireless sensor network. Wirel. Personal Commun. 1–15 (2019) 15. X. Lu, D. Han, L. Duan, Q. Tian, Intrusion detection of wireless sensor networks based on IPSO algorithm and BP neural network. Int. J. Comput. Sci. Eng. 22(2–3), 221–232 (2020) 16. S.A. Elsaid, N.S. Albatati, An optimized collaborative intrusion detection system for wireless sensor networks. Soft Comput. 24, 1–15 (2020) 17. Y. Malehand, A. Ezzati, Lightweight intrusion detection scheme for wireless sensor networks. IAENG Int. J. Comput. Sci. 42(4), 1–8 (2015) 18. M. Jamshidi, M. Esnaashari, A.M. Darweshand, M.R. Meybodi, Detecting Sybil nodes in stationary wireless sensor networks using learning automaton and client puzzles. IET Commun. 13(13), 1988–1997 (2019) 19. A.I. Al-issa, M. Al-Akhras, M.S. ALsahliand M. Alawairdhi, Using machine learning to detect DoS attacks in wireless sensor networks, in IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), pp. 107–112 (2019) 20. Q. Guo, X. Li, G. Xuand, Z. Feng, MP-MID: multi-protocol oriented middleware-level intrusion detection method for wireless sensor networks. Fut. Gener. Comput. Syst. 70, 42–47 (2017) 21. T. Ma, F. Wang, J. Cheng, Y. Yu, X. Chen, A hybrid spectral clustering and deep neural network ensemble algorithm for intrusion detection in sensor networks. Sensors 16(10), 1701–1723 (2016) 22. M. Gunasekaran, S. Periakaruppan, A hybrid protection approaches for denial of service (DoS) attacks in wireless sensor networks. Int. J. Electron. 104(6), 993–1007 (2017) 23. B. Ahmad, W. Jian, Z.A. Ali, S. Tanvirand M.S.A. Khan, Hybrid anomaly detection by using clustering for wireless sensor network. Wirel. Pers. Commun. 106(4), 1841–1853 (2019) 24. Y. Li, M. Du, Y. Li, Routing attacks detection method of wireless sensor network, in DEStech Transactions on Computer Science and Engineering, (wicom), pp. 255–265 (2018) 25. H.R. Shaukat, F. Hashim, M.A. Shaukatand, K. Ali Alezabi, Hybrid multi-level detection and mitigation of clone attacks in mobile wireless sensor network (MWSN). Sensors 20(8), 2283–2305 (2020) 26. W. Znaidi, M. Minierand, S. Ubéda, Hierarchical node replication attacks detection in wireless sensor networks. Int. J. Distrib. Sens. Netw. 9(4), 1–12 (2013) 27. M. Shahryariand, H.R. Naji, A cluster based approach for wormhole attack detection in wireless sensor networks. J. Adv. Comput. Sci. Technol. 4(1), 95–102 (2015) 28. R. Sainiand, M. Angurala, Reactive routing based optimize network performance in wormhole attack. Glob. J. Comput. Technol. 4(2), 221–224 (2016) 29. I. Almomani, B. Al-Kasasbeh, M. Al-Akhras, WSN-DS: a dataset for intrusion detection systems in wireless sensor networks. J. Sens. (2016) 30. R. Fotohi, S. Firoozi Bari, M. Yusefi, Securing wireless sensor networks against denial-of-sleep attacks using RSA cryptography algorithm and interlock protocol. Int. J. Commun. Syst. 33(4), 1–25 (2020) 31. G. Cheng, S. Guo, Y. Yang, F. Wang, Replication attack detection with monitor nodes in clustered wireless sensor networks, in IEEE 34th International Performance Computing and Communications Conference (IPCCC), pp. 1–8 (2015) 32. A.C. Charumathimand, M. Velumani, Replication attack detection in wireless sensor networks by efficient node deployment. Int. J. Res. Sci. Eng. Technol. 5(3), 11–17 (2018) 33. R. Singh, J. Singh, R. Singh, Fuzzy based advanced hybrid intrusion detection system to detect malicious nodes in wireless sensor networks. Wirel. Commun. Mob. Comput. 1–15 (2017) 34. M.A. Jan, P. Nanda, X. He, R.P. Liu, A sybil attack detection scheme for a centralized clusteringbased hierarchical network. IEEE Trustcom/BigDataSE/ISPA 1, 318–325 (2015) 35. I. Abasikele¸s-Turgut, M.N. Aydin, K. Tohma, A realistic modelling of the sinkhole and the black hole attacks in cluster-based WSNs. Int. J. Electron. Electr. Eng. 4(1), 74–78 (2016)

204

K. Jane Nithya and K. Shyamala

36. N. Alsaedi, F. Hashim, A. Sali, F.Z. Rokhani, Detecting sybil attacks in clustered wireless sensor networks based on energy trust system (ETS). Comput. Commun. 110, 75–82 (2017) 37. N.F. Abdulqader AL-Shaihkand, R. Hassanpour, Active defense strategy against jamming attack in wireless sensor networks. Int. J. Comput. Netw. Inform. Secur. 11(11), 1–13 (2019) 38. M. Elhoseny, X. Yuan, H.K. El-Minir, A.M. Riad, An energy efficient encryption method for secure dynamic WSN. Secur. Commun. Netw. 9(13), 2024–2031 (2016) 39. Q. Shi, L. Qin, L. Song, R. Zhang, Y. Jia, A dynamic programming model for internal attack detection in wireless sensor networks. Discr. Dyn. Nat. Soc. 1–9 (2017) 40. A. Siddiqui, A. Karami, M.O. Johnson, A wormhole attack detection and prevention technique in wireless sensor networks. Int. J. Comput. Appl. 174(4), 1–5 (2017) 41. W. Shi, S. Liu, Z. Zhang, A Lightweight Detection Mechanism against Sybil Attack in Wireless Sensor Network. KSII Trans. Internet Inf. Syst. 9(9), 3738–3750 (2015) 42. I. Makhdoom, M. Afzal, I. Rashid, A novel code attestation scheme against Sybil attack in wireless sensor networks. Natl Softw Eng Conf 1–6 (2014) 43. J.J. Lee, T.H. Cho, Sinkhole attack detection scheme using neighbors’ information for LEAP based wireless sensor networks. Int. J. Comput. Appl. 141, 1–7 (2016) 44. O. Naderi, M. Shahediand, S.M. Mazinani, A trust based routing protocol for mitigation of sinkhole attacks in wireless sensor networks. Int. J. Inf. Educ. Technol. 5(7), 520–526 (2015) 45. K. Saghar, M. Tariq, D. Kendall, A. Bouridane, RAEED: a formally verified solution to resolve sinkhole attack in Wireless Sensor Network, in 13th International Bhurban Conference on Applied Sciences and Technology (IBCAST), pp 334–345 (2016) 46. A. Razaqueand, S.S. Rizvi, Secure data aggregation using access control and authentication for wireless sensor networks. Comput. Secur. 70, 532–545 (2017) 47. M. Dong, K. Ota, L.T. Yang, A. Liu, M. Guo, LSCD: A low-storage clone detection protocol for cyber-physical systems. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 35(5), 712–723 (2016) 48. Y. Zhou, Z. Huang, J. Wang, R. Huang, D. Yu, An energy-efficient random verification protocol for the detection of node clone attacks in wireless sensor networks. EURASIP J. Wirel. Commun. Netw. 163–174 (2014) 49. W.A. Aliady, S.A. Al-Ahmadi, Energy preserving secure measure against wormhole attack in wireless sensor networks. IEEE Access 7, 84132–84141 (2019) 50. M. Sookhak, A. Akhundzada, A. Sookhak, M. Eslaminejad, A. Gani, M.K. Khan, X. Li, X. Wang, Geographic wormhole detection in wireless sensor networks. PLoS ONE 10(1), 1–21 (2015) 51. M. Saud Khan, N.M. Khan, Low complexity signed response based sybil attack detection mechanism in wireless sensor networks. J. Sens. 1–9 (2016) 52. M. Tiwari, A. Tiwari, D. Sukheja, Modified hop count analysis algorithm (MHCAA) for preventing wormhole attack in WSN. Communications 3(3), 6–10 (2015) 53. Y. Sabriand, N. El Kamoun, GRPW-MuS-s: a secure enhanced trust aware routing against wormhole attacks in wireless sensor networks. Network 6(5), 1–7 (2016)

Electronic Beam Steering in Timed Antenna Array by Controlling the Harmonic Patterns with Optimally Derived Pulse-Shifted Switching Sequence Avishek Chakraborty , Gopi Ram , and Durbadal Mandal Abstract An efficient electronic beam steering method in timed antenna array is proposed in this paper by controlling only the high-speed switches connected with the radiating elements of the array. The timed antenna arrays have an extra dimension—‘time’ to control the array radiation pattern compared to traditional antenna arrays. The simple on–off sequence of these arrays inherently generates undesired harmonic patterns along with the desired pattern. These unwanted and spatially individual patterns generated at lower-order harmonic frequencies are exploited in this paper with appropriate on–off sequence to generate multiple steered patterns toward preselected directions. The sidelobes of all the desired patterns are reduced, and the unwanted harmonics are suppressed. To achieve the desired objectives, unique pulse-shifted on–off sequences are derived with a modified differential evolution algorithm by implementing an efficient wavelet-based mutation operation. A linear arrangement of timed array with ten isotropic elements is considered to exhibit the potentialities of the proposed approach. Keywords Timed antenna array · Beam steering · Harmonic pattern · Differential evolution · Wavelet mutation

1 Introduction Timed array antenna is the alternative of standard traditional array antenna with an extra dimension—‘time’ to control the array radiation pattern [1]. The time-domain antenna arrays have emerged as an efficient means to produce low or ultra-low sidelobe levels (SLLs) [2]. The concept of periodic on–off switching equivalent to tapered

A. Chakraborty (B) · D. Mandal Department of ECE, NIT Durgapur, Durgapur, West Bengal 713209, India G. Ram Department of ECE, NIT Warangal, Warangal, Telangana 506004, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_18

205

206

A. Chakraborty et al.

excitation distribution has also been applied for electronic scanning [3]. The implementation of time-modulation to achieve the desired radiation patterns with nearly ultra-low SLL has been investigated with slot radiators [4]. The emergence of evolutionary computation techniques has accelerated the research on this domain [5–7]. The simple periodic on–off sequence of time-modulated arrays (TMAs) to obtain the desired radiation patterns inherently produces unwanted radiations at harmonic frequencies [8]. These unwanted patterns are usually considered as power loss that need to be suppressed [9]. Several strategies have been proposed to suppress the unwanted harmonics or sideband levels (SBLs) for enhanced performance [10, 11]. The simultaneous suppression of SLL and SBLs has also been achieved with optimal time schemes [12, 13]. Another approach of working with these unwanted sidebands has been evolved with the exploitation of harmonic patterns for multipattern synthesis of TMAs [14, 15]. The electronic steering in a time-modulated linear array (TMLA) has been proposed with suitable switching configuration [15]. The exploitation of selective harmonic patterns of the TMLAs have also been carried out with optimization-based pulse sequences [16–18]. The research on exploiting the harmonic radiations with suitably designed on–off schemes has been extended with applications like multi beamforming [19], direction of arrival evaluation [20], harmonic steering [21–23], etc. The potentialities of using the TMLAs for radar [24] and secure wireless communications [25] have also been explored. This paper investigates the capabilities of TMLAs for multipoint communication purposes by exploiting the first-order harmonic patterns to the desired directions. The fundamental pattern is kept unaltered at the broadside direction, and the higher-order frequencies or sideband patterns are minimized. The SLLs of all the desired patterns are minimized for an efficient beam steering performance. To achieve the desired goals, unique pulse-shifted time sequences have been derived by optimizing the starting instants and the total on-times of all the radiating elements. TMLA structure with ten isotropic elements is considered, and the preselected beam steering angles for the array are chosen as ± 10°, ± 20°, and ± 30° from the broadside of the array. The desired on–off sequences are generated with an improved differential evolution algorithm based on wavelet mutation strategy.

2 Theory and Mathematical Background Let us consider an N-element TMLA where each element is connected with highspeed switches and aligned toward the positive z-axis is shown in Fig. 1. The array factor is given as AF(θ, t) = e j(2π f0 )t

N  n=1

In Un (t)e jk(n−1)d cos θ

(1)

Electronic Beam Steering in Timed Antenna Array …

207

Fig. 1 N-element time-modulated linear array configuration with programmable logic circuit and high-speed switches

where In is the excitation of the nth element, k represents the constant of propagation, d is the uniform array elements spacing, θ denotes the angle measured from the main axis of the array, operating frequency is f 0 with T0 being the fundamental period, and Un (t) represents the switching function. Due to periodic nature of switching elements, Un (t) can be presented as ∞ 

Un (t) =

amn e jm(2π f p )t

(2)

m=−∞

where amn is the compound excitation coefficient of the nth element for the mth order frequency generated due to time-modulation and can be expressed as

amn

1 = Tp

Tp

Un (t)e− jm(2π f p )t dt

(3)

0

The corresponding array factor expression for a uniformly excited TMLA (In = 1) can be derived from Eq. (1) as AF(θ, t) =

∞  N 

  amn e jk(n−1)d cos θ e j2π ( f0 +m f p )t

(4)

m=−∞ n=1

The array factor for the mth order frequency can be simplified from Eq. (4) as

208

A. Chakraborty et al.

AFm (θ, t) = e j2π ( f0 +m f p )t

N 

amn e jk(n−1)d cos θ

(5)

n=1

The fundamental pattern can be obtained by putting m = 0 in the above equation and the corresponding harmonic patterns inherently generated at the multiple of modulating frequencies with m = ± 1, ± 2, ± 3, …, ± ∞.

2.1 Switching Sequences Different switching schemes can be designed for different applications, and the desired steering performance addressed in this paper is dealt with properly optimized pulse-shifted sequence. The simple on–off sequence for the nth element is shown in Fig. 2a, where the element is on for a period of τn (0 ≤ τn ≤ T p ). The pulse starts from τn1 = 0 and ends at τn2 with a normalized on-time period of {(τn2 −τn1 )/T p }. The switching function Un (t) and the corresponding complex excitation are given as  Un (t) = amn =

1, τn1 ≤ t ≤ τn2 ≤ T p whereτn1 = 0 0, otherwise

(6)

  τn  sin c mπ f p τn e− jmπ f p (τn ) Tp

(7)

The generalized pulse-shifted sequence for the nth element is shown in Fig. 2b, where the on-time pulse of Fig. 2a is shifted to τn1 (τn1 = 0) and the element is on up to τn2 . The switching function Un (t) and the corresponding complex excitation for Fig. 2 a Simple on–off switching scheme, and b shifted pulse switching scheme for the nth element of the proposed antenna array

Electronic Beam Steering in Timed Antenna Array …

209

the shifted pulse are expressed as 

1, 0 < τn1 ≤ t ≤ τn2 ≤ T p where τn1 = 0 0, otherwise   2   

τn − τn1 1 2 sin c mπ f p τn2 − τn1 e− jmπ f p (τn +τn ) = Tp

Un (t) = amn

(8)

(9)

To steer the desired first harmonic radiation patterns (m = ±1) toward the preselected directions (θ0 ), the starting instants of each radiating elements are modified as τn (n − 1)kd cos θ0 1 − mod 1 (10) τn = 2π 2

2.2 Cost Function Formulation To control the radiation features of the fundamental and the desired steered harmonic patterns, SLLs of all the patterns are minimized. The unwanted higher-order harmonics are also suppressed. To achieve the desired objectives, the appropriately designed cost function (CF) for the minimization process is defined as

(i) CF = w1 ∗ (SLL(i) 0 ) + w2 ∗ (SLL1 ) f0

f0 ± f p

+ w3 ∗ ( SBL(i) m ) f 0 +m f p

(11)

where i is number of iteration cycles, SLL0 and SLL1 are the maximum sidelobes of the fundamental ( f 0 ) and first sideband patterns ( f 0 ± f p ), SBLm represents the maximum level of higher-order harmonic patterns ( f 0 + m f p , m = ± 2, ± 3, …, ± ∞), and w1 , w2 , w3 are the contributing factors (w1 = w2 = w3 = 1). To minimize the CF, an improved differential evolution based on wavelet mutation (DEWM) is employed. The wavelet-based mutation operation enhances the finetuning ability of DE for the optimal solution. The details of DEWM can be found in [26].

3 Numerical Results and Discussion Different cases of beam steering for a ten-element TMLA are presented with predefined steering angles towards ± 10°, ± 20°, and ± 30° from the array main axis. The fundamental pattern is kept unaltered. The SLLs of all the patterns are reduced to an

210

A. Chakraborty et al.

Fig. 3 Optimized switching sequence of ten-element TMLA for ± 10° steered patterns

acceptable level, and the unwanted harmonics are suppressed to enhance the array efficiency. All the computations are performed with MATLAB. For a ten-element TMLA, the starting instants and the on-times are optimized with DEWM to produce the switching schemes required for generating the desired patterns. The first positive ( f 0 + f p ) and the first negative ( f 0 − f p ) sidebands are steered to the desired directions while the fundamental pattern is unaltered.

3.1 Case 1: Steered Patterns at ± 10° To guide the first positive and first negative sidebands toward 100° and 80° (± 10° shift from the broadside), an optimal switching scheme is reported in Fig. 3. The radiating patterns obtained for the fundamental and first sidebands are presented in Fig. 4. The SLLs of the fundamental pattern and first sideband patterns are reduced to − 26.38 and − 20.34 dB, respectively. The SBLs or the offsets for the steered patterns are − 1.51 dB due to steering. The radiated power in all the desired patterns is reported as 32.046, 21.6081, and 21.6081%. The total radiated power in desired patterns is calculated as 75.2622%. The higher sidebands are suppressed, and the power loss due to unwanted harmonics is minimized to 24.7378%.

3.2 Case 2: Steered Patterns at ± 20° To generate the desired ± 20° steered patterns at first sidebands, an optimal switching scheme is developed and shown in Fig. 5. The radiation patterns for the fundamental, first positive, and first negative sidebands pointing toward 90°, 110° (+ 20° shift),

Electronic Beam Steering in Timed Antenna Array …

211

Fig. 4 Radiation patterns of the fundamental and ± 10° steered first sideband patterns

Fig. 5 Optimized switching sequence of ten-element TMLA for ± 20° steered patterns

and 70° (− 20° shift) are shown in Fig. 6. The SLLs of the fundamental and the first sideband frequency patterns are obtained as − 26.75 and − 20.54 dB, respectively. The SBLs are reported as − 1.51 dB for both the steered patterns. The power radiated by the desired patterns are 32.0047, 21.5737, and 21.5737%. The total used power in the wanted patterns is evaluated as 75.1521%, and the unwanted power in higher harmonics is reduced to 24.8479%.

3.3 Case 3: Steered Patterns at ± 30° Another case with ± 30° steered first sideband patterns are also considered where the fundamental pattern is directed towards 90° and the first sideband patterns are steered to 120° (+30° shift) and 60° (− 30° shift). The optimized switching scheme

212

A. Chakraborty et al.

Fig. 6 Radiation patterns of the fundamental and ± 20° steered first sideband patterns

is presented in Fig. 7, and the corresponding radiation patterns are shown in Fig. 8. The SLLs of the fundamental and steered patterns are reduced to − 26.92 and − 20.62 dB, respectively. The SBLs of the shifted patterns are reported as − 1.51 dB. The power radiated by the desired patterns is 31.9869, 21.5585, and 21.5585%. The total power radiated by the desired patterns is calculated as 75.1039%, and the power in unwanted higher sidebands is 24.8961%. The power radiated by the fundamental, first positive sideband, and other higherorder sideband patterns for all three cases of beam steering are presented in Fig. 9. It is evident from the figure that the undesired higher-order sideband power is reduced. The convergence profiles of DEWM-based optimization process for all three cases are shown in Fig. 10. The numerical results obtained with the proposed method of beam steering are reported in Table 1. Fig. 7 Optimized switching sequence of ten-element TMLA for ± 30° steered patterns

Electronic Beam Steering in Timed Antenna Array … Fig. 8 Radiation patterns of the fundamental and ± 30° steered first sideband patterns

Fig. 9 Power radiated by fundamental, first sideband, and higher-order sideband patterns

Fig. 10 Convergence profiles for all the cases of beam steering with DEWM

213

214

A. Chakraborty et al.

Table 1 Numerical results obtained with the proposed beam steering method based on DEWM Steering angle

SLL0 (dB)

SLL1 (dB)

SBL1 (dB)

P0 (%)

P1 (%)

± 10°

− 26.38

− 20.34

− 1.51

32.0460

21.6081

± 20°

− 26.75

− 20.54

− 1.51

32.0047

21.5737

± 30°

− 26.92

− 20.62

− 1.51

31.9869

21.5585

4 Conclusion The steered pattern synthesis for multipoint communication is addressed in this paper by exploiting the first-order harmonic patterns of time-modulated linear array (TMLA). Toward this purpose, appropriate switching schemes are developed by optimizing the starting times and the on-time periods of all radiating elements of the array. Ten-element TMLA is considered, and three different prespecified steering directions are chosen as ± 10°, ± 20°, and ± 30° from the broadside direction. The sidelobe levels (SLLs) of all the desired fundamental patterns are minimized below an acceptable level of − 26 dB. The SLLs of the steered first positive and negative harmonic patterns are also reduced below − 20 dB. The power radiated by all unwanted higher-frequency sideband patterns is minimized below 25%, and the power efficiency of the TMLA for all three cases of steered pattern synthesis is maintained above 75%. Thus, the potentialities of the proposed DEWM-based method are explored by achieving an efficient electronic beam steering performance for ten-element TMLA. Acknowledgements This research is carried out under a project funded by Science and Engineering Research Board, Dept. of Science and Technology, Govt. of India (Grant No. EEQ/2017/000519, dated 23.03.2018).

References 1. P. Rocca, F. Yang, L. Poli, S. Yang, Time-modulated array antennas–theory, techniques, and applications. J Electromagn. Waves Appl. 33(12), 1503–1531 (2019). https://doi.org/10.1080/ 09205071.2019.1627251 2. H.E. Shanks, R.W. Bickmore, Four-dimensional electromagnetic radiators. Can. J. Phys. 37(3), 263–275 (1959). https://doi.org/10.1139/p59-031 3. H.E. Shanks, A new technique for electronic scanning. IRE Trans. Antennas Propag. 9(2), 162–166 (1961). https://doi.org/10.1109/TAP.1961.1144965 4. W.H. Kummer, A.T. Villeneuve, T.S. Fong, F.G. Terrio, Ultra-low sidelobes from timemodulated arrays. IEEE Trans. Antennas Propag. 11(6), 633–639 (1963). https://doi.org/10. 1109/TAP.1963.1138102 5. S. Yang, Y.B. Gan, A. Qing, Sideband suppression in time-modulated linear arrays by the differential evolution algorithm. IEEE Antennas Wirel. Propag. Lett. 1, 173–175 (2002). https:// doi.org/10.1109/LAWP.2002.807789

Electronic Beam Steering in Timed Antenna Array …

215

6. S. Yang, Y.B. Gan, A. Qing, P.K. Tan, Design of a uniform amplitude time modulated linear array with optimized time sequences. IEEE Trans. Antennas Propag. 53(7), 2337–2339 (2005). https://doi.org/10.1109/TAP.2005.850765 7. S. Yang, Y.B. Gan, P.K. Tan, A new technique for power-pattern synthesis in time-modulated linear arrays. IEEE Antennas Wirel. Propag. Lett. (2003). https://doi.org/10.1109/LAWP.2003. 821556 8. S. Yang, Y.B. Gan, P.K. Tan, Evaluation of directivity and gain for time-modulated linear antenna arrays. Microw. Opt. Technol. Lett. (2004). https://doi.org/10.1002/mop.20241 9. J.C. Brégains, J. Fondevila-Gómez, G. Franceschetti, F. Ares, Signal radiation and power losses of time-modulated arrays. IEEE Trans. Antennas Propag. 56(6), 1799–1804 (2008). https://doi. org/10.1109/TAP.2008.923345 10. L. Poli, P. Rocca, L. Manica, A. Massa, Handling sideband radiations in time-modulated arrays through particle swarm optimization. IEEE Trans. Antennas Propag. 58(4), 1408–1411 (2010). https://doi.org/10.1109/TAP.2010.2041165 11. Q. Zhu, S. Yang, L. Zheng, Z. Nie, Design of a low sidelobe time modulated linear array with uniform amplitude and sub-sectional optimized time steps. IEEE Trans. Antennas Propag. 60(9), 4436–4439 (2012). https://doi.org/10.1109/TAP.2012.2207082 12. A. Chakraborty, G. Ram, D. Mandal, Optimal pulse shifting in timed antenna array for simultaneous reduction of sidelobe and sideband level. IEEE Access (2020). https://doi.org/10.1109/ ACCESS.2020.3010047 13. O. Gassab, A. Azrar, A. Dahimene, S. Bouguerra, Efficient mathematical method to suppress sidelobes and sidebands in time-modulated linear arrays. IEEE Antennas Wirel. Propag. Lett. (2019). https://doi.org/10.1109/LAWP.2019.2903200 14. A. Tennant, B. Chambers, A two-element time-modulated array with direction-finding properties. IEEE Antennas Wirel. Propag. Lett. (2007). https://doi.org/10.1109/LAWP.2007. 891953 15. G. Li, S. Yang, Y. Chen, Z. Nie, A novel electronic beam steering technique in time modulated antenna arrays. Prog. Electromagn. Res. (2009). https://doi.org/10.2528/PIER09072602 16. L. Poli, P. Rocca, G. Oliveri, A. Massa, Harmonic beamforming in time-modulated linear arrays. IEEE Trans. Antennas. Propag. 59(7), 2538–2545 (2011). https://doi.org/10.1109/TAP. 2011.2152323 17. R. Maneiro-Catoira, J.A. Garcia-Naya, J.C. Bregains, L. Castedo, Multibeam single-sideband time-modulated arrays. IEEE Access (2020). https://doi.org/10.1109/ACCESS.2020.3017621 18. A. Chakraborty, G. Ram, D. Mandal, Pattern synthesis of timed antenna array with the exploitation and suppression of harmonic radiation. Int. J. Commun. Syst. e4727 (2021). https://doi. org/10.1002/dac.4727 19. U. Ye¸silyurt, I. Kanbaz, E. Aksoy, A multibeam subarrayed time-modulated linear array. Turkish J. Electr. Eng. Comput. Sci. (2020). https://doi.org/10.3906/elk-1904-201 20. G. Li, S. Yang, Z. Nie, Direction of arrival estimation in time modulated linear arrays with unidirectional phase center motion. IEEE Trans. Antennas Propag. 58(4), 1105–1111 (2010). https://doi.org/10.1109/TAP.2010.2041313 21. A. Chakraborty, D. Mandal, G. Ram, Beam steering in a time switched antenna array with reduced side lobe level using evolutionary optimization technique, in 2019 IEEE Indian Conference on Antennas and Propagation, InCAP 2019. https://doi.org/10.1109/InCAP47789.2019. 9134497 22. G. Bogdan, K. Godziszewski, Y. Yashchyshyn, Time-modulated antenna array with beamsteering for low-power wide-area network receivers. IEEE Antennas Wirel. Propag. Lett. (2020). https://doi.org/10.1109/LAWP.2020.3007925 23. A. Chakraborty, G. Ram, D. Mandal, Multibeam steered pattern synthesis in time-modulated antenna array with controlled harmonic radiation. Int. J. RF Microw. Comput. Aided Eng. e22597 (2021). https://doi.org/10.1002/mmce.22597 24. J. Euziere, R. Guinvarc’H, B. Uguen, R. Gillard, Optimization of sparse time-modulated array by genetic algorithm for radar applications. IEEE Antennas Wirel. Propag. Lett. (2014). https:// doi.org/10.1109/LAWP.2014.2299285

216

A. Chakraborty et al.

25. R. Maneiro-Catoira, J. Brégains, J.A. García-Naya, L. Castedo, Time modulated arrays: from their origin to their utilization in wireless communication systems. Sensors (Switzerland). 17(3) (2017). https://doi.org/10.3390/s17030590 26. G. Ram, D. Mandal, R. Kar, S.P. Ghoshal, Directivity maximization and optimal far-field pattern of time modulated linear antenna arrays using evolutionary algorithms. AEU—Int. J. Electron. Commun. (2015). https://doi.org/10.1016/j.aeue.2015.09.009

Classification of Attacks on MQTT-Based IoT System Using Machine Learning Techniques Jigar Makhija, Akhil Appu Shetty, and Ananya Bangera

Abstract Threats and attacks in the MQTT protocol are expected to grow with the exponential use of IoT networks in major domains such as industrial automation, transportation, and smart cities. Some of the attacks that frequently target IoT systems include unauthorized access, denial of service, packet sniffing, and malware injection which might lead to system failure. In this research, multiple machine learning models such as random forest, KNN classifier, and SVM are being used to predict the efficiency of the attacked dataset on MQTT-based IoT systems. The precision, accuracy, and the F1 score were used as the evaluation metrics for comparing the performances of the models. Based on the results obtained, it showed that the performance of random forest was highly accurate with 96% of accuracy. Keywords Internet of things (IoT) · MQTT · Anomaly detection · Machine learning

1 Introduction The Internet has been around for a long time mostly as the product of people; but since its official launch in the 1980s, the Internet also came to connect things. All the ordinary devices of everyday life became futuristic as they were armed with sensors that hold the ability to visualize, hear, and feel the surrounding environment and convert real-world information into digital data. By 2025, forecasts as per Statista suggest that IoT connected devices are set to hit 75 billion in number. IoT devices have become more efficient and consume lower energy due to their low computation capacities J. Makhija (B) Amrita Vishwa Vidyapeetham, Coimbatore, India A. A. Shetty · A. Bangera Sahyadri College of Engineering and Management, Mangalore, India e-mail: [email protected] A. Bangera e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_19

217

218

J. Makhija et al.

and the use of lightweight protocols [1]. These advantages also reduce their encryption capacity making them all the more vulnerable to attacks. This heterogeneous combination of sensor nodes enables new challenges in cybersecurity. One of the crucial means by which network security can be improved is by the use of intrusion detection systems (IDS). An IDS prompts suspicious activity within the network and brings to notice any suspicious activity, hence considered one of the most important factors in corporate cyber security. There are two primary methods of detection, anomaly- and signature-based. In this paper, we will be focusing on the latter by studying actions and identifying patterns that will be able to match the patterns of common threats, signature-based or definition-based ID systems to detect intrusions. An attack signature specifies the essential events needed to execute the attack and the sequence in which they must be performed. In this methodology, the attacks were beforehand recorded in the database. The benefit of a data analysis approach is that it is more efficient and faster than other methodologies, since it can solve the issue of unknown threats. Here we are proposing a solution based on machine learning that can detect and warn the system and protect it further in an unstable state. Quite a few machine learning classifiers have been utilized for this task. Another main feature of this paper highlights the potential to compare basic models such as random forest with SVM and KNN [2]. A machine learning classifier essentially requires a dataset to be built. Several popular models [3] are applied to implement IDS within the network intrusion detection with machine learning techniques. We have utilized a public dataset from Kaggle, which is based on attacks on MQTT Protocol for IoT system. In this study, based on a custom DoS attack model, we present a model with machine learning technique to detect the MQTT DoS attack [4] for the IoT systems. The paper is organized into the following sections; a summary of other IoT attacks and research work conducted in anomaly detection is given in Sect. 2. In Sect. 3, the overview of the dataset, various types of machine learning models and system frameworks are discussed. The results and its analysis are described in Sect. 4 of this paper. The limitations of the experiment, conclusions as well as the future scopes are described in Sect. 5.

2 Literature Review There is a plethora of studies that use machine learning to detect anomalies in conventional networks. MQTT transaction-based features are recommended for a current job that aims to detect IoT-based attacks. The authors have used features that will be used for the analysis of the TCP/IP protocol that does not provide adequate details on the parameters of the MQTT protocol. In comparison, our suggested MQTT features are focused on meta-data from the MQTT payload, which can identify and separate such attacks effectively. Furthermore, the key downside of Moustafa et al. [5] is that there was no presentation of the results of their attack detection system for MQTT-based attacks.

Classification of Attacks on MQTT-Based IoT System …

219

The major reason was that there were no real-time attacks dataset available for MQTT to test the intrusion detection methods. In IoT-MQTT-based dos attack modeling and detection, Naeem Firdous [6] Syed’s work shows that MQTT protocol is widely adopted and vulnerable to open attacks. Detecting attacks on MQTT protocol and applying multiclass classification has been shown in the work presented by AlaizMoreton et al. [7]. Hindy et al. [2] proposed the IoT IDS based on machine learning, which evaluates as attack classifiers using six different ML techniques to classify the attack. Whereas this paper attack and vulnerability detection for IoT sensors on IoT sites using Hasan’s [8] machine learning approaches on classical machine learning methods used over the dataset is conducted, and comparative study is provided, but it does not guarantee that RF can work similarly in big data and other vulnerabilities [9].

3 Resources and Methods The overall structure involves the fusion of multiple separate processes. The first step in this process is the compilation and observation of datasets. The dataset was extensively gathered and analyzed in this process to find out the various categories of data. Furthermore, we applied the preprocessing technique on data with the collected dataset. Data preprocessing involves different steps such as feature engineering, data representation, and data vectorization steps. The result of these steps would convert the data into a set of feature vectors. For training and testing of the models, this dataset was furthermore divided into 70–30 ratio for analysis.

3.1 Data Collection The experiment utilized an open-source database on Kaggle provided by Messias Alves, where the architecture is a micro-service collection which interacts with one another by employing the MQTT protocol. The dataset includes 588,261 samples and has 429,717 Normal data and 158,544 anomalous data. Table 1 will give the complete representation of the distribution of various attacks and anomalies in the test dataset. In Table 2, descriptions of 8 features are provided. Table 1 Tabular representation of the observed attacks

Attacks

Frequency count

% of anomalous data

Denial of service

132,216

22.47

Brute force

26,328

4.47

220

J. Makhija et al.

Table 2 Feature description

1.

2.

SL

Features

Data type

1

Source address IPV6

Discrete

2

Destination address IPV6

Discrete

3

TCP source port number

Discrete

4

TCP destination port number

Discrete

5

Frame length

Discrete

6

MQTT header flag

Discrete

7

MQTT message

Discrete

8

MQTT topic

Discrete

Denial of Service (DoS): In his type of attack, a user is denied access to a particular service by overwhelming its physical or network connections; the attacker either clogs the link, reducing the bandwidth or ties up system space [10]. By over flooding requests at a single source or receiver, the service is potentially disrupted. The sample used in this experiment includes 132,216 samples of DoS attack. Brute Force Attack: This attack implements a form of trial-and-error methodology to decode confidential/sensitive data. Despite the absence of an intellectual strategy, permutations and combinations run over a period of time on passwords, API keys and other encryption keys can result in privacy breaches providing complete access to the attacker. Since this attack is time intensive, understanding the amount of time it takes to brute force into a system defines the level of system security.

3.2 Theoretical Considerations Data analysis was undertaken with the help of multiple algorithms, of which descriptions are provided below. Random Forest (RF): Random forest is a supervised classification algorithm [11]. This includes a variety of individual decision trees that function as a whole. A class prediction spits out every individual tree in the random forest and the class with the most votes becomes the prediction of the model. Support Vector Machine (SVM): A vector-supporting machine is a supervised classification algorithm machine learning model, used for both classification and regression challenges. The classification algorithm is quick and reliable which can perform decently with a limited amount of data. SVM’s provide better speed and higher performance. The weight vector can be estimated by the following equation if an input x, LaGrange multipliers ‘α’ and a class ‘c’ are presented.

Classification of Attacks on MQTT-Based IoT System …

Weight vector =

n 

221

αi Ci xi

i=1

The target of the SVM is to optimize the following equation: Maximise

n  i=1

αi −

n n  

αi α j ci c j

i=1 j=1

is the vector obtained from various kernels like polynomial, radial basis, Sigmoid functions. K-Nearest Neighbors (KNN): K-nearest neighbors (KNNs) are used for regression and classification problems in machine learning. KNN algorithms use data and categorize new data points based on measures of similarity, commonly the distance function. Majority votes among its neighbors are the basis of classification. Euclidean, Manhattan, and Minkowski distance functions are valid for continuous variables, while in the case of categorical variables it must employ the Hamming distance function. The easiest way to select the optimum value for K is to examine the data first. In general, a larger value of K is more effective, as it reduces noise overall but with no guarantee. The ideal K has traditionally been between 3 and 7 for most datasets. This result substantially better than 1NN.

3.3 Evaluation Criteria The F1 score was calculated for assessing the efficiency of the developed system. The F1 score conveys the balance between the precision and the recall and can be used to measure the model’s performance. The F1 score is calculated by using the equation below: F Measure =

2 ∗ True Positive 2 ∗ True Positive + False Positive + False Negative

4 Outcome of the Applied Model 4.1 Attack Classification Results and Analysis The impact of the MQTT characteristics listed on attack detection was determined by analyzing the method of detection with count-based datasets and the complete dataset.

222

J. Makhija et al.

Fig. 1 Random forest plot

4.2 Criteria for Detection Multiple experiments on the described datasets in the above section have tested the efficacy of the detection. To assess the effectiveness of detecting anomalous MQTT traffic, the performance of each individual classifier was calculated. To assess the effectiveness of the detection system used in the experiment, the metrics listed below were used: accuracy and F1 score. By calculating True Positive, True Negative, False Positive, and False Negative, these metrics can be determined from the amount of occurrences that are appropriately and inaccurately listed [12]. TP is the percentage of anomalous occurrences in the dataset that are correctly observed. The number of valid instances correctly identified is TN. False Positives are the percentage of regular records listed as anomalous, whereas FN is the percentage of legitimate anomalous cases (Figs. 1, 2, and 3). The percentage of occurrences correctly identified as either abnormal or valid is accuracy (ACC) and is determined by Accuracy = TP + TN/TP + TN + FP + FN

4.3 Detection Results In terms of accuracy (%), F1 score (%) to construct the model, the results obtained of the classification models used in the MQTT attack classification algorithm are addressed. Full features were tested for the classifiers and populated in Table 3.

Classification of Attacks on MQTT-Based IoT System …

223

Fig. 2 KNN plot

Fig. 3 SVM plot

Table 3 Results obtained

Algorithm

Accuracy

F1 score

SVM

92.5455

91.9069

KNN classifier

95.495

95.42526

Random forest classifier

96.41611

96.3366

5 Conclusions and Future Scope Based on the contents presented, this paper performed an experiment that utilized ML-based attack detection framework for various attacks like Brute force and DoS for detecting MQTT attacks [13]. The model was able to identify and predict the

224

J. Makhija et al.

accuracy of the attacks. The dataset was evaluated by performing three different ML models. Despite several other algorithms being equally reliable, employing the random forest algorithm, we obtained an accuracy of 96%. Furthermore, this problem requires a more empirical analysis with real-time data. To interpret these issues more fully, more research is required. It does not guarantee, however, that these defined models will work efficiently for different attacks prone to IoT systems [14]. More research is therefore required.

References 1. M.H. Bhuyan, D.K. Bhattacharyya, J.K. Kalita, Towards generating real-life datasets for network intrusion detection. Int. J. Netw. Secur. 17(6), 683–701 (2015) 2. H. Hindy, E. Bayne, M. Bures, R. Atkinson, C. Tachtatzis, X. Bellekens, Machine learning based IoT intrusion detection system: an MQTT case study (MQTT-IoT-IDS2020 dataset), in 12th International Network Conference 2020 (Springer) 3. K.A. da Costa, J.P. Papa, C.O. Lisboa, R. Munoz, V.H.C. de Albuquerque, Internet of Things: a survey on machine learning-based intrusion detection approaches. Comput. Netw. 151, 147–157 (2019) 4. I. Vaccari, G. Chiola, M. Aiello, M. Mongelli, E. Cambiaso, MQTTset, a new dataset for machine learning techniques on MQTT. Sensors 20(22), 6578 (2020) 5. N. Moustafa, B. Turnbull, K.K.R. Choo, An ensemble intrusion detection technique based on proposed statistical flow features for protecting network traffic of internet of things. IEEE Internet Things J. 6(3), 4815–4830 (2018) 6. F. Chen, Y. Huo, J. Zhu, D. Fan, A review on the study on MQTT security challenge, in 2020 IEEE International Conference on Smart Cloud (SmartCloud), Nov 2020 (IEEE), pp. 128–133 7. H. Alaiz-Moreton, J. Aveleira-Mata, J. Ondicol-Garcia, A.L. Muñoz-Castañeda, I. García, C. Benavides, Multiclass classification procedure for detecting attacks on MQTT-IoT protocol. Complexity (2019) 8. M. Hasan, M.M. Islam, M.I.I. Zarif, M.M.A. Hashem, Attack and anomaly detection in IoT sensors in IoT sites using machine learning approaches. Internet Things 7, 100059 (2019) 9. M.O. Pahl, F.X. Aubet, All eyes on you: distributed multi-dimensional IoT microservice anomaly detection, in 2018 14th International Conference on Network and Service Management (CNSM), Nov 2018 (IEEE), pp. 72–80 10. N. Chaabouni, M. Mosbah, A. Zemmari, C. Sauvignac, P. Faruki, Network intrusion detection for IoT security based on learning techniques. IEEE Commun. Surv. Tutor. 21(3), 2671–2701 (2019) 11. S. Zeadally, M. Tsikerdekis, Securing Internet of Things (IoT) with machine learning. Int. J. Commun. Syst. 33(1), e4169 (2020) 12. N.F. Syed, Z. Baig, A. Ibrahim, C. Valli, Denial of service attack detection through machine learning for the IoT. J. Inform. Telecommun. 1–22 (2020) 13. F. Hussain, R. Hussain, S.A. Hassan, E. Hossain, Machine learning in IoT security: current solutions and future challenges. IEEE Commun. Surv. Tutor. (2020) 14. A.P. Haripriya, K. Kulothungan, Secure-MQTT: an efficient fuzzy logic-based approach to detect DoS attack in MQTT protocol for internet of things. EURASIP J. Wirel. Commun. Netw. 2019(1), 90 (2019)

Encrypted Traffic Classification Using eXtreme Gradient Boosting Algorithm Neha Gupta, Vinita Jindal, and Punam Bedi

Abstract In an organization, encrypted traffic is generated not only from HTTPS, VOIP, and SMTPS protocols but also from private VPNs and TOR browsers. Employees use these tools to access websites restricted by the organization. VPN and TOR traffic consumes important network resources and also violates organizational policies. Hence, encrypted traffic classification is crucial for an organization’s network management process. This paper proposes a novel system to perform encrypted traffic classification. The proposed system uses the eXtreme gradient boosting algorithm to classify the traffic into three classes: normal (non-VPN and non-TOR) traffic, TOR traffic, and VPN traffic. The CIC-Darknet 2020 dataset has been used to train the proposed system as it contains an imbalanced amount of TOR and VPN traffic, which closely resembles the real-world traffic. The proposed system outperforms other machine learning algorithms in terms of recall, precision, F1-score, and accuracy. Keywords Network security · Encrypted traffic classification · eXtreme gradient boosting (XGBoost) · Virtual private network (VPN) · The onion router (TOR) · CIC-Darknet 2020 dataset

1 Introduction With the rising demand for cybersecurity, organizations are now utilizing encryption techniques to safeguard their online communications. The most common approach adopted by organizations is the use of HTTPS, VOIP, and SMTPS protocols to encrypt N. Gupta (B) · P. Bedi Department of Computer Science, University of Delhi, Delhi, India P. Bedi e-mail: [email protected] V. Jindal Department of Computer Science, Keshav Mahavidyalaya, University of Delhi, Delhi, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_20

225

226

N. Gupta et al.

the network traffic generated from browsing, audio streaming, and email services. However, in recent times, it has been observed that other types of encrypted traffic are also present in organizational networks. This includes encrypted traffic from the virtual private networks (VPNs) and the onion router (TOR) browser. Both VPN and TOR browsers enhance user privacy on the Internet. They not only encrypt user’s traffic but also anonymize the IP address of both the sender and the receiver. The amount of VPN traffic within an organization’s network has increased over the years. This can be seen as a consequence of increased restrictions implemented by organizations on Internet usage. More and more organizations now restrict employees to access some/all websites that fall outside the scope of organizational purpose. This causes employees to use VPN services to access restricted content from the company’s network. In most cases, this kind of VPN usage is for a benign purpose, such as accessing social networking websites, gaming, and movie streaming. However, VPN services can also be misused to create insider threats against the organization. In both the scenarios mentioned above, VPN traffic in an organization’s network consumes crucial network resources, degrades the quality of service, and violates company policy. Another source of encrypted traffic in an organization’s network is TOR browser. Though the motive behind designing TOR was to anonymize users’ digital identity, it has become a gateway to the Dark Web [1]. TOR traffic in an organization’s network can prove to be more dangerous than VPN traffic. This is because access to the Dark Web through TOR can lead to downloading malware on company devices, capturing sensitive information about the organization, and exposing the organization to cybercrimes. Due to the high risk associated with TOR traffic in a business network, organizations are now implementing solutions for efficiently identifying this type of traffic. Since TOR traffic occupies a very small volume of the total encrypted traffic, classifying this type of imbalanced encrypted traffic becomes even more difficult [2]. Classification of encrypted traffic has become crucial for fine-grained network management and control. After the classification of encrypted traffic into normal (non-TOR and non-VPN) traffic, VPN traffic, and TOR traffic, implementation of different control policies for each type of encrypted traffic is possible. Hence, there is a need to develop security solutions that can distinguish normal encrypted traffic from VPN traffic and TOR traffic [3, 4]. This paper proposes a novel system that uses the eXtreme gradient boosting (XGBoost) algorithm to perform encrypted traffic classification. The proposed system also addresses the class imbalance problem and performs an accurate classification of encrypted network traffic. The proposed system utilizes the CIC-Darknet 2020 dataset [5] to classify the traffic into three classes: normal (non-VPN and non-TOR) traffic, VPN traffic, and TOR traffic. To the best of our knowledge, the existing literature only focuses on either classifying VPN and non-VPN traffic or classifying TOR and non-TOR traffic but not both simultaneously. The proposed system performs multiclass classification by considering both VPN traffic and TOR traffic together as present in real-world networks.

Encrypted Traffic Classification Using eXtreme Gradient …

227

The remaining paper is organized as follows: Sect. 2 presents the literature Review, Sect. 3 describes the proposed system, and Sect. 4 gives details about the experiments and results, followed by Sect. 5, which concludes the paper.

2 Literature Review This section presents an overview of recent research works that focus on encrypted traffic classification. Cuzzocrea et al. [6] evaluated the performance of six machine learning (ML) algorithms, namely J48, J48Consolidated, BayesNet, jRip, OneR, and RepTree, for detecting TOR traffic. The authors performed their experiment on realtime data generated through activities like browsing, audio–video streaming, chat, and email on TOR as well as non-TOR browsers. Kim and Anpalagan [7] performed classification of TOR traffic using convolutional neural network (CNN). The authors utilized the packet capture files from the UNB-CIC TOR dataset for experimentation purposes. Guo et al. [8] proposed two models: CNN and convolutional autoencoder (CAE), for classifying VPN and non-VPN traffic using the ISCX VPN 2016 dataset. The authors also performed dimensionality reduction using multi-layer perceptron. On comparing their results, the authors concluded that the CAE-based model was more efficient for identifying VPN traffic, while the CNN-based model was more accurate in performing application identification. Zou et al. [9] used a combination of CNN and long short-term memory (LSTM) to identify the type of application generating the VPN traffic. The authors compared the performance of one-dimensional CNN with the hybrid CNN-LSTM model. In both cases, packets were first converted to images, which were then used as inputs to the models. Shapira and Shavitt [10] also utilized CNN to achieve the application-level classification of VPN traffic and TOR traffic separately. The authors converted the flow samples from ISCX VPN and ISCX TOR datasets into images and then performed image classification. In their work, Hodo et al. [11] compared the efficacy and reliability of two algorithms, namely artificial neural network (ANN) and support vector machine (SVM) for classifying encrypted traffic. The authors also performed dimensionality reduction using correlation-based feature selection. On comparing the performance of these algorithms, the authors concluded that ANN outperformed SVM in distinguishing TOR traffic from the UNB-CIC TOR dataset. Saber et al. [12] utilized data balancing techniques to perform encrypted traffic classification. The authors combined oversampling and undersampling with principal component analysis (PCA) to address the class imbalance problem of encrypted network traffic. Further, SVM was used to classify the network traffic into fourteen different traffic classes based on the application used for traffic generation. The existing literature contains research works that classify either VPN and nonVPN traffic or TOR and non-TOR traffic. To the best of our knowledge, no research work exists that separates these two types of encrypted traffic (VPN and TOR) from each other. The proposed system fills this research gap and classifies the network

228

N. Gupta et al.

traffic into three classes: normal (non-VPN and non-TOR) traffic, TOR traffic, and VPN traffic. The details of the proposed system have been discussed in the next section.

3 The Proposed System This paper proposes a novel system for classifying encrypted traffic present in a network. The proposed system utilizes the eXtreme gradient boosting (XGBoost) algorithm to categorize encrypted traffic into three classes: normal traffic, VPN traffic, and TOR traffic. This classification will help organizations to closely monitor and control the type of encrypted traffic traversing their network. The architecture of the proposed system has been shown in Fig. 1 and described in detail. The proposed system consists of two layers: the pre-processing layer and the classification layer. The pre-processing layer pre-processes the input dataset in three steps: data cleaning, data normalization, and dimensionality reduction. The first step of pre-processing cleans the input data by replacing both the NaN and infinity values with the value −1. This step is essential because ML algorithms cannot process non-numeric values. During normalization, the values of the cleaned dataset are converted into a uniform range of [0, 1]. Normalization helps ML algorithms to remain unaffected by the diverse range of values present in different features. Next, during dimensionality reduction, the number of attributes present in the dataset is reduced. Dimensionality reduction is achieved by retaining only the most important attributes. The importance of an attribute is decided by two factors: variance and correlation. The features having the same values in all the samples have zero variance. These features are eliminated without any loss of information. Further, if two features are highly correlated, then both offer almost similar information. Hence, one of these features is removed without affecting the detection ability of the classifier.

Fig. 1 Architecture of the proposed system

Encrypted Traffic Classification Using eXtreme Gradient …

229

Dimensionality reduction leads to a shorter processing time and fastens the classification process. After pre-processing, the data is classified by the XGBoost algorithm in the classification layer. XGBoost is an ensemble algorithm that combines multiple weak learners (decision trees) to create a strong learner. Each tree reduces the misclassification error of the previous tree to improve the classification in each subsequent iteration. Another advantage of XGBoost over other algorithms like random forest, AdaBoost, and gradient boosting is that XGBoost is faster than all these algorithms [13]. In the proposed system, XGBoost classifies the pre-processed input samples into one of the three classes: normal traffic, VPN traffic, and TOR traffic. For experimentation, the CIC-Darknet 2020 dataset has been used in this paper. Here, the normal class includes both non-VPN and non-TOR traffic samples. These sample types were grouped and assigned class label 0. Next, TOR samples were assigned label 1, and VPN samples were given label 2. The next section describes the experimental study and discusses the obtained results.

4 Experiments and Results The proposed system was designed and developed on Intel® Core™ i7-8750H processor with Windows 10 operating system. Python programming language was used for the implementation of the proposed system. In the pre-processing layer, normalization and variance calculation was performed on the cleaned dataset using Python’s scikit-learn library. Further, Pearson’s correlation coefficient was computed using Python’s pandas library. Next, for the second layer, hyperparameter optimization was performed for the XGBoost algorithm. The optimal configuration consisted of trees with a maximum depth of 20, number of rounds equal to 5000, and a learning rate of 0.3, respectively. The CIC-Darknet dataset used in this paper consisted of 83 features and two labels for each sample. Four features, namely Flow ID, Src IP, Dst IP, and Timestamp, were removed from the dataset before pre-processing as they do not contribute to the classification process. After removing zero variance features, only 64 features remained. Out of these, 28 features were further eliminated after computing the correlation coefficient with the threshold value of 0.95. Therefore, only the remaining 36 features were used by the proposed system to classify TOR traffic and VPN traffic from normal network traffic. The selected features include feature number 2, 4, 5, 7, 8, 9, 10, 11, 13, 15, 19, 20, 21, 23, 26, 28, 31, 33, 36, 40, 43, 44, 48, 49, 50, 51, 52, 65, 66, 67, 71, 72, 73, 79, 80, and 82. Furthermore, the dataset consists of two labels: Label 1 and Label 2. Label 1 describes the primary class, and Label 2 is the further classification of Label 1. In this paper, only Label 1 has been utilized. It consists of four categories: VPN, TOR, non-VPN, and non-TOR. For experimentation purposes, the non-VPN and non-TOR samples were grouped together under the normal label. After this label conversion,

230

N. Gupta et al.

each sample belongs to either the normal class or TOR class, or VPN class. Then, the modified dataset was divided into the training and testing dataset, such that 80% was used for training and the rest 20% for testing. The training dataset consisted of 93,719 normal, 1106 TOR samples, and 18,399 VPN samples. On the other hand, the testing dataset consisted of 23,500 normal samples, 286 TOR samples, and 4520 VPN samples. These datasets were used for training and testing of the proposed system along with seven ML algorithms, namely K-nearest neighbor (KNN), Naïve Bayes (NB), random forest (RF), AdaBoost (AB), balanced bagging classifier (BBC), artificial neural network (ANN), and long short-term memory (LSTM). Recall, precision, F1score, and accuracy are the metrics that were used to evaluate the performance of all these algorithms in consideration. Recall is the fraction of samples identified by the classifier from all the samples present in a class. Precision is the fraction of correctly classified samples from all the samples classified as belonging to a class. F1-score is the harmonic mean of recall and precision. It gives equal weightage to both these measures and computes the goodness of the classifier. Lastly, accuracy is the fraction of correctly classified samples out of all the samples. The results of the experiments have been shown in Fig. 2. It has been shown in Fig. 2a that the proposed system achieves the highest recall values for the normal class as well as the VPN class as compared to other algorithms in consideration. In the case of the TOR class, the proposed system achieves a very

Fig. 2 Results obtained from experimentation

Encrypted Traffic Classification Using eXtreme Gradient …

231

high recall value as compared to all its counterparts except the balanced bagging classifier. This clearly shows that though the proposed system is trained on a few TOR traffic samples, it can accurately identify them without affecting its performance on the other two classes. Moreover, the proposed system attains the highest precision values for all three classes, as shown in Fig. 2b. This signifies that the majority of predictions made by the proposed system are correct and reliable. Since recall and precision are equally important, we also compute the F1-score for the proposed system and its counterparts. Further, it has been shown in Fig. 2c that the proposed system obtains the highest F1-scores for all three classes: normal traffic, TOR traffic, and VPN traffic. It also attains the best accuracy value compared to all the algorithms in consideration, and the same has been depicted in Fig. 2d. These results highlight that the proposed system is highly efficient in handling the class imbalance problem of network traffic and performing accurate multiclass classification of encrypted network traffic.

5 Conclusion In the present digital era, organizations are witnessing different types of encrypted traffic in their networks. Apart from encrypted traffic generated from HTTPS, VOIP, and SMTPS protocols, VPN and TOR encrypted traffic is also traversing the organization’s network. VPN and TOR traffic is generated when employees access websites that are restricted by the organization. This type of encrypted traffic not only violates the organization’s security policy but also burdens its network and may prove to be a cyberthreat. Thus, an organization must closely monitor different types of encrypted traffic traversing its network. This will allow network administrators to adopt a suitable response strategy for each type of encrypted traffic. This paper proposed a novel system for encrypted traffic classification. The proposed system utilized the eXtreme gradient boosting (XGBoost) algorithm to classify encrypted network traffic into three classes: normal (non-VPN and non-TOR) traffic, VPN traffic, and TOR traffic. The proposed system was trained and tested on the CIC-Darknet 2020 dataset that contains imbalanced TOR traffic and VPN traffic, which closely resembles realworld traffic distribution. The performance of the proposed system was compared with seven machine learning algorithms based on recall, precision, F1-score, and accuracy metrics. The proposed system outperformed all its counterparts by effectively handling the class imbalance problem and performing accurate multiclass classification of encrypted network traffic. Acknowledgements The first author would like to acknowledge the University Grants Commission for partially funding this work via the Junior Research Fellowship (Reference No. 3505/NETNovember 2017).

232

N. Gupta et al.

References 1. P. Bedi, N. Gupta, V. Jindal, Dark web: a boon or a bane, in Encyclopedia of Criminal Activities and the Deep Web, ed. by D.B.A. Mehdi Khosrow-Pour (IGI Global, 2020), pp. 152–164 2. P. Bedi, N. Gupta, V. Jindal, Siam-IDS: handling class imbalance problem in Intrusion Detection Systems using Siamese Neural Network, in Third International Conference on Computing and Network Communications, ed. by S. Thampi, S. Madria, X. Fernando, R. Doss, S. Mehta, D. Ciuonzo, Trivandrum, Kerala, vol. 171 (2019), pp. 780–789 3. Cybersecurity Security and Infrastructure Security: Defending Against Malicious Cyber Activity Originating from Tor (Advisory, CISA, United States, 2020) 4. Hitachi Systems: Risks Associated to Using Tor Inside a Business Network (Hitachi Systems Security Inc.). Available at: https://hitachi-systems-security.com/risks-associated-to-using-torinside-a-business-network/. Accessed 7 Apr 2020 5. A. Lashkari, G. Kaur, A. Rahali, DIDarknet: a contemporary approach to detect and characterize the darknet traffic using deep image learning, in 10th International Conference on Communication and Network Security, Tokyo, Japan (2020) 6. A. Cuzzocrea, F. Martinelli, F. Mercaldo, G. Vercelli, Tor traffic analysis and detection via machine learning techniques, in 2017 IEEE International Conference on Big Data, Boston, USA, vol. 1 (2017), pp. 4474–4480 7. M. Kim, A. Anpalagan, Tor traffic classification from raw packet header using convolutional neural network, in 2018 1st IEEE International Conference on Knowledge Innovation and Invention (ICKII), Jeju, South Korea (2018), pp. 187–190 8. L. Guo, Q. Wu, S. Liu, M. Duan, H. Li, J. Sun, Deep learning-based real-time VPN encrypted traffic identification. J. Real-Time Image Proc. 17, 103–114 (2020) 9. Z. Zou, J. Ge, H. Zheng, Y. Wu, C. Han, Z. Yao, Encrypted traffic classification with a convolutional long short-term memory, in IEEE International Conference on High Performance Computing and Communications (HPCC), Exeter, United Kingdom (2018), pp. 329–334 10. T. Shapira, Y. Shavitt, FlowPic: encrypted Internet traffic classification is as easy as image recognition, in IEEE INFOCOM 2019—IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Paris, France (2019), pp. 680–687 11. E. Hodo, X. Bellekens, E. Iorkyase, A. Hamilton, C. Tachtatzis, R. Atkinson, Machine learning approach for detection of nontor traffic, in ARES ‘17: Proceedings of the 12th International Conference on Availability, Reliability and Security, Reggio Calabria, Italy (2017), pp. 1–6 12. A. Saber, B. Fergani, M. Abbas, Encrypted traffic classification: combining over- and undersampling through a PCA-SVM, in 2018 3rd International Conference on Pattern Analysis and Intelligent Systems (PAIS), Tebessa, Algeria (2018), pp. 1–5 13. P. Bedi, N. Gupta, V. Jindal, I-SiamIDS: an improved Siam-IDS for handling class imbalance in network-based intrusion detection systems. Appl. Intell. 1–19 (2020)

Analyzing Natural Language Essay Generator Models Using Long Short-Term Memory Neural Networks Mayank Gaur, Mridul Arora, Varun Prakash, Yash Kumar, Kirti Gupta, and Preeti Nagrath

Abstract Essay generation falls under very rare and challenging cases of deep learning in which input data is way lower than the output data. It focuses on the generation of written texts in natural human language from some known semantic representation of topic information. Aim is to generate informative, diverse, and topic-consistent essays based on different topics. We implemented three different artificial intelligence models, i.e., topic average long short-term memory (TAV-LSTM), topic-attention LSTM (TAT-LSTM), multi-topic aware LSTM (MTA-LSTM) to find out the best suitable technology for the natural language generator. However, TAV and TAT LSTMs showed some valuable results, MTA-LSTM gave the most suitable outcomes. Experimental results verify that the MTA-LSTM model is able to generate topic-consistent text, diverse and essentially makes development as compared to strong baselines. After the implementation of the models, it was found that MTALSTM outperformed TAT-LSTM and TAV-LSTM in almost every metric. Overall MTA-LSTM outperformed TAT-LSTM and TAV-LSTM by 14.79 and 20.82% in human evaluation. Also, it performed better in BLEU score evaluation by 27.47% in TAV-LSTM and by 11.53% in TAT-LSTM. Keywords NLG · Essay generation · Deep learning · Text generation · LSTM · Embeddings · Attention · BLEU score

M. Gaur · M. Arora · V. Prakash (B) · Y. Kumar · K. Gupta · P. Nagrath Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, New Delhi, India K. Gupta e-mail: [email protected] P. Nagrath e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_21

233

234

M. Gaur et al.

1 Introduction Essay generation falls under the domain of natural language generation (NLG) [1] which can be defined as the “process of producing meaningful phrases and sentences in the form of natural language.” It naturally generates outputs that describe, summarize or explain input data in a manner which is human-like, at the speed of hundreds of pages per second. In general terms, NLG and natural language understanding (NLU) [2] are divisions of a more general natural language processing (NLP) sphere that encompasses all software which decophers or builds human language. With further success in this domain, NLG can also be used as article generation for magazines, by lawyers, code completion and who knows how many billion dollar ideas can come from this. It can also be helpful for news anchors as they just need to write the headline and the whole article will be generated. But the task of essay generation, as of now, suffers from a variety of problems. To start with, sentences of the essay generated by the algorithm are not relevant to the topic words used when taken as a whole. As an example, if topic words are ‘A day in a park’, one sentence of the essay might describe a day-like the sun is there of now suffers from few problems. To start with, sentences of the essay generated by the algorithm are not relevant to the topic words used when taken as a whole. As an example, if the topic words are ‘A day in a park’, one sentence of the essay might describe a day-like the sun is there, etc., and another might describe what is a park. This problem is known as topic relevance. Also, we have found that if one sentence is like ‘We had a great day’, the next sentence is often completely irrelevant to the previous sentence, for example, it might be ‘Our ford car choked’. Outputs essays must be fluent for human level performance. Lack of coherence is also noticed in the essays. Coherence in a piece of writing means that the reader can easily understand what writing wants to say. Coherence is about making everything flow smoothly. The reader can see that everything is semantically arranged and connected, and relevance to the topic words as a whole is maintained throughout. Integrity of the essay also has a lot of improvement. By integrity, we mean the level of vocabulary and words used to express sentiment of the essays. Our objectives of this work are: • Study different architectures of long short-term memory (LSTM) models for essay generation. • Implement the studied Architectures. • Compare and Analyze the implemented architectures. Keeping its importance in mind, a lot of work has been done on this domain which is described in Sect. 3 but still, performance in text generation is not at par, leave humans, but even with other sections of NLP like natural machine translation (NMT), speech recognition, etc. Essay generation is a task which faces many problems such as sentences formed are of relevance, same sentences or words are not repeated again and again, whether the sentences when read as a paragraph makes sense or not, does

Analyzing Natural Language Essay Generator Models Using Long …

235

the essay generated by the model have any resemblance with the essay written by humans. We have implemented the 3 architectures namely, MTA-LSTM, TAV-LSTM, TATLSTM from [3] on the Paul Graham dataset [4]. TAT-LSTM produces output essays by averaging embedding vectors of the input, and passing this average to a RNN network. This architecture as discussed in Sect. 5, lacks diversity, coherence and fluency and is only capable of generating short essays. TAT-LSTM improves TAV by incorporating attention mechanisms, which as a result is capable of generating bigger essays with a significant improvement in all parameters. We used Bahdanau attention as described in paper. This architecture too but suffers from the repetition problem or lack of diversity as attention also assigns weights to some topics too many times, making the output focused around some topics only, thus not covering the overall semantics of the topic words. To deal with this problem, a context vector is used in MTA-LSTM which ensures each topic word is utilized and overall semantic is covered. In Sect. 2, we have discussed the previous works done in the sector of essay generation which then get modified by the upgraded version of models in the same field. Then, in Sect. 3, we have described the dataset used and data preprocessing in detail including the concept of embeddings. Then in the next subsection, we have discussed the approaches taken to generate the essay that takes care of all the factors that make the generated essay correct. Then, in Sect. 4, we have discussed the experimental settings of our work and the evaluation metrics which we have used to evaluate the generated essay. In Sect. 5, experimental results are being discussed, and in the following Sect. 6, the conclusion of our work is discussed.

2 Related Work In 2010, Moawad et al. [5], used rich semantic representation based approach for text generation, which is to develop English text from semantic representation called rich semantic graph (RSG), RSG is a new ontology-based representation, is used to generate a unified semantic representation for different NLP applications-like machine translation [6, 7], text summarization, and information retrieval [8], a model was proposed by the author. The proposed model was exploited in many NLP applications, few to be mentioned: Machine translation, text summarization and information retrieval. The proposed NLG model acknowledged a semantic representation in the structure of RSG along with generated multiple texts. Ontology was used to achieve multiple texts conferring to the word synonyms. Certainly, the model classifies the generated texts to rank them and reinforce two criteria: Most often used words and acknowledged sentence relations. In 2011, the authors presented the NLG model [9], which takes a set of nouns as well as verb objects as an input and generates simple possible paragraphs. The model consisted of five tasks text planning, writing styles, surface realization, sentence planning, and evaluation. The first task generates the selected noun and verb synonyms

236

M. Gaur et al.

which was used in other steps. The sentence planning task generates simple paragraphs which are not grammatically correct. In the surface realization task, the generated simple paragraph was then grammatically corrected. The writing styles take an input of the selected writing style and output the simple paragraph with this style in the model. There were two writing styles descriptive and cause and effect. Finally, the coherence of the generated simple paragraph was going to be evaluated. Then, in 2012 the authors [10] developed, via text modification, a case-based reasoning approach to natural as well as in particular to surface realization [11]. An approach namely CeBeTa blends a sentence similarity metric along with a reuse approach based on text-transformation cycle, to achieve results to the text modification complication. Also, by using a self trained statistical evaluator, it became achievable to differentiate correct and incorrect solutions. A path to text adaptation backing text transformation routines (TTRs), and thus, the idea of self inferring path from the fetched case, as how to modify modifications that need to be achieved to the matter at hand to get the solution. In 2012, De Novais et al. [12] improved, using N-gram Statistics, the text Generation. The paper presented a natural surface realization agreement model along with the aid of n-gram language models in the direction of a simplified improvement and handling of NLG resources. For this, few tasks were studied: The lexical choice [13– 16] of noun phrases (NP) and verb phrase (VP) head constituents, ordering of noun modifiers as well as verb-complement understanding. The outputs proposed that the practice of n-gram statistics may certainly advance surface realization of different vocabularies such as Portuguese in different ways. In precise, the NP head lexical choice and verb-complement constituents tasks were successfully completed with further success in this domain. Kiddon et al. [17] tried to solve these problems by using an agenda item list in 2016, on the basis of which text is generated, model accompanied with 2 attention mechanisms, one monitoring used agenda items and other monitoring unused agenda items. In 2019, Welleck et al. [18] used likelihood training along with unlikelihood training, in which likelihood outputs probability of a next token and unlikelihood outputs prob that token is not the next word, based on how many times that token has appeared. The authors Feng et al. [19] used word embeddings and attention for generation in 2018, which also improved topic-integrity and relevance. More specifically, architectures discussed in the paper talks about three architectures. TAV-LSTM, the first one, in simpler terms is finding the embeddings of topic words, averaging them and passing it through RNN [20–22] network. This will be least accurate in this list as the final average vector can be the same for some other combination of words and also, this has very low complexity. TAT-LSTM, the second one, in most simpler terms is finding the embeddings of each topic word and applying attention to it, then passing them to the RNN network. This performs better than TAV-LSTM because by attention, weights to topic words are given on the basis of position in the essay, hence increasing the essay integrity. Third and the best one, MTA-LSTM in which the author used TAT-LSTM and added in its topic coverage vector, which monitors

Analyzing Natural Language Essay Generator Models Using Long …

237

how many times each topic word is used. These approaches do solve integrity, relevance and also coherence problems to some extent, but did not improve much on diversity. In 2019, Yang et al. [23] addressed this problem by adding the concept of common sense keywords, in which each topic word is supplemented by a list of common sense words based on that topic word, and used adversarial training [24–26] along with attention mechanisms to further improve the content. This method improves the diversity as the pool which is used to generate text is increased by many times (topic words + common sense words). In this, architecture consists of two parts: Firstly, memory-augmented generator includes an encoder and a decoder. Encoder consists of a bidirectional LSTM and then the concatenated vector is passed to the decoder. Decoder consists of the previous hidden state (st −1 ) and concatenation of previous output (yt −1 ), context vec kitor (ct ), memory vector (m_t) being passed to a RNN network c is computed via attention of input topic sequence, m is taken from Memory vector which contains commonsense of topic words. M is dynamically updated at each timestep to maintain coherence and is updated via gates like in gated recurrent units (GRU) [27, 28]. Secondly, the discriminator is introduced to calculate topicconsistency between the input topics and the generated essay, which further improves the text quality. It outputs a vector containing prob. Of topic words, and a sentence prob. Whether a sentence belongs to text or not adversarial training is done here and discriminator is optimized. This model is found to be better than MTA LSTM. Lin et al. [29] in 2020, addresses the problems by using BERT [30, 31] and GLOVE [32] representations and a target-side contextual history mechanism in selfattention networks to guide generation so that repetition and low retention problem of attention can be addressed. Glove word vectors are used and BERT embeddings are used as contextual embeddings, both are combined by dynamic weighted sum to alleviate semantic information gap. This architecture basically consists of two parts. First, encoder incorporates the pre-trained language model BERT into the encoder as contextual embeddings [33] to alleviate the information gap. Glove word vectors are used and bert’s embeddings are used as contextual embeddings, both are combined and passes into encoder. By fusing the hidden states of BERT across encoder layers, the semantics of the topic words are enriched, and the relationship between multiple topics is considered. Moreover, a dynamic weighted sum is employed and leads a better performance in semantic gap alleviation. Second, in this, a target-side contextual history mechanism in self-attention networks is used to guide generation. With the help of the context-aware generator, the quality of the generated text is improved. This is done to deal with repeated problems and low retention of attention. After our exhaustive analysis of the literature, we have deduced that technologies of essay generation as of now suffer from few problems. There are very few architectures which explicitly focus on all the challenges of essay generation. Integrity and coherence of the generated essays are not at all par with human performance. Huge amount of data is needed for training the model that costs a huge amount of money which small organizations cannot afford.

238

M. Gaur et al.

3 Methodology See Fig. 1.

3.1 Dataset Used and Data Preprocessing The dataset we used is a concatenated corpus of highly insightful essays from Paul Graham, which consists of various essays on various topics. The dataset in total consists of 452,944 words with a vocabulary size of 30,064 words. This corpus is available in the form of continuous text, so we preprocessed such that 5 words are used to output the next word, next 4 and the output word is used to predict the next word and so on. Cleaning of text is done by removing all punctuations except full stop and commas, converting all letters to lowercase, and removing all the noise from the dataset (urls, font size references etc.).

3.2 Embeddings Word2vec [34–36] is a 2-layer neural net that processes the text by “vectorizing” words. Its input is a text corpus and output is a set of vectors: Feature vectors that represent words in that corpus. Word2vec is no more a deep neural network as it changes the text into a numerical structure which deep neural networks can master. The conclusions of Word2vec neural net can be a vocabulary during which each item features a vector secured thereto, which can be encouraged into a deep learning net or just inquired to detect relationships bounded by words. Measuring cosine similarity [37–39], no analogy is conveyed as a 90° angle, while total analogy of 1 is a 0° angle, Entire overlap; i.e., Sweden equals Sweden, whereas Norway features a cosine distance of 0.760124 from Sweden, the very best of the other country. Here’s a inventory of words related to “Sweden” using Word2vec, in order of proximity (Table 1). An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models (Fig. 2). The words love, admire and worship might cluster in one corner, while hate, despise and detest fall together in another. Similar things (words) and ideas are shown to be “close”, i.e., have less distance and fall closer in space.

Analyzing Natural Language Essay Generator Models Using Long …

Fig. 1 Flowchart depicting our workflow

239

240 Table 1 Words with similar meaning have less distance in embeddings representations

M. Gaur et al. Word

Cosine distance

Denmark

0.715460

Norway

0.760124

Belgium

0.585835

Iceland

0.562368

Switzerland

0.588132

Finland

0.620022

Netherlands

0.574631

Slovenia

0.531408

Estonia

0.547621

Fig. 2 Diagram representing CBOW and skip-gram algorithm. In CBOW, adjacent words are used to output middle words but in skip gram, middle words are used to output previous and output n words, where n is window size chosen

Analyzing Natural Language Essay Generator Models Using Long …

241

3.3 Approach (1)

Topic-Averaged LSTM (TAV-LSTM)

In this architecture, topic representations of the topic words are calculated first, average of which is further fed into LSTM network. Topic representation is calculated by the following equation: k T =

topici k

i=1

(1)

where k is the number of input topic words, T is the topic representation, and topici is the word embedding of topic word i. After calculating topic representation, probability distribution of the vocabulary set y is obtained by a LSTM based decoder. At each timestep y, prediction of the next word is dependent on the current hidden state h and can be stated as follows: P(yt /yt−1 , T ) = softmax(g(h t ))

(2)

Before each prediction, ht is updated by: h t = f (h t−1 , yt−1 )

(3)

where g(·) and f (·) are a linear and an activation function that is determined by the LSTM structure (Fig. 3). (2)

Topic-Attention LSTM (TAT-LSTM)

TAV-LSTM uses the average of all the topic words and hence, essays have good relevance to all the topic words when taken as a whole. But this approach lacks diversity. Also, two different vectors can have the same average vectors and in that case, TAV-LSTM will output the same word. TAT-LSTM incorporates an attention mechanism [40, 41] which helps in selecting different words to guide generation at the next timestep (Fig. 4). For each generation step. T t can be defined as follows:

Fig. 3 The topic-averaged LSTM approach, where yi and hi are the output and hidden states at step i

242

M. Gaur et al.

Fig. 4 The basic long short-term memory approach with averaged topic words embedding and its topic-attention extension for essay generation

Tt =

k 

αt j topic j

(4)

j=1

where α tj are the scalars derived during attention mechanism, which are defined as: exp(gt j ) αt j = k i=1 exp(gti )

(5)

where topicj is the Word2Vec embedding of topic word ‘j’. And   gt j = vaT tanh Wa h t−1 + Ua topic j

(6)

where va , W a and U a are three matrices that are optimized during training and gtj is the attention score on topicj at time step t. Therefore, the probability of the next word yt can be defined as: P(yt /yt−1 , Tt ) = softmax(g(h t ))

(7)

h t = f (h t−1 , yt−1 , Tt )

(8)

and ht is updated by:

Analyzing Natural Language Essay Generator Models Using Long …

(3)

243

Multi-Topic-Aware LSTM (MTA-LSTM)

Although TAT-LSTM makes good use of topic words and also performs better than TAV-LSTM, it does not produce results which are consistent with the topic words. Moreover, it skips the past attentional actual information, due to which some topic words appear more than the others and the essay is more focused on a few words. MTA-LSTM solves this problem to much extent, by keeping track of frequency of the usage of each topic word. It does this by maintaining a topic coverage context vector c, which represents the degree to which a topic word must be expressed in future generation, to regulate the attention policy, so that the model can consider more about unexpressed topic words. This is also backed up by a parameter ϕ j which can be stated as discourse-level attention weight for topic. An example of the model is shown in Fig. 5. C t can be visualized as a vector initialized by ones. When generating a replacement word at time step t, ct,j is calculated as follows: Ct, j = Ct−1, j −

1 αt, j ϕj

(9)

where α t,j is the attention weight of the topic word i at time step t and ϕ j = N·σ (Uf [T 1, T 2, …, Tk]), Uf ∈ Rkdw . And gtj is updated as follows:   gt j = Ct−1, j vaT tanh Wa h t−1 + Ua topic j

Fig. 5 MTA-LSTM approach for essay generation, where c are topic coverage vectors

(10)

244

M. Gaur et al.

Hence, the probability of the next word yt can be defined as: P(yt /yt−1 , Tt , Ct ) = softmax(g(h t ))

(11)

4 Experiment We describe various experimental settings of our work and evaluation metrics used to test the performance of the models in this section.

4.1 Experimental Settings We have used Paul Graham essay’s dataset which consists of various essays of Paul Graham, in this work. For implementation we chose the 45,000 most frequently used words for both training and testing. Word2Vec embeddings are used in this work with a dimensionality of 300. Dataset was preprocessed such that the initial 5 words are used to predict the next word. We have used the Tensorflow library for our work. 3 LSTM layers are used in each architecture, with each layer having 600 hidden units. Parameters for our model were randomly initialized using Xavier initialization. Model was trained with Adam optimisation, where the minibatch was set to 64.

4.2 Evaluation Metrics Human Evaluation 25 different essays from each model were distributed to 20 people having expertise in the English language. These people were asked to evaluate the essays on the basis of “Topic-Integrity”, “Topical-Relevance”, “Coherence” and “Fluency”. A score is given to each essay on each aspect and it ranges from 1 to 5, with 5 representing the best score. Finally, scores are collected and he averaged to get a final score. Although scores were not very consistent, we find that scores keep a consistent trend. BLEU Score We have used bilingual evaluation understudy (BLEU) [42, 43] as a metric for automatic evaluation. This metric is widely used for automatic evaluation in machine translation systems. We took original essays and generated and found a BLEU-2 score for automatic evaluation.

Analyzing Natural Language Essay Generator Models Using Long …

245

5 Experimental Result See Tables 2 and 3. The generated essays from each model consists of errors, in the essays generated by model TAV-LSTM, we can clearly notice, the sentences generated can be seen multiple times in the same essay generated. For, e.g.: for the topic words ‘will tend to be disliked’: essay generated is: ‘and the way is not an adult. And the burden is not the facebooktwitter market. And the lot of the startup. And the burden is not the facebooktwitter market. And the lot of the startup. And the burden is not the facebooktwitter market. And the lot of the startup. And the burden is not the facebooktwitter market. And the lot of the startup. And the burden is not the facebooktwitter market. And the lot of the startup.’ Hence, in three hundred essay words generated, the integrity is very low as we can notice the generated words are making the least amount of sense in the context of English lexicon. Also, we can notice, the essay words are related to the topic words provided, hence, the relevance score of this model is better as compared to integrity score. Likewise, for other two factors we can find that in the factor fluency we can notice that the sentences are well formed and are connected to each other. And also, we cannot easily understand the essay generated. Hence, got a low score for factor coherence. For the model TAT-LSTM, we find that the generated essays have better scores as compared to the previous model (TAV-LSTM). For e.g., for the topic words ‘will tend to be disliked’, essay generated is ‘so if you reach some revenues. So if you start to get a lot of air in the straw. When someone speaks for the firm in the sense that they violate current startup fashions. In yc we say short-term greed is that the underlying cause is usually because it was a lot of people who will be proud of it. They always have to be suitable for everyone. And the cheap housing that draws other people here is better than truth as it makes sense to ask early on.’ Hence, in three hundred essay words generated, we can notice that the integrity is slightly better than the previous model, but for the factor Relevance, there is a drastic Table 2 Table containing evaluation scores by various experts Model

Integrity

Topic-relevance

Fluency

Coherence

Average

TAV-LSTM

2.40

2.72

3.60

2.84

2.89

TAT-LSTM

2.52

3.20

3.82

2.92

3.11

MTA-LSTM

3.20

3.80

4.08

3.52

3.65

Table 3 Automatic BLEU score evaluation

Model

BLEU score

TAV-LSTM

2.234

TAT-LSTM

2.725

MTA-LSTM

3.080

246

M. Gaur et al.

change in the relevance of the essay words generated, the possible reason this could be the less number of sentences generated in the previous model and more number of sentences generated in model TAT-LSTM. In generated essays, we can easily find that the sentences generated are related to previous sentences that have generated better fluency. For coherence, we can notice that we can understand the essay in a slight better manner. For the model MTA-LSTM, we get the best results for all the four factors compared to the other models. For the same pair of topic words, ‘the “flow” that characteristically all tastes. It seems reasonable to start to be a picky startup to make a conscious effort to do that. The best investors are so thoroughly picked off that the best way to do it is to eat rich. But if you want to start a startup, and that means you have to be disciplined about assigning probabilities. You have no example to be a mathematician.’ For the first three hundred words generated, we can clearly notice the difference between the precision of essays generated in all the factors. We can observe that the integrity of the generated essay is best. Also for relevance, we observe that the essay words generated are connected to the topic words and the sentences also are related to the previous sentences which makes this model best for the factor fluency. And for coherence, we can observe that the generated essay is easily understandable by a person.

6 Conclusion In this paper, we implemented three different LSTMs models to find out the most accurate among them. Our research was focused on four main elements, i.e., “TopicIntegrity”, “Topical-Relevance”, “Coherence” and “Fluency” and after looking to the results we came up with the conclusion that MTA-LSTM is by far the best LSTM technique for Natural Language Generator, whereas TAV-LSTM and TAT-LSTM also showed some promising results. Hence, we can conclude that the model MTA-LSTM generates the best results as compared to other models (TAV-LSTM, TAT-LSTM) due to the reason that MTA-LSTM maintains a novel multi-topic vector, which gets trained by learning the weight of every topic as well as gets persistently modernized at the time of the decoding process. Then, the vector is encouraged to an attention model to train the generator.

References 1. T.B. Hashimoto, H. Zhang, P. Liang, Unifying human and statistical evaluation for natural language generation. arXiv preprint arXiv:1904.02792 (2019) 2. J. Allen, Natural Language Understanding (Pearson, 1995)

Analyzing Natural Language Essay Generator Models Using Long …

247

3. T.N.T. Abd Rahim, Z. Abd Aziz, R.H. Ab Rauf, N. Shamsudin, Automated exam question generator using genetic algorithm, in 2017 IEEE Conference on e-Learning, e-Management and e-Services (IC3e), Nov 2017 (IEEE), pp. 12–17 4. https://www.kaggle.com/krsoninikhil/pual-graham-essays?select=paul_graham_essay.txt 5. I.F. Moawad, D.S. Fadl, M.M. Aref, Rich semantic representation based approach for text generation 6. C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki, O. Zaidan, Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation, in Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, July 2010, pp. 17–53 7. E. Lloret, H. Saggion, M. Palomar, Experiments on summary-based opinion classification, in Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, June 2010, pp. 107–115 8. G. Paltoglou, M. Thelwall, A study of information retrieval weighting schemes for sentiment analysis, in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July 2010, pp. 1386–1395 9. D. Sayed, M. Aref, I. Fathy, Text generation model from rich semantic representations, in Egypt Society of Language Engineering ESOLEC, 68-67 (2011) 10. J. Valls, S. Ontañón, Natural language generation through case-based text modification, in International Conference on Case-Based Reasoning, Sept 2012 (Springer, Berlin, Heidelberg, 2012), pp. 443–457 11. R. Rajkumar, D. Espinosa, M. White, The OSU system for surface realization at generation challenges 2011, in Proceedings of the 13th European Workshop on Natural Language Generation, Sept 2011, pp. 236–238 12. E.M. De Novais, T.D. Tadeu, I. Paraboni, Improved text generation using n-gram statistics, in Ibero-American Conference on Artificial Intelligence, Nov 2010 (Springer, Berlin, Heidelberg, 2010), pp. 316–325 13. A. Wiratmo, C. Fatichah, Indonesian short essay scoring using transfer learning dependency tree LSTM 14. C. Retoré, Variable types for meaning assembly: a logical syntax for generic noun phrases introduced by most. Rech. linguistiques Vincennes 41, 83–102 (2012) 15. B. Qin, D. Tang, X. Geng, D. Ning, J. Liu, T. Liu, A planning based framework for essay generation. arXiv preprint arXiv:1512.05919 (2015) 16. E. Clark, A. Celikyilmaz, N.A. Smith, Sentence mover’s similarity: automatic evaluation for multi-sentence texts, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, July 2019, pp. 2748–2760 17. C. Kiddon, L. Zettlemoyer, Y. Choi, Globally coherent text generation with neural checklist models, in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Nov 2016, pp. 329–339 18. S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, J. Weston, Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319 (2019) 19. X. Feng, M. Liu, J. Liu, B. Qin, Y. Sun, T. Liu, Topic-to-essay generation with neural networks, in IJCAI, July 2018, pp. 4078–4084 20. E. Wulczyn, C. Jacoby, Softmax RNN for short text classification 21. L.R. Medsker, L.C. Jain, Recurrent neural networks. Des. Appl. 5 (2001) 22. P. Rodriguez, J. Wiles, J.L. Elman, A recurrent neural network that learns to count. Connect. Sci. 11(1), 5–40 (1999) 23. P. Yang, L. Li, F. Luo, T. Liu, X. Sun, Enhancing topic-to-essay generation with external commonsense knowledge, in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, July 2019, pp. 2002–2012 24. A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, A.A. Bharath, Generative adversarial networks: An overview. IEEE Signal Process. Mag. 35(1), 53–65 (2018) 25. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

248

M. Gaur et al.

26. M. Arjovsky, L. Bottou, Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862 (2017) 27. R. Dey, F.M. Salemt, Gate-variants of gated recurrent unit (GRU) neural networks, in 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Aug 2017 (IEEE, 2017), pp. 1597–1600 28. J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014) 29. F. Lin, X. Ma, Y. Chen, J. Zhou, B. Liu, PC-SAN: pretraining-based contextual self-attention model for topic essay generation. KSII Trans. Internet Inf. Syst. (TIIS) 14(8), 3168–3186 (2020) 30. T. Zhang, V. Kishore, F. Wu, K.Q. Weinberger, Y. Artzi, Bertscore: evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019) 31. I. Tenney, D. Das, E. Pavlick, BERT rediscovers the classical NLP pipeline. arXiv preprint arXiv:1905.05950 (2019) 32. J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word representation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Oct 2014, pp. 1532–1543 33. W. Zhao, M. Peyrard, F. Liu, Y. Gao, C.M. Meyer, S. Eger, Moverscore: text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909. 02622 (2019) 34. K.W. Church, Word2Vec. Nat. Lang. Eng. 23(1), 155–162 (2017) 35. X. Rong, word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 (2014) 36. D. Herremans, C.H. Chuan, Modeling musical context with word2vec. arXiv preprint arXiv: 1706.09088 (2017) 37. H.V. Nguyen, L. Bai, Cosine similarity metric learning for face verification, in Asian Conference on Computer Vision, 8 Nov 2010 (Springer, Berlin, Heidelberg, 2010), pp. 709–720 38. N. Dehak, R. Dehak, J.R. Glass, D.A. Reynolds, P. Kenny, Cosine similarity scoring without score normalization techniques, in Odyssey, June 2010, p. 15 39. F. Rahutomo, T. Kitasuka, M. Aritsugi, Semantic cosine similarity, in The 7th International Student Conference on Advanced Science and Technology ICAST, Oct 2012, vol. 4, no. 1 40. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need. Adv. Neural Inf. Process. Syst. 5998–6008 (2017) 41. S. Vashishth, S. Upadhyay, G.S. Tomar, M. Faruqui, Attention interpretability across NLP tasks. arXiv preprint arXiv:1909.11218 (2019) 42. K. Papineni, S. Roukos, T. Ward, W.J. Zhu, BLEU: a method for automatic evaluation of machine translation, in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, July 2002, pp. 311–318 43. X. He, L. Deng, Maximum expected bleu training of phrase and lexicon translation models, in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), July 2012, pp. 292–301

Performance Evaluation of GINI Index and Information Gain Criteria on Geographical Data: An Empirical Study Based on JAVA and Python Sheikh Amir Fayaz, Majid Zaman, and Muheet Ahmed Butt

Abstract In this paper, generation and performance comparison is established between information gain and GINI index on raw geographical dataset. Concrete results are drawn, and performance is measured by evolving decisions on the two generated datasets from the same closure. Afterward, MDL pruning is also applied on the both generated decision trees by reducing the number of rules without affecting the overall performance of the tree. In this study, primarily we have exploited ID3 learning algorithm in both Java and Python language for the prediction purposes. After analyzing the results, implementation in Java gives better performance results by attaining the networthy accuracy of 81.52%. Furthermore, we then employed new-fangled approach on same set of data which compares the overall performance using information gain and GINI index. The empirical results demonstrate that information gain has achieved outstanding results with an accuracy of 81.20% with only 12 PMML rule sets. Keywords Geographical data · Machine learning · Decision tree · Information gain · GINI coefficient

1 Introduction There are so many ways to describe what machine learning in today’s world. Machine learning is a part of artificial intelligence where we can learn, predict, decide, categorize, analyze, and recognize the data in new ways. Machine learning has stolen the spotlight. It has become the number one search trend during the year 2019 and

S. A. Fayaz · M. A. Butt Department of Computer Science, University of Kashmir, Srinagar, India e-mail: [email protected] M. Zaman (B) Directorate of IT & SS, University of Kashmir, Srinagar, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_22

249

250

S. A. Fayaz et al.

continues to grow unimaginably. Machine learning is the study of scientific algorithms and various statistical models which are used to perform a specific task, that relay on the patterns, generating inferences and interpretations [1–4].

2 Decision Tree A decision tree is a significant algorithm for prescient displaying and can be utilized to visually and unequivocally speak to decisions. It is a graphical representation that makes utilization of branching technique to epitomize all potential results based on certain conditions. A decision tree [5, 6] gives a convincing strategy for decision creation because they allow us to analyze completely the potential outcomes of a decision and provides a framework to enumerate the values of conclusions and the probabilities. In decision tree learning, a machine learning algorithm called as Iterative Dicotomiser 3 (ID3) invented by J. Ross Quinlan in the late 1970s is used to generate decisions from the set of data. This construction of decisions uses recursive divide and conquer method in which the training datasets are recursively partitioned into smaller subsets until no further partitions are required [7–10].

3 Splitting Benchmarks An attribute selection and the splitting approaches are implemented recursively to make the decisions on the particular set of data. The attribute selection measure deals on the basis of a value or a score which best defines the particular label to a class labeled dataset, and the attribute with the highest score is selected as the splitting node for that given partition. In ID3 [11] and CART, there are two main approaches used for the splitting of the nodes based on the degree of randomness of the data and impurity, which include: 1. 2.

Information gain. GINI coefficient.

3.1 Information Gain Based on the specific split, information gain can be defined as the average of all the entropies. Iterative Dicotomiser 3 (ID3) algorithm uses information gain (IG) as its attribute selection measure. Information gain is a measure of how much information attribute give us maximum information about the class. They are often used while implementing machine learning algorithm. This is based on the concept of entropy and targets to reduce the level of entropy starting from the root node to the leaves nodes. Therefore, information gain can be represented as:

Performance Evaluation of GINI Index and Information Gain …

Information Gain (Feature) = Entropy (Dataset)−Entropy (Feature)

251

(1)

where Entropy = Measure of uncertainty of a random variable, it helps us to measure the purity of the split. And, entropy of the dataset (S) can be calculated using the formula: H (S) = −P(+) Log2 P(+) − P(−) Log2 P(−)

(2)

where P(−) is the percentage of negative class and P(+) is the percentage of positive class. S = Subset of training example. Thus, information gain is given by: Gain(S, A) = H (S) −

 |Sv| H (Sv) |S| V ∈Val

(3)

where Sv = Subset after split. And, A = root node. Thus, the best splitting will be chosen on the basis of the highest information gain, and the amount of information required for classifying records is minimal.

3.2 GINI Coefficient GINI index is the technique used for the attribute selection measure which is used to calculate the impurity of the record. This technique is used by the CART, and it chooses the attribute for the splitting which has less impurity measure, i.e., the attribute with the lower GINI coefficient will be preferred. The GINI coefficient uses the binary split of each attribute, i.e., GINI(D) = 1 −

m 

pi 2

(4)

i=1

where D is the binary split on A into D1 , D2 as shown below: GINI A (D) =

|D1 | |D2 | GINI(D1 ) + GINI(D2 ) |D| |D|

(5)

This process continues recursively and the attribute with the less impurity (Minimum GINI coefficient) is chosen as splitting attribute. Thus, from Eqs. (4) and (5), we get: GINI A = GINI(D) − GINI A (D)

(6)

252

S. A. Fayaz et al.

4 Related Work Enormous amount of data is present in every field like academic data, agricultural data, weather data [12, 13], cloud data [14] etc. In this paper, a brief review is provided to summarize the recent studies on the information gain and the GINI index. Raileanu et al. proposed a theoretical comparison between the GINI Index and information gain. In this paper [15], a formal methodology has been introduced which allows to compare a multiple split criteria for the decision process. Furthermore, a formal description of how to select between split criteria on a given set of data has been presented based on information gain and GINI coefficient which helps them [15] to analyze frequency of agreement and disagreement on both the functions, and it has been calculated that they only disagree by 2%. Thus, results do not conclude which one to prefer. Jain et al. proposed an investigation based on joint splitting criteria for decision tree based on information gain and GINI index. In this paper [16], they proposed to split the data when the information gain is maximum and GINI index is minimum. The experiments were performed on the UCI machine learning datasets which results that the joint splitting criterion is working satisfactorily when the experimental values were compared on individual criterion basis. Muharram et al. 2004 proposed a feature construction based on the information gain and GINI index. In this study [17], the conclusion on the performance of a series of classification algorithms has been carried out for feature construction by genetic programming with the addition of the developed attributes. Based on information gain and GINI index, two fitness functions were used in genetic programming. Results: In [17], a performance comparison has been made between four classifiers (C5, CHAID, CART, and ANN) with new attribute and without new attribute. They concluded that no algorithm has advantage over the other rest classifiers. The performance in C5 and CART has been increased dramatically to 100% and rest of the classifiers also improved but not as much as in case of C5 and CART. Zaman et al. 2020 proposed an analytical comparison between information gain and GINI index on geographical data. In their [18] research, the data has been collected from IMD Pune, India and is of Kashmir region. The attributes present in their dataset include humidity @3, humidity @12, maximum temperature, minimum temperature, and season. A comparison has been made between information gain and GINI index in which both the techniques where used to convert the continuous data into discrete valued data.

5 Dataset Data is one of the fundamental characteristic in every experiment which is to be carried out. In this study, we have worked on the continuous dataset which has been collected from the Indian Meteorological Department (IMD) Pune India [18].

Performance Evaluation of GINI Index and Information Gain …

253

This geographical data is of Kashmir region of India from the year 2012 to 2017 in which the weather parameters in this are taken from the three different regions of the Kashmir division which include North zone—containing Gulmarg area, South zone—containing Qazigund area and Central part of the Kashmir—containing Srinagar area. These regions are divided on the basis of stations ids present in the dataset [18]. Snapshots (Figs. 1 and 2) define the two CSV files which consists all the attributes which include station id, year, month, date, hour, humidity, maximum temperature, minimum temperature, rainfall [18], and these attributes are processed and are then used for the implementation of decision tree.

Fig. 1 Instances of relative humidity

Fig. 2 Instances of maximum temperature, minimum temperature, and rainfall

254

S. A. Fayaz et al.

Fig. 3 Processed and integrated dataset

These two files are then preprocessed in which relevant attributes are included, missing values were removed, and then integrated into one single file as shown in Fig. 3. The resultant continuous dataset contains all the necessary attributes for all the three zones [18]. These attributes include: a. b. c. d. e. f. g. h. i.

Maximum temperature Minimum temperature Humidity @ 12 a.m. Humidity @ 3 p.m. Rainfall Month Date Year Station Id.

The attribute (month) in the integrated dataset has been divided into four seasons as shown in Fig. 4. Thus, in Fig. 3, the continuous values of the attributes are now used for the implementation of the decision tree; but before the implementation process, the continuous values need to first convert into discreet values. This conversion can be done using information gain and GINI index. In this paper, we have used [18] the resultant continuous dataset into discretized valued attributes using both information gain and GINI index.

Fig. 4 Splitting months

Performance Evaluation of GINI Index and Information Gain …

255

Table 1 Best possible splits using ID3 and CART Attribute

Information gain

TMAX

25.05

GINI index 8.05

Class one

Class two

8.05 8.05 is H2

TMIN

−0.35

HUMID12

69.5

−0.35

−0.35 −0.35 is L2

69.5

69.5 69.5 is T2

89.5

89.5 89.5 is U2

5.1 Evaluation—Information Gain Versus GINI Index In this processed dataset, four attributes are continuous valued rather than discrete valued, and we employed information gain used by ID3 and GINI index used by CART to determine best possible split point, and the results are shown in Table 1 [18].

5.2 Information Gain Information gain is a measure of how much information attribute give us maximum information about the class. The best splitting will be chosen on the basis of the highest information gain, and the amount of information required for classifying records is minimal. For converting the continuous data into labeled/discreet values, the information gain of each attribute has been calculated individually (Table 1) [18], and the best split point has been calculated accordingly. The snapshot of the normalized discreet valued dataset on the basis of information gain is shown in Fig. 5. Here, all the fields are binary classified except season which has been splitted into four values as shown in Fig. 4, and the target attribute rainfall has been labeled into Y and N, where • • • •

Max_Temp has been splitted into H1 and H2 and H1 25.05 Min_Temp has been splitted into L1 and L2 and L1 −0.35 Humid_12 has been splitted into T1 and T2 and T1 69.5 Humid_3 has been splitted into U1 and U2 and U1 82.5.

5.3 GINI Coefficient After converting the continuous data (Fig. 3) into labeled data using information gain, we have converted the same set of data into labeled data using GINI coefficient where it chooses the attribute for the splitting which has less impurity measure, i.e., the attribute with the lower GINI coefficient will be preferred (Table 1) [18]. Thus,

256

S. A. Fayaz et al.

Fig. 5 Resultant labeled data using information gain

the snapshot of the normalized discreet valued dataset on the basis of GINI coefficient is shown in Fig. 6. Here, all the fields are binary classified except season which has been splitted into four values as shown in Fig. 4, and the target attribute rainfall has been labeled into Y and N, where • • • •

Max_Temp has been splitted into H1 and H2 and H1 8.05 Min_Temp has been splitted into L1 and L2 and L1 −0.35 Humid_12 has been splitted into T1 and T2 and T1 69.5 Humid_3 has been splitted into U1 and U2 and U1 89.5.

6 Decision Tree Implementation: An Empirical Examination of Python and Java Implementation is one of the major approach in the evaluation and the effectiveness of the data. In this paper, we have implemented decision tree on the both sets of data (Figs. 5 and 6) using Python and Java which are shown below:

Performance Evaluation of GINI Index and Information Gain …

257

Fig. 6 Resultant labeled data using GINI coefficient

6.1 Implementation Using Information Gain For the decision tree implementation using information gain, the attribute with the highest information gain will be selected as the root node or parent node of the tree. This step is continuous until all the attributes are processed, and it will lead us with the leaf nodes as our target attribute result. In this iterative process, we have calculated the entropy and information gain for each attribute, and the attribute with the highest information gain is selected as the decision node, and the same process has been repeated for every other required nodes. After the completion of these iterative steps, we came up with the resultant decision tree of these given attributes which is shown in Fig. 7.

Fig. 7 Decision tree using information gain

258

S. A. Fayaz et al.

6.2 Implementation Using GINI Index For the decision tree implementation using GINI index, the attribute with the minimum impurity, i.e., less impure will be selected as the root node or parent node of the tree. This step is continuous until all the attributes are processed, and it will lead us with the leaf nodes as our target attribute result. In this iterative process, we have calculated the GINI index for each attribute, and the attribute with the minimum impurity is selected as the decision node, and the same process has been repeated for every other required nodes. After the completion of these iterative steps, we came up with the resultant decision tree of these given attributes which is shown in Fig. 8.

7 Minimum Descriptive Length (MDL) Pruning The decision trees generated by various machine learning algorithms like ID3, CART, and C4.5 provide precise, effective, and accurate results, but they usually suffer some disadvantage by providing very large decision tree structures that make them inconceivable for the professionals [7, 19]. Thus, a concept of decision tree pruning has evolved to solve this problem. In MDL tree pruning, the large tree is used to convert into a small tree by which it is easier to understand without affecting the overall performance of the original tree. In this method, a portion of original tree (Sub tree) is considered as undependable and is therefore pruned and results with the new tree (pruned) with almost the same accuracy and performance of the original tree. Here, we have implemented MDL pruning on both decision trees generated (Figs. 7 and 8) on information gain and GINI coefficient, and the resultant pruned trees are shown in Figs. 9 and 10.

Fig. 8 Decision tree using GINI index

Performance Evaluation of GINI Index and Information Gain …

259

Fig. 9 Pruned decision tree using information gain

Fig. 10 Pruned decision tree using GINI index

8 Experimental Results and Performance Comparison Here, we have implemented four decision trees based on information gain and GINI coefficient including two pruned decision trees. The performance and accuracy measures have been calculated for each decision tree, and the results were quite satisfactory.

8.1 Performance: Python Versus Java To calculate the performance, we have implemented the code in Python and Java. After the implementation, the performance of the data reduces in case of Python

260

S. A. Fayaz et al.

Fig. 11 Prediction performance using Python

where it was around 76.09% in case of GINI index. Figure 11 snapshot depicts the performance along with the confusion matrix and other various calculations. Furthermore, the same data was implemented in Java, and it was observed that the performance increases significantly up to 81.53%, and the number of rules generated is 53 using GINI index. The snapshot of the rules generated in Java is shown in Fig. 12. Similarly, same approach is implemented using information gain, and it was observed that the performance almost remains same, but the number of number of rules reduces drastically from 53 to 40. In case of pruned decision trees, the performance in GINI and information gain does not show much deviation from the respected original trees, but the number of rules again reduces, and it was observed that in case of information gain, the number of rules is less (12) than that of GINI index (16) and without affecting the overall performance of the tree. Below snapshot shows the number of rules generated in pruned decision trees in Java on GINI index (Fig. 13). Information retrieval metrics such as recall and precision have been used in this study to evaluate the proposed approach. Precision and recall values are used neck to neck. Precision can be defined as the percentage of outcomes which are significant. Likewise, recall states the percentage of total significant outcomes correctly classified by the algorithm. The main objective of precision is to evaluate the true positive (TP) entities that are the correctly classified entities with respect to the false positive (FP) that are the incorrectly classified entities, and in case of recall, it is FN in place of FP [20, 21]. These can be calculated as follows Recall =

True Positives True Positives + False Negatives

Precision =

True Positives True Positives + False Positives

(7) (8)

Also, we have also calculated the Cohen Kappa coefficient which is actually the agreement between two set of individuals, i.e., when two binary variables are attempts

Performance Evaluation of GINI Index and Information Gain …

Fig. 12 Generation of rules in decision tree on GINI index

261

262

S. A. Fayaz et al.

Fig. 13 Generation of rules in pruned decision tree on GINI index

by two individuals to measure the same entity. The value of Cohen Kappa is defined as: K =

Po − Pe 1 − Pe

(9)

where Po = observed Agreement, Pe = Overall Probability. Thus, the comparative analysis for each decision tree is shown in Table 2. Figure 14 shows the comparative chart for the actual values which were calculated in this study, and their analysis has been made. The accuracy found to remain almost same in all the four approaches but the number of rules in case of information gain— pruned reduces considerably without affecting the overall performance of the tree; i.e., with less number of rules, we can predict the overall forecast with less complexity.

9 Conclusion and Future Work In this paper, we can conclude with two aspects. Firstly, in this study, the performance reduces when Python is taken into consideration as compared to Java on the historical geographical dataset [18]. Secondly, performance comparison has been

Performance Evaluation of GINI Index and Information Gain …

263

Table 2 Accuracy statistics S. no.

Specifications

Decision tree (information gain)

Decision tree (GINI index)

Without pruning

Without pruning

Pruned

Pruned

1.

Test set

1786

1786

1786

1786

2.

Correctly classified

1450

1450

1456

1448

3.

Wrong classified

336

336

330

338

4.

Accuracy

0.812

0.812

0.815

0.811

5.

Error

0.181

0.181

0.184

0.189

6.

Rules

40

12

53

16

7.

Recall (rainfall)

0.432

0.432

0.502

0.484

8.

Recall (no rainfall)

0.914

0.914

0.945

0.944

9.

Precision (rainfall)

0.724

0.724

0.783

0.779

10.

Precision (no rainfall)

0.837

0.837

0.820

0.817

11.

Cohen Kappa

0.512

0.512

0.497

0.482

Fig. 14 Results and accuracy findings

made between the decision trees generated on the same set of data based on information gain and GINI index. It was observed that the accuracy almost remains the same in both the cases (approx. 81%), but the number of rules generated in case of information gain reduces drastically (40 rules). Also, after applying MDL pruning on both decision trees, it was observed that the accuracy measure again remains almost same (approx. 81%), but the number of rules again reduces drastically in case of information gain (12 rules only) without affecting the overall performance. Thus in

264

S. A. Fayaz et al.

this paper, we can conclude with that the information gain performs better results as compared to GINI index on the particular set of data. Though, this study was primarily aimed at performance comparison of information gain and GINI index on the particular geographical data of Kashmir province. Thus, we can also generate the same decision model and compare the same performance on some other benchmark datasets.

References 1. J. Han, M. Kamber, Data Mining Concepts and Techniques (China Machine Press, Beijing, 2007). 2. R. Mohd, M.A. Butt, M.Z. Baba, GWLM–NARX. Data Technol. Appl. (2020) 3. M. Ashraf et al., Knowledge discovery in academia: a survey on related literature. Int. J. Adv. Res. Comput. Sci. 8, 1 (2017) 4. M. Ashraf, M. Zaman, M. Ahmed, To ameliorate classification accuracy using ensemble vote approach and base classifiers, in Emerging Technologies in Data Mining and Information Security (Springer, Singapore, 2019), pp. 321–334 5. S.R. Safavin, D. Langrebe, A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 21(3), 660–674 (1991) 6. Z. Kapas, L. Lefkovits, L. Szilágyi, Automatic detection and segmentation of brain tumor using random forest approach, in Modeling Decisions for Artificial Intelligence (Springer, 2016), pp. 301–312 7. J. Quinlan, Simplifying decision trees. Int. J. Hum. Comput. Stud. 51(2), 497–510 (1999) 8. L. Rokach, O. Maimon, Data Mining with Decision Trees: Theory and Applications (World Scientific Pub Co Inc., 2008) 9. M. Ashraf, M. Zaman, M. Ahmed, An intelligent prediction system for educational data mining based on ensemble and filtering approaches. Procedia Comput. Sci. 167, 1471–1483 (2020) 10. M. Zaman, S.M.K. Quadri, M.A. Butt, Information translation: a practitioners approach, in Proceedings of the World Congress on Engineering and Computer Science, vol. 1 (2012) 11. Q. Zhang, K. You, G. Ma, Application of ID3 algorithm in exercise prescription, in The International Conference on Electric and Electronics, Nanchang, China, 22 June 2011, vol. 99, no. 3, pp. 669–675 12. S.A. Fayaz, M. Zaman, M.A. Butt, To ameliorate classification accuracy using ensemble distributed decision tree (DDT) vote approach: an empirical discourse of geographical data mining. Procedia Comp. Sci. 184, 935–940 (2021) 13. How machine learning is redefining geographical science: a review of literature. Int. J. Emer. Technol. Innov. Res. ISSN:2349-5162, 6(1), 1731–1746, January 2019, Available: http://www. jetir.org/papers/JETIRDW06285.pdf (2019) 14. S.A. Fayaz, I. Altaf, A.N. Khan, Z.H. Wani, A possible solution to grid security issue using authentication: an overview. J. Web Eng. Technol. 5(3), 10–14 (2019) 15. L.E. Raileanu, K. Stoffel, Theoretical comparison between the Gini index and information gain criteria. Ann. Math. Artif. Intell. 41(1), 77–93 (2004) 16. V. Jain, A. Phophalia, J.S. Bhatt, Investigation of joint splitting criteria for decision tree classifier use of information gain and gini index, in TENCON 2018–2018 IEEE Region 10 Conference (IEEE, 2018), pp. 2187–2192 17. M.A. Muharram, G.D. Smith, Evolutionary feature construction using information gain and GINI index, in European Conference on Genetic Programming (Springer, Berlin, Heidelberg, 2004), pp. 379–388 18. M. Zaman, S. Kaul, M. Ahmed, Analytical comparison between the information gain and Gini index using historical geographical data. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 11(5), 429–440 (2020)

Performance Evaluation of GINI Index and Information Gain …

265

19. D.D. Patil, V.M. Wadhai, J.A. Gokhale, Evaluation of decision tree pruning algorithms for complexity and classification accuracy. Int. J. Comput. Appl. 11(2), 23–30 (2010) 20. S. Zainudin, D.S. Jasim, A.A. Bakar, Comparative analysis of data mining techniques for Malaysian rainfall prediction. Int. J. Adv. Sci. Eng. Inf. Technol. 6(6), 1148–1153 (2016) 21. M. Ashraf, M. Zaman, M. Ahmed, Performance analysis and different subject combinations: an empirical and analytical discourse of educational data mining, in 2018 8th International Conference on Cloud Computing, Data Science & Engineering (Confluence) (IEEE, 2018), pp. 287–292

Critical Analysis of Big Data Privacy Preservation Techniques and Challenges Suman Madan , Kirti Bhardwaj , and Shubhangi Gupta

Abstract In the present age, diversified data—both structured and unstructured, is being produced in huge amounts from sources like database transactions, audios, images, videos, social online platforms, etc., and that too within a time period of a millionth of a second. While this data is consumed and stored in large volumes, it is quite complex in nature and is growing within the constraints of self-regulatory sources. As a result, conventional procedures of data analysis and management fail in successfully managing the large data sets, collectively called as big data. Henceforth, by using a variety of techniques in the domain of data mining on large data sets, useful information is obtained, but this process is as meticulous as it is challenging. One of the biggest challenges, among others, is maintaining data privacy and to restrict unauthorized access to sensitive data generated, exchanged or recorded in banking transactions, health sector procedures, user interactions on social media as online presence and dependency of users has exponentially increased along with proliferation of sensitive information. This paper presents research insights into challenges in big data mining and the privacy concerns in big data besides presenting gaps in the research that can be used to plan future research. Keywords Big data · Big data challenges · Big data techniques · Data mining · Data privacy

1 Introduction Until quite recently, information being kept in corporate databases was being maintained for archiving purposes only. But when this huge database of information termed as “big data” was scrutinized using unconventional applications, it proved to be of utmost relevance as it revealed trends and predictive results [1, 2]. Big data is now used extensively in fields like banking, marketing and finance in various corporate on a global level. Individual workstations are responsible for data S. Madan · K. Bhardwaj · S. Gupta (B) Jagan Institute of Management Studies, Rohini, Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_23

267

268

S. Madan et al.

inflow from different sources where data is analyzed and processed for prediction of trends, consumer behavior, product lifecycle, etc. These conclusions drawn hold true worth for these corporates since they hold a huge monetary value. Huge tech-based companies like Google, Facebook and Twitter require big data management on a regular transaction-based basis among others. So, they store the data that they collect from their users and analyze it to draw predictions and conclusions that they use to improve their products and services on a regular basis [3]. Among the various features of big data, there are some key ones that could be used to represent big data, commonly known as the 5Vs and 3Ps. The 5Vs include volume, velocity, value, variety and veracity. Similarly, the 3Ps include prediction, prevention and personalization [4, 5]. The above-mentioned features are used to distinguish data and its quality since big data is often collected from a various type of sources. Another important task is to maintain privacy and secure access of sensitive or uncertain data [6]. The data can be structured or unstructured depending upon the various sources it is collected from such as social media, shared audios and videos, banking sector and other transactions. Since this data could be both complex and diverse, traditional systems like RDBMS and SQL prove to be inadequate. Also, the variety of data collected is highly differentiated and distinguished with more data being collected to the already increasing quantity at a very fast pace. Due to this reason, conventional practices of data storage, analysis and processing fail to keep up with these data sets [7]. In turn, advanced tools and applications are used to deal with them [6]. Some of the applications that would overcome the above limitation are Hadoop, MapReduce, No SQL, Apache Spark, HBase, Pig, Hive, Sqoop and Oozie. From the above mentioned, Hadoop is most adequate and popular too [6, 7]. There could be many approaches to draw valid conclusions and predictive trends that aid in forecasting future business markets that would lead to sure monetary gains or profits. As data mining is a crucial part of the entire procedure of storage and analysis of big data and is an important role in the process of knowledge discovery, therefore, it is usually denoted as knowledge discovery databases (KDD) [8, 9]. Data mining needs to be carried out with utmost precision. Scrambled data often leads to faulty results and reduces the efficiency of the data model being used. For big data to be beneficial in monetary terms, keeping the marketable growth of organizations and otherwise, data needs to be analyzed correctly before being used for further processes [10]. One of the key elements need to be taken care of at the time of data mining is to preserve privacy which means restricting access to only those who are authorized to view individual information. Thus, privacy is an important aspect that organizations need to maintain while mining results in order to not disclosing individual identity [11].

Critical Analysis of Big Data Privacy Preservation Techniques …

269

2 Privacy Concern in Big Data Corporate organizations gather information or data about the users that may be generic or specific, like health records, email address, DOB, credit and debit card numbers, phone numbers, etc. [12]. Often this data is sold to other third-party companies for further processing or as data set, if the company specializes in selling anonymous user data. For this reason, modeling privacy enhancing techniques to be applied on individual data so as to arrive at legitimate conclusions holds top priority. Methodical, abstract and legitimate significance of data privacy are considered to be important constraints in data mining so as to secure unauthorized access [13, 14]. Corporate giants like Facebook, Twitter and Google deal with data exchange extensively in not only user domain but in corporate domain as well. This data may vary from audio and videos to transactions. This may lead to cybercrime the likes of which tech companies will be most affected by [15, 16]. During the processing and modification of data, the security manager ensures that security at each subsequent level of architecture is maintained so as to protect data at these different layers from unwanted inference, malicious attacks or even unauthorized access while similarly maintaining confidentiality, authenticity, accessibility and availability [17–19]. Modern privacy law suggests keeping user privacy at utmost importance for all the organizations that are collecting data. Simultaneously, it is very important for organizations to maintain privacy within an enterprise since not following user privacy norms may lead to grave legal consequences, among others [20]. With online tools becoming more advanced within the minute and the exponentially rising online presence, there is an increase in the rate of threats against privacy [21]. Henceforth, the basic requirement while data exchange is to change the information in such a way that it is more difficult to link distinguished information records with particular users. This could be achieved by various strategies including encryption procedures and data structures, which maintain prediction of patterns of accessing data somewhat impossible if not highly complex. Data mining techniques are important in data integrity and preserving data privacy in the domain of big data. Figure 1 summarizes various privacy and in turn security concerns in terms of major categories: infrastructure security, data management, data privacy, integrity and reactive security.

3 Literature Review Data has been growing substantially which has posed various challenges from analysis and storage perspectives in organizations. Heterogeneity, scalability, timeliness, infrastructure fault, skill requirements, etc., are some of the common challenges that organizations face [10].

270

S. Madan et al.

Fig. 1 Privacy and security concerns

In [11], maintaining security at multiple levels is presented in the form of an algorithm that uses masking to distinguish complex stakes of big data. It has a module for collecting data, which is often in a cleaned form, from the source. This cleaned data is then sent to the data miner but only after it is unscrambled and many significant query-related and other constraints are applied on it. Before any conclusions are drawn or any relevant information is extracted, the present data is matched with the sensitive data kept in the database. This is achieved by applying encryption and decryption techniques. If, after the procedure, the data proves to be a match, the data is further forwarded to a decision maker, or it is sent back to the data miner for corrections and more processing to improve it. One of the biggest disadvantages of this algorithm is that the data matching is too time consuming and complex. In [22], the author has examined various sources of data along with its role in different fields like risk assessment, media and log storage. Furthermore, findings about challenges, tools to handle big data and key features like heterogeneity, volume, etc., are discussed, while in [20], challenges and factual issues are highlighted concerning big data storage, management, analysis and processing with the reason being its substantial rate of growth. Along with the 5Vs including velocity, value, volume, veracity and variety, there exist many other key parameters which are important in deciding how to manage big data which is the most important issue focused on by the author. In [23], a three-layered cryptographic model is presented, with the layers being— secret, authorized and public. Applying the same model and other encryption schemes, an algorithm is composed that converts plain text into ciphertext which different layers of security on the data. At the bottom layer, i.e., the secret layer, the data is imprinted with a digital mark along with encryption making it inaccessible to

Critical Analysis of Big Data Privacy Preservation Techniques …

271

users who are not authorized. At the middle layer, the receivers who are authorized to have access and also possess a private key can use data mining techniques to decrypt the encrypted ciphertext or the data, generally. At the top layer, i.e., the public layer, the authorized person can do a self-search and view the information or conclusions after the entire encryption–decryption and data mining procedure. In [24], integration of MapReduce has been suggested along with implementing privacy preservation techniques and organization authentication to promote efficient and secure data processing, to find possible solutions of privacy-related problems and improve security. In [25], the author proposes a three-tier architecture framework where the former tier supports easy access of data and mathematical computations, the intermediate one focuses on user privacy issues, and the bottom one is aimed toward issues occurring at the time of mining complex, diversified and dynamic data. This constitutes for a data-driven big data processing model that encompasses not only data mining but supports availability and demand obsessed collective of information bases, security key areas, deep analysis and user-centric modeling. In [26], the author presents a technique called K-anonymity wherein every record is similar to minimum one other k − 1 records present on the established variables. K-anonymity uses generalization and by replacing original values by special characters (which is called overpowering). The only disadvantage it has is that it is not concerned with links between sensitive attributes so the sensitive data is still visible or outflowing. In [27], a hybrid optimization technique was designed for privacy preservation in big data using K-anonymization. This technique provided the data protection and easy data retrieval to minimize the delay. However, this method has the limitation of K-anonymity model, which failed in the real scenario when the attackers utilized other methods. In [28], the writers present a unique way to promote scalability of data. They have used a generalization algorithm, the structure of which is formed by establishing a tree of the user’s original sensitive information data set and by applying various operations. This generalization method is actually called bottom-up approach and is an efficient method to make data anonymous as with the incoming of more and more data, the previously stored data keeps getting compressed. But generalizations are of many types, and so according to the data, the best one should be identified so as to examine upward moving hierarchies and relations in each step of the process. Differential privacy technique in [29] does not allow clients to access to the database. This technique is completely different from anonymization as the data does not need to be modified, but the results are displayed after the interface computes results and adds distortion to the results. This technique is mainly used to contract the conceivable outcomes of individual acknowledgment while querying the data. One problem with this technique is that an analyst should be aware of the query before using it. In [30], a useful approach focusing on hiding input data and internal state of data while processing is suggested through a homomorphic technique. This technique includes having methods to perform certain computations on ciphertext, of which resultant data is obtained that is encrypted as well. Further, it involves matching the

272

S. Madan et al.

decrypted results to the data obtained by applying operations on plain text. Henceforth, initial state of the data and backend state of the data that is encrypted remain preserved. In [31], a phases-based approach is suggested wherein huge sets of data are divided in two phases. In the primary phase, the data is made anonymized to obtain intermediate results which are then all integrated in the second phase. The resultant data obtained at the end of second phase is the final desired outcome. This is another form of top-down approach which increases security and data preservation. But, a disadvantage of this approach is revealed when the data set taken is too large to be anonymized and there is also the added risk of privacy losses while partitioning. In [32], data mashup technique is a method to secure two-party high multidimensional private data. In this technique, the data on user’s end is mashed up before sending it to the third party. Only the ordinary data is exposed to third party, and encryption is performed on the sensitive data before revealing it to the other party. The issue with this technique is that a lot of time is needed in mashing up large data sets. In [33], proxy re-encryption technique is a technique that involves sharing of only the ciphertext securely multiple times. The message or sender and receiver’s identity is never disclosed. In this technique, there is an encryption scheme in which the ciphertext of a particular key is converted into an encryption of the message by using a separate key. In [34], detailed technologies such as bucketization, one attribute per column, generalization, multiset-based generalization, slicing and slicing with suppression have been discussed. These techniques help in achieving a different level of privacy. Generalization cannot be easily applied on high-dimensional data. Bucketization could not maintain the membership disclosure, so slicing technique was mentioned which could overcome these problems. In [35], data can be portioned vertically and horizontally using an anonymizing technique called slicing. Vertical partitioning refers to clustering the highly correlating attributes into column. Horizontal partitioning refers to randomly sorting column values such that no column values can be linked. Slicing is used to interrupt the relationship across columns as well as to ensure the bond between every column. Slicing technique is the best approach to deal with high-dimensional data. Another technique, in [36], is hybrid technique which combines generalization and randomization in the respective order for better accuracy while recreation of data. In [37], an anonymity model is developed by the author for publishing privacy preserving data securely by using a fitness function. It could be improved by amending the optimization algorithms as it neglected certain database necessities. In [38, 39], various privacy preservation models and schemes are highlighted by proposing a hybrid method for privacy protection using dragonfly algorithm. Also, the schemes mentioned suggest high degree of data security while publishing big data in cloud. In [40], in order to improve accuracy of query processing and reducing the possibility of privacy leakage, with the help of differential privacy, an output perturbation privacy maintaining approach was proposed.

Critical Analysis of Big Data Privacy Preservation Techniques …

273

4 Findings Our study highlighted that to relish the unescapable claims of big data we need to deal with various concerns with reference to privacy and in turn security of big data. Key concerns can be summarized as • Efficiency and scalability of algorithms as we need to deal with databases with huge volume and velocity. • Dealing with data that is heterogeneous in nature, since most “divide and conquer methods” are not apt given the vast scale of big data and subjective privacy measurements which may differ from person to person and has dimensions exceeding technical boundaries to that of social and psychological as well. • Lack of well-defined privacy frameworks, since big data is a relatively new field and there is varied clustering methods that put forth the limitation of adapting differential privacy in practice which could be resolved only when we build improved theoretical foundations for studying privacy conditions from all perspectives [41]. Table 1 describes the various privacy techniques and the challenges faced by them. Comparative analysis of some of the privacy preserving techniques are shown in Table 2 based on parameters such as type of data, linkage property, information loss and privacy preserved. The above analysis shows that no single technique is dependable in all domains. Each technique performs in a different way depending on the size of data and the type of application.

5 Technological-Based Solutions To satisfy the unprecedented requirements of big data, we need to make necessary improvements in the existing theoretical frameworks, algorithms and mechanisms along with making significant investments in the following: 1.

Integrating privacy preservation techniques with quantum computing: Quantum computers actively use quantum phenomenon of superposition and entanglement to perform complex tasks that ordinary computers cannot. It is this ability of quantum computers that we can use to ensure better functionality in security and privacy preservation with an emphasis on time complexities of encryption methods. However, it is not an unconditional method for privacy safety. The measurement-based quantum computation model in [42] advises a promising mechanism of achieving blind computing which means that any client can remotely carry out any computation on a quantum server and the server in turn can execute the command without gaining any sensitive information about either the client, the input given or the output received. Braz et al. [43] show

274

S. Madan et al.

Table 1 Privacy techniques and challenges Techniques

Challenges

Anonymization through generalization • • • •

Causes loss of information Not ready to protect attribute correlations Each attribute is generalized separately To climb up the hierarchy, each iteration needs to recognize the best generalization

Bucketization

• Cannot intercept attribute: membership disclosure • Essential to issue quasi-identifiers values in their original form • Needs clear split-up between quasi-identifiers and sensitive attributes

Cryptographic technique

• Difficult to apply for large databases • Difficult to scale when more events are involved • Non-sensitive data is also encrypted that can be useful for analytics

Data mashup technique

• Mashing large scale of data requires a lot of time • Mashing of data may cause a loss of accuracy

Differential privacy

• No preservation of data truthfulness at the record level • High computation complexity

Homomorphic encryption

• Computational overhead increased • Not applicable for large data sets

K-anonymity

• Gives no consideration of the links between sensitive data • Not able to protect against attacks based on background knowledge • Not applicable for high-dimensional data

Slicing

• Since there is no clarity regarding attribute disclosure, the random grouping of attributes is not efficient • Data utility is reduced because of spurious tuples

Top-down specialization approach

• Loss of privacy • Leads to its inadequacy in handling large-scale data sets

2.

3.

a conceptual framework of the above-mentioned model which accentuates its feasibility. Infusing social sciences in building privacy techniques and mechanisms: Since privacy is subjective, different privacy preservation techniques need to follow the requirements and findings of social and behavioral sciences, which should be our main focus primarily as per the authors of “Privacy-preserving data publishing: A survey of recent developments” [44] and the emerging discipline of computational psychophysiology [45]. Innovating theoretical privacy frameworks with a practical approach: Presently, different clustering methods being used for handling the heterogeneous data obtained from various sources suffer varied vulnerabilities, feasibility and flexibility issues when practically applied. In order to curb this limitation, we need

Critical Analysis of Big Data Privacy Preservation Techniques …

275

Table 2 Comparison of different privacy techniques Parameters Techniques Type of data

Anonymization through generalization

Micro data

Bucketization

Micro data

Cryptographic technique

Micro data

Data mashup technique

High dimensional

Differential privacy

Micro data

Homomorphic encryption

Micro data

Hybrid approach

High dimensional

K-anonymization

Micro data

Proxy re-encryption

Micro data

Slicing technique

High dimensional

Top-down specialization technique

Micro data

Very High

High

Linkage property

Information loss

Privacy preserved



Low

∅Very Low

a set of theories, for example, fuzzy logic and game theory which can aid in dealing with ambiguous data and arguments, rather than contending with just one theory [46].

276

S. Madan et al.

6 Conclusion and Future Work Big data alludes to the complex and huge data sets, and big data mining is a procedure done to discover unknown patterns from big data. With the rising and rapidly growing data, things are changing in the business environment. Big data is attaining an ultimate edge in data research and for various business applications. Organizations are currently using big data analysis to measure the future trends so that huge value can be delivered out of it. Big data mining is an emerging research area; a limited work has been done on it so far. According to the authors, a lot of work needs to be done to overcome various challenges like heterogeneity, infrastructure faults, scalability, timeliness and privacy. They even pointed out the privacy challenges of big data mining. Since the organizations could not protect large volumes of data from different attacks, the extreme volume, variety and velocity of data created problem for most of them. There is a need to compromise the privacy in order to carry out the operations in big data mining. Sensitive information held by business organizations related to their clients is considered as a big asset to them. In order to protect the data against unauthorized access, various techniques were discussed in the literature, but they had some limitations, So, the authors believe that there is a need to develop more such techniques and mechanism that will help in preserving privacy during data analysis process, for the reason that if privacy about an individual is violated it may have disastrous significance on someone’s life. In future, all these findings will be utilized to create more effective privacy preservation algorithm so that we can achieve more utility and less information loss.

References 1. U. Ahsan, A.A. Bais, Review on big data analysis and Internet of things, in 2016 IEEE 13th International Conference on Mobile Ad Hoc and Sensor Systems (IEEE, 2016), pp. 325–330. https://doi.org/10.1109/mass.2016.38 2. N. Chakraborty, S. Gonnade, Big data and big data mining: study of approaches, issues and future scope. Int. J. Eng. Trends Technol. 18, 221–223 (2014) 3. I. Kalbandi, J.A. Anuradha, Brief introduction on big data 5Vs characteristics and Hadoop technology, in Procedia Computer Science 48: International Conference on Computer, Communication and Convergence (ICCC 2015), vol. 48 (2015), pp. 319–324 4. P.G. Sawant, B.L. Desai, Big data mining: challenges and opportunities to forecast future scenario. Int. J. Innov. Technol. Explor. Eng. 3, 5228–5232 (2015) 5. S. Madan, K. Bhardwaj, S. Gupta, A literature analysis on big data paving its way to E-business, in Emerging Trends in Information Technology (2019). ISBN 978-93-89165-99-9 6. I.A.T. Hashem et al., The rise of ‘big data’ on cloud computing: review and open research issues. Inf. Syst. 47, 98–115 (2015) 7. S. Sagiroglu, D. Sinanc, Big data: a review, in 2013 International Conference on Collaboration Technologies and Systems (CTS) (IEEE, 2013), pp. 42–47 8. H. Dev, T. Sen, M. Basak, M.E. Ali, An approach to protect the privacy of cloud data from data mining based attacks, in 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (IEEE, 2013), pp. 1106–1115. https://doi.org/10.1109/sc.companion.201 2.133

Critical Analysis of Big Data Privacy Preservation Techniques …

277

9. D.K. Singh, V. Swaroop, Data security and privacy in data mining: research issues & preparation. Int. J. Comput. Trends Technol. 4, 194–200 (2013) 10. A. Sameer, Big data and data mining a study of (characteristics, factory work, security threats and solution for big data, data mining architecture, challenges & solutions with big data), in Advancing Web Paging Techniques (2016), pp. I–XXVI. https://doi.org/10.13140/rg.2.1.3238. 9525 11. M.R. Choopa, Data mining and security in big data. Int. J. Adv. Res. Comput. Eng. Technol. 4, 1065–1069 (2015) 12. S.M. Nargundi, R. Phalnikar, Data DE-identification tool for privacy preserving data mining. Int. J. Comput. Sci. Eng. Inf. Technol. Res. 3, 267–276 (2013) 13. L. Hbibi, H. Barka, Big data: framework and issues, in 2016 International Conference on Electrical and Information Technologies (ICEIT) (IEEE, 2016), p. 6 14. S. Srijayanthi, R. Sethukkarasi, A comprehensive survey on privacy preserving big data mining. Int. J. Comput. Appl. Technol. Res. 6, 79–86 (2017) 15. M. Kaushik, A. Jain, Challenges to big data security and privacy. Int. J. Comput. Sci. Inf. Technol. 5, 3042–3043 (2014) 16. M. Smith, C. Szongott, B. Henne, G. Von Voigt, Big data privacy issues in public social media, in 2012 6th IEEE International Conference on Digital Ecosystems and Technologies (DEST) (IEEE, 2013), p. 6 17. G. Geethakumari, A. Srivatsava, Big data analysis for implementation of enterprise data security. Int. J. Comput. Sci. Inf. Technol. Secur. 2, 742–746 (2012) 18. K.U. Jaseena, J.M. David, Issues, challenges, and solutions: big data mining, in Sixth International Conference on Networks & Communications (2014), pp. 131–140 19. S. Kim, I. Lee, Data block management scheme based on secret sharing for HDFS, in 10th International Conference on Broadband and Wireless Computing, Communication and Applications Data (IEEE, 2015), pp. 51–56. https://doi.org/10.1109/bwcca.2015.70 20. S. Kaisler, F. Armour, J.A. Espinosa, W. Money, Big data: issues and challenges moving forward, in 2013 46th Hawaii International Conference on System Sciences (IEEE, 2013), pp. 995–1004. https://doi.org/10.1109/hicss.2013.645 21. L. Xu, C. Jiang, J. Wang, J. Yuan, Y. Ren, Information security in big data: privacy and data mining. IEEE Access 2, 1149–1176 (2014) 22. A. Katal, M. Wazid, R.H. Goudar, Big data: issues, challenges, tools and good practices, in 2013 Sixth International Conference on Contemporary Computing (IC3) (IEEE, 2013), pp. 404–409 23. N.I. Hussain, B. Choudhury, S. Rakshit, A novel method for preserving privacy in big-data mining. Int. J. Comput. Appl. 103, 21–25 (2014) 24. S. Vennila, J. Priyadarshini, Scalable privacy preservation in big data a survey, in 2nd International Symposium on Big Data and Cloud Computing (ISBCC’15), ed. by V. Vijayakumar, V. Neelanarayanan. Procedia Comput. Sci. 50, 369–373 (Elsevier B.V) (2015) 25. X. Wu, X. Zhu, S. Member, Data mining with big data. IEEE Trans. Knowl. Data Eng. 26, 97–107 (2014) 26. S. Salini, S.V. Kumar, R. Neevan, Survey on data privacy in big data with K-anonymity. Int. J. Innov. Res. Comput. Commun. 3, 3765–3771 (2015) 27. S. Madan, P. Goswami, A novel technique for privacy preservation using K-anonymization and nature inspired optimization algorithms, in International Conference on Sustainable Computing in Science, Technology & Management (2019) 28. M. Balusamy, Data anonymization through generalization using map reduce on cloud, in 2014 IEEE International Conference on Computer Communication and Systems (ICCCS ‘14) (IEEE, 2014), pp. 39–42 29. A. Gosain, N. Chugh, Privacy preservation in big data. Int. J. Comput. Appl. 100, 44–47 (2014) 30. M. Sangeetha, P. Anishprabu, S. Shanmathi, Homomorphic encryption schema for privacy preserving mining of association rules. Int. J. Innov. Res. Sci. Eng. 31. B.C.M. Fung, K. Wang, P.S. Yu, Top-down specialization for information and privacy preservation, in ICDE ‘05 Proceedings of the 21st International Conference on Data Engineering (IEEE, 2005), pp. 205–2016

278

S. Madan et al.

32. I. Sridhar, P. Jacob, Secure two party high dimensional private data using data mash up. Int. J. Comput. Sci. Inf. Technol. 5, 644–645 (2014) 33. K. Liang, W. Susilo, J.K. Liu, Privacy-preserving ciphertext multi-sharing control for big data storage. IEEE Trans. Inf. Forensics Secur. 10, 1–11 (2015) 34. P.C. Kaur, T. Ghorpade, V. Mane, Analysis of data security by using anonymization techniques, in 2016 6th International Conference—Cloud System and Big Data Engineering (Confluence) (IEEE, 2016), pp. 287–293 35. K. Rodiya, P. Gill, A review on anonymization techniques for privacy preserving data publishing. Int. J. Eng. Res. Technol. 4, 228–231 (2015) 36. S. Lohiya, L. Ragha, Privacy preserving in data mining using hybrid approach, in 2012 Fourth International Conference on Computational Intelligence and Communication Networks (IEEE, 2012), pp. 743–746. https://doi.org/10.1109/cicn.2012.166 37. S. Madan, P. Goswami, K-DDD measure and MapReduce based anonymity model for secured privacy preserving big data publishing. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 27(2), 177–199 (2019) 38. S. Madan, P. Goswami, Nature inspired computational intelligence implementation for privacy preservation in MapReduce framework. IJIIDS 13(2/3/4), 191–207 (2020). https://doi.org/10. 1504/IJIIDS.2020.109455 39. S. Madan, P. Goswami, A privacy preservation model for big data in map-reduced framework based on k-anonymization and swarm-based algorithms. IJIEI 8(1), 38–53 (2020). https://doi. org/10.1504/IJIEI.2020.105433 40. M. Du, K. Wang, Y. Chen, X. Wang, Y. Sun, Big data privacy preserving in multi-access edge computing for heterogeneous Internet of things. IEEE Commun. Mag. 56, 62–67 (2018) 41. A. Al-Shomrani, F. Eassa, K. Jambi, Big data security and privacy challenges. Int. J. Eng. Dev. Res (IJEDR) 6(1), 894–900 (2018). ISSN: 2321-9939. http://www.ijedr.org/papers/IJE DR1801155.pdf 42. D. Gottesman, I.L. Chuang, Demonstrating the viability of universal quantum computation using teleportation and single-qubit operations. Nature 402(6760), 390–393 (1999) 43. S. Barz, E. Kashefi, A. Broadbent, J.F. Fitzsimons, A. Zeilinger, P. Walther, Demonstration of blind quantum computing. Science 335(6066), 303–308 (2012) 44. B.C.M. Fung, K. Wang, R. Chen, P.S. Yu, Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. 42(4) (2010). Art. No. 14 45. Advances in Computational Psychophysiology. Accessed on 17 May 2016. [Online]. http:// www.sciencemag.org/custompublishing/collections/advances-computational-psychophysio logy 46. S. Yu, Big privacy: challenges and opportunities of privacy study in the age of big data. IEEE Access 4, 2751–2763 (2016). https://doi.org/10.1109/ACCESS.2016.2577036

Performance Improvement of Vector Control Permanent Magnet Synchronous Motor Drive Using Genetic Algorithm-Based PI Controller Design Rajesh Kumar Mahto, Ambarisha Mishra, and Bharti Kumari

Abstract The proposed work presents an improved performance of vector control PMSM drive using PI controllers, and each controller parameter is tuning with genetic algorithm-based optimization technique. The ability to work well of a drive system for variable speed and load in closed-loop system depends on controller’s performance. The working performance of controllers depends on its gain parameter; hence, the genetic algorithm-based optimization technique gives a unique value of each controller gain parameters for sudden change in speed and load. These properties of controllers improved the working performance of drive system. The proposed work had been developed in MATLAB/Simulink environment for variable speed and load, and it had been seen that from output response the proposed work greatly improved the settling time, THD in stator current, supply voltage profile and performance of overall drive system. Keywords Controller design · Optimization technique · Vector control

1 Introduction The genetic algorithm is an optimization technique work on the basis of search heuristic, and it is inspired by Charles Darwin’s theory of natural evolution. There are several methods of optimization technique, and these are used in various fields of engineering. These methods are used to improve the performance of a system like attribute reduction and methods of classifier and feature extraction [1–4]. These optimization techniques are also used in electric drive to improve controller’s performance and tuning approaches. In this work, PMSM motor is used in vector R. K. Mahto (B) · A. Mishra National Institute of Technology Patna, Patna, India A. Mishra e-mail: [email protected] B. Kumari Nalanda College of Engineering, Chandi, Nalanda, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_24

279

280

R. K. Mahto et al.

control drive system, and genetic algorithm-based optimization technique is used for controller parameter tuning approach [5, 6]. The rotor of permanent magnet synchronous motor has permanent magnet on rotor to produce main flux; hence, there is no need of DC source as in case of synchronous motor. The permanent magnet synchronous machine operates at high efficiency because permanent magnet produces high flux density that also reduces the size of the motor. These features of the machine make it very attractive; hence, it is used in aircraft, electric vehicles and industry [7, 8]. These applications are demanding variable load torque and speed. The demand of variable speed and load can be fulfilled by the closed loop of vector control. In vector control, flux-producing current and torque-producing component of current are independently controlled. In vector control drive system, three PI controllers are used, and each controller parameter is generally tuned with Zeigler Nicolas and trial and error methods [5, 9]. As the error ranges are not fixed for variable speed drive system, hence, these are not the suitable methods for variable speed drive system. To avoid this complexity, this paper presents self-tuned PI controller-based vector control PMSM drive using genetic algorithm; hence, the proposed algorithm generates a unique value of controller’s parameter (K P and K I ) for change in speed and load; hence, transient and steady-state errors are compensated. The proposed tuning methods for controller gain parameter improved the various performance of drive in terms of THD in stator current and torque ripple in generated air gap electromagnetic torque and output voltage profile of inverter. The proposed works of this paper are organized in five parts, and the rest content of this paper is divided into four parts. The part 2 represents inverter model used in the drive system; part 3 represents the mathematical model of PMSM, block diagram of vector control and flowchart of GAs optimization tool; part 4 represents result obtained using proposed technique; part 5 represents conclusions of the proposed technique.

2 Inverter Model 2.1 Two-Level Voltage Source Inverter The AC machines are used for variable speed and load, and to fulfill this demand, a voltage source inverter is used as the VSI has the capability to produce variable output voltage and frequency. Figure 1 represents the circuit diagram of two-level three-phase voltage source inverter with star-connected inductive load. The given circuit diagram consists of six semiconductor devices (IGBT). The gate pulses are generated using SPWM technique to turn ON the IGBT. The switching pulses of the VSI are generated at base frequency 50 Hz carrier frequency 1.2 kHz. The inverter supply voltage is taken as 100 V.

Performance Improvement of Vector Control Permanent Magnet …

VDC

3-Phase 3-leg VSI

T1

T3

281

T5

Phase A Phase B T4

T6

T2

Phase C

Fig. 1 Three-phase two-level VSI connected with star road

3 PMSM Model The mathematical modeling of PMSM is developed on rotating frame of reference, and it had been assumed that flux saturation, Eddy and hysteresis currents losses and field currents dynamic are negligible, and induced EMF is sinusoidal [10]. Voltage equations are given by Vq = Rs i q + ωr λd + ρλq

(1)

Vd = Rs i d − ωr λq + ρλd

(2)

λq = L q i q

(3)

λd = L d i d + λ f

(4)

Vq = Rs i q + ωr (L d i d + λ f ) + ρ L q i q

(5)

Vd = Rs i d − ωr L q i q + ρ(L d i d + λ f )

(6)

Substituting above equations

The developed electromagnetic torque motor is being given by Te =

 3  p  λd i q − λq i d 2 2

The equation of mechanical torque is given by

(7)

282

R. K. Mahto et al.

Fig. 2 PMSM vector control drive

Tm = TL + Bωm + j

dωm dt

(8)

In the given equation, ωr and ωm represent the rotor electrical and mechanical speed ,respectively.

3.1 Vector Control Drive System The closed-loop vector control drive consists of power module, motor and sensor and control unit as shown in Fig. 2. The rotor position sensor senses the dynamics of rotor and sends the signal to the control unit. The control units compare the given signal with reference signal and sends the error to the controller. There are three PI controllers used as speed, flux and torque controllers. The proportional controller works in proportional with error, and integral controller compensates steady-state error. The voltage source inverter is used in power module as VSI has the capability to generate variable voltage and frequency as per the demand.

3.2 Tuning Methods of PI Controllers Gain Using GAs There are three PI controllers used in vector control PMSM drive, and each PI controller has two tuning parameters (K P , K I ); hence, six variables had been assigned. The algorithm is written for these six variables, and the maximum and minimum limit had been set in according to error range. It has been seen that there is a unique value of controller parameter for sudden change in load and speed that are generated. The schematic diagram of flowchart of genetic algorithm which is used in this paper had been given below.

Performance Improvement of Vector Control Permanent Magnet …

283

Begin Define speed, torque and flux error as a objective function and it is to be minimized Initialize the population Run the loop until termination criteria reached

Calculate the fitness function Crossover Mutation Survivor selection Terminate at optimum point or return for best

4 Result and Analysis To show the working performance of the proposed genetic algorithm-based PI controller tuning, it had been investigated in MATLAB/Simulink platform for variable speed drive system.

4.1 Simulation Response of PMSM Drive at Change in Speed with Constant Load Without GAs In case of vector control in rotor reference frame, d-axis and q-axis current are present. The d-axis of current is field-producing component of current, and q-axis component of current is torque-producing component of current. In this, the maximum torque per ampere (MPTA) control strategy is used; hence, q-axis current varies according to the load, in this case as the load is constant so that q-axis is also constant (Fig. 3). The simulation response of phase current of stator at constant load with changed in speed is shown in Fig. 4. It can be observed from the waveform that the current is constant as the load is constant. Figure 5 represents the torque and speed response of PMSM drive. The motor starts initially at t = 0, the load torque is 2 Nm and set speed 157 rad/s at t = 0, at t = 2 s, the speed is changed to 170 rad/s, and again t = 3 s it changed from 170 to 157 rad/s that is shown in Fig. 5.

284

R. K. Mahto et al. i_d

4

d-axiz cu rrent (Id)

3 2 1 0 -1 -2 -3 -4 i_q

q axis cu rrent (Iq)

30

25

20

15

10

5

0

0

1

2

3

4

5

6

4

5

6

Time(sec)

Fig. 3 Simulation response of d- and q-axis current

i_a

stator current (Ia)

20

10

0

-10

-20 i_b

stator current (Ib)

20

10

0

-10

stator current (Ic)

-20 i_c

20

10

0

-10

-20

0

1

2

3

Time(sec)

Fig. 4 Stator current of PMSM drives without GAs

Speed (rad/s)

150 Set speed

100

Actual rotor speed 50 i_c

5

e

T (Nm)

0

0

0

0.5

1

1.5

2

2.5 Time (s)

3

3.5

4

4.5

5

Fig. 5 Response of speed and torque without GAs

4.2 Simulation Response of Vector Control PMSM Drive with GAs The vector control of PMSM drive had been developed in MATLAB/Simulink. The tuning of PI controllers gain is using genetic algorithm for the optimized value of (K P ) and (K I ) for minimized of speed, torque and flux errors. To find the optimized value of all the three controller’s gain, there are six variables assigned in editor window. The simulation response at constant speed 150 (rad/s) and variable torque as starting with ramp load from t = 0.2 s to t = 1.7 s after this point of time load becomes constant. This can be observed in Fig. 6.

Performance Improvement of Vector Control Permanent Magnet …

285

200 150 Set Speed

100

Actual Rotor Speed

50 0 4 2 0 8 6 4 2 0 -2 -4 0

Load Torque (T L ) 3.5 2.5 1.5

3 2 1.55

1.6

1.65

1.7

1.75

1.8

Electromagnetic Torque (Te ) 0.5

3

2.5

2

1.5 Time (s)

1

Fig. 6 Speed and torque responses of PMSM drive with GAs

10

Id

Fig. 7 d-axis and q-axis current response with GAs

5

Iq

0 20 10 0 0

0.5

1

1.5 Time (s)

2

2.5

3

The current response of d-axis and q-axis for GAs based in closed-loop PMSM drive had been shown in Fig. 7. It had been observed that there is less current ripple in both current as compared to without Gas-based PMSM drive. In the case of closed-loop vector control drive system, the output performance of motor depends on input voltage and current. These voltage and current are supplied by inverter. The output voltage of inverter depends on gate signal, and to generate the gate signal, reference and carrier signals are modulated. In case of closed-loop vector control, the reference signals are generated by controllers; hence, the generated reference signal mainly depends on controllers. The magnitude stator currents are proportional to the load; hence, for the ramp type, the load increases from t = 0.2 s to 1.7 s corresponding to supply stator current that had been also increased that can be clearly observed in Fig. 8. The zoomed viewed of phase A current at t = 1.5 s is also shown (Table 1). The current THD and settling time of current and speed response had been calculated in MATLAB using FFT tool and step info command, respectively.

286

R. K. Mahto et al.

Ia

40 20

10 0 -10 -20

1.4

1.45

1.5

1.55

1.6

0

Ic

Ib

-20 40 20 0 -20 40 20 0 -20 0

0.5

1

1.5 Time (s)

3

2.5

2

Fig. 8 Three-phase stator current response

Table 1 Comparative analysis of output response

Parameter THD content in stator current (%) Torque ripple content (%) Settling time (T s ) (s)

Without GAs

With GAs

5.85

1.33

18.21

12.145

1.2

0.6

5 Conclusions This paper presents the strategy of vector control permanent magnet synchronous motor drive. The controllers used in the drive system are designed with genetic algorithm-based optimization technique. The GAs give the optimized value of all the three controller’s gain (K P , K I ) values. Hence if there are changes in speed and load, GAs produced new value of controller gain value so that errors are compensated quickly. This property of controllers improved the PMSM drive performance in terms of generated voltage and current profile, catch up speed and torque pulsation. The simulation response of the genetic algorithm-based drive shows the effectiveness of proposed technique over conventional methods without increasing any complexity and cost of the drive system.

References 1. M. Alweshah, O.A. Alzubi, J.A. Alzubi, S.A. Mohammed, Solving attribute reduction problem using wrapper genetic programming. Int. J. Comput. Sci. Netw. Secur. (2016) 2. O.A. Alzubi, J.A. Alzubi, S. Tedmori, H. Rashaideh, O. Almomani, Consensus-based combining method for classifier ensembles. Int. Arab J. Inf. Technol. (2018) 3. O.A. Alzubi, J.A. Alzubi, M. Alweshah, I. Qiqieh, S. Al-Shami, M. Ramachandran, An optimal pruning algorithm of classifier ensembles: dynamic programming approach. Neural Comput.

Performance Improvement of Vector Control Permanent Magnet …

287

Appl. (2020). https://doi.org/10.1007/s00521-020-04761-6 4. R.K. Mahto, A. Mishra, Vector control of permanent magnet synchronous machine with reduced switch five-level voltage source inverter, in IEEE 21st Electronics Packaging Technology Conference (EPTC), Singapore (2019), pp. 751–756 5. A. Ahmed, Y. Sozer, M. Hamdan, Maximum torque per ampere control for buried magnet PMSM based on DC-link power measurement. IEEE Trans. Power Electron. 32(2), 1299–1311 (2017) 6. A. Mishra, V. Mahajan, P. Agarwal, S.P. Srivastava, Fuzzy logic based speed and current control of vector controlled PMSM drive, in 2012 2nd International Conference on Power, Control and Embedded Systems, Dec 2012, pp. 1–6 7. R.K. Mahto, A. Mishra, Self-tuning vector controlled PMSM drive using particle swarm optimization, in 2020 IEEE First International Conference on Smart Technologies for Power, Energy and Control (2020) 8. R.K. Mahto, A. Mishra, R.C. Bansal, A reduced switch five-level VSI for high-performance vector controlled PMSM drive. Electr. Power Compon. Syst. 1–12 (2020) 9. A.H. Abosh, Z.Q. Zhu, Y. Ren, Reduction of torque and flux ripples in space vector modulationbased direct torque control of asymmetric permanent magnet synchronous machine. IEEE Trans. Power Electron. 32(4), 2976–2986 (2017) 10. A. Mishra, J.A. Makwana, P. Agarwal, S.P. Srivastava, Modeling and implementation of vector control for PM synchronous motor drive, in IEEE-International Conference on Advances in Engineering, Science and Management (ICAESM-2012) (2012), pp. 582–585

Monitoring and Protection of Induction Motors Against Abnormal Industrial Conditions Using PLC Aaryan Sharma, Poras Khetarpal, Neelu Nagpal, and Ruchi Sharma

Abstract Common abnormal operating conditions for induction motors include overvoltage, over-temperature, disturbed RPM, and over-current. This paper proposes an efficient, low-cost, and low maintenance protection method for induction motors (single-phase and three-phase) against these problems. The hardware system utilizes the modern control devices and sensors common to the industrial process. Programmable logic controller (PLC) which is highly efficient in handling multiple processes simultaneously is used in the system. The PLC monitors these important parameters continuously and displays any threat or abnormality on the LCD screen. It can detect the abnormalities and shuts down the motor if no specific action is taken to troubleshoot the system within a specified period of time. Further, restarting of the motor can be done if the fault is mitigated. This system is tested on different induction motors with different specifications, commonly used in the process industry and has proven to be effective. Keywords Fault detection · Protection system · Induction motor monitoring · Programmable logic controller (PLC)

1 Introduction Induction motors cover about 90% of the total motors used in the industries today. Any process taking place in the industries today is somehow related to induction motors, and all this has been made possible because of its simple design, robustness, and rugged usage resistance. This efficient motor has a wide spectrum of industrial use and applications [1]. But still these efficient and widely used motors do not have a standard fault detection and protection system for them. Induction motors used in the industries often encounter abnormal conditions such as overvoltage, over-current, A. Sharma · P. Khetarpal (B) · R. Sharma Bharati Vidyapeeth’s College of Engineering, New Delhi, India N. Nagpal Maharaja Agarsen Institute of Technology, New Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_25

289

290

A. Sharma et al.

disturbed RPM, and excess heating caused by long runs. These conditions need to be controlled in a defined period or else these abnormalities could burn down the motor or could cause substantial harm to the motor unit [2]. Early methods of protection for induction motors use microcontrollers and mechanical relays, etc., having shortcomings like time delays and processing problems which may lead to damage of costly induction motors. Thus, for protection of costly and heavy industrial motors, modern control units and methods play a vital role [3, 4]. Several techniques have been suggested to analyze variety of mechanical and electrical faults in induction motors. Continuous supply current analysis during operation can help to detect any current-related problem in the induction motors and can be applicable for both single-phase and three-phase motors. In this technique, the current is monitored using a current sensor, and then the measured values are further compared to given standard values taking into account the range of operation for different loads. ‘WCS1700’ is a commercially available sensor used for the actual current readings in a standard and an easy way. This method has been adopted to detect stator or rotor faults of the motor [5]. Balancing of three-phase supply for three-phase motor is very necessary as disturbance in any single-phase can lead to unbalanced voltage and current spikes ultimately damaging the motor. In this present work, PLC is used for the purpose of fault detection. PLC handles all the analysis and protection task efficiently [6, 7]. This is an advance controller with far much better reliability and controllability than the conventional control methods. Major parameters of induction motor such as current, voltage, rpm, and temperature are monitored in this protection system. In this proposed work, these parameters are continuously monitored and analyzed in real time. The proposed algorithm of this system is developed such that if the motor is operating within a defined safe range, it will allow the motor to operate in those conditions but upon detecting any abnormality, the controller will immediately shut the motor by tripping the contactor and isolating the power supply [8]. The unique approach of this work is that it uses advanced and efficient components like PLC and solid-state relays for the proper function and eliminates the limitations of time delay and performance restrictions. Rest of the paper is organized as follows: Sect. 2 gives introduction regarding PLC as a system controller and the proposed hardware. Section 3 presents the proposed methodology. Section 4 presents the detailed results obtained after the experiment. In Sect. 5, conclusion is presented.

2 PLC as a System Controller PLCs are flexible with solid-state controllers. These are the computing devices used in industries for automation. PLC accepts input in the form of digital or analog data (depending upon the model), processes the information and acts on the basis of algorithm present in it. Output from PLC is fed to slave systems for further execution [9, 10]. This controller comprises of input module, processor (central processing unit), and output module. The input and output modules can be digital or analogue

Monitoring and Protection of Induction Motors Against Abnormal …

291

Fig. 1 Block diagram of PLC as control unit

depending upon the processor capacity and brand model. High end PLCs can handle both digital and analogue modules simultaneously. Inputs to the PLC are given via different sensors, which give output in electrical form. When input is given to the controller, it converts input into logic signals readable by the processor so that the processor can further utilize it [11]. Figure 1 shows PLC as the control unit of the motor drive system. Once the controller decides the output action, it can be low state or high state depending upon the algorithm. In this paper, analog values of parameters are considered. All four important parameters, i.e., current, voltage, RPM, temperature, give analog signals and are recognized by analog PLCs. These parameters have their dedicated sensors to give real-time output which are regularly monitored by PLC for any kind of abnormality. The CPU of the proposed controller continuously checks the conditions and works according to the set algorithm. Thus, CPU behaves like the most important component of PLC [12, 13].

2.1 Proposed Hardware Abnormal conditions in induction motors occur if the supply coming to the industries is unregulated. Due to this unregulated supply problems such as under-voltage or overvoltage occur. Few institutes and organizations such as IEEE and IEC have found in a study that if the voltage supplied to the industries has a divination of 5% from the standard set voltage, then it causes high temperatures in induction motors [14]. This unwanted and unplanned temperature surge can damage the physical structure of the motor including windings. Thus, induction motor needs protection from not only overvoltage but also from under-voltage. According to some industries, unbalanced voltage is defined as: Unbalance Voltage = (Max deviation from average V or I /average V or I ) × 100 (1)

292

A. Sharma et al.

where V denotes the voltage and I denotes current. Divide the maximum deviation by the average voltage or current and multiply by 100%; where the average is taken from three phase supply and not just from single supply [6].

2.2 Overvoltage Circuit The system hardware comprises of induction motor, analogue PLC, potential transformer (PT), current sensor, proximity sensor, heat sensor, contactor, IC LM35, solid-state relays, etc. Conditions such as overvoltage or over-current are determined by comparing specified values to real-time values. PLCs have different operating voltages such as 12 VDC, 24 VDC, 230 VAC and some specially customized according to needs [15]. The PLC used in this paper operates at 24 VDC. The supply voltage used by the motor is connected to a stepdown potential transformer which steps it down to 24 V, and the output of these step-down PTs is connected to a bridge rectifier. In case of three-phase supply, each phase is connected to a PT and output of each PT is connected to a bridge rectifier. These rectified outputs are fed to the input module of the PLC for voltage detection and analysis [16]. Detection of voltage surge in motor and to instantly trip the motor is the main purpose of the system for which three bridge rectifiers are used. These inputs from PTs are constantly monitored in real time by PLC. Solid-state relays are connected to the PLC output module, and the relay further controls a magnetic contactor [17]. The magnetic contactor used for single-phase and three-phase is the same so there is no change in the configuration there. Solid-state relays used are fully capable of controlling the magnetic contactor and controlling of input supply. This control and feedback cycle make a complete loop. When all the parameters are stable and create a good supply for the motor, PLC constantly gives low output which does not trigger any action on the power supply and keeps the motor running. As soon as the algorithm detects any abnormalities in any of the four parameters, PLC gives high output activating the relay and this relay further switches the state of the contactor to open state, disconnecting power supply from the motor. This quick action saves induction motor from any harm and prevents huge profit loss [18] (Fig. 2).

2.3 Under-Voltage Circuit This subsection is focused on the system and architecture developed for protection against under-voltage condition for induction motor. Input module of a PLC accepting analogue inputs is used to detect abnormal voltage. The overvoltage and undervoltage structure are very similar, just that in case of under-voltage the range set is a bit higher than the motor specifications, and this value is also a analogue value which

Monitoring and Protection of Induction Motors Against Abnormal …

293

Fig. 2 Testing setup

is always monitored in real time to find any red flags. This continuous monitoring also has an advantage over digital inputs that the analogue inputs gives us detailed values of the current voltage and is not just bounded between two states (high and low). There are instances where supply voltage gets under the recommended range but is back in range quickly. This scenario happens quite often and tripping the motor every time will create a disaster. Thus, when the supply voltage is safe, the contactor has the contacts made and allows the current to pass. When a danger occurs and is not fixed in a given time span, the system will trip the contactor, saving the motor from any damage [19].

2.4 Over-Temperature Circuit Induction motors are also used in industrial works for long hours, lifting heavy loads. They also face unbalanced voltages which add up to the rising temperature of motor windings. Over-temperature or overheating is the rise in temperature of the motor windings and ultimately the overheating of the motor body or the frame in which our motor is fitted. The temperature rise in an induction motor is calculated by the formula [15]:  2 %temperature rise = 2 ∗ unbalanced voltage

(2)

294

A. Sharma et al.

Excess heating for long hours or temperature spikes can damage the insulation of the motor. This damage in the insulation can damage the windings and ultimately burn out the motor. There are some standards such as the IEEE 841 (petroleum and chemical industry) which states that the rated load must not exceed 45 °C. Temperature in the windings increases by 25% in which there is an unbalanced current in which we have a voltage disturbance of 3.5% per phase. Early temperature sensing systems used complex circuitry and big expensive sensors which are not efficient enough compared to their cost and are also very complex to implement, making the protection systems expensive, and troublesome. In this work, rather a simple and easy sensor, the LM35 is used which is a mini scale sensor and easily replaceable as this sensor costs very less [20]. LM35 is a temperature sensor that outputs an analogue signal which is proportional to the instantaneous temperature. The output voltage can easily be interpreted to obtain a temperature reading in Celsius. Many low-end products take advantage of low cost, greater accuracy and used LM35 in their products. The input voltage to LM35 can be from +4 to 30 V. It consumes about 60 µA of current [21]. This sensor is used in our design due to its advantages and costeffectiveness. This sensor is fixed on the surface of the motor body to be in direct contact of the motor. This gives analogue outputs continuously and almost instantly. This analogue output is sent to the input module of the PLC which is easily readable [22]. Overheating also sets of the alarm and trips the motor. This is achieved by high linearity of the LM35 sensor as it sends its outputs with an increase of 10 mV for 1 °C rise in temperature and vice versa. PLC continuously monitors these temperature readings and trips the motor if necessary.

2.5 RPM Measurement Circuit The system has used a PNP-type proximity sensor mounted on the case of our motor so as to measure the continuous speed of our motor. When in no-load, the motor is rated at 3000 RPM and under-load the RPM is reduced according to a calculation. This fall in motor RPM is not linear and varies from load to load and hours of running which also has a defined range of operation. The proximity sensor needs an additional tooth or a metal object which is attached to the exterior end of the motor shaft. This will give one pulse per rotation and help proximity sensor to read pulses per minute. It does need calibration sometimes while using but the proximity sensor used in this project does not need calibration as it is giving direct output readable by the PLC [23]. This system does not have the capacity to trip the motor for now as this is a very complex algorithm to make but it will surely display the readings in the memory if used or can even upload it to real-time excel sheet via Internet.

Monitoring and Protection of Induction Motors Against Abnormal …

295

2.6 PLC Ladder Algorithm The ladder logic used in this system is quite simple and easy to implement. This ladder logic has one Rung (programming line), which has NC (normally closed) objects in that rung for each parameter. NC means in normal conditions of voltage, current, and temperature the ladder logic will allow the motor to run but, if any of this parameter is disturbed then this breaks contact and the objects placed in series brakes the circuit resulting in tripping of the motor. The object in the output of the first rung is a timer which is set for 2 s, if the conditions like voltage spikes occur and are resolved within these 2 s, then the timer resets and motor keeps on running without any break. If the issue is not resolved in 2 s, then the timer goes off and trips the contactor, ultimately saving the motor.

3 Proposed Methodology The system is applicable to all induction motors with just a few changes in application for different variants. The motor used to implement base design has specifications as shown in Table 1. The complete hardware after integrating all components is shown in Fig. 3. A basic ladder logic is shown which is controlling the system. The operating range is already set in the sensor inputs to trip the system. Firstly, PLC is switched on, and then the motor supply is switched on. PLC has to be turned on first because, and then initial parameters won’t be detected of our motor. Figure 3 shows the circuit wiring and system arrangement. Table 1 Specifications of main testing motor

Parameter

Ratings

Model No.

SW J-1

Type

Self-priming

Size

25 * 25 MM

Motor rating

0.35 kW (0.5 HP)

Rated speed

3000 RPM

Rated voltage

220 V

Rated frequency

50 Hz

Max current

4A

Capacitor rating

20 uf 440 V

Winding connection

Cap. start and cap. run

Max power input

1.5 kW

Class of insulation

B

No. of phase

single

296

A. Sharma et al.

Fig. 3 Description of hardware

When switch PLC is switched on, it immediately scans the parameters and if they are under permissible limits, PLC allows the motor to run otherwise the motor is not allowed to run. This starting of PLC before the motor will make sure that all supplies are in good condition and will block the supplies in case there is an abnormality [24]. The continuous monitoring and analysis of all parameters by the PLC will work as: • If the input voltage is less than 190 V, it will trigger an alarm in the PLC and the supplies will be blocked. • In our normal operating range from 190 to 235 V, the PLC will allow the motor to run indicating that it is safe environment for our motor to run. • If there is any voltage surge of continuous spiking, it will trip out motor in order to protect it from damage. • Over-current will also trip our motor. Condition of over-current occurs when there is a mechanical fault and the motor is nor working in its decided range, i.e., above 4 A. • Excess heat is equally damaging and dangerous for our motor as other parameters, then PLC will trip our motor if the temperature reaches above 45 °C [25]. Table 2 shows the range set for parameters in which our motor is set to trip.

Monitoring and Protection of Induction Motors Against Abnormal … Table 2 Favorable and unfavorable conditions for the motor used

297

Parameter

Favorable working range

Unfavorable conditions

Voltage

190–235 V

>190 V and 3 A and 0. Unfortunately, we are able to by no means realize the actual fee of broadcast until the transaction becomes issued via way of means of a node that we controlled. The cause is that the transaction needs to propagate from the originating node via the peer-to-peer community and the clocks are not flawlessly synchronized. We can, however, approximate latency the use of the earliest recognized time for broadcast. Approximating our definition from above, we are able to say for a few transactions t, obtained is the maximum preliminary time stamp while a number of our nodes get hold of transaction t, hence:latencyt~tincluded—obtained.

Land Rights Documentation and Verification …

379

Fig. 7 Execution time, latency, and throughput of Ethereum and Hyperledger Fabric, respectively

5.2 Transaction Throughput It can be defined as the rate at which transactions are committed on a blockchain network. We need to bear in mind that transactions can be marked as invalid in a blockchain network in the event of a verification error, so that only legitimate transactions are considered when calculating throughput. Through the transaction = valid transaction committed/total time taken.

5.3 Execution Time The transactions are sent to the blockchain network by batches in the case of the Ethereum blockchain. So, we have taken into account the execution period within the evaluation process to devote all of the transactions in the sequence to the ledger (Fig. 7).

5.4 Result It simplifies and synthesizes intricate property details into alphanumeric code. This can be used for fetching property title history. The property owner cannot tamper, and no duplicate owners can be possible. New registrations eliminate corruption. It will make faster property verification leading to faster registrations and an increase in tax collections of the government.

380

S. N. Billah et al.

6 Conclusion The important motive of this paper is to discover whether or not integrating blockchain era into this enterprise procedure of land document control is a good answer. To answer this question, a literary examination of the winning business method was conducted, and concerns relating to the winning procedure were discovered as a result of this examination. We then debate why we assume blockchain era is the very first-rate answer with the Ethereum platform. In this paper, we name the establishment of Hyperledger Fabric, which does not have to use blockchain technology to solve residential solutions when starting your land recorder process. The paper’s main emphasis is given on the part of the blockchain network design and workplace evaluation.

References 1. H. Bhorshetti, S. Ghuge, A. Kulkarni, S. Bhingarkar, Land Record Maintenance Using Blockchain (IC-BCT 2019), pp. 205–214 (2020). Available https://doi.org/10.1007/978-98115-4542-9_17. Accessed 20 Aug 2020 2. N. Gupta, M.L. Das, S. Nandi, LandLedger: blockchain-powered land property administration system, in 2019 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS), GOA, India, 2019, pp. 1–6. https://doi.org/10.1109/ANTS47819.2019. 9118125. Accessed 24 Aug 2020 3. S. Pongnumkul, C. Khonnasee, S. Lertpattanasak, C. Polprasert, Proof-of-concept (PoC) of land mortgaging process in blockchain-based land registration system of Thailand, in Proceedings of 2020 The 2nd International Conference on Blockchain Technology, 2020. Available https:// doi.org/10.1145/3390566.3391669. Accessed 26 Aug 2020 4. Z. Zheng, S. Xie, H. Dai, X. Chen, H. Wang, An overview of blockchain technology: architecture, consensus, and future trends, in 2017 IEEE International Congress on Big Data (BigData Congress), Honolulu, HI, 2017, pp. 557–564.https://doi.org/10.1109/BigDataCongr ess.2017.85 5. V. Thakur, M. Doja, Y. Dwivedi, T. Ahmad, G. Khadanga, Land records on blockchain for implementation of land titling in India. Int. J. Inf. Manage. 52, 101940 (2020). Available https://doi.org/10.1016/j.ijinfomgt.2019.04.013. Accessed 10 Sept 2020 6. H. Magrahi, N. Omrane, O. Senate, R. Jaziri, NFB: a protocol for notarizing files over the blockchain, in 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), 2018. Available https://doi.org/10.1109/ntms.2018.8328740. Accessed 3 Sept 2020 7. S. Krishnapriya, G. Sarath, Securing land registration using blockchain. Proc. Comput. Sci. 171, 1708–1715 (2020). Available https://doi.org/10.1016/j.procs.2020.04.183. Accessed 27 Aug 2020 8. M. Shuaib, S. Daud, S. Alam, W. Khan, Blockchain-based framework for secure and reliable land registry system. TELKOMNIKA (Telecommun. Comput. Electron. Control) 18(5), 2560 (2020). Available https://doi.org/10.12928/telkomnika.v18i5.15787. Accessed 28 Aug 2020 9. M. Nandi, R.K. Bhattacharjee, A. Jha, F.A. Barbhuiya, A secured land registration framework on blockchain, in 2020 Third ISEA Conference on Security and Privacy (ISEA-ISAP), Guwahati, India, 2020, pp. 130–138. https://doi.org/10.1109/ISEA-ISAP49340.2020.235011 10. B. DDdder, O. Ross, Timber tracking: reducing complexity of due diligence by using blockchain technology. SSRN Electron. J. (2017). Available https://doi.org/10.2139/ssrn.301 5219. Accessed 15 Sept 2020

Land Rights Documentation and Verification …

381

- c, Blockchain technology, bitcoin, and Ethereum: a brief 11. D. Vujicic, D. Jagodic, S. Randi´ overview, 1–6 (2018). https://doi.org/10.1109/INFOTEH.2018.8345547 12. H. Magrahi, N. Omrane, O. Senate, R. Jaziri, NFB: a protocol for notarizing files over the blockchain, in 2018 9th IFIP International Conference on New Technologies, Mobility and Security (NTMS), 2018. Available https://doi.org/10.1109/ntms.2018.8328740. Accessed 30 Sept 2020 13. M. Shuaib, S. Daud, S. Alam, W. Khan, Blockchain-based framework for secure and reliable land registry system. TELKOMNIKA (Telecommun. Comput. Electron. Control 18(5), 2560 (2020). Available https://doi.org/10.12928/telkomnika.v18i5.15787. Accessed 7 Sept 2020

Implication of Privacy Laws and Importance of ICTs to Government Vision of the Future Ayush Gupta, Prabhat Mittal, Pankaj Kumar Gupta, and Sakshi Bansal

Abstract In the past three decades, every nation has included information and communication technologies (ICTs) as a necessary part to run the government services efficiently and effectively. ICTs have been incorporated in the internal and external processes of the personalized and responsive e-government services. Indeed, these services can enhance the citizen’s end-to-end experience and service delivery in the public sector. Every government, thus, has initiated the procurement of advanced technology, technology skills, innovation capabilities, and digital public services to enhance the quality of service, public trust in government policies with a better citizen outcome. The present study undertakes index scores of 100 top countries based on government artificial intelligence readiness index (AIRI) provided by Oxford Insights in 2019 to investigate the impact of digital capabilities and services on the government vision of the future. The findings suggest that the improvement in government procurement of advanced technology, and innovation capabilities of a nation can be significant in ‘ICTs to government vision of the future.’ Keywords ICT · Innovation capability · Digital public services · Technology skills

A. Gupta (B) University of Turku, Turku, Finland e-mail: [email protected] P. Mittal Satyawati College (Evening), University of Delhi, New Delhi, India e-mail: [email protected] P. K. Gupta Jamia Milia Islamia University, New Delhi, India S. Bansal (B) Janki Devi Memorial College, University of Delhi, New Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_32

383

384

A. Gupta et al.

1 Introduction With increasing impact of social media platforms with artificial intelligence, citizens today expect a user-driven approach with a personalized and responsive government service delivery. The citizens are expecting services from the government similar to the private sector [12]. Optimization models of a networked society with greater engagement of public can offer management of resources and thus boosting trust and accountability [13, 16]. The availability of advanced analytical tools allows for a continuous monitoring to help the government improve the service designs and more responsive public service delivery [9]. For instance, analytical apps of mobile can help patients upload information online and consult doctors for the best possible treatment options. Patients can online consult with more options for better control over their health and with less dependency on specific healthcare professionals (HCPs). In the last three decades, many countries have implemented digital services for the public with the belief to enhance effectiveness, accountability, and transparency among the citizens and the government staff members [8]. Advanced technology like big data offers capabilities to handle vast quantities of data that government collects through their everyday activities [10]. It offer opportunities of unveiling hidden patterns and correlation insights [17] like value-creation [4] intensify the visibility and viability in supply chain for better allocation of resources [1]. For this reason, the implementation and execution of big data have enabled public administration to increase the accuracy levels especially in areas of business and finance with large amount of data dealing with millions of trades and money transactions taking place every day [3, 14]. Predictive analytics, a part of big data can provide a smart management of public resources by anticipating problems and a proactive measure to prevent them—for example, fraud detection in auditing. The big data categorized mainly include advanced analytics for efficiency, effectiveness, and transparency in public services and learning from the performance of such services [2, 10]. Big data application also offers integrated data security and privacy mechanism, as the data in public sector contains personal information and is highly sensitive in nature [11]. Alongwith governments are also accountable to protect the citizen’s privacy and need to be fully capable to combat from the threats by the cyber attackers, fraudsters, and hackers [5, 7, 20]. Governments need to be fully capable and build the technology skills to adopt the advanced digital technologies [2, 21]. Cybercriminals are using new tactics to hack the user accounts and have become more impactful causing direct financial (business activities and money trails) and indirect effects (stolen identity, lost privacy, reputation, etc.) [18]. According to [19], the nation should adopt national cybersecurity strategy (NCSS) including privacy laws to prepare for the cyber threats from new technologies. Deb et al. [5] in a study used sentiment analysis like timeseries for predicting cyber-attacks weeks ahead of the events. The study carried out an analysis over 0.4 million written posts during 2016–18 on over 100 hacking forums. Individual privacy is the responsibility of the government that collect, retain, use, or share personal information during the delivery of public services. The government

Implication of Privacy Laws and Importance …

385

needs to handle the information with care, protect it from inadvertent disclosure or misuse, and be transparent about the use and protection of that data [15]. Aloongwith, the concept of privacy and the data protection laws is facing many challenges to the use of advancing technologies such as big data, digital identity, biometrics, and social media sites [6]. The presents study evaluates the differences among countries with implementation of privacy laws on their priority of government vision of ICTs, digital public services, procurement of advanced technology, innovation capabilities, and the technology skills. Further, the authors have established a relationship to identify the influence of various inputs of data capabilities on the government vision of ICTs for the future. AI readiness index has been prepared for policy makers to compare their performance and to focus of the right technologies and data capabilities. With aging of automation, governments must be ready to invest and capitalize on the power of AI. Government can take advantage of opportunities offered by emerging technologies to improve the citizens’ experience of government functioning. The present study undertakes AI readiness index scores of 100 top countries in Govt digital public services, government procurement of advanced technology, data capability, technology skills, innovation capabilities, and artificial intelligence (AI) start-ups. The study aims to identify the influence of various input in the AI, digital technologies, and strategies on the effectiveness of the government in the public service delivery. Oxford insights in year 2019 prepared the AI readiness index for 194 countries to capture their performance and focus on the advanced technologies and capabilities. According to the report, Singapore holds the first rank AI readiness, with rest of the top 20 includes western European governments, Canada, Australia, New Zealand and four Asian countries including India. The top rank of AI readiness index holds by countries with strong economies, innovative capabilities, and good governance. No country of Latin America or Africa could place in the list of top 20. Table 1 presents the country placed rank no. 1 in the regions alongwith their global rank among 194 countries. The present study has selected 100 top countries according to the AI readiness score and collected their index scores about the procurement of advanced technology, innovation capabilities, technology skills, data availability, and the digital public services. Table 1 Governments AI readiness index score 2019 (region-wise top countries) Region

Asia–Pacific

Africa

Latin America

North America

Eastern Europe

Australia/NZ

Western Europe

Country

Singapore

Kenya

Mexico

USA

Estonia

Australia

UK

Top score

9.186

5.672

6.664

8.804

6.968

8.126

9.069

Global rank

1

52

32

4

23

11

2

Data Source Oxford Insights and the IDRC (2019)

386

A. Gupta et al.

Table 2 Description of variables, sources, and type (input/output) Code

Variable

Source

Input/output

IGV

Government vision of ICTs future

WEF networked readiness index 2016

Output

GAT

Gov’t Procurement of advanced technology

WEF networked readiness index 2016

Input

GTS

Technology skills

WEF global competitiveness report 2018

Input

GIC

Innovation capability

WEF global competitiveness report 2018

Input

GDA

Data availability

OKFN open data index 2016/2017

Input

GDS

Digital public services

UN online service index

Input

2 Material and Methods The study has used AI readiness index of top one hundred countries in the list of 194 mentioned in Oxford Insights report 2019. The index scores are normalized in a range of 0–100 using maximum scores (Index/maximum * 100). Various factors related to importance of ICTs to government vision have been considered as input variables like Govt’s procurement of advanced technology, technology skills, innovation capability, data availability, and digital public services (see Table 2). The variables have been explained using the descriptive statistics alongwith scatter plots to explain the strength of relationship between the variables. Regression analysis has been carried out to establish a relationship between the endogenous and exogenous variables and identify the significant factor influencing the government vision of the future in regard to ICTs. The regression model to establish the relationship between the input and output variable can be stated as follows: IGV = β0 + β1 GAT + β2 GTS + β3 GIC + β4 GDA + β5 GDS

3 Results 3.1 Descriptive Statistics Table 3 presents the descriptive statistics of variables considered for the present study. The results include mean scores and the standard deviation for two categories according to the implementation of privacy laws (Yes/No). Independent sample tstatistics have been computed to find the significant differences in the average index scores due to the implementation of privacy laws by the countries.

Variable

Government vision ICTs

Gov’t procurement Ad Tech

Technology skills

Innovation capability

Data availability

Digital public services

Code

IGV

GAT

GTS

GIC

GDA

GDS

76.55

43.37

47.41

62.69

49.48

8.96

15.77

19.29

17.33

13.39

81.52

27.00

43.88

49.42

56.29

58.80

Mean

SD 10.67

Mean

59.38

Privacy laws (no)

Privacy laws (yes)

Table 3 Descriptive statistics of variables (normalized index scores 0–100) SD

9.65



20.64

29.66

11.54

12.98

76.8

43.132

47.234

62.029

49.828

59.35

Mean

Overall SD

15.52

19.24

17.41

14.64

9.15

10.72

0.696

0.843

0.440

0.994

1.634

0.117

t-statistics

0.488

0.403

0.661

0.375

0.106

0.907

Sig.-value

Implication of Privacy Laws and Importance … 387

388

A. Gupta et al.

Table 4 Correlation matrix IGV GAT

IGV

GAT

GTS

GIC

GDA

GDS

1

0.384**

0.221*

0.429**

0.279*

0.404**

1

0.330**

0.394**

0.300*

0.291**

1

0.583**

0.431**

0.367**

1

0.594**

0.698**

1

0.590**

GTS GIC GDA GDS ** Significant

1 at 1% level, *Significant at 5% level

The results indicate that the average index score of countries about ICTs to government vision for the future with enactment of privacy laws (mean score 59.38) is higher than countries with no privacy laws (mean score 58.80). However, the results do not show sufficient evidence to confirm the significant difference in the two averages (t-statistics value = 0.117, p > 0.05).

3.2 Correlation Analysis Table 4 presents the degree of relationship between the variables using correlation analysis. The results indicate that all variables (ICTs to government vision to the future, Govt’s procurement of advance technology, technology skills, innovation capabilities, data availability, and digital public services) are positively correlated and are significant. The maximum correlation can be noticed with innovation capability of the country and the Government vision of the future about ICTs. Scatter plots presented in Fig. 1 display the association between the input (Govt’s procurement of advance technology, technology skills, innovation capabilities, data availability, and digital public services) and the output variable (ICTs to government vision of the future). The box plots are also presented in the figures to display the five number summary statistics (minimum, quartile 1, quartile 2, quartile 3, maximum).

3.3 Regression Analysis Table 5 presents the results of regression analysis to find the impact of exogenous variables (Govt digital public services, government procurement of advanced technology, data capability, technology skills, innovation capabilities, and artificial intelligence (AI) start-ups) on the endogenous (Government vision of the future about ICTs). The results indicate the positive and significant influence of Govt procurement of advanced technology (β = 0.353, p < 0.01), and innovation capability (β = 0.247, p < 0.05) on the government vision of the future and importance of ICTs.

Implication of Privacy Laws and Importance …

389

a: IGV vs. GAT

b: IGV vs. GTS

c: IGV vs. GIC

d: IGV vs. GDS

e: GAI vs. EOG

f: GIC vs. EOG

Fig. 1 Scatter plots

Table 5 Regression analysis Dependent variable: government vision of ICTs future Independent variable

Slope coefficient

t-statistics

Sig.-value

0.006

Gov’t procurement of Ad technology (GAT)

0.353

2.856

Technology skills (GTS)

0.018

0.126

0.900

Innovation capability (GIC)

0.247

2.568

0.013

Data capabilities (GDA)

-0.107

-1.741

0.087

Digital public services (GDS)

0.152

1.510

0.136

Adj R-square = 0.495, F-statistics = 11.951 (p < 0.01)

4 Conclusions In the past three decades, every nation facing challenges of continuous technological developments with the increasing number of cyber attackers, fraudsters, and hackers. Every country needs to update in the usage of digital technologies and requires a data centric strategic approach in a more collaborative and innovative manner. It is, therefore, required to use the right strategy and architecture, integrated with the digital technology in prioritizing the government vision of the future about ICTs. Government can take advantage of opportunities offered by emerging technologies to improve the citizens’ experience of government functioning. There is a need of regular assessment of the government’s progress in delivering public services and improve the governance infrastructure for future planning. The study undertakes

390

A. Gupta et al.

the top 100 countries based on their AI readiness index scores prepared by Oxford Insights and IDRC (2019). The findings of the paper highlight the disparities in the government’s vision of the future about ICTs to government vision of the future enacting the privacy laws, policy makers should act and ensure of no further global inequalities. With ageing of automation, governments must be ready to enforce the best practices of data privacy. The government need to handle the information with care, protect it from inadvertent disclosure or misuse, and be transparent about the use and protection of that data. Alongwith, the policy makers should implement the privacy and the data protection laws in a more balanced forms so to help citizens take benefits of advancing technologies such as big data, digital identity, biometrics, and social media sites. The study findings with the use of regression analysis highlight the significance of Gov’t procurement of advanced technology, and the innovation capability, on government vision of the future about ICTs. Government needs to retain and develop skills and capabilities of the people to face the challenges of dynamic and responsive environment, thus able to sustain a digital public sector culture.

References 1. J.A. Alzubi et al., Hashed Needham Schroeder industrial IoT based cost optimized deep secured data transmission in cloud. Measurement 150, 107077 (2020). https://doi.org/10.1016/j.mea surement.2019.107077 2. A. Arora et al., Role of emotion in excessive use of twitter during COVID-19 imposed lockdown in India. J. Technol. Behav. Sci.2020https://doi.org/10.1007/s41347-020-00174-3 3. A. Bhatia, P. Mittal, Big data driven healthcare supply chain: understanding potentials and capabilities. SSRN Electron. J.2019https://doi.org/10.2139/ssrn.3464217 4. B. Brown, M. Chui, J. Manyika, Are you ready for the era of “big data”? McKinsey Q. 4(1), 24–35 (2011) 5. A. Deb, K. Lerman, E. Ferrara, Predicting cyber-events by leveraging hacker sentiment. Inf. (Switzerland) (2018).https://doi.org/10.3390/info9110280 6. S.S. Dey, J. Thommana, S. Dock, Public agency performance management for improved service delivery in the digital age: case study. J. Manage. Eng. 31(5), 05014022 (2015). https://doi. org/10.1061/(asce)me.1943-5479.0000321 7. T.J. Holt, J.D. Freilich, S.M. Chermak, Exploring the subculture of ideologically motivated cyber-attackers. J. Contemp. Crim. Justice (2017).https://doi.org/10.1177/1043986217699100 8. H.A.T. Leão, E.D. Canedo, Best practices and methodologies to promote the digitization of public services citizen-driven: a systematic literature review. Inf. (Switzerland) (2018).https:// doi.org/10.3390/info9080197 9. I. Mergel, N. Edelmann, N. Haug, Defining digital transformation: results from expert interviews. Gov. Inf. Q. 36(4), 101385 (2019). https://doi.org/10.1016/j.giq.2019.06.002 10. P. Mittal, Big data and analytics: a data management perspective in public administration. Int. J. Big Data Manage. 1(1), 1 (2020). https://doi.org/10.1504/ijbdm.2020.10032871 11. R. Munné, Big data in the public sector, in New Horizons for a Data-Driven Economy (Springer International Publishing, Cham, 2016), pp. 195–208. https://doi.org/10.1007/9783-319-21569-3_11 12. OECD, Strengthening Digial Government (2019)

Implication of Privacy Laws and Importance …

391

13. D. Petrakaki, Re-locating accountability through technology: from bureaucratic to electronic ways of governing public sector work. Int. J. Public Sect. Manage. (2018).https://doi.org/10. 1108/IJPSM-02-2017-0043 14. S. Sagiroglu, D. Sinanc, Big data: a review, in Proceedings of the 2013 International Conference on Collaboration Technologies and Systems, CTS 2013, pp. 42–47. https://doi.org/10.1109/ CTS.2013.6567202 15. N.A. Siddiquee, M.Z. Mohamed, E-Government and transformation of service delivery in Malaysia. Int. J. Publ. Admin. Dig. Age2015https://doi.org/10.4018/ijpada.2015070103 16. C. Del Sordo, R.L. Orelli, E. Padovani, Governing the public sector e-performance: the accounting practices in the digital age, in Decision Management: Concepts, Methodologies, Tools, and Applications (2017).https://doi.org/10.4018/978-1-5225-1837-2.ch082 17. R. Sowmya, K.R. Suneetha, Data mining with big data, in Proceedings of 2017 11th International Conference on Intelligent Systems and Control, ISCO 2017, pp. 246–250. https://doi. org/10.1109/ISCO.2017.7855990 18. M. Spremi´c, A. Šimunic, Cyber security challenges in digital economy, in Lecture Notes in Engineering and Computer Science (2018) 19. C.S. Teoh, A.K. Mahmood, National cyber security strategies for digital economy. J. Theor. Appl. Inf. Technol. (2017) 20. S. Ur Rehman, V. Gruhn, An approach to secure smart homes in cyber-physical systems/internet-of-things, in 2018 5th International Conference on Software Defined Systems, SDS 2018. https://doi.org/10.1109/SDS.2018.8370433 21. S. Yadav et al., Children aged 6–24 months like to watch YouTube videos but could not learn anything from them. Acta Paediatr. 107(8), 1461–1466 (2018). https://doi.org/10.1111/apa. 14291

AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study Harsh Jigneshkumar Patel, Parita Oza, and Smita Agrawal

Abstract According to the report of the World Health Organization (WHO), the most common and dangerous disease among women is known to be breast cancer. Breast cancer is a life-threatening disease and may cause death of women. Early detection of breast cancer can decrease mortality rate and improve survival rate in women. Various artificial intelligence (AI) approaches have been used by the research community to build computer-aided diagnosis (CAD) systems for early detection of breast cancer. This study presents various breast imaging modalities used for the cancer diagnosis, related work in this domain, various pre-processing techniques to improve the quality of breast images and applications of machine learning (ML) for breast imaging. The study also presents various deep learning (DL) approaches to build a system for automated breast cancer diagnosis. Various pre-trained deep learning models are also presented in the study. Due to an imbalanced and inconsistent dataset, AI models may not perform well. We have also discussed various techniques to improve the performance of the model. Prediction, segmentation, and classification deep neural network models along with the various imaging modalities are presented which are beneficial for the diagnosis process of breast cancer. Keywords Medical imaging modalities · Breast cancer classification · Convolutional neural network (CNN) · Computer-aided diagnosis (CAD) · Deep neural network · Machine learning · Artificial intelligence

H. J. Patel (B) · P. Oza · S. Agrawal Nirma University, Ahmedabad, India e-mail: [email protected] P. Oza e-mail: [email protected] S. Agrawal e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_33

393

394

H. J. Patel et al.

1 Introduction Breast cancer is the foremost lethal disease among women world-wide. agreeing to the report of the World Health Organization (WHO), and breast cancer is the fundamental cause for any non-advertent deaths. Among any other cancer sorts such as liver, lung, and brain, breast cancer is the third most elevated illness [1]. As per WHO, 1.7 million (11.3%) misfortunes uncovered in 2015 were associated with breast cancer additionally, the amount of breast cancer patients is anticipated to increment by 70% within 20 years [2]. So, the early and correct discovery of breast cancer plays a vital role within the increase in survival rate of the persistent. Breast cancer is basically of two sorts: malignant and benign. However, among them, benign one is non-cancerous; i.e., non-invasive and malignant is dangerous. In this manner, both these sorts require diverse diagnosis arrange and legitimate and exact prediction methodologies. Some of the research concurred that the weak technical expertise leads to the low accurate diagnosis in interpretation of images. The rise of different illnesses and restricted human work has motivated analysts and clinical staff to utilize computed helped innovation to encourage breast thermography-based determination and accordingly limit mistakes [3]. Therefore, different computer-based solutions came into existence for classification of medical images. Different types of medical imaging modalities such as mammography, ultrasound (US), magnetic resonance imaging (MRI), computed tomography (CT), etc., are adopted for the breast cancer diagnosis. Those medical images are normally interpreted by the human experts such as doctors and radiologist. In the past few years, many researchers focused on the concepts of deep learning (DL), as all those DL models are publicly available so they can be easily applied using its pre-trained network [3]. The rapid growth of the deep learning technology has increased the development of scientific field. These are applied in the field such as education, finance, medical, security, retail, and e-commerce [4]. It provides the automated tools and models that reduce the human effort and error rate and improve the accuracy of automatic detection ability. Nowadays, uses of machine learning algorithm, predictive analysis, and pattern recognition are come into existence or the diagnosis of breast cancer [4]. Thus, the machine learning algorithms not only predict accurate result and increased diagnosis rate, but also supplant pathologists’ a large portion of the work. The use of machine learning algorithms as classifiers has been evident in various domains including health industry. Machine learning algorithms comprise of algorithms such as regressions, classification, and clustering. There are mainly two types of machine learning algorithms used for predictions in health industry: supervised and unsupervised learning [5]. The involvement of artificial intelligence in the health sector has crucially helped the doctors from imaging modalities to predictive analysis of patients diagnosing from various kinds of fatal diseases such as diabetes and cancer. Artificial intelligence approaches consist of various working technologies and imaging modalities which are advancing in order to achieve better results in the prediction and classification of the diseases and deep learning neural networks and

AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study

395

various machine learning algorithms consisting of supervised and unsupervised learning techniques such as the K-means clustering and the self-organizing maps which help in efficient processing of images for identification of mass globules present in the breasts. The working procedure begins with the feature selection and extraction which is followed by feature scaling for better information retrieval results; the feature vector must be scaled down to fit the model specifications along with removal of noise present in the images. Data augmentation is used for the creation of dataset based on present data. The feature vectors extracted from the images are fed into the various classification algorithms pertaining to neural network models, and machine learning algorithms used as classifiers trained on breast images data. The AI methodologies developed or under research have proven to increase the prediction of breast cancer by considering various dependencies present in the image frames. The feature extraction methodologies have scope of improvisation by mining of just the necessary features and better feature selection techniques. Deep neural networks are successfully used for mining information which is implicit in nature of the images. The structure of the paper is mentioned as followed: Section 2 talks about the various imaging modalities used for the diagnosis process of breast cancer. Section 3 discusses about the related research progresses in the field of breast cancer diagnosis using artificial intelligence approaches. Section 4 talks about the various preprocessing techniques used for better results of classification, segmentation, and detection. Section 5 talks about the AI approaches used for the diagnosis purposes. Sections 6 discusses about the various techniques used for the improvement of performance of the approaches. Section 7 briefs about the various scopes of research and challenges present in the diagnosis process.

2 Breast Imaging Detection of breast cancer at the very early stage is fundamental as it is attached with the available number of alternatives treatment and increment in the survival [6]. Breast cancer detection is really crucial for the patients as it helps them try various alternative measures and treatment procedures to recover and survive. Clinical imaging modalities are helpful in the early recognition of the breast cancer disease. Various types of medical imagining techniques are discussed below.

2.1 Mammograms This tool mainly uses X-ray imagining for the early detection of the breast anomaly [1]. Two types of mammograms (1) screening mammograms (2) diagnostic mammograms Mammograms are screened to find for the slightest indications of cancer as the X-ray of the women’s breasts are taken who do not feel any symptoms or effects of

396

H. J. Patel et al.

Fig. 1 Mammographic images of the breast tissues

breast cancer such as lump creation. This technique is crucial as it detects cancer even when it is merely existing to be considered or felt by a primary care physician of the patient. Diagnostic is not considered healthy to be taken up as a method of analysis as they are found to be really effective when the patient is 40 years or above due to the fact that the breasts are too dense for the mammograms to sense for cancerous cells, and at the same, the radiations of the mammograms are injurious to the patients under the particular age mentioned. Thus, the usage of the mammography is generally after surgical treatment of primary breast cancer in women (Fig. 1).

2.2 MRI Breast magnetic resonance imaging as a rule is performed after you have a biopsy that is positive for malignant growth and your primary care physician needs more data about the degree of the disease. For certain individuals, a breast MRI might be utilized with mammograms as a screening apparatus for recognizing breast malignant growth. That gathering of individuals incorporates ladies with a high danger of breast cancer, who have an extremely solid family background of breast cancer or convey a genetic breast cancer growth quality transformation. The working mechanism of the magnetic resonance imaging is based on the concept of neo angiogenesis. The washout and the uptake of the gadolinium element which occurs mainly due to the increased vascular permeability as a result of the tumor blood vessels present in the breast area. This results in the enhancement and washout which make it identifiable for the imaging analysis to predict and differentiate between benign and malignant samples. DCIS

AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study

397

Fig. 2 Magnetic resonance image of the breast

present in the milk ducts is identified using microcalcifications, whereas the tumor present in the terminal ducts increase the blood supply reaching to them which makes them undergo necrosis and calcify. These types of calcifications are not generally identified by the MRI process. MRI is significantly useful for detection of highgrade lesions present in the milk ducts which are calcified. MRI is used mainly for detection of multifocal and multicentric lesions of the dense breasts or the permeable milk ducts which is caused due to the washout kinetics. There are problems related to the MRI for breasts as it claims to decrease the problems of surgeries by defining the size and extent of the tumor, whereas in contrast, it decreases the local recurrence rates along with the survival rates of the patient. It also leads to the increase of costs, biopsies, and may increase the mastectomy rates in the cases of its usage (Fig. 2).

2.3 Computerized Tomography CT body checks are utilized to survey metastatic spread in organizing of essential of primary breast cancer to evaluate possible repeat and screen therapeutic response [1]. It took more than one picture through different angle. It is used to observe at the chest to check if the cancer has affected to other different organs such as liver or lungs.

398

H. J. Patel et al.

It is with the help of x-rays and computers that the entire cross section of the body is visualized using the technology. Series of the x-ray images taken with the help of a rotating arc to generate three-dimensional images of the breast tissue helping in analysis for cancerous tissue or globules. The crucial benefit of using CT scans is the compatibility of CT to analyze information provided by the FDG uptake which is accompanied with the ability of CT scans to provide higher resolution images of the tumor morphology. The use of CT scans for breast cancer prediction is novel and has been under research for the past few years due to inconsistencies of results projected and also have a huge advantage of higher resolution of images with lesser noise thus helping in better feature extraction and modeling process. Alternative used for the CT scan processing is by injecting a contrast medium which flows through the bloodstreams when injected through the veins. The contrast medium is easily visualized helping detection of tumors present in various parts of the body. The body is exposed to minute radiations which undergone frequently increase the chances of cancer.

2.4 Scintimammography It uses a radioactive tracer called technetium sestamibi that are injected in to the vein [1]. Through a special camera, breast cancer cells are detected by the tracer. It has been shown that it provides good sensitivity but not able to capture the reliable detection of tiny breast tumors [3]. It is classified as nuclear medicine breast imaging due to the usage of small amounts of radioactive materials and substances used. It is used for examination and diagnosis of cancer in various parts of the body such as cardiological, gastrointestinal, breast, and neurological disorders present in patients. The radioactive material is injected in the body or inhaled in as a gas, and the material accumulates at the part of the body to be examined which helps the doctors analyze the part of the body for a tumor and radiating energy simultaneously in the form of gamma rays. Procedure is additionally used for the analysis of the metabolism rates and the chemical flow taking place in various organs of the body. The extent of accumulation of the material indicates the level of chemical activity taking place which is inferred as potential hotspots for tumors present in the particular cross section of the body. It is extensively used due to the unmatched advantages of reduced invasive procedure being conducted in the human body and works equally efficiently when the breast tissues are extensively dense and those with breast implants. The limitation to the usage of scintimammography is the lesser resolution of images produced using this technique.

AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study

399

Fig. 3 Original and annotated histopathological image of the cell structure

2.5 Histopathologist Images We utilized a basic H&E shading normalization strategy [7] to standardize each picture. We at that point resized the pictures into four distinct sizes with proportions of (1×, 0.5×, 0.33×, and 0.25×). These resized pictures are separated into four touching non-covering patches. In the wake of producing patches without augmentation, we linked them by resizing them into 224 × 224 pixels. We produced an enlarged fix savvy dataset from the connected subset of four adjoining non-covering patches. By and large, pathologists’ study histological pictures from different directions, and there are numerous varieties in recoloring and securing conditions. To imitate the pathologist assessment measure and reasonable variety, we applied various kinds of information augmentation, for example, flipping (flat and vertical), turn, moving (width and stature), brilliance, zoom, and obscuring (slight). This information augmentation can expand the size of the dataset without disintegrating its quality. The writing likewise recommends that the information augmentation and fixing methods can be utilized for histological characterization [8]. Every one of the created patches was considered to have a similar class mark as the first picture. These resized patches were utilized for preparing the model with picture net loads (Fig. 3).

2.6 Positron Emission Tomography (PET) This type of imaging is used to calculate the sugar content consumed by the cells. The idea behind this test is that cancer cells with increased metabolic activity consumed more sugar and thus identify tumors [9]. Radioactive materials are used for the process which are excellent substances used for detection and visualization. The radioactive atoms are attached to the materials that are used by the organs naturally which are scanned under the scanner for the detection of tumors. These emission images are used for the analysis and detection of tumors and increased activity of the organs. Creation of the radionuclide with the help of the radioactive substance

400

H. J. Patel et al.

and the naturally consumed material is executed which are broken down during the further process resulting in the emission of positrons. Along with the dispersion of the positrons is the emission of the gamma rays which are scanned under the scanner for any abnormalities. The scanning is followed by the creation of image maps whose brightness is directly proportional to the amount of the radionuclide accumulated in the tissues. PET proves to be the most accurate imaging modality used for tracking of the spreading of cancer from the origination of the tumor.

3 Related Work There are research works in the field of CAD executed with the help of deep learning technologies and achieving medical level results and the work done. The survey stated the importance of the improvement of the computation cost as well as the evaluation metrics that has led the field of medical technology to achieve this level of accuracy. Alexander [1] researched about the advances that are made in the field of MRI imaging using deep learning technology. The MRI imaging is performed with the help of CNNs and also helps in reduction of gadolinium dose in the contrast enhanced brain MRI to a great extent and on levels of magnitudes of the color intensity. Radiotherapy is also a crucial field of medical imaging where deep learning is extensively used and is dependent upon. In PET-MRI improvisation, as well as theranostics, the combination of confocal endimosrcopy with the deep learning models is useful for the detection of intraoperative CLE images. The important factors in the process of MRI imaging using deep learning are data acquisition and reconstruction, quantitative parameters such as QSM and MR fingerprinting, image restoration and synthesis, image segmentation, and feature vector extraction. Lian [2] researched about the technology and tools of convolutional neural network-based mammographic-based breast cancer diagnosis. There are various processes of medical diagnosis using deep learning models by feeding the pre-trained models with the target data using transfer learning, building, and training models using the target data and as well as reducing the parameters or creation of shallow networks. CAD is executed using various technologies such as machine learning and deep learning. Machine learning algorithms such as the support vector machine, artificial neural networks, K-nearest neighbors, Naïve Bayesian, and random forest are used for breast cancer diagnosis, whereas deep learning neural network models consist of convolutional neural networks which are used for feature extraction from the images and then being fed to the classifier for the classification. Use of transfer learning is crucial to increase the accuracy of the prediction with lesser computational cost and time. There are graph-based semi-supervised machine learning techniques as well for cancer diagnosis which involve the various steps of data weighting, feature selection, and data labeling before feeding it to the classifiers. Jeremy [3] worked on diagnosis of breast cancer using deep neural networks and the effect of CAD on radiology and imaging sciences. MRI in breast cancer was experimented with 3D-CNN which consisted of ten layers in the neural network

AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study

401

for feature extraction and classification. Along with the increased accuracy in the prediction, it helped capturing better MRI which is easy to analyze and dynamic contrast enhanced. With the usage of the 3D-CNN, the working and results proved 2D-CNNs to be incompetent. Patch-based CNN-based detection approaches function in the same way by extracting features from the images and classification of each voxel to be normal or abnormal. The last layer of the network works as a structural support vector machine. U-Net is also a network based on the CNN architecture which uses patch-based systems to delineate tumors in the images. Yunchao [4] researched about the AI techniques used for the detection of breast cancer. He provided an overview of the various procedures in the breast cancer prediction such as the segmentation, classification, and grading of various other cancer types. The research was implemented using deep neural network approaches which are present for the use cases mentioned along with various pre-processing techniques, neural network models, datasets for training, and evaluation metrics which provide the most efficient insights of the classification procedure along with the various aspects of improvisation. Kaushal [6] worked on the diagnosis of breast cancer using intuitionistic fuzzy histogram hyperbolization and c-mean clustering algorithms with texture featurebased classification of mammography images. Use of fuzzy clustering is executed for the allocation of images to the respective clusters which is followed by the texture features extraction and modeling of the convolutional neural network based on the extracted texture features from the pre-processed images. Classification of the dataset is done using the support vector machine algorithm and the K-nearest neighbor algorithm. The proposed approach consisted of image preprocessing techniques such as cropping, filtering, and intuitionistic fuzzy histogram hyperbolization followed by image segmentation methodologies. The segmented images are passed through feature extractors which are fed into the classifiers used. Ismail [10] worked on research regarding deep learning approaches for the prediction and analysis of breast cancer tumor. The process uses deep neural networks for the classification and recursive feature elimination process for the efficient feature extraction process along with various image normalization and pre-processing techniques. Noise removal algorithm is used for extraction of useful information from the images to avoid discrepancies and inconsistencies in the results. The proposed research methodology is applied to the WBCD dataset with precision, recall, and accuracy as the evaluation metrics. Din [9] proposed a system for more efficient results of the prediction of breast cancer using various tools and methods of machine learning. The techniques used consist of artificial neural networks (ANNs), Bayesian’s network, support vector machine (SVMs), and decision trees, and these are among the most widely accepted algorithms used for classification. The approach consists of dimensionality reduction followed by feature selection which are relevant to the usage along with noise removal of the image. The feature extraction process extracts the features from the images and create a feature vector which is fed into the machine learning algorithms which are used for the classification process.

402

H. J. Patel et al.

Table 1 summarizes the deep learning approaches used for the breast cancer diagnosis which comprises of various neural network models for the various use cases such as classification, detection, and segmentation.

4 Pre-processing Techniques of Breast Images This pre-processing step is required to transform the raw images data into structure form so that the important feature related to the domain can be identified and further analyzed. Images can be affected by many factors such as noise arising from dispatching process or image accession. The qualities of images affect the feature extraction and image segmentation so that the appropriate pre-processing techniques is significant. Pre-processing is an essential part of the process as it reduces the noise in the images which tend to suppress the information which is being highlighted in the images due to irrelevant background and features and noise in the images in general which are needed to be smoothened to avoid inconsistencies and achieve better results.

4.1 Use of Filters for Normalization Image normalization is executed using adaptive filters which calculate the median value of the pixel values of the 3 × 3 pixels around the corresponding pixel of the input images [4]. This is mainly done to reduce the repulsive noise in the picture by normalizing the value of the intensity of the pixels [10]. This is used for the mammogram segmentation, mammogram enhancement, and mammogram label orientation. Adaptive median filters are used for the normalization of the image intensity values. The filter is a 3 × 3 matrix which calculates the average of all the cells present in the filter and assigns the value to the central pixel as the normalized vector value of the filter. This helps in normalization of the intensity of the images without loss of information present in the images. The filter size used is 3 × 3 but can be altered with regards of the normalization extent. It also helps in the reduction of the noise present in the mammographic images used for diagnosis. Denoising of the images has adverse effects as it also leads to the blurring of images overall and loss of edges present in the images. Low-pass filters are used for the normalization by blockage of detail information. The use of various image statistics such as mean, variance, and spatial correlation of the image frames result in better highlighting of the edges present in the images making them less blur.

Type of work/application

Segmentation

Classification

Segmentation

Classification

Classification

Segmentation

Detection

Classification

Segmentation

Segmentation

Related work

1

2

3

4

5

6

7

8

9

10

MRI

Mammograms

MRI

Mammograms

Mammograms

Histopat holograms

Histopat holograms

Mammograms

MRI

Mammograms

Type of image

TCIA

MIAS

TCIA

MIAS

MIAS

BreCaHA D

BreCaHA D

MIAS

TCIA

MIAS

Dataset

Table 1 Summary of paper of CAD-based breast cancer diagnosis

GAN

GAN

DBN

CNN

DBN

CNN

GAN

GAN

CNN

GAN

AI approach

Transfer learning

Transfer learning

From scratch

End to End

From scratch

Transfer learning

Transfer learning

End to end

Transfer learning

Transfer learning

Type of training

90.04

97.81

97.23

91.67

91.03

93.52

98.16

93.70

92

91

Model performance (%)

(continued)

Underfitting of model due to transfer learning

Computationally exhaustive

Weak feature extractor

Less accuracy

Overfitting making it less accurateon another dataset

Inconsistencies in dataset. Less accuracy

Inconsistencies in dataset

Computational ly exhaustive

Accuracy is less due to the size of breasts are smaller

Dataset is smaller leading to overfitting

Limitation

AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study 403

Type of work/application

Classification

Segmentation

Classification

Classification

Segmentation

Classification

Related work

11

12

13

14

15

16

Table 1 (continued)

MRI

Mammograms

Mammograms

Histopat holograms

MRI

Histopat holograms

Type of image

TCIA

MIAS

MIAS

BreCaHA D

TCIA

BreCaHA D

Dataset

CNN

DCN N

CNN

GAN

CNN

DCN N

AI approach

Transfer learning

From scratch

End to end learning

Transfer learning

From scratch

From scratch

Type of training

96.29

93.71

90.31

91.08

93.46

91.34

Model performance (%)

Absence of feature normalization

Overfitting

Feature extraction weakly executed

Cost and data intensive

Less accuracy due to weak classifiers

Dataset inconsistencies and pre-processing techniques not utilized

Limitation

404 H. J. Patel et al.

AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study

405

4.2 Channeling of Images Input networks require 3-channel input images whereas some of the images are of 2channel network and are similar to CT scan images, while public datasets give more annotated images [7]. Researchers have been using two types of datasets which are private and public in nature. The datasets which are available publicly are considered to be more reliable due to the fact that they can be accessible to a greater number of users and researchers who could highlight and improve the quality of the dataset [11].

4.3 Conversion to 3 Channel Images For the input network, 3-channel input images are required for the function of network. Some of the mammogram images are 2-channel input images and are grey scale; therefore, images are converted into 3-channel images and RGB images for the network [12]. The conversion to 3-color channel networks is accompanied by the conversion to the grey scale images, and the 3-channel networks in the images help in better detailing and information highlighting of the images.

4.4 Morphological Operations Morphology is an operation used for the segmentation of the parts of the images which are not relevant or are considered as noise. The morphological operations are applied on the mammography images to segment the noise present in the data for efficient mining of implicit information and dependencies present. It consists of four stages which are listed as follows: dilation, erosion, opening, and closing. The operations involve the input image pixels comparison to the corresponding pixels present in the filter around it.

5 AI Approaches for Breast Cancer Diagnosis Deep learning has its basis of formation related to the functioning of the neurons of the human brain. Inside the human brain, each neuron is interconnected with each other through the entire data transfer takes place from every part of the body to the brain [3, 8, 13]. In neural network systems, each neuron gets an input data which is accompanied with weights and biases which are used to perform the dot operation and then feeding the output to the activation functions which are decided based on the use case, and the type of output is expected from the neural network [6, 14,

406

H. J. Patel et al.

15]. Weights in the inputs defines the quality of association between the two nodes present, whereas the bias is used to normalize the effect of the activation on the entire final output. Prediction of future observation can be done by thoroughly learning from the data by training your neural network models. In the field of medical computing, the accuracy of the models must be 100% as even a minute mistake in prediction of cancer could cost a person’s life [7, 9, 16].

5.1 Machine Learning The entire process of prediction of breast cancer using machine learning can be divided into two categories: supervised and unsupervised approaches. Supervised machine learning algorithms utilize the dataset for the training purposes and reiteration continues with the improvement of the prediction quality based on the features extracted from the images. Unsupervised machine learning algorithms train and learn from the environment and consist of various techniques such as K-NN, K-means clustering. Whereas, the supervised algorithms consist of deep learning models, random forest, support vector machines, decision trees, and regression algorithms. Data is pre-processed for the removal of inconsistencies and noise present in the data which is followed by data visualization. It is used for analysis of the various kinds of features present in the dataset that have to be fed as input to the algorithms. The algorithms are fed with the processed data which act as classifiers for binary outputs [17]. Table 2 summarizes the various machine learning algorithms used for the classification techniques of breast images based on the features extracted manually or neural networks.

5.1.1

Classification

Classification is used to separate set of available categories for new instance based on accessible training datasets. For detection of anomaly from the images, various types of classifiers are used for classification of tissue categories depends on the types of breast cancer or its evaluation. Regions in an image, for example, a tissue or a cell are arranged into one of the classes, for example, malignant, benign, and cancerous areas with various evaluating [6, 13]. There are several techniques for classification such as neural network, K-nearest neighborhood (K-NN), logistic regression, and fuzzy systems that are utilized and can be applied to the medical images of breast for classifying images such as malignant or benign [10, 18]. The classification process is executed using the feature vectors extracted from the images using the deep neural net models. After conversion of the entire image into important and significant feature vectors, the feature vectors are fed into the classifiers for the further classification process.

Type of work/application

Classification

Classification

Segmentation

Classification

Classification

Classification

Classification

Classification

Related work

18

19

20

21

22

23

24

25

Histopatholograms

Histopat holograms

Mammograms

Mammograms

Mammograms

Mammograms

MRI

Mammograms

Type of image

BCWD

BCWD

DDSM

MIAS, DDSM

DDSM

MIAS

TCIA

MIAS

Dataset

SVM, random forest Bayesian Networks

k-NN, binary SVM, neighborhood component analysis

Discrete wavelet transform, random forest

GLCM, SVM

Local binary patterns

Ensemble of decision trees, SVM

KFD, RVM, AdaBoost

CNN and RBF-based SVM

AI approaches

Table 2 Summary of papers related to ML techniques used for CAD-based diagnosis

End to end learning

From scratch

Transfer learning

End to end learning

From scratch

From scratch

Transfer learning

End to end learning

Type of training

95.4-SVM, 93.76%-RF

87.59

84.29

85.54

84.80

97.55

89.53

94.21

(continued)

SVM does not function well on datasets large

Size of dataset leading to underfitting

Texture-based feature extraction considered inefficient

GLCM is considered to be compute exhaustive

Weak feature extraction process

Temporal subtractions remove the microcalcifications

Weak feature extraction

Computation ally exhaustive

Model Limitation performance (%)

AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study 407

Type of work/application

Classification

Classification

Classification

Classification

Classification

Segmentation and classification

Classification

Classification and Segmentation

Detection

Related work

26

27

28

29

30

31

32

33

34

Table 2 (continued)

Mammograms

Mammograms

Mammograms

Mammograms

Mammograms

MRI

Mammograms

Mammograms

Mammograms

Type of image

MIAS

MIAS

DUMD

MIAS

MIAS

TCIA

WBCD

DDSM

MIAS

Dataset

SVM

Multilayer ed-perceptron neural networks

Adaptive boosting

Particle swamp optimization, Gaussian mixture model

ANN

SVM, K-NN, Naïve Bayes

SVM, random Forest, k-NN, ANN

Random forest, RF-ELM

Random forest

AI approaches

End to end learning

From scratch

End to end learning

Transfer learning

End to end learning

From scratch

End end learning

From scratch

From scratch

Type of training

93.47

90.51

85.35

89.5

90.14

93.59%-SVM, 95.72%-K-NN. 94.16%-NB

94.21-SVM, 91.04%-k-NN

96.7

94.09

Does not function well on large datasets

MLNN disregard spatial information

Scaling up not achievable

Does not execute well on lower dimensional datasets

Underfitting due to less amount of training data

K-NN do not learn from the training process

SVM does not work with large number of features

Computationally exhaustive

Overfitting of data

Model Limitation performance (%)

408 H. J. Patel et al.

AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study

5.1.2

409

Clustering

Clustering of mammographic images using various deep neural network algorithms is used in various areas such as pattern recognition and image processing modalities. It is also executed using machine learning algorithm such as the K-means clustering and other methodologies such as the self-organizing maps. K-means clustering is used with various computation measures like centroid, distance, and epoch [11, 19, 20]. Centroids of the images are calculated and the other node-points present in the image are clustered according to the foggy/random centroids which help in identification of dense mass present in the image. Self-organizing maps used for the clustering utilize the Euclidean block distance for the clustering of the various node-points present in the images. Hyper-spherical shaped clusters are identified using the maps which are calculated using the Euclidean distance as an index of similarity [7, 19, 21]. The hyper-ellipsoidal clustering is used in the case the usage of the Mahalanobis distance which is calculated as the inverse of the covariance matrix.

5.1.3

Regression

Regression algorithms in machine learning are used for the prediction of breast cancer with the help of statistical analysis of the data provided as an input. There are various regression algorithms used for the analysis such as the logistic regression and the fuzzy logistic regression which essentially evaluate the extend of dependency between the independent variable and the fuzzy binary outcome which consists of two groups as positive and negative for classification of breast cancer [3, 22]. The predicted probabilities are used for the modeling of the fuzzy logistic regression. But the accuracy achieved by the regression algorithms for the classification of breast cancer is comparatively low with respect to the deep neural network models as they fail to identify dependencies and relations between the features extracted and fed into the classifier [5, 9].

5.2 Deep Learning Approach Machine learning algorithms have been used for over twenty years for the detection, diagnosis, and prediction of breast cancer but the accuracy had not been to the mark needed for the medical industry. The process of machine learning algorithms followed a sequence flow starting with the extraction of the features from the images followed by classification using the machine learning algorithms like support vector machines, random forest, decision trees, and Naïve Bayes algorithm. The survey states that the CNN models are effectively used for detection, segmentation, and classification of cancer cells from images of the breasts. Different levels of features are necessary for training the network and GoogleNet, ResNet, and VGGNet are used to analyze and

410

H. J. Patel et al.

mine the same for the particular dataset of images being fed into them. The features were extracted manually for the machine learning algorithms to classify or detect the tumors which made the entire process to be tedious and compute intensive whereas the deep neural networks are presently used for the feature selection and extraction process. The framework consists of various layers and procedure for the analysis and training of the dataset. The pre-trained models are used for mining information about the various datasets and gaining information about the parameters of the network. It is then followed by data pre-processing where the image quality is analyzed and particular methodologies are used to reduce the noise in the images as well as removal of backgrounds and other parts of images which are not relevant to breast cancer. Then the feature selection process takes place which helps analyze and select the features which are really important to the neural network model and does not lead to obstruction of information which is potentially useful for the network [3, 5]. It is followed by the segmentation, classification, or the detection process and is trained to predict the most accurate results.

5.2.1

Convolutional Neural Networks

Convolutional neural networks are deep neural network algorithms that takes inputs of images along with some weights and biases to align the parameters and features in the most efficient way to lead to best possible performance. CNNs comprise of various layers of nodes which include the convolution layers where the calculation of the output of each activation function is done by feeding it with the input images along with weights and biases for planning of significant highlights that makes each image different from another [2, 13, 23]. After the process of computation of the inputs of the activation function and the final output of the same, the output is considered as a classification result where the output value indicating which class does it belong to. The network comprises of the different layers such as the convolutional layers, pooling layers, and the fully connected layers and finally the output layer, where the pooling layer regularizes the dimensionality and complexity of the neural network [24]. The convolutional neural networks consist of various layers which have their own unique functionalities: 1.

2.

The convolution layer consists of various segments which help in the calculation of the immediate output which is then fed to the layers ahead. It consists of convolutional filters which help in detecting particular features on the feature map. This is done by convoluting the filters on the image creating a 2D filter map [4, 18]. Pooling is used for downsizing the dimensions of the output features so that it becomes robust to variations in the features which is due to the reduced resolution output features [1]. The activation functions are used for the excitation of the neurons present in the neural network. The normalization layer is used for the feature maps reduce the non-uniformity which lead to inconsistencies during the process of feature mapping.

AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study

411

Fig. 4 CNN workflow for breast image diagnosis [25]

3.

The regularization is done in order to reduce overfitting of the model on both the training and test data. The regularization function used here is the dropout regularization function.

The training of the neural networks is done by mapping the set of input features of the images with the corresponding output and tuning the parameters after each iteration to make the model learn. Here, the CNNs are used to classify the thermograph images into two classes: malignant growth or benign. The step to be taken for this process is such as dataset preparation in image preprocessing, learning feature, and classification. Various classification classes can be made such as binary, i.e., cancer or not cancer and with more classes such as healthy, malignant, and benign (Fig. 4). The famous frameworks such as GoogleNet, ResNet, and VGGNet are trained on large number of datasets which makes the feature extraction process easier for the proposed framework in the architecture. Being trained on large datasets makes the models to be efficient in dealing with data inconsistencies and noises in the datasets. (1)

(2)

Google Net: These CNN model consists of three convolution layers, namely pooling layers, rectified linear operation layers, and two fully connected layers. Using this architecture, we design a model which combines various convolution filters layers of variable in size into a single filter layer [9]. Thus, it reduces the complexity overhead and also minimizes computational parameters. The below figure described the GoogleNet architecture. It consists of an inception module which helps in reduction of the computational expense as well as avoiding of overfitting of the data. It significantly reduced the problem of vanishing gradients which previously led to inconsistencies in the results. VGGNet: These CNN model consists of 13 convolution layers, pooling, rectification, and three fully connected layers. The VGGNet model architecture consists of 13 convolutional layers, pooling and rectification and 3 fully connected layers at the end after the convolutional and the pooling layers are over. The working mechanism of VGGNet is closely related to AlexNet, but the complexity of the model made it outperform the basic AlexNet neural network model [4, 26]. The filter size and the pooling network are of 3 × 3 and

412

(3)

H. J. Patel et al.

2 × 2, respectively. This makes the convolutions complex and deep analyzing each pixel with context of a lot many other grids. The most important aspect using the VGGNet is the deep neural network with smaller filters. The network architecture includes the input layer followed by one or more fully connected layers which help match the dimensions of the output of the neural network. The convolutional layers include the activation functions which are mainly the ReLu or the Leaky ReLu. SoftMax activation function is also used for the normalization of the image heat maps which makes the training on the dataset easier and computationally efficient [7, 27, 28]. The convolutional layers are accompanied by the max pooling layers in each convolution where the map and the filter size is prevalent and common to all the VGGNets irrespective of the use case. Batch normalization is also executed speedup the convergence factor in the network along with the regularization effect it has. The VGGNets have different number of convolutions, and it is the reason of the different VGGNets being named, respectively. There are five versions of the VGGNet present, but the most efficient and accurate are the VGG16 and the VGG19 models as they are considered to be networks of networks. VGGNet comprises of a uniform architecture with the same kernel size throughout the layers present [29, 30]. ResNet: These neural networks have the intense computing power and achieve high performance even with large number of layers and thus achieve accurate result in ImageNet classification. ResNet is the most efficient solution to the problem of vanishing gradient in neural network models. There are several models of ResNet which are being trained and used for the classification purposes [6, 9].

The vanishing of gradient in other models was due to the fact that on addition of multiple layers in the network, the repeated multiplication of the gradient makes it vanish after a few new layers being added which was solved by the introduction of ResNet. ResNets comprise of eight or more model architectures, and the maximum number of layers achieved is 1202 with the ResNet 152 being considered the most efficient in computational expenditure. The applied framework uses pre-trained models such as the ResNet, GoogleNet, and VGGNet which are trained on the ImageNet datasets which leads to improved parameters and hyperparameters selection and tuning. This usage of transfer learning used in the framework helps gather information of all aspects in the images when trained on multiple neural network models [2, 6, 11]. After the feature extraction process processed using the above-mentioned neural net models, they are combined with the fully connected layers for the further classification process of breast cancer prediction based on the images captured and processed by different imaging modalities such as ultrasound, MRIs, and X-rays. The classification process is a binary classification problem having the classes: benign and malignant.

AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study

5.2.2

413

Generative Adversarial Neural Networks

Generative adversarial neural networks [5] have been used mainly in the field of image processing and analysis. GANs are known to provide higher values of evaluation metrics in image processing even with less amount of data being provided [6, 16, 27, 31]. But they do not prove to be efficient due to high computational power needed for the various use cases of object detection, object identification, and image segmentation. The segmentation methods are used for the image features extraction process. With the increasing number of layers in the feature extraction layer, the features retaining capacity increases which helps the classification algorithms to achieve better accuracy results [15, 27]. After the segmentation of the images in the image dataset, the shape classification process is necessary to be predict based on the shape of the tumor as well. For the following purpose, a deep neural network named PointNet was proposed which used point cloud for object classification, local, and global segmentation. 3D shape classifiers too are used for the shape classification [9, 26, 32] (Fig. 5). The process of cancer prediction starts with the formation of region of interest in the images which is based on the coordinates marking and distance calculation to calculate the region of the tumor in three dimensions. Then the prepared data is then fed to the conditional GAN to create a binary mask of the tumor. The further processing is done using the morphological operations to remove the speckles. The generator network of the cGAN is used to learn the intrinsic features, such as gradients, edges, and shapes of both the classes of tissues which is followed by the creation of a binary mask for the images based on the features [1, 7, 11]. The discriminative network is used for the assessment of the binary mask generated by the generator network. Various activation functions such as the ReLu, Tanh, and leaky ReLu are applied to the data being fed along with the dropout regularization layer prevents the vanishing of gradient and the overfitting of the entire neural network model [5, 7].

Fig. 5 GAN workflow [33]

414

5.2.3

H. J. Patel et al.

Deep-Belief Neural Networks

Deep-belief neural networks [27] are algorithms used for prediction and classification using unsupervised learning and probabilities and consist of directed and undirect layers in the neural network architecture. Deep-belief neural networks consists of layers with the initial layers learning and retrieving information about the data input rather than just the feature extraction procedures performed by various convolutional neural network models for edge detection and minute features [9, 21]. Deep-belief neural network models consist of restricted Boltzmann machines (RBMs) in which the layers of the neural network architecture are connected with each other throughout the model. Connections in the architecture vary with the top layers being undirected in nature along with the formation of the associative memory, whereas the lower layers of the model are directed with information being processed in form of feature vectors extracted and learnt in the starting layers of the neural network [16, 34]. Training of the deep-belief neural networks is done using the greedy learning algorithms which start the learning and tuning procedure from the bottom and continue tuning the parameters present in the neural network in each subsequent layer. The optimization of the weights present in each layer is done in a fast and efficient ordered manner using the greedy learning algorithms [10, 12].

6 Techniques to Improve Performance of AI Approaches There are certain aspects where the improvement of the techniques would improve the results of the classification such as the feature extraction techniques where minute details are not being captured due to the noise present in the images [15, 22, 35]. Imaging modalities play a crucial role in the development and improvisation of the process by capturing the smallest details which help in better understanding and relating of the feature vectors present. Multi-parametric CAD structures are developed using various protocols for imaging such as the magnetic resonance imaging protocol which have improvised the imaging procedure and feature extraction extent from the breast images.

6.1 Data Augmentation A CNN model contains large number of parameters which need to be tuned for the efficient functioning and is possible only with the help of large datasets. In the real world, it is merely difficult to find due to very few public datasets being able for use in computational science [15, 19]. This calls in for the different data augmentation techniques which help create datasets for training CNN models for the best possible performance. This is done using various techniques such as flipping, cropping, and

AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study

415

changing various other features of the images which lead to creation of potentially new images for use. The need of data augmentation techniques is when the ground truth information related to the topic is insufficient to predict the class of the particular data point. Deep learning models tend to overfit on the data or memorize the features of the dataset which makes it vulnerable to foreign datasets or situations [5, 8]. The augmented images in the dataset help the model in dealing with the training on particularly noisy images due to its robustness. The translation of images is also an effective way of data augmentation as it just shifts the image by a few pixels by applying padding in a particular direction. This makes the model learn the spatially invariant features as well. Scaling of the pictures into the training set helps the deep neural net models learn about the deep features which are crucial to the feature vector being extracted from the images [12]. As the scaling of the images is possible in the X and Y direction of the image plane, the tumoral features in the images can be scaled and extracted to a more granular level. Augmentation processing is used to avoid the overfitting of the models by providing larger datasets to train the model on which would eventually lead to lower overfitting of the train data and lower underfitting of the test data being used [22]. The process of augmentation increases the size of the dataset by resizing, cropping, and manipulating the image resolution. Generative adversarial networks have also been used for data augmentation techniques by creation of artificial data which can be exploited to be trained on the deep neural nets as the images tend to be indistinguishable from the original data images by the discriminator which is mainly executed with the help of coarse-to-fine generator [7, 15]. Using GANs for data augmentation makes the data and the training process robust and free of inconsistencies.

6.2 Transfer Learning Transfer learning is used to increase the model accuracy by not relying on a single dataset and training the particular model on various larger datasets so that the tuning of parameters is done to train the model efficiently. There are possibilities when one particular configuration of a neural network model is used for various other domains and use cases, and by inheriting data, it increases the knowledge volume of the model to predict efficiently even when treated with absurd and inconsistent forms of data. It also helps increase the results of the evaluation metrics as the parameters are tuned in the best possible way to be trained and tested on a particular dataset [2, 20, 36]. The entire architecture starts with the data augmentation process in which the dataset is enlarged for better training capacity and reduction of underfitting of data. The various ways of augmentation of the image’s datasets are horizontal and vertical flips, contrast adjustments, image rotations, and brightness adjustments as well. The pre-processing techniques are used for removal of the noises and inconsistencies in the images present in the dataset as they lead to reduction in accuracy of the model.

416

H. J. Patel et al.

The feature extraction process of the pre-processed images is done to extract the useful features which are to be fed in the classification model. Other features which are not useful tend to increase the computation cost and reduce the performance of the model. The feature extraction process is done using the convolutional neural network models such as the DCNN, and the classification based on the extracted features is done using the multilayer perceptron classifier after being fused into a single feature vector [9, 13]. Transfer learning is not limited to image processing but is applied to each and every domain in machine learning. Natural language processing is also a domain of machine learning which uses various pre-trained models such as long short-term memory and BERT for text processing and analysis. The pre-trained image feature extractors are used to increase the accuracy and acceleration of the learning process due to the fact that the deep neural network is trained on a source dataset before being applied to the desired dataset. The low-level features extracted from the source dataset such as the contours, edges, and curves are directly extracted from the target dataset images without intense processing [1].

7 Research Opportunities and Challenges The present technologies used for the imaging modalities and the algorithms used for the classification have scope of improvement as the imaging technologies used face lack of minute information processing capabilities and pick-up lot of noise from the images which creates inconsistencies in the features extracted from the images. Noise reduction algorithms can be applied to the images as well for the reduction of the inconsistencies in the data. The deep neural networks used for the classification and processing of the images can be improved by the enhancement of the feature extraction and dependency creation techniques which may help in learning of the image feature vectors extracted from the images. Usage of classifiers fitting well according to the datasets is crucial due to the overfitting and underfitting consequences of training processes. Mining the dependencies of the images present in the dataset is very important and has not been researched about in depth till date. Scaling of the information present in the images is crucial for the dependencies present to be mined for better feature extraction process and removal of noise from the images present in the dataset. Ensemble of decision trees proved to be the most efficient classifiers, but the removal of the microcalcifications due to subtractions can be dealt with that lead to accurate classification of the data.

8 Conclusion Breast cancer is a lethal disease which leads to death of women if not diagnosed at the right time. This is mainly caused due to the tumors made up of cancerous cells. Diagnosis of breast cancer involves the analysis of the images of breasts with

AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study

417

consideration of the various features present. Having a large number of patients and images to be studied involves a large amount of time devoted to each one of them making the entire procedure to be tedious and inconsistent at times. Computer-aided diagnosis involves computational technology and various imaging modalities which are used for the segmentation, detection, and classification of the samples. It helps in generation of better quality of images without unnecessary information along with precise classification tools. Noise removal mechanisms are important in the process to prevent losing of essential features or misinterpretation of the images present in the dataset. Various machine learning and deep learning models and algorithms are used for the use case which involves pre-processing, feature extraction, feature selection and classification, segmentation, and detection based on the feature vectors extracted from the breast images. Manual diagnosis also results in errors and inconsistencies which is resolved with the help of CAD-based breast cancer diagnosis. The purpose of the CAD model is to assist practitioners and radiologists in their routine clinical work. The study discusses techniques used for the enhancement of quality of the images as well as results with the help of augmentation techniques and feature extraction using numerous pretrained neural network models like ResNet, GoogleNet, and VGGNet. Various deep learning neural networks are explained briefly for the execution of classification, segmentation, and detection such as CNNs, GANs, and DBNs. Various applications of machine learning for breast cancer diagnosis are also presented in the paper. The use of these AI approaches to build CAD models for breast cancer diagnosis has resulted in lesser inconsistencies and oversight errors in the diagnosis process.

References 1. S. Lowes, A. Leaver, A. Redman, Diagnostic and interventional imaging techniques in breast cancer. Surgery (Oxford) 37, 02 (2019) 2. G. Murtaza, L. Shuib, A. Wahid, G. Mujtaba, H. Nweke, M. Al-Garadi, F. Zulfiqar, G. Raza, N. Azmi, Deep learning-based breast cancer classification through medical imaging modalities: state of the art and research challenges. Artif. Intell. Rev. 05 (2019) 3. R. Roslidar, A. Rahman, R. Muharar, M.R. Syahputra, F. Arnia, M. Syukri, B. Pradhan, K. Munadi, A review on recent progress in thermal imaging and deep learning approaches for breast cancer detection. IEEE Access 8, 116, 176–116, 194 (2020) 4. G. Yunchao, Y. Jiayao, Application of computer vision and deep learning in breast cancer assisted diagnosis 01, 186–191 (2019) 5. K. He,X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778 6. C. Kaushal, S. Bhat, D. Koundal, A. Singla, Recent trends in computer assisted diagnosis (cad) system for breast cancer diagnosis using histopathological images. IRBM 40, 06 (2019) 7. S.H. Heywang-Köbrunner, A. Hacker, S. Sedlacek, Advantages and disadvantages of mammography screening. Breast Care 6(3), 199–207 (2011) 8. M.I. Daoud, S. Abdel-Rahman, R. Alazrai, Breast ultrasound image classification using a pretrained convolutional neural network, in 2019 15th International Conference on Signal-Image Technology Internet-Based Systems (SITIS), 2019, pp. 167–171 9. I.U. Din, J. Rodrigues, N. Islam, A novel deep learning based framework for the detection and classification of breast cancer using transfer learning. Pattern Recogn. Lett. 125, 04 (2019)

418

H. J. Patel et al.

10. N.S. Ismail, C. Sovuthy, Breast cancer detection based on deep learning technique, in 2019 International UNIMAS STEM 12th Engineering Conference (EnCon), 2019, pp. 89–92 11. M. Karabatak, A new classifier for breast cancer detection based on Naïve Bayesian. Measurement 72, 32–36 (2015) 12. S.-H. Heywang-Köbrunner,A. Hacker, S. Sedlacek, Advantages and disadvantages of mammography screening. Breast Care 6(3), 199–207 (2011) 13. S. Guan,N. Kamona, M. Loew, Segmentation of thermal breast images using convolutional and deconvolutional neural networks, in 2018 IEEE Applied Imagery Pattern Recognition Workshop (AIPR) (IEEE, 2018, Oct), pp. 1–7 14. E.Y.K. Ng, Y. Chen, Segmentation of breast thermogram: improved boundary detection with modified snake algorithm. J. Mechan. Med. Biol. 6(02), 123–136 (2006) 15. S.H. Heywang-Köbrunner, A. Hacker, S. Sedlacek, Advantages and disadvantages of mammography screening. Breast Care 6(3), 2–207 (2011) 16. H. Xu, T. Chen, J. Lv, J. Guo, A combined parallel genetic algorithm and support vector machine model for breast cancer detection. J. Comput. Method Sci. Eng. 16(4), 773–785 (2016) 17. R. Pillai, P. Oza, P. Sharma, Review of machine learning techniques in health care, in Proceedings of ICRIC 2019 (Springer, Cham, 2020), pp. 103–111 18. K.M. Prabusankarlal, P. Thirumoorthy, R. Manavalan, Assessmentof combined textural and morphological features for diagnosis of breast masses in ultrasound. Human Centr. Comput. Inform. Sci. 5(1), 12 (2015) 19. A.S. Abdel Rahman, S.B. Belhaouari, A. Bouzerdoum, H. Baali, T. Alam, A.M. Eldaraa, Breast mass tumor classification using deep learning, in 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), 2020, pp. 271–276 20. The digital database for screening mammography, 2001. Available http://www.eng.usf.edu/ cvprg/Mammography/Database.html 21. A. Karahaliou, S. Skiadopoulos, I. Boniatis et al., Texture analysis of tissue surrounding microcalcifications on mammograms for breast cancer diagnosis. Br. J. Radiol. 80(956), 648–656 (2007) 22. Cancer Imaging Archive. https://www.cancerimagingarchive.net/nbia-search 23. A. Aksac, D.J. Demetrick, T. Ozyer, et al., BreCaHAD: a dataset for breast cancer histopathological annotation and diagnosis. BMC Res. Notes 12, 82 (2019). Available https://doi.org/10. 1186/s13104-019-4121-7 24. P. Tiwari, J. Qian, Q. Li, B. Wang, D. Gupta, A. Khanna, J. Rodrigues, V. Albuquerque, Detection of subtype blood cells using deep learning. Cogn. Syst. Res. (Elsevier).https://doi. org/10.1016/j.cogsys.2018.08.022 25. N. Ponraj, A survey on the preprocessing techniques of mammogram for the detection of breast cancer. J. Emerg. Trends Comput. Inf. Sci. 2 26. A. Mizushima, R. Lu, An image segmentation method for apple sorting and grading using support vector machine and Otsu’s method. Comput. Electron. Agric. 94, 29–37 (2013) 27. Q. Zhou, Z. Li, J.K. Aggarwal, Boundary extraction in thermalimages by edge map, in Proceedings of the 2004 ACM Symposium on Applied Computing (ACM, 2004, March), pp. 254–258 28. T. Ayer, O. Alagoz, J. Chhatwal, J.W. Shavlik, C.E. Kahn, E.S. Burnside breast cancer risk estimation with artificial neural networks revisited. Cancer 116, 3310–3321 (2010) 29. H. Asri, Using machine learning algorithms for breast cancer risk prediction and diagnosis. Proc. Comput. Sci. 83, 1064–1069 (2016) 30. R.L. Siegel, K.D. Miller, A. Jemal, Cancer statistics. CA Cancer J. Clin. 69(1), 7–34 (2019) 31. L. Ein-Dor, O. Zuk, E. Domany, Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci. 103, 5923–5928 (2006) 32. Y. Qiu, S. Yan, R.R. Gundreddy et al., A new approach to develop computer-aided diagnosis scheme of breast mass classification using deep learning technology. J. Xray Sci. Technol. 25(5), 751–763 (2017) 33. S. Dabeer, Cancer diagnosis in histopathological image: CNN based approach. Inform. Med. Unlocked 16, 100231 (2019)

AI Approaches for Breast Cancer Diagnosis: A Comprehensive Study

419

34. S. Saini, R. Vijay, Optimization of artificial neural network breast cancer detection system based on image registration techniques. Optimization 105(14), 26–29 (2014) 35. J.A. Alzubi, A. Kumar, O.A. Alzubi, R. Manikandan, Efficient approaches for prediction of brain tumor using machine learning techniques. Ind. J. Publ. Health Res. Dev. (2019). https:// doi.org/10.5958/0976-5506.2019.00298.5 36. K. Park, A. Ali, D. Kim, Y. An, M. Kim, H. Shin, Robust predictive model for evaluating breast cancer survivability. Engl. Appl. Artif. Intell. 26, 2194–2205 (2013)

Energy-Efficient Lifetime and Network Performance Improvement for Mobility of Nodes in IoT Y. Chitrashekharaiah, N. N. Srinidhi, Dharamendra Chouhan, J. Shreyas, and S. M. Dilip Kumar

Abstract Internet of Things (IoT) plays a significant role in the modern world dayto-day activities. Most of the IoT devices used are battery-operated, wherein energy management is most important factor for proper functioning of the application in the network. Mobility of nodes in IoT network is the big challenge with respect to energy consumption, and it degrades IoT network lifetime. In the proposed work, mobility and energy management of IoT nodes are considered to achieve better network performance and to improve the IoT network lifetime. Here, congestion control algorithm detects congestion and mitigates by adjusting the contention window through clear channel assignment (CCA) which increases throughput of the IoT network. The simulation findings suggest that the CCA congestion strategies significantly boost the performance of the network and provide optimal utilization of energy when compared to similar existing methods. Keywords Congestion · CCA · Energy efficiency · IoT · Mobility · Network · QoS

1 Introduction IoT is the concept which integrates all the real-world objects so that they must communicate with each other to perform the desired application task. As the number of devices or objects increases, day by day the communications among different devices are difficult [1], and also, the energy utilization [2] for communication is more. This is the real-world challenge for proper optimization of energy in a network. IoT network operates in a resources constrained environment such as less storage, low processing power, and they are battery-operated. Energy for battery-operated Y. Chitrashekharaiah (B) · D. Chouhan · J. Shreyas · S. M. Dilip Kumar Department of Computer Science and Engineering, University Visvesvaraya College of Engineering, Bangalore, India N. N. Srinidhi Department of Computer Science and Engineering, Sri Krishna Institute of Technology, Bangalore, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_34

421

422

Y. Chitrashekharaiah et al.

devices is an important resource because in IoT applications the nodes or devices are deployed in remote areas. As the gradual improvement of the Internet, demand for IoT application increases which operates in a restricted environment. To boost network performance, each layer of the IoT architecture model is standardized [3], which includes node location, route discovery, medium access control (MAC), UDP, etc. In this paper, proposed work considers the MAC layer and mobility-aware network, the MAC layer which has the responsibility of sensing the communication medium and optimal utilization of medium for maximum access so that energy efficiency is achievable. The MAC layer uses three types of protocol, namely contentionbased, reservation-based and canalization-based. The objectives of these protocols are to reduce the collision and error rate in the communication medium. The proposed work considers the congestion control algorithm to detect congestion and mitigate by adjusting the contention window through CCA mechanism to achieve better performance and improve the lifetime of the nodes in a network. The contribution of work includes • Proposed work provides optimal utilization of energy for mobility-aware nodes. • Congestion control with CCA mechanism is proposed to mitigate congestion. • Proposed work provides improved network performance by increasing throughput, etc. The paper is organized into following sections. Section 2 provides a review of various works with respect to energy conserving and network performance improvement. Section 3 presents the problem statement of the paper. The network architecture is defined in Sect. 4. The proposed methodology and algorithm is presented in Sect. 5. Simulation and results are discussed Sect. 6, and conclusions along with future work are presented in Sect. 7.

2 Related Work This section discusses the most recent research studies undertaken in the field of energy management system or a performance enhancement process. Semasinghe et al. [4] the theory of effective resource management in IoT was discussed. To avoid resource allocation conflict, game theory concept is used, but they are not suitable for large-scale applications, so they concentrate non-conventional models of game theory that suit the inherent features of large-scale IoT organization. The downside of this approach is that one game model is insufficient to solve every problem that occurs when distributed resource management systems are configured in large IoT. Henkel et al. [5] the IoT model is explored and outlined with a specific emphasis on energy use and methodologies for its minimization. Dynamic voltage and frequency scaling (DVFS) technologies can be given for cores concerning voltage and frequency settings. For a core to sustain a given frequency stably, the supply voltage must be

Energy-Efficient Lifetime and Network …

423

set above a minimum value. In terms of power and energy, IoT devices can benefit from voltage scaling, but at the same time, they can suffer from a severe reduction in reliability. Colistra et al. [6] described the concept of consensus protocol, and new consensus protocols are primarily used to increase the lifespan of the network and control the quality of service (QoS) required. The proposed protocol also allowed the lifetime network to be enhanced since the lifetime of each node needs to be equal with all the nodes engaged in the same mission. The limitation is that the protocol is not suitable for multitasking optimization. Luo et al. [7] energy saving by probabilistic dissemination (ENSPD) was proposed to select the optimum energy policy and enhance the life of the whole network. They implemented the idea of an equivalent node to pick the required relay node to minimize resources and achieve maximum data transmission. The limitation is that the equivalent node concept cannot be applied to a more universal and highdimensional model. Agarwal et al. [8] the stochastic model method was designed to collect the estimated energy usage of sensor nodes per operating cycle, and predicted lifetime models are defined. This model facilitates a practical estimation of the lifetime of the wireless sensor network, as there is a clear functional dependence between the lifetime of the network and the sensor node lifetime, and it measures the stochastic behavior of sensor nodes using semi-Markov theory. The model is not ideal for building self-powered sensor nodes effectively.

3 Problem Description and Objectives The IoT consists of linking a wide variety of objects with the goal of making the life of the human being simpler. The IoT uses AI to make decisions that lead to a smart city being created. In comparison, most devices connected with the IoT system tend to suffer from a short battery life. Since nodes are most often installed in very unusual locations, the node’s relation to their power supply is a very important concern.

3.1 Problem Statement Problem is to design a schema/method that boosts the performance of the network by improving the QoS in MAC layer such as throughput and to provide optimized energy utilization for each of the nodes in a mobility-aware IoT local network.

424

Y. Chitrashekharaiah et al.

3.2 Objectives • To provide optimal energy. • To mitigate the congestion. • To boost the performance of network.

4 Network Architecture 4.1 Network Model Let us consider an IoT local network consists of N number of nodes/devices ψ = {ψ 1 , ψ 2 ,…, ψ N } and a gateway, as shown in Fig. 1, the nodes in the network are initially positioned in a grid structure, and then using mobility model, nodes are moving randomly in all direction. As the nodes shift their positions, the communication between them varies, resulting in increased energy consumption. As the number of nodes increases, network congestion increases, causing packet transmission delays and degrading the network’s performance/lifetime. Using special IoT wireless technology, each of these nodes is linked to the gateway.

Fig. 1 Mobility-aware IoT network with N devices/nodes and gateway

Energy-Efficient Lifetime and Network …

425

Fig. 2 Energy consumption model of each node

4.2 Energy Model Figure 2 shows that each of nodes has an energy module consisting of an energy source module and a module for energy dissipation, energy source module is provided with a battery, and the energy dissipation module has three parts: sensor, multiprocessor system-on-chip (MPSoC) and a signal transceiver. Sensors will acquire data for the application from the surrounding region, MPSoC performs application processing, and the transceiver establishes communication among the IoT nodes and gateways. The total energy consumption of every IoT network device E con consists of the energy utilization of the sensors E sen , the energy utilization of the MPSoC E prs and the energy utilization of the signal transceiver E sgn . Therefore, E con is formulated as E con = E sen + E prs + E sgn

(1)

5 Proposed Method and Algorithm 5.1 Methodology For battery-powered IoT applications, proposed work analyzes the issue of network lifetime and network efficiency. The congestion management algorithm senses congestion and mitigates by changing the contention window by clear channel assignment (CCA), which improves packet delivery in the network. When the channel is idle, the nodes enter a random back-off state in order to prevent collision between nodes. The generation of back-off time is shown in Eq. (2). BackoffTime = random () × slotTime Here, random () is a random number between (0, CW).

(2)

426

Y. Chitrashekharaiah et al.

Upon congestion of the next-hop node, the queue length of the next-hop node is considered as a metric to adaptively adjust the size of the current node’s contention window to accomplish congestion reduction objectives. The current node CW is adjusted as shown in Eq. (3). CW = (CWmin + 1) × 2n −1

(3)

Here, CW CWmin n

is the adjusted initial value of the contention window of the current node, is the minimum value of the contention window, and is the contention window adjustment parameter as is given in Eq. (4).  n = ((Q new − t)/(Q max −t)) × log2

CWmax + 1 CWmin + 1

where Qnew queue length of the next node in the current node.

5.2 Algorithm Input: Queue length. Output: Contention window (CW) size.

 (4)

Energy-Efficient Lifetime and Network … Table 1 Terminologies

427

QL_curr

Length of the buffer queue for the current node

Qmax

Nodes max length of the buffer queue

t

Threshold for congestion

Cs_next

Congestion status of next-hop node

Q_length

Hold the buffer queue capacity

Cs_curr

Congestion status of current node

CWmin

The minimum value of the contention window

CWnew

New present CW value of the next node in the current node

QL_nxt

Length of the buffer queue for the next node

The algorithm for congestion control to reduce the receiving rate of the next-hop node should be applied when Cs_next = 1. The congestion control algorithm has been proposed in line with this theory. The principle of the algorithm is that the current node notices that the next-hop congestion flag is congested with Cs_next = 1, i.e., the next-hop node. The preference of the current node to enter the channel is decreased by increasing the CW of the current node, thus reducing the data transmission rate of the current node. In other words, the data reception rate of the next-hop node is decreased, thus mitigating the congestion of the next-hop node (Table 1).

6 Simulation Analysis and Result 6.1 Simulation Model The simulation framework is designed considering the NS3 discrete event simulator as a computing environment. The simulation study shows that the proposed method of improved clear channel assessment (CCA) gives better performance compared to the distributed coordinated function (DCF) algorithm. The performance validation is carried out through a set of performance metrics like average time delay, throughput and remaining energy.

6.2 Simulation Settings Simulations are carried out in order to measure the performance of the proposed method corresponding to QoS parameters of the MAC layer by changing the number of nodes. Parameters for the simulation are shown in Table 2.

428 Table 2 Simulation parameters

Y. Chitrashekharaiah et al. Parameters

Values

Environment

Ubuntu16.04 + NS 3.25

Node distance

20 m

Packet size

600 bytes

Simulation duration

100 s

Initial energy

10 J

Number of nodes

25, 50, 100, 150, 200

Routing protocol

AODV

Fig. 3 Analysis of remaining energy with simulation time

6.3 Result Analysis As the simulation time proceeds, remaining energy of the network decreases as shown in Fig. 3. Because more nodes are participating in the communication, energy will be consumed to transfer data and receive data from different nodes, but in the proposed method, remaining energy will decrease slowly compared to DCF algorithm. In Fig. 4, analysis is based on the node density with throughput. Increasing the number of nodes from 50 to 200, it clearly shows that initially, throughput is in saturation points and then decreases gradually. Because more number of nodes results in an increase in congestion which will be mitigated by the proposed method to provide better throughput compared to DCF algorithm. As shown in Fig. 5, the average delay will increase as the number of nodes increases. In the base DFC algorithm, average delay will be very high as node density reaches 200, but our proposed method reduces the delay by the congestion technique, thereby providing the increased performance of the network.

Energy-Efficient Lifetime and Network …

429

Fig. 4 Analysis of throughput with node density

Fig. 5 Analysis of average delay with node density

7 Conclusion and Future Work Due to large number of devices/objects are being added to the IoT network, the complexity of the IoT network will be increased. With the large number of IoT devices connected to IoT network increases communication complexity among devices due to congestion in the network. This congestion in network will result in reduction of network performance. In order to avoid this, proposed method considers congestion control algorithm with a CCA mechanism. Method will detect and mitigate by adjusting the contention window of congested nodes. The simulation findings suggest that the CCA congestion strategies significantly boost average throughput, substantially minimize average delay. It provides efficient node energy consumption and mitigates network congestion in various network settings. In future, a secure data transfer among nodes can be proposed for defending against malicious attacks as the IoT nodes are mobile in nature, and it will be more prone to attackers.

430

Y. Chitrashekharaiah et al.

References 1. N.N. Srinidhi, C.S. Sagar, S. Deepak Chethan, J. Shreyas, S.M. Dilip Kumar, An improve PRoPHET-Random forest based optimized multi-copy routing for opportunistic IoT networks . Internet Things 11, 100203 (2020) 2. N.N. Srinidhi, G.P. Sunitha, S. Raghavendra, S.M. Dilip Kumar, V. Chang, Hybrid energy efficient and QoS-aware algorithm for intelligent transportation system in IoT. Int. J. Grid Utility Comput. 11(6), 815–826 (2020) 3. J. Shreyas, S.M. Dilip Kumar, A survey on computational intelligence techniques for internet of things, in Communication and Intelligent Systems, Lecture Notes in Networks and Systems, vol. 120, (©Springer Nature Singapore Pte Ltd., 2020) 4. P. Semasinghe, S. Maghsudi, E. Hossain, Game-theoretic mechanisms for resource management in massive wireless IoT systems. IEEE Commun. Mag. 55, 121–127 (2017) 5. J. Henkel, S. Pagani, H. Amrouch, L. Bauer, F. Samie, Ultralow Power and Dependability for IoT devices (invited paper for IoT technologies), in Proceedings o the IEEE Design, Automation, and Testing Europe Conference and Exhibition, 2017, pp. 954–959 6. G. Colistra, V. Pilloni, L. Atzori, Objects that agree on task frequency in the IoT: a lifetimeoriented consensus-based approach, in IEEE World Forum on Internet of Things, 2014, pp. 383– 387 7. J. Luo, D. Wu, C. Pan, J. Zha, Optimal energy strategy for node selection and data relay in WSN-based IoT. Mobile Netw. Appl. 20(2), 169–180 (2015) 8. V. Agarwal, R.A. DeCarlo, L.H. Tsoukalas, Modeling energy consumption and lifetime of a wireless sensor node operating on a contention-based MAC protocol. IEEE Sens. J. (2017)

Design and Implementation of Electronic Voting Using KECCAK256 Algorithm on Ethereum Network Oluwatosin James Fayemi, Aderonke Favour-Bethy Thompson , and Olaniyi Abiodun Ayeni

Abstract Electronic voting (e-voting) is an electronic method of voting. This voting method ensures that it is easier to vote and encourages more people to vote while also eradicating hooliganism and violence that impede several elections in many countries. It is a transparent form of voting if it has not been hacked; Unique Identification of India thus, the need for Ethereum and blockchain network in the e-voting system. The transactions on blockchain is evenly distributed ensuring that it is decentralized, meaning there is no specific database where data is stored, hence ensuring tight security. This paper seeks to therefore develop an e-voting system based on Ethereum by using the KECCAK256 algorithm by creating a voting scheme that is transparent, reliable, visible, and easy access to the election process starting from registration phase to the results phase. In developing this system, we made use of react js for building the user interface of the system, node js for the executing the JavaScript code outside of the web browser, and solidity was used for programming the smart contract. The system was successfully tested on Ganache which was the local blockchain used in implementing the system, and Google Chrome was the user interface where the user could access to vote and MetaMask which was used for accessing the Ethereum platform where authentication and authorization take place. The performance of the system on comparing it to other existing systems showed that the system was above average. Keywords Ethereum · Electronic voting · KECCAK256 algorithm · Blockchain technology · Smart contract · Security

O. J. Fayemi Computer Science Department, Federal University of Technology, Akure, Nigeria A. F.-B. Thompson (B) · O. A. Ayeni Cyber Security Department, Federal University of Technology, Akure, Nigeria e-mail: [email protected] O. A. Ayeni e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_35

431

432

O. J. Fayemi et al.

1 Introduction An election can be described as the act selecting or making decision among available options. It is a formal way of electing who governs in the democracy era. Voting, whether traditional ballot-based or electronic-based voting (e-voting), is the foundation for most of the current democracies. Recently, voter apathy has been on the rise, particularly among the young generation who prefers to be on their phones or gadgets most of the times. E-voting is therefore seen as a possible answer to attract these young ones to vote [1]. According to Zheng and Chuah [2], electronic voting (E-voting) is a voting process in which voters use electronic devices to vote during an election process. These electronic devices consist of the Internet, electronic kiosks, and GSM or a mobile phone to cast a ballot. These enable voters to vote from anywhere; this means they can take part in the election process from any location. The major concern and threats to implementation of an e-voting system are identity fraud and DDoS attack. The DDoS attack on e-voting may tamper with the accessibility and availability of a voting system. This seeks to exploit the transmission control protocol (TCP) connection between the systems by flooding packets toward the server. When there is a flurry of request on too many packets, the buffer queue of the connection will become overfull so much that it would reject connections including the legitimate ones. Identity fraud meanwhile means an attacker could manipulate voter registrations, change voter polling station, or alter any information as regards the ballot. If an attacker has details of a valid user, the attacker can submit and request changes on ballots information for the valid voter. Identity fraud usually defies the integrity of a voter in the election process. To solve this issue, Hardwick et al. [3] proposed that for a e-voting scheme to be robust, a good number of functional and security requirements are to be considered including transparency, accuracy, auditability, system and data integrity, secrecy/privacy, availability, and distribution of authority. Blockchain technology, however, comes to the fore. The technology is supported by a distributed network that comprised of a huge number of interconnected nodes. These nodes have apiece, their copy of the distributed ledger that has the complete account of all the transactions that has been processed by the network. Thus, no single person or authority controls the network. As long as majority of the nodes approve the transaction, the transaction goes ahead; these transactions allow the users to remain anonymous. A quick examination of the blockchain technology (including smart contracts) points out to the fact that it is an appropriate basis for e-voting. It also has the possibility of making e-voting very adequate, secure, and available. Cryptography protects the privacy of a message encryption and decryption. In protecting the privacy of the message, the encryption method is used. The message is encrypted with a secret key and produces a ciphertext. To disclose the secret message, decryption is used. Decryption is like undoing what was encrypted. The ciphertext is decrypted with a secret key and is returned back the secret message. Both the encryption and decryption methods must be known. The only secret in the process is

Design and Implementation of Electronic Voting Using KECCAK256 …

433

the secret key. It will be used to lock and unlock the message. Therefore, the privacy of the cryptosystem has heavy reliance on the secret key; this is what Kerckhoff’s principle is all about. Blockchain uses cryptography to protect the uniqueness of a sender, while making sure the previous records are not tampered with. Implementing cryptography into e-voting eases confidentiality of an e-voting system.

1.1 Blockchain, Natïve, and Traditional Voting [4] Ethereum is an open software platform based on blockchain technology that enables developers to build and deploy decentralized applications. Ethereum is a distributed public blockchain.

1.2 Ethereum and Smart Contract Smart is actually a tiny computer program that is stored inside of a blockchain. It is equally distributed into the blockchain.

1.3 History of Elections In spite of the fact that races were utilized in antiquated Athens, in Rome, and in the determination of popes and Holy Roman sovereigns, the inceptions of decisions in the contemporary world lie in the progressive rise of agent government in Europe and North America starting in the seventeenth century. Around then, the all-encompassing idea of portrayal normal for the Middle Ages was changed into an increasingly individualistic origination, one that made the individual the basic unit to be checked. For instance, the British Parliament was no longer observed as speaking to homes, partnerships, and personal stakes however was fairly seen as representing genuine individuals. The development annulling the supposed “spoiled precincts” appointive areas of little populace constrained by a solitary individual or family that finished in the Reform Act of 1832 (one of three significant Reform Bills in the nineteenth century in Britain that extended the size of the electorate) was an immediate result of this individualistic origination of portrayal [5]. When governments were accepted to get their forces from the assent of the administered and expected to look for that assent normally, it stayed to choose correctly who was to be incorporated among the represented whose assent was important. Backers of full majority rule government supported the foundation of general grown-up testimonial. Across western Europe and North America, grown-up male testimonial was guaranteed wherever by 1920; however, lady testimonial was not built until some other time (e.g., 1928 in Britain, 1944 in France, 1949 in Belgium, and 1971 in Switzerland) [6].

434

O. J. Fayemi et al.

Despite the fact that it is not unexpected to compare agent government and races with vote based system, and albeit serious races under all inclusive testimonial are one of vote based system’s characterizing attributes, all inclusive testimonial is definitely not an essential state of serious appointive legislative issues. An electorate might be restricted by formal lawful necessities similar to the case before general grown-up testimonial, or it might be constrained by the disappointment of residents to practice their entitlement to cast a ballot. In numerous nations with free decisions, enormous quantities of residents do not cast polling forms. For instance, in Switzerland and the USA, less than a large portion of the electorate vote in many races. Albeit lawful or purposeful prohibition can significantly influence open strategy and even sabotage the authenticity of an administration; it does not block dynamic by political race, given that voters are given certifiable choices among which to pick. During the eighteenth century, access to the political field relied to a great extent upon enrollment in a gentry, and interest in races was managed chiefly by neighborhood customs and courses of action. Albeit both the American and the French unrests proclaimed each resident officially equivalent to each other, the vote stayed an instrument of political force controlled by not many. Indeed, even with the execution of widespread testimonial, the perfect of “one individual, one vote” was not accomplished in all nations. Frameworks of plural democratic were kept up in certain nations, giving certain social gatherings a discretionary favorable position. For instance, in the UK, college graduates and proprietors of organizations in voting demographics other than those in which they lived could cast more than one voting form until 1948. Prior to World War I, both Austria and Prussia had three classes of weighted votes that successfully kept constituent force in the possession of the upper social layers. Until the section of the Voting Rights Act in 1965 in the USA, lawful hindrances and terrorizing adequately banished most African Americans, particularly those in the South from having the option to cast voting forms in races (Khan et al. 2018). During the nineteenth and twentieth hundreds of years, the expanded utilization of serious mass decisions in Western Europe had the reason and impact of systematizing the decent variety that had existed in the nations of that area. In any case, mass decisions had very various purposes and results under the one-party socialist systems of Eastern Europe and the Soviet Union during the period from the finish of World War II to 1989–90. In spite of the fact that these administrations held decisions, the challenges were not serious, as voters for the most part had just the decision of deciding in favor of or against the official up-and-comer. Without a doubt, races in these nations were like the nineteenth century Napoleonic plebiscites, which were proposed to exhibit the solidarity as opposed to the assorted variety of the individuals. Contradiction in Eastern Europe could be enlisted by intersection out the name of the up-and-comer on the polling form, as a few million residents in the Soviet Union did in every political decision before 1989; notwithstanding, in light of the fact that mystery casting a ballot did not exist in these nations, this training welcomed responses. Nonvoting was another type of dissent, particularly as nearby socialist activists were feeling the squeeze to accomplish almost a 100% turnout. Not all races in Eastern Europe followed the Soviet model. For instance, in Poland, a greater

Design and Implementation of Electronic Voting Using KECCAK256 …

435

number of names showed up on the polling form than there were workplaces to fill, and some level of constituent decision was in this manner gave. In sub-Saharan Africa, serious races dependent on all inclusive testimonial were presented in three particular periods. During the 1950s and 60s, various nations held decisions following decolonization. Albeit a considerable lot of them returned to dictator types of rule, there were special cases (e.g., Botswana and Gambia). In the late 1970s, decisions were presented in fewer nations when some military tyrannies were broken down (e.g., in Ghana and Nigeria) and different nations in Southern Africa experienced decolonization (e.g., Angola, Mozambique, and Zimbabwe). Starting in the mid-1990s, the finish of the Cold War and the decrease of military and financial guide from created nations realized democratization and serious decisions in excess of twelve African nations, including Benin, Mali, South Africa, and Zambia. Serious races in Latin America additionally were presented in stages. In the century after 1828, for instance, decisions were held in Argentina, Chile, Colombia, and Uruguay; however, everything except Chile returned to dictatorship. Extra nations held decisions in the period dating approximately 1943–1962; however, again many did not hold law-based governments. Starting in the mid-1970s, serious decisions were presented step by step all through the majority of Latin America. In Asia, serious decisions were held after the finish of World War II, much of the time because of decolonization (e.g., India, Indonesia, Malaysia, and the Philippines); however, indeed the reclamation of dictatorship was typical. Starting during the 1970s, serious decisions were reintroduced in various nations, including the Philippines and South Korea. Except for Turkey and Israel, serious decisions in the Middle East are uncommon. Dictator systems frequently have utilized decisions as an approach to accomplish a level of mainstream authenticity. Fascisms may hold races in situations where no considerable restriction is remotely possible (e.g., on the grounds that resistance powers have been stifled) or when financial variables favor the system. In any event, when resistance groups are permitted to take part, they may confront terrorizing by the legislature and its partners, which in this manner blocks the viable assembly of likely supporters. In different cases, a system may delay a political race if there is a huge possibility that it will lose. Also, it has been a typical act of tyrant systems to mediate once balloting has started by threatening voters (e.g., through physical assaults) and by controlling the tally of votes that have been openly cast.

2 Review of Related Works 2.1 Toward Secure E-Voting Using Ethereum Blockchain [7] This research was propelled, and the principle inspiration in this undertaking is to give a protected democratic condition and show that a dependable e-casting a ballot plot is conceivable utilizing blockchain. Since, when e-casting a ballot is accessible

436

O. J. Fayemi et al.

for each and every individual who has a PC, or a cell phone, each and every authoritative choice can be made by individuals and individuals, or if nothing else individuals’ sentiment will be increasingly open and progressively available by government officials and directors. This will in the end lead mankind to the genuine direct popular government. It is significant for us since decisions can without much of a stretch be tainted or controlled particularly in modest communities and even in greater urban areas situated in degenerate nations. Besides, enormous scope customary decisions are over the top expensive in the long haul, particularly if there are many topographically appropriated vote habitats and a great many voters. Additionally, the voters (predominantly for individuals from associations) may be in the midst of a get-away, on an excursion for work or distant for whatever other explanation, which will make unthinkable for that specific voter to go to the political race and may bring down the general participation. E-casting a ballot will be capable take care of these issues, whenever executed cautiously. The idea of e-casting a ballot is altogether more established than blockchain so that every single known model so far utilized methods for brought together calculation and capacity models. Estonia is a generally excellent model, since the administration of Estonia is one of the first to execute a completely on the web and exhaustive e-voting arrangement. The idea of e-casting a ballot was begun to be bantered in the nation in 2001 and formally began by the national experts in the late spring of 2003. Their framework is as yet being used, with numerous enhancements and adjustments on the first plan. As revealed, it is right now exceptionally powerful and solid. They utilize shrewd advanced ID cards and individual card perusers (dispersed by the legislature) for individual savvy validation. For residents to go to the races by posting the competitors and making a choice, there is a unique web-based interface just as a comparable work area application so that anybody having a PC and Internet association and furthermore his/her ID card can without much of a stretch vote remotely.

2.2 Blockchain-Based E-Voting System [8] In this paper, we introduced a unique, blockchain-based electronic voting system that utilizes smart contracts to enable secure and cost-efficient election while guaranteeing voters privacy. We have outlined the systems architecture, the design, and a security analysis of the system. By comparison to previous work, we have shown that the blockchain technology offers a new possibility for democratic countries to advance from the pen and paper election scheme, to a more cost- and time-efficient election scheme, while increasing the security measures of the today’s scheme and offer new possibilities of transparency. Using an Ethereum private blockchain, it is possible to send hundreds of transactions per second onto the blockchain, utilizing every aspect of the smart contract to ease the load on the blockchain. For countries of greater size, some measures must be taken to withhold greater throughput of transactions per second, for example the parent and child architecture which reduces the number of transactions stored on the blockchain at a 1:100 ratio without compromising

Design and Implementation of Electronic Voting Using KECCAK256 …

437

the networks security. Our election scheme allows individual voters to vote at a voting district of their choosing while guaranteeing that each individual voters vote is counted from the correct district, which could potentially increase voter turnout. The work intends to give a higher security in surveying machines that defeats the fake votes. The electronic democratic machine is gotten to by coordinating the biometric information with the Aadhar card data. To decrease the labor and time utilization, we present blockchain which makes a square for each vote and each is associated with one another. Subsequently, information cannot be adjusted once information has added to the record in the blockchain. The vote throwing stage is a continuous equipment setup sorted out in a surveying corner with which the voters are permitted to make their choice. Allowing of access to the electronic democratic machine is given when the biometric of the client is coordinated with their Aadhar card subtleties. Unique Identification of India (UIDAI) is an information assortment focus where the subtleties of the Aadhar holder are kept up. Here, the voter first gives his Aadhar card for QR understanding administrator. The voter is permitted into the voting station room when the QR perusing/UID confirmation the computerized show in EVM shows as “VOTE.” The voter needs to verify his/her thumb with biometrics system; and thus, giving that the thumb information are checked is matches the pre-stacked server data, then the balloter is permitted to manufacture the vote. The following stage is ensuring the vote, security of vote which is accomplished by blockchain innovation. The blockchain is incorporated into the EVM machine. As the name demonstrates, blockchain is a square of chain that contains the data. This method is initially portrayed in 1991 by a gathering of scientists and initially planned to timestamp computerized reports so it is unrealistic to alter them practically like a public accountant. The blockchain is a circulated record that is totally open to anybody. They have a fascinating property; when information has been recorded inside a blockchain, it turns out to be extremely hard to transform it. Things being what they are, how can it work? How about we think about a square, each square contains information, a hash of the square and hash of the past square. Information: It contains the democratic data. Hash: You can contrast the hash with unique finger impression; it is consistently special. When a square is given, its hash has been determined, therefore, changing anything inside the square will cause a change in its hash. As such, the hash is exceptionally valuable when you need to recognize changes to square. On the off chance that the unique finger impression of square changes, it never again is a similar square.

2.3 A Privacy-Preserving Voting Protocol on Blockchain [9] In this paper, we proposed a native voting mechanism on blockchain to facilitate decision making for peers of a blockchain network. The proposed protocol protects voters’ privacy and enables detection and correction against cheating without any trusted party. Our implementation on Hyperledger Fabric shows that the protocol is feasible and is practical for small to medium scale of voting problems. More work

438

O. J. Fayemi et al.

needs to be done in the future. First, we need to perform formal security analysis. Potential security attacks, like cartel attack, should be tested against the design. Second, a theoretical analysis of how to decide system parameters including number of ballots, PKI key length, etc., should be conducted. Third, the current implementation of our voting protocol is mainly for functional verification. More work on testing and optimization is needed in the future. We can also implement our protocol on different public and consortium blockchain platforms, e.g., Ethereum, Corda, Quorum, etc., to further inspect our protocol design and examine the performance difference between them.

2.4 An E-Voting with Blockchain: An E-Voting Protocol with Decentralisation and Voter Privacy [1] Why use blockchain in an e-voting protocol? Well, the answer lies in the fact that it is very much advantageous, especially when several groups of people wishes to use just one public database and maintain it. This database must also be owned, maintained, and updated by each of the users, but no one would be able to control the database or tamper with it. Thus, the need for an e-voting system makes use of blockchain as a reliable and transparent ballot box. This system will be designed to follow the rudiments of e-voting principles and ensure there is a level of decentralization which will enable voters to update or change their voting choices. This research work thereby puts forward a blockchain-based e-voting scheme that meets these rudiments of e-voting as well as have some level of decentralization and ensures voters have the power to make changes as they like within any given time. The limitation of these works is that it was canceled due to the fear that the system may face some challenges which may lead to votes going public in case of a cyber-attack.

2.5 A Secure End-to-End Verifiable E-Voting System Using Zero-Knowledge-Based Blockchain [10] The proposed system is developed to improve the Direct Electronic Recording (DRE) system in order to detect if the tallying phase of the voting is being tampered with. The proposed solution avoids double inclusion of the same ballot that is already included in the blockchain earlier. The network stores transactions by hashing them into the ongoing blockchain of zero-knowledge-based consensus algorithm, forming a record that cannot be changed without solving the discrete logarithm problem (DLP) multiple times for each transaction. The proposed system is able to encrypt ballot in such a way that the election tally can be publicly verified without decrypting cast ballots maintaining end-to-end verifiability and without requiring the secure bulletin

Design and Implementation of Electronic Voting Using KECCAK256 …

439

board. The proposed system does not show the life result of the electoral act, and it can only be used for a small-scale project.

2.6 A Conceptual Secure Blockchain-Based Electronic Voting System [11] The proposed system is that the blockchain-based system will be secure, reliable, and anonymous and will help increase the number of voters as well as the trust of people in their governments. The system is a decentralized system which does not rely on trust. Any registered voter will be eligible to vote using any device that is connected to the Internet. This will ensure that the blockchain can be verified publicly as well as distributed such that it wont get corrupted by any Tom, Dick, and Harry. The limitation for this work is the inability to change a vote in case of a user mistake. The user can cast his or her vote only once.

2.7 A Solution Based on the Diffie–Hellman Process System [12] It was proposed because Diffie–Hellman process does not consider blockchain technology; rather, it suggests using random numbers public or private key sets, in order for a “two-round” voting system to be supposedly held with ballot privacy. The approach aims at securing the voter’s ballot and identity. This approach, however, is not for a large electoral system.

2.8 A Blockchain-Based E-Voting System [13] The aim was to eradicate stealing of ballot and rigging of election; however, voters will have to purchase a token from them to be eligible to vote. This system disallowed unregister voters without token to vote. The approach limitation is that if the number of voters is less than the number of token the government has purchase then it a waste of resources.

440

O. J. Fayemi et al.

2.9 An E-voting System Based on Blockchain and Ring Signature [14] The work aims to provide a reliable, safe, and transparent electronic system. This design implemented a fine web voting system software through PHP and JavaScript programming languages. The system aims to provide a secured e-voting system on blockchain network and ring signature. However, the system is expensive to build, and for every person that will vote on the system must pay in order to cast his vote.

2.10 A Survey on Feasibility and Suitability of Blockchain Techniques for the E-Voting Systems [4] This project sought to query the viability or possibility and appropriateness of blockchain technology in an e-voting environment, vis-à-vis the technical and nontechnical aspects. This approach was said to be built with the use of C++ and Python language programming. The Ethereum-based solutions usually make use of Solidity language for implementing smart contracts. Sovereign, which is written in Python programming language uses JavaScript-based Meteor environment as its web-based user interface. The proposed system is built to provide transparency with privacy, cheaper in the long term, and to provide instant results. The limitation of this project is that the work was not implemented but rather was a feasibility studies. The system is not user friendly and does not hide user confidentiality.

2.11 An Electronic Voting Machine Based on Blockchain Technology and Aadhar Verification [15] This work seeks to make available a much better security in polling machines while also overcoming the bogus amount of votes. The e-voting system works by checking the biometric data with the Aadhar card information and matching it. In order to reduce human labor and time wasting, blockchain was introduced, and it creates a block for each votes cast, and these blocks are connected to one other. This ensures that once data has been added to the ledger in the blockchain, it cannot be tampered with. This system was designed to ensure data security and integrity in the democratic system of election among the people. The limitation of this work is that whenever one of the block is being changed due to error of the voters the remaining block in the system will become invalid, so mistake is not allowed in this work.

Design and Implementation of Electronic Voting Using KECCAK256 …

441

2.12 A Secure Voting System Using Ethereum’s Blockchain Which They Called BroncoVote [16] A blockchain-based voting system, named BroncoVote, preserves voter privacy and increases accessibility, while keeping the voting system transparent, secure, and cost-effective. The system, BroncoVote, uses existing functionality and features provided by Ethereum to provide the ability for creating and voting on ballots. The system implementation consists of three smart contracts coded in Ethereum’s Solidity language, two scripts written in JavaScript, and one HTML page. BroncoVote is an open-source project, and the entirety of the code is available for public use. The implemented system, BroncoVote, provides a secure and private e-voting system that is also easily accessible. BroncoVote is a university scale voting system that utilizes smart contracts in Ethereum and Paillier Homomorphic Encryption to achieve our goals. The system does not hide the identity of the voter.

2.13 A Crypto-Voting, a Blockchain-Based e-Voting System [17] The main aim of this research work is to present and define crypto-voting which is another innovative method of e-voting. This new method was based on the Shamirs secret sharing approach, which was implemented on the blockchain technology. Crypto-voting system intends to use a permissioned blockchain (i.e., subject to authorization), in order to guarantee access control without compromising the requirements of anonymity and confidentiality. The objective of the system is to improve the approaches of traceability and the assessment of the operation carried out during voting processes without the use of a middleman. The proposed system was not actually implemented.

3 Methodology The voting system is composed of the following major components: 1. 2. 3.

User interface Database Ethereum network The architecture is divided into two phases which are:

• The registration phrase • The voting phrase (Fig. 1)

442

O. J. Fayemi et al.

Fig. 1 System architecture for the registration phase

The voter interacts with the application which sends information to the server. These information contain the basic details of the user that will be entered during the registration. The server sends the basic information of the voter entered on the registration form to the database. On getting to the database, it generates a unique user_id for the user, and it communicates the user_id and successful acknowledgement to the user through the server (Fig. 2). In the voting phase of the system, the user communicates with the server through the application. The server sends the input login details entered by the voter to the database for authentication, and when the database confirms that the details are correct, it then sends an authentication successful notification message to the server in order to grant access to the voter to cast its vote on the Ethereum network. The Ethereum network has an interconnected node in it, and these nodes can also be referred to as blocks which combine as a chain. The chain of blocks allows all the deployed projects into it to be secure by connecting the newly created node to the already existing node in the environment. Ethereum uses a smart contract to decide what should be carried out on the deployed system. The smart contract then holds the rule and regulation that binding the system. The system architecture is divided into two phrases which are the phrase of registration and the voting phrase. At the registration phase, the following factors will

Design and Implementation of Electronic Voting Using KECCAK256 …

443

Fig. 2 System architecture for the voting process

be considered in order to assign a user ID that will be quite difficult to get by a cyber-terrorist; W is a variable representing the voter’s Ward (pre-assigned value), L also represents the Local Government Area of the supposed voter, S represents the State, D represents the full date (month and day [concatenated]) of registration, Y represents the year of registration, T represents the exact time of registration in (hour: minute: second). The User ID, Uid, is calculated according to the following equation: which comprises of the year, day, month, hour minutes, and second U id = f (W, L , S) + G(D)

(1)

where x represents each contestant on the voting system, and it ranges to y which will be the last contestant on the system, e represents each eligible voter on the voting system, and it ranges to f which will be the last eligible voter on the system, z represents each ward on the voting system, and it ranges to g which will be the last ward on the system, i represents each local government on the voting system, and it ranges to j which will be the last local government on the system, u represents each state on the voting system, and it ranges to v which will be the last state on the system. The collation algorithm used to collate and determine the winners of the election is represented with the variables in Eqs. 2–8;

444

O. J. Fayemi et al.

Each Authenticated Voter, eligible to vote (AV e ) in the voting process, is represented finitely by AVe , e = 1, 2, 3, 4, . . . , f

(2)

Each Ward, W z eligible for accommodating authenticated voters, is represented finitely by; W z, z = 1, 2, 3, 4, . . . , g

(3)

Each Local Government, LGj , is represented finitely by LG i , i = 1, 2, 3, 4, . . . , j

(4)

Each State, S u , is represented finitely by Su , u = 1, 2, 3, 4, . . . , v

(5)

The Election type, E t , is represented finitely by E t , t = 1, 2, 3, 4, . . . , n

(6)

The Contestant, C x , is represented by C x , x = 1, 2, 3, 4, . . . , y

(7)

The final collation process, R(E t ) is determined and calculated using Eq. 8; R(Et) =

 f y   x=1

e=1

AV e +

g  z=1

CxWz +

j  i=1

C x LGi +

v 

 C x Su

(8)

u=1

The implementation was carried out with the following tools, NodeJS which is the JavaScript framework which was used to design the backend code for the decentralized voting web application. Pure JavaScript used for the smart contract. ReactJS is the rich JavaScript framework which was used to design the frontend interface for the decentralized web application. While solidity is the smart contractoriented language which used for the voting application. The contract initializes the voters and set rules for voting. Ethereum local Blockchain (Ganache) is the local blockchain on which the smart contract is installed to enable interaction with the deployed contract instances. MetaMask is browser extension and extension wallet are for signature and transactions approval. MetaMask was connected to the local blockchain through the local host URL. Postgress is the database used and the User’s corresponding IDs be stored on the database using Postgress (Figs. 3, 4 and 5).

Design and Implementation of Electronic Voting Using KECCAK256 …

445

Fig. 3 Calculation of the results for each voter

Fig. 4 Compilation of smart contracts

3.1 Hardware Requirements The minimum hardware component that is required to provide an effective and reliable support for the electronic voting system on Ethereum is listed as follows • • • • •

A computer system 1.75 GHz duo core processor 1 GB of ram 300 GB of hard disk size Software requirements.

446

O. J. Fayemi et al.

Fig. 5 Ganache database

The following are the software tools required for the implementation of the proposed system: • • • • • •

NodeJS ExpressJS for framework ReactJS Solidity Ethereum blockchain Postgress for database.

3.2 Implementation Tools The major purpose for the implementation phase is to convert the system design into source code, and each component of the design is implemented as a program module. The tools were chosen because of the robustness, scalability, productivity. The implementation will be carried out with the following tools.

3.2.1

NodeJS

NodeJS which is the JavaScript framework will be used to design the back-end code for the decentralized voting web application. The implementation will be carried out with the following tools: NodeJS which is the JavaScript framework will be used to design the back-end code for the decentralized voting web application. Pure JavaScript will also be used to write tests for the smart contract.

Design and Implementation of Electronic Voting Using KECCAK256 …

3.2.2

447

ReactJS

ReactJS is the rich JavaScript framework which will be used to design the frontend interface for the decentralized web application. Solidity is the smart contract-oriented language which will be used to write the smart contract for our voting application. The contract will initialize the voters and set rules for voting.

3.2.3

Ganache

Ethereum local Blockchain (Ganache). Ganache will be used as the local blockchain on which our smart contract will be installed on so that we can interact with the deployed contract instances.

3.2.4

MetaMask

MetaMask is browser extension and extension wallet which will be used to sign and approve transactions. MetaMask will be connected to our local blockchain through the local host URL. Postgress is the database that will be used. User’s corresponding IDs will be stored on the database using Postgress.

4 System Implementation The system is designed in various modules that navigate through the system. The contract compiling page is the page that the programmer compiles the contract (i.e., compiling the rules and regulation contained in the project), and also, this page links all created page together in order for the project to work perfectly well and being accessible. The layout is shown in Fig. 6. This registration page handles the registration of user by prompting every user to enter their personal details as regards to the full names, date of birth, gender, ward, local government, state, and password into the system. A user who is not register into the system cannot have access into the system. The registration layout will be shown Fig. 7. After a successful registration process, the user goes to the top corner of the browser and clicks on the MetaMask icon in order to login with the register login details requested which will prompt the user to gain access into the system (Fig. 8). After a successful authentication is established when a correct login credential is being entered. A click on the “Login” button displays the user dashboard that shows the profile, the amount on the users account which will be used to carry out various transaction on the system. Furthermore, it shows what the voter can do on its pages. The layout is shown in Fig. 9.

448

Fig. 6 Contract compilation page

Fig. 7 Registration page

O. J. Fayemi et al.

Design and Implementation of Electronic Voting Using KECCAK256 … Fig. 8 Login page

Fig. 9 Voter dashboard

449

450

O. J. Fayemi et al.

The private key importation page requests for a private key given to the voter by the system during registration. This is a way of allowing two-level authentication level possible. After the user inputs the private key and clicks on the “import” button, automatically the system verifies if the account has voted before. Then, the system allows the voter to cast its vote, otherwise display a message that “account already vote.” The layout is shown in Fig. 10. The image in Fig. 11 shows a user account address and its private key that will be imported for a voter to be able to carry out the electoral act.

Fig. 10 Private key input page

Fig. 11 Private key

Design and Implementation of Electronic Voting Using KECCAK256 …

451

The candidate selection and confirmation of candidate page is a page where a voter picks its candidate to vote for, after which the system ask the user if the pick candidate is the right option for the voter. This page is just to check a voter in order to help a voter from making selection mistake. The layout will be shown in Fig. 12. This election results page shows the outcome of the voter’s vote which was cast and the result of every candidate that is contesting in the electoral act. The layout is shown Fig. 13.

Fig. 12 Candidate selection and confirmation of candidate

Fig. 13 Election result

452

O. J. Fayemi et al.

5 Conclusion In conclusion an Ethereum-based electronic voting system was developed. The system used Ethereum blockchain technology methods to improve its security features. The system was also able to protect the uniqueness of every voter and also ensure that the recorded vote results are not being tampered with. Ethereum blockchain provides advantageous properties for electronic voting system such as integrity, verifiability, authenticity, and accessibility from all its users. The system does not depend on any form of human third party but on computational cryptographic trust. The Ethereum blockchain-based electronic voting system is secured such that no one or an attacker can compromise the system or corrupt it. The Ethereum-based electronic voting system developed from this research has proven to be an efficient way to provide a more suitable, secure, transparent, and efficient way for people to carry out the electioneering act and vote for their respective choice of candidate. The system also aimed at eliminating bogus incorrect voting result and also denied voters impersonation.

References 1. A.G. Freya Sheer Hardwick, E-Voting with Blockchain: An E-Voting Protocol with Decentralisation and Voter Privacy, United Kingdom (2016) 2. C.C. Clement Chan Zheng Wei, Blockchain-based electronic voting protocol. Int. J. Inform. Visual. (2018) 3. F.S.A.G. Hardwick, E-voting with blockchain: an E-voting protocol with decentralisation and voter privacy. Royal Holloway, University of London, Egham, United Kingdom, 2 (2018) 4. E.A. Umut Can Çabuk, A survey on feasibility and suitability of blockchain techniques for the E-voting systems. Int. J. Adv. Res. Comput. Commun. Eng. 133 (2018) 5. Multichain (2017) https://www.multichain.com/blog/2017/05/blockchain-immutability-myth/ 6. Olson, What are we? A study in personal ontology. (Oxford University Press, New York, 2007) 7. A.K. Koç, U.C. Cabuk, G. Dalkilic, Towards secure E-voting using ethereum blockchain. In 2018 IEEE 6th International Symposium on Digital Forensic and Security (ISDFS), Antalya, Turkey. https://doi.org/10.1109/ISDFS.2018.8355340 8. F.Þ. Hjálmarsson, G. K. Hreiðarsson, M. Hamdaqa, G. Hjálmtýsson, Blockchain-based Evoting system (2019, May) 9. W. Zhang, Y. Yuan, Y. Hu, S. Huang, S. Huang, S. Cao, A. Chopra, A privacy-preserving voting protocol on blockchain. In IEEE Cloud, San Fransisco, CA, USA, July 2018. https://doi.org/ 10.1109/CLOUD.2018.00057 10. S. Panja, B.K. Roy, A secure end-to-end verifiable E-voting system using zero knowledge based blockchain, India (2018). https://eprint.iacr.org/2018/466.pdf 11. A.B. Ayed, A conceptual secure blockchain-based electronic voting system. Int. J. Network Secur. Appl. (IJNSA) (2017) 12. F. Hao, P.Y.A. Ryan, Real-World Electronic Voting: Design, Analysis and Deployment (CRC Press, 2017), pp. 143–170 13. F.P. Hjálmarsson, G. Hreiðarsson, Blockchain-based E-voting system. School of Computer Science Reykjavik University, Iceland, 2, 3 (2018) 14. Y. Wu, An E-voting system based on blockchain and ring signature. School of Computer Science, University of Birmingham (2017)

Design and Implementation of Electronic Voting Using KECCAK256 …

453

15. A.R.R. Navya, Electronic voting machine based on blockchain technology and aadhar verification. Int. J. Adv. Res. Ideas Innov. Technol. 2 (2018) 16. G. Gaby, P.B. Dagher, BroncoVote: Secure Voting System using Ethereum’s Blockchain (2Winthrop University, South Carolina, U.S.A., Rock Hill 2018) 17. F. Fusco, M. Lunesu, F. Pani, A. Pinna, Crypto-voting, a blockchain based e-voting aystem. In Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KMIS, pp. 223–227 (2018). https://doi.org/10. 5220/0006962102230227. ISBN 978-989-758-330-8 ISSN 2184-3228

VizAudi: A Predictive Audio Visualizer Sapna Malik, Aashish Upadhyay, Kartik Kumar, and Nachiketa Raina

Abstract Speech-to-text is a great way of translating audio to a contextual form. Unfortunately, simple speech-to-text applications are unable to express the background noises. Such applications are limited in their ability to provide a wholesome knowledge of an audio clip extracted from a real-world environment. Therefore, this research aims to convert the text to speech and augment the text with suitable images based on the ambient sound. It detects speech and ambient noise from audio. Speech is converted to text, and the ambient audio (cars, dogs, sirens, etc.) is displayed as a sequence of images or GIFs. This is achieved with the help of different spectrograms by extracting relevant audio clips and thereby feeding them to the CNNs while processing the speech at the same time. This proposed work can help people with hearing impairments by providing them a visual output of both background and foreground audio. Keywords Neural networks · Convolution neural networks · UrbanSound8K · Sound recognition

1 Introduction The ability of hearing is very crucial as it plays a vital role in aiding humans to interact with their surroundings. The hearing sense is responsible for acquiring knowledge and not only interact with fellow humans but at the same time provide an ability to fully perceive the environment aptly. Throughout evolution, communication always played an important role as it enabled humans not only to catalog information but also mass produce it thus leading to various industrial revolutions. The ability to communicate is what leads humans to evolve into much smarter beings. On the contrary, its inability leads people to suffer from many aspects of their lives; they S. Malik · A. Upadhyay · K. Kumar · N. Raina (B) Department of Computer Science, Maharaja Surajmal Institute of Technology, New Delhi, India S. Malik e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_36

455

456

S. Malik et al.

cannot thus communicate or socialize effectively with normal people. Also, the safety of deaf people is in harm’s way in many scenarios where they cannot take the right decision originating from the action of analyzing sounds in the brain, primarily alarming sounds. The treatment of hearing impairment is dependent on the level of inability. Clinical medicine is unable to treat many cases of deafness; therefore, people lose hope toward the chances of a normal life. There are plenty of libraries and application programming interfaces capable of generating text from speech up to certain accuracy. But the only bottleneck they go through is their inability to study the ambient noises against the speech. This is where this research work plays its part. The work aims to provide an augmentation system to provide a visual aid for the hearing-impaired people virtually. It can also be used to enhance any audio-only form of media by augmenting it with relevant images, GIFs, and text. This application finds its use in the contemporary world for people with hearing impairments or developing smart applications requiring audio analysis, etc.

2 Related Work 2.1 Works on the Categorization of Sound Ordinarily, sound classification is graded into two major steps, extracting features and then classifying. The features extracted are sophisticated handcrafted features. One of the features like MFCC [1] has been widely used up until now. Then, the classification task is done by using the features extracted and then using them as inputs for classifiers like SVM [2], HMM [3], and GMM [4]. Nowadays, more and more feature extraction is done with the help of deep neural networks (DNN). More and more researchers are letting these architectures unravel their full potential with intrinsic feature learning. Completely raw data is given as input, and it is expected from the models to extract the features. Parascandolo et al. [5] propose using overlapping windows for cutting the audio signals into pieces(segments) and converting to logMel and then using it as input in BLSTM RNN. The suggested model in this can perform better than their standard connected neural network. Hershey et al. [6] reasoned that multiple CNNs have been tested on multiple subsections of Youtube—100 M dataset while specifically audios were sectioned if they were isolated and changed into the logMel which were further used as input for the model. These features are used as input to different kinds of architectures such as VGG, AlexNet, Inception, ResNet, and also with DNN being used as a standard. Out of these, ResNet and Inception performed considerably better for extracting audio features in different areas in audio representation. Pons et al. [7] proposed that in-between Mel-frequency cepstrum coefficients (MFCC) features and features that are extracted by a CNN, the latter was found

VizAudi: A Predictive Audio Visualizer

457

preferable when used in the same type of classifiers. But classic features like MFCC are still very effective. Piczak et al. [8] express that a model of CNN is used for the categorization of sounds. Breaking the input audio into pieces and then CNN is applied on the Melspectrograms extracted from the audio clips, the performance was ameliorated from the earlier baseline models.

2.2 Works on UrbanSound8K Dataset Dataset used by the proposed work has been studied in various prior researches to date. These studies include both types of feature engineering methods including traditional methods and modern methods where the model can extract features. Salamon et al. [2] propose a criterion for the dataset UrbanSound8K, and features are extracted as MFCC features, on which five different classification algorithms are evaluated. These include SVM, random forest, k-NN, ZeroR, and decision tree. The best result is given by the decision tree and SVM having an accuracy of almost 70%. Piczak et al. [8] attempted to understand the results of CNN on the UrbanSound8K dataset. The architecture consisted of 1 dense layer (fully connected layer) and 2 convolutional layers while simultaneously using a dense layer for better generalizability. Furthermore, spectrograms are used, and these partitioned audio clips on a nonlinear Mel-scale are used to achieve accuracy and efficacy of 75% while at the same time promising better results than that on longer segments.

2.3 Work on Deaf Assistance System Saifan et al. [9] proposed a whole functional system that aims at converting background audio and speech into alarm sounds while trying to convert speech for people having hearing defects. The aforementioned system tries to aid deaf people to live normally by informing them about sounds like engine running or children crying and keep them aware of their surroundings. Moreover, this system promotes the safety of such people by alarming them of the environment. This system comprises a speech recognition and a sound recognition model.

458

S. Malik et al.

Table 1 Sound class ID

Numeric identities

Sound class

0

air_conditioner

1

car_horn

2

children_playing

3

dog_bark

4

drilling

5

engine_idling

6

gun_shot

7

jackhammer

8

siren

9

street_music

3 Dataset and Features 3.1 Dataset The proposed work uses UrbanSound8K as dataset [2, 10]. A total of 8732 clips of sounds common to urban environments are present which are labeled and distributed across ten different classes being present in different folders. The names of the audio clips are assigned in the format: [fs_ID]-[class_ID]-[occurrence_ID]-[slice_ID].wav, where class_ID is an unique identifier of the class which the clip belongs to. Nearly 90% of the clips present in the dataset contribute to the training of the model proposed in this paper, while the other 10% is used for training and validation purposes (Table 1).

3.2 Features VizAudi uses the LibROSA library to extract four different features (spectrograms) from the audio clips mentioned in the dataset [11, 12]. 1.

Mel-frequency spectrum: Vizaudi uses Mel-scale for representing frequencies as it is the nearest approximation of how humans perceive sound. In the Melfrequency spectrum, the linear cosine transformation of short-term power spectrum of sound is represented in a nonlinear Mel-scale. The following equation transforms a given frequency to its corresponding Mel-scale value [13, 14]. M( f ) = 1125 ln(1 + f /700)

2.

(1)

Mel-frequency cepstrum coefficients (MFCC): Mel-frequency cepstral coefficients (MFCCs) are coefficients of MFC, they are profoundly used features in

VizAudi: A Predictive Audio Visualizer

459

sound recognition, because of their great ability to function in the same way a human hearing system works [15]. Extracting MFCC and MFC includes various steps [16] (Fig. 1). 3.

Chroma: Vizaudi uses Chroma-based features to extract the harmonic and melodic features of the audio input [17]. To obtain a spectrum short time Fourier transform (STFT) is applied. By applying a Constant-Q transform (CQT), one gets the Chroma Energy Normalized Statistics (CENS) [15] (Fig. 2).

4.

Contrast: Vizaudi takes into consideration the relative spectral distribution instead of the average spectral envelope. Therefore, spectral contrast features become handy where spectral peak, spectral valley, and the difference in their subband is taken into consideration [18] (Fig. 3).

Fig. 1 MFCC feature of barking dog sound

Fig. 2 Chroma feature of barking dog sound

460

S. Malik et al.

Fig. 3 Contrast feature of barking dog sound

4 Methodology See Fig. 4.

4.1 Background and Foreground Sound Segregation The first thing that needs to be done to represent background sounds in some way is to detect them. In this phase, the authors have worked solely upon the identification of background as well as the foreground parts of the sound. For this purpose, a custom dataset is created which consists of 1640 instances of time duration 4 s each distributed between foreground and background audios equally (i.e., 820 instances of each). A convolutional neural network that segregates parts of input audio clips as foreground or background audio is built [9]. The model is trained with the features extracted through spectrograms explained above.

Fig. 4 VizAudi block diagram

VizAudi: A Predictive Audio Visualizer

461

Fig. 5 Confusion matrix of CNN

4.2 Background Sound Classification In this phase, the background and main parts of audio are segregated and are in processed format ready to be fed to the model. Now, the classification of the background audio into ten classes based on the UrbanSound8K dataset is performed. Here, it is approached by constructing a standard convolutional neural net model equipped with dropout layers for better generalization while using adam optimizer and categorical cross entropy as the loss function. 90 percent of the dataset is used to train the model, while the remaining 10% of the dataset is used to test the model. An aggregate accuracy of 80 percent was received while testing the model on the testing dataset, which is acceptable for the CNN model we are working with (Fig. 5).

4.3 Visual Output This is the final phase of the research work in which everything done till now has to be put together. All the work from preprocessing of data to the output of text (the main speech) accompanied by appropriate GIFs (for background voices) has to be put together in a single line. After receiving the resultant classified audio from the model and text from speechto-text Google API, the system combines them to form a sequence of video clips combined with audio as the final result which is achieved through moviepy and FFmpeg [19, 20] (Figs. 6 and 7).

462

S. Malik et al.

Fig. 6 Speech-to-text output

Fig. 7 Background sound visual output

5 Conclusions and Future Work Here, the authors have proposed the possible architecture for developing a predictive audio visualizer for which a CNN model has been used. The speech recognition engine is responsible for converting speech to text using speech-to-text Google API whose alternatives can also be considered (e.g., Soundex algorithm), and simultaneously the sound classification algorithm used is responsible for identifying ten different categories of sound present in the UrbanSound8K dataset. This paper investigates only a single approach to implement the sound engine classification, i.e., the convolution neural network (CNN). The experimental results showed that using a three-layer CNN achieves the highest classification accuracy on the test dataset to be about 80%. By providing the input audio, the model was able to segregate and recognize the audio from sound and speech which was then used to generate a visual output. As future work, the segregation algorithm and sound recognition engine of VizAudi will be improvised, so that it can extract more features from the input audio and also identify more categories of sounds. Improving this also entails a plan to expand the visualization capability of VizAudi by including the intensity of detected sound as well as the rate of change of intensity, so that the user can be made aware of their surroundings more precisely.

VizAudi: A Predictive Audio Visualizer

463

References 1. S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980) 2. J. Salamon, C. Jacoby, J.P. Bello, A dataset and taxonomy for urban sound research, in 22nd ACM International Conference on Multimedia (ACM-MM’14), Orlando, FL, USA, Nov 2014, pp. 1041–1044 3. A. Mesaros, T. Heittola, A. Eronen, T. Virtanen, Acoustic event detection in real life recordings, in 2010 18th European Signal Processing Conference, Aug 2010, pp. 1267–1271 4. A. Mesaros, T. Heittola, A. Eronen, T. Virtanen, Acoustic event detection in real-life recordings, in 18th European Signal Processing Conference, 07 2014 5. G. Parascandolo, H. Huttunen, T. Virtanen, Recurrent neural networks for polyphonic sound event detection in real life recordings. CoRR, vol. abs/1604.00861, 2016. Available: http:// arxiv.org/abs/1604.00861 6. S. Hershey, S. Chaudhuri, D.P.W. Ellis, J.F. Gemmeke, A. Jansen, R.C. Moore, M. Plakal, D. Platt, R.A. Saurous, B. Seybold, M. Slaney, R.J. Weiss, K.W. Wilson, CNN architectures for large-scale audio classification. CoRR, vol. abs/1609.09430, 2016. Available: http://arxiv.org/ abs/1609.09430 7. J. Pons, X. Serra, Randomly weighted Cnns for (music) audio classification. CoRR, vol. abs/1805.00237, 2018. Available: http://arxiv.org/abs/1805.00237. H. Eghbal-Zadeh, B. Lehner, M. Dorfer, G. Widmer, A hybrid approach with multi-channel i-vectors and convolutional neural networks for acoustic scene classification. CoRR, vol. abs/1706.06525, 2017. Available: http://arxiv.org/abs/1706.06525 8. K.J. Piczak, Environmental sound classification with convolutional neural networks, in 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Sept 2015, pp. 1–6 9. R.R. Saifan, W. Dweik, M. Abdel-Majeed, A machine learning-based deaf assistance digital system Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/cae.21952 10. “Urbansound8k dataset,” https://urbansounddataset.weebly.com/urbansound8k.html 11. B. McFee, C. Raffel, D. Liang, D.P.W. Ellis, M. McVicar, E. Battenberg, O. Nieto, Librosa: audio and music signal analysis in python, in Proceedings of the 14th Python in Science Conference, ed. K. Huff, and J. Bergstra, 2015, pp. 18–24 12. D.-N. Jiang, L. Lu, H.-J. Zhang, J.-H. Tao, L.-H. Cai, Music type classification by spectral contrast feature, in Proceedings. IEEE International Conference on Multimedia and Expo, vol. 1, Aug 2002, pp. 113–116 13. En.wikipedia.org, Mel-frequency cepstrum, 2019. Available: https://en.wikipedia.org/wiki/ Mel-frequencycepstrum 14. S.S. Stevens, J. Volkmann, E.B. Newman, A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8(3), 185–190 (1937). https://doi.org/10.1121/1.1915893 15. M. Kattel, A. Nepal, A. Shah, D. Shrestha, Chroma feature extraction, 01 2019 16. T. Khdour, P. Muaidi, A. Al-Ahmad, S. Alqrainy, M. Alkoffash, Arabic audio news retrieval system using dependent speaker mode, Mel frequency cepstral coefficient and dynamic time warping techniques. Res. J. Appl. Sci. Eng. Technol. 7, 5082–5097 (2014) 17. En.wikipedia.org, Chroma feature, 2019 18. spectral features, https://musicinformationretrieval.com// spectral features.html, accessed: 2019-05-25 19. Working with moviepy library, https://zulko.github.io/moviepy/ref/ref.html 20. Working with FFMPEG, https://ffmpeg.org/documentation.html

Universal Quantitative Steganalysis Using Deep Residual Networks Anuradha Singhal and Punam Bedi

Abstract Quantitative steganalysis helps to predict length of payload message in stego images. Payload estimation is vital tool for forensic analysts. With recent development in deep learning, convolutional neural networks are extensively used for image steganalysis. CNN helps to easily capture statistical characteristics of images and learn features automatically. But as depth of convolutional neural network increases, accuracy suffers. Deep residual network makes use of residual blocks to address this problem. Residual network (ResNet) contains hundreds or even thousands of layers to learn demographic features of object implicitly. We propose blind payload length estimation using ResNet-50, deep residual network. We have used pretrained model ResNet-50 for estimating payload length. Experiments are performed in Keras Python library on gray scale images from standard BOSSBase dataset. Mean absolute error has been used as performance metric. Use of residual network, ResNet-50, for universal payload estimation demonstrates good result both in spatial and in JPEG domain. Keywords Deep neural networks · ResNet-50 · Quantitative steganalysis · CNN

1 Introduction Camouflaging information in any form of media is known as steganography. Detection of covert communication is steganalysis. In today’s digital world, mere detection of undercover communication is not enough. Forensic analyst wish to extract more information. Steganalysis is classified into active and passive steganalysis. Active steganalysis is also known as forensic steganalysis. Passive method deals only with detection of stego object; active steganalysis focuses on extracting details such as length of message, region where message is hidden or embedded key is used A. Singhal (B) · P. Bedi Department of Computer Science, University of Delhi, New Delhi, Delhi, India P. Bedi e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_37

465

466

A. Singhal and P. Bedi

while inserting payload. A lot of image steganography techniques exist in the literature for embedding algorithms like LSB [1, 2], nsF5 [3, 4], F5 [5], Model-Based Steganography [6]. Estimating the length of payload message is known as quantitative steganalysis. Universal or blind steganalysis predicts embedded message length without any prior information about used stegonagraphy algorithm. Most popular techniques in quantitative steganalysis involves feature extraction then applying regressor for prediction [7]. These are machine learning techniques in which features are manually extracted and then fed to regressor for forming association between extracted features and images. These are computationally intensive procedures. Convolutional neural networks (CNN), deep learning architecture provides endto-end system. It automatically extracts features and uses them for classification or prediction. But as the depth of CNN increases, CNN suffers from vanishing gradient descent problem, thus affecting accuracy. To overcome this, technique of skip residual connections is introduced in deep residual networks. These networks are called ResNets [8]. ResNet has been extensively used for image recognition. Robustness of ResNets has been proven in many image and visual recognition tasks [8–10]. This paper proposes the use of ResNet-50 for universal payload length estimation in stego images. Section 2 introduces associated work. Section 3 brief about concepts used in paper. Sections 4 and 5 illustrate proposed work with experiments. Finally, conclusion is stated in Sect. 6.

2 Related Work Kodovský et al. [11] presented change rate approximation utilizing the maximum likelihood principle and penalty function acquired by analyzing steganographic techniques and characteristics of cover images. Kodovský et al. [12] proposed maximum likelihood estimator for payload length estimation. This was computationally expensive method. In second method, objective function formed from zero message hypothesis was minimized. This had low implementation cost. SVR estimator was used. Guan et al. [13] proposed 1630 dimensional feature set comprising of Markov, PEV features, and different calibrated Markov features for universal quantitative steganalysis. Gradient boosting was used as regressor with weak learner in each step. Kodovsky et al. [7] used high-dimensional rich model feature set and gradient boosting for regression. In spatial domain, 12,753 dimensional feature set were used for HUGO and 50,856 dimensional feature set for LSB replacement. Similarly, in JPEG domain, 22,510 calibrated rich models were used for feature extraction. In gradient boosting, regression trees were used as base learners. Chen et al. [14] used CNN detectors for extracting features. Design known as bucket estimator comprised of first training CNN family with different payload sizes; extracted features were concatenated to train using mean square error as loss function. Chutani et al. [15] used extreme learning machines as base regressor. Random subspaces of image

Universal Quantitative Steganalysis Using Deep Residual Networks

467

Table 1 Summary of techniques used for quantitative steganalysis Author

Technique used

Domain

Kodovský et al. [11]

Change-rate approximation

JPEG Domain

Kodovský et al. [12]

Maximum likelihood estimator

Veena et al. [16]

Discretized-all condensed nearest neighbor. AdaBoost regressor was used for estimation

Sun et al. [17]

Ye-Net architecture [18]

Guan et al. [13]

1630 dimensional feature set comprising of Markov, PEV features and different calibrated Markov features

Kodovsky et al. [7]

High-dimensional rich model feature set

Singhal et al. [19]

348 dimensional SVD feature vector

Chutani et al. [15]

Extreme learning machines

Chen et al. [14]

CNN detectors

Singhal et al. [20]

CNN with LSTM

Spatial Domain

Universal

features and base regressor reduced prediction error. Author presented extreme learning machine-based ensemble with random subspace method for blind quantitative steganalysis. Veena et al. [16] used low-dimensional feature set. Greedy randomized adaptive search procedure was used at first level. discretized-all condensed nearest neighbour method was followed for choosing instances. AdaBoost regressor was used for estimation. Sun et al. [17] used Ye-Net architecture [18] for quantitative steganalysis. It worked well with low embedding rates. Singhal et al. [19] used 348 dimensional SVD feature vector for building blind quantitative steganalyser. Support vector regressor was used. Singhal et al. [20] used CNN and long short-term memory architecture for universal quantitative steganalyser. Techniques stated above are presented in tabular format in Table 1. Most of them use machine learning for prediction of payload length. Deep residual network ResNet-50 has not been used in the literature for payload length estimation.

3 Basic Concepts 3.1 Steganalysis Art of differentiating between clean and adulterated object is known as steganalysis. We can divide steganalysis into two groups. Passive steganalysis deals only with classification between stego and clean image. Extracting details about hidden message like length of embedded message, region of payload, estimation of stego key is termed as active or forensic steganalysis.

468

A. Singhal and P. Bedi

Estimating length of embedded message is known as quantitative steganalysis. Predicting length without any prior knowledge about embedding algorithm is known as blind or universal quantitative steganalysis.

3.2 Deep Residual Network CNN, deep learning architecture has been used extensively for object detection, image classification, image segmentation, face recognition [10, 21–24]. Activation functions like ReLu and dropout are used to accelerate convergence time during training and prevent over-fitting, respectively [25, 26]. But as depth of network increases, vanishing gradient descent starts occurring and affects accuracy. To resolve the problem of vanishing descent, deep residual network ResNet was adapted [8]. ResNet-50 has been extensively used for image recognition tasks. With residual networks, we can train hundreds or even thousands of layers, still achieving high accuracy. Residual network addresses the problem of vanishing descent by using skip connections. Skip connections or shortcut connection in training bypass input of first layer to last layer of model. This is known as residual block. In residual block in addition to a layer feeding into its immediate successive layer, it also feeds directly into two three layers ahead of it. procedure is demonstrated in Fig. 1. Final output y(x) is rearranged as given in Eq. 1 y(x) = f (x) + x

(1)

where f (x) is nonlinear mapping, i.e., number of layers like convolutional block, batch normalization layer, activation layer. x is input to residual block.

Fig. 1 Residual block

Universal Quantitative Steganalysis Using Deep Residual Networks Table 2 Structure of ResNet 50

469

Stage

Output

ResNet-50

conv1

112 × 112

7 × 7.64, stride 2

conv2

56 × 56

3 × 3 max pool, stride 2 ⎤ ⎡ 1 × 1.64 ⎥ ⎢ ⎢ 3 × 3.64 ⎥ × 3 ⎦ ⎣ 1 × 1.256

conv3

28 × 28

conv4

14 × 14



1 × 1.128



⎥ ⎢ ⎢ 3 × 3.128 ⎥ × 4 ⎦ ⎣ 1 × 1.512 ⎡ ⎤ 1 × 1.256 ⎥ ⎢ ⎢ 3 × 3.256 ⎥ × 6 ⎣ ⎦ 1 × 1.1024

7×7

conv5

1×1



⎤ 1 × 1.512 ⎢ ⎥ ⎢ 3 × 3.512 ⎥ × 3 ⎣ ⎦ 1 × 1.2048 Global average pool 1000-d, fc, softmax 25.5 × 106

No. of parameters

Low training error occurs in residual blocks because gradient easily maps to previous layer. Table 2 illustrates structure of ResNet-50. It contains five convolutional blocks and 50 layers. Many variants of ResNet-50 have been proposed [27]

4 Proposed Work We have predicted payload length using pre-trained ResNet-50 architecture. Images with labels are fed to network in training phase. We also try to obtain best-fit model by performing validation. After that test, dataset is fed for prediction message length. System design is as depicted in Fig. 2. Cross-entropy loss function is used in pre-trained network, and it is given by Eq. 2. LC E = −

T N 1  e W yi xi +b yi log T N N i=1 e W ji xi +b j

j=1

(2)

470

A. Singhal and P. Bedi

Fig. 2 System design for predicting payload length

where W is weight b is bias x i is the ith example of training yi is label N is number of training instances. Adam [28] optimizer is used for updating weights and bias in training phase. Adam improves the training speed of network. In training phase, batch of images is fed into network and feedforward loop is carried out. After this, backpropagation uses Adam optimizer for calculating loss function. After a certain number of iterations (epochs = 100), network is stored for predicting values. Algorithm used for training of proposed system is given as Algorithm 1. Algorithm for training of proposed Universal Quantitative Steganalysis Input: Database comprising of stego images with payload capacity Output: Trained model for predicting payload length 1. 2. 3. 4. 5.

Repeat while network is not converged Loss function as given by Eq. (2) is calculated. Adam optimizer is used in gradient descent to upgrade weights and bias of network during backpropagation. Model is validated. End.

Universal Quantitative Steganalysis Using Deep Residual Networks

471

5 Experimental Work Dataset BOSSbase [29] is used for experimental work. It consist of 10,000 natural gray scale portable gray map (PGM) images of dimension 512 × 512. Due to GPU limitation, we have resized images to 256 × 256. Platform All the experiments are carried in Python 3.6 using Spyder and Nvidia Quadro K4200 GPU card. Hyperparameters In training phase, some values are pre-initialized to define structure of network. Since we are already using pre-trained network ResNet 50, we have only used learning rate, number of epochs, batch size; learning rate is initialized to 1e-4, batch size to 64 images, and epochs is set to 100. Evaluation Criteria MAE, mean absolute error, is used for evaluation. It is average of variation between true and predicted value. It is given by Eq. 3. n MAE =

i=1 |x i −yi |

n

(3)

where x i predicted value yi is true value n is number of observations. Spatial Domain WOW [30], S-UNIWARD [31], HILL [32], MiPOD [33] embedding algorithms are used to create stego images with payload capacity 0.2 and 0.4 bpp. We have used total 50,000 images for experiments out of which 40,000 were stego and 10,000 clean images. Images were partitioned in disjoint sets for training and testing (80% for training and rest for testing). Figure 3 presents loss percentage during training and validation. Mean absolute error achieved during training and validation is depicted in Fig. 4. JPEG Domain J-UNIWARD [31] and UED-JC [34] embedding algorithms were used to create stego images with 0.2 and 0.4 bpnzac (bits per nonzero AC DCT coefficient). Out of 50,000 images (40,000 stego and 10,000 clean images) used for experiments; they were partitioned into disjoint sets of training and testing. Figure 5 presents loss percentage during training and validation. Mean absolute error achieved during training and validation is depicted in Fig. 6. Comparison of training loss and MAE with linear regression and VGG-16 [35] is depicted in Table 3.

472

A. Singhal and P. Bedi

Fig. 3 Loss % in training and validation

Fig. 4 MAE during training and validation

Fig. 5 Loss % for training and validation in JPEG domain

6 Conclusion Universal quantitative steganalysis is an important tool for forensic analyst. It helps them to explore covert communication. Use of ResNet-50, deep residual network for payload length estimation, is presented in paper. The presence of large number of layers in ResNet-50 helps network to easily learn complex statistical characteristics of images making it suitable for prediction. Experimental study depicts good result.

Universal Quantitative Steganalysis Using Deep Residual Networks

473

Fig. 6 MAE for training and validation in JPEG domain

Table 3 Comparison of result Linear regression

Training loss

MAE of predicted values

23.78

15.24

VGG-16 [35]

9.67

9.45

ResNet 50

6.78

4.46

We would wish to extend this work with variants of deep residual network. Also, we would like to propose our own architecture to produce better results.

References 1. J. Fridrich, M. Goljan, R. Du, Detecting LSB steganography in color and gray-scale images. IEEE Multim. Special Issue Secur. 8(4), 22–28 (2001) 2. J. Fridrich, J. Kodovsk, Steganalysis of LSB replacement using parity-aware features. Inf. Hiding, vol. 7692, no. Lecture Notes in Computer Science, 2013, pp. 31–45 3. A. Nissar, A. Mir, Classification of steganalysis techniques: a study. Digit Signal Process 20(6), 1758–1770 (2010) 4. I. Avcibas, N. Memon, B. Sankur, Steganalysis using image quality metrics, in Proceedings of SPIE Electronic Imaging, Security and Watermarking of Multimedia Contents, vol. 4314, no. II, 2001, pp. 523–531 5. S. Lyu, H. Farid, Steganalysis using higher-order image statistics. IEEE Trans. Inf. For. Secur. 1(1), 111–119 (2006) 6. D. Soukal, J. Fridrich, M. Goljan, Maximum likelihood estimation of secret message length embedded using steganography in spatial domain, in Proceedings of SPIE, Electronic Imaging, Security, vol. 5681, no. Jan. 16–20, Jan 2005, pp. 595–606 7. J. Kodovský, J. Fridrich, Quantitative steganalysis using rich models, inMedia Watermarking, Security, and Forensics International Society for Optics and Photonics, vol. 8665, 22 Mar 2013, p. 86650 8. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778 9. J. Donahue et al., Decaf: a deep convolutional activation feature for generic visual recognition. Int. Conf. Mach. Learn. 27, 647–655 (2014)

474

A. Singhal and P. Bedi

10. R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587 11. J. Kodovský, J. Fridrich, Quantitative steganalysis of LSB embedding in JPEG domain, in Proceedings of the 12th ACM Workshop on Multimedia and Security, 9 Sept 2010, pp. 187–198 12. J. Kodovský, J. Fridrich, Quantitative structural steganalysis of Jsteg. IEEE Trans. Inf. For. Secur. 8, 681–93 (2010) 13. Q. Guan, J. Dong, T. Tan, Blind quantitative steganalysis based on feature fusion and gradient boosting, in International Workshop on Digital Watermarking (Springer, Berlin, 2010), pp. 266–279 14. M. Chen, M. Boroumand, J. Fridrich, Deep learning regressors for quantitative steganalysis. Electron. Imaging 2018(7), 160–161 (2018) 15. S. Chutani, A. Goyal, Improved universal quantitative steganalysis in spatial domain using ELM ensemble. Multim. Tools Appl. 77(6), 7447–7468 (2018) 16. S.T. Veena, S. Arivazhagan, Quantitative steganalysis of spatial LSB based stego images using reduced instances and features. Pattern Recogn. Lett. 105, 39–49 (2018) 17. Y. Sun, T. Li, A method for quantitative steganalysis based on deep learning, in 2019 2nd International Conference on Information Systems and Computer Aided Education (ICISCAE) IEEE, 28 Sept 2019, pp. 302–309 18. J. Ye, J. Ni, Y. Yi, Deep learning hierarchical representations for image steganalysis. IEEE Trans. Inf. For. Secur. 12(11), 2545–2557 (2017) 19. A. Singhal, P. Bedi, Blind quantitative steganalysis using SVD features, in 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), IEEE, 19 Sept 2018, pp. 369–374 20. A. Singhal, P. Bedi, Blind quantitative steganalysis using CNN–long short-term memory architecture, in Strategic System Assurance and Business Analytics 2020 (Springer, Singapore, 2020), pp. 175–186 21. D. Yi, Z. Lei, S.Z. Li, Age estimation by multi-scale convolutional network, in Asian Conference on Computer Vision (Springer, Cham, 2014), pp. 144–158 22. X. Wang, R. Guo, C. Kambhamettu, Deeply-learned feature for age estimation, in 2015 IEEE Winter Conference on Applications of Computer Vision, 5 Jan 2015, pp. 534–541 23. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Commun. ACM 24, 84–90 (2017) 24. J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440 25. V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines, in InICML, 1 Jan 2010 26. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 27. S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transformations for deep neural networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1492–1500 28. D.P. Kingma, J. Ba, Adam: a method for stochastic optimization, arXiv preprint arXiv:1412. 6980, 22 Dec 2014 29. P. Bas, T. Filler, T. Pevny, “Break Our Steganographic System”: The Ins and Outs of Organizing BOSS (Springer, Berlin, 2011), pp. 59–70 30. V. Holub, J. Fridrich, Designing steganographic distortion using directional filters, in IEEE International Workshop on Information Forensics and Security (WIFS), Dec 2012, pp. 234–239 31. V. Holub, J. Fridrich, T. Denemark, Universal distortion function for steganography in an arbitrary domain. EURASIP J. Inf. Secur. 1 (2014) 32. B. Li, M. Wang, J. Huang, X. Li, A new cost function for spatial image steganography, in 2014 IEEE International Conference on Image Processing (ICIP), 2014, pp. 4206–4210

Universal Quantitative Steganalysis Using Deep Residual Networks

475

33. V. Sedighi, R. Cogranne, J. Fridrich, Content-adaptive steganography by minimizing statistical detectability. IEEE Trans. Inf. For. Secur. 221–234 (2015) 34. L. Guo, J. Ni, Y.Q. Shi, Uniform embedding for efficient JPEG steganography. IEEE Trans. Inf. For. Secur. 9(5), 814–825 (2014) 35. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 4 Sept 2014

Image-Based Forest Fire Detection Using Bagging of Color Models Reyansh Mishra, Lakshay Gupta, Nitesh Gurbani, and Shiv Naresh Shivhare

Abstract Forest fires have turned out to be one such calamity that can disrupt all the commercial as well as non-commercial activities and can severely threaten sustainable development. The research conducted so far focuses on the implementation of sensors. Taking into account the drawbacks of the already implemented research, we propose a color features-based method to detect forest fire from images. Color-based feature models such as the RGB color model, Y Cb Cr color model, and CIE-L*a*b* color model are utilized to extract color features of forest images in the process of forest fire detection system. The experimental results demonstrate that the proposed method performs favorably by producing accurate and efficient predictions. Keywords Forest fire detection · Color models · Bagging

1 Introduction In the last few decades, the increasing number of forest fire incidents has created a situation of alarm in most nations. These are known to severely affect the functioning of activities that are of the utmost importance, whether industrial or household. The wildfire statistics of the National Interagency Fire Center (NIFC) reported that more than 10.3 million acres of land were affected by forest fires in the year 2020, which if compared with the statistics of 2019, seems to have increased by around 6 million acres in 2020 [1]. This is because of human activities and some natural phenomenon like lava and lightning. Considering the reports of various newspapers, the high carbon monoxide emissions are also a result of these forest fires. To enhance the accuracy of these fire detection systems, a large number of sensors are required to be put up in the outdoors [2]. The main drawback of this approach lies in the requirement for a continuous supply of electricity. Provision of such resources is a difficult task. It is quite a common observation that for the fire to be detected, sensors R. Mishra (B) · L. Gupta · N. Gurbani · S. N. Shivhare School of Computer Science, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_38

477

478

R. Mishra et al.

are needed to be placed in the close surroundings of the fire [3]. This might also lead to the damaging of these sensors. Image processing and computer vision-based systems replace conventional fire detection systems. It all has been possible due to the advancement of camera technology and processing techniques [4]. This approach enables the computer system to detect the fire in large open spaces, without the use of any sensor, thereby reducing the limitation of having a continuous power supply and maintenance [5]. Consequences of either natural processes or a little bit of interference of mankind, even a small-scale forest fire can lead to unimaginable destruction. According to research organizations, there are worldwide hundred thousands of forest fire incidents every year [6]. Over the past 10 years, there has been a plethora of 64,100 wildfires each year, and an average of 6.8 million acres burned in this period [1]. This depreciation depicts the result of awareness and active contribution of mankind. Mahmoud et al. [7] proposed a method to detect forest fire including various stages. First, they applied background subtraction. Then a color segmentation model was used for marking the candidate regions. Third, special wavelet analysis is administered to find the difference between real fire and objects similar to fire. Santhosh and Vinoth [8] suggested another approach where only L*a*b* color model is exploited. It is assumed that the output is produced in RGB format by the image acquisition device. There are three main stages within the algorithm: Optical flow estimation, Chan-Vese algorithm, and CIE L*a*b* space color model. Celik and Ma [9] proposed a generic rule-based color model for fire pixel detection. Their proposed algorithm uses the Y Cb Cr color space for addressing the issue of illumination variations. Different rules were defined on the Y , Cb , and Cr color components alongside the devised chrominance model on the Cb -Cr color plane. These are used to identify the fire pixels in different color images. A set of images was taken which had some fire, non-fire, and fire-like regions. Rules for three different color models were made by using these set of images. Premal and Vinsley [10] also proposed a method using some statistical features from the Y Cb Cr model. Here, four different rules were applied to segregate the fire region. Two rules were used for differentiating the region under fire, and the other two were used for segmenting the extremely high fire center region. Tuba et al. [11] suggested the threshold value concept and elaborated the use of Y Cb Cr color model. Different characteristics of the color components of Y Cb Cr color model were used to detect fire based on the predefined threshold values and the combination of the values of the components. Dubey et al. [5] proposed the use of artificial intelligence along with the Internet of Things (IoT). This involves a cluster-based approach and the use of cameras and satellites for the detection of older fires. In addition to this, they used a fully connected neural network for analyzing the data and notifying the concerned user. Having gone through the research that has been conducted in this field so far, we have formulated an image-based forest fire detection approach with major contributions as follows: 1. Extensive use of RGB, Y Cb Cr , and CIE L*a*b* color spaces to extract relevant color features altogether.

Image-Based Forest Fire Detection Using Bagging of Color Models

479

2. Applying certain conditions that come with these color models to check the presence of fire. Furthermore, combining the results and applying the concept of bagging with majority voting rule to enhance the performance. The rest of the paper is structured in the following way: Section 2 describes the proposed solution for forest fire detection with the application of the specified color models and their respective conditions. Section 3 illustrates the experimental results of the proposed method. Comparision with the existing methods has been discussed in Sect. 4. Lastly, the paper concludes with the limitation of this approach and the future scope in Sect. 5.

2 Proposed Method The color models with their respective features whether color-based or statistical, both are calculated in this section. Also, their implicit conditions for the detection of fire are worked upon a data set of images with high, moderate, and low amounts of fire. Lastly, the concept of bagging with the majority voting rule is used to enhance the performance of the proposed method to detect the forest fire. The flowchart of the proposed method is shown in Fig. 1.

2.1 Feature Extraction Significant color features of forest images are extracted based on three-color models such as RGB color model, Y Cb Cr color model, and CIE L*a*b* color model. The total number of pixels containing fire are interpreted with the help of the implicit conditions. The color models that have been worked upon in this approach are discussed as follows: RGB color model: In terms of the RGB values (where R is Red, G is Green, and B is Blue), the inter-relation that we have used here between R, G, and B: R>G and G>B [12]. This can be written in the combined form as R>G>B. This clearly shows that R should be the more dominant color in an RGB for fire detection. For a precise calculation, the mean of R, G, and B is also calculated and then is compared with the color channels for every pixel. The regions depicting fire should have their respective mean values of RGB greater than the individual pixel value [12]: Y C b C r color model: Y is the luma component or it can also be referred to luminance, Cr , and Cb are the red-difference and blue-difference chroma components. It is not an absolute color space. There is a need to convert the RGB color model into Y Cb Cr because the flame color of those pictures is spread more uniformly in the Y Cb Cr color model than in the RGB color model [10]. The chrominance can be used in

480

R. Mishra et al.

Fig. 1 Flowchart of the proposed method

modeling the color of the fire rather than modeling its intensity [13]. Based on the aforementioned concept, Y Cb Cr color space model is utilized for the classification of fire pixels. Coversion between the two color models was considered to be linear [12]. The conversion is shown in Eq. 1: ⎡

⎤ ⎡ ⎤⎡ ⎤ Y 0.2568 0.5041 0.0979 R ⎣Cb ⎦ = ⎣−0.1482 −0.2910 −0.0714⎦ ⎣G ⎦ 0.0979 0.4393 −0.0714 B Cr

(1)

In order to fetch the statistical features such as mean and standard deviation [10], following expressions are to be used which are described in Eqs. 2, 3, and 4: Ymean =

I J 1   Y (x, y) I × J x=1 y=1

(2)

Image-Based Forest Fire Detection Using Bagging of Color Models

481

Crmean =

I J 1   Cb (x, y) I × J x=1 y=1

(3)

Cbmean =

I J 1   Cr (x, y) I × J x=1 y=1

(4)

Here, Ymean , Cbmean , and Crmean are the mean values of luminance/brightness, chrominance of blue, and chrominance of red components, respectively [12]. I × J is the total number of pixels in the input image. Using mean of the image, standard deviation of the image can also be calculated for better accuracy [10]. There are some images which do not represent any fire. After performing certain analysis on the set of images, value of τ is empirically set as 7.4 as suggested in [10]. It can be calculated by referring Eq. 5:

Crstd

   =

I J 1   (Cr (x, y) − Crmean )2 I × J x=1 y=1

(5)

Once the values for Y , Cb , and Cr components are obtained, the conditions for the presence of fire can be checked using Eqs. 6, 7, 8, and 9 [10]:

1(i, j), if Y (i, j) > Cb(i, j) 0, Otherwise

(6)

R1 (i, j), if Y (i, j) > Ymean , Cr (i, j) > Crmean 0, Otherwise

(7)

R1 (i, j) =

R2 (i, j) =

R3 (i, j) =

1(i, j), if Cb(i, j) ≥ Y (i, j) > Cr (i, j) 0, Otherwise

R4 (i, j) =

R3 (i, j), if Cr < τ Crstd 0, Otherwise

(8)

(9)

CIE L*a*b* color model: This color model was recognized by the International Commission on Illumination (CIE) in 1976 [14]. It illustrates all the colors which are visible to the human eye. The main advantage of implementing this model is that it is device independent. It is known as uniform color space. The lightness or luminance or brightness L*, stand for the darkest black at L* = 0, and the brightest white at L* = 100. The a* and b* color channels will match up to neutral gray values when a* = 0 and b* = 0 [15]. In the a* axis, the red color provides positive values and green with negative values [8, 14]. Similarly in b* axis, blue color represents

482

R. Mishra et al.

negative values and yellow as positive values [15]. Conversion from RGB to CIE L*a*b* color model and the nonlinear relations of L*, a*, and b* is shown in Eqs. 10, 11, 12, and 14 [8, 16]: ⎡ ⎤ ⎡ ⎤⎡ ⎤ X 0.412453 0.357580 0.180423 R ⎣ Y ⎦ = ⎣0.212671 0.715160 0.072169⎦ ⎣G ⎦ Z 0.019334 0.119193 0.950227 B



L =

(10)

116 ∗ ( j/jn )1/3 − 16, if ( j/jn ) > 0.008 Otherwise 903.3 ∗ ( j/jn ),

(11)

a ∗ = 500 ∗ ( f (i/i n ) − f ( j/jn ))

(12)

b∗ = 200 ∗ ( f (i/i n ) − f (k/kn ))

(13)

r 1/3 , if r > 0.008856 f (r ) = ∗ 7.787 r + 16/116, Otherwise

(14)

Where i = X , j = Y , k = Z , and i n , jn , and kn are describing a specified white achromatic reference illuminant. For Standard illuminant ‘D65’ values are: i n = 95.0489, jn = 100, and kn = 108.8840. Once fetched, the conditions for the presence of fire will be applied according to Eqs. 15, 16, 17, 18, and 19 where ‘m’ is the mean of a particular component [8]:

1, if L ∗ (i, j) ≥ L R1 (i, j) = 0, Otherwise

R2 (i, j) =

1, if a ∗ (i, j) ≥ a 0, Otherwise

R3 (i, j) =





m

m

1, if b ∗ (i, j) ≥ b ∗ m 0, Otherwise

1, if b ∗ (i, j) ≥ a R4 (i, j) = 0, Otherwise



(i, j)

(15)

(16)

(17)

(18)



1, if Ri (i, j) = 4 F(i, j) = 0, Otherwise

(19)

Image-Based Forest Fire Detection Using Bagging of Color Models

483

2.2 Threshold and Bagging The threshold value for all the models has been calculated by running multiple cases of fire images as well as non-fire images. For instance, all the non-fire images were checked by executing all the three color models so that the threshold value was kept slightly above the value which the models were predicting. Talking about the images having fire, the threshold was adjusted to be slightly lesser than the predicted value depicting fire pixels. So it came out to be in a perfect range to detect the presence of fire in an image. Furthermore, if the result comes out to be positive for at least two (majority vote) color models, then we affirm that fire exists in the image as shown in Fig. 2. Concept of bagging is used for the inclusion of three color models for a precision-based fire detection system.

3 Experimental Results After fetching the RGB values from the images, they are converted to Y Cb Cr and CIE L*a*b* color models as shown in Fig. 2. Figure 2a represents the raw image in RGB color format which is taken as input. Figure 2b represents the image that is processed after applying the Y Cb Cr conversion to it. Similarly, Fig. 2c represents the image that is processed after applying the CIE L*a*b* conversion to it. Table 1 shows that if only a single model is taken into account, then the results will not be much accurate. So, the incorporation of three different models results in a more accurate inference with an accuracy of 93.33%. However, the proposed method lags while processing certain images of the forest fire, as false positive cases due to the similar color range of intensity [7].

4 Discussion Majority of the fire detection methods involve the use of multiple sensors which in turn increases the cost of developing the complete system. Since, our method is based solely on image processing, so it does not require any use of sensors. Sensors are susceptible to the damage which can be incurred during the course of fire, thereby plunging the efficiency of the complete model. Researches indicate that systems involving sensors are not time efficient. Therefore, in order to receive responses as soon as possible, certain time-efficient methods must be implemented as described in this paper.

484

R. Mishra et al.

Fig. 2 Real forest fire images with extracted significant color features. a Color features of RGB model, b color features of Y Cb Cr model, c color features of CIE L*a*b* model Table 1 Results of images depicting fire corresponding to RGB, Y Cb Cr and CIE L*a*b* color models and bagging in the form of binary classification S. No. Result of RGB Result of Y Cb Cr Result of CIE Result of bagging model L*a*b* model model 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes

Yes Yes No Yes No Yes Yes No Yes Yes Yes Yes No No No

Yes Yes Yes Yes Yes No Yes Yes No No Yes Yes No Yes Yes

Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes

Image-Based Forest Fire Detection Using Bagging of Color Models

485

5 Conclusion and Future Work Once the experimental results are compared with the threshold values, the presence of fire is confirmed. The complete details about each color model can be seen in Table 1 including all the details. We are not aiming toward the suppression of fire here; rather, our focus lies on devising a method for efficient detections of the forest fire. The most important application of the proposed method deals with the forests as we all know how disastrous a forest fire can be. In order to prevent the forest fire, it is extremely important to have a system or an algorithm to determine the spread of fire at the very earliest. Once executed successfully, this approach holds the potential to be applied beyond the forest fire scenario, helping us detect and subjugate fire and bring prosperity to the global flora and fauna. The proposed approach will be taken further with the application of some additional features and advanced algorithms for better precision in the near future.

References 1. Congressional Research Service, Wildfire Statistics (2020). Accessed 26 Dec 2020 2. C. Yuan, Z. Liu, Y. Zhang, UAV-based forest fire detection and tracking using image processing techniques, in 2015 International Conference on Unmanned Aircraft Systems (ICUAS) (IEEE, 2015), pp. 639–643 3. A.A.A. Alkhatib, A review on forest fire detection techniques. Int. J. Distrib. Sens. Netw. 10(3) (2014) 4. A.E. Çetin, K. Dimitropoulos, B. Gouverneur, N. Grammalidis, O. Günay, Y. Hakan Habiboˇglu, B. Uˇgur Töreyin, S. Verstockt, Video fire detection—review. Digit. Signal Process. 23(6), 1827– 1843 (2013) 5. V. Dubey, P. Kumar, N. Chauhan, Forest fire detection system using IoT and artificial neural network, in International Conference on Innovative Computing and Communications (Springer, Berlin, 2019), pp. 323–337 6. S. Joseph, K. Anitha, M.S.R. Murthy, Forest fire in India: a review of the knowledge base. J. Forest Res. 14(3), 127–134 (2009) 7. Mubarak Adam Ishag Mahmoud, Honge Ren, Forest fire detection and identification using image processing and svm. Journal of Information Processing Systems 15(1), 159–168 (2019) 8. S. Arjun Santhosh, E. Vinoth, Lab color space model with optical flow estimation for fire detection in videos. IOSR J. Comput. Eng. 22–38 (2014) 9. T. Celik, K.-K. Ma, Computer vision based fire detection in color images, in 2008 IEEE Conference on Soft Computing in Industrial Applications (IEEE, 2008), pp. 258–263 10. C. Emmy Premal, S.S. Vinsley, Image processing based forest fire detection using ycbcr colour model, in 2014 International Conference on Circuits, Power and Computing Technologies (ICCPCT-2014) (IEEE, 2014), pp. 1229–1237 11. V. Tuba, R. Capor-Hrosik, E. Tuba, Forest fires detection in digital images based on color features. Int. J. Educ. Learn. Syst. 2 (2017) 12. N.I.B. Zaidi, N.A.A.B. Lokman, M.R.B. Daud, H. Achmad, K.A. Chia, Fire recognition using RGB and YCbCr color space. ARPN J. Eng. Appl. Sci. 10(21), 9786–9790 (2015) 13. V.K. Vijaylaxmi, S.C. Sajjan, Fire detection using YCbCr color model (2016) 14. CIELAB color space (2020). Accessed 7 Nov 2020

486

R. Mishra et al.

15. M. Senthil Vadivu, M.N. Vijayalakshmi, Implications of color models in image processing for fire detection. Int. J. Comput. Appl. 975, 8887 16. Color math and programming code examples (2020). Accessed 11 Nov 2020

Machine Learning Techniques for Diagnosis of Type 2 Diabetes Using Lifestyle Data Shahid Mohammad Ganie , Majid Bashir Malik , and Tasleem Arif

Abstract Diabetes mellitus is a common chronic disease that has affected millions of people in the world today. Among all types of diabetes, Type 2 diabetes mellitus (T2DM) is very common and accounts for a vast majority (around 90%) of diabetes worldwide. There is a dire need to exploit the machine learning techniques that shall help healthcare providers to better diagnose the disease. In this research paper, prominent machine learning classifiers, viz. LR, NB, SVM, DT, and RF, have been developed on the dataset collected from the wide geographical region for diagnosis of T2DM. The results show that random forest (RF) outperformed over other classification algorithms, thus achieved an optimal accuracy rate of 96.29%. The developed random forest model also achieved better results in terms of other statistical measures like precision, recall, specificity, F1-score, ROC, and MCC. In order to remove biasness in the dataset, tenfold cross-validation has been used. Keywords Diabetes diagnosis · T2DM · Machine learning algorithms · Python · Jupyter notebook · LR · NB · SVM · DT · RF

1 Introduction According to the statistical reports from various organizations like World Health Organization, Indian Diabetes Federation, American Diabetes Association, more than five hundred million people in the world are living with diabetes [1, 2]. Diabetes mellitus is growing rapidly and provides a significant threat worldwide [3, 4]. The S. M. Ganie · M. B. Malik (B) Department of Computer Sciences, BGSB University, Rajouri, J&K, India e-mail: [email protected] S. M. Ganie e-mail: [email protected] T. Arif Department of Information Technology, BGSB University, Rajouri, J&K, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_39

487

488

S. M. Ganie et al.

Fig. 1 Top 7 countries with undiagnosed diabetes

annual healthcare outlay with diabetes mellitus is estimated to be 760 billion USD globally, and it is estimated that the spending will reach 825 billion USD by 2030 and 845 billion USD by 2045 [5]. It has further been estimated in 2019, 50% or 231 million people living with diabetes (particularly type 2 diabetes) are undiagnosed and are not aware of their health issues [6]. “Diabetes mellitus is deadliest and is caused by a set of metabolic disorders that occurs when the body cannot produce any or enough insulin or cannot effectively use the insulin it produces” [7]. The human body is made of millions of cells, when we eat food it contains a large amount of glucose which is oxidized to provide energy to these cells. The glucose is produced naturally in a healthy body but in case of diabetic patients it decreases and thus leads to many health complications [8]. Figure 1 depicts the top 7 countries or territories with people (20–79 years) living with undiagnosed diabetes in 2019 [6]. The common symptoms found in diabetic patients are “frequent urination, fatigue, sudden weight loss, lack of energy, excessive thirst, blurred vision, etc.” [7]. Different categories of diabetes are type 1, type 2, gestational diabetes, and pre-diabetes [9]. Type 1 diabetes is insulin-dependent and is mainly found in people who are below age 30 years. Type 2 diabetes mellitus also called insulin-independent that can be prevented and cured by lifestyle changes [9]. Gestational diabetes is seen in female during pregnancy which may increase the risk of complications for both mother and the baby [10]. Pre-diabetes is caused because of increased sugar levels, and it implies a risk of developing T2DM in the future [11]. The motive of this research study is to develop machine learning models using lifestyle data for better diagnosis of T2DM. The novel contribution of this work is as follows: • The dataset has been collected through online as well as offline mode from a geographical region based on the suggestions of diabetes specialists called endocrinologists. • To remove the noise, corrupted, duplication, and outliers from the dataset, data preprocessing has been done to improve the data quality assessment. Also, missing

Machine Learning Techniques for Diagnosis of Type 2 …

489

values have been filled with statistical measures (mean, mode, and median) for the features like age, height, weight, and family history. • Feature engineering has been done, and contributing parameters (lifestyle) have been selected for building the machine learning models. The research work follows the hierarchy that is being divided into the sections: Sect. 1 discusses diabetes mellitus and its complications that affected human lives. In Sect. 2, previous important work done on T2DM prediction using machine learning techniques has been discussed. In Sect. 3, we have discussed the proposed system, i.e., data collection, data description, data preprocessing, data modeling, and finally generation of results. In Sect. 4, we have discussed the experimental results achieved by implementing these machine learning algorithms. Lastly, in Sect. 5, it presents a conclusion of our research work and tries to identify the shortcoming of the current study and provide future directions for the better prognosis and diagnosis of T2DM.

2 Literature Survey Machine learning techniques have been used by various researchers for the better prediction of type 2 diabetes mellitus [12–14]. Machine learning whenever implemented has provided better results in terms of accuracy, recall, precision, etc [15, 16]. Patil and Tamane [17] this research present an experimental setup of various algorithms that were used to classify diabetes. The performance measurement of various statistical measures was calculated for validation of results, among all the classifiers, logistic regression achieved the highest test accuracy of 79.54%. Barhate and Kulkarni [18] developed machine learning models based on patient’s health history using the PIMA diabetes dataset for better treatment of type 2 diabetes mellitus. In this research, the missing values in the dataset were replaced with the help of MICE imputations. Random forest performed well among all the algorithms giving an accuracy of 79.7%. Mujumdar and Vaidehi [19] this study aims at developing a prediction model for diabetes and its better classification including some external factors and internal factors like glucose, age, and insulin in this piece of research classification accuracy is boosted with a new dataset in comparison with the existing dataset. Application of pipeline gave logistic regression classifier as a best model with highest accuracy of 96%. Birjais et al. [20] proposed the prediction model for future diabetes and its risk level by machine learning approach using algorithms like NB, LR, and GB using PIMA diabetes dataset. For finding the correlation between the attributes, Boruta algorithm has been used. GB marked the highest accuracy of 86% among all classifiers. Muhammad et al. [21] proposed a framework using different machine learning algorithms for type 2 diabetes mellitus. The source of the dataset is Murtala Mohammed Specialist Hospital, Kano, consisting of nine attributes and 383 samples. The machine learning classifiers used in this experimental study are LR, SVM, K-NN, RF, NB, and GB algorithms. However, the random forest among all the classifiers was found to be the best model with an accuracy rate of 88.76%.

490

S. M. Ganie et al.

Random forest and gradient boosting outperformed among other models in terms of receiver operating curve. Tiggaa and Garg [22] developed a framework that identified the risk pertaining to diabetes mellitus type 2 at its early stage. In this research, the implementation of six machine learning algorithms, viz. logistic regression, K-NN, SVM, NB, DT, RF, was done and the analytical results were compared with various statistical measures. After the implementation and analysis, RF achieved the highest accuracy of 94.10% among all classifiers.

3 Proposed System The proposed framework for the study is shown in Fig. 2. It describes the working principle from the data collection up to the desired results for the diagnosis of T2DM using machine learning techniques.

Fig. 2 Proposed framework for the diagnosis of T2DM

Machine Learning Techniques for Diagnosis of Type 2 …

491

Table 1 Feature description used for the study S. No.

Attribute name

Description

Range

1

Age

Age of the entity

10–90 years

2

Sex

Gender of the entity (M/F)

0 or 1

3

Family history

Any entity in a family is diabetic or not

0 or 1

4

Smoking

Whether the entity is a smoker or not

0 or 1

5

Drinking

Whether the entity is liquor or non-liquor

0 or 1

6

Thirst

How many times entity drinks water in a day

Times

7

Urination

How many times an entity urinates in a day

Times

5

Height

Height of the entity

60–190 cm

9

Weight

Weight of the entity

20–96

10

Fatigue

If the entity feels fatigued or not

0 or 1

11

Diabetic

If an entity is diabetic or not

0 or 1

3.1 Dataset Description The dataset collected for the experimental study comprised of 11 features and 1552 instances, collected across the geographical region based on the suggestions provided by endocrinologist (diabetes specialists). The features selected are demonstrated in Table 1.

3.2 Data Preprocessing The dataset for the experiment has been loaded in the Jupyter notebook [23]. Then various preprocessing, data visualization, and machine learning libraries like NumPy, Matplotlib, Pandas, and Seaborn have been imported using the programming language Python (3.9.1) [24, 25]. To acquire better results, preprocessing is being done through various libraries. Missing values have been filled by mean, mode, and median for attributes like age, thirst, fatigue, and urination, and duplication, inconsistency, and outliers have been neutralized from the dataset. Also, data transformation has been performed before developing the models [25].

3.3 Machine Learning Techniques The main intent of this study is to present the prominent machine learning algorithms widely used in healthcare analytics for the prediction of various diseases [26, 27]. In this study, we have developed various machine learning models for binary classification problem in order to identify whether the patient is diabetic or non-diabetic

492

S. M. Ganie et al.

based on the features of the lifestyle dataset. Five ML classifiers have been used for model building to diagnosis the T2DM. The algorithms are “logistic regression (LR), naive Bayes (NB), support vector machine (SVM), decision tree (DT), and random forest (RF).”

4 Experimental Results The machine learning techniques can achieve state-of-the-art results with its potential of providing consistent, reliable and validated results; sometimes it may even exceed human expertise [28]. The scatterplot shown in Fig. 3 is used to show the relationship of the features used in the dataset. The position of each dot along the X-axis and Y-axis indicates the values that are used to quantify a distinct data point. Figure 4 shows the histogram that counts the data in different bins and is being used to plot the distribution of each feature in the dataset. For example, in case of attribute Sex 0.00 indicates male and 1.00 indicates female, the rest of the bins also shows their distribution in the dataset. Table 2 demonstrates the accuracy of various machine learning techniques used for the diagnosis of T2DM. Among all the classifiers, RF shows the highest accuracy rate of 96.29%. In machine learning paradigm, confusion matrix is used to evaluate the performance of machine learning algorithms. It is a tabular structure where the columns

Fig. 3 Scatterplot for each feature in the dataset representation

Machine Learning Techniques for Diagnosis of Type 2 …

493

Fig. 4 Histogram for each feature in the dataset

Table 2 Accuracy of the various machine learning models Model

Training accuracy (%)

Testing accuracy (%)

Misclassification rate (%)

Logistic regression

89.46

89.37

0.096

Naïve Bayes

89.68

90.98

0.090

Support vector machine

91.61

91.94

0.080

Decision tree

96.12

93.55

0.064

Random forest

98. 60

96.29

0.037

represent the predicted value and the rows represent the actual values. The confusion matrices of these classifiers are shown in Figs. 5, 6, 7, 8, and 9. Additionally, some other statistical measures are also calculated as shown in Fig. 10. These measures are used to validate the machine learning classifiers. The various measures are precision, sensitivity, specificity, negative predicted values, F1-score, etc.

494 Fig. 5 LR algorithm

Fig. 6 NB algorithm

Fig. 7 SVM algorithm

Fig. 8 DT algorithm

Fig. 9 RF algorithm

S. M. Ganie et al.

Machine Learning Techniques for Diagnosis of Type 2 …

495

Fig. 10 Performance evaluation of ML techniques

Table 3 Comparison of accuracy ML models Classifiers

Accuracy with PIMA dataset (%) Accuracy with the dataset used in this study (%)

Logistic regression

79 [17]

89.37

Naive Bayes

87 [29]

90.98

Support vector machine 83 [30]

91.94

Decision tree

73.82 [31]

93.55

Random forest

92.02 [32]

96.29

Table 3 depicts the comparison between the PIMA diabetes dataset and the dataset used in this study for the better diagnosis of type 2 diabetes mellitus using machine learning techniques.

5 Conclusion The main contribution of this study was to develop the models based on machine learning techniques for better diagnoses of type 2 diabetes mellitus. The dataset used for the experiment has been collected through offline as well as online mode from geographical regions. The features selected for the diagnosis of T2DM have been recommended by the experts (diabetes specialists) of the medical domain. Preprocessing techniques have been used to improve the quality of the dataset. Among all the models, random forest outperformed with the highest accuracy rate of 98.60% for

496

S. M. Ganie et al.

training set and 96.29% for testing set. RF also produced good results for other statistical measures like precision, sensitivity, and specificity. Tenfold cross-validation has been used to validate machine learning models, viz. LR, NB, SVM, DT, and RF. To extend the current study, these machine learning models shall be used for other real time and large datasets. The efficiency, reliability, and validity of the current research shall be established by using hybridization or ensemble approaches diagnosis of T2DM in early stages to save human lives.

References 1. D.M. Chan, Director-General, and WHO, Global report on Diabetes World Health organization, 2018, p. 88 2. Diabetes Federation International and IDF, IDF Diabetes Atlas 2019, 9th edn., 2019 3. World Health Organization, Global status report on noncommunicable diseases. World Heal. Organ. 53(9), 1689–1699 (2010). https://doi.org/10.1017/CBO9781107415324.004 4. International Diabetes Federation and Nam Han Cho (chair) et al., Eighth edition 2017. 2017 5. P. Zhang et al., Global healthcare expenditure on diabetes for 2010 and 2030. Diabetes Res. Clin. Pract. 87(3), 293–301 (2010). https://doi.org/10.1016/j.diabres.2010.01.026 6. IDF, IDF Diabetes Atlas 2019, 9th edn. 2019 7. N. Sneha, T. Gangil, Analysis of diabetes mellitus for early prediction using optimal features selection. J. Big Data 6(1) (2019). https://doi.org/10.1186/s40537-019-0175-6 8. S. Gentile et al., Five-year predictors of insulin initiation in people with type 2 diabetes under real-life conditions. J. Diabetes Res. 2018, 1–11 (2018). https://doi.org/10.1155/2018/7153087 9. P. Kaur, M. Sharma, Analysis of data mining and soft computing techniques in prospecting diabetes disorder in human beings: a review. Int. J. Pharm. Sci. Res. 9(7), 2700–2719 (2018). https://doi.org/10.13040/IJPSR.0975-8232.9(7).2700-19 10. A. Anand, D. Shakti, Prediction of diabetes based on personal lifestyle indicators, in Proceedings of 2015 1st International Conference on Next Generation Computer Technology, NGCT 2015, September, pp. 673–676, 2016, https://doi.org/10.1109/NGCT.2015.7375206 11. S. Vyas, R. Ranjan, N. Singh, A. Mathur, Review of predictive analysis techniques for analysis diabetes risk, in Proceedings of Amity International Conference on Artificial Intelligence AICAI 2019, 2019, pp. 627–631. https://doi.org/10.1109/AICAI.2019.8701236 12. A. Hussain, S. Naaz, Prediction of Diabetes Mellitus: Comparative Study of Various Machine Learning Models, vol. 1166 (Springer Singapore, 2021) 13. S.F. Shetu, M. Saifuzzaman, N.N. Moon, S. Sultana, R. Yousuf, Student’s Performance Prediction Using Data Mining Technique Depending on Overall Academic Status and Environmental Attributes, vol. 1166 (2021) 14. L. Mrsic, T. Mesic, M. Balkovic, Cognitive Services Applied as Student Support Service Chatbot for Educational Institution, vol. 1087 (2020) 15. T. Santhanam, M.S. Padmavathi, Application of K-Means and genetic algorithms for dimension reduction by integrating SVM for diabetes diagnosis. Procedia Comput. Sci. 47(C), 76–83 (2015). https://doi.org/10.1016/j.procs.2015.03.185 16. M.B. Malik, A model for Privacy Preserving in Data Mining, no. December, 2016 17. R. Patil, S. Tamane, A comparative analysis on the evaluation of classification algorithms in the prediction of diabetes. Int. J. Electr. Comput. Eng. 8(5), 3966–3975 (2018). https://doi.org/ 10.11591/ijece.v8i5.pp3966-3975 18. R. Barhate, P. Kulkarni, Analysis of classifiers for prediction of Type II diabetes mellitus, in Proceedings of 2018 4th International Conference on Computing, Communication, Control and Automation, ICCUBEA 2018, 2018, pp. 1–6. https://doi.org/10.1109/ICCUBEA.2018.869 7856

Machine Learning Techniques for Diagnosis of Type 2 …

497

19. A. Mujumdar, V. Vaidehi, ScienceDirect ScienceDirect ScienceDirect ScienceDirect Diabetes Prediction using Machine Learning Aishwarya Mujumdar Diabetes Prediction using Machine Learning Aishwarya Mujumdar Aishwarya. Procedia Comput. Sci. 165, 292–299 (2019) 20. R. Birjais, A.K. Mourya, R. Chauhan, H. Kaur, Prediction and diagnosis of future diabetes risk: a machine learning approach. SN Appl. Sci. 1(9), 1–8 (2019). https://doi.org/10.1007/s42452019-1117-9 21. L.J. Muhammad, E.A. Algehyne, S.S. Usman, Predictive supervised machine learning models for diabetes mellitus. SN Comput. Sci. 1(5), 1–10 (2020). https://doi.org/10.1007/s42979-02000250-8 22. N.P. Tigga, S. Garg, Prediction of type 2 diabetes using machine learning classification methods. Procedia Comput. Sci. 167(2019), 706–716 (2020). https://doi.org/10.1016/j.procs. 2020.03.336 23. Anaconda Inc., Anaconda Distribution, Anaconda, 2019 24. S. Raschka, J. Patterson, C. Nolet, Machine learning in python: main developments and technology trends in data science, machine learning, and artificial intelligence. Information 11(4) (2020). https://doi.org/10.3390/info11040193 25. G. Nguyen et al., Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif. Intell. Rev. 52(1), 77–124 (2019). https://doi.org/10.1007/s10462018-09679-z 26. P. Doupe, J. Faghmous, S. Basu, Machine learning for health services researchers. Value Heal. 22(7), 808–815 (2019). https://doi.org/10.1016/j.jval.2019.02.012 27. N. Nissa, S. Jamwal, S. Mohammad, Early detection of cardiovascular disease using machine learning techniques an experimental study. Int. J. Recent Technol. Eng. 9(3), 635–641 (2020). https://doi.org/10.35940/ijrte.c46570.99320 28. D.K. Choubey, S. Paul, Classification techniques for diagnosis of diabetes: a review. Int. J. Biomed. Eng. Technol. 21(1), 15–39 (2016). https://doi.org/10.1504/IJBET.2016.076730 29. M.M. Mottalib, M.M. Rahman, M.TMd. Tarekx, F. Ahmed, Detection of the onset of diabetes mellitus by Bayesian classifier based medical expert system. Trans. Mach. Learn. Artif. Intell. 4(4), 1–8 (2016). https://doi.org/10.14738/tmlai.44.1962 30. M.F. Hashim, S.Z.M. Hashim, Comparison of clinical and textural approach for diabetic retinopathy grading, in Proceedings of- 2012 IEEE International Conference of Computer Science and Engineering, ICCSCE 2012, 2012, pp. 290–295. https://doi.org/10.1109/ICCSCE. 2012.6487158 31. D.S.D. Sisodia, D.S.D. Sisodia, Prediction of diabetes using classification algorithms. Procedia Comput. Sci. 132(Iccids), 1578–1585 (2018). https://doi.org/10.1016/j.procs.2018.05.122 32. K.K. Chari, M. Chinna Babu, S. Kodati, Classification of diabetes using random forest with feature selection algorithm. Int. J. Innov. Technol. Explor. Eng. 9(1), 1295–1300 (2019). https:// doi.org/10.35940/ijitee.L3595.119119

Deep Learning-Based Recognition of Personality and Leadership Qualities (DeePeR-LQ): Review Devraj Patel and Sunita V. Dhavale

Abstract Personality of an individual is the unique way of thinking and behaving in an environment. Automatic assessment of personality has been a challenging problem in the multimedia research due to the subjectivity involved in the assessment. Similarly, the leadership qualities are the set of traits that can influence the group in performing the desired task to achieve the set goal. It can be identified by observing the manifestations of an individual in a test environment. Assessment of leadership qualities using scientific selection techniques has been successfully implemented for selection of military leaders. Our study is intended to explore the feasibility of automatically recognising the leadership qualities and identify its relationship with personality. In this paper, we explored the applicability of existing deep learning techniques for recognition of personality traits and proposed the research areas for automated analysis and recognition of leadership qualities. Keywords Apparent personality traits · Big-five personality · Leadership qualities · Multimodal learning · Deep learning techniques.

1 Introduction Recognition of human personality has been considered as a highly complex task as it involves deep knowledge of human psychology along with understanding of the individual’s mind and body system that determines his unique way of acting in the society [1]. Besides the psychological aspects for defining personality, its automatic assessment has been an interesting and one of the most challenging topics for researchers from the field of psychology, computer vision and artificial intelligence. D. Patel (B) Defence Institute of Advanced Technology, Pune, India S. V. Dhavale Department of Computer Science and Engineering, Defence Institute of Advanced Technology, Pune, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_40

499

500

D. Patel and S. V. Dhavale

The leadership qualities, on the other hand, are the prime requirement for the selection of suitable candidates in any leader-role specific jobs. Automatic recognition of leadership qualities would help the assessors to overcome the manual errors caused due to subjectivity in assessment. The system that can automatically recognise leadership qualities can be highly useful for many applications such as automated selection system, forensics, job screening, psychological studies, automated counselling and guidance system and recommendation systems. To summarise, the main contributions of this paper are as follows: (a) Define the novel approach for identification of leadership qualities using multimodal inputs. (b) Summarise the existing trends in automatic recognition of personality traits and leadership qualities. (c) Proposed the research agenda for recognition of leadership qualities. (d) Identified the challenges and future directions for the proposed agenda. Before exploring the use of technology for recognition of personality traits, we present a brief review of computing personality and military leadership qualities.

1.1 Computing Personality Vinciarelli [2] has presented a survey of personality computing and addressed the three main problems in personality assessment, namely automatic personality perception (APP), automatic personality recognition (APR) and automatic personality synthesis (APS). Among the various personality basis such as psychoanalytic, biological, behaviourist, humanistic and cognitive, we found that the traits-based personality computing, also termed as "personality traits", is widely accepted among the computing community and researchers. The basic principle behind this type of research is that the individual’s personality is externalised through the way he uses technology, and hence, it is considered that the personality traits should be predictive of an individual’s behaviour in any particular situation. In consonance with the concept of the Big Five personality traits, as briefed in a research paper [3], the five traits (acronym as OCEAN) can be summarised as follows: (a) Openness: Artistic, imaginative, unconventional, autonomous, insightful, original, wide interests, curious, etc. (b) Conscientiousness: Planful, responsible, thorough, efficient, organised, dependable, reliable, etc. (c) Extraversion: Outgoing, talkative, sociable, energetic, independent, active, assertive, zeal, etc. (d) Agreeableness: Generous, forgiving, sympathetic, trustworthy, appreciative, kind, caring, compliant, etc. (e) Neuroticism: Tense, touchy, hostility, unstable, worrying, anxious, insecurity, self-pitying, etc.

Deep Learning-Based Recognition of Personality …

501

Table 1 Qualities assessed for selection of military leaders Social adjustment Planning and Organisings Social effectiveness Effective intelligence Reasoning ability Organising ability

Social adaptability Cooperation Sense of responsibility

Power of expression

Initiative Self-Confidence Speed of decision

Dynamic Determination Courage Stamina

Ability to influence the group Liveliness

1.2 Leadership Qualities Leadership can be separated into two broad categories as leadership effectiveness and leadership emergence [4]. The leader emergence is the perception of others about viewing an individual as their leader, whereas leader effectiveness refers to an individual’s performance in motivating and influencing his group towards achievement of its goals. The actual testing of leader effectiveness happens under stressful situations. A qualitative and quantitative review of the traits of effective or emergent leader is listed by Judge et al. [5]. It is said that every organisation involved in the leadership training and development has developed their own list of ’Leadership Qualities’ based on the job requirement. Like in the Indian Armed Forces, there are fifteen qualities (categorised in four factors) [6] are the most desirable qualities that are being assessed for the selection of military leaders. Each leadership quality is assessed through the series of events designed scientifically. The response and reactions of each individual are observed and recorded for interpretation of leadership qualities as listed in Table 1. The assessors are then required to rate these leadership qualities on a scale of 1–10. The ten point rating scale has been adopted at selection centres since it facilitated wider dispersion and finer discrimination near the end zone of the scale. There is another well-accepted type of leadership found in the literature popularly known as ethical leadership. According to Brown et al. [7], ethical leadership is the demonstration of appropriate code of conduct by an individual and his interpersonal relationships. It also includes the promotion of such conduct to followers through two-way communication reinforcement and decision-making capabilities. It is an important topic for both researchers and practitioners since the empirical evidence confirms its effects on the behaviour and performance of the employees.

1.3 Organisation We have thoroughly studied and reviewed various applications of deep learning techniques for detection and recognition of personality traits. Section 2 provides an idea

502

D. Patel and S. V. Dhavale

of the related work in the recognition of personality traits and leadership qualities. Thereafter, in Sect. 3, we describe the relationship of personality and leadership qualities. The proposed research agenda for recognition of effective leadership qualities is explained in Sect. 4. Further, Sect. 5 contains the conclusion and future directions.

2 Related Work Mehta et al. [8] have presented the recent trends in the deep learning-based personality detection. However, there has not been much literature available for automatic identification of leadership qualities. For the understanding of the topic, the literature survey has been presented separately for recognition of personality traits and leadership qualities. The existing deep learning techniques for the recognition of personality traits are divided based on the input modalities.

2.1 Single Modal Polzehl et al. [9] are the first contributors for the automatic assessment of personality from speech using classification methods. The method was divided into two stages, feature extraction using various audio descriptors such as intensity, pitch, loudness, formants and MFCC, followed by classification. Praat acoustic analysis program was used to extract the audio features, and then, extracted features are fed to a SVM classifier. The acted dataset was created by a trained speaker mimicking the particular type of personality. In 2012, Valente [10] had tested the recognition of personality traits in spoken conversations from the AMI Corpus dataset. The author has used the boosting algorithm for combining many small and weak algorithms so as to produce an accurate classifier. Later, Mohammadi et al. [11] showed that it is possible to automatically rank people with different degrees of personality traits using corpus of 640 speech clips (each clip of 10 seconds long) randomly extracted from the nearly 100 news channels. Since 2015, the deep learning approaches have gained popularity and yield the state-of-art accuracy for automatic speech recognition using CNN [12]. Similarly, Navonil et al. [13] have developed a CNN-based document modelling technique for detection of personality traits from text. The model was trained using an essay dataset [14]. In 2016, Liu et al. [15] designed a model to predict personality traits of the individual from their Twitter profile picture. In the same line, in 2017, Yu et al. [16] have used the Facebook status updates data and investigated CNN and RNN for the recognition of personality traits. Later in 2019, Xue et al. [17] proposed the personality analysis using facial geometric features. The proposed method extracts facial geometric landmarks from the static facial images and then defines facial attributes to estimate the corresponding personality traits. Predicting personality with gait features gave a novel perspective for the researchers. In 2018, Sun et al. [18] have

Deep Learning-Based Recognition of Personality …

503

Fig. 1 Audio and visual feature extraction pipeline proposed by Subramaniam et al. [20]

explored the gait features that affect an individual’s personality through regression analysis. The author used Kinect 2.0 depth camera to collect participant’s gait data.

2.2 Bi-Modal With the release of first impression challenge dataset [19] in 2016, the bi-modular architecture with the fusion of audio and video features has gained the interest of researchers. The first impression dataset consists of nearly 10000 video clips of 15 s each, extracted from 3000 YouTube videos covering people of various age speaking English in front of camera. The videos were annotated with Big Five personality traits by Amazon Mechanical Turk (AMT) workers. Subramaniam et al. [20] have proposed the recognition of Big Five personality traits from short length videos. The author has proposed an LSTM-based model to capture the temporal patterns of the images (from video) and audio input to predict the Big Five personality traits (Fig. 1). Ma et al. [21] used LSTM to learn activity progression for activity detection. The training dataset “ActivityNet” comprised of 28,000 videos of 203 activities was collected from YouTube channel. Later, Ventura [22] performed the experiment on CNN by combining face detection and annotated action units (AU) recognition systems. In the same year, Yeng et al. [23] had modelled the temporal dependencies from video clips using deep bi-modal regression LSTM. Further, Wei et al. [24]

504

D. Patel and S. V. Dhavale

Fig. 2 Architecture of descriptor aggregation network (DAN) model proposed in [24]

Fig. 3 Flow chart of the multimodal model proposed by Gurpinar et al. [25]

have proposed the deep bi-modal regression (DBR) framework. Traditional CNN was modified as descriptor aggregation network (DAN) as mentioned in Fig. 2, to capture the visual information from the images extracted from videos. The fully connected layer was discarded in DAN and replaced by average-pool and max-pool. The log filter bank was chosen as audio representations and extracted from audios in videos data. After training both modalities, an advanced ensemble method was used to obtain the final scores.

2.3 Tri-Modal/Multimodal There are very less work found in the field of trimodal or multimodal recognition of personality traits. Most of the authors proposed the fusion techniques to fuse single and bi-modalities to obtain multimodal results. Gurpinar et al. [25] proposed to fuse extreme learning machines (ELM) models that were trained on audio, face and scene features. The flow chart of the model is presented in Fig. 3. The weighted score level fusion was implemented to predict Big Five personality traits using first impression dataset. While most of the study was focused on fusion of audio and video modalities, Gucluturk et al. [26] demonstrated the performance of the various models using fusion of the various data modalities. The experiment was conducted using models for regression of sensory data, language models, personality traits annotations and job interview annotations. Similarly, Kampman et al. [27] proposed a model using

Deep Learning-Based Recognition of Personality …

505

Table 2 Best mean accuracy achieved for recognition of personality traits using deep learning techniques Input modalities Authors Best accuracy achieved Text Audio Video Bi-modal Multimodal

Majumder et al. [13] Valente et al. [10] Gurpinar et al. [25] Wei et al. [24] Güçlütürk et al. [26]

57.99 64.84 90.94 91.30 91.62

audio, video and text modality. The author trained a separate neural network for each of the modalities using CNNs. The summary of test results achieved in recognition of personality traits using deep learning techniques by various authors is as listed in Table 2.

3 Personality Traits to Leadership Qualities Judge [5] states that extraversion has the strongest correlation with leadership. Conscientiousness and then neuroticism and openness to experience displayed the next strongest correlations with leadership; however, the agreeableness displayed a relatively weaker correlation with leadership. The result states that the Big Five traits had predicted student leadership better than any other type of leadership such as government or military leadership. There has been always ambiguity in the literature over the validity of traits theory. After going through various publications, it is observed that nobody has systematically sampled the leadership-related facets within the prominent personality domains. Ozbag [28] in 2016 analysed the relationship between the five factor traits and ethical leadership. The author quoted the statement of Stogdill, 1948, that leadership of an individual exists in a social situation and those who are leaders at one situation may or may not be leader in any other situations. Based on certain hypotheses, the experiment was conducted over 144 respondents, and it was concluded that three of the Big Five traits (openness, conscientiousness and agreeableness) were positively related to ethical leadership and subordinate perceptions, whereas neuroticism was negatively related, and extraversion has minimal relation with ethical leadership.

4 Proposed Research Agenda According to Allport [1], “Personality is the dynamic organisation within those psychological systems that determines his unique adjustment to his environment and

506

D. Patel and S. V. Dhavale

his characteristics, behaviour and thought”. Therefore, it means that personality consists of the behaviour and situation in the surrounding, i.e.  personality = behaviour ∗ environment)

(1)

It also implies that the correct assessment of personality can be obtained by interpreting the response of an individual along with the situation in analysis. The individual is first required to put in a testing environment, and then, his manifestations need to be recorded and then fair interpretation with the laid down shades of each qualities. Singh et al. [29] have explained the qualities being observed for selection of leaders in military. There are three aspects of candidates being considered, i.e. at consciousness, subconscious and at unconscious levels. Therefore, there are generally three ways to collect manifestations of the candidates during selection process: (a) Through written mode (psychological analysis). (b) Through personal interview (audio and video). (c) Through group effectiveness (audio and video). The written mode may also include various apperception tests and self-descriptions. The language models can be used to extract the features and interpret the qualities associated with each input data. Similarly, the extraction of relevant qualities from the manifestation observed during the group activities forms a research agenda towards the recognition of leadership qualities. The proposed methodology includes the input from three modes i.e. text, audio and video. The text may be collected in the form of short essays, story writings based on situational pictures, comments posted on social media/blogs. The audio includes one to one interaction conducted during interviews or public speeches delivered in a time bound manner. The video data may contains group activities and body language of the individuals. The audio and video data may also be collected from the common source for training the model. Data once collected needs to be annotated with the help of subject experts (psychologists/qualified assessors) for all the leadership qualities listed at Table 1 based on a standard chart for analysis of each leadership qualities. The proposed model requires implementation of efficient feature extraction techniques like language models for text data, spectral features for audio and Convolutional Neural Network for video data. Further, fast and robust classification techniques like LSTM may be applied for sequential classification of the inputs from all modalities. Appropriate ensemble methods like bagging or Boosting may be applied to obtain final ratings of leadership qualities. The ratings (on a scale of 10) be compared with the standard chart also called as Shade Card, prepared for each leadership qualities so as to predict the overall personality of an individual. The flow chart for the proposed method is presented in Fig. 4. The novel approach for identification of leadership qualities using combined analysis of handwritten text, one-to-one interview and group effectiveness would enhance the efficiency and accuracy of selection system for military leaders.

Deep Learning-Based Recognition of Personality …

507

Fig. 4 Proposed methodology for recognition of personality using leadership qualities

4.1 Challenges Ahead and Future Directions The recognition of leadership qualities is a challenging task as it involves deep understanding of the psychological aspects along with the knowledge of technology. The challenges for successful and effective recognition of personality with leadership qualities are divided in two aspects: Annotated dataset and efficient computing techniques. Annotated Dataset It is not surprising that data plays a vital role in training the appropriate model for recognition of personality. It requires the intervention and guidance of experts for annotation of the captured data. As of now, to the best of our knowledge, there are very limited dataset available in the open domain. Few notable annotated datasets are first impression dataset [19], speaker personality corpus [30], MyPersonality [31] (Facebook has stopped sharing of MyPersonality dataset since 2018). There is not even a single dataset available that consists of annotated qualities in relation to leadership. Therefore, the first challenge is to collect large amount of data (text, audio and video) from the relevant fields and then find its annotations with the help of subject experts. Computing Techniques There has been remarkable progress in the application of deep learning for the recognition of Big Five personality traits using audio, text and video modalities [8, 25, 27, 32] but comparatively less in the field of computing leadership qualities [2] and even lesser for computing personality to leadership qualities relationship. We have also found that there is huge scope of applying automatic personality detection methods for the development of models to recognise leadership qualities. Effective deep learning technique for the data processing and recognition

508

D. Patel and S. V. Dhavale

of accurate qualities is required so as to enhance the computer–human interaction and promote technological-based dependency of psychological studies.

5 Conclusions Considering the deeper aspects of leadership qualities and its impact on human– computer interaction, the need for deep research is felt essential. As we discussed in this paper, deep learning provides various ways to compute the psychological concepts. Due to deficiency of relevant dataset, most of the existing multimodal techniques are experimented using acted or annotated dataset which may not perform well with the same accuracy on actual data. Therefore, there is a dire need of a more diverse dataset for better training of the models. As the qualities refer to the silent aspects of an individual’s behaviour, hence, when it gets recognised automatically, it would certainly provide the better way to resolve the gap between human being and technology. To the best of our awareness, this paper is the first review on deep learning techniques for computing leadership qualities. The computing community and job selection centres along with psychological domains would be highly benefited with deeper study in the field of deep learning techniques towards identification of leadership qualities and recognition of personality traits. Acknowledgements We would like to acknowledge the valuable feedback and suggestion of instructors at Defence Institute of Psychological Research, New Delhi. Lt Cdr Patel recognises the support of fellow research scholars and staff of Defence Institute of Advanced Technology, DRDO Lab, Ministry of Defence, India.

References 1. G. Allport, Pattern and growth in personality. Am. Acad. Child Psychiatr. 2(4), 769–771 (1963). https://doi.org/10.1007/978-0-387-79061-9_96 2. A. Vinciarelli, G. Mohammadi, A survey of personality computing. IEEE Trans. Affect. Comput. 5(3), 273–291 (2014). https://doi.org/10.1109/TAFFC.2014.2330816 3. J. Digman, Personality structure: emergence of the five-factor model. Ann. Rev. Psychol. 41, 417–440 (1990). https://doi.org/10.1146/annurev.ps.41.020190.002221 4. R.G. Lord, C.L. De Vader, G.M. Alliger, A meta-analysis of the relation between personality traits and leadership perceptions: an application of validity generalization procedures. J. Appl. Psychol. 71, 402–410 (1986). https://doi.org/10.1037/0021-9010.71.3.402 5. T. Judge, J. Bono, R. Ilies, M.W. Gerhardt, Personality and leadership: a qualitative and quantitative review. J. Appl. Psychol. 87(4), 765–780 (2002) 6. C.J.P. Singh, D. Tiwary, B.D.S.N. Mishra, A conceptual approach to officer selection and officer like qualities. J. Sci. Res. Publ. 5(10), 342–356 (2015) 7. M. Brown, L. Treviño, D. Harrison, Ethical leadership: a social learning perspective for construct development and testing. Org. Behav. Human Dec. Process. 97, 117–134 (07 2005). https://doi.org/10.1016/j.obhdp.2005.03.002

Deep Learning-Based Recognition of Personality …

509

8. Y. Mehta, N. Majumder, A. Gelbukh, E. Cambria, Recent trends in deep learning based personality detection. Artif. Intell. Rev. 53, Apr 2020. https://doi.org/10.1007/s10462-019-09770z 9. T. Polzehl, S. Möller, F. Metze, Automatically assessing personality from speech, in IEEE Fourth International Conference on Semantic Computing. pp. 134–140 (2010). https://doi.org/ 10.1109/ICSC.2010.41 10. F. Valente, S. Kim, P. Motlicek, textitAnnotation and Recognition of Personality Traits in Spoken Conversations From the AMI Meetings Corpus, vol. 2, Jan 2012 11. G. Mohammadi, A. Origlia, M. Filippone, A. Vinciarelli, From speech to personality: mapping voice quality and intonation into personality differences, pp. 789–792, Oct 2012 https://doi. org/10.1145/2393347.2396313 12. D. Palaz, M. Magimai-Doss, R. Collobert, Convolutional neural networks-based continuous speech recognition using raw speech signal, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4295–4299 (2015). https://doi.org/10.1109/ ICASSP.2015.7178781 13. N. Majumder, S. Poria, A. Gelbukh, E. Cambria, Deep learning-based document modeling for personality detection from text. IEEE Intell. Syst. 32(2), 74–79 (2017). https://doi.org/10. 1109/MIS.2017.23 14. J. Pennebaker, L. King, Linguistic styles: language use as an individual difference. J. Pers. Soc. Psychol. 77(6), 1296–1312 (1999). https://doi.org/10.1037/0022-3514.77.6.1296 15. L. Liu, D. Preotiuc-Pietro, Z.R. Samani, M.E. Moghaddam, L.H. Ungar, Analyzing personality through social media profile picture choice, in ICWSM, pp. 235–242 (2016) 16. J. Yu, K. Markov, Deep learning based personality recognition from facebook status updates, pp. 383–387, 11, 2017. https://doi.org/10.1109/ICAwST.2017.8256484 17. M. Xue, X. Duan, Y. Wang, Y. Liu, A computational personality traits analysis based on facial geometric features, in 2019 IEEE 14th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), pp. 1107–1111 (2019). https://doi.org/10.1109/ISKE47853. 2019.9170334 18. J. Sun, P. Wu, Y. Shen, Z. Yang, H. Li, Y. Liu, T. Zhu, L. Li, K. Zhang, M. Chen, Relationship between personality and gait: predicting personality with gait features, in 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1227–1231 (2018). https:// doi.org/10.1109/BIBM.2018.8621300 19. C. Ponce-López, ChaLearn LAP 2016: First Round Challenge on First Impressions—Dataset and Results (Springer International Publishing, 2016). https://doi.org/10.1007/978-3-31949409-8_32 20. A. Subramaniam, V. Patel, A. Mishra, P. Balasubramanian, A. Mittal, Bi-modal first impressions recognition using temporally ordered deep audio and stochastic visual features, pp. 337–348, Oct 2016 21. S. Ma, L. Sigal, S. Sclaroff, Learning activity progression in lstms for activity detection and early detection, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1942–1950 (2016). https://doi.org/10.1109/CVPR.2016.214 22. C. Ventura, D. Masip, A. Lapedriza, Interpreting cnn models for apparent personality trait regression, in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1705–1713 (2017) 23. K.Y. Stanford, S. Mall, N.G. Stanford, Prediction of personality first impressions with deep bimodal LSTM (2017) 24. X.-S. Wei et al., Deep bimodal regression of apparent personality traits from short video sequences. IEEE Trans. Affect. Comput. 9(3), 303–315 (2017). https://doi.org/10.1109/ TAFFC.2017.2762299 25. F. Gürpinar, H. Kaya, A.A. Salah, Multimodal fusion of audio, scene, and face features for first impression estimation, in 23rd International Conference on Pattern Recognition (ICPR), pp. 43–48 (2016). https://doi.org/10.1109/ICPR.2016.7899605 26. Y. Güçlütürk, U. Güçlü, X. Baró, H.J. Escalante, I. Guyon, S. Escalera, M.A.J. van Gerven, R. van Lier, Multimodal first impression analysis with deep residual networks. IEEE Trans. Affect. Comput. 9(3), 316–329 (2018). https://doi.org/10.1109/TAFFC.2017.2751469

510

D. Patel and S. V. Dhavale

27. O. Kampman, E. Jebalbarezi, D. Bertero, P. Fung, Investigating audio, video, and text fusion methods for end-to-end automatic personality prediction, pp. 606–611, Jan 2018. https://doi. org/10.18653/v1/P18-2096 28. G.K. Özba˘g, The role of personality in leadership: five factor personality traits and ethical leadership. Proc. Soc. Behav. Sci. 235–242 (2016). https://doi.org/10.1016/j.sbspro.2016.11. 019 29. J. Singh, D.D. Tiwary, D.A. Jha, Adjusts to change accepts current practice administers reacts responds to circumstance follows through implements decisions (2016) 30. L. Gilpin, D. Olson, T. Alrashed, Perception of speaker personality traits using speech signals, pp. 1–6, Apr 2018. https://doi.org/10.1145/3170427.3188557 31. D. Stillwell, M. Kosinski, Mypersonality project: example of successful utilization of online social networks for large-scale social research, Jan 2012 32. A. Kachur, E. Osin, D. Davydov, K. Shutilov, A. Novokshonov, Assessing the big five personality traits using real-life static facial images, Apr 2020. https://doi.org/10.31234/osf.io/ 3y98a

Sentence-Level Document Novelty Detection Using Latent Dirichlet Allocation with Auto-Encoders S. Adarsh, S. Asharaf, and V. S. Anoop

Abstract Novelty detection is a classical one-class classification problem in machine learning that attempts to categorize the ‘known’ data inputs from the ‘unknown’ inputs mainly using unsupervised learning approaches. There are many approaches available for detecting the novelty from the text documents, primarily on the sentence level, with varying degrees of success based on the methodology they adopt. The advancements in deep learning have contributed significantly for developing more effective algorithms for detecting the novelty. This work primarily focuses on finding sentence-level document novelty using latent Dirichlet allocation (LDA), which is one of the globally accepted and extensively used topic modeling algorithms by combining it with auto-encoders. Experiments on benchmarked datasets indicate that the proposed approach outperforms some of the state-of-the-art approaches in novelty detection. Keywords Novelty detection · Topic modeling · LDA · Auto-encoders

1 Introduction A vast majority of the classical pattern recognition approaches in machine learning focuses on classifying data inputs into two or more classes. However, novelty detection is a different pattern recognition approach which uses the principles of one-class classification, where one class (usually the known or the normal one) will be used to distinguish from all other possible data inputs [1]. This approach generally assumes S. Adarsh (B) · S. Asharaf Indian Institute of Information Technology and Management-Kerala (IIITM-K), Thiruvananthapuram 695581, India e-mail: [email protected] S. Asharaf e-mail: [email protected] V. S. Anoop Rajagiri College of Social Sciences, Kochi 683104, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_41

511

512

S. Adarsh et al.

that the normal or the known class is highly sampled in the given context, while the unknown classes will be under-sampled. Novelty detection is a generalized way of separating the ‘known’ and the ‘unknown’ classes which is a primary requirement in most of the real-world scenarios. The word ‘novelty’ is defined by the Merriam Webster dictionary as an event that is ‘new’ or different from anything familiar. This process primarily classifies whether the test data belongs to the class of training data or not by defining the description of the normal data and building the model for the same [2]. The description of the known/normal class is learnt by building a model generally through unsupervised learning of positive instances. Previously unseen/novel inputs usually undergo a pattern matching process by comparing with the already learned models of normal/known data producing the novelty score as the output. This novelty score can ideally be a probabilistic or non-probabilistic one, which is then compared to the threshold set for the novelty decision for classification. Any inputs with novelty score greater than the threshold value will be classified as unknown/abnormal class [3]. The occurrence of novel data inputs in any context can be relatively infrequent, but it can have very significant consequences to the entire working environment. Generally, novelty and redundancy are considered conversely in data stream and have due relevance in many real-world scenarios. Hence, this one-class classification found important application in large datasets to easily categorize whether the new data belongs to the defined model or not. It is often mistaken and commonly substituted with anomaly and outlier detection [4]. Outlier detection refers to finding the data points which would be inconsistent and possibly far off from the remaining data point, whereas anomaly detection is the process of finding unexpected data points or irregularities when compared to the normal data. An outlier is typically considered as an event that is deemed to be inconsistent with the provided training dataset. This work proposes an approach for detecting the novelty of text documents, which is powered by a topic modeling algorithm. The major contributions of this paper are as follows: • Introduces novelty detection and its relevance in the real-world applications. • Critically examines some of the related and prominent works in novelty detection and also discusses topic modeling algorithms. • Proposes a framework for novelty detection which is powered by topic modeling process. • Experimentally verifies the usefulness of the proposed algorithm in the novelty detection task. The remainder of this paper is organized as follows: Sect. 2 discusses some of the recent and prominent approaches in novelty detection, Sect. 3 introduces the proposed framework for novelty detection; in Sect. 4, the experimental setup is discussed, in Sect. 5, the results are presented, and a detailed analysis is given. Conclusions and future work are discussed in Sect. 6.

Sentence-Level Document Novelty Detection Using Latent …

513

2 Related Work There are many approaches reported in the recent past on detecting novelty from text documents, with varying degrees of success. This section details some of those prominent works on novelty detection and also outlines the background details on topic modeling algorithms which are used in the proposed work.

2.1 State-of-the-Art in Novelty Detection A multiple class novelty detection framework was proposed by Oza et al. [5]. The authors considered the problem of multiple class novelty detection under dataset distribution shift to improve the novelty detection performance [5]. An approach for novelty detection in social media by fusing text and image into a single structure has been proposed by Amorim et al. [6] S. Kumar and K. K. Bhatia had proposed an innovative novelty detection mechanism [7] which may be appended with the existing web crawlers to avoid the issue of detecting novelty or redundant information while crawling. OCGAN—an approach for one-class novelty detection which focusses on learning latent representations of the known examples with the help of a denoising auto-encoder network was proposed by Perera et al. [8]. A novelty detection patent mining approach for analyzing technological opportunities was introduced by Wang et al. [9]. Their research leveraged the novelty detection statistical technique to develop a patent mining approach to identify novel/unusual patents which can provide innovative ideas for new opportunities. The authors have utilized the natural language processing (NLP) and latent semantic analysis (LSA) approaches to bring out the unseen relations between words in the patent documents for reducing the vocabulary mismatch issues and eases the task of keyword selection by experts. In this work, we propose a method for sentence-level novelty detection using latent Dirichlet allocation (LDA) [10] which is one of the most widely used topic modeling algorithms, and the same is combined with auto-encoders for better novelty detection from text documents.

2.2 Background: Topic Modeling Topic modeling represents a static modeling in machine learning which is extensively used for text mining and information retrieval from large collection of documents. Text mining is the process of deriving valuable information from the text corpus which utilizes natural language processing techniques for transforming the unstructured text inputs into meaningful and actionable information [11]. With the ever-growing document collection across Internet, it becomes difficult to query the relevant information needed for any specific context. Topic modeling is a suite of

514

S. Adarsh et al.

text understanding algorithms in the area of text mining which can automatically recognize the important topics inside a document or a textual object and can show the inherent text patterns exhibited by the document corpus. A ‘topic’ is generally considered as a probability distribution over a fixed dictionary on the context of topic modeling approaches. A text can represent any content like books, articles, mails, blogs and any form of unstructured text content. The topic models are not capable of understanding the meanings and concepts of words inside the text documents. The topic models assume that any part of the text is a combination of selected words from a likely bucket of words in which each bucket corresponds to a unique topic [12]. This process will be iterated over the most probable distribution of words into bucket which forms the ‘topic’ for building the model. One of the earliest approaches of topic modeling was latent semantic indexing (LSI) [13] was introduced by Deerwester et al. in 1990 which was a non-probabilistic model and hence not widely used in topic modeling. Modified versions of LSI like probabilistic LSI (PLSI) [14] were introduced by Hofmann in 1999 which is considered to be a more realistic approach. Later in 2003, Blei et al. published a more realistic and complete probabilistic generative model as an enhancement of PLSI called latent Dirichlet allocation (LDA) [10]. The LDA approach is usually defined as a two-level hierarchical Bayesian model with an assumption that each document is a distribution over topics and every topic is a distribution over words [15] Every word inside a document is generated by first sampling a topic from the topic distribution associated with the corresponding document and then sampling a word from the word distribution associated with the corresponding topic [16]. Thus, given a corpus, LDA tries to find the right assignment of topic to every word [17] such that the parameters of the generative model are maximized.

3 Proposed Method The proposed method aims to implement sentence-level novelty detection using topic modeling approach. Novelty in a given sentence can be computed with respect to the count of the new words appearing in them [18, 19]. This method enhances the LDA approach by introducing an auto-encoder to find the novelties in the topics and to classify whether a new sentence is novel or not. For converting the input dataset of headline text into vector space, we used the term frequency-inverse document frequency (TF-IDF) approach. There are many vector space models for text analysis such as TF-IDF, CBOW, word2vec and doc2vec. TF-IDF can be defined as a statistical measure which evaluates on the relevance of a word to a corresponding document in a given corpus of documents. This measure is mathematically defined by the multiplication of two metrics: the count of the word occurrence in a given document and the inverse document frequency of the given word across the given corpus of documents. We have chosen this method as it is very useful in case of small and domain-specific texts like the news headline. It is very good in understanding texts which contain domain heavy terms and are not very well written or grammatically

Sentence-Level Document Novelty Detection Using Latent …

515

Fig. 1 Workflow of the proposed approach

correct. With well-written text contents, we can use other methods like doc2vec which will be more efficient. The equation for computing the TF-IDF score for a given word t in the given document d from the corpus of document D is given as follows: T f − id f (t, d, D) = T f (t, d).id f (t, D)

(1)

T f (t, d) = log(1 + f eq(t, d))

(2)

I d f (t, D) = log(N /Count(dε D : tεd)

(3)

Since this process of novelty detection is completely unsupervised and we do not have labeled data to classify as known or unknown class, we will use five-layer deep ‘auto-encoder’ neural network to train our model. Auto-encoder is a special type of neural network which copies input data to output data through a process called reconstruction. ‘Hidden layers’ of the network do the feature extraction and the decoding work. At the end of the entire process definitely, some loss gets generated and the data point which is dissimilar from others incurs more loss (Fig. 1).

4 Experimental Setup This section describes the experimental setup we have used for implementing the proposed framework for the novelty detection. All the experiments discussed here were implemented in Python 3.7 and run on a server configured with AMD Opteron 6376 @ 2.3 GHz/32 core processor and 64 GB of main memory. The dataset used is Kaggle’s {A million news headlines} containing 1,186,018 headlines published over a period of 15 years. This dataset was preprocessed through the following steps before loading into the LDA model. We have performed the following steps. • The dataset was tokenized by splitting the headline text content into sentences and the sentences into words.

516

S. Adarsh et al.

• All words were converted to lowercase and removed the punctuations in the headlines. • All words with less than three characters are removed as they will not be an eligible topic. • All stop words in the dataset were removed. • All words are lemmatized by changing the words in third person to first person and changing the verbs in past and future tenses into present tense. • Stemming of words was performed to reduce the words into their root form. After preprocessing the news headline input, we create a dictionary called bag of words which represents the number of times a word appears in the input training data set. The bag of words is created for every document with some basic filtering parameters like removing tokens that appear in less than 15 documents, more than half of documents and taking only the first 100,000 most frequent tokens. We used the Gensim simulator to create the bag of words dictionary and created the TF-IDF model on the bag of words corpus. The auto-encoder used here is a five-layer deep neural network with network structure as input layer (Layer 1) with 300 features, encoding layer (Layer 2) with 600 features, Layer 3 with 150 features, decoding layer (Layer 4) with 600 features and the output layer (Layer 5) with 300 features. The Layers 2, 3 and 4 form the hidden layers of the auto-encoder. Auto-encoders are trained with same data for the input and output layers where the network learns a representation (encoding) for the dataset, mainly for dimensionality reduction, through training to eliminate unwanted signals. The known/normal data input can directly pass between layers, with minimal loss, but the data loss for the unknown data will be more as it deviates from the hidden data pattern. A cosine similarity of output and actual data is computed to measure the ‘loss’ which determines the novelty score. [NP]i = 1/d

d 

(x j − n j )2

(4)

j=1

[NP]i = novelty factor for the ith data instance d = total number of features x j = jth feature value of the input (1 < = j < = d) nj = jth feature value of output.

5 Results and Discussions This method details the results of our proposed approach for novelty detection. The results we have obtained using different features are shown in Table 1. When only TF-IDF is used as a feature, we got 52% and 44% as precision and recall, respectively,

Sentence-Level Document Novelty Detection Using Latent … Table 1 Results obtained from the proposed approach using different features with auto-encoder

Feature

517 Precision

Recall

Mistake

TF-IDF

0.52

0.44

43.5

BOW+TF-IDF

0.67

0.34

27.2

TF-IDF+Auto-encoder

0.82

0.21

20.5

TF-IDF+BOW+Auto-encoder

0.91

0.18

11.3

and the values were 67% and 34%, respectively, when we have combined the bag of words model with TF-IDF. When the auto-encoder was used along with TF-IDF that has been computed from the dataset, there was a significant increase in the precision, that is, 82% and the recall was 21%. The best precision was obtained when the auto-encoder was used along with TF-IDF and bag of words features. The column ‘mistake’ in the table is computed as 1-accuracy which is the percentage of incorrect predictions. As seen from the table, the value for the mistake is significantly reduced when more features are used along with the auto-encoder. For the topic modeling, we have used the pyLDAvis package’s interactive chart to analyze the produced topics and the related keywords. This will create and inter-topic distance map to know the performance of LDA model in generating topics and also shows the top 30 most relevant terms for the topic. Here every circle on the lefthand side of the plot represents a topic. The bigger the circle, the more relevant and dominant is that topic in the given document input. A reasonably good topic model will have fairly big, non-overlapping circles scattered across the chart rather than forming clusters within limited quadrant. Our process has successfully built a good topic model with better sufficient number of topics. Any model with large number of topics will have many overlaps and smaller circles clustered in one quadrant of the representation. The effect and results of introducing LDA with the proposed method are shown in Table 2. When used with the LDA, the proposed approach performed significantly better compared with the non-LDA approach. When only LDA weight (topic weight) was used as a feature, the method showed 62% and 59% precision and recall values, respectively. When LDA was combined with bag of words, the precision and recall values obtained were 73% and 69%. The largest precision was obtained when the features TF-IDF, bag of words, topic weights (LDA) and the auto-encoder were Table 2 Results obtained from the proposed approach using different features with auto-encoder and topic modeling Feature

Precision

Recall

Mistake

LDA weight

0.62

0.59

49.8

BOW + LDA

0.73

0.69

43.6

TF-IDF + LDA + Auto-encoder

0.89

0.84

29.7

TF-IDF + BOW + LDA + Auto-encoder

0.96

0.89

18.3

518

S. Adarsh et al.

Fig. 2 a Precision, recall and F-measure comparison of the proposed approach with different features and auto-encoders. b Precision, recall and F-measure comparison of the proposed approach with different features and auto-encoder with topic modeling (LDA)

combined, and this approach gave the precision of 96% and recall of 89%, respectively. The mistake values obtained are 49.8%, 43.6%, 29.7% and 18.3%, respectively, for LDA weight, bag of words + LDA, TF-IDF + LDA + Auto-encoder, and TF-IDF + BOW + LDA + Auto-encoder features, respectively (Fig. 2).

6 Conclusions and Future Work This work proposed an approach for detecting sentence-level novelty by combining topic modeling with auto-encoders. Topic modeling is a suite of text understanding algorithms that unearths the latent themes from a set of documents, and the autoencoders are special types of neural network which copies input data to output data through a process called reconstruction. The experiments conducted on some realworld dataset show that efficient novelty detection is possible using the proposed approach that significantly outperforms some of the state-of-the-art approaches for novelty detection. As the end results are promising, the authors would like to add more features into the proposed approach for better novelty detection and also would like to conduct more experiments on different datasets. Acknowledgements The authors would like to thank all the staff members of Data Engineering Laboratory at the Indian Institute of Information Technology and Management-Kerala (IIITM-K), for their valuable suggestions and criticisms that significantly improved the quality of this paper. This work was supported by the Junior Research Fellowship (JRF) in Engineering Sciences by Kerala State Council for Science, Technology and Environment (KSCSTE).

References 1. M. Markou, S. Singh, Novelty detection: a review—part 1: statistical approaches. Signal Process. 83(12), 2481–2497 (2003)

Sentence-Level Document Novelty Detection Using Latent …

519

2. M. Markou, S. Singh, Novelty detection: a review—part 2: neural network based approaches. Signal Process. 83(12), 2499–2521 (2003) 3. M.A.F. Pimentel, et al., A review of novelty detection. Signal Process. 99, 215–249 (2014) 4. D. Miljkovi´c, Review of novelty detection methods” in The 33rd International Convention MIPRO (IEEE, 2010) 5. P. Oza, H.V. Nguyen, V.M. Patel, Multiple class novelty detection under data distribution shift, in European Conference on Computer Vision, vol. 2 (Springer, Berlin, 2020) 6. M. Amorim, F.D. Bortoloti, P.M. Ciarelli, E.O. Salles, D.C. Cavalieri, Novelty detection in social media by fusing text and image into a single structure. IEEE Access 7, 132786–132802 (2019) 7. S. Kumar, K.K. Bhatia, Semantic similarity and text summarization based novelty detection. SN Appl. Sci. 2(3), 332 (2020) 8. P. Perera, R. Nallapati, B. Xiang, Ocgan: One-class novelty detection using gans with constrained latent representations, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 2898–2906 9. J. Wang, Y.J. Chen, A novelty detection patent mining approach for analyzing technological opportunities. Adv. Eng. Inform. 42, 100941 (2019) 10. D.M. Blei, A.Y. Ng, M.I. Jordan, Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003) 11. Z. Tong, H. Zhang, A text mining research based on LDA topic modelling, in International Conference on Computer Science, Engineering and Information Technology, 2016 12. S.I. Nikolenko, S. Koltcov, O. Koltsova, Topic modelling for qualitative studies. J. Inf. Sci. 43(1), 88–102 (2017) 13. S. Deerwester, et al., Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990) 14. T. Hofmann, Probabilistic latent semantic indexing, in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. 15. R. Arun, et al., On finding the natural number of topics with latent Dirichlet allocation: some observations, in Pacific-Asia Conference on Knowledge Discovery and Data Mining (Springer, Berlin, 2010) 16. H. Jelodar, et al., Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey. Multim. Tools Appl. 78(11), 15169–15211 (2019) 17. A. Onan, S. Korukoglu, H. Bulut, LDA-based topic modelling in text sentiment classification: an empirical analysis. Int. J. Comput. Linguistics Appl. 7(1), 101–119 (2016) 18. E.R. Faria, et al., Novelty detection in data streams. Artif. Intell. Rev. 45(2), 235–269 (2016) 19. G. Gaughan, A.F. Smeaton, Finding new news: novelty detection in broadcast news, in Asia Information Retrieval Symposium (Springer, Berlin, 2005)

Prediction of Environmental Diseases Using Machine Learning Amrita Sisodia and Rajni Jindal

Abstract The modern lifestyle leads to a chain of urban diseases. Worldwide around 2.4 million deaths could be prevented if a practice of proper hygiene can be maintained. The development of computer science and regional information generates a vast amount of data and creates an environment for disease prediction. The present work describes the implementation of BDA and machine learning (ML) techniques in healthcare. In this paper, we are using five ML algorithms-like decision tree (DT), support vector machine (SVM), random forest (RF), Naïve Bayes (NB), and logistic regression (LR) for the analysis of Diarrhoea, Malaria, Viral Hepatitis, Japanese Encephalitis, and Acute Respiratory Infection. The experiment has been conducted by using Python on the dataset available at the open government India platform for calculating recall and precision. For an unbiased estimate of the prediction model, the tenfold cross-validation method was used. Moreover, to see the effectiveness of the proposed model a comparative study has been made based on obtained results. Keywords Big data · Hadoop · Healthcare · Electronic health · Health care data · Remote health monitoring

1 Introduction The rapid development of urbanization improved our daily life but also leads to a series of urban diseases. These prevalent urban diseases can cause cancer and many other environmental diseases in most of the Indian cities. Inadequate sanitation facility is the root cause of various environmental diseases. In the year 2016, hygiene, sanitation, and water were responsible for 829,000 annual deaths due to diarrhoea, and this risk factor makes an important environmental contributor for ill health. Unsafe water, sanitation, or hygiene is the biggest factors of deaths around the globe due to diarrhoeal disease. In addition to diarrhoea following diseases such A. Sisodia · R. Jindal (B) Department of Computer Science and Engineering, Delhi Technological University, Bawana Road, New Delhi, Delhi 110042, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_42

521

522

A. Sisodia and R. Jindal

as Malaria, Viral Hepatitis, Japanese Encephalitis, and Acute Respiratory Infection will be prevented if adequate water quantity is maintained with quality, hygiene, and sanitation facilities. Globally by improving sanitation, hygiene and water have the potential to avoid at least 9.1% disease burden [1]. The healthcare sector is a big contributor to the world’s digital structured and unstructured data. Every second data is created from various sources. The accumulation of this huge amount of data from the sources such as clinical data [2], medical images, X-ray data, and sensor data has overburdened the capacity for effective and essential aggregation and analysis aiming to provide better clinical quality, personalized patient care, and patient safety. To figure out a decisive correlation or complex relations between the symptoms of a patient, diseases, and their respective treatment [3] one should do an intelligent and efficient data analysis representation. This efficient data analysis representation should also be able to provide better economical methods for cost reduction in personalized care. According to the World Health Organization (WHO), global health spending totaled $7.5 trillion in 2013, which is increasing by an average of 6 percent every year since 1995. This trend keeps on accelerating in the future and it reached $9.3 trillion in 2018 [4]. Although increased healthcare expenditure can be matched with the improvement in healthcare or the new and advanced services provided by it. These healthcare services can be differentiated between the following groups of our society.

1.1 Senior Citizens Environmental health is the main component of the overall health of a Nation’s citizens. Unfortunately, the division of environmental benefits and risks are not equal to all fragments of our society. Older people may have different vulnerabilities to environment containments [5, 6] due to various reasons-like ageing parameters change the degree of sustainability which is associated with physiological changes. Various other factors affect the vulnerability to environmental hazards factors such as socioeconomic and nutritional status. Senior citizens require more medical facilities and care than the younger generation. So they need greater health care medical facilities.

1.2 Additional Information Required by the Volume Editor When sending your final files, please include a readme informing the contact volume editor which of your names is/are your first name(s) and which is/are your family name(s). This is particularly important for Spanish and Chinese names. Authors are listed alphabetically according to their surnames in the author index.

Prediction of Environmental Diseases Using Machine Learning

523

2 Related Works Machine learning (ML) and big data is an established field of computer science and playing an important role in the healthcare sector for doing predictive analysis and classification of different diseases [7, 8]. N. Yuvaraj et al. uses ML for the prediction of diabetes on the Hadoop cluster [9]. In their experiments, random forest produces the best result on a four-node Hadoop cluster, followed by decision tree and Naïve Bayes. In another study by Mehrbakhsh Nilashi et al. uses ML for the prediction of the Unified Parkinson’s Disease Rating Scale (UPDRS) [10]. The author uses a support vector machine and finds out that the method has the potential to predict the right thing. Some of the authors also use ML for improving healthcare services in the cloud environment. Ahmed Abdelaziz et al. uses ML for the prognosis of chronic kidney disease (CKD) by utilizing it in a hybrid environment [11]. The author uses two consecutive techniques for this where linear regression is utilized for finding out the critical factors that affect CKD and neural network is used for the prediction of CKD. The ML is utilized in healthcare for different contexts such as Akbar K. Waljee et al. used it in predicting the use of steroids in hospitalized and outpatients suffering from inflammatory bowel disease (IBD) [12]. Based on the literature, it is clear that ML is being used for disease detection and prediction. This helps the doctors to monitor various parameters of a patient’s health in advance. For a good healthcare facility ML is always used to take preventive measures and at the same time, it is also utilized to reduce the burden of healthcare workers. ML is the perfect technique used for better healthcare facilities and it works well in the combination of different techniques like big data analytics (BDA).

3 Waterborne Disease In developing countries, water sources are contaminated as they contain many biological and physical agents, harmful chemicals. The mortality and morbidity rate relates to waterborne disease happens due to the transmission of infection through waterbased route, water-washed route, and waterborne route. The spread of waterborne diseases is caused by consuming unhealthy water by animals and human beings. Water disease-like diarrhoea is instigated due to the ingestion of pathogens in the human body and after consuming contaminated food, unhealthy water through unclean hands. The main causes of transmission of these pathogens are insufficient hygiene and inadequate sanitation. Better water management and irrigation services eliminate viral diseases-like Japanese encephalitis which is transmitted by mosquitoes. This disease causes inflammation of the membranes around the brain. The disease condition can be improved by doing improvement in water resources and removing the access of mosquito vectors to the pig.

524

A. Sisodia and R. Jindal

3.1 Urbanization In the new era of the global and networked world, all things are connected by urban nodes. The organization of the city gathers the attention of a human. However, in the process of urban development, various urban diseases make the human body their hostage. Urbanization helps the countries in their economic growth but in this run, various adverse factors affect the citizens’ health. To tackle a particular disease firstly it is required to target the essential cause before this disease turns up as an epidemic and then treat it with targeted therapies. Therefore, treating the disease in its bud is a better choice before it becomes an epidemic. Based on these problems we extend the conception of some environmental diseases by predicting them using classification techniques. For addressing these particular problems and unhealthy practices of human being a proper analysis of health care data is required.

4 Big Data Analysis Big data refers to datasets that are too large and beyond the capacity of classic database software tools to perform basic operations like capture, store, manage, and analyze [13]. It can also be generalized as a vast amount of data, which is nearly in terabytes (1012 bytes), petabytes (1015 bytes), zettabytes (1021 bytes) [14]. Big data is a collection of various dataset those are large and complex and it is very difficult to process them. Medical data is present in different formats which are produced from various sources. This inconsistent behaviour of data is very difficult to handle and process for generating important information out of it. Big data analytical tools are used for this purpose but they are still in the incubator. BDA applications have the capability for improving health care outcomes and decrease wastage of health care resources to improve overall healthcare facilities. As an example, Google’s model was able to predict the spread of influenza more accurately than the US centre for disease control model, which depends on the cases submitted from the hospitals and health clinics [15]. Generally, Big data is explained by the seven ‘Vs’ of it like volume, velocity, veracity, variety, validity, volatility, value [16].

4.1 Volume It is the amount of data in the dataset which is being processed by parallel and distributed computing. To run complex operations on a large data set parallel and distributed models were used. Big data deals with high velocity, high veracity, and the high volume of data which is being processed by taking computation processing closer to data rather than taking data to computation as happened before the big data era.

Prediction of Environmental Diseases Using Machine Learning

525

4.2 Velocity It refers to the speedy generation of data from various sources which is making it big and complex to be handeled by typical data handling software. In the medical field, this high-velocity data is generated from various sources such as wearable devices, smartphone applications data, sensors data, etc. To handle this tsunami of data powerful data engines are required. Probably, we can say Hadoop MapReduce helps for processing, storing, and retrieving this large amount of data.

4.3 Veracity It means the meaningfulness or the truthfulness of data. Before this huge amount of data is added to the big data universe, the work of data scientists and the research community has been started to check the cleanliness and preciseness of data. Because they are dealing with unstructured big data such as tweets, Facebook posts, LinkedIn posts, etc. It is required by them to segregate the things that they see and what they require for taking critical business and sales decisions. In this V the processing of big data is done to figure out the related analysis and result outcomes. That is why data cleansing with the help of some great tools and algorithms were performed.

4.4 Variety Healthcare big data is a mixture of varied kinds of data which includes text and images from X-ray, MRI, sensor, Radiology data, voice, video recordings, etc. The more variety of data can cause the chances of more errors. That is why a relational database can no longer work on this mixed form of data.

4.5 Validity The validity of data and veracity of data are two different concepts but very similar to each other. Generally, validity is used to check the accuracy and correctness of data on the behalf of its future usage. With the help of an example given by Huang et al. [12] explained for a physician that they can not take data from a clinical trial that is related to a patient’s disease symptoms without validating them. In other words, we can say that a set of data may be valid for one application but may be invalid for another usage.

526

A. Sisodia and R. Jindal

4.6 Volatility Big data volatility means the validity of data or how long the data is valid and how long it can be stored. In this real-world time plays an important role, data we need for a particular time may become futile for future analysis. In the medical field healthcare data changes according to time. This thing raised a question on the relevance of data that how long the data should be stored and when this data should be archived or deleted. As the size of data is increasing on daily biases it is an important decision to be taken by the experts.

4.7 Value This is a special V among all Vs for a special reason. The desired outcome of the big data processing is retrieved by this V. Our primary goal is to extract out the maximum value from any big data set.

5 Health Care Ecosystem and Performance Measures The recent advancement towards digitization in the healthcare industry has openedup new research opportunities. Though the data is not present in any particular format then also we can extract out some meaning full and intelligent information from it. This information is further used for research purposes and can be utilized to create new trends. All of this comes under the BDA process. BDA can potentially be used to create a good health ecosystem by accessing large and diverse data sources to provide a timely estimation of quality.

5.1 Experimental Method In Fig. 1 with the help of a flow chart we are explaining the overall working of our experimental model. Step 1: In this step, we are retrieving the disease dataset available on the Indian Government site [17]. The data at website “data.gov.in” contains data related to different disease in vast amount, it also contains sanitation and water facility in India along with the health resources. All five diseases are caused due to adverse environmental conditions and the practice of poor hygiene. Globally, diarrhoea is the second most important cause of death in children under five years. The main causes of these diseases are a variety of bacterial, viral, and parasitic organisms which lead to infection in the intestinal tract. The infection spreads through polluted food or

Prediction of Environmental Diseases Using Machine Learning

527

Fig. 1 Representation of the working of an experimental model

drinking water or as a result of poor hygiene. These diseases can be prevented by using safe drinking-water or by using improved sanitation practice to reduce disease risk. Step 2: In step two preprocessing of the dataset is performed by normalization. It is very tedious work to perform this step as the amount of data is huge and it consists of different disease data along with various causes of disease. However, the implementation of big data in healthcare is in an incubator stage but can be used to predict some significant information. Step 3: In this step dataset is being passed to all classifiers individually by applying machine learning techniques. The result generated by these classifiers will help to predict the disease. Step 4: Here, recall and precision performance measures are calculated to measure the performance of all models using the following recall and precision performance relations; Recall =

TP TP + FN

Precision =

TP TP + FP

(1) (2)

where TP, FP, and FN are true positive, false positive, and false negative values, respectively. Step 5: In the last step evaluation and comparison of various classifiers are performed on a different dataset.

528

A. Sisodia and R. Jindal

6 Results and Discussion The recall values of different models for the various disease are shown in Table 1 for different data sets as performance measures using Eq. 1. The validation of results is done by the tenfold cross-validation method. The results generated by precision and recall are considered best when their value is close to one. Here, it is observed that for recall Naive Bayes classifier produces best results with Viral Hepatitis, Malaria, and Japanese Encephalitis. This shows that Malaria, Viral Hepatitis, and Japanese Encephalitis have more independent attributes. For Diarrhoea Random Forest classifier performs well and the Logistic Regression classifier works for Acute Respiratory. Figure 2 represents a bar graph for various models and plotted using the data from Table 1. Hence, the observed result indicates that the best model is created for Viral Hepatitis with Naive Bayes and it gives a maximum 97% value amongst all the diseases for Recall. The precision values of different models for the different disease are calculated using Eq. 2. Table 2 represents the obtained value of precision for various ML models calculated on different environmental disease datasets. The best precision is measured by the RF classifier in the case of Malaria, Viral Hepatitis, and Japanese Table 1 Calculated recall for various models using classification techniques Diseases data Support vector set machine

Naïve Bayes

Random forest

Decision tree

Logistic regression

Malaria

0.36

0.64

0.41

0.52

0.26

Diarrhoea

0.75

0.50

0.76

0.70

0.66

Viral Hepatitis

0.31

0.97

0.30

0.42

0.14

Japanese Encephalitis

0.21

0.45

0.35

0.42

0.17

Acute Respiratory Infection

0.77

0.54

0.85

0.81

0.87

Fig. 2 Graph of calculated Recall as performance measures

Prediction of Environmental Diseases Using Machine Learning

529

Table 2 Calculated precision for various ML models using classification techniques Diseases data Support vector set machine

Naïve Bayes

Random forest

Decision tree

Logistic regression

Malaria

0.49

0.46

0.56

0.44

0.45

Diarrhoea

0.72

0.84

0.77

0.71

0.74

Viral Hepatitis

0.25

0.30

0.56

0.40

0.41

Japanese Encephalitis

0.41

0.41

0.55

0.42

0.44

Acute Respiratory Infection

0.83

0.97

0.97

0.94

0.96

Encephalitis. For acute respiratory infection RF classifier and NB classifier both produce the same results. As in the diarrhoea dataset, Naive Bayes performs to give the best results. This means that the diarrhoea dataset has more independent attributes than any other dataset. Precision and recall are a very important parameter required for measuring the performance of a classifier. Performance measures like precision and recall are significant factors in the medical field as the concept of promise to achieve a better healthcare system with a more proactive to predictive approach is achieved by applying these measures. Figure 3 represents a bar graph for various ML models and plotted using the data from Table 2. Thus, the observed results predict that the best model is created for Acute Respiratory Infection with Random Forest and Naïve Bayes classifiers and it gives 97% value amongst all the diseases for Precision.

1.2 1 0.8 0.6 0.4 0.2 0

RF LR DT NB

Malaria

Diarrhoea

Viral Japanese Acute HepaƟƟs EncephaliƟs Respiratory InfecƟon

Fig. 3 Representation of precision for various models

SVM

530

A. Sisodia and R. Jindal

Fig. 4 Comparison of precision and recall values with state-of-the-art algorithms

7 Comparison with Other Techniques The investigation carried out by Yuvaraj et al. [9] uses a diabetes data set on a Hadoop cluster and perform predictive analysis by using three ML algorithms and comparative results with the present study are shown in Fig. 4. In their analysis Naïve Bayes produces best precision of 91% and RF produces best recall of 88%. In the proposed method we produce our result by using five ML algorithms on five different disease data set. The best precision value is produced by NB and RF for Acute Respiratory Infection is 97% (i.e., 0.97) and best recall value is produced by NB for Viral Hepatitis is also 97% (i.e. 0.97). In both the cases the result of the proposed method is better than the method used in the comparative study [9].

8 Conclusion Early disease prediction is beneficial to provide good and stable health. The study predicts particular environmental diseases based on desirable parameters. All of this has been carried out using a 10-fold cross-validation technique for the unbiased result. Amongst the data set of different diseases, the best recall and precision values are observed to 97% for Viral Hepatitis and Acute Respiratory Infection with RF and NB classifiers, respectively. The proposed model is also compared with the sate-of-theart algorithms, and it gave the best precision and recall values. We have considered default parameters for all classifiers and in the future, we will be performing parameter tuning of all classifiers with a larger volume of datasets from a broader variety of sources.

Prediction of Environmental Diseases Using Machine Learning

531

References 1. World Health Organization, Global status report on water safety plans: a review of proactive risk assessment and risk management practices to ensure the safety of drinking-water (No. WHO/FWC/WSH/17.03). World Health Organization (2017) 2. M. Ghassemi, L.A. Celi, D.J. Stone, State-of-the-art review: The data revolution in critical care. Crit. Care 19(1), 1–9 (2015). https://doi.org/10.1186/s13054-015-0801-4 3. N. Agarwal, A. Brem, Strategic business transformation through technology convergence: implications from General Electric’s industrial internet initiative. Int. J. Technol. Manage. 67(2–4), 196–214 (2015). https://doi.org/10.1504/IJTM.2015.068224 4. M. Kumar, J. Mostafa, R. Ramaswamy, Federated health information architecture: enabling healthcare providers and policymakers to use data for decision-making. Health Inf. Manage. J. 47(2), 85–93 (2018). https://doi.org/10.1177/1833358317709704 5. M. King, A. Smith, M. Gracey, Indigenous health part 2: the underlying causes of the health gap. The Lancet 374(9683), 76–85 (2009). https://doi.org/10.1016/S0140-6736(09)60827-8 6. World Health Organization, Diet, nutrition, and the prevention of chronic diseases: report of a joint WHO/FAO expert consultation, vol. 916. World Health Organization (2003) 7. T.B. Murdoch, A.S. Detsky, The inevitable application of big data to health care. JAMA 309(13), 1351–1352 (2013). https://doi.org/10.1001/jama.2013.393 8. W. Huang, H. Wang, Y. Zhang, S. Zhang, A novel cluster computing technique based on signal clustering and analytic hierarchy model using hadoop. Cluster Computing, 22(6), 13077–13084 (2019). https://doi.org/10.1007/s10586-017-1205-9 9. N. Yuvaraj, K.R. SriPreethaa, Diabetes prediction in healthcare systems using machine learning algorithms on Hadoop cluster. Clust. Comput. 22(1), 1–9 (2019). https://doi.org/10.1007/s10 586-017-1532-x 10. M. Nilashi, O. Ibrahim, H. Ahmadi, L. Shahmoradi, M. Farahmand, A hybrid intelligent system for the prediction of Parkinson’s Disease progression using machine learning techniques. Biocybern. Biomed. Eng. 38(1), 1–15 (2018). https://doi.org/10.1016/j.bbe.2017.09.002 11. A. Abdelaziz, M. Elhoseny, A.S. Salama, A.M. Riad, A machine learning model for improving healthcare services on cloud computing environment. Measurement 119, 117–128 (2018). https://doi.org/10.1016/j.measurement.2018.01.022 12. A.K. Waljee, R. Lipson, W.L. Wiitala, Y. Zhang, B. Liu, J. Zhu, B. Wallace, S.M. Govani, R.W. Stidham, R. Hayward, P.D. Higgins, Predicting hospitalization and outpatient corticosteroid use in inflammatory bowel disease patients using machine learning. Inflamm. Bowel Dis. 24(1), 45–53 (2018). https://doi.org/10.1093/ibd/izx007 13. A. Sisodia, R. Jindal, Exploring the application of big data analysis in healthcare sector, in 2017 International Conference on Computational Science and Computational Intelligence (CSCI), 14–16 Dec 2017, pp. 1455–1458. https://doi.org/10.1109/CSCI.2017.254 14. J.S. Rumsfeld, K.E. Joynt, T.M. Maddox, Big data analytics to improve cardiovascular care: promise and challenges. Nat. Rev. Cardiol. 13(6), 350 (2016). https://doi.org/10.1038/nrcardio. 2016.42 15. M.F. Uddin, N. Gupta, Seven V’s of Big Data understanding Big Data to extract value, in Proceedings of the 2014 zone 1 conference of the American Society for Engineering Education. IEEE (2014), pp. 1–5. https://doi.org/10.1109/ASEEZone1.2014.6820689 16. W. Raghupathi, V. Raghupathi, Big data analytics in healthcare: promise and potential. Health Inf. Sci. Syst. 2(1), 3 (2014). https://doi.org/10.1186/2047-2501-2-3 17. https://data.gov.in

Frequent Itemset Mining Using Genetic Approach Renji George Amballoor and Shankar B. Naik

Abstract There is a realization in the industry to use business analytics for extracting knowledge from data for better decision-making. This involves the use of data mining techniques, like frequent itemset mining for discovering patterns of interest. The process of mining frequent itemsets is challenging due to the large number of itemsets generated in the process and presented to the user. Use of genetic algorithms in frequent itemset mining can help address this issue. A genetic algorithm to mine frequent itemsets called FIMGA has been proposed in this paper. FIMGA employs the features of genetic approaches in mining frequent itemsets. FIMGA does not consider the itemsets which are sure of not generating any frequent itemset for crossover. This reduces the overall search time of itemsets making FIMGA time efficient. Experiments have shown that FIMGA is efficient in generating frequent itemsets. Keywords Frequent itemset mining · Genetic algorithms · Data Mining · Business analytics · Knowledge discovery

1 Introduction In the present economy and society, large volumes of real-time big data are being generated from the interactions of men, machines and things. The penetration and the spread of disruptive digital technologies along with Internet of things and digital convergence are creating new data even from digital breadcrumbs or digital exhaust. The online traces capture huge data about human behaviour, their preferences, perceptions, attitudes, interactions, thoughts, beliefs, etc., on a real-time basis for decision-making. The impact of decisions made from the hidden patterns and trends in big data is impacting the economy .

R. G. Amballoor (B) · S. B. Naik Directorate of Higher Education Government of Goa, Goa, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_43

533

534

R. G. Amballoor and S. B. Naik

There is realization in industry to use business analytics for extracting value from data [1] for better decision-making [2]. Analysing and generating patterns through data mining is becoming important for business in taking strategic decisions to survive in the competitive ecosystem. Data mining will also equip the industry to understand and predict consumer behaviour, in sales forecasting, management of inventories, diversification, marketing, etc. The discovery of knowledge from big data came to be known as knowledge discovery in databases (KDD). The KDD process of frequent itemset mining (FIM) [3] or the market basket analysis (MBA) is used by the business managers for predictive analytics. The hidden patterns and factors can bring out the association or co-occurrence among the items purchased [4]. Frequent itemset mining which aims to generate itemsets of potential interests for users has been an important domain for researchers to work in for quite a long time. But, it suffers from the problem of generation of a large number of itemsets making it costly process both in terms of time and memory [5]. The algorithms implementing the concept generate a huge set of itemsets during the intermediate phases of execution [6, 7]. While the users are interested in frequent itemsets, a huge portion of the itemsets generated in the intermediate phases are not frequent which are required to be stored and analysed to generate the final results. Storing and scanning these itemsets are challenging. The study in this paper uses genetic algorithms to generate frequent itemsets from huge consumer buying data collected internally and externally. The use of genetic algorithms to understand patterns in consumer behaviour is based on concepts of natural selection and genetics [8]. The genetic algorithms can provide near optimal results within a short time period which makes it attraction for the business Genetic algorithms are highly promising in handling large number of itemsets which are generated during intermediate stages. The Apriori approach generated itemsets one level at time based on the size of the itemsets. Itemsets of size k are generated by analysing itemsets of size k − 1. These approaches are similar to the approach followed in genetic algorithms where the elements of the current population are processed to generate the elements of the next generation by considering the healthy elements, i.e. the frequent itemsets. Although many genetic algorithms to find frequent itemsets are proposed, they are costly in time due to the large numbers of operations such as crossover and mutation that are performed on the existing population of itemsets in order to generate the new population of itemsets. The authors of this paper aim to propose an algorithm to mine frequent itemsets employing genetic approach by efficiently performing crossovers by not considering those parent items which are sure of not generating any healthy child, i.e. not frequent itemset. The proposed algorithm will serve as a base to explore various possibilities of adjoining frequent itemset mining algorithm with the genetic approaches, thereby creating a scope for the introduction of new operators in genetic algorithms apt for frequent itemset mining.

Frequent Itemset Mining Using Genetic Approach

535

2 Related Work 2.1 Frequent Itemset Mining Frequent itemset mining aims to discover interesting patterns in the form of itemsets from transactional datasets. The first algorithm to mine frequent itemsets is the Apriori algorithm proposed in [3]. This algorithm suffered from the problem of generating large itemsets and requiring multiple scans of database. To address this issue, several algorithms such as FP-growth were proposed [9, 10]. However, these algorithms presented the users with large number of frequent itemsets. Users may be interested in not all the frequent itemsets. A fewer in number, yet the top frequent itemsets may be potentially interesting to the users. In such situations, employing genetic approach in algorithm generating frequent itemsets can be helpful.

2.2 Genetic Algorithms in Frequent Itemset Mining GAR [11] and GENAR [12] are the first algorithms developed to mining frequent itemsets employing approaches of genetic algorithms. Both these algorithms have limitations due to the way in which the items and itesets are represented. BAFTIM [13] employs the bat meta-heuristics to mine frequent itemsets. Wei and Chaomin [14] have proposed a framework for mining high utility itemsets using bio-inspired algorithms. More recently, GA-Apriori algorithm is proposed in [15] which mines frequent itmesets executing the crossover and mutation steps recursively. These algorithms suffer due to the large number of crossovers it has to perform while generating the new population of itemsets. Besides, not all the itemsets generated are frequent. The frequent itemsets are discovered by performing another mutation step.

3 Problem Definition 3.1 Preliminaries Let I = {i 1 , i 2 , ..., i m } be the set m literals representing the items. An itemset X is a set of items such that X ⊆ I . Let D be the database on n elements where each element of D is an itemset. The support of an itemset, denoted as supp(X ), is the count of the elements in D containing X . Itemset X is frequent if supp(X ) ≥ s0 , where s0 is the minimum support threshold. The value of s0 is usually specified by the user.

536

R. G. Amballoor and S. B. Naik

We represent the itemset X as a vector of m bits. The value of bit j is set to 1 only if item i j  X and 0 otherwise. For example, for I = {a, b, c, d, e}, the itemset X = {a, c, e} is represented as . The algorithm requires data structures to store the results generated during execution of various intermediate phases of the algorithm. We define two sets F and G for the purpose. Set F is used to store all the frequent itemsets generated during the intermediate phases. Set G j contains the frequent itemsets of size j. The itemsets of size j + 1 are generated from the itemsets present in G i−1 .

3.2 Problem Statement Given a database D and a minimum support threshold s0 , the problem is to generate frequent itemsets from D using genetic approach.

4 FIMGA—Frequent Itemset Mining Using Genetic Approach Algorithm FIMGA works in two steps, i.e. Genset1 and crossover, as described in Sects. 4.1 and 4.2, respectively.

4.1 Genset1 The database D is scanned to generate frequent itemsets of size 1 by scanning D m number of times. Each scan corresponds to an item i j in I during which the database elements containing the item i j are counted to calculate supp({i j }). If supp({i j }) ≥ s0 , then {i j } is frequent itemset and is stored in the sets F and G 1 . At the end of this step, F contains all frequent single itemsets.

4.2 Crossover Crossover step is executed over two itemsets X a , X b ∈ G j such that n(X a ∩ X b ) = j − 1, i.e. both X a and X b should have 1 less to j items in common. This ensures that the newly generated itemsets are of size j + 1 and prevents from selecting those itemsets for crossover out of which the newly generated itemsets are never frequent.

Frequent Itemset Mining Using Genetic Approach

537

This prevents crossovers which generate unhealthy children. By unhealthy, we mean those itemsets which are infrequent. / X b , while xb be the item in X b such that Let xa be the item in X a such that xa ∈ xb ∈ / X a . Itemsets newX a = X a ∪ {xb } and newX b = X b ∪ {xa } are generated. Both newX a and newX b are same in most cases. If newly generated j + 1-sized itemset then newX a and newX b will be added to G j+1 otherwise F, if they are frequent. The crossover step is repeated for the new set G j+1 . In case G j+1 is empty, then the algorithm is stopped.

5 A Running Example Consider the database D in Table 1. Let I = {a, b, c, d, e} bee the set of literals and s0 = 2 be the minimum support threshold.

5.1 Genset1 Since I has five items, the database D is scanned five times to count the support of single itemsets {a}, {b}, {c}, {d} and {e} represented as , , , and , respectively. Upon scanning the elements of D five times, the resultant itemsets along with their support values are shown in Table 2. Itemset {d} in Table 2 is not frequent and is not considered for any crossover to generate itemsets of size 2. Hence, G 1 = {{a}, {b}, {c}, {e}} and F = G 1 .

Table 1 Database D Transaction No. 1 2 3

Itemset

Transaction No.

Itemset

a, b, c a, b b, c, e

4 5

a, e a, b, c, d

Table 2 1-sized Itemsets Itemset Vector {a} {b} {c}



Support

Itemset

Vector

Support

4 4 3

{d} {e}



1 2

538

R. G. Amballoor and S. B. Naik

5.2 Crossover Since G 1 = φ, the crossover step is executed over it. To begin with, itemsets {a} and {b} are considered because the n{{a} ∩ {b}} = 0 = j − 1 as j = 1 here. The crossover is performed on the vectors and representing itemsets {a} and {b}. To do so, OR operation is performed between and to generate the new vector representing a new itemset {a, b}. The database D is scanned to calculate supp({a, b}) which is 3. Since supp({a, b}) ≥ s0 , the itemset {a, b} is frequent and is inserted into G 2 . After executing crossover on all the possible pairs of itemsets in G 1 , the itemsets in G 2 = {{a, b}, {a, c}, {b, c}} and F = F ∪ G 2 = {{a}, {b}, {c}, {e}, {a, b}, {a, c}, {b, c}}. Similarly, after executing crossover over G 2 , the itemsets in G 3 = {{a, b, c}} and F = {{a}, {b}, {c}, {e}, {a, b}, {a, c}, {b, c}, {a, b, c}}. Since G 3 has only one itemsets, no crossover will be possible, and the algorithm is stopped.

6 Experimental Study Experiments were conducted to compare the time required for execution of the proposed algorithm FIMGA with the GA-Apriori algorithm by varying the number of itemsets in I . The dataset was generated using the IBM synthetic generator [16]. The number of elements in the database D is 1000 and 2000 K. Both the algorithms were implemented using C++. The observations are shown in Fig. 1. The value of s0 was kept to 50% of the size of D. It is observed that the proposed algorithm requires less time than GA-Apriori algorithm in both the cases. However, GA-Apriori outperforms the proposed FIMGA for lower item counts. The reason why FIMGA is more time efficient is due to the strategy it employs in selecting the itemsets for crossover. The strategy is that itemsets which can never generate frequent itemsets are not considered for crossover. These experiments were performed on synthetic datasets. Although the performance of FIMGA is encouraging on the dataset considered, the performance may not be encouraging for other datasets as the efficiency and accuracy of such algorithms depend on the kind of dataset too.

6.1 Limitation of the Study The following are the limitations of the study. The experiments were done only on synthetic datasets. The proposed algorithm was compared with only one algorithm. Although FIMGA is observed to perform better in this study, the same observation may not be true for other datasets as the accuracy and efficiency of the algorithms

Frequent Itemset Mining Using Genetic Approach

539

Fig. 1 Execution time versus item count for D size = 1000 and 2000 K

depend on the kind of dataset used. The experiments were conducted to understand the efficiency only in terms of time. While efficiency in terms of the memory used will be done in future.

7 Conclusion The realization in industry to use business analytics for extracting knowledge from data for better decision-making has been motivating researchers to develop and enhance data mining algorithm. One such domain in data mining is frequent itemset mining from datasets. The process of mining frequent itemsets is challenging due to multiple scans of database and mostly the large number of itemsets generated in the intermediate stages and also the large number of frequent itemsets presented to the users. In this paper, a genetic algorithm to mine frequent itemsets FIMGA has been proposed. Experiments were conducted to compare FIMGA with GA-Apriori algorithm. It was observed that FIMGA is time efficient as compared to GA-Apriori for the datasets considered in this paper. One of the limitations of this study is that the experiments were conducted only on synthetic datasets. The algorithm will be executed on real datasets and compared with other similar algorithms in future. Besides these, the possibility of applying the algorithm to data streams will also be studied.

540

R. G. Amballoor and S. B. Naik

References 1. F. Acito, V. Khatri, Business analytics: why now and what next? (2014) 2. D. Delen, H.M. Zolbanin, The analytics paradigm in business research. J. Bus. Res. 90, 186– 195 (2018) 3. R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in Proceedings of 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487-499), Sept 1994 4. L.C.M. Annie, A.D. Kumar, Market basket analysis for a supermarket based on frequent itemset mining. Int. J. Comput. Sci. Issues (IJCSI) 9(5), 257 (2012) 5. Y. Djenouri, A. Bendjoudi, M. Mehdi, N. Nouali-Taboudjemat, Z. Habbas, GPU-based bees swarm optimization for association rules mining. J. Supercomput. 71(4), 1318–1344 (2015) 6. S.S. Waghere, P. RajaRajeswari, V. Ganesan, Retrieval of frequent itemset using improved mining algorithm in Hadoop, in International Conference on Innovative Computing and Communications (Springer, Singapore, 2021), pp. 787–798 7. S.B. Naik, Mining associations rules between attribute value clusters, in Advances in Artificial Intelligence and Data Engineering (Springer, Singapore, 2021), pp. 909–917 8. D.V. Paul, S.B. Naik, P. Rane, J.D. Pawar, Use of an evolutionary approach for question paper template generation, in 2012 IEEE Fourth International Conference on Technology for Education (IEEE, 2012 July), pp. 144–148 9. C. Borgelt, An implementation of the FP-growth algorithm, in Proceedings of the 1st International Workshop on Open Source Data Mining: Frequent Pattern Mining Implementations, pp. 1–5, Aug 2005 10. G. Grahne, J. Zhu, Fast algorithms for frequent itemset mining using fp-trees. IEEE Trans. Knowl. Data Eng. 17(10), 1347–1362 (2005) 11. J. Mata, J.L. Alvarez, J.C. Riquelme, Discovering numeric association rules via evolutionary algorithm, in Pacific-Asia Conference on Knowledge Discovery and Data Mining (Springer, Berlin, Heidelberg, 2002), pp. 40–51 12. J. Mata, J.L. Alvarez, J.C. Riquelme, An evolutionary algorithm to discover numeric association rules, in Proceedings of the 2002 ACM Symposium on Applied Computing, pp. 590–594, Mar 2002 13. K.E. Heraguemi, N. Kamel, H. Drias, Multi-swarm bat algorithm for association rule mining using multiple cooperative strategies. Appl. Intell. 45(4), 1021–1033 (2016) 14. W. Song, C. Huang, Mining high utility itemsets using bio-inspired algorithms: a diverse optimal value framework. IEEE Access 6, 19568–19582 (2018) 15. Y. Djenouri, M. Comuzzi, Combining Apriori heuristic and bio-inspired algorithms for solving the frequent itemsets mining problem. Inf. Sci. 420, 1–15 (2017) 16. R. Agrawal, T. Imieli´nski, A. Swami, Mining association rules between sets of items in large databases, in Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pp. 207–216, June 1993

Gesture-Based Media Controlling Using Haar Cascade Pragati Chandankhede and Sana Haji

Abstract One of the greatest strengths of e-learning resources depends on audio and video files. The video file makes users more attentive and enthusiastic due to balanced picture and sound quality. During the pandemic situation that arose in twentieth century, it has made parents more awakened for video- and audio-based learning. The paper proposes the face-controlled application that handles media. The combined approach of face detection, iris detection and media control is an innovative way to solve the problem. Proposed face-controlling system pauses and resumes the media by identifying whether the user’s iris looking at screen or not. Face detection is achieved from Haar cascade classification, and iris detection is done through the Hough transform classifier. The media pauses if the iris of the person using the system is not toward screen for about 8 ms. The important advantage of system lies in power consumption since the video stops for the period if the user is distracted while watching the video. Despite powerful visual aesthetics in media, there is a need to handle distractions which occur when watching media. We tend to miss some important parts in video, and we need to rewind back the video till the last viewed content. People have an inclination to make notes while listening to video lectures, and this requires to start and stop the video enormous times. This leads to wasting time, and also learning is not that effective. Keywords Face recognition · Media · Gesture · Feature extraction · Image processing

P. Chandankhede (B) Sir Padampat Singhania University, Udaipur, India e-mail: [email protected] S. Haji K.C.College of Engineering, Mumbai, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_44

541

542

P. Chandankhede and S. Haji

1 Introduction • Computer vision: With advancement in technology, user expects an interface for each application he is being using, so that he can communicate easily and in more user-friendly way. The computer vision has been adopted frequently for gesture recognition. The technique needs to process input images and make computer to understand them the way human can do easily. This approach makes computer to be trained by samples of images. • Gesture recognition: The capacity of computer to understand sign of human activity and countering this activities is said as gesture recognition. Typical gesture includes recognizing hand movement, neck nodding, face recognition and iris detection [1]. • Media player: It plays a key role as a source of entertainment where we can play media like music, audio, movies, videos, etc. In today’s world of advance technologies, the media player is an important application which will be used to view video contents, and for controlling it, functionalities such as hand gesture and face detection will be used. But when the user is disturbed while watching a video, user misses some part of video which sometimes irritates him or creates a performance issue and also leads to wastage of user’s time. Face-controlling system for media pauses and resumes the media by detecting whether the user’s face is toward the screen or not. While watching a video, when someone interrupts you have to look somewhere else or go away from the system for some time, so you miss some part of the video. Subsequently, you need to rewind back the video to the last watched content. This arises the need of a system which automatically pauses when user is not looking at it, and the system resumes again as soon as the user looks at it again. This can be done using the web camera which is capturing the user’s face. The video on media player will be played continuously as long as the camera detects the user’s face or iris completely and paused as soon as user’s face or iris is not completely detected. This method of face or iris detection based on media player can help to minimize human efforts. Face-controlling media player is the system which detects the user iris looking on the screen to play or pause the media. System will continuously observe whether the user’s iris is completely toward the screen using a web camera, and if it is detected, then the video will play without any interruption, else video will be paused. User can take full advantage of using this media player as they are not missing any part of the video. Iris detection functionality of this media player makes it more efficient and user-friendly. In current systems, face detection and hand gesture recognition have been done in a controlled situation; the accuracy rate is poor.

2 Literature Review To make computers more user-friendly and receptive, there is a need for technology to be more effective for interaction between user and computer. Various researches

Gesture-Based Media Controlling Using Haar Cascade

543

were done in these areas, and few are listed here. These research findings are not only limited to human–computer interaction but also include the technique of computer vision helping in real-life scenario. Human–computer interaction with computer vision was extensively used for gesture recognition. In the year 2017, author Rachana Tripathi et al. proposed a method to control volume by using hand gestures [2]. Randomized circle detection algorithm is used to identify shapes in an image by randomly selecting pixels and applying the distance criteria to form a circle present in a captured image [3]. An individual can be identified by his iris; hence, by using the wavelet transform feature, vector can be made compact [4]. In the year 2010, Sanchit Mahajan et al. proposed a system consisting of a number of sub-systems that aim at iris detection in some innovative way and follow the hierarchy of steps. The first stage proposed was image acquisition in which capturing of the eye image is done through camera which is located at an approximate distance of 9 cm from the user’s eye. The estimated distance between the source of light and user is 12 cm [5]. The concept elaborated that the iris positioned in an eye is detected as a first step for enquiring a region using appropriate coordinates. Thus, segmentation depends on input images provided. The outer radius of the iris patterns is detected by the center of the pupil. Canny edge detector [3] recognizes the inner and outer boundaries of the iris by finding the edge image. Removal of noise is done at this step by filtering and blurring the image and removing the pixels consisting indifferent spots. The points where the pixels patterns are similar are grouped together. The technique of gradient is used to mark an edge where the magnitude is larger. The last steps deal with suppression. This phase segments the iris region, and next, we need to normalize it, to generate the iris code. The number of discrepancies present in the eye range from position of the pupil, optical size of the iris in the eye and orientation of the iris varies among individuals; thus, there emerges a need to normalize the iris image representation. It is required to normalize the iris image, to present it in a common format with similar dimensions. Normalized images obtained by the normalization process were given as input to the backpropagation algorithm. Then, the images from the database were used for iris recognition and comparison between the actual image and the iris. If the match is perfect, then the recognition system would be successful, otherwise backpropagation algorithm was used to train the network. The behavior of the network is analyzed when an input is provided, and for that the error rate is calculated. The network has been trained and tested for a number of eye images. System can differentiate between body of the pupil and iris part of the human eye. This is achieved by using mathematical functions and calculations to recognize different eye boundaries. It then circles outer boundary of pupil which is also the inner boundary for the iris using modified Canny edge detector algorithm, and this enables the detection and calculation of the outer boundary of the iris [4, 5]. Siddharth Swarup Rautaray proposed that hand segment where the input to the camera is converted to an image. Then, the image is converted to gray scale, and then tracking of moving hand is done [6]. Later, K means algorithm is used to cluster the moving points. Then, the resultant clusters are cropped and stored into different images. Then, 15 positive images of each gestures are used. All the images are generated from their respective address where features are extracted through PCA.

544

P. Chandankhede and S. Haji

These features are saved for recognition in an xml data format. The input images are compared with the loaded xml data to decide which gesture it matches. This is done by using KNN algorithm. The application allows the users for generating their own gestures for controlling VLC functions. Thus, the system can be used easily to control VLC media player using hand gesture. The drawback of the system is that the recognition system is less robust. Thus, a more robust algorithm is required to reduce noise and blur the motion. Currently for managing, the application has global keyboard shortcut in VLC which makes the keyboard event of that global shortcut with keybd_event() function. It is not the smart way of controlling any application so inter-process communication technique can be applied for this. By applying interprocess communication, VLC can be replaced with any other application very easily [6]. Iris recognition technology uses various deep learning techniques such as computer vision, pattern recognition, statistical inference and optics. The geometric objects such as lines and circles present in an image can be identified by computer vision algorithm named Hough transform [7]. The calculation of the center coordinates and radius of the iris and pupil region can be done by circular Hough transform. Iris is a muscle within the circular part of the eye which controls the size of the pupil that helps in managing the amount of light which enters the eye. The colored portion of the eye is the iris which is the most sensitive organ of human face, and the colored portion of the eye is based on the amount of melatonin pigment within the muscle [5]. The average diameter of the iris is 12 mm, and the size of the pupil varies from 10 to 80% of the iris diameter [8]. Hidden Markov model is applied to speech recognition for conversion of raw speech into text signals [9]. Iris detection which was profoundly notified by Daugman’s algorithm [10] was utilized for converting any circular region into multiple rectangular block. Author Dodiya Bhagirathi et al. [11] propose that 12 MP camera is used to capture the image of the face. Since the webcam is working on the Ycbcr format, the image is converted in to RGB format. Next bilateral filter will be used to remove the noise. Hough transform is comparatively robust against noise. Circle of the human eye can be detected using circular Hough transform which is a very easily implemented. Blob analysis is fastest way of face recognition, and it processes faster. Circular Hough transform is very easy to implement and is considered as simple method for iris detection [11]. Krishna Chaitanya proposes different image processing techniques, feature extraction and classification tools are used to detect the gestures in real time and appropriate command to the windows media player [1]. Hand region is detected using skin detection model, and background is removed using approximate median techniques. Decision tree method is used as a classification tool to create various categories. Windows player is given command according to the identified gesture by windows interaction [10]. This system allows the user to handle the application from a distance without using keyboard and mouse. Human–computer interface is used by which a user can use hand gestures to control media player. But the system fails to identify enormous numbers of gestures with various individuals and also mobile hand gestures. Also

Gesture-Based Media Controlling Using Haar Cascade

545

the system needs to focus on different media players and not only Windows [10]. Yaser Daanial Khan et al. propose the following steps where first step involves using Canny or Sobel edge detection for segmentation of only the required features within the image of the eye [1]. Later, the disk-shaped iris is converted into rectangular form. Third step is bottleneck neural network used to compact the large data of the highresolution transformed image; subsequently, neural network technique is combined with the compressed data for the accurate recognition of the input image. Neural network is trained on the image dataset using segmentation and the feature extraction have been carried out which leads to transformation of images into rectangular shape. The compact data of all the images is given enormous times to the neural network to serve the training purpose till the resultant error has attained zero value. The network usually gives better results for the inputs of irises whose training has been done consistently [1]. Confusion matrix is obtained by observing the behavior of the neural network. Jalled [17] propose OpenCV-Python code using Haar cascade algorithm for object and face detection for unmanned aerial vehicle (UAV). But research gap identified was to order to control and navigate through obstacles, and the similarity between a template and an image region can be measured by their correlation (SSD) Single Shot MultiBox Detector. Over the last several years, it has been shown that image-based object detectors are sensitive to the training data. In the year 2015, moving in similar directions, Simonyan and Zisserman [13] propose VGG16 model. VGG16 (also called OxfordNet) is a convolutional neural network architecture named after the Visual Geometry Group from Oxford, who developed it. It was used to win the ILSVR (ImageNet) competition in 2014, composed of 16 convolutional layers, multiple max-pool layers and three final fully connected layers. One of its specificities is to chain multiple convolutional layers with ReLU activation functions creating nonlinear transformations. The model accomplishes 92.7% accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes. In the year 2017, Navya Sushma Tummala et.al propose face recognition through sets of character feature using principal component analysis [14]. Linear discriminant analysis is used for recognizing human subject. This permits objective estimation of the significance of visual information in different parts. Evolutionary pursuit implements strategical characteristics of genetic algorithm for searching the space of possible solutions to determine the optimal basis. But the limitation of the system lies in data normalization. Paul Viola and Michael Jones proposed the methodology for recognizing the feature map and converting them into xml file which further can be used for detection of face eyes [15]. When associated with nursing, HMM is applied to speech recognition, and the states are interpreted as acoustic models, indicating what sounds are likely to be heard during their corresponding segments of speech while the transitions offer temporal constraints, indicating how the states may follow each other in sequence. The first feature is designated to target the property that the region of the eyes is darker than the region of the nose and cheeks. The second feature is designated to target the property that the eyes are darker than the bridge

546

P. Chandankhede and S. Haji

of the nose. Constantly applying windows on cheeks or the other parts of the face is doubtful but has been achieved by AdaBoost [16]. Finally after referring number of methods, it has made it quiet clear that Haar cascade is majorly used for feature extraction because of its advantage in calculation speed and it is effective in face recognition.

3 Architecture of Multimedia Devices Generalized architecture of multimedia application has two parts as illustrated in Fig. 1. First part has the capability that can take digital media in the system as input and should be able to differentiate it as audio or video [11]. And second is associated with user interface that can depict the user state and can control the running of system. Application Programming Interface for the UI and player can be based on random choice, but the phenomena of the communication between the two pieces are fundamentally the same for all media player apps. The Android framework defines two classes, a media session and a media controller, that impose a well-defined structure for building a media player app as shown in Fig. 2. The media session and media controller communicate with each other using predefined callbacks that correspond to standard player actions (play, pause, stop, etc.), as well as extensible custom calls that you use to describe distinctive behaviors unique to your app. A media session is having control thoroughly over communication with the player. It hides the details of player’s Application Programming Interface from the rest of your app. The player is only called from the media session that controls it. The session maintains a representation of the player’s state (playing/paused) and information about what is playing. This is important class if we intend to control the media with real-time image processing. User interface can communicate with media controller; it maintains the states of media session. So that associated user interface is automatically updated. A media controller can only connect to one media session at a time. Real-time processing requires media controller that can deploy interfaces of stopping and playing media at runtime [17]. When playing a video, your eyes and ears are both engaged. When audio is playing, you are listening to it, but it is also feasible Fig. 1 Multimedia application architecture

User Interfac

User Interface

Media Controller

Fig. 2 Communication within media player

Media Session

Player

Player

Gesture-Based Media Controlling Using Haar Cascade

547

to work on different application at the same time. There is a different design for each use case. Pattern illustration, data learning, selection of training and test data are the most important application of recognition used for edge detection. These systems identify the unique features of an individual and thus help in detecting a person. System of recognition may consist of fingerprint recognition, iris recognition, eye recognition, etc. Iris recognition is most accepted recognition technique. The process perceives a person by examining the random pattern in the iris. Iris detection is considered as most efficient and accurate technique.

4 Methodology Figure 3 depicts block diagram for proposed system. Following are the steps to show light on each step. Fig. 3 Block diagram for proposed system

Camera

Media player is active Face recognition

Bounding box prediction Haar Cascade Implementation

Media Controller by Haar Cascade

Play/pause

Resume once iris detected

Iris detection by Hough transform

548

P. Chandankhede and S. Haji

Fig. 4 Constant radius

Step 1: User’s face is captured using the front webcam. It is divided into distinct regions such as forehead, below forehead, lip and chin which are considered as cascade, i.e., those regions important for step 1 which are used for classifying object ahead. Only face is recognized by system. The cascade function is trained with multiple face and eyes data; hence, only face can be detected. These are the features for step 1 which acts as input for next step. Step 2: Each feature is specified by their shape position and scale. Images are sampled using horizontal and vertical stripes. Step 3: The system has trained pupil detector with number of sample of images on classifier. Out of number of features below part, below forehead that is iris region is divided into four concentric circles. Step 4: Circular Hough transform is a technique of feature extraction for detecting circle. Two circles will detect two eyes, and two inner circles will detect iris [18]. The constant radius detection is maintained by Hough transform. Circular Hough transform did this by keeping track of accumulator matrix. Figure 4 shown below is divided into two parts, left side maintains the radius and if new image, which is shown on right side is provided as input, Hough transform maintain constant radius along x- and y-axis. Aim of maintaining matrix is to find intersection point in parameter space. Number of circles identified in matrix are the element of accumulator matrix. And count maintained by matrix is called as voting. Once voting is completed, local maxima is calculated, and this position will agree to whether user is looking ahead or other angel. If both are aligned, conclusion is drawn by system, i.e., eyes are ahead on camera. Step 5: Each detection is based on model that captures on basic of pixel value using latent variable of image. Step 6: If both irises are not available for detection for more than 30 ms, i.e., the user is not around or is distracted, this is considered as a state in media controller and it saves the session. Step 7: Media controller sends the signal of stop/pause to media session layer. And sliding window concept will hold the window until it gets the input. Step 8: The video will be paused by the bus controller. Proposed system uses Haar cascade classifier to detect face. We used IIT Delhi Iris Database. On face classifier training, we used 150 images. In order to detect iris, circular shape of pupil is detected by circular Hough transform. Figure 3 depicts the proposed system which focuses on idea that consists of an operational system. There is a camera having resolution 720 pixel and 30 fps having good video quality, and field of view of 60 degree is attached to the system to detect the user’s face.

Gesture-Based Media Controlling Using Haar Cascade

549

Fig. 5 Bounding box face detection

5 Results and Discussion A camera records the scene where user is sitting in front of system. A scene can be thought of action in single location. The important thing that can capture in every scene is face. There are two main methods of detection system: (a) cascade classifier (used for bounding box detection system) and (b) circular Hough transform (detection for iris). Figure 5 shows that once the user switches on the system, he is allowed to select the video he wants to watch. Face and iris are distinguished by using the Haar cascade classifier and using the bounding box for face detection. Each grid cell determines a bounding box comprising the x- and y-coordinates and the width and height and the confidence. Since the user is watching the video without any disturbance, the video is running. As shown in Fig. 6, we can see that as soon as the user gets distracted/lose eye contact with the system, playing video will get paused. Table 1 depicts the different algorithms on basis of accuracy and work being used up for training; the most well-known are listed in Table. 1.

6 Conclusion In spite of various contributions from many communities for iris and face recognition, that has been already proposed and provided effortless application and system.

550

P. Chandankhede and S. Haji

Fig. 6 Output after iris not looking toward screen

Table 1 Comparison between gesture detection algorithms

Algorithm

Accuracy

Training required

HMM

90%

Extensive

Haar cascade classifier

95%

Offline training

Daugman’s algorithm

90%

Extensive

However, there is no combined application of iris detection (by circular Hough transform) and face recognition (by haar cascade) used for handling media player. The problem faced is to get an estimated result so that the user can take complete benefit of the media without any disturbance. The accuracy of system is up to 95% considering training from dataset consisting of 150 images. Accuracy can be improved with extensive training. The advantage of the system will reduce power consumption since the video will get pause when the user is not facing toward the media. The resulting system promises to be more efficient and accurate for controlling media. The impetus and purpose behind this research were to discover process that makes user familiarity while controlling media devices. The limitation of system lies.

References 1. A. Yaser Daanial Khan Farooq, M.W. Anwar, A neuro-cognitive approach for iris recognition using back propagation. World Appl. Sci. J. 16(5), 678–685 (2012). ISSN 1818-4952 IDOSI Publications, 2012

Gesture-Based Media Controlling Using Haar Cascade

551

2. N. Sanghavi, R.T.R. Thakur, Look based media player. Int. Res. J. Eng. Technol. (IRJET) 4 (2017) 3. T. Chuan Chen, K. Liang Chung, An efficient randomized algorithm for detecting circles. Comput. Vis. Image Understand. 83, 172–191 (2001) 4. K. Lee, S. Lim, K. Lee, O. Byeon, T. Kim, Efficient Iris recognition through improvement of feature vector and classifier. ETRI J. 23, 61–70 (2001) 5. N. Kak, R. Gupta, S. Mahajan, Iris detection system. Int. J. Adv. Comput. Sci. Appl. 1, 1 (2010) 6. S.S. Rautaray, A. Agrawal, A vision based hand gesture interface for controlling VLC media player. Int. J. Comput. Appl. (0975-8887) 10(7) (2010) 7. E.M. Bracamontes, M.E. Martinez-Rosas, M. Miranda, H.L. Martinez-Reyes, Implementation of Hough transform for fruit image segmentation. Procedia Eng. 35, 230–239 (2012) 8. J.R. Sekar, S. Arivazhagan, R.A. Murugan, Methodology for iris segmentation and recognition using multi-resolution transform, in Third International Conference on Advanced Computing, Chennai, 2011, pp. 82–87. https://doi.org/10.1109/ICoAC.2011.6165153 9. T. Zia, D. Bruckner, A. Zaidi, A hidden Markov model based procedure for identifying household electric loads, in 37th Annual Conference of the IEEE Industrial Electronics Society, 2011 10. P. Verma, M. Dubey, P. Verma, S. Basu, Daughman’s algorithm method for iris recognition—a biometric approach. Int. J. Emerg. Technol. Adv. Eng. (2012) 11. N. Krishna Chaitanya, R. Janardhan Rao, Controlling of windows media player using hand recognition system. Int. J. Eng. Sci. (IJES) 3(12) (2014). ISSN (e): 2319-1813, ISSN (p): 2319-1805 12. H. Jadhav, S. Pathan, N. Rokade, U. Annamalai, Controlling multimedia applications using hand gesture recognition. J. Int. Res. J. Eng. Technol. (IRJET) 02, 1200–1203 (2014) 13. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in International Conference on Learning Representations, 2015 14. N.S. Tummala, P.C. Sekhar, Face recognition using PCA and geometric approach, in International Conference on Computing Methodologies and Communication, 2017 15. R. Ogla, A. Ogla, A. Abdul Hussien, M. Mahmood, Face detection by using OpenCV’s ViolaJones Algorithm based on coding eyes. Iraqi J. Sci. 58(2A), 735–745 (2017) 16. T. Chengsheng, L. Huacheng, X. Bing, AdaBoost typical Algorithm and its application research, in MATEC Web of Conferences, ICMITE 2017, vol. 139. 00222 (2017). https://doi.org/10.1051/ matecconf/13900222 17. F. Jalled, Object detection using image processing, in Computer Vision and Pattern Recognition. Moscow Institute of Physics & Technology, 2016 18. D. Nadar, S. Acharya, S. Parab, A. Karkera, Look based media player. Int. J. Res. Appl. Sci. Eng. Technol. (IJRASET) 7 (2019)

Comparative Analysis of Models for Abstractive Text Summarization Minakshi Tomer and Manoj Kumar

Abstract In present scenario, there is a huge amount of data on the web. It is becoming challenging to access any information from this data. Text summarization is an important solution for relevant information extraction. In this paper, four basic models of deep learning techniques are successfully applied on CNN/Daily mail data set. This will help researcher to understand the basic models and its implementation for text summarization. These models explain how attention mechanism affects the models. The results are evaluated using ROUGE scores. Keywords Abstractive text summarization · Extractive text summarization · Preprocessing · Recurrent neural network (RNN) · Long short-term model (LSTM) · Gated recurrent unit (GRU)

1 Introduction There is an abundance of documented material available which can be summarized in short and meaningful pieces for the user. There are blogs, news, scientific articles or speeches which can be summarized from a long, multi-paged document to a paragraph of a few lines. There is a need to reduce the length of the document as original document consumes a lot amount of time. As the technical advancement followed the increased speed, it was easier to provide the user with the more emphasized part of the document in a go rather than the whole detailed piece to match the pace of the life.

M. Tomer (B) Research Scholar, USICT, GGSIPU, New Delhi, India IT Department, MSIT, New Delhi, India M. Kumar Professor, Netaji Subhas University of Technology, East Campus (Formerly Ambedkar Institute of Advanced Communication Technologies and Research), New Delhi, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_45

553

554

M. Tomer and M. Kumar

The aim of abstractive text summarization is to generate condensed and informative summary from a large source document [1, 2]. It is different from extractive text summarization which directly extracts the most relevant words, phrases and sentences from source document, whereas abstractive summary consists of set of some new words and phrases which were not present in the source document. In recent years, deep learning algorithm was successfully applied in the area of information retrieval (IR), natural language processing (NLP) and machine translation (MT) [3]. Due to the promising results of deep learning in NLP, it has taken the state-of-the-art model for text summarization to a higher level. For automatic text summarization (ATS), the sequence-to-sequence framework has become the commonly followed approach which consists of an encoder and a decoder. The encoder reads the preprocessed input and converts it into a vector representation, and then the decoder uses that vector representation to generate the desired output. Different summarization models employ different algorithms like convolutional neural network (CNN), recurrent neural network (RNN) and long short-term memory (LSTM) as an encoder and decoder. Various extensions like attention, pointing and coverage mechanism were also applied to these models. Apart from these, great efforts have been made for hybrid models that take the advantage of extractive and abstractive models to improve the resultant summary. In this paper, the comparison of all the traditional approaches for abstractive text summarization is made on the CNN/Daily mail data set, and their results have been compared in terms of ROUGE score which measures the quality of system generated summary against the ground truth data. The rest of the paper is structured as follows. Section 2 includes the related work of abstractive text summarization. Section 3 explains all the techniques for abstractive text summarization. Section 4 includes the experimental settings and results. The paper is concluded with future scope in Sect. 5.

2 Related Work Different machine learning algorithms like Bayesian classifier, decision tree, hidden Markov model and integer linear programming were successfully applied for text summarization [4–7]. For extractive text summarization, [8, 9] utilizes neural network to transform the sentences into vectors and select sentences for summary generation. Another similar work utilized query attention-based convolutional neural network for query-based multi-document summarization. A convolutional neural network-based framework is used for sentence feature representation and ranking for single and multiple document summarization [10]. A CNN model is also used by Denil [11]. In another work by Ziqiang Cao [12], recurrent neural network is used for text summarization. An attentional recurrent neural network-based framework is utilized by Cheng and Lapata [13] which consists of an encoder to read the input sequence and generate a continuous vector representation from which final summary is generated. A RNN-based architecture which consists of a classifier and selector is

Comparative Analysis of Models for Abstractive Text Summarization

555

proposed by Ramesh Nallapati [14] to read the input document and select sentences for summary generation. An extractive model SummaRunner proposed by Nallapati [15] which is a two-layer bidirectional GRU-based RNN. The first layer computes the hidden representation of input at word level. However, the second bidirectional layer is sentence layer which represents the sentence-level encoding. With the emergence of deep generative models, text abstractive summarization became popular. For abstractive text summarization, the first encoder–decoder model is given by Rush and Weston [16] for sentence-level summarization. The extension of this was given by Chopra et al. [17]. Ramesh Nallapati [18] introduced a sequence-to-sequence model in which the encoder is bidirectional and decoder is unidirectional GRU-RNN with attention. Another model to overcome the limitations of above-mentioned models was proposed by See and Manning [19]. Romain and Caiming Xiong [20] applied reinforcement-based model for abstractive text summarization. Generative adversarial network has improved the readability score of summaries [21]. Another reinforcement-based model with recurrent neural network is proposed by Nyugen [22]. Hybrid models are the new research area in this field which is successfully applied by Hsu et al. [23–25].

3 Techniques for Abstractive Text Summarization The process of summarization includes selection of relevant words, set of words, phrases and sentences that explains the crux of the entire input document. There are different models that can be applied for this task. These models are explained in this section. The essence is to understand the basic models which can be applied for automatic text summarization. This section consists of two parts, the first part explains the preprocessing stage which is common to all the techniques, and the second part explains all the models.

3.1 Preprocessing It is the initial step that needs to be performed before applying summarization models. There are different techniques to preprocess the text. The following steps of preprocessing includes: 1. 2. 3. 4.

Segmentation: It is a process in which all the sentences present in input document are treated as separate entities by identifying its beginning and end. Cleaning: It is a process which removes all types of special character present in the text and simplifies it. Case conversion: All the text is converted to lower case for text uniformity. Tokenization: All the words present in the sentences are breakdown to individual words for further processing. This method is called tokenization.

556

5.

6.

7.

M. Tomer and M. Kumar

Elimination of stop words: These are the words without any linguistic meaning and which contribute nothing to the meaning of text. These words are considered to be occurred too frequent in the document. It reduces the number of terms to be considered. Stemming: In input document, few words are present in different forms. It is the method of reducing these variational words to their root form which is just the canonical form of original word. Lemmatization: It is the method of reducing the words to its base and meaningful form by removing the extra unnecessary characters from it.

3.2 Abstractive Text Summarization The following are the approaches which are implemented and compared.

3.2.1

Encoder–Decoder with RNN

It is a type of architecture in which a sequence of variable length is encoded into a vector representation of fixed length with the help of an encoder. Then this encoded vector is decoded back into the sequence of variable length with the help of a decoder. Although this works, the fixed length encoding of the input limits the length of the output sequence that can be generated. Let the input sequence be represented by x. Each symbol of this input sequence is read by the encoder RNN sequentially. The hidden state of the RNN is updated with each entry of the input sequence. The updation follows the described formula: h(t) = f (h(t − 1), xt )

(1)

when whole of the input sequence is read, the hidden state of the RNN is represented by s. s is the summary of the whole input sequence. Another RNN is the decoder of the model, which is used to generate the output sequence. The sequence obtained is generated by predicting the next symbol yt with respect to the given hidden state h(t). The hidden state of the decoder is calculated by the following equation: h(t) = f (h(t − 1), y(t − 1), s)

(2)

y(t − 1) and the input sequence’s s both check the hidden state function h(t) and the predicted output y(t) of the decoded sequence. Both the components of the RNN are trained together to maximize the probability of generation and hence improve the end results.

Comparative Analysis of Models for Abstractive Text Summarization

3.2.2

557

Encoder–Decoder with GRU

It is the same architecture as a basic encoder–decoder model. This model also follows the encoding of the input sequence and then decoding it to obtain the output. The difference in the model comes in the manner of encoding and decoding. Gated recurrent unit (GRU) is used in both encoding and decoding. In a principle approach of following RNN architecture, each output is obtained by mapping entire history of previous inputs. In a more practical approach, training RNN efficiently is a very sophisticated task due to the problems caused by the exploding gradient and the vanishing gradient. The future context recognition power of a unidirectional GRU is very less. So, two GRU layers are used to pass the input vectors, and then the hidden state vectors of both the GRUs are concatenated. The direction of both the layers is opposite to each other. Hence, the resultant bidirectional GRU can use both the future and past information. Multiple layers of GRU are stacked together to build a better platform of the representation for the decoder to attend. The hidden state of the encoder is computed as follows: h(t) = (1 − z(t))ηh(t − 1) + z(t)ηh  (t)

(3)

The element wise multiplication is represented by η. Candidate activation is represented by h (t), and z(t) represents update gate. Unidirectional GRU is used for the decoding part of the model. It is used to produce output sequences one by one. Context vector is also used for the decoding to solve the problem associated with learning of variable length annotation sequences and then associating the output sequences with them which too are of variable length. The hidden state is computed with the help of following equation:   s(t) = 1 − z  (t) ηs(t − 1) + z  (t)ηs  (t)

(4)

z (t) is the update gate, and s (t) is the candidate activation.

3.2.3

Encoder–Decoder with GRU and Attention

For the generation of the best possible results and in less time, the model should be less sophisticated and focused on the important factors of its functionality. In order to achieve this, we replace RNN with GRU to reduce the complexity of the model and induce attention mechanism for increasing the focus on the generated sequence. The encoder is made using bidirectional GRU to take the future and past both into consideration during the processing. With the help of following equations, the hidden state of encoder is computed as follows: Z t = σ (Wx z xt + Uhz h t − 1)

(5)

558

M. Tomer and M. Kumar

Rt = σ (Wxr xt + Uhr h t − 1)

(6)

h t = tanh(Wxh xt + Ur h (r tηh t − 1))

(7)

ht = (1 − z t )ηh t − 1 + z t ηh t

(8)

W and U represent the related weights matrices. σ is the sigmoid function, and element wise multiplication operator is represented by η. The update gate, reset gate and candidate activation are represented by zt , r t and h t , respectively. The attention in the decoder is induced with the help of following equations: Ci =

t 

ai j h j

(9)

( j=1)

C i is the context vector, and a is the weight of each hidden state h which is computed as follows: exp(ei j) k=1 exp(eik)

ai j = t

(10)

ei j = a(s(i − 1), h( j))

(11)

correlation value of the output at position i and the input at position j are denoted by eij , and the scoring function is denoted by a. The decoder is made using a unidirectional GRU along with attention mechanism to restrict its generation capability and hence improve the overall output generation. The following equations help in calculation of hidden state of the decoder:   z t = σ W yz E yt − 1 + Usz st − 1 + Ccz ct

(12)

  rt = σ W yr E yt − 1 + Usr st − 1 + Ccr ct

(13)

st = tanh(W ys E yt − 1 + Ur s (rt ηst − 1) + Ccs ct )

(14)

  St = 1 − z t  St − 1 + z t  St

(15)

W, U and C are the related weights. The update gate, reset gate and candidate activation are represented by zt , r t and ht , respectively. E represents embedding matrix, and y is the target symbol. The attention in the decoder is induced with the help of following equations:

Comparative Analysis of Models for Abstractive Text Summarization t 

Ci =

ai j h j ,

559

(16)

( j=1)

C i is the context vector, and a is the weight of each hidden state h which is computed as follows: exp(ei j) k=1 exp(eik)

ai j = t

(17)

ei j = a(s(i − 1), h( j))

(18)

correlation value of the output at position i and the input at position j are denoted by eij , and the scoring function is denoted by a.

3.2.4

Seq2Seq with Attention

The basic and principal approach that is mostly followed in the encoding–decoding models is that when the connections between the context and its current time step are modelled, the weightage that must be given to the previously generated sequences is same rather than different for each position. So, in order to solve this problem, the concept of attention mechanism was introduced. The basic idea of attention is similar to that of a human brain paying attention to the important parts of a story or a movie or a book and sidelining the other parts. In the same manner, the work of attention mechanism is to assign weights to each part in the sequence and then bifurcate the essential parts with the non-essential ones. In a practical approach, the use of attention mechanism is to replace context vector as input in the decoder part of the model. The following equation computes the context vector along with its weight as the weighted sum of the hidden states of the encoder network. Ci =

t 

ai j h j

(19)

j=1

a is the weight of each hidden state h which is computed as follows: exp(ei j) k=1 exp(eik)

ai j = t

(20)

ei j = a(s(i − 1), h( j))

(21)

560

M. Tomer and M. Kumar

correlation value of the output at position i and the input at position j are denoted by eij , and the scoring function is denoted by a. Only the induction of the attention mechanism is additional in the general process of encoding and decoding. This to improve the quality of the generated output.

4 Experiment and Results The workstation used was equipped with a Ryzen 7 3700x (8 cores 16 threads) CPU clocked at 4.4 GHZ and has 16 GB of ddr4 memory clocked at 3600 MHz. The GPU is a rtx 2070 super with 8 GB of gddr6 memory and has 2560 cuda cores and 320 tensor cores. GPUs are optimized for training artificial intelligence and deep learning models as they can process multiple computations simultaneously. They have a large number of cores, which allows for better computation of multiple parallel processes. Additionally, computations in deep learning need to handle huge amounts of data—this makes a GPU’s memory bandwidth most suitable.

4.1 Data set There are different commonly used data sets for the purpose of text summarization. Document Understanding Conference (DUC) data sets such as DUC-2001, DUC2002, DUC-2003, DUC-2004, DUC-2005, DUC-2006 and DUC-2007 are the benchmark data sets available in the literature for summarization tasks. Since deep learning systems require large amount of data for training, a standard data set CNN/Daily mail is used for training and testing purposes in above-mentioned models. This data set is created and first used by Ramesh Nallapati [18]. It consists of stories from CNN and Daily mail website. This data set consists of 312,084 number of documents for single document summarization. Each article in it is put together with a human written summary.

4.2 Experimental Setting and Results In the experiment, the batch size is set to 32. The dimension of hidden state for the encoder and decoder is set to be 256. Vocabulary size is set to be 50 K, and word embedding dimension is 128. The learning rate parameter is set to be 0.01. To speed up the training time, the input sequence is set to be 120, and during testing time similar length is used.

Comparative Analysis of Models for Abstractive Text Summarization

561

4.3 Evaluation Metric The evaluation techniques for summarization can be broadly categorized into intrinsic and extrinsic approaches [26]. In this paper, the quality of summaries is evaluated using recall-oriented under gisting evaluation (ROUGE). It is based on similarity of n-grams. It calculates the similarity between system generated summary and standard reference summary. ROUGE-1 refers the unigrams measure and ROUGE-2 refers bigram measure between the system generated summary and reference summary. Rouge-L measures the longest common sequence of words between the reference and system summary. The experimental results of all the four models are shown in Table 1 and Fig. 1 which includes encoder–decoder with RNN, encoder–decoder with GRU, encoder– decoder with GRU attention and sequence-to-sequence with attention. Rouge-1 score Table 1 Comparison of ROUGE scores for various models on CNN/Daily mail data set Models

Rogue-1

Rogue-2

Rogue-L

Encoder–decoder with RNN

0.305

0.178

0.278

Encoder–decoder with GRU

0.490

0.322

0.381

Encoder–decoder with GRU and Attention

0.479

0.364

0.406

Seq2Seq with attention

0.345

0.237

0.323

The value in bold indicates which model among all four is giving best result for Rouge-1, Rouge-2 and Rouge-L

0.6 0.5 0.4 0.3 0.2 0.1 0 Encoder-Decoder with RNN

Encoder-Decoder with GRU Rogue-1

Encoder-Decoder with GRU and AƩenƟon

Rogue-2

Fig. 1 Performance comparison of various models

Rogue-L

Seq2Seq with aƩenƟon

562

M. Tomer and M. Kumar

of encoder–decoder with GRU is higher than another model. Rouge-2 and Rouge-L scores of encoder–decoder with GRU and attention are higher than other models.

5 Conclusion and Future Scope In this paper, four basic models for text summarization are implemented and compared using ROUGE scores. All the models are implemented on CNN/Daily mail data set. The experimental results present that ROUGE-2 and ROUGE-L scores of encoder–decoder with GRU, and attention is better than other models. These model comparisons can be extended by implementing other models on the same data set for text summarization. These models can also be implemented and compared on different data sets.

References 1. M. Gambhir, V. Gupta, Recent automatic text summarization techniques: a survey. Artif. Intell. Rev. 47(1), 1–66 (2017) 2. E.L.A.M. Palomar, Text summarisation in progress: a literature review. Artif. Intell. Rev. 37, 1–41 (2012) 3. T. Young, D.H. et al., Recent trends in deep learning based natural language processing, in arXiv:1708.02709v6 [cs.CL] 4 Aug 2018, 2018 4. J.M. Conroy, D.P. O’leary,Text summarization via hidden Markov models, in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001 5. S. Corston-Oliver, E. Ringger, M. Gamon, R. Campbell, Task-focused summarization of email,in Text Summarization Branches Out, 2004 6. L. Li, K. Zhou, G.R. Xue, H. Zha, Y. Yu, Enhancing diversity, coverage and balance for summarization through structure learning,in Proceedings of the 18th International Conference on World Wide Web, 2009 7. Y. Ouyang, W. Li, S. Li, Q. Lu, Applying regression models to query-focused multi-document summarization.Inf. Process. Manage. 47(2), 227–237 (2011) 8. M. Kågebäck, O. Mogren, N. Tahmasebi, D. Dubhashi, Extractive summarization using continuous vector space models,in Proceedings Kågebäck, M., Mogren, O., Tahmasebi, N., & Dubhashi, D. (2014, April). Extractive summarization using continuous vector spa In Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC), Apr 2014 9. W. Yin, Y. Pei, Optimizing sentence modeling and selection for document summarization,in Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015 10. Y. Zhang, M.J. Er, R. Zhao, M. Pratama, Multiview convolutional neural networks for multidocument extractive summarization. IEEE Trans. Cybern. 47(10), 3230–3242 (2016) 11. M. Denil, A. Demiraj, N. De Freitas, Extraction of salient sentences from labelled documents,arXiv preprint arXiv:1412.6815, 2014 12. Z. Cao, F. Wei, L. Dong, S. Li, M. Zhou, Ranking with recursive neural networks and its application to multi-document summarization,in Twenty-Ninth AAAI Conference on Artificial Intelligence (2015) 13. J. Cheng, M. Lapata, Neural Summarization by Extracting Sentences and Words, arXiv:1603. 07252v3 [cs.CL] 1 Jul 2016, 2016

Comparative Analysis of Models for Abstractive Text Summarization

563

14. R. Nallapati, B. Zhou, M. Ma, Classify or select: neural architectures for extractive document summarization,arXiv:1611.04244v1, Nov 2016 15. R. Nallapati, F. Zhai, B. Zhou, Summarunner: a recurrent neural network based sequence model for extractive summarization of documents,arXiv preprint arXiv:1611.04230, 2016 16. A.M. Rush, S. Chopra, J. Weston, A neural attention model for abstractive sentence summarization,arXiv:1509.00685v2 [cs.CL] 3 Sept 2015, 2015 17. S. Chopra, M. Auli, A.M. Rush, Abstractive sentence summarization with attentive recurrent neural networks, in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 93–98 18. R. Nallapati, B. Xiang, B. Zhou,Sequence-to-sequence RNNS for text summarization, in ICLR Workshop abs/1602.06023, 2016 19. A. See, P.J. Liu, C.D. Manning, Get to the point: summarization with pointer-generator networks,arXiv:1704.04368 [cs.CL], 2017 20. R. Paulus, C. Xiong, R. Socher, A deep reinforced model for abstractive summarization,arXiv preprint arXiv:1705.04304, 2017 21. L. Liu, Y. Lu, M. Yang, Q. Qu, J. Zhu, H. Li, Generative adversarial network for abstractive text summarization,arXiv preprint arXiv:1711.09357, 2017 22. L. Chen, M. Le Nguyen, Sentence selective neural extractive summarization with reinforcement learning,in 2019 11th International Conference on Knowledge and Systems Engineering (KSE), 2019 23. W.-T. Hsu, C.-K. Lin, M.-Y. Lee, J.T. Kerui Min, M. Sun, A unified model for extractive and abstractive summarization using inconsistency loss, arXiv preprint arXiv:1805.06266, 2018 24. S. Song, H. Huang, T. Ruan, Abstractive text summarization using LSTM-CNN based deep learning, in Multimedia Tools and Applications, 2018, pp. 1–19 25. M. Tomer, M. Kumar, Improving text summarization using ensembled approach based on fuzzy with LSTM. Arabian J. Sci. Eng. 45(12), 10743–10754 (2020) 26. J. Steinberger, K. Jezek, Evaluation measures for text summarization. Comput. Inform. 28, 1001–1026 (2007)

Polarity Detection Across the Globe Using Sentiment Analysis on COVID-19-Related Tweets M. Uvaneshwari, Ekata Gupta, Mukta Goyal, N. Suman, and M. Geetha

Abstract In the face of a crisis, how different cultures respond and react depends largely on external factors. Government’s decision has a high impact on the behavior of the society. It has been largely observed that people across nations use social media to express their feeling. Even during the spread of coronavirus (COVID-19), the countries located as various longitudes expressed different emotions. The purpose of this study is to analyze different emotions of people from different countries. Text classification was used for estimating the sentiment polarity from extracted tweets. Various machine learning models have been trained to achieve accuracy on the sentiment dataset. It was observed that although people of different race and culture are spread across the globe over a long longitude area, the sentiments or emotions of the people are the same. Keywords Pandemic · COVID-19 · Polarity assessment · Crisis · Sentiment analysis

1 Introduction In the midst of the current coronavirus (COVID-19) pandemic, everyone is witnessing a sea change in the way people are behaving in their daily activities, be it e-learning, the way we interact with others, connect, or do shopping. These common issues have a strong impact in our life; however, given a situation, not all religions respond and M. Uvaneshwari SRM Institute of Science and Technology, Chennai, India E. Gupta (B) Guru Nanak Institute of Management, New Delhi, Delhi, India M. Goyal Guru Nanak Dev Institute of Technology, New Delhi, Delhi, India N. Suman · M. Geetha SR University, Warangal, Telangana, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_46

565

566

M. Uvaneshwari et al.

Fig. 1 Stages of sentiment analysis

react in the same way. Research indicates that individuals across different cultures reason differently, even under normal circumstances [1]. There are many factors that affect the decision making as a whole. In fact, the paper focuses on whether there occurs difference across the globe during COVID-19 crisis. During the intense crisis, social media networks play an important role as people use these networking outlets to exchange thoughts, views, and responses with others to deal with and respond to crises. We will therefore concentrate on addressing collective responses to events shared in social media in this report. We have laid special emphasis on global health-related issue, the COVID-19 outburst on the social media site of Twitter, because of its access and availability (Fig. 1).

1.1 Research Study and Objectives The basic objective of this research is to comprehend how different people across the global react to given global crisis. NLP-based sentiment analysis using various supervised machine learning tools has been used to analyze certain questions. RQ1: Can NLP with accuracy help in understanding/predicting the sentiments? RQ2: To compare the sentiments of Indians with across the globe?

Polarity Detection Across the Globe Using Sentiment Analysis …

567

Fig. 2 Basic steps followed in sentiment analysis for text classification

1.2 Our Work Based on the research study and objectives, the major work is as following: • Supervised machine learning tools which inputs Twitter feeds and outputs the sentiments • To calculate the accuracy achieved on sentiment polarity assessment dataset. • Provide interesting insights across the globe over the collected dataset on coronavirus outbreak on social media (Fig. 2). The research paper is organized into various sections ranging from research design, dataset, and related work. Data cleaning and processing are also discussed. Various machine learning models are used to train the dataset to achieve accuracy, and lastly, analysis part is presented.

2 Research Procedure 2.1 Research Design Tweets posted on global crisis corona on users were collected, and quantitative (experimental) research methodology was done. The data from March 2020 until April 2020 were collected; this was the time when cases were being witnessed across globe. This

568

M. Uvaneshwari et al.

period was the initial stage, and we wanted to study the behavior of people at the initial days.

2.2 Study Dimensions The following fields were taken into account 1. 2. 3.

Time-(t): early March 2020 until the end of April 2020 Location-(l): country of the person who tweeted Polarity-(p): sentiment classified as positive, neutral, extremely positive, extremely negative, or negative.

2.3 Tools and Instrument A machine learning approach to perform text classification was executed using scikitlearn, Python, and NLTK.

3 Literature Review Text classification is a machine learning model, and it can help to solve by training the classifier on a named text set. The hierarchical generative probabilistic model which integrates both N-gram and latent topic variables was used. The polarity score was calculated from the graph containing hashtag [2, 3]. According to the authors [4, 5], text classification can be used to solve problems in data mining and machine learning. Sentiment analysis may also find the most commonly used terms from the bubble chart between positive, negative, or neutral tweets, and “united”, “airlines”, “passenger”, and “overbook” are the most common words in negative tweets. While words in positive tweets tend to be much smaller in comparison to negative feelings, such as “mate”, “opportunity”, “brilliant”, and “welcome,” some of them are actually humorous about this incident [6]. During the intense crisis, social media networks play an important role as people use these networking outlets to exchange thoughts, views, and responses with others to deal with and respond to crises. Therefore, in this report, the emphasis was on examining collective responses to social media events. The readiness of data on Twitter in the form of tags was used. To conclude, thousands of Twitter users data were analyzed to understand how various cultures reacted and responded to coronavirus within initial weeks after the corona crisis [7, 8].

Polarity Detection Across the Globe Using Sentiment Analysis …

569

The application of textual analytics, natural language processing (NLP), and other artificial intelligence techniques in research has seen an exponential increase. Limitations in these techniques have been found in spite of rapid advances in NLP. Therefore, understanding the their limitations of these various techniques algorithms has become necessary for further research [9, 10]. Despite the synonymous usage of sentiment analysis and emotional intelligence, they are quite different. On the outset, sentiment analysis relies primarily on data to categories expressions as positive, negative, or neutral, and the subtlety of the emotions expressed through the comments is further explored by EQ. EQ is far harder and more multifaceted than evaluating emotion. For example, sentiment analysis will check whether the comment is optimistic, negative, or neutral for a specific comment, but emotional intelligence will further check whether the comment contributes to disappointment, dissatisfaction, or sarcasm if the comment is found to be negative [11, 12]. Within the Balahur (2013) paper titled “sentiment analysis in Social Media Text”, emotional analysis of Twitter datasets using unigram and bigram (n-gram) and supervised learning with simple support vector machines. On the basis of the observations, it can be concluded that on the one hand, the unigram and the bigram together are the best properties to use emotional analysis. Second, we can see that the output rating of emotions is significantly improved by generalizations, using special tags, emotive terms, and modifiers (joy, happiness, sorrow, terror, etc.) [13, 14]. Twitter has been shown to be helpful for various activities, such as emergency communication networks, tracking public feelings, identifying anomalies, and providing early warning. Twitter was used as the data source to monitor public response and health during disasters (e.g., hurricanes floods earthquakes terrorist bombing dissemination of misinformation related to public health and other and disease outbreaks). Researchers are trying to come up with new ideas that include using Twitter data in different ways. Catherine et al. use Twitter data to analyze and explain Twitter data [15, 16]. Social networking outlets as given as Facebook, Twitter, and YouTube are also the most viable source of information known as social data, social networking discusses events that occur in regular everyday life, and all individuals are able to discuss and share their opinions about these events. The coronavirus known as COVID-19, which started in Wuhan, China, at the end of last year, has become one of the most debated and one of the world’s most common pathogens. More than 20,000,000 people have contracted the disease, and more than 157,000 people have died worldwide, according to the World Health Organization (WHO) [17–19]. More than 121,407,000 tweets were recently examined by De Domenico and his colleagues, most of them of which were tagged to a location, and more than 22 million web pages to identify the messages sent during the outbreak by individuals and robots. The tasks involved were finding emotional content, filtering posts, analyzing tweets related for neuroticism and other psychological attributes. They were also validated for misinformation. What happened during the global outbreak of the COVID-19 disease is distinct from the “normal dissemination of false news [20–22]”.

570

M. Uvaneshwari et al.

Media messaging, sadly, does not always agree with science and disinformation, unfounded statements, and rumors that may spread rapidly. Twitter is the most common type of social media used for medical care information by social media sites. Previous studies show that Twitter can provide valuable public health information, including monitoring outbreaks of infectious diseases, natural disasters, sensitive use of drug, and more [4]. A Twitter social network research study was conducted to analyze COVID-19-related social media conversations and to investigate social sentiments toward COVID-19-related themes. There were two study objectives: to provide clarity about online COVID-19-related discussion topics and to analyze COVID-19related sentiments. Findings from this research will shed light on overlooked feelings and patterns associated with the COVID-19 pandemic [23, 24, 25]. Tweets from the USA were evaluated at the state and county levels. Differences in temporal tweeting habits have been established, and as the pandemic progressed, individuals were found to tweet more about COVID-19 throughout working hours. A sentiment analysis was performed over time [26]. “Social media offers a large number of user-created messages that represent the moods and viewpoints of those users. The complexity of analysis is the problem with this data source-social media posts are incredibly noisy and idiosyncratic, and the volume of incoming data is much too high to be analyzed manually. Automated methods are required to extract meaningful insights and measure the development of these feelings over time, segment the results by region, and compare them with events that took place in each region [3].”

4 Dataset 4.1 Data Gathering Procedure Tweepy is a well-known standard Twitter API, used to obtain users’ tweets using multiple queries for the said period. We have taken tweets from first quarter of the month of March to beginning of second quarter April 2020 from the publically available labeled dataset. The dataset comprises of labeled class, positive, neutral, extremely positive, extremely negative, or negative. The dataset has 40 k tweets (Fig. 3).

4.2 Data Preparation The final dataset consisted of raw tweets, which needed preprocessing to get good results. Tweets were processed to clean the dataset to eliminate stop words, frequent usage words such as prepositions, conjunctions, and articles. These types of words have no importance in sentiment analysis. Following are preprocessing steps.

Polarity Detection Across the Globe Using Sentiment Analysis …

571

Fig. 3 Analyzing a comment

• Filtering/noise removal—In this step, tweets are processed to clean them by removing special symbols, emotions, etc. • Tokenization—The main of this step is to identify various tokens from the text collected. • Stop words removal—Stop words and also specific common words which have no analytic value are removed from the tweets. • Text /word normalization—This step stemming or lemmatization is used to remove common word endings such as “ing”, “es”, “s”, etc. • Remove white spaces—Each text contains lots of white spaces. In this step, white spaces are removed. • Convert to lower—After removing all unnecessary terms from text, it is converted to lower case.

5 Model for Sentiment Analysis Notion examination on tweets alludes to the arrangement of an information tweet text into supposition polarities, including very sure, positive, negative, amazingly negative, and nonpartisan. Sentiment extremity absolutely passes on significant data about the subject of the content. There are various strategies and complex calculations used to order and prepare machines to perform supposition examination. There are upsides and downsides to each. However, utilized together, they can give excellent outcomes. The following are the absolutely most utilized calculations.

5.1 Naive Bayes Guileless Bayes is a genuinely straightforward gathering of probabilistic calculations that, for estimation investigation grouping, appoints a likelihood that a given word or

572

M. Uvaneshwari et al.

expression ought to be considered as amazingly certain, positive, negative, incredibly negative, and impartial. Basically, this is the way Bayes’ hypothesis works. The likelihood of A, if B is valid, is equivalent to the likelihood of B, if A is valid, times the likelihood of A being valid, separated by the likelihood of B being valid: P(A/B) =

P(B/A) ∗ P(A) P(B)

Naive Bayes calculates words against each other. So, machine learning models are trained for word polarity.

5.2 Linear Regression Straight relapse is a factual calculation used to foresee a Y esteem, given X highlights. Utilizing machine learning, the informational collections are inspected to show a relationship. The connections are then positioned along the X/Y pivot, with a straight line going through them to anticipate further connections. Direct relapse ascertains how the X input (words and expressions) identifies with the Y yield (extremity).

5.3 Support Vector Machines (SVM) A help vector machine is another administered AI model, like straight relapse yet further developed. SVM utilizes calculations to prepare and group text inside our estimation extremity model, making it a stride past X/Y forecast.

6 Results on Trending Hashtag #Twitter Data 40 K tweets were collected from a public domain. Initially, a word cloud was generated from it (Fig. 4). The word cloud clearly shows that all the tweets are related to COVID-19. The next step was to clean the data to get clean tweets. After this the machine learning model was used to train the dataset. The dataset was split in 70 and 30 ratios for training and testing purposes. The first algorithm used was Naïve Bayes. Confusion matrix and accuracy score of the same are shown in Figs. 5 and 6. It was found that the accuracy score was only 43% (Figs. 7 and 8). The next algorithm used was support vector machine. The accuracy went up to 58% using SVM.

Polarity Detection Across the Globe Using Sentiment Analysis …

573

Fig. 4 Word cloud from the tweets

Fig. 5 Confusion matrix of NB

7 Conclusion The polarity or sentiments were analyzed using machine learning algorithms: Naïve Bayes and support vector machine. It is found that support vector machine predicted a

574

M. Uvaneshwari et al.

Fig. 6 F1 score and accuracy

Fig. 7 Confusion matrix of SVM

much better accuracy. Then, the sentiments of people within India and other regions of the world were found using this machine learning tool by splitting the data in training and test data in 70–30 ratio. The dataset consists of a labeled class varying from extremely negative to extremely positive. Based on our objective, it was determined that scalar vector machine has the best accuracy (Fig. 9).

Polarity Detection Across the Globe Using Sentiment Analysis …

575

Fig. 8 F1 score and accuracy

Fig. 9 Comparison of sentiments in India and other region

Secondly, the polarity was compared within India and other regions of the world, and it is found that all people across the globe have almost the same sentiments toward COVID-19.

576

M. Uvaneshwari et al.

References 1. P. Johnson-Laird, N. Lee, Are there cross-cultural differences in reasoning? in Proceedings of the Annual Meeting of the Cognitive Science Society (2006), pp. 459–464 2. M. Ravichandran, G. Kulanthaivel, T. Chellatamilan, Intelligent Topical Sentiment Analysis for the Classification of E-Learners and Their Topics of Interest, https://www.hindawi.com/jou rnals/tswj/2015/617358/ (2020) 3. A. Kruspe, M. Häberle, I. Kuhn, X.X. Zhu, Cross-language sentiment analysis of European Twitter messages during the COVID-19 pandemic (2020) 4. D. Ramesh, S. Pasha, S.N. Harshavardhan, A. Shabana, Enhancements of artificial intelligence and machine learning. Int. J. Adv. Sci. Technol. 28(17), 16–23 (2019) 5. M.A. Shaik, A survey on text classification methods through machine learning methods. Int. J. Control Autom. 12(6), 390–396 (2019) 6. D. Jain, S. Makkar, L. Jindal, M. Gupta, Uncovering Employee Job Satisfaction using Machine Learning: A Case Study of Om Logistics Ltd. International Conference on Innovative Computing and Communication, Advances in Intelligent Systems and Computing, vol. 1166 (2020), https://doi.org/10.1007/978-981-15-5148-2_33 7. A.S. Imran, S.M. Daudpota, Z. Kastrati, R. Batra, Cross-Cultural Polarity and Emotion Detection Using Sentiment Analysis and Deep Learning on COVID-19 Related Tweets, https://iee explore.ieee.org/document/9207881 (2020) 8. R.R. Kumar, M.B. Reddy, P. Praveen, Text classification performance analysis on machine learning. Int. J. Adv. Sci. Technol. 28(20), 691–697 (2019) 9. G. Jim Samuel, G.M. Nawaz Ali, M. Rahman, M. Ek Esawi, Y. Samuel, COVID-19 Public Sentiment Insights and Machine Learning for Tweets Classification https://www.mdpi.com/ 2078-2489/11/6/314/htm (2020) 10. M. Sheshikala, D. Kothandaraman, R. Vijaya Prakash, G. Roopa, Natural language processing and machine learning classifier used for detecting the author of the sentence. Int. J. Recent Technol. Eng. 8(3), 936–939 (2019) 11. K. Chakraborty, S. Bhatia, S. Bhattacharyya, J. Platos, R. Bag, A.E. Hassanien, Elsevier, Sentiment Analysis of COVID-19 tweets by Deep Learning Classifiers—A study to show how popularity is affecting accuracy in social media, https://www.sciencedirect.com/science/article/ pii/S156849462030692X 12. S.N. Pasha, A. Harshavardhan, D. Ramesh, S. SHABANA, Variation analysis of artificial intelligence, machine learning and advantages of deep architectures. Int. J. Adv. Sci. Technol. 28(17), 488–495 (2019) 13. L. Nemes, A. Kiss, Social media, sentiment analysis based on COVID-19, https://www.tandfo nline.com/doi/full/10.1080/24751839.2020.1790793 (2020) 14. R. Nisbett, The Geography of Thought: How Asians and Westerners Think Differently... and Why (Simon and Schuster, New York, 2004) 15. M. Yasin Kabir, S. Madria, Corona Vis: A Real-time COVID-19 Tweets Data Analyzer and Data Repository, https://arxiv.org/pdf/2004.13932.pdf,july2020.By-Step Twitter Sentiment Analysis: Visualizing Multiple Airlines’ PR Crises (2020) 16. https://www.aclweb.org/anthology/2020.nlpcovid19-acl.14/, July 2020 17. K.H. Manguri, N. Rebaz, Ramadha, R. Pshko, M. Amin, Twitter Sentiment Analysis on Worldwide COVID-19 Outbreaks, May 2020, https://www.researchgate.net/publication/341500307_ Twitter_Sentiment_Analysis_on_Worldwide_COVID-19_Outbreaks 18. T.L. Dk, Why is Denmark’s Coronavirus Lockdown so Much Tougher Than Sweden’S? Local. Accessed 2 May 2020. Available https://www.thelocal.dk/20200320/why-is-denmarkslockdown-so-much-more-severe-than-swedens 19. S. Nangia, S. Makkar, R. Hassan, IoT based predictive maintenance in manufacturing sector, in International Conference on Innovative Computing and Communication (ICICC-2020) 21–23 Feb 2020, Delhi, India. Electronic Copy is Available at: https://ssrn.com/abstract=3563559 (2019)

Polarity Detection Across the Globe Using Sentiment Analysis …

577

20. B. Guarino, Analysis of millions of coronavirus tweets shows ‘the whole world is sad, https://www.washingtonpost.com/science/2020/03/17/analysis-millions-coronavirus-twe ets-shows-whole-world-is-sad/, 17 Mar 2020 21. B. Jaidev, S. Garg, S. Makkar, Artificial intelligence to prevent road accidents. Int. J. Mach. Learn. Netw. Collab. Eng. 3(1), 35–45 (2019) 22. COVID-19, https://www.mckinsey.com/~/media/mckinsey/business%20functions/risk/our% 20insights/covid%2019%20implications%20for%20business/covid%2019%20may%2013/ covid-19-facts-and-insights-may-6.ashx. Accessed on 11 June 2020 23. M. Hung, E. Lauren, E.S. Hon, W.C. Birmingham, J. Xu, S. Su, S.D. Hon, J. Park, P. Dang, M.S. Lipsky, Social Network Analysis of COVID-19 Sentiments: Application of Artificial Intelligence, https://www.jmir.org/2020/8/e22590/, Aug 2020 24. S. Makkar, M. Singhal, N. Gulati, S. Agarwal, Detecting Medical Reviews Using Sentiment Analysis, Privacy Vulnerabilities and Data Security Challenges in the IoT (CRC Press, Taylor & Francis Group, 2020), pp. 199–216 25. S. Makkar, A. Sethi, S. Jain, Predictive analytics for retail store chain, in International Conference on Innovative Computing and Communication, Advances in Intelligent Systems and Computing (2020). https://doi.org/10.1007/978-981-15-5148-2_56 26. https://www.jmir.org/2020/8/e22590/https://www.jmir.org/2020/8/e22590/

FOG-EE Computing: Fog, Edge and Elastic Computing, New Age Cloud Computing Paradigms Shristi Achari

and Rahul Johari

Abstract Cloud computing is a challenging domain where new technological changes are happening at rapid rate. The steady emergence of fog, mist and edge computing (FOME computing), a paradigm of mist, IoT, cloud , edge and fog (MiCEF) computing in the last five years is epitome of the same. In the current research paper, effort has been made to extensively explore the fog and edge computing, and the same has been simulated using the iFogSim simulator by taking a case study of Indian cities using couple of sensors and actuators. To the best of our knowledge, it is first kind of work in this direction. Keywords Fog computing · Edge computing · Elastic computing · Ifogsim simulator

1 Introduction 1.1 Fog Computing It is a decentralized computing infrastructure where data, storage, compute and applications are allocated somewhere between the data source and the cloud. As stated by the National Institute of Standards and Technology (NIST), “Fog computing is a layered model for enabling ubiquitous access to a shared continuum of scalable computing resources. The model facilitates the deployment of distributed, latency-aware applications and services and consists of fog nodes (physical or virtual), residing between smart end devices and centralized (cloud) services” [1, 2]. In fog computing architecture, an essential component is used which is called fog node. Fog nodes are physical component such as routers, gateways, servers and S. Achari (B) · R. Johari SWINGER: Security, Wireless, IoT Network Group of Engineering and Research Lab, University School of Information, Communication and Technology (USICT), Guru Gobind Singh Indraprastha University, Sector-16C, Dwarka, Delhi, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_47

579

580

S. Achari and R. Johari

switches as well as virtual component such as cloudlets and virtual switches. There are three types of service model in fog computing: Software as a service (SaaS) in which fog applications are provided directly to the consumer by the fog service provider also it manages these applications which are run on the fog nodes clusters. With the uses of end devices, consumer can access these applications, but consumer cannot change the settings such as network, storage, server, operating system and general applications [3]. In platform as a service (PaaS), a platform has been given by fog service provider to the consumer where they can run and deploy their applications. For customization and implementation of applications, this model gives more flexibility [3]. Infrastructure as a service (IaaS) gives high degree of flexibility to the fog service consumer. The operating system, network and storage setting can be managed by the consumer, and also, some infrastructure elements like firewall might be managed by the consumer [3]. Fog computing has four deployment models: Private fog is more general model for fog computing deployment in which a fog infrastructure configured by fog service provider within the device for a specific organization. In community fog model, the infrastructure which is client capable shared through consumers of organizations with shared concerns, interests or requirements. In public fog, the fog service provider gives services to the consumer who payed for available services through generic platform. Hybrid fog is the combination of two or more different fog models (public,private or communitys).

1.2 Edge Computing “Edge computing refers to the enabling technologies allowing computation to be performed at the edge of the network, on downstream data on behalf of cloud services and upstream data on behalf of IoT services.” To provide a direct transmission service, it runs specific applications in a fixed location. It includes various technologies such as mobile data capture, mobile signature analysis, sensor networks, peer-to-peer and ad hoc networking, and also with the help of information and communication technology in 4.0 industry, it makes a major contribution for production of networking equipment [2, 4]. Due to the exponential growth of IoT devices, edge computing was developed. It is a more efficient form of cloud computing which is used for real-time monitoring analysis. The primary advantage of edge computing is that it reduces latency, so that it provides more responsive processing of real-time data.

1.3 Elastic Computing It is the concept of the cloud computing where computing resources can be easily scaled up and down through cloud service provider [5].

FOG-EE Computing: Fog, Edge and Elastic …

581

Fig. 1 Relation between cloud, fog and edge computing

In elastic computing when and whenever computing power is needed, the cloud service provider supplies the power. These resources elasticity could be in respect of bandwidth, power, storage, etc. In the below diagram, relation between the cloud, fog and edge computing has been given in which thousand of cloud data centers are available which are extended to the edge of the network by fog computing in which thousand of the edges or fog nodes like router, hub and gateways are present which are connected to the millions of smart IoT devices such as smart watches, wireless network, commercial security system and smart city. The relation between three computing paradigms: cloud, fog and edge computing has been presented in Fig. 1.

582

S. Achari and R. Johari

2 Literature Survey Tange et al. [6] focus on the Industrial Internet of Things (IIoT) security. The paper has been followed two contributions, in first one, a systematic review has been discussed on IIoT security for years of 2011–2019 in which work has been elaborated for four major research questions, (RQ1) what has been existed for IIoT by security requirements, (RQ2) how have been scientific publications about IIoT security spread throughout the year, (RQ3) how has been research activity of IIoT security geographically distributed and (RQ4) what have been publication venues for IIoT security most popular. RQ1 has been explained with the help of investigated works and exploration through which the security requirements of IIoT extracted, and by the help of investigated security research of quantitative analysis, RQ2–RQ4 have been addressed. In the second contribution; we need to discuss the fog computing paradigms and how it can meet the requirements which posed by industrial environments. In the investigation, the requirements were extracted by taking the perspective and revisiting of fog computing. To achieve the massive fog computing deployment, the limitation and challenges still needed have been pointed. When limitations and challenges of fog computing overcome, it could be turned into a powerful appliance in securing the different type of industrial environments which are associated with each other [6]. Rathee et al. [7] concerned about IoT device layer attack in the fog computing environment and fog nodes throughout the handoff (mobility) of IoT devices in which the transmission processing has been demoralized. A framework has been proposed in which tidal trust has been calculated for each node and malicious nodes have been detected with the help of pre-defined values. A framework has been proposed in which tidal trust has been calculated for each node and malicious nodes have been detected with the help of pre-defined values over certain networking parameters such as (a) network resources versus malicious number of fog nodes, (b) trusted nodes versus percentage of fog node number which are malicious, (c) HEU (handoff IoT device) versus malicious fog node and (d) mobile HEU (MHEU) versus fog nodes. In contrast to these networking parameters, the proposed mechanism has been verified. Result has been improved 85% in comparison with existing technique. Trust services calculation will be included for future work which are running with the node trust value on fog and IoT nodes in opposition to different IoT environment which are dynamic and random in nature [7]. Jazayeri et al. [8] investigated the offloading problems in different architecture of cloud, edge and fog. A case study has been discussed in which an application have been executing the modules in mobile devices or transferring them on the Cloud. To solve the problem, monitor–analysis–plan–execution (MAPE ) model is presented. In the proposed method greedy auto-scaling deep reinforcement learningbased offloading (GASDEO), firstly, a greedy technique has been used to prioritize the local fog devices to meet the auto-scaling. Then, in second step, best destination has been selected for the modules by the GASDEO. Based on the convergence

FOG-EE Computing: Fog, Edge and Elastic …

583

rate, the proposed algorithm has been analyzed, which performed better than the reinforcement learning. On average, the results obtained using the proposed method have been better than the local execution method by 18%, the first fit method by 6%, and the standard Deep reinforcement learning method (DRL) by 2% by considering the parameters i.e. delay, execution cost, network resources, and power consumption. For future work, renewable energy will be model for offloading problem, machine learning (ML) algorithm for load predication in the IoT application can also be useful, and to improve the effectiveness and training speed of DRL, prioritized experience replay method will be used [8]. Karagiannis et al. [9] presented and compared the flat and hierarchical architecture for the fog computing in which an unified system model has been created for common representation of both architectures, and to create fog computing systems, an algorithm has been designed which follows these architectures. To evaluate the bandwidth utilization and communication latency of these architectures, various experiments have been performed. Result shown that those applications which have no dependency on cloud such as any of the resource demanding tasks are not involved, 13% communication latency is reduced by the hierarchical architecture as compared to flat architecture, whereas 16% communication latency is reduced by flat architecture as compared to hierarchical that involves resource demanding tasks. Based on the result, differences of these architecture are quantified for fog computing, and discussion of these results has been provided which could be used as guideline for the selection of appropriate architecture [9]. Yousefpour et al. [10] presented a fog computing tutorial and discussed similarities and differences with other models such as MEC, cloudlt and edge computing. In today’s world, in the area of information technology, data is most important thing, as volume and velocity of data increase, big data transmission from IoT devices to the cloud might not be efficient. Fog computing has been proposed to resolve these issues, it tides over between cloud and IoT devices by allowing networking, computing, storage and management of data within the close proximity of IoT devices on the network nodes. To address the above issues, research community has been proposed other computing paradigms such as cloudlets, cloud of things, edge computing and mist computing. Comparison of fog computing with other paradigms has been presented in a survey and argued that it is more general computing because of its thorough definition scope and flexibility. A detailed survey has been presented on fog computing and discussed how it can fulfill the increasing requirement of privacy, latency strictness and bandwidth requirements. In the area of fog computing, a comparison of related survey paper has been included, for the better understanding of fog computing, firstly, look the cloud computing, then discuss the solution for issues of cloud, how fog computing reach out cloud computing. Classification of research topics for the fog computing has been provided, and based on the survey, there are some challenges, and research work in fog computing has been provided [10]. Mahmud et al. [11] presented a recent developments survey in fog computing. It is a distributed paradigm which works between cloud datacenters and IoT devices/sensors as an intermediate layer. Fog computing basic difference with other

584

S. Achari and R. Johari

paradigms has been discussed in detail, and also, different characteristics of fog computing are analyzed. Fog computing dissimilarity with other related computing approaches has been discussed to resolve the cloud computing-related issues such as network congestion, huge delay in round-trip and debasement of service quality. Edge computing concept has been proposed where data processing enabled at edge network and provides quick reply to the service requests which are computational and usually resists bulk rawdata to be forward toward core network. Mobile edge computing (MEC) and mobile cloud computing (MCC) both are considered as probable extension of edge and cloud computing, MEC enhances the network efficiency and targets compatible and faster cellular services for the consumers, whereas MCC combines mobile computing, cloud computing and wireless communication to increase the quality of experience (QoE). Like MEC and MCC, edge computation can also enabled by fog computing. In comparison with other related computing paradigms, fog computing is considered as more capable as well as organized for IoT because of this, reduction in service demand of huge number of IoT devices/sensors and multi-tire application deployment can be easily detected, and in contrast of edge computing, it can also extend the cloud services like IaaS, PaaS, SaaS, etc. Taxonomy of fog computing has been presented according to its key features and identifies challenges, and through the analysis, there are some optimistic research work which will be used in future [11].

3 Methodology A virtual scenario has been designed involving four indian cities, namely Delhi, Mumbai, Jaipur and Agra (Fig. 5). The sensors and actuators were used in the design scenario. The topology designed and deployed in iFogSim simulator has been presented in Fig. 2, and information about the configuration parameters has been showcased in Figs. 3 and 4, respectively. The flowchart on how the communication will occur between cities is schematically presented as LEGEND CSP Fn En An Sn St DB

Cloud Service Provider Fog Nodes Edge Nodes Actuator Sensor Sensor Type Database

FOG-EE Computing: Fog, Edge and Elastic …

3.1 Flowchart

585

586

4 Results

Fig. 2 FOG-EE Topology (addition of four fog nodes)

S. Achari and R. Johari

FOG-EE Computing: Fog, Edge and Elastic …

Fig. 3 Edge connection b/w multiple Fn b/w Delhi and Mumbai

Fig. 4 Configuration of An and Sn connected with Fn installed at Mumbai

587

588

S. Achari and R. Johari

Fig. 5 Deployment of An and Sn at Fn installed at Mumbai

5 Conclusion In the current research paper, extensive literature study related to fog, edge and elastic computing has been done. A simulation involving case study connecting Indian cities using sensors and actuators has been performed using the iFogSim simulator.

References 1. M. Iorga, L. Feldman, R. Barton, M.J. Martin, N.S. Goren, C. Mahmoudi, Fog computing conceptual model (No. Special Publication (NIST SP)-500-325) (2018) 2. A. Sunyaev, A. Sunyaev, Internet Computing (Springer International Publishing, New York, NY, USA, 2020), pp. 237–264 3. Z. Mahmood, Fog Computing: Concepts, Frameworks and Technologies, 1st edn. (Springer, Cham, 2018) 4. W. Shi, J. Cao, Q. Zhang, Y. Li, L. Xu, Edge computing: vision and challenges. IEEE Internet Things J. 3(5), 637–646 (2016) 5. https://www.techopedia.com/definition/26598/elastic-computing-ec 6. K. Tange, M. De Donno, X. Fafoutis, N. Dragoni, A systematic survey of industrial Internet of Things Security: requirements and fog computing opportunities. IEEE Commun. Surv. Tutorials 7. G. Rathee, R. Sandhu, H. Saini, M. Sivaram, V. Dhasarathan, A trust computed framework for IoT devices and fog computing environment. Wirel. Netw. 26(4), 2339–2351 (2020) 8. F. Jazayeri, A. Shahidinejad, M. Ghobaei-Arani, Autonomous computation offloading and auto-scaling the in the mobile fog computing: a deep reinforcement learning-based approach. J. Ambient Intell. Human. Comput. 20 (2020)

FOG-EE Computing: Fog, Edge and Elastic …

589

9. V. Karagiannis, S. Schulte, Comparison of alternative architectures in fog computing, in 2020 IEEE 4th International Conference on Fog and Edge Computing (ICFEC) (IEEE, 2020 May), pp. 19–28 10. A. Yousefpour, C. Fung, T. Nguyen, K. Kadiyala, F. Jalali, A. Niakanlahiji, J. Kong, J.P. Jue, All one needs to know about fog computing and related edge computing paradigms: a complete survey. J. Syst. Arch. 98, 289–330 (2019) 11. R. Mahmud, R. Kotagiri, R. Buyya, Fog computing: a taxonomy, survey and future directions, in Internet of everything (Springer, Singapore, 2018), pp. 103–130

Hybrid Filter for Dorsal Hand Vein Images Nisha Charaya, Anil Kumar, and Priti Singh

Abstract With the recent trends in technology, biometrics have been an incredible part of our routine life. Over the recent years, dorsal hand veins have emerged as a promising biometric attribute. High-quality hand vein images are required to achieve a reliable biometric system with good performance. However, capturing good quality hand vein images are a challenging task due to presence of lots of noise. The quality of images is highly affected by illumination, the presence of hair on skin shadow bones and many other factors. This paper presents a layered-filter approach for noisy hand vein images. The technique greatly enhances the image quality which yields accurate results when tested as a biometric. Keywords Biometrics · Security · Forgery · Hand vein · Noise removal · Pre-processing · Enhancement

1 Introduction With the advances in technology and implementation of automation everywhere, security has been put at stake and needs to be ensured before bringing a new technology into practice. Starting from minor to major, all the activities in our lives have become automated with the help of biometrics. Attendance marking systems in offices, online transactions in banks and door opening at home; everything has become fast and automatic. But this automation requires a perfect, accurate and precise method to identify a person. The technique to identify or verify a person with the help of physiological or behavioral attributes is known as biometrics. Iris, veins, fingerprints and face are examples of physiological biometrics, whereas voice,

N. Charaya (B) · A. Kumar · P. Singh Amity University, Gurgaon, Haryana, India e-mail: [email protected] A. Kumar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_48

591

592

N. Charaya et al.

gait and DNA are examples of behavioral biometrics. A biometric system is userconvenient as compared to the traditional methods as it does not require be kept with safety or crammed all the times [1, 2]. The need of biometrics has been forced by the access management system which requires that a person should be identified accurately before providing access to any information, device or other application. The most crucial step in this is to verify the identity claimed by the user. This is done so that the important information and resources can be protected from impostors. Traditionally, a password or identity card was matched for this, but due to its limitations of being misplaced, forgotten or copied; its reliability was under doubt. Simply not remembering the pass code or not possessing the identity card may be a trouble to the genuine user as the system may refuse the access. As a solution to these problems, biometric has come for the rescue which provides an accurate, automatic and reliable authentication system that cannot be stolen, shared, forgotten or forged [3]. On the contrary to traditional systems, the biometric-based authentication system is user-convenient, secure and universally accepted. The traditional and biometrics-based authentication systems are shown in Fig. 1. There are numerous physiological and behavioral attributes that are used for biometric system with fingerprints and face being the most popular and widely used among them. But hand-veins have become popular among researchers due to its characteristics of user convenience and security against forgery [2]. Hand-veins are the vast network of blood vessels present under the skin. The vein is analyzed with features of muscle tissues, bones and hand veins [4]. Under a human skin, there is vein pattern which consists of various networks of blood vessels. Apart from surgical intervention the structure of vascular pattern in human body parts are diverse and the stability is for more time. Further bloods are present beneath skin

Fig. 1 Human identification systems a traditional system using identity card and pass code, b biometric system [3]

Hybrid Filter for Dorsal Hand Vein Images

593

which cannot be visible directly with bare eyes of humans. Related to other biometric features, vein pattern is very difficult to recognize. Its characteristics like uniqueness, contactless, universality and anti-forgery are strong enough to hide the challenges in its use. One of the main challenges in vein pattern biometric model is to acquire images of different pattern of veins quickly without involving harmful surgical instruments. However, only good quality images can be used as a biometric and yield accurate results. If the images are of poor quality it will not be detected at input of the system which leads to failure to enroll rate and failure to capture rate. The consequences of which are poor accuracy as well as user-inconvenience. For these reasons, noise removal has become an important steps in hand vein-based biometric system. The quality of images is greatly affected by illumination, optical blurring presence of hair on skin, shadow of bones, uneven surfaces and many other factors [3]. Most of these factors are unavoidable and can be nullified by applying suitable filtering techniques which makes these images usable as an input to biometric system. All these noises are of different nature and require different filters to get removed. However, application of multiple filters makes the system complex and time consuming. The objective of this research is to design a layered filter which combats the major noises present in the images and enhances the image quality significantly. Our system improves the overall performance by minimizing the failure to enroll and failure to capture rate which mainly occurs due to low-quality images. The remaining paper is organized as follows. Section 2 discussed related work on this topic. We proposed a hybrid filter consisting of median filter and 2-D wiener filter in Sect. 3. Section 4 discusses experimental results. Finally, Sect. 5 presents conclusion of the paper.

2 Related Work As noise removal is an essential step in pre-processing, lots of techniques have been applied by researchers. Different filters and their extensions have been implemented in the existing work. Zhong applied mean and median filters to reduce the noise in hand vein images which was, followed by histogram equalization for contrast enhancement [4]. Trabelsi filtered the impurity and noise present in hand vein image by using anisotropic diffusion filter which can distinguish between edges present in texture and real borders of the image [3]. Chuang used Gaussian low pass filter to enhance the image quality [5], whereas a Gaussian matched filter was adopted by Huang for noise removal and contrast enhancement of NIR dorsal hand images [6]. The images acquired by Zhao contained big noise. So, they developed a three step algorithm to remove the noise. The three steps consisted of match filter; wiener filter and average filter, respectively [7]. On the similar pattern, Ramalho applied a three layer filter comprising of different filters for different type of noise. The filter consisted of an average filter to remove

594

N. Charaya et al.

the impulsive noise, morphological opening and closing filters to remove the linear noise and finally a 2-D wiener filter to remove small artifacts [8]. A much complex filtering system was proposed by Li. Here, a 2-D Gaussian low pass filter was used to eliminate the speckle noise. The images were passed through this filter 6 times. Thereafter, it was passed through a single-column and a singlerow median filter to eliminate the horizontal strip scanning noise. Finally, it was smoothened by morphological opening and closing operations and passed through a median filter which smoothed out the boundary [9]. A simple morphological opening operation, followed by normalization was employed by Wang and Zhang which removed the hand shank area and enhanced the image, but no filter was applied for noise removal [10]. The work existing in literature motivated to design a filter which is simple yet effective for noise removal in DHV images.

3 Proposed Filter A general biometric system is comprised of several components as shown in Fig. 2. The sensor element collects the unprocessed biometric data from an individual that may be in the form of an image, video, audio or some other signal. This unprocessed data is passed through the pre-processing block where it is refined, improved and made ready to be processed and utilized. The feature extraction block gathers a unique set of features to represent the signal. The extracted feature set is labeled with user identity and stored in the database during user enrollment phase. The matching block is responsible for performing comparison between the data presented for identification and the stored data. As an outcome of comparison, a matching score is generated. The decision block categorizes the presented data as identified or unidentified based on its matching score value [3]. The much ignored pre-processing module is highly important as all the acquired images of different quality need to be converted to system-acceptable quality image first. It consists of numerous other steps including noise removal. In case of dorsal hand vein images, the noise is unavoidable as it arises due to presence of hair on skin, shadow of bones, uneven surfaces and other natural factors. As concluded from the literature study, near infra-red imaging technique is preferred for all veins imaging in hand as it is temperature independent [11]. In this work, the images are taken from Bosphorus hand vein database. The images are pre-processed before extracting the features. In pre-processing, ROI is taken as a square region around centroid, noise removal is done by proposed filter, histogram equalization is employed for image-enhancement and Otsu’s thresholding is used for image-segmentation. However in this paper, noise removal results are shown over whole image for better clarity. The proposed filter is a two layer filter with the first layer consisting of median filter and second layer as 2-D wiener filter. This filter is simple yet effective for noise removal in DHV images.

Hybrid Filter for Dorsal Hand Vein Images

595

Fig. 2 Flow chart of hand vein recognition system

Median filter is a statistical filter which performs well for removal of Gaussian noise, random noise and salt and pepper noise. This filter is used so that it can target the majority of noise arising from different sources and reasons in a DHV image. It works by moving through the image pixel by pixel and deciding whether a particular pixel corresponds to image or noise based on its neighboring pixel values. It replaces the pixel value with the median of neighborhood pixel values. The median is obtained by arranging all the neighboring pixel values in ascending or descending order and then replacing the considered pixel with the value of middle pixel. An example of median filtering in a 3 × 3 window is shown in Fig. 3 [12]. In an image, median filtering is performed by moving a window over the image. The filtered output image is obtained by replacing the center value of the window by the median of the values in the input window. A non-linear median filter is employed instead of most commonly used linear filters as it preserves the sharp edges which gets blurred after passing through a linear filter [13]. The median filtered image is passed through wiener filter for an inverse filtering operation so that any image blurring which might happen during filtering can be reverted and image can be restored back. It also smoothens out any leftover noise still present in the image by averaging it out. It acts as a tradeoff between noise smoothing and inverse filtering. It performs removal of the additive noise and inversion of

596

N. Charaya et al.

Fig. 3 Median filtering example [14]

blurring simultaneously, while optimizing the mean square error involved in both of them [15]. For Wiener filtering, firstly the power spectra of the original image and the additive noise are estimated. The power spectrum of white additive noise is given by the variance of noise and that of original image is estimated as: per = S yy

 1  Y (k, l)Y (k, l)∗ 2 N

(1)

where Y (k, l) is the discrete Fourier transform of the observed signal. Another estimate which is utilized here and leads to a cascade implementation of the inverse filtering and the noise smoothing is: Sx x =

S yy − Sηη |H |2

(2)

The power spectrum S yy can be estimated using Eq. (1). This estimate results in a cascade implementation of inverse filtering and noise smoothing: per

W =

1 S yy − Sηη per H S yy

(3)

The restored image obtained as an output of wiener filter is a linear estimation of the original image and improved in terms of the visual performance [16].

Hybrid Filter for Dorsal Hand Vein Images

597

4 Simulation Results The proposed filter has been designed and simulated in MATLAB and tested over Bosphorus Hand database. The database comprises 642 subjects having 6 images for a person, which contains three right-hand images and three left-hand images. The images are of different illumination condition and different postures. Moreover, 276 subjects are with three left-hand images only. Among total 918 subjects, 160 poses hand images with time lapse of several months [17]. The simulation results obtained for a sample image are shown in Fig. 4. From the results, it is evident that the median filtered image is visually improved as compared to the original image. Though the median filtered image seems to be clearer and brighter, but is still contains some noise which needs to be averaged out. The same has been accomplished in layered-filtered image. This image can now be further passed through next steps of the authentication system. After noise removal, the image is enhanced by the process of histogram equalization. Simulation result of the same is shown in Fig. 5a.

Fig. 4 Simulation results a original image, b median filtered image, c layered filtered image

Fig. 5 a Enhanced image, b binarized image

598

N. Charaya et al.

The final binarized image obtained as a result of Otsu’s thresholding applied over region of interest is shown in Fig. 5b. This image can now be processing ready for feature extraction and further matching process.

5 Conclusion The proposed-layered filter is simple, effective and time efficient for noise removal in dorsal hand vein images. The images obtained by this technique are used and tested for authentication after following further steps of vein recognition. The system developed in this way is accurate and does not suffer with failure to capture error which may degrade the overall performance of the system. An overall accuracy of 99.18% has been obtained for these filtered images by using suitable feature extraction technique and matching algorithm.

References 1. C.H. Kalyani, Various biometric authentication techniques: a review. J. Biom. Biostat. 08(05) (2017) 2. K. Dharavath, F.A. Talukdar, R.H. Laskar, Study on biometric authentication systems, challenges and future trends: a review, in 2013 IEEE International Conference on Computational Intelligence and Computing Research (2013) 3. A.K. Jain, P. Flynn, A.A. Ross, Handbook of Biometrics (2007) 4. D. Zhong, H. Shao, S. Liu, Towards application of dorsal hand vein recognition under uncontrolled environment based on biometric graph matching. IET Biom. 8(2), 159–167 (2019) 5. S.J. Chuang, Vein recognition based on minutiae features in the dorsal venous network of the hand. Signal Image Video Process. 12(3), 573–581 (2018) 6. D. Huang, X. Zhu, Y. Wang, D. Zhang, Dorsal hand vein recognition via hierarchical combination of texture and shape clues. Neurocomputing 214, 815–828 (2016) 7. S. Zhao, Y. Wang, Y. Wang, Extracting hand vein patterns from low-quality images: a new biometric technique using low-cost devices, in Proceedings 4th International Conference on Image and Graphics. ICIG 2007, pp. 667–671 (2007) 8. Sanchit, M. Ramalho, P.L. Correia, L.D. Soares, Biometric identification through palm and dorsal hand vein patterns, in EUROCON 2011—International Conference on Computer as a Tool—Joint with Conftele 2011 (2011) 9. X. Li, X. Liu, Z. Liu, A dorsal hand vein pattern recognition algorithm, in Proceedings of 2010 3rd International Congress on Image and Signal Processing. CISP 2010, vol. 4, no. 2, pp. 1723–1726 (2010) 10. Y. Wang, X. Zheng, Cross-device hand vein recognition based on improved SIFT. Int. J. Wavelets Multiresolut. Inf. Process. 16(2), 1–17 (2018) 11. M.M.S. Ibrahim, F.S.M. Al Naimy, L. Rajaji, S.S. Amma, Biometric recognition for safe transaction using vein authentication system, in IET Conference Publications, vol. 2012, no. 624 CP, pp. 77–82 (2012) 12. H.M. Wechsler, Digital Image Processing, 2nd edn., in Proceedings of the IEEE, vol. 69, no. 9. pp. 1174–1175 (2008) 13. ˙I. Dinç, M.L. Pusey, DT-Binarize Multidimensional Systems: Signal Processing and Modeling Techniques (2015)

Hybrid Filter for Dorsal Hand Vein Images

599

14. Spatial Filters—Median Filter. [Online]. Available: https://homepages.inf.ed.ac.uk/rbf/HIPR2/ median.htm. Accessed 16 Dec 2020 15. X. Xαμζας, ´ Eιδικα´ Kεϕαλαια ´ Eπεξεργασ´ιας ηματων ´ Aυτoπρoσαρμoζ´oμενα ´ιλτρα Wiener filter, no. 145, pp. 1–64 (2014) 16. Wiener Filtering. [Online]. Available: https://www.owlnet.rice.edu/~elec539/Projects99/ BACH/proj2/wiener.html. Accessed 16 Dec 2020 17. Bosphorus Hand Geometry Database and Hand-Vein Database | Bifrost Data Search. [Online]. Available: https://datasets.bifrost.ai/info/1567. Accessed 16 Dec 2020

Satellite Image Enhancement and Restoration Using RLS Adaptive Filter Rekh Ram Janghel, Saroj Kumar Pandey, Aayush Jain, Aditi Gupta, and Avishi Bansal

Abstract The main objective of this paper is to find and analyze the performance of various image restoration techniques to enhance the retrieved satellite image. Satellite pictures in course of being caught and communicated are corrupted because of station influence. These impacts contain noise in the form of additive White Gaussian noise, salt and pepper noise and mixed noise. Subsequently, acquired images are exceptionally noisy tainted with the fact that the picture substance is weakened or strengthened. Here, recursive least square adaptive algorithm is used for performing image restoration from noise-corrupted satellite images. The exhibition is estimated by methods for human visual system, quantitative measures as far as MSE, RMSE, SNR and PSNR. The RLS filter gave MSE value of 385.86, RMSE value of 19.64, SNR value of 2.42 and PSNR value of 22.25. When we compare RLS filter performance with NLMS filter (MSE value of 444.29, RMSE value of 21.08 and PSNR value of 21.65) and SSLMS filter (MSE value of 5052.99, RMSE value of 71.08 and PSNR value of 11.09), we observe that the MSE value of RLS filter is the least, and PSNR value of RLS filter is the highest. Hence, the RLS adaptive algorithm is preferred. Keywords Salt and pepper noise (SPN) · Normalized least mean squares filter (NLMS) · Recursive least square (RLS) · Additive White Gaussian noise (AWGN) · Human visual system (HVS) · State space least mean squares filter (SSLMS)

1 Introduction Image restoration and filtering is a basic domain of cutting edge technology of processing digital images that is used to restore the undermined or deformed image substance [1, 2]. Image enhancement is the system of improving the quality and the information substance of interesting data before getting ready. R. R. Janghel (B) · S. K. Pandey · A. Jain · A. Gupta · A. Bansal National Institute of Technology, Raipur, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_49

601

602

R. R. Janghel et al.

Image denoising is done to dispose of noise by securing significant characteristics and features of the image and to get extraordinary quality images [3]. It is important to choose the ideal restoration filter and show a key employment for image denoising. Satellite images when taken and sent in non-wired stations are ordinarily tainted due to boisterous station impacts [4]. The images are degraded by noise in the channel and unpredictable barometrical aggravation [5, 6]. In this way, the contents of the channel are either debilitated or upgraded during transmission. In wireless stations, different noise plans, for instance additive White Gaussian noise, impulse noise and mixed noise, exist and deform the satellite images [1, 7]. Thus, the recuperated images are uncommonly boisterous considering the way that the substance of the yield image is either diminished or upgraded. Preceding the usage of these images in different applications, the noise must be removed. While restoring image, there is certainly a trade-off between noise removal and keeping up clear image features [2]. If images and noise substance both are open, it ends up being definitely not hard to design a channel to ameliorate SNR. Exactly when the substance is not completely recovered, and a short time later required sensible models for evaluating the authentic information. Generally, the models are wrong or time fluctuating [6]. Henceforth, use of flexible channels ought to fulfill the particular necessities. The adaptive filtering is one of most sensible techniques particularly for noise disavowal in image and signal managing tasks [8]. The presentation of adaptable channels appears to act ordinarily changed, in this way extensively utilized. A definitive objective is to control the adaptable channel limits for accomplishing least mean square slip-up between the channel yield and needed signal [9, 10]. In this paper, we have worked on a RLS adaptable figuring for ideal restoration of satellite images. The RLS estimation iteratively minimizes the gradient square of needed assessed screw up signal. The proposed technique is executed by arranging the RLS estimation with system identification approach for noise assessment and sometime later signal enhancement for noise clearing. The estimations for image restoration, further denoising and improvement are re-enacted in the python stage. At last, the presentation is examined by strategies for faithfulness standard, for example, human visual system (HVS) and quantitative and graphical measures. We will presumably accomplish improved evaluation of signal-to-noise ratio (SNR) and peak signal-to-noise ratio (PSNR). Further, Sect. 2 contains the literature review, followed by methodology and proposed methods in Sect. 3. After that, Sect. 4 contains the result and discussion, and finally, conclusion is presented in Sect. 6.

2 Literature Review Earlier linear filters were preferred for signal and image processing as they were simpler to design and implement. However, these filters performed poorly when tested with additive noise and also for systems with non-Gaussian statistics or nonlinearities. These filters caused blurred edges in images and did not remove impulsive

Satellite Image Enhancement and Restoration Using RLS Adaptive …

603

noise properly [1]. Consequently, nonlinear channels were actualized. One of the prominent nonlinear filters that we have is the median filter (MED). It is capable as far as computing is concerned, yet yields results which are distorted and pixelated. The method proposed by Huang et al. has used a 2-D median filter which depends upon organizing and reviving gray-level histogram for the image segments in the frame [11]. Along these lines, a quick real-time computational algorithm has been introduced for filter signals and images using median filters [12]. This approach used noise filtering based on their local variance and mean and has been suggested for both additive and multiplicative cases. Irrespective of that, this approach was not suitable for cleaning out impulse noise. Kundu et al. also suggested a separate approach for the propagation of impulse noise from images using a mean channel [13]. This filtering plan focuses on replacing the central pixel value with the general mean of all pixels within the sliding window. This sort of channel is also not ideal for dispensing with impulsive and non-impulsive noises simultaneously. All through the accompanying very few years, researchers have made and actualized a variety of all around arranged methods for image restoration and sound removal. Adaptable filtering is maybe the most appropriate sound clearing out features for the application of images and signal. The functionality of adaptable channels is shown to act naturally changing, so it is generally used. An authoritative goal is to control flexible channel parameters to achieve the base slip-up between the channel yield and the ideal signal [9, 10]. Another channel called signal adaptable median channel was proposed which performed and gave better outcomes than other nonlinear adaptive channel filters for noises of various kinds [14]. The adaptable average filter which was proposed here displays low efficiency in the case of impulsive noise and does not eliminate noise near to the edges [15]. The filtering plan suggested could not cover the noise which is impulsive in nature, satisfactorily, however could save the edge noise efficiently as compared to the mean filter [16]. Adaptive median filters have also been suggested for disposing of the impulsive noise and securing the sharpness of the images [17]. The method proposed by Lal is a novel remote sensing image restoration technique. The technique mainly concentrates on the parameter estimation which is used to estimate the point spread function (PSF) [18]. From the estimated PSF, the image is restored using Wiener filter [18]. VamsiKrishna proposed a method which uses bilateral and guided filters for removal of noise. These filters smoothen the images while preserving edges. In bilateral filtering, each pixel is replaced by the weighted average of its neighbor pixels, but in guided filters, the behavior at edges will be more efficient than the bilateral filter [19]. For this paper, we are proposing the use of the RLS adaptive algorithm for the retrieval and enhancement of satellite imagery optimally. This algorithm delivers the immediate square decrease of the signal line of the deliberate blunder of the ideal mistake. Our proposed method is utilized by setting up a RLS algorithm with a sound detection system and afterward signal stabilization for noise removal. Finally, execution was evaluated by unwavering quality methods, specifically the human visual system (HVS), estimation and illustrative measures and parameters.

604

R. R. Janghel et al.

Our objective is to accomplish enhanced scores for signal-to-noise ratio (SNR) and peak signal-to-noise ratio (PSNR).

3 Methodology Channel noise is a kind of extra focus between all of them. Any shortfall in the AWGN channel leads to a direct increase in the constant strong white spectral sound and the distribution of Gaussian [1]. The order of the noise level at the post office indicates that the SNR should be better than ≥20 dB and therefore the interruption level is low. In addition, the image signal is lowered with drive chaos and clutter combined behind the edge of the receiver. Combined interruption is a combination of SPN as well as AWGN [1], which accurately reflects image retrieval, its sound removal and enhancement structure. Figure 1 shows the flow of proposed methodology.

3.1 About Adaptive Filters This is used when the given determinations are not seen or not completed by the particular time filters. These adaptive filters can follow varieties in the image signal

Fig. 1 Proposed methodology using RLS algorithm

Satellite Image Enhancement and Restoration Using RLS Adaptive …

605

Fig. 2 General configuration of adaptive filter

or parameters of the time-changing system to be in norm with the particular execution factors. It endeavors to exhibit the yield signal for relationship with its data signal iteratively in real time [9, 10]. Its configuration is shown in Fig. 2. The display is improved with each iteration automatically as the filter coefficients and impulse response of discrete iteration input are changed by methods for algorithm. The adequacy of these filters depends upon plan systems and change algorithms.

3.2 RLS—Recursive Least Square Adaptive Filter In case the range of eigenvalues is high for the correlation matrix of an input signal, the algorithm is seen to be following a fast convergence rate. The performance results shown by this algorithm are exceptional when seen in the time variable conditions. The purpose of the RLS algorithm is to select the filter coefficients such that the output signal will fit the signal you want at least a square measure. These benefits are derived from the rising cost of computers and other instability [8, 10]. It is executed by handling initial conditions and thereafter reviving the earlier check subject to knowledge preserved in the new data tests [10]. x(k) = [x(k)x(k − 1) . . . x(k − N )]T

(1)

Function for LS algorithm: ξ d (k) =

k 

λk−i ε2 (i)

(2)

 2 λk−i d(i) − x T (i)w(k)

(3)

i=0

ξ d (k) =

k  i=0

606

R. R. Janghel et al.

where w(k) = [w0 (k)w1 (k) . . . w N (k)]T is the filter coefficient ε(i) = a-posteriori output error and λ = forgetting factor, ranges from 0 to λ ≤ 1. Filter coefficient w(k): w(k) = R −1 D (k)PD (k)

(4)

R −1 D (k) = inverse correlation matrix of input signal and PD (k) = deterministic cross correlation matrix among input and desired signal. Computational Initialization of RLS Algorithm S D (−1) = I where δ should be the inverse estimate of input signal.  T PD (−1) = x(−1) = 0 0 . . . 0 D0 for k ≥ 0   1 S D (k − 1)x(k)x T (k)S D (k − 1) S D (k) = S D (k − 1) − λ λ + x T (k)S D (k − 1)x(k) PD (k) = PD (k − 1) + d(k)x(k) w(k) = S D (k)PD (k) If required, then evaluate y(k) = w T (k)x(k) ε(k) = d(k) − y(k)

3.3 Image Denoising and Channel Estimation Channel assessment is a method of depicting the effects of a physical channel on a data system. Right when a channel needs to work consistently or straightforwardly, the channel estimates or balances the system response. An incredible channel rating is one that satisfies the standards for screw up lessening [20]. In this paper, our goal is to achieve least mean square bumble (MMSE) by discarding noise impacts using the above algorithm. For proposed methodology, refer Fig. 3.

Satellite Image Enhancement and Restoration Using RLS Adaptive …

607

Fig. 3 Proposed methodology for image restoration

3.3.1

System Identification

We have to measure the channel plans which arrange with the models that were used as redirect noise in the ideal image signal d(k). The evaluation of these models is cultivated by masterminding system identification with the above algorithm iteratively as shown in Fig. 3. This method is used for exhibiting a dark system. A comparative data image x(k) is invigorated from both the sides—dark system and adaptive algorithm. Final image of a dark system is shown by d(k), and n1(k) speaks to the saw channel noise. The ideal signal at the yield of the dark system gets degraded because of this. d(k) = x(k) + n 1 (k)

(5)

e(k) = d(k) − y(k)

(6)

The adaptive filter output is represented by y(k), and error signal is represented as the difference between d(k) and y(k) and is represented as in Eq. 6. w(k) represents the filter coefficients [5].

3.3.2

Signal Enhancement

Image denoising is finished by methods for signal enhancement. The noise designs that were available are productively deleted from wanted image signals by coordinating with assessed noise patterns [21, 22]. The ideal image signal is sullied with

608

R. R. Janghel et al.

Gaussian noise spoke to. The assessed noise e(k) is taken as the contribution. As appeared in Fig. 7, e(k) = x(k) + n 1 (k) − y(k)

(7)

Giving a system yield analysis to the adaptive filter and a short time later changing the filter by using a RLS algorithm to get the most un-square measure is the standard limit of this configuration [10]. The equation to calculate MSE is given in below Eq. 8   ξmin = E x 2 (k)

(8)

4 Experimental Results Dataset: We have utilized an image of Solway Firth (Fig. 4) caught in October 2019 by the Operational Land Imager on the Landsat 8 satellite for the examination. On that day, the waters along the shoreline of Dumfries and Galloway, Scotland and Cumbria, England, were rich with silt and disintegrated natural issue (plant flotsam and jetsam, soils, plankton) that was likely worked up by the tides. The water changes tone unexpectedly seaward where the shallower sound meets further waters of the Irish Sea. In this paper, the proposed method for restoring the image, then removing noise and finally enhancing it is tested and implemented in Python 3.8 using the original colored satellite image from NASA (Fig. 5I) and then converting it to grayscale using OpenCV methods. Fig. 4 Original colored satellite image captured by NASA Landsat 8 [27]

Satellite Image Enhancement and Restoration Using RLS Adaptive …

609

Fig. 5 Images after applying different types of filter. I Original image. II Noise corrupted image. III Restored image using RLS adaptive filter. IV Image after applying median filter. V Image after applying average filter. VI Image after applying Laplacian filter

A.

Restoring Image from Additive White Gaussian Noise (AWGN)

Figure 5I speaks to the information-based satellite image used as input. This image is distorted by adding additive White Gaussian noise with multiple times more noise power as appeared in Fig. 5II. B.

Image Denoising and Enhancement

In the wake of getting extraordinary restoration results (Fig. 5III shows the RLSfiltered image), these satellite images are used for next handling and improving using image getting ready (IP) strategies. In this cycle, impulse noise and mixed noise are slaughtered by using spatial filtering algorithms.

610

R. R. Janghel et al.

• Median filter—The pixels obtained in output are determined by median of adjacent pixels. The output is then used as input to the median filter, which effectively disposes of SNP noise while securing sharp edges of image [6]. Figure 5IV shows the median-filtered image. • Laplacian filter—It works on the edges and other borderlines. Enhances image by providing minute features and sharpness. Figure 5VI shows the Laplaciansharpened image. • Averaging filter—Its purpose is noise abatement and smoothening of clear borderlines [21, 23]. It is a low-pass filter, it changes the value of every pixel either to the average, weighted mean, intensity of nearest neighbor. Due to this, the noise thickness is decreased [6]. Figure 5V shows the normal filtered image [24, 25]. The resultant satellite images ensuing to denoising the cycle show fitting visual quality. • Histogram equalization—It is used to refine abysmal intensity dispersals and is used for sharpening (to achieve uniform power assignment). C.

Evaluation of Performance

Quantitative analysis for satellite image by using mean squared error (MSE), root mean square error (RMSE), signal-to-noise ratio (SNR) and peak signal-to-noise ratio (PSNR) parameters. • MSE—It is the sum of squares of error. M N 2 1   f (x, y) − fˆ(x, y) M N x=1 y=1

MSE =

(9)

where f (x, y) is an original image, fˆ(x, y) be the restored image, x and y describe the discrete coordinates of the image, and M × N is the size of image. • RMSE—It is the square root of MSE and is given by ⎡ erms = ⎣

M−1 N −1 

2

⎤ 21

fˆ(x, y) − f (x, y) ⎦

(10)

x=0 y=0

• SNR—It is given by

M−1 N −1 SNRrms =

x=0

M−1 N −1 x=0

y=0



y=0

fˆ(x, y)2

fˆ(x, y) − f (x, y)

2

(11)

• PSNR—It is given by  PSNR = 10 log10

2552 MSE

 (12)

Satellite Image Enhancement and Restoration Using RLS Adaptive …

611

The graphical model visualizes the data-driven gist to determine the most important characteristics of the images in various stages. Figure 6 shows the graphical analysis with respect to mean square error. It compares the input image and output image for each period of the proposed plot. MSE also portrays the slip-up size against each pixel, which is restricted profitably during denoising and restoration cycle fairly [26] (Table 1).

Fig. 6 Analyzing images graphically—restoration, denoising and enhancement. I MSE of corrupted image wrt original image. II MSE of median filtered image wrt original image. III MSE of average image wrt original image. IV MSE of Laplacian image wrt original image. V MSE of corrupted histogram equalized image wrt original image

612

R. R. Janghel et al.

Table 1 Quantitative analysis for satellite image enhancement and restoration Enhancement and restoration techniques

Quantitative parameters MSE

RMSE

SNR

PSNR

RLS-restored image

385.86

19.64

2.42

22.25

Median-filtered image

397.69

20.23

2.49

21.06

Average filtered image

524.04

22.86

2.56

20.93

9399.12

95.57

0.00064

Laplacian-filtered image

Table 2 Comparing proposed method with previous method for satellite image

Filters used

MSE

RMSE

8.39

PSNR

RLS filter

385.86

19.64

22.25

Median filter

397.69

20.23

21.06

Average filter

524.04

22.86

20.93

Laplacian filter

9399.12

95.57

8.39

NLMS filter

444.29

21.08

21.65

SSLMS filter

5052.99

71.08

11.09

5 Discussion Performance Comparison with Other Filtering Techniques The introduction extents of average filter and median filter for removal of noises from images captured from satellite are showcased [19]. These images are deteriorated with various types of noises, for instance, Gaussian, salt and pepper and speckle noise and achieving filtration results. For execution, data-based satellite images are used. In this paper, PSNR is the solitary measure allocated for execution measurement of mentioned denoising filter. Finally, the experimental results and execution assessment show that the RLS algorithm is proven to give better filtration results on satellite images and greater values of PSNR. This relationship can be found in Table 2 similarly as in Fig. 7.

6 Conclusion In the above methodology, we have used the RLS adaptive algorithm for removing White Gaussian noise from satellite images. The adaptive cycles can detect abnormalities and noises acquired in the distant channel and to dispense with them successfully. The satellite images are used for execution of the RLS algorithm, and then, MSE, RMSE, SNR and PSNR for the RLS adaptive restoration and for other image handling methods are prepared. The lower the value of MSE and RMSE, the better the result.

Satellite Image Enhancement and Restoration Using RLS Adaptive …

613

Fig. 7 Graph plots comparing MSE, RMSE, PSNR values for different filters. I MSE values for different filters. II RMSE values for different filters. III PSNR values for different filters

The higher the value of SNR and PSNR, the better the result. The Python 3.8 reenactment results validated that the proposed methodology is profitable in disposing of noise from impacted satellite images and the obtained image is better without affecting the introduction measures altogether. Accordingly, it is presumed that the RLS variety for image restoration is more capable and gives better results than using conventional filters for the finish of AWGN. As needed, the obtained images in the wake of denoising and improvement show better visual effects and retain the sharp edges. The RLS adaptive algorithms have seemed to achieve the improved value of SNR and reduced value of MSE at higher speed of blend anyway with the trade-off with the colossal computational multifaceted nature and memory need.

References 1. S. Meher, Development of some novel nonlinear and adaptive digital image filters for efficient noise suppression. Ph.D., Department of Electronics & Instrumentation Engineering, NIT, Rourkela, Orissa, India, 2004 2. M.J.M. Parmar, Performance evaluation and comparison of modified denoising method and local adaptive wavelet image denoising method, in International Conference on Intelligent Systems and Signal Processing (ISSP) (2013), pp. 101–105 3. N.S. Sandeep Kaur, Image denoising techniques: a review, Int. J. Innov. Res. Comput. Commun. Eng. 2, 4578–4583 (2014) 4. J.P. Hema Jagadish, Approach for denoising remotely sensed images using DWT based homomorphic filtering techniques. Int. J. Emerg. Trends Technol. Comput. Sci. (IJETTCS) 3, 90–96 (2014)

614

R. R. Janghel et al.

5. D.V.K.C. Mythili, Efficient technique for color image noise reduction. Res. Bull. Jordan ACMISWSA II, 41–44 (2011) 6. S.G.K. Apeksha Jain, M. Ahmed, Review on denoising techniques for AWGN signal introduced in a stationary image. Int. J. Eng. Sci. Invent. 3, 01–10 (2014) 7. J. See, Fuzzy-based parameterized Gaussian edge detector using global and local properties. Int. J. Eng. Technol. (IJET) 5, 3858–3869 (2005) 8. N.A. Lilatul Ferdouse, Simulation and performance analysis of adaptive filtering algorithms in noise cancellation, IJCSI Int. J. Comput. Sci. Issues 8, 185–192 (2013) 9. S.S. Vartika Anand, Intelligent adaptive filtering for noise cancellation. Int. J. Adv. Res. Electr. Electron. Instrum. Eng. 2, 2029–2039 (2013) 10. P.S.R. Diniz, Adaptive Filtering Algorithms and Practical Implementation, 4th edn. (Springer, New York, Heidelberg, Dordrecht, London, 2013) 11. T.S. Huang, G.J. Yang, G.Y. Tang, A fast two dimensional median filtering algorithm. IEEE Trans. Acoust. Speech Signal Process. ASSP-27(1) (1979) 12. E. Ataman, V.K. Aatre, K.W. Wong, A fast method for real time median filtering. IEEE Trans. Acoust. Speech Signal Process. ASSP-28(4) (1980) 13. A. Kundu, S.K. Mitra, P.P. Vaidyanathan, Application of two dimensional generalized mean filtering for removal of impulse noises from images. IEEE Trans. Acoust. Speech Signal Process. ASSP-32(3) (1984) 14. R. Bernstein, Adaptive nonlinear filter for simultaneous removal of different kinds of noise in images. IEEE Trans. Circuits Syst. 35(1) (1988) 15. Y.H. Lee, S.A. Kassam, Generalized median filtering and related nonlinear filtering techniques. IEEE Trans. Acoust. Speech Signal Process. ASSP-33(3) (1985) 16. X.Z. Sun, A.N. Venetsanopoulos, Adaptive schemes for noise filtering and edge detection by use of local statistics. IEEE Trans. Circuits Syst. 35(1) (1988) 17. H. Hwang, R.A. Haddad, Adaptive median filters: new algorithms and results. IEEE Trans. Image Process. 4(4) (1995) 18. A. Lal, A. Abdulla, D. Aju, Remote sensing image restoration for environmental applications using estimated parameters. C. R. Acad. Sci. 71, 1094–1100 (2018) 19. S.N. Reddy, A. Pradesh, A. Pradesh, S. Kalahasthi, A. Pradesh, Satellite image restoration using bilateral and guided filters. IJSRST 4(5), 747–754 (2018) 20. S.K. Lalita Rani, Adaptive synchronization and linear channel parameter estimation using adaptive filtering. Int. J. Adv. Res. Electr. Electron. Instrum. Eng. 2, 2875–2881 (2013) 21. M.K. Jyotsna Yadav, Performance analysis of LMS and RLS adaptive FIR filter for noise cancellation. Signal Image Process. Int. J. (SIPIJ) 4, 45–56 (2013) 22. A.K.D.C. Dhubkarya, Simulation of adaptive noise canceller for an ECG signal analysis. ACEEE Int. J. Signal Image Process. 3, 1–4 (2012) 23. R. Talwar, Performance comparison of adaptive algorithms for noise cancellation. Int. J. Eng. Sci. Res. Technol. (IJESRT) (2014) 24. A. Shukla, A modified bat algorithm for the quadratic assignment problem, in IEEE Congress on Evolutionary Computation (CEC) (IEEE, 2015) 25. D. Bienstock, A. Shukla, Variance-aware optimal power flow: addressing the tradeoff between cost, security, and variability. IEEE Trans. Control Netw. Syst. 6(3), 1185–1196 (2019) 26. A. Perdikakis, A. Shukla, D. Kiritsis, Optimize energy efficiency in the supply chain of FMCGs with the use of semantic web technologies. Procedia Eng. 132, 1112–1119 (2015) 27. https://www.nasa.gov/feature/amazing-earth-satellite-images-from-2019/

Efficient Recommendation System Using Latent Semantic Analysis Rahul Budhraj, Pooja Kherwa, Shreyans Sharma, and Sakshi Gill

Abstract During the last decade, recommender system has been a significant part of our lives and we constantly depend on them to provide relevant and interesting results, be it “videos recommended for you” by YouTube or watch “movies recommended for you” by IMDb. There are many approaches based on the latest machine learning algorithms for finding recommendations. The efficiency of these algorithms is still a big research question. In this paper, the authors proposed an efficient recommendation using latent semantic analysis approaches. Here, authors improve the accuracy and efficiency of the basic approaches, i.e., content-based and collaborative filtering-based systems, by using singular value decomposition (SVD) and its other variant SVD++, and also perform a detailed experiment using MovieLens dataset and hence, show significant improvement, in comparison with typical K-Nearest Neighbor (KNN)-based recommender systems, in both accuracy metrics, i.e., mean absolute error and root mean square error. Keywords Singular value decomposition · Singular value decomposition++ · Content-based system · Collaborative filtering · K-nearest neighbors

1 Introduction From entertainment to e-commerce, recommendation systems have made an indispensable place for themselves. They can be summarized as a model that gathers related information from a pool of information and presents it in a rather useful manner. They try to instill in the software/computer system, a human-like behavior, to advise its user. In short, a recommender system tries to influence its user’s sense of judgment via suggestions. These suggestions are supposed to be as close as possible, to the user’s preferences. Hence, recommender systems can be called a kind of information filtering system, or an information agent [1], which attempts to judge the R. Budhraj (B) · P. Kherwa · S. Sharma · S. Gill Computer Science and Engineering Department, Maharaja Surajmal Institute of Technology, GGSIPU, New Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_50

615

616

R. Budhraj et al.

preferences of a person and suggest items, such as movies to watch, text to read, products to buy, etc., based on these judgments. Given their functionality, recommender systems are an important application of artificial intelligence. And their area of applications is expanding like crazy. They are being used in almost every online platform that people use these days. The content these websites offer differs from movies on Netflix, videos on YouTube, creative content on Pinterest, to opportunities to make friends and new connections on social media platforms, such as LinkedIn, Instagram, etc. In fact, the sole reason for the success of these platforms is the right implementation of recommender systems. Now, the real question is how do these systems carry out their role (roles) on these platforms. Often, the recommender systems collect user information, which can be their past choices or subjects they might have shown interest in, at any point of time in the past. Then, this information is used to improve the suggestions made by the system in future. For example, Instagram, a social media application, monitors users’ interests and their interaction with various photos or videos, on their feed. This helps them gather information about a user and then suggest more content based on this record. Another preferred approach used by these systems is to provide suggestions based on the activities of a mass of people. For example, on an e-commerce site, if many people purchase a wireless mouse, along with a laptop, it will suggest its users, who are searching for a laptop, to buy a mouse as well. There are three basic approaches used for the generation of recommender systems: collaborative filtering, content based, and knowledge based. Collaborative filtering techniques consider the past interaction between the users and items as the most important factor to produce new relatable suggestions for the users. It monitors a group of like-minded people, with similar rating history, and generates results based on this information [1, 2]. The content-based approach, as the name suggests, finds and suggests content to a user that is similar to his past preferences, or somehow linked to them [3, 4]. Another method is the knowledge-based method, where knowledge about items and users is used for recommendations. However, it does not take into account personal tastes and the ratings of items; hence, it avoids many drawbacks faced by the other two, more traditional techniques [5]. In practical implementation, each approach suffers from some or the other obstruction. Hence, adopting a hybrid approach toward achieving a recommender system has become more common these days, which involves combining two or more traditional methods to overcome the drawbacks faced by them individually [3]. For example, in [6], Gipp et al. develop a hybrid approach to create a recommender system for research papers, called Scienstein. Their proposed method is used four different approaches, which came together to cover up each other’s disadvantages and formed a stronger system. The motivation behind this research work was the need to eliminate the fallbacks of the basic techniques. Traditional approaches like the collaborative filtering and the content-based methods create problems, such as scalability and the sparsity issues. Content-based methods make recommendations on the existing interests of the user and require a lot of domain knowledge, which makes the model only as good as the hand-engineered features. In simpler words, the objective of this paper is to improve the efficacy of the elementary techniques of a recommender system, i.e.,

Efficient Recommendation System Using Latent Semantic Analysis

617

content-based and collaborative filtering approaches, by using latent semantic analysis (LSA) applications, i.e., singular value decomposition (SVD) and its variants which are essentially dimensionality reduction techniques. Dimensionality reduction is a pre-procession step, which aims at achieving intrinsic dimensionality of a dataset, by reducing redundant and useless information. This paper involves performing an experimental analysis on a large movie dataset. The next section, i.e., Sect. 2, highlights the related work done in this field. Section 3 describes the methodology adopted for this research paper. Section 4 contains information about the experimental analysis done for the paper, which is followed by a Sect. 5 that summarizes the results of the analysis. Section 6 highlights the conclusion and the future scope of this wide topic and, this section is followed by the references used for the paper.

2 Related Work Recommender system is a largely explored area, mainly due to the benefits that it offers. The results generated by the system largely depend on the machine learning algorithms used. For example, Li et al. [7] implemented various machine learning techniques for a movie recommendation system, which are used for clustering, such as random clustering, hierarchical clustering, density-based clustering (DBSCAN), and affinity propagation. DBSCAN outcasted the other techniques by generating the best results. In [8], Portugal et al. concluded that Bayesian and decision tree algorithms are the most used, due to their lesser complex implementation. With the increase in online entertainment platforms, research on videos/movie recommendations is being done on a larger scale. Lu et al. [9] discussed the event when the popular online platform, Netflix, called out people to defeat its own recommender system, Cinematch, and how it ended up having a better recommender system along with a gained popularity. Jahrer et al. [10] also worked out a way to produce better results on the same Netflix dataset by linearly combining a series of collaborative filtering algorithms. This combination overperformed a single collaborative filtering algorithm. In [11], Basu et al. explored the idea of movie recommendations and suggested a better approach that used both content information, as well as ratings. Their approach reframed the artifact rating problem, to a classification problem, which produced better results. Many researchers have been putting out efforts to improve the traditional approaches by developing new hybrid approaches or working on some new ideas, such as Adomavicius and Tuzhilin introduce three new algorithm paradigms, for context-aware recommender systems, a rather contemporary form of systems [12]. These paradigms are contextual pre-filtering, post-filtering, and modeling, and they focus on integrating contextual information into the recommendation process. Asanov gave Cinemscreen, a recommendation agent, as an example of a hybrid filtering-based recommender system, which suggests movies being shown in cinemas [13]. He also discussed various context-aware, semantic-based, and cross-domain-based approaches.

618

R. Budhraj et al.

Data mining techniques such as regression, neural networks, and clustering KNearest Neighbors are also being exploited by a lot of researchers to improve the efficiency of the recommender systems. Park et al. [14] explained these techniques briefly and analyzed the distribution of articles by using these techniques, where he found that KNN, which is a type of collaborative filtering method, was the most used in the recommendation fields. Context clustering, a relatively new approach, is still in its initial phases and is being explored by many people. Kannout [15] proposed a recommender system based on context clustering. In fact, he worked out two versions of this system which used rating-based clustering and user-based clustering, respectively. His experimental analysis also proved that his approaches outperformed the conventional collaborative filtering technique. Researchers are also working out ways to develop metrics for evaluating the performance of these recommender systems. For example, in [16], Gaudioso et al. discussed various metrics, such as accuracy metrics, rank metrics, etc., and gave a new approach that considers the followability along with the usefulness of the recommendations. All these procedures have some fallbacks, which is understandable given the young age of this concept. Hence, we try to contribute our efforts to improve the accuracy of the conventional approaches by making use of latent semantic analysis.

3 Methodology In order to achieve an efficient recommender system, first, we analyze the system using basic approaches, i.e., collaborative filtering and content based. Then, in order to reduce the error metrics, we propose an architecture that will make use of latent semantic analysis, a dimension reduction technique, and KNN, a machine learning algorithm, which is used to generate the predictions. To achieve our objective, i.e., to improve the prediction accuracy of the system, we pre-process the data by using variants of LSA, SVD, and SVD++, or in other words, matrix factorization. The proposed architecture has been shown in Fig. 1.

MovieLens 100k Dataset

Preprocessing with SVD and SVD++

Fig. 1 Proposed network for an efficient recommender system

Generating predictions with KNN

Efficient Recommendation System Using Latent Semantic Analysis

619

3.1 Traditional Recommender System Techniques 3.1.1

Collaborative Filtering

Collaborative filtering is the algorithm for filtering items from user opinions, like their likes or dislikes, and then recommending the same items to users with similar opinions [17]. For example, if user A and user B rate a movie M 1 and M 2 similarly and user A then rates another movie M 3 nicely, it will be recommended to user B. Another simple example is that three friends discuss their opinions on books, movies, restaurants, etc. After some time, Parul understands that Shubham recommends the stuff she mostly likes and Sahil recommends almost everything she ends up hating. So, what happens is, with time Parul is able to learn whose opinions she can refer, to determine the quality of an item [17]. This is what the algorithm of collaborative filtering has the potential to do. The main benefit of this algorithm is that instead of referring to just a group of opinions, we can use hundreds and thousands of people’s opinion, to recommend better and more efficiently to the new users. Collaborative filtering has two types:

User-Based Nearest Neighbor Algorithm In this type, the predictions and filtering are done, based on the similarity of the users near each other. If a user u1 is similar to user u2 in terms of opinions, we say that u1 is a neighbor of u2 . Here, the algorithm produces a prediction for an item i1 , by analyzing ratings for i1 from users in u1 ’s neighborhood. The following formulation is used for the predictions:  pred(u, i) =

nC neighbors(u)r ni number of neighbors

(1)

But this formula has a lot of missing factors in its predictions, so the final formulation comes from the Pearson correlation similarity [18], which is:    ri j − ri rk j − rk simik = corrik =   2 i  2 i j=1 ri j − ri j=1 r k j − r k i

j=1

(2)

Item-Based Nearest Neighbor Algorithm In this algorithm type, the similarity between different items is predicted for a common user. For example, items i1 and i2 are the items whose similarity we want

620

R. Budhraj et al.

to calculate, given by a common user u1 . Expressions like Pearson correlation or cosine/vector similarity can be used here to find the similarity computation [18]. Cosine/vector similarity of items ih and ik is expressed by: simhk = coshk =

 i

ri h 

2 i ri h

rik 

(3)

2 i rik

Now, if the items are found to be nearly equal in the similarity of features/ratings, item i2 would be recommended to the user.

3.1.2

Content-Based Filtering

Content-based filtering is done on the basis of the user preferences of the items and then recommending them appropriate new items using that information. Before the beginning of collaborative filtering in the 1990s, earlier research was based on this algorithm. Content-based recommendation is directly related to supervised machine learning. That is, the algorithm runs on the given database of user opinions (ratings) and recommends new items (results) based on it. There are many uses of content-based filtering, for instance, it could be used to filter videos on streaming websites according to previous likes/dislikes given by the user. Also, products of similar ratings on e-commerce websites are filtered using this algorithm and shown to the user for better recommendations, if possible. It can also remove the redundant data for the user according to his specific low ratings or dislikes. In addition to this, content-based recommender systems can recommend unrated or new items, which is a drawback in collaborative filtering [19] (Fig. 2). But one of the few issues of content-based filtering is feature quality [20]. The items to be recommended should be described, to prepare for meaningful learning of user preferences. Also, content-based filtering has several shortcomings like limited content analysis, over-specialization (serendipity problem), and recommendation for new users [19]. The similarity between items is required to be found even in this algorithm. Here, the feature similarity is derived by computing the cosine similarity of both the user profiles and the objects/items. Cosine similarity is widely used in recommendation filtering algorithms as it is the most efficient formulation. It can be stated as:  wki wkh sim(di , dh ) =  k  2 2 k wki k wkh

(4)

Efficient Recommendation System Using Latent Semantic Analysis

621

Fig. 2 High-level architecture of a content-based recommender

3.2 Dimension Reduction Techniques 3.2.1

SVD (Singular Value Decomposition)

SVD is a technique that uses a linear approach toward the dimensionality reduction of some information. It makes use of matrix factorization; hence, sometimes it is also referred to as the matrix factorization method. It calculates the eigenvalues and the eigenvectors of the covariance matrix [21]. It helps in predicting the best projection axis, on which the sum of squares of projection errors would be nominal, or the least. In this method, two matrices, for user factor and item factor, respectively, are generated from the initial sparse matrix. This is carried out by minimizing the loss function, which consists of a general bias term and two bias terms for the item as well as the users. Hence, the equation is formulated as: min

p,q,b

 2   rui − μ − bu − bi − puT qi + λ || pi ||2 + ||qi ||2 + bu2 + bi2

(5)

u,i

It is one of the most used techniques in recommender systems, as it deals with the scalability and the sparsity issues, which are faced by the collaborative filteringbased systems. Sarwar et al. [10] also proved that SVD-based predictions show better accuracy than the collaborative filtering recommender system and how SVD has the potential to show better online performance, when compared to a correlation-based system.

622

3.2.2

R. Budhraj et al.

SVD++

SVD++ is based upon the idea of extracting latent features on, how a user rates an item, after using matrix factorization on the data. The fundamental thought of SVD++ is to dissect the user’s inclination for each factor and the reach out to which the movie contains the different ratings from the noticed ratings and the implicit feedback from the users, to anticipate the missing score [22]. Previously, in SVD, we discussed the equation for formulation. Here, the SVD++ proposes the following formulation: ⎛⎛ ⎛ ⎞⎞2   ⎝⎝rui − μ − bu − bi − qiT ⎝ pu + |N (u)|−1/2 yi ⎠⎠ min p,q,b

u,i



+λ⎝|| pu ||2 + ||qi ||2 + bu2 + bi2 +



⎞⎞ ||yi ||2 ⎠⎠

j∈N (u)

(6)

j∈N (u)

 If we see closely, we will notice that |N (u)|−1/2 j∈N (u) yi , is an additional term, when compared to the previous Eq. (5), and the only difference between the two is this term only. This could be understood by the fact the SVD++ is including the effect of implicit feedback. The best approach to decipher this is by understanding that how a user rates a thing is in itself a sign of inclination. As such, chances that the user “prefers” a thing he/she has already rated are higher than for a non-rated thing. With the addition of the implicit information, the derivative model SVD++, of SVD, achieves better precision [23].

3.3 K-Nearest Neighbor (KNN) K-Nearest Neighbor (KNN) is a machine learning algorithm that relies on the basic idea that similar things tend to exist near each other. KNN does not draw out any relations or conclusions from the training data, rather it picks top K similar results, based on some similarity metric. Generally, Euclidean distances are found from the target point to all other data points, to measure similarity, and to select the nearest k neighbors, or points. In case of making recommendations to a particular user, the similarity between a particular feature, like users, items, etc. is calculated first, using similarity metrics like cosine and Pearson similarity. When the similarity between users has been calculated, the KNN selects the K-number of most similar users from the neighborhood and recommends it [24].

Efficient Recommendation System Using Latent Semantic Analysis

623

4 Experimental Analysis In our practical analysis, we will be evaluating our recommender system using different approaches based on accuracy metrics. We will be using the MovieLens 100 k dataset to test our recommender system and evaluate the performance based on the testing data, formed from the dataset. Also, we will be using Surprise Library in Python to help ease some of the work and provide a smooth workflow to develop the recommender system with functionalities such as built-in MovieLens dataset library and algorithms such as SVD and SVD++. Additionally, it has quite some tools to measure and evaluate the performance of the system.

4.1 Training Data and Testing Data The dataset we are using for this experiment consists of 100,000 ratings (scale 1–5), for 1682 movies, rated by 943 users. At least 20 movies have been rated by each user. To test the recommender system, we have made in an offline setting, we have partitioned the dataset into a training set which consists of 70% of the data and a testing set with 30% of the data. The testing set will be used to find the accuracy of the system, as the test set is not known to the system and we will predict ratings on that test set, to evaluate the accuracy of the recommender system.

4.2 Accuracy Metrics The fundamental limitation with testing accuracy with train/test split is that we predict the ratings of movies that the user has already seen. We do not judge the accuracy of the new recommendations made by the system that the user will find interesting or not, because it is nearly impossible to test that in an offline setting. So, the best we can do is predict ratings on the test set which users have already rated but are not known to the system. The most basic accuracy metric used to test the recommender system is mean absolute error (MAE). The mathematical representation of MAE is shown below: n Mean Absolute Error =

i=1 |yi

n

− xi |

(7)

where yi is the rating our system predicts, x i is the actual rating of movies, and n is the total number of ratings. So, basically, it is the average of all absolute error difference between the ratings. A more promising accuracy metric is root mean square error (RMSE), and it penalizes you more if your rating is way off the actual rating and penalizes you less

624

R. Budhraj et al.

when it is close. Below is the mathematical equation for RMSE:

Root Mean Square Error =

n

i=1 (yi

− x i )2

(8)

n

Here, instead of directly summing up prediction error, we sum up the squares of the prediction errors and take the average, and then, we take the root of the whole thing to reach a meaningful number. These two metrics were originally used in the Netflix Prize contest in 2006 to test the performance of the recommender system, and we will use the same to test our system [9]. The lower the RMSE and MAE, the better the accuracy is.

5 Results and Discussion The results are given in Table 1 for the accuracy, using different approaches, namely content-based system using KNN, item and user-based collaborative KNN system and finally matrix factorization techniques, i.e., SVD and SVD++. The most notable result from the experiments is that we have achieved a more efficient system, in terms of accuracy, by using dimension reduction techniques and machine learning algorithms. SVD++ shows the best results with the lowest RMSE and MAE metrics, although SVD is also nearly that good in terms of performance and is more efficient than all other basic approaches and techniques. Best accuracy in terms of RMSE and MAE score does not guarantee that the recommendations are also better, or the system is more suitable for the user as we talked about it earlier. However, we have not tested our recommender system in real world with real people, so we cannot comment about the recommendations made by the system using different approaches. But the best we can talk about is the accuracy and efficiency of the system. Table 1 Results

Algorithm

MAE

RMSE

Content KNN

0.7204

0.9324

Item-based KNN

0.7739

0.9954

User-based KNN

0.7746

1.0049

SVD

0.6972

0.9056

SVD++

0.6859

0.8936

Efficient Recommendation System Using Latent Semantic Analysis

625

6 Conclusion and Future Scope It is very clear that the choice of algorithms makes a huge impact on the prediction accuracy. Hence, a proper analysis of techniques is very important before their implementation. The above experiments show that we have enhanced the accuracy of the system and made it more efficient by using LSA applications, i.e., SVD and SVD++. However, better accuracy does not mean better recommendations, and we cannot know for sure which recommendations user will find interesting until we test it in real-world environment on real-world people. There is yet a long way ahead. We plan on creating a hybrid model, which implements a rather contemporary phenomenon, called context clustering. This model would aim on overcoming the disadvantages of the basic approaches. Kannout [15] worked out a recommender system using the same approach and proved by a practical implementation that his proposed model outperformed the traditional collaborative filtering-based systems. We further plan to implement the Bert library, a Python library, recently published by Google. It is a language representation model, which is being used for embedding, natural language processing tasks, language inference, etc. It is already showing promising results in various fields.

References 1. R. Burke, Hybrid web recommender systems, in The Adaptive Web (Springer, Berlin, Heidelberg, 2007), pp. 377–408 2. J.L. Herlocker, J.A. Konstan, L.G. Terveen, J.T. Riedl, Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. (TOIS) 22(1), 5–53 (2004) 3. S.P. Sahu, A. Nautiyal, M. Prasad, Machine learning algorithms for recommender system—a comparative analysis. Int. J. Comput. Appl. Technol. Res. 6(2), 97–100 (2017) 4. P. Melville, V. Sindhwani, Recommender systems, in Encyclopedia of Machine Learning, vol. 1, pp. 829–838 (2010) 5. R. Burke, Knowledge-based recommender systems, in Encyclopedia of Library and Information Systems, vol. 69, no. Supplement 32, pp. 175–186 (2000) 6. B. Gipp, J. Beel, C. Hentschel, Scienstein: a research paper recommender system, in Proceedings of the International Conference on Emerging Trends in Computing (ICETiC’09) (2009), pp. 309–315 7. B. Li, Y. Liao, Z. Qin, Precomputed clustering for movie recommendation system in real-time. J. Appl. Math. (2014) 8. I. Portugal, P. Alencar, D. Cowan, The use of machine learning algorithms in recommender systems: a systematic review. Expert Syst. Appl. 97, 205–227 (2018) 9. L.Y. Lü, M. Medo, C.H. Yeung, Recommender systems. Phys. Rep. 519(2) (2012) 10. M. Jahrer, A. Töscher, R. Legenstein, Combining predictions for accurate recommender systems, in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2010), pp. 693–702 11. C. Basu, H. Hirsh, W. Cohen, Recommendation as classification: using social and content-based information in recommendation, in AAAI/IAAI (1998), pp. 714–720 12. G. Adomavicius, A. Tuzhilin, Context-aware recommender systems, in Recommender Systems Handbook (Springer, Boston, MA, 2011), pp. 217–253

626

R. Budhraj et al.

13. D. Asanov, Algorithms and Methods in Recommender Systems (Berlin Institute of Technology, Berlin, Germany, 2011) 14. D.H. Park, H.K. Kim, I.Y. Choi, J.K. Kim, A literature review and classification of recommender systems research. Expert Syst. Appl. 39(11), 10059–10072 (2012) 15. E. Kannout, Context clustering-based recommender systems, in 2020 15th Conference on Computer Science and Information Systems (FedCSIS) (IEEE, 2020), pp. 85–91 16. F.H.D. Olmo, E. Gaudioso, Evaluation of recommender systems: a new approach. Expert Syst. Appl. 35(3), 790–804 (2008) 17. J. Ben Schafer, D. Frankowski, J. Herlocker, S. Sen, Collaborative filtering recommender systems, in The Adaptive Web (Springer, Berlin, Heidelberg, 2007), pp. 291–324 18. E. Vozalis, K.G. Margaritis, Analysis of recommender systems algorithms, in The 6th Hellenic European Conference on Computer Mathematics & its Applications (2003), pp. 732–745 19. P. Lops, M. De Gemmis, G. Semeraro, Content-based recommender systems: state of the art and trends, in Recommender Systems Handbook (Springer, Boston, MA, 2011), pp. 73–105 20. R. Burke, A. Felfernig, M.H. Göker, Recommender systems: an overview. AI Mag. 32(3), 13–18 (2011) 21. O. Saini, S. Sharma, A review of dimension reduction techniques in data mining. Comput. Eng. Intell. Syst. 9, 7–14 (2018) 22. Y. Koren, Factorization meets the neighborhood: a multifaceted collaborative filtering model, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008), pp. 426–434 23. Z. Xian, Q. Li, G. Li, L. Li, New collaborative filtering algorithms based on SVD++ and differential privacy. Math. Probl. Eng. (2017) 24. B.-B. Cui, Design and implementation of movie recommendation system based on KNN collaborative filtering algorithm. ITM Web Conf. 12, 04008. EDP Sciences (2017)

A Study of Machine Learning Techniques for Fake News Detection and Suggestion of an Ensemble Model Rajni Jindal, Diksha Dahiya, Devyani Sinha, and Ayush Garg

Abstract Advancement of technology and rapidly growing telecom giants like Jio platforms and Bharti Airtel, have made information accessible to just about everyone. The Internet has proven itself to be a boon with people relying on it for education, business, news, etc., but the trustworthiness of information is not guaranteed. Vast amount of information is churned out via print and online media, but it is not easy to tell whether the information generated is true or false. At present times, false info about terrorism, natural disasters, pandemic, science, COVID-19 virus, or financial information has become a menace. Individuals are devouring day-by-day news they need, and fake news has been found to spread significantly faster than genuine news. People are consuming daily news they need, and fake news has been found to notoriously spread even quicker than real news. It has an extraordinary negative effect on people just as on society in general. For fake news classification, we are applying methods and models from ML literature in this paper. The goal of this project is to make a classifier that can recognize if the given news snippet is fake based only its text. Our goal was to develop deep learning models for identifying fake news and labeling them as false or real. Our experiments on the dataset using ensemble technique show a promising accuracy of 91.87%. Keywords Binary classification · Ensemble model · Fake news detection · Machine learning · Deep learning

1 Introduction Now is an era of deception—facts and news are made up with deceit. They are empirically false.—by Neil Portnow R. Jindal · D. Dahiya · D. Sinha · A. Garg (B) Department of Software Engineering, Delhi Technological University, Bawana Road, Rohini, Delhi 110042, India R. Jindal e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_51

627

628

R. Jindal et al.

Knowledge about misinformation and fake news has increased, yet we find it hard to curb its impact. Fake news is typically generated for commercial interests—to attract viewers or might be to collect advertising revenue. The term “fake news” is used to represent incorrect bulletin or propaganda-driven bulletins which result in creating unrest. It is usually spread through traditional media like print or non-traditional media. Social media platforms such as Twitter, WhatsApp, and Facebook are fast effective channels for dissemination of false unverified information and fabricated attention-grabbing stories. Easily available Internet and ease of communication lead to spreading of such news like wildfire. False information about COVID-19 with “potentially serious implications” on health has cost lives. As vaccines emerge, misinformation is being spread to persuade people not to vaccinate themselves. Such false news by anti-vaccine campaigners can undermine the achievement of COVID-19 vaccines. It is the responsibility of international agencies, tech companies, and governments to fight back against this problem. Considering the huge volume of data and the ease with which information is disseminated, AI tools can prove to be useful in differentiating fake news from real. However, there have been debates on whether AI alone is the solution to this problem. Experts argue that the future will involve some combination of human labor and AI to combat the malicious agendas who initiate fake news with success. AI-enhanced factchecking is only route-forward as was studied in [1]. Lack of computation tools to curb this menace has motivated research on this topic in recent years. The power of AI can be tapped to spot fake news, which would be particularly helpful in dealing with millions or news pieces. ML techniques can catch even the most subtle, underlying deviations and lack of credibility in data, which is imperceptible to humans without domain expertise. By means of this paper, we have put forward an Ensemble Model Approach to detect fake news. We combined different classifiers to filter the noise and to avoid the problems caused by using a single classifier. We studied the dataset and made use of different computational approaches including SVM, neural nets, convolutional neural networks (CNNs) and long-short memory networks (LSTMs). We compared our results with existing models and conclude that our model gives accuracy of 91.87% on “fake news” dataset of Kaggle.

2 Literature Survey The challenge of fake news is more often than not, associated with hoax, spammerdetection, rumor, deception (related work in [2, 3] respectively) among many others. Shu et al. [4] discuss open problems, fundamental theories, and research strategies in research for fake news detection problem. The two major categories of approaches used for detection of fake news are textual cue approaches and network-analysis approaches. The second approach relies on non-textual features of news to classify fake and real news. Examples of some mainstream techniques are used for checking the truthfulness of the news content:

A Study of Machine Learning Techniques for Fake News Detection …

629

Fig. 1 Sample of some article snippets in the dataset

(1)

(2) (3)

Propagation graph based: Tacchini et al. [2] proposed an approach based on the hypothesis that users who frequently like or share fake or low-quality content can help to identify the quality of the unlabeled content. Fact-checking using knowledge graphs [5]. Expert-oriented fact-checking relying on external sources such as DBpedia.

We have followed the textual-based approach, which includes employing text properties as features for our ML models. Recent studies have revealed that fake news articles use “biased subjective” language [6]. There are some underlying style characteristics in the language and semantics of fake news content which are distinguishable from those of true news. The textual approach effectively differentiated fake news from satire [7]. A detailed study done in [8] surveyed various NLP techniques along with different neural network and non-neural network models on various datasets available publicly. Mihalcea and Strapparava [9] use NLP techniques to identify and differentiate fake news. Liu et.al. [10] use similarities in lexical content and style and also sentiment analysis, to analyze fake and false reviews on Amazon (Fig. 1).

3 Experimental Design 3.1 Dataset Description The dataset utilized in this study was obtained from Kaggle and contains news pieces covering multiple topics. The collection of 20,761 news samples were jumbled and split into test and training datasets (20% and 80%, respectively). The training dataset had around 8341 real and 8267 fake training samples after the data preprocessing phase. After the data cleaning and preprocessing steps, the training dataset had 8341 real and 8267 fake samples remaining. We then used three machine learning models (neural nets, Naïve Bayes, and SVM) and two deep-learning models (LSTM and CNN) for our classification task. Naive Bayes was used predominantly as the baseline model for the problem. All code was written in Python 3.6, using TensorFlow and

630

R. Jindal et al.

Table 1 Dataset description

Attribute

Type

# (Id)

Numeric

Title

Text

Author

Text

Text

Text

Label

Numeric (either REAL (0) or FAKE (1))

Keras modules, e.g., NumPy, TensorFlow, gensim, pandas, Keras, matplotlib, scikit-plot, sklearn, and nltk. All experiments have been performed on a Core processor Intel i5-8250U 1.60 GHz with 16 GB RAM (Table 1).

3.2 Feature Extraction Text comes in various formats. News articles can use a lot of blank lines. The text can be very messy. The first step is text cleaning, which is done by removing unneeded punctuation marks and blank lines. Word embedding is an important method for feature extraction in NLP tasks. We have used Doc2Vec technique for transforming the texts of articles to numbers. Doc2Vec model learns to represent each document as a vector and provides a richer vector representation to large units of text. Doc2Vec is said to be a form of word2vec for the use of word sequences. Le and Mikolov [11] demonstrated capability of paragraph vectors-Doc2Vec-model on sentiment analysis and text classification tasks.

3.3 Models This paper explores various ML and deep learning models and suggests an ensemble model for fake news detection problem (Fig. 2).

3.3.1

Naïve Bayes

Naïve Bayes is a probabilistic model based on the Bayes theorem and is based on calculating conditional probability of the input sample belonging to a particular class given the features of the input sample. We implemented a Naïve Bayes classifier in order to calculate the dataset’s standard accuracy value.

A Study of Machine Learning Techniques for Fake News Detection …

631

Fig. 2 Flowchart representing workflow to classify news articles

3.3.2

Support Vector Machines

SVMs have previously been used in applications like medical analysis, text classification, and pattern recognition and language classification. The objective of SVM is to find out which category any new data belongs to, in addition, it must also maximize the distance between the two classes. We are using nonlinear kernel—radial basis function—to learn nonlinear dependence between the Doc2Vec embedding of the news piece and the class labels.

3.3.3

Neural Network

A neural network is an arrangement with an enormous amount of units (neurons), arranged in layers, and linked together in a pattern of connections. These interconnections and the neuron cells form a neural net. The configuration of the model used is [256, 256, 80]. The hidden layer neurons use “ReLU” activation, and the output layer has SoftMax activation.

632

R. Jindal et al.

Fig. 3 Flowchart for LSTM

3.3.4

LSTM

The LSTM has a cycle in the neuron connections unlike the fully connected neural networks. The output of a LSTM cell is a function of the current input from the previous layer alongside the state of the current cell. So the preceding state of the LSTM cell plays a role in next state of the cell (Fig. 3). The embedding layer learns an embedding for all of the words in the training corpus when the model is trained. It is initialized with arbitrary weights. It needs each word to be mapped to a distinct integer. For this, the each word’s frequency in the text corpus was counted and the top 500 words (with highest frequencies) were each assigned a unique ID. The words are then replaced by their assigned IDs. Each input news piece now is a series of integers. Next to decrease the dimensionality of the problem, we transform the lists of integers to a fixed length vector. Hence, we trim lists longer than 500 numbers and put 0’s at the beginning of lists smaller than 500 numbers. Hence, the initial text has transformed to a “fixed length vector of length 500” which is fed to the embedding layer of the LSTM. The output of the embedding is 2D with 32-dimensional vector for each of the 500 words of the input sequence. We chose an embedding space of 32 dimensions. The LSTM has one word-embedding layer, one hidden LSTM layer, and finally an output dense layer.

3.3.5

CNN

Convolutional neural networks (CNNs) are known to perform very good on computer vision tasks as they capture the local structures of the image. CNN has performed well on many NLP tasks like sentence modeling and semantic parsing ([12, 13] respectively) and other conventional NLP tasks [14]. To train the CNN model, we first created a word embedding with 32-dimensional vectors. Layers, parameter settings, and CNN configuration are given in Table 2. In our model, every convolutional layer is accompanied by a pooling layer. The purpose of using convolution layers with pooling is that the training time of such models is much shorter, while the performance is the same. We tuned hyperparameters in experiments with multiple combinations of hyperparameters. The dropout layer will avoid the model to overfit. A GlobalMaxPooling is commonly used for NLP-related classification tasks. The output of GlobalMaxPooling layer is input to the dense

A Study of Machine Learning Techniques for Fake News Detection … Table 2 Configuration of CNN model

633

Layer (type)

Output shape

Parameter

input_1 (input layer)

(None, 500)

0

embedding_1 (embedding)

(None, 500, 32)

160,064

convld (Conv lD)

(None, 494, 150)

33,750

max_pooling1d (MaxPooling1D)

(None, 98, 150)

0

convld_1 (Conv lD)

(None, 92, 150)

157,650

max_pooling1d_1 (MaxPooling1D)

(None, 18, 150)

0

convld_2 (Conv lD)

(None, 12, 150)

157,650

global_max_pooling1d (Global)

(None, 150)

0

dense_5 (Dense)

(None, 150)

22,650

dropout_2 (Dropout)

(None, 150)

0

Dense_6 (Dense)

(None, 1)

151

Total parameters: 531,915

output layer. The output layer has sigmoid nonlinearity activation function to predict the class labels.

3.3.6

Ensemble of Classifiers

Previously, Thorne et al. [15] used a stacked ensemble model to perform stance classification and got remarkable results. Also Minaee et al. [16] performed sentiment analysis with an ensemble model of Bi-LSTM and CNN. Hence, we have thought to build an ensemble classifier of three best models among the bunch, i.e., LSTM, CNN, and neural net, to get the final classification for new news articles. It unifies the results of the three models (Fig. 4).

4 Discussion of Results • The respective F1 score, recall, and precision of the models were calculated. These metrics help in evaluating our models and understanding overall performance of the algorithms for the task of binary classification of news articles. Comparison between the different models is shown in Table 3. • Naïve Bayes that serve as a baseline reported the lowest performance scores among the bunch. This shows that something more than just a probabilistic model is needed for the problem.

634

R. Jindal et al.

Fig. 4 Block diagram representing suggested ensemble model

Table 3 Model performance on test dataset Model

Accuracy (%)

Precision

Recall

F1 score

Naïve Bayes

72.94

0.72

0.86

0.76

SVM

88.42

0.85

0.93

0.89

Neural network with Keras

85.62

0.92

0.93

0.92

LSTM

92.37

0.94

0.94

0.94

CNN

94.53

0.93

0.93

0.95

Ensemble (NN + LSTM + CNN)

91.87

0.93

0.93

0.93

• In the case of SVM classifier, nonlinear kernel gave better results than NaïveBayes. • Neural network did work well as a classifier achieving an accuracy of 85.62% with Keras implementation. We tested tanh and sigmoid functions for the activation function of the hidden layer; however, the results were not good enough for them to be used in place of ReLU. • The deep learning models, i.e., LSTM and CNN models, gave the highest precision values that mean the deep learning models were able to extract some subset of relevant features and hence learn to classify news more accurately. The text in a news article is essentially sequential in nature and deep learning model architectures are able to extract such sequential information. It is a difficult task, even for human experts to identify a news as fake/authentic. The intuition behind good performance of CNN can be that CNN could recognize hidden features of the word embeddings of the text of the news article to effectively differentiate real from fake news. Regarding LSTM, we make an intuition that by capturing patterns in the sequence of the words, LSTM model was able to solve the problem effectively. Also, LSTM had faster convergence time than CNN for comparable accuracy.

A Study of Machine Learning Techniques for Fake News Detection …

635

Table 4 Classification reports of the models used for the classes (0 and 1) Model

Class

Precision

Recall

F1 score

Support

Naive-Bayes

0

0.79

0.58

0.67

2046

1

0.68

0.85

0.75

2107

SVM

0

0.92

0.82

0.87

2046

1

0.84

0.93

0.89

2107

0

0.93

0.92

0.92

2046

1

0.93

0.93

0.93

2107

CNN

0

0.96

0.93

0.95

2088

1

0.93

0.97

0.95

2065

Ensemble

0

0.95

0.92

0.93

2088

1

0.92

0.95

0.93

2065

LSTM

• Further, we observe the performance (based on precision, recall, and F1-score) of each of the models for the individual classes (0 and 1). As shown in the classification reports for the models in Table 4, the performance of Naive Bayes, SVM and neural net models varies for classes REAL (0) and FAKE (1). That is, these models were unable to classify both the classes (fake and real news) with equal performance, whereas deep models, i.e., LSTM and CNN, are predicting news articles of both classes with almost equal performance. This indicates that CNN and LSTM models will tend to be more robust and give promising results than ML. • The accuracy value for our ensemble model is 91.87% which is less than the individual accuracies of LSTM (92%) and CNN (94%). Hence, our proposed ensemble approach does not give very good results on our chosen dataset.

5 Conclusion Reviewing the current scenario, it feels fake news is a very important topic to research and work upon, for ensuring news we go through is credible. Research contributions: RC1: In this research, we tried to compare existing state-of-the-art ML models on our chosen dataset. An attempt was made to demonstrate improvements in the task of fake news detection as compared to existing state-of-the-art results (of our knowledge) using an ensemble model. RC2: This work can assist in broadening our understanding of the applicability of AI-based strategies for disseminating fake news and affirm the potential of AI to solve the problem of fake news. By using the deep learning methods in our work, we check whether the circulated news is fake or not. In this project, we have also proposed a bagging ensemble model (unified model of the models we have worked on) to identify fake news with

636

R. Jindal et al.

promising results. Our proposed methodology performs satisfactorily well with an accuracy measure value of 91.87%. The results of the study were validated using multiple performance evaluation parameters, e.g., false positive, true negative, recall, precision, accuracy, and F1. Individual deep models gave better results than the proposed ensemble model; hence, for our chosen dataset, the ensemble approach does not give very good results. News is not just composed of text, but is also accompanied by images. Exploring the relation of the images with the text in the articles is also essential to effectively combat fake news. Studying the features of images can be done by using well-known convolutional neural networks (CNNs) and that is something we wish to work on in future. Further we can formulate a fake news dataset, which includes other features such as the source of the news, any associated URLs, topic of the news piece (science, politics, sport), publishing medium (blog, print, social media), geographic region of origin, and other textual features which have not yet exploited in this project.

References 1. University of Michigan, Fake news detector algorithm works better than a human. ScienceDaily, 21 Aug 2018 2. E. Tacchini, G. Ballarin, M.L. Della Vedova, S. Moret, L. de Alfaro, Some like it hoax: automated fake news detection in social networks, in Proceedings of the Second Workshop on Data Science for Social Good (SoGood), Skopje, Macedonia. CEUR Workshop Proceedings, vol. 1960 (2017) 3. G.C. Santia, M.I. Mujib, J.R. Williams, Detecting social bots on Facebook in an information veracity context. Proc. Int. AAAI Conf. Web Soc. Media 13(01), 463–472 (2019) 4. K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake news detection on social media: a data mining perspective (2017) 5. V. Fionda, G. Pirro, Fact checking via evidence patterns, in Proceedings of the IJCAI-18 (2018) 6. X. Zhou, R. Zafarani, A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities (ACM, 2019) 7. O. Levi, P. Hosseini, M. Diab, D.A. Broniatowski, Identifying nuances in fake news vs. satire: using semantic and linguistic cues, in Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda (2019) 8. N.J. Conroy, V.L. Rubin, Y. Chen, Automatic deception detection: methods for finding fake news, in Proceedings of the Association for Information Science and Technology (2015), pp. 1–4 9. R. Mihalcea, C. Strapparava, The lie detector: explorations in the automatic recognition of deceptive language, in Proceedings of the ACL-IJCNLP 2009 Conference Short Papers (Association for Computational Linguistics, 2009) 10. M. Hu, B. Liu, Mining and summarizing customer reviews, in Proceedings of the Tenth ACM Conference on Knowledge Discovery and Data Mining (ACM, 2004) 11. Q. Le, T. Mikolov, Distributed representations of sentences and documents (2014) 12. Y. Shen, J. Gao, L. Deng, G. Mesnil, Learning semantic representations using convolutional neural networks for web search (2014) 13. K.M. Herman, E. Grefenstette, A deep architecture for semantic parsing (2014) 14. W. Wang, J. Gang, Application of convolutional neural network in natural language processing, in ICISCAE (2018)

A Study of Machine Learning Techniques for Fake News Detection …

637

15. J. Thorne, M. Chen, G. Myrianthous, J. Pu, X. Wang, Fake news detection using stacked ensemble of classifiers, in Proceedings of the 2017 EMNLP Workshop on Natural Language Processing Meets Journalism (2017) 16. S. Minaee, E. Azimi, A. Abdolrashidi, Deep-sentiment: sentiment analysis using ensemble of CNN and Bi-LSTM models (2019)

Metric Learning with Deep Features for Highly Imbalanced Face Dataset Ashu Kaushik and Seba Susan

Abstract Learning faces from large databases having a highly uneven class distribution is subject to bias toward the majority class that has the maximum number of instances. The problem becomes more complex when the number of classes is very large with samples from the minority classes often getting mis-classified. This phenomenon, popularly known as the class imbalance problem, is yet to be explored in its entirety for the deep learning techniques that form the state of the art in computer vision today. Metric learning is sometimes used to mitigate the effect of class imbalance due to its data space transformation properties that bring samples of a class closer. In this paper, we investigate three popular metric learning schemes, namely LMNN, NCA, and MLKR, for transforming the input space comprising of deep feature vectors extracted from the VGG-Face pre-trained network. To reduce the computational overload associated with metric learning, we learn the distance metric from the majority classes only and use the learnt metric to transform the entire input space. Experiments on the benchmark LFW facial dataset prove the efficiency of our approach from the higher accuracies obtained as compared to the existing techniques. Keywords Metric learning · Imbalanced learning · Face recognition · VGG-Face · Deep features

1 Introduction In today’s modern and socially active world, a huge amount of online data is being generated every day. Those who are highly active on social media generate a big chunk of raw data whereas those who are less active yield a smaller chunk of data. The bigger chunk of raw data generated by the active users constitutes the majority classes and the smaller chunk constitute the minority classes. So, this leads to the problem of class imbalance that induces a decision bias when machine learning A. Kaushik · S. Susan (B) Department of Information Technology, Delhi Technological University, Bawana Road, Delhi 110042, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_52

639

640

A. Kaushik and S. Susan

algorithms are applied to the dataset [1]. Most of the work that has been done so far mainly focuses on binary classes [2], and research on multi-class highly imbalanced data is relatively scant [3]. Deep learning methods can be much more accurate while classifying large datasets with uneven class distributions [4]. They can be computationally expensive due to the large number of layers and parameters. The problem becomes intensified for large imbalanced datasets. Various methods have been devised to solve this problem like pruning the insignificant features before passing it onto the deep networks [5]. An incremental batch-wise minority class rectification is done in [6] to reduce the dominance of the majority classes by analyzing the boundaries of scattered minority classes in batch-wise learning process. Data augmentation of the minority class was performed in [7] to balance the class distribution prior to deep transfer learning. Resampling may not be able to solve the imbalance issue when the number of classes in the dataset is very high since it is difficult to define the boundary between the majority and minority classes. An alternative strategy of countering class imbalance in deep networks is cost-sensitive learning [8, 9] where the contribution of the minority class toward the computation of the loss function is enhanced. Metric learning is another solution to the class imbalance problem explored in detail for toy datasets [10]. It involves data space transformation that reduces intra-class differences and increases inter-class differences, thereby preventing misclassification to a good extent. Metric learning has been used in conjunction with resampling with successful results [11]. Deep metric learning in [12] involved formulating a loss function for the deep network that penalized negative pairs of samples with large distances; there are related researches [13, 14] where metric learning was embedded inside loss functions and optimization objectives. The direct application of distance metric learning schemes that are well explored in data mining, to the deep features extracted from large image databases is limited; and it is this problem that we seek to address in this paper. Further sections of the paper are organized as follows. Section 2 starts with the basic concepts involved and then introduces the proposed approach. Section 3 analyzes the experimental results, and Sect. 4 gives the conclusions and outlines the future work.

2 Proposed Metric Learning with Deep Face Features We use a pre-trained deep neural network model just to generate the feature embedding of the facial image. After extraction of deep features, we choose a small subset containing a finite set of samples from each of the majority classes. Distance metrics are learnt from this small subset of majority classes in accordance with our findings in our previous work [15]. After learning the transformation metric, the whole input space (majority as well as minority samples) is transformed into a new feature space which is divided into two halves, i.e., training and testing, by alternate sampling. The transformed deep features are classified by a third-party classifier, namely the cosine

Metric Learning with Deep Features for Highly Imbalanced Face …

641

similarity classifier. We start with basic concepts, and then the proposed model is presented.

2.1 Deep Face Networks There are various state-of-the-art pre-trained deep convolutional networks for face recognition like DeepFace [16], FaceNet [17], and VGG-Face [18]. VGG-Face was developed by Parkhi et al. [18] in 2015. It is based on the VGG-16 [19] architecture and is very deep as it comprises of 22 layers and 37 deep units. In the convolution layers, it makes use of a 3 × 3 filter with stride equal to 1. A 2622-dimensional feature embedding is extracted from the deep network for each image, which is given as input to the cosine similarity classifier. The architecture of VGG-Face is summarized below. (1)

2 convolutional layers (64 channels); 3 × 3 kernel; stride = 1, padding = 1; followed by max-pooling layer

(2)

2 convolutional layers (128 channels); 3 × 3 kernel; followed by a max-pooling layer

(3)

3 convolutional layers (256 channels); 3 × 3 kernel; followed by a max-pooling layer

(4)

3 convolutional layers (512 channels); 3 × 3 kernel; followed by a max-pooling layer

(5)

3 convolutional layers (512 channels); 3 × 3 kernel; followed by a max-pooling layer

(6)

3 fully connected layers (generates 2622 × 1 embeddings); followed by a softmax layer

2.2 Distance Metric Learning Distance metric learning involves computation of distance metrics that allow data space transformation of the input space for better definition of class boundaries. The data space transformation induced by metric learning brings similar samples within a class closer and pushes dissimilar samples belonging to different classes farther apart. Metric learning has proved to mitigate the class imbalance problem to a certain extent [10]. Some examples of metric learning algorithms are large margin nearest neighbor (LMNN) [20], neighborhood component analysis (NCA) [21], and metric learning for kernel regression (MLKR) [22]. The three techniques are briefly described next. LMNN was proposed by Weinberger and Saul in 2009, and it is one of the most popular data space transformation technique. The loss function is given by C ost = αT push (L) + β T pull (L)

(1)

642

A. Kaushik and S. Susan

Equation (1) specifies the cost function that needs to be minimized. It is a weighted sum of push and pull functions where α = (1 − β). The value of β lies between 0 and 1. The transformation matrix that is learned by this function keeps the margin between k-similar samples minimum and maximizes the margin between samples of different classes [23]. It is not advisable for direct application when the number of classes is large [24]. Neighborhood component analysis was proposed by Goldberger et al. in 2005 [21]. The function that needs to be maximized is the stochastic variant of KNN score. Metric learning for kernel regression (MLKR) was proposed by Weinberger and Tesauro in 2007 [22]. It is learned by minimizing the leave one out regression error. Any type of kernel can be used, but the main focus is on Gaussian kernel and Mahalanobis metric. In the metric learning experiments in [10], NCA and MLKR followed by LMNN gave a consistently good performance for most of the toy datasets; hence, these are the choices of metric learning schemes to be tested for our application on deep features.

2.3 Proposed Model The block diagram of the proposed model is shown in Fig. 1. The input images belong to the highly imbalanced LFW dataset and constitute of cropped faces only. The images given as input to the feature extraction module are grayscale images that result in feature embedding of dimension 2622 × 1. After feature extraction, the classes are segregated as majority and minority classes based on the sum-based partitioning of class populations, as proposed by the authors in their previous work [15] for highly imbalanced multi-class datasets. There are 186 classes of the total of 1680 classes of the LFW dataset that are defined collectively as the majority class. The rest of the classes form the minority class. The first three samples of each class in the majority subset are taken into account for learning the data space transformation metric. Only these samples are passed to the metric learning module where the type of metric learning would be chosen, like LMNN, NCA, or MLKR, to generate a transformation metric. The learned transformation metric is then applied to the entire input space including both majority and minority classes which results in a new transformed input space. The data is then split into training and testing based

Fig. 1 Proposed model

Metric Learning with Deep Features for Highly Imbalanced Face …

643

on odd–even rule; i.e., odd samples of a class are used for training, even samples are used for testing, and a suitable classifier like cosine similarity is used for further classification. Both validation (V) and cross-validation (CV) results are reported. CV results are obtained by swapping the test and the training sets.

3 Results and Discussions 3.1 Dataset Labeled Faces in the Wild (LFW) is a publicly available dataset developed in 2007 by Huang et al. [25]. It is a benchmark in the field of face recognition. It consists of more than 13 k images of over 5.5 k celebrities collected over the Internet. The number of celebrities with two or more than two samples are 1680, and rest of the others have only one sample. So, for the sake of our experiments, we have just selected celebrities with two or more than two samples and neglected the rest of the celebs. Figure 2 shows some of the majority and minority classes while Fig. 3 shows the class distribution of the LFW dataset. Top-5 Majority classes of LFW dataset (cropped)

George W. Bush (530 samples)

Collin Powell (236 samples)

Tony Blair (144 samples)

Donald Rumsfeld (121 samples)

Gerhard Schroeder (109 samples)

Bottom-5 Minority classes of LFW dataset (cropped)

Michel Duclos (2 samples)

Christopher Patten (2 samples)

Mikhail Gorbachev (2 samples)

Jerry Falwell (2 samples)

Fig. 2 Some majority and minority celeb classes of the LFW dataset

Erika Christensen (2 samples)

644

A. Kaushik and S. Susan

Fig. 3 Population profile of the 1680 classes of the LFW dataset

3.2 Result Analysis We have used an Intel i5 dual-core processor clocked at 2.7 GHz and Python 3.5 software to perform the experiments. We have used various metric learning techniques— LMNN, NCA, and MLKR to learn various transformation metrics. The system took less than an hour each to learn the distance-based transformation metrics for all the metric learning techniques, and few minutes were required to complete the classification by various baseline methods. We extracted the deep features from VGG-Face model which yielded a 2622 × 1 feature embedding for each grayscale cropped face image of size 64 × 64 that is given as the input. For all the algorithms, the segregation of majority and minority classes has been done based on the sum-based partitioning method as discussed in [15]. So, majority classes = 1:186; minority classes = 187:1680. But we have taken only the first three samples of each majority class to learn the distance metrics, to reduce the number of computations. Classification has been done both with and without metric learning techniques. We compare the results with various popular baseline methods such as support vector machine (SVM) [26], cosine similarity metric [27], i.e., direct classification of VGG-Face deep features without any metric learning. After that metric learning is applied to the input data space and a new transformed input space (test and training) is derived using all the three above-mentioned metric learning algorithms. The transformed test samples are classified using the cosine similarity classifier. The results of accuracy, F1-score, and AUC scores are compiled in Table 1. As observed from Table 1, the proposed combination of VGG-Face + LMNN yields the best accuracies out of all methods, rendering metric learning with deep features to be a successful learning methodology despite the pre-existing class imbalance and the large number of classes involved.

Metric Learning with Deep Features for Highly Imbalanced Face …

645

Table 1 Results of face recognition using the deep features and metric learning schemes Method

AUC

F1-score

Accuracy

V

CV

V

CV

V (%)

CV (%)

VGG-Face + SVM [26]

0.654

0.635

0.287

0.263

55.4

51.1

VGG-Face + cosine similarity [27]

0.689

0.675

0.342

0.329

55.4

51.9

VGG-Face + LMNN

0.697

0.681

0.355

0.339

57

53.6

VGG-Face + NCA

0.634

0.625

0.233

0.225

44.4

42.4

VGG-Face + MLKR

0.600

0.591

0.174

0.167

36.5

33.9

4 Conclusion A novel application of distance metric learning to deep features is proposed for face classification from a large multi-class highly imbalanced dataset. The deep feature embedding is extracted from the pre-trained VGG-Face network. The LMNN, NCA, and MLKR metric learning algorithms are applied to transform the input space. The distance metrics are learned only from the majority classes to save computations. The entire input space is transformed using the learnt metrics. Experiments on the imbalanced facial dataset LFW yields the highest accuracy for the combination of VGG-Face deep features and LMNN metric learning with the cosine similarity classifier. Finding more combinations of metric learning and deep learning forms the future scope of our work.

References 1. H. He, E.A. Garcia, Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009) 2. S. Susan, A. Kumar, SSOMaj-SMOTE-SSOMin: three-step intelligent pruning of majority and minority samples for learning from imbalanced datasets. Appl. Soft Comput. 78, 141–149 (2019) 3. S. Susan, A. Kumar, The balancing trick: optimized sampling of imbalanced datasets—a brief survey of the recent state of the art. Eng. Rep. e12298 (2020) 4. M. Saini, S, Susan, Comparison of deep learning, data augmentation and bag of-visual-words for classification of imbalanced image datasets, in International Conference on Recent Trends in Image Processing and Pattern Recognition (Springer, Singapore, 2018), pp. 561–571 5. Hendry, R. Chen, C. Liao, Deep learning to predict user rating in imbalance classification data incorporating ensemble methods, in 2018 IEEE International Conference on Applied System Invention (ICASI), Chiba (2018), pp. 200–203 6. Q. Dong, S. Gong, X. Zhu, Imbalanced deep learning by minority class incremental rectification. IEEE Trans. Pattern Anal. Mach. Intell. 41(6), 1367–1381 (2019). https://doi.org/10. 1109/TPAMI.2018.2832629 7. M. Saini, S. Susan, Deep transfer with minority data augmentation for imbalanced breast cancer dataset. Appl. Soft Comput. 97, 106759 (2020)

646

A. Kaushik and S. Susan

8. S. Susan, D. Sethi, K. Arora, CW-CAE: pulmonary nodule detection from imbalanced dataset using class-weighted convolutional autoencoder, in International Conference on Innovative Computing and Communications (Springer, Singapore, 2020), pp. 825–833 9. M.L. Wong, K. Seng, P.K. Wong, Cost-sensitive ensemble of stacked denoising autoencoders for class imbalance problems in business domain. Expert Syst. Appl. 141, 112918 (2020) 10. S. Susan, A. Kumar, DST-ML-EkNN: data space transformation with metric learning and elite k-nearest neighbor cluster formation for classification of imbalanced datasets, in 2019 AIDE. Springer AISC Series (2019) 11. S. Susan, A. Kumar, Learning data space transformation matrix from pruned imbalanced datasets for nearest neighbor classification, in 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS) (IEEE, 2019), pp. 2831–2838 12. C.-X. Ren, X.-L. Xu, Z. Lei, A deep and structured metric learning method for robust person re-identification. Pattern Recogn. 96, 106995 (2019) 13. X. Wang, Y. Hua, E. Kodirov, G. Hu, R. Garnier, N.M. Robertson, Ranked list loss for deep metric learning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), pp. 5207–5216 14. Z. Bai, X.-L. Zhang, J. Chen, Speaker verification by partial AUC optimization with mahalanobis distance metric learning, in IEEE/ACM Transactions on Audio, Speech, and Language Processing (2020) 15. S. Susan, A. Kaushik, Weakly supervised metric learning with majority classes for large imbalanced image dataset, in Proceedings of the 2020 the 4th International Conference on Big Data and Internet of Things (2020), pp. 16–19 16. Y. Taigman, M. Yang, M. Ranzato, L. Wolf, DeepFace: closing the gap to human-level performance in face verification, in 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH (2014), pp. 1701–1708. https://doi.org/10.1109/CVPR.2014.220 17. F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: a unified embedding for face recognition and clustering, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA (2015), pp. 815–823. https://doi.org/10.1109/CVPR.2015.7298682 18. O. Parkhi, A. Vedaldi, A. Zisserman, Deep face recognition, vol. 1, pp. 41.1–41.12 (2015). https://doi.org/10.5244/C.29.41 19. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 20. K.Q. Weinberger, L.K. Saul, Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 10, 207–244 (2009) 21. J. Goldberger, G.E. Hinton, S.T. Roweis, R.R. Salakhutdinov, Neighbourhood components analysis, in Advances in Neural Information Processing Systems (2005), pp. 513–520 22. K.Q. Weinberger, G. Tesauro, Metric learning for kernel regression, in Artificial Intelligence and Statistics (2007), pp. 612–619 23. K.Q. Weinberger, J. Blitzer, L.K. Saul, Distance metric learning for large margin nearest neighbor classification, Advances in Neural Information Processing Systems (2006), pp. 1473– 1480 24. S. Parameswaran, K.Q. Weinberger, Large margin multi-task metric learning, in Advances in Neural Information Processing Systems (2010), pp. 1867–1875 25. G.B. Huang, M. Mattar, T. Berg, E. Learned-Miller, Labeled faces in the wild: a database for studying face recognition in unconstrained environments (2008) 26. A.A. Moustafa, A. Elnakib, N.F.F. Areed, Age-invariant face recognition based on deep features analysis. Signal Image Video Process. 1–8 (2020) 27. O. Laiadi, A. Ouamane, A. Benakcha, A. Taleb-Ahmed, A. Hadid, Kinship verification based deep and tensor features through extreme learning machine, in 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019) (IEEE, 2019), pp. 1–4

An Adaptable Ensemble Architecture for Malware Detection D. T. Mane, P. B. Kumbharkar, Santosh B. Javheri, and Rahul Moorthy

Abstract Over recent years, the world is being driven through data, which has also marked the increase of malware attacks. These are harmful programs that can perform functions like stealing or deleting the user’s sensitive data, monitoring the user’s activity, and seizing control over the user’s computer. Early detection of such programs, using the binary data present in each computer file, is essential in today’s world. The ability to convert the binary file to an image representation has opened doors for deep learning-based approaches. Traditional approaches use large convolution layer-based neural network architectures like Resnet and VGG-16 to solve this problem. Though these techniques are effective, they take a relatively long time to detect malware from these images, which cannot be afforded in such time-sensitive tasks. In this paper, we proposed an ensemble-based approach using a relatively shallow convolution layer-based neural network architecture boosted using the lazy unsupervised learning technique of K nearest neighbors. We tested this model on the publicly available Malimg dataset with 9339 binary file image representation samples belonging to 25 malware families. Though this combination has less complexity than traditional approaches, it has achieved a better accuracy of 99.63% on such a seemingly complex task. It has also displayed some notable advantages of faster training, faster prediction, and improved performance on classes with less data, which shows bright scope for building an adaptable stochastic malware detection framework, a much-needed system cybersecurity domain. Keywords Malware detection · Deep learning · Convolutional neural network · Ensembling

D. T. Mane (B) · P. B. Kumbharkar · S. B. Javheri JSPM’s Rajarshi Shahu College of Engineering, Pune, Maharashtra 411033, India R. Moorthy Pune Institute of Computer Technology, Pune, Maharastra 411057, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_53

647

648

D. T. Mane et al.

1 Introduction Malware is mainly responsible for the breach in data security due to the ease with which it can enter the system. They account for more than 1 billion attacks [1] alone in 2019. In this digital world, the generation of data averages to about 2.5 quintillion bytes [2] per day; such amount of breaches can cause major privacy problems for the user and heavy losses for corporate companies. Therefore, malware detection systems are essential as they can serve as an early cautioning system and prevent the information from getting compromised. There are many cybersecurity centered techniques available for detecting malware. Though effective, these techniques are not adaptable, which is a need of today’s world. Every year there is a discovery of many new variants of malware; in 2019 alone, there was a discovery of around 11 million [1] new variants of malware. This is a problem that needs to be properly addressed in the future. The technology improvement has opened a great scope for multidisciplinary-based approaches that can help build such adaptive systems. Given the number of attacks, the data generated from these attacks has also increased exponentially. Malware attacks leave binary traces of their attacks in the file or medium in which they were present. The ability to represent these traces in binary images has opened an opportunity for analytic using advanced deep neural networks and advanced machine learning techniques, which shows a great scope for building an adaptive detection system for Malwares. Conventionally, due to the image-based data, there is a use of large convolution neural network-based architectures like Resnet [3], VGG-16 [4] and heavy image processing techniques for building such systems. Though these methods are very accurate and efficient, they have an inherent drawback of high complexity present in the architecture. This complexity leads to greater prediction time, which cannot be present in such a time-sensitive task as every second can cause a loss of many bytes of data. Along with this, it also presents an opportunity for overfitting when the model closely fits the learning curve over the limited amount of provided training data points. Thus, it performs poorly over unseen data of present classes, let alone new malware variants. Therefore, this creates a need for a less complex architecture for such an adaptable system. In this paper, the authors proposing an ensemble-based approach to deal with this problem. Specifically combining a shallow convolution-based neural network architecture with an unsupervised-based lazy learner technique of K nearest neighbors is to generate the predictions. This, in turn, reduces the complexity by a great margin and gives prediction in less amount of time without compromising the results. Rather, it attains superior results in contrast to the art models’ state. Along with this, there is also an additional observed advantage of good performance over fewer data present of certain classes, which shows promising results for building an adaptable system for malware detection. Our particular contributions through this paper are as follows • The use of ensemble methods for malware detection consisting of less complex models in comparison to the conventional architecture.

An Adaptable Ensemble Architecture for Malware Detection

649

• Using unsupervised lazy learners as a part of the architecture for reducing the training and prediction time. The paper is structured as follows: Sect. 2 focuses on the background and related works, Sect. 3 details the proposed architecture, and Sect. 4 imparts upon the experimental setup. Further, the dataset description and the results are provided in Sect. 5. Finally, the conclusion and future scope are described in Sect. 6.

2 Related Work Malware detection has been considered an important task in cybersecurity. The algorithms which deal with this task are categorized into two types. Firstly, the signaturebased techniques [5] scan the sequence of bytes for specific signatures inherent to the malware. Santos et al. [6] used the frequency of appearance of the opcode sequence to determine the malware variant. Tang et al. [7] used dynamic micro-architectural execution patterns to detect malware programs. Kemalis and Tzouramanis [8] developed a new SQL-IDS technique to tackle SQL injection problems based upon this idea. Secondly, anomaly-based [5] detects the presence of malicious code when the behavior deviates from the knowledge of what is considered normal. Furthermore, specification-based [5] is a special category present in the anomaly-based techniques, which detect based on a rule-set. If there is a deviation from the rule-set that program is considered malware. Chaugule et al. [9] developed a malware detection system for mobile phones upon this idea and determined the presence of malware based on keypad and touchscreen interrupts. Though these methods are very effective, they are centered on the cybersecurity-based features that do not allow the flexibility for new variants of Malware. Recently, there has been active research in machine learning and deep learning for building a system for malware detection. Makandar and Patrot [10] used different feature wavelet transformations, which was fed to an SVM-based classifier for determining the class of malware of the binary image. Pinto et al. [11] used a pretrained unsupervised approach trained using the operands and opcode of malware families. Jung et al. [12] had used images generated from the byte sequence of malware attacks, which were passed through a convolution neural network to detect the code’s malicious nature. Singh et al. [13] used Resnet-50 architecture to determine the class on the Malimg dataset. Lo et al. [14] used an Xception convolution neural network model with transfer learning on the Microsoft malware dataset. Rezende [15] used VGG-16-based bottleneck features to classify malware software. Nataraj and Manjunath [16] used signal processing to analyze different malware forms. Rajesh Kumar et al. [17] used image similarity to detect unknown or new types of malware. Zhou [18] used the combination of static and dynamic features of malware, which was fed to different machine learning classifiers to determine the malicious nature of the input. Sethi et al. [19] developed a framework to determine different files into macro- and micro-malware. This framework uses Cuckoo Sandbox to analyze these

650

D. T. Mane et al.

files, which were later fed to a machine learning model developed using the Weka framework. It can see that the previous architectures show a heavy reliance on deep convolution layer-based approaches. The major drawbacks of such an approach are the seemingly complex nature of the framework, long training, and prediction time. This shows an extensive scope of improvement in their present structures.

3 Proposed Ensemble Architecture In this section, the author presents an ensemble-based architecture for malware detection. Specifically combining a shallow convolution-based neural network architecture with an unsupervised-based lazy learner technique of K nearest neighbors (KNN) is to generate the predictions. This architecture presents an advantage in faster convergence, faster prediction and is less prone to overfitting due to the inherent presence of reduced complexity. The detailed architectural diagram of adaptable ensemble is shown in Fig. 1. The further details have been divided into sections for easier understanding.

3.1 Convolution Neural Network We have used a small architecture of the traditional convolution neural network as a feature extraction pipeline for the binary images. The preprocessing steps for the images for reducing the complexity of the calculations are as follows • I. Resizing: The image have been resized into 32 * 32 * 3 from its variable size. • II. Flattening: The image also have converted RGB image into a grayscale image.

Fig. 1 Adaptable ensemble architecture

An Adaptable Ensemble Architecture for Malware Detection

(a)

(b)

(c)

651

(d)

Fig. 2 a Input image b CNN layer 1 visualization c CNN layer 2 visualization d CNN layer 3 visualization

The notable change from the traditional approach of building a CNN model is using the leaky ReLU [20] activation layer after each CNN layer. This gives an advantage for tackling the dying ReLU problem and boosting training. The model attains the global minimum in less time, which is not observed in general activation functions. The formula of leaky ReLU [20] is stated in the equation below. leaky_out put = I ∗ 1 ∗ 0.01(I ≤ 0) + 1 ∗ I (I > 0)

(1)

where input layer I and leaky output is the output of leaky ReLU [20] activation function. CNN Setup: Our CNN architecture has three layers. The first layer has 32 filters, which are followed by a leaky ReLU [20] layer, which is fed to a max-pooling layer. The second layer has the same combination with 16 filters. Lastly, the final layer has a filter size of 8. All three layers have three filter sizes with one stride. The final layer is flattened and connected to a dense layer of 256 nodes connected to the softmax output layer of 25. The CNN feature visualization of the three layers is shown in Fig. 2. The CNN learning algorithm is stated below, and the notation table of the same is stated in Table 1.

3.2

K Nearest Neighbors

KNN [21] or popularly known as the KNN is an unsupervised learning algorithm. It uses clustering, which is based upon similarity to predict the class of an unknown data point. It determines the class using the K number of closest neighbors near to it. The proposed model uses Euclidean distance to determine the closeness; the formula is stated below  (2) d pq = ( p1 − q1 )2 + ( p2 − q2 )2

652

D. T. Mane et al.

Table 1 CNN notation table Notation Meaning

Notation

Meaning Bias of fully connected layer Fully connected layer output Actual label Calculated loss Derivative of weight update Transpose Element wise product Derivative of bias update Flattened output of CNN Training data Vertical end of image Horizontal end of image

no_c

Number of channels

bi_fl

i_ht

Image height

fl_output

l_wt fi ht_u

Image width Filter size Height units

true_label Loss dtwfl

wt_u α wt

Width units Alpha Weights

.T . dtbfl

Sl

Stride

fl

b ve_s ht_s

Bias tra_data Vertical start of image ve_e Horizontal start of ht_e image Weight of fully connected layer

Wt_fl

Here p is the coordinates of the unknown data point, and q is the coordinates of the data point whose class is known. The less the distance between the point, the more the similarity between them. The advantages of using this algorithm are as follows • The number of trainable parameters is very less which reduces the complexity of the model. • Being a lazy learner determines the distance on run time and is stochastic in nature. This gives an advantage of less training time and being a shallow, complex model. The complexity of computation is also reduced, which directly reduces the prediction time. • The similarity-based determination technique of the algorithm also helps detect new variants of the existing viruses, which gives a sense of adaptability. This cannot be observed in traditional architectures.

3.3 Ensembling The ensembling concept which is used by the proposed approach is as follows. Firstly, the image passes through the convolution neural network architecture, which extracts the image’s features. The dense layer’s intermediate model weights are then passed

An Adaptable Ensemble Architecture for Malware Detection

653

Algorithm 1 CNN learning algorithm Input: cnn_filter,last_layer_parameters,wt,α Output: FCNN layer featur 1: ht_u=int((i_ht-fi)/Sl)+1 2: wt_u=int((i_wi-fi)/Sl)+1 3: for ht=1,wi=1,ch=1 to htu,wtu,no_c do 4: slice=last_layer[ve_s:ve_e,ht_s:ht_e,:] 5: output=sum(slice*wt)+bi 6: end for 7: for ht=1,wt=1,ch=1 to ht_u,wt_u,no_c do {Max Pooling Layer} 8: slice=last_layer[ve_s:ve_e,ht_s:ht_e,:] 9: output=max(slice) 10: end for 11: output=0.01*output {Leaky Rely layer} 12: for ht=1,wi=1,ch=1 to ht_u,wt_u,n_c do {backward Propagation} 13: slice=last_layer[ve_s:ve_e,ht_s:ht_e,:] 14: dtw+=slice*output[ht,wi,ch] {Weight Updates} 15: dtb+=output[ht,wi,ch] {Bias Updates} 16: end for 17: wt-= α*dtw 18: b-= α*dtb 19: Flatten the output 20: fl_out=Wt_fl*f_l+bi_fl {Fully connected Forward Propagation} 21: loss=fl_output-true_label {Fully connected backward Propagation} 22: dtwfl=loss.fl.T 23: dtfl=W_fl.T.loss 24: dtbfl=loss 25: wfl-=wfl-αdtwfl 26: bfl-=bfl-αdtbfl 27: Output fl as features Table 2 Ensemble notation table Notation min_dist neigh fl_tra_data unknown

Meaning Minimum distance Closest neighbors Training data with fully connected features Unknown point being considered

through a KNN model for determining the output class of the binary image. The ensemble-based approach, in general, gives an advantage of reducing the complexity by combining two weak learners to generate one strong learner. This, in turn, gives a generalized model rather than being biased toward the training data. The ensemble algorithm is stated below, while the symbol notation table is stated in Table 2.

654

D. T. Mane et al.

Algorithm 2 Ensemble learning algorithm Input: fl_tra_data Output: Output Class 1: for i=0 to size of fl_tra_data do {KNN Training} 2: plot fl_tra_data[i] 3: end for 4: min_dist=0 5: neigh=0 6: for i in fl_tra_data do {KNN prediction} 7: if (unknown[i] − f l_tra_data[i])2 ≤ min_dist then 8: min_dist=(unknown[i] − f l_tra_data[i])2 9: neigh=neigh+1 {Do till closest 5 neighbours have been found out} 10: end if 11: end for 12: return class of maximum neigh {Return class of maximum neighbours}

4 Experimental Setup The input image after resizing is 32 * 32. The convolution neural network model is trained using an Adam optimizer. As the problem was a multi-class classification problem, we have used the loss function of multicategorical cross-entropy loss. The learning rate was set to 0.01, while the batch size was 64. Ten epochs were used to train the CNN model. The dense layer weights were used as the features extraction step for further processing. The k value for the KNN model used for the final classification was set to 5. The whole architecture was trained on an Nvidia Geforce 1060Ti 6 GB GPU.

(a)

(b)

(c)

Fig. 3 a Adialer C visualization b Allaple A visualization c Fakerean visualization

An Adaptable Ensemble Architecture for Malware Detection Table 3 Dataset description Virus family No. of images Adialer.C Lolyda.AA1 Agent.FYI Lolyda.AA2 Allaple.A Lolyda.AA3 Allaple.L Lolyda.AT Alueron.gen!J Malex.gen!J Autorun.K Obfuscator.AD C2LOP.gen!g Rbot!gen C2LOP.P Skintrim.N Dialplatform.B Swizzor.gen!E Dontovo.A Swizzor.gen!I Fakerean VB.AT Instantaccess Wintrim.BX Yuner.A

122 231 116 284 2949 123 1591 159 198 136 106 142n 200 158 146 80 177 128 162 132 381 408 431 97 682

655

Training samples

Precision

Recall

F1 score

94 186 93 229 2361 98 1274 127 160 109 86 113 160 126 118 44 143 102 131 105 306 326 346 65 545

1.00 1.00 1.00 0.70 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.92 0.95 1.00 1.00 1.00 1.00 1.00 1.00 0.90 1.00 1.00

1.00 1.00 1.00 0.88 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.95 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.75 1.00 1.00

1.00 1.00 1.00 0.78 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.97 0.96 0.97 1.00 1.00 1.00 1.00 1.00 1.00 0.82 1.00 1.00

5 Dataset Description and Result Analysis The proposed model is tested on the Malimg dataset. It comprises 9339 samples belonging to 25 classes. Figure 3 shows a few examples present in the dataset, while Table 3 states the per class distribution and training samples of the data. The data is divided further into 80% training set, 20% testing set. The training data consisted of 7470 images, while the test data consisted of 1899 images. Table 3 describes the training data distribution between various malware families. The model was trained on the training data using the experimental setup mentioned above while testing the test data. The loss and accuracy graph attained during the CNN training is shown in Fig. 4. This demonstrates that the CNN model has converged in a concise number of epochs due to the leaky ReLU activation function layer’s use after each convolution layer.

656

D. T. Mane et al.

Fig. 4 a Loss versus epoch b accuracy versus epoch Table 4 Accuracy comparison Architecture name VGG-16 [15] Accuracy

98.99%

Resnet-50 [13]

Proposed CNN+KNN

98.62%

99.63%

We now present the proposed architecture comparison with the state-of-the-art models of Resnet-50 [13] and VGG-16 [15]. Table 4 shows the accuracy comparison chart between the models. The ratio of the correct prediction out of the total is accuracy. It can be observed that though our proposed ensemble model has reduced complexity. It has outperformed heavily complex architectures by a good margin and achieved a quality result of over 99% accuracy. A visible observation about the dataset is that the samples per class in the training data are biased in nature. Thus, accuracy in itself cannot prove the authenticity of the model. Therefore, the testing of the model was further done using other metrics which is closely related to per class distribution, and the results have been stated in Table 3. Here, precision is the number of predicted samples truly belonging to the class in which it is present; recall refers to the number of malware classes predicted correctly by the model. Lastly, F1 -score, considered the traditional metric for any classification task, is the harmonic mean of precision and recall. Though a large imbalance is present between classes. Based on precision, recall, and F1 -score, it can observe that the model has performed equally well for all classes with no class balancing preprocessing technique. This conveys the ability to detect malware using less amount of training data. It also suggests the flexible and adaptable nature of the architecture with new variants of existing malware families. Trainable parameters are an important measure to compare the model’s complexity. We compare our model with the earlier mentioned standard models. The data is given in Table 6. Along with this, we would also like to present our analysis on time taken by each of the architectures in predicting an unknown sample. Table 5 states the results of the same.

An Adaptable Ensemble Architecture for Malware Detection Table 5 Prediction time comparison Architecture name VGG-16 Resnet-50 CNN+KNN

Table 6 Trainable parameters comparison Architecture name VGG-16 Resnet-50 CNN+KNN

657

Prediction time (s) 1.000 0.534 0.122

Trainable parameters 138 M 25.6 M 370 K

This clearly shows that our proposed ensemble architecture predicts an unknown sample much faster than traditional CNN architectures. This is a very notable analysis as these mere milliseconds can save a corporate data-rich system from malware attacks, resulting in heavy data loss. Along with this, it can observe that our architecture though performing better than the traditional architecture, requires exponentially less trainable parameters due to the less amount of parameter setting needed in K nearest neighbors, which are boosted using shallow CNN architecture.

6 Conclusion Hence, using our proposed ensemble architecture, we have achieved a notable boost of metrics over the traditional deep learning architectures in performing malware classification even in classes with fewer training examples. This also suggests that can achieve the system’s adaptable nature toward new variants of malware by using a combination of unsupervised learning algorithms with traditional supervised deep learning techniques, which is a critical need of such systems. However, the new large CNN models or heavy ensembles outperform our metrics. Such architectures’ major downside is high complexity, bias toward detecting existing families of malware present in training data, and larger prediction time. Our proposed model outperforms such models in all aspects and achieves a speedup of about 900 ms compared to traditional architectures in predicting an unknown sample. This is a significant difference in such demanding tasks such as malware detection. Future work includes using a similar combination of supervised and unsupervised algorithms to develop an adaptable system that detects large malware variants, not limited to the families considered in the dataset. Another aspect that can be explored is the use of multimodal-based architecture in which features based upon cybersecurity-based concepts are also used besides the image, which can help the model detect differences between the families

658

D. T. Mane et al.

of malware in a much more fundamental way. Hence, an faster, adaptable architecture like ours, built specifically combining supervised deep learning and unsupervised learning algorithms, is certainly a step toward a large-scale single expert system for detecting malware.

References 1. AV-TEST GmbH, Malware Statistics (2020). https://www.av-test.org/en/statistics/malware. html. Accessed 13 Mar 2020 2. D. Lackey, Data Statistics (2019). https://blazon.online/data-marketing/how-much-data-dowe-create-every-day-the-mind-blowing-stats-everyone-should-read.html. Accessed 13 Mar 2020 3. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 27–30 June 2016 4. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556 (2015) 5. V.P.V. Laxmi, M.S. Gaur, Survey on Malware detection methods, in Proceeding of the 2nd Annual India Software Engineering Conference, 23–26 Feb 2009 6. I. Santos, F. Brezo, J. Nieves, Y. Penya, B. Sanz, C. Laorden, P. Bringas, Idea Opcode-sequencebased Malware detection, in Proceedings of 2nd International Symposium on Engineering Secure Software and Systems (2010), pp. 35–43 7. A. Tang, S. Sethumadhavan, S.J. Stolfo, Unsupervised anomaly-based Malware detection using hardware features, in Research in Attacks, Intrusions and Defenses (2014), pp. 109–129 8. K. Kemalis, T. Tzouramanis, SQL-IDS: a specification-based approach for SQL-injection detection, in Proceedings of the ACM Symposium on Applied Computing (2008), pp. 2153–2158 9. A. Chaugule, Z. Xu , S. Zhu, A specification based intrusion detection framework for mobile phones, in ACNS’11: Proceedings of the 9th International Conference on Applied Cryptography and Network Security (2011), pp. 19–37 10. A. Makandar , A. Patrot, Malware class recognition using image processing techniques, in International Conference on Data Management, Analytics and Innovation, 24–26 Feb 2017 11. D.R. Pinto, J.C. Duarte, R. Sant’Ana: A deep learning approach to the Malware classification problem using autoencoders, in SBSI’19: Proceedings of the XV Brazilian Symposium on Information Systems, 20(1), pp. 1–8 (2019) 12. B. Jung, T.G. Kim, E. Im, Malware classification using byte sequence information, in RACS ’18: Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems (2018), pp. 143–148 13. A. Singh, A. Handa, N. Kumar, S.K. Shukla, Malware classification using image representation, in Cyber Security Cryptography and Machine Learning (2019), pp. 75–92 14. W.W. Lo, X. Yang, Y. Wang, An Xception convolutional neural network for Malware classification with transfer learning, in 10th IFIP International Conference on New Technologies, Mobility and Security (NTMS), 24–26 June 2019 15. E. Rezende, G. Ruppert, T. Carvalho, A. Theophilo, F. Ramos, P. de Geus, Malicious software classification using VGG16 deep neural network’s bottleneck features, in Information Technology—New Generations (2018), pp. 51–59 16. L. Nataraj, B.S. Manjunath, SPAM: signal processing to analyze Malware. IEEE Signal Process. Mag. 33(2), 105–117 (2016) 17. G. Rajesh Kumar, N. Mangathayaru, G. Narasimha, Similarity function for intrusion detection, in ICEMIS’19: Proceedings of the 5th International Conference on Engineering and MIS, vol. 28 (2019), pp. 1–4 18. H. Zhou, Malware detection with neural network using combined features, in CNCERT 2018: Cyber Security (2018), pp. 96–106

An Adaptable Ensemble Architecture for Malware Detection

659

19. K. Sethi, S.K. Chaudhary, B.K. Tripathy, P. Bera, A novel malware analysis for malware detection and classification using machine learning algorithms, in SIN’17: Proceedings of the 10th International Conference on Security of Information and Networks (2017), pp. 107–113 20. A.L. Maas, A.Y. Hannun, A.Y. Ng, Rectifier nonlinearities improveneural network acoustic models, in ICML, vol. 30 (2013) 21. T. Cover, P. Hart, Nearest neighbor pattern classification. IEEE Trans. Inf.Theory 13(2), 21–27 (1972)

An Application of Deep Learning in Identification of Depression Among Twitter Users Ashutosh Shankdhar, Rishik Mishra, and Nitya Shukla

Abstract Depression is one of the most common mental disorders with millions of people suffering from it. It has been found to have an impact on the texts written by the affected masses. In this study, our main aim was to utilize tweets to predict the possibility of a user at-risk of depression through the use of natural language processing (NLP) tools and deep learning algorithms. LSTM has been used as a baseline model that resulted in an accuracy of 95.12% and an F1 score of 0.9436. We implemented a hybrid BiLSTM + CNN model which we trained on learned embeddings from the tweet dataset was able to improve upon previous works and produce precision and recall of 0.9943 and 0.9988, respectively, giving an F1 score of 0.9971. Keywords Natural language processing · Twitter · Deep learning · Depression · BiLSTM + CNN · Social network

1 Introduction Depression is widespread, with close to 264 million people affected all around the world [1]. Unlike usual mood fluctuations and short-term emotional response to problems, depression may become a serious health condition. It has been shown to reduce productivity at the workplace, schools and even in familial matters owing to the constant suffering brought on by it. In extreme cases, it may even lead to suicide which is the subsequent driving reason for death in 15–29-year-olds. Approximately A. Shankdhar · R. Mishra (B) · N. Shukla Department of Computer Engineering and Applications, GLA University, Mathura, Uttar Pradesh 281406, India e-mail: [email protected] A. Shankdhar e-mail: [email protected] N. Shukla e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_54

661

662

A. Shankdhar et al.

8,000,000 people lose their life due to suicide every year, which is one person every 40s [2]. Depression is a result of complex interactions among social, mental, and organic variables. Individuals who have experienced negative occasions in life like joblessness, disappointment, mental injury, and so on are bound to create depression [3]. Depression can, thus, lead to more pressure and tension and contrarily influence the individual’s life situation [1]. Health outcomes in today’s world are often shaped by social interactions. A majority of social interactions now take place on the Internet through social media and social networking sites, and this is evidenced by the number of active social media users reaching 3.8 billion as of July 2020 coupled with a 10.5% increase in users per year [4]. These interactions can have negative implications for multiple mental health and the behavior of the subjects. This study builds upon the hypothesis that sentiments and mental state are reflected on to a user’s tweets. An abundance of textual data available on twitter has led to many researchers using the micro-blogging sites data for classification tasks, including depression classification, most of which rely upon recurrent neural networks or manual feature extraction along with using machine learning algorithms, such as support vector machine (SVM) for the classification task. In this paper, we are analyzing user’s tweets to make predictions about the possibility of a person suffering from depression. The task of identifying depression through natural language processing (NLP) is more convoluted than simple sentiment analysis as there isn’t a clear distinction between depressed and non-depressed as there are in being sad or happy, and this leads to a problem in feature extraction. In this study, we are focusing on using a large sample of user data for training and using deep learning for feature extraction instead of manual feature extraction. A multilayer convolutional neural network (CNN) architecture is used in this study to extract the features from the word vectors and for classification between two classes, namely depressive and non-depressive. With this, we would be able to draw insights from the textual data available on twitter and to utilize those insights for making accurate predictions.

2 Literature Review The social media data adds an additional level of complexity to the classification tasks due to the widely varying contexts and frequent mis-spelling along with literary freedom being enjoyed by users. There has been a lot of previous work been done to identify depression using social media data, with most of the approaches utilizing traditional feature engineering techniques. The study by Choudhury et al. [5] successfully exhibited the potential of social media data in predicting as well as measuring major depression in people while using a SVM along with collecting highly specialized features from 1583 crowdsourced user’s questionnaires and social media data. Nadeem et al. [6] moved on from considering the problem of depression detection

An Application of Deep Learning in Identification of Depression …

663

from a behavioral one to a text classification one and used a corpus of 2.5 M tweets and employing bag of words approach and statistical modeling techniques to achieve an accuracy of 86% with naive Bayes classifier. Various studies worked to improve upon the classification by employing various statistical methods like in 2018 [7] Islam and their team used the method of machine learning technique to detect the depression in Facebook users. They show that their proposed method can significantly improve the accuracy and classification error rate. In addition, their result shows that in different experiments decision tree (DT) gives the highest accuracy than other ML approaches. Many further studies by various authors employed n-gram models and deep learning for the classification task such as long short-term memory (LSTM) or other recurrent neural networks (RNNs), mostly using bag of words approach. Ahmad et al. [8] showed the improvements that bidirectional long short-term memory (BiLSTM) offer over the LSTMs in the classification of depressive disorders due to the ability to predict both future context and past context, this resulted in 93% accuracy. Our study is built upon the work that has been done by above-mentioned authors and hopes to achieve an inclusion of a wider range of tweets to be classified, including emojis, and even those with wide semantic and syntactic variations. The present work aims to work on providing a deployable system with a high accuracy and to utilize the latest advances to improve upon previous results.

3 Dataset For our study, we have used the dataset constructed by Shen et al. [9], in which they have collected the data of twitter users through the widely used APIs. The data has been collected for the years 2009 through 2016, the first criteria for their classifying a tweet or user as positive is the mention of the term “(I’m/I was/I am/I’ve been) diagnosed depression”; from this criteria, they were able to collect 1402 depressed user’s tweets, these tweets were preprocessed and a similar dataset for non-depressed users had also been made by them.

4 Methodology The method used in this work is divided into two stages: the first one being data preprocessing and the second being text classification. The modules have been divided into reduce the time required to process the training data and also to reduce redundancy, which arises from processing the same data after every run.

664

A. Shankdhar et al.

4.1 Data Preprocessing Before feeding the data, we saw that the words are contextual and abstract in the raw data, which causes extraordinary challenges in word coordinating and semantic analysis. Therefore, we carried out the following data preprocessing procedures: Emoji Processing. Emojis require an additional level of preprocessing as the embedding for them is disparate from the corpus used to train the fasttext word embeddings, and we have thus removed the emojis contained in the data by utilizing an emoji library collected from Twitter [10] and replaced the emojis present in the data with their textual meaning (Table 1). General Preprocessing. To reduce the size of the corpus, we can clip or stem the various tenses or forms of words contained in the dataset, this would lead to a faster processing and in turn a faster training, we can stem the words as the tenses or forms of words in the corpus are semantically similar and hence could be losslessly converted into the simal vector notation. We also pad the sequence with empty strings till a length of 65 for getting the same input shape for the text data. Tokenization is also performed on the separate tweets to aid in vectorization of the texts (Table 2). Word Embedding. For the word embedding, we have opted to use the pre-trained fasttext word embedding. Fasttext works well with rare words. So even if a word wasn’t seen during training, it can be broken down into n-grams to get its embeddings. Word2vec and GloVe both lack in providing any vector representation for words not in the model dictionary. This is a huge advantage of this method. Table 1 Example of emoji processing Before processing

After processing

I’m so hungry. With training again it’s going to I’m so hungry. Face with rolling eyes With zero gains skinny body no shape and eating a training again it’s going to zero gains skinny whole lot! I will be done in September body no shape and eating a whole lot! I will be done in September flexed biceps

Table 2 Text after tokenization and stemming Before processing

After processing

I’m so hungry. Face with rolling eyes with [I, hungry, face, roll, eye, train, again, go, zero, training again its going to zero gains skinny gain, skinny, body, shape, ea, whole, lot, done, body no shape and eating a whole lot! I will be september, flex, bicep] done in September flexed biceps

An Application of Deep Learning in Identification of Depression …

665

4.2 Text Classification The following section details the implementation of a random forest classifier (RFC), which is used as a baseline model for the classification of the text. The next section is about the CNN + BiLSTM model which we are using as the improved model. Baseline Model. For the baseline model, we have used a random forest classifier (RFC) with 62 estimators and the quality of split being measured based on entropy, and the maximum depth of the classifier was chosen to be 10. Since the tweet data is very short and would not be ideal for feature extraction, we have used the wiki data for the TF-IDF feature extraction and used the IDF scores for feature selection to be fed into the RFC. BiLSTM + CNN Model. The LSTM layer is used in our model for encoding the natural language features which it has proved to be effective in, and the CNN layers are then used to encode the categorical features which have again been proven to be effective in classification tasks. For the neural network approach, we are using bidirectional LSTM [11] with 64 units that feeds into two further convolutional layers (CNN), both with a kernel size of k = 3 and filter sizes f , being 32 and 64. The outputs of the convolutional layer are locally max-pooled and the resulting encoded features are fed into fully connected layers with units d ∈ 32, 16, 8. This is followed by the final predicting layer. Except for the final layer all the other layers use rectified linear unit (ReLU) activation function [12]. A dropout layer with dropout rate of 0.2 is applied to the CNN layer for regularization. The final layer, due to the binary nature of the classification, uses sigmoid activation function. The training consists of “adam” optimization function and binary cross-entropy as a loss function. This method utilized the pre-trained fasttext word embedding, which had been trained on wiki corpus. This has been useful as it was able to work with words not seen before, and this is used as the initial weights for the input embedding layer (Fig. 1).

5 Experimental Results The baseline model and the BiLSTM + CNN model were both trained on the same set of text data, but with different embedding techniques, the differences between both model’s performances using two different types of word embeddings (namely TF-IDF and fasttext) were calculated through the difference in their F1 scores, and the difference was found to be negligible, but the BiLSTM + CNN model was still made to use the fasttext due to the availability of pre-trained multilingual embeddings which could help in further research to be done on this topic which could include multiple languages being compatible.

666

A. Shankdhar et al.

Fig. 1 BiLSTM + CNN model architecture with forward propagation with each CNN layer followed by a max-pooling layer and using a rectified linear unit (ReLU) activation function

The training was done on a workstation with Intel i7 processor and 32 GB of RAM along with an RTX2060 GPU with 6 GB VRAM.

5.1 Baseline Model Training The baseline model has been implemented using the sklearn framework available in python. The random forest classifier has been used in the training, which was fed with the encoded vectors from the TF-IDF scores, and this resulted in a precision of 65.58%, recall of 57.25%, leading to an F1 score of 0.6355. This score is then used as a baseline for further models.

An Application of Deep Learning in Identification of Depression …

667

5.2 BiLSTM + CNN Training The hybrid model (BiLSTM + CNN) was trained on word embeddings, resulting from the fasttext framework, with initial weights being the pre-trained embeddings from wiki corpus. The model was trained on a training set of 11,085 preprocessed tweets which had been embedded using fasttext and 2% of which were used as validation data for the model, and this was trained for 42 epochs with fourfold cross validation, which resulted in a training accuracy of 99.98 and validation accuracy of 99.77. This trained mode was used to predict the labels for the test dataset containing 1189 tweets, and this resulted in an accuracy of 99.74% with the F1 score of 0.9921. We have, along with the 2 above mentioned models, trained 2 other models, namely CNN, LSTM, and random forest, which were all trained with the same training dataset of 11,085 tweets. A clear comparison of their AUC scores can be seen from Fig. 2 and the accuracy metrics which are mentioned in Table 3. It is clear from Table 3 that the BiLSTM + CNN model performs better that using only LSTM or CNN, and this can be attributed to the better feature extraction capability of BiLSTM and CNNs being used together. The BiLSTMs can store the information for both forward and backward directions, thus making an improvement over LSTM and providing better sequential features while the inclusion of CNN allows for improvement in the extraction of dimensional features as well and thus providing an overall better accuracy and precision than using either CNN or LSTM.

Fig. 2 ROC curves for various models trained on the same training data showing the clear advantage of using CNN over LSTM and also showing the slight difference between CNN and BiLSTM + CNN model

668 Table 3 Model comparison based on the accuracy metrics showing the better performance of BiLSTM over the other models

A. Shankdhar et al. Model

F1 score

AUC score

Precision

BiLSTM + CNN

0.9921

0.991

0.9943

CNN

0.9823

0.988

0.9811

LSTM

0.9436

0.949

0.893

Random forest

0.623

0.696

0.452

The bold entries in the table depict the best performance score among the tested models, in this case, BiLSTM+CNN performs the best

6 Conclusion The BiLSTM-CNN model with fasttext word embeddings was shown to provide better results with a larger area under the curve that other classifiers due to the BiLSTM layers extracting better sequence features from the text data, and the CNN is providing better dimensional features for classification. The proposed model resulted in a F1 score of 0.9971 with a precision score of 99.31% with fasttext embedding and F1 score of 0.9886 for BiLSTM + CNN with WORD2VEC approach. The study was able to provide evidence for the ability of autonomous system to be able to identify depressive users and provided a base for further research to be done on improving the outcome by the use of a larger and better annotated dataset. These results show a promising future for identifying and aiding depressive users without requiring explicit human intervention and thus ensuring users privacy.

References 1. L.S. Radloff, The CES-D scale: a self-report depression scale for research in the general population. Appl. Psychol. Meas. 1(3), 385–401 (2019). https://doi.org/10.1177/014662167 700100306 2. R.L. Spitzer, K. Kroenke, J.B. William, Validation and utility of a self-report version of PRIMEMD: the PHQ primary care study. Primary care evaluation of mental disorders. Patient health questionnaire. JAMA 282(18), 1737–1744 (1999). https://doi.org/10.1001/jama.282.18.1737 3. R.L. Spitzer, J.B. Williams, K. Kroenke, R. Hornyak, Validity and utility of the PRIME-MD patient health questionnaire in assessment of 3000 obstetric-gynecologic patients: the PRIMEMD patient health questionnaire obstetrics-gynecology study. Am. J. Obstet. Gynecol. 183(3), 759–769 (2000). https://doi.org/10.1067/mob.2000.106580 4. A. Bonner, You Are What You Tweet, 10 Aug 2019. https://towardsdatascience.com/you-arewhat-you-tweet-7e23fb84f4ed 5. M. Gamon, M. Choudhury, S. Counts, E. Horvitz, Predicting depression via social media. Association for the Advancement of Artificial Intelligence (2013) 6. M. Nadeem, M. Horn, G. Coppersmith, S. Sen, Identifying depression on Twitter (2016) 7. M.R. Islam, M.A. Kabir, A. Ahmed, Depression detection from social network data using machine learning techniques. Health Inf. Sci. Syst. 6(1), 8 (2018). https://doi.org/10.1007/s13 755-018-0046-0

An Application of Deep Learning in Identification of Depression …

669

8. H. Ahmad, M. Asghar, F. Alotaibi, I. Hameed, Applying deep learning technique for depression classification in social media text. J. Med. Imaging Health Inform. 10, 2446–2451 (6) (2020). https://doi.org/10.1166/jmihi.2020.3169 9. G. Shen, Depression detection via harvesting social media: a multimodal dictionary learning solution, in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) (2017), pp. 3838–3844. https://doi.org/10.24963/ijcai.2017/536 10. P.A. Cavazos-Rehg, M.J. Krauss, S. Sowles, S. Connolly, A content analysis of depressionrelated tweets. Comput. Hum. Behav. 54, 351–357 (2016). https://doi.org/10.1016/j.chb.2015. 08.023 11. S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735 12. A.L. Maas, A.Y. Hannun, A.Y. Ng, Rectifier nonlinearities improve neural network acoustic models, in Proceedings of the 30th International Conference on Machine Learning, 28 (2013)

Performance Evaluation of LSB Sequential and Pixel Indicator Algorithms in Image Steganography Jabed Al Faysal and Khalid Mahbub Jahan

Abstract Digital photo images are everywhere around us over the Internet. Due to the advancements in digital cameras, photo images have become a rich subject of manipulation. Images are used as cover media by image-based steganography to conceal hidden data. The objective of the approach is to hide a message under an image without creating any doubt. Encoder of this can be used for hiding message in an image so that when we send the image to another person using any network, that does not create unwanted attention to any unauthorized person about the existence of the message. But the person who has decoder can easily decode the message from the image. This tool manipulates so less portion of the image that it cannot be visible in naked eyes and only those who already know about existence of the message can decode it. In this way, message can be passed to the actual receiver without notifying the existence of it even in insecure network. The common technique used replacing the least significant bits (LSBs) of the image pixels with the hidden bits intended. On the other hand, pixel indicator technique (PIT) benefits from the advantages of several older steganography methods. Here, a performance comparison among LSB sequential substitution, PIT and coded LSB substitution are carried out over image quality, capacity, and security issues. Keywords Image steganography · LSB substitution · Pixel indicator algorithm · Information security

1 Introduction Steganography is the process of hiding a secret message within an ordinary one and the extraction of it at its destination. Steganography takes cryptography (converts data into a format that is unreadable for an unauthorized user) a step farther by hiding a J. Al Faysal (B) Khulna University, Khulna, Bangladesh K. M. Jahan University of Dhaka, Dhaka, Bangladesh © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_55

671

672

J. Al Faysal and K. M. Jahan

message so that no one suspects it exists. Ideally, anyone scanning the data will fail to know it contains any secret information. Digital steganography techniques can be applied to images, a video file or an audio file, or text file. Various algorithms are used in different scenarios of steganography. In this tool, three different algorithms from image steganography are used. It takes BMP image for hiding any message. Length of message is limited according to input image. It is impossible to hide a large message in a small image without changing the image visibly. Implemented algorithms are: 1. 2. 3.

LSB sequential substitution Pixel indicator method Coded LSB substitution.

There is an encoder which encodes the text with image without change the image visibly. So, we can send the image to anyone over any medium without creating any doubt to unauthorized person that this image containing any hidden message. Decoder part retrieves the hidden message from encoded image. Today’s world is very much dependent on the information. This information is passed via one to another by any medium around the globe. It is impossible to make the entire medium secured as there are many wired and wireless medium are used for global communication and they are physically spread in every corner of the world. So to make the information secured, several methods have introduced. But unfortunately there is always a risk of data security. So, locking the data to unauthorized access is always not enough. That is why data have to be hidden also. The main motivation of selecting steganography over cryptography is its nature of avoiding unwanted attention [1]. Cryptography works on hiding information, while steganography deals with composing hidden messages so that only the sender and the receiver know about it. In steganography, only the sender and the receiver know the existence of the message, whereas in cryptography the existence of the encrypted message is visible to the world. Due to this, steganography removes the unwanted attention appears in the hidden message. Cryptographic methods try to protect the content of a message, while steganography uses methods that would hide both the message as well as the content.

2 Method Analysis Today’s digital steganography works by adding secret bits (or replacing bits) in files, such as photos or audio files, with secret data. The fact that it is not widely used and is very hard to “crack” makes it even more appealing, and therefore, a pretty good method of transmitting extremely sensitive personal or business information through e-mail, over the Web, or through social channels such as Twitter or Facebook [2]. In digital steganography following types are available: • Text steganography—Text is used as cover for a hidden message.

Performance Evaluation of LSB Sequential and Pixel Indicator …

673

• Image steganography—image is used as cover for a hidden message. • Audio/video steganography—Audio/Video file is used as cover for a hidden message. • Web content steganography—Web content is used as cover for a hidden message. There are three types of stegosystem. They are• Pure stegosystem—No external key is used in this type of stegosystem. • Public key stegosystem—Public key is used in this type of stegosystem. • Secret key stegosystem—Secret key is used in this type of stegosystem. In our approach, image steganography and pure stegosystem are used where the system is designed with the following components: Steganography Encoder—This will encode the image with a hidden message. Steganography Decoder—This will decode the hidden message from the encoded image (Fig. 1). For developing a steganography tool first, we have to analyze the structure of image. Here, BMP format is used. This format stores pixel wise color information. Every pixel has a specific color which is the combination of three basic colors (Red, Green and Blue) known as RGB (Fig. 2). Every pixel of the image has a value of red, blue and green; these values are between 0 and 255. 0 indicates the absence of that color and 255 indicates the maximum presence of that color (Fig. 3). All other colors are combination of these three colors. Some of the examples are given in Fig. 4. For putting a secret text in the image, the value of these colors are manipulated [4]. Here, only 1 LSB (Least significant Bit) of the every color pixels are changed and message bits are put in there. More bits can be changed but the more bits are manipulated more image quality is changed. A digital BMP (Bitmap) file is nothing, but a collection of RGB colors. All 3 colors have the value between 0 and 255. This can be compared with a 2-dimensional array. We know in digital technology any value can be represented with 0 and 1. As an example binary representation of 255 is 11111111. All the value between 0 and 255 can be represented with 8 binary values (0 or 1). Similarly, any text is a collection of characters which have an ASCII value between 0 and 127 (0–255 for extended Fig. 1 Block diagram of stegosystem [3]

674

J. Al Faysal and K. M. Jahan

Fig. 2 RGB color pallet

Fig. 3 RGB basic colors

Fig. 4 RGB combinations colors

ASCII). This values also have binary representation. If we replace the LSB of one pixel of a single color and put the bit of text we can put all 8 bits of a character in 8 pixels. The change in image is maximum 1 and minimum of 0. The change the value one makes so less effects in color which cannot even be detected with naked eyes. But we can reassemble the last bits and get the characters to reproduces the message. This is the main working principle of steganography. Here, image quality is changed so less but there are some other issues arise in this technique. One of them is security [5] and other is message capacity in image. That is why three different algorithms are used in this tool to cope with these issues. Here in example a characters insertion in image: 1.

letter W is convert to its ASCII value (87)

Performance Evaluation of LSB Sequential and Pixel Indicator …

675

Fig. 5 Insertion of letter ‘W’ in 8 pixel of color red

2. 3. 4.

87 is represented with binary (01010111) Replace Red’s 8 pixels R(0, 0)–R(0, 7) LSB sequentially with a single bit of ASCII value sequentially Get the modified values of R(0, 0)–R(0, 7).

By this technique all the character’s bits are substituted in the last bit of the image (Fig. 5). Here, the characters are encoded in encoder part. Encoder takes one input image and in this technique puts all the characters in the image. With the modified color pixels stego image is created which has the same visibility as the original image, but contains the secret message. Then, we can send the stego image via any communication medium. If any unauthorized person steals this image he can see the image only. Only if he knows there is a message hidden in it and know how to decode it can read the message otherwise not. The recipient can easily decode the message with the decoder tool. Here, decoding is done by taken all the last bit’s which contain message bits and reassemble it to get the ASCII value and finally convert the ASCII value to original character. 1. 2. 3.

Secret message bits are taken from the color values 8 bits are converter to decimal to get the ASCII values (87) ASCII value (87) converts to the original character (W).

By this technique, the whole message is retrieved from the image. For entering the message into the image, three different algorithms are used. For every algorithm to add the hidden message to image in encoder there is also a method in decoder part. Encoded message of one algorithm cannot be decoded using another decoding algorithm (Fig. 6).

676

J. Al Faysal and K. M. Jahan

Fig. 6 Reconstruction of letter ‘W’ from 8 LSB of color red

3 Algorithms Three algorithms are implemented in the tool. These are: 1. 2. 3.

LSB sequential substitution Pixel indicator method Coded LSB substitution. Details of the algorithms are described below:

3.1 LSB Sequential Substitution LSB sequential substitution is the most common one. In this algorithm all message bits are substituted sequentially into the image [6, 7]. The following steps are done to implement this algorithm.

3.1.1 1. 2. 3.

4.

Encoder Part

Take the input BMP image and calculate the total number of pixels. Total number of pixels = Width(x) ∗ height(y) Calculate the maximum number of characters from this formula: Maximum number of character = (floor((x ∗ y − 8)/8) ∗ 3) Take the input message and calculate the number of characters. If exceed maximum number of character, it tells the user that this message is large so it can’t be hidden inside this image. If the message is not exceed the maximum number of character then it is divided into 3 parts. Three 2D array y ∗ x of red, green and bblue are created which contain pixel wise color information of the image.

Performance Evaluation of LSB Sequential and Pixel Indicator …

677

Fig. 7 Encoding process of LSB sequential substitution

5. 6.

7.

Total 16 pixels are reserved for containing message size; in this pixel no message bits are substituted. The number of character is substituted here. Three parts of the message are sequentially substituted in three 2D arrays. The substitution process makes the last bits of color into 0 then add the message bits (0 or 1) After the substitution process with three modified 2D array stego BMP image are created with same folder of input image with add ‘1’ with name (Fig. 7).

3.1.2 1. 2. 3.

Decoder Part

Take the input BMP image and calculate the total number of character (which is put by the encoder). Read the bits according to total number of character from Red, Green and Blue part of the image. Generate characters from bits and display the whole message (Fig. 8).

678

J. Al Faysal and K. M. Jahan

Fig. 8 Decoding process of LSB sequential substitution

3.2 Pixel Indicator Method Pixel indicator method is used due to increase security of message. In this method one of the channel indicates the data is present or not in other two channels [8]. One pixel has three colors. One of the colors is choosing as indicator. If the last bit of the color is 1 then message bits are encoded in other two or any one of the channels. If it is 0 then it indicates that there are no message bits in other two colors of the same pixel. No extra message size indicator bits are required in this method. By counting last bits of 1 in indicator channel, message size can be found.

3.2.1 1. 2. 3.

Encoder Part

Take the input BMP image and calculate the total number of pixels. Total number of pixels = width(x) ∗ height(y) Calculate the maximum number of characters from this formula: Maximum number of character = (floor((x ∗ y)/16)) Take the input message and calculate the number of character. If exceed maximum number of character, it tells the user that this message is large so it can’t be hidden inside this image.

Performance Evaluation of LSB Sequential and Pixel Indicator …

679

Fig. 9 Encoding process of pixel indicator method

4. 5.

6.

Make all the last bits of indicator channel to 0. Put the characters bits in one or two of the channel (in this tool one channel is used for message). Make corresponding indicator channel to 1. Here message are not added in all locations of channel. A predefined sequence is selected for substitute the message bit with regular interval locations. (Random sequence can be used here too.) After the substitution process with three modified 2D array stego BMP image are created with same folder of input image with add ‘2’ with name (Fig. 9).

3.2.2 1. 2.

Decoder Part

Check all last bits of indicator channel and take the corresponding bits from message carrier channel. Generate characters from bits and display the whole message (Fig. 10).

3.3 Coded LSB Substitution Coded LSB substitution is the modified and secure version of LSB sequential substitution algorithm. In this algorithm, message is coded with of the pixel value by XOR operation and coded message are substituted sequentially to the image. The following steps are done to implement this algorithm.

680

J. Al Faysal and K. M. Jahan

Fig. 10 Decoding process of pixel indicator method

3.3.1 1. 2. 3.

4. 5.

6.

7.

Encoder Part

Take the input BMP image and calculate the total number of pixels. Total number of pixels = Width(x) ∗ height(y) Calculate the maximum number of characters from this formula: Maximum number of character = (floor((x ∗ y − 8)/8) ∗ 3) Take the input message and calculate the number of characters. If exceed maximum number of character, it tells the user that this message is large so it cannot be hidden inside this image. Any of the pixels (same used for encoder and decoder) is code all the character of message with XOR operation. If the message is not exceed the maximum number of character then coded message is divided into 3 parts. Three 2D array y ∗ x of red, green and blue are created which contain pixel wise color information of the image. Total 16 pixels are reserved for containing message size and 1 pixel as message code. In these pixels no message bits are added. The number of character is substituted here. Three parts of the coded message are sequentially substituted in three 2D arrays. The substitution process is make the last bits of color into 0 then add the message bits (0 or 1)

Performance Evaluation of LSB Sequential and Pixel Indicator …

681

Fig. 11 Encoding process of coded LSB substitution

8.

After the substitution process with three modified 2D array stego BMP image are created with same folder of input image with add ‘3’ with name (Fig. 11).

3.3.2 1. 2. 3.

Decoder Part

Take the input BMP image and calculate the total number of character (which is put by the encoder). Read the bits according to total number of character from red, green and blue part of the image and XOR all the coded character to extract original character. Generate characters from bits and display the whole message (Fig. 12).

3.4 Tools and Simulation Environment All the three Algorithms have implemented in Java. So, the Java Runtime Environment is necessary to run the program. Java swing is used for graphical user interface. A console application is not user friendly. Two executable JAR (Java Archive) files (One for Encoder, another for decoder) are created with java code which can easily be used by the user.

682

J. Al Faysal and K. M. Jahan

Fig. 12 Decoding process of coded LSB substitution

4 Encoding and Decoding After implementing all three algorithms, we get two executable JAR files (Encoder and Decoder) (Fig. 13). After giving the input BMP file, encoder calculate the maximum number of character which can be hidden in the image for all three algorithms and display the maximum number of characters (Fig. 14). Fig. 13 Encoder.JAR

Performance Evaluation of LSB Sequential and Pixel Indicator …

683

Fig. 14 Display maximum number of characters and waiting for inputs

After selecting the message and algorithm if message is large or null it shows display “Message too large or empty” (Fig. 15). After successful encoding it gives the location where encoded BMP file is stored (Fig. 16). Decoder file wants the location of the encoded BMP file (Fig. 17). After giving the coded BMP file it wants to know the decoding algorithm. Here only that decoding algorithm works by which original file was encoded (Fig. 18). After selecting the decoding algorithm decoder shows the hidden message (Fig. 19). If message is large for better visibility it write the message to a text file and display the text file location instead of message (Figs. 20 and 21).

Fig. 15 Display for null and large message

684

Fig. 16 Successful encoding Fig. 17 Decoder.JAR

Fig. 18 Decoding algorithms selection

J. Al Faysal and K. M. Jahan

Performance Evaluation of LSB Sequential and Pixel Indicator …

685

Fig. 19 Successful decoding

Fig. 20 Displaying message location

Fig. 21 Written extracted message to a text file

5 Comparative Analysis Here, a performance comparison among previously described algorithms is done over image quality, capacity and security issues.

686

J. Al Faysal and K. M. Jahan

Fig. 22 Original Image 1

Fig. 23 Image 1 encoded with LSB sequential substitution

5.1 Image Perceptibility The data hiding method should hide data in such a manner that the original cover image and the encoded image are perceptually indistinguishable [9]. One of the basic objectives of Steganography is to preserve the quality of encoded image. Otherwise the existence of hidden message is revealed. Here, image quality is preserved for any size of the image. Two sample images are shown as examples in Figs. 22, 23, 24, 25, 26, 27, 28 and 29. We can clearly see that no visible change in any of the image. So image quality reservation objective is fulfilled.

5.2 Image Capacity Capacity refers to the amount of information that can be hidden in cover medium. One of the reasons to implement different algorithms is image capacity [10]. For LSB

Performance Evaluation of LSB Sequential and Pixel Indicator …

687

Fig. 24 Image 1 encoded with pixel indicator method

Fig. 25 Image 1 encoded with coded LSB substitution

Fig. 26 Original Image 2

and coded LSB substitution, image can hide more characters than in PIT methods (Table 1). Capacity of these algorithms can be calculated by using these formula: LSB sequential substitution capacity, Maximum number of character = (floor((x ∗ y − 8)/8) ∗ 3)

688

J. Al Faysal and K. M. Jahan

Fig. 27 Image 2 encoded with LSB sequential substitution

Fig. 28 Image 2 encoded with pixel indicator method

Fig. 29 Image 2 encoded with coded LSB substitution

Pixel indicator methods capacity, Maximum number of character = (floor((x ∗ y)/16)) Coded LSB substitution capacity,

Performance Evaluation of LSB Sequential and Pixel Indicator …

689

Table 1 Capacity (number of characters) of different algorithms for different image size Image size

LSB sequential substitution capacity

Pixel indicator methods capacity

Coded LSB substitution capacity

14 × 14

69

12

69

20 × 20

147

25

147

38 × 89

1263

211

1263

480 × 299

53,817

8970

53,817

1346 × 371

187,257

31,210

187,257

1280 × 800

383,997

64,000

383,997

Maximum number of character = (floor((x ∗ y − 8)/8) ∗ 3)

5.3 Security One of the main concerns about steganography is security. We assumed that the existence of hidden message is not revealed to any unauthorized person. But if anyone reveals that something is hidden in the message then the different algorithms have different protection level.

5.3.1

LSB Sequential Substitution

This is the most common technique of steganography. So it has less security among three algorithms as this algorithm is known by many people. Another reason makes it less secure is its nature. Bits are sequentially puts into image so if anyone sequentially decodes its character then secret message can be easily revealed.

5.3.2

Pixel Indicator Methods

This method is less known so it provides some security in that way. Another thing is unauthorized does not know how many bits are used for carrier and indicator. He also does not know in which sequence message bits are substituted and which one is indicator channel and which channels are data carriers. As this information is to anyone who does not have the decoder cannot decode easily. So it is more secure than LSB sequential methods.

690

J. Al Faysal and K. M. Jahan

Table 2 Comparison among different algorithms

5.3.3

Parameter

LSB sequential substitution

Pixel indicator methods

Coded LSB substitution

Image quality

Not visibly reduced

Not visibly reduced

Not visibly reduced

Capacity

More

Less

Security

Less secured More secured

More Most secured

Coded LSB Substitution

This method is most secured than other two. Until one unauthorized person cannot identify which pixels which color is used as code he cannot decode this. And one image usually has a large number of pixels. So it is a difficult job to find the correct pixel and its color among all the pixels. That’s makes this algorithm more secured than other twos. The relative comparison among all three algorithms is given in Table 2.

6 Conclusion As information is very crucial resource to us, so securing it becomes all the more necessary. The communication media through which we send the data does not provide data security. Thus other methods of securing data are required. For this, we need to go for steganography algorithms where different approaches are used in different scenarios along with various graphical user interface to make it easy to use. In this paper, three efficient steganography techniques are analysed where we found indistinguishable perceptibility among the images after applying each method. We have also showed the amount of information that can be hidden for different sized images. Again, we carried out a comparison among the algorithms in terms of security. In future, we will try to embed secret message through image using discrete cosine or wavelength transformation methods [3]. Also, there is a scope of using audio and video steganography methods as well.

References 1. N. Tiwari, M. Sandilya, M. Chawla, Spatial domain image steganography based on security and randomization. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 5(1) (2014) 2. A. Gutub, Pixel indicator technique for RGB image steganography. J. Emerg. Technol. Web Intell. 2. https://doi.org/10.4304/jetwi.2.1.56-64 3. S. Maheswari, D, Jude, Different methodology for image steganography-based data hiding: review paper. Int. J. Inf. Commun. Technol. 7, 521 (2015)

Performance Evaluation of LSB Sequential and Pixel Indicator …

691

4. H. Arora, C. Bansal, S. Dagar, Comparative study of image steganography techniques, in International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Greater Noida (UP), India (2018), pp. 982–985. https://doi.org/10.1109/ICA CCCN.2018.8748451 5. S. Venkatraman, A. Abraham, M. Paprzycki, Significance of steganography on data security, in International Conference on Information Technology: Coding and Computing (ITCC’04), Las Vegas (2004), pp. 5–7 6. E. Gyamfi, I.K. Nti, J. Aning, Using LSB steganography technique and 256 bits key length AES algorithm for high secured information hiding. Int. J. Comput. Sci. Softw. Eng. (2017) 7. S. Chakraborty, A.S. Jalal, C. Bhatnagar, LSB based non blind predictive edge adaptive image steganography. Multimed. Tools Appl. (2016) 8. N. Tiwari, M. Shandilya, Secure RGB image steganography from pixel indicator to triple algorithm—an incremental growth. Int. J. Secur. Appl. 4, 53–62 (2010) 9. M. Hussain, A survey of image steganography techniques. Int. J. Adv. Sci. Technol. (IJAST) 54, 113–125 (2013) 10. M. Cem Kasapba¸si, W. Elmasry, New LSB-based colour image steganography method to enhance the efficiency in payload capacity, security and integrity check. S¯adhan¯a 43, 68 (2018). https://doi.org/10.1007/s12046-018-0848. Accessed 16 Nov 2020

MATHS: Machine Learning Techniques in Healthcare System Medha Chugh , Rahul Johari , and Anmol Goel

Abstract The healthcare industry has been growing with artificial intelligence supporting its back. There is enormous data being collected through IoT devices, and then, the data is stored on the cloud assisted remote servers. The combination of IoT, big data and cloud is proven as to be the key to the revolution of the sector. The large amount of data collected is useless unless it is used to derive important results through the process of Knowledge discovery. Therefore, the data science has infinite potential, and the Data Scientists across enterprises are investing their time and efforts to develop applications that can use machine learning and artificial intelligence to simplify the provision of healthcare resources to the public and help doctors to use the machine learning models in addition to their experience and medical knowledge in diagnosis and treatment. The proposed system is a healthcare management system which can be used by patients to take appointments from doctors, check their prescriptions and for doctors; it provides a support of machine learning to diagnose the diseases and write prescriptions. The datasets of diabetes, heart disease, chronic kidney disease and liver are taken and ML models like decision trees, random forest, logistic regression and Naïve Bayes classifiers are applied on them.The best model suited is used for the prediction of that disease in application. Keywords Machine learning · Health management system · Healthcare applications · Classifiers

1 Introduction Artificial intelligence has impacted every industry in the past decade, but the evolution of healthcare industry in artificial intelligence is a very slow paced process. The healthcare system across every nation of the world is developed and maintained according to the situations, environmental conditions, medical requirements of the M. Chugh (B) · R. Johari · A. Goel SWINGER,USICT, GGSIPU, New Delhi, Delhi, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_56

693

694

M. Chugh et al.

people and quality of life style in that particular region. The decentralisation of the medical protocols, tools, medical techniques, treatments, equipment’s and the recordings of the patient databases across all the regions making it impossible to ease out the system by automation of the healthcare management system. The electronic health records are collected by using IoT equipment’s intensive care units (ICUs), electrocardiography (ECGs), magnetic resonance imaging (MRIs), various other tests and implantable devices are stored as large databases in either structured or unstructured form using cloud services. But not all the data stored is used for research purpose which leads to under-utilisation of the data collected for. Also, most of the hospitals are still using traditional methods for the management of their patients where the patients have to go to the hospital whenever they are ill, stand in long queues in the out patient department (OPD) to get their appointments of the doctor and the doctor evaluates the patient’s condition. On the basis of the conversation, his medical knowledge and past experience; it is suggested to have the medication process. Enterprises need to develop applications that can be implemented in the healthcare industries for efficient management of the patient’s records, faster and more accurate prediction of the results. Designing, development and deployment of such applications are the need of the current hour, and hence, enterprises have been investing lot of efforts in order to apply artificial intelligence to achieve the goal to create such applications that involve the automation of planning and maintenance of the healthcare sources and are termed as enterprise resource planning systems (ERP). The current work aims to develop an application system that can combine patient record management system with the clinical recommender system in order to centralise the overall healthcare management system. Four diseases datasets are included here which are diabetes clinical data, liver infection datasets, heart disease datasets and chronic kidney disease datasets. These are all taken from Kaggle and included for the purpose of preparation for the various models that can be applied on newer patient’s data. The ML models developed were compared in order to select the best model suitable for the purpose of classification and prediction of the results.

2 Problem Statement The proposed system, healthcare management system is designed with the purpose of revolutionising the existing system of the healthcare management, wherein the patients have to go to hospital, stand in long queues and register themselves. The doctors have to ask details of their patients on each visit orally which may lead to missing of some crucial points. The system requires tedious paperwork by all the employees to maintain the records of patients (Fig. 1). With all the above-stated features, this system can prove to be a major advancement in the current system used.

MATHS: Machine Learning Techniques in Healthcare System

695

Fig. 1 Types of machine learning models

3 Literature Survey In [1], author(s) envisions a remote health monitoring system which will provide personalised analysis of the electrocardiography (ECG) recordings of the heart patients to predict the risk of heart diseases beforehand. The physician can access the decision support system that uses visualisation and classifiers to alarm the doctor when the parameters governing the occurrence of the disease exceed threshold values. After applying many machine learning models, it is concluded that support vector machine (SVM) is the most optimal choice for the dataset when the training sample is large. In [2], author(s) discusses the design of PredicT-ML, an automated software tool to predict results using clinical data. The motive behind the tool development is to automate the process of feature engineering, algorithms and hyper-parameters selection and hence reducing the load of computer scientists who are needed to manually iterate different values of hyper-parameters and choose various machine learning models to improve the accuracy of the system. The detailed design document is under development at the current stage. In [3], author(s) proposes convolutional neural network-based multimodal disease prediction (CNN-MDRP) algorithm to predict the risk of occurrence of cerebral infarction using real-time hospital data. The patient’s demographic details, habits and symptoms were stored as structured data, and the disease description as given by the patients was stored in unstructured textual forms. Later, k-nearest neighbours (kNN), decision tree and support vector machines were the ML algorithms applied on structured data for the production of the disease. The prediction accuracy of as obtained was 94.8%.

696

M. Chugh et al.

In [4], author(s) designs clinical recommender system that is based on multiple classifier system. The proposed system uses medical data of older patients for the training of machine learning models. The data is first pre-processed, and outliers are removed using Cronbachs Alpha test, and later, predictive component analysis(PCA) analysis is used for the purpose of feature selection and then that data is trained using optimised ML algorithm. The observations were made on the following datasets: vertebral column clinical data, Parkinson’s disease, clinical coronary artery disease clinical data, diabetes clinical data and mammographic mass clinical data. In [5], author(s) discusses the challenges associated with the evolution of cardiovascular medicine. Conventional methods for prediction use only handful of variables that impact cardiovascular phenotypes and, hence, are unable to provide an insight of the conditions limiting the use of machine learning in the field. Further the author provides an insight to the statistical approach that is to be used to build prediction models and develop and enterprise resource planning systems (ERP). In [6], the technique to optimise the virtual machines selection for implementation of the machine learning algorithms to diagnose and predict the chronic kidney disease (CKD) is implemented and discussed. The result of the disease diagnosis is predicted by applying the machine learning models such as linear regression and artificial neural networks (ANN). For the optimisation of virtual machines (VMs), parallel particle swarm optimisation (PPSO) is applied. Each disease requires its own model to be designed based on the parameters affecting the disease which adds up to the future scope of the idea. In [7], author (s) provides an overview on the data mining requirements, and the various machine learning algorithms that can be adopted to analyse large patient databases stored through different digital medical equipment’s used in the intensive care units, operation theatres, ventilators and clinics. Hence, this data is pre-processed first using filtering methods, and then, later can be used to apply various machine learning models to understand the accuracy of the result. In [8], author(s) takes Framingham dataset which has 16 attributes in order to predict if the person may have heart disease or not. The machine learning algorithms like support vector machine, logistic regression, random forest classifier, Gaussian Naïve Bayes and decision tree classifier are implemented, and then accuracy is compared. Logistic regression performed the best with an accuracy of 88.29%. In [9], author(s) proposes a system to analyse the emotions by recognising the face through CCTV Surveillance and providing counselling to the students suffering from depression and/or anxiety. Adaboosting is used to train the system. The accuracy of the system as obtained is 83%.

MATHS: Machine Learning Techniques in Healthcare System

697

Algorithm 1: Trigger-When patient logins to the website if P  HMS then book_id ← book_Appointment () visit_id ← visit_Doctor (book_id) if Dr == ’1’ then if ’test’== ’1’ then input ← array(test_t ype, input_ f ields) answer ← pr edict_r esults(input) if answer==True then print "You may have the disease" else print "You don’t have the disease" end else write_ pr escri ption() end else wait () end else return register into PRM end Function pr edict_r esults(input : array) test ← test_t ype; model ← train m odel(test); pr edict ← pr edict (model); returnpr edict; end

4 Methodology Adopted 4.1 Algorithm 4.2 Dataset Description For the multiple disease prediction, four datasets are taken here. These datasets are taken from Kaggle and are heart disease dataset, liver dataset, chronic kidney disease dataset and diabetes clinical dataset [10]. The heart disease dataset taken from Kaggle is collected from the patients from countries like Cleveland, Hungary, Switzerland and Long Beach. The combined dataset contains 14 features and 916 samples. Liver dataset consists of 416 liver patient records and 167 non-liver patient records collected from the hospitals of Andhra Pradesh in India. The target variable is nominal in nature which gives yes or no as the output.

698

M. Chugh et al.

The chronic kidney disease dataset was collected in India in a period of two months. It has 400 rows with 25 features. The target variable is the ‘classification’, which is either ‘ckd’ or ‘notckd’—ckd = chronic kidney disease. The dataset of diabetes taken originally from the National Institute of Diabetes and Digestive and Kidney Diseases(NIDDK) is created to check the status of diabetes in a person. Several parameters were taken in consideration for the selection of the instances. The dataset used here contains all females with the minimum age of 21 years.

4.3 Feature Selection The attributes of the datasets were studied, and feature selection was done in order to find the most important features. Feature selection involves two types, namely filter methods and wrapper methods. Wrapper feature selection method is used in the following datasets for extracting the most important features. Extra trees classifier is an ensemble method similar to random forest classifier and can be used to select the relevant features from the dataset. It forms subset of datasets and creates several decision trees later, and it calculates the information gain for each attribute. The attributes having the maximum information gain are chosen for the prediction model. It is observed that chronic kidney disease dataset, which has 24 features originally, has only ten of the features as selected ones for the prediction.

4.4 Applying Machine Learning Algorithms Logistic Regression is the classification algorithm that is based on predicting the outcome as categorical variable. It is similar to linear regression, but it fits an S shaped sigmoid curve to the data where 0 and 1 are the extreme values. A threshold value is fixed, and the probability is used to predict the results. Random Forest Classifier is the ensemble classifier that creates a set of decision trees using the subsets of the training set and later aggregates the votes from different trees to get the final outcome. Decision Tree Classifier is a machine learning algorithm that involves learning through decision trees which are flowchart like tree structures with each internal node denoting a test on attribute and each branch holding an outcome of the class label applied. The data is partitioned into many branches using the attribute values such that the data gets completed splitted at the end and can be used for predictive analysis. To select the attribute which can result in best splitting of data, various measures can be calculated. These are information gain, Gini index and gain ratio.

MATHS: Machine Learning Techniques in Healthcare System

699

Naïve Bayes classification is the tool used for classification and is based on Bayes theorem which is used when the features may or may not depend on each other but they contribute independently to make the product. Naive Bayes algorithm is the algorithm that learns the probability of an object with certain features belonging to a particular group/class. Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). P(c|x) = P(c) ∗ P(x|c)/P(x)

(1)

where P(c|x) P(c) P(x|c) P(x)

is the posterior probability of class (c, target) given predictor (x, attributes). is the prior probability of class. is the likelihood which is the probability of predictor given class. is the prior probability of predictor.

5 Experimental Setup The system is developed and is able to provide the desired functionalities. It is designed using HTML, JavaScript, CSS, JQuery and BootStrap as front end technologies, for backend development NodeJs is used which further calls Python scripts to execute the prediction of the machine learning model for disease diagnosis.

6 Results: Analysis and Discussion The machine learning models were prepared, applied on the datasets chosen, and the accuracy of different models were calculated and compared. Accuracy is taken to be the parameter to determine the model to be chosen for the application purpose. As shown in Fig. 2, the best model suited for the heart dataset is logistic regression with maximum accuracy of 98.5% among all the other models. The minimum accuracy is of decision tree classifier which came out to be 92%. The highest accuracy is gained by the logistic regression model that is 75.17% among all others in case of liver dataset. Gaussian Naïve Bayes classifier did not perform well and achieved the least accuracy. In case of chronic kidney disease dataset, the logistic regression, random forest classifier and Gaussian Naïve Bayes classifier, all performed well with accuracy of 1. The highest accuracy is gained by random forest classifier which is 83.67%, and hence, it proved to be the best model among all and the least accuracy is of logistic regression when diabetes dataset is used.

700

M. Chugh et al.

Fig. 2 Graph showing comparison among ML models for heart dataset (accuracy score in percentage)

7 Conclusion and Future Work The research work successfully implements all the functionalities as desired. The patient is able to register by adding all the relevant demographic details and then log in to book appointment in the hospital. Admin is able to create or remove the account of employees and hence manage the user accounts. The doctor can predict the diseases and write the prescriptions (Figs. 3, 4 and 5). The future work as proposed for the on-going study is to further develop a centralised hospital database so that it can be generalised to all the healthcare departments, and the big data can be efficiently used and analysed extensively to retrieve useful results. The application system developed can be further modified and one add more functionalities of searching, providing details of the on-going systems and including more datasets to increase the diagnosis of other diseases too. In future, it

Fig. 3 Graph showing comparison among ML models for liver dataset (accuracy score in percentage)

MATHS: Machine Learning Techniques in Healthcare System

701

Fig. 4 Graph showing comparison among ML models for chronic kidney disease dataset (accuracy score in percentage)

Fig. 5 Graph showing comparison among ML models for diabetes dataset (accuracy score in percentage)

can be further expanded to include the hospital directory so that various hospitals and clinics can be accessed through a single portal. Image datasets can also be included to allow the image processing of the reports and implementing deep learning to detect the diseases.

References 1. S. Hijazi, A. Page, B. Kantarci, T. Soyata, Machine learning in cardiac health monitoring and decision support. Computer 49(11), 38–48 (2016) 2. G. Luo, PredicT-ML: a tool for automating machine learning model building with big clinical data. Health Inf. Sci. Syst. 4(1), 5 (2016) 3. M. Chen, Y. Hao, K. Hwang, L. Wang, L. Wang, Disease prediction by machine learning over big data from healthcare communities. IEEE Access 5, 8869–8879 (2017) 4. I. Mandal, Machine learning algorithms for the creation of clinical healthcare enterprise systems. Enterp. Inf. Syst. 11(9), 1374–1400 (2017) 5. K. Shameer, K.W. Johnson, B.S. Glicksberg, J.T. Dudley, P.P. Sengupta, Machine learning in cardiovascular medicine: are we there yet? Heart 104(14), 1156–1164 (2018)

702

M. Chugh et al.

6. A. Abdelaziz, M. Elhoseny, A.S. Salama, A.M. Riad, A machine learning model for improving healthcare services on cloud computing environment. Measurement 119, 117–128 (2018) 7. G. Meyfroidt, F. Güiza, J. Ramon, M. Bruynooghe, Machine learning techniques to examine large patient databases. Best Pract. Res. Clin. Anaesthesiol. 23(1), 127–143 (2009) 8. D. Swain, P. Ballal, V. Dolase, B. Dash, J. Santhappan, An efficient heart disease prediction system using machine learning, in Machine Learning and Information Processing (Springer, Singapore, 2020), pp. 39–50 9. S. Sinha, S.K. Mishra, S. Bilgaiyan, Emotion analysis to provide counseling to students fighting from depression and anxiety by using CCTV surveillance, in Machine Learning and Information Processing (Springer, Singapore, 2020), pp. 81–94 10. https://www.kaggle.com/. Accessed on 25th Mar 2020

EnSOTA: Ensembled State of the Art Model for Enhanced Object Detection Jayesh Gupta, Arushi Sondhi, Jahnavi Seth, Moolchand Sharma, Farzil Kidwai, and Aruna Jain

Abstract With advancements in computation capabilities, the focus is diverted toward creating object detection models with better accuracy than speed. Through the medium of the research presented in this paper, we wish to establish a more robust and more accurate method for object detection using a different ensemble state-of-the-art object detection models. Earlier research had established the use of singular architectures and models to tackle object detection problems which later on presented with each model having their own bias and variance. Current research is now exploring the idea of “Ensemble Learning” after the success of simple ensembled models such as XGBoost. Through the medium of this research, we propose expanding the idea of ensembled learning in object detection by applying state-ofthe-art object detection models together as an ensemble to reduce individual bias and variance, while achieving better metrics and accuracy. We have tested EnSOTA on the PASCAL VOC challenge, and in comparison to individual models, it was able to deliver an accuracy higher by 5-8% at solving the same challenge. Keywords Object detection · Ensemble learning · Computer vision · Deep learning

1 Introduction Object Detection has become a required field of machine learning and artificial intelligence. It has become an integral part of fields such as robotics, verification and automation. One of the most challenging tasks in achieving a successful computer vision project is identifying and using an accurate object detection model. However, it is not always necessary that one model will always perform better than the rest. J. Gupta · A. Sondhi · J. Seth · M. Sharma (B) · F. Kidwai Maharaja Agrasen Institute of Technology, Delhi 110086, India e-mail: [email protected] A. Jain Bharati College, University of Delhi, Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_57

703

704

J. Gupta et al.

Each model has a specific architecture and parameters that prove to be better at solving one problem than the other [1]. The new industry-standard practice that has achieved higher accuracy in machine learning problems is ensembling [2, 3]. Models such as XG Boost [4] (Gradient Boosted Ensembled Decision Trees) have been able to achieve better accuracy and results compared to individual models [5]. Ensembling can propose a prediction based on an understanding of a group of weak prediction models instead of one generalized predictor [6]. It acts as a combination of multiple systems that combine their predictions and knowledge to arrive at a better conclusion than their capacity [7]. Model selection is an integral part of solving a machine learning-based problem— one of the critical factors in optimizing bias and variance of a model. A model delivers low bias and low variance error in an ideal situation, resulting in an efficient algorithm that solves problems. However, in a credible scenario, these two errors complement each other, where the reduction of one increases the other known as the bias-variance standoff [8]. By applying standard object detection models through ensembling to solve the same problem, we will be creating a strong learner based on the combination of standard object detection models on level ground. We will be able to identify and study how each model operates for the same problem. Each of these models has a different variance and bias threshold to solving a similar problem. By combining the predictions of each of these models, we will be able to create better foundation at eliminating their bias and variance at an individual capacity and build a model which is better at solving the same problem, while also presenting a lower bias and variance than these individual models [9]. By studying an ensembled collection of standard object detection models, we establish a more in-depth insight into the standard metrics established for individual models, while conducting a new and better study of object detection models and expanding on the current object detection metrics. Through our ensembled model, we aim to establish a better model in terms of performance and accuracy. Our paper takes on a fresh idea of the study of object detection models, while also proposing a new approach to using standard object detection. In particular, this research presents the following contributions through its work: • We present and expand the use of an ensembled object detection model consisting of individual state-of-the-art object detection models to improve performance and accuracy. • We conduct a comprehensive comparative study between the individual object detection models and our ensembled object detection model, while showing the benefits and results of this method. • We have compiled comprehensive metrics that help establish a method to ease model selection and help understand how ensembling will increase performance for difficult to detect objects and individual classes. Object recognition is a crucial problem with computer vision that deals with recognizing and placing objects within certain classes’ image [10]. It is possible to

EnSOTA: Ensembled State-of-the-Art Model for Enhanced Object …

705

view object localization differently, including forming a bounding box around the object or labeling each pixel in the image containing the object. Object detection is categorized as combining two tasks, object localization and object classification [11]. Localization deals with identifying regions or areas of an image that have the likelihood of containing an object or merely identifying regions of interest or high value from the image. Classification deals with identifying the object’s nature in the image by categorizing it with the type of object and the probability of it being that class of object. The latest state-of-the-art techniques are divided into two types: One-stage and two-stage techniques. One-stage methods prioritise the speed of inference, and example models include YOLO and SSD, while two-stage methods prioritise the precision of detection, and faster R-CNN are example models. The paper is structured as follows: Sect. 2 addresses related principles regarding the implementation of previously implemented systems and the use of YOLO, SSD and faster regionbased convolutionary neural networks. Section 3 outlines the proposed protocol and a description of steps in the implementation of the method. In Sect. 4, the dataset is addressed. In Sect. 5, the system implementation and the results generated are defined and discussed. Finally, the rest section includes the paper with final remarks followed by references.

2 Existing Methods 2.1 You Only Look Once (YOLO) YOLO [12, 13] reframes object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities, as seen in Fig. 1 via its architecture. Using this approach, you just look at a picture to predict what objects are present and where they are. For such boxes, multiple bounding boxes and class probabilities are simultaneously predicted by a single completely convolutional network. YOLO trains on full images and specifically optimises identification effectiveness.

2.2 Single Shot Detector (SSD) SSD [14] does not have a delegated proposal area network and functions as a single pass network. This forecasts the boundary boxes and the groups in one single pass directly from function maps. In order to predict object classes and offsets for default anchor boxes, it uses small convolutionary filters, while also applying filters for default anchors to manage the various aspect ratios. The SSD approach is based on a convolutionary feed-forward network which produces a set of bounding boxes of

706

J. Gupta et al.

Fig. 1 Architecture of YOLO

Fig. 2 The architecture of single shot detection (SSD)

a fixed size as represented in Fig. 2. It also scores for the existence of instances of object class in those boxes, followed by a stage of non-maximum suppression to generate the final detections.

2.3 Faster Region-Based Convolutional Neural Networks Faster R-CNN [15] consists of two modules. A deep, completely convolutional network proposing regions is the first module, and a fast R-CNN detector using the regions proposed is the second module, as seen in Fig. 3. For object detection, the whole network is a single, unified network. A region proposal network (RPN) takes the image as an input and outputs a collection of proposals for rectangular objects, each with an object score.

EnSOTA: Ensembled State-of-the-Art Model for Enhanced Object …

707

Fig. 3 The architecture of faster RCNN

2.4 Ensembling Ensembling [16] is a method that aims to maximize the detection performance by fusing individual detectors. While rarely mentioned in deep-learning papers applied to remote sensing, ensembling methods have been widely used to achieve high scores in recent data science competitions and problem statements. Ensemble techniques are used in a wide variety of problems in machine learning. Combining several algorithms is known to reduce the bias of a single detector or its variance. Those problems of bias and variance of prediction are well known. They can be caused by using too big a model with too small a dataset, or that Earth observation data is too heterogeneous for the model to capture the full picture. In other words, there is not a single model that can adequately model the full set of data to process. For instance, stacking is known to reduce the algorithm’s bias [17], whereas bagging or test-time augmentation decreases variance [18]. Most ensembling methods increase the computing time (training time and prediction time), but provide an efficient way to push forward the state-of-the-art. There are three critical types of meta-algorithms aimed at combining weak learners: Bagging, mostly including homogeneous weak learners, learning them in parallel independently from each other, and combining them according to some deterministic averaging method.

708

J. Gupta et al.

3 Proposed Method In our study, to define EnSOTA, we used bagging based ensembling of 3 standard object detection models, YOLO V3, Faster RCNN and single shot detection (SSD) with the latter two models with a ResNet [19] backbone. The Pascal VOC Dataset 2012 + 2007 is used as data with each detector trained on the entire training and validation data. We created three learners that learned different aspects and features of the Dataset by training on the Dataset, while having their individual bias and variance. The weak predictions were combined to form one strong prediction for an image using Weighted Boxes Fusion [20] of bounding boxes to fuse overlapping predictions, followed by affirmative selection of the predictions. The proposed architecture is shown in Fig. 4.

3.1 Weighted Boxes Fusion Weighted boxes fusion (WBF) [20] method for combining predictions of object detection models. Unlike NMS and soft-NMS [21] methods that remove part of the predictions, the proposed WBF method uses all proposed bounding boxes’ confidence scores to constructs the average boxes. This method significantly improves the quality of the combined predicted rectangles. Both non-maximum suppression and soft-non-maximum suppression exclude some boxes, while WBF uses all boxes. Thus, it can fix cases where all boxes are predicted inaccurately by all models due to low IoU with the ground truth. NMS/soft-NMS will leave only one inaccurate box, while weighted boxes fusion will have a higher probability of increase IoU with the ground truth.

Fig. 4 Proposed ensembled object detection model architecture

EnSOTA: Ensembled State-of-the-Art Model for Enhanced Object …

709

4 Research Approach 4.1 Dataset For the study, we used the PASCAL VOC [22] dataset. It provides standardized image data sets for object detection. According to industry standards, the guidelines and rules set with this dataset allow evaluation and comparison of different methods and models. It provides 20 object classes with the training data having 11,530 images containing 27,450 ROI annotated objects. The classes are aeroplane, bicycle, boat, bottle, bus, car, cat, chair, cow, dining table, dog, horse, motorbike, person, potted plant, sheep, train and TV. We took a union of the train and validation dataset of Pascal VOC 2007 and Pascal VOC 2012 each containing 20 classes for training. For validation, we took the test dataset of PASCAL VOC 2007 with 4952 images.

4.2 Training The three models were trained individually on the union of PASCAL VOC 2007 and 2012 training and validation dataset. The training continued for roughly 200 epochs for each model or up until the categorical loss and localization loss were under a set industry standard threshold. The training was monitored to stop when the accuracy between models flatlined between different epochs as further training would have led to overfitting. The training script ran on python using the Gluon CV [23] library. The data were stored in XML file formats with each image having their annotation file.

4.3 Testing After the models were trained, the 2007 test PASCAL VOC data was used to derive a final test accuracy. Each model was made to predict all the 4952 images within the test dataset. Post every prediction, all the boxes, confidence scores and labels were stored in separate text files for each of the images and each model. Post the predictions were made; each predicted file was analyzed and was subject to pre-processing before the evaluation score was calculated. The following steps have applied the predictions soon 1. 2.

Box coordinates that went beyond the image size were limited to the image dimensions by applying a max and min filter on every predicted coordinate Boxes with confidence score lower than 10% were eliminated to reduce the False Positives

710

3. 4.

J. Gupta et al.

Boxes with a negligible area or boxes with inconsistent format coordinates were eliminated Predictions were sorted based on confidence score to minimize the calculation time.

4.4 Prediction Ensembling After each model’s predictions were noted for each image, then the model predictions for each image were subject to ensembling. All three predictions for the three different models for each image were loaded together. Post this weighted boxes fusion (WBF) [20] was applied for each image on all these predictions, and the resulting predictions were stored in a separate file for each image. These new predictions acted as the predictions made mine by the ensemble model. For weighted boxes fusion, the coordinates were normalized to prevent any mismatch between image size augmentation that each model might perform. The normalized boxes were before recording expanded back to match the dimensions of the input image. Post recording of the predictions, the following pre-processing was applied • Box coordinates that went beyond the image size were limited to the image dimensions by applying a max and min filter on every predicted coordinate • Boxes with confidence score lower than 10% were eliminated to reduce the False Positives • Boxes with a negligible area or boxes with inconsistent format coordinates were eliminated.

5 Results and Discussion 5.1 Evaluation Metrics For each model, two different sets of metrics were recorded. These metric calculations were based on standards set forth by the PASCAL VOC format. These two metrics are 1. 2.

Precision-recall curve for each class of the dataset Average precision for each class and model.

For calculation, the precision-recall curve was evaluated for each class by varying the confidence score threshold for each class and then measuring the precision and recall at that point. The IoU threshold was set at 0.5, i.e., a prediction was considered true positive if its intersection over union was more significant than or equal to 50% to a ground-truth box.

EnSOTA: Ensembled State-of-the-Art Model for Enhanced Object …

711

Fig. 5 Intersection over union illustration

5.2 Intersection Over Union (IOU) Intersection over union (IOU) calculates the overlap between two bounding boxes depending on the Jaccard Index. A ground-truth bounding box But and a predicted bounding box Bp is necessary. We can say whether detection is real (True Positive) or not (False Positive) by applying an IOU threshold. The overlapping area gives the IOU between the projected bounding box and the bounding box of ground truth, divided by the union area between them using Eq. (1): IOU =

area(Bp ∩ Bgt ) area(Bp ∪ Bgt )

(1)

Figure 5 illustrates the intersection over union between a ground-truth box (green) and a detected box (red).

5.3 Precision × Recall Curve The precision × recall curve is an excellent way of assessing the output of an object detector as trust is altered by plotting a curve for each class of objects. An object detector in a given class is considered healthy if the accuracy remains high as the recall increases, which means that even if you vary the degree of trust, the accuracy and recall will be high. Another way to find a good object detector is to search for a detector that can only identify objects that are important and find objects of ground reality. A low-object detector needs to increase the number of detected objects (increasing False Positives = lower precision) to retrieve all ground-truth objects (high recall). That is why the precision × recall curve usually starts with high-precision values, decreasing as recall increases. You can see an example of the precision × recall curve in the next topic (Average Precision). This kind of curve is used by the PASCAL VOC 2012 challenge and is available in our implementation.

712

J. Gupta et al.

5.4 Average Precision Calculating the region under the curve (AUC) of the precision × recall curve is another way to compare the performance of object detectors. Since AP curves are mostly zigzag curves going up and down, it is generally not easy to compare different curves (different detectors) in the same graph because they often appear to cross each other. That is why it can also help us compare various detectors with average precision (AP), a numerical metric. AP is the accuracy averaged in practise for all recall values between 0 and 1. Since 2010, the PASCAL VOC problem has modified the way AP is computed. The interpolation done by the PASCAL VOC challenge currently uses all data points. Our research methodology suits their current submission, as we want to replicate their default implementation (interpolating all data points).

5.5 Recorded Metrics EnSOTA outperformed all the individual models, resulting in an average increase in mean average precision of the best performing individual model by 5% as seen in Fig. 6. The model increased the mean average precision and increased the average precision for each class from each model. For classes such as bottle and potted plant where individual models were missing these at different instances due to them being difficult to localize due to their smalls size, ensembling led to all models building on each other’s mistakes significantly higher detection rate for such difficult classes as shown in Table 1. For classes, that individual models could predict more comfortable, and with higher precision, the ensembling led to a small increase across all models.

Fig. 6 Comparative plot of object detection model performance

EnSOTA: Ensembled State-of-the-Art Model for Enhanced Object …

713

Table 1 Comparative average precision for each class for each model SSD

YOLO

Faster RCNN

Ensemble

mAP (%)

76.18

74.85

72.94

80.30

Aeroplane

0.83

0.85

0.81

0.88

Cycle

0.84

0.81

0.79

0.85

Bird

0.73

0.75

0.72

0.79

Boat

0.69

0.64

0.60

0.71

Bottle

0.51

0.57

0.48

0.62

Bus

0.80

0.77

0.8

0.85

Car

0.81

0.80

0.79

0.84

Cat

0.90

0.87

0.85

0.9

Chair

0.58

0.55

0.52

0.64

Cow

0.77

0.83

0.79

0.84

Table

0.68

0.63

0.61

0.72

Dog

0.86

0.84

0.82

0.87

Horse

0.85

0.85

0.82

0.87

Motorbike

0.84

0.81

0.8

0.86

Person

0.84

0.83

0.81

0.85

Plant

0.56

0.53

0.52

0.61

Sheep

0.72

0.75

0.73

0.78

Sofa

0.72

0.62

0.69

0.78

Train

0.86

0.83

0.84

0.90

TV

0.75

0.74

0.71

0.81

The ensembled model played a crucial in improving localization and reducing false positives and false negatives, leading to higher Precision and recall. In the models, as seen in Fig. 7, in comparison to ground-truth errors existed. Faster RCNN predicted the same class multiple times leading to higher false positives even though bounding box predictions were reasonable compared to the ground truth. SSD and YOLOv3 models missed out on localization with SSD having poorer bounding boxes and YOLOv3 missing out on several objects leading to False Negatives. These all shortcomings reduced precision and recall for the individual model. However, in EnSOTA detection, the false positives were eliminated, and object localization got better than individual models. All the models and their characteristics built upon each other’s bias and variance, leading to a much better performance by the ensemble model.

714

J. Gupta et al.

Fig. 7 Prediction comparison of all models with ground truth

6 Conclusion From the research conducted and the outcomes recorded, it is safe to conclude that our research paper proposed the new ensembled model (EnSOTA) outperformed individual State of the art object detection models. The proposed model led to better accuracy and meant average precision and could localize objects better and reduce false positives and false negatives. The ensembled model has surpassed the individual object detection models across the board in terms of accuracy and performance.

7 Future Scope More models such as CenterNet [24] can be studied and explored as a weak learned ensembling in further research. Different combinations and hyperparameters can be explored further to increase the performance of such an ensembled model. Of the researched combination in this paper, more combinations and weights can be pursued to understand better the weightage each model should have based on parameters in an ensembled model to lead to better results.

References 1. A. Groener, G. Chern, M. Pritt, A comparison of deep learning object detection models for satellite imagery, in 2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), Oct 2019, pp. 1–10. https://doi.org/10.1109/AIPR47015.2019.9174593 2. R. Ray, S.R. Dash, Comparative study of the ensemble learning methods for classification of animals in the zoo, in Smart Intelligent Computing and Applications, vol. 159, ed. by S.C.

EnSOTA: Ensembled State-of-the-Art Model for Enhanced Object …

3. 4.

5.

6. 7.

8.

9.

10.

11.

12. 13.

14. 15.

16. 17.

18.

19.

20.

21.

715

Satapathy, V. Bhateja, J.R. Mohanty, S.K. Udgata (Springer Singapore, Singapore, 2020), pp. 251–260. https://doi.org/10.1007/978-981-13-9282-5_23 X. Dong, Z. Yu, W. Cao, Y. Shi, Q. Ma, A survey on ensemble learning. Front. Comput. Sci. 14(2), 241–258 (2020). https://doi.org/10.1007/s11704-019-8208-z T. Chen, C. Guestrin, XGBoost: a scalable tree boosting system, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug 2016, pp. 785–794. https://doi.org/10.1145/2939672.2939785 P. Singh, Comparative study of individual and ensemble methods of classification for credit scoring, in 2017 International Conference on Inventive Computing and Informatics (ICICI), Coimbatore, Nov 2017, pp. 968–972. https://doi.org/10.1109/ICICI.2017.8365282 L. Rokach, Ensemble-based classifiers. Artif. Intell. Rev. 33(1–2), 1–39 (2010). https://doi.org/ 10.1007/s10462-009-9124-7 Y. Ren, L. Zhang, P.N. Suganthan, Ensemble classification and regression-recent developments, applications and future directions [review article]. IEEE Comput. Intell. Mag. 11(1), 41–53 (2016). https://doi.org/10.1109/MCI.2015.2471235 B. Ghojogh, M. Crowley, The theory behind overfitting, cross validation, regularization, bagging, and boosting: tutorial. arXiv:1905.12787 [cs, stat], May 2019, [Online]. Available: http://arxiv.org/abs/1905.12787 J. Xu, W. Wang, H. Wang, J. Guo, Multi-model ensemble with rich spatial information for object detection. Pattern Recogn. 99, 107098 (2020). https://doi.org/10.1016/j.patcog.2019. 107098 Z.-Q. Zhao, P. Zheng, S.-T. Xu, X. Wu, Object detection with deep learning: a review. IEEE Trans. Neural Netw. Learning Syst. 30(11), 3212–3232 (2019). https://doi.org/10.1109/ TNNLS.2018.2876865 Y. Wu et al., Rethinking classification and localization for object detection, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, June 2020, pp. 10183–10192. https://doi.org/10.1109/CVPR42600.2020.01020 J. Redmon, A. Farhadi, YOLOv3: an incremental improvement. arXiv:1804.02767 [cs], Apr 2018, [Online]. Available: http://arxiv.org/abs/1804.02767 J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: unified, real-time object detection, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, June 2016, pp. 779–788. https://doi.org/10.1109/CVPR.2016.91 W. Liu et al., SSD: single shot multibox detector. arXiv:1512.02325 [cs], vol. 9905 (2016), pp. 21–37. https://doi.org/10.1007/978-3-319-46448-0_2 S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031 O. Sagi, L. Rokach, Ensemble learning: a survey. WIREs Data Min. Knowl. Discov. 8(4) (2018). https://doi.org/10.1002/widm.1249 K. Matlock, C. De Niz, R. Rahman, S. Ghosh, R. Pal, Investigation of model stacking for drug sensitivity prediction. BMC Bioinform. 19(S3), 71 (2018). https://doi.org/10.1186/s12 859-018-2060-2 S. González, S. García, J. Del Ser, L. Rokach, F. Herrera, A practical tutorial on bagging and boosting based ensembles for machine learning: algorithms, software tools, performance study, practical perspectives and opportunities. Inf. Fusion 64, 205–237 (2020). https://doi.org/ 10.1016/j.inffus.2020.07.007 K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, June 2016, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90 R. Solovyev, W. Wang, T. Gabruseva, Weighted boxes fusion: ensembling boxes for object detection models. arXiv:1910.13302 [cs], Aug 2020, [Online]. Available: http://arxiv.org/abs/ 1910.13302 N. Bodla, B. Singh, R. Chellappa, L.S. Davis, Soft-NMS—improving object detection with one line of code, in 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Oct 2017, pp. 5562–5570. https://doi.org/10.1109/ICCV.2017.593

716

J. Gupta et al.

22. M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The Pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010). https://doi.org/10.1007/ s11263-009-0275-4 23. J. Guo et al., GluonCV and GluonNLP: deep learning in computer vision and natural language processing. arXiv:1907.04433 [cs, stat], Feb 2020, [Online]. Available: http://arxiv.org/abs/ 1907.04433 24. K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, Q. Tian, CenterNet: keypoint triplets for object detection, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), Oct 2019, pp. 6568–6577. https://doi.org/10.1109/ICCV.2019.00667

A Coronavirus Herd Immunity Optimization (CHIO) for Travelling Salesman Problem Lamees Mohammad Dalbah, Mohammed Azmi Al-Betar, Mohammed A. Awadallah, and Raed Abu Zitar

Abstract In this paper, the travelling salesman problem (TSP) is tackled by coronavirus herd immunity optimizer (CHIO). TSP is the problem of finding the best tour for the salesman in order to visit all cities with minimum cost. In essential, this is a scheduling optimization problem that belongs to NP-hard class in almost all of its variants. CHIO is a recent human-based optimization algorithm that imitated the herd immunity strategy as a way to treat COVID-19 pandemic. The proposed method is evaluated against TSP models of various sizes and complexity including six models (25, 50, 100, 150, 200, and 300) cities. The obtained results are compared against four other methods. They are the genetic algorithm (GA), imperial competitive algorithm (ICA), Keshtel algorithm (KA), and red deer algorithm (RDA). The results prove that the CHIO is able to achieve the best obtained results for all large-scaled problems and produced very comparative results for small TSP problems. Keywords Optimization · Coronavirus herd immunity optimizer (CHIO) · Travelling salesman problem · COVID-19 · Human-based metaheuristics

L. M. Dalbah · M. A. Al-Betar (B) · M. A. Awadallah Artificial Intelligence Research Center (AIRC), College of Engineering and Information Technology, Ajman University, Ajman, UAE e-mail: [email protected] L. M. Dalbah e-mail: [email protected] M. A. Awadallah e-mail: [email protected] M. A. Awadallah Department of Computer Science, Al-Aqsa University, P.O. Box 4051, Gaza, Palestine R. A. Zitar Sorbonne University Center of Artificial Intelligence, Sorbonne University-Abu Dhabi, Abu Dhabi, UAE © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_58

717

718

L. M. Dalbah et al.

1 Introduction Travelling salesman problem (TSP) is a real-world scheduling problem studied in eighteenth century by Thomas Penyngton Kirman and Sir William Rowam Hamilton [1]. It is concerned with finding the optimal route that the salesman can conduct to visit N cities with minimum routing cost [2]. In term of optimization, TSP is considered as one of the NP-hard problems that are computationally difficult to be determined, resulting in not exact but “good-enough” solutions [3]. It is normally used as a classical optimization problem where the performance of the new algorithms is studied. Travelling salesman problem has been implemented using several optimization algorithms to evaluate and compare their performance. In [4], TSP was used to analyse the efficiency of their proposed algorithm that combines ant colony optimizer (ACO) as population-based algorithm with tabu search as a local search algorithm. Another study used simulated annealing algorithm with greedy search as a neighbouring mechanism [5]. Other recent methods used TSP to evaluate their performance such as imperialist competitive algorithm (ICA) [6], Keshtel algorithm (KA) [7], and red deer algorithm (RDA) [8]. Optimization is defined in [9] as the process of finding the best way to use available resources without violating any constrains. Optimizations methods are classified into heuristics and metaheuristic. Heuristic methods are normally used to solve low dimension problems which construct the solution for TSP city after city using greedy rules until the complete route is created. On the other hand, in order to solve efficiently complex problem, metaheuristic methods are conventionally used. Metaheuristicsbased algorithms are known as bio-inspired or nature-inspired since it mimics either biological objects or nature [2]. Metaheuristics-based algorithms are a general optimization template that refines the optimization problem using intelligent operators that explore and exploit the problem search space controlled by tuned parameters until the good-enough solution is reached [10]. The metaheuristics-based algorithms are normally classified based on the number of solutions treated at each iteration into (i) population-based (i.e. handled many solutions at the same time) and (ii) local search (i.e. handled one solution at a time). Coronavirus herd immunity optimization algorithm (CHIO) is a new populationbased metaheuristic optimization algorithm that was inspired by herd immunity strategy as a way to tackle COVID-19 spreading pandemics [11]. CHIO was evaluated and proved worth using seven uni-modal, sixteen multi-modal testing functions and several engineering optimization problems [11]. It is a very efficient algorithm since it has few initial parameters. It also does not require mathematical derivation at the initial stage. It is simple and adaptable and easy to use. Therefore, it is a promising search algorithm that can be used for a wide variety of optimization problems such as TSP. Algorithmically, CHIO is initiated with many solutions. At each generation, the CHIO refines the solution in accordance with the production rate using three operations: Susceptible cases, infected cases, and immune cases. These three operations

A Coronavirus Herd Immunity Optimization (CHIO) for Travelling …

719

modify the current solutions based on social distancing strategy. After predefined iterations based on the age of the infected case and the immunity rate of such case, the infected case is either recovered or died. This process is iterated until the whole population is recovered. In this paper, the CHIO is adapted for TSP. The adaptation includes adjusting some CHIO operations to be workable for the discrete nature of the TSP problem. In order to evaluate the CHIO-based TSP algorithm, TSP benchmarks with different size such as 25, 50, 100, 150 and 200 cities are used. These TSP models are analysed in terms of best, worst, average and computational time required for 30 replicated runs. For comparative evaluations, the convergence behaviour of CHIO is compared with other four well-established methods such as genetic algorithm (GA) [12], imperialist competitive algorithm (ICA) [6], Keshtel algorithm (KA) [7], and red deer algorithm (RDA) [8]. Interestingly, CHIO was able to achieve comparatively very competitive results especially for large dimensional TSP problems. In conclusion, the CHIO is a promising algorithm for TSP problems which is a scheduling problem pregnant with many future developments for other kinds of scheduling problems. The intended objectives of this paper are as follows: • Adapt and implement CHIO in order to solve travelling salesman problem • Define the travelling salesmen objective function to measure the solution effectiveness • Analyse CHIO routes with accordance to other metaheuristic algorithms to evaluate its performance. The rest of the parts of this article are presented such that Sect. 2 defines TSP in terms of solution representation and the cost functions, Sect. 3 presents the fundamental background of the CHIO and how it is adapted to solve the TSP. Section 4 reports and discusses the obtained results of the different TSP models, and the findings are also discussed. Finally, Sect. 5 concludes the findings and suggests some possible future works.

2 TSP Definition In order to solve any problem in an optimization context, the problem specific knowledge must be embedded in the form of an objective function. Here, TSP is mathematically modelled in terms of optimization context. More specifically, the symmetric version of travelling salesman problem (sTSP). The solution representation as well as the cost function is presented as an initial step to adapt CHIO for TSP. Assume n represents the number of cities in the set C where (C = {c1 , c2 , . . . , cn }) that a salesman has to visit and return to the starting city. The salesman must cross each city once. Any optimizer objective is to find the shortest route where all cities must be visited. Each cn has x and y coordinates stored in X and Y lists where X = [x1 , . . . , xn ] and Y = [y1 , . . . , yn ]. Therefore, the Euclidean distance d(ci , c j )

720

L. M. Dalbah et al.

between two cities where i and j associated with the city coordinator (xi , yi ) and city coordinator (x j , yi ), respectively, can be calculated using Eq. (1). d(ci , c j ) =

 (xi − x j )2 + (yi − y j )2

(1)

Since the problem model that we are using is symmetric, then as a result di j = d ji [13]. Towards our aim to find the shortest tour, the tour length for a given tour directions array (e.g. for n = 3, directions = [2, 3, 1, 2]) will be calculated using Eq. (2), and this is the cost function (i.e. f (R)) which the CHIO will try to minimize to reach the best solution (or route) (i.e. R = (c1 , c2 , . . . , cn )). min f (R) =

n 

d(ci ,ci+1 )

(2)

i=1

3 Coronavirus Herd Immunity Optimizer for Travelling Salesman Problem Viruses are infectious diseases transmitted between individuals causing various symptoms. Normally, vaccination is used against well-known viruses to provide an important public health tool that could be used to interrupt transmission within outbreaks and prevent subsequent occurrences [14]. The main issue arises when the world faces a new virus which currently happening with coronavirus disease (COVID-19) [15]. Until vaccine invention Ministry of Health in each country applied, one of two worldwide mechanisms to control the virus spreading, which are either social distancing technique or herd immunity. In such pandemics, the population will be categorized into infected individuals (confirmed), immune individuals (i.e. cases have antibodies), and susceptible individuals [16].

3.1 CHIO Procedure CHIO is an optimization algorithm that mimics the herd immunity principle [11]. This is the main part of this paper in which the procedural steps of CHIO and how it is adapted for TSP are presented as follows. Note that the flowchart of CHIO is provided in Fig. 1. Step 1: Initialize the TSP and CHIO parameters. Initially, the TSP parameters which are normally extracted from the benchmark models include the number of cities to be visited, the distance between each two cities and others. The TSP solution is represented as a vector R = (c1 , c2 , . . . , cn ) of n cities. The TSP solution is evaluated based on the cost function formulated in Eq. (2) which represents the

A Coronavirus Herd Immunity Optimization (CHIO) for Travelling …

721

Fig. 1 Flowchart of CHIO

total Euclidean distance between any two subsequent cities in R. The parameters of the CHIO can be categorized into two types: algorithmic and control parameters. Algorithmic parameters are those that are required for all population-based algorithms such as the maximum number of iterations, problem dimensions, lower and upper bounds of each decision variable, and the population size. The control parameters are those that affect the convergence behaviour of CHIO such as the initial infected cases, the basic reproduction rate, and the age of the infected case. These parameters are notated in Table 1. Step 2: Generate initial random population of CHIO, the generated population contains number of tours equal to PopSi ze, where tour stores the positions, cost, status, and age as shown in Fig. 2, and the variables descriptions are given in Table 2. Each tour is a random TSP solution which is generated randomly in a permutation way with the cities labels are shuffled in the solution without replicating the city label in any tour. Thus, the population is a matrix of tours. The size of the matrix is n × popsi ze. Each tour in the population is evaluated using First, sort in ascending order the position vector and store the original index of the

722 Table 1 CHIO Parameters Type

L. M. Dalbah et al.

Notation

Description

Algorithmic parameters

Max I tr PopSi ze n lb and ub

Control parameters

C0

Maximum number of iterations Population size Problem dimension Lower and upper bounds of city c Set to one by default, represents the number of initial confirmed cases Basic reproduction rate Threshold age where the infected cases died or become immuned

BR Max_age

Fig. 2 Population structure array Table 2 Population variables Variable Positions (ci ) Costvalue( f (R)) Status Age

Description Vector of n number of cities initialized randomly The tour length calculated using the fitness function Initially filled with either value of 0 indicating susceptible case or 1 indicating infected case Initially starts with zero

presorted position in dir ections vector. Then, append dir ections vector with   the same starting index, resulting in dir ections = dir ections dir ections(1) . Finally, calculatethe tour cost using  the cost function defined instep 2 (e.g. if n = 3 18.4 9.6 and position = 13.5 , then the sorted position = 9.6 13.5    18.4 and sorted indices are 3 1 2 accordingly, dir ections will be 3 1 2 3 ).

A Coronavirus Herd Immunity Optimization (CHIO) for Travelling …

723

Step 3: Population evolution, this is the main evolution loop in CHIO, the current j positions ci (t) of city (or direction) i and tour j in the current iteration t are adjusted depending on Eq. 3 rules. ⎧ j ci (t) ⎪ ⎪ ⎪ ⎨C(c j (t)) j i ci (t + 1) ← j ⎪ N (c ⎪ i (t)) ⎪ ⎩ j R(ci (t))

r r r r

≥ B Rr < 13 × B R r . //infected case < 23 × B R r . //susceptible case < B R r . //immuned case

(3)

j

Note that the ci (t + 1) is the adjusted value. B Pr is the basic reproduction rate which determines the speed of spreading the virus across the population. r is a j j random function generated a random value between 0 and 1. C(ci (t)), N (ci (t)), j R(ci (t)) are neighbouring functions refer to the infected case, susceptible case, and immuned case rules, respectively. These three rules will be discussed below: j Infected case:If the second rule condition satisfied, then the current (ci (t)) j becomes infected as in Eq. 4, and the new generation of (xi ) is calculated by Eq. 5. j

j

ci (t + 1) = C(ci (t)) where

j

j

j

C(ci (t)) = ci (t) + r × (ci (t) − cik ))

(4)

(5)

Note that k(t) is any infected case from the population selected randomly with its status = 1. Susceptible case: If the third rule condition (i.e. r < 23 × B Rr ) is satisfied, the j j current (ci (t)) is adjusted based on the Eq. 6, and the new value of (N (ci (t))) is calculate by Eq. 7. j j (6) ci (t + 1) = N (ci (t)) j

j

j

N (ci (t)) = ci (t) + r × (ci (t) − cim )

(7)

where m is the index of any susceptible case chosen randomly from the population with status = 0. Immuned case: If the last rule condition (i.e. r < B Rr ) is satisfied, the current j j (ci ) becomes immuned as formulated in Eq. (9), and the new value of (R(ci )) is calculate by Eq. (10). (8) f (cv ) = arg min f (c j ) j∼{k|Sk =2}

j

j

ci (t + 1) = R(ci (t))

(9)

R(ci (t)) = ci (t) + r × (ci (t) − civ )

(10)

j

j

j

724

L. M. Dalbah et al.

Note that v is chosen satisfying Eq. (8), which means that f (cv ) has the minimum cost function and its status = 2. Step 4: Update the population, for each generated tour R j (t + 1), the objective function f (R j (t + 1)) is calculated. When the value of its cost function is less than the cost of the previous one f (R j (t)), the replacement will take place. Accordingly, the status value of R j (t + 1) is updated based on Eq. (11), where the equations parameters are is_cor ona(c j (t + 1)) and  f (c). Note that is_cor ona(c j (t + 1)) value can be either zero or one depending on c j (t + 1) if it becomes an infected case then it will be 1 otherwise 0.  f (x) is the summation of tours objective valpopsi ze

f (c )

i , which represents how strong ues over the population size written as i=1 popsi ze the population is. Finally, according to the status value, the age vector will be incremented by one if Status j = 1.

Sj ←

⎧ ⎪ ⎪ ⎨1 ⎪ ⎪ ⎩

2

j (t+1) ∧ S j = 0 ∧ is_Cor ona(c j (t + 1)) f (c j (t + 1)) < f (c)  f (c)

(11) j (t+1) f (c j (t + 1)) > f (c)  f (x) ∧ S j = 1

Step 5: Fatality cases [11], if the age of an infected case becomes greater than or equal to the maximum age defined earlier, then this case will be considered as dead, a new tour will be generated with random positions, its age and status are set to zero. This will help in preventing the local optimal and maintain the diversity of CHIO. Step 6: Stop criterion, the loop from step 3 to step 5 repeats until the terminating condition is met, in our case is the defined Max_I tr . In this case, the majority goes to susceptible and immuned cases, in which the infected cases do not exist any more.

4 Experiments and Results The efficiency of the proposed CHIO on solving travelling salesman problems is evaluated in this section. The CHIO is tested using six different cases of TSP. To assess the performance of the CHIO, we compare the results of the CHIO with other metaheuristic algorithms like genetic algorithm (GA) [12], imperial competitive algorithm (ICA) [6], Keshtel algorithm (KA) [7], and red deer algorithm (RDA) [8]. The parameter settings of the proposed CHIO were fixed for for all models of TSP as follows: the spreading rate is equal to 5%, Max_age is set to 100, C0 is equal to 1, the population size (PopSi ze) is set 100, and the lower and upper bounds (lb and ub) are set to −30 and +30, respectively. The maximum number of iteration is set to 500. The proposed CHIO is run 30 times on each TSP models. It should be noted that these settings are chosen similar to what is used by the other competitive algorithms. However, the proposed CHIO is implemented using MATLAB R2017a. The experiments are run on a laptop with a core i7 CPU, 16 GB of RAM, and Windows 10 is the operating system.

A Coronavirus Herd Immunity Optimization (CHIO) for Travelling … Table 3 Model parameters Parameter x_min and x_max y_min and y_max n

725

Description x Coordinates upper and lower limits y Coordinates upper and lower limits Number of cities

4.1 TSP Data Description Six models were used in implementing TSP (25, 50, 100, 150, 200, 300 cities), and the models initialization parameters are listed in Table 3. The positions of n cities were defined as random integers between x and y boundaries. Accordingly, the distances between every two cities were calculated using the previously discussed Eq. 1 and stored in n-by-n d matrix shown in Eq. 12, in which the diagonal equals zero since it is referring to the distance between the city and itself . ⎤ ⎡ d11 d12 · · · d1n ⎢d21 d22 · · · d2n ⎥ ⎥ ⎢ (12) d=⎢ . . . ⎥ ⎣ .. .. · · · .. ⎦ dn1 dn2 · · · dnn

4.2 TSP Results and Comparisons Table 4 shows the results of running the CHIO, as well as the results of the other algorithms (i.e. GA, ICA, KA, and RDA) for TSP models (model 25 to model 300). Notably the results of the comparative methods are taken from [8]. In this table, the best cost, worst cost, mean cost, and the computational time were reported for each TSP model. It should be noted that the best results are highlighted using bold fonts. The minimum results indicate the best results. In terms of the best cost, it can seen that the performance of the proposed CHIO is better than the other comparative methods in the two complex models (Model 200 and Model 300). Furthermore, the CHIO algorithm achieved the third best results on three models (Model 50, Model 100, and Model 150). Lastly, the CHIO algorithm is obtained the fourth best results on Model 25. In terms of the mean costs, the performance of the CHIO outperforms the other comparative algorithms on three TSP models (i.e. Model 300, Model 200, and Model 150). Furthermore, the CHIO is obtained the second-best results on Model 100 and Model 50. Finally, the CHIO achieved the fourth best results in Model 50.

726

L. M. Dalbah et al.

Table 4 Algorithms results for TSP with 500 iterations and 100 populations Algorithm Best cost Worst cost Mean cost Elapsed time (s) CHIO GA ICA KA RDA

1055.12 952.85 1042.20 1231.05 951.45

CHIO GA ICA KA RDA

1137.89 830.07 1133.71 1488.45 842.49

CHIO GA ICA KA RDA

2635.61 2362.73 2366.41 3338.12 3092.09

CHIO GA ICA KA RDA

4549.57 4328.37 4431.58 5734.13 5661.11

CHIO GA ICA KA RDA

6560.90 6595.78 6642.35 8315.91 8158.97

CHIO GA ICA KA RDA

10509.95 10688.41 10656.93 12454.39 12485.05

Model 25 1174.80 1200 1216.35 1467.26 1047.93 Model 50 1310.56 1546 1428.14 1910.13 1489.41 Model 100 3062.03 3473.85 3159 4198.71 3766.15 Model 150 5065.78 5855.58 5256.62 6593.10 6056.69 Model 200 7376.42 7844 7542.72 9135.46 8527.05 Model 300 11545.83 12546.12 11753.67 13664.01 12917.24

1121.01 1072.62 1108.29 1342.45 971.39

11.86 2.57 2.9 6.58 6.36

1246.97 1211.47 1250.63 1735.52 1279.91

12.24 2.79 2.97 7 9.58

2854.73 2923.82 2841.70 3848.39 3578.39

12.96 3.39 3.59 8.26 11.37

4830.84 4972.11 4878.50 6208.52 5907.86

13.75 3.87 3.98 9.45 12.61

6997.22 7180.94 7102.63 8730.51 8353.95

23.5 4.12 4.5 11.34 15.27

10921.61 11592.05 11105.87 13105.72 12746.35

16.09 5.44 7.91 13.67 18.28

Source:Bold font refers to the best solution achieved (lowest is best)

A Coronavirus Herd Immunity Optimization (CHIO) for Travelling …

727

Model 25

Model 50

Model 100

Model 150

Model 200

Model 300

Fig. 3 Convergence behaviour of CHIO, GA, ICA, KA, and RDA for TSP models

Based on above, it can be reported that the performance of the CHIO is better than other algorithms based on complicated cases of TSP problem. This is because that the CHIO is able to make a right balance between the local exploitation and the global exploration capabilities on navigating the search space and thus achieving good results. However, the performance of the CHIO is competitive for the small cases of TSP problem. This is because of the problem of slow convergence for the small problems.

728

L. M. Dalbah et al.

From Table 4, it can seen that the GA is faster than the other comparative methods on all models of TSP. On another hand, the computational time of the CHIO is larger than the other comparative methods in all most cases of TSP problem. Figure 3 shows the convergence behaviour of the proposed CHIO in comparison with the other comparative methods over 500 iterations for all TSP models. The x-axis represents the number of iterations, while the y-axis reflects the fitness values. Clearly, it can be seen from the plots that the convergence of the CHIO is enhanced gradually during the last stages of the search process for all models of TSP problem. Furthermore, this is reflected on the results of the CHIO algorithm provided in Table 4.

5 Conclusion and Future Work In this paper, the coronavirus herd immunity optimization (CHIO) is utilized to tackle travelling salesman problem (TSP). CHIO was inspired by herd immunity which is a strategy used by the Ministry of Health in tackling COVID-19 pandemic until vaccine invention. CHIO starts with a random initial population and ends with a population where most of the individuals are susceptible or immuned. TSP is tackled by finding the best tour for the salesman who plans to visit all cities in a minimum cost. It is NP-hard problem which cannot be easily tackled by calculus-based methods. In order to evaluate the viability of CHIO, six TSP models with various cities size and complexity are used: 25, 50, 100, 150, 200, and 300 cities. The results are recorded in terms of best, average, worst, and computational time. CHIO is able to solve all TSP models efficiently. Comparatively, the results of four well-established methods selected carefully are compared with those obtained by CHIO. Interestingly, CHIO is able to produce the best results for three large-scaled TSP models which are TSP with 150 cites, with 200 cities, and with 300 cities. CHIO is able to achieve the second-best results for TSP with 50 and 100 cities. In a conclusion, CHIO is very powerful algorithm for TSP-like scheduling problems which can be used efficiently to tackle this kind of problems. As a future work, other TSP models extracted from the other libraries such TSPLIB real data can be used for further evaluation. Also, CHIO can be adapted for other scheduling problems such as course timetabling, examination timetabling, nurse restoring, and many others [17–19]. The hybrid version of CHIO can be tested also in the future as well. Acknowledgements The first author would like toas course timetabling, examination thank the Ajman University (AU) for supporting his Master of AI study and Research Assistant (RA) position.

A Coronavirus Herd Immunity Optimization (CHIO) for Travelling …

729

References 1. R. Matai, S.P. Singh, M.L. Mittal, Traveling salesman problem: an overview of applications, formulations, and solution approaches, in Traveling Salesman Problem, Theory and Applications, vol. 1 (2010) 2. S. Deb, S. Fong, Z. Tian, R.K. Wong, S. Mohammed, J. Fiaidhi, Finding approximate solutions of np-hard optimization and tsp problems using elephant search algorithm. J. Supercomput. 72(10), 3960–3992 (2016) 3. S. Arora, Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems. J. ACM (JACM) 45(5), 753–782 (1998) 4. R.W. Dewantoro, P. Sihombing, Sutarman, The combination of ant colony optimization (ACO) and Tabu search (TS) algorithm to solve the traveling salesman problem (TSP), in 2019 3rd International Conference on Electrical, Telecommunication and Computer Engineering (ELTICOM) (2019), pp. 160–164 5. X. Geng, Z. Chen, W. Yang, D. Shi, K. Zhao, Solving the traveling salesman problem based on an adaptive simulated annealing algorithm with greedy search. Appl. Soft Comput. 11(4), 3680–3689 (2011) 6. M.-H. Chen, S.-H. Chen, P.-C. Chang, Imperial competitive algorithm with policy learning for the traveling salesman problem. Soft Comput. 21(7), 1863–1875 (2017) 7. M. Hajiaghaei-Keshteli, M.J.A.S.C. Aminnayeri, Solving the integrated scheduling of production and rail transportation problem by Keshtel algorithm. Appl. Soft Comput. 25, 184–203 (2014) 8. A.M. Fathollahi-Fard, M. Hajiaghaei-Keshteli, R. Tavakkoli-Moghaddam. Red deer algorithm (RDA): a new nature-inspired meta-heuristic. Soft Comput. 1–29 (2020) 9. S.E. De León-Aldaco, H. Calleja, J.A. Alquicira, Metaheuristic optimization methods applied to power converters: a review. IEEE Trans. Power Electron. 30(12), 6791–6803 (2015) 10. M.A. Al-Betar, A.T. Khader, I.A. Doush, Memetic techniques for examination timetabling. Ann. Oper. Res. 218(1), 23–50 (2014) 11. M.A. Al-Betar, Z.A.A. Alyasseri, M. Awadallah, I.A. Doush, Coronavirus herd immunity optimizer (CHIO). Neural Comput. Appl. 1–32 (2020) 12. H. Braun, On solving travelling salesman problems by genetic algorithms, in International Conference on Parallel Problem Solving from Nature (Springer, Berlin, 1990), pp. 129–133 13. E. Osaba, X.-S. Yang, J. Del Ser, Traveling salesman problem: a perspective review of recent research and new results with bio-inspired metaheuristics, in Nature-Inspired Computation and Swarm Intelligence (Elsevier, Amsterdam, 2020), pp. 135–164 14. J.A. Regules, J.H. Beigel, K.M. Paolino, J. Voell, A.R. Castellano, Z. Hu, P. Muñoz, J.E. Moon, R.C. Ruck, J.W. Bennett et al., A recombinant vesicular stomatitis virus Ebola vaccine. New Engl. J. Med. 376(4), 330–341 (2017) 15. C.-C. Lai, T.-P. Shih, W.-C. Ko, H.-J. Tang, P.-R. Hsueh, Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and corona virus disease-2019 (COVID-19): the epidemic and the challenges. Int. J. Antimicrob. Agents 105924 (2020) 16. C.M. Pease, An evolutionary epidemiological mechanism, with applications to type a influenza. Theor. Population Biolo. 31(3), 422–452 (1987) 17. M.A. Awadallah, A.L. Bolaji, M.A. Al-Betar, A hybrid artificial bee colony for a nurse rostering problem. Appl. Soft Comput. 35, 726–739 (2015) 18. M.A. Al-Betar, A.T. Khader, M. Zaman, University course timetabling using a hybrid harmony search metaheuristic algorithm. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(5), 664– 681 (2012) 19. M.A. Al-Betar, University course timetabling using a hybrid harmony search metaheuristic algorithm. J. Ambient Intell. Humanized Comput. https://doi.org/10.1007/s12652-02002047-2

System for Situational Awareness Using Geospatial Twitter Data Hamid Omar, Akash Sinha, and Prabhat Kumar

Abstract Recent advances in social media have unveiled their potential of providing real-time solutions for disaster management. The work proposed in this paper utilizes Twitter posts to improve flow of information during crisis situations in order to provide support and save lives. The proposed system employs machine learning techniques to perform multiclass classification and filtering important tweets with high degree of accuracy. The proposed system accurately flag tweets about injured or dead people, which we hope can expedite search and rescue efforts of concerned teams. Analysis of the results obtained indicates that efficiency of the system can be further enhanced by using appropriate deep learning techniques Keywords Natural language processing · Situational awareness · Machine learning

1 Introduction Since the advent of smartphones, social media has witnessed a meteoric rise to relevance. The advancements in social media technology stretch from affecting results of elections to influencing our day-to-day actions such as purchasing habits. In the post pandemic era, social media has helped enable everything from education to spreading crucial and lifesaving information. From a platform built surrounding the users, it has now come to be widely adopted by corporations, mainstream media, and even National and State Governments as a means to disseminate data to the masses. Twitter has been used for spreading news and updates around the world and has been shown to have application in emergency situations of natural disasters [1]. Every H. Omar (B) Computer Science and Engineering, Vellore Institute of Technology, Vellore, India A. Sinha · P. Kumar Computer Science and Engineering, National Institute of Technology, Patna, India e-mail: [email protected] P. Kumar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_59

731

732

H. Omar et al.

second, on average, around 6000 tweets are tweeted on Twitter, which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day and around 200 billion tweets per year [2]. Given that such a huge flow of real-time information takes place on Twitter, it is apparent that it possesses potential as an important tool for gathering information during times of crisis. However, this deluge of data is no use to us unless there is a process in place which can analyze the recent tweets in a systematic manner, accurately classify potential concerns, and relay them to appropriate authority [3, 4]. In the past, Twitter has been used in disaster relief operations in India as well. Tweets from disaster hit regions during can be classified and channelized in order to extract important information which could then be utilized by authorities to save lives. If such a system could be employed peremptorily, rather than after the fact, it could significantly increase response times by concerned authorities. To that end, we propose a process based on text analysis which is augmented by twitters feature geotagged location, which will allow us to pinpoint to the location of the region of concern. Our work uses text-based analysis along with other heuristic approaches in order to capture real-time trends in the Twitter data and inform necessary authorities as and when the need arises. We have used deep learning methods such as BERT [5] along with pretrained models for classification tasks and juxtaposed the results with our own approach. We have also used sentiment analysis as an added parameter to ensure that tweets which are related to a disaster, but are not of immediate concern are filtered out. Social media analytics and reporting toolkit (SMART) [6] is a system for visual analysis of geotagged, publicly available, real-time tweets to enhance situational awareness and expedite emergency response. It has, however, only been used in the USA. Tweets in India include use of regional lingo which may confound models trained for a different demographic. Consequently, we have trained our model on the appropriate corpus in order to ensure that tweets involving all kinds of crises and eventualities can be recognized and filtered for further processing. We have also attempted to install a system to find the current location of a tweet using a Markov model. This will serve to compensate for the relatively few geotagged tweets available to us for inference. We were able to achieve a significantly accurate result on our training data with regards to classification of tweets which relate to injured or dead people, something that we believe will be of significant importance to rescue workers. Implementation of our proposed system will increase situational awareness of relevant authorities, enabling them to take pertinent action based on accurate and timely information that our system can provide. The rest of the paper is organized as follows: Sect. 2 reviews the related literature; Sect. 3 provides details of the proposed system; Sect. 4 discusses the implementation details including the dataset used and results obtained; and finally, Sect. 5 presents the concluding remarks and directions for future work.

System for Situational Awareness Using Geospatial …

733

2 Related Works Wang et al. [7] have performed research on recognizing occurrence of new events, their time, and their location. Their methodology involves a Gaussian mixture model for bursty word candidate selection, a system based on HDP with added time dimension, termed td-HDP, for detection of new events and lastly, the location of event using CRF algorithm. Tagging the location was done by name entity recognition. This paper, however, talks about event detection in general and is not exclusively dedicated to crisis detection. Cheng et al. [8] have used STSS (space–time scan statistics) in the context of tweets surrounding a helicopter crash in London. This methodology searched for spatio-temporal clusters in order to perform event detection. However, their area of application was very specific and took place after the fact. Also, the event chosen to study took place in a huge metropolitan city, making it very easy to collect huge amounts of data for the same, a large portion of which was generated from people in and around the target area. Thus, it is relevance in crisis management for remote locations require further study. Zhou et al. [9] proposed a solution for online social event monitoring, pertinent to real-world applications like crisis management. They have proposed a methodology that takes into account the content, location, and time information for tweet representation. Similarity is calculated using link, location, context, and time information. Additionally, a hash scheme and query optimisation was also proposed to aid in fast social monitoring. This methodology was tested and found to be successful in two Australian crises. Regarding classification of tweets, or short messages in general, Sriram et al. [10] made a few observations in possible methodologies and pros and cons for each. By analyzing potential approaches, such as extraction of meta-information or even external features from world knowledge, the authors came to the conclusion that using efficient methods along with minimal number of features is the right way to approach this problem. Their model classifies tweets into five generic categories, namely news, opinions, deals, events, and private messages. Thus, this work helps in classification of tweets, and it does not talk of any geospatial aspect of the texts. In their paper, Singh et al. [11] have used a methodology wherein they have tried to classify tweets by flood victims and if need be, infer their location. In case the tweet is not geotagged by default, Markov model is used to infer the location based on their location given in previous tweets. However, this system relies on the user having used their location at least a few days prior to the time of inference. The proposed system will not be able to perform in cases where the user has never turned on their location at all. Secondly, this system has been trained on a very specific corpus, i.e., flood corpus, thus calling into question its ability to generalize. The accuracy of the model could also be worked upon to give better, more accurate results. Tweets have also been the basis for detecting very precise crisis situations as well. Joshi et al. [12] had demonstrated the use of Twitter in case of a thunderstorm asthma outbreak. They created a platform based on social media posts, where they use such

734

H. Omar et al.

posts as input to the model. To prevent their model from being swamped by random tweets, they applied three layers of filtering prior to a final step for monitoring. Their results showed a lot of promise as they were able to detect outbreaks before official/ mainstream sources. There has even been work done on inferring more complex details from Twitter related to natural disasters. Muhammed et al. [13] constructed an analytical framework to (1) identify and categorize fine-grained details about a disaster such as affected individuals, damaged infrastructure, and disrupted services; (2) distinguish impact areas and time periods, and relative prominence of each category of disasterrelated information across space and time. Their methodology has potential for real-time identification of effects of a disaster, and thus, can be very beneficial to emergency crews on the ground.

3 Proposed Work This section provides the details about the proposed system. Figure 1 provides a high-level overview of the different modules of the proposed system. The functionality of different modules has been discussed below: (i)

(ii)

(iii)

(iv)

Mining Tweets: This module is responsible for using a callback in order to receive tweets from the Twitter API. We are mining tweets using the free version of Twitter API available via a developers account, and thus have a maximum cap under which we can potentially mine tweets. Twitter API: This denotes the official Twitter API. It provides user access tokens and enables the mining of Twitter data. Utilizing this feature, we were able to make GET requests to obtain data points of tweets in JSON format. Filtering Geotagged Tweets: This module is responsible for filtering tweets based on location. We enter a bounding box, and we accordingly get tweets which have been geotagged and those geotags are within the highlighted region. In our case, the highlighted region is Bihar. Preprocessing: Preprocessing is the process where we extract important information which is required for performing inference and discard the rest. In this

Fig. 1 Module diagram

System for Situational Awareness Using Geospatial …

(v)

(vi)

(vii)

735

case, we need the text of the tweet in particular. Having received the text, we remove any superfluous data from the text such as hashtags, external links, Twitter handles, and emoticons. This leaves us with plain text. We perform lemmatization of the text in order to extract the root words by removing prefixes and suffixes. After lemmatization, we are only with the stem of each word. The next and final step would be to convert the lemmatized words into word embeddings. Word embeddings enable us to perform mathematical operations on the text which is a crucial prerequisite to performing any kind of processing on natural language data. Word embeddings help extract semantics of a word and represent it as a vector. Tweets Classification: This module is responsible for classifying tweets between being related to a crisis or not. Text converted into word embeddings from the previous sections is now passed through a pretrained supervised learning classification model which puts into one of two categories after performing inference on the input data: (i) disaster related and (ii) not related to a disaster. Disaster Detection: If there are multiple tweets originating from the same location which have also been tagged positive by our classification model, then our model concludes that this may be a disaster event that needs to be reported to relevant authorities. Therefore, it passes this information to the next module along with the geographic location. Dashboard: This relays geographic information along with tweets classified by the classifier to the concerned authorities. The people who view this information can act upon it in order to ensure safety of people involved in a particular incident.

The information is received in csv format. It is then visualized in an appropriate manner to make the tweet contents and location easily decipherable to the user, thus ensuring that the purpose of generating an actionable insight is fulfilled. The workflow of the proposed system is shown in Fig. 2. The algorithm for classifying the tweets is presented in Sect. 3.1.

3.1 Algorithm 1. Pick k points from the dataset at random. 2. Loop through each attribute by: 2.1. Partitioning all instances on the basis of attribute. 2.2. Compute information gain of each partition 2.3. Use the attribute with highest information gain to split the current node. 2.4. Make each partition as a child of the current node. 2.5. For each child: 2.5.1. If child node is “pure”, make it a leaf 2.5.2. Else, set child as current and return to step 2 3. If x (hyperparameter) decision trees have been constructed, return, else go to step 1.

736

H. Omar et al.

Fig. 2 Workflow of the proposed system

Since we have to classify data points into multiple categories, we have to use a classification algorithm. There exists multiple classifiers in the literatures. However, in our work, we have chosen random forest algorithm. To facilitate the split, we use information gain as the metric.

4 Results and Analysis This section discusses the implementation details of the proposed system. Dataset used for the experimental evaluation of the proposed work consists of tweets posted during and after India-Pakistan floods of 2014 [14]. The dataset consists of the following labels: • • • • • • •

not related or irrelevant, other useful information, donation needs or offers or volunteering services, injured or dead people, displaced people and evacuations, infrastructure and utilities damage or caution and advice.

Data is initially preprocessed, and the resultant vector is passed to all decision trees in random forest. Considering each resultant classification of a decision tree to be a vote for predicted label, the final classification for a given data point is assigned the label with most votes. The performance of random forest classifier in the proposed model is compared with XGBoost classifier as well. Tables 1 and 2 present the

System for Situational Awareness Using Geospatial …

737

Table 1 Accuracy score for each model Labels

Random forest (%)

XGBoost (%)

Caution and advice

50

42.86

Displaced people and evacuations

40

30

Donation needs or offers or volunteering services

45.45

50

Infrastructure and utilities damage

50

32

Injured or dead people

94.21

92.86

Not related or irrelevant

56.52

76.92

Other useful information

61

62.5

Table 2 F1 score for each model

Random forest

0.777046

XGBoost

0.761401

accuracy and F1 scores obtained using both the classifiers, respectively. Results obtained clearly indicate that more accuracy is achieved random forest classifier.

5 Conclusions and Future Works The aim of the work proposed in this manuscript is to build a reliable system which can classify tweets that seem of concern and return those results to a monitoring system, which can be of use in case of an untoward incident. The proposed work employs tweets posted by the users on Twitter platform to infer required information for providing relief services in case of disasters. Experimental evaluation of the proposed work shows that our model obtains high accuracy in identifying injured people and casualties in given tweets with over 90% accuracy. Our XGBoost model has also helped us filter irrelevant tweets with a great degree of accuracy. Therefore, these algorithms can be used to gain critical information in crisis situations, which was our initial goal. However, further improvements can be made in accuracy of the machine learning model. Deep learning models such as LSTMs may provide improved classification. Methods of creating word embeddings can also be improved by using alternatives such as GloVe.

References 1. G.P. Cooper, V. Yeager, F.M. Burkle, I. Subbarao, Twitter as a potential disaster risk reduction tool. Part I: Introduction, terminology, research and operational applications. PLOS Currents Disasters, Edition 1 (2015)

738

H. Omar et al.

2. Twitter usage Statistics. Available at https://www.internetlivestats.com/twitter-statistics/#sou rces. Accessed on 16 Oct 2020 3. A. Sinha, P. Kumar, N.P. Rana, R. Islam, Y.K. Dwivedi, Impact of internet of things (IoT) in disaster management: a task-technology fit perspective. Ann. Oper. Res. 283(1), 759–794 (2019) 4. A. Amirkhanyan, C. Meinel, Analysis of the value of public geotagged data from twitter from the perspective of providing situational awareness, in Social Media: The Good, the Bad, and the Ugly. I3E 2016. Lecture Notes in Computer Science, ed. by Dwivedi Y. et al., vol. 9844 (Springer, Cham, 2016) 5. S. González-Carvajal, E.C. Garrido-Merchán, Comparing BERT against traditional machine learning text classification. arXiv preprint arXiv:2005.13012. Accessed on 16 Oct 2020 (2020) 6. L.S. Snyder, M. Karimzadeh, C. Stober, D.S. Ebert, Situational awareness enhanced through social media analytics: a survey of first responders, in 2019 IEEE International Symposium on Technologies for Homeland Security (HST), Woburn, MA, USA (2019), pp. 1–8. https://doi. org/10.1109/HST47167.2019.9033003 7. X. Wang, F. Zhu, J. Jiang, S. Li, Real time event detection in twitter, in International Conference on Web-Age Information Management, June 2013 (Springer, Berlin, 2013), pp. 502–513 8. T. Cheng, T. Wicks, Event detection using twitter: a spatio-temporal approach. PLoS ONE 9, e97807 (2014) 9. X. Zhou, L. Chen, Event detection over twitter social media streams. VLDB J. 23(3), 381–400 (2014) 10. B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, M. Demirbas, Short text classification in twitter to improve information filtering, in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, July 2010, pp. 841–842 11. J.P. Singh, Y.K. Dwivedi, N.P. Rana, A. Kumar, K.K. Kapoor, Event classification and location prediction from tweets during disasters. Ann. Oper. Res. 283(1), 737–757 (2019) 12. A. Joshi, R. Sparks, J. McHugh, S. Karimi, C. Paris, C.R. MacIntyre, Harnessing tweets for early detection of an acute disease event. Epidemiology 31(1), 90 (2020) 13. M.A. Sit, C. Koylu, I. Demir, Identifying disaster-related tweets and their semantic, spatial and temporal context using deep learning, natural language processing and spatial analysis: a case study of Hurricane Irma. Int. J. Digital Earth 12(11), 1205–1229 (2019) 14. India Floods 2014. Crisis NLP. Available at: https://crisisnlp.qcri.org/lrec2016/content/2014_i ndia_floods_en.html. Accessed on 16 Oct 2020

Classification of Malware Using Visualization Techniques Divyansh Chauhan, Harjot Singh, Himanshu Hooda, and Rahul Gupta

Abstract In the present world, every device whether small or big is susceptible to malware attacks. A continuous trend of evolution of malware attacks can be seen daily which have the capacity to harm any system ranging from a small device like mobile to a large-scale system like a satellite station. In this paper, an improvised method for the classification of the malware into their respective families has been proposed. The method proposes the visualization of malware files into images of different colour modes: RGB, HSV, greyscale and BGR. This method can cause patterns to be visible more clearly that can be exploited for the classification of malware images. The support vector machine (SVM) is practised for the classification of these malware images of different colour modes which give the highest accuracy of 96% in case of HSV, grey scale and BGR. Therefore, evaluating the method in terms of classification accuracy, result consistency, recall, F1 score and precision, the results indicate that the method has the potential to enable a more efficient way of malware classification. Keywords Malware · Malware attack · Visualization of malware · Colour modes · Malware classification · Support vector machine · Virus

1 Introduction With the increased use of the Internet over the last 10 years, there is a high risk of getting the systems infected with malicious software. According to the reports, the number of new malwares reported every day has risen to about 25,000. Structures of new malwares introduced daily may be different, but the manner in which they work almost remains the same just like old malwares. Malware samples having similar functionalities can be considered belonging to the same family [1]. The need to detect and classify the malware is important so that its signature can be styled to anti-malware software. The malware file can be visualized using image processing

D. Chauhan (B) · H. Singh · H. Hooda · R. Gupta Delhi Technological University, New Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_60

739

740

D. Chauhan et al.

techniques. The malware files can be converted into images which can be used for malware classification purposes. A program deliberately written to perform malicious activities such as stealing of user’s information or spying someone. With the drastic increase in use of Internet and modern digital devices, the frequency of attacks has also increased rapidly allowing malware more to exploit the vulnerabilities. Not only the frequency of attacks is increased, but also the creation of new malware has also increased in the past five years. With this much growth, the region becomes challenging for research. Identification of the malicious program at the network or very end device is termed as malware detection. Discussing about present, majorly there are two classes of malware detection techniques: • Signature-based detection A signature is a precise sequence of bytes. To find the malicious file, a signaturebased detection technique looks for that specific signature of the malware in the file [2]. These techniques are highly accurate. The two major drawbacks are: 1. 2.

These techniques depend on the limited signature database which has to be updated very frequently. It takes time to find a specific signature and updating it in the database, which gives the malware enough time window to attack.

• Non-signature-based detection Unlike the signature-based, non-signature-based techniques try to identify and match the behaviour and pattern to the predefined profile [3]. It may raise some false positive or negative results, but it eliminates both the drawbacks of the signature-based techniques: 1. 2.

It eliminates the time window for the malware to attack. It can also detect the unknown, new malware which signature-based techniques fail to do.

Therefore, in this paper a non-signature-based detection technique is proposed which uses malware images generated in different colour models (i.e. greyscale, RGB, BGR, HSV) for classification of malware into their respective families. Colour model (or Colour space) is an abstract mathematical model which simply describes the range of colours as tuples of numbers. The malware images generated in different colour models are classified into their classes using support vector machine (SVM). The major highlights of the paper are: 1. 2. 3. 4.

The prime objective is to classify the malware into their respective families. To convert the malware binaries into images in different colour models (i.e. grey scale, RGB, BGR, HSV). Integrating the images of the same colour model into a CSV file to form the five different datasets. Classification is performed using support vector machine (SVM) on all datasets.

Classification of Malware Using Visualization Techniques

741

The rest of the application is as follows: Sect. 3 presents the methodology used along with different techniques used in converting the malware binaries into images in different colour models in detail. Section 4 discusses the results. Section 5 finally concludes the paper.

2 Related Works In [4], Nataraj and Jacob use image processing for the classification of malware. In this paper, structural similarities are shown between malware of the same family by converting the executable binaries into greyscale images. In [5], Nataraj and Yegneswaran give a comparison between the dynamic and malware image analysis. This study shows the efficiency of the image-based method over the dynamic one. Hasan in [6] gives knowledge about malware, its impacts, identification of various malwares and methods to identify malware such as signature-based and heuristic methods with their respective limitations. Kancherla in [7] has presented a method for malware classification wherein the malware file is converted to a binary file which in turn is converted to an 8-bit 1D vector. One vector represents one pixel and also gives the value of the intensity of this pixel. Later, it extracts the lowlevel features which are applied to the algorithm for malware classification. LIU in [8] gives a method to detect and classify malware files. It consists of two stages: feature extraction and classification. For extracting features, the executable files are converted to greyscale images by breaking the file into the blocks of 8 bits. Zhang in [9] gives a method for classifying malware using opcode. The executable files are disassembled into opcodes which are converted further into images. In [10], D.C. Lo and J. Luo give a technique based on local binary pattern (LBP), which is considered efficient for pattern classification. In [11], Wang et al. use obfuscation and dynamic techniques for classifying malware. The paper proposed convolutional neural network for classification which reduces the training and testing time for the method.

3 Methodology The proposed system consists of three phases—base dataset selection, dataset creation and classification. All these phases are discussed in Fig. 1.

742

D. Chauhan et al.

Fig. 1 Methodology flowchart

3.1 Base Dataset Selection The dataset used in this paper for training and testing is openly available on Kaggle under the name of Microsoft Malware Classification (BIG 2015) [12]. The dataset consists of byte code and assembly files of more than 20,000 malware samples. The names of the malware family and number of instances used in our model are shown in Table 1. Table 1 Names of family and number of instances

S. No.

Name of family

No. of instances

1

Ramnit

1541

2

Obfuscator.ACY

1228

3

Lollipop

2142

4

Gatak

1013

5

Kelihos_ver3

2942

Classification of Malware Using Visualization Techniques

743

3.2 Dataset Creation (Feature Extraction) 3.2.1

Conversion of Malware Binaries into Images

Different techniques have been used for converting the malware binaries into images which are of four types: RGB, HSV, greyscale and BGR. Each malware binary is considered as a string of 1s and 0s which are further divided into 8-bit units [8, 13–15]. All the techniques for the conversion of the malware binary into images of different colour modes are discussed below. Technique 1 (RGB): The RGB image is generated using a 2D colour map that stores the RGB values equivalent to each byte. The colour map is of the order 16 × 16. The decimal number which is represented by the first 4 bits and the second 4 bits of the 8-bit unit serves as a coordinate of the colour which is to be extracted from the colour map. As every 8-bit unit is taken as a reference to form our image, this iteration is done for all 8-bit units. Each RGB value of the pixel obtained from the colour map is stored on a two-dimensional matrix. One of the parameters of the matrix, i.e. the width, is kept constant at 384 bytes, and the other parameter, i.e. the height, varies according to the malware file size. The flowchart for the above process is shown in Fig. 2. Technique 2 (RGB): Every 8-bit unit is distributed into three sets of 3-3-2 bits which are then used to get the R (red), G (green) and B (blue) values of the pixels, respectively. The red and green values are obtained by multiplying the decimal number of the corresponding set (i.e. first and second sets) with “16”. Similarly, the blue value is obtained by multiplying the decimal number of the last set by “32”. The RGB value of a pixel is determined by concatenating the values obtained by all the three sets. This process is repeated for all 8-bit units, and each RGB value is stored in a 2D matrix. One of the parameters of the matrix, i.e. the width is kept constant at 384

Fig. 2 Flowchart for technique 1 (RGB)

744

D. Chauhan et al.

bytes and the other parameter, i.e. the height varies according to the malware file size. The flowchart for the above process is shown in Fig. 3. Technique 3 (Greyscale): Each 8-bit unit represents a number between 0 and 255. The decimal number obtained by each unit is considered as the value of the pixel which is stored in the two-dimensional matrix [4]. This process is repeated for all 8-bit units. One of the parameters of the matrix, i.e. the width is kept constant at 384 bytes and the other parameter, i.e. the height varies according to the malware file size [8]. The flowchart for the above process is shown in Fig. 4. Technique 4 (HSV): HSV image is generated using a 2D colour map which stores the RGB values equivalent to each byte. The decimal number which is represented by the first 4 bits, and the second 4 bits of the 8-bit unit serve as a coordinate of

Fig. 3 Flowchart for technique 2 (RGB)

Fig. 4 Flowchart for technique 3 (greyscale)

Classification of Malware Using Visualization Techniques

745

Fig. 5 Flowchart for Technique 4 (HSV)

the colour which is to be extracted from the colour map [13]. The RGB value of the pixel obtained from the colour map is used to get the hue, saturation and value using Eqs. (1)–(4) which was stored on a two-dimensional matrix. This iteration is done for all 8-bit units. One of the parameters of the matrix, i.e. the width, is kept constant at 384 bytes, and the other parameter, i.e. the height, varies according to the malware file size. The flowchart for the above process is shown in Fig. 5. S=

V − min(R, G, B) V

(1)

V = max(R, G, B) ⎧ 0 ⎪ ⎪ ⎨ 60 ∗ (R − G)/(V − min(R, G, B)) + 240 H= ⎪ 60 ∗ (B − R)/(V − min(R, G, B)) + 120 ⎪ ⎩ 60 ∗ (G − B)/(V − min(R, G, B))

(2) if if if if

V V V V

= min(R, G, B) =B =G =R

Final pixel value = [H/2, S∗255, V ∗255]

(3)

(4)

Technique 5 (BGR): BGR image is generated using a 2D colour map which stores the RGB values equivalent to each byte. The decimal number which is represented by the first 4 bits and the second 4 bits of the 8-bit unit serves as a coordinate of the colour which is to be extracted from the colour map [14]. The RGB value of the pixel obtained from the colour map was inverted so that R value becomes B, B value becomes R and the final value is then stored on a two-dimensional matrix. This iteration is done for all 8-bit units. One of the parameters of the matrix, i.e. the width, is kept constant at 384 bytes, and the other parameter, i.e. the height, varies

746

D. Chauhan et al.

Fig. 6 Flowchart for Technique 4 (BGR)

RGB image (technique1)

RGB image (technique2)

Grayscale image (technique3)

HSV image (technique4)

BGR image (technique5)

Fig. 7 Different illuminations of a malware image

according to the malware file size. The flowchart for the above process is shown in Fig. 6. In Fig. 7, different illuminations of a malware image generated using the different technique are shown.

3.2.2

Feature Extraction

The images of malware binaries of width 384 pixels are resized to 64 × 64 and flattened [14]. Equation 5 is used to convert the three values of a pixel in RGB, HSV,

Classification of Malware Using Visualization Techniques

747

BGR into a single value which is then entered into the CSV file. Five different CSV files are computed from the images obtained from the five techniques mentioned above. In each CSV file, there are 8866 entities belonging to five different malware families each having 4096 features. Feature Value = 0.2989 ∗ x + 0.5870 ∗ y + 0.1140 ∗ z

(5)

where x, y, z are the three values in the pixel.

3.2.3

Classification

For the implementation of the below work, a lot of Python libraries are used such as NumPy, pandas, OpenCV, scikit-learn [16]. The procured dataset which consists of 8866 instances is split into two parts: 75% (6649 instances) for training and 25% (2217 instances) for testing samples for validation. Support vector machine (SVM), an algorithm for data analysis and regression, is used for the classification of malware images into their respective families [17].

4 Results The model was trained on images generated using different techniques mentioned above and gives successful result on malware classification. The precision, recall and F1 score for our proposed method have been calculated for the different techniques. The computed results show us an accuracy of greater than or equal to 95% with a mean accuracy of 95.6%. The accuracies for different techniques are shown in Table 2. Recall, precision and F1 score are important parameters for determining the efficiency of a model. Recall is used to find the completeness of any model, whereas precision indicates the usefulness of any model. Thus, a model with high recall and precision results in a high F1 score that ultimately indicates that the method is practically feasible and efficient. Precision, recall and F1 score are calculated in Tables 3, Table 2 Accuracy table

S. No.

Technique no.

Accuracy

1

Technique 1 (RGB)

0.95

2

Technique 2 (RGB)

0.95

3

Technique 3 (Grey scale)

0.96

4

Technique 4 (HSV)

0.96

5

Technique 5 (BGR)

0.96

Mean accuracy

0.956

748

D. Chauhan et al.

Table 3 Technique 1 (RGB) Sr. No.

Name of family

Precision

Recall

F1 score

No. of instances

1

Ramnit

0.91

0.95

0.93

394

2

Obfuscator.ACY

0.95

0.88

0.91

296

3

Lollipop

0.91

0.96

0.94

527

4

Gatak

0.97

0.88

0.92

240

5

Kelihos_ver3

1.00

1.00

1.00

760

Weighted average

0.95

0.95

0.95

2217

4, 5, 6 and 7 using Eqs. (6)–(8). Precision = True Positives/Actual Results

(6)

Recall = True Positives/Predicted Results

(7)

F1-Score = 2∗(Precision*Recall)/(Precision + Recall)

(8)

Table 3 shows the obtained results in terms of recall, F1 score and precision for our proposed method on the testing dataset. Table 4 Technique 2 (RGB) Sr. No.

Name of family

Precision

Recall

F1 score

No. of instances

1

Ramnit

0.89

0.96

0.92

373

2

Obfuscator.ACY

0.97

0.90

0.93

308

3

Lollipop

0.92

0.96

0.94

560

4

Gatak

0.97

0.88

0.92

243

5

Kelihos_ver3

1.00

0.99

1.00

733

Weighted Average

0.95

0.95

0.95

2217

Table 5 Technique 3 (Greyscale) Sr. No.

Name of family

Precision

Recall

F1 score

No. of instances

1

Ramnit

0.87

0.97

0.92

376

2

Obfuscator.ACY

0.97

0.89

0.93

307

3

Lollipop

0.95

0.96

0.96

564

4

Gatak

0.99

0.92

0.95

260

5

Kelihos_ver3

1.00

1.00

1.00

710

Weighted average

0.96

0.96

0.96

2217

Classification of Malware Using Visualization Techniques

749

Table 6 Technique 4 (HSV) Sr. No.

Name of family

Precision

Recall

F1 score

No. of instances

1

Ramnit

0.92

0.96

0.94

375

2

Obfuscator.ACY

0.99

0.88

0.93

322

3

Lollipop

0.92

0.98

0.95

525

4

Gatak

0.98

0.93

0.95

270

5

Kelihos_ver3

1.00

1.00

1.00

725

Weighted average

0.96

0.96

0.96

2217

Table 7 Technique 5 (BGR) Sr. No.

Name of family

Precision

Recall

F1 score

No. of instances

1

Ramnit

0.91

0.95

0.93

390

2

Obfuscator.ACY

0.95

0.86

0.91

319

3

Lollipop

0.92

0.98

0.95

523

4

Gatak

0.99

0.93

0.96

250

5

Kelihos_ver3

1.00

1.00

1.00

735

Weighted average

0.96

0.96

0.96

2217

From the tables of the respective techniques, it is clear that our proposed method was able to classify the malware with high precision, recall and F1 score and thus can prove to be a useful tool in real-life applications.

5 Conclusion and Future Work Based on the above study, the improvement of existing techniques or the development of new techniques is necessary to prevent the increase in the number of malware attacks. The proposed visualization-based technique has proved to be an efficient method for malware classification as it helps us to detect the malware statically. Thus, the need to run malware on the system is eliminated. A fixed number of families of malware are used for the current research; thus, this research can be extended to different malware families. The minimal error in the results has proved the method to be very efficient in classifying malware successfully. Space and time complexity is also an area that can be worked upon. Deep learning algorithms also can be applied which are supposed to be more efficient and powerful than support vector machines or any standard machine learning algorithms.

750

D. Chauhan et al.

References 1. P. Parmuval, M. Hasan, S. Patel, Malware family detection approach using image processing techniques: visualization technique. Int. J. Comput. Appl. Technol. Res. 07, 129–132 (2018) 2. V.S. Sathyanarayan, P. Kohli, B. Bruhadeshwar, Signature generation and detection of malware families (2008). https://doi.org/10.1007/978-3-540-70500-0_25 3. M.F. Zolkipli, A. Jantan, An approach for malware behavior identification and classification, in 2011 3rd International Conference on Computer Research and Development, Mar 2011, vol. 1, pp. 191–194. https://doi.org/10.1109/ICCRD.2011.5764001 4. L. Nataraj, S. Karthikeyan, G. Jacob, B.S. Manjunath, Malware images: visualization and automatic classification (2011). https://doi.org/10.1145/2016904.2016908 5. L. Nataraj, V. Yegneswaran, P. Porras, J. Zhang, A comparative assessment of malware classification using binary texture analysis and dynamic analysis (2011). https://doi.org/10.1145/ 2046684.2046689 6. A.Z.M. Saleh, N.A. Rozali, A.G. Buja, K.A. Jalil, F.H.M. Ali, T.F.A. Rahman, A method for web application vulnerabilities detection by using Boyer-Moore String matching algorithm (2015). https://doi.org/10.1016/j.procs.2015.12.111 7. K. Kancherla, S. Mukkamala, Image visualization based malware detection, in 2013 IEEE Symposium on Computational Intelligence in Cyber Security (CICS), Apr 2013, pp. 40–44. https://doi.org/10.1109/CICYBS.2013.6597204 8. L. Liu and B. Wang, Malware classification using gray-scale images and ensemble learning (2017). https://doi.org/10.1109/ICSAI.2016.7811100 9. J. Zhang, Z. Qin, H. Yin, L. Ou, Y. Hu, IRMD: malware variant detection using opcode image recognition, 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), Dec 2016, pp. 1175–1180. https://doi.org/10.1109/ICPADS.2016.0155 10. J.S. Luo, D.C.T. Lo, Binary malware image classification using machine learning with local binary pattern (2017). https://doi.org/10.1109/BigData.2017.8258512 11. D. Xue, J. Li, T. Lv, W. Wu, J. Wang, Malware classification using probability scoring and machine learning. IEEE Access (2019). https://doi.org/10.1109/ACCESS.2019.2927552 12. R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, M. Ahmadi, Microsoft malware classification challenge. CoRR, vol. abs/1802.1 (2018). Available: http://arxiv.org/abs/1802.10135 13. A. Kumar, M. Gupta, G. Kumar, A. Handa, N. Kumar, S.K. Shukla, A review: malware analysis work at IIT Kanpur, in Cyber Security in India: Education, Research and Training, ed. by S.K. Shukla, M. Agrawal. Singapore: Springer Singapore (2020), pp. 39–48 14. A. Singh, A. Handa, N. Kumar, S.K. Shukla, Malware classification using image representation (2019). https://doi.org/10.1007/978-3-030-20951-3_6 15. L. Chen, Deep Transfer Learning for Static Malware Classification. arXiv (2018) 16. N. Community, NumPy reference, Oct 2011 17. C.-J. Lin, C.-J. L. Chih-Wei Hsu, C.-C. Chang, A practical guide to support vector classification. BJU Int. (2008)

Classification and Activation Map Visualization of Banana Diseases Using Deep Learning Models Priyanka Sahu, Anuradha Chug, Amit Prakash Singh, Dinesh Singh, and Ravinder Pal Singh

Abstract Machine learning, especially deep learning (DL), comprises a modern, recent technique to process the images and data, with promising outcomes and enormous potential. DL is acquiring prevalence as it proves its supreme computation power in terms of accuracy when there is a need to train the model with a massive consignment of data. As a subset of machine learning, DL has been effectively applied in different areas with improved accuracy. In the recent past, it has expected to bring a kind of revolution in farming. In this study, three DL models, namely AlexNet, VGG16, and GoogleNet, were implemented to classify the banana crop leaves diseases. The DL models rely on the convolutional neural network (CNN) as an algorithmic learning technique. To train the DL models, CNN used to extract the features automatically from the raw input images. It was observed that finetuning of pre-trained networks achieved better classification results than the training from scratch. Moreover, the fine-tuning of hyperparameters increases the accuracy of AlexNet from 0.882 (without fine-tuning) to 0.908 (fine-tuning), VGG16 from 0.896 to 0.9375, and GoogleNet from 0.901 to 0.9531. For visualization of results in the intermediate layers, activation maps (filters) were used to show the internal working of the convolutional layers. These maps also tend to visualize the symptoms and the diseased regions of the leaf image. GoogleNet outperformed the rest two of the models with an accuracy of 95.31%. Keywords Crop diseases · Banana plant · Deep learning · Classification · Convolutional neural network · Fine-tuning · Visualization P. Sahu (B) · A. Chug · A. P. Singh University School of Information, Communication, and Technology, Guru Gobind Singh Indraprastha University, New Delhi, India A. Chug e-mail: [email protected] A. P. Singh e-mail: [email protected] D. Singh · R. P. Singh Division of Plant Pathology, Indian Agricultural Research Institute, New Delhi, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_61

751

752

P. Sahu et al.

1 Introduction The banana plant described as Musa sp (Musaceae family) is the second vital cultivation in India, succeeding mango [1]. Diseases are considered very circumcised to produce crops. Alike other crops, banana crops are susceptible to diseases very quickly, causing a lot of damage to the farming economy. For example, banana black Sigatoka and banana bacterial wilt disease reduce crop production to a great extent. It is crucial to save the quality and quantity of banana crops; therefore, there is a great need to identify and treat disease at the right time and avoid spreading it. The farmer does not understand the symptoms of crop disease properly. Hence, they have to take their crop sample to the laboratory for a thorough examination, but this is very time-consuming and expensive. Therefore, many efforts have been made to automate the disease classification process, deploying leaf images. The approaches used to detect diseases at the early stages rely on computer vision and machine learning by performing classification on the diseased image leaves. For this kind of sort, experts are needed to complete the feature extraction and labeling process manually; hence, the procedures are not fully automated and quite expensive. Therefore, most studies show that experimentation is made on only small labeled datasets for training and accuracy evaluation. Deep learning (DL) arises as a new development in machine learning that accomplishes the advanced outcomes of many research fields, for example, computer vision, medication design, plantation health issues, and bioinformatics. The benefit of DL is exploiting legitimately raw information directly without using any feature extractor. In the recent past, DL utilization gave good outcomes in both industrial and academic fields because of two prime reasons. Primarily, the bulk of data is produced day-byday. Consequently, this information can be utilized in order to prepare a deep model. Furthermore, the intensity of computing given by the high-performance computing (HPC) and graphics processing unit (GPU) forms the training possible for deep learning models. As DL frameworks began to make advancements with time, they have been deployed to perform image classification and recognition. These frameworks have also been introduced in various agricultural applications; e.g., plant leaves classification was carried out by deploying the author-modified convolutional neural network (CNN) with a random forest (RF) classifier. Thirty-two crop species were involved in the classification, and their performance was observed using classification accuracy (CA) at 97.3% [2]. In studies [3–5], authors have been performed implementations for fruit and leaf counting. For the classification of different crop types, Kussul et al. [6] implemented a user-modified CNN, Mortensen et al. [7] applied VGG16, Rubwurm et al. [8] proposed LSTM, and Rebetez et al. [9] deployed CNN with RGB histogram. Moreover, DL techniques have been used for the recognition of different plant types in studies [10–12]. Researchers [10, 12] have deployed user-modified CNN, and AlexNet was used in the study [11]. Results of all were computed using CA, and AlexNet [11] performed better than the other two. Barman et al. [13] deployed selfstructured CNN and MobileNet CNN to detect the infections present in the leaves

Classification and Activation Map Visualization …

753

of citrus plant. A smartphone was used to acquire the leaf images and to prepare a real-time dataset. The projected technique is then installed in a mobile phone for real-time testing. The rest of the paper is framed into the following sections. Section 2 gives some background insights of DL applicability. Section 3 discusses the methodology used for crop disease detection. Section 4 describes some notable DL models used for experimentation. Section 5 explains the experimental analysis using pre-trained models deployed over a small data sample of banana leaf diseases. Section 6 is visualizing the learning process on CNN. Finally, Sect. 7 concludes the study.

2 Background DL is about how deeply a neural network is extendable, giving a hierarchical description of data using several convolutions. CNNs are used for training and testing for each inserted image and flow it in the sequence of convolution layers with kernels (filters), pooling, FC (fully connected layers) and later use ‘softmax function’ for classification of some objects in the image with probabilistic values in a range (0–1). CNNs have been used in many of the applications for plant disease detection. In paper [14], a leaf disease detection for the banana plant was performed with the help of CNN Models (MobileNet-V1, Inception-V2, and ResNet-50) with SSD detectors and faster CNN. The author proposed a contemporary visualization technique in [15] using correlation coefficients and deep learning models like VGG-16 and AlexNet architectures. In [16], images were acquired using backscattering, and immediately after then, the visual assessment is carried out using a browning scale. For example, [17] proposed a saliency map for symptoms visualization of crop diseases; [18] spotted 13 different kinds of crop diseases using CaffeNet CNN and got a classification accuracy of 96.3%. To find out the diseases in plant leaves, the LeNet framework was applied over training data, and F1-score and CA were owned for evaluating the system in grayscale and color modes [19]. Overfeat, AlexNet, GoogLeNet, AlexNetOWTbn, and VGG architectures are five CNNs that were used in the study [20], where the VGG framework performed best among all models. To classify diseases in maize crops, CNN was implemented, and histograms are used to show the performance. In [21], AlexNet, ResNet, and GoogleNet were evaluated for the identification of tomato crop diseases. SqueezeNet v1.1 and AlexNet were used for the classification of tomato crop where AlexNet performs better [22]. Table 1 shows the literature review of some popular DL techniques.

754

P. Sahu et al.

Table 1 Literature review of a few DL approaches used along with the crop names, datasets, visualization, and performance metrics Deep learning architectures

Datasets

Crop

Visualization techniques

Performance metrics

AlexNet and GoogLeNet [23]

PlantVillage

Apple, tomato, blueberry, strawberry, bell pepper, soybean, cherry, squash, corn, potato, peach, raspberry, grape

Visualization of CA neuron activations in the initial convolutional layer

AlexNet, VGG, GoogLeNet, AlexNetOWTBn, Overfeat [20]

In-field imaging and PlantVillage

Apple, N/A cucumber, onion, blueberry, corn, orange, banana, eggplant, grape, cabbage, cherry, gourd, celery, cassava, cantaloupe

Success rate

CaffeNet [18]

Internet

Pear, grapevine, Visualization of peach, cherry, activation maps apple with filter from initial to final layer

Precision

ZFNet, AlexNet, R-FCN, GoogleNet, SSD, VGG-16, Faster RCNN, ResNet-50, ResNetXt-101 [24]

Image captured from real fields

Tomato

Bounding boxes were used for the localization and classification of diseases

Precision

CNN [25]

Bisque Platform of Cy Verse

Maize

Heat maps were deployed to identify the diseases in maize crops

Accuracy

DCNN [26]

Images were captured from real field

RICE

Feature map to Accuracy detect the diseases in rice plants

GoogLeNet, AlexNet [27]

PlantVillage

Tomato

Symptoms visualization approach

Accuracy

(continued)

Classification and Activation Map Visualization …

755

Table 1 (continued) Deep learning architectures

Datasets

Crop

Visualization techniques

Performance metrics

VGG-FCN-S, VGG-FCN-VD16 [28]

Database of wheat disease in 2017

Wheat

Spatial and feature maps

Accuracy

CNN, VGG-A [29]

Images captured from actual field

Radish

HSV with K-means clustering

Accuracy

AlexNet [30]

Images were captured from real field

Soybean

Feature map to CA identify the marks of diseases

Random forest, AlexNet, DCNN, Support vector machine [31]

Agricultural fields and Image dataset in China

Cucumber

Image segmentation approach

CA

Student/teacher architecture [32]

PlantVillage

Squash, Apple, Grapes, Tomato, Bellpepper, Strawberry, Blueberry, Soybean, Cherry, Peach, Corn, Raspberry, Orange, Potato

Images with discriminant regions, heat map formation, and image segmentation using binary threshold logic

Validation accuracy and loss, Training accuracy and loss

3D-CNN [33]

Real environment

Soybean

Saliency feature map visualization

F1-score, CA

AlexNet, VGG-16 CASC-IFW [15]

Banana, Apple

Saliency feature map, mesh graphical image, 2D and 3D contour

CA

VGG–Inception [34]

Real environment

Apple

Activation visualization

Accuracy

Modified LeNet [35]

PlantVillage

Olives

Segmentation and True positive rate edge feature map

ResNet (18, 50, 152) VGG16, and Inception V3 [36]

Real environmental conditions

Banana

Feature extraction Accuracy

(continued)

756

P. Sahu et al.

Table 1 (continued) Deep learning architectures

Datasets

Faster RCNN [37] Image collection using UAV (real environment)

Crop

Visualization techniques

Performance metrics

Banana

Linear contrast stretch, Synthetic color transform, Triangular greenness index, bounding boxes of detection

Accuracy, precision, recall

3 Methodology In order to operate DL architectures, various phases are needed, beginning from the dataset gathering to performance analysis/visualization mappings are shown in Fig. 1. The deployed methodology is further split into four parts as shown in Fig. 2. 1.

Pre-training

Dataset collection (images of banana leaves from PlantVillage dataset)

Splitting of input data (80:20)

DL Architectures

Training set and Validation set

AlexNet, GoogleNet, VGG16

Training/validation of the model

Performance metrics

Accuracy, Loss

Fig. 1 Flow diagram of the DL model’s implementation: Initially, the input data is collected [23] and then split into two portions, generally into the 80–20 ratio of trained and validation dataset. Afterwards, DL architectures are deployed over the dataset either with pre-training or without pre-training, and training and validation curves are drawn to represent the performance of the architectures. Moreover, performance metrics are applied to classify the images (crop disease)

Classification and Activation Map Visualization … Pre-training

757

Training (Hyper – parameters fine tuning)

Disease classification

Symptoms detection and visualization

Fig. 2 Implemented DL approach

Table 2 Hyperparameters specification

2.

3.

4.

Hyperparameters

AlexNet

VGG16

GoogleNet 0.001

Learning rate

0.001

0.00001

Decay

0.005

1e−6

1e−6

Momentum

0.9

0.9

0.9

In this step, deep models need to be trained with the bulk of data such as ImageNet. The target of this phase is to initialize the weights of the network. Training (Hyperparameters fine-tuning) The resulted network obtained from the previous phase was fine-tuned with the hyperparameters. Furthermore, the final layer of the pre-trained network was substituted with a different output layer with three classes (three diseases of banana). Tuning of hyperparameters is shown in Table 2. Disease classification In this phase, the user classifies the leaf image using deep architecture to determine the disease in the banana plant. Symptoms detection and visualization After the process of crop disease classification, the client performs the visualization of the lesions of the disease present in the leaf image. The disease’s characteristics shown using visualization assist the naïve user by providing them additional information on the mechanism. Furthermore, the visualization technique helps the user to estimate the disease spread among other banana plants.

4 Crop Disease Detection by Notable DL Models Many DL architectures/models developed soon after the famous AlexNet [38] for image segmentation, identification, and classification. This portion shows some research carried out by using well-known DL models to detect and classify crops’ diseases. In most studies, the PlantVillage dataset has been commonly used as it comprises 54,306 images of 14 different crops with 26 crop diseases [23]. In this

758

P. Sahu et al.

study, three famous DL models, namely AlexNet [38], VGG16 [39], and GoogleNet [40], are deployed over the leaf images of bananas. These architectures’ performance in computer vision challenge ImageNet motivates us to make them our choice for implementation. Additionally, these architectures permit us to fine-tune hyperparameters and transfer learning from the large labeled dataset for crop disease classification.

4.1 AlexNet (2012) AlexNet, a novel framework, was implemented by Krizhevsky et al. [38] won the ImageNet challenge, where it outperformed in the competition. It decreases the top5% error by 15.3%. It comprised five convolutional layers, three fully connected layers, max pooling, ReLU activations, data augmentation, dropout, SGD with momentum [38]. ReLU activations were applied after every convolutional and FC layer. The dropout function was used before the first and second FC layers. The benefit of ReLU function over sigmoid or tanh is its speed to train the network because sigmoid or tanh derivatives are very small, and thus, weight updates almost vanish. ReLU is defined as: y = max(0, x). Zhang et al. [21] have implemented the three CNN frameworks, AlexNet, ResNet, and GoogLeNet, to identify the diseases in tomato leaves. Training and validation accuracy were computed to denote the performance of the architectures; ResNet gave the top results among all. Durmus et al. [22] compared the performance of SqueezeNet v1.1 and AlexNet architectures for tomato crop disease classification, where AlexNet performs better with higher accuracy. Moreover, AlexNet and VGG16 frameworks were compared in [41], and CA was measured for six tomato crop diseases. Similarly, AlexNet and GoogLeNet’s performance was analyzed to detect the disease spots on leaves using F1-score, precision, recall, and accuracy metrics. This experiment was conducted over the publicly accessible PlantVillage dataset. Four cucumber diseases were detected in [31] in which AlexNet, random forest, and support vector machines were compared using accuracy metrics.

4.2 VGG16 (2014) Simonyan et al. [39] developed a DL architecture with 16 convolutional layers, and 3 × 3 convolution filters were used to increase the depth of the network. It shows significant improvement in the accuracy of large-scale image recognition. The weight configuration for this model is publicly available. VGG is comprised of 138 million parameters that make it very challenging to handle. To detect the diseases in wheat

Classification and Activation Map Visualization …

759

crops, Lu et al. [28] implemented two DL architectures, namely VGG-FCN and VGG-CNN. Furthermore, feature visualization is done for each block in these DL models. In [29], the VGG-CNN framework was implemented for the identification of disease (Fusarium wilt) in radish in which K-means clustering algorithm was applied to detect the spots of diseases.

4.3 GoogLeNet (2014) Szegedy et al. [42] have implemented a 22 layers deep CNN model in order to detect and classify the images. The core significance of this model is to improve the utilization of the computational resources that are deployed in the network. With the constant computational budget, the width and depth of the CNN were increased in this model. Hebbian principle and the concept of multi-scale processing were used to optimize the quality of architecture. GoogLeNet gave a top-5 error rate of 6.67%, which is very similar to human-level performance.

5 Experiments 5.1 Pre-trained Models For the classification of crop diseases, DL models, especially CNNs, are trained directly over raw input images. Consequently, the DL models result in learning the extracted features from input images without manual help (human). In other words, automatic feature extraction occurs along with the training of the classifier. We have used three CNN models, namely AlexNet, VGG16, and GoogleNet. These frameworks were presented in computer vision challenges such as ImageNet and got some winning positions. The motive is to deploy these models for the identification of crop diseases. These layered architectures give each crop disease’s output probabilities at the last FC layer using the softmax function. Raw data that is going to feed to the CNN is resized into 256 × 256 pixels. We have used three classes for the identification of crops. For implementing a DL, there is a need for committed software and hardware to speed up the training. Figures 4 and 5 demonstrate the result analysis of DL models.

760

P. Sahu et al.

5.2 Workstation Specifications and Deep Learning Framework All the implementations were performed using Google Colab (python 3) on a personal computer with GPU: • • • •

Python 3.7, 1xTesla K80, 2496 CUDA cores, and 12 GB GDDR5 VRAM.

Such kind of GPU specification is vital for reducing the learning time from days to a few hours. GPU’s memory requirement also plays a significant role; if provided memory is less that means it cannot process extensive examples in the iterations of learning.

5.3 Dataset We have used a publicly accessible image dataset of leaves taken from the PlantVillage. The source of data is present at www.PlantVillage.org and consists of nearly 50,000 images of fruits and leaves. A subset of the dataset was taken that contains 510 images split into three classes, namely BBS, BBW, and healthy. Table 3 shows the characteristics of the classes (diseases) in banana leaves. The images of banana crop leaves are as follows:• 35 images infected with BBS disease, • 180 images infected with BBW disease, and • 295 healthy leaf images. Table 3 Diseases name with characteristics Disease name

Disease characteristics

Disease-causing elements

Banana Black Sigatoka (BBS)

Onset symptoms start from Ascomycete fungus younger leaves Mycosphaerella fijiensis Leaf-spot disease (Morelet) Lesions with a rusty brown appearance and seem to be pale, paint-like fragments on the leaves

[43]

Banana Bacterial Wilt (BBW)

Foliage turns yellow and extends upwards Rotten interior part Flowing bacteria can be seen easily

[43]

Bacteria Xanthomonas vasicola pv. musacearum (Xvm)

Reference

Classification and Activation Map Visualization …

761

Fig. 3 Images of infected and healthy leaves of banana plant: a and b are infected with BBS; c and d are infected with BBW; and e is healthy leaf [PlantVillage]

Figure 3(a–d) shows a few leaves with their respective diseases, whereas Fig. 3-e represents a healthy leaf image. Images used for classification have been resized to 256 × 256-pixel size

5.4 Performance Metrics To measure the performance of the DL models, classification accuracy and loss were used as the performing metrics. Table 4 represents the metrics formulae utilized in the experimentation. Table 4 Performance metrics deployed in the study Performance metric

Definition

Symbol

Classification accuracy

It is the % correct prediction from the total ones

CA

TP+TN CA = (TP+TN+FP+FN) Notations, TP = true positive TN = true negative FP = false positive FN = false negative

Loss

M Loss (Cross-entropy) = − i=1 y o, i log (po, i ) M—Total classes (BBS, BBW, Healthy) log—The natural log y—Classification pointer (0 or 1), if the class marked as the label ‘i’ is the true classification for the instance ‘o’ p—Predicted probability observation ‘o’ is of class ‘i’

L

762

P. Sahu et al.

5.5 DL Architecture with Pre-training Versus DL Architecture Without Pre-training It is observed that fine-tuning of pre-trained networks performs better than the training from scratch. Moreover, the fine-tuning of hyperparameters increases the accuracy of AlexNet from 0.882 to 0.908, VGG16 from 0.896 to 0.9375, and GoogleNet from 0.901 to 0.9531. The impact of transfer learning is clarified by the capability of the network that reuses and transmits the features from one problem domain to another. These inherited features are used only with some minor changes in the last layers. Furthermore, the fine-tuning of hyperparameters is very helpful in situations where training datasets are small. The pre-trained models are trained over large datasets (ImageNet) with a higher number of labels, and these are reused over the smaller training examples. Besides, fine-tuning is also beneficial for training over the machines with a limited amount of memory in GPU. A comparison for the performance of pre-trained models is made with the models trained from scratch, beginning with randomly assigned network weights. It draws the effect of transfer learning for crop disease classification. Table 5 shows the experimental results obtained with pre-trained and without pre-training (Figs. 4 and 5). Table 5 Experimentation results

Fig. 4 Accuracy analysis of DL models

Deep architectures

Performance measures

Without pre-training

With transfer learning

AlexNet

Accuracy

0.882

0.908

Loss

0.427

0.291

VGG16

Accuracy

0.896

0.9375

Loss

0.319

0.2608

GoogleNet

Accuracy

0.901

0.9531

Loss

0.329

0.2024

Accuracy 0.96 0.94 0.92

Without pretraining

0.9 0.88 0.86 0.84

With pretrained weights

Classification and Activation Map Visualization … Fig. 5 Loss analysis of DL models

763

Model Loss 0.5 0.4 0.3 0.2 0.1 0

Without pretraining With pre-trained weights

6 Symptom Visualization Once the training phase is completed, the visualization techniques extract the biological information from the architectures trained using image data. These extractions of information support the naïve users to understand the crop disease along with their symptoms. DL models extract this biological information straight from raw data without any expertise service.

6.1 Symptoms and Disease Lesion Detection Using DL Older neural networks cannot interpret the neural networks working, while in DL, many visualization techniques have been used to represent the learned features. It helps to explain the working of a classifier that provides the results and also shows how features were built [27]. It was observed that visualization techniques for feature representation help the naïve users to acknowledge the crop disease their symptoms. This study represents the visualization filter maps to show the classification of the disease. When an input image is fed to the network, the architecture returns the pixel value for the layer activated in the original architecture. For instance, first convolution layer activations are shown for the image input: (1, 252, 252, 32). In this context, 252 × 252 is a feature map, and 32 is the total number of channels present. Figure 6 shows the output of the first activation layer in CNN.

6.2 Visualization of Every Channel in Each Intermediate Activation Layer It shows how visualization filters help CNN understand the basic patterns in an image and how the layers communicate by passing the information from one layer to

764

P. Sahu et al.

(a)

(b)

(c)

Fig. 6 Output of first activation layer of the CNN: a shows the preprocessed leaf of banana, b shows the output of first activation layer using channel number three, and c shows the output of first activation layer using channel number thirty

another. Interpretation of the activation filters used in convolution layers is described as: 1.

2.

3.

The initial layer potentially returns the complete shape of the leaf image. However, many filters are left blank as these were not activated in the convolutional layer. At this stage, activations return the complete information of the original image. While moving deeper into the rest layers of the architecture, the activations tend to become less visible and more abstract. They started to encode in higher-level concepts like corners, single borders, and angles/slopes. Higher-level abstractions convey progressively less knowledge about the visible content of the input image and more information about the labels or classes of the image. Architecture goes very complicated at this point when the last layer activations are not shown, so there is nothing left for learning at this point.

A few instances of intermediate activation maps of the middle layers are shown in Fig. 7.

7 Conclusion In this paper, CNN-based DL models are deployed in order to carry the banana leaf disease classification. Moreover, the fine-tuning of hyperparameters increases the accuracy of AlexNet from 0.882 (training from scratch) to 0.908 (fine-tuning), VGG16 from 0.896 (training from scratch) to 0.9375 (fine-tuning), and GoogleNet from 0.901 (training from scratch) to 0.9531 (fine-tuning). The experimental results show that GoogleNet performs best among AlexNet and VGG16 for disease classification. Furthermore, the experimentation also validates pre-training (transfer learning) over the without pre-training (training from scratch). This study also visualizes the results of activation maps deployed in the intermediate convolutional layers.

Classification and Activation Map Visualization …

765

Fig 7 Instances of intermediate activation maps of the intermediate layers

These maps also tend to visualize the symptoms and the diseased regions of the leaf image. It helps the naïve users to understand the internal working of the network. Acknowledgements Authors are thankful to the Department of Science and Technology, Government of India, Delhi, for funding a project on ‘Application of IoT in Agriculture Sector’ through ICPS division. This work is a part of the project work.

References 1. K. Satyagopal, S.N. Sushil, P. Jeyakumar, G. Shankar, O.P. Sharma, S.K. Sain, et al, AESA Based IPM Package for Banana. By National Institute of Plant Health Management, Rajendranagar, Hyderabad (2014) 2. D. Hall, C. McCool, F. Dayoub, N. Sunderhauf, B. Upcroft, Evaluation of features for leaf classification in challenging conditions, in 2015 IEEE Winter Conference on Applications of Computer Vision (2015), pp. 797–804 3. Y. Itzhaky, G. Farjon, F. Khoroshevsky, A. Shpigler, A. Bar-Hillel, Leaf counting: multiple scale regression and detection using deep CNNs. BMVC (2018), p. 328 4. J. Ubbens, M. Cieslak, P. Prusinkiewicz, I. Stavness, The use of plant models in deep learning: an application to leaf counting in rosette plants. Plant Methods 14, 6 (2018) 5. M. Rahnemoonfar, C. Sheppard, Deep count: fruit counting based on deep simulated learning. Sensors 17:905 (2017) 6. N. Kussul, M. Lavreniuk, S. Skakun, A. Shelestov, Deep learning classification of land cover and crop types using remote sensing data. IEEE Geosci. Remote Sens. Lett. IEEE 14, 778–782 (2017) 7. A.K. Mortensen, M. Dyrmann, H. Karstoft, R.N. Jørgensen, R. Gislum, et al., Semantic segmentation of mixed crops using deep convolutional neural network, in Proceedings of International Conference on Agricultural Engineering (2016)

766

P. Sahu et al.

8. M. Rußwurm, M. Körner, Multi-temporal land cover classification with long short-term memory neural networks. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 42, 551 (2017) 9. J. Rebetez, H.F. Satizábal, M. Mota, D. Noll, L. Büchi, M. Wendling, et al., Augmenting a convolutional neural network with local histograms—a case study in crop classification from high-resolution. UAV imagery. ESANN (2016) 10. G.L. Grinblat, L.C. Uzal, M.G. Larese, P.M. Granitto, Deep learning for plant identification using vein morphological patterns. Comput. Electron. Agric. 127, 418–424 (2016) 11. S.H. Lee, C.S. Chan, P. Wilkin, P. Remagnino, Deep-plant: plant identification with convolutional neural networks, in 2015 IEEE Int Conf image Process (2015), pp. 452–456 12. M.P. Pound, J.A. Atkinson, A.J. Townsend, M.H. Wilson, M. Griffiths, A.S. Jackson, et al., Deep machine learning provides state-of-the-art performance in image-based plant phenotyping. Gigascience 6, gix083 (2017) 13. U. Barman, R.D. Choudhury, D. Sahu, G.G. Barman, Comparison of convolution neural networks for smartphone image based real time classification of citrus leaf disease. Comput. Electron. Agric. 177, 105661 (2020) 14. M.G. Selvaraj, A. Vergara, H. Ruiz, N. Safari, S. Elayabalan, W. Ocimati, et al., AI-powered banana diseases and pest detection. Plant Methods 15 (2019) 15. M.A. Khan, T. Akram, M. Sharif, M. Awais, K. Javed, H. Ali et al., CCDF: automatic system for segmentation and recognition of fruit crops diseases based on correlation coefficient and deep CNN features. Comput. Electron. Agric. 155, 220–236 (2018) 16. N. Hashim, R.B. Janius, R.A. Rahman, A. Osman, M. Shitan, M. Zude, Changes of backscattering parameters during chilling injury in bananas. J. Eng. Sci. Technol. 9, 314–325 (2014) 17. M. Brahimi, M. Arsenovic, S. Laraba, S. Sladojevic, K. Boukhalfa, A. Moussaoui, Deep learning for plant diseases: detection and saliency map visualisation. Hum. Mach. Learn. 93–117 (2018) 18. S. Sladojevic, M. Arsenovic, A. Anderla, D. Culibrk, D. Stefanovic, Deep neural networks based recognition of plant diseases by leaf image classification. Comput. Intell. Neurosci. 2016 (2016) 19. J. Amara, B. Bouaziz, A. Algergawy, et al., A deep learning-based approach for banana leaf diseases classification. BTW (2017), 79–88 20. K.P. Ferentinos, Deep learning models for plant disease detection and diagnosis. Comput. Electron. Agric. 145, 311–318 (2018) 21. K. Zhang, Q. Wu, A. Liu, X. Meng, Can deep learning identify tomato leaf disease? Adv. Multimed. 2018 (2018) 22. H. Durmu¸s , E.O. Güne¸s, M. Kırcı, Disease detection on the leaves of the tomato plants by using deep learning, in 2017 Sixth International Conference on Agro-Geoinformatics (AgroGeoinformatics (2017), pp. 1–5 23. S.P. Mohanty, D.P. Hughes, M. Salathé, Using deep learning for image-based plant disease detection. Front Plant Sci. 7, 1419 (2016) 24. A. Fuentes, S. Yoon, S.C. Kim, D.S. Park, A robust deep-learning-based detector for real-time tomato plant diseases and pests recognition. Sensors 17, 2022 (2017) 25. C. DeChant, T. Wiesner-Hanks, S. Chen, E.L. Stewart, J. Yosinski, M.A. Gore et al., Automated identification of northern leaf blight-infected maize plants from field imagery using deep learning. Phytopathology. Am. Phytopath. Soc. 107, 1426–1432 (2017) 26. Y. Lu, S. Yi, N. Zeng, Y. Liu, Y. Zhang, Identification of rice diseases using deep convolutional neural networks. Neurocomputing 267, 378–384 (2017) 27. M. Brahimi, K. Boukhalfa, A. Moussaoui, Deep learning for tomato diseases: classification and symptoms visualization. Appl. Artif. Intell. 31, 299–315 (2017) 28. J. Lu, J. Hu, G. Zhao, F. Mei, C. Zhang, An in-field automatic wheat disease diagnosis system. Comput. Electron. Agric. 142, 369–379 (2017) 29. Ha, J.G., Moon, H., Kwak, J.T., Hassan, S.I., Dang, M., Lee, O.N., et al., Deep convolutional neural network for classifying Fusarium wilt of radish from unmanned aerial vehicles. J. Appl. Remote Sens. 11, 42621 (2017)

Classification and Activation Map Visualization …

767

30. S. Ghosal, D. Blystone, A.K. Singh, B. Ganapathysubramanian, A. Singh, S. Sarkar, An explainable deep machine vision framework for plant stress phenotyping. Proc. Natl. Acad. Sci. 115, 4613–4618 (2018) 31. J. Ma, K. Du, F. Zheng, L. Zhang, Z. Gong, Z. Sun, A recognition method for cucumber diseases using leaf symptom images based on deep convolutional neural network. Comput. Electron. Agric. 154, 18–24 (2018) 32. M. Brahimi, S. Mahmoudi, K. Boukhalfa, A. Moussaoui, Deep interpretable architecture for plant diseases classification, in Signal Processing Algorithms, Architectures, Arrangements, and Applications (2019), pp. 111–116 33. K. Nagasubramanian, S. Jones, A.K. Singh, A. Singh, B. Ganapathysubramanian, S. Sarkar, Explaining hyperspectral imaging based plant disease identification: 3D CNN and saliency maps (2018). arXiv Prepr arXiv180408831 34. P. Jiang, Y. Chen, B. Liu, D. He, C. Liang, Real-time detection of apple leaf diseases using deep learning approach based on improved convolutional neural networks. IEEE Access 7, 59069–59080 (2019) 35. A.C. Cruz, A. Luvisi, L. De Bellis, Y. Ampatzidis, Vision-based plant disease detection system using transfer and deep learning, in 2017 ASABE Annual International Meeting (2017), p. 1 36. S.L. Sanga, D. Machuve, K. Jomanga, Mobile-based deep learning models for banana disease detection. Eng. Technol. Appl. Sci. Res. 10, 5674–5677 (2020) 37. B. Neupane, T. Horanont, N.D. Hung, Deep learning based banana plant detection and counting using high-resolution red-green-blue (RGB) images collected from unmanned aerial vehicle (UAV). PLoS One 14, e0223906 (2019) 38. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process Syst. 1097–105 (2012) 39. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition (2014). arXiv Prepr arXiv14091556 40. C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, inception-ResNet and the impact of residual connections on learning, in Thirty-First AAAI Conference on Artificial Intelligence (2017) 41. A.K. Rangarajan, R. Purushothaman, A. Ramesh, Tomato crop disease classification using pre-trained deep learning algorithm. Procedia Comput. Sci. 133, 1040–1047 (2018) 42. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, et al., Going deeper with convolutions, in IEEE conference on computer vision and pattern Recognition (2015), pp. 1–9 43. G. Owomugisha, J.A. Quinn, E. Mwebaze, J. Lwasa, Automated vision-based diagnosis of banana bacterial wilt disease and black Sigatoka disease. International Conference on the use of Mobile Informations and Communication Technology (ICT) in Africa (2014)

Exploring Total Quality Management Implementation Levels in I.T. Industry Using Machine Learning Models Kapil Jaiswal, Sameer Anand, and Rupali Arora

Abstract This paper aimed at creating a model to predict TQM levels as high or low, based on the employee strength (considered as size), age (years of existence of organization), and presence of any certification or any quality award received by the organization. Using convenient sampling technique, the survey collected the information about age, size (employee strength), and information on quality certification or award received and TQM practices in the organization. A total of 146 responses were received between a period 2019—Jan, 2020. In this study, several machine learning algorithms were compared to find the best suitable model for the data collected through a survey from IT companies in NCR and TRICITY (Chandigarh, Mohali, and Panchkula) region. The results of this study showed that classification and regression trees and extra trees models perform well, as compared to other machine learning predictive algorithms, on the data collected through the survey for predicting the level of TQM in an organization. Keywords Machine learning · Decision trees · TQM · Quality · Predictive analysis · IT industry

K. Jaiswal Operations and Development—OATI, Chandigarh, India S. Anand (B) Shaheed Sukhdev College of Business Studies, New Delhi, India e-mail: [email protected] R. Arora Chandigarh University, SAS Nagar, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_62

769

770

K. Jaiswal et al.

1 Introduction 1.1 IT Industry India sits on top of the list of global sourcing destinations capturing 55% out of US$185-190 billion market of international services sourcing in 2017–18. IT and ITeS businesses of India operate more than 1000 delivery points spread across almost 80 countries. Indian IT industry is already well established in National Capital Region (NCR), and it is flourishing in TRICITY (Chandigarh, Panchkula, and Mohali) region. There is a cutthroat competition among IT organizations, and each organization is trying to serve customers with highest level of quality at minimum costs. Several studies like [23, 37] mentioned that the implementing TQM improved quality, profitability as well as performance of services [11, 32, 36].

1.2 TQM and TQM Elements TQM has been defined by Psychogios and Priporas [30] as “a comprehensive company-wide approach for meeting or exceeding the requirements and expectations of customers that entails the participation of everyone in the organization in using quantitative techniques to continually improve the products, services, and processes of the company.” The basic ideas behind TQM is to emphasize and implement quality metrics and parameters to improve organizational efficiency, operational efficiency, and productivity [10]. Numerous studies on the subject have demonstrated a direct association between improved performance of the firm and implementation of TQM measures and establish that the businesses in which effective implementation of TQM has taken place, perform better in areas of revenue and profit generation, cost and capital expenditures, size of assets and employees satisfaction, in comparison with those firms which lack TQM measures [17]. There have been many studies on TQM which have considered different elements (or practices) of TQM for measuring and analyzing TQM implementation. Many studies like Abusa [1], Talib and Rahman [34], Rahman and Bullock [7] have considered top management commitment, customer focus or satisfaction, continuous improvement, and people management as main TQM elements in their studies to analyze TQM implementation. In fact, studies like [26, 29, 33] used the same four elements in their studies, as used in current research study. The questionnaire used in this study was based on the above-mentioned four constructs/elements of TQM.

Exploring Total Quality Management Implementation …

771

1.3 ISO Certification and Quality Awards TQM carries paramount importance and its significance is underscored by prominent recognitions such as the US-based Malcolm Baldrige Quality Award, Japanbased Deming Prize, the European Quality Prize and Singapore Quality Award. The purpose of these awards is to highlight those companies that implement best practices in management and to encourage others for making use of TQM techniques to meet global standards. Consequently, many production and service sector firms have started implementation of numerous TQM techniques. Bringing the structure and the management methods and processes of an organization in-line with requirements of the TQM standards is extremely important as per the models put forth by these numerous quality awards. Ghobadian and Gallear [14] mentioned in their study that quality awards like European Quality Award, the British Quality Award, and the Baldrige Award play a major role in this for accomplishing the implementation of TQM in SMEs. Issac [22] stated that ISO 9000 has global recognition as the foundation for assessment and defining the systems of quality management in those processes where design is an integral component of operations.

1.4 Machine Learning Algorithms Machine learning is a new way that has the potential to bring improvement in forecasting models and help in the managerial decision-making steps. Machine learning means computer-based methods to detect and obtain patterns from available data and perform optimization tasks on that gained knowledge with little or no human involvement. These approaches have been derived from artificial intelligence as well as dynamic programming. Machine learning techniques have found application in numerous spheres as high performing tools of data-mining to find out unseen patterns or data concealed within a database that can make the bottom line better. These techniques contain association rules, decision trees, neural networks, and genetic algorithms. Studies in business and management have made use of various such techniques to overcome the issues of classification, like estimating customer’s solvency and defaulting on consumer’s loans and modeling consumer choice [20, 35]. These techniques prove to be helpful in obtaining updated information when the researchers have observable data; however, structure of the model is not known.

2 Statement of Problem In many studies like [1, 25] conducted earlier, the way to assess TQM level has been based on primary data collected through a survey, which was based on many TQM

772

K. Jaiswal et al.

practices or elements. These practices and elements have been quite different in various studies [21, 28], and there have been a lot of differences among the practices chosen. While the primary data collection on TQM practices may be one of the ways to identify TQM levels, another approach may be to collect data which is easily available like employee strength, age, and recognition/awards (mostly secondary data) and evaluate TQM level based on the same. The primary data collection is a long and costly process. The study tries to provide an alternate and quicker way of determining the TQM levels. Basically, this study will try to solve the problem of finding out TQM levels with the available information on size, age, and quality certifications/awards, for an organization. This may certainly help researchers but can also act as a tool for those customers, who just want to shortlist a few organizations, based on the TQM level, predicted through the model provided in this study.

3 Objective of Study This paper aims at creating a model to predict TQM levels as high or low, based on the employee strength (considered as size), age (years of existence of organization), and presence of any certification or any quality award received by the organization. Most of such models are based on traditional statistical techniques which make a lot of assumption on distribution of underlying data. But, in this study, the model has been created using machine learning techniques which do not make any assumptions on distribution of data.

4 Literature Review Hoang et al. [18] found out that in comparison with small to medium-sized companies, larger companies had better levels of TQM implementation across all of their functions, with the exception of teamwork and open organization. This study also mentioned that certification statistics displayed that by August 2006, 1683, organizations in Vietnam had already received ISO 9001 certification and had founded an “ISO Club” to encourage TQM implementation and to share their expertise among themselves. It shows that the industries in Vietnam consider TQM as a very useful technique to bring betterment in the quality of their products and services. Samson and Terziovski [32] have discovered significant variances in the association between TQM and organizational performance when the company size was being considered, especially its impact on development of new products. Their findings demonstrate that larger organizations benefitted more from TQM as compared to smaller ones. These outcomes are in-line with studies conducted by Garvin [13], GAO Study [12]. However, a study by Ahire and Golhar [3] shows lack of working differences in TQM implementation as a result of firm size and demonstrates that

Exploring Total Quality Management Implementation …

773

smaller or larger firms maintaining high quality of manufacturing have implemented TQM equally and efficiently. The study of Haar and Spell [15] considers the rates of adoption of TQM by businesses in New Zealand and the role of company size in establishing these rates. To estimate these rates, the variables used were company size, workplace autonomy, performance standards, use of teams, and group problem solving, in which role of moderating variable was assigned to company. The results indicate that firms with greater levels of workplace autonomy, the use of performance standards, teams, and group problem solving had more tendency to implement TQM, and this carried more probability for large-scale firms in comparison with smaller ones. These conclusions show that despite the fact that most smaller businesses face challenges like limited markets, insufficient resources, and a lack of managerial know-how, nonetheless they have an edge in terms of flexibility and innovation that would enable the implementation of TQM just as to implement TQM as efficiently as the larger businesses. Abusa [1] explored that the degree of implementing TQM practices in large size businesses tends to be greater and far advanced than among the small and medium businesses, however, the results show otherwise. His findings report no significant difference among small and medium enterprises and larger businesses’ mean of TQM elements at the 0.05 level of significance. Furthermore, the mean score of TQM elements in small and medium-sized businesses is comparatively better than the mean scores of TQM elements among large companies. Study by Brah et al. [6] does not find any significant difference between the rigors of implementation of experienced TQM firms relative to less experienced firms. Therefore, it can be stated that firms having more experience in TQM implementation do not necessarily perform better than less experienced ones. The results of Ahire and Golhar [2] fail to report any difference after comparing firms having extensive practice of TQM and less experienced TQM firms in the rigor of TQM implementation, with exception of the variable of employee involvement and show that firms can reach higher levels of performance in comparatively short TQM time frame in comparison with the firms having longer experience in TQM implementation. The International Organization for Standardization (ISO) has established a set of standards for quality management systems termed the “ISO 9000” series. This practice is partly related to the practices of implementation of TQM. ISO 9000.3 is a specifically interpreted section of ISO 9001, which deals with software development [19]. Lakhal [27] concluded in his study that the implementation of ISO 9000 standard prior to TQM implementation, results in improved performance of the organization. Berger and Magliozzi [5] have shown that statistical methods like logistic regression, and discriminate analysis has been widely used for modeling the consumer responses to the practices of direct marketing. Despite the fact that statistical methods can prove to be extremely effective, they are based on a number of rigorous statistical assumptions about the types and distribution of the data to be used and normally can only accommodate a limited quantity of variables. To overcome this, researchers have come up with advanced models, including beta-logistic models [31], tree-generating techniques like CART and CHAID [16], and the hierarchical Bayes model [4].

774

K. Jaiswal et al.

5 Conceptual Framework The image below provides the conceptual framework for this study. As per literature reviewed, it is known that the variables like age, size, and certifications/awards impact may determine TQM level of an organization. In this case, the level of TQM is a binary dependent variable. The independent variables were as follows: Size—Number of employees in the organization (Continuous Variable). Age—Years of existence of organization (Continuous Variable). Certification/Recognition or Award: Presence or absence of quality certification (like ISO) or any recognition or quality award received (Categorical Variable—with values as 0 or 1). The organizations were identified as high and low TQM implementers based on the primary data collected. The rationale behind identifying TQM as high or low was similar to that used by Jaiswal and Garg [24], Abusa [1] and Chapman and Al-Khawaldeh [8] who made two categories of the participating companies: businesses having an advanced TQM implementation level and those having a lowlevel implementation status. The technique used the responses received from survey asking questions about TQM implementation (Fig. 1). The domain of machine learning and statistics involves classification technique as a supervised method of enabling the computer program to learn, using input data and then classifying new observations, based on previous learning. The used set of data can be binary (for instance gender, i.e., male/female or spam or not in case of mails) or the data can also be multi-class. Machine learning uses algorithms for classification, such as: 1. 2. 3. 4. 5.

Linear Classifiers: Logistic Regression, Linear Discriminant Analysis Nearest Neighbor Naïve Bayes Support Vector Machines Decision Trees

Fig. 1 TQM and machine learning process. Source The Author, 2020

Exploring Total Quality Management Implementation …

775

All the above-mentioned algorithms are non-ensemble techniques and were compared before arriving at the final model. Following is some description of all above non-ensemble techniques. i.

ii.

iii.

iv.

v.

vi.

Logistic Regression (Predictive Learning Model) is used for the analysis of a data set containing at least one (or more) exogenous variables that cause the outcome. Dichotomous variable is used to measure the outcome which can be either of the two outcomes possible. Logistic regression pinpoints the best fitting model that explains the cause-and-effect relation between the predictor variable and its dichotomous response variable. This carries more advantage than other binary classification like nearest neighbor as it also provides quantitative explanation of the factors leading toward classification. Linear Discriminant Analysis is a statistical technique which reduces the dimensionality in such a way that results into the maximum feasible discrimination among different classes, used in machine learning to find the linear blend of features, which can isolate two or more classes of objects with best performance. Nearest Neighbor method of classification is also termed as k-nearest neighbor, and it utilizes proximity as the proxy variable to substitute “sameness.” This algorithm learns to label new points from the collection of existing labeled reference points which are in close proximity to the new points to be labeled. The nearest neighbor is the existing labeled point nearest to the new point. The measure of closeness is normally stated as a dissimilarity function. After examining the k-numbers of neighbors, this function allots a new label which is present on majority of the neighbors. Naive Bayes Classifier (Generative Learning Model) means that one particular feature in a class is independent of any other feature, and all such properties contribute independently to probability. The custom scalability and convenient building of this type of classifiers render it highly useful in processing extensive sets of data. Support Vector Machine (SVM) algorithm separates N classes by finding a hyperplane in N dimension space that differentiate or distinguishes all classes. Out of various hyper planes created, the one which have maximum distance from two data points of different classes should be chosen. This in turn provides the confidence that the classification of next data point will be correct. Decision Trees technique is a way to summarize/understand the data. A tree has root nodes and child nodes. The structure of a decision tree resembles a hierarchy which allows to reach the desired outcome by carefully examining its nodes. A decision tree is the most important part in machine learning to make a machine capable enough to get decisions by own self. It often mimics the human level thinking to decide something from given data. Decision trees utilize a tree structure format to construct models used for purpose of classification or running regression. The process involves disintegrating the set of data into small subsets, and concurrently, a decision tree is developed incrementally. It generates a final tree with decision and leaf nodes. The decision node contains at

776

K. Jaiswal et al.

least two branches (or more), while the leaf node represents either the decision or the classification. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees are capable of dealing with categorical as well as numerical data sets. Classification and Regression Trees (CART), a kind of decision tree, uses Gini index as the measurement metric. If all the data belong to a single class, then it can be called pure. Its degree will be always between 0 and 1. If 0, all data belongs to the single class/variable. If 1, the data belong to the different class/field. Following is brief description of ensemble techniques used to create model in this study. Ensemble methods are machine learning techniques that uses a combination of numerous base models to generate an optimal predictive model. During prediction of target variable by means of machine learning, the factors of noise, variance, and bias cause difference in the actual and estimated figures. Except for noise, the other errors can be rectified or reduced by using ensemble methods. i.

ii.

Boosting is a process by which group of algorithms use weighted averages to transform weak learners to strong learners. This does not resemble bagging, which runs every single model independently, followed by aggregation of the end results without prioritizing a specific model. Boosting relies on teamwork, which means that every model that has been run provides the required features for the upcoming model to focus on. Adaptive boosting (AdaBoost) and Gradient boosting (GBM) are the types of boosting techniques. Both these techniques build weak learners in a sequential manner, and the final prediction is based on weighted average of all weak learners. The difference between the two lies in the fact that the shortcomings (weak learner) are identified by gradient in GBM, while they are identified by high-weighted data points in AdaBoost. Bagging is another simple yet powerful ensemble method and is also called bootstrap aggregation. It finds its application in decision trees, which is a highvariance machine learning algorithm. Applying bootstrap aggregation to decision trees reduces the concerns regarding over fitting the training data by individual trees. Therefore, to be efficient, each of the decision trees is grown deep, which means limited training at individual leaf node and that the leaf nodes are not removed (pruned). Random forest and extra trees are kind of bagging methods.

Random forest is a collaborative technique of learning regression, classification, and related operations that constructs multiple decision trees during training stage and outputs a class to represent the mode of all classes (classification) or predicts the mean score (regression) of the trees. Extremely Randomized Trees Classifier (Extra Trees Classifier) is a type of collaborative learning procedure which sums the results of many de-correlated decision trees collected in a “forest” to output its classification result. In concept, it is very similar to a random forest classifier and only differs from it in the manner of

Exploring Total Quality Management Implementation …

777

construction of the decision trees in the forest. Just like random forest, a random subset of candidate features is used, but as an alternative of observing for the most discriminative thresholds, thresholds are drawn at random for each candidate feature, and the best of these randomly generated thresholds is picked as the splitting rule. This in turn leads to reduction of the variance of the model, a bit more than random forests.

6 Methodology This study is based on the IT (software development and maintenance) organizations. The sampling frame was a list of IT organizations listed on NASSCOM website, from year 2016, in the region of NCR and TRICITY. A well-structured questionnaire that has been used by Jaiswal and Garg [24] was used to collect data for the study. This was sent to the employee of these organizations through email. One response was considered for one organization. The survey was based on likert scale. Using convenient sampling technique, the survey collected the information about age, size (employee strength), information on quality certification or award received, and TQM practices in the organization. A total of 146 responses were received between a period 2019—Jan, 2020. The geographical area of this research covers IT organization operating in NCR and TRICTY region. A machine learning model was created using all the three independent variables, namely organization size, age, and presence of quality certification/award, to predict the level of TQM implementation in an organization.

7 Results and Discussion The difference in scales of raw data might have a negative impact on the capability of various algorithms. To make sure if this is the case or not, the same algorithms should be evaluated using standardized data. It means that data transformation has been done to ensure standard deviation of 1 and mean score of 0 across all the attributes of the data set. The study used tenfold cross-validation which is notably configured using the same random seed in order to make sure that similar splits are applied to the training data and that each individual algorithm is estimated precisely using the same method. Cross-validation can be defined as the re-sampling procedure utilized to estimate the models of machine learning in the presence of limited data sets. Cross-validation relies on one parameter “k” which represents the quantity of groups into which the limited sample will be split. This is termed as k-fold cross-validation procedure. The chosen value of k is then substituted to be used in model, for instance, k = 15 means a 15-fold cross-validation. This technique is utilized in applied machine learning to approximate the capability of a machine learning model on unseen set of data. A

778 Table 1 Algorithm results comparison

K. Jaiswal et al. Methods

Mean (accuracy)

S.D (accuracy)

Logistic regression (LR)

0.822321

0.137798

Linear discriminant analysis (LDA)

0.822321

0.137798

K-nearest neighbors (K-NN)

0.816071

0.145303

Classification and regression trees

0.959821

0.085988

Naïve Bayes

0.822321

0.137798

Support vector machine

0.822321

0.137798

Source Author’s field survey, 2020

limited data set (training set) is used to assess the expected capability of the model for making predictions on data. The result in Table 1 showed mean and S.D. accuracy values across the crossvalidation folds. A better way is to examine the distribution of accuracy values across the cross-validation folds. It can be done using box and whisker graph plots. The results in Fig. 2 showed that distribution of CART is better than others. The small round ball shapes are outliers. The box in Fig. 2 ranges from the lower to upper quartile values, and median is denoted by a line. The whiskers spread show the range of the data. Based on above results, we proceeded further with modeling using CART technique. The model was tested on test set data, and the metrics like accuracy and sensitivity were calculated and put in a form of confusion matrix (Table 2) for measuring the effectiveness of this model. A confusion matrix (Table 2) in machine learning field describes the performance of the model. Here is the description of terms used in Table 2. • True Positive (TP): Records in data, which is positive, are predicted as it is. Fig. 2 Algorithm comparison. Source Authors field survey, 2020

Exploring Total Quality Management Implementation … Table 2 Confusion matrix—CART model

779

Confusion matrix Predicted NO

Predicted YES

Actual NO

TN

FP

0

5

Actual YES

FN

TP

0

25

Source Author’s field survey, 2020

• False Negative (FN): Records in data are positive, but are predicted otherwise. • True Negative (TN): Records in data are negative and are predicted as it is. • False Positive (FP): Records in data are negative, but are predicted otherwise. These terms are used to calculate further performance metrics as under. Accuracy is defined as the ratio of true predictions (whether positive or negative) by total number of observations. Sensitivity is the ratio of true positives upon true positive plus false negative. High sensitivity indicates that the class is correctly recognized with low number of false negatives. Precision is the ratio of true positives upon sum of true positive and false positive. Higher precision values indicate value labeled positive has higher chances of actually being positive. The same process was followed for measuring the performance through application of ensemble techniques. The following algorithms were applied. i. ii.

Boosting Methods: AdaBoost (AB) and Gradient Boosting (GBM). Bagging Methods: Random Forests (RF) and Extra Trees (ET) (Fig. 3).

Figure 2 and Table 3 clearly demonstrate that extra trees (ET) performed best among the ensemble algorithms. The spread of ET is minimum, and the mean accuracy is the highest among all. Hence, it was chosen to create another model. Table 5 shows the specifications of the same, similar to what was done for CART model (Table 4). Fig. 3 Ensemble algorithm comparison. Source Author’s field survey, 2020

780

K. Jaiswal et al.

Table 3 Cart model performance metrics CART model metrics Parameters

Formula

Value

Accuracy

(TP + TN)/(TP + TN + FP + FN)

0.83

Sensitivity

TP/(TP + FN)

1

Precision

TP/(TP + FP)

0.83

F1-Score

2 * (Precision * Sensitivity)/(Precision + Sensitivity)

0.91

Source Author’s field survey, 2020

Table 4 Ensemble algorithms results comparison

Methods

Mean (accuracy) S.D (accuracy)

Ada boosting (AB)

0.856667

0.132539

Gradient boosting machines 0.908333 (GBM)

0.123322

Random forest (RF)

0.933333

0.110554

Extra trees (ET)

0.941667

0.108972

Source Author’s field survey, 2020

Table 5 Confusion matrix—ET model

Confusion matrix Actual NO Actual YES

Predicted NO

Predicted YES

TN

FP

3

2

FN

TP

0

25

Source Author’s field survey, 2020

The terms used in Table 5 have already been explained in previously created CART model. Based on confusion matrix values, the performance metrics for ET model were calculated as under, as we did for CART model. Table 6 calculates the accuracy, sensitivity, precision, and F1 values for ET model. All these values are better than CART model values. This indicates that the model created through ensemble technique ET is better than simple CART model in predicting TQM levels. Figure 4 compares the mean values of accuracy for all models taken together. The information in Fig. 3 clearly shows that ET and CART have the highest mean accuracy values among all, whether ensemble or non-ensemble machine algorithms. Figure 5 depicts the standard deviation for all techniques taken together across cross-validations. In this figure too, it is clear that the spread of CART is minimum followed by ET.

Exploring Total Quality Management Implementation …

781

Table 6 ET model performance metrics ET model metrics Parameters

Formula

Value

Accuracy

(TP + TN)/(TP + TN + FP + FN)

0.93

Sensitivity

TP/(TP + FN)

1

Precision

TP/(TP + FP)

0.93

F1

2*(Precision * Sensitivity)/(Precision + Sensitivity)

0.96

Source Author’s field survey, 2020

Mean Accuracy 1 0.933333

0.95 0.9 0.85

0.959821

0.941667

0.908333 0.856667 0.822321

0.822321

0.816071

0.822321

0.822321

0.8 0.75 0.7

Fig. 4 Mean accuracy of machine algorithms. Source Author’s field survey, 2020

8 Conclusion and Recommendations The results of this study showed that CART and ET models perform well, as compared to other machine learning predictive algorithms, on the data collected through the survey for predicting the level of TQM in an organization. The ensemble methods in general resulted into better accuracy as compared to all non-ensemble methods, except CART model. The CART and ET algorithms have almost same accuracy on training set, but ET was much better than CART, when it was validated on test set with an accuracy of 0.93% as compared to accuracy of 0.83%, provided by CART model. In fact, all other performance metrics of ET model were better than CART model. Eventually, we arrived at a model which can differentiate in TQM levels of organizations by indicating the levels of TQM implemented, based on the data collected from 146 IT organizations. The main benefit of this research can be perceived by the researchers who are trying to establish TQM levels by using

782

K. Jaiswal et al.

S.D. Accuracy 0.16 0.14 0.12

0.132539

0.137798 0.137798

0.145303

0.137798 0.137798

0.123322 0.110554 0.108972

0.1

0.085988

0.08 0.06 0.04 0.02 0

Fig. 5 Standard deviation accuracy for techniques used. Source Author’s field survey, 2020

traditional concepts of surveying the organization for TQM practices followed. Most of the easily available information, viz., employee strength, years in operation, and any certificate/award, can be put in use to predict the TQM level of a business. The information related to above-stated input parameters is mostly available from secondary sources. Hence, instead of collecting primary data for numerous TQM practices, it may be possible to predict TQM levels of the organizations with the model created in this study. Many studies have indicated that the TQM is highly related to customer and employee satisfaction. It leads to better performance and organizations with higher TQM levels have better productivity and quality. Based on this fact, it can be recommended to the businesses, who are looking for shortlisting vendors for service provision, to use this model. Once predicted, TQM levels (High or Low) can be extrapolated to predict quality, customer satisfaction, productivity, etc., for a service provider. This will benefit those customers who may be in dilemma over choosing a service provider out of many options available. This model may provide an insight to them on TQM level implementation and in turn may help short listing few out of many.

9 Limitations and Scope for Future Study Generally, the size of sample mostly remains a limitation of a study due to geographical spread of the sector of interest in the study. Hence, a wider geographical area can be studied to arrive at a steadier model. Similarly, a larger number of TQM elements can be considered for the survey to be done for identifying TQM levels,

Exploring Total Quality Management Implementation …

783

before coming up with machine learning model. The model created contained three independent variables. Further independent variables may be added to the model, and a more complex model can be created for predicting TQM levels.

References 1. F. Abusa, TQM implementation and its impact on organizational performance: a case study on Libya. Unpublished PhD dissertation, Faculty of Engineering, University of Wollongong New South Wales, Australia (2011). http://ro.uow.edu.au/theses/3314. Accessed 1 July 2020 2. S.L. Ahire, D.Y. Golhar, Quality management in large vs. small firms: an empirical investigation. J. Small Bus. Manage. 34, 1–13 (1996) 3. S.L. Ahire, D.Y. Golhar, Quality management in large versus small firms. J. Small Bus. Manage. 27, 1–13 (2001) 4. G.M. Allenby, R.P. Leone, L. Jen, A dynamic model of purchase timing with application to direct marketing. J. Am. Stat. Assoc. 94(446), 365–374 (1999) 5. P. Berger, T. Magliozzi, The effect of sample size and proportion of buyers in the sample on the performance of list segmentation equations generated by regression analysis. J. Direct Mark. 6(1), 13–22 (1992) 6. S. Brah, J. Li Wong, B. Rao, TQM and business performance in the service sector: a Singapore study. Int. J. Oper. Prod. Manag. 20, 1293–1312 (2000). https://doi.org/10.1108/014435700 10348262 7. P. Bullock, S. Rahman, Soft TQM, Hard TQM and Organization performance relationships: an empirical Investigation. Omega 33(1), 73–83 (2005) 8. R. Chapman, K. Al-Khawaldeh, TQM and labour productivity in Jordanian industrial companies. TQM Mag. 14(4), 248–262 (2002) 9. T.H. Dinh, I. Barbara, L. Tritos, Total quality management (TQM) strategy and organisational characteristics: Evidence from a recent WTO member. Total Qual. Manag. Bus. Excell. 21(9), 931–951 (2010). https://doi.org/10.1080/14783363.2010.487680 10. T.J. Fisher, The impact of quality management on productivity. Int. J. Qual. Reliab. Manag. 9(3) (1992). https://doi.org/10.1108/EUM0000000001647 11. M.F. Fuentes, Total quality management, strategic orientation and organizational performance: the case of Spanish companies. Total Qual. Manag. Bus. Excell. 3, 303–323 (2006) 12. GAO Study, Report to the House of Representatives on Management Practices, US Companies Improve Performance through Quality Efforts. United States General Accounting Office, Washington, DC (1991) 13. D.A. Garvin, Managing Quality: The Strategic and Competitive Edge (The Free Press, New York, 1988). 14. A. Ghobadian, D. Gallear, TQM and organization size. Int. J. Oper. Prod. Manag. 17(Iss 2), 121–163 (1997). https://doi.org/10.1108/01443579710158023 15. J.M. Haar, C.S. Spell, Predicting total quality management adoption in New Zealand: the moderating effect of organizational size. J. Enterp. Inf. Manag. 21(2), 162–178 (2008) 16. D. Haughton, S. Oulabi, Direct marketing modelling with CART and CHAID. J. Direct Mark. 11(4), 42–52 (1997) 17. Hendricks K.B. Hendricks, V.R. Singhal, Does implementing an effective TQM program actually improve operating performance? Empirical evidence from firms that have won quality awards. Manag. Sci. 43, 1258–1273 (1997) 18. D.T. Hoang, B. Igel, T. Laosirihongthong, Total quality management (TQM) strategy and organisational characteristics: Evidence from a recent WTO member. Total Qual. Manag. 21(9), 931–951 (2010) 19. https://www.nasscom.in/knowledge-centre/publications/india-start-ecosystem-maturing-2016

784

K. Jaiswal et al.

20. M.Y. Hu, M. Shanker, M.S. Hung, Estimation of posterior probabilities of consumer situational choices with neural network classifiers. Int. J. Resour. Mark. 16(4), 307–317 (1999) 21. L.B. Ismail, An evaluation of the implementation of total quality management (TQM) within the construction sector in the United Kingdom and Jordan Loiy Bani Ismail (2012) 22. G. Issac, R. Chandrasekharan, R.N. Anantharaman (2004) A conceptual framework for total quality management in software organizations. Total Qual. Manag. Bus. Excell. 15(3), 307–344. https://doi.org/10.1080/1478336042000183398 23. A. Jaafrah, A. Al-Abedallat, The effect of quality management practices on organisational performance in Jordan: an empirical study. Int. J. Finan. Res. 1(4) (2012) 24. K. Jaiswal, M. Garg, Exploring relationship between TQM and software productivity. Ing. Solidar 15(3), 1–29 (2019) 25. K. Jaiswal, M. Garg, TQM implementation and organizational characteristics: a study on Tri City region. Gurukul Bus. Rev. 14(Spring 2018),5 2–16 (2018) 26. J.H. Khan, Impact of total quality management on productivity. TQM Mag. 15(6), 374–380 (2003) 27. L. Lakhal, The relationship between ISO 9000 certification, TQM practices, and organizational performance. Qual. Manag. J. 21(4), 38–48 (2014) 28. N. Lakmal, A. Gayani, I.J. Subasinghe, R. Rahulan, The impact of TQM factors on firms’ performance; with reference to the seafood manufacturing company in Sri Lanka, in ADMIFMS International Management Research Conference 2018 (2018). Retrieved 01 Apr 2019, from http:// www.iosrjournals.org/iosr-jbm/papers/Conf.ADMIFMS1808-2018/Volume-2/2.06-12.pdf 29. J. Oakland, Total Quality Management: Text with Cases, 3rd edn. (Oxford, Butterworth Heinemann, 2003). 30. A. Psychogios, C. Priporas, Understanding total quality management in context: qualitative research on managers’ awareness of TQM aspects in the Greek service industry. Qual. Rep. 12(1), 40–66 (2007). http://www.nova.edu/ssss/QR/QR12-1/psychogios.pdf 31. V.R. Rao, J.H. Steckel, Selecting, evaluating, and updating prospects in direct mail marketing. J. Direct Mark. 9(2), 20–31 (1995) 32. D. Samson, M. Terziovski, The relationship between total quality management practices and operational performance. J. Oper. Manag. 17, 393–403 (1999) 33. J. Siddiqui, Z. Rahman, TQM principles’ application on information systems for empirical goals. TQM Mag. 19(1), 76–87 (2007). https://doi.org/10.1108/09544780710720853 34. F. Talib, Z. Rahman, Identification and adoption of total quality management practices in Indian information and communication technology and banking industries : an empirical study, in Proceedings of 3rd International Conference on Sustainability and Management Strategy (ICSMS-2014). Available at SSRN (2014). https://ssrn.com/abstract=2754070 35. P.M. West, P.L. Brockett, L.L. Golden, A comparative analysis of neural networks and statistical methods for predicting consumer choice. Market. Sci. 16(4), 370–391 (1997) 36. A.C.L. Yeung, T.C.E. Cheng, et al., An operational and institutional perspective on total quality management. Prod. Oper. Manag. 15(1), 156–170 (2006) 37. C. Zehir, Ö.G. Ertosun, S. Zehir, B. Müceldilli, Total quality management practices. Effects on quality performance and innovative performance. Procedia Soc. Behav. Sci. 41, 273–280 (2012). https://doi.org/10.1016/j.sbspro.2012.04.031

Predicting an Indian Firm’s Sickness Using Artificial Neural Networks and Traditional Methods: A Comparative Study Narander Kumar Nigam, Harshit Agarwal, and Khushi Goyal

Abstract The current research is focused on the development of early warning signal models for the prediction of industrial sickness for Indian firms. A sample of 196 manufacturing companies (52 sick and 144 non-sick) has been utilized to develop the 2-year, 3-year and 4-year sickness predictive models using the financial information for the years 2015–2020. These models have been constructed using the Logit, Probit, and ANN techniques, and then, results from the three techniques have been compared for each predictive model to obtain the model with the highest prediction accuracy. To make this comparison more reliable, the same companies have been considered in the testing set of all the models. Based on the empirical results, it can be inferred that the artificial neural network has the highest predictive ability in all the cases, whereas, between the traditional models, the Logit model has higher accuracy than the Probit model. The results from the study can assist the management in taking timely decisions to reduce the likelihood of the company becoming sick, and the various other stakeholders like investors and banking institutions in making their investment and credit extension decisions.

1 Introduction ‘Any future increases in corporate insolvencies are likely to affect others as insolvency practitioners estimate that around 27% of corporate insolvencies are triggered by another company’s insolvency—the “domino effect”.’ These words by Steven Law [1], the then president of R3 (the UK association of business recovery professionals), N. K. Nigam (B) · H. Agarwal · K. Goyal Shaheed Sukhdev College of Business Studies, University of Delhi, New Delhi, India e-mail: [email protected] H. Agarwal e-mail: [email protected] K. Goyal e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_63

785

786

N. K. Nigam et al.

are a testimonial to the economic significance of the corporate insolvency prediction. The adverse consequences of a firm becoming financially insolvent are experienced by a diverse group of stakeholders, such as shareholders, employees, management, customers, whose impact reaches on to the other firms, and the overall economy. The prediction of insolvency of firms has thus been a crucial topic for a long time now and therefore has been studied in detail by various researchers [2–7]. It is widely contended that insolvency in nearly all organizations takes place in a similar way [8]. Thus, there exists a similarity in the signals indicating its imminence [9]. Early research on the prediction of insolvency made use of different techniques like multiple discriminant analysis (MDA) [3, 5, 10, 11], probit model [12–14], and logistic regression model [15–17]. However, the efficacy and applicability of some of these statistical techniques rely majorly on assumptions such as normality, linearity, and independence among the independent variables. Recent studies have employed artificial neural networks (ANN) as an alternative technique for the prediction of bankruptcy [18–21] because of the model’s ability to work with imprecise variables [22]. The comparison between the classical statistical techniques and the neural networks continues to be an unresolved dispute with the existing studies leading to mixed results. A plethora of studies has asserted the superiority of the ANN model over logistic regression in predicting corporate insolvency [23–25]. At the same time, Nag [26] argued that the performance of the neural networks over the traditional methods is not necessarily superior because although the prediction error yielded by multiple regression models was more than the ANN model, the ANN model’s residual term serial correlation was also higher. Furthermore, Alici [27], based on a sample of UK firms, asserted the superiority of Logit, Probit, and MDA over the neural networks for insolvency prediction. Aiming to answer this open question, the present study attempts to draw a clear picture of the comparative performance of the traditional models (Logit and Probit model) and the artificial neural networks in predicting the sickness of companies in India. Despite being a popular prediction model, MDA has been excluded from this research because questions are often raised concerning the restrictive requirements imposed by this model. The study attempts to develop models capable of predicting industrial sickness, as early as 4 years prior to the company being sick which can help the various stakeholders like management, banks, investors, etc., take prudent and strategic decisions timely. Besides, while extensive research has been conducted internationally for the prediction of firm insolvency, not many research studies have been carried out to predict industrial sickness in India. Moreover, Datta [28] asserted that the existing empirical work on industrial sickness fails to provide models that can make predictions accurately, which makes it more imperative to undertake this research. To obtain a better predictive model, a few new indicators, apart from the ones widely used in the past researches, have been incorporated. Additionally, the present research takes into consideration a more recent dataset of sick and non-sick companies for the development of the models in comparison with the prior studies. The rest of the research study proceeds as follows. Reviewing the existing corporate distress prediction literature, the next section documents the relative popularity

Predicting an Indian Firm’s Sickness Using Artificial Neural …

787

of different methods used in academic research for financial distress prediction. The third section explains the sample development and the methodology used to develop the sickness prediction model. The fourth part of this research paper presents empirical results. The fifth section of the paper provides a discussion, and the final section provides the concluding remarks.

2 Literature Review Prediction of corporate failure using past financial data has been studied extensively. Beaver [2] predicted the insolvency of companies using the information contained in the financial statements with the help of a univariate model. He indicated that financial ratios can help estimate the solvency or insolvency of a company. Following Beaver’s pioneering work in 1966, Altman [3] contributed immensely to the prediction of insolvent companies by being the first person to deploy MDA for insolvency prediction. His study suggested that insolvency prediction can be better done by using aggregate financial ratios instead of individual financial ratios. His model achieved a hit ratio of 79% for the holdout sample one year before insolvency. From the early 1970s, many researchers focused on the application of MDA models, Z score, and other similar techniques to predict the insolvency of businesses [5, 29–31]. The MDA model used by Blum [30] had an accuracy of predicting 94% of bankruptcy cases correctly one year prior to bankruptcy. Dambolena and Khoury [32] developed MDA models that could correctly predict 87%, 85%, and 78% for 1, 3, and 5 years before the bankruptcy, respectively. However, the assumption violations associated with the MDA approach [33, 34] guided the efforts of the researchers toward the development of logit and other conditional probability models, with the former being the most popular [13, 16, 35]. The first attempt to use a logit model to predict bankruptcy was made by Ohlson [15] who used nine independent variables to develop a model, which could predict correctly 92% of the estimates, using the information for firms two years prior to bankruptcy. Darayseh et al. [36] applied logistic regression to economic variables and the firmwise financial ratios. They drew a sample of 110 bankrupt manufacturing firms (that became bankrupt between 1990 and 1997) as well as 110 non-bankrupt firms, and the model could correctly predict 87.82% and 89.50% of the sample firms and sample retention, respectively, one year earlier to bankruptcy. The advancements in technology along with enhanced availability and power of computer processing in the 1990s led the researchers to develop more reliable models for corporate distress prediction. Subsequently, artificially intelligent techniques like neural networks [18–21] and genetic algorithms [37] were introduced. Odom and Sharda [18] made the first use of artificial neural networks for bankruptcy prediction. Using the five financial ratios used in Altman [3] as the input variables, they developed a three-perceptron network. Using different mixture levels for the training set, the study concluded the robustness of neural networks over MDA in bankruptcy prediction. Callejón et al. [38] matched a group of 500 industrial European companies

788

N. K. Nigam et al.

that went bankrupt between 2007 and 2009 with 500 insolvent companies, according to country and asset size classification and made use of NN models for the prediction of insolvency data. The estimated NN model could make correct predictions for 92.5% and 92.1% of the training and testing datasets, respectively, for two years prior to bankruptcy. For Indian firms, few studies have been conducted to predict industrial sickness. Kaveri [39] attempted to develop a prediction model on the sample of small-scale industries using MDA. This model achieved an accuracy rate of 76% for the prediction of sickness, one year prior to sickness. Bhattacharaya [40] carried out research on a sample of 28 sick and 26 financially sound companies and was able to correctly classify around 80% of the companies using MDA. Gupta [41] attempted to determine the ratios instrumental to monitor the corporate sickness and concluded that profitability ratios had comparatively more predictive power. Some researchers [24, 26, 27, 42] compared the efficacy of various models in predicting the insolvency of firms, but this provided mixed results. Alici [27] in the study compared the artificial neural network with traditional techniques such as MDA and Logit regression in predicting the insolvency of firms in the UK. This study indicated that Logit, Probit, and MDA were better than the artificial neural network in predicting insolvency. However, in the study performed by Tam and Kiang [42] which compared the neural network model with logistic regression, discriminant analysis model, K-nearest neighbor, and ID3 algorithm in predicting the insolvency of banks, it was found that the neural network model is more accurate, adaptable, and robust as compared to other methods. Furthermore, Hosaka [43] conducted a study in the Japanese Stock Markets in which ANN displayed better performance in comparison with linear discriminant analysis, decision trees, and intelligent machines in predicting bankruptcy of companies. Despite extensive literature on financial distress, there still exist disputes over the appropriateness of the different ratios as well as the methods (ANN or LR) for the prediction of corporate sickness. Besides, not a lot of attempts have been made to predict the industrial sickness of Indian firms. Therefore, it is imperative to conduct a study investigating the sickness prediction of companies in India, especially at a time when the Indian economy has witnessed so many setbacks.

3 Data and Methodology 3.1 Data 3.1.1

Sample Selection

As per the Sick Industrial Companies Act (SICA) enacted in 1985, a sick unit can be defined as an industrial unit that has been in existence for at least the past five years and whose accumulated losses exceed its net worth at the end of any financial year. The

Predicting an Indian Firm’s Sickness Using Artificial Neural …

789

data for around 17,506 companies in the manufacturing industry was obtained from the Prowess IQ database for the years 2013 to 2020. The manufacturing units which were sick in both the years 2019 and 2020 have been termed as sick companies in this analysis. The dataset of around 17,506 companies was reduced to 196 companies after eliminating companies whose net worth and accumulated losses/profits data was not available, or which became sick in any of the years between 2013 and 2018 or which were sick in only one of the two years (2019 or 2020). Out of these 196 firms, 52 firms were classified as sick units and the remaining 144 units were classified as non-sick. The financial information for the companies used to develop the prediction model was obtained for the years 2015 to 2017. Also, the total number of firms varies from 196 in 2017 to 185 in 2015 due to the non-availability of data.

3.1.2

Variables

It is widely contended that the firms which are not profitable are more likely to have a negative net worth and become a sick unit [44]. Scherrer and Mathison [45] also believe that high profitability plays an important role in the stabilization of operating cash flows and hence helps in the reduction of the risk of becoming a sick unit. The current study thus uses two measures of profitability, return on assets ratio and earnings before interest and taxes to total assets ratio, because these ratios measure an organization’s profitability taking into consideration its scale of operations. Financially distressed firms are likely to have low liquidity [46]. Low liquidity may result in the inability of a firm to meet its debt obligations, which in turn increases the probability of an industrial unit turning sick. Thus, the present study employs two variables to incorporate a firm’s liquidity: Creditor Turnover ratio and Cash Profit as a percentage of total income. Debt financing impacts a company’s financial health as there exists a significant relationship between debt and profitability of a company [47–50]. High debt increases the financial risk of a company which can negatively affect the profitability increasing the likelihood of the company becoming sick. Therefore, the debt-to-equity ratio and the short-term borrowings to sales ratio have been used in this paper as a measure to indicate the borrowings of a company. The dependent variable in classification models used in this paper is binary in nature, where the value of 1 represents the sick firms, whereas 0 represents the non-sick firms.

3.2 Methodology This subsection briefly discusses the models that have been employed in this paper and the methodology adopted to run them.

790

3.2.1

N. K. Nigam et al.

Logit Model

The Logit model is a statistical model that investigates the interrelation between binary response probability and the predictor variables. To predict sick firms, the binary probability will be the probability of a firm becoming sick, whereas numerous predictor variables can be used. The advantage of this model is that it bypasses some of the assumptions (multivariate normality and equal covariance matrices) taken by the other traditional models [51].

3.2.2

Probit Model

The Probit regression model is a technique to perform regression on dichotomous dependent variables. The probit model calculates the probability of the value belonging to one of the two outcomes. The Probit model distinguishes itself from the logit model by using the cumulative distribution function of standard normal distribution instead of the cumulative distribution function of the logistic distribution. The Probit model is preferred over the logit model only in the case of random effects models if there is a large sample size otherwise both the models are considered equally good.

3.2.3

Artificial Neural Network

Neural networks originated in the 1940s [52, 53] as a method to imitate the working of the human brain. A neural network comprises numerous computational functions also known as neurons which are organized in layers. In this paper, the feed-forward neural network structure has been employed, the basic form of which consists of three layers: the input layer, the hidden layer, and the output layer. In this model, the data flows in one direction, i.e., from input nodes to the output nodes. The sample data was divided into two groups, i.e., training sample and testing sample. The companies in the training sample were used to construct the model, whereas the companies in the testing sample were used to check the efficacy of the model constructed. The testing sample was formed by picking 20 random companies (ten sick companies and ten non-sick companies) from the entire sample, and this was kept constant across all the models to make the results comparable. EViews was used to construct the Logit and Probit model, whereas IBM SPSS was used to construct the artificial neural network model (multilayer perceptron).

Predicting an Indian Firm’s Sickness Using Artificial Neural …

791

4 Empirical Results 4.1 Comparison of the Prediction Accuracy The results for the different predictive models are reported in Tables 1, 2 and 3. For each model, the accuracy percentage for both the sick and non-sick firms is mentioned separately. As presented in Table 1, the Probit Model, the Logit Model, and the ANN Model accurately classify around 60%, 60%, and 90% of the total testing sample set, respectively. Both the Logit model and the Probit Model misclassify three sick firms as non-sick, whereas the ANN model misclassifies only 1 sick firm as non-sick. The results reported in Table 2 show that the Probit Model, the Logit Model, and the ANN Model classify around 55%, 60%, and 70% of the total testing sample set accurately. The Probit Model misclassified the maximum number of sick firms as non-sick, i.e., 6, whereas the ANN model misclassified the minimum number of sick firms, i.e., 3. Table 3 presents that the Probit Model, the Logit Model, and the ANN Model accurately classify around 85%, 85%, and 95% of the testing sample, respectively. The Probit Model and the Logit Model misclassify 7 sick firms as non-sick, whereas the ANN model accurately predicts all the sick firms. Thus, the ANN models yield the highest accuracy in the 2 years, 3 years, as well as the 4 years predictive models signifying the ANN Model to be better than the traditional models in predicting industrial sickness.

Table 1 Comparison of four-year predictive models Probit model Non-sick

Sick

Accuracy percentage

5

5

50%

Sick

3

7

70%

Overall accuracy

60%

Non-sick

Logit model Non-sick

5

5

50%

Sick

3

7

70%

Overall accuracy

60%

ANN model Non-sick

9

1

90%

Sick

1

9

90%

Overall accuracy

90%

Sample size: training = 165 Testing = 20

792

N. K. Nigam et al.

Table 2 Comparison of three-year predictive models Probit model Non-sick

Sick

Accuracy percentage

Non-sick

7

3

70%

Sick

6

4

40%

Overall accuracy

55%

Logit model Non-sick

7

3

70%

Sick

5

5

50%

Overall accuracy

60% 7

3

70%

Sick

3

7

70%

Overall accuracy

70%

Non-sick

Sick

Accuracy percentage

Non-sick

10

0

100%

Sick

3

7

70%

Overall accuracy

85%

ANN model Non-sick

Sample size: Training = 171 Testing = 20 Table 3 Comparison of two-year predictive models Probit model

Logit model Non-sick

10

0

100%

Sick

3

7

70%

Overall accuracy

85% 9

1

90%

Sick

0

10

100%

Overall accuracy

95%

ANN model Non-sick

Sample size: training = 176 Testing = 20

Predicting an Indian Firm’s Sickness Using Artificial Neural …

793

Table 4 Predictive variable coefficients Variables

Two-year prediction model Logit model

Probit model

Debt/Equity

0.1228

0.1720

Short-term borrowings/Sales

0.3254

0.9945

−1.6689

−0.0006

Creditors turnover Cash profit/Total income

−0.0015

−0.0094

Earnings before interest and tax/Total assets

−0.0106

−12.6136

−27.7171

−0.1462

Return on assets

4.2 Coefficients of Variables of the Two-Year Predictive Model Table 4 reports the coefficients for the two-year predictive model estimated using the traditional techniques, i.e., Logit and Probit. The coefficient for the debt-toequity ratio is positive which signifies a direct relation between the debt-to-equity ratio and the likelihood of a company becoming sick which is because high debt leads to an increase in the financial risk for a company which adversely impacts a company’s profitability. The short-term borrowings to sales ratio is positively related to a company’s probability of becoming sick because a higher proportion of a company’s short-term borrowings in the sales indicates that the company is relying more on borrowed funds to finance its normal business operations which signals a liquidity crunch in the company. The creditors turnover ratio has a negative coefficient as high creditors turnover ratio indicates a company’s ability to meet its short-term obligations. Higher the ratio, the higher the ability of a company to repay its debt and lower the chances to turn sick. Cash profit as a percentage of the total income is inversely related to the chances of a company becoming sick because high cash inflows indicate high liquidity for the company and higher the liquidity, lower the probability of a company becoming sick. The return on assets ratio and earnings before interest and taxes to total assets ratio have a positive coefficient as higher the profitability of an organization, higher the stability of operating cash flows and hence lower the chances of a company becoming a sick unit.

5 Discussion The findings of this study are in lines with the analysis of Cosset and Roy [23], Fletcher and Goss [24], Zhang et al. [25] who asserted the superiority of neural networks over the logistic regression models in the prediction of financial distress. Fletcher and Goss [24] based on a sample size of 18 bankrupt and non-bankrupt firms each suggested that the artificial neural networks have a higher accuracy as

794

N. K. Nigam et al.

well as are a better fitting model when compared to the logistic regression model. Zhang et al. [25] based on a study of 220 firms with 110 bankrupt and non-bankrupt firms each concluded that neural network models outperform the logit models in the prediction of corporate insolvency. However, these findings are contrary to the result of Alici [27] who concluded that the traditional methods like Logit, Probit, and MDA are superior to the ANN. The reason for the same can be that a set of around 28 ratios for that study was obtained through profile analysis and a set of nine ratios using principal component analysis, while the ratios for this study have been selected based on the prior literature as well as the accounting rationale.

6 Conclusion In recent years, the incidence of industrial sickness in India has attained a serious dimension. The demonetisation and the implementation of Goods and Services Tax coupled with several industrial reforms have transformed the landscape of the Indian Manufacturing Industry and led to the obsolescence of existing industrial sickness prediction models, accentuating the need for reformulation. Thus, the present research concentrates on the development of a new sickness prediction model as well as the comparison between the traditional models and the ANN model. The findings of this study assert the superiority of neural networks over the traditional techniques in the prediction of financial distress. It can also be observed that the model predicting the industrial sickness two years prior to the industrial unit becoming sick yields the highest accuracy of 95%. This can be explained by the fact that the financials for the firm start displaying more signs of stress as it moves toward sickness. The findings of this study have several implications. Firstly, the prediction model will have great applicability in turnaround management. Keeping a regular check on the financial health of the firm by testing it for sickness time and again, the top management can immediately intervene when the firm displays a fair probability of getting sick in the near future and take timely decisions for its turnaround. Additionally, the predictive model can also prove to be a useful tool for the financial institutions for delineation of the risk profile of each firm while evaluating the proposals for the loan, risk assessment, and making credit decisions. Moreover, the stockholders, depositors, and the other stakeholders from the securities market can make an assessment of the risk profile of each stock and make their investment decisions accordingly. One of the limitations of the current study is that it makes use of only financial accounting variables. A variety of non-financial factors like geographic diversification and the segmentation of the market, impact a company’s financials which in turn may facilitate the prediction of sickness. However, this study did not employ these variables in the model because of a further loss of a degree of freedom to a sample of 52 sick firms. Also, based on the results obtained, it is recommended that more research on the development of sickness prediction models is required and the models

Predicting an Indian Firm’s Sickness Using Artificial Neural …

795

solely based on the accounting variables now pave the way for the development of models that include market-based variables as well.

References 1. S. Law, R3 comments on the insolvency statistics for Q1. s.l.: R3 Press Releases (2010) 2. W. Beaver, Financial ratios as predictors of failures. J. Account. Res. 4(3 Suppl.), 71–102 (1966) 3. E.I. Altman, Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J. Fin. 23(4), 589–609 (1968) 4. C. Johnson, Ratio analysis and the prediction of failure. J. Fin. 1166±1168 (1970) 5. R. Edmister, An empirical test of financial ratio analysis for small business failure prediction. J. Fin. Quant. Anal. 7(2) (1972) 6. E.I. Altman, Accounting implications of failure prediction. J. Account. Audit. Fin. 4–19 (1982) 7. F.L. Jones, Current techniques in bankruptcy prediction. J. Account. Liter. 131–164 (1987) 8. A. McTear, et al., Corporate Insolvency (Cavendish Publishing Limited, London: s.n., 2004) 9. M. Marilena, T. Alina, The significance of financial and non-financial information in insolvency risk detection. Procedia Econ. Fin. 750–756 (2015) 10. A.D. Castagna, Z.P. Matolcsy, The prediction of corporate failure: testing the Australian experience. Australian J. Manag. 23–50 (1981) 11. E.I. Altman, Corporate Financial Distress (Wiley, New York, 1983). 12. G. Hanweck, Predicting bank failure. Research Papers in Banking and Financial Economics (1977) 13. M.E. Zmijewski, Methodological issues related to the estimation of financial distress prediction models. J. Account. Res. 59–82 (1984) 14. K. Skogsvik, Current cost accounting ratios as predictors of business failure: the Swedish case. J. Bus. Fin. Account. (1990) 15. J.A. Ohlson, Financial ratios and the probabilistic prediction of bankruptcy. J. Account. Res. 109–131 (1980) 16. C.V. Zavgren, Assessing the vulnerability to failure of American industrial firms: a logistic analysis. J. Bus. Fin. Account. (1985) 17. C.S. Lennox, Are large auditors more accurate than small auditors? Account. Bus. Res. 217–227 (1999) 18. M.D. Odom, R. Sharda, A neural network model for bankruptcy prediction, in International Joint Conference on Neural Networks (IEEE, San Diego, 1990), pp. 163–168 19. L. Medsker, E. Turban, R.R. Trippi, Neural network fundamentals for financial analysts. J. Invest. 59–68 20. R.L. Wilson, R. Sharda, Bankruptcy prediction using neural networks, in Decision Support Systems (1994), pp. 545–557 21. J.E. Boritz, D.B. Kennedy, Predicting corporate failure using a neural network approach, in Intelligent Systems in Accounting, Finance and Management (Wiley, 1995), pp. 95–111 22. K.C. Chung, S.S. Tan, D.K. Holdsworth, Insolvency prediction model using multivariate discriminant analysis and artificial neural network for the Finance Industry in New Zealand. Int. J. Bus. Manag. 19–28 (2008) 23. J.-C. Cosset, J. Roy, The determinants of country risk ratings. J. Int. Bus. Stud. 135–142 (1991) 24. D. Fletcher, E. Goss, Forecasting with neural networks: an application using bankruptcy data. Inf. Manag. 159–167 (1993) 25. G. Zhang, et al., Artificial neural networks in bankruptcy prediction. Eur. J. Oper. Res. 16–32 (1999) 26. R. Nag, Neural Network Applications for Finance (1991) 27. Y. Alici, Neural networks in corporate failure prediction. Neural Netw. Fin. Eng. (1996)

796

N. K. Nigam et al.

28. D.K. Datta, Industrial sickness in India—an empirical analysis. IIMB Manag. Rev. 104–114 (2013) 29. E. Deakin, A discriminant analysis of predictors of business failure. J. Account. Res. 167–179 (1972) 30. M. Blum, Failing company discriminant analysis. J. Account. Res. 1–25 (1974) 31. R.J. Taffler, The assessment of company solvency and performance using a statistical model. Account. Bus. Res. 295–308 (1983) 32. I.G. Dambolena, S.J. Khoury, Ratio stability and corporate failure. J. Fin. 1017–1026 (1980) 33. O.M. Joy, J.O. Tollefson, On the financial applications of discriminant analysis. J. Fin. Quant. Anal. 723–739 (1975) 34. R. Moyer, Forecasting financial failure: a re-examination. Fin. Manag. 11–17 (1977) 35. M. Hamer, Failure prediction: sensitivity of classification accuracy to alternative. J. Account. Public Policy 287–307 (1983) 36. M. Darayseh, E. Waples, D. Tsoukalas, Corporate failure for manufacturing industries using firms specifics and economic environment with logit analysis. Manag. Fin. 23–37 (2003) 37. J. Kingdon, K. Feldman, Genetic algorithms and applications to finance. Appl. Math. Fin. 89–116 (1995) 38. A.M. Callejón, et al, A system of insolvency prediction for industrial companies using a financial alternative model with neural networks. Int. J. Comput. Intell. Syst. 29–37 (2012) 39. V.S. Kaveri, Financial Ratios as Predictors of Borrower’s Health: With Special Reference to Small Scale Industries in India (Sultan Chand, New Delhi, 1980) 40. C.D. Bhattacharaya, Discriminant analysis between sick and healthy units. Chartered Accountant, pp. 499–505 (1982) 41. L.C. Gupta, Financial ratios for signalling corporate failure. The Chartered Accountant, pp. 697–707 (1983) 42. K.Y. Tam, M.Y. Kiang, Managerial applications of neural networks: the case of bank failure predictions. Manag. Sci. 926–947 (1992) 43. T. Hosaka, s.l. Bankruptcy prediction using imaged financial ratios and convolutional neural networks, in Expert Systems with Applications, vol. 117 (Elsevier, 2019) 44. Z. Gu, L. Gao, A multivariate model for predicting business failures of hospitality firms. Tourism Hosp. Res. Survey 37–49 (2000) 45. P. Scherrer, T. Mathison, Investment strategies for REIT inventories. Real State Rev. (1966) 46. D. Ogachi, et al., Corporate bankruptcy prediction model, a special focus on listed companies in Kenya. J. Risk Fin. Manag. (2020) 47. C. Baum, D. Schaefer, O. Talavera, The effects of short-term liabilities on profitability: the case of Germany, in Money Macro and Finance (MMF) Research Group Conference. s.l.: Money Macro and Finance Research Group (2006) 48. T. Murugesu, Effect of debt on corporate profitability (Listed Hotel Companies Sri Lanka). European J. Bus. Manag. (2013) 49. H. Habib, F. Khan, M. Wazir, Impact of debt on profitability of firms: evidence from nonfinancial sector of Pakistan. City Univ. Res. J. (2016) 50. H.M.D.N. Somathilake, Factors influencing individual investment decisions in Colombo Stock Exchange. Int. J. Sci. Res. Publ. (IJSRP), 579–585 (2020) 51. S.J. Press, S. Wilson, Choosing between logistic regression and discriminant analysis. J. Am. Stat. Assoc. 699–705 (1978) 52. W.S. Mcculloch, W. Pitts, A logical calculus of ideas immanent in nervous activity. Bull. Math. Biophys. (1943) 53. D.O. Hebb, The organization of behavior; a neuropsychological theory (Wiley, s.l., 1949)

Analysis of Change of Market Value of Bitcoin Using Econometric Approach Harivansh Gahlot, Irsheen Baveja, Gurjeet Kaur, and Sandra Suresh

Abstract Blockchain not only represents the latest innovation in information and communications technology (ICT) but also lays the foundation stone for the development of numerous economic models which are characterized by decentralization, privacy, security and transparency. One of the pre-eminent products of the transparent digital database of transactions that is claimed to be inviolable is the cryptocurrency. The cryptocurrency is often christened as Bitcoin since it is the most convenient and easy to use out of all and has consequently gained enormous traction in contemporary economic and business models, radically transforming payment mechanisms for the better. Numerous studies in the past have been undertaken that aim at examining the underlying reason behind Bitcoin’s value over the years. The study involves usage of econometric models to analyze how various variables at the macro- and micro-level affect the Bitcoin price trend in the long and short term. The objective of this research analysis is not only to aid investors in making prudent investment decisions in Bitcoin by closely monitoring the influential exogenous and endogenous variables but also to structurally understand the working relationship and economic consequences of these variables in the cryptocurrency space. After checking for stationarity of the variables under study, ARDL model was used followed by the Bounds F test and UECM model (where required). It was found that the variables under study explained 99.32% variation in BTC market value in the long run, some of which had a positive relationship while some had negative. Some were also found to be not significantly influencing its market value. H. Gahlot · I. Baveja Shaheed Sukhdev College of Business Studies, Delhi 110085, India e-mail: [email protected] I. Baveja e-mail: [email protected] G. Kaur (B) · S. Suresh Assistant Professor, Shaheed Sukhdev College of Business Studies, Delhi 110085, India e-mail: [email protected] S. Suresh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_64

797

798

H. Gahlot et al.

Keywords Blockchain · Bitcoin · Cryptocurrency · Distributed network · Hash rate

1 Introduction Technology is evolving rapidly. The ICT has made transfer of information from one corner of the world to the other a matter of fraction of a second. An eminent product of the evolution of technology in the context of contemporary business is the blockchain. Blockchain attempts to revolutionize the buying and selling paradigms [1, 2] by introducing business models that thrive on data inputted by users across a network [3]. This growing digital system of records, called blocks involving processing of millions of bytes of information in seconds, has led to the emergence of digital currency burgeoning in a decentralized system [4] using cryptography, called cryptocurrency. It is believed that the future of currency lies with cryptocurrency for its fool proof functioning enabled by the technological infrastructure of blockchain. While cryptocurrency is experiencing gradual market acceptance globally owing to its transactional and storage convenience, the first cryptocurrency to be decentralized [5] which was Bitcoin occupies a lion’s share in the market till date. What started in 2008 by Satoshi Nakamoto using blockchain has come to dominate and even define the space of cryptocurrency rendering the remaining thousands of cryptocurrencies as ‘altcoins’, short for alternate coins. The reason behind the reputation Bitcoin has earned alludes to the fallacies peculiar to traditional banking systems such as: lack of unrestricted user autonomy, transfer limits and identity theft, etc. [5]. Over the years, Bitcoin has attracted academicians, investors, policy makers and researchers alike, who aim at cutting the gordian knots concerning the nature of its market value and its functioning in relation to the market forces. Hayes [6] studies point toward Bitcoin being a commodity governed by an underlying economy having cost and revenue structures singular to itself, the equilibrium analysis of which plays a quintessential role in determining its market value. Studies also postulate that Bitcoin is a function of a network of users interconnected among each other for the purpose of supporting fiat currency costs in an attempt to make digital capital gains to ensure Bitcoin’s operativity. However, supplementary studies and applications regard Bitcoin as a currency. It is noted that with the passage of time, Bitcoin has now engendered a new trusted, disinter mediated and a fool proof method of ensuring storage of value while being used as a medium of exchange. The study focuses on providing a pragmatic viewpoint of the impact of various variables at both the micro- and macro-level, on the market value inherited in Bitcoin. The objective of this research analysis is not only to aid investors in making prudent investment decisions in Bitcoin by closely monitoring the influential exogenous variables but also to structurally understand the working relationship and economic consequences of these variables in the cryptocurrency space. Bitcoin is thriving in the crypto space

Analysis of Change of Market Value of Bitcoin Using …

799

both in magnitude and scope; it is a worthwhile endeavor to comprehend the factors driving the market value of Bitcoin and the reasons behind the same. The paper is divided into following sections. Section 2 highlights the literature review; Sect. 3 gives the research methodology adopted in the paper; next section, i.e., Section 4 focuses on data analysis wherein we took over 4 years data of Bitcoin’s price and various exogenous and endogenous factors that are likely to impact the price. Sections 5, 6 and 7 mark the conclusion, managerial implication and limitation with future scope of study. The study is targeted with the primary goal of contributing to the study of Bitcoin and its functioning by aiming to • Explore and understand the various exogenous and endogenous factors and their economic consequences, determining the trend of market value of Bitcoin • Aid investors in making informed investment decisions in Bitcoin • Understand the scope and breadth of the economics of blockchain technology in building business models such as that of the market of Bitcoin.

2 Literature Review Considering the traction this technology has gained over the past few years, it goes without saying that several academic researchers have shifted their attention to Bitcoin and the underlying economy if we assume it to be a commodity. Among these studies, most of them were aimed at identifying the predictors for future fluctuations of Bitcoin price. Buchholz et al. [7] gave one of the most magnificent initial researches on the topic that focused on the number of Bitcoin in circulation, number of daily transactions in Bitcoin, total daily volume of transactions in Bitcoin and number of searches with query ‘Bitcoin’ estimated using Google trends. This was followed by Kristoufek [9], in the following year, discarding the hypothesis of existence of a real economy underlying Bitcoin and proposed a model to explain the elements of the speculative component in Bitcoin price formation by using a hypothesis that most of the variation in the price could be explained by investors’ interest (both positive and negative) and for this analysis he used two proxies: the number of searches made on the search engine Google with ‘Bitcoin’ as a query (as already done in the previous study) and the number of unique visits on the Wikipedia page of Bitcoin. The conclusion he drew was that there is a significant bidirectional relationship between the price of a Bitcoin and the number of searches made on both Wikipedia and Google. He also suggested the presence of a feedback mechanism increasing Bitcoin’s volatility, pushing it higher than expected when its price was above its recent average and lower than expected in the opposite case. He, finally, concluded that the value of a Bitcoin is purely based on speculation assuming a behavior very close to that of a bubble market, thus confirming Buchholz et al. [7] and Cheah and Fry’s [10] perspectives. Ciaian [11] study aimed at establishing a relationship between Bitcoin price and some macroeconomic variables, such as the Dow Jones Index, FTSE 100,

800

H. Gahlot et al.

the Euro/Dollar exchange rate, the Nikkei 225, the Dollar/Yen exchange rate, the Brent, WTI and CMCI oil prices. He concluded that there exists a significant relationship between Bitcoin’s price and Dow Jones Index both in the short and the long term and another one only in the long term with the Euro/Dollar exchange rate and the WTI oil price. He conflated and analyzed the results of the studies by Buchholz et al. [7], Kristoufek [9]. He, thus, analyzed Bitcoin’s price trend in relation to demand/supply functions (number of Bitcoin in circulation, number of daily transactions, number of unique addresses used per day), investors’ interest (the element of speculative components and bubble economy hypothesis using number of searches on the Bitcoin Wikipedia page, number of new subscribers and new posts on bitcointalk1) and macroeconomic factors (DJ Index, WTI oil price). The analysis yielded results that previous hypotheses regarding the importance of classical fundamentals and investors’ interest were confirmed. On the contrary, no statistically significant relationship was found with the macroeconomic factors used in Ciaian’s [11] findings (it was thus invalidated). However, unlike Buchholz and Kristoufek, Ciaian et al. concluded that the speculative component of cryptocurrencies was not mandatorily leading to the formation of a market bubble but could have rather been physiological of the exponential adoption process of a new potentially disruptive technology such as a blockchain. Cusumano [12] concluded that Bitcoins are more like a computer-generated commodity than a currency. Researchers examined Bitcoin’s functioning as a currency and its efficiency as an investment asset. They suggested that Bitcoin is less useful as a currency but it can play an important role in enhancing the efficiency of an investor’s portfolio. Researchers studied Bitcoin from a different perspective, they analyzed characteristics of users of Bitcoin and found that computer programming and illegal activity search terms are positively correlated with Bitcoin interest, while Libertarian and investment terms are not. Cheah and Fry [10] analyzed Bitcoin from the perspective of an unstable commodity with a speculative-bubble like economy, their study’s results showed that Bitcoin prices are prone to speculative bubbles, and the fundamental value of Bitcoin is zero. Dyhrberg [13] applied the asymmetric GARCH methodology used in studies of gold to explore the hedging capabilities. He finds Bitcoin has some of the same hedging abilities as gold, and furthermore, it can be classified as something in between gold and the American dollar on a scale from pure medium of exchange advantages to pure store of value advantages. This research clearly brings out the confusion of Bitcoin being a commodity or a currency. Finally, Hayes [6] tried to establish a relationship between the price trend of Bitcoin and the underlying economy (if any). As we already know, the operativity of the Bitcoin protocol is ensured by the constant work of a group of users of its ecosystem called miners, which make use of specific hardware devices in order to ensure that the protocol does not accept fraudulent transactions. Miners’ work has a cost in terms of time and consumed electricity, and since their job is to ensure safety of transactions, they are given emission reward through the protocol itself which is nothing but issuing new Bitcoins for each valid transaction block generated. Considering this mechanism, Hayes [6] developed a linear regression in order to test

Analysis of Change of Market Value of Bitcoin Using …

801

the relationship between the price of a Bitcoin and the total computational power submitted by miners into the network (hash rate). The analysis concluded that a variation of the hash rate accounted for up to 80% of the price change of a Bitcoin. In the second part of his study, Hayes developed a model to detect the marginal cost of production of a Bitcoin [6]. The model consisted of a marginal cost and a marginal profit function such that, assuming the presence of efficient markets, their ratio should have been equal to the price of a Bitcoin in equilibrium [6, 11]. In 2015, Kristoufek conducted another study that focused on the existing relationships between Bitcoin’s price and five different variables. In addition to those analyzed by him in 2013, his new work considered the following variables: a trade– exchange ratio (designed to measure the use of Bitcoin as a currency for ‘real’ use and not as a commodity), hash rate and difficulty (proxies of the value of the hardware for the underlying infrastructure), the Financial Stress Index (FSI, for measuring the degree of financial stress on the global economy) and the price of gold. The study [9] confirmed the relevance of the investors’ interest elements (speculative component of Bitcoin price formation). However, he also came up with two other interesting conclusions. The first being that Bitcoin, unrelated to either FSI or gold, was not to be considered as a ‘safe haven’ good; the second was that Bitcoin’s value seemed not to be determined by pure speculation but supported by the growth of its real economy (the number of goods and services directly available in Bitcoin and the value of hardware underlying its infrastructure). A study is linked to this last perspective. The researchers took into account several variables such as trade–exchange ratio, Bitcoin’s monetary velocity, total daily volume in Bitcoin, hash rate, Google searches with ‘Bitcoin’ as query, gold price and the Shanghai Stock Exchange Index (SI). They came to the conclusion that although in the short-term Bitcoin price trend was impacted by several variables (trade–exchange ratio, the total daily volume in Bitcoin and SI), in the long run, its price seemed to be influenced above all by only the hash rate. Finally, Balcilar et al. [14] tried to establish a relationship between the price trend of Bitcoin and the volume transacted daily on the main exchanges using a non-parametric causality-in-quantiles test. However, the results showed that in no period of the observed conditional distribution the volume analysis allowed to make statistical predictions on Bitcoin’s future prices. In 2017, Zhu et al. attempted to analyze factors influencing Bitcoin price, using the factors that have an influence on gold price which is a commodity. Thus, they seem to corroborate with Hayes [6] that it is a commodity. They built a VEC model, selected seven variables and used the ARDL bounds testing method to analyze the long-run relationships among the variables, and then they used a VEC granger causality test to look for causal relationships between their variables. They concluded that Bitcoin, in the long run, is not stable and hence cannot be considered a safe haven. Throughout the evolution of Bitcoin, there was always a dilemma regarding whether it is a commodity or a currency. While we all call it ‘cryptocurrency’, some researchers vehemently opposed accepting it as a currency [15]. However, the reason

802

H. Gahlot et al.

Fig. 1 Technology of crypto hashing

why it was christened as a ‘currency’ was because of its immense ‘store of value function’. This is because the technology is nearly impossible to hack owing to reasons like: Reverse engineering in a cryptocurrency is not possible, i.e., one might think that since the input commands are being encrypted in a series of different combinations of 1s and 0s (binary), one can easily backtrack the output combination to reach to the input command and hence hack it. However, encrypting in a cryptocurrency takes place with a robust technology of crypto hashing [16] (Fig. 1) which makes it impossible to reverse engineer since the combinations 1s and 0s of the same input command are changed every time the command is typed [17]. Digital signatures cannot be forged due to public key (abbreviated as pk) and private key (or secret key abbreviated as sk) pair. Unlike a physical signature, a digital signature changes for different messages. This is possible due to the function—sign (message, sk). This function ensures that the signature is dependent on message and the private key, thus making it unique for each message. Then, this message is verified using the function—verify (message, signature, pk) = True/False. This is where the public key comes into play. All it does is output true or false based on if this was a signature produced by the private key associated with the public key that we are using for verification. Thus, digital signatures (Fig. 2) cannot be forged unless someone knows the ‘secret key’ which is unique to every user. Unique ID for each transaction—Now, even though signatures cannot be forged, a miner can easily copy and paste the same line (representing a transaction) multiple times in order to have the maximum computational power quickest, thus leading to fraud transactions since message signature combination remains valid. To get around this, every time a user signs a transaction to approve it, and a unique ID is assigned to that transaction. That way, even if the exact same transaction takes place multiple times, a different unique ID is assigned to it making it a unique transaction which cannot be copied.

Analysis of Change of Market Value of Bitcoin Using …

803

Fig. 2 Uniqueness of digital signatures

However, the research conducted by Hayes solved the dilemma, and cryptocurrency was deemed to be a ‘commodity’. His model proved an underlying economy and a production mechanism similar to that of any commodity. Thus, even though cryptocurrency has one phenomenal feature of a currency, it actually is not one. Finally, a research conducted by researchers of SDA Bocconi School of Management has taken Hayes’ research a step further in finding out if Bitcoin can be termed as a commodity. For this, it has assumed that Bitcoin relies on a network of users willing to support actual fiat currency costs and receives digital currency profits in order to ensure its operativity. This is similar to a production structure of a classical industrial good. Therefore, the study aimed at establishing a relationship between the actual price trend of Bitcoin and the estimated price trend assuming the existence of such a structure. The study concluded that there was a significant relationship between the two, hence there is an underlying infrastructure and thus it can be safely christened an economic commodity. Based on these studies, the stance of Hayes and Researchers of Bocconi School has been validated in this study. However, the focus of this paper is on simplifying the investment decision for the investors by providing them a more pragmatic knowledge of investing in Bitcoin. This study adds to the existing literature by helping investors make more rational investment decisions by looking at the price trends of the variables that impact it and then estimating the possible impact on Bitcoin price through the relationships established in this paper.

3 Research Methodology The study makes use of experimental methodology to uncover and determine the factors affecting the market value of Bitcoin exogenously and endogenously with respect to the degree and direction of their influence in the long and short run. The

804

H. Gahlot et al.

methodology is described in the following steps and illustrated with the help of the diagram as shown in Fig. 3. Around the time of inception of blockchain technology, a speculative component [10, 18] of its demand was given paramount importance, and investor’s interest (be it positive or negative) was considered the best and only yardstick for determining the trend for market value of Bitcoin. This led to the popular understanding of cryptocurrencies as a very unstable market and one which is susceptible to creation of market bubbles. Even so, contemporary studies invalidated this stance and focused on other variables that are responsible in its price trend determination. These studies concluded that more or less there is a solid economy with a robust infrastructure mechanism similar to any production good. These studies were more comprehensive in the regard that they considered technical components such as the costs incurred by the players in the Bitcoin ecosystem to sustain the protocol and ensure its regular growth and development and revenues earned by these members and then formulating an equilibrium analysis. Hayes’ [6], research by Bocconi School of Management and Zhu et al. [19] served to be the most relevant source of inspiration for this paper. The study has also taken cue from some other researches in the manner that it has analyzed various independent variables that affect Bitcoin price which is treated as a commodity in order to increase the efficiency of an investor in investing. Though Hayes’ statistical results were spurious since his analysis involved a linear regression model between two nonstationary historical series (which by common knowledge of statistics meant spurious results), his intuition regarding the probability of estimation of the market equilibrium monetary value of a cryptocurrency by comparing the profits with marginal costs of the underlying infrastructure gives verisimilitude. Extrapolating on these lines, researchers from Bocconi developed an equilibrium analysis wherein they estimated Bitcoin prices using an equilibrium model and compared it with its actual prices. Their results showed a significant relationship between the two since the equilibrium prices trend nearly resembled the actual price trend during the period observed. This proved Bitcoin to be a commodity with an underlying production mechanism. This study extends their analysis by quoting that there exists sufficient evidence to say that the results found by the Bocconi School researchers were statistically valid. Hence, we assume Bitcoin to be a commodity [20]. However, our research study propagates on establishing a more pragmatic approach toward investment in this revolutionizing technology’s product by systematically reducing the knowledge gap pervasive in the domain of investment in Bitcoin and relevant studies by researchers and academicians on similar lines. The study bounds to enhance the know-how of factors that hold a major role in driving the market value of Bitcoin in the long and short run. Thus, insights from their study reveal that Bitcoin’s inherent value stems from intersection of demand and supply that is marginal revenue (addition made to total revenue when an additional unit of Bitcoin is sold) and marginal cost (cost of mining an additional unit of Bitcoin), which is explained by the theory of supply. It is imperative to note that it is assumed that perfect knowledge of Bitcoin and its economics

Analysis of Change of Market Value of Bitcoin Using …

805

CollecƟon of data Blockchain.com records for 4 years and 1 month (2017-2021) pertaining to Bitcoin closing market value, Miners’ Revenue, Hash Rate, Dow Jones Industrial Average (adjusted closing price), Block Fee, Daily blocks added to blockchain, Daily Output Volume, Price VolaƟlity, WTI Oil Price, Daily number of transacƟons.

Categorizing the aforemenƟoned variables into two segments: Micro & Macro for the purpose of analyzing the impact of each of these on the trend of market value of Bitcoin in operaƟonal synergy with other variables belonging to the same category.

Checking the level of staƟonarity of the Ɵme series data collected, to decide on the appropriate staƟsƟcal model. Augmented – Dickey Fuller Test

Running the Auto Regressive Distributed Lag Model to comment on the degree and direcƟon of influence of various exogenous and endogenous variables on the trend of market value of Bitcoin in the long run, as against the standard levels of significance.

Backed by empirical analysis, understanding the reasons behind the significance of impact of each of the variables taken under consideraƟon on the trend of market value of Bitcoin in the short and long run, in light of surfacing the significant variables.

Running the Bounds – F Test to check for the necessitaƟon of performing an Unrestricted Error CorrecƟon Model, in case the variables under study exhibit a significant cointegraƟon relaƟonship amongst each other.

Running the Bounds – F Test to check for the necessitaƟon of performing an Unrestricted Error CorrecƟon Model, in case the variables under study exhibit a significant cointegraƟon relaƟonship amongst each other.

Fig. 3 Process flow diagram of the methodology used to understand the degree and direction of impact of factors influencing the market value of BTC

806

H. Gahlot et al.

does not exist among the market participants. Thus, the equilibrium analysis of such marginal revenue and marginal cost helps determine market value of Bitcoin in the long run. However, exogenous and endogenous factors at both micro- and macro-level which drive the market value of Bitcoin remain a relatively untapped realm of study and the understanding of which can dramatically influence the investment decisions concerning Bitcoin and its future studies [18]. We have addressed the analysis of the magnitude influence of exogenous and endogenous variables on the market value of Bitcoin using the following methodology. 1. 2. 3.

Secondary data collection Analysis of macro-variables using ARDL and UECM models Analysis of micro-variables using ARDL model (UECM was not used since Bounds F test indicated so).

3.1 Data Sources The Web sites we referred to online for the purpose of conducting the study were coincodex, bitinfocharts, growwin blogs about cryptocurrency, ychart indicator for Bitcoin network hash rate, blockchain.com for obtaining time historical times series variable across a period of four years and one month. The macro time series variables are Bitcoin price whose movement we aim to monitor as a result of presence of influential factors such as Dow Jones Industrial Average, daily output volume, Bitcoin price volatility and WTI Oil Index. The micro-variables include miners revenue, hash rate, number of Bitcoin transactions, daily blocks added to blockchain and block fee.

4 Data Analysis 4.1 Micro–Macro Model Decomposition—Empirical Analysis Firstly, the stationarity of the given set of variables at both the micro- and macro-level was checked using the augmented Dickey–Fuller test (ADF Test). The null hypothesis of the test states that the nature of data under consideration is non-stationary and there exists a unit root. On the basis of the stationarity results of ADF test, an appropriate econometric model is chosen. The usual econometric approaches such as OLS, regression, VAR were not used, as they are used for stationary data as well as the VECM and Johansen test of cointegration as they are used for non-stationary data [21]. The OLS and VAR models as unit root tests are used for analysis of stationary time series, whereas the

Analysis of Change of Market Value of Bitcoin Using …

807

Johansen test, followed by cointegration test if necessitated, is used for analysis of non-stationary time series [22]. The study uses auto regressive lag distributed model (ARDL model), as the variables considered in each of the models are a mix of stationary and non-stationary variables. The model helped in disentangling long-run relationships from the shortrun dynamics so as to learn the significance of impact of independent variables on the Bitcoin price in the short term and long term. This is followed by the Bounds F test to test the level of cointegration between the independent variables and Bitcoin price on short-run and long-run classification. If the results of Bounds F test indicate the presence of cointegration, the unrestricted error correction model (UECM) is run to study the long- and short-run coefficients of cointegration between variables if any [23, 24]. The model specification/formula of an unrestricted ECM conditional to an ARDL(p, q1 , …, qk ) is yt = c0 + c1 t − α(yt−1 − θ xt−1 ) +

p−1 



ψ yi yt−i + ω xt +

i=1

q−1 

ψxi xt−i + u t

(1)

i=1

θ represents the long-run coefficients that represent the equilibrium effects of the independent variables on the dependent variable. The coefficients in the short term ψyi , ψx i and ω account for short-run fluctuations not due to deviations arising from the long-run equilibrium.

4.2 Analysis of Macro-variables The variables taken into consideration are Bitcoin closing price, Dow Jones Industrial Average (DJIA), WTI, Bitcoin miners revenue, Bitcoin price volatility and daily output. The DJIA has been analyzed, considering the putative usage of Bitcoin as a commodity for making capital gains, the DJIA as an economic barometer of the country can be used to reflect the general investment and growth sentiment. Further, the variables under study are WTI, one of the prominent global oil benchmarks that drive investment decisions, miners revenue, which is a relatively complex variable to measure over a period of time, given that it is composed of variables that implicitly reflect the demand for Bitcoin such as the transaction fee and block reward. Miners revenue is an essential aspect since the equilibrium analysis (Researchers, Bocconi School of Management, Milan, Italy) conducted to validate the assumption that Bitcoin being a commodity required this on the right hand side since it represents the marginal profit to be equated with the marginal cost of the underlying infrastructure. Therefore, since it affects the underlying economy of Bitcoin, it is chosen as a macro-variable that affects its price.

808

H. Gahlot et al.

Miners revenue (calculated using Eq. 2) comprises block reward (BR), block time and fees where BR is the emission reward given to miners for new units of cryptocurrency per block that is the reward that miners get for listening for transactions, creating blocks and broadcasting them to the network [25, 26]. From the miner’s perspective, each block is like a miniature lottery where miners are coming up with conjectured numbers as fast as they can until one fortunate miner finds the right combination that makes the proof of work which when coupled with all the listed transactions makes for a block and BT is the time required to generate one block and block fees is the remuneration received for each transaction placed in a block. Given the progressive attrition of units generated by blocks (Bitcoin, like many cryptocurrencies, has deflationary tendencies), with the passage of time, the validation work is reimbursed through fees. Daily blocks when multiplied with block reward would give us a part of miners revenue. Fees is currently around 2–5% of the total revenue value. block time for Bitcoin is 10 min per block on an average.   MINER’S REVENUEBTC = BRBTC ∗ 3600 s/h ∗ 24 h/day BTs + FEESBTC (2)

Source https://bitcoin.org/. The Bitcoin price volatility (in USD) on a daily basis has also been considered for the variance in price of a commodity changes the influx of investment drawn toward it. Further, daily output volume is the total value of all transaction outputs per day. Just like the stock market where the investor sentiment is affected by the trading volume, output volume in cryptocurrencies hold a similar relationship with their price [14]. It, therefore, became an important macro-variable to be incorporated in the study. Augmented Dickey–Fuller Test (ADF Test) On checking stationarity, daily output volume, miners revenue, Bitcoin price volatility were found to be stationary, while Bitcoin prices, DJIA, WTI were found to be non-stationary. Refer to Exhibit 1. Auto Regressive Distributed Lag Model (ARDL) At 10% level of significance, the miners revenue and the price volatility of Bitcoin affect the price of Bitcoin negatively and positively, respectively. Whereas at 0.1%

Analysis of Change of Market Value of Bitcoin Using … Exhibit 1 Augmented Dickey–Fuller test

809

Variable

p-value

Result

DJIA

0.08874

Non-stationary

WTI

0.6477

Non-stationary

Daily output volume

0.01

Stationary

Miners revenue

0.01

Stationary

Price volatility

0.01

Stationary

Bitcoin price

0.955

Non-stationary

level of significance, the value of DJIA affects the price of Bitcoin negatively. It is also noted that at 5% level of significance, the previous value of Bitcoin by lag3, DJIA and the miners revenue positively impact the market value of Bitcoin. The value of adjusted R-squared stands good at 99.32% which indicates that 99.32% of the variability in Bitcoin price is explained by the concerned variables. Refer to Exhibit 2. Bounds F Test On running the test, the p-value of the F statistic came out to be 0.01754 which is less than the significance level of 0.05 prompting null hypothesis to be rejected in support of the alternate hypothesis of a possible cointegration relationship between the independent variables and Bitcoin market value existing. As discussed, this test has replaced the need for undertaking the Johansen and VECM tests due to the nature of stationarity of the time series taken under study. Unrestricted Error Correction Model (UECM) The results of Bounds F test indicated four co-integrating relationships. Since there are possible cointegration relationships between the independent macro-variables and Bitcoin price, an unrestricted error correction model or UECM is run in order to find out the significant long-run and short-term coefficients of cointegration (we ran the UECM model on R studio which used Eq. 1). Refer to Exhibit 3. On running the model and analyzing the results, it was found that the Bitcoin price, output volume, DJIA and miners revenue had a long-run relationship with Bitcoin prices, while miners revenue and DJIA had a short-run impact on Bitcoin price. Thus, DJIA and miners revenue impact the market value of Bitcoin in both the short and long run.

4.3 Analysis of Micro-variables We will commence the synthesis by giving the rationale behind choosing each variable for the study. The premise behind taking the specific variables at both microand macro-level is explained further. The study dwells on the foundation that Bitcoin

810

H. Gahlot et al.

Exhibit 2 ARDL model Regressor

Coefficient

Standard error

T-ratio

Probability

L(BTC Price, 1)

9.956e−01

2.619e−02

38.016

< 2e−16*

L(BTC Price, 2)

−4.294e−02

3.701e−02

−1.160

0.246210

L(BTC Price, 3)

5.471e−02

2.652e−02

2.063

0.039257*

Output volume

8.706e−06

1.458e−05

0.597

0.550436

L(Output Volume, 1)

1.070e−05

1.778e−05

0.602

0.547252

L(Output Volume, 2)

3.832e−06

1.807e−05

0.212

0.832091

L(Output Volume, 3)

1.159e−05

1.808e−05

0.641

0.521866

L(Output Volume, 4)

4.868e−06

1.783e−05

0.273

0.784836

L(Output Volume, 5)

−9.961e−06

1.461e−05

−0.682

0.495425

DJIA

2.853e−02

4.372e−02

0.653

0.514168

L(DJIA, 1)

1.339e−01

5.478e−02

2.445

0.014618*

L(DJIA, 2)

−1.465e−01

4.390e−02

−3.336

0.000872*

Miners revenue

−8.255e−06

7.436e−06

−1.110

0.267100

L(Miners revenue, 1)

−7.759e−06

8.517e−06

−0.911

0.362421

L(Miners revenue, 2)

−1.191e−05

8.586e−06

−1.387

0.165655

L(Miners revenue, 3)

−1.542e−05

8.608e−06

−1.792

0.073389

L(Miners revenue, 4)

1.829e−05

8.534e−06

2.144

0.032233*

L(Miners revenue, 5)

1.578e−05

7.434e−06

2.122

0.033981*

WTI

−2.319e−01

8.049e+00

−0.029

0.977015

L(WTI, 1)

−7.78e+00

1.038e+01

−0.750

0.453471

L(WTI, 2)

−4.33e+00

1.042e+01

−0.416

0.677641

L(WTI, 3)

1.226e+01

8.092e+00

1.515

0.130036

Price volatility

9.814e−01

5.506e−01

1.783

0.074873

Signif. Codes: ‘***’0.001, ‘**’0.01, ‘*’0.05, ‘. ’0.1, ‘ ’1

is a commodity. Having claimed that, the existence of an underlying economy and a significant production system is corroborated. The presence of this ‘infrastructure’ has been confirmed by Haye’s and Bocconi school’s studies. Considering this, the variables such as hash rate, number of daily transactions, daily blocks added to the blockchain, next block fee in USD, closing price (USD) of Bitcoin have been taken. Network hash rate (GH/s) is one of the most crucial aspects since it is a stock with a net inflow of hash rate which is representative of the amount of computational strength added daily to or subtracted from the ecosystem. Attimezero, the hash rate is set to 0.007 GH/s, according to some estimation of the original computational strength available to Satoshi Nakamoto in the initial days of Bitcoin. Using the hash rate, the number of daily hashes (GH/day) computed by the network is estimated. The mining cost is estimated by the hashing cost, which is expressed in $/GH and given

Analysis of Change of Market Value of Bitcoin Using …

811

Exhibit 3 Unrestricted error correction model Regressor

Coefficient

Standard error

L(BTC Price, 1)

7.374e−03

2.175e+02

T-ratio 1.640

Prob. 0.101201

L(Output, 1)

2.973e−05

4.496e−03

1.927

0.054210

L(DJIA, 1)

1.598e−02

1.543e−05

1.917

L(Miners revenue, 1)

−9.277e−06

2.647e−06

−3.505

0.000471*** 0.891381

0.055442

L(WTI, 1)

−9.130e−02

6.684e−01

−0.137

Price volatility

9.814e−01

5.506e−01

1.783

0.074873

d(L(Bitcoin price, 1))

−1.177e−02

2.661e−02

−0.442

0.658290

d(L(Bitcoin price, 2))

−5.471e−02

2.652e−02

−2.063

0.039257*

d(Output Volume)

8.706e−06

1.458e−05

0.597

0.550436

d(L(Output Volume, 1))

−1.032e−05

1.741e−05

−0.593

0.553371

d(L(Output Volume, 2))

−6.491e−06

1.677e−05

0.387

0.698751

d(L(Output Volume, 3))

5.093e−06

1.513e−05

0.337

0.736438

d(L(Output Volume, 4))

9.961e−06

1.461e−05

0.682

0.495425

d(DJIA)

2.853e−02

4.372e−02

0.653

0.514168

d(L(DJIA, 1))

2.853e−02

4.390e−02

3.336

0.000872***

d(Miners revenue)

−8.255e−06

7.436e−06

−1.110

0.267100

d(L(Miners revenue, 1))

−6.737e−06

8.068e−06

−0.835

0.403818

d(L(Miners revenue, 2))

−1.865e−05

8.108e−06

2.300

d(L(Miners revenue, 3))

−3.407e−05

8.049e−06

−4.233

2.45e−05***

d(L(Miners revenue, 4))

−1.578e−05

7.434e−06

−2.122

0.033981*

d(WTI)

−2.319e−01

8.049e+00

−0.029

0.977015

d(L(WTI,1))

−7.925e+00

8.191e+00

−0.967

0.333479

d(L(WTI,2))

−1.226e+01

8.092e+00

−1.515

0.130036

0.021604*

Signif. Codes: ‘***’0.001, ‘**’0.01, ‘*’0.05, ‘. ’0.1, ‘ ’1

by the quotient between the price o1 energy ($/J) and the energy efficiency of stateof-the-art mining hardware (GH/J). Thus, network hash rate helps in determining the cost of the underlying infrastructure. Mining profit is calculated by subtracting the mining costs from the revenues. This determines the amount of hash rate shortfall, which implies how much hashing power could be added before the marginal cost of adding more will exceed the benefit (a situation of zero-profit): Hash Rate Shortfall = Mining Profit/Cost of Energy · Energy Efficiency of Mining Hardware A miner would keep mining only till marginal cost (MC) = marginal revenue (MR), since post this he would like to exit the market as his costs from mining are

812

H. Gahlot et al.

exceeding the benefits and thus, he would have no incentive to mine further. This is represented by a hash rate shortfall, and thus, it also is responsible for regulating the number of miners who would compete for profits in the market. Number of transactions represent the level of trust and credibility people have on Bitcoin. Since the price of Bitcoin is heavily swayed by the speculative component (as highlighted by Kristoufek [9] and Ciaian et al. [11]), these parameters are of utmost importance, and this paved their way into the study.

Source https://bitcoin.org/

Source https://bitcoin.org/ Only the header data is hashed during mining.

Analysis of Change of Market Value of Bitcoin Using …

813

Source https://bitcoin.org/ Bitcoin’s daily closing price data has been taken over a period of a little over 3 years (January 21, 2017) in US dollars since it is the dependent variable under consideration on whom the impact of various aforementioned independent variables is studied. Checking Stationarity The result revealed that the daily block fee and the number of blocks added to the blockchain daily are stationary time series. Whereas the closing price of Bitcoin, hash rate and number of transactions in Bitcoin daily are non-stationary time series (refer to Exhibit 4). The set of variables consisting of both stationary and non-stationary time series compelled us to carry out the ARDL test for the micro-model as well. The variables were further standardized by converting them to their z scores as they differed largely on scales. Refer to Exhibit 4. Auto Regressive Distributed Lag Model (ARDL) The model revealed that at 10% level of significance there exists a significant impact of daily number of transactions and the hash rate on the Bitcoin price in the long-term positively. At 0.1% level of significance, the historical values of Bitcoin by over a day positively impacts the long-run market value of Bitcoin.

Exhibit 4 Augmented Dickey–Fuller test

Variable

p-value

Result

Daily count of transactions

0.9626

Non-stationary

Hash rate

0.1961

Non-stationary

Daily block fee

0.01

Stationary

Daily blocks added to blockchain

0.01

Stationary

Bitcoin price

0.955

Non-stationary

814

H. Gahlot et al.

Exhibit 5 ARDL model Regressor L(Bitcoin price, 1)

Coefficient 0.9999928

Standard Error

T-ratio

Probability

0.0030591

326.891

< 2e−16*** 0.0627

−0.1636536

0.0878594

−1.863

L(hash rate, 1)

0.1492663

0.0872721

1.710

0.0874

Blocks added to Blockchain daily

0.0026923

0.0026622

1.011

0.3120

Hash rate

Block fee

0.0005642

0.0024814

0.227

0.8202

Daily number of transactions

0.0189130

0.0107508

1.759

0.0788

*** 0.001

The value of adjusted R-squared is good at 99.31% which indicates that 99.31% of the variability in Bitcoin price is explained by the concerned variables. Refer to Exhibit 5. Bounds F Test The p-value of F statistic supporting the null hypothesis of cointegration not existing was 82.25%, significantly greater than the 5% Alpha. Thus, the null hypothesis of cointegration relationship existing among variables was rejected. Proving statistically the absence of cointegration relationship the UECM model was not conducted for the given set of variables.

5 Conclusion Technology is changing rapidly. The blockchain technology has gradually been surfacing different countries in different paradigms, all of which converge toward creation of new and better economic cooperation [27]. This economic system is recognized by greater decentralization, coupled with security, transparency and privacy; wherein a trust less system of information sharing is engendered without the involvement of third parties, and so the fear of accountability being jeopardized, begetting financial risk is avoided. Till date, cryptocurrencies are representative of the most pre-eminent products of this technology, given its benefits over conventional sources of money. Following Hayes’ and Bocconi School’s perspective, we corroborate the presence of an underlying infrastructure similar to any other production good and deeming cryptocurrency as a commodity. To assert that Bitcoin is the new ‘Digital Gold’ has not been out of place theoretically and practically. While the focus of the majority of the research was at identifying the predictors for future price fluctuations of Bitcoin [28]. This study aims at equipping the investors with more pragmatic knowledge of inter-relationships between various variables and Bitcoin price so they can make better investment decisions with potent ease. While Bocconi School and other historical studies concerning the market value of Bitcoin have made use of VAR, Johansen and VECM models, the study has relied

Analysis of Change of Market Value of Bitcoin Using …

815

on ARDL model of statistical analysis to study the impact of different variables at the microscopic and macroscopic level on the price trend of Bitcoin in the short and long term. In a bid to aid investors in making prudent investment decisions and academicians and researchers alike to explore the intricacies of crypto space, from the statistical analysis involved in the study, we recommend them to consider and gauge changes taking place over time in the Dow Jones Industrial Average adjusted closing value in the long term. In the long term, the DJIA positively affects the market value of Bitcoin, corroborating the hypothesis of Bitcoin functioning as a commodity. Further, the positive trend in Bitcoin price due to the historical values of Bitcoin reflects the possibility of indexing the demand for Bitcoin to its historical prices to some extent. The miners revenue also positively impacts the market value of Bitcoin in the long run. The miners revenue is composed of variables that implicitly reflect the demand for Bitcoin such as the block reward or transaction fee. The value of transaction fee paid to the miner is often at the discretion of the one initializing the transaction, which psychologically tends to be directly proportional to the volume of Bitcoin traded through the given transaction. The analysis in the study also reveals that the daily number of transactions helps explain the market value of Bitcoin directly. This implicitly taps the demand of Bitcoin as a measure converted in the daily number of transactions. It is thus wise to say that if one observes a whopping number of transactions held in Bitcoin daily, it is prudent to invest in the cryptocurrency to make capital gains. The hash rate being construed as the measuring unit of the processing strength of the Bitcoin network also significantly positively impacts the long-term market value of Bitcoin. The more the computational power of the network, the lesser it takes for miners to break the 256-digit binary code to produce a new block in the existing blockchain [29]. In order to keep the time lag between creation of subsequent blocks in the blockchain constant, the complexity of cracking the code increases as more and more miners work on generating blocks. Here, the number of miners working on generating a block tells a lot about the general demand of Bitcoin, proportionately. Moreover, the variables that are a subject matter of study in the research help explain 99.32% of the variability in the market value of Bitcoin. This fact enables anyone interested in the field of Bitcoin to peg the market value of Bitcoin to these exogenous variables in the long run. The study also proposes the absence of a significant impact of variables such as the number of blocks added to the blockchain daily, WTI and daily output volume of Bitcoin on the Bitcoin market value. It is imperative to take into consideration that the analysis involved in the research study is primarily applicable on Bitcoin and does not necessarily have extension over other altcoins. However, a study on similar lines may indeed elucidate the correlation of exogenous and endogenous variables on the market value of the cryptocurrency thesis there as well. Still, given the rationale of Bitcoin dominating the crypto space for exchange in terms of both magnitude and scope, it is a worthwhile quest to comprehend why this unique asset has value.

816

H. Gahlot et al.

6 Managerial Implication Investors are often faced with two major hurdles while investing which are information overload and unknown risks. Cryptocurrency is currently (partially) unregulated [30]. Investors and prospective investors are, thus, dependent greatly on publicly available third-party aggregated data in order to monitor the impact of various variables on cryptocurrency price, for example, estimated market capitalization, prices and trading volumes on exchanges, etc. However, since there is lack of prescribed standards in the crypto space in relation to publication of such data and their likely impact on the crypto prices, there are deficiencies in the information and knowledge available influencing sound investment decisions. Through this research, this deficiency has been plugged to a great extent enabling investors and academicians to have a more comprehensive understanding of the various innate variables of the Bitcoin infrastructure and the macroeconomic variables that affect any other economic commodity (like Bitcoin) and the impact of the change in their price trends over Bitcoin. This study may also be extended to study the functionality of other cryptocurrencies and their market value.

7 Limitations and Future Research In a bid to close the information gap efficiently, the need for having a monitoring template arises that investors and portfolio managers can use as a ready reference to recognize and gauge the type and level of cryptocurrency-related activities and relationships. This template can be made by making the model more robust. Thus, this research lays a foundation for subsequent research where more variables can be included in the model to make it more accurate and useful to investors. Moreover, the researches available till date have presented predictive analysis for Bitcoin prices in order to provide near to accurate future prediction of its price trends. However, the research has focused on the kind of relationship each variable holds with the Bitcoin price in order to help investors make sound decisions regarding investment in Bitcoin. This leaves scope for future research, in the manner that a robust predictive analysis can be developed using the variables in this paper, which would give more accurate results (than the ones given by Bocconi School researchers) since this research has considered variables on both microscopic and macroscopic levels that impact Bitcoin whether it is considered as a commodity or a currency.

References 1. Davidson, S., De Filippi, P., & Potts, J. (2016). Economics of blockchain. Available at SSRN 2744751. https://doi.org/10.2139/ssrn.2744751

Analysis of Change of Market Value of Bitcoin Using …

817

2. J. Lindman, V.K. Tuunainen, M. Rossi, Opportunities and risks of Blockchain technologies—a research agenda (2017). https://doi.org/10.24251/HICSS.2017.185 3. P. Šurda, The origin, classification and utility of Bitcoin. Classif. Utility Bitcoin (2014). https:// doi.org/10.2139/ssrn.2436823 4. W. Mougayar, The Business Blockchain: Promise, Practice, and Application of the Next Internet Technology (Wiley, 2016). 0000001.0000001 5. S. Nakamoto, Bitcoin: a peer-to-peer electronic cash system. Manubot. Econometrica, 1–48 (2019). 10.1186_s40854-017-0054-0 6. Hayes, A. (2015). A cost of production model for bitcoin. Telemat. Inf. 34(7), 1308–1321. https://www.researchgate.net/deref/http%3A%2F%2Fdx.doi.org%2F10.1016% 2Fj.tele.2016.05.005 7. M. Buchholz, J. Delaney, J. Warren, & J. Parker, Bits and bets, information, price volatility, and demand for BitCoin. Economics, 312 (2012). http://www.bitcointrading.com/pdf/bitsandbets. pdf 9. L. Kristoufek, BitCoin meets Google Trends and Wikipedia: quantifying the relationship between phenomena of the Internet era. Sci. Rep. 3(1), 1–7 (2013). https://doi.org/10.1038/sre p03415 10. E. Cheah, J. Fry, How investible is Bitcoin? Analysing the liquidity and transaction costs of Bitcoin markets. Econ. Lett. 130, 32–36 (2015). https://doi.org/10.1016/j.econlet.2018.07.032 11. P. Ciaian, M. Rajcaniova, D.A. Kancs, The economics of Bitcoin price formation. EERI Res. Pap. Ser. (2016).https://doi.org/10.1080/00036846.2015.1109038 12. M. Cusumano, The Bitcoin ecosystem. Commun. ACM 57(10), 22–24 (2014). https://doi.org/ 10.1145/2661047 13. A. H. Dyhrberg, S. Foley, J. Svec, How investible is Bitcoin? analyzing the liquidity and transaction costs of Bitcoin markets. Econ. Lett. 171, 140–143 (2018). https://doi.org/10.1016/ j.econlet.2018.07.032 14. M. Balcilar, E. Bouri, R. Gupta, D. Roubaud, Can volume predict Bitcoin returns and volatility? A quantiles-based approach. Econ. Model. 64, 74–81 (2017). https://doi.org/10.1016/j.eco nmod.2017.03.019 15. D. Yermack, Is bitcoin a real currency? An economic appraisal. National Bureau Econ. Res. (2013) 16. J. Bonneau, A. Miller, J. Clark, A. Narayanan, J.A. Kroll, E.W. Felten, Sok: research perspectives and challenges for bitcoin and cryptocurrencies, in Security and Privacy (SP), 2015 IEEE Symposium on Security and Privacy (pp. 104–121) (2015). https://doi.org/10.1109/SP.2015.14 17. S. Barber, X. Boyen, E. Shi, Bitter to better—How to make bitcoin a better currency, in International Conference on Financial Cryptography and Data Security (2012). https://www.resear chgate.net/deref/http%3A%2F%2Fdx.doi.org%2F10.1007%2F978-3-642-32946-3_29 18. L. Kristoufek, What are the main drivers of the Bitcoin price? Evidence from wavelet coherence analysis. PloS One 10(4) (2015). https://doi.org/10.1371/journal.pone.0123923 19. Y. Zhu, D. Dickinson, J. Li, Analysis on the influence factors of Bitcoin’s price based on VEC model. Fin. Innov. 3(1), 1–13 (2017). https://doi.org/10.1186/s40854-017-0057-x 20. J.A. Bergstra, P. Weijl, Bitcoin: A money-like informational commodity (2014). arXiv preprint. arXiv:1402.4778 21. P.C. Phillips, S.N. Durlauf, Multiple time series regression with integrated processes. Rev. Econ. Stud. 53(4), 473–495 (1986). https://doi.org/10.2307/2297602 22. S. Johansen, Identifying restrictions of linear equations with applications to simultaneous equations and cointegration. J. Econ. 69(1), 111–132 (1995). https://doi.org/10.1016/03044076(94)01664-L 23. R. Engle, C. Granger, Co-integration and error correction: representation, estimation, and testing. Econometrica 55(2), 251–276 (1987). https://doi.org/10.2307/1913236 24. G. Bårdsen, Estimation of long run coefficients in error correction models. Oxford Bull. Econ. Stat. 51(3), 345–350 (1989). https://doi.org/10.1111/j.1468-0084.1989.mp51003008.x 25. G. W. Peters, E. Panayi, Understanding modern banking ledgers through blockchain technologies: future of transaction processing and smart contracts on the internet of money, in Banking

818

26.

27.

28. 29. 30.

H. Gahlot et al. beyond banks and money, pp. 239–278 (Springer, Cham, 2016). https://doi.org/10.2139/ssrn. 2692487 Beck, R., & Müller-Bloch, C., Blockchain as radical innovation: a framework for engaging with distributed ledgers as incumbent organization. In Proceedings of the 50th Hawaii International Conference on System Sciences (2017). https://doi.org/10.24251/HICSS.2017.653 G. Salviotti, L.M. De Rossi, N. Abbatemarco, A structured framework to assess the business application landscape of blockchain technologies, in Proceedings of the 51st Hawaii International Conference on System Sciences (2018). https://doi.org/10.24251/HICSS.201 8.440 N.C. Mark, Exchange rates and fundamentals: evidence on long-horizon predictability. Am. Econ. Rev. 85(1), 201–218 (1995). http://www.jstor.org/stable/2118004 C. Catalini, J.S. Gans, Some simple economics of the blockchain (No. w22952). National Bureau of Econ. Res. (2016). https://doi.org/10.3386/w22952 W. Reijers, F. O’Brolcháin, P. Haynes, Governance in blockchain technologies and social contract theories. Ledger 1, 134–151 (2016). https://doi.org/10.5195/ledger.2016.62

Detection of COVID-19 Using Intelligent Computing Method Asmita Dixit, Aatif Jamshed, and Ritin Behl

Abstract The COVID-19 crisis has generated an enormous need for country-wide contact tracing which needs thousands of people to quickly learn key skills. Article will discuss the science of SARS-CoV-2, including the contagious duration, the clinical presentation of COVID-19, and details of how SARS-CoV-2 is spread from person to person and why contact tracing can be such an effective intervention in public health. Study will focus to monitor connections, including how to establish relationships with cases, recognize their connections, and help both cases and their contacts to avoid their group transmission. The course would also discuss a range of critical ethical issues including touch monitoring, separation, and quarantine. In this report, we have built a new support vector machine method that specifically predicts the number of COVID-19 cases in the days to come if precautionary steps are taken to prevent them. It is also found that the expected results for the first few days closely equal a rise in the number of cases. Finally, the paper will describe some of the most important obstacles to communication effort tracing along with techniques to resolve these attempts. Lot of attempts are still being worked upon for the eradication of disease, but the characteristic of the virus differing from one human body to another is a major concern. Keywords COVID-19 · SARS · MERS · Pretrained model

A. Dixit · A. Jamshed (B) ABES Engineering College, Ghaziabad, Uttar Pradesh, India e-mail: [email protected] A. Dixit e-mail: [email protected] R. Behl Department of Information Technology, ABES Engineering College, Ghaziabad, Uttar Pradesh, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_65

819

820

A. Dixit et al.

1 Introduction Coronavirus is an emerging pandemic which has almost brought the world to a halt. The disease COVID-19 is emerged from the virus called coronavirus. Coronavirus is a large group of diverging virus which requires quality-based powerful lensed microscope for its view. The virus got its name because of its representation like a crown. Figure 1 depicts the layout of the virus. Coronavirus varied in different category ranging its adverse effects in mammals as well birds in past. The prevalent effect of virus in humans is respiratory infection. The effect of fever of such viruses can be mild to fatal. COVID-19, SARS [1] describe the criteria of deadly effect. The root cause of COVID-19 is SARS CoV2. The virus origination comes from bats and remains in them constantly. The ability of this virus being able to hop among species and transmit rapidly among humans makes it a pandemic uncontrollable. The virus which is rapidly transmitting from one person to another is the third virus in the family of coronavirus, the first being severe acute respiratory syndrome (SARS). It existed in year 2002 in China at Guangdong [2]. Middle eastern respiratory syndrome (MERS) is another outbreak in 2012 leading human infections. Of lately by the end of year 2019, there emerged a life-taking respiratory threat in the city of Wuhan in China now spreading worldwide as pandemic. Figure 2 reflects the cell body of the virus in yellow, and the virus replicates in cells in which it resides and ultimately infecting the different cells of the body with a complete gradual damage. In this article, we have incorporated the following: • A supervised machine learning algorithm into the proposed method, specifically the help vector machine algorithm. The key purpose of SVM is to calculate the optimum hyperplane used to linearly divide the functionalities into two parts, i.e., two types of function, by optimizing the marginal value. • Prediction of COVID affected lungs via SVM classifier using lungs CT scans. Fig. 1 Coronavirus layout (Ref: https://www.washingto npost.com/health/2020)

Detection of COVID-19 Using Intelligent Computing Method

821

Fig. 2 CoV2 virus emerging from cells (Ref: https://www. eurekalert.org/pub_releases/ 2020-03/nioa-ncs031 720.php)

1.1 Coronavirus Origin C-viruses are a wide multi-virus family. Many of them aggravate the cold. Many species are infected, including birds, camels, and horses. At the end of 2019, the first virus detected in Wuhan, China, sparked a global pandemic. Researchers claim that the SARS-CoV-2 developed bats. C-viruses (MERS), as well as severe acute respiratory syndrome (SARS) in the middle east air syndrome, also started. In an open-air ‘wet-markets’ of Wuhan, where customers have bought fresh food and seafood, some slaughtered on spot, SARS-CoV-2 has jumped on people. Many of the wetlands market offer species that are vulnerable and prohibited, including cobras, wild boars, and pigs. Crowded environments can cause viruses of different species to exchange genes. The virus often enhances so much that it can begin to get infected and spread to people. However, the Wuhan market did not offer bats when the outbreak happened. And there was also an early responsibility on pangolins, also known as scaly anteaters, who are secretly traded on some markets in China. Several coronaviruses affect pangolins close to SARS-CoV-2 [1]. Individuals without close interaction with livestock have been affected by the transmission of SARS-CoV-2 within and outside China. It showed that the infection is spread between humans and humans. It circulates in the USA and around the world, and people select and pass the c-virus inadvertently. The increasing global dissemination is a pandemic now.

1.2 Virus Evolution Scientists identified a human virus for the first time in 1965. It provided a general refreshment. Later in the decade, researchers described and called a community after the introduction of crown-like human and animal viruses [3, 4]. The possibility of human contamination is seven coronaviruses. In Southern China, the epidemic of SARS-Severe ARS was raised in 2002 and soon spread to 28 other countries. About 8000 individuals had become drunk and 774 had died

822

A. Dixit et al.

by July 2003. In 2004, a slight epidemic ended in just four additional cases. The infection induces fatigue, headache, and respiratory issues, including vomiting and inability to breathe. In Saudi Arabia, MERS launched in 2012. Around 2500 reports of people who stay or frequent the middle east have been reported. The severe acute respiratory syndrome (SARS) is less severe than its equivalent, but 858 patients are gone [5]. There are the same respiratory signs, but sometimes renal dysfunction.

2 Literature Survey Torky et al. [6] explored an important program that assists states, health agencies, and the public in taking informed decisions on infection diagnosis, infection control, and infection prevention. The architecture is now being designed and introduced as a modern program composed of four components: outbreak detection module, blockchain network, smartphone device P2P, and mass control system. The latest technology is being developed and enforced. Such four elements are used together to identify suspected infected cases and to estimate and measure the likelihood of coronavirus infection (COVID-19). Mashamba et al. [7] explored low-cost COVID-19 and other new infectious diseases self-tested and monitoring technologies combined with artificial intelligence. Prompt installation and proper operation of the proposed network have the potential to curb COVID-19 transmissions and associated mortality, particularly where the laboratory infrastructure is poorly accessible. Wu et al. [8] recorded four patients neutralizing the disease with coronavirus in 2019 (COVID-19). Two of the antibodies, B38 and H4, activated the cell-receiver viral spike protein (RBD), (ACE2). The RBD arrangement connected to B38 suggests that the binding site for the B38 overlaps with the binding site for the ACE2. While H4 blocks RBD binding to ACE2, it binds at an alternate position and can thus link the two antibodies simultaneously. A combination of antibodies may potentially be employed together in medical applications. Nguyen et al. [9] outlined the main approaches blockchain and AI should have to tackle the epidemic of COVID-19. Instead, we review new work on the usage of blockchain and AI in a broad variety of applications for COVID-19. The recently developed programs and cases used to cope with the pandemic of coronavirus utilizing such techniques are also discussed. Finally, highlighted the obstacles and potential directions which inspire further research efforts to tackle potential epidemics of coronavirus. Kumar et al. [10] tested the identification of a global profound thinking paradigm utilizing fresh and up-to-date evidence to improve identification of patients with COVID-19 dependent on computed tomography (CT) slices. In fact, outlined the main approaches blockchain and AI should have to tackle the epidemic of COVID19. The fusion of blockchain and federated learning technology collects data from different clinics, without consumer privacy leakage. Firstly, collect scientific information for researchers on real-life COVID-19. Second, a number of deep-learning

Detection of COVID-19 Using Intelligent Computing Method

823

models (VGG, DenseNet, AlexNet, MobileNet, ResNet, and Capsule Network) are used to classify patterns in patients via COVID-19 lung screening. Fourth, share data securely through various hospitals by the integration of federated learning and blockchain. Finally, findings indicate improved progress in the diagnosis of patients with COVID-19. Chamola et al. [11] explored the covetous impact on the global economy. We are investigating the use of the Internet of things (IoT), UAVs, blockchains, artificial intelligence (AI) and 5G, among others, to the effects of an epidemic of COVID-19.

3 Signs and Symptoms of Virus Existence COVID-19 is at times difficult to classify in terms of signs and symptoms. There are people who do not have any symptoms at all mostly termed as asymptomatic. Symptoms [12] like cold, mild fever, muscle pain, and cough can exist in both COVID-19 and any other respiratory disease, but the symptom of losing sense of taste and smell is more specific to coronavirus existence in the body. The severe signs and symptoms of COVID-19 which require immediate medical aid are rise in body temperature and heavy breathing difficulty. Blue lips and face can also be one of the symptom in which the patient body is suffering oxygen debt. Confusion or difficulty in waking up could also be a cause. The reason in which coronavirus leads to fatality to patient is by damaging lungs. High congestion in lungs damages lungs leading to death. Figure 3 shows the image of both healthy human lung and

Fig. 3 a Healthy lungs versus b coronavirus affected lungs (Ref: Radiopedia.org)

824

A. Dixit et al.

Fig. 4 Graph showing incubation period rise in case (Ref: https://covid19.who. int/)

coronavirus effected congested lungs. The source of image is Radiopedia.org [11]. The existence of white matter in right side image depicts that the tissues in lungs are adversely destroyed by coronavirus. It is important to understand the development of virus in our body and the damage caused. Such period is termed as incubation period [12]. The period lies in a span of 2–14 days. This is the time when symptoms will develop and will infect. The graph in Fig. 4 shows the ratio of rate at which human affected by virus develops symptoms. 5% will develop signs of virus within 2 days and by the 14th day 95% [13] will be affected. During this infectious period, the speed of transmitting virus is very high. The asymptomatic humans affected are equally contributing in the spread of virus. Figure 5 depicts the complete outline of the infectious period. The timeline when an infected person is the most infectious be it severely ill or asymptomatic. Figure 6 shows the total number of coronavirus cases over time since January 22. Figure 7 shows total deaths over time since January 22, 2020. Figure 8 shows total recoveries over time since January 22, 2020.

3.1 Risk Factors and Diagnosis With this stroke of pandemic [14], COVID-19 the risk prevails everywhere. In particular, older adults who are over the age of 65 are much more likely to have severe COVID-19 disease than others. Other existing medical conditions that increase the risk of severe disease in COVID-19 include diabetes, hypertension, any kind of lung disease such as asthma or emphysema or COPD, which is chronic obstructive pulmonary disorder. People who have heart disease, liver disease, and any kind of kidney disease are also at increased risk for severe COVID-19 disease. People who have weakened immune systems are also at increased risk. The cause of death due to this virus could be if the virus damages the lungs too much and the patient is unable to breathe on their own. Another way is through lack of oxygen. 2–5% of people

Detection of COVID-19 Using Intelligent Computing Method

Fig. 5 Outline of infectious period (Ref: https://covid19.who.int/)

Fig. 6 Total coronavirus cases (Ref: https://covid19.who.int/)

825

826

Fig. 7 Total deaths during COVID-19 period (Ref: https://covid19.who.int/)

Fig. 8 Total recoveries during COVID period (Ref: https://covid19.who.int/)

A. Dixit et al.

Detection of COVID-19 Using Intelligent Computing Method

827

aged 65–75 years’ likely die from COVID-19 [15] in the USA. This risk increases to 4–10% among those aged 75–85 and is over 10% in people aged more than 85 years old. There exist two kinds of diagnosis test [16] for the virus. One is diagnostic test for virus identification in the body. The tests are termed as rpolymerase chain reaction (PCR) tests also known as molecular tests. It is done to check virus reproduction in body cells. The other one is antibody test. This test can identify antibodies to the virus usually in your blood. Antibodies are made by your immune system to fight off viruses or bacteria. We can identify antibodies that are specific to this virus in your blood. Some antibodies called IgG antibodies begin to develop when you are sick, but they are identified mostly after you recover. PCR tests [17] are done to identify active infection. It detects the RNA (genetic material) of the virus. Swab is taken from nose, throat, or mouth. This is important to keep in mind that just because someone has a negative PCR test that does not necessarily mean that they definitely do not have the infection. To test for past infection, IgG antibody is the commonly done test. This test is usually done on blood. Figure 6 gives a view of how swab collection is done. It can yet not be defined that the suggested test is perfect for diagnosis but are the best solutions of diagnosis so far.

3.2 Transmission and Its Significance for Stopping SARS-CoV-2 or COVID-19 [18] is declared as a pandemic due to the impact it is causing in spreading from one person to another. The source of infection lies in the infected person’s respiratory tract, mouth, nose, and throat. Droplets can come out and travel to other person at any moment when there is talking, laughing, or sneezing. Another mode of getting virus infected is via coming in contact with same surface touched by infected person. In such a case, keeping environment clean and washing hands frequently is advised before touching oneself. Community spread is more likely to spread in congregate housing setup, prisons, dormitories, shelters, etc., the only solution which has been suggested is the stoppage of transmission. The following graph [19, 20] depicting reproductive number in Fig. 9 indicates how one person can affect many in chain. Even if one person is prevented from getting affected, the chain of transmission is just reduced by half. The prevention of just one person from getting infected will only lead to a total of four infected cases rather than 15. Figure 10 represents a better layout of the above-said statement (Fig. 11).

828 Fig. 9 Swab sample taken for diagnostic test (Ref: Radiopedia.org)

Fig. 10 Rate of reproduction (Ref: Radiopedia.org)

Fig. 11 Effect of preventing one infection (Ref: Radiopedia.org)

A. Dixit et al.

Detection of COVID-19 Using Intelligent Computing Method

829

4 Results and Discussion The purpose is to create a method that correctly forecasts the number of COVID-19 events. If caution is not taken appropriately or a lockout is not observed, we could from a serious situation. With this in mind, we have proposed a method to help forecast the number of cases of COVID-19. Initially, we conducted a statistical study in the proposed method to approximate the number of cases in India. As shown in Fig. 12, data preprocessing was initially conducted to locate the sound in the dataset. The detailed data sets are seen in Table 1. However, as the existing dataset is very small, the data do not contain noise. But we have introduced data preprocessing by retaining futuristic models for the same problem. SVM is then implemented with regression. SVM is first applied to identify the data in the number of clusters, followed by linear regression to evaluate the expected values. Dataset is obtained from the Web site of Kaggle which has made it available freely for

Data Preprocessing

Support Vector Machine for Analysis

Predicted Cases

Fig. 12 Flow diagram of proposed system

Table 1 Results showing future predictions

Date

Predicted cases with SVM model

12/08/2020

68,045,893

12/09/2020

68,658,391

12/10/2020

69,283,983

12/11/2020

69,912,120

12/12/2020

70,543,921

12/13/2020

71,181,472

12/14/2020

71,821,749

12/15/2020

72,465,967

12/16/2020

73,115,127

12/17/2020

77,376,739

830

A. Dixit et al.

Fig. 13 Predictions done by SVM model after training

competition and study. We have taken a data kit with features including no changes in cases every day, no injuries, no rescued cases, area-sensitive, and countrywide data for contaminated cases until September 2020. Figure 13 shows the performance of support vector machines for predicting coronavirus cases. Table 1 indicates the estimated outcomes, i.e., the number of cases that could occur in the coming days if due treatment and precautionary steps are not taken. Our projections are nearly equal the actual rate of growth in COVID-19 cases per day. We are also anticipating good outcomes from our study. Since the spread of disease, there is no proper treatment found yet. Various antivirus drugs are in the making and our under human trial. Many therapies have been tried so far including drugs to treat Malaria. Recent treatment is convalescent plasma therapy in which antibodies found in plasma of a recovered COVID [19] patient are taken. This treatment helps in reducing severity of the disease also shortening its duration. It is still practiced on experimental basis. So far, any antiviral treatment is not found. Medication taken for other viral infections are tested. Measures like isolation of the affected patient is done to help spread to a healthy person. The person who has come in a close contact to the virus is made to quarantine for a duration of 14 days [21]. As a measure, most of the countries had experienced lockdown since the time of emerge of the pandemic COVID-19. Most places have come to a complete shutdown to avoid dense contact environment, and this includes conferences, religious places, workplaces (shifted to online mode), bars, gyms, schools, etc. It has

Detection of COVID-19 Using Intelligent Computing Method

831

Table 2 List of vaccines in progress Name of vaccine

Name of university

Phase

CoronaVac

Sinovac Research and Development Co., Ltd.

Phase 3

AZD1222

University of Oxford, AstraZeneca

Phase 2

mRNA-1273

Moderna, US National Institute of Allergy and Infectious Diseases, BARDA

Phase 3

BNT162

Pfizer, BioNTech

Phase 2

Inactivated vaccine

Wuhan Institute of Biological Products; China

Phase 3

been suggested that people above the age of 65 and children below 10 must avoid to go out. Some of the vaccines under clinical trials are listed in Table 2. The source of the vaccines under trials is picked from Wikipedia.

5 Conclusion and Future Scope By the end of this article, an awareness will generate. This article gives the brief detailing of how it originated to how it became pandemic. It will help in finding the preventions one should adopt and choose the difference between isolation and quarantine. The article tries to cover a maximum description about the COVID-19. The motivation of writing this article is to spread awareness about the disease and what precautions could be possibly taken. The article gives a brief knowledge about how the disease originated what are the possible signals and symptoms which lead to the disease testing. It also focusses on the risk factors involved in catching the disease and how rapid is the medium of transmission. Many vaccines are under trial under medical research, and by the year 2021, it is expected to come. In future, detection of rapid change in virus cell from body and its study via neural network can be done. One aspect of future scope is detection of COVID infected person in combination with machine learning algorithms and another aspects for better precision a hybrid classifier. The prediction rate of growth can be easily predicted using optimization techniques.

References 1. S. Reddy, J. Fox, M.P. Purohit, Artificial intelligence-enabled healthcare delivery. J. R. Soc. Med. 112(1), 22–28 (2019) 2. P. Zhou, X.-L. Yang, X.-G. Wang, B. Hu, L. Zhang, W. Zhang, H.-R. Si, Y. Zhu, B. Li, C.-L. Huang et al., A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature, 1–4 (2020) 3. H.H. Elmousalami, A.E. Hassanien, Day level forecasting for coronavirus disease (COVID-19) spread: analysis, modeling and recommendations (2020). https://arxiv.org/abs/2003.07778

832

A. Dixit et al.

4. B.R. Beck, B. Shin, Y. Choi, S. Park, K. Kang, Predicting commercially available antiviral drugs that may act on the novel coronavirus (2019-nCoV), Wuhan, China through a drug-target interaction deep learning model (2020). bioRxiv 5. A. Jamshed, B. Mallick, P. Kumar, Deep learning-based sequential pattern mining for progressive database mainly for bins. Soft Comput (2020). https://doi.org/10.1007/s00500-020-050 15-2 6. Torky, Mohamed, Ashraf Darwish, and Aboul Ella Hassanien, Blockchain Use Cases for COVID-19: Management, Surveillance, Tracking, and Security. (2021): 261–274. 7. N.B. DeFelice, E. Little, S.R. Campbell, J. Shaman, Ensemble forecast of human West Nile virus cases and mosquito infection rates. Nat. Commun. 8, 1–6 (2017) 8. Wu, Ping, Fang Duan, Chunhua Luo, Qiang Liu, Xingguang Qu, Liang Liang, and Kaili Wu, Characteristics of ocular findings of patients with coronavirus disease 2019 (COVID-19) in Hubei Province, China.JAMA ophthalmology 138, no. 5 (2020): 575–578. 9. D.C. Nguyen, K.D. Nguyen, P.N. Pathirana, A mobile cloud based IoMT framework for automated health assessment and management, in 2019 41st Annual International Conference of the IEEE Engineering (2019) 10. Kumar, Anant, and K. Rajasekharan Nayar, COVID 19 and its mental health consequences. Journal of Mental Health 180, no. 6 (2020): 817–8. 11. Chamola, Vinay, Vikas Hassija, Vatsal Gupta, and Mohsen Guizani, A comprehensive review of the COVID-19 pandemic and the role of IoT, drones, AI, blockchain, and 5G in managing its impact. IEEE access 8 (2020): 90225–90265. 12. J. Shaman, A. Karspeck, W. Yang, J. Tamerius, M. Lipsitch, Real-time influenza forecasts during the 2012–2013 season. Nat. Commun. 4, 1–10 (2013) 13. Y. Liu, K. Wang, Y. Lin, W. Xu, Lightchain: a lightweight block-chain system for industrial internet of things. IEEE Trans. Industr. Inf. 15(6), 3571–3581 (2019) 14. E. Massad, M.N. Burattini, L.F. Lopez, F.A. Coutinho, Forecasting versus projection models in epidemiology: the case of the SARS epidemics. Med. Hypotheses 65, 17–22 (2005) 15. J.B.S. Ong, M.I.-C. Chen, A.R. Cook, H.C. Lee, V.J. Lee, R.T.P. Lin, P.A. Tambyah, L.G. Goh, Real-time epidemic monitoring and forecasting of H1N1-2009 using influenza-like illness from general practice and family doctor clinics in Singapore. PLoS ONE 5 (2010). https://doi.org/ 10.1371/journal.pone.0010036 16. K. Nah, S. Otsuki, G. Chowell, H. Nishiura, Predicting the international spread of Middle East respiratory syndrome (MERS). BMC Infect. Dis. 16, 356 (2016) 17. A.M. Kuchling, Regular expression HOWTO. Regular Expression HOWTO—Python 2(10) (2014) [online]. https://fossies.org/linux/misc/python-3.8.2-docs-pdf-a4.tar.bz2/docspdf/howto-regex.pdf. Access 27 Mar 2020 18. Q. Liu, T. Liu, Z. Liu, Y. Wang, Y. Jin, W. Wen, Security analysis and enhancement of model compressed deep learning systems under adversarial attacks, in 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC) (2018), pp. 721–726 19. P. Chakraborty, P.C. Saxena, C.P. Katti, Fifty years of automata simulation: a review. ACM Inroads 2(4), 59–70 (2011) 20. W. Wang, D.T. Hoang, P. Hu, Z. Xiong, D. Niyato, P. Wang, Y. Wen, D.I. Kim, A survey on consensus mechanisms and mining strategy management in block-chain networks. IEEE Access 30(7), 22328–22370 (2019) 21. D. Bigo, S. Carrera, N. Hernanz, J. Jeandesboz, J. Parkin, F. Ragazzi, A. Scherrer, Mass surveillance of personal data by EU member states and its compatibility with EU law. Liberty and Security in Europe Papers. 2013 Nov 6(61)

Two-Line Defense Ontology-Based Trust Management Model Wurood AL-Shadood, Haleh Amintoosi, and Mouiad AL-Wahah

Abstract Heterogeneity, dynamicity, and variability that are inherited in dynamic distributed networks have come with their security challenges. First, all devices on the network must be trusted to operate according to their designated security level. Hence, any accidental or suspicious access may breach the security of the protected resources. Second, the architecture of the dynamic networks, if not designed very carefully, can make sharing appropriate information among devices difficult or impossible. In this paper, we present an approach for trust management in dynamic distributed networks. Our approach is based on establishing two defense lines to protect the resources of the dynamic distributed networks from suspicious access and obtrusive penetrating trials. For the first line, we employ a machine learning approach that serves as an outer shield which faces the penetrating trials. If this shield is penetrated, then the inner strong semantic-based shield defends against the penetrating attacks and prevents them from reaching the protected resources. We show, in this paper, how machine learning models can be combined with semantic technologies to support flexible and adaptive trust management decisions. We develop a proof-of-concept implementation and give the complexity analysis for our approach. Keywords Trust management · Dynamic distributed network · Web ontology language · k-nearest neighbor · Description logic

W. AL-Shadood · H. Amintoosi (B) Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran e-mail: [email protected] W. AL-Shadood e-mail: [email protected] M. AL-Wahah Department of Computer Science, Thi-Qar University, Dhi-Qar, Iraq e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_66

833

834

W. AL-Shadood et al.

1 Introduction As a method of authorization, trust management (TM) has received a great attention in recent years. Adding trust to current security infrastructures increases their resistance against attacks and provides extra protection to these systems. The trust plays a major role in building security systems’ infrastructures. TM is the degree of confidence or faith the entity may have about a specific action that is to be done by another entity [1]. The TM takes different classifications according to the vision of each researcher. In general, TM models can be categorized as follows [2–5]: • Attribute-based access control (ABAC) trust management: The trust evaluation of the outcome of a node is obtained by calculating the trust value and comparing it with the trust threshold. After the result is received, a node’s trust level is managed according to that trust value. • Reputation (Honesty)-based trust management: Measures the degree of trust based on past behavior of a node in the system. • PKI (Public key certificate/infrastructure)-based trust management: Is an old-style TM system that depends on public-key infrastructure cryptography technique (like blockchain), and it is almost obsolete to be used for today’s applications. • Fuzzy-based trust management: This kind of TM uses the fuzzy logic techniques, and it considers a value between (0–1) to estimate the degree of trust. • Role-based access control (RBAC) trust management: The trust evaluation of the outcome of a node is obtained by calculating the trust value of the role played by the requester after comparing it with the trust threshold. • Delegation-based: A novel type of TM that is based on granting the access right from the owner of that right to another entity. The semantic web promotes data sharing and reuse [6]. It enables the distribution, exchange, and utilization of knowledge and information through its technologies over various distributed actors on the web [7]. The main semantic web (SW) concepts are: • URI: Every entity entry has a uniform resource identifier (URI) used for the collection of machines processable data [6]. • RDFS: The resource description framework schema (RDFS) provides a vocabulary for the modeling of RDF data. RDF Schema is a complement to the standard RDF vocabulary. RDFS introduces mechanisms for describing related resource groups and the relationships between them [8]. • The resource description framework (RDF): Is used in triplet form to represent information on the Web: subject-object-predicate [9]. • Web ontology language (OWL): It is a language used in writing and editing the ontologies with logical reasoning [9]. It is categorized into three different kinds depending on the level of the required knowledge base to be represented and the deployed semantics capability. These kinds are OWL Lite, OWL DL, and Owl Full.

Two-Line Defense Ontology-Based Trust Management Model

835

• Description logic (DL) reasoner: In order to fully understand the powerfulness of semantics, a reasoning engine must be coupled to it. Some of the reasoners are tabule-based (such as Pellet reasoner), which support a large number of axioms [10]. While several confidence models have been proposed to deal with trust issues in complex environments, only a few of them have considered the semantic relationships between pervasive elements and trust categories. For this reason, we intend to resolve problems resulting from heterogeneity, dynamicity, and variability of dynamic distributed networks. In this paper, we presented two defense lines in the field of trust management (TM), the k-nearest neighbors (kNN) algorithm [11], and semantic-based trust management algorithm. We suggest using semantic web (SW) technologies and Java programming. The ontology [12] provides the semantic information via its representation, and it works as a means that captures the heterogonous nature of the represented entities. For logical inference, we use the Pellet reasoner. Protégé environment [13] is used as a tool during the design phase, while Java programming is going to the ultimate programming environment for the output product. To evaluate the proposed model, we have figured out a method of how the degree of trust is estimated based on the trust management ontology. Research Gap: Trust is fragile, and it is very difficult to heal if it is staled. In the previous research, there are several proposed approaches that address the need for robust TM models that are suitable for the dynamic aspects within the distributed networks, yet no one of these approaches provides a comprehensible investigation about the relationship between the adaptability and dynamicity. Furthermore, to our best knowledge, the empirically based models that link between the spatiotemporal aspects of the TM system and its underlying heterogeneous and variable network environment are scarce or non-exist. Motivation: Ontological rationality, combined with machine learning techniques, to semantically represent and solve the spatiotemporal aspects of the TM system is the main motive for this paper. Objectives: To seal the gap exist in current research, the objectives of this paper can be summarized as follows: • Introducing more in-depth understanding of the problems that are stemmed from DDN’s properties, namely dynamicity and heterogeneity and how can these problems be semantically addressed. • Showing how machine learning techniques can be amalgamated with semantic technologies to support flexible and adaptive trust management in DDNs. • Providing an empirical evaluation for the implemented approach, in addition to presenting the theoretical complexity aspects of it. The rest of this paper is organized as follows: related work is presented in Sect. 2. Section 3 is dedicated for clarifying the two defense lines’ approach. Implementation

836

W. AL-Shadood et al.

results are presented in Sect. 4, and finally, conclusion and future work are given in Sect. 5.

2 Related Work Kammoun et al. [14] have introduced a software-defined network architecture based on TM, and they used user/device attributes for granting or denying the access right for the Internet of Things (IoT) resources. Their approach intends to enhance the protection of the resources based on clustering techniques in TM. However, so many attributes will reduce the efficiency of the network and slow devices with limited computation power. In our approach, the deduction rules are already embedded in the DL reasoner which will reach to the result. Also, we did not put the properties of the devices and the users in the same set of learning in the first line of defense, rather, we put the features of the devices only in the set of learning. Mousa et al. [15] depending on the dependency network proposed a trust model for the management of context-aware web services. Their model makes a noticeable success in a largescale static environment only, and it does not use the web services. Despite of this success, their model is not successful in dynamic environments. In our research, we consider the dynamic distributed environment as well and we benefit from semantic web technology to resolve problems associated with them. Esposito et al. [5] propose an approach to address heterogeneity in complex distributed networks. The authors suggested using fuzzy logic which has some kind of ambiguity and is often not accurate. Also, fuzzy systems are not as capable as machine learning techniques. In our framework, we merge the machine learning with semantic information to provide excellent accuracy for TM. Meng and Zhang [16] propose a trust system entitled “True Trust,” and it utilizes all peer reviews to measure peer’s integrity. The model often measures the credibility of a peer for a requester who relies heavily on the past feedback (information) from the requester on the services of peers. In their method, they suggested a protocol that guarantees the rules of trust, while in our approach, in the stage of classification, the trust of the device is provided through the contextual information without depending on the past information of that device, while in the same time, the test is achieved for the new device. AL-Wahah [8] introduced a security framework for dynamic distribution network (DDN) that is based on semantic technologies to formulate contextual information and authorization policies. However, their framework does not incorporate TM, and hence, is not as strong as the proposed approach. Table 1 summarizes a comparison between the proposed approach and the existing state-of-the-art approaches.

Two-Line Defense Ontology-Based Trust Management Model

837

Table 1 Comparison between the proposed method and the existing approaches Approach

Engagement of trust

Technique used

Advantage

Disadvantage

Kammoun et al. [14]

Yes

Clustering-based trust model

TM is used for IoT authorization

Several attributes engaged in the process cause inefficient and slow network

Mousa et al. [15]

Yes

Dependency graph (dependency network)

Suitable with markable performance in static distributed networks

Does not work in dynamic environments

AL-Wahah [8]

No

Semantic web technologies

Works very well to provide authorization for dynamic networks

It is weak in comparison with trust-based models

Meng and Zhang [16]

Yes

Data mining-based approach

Works with excellent performance and (log(n) × log(n)) complexity in P2P networks

P2P networks are not necessarily static and hence exchanged dynamic information will disrupt this approach

Esposito et al. [5] Yes

Fuzzy-logic technique

The approach successfully addresses the issues stemmed from heterogeneity that is inherited in DDN

fuzzy logic is ambiguous and is often not accurate as machine learning

The proposed method

Semantic web technologies plus machine learning (ML) techniques

Very suitable to work with DDNs since the semantic information is updated according to the contexts of the communicators. Moreover, ML technique is basically used as a filtration step for the later semantic-based TM model

High theoretical complexity (Exp(n))

Yes

838

W. AL-Shadood et al.

3 The Proposed Method 3.1 First Line of Defense This line represents the first security shield. If this shield is penetrated, then the inner strong semantic-based shield defends against the penetrating attacks and prevents them from reaching the protected resources. We employ kNN approach [17] as a first line of defense or screening belt that encircled the resources we want to protect. This is because only the mostly trusted device will be allowed to get over this fence. As shown in Fig. 1, only the trusted devices (the small dots inside the figure) can access the requested resource. Other requests (the big dots with the same figure) which do not reach the required level of trustworthiness are kept out. kNN Algorithm The kNN algorithm is one of the data mining algorithms and is widely used for classification of data. Also, it is simple and efficient in estimation and prediction since it identifies the close k-neighbors by determining the distance of similarity of the input features and then predicts the output values based on the k-neighbors records. The kNN algorithm has several advantages, among them are that it is quick and easy to implement. The kNN algorithm represents the classification algorithm of the inputs depending on the learning data collected from the nearest k neighbors. Figure 2 shows the classification of a new point. The main problem with kNN is its accuracy of prediction as it is a machine learning algorithm, and all machine learning techniques are prone to some degree of inaccuracies. In our work, the accuracies are beheld in the second stage of the proposed approach via TM ontology’s semantics. The output of the classification is

Fig. 1 Relation between the two shields

Two-Line Defense Ontology-Based Trust Management Model

839

Fig. 2 Classification of a new point

crucial to our model since it screens out the big bulk of incoming requests and reduces the processing time for the second stage of proposed approach. The classification is done in two steps, training and testing. In the first step, we take fifty devices as a set of training, and each of them has five features (number system device, IP, Mac address, serial number, model ID). In the second step, an unknown device’s features are fed as a test vector to predict the device’s category (whether it is Trusted or UnTrusted). Hence, kNN output represents the classification of the unknown device/user according to two classes, specifically Trusted or UnTrusted. kNN uses the Euclidean distance (Eq. 1 below) between the input value (vector of the new point) and the first vector of the training set. This process is repeated over all training sets by comparing the results with the next vector at each iteration and getting the smallest distance between them. As a result, kNN shall finally get the best output. To avoid ties, it is usually the case that k is chosen to be an odd number or choosing the value of k with the lowest error rate [11]. In our work, if we set the value of k to be more than 1, in each run of the kNN, it gives a different result for a sample. However, when the value of k is set to 1, the algorithm detects a unique class (either Trusted or UnTrusted). The Euclidean distance is given as in Eq. 1:    n 2 (Y j − X j ) (X, Y ) = 

(1)

j=1

Assuming that X j is an input tuple with p features (X j1 , X j2 , …, X jp ) such that (j = 1, 2, …, n), where n is the total number of tuple inputs and (i = 1, 2, …, p), where p is the total number of features. Algorithms 1 and 2 describe kNN behavior within the proposed model.

3.2 Second Line of Defense From our point of view, the first defensive line is not enough since machine learning techniques are inclined, sometimes, to be inaccurate. Moreover, restricting access to the protected resources according to the semantic information associated to the requests is very appropriate since this information can be modeled into strict OWL

840

W. AL-Shadood et al.

Fig. 3 Main concepts of the TMO ontology

rules that prevent any suspected (untrusted) access. Additionally, heterogeneity property in DDN brings some issues that must be resolved correctly, and semantic techniques can do that because they have the ability to distinguish the types of the devices and requests and their properties. Trust Management Ontology In our framework’s ontology, the TMO, there are eleven main classes which are: User, Can_View, Certificate, Device, Division, Done, Email, Features, Trusted, UnDone, and Untrusted. TMO also includes several object and data properties that are very important and used to link system environment’s elements. Moreover, individuals that represents instances of the TMO’s concepts (classes) are inserted during design and implementation phases. A few numbers of those individuals are inserted in the design time just to test the ontology engineering process and to make sure that the model’s ontology is progressing correctly. The large size of the individuals is inserted during runtime to check the dynamic behavior of the proposed model. Figure 3 shows the main concepts of the TMO ontology. The workhorse in TMO ontology is the class Trusted and its subclasses Trustedi, 1 ≤ i ≤ n, for some positive integer n. Each class Trustedi has its own OWL complex class definition, and this definition specifies the characteristics each trusted entity must have. For example, Trusted2 is defined to specify that an entity is trusted if it has the characteristics mentioned in Formula 1. We should note that all classes in every OWL ontology are direct/indirect subclasses of the class Thing. T r usted2 = Device and (belongs_T o some (U ser and (is_Employee some Employee) and (has_Passwor d value “Passwor d6”) and (has_U ser I D value “R BGG567”) and (r esponsible_For value “issuing_Cer ti f icates”))) and (has_M AC_Addr ess value “GG-80-Y Y -17-D9-A1”) (1) Formula 1. Trusted2 class definition.

Two-Line Defense Ontology-Based Trust Management Model

841

Each class in the model’s ontology is defined to serve a dedicated purpose, for example Formula 2 specifies Can_Download class using OWL. This class states that a device can download the resource if it belongs (subsumed by) to class Trusted, and it is possessed by some user and that user has already paid the certification issuing’s fee. Can_Download = Device and T r usted and (belongs_T o some (U ser and ( pay Fee value tr ue)))

(2)

Formula 2. Can_Download class representation in OWL. The required information such as Mac address, user ID, user password … are also encoded in the model’s ontology TMO. The trust is evaluated according to the following two algorithms: Algorithm 1 (kNN) Input: device ID, IP Address, Mac Address, Serial No. Model ID. Output: Nearest class label {Trusted, UnTrusted}.  n  (Y j − X j ), D is the Euclidean distance among 1. Calculate D(X, Y ) = j=1

2. 3. 4. 5. 6. 7. 8.

points. Sort the calculated n Euclidean distances in non-decreasing order. Let k be a positive integer, take the first k distances from this sorted list. Find those k points corresponding to these k distances. Let ki ∈ ith class, the number of points among k points i.e. k ≥ 1. If ki > kj ∀ i = j then put x in class i. Output class i. end {kNN}.

Algorithm 1 is dedicated to achieving the kNN machine learning. Its output is fed to the Algorithm 2. Algorithm 2 (Ontlogy_based TM) Input: Request(device ID, IP Address, Mac Address, Serial No., Model ID), TM O . Output: Semantic class classification {Trusted, UnTrusted}. 1. 2. 3. 4. 5. 6. 7. 8.

T = kNN(device ID, IP Address, Mac Address, Serial No., Model ID) If T = UnTrusted then exit() Insert(Anonymous, TM O ) Add_Data_Properties(Anonymous, TM O ) Add_Object_Properties(Anonymous, TM O ) Call_DL_Reasoner(Pellet, TM O ) If Anonymous  Trusted then

842

9. 10. 11. 12. 13. 14.

W. AL-Shadood et al.

return (Trusted) exit() else return (UnTrusted) exit() end {Ontlogy_based TM}

Algorithm 2 is responsible for classifying the trusted entities. It calls kNN algorithm first, and then, it applies its own ontology-based classification, so that it prevents any malicious or suspected requests. After calling kNN approach and getting its output, Algorithm 2 feeds this output to the TM ontology, TMO, as undetermined class Anonymous. This subclass, namely the Anonymous, is then updated with object and data properties that are associated with the request (user ID, user password, … etc.). Once this step is complete, a DL reasoning is issued within Java environment to verify whether class Trusted subsumes class Anonymous. If this subsumption is successful, which literally means that there is a class Trustedi that is a subclass of Trusted and is really equivalent to the class Anonymous, then Algorithm 2 returns a positive signal to the request and allows it to proceed.

4 Implementation Results In the design phase, an OWL ontology is engineered for the DDN. This ontology is fed with instances, so that we can measure the efficiency of Pellet reasoning [8] for different ontology’s sizes. An inference is executed inside Java environment to provide dynamic response during implementation phase. The kNN algorithm is encoded inside the same Java application, and it is called before any semantic processing is done. The output of the kNN makes up one of the inputs to the semantic processing. Hence, new axioms are added to the model ontology to increase the size of the TM ontology, so that we can check the scalability of the knowledge base of our system (385 axioms). After that, we randomly add 1000 axioms to the ontology of the method. At each repetition (starting from 385 axioms and after every addition of 1000 axioms), we first implement the Pellet reasoning on the basic size (the knowledge base of 385 axioms) for reasoning, and then, we add 1000 axioms to the ontology without stopping the synchronization of the reasoner. The practical time of executing reasoning process on the TM ontology is given in Table 2 and Fig. 4. The time is measured in microseconds. We should note that the reasoning process is not used only for classification, but it is also used to check the ontology consistency, concepts subsumption, and results traceback. Table 2 Time in ms using Pellet reasoner Axioms

385

1385

2385

3385

4385

Time

153.6

447

669.2

606.8

968.4

Two-Line Defense Ontology-Based Trust Management Model

843

Fig. 4 Pellet reasoning time versus number of axioms

Results shown in Fig. 4 and Table 2 clarify that even with large numbers of axioms, the reasoning time is still acceptable for practical uses of the approach. Another use of the semantic web technologies that we employ in this work is that of querying the knowledgebase of the represented domain via SPARQL (SPARQL Protocol and RDF Query Language) query. The query is applied on the RDF graph that is equivalent to the OWL ontology. For example, if we need to know who is the user of trusted device dev7, we just need to issue the SPARQL query shown in Fig. 5. The result of the query is given in Table 3. Fig. 5 SPARQL query issued on the knowledge base

Prefix owl: Prefix rdf: Prefix rdfs: Prefix w1: Prefix xml: Prefix xsd: SELECT ?d ?u ?m WHERE { ?d rdf:type w1:Trusted3. ?d w1:belongs_To ?u. ?d w1:has_MAC_Address ?m. }

Table 3 Results of SPARQL query

Device

User

Mac_Add

dev7

user7

11-11-11–10-61-54

844

W. AL-Shadood et al.

Fig. 6 Explanation and traceback facility

One more benefit of using semantic web technologies with reasoning is that we can traceback any action that is done previously such as access operations. For example, to understand why Device dev1 which belongs to User (user1) is also a sub concept of Trusted class, we just need to show the explanation facility of Protégé reasoning (or calling it from inside our Java application), as shown in Fig. 6. This facility makes it possible to detect and traceback any penetration (if any) and why it is happened.

4.1 Computational Complexity of the Proposed Approach As stated previously, we employ the kNN algorithm for the first defense line. This is because kNN is regarded as one of the simplest supervised machine learning algorithms, and it is best known and investigated during the last 30 years [17]. Worst time complexity for the kNN algorithm depends on the dimensionality of the training dataset. Even when using brute-force neighbor search of the kNN algorithm, the time complexity is still highly acceptable since it approaches O(n × m), where n is the number of training examples and m is the number of dimensions in the training set. For simplicity, we assume n ≥ m, the complexity of the brute-force nearest neighbor search is O(n). However, the kNN computational complexity is not the dominant term in the proposed method because the trust management uses description logic (DL) reasoning in the second stage, which in turn, has the highest processing time complexity. We use DL with expressivity of ALCF(D) that has a computational complexity of ExpTime (exponential time) for the worst-case computations. Limitations of the proposed method are that it assumes that there is an authentication capability in place, and it builds up on using this assumed-to-be capability. Also,

Two-Line Defense Ontology-Based Trust Management Model

845

the worst theoretical time complexity of the proposed method is high compared to the existing authorization methods.

5 Conclusion and Future Work In this paper, we show that semantic-based techniques, mixed with machine learning methods, can provide extra protection within DDN such that the resources in such systems will be resistant to several kinds of untrusted trials to access these resources. Also, our approach contributes in providing a solution for heterogeneity, dynamicity, and variability issues in DDNs. The paper offers a proof of concept implementation and demonstrates the approach’s complexity analysis. On one hand, the practical aspect of the proposed approach is that it can be used, in conjunction with any stateof-art authentication paradigm, to provide a solid and secure access to the protected resources. On the other hand, the theoretical aspect of the proposed approach highlights the power, capability, and appropriateness of the semantic-based techniques in building security software systems. For future work, we are planning to add another (third) shield for protection. This shield represents the final authorization process, and it is to be encircled or embedded by the proposed framework. This will increase the security of the protected resources in the DDN. Moreover, to reduce the proposed method limitations’ impact, the description logic reasoning complexity can be reduced using incremental DL reasoning and via employing more lightweight OWL ontologies instead of highly expressive ontologies.

References 1. N. Karthik, V.S. Ananthanarayana, An ontology-based trust framework for sensor driven pervasive environment, in 2017 Asia Modelling Symposium (AMS) (IEEE, 2017) 2. K. Kravari, N. Bassiliades, Ordain: an ontology for trust management in the internet of things, in OTM Confederated International Conferences on the Move to Meaningful Internet Systems (Springer, Cham, 2017) 3. Y. Sun, Y. Zhao, Dynamic adaptive trust management system in wireless sensor networks, in 2019 IEEE 5th International Conference on Computer and Communications (ICCC) (IEEE, 2019) 4. R. Iqbal et al., Trust management in social internet of vehicles: factors, challenges, blockchain, and fog solutions. Int. J. Distrib. Sens. Netw. 15(1), 1550147719825820 (2019) 5. C. Esposito et al., Trust management for distributed heterogeneous systems by using linguistic term sets and hierarchies, aggregation operators and mechanism design. Future Gener. Comput. Syst. 74, 325–336 (2017) 6. M.L. Zeng, P. Mayr, Knowledge organization systems (KOS) in the semantic web: a multidimensional review. Int. J. Digit. Libr. 20(3), 209–380 (2019) 7. P. Bellavista, A. Montanari, Context awareness for adaptive access control management in IoT environments. Secur. Priv. Cyber-Phys. Syst. Found. Princ. Appl. 2(5), 157–178 (2017) 8. M.A.H. AL-Wahah, Semantic-Based Access Control Mechanisms in Dynamic Environments

846

W. AL-Shadood et al.

9. B. Kiselev, V. Yakutenko, An overview of massive open online course platforms: personalization and semantic web technologies and standards. Procedia Comput. Sci. 169, 373–379 (2020) 10. N. Dalwadi, B. Nagar, A. Makwana, Performance evaluation of semantic reasoners, in Proceedings of the 19th International Conference on Management of Data (2013) 11. A.P. Salim, K.A. Laksitowening, I. Asror, Time series prediction on college graduation using kNN algorithm, in 2020 8th International Conference on Information and Communication Technology (ICoICT) (IEEE, 2020) 12. N.F. Noy, D.L. McGuinness, Ontology development 101: a guide to creating your first ontology (2001) 13. https://protege.stanford.edu/products.php. Last accessed 28 July 2020 14. N. Kammoun et al., A new SDN architecture based on trust management and access control for IoT, in Workshops of the International Conference on Advanced Information Networking and Applications (Springer, Cham, 2020) 15. A. Mousa, J. Bentahar, O. Alam, Dependency network-based trust management for contextaware web services. Procedia Comput. Sci. 151, 583–590 (2019) 16. X. Meng, G. Zhang, TrueTrust: a feedback-based trust management model without filtering feedbacks in P2P networks. Peer-to-Peer Netw. Appl. 13(1), 175–189 (2020) 17. A.G. Pertiwi et al., Comparison of performance of k-nearest neighbor algorithm using smote and k-nearest neighbor algorithm without smote in diagnosis of diabetes disease in balanced data. J. Phys. Conf. Ser. 1524(1) (2020)

A Machine Learning-Based Data Fusion Model for Online Traffic Violations Analysis Salama A. Mostafa, Aida Mustapha, Azizul Azhar Ramli, Mohd Farhan M. D. Fudzee, David Lim, and Shafiza Ariffin Kashinath

Abstract Traffic violations occur due to driving or behavioral issues that result in traffic offense and violate the law. Traffic violations such as running red lights, speeding, and reckless driving are translated to millions of traffic infractions every year. This paper proposes a machine learning-based data fusion (MLDF) model for online traffic violations analysis (OTVA) system. The MLDF model is set to perform cumulative traffic analysis by using a software agent (SA) for decision making and Gradient Boosted Trees (GBT), Naive Bayes (NB), and Random Forest (RF) algorithms for classification. The MLDF model is incorporated in the OTVA system for categorizing traffic violation types online. The performance of the MLDF model that includes the SA and the NB, GBT, and RF algorithms is measured and compared in terms of accuracy, recall, precision, and f -measure. The results show that the MLDF model outperforms the single NB and RF algorithms in which GBT achieves 69.86% (±1 .28%) accuracy, NB achieves 66.02% (± 3.38%) accuracy, RF achieves 69.36% (± 0.84%) accuracy, and MLDF achieves 71.88% (± 1.23%) accuracy scores. It is hoped that the results of this paper can serve as a baseline for

S. A. Mostafa (B) · A. Mustapha · A. A. Ramli · M. F. M. D. Fudzee · S. A. Kashinath Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, 86400 Parit Raja, Johor, Malaysia e-mail: [email protected] A. Mustapha e-mail: [email protected] A. A. Ramli e-mail: [email protected] M. F. M. D. Fudzee e-mail: [email protected] S. A. Kashinath e-mail: [email protected] D. Lim · S. A. Kashinath Engineering R&D Department, Sena Traffic Systems Sdn. Bhd., 57000 Kuala Lumpur, Malaysia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_67

847

848

S. A. Mostafa et al.

investigations related to the use of advanced models to automate the detection of traffic violations. Keywords Traffic violation · Data fusion · Machine learning · Prediction software agent · Decision making

1 Introduction A road traffic violation is an act that affects people, properties, and vehicles operating on highways and streets. This traffic offense is made by a motorist who is a person who drives a motorcycle, car, lorry, or bus. At some point, a law enforcement official takes action against these drivers to get referred to for a moving infringement where it is for speeding, running a red light, or some other criminal traffic offense [1]. This result of penalty by issuing citations, tickets, or warnings, and dealing with these tickets requires time and money investment [2]. For some unusual reason, it gives the idea that a considerable number of individuals do not see street activity offenses as comprising wrongdoing; however, it might result in a tragic accident [3]. Subsequently, these violations have turned into a noteworthy public concern worldwide due to the rate of fatalities that they can cause [4]. Studying and understanding traffic violations is important because it leads to better design of effective enforcement and prevention programs to reduce these offenses, then eventually increase traffic flow, and reduce road accidents [3]. The literature has shown that various works focus on the socioeconomic and demographic characteristics of criminal offenders. Moreover, several existing research studies in data mining and machine learning (ML) have focused on predict characteristics of drivers who are ticketed for traffic offenses [4, 5]. However, very limited studies focus on characterizing traffic offenders and violations. Data mining and ML have the potentials to improve the automated detection of traffic offenses [6]. Vijayarani et al. [7] combine different types of crime datasets and propose to use data mining techniques to extract new information about crimes. They propose using machine learning algorithms such as Decision Tree, Random Forest, Artificial Neural Network, and Naïve Bayes to analyze crime characteristics and criminal act patterns. The crime classification includes the characteristics of a traffic violation, theft, murder, assault, rape, cybercrime, kidnapping, vandalism, and trespassing. Yi et al. [8] propose a driving state recognition model based on the ML approach. Five classifiers are implemented in the proposed model, including Random Forest and Naïve Bayes. The data of driving state recognition include road information and personalized characteristics of the driver. The results show that increasing the provided information leads to an increase in the overall prediction accuracy of the classifiers. Li et al. [9] propose a data fusion model for predicting traffic speed. The model implements different types of machine learning algorithms, including Artificial Neural Network, k-Nearest Neighbor, Support Vector Machine, and Regression Tree. The results show that the data fusion model outperforms the individual prediction algorithm. Nweke et al.

A Machine Learning-Based Data Fusion Model for Online …

849

[10] review different types of data fusion model that includes multiple classifiers for predicting human activities. The review results show that the fusion techniques help in improving the performance of the classifiers. This study aims to build and perform cumulative traffic analysis for improving the classification of traffic violation types. Subsequently, this paper proposes an online traffic violations analysis (OTVA) system based on a data mining approach. The OTVA system includes a machine learning-based data fusion (MLDF) model that integrates software agent (SA) for decision making and ML algorithms for classification. The classification algorithms to be used for traffic cumulative traffic analysis are Gradient Boosted Trees (GBT), Naive Bayes (NB), and Random Forest (RF) algorithms. According to traffic violation data, these algorithms classify traffic violation types into three types: warning, citation, and electronic safety equipment repair order (ESERO). The source of the traffic violation data is a Web site called data.world which and the data belongs to traffic violation records of Montgomery from 2013 to 2016 [11]. The evaluation of the classification models in performing DF is made based on the criteria of accuracy, recall, precision, and f -measure. This paper consists of five main sections and started with Sect. 1 that introduces the paper scope, research problem, several related studies, and the objectives of this study. Section 2 presents the research methodology that is divided into dataset description, preprocessing, classification, and evaluation steps. Section 3 illustrates the design of the MLDF model and the implementation of the OTVA system. Section 4 presents the obtained results from threefold cross-validation tests. Section 5 concludes the paper by illustrating the main outcomes of this study and proposes future research.

2 Research Methodology Data mining has a systematic methodology that provides means to process large historical data and produce useful information [12]. Data mining software allows users to analyze data by implementing several steps, including preprocessing, features extraction, feature selection, classification, and evaluation in general [13]. This research adopts the research methodology framework that aims to identify interesting patterns from underlying data related to traffic systems in predicting traffic violations. The methodology incorporates preparing the data, applying the required data analysis process, and obtaining a substantial measure of traffic violation information. The results are fused then translated into final traffic violation types. The research framework of this work includes the dataset description, data preprocessing, classification data fusion, evaluation, and knowledge discovery and representation as experimental phases [12–14]. Figure 1 describes the proposed research framework of this study.

850

S. A. Mostafa et al.

Fig. 1 Experimental phases of the research framework

2.1 Traffic Violation Dataset The primary reason for data gathering is to get an appropriate dataset that has reasonable credit to be used for testing and evaluating the different types of research studies, including data mining, e.g., [7, 15, 16]. The traffic violation dataset is adopted from data.world Web site. It belongs to the Montgomery County traffic violations that were recorded from 2013 to 2016. The dataset contains 35 attributes that describe agency, place description, location, vehicle type, year, make, model, violation type, driver gender, accident date, belts, personal injury, property damage, etc. These attributes are used to classify the traffic violation types into a warning, citation, and ESERO.

2.2 Preprocessing Data mining implies an endeavor to get some profitable from an expansive number of fundamental materials. It entails preprocessing procedures to perform data examination, cleaning, and preparation [17]. The preprocessing removes the data odds and eases the accomplishment of the work [18]. Subsequently, the traffic violation dataset has some attributes with inadequate or missing data. The insignificant data might affect the performance of the classification algorithms and the overall model. Java machine learning library (Java-ML) is used in the data preprocessing phase. Java-ML is outlined to form data preparation less demanding, whether for classification or clustering. The Java-ML has various preprocessing algorithms that rearrange the data, remove unnecessary columns, and replace the missing values in the dataset. Data selection is the method of choosing a specific feature from the initial dataset that is most important to the data mining tasks at hand. This, in turn, improves the execution times for the data mining task with increased precision due to the removal of irrelevant or repetitive features. Feature selection is critical to improving the efficiency of data mining algorithms by using only meaningful and useful features.

A Machine Learning-Based Data Fusion Model for Online …

851

The final set of features selected have certain criteria that are important for dimension reduction [13].

2.3 Classification The algorithms that are used for traffic violation classification are Gradient Boosted Trees (GBT), Naive Bayes (NB), and Random Forest (RF) algorithms. GBT is an adaptable nonlinear regression method that makes a difference moving forward the precision of trees. In GBT, boosting is one of the powerful learning ideas that was introduced in the last 20 years. It can be used to combine many weak learners to become one strong learner. A GBT then gathers all and obtains from the trees the predictive outcomes progressively through estimations. Boosted trees algorithm has increased precision compared to normal trees but reduces speed and human interpretability [19]. On the other hand, NB is a probabilistic classifier based on the Bayes theorem that assumes all variables or factors are independent of each other. The algorithm is simple to construct and works well with large datasets. According to the Bayes Theorem, the probability of P(A|X) = P(X|A) × P(A)/P(X) × P(X) where P(A) is the relative frequency of class A samples such that p is increased when P(X|A) P(A) is increased [15]. Finally, RF, which mimics a group of decision trees and is useful for a number of recognition tasks with large training samples in numerous representations and high-speed streams. RF utilizes a base learning technology and able to provide a cross-treatment information analysis, so it is able to make more appropriate choices [13].

2.4 Evaluation Metrics This section presents the evaluation metrics that need to be applied in this research, including accuracy, precision, recall, and f -measure [6, 13, 18]. • Accuracy is the proportion of the number of true positives to the aggregate number of all data as shown in (1). accuracy =

TP + TN TP + FP + TN + FN

(1)

• Precision is the proportion of true positives (TP) to all the positives (TP) and false positives (FP) [20] as shown in (2). precision =

TP TP + FP

(2)

852

S. A. Mostafa et al.

• Recall is the true positive (TP) rate as shown in (3). For example, the extent of positive tuples that are accurately recognized. recall =

TP TP + FN

(3)

• f -Measure is the weighted average of precision and recall, whereby this score considers both false positives and false negatives into the calculation. Instinctively, it is not as simple to get it as precision, but F 1 is, as a rule, more valuable than accuracy. f -measure =

2 ∗ (recall ∗ precision) (recall + precision)

(4)

3 Modeling and Implementation 3.1 Modeling We propose a machine learning-based data fusion (MLDF) model for online traffic violations analysis (OTVA) system. The MLDF model consists of three processing phases which are preprocessing, classification, and data fusion. The preprocessing phase prepares the data by performing cleaning and adding missing values as we described before. It also splits the incoming data for training and testing of the classification algorithms. The splitting considers threefold cross-validation of 6040, 50-50, and 40-60. The classification phase includes the training step, which is offline, and the testing step, which is online. The selected classification algorithms in this study to implement the MLDF model are NB, GBT, and RF as described before. The data fusion phase includes Software Agent (SA) that has decision fusion and traffic violation checking mechanisms and results buffering database. Figure 2 shows the three phases of the MLDF model. The SA is a computer program whose architecture is represented by some observation and decision mechanisms that determine its behavior in an environment or a system [20, 21]. The agent’s decision-making mechanisms based on the provided information from the observation trigger actions that contribute to that environment or system [22, 23]. The agent’s ability to decide on the best possible actions identifies its performance quality [24, 25]. The agent’s role in the MLDF model is to decide on the best output of the classifiers by dynamically observing the classifiers’ performance and their corresponding outputs [10, 18]. Figure 3 describes the data fusion process of the agent based on the outputs of the three ML algorithms in the MLDF model. The classification results for all the accuracy, recall, precision, and f -measure criteria of the NB, GBT, and RF algorithms are temporarily saved in the buffering

A Machine Learning-Based Data Fusion Model for Online …

Fig. 2 Machine learning-based data fusion (MLDF) model

Fig. 3 Data fusion of the MLDF model

853

854

S. A. Mostafa et al.

database [26, 27]. The decision fusion mechanism applies voting functions to determine the final outcomes of the three classifiers. Subsequently, the decision fusion mechanism applies to weight the collected votes based on the training score for all the accuracy, recall, precision, and f -measure criteria [28, 29]. Then, the final outcome is processed by the decision fusion mechanism. The traffic violation mechanism decides to send the traffic case to traffic reporting for taken the required action or considered a normal traffic case, as shown in Fig. 2.

3.2 Implementation This section presents the implementation and results of the online traffic violations analysis (OTVA) system. For the purpose of building the OTVA system, Java programming language is used for the implementation of the basic system. Java machine learning library (Java-ML) is selected for ML programming. Java-ML is an open-source library that contains a vast collection of data mining and ML algorithms. It further includes algorithms for preprocessing, feature selection, clustering, and classification. It allows implementing or including new algorithms easily. The data fusion agent is implemented by using Java Agent DEvelopment Framework (Jade). Jade is an open-source library for implementing different types, mechanisms, and behaviors of software agents. In this research work, Java-ML is utilized to construct the GBT, NB, RF, and MLDF models, in which these models are first separated in terms of processing and evaluation. The threefold cross-validation is conducted to measure and evaluate the performance of the models. The cross-validation operation has two sub-processes which are training and testing data. During the training and testing phase of the models, there is a sub-process for evaluation by using accuracy, precision, recall, and f -measure criteria.

4 Results and Discussion The performance results of the GBT, NB, RF, and MLDF models are presented in Table 1. The table shows that the classification accuracy of the GBT is 69.86%, the Table 1 Performance result of GBT, NB, RF, and MLDF models Algorithm Accuracy

Precision

Recall

f -Measure

GBT

69.86% (± 1.28%) 71.17% (± 1.52%) 92.05% (± 1.11%) 80.29% (± 1.28%)

NB

66.02% (± 3.38%) 77.40% (± 3.42%) 69.15% (± 2.88%) 73.16% (± 3.19%)

RF

69.36% (± 0.84%) 72.67% (± 0.77%) 87.05% (± 1.73%) 79.42% (± 0.90%)

MLDF

71.88% (± 1.23%) 79.43% (± 2.97%) 92.62% (± 1.10%) 82.41% (± 0.99%)

A Machine Learning-Based Data Fusion Model for Online …

855

classification accuracy of the NB is 66.02%, the classification accuracy of the RF is 69.36%, and the classification accuracy of the MLDF is 71.88% which is the highest accuracy score among the other three. The classification precision of the GBT is 71.17%, the classification precision of the NB is 77.40%, the classification precision of the RF is 72.67%, and the classification precision of the MLDF is 79.43% which is also the highest precision score among the other three classifiers. The classification recall of the GBT is 92.05%, the classification recall of the NB is 69.15%, the classification recall of the RF is 87.05%, and the classification recall of the MLDF is 92.62% which is also the highest recall score among the other three classifiers. The classification f -measure of the GBT is 80.29%, the classification f -measure of the NB is 73.16%, the classification f -measure of the RF is 79.42%, and the classification f -measure of the MLDF is 82.41% which is also the highest f -measure score among the other three classifiers. On the other hand, based on the results, the GBTs and RF algorithms have both high accuracy and high recall but low precision. Figure 4 shows a comparison between the GBT, NB, RF, and MLDF. In general, a low precision means the results yield higher false positives. NB, on the other hand, had low accuracy and precision but high in the recall. This is because NB is a picky classifier, and it does not process all the results and only performs well in a dataset that is high in precision. However, as the sample data size gets larger, improving the recall rate is challenging because it leads to decreasing precision.

Fig. 4 Comparison between the GBT, NB, and RF

856

S. A. Mostafa et al.

5 Conclusion In this paper, the online prediction of traffic violation types by using the machine learning (ML) approach has been studied. Subsequently, A Machine Learning-based Data Fusion (MLDF) model is proposed using three classification algorithms, which are Gradient Boosted Trees (GBT), Naïve Bayes (NB), and Random Forest (RF) algorithms and a Software Agent (SA) for data fusion decisionmaking. The Java machine learning library (Java-ML) is used to develop and run the targeted MLDF model for developing an online traffic violations analysis (OTVA) system. A threefold cross-validation approach is used for testing and evaluating the performance of the models. The test results show that the MLDF model performs better than the other GBT, NB, and RF models when they run individually and achieves an accuracy of 71.88%, precision of 79.43%, recall of 92.62%, and f -measure of 82.41%. On the other hand, the GBT and RF models have higher accuracy and recall but lower precision than the NB. The findings of this work are hoped to be used as a benchmark or baseline results for future traffic violation prediction models. As for future study, a visualization strategy can be utilized in the OTVA system to illustrate the intensity of activity violations over geographical areas as well as accidents of inclined areas. Other than that, in future work, we attempt to apply a feature selection algorithm to improve the features of the dataset and the final results in terms of performance and time. Acknowledgements This project is funded by the Ministry of Higher Education Malaysia under the Malaysian Technical University Network (MTUN) grant scheme Vote K235 and SENA Traffic Systems Sdn. Bhd.

References 1. R. Factor, An empirical analysis of the characteristics of drivers who are ticketed for traffic offences. Transp. Res. F: Traffic Psychol. Behav. 53, 1–13 (2018) 2. J.R. Ingram, The effect of neighborhood characteristics on traffic citation practices of the police. Police Q. 10(4), 371–393 (2007) 3. S. Thapa, J. Lee, Data Mining Techniques on Traffic Violations (University of Bridgeport, CT, 2016) 4. N.A.S. Zaidi, A. Mustapha, S.A. Mostafa, M.N. Razali, A classification approach for crime prediction, in International Conference on Applied Computing to Support Industry: Innovation and Technology (Springer, Cham, 2019), pp. 68–78 5. A. Boukerche, J. Wang, A performance modeling and analysis of a novel vehicular traffic flow prediction system using a hybrid machine learning-based model. Ad Hoc Networks, 102224 (2020) 6. N.A. Mohd, S.A. Mostafa, A. Mustapha, A.A. Ramli, M.A. Mohammed, N.M. Kumar, Vehicles counting from video stream for automatic traffic flow analysis systems. Int. J. 8(1.1) (2020) 7. S. Vijayarani, E. Suganya, C. Navya, A comprehensive analysis of crime analysis using data mining techniques 9(1), 114–123 (2020) 8. D. Yi, J. Su, C. Liu, M. Quddus, W.H. Chen, A machine learning based personalized system for driving state recognition. Transp. Res. Part C: Emerg. Technol. 105, 241–261 (2019)

A Machine Learning-Based Data Fusion Model for Online …

857

9. L. Li, X. Qu, J. Zhang, Y. Wang, B. Ran, Traffic speed prediction for intelligent transportation system based on a deep feature fusion model. J. Intell. Transp. Syst. 23(6), 605–616 (2019) 10. H.F. Nweke, Y.W. Teh, G. Mujtaba, M.A. Al-Garadi, Data fusion and multiple classifier systems for human activity detection and health monitoring: review and open research directions. Inf. Fusion 46, 147–170 (2019) 11. J. Major, Montgomery county traffic violations data 2013-201. Montgomery County Open Data. https://data.world/jrm/traffic-violations 12. K. Lan, D.T. Wang, S. Fong, L.S. Liu, K.K. Wong, N. Dey, A survey of data mining and deep learning in bioinformatics. J. Med. Syst. 42(8), 139 (2018) 13. S.A. Mostafa, A. Mustapha, M.A. Mohammed, R.I. Hamed, N. Arunkumar, M.K. Abd Ghani et al., Examining multiple feature evaluation and classification methods for improving the diagnosis of Parkinson’s disease. Cogn. Syst. Res. 54, 90–99 (2019) 14. P.S. Patel, S.G. Desai, A comparative study on data mining tools. Int. J. Adv. Trends Comput. Sci. Eng. 4(2) (2015) 15. S.A. Mostafa, A. Mustapha, S.H. Khaleefah, M.S. Ahmad, M.A. Mohammed, Evaluating the performance of three classification methods in diagnosis of Parkinson’s disease, in International Conference on Soft Computing and Data Mining (Springer, Cham, 2018), pp. 43–52 16. O.I. Obaid, M.A. Mohammed, M.K.A. Ghani, A. Mostafa, F. Taha, Evaluating the performance of machine learning techniques in the classification of Wisconsin Breast Cancer. Int. J. Eng. Technol. 7(4.36), 160–166 (2018) 17. A. Fatima, N. Nazir, M.G. Khan, Data cleaning in data warehouse: a survey of data preprocessing techniques and tools. Int. J. Inf. Technol. Comput. Sci. 3, 50–61 (2017) 18. J. Liu, T. Li, P. Xie, S. Du, F. Teng, X. Yang, Urban big data fusion based on deep learning: an overview. Inf. Fusion 53, 123–133 (2020) 19. T. Chen, Introduction to boosted trees. Univ. Wash. Comput. Sci. 22, 115 (2014) 20. L. Subramainan, M.Z.M. Yusoff, M.A. Mahmoud, A classification of emotions study in software agent and robotics applications research, in 2015 International Symposium on Agents, Multi-agent Systems and Robotics (ISAMSR) (IEEE, 2015), pp. 41–46 21. S.A. Mostafa, S.S. Gunasekaran, M.S. Ahmad, A. Ahmad, M. Annamalai, A. Mustapha, Defining tasks and actions complexity-levels via their deliberation intensity measures in the layered adjustable autonomy model, in 2014 International Conference on Intelligent Environments (IEEE, 2014), pp. 52–55 22. M.A. Mahmoud, M.S. Ahmad, A. Idrus, Value management-based alternatives ranking approach for automated negotiation. Procedia Comput. Sci. 161, 607–614 (2019) 23. S.A. Mostafa, R. Darman, S.H. Khaleefah, A. Mustapha, N. Abdullah, H. Hafit, A general framework for formulating adjustable autonomy of multi-agent systems by fuzzy logic, in KES International Symposium on Agent and Multi-agent Systems: Technologies and Applications (Springer, Cham, 2018), pp. 23–33 24. A. Idrus, M.A. Mahmoud, M.S. Ahmad, A. Yahya, H. Husen, A negotiation algorithm for decision-making in the construction domain, in International Symposium on Distributed Computing and Artificial Intelligence (Springer, Cham, 2017), pp. 115–123 25. S.A. Mostafa, M.S. Ahmad, M. Annamalai, A. Ahmad, S.S. Gunasekaran, Formulating dynamic agents’ operational state via situation awareness assessment, in Advances in Intelligent Informatics (Springer, Cham, 2015), pp. 545–556 26. M.A. Mohammed, K.H. Abdulkareem, S.A. Mostafa, M.K.A. Ghani, M.S. Maashi, B. Garcia-Zapirain, I. Oleagordia, H. Alhakami, F.T. Al-Dhief, Voice pathology detection and classification using convolutional neural network model. Appl. Sci. 10(11), 3723 (2020) 27. A. Mustapha, S.A. Mostafa, M.H. Hassan, M.A. Jubair, S.H. Khaleefah, M.H. Hassan, Machine learning supervised analysis for enhancing incident management process. Int. J. 8(1.1) (2020) 28. S.A. Mostafa, A. Mustapha, S.S. Gunasekaran, M.S. Ahmad, M.A. Mohammed, P. Parwekar, S. Kadry, An agent architecture for autonomous UAV flight control in object classification and recognition missions. Soft Comput. (2021). https://doi.org/10.1007/s00500-021-05613-8 29. A. Boukerche, J. Wang, Machine learning-based traffic prediction models for intelligent transportation systems. Comput. Netw. 181, 107530 (2020)

Review of IoT for COVID-19 Detection and Classification Maha Mahmood, Wijdan Jaber AL-Kubaisy, and Belal AL-Khateeb

Abstract This paper presents a thorough review for researches that are proposed to address the COVID-19 disease by employing IoT techniques. COVID-19 disease, caused by a large family of coronaviruses, causes illness in humans and animals. The pandemic of COVID-19 has had unforeseeable global implications for the economy, customers, and social life. The lockdown policy restricted the movement of people as the only measure in the battle against the pandemic, reduced the reach of business operations, and changed the patterns of behavior of consumers who turned to stock accumulation in a panic. The Internet of Things (IoT) has been used as technology in a variety of applications. The current major research activities are proof that prosperity tracking by remote sensing is dependent on IoT. IoT should be used to prevent the distribution of COVID-19. IoT is a connection between physical devices and the Internet. Devices can monitor and respond in addition to the sense and record. IoT continues to be an effective tool for keeping track of an infected patient, which will restrict the spread of COVID-19. Technology appears today; in an internet-based environment, particularly in the present pandemic crisis, COVID-19, the feasible use of an educational technology tool can be achieved. The ability to respond to this global challenge continues to have a tremendous impact on the innovative application of digital technologies which can continue to be used to improve performance. Keywords Internet of Things · IoT · Deep learning · COVID-19 · Symptoms monitoring mechanism · Multi-task Gaussian mechanism

M. Mahmood · W. J. AL-Kubaisy · B. AL-Khateeb (B) College of Computer Science and Information Technology, University of Anbar, Ramadi, Iraq e-mail: [email protected] M. Mahmood e-mail: [email protected] W. J. AL-Kubaisy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_68

859

860

M. Mahmood et al.

1 Introduction COVID-19 is considered as a global pandemic that is exponentially spreading. An improvement has occurred in the advancement of data mining to classify suspicious or quarantined persons. The statistics from WHO shows that COVID-19 spreads very quickly among people in close contact when a person who is infected with the virus talks, sneezes, or coughs. Therefore, people with medical conditions or those over 60 years old should be self-isolated and contact their medical provider for critical cases [1]. COVID-19 affected almost every part of the world. By totally enforcing a lockout, India tried to monitor the situation; new cases were continually increasing every day. It is also noted that doctors and other medical practitioners are also the most vulnerable to COVID-19 infection since it is very easy to transmit the disease from infected to non-infected persons [2]. Sadly, there is no reliable cure process or vaccination yet. The production of an effective vaccine is likely to take over a year, largely because the virus presence has not yet been fully established [3]. This pandemic virus has no effective cure. Social distancing strategies, on the other hand, slow the distribution; this is adopted by different countries. The spread is also curtailed by putting the embargo on men and material movements across states and country borders. Meanwhile, scientists and researchers are doing their best to find a proper vaccine that can cure this virus but could not find any effective vaccine. The pandemic viral spread is growing day by day; the need of the hour is to control the spreading of virus by detecting the patients with symptoms and the isolation of patients with few or no symptoms. IoT helps to automate things and gives this world a new view with sensors and devices. It is playing its vital role in every field, whether it is medical or non-medical. To gather data from the user area, the IoT network is programmed and its physical elements are programmed, such as mobile units, actuators, sensors, and radio identification tags. In several areas, including healthcare systems, smart homes, utility management, agriculture, and defense, the IoT network is employed. IoT is a next-generation communication architecture in which billions of objects will have the ability to communicate with each other, intelligently and adaptively [1, 2]. The IoT network can be beneficial in tracing and identifying infected people. The IoT sensor can be implanted in the body of infected people, and the movement of the people can be tracked [4]. New tools have helped to identify multiple infected individuals, have recognized the areas where COVID-19 is circulating, and have dynamically discovered information [5]. Since nearly all apps are at risk of an information security crisis, developers are now seeking to enforce active steps to protect user’s privacy [6]. The following is how the rest of the paper is organized: Sect. 2 explains the Internet of things. COVID-19 is described in Sect. 3. Section 4 shows the literature review. Finally, Sect. 5 presents the conclusion and suggests some future research directions.

Review of IoT for COVID-19 Detection and Classification

861

2 Internet of Things IoT is about small devices and the interplay between them and the user with the goal of higher-level applications. Research questions may arise to optimize the communication regarding message number and sizes, speed up the performance, and minimize latency for optimal user experiences and power consumption to increase battery lifetime [7]. In combination with the accepted communication protocol, IoT called as the technological identification of radio-frequency and Internet-based equipment. IoT principles and tools have altered human services, hospital processes, and management [8].

3 COVID-19 Different epidemic conditions have been seen in the past two decades as SARS started in 2002, followed by Swine flu in 2009, Ebola in 2013, MARS in 2014, and COVID-19 in 2019 [9]. Such outbreak incidents will contribute to significant human and economic losses. Emerging innovations have both benefited from COVID-19 research, the production of new treatments and testing methods, and a response to the lack of medical supplies. In the current pandemic case, COVID-19, the primary point of consideration when using IoT is the privacy and protection of the collected data, which are unique and essential in terms of patient care [10].

4 Literature Review An IoT framework system is proposed by Otoom et al. [1] to capture real-time usage symptom data in order to detect potential cases of coronaviruses early on, to track the clinical response of those who survived from the virus, and to compile and examine appropriate data to clarify the essence of the virus. There are five key components in the framework: collection and upload of data symptom (using wearable sensors), isolation/quarantine center, data processing center (using machine learning algorithms), cloud computing, and health doctors. This research used eight machine learning algorithms to rapidly classify possible cases of coronaviruses from this real-time symptom information: neural network, k-nearest neighbor (K-NN), support vector machine (SVM), decision stump, decision table, naïve Bayes, ZeroR, and OneR After choosing the appropriate symptoms, those eight algorithms are tested on a real COVID-19 symptom dataset. The findings showed that five out of eight algorithms obtained accuracy of more than 90%. While Angurala et al. [2] proposed a Drone Based COVID-19 Medical Service (DBCMS) mechanism for medical employees’ safety resistant to contamination with COVID-19. The suggested

862

M. Mahmood et al.

mechanism could effectively facilitate the recovery phase of patients with COVID19. In the area of medical emergencies, drones are also widely used. The proposed model uses drone systems for reducing the contamination risk for physicians or other medical professionals, thus avoiding the transmission of the disease. It also assumes that separating individuals at home instead of admitting them to hospitals is an essential step, often referred to as a lockout or curfew case. In this way, if the DBCMS strategy is applied at the cluster level, the distribution can then be greatly minimized worldwide. Several COVID-19 diagnostic approaches with specific adversarial examples are validated based on deep learning algorithms by Rahman et al. [3]. The test results revealed that models of deep learning that do not accept security models against adversarial disturbances are vulnerable to such attacks. Finally, the authors discussed the method of creating an adversarial example, the attack model’s execution, as well as the disturbances to current DL-based COVID-19 diagnostic applications. The authors suggested testing the used algorithms against real-world threats, since they involve a greater number of disruptions that human subjects would notice. The authors have suggested research on additional vectors of attack measurements, such as posture, aim field changing, facial expression, the frame elevation and distance, and others within the poisoned dataset, to see how adversarial failure works against adversarial defensive mechanisms in a deep neural network. A decentralized IoT-based biometric face recognition prototype system was developed by Kolhar et al. [4] for cities under lockdown during the outbreaks of COVID-19. The output reveals that on separate demanding datasets, namely WIDER FACE (WIDER FACE dataset is a face detection benchmark dataset) and face detection data collection and benchmark, the used approach is better than the state-of-the-art approaches. Face recognition is used to enforce constraints on public gestures using a three layers edge computing architecture and then generates a multi-task cascading deep learning algorithm (CNN) to identify the face. On separate benchmarking datasets such as FDDB and Broader FACE, the authors contrasted their approach with state-of-the-art techniques for the face detection proposal. In addition, separate latency and face detection loads were carried out by the authors on cloud computing and three layers architectures. The obtained results showed that the proposed method has the edge over cloud computing architecture. For decentralized conditions, similar clustered and dispersed conditions of the database server and the framework equate the load of users on edge computing, cloud, and latency. The used approach obtained better performance on the face recognition decentralized load with respect to edge computing. “COVID-19 Intelligent Diagnosis and Treatment Assistant Program (nCapp)” based on IoT Li is applied by Bai et al. [5] as a medical technology to diagnose and improve treatment of COVID-19 at an early stage. Via the page selection key, terminal eight functions are implemented in real-time online communication with the ‘cloud.’ Diagnosis is automatically produced based on current databases, questionnaires, and control outcomes as verified, suspected, or doubtful of COVID-19 infection. Patients are classified as having mild, moderate, extreme, or serious pneumonia. Based on the most recent real-world case results, nCapp will also build an online real-time update of COVID-19 database; also, the diagnosis model is updated in real-time to improve

Review of IoT for COVID-19 Detection and Classification

863

diagnostic precision. In addition, care may be directed by nCapp. For the conduct of consulting and prevention, front-line doctors, consultants, and managers are related. Often, COVID-19 patients can be tracked for a long time using nCapp. The goal is for the nCapp system to be upgraded to the national and international levels, allowing varying levels of diagnosis and treatment for COVID-19 among different doctors from different hospitals. This will block the spread of illnesses, avoid infection by physicians, and prevent and manage epidemics as quickly as possible. A COVID-19 Symptoms Monitoring Mechanism (CSMM) architecture and simulation based on wireless sensor networks and IoT was proposed by AL-Shalabi [6] to monitor individuals suffering from chronic diseases and immune deficits during their eldest quarantine. As a result, they are more likely to become seriously sick. The mechanism is based on the patient’s health data being monitored remotely. The doctor or medical provider can do the monitoring process. For example, whether there is a high fever or trouble in breathing, this device can quickly distinguish an immediate or irregular situation. Consequently, by sending immediate SMS, including time and patient status, the device will issue a warning to the doctor or care professional to act and save the patient’s life immediately. The proposed mechanism enables the doctors to cope with the vast number of people who have been remotely quarantined. The ThingSpeak IoT platform provides a graphical output of the sensed data, making the analysis as easy as possible. Also, to recognize the (suspicious) infected person using wearable smart gadgets, a smart edge monitoring system was proposed by Ashraf et al. [7] that is efficient in remotely tracking, alerting, and recognizing the heart, cardiac signs, pulse rate, and some of the human radiological features. The proposed system includes a constantly updated pattern/map of the COVID-19infected individuals contact chain that can extend across the national population. The health and social effect of the proposed study is to assist public health officials, academics, and physicians by intelligent edge monitoring systems to control and handle this epidemic. The recommended model would help identify the infected individual and trace him. In addition, it can also preserve the data log of the patient for research and decision-making using edge computing. A new intelligent edge monitoring technology that is capable of (a) diagnosing the infection of coronavirus in the human body using a health monitoring unit, (b) identifying the presumed H2H virus chain using IoT and deep edge computing, and (c) implementation was proposed in the current report. The proposed model also includes a module for a warning and alarm system to safeguard healthy people if an infected/suspected entity reaches any public location. The layered architecture together with the proposed structure has been presented in response to these concerns. An IoT-based sensor is proposed by Rasool et al. [8] to check if a patient suffers from irregular fluctuations and to sense a voice’s pitch; in addition, the embedded thermal imaging sensor can feel a person’s temperature and give the alarm to family and government individuals as well. The government will be able to test for coronavirus infection in a vast population using this approach, which is based on deep learning techniques. Furthermore, the automated sensor detection and diagnostic device will aid in preventing the virus from spreading. A method was proposed to detect and diagnose a corona viral infection at its early stage, using IoT base sensor

864

M. Mahmood et al.

system called NodeMCU8266. This system focuses on two main symptoms; the first is the body’s temperature, and the other is cough or throat infection. It is essential to detect this virus in the early stages to stop it from spreading. The proposed system offers speedy screening of the infected population and transmits real-time data to authorities for medical care measures, including isolation and treatment, so that the wildfire spreading of this infection can be curtailed. This system is used to detect the voice’s pitch and any abnormal fluctuations in voice and detect the body temperature with a device’s help. This technique does not involve skilled manpower like medical doctors; it can be done with protected volunteers. It has been proved that potentially automatic systems could fulfill the medical doctors’ demands to assess the suspected population at the earliest. Also, the basic elements of the “must-have” and “good to have” innovations for small companies and innovation were established by Ikpe Akpan et al. [9]. Then, in December 2019, the unfolding global health pandemic of COVID-19 struck, prompting the need for company practices and remote operations to be digitized, which was deemed a “good to have” to become “essential to have” instantly to succeed in an increasingly unpredictable business climate. Identifies innovations, assesses disruptive software frameworks and techniques required to build and manage innovation in small companies, the method complexity, and the context in which it takes place are illustrated. Current reality has demonstrated that advances that allow social enterprise growth, customer relationship management systems, modern communication networks, augmented reality applications that are run remotely, and IoT are vital to business costs reduction. Predictive and visual analytics together with big data are key enablers in the new competitive market world for helping complex business decisions. Research data and research reports have shown a strong indicator of the sluggish speed at which small companies accept or are able to adopt modern technology outside the widely used IT infrastructure normal usage. Akhund et al. [10] showed that in this new age, injured persons, and virus-affected patients could be supported by IoT and robotic systems. The entire planet has lately been recovering from the pandemic of COVID-19. People infected and injured by the infection are powerless because caregivers, clinicians, and others are terrified of the infectious virus. The study used a robotic agent based on IoT that will assist injured and virus-affected persons with low-cost systems. In a 360-degree movement, the robotic agent can understand the motion of the patient and follow orders via it. Without image processing, the device is designed for motion recognition with the MPU 6050 accelerometer gyroscope sensor. To make the machine wireless, radio-frequency communication was use, while Alam [11] suggested that using IoT and blockchain technologies through smart devices infected people are supported online. Healthcare devices based on IoT capture informative information, offer valuable insight into habits and symptoms, remote tracking allowing, and simply provide improved health care and self-determination for individuals. Blockchain, which governs the medical delivery network, facilitates the encrypted transmission of patient health records. Using blockchain and IoT to track and avoid people from being COVID-19, a four-layer architecture is proposed. This study establishes a

Review of IoT for COVID-19 Detection and Classification

865

structure for COVID-19 infectious disease patients and recognizes clinical conditions and online diagnoses. Both mobile applications such as Tawakkalna, Setu, Aarogya and so on can be enabled on smart devices such as smartphones. These apps can better monitor patients with COVID-19. Installing smartphone applications on smart devices lowers the time and expense and increases the condition of the infected patient. Using IoT and blockchain technology, a four-layer architecture is proposed. Much clinical work focuses on researching, examining, and showcasing the persons affected by leading the infection of COVID-19. Substances, antibodies developed against viruses, can be used to recognize and verify individuals. This might seem to be far-fetched. The smartphone will, however, be used for diagnosing patient conditions. To easily diagnose a variety of disease-causing agents such as pathogens, toxins, and illnesses, smartphones may be turned into instruments. A variety of versatile products and services are offered by this system, the involvement of patient records, medical informatics, critical data well-being, public health information processing, healthcare practitioners, and information technologies in digital diagnostics and tracking. Depending on the market, the number of impacts of health technology varies greatly. The success of mobile healthcare systems depends on each service industry’s level of growth and accessibility. A study-based research is presented by Singh et al. [12] to highlight, explore, and discuss the general applications of the well-proven IoT philosophy by presenting a viewpoint guide for tackling the COVID-19 pandemic. Twelve essential IoT applications are defined and addressed. The study concluded that IoT allows a COVID-19 compromised patient to recognize signs and easily get better care. It is helpful for patients, doctors, physicians, and the structure of hospital administration. It smartly treats all cases to finally provide the patient and health care with improved support. IoT appears to be a perfect way for the infected patient to watch. In health care, with real-time knowledge, this technology helps sustain quality supervision. IoT is helpful in forecasting an upcoming case of this disease by using a statistical-based approach. Researchers, clinicians, government, and educators will create a healthier climate to combat the proper technology application with COVID-19. Also, Ketu and Mishra [13] proposed regression model of multi-task Gaussian mechanism (MTGP) with improved forecasts of new outbreak of coronavirus (COVID-19). The goal of the proposed regression model of MTGP is to forecast the worldwide COVID-19 outbreak. It would encourage countries to prepare their prevention steps to reduce the overall effect of infectious diseases that are quickly and broadly spread. To assess its suitability and correctness, the suggested model results are compared to the other prediction model. The IoT-related devices significance in COVID-19 prevention and identification has been addressed in subsequent research. Five selection criteria or estimation criteria were used to execute all the experiment: 1 day ahead, and 3, 5, 10, and 15 days ahead. The obtained results, with these measurement findings, showed that the efficiency of the proposed model is higher than all other models. The suggested model achieves, under different selection conditions, the lowest RMSE and MAPE in the experiment. In addition, the researchers have discovered the importance of IoT in health care and the importance of IoT for COVID-19 identification and potential methods based on IoT to mitigate the effect of COVID-19.

866

M. Mahmood et al.

Vijay Anand illustrates an innovative way of combining clinical gadgets and their uses [14] to be aligned with the systems for human resources and data innovation. It is poor and affluent to explore the possible effects by defying the radical COVID-19 pandemic by introducing the IoT strategy while delivering care without partiality to all patient groups. The numerous cloud-based IoT administrations are the exchange of knowledge, report authentication, patient monitoring, data social affair, inquiry, cleanliness clinical concern, and so on. It would shift the operational format of medical care while rewarding, particularly during this COVID-19 shutdown pandemic, a huge number of patients who receive a high level of treatment and satisfaction. Health staff will easily concentrate on patient zero to locate and transfer these persons to quarantine/isolation those that have met the sick person. Since COVID-19 emerged in China’s Wuhan Province, IoT tools such as geographic information systems can be used to help stop the pandemics from spreading by serving as an early warning system. Temperature control and other signs could be used by scanners at airports around the world. IoT includes several cloud-based offices and administrations to better support many of patients in the time of the continuous COVID-19 pandemic. In such a critical period of lockout, the remote medicinal services system gives a great deal of importance. The efficient integrated arrangement of devices, smartphones, blogs, directories, etc., allows customers to benefit smartly from administrations. IoT also advances its administrations by building up the quality community of perceptive medicinal facilities or portable centers. It is an “innovation of distinct benefit” that will uniformly modify procedures. Indeed, even in this severe era, its quality administration makes this technique increasingly efficient and profitable. IoT helps monitor and track more known persons and patients with their human care prerequisites in remote areas. A major shift in outlook would likely be observed earlier rather than later by the regular medicinal providers. The computerized revolution would place patients in the hands of cutting-edge technologies and associated goods and would provide both doctors and patients in rural areas with quality health care. Vafea et al. [15] showed vital focus areas of monitoring COVID-19 that use artificial intelligence technologies, IoT and Big Data, the importance of computational statistical modeling, population screening technology, nanotechnology for vaccine production and care, telemedicine, the 3D printing introduction to the potential of robotics and to meet new demands. The authors emphasized the need for science community cooperation through an open exchange of information, instruments, and skills. Emerging technology can be used efficiently to help the medical community to adapt quickly to COVID-19’s increased burden and demands. Technologies are used in COVID-19 testing, treatment, and diagnosis. Latest advances have demonstrated that cooperation between medical scientists and engineering is key to the creation of expeditious and less costly methods to handle the pandemic. Information open access, technologies, and tools are important for a prompt reply in the sense of accelerated and global disease transmission. To continue to have answers at this difficult time, researchers and engineers must continue to cooperate and exchange skills. While there is a need for accelerated growth and deployment of novel technology,

Review of IoT for COVID-19 Detection and Classification

867

safety control should not be neglected. It is important to respect existing principles surrounding patient-generated data, including confidentiality. When emerging technologies are used, the safety requirements for the manufacture and delivery of supplies and services should be constantly checked. Normal clinical trials can also be performed with therapies found using new technology. To achieve the best results for patients, the group must work to uphold all these safety standards. Singh et al. [16] developed a wearable quarantine based on IoT (IoT-Q-Band) for abscondation detection. The cost, the disruption of global supply chain, and the quarantine period of COVID-19, as recommended by the WHO, were held in mind when developing it. This wearable prototype reports and records the quarantine subjects absconding in real time with the bundled smartphone app. IoT-Q-Band is an economic option to discourage the spread of COVID-19 that could help low-income areas. The wearable band components, their mode of operation, and the battery are chosen to stay operational throughout. The quarantine period is also validated via the current consumption analysis. The IoT-Q-Band reuse should be stopped due to the contamination risk; thus, the prototype cost is kept lower by the reusing of many smartphones features. As an economical solution, the IoT-Q-Band could benefit low-income areas of the world where it could be used to keep track of quarantine subjects. Petrovi´c and Koci´c [17] introduced an affordable and small size IoT-based approach to improve indoor protection for COVID-19, covering many related aspects: (1) identification of masks, (2) social distance control, (3) contactless temperature sensing. Using an infrared sensor or thermal camera, Arduino Uno is used in the contactless temperature sensor subsystem, while using a Raspberry Pi with a projector, mask recognition and social distance control were carried out using computer vision techniques. A cost-effective IoT-based platform exists to assist organizations in complying with COVID-19 protection laws and guidelines to reduce disease transmission. People with elevated body temperatures should remain at home, it is mandatory to wear a mask, and a minimum of 1.5–2 m should be maintained between individuals. The Arduino Uno board with a contactless sensor of temperature is used for the first example, thus relying on a single-board Raspberry Pi device fitted with a camera for the other two scenarios using computer vision techniques. Chaudhari et al. [18] provided a smart and dedicated gateway to fight against the COVID-19 pandemic. IoT does this by using its well-integrated network. With this strong interconnectivity, healthcare tracking systems are able to connect to the Internet. The data is monitored remotely at the healthcare center. It has the ability to automatically contact the doctors at the healthcare center/hospital in any emergency. COVID-19 patient and self-quarantine persons can be appropriately monitored at a remote location with IoT devices. IoT proves to be a fantastic way to monitor the infected patient, which will restrict the spread of COVID-19. In health care, IoT apps are useful for the maintenance of cloud information in real time. With this evidence, it is possible to carry out more statistical research to forecast the forthcoming COVID19 circumstances that can create a better positive atmosphere with the proper use of IoT applications. Sensing information (data), distributing it over the Internet, and manipulating it for further decision-making are all part of the IoT. Implementing

868

M. Mahmood et al.

IoT healthcare applications allows for real-time tracking of patients, resulting in a faster rehabilitation. It can be successfully used for COVID-19 monitoring and prevention. IoT applications have proved to be successful in combating the COVID19 pandemic. This study examines the effects of COVID-19 and the role of IoT healthcare applications in effectively overcoming or combating the pandemic through technology. Also, Hassan Kumbhar et al. [19] proposed a connected paradigm by using IoT-based health monitoring and CNN-based object detection methods. The proposed scheme aims to contain the spread as soon as possible and allows people to continue their social activities. The used scheme identifies the social distancing violations using CNN-based object detection and tracks exposed or infected people using a smart wearable. First, the server receives a continuous video stream from the connected surveillance cameras to identify any social distance breaches where the inter-object distance is observed. A breach triggers area-based tracking of frequent users of smartphones and their existing disease state. If there is a person with elevated symptoms or a suspected patient, the device monitors compromised and exposed persons and necessary action is taken. Using Python simulation for you only look once (YOLO v2 and v3), for infection distribution tracing, the researchers tested the suggested scheme for social distance breach detection. The YOLO v3 darknet-53 model based on CNN achieves 90% accuracy in object detection to identify interobject distance and social distancing violation. The Python simulation successfully traces all exposed people with the likelihood that they will get infected with and without masks. COVID-19 pandemic needs to be evaded by the proposed connected paradigm and can be an entire system for future disasters. Recent innovations in sensor technology, wireless communication, and IoT systems and its usefulness in mitigating the COVID-19 challenges are presented by Udgata and Suryadevara [20]. The sensor network and IoT framework have been at the top of the research agenda for more than a decade. Many researchers have been working on building models, developing robust theories and designing societal applications to better the general population. The COVID-19 emergency has not given us ample time to think of optimal solutions for various necessities; instead, researchers worldwide reacted to the challenge and started working on the solutions. Never in the history of innovations has been such a pressing and immediate requirement for the researchers to come up with solutions and discover answers. Ease of use, reliability and robustness, accuracy, acceptability by the users and officials, an enhanced lifetime of the devices, and affordability in terms of cost are significant challenges that need to be addressed to ensure success, while a device that can detect COVID-19 from the thermal image automatically with fewer human encounters was suggested, Praveena and Sruthi [21], using a smart helmet with a built-in thermal imaging device. In the smart helmet, the thermal camera technology is coupled and incorporated with IoT technology to monitor the screening process for collecting real-time data. The proposed device is also fitted with equipment for facial recognition. It can also show the personal details of the pedestrian and can automatically take the temperatures of pedestrians. This proposed architecture has high healthcare system specifications which could theoretically inhibit the spread of COVID-19. It can be concluded that the remote sensing procedures, which provide an assortment

Review of IoT for COVID-19 Detection and Classification

869

of ways to identify, sense, and monitor coronavirus, give an incredible promise and potential to fulfill the demands of the healthcare system. An early COVID-19 diagnosis is proposed by Mobo [22] by using medical technology to enhance the treatment. The application is implemented using the “cloud” online communication in real time. This just means that the disease is gaining ground in other nations, leading to a public health epidemic, and is needed in various quarters as a multinational collaboration. However, successful protocols on the health data exchange are stressed, to the contrary, directly linked to the principles of healthy and health communities, from an expert viewpoint, they are also seen as exclusively serving the economy of a country and its economic and political power. Further development must be given to develop automatic and realistic IoT implementations of smartwatch learning management systems to provide prompt and early outbreaks identification of certain diseases to mitigate mortality from diseases and deter global dissemination. This just means that the disease is gaining ground in other nations, leading to a public health epidemic, and highly advised to implement the learning management systems use as a global collaboration is needed in various areas. Lucca et al. [23] compared works in the literature and created a taxonomy based on the criteria required in a hospital setting to control patient privacy data in the current COVID-19 pandemic. An application is modeled and applied based on the studies. The application yielded adequate results based on the results of the experiments and comparisons of the variables. The authors developed a comprehensive taxonomy. It is branched into four items with five attributes each; it is justifiable for all the items and their respective attributes. The researchers developed a prototype and application for the information flow tests that address data privacy’s main questions despite being simple. The program was designed to be implemented according to environmental requirements with registration data inputs and distinct encryption choices/hash. The device interfaces with a watch that measures the temperature of the patient and delivers care in accordance with the feverish condition of the patient, directing the doctor’s office referral or the possibility of discharge. The registry data is maintained secret by encryption and privacy criteria through the implementation of taxonomic concepts and the agility of medical practitioners in the patients that have been diagnosed with COVID-19, treatment. Temperature surveillance should be carried out on an ongoing basis; in the case of feverish conditions persisting for a period specified by the agency and other disease-suggested symptoms, the framework recommends referral to the patient without disclosure of personal data, proposes tracking without permission, and provides utilization definitions and requirements for data protection. Taiwo and Ezugwu [24] proposed a healthcare system for the smart home of the elderly, sick, and handicapped people. The work primarily aimed to make life easier for people with chronic illnesses who often attend the hospital. To decrease the hospital appointments number, stay times in hospitals are reduced, and the cost of caring for the sick is reduced, and a new scheme has been developed. The machine performs two functions: healthcare and home appliance control; with this, consumers will have a social life experience and, especially during the pandemic period, also

870

M. Mahmood et al.

have their health controlled and tracked. By that the transmission rate of communicable diseases, the proposed approach would have a positive effect on the quality of life. Patients who are under care and hospitalized for a COVID-19 would have no need to travel about constantly, thereby preserving life quality and that the risk of infection. Interestingly, the continuing process of contact-based patient visits, with recent technical developments in the fields of healthcare networks, smart home automation, and IoT technology is now considered non-obligatory. A remote smart home healthcare support scheme (ShHeS) is being developed to this end to track the health status of patients and collect prescriptions from doctors while staying at home. Besides this, physicians may still use the data obtained remotely from the patient to identify illnesses. For successful patient–doctor dual real-time connectivity, an Android mobile application that is integrated with a web application is used. Sensors are built into the framework for automated capture of patients’ physiological health parameters. To correctly perceive physiological parameters and improve system efficiency, a hyperspace analog to context (HAC) has been added to the new control protocol for service exploration and context adjustment in the home world. Patients would be remotely tracked from their homes within the new system, allowing them to have a comfortable life by using certain characteristics of smart home control systems on their phones. In coping with this influenza outbreak, the problems faced by healthcare companies were identified and possible alternatives such as AI and IoT were suggested by a crucial literature [25]. The study strengthens by identifying the problems and categorizing them as operational, individual, organizational, resource-based, technical, and external healthcare concerns. Proposals and possible research directions for disease prevention are also clarified for clinicians and scholars. The well-being of the staff of health care is crucial as the number of healthcare workers contaminated is rising day by day. A comprehensive literature analysis of a recent outbreak of epidemic and COVID-19 is conducted by employing sufficient keywords from SCOPUS. Several difficulties have been identified during the literature review that create safety concerns for healthcare workers and restrict the operation of them. Rajesh Kumar et al. [26] are recommending the diagnosis and tracking of asymptotic patients using IoT-dependent sensors. According to government records, the number of laboratory-confirmed cases is increasing in the millions around the globe, forcing the world to go into full quarantine due to the impact of this contagious virus and a lack of people. Finally, Vijay Anand et al. [27] discussed IoT for the prevention of COVID-19. IoT systems are used to detect patients and offenders in several ways. Remote data collection can be accomplished with the aid of IoT and sensors. Later, with computer technology engineers and analysts, the data can be analyzed to forecast and avoid COVID-19. It will transform the way medical services are delivered, rewarding many patients with a higher level of care and greater satisfaction, particularly during the COVID-19 lockout pandemic. Health staff will easily concentrate on patient zero to locate and transfer these persons to quarantine/isolation those that have met the sick person. Since COVID-19 emerged in China’s Wuhan Province, IoT tools such as geographic information systems can be used to help stop pandemics

Review of IoT for COVID-19 Detection and Classification

871

from spreading by serving as an early warning system. Temperature control and other signs could be used by scanners at airports around the world.

5 Conclusions and Future Works In this review, a connected IoT-based COVID-19 allows a more secure and informed environment to restart the new normal. To help serve a more significant number of patients at the time of the continuous COVID-19 pandemic, IoT necessitates a large number of cloud-based offices and administrations. The utilization of the idea of IoT makes the usability of patients very helpful, which essentially helps provide them with proper treatment to get out of this disorder. IoT deployment of smart home automation has contributed to a lot of easy living. Human care, healthcare operations, and management all changed the application of IoT concepts and tools. As a future direction, IoT technology can help with the prevention of COVID-19, As a result, the effects of these diseases, such as death rate, hospitalizations, and infections, will be greatly minimized.

References 1. M. Otoom, N. Otoum, M.A. Alzubaid, Y. Etoom, R. Banihani, An IoT-based framework for early identification and monitoring of COVID-19 cases. Biomed. Signal Process. Control 62, 102149 (2020) 2. M. Angurala, M. Bala, S.S. Bamber, R. Kaur, P. Singh, An internet of things assisted drone based approach to reduce rapid spread of COVID-19. J. Saf. Sci. Resilience 1, 31–35 (2020) 3. A. Rahman, M.S. Hossain, N.A. Alrajeh, F. Alsolami, Adversarial examples-security threats to COVID-19 deep learning systems in medical IoT devices. IEEE Internet Things J. (2020). https://doi.org/10.1109/JIOT.2020.3013710 4. M. Kolhar, F. AL-Turjman, A. Alameen, M.M. Abualhaj, A three layered decentralized IoT biometric architecture for city lockdown during COVID-19 outbreak. IEEE Access 8, 163608– 163617 (2020) 5. L. Bai, D. Yang, X. Wang, L. Tong, X. Zhu, Y. Song, Chinese experts’ consensus on the Internet of Things-aided diagnosis and treatment of coronavirus disease 2019 (COVID-19). Clin. eHealth 3, 7–15 (2020) 6. M. AL-Shalabi, COVID-19 symptoms monitoring mechanism using internet of things and wireless sensor networks. IJCSNS 8(20), 16 (2020) 7. M.U. Ashraf, A. Hannan, S.M. Cheema, Z. Ali, A. Alofi, Detection and tracking contagion using IoT-edge technologies, in Confronting COVID-19 Pandemic. 2020 International Conference on Electrical, Communication, and Computer Engineering (ICECCE) (IEEE, 2020), pp. 1–6 8. R.M. Rasool, T.F.N. Bukht, N. Bukht, R. Ahmad, P. Malik, Detection and diagnosis system of novel Covid-19 using Internet of Things and send alert. J. Inf. Comput. Sci. 10(7), 1548–7741 (2020) 9. I.J. Akpan, D. Soopramanien, D.H. Kwak, Cutting-edge technologies for small business and innovation in the era of COVID-19 global health pandemic. J. Small Bus. Entrepreneurship, 1–11 (2020) 10. T.M.N.U. Akhund, W.B. Jyot, M.A.B. Siddik, N.T. Newaz, S.A. Al Wahid, M.M. Sarker, IoT based low-cost robotic agent design for disabled and Covid-19 virus affected people, in 2020

872

11. 12.

13. 14.

15.

16.

17. 18. 19.

20.

21. 22.

23.

24. 25.

26.

27.

M. Mahmood et al. Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4) (IEEE, 2020), pp. 23–26 T. Alam, Internet of Things and Blockchain-Based Framework for Coronavirus (Covid-19) Disease. Available at SSRN 3660503 (2020) V. Singh, H. Chandna, A. Kumar, S. Kumar, N. Upadhyay, K.K. Utkarsh, IoT-Q-Band: a low cost internet of things based wearable band to detect and track absconding COVID-19 quarantine subjects. EAI Endorsed Trans. Internet Things 21(6) (2020) S. Ketu, P.K. Mishra, Enhanced Gaussian process regression-based forecasting model for COVID-19 outbreak and significance of IoT for its detection. Appl. Intell. 1–21 (2020) R. Vijay Anand, J. Prabhu, P.J. Kumar, S.S. Manivannan, S. Rajendran, K.R. Kumar, S. Susi, R. Jothikumar, IoT role in prevention of COVID-19 and health care workforces behavioural intention in India-an empirical examination. Int. J. Pervasive Comput. Commun. (2020) M.T. Vafea, E. Atalla, J. Georgakas, F. Shehadeh, E.K.M. Mylona, M.E. Kalligeros, Emerging technologies for use in the study, diagnosis, and treatment of patients with COVID-19. Cell. Mol. Bioeng. 4(13), 249–257 (2020) V.K. Singh, H. Chandna, A. Kumar, S. Kumar, N. Upadhyay, K. Utkarsh, IoT-Q-Band: a low cost internet of things based wearable band to detect and track absconding COVID-19 quarantine subjects. EAI Endorsed Trans. Internet Things 21(6) (2020) N. Petrovi´c, D. Koci´c, IoT-based system for COVID-19 indoor safety monitoring, in Proceeding of 7th International IcETRAN Conference (2020), pp. 1–6 S.N. Chaudhari, S.P. Mene, R.M. Bora, K.N. Somavanshi, Role of Internet of Things (IOT) in pandemic Covid-19 condition. Int. J. Eng. Res. Appl. 10(6), (Series-III), 57–61 (2020) F. Hassan Kumbhar, S.A. Hassan, S.Y. Shin, New normal: cooperative paradigm for Covid-19 timely detection and containment using Internet of Things and deep learning. arXiv e-prints, arXiv: 2008.12103 S.K. Udgata, N.K. Suryadevara, Advances in sensor technology and IoT framework to mitigate COVID-19 challenges, in Internet of Things and Sensor Network for COVID-19 (Springer, Singapore), pp. 55–82 G. Praveena, D. Sruthi, Novel Covid-19 detection and diagnosis system using IOT based smart helmet. JAC J. Compos. Theory XIII(IV), 0731–6755 (2020) D. Mobo, Using learning management systems in an Internet of Things (IOT) smartphone device amidst COVID-19 crisis. Comput. Law Rev. Technol. J., Am. Res. J. Comput. Sci. Inf. Technol. 4(1), 1–3 (Forthcoming) A.V. Lucca, R. Luchtenberg, L.G. de Paula Conceicao, L.A. Silva, R.G. Ovejero, M. NavarroCáceres, V.R.Q. Leithardt, System for control and management of data privacy of patients with COVID-19. Preprints.org (2020). https://doi.org/10.20944/preprints202007.0369.v1 O. Taiwo, A.E. Ezugwu, Smart healthcare support for remote patient monitoring during Covid19 quarantine. Inf. Med. Unlocked, 100428 (2020) S. Kumar, R.D. Raut, B.E. Narkhede, A proposed collaborative framework by using artificial intelligence-internet of things (AI-IoT) in COVID-19 pandemic situation for healthcare workers. Int. J. Healthc. Manag., 1–9 (2020) N.V. Rajesh Kumar, M. Arun, E. Baraneetharan, A. Kanchana, S. Prabu, Detection and monitoring of the asymptotic COVID-19 patients using IoT devices and sensors. Int. J. Pervasive Comput. Commun. (2020). https://doi.org/10.1108/IJPCC-08-2020-0107 R. Vijay Anand, J. Prabhu, P.J. Kumar, S.S. Manivannan, S. Rajendran, K.R. Kumar, S. Susi, R. Jothikumar, IoT role in prevention of COVID-19 and health care workforces behavioural intention in India-an empirical examination. Int. J. Pervasive Comput. Commun. 16(4), 331– 340 (2020). https://doi.org/10.1108/IJPCC-06-2020-0056

On the Implementation and Placement of Hybrid Beamforming for Single and Multiple Users in the Massive-MIMO MmWave Systems Mustafa S. Aljumaily, Husheng Li, Ahmed Hammoodi, Lukman Audah, and Mazin Abed Mohammed Abstract This paper discusses the issue of implementing the hybrid beamforming functionality using deep learning techniques for both single users and multiple users massive MIMO mmWave (SU-mMIMO and MU-mMIMO) systems. First, the DeepMIMO dataset is used to collect the location data of a grid of user locations in a street environment. Then, the collected dataset is used to build and optimize the direct precoding (and combining) architectures for both the transmitter (the base station (BS)) and the receiver (the user equipment (UE)), respectively. Different assumptions about the previous systems are discussed and the placement and implementation of the hybrid beamforming in the 5G and beyond systems is discussed in details with some realistic calculations and recommendations. The second part of the paper goes a step further in deriving the expected delay of beamforming when it is done on premise, in a dedicated core network, and in a cloud core networks and suggest the best place to implement the beamforming functionality for both static and mobile users using realistic parameters and calculations.

M. S. Aljumaily (B) · H. Li Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, USA e-mail: [email protected] H. Li e-mail: [email protected] A. Hammoodi · L. Audah Wireless and Radio Science Centre (WARAS), Faculty of Electrical and Electronic Engineering, Universiti Tun Hussein Onn Malaysia, 86400 Parit Raja, Batu Pahat, Johor, Malaysia e-mail: [email protected] M. A. Mohammed College of Computer Science and Information Technology, University of Anbar, Anbar 31001, Iraq e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_69

873

874

M. S. Aljumaily et al.

Keywords mmWave · Massive-MIMO · Deep learning · Beamforming · Cloud computing · 5G

1 Introduction Millimeter waves (mmWaves) ranging between (30–300 GHz) are now considered as an essential part of the fifth generation of wireless communications or 5G [1]. Even though it is theoretically including the range above, the frequencies between (24–30 GHz) are sometimes called mmWave in the literature [2]. Suffering from huge path loss, huge atmospheric absorption, and weak reflections from most surfaces and objects in the indoor and outdoor environments, mmWaves need to be sent in narrow beams to travel further and here comes the beamforming to the rescue. Beamforming have been used for many years as a way to use the geometric and physical features of the antenna structure and the electromagnetic waves to propagate them in specific direction and suppress them in the other directions [3]. Many types of beamforming have been developed over the years and they are mainly categorized as analog beamforming, digital beamforming, and the hybrid beamforming [3]. Because of the limitations of the analog beamforming with respect to the achievable spectral efficiency especially for mmWave and 5G applications, and the problems with building fully digital beamforming for commercial applications like the cost, the hardware complexity, the power consumption, and the heat emissions; the hybrid beamforming has been considered as the default type to use when building precoding and combining functionalities for massive-MIMO mmWave systems [3–5]. Hybrid beamforming has been the focus of many researchers in the past few years with the focus on improving the spectral efficiency [6, 7], reducing the hardware and computational complexities [8, 9], and limiting the required number of RF chains [10]. Recently, a new trend in building beamforming techniques has been focusing on using the machine learning and deep neural networks in doing so as in [5, 11–13]. Such functionality of beamforming using different machine learning and deep learning techniques has been suggested to be implemented on premise [5, 13], in the edge cloud [14], in the fog cloud [15, 16], or in the central cloud [11, 17, 18]. The justification for such diverse deployment of the beamforming (and other functions) in the 5G and beyond networks can be one of many reasons such as the trade-off between delay and performance, the hardware limitations of on premise installations when it comes to massive-MIMO, the type of application and its special requirements, ... etc. In the second part of this paper, we will discuss the feasibility of deploying the beamforming functionality in different parts of the 5G and beyond networks with respect to the delay and its relationship to the accuracy of directing the transmission beams (i.e., the beamforming) both for static and mobile users.

On the Implementation and Placement of Hybrid …

875

2 System Model 2.1 Deep Learning Based Hybrid Beamforming Optimization with Limited Feedback In this section, we will optimize the deep learning (DL) based hybrid beamforming suggested in [12]. First, we study the effect of using different optimizer (i.e., Nadam) instead of the suggested ADAM optimizer in [12], then we study the effect of using more than 16 channel training pilots on the beamforming accuracy, and finally, we give some insights about expanding such a system to work on multiple users simultaneously. The suggested beamforming architecture is as in Fig. 1 [12]. The same parameters [12] are used in our work except that we tried both ADAM and NADAM optimizers and with different number of training pilot signals and number of paths collected for each multi-path transmission between the BS and UE.

2.2 Delay Calculations for Different Placements of Beamforming As mentioned above, beamforming is now considered as an essential part of modern cellular communications systems. With the massive-MIMO deployments getting more and more attention, the placement of the beamforming functionality is becoming more important to ensure smooth and accurate operation of these systems. In the case of static user (or users), it does not matter where the beamforming is implemented as the system can easily calculate the location of the UE once it send the

Fig. 1 Neural network based auto-precoder consisting of channel encoder and precoder

876

M. S. Aljumaily et al.

Fig. 2 Realistic example of beamforming

pilot signal to get the channel matrix and the same location and orientation information will be valid for the rest of the call (the connection). The main problem of the beamforming and the delay appears when the system (i.e., the BS) is trying to direct its beams toward mobile users [16, 18]. The channel is assumed to be constant during the coherent time and if the beamforming is implemented on a place (other than the BS), then the total delay can exceeds the coherent time which means that when the beamforming information arrives to the BS, it would be old and useless as the channel status will be different by now and the mobile users will be in a new location. Figure 2 shows a realistic example of several sectors within several small cells and their moving beams to track users changing locations. If the beamforming is implemented on premise (i.e., in the BS), then the only delay involved here is the processing delay which has been proven to have a negligible effect on the beamforming as in [19]. But in the case of implementing the beamforming anywhere other than the local BS, then the total delay can be calculated as in 1: DelayTotal = ProcessingBS + RTTBS-Core + ProcessingCore

(1)

where: DelayTotal is the total delay involved in calculating the right direction of each user to direct the transmission to (i.e., beamforming), ProcessingBS is the processing time related to the beamforming locally (in the BS), RTTBS-Core is the round trip time of the beamforming information between the BS and the Core. This value is dependant on whether the beamforming functionality in the core is on a dedicated core or a cloud core and it can involve more delay in the case of ML/DL based beamforming. Finally, ProcessingCore is the time it takes for the beamforming algorithms

On the Implementation and Placement of Hybrid …

877

to calculate the weights of the Base Band (BB) and the RF chains to perform the intended hybrid beamforming.

3 Simulation Results The results of the first set of experiments are shown in Fig. 3. From Fig. 3, we can see that using the NADAM optimizer is improving the performance when the number of pilot signals is more than 4 and for both the transmitter (the BS) and the receiver (the UE). The other result concluded here is that increasing the number of pilot signals over 16 is not increasing the beamforming accuracy, but instead causing a degradation in the performance which is a significant limitation of such a system even though it reduces the required number of training signals to a large extent. Finally, different number of collected multi-paths was tried (3, 4, and 5) and there was no real effect on the overall accuracy because of that which suggests that keeping the number of collected multipaths to 3 is recommended. These 3 paths include the line of sight (LoS) and the strongest 2 non-line of sight (NLoS), or the 3 strongest NLoS in case of the blockage of the LoS path. This system is using the dataset collected by the DeepMIMO tool [20] for single user. In the scenario of serving multiple users at the same time, the same trained DL (on all the possible locations of the UEs in the grid of points in front of the BS) can be used for the multi-users iteratively which is preferable in the case of many users, or the DeepMIMO can be trained again in case of few users moving as a group or closely to each other and try to connect to the BS simultaneously. As mentioned in [11], this DeepMIMO based beamforming system can be implemented in a central

Fig. 3 Beamforming accuracy for different number of channel measurements (pilot signals)

878 Table 1 Experimental system parameters Parameter β α IFFT/FFT length No. of sub-carrier in each band No. of symbols in time windowing length Tx, Rx sub-carrier spacing QAM Cyclic prefix Resource blocks (RB) MIMO Channel model UE Speed range Distance between UE and BS Cloud core Distance between the BS and the core Transmission line between BS and core

M. S. Aljumaily et al.

Value 6 0.6 1024 20 14 1/14*1/sub-carrier spacing 15KHz Adaptive (16, 64 or 256) 1/14*W-OFDM subcarrier spacing 50 8X8 Vehicular-A 10–120 km/h 1000 m Network slice selection function (NSSF) 100 km Fiber optics E11

Fig. 4 Delay (in ms) between BS and the dedicated core network

or cloud processing unit and in the next section, we will discuss the feasibility of that for different options (on premise, Edge, Fog, or Central Cloud). For the second part of the work and to calculate the expected delay resulted from deploying the beamforming anywhere other than the local BS, we used realistic systems with real dedicated core networks and cloud core networks with the parameters in Table 1. The first step to calculate the expected delay was to execute a Ping command between the BS and both the types of the core (dedicated and cloud) and notice the delay incurred in each time as shown in Figs. 4 and 5. As we see from these figures, the dedicated core took much less time than the cloud core to respond as the software defined functions in the cloud core need some

On the Implementation and Placement of Hybrid …

879

Fig. 5 Delay (in ms) between the BS and the cloud core network

time at first to perform some deep learning training and then start responding by using the trained model in the later requests. This is justifying the huge delay in the beginning that reached more than 75 ms and then the reduction in the delay in the latter ping responses. This indicates that the cloud core will be slower at first, but with time it can give comparable performance to the dedicated core networks. The other conclusion from these results is that in both cases (the dedicated core and the cloud core), the delay is too much and much more than that of the channel coherent time (i.e., the time where we assume that the channel is stable) which means that implementing the beamforming in such distant cloud (100 km away from the BS) is not feasible. The obvious replacement is to implement the beamforming functionality on premise, in the mobile edge computing (MEC), or in the fog servers.

4 Conclusions and Future Works In this paper, the issue of beamforming placement has been discussed. First, one of the recent direct hybrid beamforming architectures using the DeepMIMO and Deep Learning was optimized. Then, some experiments were implemented on real devices and different core networks to estimate the delay incurred when implementing a sensitive function (like the beamforming) in the cloud. The results recommend keeping the beamforming as close as possible to the local network to handle the users mobility and the fast changing wireless channel conditions. Further investigation is required to calculate the best place for the beamforming and the relationship of the place to the expected number of users, the channel randomness, the type of application (i.e., fixed or mobile), and the suggested architecture of the beamforming itself.

References 1. A. Ghosh et al., 5G evolution: a view on 5G cellular technology beyond 3GPP release 15. IEEE Access 7, 127639–127651 (2019) 2. W.A.G. Khawaja, et al., Effect of passive reflectors for enhancing coverage of 28 GHz mmWave systems in an outdoor setting, in 2019 IEEE Radio and Wireless Symposium (RWS) (IEEE, 2019)

880

M. S. Aljumaily et al.

3. M.S. Aljumaily, Hybrid beamforming in Massive-MIMO mmWave systems using LU decomposition, in 2019 IEEE 90th Vehicular Technology Conference (VTC2019-Fall) (IEEE, 2019) 4. A.M. Elbir, S. Coleri, Federated learning for hybrid beamforming in mm-Wave Massive MIMO (IEEE Commun, Lett, 2020) 5. M.S. Aljumaily, H. Li, Machine learning aided hybrid beamforming in Massive-MIMO millimeter wave systems, in 2019 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN) (IEEE, 2019) 6. Y.W. Foad Sohrabi, Hybrid analog and digital beamforming for mmWave OFDM large-scale antenna arrays. IEEE J. Sel. Areas Commun. 35(7), 1432–1443 (2017) 7. S. Sun, T.S. Rappaport, M. Shaft, Hybrid beamforming for 5G millimeter-wave multi-cell networks, in IEEE INFOCOM 2018-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) (IEEE, 2018) 8. M. Li et al., A hardware-efficient hybrid beamforming solution for mmWave MIMO systems. IEEE Wirel. Commun. 26(1), 137–143 (2019) 9. M.M. Molu et al., Low-complexity and robust hybrid beamforming design for multi-antenna communication systems. IEEE Trans. Wirel. Commun. 17(3), 1445–1459 (2017) 10. Y. Lu, Y. Zeng, R. Zhang, Wireless power transfer with hybrid beamforming: How many RF chains do we need? IEEE Trans. Wirel. Commun. 17(10), 6972–6984 (2018) 11. A. Alkhateeb et al., Deep learning coordinated beamforming for highly-mobile millimeter wave systems. IEEE Access 6, 37328–37348 (2018) 12. X. Li, A. Alkhateeb, Deep learning for direct hybrid precoding in millimeter wave massive MIMO systems, in 2019 53rd Asilomar Conference on Signals, Systems, and Computers (IEEE, 2019) 13. M.S. Aljumaily, H. Li, Mobility speed effect and neural network optimization for deep MIMO beamforming in mmWave networks. Int. J. Comput. Netw. Commun. 12(6), 1–14 (2020) 14. Y. Liu et al., Prospective positioning architecture and technologies in 5G networks. IEEE Netw. 31(6), 115–121 (2017) 15. M.S. Elbamby, et al., Edge computing meets millimeter-wave enabled VR: paving the way to cutting the cord, in 2018 IEEE Wireless Communications and Networking Conference (WCNC) (IEEE, 2018) 16. N. Pontois, et al., User pre-scheduling and beamforming with outdated CSI in 5G fog radio access networks, in 2018 IEEE Global Communications Conference (GLOBECOM) (IEEE, 2018) 17. A.K. Bashir et al., An optimal multitier resource allocation of cloud RAN in 5G using machine learning. Trans. Emerg. Telecommun. Technol. 30(8), (2019) 18. M. Kaneko, et al., User pre-scheduling and beamforming with imperfect CSI for 5G cloud/fogradio access networks. IEICE Trans. Commun. 2018ANI0001 (2019) 19. M.S. Aljumaily, H. Li, Throughput degradation due to mobility in millimeter wave communication systems using random beamforming, in 2018 IEEE 88th Vehicular Technology Conference (VTC-Fall) (IEEE, 2018) 20. A. Alkhateeb, DeepMIMO: a generic deep learning dataset for millimeter wave and massive MIMO applications. arXiv:1902.06435 (2019)

Neural Network Based Windowing Scheme to Maximize the PSD for 5G and Beyond Ahmed Hammoodi, Lukman Audah, Mustafa S. Aljumaily, Mazin Abed Mohammed, and Jamal Rasool

Abstract Windowing and filtering of the OFDM-based waveforms have been considered as the new trend in waveform design for beyond 5G systems. Different types of windows and filters have been suggested to overcome the inherited limitations of the OFDM. The first part of this paper is suggesting and comparing the performance of several neural networks (NNs) in calculating the best window from a list of 19 windows to achieve the required power spectral density (PSD) subjected to the acceptable adjacent channel leakage ratio (ACLR) and has proved that with only few epochs, we can get great accuracy with error rate in the order of 10−7 . The second part is going a step further in using the selected window in the first part in the UFMC system and calculate the PSD and the bit error rate (BER) resulted from that and it shows huge improvement over some well-known windows. Keywords Neural network · UFMC · Windowing · PSD · 5G

A. Hammoodi (B) · L. Audah Wireless and Radio Science Centre (WARAS), Faculty of Electrical and Electronic Engineering, Universiti Tun Hussein Onn Malaysia, 86400 Parit Raja, Batu Pahat, Johor, Malaysia e-mail: [email protected] M. S. Aljumaily Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, USA e-mail: [email protected] M. A. Mohammed College of Computer Science and Information Technology, University of Anbar, Anbar 31001, Iraq e-mail: [email protected] J. Rasool Department of Communication engineering, University of Technology, Baghdad 1001, Iraq e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2_70

881

882

A. Hammoodi et al.

1 Introduction The fifth generation of wireless communications standard or (5G) based on the 3GPP Release-15 and Release-16 [1] is already being deployed worldwide. Different forms of the orthogonal frequency division multiplexing (OFDM) based waveforms have been used in this standard and many other networks in recent years because of all the advantages come with them [2]. Despite being totally compatible with the multiple input multiple output (MIMO) systems, being simple to build with the use of the fast fourier transform (FFT), the immunity OFDM systems show to the frequency selective fading channels, the efficient modulation and demodulation, etc., the OFDM based systems have been suffering from many limitations [2]. Some of these limitations include the need for strict synchronization between the Tx and Rx in order to work well, the large peak to average power ratio (PAPR), the high losses of power and bandwidth due to the use of gaurd bands, etc. [2]. To handle such problems with OFDM, many non-OFDM based waveforms have been suggested in the literature like the universal filtered multi-carrier (UFMC), filter-based multicarrier (FBMC), and others that are depending on different windows and filter designs to fix these problems [3]. The first step in the transition from 4G to 5G is through the non-stand alone (NSA) deployment option defined within the release-15 of the 3GPP standard [1] where both the 4G and 5G systems will have to coexist in the same network and many research have been conducted to ensure the green coexistence between the two systems [3–5]. Selecting the best window among several available window options to get the best performance has also been investigated in [6, 7] where they used the windowed Sinc method to achieve the best trade-off between the time and frequency localization in order to improve the spectral efficiency in [6]. In [7], the windowingOFDM (W-OFDM) is discussed as one of the candidate waveforms for the 5G and beyond networks to suppress the OOBE and achieve better spectral efficiency [7]. Reducing the OOBE resulted from the sinc-shaped orthogonal signals of the OFDM have been the focus of many research recently [8–10]. Most of these works have been focusing on using some type of windows or filters to reduce the OOBE among the sub-carriers or the sub-bands of the OFDM-based waveforms for the 5G systems and beyond. One of the most famous waveforms that follows this paradigm is the UFMC which is used in this paper as the basis to build the system, and different windows have been used to evaluate and compare the performance like the Kaiser window, the Hanning window, and the suggested convolution of Kaiser with Hanning window.

2 System Model 2.1 Deep Neural Network Based Window Selection After collecting 2048 samples of realistic PSD when using 19 different windows are shown in Table 1: and their corresponding outputs, we built several Neural Networks

Neural Network Based Windowing Scheme …

883

Table 1 Windows types related to used by researchers Window name References Root-Raised-Cosine (RRC) window Hanning (Raised-cosine) hamming Kaiser window Modified Bartlett-Hann window Bohman window Chebyshev window Flat top weighted window Gaussian window Rectangular window Taylor window Triangular window Tukey (tapered cosine) window Hankel kaiser window 4-term Blackman-Harris window Bartlett window

[11] [11] [11] [11] [12] [13] [13] [13] [12] [12] [13] [13] [12] [11] [12] [11]

Table 2 Neural networks performance comparison NN training algorithm No. of epochs Time (s) Levenberg-Marquardt Bayesian regularization Scaled conjugate gradient

Error rate

10 9

4 5

1.10e−7 7.16e−8

4

1

2.94e−7

to predict the relationship between the window selection and the expected PSD. The neural networks have been designed using MATLAB with different options and the windows outputs dataset was used as the input to the NN and the corresponding PSD dataset as the output. First, we divided the datasets as training (70%), validation (10%) and testing (20%) sets, then we tried different NN structures and training algorithms. In each of these trials, we used a 3 layer NN (input, hidden, and output) layers. The number of neurons in the hidden layer is selected to be equal to the number of windows (i.e., 19). Then, we compared the results with respect to the expected error rate, the NN speed (number of epochs required to reach the minimum error), and the fitness accuracy of each model. The comparison results are shown in Table 2. It is clear from the table that training the neural network does not take a lot of time and the best compromise with respect to the iterations, the training time and the error rate is when using the scaled conjugate gradient. These results mean that

884

A. Hammoodi et al.

using the NN can help select the best window for maximized PSD as we will see in the next section.

2.2 Adaptive Window Selection to Maximize the PSD The UFMC is used to build the experimental system and its structure is the same as in [3] besides the parameters listed in Table 2. UFMC is a technique that involves the filtration of sub-carriers prior to transmission and reception so that the ISI and ICI can be eliminated. UFMC is a modulation technique that can be considered as a generalized filter bbank multi-carrier technique (FBMC) [12, 14, 15] and filtered OFDM. The details of the experiments used to derive the best window to get the maximum PSD taking in consideration the adjacent channel power ration (ACPR), the changing bit error rate (BER), the time and frequency variant channel conditions are explained in details in Fig. 1.

Fig. 1 System model and experimental procedure

Neural Network Based Windowing Scheme … Table 3 System parameters Parameter

885

Value

Window

Kaiser, Hanning, and convolution of Kaiser with Hanning 6 0.6 1024 20 14 1/14*1/sub-carrier spacing 15 kHz 64 and 256 1/14*W-OFDM subcarrier spacing 50

β α IFFT/FFT length No. of sub-carrier in each band No. of symbols in time windowing length Tx, Rx sub-carrier spacing QAM Cyclic prefix Resource blocks (RB)

where the Adjacent Channel Power Ratio (ACPR) is calculated as in Eq. 1.  B2 p ( f )d f   − B adjacent ACPR  f space =  2 B f + 2 pmain ( f )d f f − B

(1)

2

The Bit Error Rate (BER) is calculated as in Eq. 2.   √ SNR 1 BER = 1− √ 2 (2 + SNR)

(2)

And the specified Doppler assumed here is depending on the maximum Doppler frequency parameter (fm) of the rounded spectrum which has the power spectral density (PSD) function defined in the Eq. 3:  S( f ) =

π

√1

| f0 | ≤ 1

0

| f0 | > 1

1− f 02

(3)

where f 0 = f / f m and the f m = v f /c with the v is the velocity of the user equipment (UE) with respect to the base station (BS) and c = 3 ∗ 108 is the speed of light [15]. The parameters used in the system are listed in Table 3.

886

A. Hammoodi et al.

3 Simulation Results MATLAB environment has been used to connect the system in Fig. 1, and the parameters used in the simulation are listed in Table 3. Both the PSD and the BER were used as the key performance indicators (KPI) here to compare the performance of the UFMC system with different windowing schemes. For the PSD achieved when using UFMC and CP OFDM 64-QAM we got the results in Fig. 2. Figure 2 shows the results of the scenario for 64-QAM UFMC with different windowing scenarios (Hanning, kaiser, and the proposed windowing). According to the figure, the efficiency has been increased in terms of the PSD of the UFMC in the case of the proposed window-based filtering operation. Also, the traditional Kaiser window-based filter produces UFMC PSD superior to that of the Hanning CP-OFDM. The ACLR levels of the proposed windowing based UFMC are clearly starting at − 150 dB, while the system based on traditional Kaiser window and Hanning window with the same β shows about −130 dB and −70 dB ACLR, respectively. The baseband mapping can be expanded from 64-QAM to 256-QAM to expand the bandwidth once more. Figure 3 shows the results of the increasing the modulation scheme. In general, the performance of the ACLR for all windows has increased when we increased the modulation scheme. The PSD of the proposed windowing based UFMC has slight increased compared with the 64-QAM and the ACLR levels is −140. In addition, the performance with respect to the ACLR of the traditional Kaiser window based UFMC and Hanning window CP-OFDM has degraded with the increase in the modulation scheme. The ACLR levels of the traditional Kaiser window based UFMC and Hanning window CP-OFDM are −125 dB and −50 dB, respectively. Then, we evaluated the BER of the UFMC and the CP-OFDM with 64QAM for different windows and compared it with the suggested window in Fig. 4. Finally, the same evaluation and comparison of the achievable BER for the 256QAM was done and the results are as shown in Fig. 5.

Fig. 2 PSD evaluation of UFMC and CP OFDM with 64-QAM

Neural Network Based Windowing Scheme …

887

Fig. 3 PSD evaluation of UFMC and CP OFDM with 256-QAM

Fig. 4 BER evaluation of UFMC and CP OFDM 64-QAM

According to the BER performance in Figs. 4 and 5, the BER performance based on the Kaiser proposed windowing UFMC is bit-by-bit worse than that of the Kaiser and Hanning windowing schemes. The best performance in terms of BER is with the 64-QAM and the traditional Kaiser window-based filter produces UFMC and Hanning window CP-OFDM but at the expense of the degradation in PSD performance.

888

A. Hammoodi et al.

Fig. 5 BER evaluation of UFMC and CP OFDM 256-QAM

4 Conclusions and Future Works In this paper, several windowing schemes for UFMC were implemented and their performance was compared. Beside the well known windows, a new proposed window as a combination of two windows was suggested and the NN was used to select the best window to maximize the PSD for each scenario. The performance of the system when using the traditional Kaiser and Hanning windows was compared with that of the selected window by the NN (i.e., the proposed combined window) and it showed impressive improvement with respect to PSD and some degradation in terms of the BER. Future works include implementing more windowing schemes that can both improve the PSD and the BER of the waveform when working for 5G and beyond systems.

References 1. T.-K. Le, U. Salim, F. Kaltenberger, An overview of physical layer design for ultra-reliable low-latency communications in 3GPP release 15 and release 16. arXiv:2002.03713 (2020) 2. M. Bhardwaj, A. Gangwar, D. Soni, A review on OFDM: concept, scope and its applications. IOSR J. Mech. Civ. Eng. (IOSRJMCE) 1(1), 07–11 (2012) 3. A. Hammoodi, L. Audah, M.A. Taher, Green coexistence for 5G waveform candidates: a review. IEEE Access 7, 10103–10126 (2019) 4. A. Hammoodi, et al., Green coexistence of CP-OFDM and UFMC waveforms for 5G and beyond systems, in 2020 4th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT) (IEEE, 2020) 5. A. Delmade et al., Performance analysis of analog IF over fiber fronthaul link with 4G and 5G coexistence. J. Opt. Commun. Netw. 10(3), 174–182 (2018)

Neural Network Based Windowing Scheme …

889

6. D. Wu, et al., A field trial of f-OFDM toward 5G, in 2016 IEEE Globecom Workshops (GC Wkshps) (IEEE, 2016) 7. P. Guan et al., 5G field trials: OFDM-based waveforms and mixed numerologies. IEEE J. Sel. Areas Commun. 35(6), 1234–1243 (2017) 8. K. Mizutani, T. Matsumura, H. Harada, A comprehensive study of universal time-domain windowed OFDM-based LTE downlink system, in 2017 20th International Symposium on Wireless Personal Multimedia Communications (WPMC) (IEEE, 2017) 9. X. Huang, J.A. Zhang, Y. Jay Guo, Out-of-band emission reduction and a unified framework for precoded OFDM. IEEE Commun. Mag. 53(6), 151–159 (2015) 10. Y. Mizutani, et al., A low-pass filtered time-domain window for DFTs-OFDM to reduce outof-band emission with low complexity, in 2019 16th IEEE Annual Consumer Communications and Networking Conference (CCNC) (IEEE, 2019) 11. A. Hammoodi, L. Audah, M.S. Aljumaily, M.A. Taher, F.S. Shawqi, Green coexistence of CP-OFDM and UFMC waveforms for 5G and beyond systems, in 2020 4th International Symposium on Multidisciplinary Studies and Innovative Technologies (IEEE, Oct 22, 2020), pp. 1–6 12. H. Lin, P. Siohan, An advanced multi-carrier modulation for future radio systems, in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2014), pp. 8097–8101 13. X. Cheng , Y. He, B. Ge, C. He, A filtered OFDM using FIR filter based on window function method, in IEEE 83rd Vehicular Technology Conference (VTC Spring), pp. 1–5, May 2016 14. B. Farhang-Boroujeny, OFDM versus filter bank multicarrier. IEEE Signal Process. Mag. 28(3), 92–112 (2011) 15. J. Yli-Kaakinen, P. Siohan, M. Renfors, Chapter 8—FBMC design and implementation, in Orthogonal Waveforms and Filter Banks for Future Communication Systems, eds. by M. Renfors, X. Mestre, E. Kofidis, F. Bader (Academic Press, 2017), pp. 157–195 16. A. Hammoodi, F. Shawqi, L. Audaha, A. Qasim, A. Falih, Under test filtered-OFDM and UFMC 5G waveform using cellular network. J. Southwest Jiaotong Univ. 54(5) (2019)

Author Index

A Abdullaev, S. M., 19 Abhishek, U., 117 Achari, Shristi, 579 Adarsh, S., 511 Agarwal, Amit, 51 Agarwal, Harshit, 785 Agarwal, Vatsal, 343 Aggarwal, Apurva, 71 Agrawal, Smita, 393 Al-Berry, M. N., 157 Al-Betar, Mohammed Azmi, 717 Aljumaily, Mustafa S., 873, 881 AL-Khateeb, Belal, 859 AL-Kubaisy, Wijdan Jaber, 859 AL-Shadood, Wurood, 833 AL-Wahah, Mouiad, 833 Amballoor, Renji George, 533 Amintoosi, Haleh, 833 Anamika, 129 Anand, Sameer, 769 Anoop, V. S., 511 Ansari, Mohd. Aquib, 107 Arif, Tasleem, 487 Arora, Mridul, 233 Arora, Rupali, 769 Arora, Sakshi, 303 Asharaf, S., 511 Ashraf, Mudsir, 19 Audah, Lukman, 873, 881 Awadallah, Mohammed A., 717 Ayeni, Olaniyi Abiodun, 431 Ayob, Omeera, 81

B Bangera, Ananya, 217 Bansal, Avishi, 601 Bansal, Sakshi, 383 Baveja, Irsheen, 797 Bedi, Punam, 1, 225, 465 Behl, Ritin, 819 Bhardwaj, Kirti, 267 Bhatghare, Ashish, 51 Bhut, Muheet Ahmed, 19 Billah, Sifat Nur, 367 Budhraj, Rahul, 615 Butt, Muheet Ahmed, 81, 249

C Chakraborty, Avishek, 205 Chandankhede, Pragati, 541 Charaya, Nisha, 591 Chauhan, Divyansh, 739 Chawla, Indu, 329 Chitrashekharaiah, Y., 421 Chouhan, Dharamendra, 421 Chug, Anuradha, 751 Chugh, Medha, 693

D Dahiya, Diksha, 627 Dalbah, Lamees Mohammad, 717 Dhavale, Sunita V., 499 Dilip Kumar, S. M., 421 Dixit, Asmita, 819 Dutta, Maitreyee, 129, 139

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 A. Khanna et al. (eds.), International Conference on Innovative Computing and Communications, Advances in Intelligent Systems and Computing 1394, https://doi.org/10.1007/978-981-16-3071-2

891

892 E Ebeid, Hala M., 157 El-Shahed, Reham A., 157

F Fayaz, Sheikh Amir, 249 Fayemi, Oluwatosin James, 431 Faysal Al, Jabed, 671 Fudzee, Mohd Farhan M. D., 847

G Gahlot, Harivansh, 797 Gangadharappa, M., 171 Ganie, Shahid Mohammad, 487 Garg, Ayush, 627 Gaur, Mayank, 233 Geetha, M., 565 Gill, Sakshi, 615 Giridhar, Prashant Shambharkar, 343 Goel, Anmol, 693 Goyal, Khushi, 785 Goyal, Mukta, 565 Gupta, Aditi, 601 Gupta, Ayush, 383 Gupta, Ekata, 565 Gupta, Jayesh, 703 Gupta, Kirti, 233 Gupta, Lakshay, 477 Gupta, Neha, 225 Gupta, Pankaj Kumar, 383 Gupta, Rahul, 739 Gupta, Sarishty, 71 Gupta, Shubhangi, 267 Gupta, Trasha, 343 Gurbani, Nitesh, 477

H Haji, Sana, 541 Hammoodi, Ahmed, 873, 881 Hirna, Mitul, 61 Hooda, Himanshu, 739 Hossain, Farjana, 367

J Jahan, Khalid Mahbub, 671 Jain, Aayush, 601 Jain, Aruna, 703 Jain, Somil, 61 Jain, Vibha, 61 Jaiswal, Kapil, 769

Author Index Jamshed, Aatif, 819 Jane Nithya, K., 183 Janghel, Rekh Ram, 601 Javheri, Santosh B., 647 Jindal, Rajni, 521, 627 Jindal, Vinita, 225 Johari, Rahul, 579, 693 Joshi, Ishan, 43

K Kale, Advait, 33 Kansal, Parnika, 171 Karpa, Shubham, 329 Kashinath, Shafiza Ariffin, 847 Kaur, Gurjeet, 797 Kaushik, Ashu, 639 Khan, Aslam Hasan, 319 Khatri, Nidhi, 95 Kherwa, Pooja, 615 Khetarpal, Poras, 289 Kidwai, Farzil, 703 Kour, Navleen, 303 Krishnan, Deepa, 33 Kumar, Anil, 591 Kumar, Ashwni, 171 Kumari, Bharti, 279 Kumar, Kartik, 455 Kumar, Manoj, 553 Kumar, Prabhat, 731 Kumar, Yash, 233 Kumbharkar, P. B., 647 Kushwaha, Ajay Kumar, 71, 353

L Lal, Sangeeta, 71 Li, Husheng, 873 Lim, David, 847

M Madan, Suman, 267 Mahmood, Maha, 859 Mahto, Rajesh Kumar, 279 Majithia, Arjun, 51 Makhija, Jigar, 217 Malhotra, Jigyasu, 51 Malik, Majid Bashir, 487 Malik, Sapna, 455 Mandal, Durbadal, 205 Mane, D. T., 647 Mishra, Ambarisha, 279 Mishra, Reyansh, 477

Author Index Mishra, Rishik, 661 Mittal, Prabhat, 383 Mohammed, Mazin Abed, 873, 881 Mohan, Yukti, 149 Moorthy, Rahul, 647 Mostafa, Salama A., 847 Mridha, M. F., 367 Mudgil, Pooja, 43 Mugloo, Saahil Hussain, 61 Mustapha, Aida, 847

N Nagpal, Neelu, 289 Nagrath, Preeti, 233 Naik, Shankar B., 533 Nair, Jayashree, 117 Nair, Sooraj S., 117 Nigam, Narander Kumar, 785

O Omar, Hamid, 731 Oza, Parita, 393

P Pandey, Saroj Kumar, 601 Parashar, Abhishek, 149 Patel, Devraj, 499 Patel, Harsh Jigneshkumar, 393 Patel, Hiral A., 95 Patel, Hiral R., 95 Prakash, Varun, 233

R Raina, Nachiketa, 455 Ram, Gopi, 205 Ramli, Azizul Azhar, 847 Rani, Poonam, 51, 61 Rasool, Jamal, 881 Rastogi, Somil, 71, 353 Richa, 1 Riyaz, Lubna, 81

S Sahu, Priyanka, 751 Saif, Mohammad, 61 Salal, Yass Khudheir, 19 Seth, Jahnavi, 703 Shankdhar, Ashutosh, 661

893 Sharma, Aaryan, 289 Sharma, Chhavi, 1 Sharma, Hardik, 33 Sharma, Moolchand, 703 Sharma, Ruchi, 289 Sharma, Shreyans, 615 Shedeed, Howida A., 157 Shetty, Akhil Appu, 217 Shivhare, Shiv Naresh, 477 Shokeen, Jyoti, 51 Shreyas, J., 421 Shukla, Nitya, 661 Shyamala, K., 183 Siddqui, Jamshed, 319 Singhal, Anuradha, 465 Singh, Amit Prakash, 751 Singhania, Tanay, 343 Singh, Dhananjay, 329 Singh, Dinesh, 751 Singh, Dushyant Kumar, 107 Singh, Harjot, 739 Singh, Priti, 591 Singh, Ravinder Pal, 751 Singla, Megha, 139 Sinha, Akash, 731 Sinha, Devyani, 627 Sinha, Divit, 33 Sisodia, Amrita, 521 Sohail, Shahab Saquib, 319 Sondhi, Arushi, 703 Srinidhi, N. N., 421 Suman, N., 565 Sunakshi, 343 Sunanda, 303 Suresh, Sandra, 797 Susan, Seba, 639 Suthar, Keyur, 95

T Thompson, Aderonke Favour-Bethy, 431 Tomer, Minakshi, 553

U Udawat, Bharat, 33 Upadhyay, Aashish, 455 Uvaneshwari, M., 565

Z Zaman, Majid, 19, 81, 249 Zitar, Raed Abu, 717