Computational Intelligence and Data Analytics: Proceedings of ICCIDA 2022 (Lecture Notes on Data Engineering and Communications Technologies, 142) 9811933901, 9789811933905

The book presents high-quality research papers presented at the International Conference on Computational Intelligence a

126 104 16MB

English Pages 647 [616] Year 2022

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Computational Intelligence and Data Analytics: Proceedings of ICCIDA 2022 (Lecture Notes on Data Engineering and Communications Technologies, 142)
 9811933901, 9789811933905

Table of contents :
Preface
Contents
About the Editors
Container Orchestration in Edge and Fog Computing Environments for Real-Time IoT Applications
1 Introduction
1.1 Case Study: Natural Disaster Management (NDM)
1.2 Edge and Fog Computing
2 Background Technologies and Related Work
2.1 FogBus2 Framework
2.2 K3s: Lightweight Kubernetes
2.3 Related Work
3 Container Orchestration Approach
3.1 Overview of the Design
3.2 Configuration of Nodes
3.3 P2P VPN Establishment
3.4 K3s Deployment
3.5 Fogbus2 Framework Integration
4 Performance Evaluation
4.1 Experiment 1: Orchestrated FogBus2 Versus Native FogBus2
4.2 Experiment 2: Hybrid Environment Versus Cloud Environment
5 Conclusions and Future Work
References
Is Tiny Deep Learning the New Deep Learning?
1 Introduction
2 Tiny Machine Learning: An Overview
3 From Tiny Machine Learning to Tiny Deep Learning
4 Approximate Computing for Tiny Deep Learning
4.1 Precision Scaling Mechanisms
4.2 Task Dropping Mechanisms
5 Tiny Deep Learning for the Internet of Things
5.1 Early-Exit Neural Networks
5.2 Distributed Inference for TinyDL
5.3 On-Device Learning for TinyDL
5.4 Federated Learning
6 Conclusions
References
Dynamic Multi-objective Optimization Using Computational Intelligence Algorithms
1 Introduction
2 Multi-objective Optimization Concepts
2.1 Multi-objective Optimization
2.2 Dynamic Multi-objective Optimization
3 Multi-objective Algorithms
3.1 Preference-Based Approaches
3.2 Population-Based Approaches
4 Solving Dynamic Multi-objective Optimization Problems
4.1 Detecting Changes in Environment
4.2 Responding to Changes in Environment
4.3 Evaluating Algorithm Performance
5 Challenges in Dynamic Multi-objective Optimization
5.1 Decision Making
5.2 Algorithm Behavior
6 Emerging Research Areas
6.1 Many-Objective Optimization
6.2 Fitness Landscape Analysis
7 Conclusion
References
AI for Social Good—A Faustian Bargain
1 An Introduction History and Context
1.1 Knowledge Acquisition Problem for AI Systems
1.2 Machine Learning
1.3 Deep Learning and Insatiable Data Thirst
2 Smart Devices and Clouds
2.1 Networking, Processing, and Media Convergence in Smartphones
2.2 Eavesdropping and Spying in Our House
3 The Promise for Public Good
4 Standardization and Regulation of AI Technology
References
Computational Intelligence–Machine Learning
Text Summarization Approaches Under Transfer Learning and Domain Adaptation Settings—A Survey
1 Introduction
2 Transfer Learning-Based Text Summarization
2.1 Transfer Learning in Seq2Seq-Based Approaches for Text Summarization
2.2 Transfer Learning in Non-Seq2Seq-Based Approaches for Text Summarization
3 Text Summarization Using Domain Adaptation
4 Issues, Challenges and Opportunities
5 Conclusion
References
An Effective Eye-Blink-based Cyber Secure PIN Password Authentication System
1 Introduction
1.1 Our Contribution
2 Related Work
3 System Design
3.1 Algorithm
4 EBCS-PIN Implementation and Results
5 Conclusion
References
A Comparison of Algorithms for Bayesian Network Learning for Triple Word Form Theory
1 Introduction
2 Literature Review
3 Methodology
4 Structure Learning for Bayesian Network
4.1 Search and Score-Based Algorithm
4.2 Constraint-Based Approach
4.3 Hybrid Approach
5 Study Participants
6 Results
7 Discussion and Conclusion
References
Application of Machine Learning Algorithm in Identification of Anaemia Diseases
1 Introduction
1.1 Types and Detection of Anaemia
1.2 Machine Learning Algorithms
2 Methodology
2.1 Gathering Data
2.2 Data Pre-processing
2.3 Model Selection
2.4 Validation
3 Results and Discussion
3.1 Confusion Matrix
3.2 Accuracy
3.3 Sensitivity/Recall
3.4 Specificity
4 Conclusion and Future Work
References
Detection of Fruits Image Applying Decision Tree Classifier Techniques
1 Introduction
2 Related Works
3 Methodology
4 Experiment and Results
5 Conclusion
References
Disease Prediction Based on Symptoms Using Various Machine Learning Techniques
1 Introduction
2 Literature Review
3 Methodology
3.1 Input Data
3.2 Data Pre-processing
3.3 Models
3.4 Output (Diseases)
4 Implementation
4.1 Multinomial Naïve Bayes
4.2 Random Forest Classifier
4.3 K-Nearest Neighbors (KNN)
4.4 Logistic Regression
4.5 Support Vector Machines
4.6 Decision Tree
4.7 Multilayer Perceptron Classifier
5 Results and Discussion
5.1 Experimental Analysis
6 Conclusion
References
Anti-Drug Response and Drug Side Effect Prediction Methods: A Review
1 Introduction
2 Literature Survey
3 Databases and Parameters Used for Drug Adverse Reaction Prediction
3.1 Metrics for Drug Side Effect Prediction
4 Drug Side Effect Prediction Methods
4.1 Docking-Based Methods
4.2 Networks-Based Methods
4.3 Machine Learning-Based Methods
4.4 Miscellaneous Approaches
5 Conclusion and Future Work
References
Assessment of Segmentation Techniques for Irregular Border Lesion Images in Melanoma
1 Introduction
2 Related Work and Background
3 Methodology Used
3.1 Binary Otsu Segmentation
3.2 Marker-Based Watershed Segmentation
3.3 K-Means Clustering
3.4 Quality Assessment for Images
4 Results and Evaluation
5 Conclusion
References
Secure Communication and Pothole Detection for UAV Platforms
1 Introduction
2 Methodology
2.1 Secure Communication
2.2 Pothole Detection Using CNN
2.3 Pothole Detection Using Inception-V3
2.4 Pothole Detection Using YOLO
3 Results
4 Future Scope
References
An Empirical Study on Discovering Software Bugs Using Machine Learning Techniques
1 Introduction
2 Related Work
3 Methodology
4 Experimental Results
5 Conclusion and Future Work
References
Action Segmentation for RGB Video Frames Using Skeleton 3D Data of NTURGB+D
1 Introduction
2 Research Methodology
3 Dataset for Action Segmentation
4 Experimental Results
4.1 Create Datastore for Skeleton 3D Data and RGB Videos
4.2 Extraction of Color X and Color Y from Skeleton 3D Data
4.3 Dimension for Bounding Box for Action Segmentation
4.4 RGB Videos and Segmented Action Videos
5 Conclusion
References
Prediction of Rainfall Using Different Machine Learning Regression Models
1 Introduction
2 Related Work
3 Proposed Method
3.1 The Proposed Model’s Algorithm
3.2 Dataset Description
3.3 Exploratory Data Analysis
4 Results
4.1 Results of MLR
5 Conclusion
References
A Comprehensive Survey of Datasets Used for Spam and Genuineness Views Detection in Twitter
1 Introduction
2 Literature Review
3 Summary and Challenges
4 Difficulty in Extraction and Collection of Data
5 Renaming of Data
6 Lack of Standard Datasets
7 Genuineness of Data
8 Conclusion
References
Computational Intelligence–Deep Learning
Indian Classical Dance Forms Classification Using Transfer Learning
1 Introduction
2 Related Work
3 Proposed Work
3.1 ICD Architecture
3.2 VGG16 Network Architecture
3.3 ICD Classification Algorithm:
3.4 Theory
4 Experiment Results and Comparisons
4.1 Dataset
4.2 Performance Analysis
5 Conclusion and Future Enhancement
References
Skin Cancer Classification for Dermoscopy Images Using Model Based on Deep Learning and Transfer Learning
1 Introduction
2 Related Background
3 Experimental Methodology
3.1 Convolutional Neural Network
3.2 Transfer Learning
3.3 Evaluation Metrics
4 Proposed Architecture
4.1 Flow Model
4.2 Dataset Description
4.3 Experimental Setup
5 Experimental Outcomes and Observation
6 Conclusion and Future Work
References
Deep Neural Network Architecture for Face Mask Detection Against COVID-19 Pandemic Using Pre-trained Exception Network
1 Introduction
2 Background
3 Proposed Method
3.1 Load the Data Set
3.2 Model Creation
3.3 Data Set
4 Experimental Results and Discussion
5 Conclusion
References
MOOC-LSTM: The LSTM Architecture for Sentiment Analysis on MOOCs Forum Posts
1 Introduction
2 Motivation and Related Work
3 Background
3.1 Dataset
3.2 Word Embedding
3.3 Upsampling
3.4 Convolutional Neural Network
3.5 Long Short-term Memory
3.6 Adaptive Experimentation (Ax)
4 Proposed System
5 Implementation
5.1 Performance Measures
5.2 Results and Analysis
6 Conclusion and Future Work
References
License Plate Detection of Motorcyclists Without Helmets
1 Introduction
2 Related Work
3 Data
3.1 Dataset
3.2 Data Preprocessing
3.3 Architecture
4 Results
5 Conclusion
References
Object Detection and Tracking Using DeepSORT
1 Introduction
2 Related Work
3 Existing Work
3.1 Optical Flow
3.2 Mean-Shift Algorithm
4 Proposed Methodology
4.1 Dataset
4.2 Proposed Workflow
5 Results and Discussions
6 Conclusion
7 Future Work
References
Continuous Investing in Advanced Fuzzy Technologies for Smart City
1 Introduction
2 Literature Review
3 Research Objectives
4 Methods and Models
4.1 Problem Statement
4.2 Research Methodology
4.3 Game Solution and Optimal Strategies of the First Player
5 Computational Experiment
6 Computational Experiment
7 Conclusions
References
Lesion Segmentation in Skin Cancer Detection Using UNet Architecture
1 Introduction
1.1 Our Contribution
2 Preliminaries
2.1 UNet Architecture
3 Proposed Methodology
3.1 A Proposed Model of UNet
4 First Section
4.1 Dataset and Experimental Setup
4.2 Evaluation Metrics
4.3 Results and Discussions
5 Conclusion
References
Getting Around the Semantics Challenge in Hateful Memes
1 Introduction
2 Literature Review
3 Approach
3.1 A Semantic Understanding
3.2 Variational Auto-Encoders (VAE)
3.3 Helping Decode Hate in Memes
3.4 Bringing It All Together
4 Results and Discussion
5 Conclusion and Future Work
References
Classification of Brain Tumor of Magnetic Resonance Images Using Convolutional Neural Network Approach
1 Introduction
2 Related Works
3 Material and Methods
3.1 Dataset Collection and Preprocessing
3.2 Model Architecture and Learning
4 Performance Evaluation
5 Conclusion
References
Detection of COVID-19 Infection Using Convolutional Neural Network
1 Introduction
2 Literature Survey
3 Proposed Methodology
4 Experimental Results
5 Conclusions
References
Hybrid Classification Algorithm for Early Prediction of Alzheimer’s Disease
1 Introduction
2 Related Works
3 Proposed Work
3.1 Details of Dataset
3.2 Segmentation
3.3 Feature Extraction and Reduction
3.4 Classification
4 Results
4.1 Comparison of Accuracy
4.2 Comparison of Other Performance Metrics
5 Conclusion
References
Data Analytics
Evaluating Models for Better Life Expectancy Prediction
1 Introduction
1.1 Motivation and Objective
1.2 Research Questions and Contribution
2 Related Work
3 Proposed Strategy
3.1 Life Expectancy Prediction and Forecasting Model
4 Implementation Framework
4.1 Data Preparation
4.2 Feature Extraction
4.3 Life Expectancy Prediction Modeling
4.4 Life Expectancy Forecast Modeling
5 Analytics and Discussion
5.1 Forecast Modeling Results
6 Conclusion
References
Future Gold Price Prediction Using Ensemble Learning Techniques and Isolation Forest Algorithm
1 Introduction
2 Literature Study
3 Statistics and Methodology
4 Discussion and Outcomes
5 Conclusion
References
Second-Hand Car Price Prediction
1 Introduction
1.1 Motivation
1.2 Problem Addressed
1.3 Objectives
1.4 Solution/Novelty
2 Related Works
3 System Analysis
3.1 Data Set
3.2 Algorithm-Random Forest Regression
3.3 Assumptions for Random Forest Regression
4 Implementation
4.1 Pre-processing
4.2 Visualizing
4.3 Model Creation
4.4 Input
4.5 Output
5 Analysis
6 Conclusions and Future Scope
6.1 Conclusion
6.2 Future Scope
References
A Study on Air Pollution Over Hyderabad Using Factor Analysis—Kaggle Data
1 Introduction
2 Literature Survey
3 Data Details
4 Methodology
5 Results and Discussions
6 Conclusions
References
A Comparative Study of Hierarchical Risk Parity Portfolio and Eigen Portfolio on the NIFTY 50 Stocks
1 Introduction
2 Related Work
3 Data and Methodology
4 Performance Evaluation
4.1 The Auto Sector Portfolios
4.2 The Consumer Durable Sector Portfolios
4.3 The Financial Services Sector Portfolios
4.4 The Healthcare Sector Portfolios
4.5 The Information Technology Sector Portfolios
4.6 The Oil and Gas Sector Portfolios
4.7 The NIFTY 50 Portfolios
5 Conclusion
References
Collaborative Approach Toward Information Retrieval System to Get Relevant News Articles Over Web: IRS-Web
1 Introduction
2 Related Work
3 Experimental Setup
3.1 Data Acquisition and Extraction
3.2 Data Pre-processing
3.3 Cloud-Based Data Warehousing
3.4 IRS-Web
4 Results
5 Comparative Analysis
6 Conclusion
References
Patent Recommendation Engine Using Graph Database
1 Introduction
1.1 Background
1.2 Application
1.3 Limitations
1.4 Objective
2 Literature Survey
2.1 Existing Works
3 Method Proposed
3.1 Second-Degree Node Search Technique
3.2 Node Similarity Algorithm Using Jaccard Distance
3.3 Data Modeling—Nodes and Relationships
3.4 Data Preparation and Ingestion
3.5 Query Building and Execution
4 Results
5 Conclusion
References
IFF: An Intelligent Fashion Forecasting System
1 Introduction
2 Related Work
2.1 Fashion Trend Analysis
2.2 Forecasting Time Series
3 Methodology
3.1 Attribute Recognition
3.2 Dataset for Discovering the Fashion Time line
3.3 Temporal Fashion Trends
3.4 Forecasting Trends
4 Models
4.1 Grand Means Forecaster
4.2 ARIMA
4.3 Auto ARIMA
4.4 Theta Forecaster
4.5 Polynomial Trend Forecaster
4.6 Exponential Smoothing
4.7 Naive Forecaster
4.8 Lasso Least Angular Regressor
4.9 Light Gradient Boosting
5 Seq2Seq Model
6 Results
7 Application in Real-Time Forecasting of Fashion
8 Conclusion
References
SIR-M Epidemic Model: A SARS-CoV-2 Perspective
1 Introduction
2 Epidemic Model
2.1 COVID-19: A Clinical Perception
2.2 SIR-M Model
3 Experimentation and Analysis
3.1 Model Proof
3.2 Statistics Analysis
4 Conclusion
References
MultiCity: A Personalized Multi-itinerary City Recommendation Engine
1 Introduction
2 Related Work
2.1 TRS Using Orienteering Problem
2.2 Various TRS Related Works
3 Background and Problem Definition
3.1 Time-based User Interest
3.2 Traveling History
3.3 Itinerary Interest
3.4 Itinerary Popularity
3.5 Similarity Between LU and GUs
3.6 Problem Definition
3.7 Monte Carlo Tree Search Algorithm
3.8 Simulation and Back-propagation
4 Experimental Methodology
4.1 Dataset
4.2 Baseline Algorithms
4.3 Performance Metrics
4.4 Comparison of Precision, Recall and upper F Baseline 1F1-Score
5 Conclusion
References
Block Chain and Cloud Computing
A Fully Distributed Secure Approach for Database Security in Cloud Computing
1 Introduction
2 Related Work
3 Proposed Security Model
4 Implementation
5 Conclusion and Future Scope
References
Blockchain Technology Adoption for General Elections During COVID-19 Pandemic and Beyond
1 Introduction
2 Background
3 Related Work
3.1 COVID-19 Pandemic
3.2 Blockchain Technology
3.3 Conventional Voting Method
4 Implementation Approach of the Blockchain Voting Method
5 Shortcomings of Conventional Voting Method During COVID-19 Pandemic
6 Future Research Direction and Recommendations
7 Conclusion
References
Blockchain Implementation Framework for Tracing the Dairy Supply Chain
1 Introduction
2 Blockchain
3 Existing Supply Chain Process
4 Benefits of Using Blockchain in Dairy Supply Chain
5 Design and Architecture
6 Conclusion
References
Addressing Most Common Vulnerabilities in Blockchain-Based Voting Systems
1 Introduction
2 Electronic Voting Systems
2.1 David Shaum Electronic Voting System
2.2 Caltech/MIT Voting Technology Project
2.3 The E-Poll Project
2.4 The Estonian I-Voting System
2.5 South Wales iVote System
2.6 Washington DC. Electronic Voting System
2.7 Blockchain-Based Voting Systems
3 Security Concerns with Blockchain-Based Voting Systems
3.1 Mining and Conflicts in the Blockchain
3.2 Resource Drainage
3.3 Border Gateway Protocol Attack
3.4 Terminals Used for Voting
3.5 Internet Identity and Anonymous Voting
3.6 Man-In-The-Middle Attack
3.7 A 51% Attack
3.8 A D-Denial of Service Attack
4 Our Novel Proposed Solution
5 Conclusion
References
Networks and Security
Privacy Preserving Intrusion Detection System for Low Power Internet of Things
1 Introduction
2 Background
2.1 6LoWPAN-Based IoT
2.2 RPL
2.3 Security in 6LoWPAN-Based IoT
3 Related Work
4 Privacy Preserving Intrusion Detection System
4.1 Privacy Preserving Module
4.2 Attack Detection Module
5 Simulation and Results
5.1 Selective Forwarding Attack
5.2 Vampire Attack
5.3 True Positive Rate
5.4 Energy
5.5 Privacy Metric
5.6 Utility
6 Conclusion
References
Identifying Top-N Influential Nodes in Large Complex Networks Using Network Structure
1 Introduction
1.1 Motivation and Main Contributions
2 Related Work
3 Proposed Method
4 Performance
5 Conclusion and Future Directions
References
Push and Pull Factors for Successful Implementation of ERP in SMEs Within Klang Valley: A Roadmap
1 Introduction
2 Related Work
2.1 ERP Success Parameters
2.2 ERP Pull Factors
2.3 ERP Push Factors
3 Formulation
3.1 Research Phases
4 Initial Results
5 Conclusion
References
A Hybrid Social-Based Routing to Improve Performance for Delay-Tolerant Networks
1 Introduction
2 Related Work
2.1 Some of the Mostly Known Routing Protocols of DTN
2.2 Social-Based Routing Protocols of DTN
3 Proposed Methods
3.1 Routing Considerations
3.2 Censimcom Routing
3.3 Technique
3.4 Algorithm
4 Simulation and Results
4.1 Delivery Ratio
4.2 End-to-End Delay
5 Conclusions
References
Author Index

Citation preview

Lecture Notes on Data Engineering and Communications Technologies 142

Rajkumar Buyya Susanna Munoz Hernandez Ram Mohan Rao Kovvur T. Hitendra Sarma   Editors

Computational Intelligence and Data Analytics Proceedings of ICCIDA 2022

Lecture Notes on Data Engineering and Communications Technologies Volume 142

Series Editor Fatos Xhafa, Technical University of Catalonia, Barcelona, Spain

The aim of the book series is to present cutting edge engineering approaches to data technologies and communications. It will publish latest advances on the engineering task of building and deploying distributed, scalable and reliable data infrastructures and communication systems. The series will have a prominent applied focus on data technologies and communications with aim to promote the bridging from fundamental research on data science and networking to data engineering and communications that lead to industry products, business knowledge and standardisation. Indexed by SCOPUS, INSPEC, EI Compendex. All books published in the series are submitted for consideration in Web of Science.

Rajkumar Buyya · Susanna Munoz Hernandez · Ram Mohan Rao Kovvur · T. Hitendra Sarma Editors

Computational Intelligence and Data Analytics Proceedings of ICCIDA 2022

Editors Rajkumar Buyya Cloud Computing and Distributed Systems (CLOUDS) Laboratory University of Melbourne Melbourne, VIC, Australia Ram Mohan Rao Kovvur Department of Information Technology Vasavi College of Engineering Hyderabad, India

Susanna Munoz Hernandez Computer Science School (FI) Technical University of Madrid Madrid, Spain T. Hitendra Sarma Department of Information Technology Vasavi College of Engineering Hyderabad, India

ISSN 2367-4512 ISSN 2367-4520 (electronic) Lecture Notes on Data Engineering and Communications Technologies ISBN 978-981-19-3390-5 ISBN 978-981-19-3391-2 (eBook) https://doi.org/10.1007/978-981-19-3391-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

The International Conference on Computational Intelligence and Data Analytics (ICCIDA 2022) has been recognized as a very successful conference hosted by the Department of Information Technology, Vasavi College of Engineering, and sponsored by Vasavi Academy of Education, during 8–9 January 2022 at Vasavi College of Engineering (Autonomous), Hyderabad, Telangana, India. ICCIDA 2022 aims to provide an exciting platform for the young researchers to exchange their innovations and explore future prospects of research with the global research community in academia and industry, in the areas of computational intelligence, data analytics and its allied fields. It is worth mentioning that ICCIDA 2022 has made its significance by attracting 175 high-quality research articles from several parts of the globe, including USA, Canada, Europe, South Africa, Indonesia, Malaysia, Nepal, Bangladesh, Oman and UAE, and from 18 different states of India. All the articles are critically peerreviewed, and finally, 43 papers are accepted and presented by the authors during the conference. Further, there are six keynote talks delivered by the eminent researchers, including Dr. Rajkumar Buyya (The University of Melbourne, Australia), Dr. Bing Xue (Director (Computer Science) in School of Engineering and Computer Science at Victoria University of Wellington, New Zealand), Dr. Manuel Roveri (Politecnico di Milano, Italy), Dr. Marde Helbig (Griffith University, Australia), Dr. Atul Negi (University of Hyderabad, India) and Dr. K. Raghavendra (Head High Performance Computing and Drones, Advanced Data Processing Research Institute (ADRIN), ISRO). ICCIDA 2022 maintained all the necessary quality standards and was regarded as a high-quality international conference with an acceptance ratio of 24.57%. The sincere efforts of the Technical Program Committee members and the Organizing Committee members are highly appreciated. It is to put on records my sincere thanks to the principal of VCE Prof. S. V. Ramana, CEO Sri P. Balaji and the management of Vasavi College of Engineering for their constant encouragement and generous support to make the event successful and also encouraging the active researchers by giving rewards for the best papers.

v

vi

Preface

We sincerely thank Mr. Aninda Bose and his team of Springer Nature for their strong support towards publishing this volume in the series of Lecture Notes on Data Engineering and Communications Technologies—Indexed by Scopus, Inspec and Ei Compendex. Hyderabad, India

Dr. Ram Mohan Rao Kovvur Conference Chair

Contents

Container Orchestration in Edge and Fog Computing Environments for Real-Time IoT Applications . . . . . . . . . . . . . . . . . . . . . . . Zhiyu Wang, Mohammad Goudarzi, Jagannath Aryal, and Rajkumar Buyya Is Tiny Deep Learning the New Deep Learning? . . . . . . . . . . . . . . . . . . . . . . Manuel Roveri Dynamic Multi-objective Optimization Using Computational Intelligence Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mardé Helbig AI for Social Good—A Faustian Bargain . . . . . . . . . . . . . . . . . . . . . . . . . . . . Atul Negi

1

23

41 63

Computational Intelligence–Machine Learning Text Summarization Approaches Under Transfer Learning and Domain Adaptation Settings—A Survey . . . . . . . . . . . . . . . . . . . . . . . . . Meenaxi Tank and Priyank Thakkar

73

An Effective Eye-Blink-based Cyber Secure PIN Password Authentication System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Susmitha Mukund Kirsur, M. Dakshayini, and Mangala Gowri

89

A Comparison of Algorithms for Bayesian Network Learning for Triple Word Form Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Soorya Surendran, Mithun Haridas, Greeshma Krishnan, Nirmala Vasudevan, Georg Gutjahr, and Prema Nedungadi Application of Machine Learning Algorithm in Identification of Anaemia Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Lata Upadhye and Sangeetha Prasanna Ram

vii

viii

Contents

Detection of Fruits Image Applying Decision Tree Classifier Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Shivendra, Kasa Chiranjeevi, and Mukesh Kumar Tripathi Disease Prediction Based on Symptoms Using Various Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Deep Rahul Shah and Dev Ajay Dhawan Anti-Drug Response and Drug Side Effect Prediction Methods: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Davinder Paul Singh, Abhishek Gupta, and Baijnath Kaushik Assessment of Segmentation Techniques for Irregular Border Lesion Images in Melanoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 K. Gnana Mayuri and L. Sathish Kumar Secure Communication and Pothole Detection for UAV Platforms . . . . . . 183 S. Aruna, P. Lahari, P. Suraj, M. W. F. Junaid, and V. Sanjeev An Empirical Study on Discovering Software Bugs Using Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 G. Ramesh, K. Shyam Sunder Reddy, Gandikota Ramu, Y. C. A. Padmanabha Reddy, and J. Somasekar Action Segmentation for RGB Video Frames Using Skeleton 3D Data of NTURGB+D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Rosepreet Kaur Bhogal and V. Devendran Prediction of Rainfall Using Different Machine Learning Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 B. Leelavathy, Ram Mohan Rao Kovvur, Sai Rohit Sheela, M. Dheeraj, and V. Vivek A Comprehensive Survey of Datasets Used for Spam and Genuineness Views Detection in Twitter . . . . . . . . . . . . . . . . . . . . . . . . . 223 Monal R. Torney, Kishor H. Walse, and Vilas M. Thakare Computational Intelligence–Deep Learning Indian Classical Dance Forms Classification Using Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Challapalli Jhansi Rani and Nagaraju Devarakonda Skin Cancer Classification for Dermoscopy Images Using Model Based on Deep Learning and Transfer Learning . . . . . . . . . . . . . . . . . . . . . . 257 Vikash Kumar and Bam Bahadur Sinha

Contents

ix

Deep Neural Network Architecture for Face Mask Detection Against COVID-19 Pandemic Using Pre-trained Exception Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 S. R. Surya and S. R. Resmi MOOC-LSTM: The LSTM Architecture for Sentiment Analysis on MOOCs Forum Posts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Purnachary Munigadiapa and T. Adilakshmi License Plate Detection of Motorcyclists Without Helmets . . . . . . . . . . . . . 295 S. K. Chaya Devi, G. Vishal Reddy, Y. Aakarsh, and B. Gowtham Object Detection and Tracking Using DeepSORT . . . . . . . . . . . . . . . . . . . . . 303 Divya Lingineni, Prasanna Dusi, Rishi Sai Jakkam, and Sai Yada Continuous Investing in Advanced Fuzzy Technologies for Smart City . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 V. Lakhno, V. Malyukov, D. Kasatkin, V. Chubaieskyi, S. Rzaieva, and D. Rzaiev Lesion Segmentation in Skin Cancer Detection Using UNet Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Shubhi Miradwal, Waquas Mohammad, Anvi Jain, and Fawwaz Khilji Getting Around the Semantics Challenge in Hateful Memes . . . . . . . . . . . 341 Anind Kiran, Manah Shetty, Shreya Shukla, Varun Kerenalli, and Bhaskarjyoti Das Classification of Brain Tumor of Magnetic Resonance Images Using Convolutional Neural Network Approach . . . . . . . . . . . . . . . . . . . . . . 353 Raghawendra Sinha and Dipti Verma Detection of COVID-19 Infection Using Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 D. Aravind, Neha Jabeen, and D. Nagajyothi Hybrid Classification Algorithm for Early Prediction of Alzheimer’s Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 B. A. Sujatha Kumari, Sudarshan Patil Kulkarni, and Ayesha Sultana Data Analytics Evaluating Models for Better Life Expectancy Prediction . . . . . . . . . . . . . 389 Amit, Reshov Roy, Rajesh Tanwar, and Vikram Singh Future Gold Price Prediction Using Ensemble Learning Techniques and Isolation Forest Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Nandipati Bhagya Lakshmi and Nagaraju Devarakonda Second-Hand Car Price Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 N. Anil Kumar

x

Contents

A Study on Air Pollution Over Hyderabad Using Factor Analysis—Kaggle Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 N. Vasudha and P. Venkateswara Rao A Comparative Study of Hierarchical Risk Parity Portfolio and Eigen Portfolio on the NIFTY 50 Stocks . . . . . . . . . . . . . . . . . . . . . . . . . 443 Jaydip Sen and Abhishek Dutta Collaborative Approach Toward Information Retrieval System to Get Relevant News Articles Over Web: IRS-Web . . . . . . . . . . . . . . . . . . . 461 Shabina and Sonal Chawla Patent Recommendation Engine Using Graph Database . . . . . . . . . . . . . . . 475 Aniruddha Chatterjee, Sagnik Biswas, and M. Kanchana IFF: An Intelligent Fashion Forecasting System . . . . . . . . . . . . . . . . . . . . . . 487 Chakita Muttaraju, Ramya Narasimha Prabhu, S. Sheetal, D. Uma, and S. S. Shylaja SIR-M Epidemic Model: A SARS-CoV-2 Perspective . . . . . . . . . . . . . . . . . 499 Lekshmi S. Nair and Jo Cheriyan MultiCity: A Personalized Multi-itinerary City Recommendation Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 Joy Lal Sarkar and Abhishek Majumder Block Chain and Cloud Computing A Fully Distributed Secure Approach for Database Security in Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523 Srinu Banothu, A. Govardhan, and Karnam Madhavi Blockchain Technology Adoption for General Elections During COVID-19 Pandemic and Beyond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 Israel Edem Agbehadji, Abdultaofeek Abayomi, Richard C. Millham, and Owusu Nyarko-Boateng Blockchain Implementation Framework for Tracing the Dairy Supply Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551 Mohammed Marhoun Khamis Al Nuaimi, K. P. Rishal, Noel Varghese Oommen, and P. C. Sherimon Addressing Most Common Vulnerabilities in Blockchain-Based Voting Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561 Ahmed Ben Ayed and Mohamed Amin Belhajji

Contents

xi

Networks and Security Privacy Preserving Intrusion Detection System for Low Power Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577 S. Prabavathy and I. Ravi Prakash Reddy Identifying Top-N Influential Nodes in Large Complex Networks Using Network Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597 M. Venunath, P. Sujatha, and Prasad Koti Push and Pull Factors for Successful Implementation of ERP in SMEs Within Klang Valley: A Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . 609 Anusuyah Subbarao and Astra Hareyana A Hybrid Social-Based Routing to Improve Performance for Delay-Tolerant Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619 Sudhakar Pandey, Nidhi Sonkar, Sanjay Kumar, and Yeleti Sri Satyalakshmi Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631

About the Editors

Dr. Rajkumar Buyya is a Redmond Barry Distinguished Professor and Director of the Cloud Computing and Distributed Systems (CLOUDS) Laboratory at the University of Melbourne, Australia. He is also serving as the founding CEO of Manjrasoft, a spin-off company of the University, commercializing its innovations in Cloud Computing. He has authored over 750 publications and published seven text books which are more popular internationally. Dr. Buyya is one of the highly cited authors in computer science and software engineering worldwide. “A Scientometric Analysis of Cloud Computing Literature” by German scientists ranked Dr. Buyya as the World’s Top-Cited (#1) Author and the World’s Most Productive (#1) Author in Cloud Computing. He served as founding Editor-in-Chief of the IEEE Transactions on Cloud Computing. He is currently serving as Editor-in-Chief of Software: Practice and Experience, a long-standing journal in the field established nearly 50 years ago. Susanna Munoz Hernandez is Ph.D. in Computer Science by the Technical University of Madrid (UPM), Master in Management of Information Technologies by the Ramón Llull University of Barcelona and graduate in International Affairs by the Society of International Studies of Madrid. She won the first prize in a national competition for talented young people organized by the University of La Salle of Madrid in 2003. She has professional experience in some companies (ICT and bank sector) and join research (national and European) projects of recognized prestige. She has been working as associate professor at the Computer Science School of the Technical University of Madrid since 1998. She develops her research activity in the BABEL group (http://babel.ls.fi.upm.es) with more than eighty publications. She received the 2011 UPM prize of research in international cooperation for the development. She is member of the Management Board of the itdUPM. She is part of the organizing committees of more than 15 international conferences. Dr. Ram Mohan Rao Kovvur received Ph.D. in Computer Science and Engineering from Jawaharlal Nehru Technology University (JNTU) in the year 2014 with Research specialization as Grid Computing. He has more than 25 years of teaching

xiii

xiv

About the Editors

experience in various cadres and currently he is the Professor and Head, Information Technology, Vasavi College of Engineering, Telangana, Hyderabad, India. He received many prestigious awards from reputed organizations. Dr. Ram Mohan Rao has published and presented more than 25 research articles in National and International Journals and Conferences. He obtained a grant of Rs. 19.31 Lakhs from AICTE under MODROBS and established Deep Learning Lab. As a part Research work, he established a Grid Environment using Globus ToolKit (open source software toolkit used for building Grid systems). Further he also Established Cloud Lab, VCE using Aneka Platform (US Patented) of Manjrasoft Pvt Ltd. His research areas include Distributed Systems and Cloud Computing and Data Science. Dr. T. Hitendra Sarma received Ph.D. in Machine Learning from JNT University Anantapur, India, in December 2013. He has more than 14 years of teaching and research experience. Served at different reputed institutes in various capacities. Currently, Dr. Sarma is working as Associate Professor at Vasavi College of Engineering, Hyderabad. Dr. Sarma has published more than 30 research articles in various peer-reviewed International Journals and Conferences by Springer, Elsevier, and IEEE. His research interests include Machine Learning, Hyperspectral Image Processing, Artificial Neural Networks, Data Mining and Data Science, etc. Dr. Sarma holds a project funded by SERB, INDIA. Dr. Sarma is an active researcher. He presented his research articles in reputed conferences like IEEE WCCI (2016 Vancouver, Canada), IEEE CEC (2018, Rio de Janeiro, Brazil) and IEEE ICECIE (2019 Malaysia). He delivered an Invited Talk in the Third International Conference on Data Mining (ICDM) 2017 at Hualien, Taiwan.

Container Orchestration in Edge and Fog Computing Environments for Real-Time IoT Applications Zhiyu Wang, Mohammad Goudarzi, Jagannath Aryal, and Rajkumar Buyya

Abstract Resource management is the principal factor to fully utilize the potential of Edge/Fog computing to execute real-time and critical IoT applications. Although some resource management frameworks exist, the majority are not designed based on distributed containerized components. Hence, they are not suitable for highly distributed and heterogeneous computing environments. Containerized resource management frameworks such as FogBus2 enable efficient distribution of framework’s components alongside IoT applications’ components. However, the management, deployment, health check, and scalability of a large number of containers are challenging issues. To orchestrate a multitude of containers, several orchestration tools are developed. But, many of these orchestration tools are heavyweight and have a high overhead, especially for resource-limited Edge/Fog nodes. Thus, for hybrid computing environments, consisting of heterogeneous Edge/Fog and/or Cloud nodes, lightweight container orchestration tools are required to support both resourcelimited resources at the Edge/Fog and resource-rich resources at the Cloud. Thus, in this paper, we propose a feasible approach to build a hybrid and lightweight cluster based on K3s, for the FogBus2 framework that offers containerized resource management framework. This work addresses the challenge of creating lightweight computing clusters in hybrid computing environments. It also proposes three design patterns for the deployment of the FogBus2 framework in hybrid environments, including (1) Host Network, (2) Proxy Server, and (3) Environment Variable. The performance evaluation shows that the proposed approach improves the response time of real-time IoT applications up to 29% with acceptable and low overhead. Keywords Edge computing · Fog computing · Container orchestration · Internet of Things · Resource management framework

Z. Wang · M. Goudarzi · J. Aryal · R. Buyya (B) Cloud Computing and Distributed Systems (CLOUDS) Laboratory, School of Computing and Information Systems,The University of Melbourne, Melbourne, Australia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_1

1

2

Z. Wang et al.

1 Introduction With the rapid development of hardware, software, and communication technology, IoT devices have become dominant in all aspects of our lives. Traditional physical devices are connected in the Internet of Things (IoT) environment to perform humanoid information perception and collaborative interaction. They realize selflearning, processing, decision-making, and control, thereby completing intelligent production and service and promoting the innovation of people’s life and work patterns [1]. On this premise, Cloud computing, with its powerful computing and storage capabilities, becomes a shared platform for IoT big data analysis and processing. In most cases, IoT devices offload complex applications to the Cloud for storage and processing, and the output results are then sent from the Cloud to IoT devices [2, 3]. As a result, users do not have concerns about insufficient storage space or computational capacity for IoT devices. However, with the explosive growth in the number of IoT devices, nowadays, the amount of raw data sensed and acquired by the IoT has been significantly increasing. Consequently, filtering, processing, and analyzing the massive amount of data in a centralized approach has become an inevitable challenge for the Cloud computing paradigm [3, 4]. Moreover, the number of real-time IoT applications has been significantly increased. These applications require resources that support fast processing and low access latency to minimize the total response time [5]. Some examples of these applications are autonomous robots and disaster management applications (e.g., natural hazard management).

1.1 Case Study: Natural Disaster Management (NDM) NDM comprises four phases, namely Prevention, Preparedness, Response, and Recovery. It is commonly referred to as the PPRR framework for disaster management. These four phases are not linear and independent as they overlap and support each other for a better balance between risk reduction and community resilience for better response and effective recovery. Geo-spatial solutions for different phases are in offer in practice considering the availability of big earth observation satellite data achieved from various satellite missions and IoT-enabled ground-based sensor information [6]. However, the optimal fusion of satellite-based sensors and IoT sensors can provide accurate and precise information in the case of natural disasters. As presented in Figs. 1 and 2, for the case of bushfire problems in Australia, satellite-based sensors and IoT-based sensors have been used in an ad hoc manner to inform the end users. For example, repositories of satellite data primarily from NASA and Digital Earth Australia and other location-based data are being used for the live alert feeds by the emergency services in different states. The potential of satellite data and their fusion in extracting the optimal information in real time is a challenge due to the granularity

Container Orchestration in Edge and Fog Computing …

3

Fig. 1 A visualization framework on how satellite and ground-based sensors can be fused utilizing distributed computing paradigm such as edge computing in providing accurate real-time information to end users

of sensor-specific spatial data structure on spatial, spectral, temporal, and radiometric resolutions. With the IoT-based real-time information, there is a strong potential to validate and calibrate the satellite information captured in different resolutions to inform bushfire alerts in space and time. For example, early bushfire detection, nearreal-time bushfire progression monitoring, and post-fire mapping and analysis are possible with the optimal integration of ground-based sensors to the satellite-based sensor’s information. The framework of integrating sensors and providing accurate information to end users in real time will help in saving lives and properties.

1.2 Edge and Fog Computing For smooth and efficient execution of IoT applications, distributed computing paradigms, called Edge and Fog computing, have been emerged. They concentrate data and processing units as close as possible to the end users, as opposed to the

4

Z. Wang et al.

Fig. 2 A detailed system model on various sensors integration and their utilization in disseminating data to inform end users

traditional Cloud computing paradigm that concentrates data and processing units in Cloud data centers [7]. The key idea behind Edge and Fog computing is to bring Cloud-like services to the edge of the network, resulting in less application latency and a better quality of experience for users [8, 9]. Edge computing can cope with medium to lightweight tasks. However, when the users’ requirements consist of complex and resource-hungry tasks, Edge devices are often unable to efficiently satisfy those requirements since they have limited computing resources [7, 10]. To address these challenges, Fog computing, also referred to as hybrid computing, is becoming a popular solution. Figure 3 depicts an overview of the Fog/Hybrid computing environment. In our view, Edge computing only harnesses the closest resources to the end users while Fog computing uses deployed resources at Edge and Cloud layers. In such computing environments, Cloud can act as an orchestrator, which is responsible for big and long-period data analysis. It can operate in areas such as management, cyclical maintenance, and execution of computation-intensive tasks. Fog computing, on the other hand, efficiently manages the analysis of real-time data to better support the timely processing and execution of latency-sensitive tasks. However, in practice, contradicting the strong market demand, Fog computing is still in its infancy, with problems including no unified architecture, the large number and wide distribution of Edge/Fog nodes, and lack of technical standards and specifications. Meanwhile, container technology has been significantly developing in recent years. Compared with physical and virtual machines, containers are very lightweight, simple to deploy, support multiple architectures, have a short start-up time, and are easy to expand and migrate. These features provide a suitable solution to the problem of severe heterogeneity of Edge/Fog nodes [11]. Container technology is being dominantly used by industry and academia to run commercial, scientific, and big

Container Orchestration in Edge and Fog Computing …

5

Fig. 3 Overview of Fog/Hybrid computing environment

data applications, build IoT systems, and deploy distributed containerized resource management frameworks such as FogBus2 framework [12]. FogBus2, which is a distributed and containerized framework, enables fast and efficient resource management in hybrid computing environments. Considering the ever-increasing number of containerized applications and frameworks, efficient management and orchestration of resources have become an important challenge. While container orchestration tools such as Kubernetes have become the ideal solution for managing and scaling deployments, nodes, and clusters in the industry today [13], there are still many challenges with their practical deployments in hybrid computing environments. Firstly, orchestration techniques need to consider the heterogeneity of computing resources in different environments for complete adaptability. Secondly, the complexity of installing and configuring hybrid computing environments should be addressed when implementing orchestration techniques. Thirdly, a strategy needs to be investigated to solve potential conflicts between orchestration techniques and the network model in the hybrid computing environment. Also, as Edge/Fog devices are resource-limited, lightweight orchestration techniques should be deployed to free up the resources for the smooth execution of end-user applications. Finally, integrating containerized resource management frameworks with lightweight orchestration tools is another important yet challenging issue to support the execution of a diverse range of IoT applications. To address these problems, this paper investigates the feasibility of deploying container orchestration tools in hybrid computing environments to enable scalability, health checks, and fault tolerance for containers.

6

Z. Wang et al.

The main contributions of this paper can be summarized as follows: • Presents feasible designs for implementing container orchestration techniques in hybrid computing environments. • Proposes three design patterns for the deployment of the FogBus2 framework using container orchestration techniques. • Puts forward the detailed configurations for the practical deployment of the FogBus2 framework using container orchestration tools. The rest part of the paper is organized as follows. Section 2 provides a background study on the relevant technologies and reviews the container orchestration techniques in Fog computing environments. Section 3 describes the configuration properties of the K3s cluster and the detailed implementation of deploying the FogBus2 framework into the K3s cluster. Section 4 presents the performance evaluation. Finally, Sect. 5 concludes the paper and presents future directions.

2 Background Technologies and Related Work This section discusses the resource management framework and container orchestration tools, including the FogBus2 framework and K3s. Moreover, it also reviews the existing works on container orchestration in the Cloud and Edge/Fog computing environments.

2.1 FogBus2 Framework FogBus2 [12] is a lightweight distributed container-based framework, developed from scratch using Python programming language, enabling distributed resource management in hybrid computing environments. It integrates edge and Cloud environments to implement multiple scheduling policies for scheduling heterogeneous IoT applications. In addition, it proposes an optimized genetic algorithm for fast convergence of resource discovery to implement a scalable mechanism that addresses the problem that the number of IoT devices increases or resources become overloaded. Besides, the dynamic resource discovery mechanism of FogBus2 facilitates the rapid addition of new entities to the system. Currently, several resource management policies and IoT applications are integrated with this framework. FogBus2 contains five key containerized components, namely Master, Actor, RemoteLogger, TaskExecutor, and User, which are briefly described below. • Master: It handles resource management mechanisms such as scheduling, scalability, resource discovery, registry, and profiling. It also manages the execution of requested IoT applications. • Actor: It manages the physical resources of the node on which it is running. Also, it receives commands from the Master component and initiates the appropriate Task Executor components based on each requested IoT application.

Container Orchestration in Edge and Fog Computing …

7

Fig. 4 Key components of FogBus2 [12]

• Remote Logger: It collects periodic or event-driven logs from other components (e.g., profiling logs, performance metrics) and stores them in persistent storage using either a file system or database. • TaskExecutor: Each IoT application consists of several dependent or independent tasks. The logic of each task is containerized in one TaskExecutor. Accordingly, it executes the corresponding task of the application and can be efficiently reused for other requests of the same type. • User: It runs on the user’s IoT device and handles the raw data received from sensors and processed data from Master. It also sends placement requests to the Master component for the initiation of an IoT application. Figure 4 shows an overview of the five main components of the FogBus2 and their interactions.

2.2 K3s: Lightweight Kubernetes K3s is a lightweight orchestration tool designed for resource-limited environments, suitable for IoT and Edge/Fog computing [14]. Compared to Kubernetes, K3s is half the size in terms of memory footprint, but API consistency and functionality are not compromised [15]. Figure 5 shows the architecture of a K3s cluster containing one server and multiple agents. Users manage the entire system through the K3s server and make appropriate usage of the resources of the K3s agents in the cluster to

8

Z. Wang et al.

Fig. 5 The architecture of a single server K3s cluster

achieve optimal operation of applications and services. K3s clusters allow pods (i.e., the smallest deployment unit) to be scheduled and managed on any node. Similar to Kubernetes, K3s clusters also contain two types of nodes, with the server running the control plane components and kubelet (i.e., the agent that runs on each node), and the agent running only the kubelet [16]. Typically, a K3s cluster carries a server and multiple agents. When the URL of a server is passed to a node, that node becomes an agent; otherwise, it is a server in a separate K3s cluster [14, 16].

2.3 Related Work Rodriguez et al. [17] investigates multiple container orchestration tools and proposes a taxonomy of different mechanisms that can be used to cope with fault tolerance, availability, scalability, etc. Zhong et al. [18] proposed a Kubernetes-based container orchestration technique for cost-effective container orchestration in Cloud environments. The FLEDGE, developed by Goethals et al. [19], implements container orchestration in an Edge environment that is compatible with Kubernetes. Pires et al. [20] proposed a framework, named Caravela, that employs a decentralized architecture, resource discovery, and scheduling algorithms. It leverages users’ voluntary Edge resources to build an independent environment where applications can be deployed using standard Docker containers. Alam et al. [21] proposed a modular architecture that runs on heterogeneous nodes. Based on lightweight virtualization, it creates a dynamic system by combining modularity with the orchestration provided by the Docker Swarm. Ermolenko et al. [22] studied a framework for deploying IoT applications based on Kubernetes in the Edge-Cloud environment. It achieves

Container Orchestration in Edge and Fog Computing …

9

lightweight scaling of task-based applications and allows the addition of external data warehouses. In the current literature, some techniques such as [18, 22] use Kubernetes directly on Edge/Fog nodes, which have a high overhead on resource-limited Edge/Fog nodes. Some techniques such as [21] are restricted to run a master node (i.e., server) only on the Cloud, which does not support different cluster deployment approaches. Moreover, some orchestration techniques such as [20] are only working with nodes with public IP addresses, which restricts many use-cases in Edge/Fog computing environments where nodes do not have public IP addresses. Considering the current literature, there exists no lightweight container orchestration technique for the complete deployment of containerized resource management frameworks in hybrid computing environments, where heterogeneous nodes are distributed in Edge/Fog and Cloud computing environments.

3 Container Orchestration Approach In this section, we propose a feasible approach for deploying container orchestration techniques in hybrid computing environments. First, we present a high-level overview of the design. Next, we introduce the concrete implementation details of the proposed approach.

3.1 Overview of the Design To build a complete hybrid computing environment for different IoT scenarios, we use several Cloud and Edge/Fog nodes. We choose K3s as the backbone for the hybrid computing environment because it only occupies less than half of the resources of Kubernetes, but fully implements the Kubernetes API, and is specially optimized for the resource-constrained nodes at the Edge/Fog layer. In practice, we use three Cloud instances at the Cloud layer and create several Linux virtual machines (aligned with hardware specification of Raspberry Pi Zero) as our Edge/Fog nodes. Our Cloud nodes have public IP addresses while Edge/Fog nodes do not hold public IP addresses. To address this problem, we use Wireguard to set up a lightweight Peer-to-Peer (P2P) VPN connection among all nodes. After creating the hybrid computing environment, we start to embed the FogBus2 resource management framework into it. To take advantage of the container orchestration tool, we allocate only one container to each Pod created by K3s, with only one component of the FogBus2 framework running inside each container. Also, to balance the load on each node between clusters, we assign pods to different nodes. The initialization of the FogBus2 components requires the binding of the host IP address, which will be used to pass information between the different components. This means that in K3s clustering, the FogBus2 component needs to bind the IP address of the pod, which poses a difficulty for the

10

Z. Wang et al.

Fig. 6 Overview of the design pattern

implementation, as usually the pod is created at the same time as the application is deployed. To address this challenge, we evaluate three approaches and finally decide to use host network mode to deploy the FogBus2 framework in the K3s hybrid environment. Host network mode allows pods to use the network configuration of virtual instances or physical hosts directly, which addresses the communication challenge of the FogBus2 components and the conflict between K3s network planning service and VPN. Figure 6 shows a high-level overview of our proposed design pattern.

3.2 Configuration of Nodes The deployed hybrid computing environment consists of several instances, labeled A through E. The node list, computing layer, specifications, public network IP address, and private network IP address, after the VPN connection is established, are given in Table 1.

3.3 P2P VPN Establishment As shown in Table 1, Cloud nodes have public IP addresses, while in most cases, devices in the Edge/Fog environment do not have public IP addresses. In this case, in order to build a hybrid computing environment, we need to establish a VPN connection to integrate the Cloud and Edge/Fog nodes. We use Wireguard to establish a lightweight P2P VPN connection between all the nodes. In the implementation,

Node name

Nectar1

Nectar2

Nectar3

VM1

VM2

Node tag

A

B

C

D

E

Edge

Edge

Cloud

Cloud

Cloud

Computing layer 16-core CPU, 64 GB RAM 2-core CPU, 9 GB RAM 2-core CPU, 9 GB RAM 1-core CPU, 512 MB RAM 1-core CPU, 512 MB RAM

Specifications

Table 1 Configuration of nodes in integrated computing environment





45.113.232.232

45.113.232.199

45.113.235.156

Public IP

192.0.0.5

192.0.0.4

192.0.0.3

192.0.0.2

192.0.0.1

Private IP

Auto assign

Auto assign

Auto assign

Auto assign

Auto assign

Port

Docker

Docker

Docker

Docker

Docker

Preparation

Container Orchestration in Edge and Fog Computing … 11

12

Z. Wang et al.

Fig. 7 A sample configuration script for the Wireguard

we install the Wireguard on each node and generate the corresponding configuration scripts (based on the FogBus2 VPN scripts) to ensure that each node has direct access to all other nodes in the cluster. A sample configuration script for the Wireguard VPN, derived from FogBus2 scripts, is shown in Fig. 7.

Container Orchestration in Edge and Fog Computing …

13

Fig. 8 K3s deployment in computing environment

3.4 K3s Deployment The K3s server can be located at the Cloud or at the edge, while the remaining four nodes act as K3s agents. As the aim of this research is to enable container orchestration on the FogBus2 framework, we need to install and enable Docker on both the server and agents before building K3s. First, we install and start the K3s server in Docker mode. K3s allows users to choose the appropriate container tool, but as all components of FogBus2 run natively in Docker containers, we use Docker mode to initialize the K3s server to allow it to access the Docker images. Then, we extract a token from the server, which will be used to join other agents to the server. After that, we install the K3s on each agent, specifying the IP of the server and the token obtained from the server during installation to ensure that all agents can properly connect to the server. Figure 8 shows the successful deployment of the K3s cluster.

3.5 Fogbus2 Framework Integration In the native design of the FogBus2 framework, all components are running in containers. The pod, as the smallest unit created and deployed by K3s, can wrap one or more containers. Any containers in the same pod will share the same namespace and local network. Containers can easily communicate with other containers in the same or different pod as if they were on the same machine while maintaining a degree of isolation. So first, we are faced with the choice of assigning only one container per pod (i.e., a component that the FogBus2 framework is built on) or allowing each pod to manage multiple containers. The former design would balance the load on K3s nodes as much as possible to facilitate better management by the controller, while the latter design would reduce the time taken to communicate between components and provide faster feedback to users. We decide to adopt the former design to achieve batch orchestration and self-healing from failures. In order to integrate all types of FogBus2 framework’s components into K3s, we first define the YAML deployment files for necessary components. This file is used to provide the object’s statute, which describes the expected state of the object, as well

14

Z. Wang et al.

as some basic information about the object. In our work, the YAML deployment file serves to declare the number of replicas of the pod, the node it is built on, the name of the image, the image pulling policy, the parameters for application initialization, and the location of the mounted volumes. Code Snippet 1 illustrates the YAML deployment file for the Master component of the FogBus2 framework. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

# YAML d e p l o y m e n t f i l e for the M a s t e r c o m p o n e n t # of the F o g B u s 2 f r a m e w o r k a p i V e r s i o n : apps / v1 kind : D e p l o y m e n t metadata : labels : app : fogbus2 - m a s t e r name : fogbus2 - m a s t e r spec : replicas : 1 selector : matchLabels : app : fogbus2 - m a s t e r strategy : type : R e c r e a t e template : metadata : labels : app : fogbus2 - m a s t e r spec : containers : - env : - name : PGID v a l u e : " 1000 " - name : PUID v a l u e : " 1000 " - name : P Y T H O N U N B U F F E R E D value : "0" - name : TZ value : Australia / Melbourne i m a g e : c l o u d s l a b / fogbus2 - m a s t e r imagePullPolicy : "" name : fogbus2 - m a s t e r args : [ " -- b i n d I P " , " 1 9 2 . 0 . 0 . 1 " , " -- b i n d P o r t " , " 5001 " , " -- r e m o t e L o g g e r I P " , " 1 9 2 . 0 . 0 . 1 " , " -- r e m o t e L o g g e r P o r t " , " 5000 " , " -- s c h e d u l e r N a m e " , " R o u n d R o b i n " , " -- c o n t a i n e r N a m e " , " TempContainerName "] r e s o u r c e s : {} volumeMounts : - m o u n t P a t h : / var / run / d o c k e r . sock name : fogbus2 - master - h o s t p a t h 0 - mountPath : / workplace / name : fogbus2 - master - h o s t p a t h 1 - m o u n t P a t h : / w o r k p l a c e /. m y s q l . env

Container Orchestration in Edge and Fog Computing … 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

15

name : fogbus2 - master - h o s t p a t h 2 restartPolicy : Always serviceAccountName : "" nodeName : master h o s t N e t w o r k : true volumes : - hostPath : path : / var / run / d o c k e r . s o c k name : fogbus2 - master - h o s t p a t h 0 - hostPath : path : / home / hehe / F o g B u s 2 / c o n t a i n e r s / master / sources name : fogbus2 - master - h o s t p a t h 1 - hostPath : path : / home / hehe / F o g B u s 2 / c o n t a i n e r s / m a s t e r / s o u r c e s /. m y s q l . env name : fogbus2 - master - h o s t p a t h 2 s t a t u s : {} Code Snippet 1 The YAML deployment file for the Master component of the FogBus2 framework

In the communication design of the FogBus2 framework, the initialization of components requires the binding of the host IP address, which will be used to pass information between components. For example, when a Master component is created, the IP address of the host will be passed in as a required parameter, which will also be passed in as a necessary parameter to initializing the Actor component. Because the FogBus2 framework has some generic functions that will be used by multiple or all components, the Master component will send its assigned host/VPN IP address to the Actor component and requests to return the information to this address. If this IP address is not the same as the IP address used to initialize the Actor component, communication cannot be correctly established. When the FogBus2 framework is deployed using Docker Compose (e.g., the native way that FogBus2 is deployed), communication between the components is smooth because the containers are running directly on the host. However, when the FogBus2 framework starts in K3s, the communication mechanism between the components should be updated since containers are running in pods and each pod has its own IP address. Components cannot listen to the IP address of the host because, by default, the pod’s network environment is separate from the host, which poses a challenge for the deployment of the FogBus2 framework. To cope with this problem, we propose the following three design models. Host Network When starting FogBus2 components in a K3s cluster, instead of using the cluster’s own network services, we use the host’s network configuration directly. Specifically, we connect each pod directly to the network of its host. In this case, the components in the FogBus2 framework can be bound directly to the host’s network at initialization, and the IP address notified to the target component is the same as the one configured by the target component at initialization. This design implements the following functions for the FogBus2 framework:

16

Z. Wang et al.

• Batch Orchestration: It allows containers to be orchestrated across multiple hosts. In contrast, the native FogBus2 uses docker-compose, which can only create a single container instance locally. • Health Check: The system knows when the container is ready and can start accepting traffic. • Self-healing from Failure: When a running pod stops abnormally or is deleted by mistake, the system can restart the pod. • Dynamic Change: Users can dynamically change the resources limit of the running pods, including the size of physical memory footprint, the number of physical CPUs, etc. • Resource Utilization: The system can distribute each application on each node and choose the one with the lowest physical resource usage to deploy. However, this design pattern sacrifices some of the functionality of K3s. When pods are connected directly to the network environment where the hosts are located, the K3s controller will not be able to optimally manage all the containers within the cluster because these services require the K3s controller to have the highest level of access to the network services used by the pods. If the pods are on a VPN network, we will not be able to implement all the features of K3s. We use Host Network mode to deploy the FogBus2 framework in the K3s cluster in this paper. Proxy Server As the problem stems from a conflict between the communication design of the FogBus2 framework and the communication model between pods in the K3s cluster, we can create a proxy server that defines the appropriate routing policies to receive and forward messages from different applications. When a FogBus2 component needs to send a message to another component, we import the message into the proxy server, which analyzes the message to extract the destination and forward it to the IP address of the target component according to its internal routing policy. This approach bypasses the native communication model of the FogBus2 framework, and all communication between applications is done through the proxy server. There are two types of communication methods in the FogBus2 framework, proprietary methods and generic methods. The proprietary methods are used to communicate with fixed components, such as master and remote logger, whose IP addresses are configured and stored as global variables when most components are initialized. In contrast, the generic methods are used by all components and are called by components to transmit their IP addresses as part of the message for the target component. Therefore, to enable all components to send messages to the proxy server for processing, we need to change the source code of the FogBus2 framework so that all components are informed of the IP address of the proxy server at initialization and to unify the two types of communication methods so that components will include information about the target in the message and send it to the proxy server. As a result, this design would involve a redesign of the communication model of the FogBus2 framework.

Container Orchestration in Edge and Fog Computing …

17

Environment Variable In the K3s cluster, when the application is deployed, the cluster controller will automatically create a pod to manage the container in which the application resides. However, in the YAML file, we can obtain the IP address of the created pod when configuring the container information, which allows us to pass it in as an environment variable when initializing the components of the FogBus2 framework. Then, the IP address bound to the component is the IP address of the pod and the component can transmit this address to the target component when communicating and receiving a message back. However, in our experiments, we find that pods on different nodes have problems communicating at runtime. We trace the flow of information transmitted and find that the reason for this is the conflict between the network services configured within the cluster and the VPN used to build the hybrid computing environment. The pods possess unique IP addresses and use them to communicate with each other, but these addresses cannot be recognized by the VPN on the nodes, which prevents the information from being transferred from the hosts. To solve this problem, we have proposed two solutions: • Solution 1: K3s uses flannel as the Container Network Interface (CNI) by default. We can change the default network service configuration of the K3s cluster and override the default flannel interface with the Wireguard Ethereum Name Service. • Solution 2: We can change the Wireguard settings to add the interface of the network service created by the K3s controller to the VPN profile to allow incoming or outgoing messages from a specific range of IP addresses.

4 Performance Evaluation In this section, two experiments are conducted using three real-time applications to evaluate the performance of orchestrated FogBus2 (O-FogBus2) and native FogBus2, as well as the performance of FogBus2 in the hybrid versus Cloud environment. The real-time applications used in the experiments are described in Table 2.

Table 2 The list of applications Application name

Tag

Description

NaiveFormulaSerialized

Formula

FaceDetection (480P Res)

FD480

FaceDetection (240P Res)

FD240

A mathematical formula where different parts are calculated as different tasks Face detection from real-time/recorded video streams at 480P resolution Face detection from real-time/recorded video streams at 240P resolution

18

Z. Wang et al.

(a) Formula

(b) FD480

(c) FD240

Fig. 9 Response times for Orchestrated FogBus2 (O-FogBus2) versus native FogBus2 in three applications

4.1 Experiment 1: Orchestrated FogBus2 Versus Native FogBus2 This experiment studies the performance of the FogBus2 framework deployed in K3s and compares it with the native FogBus2. In the experiment, we run the systems in the same network environment and set the same scheduling policy to ensure the reliability of the experimental results. The environment setup for this experiment is shown in Table 1. For both deployment types, we implement the same deployment strategy to ensure fairness, with the Master and one Actor running on the Edge, and the Remote Logger and two other Actors running on the Cloud. Figure 9 shows the response time for orchestrated FogBus2 and native FogBus2 using three applications. The red dots represent the average response time, while the top and bottom green lines represent the 95% confidence interval for the mean value. For all tested applications, when FogBus2 is running in K3s, the average response time is longer than the native FogBus2 framework by an average of 7%. This is because the management of deployments by the K3s cluster itself requires some overhead; however, given the resource management mechanisms, scheduling, and automatic container health checks provided by K3s, we believe this overhead is very lightweight and acceptable.

4.2 Experiment 2: Hybrid Environment Versus Cloud Environment This experiment studies the performance of O-FogBus2 deployed in the hybrid computing environment versus the Cloud computing environment. Same as Sect. 4.1, the environment setup for this experiment is shown in Table 1. For the hybrid computing environment, the Master and one Actor are running on the Edge, and the Remote Logger and two other Actors are running on the Cloud. And for the Cloud environment, all the components are running on the Cloud.

Container Orchestration in Edge and Fog Computing …

(a) Formula

(b) FD480

19

(c) FD240

Fig. 10 Response times for Orchestrated FogBus2 (O-FogBus2) in Hybrid versus Cloud deployment

Figure 10 depicts the response time of FogBus2 deployed in hybrid and Cloud environments for three applications. For all tested applications, the average response time is shorter by up to 29% when FogBus2 is running in the hybrid environment than when FogBus2 is running in the Cloud. This is because the end users are usually located at the edge of the network and the final result should be forwarded to them. If all the components of FogBus2 are running in the Cloud, it will take longer and will face the impact of the unstable Wide Area Network (WAN). Since FogBus2 is designed for IoT devices to integrate Cloud and Edge/Fog environments, the introduction of K3 does not deprive this function, so we believe that placing the entire system in a hybrid computing environment can reasonably utilize the Cloud and Edge/Fog computing resources and improve system performance.

5 Conclusions and Future Work In this paper, we discussed the importance of resource management to support realtime IoT applications. We presented feasible designs for implementing container orchestration techniques in hybrid computing environments. This study proposed three design patterns for deploying the containerized resource management frameworks such as the FogBus2 framework into the hybrid environment. Besides, we described the detailed configuration of K3s deployment and the integration of the FogBus2 framework using the host network approach. The Host Network Pattern connects the components of the cluster to the host network environment, using the native communication model of the FogBus2 framework by masking the internal network environment of the cluster while avoiding the network conflict problems related to VPN. Compared to the native Fogbus2 framework, the new system (i.e., O-FogBus2) enables resource limit control, health check, and self-healing from failure to cope with the ever-changing number and functionality of connected IoT devices. We identified several future works to further improve the container orchestration for efficient resource management in hybrid computing environments. Firstly, we can consider implementing elastic scalability to automatically add or remove computing resources according to the demands of IoT applications. To address this

20

Z. Wang et al.

challenge, the Proxy Server and Environment Variable design approaches can be investigated to enable dynamic scalability. Secondly, lightweight security mechanisms can be embedded into the container orchestration mechanisms. As IoT devices are highly exposed to users, security and privacy become important. However, the limited resources of Edge/Fog devices create difficulties for the implementation of security mechanisms. Therefore, lightweight security mechanisms to ensure endto-end integrity and confidentiality of user information can be further investigated. Next, integrating different orchestration tools, including KubeEdge, Docker Swarm, and MicroK8s, can be considered as an important future direction. Different orchestration tools may be suitable for different computing environments, so it is essential to find the best application scenarios for them. We can explore the impact of different integrated container orchestration tools for handling real-time and non-realtime IoT applications. Also, a variety of scheduling policies can be implemented to automate application deployment and improve resource usage efficiency for clusters, ranging from heuristics to reinforcement learning techniques [2]. For example, scheduling pods to nodes with smaller memory and CPU footprints to automatically load-balancing on the cluster, or spreading replicative pods across different nodes to avoid severe system failures. Furthermore, since machine learning techniques [2, 23] are becoming mature and widely used in various fields, we can consider integrating them into the Edge/Fog and Cloud computing environment. Machine learning techniques can be used to analyze the state of the current computing environment, improve the system’s ability to manage resources, and distribute workloads. As current machine learning tools are often designed for powerful servers, future research can optimize them to run on resource-constrained Edge/Fog devices. Finally, the adopted techniques can consider the requirements of specific application domains such as natural disaster management, which significantly affect human life.

References 1. Gubbi J, Buyya R, Marusic S, Palaniswami M (2013) Internet of things (IoT): a vision, architectural elements, and future directions. Future Gener Comput Syst 29(7):1645–1660 2. Goudarzi M, Palaniswami MS, Buyya R (2021) A distributed deep reinforcement learning technique for application placement in edge and fog computing environments. IEEE Trans Mob Comput (accepted, in press) 3. Aazam M, Khan I, Alsaffar AA, Huh E-N (2014) Cloud of things: integrating internet of things and cloud computing and the issues involved. In: Proceedings of the 11th International Bhurban conference on applied sciences & technology (IBCAST) Islamabad, Pakistan, 14th– 18th January, 2014. IEEE, New York, pp 414–419 4. Goudarzi M, Wu H, Palaniswami M, Buyya R (2021) An application placement technique for concurrent IoT applications in edge and fog computing environments. IEEE Trans Mob Comput 20(4):1298–1311 5. Mohammad Goudarzi, Palaniswami M, Buyya R (2021) A distributed application placement and migration management techniques for edge and fog computing environments. In: Proceedings of the 16th conference on computer science and intelligence systems (FedCSIS). IEEE, New York, pp 37–56

Container Orchestration in Edge and Fog Computing …

21

6. Ujjwal KC, Garg S, Hilton J, Aryal J, Forbes-Smith N (2019) Cloud computing in natural hazard modeling systems: current research trends and future directions. Int J Disaster Risk Reduct 38:101188 7. Buyya R, Srirama SN (2019) Fog and edge computing: principles and paradigms. Wiley 8. Dastjerdi AV, Buyya R (2016) Fog computing: helping the internet of things realize its potential. Computer 49(8):112–116 9. Goudarzi M, Palaniswami M, Buyya R (2019) A fog-driven dynamic resource allocation technique in ultra dense femtocell networks. J Network Comput Appl 145:102407 10. Shi W, Cao J, Zhang Q, Li Y, Lanyu X (2016) Edge computing: vision and challenges. IEEE Internet Things J 3(5):637–646 11. Bali A, Gherbi A (2019) Rule based lightweight approach for resources monitoring on IoT edge devices. In: Proceedings of the 5th International workshop on container technologies and container clouds, pp 43–48 12. Deng Q, Goudarzi M, Buyya R (2021) Fogbus2: a lightweight and distributed container-based framework for integration of IoT-enabled systems with edge and cloud computing. In: Proceedings of the international workshop on big data in emergent distributed environments, pp 1–8 13. Cai Z, Buyya R (2022) Inverse queuing model based feedback control for elastic container provisioning of web systems in Kubernetes. IEEE Trans Comput 71(2):337–348 14. Rancher Labs (2021) K3s—lightweight Kubernetes. https://rancher.com/docs/k3s/latest/en/. Accessed 24 Jan 2022 15. Todorov MH (2021) Design and deployment of Kubernetes cluster on Raspberry pi OS. In: Proceedings of the 29th National conference with international participation (TELECOM). IEEE, New York, pp 104–107 16. Rancher Labs (2021) Architecture. https://rancher.com/docs/k3s/latest/en/architecture/. Accessed 24 Jan 2022 17. Rodriguez MA, Buyya R (2019) Container-based cluster orchestration systems: a taxonomy and future directions. Software: Pract Exp 49(5):698–719 18. Zhong Z, Buyya R (2020) A cost-efficient container orchestration strategy in Kubernetes-based cloud computing infrastructures with heterogeneous resources. ACM Trans Internet Technol (TOIT) 20(2):1–24 19. Goethals T, De Turck F, Volckaert B (2019) Fledge: Kubernetes compatible container orchestration on low-resource edge devices. In: Proceedings of the international conference on internet of vehicles. Springer, Berlin, pp 174–189 20. Pires A, Simão J, Veiga L (2021) Distributed and decentralized orchestration of containers on edge clouds. J Grid Comput 19(3):1–20 21. Alam M, Rufino J, Ferreira J, Ahmed SH, Shah N, Chen Y (2018) Orchestration of microservices for IoT using docker and edge computing. IEEE Commun Maga 56(9):118–123 22. Ermolenko D, Kilicheva C, Muthanna A, Khakimov A (2021) Internet of things services orchestration framework based on Kubernetes and edge computing. In: Proceedings of the IEEE conference of russian young researchers in electrical and electronic engineering (ElConRus). IEEE, New York, pp 12–17 23. Agarwal S, Rodriguez MA, Buyya R (2021) A reinforcement learning approach to reduce serverless function cold start frequency. In: Proceedings of the 21st IEEE/ACM international symposium on cluster, cloud and internet computing (CCGrid). IEEE, New York, pp 797–803

Is Tiny Deep Learning the New Deep Learning? Manuel Roveri

Abstract The computing everywhere paradigm is paving the way for the pervasive diffusion of tiny devices (such as Internet-of-Things or edge computing devices) endowed with intelligent abilities. Achieving this goal requires machine and deep learning solutions to be completely redesigned to fit the severe technological constraints on computation, memory, and power consumption typically characterizing these tiny devices. The aim of this paper is to explore tiny machine learning (TinyML) and introduce tiny deep learning (TinyDL) for the design, development, and deployment of machine and deep learning solutions for (an ecosystem of) tiny devices, hence supporting intelligent and pervasive applications following the computing everywhere paradigm. Keywords Tiny machine learning (TinyML) · Tiny deep learning (TinyDL) · Internet of things · Edge computing

1 Introduction The technological evolution and the algorithmic revolution have always represented two sides of the same coin in the machine learning (ML) field. On the one hand, advances in technological solutions (e.g., the design of high performance and energyefficient hardware devices) have supported the design and development of increasingly complex and technologically demanding ML algorithms and solutions [33, 38]. On the other hand, novel ML algorithms and solutions have been specifically designed for target hardware devices (e.g., embedded devices or Internet-of-Things [IoT] units), enabling these devices to be endowed with advanced intelligent functionalities [34, 42]. Interestingly, deep learning (DL) is a relevant and valuable example of this strict and fruitful relationship between the technological evolution and the algorithmic revM. Roveri (B) Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano, Milan, Italy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_2

23

24

M. Roveri

olution. Indeed, since the appearance of the first deep neural networks [22, 24], DL algorithms have completely revolutionized the ML field. Today, they represent the state-of-the-art solution for recognition, detection, classification, and prediction (to name but a few) in numerous different application scenarios [28, 31]. Noteworthily, the basis of DL, namely the idea of deep-stacked processing layers in neural networks, dates back to the late1960s (see the seminal works in [11, 17]). At that time, the available technological solutions were unable to support the effective and efficient training of such deep neural networks. Thirty years later, the rise of hardware accelerators, such as graphics processing units (GPUs) and tensor processing units (TPUs), saw them become technological enablers of the DL revolution. This led to what is today considered the standard computing paradigm in the field: DL models trained and executed on hardware accelerators. From scientific and technological perspectives, it is therefore crucial to identify the next technological enabler capable of supporting the next algorithmic revolution. One of the most promising and relevant technological directions is the “computing everywhere” paradigm [2, 18]. It represents a pervasive and heterogeneous ecosystem of IoT and edge devices that support a wide range of ML-based pervasive applications, from smart cars to smart homes and cities, and from Industry 4.0 to E-health. Due to the technological evolution, these pervasive and heterogeneous devices are becoming tinier and increasingly energy-efficient (often being battery-powered); hence, they are able to support an effective and efficient on-device processing [2]. This is a crucial ability since moving the processing and, in particular, the intelligent processing as close as possible to where data are generated guarantees relevant advantages in ML-based pervasive applications. Some of these advantages are as follows: • an increase in the autonomy of these pervasive devices, which are therefore able to make decisions locally (without sending acquired data to the Cloud through the Internet for processing and then waiting for the results); • a reduction in the latency with which a decision is made or a reaction is activated; • a reduction in the required transmission bandwidth, hence enabling these devices to operate even in areas where high-speed Internet connections are not available (e.g., rural areas); • an increase in the energy efficiency of these pervasive devices, since transmitting the data is much more power-hungry than processing them locally; • an increase in the privacy of these pervasive applications, since possibly sensitive data remain on the device; • the ability to exploit incremental or adaptive learning mechanisms to acquire fresh knowledge directly from the field, hence improving or (whenever needed) maintaining the accuracy of ML/DL models over time [8]; and • the capability to distribute the inference and learning of ML/DL models in the ecosystem of possibly heterogeneous pervasive devices (i.e., IoT and edge devices). The drawback of such an approach is that strict technological constraints characterize these pervasive devices in terms of computation, memory, and power consumption. Indeed, the CPU frequency of such tiny devices is typically in the order

Is Tiny Deep Learning the New Deep Learning?

25

of MHz, the RAM memory is in the order of a few hundred kB, and the power consumption is typically below 100 mW. Such severe technological constraints pose huge technical challenges from a design point of view on ML and, particularly, DL solutions, which are typically highly demanding in terms of computation, memory, and power consumption. This challenge is further emphasized both by the complexities in the development of embedded software for tiny devices (i.e., the firmware that runs on them) and by the need to consider a strict co-design phase that comprises hardware, software, and ML/DL algorithms. This is exactly where tiny ML (hereinafter “TinyML”) comes into play, which involves the design of ML solutions that can be executed on tiny devices, hence being able to take into account constraints on memory, computation, and power consumption. The aim of this paper is to shed the light on the state-of-the-art solutions for the design of ML and DL solutions that can be executed on tiny devices. In particular, this paper focuses on the design and development of DL solutions specifically intended to be executed on tiny devices, hence paving the way for the tiny DL (hereinafter “TinyDL”) revolution for smarter and more efficient pervasive systems and applications. The remainder of this paper is organized as follows. Section 2 provides an overview of TinyML, and then Sect. 3 introduces TinyDL. Section 4 details approximate computing mechanisms for TinyDL. Section 5 introduces specific TinyDL solutions for the IoT. Lastly, Sect. 6 draws conclusions and presents open research points.

2 Tiny Machine Learning: An Overview TinyML [13, 42] is a new and promising area of ML and DL aimed at designing ML solutions that can be executed on tiny devices. Solutions present in this area aim to introduce tiny models and architectures characterized by reduced memory and computational demands of the processing layers [9] as well as approximate computing solutions, such as quantization [12] and pruning [27] to address the severe technological constraints on computation, memory, and energy that characterize tiny devices. An overview of the TinyML paradigm is presented in Fig. 1. Specifically, TinyML comprises the following two main modules: • the hardware module: the physical resources of the tiny device, comprising the embedded computing board, sensors, actuators, and battery (optional); • the software module: all the software components that run on the tiny device, comprising the preprocessing module, TinyML module, and postprocessing module. Data acquired through the sensors are preprocessed to remove noise or highlight relevant features. The TinyML module receives as input the preprocessed data to produce an inference (e.g., a classification, detection, or prediction) by means of the

26

M. Roveri

Fig. 1 Overview of tiny machine learning (TinyML) computing paradigm, which comprises an embedded computing board, sensors, actuators, and software processing pipeline, including preprocessing, TinyML model inference, and postprocessing

trained TinyML model. The output of the TinyML module is postprocessed to make a decision or activate a reaction, which is then conducted by means of the actuators. The basis of TinyML applications is that they are designed to be “always on” in the sense that tiny devices continuously acquire and process data (through the preprocessing, TinyML, and postprocessing modules); thus, decisions are made or reactions are activated directly on the device. Examples of TinyML applications are wake-word detection, where a given command or word acquired by a microphone is recognized by the TinyML model; person detection, where images acquired by a camera are processed by the TinyML module to detect the presence of persons therein; and gesture recognition, where data acquired by MEMS accelerometers are processed by the TinyML module to recognize gestures made by people. The development chain of TinyML applications, detailed in Fig. 2, comprises several steps that range from the hardware setup of tiny devices to the operational mode of TinyML models. Specifically, the development chain of TinyML applications comprises the following steps: 1. Hardware setup: The embedded computing board, sensors, and actuators are selected for the purposes of the TinyML application. Remarkably, the choice of the embedded computing board imposes technological constraints on memory and computation for the design and development of the TinyML model. 2. Software setup: The development toolchain for the firmware of the embedded computing board and the framework for the design of the TinyML application

Is Tiny Deep Learning the New Deep Learning?

27

Fig. 2 Development chain of tiny machine learning applications

3. 4.

5.

6.

7.

(e.g., TensorFlow Lite for Micro or Quantized Pytorch) are selected and configured. Data collection: This step is intended to create the training set for training the TinyML model (if needed, supervised information is provided by the expert). TinyML model training: The selected TinyML model (e.g., a linear classifier, decision tree, or feedforward neural network) is trained on the acquired training set. Firmware development: The trained TinyML model is included in the firmware. The firmware comprises, in addition to the inference of the TinyML model, the reading of the sensors, preprocessing, postprocessing, activation of the actuators, and (if required) communication with other computing units or devices (e.g., an edge computing unit or a gateway). Firmware compilation: The developed firmware comprising all of the aforementioned software components is compiled for the embedded computing board defined in Step 1 by means of the selected development toolchain selected in Step 2. The compiled firmware is then flashed to the embedded computing board (to accomplish this step the hardware constraints on memory and computation posed by the embedded computing board must be satisfied). TinyML operation: The tiny device, comprising the selected hardware and the compiled and flashed firmware, operates in the environment for the purpose of the given TinyML application. Information about effectiveness and efficiency can be gathered by the tiny device to monitor the status of the TinyML application over time (and, if needed, updates, patches, or bug-fixing can be introduced).

Notably, Steps 1–6 are conducted outside of the tiny devices (e.g., on the Cloud or on personal computers), whereas only Step 7 is technically executed on it. This approach guarantees sufficient computational load and memory for accomplishing the goals with the highest computation and memory demands (i.e., training set cre-

28

M. Roveri

ation and TinyML model training). However, it does not foresee on-device learning, and hence, it does not support the incremental or adaptive learning of the TinyML model over time.1

3 From Tiny Machine Learning to Tiny Deep Learning The increasing popularity of DL solutions has led the scientific community to start considering not only ML but also DL solutions for tiny devices. However, DL and tiny devices are considered to be at the extremes of the technological ecosystem, where the computational and memory demands of DL models are far beyond the technological abilities of tiny devices. It is therefore very important to understand where these memory and computational demands come from in DL models. On the one hand, running a DL model requires one to allocate memory for the weights (i.e., the parameters of the deep neural network) and for the feature maps (i.e., the intermediate results of the processing of DL layers). Hence, the memory demand of a DL model concerns the amount of memory required to store both the weights and feature maps. On the other hand, the computational demand of DL solutions depends on the number of operations required to compute the output of the DL model given an input. This computational demand is typically measured in terms of floating-point operations (FLOPs) or multiply-and-accumulate (MAC) operations. Among the DL families of solutions present in the literature, this study focuses on convolutional neural networks (CNNs) that represent the state-of-the-art solution in many recognition, classification, and detection applications.2 Typically, CNNs comprise convolutional filter (CONV) layers, rectified linear unit (ReLU) layers, Max/Average Pooling (MAX POOL) layers, and fully connected (FC) layers. The ReLU and MAX POOL layers do not have trainable parameters. Thus, this section focuses on CONV and FC layers. The remainder of this section details the memory demand (in terms of weights and feature maps) and the computational demand of these two types of layers. Let H , W , and C be the sizes of the input of a CONV layer, where C is the number of input channels. Let R and S be the sizes of the C-dimensional filter and let M be the number filters in the CONV layer. Assuming zero padding, the output feature map is characterized by the sizes E × F × M where E = (H − R + U )/U, F = (W − S + U )/U 1

This issue will be addressed in Sect. 5 by introducing on-device learning mechanisms for ML and DL. 2 What is described here can easily be extended to other families of DL solutions.

Is Tiny Deep Learning the New Deep Learning?

29

being M the number of channels and U the stride of the filter. Hence, the memory demand MC O N V of a CONV layer can be computed as follows:   fm w MCONV = NCONV + NCONV × b where

w NCONV = M × R×S×C

accounts for the total number of weights in the CONV filter (not considering the bias), fm NCONV = H × W × C + E × F × M accounts for the sum of the number of elements in the input and output feature map, and b is the number of bits used for the representation (e.g., 8, 16, or 32 bits). The total number of MAC operations in a CONV filter can be computed as follows: MACCONV = E × F × R × S × C × M, while the corresponding number of FLOPs is approximately 2 × MACCONV . w Regarding the FC layers, the number of weights NFC is w NFC = H × W + W,

where H and W are the size of the input feature map and output feature map, respectively. Hence, the memory demand MFC of an FC filter can be computed as follows:  w  MFC = NFC + H + W × b. The total number of MAC operations in an FC layer can be computed as follows: MACFC = H × W. As an example, the LeNet-1 CNN [25], one of the first (and smallest) CNNs to be presented in the literature, requires 242 Kb and 37 Kb to store the weights and feature maps,3 , respectively, and approximately 830K FLOPs to compute the inference. Deeper CNNs typically have much larger memory and computational demands. For example, the AlexNet CNN [22] requires approximately 240 MB just to store the weights and 1448M FLOPs for the inference. Further details about the memory and computational demand of more complex and deeper CNNs can be found in [38]. Given the high computational and memory demands of DL solutions and the severe technological constraints on computation and memory of tiny devices, the 3

Here, 242 Kb is the memory required to store all of the weights of all of the processing layers, whereas 37 Kb is the memory required to store the intermediate processing results.

30

M. Roveri

following question is relevant to answer: How can we fill this gap to design DL solutions for tiny devices? The answer is through TinyDL, an area of TinyML aimed at designing approximate DL models that can be executed on tiny devices. Designing TinyDL solutions requires the following three main steps to be addressed: 1. Redesigning the CNN architecture; 2. Introducing approximate computing mechanisms; and 3. Exploiting embedded-system code optimization. In the first step, traditional CNN architectures are redesigned to consider the severe constraints on memory and computation that characterize tiny devices. Examples of such simplified CNN architectures are MobileNet [14] and SqueezeNet [16]. However, despite being carefully designed to reduce the complexities of traditional CNNs, their memory and computational demands are still far beyond the abilities of tiny devices. In the second step, to further reduce the memory and computational demands, approximate computing mechanisms are considered. Examples of these mechanisms are precision scaling (i.e., reducing the memory demand by reducing the number of bits used to represent the weights and feature maps) and task dropping (i.e., skipping the execution of some tasks in the processing pipeline). The final step involves the use of toolchains or code optimization mechanisms to further reduce the computational and memory demands for a target hardware platform. Examples of libraries and toolboxes for supporting these code optimization steps are Tensorflow Lite for Micro [42], X-CUBE-AI [37], and CMSIS-NN [23]. The remainder of this paper focuses on the second step, namely approximate computing for TinyDL. Specifically, the next section details approximate solutions and mechanisms for reducing the memory and computational demands of CNNs for TinyDL.

4 Approximate Computing for Tiny Deep Learning Approximate computing mechanisms for TinyDL can be grouped into two main families: • Precision scaling, which aims to reduce the memory occupation of a CNN by changing the precision (i.e., the number of bits used for the representation) of the weights and feature maps; • Task dropping, which aims to reduce the computational load and memory occupation by skipping the execution of certain tasks associated with the processing pipeline. Precision scaling and task dropping mechanisms are detailed in Sects. 4.1 and 4.2, respectively.

Is Tiny Deep Learning the New Deep Learning?

31

4.1 Precision Scaling Mechanisms Precision scaling for TinyDL relies on quantization mechanisms [30] to reduce the memory requirements for storing weights and feature maps in CNNs. Following the notation introduced in Sect. 3 quantization is aimed at reducing the parameter b accounting for the number of bits used for the representation of weights and feature maps in MCONV and MFC . In TinyDL, quantization can be addressed from three different perspectives: (i) how to quantize, (ii) where to quantize, and (iii) what to quantize. These three perspectives are examined in detail as follows: How to quantize: Quantization can be implemented in TinyDL by means of linear, log-function, or data-driven approaches [30]. Linear quantization relies on uniform distances between the quantization levels (whose values depend on both the interval of values and the considered number of bits; e.g., b = 16, b = 8, or custom). By contrast, in log-function quantization, the distance between quantization levels varies. For example, log-base-2 quantization implements a logarithmic distribution of quantization levels. Finally, in data-driven quantization, quantization levels are determined or learned directly from the data (e.g., by using clustering algorithms [43]). In some extreme cases, very-low-precision quantization mechanisms have been considered, leading to binary or ternary CNNs (i.e., where b < 2). This allows the memory demand of the CNN to be significantly reduced at the expense of a reduction in the classification accuracy, which is often a relevant reduction [38]. Where to quantize: In a CNN, the quantization mechanisms could be applied either only to the weights or to the weights and feature maps. In the former case, one quantizes only the weights and not the feature maps that are still represented in a full-representation model (i.e., 32 bits). This means that, following the notation introduced in Sect. 3 for CONV filters, one differentiates the fm w from the one used for NCONV as follows: value of b for NCONV w MCONV = bw × NCONV + b f m × NCONV fm

where bw < b f m and typically b f m = 32. This approach allows one to reduce the memory demand of the CNN; however, it does not reduce the corresponding computational demand since all the operations are still conducted in full-representation mode. In the latter case, one reduces b for both the weights and feature maps. This allows a reduction in both the memory and computational demand because simpler operations (e.g., integer operations) could be activated. This is a crucial point to consider in tiny devices since an 8-bit fixed-point addition consumes 3.3× less energy than a 32-bit fixed-point addition and 30× less energy than a 32-bit floating-point addition. The same holds for an 8-bit fixed-point multiplication that requires 15.5× less energy than a 32-bit fixed-point multiplication and 18.5× less energy than a 32-bit floating-point multiplication [38].

32

M. Roveri

What to quantize: Quantization can be implemented through different modalities within the CNN pipeline. One straightforward approach is fixed quantization, which means that the same quantization mechanism and the same value of b are considered for all CNN processing layers. By contrast, variable quantization supports different values of b and possibly different quantization mechanisms for the different layers, filters, or channels in the CNN processing pipeline. Integrating how, where and what to quantize in TinyDL: The effects of the different types of quantization mechanisms and bits b used for the representation in CNNs have been extensively studied in the literature (see [30, 38] for a survey). Notably, while extreme solutions comprising binarized NNs (where b = 1) might suffer from a huge drop in accuracy, solutions that implement nonlinear and variable quantization mechanisms (e.g., b = 8 for the weights of the CONV layers, b = 4 for the weights of the FC layers, and b = 16 for the feature maps) could be as effective as the original CNN in a b = 32 full-representation mode in image classification problems [38].

4.2 Task Dropping Mechanisms The aim of these mechanisms is to remove some tasks from the CNN processing pipeline, thereby reducing the computational load (i.e., by reducing MACCONV or MACFC ) and corresponding memory occupation (i.e., by reducing MCONV or MFC ) of the CNN. These techniques can be classified into three main families: (i) network pruning, (ii) network architecture design, and (iii) transfer learning. The families of techniques are described as follows: Network pruning: To ease the training, CNNs are usually over-parameterized [10]. Thus, a large amount of weights in CNNs are often redundant and can thus be removed (i.e., set to zero) by means of pruning mechanisms. Remarkably, to mitigate the effects on accuracy, pruning mechanisms often require fine-tuning of the network weights once the redundant weights have been removed. Simple pruning and structured pruning are examples of network pruning mechanisms. As depicted in Fig. 3.a, the aim of simple pruning solutions is to remove those weights whose value is close enough to zero from the CNN regardless of their position/role in the CNN processing pipeline. As depicted in Fig. 3.b, structured pruning solutions aim to prune groups of weights that belong to a given structure, such as rows, columns, or the whole convolutional filter. Noteworthily, simple and structured pruning solutions can be highly effective in CNNs since they can achieve a 3–10× reduction in memory and computational demand with a negligible loss in accuracy (approximately 1%) [38]. Network architecture design: Techniques belonging to this family of solutions are aimed at replacing large filters with a sequence of smaller filters (characterized by fewer weights and operations in total) [20, 39]. Such techniques can be applied before the training (i.e., during the design of the CNN) or after the training.

Is Tiny Deep Learning the New Deep Learning?

33

Fig. 3 Network pruning mechanisms: a simple pruning; solid lines represent weights that have been set to zero in the FC layer (hence reducing the corresponding MFC and MACFC ); b structured pruning; Filter #1 has been removed from the convolutional filter of the CNN (hence reducing the corresponding MCONV and MACCONV )

In the case of application before training, the designer of the CNN could replace a large filter with a concatenation of smaller filters (hence reducing MCONV and MACCONV ). For example, replacing a 5 × 5 convolutional filter with two 3 × 3 convolutional filters would guarantee a 25% reduction in memory and computational demands. A more extreme solution is to replace an N × N filter with a sequence of a 1 × N and a N × 1 filter. In the case of application after training, network architecture design techniques rely on tensor decomposition mechanisms to decompose large convolutional filters by using matrix or low-rank approximation (hence reducing, even in this case, MCONV and MACCONV ). Transfer learning: Transfer learning mechanisms are often considered to reduce the memory and computational demand of CNNs. As illustrated in Fig. 4, the basis of such an approach relies on the use of pretrained CNNs (e.g., MobileNet or SqueezeNet, as mentioned in Sect. 3). These pretrained CNNs are used as feature extractors for application-specific and approximated CNNs. In more detail, with l being the number of processing layers of a pretrained CNN, only the first k < l processing layers are retained and used as feature extractors, whereas the last l − k layers are replaced by a single processing layer (often a classification layer) tailored to a specific application scenario [1]. This allows for the removal of the remaining l − k processing layers from the pretrained CNN, thereby removing the MFC , MACFC and MCONV , MACCONV associated with those layers (hence significantly reducing both the memory and computational demand). An example of the use of this approach was described in [1], where a tiny CNN based on the pretrained AlexNet was tailored to a two-class classification problem and ported to an STM32F7 board endowed with an ARM 167 MHz M7 CPU and 512 KB RAM. The developed tiny CNN was characterized by a memory demand of 1.4 KB, execution time of 2700 ms, and classification accuracy of 87.9%.

34

M. Roveri

Fig. 4 Transfer learning for the design of tiny CNNs

5 Tiny Deep Learning for the Internet of Things In addition to the approximate computing mechanisms described in Sect. 4, specific TinyDL solutions are available for tiny devices operating in an ecosystem of IoT units. These solutions can be grouped into four main families: early-exit neural networks, distributed inference for TinyDL, on-device learning for TinyDL, and federated learning (FL). This section describes these four families of solutions in detail.

5.1 Early-Exit Neural Networks The basis of DL neural networks concerns the processing of input data through a long pipeline of (possibly nonlinear) processing layers. Remarkably, features extracted through these processing layers are characterized by an increasing complexity and meaning. Indeed, in CNNs, the first processing layers are typically characterized by coarse-grain features, such as geometrical shapes (i.e., lines, circles, and corners). Next, processing layers aggregate these coarse-grain features to provide upper-level features (e.g., the presence of an eye or mouth in face recognition). Finally, the last layers process upper-level features to provide a final classification (e.g., the presence of two eyes and a mouth are relevant features for recognizing a face). In DL, this ability to process features with increasing complexity and meaning in CNNs is crucial to be able to incrementally process the input and take a decision as soon as enough confidence is achieved [5, 35]. This approach, which is called early-

Is Tiny Deep Learning the New Deep Learning?

35

exit or gate-classification neural network, relies on the introduction of additional classification layers (called gate-classification layers) at relevant points of the CNN processing pipeline. The goal of these additional classification layers is to compute both the classification of the input image and the corresponding confidence (typically modeled as the posterior probability of the classification). When the confidence at a given gate-classification layer is larger than an automatically computed threshold, the classification provided at that layer becomes the final classification and the execution of the remaining layers in the CNN is skipped. This allows one to save the computational load (i.e., MACFC or MACCONV ) and energy consumption associated with the execution of those layers, which is a relevant advantage for battery-powered IoT units running TinyDL models. Early-exit neural networks have also been studied in conjunction with precision scaling mechanisms, combining the idea of increasing complexity and meaning of features with heterogeneous quantization mechanisms [26].

5.2 Distributed Inference for TinyDL To match the severe technological constraints of IoT units, the inference of a TinyDL model could be allocated not just to one IoT unit but to the whole ecosystem of units. This approach, which shares some similarities with distributed DL solutions for edge and fog computing [15, 40] and offloading mechanisms [36], is aimed at optimally allocating the layers of a DL solution in an ecosystem of (possibly heterogeneous) IoT units by considering the communication and computation capabilities as well as memory constraints of the units [7]. In greater detail, the idea behind the distributed inference of TinyDL solutions is to achieve an optimal distributed assignment of layers of a TinyDL solution (in the present case a CNN, but this approach can easily be extended to other DL families of solutions) to the IoT units by minimizing the latency with which a decision is made. Interestingly, this latency, which measures the time from which a new image is acquired up to the classification being made, considers both the time required for the processing of the CNN layers on the different IoT units and the time required to transmit the intermediate results (i.e., feature maps) between the IoT units during processing, thus considering the specific communication technology (e.g., Wi-Fi 4 or Wi-Fi HaLow) or the communication channel conditions (e.g., traffic congestion or package dropping). This optimal placement of CNN layers on an ecosystem of IoT units has also been extended to support the early-exit neural networks for TinyDL described in Sect. 5.1 as well as multiple TinyDL solutions that possibly share some processing layers following the transfer learning approach introduced in Sect. 3.

36

M. Roveri

5.3 On-Device Learning for TinyDL As described in Sect. 2, the learning of TinyML and TinyDL models is typically conducted in the Cloud or on PCs due to the high computational and memory demands of this step. Currently, research is focused on lightweight learning mechanisms to support the (at least partial) learning or adaptation of TinyDL models directly on tiny devices. This is a particularly challenging task since the severe technological constraints that characterize these devices pose strong limitations on both the inference and learning of TinyDL models. An example of on-device learning for TinyDL was proposed in [6], which introduced the joint use of transfer learning and incremental K-nearest neighborhood classifiers to support the on-device training and adaptation over time of TinyML models initially trained with only a small amount of data. Similarly, [32] introduced an incremental on-device learning mechanism for anomaly detection in IoT units based on an encoder–decoder architecture, whereas [3] proposed an incremental learning mechanism to update the bias of the considered deep neural networks.

5.4 Federated Learning FL algorithms [41] are aimed at supporting the learning of a global model leveraging data distributed across various pervasive devices (e.g., smartphones or edge computing units). To achieve this goal, FL [21, 44] relies on a server, which orchestrates the global learning phase, and pervasive devices, called workers, which execute a local learning phase operating on (possibly sensitive) data. The core of FL is that workers do not share data with the server or among them. They simply send the locally trained model to the server, which performs an aggregation of all of the received models (e.g., an average of the models, as in [29]). The aggregated model is then sent back to the workers, which use it as the initialization of the local model for the next local learning phase. FL currently represents one of the most promising directions in the field of TinyML and TinyDL operating on tiny devices since on-device learning can be extended in a federated manner to the whole ecosystem of devices [19]. Recently, research has also focused on FL algorithms that can be executed in time-varying conditions [4], which represent fairly common conditions where workers must operate in real-world environments (e.g., IoT units that operate in a natural environment or interact with users who might change their behavior or interests).

Is Tiny Deep Learning the New Deep Learning?

37

6 Conclusions The aim of this paper was to introduce TinyML and TinyDL as cutting-edge research areas for the design and development of the novel generation of intelligent and pervasive applications following the computing everywhere paradigm. Approximate computing mechanisms and specific solutions for TinyDL in IoT systems were introduced and described. Despite relevant advances in this field, research has only just scratched the surface of the topic. In the future, the present authors expect a steadily increasing interest in the field from various relevant perspectives, such as TinyDL for heterogeneous platforms and systems (e.g., edge computing units endowed with hardware accelerators), TinyDL in the presence of time-varying conditions or concept drift, and the ethical and sustainable aspects of TinyML and TinyDL.

References 1. Alippi C, Disabato S, Roveri M (2018) April) Moving convolutional neural networks to embedded systems: the AlexNet and VGG-16 Case. In: 17th ACM/IEEE International conference on information processing in sensor networks (IPSN). IEEE, Porto, pp 212–223 2. Alippi C, Roveri M (2017) The (Not) far-away path to smart cyber-physical systems: an information-centric framework. Computer 50(4):38–47 3. Cai H, Gan C, Zhu L, Han S (2020) Tiny transfer learning: towards memory-efficient on-device learning 4. Canonaco G, Bergamasco A, Mongelluzzo A, Roveri M (2021) Adaptive federated learning in presence of concept drift. In: 2021 International joint conference on neural networks (IJCNN). IEEE, New York, pp 1–7 5. Disabato S, Roveri M (2018) Reducing the computation load of convolutional neural networks through gate classification. In: 2018 International joint conference on neural networks (IJCNN). IEEE, New York, pp 1–8 6. Disabato S, Roveri M (2020) Incremental on-device tiny machine learning. In: Proceedings of the 2nd International workshop on challenges in artificial intelligence and machine learning for internet of things, pp 7–13 7. Disabato S, Roveri M, Alippi C (2021) Distributed deep convolutional neural networks for the internet-of-things. IEEE Trans Comput 8. Ditzler G, Roveri M, Alippi C, Polikar R (2015) Learning in nonstationary environments: a survey. IEEE Comput Intell Maga 10(4):12–25 9. Falbo V, Apicella T, Aurioso D, Danese L, Bellotti F, Berta R, Gloria AD (2019) Analyzing machine learning on mainstream microcontrollers. In: International conference on applications in electronics pervading industry, environment and society. Springer, Berlin, pp 103–108 10. Frankle J, Carbin M (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 11. Fukushima K (1980) Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern 36:193–202 12. Gholami A, Kim S, Dong Z, Yao Z, Mahoney MW, Keutzer K (2021) A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630 13. Higginbotham S (2019) Machine learning on the edge-[internet of everything]. IEEE Spectrum 57(1):20 14. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861

38

M. Roveri

15. Hu D, Krishnamachari B (2020) Fast and accurate streaming CNN inference via communication compression on the edge. In: 2020 IEEE/ACM fifth international conference on internet-ofthings design and implementation (IoTDI). IEEE, New York, pp 157–163 16. Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K (2016) Squeezenet: Alexnet-level accuracy with 50× fewer parameters and 2

Multiclass

CT > 1

L>2

Multi-label

Application of Machine Learning Algorithm …

115

Fig. 2 Confusion matrix of binary and multiclass classification

a.

b.

c.

Binary classification refers to classification issues having two class labels. Usually, binary classification tasks include two classes: one for the normal condition, denoted by the number 0, and another for the abnormal state, denoted by the number 1. Multiclass classification refers to classification issues in which each instance has only one of more than two class labels. Multiclass classification does not have the concept of normal and abnormal outcomes, unlike binary classification. Instead, samples are grouped into one of several classes. Multiclass classification refers to classification tasks in which each instance has multiple classes.

Evaluation Metrics: Evaluating a model is an essential step in developing a powerful machine learning model and explaining the performance of classification model. Following are some important evaluation metrics to assess the model: a.

Confusion matrix: It is n × n matrix where n represents number of classes. For binary classification, n = 2, i.e. 0 and 1, whereas for multiclass classification, n is more than 2 as shown in Fig. 2. It represents predicted value against actual value. A confusion matrix [17] is a description of classification problem prediction outcomes. Count numbers are used to summarise the number of correct and incorrect predictions. True positive (TP): It occurs when an observation is expected to belong to a class, and it really belongs to that class, True Negative (TN): It happens when an observation is expected to not belong to a class, and it turns out that it does not belong to that class. False Positive (FP): It occurs when an observation is wrongly classified as belonging to a class when it does not. False Negative (FN): It arises when an observation is predicted to not belong to a class when it really does.

116

b.

c.

d.

L. Upadhye and S. P. Ram

Accuracy: It is the ratio of correct predicted values over the total predicted values, i.e. (TP + TN)/(TP + TN + FP + FN). Accuracy is of prime importance to select the best model. Sensitivity/Recall: Number of samples correctly identified as positive out of total true positives. Sensitivity = TP/ (TP + FN). Specificity: Specificity is given by TN/(TN + FP), i.e. percentage of true negative instances out of the overall actual negative instances present in the data set.

2 Methodology This study has been carried out on actual patient anaemia data set taken for academical research purpose from a clinical laboratory where it has been applied for machine learning algorithms. The data set obtained comprises of 2000 test samples of different categories of anaemia. The data set comprises eight features of RBCs that are very significant from the point of the identification of anaemia illness. Age, gender (male/female), haemoglobin, PCV, RBC count, MCV, MCH and MCHC are the features. These are quantifiable properties of data objects that were employed as input variable(s) from data to make necessary anaemia predictions. The optimum ranges of the features related to haemoglobin, RBC count and RBC indices for ready references of the readings [18] are listed below. a. b. c. d.

e.

f.

Haemoglobin (Hb): Male: 13 to 17 g per decilitre (g/dL) Female: 12 to 15 g/dL. RBC count: RBC count would be: men— 4.5 to 5.5 million/mm3 women—3.8 to 4.8 million/mm3 . Haematocrit (Packed cell volume PCV): For men—41 to 50%. The usual range for women is moderately lower, 36–46%. MCV (mean corpuscular volume): It is expressed as fL = 10−15 L. Normal range is 80–100 fL. Less than 80 fL are described as microcytic (MHA), and more than 100 fL are described as macrocytic. Mean corpuscular haemoglobin (MCH): It is expressed in 10−12 g. Normal range is 26–34 pg. Less than 26 pg is found in microcytic hypochromic anaemia, and higher MCH is found in macrocytic anaemia. Mean corpuscular haemoglobin concentration (MCHC): It is expressed in g/dL. Normal range of MCHC is 31–37 g/ dL. Less than 31 g/ dL hypochromic RBCs are observed in iron-deficiency anaemia, higher than 37 g/ dL hyperchromic RBCs usually RBCs don’t contain more haemoglobin.

As stated earlier, it had been tried to develop an automatic machine learning algorithm in which both binary as well as multiclass classification of 2000 samples of anaemic data were performed. In this study, five conditions of anaemia were classified out of which four were common types of anaemia, i.e. BTT, DA, MBP,

Application of Machine Learning Algorithm …

117

Fig. 3 Machine learning process workflow

MHA, and one was normal condition or no anaemia NNBP. Details are mentioned above. Steps, which were followed in this study, are adhering to most of machine learning process flow as shown Fig. 3, such as. • • • •

Gathering data Data pre-processing—data cleaning, data selection and data training Model selection Validation.

2.1 Gathering Data The data collection procedure was performed with the use of real-time RBC parameters data from a clinical laboratory. Because there may be a lot of missing data, excessively big values, disorganised or noisy data, the acquired data cannot be used directly for the classification process. Since machine learning algorithms are greatly influenced by nature of data set, data cleaning and data preparation are crucial. Sample data set of both binary and multiclass classification is shown below in Tables 3 and 4.

118

L. Upadhye and S. P. Ram

Table 3 Data set sample for binary classification before applying normalisation Patient Age Sex Heamoglobin PCV RBC MCV No.

MCH

MCHC

Anemia (Y/N)

1

44

F

13.9

41.8 4.79

2

49

M

16.6

49.7 5.10

87.265136 29.018789 33.253589 0 97.450980 32.549020 33.400402 0

3

67

M

15.4

44.9 4.27

105.500000 36.600000 35.100000 1

4

53

M

15.8

45.1 4.45

105.600000 35.900000 34.700000 1

5

34

M

13.2

39.5 5.08

77.755906 25.984252 33.417722 0

Table 4 Data set sample for multiclass classification before applying normalisation Patient Age Sex Heamoglobin PCV RBC MCV No.

MCH

MCHC

Anemia (Y/N)

1

44

F

13.9

41.8 4.79

87.265136 29.018789 33.253589 NNBP

2

5

M

12.5

37.8 5.34

70.786517 23.408240 33.068783 BTT

3

51

F

11.0

33.1 4.76

69.537815 23.109244 33.232628 MHA

4

81

M

8.8

26.4 2.67

98.876404 32.958801 33.333333 DA

5

67

M

15.4

44.9 4.27

105.500000 36.600000 35.100000 MBP

2.2 Data Pre-processing a.

Data cleaning: One of the most crucial processes in machine learning is data pre-processing. It is the most crucial stage in improving the accuracy of machine learning models. Data are made available in tabular form. Data set was cleaned and made ready for further processing using the following steps • Reading the data—where the raw data available into the analysis system was read, i.e. reading.csv file into Pandas. • Variable identification—predictor and target variable were identified. Data types of variables in our data set and kind of variables like categorical or continuous. Numerization of gender was carried out by taking 0 and 1 to indicate male and female, respectively • Univariate analysis—variables were analysed one by one by methods such as bar plot and histogram. • Bivariate analysis—the relation between two variables was analysed. • Missing value treatment—missing values were treated using mean, mode and median (as shown in Table 5).

b.

c.

Data feature Selection: Since it is real-time data, after discussion with experts, no specific feature was removed for classification process; all the features of the data were used for classification process. Data Training: Our data set was cleaned and made ready for further processing which is training and testing. Out of total 2000 anaemia data, 100 data were

Application of Machine Learning Algorithm … Table 5 Data exploration

119 data_type

null_count

Age

int64

0

Heamoglobin

float64

0

PCV

float64

0

RBC

float64

0

MCV

float64

0

MCH

float64

0

MCHC

float64

0

Anemia (Y/N)

int64

0

Type of Anemia

object

0

F

unit8

0

M

unit8

0

retained for validation purpose. Out of remaining 1900 data, 80% data utilised for training and 20% for testing purpose. The classifier was trained with a “training data set,” modified the parameters with a “test data set,” and then test the classifier’s performance on an unknown “validation data set.”

2.3 Model Selection In this study, five supervised algorithms to detect anaemias binary and multiclass classification were used, namely logistic regression, K-nearest neighbours, support vector machine, decision tree and random forest. The best model based on hyperparameter tuning using GridSearchCV function was selected for classification process. With the correct usage of these hyperparameters, the accuracy of the model was increased. GridSearchCV was applied by importing it from sklearn Python library. Using different combination of hyperparameters, GridSearchCV listed the best hyperparameters as shown Table 6 using grid_search. best_params_function. By using these hyperparameters, our model accuracy was increased.

2.4 Validation Cross-validation is a technique which allows evaluating the model performance. The data set was splitted into two sections: training set and test set. Here, 80% data set was used for training and 20% data set for testing using train_test_split function. As the data set was only of 2000 sample size, K-fold cross-validation was applied, where our data set was divided into folds. Multiple iterations were executed with each fold as training set as well as testing set. Average of all scores was calculated.

120

L. Upadhye and S. P. Ram

Table 6 Best hyperparameters using GridSearchCV Supervised machine learning algorithm

Best hyperparameters using GridSearchCV Binary classification

Multiclass classification

Logistic regression

‘C’: 21.54434690031882, ‘max_iter’: 100, ‘penalty’: ’l2’, ‘solver’: ‘newton-cg’

‘C’: 2.782559402207126, ‘max_iter’: 100, ‘penalty’: ‘l2’, ‘solver’: ‘newton-cg’

K-nearest neighbours

‘n_neighbours’: 3

‘n_neighbours’: 3

Support vector machine

‘C’: 1, ‘C’: 100, ‘gamma’: 0.1, ‘kernel’: ‘rbf’ ‘kernel’: ‘linear’

Decision tree

‘criterion’: ‘entropy’, ‘max_depth’: 4, ‘min_samples_leaf’: 1, ‘min_samples_split’: 8

‘criterion’: ’gini’, ‘max_depth’: 5, ‘min_samples_leaf’: 2, ‘min_samples_split’: 2

Random forest

‘bootstrap’: True, ‘max_depth’: 5, ‘max_features’: ‘auto’, ‘n_estimators’: 5

‘bootstrap’: True, ‘max_depth’: 10, ‘max_features’: ‘auto’, ‘n_estimators’: 7

Further, using hyperparameter tuning for each model, accuracy was improved, and no over fitting of model was also seen.

3 Results and Discussion In this study, it has been tried to select the best algorithm for detecting anaemia based on evaluation metrics as confusion matrix, accuracy, sensitivity and specificity. Firstly, five supervised machine algorithms were applied for binary classification to classify data set for anaemia or not anaemia. Here, classification type is “Anaemia Y/N.” “0” means “No anaemia” and “1” means “Anaemia.” Then, the data with anaemia conditions are further classified using multiclass classification. Here, classification type is “Type of Anaemia,” and labels are BTT, DA, MBP, MHA and NNBP. Different evaluation metrics with five algorithms are given below.

3.1 Confusion Matrix In this study, for binary classification, n = 2, i.e. 0 and 1, whereas for multiclass classification, n = 5, i.e. BTT, DA, MBP, MHA and NNBP. The term TP represents the number of samples having an anaemia correctly predicted as having anaemia. TN refers to the number of samples not having an anaemia correctly predicted as not having anaemia. FP represents the number of samples not having an anaemia

Application of Machine Learning Algorithm …

121

Table 7 Confusion matrix with normalisation Supervised machine learning algorithm

Binary classification confusion matrix with normalisation

Multiclass classification confusion matrix with normalisation BTT → [[0.910 0. 0.07 0.02]

Logistic regression

0 → [[0.97 0.02 1 → 0.08 0.92]]

DA → [0. 1. 0. 0. 0.] MBA → [0. 0. 1. 0. 0.] MHA → [0. 0. 0. 0.86 0.14] NNBP → [0. 0. 0. 0. 1.]]

K-nearest neighbours

[[0.97 0.02 0.08 0.92]]

[[1. [0. [0. [0. [0.

0. 0. 0. 0. 0.

Support vector machine

[[0.97 0.02 0.05 0.95]]

[[0.93 [0. 1. [0. 0. [0. 0. [0. 0.

Decision tree

[[0.99 0.01 0.06 0.94]]

[[1. 0. 0. 0. 0] [0. 1. 0. 0. 0] [0. 0. 1. 0. 0] [0.02 0. 0. 0.83 0.14] [0. 0. 0. 0. 1]]

Random forest

[[1 0 0.06 0.94]]

[[1. [0. [0. [0. [0.

0. 1. 0. 0. 0.

0. 0. 0] 97 0. 0. 0.03] 1. 0. 0] 0. 0.87 0.13] 0. 0.01 0.99]] 0. 0. 1. 0. 0.

0. 0. 1. 0. 0.

0. 0.07 0.] 0. 0] 0. 0] 0.86 0.14] 0. 1]]

0. 0] 0. 0] 0. 0] 0.87 0.13] 0. 1]]

incorrectly predicted as having anaemia. FN refers to the number of samples having an anaemia incorrectly predicted as not having anaemia. Confusion matrices with normalisation of all five supervised machine learning algorithms are shown in Table 7.

3.2 Accuracy Accuracy is of prime importance to the medical practitioners as it helps them to schedule the correct treatment procedures. In medical field, disease prediction should be 100% accurate. FP and FN both are not preferable. Accuracy of both binary and multiclass classification is listed Table 8 (Table 9; Figs. 4 and 5).

122

L. Upadhye and S. P. Ram

Table 8 Accuracy of binary class classification Comparison of accuracy of machine algorithms Algorithms:

LR

SVM

KNN

DT

RF

Accuracy: Anaemia ‘Yes’

0.9434

0.9665

0.9614

0.9735

0.9768

Table 9 Accuracy of machine algorithm in multiclass classification Comparison of accuracy of machine algorithms Algorithms:

LR

Accuracy

Sensitivity: BTT DA MBP MHA NNBP

SVM

KNN

DT

RF

0.93

0.96

0.99

0.99

0.99

0.99

0.98

0.98

1

1

0.98

0.98

0.98

1

1

0.89

0.91

0.89

0.90

0.91

0.98

0.98

0.97

0.97

0.97

Accuracy

1 0.95 0.9 LR

SVM KNN DT Classification Algorithms

RF

Fig. 4 Comparison of accuracy in binary classification Accuracy Accuracy

1

BTT DA

0.9

MBP

0.8 LR

SVM KNN DT Classification Algorithms

RF

MHA NNBP

Fig. 5 Comparison of accuracy of machine algorithms

3.3 Sensitivity/Recall In this study, sensitivity describes how well a test can detect anaemia in people who actually have the disease or condition. More sensitivity means a best model. Sensitivity of both binary and multiclass classification is listed Table 10 (Table 11).

Application of Machine Learning Algorithm …

123

Table 10 Sensitivity of binary class classification Comparison of sensitivity of machine algorithms Algorithms:

LR

SVM

KNN

DT

RF

Sensitivity: Anaemia “Yes”

0.94

0.95

0.96

0.96

0.96

Table 11 Sensitivity of multiclass classification Comparison of sensitivity of machine algorithms Algorithms:

LR

SVM

KNN

DT

RF

Sensitivity: BTT DA MBP MHA NNBP

0.91

0.93

0.98

0.98

0.99

0.98

0.98

0.97

0.99

1

0.98

0.98

0.98

0.99

1

0.86

0.86

0.87

0.90

0.88

0.98

0.98

0.99

0.99

0.99

3.4 Specificity The term “specificity” relates to determining how many healthy people were informed they didn’t have anaemia when they didn’t. Specificity of both binary and multiclass classification is listed Table 12 (Table 13). Figures 5, 6 and 7 demonstrate the comparative performance of each classification method based on accuracy, sensitivity and specificity. Based on evaluation metrics, best classifier model was identified. Based on confusion matrix, random forest classifier was the best model. Based on accuracy also, random forest classifier was best model with accuracy 97.4%. Based on sensitivity, both decision tree and random Table 12 Specificity of binary class classification Comparison of specificity of machine algorithms Algorithms:

LR

SVM

KNN

DT

RF

Specificity: Anaemia “Yes”

0.94

0.98

0.97

0.99

0.99

Table 13 Specificity of multiclass classification Comparison of specificity of machine algorithms Algorithms:

LR

SVM

KNN

DT

RF

Specificity: BTT DA MBP MHA NNBP

0.98

0.99

0.99

0.99

0.99

0.97

0.98

0.97

1

0.99

0.97

0.98

0.98

0.99

1

0.98

0.99

0.99

0.99

1

0.96

0.96

0.96

0.98

0.98

124

L. Upadhye and S. P. Ram

Sensitivity

Sensitivity 1 0.9 0.8 0.7 0.6 0.5 0.4

BTT DA MBP MHA LR

SVM KNN DT Classification Algorithms

RF

NNBP

Fig. 6 Comparison of sensitivity of machine algorithm

Specificity

Specificity 1 0.99 0.98 0.97 0.96 0.95 0.94

BTT DA MBP MHA LR

SVM KNN DT Classification Algorithms

RF

NNBP

Fig. 7 Comparison of specificity of machine algorithms

forest classifiers were the best with 97.2%. Based on specificity, random forest was the best classifier model with 99%. In this study, both binary and multiclass classification has been used to detect anaemia diseases. The result indicated that random forest and decision tree classifier model were superior to other algorithms based on different evaluation metrics.

4 Conclusion and Future Work In this paper, a machine learning algorithm-based automatic classifier of anaemia which classifies around 2000 samples of actual patient data in to five categories such as BTT, DA, MBP, MHA and NNBP based on RBC parameters such haemoglobin, RBC count, PCV, MCV, MCH and MCHC was developed. As automatic prediction can help to minimise the amount of time spent on manual diagnosis, five supervised ML algorithms were developed and applied on the patient data for its classification. The performance of the five ML algorithms in the prediction of anaemia diseases was analysed and compared. The experimental results indicate that the decision tree and random forest ML algorithms outperform logistic regression, k-nearest neighbour and support vector machine in terms of accuracy, sensitivity and specificity. In the

Application of Machine Learning Algorithm …

125

future, automated systems that can assist prediction results in suggesting additional anaemia diagnosis to evaluate the levels of ferritin, transferrin and serum iron can be developed. In addition, application of such disease prediction systems can further be expanded to propose a treatment strategy, in the future. Acknowledgements We would like to thank Nandan clinical laboratory, Bagalkot, Karnataka, India for sharing patient anaemia data which we have used for our study and was very helpful for analysis.

References 1. Ministry of Health and Family Welfare, Government of India & International Institute for Population (n.d.-b). National Family Health Survey (NFHS-5) 2019–20. Retrieved 18 Mar 2021, From http://rchiips.org/NFHS/NFHS5_FCTS/NFHS5%20State%20Factsheet% 20CompendiumPhase-I.pdf 2. Provan D (ed) (2009) ABC of clinical haematology, vol 73. Wiley. ISBN 9781118892480 (epub) | ISBN 9781118892343 (pbk.) 3. Singh T (2010) Atlas and text of hematology, vol 136. Avichal Publishing Company, New Delhi. ISBN: 9788177395402 4. Bharati S, Pal M, Bharati P (2020) Prevalence of anaemia among 6-to 59-month-old children in India: the latest picture through the NFHS-4. J Biosoc Sci 52(1):97–107. https://doi.org/10. 1017/S0021932019000294 5. Hamid GA, Clinical hematology. https://doi.org/10.13140/RG.2.1.1477.1683 6. Sims EG (1983) Hypothyroidism causing macrocytic anemia unresponsive to B12 and folate. J Natl Med Assoc 75(4):429. PMID: 6864824; PMCID: PMC2561557 7. Adegoke SA, Kuti BP (2013) Evaluation of clinical severity of sickle cell anemia in Nigerian children. J Appl Hematol 4(2):58 8. Garg P, Dey B, Deshpande AH, Bharti JN, Nigam JS (2017) Clinico-hematological profile of dimorphic anemia. J Appl Hematol 8(3):123. https://doi.org/10.4103/joah.joah_40_17 9. Sadiq S et al (2021) Classification of β-thalassemia carriers from red blood cell indices using ensemble classifier. IEEE Access 9:45528–45538. https://doi.org/10.1109/ACCESS.2021.306 6782 10. Seddik AF, Shawky DM (2015)Logistic regression model for breast cancer automatic diagnosis. In: 2015 SAI intelligent systems conference (IntelliSys). IEEE, New York. https://doi.org/10. 1109/IntelliSys.2015.7361138 11. Guo G et al (2003) KNN model-based approach in classification. In: OTM Confederated international conferences on the move to meaningful internet systems. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-39964-3_62 12. Chamasemani FF, Singh YP (2011) Multi-class support vector machine (SVM) classifiers—an application in hypothyroid detection and classification. In: 2011 Sixth international conference on bio-inspired computing: theories and applications. IEEE, New York. https://doi.org/10.1109/ BIC-TA.2011.51 13. Zhong Y (2016)The analysis of cases based on decision tree. In: 2016 7th IEEE international conference on software engineering and service science (ICSESS). IEEE, New York.https:// doi.org/10.1109/ICSESS.2016.7883035 14. Charbuty B, Abdulazeez A (2021) Classification based on decision tree algorithm for machine learning. J Appl Sci Technol Trends 2.01:20–28. ISSN: 2708-0757 15. Jaiswal JK, Samikannu R (2017) Application of random forest algorithm on feature subset selection and classification and regression. In: 2017 World Congress on computing and

126

L. Upadhye and S. P. Ram

communication technologies (WCCCT). IEEE, New York.https://doi.org/10.1109/WCCCT. 2016.25 16. Er MJ, Venkatesan R, Wang N (2016) An online universal classifier for binary, multi-class and multi-label classification. In: 2016 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, New York. https://doi.org/10.1109/SMC.2016.7844809 17. Hicks S et al (2021) On evaluation metrics for medical applications of artificial intelligence. medRxiv. https://doi.org/10.1101/2021.04.07.21254975 18. McKenzie SB, Lynne Williams J, Landis K (2020) Clinical laboratory hematology. ISBN 13: 9780134709390

Detection of Fruits Image Applying Decision Tree Classifier Techniques Shivendra, Kasa Chiranjeevi, and Mukesh Kumar Tripathi

Abstract Recognition of the image of fruits and vegetables is proposed through various descriptors based on decision tree classifier techniques. An accurate and efficient recognition system for fruits and vegetables is one major challenge. To solve these challenges, we have examined various feature descriptors based on colour, texture and shape (its combination also). In this paper, Otsu’s thresholding is used for background subtraction. Further, all segmented image is used in the feature extraction phase. Furthermore, C4.5 are used for training and classification. Finally, the various performances metric such as Accuracy, Sensitivity, Specificity, Precision, False positive rate, False-negative rate is utilized to evaluate the proposed system for recognition problem. We, also analysis the performance accuracy of both classifiers. The outcome demonstrates that the proposed fused descriptor based on the state of colour, texture, shape is more efficient. Keywords Recognition · Fruits · Vegetables · Descriptor · C4.5

1 Introduction India can create an assortment of cultivation item. This is because of its flexibility of climate. Foods grown from the ground hold a 90% portion of the all-out cultivation item. Different classifications of agriculture item are blossom, fragrant manor, crop, flavours and so forth the development of leafy foods is 314.65 million tones among every green item [1]. Uttar Pradesh positions first in the creation of vegetables to 26.4 million tones, followed by state West Bengal with 25.5 million tones, which Shivendra Department of E and TC, JJTU, Jhunjhunu, Rajasthan, India K. Chiranjeevi JJTU, Jhunjhunu, Rajasthan, India M. K. Tripathi (B) Vasavi College of Engineering, Hyderabad, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_9

127

128

Shivendra et al.

is 30% of the age of vegetables diverged from the distinctive territory of India. In the natural product item creation, state Andhra Pradesh produces 120.98 lakh tones sought after by state Maharashtra 103.78 lakh tones which contrast as one 24% of normal item creation with rest the region of India. India sold out foods grown from the ground worth of 161 USD Million. It got fourteenth position across the word in product [2]. Generally, the actual appearance is one significant property to identify the products of the soil. This effects the market worth of natural products alongside the purchaser’s inclination. The market costs of products of the soil are normally dictated by manual assessments. Such a manual examination for quality appraisal is finished by experienced people. This manual strategy is conflicting, bringing about impacts the choice of organic products for the purchaser market. Today is the day computerized stage. Numerous clients are purchasing the organic product by utilizing the computerized stage. Numerous general stores have executed the acknowledgement framework by utilizing some equipment parts and standardized tag. One significant impediment of such a framework is that it is in static structure. Standardized identification-based framework is neglecting to get to the postharvest quality characteristic of natural products. That is the reason a productive computerized acknowledgement framework will assist with expanding the precision of the framework and customer market. It will bring about income age for labour remember for creation of organic products. Such a robotized acknowledgement framework whenever upheld by an AI framework, it may work on the precision and adequacy of the proposed framework. The leafy foods picture acknowledgement predominantly relies upon melded of shading, surface, shape, size. The proposed structure of the acknowledgement framework uses melded descriptor dependent on shading, surface, shape to order the picture natural products in an alternate class of picture. There are numerous descriptors dependent on shading, surface and shape is used to separate the component of the picture of the natural product. These elements are grouped by C4.5 in the analysis. The general goal of this paper is to perceive and order the natural products with an alternate posture, changeability on the quantity of pictures and last with trimming fractional occultation impact. The particular goal is basically to cover in five stages. (1) To propose the structure for the acknowledgement of various classifications of organic product. (2) Gather the different classes of the picture. (3) Under the assortment of accessible division procedures, play out the foundation partition of natural products utilizing Otsu’s thresholding strategy. (4) Pick the compelling and suitable shading and surface, shape methods to remove the element vector of the picture. (5) At last, by utilizing C4.5 classifier perform preparing and arrangement utilizing different execution metric.

Detection of Fruits Image Applying Decision Tree …

129

2 Related Works In this part, we will try to give focus on a broad writing study of past conveyed work by many explores by researchers. The primary endeavour on recognition and classification [3, 4]. The products of the soil grouping are introduced by [5]. They have utilized the shading and surface descriptor to extricate the component of the image object. All the component is utilized in preparing and arrangement dependent on KNN classifier. The proposed framework produces 95% exactness rate. One restriction is dataset has been utilized for the analysis design is extremely old. Because of that framework will most likely be unable to exploit late improvements in a dataset of leafy foods. The researcher [6] presents a methodology for identification of foods grown from the ground. In their work, many elements are combined with a classifier. The store dataset having 15 distinct classes of leafy foods utilized for tests. The element descriptor is extensively identified with shading and surface. The outcomes show that the proposed framework diminishes the order mistake up to 15%. They have likewise joined the component descriptor for a more perplexing picture having fluctuation in number, enlightenment, distinctive posture and so forth. In this, one disadvantage is that if there should arise an occurrence of a mix of the frail element with high exactness classifier may not ready to create a decent precision rate. Similar informational index has been used by the researcher [7] for results analysis. They have first deducted the foundation of foods grown from the ground utilizing Kimplies procedures. Further sectioned picture is utilized in include extricated stage. They have extricated the component vector of the picture by a shading descriptor like GCH, CCV, CDH. Whilst surface descriptor highlight is SEH, LBP, CSLBP. Further, these all element utilized for preparing and arrangement by multi-class support vector machine. In one more paper conveyed by creator [8] on similar trial informational index, they have dissected the mean (μ) and determination (σ) of all classes of leafy foods. In this, CCV + LTP combined descriptor produces most elevated mean exactness rate with esteem 90.6%, where the base standard rate 3.8% by CCV + CLBP. One disadvantage framework that CDH + SEH produce less mean exactness rate and CDH + SEH+ CSLBP show best quality determination. That shows both join technique produce horrible showing contrasted with another strategy. The researcher [9, 10] proposed a system for acknowledgement of a specific image object that has a place with a bunch of pictures for every class. This methodology is known as a pack of component strategies. The creator [11, 12] show promising outcome for acknowledgement issue. The creator [13] likewise utilized diverse class of foods grown from the ground to test and they have accomplished a decent exactness rate with the worth 86%. The fascinating strategy was proposed by creator [14] for shape coordinating. They have taken out the three significant limitations for shapes, coordinating like the comparing point for the two shapes with a comparative nearby descriptor, least mathematical mutilation and enduring perfection of the exchange movement. As of late, the creator [15] present a structure for apple acknowledgement and arrangement. In their work, they have used the SVM, MLP, KNN classifier to

130

Shivendra et al.

arrange the apple in to sound and absconded picture classification. With the outcomes, SVM classifier shows the most noteworthy precision rate with a worth of 92.5 and 89.2% for the two classes, trailed by MLP 90.0, 86.5% acknowledgement rate, at long last, KNN classifier produce less exactness rate with esteem 87.5, 85.8 % individually. The researcher [16] as depicted the component of date natural products dataset in subtleties. They have characterized the component into five classes like Obesity, size, shape, force and deformities. They have carried out the BPNN for preparing and grouping. The outcomes show a most extreme exactness rate with esteem 80% by model 2 in grouping the grade 2 organic products. One significant disservice of the proposed framework is that it could not ready to perceive the Heaviness include from date organic products pictures. A precise structure for plant recognizable proof issue was proposed by the creator [17]. They have utilized the openly accessible dataset ‘FLAVIA’ of a leaf picture. Diverse classifier, for example, KNN, Guileless Bayes, MSVM is utilized for grouping the component vector. The KNN classifier creates better execution amongst one more classifier with esteem 97.6, 98.8% for accuracy and review separately. Now days, machine learning-based techniques is popular and getting more attention in agriculture machine vision system [18, 19], education [20], health care sector [21, 22], finance [23], text processing [24, 25] and data processing [26, 27], etc. The author [26] has described the impact of different performance metric on same dataset, also briefly discuss the key factor for selection of metrics for data classification. Role and application of nature-inspired algorithms in diagnosis area is briefly presented by the author. In their paper they have used two novel algorithms such as ant colony and artificial bee colony to diagnose the disorder. They have used different kind of the disorder inform of diabetes and cancer. With the experiment, it is analytic that hybrid-based approach produces precise and better results compare to individual approach. An approach based on machine learning for classification of disease of Parkinson’s has been analysed by author [28]. They have also presented the effective framework for early diagnosis. Naïve Bayes, SVM, Random Forest, Bagging have been used for classification. The experiment shows that SVM classifiers show highest accuracy rate amongst all classifiers. Machine learning techniques are also playing a vital role in the field of the stock market. The author [29] has calculated the static feature such as mean, variance, skewness and kurtosis. They have also used difference performance metrics such as accuracy, sensitivity, specificity to evaluate the proposed by the proposed system. The logistic regression method is used for classification with accuracy rate 99.78%, where C4.5 base classifier is able to produce less performance with the value 53.6%. Now days due to the effectiveness in accuracy machine learning are used in finance sector also. The author [23] briefly highlights the various classifiers such as Deep Neural Network, SVM and Random forest. They have also given guideline for future work in term of the role and application of the ML techniques in the stock market. The machine learning has also shown great potential in the area of petroleum to predict the crude oil price, gas price and annual interest rate. The author [30] uses different king of classifier such as LSSVM, GP, ANN and BA optimization to analysis the various

Detection of Fruits Image Applying Decision Tree …

131

kind of parameter in the petroleum domain. Coefficient of determination, Average Absolute Relative Error relative percentage and root mean square error techniques is used to build the prediction model, BA optimized techniques produce the highest accuracy amongst all other classifiers.

3 Methodology The proposed research strategy for the identification of foods grown from the ground are displayed in Fig. 1. In research strategy approach, there is mostly three-venture includes in identification issue of products of the soil. First and foremost, Foundation of foods grown from the ground are deducted. In the following stage, in light of shading, surface and shape descriptor include are removed from the fragmented picture. In the third step, Picture of various classification of products of the soil is ordered dependent on C4.5 classifiers. At long last, the exactness of the proposed framework is determined by metric such as classification accuracy, Sensitivity, Specificity, Precision, False positive rate, False-negative rate. A.

Background Subtraction: A precise strategy for background subtraction of fruits and veggies image is vital and important in recognizing the problem because the end results are directly proportional to how well the background of the image is subtracted. The Otsu’sbased segmentation method is used for background subtraction of fruits and veggies in this paper [31]. To examine the efficiency of the recognition system, we have utilized twenty different categories of fruits and veggies dataset.

An algorithm for Background subtraction based on Otsu’s threshold method: • Collect the original image of fruits and vegetables in R, G, B. The original size of the image is reduced by cropping operation to speed up the process. • From the input colour image, one luminance Y and two chrominance channels Cb, Cr are extracted. • Perform morphological operation such as open and close on the Y channel.

Inputting the capture image

Image segmentation

Feature Extraction

Training & Classification

Recognition of image

C4.5

Fig. 1 Proposed methodology to solve recognition problem of fruit and veggies

132

Shivendra et al.

• Y channel is segmented by using Otsu’s thresholding method by selecting the threshold value. • In this step, perform the Invert operation on the segmented image received from step 4. • R, G, B channel is extracted from Inverted image. • Perform concatenated operation between an inverted image with respective three channels Y, Cb, Cr. Let us assume that, the results of this operation are denoted by in intermediate image. • Extract R, G, B channel from intermediate image that is in the form of Y, Cb, Cr. • Finally, R, G, B channel are concatenated with inverted image and obtain results are in the form of background- subtracted image. In this, Extraction region of interest from the image is shown in Fig. 2. Figure 3 shows image segmentation under partial occlusion and cropping effects. Under noisy and blurring effect are shown in Fig. 4.

Fig. 2 Image segmentation from the images a before b after

Fig. 3 Image segmentation effects with partial occlusions and cropping, a before b after

Detection of Fruits Image Applying Decision Tree …

133

Fig. 4 Image segmentation results under noisy and blurring effect, a before b after

B.

Feature Extraction Locality, pose variance, Distinctiveness, Repeatability is some major properties for the efficient descriptor. In this, we will examine the feature extraction methods. With the literature survey, it is found that individual descriptorbased feature is unable to extract the feature of the image. That’s why we have proposed fused descriptor based on colour, texture and shape. For the recognition problem, colour moment histogram (CMH), colour coherence vector (CCV) whereas local binary pattern (LBP), Centre-symmetric local binary pattern (CSLBP) for texture feature and Zernike Moments (ZM) shape-based descriptor. Extraction of features of the image is a key parameter for recognition problem. The proficiency of the system directly depends on the extracted feature.

Colour moment histogram (CMH): Colour moment is very effective techniques to extract the colour feature of the image [32]. This technique is based on moment parameter. Mean, variance and skewness are three moments are defining follows. μi =

N 1  Pi j N j=1

(1)



1 N (pi j − μi )2 j=1 N 1/ 3   N 1 Si = ( pi j − μi )3 j=1 N σi =

(2)

(3)

134

Shivendra et al.

Colour coherence vector (CCV): Colour coherence vector (CCV) is one of the effective techniques to extract the feature based on colour histograms [33]. Rotation invariance properties present in CCV. In the computation, CCV will effectively work on discretized colour depends on two key factors such as coherent and incoherent. If the pixel of an image is same in the large image region it is called coherent otherwise incoherent. In the computation of the CCV first we perform elimination of variance between neighbour pixels. This can be done by altering the pixel value by with the average value of the small local neighbourhood. In this we have taken 3 × 3 local neighbourhoods and 8 adjacent pixels. The future we perform discretized to get an N number of colour space. Finally, we describe the pixel group of coherent or incoherent by calculating pixel with the help of connected component. Local binary pattern (LBP): The local binary pattern (LBP) is used for feature extraction based on texture analysis [34] LBP hold properties as illumination invariance and grayscale. The basic LBP uses 3 × 3 local neighbourhood. The central pixel of the image is re-encoded with the help of neighbours. The comparison of central pixel value to its top pixel will perform based on threshold values. It encodes 1 if central pixel value is greater than first or top pixel else it is encoded 0. The operation is express with following expression.  s(x) =

1, x ≥ 0 0, x < 0

(4)

Further the new arrange value after this expression is weighted by 2n and obtain centre pixels in formed in decimal [35] and final LBP is computed by following equation. L B PN ,R =

 N −1 i=0

s(X n − X c )2i

(5)

Finally, LBP histogram is processed from the LBP code of every pixel of the image with following expression [36]. H (K ) =

I  J 

f (L B PN ,R (i, j), k), k ∈ [0, k],

i=0 j=0



f (x, y) =

(6)

1, x = y 0 otherwise

The major advantage of LBP is that it is easier definition and simple usage yet the high dimensionality of its component vector confines its utilization for some real-time applications. One limitation of the LBP is that by nature it is locally due to this noise and shape variance may decrease their performance.

Detection of Fruits Image Applying Decision Tree …

135

Centre-symmetric local binary pattern (CSLBP): CSLBP is a member of local binary pattern family. Usually LBP generates extensive histogram, that’s why is appropriately hard to unitize. To deal with limitation, we have proposed another method with taking Centre-symmetric pixel rather than Centrebased pixel. The one major advantage of CSLBP is that it mostly used to extract the texture feature and reduction operation [37]. The computation of CSLBP is express by following expression. CS = LPBR,N,T(x,y) =  s(x) =

N/ 2 

s(n i − n i+(N / 2) )2i

i=0

(7)

1, x > T 0 otherwise

Zernike Moments (ZM): ZM is one of the powerful shape-based feature algorithms because of its rotation invariance and quadratic property [38]. The computation of ZM is done by giving a Zernike basis function as a weight to the value of the shape of a particular image. This operation is called the encoding operation of the particular image. One main advantage of ZM descriptor is that it is orthogonal in behaviour due to this the effect of the noise variation is negligible [39]. C.

Feature Combination To achieve the high accuracy rate for the robust recognition system, we have fused the feature colour, texture and shape. The experiment is evaluated in combination of both colour + texture and colour + texture shape feature. All the feature of descriptor is used by a weighted sum of all dimensions. Let us assume that we have, n number of feature descriptor representation by (X1, X2, X3, …, Xn) and dimension of each feature is (d1, d2,d3, …, dn), then the fused feature is obtained by giving weight to dimension will be denoted by following expression. X=

n 

dj

(8)

i=1

4 Experiment and Results In this experiment and results subsection, we will discuss the attributes and properties of fruits and veggies, and experiment the various method to the recognition of fruits

136

Shivendra et al.

Fig. 5 Illumination differences, apple, kiwi, cabbage, bitter melon

and vegetables and also evaluate the performance of the proposed system based on various metric. A.

Dataset

We have collected all the image of fruits and vegetables from the local market at Talegaon, Maharashtra. All the images were captured from the Nikon digital DSLR camera. Figure 5, represents the illumination difference present in Apple, Kiwi, Cabbage, Bitter Melon Category. Strawberry category with different pose represented in Fig. 6. Figure 7, shows Variance on the number image of Tomato Red. Followed by Fig. 8, represents the sample of cropping and partial occlusion. Accessibility of this property makes the data set more robust. B.

Experimental Result and Discussion

To evaluate the recognition accuracy of the proposed system, we examine and compare state of art colour, texture and shape feature descriptor. We have also evaluated the system by its combination. These all extracted features are utilized in the training and testing phase. C4.5 [40, 41] Next, various performance metrics such as

Fig. 6 Pose differences, strawberry

Fig. 7 Variance on the number of images, tomato red

Detection of Fruits Image Applying Decision Tree …

137

Fig. 8 Sample of cropping and partial occlusion

Fig. 9 Recognition accuracy, using C4.5 classifiers, a comparison between CMH, LBP, ZM, CMH + LBP, CMH + LBP + ZM feature, b comparison between CMH, CSLBP, ZM, CMH + CSLBP, CMH + CSLBP + ZM feature, c comparison between CCV, LBP, ZM, CCV + LBP, CCV + LBP + ZM feature, d comparison between CCV, CSLBP, ZM, CCV + CSLBP, CCV + CSLBP + ZM feature

classification accuracy, Sensitivity, Specificity, Precision, False positive rate Falsenegative rate has been utilized to evaluate the performance of the proposed system. The detail description of the matrices is described below (Fig. 9).

5 Conclusion In this paper, we have presented the novel framework of fruits and veggies recognition problem. One main contribution of this paper is that we have utilized the fused

138

Shivendra et al.

descriptor that produces better results compared to an individual descriptor. One other contribution is the preparation of a data set of fruits and veggies. The recognition system consists mainly of three-phase such as segmentation, extraction of feature, classification. Firstly, Otsu’s-based thresholding method is utilized to segment the fruits and veggies. Future-more, we extracted state of the art from segmented images. Next, this extracted feature of the image is used for classification by C4.5. The experimental results indicate that C4.5 classifiers are more effective and produce better performance. One future scope is to expand the dataset by grouping a variety of fruits and vegetables in a single image. We can also implement deep learning concept to evaluate the efficiency of the recognition system.

References 1. Saxana SP, Mamata (2017) At a glance 2017. Horticultural statistics at a glance 2. “Exports From India of Fruits & Vegetables. [Online]. Available: http://agriexchange.apeda. gov.in/product_profile/exp_f_india.aspx?categorycode=0102. (Accessed 06 Sep 2019). 3. Pass G, Zabih R, Miller J (1996) Comparing images using color coherence vectors. In: 1996 Proceedings of the fourth ACM international conference on Multimedia—MULTIMEDIA ’96, pp 65–73 4. Stehling RO, Nascimento MA, Falcao AX (2002) A compact and efficient image retrieval approach based on border/interior pixel classification. In: International conference on information and knowledge management (CIKM), pp 102–109 5. Zhang Y, Wu L (2012) Classification of fruits using computer vision and a multiclass support vector machine. Sensors (Switzerland) 12(9):12489–12505 6. Rocha A, Hauagge DC, Wainer J, Goldenstein S (2010) Automatic fruit and vegetable classification from images. Comput Electron Agric 70(1):96–104 7. Dubey SR, Jalal AS (2013) Species and variety detection of fruits and vegetables from images. Int J Appl Pattern Recogn 1(1):108 8. Dubey SR, Jalal AS (2015) Fruit and vegetable recognition by fusing colour and texture features of the image using machine learning. Int J Appl Pattern Recogn 2(2):160 9. Agarwal S, Awan A, Roth D (2004) Learning to detect objects in images via a sparse, part-based representation. IEEE Trans Pattern Anal Mach Intell 26(11):1475–1490 10. Jurie F, Triggs B (2005) Creating efficient codebooks for visual recognition. Proc IEEE Int Conf Comput Vision I:604–610 11. Marszalek M et al (2010) Spatial weighting for bag-of-features to cite this version : HAL Id : inria-00548584 spatial weighting for bag-of-features. In: 2010 Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 2118–2125 12. Tikkanen H, Alajoutsijärvi K, Tähtinen J A performance evaluation of local descriptors. IEEE Trans Pattern Anal Mach Intell 27(10): 1615–1630, 200AD 13. Arivazhagan S, Shebiah RN, Nidhyanandhan SS, Ganesan L (2010) Fruit recognition using color and texture features. J Emerg Trends Comput Inf Sci 1(2):90–94 14. Berg AC, Berg TL, Malik J (2005) Shape matching and object recognition using low distortion correspondences. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), 1910 1:26–33 15. Moallem P, Serajoddin A, Pourghassem H (2017) Computer vision-based apple grading for golden delicious apples based on surface features. Inf Process Agric 4(1):33–40 16. Ohali YA (2011) Computer vision based date fruit grading system : design and implementation. J King Saud Univ—Comput Inf Sci, 23(1):29–36

Detection of Fruits Image Applying Decision Tree …

139

17. Saleem G, Akhtar M, Ahmed N, Qureshi WS (2019) Automated analysis of visual leaf shape features for plant classification. Comput Electron Agric 157:270–280 18. Tripathi MK, Maktedar DD (2019) A role of computer vision in fruits and vegetables among various horticulture products of agriculture fields: a survey. Inf Process Agric 19. Kamilaris A, Prenafeta-Boldú FX (2018) Deep learning in agriculture: a survey. Comput Electron Agric 147:70–90 20. Hakak S et al (2019) Cloud-assisted gamification for education and learning—recent advances and challenges. Comput Electr Eng 74:22–34 21. Benkercha R, Moulahoum S (2018) Fault detection and diagnosis based on C4.5 decision tree algorithm for grid connected PV system. Sol Energy 173(April):610–634 22. Kumar Y, Aggarwal A, Tiwari S, Singh K (2018) An efficient and robust approach for biomedical image retrieval using Zernike moments. Biomed Signal Process Control 39:459–473 23. Emerson S, Kennedy R, O’Shea L, O’Brien J (2019) Trends and applications of machine learning in quantitative finance. In: 2019 8th international conference on economics and finance research (ICEFR) 24. Amrit C, Paauw T, Aly R, Lavric M (2017) Identifying child abuse through text mining and machine learning. Expert Syst Appl 88:402–418 25. Chatterjee A, Gupta U, Chinnakotla MK, Srikanth R, Galley M, Agrawal P (2019) Understanding emotions in text using deep learning and big data. Comput Hum Behav 93:309–317 26. Hossin M, Sulaiman N (2015) A review on evaluation metrics for data classification evaluations. Int J Data Min Knowl Manage Proc 5(2):1–11 27. Boselli R, Cesarini M, Mercorio F, Mezzanzanica M (2018) Classifying online job advertisements through machine learning. Futur Gener Comput Syst 86:319–328 28. Gautam R, Kaur P, Sharma M (2019) A comprehensive review on nature inspired computing algorithms for the diagnosis of chronic disorders in human beings. Prog Artif Intell 29. Kaur P, Sharma M (2019) Diagnosis of human psychological disorders using supervised learning and nature-inspired computing techniques: a meta-analysis. J Med Syst 43(7) 30. Naderi M, Khamehchi E, Karimi B (2019) Novel statistical forecasting models for crude oil price, gas price, and interest rate based on meta-heuristic bat algorithm. J Petrol Sci Eng 172:13–22 31. Kumar M, Maktedar DD (2018) A framework with OTSU’S thresholding method for fruits and vegetables image segmentation. Int J Comput Appl 179(52):25–32 32. Ghosal S, Mehrotra R (Jun. 1997) A moment-based unified approach to image feature detection. IEEE Trans Image Process 6(6):781–793 33. Pass G, Zabih R, Miller J (2004) Comparing images using color coherence vectors, pp 65–73 34. Güner A, Alçin ÖF, Sengür ¸ A (2019) Automatic digital modulation classification using extreme learning machine with local binary pattern histogram features measurement vol 145, pp 214– 225 35. Hassaballah M, Alshazly HA, Ali AA (Mar. 2019) Ear recognition using local binary patterns: a comparative experimental study. Expert Syst Appl 118:182–200 36. Kou Q, Cheng D, Chen L, Zhuang Y (2019) Principal curvatures based local binary pattern for rotation invariant texture classification. Optik 193:162999 37. Verma M, Raman B (2015) Center symmetric local binary co-occurrence pattern for texture, face and bio-medical image retrieval. J Vis Commun Image Represent 32:224–236 38. Kanan HR, Salkhordeh S (2016) Rotation invariant multi-frame image super resolution reconstruction using Pseudo Zernike moments. Signal Process 118:103–114 39. Singh C, Aggarwal A (2017) An efficient approach for image sequence denoising using Zernike moments-based nonlocal means approach. Comput Elect Eng 62:330–344 40. Sharma M, Sharma S, Singh G (2018) Performance analysis of statistical and supervised learning techniques in stock data mining. Data 3(4):54 41. Cheng JH, Chen HP, Lin YM (2010) A hybrid forecast marketing timing model based on probabilistic neural network, rough set and C4.5. Expert Syst Appl 37(3):1814–1820

Disease Prediction Based on Symptoms Using Various Machine Learning Techniques Deep Rahul Shah and Dev Ajay Dhawan

Abstract Meticulous and prompt analysis of any health related issues is significant for the anticipation and treatment of the disease. The conventional method of determination may not be adequate on account of a genuine infirmity. Fostering a clinical determination framework dependent on machine learning calculations for forecast of any illness can help in a more exact finding than the regular strategy. We have built a disease predication framework utilizing numerous machine learning techniques from symptoms. The dataset utilized had more than 261 illnesses and 500 symptoms for handling. The Random Forest Classifier gave the best outcomes when contrasted with Multinomial Naïve Bayes Classifier, K-Nearest Neighbors, Logistic Regression, Support Vector Machines, Decision Tree, and Multilayer Perceptron Classifier models. The accuracy of the proposed Random Forest Classifier model on the given dataset was 91.06%. Our prediction model can go about as a specialist for the early finding of disease to guarantee the treatment can happen on schedule and lives can be saved. Keywords Disease prediction · Machine learning · Logistic regression · Support vector machines · Multilayer perceptron classifier · Decision tree · Random forest classifier

1 Introduction Medication and medical services are the absolute most critical pieces of the economy and human existence. There is an enormous measure of progress on the planet we are living in now and the world that existed half a month back. Everything has turned frightful and unique. In the present circumstance, where everything has turned virtual, D. R. Shah (B) · D. A. Dhawan Department of Computer Engineering, NMIMS Mukesh Patel School of Technology Management and Engineering, Mumbai, Maharashtra, India e-mail: [email protected] D. A. Dhawan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_10

141

142

D. R. Shah and D. A. Dhawan

the specialists and medical attendants are investing greatest amounts of energy to save individuals’ lives regardless of whether they need to peril their own. There are likewise some distant towns which need clinical offices. Virtual specialists are boardguaranteed specialists who decide to rehearse online through video and telephone arrangements, rather than in-person arrangements yet this is beyond the realm of possibilities on account of crisis. Machines are constantly viewed as better compared to people as, with practically no human blunder, they can perform errands all the more effectively and with a predictable degree of exactness. A disease prediction system can be known as a virtual specialist, which can foresee the infection of any quiet with next to no human mistake. A few models of virtual specialists do exist, yet they do not include the necessary degree of precision as every one of the boundaries required are not being thought of. The essential objective was to foster various models to characterize which one of them gives the most reliable forecasts. While machine learning projects differ in scale and intricacy, their overall design is something similar. A couple of standards-based strategies were drawn in from ML to audit the new development and sending of the prescient model. The processed data is then handled in a few ML models like Multinomial Naïve Bayes Classifier, Random Forest Classifier, Logistic Regression, K-Nearest Neighbors, Decision Tree, Support Vector Machines, and Multilayer Perceptron Classifier models. According to the models, the accuracy of prediction varied. While processing the data, the input parameters like symptoms was supplied to every model, and the disease was received as an output with dissimilar accuracy levels. The model with the most noteworthy precision has been chosen. Further, in this paper we review few papers to compare with our achieved accuracy, discuss about methodology used, explain the implementation methods, and then discuss about the results achieved.

2 Literature Review Various research works have been published for the forecast of the illness reliant upon the indications using ML computations. Some of which are given below. Karayilan et al. [1] proposed a coronary illness forecast framework that utilizes the artificial neural network backpropagation calculation. 13 clinical highlights were utilized as contribution for the neural network and afterward the neural network was prepared with the backpropagation calculation to foresee nonappearance or presence of coronary illness with a precision of 95%. Chae et al. [2] utilized 4 diverse profound learning models specifically profound neural networks (DNN), long transient memory (LSTM), customary least squares (OLS), and an autoregressive coordinated moving normal (ARIMA) for observing 80 irresistible illnesses in 6 gatherings. Of the relative multitude of models utilized, DNN and LSTM models had a superior presentation. The DNN model performed better as far as normal execution, and the LSTM model gave close expectations when events were huge.

Disease Prediction Based on Symptoms …

143

Haq et al. [3] utilized a dataset that contained data about patients having any coronary illness. They separated elements utilizing three choice calculations which are help, least repetition, and greatest significance (mRMR), and least outright shrinkage and choice administrator which was cross-confirmed by the K-crease strategy. The removed highlights were shipped off 6 diverse machine learning calculations, and afterward, it was ordered dependent on the presence or nonattendance of coronary illness. A powerful coronary illness forecast framework was created by Mohan et al. [4]. They accomplished a precision level of 88.4% through the forecast model for coronary disease with the combination unpredictable woodlands with an immediate model (HRFLM). Maniruzzaman et al. [5] proposed the diabetes sickness utilizing ML calculations. Logistic Regression (LR) was utilized to recognize the danger factors for diabetes sickness. The general precision of the ML-based framework was 90.62%. Kavitha et al. [6] proposed infection forecast utilizing the Cleveland coronary illness dataset, and they used a Hybrid model of Random Forest and Decision Tree Classifier. They accomplished a precision of 88.7%. Various methods were created by Langbehn et al. [7] to distinguish Alzheimer’s illness. Information for 29 grown-ups were utilized for the preparation motivation behind the ML calculation. They had created order models to identify solid outright changes in the scores with the assistance of SmoteBOOST and wRACOG calculations. An assortment of ML methods like artificial neural networks (ANNs), decision trees (DTs), support vector machines (SVMs), and Bayesian networks (BNs) have been comprehensively applied in disease research for the improvement of judicious models, achieving strong and precise decision making [8]. Monto et al. [9] planned a factual model to foresee whether or not a patient had flu. They included 3744 unvaccinated grown-ups and juvenile patients of flu who had fever and undoubtedly 2 different symptoms of flu. Out of 3744, 2470 were affirmed to have flu by the research center. In view of this information, their model gave a precision of 79%. Different machine learning calculations were smoothed out for the viable forecast of a persistent illness episode by Chen et al. [10]. The information gathered for the preparation reason for existing was inadequate. To beat this, an inert element model was utilized. A new convolutional neural network-based multimodal disorder peril conjecture CNN-MDRP was coordinated. The estimation showed up at an accuracy of around 94.8%. Battineni et al. [16] focused on Support Vector Machine and Logistic Regression algorithms and evaluated the study models associated with diagnosis of chronic disease. These models are highly applicable in classification and diagnosis of chronic diseases. Alotaibi [17] has explored, recommended, and applied a Machine Learning model in which Rapid Miner tool is used that calculated the high degree of precision than MATLAB and Weka tool.

144 Table 1 Comparison of methodologies reported in existing literature

D. R. Shah and D. A. Dhawan Method

Model used

Accuracy (%)

Karayilan et al. [1]

ANN

95

Chen et al. [10]

CNN-MDRP

94.80

Haq et al. [11]

HRFLM

88.40

Maniruzzaman et al. [5]

Logistic regression

90.62

Mir et al. [12]

SVM Naive Bayes Random forest decision tree KNN

75 74 71 71 76

Khourdifi et al. [13]

KNN

99.70

Vijayarani et al. [14]

SVM

79.66

Mohan et al. [4]

HRFLM

88.40

Kavitha et al. [6]

Random forest + Decision tree

88.7

Sriram et al. [15]

Random forest classifier

90.26

Bindhika et al. [18] proposed a method for heart disease prediction using machine learning, and the results showed a great accuracy standard for producing a better estimated results (Table 1).

3 Methodology The list of diseases has been retrieved from the National Health Portal of India developed and maintained by Centre for Health Informatics (CHI). The dataset contains 261 different diseases and more than 500 symptoms (Fig. 1).

3.1 Input Data When arranging the calculation, we have expected to be the user must have an obvious idea in regard to the signs he is experiencing. The prediction will be made based which the users enter his or her symptoms.

Disease Prediction Based on Symptoms …

145

Fig. 1 Functioning of the machine learning models

3.2 Data Pre-processing The mining of the data’s methodologies that changes the rough data of course scrambles the data to frame a construction, so it tends to be adequately unraveled with the assistance of computation is known as data pre-handling. Data is cleansed utilizing estimates like stuffing in lost worth, thusly settling the anomalies in the data. The assessment ends up being hard while overseeing massive data base. In this way, we kill those independent factors (symptoms) which may not influence the goal factors (infections).

3.3 Models The entire system is designed in such a way to predict the diseases by utilizing the seven algorithms, i.e., Multinomial Naïve Bayes Classifier, Random Forest Classifier, Logistic Regression, K-Nearest Neighbors, Decision Tree, Support Vector Machines, and Multilayer Perceptron Classifier models, so the prescient investigation study is proposed toward the finish of the concentrate by investigating its precision, speed, proficiency, and execution of the different algorithms for the dataset.

146

D. R. Shah and D. A. Dhawan

3.4 Output (Diseases) While a structure is made with the planning set using the approved estimations, standard datasets are molded and at whatever point the user symptoms are given as a commitment as contribution of the calculation, and the after effects are made concurring as the standard dataset made, likewise making courses of action, and anticipating the high plausible infection.

4 Implementation The disease forecast framework is executed utilizing the seven information mining calculations, for example, Multinomial Naïve Bayes Classifier, Random Forest Classifier, Logistic Regression, K-Nearest Neighbors, Decision Tree, Support Vector Machines, and Multilayer Perceptron Classifier models. The portrayal and implementation of the calculations are provided beneath.

4.1 Multinomial Naïve Bayes The Multinomial Naïve Bayes (MNB) classifier is utilized when the data is multinomial appropriated. MNB works better on discrete attributes, and the multinomial appropriation typically requires whole number counts. Nonetheless, it likewise works with partial counts such tf-idf. The classifier utilizes the recurrence of words for the indicators. At the point when indicator B itself is free, we ascertain the likelihood of class A. It depends on the Eq. (1): p(A|B) =

P(A) · P(B|A) P(B)

(1)

4.2 Random Forest Classifier The Random Forest (RF) is an assessor that fits various decision tree classifiers on various sub-examples of the dataset and midpoints the outcomes to build anticipating exactness. The RF classifier is an assessor that fits various choice tree classifiers on various sub-examples of the dataset and midpoints the outcomes to expand forecast exactness. Trees are non-parametric models, for which the forecast is typically a consistent (for example, the mean for every district, when y in nonstop, or the most widely recognized class, on account of characterization). The relapse coefficients

Disease Prediction Based on Symptoms …

147

play a significant part when managing parametric models, as they depict the boundaries. On account of straight relapse, for instance, the boundary μ = Xβ, where β is the vector of relapse coefficients.

4.3 K-Nearest Neighbors (KNN) The K-nearest neighbors (KNN) calculation used is a sort of oversaw AI calculation. Not set in stone the distance of another information features any excess planning information centers. The distance can be of Euclidean or Manhattan type. Later this, it picks the K closest information centers, where K can be any number. All in all, it gives out the information feature the class to which the majority of K information centers have a spot. In this research, the informational index is run on various occasions for various upsides of N and the most encouraging worth of N is chosen for additional tests, where the distance is controlled by the Euclidean distance as displayed in Eq. (2):   D  n m Dist (X , X ) =  (X in − X im )2

(2)

i=1

4.4 Logistic Regression Logistic Regression utilizes OvR (One versus Rest) approach when portrayal is multi class. Like all backslide examinations, key backslide is a farsighted assessment. Strategic Regression is used to portray information and to explain the association between one ward equal variable and no less than one apparent, ordinal, range, or extent level autonomous elements. Strategic relapse is generally called Binomial coordination relapse. The fundamental equation of generalized linear model is given in Eq. (3): g(E(y)) = α + βx1 + γ x2

(3)

where g () is the link function, E(y) is the expectation of target variable, and α + βx1 + γx2 is the linear predictor (α, β, γ to be predicted).

148

D. R. Shah and D. A. Dhawan

4.5 Support Vector Machines SVM follows OvO (One versus One) scheme when dealing with multi class classification. SVM classifier works by defining a straight boundary between two classes. Every one of the data focuses that fall on one side of the line will be named as one class and every one of the focuses that fall on the opposite side will be named as the second. This gives off an impression of being essential for twofold class anyway the complexity is extended when we have various classes. SVM functions admirably when there is a sensible edge of parcel between classes.

4.6 Decision Tree Decision trees is one of the most un-requesting and popular arrangement calculations to appreciate and translate. It will in general be utilized for both characterization and relapse. Decision trees follows a tree like development to reach to a decision whether the given model feasible have a spot with which class. In the decision tree, for assumption, it uses the system for tree diagram at the top. It contains a root hub later which it gets separated in the predominant data part and subsequently it again gets separated.

4.7 Multilayer Perceptron Classifier A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN). The term MLP is utilized equivocally, at times freely to mean any feedforward ANN, now and again stringently to allude to networks made from different layers of perceptrons and can be applied to complex non-linear problems.

5 Results and Discussion The dataset used was from National Health Portal of India and contains 261 different diseases and more than 500 symptoms. Parameters considered for predicting the disease were based upon list of symptoms entered by the users, more the symptoms better the accuracy of disease. A solitary passage might contain a comparable distinguished sickness, for example, parasitic contamination; however, the indications in two sections for a similar infection are unique. This informational index is then coordinated into one more informational collection by handling the crude passages. The new demonstrated informational collection contain manifestations as segment names and the columns indicate the recognized sickness. For each side effect

Disease Prediction Based on Symptoms …

149

happening for an illness, the segment passage is set apart as 1 for that manifestation while different sections are set apart as 0. Each algorithm is run to fir them on training data individually and experimental results are discussed in the section further. The performance evaluation matrices used are accuracy and cross validation accuracy, which are present in Eq. (4): Accuracy =

True Positive + True Negative Total Sample

(4)

The accuracy defined in Eq. (4) is the proportion of right expectations done by the models. Cross validation makes numerous irregular parts of the dataset into preparing and approval information. For each such parted, the model is fit to the preparation information, and prescient precision is surveyed utilizing the approval information. The outcomes are then arrived at the midpoint of over the parts.

5.1 Experimental Analysis Distinct machine learning models were utilized to analyze the forecast of infection for accessible info dataset. We utilized 7 diverse ML models for the forecast. Out of the 7 models, we accomplished the most noteworthy precision in Random Forest classifier, i.e., 91.06% and most noteworthy cross validation accuracy in K-Nearest Neighbors of 98.03%. Random Forest classifier has a high precision due to enormous quantities of elements because of the inserted include determination in the model generation process. This reverence changed by our dataset, for example, it was little and huge for the training set. Because of this variety, it ended up being the most reliable model when contrasted with the other ML algorithms. We took crude data and recognized them based on symptoms. The lowest accurate model was the Multinomial Naïve Bayes of accuracy 83.14%. Support vector machines had an accuracy of 89.37%. The logistic regression model had an accuracy of 90.05%. The multilayer perceptron classifier had an accuracy of 90.72%. K-Nearest neighbors had an accuracy of 90.95%. Decision Tree had an accuracy of 90.95%. The model with lowest cross validation accuracy was the Decision Tree with lowest cross validation accuracy of 83.54%. Multinomial Naïve Bayes Classifier had a cross validation accuracy of 84.5%. The multilayer perceptron classifier had a cross validation accuracy of 86.9%. The random forest classifier had a cross validation accuracy of 87.02%. Support vector machines had a cross validation accuracy of 88.62%. The logistic regression model had a cross validation accuracy of 89.19%. A portion of the writing has utilized the Random Forest, SVM, and the KNN model for the forecast of the infections. We have utilized 7 diverse ML models for the expectation infections. We accomplished the most elevated accuracy of 91.06% which was high when contrasted with the greater part of the other procedures detailed. The most elevated accuracy was accomplished in view of the Weighted KNN model. Khourdifi et al. [12]

150

D. R. Shah and D. A. Dhawan

Fig. 2 Model versus accuracy for the implemented algorithms

accomplished the most noteworthy accuracy of 99.7% utilizing the KNN model for the expectation and arrangement of heart infections. Sriram et al. [15] utilized the Random Forest model and accomplished an accuracy of 90.26%, while our model had most elevated accuracy of 91.06%. The SVM model demonstrated to have a higher accuracy of 89.37% when contrasted with the other techniques utilized by Mir et al. [11]. Specialists and clinical experts are constantly needed if there should be an occurrence of a crisis. Our forecast framework can end up being useful when in quicker and more precise finding of an infection. The accuracy score and cross validation score values generated by the algorithms are shown in the Figs. 2, 3 and Table 2.

6 Conclusion The manuscript presented a technique of predicting the disease based on the symptoms of a singular patient. The Random Forest model gave the highest accuracy of 91.06%, and K-Nearest Neighbors model gave highest cross validation accuracy of 98.03% for the prediction of diseases utilizing the symptoms. Practically, all the machine learning models gave decent accuracy. As certain models were subject to the boundaries, they could not anticipate the infection and the accuracy rate was comparatively lower. This model would help in bringing down the expense needed in managing the sickness and would likewise further develop the recuperation cycle.

Disease Prediction Based on Symptoms …

151

Fig. 3 Model versus cross validation accuracy for the implemented algorithms

Table 2 Comparison of achieved accuracy and cross validation accuracy Algorithm

Accuracy (%)

Cross validation accuracy (%)

Random forest classifier

91.06

87.02

Decision tree

90.95

83.54

K-Nearest neighbors

90.95

98.03

Multilayer perceptron Classifier

90.72

86.9

Logistic regression

90.05

89.19

Support vector machines

89.37

88.62

Multinomial Naïve Bayes Classifier

83.14

84.5

References 1. Karayılan T, Kiliç Ö (2017) Prediction of heart disease using neural network. In: 2017 international conference on computer science and engineering (UBMK), pp 719–723 2. Chae S, Kwon S, Lee D (2018) Predicting infectious disease using deep learning and big data. Int J Environ Res Public Health 15(8):1596. https://doi.org/10.3390/ijerph15081596.PMID: 30060525;PMCID:PMC6121625 3. Haq AU, Li JP, Memon MH, Nazir S, Sun R (2018) A hybrid intelligent system framework for the prediction of heart disease using machine learning algorithms. Mobile Inf Syst Article ID 3860146, pp 21

152

D. R. Shah and D. A. Dhawan

4. Mohan S, Thirumalai C, Srivastava G (2019) Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 7:81542–81554. https://doi.org/10.1109/ACCESS. 2019.2923707 5. Maniruzzaman M, Rahman MJ, Ahammed B, Abedin MM (2020) Classification and prediction of diabetes disease using machine learning paradigm. Health Inf Sci Syst 8(1):7. https://doi. org/10.1007/s13755-019-0095-z.PMID:31949894;PMCID:PMC6942113 6. Kavitha M, Gnaneswar G, Dinesh R, Sai YR, Suraj RS (2021) Heart disease prediction using hybrid machine learning model. In: 2021 6th international conference on inventive computation technologies (ICICT), pp 1329–1333. https://doi.org/10.1109/ICICT50816.2021.9358597. 7. Langbehn DR, Brinkman RR, Falush D, Paulsen JS, Hayden MR (2001) International Huntington’s disease collaborative group. A new model for prediction of the age of onset and penetrance for Huntington’s disease based on CAG length. Clin Genet 65(4):267–77 8. Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI (2015) Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J 13:8–17, ISSN 2001–0370 9. Monto AS, Gravenstein S, Elliott M, Colopy M, Schweinle J (2000) Clinical signs and symptoms predicting influenza infection. Arch Intern Med 160(21):3243–3247. https://doi.org/10. 1001/archinte.160.21.3243 PMID: 11088084 10. Chen M et al (2017) Disease prediction by machine learning over big data from healthcare communities. IEEE Access 5 (2017):8869–8879 11. Haq AU et al (2018) A hybrid intelligent system framework for the prediction of heart disease using machine learning algorithms. Mob Inf Syst 3860146:1–3860146:21 12. Mir A, Dhage SN (2018) Diabetes disease prediction using machine learning on big data of healthcare. In: 2018 fourth international conference on computing communication control and automation (ICCUBEA), pp 1–6 13. Khourdifi Y, Bahaj M (2019) Heart disease prediction and classification using machine learning algorithms optimized by particle swarm optimization and ant colony optimization. Int J Intell Eng Syst 14. Vijayarani S, Dhayanand S (2015) Liver disease prediction using SVM and Naïve Bayes algorithms 15. Sriram TV, Rao MV, Narayana GS, Kaladhar D, Vital TPR (2013) Intelligent Parkinson disease prediction using machine learning algorithms. Int J Eng and Innov Technol (IJEIT) 3(3):1568 16. Battineni G et al (2020) Applications of machine learning predictive models in the chronic disease diagnosis. J Personalized Med 10(2):21. https://doi.org/10.3390/jpm10020021 17. Alotaibi FS (2019) Implementation of machine learning model to predict heart failure disease. Int J Adv Comput Sci Appl (IJACSA), 10(6) 18. Bindhika GSS, Meghana M et al (2020) Heart disease prediction using machine learning techniques. Int Res J Eng Technol 7(4):5272–5276

Anti-Drug Response and Drug Side Effect Prediction Methods: A Review Davinder Paul Singh, Abhishek Gupta, and Baijnath Kaushik

Abstract Predicting medication (drug) adverse reactions is a crucial aspect of drug development. Using simulation techniques to customize medication response predictions to identify the optimal medication holds enormous potential for improving a patient’s probability of successful treatment. Unfortunately, the statistical effort of forecasting medication reaction is extremely difficult, partly due to dataset limitations and also due to algorithms flaws. Medications are organic molecules that are ingested by humans as well as cause a response in the body through engaging to target proteins. These medications can reduce positive or negative effects inside the organisms. The undesirable modifications in medications cause adverse reactions inside the human organism generally referred to as medication side effects. Such adverse effects might vary from mild occurrences like a migraine to more significant ones involving heart failure, malignancy, or perhaps even mortality. The medications are put through a series of tests in the laboratories to see if they have any negative health effects. Such tests, unfortunately, are both expensive and time-consuming. Numerical techniques can be used as a replacement for experimental research. Numerous computational strategies for detecting drug adverse reactions have recently been published. In this paper, several existing techniques of drug response prediction are surveyed. The Pearson correlation coefficient value of existing techniques is compared graphically and find that SRMF technique had attained the best results. The existing methods are examined with the help of various databases. The most essential databases for drug adverse reactions prediction are BIO-SNAP, SIDER, DART, etc. The sets of data associated with drug adverse reactions, as well as the parameters used to evaluate drug adverse events prediction techniques, have been described. For the evaluation of D. P. Singh (B) · A. Gupta · B. Kaushik School of Computer Science and Engineering, Shri Mata Vaishno Devi University, Katra, Jammu and Kashmir, India e-mail: [email protected] A. Gupta e-mail: [email protected] B. Kaushik e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_11

153

154

D. P. Singh et al.

drug adverse reactions prediction techniques PCC, precision, accuracy, etc., parameters are used. For the prediction of drug side effects, several methods are used such as docking-based methods, network-based methods, machine learning-based methods, and various miscellaneous methods. Keywords Drug response prediction · Docking based · Machine learning-based prediction methods

1 Introduction An adverse drug reaction is defined by the World Health Organization as an unanticipated and adverse reaction that is considered to be created by a medicine taken during normal circumstances. Adverse drug reactions have traditionally been identified as a vital public health issue around the globe. Drug safety is the primary bottleneck in medication development, and understanding the processes behind medication side effects has been more important [1]. Identifying adverse effects of a new medication throughout its production cycle, in general, is a critical aspect of business potential. Detecting probable adverse drug reactions within active compounds early in the design process can enhance drug quality, possibly reduce threats, and also save money of drug companies. The chemical composition of the potential drug is the most important piece of information accessible inside the initial phases of drug creation. The main constituents of bioactive compounds are the basis of many previous investigations on adverse drug estimation. Machine learning approaches provide potential methodology to understand and evaluate such significant issues although the mechanics of adverse drug reactions are sophisticated but may not be fully recognized. For example, cancer is a complicated disease that includes both inter-tumor and intra-tumor variability. To drive personalized treatment, it is critical to uncover connections across biological macromolecules and clinical manifestations, identify innovative prediction indicators, as well as evaluate treatment strategies. Drug susceptibility estimation is an important aspect of drug development, which relates to treatment customized to the assigned to a specific individual instead of a fits-all-onesize approach developed and for the general population. The concept of personalized medicine goes back to Pythagoras, who’s received treatment depending on specific sensibility abnormalities in the fifth century [2]. The humorous method of customized therapy was popular for 200 years, but it was refuted in the second century due to advances in anatomy and physiology. In the twentieth century, personalized treatment was mostly based on enhanced visualization techniques like MRI and X-ray scans, as well as various pathological diagnostics. During the last two decades, the capacity to quantify genetic traits on an individualized level has finally opened a slew of new opportunities for customized medicine. When contrasted to phenomenological findings or non-molecular diagnostics, genetic descriptions provide significantly more complete medical records with a hereditary condition.

Anti-Drug Response and Drug Side Effect Prediction …

155

This study attempts to describe the several features of medication adverse reactions as well as the effective methodologies within medication adverse events forecasting which have been established. In Sect. 2, different existing drug response prediction techniques are surveyed with a comparison analysis table. The databases and parameters used for the drug side effect prediction system are depicted in Sect. 3. In Sect. 4, various drug side effect prediction models are discussed. The conclusion and future scope are presented in Sect. 5.

2 Literature Survey Several methodologies have been implemented to accurately predict pharmacological adverse reactions. In this section, various existing techniques are explained. The dataset evaluation parameters are depicted in Table 1. Gaun et al. [3] presented a method based on cell lines for the prediction of anticancer drugs response. For the estimation of cell lines, WGRMF termed as weighted graph regular matrix factorization was used. The cell line matrix similarity and matrix similarity drug were constructed with the help of the p-NN graph (Nearest Neighbor). The latent matrices of cell line and drug were generated with factorization. The noise was removed with the help of graph regularization that increases the accuracy of prediction. The performance of the purposed method was evaluated with PCC and RMSE. The method was examined on two datasets: GDSC and CCLE. The prediction of anticancer drug response becomes a challenge in the medical field. To reduce the problem of anticancer drug response prediction, Sakellaropoulos et al. [4] proposed a model based on a deep neural network. The model was trained with 10,001 cancer cell lines and 251 drugs. The model was provided state of art results on random patients. The results of the deep neural network were more efficient than machine learning-based algorithm, and the authors also found that deep neural network layers provided more successfully the structures of gene expression. Choi et al. [5] designed a prediction model for the prediction of drug resistance of anticancer and to identify the biomarkers of drug response. The technique was based on the neural network known as the RefDNN model (Reference drug-based neural network). The molecular vector structure and gene expression vector of the drug were used for the full knowledge of the benefits of the drug. The same type of chemicals had the same effects; this observation did the calculations easier. The proposed model provided efficient results on untrained drug design. The elastic net was used for the high-dimensional data of genes. The model’s performance was assessed using two cancer datasets: The first was obtained from the Drug Susceptibility Genomic information in Cancer site, and the second was obtained from the Cancer Cell Line Encyclopedias Web page. The model’s performance was compared to that of other methods such as multi-layer perceptron, KNN, random forest, and others, and the suggested model outperformed the others. Wang et al. [6] used the concept of similar cell lines with same drugs provided the similar drug response. The drug response prediction of anticancer was based on the matrix factorization using

156

D. P. Singh et al.

Table 1 Existing methods of drug response prediction Author’s name

Method used

Dataset

Performance metrics

Key-points

Gaun et al. [3]

Weighted graph regular matrix factorization (WGRMF)

GDSC and CCLE

Pearson correlation coefficient (PCC) = 0.98 root mean square error (RMSE)

Graph regularization increases the accuracy of prediction

Sakellaropoulos et al. [4]

Deep neural network based

10,001 cancer cell lines and 251 drugs

AUC (area under the curve)

Deep neural network layers provided more successfully the structures of gene expression

Choi et al. [5]

Ref (Reference) DNN (Drug-based neural network model)

GDSC and CCLE

Precision = 0.91 Accuracy = 0.91 Recall = 0.90

Provided better results than multi-layer perceptron, KNN, random forest

Wang et al. [6]

Similarity-regularized matrix factorization (SRMF) method

GDSC and CCLE

Pearson correlation coefficient (PCC) = 0.99 and root mean square error (RMSE)

Improved the prediction accuracy of the anticancer drug response

Emdadi et al. [7] HMM-LMF model based on hidden Markov model and logistic matrix factorization methods

GDSC and CCLE

PCC = 0.75

The model provided efficient results on the cell line of unseen patients

Eslahchi et al. [8]

GDSC and CCLE

Pearson correlation coefficient (PCC) = 0.84, and RMSE0.48

Finding the most redundant cell lines

Manifold learning

(continued)

resemblance regularization method. The technique used the chemical structure and gene expression for the prediction of anticancer drug response. The medications with the same composition and gene expression of cell lines that were employed in the data augmentation method were referred to as regularization. The performance was examined on two datasets: GDSC and CCLE. The proposed method improved the

Anti-Drug Response and Drug Side Effect Prediction …

157

Table 1 (continued) Author’s name

Method used

Dataset

Performance metrics

Key-points

Liu et al. [9]

Integration of matrix completion and ridge regression

CCLE known as PCC = 0.70 cancer cell line encyclopedia and GDSC known as genomics of drug sensitivity in cancer

Better than the dual-layer integrated cell line-drug network model

Zhang et al. [10] Intergraded CSN termed as cell line similarity network and DSN termed as drug similarity network

CCLE known as Pearson cancer cell line correlations = encyclopedia 0.6 and CGP known as cancer genome project

Correctly predicted mutant cell lines

Liu et al. [11]

NCFGER termed as neighbor-based collaborative filtering with global effect removal technique

GDSC and CCLE

PCC = 0.73 and RMSE

The unknown features were predicted with the K most similar neighbors

Suphavilai et al. [12]

CaDRReS (Cancer drug response prediction using a recommender system)

CCLE and GDSC

Correlation for drug biases, correction for cell line biases

The ability of unseen cell lines prediction was very helpful

prediction accuracy of the anticancer drug response. Emdadi et al. [7] given an autoHMM-LMF model based on the feature selection method. The analysis presented that there were multiple dimensions of the feature space in drug response prediction. Therefore, Emdadi et al. [7] designed a model that reduced the features of drug response. The HMM-LMF was based on the hidden Markov model and logistic matrix factorization methods. The auto-encoder model was used for the selection of gene expression and data variation that was used for the formation of the logistic matrix factorization model. The results of the model were examined on two datasets, GDSC and CCLE. The model provided efficient results on the cell line of unseen patients. Eslahchi et al. [8] used the cell lines for the prediction of drug response on anticancer patients. The proposed model known as manifold learning for anticancer drug response prediction as was based manifold learning used information of cell lines and drugs. The integration of cell line and drug information was used for the accurate prediction of the anticancer patient. The matrix produced from drug response was converted to a lower space rank that provided the new cell lines and drug features. The performance of the model was evaluated on two pharmacogenomics datasets, namely the CCLE and GDSC. Liu et al. [9] improved the prediction model of anticancer drug response with the ensemble method. The two methods matrix

158

D. P. Singh et al.

completion and ridge regression were collaborated to perform the better prediction of anticancer drug response. The model’s performance was compared to that of a double combined cellular network model. The performance of the model was examined on standard datasets CCLE and the GDSC. The model provided more efficient results in predicted and observed drug responses with high Pearson correlation on CCLE dataset whereas on the GDSC dataset. In intracellular signal-regulated kinases, the approach showed 26 of the 30 medications. The suggested model outperformed the double combined cell line-drug network architecture in terms of overall effectiveness. Zhang et al. [10] proposed a model for predicting anticancer drug response that combined drug as well as cell line characteristics. The two networks were joined to improve the model’s performance. By merging the cell line resemblance approach and the medication resemblance mechanism, the double-integrated cell line-drug system theory was created to predict anticancer therapy response. The performance of the model was examined on benchmark datasets: CCLE and CGP. The model provided 0.6 Pearson correlations for the observed responses. Liu et al. [11] designed a system for the estimation of anticancer drug response with the neighbor-based collaborative filtering technique. The drug response estimation was formulated as an estimator problem system. The global effect removal technique was used for the removal of the effects that exhibit similar cell lines and drug effects of the drugs. Similar cell lines and drugs were paired and shrunk down. The K most comparable neighbors were used to estimate the unknown attributes. On the GDSC and CCLE datasets, tenfold cross-validation was used. Suphavilai et al. [12] proposed a method (CaDRReS) also known as a recommender system that learned projections for drugs and cells based on latent ‘pharmacogenomics’ space, and unknown cell layers had their medication reactions predicted. The proposed model CaDRReS was compared with other models. The performance of the model was evaluated on branch mark datasets: CCLE and GDSC datasets. The pharmacogenomics domains inferred by CaDRReS were also shown to be useful in understanding drug processes, determining cellular subtypes, and characterizing drug-pathway interactions, according to the researchers. The evaluation parameters with description and formulas are shown in Table 2. The comparison analysis of different existing techniques of drug response prediction techniques based on the Pearson correlation coefficient (PCC) is shown in Fig. 1

3 Databases and Parameters Used for Drug Adverse Reaction Prediction Predicting pharmacological adverse effects necessitates deducing medication and proteins features which aid in identifying risks and side effects from existing databases. As anticipate likely treatment options, information regarding drug particles, protein targets, as well as negative consequences is required. Table 3 lists the key sets of data providing pharmaceutical adverse reactions data and medication

Anti-Drug Response and Drug Side Effect Prediction …

159

Table 2 Various parameters used for the evaluation of drug response prediction models Parameters PCC

Description

Evaluation example 

(xi−x)(yi−y) The Pearson correlation PCC = √ = 0.98  (xi−x)2 (yi−y)2 coefficient is a kind of correlation coefficient which depicts the connections between two factors within the same quantitative and qualitative dimension. The correlation coefficient is the ratio of how strong two consecutive variables are related

Precision [5]

The precision Precision = measurement essentially aims at describing occurrences among all of the expected events. As a result, this can be defined also as verified adverse reactions resulting out all of a given method’s expected negative impacts

Accuracy [5]

This aids in calculating the overall variety of adverse effects appropriately identified to use the developed model

True Positive True Positive+False Positive

= 0.91

Accuracy = True Positive+True Negative True Positive+False Negative+False Positive+True Negative

=

0.91

attributes that have been or could be employed in the adverse reactions probability model.

3.1 Metrics for Drug Side Effect Prediction Indicators are tools for assessing productivity. Statistics aid throughout the comprehension of how much a project requires as well as the evaluation of various approaches. Within Table 4, the measures employed inside the area of pharmacological adverse reactions assessment are described.

160

D. P. Singh et al.

PCC

Pearson correlation coefficient 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Pearson correlation coefficient

Techniques

Fig. 1 Graphical representation of drug response prediction techniques results Table 3 Essential databases for drug adverse reactions prediction Sr. No.

Database

Explanation

1

BIO-SNAP 2014[13]

BIO-SNAP 2014 gave data on the pharmaceuticals that seem to be accessible mostly in the industry in the United States. All information was gathered through open records, medicine labeling, packet manuals, as well as physician, client, and drug company records. The information in this data source is represented graphically having nodes and edges

2

SIDER (Side effect resource) 2012 [14]

The information collected through medication responses throughout drug testing seems to be a key form of potential phenotypic data. SIDER set of data collects statistics, as well as facts regarding pharmacological components and fusion protein, to formulate a meaningful picture of the benefits as well as unfavorable consequences of pharmaceuticals

3

DART (Drug adverse reaction target) [15]

DART is a record that provides regarding drug adverse reactions. Also, it includes details mostly on probable objectives including its toxic consequences

Anti-Drug Response and Drug Side Effect Prediction …

161

Table 4 Metrics of drug response prediction models with the definition Sr. No. Metrics

Definition

1

Sensitivity The fraction of positives that are predicted (i.e., the number of individuals who seem to have a condition (affected) who are accurately described as having the condition) is referred to as the true-positive rate

2

Specificity The fraction of negative accurately detected or in other words, the number of individuals who may not have the condition (untouched) who are correctly classified as not exhibiting the condition) is known as specificity (true-negative rate)

3

Recall [5]

The aggregate amount of data cases that have been detected outside the overall accurate occurrences is referred to as recall

4 Drug Side Effect Prediction Methods Among the most challenging tasks addressed during drug discovery, during their development stages adverse side effects are also there which also should be taken care of. The area of pharmacological adverse reactions detection forecasting is still at the early stages of development. The several models of drug side effect prediction are described below as:

4.1 Docking-Based Methods The favored perspective of one chemical compound with the other to maintain a real sophisticated is known as docking. Throughout the mechanism of message acquisition, the interactions with molecules, for example, phospholipids, as well as organic molecules, are critical. This kind of transmission delivered is often affected by the direction and docking [16]. Also as result, docking aids with choosing the form, and intensity including its message is generated. The docking-based algorithm for the prediction of drug side effects is discussed below.

4.1.1

Inverse Docking

INVDOCK, which attaches the chemical compound to every one of the enzyme compartments using major components mismatches, was utilized to determine the genetic variants for every one of the clinical medicines. The quantum mechanical modifications, as well as resource-saving, were then carried out. These protein targets were discovered using a rating scale that looked at both the adherence as well as the binding sites association energies [17].

162

4.1.2

D. P. Singh et al.

INVDOCK Based on Anti-HIV

Effects have been reported whenever a medication molecule is overregulated or communicates with such an enzyme unconventionally. A chemical docking-based technique has been developed for identifying these protein functions. INVDOCK The technique is used to predict the probable adverse impacts of multiple antiHIV medications in an attempt to see whether it is possible to use it to anticipate prospective objectives connected to the treatments’ negative impacts [18].

4.2 Networks-Based Methods Medicines, objectives, as well as adverse reactions, are viewed as source and destination nodes through networking. Nodes represent the connections in between vertices. Graph-based visualization is used in network-based methods toward adverse reactions forecasting and determine pharmacological adverse reactions [19]. Adverse reactions are induced by a variety of circumstances, involving incorrect dose, adsorption to non-targets, as well as insufficient biotransformation, among others. The network-based algorithms are explained below as.

4.2.1

Cooperative Pathways-Based Approach

The use of route channels in addition to providing information about genetic variants has also been considered as a method for predicting pharmacological adverse reactions. A collection of pharmaceuticals and circumstances which impact every genome are depicted like information set. The graphic networking was analyzed to recognize sub-pathways that share similar medications and substances because these routes indicated the channel’s underlying linkages [20].

4.2.2

GO and PPI-Based Approach

GO networks are created by combining pharmacological relevant data with gene ontology (GO) ontologies. The resulting PPI and GO connections are combined with pharmacological adverse reactions information to aid in the development of machines attempting to learn estimation techniques [21]. A network was established using regarding drug concentrations as well as associated negative impacts.

Anti-Drug Response and Drug Side Effect Prediction …

163

4.3 Machine Learning-Based Methods Machine learning is the umbrella term for several of modeling methodologies that gather information as well as then use that to understand a certain area. A series containing samples is delivered toward the computational intelligence methods, which are employed in the learning phase. The training aids in the detection of subsequent situations as well as incidents [22].

4.3.1

Collaborative Filtering-Based Method [23]

The spectra of negative impacts for such medications are originally developed using this approach. Following that, the pharmaceutical resemblance is calculated. A Jaccard correlation, resemblance utilizing cosine angle, as well as Pearson’s correlation, are utilized to determine the similarity. The n varieties of different prototypes are built using the techniques outlined prior. The outcome first from n simulations was combined to produce the finished outcome.

4.3.2

Boltzmann Machine-Based Network

In a Boltzmann machine-based network, the neural network is used to calculate the probability. A two-layered structure was used in the simulation. The exposed, as well as concealed connections, is represented by the two levels, accordingly. To increase efficiency, the k-nearest technique, as well as the constrained Boltzmann technique, has been combined to form ensembles [24].

4.4 Miscellaneous Approaches It covers methods that including sparse canon Pearson correlation or other goals scored procedures for determining the likelihood of adverse reactions. The following are some of the several methods for forecasting pharmacological adverse reactions.

4.4.1

Logistic Regression

A component of variables was created applying regression analysis and estimates the likelihood that perhaps the individual belongs to either the affected or nonaffected category. The LR-based PSM-termed propensity score matching is used for the prediction of drugs. Each study’s rate was computed using this technique. The revealed records are subsequently separated into 20 segments scores. The affected and non-affected individuals are classified [25].

164

D. P. Singh et al.

Table 5 Various types of techniques with adverse reactions data, Protein-related data, DRUGrelated data Reference

Technique types

Adverse reactions data

Protein-related data

DRUG-related data

[16]

Docking based

Determine adverse reactions as well as proteins that are associated with them.: DART

Protein structure is represented in three dimensions: PDB

Details about drugs: DART

[19]

Network based

Adverse reactions as Network information Information about well as information about PPI: STRING drugs: about similar drugs: DRUGBANK SIDER

[22]

Machine learning based

Adverse reaction information: SIDER

[25]

Miscellaneous approaches

Adverse reactions as Network information well as information about PPI: STRING about similar drugs: SIDER

4.4.2

Natural properties: DrugBank and KEGG

Chemical structures: PubChem To identify pharmacological substances: PUBCHEM

Network Based on Multi-kernel

The approach seeks to solve the issue of risk factor forecasting imbalanced data. The 881 PubChem fingerprinting is often used to indicate the chemical properties of both the medications, as well as the relationship between both the medication and adverse reactions have been used to describe the similarities across them [25]. The various types of techniques with side effect data, protein-related data, and drug-related data are depicted in Table 5. The drug response prediction methods with advantages and disadvantages are presented in Table 6. The fundamental disadvantages among all molecular structure-based approaches are their reliance on what is accessible in observed measurements.

5 Conclusion and Future Work The purpose of this paper is to examine the computational strategies that have been established toward predicting pharmacological adverse effects. The most significant sets of data for pharmaceutical adverse reactions forecasting have been highlighted, as well as the measurements of analyzing the pharmaceutical adverse reactions prediction algorithms. The essential parameters such as accuracy, precision, recall, etc., and databases of drug adverse reactions such as BIO-SNAP 2014, Side Effect Resource (SIDER) 2015, DART (Drug Adverse Reaction Target), etc., have been elaborated. Docking techniques, network-based techniques, ML-based approaches, and other

Anti-Drug Response and Drug Side Effect Prediction …

165

Table 6 Drug response prediction methods with advantages and disadvantages Methods

Advantages

Disadvantages

Docking based

Such techniques don’t rely on practical required to determine unique as well as unpredictable connections

Time-consuming process Pharmaceutical, as well as destination 3D components, is important

Network based

When comparing with docking-based approaches, it needs minimal computation

Consider only close neighbors The use of scientific results limits the development of novel drug interactions

Machine learning based

No need for comprehensive information Computational time is low

Existence of uncertainty The approaches’ growth is influenced either by diversification of molecules in the collected data, the accuracy of descriptions, as well as other factors

Miscellaneous approaches

Provided efficient results on large datasets

Need to define parameters

approaches have been categorized as strategies for medication adverse reactions forecasting. Docking-based approach is the favored perspective of one chemical compound with the other to maintain a real sophistication. The inverse docking and anti-HIV INVDOCK are discussed with use and advantages. The network-based approaches, for example, cooperative pathways GO and PPI-based approaches are described. ML is the umbrella term for samples of modeling techniques that gather information as well as then use that to understand a certain area. The collaborative filtering and Boltzmann machine-based methods are discussed. Miscellaneous approaches are those approaches that are not included in network and machine learning based. The main advantages of miscellaneous approaches are that it can easily implement on large datasets. The advantages and disadvantages of drug prediction approaches are depicted in Table 6. Although a variety of analytical modeling has been employed to estimate medication effects interactions, earlier techniques still can be enhanced. Kernel training as well as numerous different weighting is not taken into consideration in traditional designs. Furthermore, side effect space knowledge is essential toward improving generalization ability. The deep learningbased approaches for drug prediction response will be analyzed in the future. For the analysis of adverse reactions of medications, advanced learning techniques will be compared.

166

D. P. Singh et al.

References 1. Dey S, Luo H, Fokoue A, Hu J, Zhang P (2018) Predicting adverse drug reactions through interpretable deep learning framework. BMC Bioinform 19(21):1–13 2. Güvenç Paltun B, Mamitsuka H, Kaski S (2021) Improving drug response prediction by integrating multiple data sources: matrix factorization, kernel and network-based approaches. Brief Bioinform 22(1):346–359 3. Guan NN, Zhao Y, Wang CC, Li JQ, Chen X, Piao X (2019) Anticancer drug response prediction in cell lines using weighted graph regularized matrix factorization. Mol Therapy-Nucleic Acids 17:164–174 4. Sakellaropoulos T, Vougas K, Narang S, Koinis F, Kotsinas A, Polyzos A, Gorgoulis VG (2019) A deep learning framework for predicting response to therapy in cancer. Cell Rep 29(11):3367–3373 5. Choi J, Park S, Ahn J (2020) RefDNN: a reference drug based neural network for more accurate prediction of anticancer drug resistance. Sci Rep 10(1):1–11 6. Wang L, Li X, Zhang L, Gao Q (2017) Improved anticancer drug response prediction in cell lines using matrix factorization with similarity regularization. BMC Cancer 17(1):1–12 7. Emdadi A, Eslahchi C (2021) Auto-HMM-LMF: feature selection based method for prediction of drug response via autoencoder and hidden Markov model. BMC Bioinform 22(1):1–22 8. Moughari FA, Eslahchi C (2020) ADRML: anticancer drug response prediction using manifold learning. Sci Rep 10(1):1–18 9. Liu C, Wei D, Xiang J, Ren F, Huang L, Lang J, Tian G, Li Y, Yang J (2020) An improved anticancer drug-response prediction based on an ensemble method integrating matrix completion and ridge regression. Mol Therapy-Nucleic Acids 21:676–686 10. Zhang N, Wang H, Fang Y, Wang J, Zheng X, Liu XS (2015) Predicting anticancer drug responses using a dual-layer integrated cell line-drug network model. PLoS Comput Biol 11(9):e1004498 11. Liu H, Zhao Y, Zhang L, Chen X (2018) Anti-cancer drug response prediction using neighborbased collaborative filtering with global effect removal. Mol Therapy-Nucleic Acids 13:303– 311 12. Suphavilai C, Bertrand D, Nagarajan N (2018) Predicting cancer drug response using a recommender system. Bioinformatics 34(22):3907–3914 13. Zitnik M, Sosic R, Leskovec J (2018) BioSNAP datasets: stanford biomedical network dataset collection. Note: http://snap.stanford.edu/biodata Cited by 5(1) 14. Kuhn M, Letunic I, Jensen LJ, Bork P (2016) The SIDER database of drugs and side effects. Nucleic Acids Res 44(D1):D1075–D1079 15. Ji ZL, Han LY, Yap CW, Sun LZ, Chen X, Chen YZ (2003) Drug adverse reaction target database (DART). Drug Saf 26(10):685–690 16. Pinzi L, Rastelli G (2019) Molecular docking: shifting paradigms in drug discovery. Int J Mol Sci 20(18):4331 17. Chen YZ, Zhi DG (2001) Ligand–protein inverse docking and its potential use in the computer search of protein targets of a small molecule. Proteins: Struct, Funct, Bioinf 43(2):217–226 18. Asgaonkar KD, Patil SM, Chitre TS, Ghegade VN, Jadhav SR, Sande SS, Kulkarni AS (2019) Comparative docking studies: a drug design tool for some pyrazine-thiazolidinone based derivatives for anti-HIV activity. Curr Comput Aided Drug Des 15(3):252–258 19. Zhang F, Wang M, Xi J, Yang J, Li A (2018) A novel heterogeneous network-based method for drug response prediction in cancer cell lines. Sci Rep 8(1):1–9 20. Zhao J, Zhang XS, Zhang S (2014) Predicting cooperative drug effects through the quantitative cellular profiling of response to individual drugs. CPT: Pharmacometrics Syst Pharmacol 3(2):1–7 21. Dorel M, Barillot E, Zinovyev A, Kuperstein I (2015) Network-based approaches for drug response prediction and targeted therapy development in cancer. Biochem Biophys Res Commun 464(2):386–391

Anti-Drug Response and Drug Side Effect Prediction …

167

22. Qiu K, Lee J, Kim H, Yoon S, Kang K (2021) Machine learning based anti-cancer drug response prediction and search for predictor genes using cancer cell line gene expression. Genomics Inform 19(1) 23. Zhang J, Li C, Lin Y, Shao Y, Li S (2017) Computational drug repositioning using collaborative filtering via multi-source fusion. Expert Syst Appl 84:281–289 24. Wang Y, Zeng J (2013) Predicting drug-target interactions using restricted Boltzmann machines. Bioinformatics 29(13):i126–i134 25. Sachdev K, Gupta MK (2020) A comprehensive review of computational techniques for the prediction of drug side effects. Drug Dev Res 81(6):650–670

Assessment of Segmentation Techniques for Irregular Border Lesion Images in Melanoma K. Gnana Mayuri

and L. Sathish Kumar

Abstract Skin cancer is considered to be the deadliest disease. Lesion is a suspicious part which has an unusual growth compared to skin and also appears as a smooth surface with size variation, indiscriminate shape, and unusual colors. Segmentation plays an essential and crucial role here. When an image is divided into segments, important features can be projected and processed instead of complete image. Some expert dermatologists can see the segmented part of lesion and conclude the chances of occurring melanoma and non-melanoma. This phase plays a crucial role in early diagnosis and detection of cancer. However, selecting an apt segmentation technique for various data set images is a major challenge in the medical field. Hence, this work addresses a selection of suitable segmentation method which has to confer a good result. In this paper, three approaches of segmentation techniques binary Otsu, marker-based watershed, and K-means clustering are implemented and compared especially for irregular border lesion. Segmentation results are evaluated based on quality assessment metrics of image such as mean square error, mean absolute error, structural similarity index, and peak signal-to-noise ratio, i.e., MSE, MAE, SSIM, and PSNR. On an average, it is observed that for marker-based watershed segmentation MSE and MAE values are reduced to 35% and 10%. PSNR values are increased to 15%, and SSIM has shown an increase of 70% when compared to other two methods. This research shows that marker-based watershed segmentation works well for irregular border lesion images of melanoma. Keywords Segmentation · Image processing · Skin cancer

K. Gnana Mayuri (B) Geethanjali College of Engineering and Technology, Hyderabad, India e-mail: [email protected] L. Sathish Kumar VIT Bhopal University, Bhopal, Madhya Pradesh, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_12

169

170

K. Gnana Mayuri and L. Sathish Kumar

1 Introduction Skin lesion [1] may appear in different sizes, shapes, and colors which are hurdles for the methods of segmentation process [2] and attaining high accuracy. Analysis of skin lesion effectively will aid and support dermatologist or general physician in quick decision making for further treatment. Many researchers prefer to use segmentation methods only for normal skin lesion images, but a lot of research has to be done for irregular border lesion, as it is difficult to mark and identify its borders. Biomedical researchers also notice that most of the medic specialists struggle for accurate visual evaluation of Irregular borders. In this paper, we focus mainly to segment the lesion with irregular and uneven borders. ABCDE rule [3] is used as one of the most important factors for detecting melanoma [4]. A-asymmetric mole, B-edges or borders are inconsistent, C-abrupt changes in color of mole (black-brown, wheatish-brown, flaky red, and pink patches), D-a mole which is more than 6mm, and E-mole evolving in size and shape. Here we focus on B, i.e., borders and edges which play a keen role in the detection of melanoma. Irregular border lesion on skin of a person is a hurdle for dermatologist to identify and calculate the border of segmented lesion. Detection of edges [5] and boundaries are also considered as key points for melanoma. Results obtained after segmentation are evaluated in terms of image quality metrics [6] which is considered to be an important factor in image processing. Generally, skin cancer is categorized in to three types: basal cell carcinoma, squamous cell carcinoma, and melanoma 1. Basal Cell Carcinoma: This develops from basal cells of lower epidermis in skin which can grow on parts of head, neck and can appear on the parts of skin exposed to sunlight. BCC appears in colors like brown, black, and dark blue or combinations of these three colors. 2. Squamous Cell Carcinoma: This is developed on squamous cells of skin and mainly caused due to high radiations of sun. SCC appears on lips, head, neck, nose, ears, legs, and hands. SCC can appear as red, pink, or yellowish-red scaly patches, bump with raised edges and a sore that does not heal. This cancer can easily spread to tissues, bones, and nearby lymph nodes if neglected and detected late. 3. Melanoma: A deadliest skin cancer [7] which is developed on melanocytes of skin and develops on areas like face, legs, arms, face and also on parts which are not exposed to sun like feet, palms of hands. Melanoma can appear as a normal mole on the body. Lesion colors are black, brown, dark black, wheatish brown, dark brown, and dark red or combinations of these colors. Melanoma is detected based on ABCDE rule. This paper is arranged in to the following. Section 2 shows the related work and background, Sect. 3 illustrates the methodology used, Sect. 4 briefs the results obtained along with its evaluation, and Sect. 5 concludes the paper.

Assessment of Segmentation Techniques for Irregular …

171

2 Related Work and Background Segmentation [8] is categorized in to following 1. Thresholding-based segmentation: It is a basic and familiar technique [2] that separates objects from their background. In this each pixel value or intensity of the image is compared to a definite threshold which splits the image pixels in to two categories: • Pixels for which intensity values are lower than specified threshold. • Pixels for which intensity values are greater than specified threshold. Global threshold, adaptive threshold, and Otsu are few techniques used in thresholding. 2. Region-based segmentation: It divides the image in to segments by partitioning image in to different parts, acquiring similar properties, and identifying the region of interest. It will also partition the images in to similar regions. This method [9] depends on the presumption that its neighbor pixels inside a region will have an identical value. The standard process is to segregate one pixel to its corresponding neighbors. If similar pixels are found, pixel is set to same cluster as its neighbors. Watershed segmentation, region growing, merging, and splitting are few techniques used for region-based segmentation. 3. Clustering-based segmentation: This method [10] is used to segment image to form clusters or separate groups of pixels attaining equivalent features. Attributes of input images are divided in to clusters in such a way that attributes lying in same cluster are more alike related to other clusters. Clustering algorithms are unsupervised algorithms in which the user has no predetermined characteristics classes and sub-divisions. K-means and fuzzy c-means are few clustering techniques used. 4. Edge-based segmentation: Edge detection [11] is used to discover margins or boundaries for objects along with extraction of structure for objects in an image. It also allows focusing on features which are considerable for cancer and finally by eliminating those which are not necessary. Canny, Laplacian, Sobel, and Prewitt are edge detection methods. Zakwan et al. [6] have shown image quality assessment techniques for segmentation of test images available on Internet. Author has considered MSE, MAE, SSIM, average difference, and DSSIM as parameters for the evaluation of K-means, threshold, and watershed segmentation techniques, in which K-means clustering has proved to give better results. Jamil et al. [9] have implemented marker-based watershed segmentation for computer-aided diagnosis on dermoscopic images of melanoma. Initially, author has applied filters and gradient magnitude to remove noise and hairs from images. Then image enhancement was done. Finally, marker-based watershed segmentation showed a good result of accuracy to mark foreground object and lesion segmentation.

172

K. Gnana Mayuri and L. Sathish Kumar

Jacily Jemila et al. [2] have compared implementation of threshold values for Otsu segmentation, edge-based segmentation, and pixel-based segmentation based on image quality metrics such as MSE, PSNR, SNR, SSIM, and EPI. Author has concluded that suitable segmentation method can be opted based on image quality metric values obtained. Metib et al. [12] used K-means clustering for obtaining the area of interest of an infected lesion. In this, the distance metrics of centroid cluster and complete cluster is used to obtain segmentation process which is displayed as an object. A new point is defined by its closest point nearby and within cluster. This texturebased segmentation is compared with K-Means cluster (K = 3) based on lightness and color. Finally, K-Means clustering segmentation achieved a better performance compared to other techniques. Sara et al. [13] have presented a comparative study for image quality assessment. This author has considered SSIM, MSE, PSNR, and DSSIM as parameters to evaluate quality of image. Gothi et al. [14] have derived an efficient way of achieving image segmentation in which Otsu, K-Means, and gradient flow vector (GVF) are used for segmentation. Otsu method was applied to noise-free image which computed a global threshold value for minimizing foreground and background pixels and finally morphological operations were applied. K-Means clustering was applied to center of cluster centroids, and value of histogram is calculated for segmentation. GVF segmentation is applicable for images whose boundaries calculation is not required. Shanthi et al. [15] have presented a method for skin lesion segmentation by using watershed algorithm in multi-channel. Basically, a Gaussian filter is used to remove hairs and noise from images, and then the processed image is applied with watershed algorithm for segmentation. Images are treated with nodes, edges, and borders where watershed lines are described in graph and morphological tool was used for region segmentation which has led to give author an high border closure and precision results. Watershed segmentation was efficient in skin cancer detection.

3 Methodology Used Binary Otsu, K-means clustering, and marker-based watershed segmentation are implemented. Overall methodology for this work is given in Fig. 1.

Assessment of Segmentation Techniques for Irregular …

173

Fig. 1 Overall methodology

3.1 Binary Otsu Segmentation It is one of the basic segmentation methods [16] which converts grayscale image to binary image. This method is used to locate objects and boundaries in an image and return a single threshold intensity which separates pixels in to two classes, i.e., foreground and background. It also iterates through possible threshold values by calculating the increased pixel levels above threshold value set. This every pixel of image is assigned with a label, where pixels of same label will share a particular features or properties. It is very popular method which is still used in many applications as a basic thresholding technique. Many of the researchers use Otsu as basic step for segmentation technique along with morphological operations known as binary Otsu for accurate results and then rely on any other technique for further segmentation. Given below is the procedure for binary Otsu 1. An initial estimate for P is selected and image is segmented using P. 2. Two different classes of pixels will be produced C1 having pixels whose intensity value is greater than equal to P and C2 class having whose pixel values is less than P. 3. Average intensity values μ1 and μ2 are computed with a new threshold value for classes C1 and C2. 1 p = (μ1 + μ2) (1) 2 4. Above steps are repeated until difference of P in successive iteration is less or small than predefined parameter T and apply morphological operation closing with dilation. 5. A threshold value p is then selected by minimizing it inside class variance (σ 2) or by maximizing among a class variance (σ 2).

174

K. Gnana Mayuri and L. Sathish Kumar

σ 2 ( p) = w0 ( p).w1 ( p)

(2)

[μ1( p) = μ2( p)]2

(3)

where σ 2 (p) =class variance, w = weight, μ = combined mean, P = threshold value Resultant image will be a binary segmented image, but by using morphological operation image is enlarged by filling the gaps in boundaries and enhancing region boundaries in an image. Y • B = (Y ⊕ B)  B (4) where Y is an input image and B is structuring element used for eliminating regions which are dark and small holes or leaks.

3.2 Marker-Based Watershed Segmentation It is a region-based segmentation method [8] which uses a feature of morphological operations. In this at least one marker called (seed), a point which is inside to each object image is selected, which also includes a background to be considered as separate object. Markers [12] are selected by operators of objects. When objects are marked they can be increased using morphological operations. Following are steps used for watershed marker-based segmentation 1. Input image is converted in to grayscale, and a binary Otsu is applied to find an approximate threshold value and morphological operator; i.e., closing operation with dilation is applied to remove noise and fill the holes present. 2. Regions are labeled near to center of objects as foreground and region far away as background. To find out regions which are overlapping with each other distance transform is applied with a proper threshold. 3. Distance transform operator is applied only for binary image. This will result in a gray-level image which will be similar to input image, which excludes intensity points lying inside the foreground region and then is changed to display its nearby close boundary from each point in a gray-level image. 4. Remaining regions which are not identified are found by watershed algorithm; i.e., these regions generally would be boundaries where foreground and background meet, which is called as border obtained by subtracting foreground from background. Result obtained would be threshold image from which some unwanted regions are detached. 5. Finally, a marker is created, and a region is labeled inside it. Known regions are labeled with any positive integer and unknown area which is left is labeled as zero.

Assessment of Segmentation Techniques for Irregular …

175

6. Lastly, connected components function is used which labels background with 0 and other objects with integers from regions of 0, which is considered as unknown by watershed. Resultant image is considered as marker-based watershed which will detect boundary for lesion, and segmentation is done along within (ROI) region of interest for image.

3.3 K-Means Clustering K-means clustering [10] is a simple algorithm which starts with choosing of K centroids randomly and repeats till it finds the best centroid. Number of clusters found from the given data by algorithm is described by K in K-means algorithm. A cluster is formed by assigning data points in such a way that the total square distance among data points and centroid will be minimum. 1. 2. 3. 4.

Identify the specified number of k clusters. Implicitly select K data points by assigning every data point in to cluster. Calculation of centroids for clusters. Repeat the above steps till an ideal centroid is obtained.

P=

k  m     y j − vk 2

(5)

k=1 j=1

where k= K cluster center, vk = center at K and y j = point j in data set.

3.4 Quality Assessment for Images To assess the above three segmentation methods used, we have considered parameters of image quality assessment [13] such as (MSE, MAE, PSNR, and SSIM) mean square error, mean absolute error, peak signal-to-noise ratio, and structural similarity index. MSE: It is one of the most familiar image evaluation methods [6], which defines mean of the square difference among input image and calculated or processed image. MSE value obtained is low, better is the image quality. q P 1  MSE = (d(a, b) − e(a, b))2 P Q a=0 b=1

(6)

where P is image height, Q is image width, image d(a, b) is input image, and e (a, b) is calculated or estimated image.

176

K. Gnana Mayuri and L. Sathish Kumar

MAE: This illustrates [6] the error value between original image and calculated image. MAE value obtained is low, better is the image quality.  MAE =

1 PQ

 q P 

|(d(a, b) − e(a, b))|

(7)

a=1 b=1

where P is image height, Q is image width, image d(a, b) is input image and e (a, b) is calculated or estimated image. PSNR: This method [17] calculates the proportion for maximum attained signal power and corrupted noise power. Higher is the value obtained for PSNR, better is the image quality.  2  C PSNR = 10 log10 (8) MSE where C represents the maximum value of pixel for the input image and MSE represents the mean square error. SSIM: This illustrates [6] the similarity in structure of the image processed to its original image. Closest the value of SSIM to 1 better is the image quality for original and calculated or processed image.    2μ y μz + A1 2σ yz + A2   SSIM(y, z) =  2 (9) μ y + μ2z + A1 σ y2 + σz2 + A2 where y is image width, z = image length, μ2y is average of y, μ2z is average of z, σ y andσz are standard deviation among original and calculated image pixels. A1 ,A2 are positive constants and selected for avoiding instable of measure.

4 Results and Evaluation Images used as data set are collected from ISIC [18]. Figures 2, 3, 4, and 5 are original irregular border lesion images collected for segmentation. Results obtained were executed in anaconda jupyter notebook. Input noise-free RGB image is converted to grayscale and respective segmentation method; i.e., binary Otsu, K-means, and watershed are implemented.

Assessment of Segmentation Techniques for Irregular …

177

Fig. 2 ISIC_001

Fig. 3 ISIC_002

Fig. 4 ISIC_003

Fig. 5 ISIC_004

Figure 6 clearly shows the result of three segmentation methods for original images, namely ISIC_001, ISIC_002, ISIC_003, and ISIC_004 whose borders are scattered and uneven. Watershed clearly identifies the region of interest required and border in which lesion is not properly defined with boundaries and edges. Image quality assessment metrics such as MSE, MAE, PSNR, and SSIM are calculated and is given in Tables 1, 2, 3, and 4, which illustrates that MSE and MAE values are low, PSNR values are high and SSIM is close to 1, for marker-based watershed segmentation when compared to binary Otsu and K-means clustering. Average bar charts for Tables 1, 2, 3, and 4 are given in Figs. 7, 8, 9, and 10 from which a blue color bar represents binary Otsu, orange bar for K-means clustering and finally a green color bar for marker-based watershed segmentation, which clearly illustrates that marker-based watershed segmentation shows good results. MSE and

178

K. Gnana Mayuri and L. Sathish Kumar

Fig. 6 Implementation of segmentation methods Table 1 MSE evaluation Image Binary Otsu ISIC_001 ISIC_002 ISIC_003 ISIC_004

36845 33912 36501 38398

Table 2 MAE evaluation Image Binary Otsu ISIC_001 ISIC_002 ISIC_003 ISIC_004

138.75 130.01 129.56 134.94

Table 3 PSNR evaluation Image Binary Otsu ISIC_001 ISIC_002 ISIC_003 ISIC_004

2.46 2.82 2.50 2.28

K-Means

Watershed

37426 34620 36730 38859

648 1051 1690 1543

K-Means

Watershed

138.83 130.09 129.56 135.09

35.64 60.23 49.40 40.94

K-Means

Watershed

2.39 2.73 2.48 2.23

20.01 17.91 15.84 16.24

Assessment of Segmentation Techniques for Irregular … Table 4 SSIM evaluation Image Binary Otsu ISIC_001 ISIC_002 ISIC_003 ISIC_004

0.103 0.07 0.07 0.07

179

K-Means

Watershed

0.093 0.06 0.06 0.06

0.78 0.61 0.77 0.73

Fig. 7 Average bar chart for MSE

Fig. 8 Average bar chart for MAE

MAE values are low. PSNR values are high, and SSIM values are near to 1 for watershed segmentation.

180

K. Gnana Mayuri and L. Sathish Kumar

Fig. 9 Average bar chart for PSNR

Fig. 10 Average bar chart for SSIM

5 Conclusion Generally, it would be a preferable choice to observe an image in prior for selecting an segmentation method. Segmented image is used to interpret a specified region, part or an object which plays an important role in medical field, as early detection of cancer is an essential need in human life. This paper shows an assessment and evaluation of three segmentation techniques, namely binary Otsu (threshold-based segmentation), marker-based watershed (region-based segmentation), and K-means clustering (clustering-based segmentation) in terms of image quality metrics. Finally after comparable analysis of each segmentation method, it is proven that marker-based watershed segmentation works well for irregular border lesion images of melanoma. In the future, this work can also be extended to other types of skin lesion images and other segmentation methods can be considered.

Assessment of Segmentation Techniques for Irregular …

181

References 1. Gutman D, Codella NC, Celebi E, Helba B, Marchetti M, Mishra N, Halpern A (2016) Skin lesion analysis toward melanoma detection: a challenge at the international symposium on biomedical imaging (ISBI) 2016, hosted by the international skin imaging collaboration (ISIC). arXiv preprint arXiv:1605.01397 2. Jacily Jemila S, Brintha Therese A (2019) Selection of suitable segmentation technique based on image quality metrics. Imaging Sci J 67(8):475–480 3. Ali AR, Li J, Yang G (2020) Automating the ABCD rule for melanoma detection: a Survey. IEEE Access 8:83333–83346. https://doi.org/10.1109/ACCESS.2020.2991034 4. Chakkaravarthy Prabhu A, Chandrasekar A (2019) Automatic detection and segmentation of melanoma using fuzzy c-means. In: 2019 Fifth international conference on science technology engineering and mathematics (ICONSTEM). IEEE, New York, pp 132–136 5. Manikandan LC, Selvakumar RK, Nair S, Anu H, Sanal Kumar KP (2021) Hardware implementation of fast bilateral filter and canny edge detector using Raspberry Pi for telemedicine applications. J Amb Intell Hum Comput 12(5):4689–4695 6. Zaini SZS, Marzuki NNSM, Abdullah MF, Ahmad KA, Isa Sulaiman SN (2019) Image quality assessment for image segmentation algorithms: qualitative and quantitative analyses. In: 9th IEEE International conference on control system, computing and engineering (ICCSCE). IEEE, Penang, Malaysia, pp 66–71 7. Sreedhar B, Manjunath Swamy BE, Sunil Kumar M (2020) A comparative study of melanoma skin cancer detection in traditional and current image processing techniques. In: Proceedings of the fourth international conference on I-SMAC (IoT in social, mobile, analytics and cloud). IEEE, New York, pp 654–658 8. Broti T, Siddika A, Rituparna S, Hossain N, Sakib N (2020) Medical image analysis system for segmenting skin diseases using digital image processing technology. Int J Appl Inf Syst 12(28):7–15 9. Jamil U, Sajid A, Hussain M, Aldabbas O, Shafiq Afshan Alam Umair M (2019) Melanoma segmentation using bio-medical image analysis for smarter mobile healthcare. Springer 10(10):4099–4120 10. Yuan C, Yang H (2019) Research on K-value selection method of K-means clustering algorithm. Multidisciplinary Sci J 2(2):226–235 11. Kaur R, Maini R (2020) Evaluation and analysis of edge detection techniques on Leukemia images. Adv Math: Sci J 9(6):3721–3732 12. Metib MH, Abdulhssien MF, Abdulmunem AA (2020) Skin dermatitis detection using image segmentation techniques. In: 2nd International scientific conference of Al-Ayen University (ISCAU-2020), IOP conference series: materials science and engineering, vol 928. IOP Publishing, pp 1–9. https://doi.org/10.1088/1757-899X/928/3/032018 13. Sara U, Akter M, Uddin MS (2019) Image quality assessment through FSIM, SSIM, MSE and PSNR-a comparative study. J Comput Commun 7(3):8–18 14. Gothi S, Baraskar R, Agrawal S (2019) An efficient approach of image segmentation for skin cancer detection. Int J Sci Technol Res 7(2):783–787 15. Shanthi V, Sridevi G, Charanya R, Josphin Mary R (2020) Watershed algorithm in multichannel for skin lesion segmentation. Euro J Mol Clin Med 7(9):1374–1378 16. Zaw MT (2018) Than HTIKE AUNG: automatic segmentation of skin lesion in dermoscopic images. Int J Sci Eng Technol Res 8(9):0223–0229 17. Mwawado RH, Maiseli BJ, Dida M (2020) Robust edge detection method for the segmentation of diabetic foot ulcer images. Eng Technol Appl Sci Res 10(4):6034–6040 18. ISIC Homepage. https://www.isic-archive.com

Secure Communication and Pothole Detection for UAV Platforms S. Aruna, P. Lahari, P. Suraj, M. W. F. Junaid , and V. Sanjeev

Abstract Everyone around the world relies on roads for transportation on a daily basis. Aging and heavy usage of roads results in deterioration of the surface. Besides potholes, we also see many cracks in roads which may expand and develop into potholes themselves. Potholes can result in vehicle damage and can also cause physical harm to people in their vehicles. Detected potholes, if integrated with information systems, can notify drivers in real time, so they can be aware. Potholes are a problem, especially in developing countries. In this paper, we discuss leveraging and integrating the best modern technologies in order to solve this problem. We have tried to make it a feasible, robust, flexible, and modular system to help solve this problem. Using unmanned aerial vehicles (UAVs) and machine learning/deep learning and networking together, we will be able to spot potholes and cracks on the roads much faster than existing methods. By using the system, the need for manual surveillance is reduced. Using a UAV, you can cover large areas. To detect potholes, a UAV will be used to take videos and take pictures along the roads. These images will be sent to a computer where ML algorithms will detect the pothole. In addition to the CNN, a transfer learning algorithm and YOLO real-time object detection can be used to identify potholes. Video is streamed securely over Wi-Fi by an RTSP server to a local workstation, so the user can view the video from the UAV and may store it for future use. Keywords UAV · YOLO · Convolutional neural networks · RTSP · Inception-V3 · Potholes · Road safety · QUIC

S. Aruna (B) · P. Lahari · P. Suraj · M. W. F. Junaid · V. Sanjeev Department of Information Technology, Vasavi College Of Engineering, Hyderabad, Telangana, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_13

183

184

S. Aruna et al.

1 Introduction India is known for its pothole problem, which is the primary cause of traffic accidents. Engineers are trying to develop different methods for detecting these potholes so that they can be repaired as soon as possible since their detection is a very time-consuming task. Statistically, potholes are more deadly than any other feared cause of death on roads. In India, potholes are a very common problem faced due to both climatic and administrative reasons. Roads are very dangerous for commuters due to the incessant rain and poor maintenance. In terms of road surface, a pothole is a depressed or eroded area of surface or any place where the pavement has been damaged by traffic, and pieces of the pavement have been removed. Potholes are of various types, and they can be classified as in [1]. They begin as small cracks. Unless they are repaired right away, the entire road could be damaged. Poor road construction is the primary cause of potholes. The second cause of potholes is water flowing through them, typically from rainfall or sewage overflow. A large volume of turbulent water is one of the causes of potholes, according to urban planning experts. Despite the high mortality rates from potholes, this problem is not considered a major concern in many developing countries. Nearly, 4000 people die from potholes each year. Due to these, over 9300 deaths have occurred in the last three years, and over, 25,000 people have been injured. From 2013 to 2016, 11,836 people died, and 36,421 were injured in India’s pits. Each year, the pit’s activity is increased by the rain or monsoon. It is crucial to keep the roads safe at all times. For people who drive, road infrastructure plays a crucial role. The rainy season is the most problematic time for potholes in some areas, so drivers need to be cautious. The local population sometimes jokingly gauge a driver’s skill by their ability to remember where the pothole is and maneuver around it along the way. Thus, pothole detection is important, and the collected data can be used in many ways. In the foreseeable future, this technology could be integrated with automated or semiautomated vehicles. The warnings would enable drivers to slow down and avoid them (or the car itself could adjust settings to avoid them), minimize the impact, and ensure a smooth ride. An autonomous vehicle can also detect potholes by sensing the road ahead. However, there have been relatively few studies on road-surface damage. Many use traditional image processing methods for classifying images and determining damage but are not effective for detecting damage coordinates [2]. Recently, deep learning has been demonstrated to outperform traditional methods for predicting damage coordinates [3]. A convolutional neural network (CNN) at the heart of You Only Look Once (YOLO) is used in this study as the training and evaluation tool for object detection [4, 5]. Additionally, the potholes have been identified from a plurality of datasets by utilizing a CNN architecture [6] and Inception-V3 algorithms [7]. In order to find and remove the potholes, we can combine sensor/image data and feed it to AI algorithms. By monitoring the locations of these potholes, we can prevent many injuries and damage to physical assets. A damaged asphalt pavement can cause serious accidents and restrict traffic and fragile goods from being transported. To avoid accidents,

Secure Communication and Pothole Detection for UAV Platforms

185

traffic jams, increased fuel consumption, and consequent pollution, potholes must be repaired. The methods currently in use for detecting potholes rely mainly on a manual process, which is extremely time-consuming. Other existing methodologies such as vibration analysis [8] and public reporting are being developed for detection functions. The main theme of this project is to identify the potholes and report them. For the specific use-case of detecting potholes, various techniques are discussed in [9–11]. We have studied techniques for object detection from aerial images and application in civil engineering field and agriculture [12–14]. Most developing countries have good access to paved roads that cover the majority of their land surface and make moving around the country easier. As a result, roads are considered an important infrastructure system provided by the government for transportation purposes. The aim of this paper is to further the development of automatic geolocation, AI-based system to identify the potholes in the road from an orthographic view so as to increase detection accuracy, save costs, and limit the adverse effects of wear and tear of roads by using existing technologies like GPS-based UAVs, secure networking, and AI algorithms for image detection and integrating them with custom-built components. An in-house modular UAV developed and tested will be used to extend the project and enhance it further [15]. The proposed system can also act as a very good quality-note-taking platform for Quality Control Civil Engineers, where they can easily keep the quality of roads and pavements in check. We are implementing our project with a custom, modular quadcopter (UAV) that can handle additional equipment as necessary.

2 Methodology 2.1 Secure Communication Communication between UAV and streaming clients is important and may need to be made secure. A variety of open-sourced streaming services and clients can be used. We are currently using an open-sourced v4l2 RTSP Server for transmission and VLC client as a viewer. Other popular streaming software use Real-Time Streaming Protocol (RTSP) in their implementation. The streaming would be insecure because many of them do not provide encryption. We can use an MPEG stream from the camera and encrypt it using AES, so as to access the stream; a pre-shared key will be needed for the client. Similarly, a conversion will be conducted on the other side.

186

S. Aruna et al.

2.2 Pothole Detection Using CNN This CNN model uses a dataset published on Kaggle by Larxel. The dataset consists of a total of 1330 images which can be divided into train, test, and validate sets while training the model. These images are categorized into normal and pothole folders based on the presence or absence of potholes in the image. So, the normal datasets inside train, test, validate datasets can be considered as “normal roads” class, and the pothole datasets of the same datasets can be considered as “pothole roads” class. The dataset only contains 64 × 64 sized images to maintain uniformity and to boost up accuracy as referred to in Figs. 1 and 2. The model has been created using Keras library in Python. To ingest the dataset into our model. Multiple convolutional 2D layers have been used in the model with ten filters and a Sigmoid activation function. The strides are varied (1, 1) or (3, 3) or (5, 5). The output layer uses SoftMax function. The loss function used is “categorical_crossentropy;” the optimizer used is “Adam,” and the metric used is “Accuracy.” The model has been run for 50 epochs, and we got a validation accuracy of 93.64%. The created model takes a maximum of 25 to 30 min to train in the Colab Web site when the settings are changed to make it run on GPU. A higher performance GPU would reduce the training time to less than 10 min as well. While the accuracy of the model is 93.64, the entire loss turned out to be 3.1445770, which might also be reduced in the future with the help of transfer learning or even better techniques. One thing we observed is that there are very few misclassifications as the images used are of uniform size and as no transfer learning is used.Fig. 4 Image of the results from the CNN model, when the dataset shown in Fig. 3 is sent through the model

Fig. 1 Snippet of the dataset containing normal roads without potholes

Secure Communication and Pothole Detection for UAV Platforms

187

Fig. 2 Snippet of the dataset containing roads with potholes

Fig. 3 Snippet of the dataset, set of images created from a video and in addition to a few normal road images in between them to test

2.3 Pothole Detection Using Inception-V3 Inception-V3 is a variant of convolution neural network which is 48 inception layers deep. Each inception layer consists of 1*1, 3*3 and 5*5 convolution layers with all

188

S. Aruna et al.

Fig. 4 Image of the results from the CNN model, when the dataset shown in Fig. 3 is sent through the model

their outputs put together in the form of a single output vector forming the input for the upcoming stage. A set of 48 inception layers form the architecture of Inception-V3. The dataset used is the same as that which is used during training the CNN model. To expand the training information, common distortions introduced to the data are as follows: pivoting the picture by some angle, shifting the widths and lengths of the picture, increasing or decreasing the brightness of the picture, zooming into the picture, laterally or vertically flipping the picture. These are used to increase the data so that accuracy of the model can be increased. Transfer learning is used to train the model. The weights that were learnt while training the architecture on Google’s ImageNet dataset are used to initialize the architecture. The whole architecture gets initialized with the weights of the ImageNet, the only exceptions being input and output layers. The input format is 224*224*3 (width, height, RGB), and the output format is a vector of two rows. If the first row is 1, then the pothole is present, and if the second row of the vector is 1, then there is no pothole. After observation, the model has 4098 trainable parameters, and all other parameters are non-trainable (Figs. 5 and 6). We have initialized the model with the weights obtained with the ImageNet dataset, and all the layers in between the input and output layer were frozen. The knowledge obtained in classifying the ImageNet dataset was used to classify road images. We retrained the last layers which have a higher significance in determining the classes of the object with a very slow learning rate so as to achieve greater accuracy. For our model, the learning rate picked up was 0.1*10−3 , and the last 100, out of the 311 layers of Inception-V3 and fully connected layers were retrained so as to achieve greater accuracy. Before fine-tuning, the training accuracy of the model was 94.5%, and test accuracy was 92.5%. After fine-tuning the model, the training accuracy became 95.9% that is more than a percent increase in accuracy, and test accuracy became 93.9%. This shows that fine-tuning the last few layers of the model and updating the higher-level parameters can increase the efficiency of the model.

Secure Communication and Pothole Detection for UAV Platforms

189

Fig. 5 Training and validation accuracies against the number of epochs

Fig. 6 Training and validation loss against the number of epochs

2.4 Pothole Detection Using YOLO A pothole is an auxiliary disappointment on the street, which causes mishaps. In India, due to an increase in transportation in the form of autos, bikes, and cars, the number of disasters because of potholes has furthermore expanded. In this way, for decreasing the loss of human life due to potholes, a number of strategies have been conceived to recognize the potholes utilizing sensors. Our model is tried on diverse pothole pictures, and it recognizes with a sensible exactness. This project uses AlexeyAB’s fork of Darknet Yolo that runs on Windows and Linux. The dataset is mainly of the images from the dash camera of a car. It is used for the training of the algorithm using Darknet. It is robust with images in different angles and potholes of different sizes. The text file ought to have the same title and be within the same directory as the picture file.

190

S. Aruna et al.

Fig. 7 Snippet of the video showing the pothole detected

We used this command to run the Darknet to train the YOLO model. ./darknet detector train cfg/obj.data cfg/yolo_pothole_train.cfg darknet19_448.conv.23

where • obj.data and yolo_pothole_train.cfg are put into the cfg (configuration) folder in the darknet folder. • Darknet19_448.conv.23 is the file containing pretrained weights for the convolutional layers. • The third command line argument is for the type of operation testing the model. We used this command to test the trained model on the test image and find out the presence of a pothole in the image provided: ./darknet detector test cfg/obj.data cfg/yolo_pothole_test.cfg [trained-weight-file] [image-file]

We used this command to test the trained model on the test video and find out the presence of a potholes in the video provided (Figs. 7, 8, and 9): ./darknet detector demo cfg/obj.data cfg/yolo_pothole_test.cfg [trained-weight-file] [video-file] -out_filename [videooutput].avi.

3 Results In the entire course of our development, we have used a single dataset from Kaggle for different algorithms so that the results obtained can be compared (Table 1).

Secure Communication and Pothole Detection for UAV Platforms

191

Fig. 8 Picture showing the pothole detected in the image sent to test the model

Fig. 9 Another image showing the potholes detected on the image of the road using YOLO Table 1 Comparison of accuracies Model

Train accuracy

Train loss

Validation accuracy

Validation loss

CNN

0.9791

0.0121

0.9364

0.0314

Inception-V3 Before fine-tuning

0.9530

0.1553

0.9140

0.1957

Inception-V3 After fine-tuning

0.9468

0.1627

0.9197

0.1729

All the results mentioned were taken at their peak performance*

192

S. Aruna et al.

The table above sums up the results obtained from both the models including results of Inception-V3 before and after fine-tuning.

4 Future Scope In the future, we can extend our project by implementing image-guided navigation without using GPS guidance and also manage the flight to avoid obstacles with the help of sensors. For location identification, we can tag geographic data with images, and for stable flights, we can also analyze flight log data and feed it into a learning algorithm. The video streaming can be done efficiently using QUIC-based streaming to deliver data remotely to multiple devices. We found some anomalies in the results produced by our algorithms. They could not detect or predict a pothole if the pothole is filled with water or if the picture contains a splash of water that shadows the pothole. We can create better algorithms in the future to detect potholes irrespective of the above problems. Acknowledgements We would like to express our heartfelt gratitude to Dr. Kovvur Ram Mohan Rao, Head of Department of Information Technology for extending us moral support and giving access to resources like the Deep Learning Server. Our special thanks to Dr. Raghavendra Kune, Scientist & Section Head—HPCDS, ADRIN for guiding us and giving us valuable insights throughout the project.

References 1. Taehyeong Kim S-KR (2014) A guideline for pothole classification. Int J Eng Technol 4 2. Nienaber, et al (2015) Detecting potholes using simple image processing techniques and realworld footage. https://doi.org/10.13140/RG.2.1.3121.8408 3. Dhiman A, Klette R (2020) Pothole detection using computer vision and learning. IEEE Trans Intell Transp Syst 21(8):3536–3550. https://doi.org/10.1109/TITS.2019.2931297 4. Redmon J, Farhadi A (2017) YOLO9000: better, faster, stronger. 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6517–6525 5. Yik YK, et al (2021) A real-time pothole detection based on deep learning approach 6. Srinidhi G, et al (2020) Pothole detection using CNN and AlexNet (June 10, 2020). 2nd international conference on communication and information processing (ICCIP) 7. Arjapure S, et al (March 2020) Road pothole detection using deep learning classifiers. Int J Recent Technol Eng (IJRTE) 8(6). https://doi.org/10.35940/ijrte.F7349.038620. ISSN: 22773878, Retrieval Number: F7349038620/2020©BEIESP 8. Eriksson J, Girod L, Hull B, Newton R, Madden S, Balakrishnan H (2008) The pothole patrol: using a mobile sensor network for road surface monitoring, Proceeding of the 6th international conference on Mobile systems, applications, and services. pp 29–39 9. Tong V, Tran HA, Souihi S, Mellouk A (2018) Empirical study for dynamic adaptive video streaming service based on google transport QUIC protocol. 2018 IEEE 43rd conference on local computer networks (LCN), pp 343–350. https://doi.org/10.1109/LCN.2018.8638062 10. Bhavya P, Sharmila C, Sadhvi Y, Prasanna C, Ganesan V (2021). Pothole detection using deep learning. https://doi.org/10.1007/978-981-16-1773-7_19

Secure Communication and Pothole Detection for UAV Platforms

193

11. Chen H, Yao M, Gu Q (2020) Pothole detection using location-aware convolutional neural networks. Int J Mach Learn Cybern 11 https://doi.org/10.1007/s13042-020-01078-7 12. Al-Shaghouri A, Alkhatib R, Berjaoui S (2021) Real-time pothole detection using deep learning 13. Radovic M, Adarkwa O, Wang Q (2017) Object recognition in aerial images using convolutional neural networks. J Imaging 3:21. https://doi.org/10.3390/jimaging3020021 14. Sadgrove EJ, Falzon G, Miron D, Lamb DW (2018) Real-time object detection in agricultural/remote environments using the multiple-expert colour feature extreme learning machine (MEC-ELM). Comput Ind 98:183–191. https://doi.org/10.1016/j.compind.2018.03.014 15. Hanu SA (2019) Feature matching and streaming using unmanned aerial vehicles. Department of information technology, Vasavi College of Engineering

An Empirical Study on Discovering Software Bugs Using Machine Learning Techniques G. Ramesh, K. Shyam Sunder Reddy, Gandikota Ramu, Y. C. A. Padmanabha Reddy, and J. Somasekar

Abstract Bug is a defect in software which needs to be identified early so as to avoid unnecessary burden caused by it later. Bug discovery from software modules has been around. However, of late, machine learning (ML) became a useful and appropriate solution to many real-world problems. In this context, usage of machine learning has become an important step forward in improving state of the art in bug detection. It is an artificial intelligence-based (AI) approach that makes it more effective due to the bulk of software modules. Many existing methods strived to incorporate ML for bug discovery. However, there is need for improvement with appropriate methodology. In this paper, we proposed a methodology that exploits two ML techniques known as decision tree (DT) and random forest (RF) for efficient means of discovering bugs from software modules. An empirical study is made using Python data science platform. Experimental results showed that RF performs better than DT in terms of accuracy of bug prediction. Keywords Machine learning · Software bug discovery decision tree · Random forest

G. Ramesh (B) Department of CSE, GRIET, Bachupally, Hyderabad, Telangana, India e-mail: [email protected] K. S. S. Reddy Department of Information Technology, Vasavi College of Engineering, Hyderabad, India G. Ramu Department of Computer Science and Engineering, Institute of Aeronautical Engineering, Dundigal, Hyderabad 500 043, India Y. C. A. P. Reddy Department of CSE, B V Raju Institute of Technology, Narsapur, Telangana, India J. Somasekar Department of CSE, Gopalan College of Engineering and Management, Bangalore, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_14

195

196

G. Ramesh et al.

1 Introduction Early detection of software bugs is to be given paramount importance in the realworld application development. It could lead to better performance in terms of saving time, effort and money. If any bug is not discovered early, it is carried forward to next phase in the life cycle of software. It becomes difficult later to fix the problem as it is expensive and needs more time and effort. Of late, machine learning domain has paved way to solve many complex real-world problems. It is widely used in different areas and applications. Software development industry is also exploring the benefits of machine learning. Discovery of bugs from software and related documents is an important part in software engineering. From the literature, it is known that different methods are used for bug detection in software related data sets. Most of the solutions are based on active machine learning techniques. In this paper, we proposed a methodology that exploits decision tree (DT) and random forest (RF) to have bug prediction models. We evaluated the models to know the better performing model. Our contributions in this paper are as followed. 1. 2. 3.

A methodology is proposed to have DT and RF models to detect bugs in software. DT and RF are used with specific approach to detect bugs. An empirical study is made using Python data science platform and evaluation showed that RF performs better than DT.

The remainder of the paper is structured as follows. Section 2 reviews literature. Section 3 presents the proposed system. Section 4 presents experimental results, and Sect. 5 concludes the work.

2 Related Work This section reviews literature on bug discovery using machine learning approaches. Ferreira et al. [1] employed machine learning to detect faults in wireless mesh networks associated with solar power distribution system. Tan et al. [2] used C5.0 and random forest (RF) for prediction of network faults. Duenas et al. [3] focussed on network failure prediction online by using event stream processing. Tran et al. [4] used RF to discover software bugs in the bug reports. Tran et al. [5] on the other hand explored different data analytics techniques used for detection of faults. Armbrust et al. [6] investigated on the dynamics of faults in the context of cloud computing. Hammouri et al. [7] explored machine learning methods to find bugs associated with software development. Zhang et al. [8] defined an approach known as KSAP to have bug report assignment in an efficient manner. Towards this end, they employed KNN search-based methodology. Sabor et al. [9] investigated on the automatic prediction of bugs and their severity levels with data pertaining to stack traces. Ramesh et al. [10] define an approach for technique for identifying the code smells. Pooja et al. [11] explored different techniques for analysing the software applications.

An Empirical Study on Discovering Software Bugs Using Machine …

197

Gupta et al. [12] defined a novel XGBoost-based model to predict software bugs with supervised learning approach. Riza et al. [13] proposed an algorithm known as Knuth–Morris–Pratt to identify genomic repetitions. Sheneamer [14] focussed on finding code clones that are a kind of bugs in software development process. They used multiple similarity-based features in order to achieve this. From the literature, it is known that different methods are used for bug detection in software-related data sets. Most of the solutions are based on active machine learning techniques. In this paper, we proposed a methodology that exploits decision tree (DT) and random forest (RF) to have bug prediction models. We evaluated the models to know the better performing model.

3 Methodology The proposed methodology includes the process of both DT and RF in prediction of bugs. DT uses the training data set in order to grow a tree based on features. Then, it finds the best split for every feature available. Then, it finds the required node that leads to best split. The node is split using the identified best fit. It is based on the stopping rules. After an iterative process for all nodes, a single decision tree is formed. The splitting rule is based on Eq. (1) and Eq. (2). H (S) = −



P(x) log P(x)

x∈X

I G(A, S) = H (S) −



P(t)H (t)

(1) (2)

t∈T

The attribute that has highest information gain is considered to perform splitting. In case of RF, multiple decision trees are generated, and ensemble model is used to arrive at final predictions. The given data set is divided into number of subsets. A training data set is used to construct a tree. Then, the DT-based approach is followed for each tree. The tree is used to evaluate testing data set. After the iterative approach, a RF of tree is generated.

4 Experimental Results Experimental results are presented in terms of cross-validation score, accuracy and consumption time. As presented in Table 1, the cross-validation score is provided in presence of preprocessing and absence of it. Pre-processing is used to improve quality of training by filling missing values. As shown in Fig. 1, the cross-validation score has its influence on both the number of bug reports and the presence of pre-processing.

198

G. Ramesh et al.

Table 1 Shows the cross-validation score Cross-validation score 5

10

15

20

25

30

With pre-processing

0.61

0.6

0.6

0.6

0.65

0.63

Without pre-processing

0.48

0.49

0.5

0.5

0.48

0.5

Fig. 1 Shows cross validation score against number of bug reports

As presented in Table 2, different number of bug reports (×10,000) have influence on the prediction models in terms of accuracy. As shown in Fig. 2, the accuracy score is influenced by the number of bug reports. Random forest showed higher accuracy score. As presented in Table 3, different number of bug reports (×10,000) have influence on the prediction models in terms of priority accuracy. As shown in Fig. 3, the priority accuracy score is influenced by the number of bug reports. Random forest showed higher accuracy score. As presented in Table 4, different number of bug reports (×10,000) have influence on the prediction models in terms of consumption time. Table 2 shows accuracy of the models Accuracy score 5

10

15

20

25

30

Random forest

0.82

0.8

0.76

0.75

0.78

0.75

Decision tree

0.68

0.68

0.65

0.65

0.68

0.62

An Empirical Study on Discovering Software Bugs Using Machine …

199

Fig. 2 shows accuracy score against number of bug reports Table 3 Shows priority accuracy of the models Priority accuracy score 5

10

15

20

25

30

Decision tree

0.69

0.69

0.69

0.69

0.65

0.62

Random forest

0.72

0.72

0.73

0.75

0.75

0.74

Fig. 3 Shows priority accuracy score against number of bug reports

200

G. Ramesh et al.

Table 4 Shows consumption time of the models Consumption time (S) 5

10

15

20

25

30

Random forest

1

2

3

4

5

7

Decision tree

5

7

12

20

25

35

Fig. 4 shows consumption time against number of bug reports

As shown in Fig. 4, the consumption time is influenced by the number of bug reports. Random forest showed higher consumption time.

5 Conclusion and Future Work In this paper, we proposed a methodology that exploits two ML techniques known as decision tree (DT) and random forest (RF) for efficient means of discovering bugs from software modules. The methodology describes how the two techniques are able to find bugs. DT is the approach which is based on forming a decision tree to have predictions. RF on the other hand uses multiple DTs and makes an ensemble of them in order to have better prediction of bugs. An empirical study is made using Python data science platform. Experimental results showed that RF performs better than DT in terms of accuracy of bug prediction. The methodology explored in this paper has several limitations. First, it has limitations in terms of number of ML methods. Second, it needs further improvement in terms of processing of data prior to using algorithms. In future, we overcome these limitations with further improvements in methodology.

An Empirical Study on Discovering Software Bugs Using Machine …

201

References 1. Ferreira VC, Carrano RC, Silva JO, Albuquerque CVN, Muchaluat-Saade DC, Passos DG (2017) Fault detection and diagnosis for solar-powered wireless mesh networks using machine learning. In: Proceedings of IFIP/IEEE symposium on integrated network and service management (IM’17), pp 456–62 2. Tan JS, Ho CK, Lim AH, Ramly MR (2018) Predicting network faults using Random forest and C5.0. Int J Eng Technol 7(2.14):93–6 3. Duenas JC, Navarro JM, Parada HA, Andion J, Cuadrado F (2018) Applying event stream processing to network online failure prediction. Commun Mag 56(1):166–170 4. Tran HM, Nguyen SV, Ha SVU, Le TQ (2018) An analysis of software bug reports using Random forest. In: Proceedings of 5th international conference on future data and security engineering (FDSE’18). Springer, pp 1–13 5. Tran HM, Nguyen SV, Le ST, Vu QT (2017) Applying data analytic techniques for fault detection. Trans Large Scale Data Knowl Cent Syst (TLDKS) 30–46 6. Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M (2010) A view of cloud computing. ACM Commun 53(4):50–58 7. Hammouri A, Hammad M, Alnabhan M, Alsarayrah F (2018) Software bug prediction using machine learning approach. Int J Adv Comput Sci Appl 9. https://doi.org/10.14569/IJACSA. 2018.090212 8. Zhang W, Wang S, Wang Q (2015) KSAP: an approach to bug report assignment using KNN search and heterogeneous proximity. J Inf Softw Technol 70:68–84 9. Sabor KK, Hamdaqa M, Hamou-Lhadj A (2019) Automatic prediction of the severity of bugs using stack traces and categorical features. Elsevier J Inf Softw Technol 10. Ramesh G, Mallikarjuna Rao C (2018) Code-smells identification by using PSO approach. Int J Recent Technol Eng (IJRTE) 7(4). ISSN: 2277-3878 11. Pooja ASSVL, Sridhar M, Ramesh G (2021) Application and analysis of phishing website detection in machine learning and neural networks. In: Luhach AK, Jat DS, Bin Ghazali KH, Gao XZ, Lingras P (eds) Advanced informatics for computing research. ICAICR 2020. Communications in computer and information science, vol 1394. Springer, Singapore 12. Gupta A, Sharma S, Goyal S, Rashid M (2020) Novel XGBoost tuned machine learning model for software bug prediction. 2020 international conference on intelligent engineering and management (ICIEM), pp 376–380 13. Riza LS, Rachmat AB, Munir TH, Nazir S (2019) Genomic repeat detection using the KnuthMorris-Pratt algorithm on R high-performance-computing package. Int J Adv Soft Comput Appl 11(1):94–111 14. Sheneamer AM (2021) Multiple similarity-based features blending for detecting code clones using consensus-driven classification. Expert Syst Appl 183

Action Segmentation for RGB Video Frames Using Skeleton 3D Data of NTURGB+D Rosepreet Kaur Bhogal

and V. Devendran

Abstract Action segmentation or video segmentation which used to extract action from video frames. It plays role in various applications, i.e., visual effect assistance in the movies, scene understanding in detail, virtual background creation, and a design CAD system that can identify automatically human action from videos without any object interference. This paper presents a system that automatically segments actions from videos. The window size is variable and depends on input video. The dataset used to show experimental data is NTURGB+D. The action segmentation has been shown using 3D skeleton information on RGB videos of NTURGB+D. The experimental results have shown the performance and it test results on 5 random action videos. Keywords Segmentation · Actions · 3D skeleton data · RGB · Videos · NTURGB+D

1 Introduction Identification of motion from video frames consists of various steps as preprocessing, segmentation, feature calculation, feature dimension reduction, and classification. The probability is high for a good recognition rate if feature calculation has been done for a properly segmented image/video frame. Either use any technique or deep learning approaches which are called edge technology nowadays. However, deep learning model designing has two main things. First, what input has been used to train or test a network and tuning of hyperparameter. If the input is not given appropriately, then the system may not recognize a few actions which are expected to do by the model. So, action segmentation can be a very important task for any designed model. R. K. Bhogal (B) School of Electronics and Electrical Engineering, Lovely Professional University, Phagwara, India e-mail: [email protected] V. Devendran School of Computer Science and Engineering, Lovely Professional University, Phagwara, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_15

203

204

R. K. Bhogal and V. Devendran

There various other methods have been used for action segmentation. The work is based on action segmentation by using trajectories, first separate motion actions, and then background for classification [1]. The unsupervised action segmentation has been proposed which does not require action labels, based on that system is given by [2]. The system can segment video action using a bag of visual word features given by [3]. Ref. [4] given an approach that defines a feature that helps for action segment as well as neighbor segmentation. Automatic segmentation of actions using 3D joint information can be developed but a method is required to check for datahungry techniques [5]. Action localization is also used for action recognition that can be a useful method if not considered background interference [6]. Temporal action segmentation is getting popularity that approaches used to train by considering the time stamp annotations [7]. Researchers have given different structures which can extract motion action from video frames. The structure Global2Local which can be the approach used for video action segmentation. This approach can search motionbased action given by [8]. Dynamic manifold warping is given to calculate motion similarity based on segmented video [9]. In this paper, segmentation techniques have been given using 3D skeleton information. For action, segmentation can use other modalities also like depth map. By using depth maps, it is possible to improve the recognition rate of actions [10]. There is the possibility to segment long sequence videos using the energy of motion history images [11]. The action recognition system can improve by using segmented action videos. Such a type of system which automatically based on pose streams in form of joint given by [12]. The contour-based segmentation of still images is also can preference for action segmentation because the appropriate feature can be calculated [13]. Few proposed approaches have been discussed previously that can use for action segmentation. When giving raw videos to the system which can design by using deep learning approaches or machine learning, the system is considering the background details also instead of focusing on actions. A segmented video frame can be given to a system whose main focus is to train on action instead of the whole frame. This paper has given an approach using the NTURGB+D dataset which can use to segment action. With this segmentation approach, a more efficient system can design. The paper proceeds in various sections. Section 2 is having detail about the research methodology, detail of the dataset used is given in Sect. 3, how this output is achieved in Sect. 4, respectively. Finally, the concluded remarks are given in Sect. 5.

2 Research Methodology This section includes the methodology used for action recognition. It all depends on when this algorithm can be used on any software like Python/MATLAB. The task has been divided into the various parts as given in Fig. 1. This strategy is developed by keeping in mind deep learning approaches. That is why the first step is to create datastores for the deep learning model so that later on can use for training, validation, and testing model. In this, only one dataset is used named NTURGB+D which has 3D

Action Segmentation for RGB Video Frames Using Skeleton 3D Data …

205

Fig. 1 Flowchart for action segmentation

information as well as RGB videos. Second, required to extend the 3D data and then create a bounding box of dimension [1X4] as shown in Fig. 1. Bounding dimension based on first frame characteristic of 3D data. Simultaneously, convert each frame of video into grayscale. And, then crop each frame of video according to the dimension of the bounding box. Make a video of a segmented frame and store per class. This data are ready to give while using deep learning approaches.

3 Dataset for Action Segmentation The advancement of depth sensors for cameras helps us to record various modalities. The four modalities can be RGB videos, depth maps, IR sequences, and 3D structure information. Every modality has its advantage and can be used for an application like action recognition. Out of four modalities, two modalities have been used to do action segmentation, i.e., RGB and 3D structure. There is already an available dataset that can use for human activity recognition. NTURGB+D dataset designed to use for data-hungry techniques like deep learning approaches. The RGB videos are recorded with a resolution of 1920 × 1080. For skeleton 3D, it has 25 major body joints which included various parameters given in Fig. 2. This large-scale dataset contains 56,880 video samples collected for 40 subjects. Corresponding 56,880 videos, it has 3D joint information, depth maps, and IR sequences also [14].

206

R. K. Bhogal and V. Devendran

Fig. 2 Steps for extracting joints color X and joint color Y information

4 Experimental Results A 3D skeleton is an excellent source of information about the body pose over time. The 3D skeleton data have been used to segment the RGB videos. The 3D skeleton is typically provided a set of 3D coordinates for the number of body joints. Segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as image objects). The goal of segmentation is to simplify and/or change the representation of an image/frame into something more meaningful and easier to analyze. There is the various steps that involve getting the result as follows.

4.1 Create Datastore for Skeleton 3D Data and RGB Videos The first step is for creating a datastore for skeleton 3D and RGB videos. This step is important because segmented video can save per folder, class-wise. This can be important if required to use for deep learning model to recognize activities.

4.2 Extraction of Color X and Color Y from Skeleton 3D Data The NTURGB+D dataset contains the skeleton 3D data which has been given of each video with the file extension of the skeleton. The data contain various information related to human body action like 25-point skeleton joints (each person has 25 skeletal joints), thumb tracking, end of hand tracking, open and closed hand gestures, etc. To do action recognition, need 25 skeleton joint information of each video frame. The skeleton file of each video is used to read for extraction of 25 joint information in the respective platform. Each frame of a video has joints, in which extraction of color X

Action Segmentation for RGB Video Frames Using Skeleton 3D Data …

207

Fig. 3 Steps for action segmentation per frame

and color Y has been done for action segmentation. The steps to get this output as given in Fig. 2.

4.3 Dimension for Bounding Box for Action Segmentation Extracted output called color X and color Y for each frame is input to create the bounding box of each in a video. The minimum and maximum values have been calculated for color X and color Y. By using Xmin, Xmax, Ymin, and Ymax, the dimension can be decided. Each frame of video is cropping with the bounding box dimension given as in Eq. 1. Steps for finding bounding box dimension and cropping to find action-based parts have given in Fig. 3. Bbox = [Xmin, Ymin, (Xmax − Xmin), (Ymax − Ymin)]

(1)

4.4 RGB Videos and Segmented Action Videos The dataset NTURGB+D consists of 60 action labels in number. This approach has been tested on all videos, i.e., 56,880 videos. Consider class A001 TO A049, daily actions and medical condition with one subject and consider class A050 to A060, mutual activities with two subjects. The random five videos have been taken from

208

R. K. Bhogal and V. Devendran

Fig. 4 Experimental result on class “type on a keyboard” [input video (left) and action segmented video (right)]

Fig. 5 Experimental result on class “salute” [input video (left) and action segmented video (right)]

daily action and mutual activities. The class labeled as type on a keyboard, salute, pushing, clapping, and drinking (Figs. 4, 5, 6, 7, and 8).

5 Conclusion This paper has given the approach to segment RGB videos of all classes using skeleton 3D information. The dataset is used NTURGB+D which has four modalities of 60 actions, i.e., RGB, skeleton 3D, depth maps, and IR sequences. This dataset is one of the largest datasets for approaches where a greater number of classes or views

Action Segmentation for RGB Video Frames Using Skeleton 3D Data …

209

Fig. 6 Experimental result on class “clapping” [input video (left) and action segmented video (right)]

Fig. 7 Experimental result on class “pushing” [input video (left) and action segmented video (right)]

are required. Window size for segmentation is adaptively chosen according to RGB videos with help of skeleton 3D information. The technique is working well on all videos of NTURGB+D. Few random sample video results have been given in the section of experimental results. Visualize the segmented output video, on all videos, the performance is good except for the mutual action. If mutual action category considering, that window specifically required modification. In future work, this technique will improve where more than one person is in the video.

210

R. K. Bhogal and V. Devendran

Fig. 8 Experimental result on class “drinking” [input video (left) and action segmented video (right)]

Acknowledgements I am thanking all members of the Rapid-Rich Object Search (ROSE) Lab from Nanyang Technological University, Singapore, and Peking University, China. Their vision is to create the largest collection of datasets. They have provided me with an opportunity to use an action recognition dataset.

References 1. Guo J, Li Z, Cheong LF, Zhou SZ (2013) Video co-segmentation for meaningful action extraction. Proceedings IEEE international conference computer vision, pp 2232–2239 2. Jain H, Harit G (2018) Unsupervised temporal segmentation of human action using community detection. Proceedings international conference image processing ICIP, pp 1892–1896 3. Chen G-J, Chang I-C, Yeh H-Y (2017) Action segmentation based on bag-of-visual-words models. In: 2017 10th international conference on Ubi-media computing and workshops (UbiMedia), pp 1–5 4. Shi Q, Wang L, Cheng L, Smola A (2011) Discriminative human action segmentation and recognition using {SMM}s. Int J Comput Vis 93(1):22–32 5. Lv F, Nevatia R (2006) Recognition and segmentation of 3-D human action using HMM and multi-class AdaBoost. Lectures notes computer science (Including Subser. lectures notes artificial intelligence lectures notes bioinformatics), vol 3954. LNCS, pp 359–372 6. Luo W, et al (2021) Action unit memory network for weakly supervised temporal action localization. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9964–9974 7. Li Z, Farha YA, Gall J (2021) Temporal action segmentation from timestamp supervision. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 8361– 8370 8. Gao S-H, Han Q, Li Z-Y, Peng P, Wang L, Cheng M-M (2021) Global2Local: efficient structure search for video action segmentation. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 16800–16809

Action Segmentation for RGB Video Frames Using Skeleton 3D Data …

211

9. Gong D, Medioni G, Zhao X (2014) Structured time series analysis for human action segmentation and recognition. IEEE Trans Pattern Anal Mach Intell 36(7):1414–1427 10. Park S, Park U, Kim D (2018) Depth image-based object segmentation scheme for improving human action recognition. International conference electronic information communication ICEIC 2018, vol 2018 Jan, pp 1–3 11. Murtaza F, Yousaf MH, Velastin SA (2018) PMHI: proposals from motion history images for temporal segmentation of long uncut videos. IEEE Signal Process Lett 25(2):179–183 12. Han Y, Chung SL, Su SF (2017) Automatic action segmentation and continuous recognition for basic indoor actions based on Kinect pose streams. 2017 IEEE International conference system man, cybernetics. SMC 2017, vol 2017 Jan, pp 966–971 ˇ Ikizler-Cinbis N (2013) Bölüt ve kontur özniteliklerini kullanarak imgel13. Tan¸isik G, Güçlü O, erdeki insan hareketlerini tanima. 2013 21st signal processing communication application conference SIU 2013 14. Shahroudy A, Liu J, Ng T-T, Wang G (2016) NTU RGB+D: a large scale dataset for 3D human activity analysis. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 1010–1019

Prediction of Rainfall Using Different Machine Learning Regression Models B. Leelavathy, Ram Mohan Rao Kovvur, Sai Rohit Sheela, M. Dheeraj, and V. Vivek

Abstract Rainfall is one of the major climatic aspects of India. It has both pros and cons. Rainfall prediction is a regression problem, and if we predict the amount of rainfall in advance, we can assess the impact of it and decrease the risk of loss. One of the major problems today to predict rainfall using machine learning is accuracy. To increase that accuracy, we can first use data preprocessing techniques and select the best attributes and then use all well-known ML models and select the accurate model from the existing one. We propose the process to select model for predicting rainfall using machine learning techniques in India. Here, we will perform the visual examination of distinct data variables to uncover all the patterns using data exploration techniques. To analyze rainfall trends, scatter plots, histograms, bar charts, box plots, and heatmaps are used. Multiple linear regression, support vector regression, random forest regression, lasso regression, and ridge regression are five regression models that can be used to model data. Finally, the selected model predicts rainfall with highest accuracy. Different machine learning regression models for prediction were developed, and after comparative analysis, based on performance evaluation metrics, a best model for predicting rainfall for a subdivision is implemented. Also, the rainfall trends in India can be studied using exploratory data analysis and data visualization. Keywords Rainfall · Prediction · Machine learning · Accuracy · Regression model · Data visualization

B. Leelavathy (B) · R. M. R. Kovvur · S. R. Sheela · M. Dheeraj · V. Vivek Department of Information Technology, Vasavi College of Engineering, Hyderabad, India e-mail: [email protected] R. M. R. Kovvur e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_16

213

214

B. Leelavathy et al.

Fig. 1 Categorization of algorithms in machine learning

1 Introduction Rainfall prediction and forecasting in India are always a challenge owing to its uncertainties. It requires serious prediction models to ensure accurate predictions. Machine learning is a technique that allows computers to learn and accomplish tasks without being explicitly programmed [1]. Usually, machine learning models perform tasks with high level of accuracy. These algorithms are subdivided into unsupervised and supervised learning. Supervised learning encompasses all classification and regression methods, while unsupervised learning includes all clustering algorithms. Figure depicts classification of the above-mentioned algorithms (Fig. 1). Thus, we use machine learning models for rainfall prediction in India to make the most accurate predictions possible. Not just prediction but also, we use different machine learning tools to analyze the patterns of Indian rainfall to understand the picture of Indian rainfall in a broader scope. The dataset is collected from Kaggle which has rainfall (in mm) for different subdivisions of the country. We use different data visualization tools and Python libraries to perform exploratory data analysis to understand rainfall trends of different subdivisions and overall India. The results of each model are then compared using five different regression procedures. The best prediction model is the one with the highest accuracy and lowest error.

2 Related Work This section deals through some of the work that has been done related to our proposed model. Kumar Abhishek et al. devised a rainfall prediction system using NN model. The proposed model was able to forecast rainfall in Karnataka’s Udupi district [2].

Prediction of Rainfall Using Different Machine Learning Regression Models

215

BPNN with feed forward, layer recurrent, and BPNN with cascade feed forward neural networks are experimented. 70% of the data is used for training and 30% for testing in this model. When compared to BPNN, the recurrent network is more accurate. It was recorded a high MSE using BPNN. Minghui Qiu et al. have introduced short-term method for predicting rainfall accurately [3]. To solve this problem, a CNN model was used to collect a set of weather features from close observations and use them to predict short-term rainfall. In order to achieve promisingly better result, they compared the short-term observations with public weather forecast. Aswin et al. developed a model using deep learning architectures (LSTM and ConvNet) [4]. 468 months weather forecast was used for creating a model using LSTM and ConvNet architectures which predicts the monthly average rainfall about 10,368 locations across the world, where leading RMSE for LSTM and ConvNet is 2.55 and 2.44. Figure outlines a neural network-based rainfall prediction study for an Indian context (Fig. 2). Xianggen Gan et al. succeeded in modeling the rainfall predictor using back propagation neural networks [5]. This model was tested using a dataset with 16 meteorological characteristics that spans the years 1970 to 2000. The learning rate is 0.01, and the target error is 0.01 during network training. MATLAB neural network framework is used in this model, with BPN 100% network prediction accuracy and 67% regression prediction accuracy. With genetic programming, Sam Cramer et al. suggested a method for predicting rainfall [6]. On 21 separate datasets from cities, all throughout Europe, GP, and MCRP were compared. Past ten years data were used as training data, while last one year data were used as testing data. GP addresses MCRP’s flaw by forecasting varied

Fig. 2 NN-based rainfall prediction (Indian context)

216

B. Leelavathy et al.

climates more accurately than MCRP. Mohini P. et al. are the one who conducted A survey of different NNs to predict rainfall [7]. In comparison with forecasting methodologies, to predict better results [8] FFNN, RNN, and TDNN were used. The main disadvantage of NNs is that it only predicts accurately for annual rainfall than monthly rainfall. Sandeep Kumar et al. utilized meteorological parameters to make monthly prediction results better for data from 1901 to 2002 in Bangalore, India, analyzed using the data mining approach linear regression[9]. Pandas, sklearn, and NumPy were used to validate and obtain computational results. K-fold was used to predict rainfall season wise and month wise. In comparison with the summer season, rainy season predictions were more accurate. Sankhadeep Chatterjee et al. utilized neural network to improve the accuracy of the model [10]. The data of Meteorological Station at Dumdum, West Bengal, for the years 1989 to 1995 were used. K-means clustering was used to group the data. Using MLP-FFN, 89.54% accuracy is produced whereas HNN classifier produced 84.26% accuracy (without selection features). ManiShekar et al. also worked on different ML models that tests and predicts the air quality [11]. Navadia et al. [12] used Hadoop predictive analysis (HPA) strategy to predict rainfall. Using Apache PIG, data were processed, and rain predictions were made. In the upcoming release, [13] Apache Hadoop can be used to improve accuracy. The comparison of several rainfall forecast systems in the literature is shown in the Table 1. Table 1 Comparison of few existing rainfall prediction methods and tools used S. No.

Name of the method

Performance metrics

Tools used

1

(a) Feed forward with back propagation (b) Layer recurrent (c) Cascaded feed forward back propagation

Mean square error (MSE)

MATLAB: nntool, nftool

2

Deep convolutional neural network

MSE, correlation, critical success index (CSI)

Not mentioned

3

LTSM and ConvNet

Mean absolute percentage error (MAPE), root mean square error (RMSE)

Not mentioned

4

BP network

Accuracy

MATLAB: neural network platform

5

Genetic programming

RMSE

Not mentioned

6

Artificial neural network

Accuracy

Meteorological parameters

7

Linear regression

RMSE, MAE

Pandas and Scikit learn

8

Hybrid neural network

Accuracy, precision, recall Not mentioned

9

Likelihood

Accuracy

Hadoop

Prediction of Rainfall Using Different Machine Learning Regression Models

217

3 Proposed Method The dataset contains the monthly and annual rainfall values (in mm). We perform exploratory data analysis using data visualization tools like histograms, boxplots, heatmaps, etc. By splitting the data into training and testing, we create a regression model to predict average rainfall (in mm). Apply a variety of statistical and machine learning methodologies to prediction and conduct an analysis of the results. Splitting 70% of data to train and 30% data to test, whichever model gives the good accuracy, that model is used to predict the average rainfall in a particular region.

3.1 The Proposed Model’s Algorithm See Fig. 3. Step1: Step2: Step3:

Step4: Step5:

Based on the data gathered, evaluate the best features. Convert data into the proper format for modeling. Perform the data preprocessing to handle missing values. Visually examine and find correlations between distinct data variables, the structure of the dataset, the existence of outliers, and the distribution of data values to uncover patterns using data exploration techniques. To analyze rainfall trends, use scatter plots, histograms, bar charts, box plots, and heatmaps. Multiple linear regression, support vectorregression, random forest regression, lasso regression, and ridge regression are five regression models that can be used to model data. Calculate prediction accuracy and error values. Conclude the best model to predict the rainfall.

Fig. 3 Workflow of the proposed model

218

B. Leelavathy et al.

Fig. 4 Snapshot of the dataset

3.2 Dataset Description The dataset contains rainfall values (in mm) of 36 subdivisions in India. It contains monthly, periodic, and annual rainfall values of these 36 subdivisions from 1901 to 2017. Figure gives a snapshot of the dataset (Fig. 4).

3.3 Exploratory Data Analysis We performed exploratory data analysis using different data visualization tools. Some of these results are shown in the below figure. We used stacked bar chart to. It is done using Matplot library of Python. Here in the below figure, rainfall pattern of all the states in India with their monthly average rainfall and yearly average of training data is shown (Fig. 5).

Fig. 5 Rainfall patterns in several subdivisions

Prediction of Rainfall Using Different Machine Learning Regression Models

219

The below figure depicts the plot showing average annual rainfall of India over the period of 1901–2017. It is an axis plot which is plotted using Matplot library in Python (Fig. 6). The below figure shows the plot of heatmap, which is an important plot in understand the correlation among different features of the dataset (Fig. 7).

Fig. 6 Average annual rainfall

Fig. 7 Heatmap showing correlation among features

220

B. Leelavathy et al.

Fig. 8 Modeling results

4 Results The results of modeling are tabulated as shown in Fig. 8. Results show that multiple linear regression, ridge, and lasso show better rainfall prediction performance over the compared methods under accuracy measures the accuracy of MLR is 98.37%. MLR is better of all in taking MAE in consideration too.

4.1 Results of MLR Finally, under several performance evaluation metrics like MAE and accuracy, multiple linear regression model has outperformed other models, and thus, it can be considered as the best model for predicting rainfall (Fig. 9).

5 Conclusion We have developed several regression models to predict rainfall for the Indian dataset. Out of these models, multiple linear regression model offered good results compared to other models in terms of performance evaluation metrics like mean absolute error and accuracy. This model can be used by the government to predict floods and rainfall analysis in various vulnerable regions of the country beforehand so that

Prediction of Rainfall Using Different Machine Learning Regression Models

221

Fig. 9 Results of MLR

safety measures can be taken. Early warning and disaster prevention could be aided by effective real-time flood forecasting models.

References 1. Chattopadhyay M, Chattopadhyay S (July 2015) Elucidating the role of topological pattern discovery and support vector machine in generating predictive models for Indian summer monsoon rainfall. Theor Appl Climatol 1–12. https://doi.org/10.1007/s00704-015-1544-5 2. Abhishek K, Kumar A, Ranjan R, Kumar S (2012) A rainfall prediction model using artificial neural network. 2012 IEEE control and system graduate research colloquium (ICSGRC 2012), pp 82–87 3. Qiu M, Zhao P, Zhang K, Huang J, Shi X, Wang X, Chu W (2017) A short-term rainfall prediction model using multi-task convolutional neural networks. IEEE international conference on data mining, pp 395–400. https://doi.org/10.1109/ICDM.2017.49 4. Aswin S, Geetha P, Vinayakumar R (3–5 April 2018) Deep learning models for the prediction of rainfall. International conference on communication and signal processing. India, pp 0657– 0661 5. Gan X, Chen L, Yang D, Liu G (2011) The research of rainfall prediction models based on matlab neural network. Proceedings of IEEE CCIS, pp 45–48 6. Cramer S, Kampouridis M, Freitas AA, Alexandridis A, Predicting rainfall in the context of rainfall derivatives using genetic programming. 2015 IEEE symposium series on computational intelligence, pp 711–718 7. Darji MP, Dabhi VK, Prajapati HB, Rainfall forecasting using neural network: a survey. 2015 International conference on advances in computer engineering and applications (ICACEA) IMS engineering college, Ghaziabad, India, pp 706–713

222

B. Leelavathy et al.

8. Mani Sekhar SR, Siddesh GM, Tiwari A, Khator A, Singh R (May 2020) Identification and analysis of nitrogen dioxide concentration for air quality prediction using seasonal autoregression integrated with moving average. Aerosol Sci Eng 04(02):137–146. Springer 9. Mohapatra SK, Upadhyay A, Gola C, Rainfall prediction based on 100 years of meterological data, 2017 international conference on computing and communication technologies for smart nation, pp 162–166 10. Chatterjee S, Datta B, Sen S, Dey N, Rainfall prediction using hybrid neural network approach, 2018 2nd international conference on recent advances in signal processing, telecommunications and computing (SigTelCom), pp 67–72 11. Mani Sekhar SR, Siddesh GM, Jain S, Singh T, Biradar V, Faruk U (2021) Assessment and prediction of PM2.5 in Delhi in view of stubble burn from border states using collaborative learning model. Aerosol Sci Eng Springer 5:44–55 12. Navadia S, Yadav P, Thomas J, Shaikh S, Weather prediction: a novel approach for measuring and analyzing weather data, International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC 2017), pp 414–417 13. Siddesh GM, Srinidhi H, Srinivasa KG (2021) Case based classifier for air pollution monitoring and forecasting. J Inst Eng (India): Ser B 102(3):447–454. Springer

A Comprehensive Survey of Datasets Used for Spam and Genuineness Views Detection in Twitter Monal R. Torney, Kishor H. Walse, and Vilas M. Thakare

Abstract Social media is one of the evolving platforms to share the views. The social media such as Twitter, Instagram, Facebook, and other microblogging sites floats lot of information from one corner of the world to another corner. This information can be used for the various purposes. Social media plays major role in collection of wide variety of information. This wide variety of information can be used to extract sentiments, opinions, spam, and genuineness of views shared by the users. To perform the experimentation, proper datasets should be available. In this survey paper, we have throws light on the recent datasets used for experimentations. We have analyzed the results obtained by using various datasets and its performance. By comparing the results and performance, we try to analyze the suitable domain datasets which will give the better results after applying various methods, techniques, and algorithms. Keywords Twitter · Social media · Datasets · Machine learning · Deep learning · Facebook · Social networking

1 Introduction Today, social media plays an important role in the day-to-day life of humans. Social media is a crucial medium to contact, communication, opinion sharing, and writing views [1]. Due to evolution in social media, people able to share views, praise the moment, and criticize the product, and many other things can be done using of it. The users can share the views seating on the one corner of the world about any product, people, and any type of service. The Internet consists of full of information which flows from one corner to another corner of the world. This information shared on the different social media platform such as Facebook, Twitter, Instagram, and many other M. R. Torney (B) · V. M. Thakare Department of CSE, Sant Gadge Baba Amravati University, Amravati, India e-mail: [email protected] K. H. Walse Department of CSE, Anuradha Engineering College, Chikhali, Buldhana, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_17

223

224

M. R. Torney et al.

microblogging sites. Social media plays important role in generation of information on various domains such as banking and finance, election campaigns, trending topic, social analysis, microfinance, and economics [2]. Social networking sites, such as Twitter, provide the application programming interface such as Twitter API to extract and download the tweets and messages of the users for analysis purpose [3]. These tweets are analysis for the purpose of extraction of sentiments, mining of opinions, spam or genuineness views detection, and many more purpose. Many Websites such as UCI depository and Kaggle provide the readily available datasets for various domains with labeling. Some researchers used the readymade datasets downloaded from UCI depository and Kaggle. Ref. [4] many of the researchers used the data extracted from social networking sites and many other sources. This survey paper gives insights related to the datasets used in various experimentations and the performance attainment of the techniques. The paper is organized in the following manner. Section 1 related to the introduction and need of survey of the various datasets by the researchers. Section 2 contains the detail survey of datasets and performance of datasets and result. Section 3 related to the different comparative of datasets, summary, and conclusion.

2 Literature Review One of the important tasks in spam review detection and spam detection in Twitter is to collect the datasets which contains the different tweets related to various domains such as social media, news, movie reviews, product reviews, and many other domains. Mainly, social media platform such as Twitter, Facebook, Instagram, and many other microblogging sites has used to extract the important tweets, messages, and review messages. Many researchers have worked on the various techniques on the different datasets. After working of various datasets, all they have got various results in terms of classification accuracy by considering various parameters such as accuracy, true positive, false positive, true negative, and false negative. All researchers somehow attain accuracy by using techniques. Much type of features has extracted from the different datasets and used to attain the result. Twitter is popular platform and provides the free access to extract the tweets from the Twitter using Twitter application programming interface. The extracted tweets have store in the JSON file format. Adewole et al. [5] used four different datasets for the experimentation. The author has used 10 different machine learning algorithms to identify the best performance classification techniques among them. The various features are also extracted from the data and input to the machine learning algorithms. As comparison is done with the other techniques used for experimentation, random forest outperforms and produces the F measure, accuracy, AUC, and ROC nearly 99% considering the dataset1. Similarly, the on dataset2 again random forest performs well and generates accuracy up to 99%. To reduce the classifiers classification time and get the appropriate features, authors have used searching method based on evolutionary method. The framework proposed by the authors, able to detect and identify the spam messages and spam

A Comprehensive Survey of Datasets Used for Spam …

225

account on Twitter which is unified. The datasets collected by authors from user using Twitter application programming interface. Mahabub et al. [6] proposed the work to tackle the problem of classification related to news which is fake or real. They have worked on ensemble approach to get accuracy of the classifier much closer. They have used dataset of 6500 instance in which some are real, and some are fake which used by another author. They have used different machine learning algorithms and select 3 out of them which are fed to next step for selection of best classifier using ensemble method. They also suggested the future scope in terms of feature selection and using of advanced such as deep learning techniques for classification accuracy and to get test score. Sun et al. [7] implemented the real-time detection of spam by using parallel processing. The dataset is design from the tweets collected using the downloaded using Twitter API which is nearly a real-time tweet. A data fragment is developed and experimented with nine machine learning algorithms. The size of the datasets is varied from 1000 to 100,000 instances. They have checked the performance of the algorithm with the help of the variable sizes of datasets and compared the performance of the algorithms. Gao et al. [8] proposed and worked novel on approached to detect the real and fake review of movies. The Chinese online community site called as Douban from which all reviews of movies and score of the reviews are captured. Unsupervised approach is used to detect the spam in the movie reviews. The author concludes that by using Douban datasets, the performance of the proposed approach outperforms and shows a better result. Alom et al. [9] proposed and implemented deep learning techniques using two different datasets such as honeypot dataset and 1KN1KS dataset. The author designed the novel framework using deep learning techniques such convolution neural network and combination of two different users related features extracted from the datasets. They have compared the performance parameters and compared the performance by considering two features such as user text based and the combination of user text and metadata. The results show the improvement over the simple machine learning and deep learning techniques. They obtained the accuracy such as 96.68 and 93.12 using two different datasets such as dataset I and dataset II. Guo et al. [10] proposed and implemented the model for spam detection and named it as DeG—Spam which experimented on two different datasets. The datasets such as Tweet dataset and Weibo dataset consist of annotated user such as spam and non-spam. The proposed two stage model outperforms the baseline model, and 5 to 10% accuracy is increased when DeG—Spam approach was used. Jain et al. [11] worked and presented using the remarkable techniques deep learning. The author proposed new deep architectures for spam classification such as long short-term memory from class of recursive neural networks. The advantage of using this architecture is to ideally learn whatever the hidden features are unable to detect by the traditional classifiers. They have used three various features representation for vectors such as WordNet, ConceptNet, and WordToVec. The SLSTM architecture has ability to study and learn automatically. These techniques are applied on two different datasets such as SMS-based spam datasets and the Twitter dataset.

226

M. R. Torney et al.

By using the proposed approach, the accuracy can attain up to 96% which was highest among the all the other techniques used by the author to show performance of proposed model. Alharbi et al. [12] proposed model using the deep leaning techniques such as CNN and other model. The datasets such as SemEVal-2016_1 and SemEval-2016_2 having labeled tweets contain positive and negative sentiments used for experimentation. The datasets such as SemEval-2016_1 and SemEval-2016_2 are generated through manual annotations. The different ML algorithms like Naïve Bayes, support vector machine, KNN, and J48along with some deep learning algorithms combined with user behavioral information are used to design a model. The tenfold cross validation procedure is adopted to check the performance of the model. The result depicts that the convolution neural network with user behavioral information outperforms the rest of the baseline machine learning algorithms. The proposed model outperforms the baseline model, such as SVM, NB, J48, and KNN. Monica et al. [13] proposed and worked on a spam message detection model using by analyzing the sentiments extracted from real-time gathered data. Using Twitter API, they have extracted the real-time tweets. They have collected datasets using Twitter API having secret key, consumer key, and authentication key. With the help of developer app, author has collected the real-time tweets per user. The sentiment score of every user is determined and identify the user’s inclination toward more positive, negative, or neutral. As other approach presented, the proposed model outperforms the traditional ML-based algorithms induced with an accuracy of 97% which is best among. The model achieved the up to 71% result in terms of accuracy using 200 epochs. Aljohani et al. [14] proposed and experimented with the human annotated datasets. They have used deep learning model for implementation of model along with some altmetrics and compare the tweets of the bot and human posted messages. The models result in attain the accuracy up to 72%. Tong et al. [15] proposed and worked on the Chinese spam detection model. Due to less work and non-availability of the perfect technique, methods to detect the spam in Chinese language. The authors have proposed the capsule network-based long attention mechanisms. They have used three different datasets such as charTrec06C, WordTrec06 C, and BalanceTrec06C which contains two different classes and partitioned into training and testing validation set. One of the important conclusions of the author is that when using the proposed methods for implementation, it shows accuracy better for small datasets whereas it shows downward trend when size of datasets increased. Nasir et al. [16] proposed and implemented the novel hybrid deep learning model by combining the two famously used deep learning techniques such as convolution neural networks and recurrent neural networks to detect the fake news. They have considered the two publicly available datasets which contains article written in English related to Syrian war. These datasets are labeled with some value to indicate the fake and genuine news. The result shows that the proposed hybrid method works efficiently and outperforms the machine learning algorithms by considering the performance parameters such as accuracy, precision, recall, and F-score.

A Comprehensive Survey of Datasets Used for Spam …

227

Vanta et al. [17] proposed and worked combining different types of features taken from some other researchers, and some features have been newly added. They have used restaurant reviews datasets used by other researchers from Yelp. The datasets contain two different types of datasets such as YelpNYC and YelpZip. Barushka et al. [18] proposed and worked on an approach, which combines first and second stage. The very first stage to lower down the misclassification cost, a multi-objective evolutionary feature selection method was used. After this step, the base learners such as deep neural networks with ensemble approach were used. They have used two different datasets: gain from Hives and from Twitter. Both the datasets contain the labeled and unlabeled datasets having different types of information. From both the datasets, different information has been extracted considered as features. The author concluded that with the combination of the feature selection and the RDNN ensemble learning improves the result in terms of accuracy. Halawi et al. [19] proposed approach based on ontology used to detect the spam. The approach is unique which explores the association between ham and spam users’ detection. The unique approach used for spam detection and detection of spam and ham users performs analysis of interrelationship between real and spam users’ messages. Ontology-based approach depends on the analysis and studies and classification performs on the public tweets. During the process, ontologies have been extracted, and dictionary is designed to detect the real messages from randomly generated topics. For reproduce the trueness of the tweet, ratio of similarity is calculate. The experimentation is conducted on real tweets data which signifies that ontologybased spam detection approach is better than as compared to message-to-message techniques. The unique approach shows more than 200% good result in terms of detection rate in addition with management of the scalability of large data. The main addition and support of author to this work are design and advancements in building of message to ontology approach. This approach detects tweets which is spam by pursuing the work of content analysis which avoids the need of outsourced and relationship-based information in order to investigate the spam tweets and users. It also lowers the dependability of similarity overlap while tweets comparison with ontologies. After performing the analysis of novel approach, the authors signify the role of ontology for spam detection is important using technique of content analysis. The authors concluded that there is a need to explore and believe in private and user relation data which is helpful to all spam detection algorithms. Faris et al. [20] proposed and designed an intelligent email spam detection system using two techniques such as random weight generation and genetic algorithms. The proposed model automated the process of feature selection, which is most relevant in detecting spam. The automatic genetic algorithm and random weight network algorithm are used for feature selection. They have used three different datasets such as Spam Assassin, CSDMC2010, and Ling Spam which contain the spam messages. By using Spam Assassin datasets, model shows highest accuracy rates. The most crucial step is performed at last named feature importance analysis to identify the most influencing feature that could help attain the maximum classification accuracy. The proposed methods performance is checked against the networks having variable number of neurons with different random weight networks using various datasets.

228

M. R. Torney et al.

Elakkiya et al. [21] proposed and implemented new genetic algorithm based on features combined with the newly constructed multi-evaluation measures. For detection of spam in the networks, GAME FEST algorithm is proposed which is based on genetic algorithms. The author has considered three different datasets such as Apontador, Twitter, and YouTube. The dataset used in the proposed method fed to the genetic process for selection of the optimal features to added to subset of features. The features extracted from the previous step are supplied to the classification algorithm to check the performance evaluation of features subset. The experimentation is performed using four datasets and selected features to evaluate the performance of the feature selection approach. The tools such as WEKA and MATLAB are used for feature selection. The accuracy related to the different datasets is different by considering different parameters. Tajalizadeh et al. [22] designed the framework for detection of spam in Twitter. The author has proposed the novel framework for spam detection in Twitter using some up gradation in existing clustering technique called Denstream. The author performed the experimentation by using the datasets of different sizes and different sampling methods. They have also considered the number of features. To test the performance of the proposed framework, authors use Twitter platform which is provide the public tweets online using application programming interface. The author has used datasets which is previously available of size 600 k. They divide the datasets into different parts and used six different features of the dataset. The proposed method is checked against the different datasets, and they have concluded that if the sample size is smaller the performance of method is not up to the marks whereas in case of the larger sample, the performance is good. The performance has been better when the author selected the Dataset III and Dataset IV for the experimentation and attains the good performance. In further, future scope suggested that efficient real-time distance learning in the online phase is suggested to enhance the performance of the current stream clustering methods. Wang et al. [23] used multiple detection tests for spam detection. The author represents the spam distribution using KL divergence and probability of drift in tweets using multiple drift detection test (MDDT). The author first selects the classification model for tweets, whether spam or not. By using KL divergence, the distributional distance among various samples is measured. The MDDT test is then applied to check whether current data are different from the historical one and if this, then claims the divergence (drift) in data. This drift is used to update the model to make it more robust. They have considered 12 different features of tweet data. Different methods have been used to evaluate the performance of the proposed method. MDDT method outperformed all the other methods used for comparison. Kumar et al. [24] worked on techniques-based decentralized and distributed techniques. The model is based on unsupervised method helps to detect the spam and remove the spam from social networking sites. The MapReduce platform is used by the authors to handle and process the big data. They also present the fuzzy techniques which is newly designed and able to detect spam even from single type of message stream. The author claimed that the contribution made by them using fuzzy-based methods for detection and use MapReduce platform to store the big data is novel and

A Comprehensive Survey of Datasets Used for Spam …

229

first of its kind work. The model proposed by the authors divided into two different modules such as extraction as features from collected messages stream and to match the string using fuzzy techniques. The MapReduce platform helps to process the large data and fuzzy string matching used to identify the spam messages from compromised accounts. Further in future, for efficiency and performance improvement of techniques which is proposed, authors proposed the use of GPU for implementation of such techniques. TalaeiPashiri et al. [25] proposed a method such as SCA is used to update the feature vector to point out the optimal feature. This feature can be used for the training of ANN. The performance of the proposed method is better than the traditional machine learning models. A metaheuristics algorithm like the framework can be used for feature selection to reduce or alleviate the severity of these two problems for training the ANN. They have considered the review and try to detect the spam in the reviews. The metaheuristics algorithm is, in fact, responsible for updating the feature vectors, whereas the ANN is used for training and learning. For analysis purposes, as a data mining and learning method, artificial neural networks can be used in classification problems and effective spam detection. The author also recommended reducing the detection error of the proposed algorithm through improving the SCA. Other learning methods, such as the support vector machine (SVM) and decision tree are also recommended to be combined with the proposed method. Liu et al. [26] worked on model for spam detection in SMS named as Vanilla Transformer. It has the certain changes than the previous one. The datasets priory preprocessed to obtain the better performance in terms of accuracy and a library called spacy is used for preprocessing. The GloVe a pre-train vector for word representation is used to represent the words. The deep learning techniques are used such as SVM, LSTM, and CNN+LSTM along with some machine learning algorithms to check the performance of the proposed methods. The accuracy is 98.92% and 88% using first and second datasets which is the highest value. Murthy et al. [27] proposed and worked on the hybrid model using the combination of RNN and the SVM model. The combination of two model given the result of encoder used to predicts the next word from the previous word. The SVM model is used for performing the binary classification at last. The proposed model is evaluated using the IMDB movie dataset and the Kaggle. The performance of the proposed model is compared with the state-of-the-art techniques such as Bayesian networks classifier, random forest, RNN, and SVM. The authors concluded that the combination of the RNN and LSTM model is performed better for text classification. Murthy et al. [28] proposed and worked on the real-time sentiment analysis of Twitter views and also presented the framework for visualization. They have implemented the novel algorithms such as emotion polarity SentiWordNet for the Twitter sentiment extraction. They have compare the performance of the proposed novel approach with the various classification algorithms using six various datasets. The framework consisting of the data collection, data parsing, and data visualization. As per the results obtained, the proposed EPS algorithm attains the accuracy of 86% which is highest among all.

230

M. R. Torney et al.

Murthy et al. [28] proposed and implemented Twitter trend momentum to predict the trends of the Twitter which is similar to the indicator of the share markets moving average convergence divergence. This framework going to depict the trends of the Twitter of next few hours. Two different types of datasets such as news and trends datasets tested against the proposed framework > the performance of the proposed framework is tested against some parameters such as throughput and scalability. The proposed framework is best suitable for the big data analytics. The author shows the scalability and throughput of the proposed system is better but complexities related to the space and the time required to performed all these steps have not mentioned in the research paper. The Fig. 1 indicates the various datasets used by the researchers and their respective domains. The domain such as SMS Datasets, Advertisement, Product Review Datasets, Real Word Datasets, Reviews of Restaurant, Fake news reviews Datasets used by the authors. The Twitter dataset is the dataset used by most researchers. The Fig. 1 Datasets used by various researchers in experimentation

A Comprehensive Survey of Datasets Used for Spam …

231

dataset consists of the views shared by users on people, products, and some events. Some researchers have used available datasets downloaded from the UCI depository, Kaggle, and many depositories. Table 1 shows the domain, and the different sizes of datasets used by the authors are annotated datasets. Table 2 shows the type of detections such as spam messages or bot account detection with the model used. The author used various methods to get accuracy over different datasets. They have also combined various features which is hybrid in nature to attain the perfection detection. The performance of various models achieved using various datasets and having different domains.

3 Summary and Challenges In this comprehensive survey, we have elaborated the various datasets with the different domains such as social media, social networks, reviews datasets of products and Websites. Most of the researchers they have used social media as domain to collect the datasets. Social media and networks have wide range of topics and views related to people, services, and product. Social networking sites such as Twitter, they are providing application programming interface to download the tweets and messages for the analysis purposes. After performing the survey, there are still some problems that have not been discussed in existing research papers. The problems given below.

4 Difficulty in Extraction and Collection of Data The important and foremost thing is data collection. The huge amount of data is spread on the Internet. It is very difficult to collect and store the data in databases. Some of the social networking sites allowing using their API to download the data from the sites but difficult to extract and collect the data from sites.

5 Renaming of Data Most of the techniques such as machine learning and other techniques require labeled data for further processing. So, renaming of data is important and consumes most of the time to perform the cleaning, preprocessing, and labeling of data for classification purpose.

232

M. R. Torney et al.

Table 1 Various domains and datasets used Name of paper

Dataset used

Size of datasets

Domain of dataset

[5]

Datasets used collection of short message service, spam tweets collection of Twitter

Size of datasets are 5574 SMS 1324 SMS and 18,000 instances of messages

Social network

[6]

Fake news dataset

Total 6500 news data out of which 3252 fake and 3259 is real

News dataset

[7]

Raw dataset is collected

1.5 million tweets of the general public

Social media

[8]

Dataset is created using Attention mechanisms Movie review reviews collected from along with generative the Chinese blogging site adversarial networks used Douban. Douban dataset

[9]

Social honeypot dataset Twitter 1 KS − 10 KN dataset

[10]

Twitter dataset and Weibo 10,000 users and 2060 as dataset annotated spammer

[11]

Two datasets such as SMS spam dataset and tweets dataset collected using Twitter API and UCI depository

SMS spam dataset having Social networks 5574 and Tweets datasets having 5096 records

[12]

Two datasets such as SemVal-2016–1 and SemVal-2016–2 labeled as positive and negative

The first dataset has 774,244 instances, and the second dataset has 491,902 instances

Social networks

[13]

Twitter dataset

200 Tweets from each users

Social networks

[14]

Human annotated dataset 4540 annotated user where 269 annotated Bot

Social networks

[15]

Dataset downloaded from datasets consist of sites 23,470–messages, 23,178–messages, 8000–messages

Social media

[16]

FA-KES dataset and ISOT dataset

804 news articles and 45,000 news articles

News review

[17]

Dataset was collected from the Yelp Website

Consists of 359,052 reviews, out of 10.27% are fake review

Restaurant reviews

Dataset contains users Social media which are spammer and non-spammer. Totally, dataset consists of 11,000 tweets labeled data Social media

(continued)

A Comprehensive Survey of Datasets Used for Spam …

233

Table 1 (continued) Name of paper

Dataset used

Size of datasets

[18]

Two different datasets gained from Hives, tweets from Twitter

Hives dataset having 821 Social networks and tweets dataset having 61,674 instances

Domain of dataset

[20]

Datasets such as Spam Assassin Ling Spam and CSDMC2010

The datasets have 8150, 2894 and 4327 instances

Social media

[21]

Four datasets used from social networks

1065 users having total 2520 instances

Social networks

[22]

Used previously available 600 million datasets

Social networks

[23]

Public dataset

Dataset contains 100,000 instances

Social networks

[24]

Twitter dataset

Dataset consists of the 31,000 instances

Social networks

[25]

Spam base dataset which is standard

Size is not mentioned in paper

Advertisements for products/Websites

[26]

Used two different datasets such as short message service and Twitter spam collection

5574 and 11,968 messages in the dataset out of 747 and 5815 are spam

SMS dataset

[27]

IMDB movie dataset

Having 40,318 sentences

IMDB dataset

[28]

Six different types of datasets are used

Varying size

Different domain datasets

[29]

Trends and news datasets 350 million and 1 million Twitter trend and are used messages news collection

6 Lack of Standard Datasets Social media, social networking, and other microblogging sites contain the data of different domains. This domains data are from various platforms such as Twitter, Facebook, Instagram, and many reviews from different microblogging sites. Due to the heterogeneous nature of the Internet and huge availability of data, it is difficult to collect the standard dataset for the experimentation.

7 Genuineness of Data Due to the flow of data from one end to another, a huge amount data are available on Internet. This data are not genuine one. Fake data can be flow from one end to another by some unauthorized and bots’ users to gain the access or to appraise the product, people, or process.

234

M. R. Torney et al.

Table 2 Detection method and model used Work performed

Sources

Detection

[5]

Twitter

Spam message and Hybrid features along account detection with machine learning algorithms

Model used

[6]

Fake news dataset

Detection of fake news

Ensemble approach is used

[7]

Twitter

Real-time spam detection

different machine learning algorithms

[8]

Reviews collected from Chinese blogging sites

Spam detection in blogs

Unsupervised-based attention mechanisms along with generative adversarial networks used

[9]

Twitter

Spam detection

Deep learning model such as CNN

[10]

Twitter

Spam detection in Twitter

Deep graph-based neural networks

[11]

Twitter and UCI depository

Spam detection in Twitter

Deep learning techniques such as RNN (SLSTM)

[12]

Twitter

Sentiment detection

Deep learning techniques

[13]

Twitter

Sentiment detection

Deep learning techniques

[14]

Twitter

Bot account detection

Deep learning techniques

[15]

Chinese Website

Chinese spam detection

Deep learning techniques along with attention mechanisms

[29]

Collected from Sites

Email spam detection

Used word embedding, convolution neural networks, and bidirectional neural network

[16]

Fake news dataset

Fake news detection

Combination deep learning techniques such as CNN and RNN

[17]

Dataset was collected from the Yelp Website

Fake review detection

Majority voting approach used

[18]

Twitter

Spam filter

Ensemble approach with deep neural networks used (continued)

A Comprehensive Survey of Datasets Used for Spam …

235

Table 2 (continued) Work performed

Sources

Detection

Model used

[20]

Social media

Spam email detection

random weight generation and genetic algorithms used

[21]

Social networks

Spam detection

Genetic algorithm-based

[22]

Twitter

Spam detection

Clustering methods are used

[23]

Twitter

Spam detection

MDDT method is used

[24]

Twitter

Spam detection

Based on unsupervised method, fuzzy technique used

[25]

Reviews

Spam detection

Metaheuristic and artificial neural network

[26]

SMS message

SMS spam detection

Deep learning and machine learning algorithm used

[27]

IMDB movie dataset

Sentiment classification

Hybrid model using LSTM and support vector machine

[28]

Different datasets

Sentiment classification

EPS algorithms used

[29]

Trends and news datasets

Twitter trends momentum and identification

Novel TTM algorithm

The various issues are discussed in this section. These issues can be handled by providing the proper solution. Nowadays, various social networking sites providing facilities to use their application programming interfaces to collect the data from the sites on various topics. The data are collected in various formats such as JSON and CSV help to understand and interpret using advanced available library. Most of the models developed that are based on supervised machine learning. Exploration of unsupervised machine learning to train the model helps to reduce the task of labeling and renaming the data. A lot of datasets was available on the depositories which is of various types such as an image and text. Online social media platforms provide users to share their views on social media. The views shared by the users on social media were not always genuine one. It often may fake, or spam messages spread over the Internet. Spam messages also influence the decision of the users. The review text or messages/views must be identified as spam or genuine one. To obtain it, a proper detection mechanism with accuracy rate is proposed. The combination of various attributes related to user and the sentiment polarity of views are leads to proper solution to problem.

236

M. R. Torney et al.

8 Conclusion In this survey paper, we have minutely explored the up-to-date datasets along with the performance of each methods and technique’s used for the experimentation. It has observed that one of the techniques performed well on the datasets, but for the other, it fails to perform well on the same datasets. As the time elapses, the way of writing the messages on social media is change with using of different meaningful words. The drift is available in the messages. So, it is necessary to build the perfect datasets for the experimentation with considering all the parameters. As the domain of the dataset is change, the performance of the method, technique, and algorithm is also change. The scope for further improvement is the development of the perfect dataset which one is suitable for the methods and techniques. Also, there is need of the datasets which requires the minimum preprocessing operations. In this area mostly, the supervised machine learning methods were explored, and results were obtained. The unsupervised machine learning, deep learning, and sentiment analysis area are unearthed till now. The hybrid models can also play an important role for detection and identification of spam in Twitter. The recent trends moved toward the development of hybrid model of combination of machine learning and deep learning model.

References 1. Wu T, Wen S, Xiang Y, Zhou W (2018) Twitter spam detection: survey of new approaches and comparative study. Comput Secur 76:265–284 2. Beigi G, Hu X, Maciejewski R, Liu H (2016) An overview of sentiment analysis in social media and its applications in disaster relief. In: Studies in computational intelligence, vol 639. Springer, pp 313–340 3. Sharma S, Jain A (2019) Cyber Social media analytics and issues: a pragmatic approach for Twitter sentiment analysis. Adv Intell Syst Comput 924:473–484. https://doi.org/10.1007/978981-13-6861-5_41 4. Shayaa S, Jaafar NI, Bahri S, Sulaiman A, Seuk Wai P, Wai Chung Y, Piprani AZ, Al-Garadi MA, Sentiment analysis of big data: methods, applications, and open challenge. IEEE Access 37807–37827 5. Adewole K, Anuar N, Kamsin A, Sangaiah A (2017) SMSAD: a framework for spam message and spam account detection. Multimedia Tools Appl 78(4):3925–3960 6. Mahabub A (2020) A robust technique of fake news detection using ensemble voting classifier and comparison with other classifiers. SN Appl Sci 2(4) 7. Sun N, Lin G, Qiu J, Rimba P (2020) Near real-time twitter spam detection with machine learning techniques. Int J Comput Appl 1–11 8. Gao Y, Gong M, Xie Y, Qin A (2021) An attention-based unsupervised adversarial model for movie review spam detection. IEEE Trans Multimedia 23:784–796 9. Alom Z, Carminati B, Ferrari E (2020) A deep learning model for Twitter spam detection. Online Soc Netw Media 18:100079 10. Guo Z, Tang L, Guo T, Yu K, Alazab M, Shalaginov A (2021) Deep graph neural network-based spammer detection under the perspective of heterogeneous cyberspace. Futur Gener Comput Syst 117:205–218

A Comprehensive Survey of Datasets Used for Spam …

237

11. Jain G, Sharma M, Agarwal B (2018) Optimizing semantic LSTM for spam detection. Int J Inf Technol 11(2):239–250 12. Alharbi A, de Doncker E (2019) Twitter sentiment analysis with a deep neural network: an enhanced approach using user behavioral information. Cogn Syst Res 54:50–61 13. Monica C, Nagarathna N (2020) Detection of fake tweets using sentiment analysis. SN Comput Sci 1(2) 14. Aljohani N, Fayoumi A, Hassan S (2020) Bot prediction on social networks of Twitter in altmetrics using deep graph convolutional networks. Soft Comput 24(15):11109–11120 15. Tong X, et al (2021) A content-based Chinese spam detection method using a capsule network with long-short attention. IEEE Sens J 1–1 16. Nasir J, Khan O, Varlamis I (2021) Fake news detection: a hybrid CNN-RNN based deep learning approach. Int J Inf Manag Data Insights 1(1):100007 17. Aono TVM (2019) Fake review detection focusing on emotional expressions and extreme rating. The association for natural language processing 18. Barushka A, Hajek P (2019) SPAM detection on social networks using cost-sensitive feature selection and ensemble-based regularized deep neural networks. Neural Comput Appl 32(9):4239–4257 19. Halawi B, Mourad A, Otrok H, Damiani E (2018) Few are as good as many: an ontology-based tweet spam detection approach. IEEE Access 6:63890–63904 20. Faris H, Al-Zoubi A, Heidari A, Aljarah I, Mafarja M, Hassonah M, Fujita H (2019) An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks. Inf Fusion 48:67–83 21. Elakkiya E, Selvakumar S (2019) GAMEFEST: genetic algorithmic multi evaluation measure based feature selection technique for social network spam detection. Multimedia Tools Appl 79(11–12):7193–7225 22. Tajalizadeh H, Boostani R (2019) A novel stream clustering framework for spam detection in Twitter. IEEE Trans Comput Soc Syst 6(3):525–534 23. Wang X, Kang Q, An J, Zhou M (2019) Drifted Twitter spam classification using multiscale detection test on K-L divergence. IEEE Access 7:108384–108394 24. Kumar A, Singh M, Pais A (2019) Fuzzy string matching algorithm for spam detection in Twitter 25. Talaei Pashiri R, Rostami Y, Mahrami M (2020) Spam detection through feature selection using artificial neural network and sine–cosine algorithm. Math Sci 14(3):193–199 26. Liu X, Lu H, Nayak AA (2021) Spam transformer model for SMS spam detection. IEEE Access 9:80253–80263 27. Murthy JS, Siddesh GM, Srinivasa KG (Sep 2020) A hybrid model using MaLSTM based on recurrent neural networks with support vector machines for sentiment analysis. Eng Appl Sci Res 47(3):232–240 28. Murthy JS, Siddesh GM, Srinivasa KG (2019) TwitSenti: a real-time Twitter sentiment analysis and visualization framework. J Inf Knowl Manag 18(02):1950013 29. Murthy JS, Siddesh GM, Srinivasa KG (2019) A real-time Twitter trend analysis and visualization framework. Int J Semant Web Inf Syst 15(2):1–21

Computational Intelligence–Deep Learning

Indian Classical Dance Forms Classification Using Transfer Learning Challapalli Jhansi Rani and Nagaraju Devarakonda

Abstract Human activity analysis is useful in a variety of domains, including video surveillance, biometrics, and home health monitoring systems. In computer vision field, extraction and recognition of complex human movements from images/videos are a great complex task. In this present work, we propose the Indian classical dances (ICD) classification using the concept of transfer learning. ICD form is a combination of gesticulation of all body parts. It comes in a variety of shapes and sizes, but the most common features include single/double hand mudras, eye movement, legs alignment, hip movements, facial expressions, and legs posture. Each dance has its own gestures and clothes worn by the dancers. India classical dances are categorized into 8 categories. In this work, we used the dataset consists of eight dance classes includes Bharatnatyam, odissi, manipuri, kuchipudi, mohiniyattam, sattriya, kathakali, and kathak. Those images were collected from Internet. While image processing using CNN model training with less data does not give accurate result that leads to over-fitting problem. To overcome this problem, we propose a concept, transfer learning by this use the knowledge that was learned from some problem that can be applied to solve the problem related to the target task. It reduces both time and space complexity. In our proposed work, we use a pre-trained model VGG16. It results high accuracy of 85.4% compared to earlier methods. Keywords Indian classical dance · Transfer learning · Dance forms · Convolution neural network

1 Introduction India is a land of prehistoric art and is one of outmost rich cultural heritages among few countries throughout the world with different cultures. Indian dance is one of the most holy identities of our culture. From prehistoric times, the classical dance C. J. Rani (B) · N. Devarakonda School of Computer Science and Engineering, VIT-AP University, Amaravati, Andhra Pradesh, India e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_18

241

242

C. J. Rani and N. Devarakonda

form in India has been originated from the Natya Shastra. There are eleven different classical dance form are found in India. In India, Bharatanatyam and Bhagavatha Mela are found in Tamil Nadu, Mohiniyattam, Kathakali are originated in Kerala, Manipuri originates in Manipur, Kathak was originated in Uttar Pradesh, from the state of Andhra Pradesh Kuchipudi, Odissi is famous in Odisha, Yakshagana from Karnataka, Sattriya is famous in Assam, and Chhau in Eastern. All of these ICDs are primarily represents man’s inner beauty, divine, culture, nature, and tradition rather than entertainment. Every ICD forms are varying regarding of their dancing style and wearing of costumes. But some of dance appearance, such as Yakshagana and Kathakali, looks to be same but the facial expression and costumes of dancers are different. Bharatanatyam and Odissi dances may appear same regarding of hand mudras (gestures), but the costumes worn by the dancers, hip movement are variant. Since ancient times, India has maintained a Guru/Shisya system for learning classical dance. In today’s scenario, many people can also learn dance from a teacher. Despite the fact that dance teachers have immense knowledge, due to intense work pressure, many are unable to approach them. There are a variety of ways to learn these classical dances in today’s technical era. The majority of the people gain knowledge of dance through taking classes. Even though differently-abled persons like Deaf/Dumb, self-taught people are not able to gain from this technique. Multiple videos for all classical dance forms are available for self-study; however, they cannot gain an indepth understanding on this classical dance. There were various technology-enabled solutions for identification and classification of ICD styles. For the recognition and classification of ICD genres, image processing techniques such as HOG for feature extraction and for detection of Mudras CNN based models have been employed. In earlier, most of the methods have been executed to recognize and classify the images. Generally, digital images are represented in matrix form. Each cell stores the pixel intensity. Machine learning (ML) [1] and deep learning (DL) [2] are famous methods to identify and classify any type of data. Generally, DL and computer vision methods are used for image processing. Deep learning models are highly accurate when trained with large datasets and classify them into different labels. To handle large image datasets that needs higher end systems. It costs very high. When the training image dataset is small which results low accuracy due to lack of training data. To conquer this problem, we propose the concept transfer learning. Using this concept, we can reuse the pre-trained model on new task. It uses the knowledge gained by the previous task in order to improve the generalization in another [3]. In this work, we use VGG16 (Visual Graphics Group) is one of the most popular pre-trained model for large-scale image recognition [4] for dance forms classification. Our proposed CNN model is trained with pre-trained model with new training and validation set data which results higher accuracy of 85.4% compared to earlier methods. This model can improve reliability and reduced the space and time complexity.

Indian Classical Dance Forms Classification Using Transfer …

243

2 Related Work Ashwini Dayanand Naik et al. [5] proposed CNN model with ResNet32 as backbone network architecture to classify the ICD images related to 5 dance classes specifically Bharatanatyam, Yakshagana, Odissi, Kathakali, Kathak, and achieved an accuracy of 78.88%. Kishore et al. [6] proposed a CNN architecture for recognition and classification of ICD poses/mudras for both online and offline images. This dataset consists of 200 dissimilar Bharatanatyam poses/mudras with 10 different subjects. It results 89.92% of accuracy. But, this work mainly presents categorizing Bharatanatyam poses only. Kishore et al. [7] also suggest AdaBoost classifier used for categorizing Indian traditional dance forms. That utilizes segmentation model to uproot and identify human gestures from online video sequences. It contains separate wavelet transform and features such as local binary pattern (LBP). Categorization accuracy achieved for Bharatnatyam dance poses is 90.8%. Ankita et al. [8] proposed a deep CNN that can identify the Indian traditional dance (ICD) forms from video sequences. It uses DCNN for the illustration and uproot of features, and it was trained by using multi-class linear SVM. 211 videos have been tested so far on this form and it resulted in 75.83% accuracy. This model takes input obtained from 5 kinds of features like Zernike moments, LBP features, shape signature, Haar features, and Hu moments. Samanta et al. [9] have introduced model that uses histogram of oriented optical flow (HOOF) to symbolize every frame in a dance video by using pose descriptor, in a specific structure manner. This method categorizes dance videos by using SVM with kernel. It can classify only three types forms of dance videos, Bharatanatyam, Kathakali, Kathak, with an accuracy of 86.67%. Anami et al. [10] introduced a three stage methodology for categorizing single hand mudras. The first stage consists of pre-processing of images using Canny edge detector to obtain contour of mudras. In stage two, features about eigenvalues, hu moments, and intersection will be uprooted; in third stage for categorizing mudra, we use artificial neural network (ANN). This method is designed to categorize 28 kinds of (single hand) Asamyuktha mudras of Bharatanatyam with an accuracy of 93.64% for 50 epochs. Parameshwaran et al. [11] proposed novel method for categorizing of Bharatanatyam mudras using CNN and it also has been tested with 2 models, to be precise transfer learning (TL) and double transfer learning (DTL). This model uses dataset consisting of 2D images of hand mudras which belongs to 27 classes has been trained and also tested that achieves an accuracy level of 94.56% and 98.25%. Gautam et al. [12] proposed a method to categorize the dance images by uprooting HOG features and SVM to categorize the dance mudras. Considering 16 accumulated Bharatanatyam dance video sequences, they uprooted 50 different dance steps of dancers with framed videos of size 30*30. Samanta et al. [13], the dedication production local spatio-temporal (LST) feature model on manifold using Jensen-Bregman LogDet Divergence is carried out on the ICD dataset. The same model has been carried out on datasets like KTH [14] and UCF50 [15] that contains human activities. Sanghamitra Das et al. propose [16] a model for the identification of Kathak and Bharatnatyam dance forms from audio

244

C. J. Rani and N. Devarakonda

signals to minimize the classification time. The spectral flux method is used to determine how quickly the power spectrum of audio signals changes from one frame to the next. This work uses SVM, Naive Bayes, and REPTree as classifiers. Mostly, earlier work implements classifiers to recognize and categorize the poses/mudras of single dance mostly Bharatnatyam. This paper we mainly focus on classification of 8 categories of Indian classical dances.

3 Proposed Work 3.1 ICD Architecture Figure 1 illustrates the architecture diagram of Indian classical dance classification. The model uses ICD dataset which can be downloaded from Kaggle. This dataset consists 364 images of Indian classical dance forms, and those are classified into 8 classes, named Bharatnatyam, Odissi, Kathak, manipuri, kuchipudi, mohiniyattam, sattriya, and Kathakali. Generally, the images collected are different size and quality so must pre-process the images before training and testing. This dataset consists of 560 images with different classical dance forms to train this data use deep learning method like convolution neural network, it leads to over-fitting since it is a small dataset. To overcome this problem, we propose the concept of transfer learning (TL). Using TL initially, train the base network on large image dataset and task, and then reuse the learnt features or transform them into the target network and which

Fig. 1 ICD system architecture

Indian Classical Dance Forms Classification Using Transfer …

245

may subsequently be trained on the target dataset and task. Since the dataset is small in this case, add new fully connected layers that satisfy the new classes by removing the fully connected layers of the base network. The layers of VGG16 are shown in the Fig. 2, and the new model additionally adds fully connected layer. The work flow diagram is shown in the Fig. 3. Initially, the ICD image dataset is downloading from the Kaggle repository and then pre-processes the images into size 224 × 224 × 3. Next define the class labels for the dance images. This dataset consists of 8 classical dance forms. Next split the ICD dataset into training and testing data. Load the pre-trained model (model1) and give the new trained data the model1 at the fully connected layer, then this model trained with new features and labels and produce the classification result.

Fig. 2 VGG16 CNN model network architecture

Fig. 3 Work flow diagram of ICD classification

246

C. J. Rani and N. Devarakonda

3.2 VGG16 Network Architecture VGG16 consists of several layers shown in the Fig. 2. This model was trained on ImageNet dataset which consists of 14,197,122 images. The size of kernel is 3 × 3 with stride 1 and the pool size is 2 × 2 with stride 2 for all the layers. The first 2 convolution layers use 64 filters that results 224 × 224 × 64 size as same convolutions are used. Subsequently, the pooling layer reduces the size of the image into 112 × 112 × 64. Subsequent two convolutions and 1 pooling layer results 56 × 56 × 128. Next 3 convolutions and 1 pooling layers reduce the size to 28 × 28 × 256. Next 2 batches of 3 convolutions and 1 pooling layer reduce the size into 14 × 14 × 256 and 7 × 7 × 512. Finally, 7 × 7 × 512 is fed into the fully connected layer with (1, 25,088) feature vector. It consists of 3 dense layers, 1 layers results vector of size (1, 4096), and next layers results (1, 4096) finally the layer3 results 1000 channels for 1000 classes.

3.3 ICD Classification Algorithm: The Figure depicts the work flow process of Indian classical dance iimage classification using our proposed CNN model. Step 1: Download the ICD dataset from kaggele.com. It consists of 560 training images in train folder and 150 test images in test folder. Install the required libraries. Step 2: Pre-process the images such as coloring and resizing. Each image is resized into 224 × 224 × 3. Step 3: Define the features and labels for the training dataset are shown in the Fig. 4. Step 4: Split the dataset into training data 75% and validation data 25%. Step 5: Load the model VGG16 which is pre-trained model trained by large image dataset. Step 6: Initialize the hyper parameter, learning rate with 0.001. Fit a new CNN model that gains the knowledge from VGG16 and training with new ICD training data. Step 7: Use the softmax layer to find the probabilities of dance forms classification. Step 8: Evaluate the model performance by checking the training and testing accuracy and loss. Step 9: Make the prediction with new test data and check its performance.

Indian Classical Dance Forms Classification Using Transfer …

247

3.4 Theory Consider an input image of size I ∈ Rw × h. K is the size of the convolution kernel with stride S, padding P for applying convolution on the input image, then the output size of convolution layer (CL) is defined as Sout =

(J − K + 2P +1 S

(1)

CL is a collection of numerous feature maps with dissimilar weight vectors. These several features have been extracted for each location. The output ynl of feature map ‘n’ in CL is defined as ⎛ ⎞     l (2) ynl (x, y) = ∅l ⎝ wmn (i, j).yml−1 . x.h l + i, y.vl + j + bnl ⎠ m∈Mn i, j∈k l

Here, K l = (i, j) ∈ N 2 |0 ≤ i < K xl ; 0 ≤ j < K yl , K xl & K yl are the height l and width of convolution filter wmn of layer l, bnl is bias of nth feature map of the l layer l. Mn is a set that consists of the feature maps of the previous layer l − 1 that have been connected to nth feature map of layer l, h l , vl depicts the horizontal and vertical step size of the CL in the layer l, and ∅l is the activation function of layer l. Activation function ∅l used for this CNN model is rectified linear unit (ReLU). It can be computed using the formula: ∅ = f (x) = max(x, 0) = l

0, x < 0 x, x ≥ 0

(3)

After convolution, there is a pooling layer of size which is used to reduce the dimension of the image. Generally, max pooping or average pooling can be used in this layer. The output of pooling layer is defined as ynl (x, y) = ∅l (wnl



   ynl−1 x.Sx + i, y.S y + j + bnl

(4)

i, j∈l

where S l = (i, j) ∈ N 2 |0 ≤ i < Sxl ; 0 ≤ j < S ly , Sxl and S ly define the width and height that specifies the size of subsampling filter of layer l, bnl is a bias of nth feature map in layer l. wnl is a weight vector of feature map n in layer l, ∅l is the activation function of layer l. Final layer is the fully connected layer where classification starts. In this dense layer, neurons from preceding layers are fully connected to each and every neuron in the present layer. The equation used for this layer is as follows:

248

C. J. Rani and N. Devarakonda

⎛ y l ( j) = ∅l ⎝

N 



l−1

y l−1 (i).w (l) (i, j) + b(l) .( j)⎠

(5)

i=1

where N l−1 represents the no. of neurons in the prior layer l − 1, w (l) (i, j) represents weight vector for connection from the neuron i in the layer l − 1 to neuron j in layer l and b(l) .( j) represents the bias of the jth neuron in the layer, and j neuron in layer l, ∅l denotes the activation function of the layer l. Activation function used in this layer is the softmax which is used to find the classification probabilities and is given by (Fig. 4): e xl f (x) = x j j e

Fig. 4 ICD dataset labeled classes

(6)

Indian Classical Dance Forms Classification Using Transfer …

249

Fig. 5 ICD training data

4 Experiment Results and Comparisons 4.1 Dataset The dataset used in this model is ICD which is extracted from kaggle.com which is publicly available. It consists of 560 images of various 8 categories of Indian classical dance form images like Bharatnatyam, Odissi, Kathakali, manipuri, kuchipudi, mohiniyattam, sattriya, and Kathak. Among them, 364 images are used for training shown in the Fig. 5, and 156 images were used for testing. To fit the model, the training dataset is split into two parts, 75% is used for training and 25% is used for validation.

4.2 Performance Analysis During the training phase, CNN model results class scores for training dance images, by computing categorical cross-entropy loss, which is also referred as softmax loss. It can be defined as  H ( p, q) = − p(x)log q(x) (7) x

250

C. J. Rani and N. Devarakonda

Here, p and q are the ground truth probability and estimated probability. That are used to measure the performance of output of the fully connected layer with the help of softmax activation for multi-class classification, and as we have 8 dance classes. Finally, to manage the learning rate, Adam optimizer used it is a combination of RMS prob and momentum, it uses the squared gradient to scale the learning rate parameters. Here, we take an initial learning rate 0.0001. To reduce the loss, the network weights are updated by using gradient descent (GD) approach by backpropagating (BP) the gradients w.r.to the loss function. The model was trained with 30 epochs. The class scores probabilities and classification accuracy over validation images were generated for each epoch by using the learnt weights from training dance images. After training was completed, the highest validation accuracy with the equivalent learned weights has been used for testing. While testing phase, the trained ICD CNN model generates class scores for input dance image that predicts the image’s class based on highest probability class score. In this work, first, we train CNN model with different layers shown in the Fig. 6 with 75% training data and 25% validation data with 30 epochs shown in figure. It results 100% training accuracy, validation accuracy 38.5%. The training accuracy with validation is shown in the Fig. 8a, and training loss with validation loss is depicted for this model is shown in the Fig. 8b. It leads to over-fitting problem since it is a small dataset. The prediction results are shown in the Fig. 7.

Fig. 6 CNN model without transfer learning

Indian Classical Dance Forms Classification Using Transfer …

251

Fig. 7 ICD classification on test dataset

Fig. 8 a ICD training and validation accuracy. b ICD training and validation loss

Subsequently, we trained our proposed model with VGG16 with different layers shown in Fig. 9, and it is a pre-trained model has been trained on large ImageNet dataset. Using transfer learning trained our new model with pre-trained model. The total no. of epochs for which this has been model is trained is 50, and the initial loss is 0.0001. To improve the model performance, we use the Adam, and we got the training accuracy of 99.6% and validation accuracy of 85.4%. The testing and validation results are shown in the Fig. 11a, b. The accuracy graph for training and

252

C. J. Rani and N. Devarakonda Model: "model" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (Input Layer) [(None, 224, 224, 3)] 0 block1_conv1 (Conv2D)

(None, 224, 224, 64)

1792

block1_conv2 (Conv2D)

(None, 224, 224, 64)

36928

block1_pool (MaxPooling2D)

(None, 112, 112, 64)

0

block2_conv1 (Conv2D)

(None, 112, 112, 128)

73856

block2_conv2 (Conv2D)

(None, 112, 112, 128)

147584

block2_pool (MaxPooling2D)

(None, 56, 56, 128)

0

block3_conv1 (Conv2D)

(None, 56, 56, 256)

295168

block3_conv2 (Conv2D)

(None, 56, 56, 256)

590080

block3_conv3 (Conv2D)

(None, 56, 56, 256)

590080

block3_pool (MaxPooling2D)

(None, 28, 28, 256)

0

block4_conv1 (Conv2D)

(None, 28, 28, 512)

1180160

block4_conv2 (Conv2D)

(None, 28, 28, 512)

2359808

block4_conv3 (Conv2D)

(None, 28, 28, 512)

2359808

block4_pool (MaxPooling2D)

(None, 14, 14, 512)

0

block5_conv1 (Conv2D)

(None, 14, 14, 512)

2359808

block5_conv2 (Conv2D)

(None, 14, 14, 512)

2359808

block5_conv3 (Conv2D)

(None, 14, 14, 512)

2359808

block5_pool (MaxPooling2D)

(None, 7, 7, 512)

0

global_average_pooling2d (G lobalAveragePooling2D)

(None, 512)

0

fc-1 (Dense)

(None, 4096)

2101248

dropout (Dropout)

(None, 4096)

0

fc-2 (Dense)

(None, 4096)

16781312

dropout_1 (Dropout)

(None, 4096)

0

output_ layer (Dense)

(None, 8)

32776

================================================================= Total params: 33,630,024 Trainable params: 18,915,336 Non-trainable params: 14,714,688 _______________________________________________________________

Fig. 9 ICD CNN model with transfer learning (VGG16)

Indian Classical Dance Forms Classification Using Transfer …

253

validation data for each epoch is in shown the Fig. 10a. The training and validation loss are depicted in the Fig. 10b. The following table illustrates the model accuracy of ICD dataset with different dance forms. Compared to earlier model, our model classifies 8 categories of dances and results higher accuracy for each dance form class than earlier method (Table 1). The following Fig. 12a confusion matrix (CM) represents the accuracy of the model, and Fig. 12b shows the performance metrics of the proposed ICD classification model.

Fig. 10 a ICD training and validation accuracy (Using Transfer Learning). b ICD training and validation loss (Using Transfer Learning)

Fig. 11 a ICD validation results b ICD test data results

254 Table 1 Model accuracy for ICD dataset

C. J. Rani and N. Devarakonda Dance/model

Our

Ashwin

Model

Dayanand et al. [5]

Bharatnatyam

0.91

0.59

Kathak

0.89

0.751

Kathakali

0.96

0.881

Kuchipudi

0.95



Manipuri

0.91



Mohiniyattam

0.94



Odissi

0.97

0.647

Sattiya

0.90



yakshagana



0.787

Fig. 12 a Performance metrics, b confusion matrix

5 Conclusion and Future Enhancement Deep learning is the most admired technology for computer vision tasks such as image classification, image recognition which can be used to train large datasets. In out proposed work, we built a new model with the concept of transfer learning and CNN model for classification of Indian classical dance forms. It can classify 8 categories of dance forms like Bharatnatyam, Odissi, Kathak, Kathakali, manipuri, kuchipudi, mohiniyattam, and sattriya. To compile our model, we use the pre-train model VGG16 which transforms the features to our new model. The new model with existing features and new fully connected layers trained with new training dataset to classify the dance images. Compared to earlier methods, our ICD model results high accuracy of 85.4% with less training and validation loss. This version can be used in developing an automated system for dance quizzes and can be used by any individual to evaluate the performance of dance styles in India by varying postures and styles and also used to develop dance training apps which can be used to learn the classical dance without physical dance teacher and check whether the dance movements are

Indian Classical Dance Forms Classification Using Transfer …

255

correct or not. In future enhancement, we develop a model for dance video sequences with large dataset for dance pose estimation for Indian classical dance forms.

References 1. Pandey S, Supriya M, Shrivastava A (2018) Data classification using machine learning approach. In: Proceedings of 3rd international symposium on intelligent system technologies and application, vol 683, pp 112–122 2. Tamuly S, Jyotsna C, Amudha J. Deep learning model for image classification. International conference on computational vision and bio inspired computing (ICCVBIC 2019) 3. Han D, Liu Q, Fan W (2018) A new image classification method using CNN transfer learning and web data augmentation. Expert Syst Appl 95(1):43–56 4. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. Published as a conference paper at ICLR 2015 5. Naik AD, Supriya M (28–30 July 2020) Classification of Indian classical dance images using convolution neural network. International conference on communication and signal processing 6. Kishore PVV, Kumar KVV, Kiran Kumar E, Sastry ASCS, Teja Kiran M, Anil Kumar D, Prasad MVD (2018) Indian classical dance action identification and classification with convolutional neural networks, article in Hindawi:5141402 7. Kishore PVV, Kumar KVV, Kiran Kumar E, Anil Kumar D (Jan 2018) Indian classical dance action identification using adaboost multiclass classifier on multifeature fusion. In: Proceedings of the IEEE conference:17632790 8. Bisht A, Bora R, Saini G, Shukla P (Dec 2017) Indian dance form recognition from videos. In: Proceedings of the IEEE conference:7699290 9. Samanta S, Purkait P, Chanda B (Jan 2012) Indian classical dance classification by learning dance pose bases, published. In: 2012 IEEE workshop on the applications of computer vision (WACV):12577759 10. Anami BS, Bhandage VA (Aug 2019) A comparative study of suitability of certain features in classification of Bharatanatyam Mudra images using artificial neural network, vol 50. Published in Springer, pp 741–769 11. Parameshwaran AP, Desai HP, Sunderraman R, Weeks M (2019) Transfer learning for classifying single hand gestures on comprehensive Bharatanatyam Mudra dataset. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) 12. Gautam S, Joshi G, Garg N (Aug 2017) Classification of Indian classical dance steps using HOG features. Int J Adv Res Sci Eng (IJARSE) 6(08) 13. Samanta S, Chanda B (Aug 2014) Indian classical dance classification on manifold using Jensen-Bregman LogDet divergence. In: Proceeding of 22nd international conference on pattern recognition:14808923 14. Roth PM, Mauthner T, Khan I, Bischof H (Oct 2009) Efficient human action recognition by cascaded linear classification. In: Proceedings of IEEE 12th international conference on computer vision workshops (ICCV):11284193 15. Reddy KK, Shah M (Nov 2012) Recognizing 50 human action categories of web videos. Springer, pp 971–98 16. Das S, Dutta S, Benerjee D, Ghosal A. Classification of Bharatnatyam and Kathak dance form through audio signal. In: Emerging technologies and data mining and information security, pp 671–679

Skin Cancer Classification for Dermoscopy Images Using Model Based on Deep Learning and Transfer Learning Vikash Kumar and Bam Bahadur Sinha

Abstract Often cases of cancer are misdiagnosed at an early stage, resulting in serious repercussions, including patient death. There are also instances when individuals have other issues that are misdiagnosed as skin cancer by physicians. As a result, time and money are wasted on needless diagnostic procedures. In this paper, convolution neural network and transfer learning architecture to solve both of the issues are discussed. Both training and testing of the proposed model were done using publicly accessible ISIC 2019 Skin Lesion dataset. For automated skin lesion analysis, the deep learning models produce excellent results. The proposed model that has been used to classify melanoma and non-melanoma without any augmentation, yielded an AUC of 0.812, precision of 71%, recall of 97% and F1-score of 85%. Keywords CNN · ISIC dataset · Melanoma · Transfer learning · Skin lesion analysis

1 Introduction One of the most often diagnosed malignancies is perhaps the skin cancer. Survival chances are extremely excellent if it can be identified in its early stages and the proper medication can be chosen. As a result, it is crucial to know if the patient’s symptoms are related to cancer as soon as possible. Doctors have traditionally detected skin cancer with their naked eye. However, since humans make errors, this often leads to less accurate detection. Even specialists struggle to say it, particularly when cancer is still in its early stages. That’s where computer vision comes in to automate the V. Kumar (B) Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand 247667, India e-mail: [email protected] B. B. Sinha Computer Science and Engineering, Indian Institute of Information Technology Ranchi, Ranchi, Jharkhand 834010, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_19

257

258

V. Kumar and B. B. Sinha

whole process. Thousands of images from both melanoma and non-melanoma categories may be used to train a deep neural network. The model aims to determine whether a new test image belongs to the melanoma or non-melanoma category by learning the nonlinear relationships. This automation may not only be more efficient in eliminating both false positives and false negatives, and it might even be more accurate in terms of classification. Melanoma skin cancer comes into existence in melanocytes skin cells which produce melanin. Among all types of skin cancer, melanoma represents 1% cases but 75% death [1]. Early detection is very important because it is curable at early stage and the possibility of getting cured deteriorates with the passage of time [1]. Deep learning with image processing plays an important role when it comes to detecting the melanoma cancer. Several researchers have proposed different approaches [2] for detecting melanoma cancer but to achieve higher detection accuracy is still a major challenge. The advancement of dermoscopy technology has the potential to significantly improve melanoma diagnosis accuracy and, as a result, patient survival rates. Dermoscopy [3] is a skin imaging method that reveals the underlying skin structure by using polarised light to make the contact region transparent. Manual assessment of the dermoscopy image, on the other hand, is often time-consuming, experience-based and subjective. As a result, computer-aided diagnosis (CAD) was created to offer dermatologists with a quick, quantitative and objective assessment. The diagnosis accuracy of melanoma may be improved from 75 to 92% when experienced dermatologists utilise CAD systems to assess skin lesions, according to [4]. Automated melanoma categorisation for dermoscopy images, on the other hand, is a difficult job. To begin with, melanoma does not have a uniform appearance. Melanomas often have a wide range of colour, texture and form, making it difficult to glean reliable characteristics. Secondly, hairs, veins and bubbles may be seen in numerous dermoscopy images. These distracting anomalies make it difficult to detect melanoma. Lastly, the effectiveness of algorithms is severely limited by data scarcity [5] and class imbalance in dermoscopy image datasets. In this paper, main focus is on performing super-resolution of single image for handling the problem of estimation of respective high-resolution images. One base method (deep convolutional neural network) [6] has been used which tries to learn prior information from observing and mapping between low resolution and highresolution pairs of different images. The proposed model took advantage of transfer learning for analysing the problems which have already been encountered and learnt beforehand. It also helps the model in reducing the cumbersome job of learning at the beginning of model training. The proposed model has used five previously trained models, i.e. ResNet50, DenseNet169, DenseNet121, DenseNet201 and MobileNet. All the five pre-trained models use pre-trained weight for learning. The remaining section of the paper is structured as follows: Sect. 2 discusses the related background of the problem. Section 3 highlights the experimental methodologies used in the proposed architecture. Section 4 demonstrates the efficacy of the

Skin Cancer Classification for Dermoscopy Images …

259

proposed model by discussing the obtained results over the ISIC 2019 Skin Lesion dataset. The closing Sect. 5 concludes the paper with future work that can be done to further improvise the model.

2 Related Background Melanoma categorisation has been researched for many years and can be referenced in the literature dating back to 1987 [7]. Extraction of features and classifier construction are typically two steps in conventional melanoma classification techniques. Ali et al. [8], for example, retrieved shape and colour characteristics from dermoscopy images and used these features to distinguish between normal and cancerous lesions using an artificial neural network (ANN). Lee et al. [9] retrieved 437 colour and texture characteristics, of which 18 were chosen as the best for training a support vector machine classifier. Xie et al. [10] used different characteristics such as texture, colour and border to build a neural network (NN) model [11] that can distinguish between benign nevi and melanomas skin lesions. Full lesions are not always captured by the dermoscopy images, and Barata et al. [6] retrieved local characteristics in a patch and classified lesions using bag-of-features model. Traditional classification techniques are generally not powerful enough for complicated skin lesions since the retrieved characteristics are limited and handcrafted. Deep learning methods [12], particularly CNNs, have recently demonstrated exceptional performance and generalisation ability in a variety of image processing tasks. Deep learning can extract higher-level and more robust features from the raw image data, allowing it to learn many layers of representation. Deng et al. [13] constructed a two-level CNN in order to capture the characteristics (global and local) in order to determine the lesion borders. Yuan et al. [14] developed a new loss function for their CNN model centred on Jaccard distance to enhance its efficacy on dermoscopy image segmentation. For melanoma and non-melanoma classifications, there are three major distinctions between deep learning and conventional machine learning techniques. Deep learning (DL) methods [15], for example, may deduce hierarchical characteristics from the dataset (raw) dermoscopy pictures without requiring the formation of handcrafted features. Moreover, since DL networks are often trained from start to finish, they may forecast the type of skin lesion even without requirement for segmentation, regardless of the fact that segmentation is an essential step prior to classification in the traditional classification framework. Finally, DL networks typically contain many parameters, necessitating large amounts of training data. We present a new model based on CNN and transfer learning to automatically distinguish melanomas from non-melanomas.

260

V. Kumar and B. B. Sinha

3 Experimental Methodology This section discusses the different methodologies that are clubbed together to formulate the proposed architecture. Detailed description about each methodology used in proposed architecture is discussed in the following subsections.

3.1 Convolutional Neural Network The concept of convolutional neural network is taken from artificial neural networks in which multiple neurons connected to each other column-wise. These neurons exchange messages between each other. The basic architecture of CNN comprise of three layers: input layer (first layer), output layer (last layer) and the hidden layers (intermediary layers). The CNN is getting adopted by so many biomedical image processing research works for classification and detection. Each node (component) in the first layer is linked to all the other nodes present in the second layer in a fully connected network. The initial layer of neural network (NN) is used to transfer initial input into the system, which is then processed by successive layers of the NN. The primary component of the NN is built using a fully connected layer, a convolutional layer series and a set of maximum pooling layers. Since they are all overlaid with an activation function, these layers are termed as hidden layers. Figure 1 illustrates the architecture of CNN. Some of the major components of CNN are as follows: • Convolution layer: The convolution layer consists of a set of filters and the parameters of that filter should be learned. Its weight and height for the filters are small in comparison with those of the input volume. For computing the activation map, each filter of the convolution layer is convolved to input volume. Each filter is made of neurons. In other words, convolution layer filters slide towards the height

Fig. 1 Architecture of CNN

Skin Cancer Classification for Dermoscopy Images …











261

and width of the dot product and the input is calculated at each spatial position to get the output volume for the convolution layer. The activation function for all filters is stacked towards the deep dimension. We know that the filter’s height and width are made to be smaller compared to the input. Each neuron’s receptive field is very small, and it is equal to the filter size. Fully connected layer: The fully connected layer receives input from the convolution layer or from the pooling layer and generates the result based on the objective of the model. The layer which is fully connected takes the input in the form of a vector. Pooling layer: This layer’s work is to lower the parameter counts from the input image matrix when the image is too big. Spatial pooling, also known as subsampling or down sampling, decreases the dimension of maps but preserves crucial information. There are several types of pooling layer, namely Max Pooling, Sum Pooling and Average Pooling. Max Pooling takes the largest value corresponding its filter. Activation Function: In neural networks, an activation function is used to activate the neurons (nodes). Using the activation function, we calculate the value which tells us whether to fire that particular neuron or not. So we need to choose the appropriate activation function for a good neural network model. This performs nonlinear transformations over input data, and the transformed output is sent to the next layer of neurons as input. Some popular activations are: rectified linear unit (ReLU), leaky ReLU (improve version of ReLU), Sigmoid, Tanh, etc. Loss Function: The loss function is used to evaluate how well the learning algorithm predicts the outcome. The learning algorithm tries to improve itself by using a loss function. There are a lot of loss functions, and based on different situations, we apply them in our algorithm such as the mean square error (MSE) is the sum of the square of the difference between the predicted and the actual observation. The sum is then divided by the number of training examples in the dataset. This loss function is most widely used in the linear regression problem. Cross-entropy loss is a widely used loss function as well, in which the log function is used to predict the loss after each epoch. Cross-entropy loss increases as the predicted value is diverse from the actual value. Optimiser: Optimisers are techniques or strategies for altering the characteristics of a neural network, such as learning rate, weight in order to minimise losses. By minimising the function, optimisers are employed to address optimisation issues.

3.2 Transfer Learning Transfer learning can be defined as a method of learning in which a model reuses the previous learning to solve the second or next task. In deep learning, it plays a very important role because it leverages the previous learning to solve a new type of problem. Through transfer learning for computer vision problems, it gives accurate

262

V. Kumar and B. B. Sinha

results and makes the model faster. Transfer learning is used for the classification with the help of this learning process. Instead of learning from the beginning, the model starts from a pattern that was learned while solving another problem. There are advantages to using transfer learning, like it very simple to incorporate. For transfer learning, it does not need so much labelled data. It directly starts from where the model has been learned while solving different types of problems. This way, transfer learning takes advantage of previously learned problems and the model ignores learning beginnings. It is usually expressed by a pre-trained model for the classification of images. A pre-trained model can be defined as a model which is trained on a large dataset (called benchmark) for solving a similar type of problem that we need to solve. In this paper, we have used five previously trained models, i.e. ResNet50, DenseNet169, MobileNet, DenseNet121 and DenseNet201. All of them use pre-trained weights for learning. i. ResNet50: ResNet50 is another type of ResNet model in which the total convolution layers are 48. It also consists of 1 max and 1 average pool layer. The number of floating point operations present in ResNet50 is 3.8 × 109 . ResNet is one of the most famous and most widely used model. On our dataset, ResNet50 architecture was retrained by replacing higher layers with average pooling, fully connected and ultimately the softmax layer, which enables two diagnostic categories to be classified. The size of the images was all adjusted for the model (225, 225). The learn rate equaled 0.00001, and the optimiser was utilised via Adam. The mapping of inputs is done using identification. This identity mapping has no parameters, and only the output of the preceding layer may be added in the layer directly ahead. A linear projection multiplies the identity mapping such that the shortcut channels match the remainder of the image. The skip connections between layers combine the outputs of the preceding layer with the outputs of the stacked layer. This leads to the possibility to train networks considerably deeper than before. Figure 2 [16] illustrates the architecture of ResNet50. ii. DenseNet169: It establishes direct connections between any two layers of the same feature size. DenseNet performs well for hundreds of layers. Even without showing optimisation difficulties for the purpose of solving the fading-away gradient problem, the simple connectivity pattern used by DenseNet169 ensures maximum information flow between the layers, either in backward computation or forward propagation or in both. In the DenseNet169 architecture, all inputs from the previous layer pass through its feature map of DenseNet169 to all other subsequent layers. By doing so, it becomes very easy to the down sampling process for this architecture. The total DenseNet169 architecture is divided into many blocks of densely connected networks. Convolution and pooling operations are performed by transition layers, which are the layers between these blocks. The architecture of DenseNet169 is illustrated via Fig. 3. iii. MobileNet: The first layer of MobileNet architecture is the convolutional layer. This layer is built on the basis of depth-wise separable convolutions. All layers come after ReLU and batch normalisation linearity. In MobileNet, for the final layer, there is no linearity. It is a fully connected layer. Generally, for other

Skin Cancer Classification for Dermoscopy Images …

263

Fig. 2 Architecture of ResNet50

architectures, there are feeds to the Softmax activation for the purpose of down sampling, but in MobileNet there is no feed network. For both the first fully convolutional layer and depth-wise convolution, strided convolution is used. When incorporating point-wise and depth-wise convolution, MobileNet contains a total of 28 layers, with point-wise convolution acting as a separate layer. Figure 4 represents the architecture of MobileNet. iv. DenseNet121: DenseNet121 architecture consists of total 121 layers (3 transitions, 117 layers and 1 classification) in which each layer is directly connected to other layer in a feed-forward fashion. v. DenseNet201: It is a convolutional layer which consist of 201 layers.

3.3 Evaluation Metrics We have used different evaluation metrics as suggested in the ISIC 2019 challenge for benchmarking. These performance metrics include accuracy (AC), precision (P), F1Score, recall (R) and ROC-AUC. Precision helps us to evaluate the true positives of predicted positive outcomes. The F1-Score evaluates precision and recall both at the same time to explain the trade-off between them. Equation 1 describes the evaluation

264

V. Kumar and B. B. Sinha

Fig. 3 Architecture of DenseNet169

metric using tn, tp, fn, fp, which corresponds to the number of True Negatives, True Positives, False Negatives and False Positives, respectively. (tp+tn) (tp+tn+fp+fn) tp Recall (R) = (tn+fn) tp Precision (P) = (tp+fp) 2× P × R F1 − Score = (P + R)

Accuracy (AC) =

(1)

4 Proposed Architecture This section discusses the details related to proposed flow model, experimental setup and the dataset. The proposed flow model is divided into five stages. The results obtained at each stage of the proposed model are discussed in the upcoming section.

Skin Cancer Classification for Dermoscopy Images …

265

Fig. 4 Architecture of MobileNet

4.1 Flow Model The proposed model is illustrated via Fig. 5. The complete flow model comprise of five stages: The first stage deals with ISIC 2019 Skin Lesion dataset transformation and next stage deals with dividing the dataset into training and validation sets. The third stage performs transfer learning by making use of ResNet 50, DenseNet169, DenseNet121, DenseNet201 and MobileNet. After performing transfer learning, the model performs prediction using deep CNN model. The efficacy of proposed model is tested using different performance measures such as Accuracy, Recall, Precision, F1-Score and ROC-AUC.

266

V. Kumar and B. B. Sinha

Fig. 5 Proposed model

4.2 Dataset Description The proposed model was evaluated using the ISIC 2019 Skin Lesion Dataset. The dataset consists of 10,000 training images and 5000 validation images of size 224 × 224. All images are shared equally between the validation set and the training sets. The dataset is publicly available and can be used as a benchmark to evaluate skin lesion segmentation in frameworks. Some of the sample images are shown in Fig. 6.

4.3 Experimental Setup All experiments were performed on the NVIDIA DGX 1 cluster. It has 960 NVIDIA CUDA cores and 512 GB of 2:133 GHz DDR4 RAM. We used an NVIDIA V100 GPU for all experiments. The Adam optimiser was used for loss handling. The

Skin Cancer Classification for Dermoscopy Images …

267

Fig. 6 Sample dataset images (melanoma vs. non-melanoma)

learning rate was set at 0.00001. We used batch sizes of 32 and 64 for all models. Each model has passed through 50 epochs. The ISIC 2019 Skin Cancer dataset has its own validation set, and we used it as it is.

5 Experimental Outcomes and Observation The supremacy of the proposed model is demonstrated via different performance measures, namely accuracy, recall, precision, F1-Score and ROC-AUC. Table 1 describes the parameter details of the ResNet50, DenseNet169, DenseNet121, DenseNet201 and MobileNet model. The results obtained on different training sizes of the dataset are given in Table 2. Presently, we’re exhibiting our work by discussing our results and showing how our trained model has been validated. There are two kinds of categories in this report. The findings and assessment of the trained model are calculated using standard classification metrics. After analysing the results, we noticed that better results for models were found with a batch size of 64 compared to 32, as discussed in Table 3. Table 4 shows the impact of learning rate on outcomes. Table 5 discusses the comparison among Adam [17], SGD [18], RMSprop, Adadelta, Nadam and Adamax as optimisers. When compared to MobileNet, DenseNet169, DenseNet121 and DenseNet201, we discovered that ResNet50 as the backbone produces superior results. Table 6 compares the effects of pre-trained weights on outcomes.

268

V. Kumar and B. B. Sinha

Table 1 Parameter details of the models Model Total parameter ResNet50 DenseNet169 MobileNet DenseNet121 Densenet201

3,235,010 23,600,002 12,652,866 7,043,650 18,333,506

Trainable

Non-trainable

3,211,074 23,542,786 12,491,138 6,957,954 18,100,610

23,936 57,216 161,728 85,696 232,896

Table 2 Effect of training size on the overall performance Training size Recall (%) Precision (%) 5000 10000

86 97

F1-score (%)

67 71

76 85

Table 3 Effect of batch size on the performance of model Batch size Precision (%) Recall (%) F1-score (%) 32 64

70 71

95 97

83 85

Table 4 Rate of learning’s effect on final result Learning rate Accuracy (%) Precision (%) Recall (%) 0.00001 0.0001 0.001 0.01

94.2 91.1 84.0 85.1

71 71 70 71

Table 5 Effect of different optimiser on result Pre-trained Accuracy (%) Precision (%) model Adam SGD RMSProp Adadelta Nadam Adamax

94.2 81.1 89.4 79.2 89 87.9

71 89 72 88 72 76

ROC-AUC 0.806 0.812

F1-score (%)

ROC-AUC

97 98 84 87

85 86 74 77

0.812 0.801 0.806 0.812

Recall (%)

F1-score (%)

ROC-AUC

97 83 97 79 97 85

85 86 85 83 86 72

0.812 0.788 0.801 0.783 0.804 0.803

Skin Cancer Classification for Dermoscopy Images … Table 6 Comparison of previously trained model Optimiser Accuracy (%) Precision (%) Recall (%) ResNet50 DenseNet169 MobileNet DenseNet121 DenseNet201

94.2 88.8 85.7 85.1 85.9

71 73 73 72 72

97 91 82 97 90

Table 7 Performance comparison with state-of-the-art models Method Accuracy Transfer Learning [19] Threshold Jaccard [20] ResNet (multiclass) [21] ResNet (Binary) [22] ResNet (Ensembled) [23] Proposed model

93.63% 93.1% – – – 94.2%

269

F1-score (%)

ROC-AUC

85 84 71 83 84

0.812 0.750 0.745 0.743 0.803

ROC-AUC – – 0.906 0.913 0.915 0.812

Table 7 shows the comparison with prior state-of-the-art findings. Because they utilised the whole dataset, Pomponiu et al. [19] obtain a better level of accuracy. Though our goal was to demonstrate similar outcomes with a small number of images, our accuracy is adequate given the amount of images over which our model was trained. Bi et al. [23] have a significantly higher ROC-AUC than our proposed model since they have trained their model using an ensemble of CNNs, whereas we have used a single CNN framework. Apart from that, our study preserves the most up-todate findings in skin cancer categorisation.

6 Conclusion and Future Work Finally, this research has looked at the capacity of deep CNN to classify between melanoma and non-melanoma skin cancers. Our findings demonstrate that dermatologists are outperformed by numerous existing deep learning architectures which has been trained using dermoscopy images (10,000 training and 5000 validation). We demonstrated that utilising extremely deep CNN and by fine-tuning them, we can obtain higher diagnosis accuracy than experienced doctors and clinicians. Despite the absence of a pre-processing step in this research, the experimental findings are extremely encouraging. To aid dermatologists, these models may be readily incorporated into dermoscopy machines. For future development, additional diversified datasets (different categories and varying ages) with significantly more dermoscopy pictures are required. Additionally, utilising the information of each image may help in enhancing the performance of the models.

270

V. Kumar and B. B. Sinha

References 1. Menegola A, Fornaciali M, Pires R, Bittencourt FV, Avila S, Valle E (2017, April) Knowledge transfer for melanoma screening with deep learning. In: 2017 IEEE 14th International symposium on biomedical imaging (ISBI 2017). IEEE, New York, pp 297–300 2. Adegun A, Viriri S (2021) Deep learning techniques for skin lesion analysis and melanoma cancer detection: a survey of state-of-the-art. Artif Intell Rev 54(2):811–841 3. Phillips M, Greenhalgh J, Marsden H, Palamaras I (2020) Detection of malignant melanoma using artificial intelligence: an observational study of diagnostic accuracy. Dermatol Pract Conceptual 10(1) 4. Almeida MA, Santos IA (2020) Classification models for skin tumor detection using texture analysis in medical images. J Imaging 6(6):51 5. Sinha BB, Dhanalakshmi R, Regmi R (2020) TimeFly algorithm: a novel behavior-inspired movie recommendation paradigm. Pattern Anal Appl 23(4):1727–1734 6. Barata C, Ruela M, Francisco M, Mendonça T, Marques JS (2013) Two systems for the detection of melanomas in dermoscopy images using texture and color features. IEEE Syst J 8(3):965– 979 7. Cascinelli N, Ferrario M, Tonelli T, Leo E (1987) A possible new tool for clinical diagnosis of melanoma: the computer. J Am Acad Dermatol 16(2):361–367 8. Ali N, Quansah E, Köhler K, Meyer T, Schmitt M, Popp J, Bocklitz T (2019) Automatic label free detection of breast cancer using nonlinear multimodal imaging and the convolutional neural network ResNet50. Trans Biophotonics 1(1–2):e201900003 9. Lee HD, Mendes AI, Spolaor N, Oliva JT, Parmezan ARS, Wu FC, Fonseca-Pinto R (2018) Dermoscopic assisted diagnosis in melanoma: reviewing results, optimizing methodologies and quantifying empirical guidelines. Knowl-Based Syst 158:9–24 10. Xie F, Fan H, Li Y, Jiang Z, Meng R, Bovik A (2016) Melanoma classification on dermoscopy images using a neural network ensemble model. IEEE Trans Med Imaging 36(3):849–858 11. Sinha BB, Dhanalakshmi R (2020) Building a fuzzy logic-based artificial neural network to uplift recommendation accuracy. Comput J 63(11):1624–1632 12. Greenspan H, Van Ginneken B, Summers RM (2016) Guest editorial deep learning in medical imaging: overview and future promise of an exciting new technique. IEEE Trans Med Imaging 35(5):1153–1159 13. Deng Z, Fan H, Xie F, Cui Y, Liu J (2017, September) Segmentation of dermoscopy images based on fully convolutional neural network. In: 2017 IEEE International conference on image processing (ICIP). IEEE, New York, pp 1732–1736 14. Yuan Y, Chao M, Lo YC (2017) Automatic skin lesion segmentation using deep fully convolutional networks with Jaccard distance. IEEE transactions on medical imaging 36(9):1876–1886 15. Sinha BB, Dhanalakshmi R (2021) Building a fuzzy logic-based McCulloch-Pitts Neuron recommendation model to uplift accuracy. J Supercomput 77:2251–2267 16. Ji Q, Huang J, He W, Sun Y (2019) Optimized deep convolutional neural networks for identification of macular diseases from optical coherence tomography images. Algorithms 12(3):51 17. Jiang X, Hu B, Chandra Satapathy S, Wang SH, Zhang YD (2020) Fingerspelling identification for Chinese sign language via AlexNet-based transfer learning and Adam optimizer. Sci Programming 18. Wu Z, Ling Q, Chen T, Giannakis GB (2020) Federated variance-reduced stochastic gradient descent with robustness to byzantine attacks. IEEE Trans Sign Process 68:4583–4596 19. Pomponiu V, Nejati H, Cheung NM (2016, September) Deepmole: deep neural networks for skin mole lesion classification. In: 2016 IEEE International conference on image processing (ICIP). IEEE, New York, pp 2623–2627 20. Codella NC, Nguyen QB, Pankanti S, Gutman DA, Helba B, Halpern AC, Smith JR (2017) Deep learning ensembles for melanoma recognition in dermoscopy images. IBM J Res Dev 61(4/5):5–1

Skin Cancer Classification for Dermoscopy Images …

271

21. Yang J, Shi R, Ni B (2021, April) Medmnist classification decathlon: a lightweight automl benchmark for medical image analysis. In: 2021 IEEE 18th International symposium on biomedical imaging (ISBI). IEEE, New York, pp 191–195 22. Quang NH (2017, November) Automatic skin lesion analysis towards melanoma detection. In: 2017 21st Asia Pacific symposium on intelligent and evolutionary systems (IES). IEEE, New York, pp 106–111 23. Bi L, Kim J, Ahn E, Feng D (2017) Automatic skin lesion analysis using large-scale dermoscopy images and deep residual networks. arXiv preprint arXiv:1703.04197

Deep Neural Network Architecture for Face Mask Detection Against COVID-19 Pandemic Using Pre-trained Exception Network S. R. Surya and S. R. Resmi

Abstract The coronavirus disease (COVID-19) is an infectious disease caused by coronavirus. The COVID-19 virus spreads mostly through droplets of saliva or discharge from the nose when an infected person coughs or sneezes, so it is important to practice respiratory etiquette. The COVID-19 is spreading our community in a faster manner, stay safe by taking some simple precautions, such as physical distancing, wearing a mask, keeping rooms well ventilated, avoiding crowds, and cleaning hands. The appropriate use of wearing a mask is a normal part of our life. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a novel severe acute respiratory syndrome coronavirus. Genetic variants of SARS-CoV-2 have been emerging and circulating around the world throughout the COVID-19 pandemic. To minimize the risk of transmissions, the use of face masks or coverings has been recommended in public settings. Many countries and local jurisdictions encourage or mandate the use of face masks by members of the public to limit the spread of the virus. Masks are also strongly recommended for those who may have been infected and those taking care of someone who may have the disease. In this paper, novel face mask detection on masked face data set is done by using pretrained Xception, deep learning with depth wise separable convolution. The proposed method classifies from the given face image, mask is worn or not. The proposed method is tested and validated using the face mask data set obtained from Kaggle. This data set contains about 503 face images with mask and 503 images without mask. The experimental results show that the proposed face mask detection method significantly dominates other compared pretrained models. The results of the receiver operating characteristic curve and area under curve justify the relevance of the better results in favor of the proposed method. Keywords Covid-19 · Face mask · Machine learning · Xception · Deep learning · Classification S. R. Surya (B) Department of Computer Science, College of Engineering Perumon, Kollam, Kerala, India e-mail: [email protected] S. R. Resmi Thiruvananthauram, Kerala, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_20

273

274

S. R. Surya and S. R. Resmi

1 Introduction Coronavirus disease 2019 (COVID-19) is a contagious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Convolution neural networks nowadays emerged as dominant algorithm in computer vision tasks. The COVID-19 caused by virus, so sufficient precautions must be done to reduce the spread of the virus. Even though vaccines are now available, they will not give 100% protection against corona virus. We need to protect ourself and we need to wear mask, and keep social distancing also in public places to hamper the transmission of corona virus. Furthermore, day-by-day new variants of virus and mutations are happening in the virus, peoples have to be very careful, and wear a face mask whenever they interact with other people and in the public places. So manual monitoring of each and every people whether they have worn mask or not is very difficult. So, it necessitates the development of computer-aided techniques for the automatic detection of mask worn or not. COVID-19 mainly spreads through coughing or sneezing by an infected person. When a person suffering from COVID-19, during sneezing or coughing, the virus got transferred to any person who is in direct close with the person infected with coronavirus. This results in rapid spread of the virus. Face masks were found effective method to control the spread of virus. So, the Governments were imposed strict rules that everyone must wear mask while they go out. Even though government has imposed strict rules, some people are hesitated to wear mask or they do not wear mask properly. This necessitates the development of automatic face mask detection system. This paper introduces face mask detection based on pretrained Xception network. Even though several approaches are there for automatic face mask detection using various pretrained networks. The proposed face mask detection based on pretrained Xception network outperforms the other pretrained approaches. This approach can be used by the industries or at traffic places combined with automatic capturing of videos and extracting facial regions from images and predict whether mask worn or not. The rest of this paper includes “background” which represents the related work, “proposed method”, “data set”, and “experimental results and discussion” which enunciates about the analysis of results, and finally, “conclusion” concludes the paper.

2 Background Venkateswarlu et al. [1] used a MobileNet with a global pooling block for face mask detection. An effective solution to perform person detection, social distancing violation detection, face detection, and face mask classification using object detection, clustering, and convolution neural network (CNN)-based binary classifier proposed by Srinivasan et al. [2]. A real-time face mask detector using YOLOv3 algorithm and Haar Cascade classifier proposed by Vinh and Anh [3]. Xue et al. [4] proposed

Deep Neural Network Architecture for Face Mask …

275

an intelligent detection and recognition system for mask wearing based on improved Retina Face algorithm. In this method, person has worn mask in a proper manner or not by calculating the mask and the key point positions of the face. A mask detection model PP-YOLO-mask based on PP-YOLO through transfer learning proposed by Jian and Lang [5]. Susanto et al. [6] developed a face mask detector which is able to detect any kinds of face mask. They had used a YOLO V4 deep learning mask detection algorithm in order to detect the face mask. A face mask detection with the allocation of a person’s ID number, using the cosine distance technique proposed by Maharani et al. [7]. In this method, Haar cascade and MobileNet were implemented to obtain the face bounding box. An artificial intelligence-based smart device using Raspberry Pi with AI model with camera is proposed in [8] which identifies whether a person is wearing face mask and gives us an alert message via mobile application. The proposed device is integrated with a mobile application. Mobile application identifies if someone enters home when people are not physically present in their home. This smart device automatically opens the door only if people wear face mask. Arora et al. [9] proposed a face mask detection using deep learning. In this paper, a convolutional neural network model has been proposed which identifies whether a person is wearing a face mask or not. The model has been trained on an image data set consisting of 3835 images where 1916 images are with face masks, and 1919 images are of people without face mask. The proposed deep learning model gave an accuracy of 97.98%. Singh et al. [10] proposed face mask detection using YOLOv3 and faster RCNN models: COVID-19 environment.

3 Proposed Method A method to detect masked face in this COVID-19 situation to occupy a noteworthy part in order to prevent the transformation of coronavirus from one person to another person. The proposed deep neural network-based face mask detection using Xception network is shown in Fig. 1 which involves the following steps (i) load the data set (ii), create a deep neural network model consisting of processing units, (iii) training, (iv) testing.

3.1 Load the Data Set The first step in the proposed deep neural network-based face mask detection using Xception network is load training, validation, and test data set. The training data are used by the learning algorithm to obtain the parameters of the model. A validation data set is a portion of data held back from the training set to get an estimate of the model’s performance. It is different from test data set. Then, test set is used to

276

S. R. Surya and S. R. Resmi

Fig. 1 Workflow of the entire model

evaluate the performance of the model. Then normalize the data dimensions so that they are of approximately the same scale.

3.2 Model Creation Next step is to create a deep neural network model. The proposed automatic mask detection method uses a pretrained Xception network as the base model and flattened the output obtained from this base model and on the top of this an output classification layer is added as shown in Fig. 2. And at the output layer which consist of one

Fig. 2 Model creation from pretrained Xception network

Deep Neural Network Architecture for Face Mask …

277

Fig. 3 Transfer learning using Xception-adding classification layer

neuron to predict whether the person is wearing mask or not and it uses a sigmoid activation function shown in Eq. (1). The Xception model [11] was proposed by Francois Chollet. Xception is an extension of the inception architecture. In Inception, architecture uses standard Inception modules and Xception uses depth wise separable convolutions. Xception is a 71 layered convolutional neural network. This Xception model can be loaded with a million images from the ImageNet database [12]. The pretrained network can classify images into 1000 object categories. So this network has learned rich feature representations and can be applicable to wide range of images. f (x) =

1 1 + e−x

(1)

In transfer learning using Xception take layers from this pretrained model. Then, freeze the layers to avoid losing or destroying information during future training. Then, added a trainable layer on the top. Since the data set obtained from Kaggle contain only few numbers of images. Features are learned from Xception model and used for prediction of whether the person is wearing mask or not. Train the new layers on this data set. A last, optional step, is fine tuning, which consists of unfreezing the entire model you obtained above (or part of it), and re-training it on the new data with a very low learning rate. This can potentially achieve meaningful improvements, by incrementally adapting the pretrained features to the new data. Figure 3 shows transfer learning using Xception model.

3.3 Data Set The data set used in this paper is downloaded from Kaggle. The data set consists of total of 1006 images. In which 503 face images with mask and 503 images without mask. Out of the 503 masked face images, 300 images are used for training, 153 images are used for validation, and 50 images are used for testing. And out of the 503 unmasked face images, 300 images are used for training, 153 images are used for validation, and 50 images are used for testing. Since this data set contains only limited number of images transfer learning with pretrained Xception deep learning with depth wise separable convolutions is used in this paper. Transfer learning means that taking features learned on one problem with large data set and that features can be used for solving a new similar problem.

278

S. R. Surya and S. R. Resmi

4 Experimental Results and Discussion All the methods are coded in Python and executed in Google Colab, which helps to execute Python in the browser. The confusion matrix shown in Table 1 to make a clear understanding about TP, TN, FP, and FN. Where TP is true positive that means actual class is positive, and predicted class is also positive. TN is true negative that is actual class is negative, and predicted class is also negative. FP is false positive, that is actual class is negative, but it is predicted as positive. FN is false negative that is actual class is positive, but it is predicted as negative. The confusion matrix is shown in Fig. 4, persons worn mask is correctly classified, out of 50 persons without mask, all are predicted correctly. But one misclassification in persons without mask, all other 49 face mask images were correctly classified. The accuracy, F1-score, Kappa score, and AUC portray the performance of the algorithm. All of the metrics are denoted as following: The overall effect of the algorithm depicts the accuracy and can be calculated as Accuracy =

Table 1 Confusion matrix

Fig. 4 Confusion matrix

(TP + TN ) (TP + TN + FP + FN )

(2)

Actual class predicted class

Mask

Non-mask

Mask

TP

FP

Non-mask

FN

TN

Deep Neural Network Architecture for Face Mask …

F - score =

279

2 ∗ (Precision ∗ Recall) (Precision + Recall)

(3)

TP TP + FP

(4)

Precision = Recall =

TP TP + FN

(5)

Po − Pe 1 − Pe

(6)

Kappa = where Po Pe

relative observed agreement among raters. the hypothetical probability of chance agreement.

Table 2 shows the comparison between the various pretrained models such as InceptionV3, VGG16, VGG19, ResNet50, and Xception network on the masked face data set. From Table 2, it is clear that the proposed method based on pretrained Xception network outperforms all the other pretrained models. The accuracy, Fscore, Kappa, and AUC gage shows that the proposed method based on pretrained Xception network can make better prediction about both masks worn or not than other methods. The result obtained from the proposed is shown in Figs. 5 and 6. The bar graph of the performance comparison of various pretrained models such as InceptionV3, VGG16, VGG19, ResNet50, and Xception network shown in Fig. 7. The Fig. 8 shows the graphical representation model training and validation accuracy with respect to the number of epochs for the proposed deep neural network model. The Fig 9 shows the graphical representation model training and validation loss with respect to the number of epochs for the proposed deep neural network model. Table 2 Comparison with other pretrained models Model

Accuracy

F-score

Kappa

AUC

InceptionV3

0.96

0.94

0.92

0.96

Xception

0.99

0.98

0.94

0.97

VGG16

0.95

0.93

0.90

0.95

VGG19

0.82

0.80

0.64

0.82

ResNet50

0.81

0.79

0.62

0.80

Fig. 5 Predicted as mask

280

Fig. 6 Predicted as non-mask

Fig. 7 Performance comparison

Fig. 8 Accuracy

S. R. Surya and S. R. Resmi

Deep Neural Network Architecture for Face Mask …

281

Fig. 9 Loss

As the number of epochs increases, the training and validation loss of the model decrease.

5 Conclusion This paper presents a pretrained deep learning-based on Xception network for automatic face mask detection. The proposed approach uses data set obtained from Kaggle, which consists of few numbers of images, the proposed approach uses a pretrained network called Xception which was trained with ImageNet. The proposed approach achieved an accuracy of 99%. In this study, analyzes the applicability of different pretrained models in automatic face mask detection and it has been found that the pretrained Xception network will give better classification accuracy. In future, this method can be extended to classify face mask data set with mask is worn correctly correct, no mask, and incorrect mask (on head, chin, neck, etc.) by extending this data set. This approach can be used for real-time detection of face mask in public places. So, it is required to detect face and extract face region to this approach which will help to predict mask worn or not. The proposed method is evaluated using a data set of 1006 human face images. Experimental results proved that the proposed face mask detection using pretrained Xception networks is better than the existing methods.

282

S. R. Surya and S. R. Resmi

References 1. Venkateswarlu B, Kakarla J, Prakash S (2020) Face mask detection using mobileNet and global pooling block. 2020 IEEE 4th conference on information and communication technology (CICT), pp 1–5. https://doi.org/10.1109/CICT51604.2020.9312083 2. Srinivasan S, Rujula Singh R, Biradar RR, Revathi S (2021) COVID-19 monitoring system using social distancing and face mask detection on surveillance video datasets. 2021 International conference on emerging smart computing and informatics (ESCI), pp 449–455. https:// doi.org/10.1109/ESCI50559.2021.9396783 3. Vinh TQ, Anh NTN (2020) Real-time face mask detector using YOLOv3 algorithm and haar cascade classifier. In: 2020 international conference on advanced computing and applications (ACOMP), pp 146–149. https://doi.org/10.1109/ACOMP50827.2020.00029 4. Xue B, Hu J, Zhang P. Intelligent detection and recognition system for mask wearing based on improved RetinaFace algorithm. 2020 2nd international conference on machine 5. Jian W, Lang L (2021) Face mask detection based on transfer learning and PP-YOLO. 2021 IEEE 2nd international conference on big data, artificial intelligence and internet of things engineering (ICBAIE), pp 106–109. https://doi.org/10.1109/ICBAIE52039.2021.938 9953; Learning, big data and business intelligence (MLBDBI) (2020) pp 474–479. https://doi. org/10.1109/MLBDBI51377.2020.00100 6. Susanto S, Putra FA, Analia R, Suciningtyas IKLN (2020) The face mask detection for preventing the spread of COVID-19 at Politeknik Negeri Batam. 2020 3rd International conference on applied engineering (ICAE), pp 1–5. https://doi.org/10.1109/ICAE50557.2020.935 0556 7. Maharani DA, Machbub C, Rusmin PH, Yulianti L (2020) Improving the capability of real-time face masked recognition using cosine distance. 2020 6th International conference on interactive digital media (ICIDM), pp 1–6. https://doi.org/10.1109/ICIDM51048.2020.9339677 8. Baluprithviraj KN, Bharathi KR, Chendhuran S, Lokeshwaran P (2021) Artificial intelligence based smart door with face mask detection. Int Conf Artif Intell Smart Syst (ICAIS) 2021:543– 548. https://doi.org/10.1109/ICAIS50930.2021.9395807 9. Arora R, Dhingra J, Sharma A (2021) Face mask detection using deep learning. In: Abraham A, Hanne T, Castillo O, Gandhi N, Nogueira Rios T, Hong TP (eds) Hybrid intelligent systems. HIS 2020. Advances in intelligent systems and computing, vol 1375. Springer, Cham. https:// doi.org/10.1007/978-3-030-73050-5_36 10. Singh S, Ahuja U, Kumar M et al (2021) Face mask detection using YOLOv3 and faster R-CNN models: COVID-19 environment. Multimed Tools Appl 80:19753–19768. https://doi.org/10. 1007/s11042-021-10711-8 11. Chollet F (2017) Xception: deep learning with depthwise separable convolutions. arXiv preprint, pp 1610–02357 12. Image Net (2016) Accessed on: August. 3, 2021. [Online]. Available: http://www.image-net. orgAuthor, F.: Article title. J 2(5):99–110

MOOC-LSTM: The LSTM Architecture for Sentiment Analysis on MOOCs Forum Posts Purnachary Munigadiapa and T. Adilakshmi

Abstract The massive open online courses (MOOCs) have been among the foremost energizing improvements in e-learning environment in recent days. As the number of MOOCs resources on each domain growing greatly, there is a necessity of evaluating MOOCs. Discussion forums are the key resources for MOOCS evaluation. Sentiment analysis is the famous mechanism to identify the opinion of the students on every particular MOOC. Long short-term memory architecture is used to avoid the issue of long-term dependencies in the text. In this paper, we propose a sentiment analysis system contains a new LSTM architecture and Ax hyperparameter tuner that can jointly performs well with large text for sequential analysis and sentiment classification. Proposed system is trained on two different datasets from different platforms using optimal hyperparameters. Experimental results shown that the proposed system outperforms other machine learning models in terms of accuracy and working well with different domains. Keywords MOOCs · Sentiment · Classification · CNN · LSTM · GloVe

1 Introduction Massive open online courses (MOOCs) step forward traditional class room teaching approaches and tune the learners mind toward online teaching and learning environment. Online teaching and learning approach break down the huddles like scope, time, and other constraints that are there in traditional class room teaching. These are open for all and provide online resources and the opportunity to learn any subject from any place at any time [1]. In MOOCs, both the learner and teacher can interact with each other through discussion forum. Discussion forum is only the communication system P. Munigadiapa (B) Osmania University, Hyderabad, India e-mail: [email protected] T. Adilakshmi Department of CSE, Vasavi College of Engineering, Hyderabad, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_21

283

284

P. Munigadiapa and T. Adilakshmi

that can help the participants and teacher for different purposes like doubts clarification, expressing opinion, etc. Gaining correct information about learners experience in MOOCs became very difficult because of high number of participants. Learner posts and actions in discussion forum can provide an impression of the sentiments and concerns of learners about a course. Discussion forum posts are the key resource for understanding learning experience of the participants and in finding the opinion toward that course [2]. In recent days, people are expressing their ideas, feelings, and opinions online more boldly than ever before in terms of text, audio, video, images. Sentiment analysis became an important mechanism to learn and observe people’s opinion or sentiment, attitude toward particular product, service, topic, issue. Sentiment analysis is the most trending research area in the field of natural language processing and making it challenging to exercise all the areas in this domain [3]. Artificial intelligence mimics the human intelligence by machine. Natural language processing and machine learning are the sub-fields of artificial intelligence. Machine learning for NLP implies ML algorithms to understand the meaning of text. Deep learning is a sub-part of machine learning and enhances the performance of sentiment analysis system by considering the context of the text [4]. LSTM is the enhanced RNN architecture and powerful in sequential analysis of text data [5]. The major contributions of this paper are as follows. 1.

2.

Collection of forum posts is extracted from Stanford MOOCs forum posts dataset and applied preprocessing technique to clean the text. Upsampling technique is applied to balance the processed dataset and make it ready for experimentation An effort has been made to design the efficient long short-term memory (LSTM) deep neural network architecture that can find the sentiments of participants in MOOCs.

• Tuned the hyperparameters of LSTM model using adaptive experimentation (Ax) package of PyTorch and improved the performance of proposed LSTM model. • Machine learning algorithms like linear SVM, linear regression, and Naïve Bayesian algorithms performance are compared with proposed model. • Lastly, proposed LSTM model is tested using a bench marked IMDB movie reviews dataset, and it has been observed that the proposed model outperforms other machine learning algorithms and proven that our model is domain independent. The rest of the paper is organized into the following sections: Section 2 explains about the motivation toward developing a sentiment analysis system and related work like methodologies used in designing LSTM architecture and sentiment analysis. In Sect. 3, proposed system architecture is explained. In Sect. 4, implementation of proposed system is discussed. Section 5 presents the results of proposed system and analysis. Section 6 gave a clear conclusion of this research work and showcase some observations to extend this work.

MOOC-LSTM: The LSTM Architecture for Sentiment …

285

2 Motivation and Related Work All social media platforms like Facebook, WhatsApp, Twitter; Instagram, etc., became generators of text resource and creating data exponentially. In the data world, understanding person’s feelings, opinion on particular service or product on different platforms became very difficult. Sentiment analysis gives potential information for the organizations in their decision-making process for instance to help the customer or consumers to pick a correct item or service [6, 7]. Research on sentiment analysis has been carried out in most in the domains of news articles, product reviews, and movie reviews. It is important to perform sentiment analysis on education domain especially on online MOOCs platform. Discussion forum is only the online platform for the communication between participant and the instructor. As the numbers of messages posted by the learners are high, it is very difficult for the instructor to give reply to the posts and that is leading to increase the dropout rate in MOOCs. So there is a need of standard sentiment system to find the subjectivity and correct sentiment of MOOCs through forum posts [6, 8]. Dos Santos (ACL, 2014) proposed CNN architecture to perform sentiment analysis by using character to sentence level information. This architecture is implemented on two datasets such as Stanford sentiment tree bank (SSTB) and Stanford Twitter sentiment (STS) dataset that are from different domains. SSTB gave 85.7% accuracy for single sentence sentiment prediction, and with STS dataset, this model gave the accuracy of 86.4% [9]. Ranjan Kumar Behera (Elsevier, 2020) designed hybrid CO-LSTM architecture for sentiment analysis with the capability of handling extremely big data and it can work well with any domain. This architecture is trained with four datasets from different domains and outperforms other machine learning techniques [10]. Subarno Pal (ICITAM, 2017) proposed a bidirectional LSTM model that is compared with simple LSTM. In experimentation, it has been observed that bidirectional LSTM outperforms single LSTM [11]. Author proposed a convolutionLSTM model for sentiment analysis. It works at three different levels. Initially, convolution layer is used to find the feature representation for each word in the local context. In the next level, learn the post representation using LSTM depending on feature representations. Finally, sentiment classification is down on the posts. Experiment results show the effectiveness of the framework [12]. Most of the sentiment classification methods focus on extracting features of text from the context of debate is talk. The impact of diversity of languages is more in MOOCs forum post and leads to get poor performance of these NLP Algorithms. To address this issue, author proposed an ensemble model that contains gradient boosting decision tree (GBDT) with linear regression (LR) (GBDT + LR) that is used to reduce the cost of feature learning for classification of discussion threads manually. Experiment results on Coursera dataset for discussion threats classification gave the accuracy 83.4% [13].

286

P. Munigadiapa and T. Adilakshmi

3 Background 3.1 Dataset The proposed classification model used Stanford MOOCs forum posts dataset that contains 29,604 records with features, namely text, opinion, question, answer, sentiment, confusion, urgency, course type, etc. For the proposed research work, only text, sentiment, and confusion features are considered. Text attribute contains text messages posted by the participants in the discussion forum. Question features have binary values, 1 indicates that the posted message is question, and 0 indicates post is not a question. Sentiment attribute shows the sentiment score of the people toward that course scaling from 1 to 7. 1 indicates least sentiment score, and 7 indicates highest sentiment score. Confusion feature has the scale from 1 to 7. 1 indicates not no confusion, and 7 indicates high confusion. For better accuracy, dataset is first filtered by removing posts that are posing questions using question attribute. To make the dataset suitable for binary classification, one extra feature called sentiment_polarity has been added, and it is calculated based on the values of sentiment, confusion features. Sentiment_polarity is calculated as follows; if the sentiment > 3.5 and confusion < 4, then sentiment_polarity is “positive” else “negative”. After doing this, dataset is tuned with 23,646 records out of which 8421 are positive and 15,225 are negative, but it is not balanced. Upsampling is used to balanced this dataset. To implement the proposed model, two datasets have been chosen. Initially, Stanford MOOCs dataset (of size 30,000 with 15,000 positive and 15,000 negative) posts have been used for training and testing the proposed model. IMDB dataset is a bench mark dataset consists of 50,000 movie reviews out of which 25,000 reviews are positive and 25,000 are negative reviews. It is used to prove that the proposed model is not domain specific, and it has been proven that our sentiment analysis system works well with other domains also.

3.2 Word Embedding Word embedding is the mechanism of transforming the plain text into its equivalent vector form. All machine learning and deep leaning algorithms cannot perform NLP tasks directly with plain text. It is mandatory to convert plain text to vectors. Word embedding techniques do this task of word vectorization. There is different word embedding techniques like TF-IDF, Word2Vec, GloVe, etc., to perform this task [14]. GloVe is a text vectorization model that changes text into global vectors. It introduced word-word co-occurrence counting matrix that is capable of finding the context of each word with other word. Let the matrix be X, each element Xij of X represents the number of the word j occurs in the context of the word i. In our proposed architecture, GloVe is used to vectorize the forum posts.

MOOC-LSTM: The LSTM Architecture for Sentiment …

287

3.3 Upsampling An imbalanced class in dataset is the common problem of data classification mechanism in machine learning. If the training instances of dataset are not distributed equally across the classes, then it is more biased toward the class that has higher number of training instances and it degrade the performance of the model. Upsampling and downsampling are two techniques to handle class imbalance issue in machine learning. Upsampling is the process of generating and adding synthetic data of the class that has less instances and make the dataset balanced [15]. The dataset used to train our proposed model is not balanced. The numbers of instances in the dataset are 23,646 out of which positive instances are 8646 and negative instances are 15,000. Upsampling technique is used and balanced the dataset by adding some positive instances. The final balanced dataset contains 15,000 positive and 15,000 negative instances.

3.4 Convolutional Neural Network CNN is a class of deep neural network and works based on sub-sampling and convolution. Its architecture contains a queue of layers, namely convolution layer, pooling layer, and fully connected layer. Convolution layer performs convolution operation, and it helps in identifying features from input data. Pooling layer is very useful when we work with large set of data. It reduces the feature dimensions without dropping the quality of information. This layer is used time to time between convolution layers. Fully connected layer structure is same like multilayer perception. Convolution and pooling layers extract high-level features from input data. Fully connected layer uses these high-level features and classifies the input data into different classes based on training dataset [10].

3.5 Long Short-term Memory LSTM neural networks are a special kind of RNN, and it has the special features of f learning long-term dependencies that help in sequence prediction problem. RNN can keep track of arbitrary long-term dependencies in the input sequences, but it suffers from vanishing gradients problem. LSTM addressed this issue till some extend and but facing exploding gradient problem. LSTM unit tries to remember all the previous knowledge that the network is used so far and erase irrelevant data. LSTM neural networks mostly used in use cases like making predictions on time series data, classification. LSTM network consists of four varieties of gates, namely forget gate, input gate, input module gate, and output gate for different goals. Forget gate calculates to what extent to forget the past data. Input gate calculates the other

288

P. Munigadiapa and T. Adilakshmi

information to be written onto the internal cell state. Input modulation gate is the sub-part of input gate and balance the information that the input gate will write onto the internal state cell. Output gate calculates what output to generate from current internal cell state.

3.6 Adaptive Experimentation (Ax) Adaptive experimentation is AI enabled, machine learning guided process of repeatedly inspecting a parameter space and finds best configurations by using resources efficiently. Ax currently supports Bayesian optimization and bandit optimization as exploration strategies. Ax is used to optimize the deep learning models by tuning hyperparameters.

4 Proposed System Proposed sentiment analysis system uses LSTM architecture and hyperparameter tuning technique to get the best classification results. Pipeline of MOOCs-LSTM sentiment analysis system is presented in Fig. 1. The working style of the proposed system is as explained in sequence below. In feature engineering phase, the dataset consisting of Stanford MOOCs forum posts is collected and preprocessed first then applied GloVe word embedding technique to convert input text to vectors. Next, upsampling technique is used to balance the data by adding extra instances to the dataset and make the data ready for classification. After that, divide the balanced dataset into train, test sets with 70% and 30%, respectively, and then train the proposed LSTM model using train set. Apply hyperparameter tuning technique on trained model and find the best parameters, use them to generate optimal LSTM model. Used the test set to test the performance of proposed model.

Fig. 1 Pipe line of MOOCs-LSTM sentiment analysis system

MOOC-LSTM: The LSTM Architecture for Sentiment …

289

Fig. 2 LSTM architecture

The LSTM architecture used to develop the proposed system is presented in Fig.2. LSTM model contains 5 layers. In embedding layer, all the words of text dataset are converted to vectors of dimension 300 by using GloVe embedding technique. All these embedded vectors are trained using back propagation technique and given as input to the LSTM layer. In LSTM layer, it uses 58 LSTM units to analyze the input feature vectors. This layer identifies global vectors by checking the long-term dependences and send them as input to the next layer. Dense layer is a fully connected layer, and each neuron is connected to all other neurons. It uses ReLU activation function here, and result of this layer is passed to its next layer. Dropout layer takes the output of the previous dense layer ad input and does some regularization to avoid the over fitting problem and improves the model performance. One more dense layer is added with sigmoid activation function to find the probability of final feature vectors.

5 Implementation 5.1 Performance Measures The proposed classification model used the following performance measures. Confusion matrix describes the statistics about the actual and predicted values for all the posts in the dataset for the classifier (Table 1).

290

P. Munigadiapa and T. Adilakshmi

Table 1 Confusion matrix

Confusion matrix Predicted positive

Predicted negative

Actual positive

TP

FN

Actual negative

FP

TN

True Positive (TP): It gives the count of the posts that are actually labeled as positive and also predicted as positive by the classification model. False Positive (FP): It gives the count of posts that are actually labeled as negative but predicted as positive by the classification model True Negative (TN): It gives the count of posts that are actually labeled as positive but predicted as negative by the classification model False Negative (FN): It gives the count of posts that are actually labeled as negative and also predicted as negative by the classification model The performance of the proposed classification algorithm can be evaluated using the following measures. Precision = TP /(TP + FP) Recall = TP /(TP + FN) F - Measure = (2 × Precision × Recall)/(Precision + Recall) Accuracy = (TP + TN)/(TP + FP + TN + FN)

5.2 Results and Analysis Experiments are carried out using different conventional machine learning algorithms like Naïve Bayes, SVM, linear regression, and deep learning models like CNN and LSTM on Stanford MOOCs forum posts dataset, and results are presented in Table 2. Experimental results show that our proposed LSTM model with GloVe embedding and Ax outperformed with the accuracy of 87.64% all other classifiers and linear regression classifier performed better with accuracy of 79.76% compared to remaining all machine learning algorithms. Our proposed model used the following hyperparameters that are suggested by adaptive experimentation (Ax) are learning rate = 0.0060, dropout rate = 0.062, LSTM units = 52, no. of epochs = 5, no of hidden layers = 3, batch size = 240, embedding size = 49, Max. tex length = 199, optimizer = Adam.

MOOC-LSTM: The LSTM Architecture for Sentiment …

291

Table 2 Experimental results of different classifiers on Stanford MOOCs forum posts Confusion matrix, evaluation parameters for Stanford MOOCs dataset Models

Confusion matrix

Evaluation measures

Predicted yes

Predicted no

Precision

Recall

F-measure

Accuracy

Actual yes

3694

786

0.7921

0.7921

0.7921

0.7922

Actual no

1073

3447

Actual yes

3692

788

0.79

0.79

0.79

0.79

Actual no

1075

3445

Linear regression

Actual yes

3637

843

0.7912

0.79

0.79

0.7976

Actual no

1062

3458

CNN + GloVe

Actual yes

3786

794

0.818

0.8266

0.8222

0.826

Actual no

842

3648

LSTM + GloVe + Ax (proposed)

Actual yes

3956

548

0.8748

0.8783

0.8765

0.8764

actual no

566

3930

Naïve Bayes SVM

Performance measures of different classifiers on Stanford MOOCs forum posts dataset are show in the bar chart below (Fig. 3). To test the domain dependency of the proposed model, same experiments are repeated using other dataset that is from different domain. IMDB movie reviews dataset is considered for checking the domain dependency. The experimental results are presented in Table 3. Experimental results show that our proposed LSTM model with GloVe embedding and Ax outperformed with the accuracy of 94.26% all other

Fig. 3 Performance measures of different classifiers on Stanford MOOCs dataset

292

P. Munigadiapa and T. Adilakshmi

Table 3 Experimental results of different classifiers on IMDB movie reviews Confusion matrix, evaluation parameters for IMDB dataset Models

Confusion matrix

Evaluation measures

Predicted yes

Predicted no

Precision

Recall

F-measure

Accuracy

Actual yes

6685

814

0.89

0.89

0.89

0.8921

Actual no

805

6696

Actual yes

6601

898

0.9

0.89

0.9

0.9004

Actual no

596

6905

Linear regression

Actual yes

6684

815

0.9

0.9

0.9

0.8976

Actual no

804

6697

CNN + GloVe

Actual yes

6792

712

0.9068

0.9051

0.9059

0.906

Actual no

698

6798

LSTM + GloVe + Ax (proposed)

Actual yes

7089

427

0.9425

0.9431

0.9427

0.9426

Actual no

434

7050

Naïve Bayes SVM

classifiers and SVM classifier performed better with accuracy of 90.04% compared to remaining all machine learning algorithms. Our proposed model used the following hyperparameters that are suggested by adaptive experimentation (Ax) are learning rate = 0.027, dropout rate = 0.013, LSTM units = 47, no. of epochs = 5, no of hidden layers = 7, batch size = 228, embedding size = 94, Max. tex length = 160, optimizer = Adam. Performance results of different classifiers on IMDB movie reviews dataset are show in the bar chart below (Fig. 4).

Fig. 4 Performance results of different classifiers on IMDB dataset

MOOC-LSTM: The LSTM Architecture for Sentiment …

293

6 Conclusion and Future Work In this study, MOOCs-LSTM sentiment analysis system is developed by using LSTM architecture, and Ax is added to improve the performance of the model. On bench marked IMDB dataset, the proposed model outperformed other machine learning and deep learning classifiers. It is observed that proposed classifier gave good performance accuracy on two different datasets of different domains. So our model is domain independent. The accuracy of the proposed model on MOOCs dataset can be improved by tuning the data. There is a scope to extend this work by designing ensemble models and can implement other embedding techniques to enhance the performance.

References 1. Altalhi M (2021) Toward a model for acceptance of MOOCs in higher education: the modified UTAUT. Educ Inf Technol, pp 1589–1605. Springer 2. Karsten (2020) Evaluation of student feedback within a MOOC using sentiment analysis and target groups. IRRODL 21(3) 3. Ligthart A (2021) Systematic reviews in sentiment analysis: a tertiary study. Artif Intell Rev. Springer 4. Ikonomakis et al (2005) Text classification using machine learning techniques. WSEAS Trans Comput, pp 966–974 5. Pouyanfar et al (2018) A survey on deep learning: algorithms, techniques, and applications. ACM Comput Surv (CSUR), pp 1–36 6. Gao Y et al (2020) A parallel neural network structure for sentiment classification of MOOCs discussion forums. IEEE Access J Intell Fuzzy Syst, pp 4915–4927 7. Mak S (2010) Blogs and forums as communication and learning tools in a MOOC. In International conference on networked learning (NLC ’10), pp 275–285 8. Almatrafi O (2018) Needle in a haystack: Identifying learner posts that require urgent response in MOOC discussion forums. Comput Educ, pp 1–9. Elsevier 9. Dos Santos C (2014) Deep convolutional neural networks for sentiment analysis of short texts. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, Dublin, Ireland, 23–29 Aug 2014, pp 69–78 10. Behera RK (2021) Co-LSTM: convolutional LSTM model for sentiment analysis in social big data. Inf Process Manage 58: 102435. Elsevier 11. Wei X, Lin H (2017) A convolution-LSTM-base deep neural network for cross-domain MOOC forum post classification. Information 8(3):1–16 12. Wei X (2017) A convolution-LSTM-based deep neural network for cross-domain MOOC forum post classification. MDPI Inf 8:92 13. Ntourmas A (20212) Classification of discussions in MOOC forums: an incremental modeling approach. In: Proceedings of the eighth ACM conference on learning, pp 183–194 14. Zhou M (2021) A text sentiment classification model using double word embedding methods. In: Artificial intelligence review. Springer 15. Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progr Artif Intell 5:221–232

License Plate Detection of Motorcyclists Without Helmets S. K. Chaya Devi, G. Vishal Reddy, Y. Aakarsh, and B. Gowtham

Abstract Over the years, motorcycle accidents have increased in various countries. Motorcycles are growing more popular as a result of many social and economic causes. Although the use of helmets is made mandatory in many countries for motorcyclists, most of them do not wear a helmet. A motorcycle accident might be fatal if the rider is not wearing a helmet. Detecting such offenders of traffic rules is a highly desirable but necessary task to ensure safety measures due to many obstacles such as occlusion, illumination, poor quality surveillance video, fluctuating weather conditions, and so on. This paper aims to explain and illustrate a framework for identifying license plates of motorcyclists who ride them without helmets in surveillance videos. In the proposed approach, we generated a dataset from a real-time surveillance video that is turn fed to our custom deep learning model using the YOLOV3 framework, which comes under the class of single-shot detector algorithm for object detection. We detected multiple objects for every image and recognized license plates based on whether the motorcycle rider wore a helmet or not. Object detectors are evaluated using mean average precision, and our evaluation results are 68.79% with an IoU value of 70%. Keywords Object detection · YOLOV3 · Motorcycles · License plates · Helmets

1 Introduction Wearing a helmet can decrease the risk of fatalities in road accidents by around 42%. Based on such statistics, governments all around the world passed certain laws making the use of helmets mandatory. Despite this, especially in some of the developing countries, compliance with motorcycle helmet laws is often minimal. Governments Supported by Vasavi College Of Engineering. S. K. C. Devi (B) · G. V. Reddy · Y. Aakarsh · B. Gowtham Department of Information Technology,Vasavi College Of Engineering, Hyderabad, Telangana, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_22

295

296

S. K. C. Devi et al.

must collect accurate data on the degree of compliance with helmet rules to run targeted helmet use campaigns effectively. However, 40% of countries around the world does not assess this critical indicator for road safety. The current method of obtaining violators, which involves looking or capturing pictures of violators directly and then manual inspection of the images for their appropriate registration plates by human observers, is the main reason for the lack of helmet-rule compliance. This direct observation during roadside surveys uses many resources because it takes a long time and costs a lot of money. Video cameras are provided for indirect observation, which relieves the effort required for directly capturing photos or observation. In contrast, the classification of helmet use done by human observers is a tedious process, and it is also a limitation for the amount of data they can process. Hence, there is a growing demand for technology to detect helmet wear and identify motorcycle license plates without relying on a human observer. Machine learning is a viable way which serves the purpose of automatic identification of helmets on motorcyclists. It has been proved and used extensively to accurately recognize pedestrians, bicycles, motorcyclists, and automobiles in various road safety-related detection tasks. While some of the earlier implementations of detecting helmets on motorcyclists have proven promising, they aren’t developed to an extent where every process involved is automated. The use of deep learning, hereby referred to as DL, for detecting required objects from an image has received unprecedented attention. Object detection reduces human efforts, builds on image classification, and seeks to localize exactly where each object appears in the image. In this paper, we aim to develop and test using YOLOV3, which is an object detector. As a result, we developed a tailored YOLOV3 architecture to identify license plates from a large dataset with many views and variations in the number of riders seen.

2 Related Work The current video surveillance-based methods require significant manual effort. Due to manual intervention, such systems might become unsustainable since their efficiency decreases over time. In the past few years, there is some technological advancement in this field [1], in which one of the papers talk about the detection of helmets on motorcyclists by using standard convolutional neural networks which involve the use of adaptive background modeling for the detection of moving vehicles on busy roads [2]. Similarly, there is a model which uses background subtraction and object segmentation to detect bike riders from surveillance video and then determines whether a bike rider is using a helmet or not using visual features and binary classifier [3, 4]. And there is one that uses the same process mentioned above for recognizing motorcycles, and upon that they use object classifier HOG to classify violators. The vertical projection of a binary image is utilized for counting the number of riders [5]. One approach performed feature extraction using the Hough transform descriptor [6], which might give misleading results when objects are aligned as it focuses on edges,

License Plate Detection of Motorcyclists Without Helmets

297

but our approach reads the complete information present required during the training phase. There is an approach discussed in [7], which uses Gaussian mixture model for background separation and an SVM classifier to detect a motorcycle. All of these approaches talk about recognizing the violators, but there is no end-to-end approach to catch the violator by tracking their license plate. Our process is to acknowledge the violators along with their license plates such that there would be very little manual intervention needed to catch them.

3 Data 3.1 Dataset The dataset consists of real-time videos recorded by surveillance cameras located throughout Hyderabad. Motorcycles were captured from various perspectives, distinguishing between positive (helmeted) and negative (non-helmeted) samples.

3.2 Data Preprocessing To train the model, we require images as input. So, our challenge here is to generate images from videos. We wrote a Python script to capture frames every two seconds. As the model requires training to look for specific objects, we annotated the image with the respective labels. Our data consisted of four classes—Helmet, No-Helmet, Bike, and Lic. By this, we gathered enough data for training and validation sets.

3.3 Architecture We made use of the YOLOV3 architecture [8] for the detection of the desired classes (i.e., HELMET, NO-HELMET, BIKE, LIC) from the frames generated. YOLOV3 uses only convolutional layers, which makes it a fully convolutional network. It consists of a feature extractor that extracts essential features that help detect the required objects from an image. The feature extractor can build feature maps in various scales to meet the need to detect even the smallest of objects in a picture. The general network architecture used by YOLOV3 is shown in Fig. 1. YOLOV3 is generally used to classify objects of the COCO dataset, which contains about 80 different classes. It usually employs 53 CNN layers stacked with 53 more producing layers to a 106 layered network. But, here, we use a custom configuration of the YOLOV3 model as per our needs. We use a total of 16-layered feature

298

S. K. C. Devi et al.

Fig. 1 Network architecture

Fig. 2 Custom YOLOV3 architecture

extractors, as shown in Fig. 2, followed by a YOLO detector because we just have four classes of objects to identify. We can think of the YOLO architecture as the following: IMAGE (416, 416, 3)— DEEP CNN—ENCODING (13, 13, 3, 4). Since the reduction factor is 32, we get image size as 13 × 13. From the resulting dimensions, the third coordinate represents the number of anchor boxes for each cell in the grid, and the fourth coordinate illustrates the number of classes, i.e., HELMET, NO-HELMET, BIKE, LIC. We also use two other scales by upsampling the current size by two times and four times, which happens in layer 19 and layer 26. This is done to identify small objects, i.e., in case if we have a license plate or a helmet in the dataset which is far away from the camera and is too small to identify, we would need a bigger scale to identify objects. Hence, the size of scales 26 × 26 and 52 × 52 is also used to avoid missing out on smaller objects. A grid cell is responsible for detecting an object if the center or midpoint of any object in that image falls into that grid cell. As we use three anchor boxes or priors, each cell of the 13 × 13 (considering the initial scale) cells thus has data about these three anchors. Flattening the last two dimensions of the shape (13, 13, 3, 4), we get (13, 13, 12). Now, for each anchor of each cell, we calculate the element-wise product and probability that the box contains a particular class. Without non-max suppression (NMS), all the bounding boxes, i.e., 13 × 13 × 12 number of boxes, will be present on the final output. To eliminate the unnecessary anchors, we need to apply the algorithm called NMS. To improve our accuracy, we used various combinations of activation functions. For a few convolutional layers, we used leaky ReLU, and for others, we used linear functions.

License Plate Detection of Motorcyclists Without Helmets

299

During the training phase, the model familiarizes itself with the required objects. To extract license plates from images, we enhanced the algorithm, wherein, in the background, we obtain all the bike class objects and re-iterate them with the abovetrained model. Once we acquire all the remaining class objects from the cropped bike images, we fetch the license plates of people not wearing a helmet. It helps us to create a differentiation between multiple motorcycles present in an image.

4 Results After training the model, we evaluated its performance based on Intersection over Union, hereby referred to as IoU. We set a threshold value of 70% to determine if the object detected is valid or not. And for the bounding boxes predicted around an object, we used a confidence score of 75%. Bounding boxes greater than the confidence score are considered as positive boxes and remaining as negative. There is no end-to-end way of detecting license plates of motorcyclists without helmets until now, and here, we tried to tackle the two issues of detecting the motorcyclist without a helmet as well as identifying their license plate. From Fig. 3a–c, we can see how our model detects objects from all the classes. We then extract the license plates of persons not wearing a helmet, as shown in Fig. 3d. Every object detected has an IoU value of greater than 70%. Using the precision and recall metric, we found the mean average precision value for evaluating the model. With the validation set, the model has shown a value of 68.79%.

5 Conclusion The proposed system for license plate detection of motorcyclists driving without helmets uses the YOLO framework. The model would be helpful for Indian Law Enforcement to detect violators without depending on human observers. The main challenge was to collect the appropriate data and label it so that the model could comprehend the required features. One more factor that influenced our model’s performance is the quality of the video, as the license plates fetched from our dataset were of low resolution, so there is a lot of scope for improving the accuracy. Further, we intend to apply a convolution neural network to obtain the registration number from the license plate which will help identify the traffic violators.

300

S. K. C. Devi et al.

(a) Bike-rider with no helmet

(b) Bike-rider with helmet

(c) Multiple bike-riders

(d) License Plate

Fig. 3 Objects detected by the model

Acknowledgements We take this moment to convey our thanks and respect to our HOD, faculty, and Principal who have helped us throughout the research for this review paper. We feel privileged to express our gratitude to our project guide for expressing her confidence in us through continuous support, help, and encouragement.

License Plate Detection of Motorcyclists Without Helmets

301

References 1. Singh D, Vishnu C, Mohan CK (2016) Visual big data analytics for traffic monitoring in smart city. https://doi.org/10.1109/ICMLA.2016.0159 2. Vishnu C, Singh D, Mohan CK, Babu S (2017) Detection of motorcyclists without helmet in videos using convolutional neural network. In: Proceedings of the international joint conference on neural networks (IJCNN) 3. Chiverton J (2012) Helmet presence classification with motorcycle detection and tracking. Intell Transp Syst IET 6:59–269. https://doi.org/10.1049/iet-its.2011.0138 4. Dahiya K, Singh D, Chalavadi KM (2016) Automatic detection of bike-riders without helmet using surveillance videos in real-time. https://doi.org/10.1109/IJCNN.2016.7727586 5. Silva R, Aires K, Veras R (2014) Helmet detection on motorcyclists using image descriptors and classifiers, pp 141–148. https://doi.org/10.1109/SIBGRAPI.2014.28 6. Silva RRV, Aires KRT, Santos TS, Abdala K, Veras RMS, Soares ACB (2013) Automatic detection of motorcyclists without helmet. In: 2013 XXXIX Latin American computing conference (CLEI), pp 1–7 7. Chen Z, Ellis TJ, Velastin SA (2012) Vehicle detection, tracking and classification in urban traffic. In: 2012 15th International IEEE conference on intelligent transportation systems, pp 951–956 8. Yolov3 (2014) An incremental improvement. https://arxiv.org/abs/1804.02767

Object Detection and Tracking Using DeepSORT Divya Lingineni, Prasanna Dusi, Rishi Sai Jakkam, and Sai Yada

Abstract The rapid development of algorithms for image detection has resulted in broad safety applications, for example, face recognition and monitoring. However, monitoring in real-time is quite difficult, particularly in busy areas, where the individual may be partially or fully concealed for a period of time. Therefore, this chapter aims to build the object tracking system with the DeepSORT architecture for crowd surveillance. This system not only detects a human being in real time but also uses the information it has learnt, in contrast to the object detection frameworks like CNN, to monitor a human’s path until they leave the frame. Using the DeepSORT framework, the trajectory of an item may either be detected in real time or even in a background based on an existing record. This framework will be trained using a huge dataset to track people’s motion depending on the individual’s speed, distance and fitness. The DeepSORT algorithm, the top algorithm in object identification and tracking, is very powerful and quick. Many issues have been identified with the security mechanism for object tracking. Occlusion is one of the main problems in the pursuit of objects. This problem occurs in a busy location with another object moving extremely quickly into and out of the picture (i.e., a bustling Zebra crossing). If an item is present inside a frame, the trajectory of an individual may be incorrectly identified. Keywords Object detection · Object tracking · DeepSORT · Deep learning

1 Introduction This study is aimed at creating a system of object tracks based upon deep education, a profound framework for tracking the movement of an item within a video recording or video streaming in real time. The information will be extracted from its monitoring and displayed to users. It also focuses on monitoring object’s mobility for public safety and surveillance. This system should still be able to detect the movement path of the individuals in the frame rapidly, especially in busy regions with congested D. Lingineni (B) · P. Dusi · R. S. Jakkam · S. Yada Department of Information Technology, Vasavi College of Engineering, Hyderabad, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_23

303

304

D. Lingineni et al.

transportation [1]. Some information from the movement noticed should also be able to be extracted. The application of DeepSORT in crowd security can generate a lot more efficient monitoring system and ultimately a safer society.

2 Related Work The object tracking systems are essentially capable of locating objects from identification to object tracking on a camera frame. Based on a prior study, it was hypothesized that neural networks will enhance their performance if their database is wide and varied. For example, having pictures in the training database containing more than one person would make the system more efficient. The following are accurate neural network development processes: image acquisition, image pre-processing, image segmentation, functionality extraction and classification [1]. The algorithm for only looking one day (YOLO) differs from that of other object detection, because the IEEE International Control System, Computer and Engineering Conference Allowable permitted use limited to the use of Auckland’s Uses [1], The You Only Look Once (YOLO) method differs from existing object identification algorithms in that it is a single neural network that predicts the bounding boxes and class probabilities for these boxes. YOLO is a single network detection [2].

3 Existing Work 3.1 Optical Flow This is an appearance of motion between two successive frames of picture objects generated by object or camera movement. The 2D vector field shows the movement of points from the first image to the second in each vector. It is a vector field. In optical flow, Lucas–Kanade is utilized (Fig. 1). Optical flow is defined as the apparent mobility of individual pixels on the picture plane. It is frequently an excellent approximation of the real physical motion projected onto the picture plane. Most approaches for computing optical flow presume that a pixel’s colour/intensity remains constant as it moves from one video frame to the next [3].

Object Detection and Tracking Using DeepSORT

305

Fig. 1 Graphical view of optical flow method

3.2 Mean-Shift Algorithm A mean shift is an effective method in monitoring objects with histograms that characterize their appearance. Mean shift is the nonparametric approach of function space analysis, the algorithm of search mode. It is a technique for finding the maximum density value of the discrete data that have been sampled (Fig. 2).

4 Proposed Methodology 4.1 Dataset This study uses three different types of datasets of different sizes and classes. The different datasets influence the ability of the system to efficiently and properly detect objects and therefore affect its overall performance. The datasets used are YOLOv3, YOLOv3-tiny and YOLOv3 custom. Each set of facts has its own weight [1].

4.2 Proposed Workflow This section discusses in full the methodology used in this project and the procedure to finish it. It describes how the picture detecting system, hardware design and flowcharts are implemented (Figs. 3 and 4).

306

Fig. 2 Output after applying mean shift

Fig. 3 DeepSORT pictorial view

Fig. 4 Phases involved in the system

D. Lingineni et al.

Object Detection and Tracking Using DeepSORT

307

Fig. 5 Input Execution commands

Phase-I The user input of the system is covered here. This system needs the user to input the thorough system analysis command. Input showed the type of video to be used in Windows OS through the command prompt: weight and output file type YOLOv3. With the python method get(), all these inputs are saved and utilized throughout the tracking process in their relative variable. The next stage is video collecting that takes the kind of video from the user command in this portion (Fig. 5). Phase-2 In the following phase, the video entry is processed. The system transforms the video input into a frame-by-frame picture sequence, which extracts colour and object detection. YOLO is utilized as just the photographs are needed to go across your network one time. YOLO employs a single picture characteristic to forecast multiples that each contains a certain item. Pictures are split into the region of s × s. This ROI detects if the object centre is in an interesting region (ROI). Phase-3 Phase 3 completes the last phase of the system for visual tracking and storage. The Kalman filter algorithm uses DeepSORT combined. Kalman is utilized for the state prediction, as the state vector of the target is taken from the visual tracking filter. It is easy and quick to execute in memory. And because it employs both position speeds, it has superior outcomes than the optical flow and mean-shift. This method tracks the object during occlusion accurately. The tracking section processes the individual discovered gives him a unique ID and begins to follow their movement.

308

D. Lingineni et al.

5 Results and Discussions The following are the results from the three different types of datasets of different sizes and classes as described earlier. Each detected object is assigned a unique id number, which in turn is used as the source for tracking and object. The input videos are taken from online resources (Figs. 6, 7, 8, 9, 10 and 11).

Fig. 6 Input video from dataset-1

Fig. 7 Output video from dataset-1

Object Detection and Tracking Using DeepSORT

309

Fig. 8 Input video from dataset-2

Fig. 9 Output video from dataset-2

The live recording helps to track objects and identify them, especially in crowd surveillance and also in case of robbery and this can be used in the armed forces wherein this technique can be integrated with the drone and can be used in drone surveillance and armed drone operations (Fig. 12). The accuracy obtained for various inputs frame by frame, frame numbers are arbitrary they are obtained at different frame sequences.

310

D. Lingineni et al.

Fig. 10 Input video from dataset-3

Fig. 11 Output video from dataset-3

Frames

Input1

Input2

Webcam input

Frame 00

0.00

0.00

0.00

Frame 01

0.33

0.32

0.25

Frame 02

0.40

0.39

0.41

Frame 03

0.50

0.49

0.52 (continued)

Object Detection and Tracking Using DeepSORT

311

(continued) Frames

Input1

Input2

Webcam input

Frame 04

0.57

0.55

0.59

Frame 05

0.68

0.67

0.62

Frame 06

0.75

0.73

0.67

Frame 07

0.83

0.77

0.73

Frame 08

0.87

0.83

0.75

Frame 09

0.89

0.83

0.79

Frame 10

0.90

0.88

0.82

Frame 11

0.91

0.90

0.85

Frame 12

0.93

0.91

0.90

Frame 13

0.95

0.94

0.91

The overall average accuracy obtained is around 92%.

6 Conclusion Given the significant improvements of machine learning, its use in crowd security monitoring should be adopted. This chapter suggests the use of YOLO, and DeepSORT algorithms to follow individuals in real time. This technology tries to follow the person even when occlusion occurs. This method can be very useful for ensuring safety in locations where the object is at high rates of entry and exit. The results demonstrate that trustworthy and precise data collection can enhance tracking. Few ideas may be given for further development.

7 Future Work In future, the detection can be implemented with the next version of the YOLO framework thereby more number objects could be detected. This present system cannot distinguish between a living thing and a non-living thing. For example, this method struggles to detect a statue and a person. This system assumes statue as if a person so, that needs to be improved in the later extension of the project. This system can also be extended to detect objects in which both the object and the cameras simultaneously in the motion.

312

D. Lingineni et al.

Fig. 12 Output from live data feed through webcam

References 1. Azhar MIH et al (2020) People tracking system using DeepSORT. In: 2020 10th IEEE international conference on control system, computing and engineering (ICCSCE). IEEE 2. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788 3. Turaga P, Chellappa R, Veeraraghavan A (2010) Advances in video-based human activity analysis: challenges and approaches. In: Advances in computers, vol 80. Elsevier, pp 237–290

Continuous Investing in Advanced Fuzzy Technologies for Smart City V. Lakhno , V. Malyukov , D. Kasatkin , V. Chubaieskyi , S. Rzaieva , and D. Rzaiev

Abstract The article describes a model for the computing core of a decision support system in the process of continuous mutual investment in technologies for Smart City. In contrast to the existing approaches, the proposed model is based on solving a bilinear differential game of quality with several terminal fuzzy surfaces. A new class of bilinear differential games is considered, which made it possible to adequately describe the process of searching for rational strategies of players in the course of continuous mutual investment in the rapidly developing technology market for Smart City, taking into account fuzzy information. The model was tested in the course of computational experiments in the MathCad environment. Keywords Smart city · Optimal investment strategies · Decision support · Differential game · Fuzzy sets · MathCad

1 Introduction A modern person needs a life without traffic jams, without fear for safety during evening walks, without queues for services, but with an awareness of where the money from the local budget is directed. This is possible only in a “smart” city— “Smart City”. The concept of development of such cities is actively beginning to V. Lakhno · V. Malyukov · D. Kasatkin National University of Life and Environmental Sciences of Ukraine, Kyiv, Ukraine e-mail: [email protected] V. Chubaieskyi (B) · S. Rzaieva Kyiv National University of Trade and Economics, Kyiv, Ukraine e-mail: [email protected] S. Rzaieva e-mail: [email protected] D. Rzaiev Kyiv National Economics University Named After Vadym Hetman, Kyiv, Ukraine e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_24

313

314

V. Lakhno et al.

be embodied in Ukraine, although in developed countries these things have already become a reality. At the moment Smart Cities are competing for the championship in international rankings. The Spanish Center for Globalization and Business School Strategy in 2020 called the “smartest” London, New York, Paris, Tokyo, Reykjavik. Kyiv is ranked 115th out of 174 cities. According to UN forecasts, by 2050, 70% of the world’s population will live in cities. The authorities of megacities of developed countries are making efforts to make life in cities as comfortable, safe, and high quality as possible. Internet of Things technology, which creates data-based solutions, can improve urban life. To become a smart city, the government, businesses, and NGOs need to analyze open data and make decisions to improve the lives of citizens, look for investors. And investors, in turn, are interested in forecasting the risks of investing in the development of a particular infrastructure of a smart city. Conceptually, the city began to be considered as a smart integral object that provides comfortable living conditions and maximum safety for its residents and guests. That is why in times of rapid popularization of the digital economy trend, two approaches were tested: information-centric and customer-oriented [1]. In fact, on the basis of these approaches, practical solutions and standards of a “smart” city have been developed and implemented. Thus, at some point in time, the digital capabilities of what are called industry solutions have already been exhausted for cities. Currently, these are information modeling technologies (in construction or intelligent transport systems). That is, the question arises—how to process data from different subsystems of urban management without duplication and make them available for use in relevant business processes and services? Management has already had considerable experience in building structures of information models, business processes, and services, etc. But in the past, they were sectoral in nature and tied to methods of storing information (such as geographic information systems). For example, take the model—an intelligent transport system that optimizes traffic by displaying the traffic situation on street information boards or smartphones of users, tells them the optimal route and performs many other useful functions [2]. The geographic information system is a common “geographical substrate” for all subsystems of a smart city. The e-learning subsystem allows students to listen to lectures on a home computer without actually being in the classroom. All recorded lectures are stored on special platforms such as Moodle, and the software allows students to participate in online learning [2–4]. In Ukraine, the development of smart cities begins with the introduction and application of certain technologies of the Internet of Things, which are aimed at solving the most acute problems facing Ukrainian cities. According to the nominations of the forum, the main directions of development of smart cities are: the best innovative city (effective use of technologies for the transformation of urban space); the best transport model of the city (the highest standards in the development of transport infrastructure); the most comfortable city to live in (highest living standards and population satisfaction); city of startups (support in the implementation of innovative projects and technological solutions to improve the urban space); ecological and energy efficient city (use of smart technologies to protect the urban ecosystem).

Continuous Investing in Advanced Fuzzy …

315

Business does not stand aside. IT-startups, mobile operators, companies working in the field of “Internet of Things” and ICT are active participants in the process of creating urban innovations (see Fig. 1). There are a number of companies in Ukraine that create “smart” houses. Their experience can be scaled to develop a system of “smart” city. Kyivstar has successfully tested NB-IoT, the Internet of Things network. For example, Drogobych has a mobile supplement “Drogobich—smart city”, which promotes a tourist map, billboard, electronic services and helps residents push about the connection of electricity and water supply. In Mariupol and Kiev, the program “Safe City” is being implemented with the installation of video cameras. Their presence allows you to reduce the number of crimes, accelerates their detection, monitors the work of public utilities and allows you to regulate traffic. In Kiev, the most successful cases are introducing a video surveillance system, a platform for restoring and servicing, which gave an opportunity to open resident’s office and a mobile supplement, a public budget. The system helps to read readings of sensors of movement of municipal transport, consumption of water and heat, to regulate illumination depending on time of day and weather, to regulate traffic for avoidance of traffic jams, to control a condition of environment, to increase safety on streets. D. Nazarov believes that one of the tools for accelerated introduction of modern technologies in cities could be a public–private partnership. However, this process is currently not regulated either at the legislative level or in practice. In 2015, the process of transformation of Kyiv into a “smart city” began, which provided for three levels of key changes [5]: technological—creation of a modern

*Information for 2021-2013 is given without taking into account the temporarily occupied territory of the Autonomous Republic of Crimea and the city of Sevastopol, for 2014 -2016 – without taking into account the temporarily occupied territory of the Autonomous Republic of Crimea, the city of Sevastopol and part of the ATO zone, for 2017 – without taking into account the temporarily occupied territory. Autonomous Republic of Crimea, the city of Sevas topol and parts of the temporarily occupied territories in Donetsk and Luhansk regions.

Fig. 1 Export of services from Ukraine in the field of information and communication technologies (for the data of the State Statistics Committee of Ukraine)

316

V. Lakhno et al.

effective platform for urban infrastructure management, effective management of housing and communal services, use of security technologies, rapid response for emergency calls, timely response to the problems of Kiev, etc.; changes in city management—increasing transparency of administration and city management, developing a transparent and constructive model of public–private partnership, improving the investment climate and conditions for business development, smart integration of information between city departments, use of modern data management systems and quality analysis of events and processes in town; social change— the development of modern social infrastructure and the movement toward social equality, involving citizens in decision-making and influencing the creation of urban policy, ensuring financial stability and sustainable economic development of the city to improve the living standards of Kiev citizens [5]. Established, the concept of development of a smart city in Kyiv “Kyiv Smart City 2020” [6] was developed with the participation of the public, city government experts, representatives of Ukrainian technology companies and international business, NGOs, the scientific, and academic community. During the preparation of the project, a cooperation agreement was signed to implement the Kyiv Smart City strategy between representatives of the public, business, the IT community of Kyiv and the Kyiv City State Administration. It should be noted that the financing of the concept tasks will be carried out within the framework of the complex city target program “Electronic Capital” and the Kyiv City Informatization Program for the respective budget years. Five priority areas for improving the comfort of Kyiv have been identified: housing and communal services, security, transport, medicine, and reforming the city government management system, which use modern innovative technologies. The basic principles of the Kyiv Smart City 2020 concept include diagnostics of business processes, implementation of a transparent model of public–private partnership, cooperation with cities, analysis of regulatory restrictions and providing proposals for their elimination, investment attraction [5, 7]. As of December 2018, the following projects have already been implemented: open budget, participation budget, implemented system of electronic public procurement (thanks to which UAH 2.27 billion has been saved), electronic petitions and information system for Kyiv residents, register of Kyiv territorial community, information, and analytical. The “Property” system, the city Wi-Fi network is being developed, and the contact center of the city of Kyiv is operating. The Safe City program has been launched, in which about 6000 CCTV cameras have been installed in Kyiv, which transmit data to the city’s Data Exchange Center in real time. Also introduced 610 thousand cards of Kyiv residents, determining the arrival of transport, contactless passage in the subway, electronic registration for a doctor’s appointment and kindergarten [6]. Still, it does not feel that Kyiv is a full-fledged smart city. According to a study by the Traffic Index of the Dutch company TomTom—Kyiv took 13th place in the ranking of 400 cities in the world, where residents lose the most time in traffic jams. Experts analyze the IQ of cities according to nine criteria: human capital (development, involvement, and education of talents), social cohesion (consensus between social groups), economy, environment, management, urban planning, international relations, technology, mobility). However, the main desire to introduce a smart city

Continuous Investing in Advanced Fuzzy …

317

is more to create comfortable living conditions for people in large and small cities. The use of the latest technologies will contribute to a more rational use of resources from an economic and environmental point of view. In addition, all areas of city life are combined into a single effective system. There is still no clear definition of the concept of “smart city”, but despite this, European scientists are already studying “smart cities”. For example, a special laboratory was opened at the Technical University of Vienna in 2007, which still studies European cities, analyzing their compliance with the criteria of a smart city. Initially, small and medium-sized cities were studied, and in 2015 the study of large cities, including megacities [8]. Currently, the implementation of the concept of smart cities is actively lobbied by large IT corporations, such as IBM. The management of such companies developing information products sees great prospects for expanding their business. Such investment projects are characterized by a high degree of uncertainty and riskiness. In works [1, 2], the authors noted that, in order to increase the efficiency and effectiveness of evaluating such large projects, it is advisable to use the potential of various computerized decision support systems (DSS) [9–11]. Undoubtedly, this also applies to large interstate or interregional investment projects in Smart City technologies [3, 5, 6]. All of the above has predetermined the relevance of the topic of our research, in particular in the aspect of the need to develop new models and the corresponding software product, which will reduce the discrepancy between forecasting data and the real return on investment in Smart City technologies.

2 Literature Review In recent years, a fairly large number of works have been devoted to the issues of financial investment in Smart City technologies [12–15]. In [13, 14], the authors showed that in the course of analyzing the algorithms that are used in the process of evaluating mutual investment in Smart City technologies, it is advisable to analyze possible situations in the context of the actions of two parties (players). In the approach described by the authors [12–15], side 1 is an investor (or INV_1) from one region; side 2—an investor (or INV_2) from the second region According to [14, 15] (INV_2) is considered as a certain set of potential dangers or risks that may arise as a result of uncoordinated actions of investors in Smart City technology. This will lead to a loss of capital that was spent on IT projects within the Smart City. In works [16–18] it was noted that in relation to this class of problems, from the point of view of methodology, the most suitable methods and models based on the theory of games. Analysis of similar studies [14, 16] shows that most of the models, in particular in [17, 19], do not contain real recommendations to investors in Smart City. This also applies to the formulation of the problem that we propose in this article. Namely, models for finding rational strategies for continuous mutual financial investment in Smart City projects, taking into account fuzzy information.

318

V. Lakhno et al.

The approaches proposed by the authors [16, 17, 20] do not allow finding effective recommendations and strategies for managing investments in Smart City. This circumstance determines the need and relevance of the development of new models and software products, for example, decision support systems (DSS), which are focused on intellectualizing the procedures for selecting rational strategies for continuous mutual financial investment in Smart City projects in a fuzzy continuous setting.

3 Research Objectives Objectives of the article • development of a model designed for the computational core of the DSS, in the course of the search procedures for rational strategies of continuous mutual investment in technologies for Smart City, taking into account fuzzy information; • approbation in the course of computational experiments in the MathCad environment.

4 Methods and Models 4.1 Problem Statement This study develops works [14, 15], which are based on the application of the apparatus of the theory of multistep and differential games. Let us look at two sides. Let the player # 1 be the first investor (INV_1) in the information technologies of the region # 2. Accordingly, the player # 2 is the second investor (INV_2) in the information technologies of the region # 1. These players use financial resources (FR) to achieve their goals. Players control a dynamic system (hereinafter DC). The dynamical system can be described by bilinear differential equations with dependent motions. Let us define the sets of strategies (U ) and (V ) players. In addition, two terminal surfaces are defined. For INV_1 and INV_2, they are denoted as S0 ,F0 . The purpose of INV_1 is to bring DS with its control strategies to the terminal surface S0 . We believe that INV_1 can achieve this goal regardless of the actions of INV_2. The purpose of INV_2 is to bring DS with its control strategies to the terminal surface F0 . Likewise, INV_2 can achieve its goals regardless of the actions of INV_1. The solution will be to find a set of initial states of objects and their strategies. This, respectively, will allow INV_1 or INV_2 to bring the DS to one or the other surface.

Continuous Investing in Advanced Fuzzy …

319

4.2 Research Methodology The toolkit of the theory of differential games [14, 15] was used as a methodological basis for the study. The interaction of players within the framework of the study is described as a bilinear differential game of quality with fuzzy information. This approach made it possible to find solutions for any ratio of game parameters.

4.3 Game Solution and Optimal Strategies of the First Player Additionally, we accept the following designations: Fin R—financial resource (FR) of the investor; g ∗ —coefficient that determines the beam of balance; f 1 and f 2 —accordingly, the coefficients characterizing the elasticity of INV_2 investments in relation to INV_1 investments in Smart City and the elasticity of INV_1 investments in relation to INV_2 investments; R2+ —positive ortant; t—time parameter; u ∗ —optimal strategy INV_1; z 1 —FR value for INV_1; z 2ξ —FR value for INV_2; u(t), v(t)—implementation of the players’ strategies at time t. Z 1 and Z 2 —respectively, preference sets INV_1 and INV_2; g1 and g2 —growth rates of FRs INV_1 and INV_2. In accordance with [14, 15], the solution to a differential game generates two problems. Accordingly, the first task (hereinafter—task number 1) from the point of view of the first ally player. And the second task (hereinafter—task number 2) and from the point of view of the second player-ally. In problem No. 1, the ally player is interpreted for INV_1, the opponent player is interpreted for INV_2 … And vice versa—in problem no. 2, the ally player is interpreted forINV_2, and the opponent player—for INV_2 … The first player seeks to invest in Smart City technologies in the second region. Likewise, the second player is willing to invest in Smart City in the first region. We will assume that for a given period of time [0, T ] (T —real positive number), ξ INV_1 and INV_2 are highlighted, respectively, z 1 (0) and z 2 (0) FR for the implementation of their projects within the development of Smart City. There is an interaction between the players. Unlike the game with full INV_1 information, the initial state of INV_2 is not known exactly. He knows the information that the state INV_2 belongs to a fuzzy set {X, m(.)}, Where X subset R+ m(.)—state membership funcξ ξ ξ tion z 2 (0) the multitude X, m(z 2 (0)) ∈ [0, 1][0, 1] for z 2 (0) ∈ X. Moreover, at every moment t (t ∈ [0, T ]) his conditions are known z 1 (τ ) for τ ≤ t. In this case, the following conditions are met: z 1 (τ ) > 0 when the condition z 1 (τ ) > 0 with certainty ≥ p0 (0 ≤ p0 ≤ 1) and z 1 (τ ) < 0 when the condition z 1 (τ ) < 0 with

320

V. Lakhno et al.

certainty < p0 , and also known the values of the implementations of the strategy INV_1 u(τ )(τ ≤ t), allocated for interaction with INV_2. The reasoning is performed from the INV_1 position. That is, no assumptions are made about INV_2 being informed. This is equivalent to INV_2 having any information. The choice of control actions by the players occurs simultaneously. The dynamics of interaction is set as follows: +,ξ

dz 1 (t)/dt = −z 1 (t) + g1 · z 1+ (t) − u(t) · g1 · z 1+ (t) − f 2 · v(t) · g2 · z 2 (t); ξ

ξ

+,ξ

+,ξ

dz 2 (t)/dt = −z 2 (t) + g2 · z 2 (t) − v(t) · g2 · z 2 (t) − f 1 · u(t) · g1 · z 1+ (t); (1) and  +

z =

z, z ≥ 0 0, z < 0

 .

The interaction ends when the following conditions are met: 

   ξ ξ z 1 (t), z 2 (t) ∈ S0 = { (z 1 (t), z 2 (t) ∈ S0∗ , with certainty ≥ p0 , 



 z 1 (t), z 2ξ (t) ∈ S1∗ , with certainty ≥ p0 }

(2)

   z 1 (t), z 2ξ (t) ∈ F0 = { (z 1 (t), z 2ξ (t) ∈ S0∗ , with certainty   ξ < p0 , z 1 (t), z 2 (t) ∈ S1∗ , with certainty < p0 }

(3)

where S0∗ = S1∗ =

 

    ξ ξ 2 z 1 , z 2 : z 1 , z 2 ∈ R+ , z1 > 0 ,

    ξ ξ ξ 2 z 1 , z 2 : z 1 , z 2 ∈ R+ , z2 = 0 ,

The game takes place like this. At a moment in time t INV_1 multiplies the value z 1 (t) per coefficient (rate of change, rate of growth) g1 and chooses the value u(t), u(t) ∈ [0, 1]… The latter determines the fraction of FR g1 · z 1 (t) INV_1 allocated for investments in region 2 at the time t… INV_2 works in a similar way. That is, at the moment in time t INV_2 multiplies the value z 2ξ (0) per coefficient (rate of change, rate of growth) g2 and chooses the value v(t)(v(t) ∈ [0, 1])… The ξ latter determines the fraction of FR INV_2 g2 · z 2 (t), allocated for investments in the region 1 moment t… The allocation of financial resources INV_1 and INV_2 causes,

Continuous Investing in Advanced Fuzzy …

321

due to the relationship (elasticity) between investments, the allocation of additional FRs to regions No. 1 and No. 2. Then the states INV_1 and INV_2 at the moment of time t are determined from relations (1). If it turns out that condition (2) is satisfied, then we will say that in the procedure for investing in Smart City INV_1 achieved the desired result with certainty p ≥ p0 and the procedure is over. If it turns out that condition (3) is satisfied, then we will say that in the procedure for investing in Smart City INV_2 achieved the desired result with reliability p > 1− p0 and the procedure is over. If neither condition (2) nor condition (3) are met, then the procedure for investing in Smart City continues further. We define the function F(.) : X → R+ , F(x) = {sup m(y), f or y ≤ x}. Let us denote by F− set of such functions, through T ∗ = [0, T ],—time segment. Strategy INV_1 is a rule that allows him, on the basis of the available information, to determine the value of FR, which INV_1 allocates for investment in Smart City to region number 2. The second player (INV_2) chooses his strategy v(.) based on any information. The first player seeks to find the set of his initial states. The set of such states will be represented by [14, 15] the set of preferences of the first player Z 1 . Then, and the strategies of the first player will be called his optimal strategies. The first player’s goal (INV_1) —finding the preference set, as well as finding its strategies, applying which it will obtain the fulfillment of condition (2). The formulated game model corresponds, according to the classification of decision-making theory, to the decision-making problem under conditions of fuzzy information. To describe the sets of preferences INV_1, it is necessary to enter the value: φ(0) = inf{φ  }, F(φ  ) ≥ p0 . The following are the solutions, i.e., sets of “preferences” Z 1 and optimal strategies u ∗ (.) for all ratios of the game parameters. This is a set of such initial states (z 1 (0), φ(0)), that if the game starts from them, then there is a strategy INV_1, which, for any implementations of the strategy INV_2, “leads”, at the time t state of the system (z 1 (0), φ(0)) into such that condition (2) will be satisfied. At the same time, INV_2 lacks a strategy that can “lead” to the fulfillment of condition (3) at one of the previous points in time. Table 1 shows the options for solving the game, i.e., sets of “preferences” Z 1 and optimal strategies INV_1. Similarly, you can solve the problem from the point of view of the second ally player (INV_2). The “preference” sets (cones) from the point of view of INV_2 are “adjacent” to the “preference” sets INV_1. These sets are separated by rays of balance [14].

322

V. Lakhno et al.

Table 1 Variants of solutions to the game, i.e., sets of “preferences” Z 1 and optimal strategies INV_1 No. One

2

Game condition r1 · r2 = 1, g2 ≥ g1 .

r1 · r2 = 1, g2 < g1 .

Result  Z1 = 

2, (z 1 (0), φ(0)) : (z 1 , φ) ∈ intR+

g2 · φ(0) < r1 · g1 · z 1 (0)

2, 1, (z 1 (0), φ(0)) : (z 1 , φ) ∈ intR+

 ,u ∗ (z 1 (0), φ(0)) =



g2 · φ(0) < r1 · g1 · z 1 (0) and not otherwise defined   2, (z 1 (0), φ(0)) : (z 1 , φ) ∈ intR+ Z1 = u ∗ (z 1 (0), φ(0)) = g2 · γ (0) < r1 · g1 · z 1 (0)   2, 0, (z 1 (0), φ(0)) : (z 1 , φ) ∈ intR+ ,u ∗ (z 1 (0), φ(0)) = g2 · φ(0) < r1 · g1 · z 1 (0) < g1 h 2 (0)   2, 1, (z 1 (0), φ(0)) : (z 1 , φ) ∈ intR+ g2 · φ(0) < r1 · g1 · z 1 (0) and not otherwise defined

3

Four

r1 r2 > 1, g2 > r1 · g1 · r2 . r1 r2 > 1, g1 ≤ g2 < r1 · g1 · r2 .

Here u ∗ (.), Z 1 defined as in option 1 Z1 = 



2, (z 1 (0), φ(0)) : (z 1 , φ) ∈ intR+

(r1 · g1 · r2 · g2 )0.5 φ(0) < r1 · g1 · z 1 (0)   2, 1, (z 1 (0), φ(0)) : (z 1 , φ) ∈ intR+ (r1 · g1 · r2 · g2 )0.5 φ(0) < r1 · g1 · z 1 (0) defined Five

6

r1 · r2 > 1, g1 /(r1 · r2 ) < g2 < g1 . r1 · r2 > 1, g2 < g1 /(r1 · r2 ).

,u ∗ (z 1 (0), φ(0)) =

and not otherwise

Here u ∗ (.), Z 1 defined as in option 4  Z1 = 

2, (z 1 (0), φ(0)) : (z 1 , φ) ∈ intR+

r2 · g2 · φ(0) < g1 · z 1 (0)

2, 0, (z 1 (0), φ(0)) : (z 1 , φ) ∈ intR+

 ,u ∗ (z 1 (0), φ(0)) = 

r1 · g1 · r2 · φ(0) < r1 · g1 · z 1 (0) < g2 · φ(0)   2, 1, (z 1 (0), φ(0)) : (z 1 , φ) ∈ intR+

,u ∗ (z 1 (0), φ(0)) =

r1 · g1 · z 1 (0) > g2 · φ(0) and not otherwise defined (continued)

Continuous Investing in Advanced Fuzzy …

323

Table 1 (continued) No. 7

8

Game condition r1 · r2 < 1, g2 ≥ g1 . r1 · r2 < 1, r1 · g1 · r2 ≤ g2 < g1 .

Result Here u ∗ (.), Z 1 defined as in option 1  Z1 =

2, (z 1 (0), φ(0)) : (z 1 , φ) ∈ intR+

r2 · g2 · φ(0) < g1 · z 1 (0)

 ,u ∗ (z 1 (0), φ(0)) =

⎫ ⎧ 2 ⎪ ⎬ ⎨ 0, (z 1 (0), φ(0)) : (z 1 , φ) ∈ intR+ , ⎪ ,u ∗ (z 1 (0), φ(0)) = r1 · g2 · r2 · φ(0) > ⎪ ⎪ ⎭ ⎩ < r1 · g1 · z 1 (0) < g1 · φ(0)   2, 1, (z 1 (0), φ(0)) : (z 1 , φ) ∈ intR+ r1 · g1 · z 1 (0) ≥ g1 · φ(0) and not defined otherwise 9

r1 · r2 < 1, g2 < r1 · g1 · r2 .

Here u ∗ (.), Z 1 defined as in option 8

5 Computational Experiment Computational experiments were performed in the MathCad environment, see Figs. 1, 2 and 3. The data on investment projects in technology were taken as the Fig. 2 Results of computational experiment 1

324

V. Lakhno et al.

Fig. 3 Results of computational experiment 2

initial data. Smart city large cities of Ukraine (Kiev, Kharkov, Lvov). In Figs. 2, 3 and 4 and the results for 3 test calculations in the course of the computational experiment Fig. 4 Results of computational experiment 3

Continuous Investing in Advanced Fuzzy …

325

are shown. The purpose of the experiment is to determine the sets of strategies of the players (U ) and (V ). Cases are considered when the strategies of the players bring them to the corresponding terminal surfaces S0 ,F0 . During the experiment, sets of initial states of objects and their strategies are found, which allow the objects to bring the system to one or another terminal surface. On a plane axis z 1 —financial resources INV_1. Axis φ—financial resources INV_2. The area under the beam—L 1 (area of “preference” INV_1). The area above the beam—L 1 (area of “preference” INV_2) [15]. Balance beams on the smartphone screen are displayed as gray lines with round markers. The results obtained have demonstrated the effectiveness of the proposed approach based on solving a bilinear differential game of quality with several terminal surfaces in a fuzzy setting.

6 Computational Experiment Figure 2 shows the situation when INV_1 has an advantage in the ratio of initial financial resources when investing in Smart City. That is, the FDs are in the preference set INV_1. Figure 3 shows the situation when INV_2 uses a non-optimal strategy INV_1 at the initial time. Player 2 achieves what “brings” the state of the system to “his” terminal surface. Figure 4 shows the situation when the initial state of the system is on the balance beam. This “satisfies” both INV_1 and INV_2. We get a “stable” system. The disadvantage of the model is that manual calculations are somewhat cumbersome. But with the algorithmization of the model, you can get an effective software product for predictive assessment when choosing investment strategies in Smart City did not always coincide with the actual data. Further prospects for the development of this study, set out in the framework of the article, is the transfer of the accumulated experience into the real practice of optimizing investment policy in technologies for Smart City in other countries.

7 Conclusions A model for the computing core of the decision support system in the process of continuous mutual investment in technologies for Smart City is proposed. In contrast to the existing ones, the proposed model is based on solving a bilinear differential game of quality with several terminal surfaces in a fuzzy setting. A new class of bilinear differential games is considered for the first time. This makes it possible to adequately describe the process of searching for rational strategies of players in the course of continuous mutual investment in the rapidly developing market of products, services and technologies for Smart City, taking into account fuzzy information.

326

V. Lakhno et al.

The model was tested in the course of computational experiments in the MathCad environment.

References 1. Kupriyanovskiy VP, Nikolaev DE, Yartsev DI et al (2016) On localization of British standards for a smart city. Int J Open Inf Technol 4(7):13–21 2. Shneps-Shneppe MA (2016) How to build a smart city. Int J Open Inf Technol 4(1):12–20 3. Glazunova OG, Kasatkin DY, Kuzminska OG, Mokriev MV, Blozva AI, Voloshina TV, Sayapin TP (2016) Integration of primary resources and services of IT companies in the educational center of the university (collective monograph). Collective monograph, ed. Glazunova O.G.— Kiev: TOV “Interservice”, 285 p 4. Hulak H, Kriuchkova L, Skladannyi P, Opirskyy I (2021) Formation of requirements for the electronic record-book in guaranteed information systems of distance learning. Paper presented at the CEUR workshop proceedings, vol 2923, pp 137–142 5. Comprehensive city target program “Electronic Capital” for 2015–2018: approved by the decision of the Kyiv City Council from April, 2, 2015 No654/1518 (2018) URL: http://kmr.gov. ua/uk/municipal-target-programs. Date of application 23 Dec 2018 6. Concept “Kyiv Smart City 2020” [Electronic resource] (2018) URL: https://www.kyivsmart city.com/concept/ (date of the blast: 12/25/2018). Kyiv Smart City Concept: Website Retrieved from https://www.kyivsmartcity.com/?lang=en 7. Lakhno V, Malyukov V, Kryvoruchko O, Desiatko A, Shestak Y (2020) Smart city technology investment solution support system accounting multi-factories. https://doi.org/10.1007/978-3030-63322-6_1 8. Anthopoulos L (2015) Understanding the smart city domain: a literature review. Anthopoulos (2015) Transforming city governments for successful smart cities, no 1, pp 9–21 9. Lakhno V, Akhmetov B, Ydyryshbayeva M, Bebeshko B, Desiatko A, Khorolska K (2021) Models for forming knowledge databases for decision support systems for recognizing cyberattacks. In: Vasant P, Zelinka I, Weber GW (eds) Intelligent computing and optimization. ICO 2020. Advances in intelligent systems and computing, vol 1324. Springer, Cham. https://doi. org/10.1007/978-3-030-68154-8_42. 10. Khorolska K, Lazorenko V, Bebeshko B, Desiatko A, Kharchenko O, Yaremych V (2022) Usage of clustering in decision support system. In: Raj JS, Palanisamy R, Perikos I, Shi Y (eds) Intelligent sustainable systems. Lecture notes in networks and systems, vol 213. Springer, Singapore. https://doi.org/10.1007/978-981-16-2422-3_49 11. Bebeshko B, Khorolska K, Kotenko N, Kharchenko O, Zhyrova T (2021) Use of neural networks for predicting cyberattacks. In: Paper presented at the CEUR workshop proceedings, vol 2923, pp 213–223 12. Albino V, Berardi U, Dangelico RM (2015) Smart cities: definitions, dimensions, performance, and initiatives. J Urban Technol 22(1):3–21 13. Angelidou M (2015) Smart cities: a conjuncture of four forces. Cities 47:95–106 14. Lakhno V, Malyukov V, Bochulia T, Hipters Z, Kwilinski A, Tomashevska O (2018) Model of managing of the procedure of mutual financial investing in information technologies and smart city systems. Int J Civ Eng Technol (IJCIET) 9(8):1802–1812 15. Lakhno V, Malyukov V, Gerasymchuk N et al (2017) Development of the decision making support system to control a procedure of financial investment. Eastern-Eur J Enterprise Technol 6(3):24–41 16. Zanella A, Bui N, Castellani A, Vangelista L, Zorzi M (2014) Internet of things for smart cities. IEEE Internet Things J 1(1):22–32 17. Paroutis S, Bennett M, Heracleous L (2014) A strategic view on smart city technology: the case of IBM Smarter Cities during a recession. Technol Forecast Soc Chang 89:262–272

Continuous Investing in Advanced Fuzzy …

327

18. Hollands RG (2015) Critical interventions into the corporate smart city. Cambridge J Regions Econ Soc 8(1):61–77 19. Angelidou M (2014) Smart city policies: a spatial approach. Cities 41:S3–S11 20. Glasmeier A, Christopherson S (2015) Thinking about smart cities, pp 3–12

Lesion Segmentation in Skin Cancer Detection Using UNet Architecture Shubhi Miradwal, Waquas Mohammad, Anvi Jain, and Fawwaz Khilji

Abstract Malignant melanoma really is growing increasingly frequent, especially among those with fair skin who are exposed to the sun. Variations in size and colour, the fuzzy boundary and the low contrast between lesion and normal skin are the adverse factors for deficient or excessive delineation of lesions, or even inaccurate lesion location detection. Despite the widespread usage of artificial intelligence like convolutional neural networks (CNN) for accurate segmentation, existing encoder– decoder models based on tightly connected networks and residual networks (ResNet) for skin lesion applications utilised non-biomedical data. The difficulty of parameter settings, inadequate information in pre-trained features, and a shortage of multi-scale information all limit the effectiveness of skin lesion segmentation. To overcome this issue, the design is built on the notion of an encoder–decoder-based convolution neural network. UNet is used in the system to ensure optimum lesion segmentation performance. On ISIC 2018 lesion pictures, the recommended models are tested. The accuracy, dice coefficient, Jaccard index, sensitivity, and specificity of the models are used to assess their effectiveness. Keywords Medical image segmentation · Encoder–decoder-based convolution neural network · Optimum lesion segmentation

S. Miradwal Amdocs Development Center India LLP, Pune, India W. Mohammad (B) Xoriant Solutions, Pune, India e-mail: [email protected] A. Jain Wipro Limited, Bangalore, India F. Khilji Tech Mahindra, Pune, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_25

329

330

S. Miradwal et al.

1 Introduction Skin malignancies are divided into two types: melanoma and non-melanoma. Melanoma is a cancer that forms in the pigment-producing layers of skin (melanocytes) and spreads to other regions of the body. On the other hand, nonmelanoma seldom grows to other parts of the body. On the other hand, non-melanoma seldom grows to other parts of the body. Malignant melanoma is one of the worst kinds of skin cancer in humans, with a higher mortality rate. According to data from the Western world, melanoma is the seventh most frequent cancer in women and the sixth leading cancer in men [1]. Despite this, it is also the most treatable kind of skin cancer if detected or diagnosed early enough [2]. If detected early enough, melanoma may frequently be treated with a simple excision, lowering the fatality rate. Skin lesion segmentation, which is among the medical picture segmentation domains, is required for melanoma identification. Automatic segmentation of skin lesions in dermoscopic images speeds up melanoma diagnosis. This disorder is diagnosed with clinical screening and examination of dermoscopic, biopsy, and histological images. Given the variety of appearances, irregular forms, and limits of skin lesions, successful identification of skin lesions using these methodologies is tough, time-consuming, and error-prone, even for experienced radiologists. Traditional procedures need enhanced and well-illuminated skin pictures for a precise identification of the lesions [3]. Picture segmentation creates a set of pixels that encompasses all of the image data [4]. The accuracy of lesion classification can be aided by a well-drawn border. The findings of skin lesion segmentation have been published [5, 6]. Image analysis has been used to solve the problem, and deep learning has lately emerged as a helpful solution [7]. Segmentation of images following its successful application to non-medical data, deep learning algorithms are garnering interest in a variety of medical problems. MRI hippocampus segmentation and mammographic lesion classification [8, 9]. CNN-based algorithms develop image features after image statistics to discriminate among foreground and background pixels in order to create the final prediction. Jafari et al. (2017) [10] used a deep CNN technique for skin lesion segmentation from a non-dermoscopic image. In tiny networks, the sliding window-based technique is time consuming and inadequate in a variety of scenarios. To address the problem of class imbalance, Yuan et al. (2017) [11] established a Jaccard index-grounded convolutional method for skin lesion segmentation. Li et al. (2018) [12] proposed a deep learning system consisted of two fully convolutional residual networks to produce segmentation and coarse classification results at the same time (FCRN). Inspired by the success of fully convolutional networks, Ronneberger et al. (2015) [13] built a UNet network for biomedical picture segmentation (FCN). Zhou et al. presented UNet++, an improved version of the UNet concept, by utilising layered skip connections [9]. When compared to other traditional clustering techniques for lesion segmentation, the UNet model outperforms the competition [14]. DenseNet and ResNet [15] are

Lesion Segmentation in Skin Cancer Detection …

331

network depth-maximising methods that improve picture segmentation accuracy. The dilation layer boosts the resolution of these network blocks even further. Yu et al. (2017) [16] suggested a dilated residual network for picture segmentation. The dilated residual network surpasses the standard ResNet without requiring additional adjustments.

1.1 Our Contribution This research is focused on segmenting the skin lesions for the early detection of melanoma skin cancer. Outline of our contributions are as follows: 1.

Proposed a model variation in U-NET architecture.

Designed models were evaluated using the ISIC 2018 dataset. The results of this work are compared with state-of-the-art.

2 Preliminaries 2.1 UNet Architecture In 2015, O. Ronneberger created it as a fully convolutional neural network (CNN) a deep learning network architecture for biomedical image segmentation [13]. The UNet network, which uses a convolutional neural network (CNN) and has two parts: encoding and decoding, is shown in Fig. 1.The encoding step of the shrinking path employs two convolutional layers (3 × 3) on every level, that are filled with stride one, followed by a max-pooling (2 × 2) operation using rectified linear unit (ReLU) activation at each level, which is padded with stride two. After the first convolutional layer, a dropout layer is introduced. The input image’s dimensions are halved, and the number of feature channels is doubled with each down-sampling step. The lowest level is made up of two convolutional layers (3 × 3) with no pooling layer between them. The expanding route, also known as the decoding step, recovers the input photos’ original dimensions by performing up-sampling on the feature map. The suitable features formed by the shrinkage route are concatenated with the feature channels, and there are two convolutional (3 × 3) layers, one of which is followed by ReLU activation and dropout layer, and the second of which is followed by ReLU activation. The last layer is a convolution (1 × 1) layer that converts the feature vector into a binary prediction for vascular segmentation. The U-layer Net’s architecture is seen in Fig. 2.

332

S. Miradwal et al.

Fig. 1 General architecture of UNet [13]

3 Proposed Methodology Skin lesion segmentation is accomplished using UNet architecture in the proposed study. Preprocessing, UNet-based model construction, model testing, and model assessment are the four processes in the process. To construct the normalised pictures, the original lesion images are sent through preprocessing, which involves image shuffling and scaling. For training, preprocessed pictures are put into a UNet model, which has encoder-convolution-decoder layers in an u shape. The model was applied to test photos, and the segmented images were obtained. Specificity, sensitivity, accuracy, dice coefficient, and the Jaccard index are used to evaluate models. In the proposed study, different UNet architectures are used and experimented with. On skin lesion pictures, UNet is applied and experimented with to provide expected images. On the basis of mask images, these anticipated pictures are assessed.

3.1 A Proposed Model of UNet The UNet architectural paradigm is proposed and implemented in this article. The model includes encoders, decoders, and convolution blocks. The encoding block uses the pooling layer to reduce position values and information while detecting abstract properties. To achieve exact placement, local pixel properties that are recognised by the decoding layer are utilised. These local features are used with the novel feature map in the up-sampling strategy to conserve some crucial feature information from the prior down-sampling method. A diagrammatic representation of the proposed

Lesion Segmentation in Skin Cancer Detection …

Fig. 2 Layer architecture of UNet

333

334

S. Miradwal et al.

Fig. 3 UNet based proposed model for lesion segmentation

model is shown in Fig. 3. The dashed line in figure shows a skip connection between the encoder and decoder layers, which allows the network to keep its low-level features. The model is made up of four encoder layers, one convolution layer, and four decoder layers. The encoder block starts off with a size of 64, but once each layer’s maximum pooling operation is done, the size doubles in the subsequent layer. The input images are then fed into a convolution block with a size of 1024, which performs batch normalisation and max-pooling operations. The maximum pooling approach aids in the reduction of error while keeping image texture information. To establish the size of the original picture, which is 64 pixels wide, the decoder block performs up-sampling and image reduction. The decoder layers conduct convolution transposition and concatenation of the skip connection feature vectors. The decoder layers conduct convolution transposition and concatenation of the skip connection feature vectors. Proposed model is tuned on different hyper-parameters such as optimizer, activation function, type of loss function, learning rate of model, dropout rate, batch size, and different epochs for more experimentation. Proposed model changed the input node weight of neurons and experimented on every layer. That’s why it is different from existing UNet model.

Lesion Segmentation in Skin Cancer Detection …

335

Table 1 Statistics of dataset Database

Total images

No. of training images

No. of validation images

No. of testing images

ISIC_2018

2594

1660

518

416

4 First Section This section contains dataset, experimental setup, evaluation metrics, and results with discussion.

4.1 Dataset and Experimental Setup The suggested models are tested using skin lesion photos from ISIC 2018 [17]. Experiments are run on a total number of pictures of 64% train, 20% validation, and 16% test. Table 1 shows the dataset’s statistics. In Python TensorFlow, the model is implemented using the Keras module. The Adam optimizer is the finest solution for a wide range of situations. The Adam optimizer is utilised as a result. In the suggested technique, the categorical cross entropy loss function L is lowered for ten epochs. In Eq. 1, the categorical cross entropy loss function is defined. The rate of learning is 1e-4Experiments were carried out during epoch 10 with a batch size of four. y represents the actual picture, y represents the predicted image, Q represents the number of training samples, and R represents the number of categories in Eq. 1. Q R        yq,r log yˆq,r L y, yˆ −

(1)

r =0 q=0

4.2 Evaluation Metrics A variety of performance metrics may be used to assess the efficacy of DL approaches. The following are some of the most often used DL measures: Sensitivity: It refers to the proportion of anomalous photographs that are labelled as such, shown in Eq. 2. Specificity: It is the proportion of ordinary photographs that are classified as such, shown in Eq. 3. Accuracy: It is the proportion of photos that are anticipated properly, shown in Eq. 4. True positives are the number of malignant photos that have been detected as such (TP). The number of normal pictures assessed as normal is known as true negative

336

S. Miradwal et al.

(TN), whereas the number of normal images labelled as sickness is known as false positive (FP). The number of malignant photos mislabelled as normal is known as false negatives (FN). The dice coefficient and the Jaccard index are shown in Eqs. 5 and 6, respectively. Sen =

TP TP + FN

(2)

Spe =

TN TN + FN

(3)

TN + TP TN + TP + FN + FP

(4)

2*TP 2 ∗ TP + FP + FN

(5)

TP TP + FN + FP

(6)

Acc =

Dice =

Jaccard =

4.3 Results and Discussions Variations in existing UNet architecture are presented and tested on lesion pictures in the proposed work. Losses and dice values may be tracked for each period. On the training and validation datasets, Table 2 presents the Loss and dice coefficient values for epoch 10 of the UNet model. Figure 4 shows a loss and dice curve comparison between training and validation. The number of training epochs is displayed versus Table 2 Loss and dice coefficient values according to epoch for training and validation data of UNet model UNet model Epochs

Training loss

Training dice

Validation loss

Validation dice

1

0.3103

0.5393

0.3503

0.5195

2

0.2168

0.6227

0.224

0.6337

3

0.2160

0.680

0.2040

0.7101

4

0.1820

0.7110

0.2010

0.7201

5

0.1687

0.7322

0.2022

0.708

6

0.1487

0.7367

0.2104

0.7051

7

0.1426

0.7501

0.2030

0.7468

8

0.1366

0.7788

0.2080

0.7510

9

0.1310

0.7863

0.2011

0.7419

Lesion Segmentation in Skin Cancer Detection …

337

Fig. 4 Dice coefficient and loss function comparison for training and validation of UNet

the dice coefficient. There is a significant rise in dice similarity as the number of epochs increases. Throughout the training phase, the loss function’s behaviour is observed as the number of epochs grows. During the validation phase, the dice coefficient and loss function behaviour are monitored. Because there is no further progress in the validation performance, the training is stopped after ten epochs. Statistically, the model is significant in terms of loss because when epochs increase validation loss decreases but after few epochs loss is stagnant and even accuracy is also stagnant, not increased. Evaluation metrics are used to assess the suggested model. In terms of all measures, the UNet model outperformed the others. On the ISIC dataset, the UNet model produced the best results. Large data training yields more accurate outcomes. The ISIC dataset has superior model specificity and sensitivity. On the ISIC dataset, compares the performance of the proposed UNet with existing UNet models in Fig. 5. There is a significant disparity in sensitivity and F1-score. Table 3 compares the effectiveness of the suggested models to that of state-ofthe-art approaches. The majority of current project is focused on UNet architecture. Yuan et al. devised a method for segmenting skin lesions based on a deep architecture. Kaul et al. proposed a cascaded encoder–decoder-based architecture with an attention mechanism to increase the depth of a UNet model. As the depth of the analysis rises, the specificity and accuracy scores improve. It does, however, have

338

S. Miradwal et al.

Fig. 5 Performance comparison of proposed UNet and other UNet models on ISIC_2018

Table 3 Performance of UNet and ResUNet model with state-of-the-art methods Authors

Dice_Cofficient

Jaccard Index

Accuracy

Sensitivity

Specificity

Yuan and Lo (2017) [18]

85.9

78.5

90.02

88.5

98.5

Kaul et al. (2019) [19]

86.15

78.62

94.14

78.73

97.96

Goyal et al. (2020) [20]

78.3

60.6

91.1

68.2

98.2

Al-masni et al. (2018) [21]

84.1

68.6

90.8





SegNet (Badrinarayanan et al., 2015) [22]

80.1

68.6

90.8

81.1

96.4

Proposed UNet

83.8

74.41

94.05

87.42

98.92

a significant number of parameters due to the addition processes. In our suggested technique, designing is simple and the variables are restricted. In our suggested study, the concatenated properties are critical and they produce amazing outcomes. To increase sensitivity and specificity, the suggested models employ the creation of dense feature maps at higher layers. Al-masni employed the notion of a full-resolution convolutional network. In terms of consistency, sensitivity, and specificity, the UNet proposed model fared best, whereas Yuan et al.’s Jaccard index and dice coefficient performed best.

Lesion Segmentation in Skin Cancer Detection …

339

5 Conclusion The UNet and ResUNet architectures are used in this study to segment skin lesions. On the ISIC 2018 dataset, model modification is developed and tested. For lesion segmentation, a script-based approach is also proposed. The assessment measures dice coefficient, Jaccard index, sensitivity, specificity, and accuracy are used to compare the performance of UNet and ResUNet models. On the both datasets, the UNet model outperformed the others. Models with additional encoder and decoder layers produce more significant findings. The suggested models’ performance is also compared to that of state-of-the-art approaches. On all datasets, the suggested UNet model has a greater sensitivity than previous techniques. However, UNet’s sensitivity on the ISIC dataset has to be improved. Because the segmentation pictures in the findings are blurry, the proposed models’ accuracy is inadequate. Convolution layer can be increased in future for more precise results. Future studies might include more hyper-parameter trials to see if there are any differences in the outcomes. The segmented pictures that arise can be used to classify skin cancer and extract features. In further research, we will try to combine UNet with other deep neural networks for more accurate results. We can use this model on similar datasets. In future, we will experiment this model on different datasets also.

References 1. Wighton P, Lee TK, Lui H, McLean DI, Atkins MS (2011) Generalizing common tasks in automated skin lesion diagnosis. IEEE Trans Inf Technol Biomed 15:622–629 2. Sadeghi M, Razmara M, Lee TK, Atkins MS (2011) A novel method for detection of pigment network in dermoscopic images using graphs. Comput Med Imaging Graph 35:137–143 3. Adegun AA, Viriri S (2020) FCN-based DenseNet framework for automated detection and classification of skin lesions in dermoscopy images. IEEE Access 8:150377–150396 4. Hasan SN, Gezer M, Azeez RA, Gülseçen S (2019) Skin lesion segmentation by using deep learning techniques. In: 2019 Medical technologies congress (TIPTEKNO). IEEE, pp 1–4 5. Pennisi A, Bloisi DD, Nardi D, Giampetruzzi AR, Mondino C, Facchiano A (2016) Skin lesion image segmentation using Delaunay Triangulation for melanoma detection. Comput Med Imaging Graph 52:89–103 6. Yu Z, Jiang X, Zhou F, Qin J, Ni D, Chen S, Lei B, Wang T (2018) Melanoma recognition in dermoscopy images via aggregated deep convolutional features. IEEE Trans Biomed Eng 66:1006–1016 7. Ghosh S, Das N, Das I, Maulik U (2019) Understanding deep learning techniques for image segmentation. ACM Comput Surv (CSUR) 52:1–35 8. Kooi T, Litjens G, Van Ginneken B, Gubern-Mérida A, Sánchez CI, Mann R, den Heeten A, Karssemeijer N (2017) Large scale deep learning for computer aided detection of mammographic lesions. Med Image Anal 35:303–312 9. Hou B, Kang G, Zhang N, Liu K (2019) Multi-target interactive neural network for automated segmentation of the hippocampus in magnetic resonance imaging. Cogn Comput 11:630–643 10. Jafari MH, Nasr-Esfahani E, Karimi N, Soroushmehr S, Samavi S, Najarian K (2017) Extraction of skin lesions from non-dermoscopic images for surgical excision of melanoma. Int J Comput Assist Radiol Surg 12:1021–1030

340

S. Miradwal et al.

11. Yuan Y, Chao M, Lo Y-C (2017) Automatic skin lesion segmentation using deep fully convolutional networks with jaccard distance. IEEE Trans Med Imaging 36:1876–1886 12. Li Y, Shen L (2018) Skin lesion analysis towards melanoma detection using deep learning network. Sensors 18:556 13. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241 14. Lin BS, Michael K, Kalra S, Tizhoosh HR (2017) Skin lesion segmentation: U-nets versus clustering. In: 2017 IEEE symposium series on computational intelligence (SSCI). IEEE, pp 1–7 15. Qamar S, Jin H, Zheng R, Ahmad P (2019) Multi stream 3D hyper-densely connected network for multi modality isointense infant brain MRI segmentation. Multimedia Tools Appl 78:25807–25828 16. Yu F, Koltun V, Funkhouser T (2017) Dilated residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 472–480 17. Milton MAA (2019) Automated skin lesion classification using ensemble of deep neural networks in ISIC 2018: Skin lesion analysis towards melanoma detection challenge. arXiv preprint arXiv:1901.10802 18. Yuan Y, Lo Y-C (2017) Improving dermoscopic image segmentation with enhanced convolutional-deconvolutional networks. IEEE J Biomed Health Inform 23:519–526 19. Kaul C, Manandhar S, Pears N (2019) Focusnet: An attention-based fully convolutional network for medical image segmentation. In: 2019 IEEE 16th international symposium on biomedical imaging (ISBI 2019). IEEE, pp 455–458 20. Goyal M, Oakley A, Bansal P, Dancey D, Yap MH (2019) Skin lesion segmentation in dermoscopic images with ensemble deep learning methods. IEEE Access 8:4171–4181 21. Al-Masni MA, Al-Antari MA, Choi M-T, Han S-M, Kim T-S (2018) Skin lesion segmentation in dermoscopy images via deep full resolution convolutional networks. Comput Methods Programs Biomed 162:221–231 22. Badrinarayanan V, Handa A, Cipolla R (2015) Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293

Getting Around the Semantics Challenge in Hateful Memes Anind Kiran , Manah Shetty , Shreya Shukla , Varun Kerenalli , and Bhaskarjyoti Das

Abstract Social media has evolved into a forum for people to express their thoughts and ideas on a wide range of topics. With this, there has been a correlated rise in hate and inflammatory speeches against individuals and organisations—sometimes with severe consequences. Methods to classify ‘hate’ that is propagated through multimodal media, such as memes (contains image and text), fail to capture the meaning by comprehending both the modes(i.e. image and text). While the text itself may not be hateful, images are utilised to lend additional context to the words to subtly imply and convey hatred. On the Facebook Hate Meme Dataset, specifically curated for conveying hate implicitly in memes (further complicated by the use of ‘benign confounders’), the baseline established by Facebook with a state-of-the-art visuallinguistic model such as VilBERT is 64.73%. On the same dataset, our work beats the state-of-the-art baseline models by nearly 5% using an effective fusion of the semantics of both the text and image. Keywords Cross-attention · Semantic gap · Multi-modal hate · Hate meme

1 Introduction The original intention of social media was to connect people, and provide a platform for people to share and express their opinions. However, today it is being used to spread hateful messages, oftentimes resulting in real and physical harm. Social media content is now primarily multi-modal, which increases the complexity of automatic classification of ‘hateful’ content. Existing models for hate speech detection perform effective classification on unimodes (text-only models). State-of-the-art models such as BERT [6] (a neural netbased technique for natural language processing pre-training), for example, are able to understand the context of text and classify ‘hateful’ textual content with a high A. Kiran (B) · M. Shetty · S. Shukla · V. Kerenalli · B. Das PES University, Bangalore 560085, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_26

341

342

A. Kiran et al.

degree of accuracy. However, today hate is also expressed through mediums such as memes—which not only adds another mode in the form of images, but also compounds the challenge as the meaning is subtly and cleverly implied only when both modes are taken into account. Deep learning techniques and models today fail at correlating the semantics and context derived from two entirely different modes, which is what we explore. Our work addresses this gap by using a combination of techniques such as fusion of modes, and models like auto-encoders and cross-attention. We explore the addition of ‘knowledge’—both common sense knowledge and highly specific knowledge such as ‘race’ or ‘gender’ identified in images—and through this, work on a semantic understanding of what constitutes as ‘hate’ in the domain of memes.

2 Literature Review Prior to reviewing the literature, it is imperative to understand what is meant by ‘hate’. An extensive survey by Fortuna and Nunes [7] has indicated how corporations have different viewpoints on what constitutes as ‘hate’, based on distinct ideals. We have worked on Facebook AI’s Hateful Memes Detection Challenge [11]. Hence, the definition of hate we have worked with is as per the definition in this dataset, i.e. ‘direct or indirect attack on people based on characteristics, including ethnicity, race, nationality, religion, caste, sex, etc.’. The problem of hate detection in multi-modal memes is rooted in approaches followed in multi-modal visual-linguistic (VL) problems such as emotion detection, image captioning and visual question answering (VQA). This line of research is essentially around two factors, i.e. unimodal learning approach and the fusion strategy of the two modes. In the first generation unimodal approach, each mode is individually preprocessed and subjected to feature engineering followed by dimensionality reduction, before the final step of supervised learning. The preprocessing steps for the text consists of steps such as tokenisation, lemmatisation, case conversion and noise removal. On the image side, the preprocessing consists of the steps such as resizing and noise removal. Feature engineering on the text side has been augmented by embeddings such as Word2Vec, FastText and GloVe that uses distributional semantics. The next generation of unimodal research distinguished itself by aspects such as unimodal pre-training using transfer learning to circumvent the lack of labelled data, use of attention mechanisms such as self-attention, hierarchical attention, multi-head attention, cross-attention and transformers. In the unimodal (text-only) domain, Fortuna and Nunes [7] and Waseem et al. [27] have elaborated how feature extractors like n-grams, TF-IDF and bag-of-words and classifiers such as SVMs and decision trees among other algorithms perform on text-based datasets— with the best models showing an F1-score of about 0.9. Mozafari et al. [18] have taken the challenge of unimodal classification a step further by trying to incorporate context using a BERT + CNN model to achieve an F1-score of 0.92. However, an evident failure of this is that it works only on unimodal datasets and cannot be used to capture implicit and subtle hate conveyed through multiple modalities.

Getting Around the Semantics Challenge …

343

Early work on detection of hateful multi-modal messages has used simple fusion strategy [21] across multiple modalities. In terms of fusion, three distinct strategies have emerged over these two generation of approaches, i.e. early fusion (fusing the features right after extraction), late fusion (fusing the unimodal decisions) and hybrid fusion (fusing the outputs from both early and late fusion). In order to address the late fusion challenge, Cheng et al. [2] and Khattar et al. [10] have offered novel solutions with the use of self-supervised neural networks in the form of variational autoencoders (VAEs). While Cheng et al. have applied it to rumour detection based on text alone, Khattar et al. have the unique application of VAE’s on multi-modal fake news classification. The recent approach used by VL researchers is that of multi-modal pre-training. Using multi-modal pre-training, many VL models have emerged in the last few years [13, 14, 17, 24, 25] and these can be roughly divided into single stream and dualstream models. The dual-stream models typically process the text and image tensors in two different transformer flows and align them by some kind of attention [26] mechanism whereas single stream achieves the same thing in a single transformer flow using elaborate preprocessing for aligning the tensors. In terms of the problems investigated by VL researchers, the existing research has been heavily influenced by the type of VL problem. The problem of image captioning is mostly a challenge of finding a common embedding space from multi-modal data that has the image and the caption closely aligned. A variety of CNNs and sequential models like LSTMs, etc., along with large scale pre-training has been combined in solving this kind of problem. The problem of visual question answering (VQA) essentially models the correlation between visual information and the text question. The second generation of the VL challenge consists of some graph-based as well as knowledge-based approaches, and this has been the popular approach for VQA. Specifically the problem of multi-modal hate detection can be of two types, i.e. the case when both the modalities are aligned and when they are not. The existing work on multi-modal hate has witnessed largely unimodal pre-training with various fusion strategies along with different kinds of attention techniques. In the multimodal (text + image) domain, Gomez et al. [8] introduced the MMHS-150k dataset, which scraped and curated content from Twitter, that seemed to be promising initially. However, in addition to a class imbalance, the dataset is curated using a predefined set of hateful terms which limits the scope of a model trained on this dataset. Gomez et al. also proposed the methodology of extracting features via Inception V3, and using LSTM on the image text and tweet text, achieving the highest accuracy of 0.68. However, the issue with directly concatenating the image and text features is that this late fusion does not allow the model to efficiently learn the correlation and semantics of the text and image together. As aforementioned, the subtlety of hate memes arises from the fact that there are ‘benign confounders’ in either of the modalities that contradicts and completely changes the final semantics of the meme from that of a particular modality taken on its own. The VL approach of Muennighoff [19] to the Facebook Hateful Memes Challenge [11] uses deep neural networks trained on massive datasets, and largely uses the pre-trained aspects of these models on a given dataset. Consequently, they

344

A. Kiran et al.

fail to make accurate predictions on memes that are not of a similar type to which these models have seen. An approach around the use of co-attention and cross-attention mechanisms has become popular in the recent past to handle multiple contexts in various domains such as in multi-modal disinformation research (e.g. detection of fake news [22]) and hate [29], QA [1], vQA [30] and aspect-based sentiment analysis [28]. Though these kind of mechanisms have been used in the various multi-modal pre-trained embeddings mentioned earlier, they alone may not be sufficient to address the ‘benign confounders’ in the Facebook Meme Dataset. Hence, an approach to augment alignment with the information around confounding or contrasting nature of a hateful meme has to be devised. To capture both complementary and contrasting information from both modalities, various attempts can be made. An approach around data augmentation [15] has attempted to address this by generating more samples of similar characteristics. Another approach has been the use of emotion expressed by each modality [5] along with the multi-modal pre-trained embedding based on the assumption that emotion will indirectly capture the contrast between modalities. Zhou et al. [32], in their work on multi-modal fake news detection, have made attempts to capture the contrast in modality as a characteristic of fake-ness by evaluating the similarity of the generated caption of the visual with the text annotation provided. Yet another approach on similar lines can be to use multi-task learning based on weakly generated labels of related tasks [4] such as sarcasm detection and emotion detection. Ensemble is another possible indirect approach to capture information offered by various models. Several attempts, including the prize winning solution of the Hateful Meme challenge [16, 31] have used an ensemble of multi-modal pre-trained VL models.

3 Approach As elaborated in the literature review, the task of hateful meme detection is difficult because it requires a composite understanding of the textual and visual components. There are multiple modalities, varying hate themes, different associated sentiments, and most importantly, there is semantic contrast at play in hateful memes. Based on this, the work described in this paper adopts a few fundamental strategies: 1. Ensemble of models that capture slightly differing information from the content. Our solution comprises of an ensemble of five different models. 2. Capture additional signals such as race and gender as they are often part of hate themes. 3. Cross-attention between modalities to get image attended text and text attended image. 4. Instead of data augmentation, use variational auto-encoder (VAE) as it serves the dual purpose of robustness of model and compressed information capture from multi-modal data.

Getting Around the Semantics Challenge …

345

3.1 A Semantic Understanding We use BiLSTM networks and knowledge-based word embedding models like ConceptNet [23] and GloVe [20] that are rooted in vast semantic networks and are therefore able to capture the common sense understanding of the text. On images, we use the pre-trained Xception model [3]. We extend Vaswani et al.’s attention [26] to a cross-attention model that attends to the text with context from the image, thereby capturing both inter-modal and intramodal relationships, and equips the model with a deep understanding of the image, the text, and their coalescent subtext. Here, through one mode, the model learns to give attention to the most important pieces of information in the other mode, therefore promising that learning occurs across the two modes so as to capture context. The attention mechanism works on a key-value design, where a given query is the question, and checking the query against every key retrieves all matching content. q = Wq x

(1)

k = Wk x

(2)

v = Wv x

(3)

where q, k and v can be considered as rotations of the input. One query, q is compared against the matrix of all keys, K. a = (arg)max(K T q)

(4)

The hidden layer is the linear combination of the columns of V weighted by the coefficients in a: h = Va

(5)

Self-attention models take inputs (x) from a single mode, i.e. text embeddings in case of text and image features in case of images. Cross-attention employs a similar process with the inputs coming from both the text and the images, and we have qimage = Wq ximage

(6)

ktext = Wk xtext

(7)

vtext = Wv xtext

(8)

346

A. Kiran et al.

Each of the image feature vectors are of dimension [1 × 2048], while the word embeddings are of dimension [1 × 3500] each. With a batch size of four, and with two heads, weight matrices of dimension [1 × 1 × 4 × 1 × 2048] are used, where the first two dimensions are equal to the depth of two heads, followed by the batch size, and the remaining is the feature vector size. The corresponding similarity matrix is computed as T a = (arg)max(K text qimage )

(9)

Finally, we get the hidden layers as follows: h = Vtext a

(10)

A transformer encoder is built using this attention module, and the hidden representations thus obtained are used to train a binary classifier on which the classification task is performed. Our work additionally implements a simple fusion model and extends the VilBERT model by Lu et al. [17] and Singh et al. as alternate fusion strategies.

3.2 Variational Auto-Encoders (VAE) The dataset specified by Facebook AI [11] is limited to approximately 10,000 data-points (coloured memes). Fusion via concatenation cannot capture the interdependent information that is required to build a suitable model for this challenge. A VAE [12] implementation has an encoder that gives a latent space distribution which can be sampled from indefinitely for a single data point. We construct such an encoder–decoder pair each with three dense layers for a latent space of 32 dimensions. Each of these encoded data-points effectively have some amount of noise induced in them which increases the robustness of any model trained on them. The final model is a binary classifier of three dense layers trained to be connected with the encoder’s output from the VAE.

3.3 Helping Decode Hate in Memes In addition to a semantic understanding of the image and the text, our solution provides methods to supplement the ensemble with additional knowledge to aid in its work. According to Facebook, more than half of the hateful meme content online is related to race or ethnicity. Our work solves this with the implementation of the FairFace classifier [9] to extract racial and gender tags from the images, which are then simply concatenated with the text from the memes and employed as an additional feature input into our models.

Getting Around the Semantics Challenge …

347

Fig. 1 Complete pipeline

3.4 Bringing It All Together Figure 1 shows the complete pipeline of the solution proposed. GloVe word embeddings are generated over a concatenation of the image text with the race and gender tags identified by the FairFace classifier. Image features are extracted using the Xception V3 model. These embeddings and image features are inputs to the simple fusion and cross-attention models, with the self-attention model training only on the text embeddings. The BERT model takes as input the image text concatenated with the race and gender tags. This also serves as input to the VilBERT model along with the images in the dataset. An analysis of the classification performed by each of the models reveals each to be performing better on different elements of hate. Our solution thus concludes with an ensemble learning approach that establishes a system of models that average each other out to achieve the best performance.

4 Results and Discussion The accuracy as a measure of performance of the individual models on the test set can be given in Table 1. The accuracy of models trained on the raw dataset without the addition of information in the form of racial and gender tags is also displayed. The addition of knowledge from external sources is justified because these metrics

348

A. Kiran et al.

Table 1 Performance of the models Model Racial and gender information Accuracy supplemented? BERT on text-only Fusion with GLoVe VAE Fusion BiLSTM with simple fusion Self-attention on text Cross-attention VilBERT Ensemble—majority voting

No Yes No Yes No No No Yes No Yes Yes No Yes All models—with and without racial + gender information

62.96 66.43 62.0 62.5 57.4 65.2 61.65 62.7 60.3 62.1 63.7 66.75 69.3 68.4

are higher than the accuracy achieved by models without additional knowledge as extra input. Figure 21 demonstrates the strengths of the cross-attention model over other models. This model performs well when a high cohesive understanding of the two modes is key to an accurate classification. Figure 32 depicts how an ensemble model performs better by forgiving the misclassifications of a few models in favour of the prediction made by a majority of models, thereby making the correct prediction. Our work concludes with an ensemble of models trained on a knowledge-rich dataset, achieving an accuracy of 69.3% on the test set, thereby beating the baseline accuracy set by Facebook AI (64.73%) by nearly five per cent.

5 Conclusion and Future Work The solution takes a two-part approach to the task of hateful meme detection, in careful consideration of the complexity of hate in memes due to the subtlety and necessity for joint perception of the two modes. The first step was to extract a semantic understanding of the two modes, with the addition of external knowledge. This showed a significant boost in accuracy achieved by the models, justifying our design decisions and future work (which is based on the same key ideas). Following this, 1 2

The image is presented in grayscale. See Footnote 1.

Getting Around the Semantics Challenge …

Fig. 2 Comparison of the cross-attention model with the other models

Fig. 3 Motivation for ensemble

349

350

A. Kiran et al.

our work explored an effective fused learning of the semantics of the two modes, primarily with a cross-attention model with several examples of it faring better. The solution settled on an ensemble model that captured different elements of hate by being inclusive of the prediction made by each of the models implemented. This achieved a greater than four per cent boost in accuracy from the benchmark set by Facebook AI using state-of-the-art visual-linguistic models. With the ensemble achieving a significant boost in accuracy and the principal design decisions sound, the directions for future work are vast. A preliminary analysis on the classification made by the models show that some models are more efficient than others for specific kinds of hate. Thus, the first step would be an improvement of the ensemble mechanism based on the type of meme encountered as opposed to an equal weight voting system. Secondly, content augmentation in terms of emotion in two modalities, semantic contrast, etc., would be made part of the model as the aim is to capture the ‘benign confounders’ in hateful memes. Thirdly, additional ways of supplementing the model with signals such as tagging of political figures or topics in the images and text, along with using current socio-political context, will be explored.

References 1. Cai L, Zhou S, Yan X, Yuan R (2019) A stacked bilstm neural network based on coattention mechanism for question answering. Comput Intell Neurosci 2. Cheng M, Nazarian S, Bogdan P (2020) Vroc: Variational autoencoder-aided multi-task rumor classifier based on text. In: Proceedings of the web conference, pp 2892–2898 3. Chollet F (2017) Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1251–1258 4. Dai W, Cahyawijaya S, Bang Y, Fung P (2021) Weakly-supervised multi-task learning for multimodal affect recognition. arXiv preprint arXiv:2104.11560 5. Das A, Wahi JS, Li S (2020) Detecting hate speech in multi-modal memes. arXiv preprint arXiv:2012.14891 6. Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 7. Fortuna P, Nunes S (2018) A survey on automatic detection of hate speech in text. ACM Comput Surv (CSUR) 51(4):1–30 8. Gomez R, Gibert J, Gomez L, Karatzas D (2020) Exploring hate speech detection in multimodal publications. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1470–1478 9. Kärkkäinen K, Joo J (2019) Fairface: Face attribute dataset for balanced race, gender, and age. arXiv preprint arXiv:1908.04913 10. Khattar D, Goud JS, Gupta M, Varma V (2019) Mvae: multimodal variational autoencoder for fake news detection. In: The world wide web conference, pp 2915–2921 11. Kiela D, Firooz H, Mohan A, Goswami V, Singh A, Ringshia P, Testuggine D (2021) The hateful memes challenge: detecting hate speech in multimodal memes 12. Kingma DP, Welling M (2019) An introduction to variational autoencoders 13. Li G, Duan N, Fang Y, Gong M, Jiang D (2020) Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11336–11344 14. Li LH, Yatskar M, Yin D, Hsieh CJ, Chang KW (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557

Getting Around the Semantics Challenge …

351

15. Li Y, Huang H (2021) Enhance multimodal model performance with data augmentation: Facebook hateful meme challenge solution. arXiv preprint arXiv:2105.13132 16. Lippe P, Holla N, Chandra S, Rajamanickam S, Antoniou G, Shutova E, Yannakoudakis H (2020) A multimodal framework for the detection of hateful memes. arXiv preprint arXiv:2012.12871 17. Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in neural information processing systems, pp 13–23 18. Mozafari M, Farahbakhsh R, Crespi N (2019) A bert-based transfer learning approach for hate speech detection in online social media. In: International conference on complex networks and their applications. Springer, pp 928–940 19. Muennighoff N (2020) Vilio: State-of-the-art visio-linguistic models applied to hateful memes. arXiv preprint arXiv:2012.07788 20. Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543 21. Sabat BO, Ferrer CC, Giro-i Nieto X (2019) Hate speech in pixels: detection of offensive memes towards automatic moderation. arXiv preprint arXiv:1910.02334 22. Song C, Ning N, Zhang Y, Wu B (2021) A multimodal fake news detection model based on crossmodal attention residual and multichannel convolutional neural networks. Inf Process Manage 58(1):102437 23. Speer R, Chin J, Havasi C (2017) Conceptnet 5.5: an open multilingual graph of general knowledge, pp 4444–4451. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972 24. Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7464–7473 25. Tan H, Bansal M (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 26. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008 27. Waseem Z, Hovy D (2016) Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In: Proceedings of the NAACL student research workshop, pp 88–93 28. Yang C, Zhang H, Jiang B, Li K (2019) Aspect-based sentiment analysis with alternating coattention networks. Inf Process Manage 56(3):463–478, 102437 29. Yang F, Peng X, Ghosh G, Shilon R, Ma H, Moore E, Predovic G (2019) Exploring deep multimodal fusion of text and photo for hate speech classification. In: Proceedings of the third workshop on abusive language online, pp 11–18 30. Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6281–6290 31. Zhong X (2020) Classification of multimodal hate speech–the winning solution of hateful memes challenge. arXiv preprint arXiv:2012.01002 32. Zhou X, Wu J, Zafarani R (2020) [... formula...]: Similarity-aware multi-modal fake news detection. Adv Knowl Discov Data Min 12085:354

Classification of Brain Tumor of Magnetic Resonance Images Using Convolutional Neural Network Approach Raghawendra Sinha and Dipti Verma

Abstract The impact of brain tumors in medical field cannot be ignored and may lead to a short life in their highest grade. Thus, conduction of proper diagnosis that too in its early stage to improve the quality of life of patients is a necessity. Normally, several image processing techniques including computed tomography (CT) and magnetic resonance imaging (MRI) are being utilized to localize and calculate the size and tumor in a brain. But it has limited performance for accurate quantitative measurements that too in a small number of sample images. In this work, a simple yet robust classification using convolutional neural networks (CNN) for brain tumor is proposed. The investigational outcomes with low complication are anticipated and potentially compete the relevant state-of-the-art methods. Keywords Artificial intelligence · Convolutional neural network · Deep learning · Medical diagnosis · Medical resonance imaging

1 Introduction Recent expansion in artificial intelligence and information technology techniques is getting popularity in various real-life applications. These applications consist medical fields such as brain computing and optimization [1–3], development of autonomous vehicles [4–7], and progress of smart cities [8]. All these areas have uniform weightage but securing life through medical diagnosis is always on the top. Fusion of artificial technique into medical domain has shown remarkable performance over the traditional approach. Brain tumors are among the most serious cancers for extensive illness and death in the US [9]. Common and initial symptom of brain tumor includes mood swings, headaches, change in personality, memory loss, and vision problem. This ailment keeps the person from basic work, taking a colossal toll on livelihoods and overall progress of society and country due to its severity. Diagnosis R. Sinha (B) · D. Verma Department of Computer Science and Engineering, Vishwavidyalaya Engineering College, Ambikapur, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_27

353

354

R. Sinha and D. Verma

of a brain tumor is carried out by a neurosurgeon or a neurologist with the help of computer tomography (CT) scan and/or with magnetic resonance imaging (MRI). But, the structure of tumor in brain is not uniform and has varied representation. An affected but unidentified tumor examination sometimes may lead to the death. Nevertheless, chances of manual mistake may lead to unexpected treatment due to this short duration of investigation. Therefore, using intelligent technique a precise detection of such object is required to avoid any chances of human death. Nowadays, medical field is also being facilitated with the advancement of several intelligent techniques, which assists the precise decision-making of a neurosurgeon/neurologist. In this domain, several image processing and optimization approaches have been applied to understand the different data patterns in several applications [10–14] but sometimes their performance has been experimented against the standard conditions. In this direction, recent development in machine and deep learning techniques has leveraged the performance of the artificial intelligence (AI)-based systems [15–19]. AI techniques have the prospective to benefit some of the most challenging issues when united with standard for the betterment of our society. Among the recent AIbased techniques, deep learning assists to build strong, accessible, and operative solutions which can also be implemented to malaria detection problem in the blood cell. In this proposed approach, a convolutional neural network (CNN)-based model has been employed which not only accurately classifies between affected and unaffected brain tumor MRI images but also directs the implementation of other disease’s classification or detection in the similar domain. The schematic of the proposed approach is illustrated in Fig. 1. The structure of rest of this paper is organized as follows: Sect. 2 will discuss the relevant studies in the current domain; Sect. 3 reveals the implementation approach and the experimental results; finally, Sect. 4 discloses the performance measure of the proposed work and Sect. 5 describes the final observations and consists the insights for future directions.

Fig. 1 A process flow indicating the classification of brain tumor MRI image into infected and uninfected category

Classification of Brain Tumor of Magnetic Resonance …

355

2 Related Works Various approaches have been employed to classify brain tumor into infected and unaffected part using different standard public datasets. In this direction, the idea of dual path residual CNN had been utilized and the model attained a highest classification accuracy of 84.9% [20]. Similarly, Sajja et al. had conducted experiment utilizing integrated architecture using fuzzy C-means clustering and support vector machine (SVM) that provides the classification error rate of 5.2 and accuracy score of 94.8% [21]. Principle component analysis (PCA) and gray-level co-occurrence matrix (GLCM) had been employed for selecting the important feature to discover the presence of brain tumor and their classification into malignant and benign categories using support vector machine (SVM) [22]. Using the standard neural network architecture, Chan et al. applied VGG16 and ResNet 50 with random forest (RF) and k-means clustering approaches [23]. Authors had attained the maximum prediction performance in terms of F1 score of 86% score. Kaur et al. extracted the significant feature representation and validated their approach using confusion matrix in which SVM score (87.5%) was the maximum among KNN, boosted tree, and bagged tree methods [24]. To obtain detailed information of tumor from MRI images, Veer et al. used multilayer perceptron mechanism and categorize the brain tumor with accuracy score of 90.47% (against the 80%-20% training and testing dataset) [25]. Kaur et al. proposed two-stage brain tumor detection using morphological operation such as erosion, dilation, and clustering in the first stage and further employed Naïve Bayes classifier that attained 86% accuracy score [26]. Another feature extraction approach using CNN in which initially fuzzy c-means algorithm is employed to cluster the tumor area [27]. Further, the canny edge algorithm is applied to identify the tumor region, and then, using these two-stage information, CNN was proposed which attained the accuracy of 91.40%. Using contrast stretching and the contouring algorithm, Remya et al. had extracted the feature extraction and classified the brain tumor using KNN classifier for 250 iterations and obtained the accuracy of 85.71% [28]. Ning et al. improved the classical model of VGG19 and Inception V3 with data augmentation techniques followed by locating and extracting the brain tumor region and obtained the accuracy score of 88.96% and 91.73% for the mentioned improved model, respectively [29]. To identify the most promising features, GLCM, shape features, and local binary pattern are calculated, and then, brain tumor classification is accomplished using RF-PCA which produces accuracy of 91.95% [30]. The mentioned approaches have been utilized with the advancement of image processing and deep learning techniques. However, the performance in classifying the infected and uninfected brain tumor can be improved using an optimal model design. Considering the study in [27], in which the classification accuracy is promising but having high computation cost due to the complex design of CNN. Obtaining feature with minimum number of model is also a potential idea. In this paper, a CNNbased model to classify brain tumor into infected and uninfected classes has been

356

R. Sinha and D. Verma

proposed which offers less complex design and validated through the qualitative and quantitative assessment scores.

3 Material and Methods In order to perform classification against the brain tumor categories, following phases have been considered whose process flow is given in Fig. 1.

3.1 Dataset Collection and Preprocessing Proposed approach using CNN-based classification network requires significant features from the standard dataset which has been accessed from the public platform [31]. The referred dataset has a total of 253 brain MRI images, in which 155 images are from tumorous affected and remaining 98 images belong to non-tumorous part. However, to learn significant features using CNN model requires a large amount of data, but due to the less data availability, data augmentation approach has been adopted to increase the number of data sample and provides an optimal baseline for feature extraction. Random rotation, flipping, zooming, variation in brightness, and random shift have been applied and produce 1518 data samples after the data augmentation process. Figure 2 displays some of the sample images from the referred dataset.

3.2 Model Architecture and Learning The proposed model which is anticipated to obtain a more number of noteworthy and robust pool of features has been furnished with the convolution procedure. However, with the growth in number of convolution procedure, it does not promise about the accomplishment of greater accuracy, as it may get suffered from unsteadiness or with the problem of overfitting and underfitting during learning. To preserve the stability between all the appropriate set of constraints, several arrangements of substantial parameters have been applied. After the data augmentation, as now the dataset has more number of inputs, a total of 11,137 parameters undergone in training process in which kernel size of 32 has been functional for the given set of input image size in all the three (red, green, blue) channels. Further, to get the preferred features from the input sample and to standardize the trustworthy pixel data, maximum pooling scheme has been functional. This policy diminishes the computation complication and focuses on the high level feature abstraction which is represented in the following equation.

Classification of Brain Tumor of Magnetic Resonance …

357

Fig. 2 Sample grid of referred brain tumor dataset

Pmax (I ) = maxu i u

(1)

where the vector i includes the stimulation values from occupant pooling area of T pixels in the image. However, when the parameter deviations in any layer happen, it shakes the inputs sharing to the subsequent layers. These alterations in input distributions are computationally difficult when there are more number of layers. To lessen this covariate swing, batch normalization is employed, where stabilization has germaneness to each stimulation independently and can be conveyed: Mean (Mini Batch): μ B =

Variance (Mini Batch): σ B2 =

m 1  xi m i=1

m 1  (xi − μ B )2 m i=1

xi − μ B Normalize: xi =  σ B2 + ∈

(2)

(3)



(4)

358

R. Sinha and D. Verma

Table 1 Configuration details of proposed CNN model

Layer number

Layer type

Output shape Number of parameters

1

Input Layer

(None, 240, 240, 3)

0

2

Zero Padding

(None, 244, 244, 3)

0

3

Conv2D

(None, 238, 238, 32)

4736

4

Batch Normalization

(None, 238, 238, 32)

128

5

Activation

(None, 238, 238, 32)

0

6

Max Pooling

(None, 59, 59, 32)

0

7

Max Pooling

(None, 14, 14, 32)

0

8

Flatten

(None, 6272) 0

9

Dense

(None, 1)

6273



Scale & Shift: yi = Y xi + β ≡ B NY,β (xi )

(5)

Batch normalization (BN) is utilized to stabilize the input change and assist the operation by higher rate of learning. Configuration details along with its output shape and corresponding number of parameter have been given in Table 1.

4 Performance Evaluation Over the modest scheme of the anticipated CNN architecture and the operation of several layers engaged, the first 20 epoch sample of learning/training curve with learning rate of 0.01 has been acquired and displayed in Fig. 3a–b. It designates how the model is progressively learning the features of provided dataset to it and practices its learning to categorize the types of brain tumor. Likewise, the convergence in the early stages in both accuracy and loss curve that too in a very few epochs to its maximum and minimum level correspondingly has also been perceived and discloses the ideal performance in learning as well as in the steadiness. Apart from the qualitative evaluation discussed in Fig. 3, standard quantitative estimation [17] has also been employed to confirm the optimal behavior of proposed approach which is shown in Table 2 along with the comparison with the various state-of-the-art techniques.

Classification of Brain Tumor of Magnetic Resonance …

359

Fig. 3 Learning curve of proposed brain tumor classifier model. a Accuracy. b Loss

Table 2 Quantitative classification performance of proposed approach and its comparison with the relevant state-of-the-art technique Method

Xue [20]

Sajja [21]

Kaur [24]

Veer [25]

Kaur [26]

Remya [28]

Proposed model

Accuracy

84.90

94.80

87.50

90.47

86.00

85.71

95.39

5 Conclusion The developed CNN model has been designed to classify the MRI images of brain tumor into infected and uninfected categorizes and achieves the classification accuracy of 95.39% through the simple design and in a very few training epochs. To validate the performance of proposed method, the model has been assessed over qualitative and universal quantitative scores. The performance of the model can be further improved with different optimizers and unseen data.

References 1. Mohdiwale S, Sahu M, Sinha GR (2020) LJaya optimisation-based channel selection approach for performance improvement of cognitive workload assessment technique. Electron Lett 56(15):793–795. https://doi.org/10.1049/el.2020.1011 2. Ali U, Dewangan KK, Dewangan DK (2018) Distributed denial of service attack detection using ant bee colony and artificial neural network in cloud computing. Adv Intell Syst Comput 652:165–175. https://doi.org/10.1007/978-981-10-6747-1_19 3. Mohdiwale S, Sahu M, Sinha GR, Bhateja V (2021) Statistical wavelets with harmony searchbased optimal feature selection of EEG signals for motor imagery classification. IEEE Sens J 21(13):14263–14271. https://doi.org/10.1109/JSEN.2020.3026172 4. Dewangan DK, Sahu SP, Sairam B, Agrawal A (2021) VLDNet: vision-based lane region detection network for intelligent vehicle system using semantic segmentation. Computing 103(12):2867–2892. https://doi.org/10.1007/s00607-021-00974-2

360

R. Sinha and D. Verma

5. Pardhi P, Yadav K, Shrivastav S, Sahu SP, Dewangan DK (2021) Vehicle motion prediction for autonomous navigation system using 3 dimensional convolutional neural network. In: 2021 5th International conference on computing methodologies and communication (ICCMC), pp 1322–1329. https://doi.org/10.1109/ICCMC51019.2021.9418449 6. Dewangan DK, Sahu SP (2021) Deep learning-based speed bump detection model for intelligent vehicle system using raspberry pi. IEEE Sens J 21(3):3570–3578. https://doi.org/10.1109/ JSEN.2020.3027097 7. Singh A, Bansal A, Chauhan N, Sahu SP, Dewangan DK (2021) Image generation using GAN and its classification using SVM and CNN. In: Proceedings of emerging trends and technologies on intelligent systems. ETTIS 2021. Advances in intelligent systems and computing, vol 1371. Springer, Singapore 8. Impedovo D, Pirlo G (2020) Artificial intelligence applications to smart city and smart enterprise. Appl Sci 10(8):1–5. https://doi.org/10.3390/APP10082944 9. Miller KD, Ostrom QT (2021) Brain and other central nervous system tumor statistics, vol 71, no 5, pp 381–406. https://doi.org/10.3322/caac.21693 10. Pandey P, Dewangan KK, Dewangan DK (2018) Enhancing the quality of satellite images by preprocessing and contrast enhancement. In: Proceedings of 2017 IEEE international conference on communications signal processing, ICCSP 2017, vol 2018, pp 56–60. https://doi.org/ 10.1109/ICCSP.2017.8286525 11. Bhattacharya N, Dewangan DK (2015) Fusion technique for finger knuckle print recognition. In: International conference on electrical, electronics, signals, communication and optimization, EESCO 2015. https://doi.org/10.1109/EESCO.2015.7253990 12. Dewangan D, Rathore YK (2016) Image quality costing of compressed image using full reference method, vol 1, no February, pp 68–71 13. Sahu SP, Dewangan DK (2021) Traffic light cycle control using deep reinforcement technique. In: International conference on artificial intelligence and smart systems (ICAIS), pp 697–702. https://doi.org/10.1109/ICAIS50930.2021.9395880 14. Pardhi P, Yadav K, Shrivastav S, Sahu SP, Kumar Dewangan D (2021) Vehicle motion prediction for autonomous navigation system using 3 dimensional convolutional neural network. In: Proceedings of 5th international conference on computing methodologies and communications. ICCMC 2021, no. Iccmc, pp 1322–1329. https://doi.org/10.1109/ICCMC51019.2021. 9418449 15. Dewangan DK, Sahu SP (2021) PotNet: Pothole detection for autonomous vehicle system using convolutional neural network. Electron Lett 57(2):53–56. https://doi.org/10.1049/ell2.12062 16. Banjarey K, Sahu SP, Dewangan DK (2021) A survey on human activity recognition using sensors and deep learning methods. In: 2021 5th International conference on computing methodologies and communication (ICCMC), pp 1610–1617. https://doi.org/10.1109/ICC MC51019.2021.9418255 17. Dewangan DK, Sahu SP (2021) RCNet: road classification convolutional neural networks for intelligent vehicle system. Intell Serv Robot 14(2):199–214. https://doi.org/10.1007/s11370020-00343-6 18. Dewangan DK, Sahu SP (2021) Road detection using semantic segmentation-based convolutional neural network for intelligent vehicle system. In: Data engineering and communication technology. Lecture notes on data engineering and communications technologies. Springer Singapore, pp 629–637 19. Dewangan DK, Sahu SP (2021) Driving Behaviour analysis of intelligent vehicle system for lane detection using vision-sensor. IEEE Sens J 21(5):6367–6375. https://doi.org/10.1109/ JSEN.2020.3037340 20. Xue Y et al (2020) Brain tumor classification with tumor segmentations and a dual path residual convolutional neural network from MRI and pathology images, vol 11993. Springer International Publishing, LNCS 21. Sajja VR, Kalluri HR (2020) Brain tumor segmentation using fuzzy C-means and tumor grade classification using SVM, vol 105. Springer Singapore

Classification of Brain Tumor of Magnetic Resonance …

361

22. Pareek M, Jha CK, Mukherjee S (2020) Brain tumor classification from MRI images and calculation of tumor area, vol 1053. Springer Singapore 23. Chan HW, Weng YT, Huang TY (2020) Automatic classification of brain tumor types with the MRI scans and histopathology images, vol 11993. Springer International Publishing, LNCS 24. Kaur P, Singh G, Kaur P (2020) Classification and validation of MRI brain tumor using optimised machine learning approach, vol 601. Springer Singapore 25. Veer (Handore) SS, Deshpande A, Patil PM, Handore MV (2020) Segmentation and classification of primary brain tumor using multilayer perceptron, vol 1108 AISC. Springer International Publishing 26. Kaur G, Oberoi A (2020) Novel approach for brain tumor detection based on Naïve Bayes classification. Adv Intell Syst Comput 1042:451–462. https://doi.org/10.1007/978-981-32-99498_31 27. Arunnehru J, Kumar A, Verma JP (2020) Early prediction of brain tumor classification using convolution neural networks, vol 1192 CCIS. Springer Singapore 28. Remya Ajai AS, Gopalan S (2020) Analysis of active contours without edge-based segmentation technique for brain tumor classification using svm and knn classifiers, vol 656. Springer Singapore 29. Ning X, Li Z, Pang H (2020) Image classification of brain tumors using improved CNN framework with data augmentation, vol 341. Springer International Publishing 30. Saraswathi V, Jamthikar AD, Gupta D (2020) CNN and RF based classification of brain tumors in MR neurological images, vol. 1147 CCIS. Springer Singapore 31. Chakrabarty N (2018) Brain MRI images for brain tumor detection. Kaggle. https://www.kag gle.com/navoneel/brain-mri-images-for-brain-tumor-detection. Accessed 15 Nov 2021

Detection of COVID-19 Infection Using Convolutional Neural Network D. Aravind, Neha Jabeen, and D. Nagajyothi

Abstract Coronavirus disease (COVID-19) is a newly discovered viral sickness that can be fatal. The majority of patients will experience mild to severe respiratory problems and will improve without need for special treatment. Persons over 65, and for those who are underlying medical disorders such cardiovascular disease, asthma, respiratory illness, and cancer, are more prone for developing severe symptoms. In these conditions, 3D volumetric imaging has proven to be a useful technique for COVID-19 patient diagnosis and prognosis. We present a new approach for detecting and classifying COVID-19 infection using 3D volumetric lung imaging in this work. For the detection and classification process, we have used 3D volumetric image processing and deep learning techniques, respectively. Early recognition and finding are basic elements to stop COVID-19 spreading. Various profound learningbased approaches had been proposed for COVID-19 separating CT examines as an instrument to computerize and assist with finding. These methods suffer with at least one of the faults listed below: (i) They treat each CT scan individually (ii) These methods are trained and tested on the same dataset. To address these two challenges, we present an accurate deep learning technique for COVID-19 screening using a democratic framework in this paper. Keywords 3D volumetric image processing · Classification · Coronavirus disease (COVID-19) · Deep learning techniques · Detection

D. Aravind (B) · N. Jabeen · D. Nagajyothi Electronics and Communication Engineering, Vardhaman College of Engineering, Shamshabad, Hyderabad, India e-mail: [email protected] N. Jabeen e-mail: [email protected] D. Nagajyothi e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_28

363

364

D. Aravind et al.

1 Introduction Coronavirus (COVID-19) is a devastating infection caused by a recently found COVID. A large percentage of patients affected with COVID-19 will experience low to severe respiratory disease and overcome without any need for specific treatment. People who are older and underlying clinical issues such as heart diseases, diabetes, or a persistent respiratory infection or disease are much more prone to have a more serious illness. Learning everything about the COVID-19 virus is the best way to prevent and make the virus transmission as low as possible, the sickness it produces, and how it spreads is the greatest strategy to avoid and decrease transmission. To protect yourself from virus, wash hands thoroughly with water or use an alcohol-based sanitizer. Keep your hands away from your face. Because the COVID19 virus is primarily transmitted through spit beads or secretions from the nose when an infected person sniffles, excellent respiratory hygiene is also essential. The respiratory infection known as novel coronavirus or COVID-19 is making headlines around the world for triggering a flare-up of respiratory disease. The outbreak started in Wuhan, China’s Hubei Province, and rapidly spread around the world, including to the United States. It had infected a large number of people, and health officials are closely monitoring how the disease spreads. COVID is a disease which is common all over the world and can cause respiratory illness in humans and animals. There are a few known COVID that taint individuals and normally just purpose gentle respiratory illness, like the basic virus. Be that as it may, at any rate two recently recognized COVID had caused serious disease—severe acute respiratory syndrome (SARS) COVID and Middle East respiratory syndrome (MERS) COVID.

2 Literature Survey Several researches have proposed certain techniques to detect the COVID. Here, we briefly review some of the techniques as reported. Dascomb [1] Intermountain Healthcare is a not-for-profit organization based in Utah that operates 24 emergency clinics (including a “virtual” emergency clinic), a medical group with over 2400 doctors and advanced practise clinicians at around 160 locations, a select health wellness plans division, and other health administrations. NHM [2] The National Rural Health Mission (NRHM) and the National Urban Health Mission (NUHM) are the two sub-divided organizations of the National Health Mission (NHM). Health system strengthening, reproductivematernal-neonatal-child and adolescent health (RMNCH + A), and communicable and noncommunicable diseases are the three main functions of it. The existence of the National Health Mission—with effect from 1 April 2017 to 31 March 2020—was endorsed by the government of India in its meeting on 21 March 2018.About 80% of affirmed cases recuperate from the illness with no genuine complexities. In any case,

Detection of COVID-19 Infection Using …

365

one out of each six individuals who gets COVID-19 can turn out to be truly ill* and foster trouble in relaxing. In more extreme cases, contamination can cause serious pneumonia and different confusions which can be dealt with just at more elevated level offices (District Hospitals or more). In a couple of cases, it might even reason demise. De [3] Types of COVID-19 testing, measure: As the ICMR faced so many allegations for not testing so after many researches different types of tests such as immunological response, TrueNat and CBNAAT (tuberculosis tests), and antigen were offered by the ICMR for testing of COVID-19. Worldometer [4] Worldometer is controlled by a global group of engineers, analysts, and volunteers determined to make world measurements accessible in an interesting and time applicable organization to a wide crowd all throughout the planet. It is distributed by a little and autonomous advanced media organization situated in the United States. For the COVID-19 information, we gather information from true reports, straightforwardly from Government’s correspondence channels or by implication, through nearby media sources when considered solid. We give the wellspring of every information update in the “Most recent Updates” (news) area. Opportune updates are made conceivable on account of the interest of clients all throughout the planet and to the devotion of a group of examiners and scientists who approve information from a consistently developing rundown of more than 5000 sources. Khuzani et al. [5] For non-COVID-19 individuals with pneumonia, chest-X beam (CXR) radiography can be used as a first-line emergency intervention. In any instance, the similarity between highlights of COVID-19 CXR images and pneumonia caused by various contaminants leads to a differential judgement by radiologists testing. We hypothesized that AI-based classifiers could reliably distinguish COVID-19 patients’ CXR images from various kinds of pneumonia. To construct a powerful AI classifier that can discriminate COVID-19 cases from non-COVID-19 cases with good accuracy and affectability, we employed a dimensionality reduction technique to generate a set of ideal CXR picture highlights. We were able to run our classifier efficiently with a limited dataset of CXR photographs by lever-ageing global highlights from the entire set of CXR images. In the event of a non-COVID-19 emergency, we recommend that our COVID classifier be used in conjunction with other tests to guarantee an optimal distribution of clinic assets. World Health Organization [6] The WHO Foundation is a free award making establishment zeroed in on tending to the most squeezing worldwide wellbeing difficulties of today and tomorrow. Settled in Geneva and lawfully autonomous from WHO, the Foundation works with singular givers, the overall population and corporate accomplices to help worldwide general wellbeing needs, going from anticipation, emotional wellness and noncommunicable sicknesses to crisis readiness, flare-up reaction, and wellbeing framework reinforcing. Xie et al. [7] converse record polymerase chain response (RT-PCR) testing for COVID illness 2019 may have negative effects for certain patients with higher chest CT findings (COVID-19). The researchers provide chest CT findings from five individuals with COVID-19 illness who had initially negative RT-PCR result of this study. All individuals were eventually confirmed to have COVID-19 contamination with

366

D. Aravind et al.

repeated swab testing after being disconnected for suspected COVID-19 pneumonia. For those with a high clinical suspicion of COVID-19 disease but negative RT-PCR results, a combination of repeated swab tests and CT filtering may be beneficial. Blake [8] TCIA is a service that de-identifies and makes a large database of clinical images of malignancy open for general public download. Patients’ imaging is usually tied to a common ailment, image approach or type (MRI, CT, advanced histopathology, and so on), or exploration centre. TCIA uses DI-COM as its primary record format for radiological imaging. When available, additional information related to the images, such as patient findings, treatment details, genomes, and master examinations, is also provided.

3 Proposed Methodology Although first tests using chest CT for the conclusion of COVID-19 and the detection of contaminated areas have shown promising results, most current procedures rely on a commonly used controlled learning strategy. This necessitates a significant amount of manual naming of information; nevertheless, in such a flare-up situation, clinicians have a very limited opportunity to play out the tedious manual drawing, which may potentially ruin the execution of such regulated profound learning approaches. We propose a deep learning framework to group COVID-19 instances from local area pneumonia (CAP) and non-pneumonia cases in this study (NP) (Fig. 1). MRI and CT scans are examples of cross section digitized pictures that are easy to visualize in a multi-planar volumetric projection. These projections record the physical dimensions of a person’s body, which can then be rebuilt in a multi-planar volumetric image. CT scans typically generate a lot of slices, which must be compressed to fit inside the multi-planar volumetric screen’s. Determining the volume bounds to be visualized and then applying volume rendering techniques is one approach to accomplish this. Other option is to create the volumetric image first by gathering the necessary data from the captured slices, then use volume rendering techniques. Surface models and volumetric data are the two types of 3D data available. Surface models are common in the design profession, where objects are characterized by their surfaces, for as with polygons or parametric surfaces. Data is volumetric in the medical markets, which means that the inside of the data is also depicted using a discretely sampled 3D collection. In the supplement of a parallel picture, zeros become ones and ones become zeros (Fig. 2). High contrast is turned around. In this, we will construct fundamental structure block for CNN. A convolutional neural organization can comprise of one or different convolutional layers. The quantity of convolutional layers relies upon the sum and intricacy of the information. Convolutional neural networks (CNNs) are neural networks with at least one convolutional layer and is commonly used for image processing, grouping, division, and other auto-related data. Convolution is the process of sliding a channel over data. “One supportive approach to consider convolutions is this statement from Dr. Prasad

Detection of COVID-19 Infection Using …

367

Fig. 1 Methodology diagram

Samarakoon:” A convolution can be thought as “taking a gander at a capacity’s environmental factors to improve/exact expectations of its result.” Rather than taking a gander at a whole picture without a moment’s delay to discover certain highlights, it tends to be more viable to take a gander at more modest parts of the picture. The most well-known use of CNNs is image grouping, such as recognizing satellite images with streets or defining handwritten characters and digits. CNNs are also well-suited to typical tasks like picture splitting and sign processing. CNNs have been utilized in natural language processing (NLP) and discourse recognition, but RNNs are more often used in NLP (Fig. 3). Convolutional layer: Sliding convolutional channels are applied to the data by a 2D convolutional layer. Convolution2dLayer is used to create a 2D convolutional layer. Different segments make up the convolutional layer. While looking at a photo, the layer learns the highlights that are limited by these zones. You can use the channel size input parameter to determine the size of these areas while creating a layer using the convolution2dLayer task. Rectified Linear Unit (RELU): A rectified linear unit is utilized as a non-straight actuation work. A RELU says if the worth is under nothing, gather it together to nothing. Make a RELU layer utilizing RELU layer. A RELU layer plays out an edge activity to every component of the info, where any worth under zero is set to nothing.

368

D. Aravind et al.

Fig. 2 3D lung segmentation

Fig. 3 CNN architecture

Convolutional and cluster standardization layers are normally trailed by a nonlinear actuation capacity like a redressed straight unit (RELU), determined by a RELU layer. A RELU layer plays out a limit activity to every component, where any info esteem under zero is set to nothing, that is, the RELU layer does not change the size of its information. Max and Average Pooling Layers: A maximum pooling layer performs downtesting by isolating the contribution to rectangular pooling locales, and processing the

Detection of COVID-19 Infection Using …

369

limit of every district. Make a maximum pooling layer utilizing maxPooling2dLayer. A normal pooling layer performs down-testing by separating the contribution to rectangular pooling locales and processing the normal upsides of every district. Fully Connected Layer: Make a completely associated layer utilizing completely associated layer. A completely associated layer increases the contribution by a weight lattice and afterwards adds a predisposition vector. The convolutional (and downtesting) layers are trailed by at least one completely associated layers. As the name proposes, all neurons in a completely associated layer interface with every one of the neurons in the past layer. This layer joins the entirety of the highlights (neighbourhood data) learned by the past layers across the picture to recognize the bigger examples. For order issues, the last completely associated layer joins the highlights to characterize the pictures. SoftMax and Classification Layers: A SoftMax layer applies a SoftMax function to the input. Create a SoftMax layer using SoftMax layer. A classification layer computes the cross-entropy loss for multi-class classification problems with mutually exclusive classes. Create a classification layer using classification Layer. For classification problems, a SoftMax layer and then a classification layer must follow the final fully connected layer. The SoftMax function is also known as the normalized exponential and can be considered the multi-class generalization of the logistic sigmoid function. Regression Layer: A regression layer computes the half-mean-squared-error loss for regression problems. For typical regression problems, a regression layer must follow the final fully connected layer. For a single observation, the mean-squarederror is given by: where R is the number of responses, ti is the target output, and yi is the network’s prediction for response i. For image and sequence-to-one regression networks, the loss function of the regression layer is the half-mean-squared-error of the predicted responses, not normalized by R.

4 Experimental Results Our proposed method has been verified with the previous methods that include KNN. CNN’s key advantage over its predecessors is that it automatically finds important elements without the need for human intervention. Requires less memory storage than other supervised learning algorithms. Figure 4 is the input image which is fed for the training process from which the disease can be detected. So from Fig. 4, the lung is extracted to exactly predict the virus which is shown in Figs. 5 and 6. Figure 7 which is shown below is the final output image which shows whether a person is infected with virus or not. Figure 8 shows the output which appears on the screen.

370

D. Aravind et al.

Fig. 4 3D Volumetric lung image

Fig. 5 Middle slice image

5 Conclusions The majority of people infected with COVID-19 will experience signs of serious respiratory failure and will recover without needing additional therapy. People who are older, as well as those who suffer from chronic medical conditions such as cardiovascular disease, asthma, chronic respiratory disease, and cancer, are more likely to develop serious illnesses. In certain conditions, 3D volumetric imaging

Detection of COVID-19 Infection Using … Fig. 6 Lung segmentation

Fig. 7 Output image

Fig. 8 Example 2

371

372

D. Aravind et al.

has been a valuable technique for COVID-19 patients’ diagnosis and prognosis. We suggested a new approach for the diagnosis and classification of COVID-19 infection from 3D volumetric lung images in this report. We used 3D volumetric image processing and deep learning technique for the identifying and classification process, respectively. When compared to existing models, experimental data suggest that our model performs better.

References 1. Dascomb KK (2020) What is coronavirus (COVID-19) and how can I prepare? InterMountain Health Care, 5 Mar 2020 2. National health Mission, Ministry of Health Family Welfare Government of India. Role of frontline workers in prevention management of coronavirus 3. De A (2020) The Indian Express, Covid-19 testing: What are the tests and testing procedures being carried out in India?, New Delhi, 27June 2020 8:17:44 pm 4. Worldometer (2020) Corona cases in India, 8 Dec 2020, 05:36 GMT 5. Zargari Khuzani A, Heidari M, Shariati SA (2020) COVID-classifier: an automated machine learning model to assist in the diagnosis of COVID-19 infection in chest x-ray images. The National Center for Biotechnology Information advances science and health, 18 May 2020 6. Elaziz MA, Hosny KM, Salah A, Darwish MM, Lu S (2020) Publish With Plos One “New machine learning method for image-based diagnosis of COVID-19, 26 June 2020 7. Xie X, Zhong Z, Zhao W, Zheng C, Wang F, Liu J (2020) Chest CT for typical 2019-nCoV pneumonia: relationship to negative RT-PCR testing. Radiology, p 200343 8. Blake G (2020) TCIA, CT images in COVID-19, 15 Sep 2020

Hybrid Classification Algorithm for Early Prediction of Alzheimer’s Disease B. A. Sujatha Kumari, Sudarshan Patil Kulkarni, and Ayesha Sultana

Abstract Alzheimer’s disease is the most common type of dementia found. Dementia is actually a syndrome related to an ongoing decline of brain functioning. Alzheimer’s is caused due to increase in age, genes inherited, depression, and factors related to lifestyles. AD at its final stage cannot be treated. The intermediate stages like mild cognitive impairment (MCI), can be treated, so that the risk of developing AD can be decreased. In this work, structural MRI images are used and hybrid approach is introduced to detect MCI with good accuracy. ADNI dataset is used for project work for classification of cognitive normal (CN) and mild cognitive impairment (MCI). ADNI provides MRI data along with demographic information such as age, gender, physical examinations, and other neurobiological data. Initially, the MRI data is subjected to segmentation using K-means clustering in order to extract 2D images and gray matter. These segmented images are preprocessed using discrete wavelet transform (DWT), which is then further classified into MCI and CN classes. Random forest (RF) classifier, artificial neural network (ANN) is discretely implemented and the accuracy of prediction is calculated. Further, these algorithms are hybridized in order to achieve improved accuracy of prediction. By hybridizing the algorithms, an accuracy of 93.47% was achieved. Keywords Alzheimer’s disease · Segmentation · DWT · Magnetic resonance imaging (MRI) · Feature extraction · Prediction · Classification · RF · ANN

B. A. S. Kumari · S. P. Kulkarni · A. Sultana (B) ECE Department, Sri Jayachamarajendra College of Engineering, JSS Science and Technology University, Mysuru, Karnataka, India B. A. S. Kumari e-mail: [email protected]; [email protected] S. P. Kulkarni e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_29

373

374

B. A. S. Kumari et al.

1 Introduction The human body is a stupefying machine. Certain parameters such as hereditary qualities, age, and nourishment may give rise to disease in people. Health maybe distinguished as a state of absolute mental as well as physical and social success. Life style that includes sustenance and water we drink, exercise performed, and rest we provide for our bodies, inclination that we have or need controls healthy human being. One very common form of dementia that is AD, is one of the primary types of dementia found. Dementia in particular is an ailment related to continuous decline of brain function. It affects the mental abilities of human beings. While a lot of things are believed to increase the risk of developing AD, the exact cause is still not acknowledged completely. This disease is mostly common in human beings more than the age of 65 [1]. The risk of AD increases with age. It’s affecting is gaged as one in every 14 people above 65 years and one in every six people above 80 years of age. Furthermore, estimated one in every 20 cases of AD acts on people of 40–65 years. This is termed as young onset AD. As shown in Fig. 1, in healthy aging also the brain shrinks to some extent but, there is no loss of neurons in bulk. But in AD, many neurons lose connections with other neurons hence stops functioning and die. Alzheimer’s disease interrupts processes related to neurons and their networks that includes metabolism, communication, repair, and their connections which are associated to memory and are present in parts of the brain. In this paper, the focus is on detection of AD in its early stage using structural MRI (sMRI) neuroimaging of 2D data. Efficient classification algorithms are hybridized for the categorization of different stages of Alzheimer’s disease. The ambition is to

Fig. 1 Normal brain versus brain with AD

Hybrid Classification Algorithm for Early …

375

construct an intelligent system for high-speed computation of hybridized network to improve the accuracy of prediction of the disease.

2 Related Works In [2], based on DTCWT, PCA, and FNN, research provides an automated and accurate method for detecting Alzheimer’s disease. It outperformed seven state-ofthe-art algorithms, with an accuracy of 90.06%, a sensitivity of 92.00%, a specificity of 87.78%, and a precision of 89.6%. Kim et al. [3] Used DWT is coupled with PCA for purpose of feature extraction, hence classification was found more effective for MCI and AD. Although this was found powerful tool but it gave poor directionality and lack of phase information [4]. Liu et al. [5] use hybrid technique for MR brain image classification which combines the advantages of 2D-DWT with SURF-based BoW for feature extraction. On the DS-255, this proposed approach achieved 99.61% accuracy, 100% sensitivity, 99.55% precision, and 97.14% specificity. It took very little time and produced correct findings [6]. Islam and Zhang [7] Research on this paper explained the confiscation of Gaussian noise and impulse noise aroused in medical images. This was established on DWT and modified median filter. It was shown how the denoising of medical image from four modules could be achieved. The quality enhancement was based on edge detection of the medical image. This was performed using canny edge detection algorithm [8]. This had been found advantageous because it enhanced the quality of the images that are denoised and preserved important features [7]. Qiu et al. [9] Brain MR imaging is being used to automatically classify brain disorders into five classes. From the results obtained, it is concluded that the combination of VMD and bispectral feature techniques helps to efficiently create subtle image changes that aid in classification [10]. Kim et al. [11] Proposed DTCWT as a rare transformation on CS MRI. The results of this experiment show that DTCWT can be concentrated along the directional structure to significantly reduce DWT and related artifacts. Pan et al. [12] show a new approach combining accumulative neural networks and group learning: here, they propose a classifier developed by combining CNN and EL, i.e., the CNNEL method, to identify objects with MCI or AD using CNN models, were then integrated into an ensemble. Using a tenfold stratified cross-validation gives ten times the performance of the evaluated set [13]. Islam and Zhang [14] presented an effective approach to diagnosing AD using analysis of MRI data of brain. While much of the existing work has focused on binary classification, this model yields a significant boost for multiple-class classifiers [15]. This network was found to be beneficial for the early diagnosis of AD. In addition, the proposed approach shows great potential to be used to apply CNNs to other domains with limited datasets [16].

376

B. A. S. Kumari et al.

Various papers which were considered for the study, clearly indicated the need for preprocessing. It was also observed that using machine learning or deep learning models or hybridizing the models improves the accuracy of prediction.

3 Proposed Work The proposed system is manifested in Fig. 2. Firstly, the data is obtained from ADNI web site. This data is then segregated, and the preprocessing is performed. The feature selection is done using 3D-dual wavelet transform and is converted to a 3D array and further, a vector is formed. A hybrid algorithm is applied to detect the AD in its early stage. Magnetic resonance imaging (MRI) produces detailed image of inside of the body with the help of strong magnetic fields and radio waves. Various brain tumors, traumatic brain injury, birth defects, stroke, dementia, infection, multiple sclerosis, and headache causes can be detected through MRI scans that are formed with large tube of powerful magnets. Feature extraction is a part of the dimensionality reducing process, in which, a set of the raw data is divided and reduced to more manageable groups. So that it will be easier to process further. Feature selection is a dimensionality reduction strategy that removes noisy data and selects a small group of relevant features from the original features. Here, we concentrate on the hippocampus region of the brain, whose shrinkage is responsible for causing AD. The implementation flow of the project is explained using a flow diagram in Fig. 3. Data collected from ADNI database that consists of MRI data of cognitively normal that is CN subjects and mild cognitive impairment that is MCI subjects. Segmentation of these images is done using K-means clustering. Resultant gray matter images are further preprocessed. 2-Level 3D-DWT is applied for feature extraction, that gives data in vector form. Random forest, ANN model, and RF + ANN models are applied using libraries such as NumPy, Keras, Sckitlearn, and Matplotlib. The data is divided into 80:20 for training and testing purpose.

Fig. 2 Block diagram of proposed system

Hybrid Classification Algorithm for Early …

377

Fig. 3 Flow chart of the proposed system

3.1 Details of Dataset ADNI web site is considered to be one of the standard web sites; the dataset of patients suffering from mild cognitive impairment (MCI) and cognitive normal (CN) is obtained from ADNI web site. Figure 4 shows the number of MCI and CN samples Fig. 4 Number of samples considered for study

378

B. A. S. Kumari et al.

Fig. 5 Sample MRI image

considered for the study. Figure 5 shows sample images for CN and MCI. The images obtained are structural MRI Images.

3.2 Segmentation The raw data has to be firstly converted to the data into understandable format. Raw data is usually inconsistent, lacking in certain behavior, has many errors and likely to be incomplete. Figure 6 shows various slices of a coronal view of a MRI image. Clustering is used mainly for image segmentation. In the set of various algorithms, K-means clustering algorithm, which is an unsupervised ML algorithm, is used here. This is used for segmentation of the region of interest from the image. Initially, to improve the image quality, partial stretching enhancement is applied before applying K-means algorithm. Subtractive clustering method is used to generate the initial centers that are used in K-means clustering. This is for the purpose of segmentation. Hence, the median filter is applied in order to remove the unwanted regions. Figure 6 shows original, median filtered, and segmented image.

Fig. 6 Original image, median filtered image, and segmented image

Hybrid Classification Algorithm for Early …

379

3.3 Feature Extraction and Reduction The Daubechies wavelets referred by Ingrid Daubechies are considered to be a group of various orthogonal wavelets that are used in order to define a discrete wavelet transform. These are identified by a maximum number of various moments called the vanishing moments. Sinusoidal functions are perfectly localized in the frequency domain but are globally localized in the spatial coordinate. It is hence-forth difficult to represent as partially limited function or time or in a Fourier basis. Figure 7 shows 2-level DWT. The basic dilation equation is nothing but a two-scale, also called as a dyadic difference equation given as

Fig. 7 Approximation in three-dimension

380

B. A. S. Kumari et al.

φ(x) =

N −1 √  2 ck φ(2x − k). k=0

The scaling function is normalized by  φ(x)d x = 1. The expression relating the mother wavelet to the scaling function is ψ(x) =

N −1 √  2 (−1)k c N −1−k φ(2x − k). k=

A 3D Daubechies wavelet transform actually extracts detailed wavelet coefficient. This is achieved by finding out local averages, differences along the X-Y-Z axis. Figure 8 shows the decomposition of the 3D Daubechies wavelet. K-means clustering is a partitioned clustering approach where each cluster is associated with a center point called centroid. Each and every point is apportioned to the cluster with the closest centroid. Number of clusters, K is specified as three in the segmentation. The aim is to reduce the sum of distances of the points to their respective centroid. Most common definition of K-means is with Euclidean distance, minimizing the sum of squares error (SSE) function. A simple iterative algorithm called Lloyd’s algorithm works quite well in practice also known as Lloyd’s algorithm. Initial centroid is often chosen randomly. Clusters that are produced actually vary from one run to another run. Multiple runs are done and selection of the smallest error clustering is done. Selection of original set of points by various methods except that of random method is done. Distance function is solely dependable by the centroid. K-means converges for common similarity measures as mentioned. Initial few iterations are required for most of the convergence to occur. Most often the “until relatively few points change clusters” is made as the stopping condition. Complexity

Fig. 8 Levels of DWT

Hybrid Classification Algorithm for Early …

381

is O(n * K * I * d), where n refers to the number of points, K is the number of clusters, I is the number of iterations, and d stands for dimensionality.

3.4 Classification Random forest classification is contemplated as an object learning practice for regression, classification, and other assignments that are steered by fabricating a lot of decision trees at the time of training and eventually harvesting the class that is found to be either the mode of the classes with respect to a classifier or the prediction of mean of each tree for a respective regression frame work. While an algorithm termed as back-propagation algorithm utilizes a gradient descent method in order to evaluate the weights of each layer that goes in reverse through the output layer, a problem of vanishing gradient occurs because the layers are piled, where the differential value sets off to 0 before calculating the optimum value. When the differentiation of sigmoid is performed, 0.25 is the maximum value, which is actually closer to 0 when it further progresses to multiply, therefore, is referred to as a vanishing gradient issue that is the greatest obstacle for the deep neural networks. Substantial research has referred the challenge of the vanishing gradient problem. Unique deeds of such an effort, is to put back an activation function, the sigmoid function along with several other functions, such as the ReLU, Soft plus, and hyperbolic tangent functions. The hyperbolic tangent function inflates the values for derivative in the sigmoid range. The ReLU function is the most widely used activation function that replaces a value with 0 if the value is less than 0 and utilizes the value, if the value is greater than 0. Figure 9 shows the architecture of ANN, and Fig. 10 represents the details of hidden layers. There are four hidden layers in this ANN architecture. The first dense layer H1 consists of 128 neurons whose input dimension is 512. Next layer H2 consists of 64 neurons, hidden layer H3 consists of 64 neurons, and last layer H4 consists of 16 neurons. Figure 11 shows the block diagram for the hybridized ANN model. The hybridization is considered here, so as to increase the accuracy of prediction. RF classification was used to identify CN and MCI nodes that are 1 and 0, respectively, in the location coordinates in the domain and the input discharge-based domain. ANN is used here in order to come out with a model to estimate the accuracy of prediction over the required domain for an inconsistent discharge and the required location in the domain that is coordinate x, y. Hybridized model uses ReLU for activation purpose. Activation is done at various dimensions. Keras model is built with the help of optimizers. This model uses RMSprop as optimizer. The loss is calculated by categorical cross entropy.

382

Fig. 9 Architecture of ANN

Fig. 10 Hidden layers and number of neurons

Fig. 11 Hybridized ANN

B. A. S. Kumari et al.

Hybrid Classification Algorithm for Early …

383

4 Results Analyzing the execution, the data is trained and tested for the various classification algorithms like RF and ANN. The training set and testing are divided into 80:20. Many different dimensionality reduction techniques like dual tree wavelet transform, discrete wavelet packet transform, and analysis of principal component are used for data preprocessing. In the project, 3D Daubechies discrete wavelet transformation issued for extraction of features. After applying segmentation, the image is converted to vector. Receiver Operating Characteristics: Analysis of ROC is a precise tool used to estimate various diagnostic tests and also the predictive models. It can also be used to check out accuracy arithmetically or for the comparison of accuracy between various tests or predictive models. In real life clinical practices, progressive measures are often modified to get dichotomous tests. Figure 12 shows the ROC curve for random forest true positive verses random forest false positive, the ROC curve for ANN true positive verses ANN false positive, and the ROC curve for RF + ANN true positive verses RF + ANN false positive.

4.1 Comparison of Accuracy Figure 13 shows the comparison of accuracy of RF, ANN, and RF + ANN. Wherein the hybrid model displays highest accuracy.

Fig. 12 ROC curves

Fig. 13 Comparison of accuracy of RF, ANN, and RF + ANN

384

B. A. S. Kumari et al.

Fig. 14 Model loss graph; a random forest and b hybrid (RF + ANN)

Fig. 15 Model accuracy graph; a random forest and b hybrid (RF + ANN)

Loss is the forfeit for a bad prediction. Figure 14 shows the graphical representation of model loss. In consequence, loss is nothing but a number that indicates the peak of inferiority of the prediction of the model. The loss will be zero, if the prediction of model is perfect; else, the loss is found to be greater than zero. MSE is mean squared error loss that is achieved by averaging of the squared differences calculated between the predicted values and actual values. The result is found to be always positive without the concern of the predicted value’s sign and actual values, and the perfect value is always zero. Machine learning model accuracy is the quantification used to find which model is the best at identifying the relationship sand patterns between multiple variables in a dataset with respect to the input as in Fig. 15.

4.2 Comparison of Other Performance Metrics Confusion matrix is a technique used for encapsulating the conduction of a classifiers that are algorithms in particular. Accuracy of classification alone can be misguiding if the user has no equal number of observations in individual class or if the user has

Hybrid Classification Algorithm for Early … Table 1 Comparison of performance metrices

385

Parameters

RF

ANN

RF + ANN

Sensitivity

0.78

0.85

0.94

Specificity

0.93

0.89

0.84

Error

0.4

0.4

0.35

Precision

0.91

0.84

0.70

Fig. 16 Confusion matrix

greater than two classes in the dataset. Table 1 shows the comparison of performance metrics in terms of parameters such as error, precision, specificity, and sensitivity. True positive represented as TP. It means the prediction is +ve, and subject is MCI. True negative represented as TN. It means prediction is −ve, and particular subject is CN. False positive represented as FP. Its prediction is +ve, and particular subject is CN that is considered as a false alarm, bad. False negative represented as FN. It means prediction is −ve, and subject is MCI, the worst case as represented in Fig. 16.

5 Conclusion In this paper, structural brain images have been contemplated as an input for the classification algorithms. Considering ADNI dataset, higher accuracy is obtained. From the validation of results obtained using RF, ANN, and RF + ANN classifier, it is clearly seen that the hybrid algorithm was more efficient and gave greater performance. This work shows the classification of CN and MCI stages of AD. The hybrid model is able to classify the algorithm with 93.47%.

386

B. A. S. Kumari et al.

References 1. Maqsood M, Nazir F, Khan U, Aadil F, Jamal H, Mehmood I, Song OY (2019) Transfer learning assisted classification and detection of Alzheimer’s disease stages using 3D MRI scans. MDPI(Sensors) 2. Gudigar A, Raghavendra U, Ciaccio EJ, Arunkumar N, Abdulhay E, Acharya UR (2019) Automated categorization of multi-class brain abnormalities using decomposition techniques with MRI images: a comparative study. IEEE Access 7 3. Kim Y, Altbach MI, Trouard TP, Bilgin A (2019) Compressed sensing using dual-tree complex wavelet transform. In: 2017 34th National radio science conference (NRSC). IEEE 4. Torbati N, Ayatollahi A (2019) A transformation model based on dual-tree complex wavelet transform for non-rigid registration of 3D MRI image. Int J Wavelets Multiresolut Inf Process 17 5. Liu S, Bai W, Zeng N, Wang S (2019) A fast fractal based compression for MRI images, vol 7. IEEE 6. Oommen L, Chandran S, Prathapan VL, Krishnapriya P (2020) Early detection of alzheimer’s disease using deep learning techniques. Int Res J Eng Technol (IRJET) 7 7. Islam J, Zhang Y (2018) Brain MRI analysis for Alzheimer’s disease diagnosis using an ensemble system of deep convolutional neural networks. Brain Inf 8. Maqsood M, Nazir F, Khan U, Aadil F, Jamal H, Mehmood I, Song OY (2019) Transfer learning assisted classification and detection of Alzheimer’s disease stages using 3D MRI scans. MDPI (Sensors) 9. Qiu S, Joshi PS, Miller MI, Xue C, Zhou X, Karjadi C, Chang GH, Joshi AS, Dwyer B, Zhu S, Kaku M (2021) Development and validation of an interpretable deep learning framework for Alzheimer’s disease classification. Brain 143(6): 1920–1933 10. Pan D, Zeng A, Jia L, Huang Y, Frizzell T, Song X (2020) Early detection of Alzheimer’s disease using magnetic resonance imaging: a novel approach combining convolutional neural networks and ensemble learning. Front Neurosci 11. Kim Y, Altbach MI, Trouard TP, Bilgin A (2019) Compressed sensing using dual-tree complex wavelet transform. IEEE 12. Pan D, Zeng A, Jia L, Huang Y, Frizzell T, Song X (2020) Early detection of Alzheimer’s disease using magnetic resonance imaging: a novel approach combining convolutional neural networks and ensemble learning. Front Neurosci 14 13. Oommen L, Chandran S, Prathapan VL, Krishnapriya P (2020) Early detection of Alzheimer’s disease using deep learning techniques. Int Res J Eng Technol (IRJET) 14. Islam J, Zhang Y (2018) Brain MRI analysis for Alzheimer’s disease diagnosis using an ensemble system of deep convolutional neural networks. Brain Inf 5(2) 15. Liu S, Bai W, Zeng N, Wang S (2019) A fast fractal based compression for MRI images. IEEE 16. Qiu S, Joshi PS, Miller MI, Xue C, Zhou X, Karjadi C, Chang GH, Joshi AS, Dwyer B, Zhu S, Kaku M (2021) Development and validation of an interpretable deep learning framework for Alzheimer’s disease classification. Brain 143(6):1920–1933 17. Torbati N, Ayatollahi A (2019) A transformation model based on dual-tree complex wavelet transform for non-rigid registration of 3D MRI images. Int J Wavelets Multiresolut Inf Process

Data Analytics

Evaluating Models for Better Life Expectancy Prediction Amit, Reshov Roy, Rajesh Tanwar, and Vikram Singh

Abstract Life Expectancy is the average number of years some person or group of people tends to live. Being able to predict life expectancy may help to deliver potential insights, e.g., the prosperity of a community in the future, and volatile environments, etc. Life Expectancy also emerged as a key factor to evaluate a governing body’s performance in improving the welfare of the population in general and improving health status in particular, whether it is a decision about increasing the expenditure in any sector by the state or for people to choose which country could provide the best conditions for life. In this paper, exploratory analysis is conducted (i) to outline the most important factors that are responsible for affecting life expectancy through feature selection and (ii) to predict the most suitable machine learning algorithm to deliver the objectives. The systematic analysis is performed over the generic dataset that fits with socio-economic and health records. Though data available for countries are very little and inconsistent. The proposed prediction model achieved the aimed accurate forecast with a reduced number of features, as predictions are satisfactory over a dataset with inconsistent and missing values. Keywords Big data · Life expectancy · Pattern recognition · Predictive analysis

Amit · R. Roy · R. Tanwar · V. Singh (B) National Institute of Technology, Kurukshetra, India e-mail: [email protected] Amit e-mail: [email protected] R. Roy e-mail: [email protected] R. Tanwar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_30

389

390

Amit et al.

1 Introduction Life expectancy is one of the most important measures in terms of population’s health in a country and is used as an indicator by many policy makers and researchers to complement economic measures of prosperity such as GDP. Life expectancy depicts the average age that the members of a particular population group will be when they die. Life expectancy varies with developed and developing countries, ratio of birth to death, mortality rates of different countries and ratio of literate to illiterate population, all affect the survival time in one way or the other. The country’s growth, advancements and accessibility of resources all are the factors that affect longevity of population [1]. At present the World Bank provides life expectancy data of different countries upto 2018 which is used directly by the Google search engine, with forecasting done directly on the time series expectancy data. Such prediction and forecasting models are not very clear as to whether they take into account the effect of various other parameters other than time which affect life expectancy [2]. Also many countries do not expose their performance and health metrics which makes the data available inconsistent and sparse. This makes the task of designing a holistic prediction and forecasting model, which not only accounts for multiple different metrics but is also robust and flexible enough to handle missing and inconsistent data, difficult. Our project aims to provide forecasting features which can make predictions for not only life expectancy but also all the other factors which contribute to it well beyond 2018. We use the dataset provided by world health organization (WHO) for the years 2000–2015, to train our model and validate it by using the data provided by the World Bank.

1.1 Motivation and Objective Life Expectancy for any country tells the average life the people of that country tend to live. It gives a measure of how the country is developing and the quality of life of people in any country. An analysis on life expectancy enables forecasting of the expectancy of a country on any factor and highlights “how certain factors such as the average earnings, health expenditure, education, the life expectancy” of people of that country. The estimation outlines insights, as higher life expectancy indicates happier and healthier citizens and for lower values directs to the introspection to concern factors in a specific geographic region or country [1]. Further, an administration or office may seek forecast values to do a root cause analysis and planning. A prediction model for the life expectancy may serves manifold in futuristic needs to an administrator or public health office. A good life expectancy model would be particularly attractive to a lot of government, private bodies, and NGOs who run different healthcare initiatives such as immunization or vaccination drives in backward countries. We have worked on the systematic and exploratory view on

Evaluating Models for Better Life Expectancy Prediction

391

multiple machine learning models, with an objective to understand their feasibility for the objective. The adapted model also reveal the score/percentage of a factor is lagging behind and further required level of enhancement to meet the target level. The outcomes of forecast model may be utilized by the other application domain for the planning of dependent assets, e.g., Finance industry which offers life insurance services.

1.2 Research Questions and Contribution The identification of an appropriate machine learning (ML) model for the creation of expectancy model is a challenging task, as it requires exploratory analysis on the real data entices. In these settings, each ML model requires to be implemented and configured over the data objects for factors. We have conceptualized the following research questions (RQs) are for the smooth conduct of work in this paper: RQ-I: Which “machine learning” model is appropriate for life expectancy prediction? RQ-II: What are the key factors to affect the prediction and further how to estimate score-level of each factor to meet the target expectancy of a country? The key contribution of the work is to offer an exploratory view on the design of prediction model and their feasibility analysis, to deliver outcomes. For the experimentation purpose, we have implemented and configure the several listed machine learning (ML) models on the extracted real data entities (of several countries from WHO and World Bank with potential features). Each ML model is analysis multiple socio-economic, health factors.

2 Related Work The existing research efforts are made the problem of life expectancy prediction have considered data limited to just the US and Italy and thus could not capture the life expectancy trends for the world. The paper only used the forecasting algorithm ARIMA to capture the trend information [2]. An LSTM model has also been implemented which uses medical prescription records for training. The paper missed to address the correlation of other socio-economic factors with medical records that cause a change in life expectancy [3]. Also such deep learning models require huge datasets to achieve low variance on real world instances. Some papers have used traditional regression and tree regression methods only and not use the temporal nature of life expectancy data, due to which future forecasting was not possible and accuracy achieved was about 80–85% [4]. With regards to the methods of numerical prediction a variety of machine learning models such as those of the regression category like linear regression [4], lasso, ridge,

392

Amit et al.

elastic net regression [5] and tree-based models such as decision tree, random forest along with SVM [6] have been employed to address research problems in different fields such as housing price prediction [7]. Forecasting also employs machine learning models such as the ARIMA [8, 9] model, Prophet library [9] and other modifications of the same for research problems such as bitcoin price forecasting[9], automotive, and stock price forecasting [10]. A comparative study reveals the performance of these different models has also been explored. In the existing research efforts, no effort provides a comprehensive study and unified solution of the performance of the prediction models with regards to life expectancy. Also the problem of prediction and forecasting seems to have been discussed separately.

3 Proposed Strategy The problem of life expectancy prediction and forecasting is divided into two parts as clear in the problem statement itself.

3.1 Life Expectancy Prediction and Forecasting Model The task of life expectancy prediction from a set of assumed independent variables is a regression problem [4]. This approach has been taken up by previous research work done in this area. Previous models also directly take into account the year or time attribute into their model as an ordinal or timestamp value, but we have chosen to remove the time and location (country) aspect and create a generalized model which only depends on the socio-economic factors. The excluded time and location aspect will be brought back into consideration when we discuss the forecasting model for the same. Figure 1a illustrates the pipeline architecture for the rigorous analysis various statistical and ML models for the proposed objectives. We start with some prepro-

Fig. 1 Conceptual models a prediction model and b forecasting model

Evaluating Models for Better Life Expectancy Prediction

393

cessing techniques to clean up the WHO dataset, iterating on different techniques to handle inconsistencies such as null values, outliers scaling the dataset as well as extracting the required columns for the prediction model. In Feature Exploration we make different plots of data such as histograms, scatter plots to observe and analyze the relationship between dependent and independent variables. We then create feature sets produced by different feature selection algorithms available such as Select K-best [11], Entropy set, Correlation analysis [7],and iterate different prediction models over each of these feature sets. A variety of numerical prediction models have been taken into account including variations of a given model such as Multiple linear, Lasso and Ridge regression [4, 7], Elastic Net [5] Regression, Decision Trees, Random Forest. Each model is then validated and scored using techniques such as mean squared error, variance, r squared. A sequential flow of these steps helps us create a detailed comparison matrix between different combinations of feature sets with different numerical prediction models which ultimately helps us to come up with the best fit model for the task at hand. This general model can be saved and used later as we will see in our forecasting model. Each step has been discussed in detail in the implementation section. As mentioned earlier, we could have included “year” as an ordinal parameter in our prediction model but chose not to, since we do not have accurate future values for the other regressor required, and most of the prediction models are multiple regression-based [5, 7]. So simply supplying a future year value to the prediction model for generating a forecast is not possible. We plan to approach the problem in a different way. We perform a trend analysis on our independent variables to predict their future values and then feed them into our original prediction model to obtain the forecast for the required year. We should be able to satisfactorily forecast for upto 5–6 years ahead. Trend analysis can be done assuming each regressor as a function of time, using well established algorithms and packages such as ARIMA, Auto Arima, Prophet [9] as well as linear regression [8]. We can fine tune the forecasting model for each parameter and each country. These country socio-economic forecasting models can be saved and be used for any other task required such as integration with a web application. Given that we have acquired actual life expectancy values for future years (World Bank dataset), beyond what is present in our original dataset, we should be able to validate our forecasts for life expectancy and produce conclusive results. The below diagram represents a flowchart of the above approach. Each step of this flowchart has been discussed in detail in the implementation section.

394

Amit et al.

4 Implementation Framework 4.1 Data Preparation The primary dataset for training the prediction model is extracted from WHO with 20 features, further World Bank database is utilized for validation of life expectancy prediction and forecasting results. The inherent nature of model is a supervised learning and both the dataset consist of 21 numerical and nominal features, listed in Table 1. The data values on the mentioned attributes of 16 recent years are captured. The key features, e.g., deaths, diseases, healthcare expenditure etc. are obtained for generating a global prediction mode, as in initial steps features like “country” and “Year” are excluded. Nominal attributes like Status, which is either “Developing” or “Developed” are converted to binary attributes, though Country names can also be encoded in a similar way, including its value would lead to wrong predictions.

4.1.1

Handling Null Values, Attribute Scaling and Outliers

The dimensionality reduction is based traditional features selection and evaluation techniques, to identify important and irrelevant features. Though, it is complex task to obtain all features for some countries as all of them do not publish their data, leading to few null values. The null values are filled with mean values, e.g., GDP as null have been eliminated as filling with mean would not be a proper estimate for it. The attributes are normalizes, as it betters the model with fitting as every attribute in the same units. The data is passed through the scalar function and the independent features have been scaled. The Y value, i.e., life expectancy is kept in its original form so that the prediction model results would not need to be rescaled. The relative number of outliers is less in the dataset. Elimination either by dropping [7] instances or using methods such as rolling-mean would only lead to loss of our Table 1 Dataset features for analytics

Country

Total expenditure diphtheria

Hepatitis B

Year

HIV/AIDS

Under-five deaths

Status

GDP

Polio

Life expectancy

Population

Schooling

Adult mortality

Thinness 1–19 years

Measles

Infant deaths

Thinness 5–9 years

BMI

Percentage expenditure

Income composition of Alcohol resources

Evaluating Models for Better Life Expectancy Prediction

395

already constrained dataset; also keeping them prevents over-fitting and helps in generalization of the model.

4.2 Feature Extraction Figure 2 observed the Regressors versus Life expectancy for over the dataset, as linear positive/negative correlations are illustrated as scatter plots. The time series aspect of the data is ignored, as models as an ordinal value for prediction purposes, though prediction of independent variables which will ultimately help in forecasting of life expectancy for future timeline. The reduce the dimensionality of the dataset by eliminating some unimportant, correlated features. For this we apply methods such as (Fig. 3). Validating for correlated features using heatmap [7], we remove one out of every feature pair which is highly correlated. In our case GDP and percentage expenditure had correlation close to 1, as multiple GDP values were null, we proceeded with Percentage expenditure as a replacement for GDP. We identified some highly correlated features using a threshold of 0.8. We can observe from the below plots (higher is better) that on an average we get almost the same order of important feature scores. We proceed to select the 15–16 best features and iterate over our different prediction models with the different feature sets obtained from the above algorithms.

Fig. 2 Exploring relationships between different features for life expectancy

396

Amit et al.

Fig. 3 a Effect of linear model on regressor, b Feature Importance, and c Estimation of Information gain (Entropy)

4.3 Life Expectancy Prediction Modeling We start out by using the Train-Test Split function to generate training and validation datasets for our models and iterate over the following models.

4.3.1

Linear Regression

It is a machine learning algorithm based on supervised learning. It performs a regression task. With regression we can model a target or dependent prediction value based on multiple independent feature variables. In general training each model follows the same structure given below, with some requiring additional tuning parameters. The first plot represents the residual errors. We can observe that the errors are evenly spread on either side of the 0 line indicating a low bias model. The scatter plot between the expected and actual life expectancy also is approximately a 45° line which indicates that the model fits and performs well on the test set (Fig. 4).

4.3.2

Ridge Regression

It is the regularized version of Linear Regression where regularization term is added to the cost function. This pressurizes the algorithm to not only fit the data but also make the model weights as minimal as can be possible [5].

Evaluating Models for Better Life Expectancy Prediction

397

Fig. 4 Residual error plot, predicted versus actual life expectancy plot

4.3.3

Lasso

It is similar to Ridge Regression, Lasso which stands for Least Absolute Shrinkage and Selection Operator is another form of regularized version of Linear regression: it adds a regularized term to the cost function, but it uses the L1 norm of the weighted vector instead of half the square of the L2 term [5].

4.3.4

Elastic Net Regression

Elastic Net Regression is the combination of both Ridge and Lasso Regression in which the regularization term is a simple mix of both Rigid and Lasso’s regularization term. When r = 0, Elastic Net is equivalent to Rigid Regression and when r = 1, Elastic Net [5] is equivalent to Lasso Regression.

4.3.5

Decision Tree

Decision tree is one of the most powerful and popular tool for both classification and numerical prediction. In Decision Tree regressors, the leaf nodes instead of making a classification, predicts a real continuous value.

4.3.6

Polynomial Regression

Polynomial Regression is yet another form of linear regression in which the relationship between the independent variable x and dependent variable y is modeled as an nth degree polynomial curve. Such a model is able to estimate nonlinear relationships between the independent and the corresponding mean of the dependent variable. Although the model is flexible it is difficult to decide on the degree of the regression line.

398

4.3.7

Amit et al.

Random Forest

It is an ensemble technique capable of performing both regression and classification tasks using multiple decision trees and a technique called Bootstrap and Aggregation, commonly known as Bagging. The basic idea is to combine multiple decision trees in determining the final output rather than relying on individual decision trees. Random Forest has multiple decision trees as the base learning models. The model at random performs row sampling and feature sampling from the dataset forming sample datasets for every model.

4.3.8

Support Vector Regression

SVR is a type of supervised learning method that is used to predict discrete values. Support Vector Regression [6] uses the same principle as the support vector machines. The basic idea behind support vector regression is to find the best fit line which is basically the hyperplane that has the maximum number of support vectors or points.

4.4 Life Expectancy Forecast Modeling To generate the life expectancy forecasting model for a country we start off with the same preprocessed dataset as in the prediction modeling section and use the feature set which generates the best prediction model. In this case, “Select K-best” feature selection algorithm gives the best results along with the Random Forest Regressor model. We started out by testing the available forecasting models such as Arima, Prophet [9] and also experimented with linear regression [8]. The following plot represents the forecast for “infant deaths” for Afghanistan, The dotted plots are the forecast using different models, in Fig. 5. It is observed that all models give a satisfactory forecast, some form of validation to select the best model for a given regressor is required

Fig. 5 A comparative view of forecasting model of different time series

Evaluating Models for Better Life Expectancy Prediction

399

as linear regression unable to outline the data nuance and tendency to increasing or decreasing patterns. We start by filtering out the data for a given country. This dataset will be used for generating the best forecasting models for each of its 15 regressors. For example if we take into the country India into consideration we create a country dataframe. We have defined a Country Forecaster class which for each selected regressor/feature will create a time series pandas dataframe consisting of only the “Year” and the regressor values. In our dataset this will include a time series from 2000 to 2015. To select the best model for a particular predictor say “percentage expenditure”, we first split the time series data into two parts for training and validation. This is not the same as a random split which is employed in prediction. We selected the data from 2000–2011 for training and 2011–2015 for validation. These models are then used to generate a forecast for the years 2011–2015. The data is validated on the test data and the model with the lowest RMSE is selected as the best model for the particular regressor /predictor. Each model generates validation data. In this case, selected as the best forecaster model for “percentage expenditure” in India. Original test data

Arima

Auto Arima

Prophet

2011-01-01 0.003317 2012-01-01 0.003335 2013-01-01 0.003474 2014-01-01 0.004442 2015-01-01 0.004374

2011-01-01 0.003744 2012-01-01 0.004409 2013-01-01 0.004560 2014-01-01 0.008306 2015-01-01 0.006904

2011-01-01 0.003269 2012-01-01 0.002965 2013-01-01 0.002882 2014-01-01 0.002645 2015-01-01 0.002546

2011-01-01 0.002344 2012-01-01 0.002819 2013-01-01 0.001974 2014-01-01 0.002467 2015-01-01 0.002951

RMSE

0.00218397645921559

0.0011880604126259

0.001370120729545121

After selecting the best model we train the model on the full time series data of 2000–2015 and create a Parameter object out of it which is returned to the country_forecaster model. In this way we get a collection of parameters to best model mapping which will be used to generate a forecast dataframe. The model is stored to disk using the “pickle” python module as a country_forecaster object. Figure 6, illustrates a forecasted values related to polio and infant deaths in India. The yellow line indicates the forecast and follows the increasing/decreasing trend sufficiently. In the same way we generate forecasts for each predictor in the form of a pandas dataframe using the generate_forecast_df() function which takes in the country model, the prediction model and the actual life expectancy forecasts which have been collected from the world bank data website for comparison purposes. The function produces a dataframe with forecast data for 5 years ahead from 2015, in this case upto 2019 for the given country in the country_model. All the future forecasts data for a given year is fed into our best fit prediction model to get

400

Amit et al.

Fig. 6 Infant deaths versus year and polio versus year in India

Fig. 7 Forecasted model for different features versus years (for India)

the Life expectancy forecast for the year ahead. The data is forecasted, as an average the forecasted life expectancy is off by about 1–2 years which is reasonable given the inconsistent and erratic data the prediction model has been trained on. Figure 7 illustrated the life expectancy forecasts for China based on 9 important factors, with scaled data values. Here, forecast values are 75.49, 75.098, 75.221, 75.444, with actual (75.928, 76.21, 76.47, 76.704). Though at first glance the forecast may seem faulty, a closer look at the y-axis suggests that each division is of 1 year, the designed model deliver the same projection as WHO.

5 Analytics and Discussion The exploratory analysis of the considered ML model are conducted over metrics: Variance, SSE, R Squared, etc. the Each feature set has been iterated through the different prediction models and evaluated using functions such as variance, mean squared error and R squared. Adjusted r squared yields almost the same values as R squared (Table 2).

Evaluating Models for Better Life Expectancy Prediction

401

Table 2 The estimation of function for ML model Feature model Prediction model

Select k-best

Extra trees (feature importance)

Correlation

Information gain

Linear regression

Var: 84 MSE: 15.72 R2 : 0.834

Var: 0.0014 MSE: 98.68 R2 : 0.0

Var: 80.64 MSE: 19 R2 : 0.81

Var: 83.78 MSE: 16 R2 : 0.84

Lasso

Var: 82.22 MSE: 17.57 R2 : 0.82

Var: −0.0013 MSE: 98.96 R2 : 0.0

Var: 79.82 MSE: 19.94 R2 : 0.8

Var: 82 MSE: 17.69 R2 : 0.82

Ridge

Var: 84 MSE: 15.72 R2 : 0.84

Var: 0.0014 MSE: 98.68 R2 : 0.0

Var: 80.64 MSE: 19.1 R2 : 0.81

Var: 83.78 MSE: 16 R2 : 0.84

Elastic net

Var: 82.14 MSE: 17.64 R2 : 0.82

Var: −0.0001 MSE: 98.85 R2 : 0.0

Var: 79.91 MSE: 19.85 R2 : 0.8

Var: 81.83 MSE: 17.95 R2 : 0.82

Decision tree

Var: 92 MSE: 7.90 R2 : 0.92

Var: −1.035 MSE: 201.208 R2 : −1.04

Var: 91.86 MSE: 8.03 R2 : 0.92

Var: 92.28 MSE: 8.6 R2 : 0.93

Polynomial regression

Var: 92.40 MSE: 6.99 R2 : 0.92

Var: −0.065 MSE: 98 R2 : −0.07

Var: 91.93 MSE: 7.4 R2 : 0.92

Var: 92.53 MSE: 6.87 R2 : 0.93

Random forest

Var: 96.26 MSE: 3.69 R2 : 0.96

Var: −0.098 MSE: 108.5 R2 : −0.1

Var: 96 MSE: 3.89 R2 : 0.96

Var: 96.19 MSE: 3.76 R2 : 0.96

SVM regressor (SVR)

Var: 94.63 MSE: 5.02 R2 : 0.95

Var: −0.098 MSE: 108.5 R2 : −0.1

Var: 93.58 MSE: 5.99 R2 : 0.94

Var: 94.93 MSE: 4.73 R2 : 0.95

Here, higher Variance (Var.), lower Mean Squared Error (MSE), lower R squared (R2 ) are preferred. The best performing model is Random Forest with the Select k-Best feature set. This model is saved to be used later in the forecast model. Other models and feature combinations which perform similarly have been highlighted with lower intensity.

5.1 Forecast Modeling Results The life expectancy forecasting estimation is based on the selected mix of countries (on appropriate). The mixing of data helps to visualize the forecasted data. The forecast model for china is already shown in Figs. 7 and 8. Similarly, Figs. 9 and 10 show the forecast model of India, as statistics are forecasted (70.563 71.366 72.056 72.476), with actual: (68.607 68.897 69.165 69.416), with RMSE (2.6288). Further, the estimation of forecast model is crated for Italy. The inherent inconsistently and erratic patterns are observed in the regressor forecast plots, as

402

Amit et al.

Fig. 8 Comparison of life expectancy values (forecasted vs actual life expectancy)

Fig. 9 Forecasted model for different features versus years (for India)

Fig. 10 Comparison of life expectancy values (forecasted vs actual life expectancy)

statistics suggest as Forecasted (80.782, 80.81, 80.603, 81.661), actual (82.543, 83.243, 82.946, 83.346) and RMSE (2.0833). For Zimbabwe, as an underdeveloped country with a history of low life expectancy. The model suggested forecasted

Evaluating Models for Better Life Expectancy Prediction

403

(57.317,57.995, 61.759, 62.995), actual (59.534, 60.294, 60.812, 61.195), and RMSE (1.893, Avg RMS is 1.918).

6 Conclusion The prediction of life expectancy using tradition approaches either offers too generalized insights or limited in handling the factors. This paper conducted a systematic analysis over the range of machine learning model with an objective to identify the suitable model. The proposed data analysis assert the implementation and configuration of each ML model and their analysis over multiple socio-economic, health factors excluding time factor related to life expectancy. The study asserts that a prediction model employing Random Forest delivers an accuracy of 96.26%. Another assertion listed the most important features including: Adult Mortality, Income composition of Resources, BMI, Schooling and Development state of the country. As future work, in the direction is the inclusion of the diverse factors from the several area of life. The inclusion will be significantly enhance accuracy and forecast estimation. This also assists a machine learning model to deal with inconsistent and missing values.

References 1. Beltrán-Sánchez H et al (2015) Past, present, and future of healthy life expectancy. Cold Spring Harbor Persp Med 5(11) 2. Torri T, Vaupel JW (2012) Forecasting life expectancy in an international context. Int J Forecasting 3. Beeksma M, Verberne S, van den Bosch A, Das E, Hendrick I, Groenewoud S (2019) Predicting life expectancy with a long short-term memory recurrent neural network using electronic medical records. BMC Med Inf Decis Making 4. Agarwal P, Shetty N, Jhajharia K, Aggarwal G, Sharma NV (2019) Machine learning for prognosis of life expectancy and diseases. Int J Innov Technol Explor Eng (IJITEE) 8(10). ISSN: 2278-3075 5. Rodríguez O (2018) A generalization of ridge, lasso and elastic net regression to interval data, pp 1–13 6. Awad M, Khanna R (2015) Support vector regression. In: Efficient learning machines. Apress, Berkeley, CA 7. Abbasi S (2020) Advanced regression techniques based housing price prediction model. Research gate 8. Athiyarath S, Paul M, Krishnaswamy S (2020) A comparative study and analysis of time series forecasting techniques. SN Comput Sci 1:175 9. Cayir A, Kozan O, Da˘g T, Yenido˘gan I, Arslan Ç (2018) Bitcoin forecasting using ARIMA and PROPHET 10. Edward A, Manoj J (2016) Forecast model using Arima for stock prices of automobile sector. Int J Res Fin Mark 6(4) 11. Nair R, Bhagat A (2019) Feature selection method to improve the accuracy of classification algorithm. Int J Innov Technol Explor Eng (IJITEE) 8(6). ISSN: 2278-3075

404

Amit et al.

12. Agbehadji IE, Awuzie BO, Ngowi AB, Millham RC (2020) Review of big data analytics, artificial intelligence and nature-inspired computing models towards accurate detection of COVID-19 pandemic cases and contact tracing. Int J Environ Res Public Health 17(15):5330 13. Rosenthal MA, Gebski VJ, Kefford RF, Stuart-Harris RC (1993) Prediction of life-expectancy in hospice patients: identification of novel prognostic factors. Palliat Med 7(3):199–204 14. Anderson KG (2010) Life expectancy and the timing of life history events in developing countries. Hum Nat 21(2):103–123

Future Gold Price Prediction Using Ensemble Learning Techniques and Isolation Forest Algorithm Nandipati Bhagya Lakshmi and Nagaraju Devarakonda

Abstract Previously, gold is a mineral with unique characteristics that draw people’s attention for a variety of reasons, including its high demand in jewelry. In addition to other forms of payment, gold was utilized to finance commercial transactions all over the world. Moreover, gold is a national financial asset, and different states preserved and grew their gold holdings, indicating that they were affluent and forward-thinking. Across the globe, central banks keep precious metals such as gold for loans and other requirements of the people. In India and other parts of the globe, gold can be used to offer as a compliment. Gold prices are heavily influenced by global commodity demand and supply. Inside this article, we are going to predict the gold rate for the next 30 days and also find the predictions of the financial year 2020–21, next fiscal year 2021–22, and until the middle of the following fiscal year 2022–23 using ensemble learning technique and isolation forest algorithm. We used information from the Internet to create predictions using the ensemble learning method and unsupervisedbased approach. In this paper, we have forecasted the predictions of gold rate using different algorithms which will give better accuracy based upon above techniques. And also by applying these techniques in the whole gold dataset, we are going to compare the accuracy using ensemble and isolation forest algorithms which will give better results at all. Keywords Gold · Random forest algorithm · Fiscal year

N. Bhagya Lakshmi (B) · N. Devarakonda School of Computer Science and Engineering, VIT-AP University, Inavolu, AP, India e-mail: [email protected] N. Devarakonda e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_31

405

406

N. Bhagya Lakshmi and N. Devarakonda

1 Introduction Realistically, gold is used in various countries including USA as the form of currency. Gold’s value has remained stable, and it has been used to measure a country’s financial strength. Enormous investors were also lured to this precious metal and poured big amounts of money into it. Emerging international economies like China, Russia, and India have recently proven major gold buyers, while the United States, South Africa, and Australia have been big sellers. Traditional Chinese and Indian festivals also influence the price of gold. More money is being invested into the purchasing of this item at that moment. Small investors prefer this commodity over other investment alternatives since it is a secure investment with no inherent dangers. Internal financial conditions in the aforementioned nations are crucial in determining gold spot pricing. Government gold investment is largely affected by the government’s financial condition and interest rate; both are measures of the market’s strength [1]. When interest rates in the United States fall, more economic activity occurs in the country, resulting in capital inflows into the gold market [2]. Moreover, if interest rates are low, “they actively purchased gold otherwise they will invest alternative investments like bonds or stock exchange.” Figure 1 demonstrates that when gold rates are low, the New York Stock Exchange (NYSE) and the S&P 500 perform well. The S&P 500 is a stock market index in the United States based on the market valuations of 500 major firms listed on the NYSE or NASDAQ. The present market value at which a product is acquired or sold for pricing immediately is known as the spot price [3]. It should not be confused with the future market, which is the cost at which both parties can agree to exchange at a future date. The gold price is established multiple times a day based on the gold market’s supply and demand. This paper studies how the gold price shows the impact and performance in the future days across the economy. In addition, this study analyzes the data using Ensemble learning algorithms like random forest algorithm, AdaBoost algorithm, SVM, voting classifier, and isolation forest algorithm to predict the accuracy of future gold rates based on financial years. The next section of the article is a literature review followed by sections on data collection information, findings, discussion, and conclusion. Fig. 1 Effect of index prices on gold rates

Future Gold Price Prediction Using Ensemble Learning Techniques …

407

2 Literature Study Gold is a mineral with unique characteristics that draw people’s attention for a variety of reasons, including its high demand in jewelry. Because of its usage in coinage in the previous century, it has since spread to dentistry, the fashion industry, medicine, and electronics. Pricing, distribution, and sales are all affected by market fluctuations [4]. Gold, as an investable asset, experiences financial difficulties in traditional markets, but its value eventually rises steadily. Lawrence claims that [5]. There are no substantial correlations between gold prices and economic factors such as inflation or GDP. He saw that gold yields are less linked with bond and stock indexes other than commodities’ returns. Sjaastad and Scacciavillani [6] demonstrated that gold is a decent inflation hedge, whereas Baker and VanTassel [2] revealed that future inflation rates have an influence on gold prices. After doing a literature review, Hanan Naser [7] argues that historical evidence on gold’s utility as an inflation hedge is inconsistent. The commodities information bureau future index, the USD/Euro foreign exchange rate, the inflation rate, money supply, the New York Stock Exchange Index, the Standard and Poor 500 index, Treasury bills, and the USD index were all used by Ismail et al. [8] to forecast gold rates. According to the study [9], the Commodity Research Bureau future index, the USD/EUR foreign exchange rate, the inflation rate, and money supply all have a significant impact on gold rate. Khaemusunun [10] investigated the impact of various currencies, oil prices, and interest rates on gold prices. Hammoudeh et al. [11] concluded that gold price volatility and exchange rate volatility are related. The rate of exchange has a low and high connection with the gold price, according to Ai, et al. [12]. Ewing and Malik [13] show that precious metal’s future price volatility is transmitted. According to Ghosh et al. [14], gold rates are linked to US inflation, interest rates, and the dollar exchange rate. They identified a long-term relationship between commodity prices and the US consumer price index as a result of the co-integration investigation. According to a review of relevant studies, the link between gold’s price and many of the factors thought to impact it is erratic [15].

3 Statistics and Methodology According to the literature, we have identified certain factors to find the future gold price prediction. The spot price of gold is expressed in rupees per ounce. We gathered monthly data from online sources to forecast the future gold rate based on the current financial year. Ensemble learning algorithms and isolation forest algorithm are used to analyze and design the collected data. 75% is used for training the data, and 25% is used for testing the system. In the first process, we are going to preprocess the data and then apply these techniques to calculate the accuracy. Ensemble approaches are models that are created in multiples and then combined to provide better results. In most cases, when compared

408

N. Bhagya Lakshmi and N. Devarakonda

to a single model, ensemble methods produce more accurate findings. Ensemble techniques include random forests, AdaBoost, decision trees, SVM, and voting classifiers. Using ensemble methods, several weak learners, such as discriminate analysis, are merged to produce a powerful trainer, such as random forest. Decision tree algorithm may be used to solve a wide range of machine learning issues. A decision tree is a type of tree that can be used to forecast and categorize data. Trees that are grown to a great depth in order to learn extremely irregular patterns, on the other hand, have a tendency to overfit training sets. A slight amount of noise in the data might lead tree to grow in an unanticipated direction. Random forest is a type of ensemble approach that employs a large number of decision trees using a technique known as bootstrap aggregation, often known as bagging, to solve both regression and classification problems. Rather of relying on individual decision trees, the main concept is to combine many decision trees to arrive at a final outcome. Random forest reduces variance while preserving the low bias of a decision tree model by using bootstrapping on decision trees. When compared to most other algorithms, the random forest method offers the following advantages: When we utilize the random forest technique to solve any classification issue, over fitting problem will never arise. Both classification and regression tasks may be solved using the same random forest technique. In addition, the random forest method may be used for feature engineering to find the most significant characteristics from the training dataset’s accessible features (Figs. 2 and 3).

Fig. 2 Overview of random forest algorithm

Future Gold Price Prediction Using Ensemble Learning Techniques …

409

Fig. 3 Workflow of random forest algorithm

Python is used to implement these machine learning techniques in this study. The technique’s prediction accuracy is measured using Mean Squared Error (MSE), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE) (MAE). A voting classifier is a machine learning algorithm that learns from a set of models and predicts an output (class) based on the best likelihood of the outcome being the desired class. In this voting classifier, we have taken three algorithms such as logistic regression, decision tree classifier, and support vector machine. It simply collects the results of each classifier given into voting classifier and forecasts the output class based on the largest voting majority. Instead of developing separate specialized models and determining their correctness, we develop a single model that trains on these models and predicts output based on their aggregate majority of votes for each output class. The sigmoid transform is a mathematical function for transforming values to probabilities. It transforms any real value between 0 and 1 into another value. The logistic regression value must be between 0 and 1, and it cannot exceed this limit, resulting in a curve similar to the “S” shape. The sigmoid function or logistic function is another name for the S-form curve. The linear regression equation may be used to derive the logistic regression equation. The following are the mathematical steps for obtaining logistic regression equations: We know the equation of the straight line can be written as: Y = b0 + b1 x1 + b2 x2 + · · · + bn xn

(1)

In logistic regression, y can be between 0 and 1 only, so for this, let us divide the above equation by (1-y): y ; 0 for = 0, and infinity for y = 1 1−y

(2)

410

N. Bhagya Lakshmi and N. Devarakonda

Fig. 4 Voting classifier flowchart

But we need range between −∞ to +∞, and then take logarithm of the equation it will become (Fig. 4):  y = b0 + b1 x1 + b1 x2 + · · · + bn xn log 1−y 

(3)

AdaBoost Classifier: It operates in the following manner: • AdaBoost classifier begins by selecting a training subset at random. • It trains the AdaBoost classifier repeatedly by selecting the training set depending on the prior training’s accuracy. • It gives incorrectly categorized observations a larger weight so that they have a better likelihood of categorization in the following iteration. • It also provides weight to the trained classifier in each iteration based on the classifier’s accuracy. The classifier with the highest accuracy will be given the most weight. • This approach is continued until all of the training data fits perfectly or until the largest number of estimators specified is reached. • To categorize, cast a “vote” across all of the classification algorithms that you created (Fig. 5). Bagging Classifier: Bagging (bootstrap aggregating) is a popular machine learning ensemble learning technique. The technique constructs several models from randomly selected sections of the training dataset and integrates learners to create a stronger overall learner. The bagging classifier contains the following steps: • • • •

Data preparation Bag classifier for training Predicting test results and ensuring correctness Changing the base estimator to test accuracy.

Future Gold Price Prediction Using Ensemble Learning Techniques …

411

Fig. 5 Overview of AdaBoost classifier

In AdaBoost and bagging classifier, we have used decision tree classifier. This classifier is the base estimator to set the number of estimators and then train the model with the trained data. Support Vector Machine: Support vector machine algorithm is given a collection of labeled data in the training set and can generalize across two separate classes. The main task of the SVM is to find a subspace that can distinguish between two classes. There may be several hyperplanes that can accomplish this task, but the aim is to find the subspace with the largest margin, or the maximum distance between two classes, so that fresh data points may be readily classified in the future. Isolation Forest Algorithm: This is an unsupervised approach for identifying anomalies in the absence of labels or actual values. It would be a time-consuming process to go through each row of data and look for abnormalities. The isolation forest model is built on trees. The trees that are created for this are not the same as those used in decision trees. Isolation forests and decision trees are two separate types of construction. Another significant distinction is that the decision tree is a supervised learning method, whereas the isolation forest is an unsupervised learning system. Partitions are generated in these isolation trees by randomly selecting a feature or variable and then randomly determining a split value between the feature’s minimum and maximum value. Similarly in the decision tree, root node is chosen at random without any conditions for becoming a root node. The root node is selected at random

412

N. Bhagya Lakshmi and N. Devarakonda

from the input parameters, and a random value midway between the highest and lowest of that feature is chosen. The average anomaly score of the trees in the isolation forest is used to calculate the average score for anomalies of an input value. After fitting the whole data to the model, the average score for anomalies is generated for each variable. When the anomaly score rises, it is more likely to be an abnormality than a row with a lower anomaly score. The isolation forest algorithm is implemented in the stages below: • • • • •

Select a point to isolate. Set the range for each characteristic to isolate the minimum and maximum values. Take a feature at random. Choose a value inside the range, again at random: … Steps 3 and 4 should be repeated until the spot is isolated.

4 Discussion and Outcomes Datasets: Gold price data can be taken from different Internet sources like top10stockbroker.com, etc. The data contains date and gold price per day and month and also contains future gold rate which is predicted in the future, and at certain position or day, the gold rate will be matched with the actual rate (Figs. 6, 7, 8, 9, 10, and 11; Tables 1 and 2). Comparing the prediction accuracy of different financial years how the gold rate will be in the future. In terms of prediction accuracy while using R squared, Mean Squared Error (MSE), and Root Mean Square (Figs. 12, 13, and 14). In the individual gold rate data, i.e., next 30 days, every six-month data like July 2021–December 2021, January 2022–June 2022, etc., the accuracy should be same

Fig. 6 Gold price forecast for today, tomorrow, and the next 30 days

Future Gold Price Prediction Using Ensemble Learning Techniques …

Fig. 7 Gold price predictions for the period between July 2021 and December 2021

Fig. 8 Gold price forecast for the period between January 2022 and June 2022

Fig. 9 Gold price target for the period between July 2022 and December 2022

413

414

N. Bhagya Lakshmi and N. Devarakonda

Fig. 10 Gold rate prediction for the period between Jan 2023 and July 2023

Fig. 11 Gold rate prediction for the whole gold data Table 1 Pictorial analysis Period

Mean

Gold rate forecast or target for next 30 days

47,606.87

Std. dev. 584.1

Coeff. of variation 1.23

July 2021–December 2021

46,761.67

678.986

1.45

Jan 2021–June 2022

44,593.17

882.493

1.98

July 2022–December 2022

44,469.33

2227.927

5.01

Jan 2022–July 2023

50,315

1056.6

2.10

Overall dataset

47,282.62

1737.9

3.67

Voting classifier (logistic regression, decision tree classifier, SVM)

Random forest regression

499.46 −23.88

681.81

586,141.7

765.59

−1.319

741.5

724,978.5

851.45

MAE

MSE

RMSE

R2

MAE

MSE

RMSE

1239.66

1,536,762.5

1214.5

249,462.9

494.54

910.823

829,600.0

652.0

−0.951

581.42

338,053.6

513.93

−4.541

−2.182

R2

−0.812

January 2022–June 2022

Gold rate forecast or July target for next 30 days 2021–December 2021

Table 2 Prediction accuracy

1329.78

1,768,334.5

1328.5

0.749

928.59

862,288.96

885.81

0.776

July 2022–December 2022

898.24

806,8368.5

699.09

−0.58

861.10

741,505.5

727.21

0.116

January 2023–July 2023

699.59

489,432.5

566.5

0.038

701.32

49,185.77

506.29

0.033

Overall data

Future Gold Price Prediction Using Ensemble Learning Techniques … 415

416

N. Bhagya Lakshmi and N. Devarakonda

Fig. 12 Coefficient of variation of gold prices based on financial years and overall data

Fig. 13 Gold price forecasting based on financial years using random forest algorithm

(16.6%) in ensemble algorithms and isolation forest algorithm but when we take whole data at a time, the accuracy should be high in AdaBoost algorithm compared to all algorithms (Figs. 15 and 16; Table 3). In the below diagram, blue line indicates actual value and green line indicates predicted values. At some point, actual value and predicted value should be overlapped. This diagram indicates the overall gold data prediction based upon the actual value and predicted value. Root Mean Square Error (RMSE) and Mean Absolute Error (MAE), the outcomes show that in the year July 2022–December 2022, the gold price will be increasing highly, and after that, December 2022 onwards, there are moderate changes in gold

Future Gold Price Prediction Using Ensemble Learning Techniques …

417

Fig. 14 Gold price forecasting based on financial years using voting classifiers

100 90 80 70 60 50 40 30 20 10 0

Train Data(75%) Test Data (25%) Accuracy

Fig. 15 Gold price forecasting based on accuracy

price. This is an expected result whether the gold price will be increasing more or decreasing less based upon the COVID-19 situations and economic conditions.

5 Conclusion This study shows how the gold rate will be in the future days. In this study, first we have taken how the gold rate will be in next 30 days. After that, we have taken

418

N. Bhagya Lakshmi and N. Devarakonda

Fig. 16 Gold price forecasting based on actual price versus predicted price

Table 3 Detailed information about algorithms

Algorithm

Train data (75%)

Test data (25%)

Accuracy

Random forest algorithm

97.6

80.9

91.5

Bagging classifier

90.7

38.8

38.8

AdaBoost classifier

50

50

98.1

25

33

Voting classifier 48.1 Isolation forest algorithm

31.91

every six months of gold rate related to financial years. The prediction for July 2021– December 2021 has changed by 0.30 percent overall, with variations going lower to the lowest. In terms of the year 2021, the closing point in January was 46,070, and there was a rise from February 22 with 44,735 to 44,745 in May 22. The following set, from July 2022 to December 2022, is identical to the first, with a steady increase with closing points starting at 42,135 on July 22 and ending at 48,846 on December 22 for a 15.93 percentage change. The forecast data for January to July 2022–23 clearly shows an increase in average points and a low volatility in average points, with a total change percentage as a consequence. The average point value for January is 49,037, followed by 50,027 in February and 51,528 in March, after which it climbs steadily to 51,838 in July 23. In the month of June 22, the decrease was 43,403, with a total percentage change of −5.79%. July’s closing point was 47,352, and it fell to 45,892 in September before gradually increasing to 47,495 in December. In addition, extra

Future Gold Price Prediction Using Ensemble Learning Techniques …

419

attributes can be added to the research to improve accuracy during the procedure. It is possible that the inquiry will continue. By estimating the number of instances in the gold rate dataset in the future and how the mortality rate will vary as demand increases, we can predict the future size of the dataset. The forecast and prediction help investors the movement of gold price in the market and also analyze the current and future condition of gold. This paper studies all the possible outcomes of analysis and different conditions. Further, more data and different techniques may be added to understand the better performance of these techniques.

References 1. Toraman C, Basarir Ç, Bayramoglu MF (2011) Determination of factors affecting the price of gold: a study of MGARCH model. Bus Econ Res J 2(4):37 2. Baker SA, Van Tassel RC (1985) Forecasting the price of gold: a fundamentalist approach. Atl Econ J 13(4):43–51 3. Chen M, Chiu CC, Chen HH (2016) The integration of artificial neural networks and text mining to forecast gold futures prices. Commun Statis Simul Comput 1532–414 4. Du W, Schreger J (2013) Local currency sovereign risk. In: Social science research network, Rochester, NY, SSRN Scholarly Paper ID 2976788, Dec 2013 5. Lawrence C (2003) Why is gold different from other assets? An empirical investigation. London, UK: World Gold Council 6. Sjaastad LA, Scacciavillani F (1996) The price of gold and the exchange rate. J Int Money Finance 15(6):879–897 7. Naser H (2017) Can gold investments provide a good hedge against inflation? An empirical analysis. Int J Econ Financ Issues 7(1):470–475 8. Ismail Z, Yahya A, Shabri A (2009) Forecasting gold prices using multiple linear regression method. Am J Appl Sci 6(8):1509 9. Jagerson J, Hansen SW (2011) All about investing in gold. McGraw-Hill Publishing 10. Khaemasunun P (2014) Forecasting Thai gold prices, Vol 2. Available http://www.Wbiconpro Com3-Pravit. Pdf 11. Hammoudeh SM, Yuan Y, McAleer M, Thompson MA (2010) Precious metals–exchange rate volatility transmissions and hedging strategies. Int Rev Econ Finance 19(4):633–647 12. Han A, Lai KK, Wang S, Xu S (2012) An interval method for studying the relationship between the Australian dollar exchange rate and the gold price. J Syst Sci Complex 25(1):121–132 13. Ewing BT, Malik F (2013) Volatility transmission between gold and oil futures under structural breaks. Int Rev Econ Finance 25:113–121 14. Ghosh D, Levin EJ, Macmillan P, Wright RE (2004) Gold as an inflation hedge? Stud Econ Finance 22(1):1–25 15. Mombeini H, Yazdani-Chamzini A (2015) Modeling gold price via artificial neural network. J Econ Bus Manage 3(7):699–703

Second-Hand Car Price Prediction N. Anil Kumar

Abstract Predicting the price of second-hand or used cars is an important as well as an interesting problem. The efforts required in achieving the desired price of the used car give a rough sketch about the amount at which the car can be sold at the best price. The challenging part is to find the best price of the used cars based upon the actual features of the car. The highest co-related features are considered and are used to build the model by using Random Forest Regression technique. The features which were used to build the model should be given as input to the system for predicting the price. Random Forest is the best technique as it is a classifier that contains a number of decision trees on various subsets of the given data set and takes the average to improve upon the predictive accuracy of that data set. This ensures that the predicted price is worthy. Keywords Random Forest · Used car price · Price prediction

1 Introduction 1.1 Motivation Estimating whether the price of a used car given to a dealer or individual is worth or not. Many factors such as year and kilometer driven. They can affect the real value of a car. From in the seller’s opinion, it is also a problem to properly price a used car. Based on existing data, the purpose of which is to use machine learning skills to improve the model of pricing used cars.

N. Anil Kumar (B) Department of Information Technology, Vasavi College of Engineering, Hyderabad, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_32

421

422

N. Anil Kumar

1.2 Problem Addressed The price of a used car is especially predicted when the car is used and not recently made. This prediction is a very tough and important. With the increase in the number of used cars, it is increasing and the cost is increasing. So consumers are finding other ways to buy new cars. There is a need for accurate predictor equipment for used vehicles.

1.3 Objectives • The aim of this paper is to predict the best reasonable price of the used car. • To determine on what attributes of used cars the price is depend.

1.4 Solution/Novelty Firstly, we read the data set which consists of features of the car and analyze the dependent and independent variables, and then, data is processed in order to handle the missing values. Later, with the help of Random Forest Regression machine learning algorithm we build the model which can best predict the price of the used car. Once the model is built, then we measure its accuracy and try to improve the model’s accuracy using various approaches.

2 Related Works Enis Gegic et al. [1] build a model for predicting the price of used cars in Bosnia and Herzegovina, has applied three machine learning techniques namely Artificial Neural Network, Support Vector Machine, and Random Forest, and has given a comparative analysis of these models. Richardson [2] in his thesis proved that re-manufacturing the old cars will give high fuel efficiency, and he used multiple regression analysis to calculate the used cars price before it goes to the re-manufacturing. He also showed that hybrid cars are more able to keep their values than traditional vehicles. By utilizing neurofuzzy knowledge-based system, Wu et al. [3] has worked on car price prediction study. Their data set consisted of attributes of the car like brand of the car, year of manufacturing, and the type of engine it is made of. Their model predicted simple regression model’s results. They have also made a system called Optimal Distribution of Auction Vehicles (ODAV) to meet the high demand of the cars which are for lease every year.

Second-Hand Car Price Prediction

423

Pudaruth [4] has worked with various algorithms of machine learning, like naïve bayes, decision trees analysis, multiple linear regression, and k-nearest neighbors for car price prediction in Mauritius. They collected their data set from local newspapers within a month or less. Their data set consisted of following attributes like brand, model, transmission type and price, mileage in kilometers, cubic capacity, production year, and exterior color. Finally, he concluded that Decision Tree and Naive Bayes were not up to the expectations.

3 System Analysis In this paper, we aim to find the best price of the used car by using Random Forest Regression model. All the car features remain the dependent features, i.e., they will become the inputs to the model and the predicted price or estimated selling price will be the independent feature, which would be generated from the model. The model is entirely built on Random Forest Regression technique from SciPy library. To build the model, a large data set is been used, which consists of different cars along with their own features. The main aim of this project is to predict the worth full price of any used car which can be fair enough to both seller and buyer. The entire code is hosted on flask server. Currently, we would be using the local server for hosting the entire code and later on we can host it on any public server such that any person could access it and make use of it.

3.1 Data Set The data set comprises the attributes of the car [5] as the columns, and they are car name, year, present price, selling price, fuel type, transmission type, number of the owner, and seller type. During further processing, the attributes like number of years were created and the one with multiple valued attribute was converted to single value attribute and all the attributed of the car except the selling price was used as the dependent values, i.e., the input values and the selling price would be the independent value, i.e., the final output.

3.2 Algorithm-Random Forest Regression Random Forest [6, 7] algorithm is the supervised learning technique which is very popular in machine learning algorithms. Both Classification and Regression problems in ML can be solved by Random Forest algorithm. In particular, we are using Random

424

N. Anil Kumar

Forest Regression for building our model. To enhance the overall performance of the version and to resolve a complicated problem, Random Forest Regression makes use of the idea of ensemble learning, and this is a method of mixing a couple of classifiers. Random Forest Regression is a classifier which includes greater wide variety of selection tree on diverse subsets of the given information set, and to enhance the predictive accuracy, it takes the average. The Random Forest Regression considers the prediction from every tree rather than counting on one selected tree, primarily depending totally on bulk counting of predictions, and at last, it finally predicts the majority counting. Random Forest Regression, as its call suggests, includes a many wide variety of selection tree that paintings as an ensemble. Each person tree within side the Random Forest Regression spits out a category prediction, and the magnificence with the maximum votes will become our version’s prediction. The essential idea at the back of Random Forest Regression is a easy however effective one—the information of crowds. In information science, the motive that random woodland version works so nicely is that a massive wide variety of extraordinarily uncorrelated models (trees) working as a committee will outperform any of the person constituent models. In the Fig. 1 , we could see that the initially entire data set is split into many N number of training data sets. Each training data set leads to its own respective decision tree. These both phases, i.e., training data set and decision tree, fall under training data set section. Averages of all the results of the decision tree are considered, and the majority is considered as the predicted final output. This entire process of averaging and prediction falls under test data set.

3.3 Assumptions for Random Forest Regression It is a conjunction of multiple trees to predict the class of the data set, and there may be some possibility that decision trees may predict the correct output, whereas others may not predict the correct output instead predict the wrong output. But altogether by averaging, all the decision trees predict the correct output in the last. Thus, below are two assumptions that should be followed for the best version of Random Forest classifier to predict: • For the Random Forest classifier to predict the output accurately and correctly instead of the guessed result. It is expected to have some real values in the data set. • Every selection decision tree is expected to have prediction with very less correlation.

Second-Hand Car Price Prediction

425

Fig. 1 Working of Random Forest algorithm

4 Implementation 4.1 Pre-processing Initially, we upload the data set and check for all the null values. After clearing all null values, we go with the creation of the data set into proper format as some attributes (columns) may be in the textual format with more than one unique type of value. So, we need to make each column for each unique value while pre-processing the data. After all the things, we finally add the new columns based on the requirement and go with the future model building.

426

N. Anil Kumar

Fig. 2 Dist plot for our model versus testing data

4.2 Visualizing The actual data visualization and pre-processing are done before, and now, we just visualize the main features by using ExtraTreeRegressor and we extract the feature importance from the data set such that we can know about each feature’s importance in calculating the selling price of the car (Fig. 2).

4.3 Model Creation We are using Random Forest Regression with the decision trees constructed using the Randomized searchCV [8] attributes to the Model and coming to the Scoring Factor [9] we use “neg_mean_squared_errror.” After the model is been built based on the given attributes provided to the Random Forest Regressor function, we now fit the actual values to model by using the training variables of the data set. The entire data set is distributed into 2 types, the Independent values and Dependent values, and each type is again distributed into 2 categories one for training the model and the other for testing. The Independent value is what we predict and the Dependent value is what we give to the model.

Second-Hand Car Price Prediction

427

4.4 Input The values to model can be provided via basic python file by loading the model and giving values to the variable, and by using those variables, we need to create the array to give to the model.

4.5 Output After the prediction is started with the loaded model and the given input, the output could be displayed.

5 Analysis In this paper, while pre-processing we divide the entire data set into two splits one for training and other for testing the entire project based on Random Forest Regression; the model was generated based on the X train and Y train; the prediction was done by using the X test and Y test. X comprises the entire dependable data (input) and Y comprises the independent data (output). After the creation of the model, we then use a variable prediction to store the values of all the prediction generated with the X test. After the array is been generated, the array values are given to test with the Y test via to plotting techniques. Dist Plot The plotting of X axis is with the selling price and Y axis from prediction array. The actual plotting shows the normalized curve for the model by using difference between the Y test and prediction (Fig. 3). Scatter plot The Scatter plot is also predicted for selling price versus prediction. Here, in the scatter plot we could see that the plot is exactly having an linear line between the attributes and the best regression model will have the linear exhibitions scattered points on the graph. In the Fig. 4 , we could also see the final metrics for an regression model and the model with these values near to zero can be considered with best accurate model. Even the values are near to zero and that indicates that the prediction made using This model we be accurate.

428

Fig. 3 Dist plot for our model versus testing data

Fig. 4 Scatter plot for our model vs testing data

N. Anil Kumar

Second-Hand Car Price Prediction

429

6 Conclusions and Future Scope 6.1 Conclusion Finally, we can say that the methods used in this paper are apt and the accuracy of this project is determined using a dist plot which gave a normalized curve indicating that the accuracy is very high, and even a scatter plot was used to display the accuracy which showed that the points form a straight line indicating in high accuracy even the scoring factor for our project nearly close to the 0 so we could say our project has the best model to predict the input values. Random Forest Regression also gave us the good accuracy value.

6.2 Future Scope For better higher performance, we are able to layout deep gaining knowledge of model, use adaptive studying quotes, and educate on clusters of information in preference to the complete data set. To accurate for overfitting in Random Forest, extraordinary choices of capabilities and range of trees might be examined to test for alternate in performance.

References 1. Gegic E, Isakovic B, Keco D, Masetic Z, Kevric J (2019) Car price prediction using machine learning techniques. TEM J 8(1):113 2. Richardson MS (2009) Determinants of used car resale value. PhD thesis. Colorado College 3. Wu JD, Hsu CC, Chen HC (2009) An expert system of price forecasting for used cars using adaptive neuro-fuzzy inference. Expert Syst Appl 36(4):7809–7817 4. Pudaruth S (2014) Predicting the price of used cars using machine learning techniques. Int J Inf Comput Technol 4(7):753–764 5. Vehicle dataset—used cars data from websites https://www.kaggle.com/nehalbirla/vehicledataset-from-cardekho 6. Random Forest Regressor, https://scikit-learn.org/stable/modules/generated/sklearn.ensemble. RandomForestRegressor.html 7. VanderPlas J (2016) Python data science handbook: essential tools for working with data. O’Reilly Media, Inc 8. Randomized search CV. https://scikit-learn.org/stable/modules/generated/sklearn. model_selection.RandomizedSearchCV.html 9. Scoring factors. https://scikit-learn.org/stable/modules/model_evaluation.html

A Study on Air Pollution Over Hyderabad Using Factor Analysis—Kaggle Data N. Vasudha and P. Venkateswara Rao

Abstract The present study aims to analyze the complex behavior of air pollutants using factor analysis on Hyderabad. Kaggle data of five different locations in Hyderabad consisting of ten major air polluting components over a period of 30 months (January 2018–June 2020) was used for the purpose. Statistical software IBM SPSS version 23 was employed to identify the contribution levels of various air pollutant components to the air pollution. Significance of the location played a prominent role in grouping the air pollutants into factors. It was noticed that residential areas had less number of factors than compared to purely industrial and industrial cum residential areas. Keywords Air pollutants · Varimax rotation · Factor analysis

1 Introduction The rapid increase in urbanization and industrialization had been observed in many Indian cities over the last two decades. Hyderabad, Telangana State, is one of the fastgrowing and most urbanized and industrialized cities in the country. It is located on the bank of Musi River, covering 650 km2 at an altitude of 542 m above sea level. In the recent years, the city experienced striking development in many areas especially in information technology, pharmaceuticals (Bharat Bio-Tech, Aurbindo etc.), and other industries. Many world-class companies like Google, Amazon, etc., started their business which lead to a remarkable expansion of the city in all the directions. In this process, several residential communities were developed; high range and multistored buildings had come up and resulted in an exponential increase in transport vehicles. This growth had a greater impact on air pollution, and it is well known that N. Vasudha (B) Department of Mathematics, Vasavi College of Engineering, Hyderabad, India e-mail: [email protected] P. V. Rao Department of Physics, Vasavi College of Engineering, Hyderabad, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_33

431

432

N. Vasudha and P. V. Rao

air pollution has adverse effects on human health and also on the environment. There is an immediate need to address this issue to save the environment and humanity. Factor analysis helps in identifying the interdependency of various components by grouping them into factors. Different air pollutants contribute to the air pollution depending on the locality, and factor analysis can be used to identify and group them. In this paper, there is an attempt made to highlight the major air pollutant contributors at different locations, including residential and industrial areas in Hyderabad using factor analysis.

2 Literature Survey In nature, the air is an essential element that does not have any potential defending barrier that can be isolated; therefore, there is a need to analyze the impact of pollutants on global, national, and local scale; and thereby, measures can be taken to control the pollution at all levels [4, 11]. World Health Organization (WHO) reported that the premature deaths of more than two million people each year are attributed to the effect of air pollution during the twenty-first century. According to the National Institute of Health, industrial smog and photochemical smog are the two major forms of air pollution that can create health hazards. Generally, pollutants found in urban areas are from short-range sources, that include vehicle exhaust, combustion, standby generators, construction, demolition, and kitchen exhaust [4, 6, 9, 12]. Air-borne particulate matter is a complex mixture of organic and inorganic substances and stays for more time in the atmosphere, and as a result, they can easily bypass the filters of the nose and throat of human beings, causing great impact on health that includes chronic bronchitis, breathing issues, heart problems, and asthma. Many studies have been proved that industrial and vehicular emissions are the two major contributors to atmospheric pollution [2, 15–17]. The air quality over Hyderabad gradually declined due to industrial and transport sectors [14]. The source contribution of particulate matter over Hyderabad was quantified, using a chemical mass balance receptor model, and reported that more than 60% of pollution was dominated by vehicular exhaust and road dust [8]. Bhavana and Srinivasa Rao [3] studied the correlation among various air pollutants and meteorological parameters in industrial and commercial areas at Hyderabad using PCA and cluster analysis. Their results of the both the analysis were found to be similar. Factor analysis was carried out on PM 2.5 samples to identify the sources of pollution. The results revealed that major contributors of aerosol were coal combustion dust, metallurgic dust, and vehicle exhaust particles [7].

A Study on Air Pollution Over Hyderabad … Table 1 Details of localities under the study

433

S. No.

Name

Significance of locality

1

Bollaram industrial area, Hyderabad

Industrial area

2

Central University, Hyderabad

Residential area

3

ICRISAT Patancheru, Hyderabad

Industrial cum residential

4

IDA Pashamylaram, Hyderabad

Industrial area

5

Zoo Park, Hyderabad

Industrial cum residential

3 Data Details Daily data of ten major air pollutant components (particulate matter, CO, O3 , NOX, NH3 , SO2 , benzene, toluene, xylene) was downloaded from the Kaggle Web site for the period of 30 months (January 2018–June 2020) (https://www.kaggle.com/roh anrao/air-quality-data-in-india). Table 1 gives the significance of the localities under the study. Significant outliers are removed from the data before the analysis. The preprocessed data of the afore-said five localities is correlated using Karl Pearson’s Coefficient of correlation, and the results are reported in Tables 2, 3, 4, 5, and 6, respectively. From the correlation matrix, it is observed that most of the variables are moderate to highly correlated, and they have been highlighted in bold. Factor analysis is a statistical technique used for dimension reduction of variables that are considered to be a linear combination of underlying factors. Initially, the data was tested for the suitability of factor analysis using KMO and Bartlett’s test. It was observed that KMO > 0.5 is a reasonable and acceptable value. Sig. < 0.001 for Bartlett’s test indicates that the correlation matrix is significantly different from the identity matrix which is consistent that the matrix is factorable.

4 Methodology In factor analysis, we assume that the variable is generated from a factor. Suppose there are p variables and m < p factors represented by f 1 , f 2 , . . . , f m . Then, for a variable yi, i = 1, 2, . . . , p, the model is y1 − μ1 = λ11 f 1 + λ12 f 2 + · · · + λ1m f m + ε1 y2 − μ2 = λ21 f 1 + λ22 f 2 + · · · + λ2m f m + ε2

** Correlation

Xylene

Toluene

Benzene

O3

SO2

CO

NH3

NOx

PM 10

PM 2.5

0.136** 0.245**

0.399** 0.162**

−0.048

1

0.451**

−0.063

0.414**

1

−0.024

1

0.362**

0.415**

1

0.043

1

0.361**

1

0.019

0.233**

0.258**

0.142**

0.228**

Benzene

0.033

1

0.432**

0.410**

0.590**

O3

0.473**

−0.110**

0.616**

−0.114**

0.339**

0.843**

SO2

CO

NH3

NOx

PM 10

is significant at the 0.01 level (2-tailed)

1

PM 2.5

Table 2 Correlation matrix of air pollutants at Bollaram

0.415**

0.740**

1

0.443**

0.223**

−0.101** 1

0.085**

0.340**

0.129**

0.317**

0.344**

0.227**

Xylene

0.197**

0.092**

0.360**

0.188**

0.262**

0.161**

Toluene

434 N. Vasudha and P. V. Rao

A Study on Air Pollution Over Hyderabad …

435

Table 3 Correlation matrix of air pollutants at HCU PM 2.5

PM 2.5

PM 10

NOx

NH3

CO

SO2

O3

Benzene

Toluene

Xylene

1

0.902**

0.688**

0.550**

0.498**

0.513**

0.494**

0.524**

0.462**

0.461**

1

0.709**

0.521**

0.503**

0.600**

0.581**

0.583**

0.552**

0.508**

1

0.579**

0.522**

0.553**

0.344**

0.658**

0.619**

0.608**

1

0.193**

0.258**

0.281**

0.354**

0.307**

0.324**

1

0.301**

0.506**

0.576**

0.524**

0.545**

1

0.402**

0.564**

0.570**

0.443**

1

0.575**

0.513**

0.559**

1

0.944**

0.898**

1

0.835**

PM 10 NOx NH3 CO SO2 O3 Benzene Toluene Xylene ** Correlation

1 is significant at the 0.01 level (2-tailed)

y p − μ p = λ p1 f 1 + λ p2 f 2 + · · · + λ pm f m + ε p The factor loading λi j indicates the importance of factor j to variable i. Although the factors are unknown, they are also considered random variables, and in the model, we have E( f i) = 0, Var( f i) = 1, Cov( f i, f j) = 0. So the factors are assumed to be independent. The model also assumes E(εi ) = 0, Var(εi ) = ψi. In other words, the error terms differ for each variable and it is assumed that Cov(εi, f j) = 0 and Cov(εi, εj) = 0. Hence, variance of each variable yi , i = 1, 2, . . . , p 2 2 2 Var(yi ) = λi1 + λi2 + · · · + λim + ψi

5 Results and Discussions Factor analysis using the Varimax method was applied for data reduction. The rotated factor matrix of five stations is consolidated in Table 7. From Table 7, three factors are extracted from 10 pollutant components at Bollaram, ICRISAT, Pashamylaram, and Zoo Park; however, at HCU, two factors were extracted. Cumulative variability contributed by 10 pollutant components at Bollaram was 69.45%, at HCU was 70.32%, at ICRISAT was 67.61%, and at Pashamylaram and Zoo Park amount to 70.13% and 77.74% of the total variability, respectively. The cumulative percentage of variation is well preserved by rotation; however, the spread of variation is distributed evenly over the factors. The resulted factors are summarized in Table 8. Bollaram houses 26 bulk drug and pharmaceutical industries, out of the 17 industries are categorized as high pollution causing according to a report released by Telangana Pollution Control Board in 2019 [1]. 25% of urban ambient air pollution

** Correlation

Xylene

Toluene

Benzene

O3

SO2

CO

NH3

NOx

PM 10

PM 2.5

0.171** -0.042

0.160** 0.195** 0.260**

0.159** 0.213**

1

0.317**

1 1 1 1

0.065

1

0.023

0.174**

0.202**

0.094**

0.041

0.290**

Benzene

0.816**

1

0.252**

0.158**

0.251**

O3

0.410**

0.313**

0.774**

SO2

0.320**

0.780**

0.769**

CO

0.316**

0.748**

0.907**

NH3

NOx

PM 10

is significant at the 0.01 level (2-tailed)

1

PM 2.5

Table 4 Correlation matrix of air pollutants at ICRISAT

0.252** 1

0.615**

0.101** 0.382** 1

0.013 0.071**

0.369**

0.110**

0.366**

0.290**

0.226**

Xylene

-0.113**

0.370**

0.097**

0.433**

0.284**

0.230**

Toluene

436 N. Vasudha and P. V. Rao

** Correlation

Xylene

Toluene

Benzene

O3

SO2

CO

NH3

NOx

PM 10

PM 2.5

0.002

0.091** 0.163**

1

0.247**

1

0.330**

1 1

1

0.046

1

0.715**

0.647** 1

−0.132** 0.944**

0.437**

0.030

0.064

0.335**

0.403**

0.297**

Xylene

0.010

0.612**

1

0.632**

0.061

0.235**

0.009

−0.043

0.073**

0.364**

0.542**

0.505**

Toluene

0.328**

0.557**

0.535**

Benzene

0.374**

0.114**

0.343**

0.541**

0.162**

1

0.405**

0.141**

O3

0.513**

0.333**

SO2

0.149**

0.361**

0.405**

CO

0.110**

0.379**

0.923**

NH3

NOx

PM 10

is significant at the 0.01 level (2-tailed)

1

PM 2.5

Table 5 Correlation matrix of air pollutants at Pashamylaram

A Study on Air Pollution Over Hyderabad … 437

438

N. Vasudha and P. V. Rao

Table 6 Correlation matrix of air pollutants at Zoo Park PM 2.5 PM 10 NOx NH3

PM 2.5

PM 10

NOx

NH3

CO

SO2

O3

Benzene

Toluene

Xylene

1

0.930**

0.715**

0.394**

0.695**

0.309**

0.212**

0.414**

0.317**

0.200**

1

0.748**

0.432**

0.723**

0.472**

0.280**

0.481**

0.383**

0.244**

1

0.486**

0.700**

0.439**

0.172**

0.402**

0.371**

0.184**

1

0.538**

0.516**

0.298**

0.258**

0.281**

0.226**

1

0.574**

0.195**

0.366**

0.320**

0.228**

1

0.244**

0.371**

0.377**

0.314**

1

0.247**

0.230**

0.134**

1

0.915**

0.768**

1

0.738**

CO SO2 O3 Benzene Toluene Xylene ** Correlation

1 is significant at the 0.01 level (2-tailed)

from particulate matter is contributed by traffic and 15% by industrial activities. PM 2.5, PM 10, CO and O3 are the prime contributors to the variability amounting to 30.44% of the total indicating that industries and traffic are the major contributors to air pollution in Bollaram. Volatile organic compounds (VOCs) are derived from benzene and a sub-group of this family of compounds. However, VOCs are the second-highest contributors amounting to 22.71% of the total variability. It was observed that NH3 , NOx, and SO2 have contributed 16.3% variance to the pollution. The air pollution at HCU rose due to its proximity to the IT corridor. The major variability of 40.02% at HCU is contributed by VOCs, CO, SO2, and O3 . VOCs, the most common pollutant found in urban residential areas [10]. Variability of PM 2.5, PM 10, NOx, and NH3 is 30.3% of the total due to construction activity and vehicular pollution. PM 2.5, PM 10, NOx, and CO are the highest contributors amounting to 34.62% of the variance at ICRISAT. Pollution generated by vehicular traffic and industry emissions is the major contributor. Also being a residential area, the second-highest contributor to atmospheric pollution is VOCs amounting to 19.62% of the total variance. These large amounts of VOCs prohibit atmospheric ozone to decompose, and the presence of a large amount of O3 along with NH3 and SO2 contributes to 14.37% of the variance. Pashamylaram is a hub of chemical and pharmaceutical industries. Air pollutants generated by pharmaceutical industries predominantly constitute VOCs consisting of sulfur dioxide, nitrogen oxide, etc. [5], and it was observed that VOCs, NOx, and SO2 contribute to 37.88% of the total variation. Major chemical pollutants released in the air are in the form of smog with air-borne particulate matter [13]. It is evident from particulate matter PM 2.5 and PM 10 along with CO and O3 contributing to 20.02% of the total variance. It can be noted that the contribution of NH3 is 12.23% to the total variability. 35.60% of the total variability is contributed by the air pollutants PM 2.5, PM 10, NOx, and CO at Zoo Park. This huge contribution can be attributed to Zoo Park being industrial cum residential area. Due to the dense population, the VOCs are the

0.875

0.300

0.462

0.109

0.078

−0.230

−0.159

0.792

−0.021

0.671

0.081

0.036

0.377

30.44

30.44

NH3

CO

SO2

O3

Benzene

Toluene

Xylene

% of variance

Cumulative %

53.15

22.71

0.572

0.904

0.864

69.45

16.3

0.056

0.078

0.196

0.633

40.02

40.02

0.892

0.904

0.922

0.615

0.495

0.051

−0.044

0.487

0.609

0.563

0.118

0.561

0.419

−0.002

NOx

0.316

−0.191

0.251

0.858

0.140

0.859

PM 10

1

1

PM 2.5

Factor

3

Factor

2

HCU

Bollaram

Rotated factor matrix

Table 7 Rotated factor matrix of 5 stations

70.32

30.3

0.223

0.201

0.255

0.304

0.451

0.251

0.784

0.700

0.821

0.858

2

34.62

34.62

0.255 53.24

19.62

0.738

0.829

0.688

0.257

0.096

−0.040

−0.276

0.255

0.161

0.270

0.085

0.024

2

−0.013

0.344

0.856

0.237

0.876

0.898

0.921

1

Factor

ICRISAT

67.61

14.37

37.88

37.88

0.809

0.943

−0.072 0.019

0.919

−0.083

0.690

−0.007

−0.037

0.478

0.617

0.553

1

Factor

0.157

0.825

0.580

0.148

0.508

0.049

0.283

0.167

3

57.9

70.13

12.23

0.134

−0.172 20.02

−0.033

−0.107

−0.147

0.000

0.412

0.902

0.382

0.169

0.114

3

0.048

0.095

0.849

0.324

0.652

0.025

0.153

0.627

0.703

2

Pashamylaram

35.06

35.06

0.055

0.196

0.276

−0.027

0.418

0.826

0.440

0.850

0.892

0.905

1

Factor

Zoo Park

61.68

26.62

0.893

0.916

0.917

0.087

0.254

0.118

0.089

0.165

0.205

0.150

2

77.74

16.06

0.117

0.147

0.109

0.792

0.632

0.271

0.633

0.152

0.184

0.030

3

A Study on Air Pollution Over Hyderabad … 439

440

N. Vasudha and P. V. Rao

Table 8 Summary of resulted factors Bollaram

HCU

ICRISAT

Pashamylaram

Zoo Park

1

PM 2.5, PM 10, CO, O3

CO, SO2, O3 , benzene, toluene, xylene

PM 2.5, PM 10, NOx, CO

NOx, SO2 , benzene, toluene, xylene

PM 2.5, PM 10, NOx, CO

2

Benzene, toluene, xylene

PM 2.5, PM 10, Benzene, NOx, NH3 toluene, xylene

PM 2.5, PM 10, CO, O3

Benzene, toluene, xylene

3

NH3 , NOx, SO2

NH3

NH3 , SO2 , O3

NH3 , SO2 , O3

second-highest contributors amounting to 26.62% of the total variance. NH3 , SO2, and O3 contribute 16.06% of the total variability.

6 Conclusions Factor analysis is successful in grouping ten key air pollutants into less than three factors. Nature of the locality decides the set of variables that were grouped into factors. This study helps to identify the prominent air pollutant contributors, localitywise that are responsible for health hazards. Purely industrial areas Bollaram and Pashamylaram are found to have three factors each. Due to the presence of pharma industries at Pashamylaram, the contribution of NOx , SO2, and VOCs is much more significant when compared with Bollaram where particulate matter contributes more. Similar air pollutants are grouped at industrial cum residential areas (ZOO Park and ICRISAT). At residential area HCU, it was found only two factors were sufficient to group all the air pollutants. The major contributors are found to be VOCs which may be due to residential activities. Acknowledgements The authors would like to thank the management of Vasavi College of Engineering, Hyderabad, India. The authors acknowledge the Kaggle database for making the data available to users.

References 1. Action Plan for the restoration of environmental qualities with regard to the identified polluted industrial cluster of Patencheru-Bollaram. https://cpcb.nic.in/industrial_pollution/New_Act ion_Plans/ CEPI_Action%20PlanPatancheru-Bollaram.pdf 2. Ajay Kumar MC, Vinay Kumar P, Venkateswara Rao P (2020) Temporal variations of PM2.5and PM10 concentration over Hyderabad. Nat Environ Pollut Technol 19:421–428 3. Bhavana H, Srinivasa Rao GVR (2020) A study on air pollution trends in Sanathnagar, Hyderabad using principle component and cluster analysis. Int J Sci Res (IJSR) 9:844–847 4. Cichowicz R, Wielgosi´nski G (2015) Effect of meteorological conditions and building location on CO2 concentration in the university campus. Ecol Chem Eng S 22(4):513–525

A Study on Air Pollution Over Hyderabad …

441

5. Yaqub G, Hamid A, SairaIqbal. (2012) Pollutants generated from pharmaceutical processes and microwave assisted synthesis as possible solution for their reduction—a mini review. Nat Environ Pollut Technol 11(1):29–36 6. Gurney KR, Razlivanov I, Song Y, Zhou Y, Benes B, Massih MA (2012) Quantification of fossil fuel CO2 emissions on the building/street scale for a large U.S. City. Environ Sci Technol 46(21):12194–12202 7. Fang J, Fan JM, Lin Q, Wang YY, He X, Shen XD, Chen DL (2017) Characteristics of airborne lead in Hangzhou, Southeast China: Concentrations, species, and source contributions based on Pb isotope ratios and synchrotron X-ray fluorescence based factor analysis. https://www.sci encedirect.com/science/article/pii/S1309104217305093 8. Venkateswara Rao K, Raveendhar N, Swamy AVVS (2016) Status of air pollution in Hyderabad City, Telangana State. Int J Innov Res Sci Eng Technol 5(4):4769–4780 9. Lelieveld J, Evans JS, Fnais M, Giannadaki D, Pozzer A (2015) The contribution of outdoor air pollution sources to premature mortality on a global scale. Nature 525:367–371 10. Mar´c M, Namie´snik J, Zabiegała B (2014) BTEX concentration levels in urban air in the area of the Tri-City agglomeration (Gdansk, Gdynia, Sopot). Poland Air Qual Atmos Health 7:489–504 11. Ménard R, Deshaies-Jacques M, Gasset N (2016) A comparison of correlation-length estimation methods for the objective analysis of surface pollutants at environment and climate change Canada. J Air Waste Manag Assoc 66:9874–9895 12. Nemitz E, Hargreaves KJ, McDonald AG, Dorsey JR, Fowler D (2002) Micrometeorological measurements of the urban heat budget and CO2 emissions on a city scale. Environ Sci Technol 36(14):3139–3146 13. Naidu R, Biswas B, Willett IR, Cribb J, Kumar Singh B, Paul Nathanail C, Coulon F, Semple KT, Jonesi KC, Barclay A, Aitken RJ (2021) Chemical pollution: a growing peril and potential catastrophic risk to humanity. Environ Int 156. https://doi.org/10.1016/j.envint.2021.106616 14. Guttikunda SK, Ramani VK (2014) Source emissions and health impacts of urban air pollution in Hyderabad, India. Air Qual Atmos Health 7:195–207. doi.https://doi.org/10.1007/s11869013-0221 15. Singh V, Biswal A, Kesarkar AP, Mor S, Ravindra K (2020) High resolution vehicular PM10 emissions over megacity Delhi: relative contributions of exhaust and non-exhaust sources. Sci Total Environ 699:134273 16. Zhang K, Zhao C, Fan H, Yang Y, Sun Y (2019) Toward understanding the differences of PM2.5 characteristics among five China urban cities. Asia-Pacific J Atmos Sci 56:493–502 17. Zhao C, Wang Y, Shi X, Zhang D, Wang C, Jiang JH, Zhang Q, Fan H (2019) Estimating the contribution of local primary emissions to particulate pollution using high-density station observations. J Geophys Res Atmos 124:1648–1661

A Comparative Study of Hierarchical Risk Parity Portfolio and Eigen Portfolio on the NIFTY 50 Stocks Jaydip Sen and Abhishek Dutta

Abstract Portfolio optimization has been an area of research that has attracted a lot of attention from researchers and financial analysts. Designing an optimum portfolio is a complex task since it not only involves accurate forecasting of future stock returns and risks but also needs to optimize them. This paper presents a systematic approach to portfolio optimization using two approaches, the hierarchical risk parity algorithm and the Eigen portfolio on seven sectors of the Indian stock market. The portfolios are built following the two approaches on historical stock prices from January 1, 2016, to December 31, 2020. The portfolio performances are evaluated on the test data from January 1, 2021, to November 1, 2021. The backtesting results of the portfolios indicate that the performance of the HRP portfolio is superior to that of its Eigen counterpart on both training and the test data for the majority of the sectors studied. Keywords Portfolio optimization · Minimum variance portfolio · Hierarchical risk parity algorithm · Eigen portfolio · Principal component analysis · Return · Risk · Sharpe ratio · Prediction accuracy · Backtesting

1 Introduction The design of optimized portfolios has remained a research topic of broad and intense interest among the researchers of quantitative and statistical finance for a long time. An optimum portfolio allocates the weights to a set of capital assets in a way that optimizes the return and risk of those assets. Markowitz in his seminal work proposed the mean-variance optimization approach which is based on the mean and covariance matrix of returns [1]. The algorithm, known as the critical line algorithm (CLA), despite the elegance in its theoretical framework, has some major limitations. One of the major problems is the adverse effects of the estimation errors in its expected returns and covariance matrix on the performance of the portfolio. J. Sen (B) · A. Dutta Praxis Business School, Kolkata 700104, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_34

443

444

J. Sen and A. Dutta

The hierarchical risk parity (HRP) and Eigen portfolios are two well-known approaches of portfolio design that attempt to address three major shortcomings of quadratic optimization methods which are particularly relevant to the CLA [2, 3]. These problems are instability, concentration, and under-performance. Unlike the CLA, the HRP algorithm does not require the covariance matrix of the return values to be invertible. The HRP is capable of delivering good results even if the covariance matrix is ill-degenerated or singular, which is an impossibility for a quadratic optimizer. On the other hand, the Eigen portfolio design exploits the theory of principal components to construct orthogonal components out of the stock return values and use the values of the component loading of the stocks on the first principal component as their respective weights. Interestingly, even though CLA’s objective is to minimize the variance, both HRP and Eigen portfolios are proven to have a higher likelihood of yielding lower out-of-sample variance than the CLA. The major weakness of the CLA algorithm is that a small deviation in the forecasted future returns can make the CLA deliver widely divergent portfolios. Given the fact that future returns cannot be forecasted with sufficient accuracy, some researchers have proposed risk-based asset allocation using the covariance matrix of the returns. However, this approach brings in another problem of instability. The instability arises because the quadratic programming methods require the inversion of a covariance matrix whose all eigenvalues must be positive. This inversion is prone to large errors when the covariance matrix is numerically ill-conditioned, i.e., when it has a high condition number [4]. The HRP and Eigen portfolios are two among the new portfolio approaches that address the pitfalls of the CLA using techniques of machine learning and graph theory [2]. While HRP exploits the features of the covariance matrix without the requirement of its invertibility or positive-definiteness and works effectively on even a singular covariance matrix of returns, the Eigen portfolio leverages the orthogonality properties of the principal components for explaining the variances in stock return to arrive at a better and more robust estimation of future stock returns and their volatilities [3]. Despite being recognized as two approaches that outperform the CLA algorithm, to the best of our knowledge, no study has been carried out so far to compare the performances of the HRP and the Eigen portfolios on Indian stocks. This paper presents a comparative analysis of the performances of the HRP and the Eigen portfolios on some important stock from selected seven sectors listed in the National Stock Exchange (NSE) of India. Based on the report of the NSE on October 29, 2021, the ten most significant stocks of six sectors and the 50 stocks included in the NIFTY 50 are first identified [5]. Portfolios are built using the HRP and Eigen approaches for the chosen sectors using the historical prices of the stocks from January 1, 2016, to December 31, 2020. The portfolios are backtested on the in-sample data of the stock prices from January 1, 2016, to December 31, 2020, and also on the out-of-sample data of stock prices from January 1, 2021, to November 1, 2021. Extensive results on the performance of the backtesting of the portfolios are analyzed to identify the better-performing algorithm for portfolio design. The main contribution of the current work is threefold. First, it presents two different methods of designing robust portfolios, the HRP algorithm, and the Eigen

A Comparative Study of Hierarchical Risk Parity Portfolio …

445

approach. These portfolio design approaches are applied to seven critical sectors of stocks of the NSE. The results can be used as a guide to investors in the stock market for making profitable investments. Second, a backtesting method is proposed for evaluating the performances of the algorithms based on the daily returns yielded by the portfolios and their associated volatilities (i.e., risks). Since the backtesting is done both on the training and the test data of the stock prices, the work has identified the more efficient algorithm both on the in-sample data and the out-of-sample data. Hence, a robust framework for evaluating different portfolios is demonstrated. Third, the returns of the portfolios on the seven sectors on the test data highlight the current profitability of investment and the volatilities of the sectors studied in this work. This information can be useful for investors. The paper is organized as follows. In Sect. 2, some existing works on portfolio design and stock price prediction are discussed briefly. Section 3 highlights the methodology followed. Section 4 presents the results of the two portfolio design approaches on the seven sectors. Section 5 concludes the paper.

2 Related Work Due to the challenging nature of the problems and their impact on real-world applications, several propositions exist in the literature for stock price prediction and robust portfolio design for optimizing returns and risk in a portfolio. The use of predictive models based on learning algorithms and deep neural net architectures for stock price prediction is quite common [6–13]. Hybrid models are also demonstrated that combine learning-based systems with the sentiments in the unstructured nonnumeric contents on the social web [14, 15]. The use of multi-objective optimization, principal component analysis, and metaheuristics have also been proposed by some researchers in portfolio design [16–21]. Estimating volatility in future stock prices using GARCH has also been proposed in some work [22]. The current work presents two methods the Eigen portfolio approach and the HRP method to introduce robustness while maximizing the portfolio returns for seven sectors of the NSE of India. Based on the past prices of the stocks from January 2016 to December 2020, portfolios are designed using the Eigen and the HRP algorithms for each sector. The backtesting of the portfolios is carried out on the in-sample data of stock prices from January 2016 to December 31, 2020, and also on the out-ofsample data from January 1, 2021, to August 26, 2021. The backtesting is done on the return, volatility, and the Sharpe ratio of the portfolios for each sector.

3 Data and Methodology In this section, the six-step approach adopted in designing the proposed system is discussed in detail. The six steps are as follows.

446

J. Sen and A. Dutta

(1)

Choice of the sectors: Seven important sectors of NSE are first chosen. The selected sectors are (i) auto, (ii) consumer durable, (iii) financial services, (iv) health care, (v) information technology, (vi) oil and gas, and (vii) NIFTY 50. NIFTY 50 contains the 50 most critical stock stocks from several sectors of the Indian stock market. For the remaining six sectors, the top ten stocks are identified based on their contributions to the computation of the overall sector index to which they belong as per the report published by the NSE on October 29, 2021 [5]. Data acquisition: The prices of the ten most critical stocks of the six sectors and the 50 stocks listed in the NIFTY 50 are extracted using the DataReader function of the data sub-module of the pandas_datareader module in Python. The stock prices are extracted from Yahoo Finance, from January 1, 2016, to November 1, 2021. The stock price data from January 1, 2016, to December 31, 2020, are used for training the portfolios, while the portfolios are tested on the data from January 1, 2021, to November 1, 2021. Among all the features in the stock data, the variable close is chosen for the univariate analysis. Derivation of the return and volatility: The changes in the close values for successive days in percentage represent the daily return. For computing the daily returns, the pct_change function of Python is used. Based on the daily returns, the daily and yearly volatilities of the stocks of every sector are computed. Assuming that there are 250 operational days in a calendar year, the annual volatility values for the stocks are arrived at by multiplying the daily volatilities by a square root of 250. Designing the Eigen portfolios: Designing Eigen portfolios involves the concept of principal component analysis (PCA), a well-known dimensionality reduction method based on unsupervised learning. PCA retains the intrinsic variance in the data while reducing the number of dimensions. The principal components in the training data of the stock prices are determined using the PCA function defined in the sklearn library of Python. To retain 80% of the variance in the original stock price data, it is found that a minimum of five components is needed from the ten stocks. The components generated by the PCA function are orthogonal to each other, and their power of explanation of the variance in the data decreases with a higher component number. In other words, the first component explains the maximum percentage of the total variance. The component loading of the five principal components on each of the ten stocks reflects the weights allocated to the stocks in building the candidate Eigen portfolios. Finally, the portfolio yielding the maximum Sharpe ratio among the candidates is selected as the best Eigen portfolio. A Python function is used iterating over a loop for deriving the weights assigned to the five principal components and in identifying the best candidate Eigen portfolio [16, 16]. Hierarchical risk parity portfolio design: As an alternative to the CLA algorithm for portfolio design, the hierarchical risk parity (HRP) algorithm-based portfolios are designed for the seven sectors. The HRP algorithm works in

(2)

(3)

(4)

(5)

A Comparative Study of Hierarchical Risk Parity Portfolio …

447

three phases: (a) tree clustering, (b) quasi-diagonalization, and (c) recursive bisection. These steps are briefly described in the following. Tree Clustering: The tree clustering used in the HRP algorithm is an agglomerative clustering algorithm. To design the agglomerative clustering algorithm, a hierarchy class is first created in Python. The hierarchy class contains a dendrogram method that receives the value returned by a method called linkage defined in the same class. The linkage method received the dataset after preprocessing and transformation and computes the minimum distances between stocks based on their return values. There are several options for computing the distance. However, the ward distance is a good choice since it minimizes the variances in the distance between two clusters in the presence of high volatility in the stock return values. In this work, the ward distance has been used as a method to compute the distance between two clusters. The linkage method performs the clustering and returns a list of the clusters formed. The computation of linkages is followed by the visualization of the clusters through a dendrogram. In the dendrogram, the leaves represent the individual stocks, while the root depicts the cluster containing all the stocks. The distance between each cluster formed is represented along the y-axis, longer arms indicate less correlated clusters and vice versa. Quasi-Diagonalization: In this step, the rows and the columns of the covariance matrix of the return values of the stocks are reorganized in such a way that the largest values lie along the diagonal. Without requiring a change in the basis of the covariance matrix, the quasi-diagonalization yields a very important property of the matrix—the assets (i.e., stocks) with similar return values are placed closer, while disparate assets are put at a far distance. The working principles of the algorithm are as follows. Since each row of the linkage matrix merges two branches into one, the clusters (C N−1 , 1) and (C N−2 , 2) are replaced with their constituents recursively, until there are no more clusters to merge. This recursive merging of clusters preserves the original order of the clusters [4]. The output of the algorithm is a sorted list of the original stocks (as they were before the clustering). Recursive Bisection: The quasi-diagonalization step transforms the covariance matrix into a quasi-diagonal form. It is proven mathematically that allocation of weights to the assets in an inverse ratio to their variance is an optimal allocation for a quasi-diagonal matrix [4]. This allocation may be done in two different ways. In the bottom-up approach, the variance of a contiguous subset of stocks is computed as the variance of an inverse-variance allocation of the composite cluster. In the alternative top-down approach, the allocation among two adjacent subsets of stocks is done in inverse proportion to their aggregated variances. In the current implementation, the top-down approach is followed. A python function computeIVP computes the inverse-variance portfolio based on the computed variances of two clusters as its given input. The variance of a cluster is computed using another Python function called clusterVar. The output of the clusterVar function is used as the input to another Python function

448

(6)

J. Sen and A. Dutta

called recBisect which computes the final weights allocated to the individual stocks based on the recursive bisection algorithm. The HRP algorithm performs the weight allocation to n assets in the base case in time T (n) = O(log2 n), while its worst-case complexity is given by T (n) = O(n). Unlike the MVP, which is an approximate algorithm, the HRP is an exact and deterministic algorithm. The HRP portfolios for all the sectors are built on January 1, 2021, based on the training data from January 1, 2016, to December 31, 2020. Backtesting the portfolios on the training and the test data: In the final step, the eigen and the HRP portfolios for each sector are backtested over the training and the test dataset. For backtesting, the daily return values are computed for each sector for both portfolios. For the purpose of comparison, the Sharpe ratios and the aggregate volatility values for each sector are also computed for both the training and the test data. For each sector, the portfolios that perform better on the training and the test data are identified. The portfolio that performs better on the test data for a given sector is assumed to have exhibited superior performance for that sector. While it is well known that both HRP and Eigen portfolios usually outperform the CLA portfolio on the test data, a comparison of their performances on the training and the test data will be quite interesting.

4 Performance Evaluation The detailed results on the performance of the portfolios and their analysis are presented in this section. The seven sectors of NSE which are selected for the analysis are as follows: (i) auto, (ii) consumer durable, (iii) financial services, (iv) healthcare, (v) IT, (vi) oil and gas, and (vii) NIFTY 50. The Eigen and the HRP portfolios are implemented in Python programming language, and the portfolios are trained and tested on the Google Colab platform. In the following sub-sections, the detailed results of the performances of the two portfolios on the seven sectors are presented.

4.1 The Auto Sector Portfolios The ten most significant stocks and their respective contributions to the computation of the auto sector index according to the report published by the NSE on October 29, 2021, are as follows: Maruti Suzuki: 19.98, Mahindra and Mahindra: 15.33, Tata Motors: 11.36, Bajaj Auto: 10.75, Hero MotoCorp: 7.73, Eicher Motors: 7.60, Bharat Forge: 4.18, Balkrishna Ind.: 4.15, Ashok Leyland: 4.12, and MRF: 3.58 [5]. The dendrogram of the clustering of the stocks of the auto sector is shown in Fig. 1. The y-axis of the dendrogram depicts the ward linkage values, where a longer length of the arms signifies a higher distance, and hence, less compactness in the cluster formed. For example, the cluster containing the Bajaj Auto and Hero MotoCorp

A Comparative Study of Hierarchical Risk Parity Portfolio …

449

Fig. 1 The agglomerative clustering of the auto sector stocks—the dendrogram formed on the training data from January 1, 2016, to December 31, 2020

stocks is the most compact one, while the one containing Balkrishna Ind. and MRF is the most heterogeneous. It is evident that the tree clustering for the HRP has created four clusters for the auto sector for the allocation of weights. These four clusters contain the following stocks: (1) Bajaj Auto and Hero MotoCorp, (2) Eicher Motors, Maruti Suzuki, and Mahindra and Mahindra, (3) Tata Motors, Bharat Forge, and Ashok Leyland, and (4) Balkrishna Industries and MRF. Figure 2 depicts the weight allocations by the Eigen and the HRP portfolios for the auto sector. It is clear that both portfolios have attempted to achieve diversification in the portfolio by uniformly allocating the weights to the stocks. Figure 3a, b shows the results of backtesting for the training and the test data, respectively. These graphs plot the daily returns along the y-axis. The summary of the backtesting results is presented in Table 1. While the HRP portfolio is found to have produced a higher Sharpe ratio and lower volatility for the training (i.e., in-sample) data, the EIGEN portfolio has outperformed it on the test (i.e., out-of-sample) data producing a higher value of Sharpe ratio, albeit with slightly higher volatility.

Fig. 2 The allocation of weights to the auto sector stocks by the EIGEN and the HRP portfolios based on stock price data from January 1, 2016, to December 31, 2020

450

J. Sen and A. Dutta

Fig. 3 The returns of the EIGEN and the HRP portfolios for the auto sector stocks on the (a) training data from January 1, 2016, to December 31. 2020, and (b) on the test data from Jan 1, 2021, to Nov 1, 2021

Table 1 Performance of the auto sector portfolios Portfolio

Training performance

Test performance

Volatility

Sharpe ratio

Volatility

Sharpe ratio

EIGEN

0.240137

0.500069

0.225286

1.479449

HRP

0.226378

0.620970

0.207317

1.204434

4.2 The Consumer Durable Sector Portfolios The top ten stocks and their respective contributions to the sectoral index of the consumer durable sector as per the NSE’s report on October 29, 2021, are as follows: Titan Company: 35.66, Havells India: 11.54, Voltas: 10.18, Crompton Greaves Consumer Electricals: 10.03, Dixon Technologies India: 6.52, Bata India: 4.36, Kajaria Ceramics: 3.69, Relaxo Footwears: 3.50, Rajesh Exports: 3.16, and Whirlpool of India: 2.56 [5]. Figure 4 depicts the clustering of the stocks by the HRP portfolio in which six clusters are apparent. These six clusters consist of the following: (i) Crompton and Dixon, (ii) Relaxo, (iii) Titan and Britannia, (iv) Havells and Voltas, (v) Whirlpool and Kajaria, and (v) Rajesh Exports. Figure 5 shows the

Fig. 4 The agglomerative clustering of the consumer durable sector stocks—the dendrogram formed on the training data from January 1, 2016, to December 31, 2020

A Comparative Study of Hierarchical Risk Parity Portfolio …

451

Fig. 5 The allocation of weights to the consumer durable sector stocks by the EIGEN and the HRP portfolios based on stock price data from January 1, 2016, to December 31, 2020

weight allocation by the two portfolios, while the returns on the training and test data are shown in Fig. 6. The Sharpe ratio of the HRP portfolio is higher for the test data as observed in Table 2.

Fig. 6 The returns of the EIGEN and the HRP portfolios for the consumer durable sector stocks on the (a) training data from January 1, 2016, to December 31. 2020, and (b) on the test data from January 1, 2021, to November 1, 2021

Table 2 Performance of the consumer durable sector portfolios Portfolio

Training performance

Test performance

Volatility

Sharpe ratio

Volatility

Sharpe ratio

EIGEN

0.205992

1.253044

0.172882

2.731261

HRP

0.184684

1.205891

0.151220

3.019343

452

J. Sen and A. Dutta

4.3 The Financial Services Sector Portfolios The top ten stocks and their respective contributions to the sectoral index of the financial services sector as per the NSE’s report on October 29, 2021, are as follows: HDFC Bank: 22.49, ICICI Bank: 18.06, Housing Development Finance Corporation: 16.68, Kotak Mahindra Bank: 9.68, Bajaj Finance: 6.38, State Bank of India: 6.26, Axis Bank: 6.21, Bajaj Finserv: 3.50, HDFC Life Insurance Company: 2.06, and SBI Life Insurance Company: 1.64 [5]. The clustering done by the HRP on the stocks of this sector is shown in Fig. 7, in which the clusters consist of the following stocks: (i) Axis Bank, (ii) ICICI Bank and SBI, (iii) Bajaj Fin and Bajaj Finserv, (iv) Kotak Bank, and (v) HDFC Bank and HDFC Bank. The allocation of the weights by the portfolios is shown in Fig. 8, while the returns yielded are shown in Fig. 9. The results presented in Table 3 indicate that the EIGEN portfolio has yielded a higher Sharpe ratio for the test data. However, on the training data, the HRP portfolio produced a higher Sharpe ratio. The volatilities of the HRP portfolios are lower for both datasets.

Fig. 7 The agglomerative clustering of the financial services sector stocks—the dendrogram formed on the training data from January 1, 2016, to December 31, 2020

Fig. 8 The allocation of weights to the financial services sector stocks by the EIGEN and the HRP portfolios based on stock price data from January 1, 2016, to December 31, 2020

A Comparative Study of Hierarchical Risk Parity Portfolio …

453

Fig. 9 The returns of the EIGEN and the HRP portfolios for the financial services sector stocks on the (a) training data from January 1, 2016, to December 31. 2020, and (b) on the test data from January 1, 2021, to November 1, 2021

Table 3 Performance of the financial services sector portfolios Portfolio

Training performance

Test performance

Volatility

Sharpe ratio

Volatility

Sharpe ratio

EIGEN

0.262314

0.973574

0.236104

1.908093

HRP

0.242130

1.024041

0.225178

1.637399

4.4 The Healthcare Sector Portfolios The top ten stocks and their respective contributions to the sectoral index of the healthcare sector as per the NSE’s report on October 29, 2021, are as follows: Sun Pharmaceutical Industries: 17.88, Divi’s Laboratories: 13.67, Dr. Reddy’s Laboratories: 11.79, Cipla: 9.58, Apollo Hospitals Enterprise: 8.94, Lupin: 4.63, Laurus Labs: 4.21, Aurobindo Pharma: 4.04, Alkem Laboratories: 3.51, and Biocon: 3.34 [5]. Figure 10 shows the dendrogram of the clustering done by the HRP portfolio, in which seven clusters are visible. The weights allocation and the returns yielded by

Fig. 10 The agglomerative clustering of the healthcare sector stocks—the dendrogram formed on the training data from January 1, 2016, to December 31, 2020

454

J. Sen and A. Dutta

Fig. 11 The allocation of weights to the healthcare sector stocks by the EIGEN and the HRP portfolios based on stock price data from January 1, 2016, to December 31, 2020

Fig. 12 The returns of the EIGEN and the HRP portfolios for the healthcare services sector stocks on the (a) training data from January 1, 2016, to December 31. 2020, and (b) on the test data from January 1, 2021, to November 1, 2021

Table 4 Performance of the healthcare sector portfolios Portfolio

Training performance

Test performance

Volatility

Sharpe ratio

Volatility

Sharpe ratio

EIGEN

0.223730

0.799622

0.184003

0.672148

HRP

0.193036

1.054018

0.169768

1.410061

the two portfolios are presented in Figs. 11 and 12, respectively. The results in Table 4 show that the Sharpe ratios produced by the HRP portfolio are higher in both cases.

4.5 The Information Technology Sector Portfolios The top ten stocks and their respective contributions to the sectoral index of the media sector as per the NSE’s report on Oct 29, 2021, are as follows: Infosys: 27.21, Tata Consultancy Services: 24.05, Tech Mahindra: 9.70, Wipro: 9.50, HCL

A Comparative Study of Hierarchical Risk Parity Portfolio …

455

Fig. 13 The agglomerative clustering of the information technology sector stocks—the dendrogram formed on the training data from January 1, 2016, to December 31, 2020

Fig. 14 The allocation of weights to the media sector stocks by the EIGEN and the HRP portfolios based on stock price data from January 1, 2016, to December 31, 2020

Technologies: 8.48, Larsen and Toubro Infotech: 5.74, MindTree: 5.45, MphasiS: 5.03, L&T Technology Services: 2.44, and Coverage: 2.40 [5]. Figure 13 depicts the dendrogram of clustering by the HRP portfolio in which seven clusters are visible. Figures 14 and 15 show the weight allocation and the returns yielded by the two portfolios on the information technology sector stocks, respectively. It is observed from Table 5 that the HRP portfolio has yielded a higher Sharpe ratio on the test data.

4.6 The Oil and Gas Sector Portfolios The top ten stocks and their respective contributions to the sectoral index of the oil and gas sector as per the NSE’s report on October 29, 2021, are as follows: Reliance Industries: 32.78, Oil and Natural Gas Corporation: 12.70, Bharat Petroleum

456

J. Sen and A. Dutta

Fig. 15 The returns of the EIGEN and the HRP portfolios for the information technology sector stocks on the (a) the training data from January 1, 2016, to December 31. 2020, and (b) on the test data from January 1, 2021, to November 1, 2021

Table 5 Performance of the information technology sector portfolios Portfolio

Training performance

Test performance

Volatility

Sharpe ratio

Volatility

Sharpe ratio

EIGEN

0.214254

1.44903

0.234554

2.839298

HRP

0.206345

1.43674

0.227796

2.910402

Corporation: 9.31, Adani Total Gas: 9.24, Indian Oil Corporation: 7.60, GAIL India: 6.33, Hindustan Petroleum Corporation: 4.63, Petronet LNG: 4.02, Indraprastha Gas: 3.87, and Gujarat Gas: 2.50 [5]. The HRP portfolio has created six clusters on the ten stocks of this sector as observed in Fig. 16. The weight allocation by the portfolios and their returns are depicted in Figs. 17 and 18, respectively. The results presented in Table 6 show that the Sharpe ratio yielded by the HRP portfolio is higher for both training and test data. Moreover, for both cases, the volatilities of the HRP portfolio are found to be lower.

Fig. 16 The agglomerative clustering of the oil and gas sector stocks—the dendrogram formed on the training data from January 1, 2016, to December 31, 2020

A Comparative Study of Hierarchical Risk Parity Portfolio …

457

Fig. 17 The allocation of weights to the oil and gas sector stocks by the EIGEN and the HRP portfolios based on stock price data from January 1, 2016, to December 31, 2020

Fig. 18 The returns of the EIGEN and the HRP portfolios for the oil and gas sector stocks on the (a) training data from January 1, 2016, to December 31. 2020, and (b) on the test data from January 1, 2021, to November 1, 2021

Table 6 Performance of the oil and gas sector portfolios Portfolio

Training performance

Test performance

Volatility

Sharpe ratio

Volatility

Sharpe ratio

EIGEN

0.236194

0.590050

0.203017

1.723656

HRP

0.213704

0.832144

0.193843

1.848685

4.7 The NIFTY 50 Portfolios Finally, we consider the NIFTY 50 stocks and construct the EIGEN and the HRP portfolios for them. These stocks are the market leaders across 13 sectors in the NSE and have a low-risk quotient [5]. The dendrogram is depicted in Fig. 19, in which three distinct clusters are visible. The allocation of the weights and the returns yielded by the portfolios are shown in Figs. 20 and 21, respectively. The allocations made by both portfolios appear to be very uniform indicating diversified portfolios.

458

J. Sen and A. Dutta

Fig. 19 The agglomerative clustering of the NIFTY 50 stocks—the dendrogram formed on the training data from January 1, 2016, to December 31, 2020

Fig. 20 The allocation of weights to the NIFTY 50 stocks by the EIGEN and the HRP portfolios based on stock price data from January 1, 2016, to December 31, 2020

Fig. 21 The returns of the EIGEN and the HRP portfolios for the NIFTY 50 stocks on the test data from January 1, 2021, to August 26, 2021

A Comparative Study of Hierarchical Risk Parity Portfolio …

459

Table 7 Performance of the NIFTY 50 portfolios Portfolio

Training performance

Test performance

Volatility

Sharpe ratio

Volatility

Sharpe ratio

EIGEN

0.184898

0.934880

0.153761

2.480044

HRP

0.187925

0.887088

0.163927

2.799373

The results presented in Table 7 show that the HRP portfolio has outperformed its EIGEN counterpart as it has yielded a higher value of the Sharpe ratio. In summary, it is observed that among the seven sectors, the HRP portfolio has produced higher Sharpe ratios for four sectors on the training data, while it outperformed the EIGEN portfolio on five sectors on the test data. The results show that the performance of the HRP portfolio is superior to that of the EIGEN on what matters most, i.e., the test data of the sectors. Even if it is not right to generalize this observation found in the study of the seven sectors, it appears that the diversification approach of HRP is more effective than that of the Eigen portfolio approach. In order words, in the presence of some highly correlated stocks that lead to a highly unstable inverse correlation matrix, the HRP’s approach of diversification by assigning weights to stocks in the same cluster based on the reciprocals of their variances seems to be more effective than the formation of principal components and using their loading values as the portfolio weights.

5 Conclusion This paper has presented portfolio design approaches on some important sectors of the Indian stock market using the EIGEN and HRP algorithms. Based on the past prices of the top ten stocks of six sectors and the 50 stocks from NIFTY 50, the portfolios are designed. While the stock price data from January 1, 2016, to December 31, 2020, are used for building the portfolios, the period January 1, 2021, to November 1, 2021, is used for the testing. It is found that in both the training and the test data, the HRP portfolio yielded higher Sharpe ratios for the majority of the sectors among the seven sectors studied in this work. It is evident that the HRP portfolio has not only learned more effectively from the patterns in the training data but also achieved better diversification leading to higher Sharpe ratios on the test data.

References 1. Markowitz H (1952) Portfolio selection. J Finance 7(1):77–91 2. de Prado ML (2016) Building diversified portfolios that outperform out of sample. J Portf Manag 42(4):59–69

460

J. Sen and A. Dutta

3. Peng Y, Albuquerque PHM, Do Nascimento I-F, Machado JVF (2019) Between nonlinearities, complexity, and noises: an application on portfolio selection using kernel principal component analysis. Entropy 21(4):376 4. Baily D, de Prado ML (2012) Balanced baskets: a new approach to trading and hedging risks. J Invest Strat 1(4):61–62 5. NSE Website: http://www1.nseindia.com. Accessed on 3 Jan 2021 6. Mehtab S, Sen J, Dutta A (2020) Stock price prediction using machine learning and LSTMbased deep learning model. In: Proceedings of the SoMMA, pp 88–106 7. Mehtab S, Sen J (2020) Stock price prediction using convolutional neural networks on a multivariate time series. In: Proceedings of 2nd the NCMLAI, New Delhi, India 8. Sen J (2018) Stock price prediction using machine learning and deep learning frameworks. In: Proceedings of the ICBAI, Bangalore, India 9. Sen J, Datta Chaudhuri T (2017) A robust predictive model for stock price forecasting. In: Proceedings of the ICBAI, Bangalore, India 10. Sen J, Mehtab S (2021) Accurate stock price forecasting using robust and optimized deep learning model. In: Proceedings of the IEEE CONIT, Hubli, India 11. Mehtab S, Sen J (2020) Stock price prediction using CNN and LSTM-based deep learning models. In: Proceedings of the IEEE DASA, Bahrain 12. Qiu J, Wang B (2020) Forecasting stock prices with long-short term memory neural network based on attention mechanism. PLoS ONE 15(1):e0227222 13. Mehtab S, Sen J (2020) A time series analysis-based stock price prediction using machine learning and deep learning models. Int J Business Forecast Mark Intell 6(4):272–335 14. Mehtab S, Sen J (2019) A robust predictive model for stock price prediction using deep learning and natural language processing. In: Proceedings of 7th BAICONF 15. Carta SM, Consoli S, Piras L, Podda S, Recupero DR (2021) Explainable machine learning exploiting news and domain-specific lexicon for stock market forecasting. IEEE Access 9:30193–302015 16. Sen J, Mehtab S (2021) A comparative study of optimum risk portfolio and eigen portfolio on the indian stock market. Int. J. Bus Forecast Mark Intell 7(2):143–193, Inderscience Publishers, in press 17. Corazza M, Di Tollo G, Fasano G, Pesenti R (2021) A novel hybrid PSO-based metaheuristic for costly portfolio selection problem. Ann Oper Res 304:109–137 18. Sen J, Dutta A, Mehtab S (2021) Stock portfolio optimization using a deep learning LSTM model. In: Proceedings of IEEE MysuruCon, Hassan, India 19. Sen J, Mondal S, Mehtab S (2021) Portfolio optimization on NIFTY thematic sector stocks using an LSTM model. In: Proceedings of IEEE ICDABI, Bahrain (2021) 20. Wang Z, Zhang X, Zhang Z, Sheng D (2021) Credit portfolio optimization: a multi-objective genetic algorithm approach. Bora Istanbul Review, in press 21. Erwin K, Engelbrecht A (2020) Improved set-based particle swarm optimization for portfolio optimization. In: Proceedings of IEEE SSCI, Canberra, Australia 22. Sen J, Mehtab S, Dutta A (2021) Volatility modeling of stock from selected sectors of the indian economy using GARCH. In: Proceedings of IEEE ASIANCON, Pune, India

Collaborative Approach Toward Information Retrieval System to Get Relevant News Articles Over Web: IRS-Web Shabina and Sonal Chawla

Abstract Effectual information retrieval from collection of document available over the Web, such as news articles, is a tedious task. The objective of this research work is to introduce an Information Retrieval System for retrieval of relevant news articles available over the Web, incorporating a unified framework based on machine learning techniques. The most prominent newsgroup of India, Times of India archive has been taken into consideration to perform experimental setup under this research work. The MongoDB operational database platform has been used to build a cloud-based data warehouse. The collaboration of different machine learning techniques has been used to perform constraint-based user’s query to get the optimized resultant news articles. The resultant news articles have been ranked corresponding to their scoring values to sort in a required manner. Keywords Information retrieval system · Keyword-based indexing · Multi-level clustering · Ranking and web scraping

1 Introduction With the boom of Internet, a huge volume of data is available online such as news articles and social media platforms. There is variety of tools available to retrieve data from different news articles but some of them do not provide such an efficient result as they retrieve too many documents of which some are relevant to the users’ query and most of the relevant documents are not in order. Extraction is a way to discern significant information from unstructured data and store the extracted data in a structured manner for efficient processing. This research paper focuses on extraction of relevant information from news articles by implementing various machine learning Shabina (B) · S. Chawla Department of Computer Science and Applications, Panjab University, Chandigarh, India e-mail: [email protected] S. Chawla e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_35

461

462

Shabina and S. Chawla

Fig. 1 System overview

techniques, viz. indexing, clustering, and ranking as shown in Fig. 1. Information Retrieval is the process by which a collection of data is represented, stored, and searched for the purpose of knowledge discovery as a response to a user’s request or query. This research paper has fourfold objectives. Section 1 discusses about related work of information retrieval from different e-newspapers. Section 2 focuses on the experimental setup of the proposed system that consists of various sub-sections: data extraction, data pre-processing, cluster-based data warehousing, keyword-based indexing, multi-level clustering, and ranking of indexed based clusters. Section 3 discusses the results returned by the retrieval system, and Sect. 4 concludes the paper.

2 Related Work The information retrieval from newspaper articles is the way of mining or extracting information for data analysis. In recent years, most of the researchers inclined toward the extraction of knowledge from the collection of web documents by applying various computational methods. The traditional approaches of information retrieval are inefficient to handle and process the huge amount of unstructured data, and text mining offers a set of techniques to explore massive data efficiently. In the literature, various techniques and methods have emerged to retrieve information from web documents, viz. E-newspapers. Kim et al. [1] considered the newspaper dataset to identify keywords related to medical tourism in both medical and daily newspapers. The objective of this study was to understand the perception of medical community on medical tourism using text mining. Matto et al. [2] considered different platforms, i.e., social media and newspapers, as a source of crime data that are not entered in police databases. The objective of this study was to investigate the crime region-wise, frequently occurred crimes, and to find the pattern among reported crimes. Kiran et al. [3] reviewed the already existing information extraction techniques, their limitations, and challenges to extract the unstructured data, i.e., text, image, audio, and video. Author realized that there is a need of hour to develop novel techniques and adapt the already existing techniques to analyze the data efficiently for knowledge discovery. Hanumanthappa

Collaborative Approach Toward Information Retrieval System …

463

[4] applied the information retrieval techniques to retrieve the relevant information from electronic newspapers. Author focused on the comparative study of various extraction tools on the basis of source type, programming language, and characteristics. Hanumanthappa [5] used e-newspapers as dataset; as from the past few years, readers’ interest in e-newspapers has increased. The proposed approach focused on information extraction from the PDF documents to text format for effective reading. Zhao [6] discussed the technique of web scrapping to extract data from the web documents and store it in the database for future use. To extract data, Beautiful Soup module of python is used that scrap HTML and XML documents. Liu et al. [7] used the WiseSearch database to retrieve the corona virus-related news articles from major PressMedia between January 1, 2020, and February 20, 2020. The retrieved data was analyzed by using Python software and Python package Jieba.

3 Experimental Setup In an attempt to propose an information retrieval system, initially data is collected through web scrapper from newspaper articles [8]. Further, data pre-processing is performed to clean the data by removing the numbers, punctuations, and white spaces between the words that makes the searching process easier. To the pre-processed data, indexing is applied for sorting out and ordering of the data in more efficient way. Further to the indexed documents, clustering is applied to make clusters of each word that is present in an article on basis of synonyms. In the end, ranking of the articles is done based on score value assigned to each article on the basis of maximum occurrence of keywords. Algorithm: The below mentioned sequence of steps was performed to populate the database with the required data from the news articles. Start Step 1: Data Acquisition and Extraction Call web_scrapper () Step 2: Data Pre-Processing Invoke ‘re’ python library Call lower () Step 3: Insert data into MongoDB database Document-fields: {_id, Article_id, title, image_url, content, url, content_lower, content_words, keywords} Step 4: Indexing of documents Step 5: Clusters Generation Call get_syn(), get_tree() Step 6: Ranking of articles Call Scoring () End

464

Shabina and S. Chawla

3.1 Data Acquisition and Extraction Among newspapers, “The Times of India” is the oldest English-language daily newspaper in India. According to the Brand Trust Report India study 2019, “The Times of India” is rated as most trusted English newspaper with total readership of 2,880,144 daily. So, this newspaper was chosen as dataset and found the site “https://timesofin dia.indiatimes.com/archive.cms” to get the e-newspaper articles. Web Scrapping: Zhao [9] devised a method to retrieve the relevant information by using the python library—“Beautiful Soup.” This method requires only 2 parameters: URL path of file to be downloaded and local path of file where it is stored in the desktop. The main objective of this research is to extract the data from news articles [10] and store them in database for future use. The mundane task involves download each file one by one where every file needs approximate 35 s to download along with the time required to visit each link. The intelligent way is to implement the Python libraries—requests and Beautiful Soup [11] for this job as shown in Fig. 2. Below mentioned algorithm 1 of web scrapper helps to extract data from collection of web pages [12] with the prerequisite knowledge of python language, libraries, or frameworks and store the extracted data in different file formats, viz. CSV, JSON, and XML [13]. Algorithm: 1 WebScrapping (#month_url = f ‘https://timesofindia.indiatimes. com/archive/ year-{year}, month-{month}.cms’) Def get_links(url): # takes in url template as the argument and id_number needs to be incremented once ato change the day, returns all the links of that day re = requests.get(url) soup = BeautifulSoup(re.content.decode(’utf-8’,’ignore’) ,’lxml’).body links = soup.find_all(’a’) return_links = [] for link in links:

Fig. 2 Process of Web Scrapping

Collaborative Approach Toward Information Retrieval System …

465

try: if "articleshow" in link[’href’] and "entertainment" not in link[’href’]: return_links.append(link[’href’]) except Exception as e: pass returnreturn_links list_of_dates = [] defget_data_for_article(url): try: article_id = (url.split(’/’)[-1][:-4]) re = requests.get(url) soup = BeautifulSoup(re.content.decode(’utf-8’,’ignore’) ,’html.parser’) scripts = soup.find_all(’script’) for script in scripts: try: if "window.App={" in script.string: op = json.loads(script.string.replace ("window.App=","")) op = op[’state’][’articleshow_v2’][’data’] [article_id] article_information = {} article_information[’Article_id’] = url.split("/")[-1].split(’.’)[0] article_information[’Title’] = op[’hl’] article_information[’Image_Url’] = op[’seo’][’ogimage’] article_information[’Content’] = "" for data in op[’story’]: if ’value’ in data.keys(): article_information[’Content’]+=(data[’value’])+’\n’ article_information[’Url’] = url success.append(url) # call the ML function article_information=return_keywords(article_information) #insert in the databaseinsert_data_check(article_information) return except Exception as e: pass except Exception as e: pass if __name__ == "__main__": year = 2020 month = 1 id_number = 43831 for _ in range(1,5): url_template=f’https://timesofindia.indiatimes.com/2020/ 1/1/archivelist/year-{year},month-{month} ,starttime-{id_number}.cms’ id_number+=1 links = get_links(url_template) # print(len(links))

466

Shabina and S. Chawla pool = ThreadPool(10) pool.map(get_data_for_article, links[:10]) print(len(success))

We used “Pymongo” library that helps us to insert data into database, provides functionality of querying from user, retrieves desired data, and runs various database commands. Web scrapper is an efficient tool to extract the information from a particular web page or multiple web pages [12, 14].

3.2 Data Pre-processing The news article consists of unstructured and irrelevant data that has no significance to analyze. The information is not couched in a manner that is amenable to automatic processing. For effectual information processing, data cleaning is done with the help of “re” library in python. This library is used for evaluating the regular expressions “Regex” [15]. A regular expression specifies the string that particularly matches it. Data pre-processing helps to clean the data by removing the numbers, punctuations, and white spaces between the words. It includes lowering of text which makes searching easier, tokenization which is used for forming tokens out of sentences/paragraph, and lemmatization which is a process of grouping various inflected forms of words to analyze them as a single word. Data pre-processing is a process of using the data and making it available for machine learning model.

3.3 Cloud-Based Data Warehousing Using web scrapper, we gathered all the content of the articles of every newspaper [14] and stored them in MongoDB database as documents as shown in Fig. 3. MongoDB is a well-known document-oriented cloud-based database which is basically used for storing the JSON-like data with a dynamic schema. JSON (Java Script Object Notation) is an open standard data interchange file format that supports humanunderstandable text to store and transmit data objects. Hence, we used MongoDB to store our dataset as it offers an authoritative method to store and retrieve the useful information efficiently. Figure 3 depicts the storage view of 85,539 documents with total document size 498.6 MB in the database, where each and every document has fields: {_id, Article_id, title, image_url, content, url, content_lower, content_words, keywords} corresponding to the extracted article from the newspaper [16].

Collaborative Approach Toward Information Retrieval System …

467

Fig. 3 Storage view of documents in MongoDB

3.4 IRS-Web This section demonstrates the core module of our proposed information retrieval system over the Web named as IRS-Web. This module is a collaboration of various machine learning techniques to retrieve the relevant information over cloud-based dataset. The dataset described above is made available to extract the relevant information over Web based on user’s query. The working of our IRS-Web system starts with indexing process. Here, various keyword-based indexes have generated based on the words listed under the user’s query. This process is followed by multi-level clustering [17, 18]. Here, multi-level array-based clusters have been generated up to the level of 4 to deeply understand the user’s query based on keyword-based indexes [19]. In continuation with this, the last step is to compute the scoring of each document retrieved through indexed based clustering. This scoring is used to grant the rank of each document to sort it into relevant order. The subsequent sub-sections describe the detail of each phase along with their algorithms. Figure 4 depicts the home window of our proposed information retrieval system which allows the end user to retrieve the relevant articles from the cloud-based newspaper dataset. The resultant articles are based on the keywords listed in user’s query. Keyword-based Indexing: Indexes are data structures that tend to store a small portion of data or information in an easy-to-traverse form. In simple words, indexes are used to store the values of indexed fields outside the table and also keep location in the disk as shown in Fig. 5. This ordering of data with the help of indexing help

468

Shabina and S. Chawla

Fig. 4 Home window of IRS-Web

Fig. 5 Keyword-based indexing

us to perform quality matches and range-based query operations efficiently. In case, if indexing is not done, MongoDB will scan collectively, i.e., it will scan every document that means to select each and every document that matches the user’s query. The index stores the value of a single field or even multiple fields together, particularly ordered by the value of the field. When querying the data without indexes from MongoDB, the query will search for each and every record in the database to find

Collaborative Approach Toward Information Retrieval System …

469

the information that is required that match the user’s query. In addition, MongoDB provides sorted and limited results with the help of use of ordering in index. Multi-level Clustering: Clustering is a technique used for grouping the unlabeled data [17, 18]. An API call is made to find the synonyms of each word by using “get_syn()” function and we store them in a 4-level undirected graph where only unique words are considered by calling “get_tree()”function as mentioned in algorithm cluster generation(). Algorithm 2: Cluster Generation() Def get_syn(word): url1="https://www.wordsapi.com/mashape/words/" url2="/synonyms?when=2021-06-09T17:09:42.975Z&encrypted =8cfdb18be722959bea9807bee858beb0aeb42f0939f892b8" request_api = requests.get(url1+word+url2) text = request_api.text soup = BeautifulSoup(text, ’lxml’) words = soup.find("p").getText() data = json.loads(words) syn = data.get("synonyms") ifsyn==None: return if(len(syn)!=0): # only taking the top 20 syn as we dont want to make a word clutter for w in syn[:20]: if w not in total_list and len(w.split(" "))==1: temp_level.append(w) total_list.append(w) return def get_tree(words): global temp_level global total_list l2 = [] l3 = [] l4 = [] ans = [] total_ans = set() total_list = [] final_tree = [] # Level 1 l1 = words final_tree.append(l1) for word in words: total_list.append(word) # Level 2 pool = ThreadPool(10) pool.map(get_syn, l1) l2 = [x for x in temp_level] final_tree.append(temp_level) temp_level = [] # Level3 pool = ThreadPool(10)

470

Shabina and S. Chawla pool.map(get_syn, l2) l3 = [x for x in temp_level] final_tree.append(temp_level) temp_level = [] # Level 4 pool = ThreadPool(10) pool.map(get_syn, l3) l4 = [x for x in temp_level] final_tree.append(temp_level) ans.append(final_tree) for i in total_list: total_ans.add(i) return ans,list(total_ans)

As an output, we get 4 lists of synonyms each of every level and searching for each word in a tree is done and checks if that word is present in the “Content_words” of any article, that article is returned and implies that article belongs to that specific word cluster and merge the cluster which consist of list of articles as shown in above mentioned Fig. 6. Ranking of Indexed based Clusters: In our research work, we did ranking [20] of the articles based on mostly occurring keywords and keywords that are present in an article. “Scoring()” function is invoked to calculate the total score which will help us to identify that in which article that words present in the user’s query are mostly occurring and as much as highest score of an article, higher will be the priority of the words that are searched by the user. Algorithm 3: Scoring of documents()

Fig. 6 Multi-level clusters

Collaborative Approach Toward Information Retrieval System …

471

def scoring(article,total_ans,new_list): score = 0 outcome[article[’Url’]] = 0 tokens = word_tokenize(article[’Content_lower’]) temp_total_ans = total_ans for word in article[’Keywords’]: if word in temp_total_ans: score+=10 print("word in keywords",word) temp_total_ans.remove(word) for word in tokens: points = 2 if word in temp_total_ans: if word in new_list[1]: points/=2 print("in first list") elif word in new_list[2]: points/=4 elif word in new_list[3]: points/=8 score+=points if score>0: outcome[article[’Url’]] += score else: outcome.pop(article[’Url’], None)

This type of ranking gives more priority to the more relatable and content-driven article based on the similarity of our query as mentioned in Fig. 7. Therefore, making our searching more optimized as we are in turn ranking the articles on the basis of their score thus the search engine is very sensitive to this function.

Fig. 7 Ranked articles with score value

472

Shabina and S. Chawla

Fig. 8 Resultant ranked articles with score values

4 Results The objective of proposed methodology is to develop a Web-based information retrieval system to retrieve the relevant information from cloud-based dataset. Therefore, the E-newspaper dataset was taken into consideration to extract the relevant articles from Times of India cloud-based archive. The implementation of machine learning algorithms returns the relevant articles for the entered query that will enhance the information processing and retrieval process. Below mentioned Fig. 8 depicts the resultant articles along with their score values computed corresponding to different parameters values that are returned by proposed IRS-Web system. The articles retrieved through indexed based clustering have been taken into consideration corresponding to user query. The “Count-based Vectorization” technique has been taken into consideration to compute the scoring of each keyword that lies under resultant articles. The level-based scoring has also been used to compute the individual score of each resultant article. Finally, each resultant article has been showed with final score value in descending order. This descending order has been followed with the objective to show the highest scored document as ranked 1 followed by same sequence as mentioned in above said Fig. 8.

5 Comparative Analysis The comparative analysis also has been conducted to differentiate our proposed technique with the existing ones. During this analysis, we have considered most common clustering technique: K-means to differentiate our proposed technique against various

Collaborative Approach Toward Information Retrieval System …

473

Fig. 9 Comparative analysis of our proposed technique with the K-means clustering algorithms corresponding to different parameters

parameters like specific and spherical clusters, upper limit of clusters, broad clusters, and accuracy and efficiency. On the basis of this comparative statement, it has been concluded that in what manner our technique is better than existing one. Figure 9 elaborates the details about each comparison against various parameters. As discussed above, under this methodology we have focused on optimization, so under this analysis we elaborate only about the clustering section. This optimization work also has been done at various stages under the whole research methodology. The minor optimization at specific module may not be enough at specific level but that minor contributions at different stages have contributed big enough to achieve the optimization to retrieve the relevant result over cloud-based environment.

6 Conclusion In the present scenario with the outburst of the social content on Web, the incorporation of keyword-based indexing, multi-level clustering, and score-based ranking of clustered articles play an important role in the retrieval of relevant information. This research work proposes Web-based information retrieval system that implied machine learning techniques to retrieve the relevant articles in response to end user’s query. The TOI newspaper archive has been used under experimental setup to retrieve the data over the Web. The collaborative technique-based approach has been used to

474

Shabina and S. Chawla

get the optimized results over the Web-based news articles. The proposed IRS-Web handles the news articles through various steps starting from data extraction and preprocessing, followed by cloud-based data warehousing to which machine learning techniques applied. The proposed approach of IRS promises to optimize the results by returning the relevant articles corresponding to user’s request.

References 1. Kim S, Lee WS (2019) Network text analysis of medical tourism in newspapers using text mining: the South Korea case. Tourism Manag Perspect 31:332–339 2. Matto G, Mwangoka J (2017) Detecting crime patterns from Swahili newspapers using text mining. Int J Knowl Eng Data Min 4(2):145–156 3. Adnan K, Akbar R (2019) Limitations of information extraction methods and techniques for heterogeneous unstructured big data. Int J Eng Bus Manage 11. https://doi.org/10.1177/184 7979019890771 4. Hanumanthappa M, Nagalavi T, Kumar M (2014) A study of information extraction tools for online english newspapers (PDF): comparative analysis. Int J Innovative Res Comput Commun Eng 2(1) 5. Hanumanthappa M, Nagalavi T (2015) Identification and extraction of different objects and its location from a Pdf file using efficient information retrieval tools. International conference on soft-computing and networks security (ICSNS). pp 1–6. https://doi.org/10.1109/ICSNS.2015. 7292375 6. Ojokoh BA (2012) Automated online news content extraction. Int J Comput Sci Res Appl 2:2–12 7. Lindholm S (2011) Extracting content from online news sites 8. Zhang C, Lin Z (2010) Automatic web news content extraction based on similar pages. Int Conf Web Inf Syst Min 1:232–236 9. Zhao B (2017) Web scraping encyclopedia of big data 10. Chang CH, Kayed M, Girgis MR, Shaalan KF (2006) A survey of web information extraction systems. IEEE Trans Knowl Data Eng 18(10):1411–1428 11. Sirisuriya DS (2015) A comparative study on web scraping 12. Dong Y, Li Q, Yan Z, Ding Y (2008) A generic web news extraction approach. In: International conference on information and automation. pp 179–183 13. Li Y, Meng X, Li Q, Wang L (2016) Hybrid method for automated news content extraction from the web. International conference on web information systems engineering. pp 327–338 14. Chen J, Keli X (2008) Perception-oriented online news extraction. In: Proceedings of the 8th ACM/IEEE-CS joint conference on digital libraries 15. Norvag K, Oyri R (2005) News item extraction for text mining in web newspapers. In: International workshop on challenges in web information retrieval and integration. pp 195–204 16. Zhang D, Simoff SJ (2006) Informing the curious negotiator: automatic news extraction from the internet. In: Data mining. Springer, Berlin, Heidelberg, pp 176–191 17. Sheetrit E, Kurland O (2019) Cluster-based focused retrieval. In: Proceedings of the 28th ACM international conference on information and knowledge management, pp 2305–2308 18. Handa R, Krishna CR, Aggarwal N (2019) Document clustering for efficient and secure information retrieval from cloud. Concurrency Comput: Pract Experience 31(15) 19. Chandwani G, Ahlawat A, Dubey G (2021) An approach for document retrieval using clusterbased inverted indexing. J Inf Sci 14, 01655515211018401 20. Khan HU, Nasir S, Nasim K, Shabbir D, Mahmood A (2021) Twitter trends: a ranking algorithm analysis on real time data. Expert Syst Appl 113990

Patent Recommendation Engine Using Graph Database Aniruddha Chatterjee, Sagnik Biswas, and M. Kanchana

Abstract Accurate analysis of patents is an essential tool for modern companies. The idea behind the patent recommendation engine is to build solutions that enhance the quality as well the quantity of extractable data from a patent and discover meaningful relations, helping companies by spending fewer resources in terms of time and manpower. The recommendation engine is a concept of having the patent data transferred into a graph database and executing queries to answer questions specific to certain business use cases such that the task is significantly easier, less resourceintensive, and less complex when compared to the same being performed by a conventional relational database. The engine concept accepts a single input and forming clusters from the single starting point based on chain queries. It can then run the required algorithms on the clusters formed to select the best-fit data. The recommendation engine uses Neo4j as the database. Neo4j is a NoSQL graph database that focuses more on the relationship between various data rather than the data itself. We extract the data from our existing databases and then ingest them into Neo4j. Cypher queries power the engine that answers very specific questions within very little time. Keywords Patent recommendation · Graph database · NoSQL graphs · Neo4j · Jaccard distance · Overlapping algorithm · Similarity algorithm

A. Chatterjee · S. Biswas · M. Kanchana (B) SRM Institute of Science and Technology, Chennai, India e-mail: [email protected] A. Chatterjee e-mail: [email protected] S. Biswas e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_36

475

476

A. Chatterjee et al.

1 Introduction 1.1 Background Most conventional database engines powered by Structured Query Language (SQL) store data in the form of rows and columns. While it is a very efficient and time-tested method of storing data, finding relationship patterns are not very efficient due to the nature of the queries used to relate two or more data records. Graph database is a database solution based on graph theory, where the data is stored in a tree-like structure. The data is primarily stored in the form of graphs and hence can form cycles in them, i.e., we can cycle back to the root node and have connections between all the nodes, thus forming better relations between different nodes. A graph database is made in a way to treat the connections between data as equally important to the data. It is intended to hold data without limiting it to a pre-defined model. This makes the approach of a graph database highly suitable for storing decentralized data such as patents. A single patent can broken down into properties, and each property can be labeled and made into a node. These nodes can be then connected to each other via relationships, which are nothing but the edges of the graph. Graph databases are also scalable and ACID complaint in nature [1]. Therefore, it provides a very appealing platform for storing data alongside connections.

1.2 Application A recommendation engine is a very simple concept where based on one (or more inputs), the system can give a single of a list of similar items. The definition of similarity will depend on the parameters specified by the user, and on the algorithms(s) used to arrive at the recommendation. The system can be implemented in multiple ways, one of which is using a set of data to arrive at a recommendation. One of the ways is to adopt machine learning (ML) algorithms for natural language processing (NLP) to analyze the patent documents [2]. The proposed engine aims at using the inherent structure of a graph to interconnect data and find relationships between the same for the purpose of suggesting recommendations. The proposed engine accepts a single patent and recommends a list of similar patents based on the location each patent is published in. Based on the parameter(s) required, changes can be made in the application to tailor it to the custom needs of the use case scenario. The patent recommendation engine can be used in the fields of pharmaceuticals and product research and development to find patents of similar interests.

Patent Recommendation Engine Using Graph Database

477

1.3 Limitations This research paper explores the concept of a recommendation engine using a graph database, specifically Neo4j. One of the algorithms used is developed and made available to use by Neo4j. The implementation, working, and the efficiency of the algorithm depend on the development team, and the engine using this algorithm will be affected if and when Neo4j makes any changes to the algorithm. This research paper is limited on the resources available to execute and time the queries used throughout. Only a single data set was used on a machine with a specific configuration. A future study on the concept using a larger and varied data set and on different platforms can be conducted to supplement this paper.

1.4 Objective This research paper proposes a recommendation engine using Neo4j as the data store and two algorithms to recommend patents. The aim is to evaluate the execution time of the both the queries and determine a better-fit algorithm in terms of execution time taken by each.

2 Literature Survey 2.1 Existing Works The literature review provides a review on graph databases, patent recommendation, and the comparison of various methodologies to find recommended patents. Graph databases are a better fit for analytical purposes than conventional relational databases [3]. Since graph databases focus on relations more than the data itself, it a very good choice for designing a system where we can traverse the connections and look for data. A graph is a visual representation of a set of objects, connected to each other by links [4]. A graph data structure consists of vertices and edges. A vertex is the data part of the graph, and the edges connect each vertex with one or more vertices. In a directed graph, we can define the direction of the edges, making it possible to define a one-way traversal from one vertex to the next, depending on our requirement. We use this property of graphs to create our data set in the graph database, which allows us to efficiently organize and traverse through huge clusters of data. The idea of a recommendation engine is to create a system, which can use content filtering, collaborative filtering, or hybrid filtering techniques to suggest recommendations from a huge data set [5]. Content filtering works well with graph since graph

478

A. Chatterjee et al.

traversal via edges links to interconnected nodes, which can be used by various algorithms to suggest the nodes with the most number of matches as per some specified parameters. Clustering algorithms are a popular approach to the problem at hand. The research in [6] focuses on an algorithm to dynamically search related patents. It clusters users with similar patent search behaviors and, subsequently, infers new patent recommendations based on inter-cluster group member behaviors and characteristics. This method is very useful when the search pattern of a group of users is known. A clustering algorithm with non-exhaustive overlaps suggested in [7] is proposed to overcome fallacies with exhaustive clustering methods used in the mining of patents. A prediction technique independent of the domain based on similarity of items is also a way to approach this problem [8]. The similarity algorithm used to match items is what determines the efficacy of the methods. Hopfield net algorithm is also a potential approach while using graph data structures. High-degree associations between various labels can be used to find recommendations, as outlined in [9]. The Graph Data Science Library by Neo4j provides for multiple similarity algorithms that can be used for making predictions based on the relationship of one node to another [10].

3 Method Proposed The proposed method to recommend patents is to use the relationship between patents and assignees using two different algorithms—second-degree search and node similarity algorithm using Jaccard metric. This section describes the data preparation, the executed queries, and the query execution time of the operations.

3.1 Second-Degree Node Search Technique Second-degree node search implies to jump to the next connected node from the immediately connected nodes from the source node. The second-degree node connections are important because based on the relationship between the nodes, we can generate a list of nodes that are potentially connected with the source node and are of interest (Fig. 1). This technique purely works on the relationships that exist between the nodes, just like a graph traversal. It does not require any form of calculations by the system. The complexity of this approach depends on the number of nodes connected to both the source and first-degree nodes. One drawback of this approach is that if two or more multiple first-degree nodes are connected to one or more nodes, the same node(s) will be traversed uselessly.

Patent Recommendation Engine Using Graph Database

479

Fig. 1 A reference diagram to understand first- and second-degree node connections

This technique returns a list of nodes, without the ability to determine if some recommendations are more suited than others are. The node similarity algorithm can further achieve this.

3.2 Node Similarity Algorithm Using Jaccard Distance This algorithm is a production grade algorithm available with the Neo4j Graph Data Science Library [11]. The algorithm uses the Jaccard metric, calculated using the formula: J (X, Y ) = (|X ∩ Y |)/(|X ∪ Y |) = (|X ∩ Y |)/(|X | + |Y | − |X ∩ Y |)

(1)

The concept is similar to the first-degree recommendation used in the seconddegree node search technique. The neighborhood of a node X is scanned for, resulting in Y node(s). The Jaccard metric is calculated for J(X, Y ) and that is used to formulate the similarity score. Since this algorithm involves the calculation of Jaccard metric, the complexity of this approach is quadratic in nature with the number of nodes. For larger data sets, this approach will be slower, but will return the similarity score between a pair of nodes, allowing us to achieve more clarity on the recommendations.

3.3 Data Modeling—Nodes and Relationships The data is obtained from PatentsView, available under a Creative Commons Attribution 4.0 License [11]. The data sets are in the form of a tab-separated value file (.tsv

480

A. Chatterjee et al.

Fig. 2 A portion of the assignee TSV file downloaded from PatentsView

file). Each file contains their own headers, the description of which is also provided by PatentsView in the form of a data dictionary [12]. A sample.tsv file downloaded looks like the Fig. 2. The important information to gather here is the headers. The headers act as the node properties in the data model. We consider each file as a node and the column headers as properties and each row in the file as an individual node. We have three nodes at hand—patent, assignee, and location, to be linked via a relationship of “ASSIGNED_TO” and “LOCATED_AT.” The data model would look like as the representation in Fig. 3. The representation of data in Fig. 3 is a schema of how the data will be stored in the database. Each patent node is connected via “ASSIGNED_TO” relationship to an assignee node. In addition, each patent node is connected to a location node using “LOCATED_IN” relationship. We consider three tables from PatentsView, as listed in Table 1. The patent_assignee table however is merely a link between the patent and the assignee data tables. It serves no fruitful purpose. While the data can be joined during the process of ingestion to yield the values we need, it would be a time taking process for a NoSQL graph database to do this job. As our focus is on the recommendation queries and not the data ingestion, we decided to prepare the data first using a secondary SQL database before ingesting it into the Neo4j database.

Patent Recommendation Engine Using Graph Database

481

Fig. 3 A representation of the data model

Table 1 List of files taken into consideration for data modeling File name

Definition

Number of rows

Assignee

The assignee data for granter and pre-granted patents, having id generated by disambiguation algorithm

530,735

Patent_assignee

Descriptions for many-to-many relationships between patents and assignees

7,121,431

Rawlocation

Raw location data for inventors and assignees

30,482,996

3.4 Data Preparation and Ingestion The data is first imported into a PostgreSQL table. For each data set, we create a separate file as per the data dictionary provided by PatentsView [13]. Then, after the data is imported, we run two simple JOIN queries in succession to generate the final data set. SELECT patent_assignee.patent_id, assignee.id, assignee.name_first, assignee.name_last, assignee.organization FROM patent_assignee RIGHT JOIN assignee ON assignee.id = patent_assignee.assignee_id; SELECT results.patent_id, results.name_first, results.name_last, rawlocation.city, rawlocation.state, rawlocation.latlong

results.location_id, results.organization, rawlocation.country,

482

A. Chatterjee et al.

FROM results RIGHT JOIN rawlocation ON results.location_id = rawlocation.location_id;

The final data set generation took 9 s and 975 ms to output 7,157,913 rows, which when exported to a comma-separated value CSV file resulted in the file size of 526.2 megabytes. The CSV file generated was imported into a Neo4j database. As shown in Fig. 4, the generated CSV contains the data set we require in the form of a tabular structure. The file import functionality of Neo4j allows us to define nodes and relationship based on CSV headers. The ingestion process converted the tabular data into a graph, a portion of which can be seen in Fig. 5.

Fig. 4 A portion of the data output after the JOIN query using the location parameter, showing the tabular format of the final data set

Fig. 5 A portion of the graph data showing the interconnected nodes. The pink nodes are location nodes, the khaki ones are patent nodes, and the green nodes are the assignee nodes

Patent Recommendation Engine Using Graph Database Table 2 List of number of records for each trial

x

483 N

1

3,000

2

9,000

3

27,000

4

81,000

5

243,000

3.5 Query Building and Execution We formulated a limit on the number of records to utilize for the data set, based on the resources available for this paper. This enabled us to select break points and tabulate the query execution time for varying workloads. The number of records for each trail was calculated as per the formula:     N = 3∧ x ∗ 1000∀ x|1 ≤ x ≤ 5, x ∈ Z ∧ +

(2)

This led us to the following numbers as shown in Table 2. The queries for both the approaches discussed in this paper were formulated and executed to compare the performance between them both. With a graph database, searching for second-degree nodes are relatively easy. We use the second-degree patent search, connected via their locations. That gives us a query in the form of: MATCH (p1:Patent {patentId: "9790318"})-[:LOCATED_IN]>(l:Location)1) chlist=Getchildren(j) For each node c in chlist If(preferedparent(c) ∩ preferedparent(c+1))=F) Monitor(j) = c U c+1 End if End for End while If(count(Monitor(j) > count(chlist(j))/2+1) while(count(Monior(j)=count(chlist(j))/2+1)) Remove a node from Monitor(j) that has large nodes in first common preferedparent(Monitor(j)) End while End if If(Monitor(j)= F) Monitor=chlist Endif If (node(j)=LeafNode) Sblist=GetSibling(j) For each node g in Sblist If(preferedparent(g) ∩ preferedparent(g+1))=F) Monitor(j) = g U g + 1 Endif End for For each node g in Sblist If(count(Monitor(g) > count(chlist(g))/2+1) while(count(Monitor(g)=count(chlist(g))/2+1)) Remove a node from Monitor(g) that has large nodes in first common preferedparent(Monitor(g)) End while End if End for If(Monitor(j)= F) Monitor=sblist End if End for For each node j in Monitor Decision(j) =maximum occurred node in (preferedparent(j) ∩ preferedparent(j+1))) End for

4.2.2

Alert Generation by Monitoring Nodes

The monitoring node implements the watchdog mechanism and sends alert message to the decision nodes when it determines and abnormal behavior. Each monitoring node has a window of w units, during which it observes the overheard packets of

Privacy Preserving Intrusion Detection System for Low Power …

587

Fig. 3 Watchdog mechanism at sensor nodes

its parent node. During each window, the monitoring node generates alert message if the observed parameter value crosses the threshold criterion and alert message is sent to its decision nodes. Then, the next watchdog window is started, and the same procedure is repeated periodically, for all monitoring nodes as shown in Fig. 3. Algorithm is used for generating the alert by monitoring node. Algorithm 4: Alert generation by monitoring node While( Within the watchdog window W ) { If ( Measured parameters about the parent does not satisfy the threshold ) then { Drop (packet from the parent) Create(Alert message about measured parameter) Send(Alert message to its decision node } Else Forward(packet from parent) }

4.2.3

Attack Detection by Decision Nodes

The decision nodes after receiving alert message it decides about of the intrusion and sends its decision to 6BR for RPL topology updating. If more than half of the monitoring nodes have raised an alert, then the parent node is considered to have abnormal behavior and declare it as malicious node as shown in Algorithm 5. (i) Detection at Non-Leaf Nodes: When the monitoring node ascertains the abnormal activity, it will issue alert message to the decision nodes. The decision nodes make the final decision associated with the abnormal behavior of the parent based on alert messages received by all of its monitoring nodes and send its decision on attack to the 6BR for topology updating. (ii) Detection at Leaf Nodes: The monitoring nodes at the leaf level of the RPL DODAG tree not only monitor its parent but also its siblings. When an abnormal activity of sibling node is determined from the watchdog counter, it will send

588

S. Prabavathy and I. R. P. Reddy

the alert message to its parent which acts as decision node for the leaf nodes. When an abnormal activity of parent is inferred, then the monitoring node will issue the alert message to decision nodes which was selected by the 6BR. When monitoring nodes at the leaf level determines any abnormal activity of its parent node, then it sends alert message decision nodes selected by the 6BR similar mechanism of non-leaf nodes. Algorithm 5: Attack detection at decision While( Within the watchdog window W) { If(Alert messages are from more than half of the monitoring nodes) then { Create(Result) Send(Result, 6BR) } Else Drop(Alert) }

5 Simulation and Results The experiments are performed in Cooja [38] simulator which uses Contiki operating system with Tmote Sky sensor nodes. These sensor nodes measure the temperature of the given environment that ranges from −10 to 30 degree values, and the experiment is repeated for four sets of categories: S1 = {(0, 10), (10.1, 20), (20.1, 30), (30.1, 40)}, S2={(0, 5), (5.1, 10), (10.1, 15), (15.1, 20), (20.1, 25), (25.1, 30), (30.1, 35), (35.1, 40)}, S3 = {(0, 7), (7.1, 14), (14.1, 21), (21.1, 28), (28.1, 35), (35.1, 40)}, and S4 = {(0, 15), (15.1, 30), (30.1, 40)}. 6BR is not a constrained node, and it can be PC or laptop; however, currently there exist no PC equivalent 802.15.4 devices; therefore, we run the 6BR natively, i.e., on Linux. Unit Disk Graph Medium, the Cooja’s default lossy radio model is used for transmission. Five different configurations are used with 16, 24, 32, 40, 48 nodes. Fig. 4 shows the sample configuration with 32 nodes. The 6BR selects the monitoring and decision nodes for each configuration.

5.1 Selective Forwarding Attack 6LoWPAN is a multi-hop network which depends on the participating nodes to forward the data. A malicious node may selectively reject certain messages which carries sensitive information such as movement of enemies’ tank in the military war field, so that messages does not reach the destination. This attack is called selective forwarding attack. The proposed IDS mechanism can efficiently detect this attack

Privacy Preserving Intrusion Detection System for Low Power …

589

Fig. 4 Sample node configuration of 32 nodes with node 1 as 6BR, 14 monitoring nodes, 6 decision nodes, and 3 malicious nodes

with minimum resource consumption. The monitoring node observes the packet dropping rate of its parent. Let PD(Pij ) be the packet dropping rate of parent node i detected by the monitoring node j and ß be the threshold value for the packet dropping rate of i. When the monitoring node observes the packet dropping rate greater than the threshold value, then the monitoring node sends alert message to decision node k. When the decision node receives alert message from more than half of its monitoring nodes about the node i, then the decision node concludes i as malicious node. The result of the decision node is conveyed to the 6BR to exclude the malicious node from the RPL DODAG topology. The same mechanism can be applied to detect the black hole and sinkhole attacks.

5.2 Vampire Attack Vampire attacks are kinds of resource depletion attack which is not protocol specific. These attacks do not rely on flooding but they use small amount data as possible to achieve greatest energy drain. It is hard to detect and prevent the vampire attacks because it uses protocol complaint message for attack. Vampire attacks can be performed in 6LoWPAN by exploiting the RPL routing protocol internal process for loop-free and error-free routing topology. 6LoWPAN is likely to node disconnection from the network due to worst link state or lack of battery power. To handle these conditions, RPL contains local and global repair mechanism. Local repair allows the node to route temporarily through neighbor node with same rank or select next preferred parent. Global repair mechanism rebuilds the entire DODAG with increment in DODAG version number. Global repair consumes more energy and network resources due to the increased control messages in DODAG construction. A malicious node alters the version number field included in the control messages by its parents. When these altered control messages propagated into the network leads to forced rebuild of RPL, DODAG topology results in largest energy depletion. The

590

S. Prabavathy and I. R. P. Reddy

Accuracy

True Positive False Positive

80

% of True Positive Rate

Fig. 5 True-positive detection rate and false-positive rate with respect to number of nodes in the network

70 60 50 40 30 20 10 0 16

24

32

40

48

Number of Nodes

monitoring node verifies the DODAG version number of each packet that is received and forwarded by its parent. If there is any change in version number, it generates alert to the decision node. On receiving this kind of alert, the decision node send it result as attack to 6BR.

5.3 True Positive Rate The detection of selective forwarding attack by proposed IDS is efficient when the number of nodes is increasing because the number of monitoring node increases as the total number of nodes in the network increases. Figure 5 represents the truepositive rate and false-positive rate in detection of selective forwarding attack. From the results, it is evident that proposed IDS is very efficient against attacks in large network with more number of nodes.

5.4 Energy The use of IoT application is strictly limited by battery life of the nodes; hence, energy is a scarce resource. To measure the energy consumption by the proposed IDS, ContikiPowertrace [40] application is used. The energy consumption for 30 min by all the Tmote sky nodes in the different network topologies are calculated based on operational conditions of the Tmote sky node is shown in Fig. 6.  Energy(mJ) = Transmit ∗ Current Consumption: MCU on, Radio TX + Listen ∗ Current Consumption: MCU on, Radio RX

Privacy Preserving Intrusion Detection System for Low Power …

Energy (mJ)

Fig. 6 Energy consumption

90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0

591 RPL without EPIDS RPL with EPIDS

Energy

16

24

32

40

48

Number of Nodes

+ CPU ∗ Current Consumption: MCU on, Radio off

 +LPM ∗ Current Consumption: MCU idle, Radio off ∗ Voltage/4096 ∗ 8

(9)

5.5 Privacy Metric The privacy measures the probability of predicting the actual data of the sensor from the masked data. When the number of categories L increases, the privacy of the data increases, as shown in Fig. 7. In proposed system, the data privacy of the nodes is achieved with minimum energy consumption because the sensor nodes do not require any additional energy to ensure privacy, which is very suitable for 6LoWPANbased IoT networks with energy-constrained nodes. Privacy measurement is based on Maximum A Posteriori (MAP) estimate. Privacy Rate

% of Privacy

Fig. 7 Average privacy rate 100 90 80 70 60 50 40 30 20 10 0

Privacy

3

4

6

Number of Categories

8

592

S. Prabavathy and I. R. P. Reddy

Privacy = 1 −

     P n| pn .P pn

(10)

y∈s

where pn = arg max x∈S P( p|n) is the MAP estimate.

(11)

5.6 Utility Utility is the measure of accuracy of data reconstructed at the 6BR from the masked data sent by the sensors. The deviation of reconstructed data from the original data can be measured in terms of Mean Squared Errors (MSE). MSE = 1/k

k  ( pi − n i )2

(12)

i=1

To find MSE, the experiment is performed with 32 sensors and varying categories for ten times. The average error from all ten runs is computed. The results clearly show that error increases with number of categories. To find the impact of number of sensor on the error, the experiment is performed with 8 categories and varying the number of sensors for ten times. The average error from all ten runs is computed. The results in the Figs. 8 and 9 clearly depict that the error decreases as the number of sensor increases. Mean Square Error

Fig. 8 MSE for 32 sensor nodes with varying number of categories

0.06

MSE

0.05

MSE

0.04 0.03 0.02 0.01 0 16

24

32 Number of nodes

40

48

Privacy Preserving Intrusion Detection System for Low Power …

593

Mean Square Error fof categories

MSE

Fig. 9 Error measurement for 8 categories

0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 3

4

6

8

Categories

6 Conclusion 6LoWPAN makes Internet of Things become a reality, but privacy and security remain a great challenge. The proposed intrusion detection system provides coexistence of data privacy and intrusion detection. It overcomes the contradiction between intrusion detection mechanism and privacy preserving mechanism. The privacy mechanism at sensor nodes is computationally simple and does not increase the communication overhead. Attack is detected locally which reduces energy consumption. Extensive simulations are performed to evaluate the performance of the proposed system, and simulation results depict that it achieves high data privacy and low false alarm rate for intrusion detection with energy efficiency. The intrusion detection algorithm detects selective forwarding attack, black hole attack, and spoofing and vampire attack. It can be extended to detect more attacks in future with coexistence of identity privacy and location privacy.

References 1. Yang Y, Wu L, Yin G, Li L, Zhao H (2017) A survey on security and privacy issues in internetof-things. IEEE Int Things J 4(5):1250–1258 2. Sun X, Wang C (2011) The research of security technology in the internet of things. In: Advances in computer science, intelligent system and environment. Springer Berlin Heidelberg, pp 113–119 3. Olsson J (2014) 6LoWPAN demystified. Texas Instrum 13 4. Le A, Loo J, Lasebae A, Aiash M, Luo Y (2012) 6LoWPAN: a study on QoS security threats and countermeasures using intrusion detection system approach. Int J Commun Syst 25(9):1189– 1212 5. K˚ur J, Matyáš V, Stetsko A, Švenda P (2011) Attack detection versus privacy—how to find the link or how to hide it? In: Security protocols XIX. Springer Berlin Heidelberg, pp 189–199 6. Rghioui A, Khannous A, Bouhorma M (2015) Monitoring behavior-based intrusion detection system for 6loWPAN networks. Int J Innovation Appl Stud 11(4):894 7. Shelby Z, Bormann C (2011) 6LoWPAN: the wireless embedded internet, vol 43. Wiley

594

S. Prabavathy and I. R. P. Reddy

8. Dunkels A, Gronvall B, Voigt T (Nov 2004) Contiki-a lightweight and flexible operating system for tiny networked sensors. In: 29th annual IEEE international conference on local computer networks. IEEE, pp 455–462. 9. Winter T, Thubert P, Brandt A, Hui JW, Kelsey R, Levis P, Pister K, Struik R, Vasseur JP, Alexander RK (2012) RPL: IPv6 routing protocol for low-power and lossy networks. No. RFC 6550:1–157 10. Vasseur J-P, Kim M, Pister K, Dejean N, Barthel D (2012) Routing metrics used for path calculation in low-power and lossy networks. No. RFC 6551 11. Hennebert C, Dos Santos J (2014) Security protocols and privacy issues into 6lowpan stack: a synthesis. Int Things J, IEEE 1(5):384–398 12. Chanal PM, Kakkasageri MS (2020) Security and privacy in IOT: a survey. Wirel Pers Commun 115(2):1667–1693 13. Deep S, Zheng X, Jolfaei A, Yu D, Ostovari P, Kashif Bashir A (2020) A survey of security and privacy issues in the internet of things from the layered context. Trans Emerg Telecommun Technol e3935 14. Abdul-Ghani HA, Konstantas D (2019) A comprehensive study of security and privacy guidelines, threats, and countermeasures: an IoT perspective. J Sens Actuator Netw 8(2):22 15. Gopalakrishnan K (2020) Security vulnerabilities and issues of traditional wireless sensors networks in IoT. In: Principles of internet of things (IoT) ecosystem: insight paradigm. Springer, Cham, pp 519–549 16. Pongle P, Chavan G (Jan 2015) A survey: attacks on RPL and 6LoWPAN in IoT. In: 2015 International conference on pervasive computing (ICPC). IEEE, pp 1–6 17. Verma A, Ranga V (2020) Security of RPL based 6LoWPAN networks in the internet of things: a review. IEEE Sens J 20(11):5666–5690 18. Kasinathan P, Costamagna G, Khaleel H, Pastrone C, Spirito MA (Nov 2013) An IDS framework for internet of things empowered by 6LoWPAN. In: Proceedings of the 2013 ACM SIGSAC conference on computer and communications security, pp 1337–1340 19. Amin SO, Siddiqui MS, Hong CS, Lee S (2009) RIDES: robust intrusion detection system for IP-based ubiquitous sensor networks. Sensors 9(5):3447–3468 20. Raza S, Wallgren L, Voigt T (2013) SVELTE: real-time intrusion detection in the internet of things. Ad Hoc Netw 11(8):2661–2674 21. Kasinathan P, Pastrone C, Spirito M, Vinkovits M (2013) Denial-of-service detection in 6LoWPAN based internet of things. In: 2013 IEEE 9th international conference on wireless and mobile computing, networking and communications (WiMob). IEEE, pp 600–607 22. Pongle P, Chavan G (2015) Real time intrusion and wormhole attack detection in internet of things. Int J Comput Appl 121(9):1–9 23. Le A, Loo J, Luo Y, Lasebae A (2011) Specification-based IDS for securing RPL from topology attacks. In: Wireless days (WD), 2011 IFIP. IEEE, pp 1–3 24. Butun I, Morgera SD, Sankar R (2014) A survey of intrusion detection systems in wireless sensor networks. Commun Surv Tutorials, IEEE 16(1):266–282 25. Mehmood A, Khanan A, Umar MM, Abdullah S, Ariffin KAZ, Song H (2017) Secure knowledge and cluster-based intrusion detection mechanism for smart wireless sensor networks. IEEE Access 6:5688–5694 26. Sachan RS, Wazid M, Singh DP, Goudar RH (April 2013) A cluster based intrusion detection and prevention technique for misdirection attack inside WSN. In: 2013 international conference on communication and signal processing. IEEE, pp 795–801 27. Cervantes C, Poplade D, Nogueira M, Santos A (May 2015) Detection of sinkhole attacks for supporting secure routing on 6LoWPAN for internet of things. In: 2015 IFIP/IEEE international symposium on integrated network management (IM). IEEE, pp 606–611 28. Glissa G, Rachedi A, Meddeb A (Dec 2016) A secure routing protocol based on RPL for internet of things. In: 2016 IEEE global communications conference (GLOBECOM). IEEE, pp 1–7 29. Hatzivasilis G, Papaefstathiou I, Manifavas C (2017) SCOTRES: secure routing for IoT and CPS. IEEE Int Things J 4(6):2129–2141

Privacy Preserving Intrusion Detection System for Low Power …

595

30. Kouachi AI, Bachir A (Oct 2020) Communication-flow privacy-preservation in 6lowpansbased iot networks. In: International symposium on modelling and implementation of complex systems. Springer, Cham, pp 33–47 31. Esponda F, Guerrero VM (2009) Surveys with negative questions for sensitive items. Statist Probab Lett 79(24):2456–2461 32. Liu R, Tang S (2015) Negative survey-based privacy protection of cloud data. In: Advances in swarm and computational intelligence. Springer International Publishing, pp 151–159 33. Horey J, Forrest S, Groat M (2012) Reconstructing spatial distributions from anonymized locations. In: 2012 IEEE 28th international conference on data engineering workshops (ICDEW). IEEE, pp 243–250 34. Zhao D, Luo W, Yue L (June 2016) Reconstructing positive surveys from negative surveys with background knowledge. In: International conference on data mining and big data. Springer, Cham, pp 86–99 35. Matyas V, Kur J (2013) Conflicts between intrusion detection and privacy mechanisms for wireless sensor networks. IEEE Secur Priv 5:73–76

Identifying Top-N Influential Nodes in Large Complex Networks Using Network Structure M. Venunath, P. Sujatha, and Prasad Koti

Abstract Online social networks are popular for various activities like spreading information, creativity, and ideas, especially for viral marketing. The main focus in the social influence analysis, known as the influence maximization problem (IMP), aims to select top-N nodes to maximize the expected number of nodes activated by the top-N nodes (a.k.a seed nodes). This issue has gotten a lot of attention and has been looking into the issue of IMP, and these studies are usually too time-consuming to be useful in a complex social media network. The problem of seed selection is NP-hard. Due to the utilization of time-consuming Monte Carlo simulations, which are confined to small networks, a greedy method to the IMP issue is insufficient. The greedy approach, on the other hand, offers a good approximation assurance. In this paper, we present an algorithm for identifying communities and computing the ranking scores of nodes in the identified communities to solve the IMP with a focus on time efficiency. Keywords Top-N influential nodes · Node ranking score · Information propagation · k-Shell decomposition · Complex networks

1 Introduction In recent years, with the advancement of communication technology and the widespread use of the Internet, the number of people using social media sites has been on the rise. Online social networks have grown ingrained in the lives of all users. As a result, a vast volume of information and ideas is transmitted every day across networks that may affect a big number of individuals for a brief period [1–3]. From the last two decades, many researchers have been concerned about the Influence M. Venunath (B) · P. Sujatha School of Engineering and Technology, Pondicherry University, Puducherry, India e-mail: [email protected] P. Koti Department of Computer Applications, Saradha Gangadharan College, Puducherry, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_45

597

598

M. Venunath et al.

Spread (IS) and influence among social networks, and their results have paved the way for many online applications such as viral marketing [4], which is a successful strategy and practical platform that is used as a launching pad and spreads through W-O-M/word-of-mouth on social media sites. One of the oldest and most practical kinds of marketing, word-of-mouth, is a simple way of getting an advertising or promotional message out to people. For example, a firm may wish to target a limited number of people (a.k.a seeds) for a trial of a new brand via social-media-networking medium in the hopes that these early adopters would encourage their friends, and they influence their friends, to buy the product. The assumption is that word-of-mouth marketing will help the firm reach a large pool of targeted users. Under the spreading model, this situation is formally characterized as the influence maximization problem [5], which seeks to choose the initial seeds who can encourage the greatest number of consumers to accept a marketed product. Domingo et al. [5] were the first to present the IMP from the standpoint of the algorithm inspired by the concept of viral marketing. Next, the IMP was formatted as follows by Kempe et al. [6]: IMP seeks to find the k-influential vertices/nodes in graph G to maximize IS based on a predefined propagation model, a constant k, and the user influence probabilities. Each of the existing IM algorithms has its own set of flaws. The greedy methods produce high-quality seeds, but the method relies on time-consuming Monte Carlo (MC) simulations to determine correct marginal IS distribution. As a result, greedy methods’ efficiency restricts their usefulness on large-scale networks. Heuristic techniques are introduced to increase scalability. To achieve great efficiency and scalability, heuristic methods sacrifice some accuracy. Mixed methods are used to maintain a trade-off between accuracy and convergence. These methods are just a trade-off between efficiency and accuracy.

1.1 Motivation and Main Contributions Our research also involves combining community detection (CD) with a heuristicbased method. For community discovery and seed selection, a variety of algorithms and heuristic approaches are available. However, for a big social network, it remains a difficult challenge. Some methods require that the number and size of partitions into which the network will be divided are known ahead of time. As a result, our CD algorithm in this article aims to recognize the most natural communities, find community structures, and build a CD technique that does not define the number of communities. The following are the key contributions of our work: The suggested IM method has two distinguishing characteristics. First, it can greatly improve the efficiency of the existing algorithms for identifying top-k-influential nodes by dividing the complex network into clusters using an H-clustering algorithm to reduce the search space for finding influential users. Second, after dividing the network into clusters

Identifying Top-N Influential Nodes in Large Complex …

599

then applies a heuristic technique to directly find the target nodes to propagate information. To discover an approximation solution in an acceptable computation time, the Hybrid-IM method combines the strength of general community detection with problem-specific heuristics. In this work, we use the IC propagation model to model information dissemination in the network.

2 Related Work Kempe et al. [6] reframed the IMP as an optimization problem, presenting a greedy algorithm for locating seed users in order to maximize the network’s influence spread,  and he presented a greedy [7] approximation method with 1 − 1e −  approximation ratio to solve a (NP)-hard problem. Existing IM methods may be classified into three groups depending on how well they enhance the algorithmic design. The approaches are simulation-based, heuristic-based, and community-based IM. Model generality is an advantage of simulation-based IM methods [8–12], i.e., these systems may simply include any conventional diffusion model. For influence spread estimates, these methods employed time-consuming MC simulations. As a result, these techniques have a significant disadvantage in terms of inefficiency and accuracy, as well as low time complexity and memory usage. As a result, simulationbased IM is best suited to small-scale networks. The greedy approximation method involves tens of thousands of MC simulations, which is the major reason. Many attempts have been presented in recent years to successfully handle the IMP. The pressure applied by a user (person) to change an individual’s attitudes, views, or feelings is known as social influence. IMP is the formal term for social influence. On the vertex, the amount of social influence may be assessed. In a social network, the power that this node has over each of its neighbors to influence them with a concept, etc., is represented by the amount of social influence of a vertex or the importance of a node, which is a probabilistic value in the interval (0, 1). The vertex’s social influence may be seen as a measure of centrality. Diffusion models may be used to define the IS of nodes as well as rank them. In recent years, it has been frequently utilized for concept dissemination in social media networks [13]. Due to the difficulty and limitations of real-world network monitoring, these models mimic the dissemination process in the real world and assess the true dispersing capabilities of vertices by repeatedly repeating the procedure for each node. In complicated and vast networks, using diffusion models to determine the IS of nodes and rank them takes a long time. The social influence of the vertex can be viewed as a measure of centrality. Many heuristic-based works of literature are based on centrality, and heuristic-based algorithms attempt to increase the propagation effect. As a result, in recent years, many techniques for node ranking have been suggested, to identify the ability to disseminate nodes based on network information without using propagation models. In heuristic-based approaches, one of the metrics used to specify the disseminating capability of vertices is the amount to which a vertex is placed near the graph’s core. The topological placement of a node in the network determines its influentialness.

600

M. Venunath et al.

The top-N nodes in terms of particular centrality measures are regarded as the most influential set in all of these techniques. These techniques have the advantages of simplicity and low time complexity. Following are some of the most influential nodes that have been identified using centrality metrics such as degree [14], betweenness [15], closeness [16], and k-shell [17]. These approaches are categorized into local, global, and hybrid methods based on the type of information used in the network structure. Local techniques provide node impact only on local information; in other words, these approaches specify node influence based purely on the nodes and their neighbors. Examples of local structural approaches are degree centrality [14] and H-index centrality [18]. The theory behind these approaches is that high-degree vertices with a large number of neighbors are more likely to spread successfully. These techniques have the advantages of simplicity and minimal time complexity. Global techniques, on the other hand, need traversing the whole network and accessing the network’s global information. Closeness centrality [16], Betweenness centrality [15], and k-shell decomposition [17] are examples of global approaches. K-shell decomposition is a popular approach of this type. Because the global structure of the network determines node influence in these approaches, they have a larger temporal complexity than local methods. k-shell centrality [17]: the authors suggested a time-efficient technique for determining coreness termed k-shell decomposition. Each node is given a K s index in this measure. To do this, 1-degree nodes are eliminated one by one until there are no more 1-degree nodes. The deleted nodes are then given K s = 1. In the following phase, degree-2 vertices are eliminated, and the pruning process is repeated until no vertices with degrees less than or equal to 2 are found. The nodes deleted in this stage are assigned K s = 2. This procedure is continued until the graph is empty/eliminated of all nodes. A greater K s value implies that the vertex is closer to the graph’s core, and the vertices with the greatest K s values are termed the graph’s cores. Hybrid solutions: Because users’ social information plays an essential part in defining their influence, hybrid approaches combine this data with network structure. Aside from network structure, [19] considers the number of messages transmitted between two users for determining the most significant nodes. The importance of communities in the dissemination of information has inspired community-based algorithms. Instead of distributing the influence of each vertex over the whole network, these algorithms split the network into clusters and then spread the influence of each vertex inside its clusters. Another way to reduce time complexity in this context is to use community-based IM techniques [20–27], which can give solutions that take into account the social network’s community structure. Almost majority of the research in this category follow a three-phase method (i) community identification, (ii) candidate generation, and (iii) top-N user selection either implicitly or explicitly. Authors Chen et al. [27] use the heat diffusion model to consider influence maximization and construct H-clustering, a hierarchical community discovery method that successfully detects social network communities. Following that, candidate set creation is carried out by identifying some major communities and picking candidate nodes from them, taking into account the size of the communities and their connections. Finally, a heuristic approach is used to choose the target set, with target

Identifying Top-N Influential Nodes in Large Complex …

601

Table 1 Community-based solutions of the IM problem: benefits and drawbacks Algorithm

Benefits

Drawbacks

Wang et al.’s method [28] The first research into community-based solutions to the IM Problem

Does not guarantee of optimized approximation

CIM [27]

Scalability is greater to various centrality-based algorithms

Does not guarantee of optimized approximation

ComPath [30]

Scalability is higher when compared to centrality-based approaches

Does not guarantee of optimized approximation

INCIM [31]

In comparison to specific It just works with the LT model. centrality-based approaches, Does not guarantee of optimized LDGA and simpath the approximation scalability of the system has been increased

CoFIM [32]

Compared to other community-based approaches, it has higher scalability and spread of influence

It just works with the LT model. Does not guarantee of optimized approximation

nodes being chosen based on their position in the communities. Existing techniques rely on community structure and algorithmic community discovery (Table 1).

3 Proposed Method The proposed technique is novel in that it makes use of the network structure. This algorithm is split into two sections. In the first section, determine communities it uses an H-clustering algorithm, and second, within the clusters, ranking scores are determined using cluster local information for nodes. In reality, to select seed nodes from the cluster computed the node ranking scores independently. But to select multiple nodes from a sub-network (big Cluster) not scattered over the complete, so propagation models are used to estimate the spreading capability of a node. By this novel technique, we can select multiple nodes effectively without clustered in the subgraph (cluster). It enhances performance by allowing information to spread more quickly. The suggested technique decomposes the nodes in the cluster to avoid MC simulations to choose the seed node, using the k-core decomposition [17]. Then, it identifies top-N nodes based on RS value. Algorithm 1 I n put: Graph -G (V, E); undirected and unweighted Out put: top k - Influential spreaders. Begin: 1: IS ← Ø; ← Influential spreaders or target nodes.

602

M. Venunath et al.

//* section (i):  community/cluster identification/detection *//  2: C = c1 , c2 , c3 . . . . . . .c p ;← H_Clustering(G);       3:C = c1 , c2 , c3 . . . . . . ., cq ; ← pruning unimportant community in C; //* section (ii): target/top-N nodes selection. *// 4: for all vi ∈ V in each community from network graph G. Deg (vi ); ←-compute the degree of every node. KD (vi ); ←compute K-Core Decomposition value for every node δi ←find the value of each node’s normalized iteration multiplier (NIM) NGI (vi ) = Deg (vi )*KD (vi )*δi /|V|; ← normalized global importance (NGI) N G I (vi )∗deg(vi ) R LG I (vi ) =  ;←compute relative local–global importance ∀j∈(vi ) N G I (vi ) (RLGI) R LG I (vi ) RS(vi ) = max(R ;←compute the ranking score. LG I ) IS ← Add first rank node 5: Return IS. Section (i): A community is a group of individuals in the social network that connect with each other more intimately than individuals outside the community [33]. Hierarchical clustering (abbreviated as H-Clustering) is the clustering technique designed for phase one of IM. In H-Clustering, we use a bottom-up strategy to repeatedly combine vertices with significant structural similarity (SS) into communities, including the concept of modularity [34]. H-Clustering first calculates the SS between a node and its surrounding nodes for every node in the provided social media network, where the SS functioning as the edge weight for the neighboring vertices. Eq. (1) defines the similarity between two neighboring vertices u and v:   |ad j(u) ∩ ad j(v)| Sim u i , v j = √ |ad j(u)| × |ad j(v)|

(1)

H-Clustering initially treats every node as a community and clusters every pair of vertices into a community if the SS between those two vertices is the highest among their surrounding edges from each other after collecting the SS of all edges in the network. Until a termination condition is fulfilled, the procedure is repeated (Table 2). To select whether to terminate the community identification process, we use the modularity gain(MG) [35] to assess the quality of identified communities. The modularity function is defined as follows, based on the SHRINK algorithm [36]: Q(C) =

 p  I Si i=1

where  u∈ci ,v∈V

TS



DSi TS

2 (2)

 defined as I Si = Sim(u, v), DSi = u,v∈ci  Sim(u, v)T Si = Sim(u, v) is the sum of similarity between any I Si , DSi , T Si

u,v∈V

two vertices in the network. MG value of any two different cluster C and C

Identifying Top-N Influential Nodes in Large Complex … Table 2 Similarity values of each edge in the graph

603

Edge

  Sim u i , v j

Edge

  Sim u i , v j

v1, v2

0.816

v6, v7

0.755

v2, v3

0.577

v7, v9

0.377

v3, v4

0.612

v7, v8

0.654

v3, v5

0.670

v7, v10

0.566

v4, v14

0.288

v8, v10

0.866

v4, v5

0.912

v10, v11

0.707

v4, v6

0.816

v9, v13

0.707

v4, v7

0.617

v9, v12

0.707

v5, v6

0.894

v5, v7

0.676

  for any graph defined as Q c→C  = Q C  − Q(C).The MG is used as the terminating criteria in H-Clustering. Every iteration, we aggregate all pairings based on the clustering outcome from the previous iteration. Consider that the last iteration’s clustering result and the current iteration’s clustering result are C and C  , respectively. If the gain in modularity from C to C  is negative, H-Clustering will halt clustering since the prior clustering result is good enough to generate larger communities of vertices with the strongest SS among their neighbors. As a result, the network has some homeless nodes that are not part of any community. Section (ii): Given a social media network G = (V, E), while decomposition for K-core process, the overall number of iterations for the k-degree iteration is mk , and vertex vi ∈ V is deleted in iteration number nk . Then 1 ≤ n k ≤ m k , the NIM is therefore defined as:

nk (3) δi = 1 + max(m k ) where the maximum number of total iterations for every k is max(mk ). While conducting k-core decomposition, the NIM calculates the normalized iteration value at which a vertex is eliminated from the cluster. This concept is adapted from [31]. We can establish how vital a vertex is and what its normalized k-core value in the cluster is by computing the NIM. From a cluster perspective, a vertex with a higher NIM is more significant, since the NIM calculates the node’s local significance/importance. In the present method computes the NGI using the NIM, the degree of vertex, and the k-core decomposition value as follows: NGI(vi ) = deg(vi ) ∗ KD(vi ) ∗ δ j /|V |

(4)

The NGI calculates local significance by normalizing the NIM based on the cluster size. The NGI calculates a node’s global importance in the cluster by combining its

604

M. Venunath et al.

Table 3 Ranking scores of nodes Node

RS (vi)

Rank

Node

RS (vi)

Rank

v7

1

1

v6, v9

0.278

7

v4

0.755

2

v1

0.174

8

v5

0.463

3

v8

0.154

9

v10

0.363

4

v12, v13

0.116

10

v2

0.353

5

v11

0.077

11

v3

0.281

6

v14

0.031

12

local and global significance if we want to choose more than two nodes to select from the single cluster. The proposed technique then computes the RLGI from the NGI, taking into consideration the nodes’ immediate surroundings to avoid clustering of influential nodes. The following is how the RLGI is calculated: NGI(vi ) ∗ deg(vi ) RLGI(vi ) =  ∀j∈(vi ) NGI(vi )

(5)

Calculate the RS of the nodes using the RLGI. Find the set of influential vertices in the cluster using the RS of the vertices by ranking the nodes according to the RS (Table 3): RS(vi ) =

RLGI(vi ) max(RLGI)

(6)

After calculating the ranks of every node of each cluster, we choose a best-ranked node from each cluster as seed nodes and apply the IC model to calculate the expected diffusion value (influence spread) of selected nodes. Then, we choose nodes that have the highest diffusion value depends on N-value (top n nodes).

4 Performance Here, in Fig. 1 we compare running times of the five different community detection algorithms Kcut [37], shrink [36], METIS [38], H_Clustering [27], and agglomerative clustering algorithm [39] on the NETHep dataset; in this scenario is the time performance of algorithms observed. From the observations, the H_Clustering method takes less time among all the methods for discovering the communities in the given network. Next, we find target nodes for IMP. The proposed scalable method is effective in finding the set of top-N nodes. For NIM, degree computation, k-core decomposition, NGI, RLGI calculations take time in order of O(n), and the time complexity of this method is O(n).

Identifying Top-N Influential Nodes in Large Complex …

605

Fig. 1 Run time–on NETHep dataset

5 Conclusion and Future Directions In this research, a few sets of seed vertices in unweighted and undirected real-world social media complex networks are discovered using a unique approach used. In the present work, firstly find out the clusters form the network dataset using H_Clustering and to find seed nodes we compute the ranking scores based on the scores obtained seeds are selected. The technique used is purely based on network structure, and no network information is required. It is good at locating key nodes and can quickly disseminate information throughout the large social media network. It is also scalable, because the ranking value of nodes is determined by the local network topology of each community and, as a result, identifies the network’s top-N influential vertices.

References 1. Kianian S, Rostamnia M (2021) An efficient path-based approach for influence maximization in social networks. Expert Syst Appl 167(Sept 2020):114168. https://doi.org/10.1016/j.eswa. 2020.114168 2. Banerjee S, Jenamani M, Pratihar DK (2020) A survey on influence maximization in a social network. Knowl Inf Syst 62(9):3417–3455. https://doi.org/10.1007/s10115-020-01461-4 3. Tong G, Wu W, Tang S, Du DZ (2017) Adaptive influence maximization in dynamic social networks. IEEE/ACM Trans Netw 25(1):112–125. https://doi.org/10.1109/TNET.2016.256 3397 4. Wang F, Jiang W, Li X, Wang G (2018) Maximizing positive influence spread in online social networks via fluid dynamics. Futur Gener Comput Syst 86:1491–1502. https://doi.org/10.1016/ j.future.2017.05.050 5. Domingos P, Richardson M (2001) Mining the network value of customers. Proceedings seventh ACM SIGKDD international conference knowledge discovery data mining, pp 57–66. https:// doi.org/10.1145/502512.502525 6. Kempe D, Kleinberg J, Tardos É (2003) Maximizing the spread of influence through a social network. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 137–146. https://doi.org/10.1145/956750.956769 7. Granovetter M (1978) Threshold models of collective behavior. Am J Sociol 83(6):1420–1443. https://doi.org/10.1086/226707

606

M. Venunath et al.

8. Leskovec J, Krause A, Guestrin C, Faloutsos C, Vanbriesen J, Glance N (2007) Cost-effective outbreak detection in networks. Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 420–429. https://doi.org/10.1145/1281192.128 1239 9. Goyal A, Lu W, Lakshmanan LVS (2011) CELF++: optimizing the greedy algorithm for influence maximization in social networks. Proceedings 20th international conference companion world wide web, WWW 2011, pp 47–48. https://doi.org/10.1145/1963192.1963217 10. Chen W, Wang C, Wang Y (2010) Scalable influence maximization for prevalent viral marketing in large-scale social networks. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1029–1038. https://doi.org/10.1145/183 5804.1835934 11. Cohen E, Delling D, Pajor T, Werneck RF (2014) Sketch-based influence maximization and computation, pp 629–638. https://doi.org/10.1145/2661829.2662077 12. Tang Y, Shi Y, Xiao X (2015) Influence maximization in near-linear time: a martingale approach, pp 1539–1554. https://doi.org/10.1145/2723372.2723734 13. Zhang H, Mishra S, Thai MT (2014) Recent advances in information diffusion and influence maximization of complex social networks. Opportunistic Mob Soc Networks 37–70 https:// doi.org/10.1201/b17231 14. Freeman LC (1978) Centrality in social networks conceptual clarification. Soc Networks 1(3):215–239. https://doi.org/10.1016/0378-8733(78)90021-7 15. Freeman LC (2016) A set of measures of centrality based on betweenness. 40(1):35–41. S Publications author (s): Published by : American Sociological Association Stable : http:// www.jstor.org/stable/3033543 16. Sabidussi G (1966) The centrality index of a graph, pp 581–603 17. Kitsak M et al (2010) Identification of influential spreaders in complex networks. Nat Phys 6(11):888–893. https://doi.org/10.1038/nphys1746 18. Zareie A, Sheikhahmadi A (2019) EHC: extended H-index centrality measure for identification of users’ spreading influence in complex networks. Phys A Stat Mech Appl 514:141–155. https://doi.org/10.1016/j.physa.2018.09.064 19. Peng S, Yang A, Cao L, Yu S, Xie D (2017) Social influence modeling using information theory in mobile social networks. Inf Sci (Ny) 379:146–159. https://doi.org/10.1016/j.ins.2016.08.023 20. Rajeh S, Savonnet M, Leclercq E, Cherifi H (2021) Comparing community-aware centrality measures in online social networks BT—computational data and social networks, pp 279–290 21. Scripps J, Tan PN, Esfahanian AH (2007) Exploration of link structure and community-based node roles in network analysis. Proceedings—IEEE international conference data mining, ICDM, pp 649–654. https://doi.org/10.1109/ICDM.2007.37 22. Cao T, Wu X, Wang S, Hu X (2011) Maximizing influence spread in modular social networks by optimal resource allocation. Expert Syst Appl 38(10):13128–13135. https://doi.org/10.1016/j. eswa.2011.04.119 23. Guo S, Yang D, Yan Q (2011) Influence maximizing and local influenced community detection based on multiple spread model. In: Advanced data mining and applications, pp 82–95 24. Peng W, Lee S, Chen Y, Chang S, Chou C (2012) Exploring community structures for influence maximization in social networks. In: The 6th workshop on social network mining and analysis held in conjunction with KDD, SNA-KDD, pp 1–6 25. Lv J, Guo J, Ren H (2013) A new community-based algorithm for influence maximization in social network. J Comput Inf Syst 9(14):5659–5666 26. Song G, Zhou X, Wang Y, Xie K (2015) Influence maximization on large-scale mobile social network: a divide-and-conquer method. IEEE Trans Parallel Distrib Syst 26(5):1379–1392. https://doi.org/10.1109/TPDS.2014.2320515 27. Chen YC, Zhu WY, Peng WC, Lee WC, Lee SY (2014) CIM: community-based influence maximization in social networks. ACM transactions intelligent system technology 5(2). https:// doi.org/10.1145/2532549 28. Xie K, Wang Y, Cong G, Song G (2010) Community-based greedy algorithm for mining top-K influential nodes in mobile social networks categories and subject descriptors. Processing 16th ACM SIGKDD international conference knowledge discovery data mining, pp 1039–1048

Identifying Top-N Influential Nodes in Large Complex …

607

29. Chen Y, Chang S, Chou C, Peng W, Lee S (2012) Exploring community structures for influence maximization in social networks. Proceedings of the 6th SNA-KDD workshop on social network mining and analysis held in conjunction with KDD12 (SNA-KDD12) 30. Rahimkhani K, Aleahmad A, Rahgozar M, Moeini A (2015) A fast algorithm for finding most influential people based on the linear threshold model. Expert Syst Appl 42(3):1353–1361. https://doi.org/10.1016/j.eswa.2014.09.037 31. Bozorgi A, Haghighi H, Sadegh Zahedi M, Rezvani M (2016) INCIM: a community-based algorithm for influence maximization problem under the linear threshold model. Inf Process Manag 52(6):1188–1199. https://doi.org/10.1016/j.ipm.2016.05.006 32. Shang J, Zhou S, Li X, Liu L, Wu H (2017) CoFIM: a community-based framework for influence maximization on large-scale networks. Knowl-Based Syst 117:88–100. https://doi. org/10.1016/j.knosys.2016.09.029 33. Wasserman S, Faust K (1994) Social network analysis: methods and applications. Cambridge University Press 34. Newman MEJ (2006) Modularity and community structure in networks. Proc Natl Acad Sci USA 103(23):8577–8582. https://doi.org/10.1073/pnas.0601602103 35. Schweiger T, Feng Z, Xu X, Yuruk N (2007) A novel similarity-based modularity function for graph partitioning. Processes 9th international conference data warehousing knowledge discovery, pp 385–396 36. Huang J, Sun H, Han J, Deng H, Sun Y, Liu Y. SHRINK : a structural clustering algorithm for detecting hierarchical communities in networks categories and subject descriptors. Sci Technol 219–228 37. Ruan J, Zhang W (2007) An efficient spectral algorithm for network community discovery and its applications to biological and social networks. Proceedings—IEEE international conference data mining, ICDM, pp 643–648. https://doi.org/10.1109/ICDM.2007.72 38. Karypis G, Kumar V (2014) Multilevel algorithms for multi-constraint graph partitioning. 28–28. https://doi.org/10.1109/sc.1998.10018 39. Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Natl Acad Sci USA 99(12):7821–7826. https://doi.org/10.1073/pnas.122653799

Push and Pull Factors for Successful Implementation of ERP in SMEs Within Klang Valley: A Roadmap Anusuyah Subbarao and Astra Hareyana

Abstract Enterprise Resource Planning (ERP) is an integrated system that is implemented by businesses to improve internal business processes to provide greater efficiency, automation, and cross-functional support across all business functions. Although widely adopted by multinational corporations, small and medium-sized enterprise (SME) adoption of the system is not as popular due to complexity of implementing such systems, resources needed to go through the process which SMEs are limited of, and failure factors can inhibit the successful implementation of ERP. Therefore, this paper sets up the roadmap to formulate the ERP implementation success model for SMEs within Klang Valley, which takes into account the push factors and pull factors that either enable or inhibit the successful implementation of ERP in SMEs within Klang Valley. The research will be employing qualitative methods as it intends to have in-depth discussion with a diverse set of experts from different walks of life. The research will also provide the research community and IT industry a better understanding on the factors that contribute to a successful implementation of ERP in SMEs, as well as provide ERP implementers a model that will guide them through real-world ERP implementation project in SMEs. The research will cover SMEs located within the Klang Valley, Malaysia. Keywords ERP implementation · Push factors · Pull factors

1 Introduction Enterprise Resource Planning (ERP) is an integrated system implemented by businesses to improve internal business processes in various departments in order to provide greater efficiency, increased automation to its processes, and improved communication across the business [1]. Widely adopted across the world, ERP systems are becoming an important part of organizational practices and business operations [2]. The benefits that ERP systems bring into a business are massive. A. Subbarao (B) · A. Hareyana Multimedia University, 63100 Cyberjaya, Selangor, Malaysia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_46

609

610

A. Subbarao and A. Hareyana

ERP allows employees in an organization to work efficiently as the system breaks down barriers between business units or division due to the automation of business processes, improved customer service through single source for billing and relationship tracking, enhanced financial compliance through regulatory standards, and real-time data flow and view that helps in addressing operations issues efficiently [1]. These deliverables by ERP create uniformity in an organization and allow the enterprise to generate optimal outputs and create an efficient operation throughout the company [3]. Although ERP is widely known in the market and its adoption rate is growing at a rapid pace since its inception, ERP implementation can be a costly endeavor with complex elements that typically large businesses are able to invest in [2]. With budget constraints and the high complexity, ERP is usually an afterthought for most small and medium-sized enterprises (SMEs). However, ERP vendors in recent years are starting to offer affordable options to SMEs that are easy to install and pre-configured [4]. As SME rapid growth continues, ERP ability to improve operation productivity and efficiency can further grow SMEs and increase their competitive edge in the market [5]. SME has an important role in a country’s economy as they contribute to the growth of a nation’s economy and the generation of employment. Malaysianbased SMEs contribute 38.9% of the economy GDP in 2019 which is a substantial increase from 2018 where upon SMEs contributed 38.3% that year [6]. ERP implementation is gaining traction among SMEs but adoption issues do exist. Among the issues are the misfit and misalignment of vendor’s pre-configured ERP system and SME’s business strategy [4]. Even though pre-configured ERP system solutions by vendors features “best industry practices and standards,” SME’s established business processes are hard to forgo for some SMEs as it is the reason that makes them unique especially when the SME is in a niche market with high market capitalization [7]. However, these are not the only constraints. Many push factors, or also called barriers, exist that can inhibit the success of an ERP implementation in SMEs. These push factors affect the implementation negatively from numerous angles ranging from the project management to even the ERP package. The pull factors or enablers allow the implementation to meet its goals and also ensure the project runs smoothly. This research purpose is to find out and investigate the various push and pull factors of ERP implementation in SMEs only within the Klang Valley. As it is a qualitative research, semi-structured interviews will be utilized and all the experts that will be interviewed shall have experience in SMEs only within the Klang Valley. The end goal of this research is to create the ERP implementation success model for SMEs within Klang Valley, which could serve ERP implementers in the Klang Valley when conducting ERP implementation projects in SMEs within the Klang Valley area. The paper will begin by discussing the research in relation to past works that were done on the issue to gain a better insight of the research. In the 3rd section, the paper will highlight the plan and guidelines on the formulation of the ERP implementation success model. In the 4th section, the paper will dive into the various initial results

Push and Pull Factors for Successful Implementation of ERP …

611

of the research. The last and final section will highlight the conclusion and the future work of this research.

2 Related Work ERP implementation is often times designated as a project or a temporary process that a company and its business units undertake but ERP implementation ought to be considered a dynamic and on-going process with no closure or end [8]. If implemented correctly, ERP systems bring plenty of benefit to a business and among those benefits are increased efficiency in a business, reduced operation costs, and integrated and centralized business processes [2]. ERP implementation also has its challenges. ERP implementation is costly and complex endeavors which require intensive decision-making process by the management to go through with such project as ERP systems can affect the whole organization, and its high rate of failure when implementing such systems [9]. Furthermore, ERP implementation is resource intensive as it requires substantial workforce allocation, training, and top management commitment, not to mention the long-time frame for implementation which could impact resource allocation which most SMEs are limited of which creates a barrier and constraint for SMEs when attempting to implement ERP [4].

2.1 ERP Success Parameters Success parameters in regards to ERP Implementation, is to identify whether the ERP implementation was a success or not based on a subset of parameters [10]. Among the commonly cited success parameters are: (1) competition of project in a given budget; (2) project completion in stipulated time frame; and (3) realization of huge business benefits [5]. Completing a project within the given budget is important as excessive expenditure and overrun budget creates risk to an organization especially SMEs that has a limited resources [10]. Secondly, project completion in stipulated time frame is important as delays occurring during the ERP implementation process can be detrimental to an organization as it delays the positive returns the system has to offer while also draining further more resources [10]. Third and lastly, realization of huge business benefits is a vital parameter as high expectations can lead to major disappointment, especially when many promises were made about how the ERP system will bring huge business benefits for it to then not come to fruition [5].

612

A. Subbarao and A. Hareyana

2.2 ERP Pull Factors ERP implementation has its benefits which is why many companies particularly multinational companies have adopted it. It is also true that SMEs are also starting to adopt ERP as it gives them a competitive edge in a highly competitive marketspace [2]. However, ERP implementation has risk factors that can affect the ERP implementation negatively [5]. Hence, these are several commonly cited pull factors or enablers that enhance the ERP implementation process in SMEs. The first pull factor is change management. Employees are typically the ones impacted by new changes that effect the way they do their work which then creates resistance toward the system when not addressed accordingly [11]. Thus, effective change management helps in reducing resistance toward the new system while maximizing the benefits of ERP implementation by preparing employees for the changes [12]. When the ERP systems go live, training is very essential which is a part of change management. Thus, employee involvement in the training allows the project team to understand the needs of the staff while getting sufficient feedback on the system [10]. The second pull factor is top management support. Leadership plays a vital role in the success of the ERP system, and such role is driven by the commitment and strong support by top management [10]. The effects of such endorsement by top management to the organization are better resource allocation for the project, designation of the project as a major priority for the organization, and proactive role by senior management to solve implementation issues through strong leadership [5]. As a result, employees’ acceptance for the new system is greater due to strong support by top management but also allocation to training for the system further improves employee commitment for the system [11]. The third pull factor is project management. ERP system success is based on the process of project management whose focus is on initiating, planning, implementing, and controlling project activities to achieve project goals and milestones [13]. Additionally, project management develops the project implementation plan which defines the project activities, establishes roles and responsibilities to each activity, and encourages organization support throughout the implementation process [14]. The fourth pull factor is effective communication. Effective communication throughout the organization before, during, and after the ERP implementation phase is vital to inform all stakeholders of the project progress, its timeline, and the implementation strategy [5]. Furthermore, communication throughout the business functions and departments gets the word out about the benefits of the system and the changes to business functions that will come into effect as part of the changes that come with the new system which will greatly inform the users while also enabling feedback from them [12]. The fifth pull factor is training and education. Training ERP users is crucial for the success of the entire ERP implementation project as training provides effective understanding of the new business processes and practices that will come with the

Push and Pull Factors for Successful Implementation of ERP …

613

new system [14]. Training and educating the users also ensure the system is being effectively used which creates a more productive and effective workplace [12]. The sixth and last pull factor is project team. Forming an effective ERP project team is a critical pull factor as the team is in charge of achieving implementation success by meeting all expectations and goals that were visioned by the project team and top management [14]. Team composition is part of creating an effective team which means the team must consist of representatives from different departments, as well as ensuring involvement of all stakeholders [13]. ERP implementation success also depends on the competence of the team, which means experience and knowledge of ERP implementation are positive attribute which bring a positive impact to the team and the project as a whole [12].

2.3 ERP Push Factors Although ERP is widely adopted by large enterprises, SMEs are not as convinced to adopt ERP due to push factors or barriers that can impact the implementation process negatively [4]. Hence, these are several push factors commonly cited that inhibit the successful implementation of ERP in SMEs. The first push factor is lack of commitment from top management. Top management provides leadership and allocates resources that are crucial to the success of the ERP implementation process [5]. Lack of that commitment will lead to lack of leadership which causes the vision of the project to collapse and resources would not be allocated sufficiently [4]. The second push factor is ineffective change management. The lack of a proper change management process and procedures leads to ERP implementation failure as employees suspect and behave against the implementation [5]. Training is a part of the issue as change management is required to provide training to ERP users as a way to improve acceptance for the system. However, inadequate training of ERP users leads to lack of confidence and anxiety among the users which then amps up the resistance against the implemented system [2]. The third push factor is poor ERP package. Choosing the wrong ERP package can cause implementation and customization issues that inevitably causes a failure in the entire project [15]. Misfits in ERP implementation are a major issue as business processes are not aligned with the ERP package, leading to business functions in the organization having operational issues [7]. The fourth and last push factor is employee resistance to change. Inadequate training, poor communication, lack of commitment, and support from top management lead to resistance among employees toward the system, and poor adaptability leads to resistance by employees toward the new system [4]. Without proper knowledge and guidance of the new system, employees develop doubts of the new system, leading to them resenting it and attempting to sabotage the implementation efforts [5].

614

A. Subbarao and A. Hareyana

3 Formulation The formulation of the ERP implementation success model for SMEs within Klang Valley is the main goal of this research. This paper serves as a roadmap with the aim of receiving feedback and further breakthroughs for our model formulation. The model features four important phases that are based upon our research objectives which are Phase 1: To identify the enablers that enhance the successful implementation of ERP in SME by using literature, Phase 2: To identify the enablers that enhance the successful implementation of ERP in SME by using semi-structured interview, Phase 3: To identify the barriers that inhibit the successful implementation of ERP in SME by using literature, and Phase 4: To identify the barriers that inhibit the successful implementation of ERP in SME by using semi-structured interview. Figure 1 details each phase and its activities.

3.1 Research Phases Phase 1 Objective. To identify the enablers that enhance the successful implementation of ERP in SME by using literature. In Phase 1, literature review will be deployed as it helps informing the researcher of past works and concepts which helps the researcher in developing their approach to the topic [16]. Furthermore, literature review allows for new ideas to be develop based on data that were captured in previous literature which allows the researcher to develop new concepts [16]. As such, Phase 1 will begin with Activity 1 where the researcher will begin by searching for previous literatures relevant to the research. Activity 2 will see the review of those literature, and Activity 3 will see the extraction of relevant information from those literature where upon the final output will be generated. Phase 2 Objective. To identify the enablers that enhance the successful implementation of ERP in SME by using semi-structured interview. Phase 2 will be utilizing semi-structured interviews to meet the objective stated in this phase. As this research will be employing qualitative methods, a semi-structured interview will be deployed as a mean to extract data for this research. A semistructured interview is an effective method for data collection as it collects qualitative and open-ended data, explores participants thoughts and beliefs about the topic, and allows for a more in-depth look into personal and sensitive issues that is relevant to the topic [17]. Activity 4 begins with the development of the interview instrument that is influenced by past knowledge and studies to ensure its relevancy to the issue at hand, while in Activity 5, the questions will undergo a pilot interview to confirm the relevancy and suitability of the questions for semi-structured interview [18]. During the data collection process, purposive sampling, specifically expert sampling, will be utilized as a mean of selecting participants to be studied to ensure

Push and Pull Factors for Successful Implementation of ERP …

615

Fig. 1 Formulation process of ERP implementation success model for SMEs within Klang Valley

they provide the best information to meet the objectives of this research [19]. Experts with particular expertise will be selected for the research, and their knowledge will be extracted. As such, Activity 6 will utilize the sampling method detailed here to collect the data needed, and Activity 7 will see those data analyzed and extracted to generate the output of Phase 2. Phase 3 Objective. To identify the barriers that inhibit the successful implementation of ERP in SME by using literature. Phase 3 will see the utilization of literature review, similar to Phase 1, to meet the objective set in this phase. The phase will begin with Activity 8 where upon the researcher will search for relevant literatures to the topic. In Activity 9, the researcher

616

A. Subbarao and A. Hareyana

will review the literatures found relevant and suitable to this phase and its objective. Finally, the information deemed relevant from those literatures will be extracted in Activity 10. Phase 4 Objective. To identify the barriers that inhibit the successful implementation of ERP in SME by using semi-structured interview. To meet the objective in Phase 4, semi-structured interviews will be conducted to generate the output set in the formulation process. The activities that will be conducted are similar to that in Phase 2 where Activity 11 will see the development of the interview instrument to develop the questions relevant to the objective, Activity 12 will see a pilot interview being conducted to determine the suitability and relevancy of the questions, Activity 13 will begin the data collection process for this phase, and Activity 14 will see the end of the data collection process with the output extracted and analyzed.

4 Initial Results As of now, the research is still in its early stages. With that, output from Phases 1 and 3 will be discussed as the relevant literature has been extracted. The output of Phase 1 is a list of enablers that enhances the successful implementation of ERP in SME while the output of Phase 3 is a list of barriers that inhibits the successful implementation of ERP in SME. As stated previously, the output was generated from past literature that has been commonly cited. The output from these phases will also act as a inspiration for Phase 2 and Phase 4 as it will serve as a basis to inform the researcher when constructing the interview instrument during the semi-structured interview. With that, below are the following output from Phase 1 and Phase 3. List of enablers that enhances the successful implementation of ERP in SME is effective change management [11], top management support [10], efficient project management [13], effective communication between business units [5], thorough training and education [14], and effective project team [14]. List of barriers that inhibits the successful implementation of ERP in SME is lack of commitment from top management [4], ineffective change management [5], poor ERP package [15], and employee resistance to change [4]. Further findings will be done in the future work of this research in order to prove these findings and gain new knowledge of these fundamental issue.

5 Conclusion This paper highlights the objectives and the processes that will take place to realize those objectives which are detailed in the formulation of the ERP implementation success model for SMEs within Klang Valley. The proposed model aims to provide the research community and IT industry a better understanding on the factors that

Push and Pull Factors for Successful Implementation of ERP …

617

contribute to a successful implementation of ERP in SMEs, as well as provide ERP implementers a model that will guide them through real-world ERP implementation project in SMEs. Future work of this research will touch on the development of the proposed model. Furthermore, collection of data and further research will also be carried out to further solidify this research and its goals. Acknowledgements We are grateful to Multimedia University (MMU), Faculty of Management (FOM), and other individuals that have supported us throughout. This work is supported by MMU Grant “MMUI/210101.”

References 1. Pati A, Kumar Sagar S, Tripathi P, Professor A (2019) ERP and its successful implementation. Int Res J Eng Technol 06:709–712 2. Hasheela-Mufeti V, Smolander K (2017) What are the requirements of a successful ERP implementation in SMEs? Special focus on Southern Africa. Int J Inf Syst Proj Manag 5:5–20. https:// doi.org/10.12821/ijispm050301 3. Alsayat M, Alenezi M (2018) ERP implementation failures in Saudi Arabia: key findings. Int Bus Manag 12:10–22. https://doi.org/10.3923/ibm.2018.10.22 4. Venkatraman S, Fahd K (2016) Challenges and success factors of ERP systems in Australian SMEs. Systems 4:20. https://doi.org/10.3390/systems4020020 5. Kiran TS, Reddy AV (2019) Critical success factors of ERP implementation in SMEs. J Proj Manag 4:267–280. https://doi.org/10.5267/j.jpm.2019.6.001 6. Department of Statistic Malaysia (2020) Department of statistics Malaysia press release small and medium enterprises gross domestic product (SMEs GDP) 8 7. van Beijsterveld JAA, van Groenendaal WJH (2016) Solving misfits in ERP implementations by SMEs. Inf Syst J 26:369–393. https://doi.org/10.1111/isj.12090 8. Ahmad MM, Cuenca RP (2013) Critical success factors for ERP implementation in SMEs. Robot Comput Integr Manuf 29:104–111. https://doi.org/10.1016/j.rcim.2012.04.019 9. Amini M, Sadat Safavi N (2013) Critical success factors for ERP implementation. SSRN Electron. J https://doi.org/10.2139/ssrn.2256382 10. Saini S, Nigam S, Misra SC (2013) Identifying success factors for implementation of ERP at Indian SMEs: a comparative study with Indian large organizations and the global trend. J Model Manag 8:103–122. https://doi.org/10.1108/17465661311312003 11. Rehman K, Ali MK (2019) Key success factors for ERP projects in a developing country: a case study Jordan. Electron Bus J 18:1–7 12. Huang G, Kurnia S, Linden T (2018) A study of critical success factors for enterprise systems implementation by SMEs. Processing 22nd Pacific Asia conference information system— opportunities challenges digital society. Are we ready? PACIS 2018 13. Mahmood F, Khan AZ, Bokhari RH (2019) ERP issues and challenges: a research synthesis. Kybernetes 49:629–659. https://doi.org/10.1108/K-12-2018-0699 14. Mahraz MI, Benabbou L, Berrado A (2018) Critical success factors for ERP implementations: a moroccan case study. Proc Int Conf Ind Eng Oper Manag 2018:1122–1133 15. Menon SA, Muchnick M, Butler C, Pizur T (2019) Critical challenges in enterprise resource planning (ERP) implementation. Int J Bus Manag 14:54. https://doi.org/10.5539/ijbm.v14 n7p54 16. Creswell J, Clark V (2015) Understanding research: A consumer’s guide, 2nd Ed. Pearson

618

A. Subbarao and A. Hareyana

17. Bolderston A (2012) Conducting a research interview. J Med Imaging Radiat Sci 43:66–76. https://doi.org/10.1016/j.jmir.2011.12.002 18. Moser A, Korstjens I (2018) Series: practical guidance to qualitative research. Part 3: sampling, data collection and analysis. Eur J Gen Pract 24:9–18. https://doi.org/10.1080/13814788.2017. 1375091 19. Etikan I (2017) Sampling and sampling methods. Biometrics Biostat Int J 5:215–217. https:// doi.org/10.15406/bbij.2017.05.00149

A Hybrid Social-Based Routing to Improve Performance for Delay-Tolerant Networks Sudhakar Pandey, Nidhi Sonkar, Sanjay Kumar, and Yeleti Sri Satyalakshmi

Abstract A delay-tolerant network (DTN) is one of the methods in computer network model that concerns about the cases where limited connectivity is available between nodes to exchange information. DTN uses store-carry-forward technique to enable communication in challenged networks. Authors have proposed multiple routing techniques based on social behaviours of nodes, but the performance is not as expected. In this paper, we presented study of some social parameters which can affect performance of routing protocol and proposed a hybrid routing technique to improve the performance of the networks. The social parameters selected to propose a new routing algorithms are community, centrality and similarity and named as censimcom routing algorithm. Proposed method is implemented in ONE (Opportunistic Network Environment) and compared with exiting social-based techniques that represents proposed censimcom routing techniques and improves the performance of the network in terms of delivery ratio and delay. Keywords Delay-tolerant networks · Social parameters · Community · Centrality · Similarity · ONE simulator · Censimcom router

1 Introduction Mobile people have been able to communicate via data communication devices we use mostly for daily needs such as mobile phones, which can be used worldwide connected via Internet or some cellular networks or wireless networks. The connectivity in the nodes can be obtained but due to limited power computation capacity and battery, these cannot support connectivity like end-to-end that is required by the generally used TCP/IP [1]. As a result, asynchronous message passing (storecarry-forward networking) has been proposed as a way to communicate over these networks’ space–time pathways (e.g. delay-tolerant networking); this is in stark contrast to the IP approach, in which IP packets must either be sent instantly or S. Pandey · N. Sonkar · S. Kumar · Y. Sri Satyalakshmi (B) Department of Information Technology, National Institute of Technology, Raipur, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2_47

619

620

S. Pandey et al.

be dropped. There are some social parameters which can affect the working of the DTN routing model. The opportunistic network performance mostly depends on the characteristics of nodes, how many nodes are there, in how much distance, etc. The total time it may take for the data to reach the destination may be few seconds or minutes or may not be delivered also. If there is a situation where there is no direct connection between two nodes or due to some reason the connection is broken, then this DTN comes handy [1–3] as this protocol do not need any predefined routers or connections beforehand. It creates connections dynamically, and also, information is exchanged from the mobile nodes. These mobile nodes store the data until it finds other potential mobile node that can carry data; if it finds such node, information is exchanged between the nodes. In the architecture of DTN, a bundle layer is introduced between the application and transport layer which is responsible for the working of store-carry-forward technique. Routing in DTN is the most important task due to dynamic topology and mobile nodes. Since there is no end-to-end connection available between nodes, an intermediate node is required that can efficiently transmit data from source node to other node or destination node. Several social-based routing techniques are invented in DTN that are based on social behaviours of node for selection of forwarder node. But the performance of the algorithms is not much satisfied in terms of delivery ratio and delay. In this paper, we presented the study of social parameters of DTN and proposed a novel hybrid routing algorithm named as ‘censimcom routing’ to improve performance of network by selecting the social parameters such as centrality, similarity and community. The whole paper is organised in five sections. Section two presented the literature review in terms of related works. Section three described the proposed hybrid routing algorithm in details, and the evaluation and comparison results are presented in section four. Finally, the paper is concluded in section five.

2 Related Work 2.1 Some of the Mostly Known Routing Protocols of DTN Epidemic routing [4] is flooding-based protocol in which the message is replicated whenever a new node is encountered. Hence, the number of data transfers is not limited and also a greater number of copies of messages are wasted with high usage of resources. Epidemic protocol is effective only if all the opportunistic encounters are random. In real life, we observe some patterns in the movement of nodes to apply this observation the PRoPHET protocol was given. This PRoPHET [5] uses an algorithm that maintains the list of probabilities of the successful delivery to the given destination in DTN. The message will be replicated only if the node encountered is the potential forwarder. Spray and wait protocol [6] is also a type of protocol which is divided into two phases. First is spray phase, in which L number of copies of

A Hybrid Social-Based Routing to Improve Performance for Delay …

621

message is generated where L is the maximum number of copies that particular message can have in the network. By keeping this constraint, we can optimise the use of bandwidth by making the clever move. In the spray phase, the source node of the data is responsible for the spraying the message to L distinct relays and it will join the wait phase. In the second phase, that is wait phase, node holds the message until the destination is encountered directly. Maxprop [7] routing protocol is also a flooding-based routing protocol, in which if the node ‘A’ which is holding a message to deliver encounters an intermediate node ‘B’, then node ‘A’ will check all the messages that are not held by the encountered node ‘B’ and give copies of those messages. It cleverly determines the high priority messages and drops the low priority messages and the messages which are coming after their time to live. A queue data structure and a likelihood vector are used to compute shortest path to destination based on DFS algorithm.

2.2 Social-Based Routing Protocols of DTN To know the social relationship properties between the nodes, we construct the social graph from that we can extract the properties easily by observing social graph [8, 9]. The social graph will show the global mapping of how the nodes are related, and we will get the picture of how the properties are related. Like the social relationships like friends, family members and co-workers. With the social graph, a variety of commonly observed metrics, for example community, betweenness centrality and friendship, can be calculated and estimated accordingly. It is important to get a social graph [10] as it is main source of calculating social metrics, but it not easily available. After implementing the social graph, social metrics can be derived like community, centrality and friendship. Community: Interaction between people in communities like geographically or at temporal scales [11]. Example, if the member belongs to community ‘A’, then it is likely that it can meet with another member of the community ‘A’ (same community to which it belongs) than the random person in the network. We can extend this observation to the DTNs and believe that same community has higher chance to meet; then, we can help routing protocol to choose optimal forwarding techniques [12]. Centrality: From the given graph, the importance of a given node is detected by the number of connections it is maintaining at a particular time. In social graph, it is seen by the central node which has connections to its all neighbours. In reality, central node is the strongly connected to other nodes. Measures of centrality: 1. 2.

Degree centrality: number of links or ways that are starting upon a given node. Closeness centrality: inverse of average shortest distance to all the nodes to which it is connected.

622

3.

S. Pandey et al.

Betweenness centrality: it is the total number of shortest paths passing via the certain node. If betweenness centrality [13] is high, it acts as the bridge node for message exchange.

Similarity: It is the measure of number of common neighbours between the two nodes. It is also called the degree of separation. Like socially, it is more probable that the two people know each other if they share the same common friend. Similarity can be based on any type like similarity on user interests and similarity on user location [14]. Friendship: For two nodes to be friends, there should be a long-lasting friendship [15] and regular meetups and interests in similar things. This observation can be extended to DTN by maintaining the contact history between the two nodes. Selfishness: As we see selfishness [16] in reality, it can also be observed in DTN as some of the nodes can show partiality towards packets for which they have social ties but not for others.

3 Proposed Methods 3.1 Routing Considerations 1. 2. 3.

The information about the future contacts is readily available. Mobility is considered, and nodes are mobile. Keeping battery and storage capacity of mobile nodes into consideration.

3.2 Censimcom Routing Centrality-Similarity-Community Routing We combined three social parameters centrality, similarity and community and proposed a new protocol called censimcom protocol. To measure centrality, we use betweenness centrality to identify the bridge nodes in the network. Similarity parameter is used to know the number of neighbours common between nodes in the networks to estimate the degree of separation. Community parameter is used to estimate the interaction between the nodes based on the group it belongs to. It is used because the member of the same community is more probable to meet the other node in the same community. By using these three social parameters, we can help routing protocol to choose better forwarding relays. Reason: To avoid exchanging messages with every node that encountered to decrease congestion in the network and save some bandwidth.

A Hybrid Social-Based Routing to Improve Performance for Delay …

3.2.1

623

Betweenness Centrality

It is the total number of shortest paths incident via the given node. If the betweenness centrality [13] is high, it means it has the more connections to other nodes or it is in the junction from which we can move to any direction. These types of nodes act as bridge node for message exchange. The betweenness for a given node is calculated by using Eq. (1) as shown below: Bet(A) = number of shortest paths for which node A is the intermediate halt point

3.2.2

(1)

Similarity

It is the measure of similarity between the nodes. For a given nodes node A and node B, A is the current node and node B is the destination node. The similarity [14] between two nodes is calculated by using the following Eq. (2) as shown below: sim(A − B) = number of common neighbours for node A and node B

(2)

If the betweenness centrality is high, that node will act as the bridge node. If the similarity is high, that node will act as the forwarders.

3.2.3

Community

As the mobile nodes in DTN are carried by people and the messages are exchanged when people come closer, it is observed that the people most probably tend to communicate with other person in the same community [11] than the random person in the network. By applying this observation, we can reduce the amount of message forwardings and reduce some traffic and unnecessary overhead. To apply this, we should know the community of the node encountered. To get the community of each node, we can add labels to the nodes. Now each node has the small label saying the community to which it belongs.

3.3 Technique In censimcom routing for the DTN, when a node A meets another node M, it calculates the relative centrality utility, similarity utility and community utility for node A and node M where A is current node and M is the encountered node and node B is the destination node. The centrality, similarity and community are calculated by using the following Eq. (3a), (3b), (3c).

624

S. Pandey et al.

cen(A) cen(A) + cen(B)

(3a)

sim(A − B) sim(A − B) + sim(M − B)

(3b)

n n+m

(3c)

cenutil(A) = simutil(A) =

comutil(A) = n m

number of common communities for A and B number of common communities for M and B

Now, the node A can compute the censimcomutility which is the weighted combination of centrality, similarity and community. Censimcomutil for a node is calculated by using Eq. (3) as shown below: Censimcomutil (A − B) = α*simutil(A) + β*cenutil(A) + (1 − α − β)*comutil(A)

(4)

Here, α and β are some constants and can be tuneable, and we can adjust their values in 0 to 1 accordingly to get the weighted average of the utility function which gives the more optimised results. Like whenever node A encounters node M with destination node B, the node A will calculate the censimcomutil for node M, and if the censimcomutil for node M is greater than censimcomutil for node A, then there is the more probability that node will reach destination from node M than from node A, and then, the message will be replicated to node M from node A.

3.4 Algorithm Censimcom Routing: 1. 2. 3. 4. 5. 6. 7.

Let node A is having a message to send with destination node B. When an intermediate node M is encountered. Our algorithm will calculate the required parameters to check if the node is potential rely or not. Node A will calculate the relative centrality utility, similarity utility and community utility with respect to node M. Firstly, it will calculate centrality utility, similarity utility and community utility as shown in Eq. (3a), (3b), (3c) in technique. Next it will calculate censimcomutil (A) and censimcomutil (M) as shown in Eq. (4) in technique. If censimcomutil(M − B) > censimcomutil(A − B) then forward message to node M

A Hybrid Social-Based Routing to Improve Performance for Delay …

7.1. 7.2. 8. 9.

625

If node M is the destination node return. Else go to step1 and label node M as A.

Else do nothing. Via the possible multi hop relays the message will reach the destination B (Fig. 1).

Fig. 1 Flow chart

626

S. Pandey et al.

4 Simulation and Results We implemented the proposed censimcom protocol in the ONE (Opportunistic Network Environment) simulator. ONE [17] is a discrete event simulation engine based on nodes. The application updates all the modules that involve in the implementation of the primary simulation functionalities like node movement modelling, connections it had with other internal nodes, routing routes it is following and handling the messages at each simulation step. Each node’s modules have access to all the basic properties and parameters related to simulation and state, such as its position of the node, current path it is moving and the neighbours of the node. The MessageRouter module provides basic functionality to all routing modules, such as managing the buffer and call backs in different routing-related events for message. The simulator engine calls the call backs for a variety of events, like if the new message arrived to the current node, the router module will handle all the events as well as says the actions to be performed after each time step and the behaviour when a new node created or exits the network of the node in possible neighbourhood. The parameters we considered for the experiment are shown in the following Table 1. The size of the message is taken as 100 Kb with the buffer size of the node up to 30 MB, with the transmission range of 100 m. The message will be active up to 300 min and automatically dropped if the time exceeded. We considered the total of 100 nodes, with the simulation time of 5000 s. The speed of the connection is 2048 bytes per second. The mobility model used is the random waypoint. Here, we will evaluate our proposed protocol and the obtained results. For each protocol, we perform 100 runs to avoid any human error and bias in the nodes and we evaluate on the following metrics.

4.1 Delivery Ratio It is the ratio of total number of messages received in the destination address to the number of messages originated from the source address. Table 1 Parameters on which the experiment is done

Parameter

Value

Message size

100 kb

Buffer size

0–30 MB

Transmission range

100 m

Time to live

300 min

Number of nodes

20–100

Simulation time

5000 s

Data rate

2048 Bytes per second

Mobility model

Random waypoint

A Hybrid Social-Based Routing to Improve Performance for Delay … 0.8 Delivery Rao

Fig. 2 Graph between delivery ratio and number of nodes

627

0.6

censimcom roung SnW SimBet

0.4 0.2 0 0

40 60 80 100 120 Number of Nodes

0.8 Delivery Rao

Fig. 3 Graph between delivery ratio and buffer size

20

0.6

Censimcom Roung

0.4

SNW

0.2

SimBet

0 0

10

20 30 40 Buffer Size in MB

delivery ratio = n m

50

60

n m

(5)

number of messages received in the destination address number of messages originated from the source address

The delivery ratio is varying with the number of nodes as shown in Fig. 2. We see that our proposed protocol is giving significantly better results than the spray and wait protocol and SimBet protocol for the number of nodes greater than 40. The delivery ratio is varying with the buffer size as shown in Fig. 3. We observe that for buffer size greater than 20 MB our proposed protocol is giving the better delivery ratio than the SimBet and spray and wait protocol.

4.2 End-to-End Delay It is the total time taken by the message to reach destination from source. end to end delay = time taken by the message to reach destination from source (6) The delay in milliseconds is varying with number of nodes as shown in Fig. 4. We observe that the proposed protocol has taken significantly less time than the SimBet and spray and wait. The delay in milliseconds is varying with buffer size as shown in Fig. 5; form this, we see that the proposed protocol has taken significantly less time to deliver the message, and with 50 MB buffer size, it takes approx. 4 s to deliver the message.

Fig. 4 Graph between end-to-end delay and number of nodes

S. Pandey et al. End to end Delay (ms)

628 6000 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000

Censimcom Roung SnW

Fig. 5 Graph between end-to-end delay and buffer size

End to end Delay (ms)

0

20 40 60 80 100 120 Number of Nodes

5500 5000 4500 4000 3500 3000

SimBet

Censimcom Roung SnW

0

10

20

30

40

50

60

SimBet

Buffer size in MB

5 Conclusions In this paper, we discussed about DTN and the already existing protocols and the social parameters that can affect the DTN routing. We studied about the different social parameters that can help us in effectively optimising the existing protocols. We selected the positive social parameters like similarity, centrality and community, and by taking all the advantages provided by these parameters, we developed the protocol called censimcom routing protocol. Our censimcom protocol has increased the delivery ratio and decreased the delay in the network, and it has 10% more delivery ratio than SimBet and gives good performance than spray and wait routing protocol. Our protocol’s delay time got nearly to 4 s for 50 MB buffer size.

References 1. Zhu Y, Xu B, Shi X, Wang Y (2013) A survey of social-based routing in delay tolerant networks: positive and negative social effects. IEEE Commun Surv Tutorials 15(1):387–401, First Quarter. https://doi.org/10.1109/SURV.2012.032612.00004 2. (2006) Routing in intermittently connected mobile ad hoc networks and delay tolerant networks: overview and challenges. IEEE Commun Surv Tutorials 8(1):24–37 3. Demmer M (Nov 2008) Delay tolerant networking TCP convergence layer protocol. Internet engineering task force. [Online]. Available: repository.wit.ie/1565/1/TAPR-DCC 4. Vahdat A, Becker D (2000) Epidemic routing for partially connected Ad hoc networks. Duke University Duhram, NC, TecH.Rep. CS-2000-06 5. Han SD, Chung YW. An improved PRoPHET routing protocol in delay tolerant network 6. Misra S, Saha BK, Pal S. Opportunistic mobile networks advances and applications 7. Rani A, Rani S, Bindra HS (Feb 2014) Int J Eng Res Technol (IJERT) 3(2) 8. (PDF) Mapping networks of terrorist cells (researchgate.net) 9. Chaintreau A, Hui P, Crowcroft J, Diot C, Scott J, Gass R (2007) Impact of human mobility on opportunistic forwarding algorithms. IEEE Trans Mob Comp 6(6):606–620

A Hybrid Social-Based Routing to Improve Performance for Delay …

629

10. Beach A, Gartrell M, Akkala S, Elston J, Kelley J, Nishimoto K, Ray B, Razgulin S, Sundaresan K, Surendar B, Terada M, Han R (2008) Whozthat? Evolving an ecosystem for context-aware mobile social networks. IEEE Network 22(4):50–55 11. Flittner M, et al (2018) A survey on artifacts from CoNEXT, ICN, IMC, and SIGCOMM conferences in 2017. ACM SIGCOMM Comput Commun Rev 48(1):75–80. Available: https:// doi.org/10.1145/3211852.3211864. 12. (2013) A visual analysis approach for community detection of multi-context mobile social networks. J Comput Sci Technol 28:797–809 13. Linton C freeman a set of measures for centrality based on betweenness https://doi.org/10. 2307/3033543 14. Daly EM, Haahr M (2007) Social network analysis for routing in disconnected delay-tolerant manets. In: MobiHoc ’07 Proceedings 8th ACM international symposium on mobile ad hoc networking and computing 15. Zhang Y, Zhao J (2009) Social network analysis on data diffusion in delay tolerant networks. In: MobiHoc ’09: Proceedings tenth ACM international symposium on mobile ad hoc networking and computing 16. Srinivasan V, Nuggehalli P, Chiasserini CF, Rao RR (2003) Cooperation in wireless ad hoc wireless networks. In: Proceedings 22nd annual joint conference of the IEEE computer and communications societies (Infocom) 17. Keränen A, Ott J, Kärkkäinen T. Helsinki university of technology (TKK) The ONE simulator for DTN protocol evaluation

Author Index

A Aakarsh, Y., 295 Abdultaofeek Abayomi, 533 Abhishek Dutta, 443 Abhishek Gupta, 153 Abhishek Majumder, 509 Adilakshmi, T., 283 Ahmed Ben Ayed, 561 Amit, 389 Anil Kumar, N., 421 Anind Kiran, 341 Aniruddha Chatterjee, 475 Anusuyah Subbarao, 609 Anvi Jain, 329 Aravind, D., 363 Aruna, S., 183 Astra Hareyana, 609 Atul Negi, 63 Ayesha Sultana, 373

D Dakshayini, M., 89 Davinder Paul Singh, 153 Deep Rahul Shah, 141 Dev Ajay Dhawan, 141 Devendran, V., 203 Dheeraj, M., 213 Dipti Verma, 353 Divya Lingineni, 303

F Fawwaz Khilji, 329

G Gandikota Ramu, 195 Georg Gutjahr, 101 Gnana Mayuri, K., 169 Govardhan, A., 523 Gowtham, B., 295 Greeshma Krishnan, 101

B Baijnath Kaushik, 153 Bam Bahadur Sinha, 257 Bhaskarjyoti Das, 341

I Israel Edem Agbehadji, 533

C Chakita Muttaraju, 487 Challapalli Jhansi Rani, 241 Chaya Devi, S. K., 295 Chubaieskyi, V., 313

J Jagannath Aryal, 1 Jaydip Sen, 443 Jo Cheriyan, 499 Joy Lal Sarkar, 509 Junaid, M. W. F., 183

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 R. Buyya et al. (eds.), Computational Intelligence and Data Analytics, Lecture Notes on Data Engineering and Communications Technologies 142, https://doi.org/10.1007/978-981-19-3391-2

631

632 K Kanchana, M., 475 Karnam Madhavi, 523 Kasa Chiranjeevi, 127 Kasatkin, D., 313 Kishor H. Walse, 223

L Lahari, P., 183 Lakhno, V., 313 Lata Upadhye, 111 Leelavathy, B., 213 Lekshmi S. Nair, 499

M Malyukov, V., 313 Manah Shetty, 341 Mangala Gowri, 89 Manuel Roveri, 23 Mardé Helbig, 41 Meenaxi Tank, 73 Mithun Haridas, 101 Mohamed Amin Belhajji, 561 Mohammad Goudarzi, 1 Mohammed Marhoun Khamis Al Nuaimi, 551 Monal R. Torney, 223 Mukesh Kumar Tripathi, 127

N Nagajyothi, D., 363 Nagaraju Devarakonda, 241, 405 Nandipati Bhagya Lakshmi, 405 Neha Jabeen, 363 Nidhi Sonkar, 619 Nirmala Vasudevan, 101 Noel Varghese Oommen, 551

O Owusu Nyarko-Boateng, 533

P Padmanabha Reddy, Y. C. A., 195 Prabavathy, S., 577 Prasad Koti, 597 Prasanna Dusi, 303 Prema Nedungadi, 101 Priyank Thakkar, 73 Purnachary Munigadiapa, 283

Author Index R Raghawendra Sinha, 353 Rajesh Tanwar, 389 Rajkumar Buyya, 1 Ramesh, G., 195 Ram Mohan Rao Kovvur, 213 Ramya Narasimha Prabhu, 487 Ravi Prakash Reddy, I., 577 Reshov Roy, 389 Resmi, S. R., 273 Richard C. Millham, 533 Rishal, K. P., 551 Rishi Sai Jakkam, 303 Rosepreet Kaur Bhogal, 203 Rzaieva, S., 313 Rzaiev, D., 313

S Sagnik Biswas, 475 Sai Rohit Sheela, 213 Sai Yada, 303 Sangeetha Prasanna Ram, 111 Sanjay Kumar, 619 Sanjeev, V., 183 Sathish Kumar, L., 169 Shabina, 461 Sheetal, S., 487 Sherimon, P. C., 551 Shivendra, 127 Shreya Shukla, 341 Shubhi Miradwal, 329 Shyam Sunder Reddy, K., 195 Shylaja, S. S., 487 Somasekar, J., 195 Sonal Chawla, 461 Soorya Surendran, 101 Srinu Banothu, 523 Sudarshan Patil Kulkarni, 373 Sudhakar Pandey, 619 Sujatha Kumari, B. A., 373 Sujatha, P., 597 Suraj, P., 183 Surya, S. R., 273 Susmitha Mukund Kirsur, 89

U Uma, D., 487

V Varun Kerenalli, 341 Vasudha, N., 431

Author Index Venkateswara Rao, P., 431 Venunath, M., 597

633 W Waquas Mohammad, 329

Vikash Kumar, 257 Vikram Singh, 389

Y Yeleti Sri Satyalakshmi, 619

Vilas M. Thakare, 223 Vishal Reddy, G., 295 Vivek, V., 213

Z Zhiyu Wang, 1