Data Intelligence and Cognitive Informatics: Proceedings of ICDICI 2022 (Algorithms for Intelligent Systems) 9811960038, 9789811960031

The book is a collection of peer-reviewed best selected research papers presented at the International Conference on Dat

126 114 26MB

English Pages 931 [901] Year 2022

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Data Intelligence and Cognitive Informatics: Proceedings of ICDICI 2022 (Algorithms for Intelligent Systems)
 9811960038, 9789811960031

Table of contents :
Preface
Contents
About the Editors
1 Modeling Müller-Lyer Illusion Using Information Geometry
1 Introduction
2 Method
3 Information Geometric Measure of Distance
4 Computational Method and Results
5 Discussion
References
2 Building up a Categorical Sentiment Dictionary for Tourism Destination Policy Evaluation
1 Introduction
2 MCST Questionnaires for Selecting Evaluation Categories
2.1 Categories and Questions of MCST and Keywords Associated to the Categories
2.2 Incorporating the Questionnaire-Based Evaluation into Sentiment Analysis Based on Text Mining
3 Additional Information Required from Authors
3.1 Model Architecture
3.2 Data
3.3 Text Cleaning and Preprocessing
3.4 Extracting Category-Specific Tokens Using CTF-IDF
3.5 Logistic Regression for Extracting Sentiment Words
3.6 Human Evaluation on Sentiment Words in a Category
3.7 Matching
4 Results
5 Conclusion
References
3 Statistical Analysis of Stress Prediction from Speech Signatures
1 Introduction
2 Methodology
3 Database Collection
4 Results and Analysis
5 Discussion
6 Conclusion
References
4 DermoCare.AI: A Skin Lesion Detection System Using Deep Learning Concepts
1 Introduction
2 Theory
2.1 Survey of Existing Systems
2.2 Limitations Found in Existing Systems
3 Proposed Work
3.1 Data Set
3.2 Pre-processing
3.3 CNN Architecture
4 Results and Discussion
4.1 Evaluation Measures
4.2 Multi-class Classification
4.3 Benign/Malignant Classification
4.4 Integrating Models into a Web Application
5 Conclusion and Future Scope
References
5 Analysis of Phishing Base Problems Using Random Forest Features Selection Techniques and Machine Learning Classifiers
1 Introduction
2 Research Background
3 Methodology
3.1 Data Description
3.2 Features Selection Method
3.3 Algorithms Description
4 Results
5 Discussion
6 Conclusion
References
6 Cost Prediction for Online Home-Based Application Services by Using Linear Regression Techniques
1 Introduction
1.1 Motivation
2 Problem Statement
3 Literature Survey
3.1 Domestic Android Application for Home Services [9]
3.2 An Online System for Household Services [10]
3.3 E-commerce and Its Impact on Global Trade and Market [11]
3.4 Examining the Impact of Security, Privacy, and Trust, on the TAM and TTF Models for E-commerce Consumers [12]
3.5 A Research Study on Customer Expectation and Satisfaction Level of Urban Clap in Beauty Services with Special Reference to Pune [13]
3.6 Timesaverz—First of Its Kind On-Demand Home Service Provider, India [14]
4 Proposed System and Architecture Diagram
4.1 Registration Module
4.2 Login Module
4.3 Feedback Module
4.4 Admin Module
5 Project Scope
6 Prototype Model
6.1 Feature Value (Target Value) and Linear Regression
6.2 Cost Prediction
6.3 Three-Step Verification
6.4 Feedback
7 Users’ Classes and Features
8 Applications
8.1 Home and Cleaning Service Industry
8.2 Health Care
8.3 Repair and Maintenance Service Industry
8.4 Home Renovation/Shifting
8.5 Home Construction and Design
8.6 Businesses
9 Future Scope
10 Conclusion
References
7 Convolutional Neural Network Based Intrusion Detection System and Predicting the DDoS Attack
1 Introduction
2 Related Works
3 System Model
3.1 Image Conversion Algorithm from KDD Dataset to Gray Scale
3.2 Feature Selection Algorithm
3.3 CNN Algorithm
4 Performance Evaluation
4.1 Number of Convolution Layers
4.2 Kernel Size
5 Conclusion
References
8 BERT Transformer-Based Fake News Detection in Twitter Social Media
1 Introduction
2 Related Work
3 Proposed Approach
3.1 Dataset
4 Results and Discussion
5 Conclusion
References
9 The Facial Expression Recognition Using Deep Neural Network
1 Introduction
2 Related Work
2.1 Literature Review
2.2 Dataset
3 Proposed Model
3.1 Model Architecture
4 Experiment
4.1 Experimental Design
4.2 Experimental Results and Evaluation
5 Conclusion
References
10 New IoT-Based Portable Microscopic Somatic Cell Count Analysis
1 Introduction
2 Related Work
3 Proposed Work
4 Performance Analysıs
5 Conclusion
References
11 A Survey on Hybrid PSO and SVM Algorithm for Information Retrieval
1 Introduction
2 Literature Survey
3 Methodology
3.1 Data Analysis
4 Results and Discussion
4.1 Ranking Models and Experiments with These Models
4.2 The Vector Space Mode
4.3 Positioning Based on Document Structure
4.4 Adjustments and Enhancements to the Basic Indexing and Search Processes
5 Conclusion
References
12 Metric Effects Based on Fluctuations in Values of k in Nearest Neighbor Regressor
1 Introduction
2 Methodology
2.1 Euclidean Distance
2.2 Manhattan Distance
2.3 Hamming Distance
2.4 Regressor
2.5 Dataset
3 Results
3.1 Root Mean Squared Error
3.2 Goodness of Fit
4 Conclusion
References
13 An Ensemble Approach to Recognize Activities in Smart Environment Using Motion Sensors and Air Quality Sensors
1 Introduction
2 Literature Review
3 Proposed Work
4 Experimental Analysis
4.1 Phase 1—Activity Recognition Using Motion Sensors Data
4.2 Phase 2—Activity Recognition Using Air Quality Sensors
5 Results
6 Conclusion
References
14 Generalization of Fingerprint Spoof Detector
1 Introduction
2 Literature Survey
3 Working Principle
3.1 Dataset
3.2 Training and Validation of Our Proposed CNN Model
4 Conclusion
References
15 Applied Deep Learning for Safety in Construction Industry
1 Introduction
2 Literature Review
3 Methodology
3.1 Data Description
3.2 Convolutional Neural Network Model
3.3 VGG16 Model
3.4 System Architecture
4 Discussion
4.1 Technologies Required
4.2 Convolutional Neural Network
4.3 VGG16 Model
5 Results
5.1 Classification Performance
5.2 Pre-processed Image Dataset
5.3 Equipment Classification
5.4 Result Comparison
6 Conclusion
References
16 Deep Learning-Based Quality Inspection System for Steel Sheet
1 Introduction
2 Literature Survey
3 Existing System
4 Proposed Work
5 Architecture Diagram
6 System Design
6.1 Analysis of Experimental Data
6.2 Building Model
6.3 System Training and Testing Procedures
6.4 User Interface
7 Results
8 Conclusion
9 Future Scope
References
17 Forecasting Prediction of Covid-19 Outbreak Using Linear Regression
1 Introduction
2 Literature Review
3 Proposed System
3.1 Dataset Description
3.2 Data Pre-processing
3.3 Training and Testing
3.4 Classification
4 Result
5 Conclusion
References
18 Proctoring Solution Using AI and Automation (Semi)
1 Introduction
2 Project Plan and Schedule
3 Literature Survey
4 System Requirements
4.1 Hardware Requirement
4.2 Software Requirements
5 Proposed System
6 System Design and Working
6.1 Proctor Module
6.2 Student Module
6.3 Examiner Module
7 Implementation
8 Results
9 Advantages
10 Disadvantages
11 Conclusion
12 Future Enhancements
References
19 Apple Leaf Disease Prediction Using Deep Learning Technique
1 Introduction
2 Related Works
3 Proposed Methodology
4 Proposed System
5 Results
6 Conclusion
References
20 Sentimental Analysis and Classification of Restaurant Reviews
1 Introduction
2 Literature Survey
3 Existing Methodology
4 Sentiment Analysis of Restaurant Reviews
5 Sentimental Analysis
6 Proposed Methodology
6.1 Data Collection
6.2 Data Preprocessing
6.3 Stop-Word Elimination
6.4 Stemming
6.5 Bag-of-Words Model
6.6 Data Classification
6.7 Splitting Dataset
7 Splitting
8 Naive Bayes
9 Logistic Regression
10 Result
11 Conclusion
12 Future Work
References
21 A Controllable Differential Mode Band Pass Filter with Wide Stopband Characteristics
1 Introduction
2 Filter Design
3 Current Distributions
4 Conclusion
References
22 Design and Analysis of Conformal Antenna for Automotive Applications
1 Introduction
2 Antenna Design Process
2.1 Bending Analysis
3 Results and Analysis
4 Conclusions
References
23 An Improved Patch-Group-Based Sparse Representation Method for Image Compressive Sensing
1 Introduction
1.1 Research Motivation and Contribution
2 Related Work and Challenges
3 Proposed Work
3.1 Phase-1: Patch-Based Adaptive Sparsifying Learning (PASL)
3.2 Phase-2: Constrained Group Sparse Representation
4 Result and Discussion
5 Conclusion
References
24 Comparative Analysis of Stock Prices by Regression Analysis and FB Prophet Models
1 Introduction
2 Dataset Used for Research Evaluation
3 Comparative Analysis of Linear Regression and FB Prophet Models
3.1 Linear Regression Model
3.2 FB Prophet Model
4 Metric Used for Evaluation of Comparative Analysis of Models
5 Result and Discussion
5.1 Comparative Study
6 Conclusion and Future Scope
References
25 g POD—Dual Purpose Device (Dustbin and Cleaning)
1 Introduction
2 Brief Overview of Dual Purpose Device
2.1 Robot Vacuum Cleaner Base
2.2 Pair of ARMs
2.3 Mounted Dustbin
3 Conclusion
References
26 An Attractive Proposal Based on Big Data for Sentiment Analysis Using Artificial Intelligence
1 Introduction
2 Related Work
3 Techniques for Analyzing Emotional States
4 Implementation
5 Conclusion
References
27 SqueezeNet Deep Neural Network Embedder-Based Brain Tumor Classification Using Supervised Machine Intelligent Approach
1 Introduction
2 Related Works
3 Methodology
4 Results and Discussion
5 Conclusion
References
28 Detection of Malicious Unmanned Aerial Vehicle Carrying Unnecessary Load Using Supervised Machine Intelligence Model with SqueezeNet Deep Neural Network Image Embedder
1 Introduction
2 Related Works
3 Dataset and Methodology
3.1 Dataset
3.2 Methodology
4 Simulation and Results
5 Conclusion
References
29 Face Mask Detection Using Artificial Intelligence to Operate Automatic Door
1 Introduction
2 Literature Review
3 Hardware Components
3.1 Arduino UNO
3.2 IR Sensor
3.3 LED
3.4 LCD
3.5 Servo MG90S
4 Mobile Application
4.1 Creating Application Using Androıd Studio
4.2 Front End
4.3 Back End
5 Proposed System
5.1 Flow Diagram
5.2 Schematic Circuit View
6 Result Analysis and Discussion
6.1 Output Without Mask
6.2 Output with Mask
7 Limitations
8 Future Scope
9 Conclusion
References
30 Marine Weather Prediction Using Preprocessing Techniques in Big Data
1 Introduction
2 Related Work
3 Mean Missing Data Imputation-Based Preprocessing Model
4 Result Analysis
4.1 Mean Absolute Error (MAE)
4.2 Result
5 Conclusion
References
31 Yolov4 in White Blood Cell Classification
1 Introduction
2 Materials and Methods
2.1 Database
2.2 Architecture of YOLOv4 Model
2.3 Data Augmentation
2.4 Backbone
2.5 Neck
2.6 Prediction
2.7 Evaluation Parameters
3 Results and Discussions
3.1 Classification Results
3.2 Performance Comparison with Other Published Models
4 Conclusion
References
32 Efficient Data Hiding Model by Using RDH Algorithm
1 Introduction
2 Related Work
3 Methodologies
3.1 Image Compression
3.2 Lossless Compression
3.3 Reversible Data Hiding (RDH)
3.4 Least Significant Bit (LSB)
4 Data Hiding Using RDH
5 Measuring Compression Performances
5.1 Compression Ratio (CR)
5.2 Compression Time (CT)
5.3 Saving Percentage (SP%)
5.4 Mean Squared Error (MSE)
5.5 Peak Signal-to-Noise Ratio (PSNR)
6 Results and Discussions
7 Conclusion
References
33 Enhanced Preprocessing Technique for Air Pollution Forecasting System Using Big Data and Internet of Things
1 Introduction
2 Related Works
3 Bilateral Discretized Z- Wavelet Transform
4 Conclusion
References
34 Pre-processing of Leukemic Blood Cell Images Using Image Processing Techniques
1 Introduction
2 Literature Review
3 Methodology
3.1 RGB to Grayscale Conversion
3.2 Filtering
3.3 Contrast Enhancement
3.4 Edge Detection
4 Results and Discussions
5 Conclusion
References
35 Automated Grocery List Item Add-to-Cart Leveraging Optical Character Recognition with Transformer
1 Introduction
1.1 Core Focus of the Research
2 Literature Survey
2.1 Add-to-Cart Functionality
2.2 Optical Character Recognition Using Machine Learning
3 Proposed System Architecture
4 List Item Recognition Using Transformer
4.1 Encoder Block
4.2 Decoder Block
5 Results and Evaluation
6 Conclusion and Future Work
References
36 Anomaly Detection in Image Sequences Using Weakly Supervised Learning
1 Introduction
2 Related Work
3 Proposed Work
3.1 Feature Extraction
3.2 Weakly Supervised Learning
4 Experiments and Result
4.1 Dataset for Training and Testing
4.2 Implementation Details
4.3 Evaluation and Analysis
5 Conclusion
References
37 Sentiment Analysis of Twitter Data for COVID-19 Posts
1 Introduction
2 Related Work
2.1 Steps Involved in Sentiment Analysis
2.2 Various Approaches for Sentıment Analysis
3 Our Methodology
3.1 Collecting the Dataset
4 Implementation and Results
5 Conclusion
References
38 Brain Tumor Detection Using Image Processing Approach
1 Introduction
2 Imaging Techniques
3 Methodology
4 Result and Discussion
5 Conclusion
References
39 Routing Method for Interplanetary Satellite Communication in IoT Networks Based on IPv6
1 Introduction
2 Protocols and Transmission
3 Implementation, Validation, and Testing
4 Conclusions
References
40 Parameterization of Sequential Neural Networks for Predicting Air Pollution
1 Introduction
2 Preliminaries and Related Work
2.1 Background
2.2 Prediction of Air Pollution with Deep Learning Models
3 The Proposed Methodology
4 Experimental Results
4.1 Data
4.2 Prediction Results with Different Sequential Networks
4.3 Discussion
5 Conclusion
References
41 Customer Analytics Research: Utilizing Unsupervised Machine Learning Techniques
1 Introduction
2 Literature Survey
3 Proposed System
4 Experimental Results
4.1 Dataset
4.2 Evaluation Metrics
4.3 Performance Analysis
5 Conclusion
References
42 Multi-class IoT Botnet Attack Classification and Evaluation Using Various Classifiers and Validation Techniques
1 Introduction
2 Related Work
3 Data Set Description
4 Experimental Evaluation
4.1 Preprocessing: Min–Max Normalization
4.2 KCV and SKCV
4.3 Machine Learning Classifiers
4.4 Algorithm
5 Results and Discussion
5.1 Accuracy
5.2 Execution Time
5.3 F1 Score and Cohen’s Kappa Coefficient (Ҟ)
6 Conclusion
References
43 IoT-Based Dashboards for Monitoring Connected Farms Using Free Software and Open Protocols
1 Introduction
2 Literature Study
2.1 Use of IoT in Agricultural Sectors
2.2 Managing and Controlling in Connected Farms Using IoT Technology
2.3 Real-Time Monitoring Connected Farms
2.4 Challenges of IoT in Agriculture
2.5 Opportunities and Applications of IoT in Agriculture
3 Design and Implementation of Connected Farms
3.1 Involvement of Sensing Technologies
3.2 Framework for Design and Implementation
3.3 Real-Time Sensing Images Communicated by ESP32-CAM
3.4 Incorporation of Positioning Systems
4 Integration of Data Sources and Management
4.1 Real-Time Sensing Images Communicated by ESP32-CAM
4.2 Incorporation of Positioning Systems
5 Integration of Data Sources and Management
6 Configuration of Devices and Performance of Result Analysis
7 Conclusions
References
44 Predicting the Gestational Period Using Machine Learning Algorithms
1 Introduction
2 Literature Review
3 Dataset
4 Methodology
5 Results
6 Discussions
7 Conclusion and Future Work
References
45 Digital Methodologies and ICT Intervention to Combat Counterfeit and Falsified Drugs in Medicine: A Mini Survey
1 Introduction
2 Selection of Papers for Literature Review
3 Existing Works
4 Literature Survey
5 Discussion
6 Result
7 Conclusion
References
46 Utilizing Hyperledger-Based Private Blockchain to Secure E-Passport Management
1 Introduction
2 Related Work
3 Methodology
3.1 Characterization of the Assets
3.2 Transaction Execution via Chaincode
3.3 Setting Up the Categories of Participants
3.4 Placement of Assets in the Registry
3.5 Registering Transactions in Historian Records
3.6 Implementation of Permission Regulations Inside Access Control Module
3.7 InterPlanetary File System (IPFS) to Store Non-textual Data
4 Result
5 Conclusion
References
47 An Exploratory Data Analysis on SDMR Dataset to Identify Flood-Prone Months in the Regional Meteorological Subdivisions
1 Introduction
2 Exploratory Data Analysis
2.1 Data Collection
2.2 Understanding Features
2.3 Data Cleaning
2.4 Data Visualization and Analysis
2.5 Examining Relationship Between Features and Finding Patterns
3 Inferences from the Dataset and Discussion
3.1 Visualization of Annual Rainfall (1916–2017)
3.2 Visualization of Seasonal Rainfall
3.3 Visualization of Average Monthly Rainfall
3.4 Visualization of Subdivision-Based Rainfall
4 Conclusion
References
48 Segmentation of Shopping Mall Customers Using Clustering
1 Introduction
2 Related Work
2.1 Clustering
3 Proposed Work
3.1 General View of Data
3.2 Data Collection and Preparation
3.3 Data Analysis and Exploration
4 Methodology
4.1 Clustering
4.2 K-means Clustering
4.3 Hierarchical Clustering
4.4 Mini-Batch K-means Clustering
4.5 Elbow Method
5 Clustering Using Different Algorithms
5.1 Elbow Method
5.2 K-means Clustering Algorithm
5.3 Mini-Batch K-means Clustering
5.4 Hierarchical Clustering
6 Performance Analysis
7 Conclusion
References
49 Advanced Approach for Heart Disease Diagnosis with Grey Wolf Optimization and Deep Learning Techniques
1 Introduction
2 Literature Survey
3 Proposed System
3.1 Without Optimization
3.2 With Optimization (Grey Wolf Optimization)
4 Results and Discussions
5 Conclusion and Future Work
References
50 Hyper-personalization and Its Impact on Customer Buying Behaviour
1 Introduction
2 Literature Review
3 Methods
3.1 PLS-SEM Model
3.2 KNN Model
3.3 Matrix Factorization
3.4 Support Vector Machines
3.5 Association Rule of Mining
3.6 Decision Tree Analysis
3.7 Frequent Sequential Pattern
3.8 Recency Frequency Monetary Analysis
3.9 Genetic Algorithms
4 Proposed Model
5 Experimental Results
6 Conclusion and Future Work
References
51 Proof of Concept of Indoor Location System Using Long RFID Readers and Passive Tags
1 Introduction
1.1 UHF Frequency Band
1.2 Passive Elements
1.3 Basic UHF Passive RFID Components
1.4 Basic Long RFID Location System Concept
2 Research Equipment and Configuration
2.1 Equipment Configuration
2.2 Software Monitor
3 Measurements and Recommendations
3.1 Key Components
4 Further Work
5 Conclusions
References
52 Autism Detection in Young Children Using Optimized Long Short-Term Memory
1 Introduction
2 Literature Review
2.1 Related Works
3 Overall Framework of Autism Detection in Young Children
4 Preprocessing and Feature Extraction Phase
4.1 Preprocessing
4.2 Feature Extraction
5 Detection Using Optimized Long Short-Term Memory
5.1 Optimized LSTM
6 Weight Optimization of LSTM Using Arithmetic Crossover Insisted Shark Smell Optimization Scheme
6.1 Objective Function and Solution Encoding
6.2 Proposed ACSSO Model
7 Results and Discussions
7.1 Simulation Procedure
7.2 Performance Analysis
7.3 Statistical Analysis
7.4 Analysis on Optimization
7.5 Analysis on Classifiers
7.6 Convergence Analysis
8 Conclusion
References
53 A Comparative Review Analysis of OpenFlow and P4 Protocols Based on Software Defined Networks
1 Introduction
2 Background of the Work
2.1 Open Signaling
2.2 Active Networking
2.3 4D Project
2.4 NETCONF
2.5 Ethane
3 Protocols Enabled Software Defıned Network
3.1 The Forwarding Control Element Separation (ForCES)
3.2 OpenFlow
3.3 Comparison Between OpenFlow and ForCES
4 Overvıew of Openflow
5 Overvıew of Programmıng Protocol-Independent Packet Processor (P4)
5.1 Comparison Between OpenFlow and P4 Protocols
6 Swıtch Archıtecture
6.1 The Switch Architecture of the Openflow
6.2 The Switch Architecture of P4
7 Conclusion and Future Recommendatıon
References
54 NLP-Driven Political Analysis of Subreddits
1 Introduction
2 Inspiration and Related Work
3 Data Collection
4 EDA
5 Biased Embedding Sentiment Models
5.1 Method
5.2 Sentiment Observations
6 Political Leaning Classification
6.1 Method
6.2 Model Performance
6.3 Model Interpretation
7 Conclusion
References
55 Feature Extraction and Selection with Hyperparameter Optimization for Mitosis Detection in Breast Histopathology Images
1 Introduction
2 Related Work
3 Data Set
4 Image Pre Processing
5 Proposed Method
5.1 Overview of Proposed Methods
6 Feature Extraction
6.1 Oriented FAST and Rotated BRIEF (ORB)
6.2 Center Surround Extremas (Censure)
6.3 Edge Detection
6.4 Histogram of Oriented Gradients (HOG)
6.5 Corner Peak
6.6 Grayscale Pixel Values
6.7 Principal Component Analysis (PCA)
7 Classification
7.1 Support Vector Machines (SVM)
7.2 K-Nearest Neighbor (KNN)
7.3 Random Forest
8 Hyperparameter Optimization
8.1 Grid Search
8.2 Random Search
8.3 Bayesian Optimization with Gaussian Process
8.4 Genetic Algorithm
8.5 TPOT
9 Feature Selection
9.1 Red Deer Algorithm
9.2 Cuckoo Search Algorithm
9.3 Harmony Search
9.4 Whale Optimization Algorithm
9.5 Genetic Algorithm
9.6 Binary Bat Algorithm
9.7 Gray Wolf Optimizer
10 Results and Discussion
11 Conclusion
References
56 Review and Comparative Analysis of Unsupervised Machine Learning Application in Health Care
1 Introduction
2 Machine Learning Applications in Health Care
3 Methods and Datasets Used in Research
4 Results
5 Conclusions
References
57 A Systematic Literature Review on Cybersecurity Threats of Virtual Reality (VR) and Augmented Reality (AR)
1 Introduction
2 Problem Statement
3 Objective
4 Research Questions
5 Scope of the Study
6 Expected Results
7 Selection of Research Papers for Review
8 Literature Survey
9 Results and Discussion
10 Conclusions
References
58 A Review on Risk Analysis of Cryptocurrency
1 Introduction
2 Cryptocurrency Risk Assessment
3 Selection of Papers for Literature Review
3.1 Search String
3.2 Selection of Papers by PRISMA
4 Cryptocurrencies and Blockchains
5 Literature Survey
6 Future Challenges
7 Conclusion
References
59 Sine Cosine Algorithm with Tangent Search for Neural Networks Dropout Regularization
1 Introduction
2 Related Works and Preliminaries
3 Sine Cosine Metaheuristics and Proposed Enhancements
3.1 Cons of Basic SCA and Proposed Improved Version
4 Simulations and Discussion
5 Conclusion
References
60 Development of a Web Application for the Management of Patients in the Medical Area of Nutrition
1 Introduction
2 Objectives
3 Architecture
4 Functionality
4.1 Nutritionist Functionality
4.2 Patient Functionality
5 Evaluation
6 Conclusions and Future Work
References
61 Exploring the Potential Adoption of Metaverse in Government
1 Introduction to Metaverse
2 Metaverse Integration in Government
3 Metaverse Opportunities
3.1 To Find Innovative Ways to Communicate with the Citizens
3.2 To Establish Team Working Operation Inside the Workplace
3.3 To Find New Employees
3.4 To Develop a New Economy
4 Metaverse Challenges
5 Epilogue
References
62 Hyperparameter Tuning in Random Forest and Neural Network Classification: An Application to Predict Health Expenditure Per Capita
1 Introduction
2 Materials and Methods
2.1 The Data Classification Task with RF and NN
2.2 K-Fold Cross Validation to Improve Classification Performance
2.3 Measuring the Discriminatory Power of a Model
3 Findings
3.1 Descriptive Statistics
3.2 Binary Coding of Health Expenditure Per Capita Variable
3.3 Multicollinearity Check
3.4 Random Forest and Neural Network Classification Performances by Changing Hyperparameters and Incorporating K-Fold Cross Validation into the Model
4 Conclusions
References
63 Dual-RvCore32IMA: Implementation of a Peripheral Device to Manage Operations of Two RvCores
1 Introduction
2 Related Work
3 Core Management Unit
3.1 Data Control
3.2 Flow Control
4 C Code Multi-core Task for Testing
5 Conclusion
6 Future Work
References
64 A Comparative Study of SVM, CNN, and DCNN Algorithms for Emotion Recognition and Detection
1 Introduction
2 Related Work
3 Description of Support Vector Machine, Convolutional Neural Network, and Deep Convolutional Neural Network Models
3.1 Support Vector Machine
3.2 Convolutional Neural Network (CNN)
3.3 Proposed Deep Convolutional Neural Network (DCNN) MODEL
4 System Design
5 Implementation Results and Discussions
6 Conclusion
References
65 Monitoring and Prediction of Smart Farming Using Hybrid PSO-ELM Model
1 Introduction
2 Related Works
3 Proposed Work
4 Result and Discussion
5 Conclusion
References
66 Recommendation System Using Different Approaches
1 Introduction
2 Related Work
3 Dataset
4 Proposed Methodology
4.1 Recommendations Based on Face Detection Technique
4.2 Recommendations Based on ML Approach
4.3 Recommendations Based on User’s Conversation with Chatbot
5 Experimental Results
6 Conclusion
7 Future Work
References
67 An Exploratory Data Analysis to Examine the Influence of Confinement on Student Learning, Sociability, and Well-Being Under COVID-19
1 Introduction
2 Data Acquisition and Preprocessing
2.1 Data Collection
2.2 Data Cleaning
3 Methodology
3.1 Pandas
3.2 Numpy
3.3 Matplotlib
3.4 Seaborn
3.5 Open Datasets
4 Analysis and Results
4.1 Age Division of Participants
4.2 Demographic Details
4.3 Dedicated Class Time
4.4 Time Spent on Other Pursuits
4.5 Platforms of Social Media Preference
4.6 Stress Relievers
4.7 Things that They Miss Out
4.8 Issues with Health
4.9 Weight Distribution
4.10 Impact of Fitness on Weight
4.11 Satisfaction Over Different Mediums
4.12 Rating of Online Class
5 Conclusion
References
Correction to: A Systematic Literature Review on Cybersecurity Threats of Virtual Reality (VR) and Augmented Reality (AR)
Correction to: Chapter 57 in: J. Jacobetal, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_57
Correction to: A Review on Risk Analysis of Cryptocurrency
Correction to: Chapter 58 in: I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_58
Author Index

Citation preview

Algorithms for Intelligent Systems Series Editors: Jagdish Chand Bansal · Kusum Deep · Atulya K. Nagar

I. Jeena Jacob Selvanayaki Kolandapalayam Shanmugam Ivan Izonin   Editors

Data Intelligence and Cognitive Informatics Proceedings of ICDICI 2022

Algorithms for Intelligent Systems Series Editors Jagdish Chand Bansal, Department of Mathematics, South Asian University New Delhi, Delhi, India Kusum Deep, Department of Mathematics, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India Atulya K. Nagar, School of Mathematics, Computer Science and Engineering, Liverpool Hope University, Liverpool, UK

This book series publishes research on the analysis and development of algorithms for intelligent systems with their applications to various real world problems. It covers research related to autonomous agents, multi-agent systems, behavioral modeling, reinforcement learning, game theory, mechanism design, machine learning, metaheuristic search, optimization, planning and scheduling, artificial neural networks, evolutionary computation, swarm intelligence and other algorithms for intelligent systems. The book series includes recent advancements, modification and applications of the artificial neural networks, evolutionary computation, swarm intelligence, artificial immune systems, fuzzy system, autonomous and multi agent systems, machine learning and other intelligent systems related areas. The material will be beneficial for the graduate students, post-graduate students as well as the researchers who want a broader view of advances in algorithms for intelligent systems. The contents will also be useful to the researchers from other fields who have no knowledge of the power of intelligent systems, e.g. the researchers in the field of bioinformatics, biochemists, mechanical and chemical engineers, economists, musicians and medical practitioners. The series publishes monographs, edited volumes, advanced textbooks and selected proceedings. Indexed by zbMATH. All books published in the series are submitted for consideration in Web of Science.

I. Jeena Jacob · Selvanayaki Kolandapalayam Shanmugam · Ivan Izonin Editors

Data Intelligence and Cognitive Informatics Proceedings of ICDICI 2022

Editors I. Jeena Jacob Department of Computer Science and Engineering GITAM University Bengaluru, Karnataka, India

Selvanayaki Kolandapalayam Shanmugam Department of Mathematics and Computer Science Ashland University Ashland, OH, USA

Ivan Izonin Department of Artificial Intelligence Lviv Polytechnic National University Lviv, Ukraine

ISSN 2524-7565 ISSN 2524-7573 (electronic) Algorithms for Intelligent Systems ISBN 978-981-19-6003-1 ISBN 978-981-19-6004-8 (eBook) https://doi.org/10.1007/978-981-19-6004-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023, corrected publication 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

The ICDICI 2022 is solely dedicated to all the editors, reviewers, and authors of the conference event.

Preface

With a deep satisfaction, we write this foreword to welcome you all to the 3rd International Conference on Data Intelligence and Cognitive Informatics [ICDICI 2022] held in Tirunelveli, India on 6-7, July 2022. The theme of ICDICI 2022 is Data Intelligence, a research topic that is gaining quick research traction in both industries and academia due to its research relevance to the emerging societal and economic issues in the areas like healthcare, transportation, industries, education, and so on. The well-established research track records on intelligent data systems mandate the integration of artificial intelligence techniques and processes make ICDICI an excellent venue for exploring the cognitive foundations for emerging data systems. With respect to the potential hard work of the ICDICI 2022 conference committee, we would like to express our appreciation and gratitude to all the technical program committee members, international and national advisory board members, review committee members, who have made this conference a successful and possible one. Finally, we would like to extend our warm thanks to all the keynote speakers, session chairs, and fellow researchers, who have willingly, share their research experience and knowledge to all the readers of this extended conference proceedings. We hope that this proceeding of ICDICI 2022 will further stimulate research in data mining and intelligent systems and provide practitioners with advanced algorithms, techniques, and tools for deployment. We feel honored and privileged to serve the significant recent developments in the field of intelligent systems and data intelligence to you through this exciting program. Bengaluru, India Ashland, USA Lviv, Ukraine

Dr. I. Jeena Jacob Dr. Selvanayaki Kolandapalayam Shanmugam Assoc. Prof. Dr. Ivan Izonin

vii

Contents

1

Modeling Müller-Lyer Illusion Using Information Geometry . . . . . . Debasis Mazumdar, Soma Mitra, Mainak Mandal, Kuntal Ghosh, and Kamales Bhaumik

2

Building up a Categorical Sentiment Dictionary for Tourism Destination Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kang Woo Lee, Ji Won Lim, Myeong Seon Kim, Da Hee Kim, and Soon-Goo Hong

3

4

5

6

7

Statistical Analysis of Stress Prediction from Speech Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Radhika Kulkarni, Utkarsha Gaware, and Revati Shriram DermoCare.AI: A Skin Lesion Detection System Using Deep Learning Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adarsh Singh, Sourabh Bera, Pranav Chaturvedi, Pranav Gadhave, and C. S. Lifna Analysis of Phishing Base Problems Using Random Forest Features Selection Techniques and Machine Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mithilesh Kumar Pandey, Munindra Kumar Singh, Saurabh Pal, and B. B. Tiwari Cost Prediction for Online Home-Based Application Services by Using Linear Regression Techniques . . . . . . . . . . . . . . . . . . . . . . . . . Rounak Goje, Vaishnavi Kale, Ritik Raj, Shivkumar Nagre, Geeta Atkar, and Geeta Zaware Convolutional Neural Network Based Intrusion Detection System and Predicting the DDoS Attack . . . . . . . . . . . . . . . . . . . . . . . . . R. Rinish Reddy, Sadhwika Rachamalla, Mohamed Sirajudeen Yoosuf, and G. R. Anil

1

15

27

39

53

65

81

ix

x

8

9

Contents

BERT Transformer-Based Fake News Detection in Twitter Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. P. Devika, M. R. Pooja, M. S. Arpitha, and Vinayakumar Ravi

95

The Facial Expression Recognition Using Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Vijay Mane, Rohan Awale, Vipul Pisal, and Sanmit Patil

10 New IoT-Based Portable Microscopic Somatic Cell Count Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 A. Sivasangari, D. Deepa, R. M. Gomathi, P. Ajitha, and S. Poonguzhali 11 A Survey on Hybrid PSO and SVM Algorithm for Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 D. R. Ganesh and M. Chithambarathanu 12 Metric Effects Based on Fluctuations in Values of k in Nearest Neighbor Regressor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Abhishek Gupta, Raunak Joshi, Nandan Kanvinde, Pinky Gerela, and Ronald Melwin Laban 13 An Ensemble Approach to Recognize Activities in Smart Environment Using Motion Sensors and Air Quality Sensors . . . . . . 141 Shruti Srivatsan, Sumneet Kaur Bamrah, and K. S. Gayathri 14 Generalization of Fingerprint Spoof Detector . . . . . . . . . . . . . . . . . . . . 151 C. Kanmani Pappa, T. Kavitha, I. Rama Krishna, V. Venkata Lokesh, and A. V. L. Narayana 15 Applied Deep Learning for Safety in Construction Industry . . . . . . . 167 Tanvi Bhosale, Ashwini Biradar, Kartik Bhat, Sampada Barhate, and Jameer Kotwal 16 Deep Learning-Based Quality Inspection System for Steel Sheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 M. Sambath, C. Sai Bhargav Reddy, Y. Kalyan Reddy, M. Mohit Sairam Reddy, M. Kathiravan, and S. Ravi 17 Forecasting Prediction of Covid-19 Outbreak Using Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Gurleen Kaur, Parminder Kaur, Navinderjit Kaur, and Prabhpreet Kaur 18 Proctoring Solution Using AI and Automation (Semi) . . . . . . . . . . . . . 223 Ravi Sridharan, Linda Joseph, and B. Sandhya Reddy 19 Apple Leaf Disease Prediction Using Deep Learning Technique . . . . 239 Thota Rishitha and G. Krishna Mohan

Contents

xi

20 Sentimental Analysis and Classification of Restaurant Reviews . . . . 247 P. Karthikeyan, V. Aishwariya Rani, B. Jeyavarshini, and M. N. Muthupriyaadharshini 21 A Controllable Differential Mode Band Pass Filter with Wide Stopband Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 K. Renuka, Ch. Manasa, P. Sriharitha, and B. Vijay Chandra 22 Design and Analysis of Conformal Antenna for Automotive Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Sk. Jani Basha, R. Koteswara Rao, B. Subbarao, and T. R. Chaitanya 23 An Improved Patch-Group-Based Sparse Representation Method for Image Compressive Sensing . . . . . . . . . . . . . . . . . . . . . . . . . 283 Abhishek Jain, Preety D. Swami, and Ashutosh Datar 24 Comparative Analysis of Stock Prices by Regression Analysis and FB Prophet Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Priyanka Paygude, Aatmic Tiwari, Bhavya Goel, and Akshat Kabra 25 g POD—Dual Purpose Device (Dustbin and Cleaning) . . . . . . . . . . . . 309 R. Brindha, Vinoth Kumar Balan, Harri Srinivasan, Kartik Rajayria, and Rohit Kumar Singh 26 An Attractive Proposal Based on Big Data for Sentiment Analysis Using Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Omar Sefraoui, Afaf Bouzidi, Kamal Ghoumid, and El Miloud Ar-Reyouchi 27 SqueezeNet Deep Neural Network Embedder-Based Brain Tumor Classification Using Supervised Machine Intelligent Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Kalyan Kumar Jena, Sourav Kumar Bhoi, Kodanda Dhar Naik, Chittaranjan Mallick, and Rajendra Prasad Nayak 28 Detection of Malicious Unmanned Aerial Vehicle Carrying Unnecessary Load Using Supervised Machine Intelligence Model with SqueezeNet Deep Neural Network Image Embedder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 Sourav Kumar Bhoi, Kalyan Kumar Jena, Kodanda Dhar Naik, Chittaranjan Mallick, and Rajendra Prasad Nayak 29 Face Mask Detection Using Artificial Intelligence to Operate Automatic Door . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Suhaila Mohammed, Fahim Ahmed, Mohammad Azwad Saadat Sarwar, Rubayed Mehedi, Kaushik Sarker, and Mahady Hasan

xii

Contents

30 Marine Weather Prediction Using Preprocessing Techniques in Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 J. Deepa Anbarasi and V. Radha 31 Yolov4 in White Blood Cell Classification . . . . . . . . . . . . . . . . . . . . . . . . 387 Luong Duong Trong, Tung Pham Thanh, Hung Pham Manh, and Duc Nguyen Minh 32 Efficient Data Hiding Model by Using RDH Algorithm . . . . . . . . . . . 401 K. Renuka Devi, R. S. Hari karthikkeyyan, B. Balakumar, and G. Chandru 33 Enhanced Preprocessing Technique for Air Pollution Forecasting System Using Big Data and Internet of Things . . . . . . . . 411 M. Dhanalakshmi and V. Radha 34 Pre-processing of Leukemic Blood Cell Images Using Image Processing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Saranya Vijayan and Radha Venkatachalam 35 Automated Grocery List Item Add-to-Cart Leveraging Optical Character Recognition with Transformer . . . . . . . . . . . . . . . . 431 Tejaswi Kashyapi, Rohan Pawar, Nomesh Sarode, Aniket Dakhore, Geeta Atkar, and Geeta Zaware 36 Anomaly Detection in Image Sequences Using Weakly Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 Suyash Dhondkar, Manish Khare, and Pankaj Kumar 37 Sentiment Analysis of Twitter Data for COVID-19 Posts . . . . . . . . . . 457 Salil Bharany, Shadab Alam, Mohammed Shuaib, and Bhanu Talwar 38 Brain Tumor Detection Using Image Processing Approach . . . . . . . . 467 Abhinav Agarwal, Himanshu Arora, Shivam Kumar Singh, and Vishwabandhu Yadav 39 Routing Method for Interplanetary Satellite Communication in IoT Networks Based on IPv6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477 Paweł Dobrowolski, Grzegorz Debita, and Przemysław Falkowski-Gilski 40 Parameterization of Sequential Neural Networks for Predicting Air Pollution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 Farheen and Rajeev Kumar 41 Customer Analytics Research: Utilizing Unsupervised Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 Anuj Kinge, P. B. Hrithik, Yash Oswal, and Nilima Kulkarni

Contents

xiii

42 Multi-class IoT Botnet Attack Classification and Evaluation Using Various Classifiers and Validation Techniques . . . . . . . . . . . . . . 517 S. Chinchu Krishna and Varghese Paul 43 IoT-Based Dashboards for Monitoring Connected Farms Using Free Software and Open Protocols . . . . . . . . . . . . . . . . . . . . . . . . 529 K. Deepika and B. Renuka Prasad 44 Predicting the Gestational Period Using Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 R. Jane Preetha Princy, Saravanan Parthasarathy, S. Thomas George, and M. S. P. Subathra 45 Digital Methodologies and ICT Intervention to Combat Counterfeit and Falsified Drugs in Medicine: A Mini Survey . . . . . . 561 Munirah Alshabibi, Elham Alotaibi, M. M. Hafizur Rahman, and Muhammad Nazrul Islam 46 Utilizing Hyperledger-Based Private Blockchain to Secure E-Passport Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579 Nusrat Jahan and Saha Reno 47 An Exploratory Data Analysis on SDMR Dataset to Identify Flood-Prone Months in the Regional Meteorological Subdivisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 J. Subha and S. Saudia 48 Segmentation of Shopping Mall Customers Using Clustering . . . . . . 619 D. Deepa, A. Sivasangari, R. Vignesh, N. Priyanka, J. Cruz Antony, and V. GowriManohari 49 Advanced Approach for Heart Disease Diagnosis with Grey Wolf Optimization and Deep Learning Techniques . . . . . . . . . . . . . . . 631 Dimple Santoshi, Sangita Chaudhari, and Namita Pulgam 50 Hyper-personalization and Its Impact on Customer Buying Behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649 Saurav Kumar, R. Ashoka Rajan, A. Swaminathan, and Ernest Johnson 51 Proof of Concept of Indoor Location System Using Long RFID Readers and Passive Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665 Piotr Łozi´nski and Jerzy Demkowicz 52 Autism Detection in Young Children Using Optimized Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677 S. Guruvammal, T. Chellatamilan, and L. Jegatha Deborah

xiv

Contents

53 A Comparative Review Analysis of OpenFlow and P4 Protocols Based on Software Defined Networks . . . . . . . . . . . . . . . . . . 699 Lincoln S. Peter, Hlabi Kobo, and Viranjay M. Srivastava 54 NLP-Driven Political Analysis of Subreddits . . . . . . . . . . . . . . . . . . . . . 713 Kuldeep Singh and Sai Venkata Naga Saketh Anne 55 Feature Extraction and Selection with Hyperparameter Optimization for Mitosis Detection in Breast Histopathology Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727 Suchith Ponnuru and Lekha S. Nair 56 Review and Comparative Analysis of Unsupervised Machine Learning Application in Health Care . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751 Mantas Lukauskas and Tomas Ruzgas 57 A Systematic Literature Review on Cybersecurity Threats of Virtual Reality (VR) and Augmented Reality (AR) . . . . . . . . . . . . . 761 Abrar Alismail, Esra Altulaihan, M. M. Hafizur Rahman, and Abu Sufian 58 A Review on Risk Analysis of Cryptocurrency . . . . . . . . . . . . . . . . . . . 775 Almaha Almuqren, Rawan Bukhowah, and M. M. Hafizur Rahman 59 Sine Cosine Algorithm with Tangent Search for Neural Networks Dropout Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 789 Luka Jovanovic, Milos Antonijevic, Miodrag Zivkovic, Dijana Jovanovic, Marina Marjanovic, and Nebojsa Bacanin 60 Development of a Web Application for the Management of Patients in the Medical Area of Nutrition . . . . . . . . . . . . . . . . . . . . . . 803 Antonio Sarasa-Cabezuelo 61 Exploring the Potential Adoption of Metaverse in Government . . . . 815 Vasileios Yfantis and Klimis Ntalianis 62 Hyperparameter Tuning in Random Forest and Neural Network Classification: An Application to Predict Health Expenditure Per Capita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825 Gulcin Caliskan and Songul Cinaroglu 63 Dual-RvCore32IMA: Implementation of a Peripheral Device to Manage Operations of Two RvCores . . . . . . . . . . . . . . . . . . . . . . . . . . 837 Demyana Emil, Mohammed Hamdy, and Jihan Nagib 64 A Comparative Study of SVM, CNN, and DCNN Algorithms for Emotion Recognition and Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 849 R. Prabha, G. A. Senthil, M. Razmah, S. R. Akshaya, J. Sivashree, and J. Cyrilla Swathi

Contents

xv

65 Monitoring and Prediction of Smart Farming Using Hybrid PSO-ELM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865 A. Sridevi and M. Preethi 66 Recommendation System Using Different Approaches . . . . . . . . . . . . 883 Devata Dinesh Vamsi Durga Bhaskar, Madabhushi Aditya, N. Chaithanya, Dande Dharani, and Greeshma Sarath 67 An Exploratory Data Analysis to Examine the Influence of Confinement on Student Learning, Sociability, and Well-Being Under COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895 Amandeep Kaur and Karanjeet Singh Kahlon Correction to: A Systematic Literature Review on Cybersecurity Threats of Virtual Reality (VR) and Augmented Reality (AR) . . . . . . . . . . Abrar Alismail, Esra Altulaihan, M. M. Hafizur Rahman, and Abu Sufian Correction to: A Review on Risk Analysis of Cryptocurrency . . . . . . . . . . Almaha Almuqren, Rawan Bukhowah, and M. M. Hafizur Rahman

C1 C3

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 911

About the Editors

I. Jeena Jacob is working as Professor in Computer Science and Engineering department at GITAM University, Bangalore, India. She actively participates on the development of the research field by conducting international conferences, workshops and seminars. She has published many articles in referred journals. She has guest edited an issue for International Journal of Mobile Learning and Organization. Her research interests include mobile learning and computing. Selvanayaki Kolandapalayam Shanmugam holds a Bachelor’s degree in Mathematics and Masters in Computer Applications from Bharathiar University, and her Master of Philosophy in Computer Science from Bharathidasan University, and a Ph.D. in Computer Science from Anna University. She takes various positions as Teaching Faculty, Research Advisor and Project-Coordinator in the field of academics from 2002 in various reputed institutions. She has related to the IT industry for more than five years by taking her prestigious role as Business Analyst Consultant. Her primary research interests are in the application of computing and information technologies to problems which impacts societal benefits. Ivan Izonin graduated from Lviv Polytechnic National University in 2011 (M.Sc. in Computer Sciences) and Ivan Franko National University of Lviv in 2012 (M.Sc. in Economic Cybernetics). He got his Ph.D. in Artificial Intelligence in 2016. He is currently working as Assistant at Publishing Information Technologies Department, Lviv Polytechnic National University, Lviv, Ukraine. He has published more than 80 publications, including eight patents for inventions and one tutorial. His major research interests are computational intelligence; high-speed neural-like systems; non-iterative machine learning algorithms. Dr. Izonin is participating in the developing two international Erazmus+ projects—MASTIS and DocHub. He is Technical Committee Member of several international conferences.

xvii

Chapter 1

Modeling Müller-Lyer Illusion Using Information Geometry Debasis Mazumdar, Soma Mitra, Mainak Mandal, Kuntal Ghosh, and Kamales Bhaumik

1 Introduction Visual illusory stimuli generally refer to a percept deviated from what would be veridically predicted based on the information collected from the physical stimulus. Therefore, they are considered to be powerful tools to study neurobiological aspect of vision in non-invasive way. Moreover, successful modeling of the perceptual mechanism underlying the visual process of perceiving illusory stimulus indicates the possibilities to introduce bioplausible algorithms in computer vision. Geometrical illusions constitute a subclass of illusions in which spatial extensions, orientations, and angles are distorted and misperceived [1]. Müller-Lyer illusion is one of the extensively studied geometrical optical illusion in which visual perception of the length of a line is wrongly estimated when terminated by inward and outward arrows [2]. When the ends of the line are terminated with inward arrows (wing-in), the length is underestimated, and with outward arrows (wing-out), the length is overestimated Fig. 1. Beside this, the strength of illusion had been found to be a function of various geometrical parameters of the context [3, 4]. For example, the strength of illusion varies with the wing tilt angle. For inward wing, the tilt angles are acute and the strength of illusion decreases with it up to 90°. When the wings are outward, i.e., the wing tilt angle is obtuse, illusion of extent reverses its direction and increases with it [5, 6]. Variation of the strength of illusion is also a function of the wing length. The error defining the strengths of illusion reverses its sign for outward (over estimation) D. Mazumdar (B) · S. Mitra · K. Bhaumik CDAC, Plot-E2/1, Block-GP, Sector-V, Salt Lake City, Kolkata 700091, India e-mail: [email protected] M. Mandal DISIM, L’aquila, Italy K. Ghosh Center for Soft Computing Research and Machine Intelligence Unit, Indian Statistical Institute, 203 B T Road, Kolkata 108, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_1

1

2

D. Mazumdar et al.

and inward arrow (under estimation) [4, 5, 7]. Restle and Decker [5] observed that according to their definition the percentage error of estimation of length is positive when the wing tilt angles are obtuse, i.e., the arrow is outward while it is negative in case of acute tilt angle of wing, i.e., the inward arrow. In the first case, they observed that the error is an inverted U function of the length of the wing while in case of wings-in the measured errors are relatively insensitive to wing length. The illusion strength is a function of shaft length too and is found to be proportional to it [4, 5]. A number of hypotheses have been proposed, and various mathematical models have been offered based on the qualitative and quantitative analysis of the experimental data. One of the oldest models to explain Müller-Lyer illusion involved a combination of two factors, namely the confluxion, i.e., the two points are seen closer than the objective display would justify and contrast, meaning that they are seen too far apart [8]. Gregory et al. proposed another type of explanation which essentially described that primary cues of depth and perspective elicit perceived length distortion [9]. Pressey’s theory of assimilation predicted that the focal shaft is averaged with the magnitude of the contextual figures that are flanking the shaft [10]. In parallel, Anderson [11] proposed a qualitative weighted averaging model almost supporting Pressey’s assimilation theory. Other theories like confusion theory [12] and receptive field model [13] advocated that the proximal figures are responsible for the magnitude of distortion. Digressing from these concepts Morgan [14] introduced the concept of center of gravity which explains that while perceiving a complex object with many constituent parts, its position appears to be located at its centroid. The perceptual distortions might result because of feedback from an inappropriate tendency to fixate the center of gravity of contextual patterns in estimating the end points of any object. On the perspective of several opposing concepts and theories, the present work reports the experimental and theoretical investigation of Müller-Lyer illusion from a completely new angle. It commences with quantification of strength of illusion of Müller-Lyer stimulus as a function of different geometrical parameters through a series of psychometric experiments and reconfirms the past archived data. Further to fit these experimental data, a model is proposed using the framework of information geometry and the population code of the Müller-Lyer stimulus. It is hypothesized that the end points of any line stimulus are represented as the peak of firing pattern of a population of neurons in the human brain. Further, the estimated distance between two such points is shown to be a function of the spread of the population firing of neurons as well. The research outcome is organized as follows: In Sect. 2, we describe our experimental arrangement to quantify the strength of illusion as a function of different geometrical parameters of the Müller-Lyer stimuli and Sect. 3 is devoted to describe the role of Fisher information in estimating the error of extension in Müller-Lyer stimulus. Further, it is shown that in the framework of information geometry the stimulus can be considered as a geodesic on a so-called visual space having metric different to that of the physical space. The visual space is essentially a statistical parametric space of constant negative curvature in which the form and localization are completely uncorrelated and ensure the free mobility of rigid bodies. Fisher-Rao distance function evolved as a natural choice of measure of proper distance in the visual space. Section 4 describes the method to compute the

1 Modeling Müller-Lyer Illusion Using Information Geometry

3

Fig. 1 Müller-Lyer stimulus. Both the horizontal shafts are of same length but perceived differently due to the change in the context

Fisher-Rao distance function to yield the percentage error in measuring the length of the Müller-Lyer stimulus with variation of different geometric parameters. The simulated results demonstrate that the proposed model can capture very well all the experimental observations. Finally, we conclude the paper by discussing cursorily the application of information geometric model in different real-life problems in Sect. 5.

2 Method To quantify the variation of strength of illusion as functions of different geometric parameters of the contextual flank, we conducted psychophysical experiments. The experiments were conducted by a computer program designed using Python imaging library [15]. It generates Müller-Lyer stimulus with different geometrical configuration of the shaft and the contextual flanks, implements alterations according to the choice of subjects, and records the subject’s response. Finally, it computes the strength of illusion. The experimental setup consists of an LCD monitor, chin rest, and a keyboard. The distance between the monitor and the chin rest was kept at a comfortable viewing distance of 70 cm. Experiments were conducted in ambient light. The size of the stimulus figures was 400 × 400 pixels with a white background of 1024 × 768 pixel. The subjects were shown an image, and there was an adjustable reference line. Their task was to look at the image and set the length of the reference line equal to the shaft length. The initial length of the reference line was set at random between shaft length ±22 pixels. The strength of illusion is a measure of the difference between the actual length of the shaft and its perceived length. Measurements were made for Müller-Lyer stimulus with variation in (i) the tilt angle of wings from the shaft (15° to 165° including both the inward and outward wings), (ii) the length of the wings of the arrows, and (iii) the length of the shaft. In all the experiments, the actual length of the shaft is denoted by L o while the apparent length as perceived by the subject is denoted by L A . The strength of illusion expressed in percentage is given by, E=

Lo − L A × 100%. Lo

(1)

4

D. Mazumdar et al.

Therefore, the error is positive as long as L o > L A (wing-in) and is negative when L o < L A (wing-out). It is worth mentioning here that the error (E) is defined in [8] by the same formula except the numerator is defined as (L A − L o ). Therefore, the sign of error appears to be opposite to our convention. Subjects: Data were collected from three adult subjects (one female and two males) in each case. All the subjects gave their consent in the participation. Subjects either had normal vision or were putting on glasses used for their normal vision. None of them had any history of visual disorder. Their visual acuity was within the range of normal vision (~1.0). Two out of three subjects were naive regarding the goal of the study. The judgment of all the subjects was qualitatively found consistent. In the first set of experiments, the Müller-Lyer stimuli had the shaft length (L o ) of 176 pixels while the wing length was fixed at 56 pixels. In each run, a subject was to repeat this task for 11 different stimuli of Müller-Lyer illusion with varied wing angles and each stimulus was repeated three times in random order of permutation. In each case, the length of the wings was kept constant. No limit was posed on the observation time. The results of the experiments are given in Fig. 2. It is to be noted here that Biervliet and Heyman conducted the same experiment and their results are reproduced in [5]. It was conjectured by both of them that if the length of the shaft of the reference stimuli be X and W be the length of the wings, then according to the confluxion theory the judged length J(X) would lie between X and X − 2W cosθ, where θ is angle of wings to shaft. Therefore, it would be a function of cos(θ ). The data obtained by us is qualitatively the same in nature and corroborates the same conjecture. Further, our results are quiet in agreement with the more recent experimental data obtained by Bulatov et al. [4]. In the second set of experiment, we recorded the judgments of three subjects regarding the perceived length of the shaft varying the length of the wing in the range of 5–50 pixels in intervals of 5 pixels. Here, again we followed all the protocol maintained in the first experiment regarding the presentation of the stimulus before the subjects randomly and collecting their judgments. Error in estimating the length of the stimulus for different tilt angles of the wing from the shaft is collected. For stimuli with inward arrow (i.e., the acute-angled context), the angles are taken to be 18°, 36°, 54°, and 72° while in case of obtuse-angled contextual pattern, i.e., the outwardly arrowheads, the angles are chosen as, 108°, 126°, 144°, and 162°. In between the acute and obtuse angled context data for 90°, orientations of the wings are also recorded. The angles are chosen to compare our data with those observed in past experiments [4, 5]. In all the cases, the shaft length (L o ) of the Müller-Lyer stimuli was kept at 283 pixels. Average percentage of error as a function of the length of the wing for different tilt angle with the shaft is represented in Fig. 3a and b. As mentioned earlier, the sign of the error according to our definition following Eq. 1 is opposite to that considered in [5]. When we compare the data collected by us with those reported in [4, 5] we find that the percentage error is an inverted U-shaped function of the wing length in case of inward wings. The maxima of the curves occur at 7.07% of the shaft length when the tilt angle of the wings with the shaft is 18°, 36°, and 54°. In case of the wing’s tilt angle of 72° within the range of the wing length considered, the curve does not show such an inversion. However, it increases monotonically to reach the maximum

1 Modeling Müller-Lyer Illusion Using Information Geometry

5

value for the wing length of 50 pixels which is 17.67% of the shaft length. In Fig. 3b, the experimentally obtained curves representing change in error as function of wing length for outward wings are presented. Curve shows the similar variation as in the case of inward wings but inverted with respect to the former. As the perceived length is greater than the length of the shaft, the error is negative in these cases according to Eq. 1. For 90°, 108°, 126°, 144°, and 162°, the maxima occur at 7.07, 8.84, 8.83, 7.07, and 8.83% of the shaft length. There are deviations between the present experimental data and those observed by Restle et al. [5]. Apart from the reversal of sign convention, it is reported that for inward arrow the variation of error was insensitive to the variation of the wing length while in our experiment considerable variation is observed. Beside this, there are other quantitative differences too due to inter-subject perceptual variation. Further to note that in the past experiments [5], it was reported that the maxima occur when the wing lengths are approximately 30–40% of the shaft length which differs in our experimental data. The third experiment is conducted to record the percentage of error in perceiving the length of the Müller-Lyer stimulus varying the length of the shaft. Appropriate randomization on displaying the stimulus before the subjects as explained above is maintained to make the judgment free of bias. The results are displayed in Fig. 3c. In the past reported data for the same effects, the illusion strength was found proportional to the shaft length [4]. In our experiment, there are slight nonlinearities in the data. At this juncture, we can assume that the error in perception of extension follows an overall qualitative consistency for different subjects but quantitatively there are variations in their values. May be the covert cognitive variables like experimental

Fig. 2 Variation of strength of illusion with the tilt angle of wings (15°–165°) from the shaft. The curve shows the average value of strength of illusion for three subjects

6

D. Mazumdar et al.

Fig. 3 Variation of strength of illusion with the length of the wings for different tilt angle from the shaft and with the length of the shaft. a Represents the average of the percentage error perceived by three subjects. The angle of tilt of the wings with the shaft is taken as 18°, 36°, 54°, and 72°, and the corresponding curves are marked as 1, 2, 3, and 4. b Shows the average percentage error for three subjects when the contextual arrowheads are outward. The angles are taken as 108°, 126°, 144°, and 162°. The corresponding curves are marked as 2, 3, 4, and 5. The curve marked 1 represents the data for 90° orientations of the wings. c Variation of the strength of illusion with the length of the shaft. The shaft length is varied in a range of 183–273 pixels. Percentage error is averaged for three subjects

ecosystem, state of attention of the subjects, and viewing states are responsible for these deviations.

3 Information Geometric Measure of Distance In system neuroscience, one of the fundamental quests is how the brain encodes external stimuli in the early sensory cortex. The present understanding is that an elaborate network, consisting of a large pool of neurons, carries out the sensory processing, motor coordination, and higher brain functions [16]. To customize and model the above-mentioned neuronal behavior for Müller-Lyer stimulus, we conjecture that the mapping of external world onto the cortical sheet is essentially initiated through filtering of the stimulus by the center surround receptive field. The conjecture corroborates the findings of Chen et al. [17] that there is a center-surround structure in the optimal set of weights and the optimal whitening filter is generated from the spatiotemporal activity of the neuronal population. At this juncture, we consider difference of Gaussian filter (DOG) as the mathematical model of center surround receptive field. The Müller-Lyer stimulus when convolved with DOG the generated output pattern of neural firing looks like Fig. 4. Following descriptive comments can be made about the obtained result. Firstly, the two large peaks appearing at the two ends of the figure may be conjectured to be the landmarks under consideration of brain to fix up the end points of the Müller-Lyer stimulus. The peaky profiles can be modeled by Gaussian function of suitable parameters and can be considered as the population code or the coarse map of firing pattern of a set of neurons engaged in finding the location of the end points of the Müller-Lyer stimulus. In computational neuroscience, the representation of average firing rate of a population of neurons is

1 Modeling Müller-Lyer Illusion Using Information Geometry

7

termed as the tuning curve. Secondly, the contour plot (projected on the floor of the figure) clearly shows that the mean of the Gaussian profiles approximately coincides with the centroid of the virtual triangle formed by the two lines of arrows at the two ends of the stimulus. The last result corroborates the center of gravity model proposed by Morgan [14]. Based on these preliminary observations, we model the visual process of locating the end points of Müller-Lyer stimulus. We consider that the locational information of end points of Müller-Lyer stimulus is encoded by two identical Gaussian profile representing the tuning curves, featured by the mean and its standard deviation, respectively. The mean of the tuning curve represents the position of the end points. The standard deviation is significant as well, because it determines the highest slope of the curve to encode the maximum change in the firing rates of nearby neurons. For two infinitesimally close stimuli, the slope of the Gaussian population code is used to discriminate them. Therefore, our model is based on the following mathematical framework: (i) the image of the stimulus is analyzed in the brain by projecting them on a statistical parametric space which we designate as the visual space, (ii) the tuning curve of neural firing corresponding to localization of end points of Müller-Lyer stimulus is represented by Gaussian distribution function in the visual space and called as the tuning curve, and (iii) the coordination of the visual space is made using the mean and the standard deviation of the distribution function. Representation of neural firing by statistical distribution is supported by experimental evidence in computational neuroscience [18]. Finally, the proposed model of the visual space is a half plane with frames of reference defined by mean μ and standard deviation σ of the tuning curve and can be written as,

Fig. 4 Neural firing pattern obtained by convolving Müller-Lyer stimulus with inward arrowheads with a DOG filter. The shaft of the Müller-Lyer stimulus is 176 pixels, length of the wing lines is 56 pixels, and the tilt angle with the shaft is 300. The excitatory center of the DOG filter is generated with a Gaussian of unit amplitude and scale facto σ e = 1.608 in arbitrary scale while that of the inhibitory surround was fixed at σ i = 2.59

8

D. Mazumdar et al.

  H f = (μ, σ ) ∈ R 2 |σ 0 .

(2)

In the half plane, each tuning curve is represented as a point and is mathematically described by univariate Gaussian pdf,  p(r, )dr = 1.

(3)

X

where r is a stochastic variable representing the firing rate of neurons and  = (μ, σ ) is the parameter characterizing the tuning curve. The distance between two infinitesimally separated tuning curves represented as points p(μ1 , σ1 ) and p(μ2 , σ2 ) in the half plane is given by the differential line element [19], dsf2

=

2 

gμγ dμ dγ .

(4)

μ,γ =1

gμγ is the metric tensor of the visual space, known as the Fisher information metric, and is expressed as,  p(r, )

gμγ () = r

∂ϕ ∂ϕ dr . ∂μ ∂γ

(5)

The function ϕ(r, ) = − ln p(r, ) is the negative of the log-likelihood function and termed as spectrum. Importance of information geometric Fisher information metric can be realized considering the fact that in each step of measurement actually we gain information about any system. The maximum amount of information that can be gained by measuring the change of any observable due to the change in parameter is encoded in the metric. Considering p(r, ) as Gaussian pdf, we first compute the log-likelihood function ϕ(r, ). Finally, using ϕ(r, ) different components of the metric tensor, when computed using Eq. 5 are found as, gμμ = 1/σ 2 , gσ σ = 2/σ 2 while gμσ = gσ μ = 0. Therefore, the differential line element representing the proper distance in the visual space, known as the Fisher-Rao distance, can be written as [20], dsf2 =

dμ2 + 2dσ 2 . σ2

(6)

When the distributions differ only in mean the metric reduces to [21], dsf2 =

dμ2 . σ2

(7)

For univariate normal distribution with variance σ, the distance between two distributions N(μ1 , σ ) and N(μ2 , σ ) is given by,

1 Modeling Müller-Lyer Illusion Using Information Geometry

s f (μ1 , μ2 , σ ) =

9

|μ1 − μ2 | . σ2

(8)

We further computed the Ricci scalar curvature R using the components of the metric tensor and found that R = −1 [22]. Therefore, the visual space is a Riemannian space of constant negative curvature, or in other words a hyperbolic space.

4 Computational Method and Results The computational model consists of two components. Firstly, we convolve the image of a Müller-Lyer stimuli with a center surround receptive field defined by the DOG kernel,     − x 2 + y2 − x 2 + y2 ke ki R(x, y) =  exp − exp . (9) 2σe2 2σi2 2π σe2 2π σ 2 i

There are four parameters in the expression of DOG. σ e and σ i are the scale factors of the center and the surround, respectively, while k e and k i represent the excitatory and inhibitory gain, respectively. In our model, we fixed the value of k e = 1 and k i = 0.892 based on our previous findings [23]. Adjustment of the values of σ e and σ i requires a bit of explanation. The experimental values of the percentage error E vary with the variations of the geometrical attributes of the shaft and the contextual wings. To reproduce the same variation in the computed value of the error, one has to change the scale factors of the DOG kernel adaptively with the change in the geometrical attributes of the Müller-Lyer stimulus. The situation is not surprising, because such adaptive adjustment of scale factors depending on some geometric feature or context of the figure is not uncommon. We have observed [23] that how the scale factor changes adaptively with the sharpness of discontinuity of the Mach bands. We have also observed [24] how in the process of edge detection in a natural image, the scale factor adapts itself with the context of the image. Before explaining the process of adjustment of the scale factors of DOG kernel, we complete our description of computation of the error as these are closely related. The percentage error between the actual length of the shaft (L shaft ) and the information geometric distance between the two peaks appearing in the convolved image as,  L p2 − L p1 /σ Ls = × 100% L shaft 

(10)

L p2 and L p1 are the two peak response points, and σ is the scale factor of the peaky profile. We assume that the firing patterns generated at both ends have the identical scale factor. To determine the scale factor σ, we cropped the region of the peaky

10

D. Mazumdar et al.

profile by windowing the convolved data and computed the standard deviation. For each stimulus, we adjusted the scale factors of the DOG kernel (σ i ) so that E − L s → 0. Figure 5 represents the comparative results of computer-simulated data with the experimentally obtained results of percentage illusion as a function of the wing tilt angle with the shaft. In the second set, we compute the percentage error in estimating the length of Müller-Lyer stimuli of different wing lengths and wing tilt angle with the shaft. As described in Sect. 2, for inward arrowheads we consider the angles as, 18°, 36°, 54°, and 72°. In each case, the length of the wing lines is varied from 5 to 50 pixels in intervals of 5 pixels. The results are shown in Fig. 6. The curves represent the data for 18°, 36°, 54°, and 72° marked as 1, 2, 3, and 4. Other details are explained in the figure caption. Repeating the same method for vertical and outward arrow-headed Müller-Lyer stimuli, we compute the percentage of error for wing tilt angle with the values 90°, 108°, 126°, 144°, and 162°. Results are shown in Fig. 7. Here, again we represent the curves of % of error as function of wing length for 90°, 108°, 126°, 144°, and 162° in downward order. Finally, in Fig. 8 the experimental and our simulated data on the variation of strength of illusion as a function of the length of the shaft are exhibited. It is interesting to note that as per our model the end points of the Müller-Lyer illusion are perceived as the mean of the tuning curves occurring at both the end of the neuronal population firing map which causes deviation of the apparent length from the actual length of the shaft. If a virtual line is considered from the outer endpoints of the wings to the perceived point, it becomes apparent that error in estimation of angle of the wing tilt also takes place. In case of wing-in context, the acute angles are overestimated, and in case of wing-out context, the obtuse angles are underestimated. In 1890, Brentano proposed a hypothesis regarding the human perception of the

Fig. 5 Simulation of variation of percentage of error in Müller-Lyer stimulus as a function of wing tilt angle with the shaft. a Open circles (o) represent the experimental data. Dashed line (–) represents the simulated data computed using the Euclidean distance between the two peaks, and the solid line (-) represents result of simulation using the information geometric measure of distance. b Variation of the scale factors of the DOG kernel as a function of the wing tilt angle with the shaft. The solid line corresponds the excitatory center while the dashed line represents the same for the inhibitory surround

1 Modeling Müller-Lyer Illusion Using Information Geometry

11

Fig. 6 Experimental and computer simulated data for inward arrow heads. a Experimental data fitted with polynomial fitting. b1–b4 Comparative results of experimental and simulated data for curve 1–4 drawn in (a). c1–c2 Variations of s.d of DOG kernels offerring best matches with the experimental data [1–4 in (a)] for acute angled stimulus

angles between two straight lines, which states that perceptually the acute angles are overestimated while the obtuse angles are underestimated. Equivalently, it may be stated that the tendency of visual perception is to estimate angles with a regression toward right angle. Such hypotheses, though correct, are not yet provable through any generalized mathematical principle. In the present work, we receive the evidence which support the Brentano hypothesis thereby raising the debate whether the illusion in case of Müller-Lyer stimulus is related to perceptual deviation of estimate of extension or the angle.

12

D. Mazumdar et al.

Fig. 7 Experimental and computer simulated data for outward arrow head. a Experimental data fitted with polynomial fitting. b1–b5 Comparative results of experimental and simulated data. c1– c2) Variations of s.d of DOG kernels offering best matches with the experimental data [1–5 in (a)] for obtuse angled stimulus

5 Discussion The proposed information geometric model of the visual space endowed with FisherRao information metric has the potentiality for modeling many real-life visual phenomena. In computer graphics, one of the live problems is the reconstruction of 3D image from 2D images collected from a video sequence. Simultaneous Localization and Mapping (SLAM) is one of the popular algorithms which incrementally build a 3D model of the surrounding environment while concurrently localizing the camera. Key frame selection is an important task in this algorithm. To select the keyframe of a video sequence, many researchers demonstrated that the summarized statistics of Fisher information provide improved performance in selecting the key frames [25, 26]. In robot navigation, metric representation is the most common task to estimate the coordinates of the objects placed in a three-dimensional space.

1 Modeling Müller-Lyer Illusion Using Information Geometry

13

Fig. 8 Computer simulated data of percentage of error in measuring the length of Müller-Lyer stimulus as a function of the length of the shaft. a % of error in measuring the length as function of shaft length. The solid line represents the experimental data while the dotted line represents the same obtained through computer simulation. b Represents the variation of the standard deviation of the center (solid line) and surround Gaussian (dashed line) of the DOG kernel generating the best matches with the experimental data

The proposed information geometric model of the visual space can be extended to model the neural mechanism underlying the reconstruction of 3D world from the twodimensional retinal images in the human visual cortex. It is further interesting to note that information geometric model and Fisher-Rao metric is also becoming important in shape matching. Peter and Rangarajan [27] proposed a unifying framework for shape matching that uses mixture models to couple both the shape representation and deformation. The theoretical foundation is drawn from information geometry wherein information matrices are used to establish intrinsic distances between parametric densities. All the experimental data and research codes are uploaded on Github (https://github.com/somucdac/Muller-Lyer-Experimental-Data). Finally, the authors have followed the COPE guidelines for ethical responsibilities. There is no potential conflict of interest.

References 1. Ninio J (2014) Geometrical illusions are not always where you think they are: a review of some classical and less classical illusions, and ways to describe them. Front Hum Neurosci 8:856 2. Valentin D, Gregory L (1999) Context-dependent changes in visual sensitivity induced by Muller-Lyer stimuli. Vision Res 39:16571670 3. Greist-Bousquet S, Schiffman HR (1981) Size of the Mueller-Lyer illusion as a function of its dimensions: theory and data, Suzanne. Percept Psychophysics 30(5):505–511 4. Bulatov A, Bertulis A, Mickien L (1997) Geometrical illusions: study and modelling. Biol Cybern 77:395–406 5. Restle F, Decker J (1977) Size of the Muller-Lyer illusion as a function of its dimensions: theory and data. Percept Psychophys 21:489–503

14

D. Mazumdar et al.

6. Pressey AW, Di Lollo V, Tait RW (1977) Effects of gap size between shaft and fins and of angle of fins on the Muller-Lyer illusion. Perception 6:435–439 7. Fisher GH (1970) An experimental and theoretical appraisal of the perspective and sizeconstancy theories of illusions. Q J Exp Psychol 22:631–652 8. Lewis EO (1909) Confluxion and contrast effects in the Muller-Lyer illusion. Br J Psychol 3:21–41 9. Gregory RL (1970) The intelligent eye. Wiedenfeld and Nicolson, London 10. Pressy AW (1967) A theory of the Muller-Lyer illusion. Percept Mot Skills 25:569–572 11. Anderson NH (1974) Methods for studying information integration. (Technical Report CHIP43) University of California, San Diego, Center for Human Information Processing, La Jolla, Calif, pp 215–298 12. Erlebacher A, Seculer R (1969) Explanation of the Muller-Lyer illusion: confusion theory examined. J Exp Psychol 80:462–467. Walker EH (1973) A mathematical theory of optical illusions and figural aftereffects. Percept Psychophys 13:467–486 13. Walker EH (1973) A mathematical theory of optical illusions and figural after effects. Percept Psychophys 13:467–486 14. Morgan MJ (1999) The Poggendorff illusion: a bias in the estimation of the orientation of virtual lines by second-stage filters. Vision Res 39(14):2361–2380 15. Pierce JW (2007) Psychopy-psychophysics software in python. J Neurosci Methods 813 16. Zhou D, Rangan AV, McLaughlin DW, Cai D (2013) Spatiotemporal dynamics of neuronal population response in the primary visual cortex. PNAS 110(23):9517–9522 17. Chen Y, Geisler WS, Seidemann E (2006) Optimal decoding of correlated neural population responses in the primate visual cortex. Nat Neurosci 9(11):1412–1420 18. Pouget A, Dayan P, Zemel R (2000) Information processing with population codes. Nat Rev Neurosci 1:125–132 19. Burbea J, Radhakrishna Rao C (1982) Entropy differential metric, distance and divergence measures in probability spaces: a unified approach. J Multivar Anal 12:575–596 20. Costa SIR, Santos SA, Strapasson JE (2014) Fisher information distance: a geometrical reading. arXiv.1210.2354v3 [stat.ME]. 10 Jan 2014 21. Atkinson C, Mitchell AFS (1981) Rao’s distance measure. Sankhya: Indian J Stat 43(Series A):345–365 22. Mazumdar D (2021) Representation of 2D frameless visual space as a neural manifold and its information geometric interpretation. arXiv:2011.13585 [cs.NE] 23. Mazumdar D, Mitra S, Ghosh K, Bhaumik K (2016) A DOG filter model of the occurrence of Mach bands on spatial contrast discontinuities. Biol Cybern. https://doi.org/10.1007/s00422016-0683-9 24. Mazumdar D, Mitra S, Ghosh K, Bhaumik K (2021) Analyzing the patterns of spatial contrast discontinuities in natural images for robust edge detection. Pattern Anal Appl 24. https://doi. org/10.1007/s10044-021-00976-y 25. Lim H, Lim J, Kim HJ (2014) Online 3D reconstruction and 6-DoF pose estimation for RGB-D sensors. In: Agapito L, Bronstein M, Rother C (eds) Computer vision—ECCV 2014 workshops. ECCV 2014. Lecture notes in computer science, vol 8925. Springer, Cham. https://doi.org/10. 1007/978-3-319-16178-5_16 26. Kerl C, Sturm J, Cremers D (2013) Dense visual slam for RGB-D cameras. In: Proceedings of the international conference on intelligent robot systems (IROS) 27. Peter AM, Rangarajan A (2009) Information geometry for landmark shape analysis: unifying shape representation and reformation. IEEE Trans PAMI 31(2):337–350

Chapter 2

Building up a Categorical Sentiment Dictionary for Tourism Destination Policy Evaluation Kang Woo Lee, Ji Won Lim, Myeong Seon Kim, Da Hee Kim, and Soon-Goo Hong

1 Introduction Information and communication technologies (ICT) are driving rapid changes in tourism. Such technologies are not only transforming tourism-related businesses, organizations, and infrastructures, but also reshaping the tourism policy-making process. Policy evaluation, which has been traditionally measured using qualitative and quantitative studies, currently faces many challenges; hence, big data, generated as a result of the massive growth of ICT, has become the main mechanism of tourism evaluation. Additionally, it is difficult to identify immediate issues and rapid changes in tourism trends because the traditional evaluation methods cannot adequately manage the fast and massive flow of data [1]. The evaluation methods, such as survey and interview, are costly and time-consuming and are strongly affected by the limited answers to the designed questions and a small number of samples, thereby leading to a biased conclusion [2]. Furthermore, the tourism sector and local economy have been affected acutely by the COVID-19 pandemic, and the resilience of the tourism sector is under immense stress. To maintain the sustainable development of the tourism destinations beyond the pandemic, it is even more important to analyze timely data accurately. This K. W. Lee · J. W. Lim · S.-G. Hong (B) Smart Governance Research Center, Dong-a University, 225 Gudeok-ro, Seo-gu, Busan, South Korea e-mail: [email protected] M. S. Kim · D. H. Kim Department of Computer Engineering, Dong-a University, 37 Nakdong-daero 550 gil, Saha-gu, Busan, South Korea S.-G. Hong Department of Management Information System, Dong-a University, 225 Gudeok-ro, Seo-gu, Busan, South Korea © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_2

15

16

K. W. Lee et al.

is particularly challenging at the moment when the impact of the pandemic is unprecedented. This study attempts to overcome the limitations of the existing tourism-destination evaluation methods through text mining. A large amount of data is being generated as many people share tourism-related information and new experiences obtained from travel through SNS [3]. Text mining is a suitable method to find new and meaningful information on tourism destinations through analyzing the large amounts of data available in social media [4]. In this study, our research efforts are focused on developing a dictionary-based sentiment analysis for evaluating tourism destinations in which the categorical items of the system are adopted from the existing questionnaire. The questionnaire has been developed by the Korean Ministry of Culture, Sports, and Tourism (MCST) and continuously updated [1, 5]. The questionnaire is composed of 8 sections that are further divided into 27 categories. The questionnaire evaluates satisfaction of the tourists on a 7-point Likert scale. In this study, we attempt to insert this questionnaire into the computational domain by creating a system of sentiment analysis. With a series of computational methods, sentiment lexicons, such as ‘sentimentally polarizable’ and ‘categorically discriminable,’ are extracted and used to construct a dictionary. Based on this sentiment dictionary, a particular tourism destination in Korea is evaluated.

2 MCST Questionnaires for Selecting Evaluation Categories 2.1 Categories and Questions of MCST and Keywords Associated to the Categories The MCST evaluation has been conducted to identify issues across national tourism destinations and support sustainable development for the destination since the 1990s. Through the evaluation, ‘local tourism zones’ were selected and invested upon to be built as attractive tourism destinations for foreigners [6]. Depending on the dynamically changing issues of tourism, question items of the evaluation were changed slightly. Nonetheless, the MSCT questionnaire is designed to identify the pros and cons of tourism destinations—What makes tourists satisfied or dissatisfied? For our dictionary-based sentiment analysis, 11 categories, including diversity and quality of goods, taste and diversity of foods, and price, were selected. In Table 1, the 11 categories, the questions of the corresponding categories used in the MCST questionnaire, and the keywords associated with the categories for the sentiment analysis are presented.

2 Building up a Categorical Sentiment Dictionary …

17

Table 1 The categories, questions, and keywords Categories

Questions

Keywords

Diversity and quality of goods

There are various types of souvenirs related to tourism destinations, the quality of souvenirs is good

Gift, store, shopping, brand, goods, discount, souvenir, duty-free shop, fraud, craft, quality, variety, etc.

Taste and diversity of foods

There are various types of food, the food is delicious

Kimchi, taste, fresh, fish smell, food poisoning, delicious, etc.

Price

Price in the tourism destination is appropriate

Price, rip-off, expensive, cheap, cost, free, etc.

Hospitality

Staffs in the tourist information are kind

Kindness, service, impression, disregard, anger, disrespect, attitude, etc.

Public transport

It is easy to reach the tourism Taxi, subway, bus, train, destination by public passenger, traffic, congestion, transport destination, pier, ferry, Busan port, etc.

Parking

It is convenient to use parking Parking, street parking, facilities parking lot, etc.

Rest facility

Rest area (bench, lounge) is well established

Comfortable, convenient, rest, lounge, bench, chair, stroller, etc.

Toilet

Toilet is clean

Toilet, bathroom, sanitation, etc.

Experience of local culture

I learned the culture of the tourism destination well

Culture, tradition, relics, museum, history, etc.

Safety

I think tourist facilities are safe

Safety, protection, danger, accident, etc.

Recommendation

I am willing to recommend tourist attractions to people around me

Recommendation, highly recommended, not recommended, etc.

The questions for each category are used in the MCST questionnaire, and keywords associated with the categories are used to construct our sentiment dictionary

2.2 Incorporating the Questionnaire-Based Evaluation into Sentiment Analysis Based on Text Mining In the traditional tourism policy evaluation, several issues may arise while attempting to incorporate the evaluation system into the computational domain. Unlike the simplest sentiment analysis that classifies sentiment lexicon into positive or negative, the tourism policy evaluation has multiple evaluation categories [7]. Furthermore, some categories (e.g., cultural understanding) cannot be adequately implemented in text mining-based sentiment analysis. These issues are directly linked to the details

18

K. W. Lee et al.

of the answers with respect to the items of the questionnaire using sentiment analysis. This implies that the sentiment analysis not only extracts positive and negative sentiment that a lexicon conveys, but also assign the lexicon into a category of the questionnaire [8].

3 Additional Information Required from Authors 3.1 Model Architecture The model architecture of the sentiment analysis for the tourism policy evaluation is illustrated in Fig. 1. The model can be separated into 2 modules: sentiment word extraction and matching. Following the text cleaning and text pre-processing stages, the sentiment words are extracted using logistic regression and cTF-IDF. The logistic regression is used to classify the text dataset into positive or negative sentiment, and it extracts regression parameters that are used for evaluating word polarity. Furthermore, the cTF-IDF is to determine the category that is appropriate for a word. Combining them together, the sentiment dictionary is constructed. Subsequently, a tourist review is analyzed by matching with the sentiment lexicons in the dictionary. Thus, the tokenized words of the review are evaluated not only to determine whether they are negative or positive, but also which categories they belong to. The analyzed results are then visualized.

Fig. 1 Architecture of dictionary-based sentiment analysis

2 Building up a Categorical Sentiment Dictionary …

19

3.2 Data Tourist reviews on Busan, where is the second-largest city of South Korea, are collected from TripAdvisor webpages [9]. Busan has a variety of natural, cultural, and historical sites, such as the Haeundae [10], Gamcheon Culture Village [11], and Beomeosa (temple) [12]. Among the 598 registered tourist sites, reviews on 400 tour sites were stored in a comma-separated values file. The data is composed of information on aspects, such as ‘date reviewed,’ ‘language written,’ ‘name of tourist site,’ ‘review,’ and ‘rating score.’ We collected a total of 6139 tourist reviews created from 2015 to 2021. Only the reviews written in Korean were collected for this study. The rating scores ranged from 1 to 5 corresponding to ‘terrible (very negative)’ and ‘excellent (very positive),’ respectively.

3.3 Text Cleaning and Preprocessing • Text Cleaning The text data contained special characters, numbers, and punctuation that would make analyzing the text more difficult. We removed the irrelevant data (noise) from the text data. • Tokenization Tokenization is the process of splitting a piece of text, such as a phrase, sentence, paragraph, or document, into a smaller unit called token. The tokenization of the Korean language is generally incorporated with part of speech (POS) tagging [13], in which categories of tokens are assigned according to their functional, semantic, and morphological criteria (e.g., noun, verb, adjective, etc.) [14]. In this study, five parts of speech, including noun, root word, verb, adjective, and negative designator, were extracted through the tokenization process using the KoNLPy package [15]. The first four parts of speech are the content words that possess semantic content and contribute to the meaning of the sentence in which they occur. The negative designator is a unique Korean POS that negates a sentence. It may correspond to ‘NOT’ in English. • N-gram An n-gram is a consecutive sequence of n tokens extracted from a given text. Depending on the number of tokens in the sequence, the term n can be ‘uni,’ ‘bi,’ and ‘tri’ if n = 1, 2, and 3, respectively. The n-gram is essential in any task in which a single word is not semantically sufficient to be identified and classified. For a dictionary-based sentiment analysis, a single word suffers from semantic ambiguity because of the absence of the context that provides the appropriate interpretation of the subject matter. In particular, by combining a negative designator with a noun (e.g., ‘good-place-anida’ can be translated into ‘not a good place’), the n-gram clarifies its negative meaning. • Vectorization

20

K. W. Lee et al.

In natural language processing, different methods transform texts into vectors (references). Of these methods, two text-encoding methods, count vector and term frequency-inverse document frequency (TF-IDF), were employed in this study. The count vectorization method transforms a text into a vector on the basis of the word count. The dimensionality of a vector in the count vectorization method is equal to the vocabulary size of the entire text. An element of a vector corresponds to a word in the vocabulary, and its value corresponds to how often the word appears in the document. In the TF-IDF method, where the TF is multiplied by the IDF, a word is vectorized in terms of how important it is to a document in a collection or corpus, that is, the term frequency is modulated by decreasing the weight of the terms that occur very frequently across the document, but increasing the weight of the terms that occur infrequently.

3.4 Extracting Category-Specific Tokens Using CTF-IDF Data gathering using keywords It is assumed that the review data of an evaluation category (e.g., parking) is defined as a set of keywords are associated strongly with the category. The keywords for a category are connected by disjunctions as: keyword1 ∪ keyword2 ∪ · · · ∪ keywordn . Thus, the dataset for a category c is obtained as follows: Dc = {x| f (x, kc }

(1)

where k c denotes the keywords for a category that are connected by disjunctions such that k 1 ∪ k 2 ∪ · · · ∪ k n and f (·) is the matching function that specifies the members of the category containing the keywords. Category-based TF-IDF The fittedness of a token to a category can be measured using the category-based TF-IDF. The category-based TF-IDF (cTF-IDF) method allows the selection of a category-discriminative token [16]. The cTF-IDF works on the collective documents created by joining all individual documents in a category together. The cTF-IDF of a token i can be obtained from the following equation: cTF − IDF =

ti m × log n wi j=1 t j

(2)

where t indicates the frequency of the words that are extracted for each class i and divided by the total number of words w. The total number of individual documents across all classes m is divided by the total sum of words across all categories. Selecting the best category of a word using argmax function The same tokens in a category may appear in different categories as reviews are generally a mixture of categories (e.g., parking and food). However, the tokens associated with multiple categories are more likely due to the ambiguity of sentiment evaluation of a word. To remove this ambiguity, it is also assumed that the sentiment words belonging to a

2 Building up a Categorical Sentiment Dictionary …

21

categorical dictionary do not belong to other categorical dictionaries. Therefore, a set of sentiment words in categorical dictionaries is mutually exclusive. In this regard, the argmax(·) function is applied to a token across all categories, and subsequently, the best fitted category is selected for a token.

3.5 Logistic Regression for Extracting Sentiment Words Logistic regression is a statistical model often used in machine learning algorithm for classification. The model is basically used to predict a binary output even though multinomial extensions of logistic regression exist [17]. In logistic regression, β is a vector of weight parameters of length m with the same dimensions as that of the vocabulary. A sigmoid function is used to generate an output by taking a linear combination of the input vectors, as shown in Eq. (1). The model is trained to estimate the parameters β based of n training example reviews such that the classification error is minimized. The purpose of logistic regression in this study is to classify a tourist review either into a positive or negative sentiment. z = β1 x1 + β2 x2 + · · · + βm xm + b 1 y(z) = 1 + e−z

(3)

where z is the weighted sum of a document vector in which an element of the vector corresponds to a word and b is bias. A logistic regression method is applied to extract sentiment tokens for our tourism policy evaluation system. First, 1s are assigned to positive reviews rated between 4 and 5 and 0s are assigned to negative reviews rated between 3 and 1. Following training of the regression model, the weight parameters for both positive and negative reviews were extracted and used for further processing [18, 19].

3.6 Human Evaluation on Sentiment Words in a Category The tokens were sorted by descending order in the values obtained from multiplying a regression parameter by the argmax(cTF-IDF) of a word. The ordered tokens were evaluated by individuals in terms of their categorical suitability as well as sentimental valence. Five students were involved in the rating task. They were asked to evaluate whether the extracted lexicons were suitable and sentimentally polarizable (either positive or negative) for a given category. Subsequently, the lexicons that were not categorically suitable and sentimentally polarizable were eliminated. After the evaluation, the selected lexicons were re-organized in terms of their forms. The three lexicon forms—unigram, bigram, and trigram—were identified.

22

K. W. Lee et al.

The lexicon that is composed of the ‘category-related (sentiment-related)’ word + ‘sentiment-related (category-related)’ word is defined as the basic form. For example, the lexicon ‘주차 (parking)-어렵 (hard)’ is the basic form in the parking category. The word ‘parking’ is categorically related, and ‘hard’ is sentimentally polarizable. The unigram lexicon can specify a category with only a single word and has a sentimental polarity. For example, the single word ‘강추,’ which translates to ‘strongly recommend,’ is the example of the unigram lexicon. In contrast, the trigram lexicon combines a negative word, such as ‘not,’ ‘no,’ and ‘nothing (zero)’ to the basic form. For example, in ‘주차창 (parking lot)-넓 (spacious)-않(not),’ the negative word ‘않’ is combined with the basic form ‘주차장-넓.’

3.7 Matching The assessment of the tourism destinations can be performed with the matching process. The tourist review as the input coming into the system is cleaned, tokenized, and n-grammed. The tokenized review is compared with the sentiment lexicons in the dictionary and assessed in terms of how many lexicons dictionaries are matched along with the 11 categories.

4 Results In order to demonstrate the construction of the sentiment dictionary, the results are graphically presented. In Fig. 2, ten sentiment words with the regression parameter values are presented. These words belong to the parking category. The top five words have the negative sentiment associated with parking, whereas the bottom five words have the positive sentiment. For example, ‘many_parking’ ( 많/VA_주차/NNG), ‘parking_difficult’ (주차/NNG_힘들/VA), and ‘parking_hard’ (주차/NNG_어렵/VA) are negative-sentiment lexicons that have negative regression parameters. In contrast, ‘no_car’ (차/NNG_없/VA), ‘good_parking’ (좋/VA\_주 차/NNG), and ‘parking_lot_use’ (주차장/NNG_이용/NNG) are positive-sentiment lexicons with positive regression parameters. The argmax(cTF-IDF) values of the sentiment lexicons are presented in Fig. 3. The cTF-IDF can be best understood as the feature from textual documents belonging to the category they are in. Higher cTF-IDF values imply that the lexicons are more typically found in the context describing the category. However, the same sentiment lexicon can appear in many different categories with different cTF-IDF values. Upon applying argmax with cTF-IDF, the sentiment lexicon can be assigned to the category that is thought to be most suitable. In Fig. 3, the lexicons were chosen and grouped into the parking category applying argmax(). As shown in the figure, the lexicons ‘good parking’ (좋/VA_주차/NNG) and ‘many parking’ (많/VA_주차/NNG) have relatively higher argmax (cTF-IDF) values than

2 Building up a Categorical Sentiment Dictionary …

23

Fig. 2 The words with positive and negative regression coefficients

Fig. 3 The argmax(cTF-IDF) values of the sentiment lexicons

the lexicons ‘move-up parking good’ (올라가/VV_주차/NNG_좋/VA) and ‘large parking’ (넓/VA_주차/NNG). Following the multiplication of the regression parameter and argmax(cTF-IDF) values for each lexicon, the characteristics that are categorically discriminable and sentimentally polarizable can be imposed on a lexicon. In Fig. 4, the lexicons ‘no_car’ (차/NNG_없/VA) and ‘many_parking’ (많/VA_주차/NNG) are the most typical lexicons found in the parking category, but their sentiment orientations are opposite. Following the construction of the sentiment dictionary, the sentiment analysis is performed on Haeundae beach, which is one of the most popular tourism destinations located in Busan, South Korea. It is generally considered as the best tourism destination for the summer holidays. The results of the sentiment analysis on the beach are presented in Fig. 5. Among 11 evaluation categories, more negative sentiments are found in two categories— parking and safety. The beach suffers from severe parking difficulties during the high season of July and August. Tourists and cars from all around the country flock to the

24

K. W. Lee et al.

Fig. 4 The multiplication between cTF-IDF and regression coefficient for a word

beach. It is generally recommended for people to use public transports. Concerning the safety issue, the rip current, which can occur near the beach with breaking waves, is often referred to in the negative reviews. In 2017, approximately 50 tourists were swept away by this current even though all of them were rescued. In other categories, such as diversity and quality of goods, food, price, hospitality, and rest facility, tourist evaluations rated on the positive scale.

5 Conclusion Even though our sentimental analysis model is promising, its performance has not been fully tested yet. In particular, as noted earlier, the model is based on the MCST questionnaire; hence, a comparative study between them is evidently necessary. Our future course of study can be extension of the analysis presented here to include this comparison.

2 Building up a Categorical Sentiment Dictionary …

25

Fig. 5 The sentimental analysis of Haeundae beach. The beach is evaluated in the 11 categories

Acknowledgements This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2018S1A3A2075240).

References 1. Korean Ministry of Culture, Sports, and Tourism (2018) A comprehensive evaluation report for culture, tourism, and festival, MCST 2. Kim JH (1995) A study on reliability analysis of questionnaire items. Master’s Thesis, Jeonju University 3. Parra-Lopez E, Bulchand-Gidumal J, Gutierrez-Tano D, Diaz-Armas R (2004) Intentions to use social media in organizing and taking vacation trips. Comput Hum Behav 27(2):640–654 4. Kar AK, Dwivedi YK (2020) Theory building with big data-driven research–moving away from the “what” towards the “why”. Int J Inf Manage 54 5. Korean Ministry of Culture, Sports, and Tourism (2011) A comprehensive evaluation report for culture, tourism, and festival, MCST 6. Ministry of Culture and Tourism, Tourism Evaluation Report. MCT (2003) 7. Toboada M, Brooke J, Tofiloski M, Voll M, Stede M (2011) Lexicon-based methods for sentiment analysis. Comput Linguist 37(2):267–307 8. Mowlaei ME, Abadeh MS, Keshavarz H (2020) Aspect-based sentiment analysis using adaptive aspect-based lexicons. Expert Syst Appl 148(15)

26

K. W. Lee et al.

9. Kim C (2000) A model specification for measuring competitiveness of the tourism industry Korea. Tourism Research Institute 10. Isojima A (2006) Analysis of a consumer questionnaire pertaining to rice by using text mining. Agric Inf Res 15(1):49–60 11. TripAdvisor Busan page. https://www.tripadvisor.com/Tourism-g297884-Busan-Vacations. html. Last accessed 1 Oct 2021 12. Wikipedia Haeundae Beach page. https://en.wikipedia.org/wiki/HaeundaeBeach. Last accessed 1 Feb 2022 13. Wikipedia Gamcheon Culture Village page. https://en.wikipedia.org/wiki/GamcheonCulture Village. Last accessed 1 Feb 2022 14. Wikipedia Beomeosa page. https://en.wikipedia.org/wiki/Beomeosa. Last accessed 1 Feb 2022 15. Park K, Lee J, Jang S, Jung D (2020) An empirical study of tokenization strategies for various Korean NLP tasks. In: Proceedings of the 1st conference of the Asia- Pacific chapter of the association for computational linguistics and the 10th international joint conference on natural language processing, pp 133–142. Suzhou, China 16. Han C-H, Palmer M (2004) A morphological tagger for Korean: statistical tagging combined with corpus-based morphological rule application. Mach Transl 18(4):275–297 17. Park EL, Cho S (2014) KoNLPy: Korean natural language processing in Python. In: Proceedings of the 26th annual conference on human and cognitive language technology. Chuncheon, Korea 18. Zhang T, Ge SS (2019) An improved TF-IDF algorithm based on class discriminative strength for text categorization on desensitized data. In: Proceedings of the 2019 3rd international conference on innovation in artificial intelligence. pp 39–44 (2019) 19. Jurafsky D, Martin JH (2021) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 3rd edn. Prentice Hall PTR, Upper Saddle River, NJ, United States

Chapter 3

Statistical Analysis of Stress Prediction from Speech Signatures Radhika Kulkarni, Utkarsha Gaware, and Revati Shriram

1 Introduction Stress can affect the dynamics of our memory and hamper our ability to remember things effectively [1]. Stress in day to day life is followed by performance degradation whilst carrying out tasks [2]. When a person is agitated or exposed to a risky situation, his heart rate rises as the heart beats faster to send more blood to the muscle. To prepare the muscles for a fight/flight situation, blood is directed to them. Micro-muscle tremors (MMTs) are the vibrations of muscles caused by this. The muscles of the vocal tract can transmit vibrations through speech [3]. A person’s voice carries information that can be divided into two categories. The first section contains linguistic information, in which the utterances are made according to the language’s standards of pronunciation [4]. The paralinguistic information is the second part. Voice quality, rhythm, intonation, speech pauses and prosody are all examples of paralinguistic information [5]. Spectral analysis of voice signals helps us to find out how acoustic energy is distributed across frequencies. In addition to that, other physiological signals like photoplethysmography and electroencephalogram signals can be analysed whilst performing some Stroop task tests to find stressed conditions in individuals (9). These signals can be analysed for the R–R wave interval which gives us information about the heart rate which may increase during stress conditions [6]. As opposed to the approach used whilst carrying out this work to measure exam stress, HR interviews of employees in a workplace environment can be recorded and monitored to perform voice-based analysis [7].

R. Kulkarni (B) · U. Gaware · R. Shriram MKSSS’s Cummins College of Engineering for Women, Pune, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_3

27

28

R. Kulkarni et al.

2 Methodology With a motive to focus more on the paralinguistic information in voice signals for stress detection in students, 50 subjects were made to read an English paragraph which was triggering no extreme emotions but was on a slightly tougher side in concern with the pronunciations. Whilst reading, their audios were recorded, and preprocessing was carried out on the audio recordings later to remove any noise present. The audio files were then converted from mpeg-4 and m4a to mp3 format for performing the analysis in MATLAB with the help of the audio toolbox. After that, the spectral features were extracted from the voice signals. Features like spectral rolloff, spectral entropy, log energy, power spectral density, spectral flux, spectral centroid, maximum amplitude, pitch, energy function, energy, power were extracted using functions from the voice signals. The audios were recorded on three different devices each of a different model, iPhone 11 pro max, Samsung A30s and OnePlus 6T to see if the quality of the recorder has any impact on the results obtained along with the individual’s stress levels. Figure 1 shows the system block diagram which describes the flow of the process. Spectral Flux (dB/Decade). The spectral flux, which measures the spectral change between two consecutive frames, is calculated using the squared difference between the normalised magnitudes of the spectra of two consecutive short-term windows [8]. Fl(i, i − 1) =

wf 1  K =1

Fig. 1 System block diagram

(E N i(k) − E N i − 1(k))2

(1)

3 Statistical Analysis of Stress Prediction from Speech Signatures

29

Spectral Rolloff (dB/Decade). The frequency below which a certain percentage of the total spectral energy is found is known as spectral roll off [8]. Spectral Entropy (dB). It is the computation of spectral power distribution and time-series signal forecastability. Shannon entropy and information entropy in the information data are used to calculate this entropy [8]. SSH(F) = −



u(Ph(F) log e(Ph(F))

(2)

Pitch (Hz). The degree of highness or lowness in a tone is referred to as pitch. Pitch is a non-cognitive feature that allows for the frequency-related ordering of sounds. The rate at which vibrations are created is referred to as the rate of vibration. When the tone’s frequency is high, the pitch is high [9]. Energy Function (Joules). The energy function is used to estimate the time it takes for voiced speech to become unvoiced and vice versa. In high-quality voice, energy can be used to distinguish between sound and quiet (high signal-to-noise ratio) [10, 11]. Energy (Joules). The overall magnitude of a signal correlates to its energy. In audio broadcasts, it refers to the signal’s loudness. A signal’s energy is defined as: 

|x(n)|2 n

(3)

Maximum Amplitude (dB). A waveform’s maximum amplitude is the maximum positive or negative departure from its zero reference level [12]. Power Spectral Density (Vrms2 /Hz). The power spectral density function aids in calculating the overall power contained in each spectral component of a signal, providing us with the signal’s power over the whole frequency spectrum or range [13, 14]. Power (dB). The energy rate, or sound energy per unit of time emitted by a source, is referred to as sound power. When sound travels through a medium, its acoustic power is transferred. The sound intensity is defined as the sound power transmission over a surface (W/m2 ), which is a vector quantity with a direction.

3 Database Collection The data were collected from 50 students of Cummins College of Engineering. All the 50 subjects were females. The age group of the subjects was between 18 and 23 years. All the subjects were normal and healthy. The data were collected on 3 devices at the same time (iPhone 11 pro max, Samsung A30s, OnePlus 6T). The feature extraction was done with the help of MATLAB R2021a. Verbal consent was

30

R. Kulkarni et al.

given by all the participants before recording the audios and to use them in the research.

4 Results and Analysis Shown from Figs. 2, 3, 4 and 5 are the plots of Subject 1’s audio feature results from 3 devices. The plots of all the features for Subject 1 are shown above. The spectral centroid, spectral rolloff and power spectral density are the highest in Samsung A30 and lowest in the OnePlus 6T. In case of spectral entropy and pitch, it is the highest in the OnePlus 6T and lowest in the iPhone 11 pro max. Log energy and power is the

(a)

(b)

(c) Fig. 2 Sample plots of spectral centroid for the signal recorded by 3 devices. a iPhone 11 pro max. b Samsung A30s. c OnePlus 6T

3 Statistical Analysis of Stress Prediction from Speech Signatures

31

(b)

(a)

(c) Fig. 3 Sample plots of spectral rolloff for the signal recorded by 3 devices. a iPhone 11 pro max. b Samsung A30s. c OnePlus 6T

lowest in case of Samsung A30s and highest in case of iPhone 11 pro max. Spectral flux is the lowest in case of iPhone 11 pro max and highest in case of the OnePlus 6T. Maximum amplitude is found to be highest in case of iPhone 11 pro max and lowest in case of OnePlus 6T. Energy function is found to be lowest in case of the OnePlus 6T and highest in case of Samsung A30s. Energy is highest in OnePlus 6T and lowest in iPhone 11 pro max. Box plots of spectral centroid, energy function, power, log energy, spectral entropy and spectral rolloff are plotted as shown in Fig. 6a–f, respectively, to show the distribution of the data around the values. Looking at the box plots considering all 50 subjects together, the range of power and spectral entropy are highest in case of the OnePlus 6T and lowest in case of Samsung A30s. Range of energy function is highest in case of iPhone and lowest in case of OnePlus 6T. Log energy on the other hand is highest in case of iPhone and lowest in case of Samsung A30s. The range of spectral rolloff is less in the OnePlus 6T than that of iPhone 11 pro max and Samsung A30s.

32

R. Kulkarni et al.

(b)

(a)

(c) Fig. 4 Sample plots of spectral entropy for the signal recorded by 3 devices. a iPhone 11 pro max. b Samsung A30s. c OnePlus 6T

The mean, median, standard deviation, skewness and kurtosis of all the 50 obtained values of the features were calculated for statistical analysis. The sample table of the statistical analysis consisting of data from 10 subjects is shown in Tables 1, 2 and 3.  Mean =

x

N n   th observation + n2 + 1 th observation 2 Median = 2   (xi − μ)2 Standard deviation = N

(4)

(5)

(6)

3 Statistical Analysis of Stress Prediction from Speech Signatures

33

(b)

(a)

(c) Fig. 5 Sample plots of pitch for the signal recorded by 3 devices. a iPhone 11 pro max. b Samsung A30s. c OnePlus 6T

N Skewness =

(X i − X )3 (N − 1) × σ 3 i

(7)

where N Xi X σ

Number of variables, random variable mean of distribution Standard deviation

Kurtosis = where

μ4 σ4

(8)

34

R. Kulkarni et al.

(a)

(c)

(e)

(b)

(d)

(f)

Fig. 6 Box plots of spectral centroid, energy function, power, log energy, spectral entropy and spectral rolloff. a Spectral centroid. b Energy function. c Power. d Log energy. e Spectral entropy. f Spectral rolloff

3 Statistical Analysis of Stress Prediction from Speech Signatures

35

Table 1 Mean and median values for the features used Parameter

Mean iPhone 11 pro max

Median Samsung A30s

OnePlus 6T

iPhone 11 pro max

Samsung A30s

OnePlus 6T

Log energy 0.6922

−5.679

−0.6163

0.6438

−5.4638

−0.5886

Spectral flux

0.0963

0.002

6.6986

0.0016

0.00185

6.256

Spectral centroid

826.1953

946.5018

381.2158

819.995

923.9682

337.261

Spectral rolloff

2198.326

2594.1508

1493.257

2197.15

2332.3

1218.8

Max amplitude

0.7004

1.065

0.3583

0.7201

0.79275

0.3293

Pitch

208.4925

483.4746

235.088

210.545

220.9098

233.3914

Spectral entropy

0.3762

0.3819

4.0137

0.3752

0.3747

3.603

Energy function

0.0075

0.008

0.3762

0.006

0.0077

0.3925

Energy

4.1158

3.4842

0.0018

2.7383

2.3234

0.0012

Power

3.2349

3.1278

70.0316

2.8518

3.0569

2.9019

PSD

3874.3

3029.626

1854.6865

3787.65

3542.35

1896.6

Table 2 Standard deviation values for the features used

Parameter

Standard deviation iPhone 11 pro max

Samsung A30

OnePlus 6Tt

Log energy 0.6801

1.0694

0.829

Spectral flux

0.6664

0.001

3.0704

Spectral centroid

193.2207

258.21

224.2364

Spectral rolloff

464.6977

2056.4333

1246.7585

Max amplitude

0.1945

2.4618

0.1525

Pitch

22.1742

340,303.2976

20.0687

Spectral entropy

0.02623

0.056

2.3768

Energy function

0.0048

0.0039

0.0939

Energy

2.9506

2.7315

0.0014

Power

1.7644

1.59162

192.415

PSD/FFT

1306.4344

2011.3334

683.7151

36

R. Kulkarni et al.

Table 3 Kurtosis and skewness values for the features used Parameter

Kurtosis iPhone 11 pro max

Skewness Samsung A30s

OnePlus 6T

iPhone 11 pro max

Samsung A30s

OnePlus 6T

Log energy −0.8325

−0.9315

Spectral flux

49.9995

−0.1403

−0.9122

0.0282

−0.3107

0.0716

0.952

7.071

0.6422

0.8561

Spectral centroid

−0.1379

−0.6289

0.6461

0.1668

0.268

0.9436

Spectral rolloff

−0.0722

41.4344

12.9577

−0.2417

6.1268

2.8969

Max amplitude

−0.7142

49.3708

0.6135

−0.5459

7.005

0.7793

Pitch

1.7164

49.9999

0.8282

−0.399

7.071

−0.4501

Spectral entropy

0.111

13.1311

−0.2789

0.2586

3.326

0.6839

Energy function

−0.1764

−0.5372

−0.1716

0.8468

0.4175

−0.2228

Energy

−0.906

0.167

1.293

0.8154

1.2383

1.3842

Power

−0.2777

−0.06685

7.1234

0.6991

0.7189

2.8552

PSD

−0.1023

−0.8997

−0.4052

0.2792

−0.3388

0.359

μ4 fourth central moment σ 4 standard deviation.

5 Discussion Stress is responsible for approximately 80% of the illnesses suffered by students today. It can become a major concern in students wanting to pursue higher education because of the built up stress and also being subjected to new academic situations. Often, it is not the stress itself that causes harm, but rather the individual’s reaction to it [15]. The subject’s social interaction behaviour can change according to the stress levels which can be reflected in various ways including social media [16]. Often times, built up stress in students can contribute to workplace stress in the future. This is more evident in the IT sector employees [17]. To curb this, our study’s main purpose was to discover if a person’s voice could be analysed to detect whether or not they were stressed. For this, an English passage was utilised that elicited no strong emotions but had more difficult pronunciations. Other sorts of produced stressors, however, can be used for voice analysis. Making the subjects, who were essentially students, sleep for a shorter period of time the day before the audio is recorded can also have a substantial impact on the feature values. Stress has an impact on our body’s other physiological signals as well. The impact of meditation on physiological

3 Statistical Analysis of Stress Prediction from Speech Signatures

37

signals is also being examined, as opposed to stressful settings [18]. In many forensic and security scenarios, vocal stress analysis can be useful in detecting dishonesty in the voices of convicted individuals [19]. Pre-recorded data sets are not considered in the study since the data are collected after the event has already occurred or the stressor has been given, making it possible to fake the effect. On the other hand, these audio recordings contain real-time data that can be used to effectively distinguish between stress levels. The research can be extended with a much larger and varied database.

6 Conclusion Our work is largely dependent on the human body’s vocal response to stressful situations. If stress is not treated properly, it can become chronic and lead to mental health issues such as anxiety and depression. The goal of the research thus far has been to find and measure acoustic features of the voice that, when altered, represent the impact of stress on people. Based on statistical analysis of the data, the results revealed that the numerical differences between the features recorded in the resting state and in a state of stress are significant. The changes in the auditory markers generated by stress can be seen, even if the differences are only a few Hertz, dB, or milliseconds. Taking one numerical value of a one-dimensional feature for the resting and stressed states, however, does not allow us to clearly identify the kind or degree of stress. When a person is under stress, the changes in his or her voice’s harmonic structure caused by the vibration of the vocal cords do not extend to the high-pitched areas of the voice, and there is less vibration of the vocal cords. On the basis of voice, it has been discovered that the spectral properties of voice signals vary in stressed individuals. The energy, spectral rolloff, spectral centroid, spectral entropy of the voice signals tend to increase. It can also be observed that when the pitch level is low, the stress levels tend to be high, and when the pitch in the voice is high, stress levels are low. Since the data were collected in MKSSS’s Cummins College of Engineering for women, all the audio recordings that were collected were those of females. This study can be carried out further by including male audios as well. The inclusion of both male and female audios in the data set can give values ranging on a wide spectrum.

References 1. Vogel S, Schwabe L (2016) Learning and memory under stress: implications for the classroom. npj Sci Learn 1:16011 2. Manjunath P, Pola S, Ashok V, Twinkle S Predictive analysis of student stress level using machine learning. Int J Eng Res Technol (IJERT) 3. Hopkins CS, Ratley RJ, Benincasa DS, Grieco JJ (2005) Evaluation of voice stress analysis technology. In: System sciences. HICSS’05. Proceedings of the 38th annual Hawaii international

38

R. Kulkarni et al.

conference. IEEE, pp 20b–20b 4. Nwe TL, Foo SW, De Silva LC (2003) Speech emotion recognition using hidden Markov models. Speech Commun 41(4):603–623 5. Zhang B (2017) Stress recognition from heterogeneous data. Human-computer interaction [cs.HC]. Université de Lorraine. English. ffNNT: 2017LORR0113f 6. Rothkrantz LJM, Wiggers P, van Wees J-WA, van Vark RJ (2004) Voice stress analysis. In: International conference on text, speech and dialogue. Springer, pp 449–456 7. Tomba K, Dumoulin J, Mugellini E, Abou Khaled O, Hawila S (2018) Stress detection through speech analysis. In: ICETE (1), pp 560–564 8. Giannakopoulos T, Pikrakis A (2014) Introduction to audio analysis 9. James J, Kulkarni S, George N, Parsewar S, Shriram R, Bhat M (2020) Detection of Parkinson’s disease through speech signatures. In: Raju K, Govardhan A, Rani B, Sridevi R, Murty M (eds) Proceedings of the third international conference on computational intelligence and informatics. Advances in intelligent systems and computing, vol 1090. Springer, Singapore 10. Shete DS (2014) Zero crossing rate and energy of the speech signal of Devanagari script. 4(1):01–05 Ver I 11. Repovs G (2004) The mode of response and the Stroop effect: a reaction time analysis. Horiz Psychol 13(2):105–114 12. Rawlins J, Basics MS (2000) AC circuits 13. Shriram R, Baskar VV, Martin B, Sundhararajan M, Daimiwal N (2018) Connectivity analysis of brain signals during colour word reading interference. Biomedicine 38(2):229–243. ISSN: 0970-2067 14. Kate S, Malkapure V, Narkhede B, Shriram R (2021) Analysis of electroencephalogram during coloured word reading interference. In: Santhosh KV, Rao K (eds) Smart sensors measurements and instrumentation. Lecture notes in electrical engineering, vol 750. Springer, Singapore 15. Udeshi N, Shah N, Shah U, Correia S (2021) Destress it—detection and analysis of stress levels. In: Data intelligence and cognitive informatics. Springer, Singapore, pp 19–33 16. Sharma S, Sharma I, Sharma AK (2019) Automated system for detecting mental stress of users in social networks using data mining techniques. In: ICDICI Publications—2021 2020 International conference on computer networks, big data and IoT, pp 769–777. Springer, Cham 17. Sreedharshini S, Suresh M, Lakshmi Priyadarsini S (2021) Workplace stress assessment of software employees using multi-grade fuzzy and importance performance analysis. In: Data intelligence and cognitive informatics. Springer, Singapore, pp 433–443 18. Ingle R, Awale RN Impact analysis of medication on physiological signals 19. Lacerda F (2013) Voice stress analyses: science and pseudoscience. Proc Mtgs Acoust 19:060003

Chapter 4

DermoCare.AI: A Skin Lesion Detection System Using Deep Learning Concepts Adarsh Singh, Sourabh Bera, Pranav Chaturvedi, Pranav Gadhave, and C. S. Lifna

1 Introduction Melanoma is the most threatening type of lesion. Melanoma is the least common skin cancer, still it is the cause of 75% of deaths due to skin lesions. As with other cancers, early and correct detection potentially aided by means of statistics and science can make cure extra effective. Currently, computer-aided diagnosis (CAD) has become a necessity for various skin diseases as the difference between the number of patients and doctors is quite high. Also, the cost to test and the time taken to determine the skin lesion is way too high compared to CAD. Deep learning algorithms, powered by advances in computing and very large data sets, have recently outpaced human performance in games such as chess and GO. The paper proposes a deep learning model which gives output way earlier and even cheaper than manual diagnosis. Further work on this model can even match the level of medical diagnosis. In this paper, the authors demonstrate a web application which determines whether a dermoscopic image is benign or malignant with the help of a robust predictive model. It also predicts the severity of the lesion at the current stage.

A. Singh · S. Bera (B) · P. Chaturvedi · P. Gadhave · C. S. Lifna Vivekanand Education Society’s Institute of Technology, Chembur, Mumbai, India e-mail: [email protected] A. Singh e-mail: [email protected] P. Chaturvedi e-mail: [email protected] P. Gadhave e-mail: [email protected] C. S. Lifna e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_4

39

40

A. Singh et al.

2 Theory Considering the importance of CAD in skin lesion detection, various researches have been done for skin lesion detection. This section discusses the related state of art techniques proposed.

2.1 Survey of Existing Systems In paper [1], Kassem et al. proposed a GoogleNet model upon the ISIC 2019 data set. The authors first pre-trained the GoogleNet and then replaced the last three layers of GoogleNet architecture with multi-class SVM. After refining weights of all the architectural layers, the authors could improvise performance measures. In paper [2], Arshad et al. proposed a framework for the HAM10000 data set from ISIC 2018 Challenge. The authors introduced a solution which consisted of a stepwise data augmentation and then training it on using ResNet50 and ResNet101 and generated better accuracy after augmentation. In paper [3], Hasan et al. worked on a binary classification upon the ISIC 2016 data set consisting of benign and malignant images. The authors used a variety of convolutional neural network models like SVM, VGG16 and ResNet50 and found VGG16 as the best model. In paper [4], Esteva et al. proposed a model to classify whether the lesion is benign or malignant using the ISIC 2020 data set. The authors trained several models and even customised CNN models. GoogleNet inception V3 architecture gave the best performance. In paper [5], Alom et al. proposed an improved version of the U-Net model, named NABLA-Net which classifies lesions as benign or malignant. Hence most of the research work done earlier are using CNN, and many of them are giving quite good results (R2U-Net).

2.2 Limitations Found in Existing Systems The major limitations of existing systems are discussed in the following section. The authors have done research and compared many papers related to skin lesion detection and classification on the basis of the type of classification, data set used and whether the model is deployed for public use or not. Table 1 shows all the related papers researched by the authors, and there is no such system which is capable of both multi-class classification (as discussed in Table 1) and binary classification in a form of website or mobile application for public use.

4 DermoCare.AI: A Skin Lesion Detection System …

41

Table 1 Research papers related to skin lesion detection/classification Publication usage

Data sets

Website/Mobile

Integration

Hassan et al. [6]

ISIC 2020

No

Benign/Malignant classification

Cullell et al. [7]

ISIC 2019

No

Multi-class (8 classes)

Wu et al. [8]

ISIC 2016, 2017

No

Binary classification

Kadambur and Al Riyaee [9]

HAM10000

Yes

Multi-class (7 classes)

Dutta et al. [10]

ISIC 2017

No

Multi-class (3 classes)

Albahar [11]

Custom data set (ISIC)

No

Benign/Malignant classification

Wang and Hamian [12]

ISIC 2020

No

Benign/Malignant classification

Ali et al. [13]

ISIC 2020, HAM10000

No

Benign/Malignant classification

Gessert et al. [14]

ISIC 2019

No

Multi-class (8 classes)

Abuzaghleh et al. [15]

Custom dataset

Yes

Benign/Malignant classification

Chaturvedi et al. [16] HAM10000

Yes

Multi-class (7 classes)

Moldovan [17]

No

Multi-class (7 classes)

HAM10000

∗ Multi-class classification ****Custom data set—a data set made using ISIC archive [18]

3 Proposed Work This section discusses the data set used, data pre-processing, CNN architecture and the classification process of the proposed system. Figure 1 depicts the flow of the proposed system discussed in the paper. The user provides a dermoscopic image which is passed onto two models, i.e. ResNet50 and DenseNet121. The ResNet50 is used for multi-class classification which classifies the image into eight classes, and DenseNet121 is used for binary classification which determines whether the image is benign or malignant. The output provided will be then shown on the website (DermoCare.AI) in which both the models are integrated.

42

A. Singh et al.

Fig. 1 Proposed system

3.1 Data Set The need of having a large data set is a critical aspect in training deep learning neural networks. This paper considered a combination of 2019 and 2020 ISIC (International Skin Imaging Collaboration) Challenge data sets with a proper labelling. The ISIC 2019 [19, 20] dataset consisted of 25,331 training images with metadata entries of age, sex, general anatomic site and common lesion identifier. The data set was used for the classification of skin lesions into 9 classes: nevus (NV), melanoma (MEL), basal cell carcinoma (BCC), benign keratosis (BK), actinic keratosis (AK), squamous cell carcinoma (SCC), vascular (VASC) and dermatofibroma (DF) and unknown (UNK). Parallelly, skin lesions were classified into benign (Non-cancerous) and malignant (Cancerous) using the data set from ISIC 2020 challenge [21]. The data set includes 33,126 unique benign and malignant dermoscopic training images of skin lesions from more than 2000 people. Using a special patient identity, each image is connected to them.

3.2 Pre-processing For enhancing the efficiency of a neural network model, image pre-processing is an important stage. It was observed that the ISIC 2019 challenge data set was highly imbalanced. To overcome this, data augmentation was performed on VASC, DF, SCC and AK classes of skin lesions using ImageDataGenerator API [22] provided by TensorFlow library. The unknown (UNK) class for the ISIC 2019 data set was dropped because no images of the class were present amongst the images available for training purposes. During the augmentation, the image size was rescaled to 224 × 224 pixels. The following techniques are used for image augmentation, such as width shifting, height shifting, shearing, zooming, flipping (Horizontal) and fill mode (Constant). The model was trained using 3000 images of each class. Also, the ISIC 2020 data set was highly skewed in favour of benign images. As all the melanoma (MEL) skin lesions fall into the malignant category, a random selection of 4433 images from the ISIC 2019 data set was done. The number of benign images was scaled down to 5000 images, and a data set of 10,000 images with 5000 belonging to each category (Benign and malignant) was then taken forward for model training.

4 DermoCare.AI: A Skin Lesion Detection System …

43

3.3 CNN Architecture In this paper, the model was trained on the following pre-trained CNN models— ResNet50, VGG16, AlexNet, DenseNet121. Amongst these models, ResNet50 gave the most accurate results for the class identification of skin lesions from the ISIC 2019 Challenge data set. There are 5 stages in the ResNet50 model, each with a convolution and an identity block. There are three convolution layers in each convolution block, as well as in each identity block. More than 23 million parameters can be trained with the ResNet50. All the layers are connected to each other. Also, with these layers, the activation function used is a rectified linear activation function (ReLU) which is then connected to the fully connected layers. Figure 2 depicts the architecture of ResNet50. Further, the images were examined to check whether the skin lesion is cancerous or not using the popular generic model, DenseNet121. In DenseNet architecture, each layer is connected with every other layer, hence the name densely connected convolutional neural network. The DenseNet121 model contains 1, 7 × 7 convolution layer, 58, 3 × 3 convolution layer, 61, 1 × 1 convolution layer, 4, average pool layer and 1 fully connected layer. DenseNets require fewer parameters and provide each layer with additional input from all previous layers and all subsequent layers with their own input/feature maps. In this way, each layer inherits knowledge from the previous layer. This makes DenseNets more powerful than other pre-trained models, with a more powerful gradient flow, more versatility, and smaller network sizes resulting in better results. Figure 3 depicts the architecture of the DenseNet121 model.

Fig. 2 ResNet50 architecture

Fig. 3 DenseNet121 architecture

44

A. Singh et al.

4 Results and Discussion This section discusses the stepwise procedure of the developed system which includes development of models and creating a web application and deploying it into the web. The CNN models were trained using FAST AI [23] on Google Colab. Also, for testing the capabilities of a model, it was submitted on the ISIC 2020 challenge [24] on Kaggle.

4.1 Evaluation Measures For quantitative analysis of the experimental results, several performance metrics were considered. These metrics are evaluated using the variables, true positives (tp), true negative (tn), false positive (fp) and false negative (fn). These measures are computed using a confusion matrix. Using the confusion matrix in Fig. 4, the evaluation metrics were calculated using the formulas given in Table 2. Loss Function: For model training, two different classifiers were implemented. The first classification model predicts the type of skin lesion and the second one, predicts whether the skin lesion is benign or malignant. The two loss functions are used as follows: 1. The authors have used the following cross-entropy loss function for binary classification: Fig. 4 Confusion matrix

Table 2 Evaluation measures used Performance metric

Formula

Performance metric

Formula

Accuracy

tp+tn tn+fp+tn+fn

Jaccard index

tp tp+fp+fn

Recall

tp tp+fn

F1-score

2∗

Specificity

tn fp+tn

Precision

tp tp+fp

Precision∗Specificity Precision+Specificity

4 DermoCare.AI: A Skin Lesion Detection System … Table 3 Comparison of testing accuracies using the HAM10000 data set

45

CNN model

Training epochs

Testing accuracies (%)

ResNet50

50

93.87

VGG16

50

90.3115

AlexNet

50

83.6614

DenseNet

50

90.0057

L loss = −Wi [yi log xn + (1 − yi ) log(1 − xi )]

(1)

where xi is the probability of the ith lesion image being predicted as positive, yi stands for the label of the image and its weight. 2. The authors have used the following categorical cross-entropy loss function for multi-class classification: L loss = −

C N 1  1yi ∈ Cc log( pmodel [yi ∈ Cc ]) N i=1 c=1

(2)

where i iterates over N observations, c iterates over C classes, 1 is the indicator function—here, like binary cross-entropy, except operates on length-C vectors, pmodel [yi ∈ Cc ]—predicted probability of observation i belonging to class c.

4.2 Multi-class Classification The augmented ISIC 2019 data set was trained on four different pre-trained models, namely ResNet50, AlexNet, DenseNet and VGG16. Later, the models were tested on a similar dataset of HAM10000 [19, 25, 26]. The testing accuracies given by the models as shown in Table 3. ResNet50 exhibited the best performance. Table 4 depicts the class-wise results generated using the confusion matrix ResNet50 model. The confusion matrix of Fig. 5 is generated referring to Fig. 4. Figure 6 shows the trends of training and validation losses across various epochs for which the ResNet50 model was trained. The losses were calculated using Eq. (2).

4.3 Benign/Malignant Classification Apart from identifying the type of skin lesion using ISIC 2019 Challenge Data set, the authors also examined whether the skin lesion is benign (non-cancerous) or malignant (cancerous) after performing data pre-processing as explained in Sect. 3.2. The newly prepared data set was trained on four different pre-trained models, namely ResNet50,

46

A. Singh et al.

Table 4 Class-wise results generated using the confusion matrix of ResNet50 model Skin Lesion Class

Accuracy

Precision

Recall

F1-score

Specificity

Jaccard Index

AK

0.995987

0.977742

0.988746

0.983213

0.996964

0.966981

BCC

0.978980

0.930269

0.898928

0.914330

0.990393

0.842181

BKL

0.965030

0.787194

0.856557

0.820412

0.976185

0.695507

DF

1.000000

1.000000

1.000000

1.000000

1.000000

1.000000

MEL

0.956430

0.873602

0.871652

0.872626

0.973945

0.774034

NV

0.978406

0.931587

0.906516

0.918880

0.989618

0.879934

SCC

0.995987

0.988333

0.976936

0.982601

0.998487

0.965798

VASC

1.000000

1.000000

1.000000

1.000000

1.00000

1.000000

Average

0.983852

0.936091

0.937417

0.936508

0.990699

0.886804

The values are calculated using mathematical expressions as discussed in Table 2 Fig. 5 Confusion matrix of ResNet50 model

Fig. 6 Losses across epochs for the ResNet50 model

4 DermoCare.AI: A Skin Lesion Detection System … Table 5 Testing accuracies of the models submitted on ISIC 2020 challenge website

Model

47 Final score

ResNet50

0.875

DenseNet121

0.892

VGG16

0.853

DenseNet169

0.853

Fig. 7 Confusion matrix of DenseNet121 model

DenseNet121, DenseNet169 and VGG16. Amongst the models, the DenseNet121 exhibited the highest training accuracy of 93.78%. Later, the authors submitted the results to the official ISIC 2020 Challenge website, and the final scores obtained are shown in Table 5. The confusion matrix of Fig. 7 is generated referring to Fig. 4. Figure 8 shows the trends of training and validation losses across various epochs for which the DenseNet121 model was trained. The losses were calculated using Eq. (1). The relevant performance metrics for the DenseNet121 model are shown in Table 6, which were calculated using mathematical formulas given in Table 7. Later, the results obtained in this paper were compared with other papers which are mentioned in Sect. 2.1.

4.4 Integrating Models into a Web Application The two models were then integrated into the website named Dermocare.AI. Figures 9, 10 and 11 shows a snapshot of the website. Initially, the authors have created an API using FastAPI [27] and then integrated both the models mentioned above in the respective API which will also serve as a web application. In the web app, users firstly have to select an image. After the image is selected, after clicking the

48

A. Singh et al.

Fig. 8 Losses across epochs for DenseNet121 model

Table 6 Class-wise results generated using confusion matrix for DenseNet121 model Accuracy

Precision

Recall

F1_score

Specificity

Jacard_index

Benign

0.937685

0.947826

0.931624

0.939655

0.94472

0.886179

Malignant

0.937685

0.927052

0.944272

0.935583

0.931624

0.878962

Average

0.937685

0.937439

0.937948

0.937619

0.937948

0.882571

Table 7 Comparison of our solution with the other solutions which are discussed in Sect. 2.1 Evaluation measure Proposed solution Paper [1] Paper [2] Paper [3] Paper [4] Paper [5] Accuracy for binary 93.76% classification

NA

NA

93.18%

72.1%

NA

Accuracy for type classification

94.92%

91.7%

NA

NA

87.6%

No

No

No

No

No

93.87%

Website application Yes

‘Analyse’ button, two functions are called simultaneously. The first function makes a call to the multi-class classification model and provides the image to the model as an input. Similarly, the second function inputs the skin image to the binary classification model. Using JavaScript [28], the predicted results from the models are displayed on the website along with necessary information regarding the skin disease.

5 Conclusion and Future Scope The paper discusses a system to automate the process of detecting the types of skin lesions. Two pre-trained deep learning models have been developed on the data sets from ISIC 2019 and ISIC 2020 Challenges. As per experimental study, ResNet50 proved to be the most accurate model on the resampled data set of ISIC 2019, whereas

4 DermoCare.AI: A Skin Lesion Detection System …

49

Fig. 9 Dashboard of the web application (Dermocare.AI)

Fig. 10 Uploading the image section

DenseNet121 on the ISIC 2020 dataset. Based on the findings, authors have developed a web application named DermoCare.AI where both the models are deployed. One predicts whether the uploaded image is benign or malignant, and the other model identifies the class to which the image belongs to. Finally, the website generates a report for the sufferer and serves as a handy tool for dermatologists. The work can

50

A. Singh et al.

Fig. 11 Test result of the image uploaded on the Dermocare.AI website

be further upgraded to an easy and handy mobile app for the stakeholders. Further, the model can be revitalised by training it with the upcoming ISIC data sets and by incorporating unknown classes by analysing the confidence values for images having contrasting skin lesions.

References 1. Kassem MA, Hosny KM, Fouad MM (2020) Skin lesions classification into eight classes for ISIC 2019 using deep convolutional neural network and transfer learning. IEEE Access 8:114822–114832 2. Arshad M et al (2021) A computer-aided diagnosis system using deep learning for multi-class skin lesion classification. Comput Intell Neurosci 2021 3. Hasan MR et al (2021) Comparative analysis of skin cancer (Benign vs. Malignant) detection using convolutional neural networks. J Healthcare Eng 2021 4. Estava A et al (2017) Dermatologist level classification of skin cancer with deep neural networks. Nature 542(7639):115–118 5. Alom MZ et al (2019) Skin cancer segmentation and classification with NABLA-N and inception recurrent residual convolutional networks. arXiv:1904.11126 (2019) 6. Hassan HA et al (2019) Skin lesion classification using deep learning techniques 7. Cullell-Dalmau M et al (2021) Convolutional neural network for skin lesion classification: understanding the fundamentals through hands-on learning. Front Med 8:213 8. Wu J et al (2020) Skin lesion classification using densely connected convolutional networks with attention residual learning. Sensors 20(24):7080 9. Kadampur MA, Al Riyaee S (2020) Skin cancer detection: applying a deep learning based model driven architecture in the cloud for classifying dermal cell images. Inf Med Unlocked 18:100282 10. Dutta A, Hasan K, Ahmad M (2021) Skin lesion classification using convolutional neural networks for melanoma recognition. In: Proceedings of international joint conference on advances in computational intelligence. Springer, Singapore 11. Albahar MA (2019) Skin lesion classification using convolutional neural network with novel regularizer. IEEE Access 7:38306–38313 12. Wang S, Hamian M (2021) Skin cancer detection based on extreme learning machine and a developed version of thermal exchange optimization. Comput Intell Neurosci 2021

4 DermoCare.AI: A Skin Lesion Detection System …

51

13. Ali MS et al (2021) An enhanced technique of skin cancer classification using deep convolutional neural networks with transfer learning models. Mach Learn Appl 5:100036 14. Gessert N et al (2020) Skin lesion classification using ensembles of multi-resolution EfficientNets with metadata. MethodsX 7:100864 15. Abuzaghleh O, Barkana BD, Faezipour M (2015) Noninvasive real-time automated skin lesion analysis system for melanoma early detection and prevention. IEEE J Transl Eng Health Med 3:1–12 16. Chaturvedi SS, Gupta K, Prasad PS (2020) Skin lesion analyser: an efficient seven-way multiclass skin cancer classification using MobileNet. In: International conference on advanced machine learning technologies and applications. Springer, Singapore 17. Moldovan D (2019) Transfer learning based method for two-step skin cancer images classification. In: E-health and bioengineering conference (EHB). IEEE 18. The International Skin Imaging Collaboration (ISIC) Accessed: 22 Dec 2018 [Online]. Available: https://www.isicarchive.com/#!/topWithHeader/onlyHeaderTop/gallery 19. Codella NCF, Gutman D, Celebi ME, Helba B, Marchetti MA, Dusza SW, Kalloo A, Liopyris K, Mishra N, Kittler H, Halpern A (2017) Skin lesion analysis toward melanoma detection: a challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). arXiv:1710.05006 20. Combalia M et al (2019) Bcn20000: dermoscopic lesions in the wild. arXiv preprint arXiv: 1908.02288 21. Rotemberg V et al (2021) A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Sci Data 8(1):1–8 22. ImageDataGenerator API: https://www.tensorflow.org/api_docs/python/tf/keras/preproces sing/image/ImageDataGenerator 23. FAST AI official documentation [Online]. https://www.fast.ai/ 24. SIIM-ISIC Melanoma Classification-ISIC 2020 challenge on Kaggle. https://www.kaggle.com/ c/siim-isic-melanoma-classification 25. Codella N, Rotemberg V, Tschandl P, Emre Celebi M, Dusza S, Gutman D, Helba B, Kalloo A, Liopyris K, Marchetti M, Kittler H, Halpern A (2018) Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (ISIC). https://arxiv.org/abs/1902.03368 26. Akram T et al (2020) A multilevel features selection framework for skin lesion classification. Human-Centric Comput Inf Sci 10(1):1–26 27. FastAPI Documentation. https://fastapi.tiangolo.com/ 28. Javascript Documentation. https://developer.mozilla.org/en-US/docs/Web/JavaScript

Chapter 5

Analysis of Phishing Base Problems Using Random Forest Features Selection Techniques and Machine Learning Classifiers Mithilesh Kumar Pandey, Munindra Kumar Singh, Saurabh Pal, and B. B. Tiwari

1 Introduction Phishing be a deceptive attack that uses social and technological trickery to acquire a person’s identity and financial information. Users will open bogus Web sites to provide financial data such as usernames and passwords by using faked emails from real firms and agencies. Attackers utilize a variety of techniques and interfaces to collect user information, including email, URLs, instant chats, forum comments, phone calls, and text messages. Phishing material will usually have a structure similar to legitimate content, which can fool people for obtaining the sensitive information. The main goal of a phishing attack is to obtain a specific personal information for the sake of financial gain or to commit identity theft. Phishing attacks are inflicting havoc on businesses all over the world [1], with a majority of phishing efforts focusing on financial/payment institutions as well as Webmail. Attackers make unlawful copies of genuine Web sites and mails in order to acquire the personal information [2–4]. In addition to logans, the email should be displayed with the logos of a respectable firm. In addition to the HTML structure, the design allows copying of images or a full Web site [5]. It is also considered as one of the reasons for the Internet’s rapid expansion as a communication medium; on the other hand, this leads to the misuse of brands and trademarks [6–8]. The “spoofed” emails will be sent by attacker to as many individuals as possible in order to attack different users. When users read those emails, they are often directed away from the actual company Web site to a fake Web site. There is a good probability that the user information will be exploited. As a result, phishing has become extremely urgent, difficult, and unduly crucial M. K. Pandey · M. K. Singh · S. Pal (B) Department of Computer Applications, VBS Purvanchal University, Jaunpur, India e-mail: [email protected] B. B. Tiwari Department of Electronics and Communication, VBS Purvanchal University, Jaunpur, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_5

53

54

M. K. Pandey et al.

in the modern culture [9–11]. However, there may be a dearth of efficient antiphishing techniques to identify dangerous URLs present within a business in order to safeguard its users from malicious URLs available on the Internet by using machine learning (ML) techniques [12–15]. As a result, conventional methods are unable to detect new dangerous URLs. Researchers proposed different machine learning-based approaches to identify harmful URLs in order to overcome the limitations of the blacklist-based system [16–18]. Malicious URL detection is considered as a binary classification problem with two possible outcomes: malicious and benign [19–21]. When compared to the blacklist technique, this strategy offers a higher generalization ability for detecting the unknown harmful URLs. One of the combination of ML approaches that provide a solution based on the complex real-time challenges is RNN and LSTM. RNN can store inputs on the behalf of a longer length of time by using LSTM. Later, it is compared to the idea of computer storage. Furthermore, each feature will be handled in accordance with the uniform distribution [22].

2 Research Background Nisha and Madheswari [23] discussed about phishing attacks. Phishing attacks are now available in a variety of forms. Messages requiring users to verify account information, requesting that users re-enter their information, bogus account charges, unwanted account changes, new free services requiring immediate action, and many other malicious sites are sent to a large number of recipients in the hopes that the unsuspecting person will react by clicking on a link to or signing on a fake site. Kazemian and Ahmed [24] discussed about different phishing Web sites. The disadvantage of this strategy is that the blacklists cannot generally include all phishing Web sites since it consumes more time on the behalf of a newly built fraudulent Web site. Thomas et al. [25] analyzed the phishing attacks. Usually, the malware will be sent in the form of an email attachment, which may be easily opened and downloaded. Often, the malware will get automatically installed in the user device. A blacklistbased strategy, a content-based approach, and a heuristic-based approach are all employed to combat phishing attacks. Blacklist refers to a collection of harmful URLs. Firdaus et al. [26] and Razak et al. [27] have considered the malware-based phishing. Malware-based phishing refers to assaults that cause malicious software to be installed and executed on consumer systems. Chaudhry et al. [28] discussed about key loggers and screen grabbers. Malware will be usually sent in the form of an email attachment that may be opened and downloaded. Key loggers and screen grabbers are two forms of malware commonly used in phishing attacks. They will capture and log input keyboards or show the screen and deliver information to the phisher. In certain cases, the attacker’s purpose is to get control of the victim’s computer.

5 Analysis of Phishing Base Problems …

55

Gowtham and Krishnamurthi [29] analyzed about some phishing methods. A phishing technique in which a phisher alters a part of the information on a trustworthy Web site is called as content injection. This was done to reroute the user away from the legitimate Web site and to a page where personal information were collected. Xiang et al. [30] analyzed the heuristic-based systems and used contentbased approach for detecting the phishing Web sites. Heuristic-based systems collect different characteristics from Web sites in order to determine the phishing possibilities.

3 Methodology The five components of the phishing detection system include data collection, identifying phishing attributes, developing a model, testing, and finally comparing the findings. The algorithm description section shows the components of the phishing detection system, which are addressed in the following sections.

3.1 Data Description The first stage in implementation is data collection. The dataset phase be critical in terms of assuring the accuracy of the outcomes. The dataset will help in the identification and explanation of phishing and legitimate operations. The dataset is then examined for further analysis, and the results are used to forecast or predict possible phishing attacks. All of the characteristics were gathered from UCI dataset. There shape of dataset be (11,055, 16) phishing Web site characteristics in all that have been collected. This information was mostly gathered from a well-known phishing database. The categorical_val of dataset are as follows:

All the values are categorized by histogram blue and red color as:

56

M. K. Pandey et al.

Fig. 1 Representation of histogram plotting of phishing dataset

The categorical values are represented in Fig. 1, and further, the gathered dataset will be converted to numerical values by substituting the values “1,” “0,” and “−1.”

3.2 Features Selection Method Data science provides an environment for experimenting with features such as random forest, which is used as a feature selection technique, or tree-based algorithms. This tree-based method ranked the methods for increasing node integrity. The nodes with the greatest decrease in impurity are located in the beginning of the trees, while the nodes with the smallest decrease in impurity are found at the end. Further, a subset of the most essential characteristics is produced by trimming the trees below a certain node. Figure 2 shows the target variable results, which occupy a highly important position in the phishing dataset.

5 Analysis of Phishing Base Problems …

57

Fig. 2 Representation of phishing dataset features selection by random forest algorithm

3.3 Algorithms Description 3.3.1

Naïve Bayes Algorithm

Abhilash, P. M. and D. Chakradhar have introduced the effective inductive learning algorithms. Naive Bayes has been considered as the efficient machine learning algorithm, and it works on the class-level dataset on the basis of target variables. Nave Bayes text classification has been widely utilized in document categorization assignments since the 1950s to categorize any sort of data, including text, network characteristics, phrases, and so on. This method be referred to as a generative model describes how a dataset will be created using a probabilistic model. This method can produce fresh data, which remain comparable to the data with which the model gets trained by sampling. On the behalf of textual characteristics and word embeddings, the most basic version of the Naive Bayes classifier has been employed [31].

3.3.2

KNN Algorithm

Khorshid, Shler Farhad, and Adnan Mohsin Abdulazeez introduce the KNN-based supervised algorithm. The k-nearest neighbors solve the dataset problems on the basis of multi-class functions. The distance between a fresh sample and its neighbor will be employed in this approach for categorization. As a result, the k-nearest neighbors training set is found, and an item will be assigned to the class with most members among its k-nearest neighbors. KNN will remain as a non-parametric learning method with almost no assumptions about the distribution of underlying data [32].

58

3.3.3

M. K. Pandey et al.

Decision Tree Algorithm

Charbuty, Bahzad and Adnan Abdulazeez introduce decision tree algorithm. Decision tree is considered as a well-known classification algorithm, and one of the most extensively used inductive learning methods in machine learning domain. It can handle both continuous and discrete characteristics and also the training data with missing values. The idea of utilizing information entropy is to construct a decision trees by using the labeled training data. Their capacity to understand disjunctive statements and tolerance to noisy data makes them an excellent option for performing text categorization [33].

3.3.4

Random Forest Algorithm

Zhang et al. [34] introduced the random forest-based (RF) classification and regression technique, which uses an ensemble of algorithms. On a random subset of data samples and characteristics, RF constructs numerous decision tree classifiers. The majority voting of decision trees will be used to classify a fresh sample. The fundamental benefit of RF is that it scales well to big datasets and remain as a solid approach for predicting the missing data and provide excellent accuracy even when a significant amount of the data goes missing.

3.3.5

Support Vector Machine Algorithm

Chandra and Bedi [35] introduced the SVM-based supervised learning pattern recognition technique to categorize and solve dataset challenges. SVM introduces the idea to identify different classes present in a search space. Support vectors represent the hyperplanes by utilizing the crucial training tuples.

3.3.6

Proposed Model

The tagged synthetic data will then be used to train a model, which will subsequently be used to perform actual data selection in order to determine whether there is an evidence of a threat. The model’s outcome variable gets divided into two categories: training and testing in Fig. 3. This model will then be trained with Naïve Bayes, decision tree, KNN, random forest, and SVM algorithms. Random forest is an ensemble supervised machine learning methodology, and it is the effective feature selection method that has previously been proved for detecting the dataset significance probability. To assess the performance of the proposed model, the synthetic data are divided into ten folds, train it on 80% of the data, and then test it on 20% of the data, and finally, one fold is preserved for testing. The obtained results have improved the prediction accuracy of the 10 tests.

5 Analysis of Phishing Base Problems …

59

Fig. 3 Representation of proposed model on phishing dataset

3.3.7

Performance Evaluation

Performance measures assess certain aspects of categorization task, and it will not always provide the same information. Any classification method requires an understanding on the working of a model. Different evaluation measures may have different underlying mechanics, thus understanding what each of these metrics reflects and what sort of information they are trying to transmit be critical on the behalf of comparison. A classifier’s performance may be measured in a variety of ways, including accuracy, F-measure, and kappa values [36–40]. In this research paper, we have used five machine learning classifiers as NB, KNN, DT, machine learning RF, and SVM. All five classifiers were put to the test in a variety of scenarios. The proposed framework is also tested in a binary environment to evaluate whether the proposed multi-class strategy works better in detecting cyberbullying behavior in tweets even with a binary classification challenge. The proposed

60

M. K. Pandey et al.

study has excluded the trials with poor performance obtained from the list in order to achieve the aim of standardization on the top findings for cross-comparing across each layer of features that offer experimental outcomes. A tenfold cross-validation approach was used in all trials.

4 Results This section compares the effectiveness of several classifiers when it comes to categorizing tweets in various levels. Table 1 displays the training results on 80% dataset of multi-class classification on the behalf of each classifier under various circumstances. With the features selection method “Random Forest,” all the classifiers: Naïve Bayes, KNN, decision tree (J48), random forest, and SVM, calculate classification accuracy, F-measure and kappa statistics in environment of cross-fold 5, 10, 15, and 20 validations. Table 1 Classifiers performance under various settings in multi-class classification Features selection method

Cross-fold validation

Classifier

Accuracy

Kappa statistics

F-measure

Random forest

5

.NB

67.214

0.397

0.744

.KNN

86.692

0.416

0.864

.DT

89.714

0.475

0.886

.RF

89.759

0.474

0.886

.SVM

86.576

0.417

0.864

.NB

76.91

0.276

0.794

.KNN

86.679

0.415

0.864

.DT

89.731

0.479

0.887

.RF

90.363

0.471

0.889

.SVM

89.747

0.475

0.886

.NB

76.71

0.256

0.774

.KNN

85.679

0.401

0.873

.DT

87.731

0.461

0.874

.RF

90.253

0.311

0.779

.SVM

88.747

0.451

0.756

.NB

75.68

0.232

0.761

.KNN

84.655

0.381

0.753

.DT

86.691

0.371

0.694

.RF

89.321

0.279

0.671

.SVM

86.747

0.621

0.696

Random forest

Random forest

Random forest

10

15

20

5 Analysis of Phishing Base Problems …

61

Table 2 Classifiers performance under various settings in binary classification Features selection method

Cross-fold validation

Classifier

Accuracy

Kappa statistics

F-measure

Random forest

10 (fold)

.NB

76.81

0.281

0.786

.KNN

87.679

0.315

0.764

.DT

88.731

0.479

0.887

.RF

91.363

0.281

0.789

.SVM

89.891

0.515

0.716

Table 2 displays the test results on remaining 20% dataset of multi-class classification on the behalf of each classifier under various circumstances. With the features selection method “Random Forest,” all the classifiers: Naïve Bayes, KNN, decision tree (J48), random forest, and SVM, calculate classification accuracy, F-measure, and kappa statistics in environment of cross-fold 10 validation.

5 Discussion The proposed research study takes a step forward by highlighting the shortcomings of the current cyberbullying detection method. A holistic framework has been developed for determining the results of cyberbullying on Twitter based on the past research from other fields. Random forest has calculated highest values of accuracy in each iterations with different cross-fold 5, 10, 15, and 20 validations. I-Random Forest, 89.759, 0.474, 0.886, II-Random Forest, 90.363, 0.471, 0.889 III-Random Forest, 90.253, 0.311, 0.779 and Random IV-Forest, 89.321, 0.279, 0.671. Figure 4 shows tenfold cross-validation calculated 90.363 highest accuracy and all four experiment evaluated always high accuracy on the behalf of Random Forest on the behalf of training dataset on 80%. The test result also calculated enhanced accuracy on the behalf of random forest on 20% or test dataset. To identify results in tweets, a large number of trials were conducted with the multi-classification strategy. The main goal of this research study is to establish a systematic technique to apply target variable levels for performing multi-class classification. In binary classification, the suggested method for detecting cyberbullying behavior outperforms numerous feature engineered techniques and methodologies. Random forest has obtained greatest overall classifier performance. The capacity to limit the number of selected characteristics while keeping as much overall prediction information as feasible be a fundamental requirement for implementing a successful feature selection. The majority of the published literature focuses on structured data approaches. Previous model did not explain class distribution on the learning problem. As a result, many of them only produce a

62

M. K. Pandey et al.

0.671 0.279

89.321

0.779 0.311

90.253

0.889 0.471

90.363

0.886 0.474 0

89.759 20

40 F-Measure

60 Kappa Stascs

80

100

Accuracy

Fig. 4 Representation of random forest on the behalf of training dataset on 80% with various cross-validation fold

marginal improvement in performance. Multi-minority classes are used for creating new discriminatory characteristics of data that increase classifier accuracy.

6 Conclusion Although the Internet and social media offer demonstrable benefits on the behalf of society, its widespread usage may have severe negative implications. This study has successfully developed a model for identifying cyberbullying and its severity result in twitter platform. The suggested methodology is a feature-based model called “Random Forest,” which uses features from tweet content to build a machine learning classifier for categorizing tweets as non-cyberbullying and assessing the severity of outcomes as 1 and −1. The training and test results also resulted in enhanced accuracy with 80% and 20% training and testing dataset, respectively. Other source of dataset problems should be looked at to determine whether there be a similar trend of cyberbullying intensity. Future research could improve automated machine learning models and artificial intelligence-based ensembles for dealing with social networking difficulties by including early cyberbullying detection mechanisms.

References 1. Jain AK, Gupta BB (2018) PHISH-SAFE: URL features-based phishing detection system using machine learning. In: Cyber security. Advances inside intelligent systems and computing, vol 729. https://doi.org/10.1007/978-981-10-8536-9_44

5 Analysis of Phishing Base Problems …

63

2. Purbay M, Kumar D (2021) Split behavior of supervised machine learning algorithms on the behalf of phishing URL detection. Lecture notes inside electrical engineering, vol 683. https:// doi.org/10.1007/978-981-15-6840-4_40 3. Gandotra E, Gupta D (2021) An efficient approach on the behalf of phishing detection using machine learning. In: Algorithms on the behalf of intelligent systems, Springer, Singapore.https://doi.org/10.1007/978-981-15-8711-5_12 4. Le H, Pham Q, Sahoo D, Hoi SCH (2017) URLNet: learning a URL representation with deep learning on the behalf of malicious URL detection. In: Conference’17, Washington, DC, USA. arXiv:1802.03162 5. Hong J, Kim T, Liu J, Park N, Kim SW Phishing URL detection with lexical features and blacklisted domains. In: Autonomous secure cyber systems. Springer, https://doi.org/10.1007/ 978-3-030-33432- 1_12. 6. Kumar J, Santhanavijayan A, Janet B, Rajendran B, Bindhumadhava BS (2020) Phishing website classification and detection using machine learning. In: International conference on computer communication and informatics (ICCCI), Coimbatore, India, pp 1–6, https://doi.org/ 10.1109/ICCCI48352.2020.9104161 7. Hassan YA, Abdelfettah B (2017) Using case-based reasoning on the behalf of phishing detection. Procedia Comput Sci 109:281–288 8. Rao RS, Pais AR (2019) Jail-Phish: an improved search engine based phishing detection system. Comput Secur 1(83):246–267 9. Aljofey A, Jiang Q, Qu Q, Huang M, Niyigena JP (2020) An effective phishing detection model based on character level convolutional neural network from URL. Electronics 9(9):1514 10. AlEroud A, Karabatis G (2020) Bypassing detection of URL-based phishing attacks using generative adversarial deep neural networks. In: Proceedings of the sixth international workshop on security and privacy analytics 2020 Mar 16, pp 53–60 11. Gupta D, Rani R (2020) Improving malware detection using big data and ensemble learning. Comput Electron Eng 86:106729 12. Anirudha J, Tanuja P (2019) Phishing attack detection using feature selection techniques. In: Proceedings of international conference on communication and information processing (ICCIP). https://doi.org/10.2139/ssrn.3418542 13. Wu CY, Kuo CC, Yang CS (2019) A phishing detection system based on machine learning. In: International conference on intelligent computing and its emerging applications (ICEA), pp 28–32 14. Chiew KL, Chang EH, Tiong WK (2015) Utilisation of website logo on the behalf of phishing detection. Comput Secur 16–26 15. Srinivasa Rao R, Pais AR (2017) Detecting phishing websites using automation of human behavior. In: Proceedings of the 3rd ACM workshop on cyber-physical system security, ACM, pp 33–42 16. Sahingoz OK, Buber E, Demir O, Diri B (2019) Machine learning based phishing detection from URLs. Expert Syst Appl 117:345–357 17. Zamir A, Khan HU, Iqbal T, Yousaf N, Aslam F et al (2019) Phishing web site detection using diverse machine learning algorithms. Electron Libr 38(1):65–80 18. Almseidin M, Zuraiq AA, Al-kasassbeh M, Alnidami N Phishing detection based on machine learning and feature selection methods. Int J Interact Mob Technol 13 19. Tan CL, Chiew KL, Wong K (2016) PhishWHO: phishing webpage detection via identity keywords extraction and target domain name finder. Decis Support Syst 88:18–27 20. Gull S, Parah SA (2019) Color image authentication using dual watermarks. In: Fifth international conference on image information processing (ICIIP), pp 240–245 21. Giri KJ, Bashir R, Bhat JI (2019) A discrete wavelet based watermarking scheme on the behalf of authentication of medical images. Int J E-Health Med Commun 30–38 22. Gandotra E, Bansal D, Sofat S (2016) Malware threat assessment using fuzzy logic paradigm. Cybern Syst 29–48 23. Nisha S, Madheswari AN (2016) Secured authentication on the behalf of internet voting in corporate companies to prevent phishing attacks. 22(1):45–49

64

M. K. Pandey et al.

24. Kazemian HB, Ahmed S (2015) Comparisons of machine learning techniques on the behalf of detecting malicious webpages. Expert Syst Appl 42(3):1166–1177 25. Thomas K, Grier C, Ma J, Paxson V, Song D (2011) Design and evaluation of a real-time URL spam filtering service. In: IEEE symposium on security and privacy, pp 447–462 26. Firdaus A, Anuar NB, Razak MFA, Hashem IAT, Bachok S, Sangaiah AK (2018) Root exploit detection and features optimization: mobile device and blockchain based medical data management. J Med Syst 42(6) 27. Razak MFA, Anuar NB, Othman F, Firdaus A, Afifi F, Salleh R (2018) Bio-inspired on the behalf of features optimization and malware detection. Arab J Sci Eng 28. Chaudhry JA, Chaudhry SA, Rittenhouse RG (2016) Phishing attacks and defenses. Int J Secur Appl 10(1):247–256 29. Gowtham R, Krishnamurthi I (2014) A comprehensive and efficacious architecture on the behalf of detecting phishing webpages. Comput Secur 40:23–37 30. Xiang G, Hong J, Rose CP, Cranor L (2011) Cantina+. ACM Trans Inf Syst Secur 14(2):1–28 31. Abhilash PM, Chakradhar D (2021) Sustainability improvement of WEDM process by analysing and classifying wire rupture using kernel-based naive Bayes classifier. J Braz Soc Mech Sci Eng 43(2):1–9 32. Khorshid SF, Abdulazeez AM (2021) Breast cancer diagnosis based on k-nearest neighbors: a review. PalArch’s J Archaeol Egypt/Egyptol 18(4):1927–1951 33. Charbuty B, Abdulazeez A (2021) Classification based on decision tree algorithm on the behalf of machine learning. J Appl Sci Technol Trends 2(01):20–28 34. Zhang W, Wu C, Zhong H, Li Y, Wang L (2021) Prediction of undrained shear strength using extreme gradient boosting and random forest based on Bayesian optimization. Geosci Front 12(1):469–477 35. Chandra MA, Bedi SS (2021) Survey on SVM and their application in image classification. Int J Inf Technol 13(5):1–11 36. Yadav DC, Pal S (2021) An ensemble approach on the behalf of classification and prediction of diabetes mellitus disease. In: Emerging trends in data driven computing and communications. Springer, Singapore, pp 225–235 37. Yadav DC, Pal S (2021) Performance based evaluation of algorithms on chronic kidney disease using hybrid ensemble model in machine learning. Biomed Pharmacol J 14(3):1633–1646 38. Yadav DC, Pal S (2021) Discovery of thyroid disease using different ensemble methods with reduced error pruning technique. In: Computer-aided design and diagnosis methods on the behalf of biomedical applications. CRC Press, pp 293–318 39. Hamdan YB (2021) Construction of statistical SVM based recognition model for handwritten character recognition. J Inf Technol 3(02):92–107 40. Tripathi M (2021) Sentiment analysis of Nepali COVID19 tweets using NB, SVM AND LSTM. J Artif Intell 3(03):151–168

Chapter 6

Cost Prediction for Online Home-Based Application Services by Using Linear Regression Techniques Rounak Goje, Vaishnavi Kale, Ritik Raj, Shivkumar Nagre, Geeta Atkar, and Geeta Zaware

1 Introduction Covid-19 pandemic has created a huge demand for various categories of home services [1–4]. Furthermore, this report develops a solution to tackle the increase in online on-demand home service market sectors (home care and elegance, repair and maintenance, health, wellness, and wonder, and others) [4–8]. The features of the designed Web portal are as follows: • A complete and efficient Web site that can provide information on different business domains, skilled workers, and many more. • A Web platform where customers can directly deal with the service providers since they are just one-click away. • Moreover, the Website can predict the cost of various home services with best accuracy. • Clients can compare various service providers for the same service. The Web site performs several tasks. The clients and service providers can access the Web site at a time. The clients and service providers can get easy access of the Web site by just clicking on the links. The Web site gives the reliable environment to both clients and service providers.

R. Goje (B) · V. Kale · R. Raj · S. Nagre · G. Atkar · G. Zaware G H Raisoni College of Engineering and Management, Pune, Maharashtra 412207, India e-mail: [email protected] G. Atkar e-mail: [email protected] G. Zaware e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_6

65

66

R. Goje et al.

1.1 Motivation During lockdown, people faced many issues regarding small but essential services like plumbing, electrician, key maker, and watchman. These service providers also lost their source of earnings in the pandemic. This has been a motivation for this project which provides a platform to the service providers and the common people to solve these daily problems.

2 Problem Statement To design a web platform where customers and service providers register themselves and offer their services or obtain benefit from the services provided on the portal. Customers can directly deal with service providers through the portal as per need.

3 Literature Survey 3.1 Domestic Android Application for Home Services [9] Survey: Here, the clients use the proposed mobile application to raise a request over any particular service. It connects the client and service providers globally. The location coordinates like longitude and latitude are fetched by the mobile application and sent to the service provider. It also compares the location coordinates of both client and service provider to get the accurate distance between them. Workflow: At first, the client raises a request by filling out the request form on the portal. The application takes the location after submitting the request successfully. Similarly, on another side, when the service provider accepts the request, the location gets recorded in the database. And then, Global Positioning System (GPS) calculates the most accurate distance between the two parties. Drawbacks: The GPS is not accurate to get the location. Instead, they can use API to fetch the real-time location. Also, both parties can update their locations once, and later, they won’t be required to update their real-time locations.

3.2 An Online System for Household Services [10] Survey: The developed portal provides the best and most affordable services by searching over many service providers. This portal has many categories of

6 Cost Prediction for Online Home-Based Application Services …

67

services. Also, the system tracks each activity of a service provider. Portal has one review system for each category of services to improve the quality of services. Workflow: An efficient portal for searching the listed service providers was developed. The client can easily get the details over the services and easily register for a service. Also, the activities performed by the service provider get verified by the client in real-time operations. Drawback: A lot of updates are required for the functionality framework. Difficult to access the portal was experienced by new users who are not well known for smart devices. Also, verification is mandatory for registering the service providers so that the client can get the best service.

3.3 E-commerce and Its Impact on Global Trade and Market [11] Survey: Basically, it focused on the benefits of e-commerce over home services, with some case studies. Moreover, the paper studied e-commerce revolutionizing the global trade and local markets by taking an example of how the European Commission defined the e-commerce in 1977. Also, the expansion of e-commerce in our day-to-day life with the Internet was illustrated. Workflow: Starting from B2B, B2C, C2B, and C2C business models, how the Internet leads to e-commerce on an online platform was discussed. It also exchanges the information between purchase and sales data. The financial transactions between customers and banks were faster. Any relevant information was easy to be updated by using the cloud. Drawbacks: It depends on the Internet for ecommerce. It needs the cloud support, which increases the overall cost of the business model. Limited communication between both parties was observed. Also, lack of smart devices in poor class society to access e-commerce with the Internet and poor Internet facilities in some regions is to be considered.

3.4 Examining the Impact of Security, Privacy, and Trust, on the TAM and TTF Models for E-commerce Consumers [12] Survey: It mainly focused on the problems faced by the clients over the service providers. For example, the feedback issue raised by service providers, services are not up to date, etc. Workflow: The service providers are listed according to the entries in the database. It required a random search from clients to get the best service provider near their home. Authentication module was used to verify genuine users. Also, it had a secure payment gateway for cashless and online payments.

68

R. Goje et al.

Drawbacks: All the service providers are randomly registered in this portal, so the clients face difficulty searching for the best service providers in the huge database entries. The overall charges for services are not fixed; rates are not standardized, and several required categories are not mentioned. Verification is not up to date, and the cost estimator is needed to check the cost over the various categories of services.

3.5 A Research Study on Customer Expectation and Satisfaction Level of Urban Clap in Beauty Services with Special Reference to Pune [13] Survey: The research basically focused on the quality of services as per the client’s view. Targets were on the huge demands due to increase in e-commerce. Also, how the customers are satisfied from the services was provided. The competition of various number of service providers for a single category of service was found. Workflow: Client can simply search for the services over the huge listings. They get the detailed information of their selected service providers and can book a service in short time by providing valid details. Drawbacks: The relationship between client (customers) and service providers is not to a good level. The application is not easy to understand for new users. Proper feedback is not maintained, and the quality services day-by-day are not up to the mark.

3.6 Timesaverz—First of Its Kind On-Demand Home Service Provider, India [14] Survey: The business model represents targeting a specific category of customers. Workflow: The client–service model represents the relationship between them and the integration of the quality of services. Drawbacks: Difficulties were faced by the company in listing the service providers. The verification of each service provider are made over the large region. Also, the problems related to business operations like cloud-based applications, expansion over the cities, and marketing were faced. More improvements are required in the entire feedback module. Verification is required before filling the feedback for both parties. Also, it is not expanded to many cities and towns. Lack of marketing over large domain areas and lack of services in the village areas.

6 Cost Prediction for Online Home-Based Application Services …

69

4 Proposed System and Architecture Diagram The proposed system and architecture diagram are the fundamental structure of the software and the modules, added while creating such structure of the Web site. There are varieties of modules like registration module, login module, feedback module, and admin module. These individual modules play an important role, as explained below.

4.1 Registration Module The system consists of a registration module for customers and the service providers. When any service provider or customer wants to provide service or avail services, respectively, they need to go through the registration process which includes email verification. This module registers the details to maintain the records of all individuals. And those records can be further reused for reference. Since registration is done via email verification, it makes the panel more authentic and gives a good experience throughout.

4.2 Login Module A login is a set of credentials to authenticate. Most probably, consist of a username and a password. However, a login may have different information, such as PIN, passphrase, or passcode. Because of this module, authenticity can be achieved in a manner where only authorized users can get access. The login module consists of login for customers and service providers. To access such functionality, a user has to complete the registration process first.

4.3 Feedback Module As well known, customer feedback plays an important role in the growth of any platform. Therefore, a feedback module so that customers can give ratings to the service provider for the service they had received has been added. This module creates surveys to collect feedback. Unlike tools like surveys, it allows to write your doubts. The feedback module helps get exact feedbacks, which includes all pros and cons of the application and service. The purpose of feedback is to improve the performance.

70

R. Goje et al.

4.4 Admin Module The admin module is known as the administration module. The admin module consists of details of all the users including the service provider and the customer. The admin has the privilege to add, modify, create, or delete the particular service or customer details from the database as per the need and requirements of individuals. For adding new users, admin does have several fields such as: Enter a username, category, place (location), email ID, phone number, and upload image. These fields play an important role to maintain records of all customers in the database to establish a friendly environment and easy access for future references. The admin module allows the administrator to set up all back end process and to perform system configuration. This module is also known as user management which not only allows users to set up definable access but also to access multiple and single branches. The admin can also set overall security settings including session time out, required strength password, accounts lockout, which is inactive, a password reset period, etc. From the admin model, all the accessible features can be directly handled even those related to the clients. Admin can directly sort the list and search your requirement. Admin has control and can track every action throughout. Admin can list out all the required home services. Along with this, it can manage them easily by removing or adding these services. This management will let the admin deliver a better customer experience. Admin can also add new service providers manually, and apart from this, an admin can verify the profiles of all the service providers and can prevent any fraudulent activity while registering in the panel. Figure 1 shows the proposed system.

Fig. 1 Proposed system

6 Cost Prediction for Online Home-Based Application Services …

71

Fig. 2 System architecture diagram

5 Project Scope Clients can compare various service providers for the same service. The Web site is designed to perform several tasks, and the users and service providers can access the Web site at the same time. The service provider can check the reviews so that they can check the quality or methods of services provided which makes this system reliable. The system architecture is shown in Fig. 2.

6 Prototype Model The dataset generated by the users depending upon the types of services has been used. So, the primary dataset contains six features to predict the efficient cost. Cost prediction is done by the linear regression model, which is the best in terms of resource consumption for the analytics. It can show greater efficiency than some of the well-known algorithms, by taking appropriate feature values.

6.1 Feature Value (Target Value) and Linear Regression The dataset describes the cost of each type of service with the service category, year of services, ratings and course, and price over the services. Then, the values are taken to make the linear regression model, where the price of the services does not represent the current market value prices.

72

R. Goje et al.

It first compares all the data, year by year with the different price segments in a simple linear equation. First, some of the feature columns are considered for the dataset available now: • Year • Reviews • Category. And the target value is “Price.” Before implementing the algorithm, the exploratory data analysis like data cleaning, checking missing values using imputer (especially NAN values), and deleting/ignoring special symbols are performed. In short, more data must be prepared for further processing.

6.2 Cost Prediction A CSV file contains data stored in tabular form. Pandas ensure to import a dataset. Feature selector takes minimum input variables. Two sets are developed for training and testing a model. OneHotEncoder converts categorical data into numerical data for feature selector. Column transformer ensures the selection of categorical data and numerical data. Pipelining helps to get the value of the feature. Pickle library allows to dump the model. Steps for cost prediction model are shown in Fig. 3. Steps for cost prediction model:

Fig. 3 Cost prediction model

6 Cost Prediction for Online Home-Based Application Services …

73

• CSV File: It is a primary dataset that mainly contains the service provider name, service category, year of service, ratings, and price for service. It is auto-generated data by the user entries. • Pandas Data Frame Object: Dealing with multiple rows and columns over a huge dataset, pandas data frame objects are needed for more accessible analysis. Data frames are necessary for a collection of objects for given index values by pandas. • Feature Selector: Year, reviews, and category are considered as the main feature selectors to build the maximum efficiency over the value of the target, i.e., cost. However, it takes more time to process over the string values than the numerical values. • Train and Test: It takes relevant samples to recognize the pattern with specific criteria. It also helps out to predict the outcomes of the design of the model. • OneHotEncoder Function: It is not easy to interpret the vast data to get the accurate and desired values. It helps to get the binary vectors from categorical variables. The binary vectors here represent each integer value. Also, index integers are marked as numerical ones. • Column Transfer: It applies separate transformers for every numerical or categorical data. Here, they are primarily used for feature values over the target value. • Pipelining with Linear Regression Model: Here, pipelining is used to club various parameters into one. A linear regression algorithm is used with pipelining to frame the relationship between different feature selector values and the target value. • Fitting Machine Learning Model: It helps to generalize up to the best case or similar data from training sets. Fitting refers to taking the parameters to get more accurate results by reducing errors. Ideally, it is the model which makes accurate predictions with zero error. This project is interested in the target value, i.e., cost over the various parameters like year, review, and category. • Dumping Pipeline using Pickle: Pickle library converts Python objects into a file. And to store the file, the dump function is used. Mostly, pickle is used to handle more miniature array objects and is more efficient than joblib library. • Analysis using Model: After performing all the tasks, the machine learning model is now ready for analysis. Taking an array object, now, the cost can be predicted by taking the required parameters.

6.3 Three-Step Verification Transaction (request for a service): All clients wish to get the best service over different categories listed in the database. After predicting the cost, now, the client can move further to book a service in the portal. The client has to enter the basic details to request a service. Hereafter, two auto-generated emails will get generated at the backend. The first auto-generated email for the client will receive the email

74

R. Goje et al.

Fig. 4 Flowchart for transaction—request for a service

containing the token and the detailed information of the service provider. The token concept here is the crucial feature for this three-step verification as shown in Fig. 4. The second auto-generated email for the service provider is to accept the client’s request. Now, it is their choice to verify the services, and the status for service now becomes active by clicking the link provided in the auto-generated email. This status component helps to ensure that both parties must fill out the feedback form by using their tokens.

6.4 Feedback The motive of this feedback module is to update the database with genuine clients and professional service providers. Now, actual clients and professional service providers can fill in the feedback. The token from the transaction module plays an essential role in the verification process as in Fig. 5. After verifying the user, now, there are two feedback forms: • Services: The genuine clients are allowed here to fill the true feedback as per the service provided by the professional service provider after the verification with the help of a token created while registering for a service. Genuine ratings for services get updated with the previous ratings by comparing previous and current ratings. Also, all the entries will get recorded in the database at the back end for future references and analysis.

6 Cost Prediction for Online Home-Based Application Services …

75

Fig. 5 Flowchart for feedback

• Client: Here, the portal gives access to the service provider. After verification with the token provided by the client, ratings get calculated at the back end and get updated with the existing records. If the client gets poor ratings from the service provider, then an unpleasant client will get removed automatically from the database. Also, all the entries will get recorded in the database at the back end for future references and analysis.

7 Users’ Classes and Features • Planned Approach Toward the Working: The working in the organization is in an organized way. Data should get retrieved as per commands by storing it in a proper way. • Accuracy: Greater accuracy for system architecture is obtained. All the commands should be done accurately to get higher efficiency. With linear regression, the system efficiently works on large datasets.

76

R. Goje et al.

• Reliability: There are many reasons for reliability. The cause to increase reliability is to have proper storage of data. • No Redundancy: It must be ensured that no duplicate information is stored in the database. This would be achieved by a random check of the database. This leads to consistency in information and the use of efficient storage space. • Immediate Retrieval of Information: The proposed system is designed to get the information as earliest as possible by reducing time complexity. • Immediate Storage of Information: To retrieve the data to get correct information, data are stored immediately. Manually, there are many problems to handle a large database. • Easy to Operate: The system is easy to operate by any electronic gadget containing a Web browser and very easy to develop in a short time. • User Friendly: Clear structure, navigation, and page names. Responsive and compatible design. • Quick Registration and Profile Approval: The registration process is very simple for service providers and clients and does not consume time. Once the form is submitted, the service provider and clients get the approval immediately. • Real-Time Request: If a client asks for a service, the service provider gets the inquiry notification mail in real time. • Feedback/Review Feature: The feedback/review process works in both ways. Customers as well as service providers both can give feedbacks. • Creation of Service List and Management: The admin lists the services. The admin can easily manage the services by adding or removing them from the database. • Manage and Verify Service Providers and Customers: The admin has all rights to manage the service providers and the customers. Verification of profiles of all service providers and customers is done by the admin to avoid any fraud. • Manage Reviews: If the feedback/review of any user is less than 2, then the account gets directly deleted from the database.

8 Applications 8.1 Home and Cleaning Service Industry Home and cleaning service demands are increasing around the world; also, it has become much more difficult to handle it efficiently in a good manner. So, instead of moving from place to place for the services, it has now become much easier for the consumers to find them at their fingertips. The system offers expert service or services to customers and manages all the work and much more from a powerful dashboard.

6 Cost Prediction for Online Home-Based Application Services …

77

8.2 Health Care Health care is the biggest industry of all time. This is also the kind of industry that supports frequent changes and is updated continuously. People can get benefit from the technological innovations and rapid discovery of new things that come in this field, but there have been challenges to reach these services to needy people.

8.3 Repair and Maintenance Service Industry Earlier, it was difficult for people to find an electrician, a plumber, a key maker, other repair servicemen, or support workers. With the introduction of this portal, this has become so easy to get one. Just to tap on the mobile can find trusted professional workers.

8.4 Home Renovation/Shifting People looking for labor support to move their things from old homes to new houses can use this portal. Service providers here include technicians, pest control, kitchen designers, and maintenance experts.

8.5 Home Construction and Design Consumers can hire skilled workers or technicians for installing CCTV cameras, interior designing, or modular kitchens.

8.6 Businesses Professionals like Web designers, chartered accountants, lawyers, and others can also offer their services.

9 Future Scope This system accommodates the changing needs of the client. The proposed system can be designed so that its capacity can be expanded in response to the further

78

R. Goje et al.

requirements for which the application provides an appropriate service overseas. Further, this Web site can be added by simply adding up the required services, additional payment systems, and the next level of security verification processes. For example, the existing system provides the plumber, electrician, home cleaner, painter, etc., and further, the system can be extended as per the requirements of the user, such as laundry, computer and mobile repair, catering services, and more. A secure payment system can be added by providing a separate payment gateway for the web application. Online payment options such as UPI and credit/debit cards can be introduced. Internet banking can include international transactions provided by different banks. And to solve the issues, video tutorials and 24 * 7 chat or call support can be implemented.

10 Conclusion To reduce the burden of finding the home services given by the best service providers, the proposed system provides multiple service options by providing service providers at the doorstep. In a systematic online environment, consumers can easily access home services more easily and simply. With skilled workers, all required services are made possible in a click, anytime from anywhere. This online home-based application is an innovative idea in the current market. Although services like Urban clap, Zimmber, and TimeSaverz are already dominating in the on-demand home service market, there are still many chances which can be explored and executed. Proper research about the current market, introduction of vital features, and proper implementation of innovations given by experts can make this portal successful. The huge request for domestic home services will be high as everyone expects easiness in our lives. Utilizing these huge requests in a correct approach is needed in the current scenario.

References 1. Adlakha N (2021) Everything now comes home: on-demand service apps and their teething troubles. The Hindu. https://www.thehindu.com/real-estate/on-demand-service-apps-teethingtroubles-2021-urban-company-housejoy-construction-renovation/article35760132.ece 2. Yin C (2015) An empirical study on users’ online payment behaviour of tourism website. In: IEEE 12th international conference on e-business engineering 3. Bhuvaneswari T, Keerthana KP (2016) Image segmentation based on dilation and erosion to reduce background noise. Int J Mod Trends Eng Sci 3:245–250 4. Keerthana KP, Kavitha K (2012) Comparative analysis of fault coverage methods. Bonfring Int J Power Syst Integr Circuits, Special Issue Commun Technol Interv Rural Soc Dev 2:110–113 5. Yrnn-ping CA, Yuying W (2010) Simple said about online payment risks and preventive measure. In: China located international conference on information systems for crisis response and management. IEEE

6 Cost Prediction for Online Home-Based Application Services …

79

6. Kovachev D, Klammadriano R (2011) Beyond the client server architectures: a survey of mobile cloud techniques. In: Workshop on mobile computing in 2011 7. Mantoro T, Milišic A, Ayu MA (2010) Online payment procedure involving mobile phone network infrastructure and devices. IEEE 8. Pooventhan K, Arun Mozhi Devan P, Mukesh Kumar C, Midhun Kumar R (2019) IoT based water usage monitoring system using LabVIEW. In: Smart technologies and innovation for a sustainable future. Springer, Cham, pp 205–212 9. Bandekar S, Avril D (2016) Domestic android application for home services. Int J Comput Appl 10. Indravasan NM, Adarsh G, Shruthi C, Shanthi K (2018) An online system for household services. Int J Eng Res Technol 11. Shahriari S, Mohammadreza S, Saeid G (2015) Ecommerce and its impact on global trade and market. Int J Res Granthaalayah 12. Basak SK, Govender I (2009) Examining the impact of security, privacy and trust on the TAM and TTF models for ecommerce consumers: a pilot study. IEEE 13. Pathak R, Salunkhe P (2018) A research study on customer expectation and satisfaction level of Urban clap in beauty services with special reference to Pune. Int J Manage Technol Eng 412–421. ISSN NO: 2249-7455 14. Sangwan S (2017) Timesaverz—first of its kind on-demand home service provider in India. Businessworld. http://www.businessworld.in/article/Timesaverz-First-Of-Its-Kind-OnDemand-Home-Service-Provider-In-India/07-03-2017-113954/

Chapter 7

Convolutional Neural Network Based Intrusion Detection System and Predicting the DDoS Attack R. Rinish Reddy, Sadhwika Rachamalla, Mohamed Sirajudeen Yoosuf, and G. R. Anil

1 Introduction The Internet has shrunk the world in many ways, but it has also revealed many varieties of positive and negative consequences. The world of cyberattacks evolved at the same rate as security. It is necessary to have a strong understanding of cybersecurity principles to defend us from threats that occur in cyberspace [1]. The technique of protecting computers, networks, servers, and data from malicious assaults is known as cybersecurity [2]. The majority of media activities are transmitted; most financial transactions are completed, and a considerable amount of individuals’ time and activities are invested in socializing in this space. Online frauds are the largest cyberthreats which are exploiting the continent rapidly. The percentage of African citizens having access to the Internet is 38%. The botnet attack (DDoS) is the fifth most ubiquitous cyberattack in the report, and there were around 50,000 DDoS attack victims with a monthly average of 3900. The Amazon Web Services were targeted by a massive DDoS attack which is known as one of the most severe DDoS attacks in 2020. In this attack, the CLDAP reflection was used to target an anonymous client. The quantity of data delivered to the victim’s IP address is amplified 56–70 times by using insecure third-party CLDAP servers. Since 2020, the DDoS attacks have increased over 151% with 91% of the attacks carried on for about four hours long. To R. Rinish Reddy (B) · S. Rachamalla · G. R. Anil Vardhaman College of Engineering, Hyderabad, Telangana, India e-mail: [email protected] S. Rachamalla e-mail: [email protected] G. R. Anil e-mail: [email protected] M. S. Yoosuf VIT-AP University, Amaravati, Andhra Pradesh, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_7

81

82

R. R. Reddy et al.

identify and prevent these DDoS attacks, machine learning and artificial intelligence are being widely used. The most common attack in the Internet world is a distributed denial of service attack (DDoS). It is a harmful venture to obstruct the standard traffic of a chosen server or a network by drowning the target and its neighboring architecture with a traffic flood. DDoS attacks gain success by using several computer systems as the origin of the attack traffic [3]. In the DDoS attack, the attacker influences the nodes, called bots that are a part of the attack [4]. Then, the attacker initiates a wide range attack on the targeted server using the swarm of bots that are called botnets. Two ways for detecting DDoS assaults are in-line packet inspection and out-of-band detection using traffic flow record analysis. It is possible to use on-premises or cloud-based services. Machine learning is a rapidly expanding field of computational algorithms aimed at replicating human intelligence by learning from their environment [5, 6]. Machine learning approaches have helped pattern recognition, spacecraft engineering, finance, computer vision, entertainment as well as biological and medicinal applications [7]. Machine learning-based detection is being adopted by researchers in many security measures due to the inefficiency of signature-based methods in identifying zeroday attacks or even modest variants of known assaults [8]. The machine learningbased DDoS assault detection approach has two steps: feature extraction and model detection. The DDoS attack network characteristics with a substantial proportion are extracted in the feature extraction stage by comparing the data packages categorized according to rules [9]. The extracted features are utilized as machine learning input features in the model detection stage which is used to train the attack detection model [10]. A network is established for a computer system. Data are transferred through the network in the form of packets which are discrete units of data. To monitor and analyze the network and the packet transactions, sniffers must be used. One of the majorly used sniffers is Wireshark [4, 11, 12]. It is a packet capture and analysis tool that captures the packets from the desired network connection that could be from the targeted computer to the home office or to the Internet [13]. It is most often used to trace connections and observe the information of suspicious network transactions. The users may see the packet sequence graphically in Wireshark by selecting suspected packets from the menu to statistics, and obtain the flow graph. When there are harmful network intrusions, the IDS approaches use convolutional neural networks (CNNs) that significantly improve classification accuracy when compared to other machine learning methods [10]. The fundamental advantage of CNN over its predecessors is that it discovers essential traits without the need for human intervention. In image recognition issues, the CNN algorithm has a high level of accuracy, and it automatically detects essential characteristics and weight sharing. In addition, CNN is computationally efficient. The ability of CNNs to construct an internal illustration of a two-dimensional image is one of their advantages. The CNNs have been inspired from human biology [1]. Each neuron in CNN has its own responsive field and is coupled to other neurons in such a way that they span the full visual field, similar to how neurons in human brains do [14, 15]. Due to their ability to abstractly

7 Convolutional Neural Network Based Intrusion Detection System …

83

show low-level intrusion traffic data features as high-level features and higher feature learning skills as a semi-supervised neural network, CNNs have been increasingly prominent in the field of network intrusion detection in recent years [16, 17]. CNN enables the model to learn the data’s position and scale-invariant structures, which is crucial when working with images [1]. By employing link information in the network, ranking items in a network can mean categorizing them according to importance, popularity, impact, authority, relevance, similarity, and proximity.

2 Related Works Cheng et al. [13], this research suggested a feature weight calculation approach that relies on principal component analysis to assess the relative relevance of different features, for fusing distinct features and getting a multi-element fusion feature (MEFF) value. By using the CAIDA DDoS Attack 2007, they have achieved an accuracy of 92% on 2000 samples. Sambangi and Gondi [4] the goal of the study is to analyze the issue of a DDoS attack detection from the CICIDS 2017 benchmark dataset and has chosen a Friday Afternoon log file. They have used multiple linear regression to predict the DDoS and bot attacks. The multiple linear regression model’s percentage accuracy is calculated to be 73.79% for 16 attributes which is the highest. Marcin et al. [11] this study offered a method of detection based on measurements that uses network traffic and machine learning techniques to infer whether the packet sniffer software is operating on the desired machine. The proposed detection method has achieved around 99% accuracy. Shin et al. [16] this publication has used three major techniques of CNN that are using already trained CNN features, training the CNN from scratch, and undertaking unsupervised CNN pre-training followed by supervised fine-tuning on datasets taken from ImageNet. They have studied thoraco-abdominal lymph node (LN) and interstitial lung disease (ILD). Li and Liu [2] the suggested approach is subjected to a class of stealthy assaults, as well as a thorough security study. The results reveal that a variety of sensors may be uniquely identified with a 98% accuracy. The suggested scheme is put to the test on a number of attack scenarios from the literature, all of which are detected with excellent precision. Bi et al. [10] this study gives a detailed explanation on machine learning from the scratch. They have also provided brief description on five regular machine learning algorithms and 4 ensemble-based algorithms and concluded it with epidemiological applications. Jia et al. [14] in this study, a huge dataset is constructed using the DDoS simulators BoNeSi and SlowHTTPTest and implemented on the CICDDoS2019 dataset to evaluate the model’s efficiency and accuracy in identifying and classifying objects. The suggested convolutional neural network has a classification accuracy of up to 99.9%.

84

R. R. Reddy et al.

Kim et al. [18] this work has developed a DL-based intrusion detection model focusing on DoS attacks. The algorithm used is CNN on CSE-CIC-IDS2018 and KDD CUP 1999 datasets. The proposed CNN model is then compared with a RNN model. The accuracy obtained from the CNN detection model is 99% which is almost as same as the accuracy obtained from the RNN model. Zhou et al. [9] the design ideas and dissimilar node embedding methodologies for presenting the network learning over homogenous networks were reviewed in this paper. They presented a unified reference framework that divides and generalizes the process into three steps: node feature extraction, preprocessing, and node embedding model training, as well as design and development suggestions for the next generation of algorithms for representing the networks and systems. Farda et al. [7] because of its superior performance, the principle component analysis (PCA) network was chosen for the DL network design in this study. PCA filters, binary hashing, and a block-wise histogram were used to analyze CT calcaneal images. To evaluate the proposed system’s performance either with or without data enhancement, two training methodologies and five data sample sizes were explored. The proposed deep CNN model attained an accuracy of 72%, i.e. Ferrag et al. [5] this work is a thorough examination of deep learning algorithms for intrusion detection. In addition, the efficiency of numerous strategies using the most essential performance metrics, such as accuracy, rate of detection, and false-alarm rate, is evaluated. Liu et al. [8] brought forward a multi-classification network intrusion detection system based on CNN. The data are first preprocessed; after which, the initial one-dimensional data are converted to two-dimensional data, and the features are trained using optimal convolutional neural networks. The multi-classification network monitoring experiment in this research was conducted using the KDD CUP 99 dataset. Aljuhani [17] this study reviews latest research on DDoS detection algorithms that use single and hybrid machine learning techniques in current networking environments. Various DDoS mitigation solutions based on machine learning methodologies that exploit a virtualized environment, such as cloud computing and software-defined networking, are examined in this study. Hadeel et al. [3] in this paper, a classic method for binarizing continuous swarm intelligent algorithms is contrasted to a new method for binarizing a continuous pigeon-inspired optimizer. Three prominent datasets were used to test the proposed algorithm: KDD CUP99, NLS-KDD, and UNSW-NB15. In terms of TPR, FPR, accuracy, and F-score, the suggested approach beats numerous feature selection techniques from state-of-the-art relevant research.

3 System Model The proposed intrusion detection system (IDS) uses convolutional neural network to monitor the data packet and its behavior and intends to predict the DDoS attack

7 Convolutional Neural Network Based Intrusion Detection System …

85

in the early stage. To train the model, both live packets as well the KDD CUP 1999 dataset are used. To track down the live packets, the Wireshark packet sniffer tool is used. It has a lot of advanced capabilities, such as live capture and offline analysis, a three-pane packet explorer, and analysis coloring rules. It monitors the messages transmitted and received by the computer’s apps and protocols, but it never sends the packets. The contents of the various protocol fields in these acquired packets are normally stored and displayed. Generally, the required training dataset is about 4 GB of compressed TCP dump data, and the testing dataset has around 2 million connection records. Also, KDD CUP99 is used here over Wireshark because Wireshark cannot sustain such a large amount of data. It can only hold limited data from the network stream. However, a machine learning model has to be trained with the large size of data to get accurate results. So, KDD CUP 1999 dataset is downloaded and used in training the IDS model. The KDD CUP 1999 dataset was developed by Defense Advanced Research Projects Agency (DARPA). It is the most widely used dataset for IDS testing. KDD comprises of very large data that have been categorized into two types of data, training dataset, and testing dataset. The training dataset consists of around 24 attacks that have been grouped into 4 main categories of attacks, which are, DoS, U2R, R2L, and Probing and has 41 features. This paper focuses on the DoS (DDoS) assault. All this data or the required information is taken into an Excel sheet to begin preprocessing. It is taken into an Excel sheet with all the necessary parameters and values correctly. This data have to be converted to a grayscale image so that the CNN algorithm can take it as input. Preprocessing sample data are the initial step; it takes a binary version of an executable or decompiled file from an Excel sheet and converts it to a grayscale image. The input taken in the form of grayscale image is one where only color used is grayscale shades. The difference between these images or other type of color image stems from the fact that every pixel implies less data. Every image has pixels that have certain numerical values called as pixel values that denote the intensity of the pixels. These intensity values vary for color images and grayscale images. For a grayscale image, the pixel values range from 0 to 255. 0 represents black, and 255 represents white with other values in the between representing different shades of gray. Smaller values close to 0 signify the dark color, whereas bigger numbers nearer to 255 reflect the lighter shade. Image conversion can also be achieved by thresholding the image. Thresholding is an image conversion technique that creates a binary picture by applying a threshold value setting to the original image’s pixel intensity. We set the threshold value as 255/2 which is 127.5. This threshold value is applied to every pixel of the image. If the value is less than 127, it is taken as 0 that is black, and if the value is greater than 127, it is taken as 1, which is white. With this process, the grayscale image is formed with values 0s and 1s.

86

R. R. Reddy et al.

3.1 Image Conversion Algorithm from KDD Dataset to Gray Scale Input: Data packet information Output: Grayscale image 1. 2. 3. 4. 5. 6. 7.

Let pixel values taken from the dataset be named as p1 , p2 ,…, pN . Pixel values range from 0 to 255, where 0 is the minimum and 255 is the maximum value. Set the threshold setting as (Max value/2) = (127.5). Apply the threshold setting to each pixel and read the values. If (value < 127.5) take it as 0. If (value > 127.5) take it as 1. Set 0-valued pixels as black and 1-valued pixel as white.

CNN neurons share weights, unlike multilayer perceptrons, which each neuron has its own weight vector. As a result of this weight sharing, the total number of trainable weights is reduced, resulting in sparsity. Preprocess the grayscale image to attain CNN’s input data constraints. When CNN conducts an image classification task, it uses image data of the same sizes as the input. The image data should, in general, be the same length and width (the length to width ratio should be 1:1). Executive files come in a variety of sizes; grayscale image sizes vary significantly. As a result, all grayscale photos must be normalized. A re-sampling mechanism is used to normalize bilinear interpolation. Based on distance, the four input vector cell centers that are n nearest to the cell center for the outputs processing cell will be weighed and averaged. The more information acquired by the CNN input, the better the detection result; however, the more complicated the network structure, the longer the time it takes to train the system. As a result, a grayscale image is formed. The dataset here has a collection of digital images that are used to test, train, and evaluate the algorithms and their performances. The algorithm is considered to learn from the dataset’s samples. It is important to perform feature selection which is the process of limiting the number of variable inputs by picking the elements that lead the most to the prediction variable or result, either manual or automated to our algorithm to increase the accuracy of the model and to avoid the data from being trained on irrelevant features or the features that don’t contribute to the building of the model. To apply feature selection to this network, specify which attributes should be removed from all features. When one attribute is eliminated from the trained network, accuracy rates are computed. Set the input weight of one characteristic to zero to eliminate it. After that, the networks’ accuracy rates are rated. Once the network can obtain an accuracy reduction of no more than R percent with one more attribute eliminated, it will delete the attribute and recalculate. Otherwise, the algorithm will stop working.

7 Convolutional Neural Network Based Intrusion Detection System …

87

3.2 Feature Selection Algorithm Input: Grayscale image Output: Identified important features 1. Consider A = {A1 , A2 , …An } as the set of input attributes to the CNN and consider R as the maximum drop of accuracy rate of test dataset. 2. Train network N to reduce the loss value with A as input in order to make the accuracy rate of training set acceptable. 3. For all k = 1, 2, …n, network Nk obtained the weight from input Ak as zero and weights from other inputs equal to weights of network N. 4. Calculate the accuracy rates of training set (Rk) and test set (R’k), respectively. 5. Rank the networks Nk by the order of their accuracy rates of training set. 6. Calculate the change of accuracy rate of test set, r, for each Nk from k = 1. If r  R, remove Ak from input set A, and N = N − 1. If k < N, k = k + 1 and then repeat. Else terminate the algorithm. 80% of this dataset is sent for training and the rest 20% for testing. When partial data are sent to training, it is necessary to prepare, clean, and label the data in order to train the computer to grasp the demands. Eliminate any entries that are incorrect, lacking information, or are ambiguous or misleading. Machine learning is ineffective without high-quality data; patterns can be found through analyzing data, and hence, the trained model is now able to accept fresh data and provide predictions to the system. After machine learning, software has indeed been trained on an initial training dataset; the unused data, referred to as the test set, are used to test it [19]. This test data are sent to the threshold setting which is percentage that determines the total number of data values that must meet the threshold before the system can make a conclusion.

3.3 CNN Algorithm The weights, biases, and filters are initialized randomly during forward propagation process. The CNN algorithm treats these values as parameters. The data collected are sent forth into the system. Every hidden layer receives data, processes it using the activation function, and sends it to the next layer. 1. Save the input images to a variable called k. 2. Create a filter matrix. With the filter, images are convolved. Z1 = k ∗ f 3. On the result, use the sigmoid activation function. A = sigmoid(Z 1)

88

R. R. Reddy et al.

4. Create a bias and weight matrix. Modify the values with a linear transform. Z 2 = (W T ∗ A) + b 5. For the final sequel, use the sigmoid function on the data. O = sigmoid(Z 2) Input layer, convolutional layers, pooling layers, fully connected layer, and output layer are few of the layers that make up a CNN algorithm. The stages of the categorization process when data are fed as input to CNN are given in detailed study, depicted in Fig. 1 and explained as follows: (a) (i) Dataset collection stage: The data are taken from the grayscale image. (ii) Feature extraction stage: It is integrated within the convolution layers through each layer’s filters; to train and evaluate the CNN model, every input image is sent through a set of convolution layers with filters (Kernels), fully connected layers (FC), pooling, and the softmax function, which will categorize an entity with probability-based values between 0 and 1 [20]. This stage has two layers, convolution (ConvNet) and pooling layers. Convolution is the initial layer that gathers data from an input image. Convolution maintains the link among pixels by learning visual qualities with tiny squares of input data. It is a two-input mathematical procedure with an image matrix and a filter or kernel as inputs. Kernel is a filter used to extract inputs from images. For a non-linear operation, softmax is used since it is significant because it is intended to bring nonlinearity into the ConvNet. All negative numbers are adjusted to zero. The output is f (x) = max(0, x). When the images are too huge, the pooling layer which is the next layer would lower the number of parameters. (iii) Classification stage: In the fully connected layer (FCL), the matrix is flattened into vector and fed it into it, similar to a neural network. Finally, to classify the outputs, a generated activation function such as softmax is used. With an end-to-end structure, CNN accepts the raw image as input and delivers the classification result and is sent to the trained model which is combined with the training dataset. Since the model was built on a training dataset, it should be tested on a test dataset that was used during training. In reality, the test dataset is unrelated to the training dataset. The configuration of the computer system that we have used for the implementation is a Xeon Processor that contains 16 GB of RAM and 1 TB SSD storage. We have developed our CNN model using Python programming language using TensorFlow. The convolution layer extracts the image’s unique features while preserving the image’s I/O and spatial information, and by enumerating a pooling layer to the convolution layer, the size of the feature data is reduced. The following Eq. (1) is the base for processing an image: L =

L − K + 2P +1 S

(1)

7 Convolutional Neural Network Based Intrusion Detection System …

89

Fig. 1 CNN architecture

The length of the input image is denoted by the letter L. The kernel size and zero, which is filled by the level of dimension of both ends, are denoted by the letters K and P. Finally, S denotes the kernel’s stride on a convolution layer. While numerous convolution layers may help learn images with complicated characteristics more successfully, the number of convolutional layers and their efficiency are not always related. Because the number of convolutional layers and their performance are dependent on the features of the input images, we must experiment with several designs and learn to find the best design. We build our models with hyperparameters like picture type in mind (gray scale or RGB), the number of convolutional layers and kernel size, as well as the weights used to create a hidden layer in the convolution layer [17].

4 Performance Evaluation The CNN model is additionally conceivable to change two additional boundaries, for example, the number of convolution layers and size of the portion as portrayed already [19]. The number of convolutional layers matters. Adding more layers will aid in the extraction of more features. More number of features results in better evaluation of the model and distinction. But adding even more layers than required is unnecessary since that will lead to overfitting of the data which complicates the model. Hence, adding more layers beyond a certain point causes the data to become skewed. These boundaries are named hyperparameters and are displayed in the tables. Weight initialization is used to avoid layer activation outputs from bursting or disappearing during a deep neural network forward pass. The weights are frequently modified by their values to get or maintain optimum accuracy. If all of the weights are set to 0, all of the neurons in all of the layers execute the same calculation and get the same result, rendering the deep net worthless. The kernel size is commonly 3 × 3 or 4 × 4. However, 3 × 3 is selected as a midpoint number, and the experiment should be conducted with sizes ranging from 2 × 2 to 5 × 5 to determine the best size. The

90

R. R. Reddy et al.

kernel creates a feature map by traversing over the image as much for the designated value, stride. To retrieve the feature intensively, the stride is set to 1. The results of the experiments demonstrate that the majority of scenarios are 99% accurate. The CNN model is evaluated using four performance indicators: accuracy, precision, recall, and the F1 measure. F1-score is used to assess the effectiveness of the proposed model [21]. The F-score is a measure of both recall and precision. While working out the F-score, the F1-score is the worth that is given a weighted beta worth of 1 for accuracy [21]. The accuracy is the metric used to tell the percentage of accurate predictions. The F1-score and accuracy are calculated as follows: Accuracy = F1 − score =

(TP + TN) TP + TN + FP + FN

(2)

2 × precision × recall precsion + recall

(3)

where precision =

TP TP + FP

(4)

and, recall =

TP FN + TP

(5)

The number of samples that are correctly identified as benign is known as true positive (TP). The number of samples that mistakenly detect harmless data as an attack is known as false negative (FN). The number of samples that mistakenly identify an attack as benign is referred to as false positive (FP). The number of samples correctly identified as an assault is called true negative (TN). Precision is measured as the total number of selected documents obtained divided by the overall number of documents retrieved, and recall is described as the total number of specific documents derived divided by the total number of relevant documents in the system [21]. The detection performance of KDD with grayscale images is high, with GS-4, GS-8, GS-12, and GS-14 being the highest. When the kernel size is 3 × 3, 4 × 4, or 5 × 5, the performance of the occurrence with three convolutional layers is the greatest, regardless of the RGB and grayscale images. The case of three convolutional layers performs best when the kernel size is 5 × 5. The following is a more extensive study based on these hyperparameters.

7 Convolutional Neural Network Based Intrusion Detection System … Table 1 KDD binary classification accuracy as a function of the number of convolution layers

91

Num. of Conv. layer

GS scenarios

Kernel size

Accuracy

1

GS-1

2×2

0.93145

2

GS-2

0.93223

3

GS-3

0.93321

4

GS-4

1

GS-5

2

GS-6

0.93888

3

GS-7

0.94002

4

GS-8

0.93943

1

GS-9

2

GS-10

0.94286

3

GS-11

0.94355

4

GS-12

1

GS-13

2

GS-14

0.95879

3

GS-15

0.95840

4

GS-16

0.95012

0.93455 3×3

4×4

0.93567

0.94102

0.94476 5×5

0.94568

4.1 Number of Convolution Layers The outcomes of employing KDD are shown in Table 1. The more convolutional layers (1L, 2L, 3L, 4L) the more accuracy is increased when the kernel is 2 × 2, and as a result, the more layers there are, the greater the performance in terms of precisely extracting characteristics. Grayscale performance is best when three convolutional layers are used, as shown by the 5 × 5 kernel. Among the grayscale scenarios, the GS-14 of kernel size 5 × 5 has the best accuracy. The set of hyperparameters with the greatest result is the kernel size 5 × 5 and the three convolutional layers.

4.2 Kernel Size The performance of the kernel sizes 2 × 2, 3 × 3, 4 × 4, and 5 × 5 for KDD is compared. For GS scenarios, where there is no regular pattern of form (e.g., positive or negative) for accuracy as per kernel size since there are two or three convolutional layers. When the kernels are 2 × 2, 3 × 3, and 4 × 4, the grayscale accuracy of scenarios with four convolutional layers is significantly higher than scenarios with less convolutional layers. When the kernel size is 5 × 5, the accuracy of two and three convolutional layers is good. The larger the kernel, the better the performance. Similar to KDD, there is no discernible pattern in the 4 × 4 kernel size scenarios depicted in the given Fig. 2.

R. R. Reddy et al.

Accuracy

92

0.98 0.96 0.94 0.92 0.9 1 Conv Layer 2 Conv Layer 3 Conv Layer 4 Conv Layer 2x2

3x3

4x4

5x5

No. of convolutional layers Fig. 2 Accuracy by the kernel size in binary classification for KDD

Table 2 Performance of the proposed DDoS detection

Paper

Accuracy (%)

Kim et al. (2020)

91.5

Shaaban et al. (2019)

93.01

Proposed method

95.8

In comparison to other deep learning models, the CNN utilizes the fewest number of trainable parameters like weight sharing, which is a substantial benefit. Using the same model on new data will result in extremely high false-positive rates and poor performance. As a result, the best way to assess the competence of intrusion detection models is to check how they perform with new data that have never been seen before and during training. The datasets of existing models and the proposed model have similar characteristics, which is why they are compared. When compared to other methods of research (Kim et al. [18] and Shaaban et al. [12]), the proposed model has the highest accuracy which is shown in Table 2. TP + TN + F.

5 Conclusion DDoS attacks have become a danger to the security and integrity of computer networks and information systems, which are critical components of today’s infrastructure. The main objective of this research work is to create an effective IDS system using CNN to detect DDoS assaults. The model is built and trained using KDD CUP 1999 dataset and live data packets. Image conversion, feature selection, and CNN algorithms are used and are classified. The convolution layer extracts the image’s unique features, and the bulk of the feature data is lowered by integrating the convolution layer with a pooling layer. The results are visualized in the form of tables and

7 Convolutional Neural Network Based Intrusion Detection System …

93

graphs determined by the number of convolution layers and kernel size by calculating the accuracy, precision, recall, and F1 measure. According to this study and results, CNN fared better than other research quality and techniques of our model obtains 95.8% detection accuracy. In future, the proposed CNN model will be integrated with fuzzy logic.

References 1. Liu H, Patras P (2020) NetSentry: a deep learning approach to detecting incipient large-scale network attacks. arXiv:2202.09873 2. Li Y, Liu Q (2018) A comprehensive review study of cyber-attacks and cyber-security; emerging trends and recent developments. Energy Rep 7:8176–8186. ISSN 2352-4847. https://doi.org/ 10.1016/j.egyr.2021.08.126 3. Alazzam H, Sharieh A, Sabri KE (2020) A feature selection algorithm for intrusion detection system based on Pigeon Inspired Optimizer. Expert Syst Appl 148:113249. ISSN 0957-4174. https://doi.org/10.1016/j.eswa.2020.113249 4. Sambangi S, Gondi L (2020) A machine learning approach for DDoS (distributed denial of service) attack detection using multiple linear regression. Proceedings 63:51. https://doi.org/ 10.3390/proceedings2020063051 5. Ferrag MA, Maglaras L, Janicke H, Smith R (2019) Deep learning techniques for cyber security intrusion detection: a detailed analysis. https://doi.org/10.14236/ewic/icscsr19.16 6. Smitha TV, Madhura S, Sindhu R, Brundha R (2021) A study on various mesh generation techniques used for engineering applications 7. Farda NA, Lai J-Y, Wang J-C, Lee P-Y, Liu J-W, Hsieh I-H (2021) Sanders classification of calcaneal fractures in CT images with deep learning and differential data augmentation techniques. Injury 52(3):616–624. ISSN 0020-1383. https://doi.org/10.1016/j.injury.2020. 09.010 8. Liu G, Zhang J (2020) CNID: research of network intrusion detection based on convolutional neural network. Discrete Dyn Nat Soc 2020:11, Article ID 4705982. https://doi.org/10.1155/ 2020/4705982 9. Zhou J, Liu L, Wei W, Fan J (2023) Network representation learning: from preprocessing, feature extraction to node embedding. ACM Comput Surv 55(2):35, Article 38. https://doi.org/ 10.1145/3491206 10. Bi Q, Goodman KE, Kaminsky J, Lessler J (2019) What is machine learning? A primer for the epidemiologist. Am J Epidemiol 188(12):2222–2239. https://doi.org/10.1093/aje/kwz189 ˙ 11. Gregorczyk M, Zórawski P, Nowakowski PT, Cabaj K, Mazurczyk W (2020) Sniffing detection based on network traffic probing and machine learning. IEEE Access 8:149255–149269 12. Shaaban A, Abd-Elwanis E, Hussein M (2019) DDoS attack detection and classification via convolutional neural network (CNN) 233–238. https://doi.org/10.1109/ICICIS46948.2019.901 4826 13. Cheng J, Cai C, Tang X, Sheng V, Guo W, Li M (2020) A DDoS attack information fusion method based on CNN for multi-element data. Comput Mater Continua. 62:131–150. https:// doi.org/10.32604/cmc.2020.06175 14. Jia Y, Zhong F, Alrawais A, Gong B, Cheng X (2020) FlowGuard: an intelligent edge defense mechanism against IoT DDoS attacks. IEEE Internet Things J 7(10):9552–9562. https://doi. org/10.1109/JIOT.2020.2993782 15. Kumar T (2020) Video based traffic forecasting using convolution neural network model and transfer learning techniques. J Innovative Image Process 2:128–134. https://doi.org/10.36548/ jiip.2020.3.002

94

R. R. Reddy et al.

16. Shin H-C, Roth HR, Gao M, Lu L, Xu Z, Nogues I, Yao J, Mollura DJ, Summers RM (2016) Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans Med Imaging 35:1285–1298 17. Aljuhani A (2021) Machine learning approaches for combating distributed denial of service attacks in modern networking environments. IEEE Access 9:42236–42264. https://doi.org/10. 1109/ACCESS.2021.3062909 18. Kim J, Kim J, Kim H, Shim M, Choi E (2020) CNN-based network intrusion detection against denial-of-service attacks. Electronics 9(6):916. https://doi.org/10.3390/electronics9060916 19. Dodiya B, Singh U (2022) Malicious traffic analysis using Wireshark by collection of indicators of compromise. Int J Comput Appl 183:975–8887. https://doi.org/10.5120/ijca2022921876 20. Wu J, Wang X, Gao X, Chen J, Fu H, Qiu T, He X (2022) On the effectiveness of sampled softmax loss for item recommendation. arXiv:2201.02327 21. Muhammad MI, Hussain H, Khan AA, Ullah U, Muhammad Z, Ahmed A, Raza M, Rahman I, Haleem M (2022) A machine learning-based classification and prediction technique for DDoS -attacks. IEEE Access. https://doi.org/10.1109/ACCESS.2022.3152577

Chapter 8

BERT Transformer-Based Fake News Detection in Twitter Social Media S. P. Devika, M. R. Pooja, M. S. Arpitha, and Vinayakumar Ravi

1 Introduction The World Health Organization (WHO) was notified at the end of December 2019 with the purpose of a group of pneumonia cases with an obscure beginning previously, found in Wuhan, Hubei Province, China. During early January 2020, China educated the WHO regarding the problem what’s more its obscure reason due to an expansion in the quantity of cases. COVID-19 was a worldwide health threat that necessitates utmost caution, stringent individual, and hygiene in general, in addition to the sanitation in all public spaces. According to all WHO assessments, the epidemiological situation is extremely serious, and researchers are working rapidly to create an immunization to annihilate the infection [1–5]. The Internet was a nocost platform content as well as news really is not verified or verified before being shared. Fake news and misinformation, as per the World Health Organization, could harm the COVID-19 vaccination program. The corona virus pandemic has needed a significant shift in how government officials, healthcare professionals, and the general public interact with everyday exercises while battling COVID-19. Increased knowledge of the pandemic’s behavior has resulted from the widespread use of communication. As a result of this S. P. Devika (B) · M. R. Pooja · M. S. Arpitha Department of Computer Science and Engineering, Vidyavardhaka College of Engineering, Mysuru, Karnataka, India e-mail: [email protected] M. R. Pooja e-mail: [email protected] M. S. Arpitha e-mail: [email protected] V. Ravi Center for Artificial Intelligence, Prince Mohammad Bin Fahd University, Khobar, Saudi Arabia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_8

95

96

S. P. Devika et al.

increased awareness, countries have gone into quarantine to combat the virus’s high incidence of contamination [6–9]. Researchers have recently proposed profound neural organizations for removing significant data from the text-based substance in different regular language handling assignments, owing to the rapid advancement of computational technology. Understanding the purpose of a language used in news items and its authors is also a challenging undertaking because people interpret language differently. As a result, different groups of people may perceive the same news as both real and fake. Every day, a massive volume of information is shared via social media and the Internet. Fake news detection was a difficult undertaking since it necessitates the identification and detection of a wide range of news genres, including click bait, propaganda, satire, disinformation, falsification, shoddy journalism, and others. Due to the high volume of Internet communication, manual recognition of this phony news is not possible. As a result, implementing a framework that can consequently recognize Fake news on the Internet concerning COVID-19 is desirable. Even in the field of journalism, there is no specific definition of fake news [10–14]. In light of the numerous for text categorization, deep neural classifiers and representation models are both accessible primary issues, “For a difficult job like fake news detection, which combination of pre-trained models and neural classifiers can perform accurately?”. We investigate the recital of various blends of pre-prepared neural classifiers and models in a comparative research. To accomplish this, we implement neural classifiers, such as BERT [15]. We evaluate the outcomes of all of the proposed models and examine their benefits and drawbacks. We choose the most excellent model and analyze the results to those of model that is state of the art and demonstrate usefulness a technique. A fast headway of innovation in Web-based correspondence along with fingertip admittance due to the Internet brought about the assisted spread of fake information to draw in a worldwide crowd for a minimal price by news channels, independent columnists, and sites. During the COVID-19 epidemic, people are caused by these bogus and possibly destructive cases and stories, which might hurt the inoculation cycle. Mental examinations uncover that the human capacity to identify duplicity is just somewhat better compared to risk; accordingly, there is a developing requirement for genuine thought for creating robotized systems to battle counterfeit news that navigates these stages at a disturbing rate. This paper efficiently surveys the current phony news recognition innovations by investigating different Machine Learning and Deep Learning methods pre- and post-pandemic. Significant work has been done to use AI, man-made consciousness, profound learning, and regular language handling to robotize the most common way of grouping news as phony or genuine. To classify a piece of information, we want to know the issue definition first, then, at that point, we go for our model and assess the outcome. AI and Deep Learning calculations are two of the most well-known methodologies for recognizing counterfeit news. AI opens up a wide area of opportunities for research. Henceforth, this paper centers around Machine Learning and Deep Learning perspectives to group the most famous phony news identification procedures.

8 BERT Transformer-Based Fake News Detection …

97

The concept of employing a layer block as the popularity of a structural unit is growing, as well traction in the middle of scientists. We present a deep learning algorithm based on BERT technique (FakeBERT) in this research by mixing multiple single-layer CNNs in parallel blocks with Transformers’ Bidirectional Encoder Representations (BERT). BERT is a sentence encoder that we employ properly extract a sentence’s context representation.

2 Related Work Researchers have offered numerous techniques due to the significance of building accurate false news detecting systems. As a result, several academicians attempt to use data other than the semantic elements retrieved from news stories. This paper introduced a new hybrid feature learning method that included explicit and latent features. Another recent work tuned hyperparameters and selected variables likenamed elements of every news item, amount of words, characters, and sentences in every news story to obtain optimal accuracy using ensemble machine learning models. Researchers generally used deep contextualized embeddings to address input texts once they arose. During a worldwide epidemic, such as COVID-19, however, everyone is perplexed and concerned. To address this challenge, a lot of work has gone into developing Natural Language Processing systems. For a set of data assertions that are true versus false about COVID-19, a pipeline based on the BERT transformer concept produced better results than individual models. On the False News Challenge dataset, a straightforward classifier is employed to exploit Term Frequency (TF), Term Frequency (TF), and Inverse Document Frequency (IDF) are all terms that are used (TF-IDF) moreover, they are all cosine examples of term frequencies, closeness across vectors as highlights to give a benchmark for counterfeit news identification. BERT was used to explore the context of the title as well as main content of the news. Since showed that fine-tuning a BERT that has already been trained for certain tasks purposes, such misleading information detection outflanks traditional series models. Regular language portrayal, grammar investigation, semantic examination, and normal language handling with AI classifiers are all part of this approach’s linguistic component. The network method uses knowledge networks to fact-check assertions made in the news, with a focus on evaluating user attitude and behavior. Compared to fake news on social media today, detecting fake news in conventional media was easy. While detecting false news in traditional media simply needed paying attention to news behavior, detecting fake news on social media necessitates extra user and post behavior. All preparation information should go through four phases: information planning and preprocessing, highlight designing (include choice and component extraction),

98

S. P. Devika et al.

and model choice and construction before being used to develop deluding data recognition models utilizing AI techniques. The vast quantity of data required to create a detection model is made easier by these usual phases.

3 Proposed Approach On a variety of phrase classification and sentence-pair regression tasks, BERT achieved new world records. BERT employs a cross-encoder, in which two phrases are fed into the transformer network, with the goal value anticipated. Individual sentences have been entered sentence embeddings in BERT, and fixed sizes have been generated. The majority frequent method was to use first token’s output or to the output layer’s average of the BERT (also known as BERT embeddings) (the [CLS] token). BERT (Bidirectional Encoder Representations from Transformers) introduces the same as a Natural Language Processing representation by Google Research in 2018. When it was recommended, it obtained state-of-the-art accuracy taking place a variety of NLP and NLU tasks. BERT Model Architecture (Fig. 1): BERTBASE and BERTLARGE are the two sizes available for BERT. The BASE model is used to compare the design’s performance to that of another architecture, while the LARGE model delivers state-of-theart findings, which are described in the study article. On numerous Natural Language Processing and Language Modeling tasks, BERT was able to enhance accuracy (or F1-score). This paper’s key contribution is that it enables the use of semi-supervised learning for various NLP tasks, allowing for transfer learning in NLP. We complete the task contribution to a type in which there is a tensor neighborhood component associated with each other in our suggested model as shown in Fig. 2. In many existing and valuable examinations, the issue of fake news has been

Fig. 1 Architecture of the BERT

8 BERT Transformer-Based Fake News Detection …

99

Fig. 2 Fake BERT model

analyzed using a unidirectional pre-training statement implanting; after that, there is a layer of 1D convolutional pooling organization. Our approach uses vectors obtained following BERT’s word embedding as inputs. The critical advantage of utilizing the ReLU work over other enactment capacities is that it does not at the same time invigorate the neurons in general. ReLU convolutionally created a function of activation that is nonlinear similar to Sigmoid or tanh. That ReLU equation can be expressed as: σ = max(0, z) → Activation Function here z = input. Then, model’s predictions are compared to the actual results, which was the true distribution of probability. If the forecast is perfect, it becomes zero. Since a result, it is possible to use cross-entropy to train the classification model as a loss function. As a result, forecasting a chance of 0.14 when the observation label for the observation is 1 a negative and will as a result of significant value loss. When the amount of module (M) equals two, it is possible to calculate cross-entropy in binary classification. L = −(y log( p) + (1 − y) log(1 − p)) → Loss Function(L)

(1)

In the event that M > 2 (e.g., multi-class arrangement), as ascertain a different misfortune for each individual class mark for each perception as well as total the outcome. −

M 

  yo,c log po,c

(2)

c=1

Here y is binary indicator (0 or 1) if class label c is the correct classification for observation o, p is predicted probability observation o that is of class c.

100

S. P. Devika et al.

Table 1 Data distribution between labels and splits

Labels

Training

Test

Total

True

3035

758

3793

Fake

3036

759

3795

Total

6071

1517

7588

3.1 Dataset Battling an Infodemic: COVID-19 Fake News dataset obtained the dataset. The English method was developed using data from the COVID-19 pandemic. Table 1 shows some of the dataset’s claims as well as the ground truth. • True News Tweets, postings, and articles that make COVID-19-related assertions and guesses that have been proven to be false. • Fake News Tweets about COVID-19 that come from reliable sources and provide important information. These data were collected from the social media application like, Twitter. The split up into two parts: train (80%) and test (20%).

4 Results and Discussion The fake dataset is a collection of short news stories that has been frequently used to test algorithms for identifying fake news as a multi-class classification challenge. As shown in Table 2, we employed our suggested models for multi-class categorization in this investigation, as well as the credit history feature as metadata, in addition to the article contents. Fake News is false or misleading information that is sometimes purposely broadcast in real news media material in order to fool people. It is written to infringe on a person’s, group’s, or agency’s right to benefit financially by employing deceptive headlines or click bait to boost online sharing and reading of the material. Because to the widespread use of the Internet, technology, and social media platforms, a significant number of people may now successfully like, share, and promote their Table 2 Classification report of confusion matrix Precision

Recall

F1-score

Support

Europe

0.72

0.84

0.78

392

India

0.81

0.80

0.80

351

United States

0.72

0.51

0.60

206

8 BERT Transformer-Based Fake News Detection …

101

ideas, resulting in both positive and negative societal consequences. Furthermore, a lack of awareness of COVID-19 exacerbates the pandemic situation, especially during the vaccination stage. As a result, the demand for automated solutions to combat false news, which has been spreading at an unprecedented rate across several platforms, is increasing. We use BERT as a sentence encoder properly extracting a sentence’s context representation. Sequential neural networks are used to encode the data required, and several current and usable approaches had been proposed. A deep neural network with a bidirectional training technique, on the other hand, may be the most optimum and accurate option for detecting fake news. The proposed approach improves fake news detection presentation by having a remarkable capacity to recognize semantic and long-distance links in phrases. On top of the encoder output, we applied a classification layer, multiplied the embedding matrix by the output vector, and then used the Softmax function to evaluate the probability of each vector to generate our recommended architecture. To extract data from the dataset used for training, several filters were applied to each layer. BERT in conjunction using a convolutional deep neural network (DCN) is that type of deep neural network excellent at handling construction on a grand scale and a jumble of words. Many performance assessment criteria (Accuracy in training, validation, the False Positive Rate (FPR), and False Negative Rate (FNR) were used to calculate the error rate.) validate the categorization findings. We show that our pre-trained bidirectional model (BERT) has the greatest accurateness when compared to other models such as bag of words, TF-IDF, TF-IDF with SVD, and TF-IDF with NMF. Compare that to Google’s word embedding (skip grammar and cbow). BERT has performed better compared to others.

5 Conclusion The reliability, suggested (FakeBERT is just a deep convolutional BERT-based strategy) designed in order to identify fraudulent news, was observed in this evaluation. The discoveries of the arrangement show that FakeBERT creates more precise outcomes, with an exactness of 97%. FakeBERT beats existing best in class techniques when utilizing a certifiable phony news dataset: Fake News. Besides, on the grounds that the postings and reports distributed are not all in English, analysts ought to adjust their models to permit their NLP approaches for recognizing bogus news and disdain discourse to be utilized in different dialects too. Subsequently, in this computerized time, when people are touchy to any data they read on the Web, state specialists should find ways to forestall the spread of phony news and disdain purposeful publicity to keep control among their residents.

102

S. P. Devika et al.

References 1. Elhadad MK, Li KF, Gebali F (2020) Detecting misleading information on COVID-19, Canada 2. Verma S, Kariyannavar SS, Paul A, Katarya R (2020) Understanding the applications of natural language processing on COVID-19 data, India 3. Li X, Xia Y, Long X, Li Z, Li S (2021) COVID-19 fake news detection in English, China 4. Bangyal WH, Qasim R, ur Rehman N, Ahmad Z, Dar H, Rukhsar L, Aman Z, Ahmad J (2021) Detection of fake news text classification on COVID-19 using deep learning approaches, Pakistan 5. Koirala A (2020) COVID-19 fake news classification with deep learning. Computer Science and Information Management Asian Institute of Technology 6. Samadi M, Mousavian M, Momtazi S (2021) Deep contextualized text representation and learning for fake news detection, Iran 7. Sadiq-Ur-Rahman Shifath SM, Khan MF, SaifulIslam M (2021) A transformer based approach for fighting COVID-19 fake news, Bangladesh 8. Kaliyar RK, Goswami A, Narang P (2020) FakeBERT: fake news detection in social media with a BERT-based deep learning approach 9. Szczepa´nski M, Pawlicki M, Kozik R, Chora´s M (2020) New explainability method for BERTbased model in fake news detection, Poland 10. Reimers N, Gurevych I (2019) Sentence-BERT: sentence embeddings using Siamese BERTnetworks. Darmstadt 11. Jwa H, Oh D, Park K, Kang JM, Lim H (2019) exBAKE: automatic fake news detection model based on bidirectional encoder representations from transformers (BERT), Korea 12. Bondielli A, Marcelloni FA (2019) A survey on fake news and rumour detection techniques, Italy 13. Ahmed H, Traore I, Saad S (2017) Detection of online fake news using n-gram analysis and machine learning techniques, Canada 14. Clark K, Luong M-T, Le QV, Manning CD (2020) Electra: pre-training text encoders as discriminators rather than generators 15. Vijjali R, Potluri P, Kumar S, Teki S (2020) Two stage transformer model for covid-19 fake news detection and fact checking

Chapter 9

The Facial Expression Recognition Using Deep Neural Network Vijay Mane, Rohan Awale, Vipul Pisal, and Sanmit Patil

1 Introduction A facial expression recognition system is a technology which uses biometric markers to detect emotions in human faces. The system proposed here can be used as a sentiment analysis tool which will help us to detect the basic expressions like anger, disgust, neutral, fear, happy, sad, and surprise. Such a system is helpful mainly due to its ability to imitate human coding skills. Facial expressions help in conveying nonverbal communication which plays an important role in relations between people. These nonverbal communication expressions help the listener to comprehend the speakers’ words in a much better way. Thus, in short, this facial recognition system will extract information from the images we provide it, analyze it, and then provide us with the most likely expression which is present in the image. These useful cases of facial expression recognition have found its way in several applications like security companies, immigration checkpoints, retail services, and many more. Facial expression recognition is being used by security companies to secure their premises and make sure that unauthorized personnel are not allowed to private locations. Border control has been improved at immigration checkpoints due to facial expression recognition being used to detect criminals or persons of interests. Certain vending machines use facial expressions to recommend the drink to the customer based on their expressions while looking at a certain drink. V. Mane (B) · R. Awale · V. Pisal · S. Patil Vishwakarma Institute of Technology, Pune 411037, India e-mail: [email protected] R. Awale e-mail: [email protected] V. Pisal e-mail: [email protected] S. Patil e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_9

103

104

V. Mane et al.

In recent years, facial expression has been developed with the help of convolution neural networks. A typical convolution network consists of 5–10 neural networks or a bit more as well. This has proved effective in a lot of cases, but it is widely believed that increasing the number of layers could help us to extract even more information which can improve the overall performance of the model. Hence in our proposed approach, we have used deep convolutional neural networks over the conventional neural networks. This paper is organized as follows. In Sect. 2, related work on facial expression recognition is briefly summarized. In Sect. 3, proposed neural network architecture is explained. The experiments and evaluation of different models are discussed in Sect. 4. The conclusion is in Sect. 5 followed by references.

2 Related Work 2.1 Literature Review The recognition of facial recognition from the expression-specific features is presented in [1]. They utilized the CNN and obtained an accuracy of 96.76% of accuracy. The deep learning models for facial expression recognition are implemented in [2]. The L2 multi-class SVM loss is preferred over cross-entropy loss in FER. A facial recognition model based on CNN is presented in [3]. The stochastic gradient descent algorithm is utilized to extract and classify facial features. The abandonment method is used to solve the overfitting problem. They succeed in achieving an accuracy of up to 99.82%. A new deep neural network architecture for the automatic recognition of facial expressions is presented by Mollahosseini et al. [4]. The presented algorithm is a one-component architecture that takes registered facial images as input and groups them into one of the six basic expressions. A unique methodology for facial expression recognition without preprocessing and feature extraction stages is presented in [5]. The classification is performed using CNN. They achieved a test accuracy of 61.7% in FER-2013 in a classification task with seven classes compared to 75.2% in the current classification. A real-time facial recognition system for seven expressions is presented in [6]. Facial images were preprocessed after recording, from which traits and emotions recognized by CNN’s training-based model were extracted. This system yields precision of 91.2 and 74.4% in the JAFFE and FER-2013 databases [7]. The AI-based system for emotion recognition from facial expression is presented in [8]. The accuracies achieved with the proposed model are 70.14 and 98.65% for the FERC 2013 and JAFFE datasets. The study on recent work on FER using deep learning is elaborated in [9]. The modified CNN used for emotions classifications based on visual geometry is presented in [10]. They achieved a precision of 69.40. The detection of facial feelings is achieved through assessing the biological markers to categorize the affective nation of a person [7, 11–15]. The experimental outcomes at

9 The Facial Expression Recognition Using Deep Neural Network

105

Table 1 Comparisons between different models Reference

Model

Test accuracy (%)

[12]

CNN + batch normalization + different filters

65

[13]

CNN + batch normalization + global average pooling

66

[4]

CNN architecture

66.4

the dataset FER 2013 proved that the proposed approach achieves a higher precision [15–18]. The comparison between models are given in Table 1.

2.2 Dataset The experimentation was conducted in this presented system using FERC 2013 dataset as in Fig. 1. It consists of 48 × 48 pixel grayscale images of faces. It has seven different categories of emotion such as Angry, Disgust, Fear, happy, Sad, Surprise, Neutral, and all have class indices ranging from 0 to 6, respectively. In the training set, there are 28,709 images and the public test set consists of 3589 images.

3 Proposed Model The proposed deep neural network model is obtained after multiple experiments on the different combinations. These different combinations include the use of different optimizers such as Adam, SGD, and RMS Prop along with inclusion and noninclusion of early stopping. In this proposed model, kernel used is 3 × 3 throughout the different research. The main reason to use 3 × 3 instead of any different kernel such as 5 × 5 or 7 × 7 is less number of trainable parameters. A 3 × 3 kernel is

Fig. 1 FER 2015 dataset

106

V. Mane et al.

used in a stacked convolution network which performs the same as one convolution neural network with a higher kernel size (5 × 5 or 7 × 7).

3.1 Model Architecture The 12 layer proposed model in Table 2 is divided into four different blocks. The first block has two 3 × 3 convolution layers with 32 number of filters with a stride equal to 1. The second block also has the same two 3 × 3 convolution layers with 64 filters, keeping a stride equal to 1. The third block also has the same two 3 × 3 convolution layers with 128 filters. The last block has three 3 × 3 convolution layers with 256 filters. Max Pooling layer is added after each block except the last block which has a filter size of 2 × 2. After four blocks, two fully connected layers of 256 neurons are connected. Finally, the output layer is also a fully connected dense layer with 7 neurons which represent 7 different emotion classes. At the output layer, the ‘SoftMax’ function is used along with the cross-entropy loss function to train the model as shown in Fig. 2. Table 2 Architecture of proposed model

Type

Layer

Feature extractor

Convolution (32, 3 × 3) Convolution (32, 3 × 3) Max pooling Convolution (64, 3 × 3) Convolution (64, 3 × 3) Max pooling Convolution (128, 3 × 3) Convolution (128, 3 × 3) Max pooling Convolution (256, 3 × 3) Convolution (256, 3 × 3) Convolution (256, 3 × 3) Max pooling

Classifier

Dense/fully connected (256) Dense/fully connected (256) Dense/fully connected (7)

9 The Facial Expression Recognition Using Deep Neural Network

107

Fig. 2 Model architecture

4 Experiment In this research, a lot of different training approaches have been used while training. The experiment was conducted using Keras and Kaggle notebook with GPU acceleration support, while starting model training had several options to consider and conduct an experiment on. Different options include the use of image augmentation, inclusion or non-inclusion of batch normalization, and different optimizer selection include SGD, Rms Prop, and Adam.

4.1 Experimental Design Since there are a lot of different combination possible and each model training takes around 40–50 min, best possible combinations are selected by experimenting. After each combination and training, the combination that provides the highest test accuracy was used for further training and model selection. Considering all possibilities at the end 18 combinations were trained, some with different hyper-parameters also experiment. Conv2D: Convolution is a simple operation that starts with the kernel. The kernel is a simple small weight matrix. The kernel hover or slides over the input data which is 2D. It performs elementwise multiplication with the input part where the kernel is part of and then it sums up the result into a single output pixel. The kernel repeats this

108

V. Mane et al.

process for every location of input and converts the 2D feature matrix into another 2D feature matrix. Stride is kept 1 throughout the experiment. Batch Normalization and Max Pooling: Batch normalization is the technique used for training a very deep neural network that standardizes the input to layer for every mini-batch. This also helps to stabilize the learning process and reduces the number of epochs required for training the neural network. Pooling is also known as local translation invariance. A pooling layer is an approach to downsampling feature maps. Two methods are used more often such as average pooling and max pooling that summarize the average of features and most activated features, respectively. Activation Function: The nonlinear transformation is used to the input neuron, and this is introduced by the activation function. For our experiment, we have used ‘ReLU’ as an activation function for input and hidden layers. ReLU stands for Rectified Linear Unit. The gradient of ReLU is always equal to 1. This means neurons will only be deactivated if the output value is less than zero. F(x) = max(0, x)

4.2 Experimental Results and Evaluation All the experimental results are discussed in the above table. All combinations were re-tested for the same number of epochs. The best result was obtained using the 12 layer model. This 12 layer model referred to as model 4 succeeded in achieving 68.71% accuracy with help of batch normalization, image augmentation, and Adam as optimizer. The validated results are shown in Figs. 3, 4 and 5. Effect of Data Augmentation and Batch Normalization: The first 6 layer model (model 1) is used to make decisions of further training by a combination of different methods. Model 1 was first trained with Adam optimizer, batch normalization, and augmentation which results in an accuracy of 61.38%. As given in Table 2, same model was trained without using batch normalization, and accuracy was reduced to 59.47%. Model 1 was also trained without data augmentation, and it turn out to give much less accuracy and the model tends to overfit. So, with the result acquired from the different combinations on model 1, use of batch normalization and data augmentation gives better accuracy as in Table 3. Data augmentation significantly improves the model’s accuracy. This is why both features are used for further training. Effect of Using Different Optimizers: Considering the last observations on model 1 with Adam optimizer, batch normalization and data augmentation best possible combination is taken forward to perform further experiments. For comparison, different optimizers are used such as RMS Prop and SGD. One can easily conclude that Adam gives the best accuracy among the three followed by Rms Prop and then SGD. The accuracy obtained is mentioned in Table 4.

9 The Facial Expression Recognition Using Deep Neural Network

109

Fig. 3 Model without batch normalization

Fig. 4 Model with batch normalization

Increasing Number of Layers: With the best combination obtained with model 1 after different experiments, model 1 is used for the comparison between different optimizers. 8 layers (model 2), 10 layers (model 3), 12 layers (model 4) were trained using the best features mentioned above. One could easily observe that increase in depth in networks results in better accuracy. Results are mentioned in following Table 5. Proposed model in this paper is benefited by Adam optimizer, batch normalization and data augmentation. There are lots of parameter to take in consideration while making model, and using different possible approached model, accuracy can also be

110

V. Mane et al.

Fig. 5 12 layer model accuracy graph Table 3 Effect on accuracy using batch normalization and data augmentation

Table 4 Effect of optimizer

Table 5 Effect of number of layers

Optimizer

Data augmentation

Batch normalization

Test accuracy

Adam

Yes

No

59.47

Adam

Yes

Yes

61.38

Adam

No

Yes

53.79

Optimizer

Test accuracy(%)

Adam

61.38

Rms Prop

59.00

SGD

57.27

Model

Test accuracy(%)

Model 1 ( 6 layer)

61.38

Model 2 (8 layer)

62.45

Model 3 (10 layer)

64.96

Model 4 (12 layer)

68.71

9 The Facial Expression Recognition Using Deep Neural Network

111

Table 6 Comparisons between different models Reference

Model

Test accuracy (%)

[12]

CNN + batch normalization + different filters

65

[13]

CNN + batch normalization + global average pooling

66

[4]

CNN architecture

66.4

This model

CNN + batch normalization

68.71

increased. The best obtained model is compared with previously proposed models in Table 6.

5 Conclusion The proposed architecture is effective to tackle many problems in facial expression recognition. The proposed architecture uses a stack convolution layer. Total 18 different models with different combinations were trained, and the result was obtained using different optimizers, pooling, and normalization. Despite less number of parameters model performs better than many other models and gives 68.71% accuracy. Future work can be done with the use of different hyper-parameter tuning. Other improvements can be done with the use of existing pre-trained models for better classification.

References 1. Lopes AT et al (2017) Facial expression recognition with convolutional neural networks: coping with few data and the training sample order. Pattern Recognit 61:610–628 2. Sang DV, Van Dat N (2017) Facial expression recognition using deep convolutional neural networks. In: 9th international conference on knowledge and systems engineering (KSE). IEEE 3. Yan K et al (2017) Face recognition based on convolution neural network. In: 36th Chinese control conference (CCC). IEEE 4. Mollahosseini A, Chan D, Mahoor MH (2016) Going deeper in facial expression recognition using deep neural networks. In: IEEE winter conference on applications of computer vision (WACV). IEEE 5. Singh S, Nasoz F (2020) Facial expression recognition with convolutional neural networks. In: 10th annual computing and communication workshop and conference (CCWC). IEEE 6. John A et al (2020) Real-time facial emotion recognition system with improved preprocessing and feature extraction. In: Third international conference on smart systems and inventive technology (ICSSIT). IEEE 7. Talegaonkar I et al (2019) Real time facial expression recognition using deep learning. In: Proceedings of international conference on communication and information processing (ICCIP) 8. Jaiswal A, Raju AK, Deb S (2020) Facial emotion detection using deep learning. In: International conference for emerging technology (INCET). IEEE 9. Mellouk W, Handouzi W (2020) Facial emotion recognition using deep learning: review and insights. Procedia Comput Sci 175:689–694

112

V. Mane et al.

10. Kusuma GP, Jonathan APL Emotion recognition on FER-2013 face images using fine-tuned VGG-16 11. Shirisha K, Buddha M (2020) Facial emotion detection using convolutional neural network. Int J Sci Eng Res 11(3):51. ISSN 2229-5518 12. Agrawal A, Mittal N (2020) Using CNN for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy. Vis Comput 36(2):405–412 13. Arriaga O, Valdenegro-Toro M, Plöger P (2017) Real-time convolutional neural networks for emotion and gender classification. arXiv:1710.07557 14. Quinn M-A, Sivesind G, Reis G (2017) Real-time emotion recognition from facial expressions. Standford University 15. Minaee S, Minaei M, Abdolrashidi A (2021) Deep-emotion: facial expression recognition using attentional convolutional network. Sensors 21(9):3046 16. Kumar P, Kishore A, Pandey R (2019) Emotion recognition of facial expression using convolutional neural network. In: International conference on innovative data communication technologies and application. Springer, Cham, pp 362–369 17. Suneeta VB, Purushottam P, Prashantkumar K, Sachin S, Supreet M (2019) Facial expression recognition using supervised learning. In: International conference on computational vision and bio inspired computing. Springer, Cham, pp 275–285 18. Mishra S, Talashi S (2021) Facial expression recognition system using different methods. In: Computer networks and inventive communication technologies. Springer, Singapore, pp 185–196

Chapter 10

New IoT-Based Portable Microscopic Somatic Cell Count Analysis A. Sivasangari, D. Deepa, R. M. Gomathi, P. Ajitha, and S. Poonguzhali

1 Introduction The world’s milk production is increasing day by day. The cold chain management has to be monitored. The milk quality has to be monitored for its biochemical and microbial changes during these stages of milk collection, chilling, and transportation. The quality of the finished product and its shelf life always depends on the initial quality of raw milk. Therefore, the need of collecting quality milk not only in its constituents but milk should be free from harmful bacterial contamination, free from adulteration and should have continuous chilling or maintain chilling temperature during transportation. So there is an urge for the organized dairy sectors to enhance quality milk collection and processing of the same for a quality product of appeal in the world market in the present scenario which ensures the safety and good health for the consumers. Despite these properties, the dairy sector’s productivity is relatively low because of some complex fiscal, technological, and institutional challenges. The difficulties of dairy farming are a lack of feed/poor ration formulation, a lack of fodder, low milk output per cow, a lack of chilling facilities on a farm, and the fact that most dairy farmers are small farmers. Other constraints include insufficient access to finance, poor linkages between actors in the chain, a restricted private sector, restricted IA facilities, and impeding animal diseases. Nevertheless, the above-mentioned problems can also help foreign direct investors and traders understand quickly the holes they can fill, as the dairy industry is at a turning point in its history. The quality milk production and production of quality milk production usually result in hygiene and nutritional products for public consumption which ensures safety. The safe and secure milk and milk products increase milk consumption and the entire business grows rapidly which will create market demand for consumption, A. Sivasangari (B) · D. Deepa · R. M. Gomathi · P. Ajitha · S. Poonguzhali School of Computing, Sathyabama Institute of Science and Technology, Chennai, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_10

113

114

A. Sivasangari et al.

country like India where the economy depends on agri-based and agro-based industries, the milk production with quality in their milk shed can slow down imports and build up their economy.

2 Related Work Zajac et al. [1] proposed an improved method for somatic cells with nuclei staining. They increased the sample temperature to 100 °C to perturb the somatic cell wall membrane and improve the penetration of ethidium bromide into somatic nuclei. The rising temperature during the sampling preparation improved the penetration of ethidium bromide into the nuclei of somatic cells. Methods for fluorescence microscopy are suited for precise somatic laboratory analysis of cells containing raw cow milk. They figured out the statistics differential results between different fluorescence methods of microscopy should be 11,000 SC/ml. Their findings show a high degree of precise know-how, talent, and experience the technicians in the laboratory. Zecconi et al. [2] proposed a method that inquiry confirmed the repeatability of the measurement and enables its use under conditions in the field. No association was found in the observational data in very small SCC animals, between DSCC and the intramammary infections. That all the major components of milk decreased as the DSCC grew. Becheva et al. [3] proposed that significant numbers of polymorphonuclear neutrophil cells in milk are strong evidence for bruising in the udder. It has prepared polyclonal antibodies specific to bovine neutrophils. Anti-bovine antibody neutrophil—FITC conjugate was obtained to visualize the capture of anticorps neutrophils. Anti-bovine neutrophil antibody specificity—FITC. The conjugate has been proven using a fluorescent microscope. Damm et al. [4] proposed the study aimed to create a new method for rapid and simultaneous establishment of SCC and a Fresh Parameter Somatic Cell Differential Count (DSCC), samples of cow’s milk using flow cytometry. The method is pursued use in central milk laboratory research, to maintain current DHI infrastructures. Biswas et al. [5] proposed the results suggest the time taken to sample the milk and analyze milk SCC by microscopic direct method should reduce the buffalo milk samples to less than 2 h in Indian Air state. The analysis could, however, be carried out during milking if milking should take longer just two hours, and the effect of preservatives on buffalo milk SCC using a direct microscopic method. Becheva and Godjevargova [6] proposed the Quantum Dots (QDs) and Fluorescein Coupling. Antibody isothiocyanate (FITC) has been shown to comparison of ultraviolet and fluorescent emission spectra conjugate bandwidth and initial elements. The activity indirect ELISA was measured at the same time as conjugates conditions. Gao et al. [7] proposed the method of blended segmentation for counting them in images of bovine milk. Firstly, it uses the cloud model for threshold images of cells; secondly, watersheds are used to recreate distance images to obtain the initial

10 New IoT-Based Portable Microscopic Somatic Cell Count Analysis

115

segmentation result; finally, an area-wide similarity criterion is established according to the real sense of area similarity of human vision. Wall et al. [8] proposed the stepped DSCC qualities significantly when contrasting pre and post-tests immune response stimulation using the LPS and LTA are components of the bacterial cell wall. That, in turn, reflects the migration of cell populations from milk macrophages to predominantly PMN. After stimulation, the SCC increased dramatically, as expected. Different types of IoT applications are described in [9–16].

3 Proposed Work A somatic cell count is being used in the mastitis screening task which is applied in the dairy industry as an important key indicator. Nevertheless, mastitis is a significant problem, and the dairy industry causes losses. Milk somatic cells are a mixture of the cells and immune cells containing milk. Such cells are used as an index for estimating the dairy animals’ health and milk content. Health, parity, stage of lactation, and race of an animal influence it. Low SC milk contains the finest dairy products. A combined sample of the raw milk from dairy farms is obtained. Inflammatory conditions of the mammary glands cause high milk SCC. High milk SCC is caused by mammary gland inflammation conditions that modify the milk composition of the milk very much like blood. These are caused by the increase in blood mammary barrier permeability, which leads to more ions, enzymes, and inflammatory cells flowing into the milk. An increase in SCC milk is associated with a decrease in milk yield. The slide preparation for milk microscopy consists of spreading 10 Lµ of milk over a 1 cm region and drying it in the pot. It’s also colored only with a staining solution given by a new man. The slide is visualized and photographed during the procedure. In a 40fold magnifying microscope, somatic cells are composed of cytoplasm and nucleus ranging from 4 mµ to 8 m. The nucleus is deep blue, and the cytoplasm is light blue. The Arduino board is attached with the conductivity and pH sensors. The Arduino board sends the sensor data over the serial port to the Raspberry Pi. The raspberry pi gathers all the information and sends the information to the cloud and Web servers. The data is contained in the MYSQL database. The data is encoded as URL parameters in the HTTP GET request. The server collects the details and then transfers them to the PHP file. The script connects to a database and then executes a SQL command that retrieves the data from the database. The Raspberry Pi’s Internet connection can be built with Wi-Fi. Figure 1 indicates the model proposed for the work description. Figure 2 shows the processing of a somatic image. Several morphological operators such as dilate, erode, open, and close are performed to extract the clear image. The dilation operation enlarges the input object, whereas the erode operator reduces or shrinks the input object. The proposed work exploits dilate and erode operators. Figure 3 shows the crucial parameters used in the proposed model. The width of the cells is determined by the structuring element.

116

Fig. 1 Proposed work architecture Fig. 2 Somatic image processing

A. Sivasangari et al.

10 New IoT-Based Portable Microscopic Somatic Cell Count Analysis

117

Fig. 3 Crucial parameters

4 Performance Analysıs Milk somatic cells are a mixture of the cells and immune cells containing milk. These cells are used as an index for estimating dairy animal’s health and milk quality. Health, parity, stage of lactation, and race of an animal affect it. Low SC milk contains the best dairy products. A synthetic sample of the raw milk from dairy farms is obtained. Milk somatic cells are a mixture of the cells and immune cells containing milk. These cells are used as an index for estimating dairy animals’ health and milk quality. Health, parity, stage of lactation, and race of an animal affect it. Low SC milk contains the best dairy products. A synthetic sample of the raw milk from dairy farms is obtained. A rise in SCC milk is associated with decreases in milk production. The slide is visualized after that process and photographed in a 40-fold magnifying microscope. The nucleus is deep blue, and the cytoplasm is light blue. Figure 4 explains how somatic image processing works.

5 Conclusion Our invention of the health monitoring system is a monitoring of quantity of stored milk in the BMC helps to track changes in case of adulteration and spillage of milk. The collection center agent or trained person to start the diesel generator may not be available on the spot when the raw power fails due to various reasons. This system automatically sends the SMS alerts to the agent, so that he can perform immediate alternate arrangements and arrest the spoilage of huge milk in the bulk milk cooler.

118

A. Sivasangari et al.

Fig. 4 Performance analysis

References 1. Zajac P, Zubricka S, Capla J, Zelenakova L (2016) Fluorescence microscopy methods for the determination of somatic cell count in raw cow’s milk. Vet Med 61(11):612–622 2. Zecconi A, Dell’Orco F, Vairani D, Rizzi N, Cipolla M, Zanini L (2020) Differential somatic cell count as a marker for changes of milk composition in cows with very low somatic cell count. Animals 10:604 3. Becheva Z, Gabrovska K, Godjevargova T (2017) Immunofluorescence microscope assay of neutrophils and somatic cells in bovine milk. Food Agric Immunol 28(6):1196–1210 4. Damm M, Holm C, Blaabjerg M, Bro MN, Schwarz D Differential somatic cell count—a novel method for routine mastitis screening in the frame of dairy herd improvement testing programs. J Dairy Sci 100:4926–4940 5. Biswas S, Mukherjee R, Mahto RP, De UK (2016) Effect of storage temperature on somatic cell count of buffalo milk using direct microscopic method. Indian J Anim Sci 86(1):32–34 6. Becheva Z, Godjevargova T (2017) Preparation of anti-elastase antibody conjugated with quantum dots 710 Nm and fluorescein isothiocyanate for immunoassay of milk somatic cells. Becheva Godjevargova, J Nanomater Mol Nanotechnol 7. Gao X, Xue H, Pan X, Jiang X, Bo Y, Wang Y (2016) Segmentation of somatic cells based on cloud model. Rev Téc Ing Univ Zulia 39(2):93–101 8. Wall SK, Wellnitz O, Bruckmaier RM, Schwarz D Differential somatic cell count in milk before, during, and after lipopolysaccharide- and lipoteichoic-acid-induced mastitis in dairy cows. J Dairy Sci 101:5362–5373

10 New IoT-Based Portable Microscopic Somatic Cell Count Analysis

119

9. Poonguzhali S, Chakravarthi R (2020) Design of an intelligent foot insole using dynamic sensor network for prevention of diabetic foot ulceration-telemedicine application. J Adv Res Dyn Control Syst 12(4):369–378 10. Sivasangari A, Poonguzhali S, Rajkumar I (2019) Ontology based web page recommendation system. Int J Innovative Technol Explor Eng (IJITEE) 8(6S4). ISSN: 2278-3075 11. Sivasangari A, Poonguzhali S, Rajkumar MI (2019) Face photo recognition using sketch image for security system. Int J Innovative Technol Explor Eng (IJITEE) 8(9S2). ISSN: 2278-3075 12. Sivasangari A, Ajitha P, Gomathi RM (2020) Light weight security scheme in wireless body area sensor network using logistic chaotic scheme. Int J Networking Virtual Organ (IJNVO) 22(4) 13. Ajitha P, Sivasangari A, Indira K (2018) Predictive inter and intra parking system. Int J Eng Adv Technol 8:354–357 14. Ajitha P, Sivasangari A, Jinila B (2019) A description profound fusion recommender scheme based on self-chipper with neural communal filtering. Int J Eng Adv Technol 8(6 Special Issue 3):1279–1283 15. Deepa D, Vignesh R, Sivasangari A, Mana SC, Samhitha BK, Jose J (2020) Visualizing road damage by monitoring system in cloud. Int J Electr Eng Technol 11(4):191–203 16. Sivasangari A, Deepa D, Anandhi T, Ponraj A, Roobini MS (2020) Eyeball based cursor movement control. In: International conference on communication and signal processing (ICCSP). IEEE, pp 1116–1119

Chapter 11

A Survey on Hybrid PSO and SVM Algorithm for Information Retrieval D. R. Ganesh and M. Chithambarathanu

1 Introduction The significance of the term Information Retrieval (IR) is regularly exceptionally expensive. In 1950, Calvin Mooers coined the term “Information Retrieval” in 1951 to describe how an impending client of information might convert a welcome message for data into a useful collection of references. As per Calvin, Information Retrieval (IR) holds the important part of the blueprint of information and its particulars for search, and furthermore whatever frameworks, procedures, or machines that are utilized to hold out the activity. Information Retrieval (IR) is the process of discovering a material (generally archives) of an unstructured nature (text) that satisfies the data request from huge assortments (normally put away on PCs). The goal of information retrieval (IR) is to create, illustrate, and display frameworks that are ready to provide efficient and strong substance-based access to a large amount of information. The goal of an IR framework is to assess the value of information sources such as message archives, images, and video to a client’s data requirement. Such a data requirement is addressed in the form of an inquiry, which generally compares a collection of words. Clients are only interested in the information that is essential for their data requirements. The representation and association of the material will be offered in order to provide the client with easy access to the information in which he/she is interested. The primary objective of an IR framework is to recover the information that is important to answer a client question while not recovering the non-important things. Besides, the recovered data things are to be positioned from the premier interest to the littlest sum significant. Information retrieval is not similar to information recovery. Information recovery determines which reports of a gathering include the catch phrases inside the client inquiry that is insufficient to meet the client data D. R. Ganesh (B) · M. Chithambarathanu Department of ISE, CMR Institute of Technology, Bangalore, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_11

121

122

D. R. Ganesh and M. Chithambarathanu

requirement [1, 2]. The client of an information positioning framework is more concerned on retrieving the data from several subjects than recovering the information that answers a specific request. The recovered item may be off base with some errors that will most likely go undiscovered apart from an information recovery framework, yet one inaccurate article amid cardinal retrieved things signify complete despair. This record content prediction entails extracting the syntactic and semantic data from the archive text and applying the data to coordinate with the client’s data requirement. The concept of importance is used in performing information recovery. The most difficult portion of the recovery interaction is to determine whether the documents are relevant to or answer a certain inquiry. Records have been sorted from most important to least important. When the major archives relevant to the investigation are positioned higher and the non-preeminent are positioned lower, an IR framework infers its most extreme recovery adequacy.

2 Literature Survey To encourage required information about different ideas related with the current application, existing research literature was analyzed thoroughly. Some of the significant works were performed in which few are explained below. Ashutosh Kumar Singh and Ravi Kumar P demonstrated that the outcome is based on the presence of phrases inside the inquiry. This method eliminates the incorrect Boolean syntax used by end-users and returns a few inaccurate query results. The Probabilistic Model of Norbert Fuhr shows Vector space model is an arithmetic archive that uses vectors for representation. Ed Greengrass depicted that the personalized information recovery manages such perspectives where the client assumption remains as a primary consideration. The serious issue is the selection of the optimal framework in which it can be fabricated. The examination reveals that it is done by building a user profile and using evolutionary algorithms. Border et al. [3] analyzed the Bankruptcy prediction approach, which has been determined by different information handling methods. Since this issue could be a basic, a combination of PSO and SVM algorithms is used to predict the bankruptcy. Cao [4] proposed a SVM boundary determination-supported CPSO with a goal to develop optimal boundaries for support vector machine. New approaches are developed to overcome the existing issues. Prior to addressing the QP problem, SVM boundary assurance has been performed. Lee [5] described a hybrid particle swarm optimization and support vector machine-based approach for successfully developing a Devanagari scripting framework. Jones et al. [6] addressed the element extraction challenge in information retrieval process by focusing on the positioning issue. They have also proposed a direct element extraction approach for performing a ranking called life rank calculation.

11 A Survey on Hybrid PSO and SVM Algorithm …

123

According to Aksoy [7], sequential minimal advancement is an effective method used for developing and calculating the correctness of support vector machine calculations. Aksoy and Haralick [8] have proposed a combination of two algorithms for detecting the interruption location issue. They have used standardized particle swarm optimization algorithm for the assurance of ideal features subset. Liu et al. [9] proposed a technique for highlighting the choice upheld SVM, which is advanced than the other standard algorithm. They considered one out methodology and further two classifiers; execution was assessed by utilizing the execution measurements. Mehrdad Dianati, Insop Song, and Mark, Treiber University of Waterloo, Ontario, proposed evolutionary calculations within the depiction of Darwinian development by using the ideas of Mendel in different computing models and applications. To summarize, Djoerd Hiemstra, Donna Harman, University of Twente, have concluded Information Retrieval Models as an early strategy for performing data recovery and is used as the initial model to find data inside a wide range of available information [10–12]. This approach is based on pure arithmetic and Boolean algebra, which together provide a model for data determination.

3 Methodology The split inquiry is to be completed for processing the proposed framework by utilizing the Support Vector Machine (SVM) algorithm. The parameters are then applied by acknowledging that they are planned by creating the further course of action. If it does not assist, update the academic material and then develop the course of action. Finally, using the Particle Swarm Algorithm Router, the appropriate responses were located and the best result was obtained. The major role of this section is to determine whether or not the system is suitably approachable. Following that, several types of assessments, such as execution examination, specialized examination, sensible evaluation, and so on, are carried out. The modules involved here are: Knowledge base building: This module converts the document to feature vectors and store within the object. Query pre-processing: Query is pre-processed to obtain the feature vector. SVM clustering: This module clusters the feature vectors in the content to form a query feature vector and provide the clustering during which the query feature vector has the highest association. PSO ranking: PSO ranks the document within the cluster returned by the SVM clustering module and returns the highest suitable answer.

124

D. R. Ganesh and M. Chithambarathanu

Fig. 1 System architecture

The primary factor for adopting framework enhancement is configuration. Programming configuration might be a cycle in which the requirements are translated into a programming representation. Configuration may be where the software/program fosters customization. The new framework should be planned based on the customer requirements, and hence, the point-by-point assessment of the entire framework should be incorporated. This process is known as framework planning. The best appreciation to unequivocally decipher a customer’s requirement within the limit is arrangement. The setup generates a representation or model and provides information on programming structure, plan, interfaces, and components required to finish a system, and Fig. 1 depicts the entire system architecture. Requirement Analysis: This stage is worried about the collection of necessities in the structure. This association makes the record essential. System Design: Considering the requirements as a primary necessity, the architectural variations are translated into an item representation. During this stage, the developer focuses on computation, affiliation, programming design, and so on. Coding: During this stage, the engineer begins coding in order to coordinate a comprehensive layout of the item. As an outcome, the architectural subtleties have virtually been converted into PC visible interface code. Execution: The execution stage incorporates the particular coding or programming of the item. The following algorithms are used by many authors 1. Support vector clustering Support Vector Domain Description (SVDD) is used to be particular with the domain information space, where the information models are focused.

11 A Survey on Hybrid PSO and SVM Algorithm …

125

SVDD includes the basic categorization of part-based learning. Moreover, SVDD searches for the smallest circle that encloses the information in its “direct” variant. When used with a part work, it is implied for the smallest encasing circle within the element space defined by the part work. Component space is the data represented by a circle; when projected back to information space, the circle is transformed into a collection of non-direct forms that encapsulate the data. SVDD provides a capability that indicates whether the information is within the component space circle or not, indicating whether a particular point has a location with the assistance of the appropriation. 2. PSO ranking Particle swarm optimization (PSO) is a calculation in which the optimal arrangement is often addressed as some degree or surface in an n-dimensional space for supervising difficulties. 3. Document pre-processing The input documents and query are processed by using the following steps: a. b. c. d.

Lower-case letter conversion Remove special characters Remove comma Stem each word.

4. TFID Feature Vectorization. TFID Feature Vectorization gauges the term frequency (TF) and its inverse document frequency (IDF). Each word has its own TF and IDF esteem, where the product of the TF and IDF variation generates the TF * IDF term. More importantly, it examines how relevant the keyword is all around the Internet, which is generally known as corpus.

3.1 Data Analysis A utilization case chart is a type of social outline developed using a use-case examination. Its purpose is to demonstrate a graphical perspective on the usefulness delivered by a structure comparable to performers, their targets (referred to as use cases), and any conditions between those use cases. Figure 2 depicts the use-case diagram. The corpus is used as information, and the information space is constructed using the KB measure. The information inquiry is processed in order to build the coordination with the corpus outcome. Unit testing is performed on each module, which comprises the general framework. Unit testing focuses on assertion attempts carried out within the smallest unit of code present in each module. This is always referred to as module testing. Following that, the unit testing table demonstrates the limitations that were attempted during the development process. Toward the start of coding stage, simply the limits required in different pieces of the framework are initiated. Each cut-off is coded and endeavored. Following a

126

D. R. Ganesh and M. Chithambarathanu

Fig. 2 Use case diagram

thorough examination of the correctness of as many as possible, they are classified. A mixture of coordinated lessons is undeniably attempted for their advantage. After confirming the correctness of the yields in each class, they are merged and attempted again. The developed project is a combination of front-end and back-end development. The front-end is built in the Java Swing environment. The client interface is intended to encourage the customer to make a clear request to the framework and consider the design’s ordinary and faulty direct and its outcomes. The back-end code is combined with the GUI and executed. Mix Testing Information will be lost transversely over the interface. One module can adversely influence another. When the sub-issues are joined, it should not reduce the genuine limit. Blended testing could be an exact methodology used for fostering the framework structure. It helps to solve the issues with respect to twofold venture advancement and confirmation. The yield attempting and testing part of the proposed system has been described in the confirmation testing, because there is no structure that would be beneficial except if it does not produce the pre-set yield within the free ordered arrangement. In this manner, yield testing includes, regardless of anything else, gathering some data about the setup required by them in order to examine the yield given or exhibited by the structure. The yield design is considered only: 1. Onscreen 2. Printed design is utilized for approving the locales slithered, the importance score of sites, the positioning score of sites. This study deals with a number of testing combinations such as unit testing, which might be a technique used for verifying the precise operation of a certain module of the ASCII text report. Module testing is another term used for it. It also provides a brief overview of several types of joining testing in which independent programming modules are integrated and tested as a group.

11 A Survey on Hybrid PSO and SVM Algorithm …

127

4 Results and Discussion Boolean frameworks were developed and marketed more than three years before an action when computing power was least compared. As a result, these frameworks require the client to add grammatical constraints in their uncertainty to limit the total number of records recovered, and also the other recovered reports are not structured by involving any relationship with the client’s query. Although Boolean frameworks provide extensive online search capabilities to bookkeepers and other prepared gobetweens, they have a tendency to provide useless support to end-clients, particularly those who use the framework on an inconsistent basis. Due to the sheer confounded question sentence structure required by these frameworks, the end-clients are likely to be recognizable in the wording of the information set they are looking for, but lack the preparation and practice required to encourage consistently great outcomes from a Boolean framework. By all accounts, the recovery positioning strategy is more targeted at these end-clients. This technology enables the client to include a basic inquiry such as a sentence or an expression (no Boolean connections) and get a collection of data positioned in such a way that they are likely to be relevant.

4.1 Ranking Models and Experiments with These Models Each trial results introduced inside the models are upheld by utilizing standard test assortments, standard review, and accuracy measures for assessment. These endclients are likely to be identifiable in the terminology of the information set they are searching for, but lack the preparation and experience required to encourage consistently outstanding results from a Boolean framework due to the complicated query sentence structure requirements. By all accounts, the recovery positioning strategy is more targeted at the end-users. This technology enables the customer to include a basic query such as a sentence or an expression (no Boolean connections) and get a stock of data positioned so as to be likely relevant.

4.2 The Vector Space Mode The example archives and inquiry vectors are imagined like an n-dimensional vector space, where n compares to the quantity of novel terms present in the information collection. A vector coordinating with activity, in view of the cosine relationship, is used to quantify the cosine of the point established between vectors, which would then be utilized to process the closeness between an archive and inquiry, and the records can be positioned based on that space.

128

D. R. Ganesh and M. Chithambarathanu

Fig. 3 Matrix representation

 n    i=1 tdi j × tqik similarity d j , qk =  n n 2 2 i=1 tdi j × i=1 tqik

(1)

Equation 1 symbols are explained below, Where td ij = the ith term in the vector for archive j, tqik = the ith term in the vector for inquiry k, n = the quantity of interesting terms in the informational index. Where N = the quantity of archives in the assortment, R = the quantity of important records for inquiry q, n = the quantity of archives having term t, r = the quantity of important records having term t. The matrix representation is shown in Fig. 3.

4.3 Positioning Based on Document Structure Some positioning trials are more dependent on record or intra-document structure than on the term-weighting depicted earlier. Bernstein and Williamson [10] fabricated a positioning recovery framework for a profoundly organized information base, the Hepatitis Knowledge Base. Their positioning calculations not only considered term relevance throughout the entire collection and inside a specific archive but also the principal location of the phrase, such as within outline passages versus inside text sections. The sorting and recovery were based on a single value decay (known as factor investigation) of a term-report framework obtained from the entire record collection. This was combined with weighting based on the component of term recurrence inside a record (the root mean square standardization) and a component of term recurrence across the whole collection (the clamor or entropy measure, or on the other hand the IDF measure). The results were better compared to those obtained by using term-weighting solely, while additional development is required before this technology can be used in large recovery frameworks.

11 A Survey on Hybrid PSO and SVM Algorithm …

129

4.4 Adjustments and Enhancements to the Basic Indexing and Search Processes There are significant possible changes and improvements to the fundamental ordering and search measures, some of which are critical due to the unique recovery conditions (those including huge and extremely huge informational collections are investigated), and some of which are the procedures used for improving the reaction time or further developing simplicity. Unmistakably, two distinctly changed records might be created and saved, one for stemmed phrases and another for unstemmed phrases. Question words would normally use the stemmed form; however, questions marked with a “don’t stem” character would be sent to the unstemmed form. Though this would address the issue for smaller informational collections, it creates a capacity concern for large informative indexes. A hybrid changed document was created to merge these entries, saving no space in the word reference section while saving substantial storage space over that required to store two forms of the entries. This stockpiling reserve funds is at the expense of some extra searching time and so may not be the best option.

5 Conclusion In this study, the developed framework is the combination of both SVM and PSO. The proposed hybrid framework overcomes all existing shortcomings in data recovery positioning and works on the presentation of the positioning framework as discovered in several research works. This study offers a monolingual positioning system, which is only based on standard algorithms. The inquiry is being developed for crosslinguistic and consistent recovery framework. As a result of this research, hybrid model created by combining the exceptional characteristics of various streamlining computations is frequently required to identify the boundaries of the order methods, and therefore, the model’s accomplishment rate is increased.

130

D. R. Ganesh and M. Chithambarathanu

References 1. Lee DL, Chuang H, Seamons K (1997) Document ranking and the vector-space model. IEEE Softw 14(2):67–75 2. Luo J, Xiang G, Pan C (2017) Discovery of microRNAs and transcription factors co-regulatory modules by integrating multiple types of genomic data. IEEE 3. Border A (2020) A taxo nomy of web search. ACM SIGIR Forum 36(2) 4. Cao Y Adapting ranking SVM to document retrieval. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. ACM 5. Lee JH (1994) Properties of extended Boolean models in information retrieval. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval. Springer, New York 6. Jones KS, Walker S, Robertson SE A probabilistic model of information retrieval: development and comparative experiments: part 2. Information Processing and Management 7. Aksoy S (1998) Textural features for content-based image database retrieval. Master’s thesis, University of Washington, Seattle, WA 8. Aksoy S, Haralick RM (1998) Textural features for image database retrieval. In: Proceedings of IEEE workshop on content-based access of image and video libraries, in conjunction with CVPR’98, pp 45–49, Santa Barbara, CA 9. Song Y, Liu S, Liu X, Wang H (2015) Automatic taxonomy construction from keywords via scalable Bayesian rose trees. IEEE 10. Bernstein LM, Williamson RE (1984) Testing of a natural language retrieval system for a full-text knowledge base. 35(4):235–247 11. Agarwal A, Biadsy F, Mckeown K (2009) Contextual phrase-level polarity analysis using lexical affect scoring and syntactic n-grams. In: Proceedings of the 12th conference of the European chapter of the ACL (EACL 2009) 12. Barbosa L, Feng J (2010) Robust sentiment detection on twitter from biased and noisy data. Coling 2010: Posters

Chapter 12

Metric Effects Based on Fluctuations in Values of k in Nearest Neighbor Regressor Abhishek Gupta, Raunak Joshi, Nandan Kanvinde, Pinky Gerela, and Ronald Melwin Laban

1 Introduction Machine learning branch has made a great advancement in terms of improvement and adaptability with respect to different form of data. The machine learning has classification [1] and regression [2] as major learning methodologies. These later have subdivisions of supervised [3] and unsupervised [4] learning. The regression is used for the prediction of continuous values and classification for discrete value prediction. The learning methods are divided in parametric and nonparametric [5] type models. Considering the parametric models, the types of algorithms are linear regression [6], logistic regression [7], discriminant analysis [8], whereas in nonparametric models, the types are k-nearest neighbors [9], support vector machines [10], decision trees [11], bagging ensemble methods [12], boosting ensemble [13] methods. Regression is something we are aiming for this paper. The regression has many varied models, viz. linear regression, Lasso regression [14], ridge regression [15] which are linear models. Similarly, there are regression-based models with support vector machines, bagging ensemble and distance-based algorithms. This paper is not a compendious comparison but is the subtle observation of the distance-based learning method known as k-nearest neighbor regressor. The algorithm is supervised and nonparametric. The point that we are trying to prove in this paper is related to effect of fluctuations in the k value over regression-based metrics. The metrics used are going to be root mean squared error and goodness of fit measure known as R-squared score. A. Gupta (B) · R. Joshi University of Mumbai, Mumbai, India e-mail: [email protected] N. Kanvinde · P. Gerela Thakur Institute of Management Studies, Career Development and Research, Mumbai, India R. M. Laban St. John College of Engineering and Management, Palghar, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_12

131

132

A. Gupta et al.

2 Methodology K-nearest neighbors abbreviated as KNN is a supervised learning algorithm. It is nonparametric approach to density estimation [16] that makes few assumptions about the form of the distribution. It requires labeled input data which is given to the model for training and later test on validation data to compare with the expected output. Kernel width is an important parameter and is denoted using h. When density of data is high, the data is smoothened to the extent of deleting the invaluable insight from the data. If there is reduction in h, the noise in the data increases hence giving a vague prediction and landing the prediction in the area of less predictive section. Thus, the optimal choice for h may be depend on location inside the data space where need of KNN arises. P(X ) =

K NV

(1)

Equation 1 gives the probability distribution where K is number of parameters and V is volume. The equation gives the general result for local density estimation, instead of fixing V , consideration for K is done. The KNN algorithm uses feature similarity to predict the values on newly observed data points. This explains that the new point is assigned a value based on how closely it resembles the points in the training set. The judgment of the algorithm is done using distance-based formulas. These distance-based formulas are required to distinguish the data points efficiently one from each other.

2.1 Euclidean Distance Euclidean distance [17] is calculated as the square root of the sum of the squared differences between a new observation and an existing observation. The formula for this can be given as   n  d(x, y) =  (xi − yi )2 (2) i

Equation 2 has x and y as its 2 points for calculation where summation is taken over n range of numbers. A square is taken for avoiding the negative values.

12 Metric Effects Based on Fluctuations in Values of k in Nearest . . .

133

2.2 Manhattan Distance Manhattan distance [18] is the distance between real vectors using the sum of their absolute difference. The formula can be given as d(x, y) =

n 

|xi − yi |

(3)

i

Equation 3 is a representation of the Manhattan Distance where the 2 values taken into consideration are x and y and an absolute values are taken to avoid the negative values.

2.3 Hamming Distance Hamming distance [19] is used for categorical variables. The formula representation is exactly as same as Manhattan distance with set of imposed rules. If the value x and the value y are the similar, the distance is equal to 0 else it will be 1.

2.4 Regressor The k-nearest neighbors regressor operates on the principles of k-nearest neighbors at every instance where the significant arbitrary values declared by the user as a parameter is known as k. Nearby points have more influence on the regression that points that faraway in weight nearest neighbors.

2.5 Dataset The problem is subtle observation of the k-nearest regressor so the horizon for selecting datasets was very vast for us. We decided to perform the method with multiple standardized datasets. The datasets we used are regression-based datasets, which are Boston Housing Prices, QSAR fish toxicity LC50 and CO2 emission by vehicles. These are widely used regression-based datasets and should suffice to serve the purpose of this paper.

134

A. Gupta et al.

3 Results This section of the paper gives the outcomes that we have achieved after implementation. The results can be measured in a broad sense using regression-based metrics. The metrics are explained below along with the performed results.

3.1 Root Mean Squared Error As the words make up the metrics, the meaning of it could be understood word by word. The first consideration should be given to error. The error is also known as residual. Residuals are a measure of how far from the regression line data points are separated. These are nothing but prediction error which is literally subtracting the predicted value from actual value. This error is squared later to avoid the negative values. This can later be used to give the sum of squared errors which is a summation of all the squared errors. The formula for it can be represented as SSE =

n  (yi − yˆi )2

(4)

i

Equation 4 considers the true label and predicted label. The summation of all such values is taken over the range of n values. Now, mean is calculated which gives mean squared error [20] which is metric for regression-based models. The formula for mean squared error is given by MSE =

n 1 (yi − yˆi )2 n i

Further, the root of the mean squared error is calculated by formula  n 2 i (yi − yˆi ) RMSE = n

(5)

(6)

In RMSE [21], before average, the residuals are squared. This is an indication that RMSE is useful when large residuals are present and they do affect the performance of the model. It avoids considering the absolute value of the residuals and this attribute is effective in many mathematical calculations. In this metric, lower the value, better is the performance of the model. Standard deviation is a measure of how spread out are the data points. Its equation has variance square root and squared differences from the mean is variance average. So in order to get RMSE, one uses standard deviation formula with square root of average of squared residuals. The key point taken into consideration is, RMSE is most useful at the time of large errors. Absolute

12 Metric Effects Based on Fluctuations in Values of k in Nearest . . .

135

Fig. 1 RMSE score for Boston housing prices

fit of the model on the data is done by RMSE. They are negatively oriented scores which states, lower the values, better they are. After the prediction process, the RMSE individually is considered for every single dataset over the set of varied k values. This can be visualized efficiently and it gives a detailed outlook toward the fluctuation of the k values. Figure 1 is representation of the RMSE score calculated for Boston Housing Prices over the k values in range of 76. The graph gives the fluctuation in the values of the RMSE score where the lowest value of the RMSE is observed at a very early stage. Similarly, the RMSE score for other 2 datasets can also be observed using visualizations. Figure 2 gives the representation of the RMSE values over the line in a quadratic curve style fashion. The lowest is observed before 5 values of k. This is an indication of transgression in the metric observation with respect to varied datasets. Figure 3 is the representation of the RMSE values over the 76 values of the k where the optimal score is observed around 10 and 20 values of the k. This is an indication of the score fluctuations and is not a robust metric to infer anything.

3.2 Goodness of Fit Goodness of fit [22] is the metric of accuracy for regression-based models. The mathematical terminology for goodness of fit is known as R 2 or R-squared score. Coefficient of determination is the term given to it. R-squared is a perfect indication of how effectively a models fits the given dataset. It can also be considered as an

136

Fig. 2 RMSE score for carbon dioxide emissions

Fig. 3 RMSE score for LC50

A. Gupta et al.

12 Metric Effects Based on Fluctuations in Values of k in Nearest . . .

137

Fig. 4 Goodness of fit for Boston housing prices over k values

indication for relation of how effectively the regression line which are predicted values related to the actual test set of the data. The model gives a range between 0 and 1 as an indication of model performance given by metric. The values closer to 1 indicate the model is very good and vice versa. R-squared is a comparison of Sum of Squared Residuals (SSR) with Sum of Squared Totals (SST). SST is the calculation of summation performed over the perpendicular distance between the average line and its corresponding data points. SSR is the calculation of summation performed over the squares of perpendicular distance between best fit line and its data points. The equation for R-squared is represented by the formula as R2 = 1 −

SSR SST

(7)

where the formula for SSR is given by SSR =

n  ( yˆi − yi )2

(8)

i

and the formula for SST is given by SST =

n  (yi − yi )2 i

(9)

138

A. Gupta et al.

Fig. 5 Goodness of fit for CO2 emissions over k values

The graphical representation of the varied values of R-squared with respect to values of k ranging till 9 is given. The highest observation is found at the k value as 2 and it substantially becomes lower over the period of time which indicates the values of the data are less varied and broad distribution of the values of the k groups has more influence than getting into intricate details. The other observations can also be spot with other datasets (Fig. 4). The same type of observation in the graph can be seen with Fig. 5 and the influence of higher values of the k does not have any influence on the effect of the accuracy score. The less disparity between the classes has certain influence on the performance of the algorithm. This can be also checked on the final dataset for confirmation (Fig. 6). The goodness of fit for the LC-50 has proved to be effective procedure to check the effect of the k values on the R-Squared accuracy metric. The higher values of k not necessarily indicate a better performance of the model. The optimal solution is important and the values of the k with certain effect make a lot of difference in most of the cases.

4 Conclusion The main purpose of this paper emphasizes on the point of effect on metrics with respect to the fluctuations of k values for nonparametric regression-based model. The nonparametric model we used for implementation is k-nearest neighbor regressor

12 Metric Effects Based on Fluctuations in Values of k in Nearest . . .

139

Fig. 6 Goodness of fit for LC-50 over k values

which is supervised learning model. The metrics we used to prove the subtle point were root mean squared error and R-squared goodness of fit. The RMSE did not prove to be a better fit metric for proving the point as there were fluctuations in the values with respect to different datasets. The R-squared on the other hand was able to prove the necessary point and performed very efficiently. It gives the optimal value of k in every single situation. The point was proved that the higher values of k do not influence the performance of the model and less distinctions between the values for separation holds more value than arbitrarily increasing the amount of k values. This is definitely not the end of the paper for focusing on such subtle observations and for sure opens many doors for new research which we will be glad to be a part of with our best belief and knowledge.

References 1. Cormack RM (1971) A review of classification. J Roy Stat Soc Ser A (General) 134(3):321– 367. http://www.jstor.org/stable/2344237 2. Maulud D, Abdulazeez AM (2020) A review on linear regression comprehensive in machine learning. J Appl Sci Technol Trends 1(4):140–147. https://doi.org/10.38094/jastt1457, https:// jastt.org/index.php/jasttpath/article/view/57 3. Kotsiantis SB (2007) Supervised machine learning: a review of classification techniques. In: Proceedings of the 2007 conference on emerging artificial intelligence applications in computer engineering: real word AI systems with applications in EHealth, HCI, information retrieval and pervasive technologies. IOS Press, NLD, pp 3–24

140

A. Gupta et al.

4. Längkvist M, Karlsson L, Loutfi A (2014) A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recogn Lett 42:11–24 5. Bashir F, Wei HL (2015) Parametric and non-parametric methods to enhance prediction performance in the presence of missing data. In: 2015 19th international conference on system theory, control and computing (ICSTCC), pp 337–342. https://doi.org/10.1109/ICSTCC.2015. 7321316 6. Wallisch C, Bach P, Hafermann L, Klein N, Sauerbrei W, Steyerberg EW, Heinze G, Rauch G (2022) On behalf of topic group 2 of the STRATOS initiative: review of guidance papers on regression modeling in statistical series of medical journals. PLOS ONE 17(1):1–20. https:// doi.org/10.1371/journal.pone.0262918 7. Cramer JS (2002) The origins of logistic regression 8. Gupta A, Soni H, Joshi R, Laban RM (2022) Discriminant analysis in contrasting dimensions for polycystic ovary syndrome prognostication. arXiv preprint arXiv:2201.03029 9. Taunk K, De S, Verma S, Swetapadma A (2019) A brief review of nearest neighbor algorithm for learning and classification. In: 2019 international conference on intelligent computing and control systems (ICCS), pp 1255–1260. https://doi.org/10.1109/ICCS45141.2019.9065747 10. Hearst M, Dumais S, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28. https://doi.org/10.1109/5254.708428 11. Quinlan JR (1986) Induction of decision trees. Mach Learn 1(1):81–106. https://doi.org/10. 1023/A:1022643204877 12. Kanvinde N, Gupta A, Joshi R (2022) Binary classification for high dimensional data using supervised non-parametric ensemble method. arXiv preprint arXiv:2202.07779 13. Gupta AM, Shetty SS, Joshi RM, Laban RM (2021) Succinct differentiation of disparate boosting ensemble learning methods for prognostication of polycystic ovary syndrome diagnosis. In: 2021 international conference on advances in computing, communication, and control (ICAC3). IEEE, pp 1–5 14. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc Ser B (Methodological) 58(1):267–288. http://www.jstor.org/stable/2346178 15. Hoerl AE, Kennard RW (2000). Ridge regression: biased estimation for nonorthogonal problems. Technometrics 42(1):80–86. http://www.jstor.org/stable/1271436 16. Sheather SJ (2004) Density estimation. Stat Sci 19(4):588–597. http://www.jstor.org/stable/ 4144429 17. Liberti L, Lavor C, Maculan N, Mucherino A (2014) Euclidean distance geometry and applications. SIAM Rev 56:3–69 18. Ranjitkar HS, Karki S (2016) Comparison of A*, Euclidean and Manhattan distance using influence map in MS. Pac-Man 19. Norouzi M, Fleet DJ, Salakhutdinov RR (2012) Hamming distance metric learning. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol 25. Curran Associates, Inc., USA 20. Sammut C, Webb GI (2011) Encyclopedia of machine learning. Springer Science & Business Media, Berlin 21. Chai T, Draxler RR (2014) Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE in the literature. Geoscientific Model Dev 7(3):1247–1250. https://doi.org/10.5194/gmd-7-1247-2014 22. Colin Cameron A, Windmeijer FA (1997) An R-squared measure of goodness of fit for some common nonlinear regression models. J Econometrics 77(2):329–342

Chapter 13

An Ensemble Approach to Recognize Activities in Smart Environment Using Motion Sensors and Air Quality Sensors Shruti Srivatsan, Sumneet Kaur Bamrah, and K. S. Gayathri

1 Introduction Activities of Daily Living (ADL) refers to the routine tasks which can be performed by any healthy individual in the absence of assistance. ADLs are broadly classified as basic or physical (BADLs) and instrumental (IADLs). BADLs focus on six activities which include ambulating, feeding, dressing, personal hygiene, continence, and toileting. IADLs focus on transportation and finances, management of finances, shopping and meal preparation, house cleaning and home maintenance, management of communication with others, and management of medications. Different scales are used to monitor the person’s capability of performing simple routine tasks. Lawton IADL and Katz Index of Independence scale (Katz ADL) are example scales applied. ADL is monitored using wearable motion sensors. Motion sensors detect minute movements which are compiled and stored for further processing of recognizing the performance of a routine task or not. Some of the types of motion sensors are Ultrasonic, Passive Infrared (PIR), Tomographic, and Microwave. Sensors have a sensor field. In the presence of any interruption, the field is triggered and sends an activation signal to move the mechanical component. In some smart environments, video feeds are also used to detect postures, and movements of an individual under observation for health care. Humans exhibit heat signatures which are captured by the S. Srivatsan (B) · S. K. Bamrah Department of Computer Science and Engineering, Sri Venkateswara College of Engineering, Sriperumbudur, Tamil Nadu, India e-mail: [email protected] S. K. Bamrah e-mail: [email protected] K. S. Gayathri Department of Information Technology, Sri Sivasubramaniya Nadar College of Engineering, Kalavakkam, Tamil Nadu, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_13

141

142

S. Srivatsan et al.

sensors in the camera. Any movement made sends a signal to the system. Processing of video feeds requires constant monitoring and is computationally intensive. Sensors are used for various purposes. The applicability has extended to elderly care as well. Medical professionals analyze medical data regularly for diagnosis. There are different types of sensors used as blood oxygen, temperature, pressure, electrocardiogram (ECG), heart, image, and motion sensors. Wearable motion sensors are widely used to understand the workings of the body and help improve the capabilities of movements [1, 2]. Traditionally, simple motion sensors are attached to different surfaces to identify if there is any activity performed. Most of the data is captured and transmitted on a continual basis for further processing. The sensors are widespread across the room present in the kitchen, dining, and bathing areas. When an individual picks up any item from the kitchen or dining space in a particular sequence, it can be used for further analysis, sometimes even related to inferring the proper functionality of cognitive abilities. Newer mechanisms are built to capture ADL data in a non-invasive manner [3, 4]. Machine and deep learning techniques are used to classify activities performed and systems built to recognize ADL [2, 5, 6]. A variety of sensor data is used for the purpose of ADL classification. Based on the intent of use, they are pre-installed in a particular space. Novel approaches in ADL classification with the use of indoor air quality are open to exploration. Gas sensors capture information about the chemical compositions of the surroundings in a smart environment providing details about the activity performed in different scenarios. Researchers collect and utilize similar unique sensory input in a useful manner to predict daily routine activities performed by the individual. The paper is devised into various sections. Section 2 summarizes existing studies using sensors in activity recognition tasks for smart environments. The proposed model for activity recognition using motion and air quality sensor data is discussed in Sect. 3. Experimentation performed to draw analysis for the activity recognition system proposed is expanded in Sect. 4. Results for the study are indicated in Sect. 5 followed by the conclusion.

2 Literature Review Intelligent systems are built to understand the physical world. Assisted home care implements intelligent systems that identify activities performed by an occupant in the system or environment. Smart environments are built with IoT systems. An IoT system combines several daily routine objects with the Internet using sensors and actuators [4]. A variety of data is captured by sophisticated sensors in a non-invasive manner such as physiological, behavioral, environmental, and dietary. Sensing technologies are evolving to keep pace with the needs and requirements of the aging population. Smart health care combines wireless communication systems, machine, and

13 An Ensemble Approach to Recognize Activities in Smart . . .

143

deep learning techniques. There are different types of sensors used to build smart environments such as binary, physiological, location-based, image-based, and environmental. Designing sensors face challenges. Solutions or models such as LoCATE, mORAL, and SUGAR are implemented to track routine tasks such as walking, sitting, climbings, etc., of the occupant in the given space [7]. The multi-class window problem is addressed by U-Net, an activity recognition framework [8]. Health monitoring systems can analyze different aspects of an individual at physical and mental levels. By detecting symptoms at an early stage, relevant and necessary measures can be taken at the right time. Problems related to stress, anxiety, and hypertension are detected using regression techniques from BP and PPG signals. Inertial measurement units (IMUs) and electromyography (EMG) sensors are used to detect body postures or posture recognition and force exertions [9]. Pre-fall detection systems assist the elderly largely, and a dataset KFall is developed for research in the domain [10]. Daily activities or ADL are related to sensorimotor impairments. Providing improvements to impairments using wearable and wireless sensors are explored in [1]. Activity recognition is performed by pattern-recognition algorithms. Sensing systems are developed using mobile devices. The various signals are processed using machine learning to classify the human activities performed. For activity recognition, different forms of sensory inputs can be applied to predict the routine task including wearable, environmental, radio frequencies, and video. The use of multiple sensors to detect the task is explored with the use of machine learning. Activity recognition is based on the classification of sensor data [11]. Activity recognition classification can also be extended to analyze how much energy is utilized for the physical activity involved using metabolic equivalents [METS] values [12]. ADL recognition systems understand the flow of interactions in the environment indicating the occurrence of an activity. An activity is a series of interactions that can also be recognized. Door and motion sensors are most commonly used in activity recognition [5]. In order to identify and classify activities more effectively and further infer cognitive capabilities Hidden Markov chains, fuzzy logic, and web ontology are used. Supervised and unsupervised forms of machine learning are adapted to understand activities in varied contexts. Convnet architectures and adaptive boosting (AdaBoost) help to extract and select unique characteristics to be used for classification tasks. Some of the classification techniques used were bagging, random forest, OneR, Multilayer Perceptron, LogiBoost, etc. [13]. Due to the lack of fully labeled data, other techniques such as Bayesian and Markov networks are implemented. Sensor data have uncertainties that can be addressed by the probabilistic machine learning models [1]. Activity recognition modeling used probabilistic reasoning and ontology approaches. Markov Logic Network (MLN) is proposed to address the series of sub-tasks involved in activity predictions involving monitoring, modeling, and decision-making. The behavior of the occupant and associated pattern generated helps to generate an activity model. Routine activities focus on the repetition of tasks. Activity recognition is based on vision or sensors using data or knowledge-driven approaches. Generative approaches apply data-driven mechanisms where Dynamic Bayesian Networks (DBN), Naive Bayes Classifier (NBC),

144

S. Srivatsan et al.

and Hidden Markov Model (HMM) are used for activity detection. Discriminative approaches are also used to identify daily routine tasks using Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs) [14]. There are several challenges related to the capturing of ADL data [1, 13, 15]. Some of the challenges associated with the overall designing of sensors and health care are stated in detail in [3]. A combination of the several movements and even, the atmosphere around in the smart environment enables researchers to collect and utilize sensory input in a novel manner to predict the activity performed by the individual. Air quality data is a new approach explored in the prediction of daily routine activities using electrochemical gas sensors [2, 11]. Gas sensors are used to infer the quality of air in an indoor environment. In the presence or absence of certain chemicals, data analysis is performed to identify the activities. Careful calibration of sensors is not required in all instances, use of a combination of electrochemical sensors relevant details can be captured for the major chemicals present in the air.

3 Proposed Work Activity recognition relies on the interaction of the occupant in a smart environment with the sensor installed capturing data for the prescribed activity. There are two phases explored in this study to predict the activity in different smart environments using different types of sensor data. The phases are • Activity recognition using motion sensors • Activity recognition using air quality sensors. Figure 1 depicts the overall proposed architecture of activity prediction using two types of ADL datasets. In Phase 1, ADL data is captured in a study performed by researchers at Washington State University, CASAS dataset, to predict the daily routine tasks such as eating, watching TV, or grooming performed by an occupant under observation in the smart environment built. Activities can also be recognized by indoor concentrations of chemicals captured using gas sensors. Phase 2 showcases the utility of an AQI-ADL dataset used to also identify and predict daily routine tasks such as sleeping, studying, or cooking in a different smart environment built. In order to choose the most relevant features for the task of activity recognition, feature selection is performed on the unique dataset. The mechanism implemented enhances the overall performance of the machine learning model generated for the task of classification. Individual machine learning models are built to recognize tasks in Phase 1 and Phase 2. The final proposed approach used for activity recognition showcases of the different forms of sensors data used, and the ensemble form of learning is a preferred approach for the prescribed task of classification. Most ADL activities may not involve gas sensors, classification is performed using motion sensor data as a preferred choice. In the case of indoor smart environments using gas sensors, the preference of model for the task of activity recognition, shifts accordingly.

13 An Ensemble Approach to Recognize Activities in Smart . . .

145

Fig. 1 Proposed architecture for activity recognition using motion sensor and air quality data

4 Experimental Analysis Activity prediction using motion sensor data and air quality data are performed individually. Simple and ensemble techniques machine learning algorithms are applied to understand the influence and impact of ADL data used in the activity prediction process.

4.1 Phase 1—Activity Recognition Using Motion Sensors Data In the CASAS study, relevant sensors are installed in locations such as the main door, seat, microwave, cooktop, toaster, cupboard, fridge, shower, toilet, basin, cabinet, and bed. Sensor event files are generated for each participant. Based on the surface touched or used the sensor input is captured and used for activity prediction. The various activities predicted include having a meal, leaving, watching television, grooming, showering, toileting, or sleeping. Details of the activities are seen in Fig. 2. A variety of machine learning algorithms are tested to identify the activity using an activity score. RandomizedSearchCV is used for the process of hyperparameter tuning. The method utilizes a smaller set of hyperparameters to tune rather than scan

146

S. Srivatsan et al.

Fig. 2 Overview of the number of occurrences of each activity along with the duration of activity versus time

through the entire hyperparameter space. It uses cross-validation to perform the task of activity prediction where the classifier, parameter distribution, and the number of folds are considered. An ensemble form of learning, random forest (RF) yields the best results. RF demonstrates the concept of bagging or bootstrap aggregation in order to prevent excessive non-correlatedness among motion sensor data features. The variance of the decision trees produced for activity recognition is reduced.

13 An Ensemble Approach to Recognize Activities in Smart . . .

147

4.2 Phase 2—Activity Recognition Using Air Quality Sensors Activity recognition information is generated for four situations. Air quality sensor data was collated using six sensors such as MQ2, MQ9, MQ135, MQ137, MQ138, and MQ811 in a smart environment. MQ sensors are a collection of sensors used to detect gases such as propane, alcohol, methane, LPG, smoke, benzene, etc. Activity recognition information is generated for four situations. The data focuses on the use of the six sensors generating around 1900 samples for varied situations. Each sample has 7 values. The target variables are labeled 1, 2, 3, 4 and indicate the following, 1. A ‘Normal’ situation indicates the presence of clean air and allows the individual to perform regular tasks of sleeping, resting, or studying with ease. 2. A ‘Preparing a meal’ situation indicates forced air circulation with one or more individuals in a given space cooking a meal. 3. A ‘Presence of smoke’ situation indicates where any article is burning with closed doors, windows causing smoke to be present in the environment. 4. A ‘Cleaning’ situation indicates the state where there is an indication of the use of alcohol or spray, a liquid detergent. Feature selection is performed on air quality sensor data using SelectKBest with Anova F-classification function. The method focuses on score functions in order to reduce the irrelevant features, impacting the overall performance of the models generated for the task of activity recognition. The experiment uses either 3, 4, or 5 feature sets combined on the basis of the individual ‘k’ scores generated. Prediction of the daily routine activities using air quality sensor data is a unique approach extended. Standard machine learning algorithms are applied to the air quality data sensor data while strategically selecting relevant features from the feature set. The algorithms tested in Phase 2 include Logistic Regression, K-Nearest Neighbor, Naive Bayes, and Random Forest. The machine learning models were tested with a 3-set, 4-set, 5-set, and 6-set feature set selection and showcased the performance of Random Forest being the preferred choice for the task of classification. With the least amount of features involved, the accuracy of the model was high.

5 Results The experiment focused on activity prediction using two different datasets in individual smart environments. Activity prediction performed extensively using simple motion sensor data and unique air quality sensor data showcased the applicability of computationally reliable forms of utility. Machine learning algorithms are implemented on both forms of data. Results of the machine learning models generated for Phase 1 are indicated in Table 1. In order to maintain diversity across features of motion sensor data, replacements are preferred in a sampling of data. Performing activity recognition using an

148

S. Srivatsan et al.

Table 1 Activity recognition using motion sensor data Model name Accuracy (%) Random forest Logistic regression KNN Naive Bayes

86.8 71.71 70.23 77.42

Table 2 Activity recognition using air quality sensor data Model name Accuracy (%) Logistic regression KNN Naive Bayes Random forest

88.58 98.36 83.15 97.2

Table 3 Selection of air quality sensor features for dementia detection using ANOVA Model name Accuracy with 3 Accuracy with 4 Accuracy with 5 Accuracy with all features, namely features, namely features, namely input features (6) S5, S3, S4 (%) S5, S3, S4, S1 S5, S3, S4, S1, S2 (%) (%) (%) Logistic regression KNN Naive Bayes Random forest

79.34

77.71

81.52

88.58

96.19 80.43 96.19

95.65 79.89 97.82

96.19 77.71 96.73

98.36 83.15 97.28

ensemble form of learning is more accurate with the Random Forest approach, the accuracy of the model generated is the highest at 86.8%. In Phase 2, the air quality data provides essential information about the presence of a person or smoke and other related information to recognize daily routine tasks. With a unique feature set, the process of feature selection handles time and space complexity effectively. Table 2 indicates the performance of a smaller feature set for the given task of classification generates an accuracy of 96%. The Random Forest technique is effective in Phase 2 of activity recognition as well in comparison to the other standard algorithms applied (Table 3).

13 An Ensemble Approach to Recognize Activities in Smart . . .

149

6 Conclusion Smart health care is emerging to ensure better health services are provided to the needy. Healthcare professionals also benefit from the advent and synchronization of technology into the medical field. Machine earning techniques are applied to sensor data to predict activities performed by an occupant in smart environments Routine tasks can be monitored using simple sensors consuming a lot of time if performed manually by experts. CASAS is a study performed by researchers which explores the aspect of the use of motion sensors to detect routine tasks. In addition to the use of known sensory input, an additional set of data is explored. Air quality data is captured by gas sensors and provides useful information about the environment without the need for expert calibration. In the experiment conducted, both datasets are used for activity recognition within the individual smart environments. The overall process is conducted in two phases for both forms of sensor data. Standardized machine learning models are tested to understand the impact of sensors on activity prediction. In Phase 1, the Random Forest approach provided the highest accuracy results of 86.8%. In Phase 2, activity prediction is performed best using a Random Forest model with an accuracy of 96.19% with lesser features. The scope of the experiment can be extended to elderly care and assisted living, improving geriatric care by designing and building suitable frameworks.

References 1. Dobkin BH (2013) Wearable motion sensors to continuously measure real-world physical activities. Curr Opin Neurol 26(6):602 2. Liu J, Sohn J, Kim S (2017) Classification of daily activities for the elderly using wearable sensors. J Healthc Eng 2017:7. https://doi.org/10.1155/2017/8934816 3. Cook D, Crandall A, Thomas B, Krishnan N (2013) Casas: a smart home in a box. Computer 46(07):62–69 4. Wang J, Spicher N, Warnecke JM, Haghi M, Schwartze J, Deserno TM (2021) Unobtrusive health monitoring in private spaces: the smart home. Sensors 21(3). https://www.mdpi.com/ 1424-8220/21/3/864 5. Camp N, Lewis M, Hunter K, Johnston J, Zecca M, Di Nuovo A, Magistro D (2021) Technology used to recognize activities of daily living in community-dwelling older adults. Int J Environ Res Public Health 18(1). https://www.mdpi.com/1660-4601/18/1/163 6. Zhu H, Samtani S, Nunamaker J (2020) Human identification for activities of daily living: a deep transfer learning approach. J Manag Inf Syst 37. https://doi.org/10.1080/07421222.2020. 1759961 7. Nthubu B (2021) An overview of sensors, design and healthcare challenges in smart homes: future design questions. Healthcare 9(10). https://www.mdpi.com/2227-9032/9/10/1329 8. Zhang Y, Zhang Z, Zhang Y, Bao J, Zhang Y, Deng H (2019) Human activity recognition based on motion sensor using u-net. IEEE Access 7:75213–75226. https://doi.org/10.1109/ACCESS. 2019.2920969 9. Alazzam M, Alassery F, Almulihi A (2021) A novel smart healthcare monitoring system using machine learning and the internet of things. Wirel Commun Mobile Comput 2021:1–7. https:// doi.org/10.1155/2021/5078799

150

S. Srivatsan et al.

10. Yu X, Jang J, Xiong S (2021) A large-scale open motion dataset (kfall) and benchmark algorithms for detecting pre-impact fall of the elderly using wearable inertial sensors. Front Aging Neurosci 13. https://doi.org/10.3389/fnagi.2021.692865, https://www.frontiersin.org/article/ 10.3389/fnagi.2021.692865 11. Gambi E, Temperini G, Galassi R, Senigagliesi L, De Santis A (2020) Adl recognition through machine learning algorithms on iot air quality sensor dataset. IEEE Sens J 20(22):13562–13570. https://doi.org/10.1109/JSEN.2020.3005642 12. Kim TS, Cho JH, Kim JT (2013) Mobile motion sensor-based human activity recognition and energy expenditure estimation in building environments. In: Hakansson A, Höjer M, Howlett RJ, Jain LC (eds) Sustainability in energy and buildings. Springer, Berlin, pp 987–993 13. Johanna GR, Paola Patricia AC, Alvaro Agustín OB, Eydy del Carmen SB, Miguel UT, la HozFranco Emiro D, Jorge Luis DM, Shariq Aziz B, Diego M (2021) Predictive model for the identification of activities of daily living (ADL) in indoor environments using classification techniques based on machine learning. Procedia Comput Sci 191:361–366. https://doi.org/10.1016/ j.procs.2021.07.069, https://www.sciencedirect.com/science/article/pii/S1877050921014721 14. Gayathri K, Easwarakumar K, Elias S (2017) Probabilistic ontology based activity recognition in smart homes using Markov logic network. Knowl Based Syst 121(C):173–184. https://doi. org/10.1016/j.knosys.2017.01.025 15. Gambi E (2020) Air quality dataset for ADL classification. Mendeley Data 1. https://doi.org/ 10.17632/kn3x9rz3kd.1

Chapter 14

Generalization of Fingerprint Spoof Detector C. Kanmani Pappa , T. Kavitha , I. Rama Krishna, V. Venkata Lokesh, and A. V. L. Narayana

1 Introduction In the realm of forensics, the history of fingerprint spoofing is almost as old as the history of fingerprint classification. In fact, before it was ever put as a topic in 1936, the subject of whether or not fingerprints left behind in a crime scene could be forged was positively answered in 1924. The adoption of new spoofing materials and techniques to degrade the technologies that are particularly developed to avoid fingerprint spoofing is a repeating theme in the research literature. A spoofing attack happens in the field of biometrics when an attacker mimics another person’s biometric attribute in attempt to undermine a biometric authentication system. For instance, a fake finger may be created by using the widely accessible materials such as latex, adhesive, and gelatin, and the fingerprint can be added subsequently. The patterns of a person are imprinted on the surface. An attacker can mimic the owner of the genuine ridges by using a fake finger on a fingerprint sensor. Such attacks pose a direct danger since they depend on freely available resources and do not demand any understanding of the fundamental operation of the underlying biometric identification system. This type of fingerprint spoofing attack can have a success rate of more than 70%.

C. Kanmani Pappa (B) · T. Kavitha · I. Rama Krishna · V. Venkata Lokesh · A. V. L. Narayana Electronics and Communication Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_14

151

152

C. Kanmani Pappa et al.

Biometric recognition systems are being employed in a variety of identifying industries due to their ease and robustness when compared to traditional protocols such as a password. Biometrics recognition systems make use of physiological and behavioral features of humans. Fingerprints are one of the most extensively used authentication technologies since they ensure good identification accuracy, which are cost-effective and can be used to process large image databases. Attendance, smartphone identification, forensics, healthcare systems, banking, and more applications employ fingerprint recognition technologies. As the number of attack tools has increased, researchers have been attracted to create a system that can analyze and provide a solution for fingerprint liveness detection. Figure 1 depicts the suggested research documents in fingerprint biometrics, which has been quickly developing and attracting academics in recent years.

Fig. 1 Fingerprint fabrication attacks

14 Generalization of Fingerprint Spoof Detector

153

This section discusses the characteristics of fingerprint images, the most important elements of fraudulent images, and public liveness databases. The first step in developing an anti-spoofing fingerprint identification system is to understand the properties of the images. Numerous characteristics for each feature have been presented by various studies. The general characteristics of fingerprints can be split into three categories. • Consider the global ridgeline at the global level. This is the most popular categorization level, where classes may be derived from global characteristics. • Local-level refers to the minutiae acquired from the ridge. This level is typically used in the matching process. • Covers fingerprint image attributes and also the most important features of fraudulent images and public liveness databases. The degree of parallelism in processing is a third design consideration. Each pixel of an N × M array is paired with a separate processor that communicates directly with nearby pixels in fully parallel systems processors and pixels. In the chip area, this comes at a high price. The Chinese Academy of Science Institution of Automation (CASIA) was founded in 2008 to assist researchers with model training and testing. Mostly due to the limited amount of actual and fake photos, it was integrated with additional datasets. ATVS dataset forms from two sub-sets: DWC and DWNC. Each set includes both real and fake images obtained from three different sensors. Fake groups are categorized based on the fabrication materials such as silicon, latex, ecoflex, and wood glue. Figure 2 shows different samples of real and fake images from ATVS dataset. A third design consideration is the degree of parallelism observed in image processing. In a completely parallel system, each pixel of a N × M array gets coupled with a distinct CPU, which communicates directly with the surrounding pixels. This comes at a significant cost in the chip world. In 2008, the Chinese Academy of Science Institution of Automation (CASIA) was established to aid researchers in model training and testing. Mostly, it will be supplemented with additional datasets due to the scarcity of real and fraudulent photos. Figure 3 shows a fake image. Software-based alternatives that use one of the following strategies comprise Local Phase Quantization (LPQ), Binarized Statistical Image Features (BSIF), and Weber Local Descriptor. Local Phase Quantization (LPQ) established a 2-dimensional LCPD to analyze each time and data in the wavelet coefficients. Convolutional neural networks (CNN) have also been proposed as an alternative to the custom-tailored anti-spoof features. One of the disadvantages of many known anti-spoof algorithms is their poor generalization effectiveness across spoof materials. There might be aspects that were not demonstrated during training phase. The rate of spoof detection error has increased. To generalize an algorithm’s effectiveness across fake fabrication, a cross-material performance has been studied in certain studies.

154

Fig. 2 Samples of real fingerprint images (above) and fake

Fig. 3 Live finger versus fake finger identification

C. Kanmani Pappa et al.

14 Generalization of Fingerprint Spoof Detector

155

CNN follows a hierarchical model, which acts like a funnel when building a network by eventually outputting fully connected layers. In this layer, all neurons are interconnected and the output is processed. In Fig. 3, within the image of a live thumbprint and its corresponding mock fingerprinting, the distortions introduced in the spoofs are updated accordingly. Local fixes based on minutiae extracted around the artifacts are also shown. The images acquired from MSU Fingerprint Presentation Attack Dataset (MSU-FPAD)-Silicone will be used as the fake material for the Cross-Match Sensor.

2 Literature Survey The artifacts introduced in the spoofs [1, 2] are highlighted in red in this example of a real fingerprint and its associated spoof fingerprint. Local fixes based on details are obtained for all the available items. The fake material for the Cross-Match Sensor is silicone (Ecoflex). For both actual and spoof content, the proposed method gives spoof scores of 0.06 and 0.99, respectively. The AGNN classifier was also employed in the three-step model. The model attempts to improve the fingerprint image through a process known as image denoising, which aids in delivering better image recognition and categorization. When compared to the existing neural network and non-neural network methods, the model has outperformed the existing frameworks in terms of precision, sensitivity, specificity, accuracy, and F1-measure. The large volume of dataset is one of the most significant issues in fingerprint identification. To address this issue, [3] has devised a novel strategy to divide the dataset into sub-datasets in order to reduce the search space. The proposed method was created by using Orie’s modified histograms. They employed the Extreme Machine Learning (EML) model and RBF kernel for performing classification instead of the Histogram of Oriented Gradients (HOG) descriptor. The model was trained and tested by using a noisy fingerprint dataset, called FVC2004 dataset. The fingerprint classification models are then compared to cutting-edge technologies [2]. An identification approach has been incorporated to assure quality correctness and abstention. Feature extraction is based on the wave atoms approach, which does not rely on image quality measurements or image enhancement to lessen the chance of error. FCV2002 fingerprint datasets were used in this research study. To be appropriate for wave atoms transform, each image was separated into groups of 16 images. For classifying the fingerprint images, SVM classifier algorithm has been applied and a good performance has been observed. Researchers have attempted [4] to increase the generalization performance of any fingerprint spoof detector and compared the spoofs created from materials that are not visualized during training to spoofs. The style (texture) properties of fingerprint images of known materials have been transferred for the purpose of synthesizing fingerprint images corresponding to unknown materials that may occupy the space between the known and unknown materials.

156

C. Kanmani Pappa et al.

In the deep feature space, there are many known materials [5]. In addition, live synthetic fingerprint images have been added to the database. For developing a generative noise—in-variant features to differentiate between the live and spoofs, CNN has been compelled which uses a trained dataset. An automated fingerprint identification system is one in which a fingerprint spoof detector is a pattern classifier that has been used for distinguishing a real finger with that of a spoof finger. These spoof detectors will work on the basis of learning and training images obtained from a dataset. As a result, any spoof detector’s performance suffers dramatically when it encounters a spoof. Spoofs made from unusual materials will not be presented in the training datasets. While dealing with the real-world applications, the topic of spoofing fingerprints has to be moved toward as an open set recognition problem. The proposed methodology works on alleviating the security risk posed by introducing a Weibull-calibrated SVM (W-SVM), which is relatively robust for open set recognition.

3 Working Principle 3.1 Dataset The most recent benchmark dataset, the LivDet 2015 [3], has been obtained and used. The provided data includes both training and testing data, each with its own set of variables which is shown in Fig. 4. A set of real and fake fingerprint images are scanned by using various optical devices. Throughout the course, the focus is on the digital fingerprint scans, of which there are a total of 2500 testing images (1000 live and 1500 spoof) and 1000 training images (both real and simulated). Optical gadgets include the Green Bit, Biometric, and Cross-match. The materials utilized to create the fake fingerprints include ecoflex, gelatin, latex, and wood glue. Furthermore, the test set includes additional spoofing. In the proposed methodology, there are two phases: First phase is to extract the features from the image and second phase is intended to train and test the extracted features by using different machine learning models. Convolutional neural network (CNN) models will consider images as direct input. The methods used for extracting the features have been explained below, and then, the models that were utilized have been explained below. The most recent benchmark dataset LivDet 2015 was obtained and used. Both training and test data are included in the package, each with its own collection of variables. • A collection of real and fake fingerprint images scanned by a variety of optical devices. • This study focused on digital fingerprint image scans during the training phase. • 2500 test images and 1000 (real and fake) training images (1000 real and 1500 spoof images).

14 Generalization of Fingerprint Spoof Detector

157

Fig. 4 Block diagram

• Optical devices including Green Bit, Biometric, and Cross-Match were the materials used in the creation of the fake images. • Fingerprints include ecoflex, gelatin, latex, and wood glue. The normalizing factor arctan is used to maintain (xj) within a restricted range. The number of pixels surrounding xj such that xi is xj’s neighbor pixel is called n. We selected a 3 × 3 neighborhood for our studies, which is capable of extracting excellent features for liveness identification [6]. Weber’s Law [1] is the inspiration for WLD, which states that the only perceptible difference is a continual propositional of the original stimulus. The gradient orientation is then calculated. The gradient orientation of xj is defined by: Using the example in Fig. 1, the gradient orientation of xj is determined by: The differential excitation () and orientation () for each pixel are then collected and reduced into a single vector. We got 120 bins for differential excitation and 8 bins for orientation in our scenario. Image augmentation adds more variety and quantity to our training data, allowing us to fine-tune our classifiers and avoid overfitting. We used a two-step approach to enhance the photos, as described in [2]: (1) horizontally flip the image and (2) crop five smaller overlapping images from the original and its flipped copy. These five images are generated separately from the image’s four corners and the image’s center (see Fig. 2), yielding a total of ten new images for each original sample. This method is employed to avoid some of the modifications in conventional augmentation procedures since the appropriate preprocessing step should include cropping and removing image rotation. Overfitting can be avoided by reducing the dimensionality. Dimensionality reduction is the process of reducing the original feature set to a

158

C. Kanmani Pappa et al.

Fig. 5 Image augmentation scheme

smaller collection of features that may be utilized to recover the majority of the variability in the data [7]. One of the most often used strategies for dimensionality reduction. PCA stands for Principal Component Analysis. By returning a lower-dimensional value, PCA generates a collection of linear approximations from high-dimensional data. A three-dimensional linear manifold is a manifold that has three dimensions. A higher dimensional manifold’s maximal variability is represented by this manifold which is shown in Fig. 5. The resulting image is sent to two hair cascade facial feature trackers at the same time. Another way to extract a one-dimensional feature from a picture is to use CNN. The architecture of the CNN family is designed to generate “biologically inspired” feature vectors of pictures. The CNN model consists of a single image input layer with numerous interweaving convolution (with linear filters) and spatial pooling layers, which are then flattened and normalized to create the feature vector of the input image. Spoofing attacks have resulted in a myriad of countermeasures, both on the hardware and software interfaces. The liveness detection issue is the subject of software countermeasures, which are designed to determine whether a fingerprint image was taken from a real person. Liveness detection is a critical security issue that requires exceedingly appropriate solutions. The difference between a real and fake image is so subtle that even humans face difficulty in differentiating the images. As a result, texture-based pattern recognition is frequently used in performing liveness detection. PCA is generally used to reduce the dimensionality of the features derived from distinct local image descriptors. The SNN-BSIF/WLD/LPQ models are multi-layer neural networks with reduced features as input. The SNN-MixFeat model combines three models, namely SNN-BSIF, SNN-WLD, and SNN-LPQ. This study’s major contributions are listed below. They also point out the discrepancies between this and the proposed paper which is shown in Fig. 6.

14 Generalization of Fingerprint Spoof Detector

159

Fig. 6 Models tested

• Using industry experience, developed a solid fingerprint system. A CNN model is being trained to align local patches centered fingerprint spoof detector fingerprint minutiae. It is not the same as what has been reported earlier. The full fingerprint image is used by most systems for fraud identification. • The use of fine-grained representations of fingerprint picture patches can detect partial fake fingerprints on a local level fingerprint changes and regions. Features of SVM with BSIF/LPQ/WLD: For the entire dataset, BSIF, LPQ, and WLD features were created individually (training and test images). Corresponding SVM models were developed and tested individually for each set of features. Features of SVM with CNN-RFW include the following: CNN-RFW characteristics were used to train SVM classifiers. SNN-BSIF/WLD/LPQ and SNN Mix Feat are Neural Networks that use BSIF/WLD/LPQ characteristics. Use PCA to reduce the dimensionality of features derived by distinct local picture descriptors. The SNN-BSIF/WLD/LPQ models are multi-layer neural networks with reduced features as input. The SNN-MixFeat model combines the SNN- BSIF, SNN-WLD, and SNN-LPQ models. CNN-RFW neural network features: Convolutional neural networks were used to train NN classifiers on characteristics extracted from them. The construction of our NN classifier is shown in Fig. 7. CNN-RFW and BSIF features in a neural network concatenated two feature sets using the Merge layer to test how the NN classifier could benefit from several feature sets. Local patches were extracted around the fingerprint minutiae (gelatin). Each item’s spoofness score patch has a score between 0 and 1; the greater the score, the more likely the patch will be removed from a forgery of a fingerprint, the spoofness scores of

160

C. Kanmani Pappa et al.

Fig. 7 For a authentic fingerprints, and b spoof fingerprints

the specific picture. The project in achieving generates a worldwide scale. Resulting choice depends on a recently identified categorization threshold in a training dataset image which has a global spoofness of score less than 1. It is recognized live if the criteria have been met; else, it is recognized spoofing. CNNs differ from typical neural networks in that they account for image spatial structure. We employed a CNN that consists of a series of convolutional and pooling layers. For more information, see [8, 9]. VGG16 [9]: We trained the whole VGG16 convolution model with the output dense layer modified to suit our binary classification. Inception-v3 is a fully connected layer with ReLU activation; a dropout layer with standard rate 0.5 and a single output with sigmoid activation were employed with a custom classifier block on the top of a pre-trained model. While there is breakthrough, various CNN architectures have been proposed in the literature. This research study employs Convolution Neural Network (CNN) framework as it delivers anything and benefits the model by significantly reducing the model size and strength and conditioning durations while improving the spoof detection and it helps to remain as a low-latency network to summarize an input in the range of 100 ms. The percentage of parameter estimation is greatly reduced by Inception-v3. Using a live or fake fingerprint image in the range of 100 ms, the input image in the range of 800 ms is required instead. It greatly reduced by the Inception-v3 network. That Inception-v3 anything happens less computational power when it comes to information augmentation & batch normalization. Overfitting should be avoided. We used the TF-Slim library, which is a slimmer version of the TF-Slim library. The architecture of MobileNet-v1, the architecture’s last stratum, a 1000-unit soft max layer intended to forecast the future ImageNet dataset has 1000 classes. A two-unit

14 Generalization of Fingerprint Spoof Detector

161

convolution layer was substituted spoof for the two-class problem, i.e., active versus resurrected. RMS Prop was the optimizer that was used to train the network using a batch size of 100 and a synchro nous gradient descent. To ensure that the trained model is resistant to any changes in pixels, SVM classification methods such as lighting manipulation, irregular chopping, and vertical flipping are used. Each unique model was trained on every edge length for the multiresolution local patches, using same settings from before. By disguising the genuine identity from a fingerprint biometric system, partial parodies and incomplete tricking changes uncommon ways of escape. Partially, mock fingerprints that hide only a small area of the living finger are mistakenly missed by spoof detectors trained on complete fingerprint images. Furthermore, many smartphones and other embedded technology can only detect a small living environment. Based on the available detection zone, typically having a complete depiction of the scene is essential. The perfectly alright description of the input image pixels is one of the key advantages of using a patch-based technique for spoof detection used to create an imprinted half fake palm scanner, whereas displays an example of a silicone fingerprint spoof that masks only a portion of the live finger. The suggested approach indicates small regions as live or fake using minutiae-based local patches (shown). A finely detailed representation of a fingerprint image Incisions, mutilations, and stitching, among other things, alter fingerprints which are carried out by surgical and chemical methods (see Create fictitious minutiae as indicated in in addition (g)). The proposed method can be used to draw attention to specific places. The recommended method identified both fingerprint photos as spoofs with 0.78 and 0.65 spoofness scores, respectively.

3.2 Training and Validation of Our Proposed CNN Model The Spoof level is the output of the softmax layer involved in the trained MobileNetv1 model, where the value ranges from 0 to 1. The higher the spoof score, the more likely the entire information will be fake. The spoof scores for the k minutiae-based local patches recovered from the input image are averaged to provide a global spoof score S I for an input test image I. The overall spoof score centered throughout each area pixel is averaged to obtain the final spoof score. The local patches include several resolutions. The cutoff threshold is used for minimizing the average categorization which is shown in Fig. 8. Results and Discussions The performance metrics of the proposed SVM and Neural Network models are shown in this section. Aside from test accuracy, two new metrics are introduced: AUC and ACE. The AUC of SVM evaluates the probability that it rates a randomly selected live example, which remains higher than a randomly selected fake sample. The average of the false-positive and false-negative rates, i.e., the rate of misclassified

162

C. Kanmani Pappa et al.

Fig. 8 Correctly classified live fingerprint and incorrect classified live fingerprint

real instances and the rate of misclassified spoof examples, are measured by ACE. Figures 9, 10 and 11 show the simulation results. Datasets Used The proposed neural network-based implementations are based on the Keras library [4], which runs on top of TensorFlow. For the proposed NN models, a stratified tenfold cross validation similar to SVM is used. As a loss function, a binary cross entropy is employed. Some of the considered models employ AdaDelta optimizers, while others use Stochastic Gradient Descent optimizers. When compared to SGD, AdaDelta delivers quicker convergence. Moreover, PCA helps in combating overfitting and the difference in validation loss.

4 Conclusion The proposed study has successfully analyzed the machine learning-based fingerprint identification systems and anti-spoofing strategies. A comparison of such models had been presented, as well as a comparison of multiple datasets has also been carried out. The most commonly used learning classifier in literary models is SVM classifier. For training and testing phase, the datasets from LivDet2011 and LivDet2013 have been used. When the fingerprint spoof detection system is presented with spoofs

14 Generalization of Fingerprint Spoof Detector

163

Fig. 9 Spoof fingerprint

created from pre-existing materials that were not noticed during the training stage, the error rate of a fingerprint spoof detector tends to increase abruptly. To overcome this, the presented technique provides a strategy for automatically detecting and adapting a spoof detector to locate spoofs generated with unique materials that are discovered during the proposed operational phase, hence addressing the basic open set identification difficulty. To achieve this purpose, a novel material detector based on W-SVM was developed to detect spoofs built from advanced techniques. Future scope is to incorporate an advanced machine learning-based methodology to detect and classify the fake fingerprints from a live one by using the datasets with new public live fingerprints.

164

Fig. 10 Image processing output

Fig. 11 Digital parameters of the image

C. Kanmani Pappa et al.

14 Generalization of Fingerprint Spoof Detector

165

References 1. Mura V, Ghiani L, Marcialis GL, Roli F, Yambay DA, Schuckers SA (2015) LivDet 2015 fingerprint liveness detection competition 2015. In: 2015 IEEE 7th international conference on biometric theory, applications and systems (BTAS), pp 1–6 2. Fumera G, Marcialis GL, Roli F, Biggio B, Akhtar Z (2012) Biometric authentication solutions are tested for security under real-world spoofing attacks. IET Biometrics 1(1):11–24 3. Bengio Y (2009) Deep architectures for AI learning 4. Pinto N, Cox D (2011) Beyond simple features: a large-scale feature search approach to unconstrained face recognition. In: 2011 IEEE international conference on automatic face and gesture recognition and workshops (FG 2011), IEEE 5. Verdoliva L, Gragnaniello D, Poggi G, Sansone C (2013) Fingerprint liveness detection based on weber local image descriptor. In: 2013 IEEE workshop on biometric measurements and systems for security and medical applications (BIOMS), IEEE, pp 46–50 6. Ghodsi A (2006) A short tutorial on dimensionality reduction, vol 37, no. 38. Department of Statistics and Actuarial Science, University of Waterloo, Ontario, Canada 7. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:1–27 8. Marcialis GL, Coli P, Roli F (2007) Fingerprint vitality detection using the power spectrum. In: Proceedings of IEEE international workshop on automatic identification advanced technologies autoID, Alghero, Italy, pp 169–173 9. Chang S, Liu C, Niu B, Tang M, Zhou Z, Huang Q (2015) An evaluation of fake fingerprint databases utilizing SVM classification. Pattern Recogn Lett 60(1):1–7 10. de Alencar Lotufo R, Nogueira RF, Machado RC (2014) Evaluating software-based fingerprint liveness detection using convolutional networks and local binary patterns to evaluate softwarebased fingerprint liveness detection. In: Proceedings of the 2014 IEEE workshop on biometric measurements and systems for security and medical applications (BIOMS), IEEE, pp 22–29 11. Rahtu E, Kannala J (2012) Bsif: binarized statistical image features. In: Proceedings of the 21st international conference on pattern recognition (ICPR2012), IEEE, pp 1363–1366 12. Matsumoto T, Matsumoto H, Yamada K, Hoshino S (2002) Impact of artificial “gummy” fingers on fingerprint systems. In: Optical security and counterfeit deterrence techniques IV, pp 275–289 13. Arora SS, Cao K, Jain AK, Paulter NG (2013) 3D fingerprint phantoms. Department of Computer Science and Engineering, Michigan State University, East Lansing, Technical paper 14. Coles S (2001) Statistical modeling of extreme values: an overview. Springer

Chapter 15

Applied Deep Learning for Safety in Construction Industry Tanvi Bhosale, Ashwini Biradar, Kartik Bhat, Sampada Barhate, and Jameer Kotwal

1 Introduction Technology combined with common problems in the construction industry helps to overcome its obstacles. There have been [1] various reasons like not checking the equipment before using it or not wearing that equipment properly, leading to fatality and hence making construction one of the most dangerous industries. Also, the top four reasons leading to the fatalities in construction industries are as follows: falls from heights, being struck by moving objects, being struck by moving vehicles, being trapped between something and electrocutions, increasing the responsibility of the manager/contractor concerning employee protection. For which, [2] CPWD of India has also called up penalties to be deducted from the contractors to increase the safety of workers. In which, they are charged an amount for fatal accidents, injuries, delay in reporting these, injuries and preserving the scene of an accident for further inquiries. Making the safety of workers is an important aspect in the field. The safety measures performed currently in the industry are less collaborated with technology. For instance, a training session is given to workers before they start, compulsion on making them wear the equipment, installation of warning signs on risk prone areas, taking in account the weather to prevent the aftereffects on workers and physically monitoring the construction site for safety. The use of technology into solving such problems will promote easy management and higher safety assurance.

T. Bhosale (B) · A. Biradar · K. Bhat · S. Barhate · J. Kotwal Pimpri Chinchwad College of Engineering and Research, Ravet, Pune, Maharashtra 412101, India e-mail: [email protected] J. Kotwal e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_15

167

168

T. Bhosale et al.

This paper presents a deep learning solution towards making the work of managers/contractors, of managing the safety of their workers easier. Using the Convolutional Neural Networks (CNN) model for carrying out the classification and recognition on the safety equipment worn by the workers, i.e. the Hard hats and safety vests [3]. CNN has been an efficient deep learning model in construction for a long period. Then, training the model on the annotated data with thousands of images augmented and pre-processed. The time consumed in feature engineering gets altered in CNN and is the reason why the fully connected convolutional neural network is used for classification in the paper, also implementing a pre-trained model VGG16 on the same input data. Compare the result accuracy of the CNN model and the VGG16 model. Finally, choose the best model for further running real-time object detection. Usually, object recognition or classification is carried out using algorithms or pre-trained models. Under which, the YOLO V3 algorithm and VGG16 model stand well known [4]. When the YOLO V3 model is compared to the VGG16 model using nearly 1000 hand-drawn images, then VGG16 turns out to give results more accurately than YOLO. As VGG16 is an object recognition model that recognizes images with one object, most of the dataset comprised images with a single object, like the input dataset used for this paper. Various approaches have been used to detect safety equipment on-site in real time [5]: implementation of YOLO algorithm for detecting the worker, their hats and Vests and indicating that the worker is safe when he has worn all his/her safety equipment [6]. This paper tries to overcome the gap of implementing the VGG16 model instead of YOLO V3 for object recognition. The author performs recognition using VGG16, a pre-trained model and recognition using a convolutional neural network model, additionally providing a comparative view of the VGG16 model with a CNN model and exporting the weights of the best accuracy model. This followed by using those weights for carrying out real-time object detection. Additionally, indicating the number of workers following and not following the norms by wearing the safety hats and vests in the recent video frame. This analytical format of the real-time video will help the managers/contractors to observe and analyse easily [7]. Provided the accurate model the surveillance of the worker safety will become simpler and more efficient [8], hence reducing the fatality rate on the construction sites.

2 Literature Review Filatov et al. have implemented object detection on annotated datasets of over a thousand images using neural networks and their pre-trained models, concluding with performance matric of precision, recall and F1 score for both the train and test dataset [9]. Rubaiyat et al. have implemented a system architecture in which images are segmented and feature extraction is performed using HOG and DCT; on which,

15 Applied Deep Learning for Safety in Construction Industry

169

machine learning algorithm is applied and further is carried forward for hard hat detection. The paper has a proposed system that has two main parts: one that integrates frequency domain information from the image with the common person detection algorithm HOG to detect people (i.e. construction workers); and the rest works for helmet wearing detection by combining colour information and the Circular Hough transform [10]. SSD-MobileNet algorithm is also an alternative to perform het detection [11]. Hoang has presented a paper with a comparison between the YOLO V3 algorithm and the VGG16 algorithm, by performing model training on a dataset of around 1000 hand-drawn images and analysing its results [4]. Kamal et al. present a paper to detect workers with and without hard hats, using various algorithms and performing comparisons between them to display the most optimal option [12]. Technology Natha et al. have discussed three approaches for the safety detection of the workers using the Yolo-v3 algorithm. The first approach discusses recognizing the worker, hard hat and safety vest separately [13, 14], then applying a machine learning classifier to check which worker is wearing which equipment by considering its bounding region. Next, the second approach shows how the Yolo-v3 algorithm can be directly used to recognize the worker and his/her respective safety equipment. Lastly, in the third approach, there is the use of the Yolo-v3 algorithm to recognize the worker; on which, a convolutional neural network classifier is implemented to further categorize it into worker, safety vests and hard hats, concluding with a comparison between these approaches [5]. Xua et al. have presented a paper on promoting the use of machine learning algorithms in solving the construction industry issues. The paper presents how machine learning has developed and its different subsections, namely shallow learning and deep learning. Where supervised and unsupervised learning comes under shallow learning, and ANNs, RNNs and CNN’s come under deep learning, also discussing its applications and challenges, and providing future directions for optimal use of existing algorithms [3]. Akinosho et al. have reviewed the present status and future innovations using deep learning in the construction industries. The paper says deep learning is a subset of machine learning, and even though it has numerous layers which take longer for implementation but gives good accuracy. It also highlights how convolutional neural network performs well in application areas like image classification, image capturing, object detection and object tracking [15].

170

T. Bhosale et al.

3 Methodology 3.1 Data Description This paper deals with a dataset of approximately 2000 images, collected from various online open-source platforms. These images are in the “JPG” format. Pre-processing is carried out on these images which includes annotating the data and performing data augmentation. Data Annotation Data annotation is the most common way of classifying and naming the data accessible in various arrangements like text, video, or pictures. In picture subtitling, different normal kinds of picture inscriptions utilized are jumping box annotations, polygon annotations, semantic division, milestone annotations, multiline annotations and 3D point cloud annotations. Since the image commented on data is utilized to prepare the AI model, the exactness will be higher. In this system, the input data is annotated with the VoTT (Visual Object Tagging Tool) to extract its label and bounding boxes. It is a free and open-source app used for image annotation and labelling, which is developed by Microsoft. The software is used for building end-to-end object detection models from image and videos assets for computer vision algorithms. The annotations are exported in the “JSON” file format with its bounding box values, type and various details. A dataset of images (2000 images) was annotated via VoTT. Specific colour code was distinguishing our safety gear, i.e. hat helmet and labelling human being. In Fig. 1, input is in the form of images and output is in the form of labelled images that recognize the features in the image.

Fig. 1 The VoTT annotation tool software. Performing annotations for vest, hat and human labelling

15 Applied Deep Learning for Safety in Construction Industry

171

Fig. 2 Convolutional neural network model architecture used in this paper

Data Augmentation The performance of neural networks usually improves with data availability. Data augmentation is the technique of artificially generating new training data from the existing one. This is done by applying domain-specific techniques to examples from the training data to generate new and different training examples. Transformations cover a wide range of operations of image manipulation, such as shift, flip, zoom and more. The aim is to extend the training dataset with new possible examples. The system performs resizing the image into a (32 × 32 × 3) size, with the help of the bounding boxes of the annotated data JSON files, and then cropping those and saving them as new labelled images under data augmentation.

3.2 Convolutional Neural Network Model Convolutional neural networks (CNNs) are deep learning algorithms that take an image input, allot significance (learnable weights and biases) to various aspects/objects in the image, and be able to differentiate them. Figure 2 displays how the CNN model that is trained by the author gets an image input of type RGB and size (32, 32). It is trained through a CNN model of convolutional layers and undergoes max pooling, which is further flattened, and after passing through the dense layers, it gives the corresponding output classification of the classes name [hat, vest, human]. The implemented model is summarized by Fig. 3.

3.3 VGG16 Model The VGG16 architecture as shown in Fig. 4 is a convolutional neural network (CNN) design that was used to win the ILSVR(ImageNet) competition in 2014 [16]. It is considered to be among the finest vision models in use today.

172

T. Bhosale et al.

Fig. 3 The CNN model summary

Fig. 4 VGG16 model architecture

3.4 System Architecture The system architecture showed in Fig. 5 explains the end-to-end flow of the whole process. It starts with data collection which gets forwarded as an input in the system. This data is pre-processed by carrying out functionalities like data annotation and data augmentation. The refined data is distributed into training and testing datasets with an aspect ratio of [8:2]. The trained dataset is then used by the defined CNN and VGG16 model separately. Consecutively, the test dataset is run through these models and the accuracy is calculated. Also, next visually represents the accuracy graph. By comparing both, we get the best model. Which will next export its weights for

15 Applied Deep Learning for Safety in Construction Industry

173

Fig. 5 System architecture

deploying it into a real-time application, for performing safety equipment detection and providing decisive analytics.

4 Discussion 4.1 Technologies Required 1. Jupyter notebook—Jupyter notebook is a freely available, open-source and very interactive programming environment that allows you to open, edit, share and create Jupyter documents. It is also a web-based interactive environment. Jupyter notebook supports several popular programming languages in data science such as Python, Scala and R. 2. TensorFlow—TensorFlow is a library utilized for mathematical calculation and is open source with an enormous scope in AI. It additionally packages a bunch of profound learning models and calculations, making them helpful through a typical representation. It utilizes Python to give a useful and interactive frontend API for building applications involving systems while running applications. TensorFlow trains and runs deep neural networks for image recognition. It also supports scalable production for prediction with the same models used for training. TensorFlow applications can be run conveniently on a local machine, on a cloud cluster, CPUs and GPUs. And, it is largely used in this paper. 3. Keras—Keras is a library for neural networks. With Keras, deep learning can be fast and easy to prototype as well as run on CPU and GPU seamlessly. The

174

T. Bhosale et al.

Python code allows for easy debugging and expendability. Keras is user-friendly, modular, composable and easy to use and is majorly implemented in this paper. 4. VoTT—The Visual Object Tagging Tool abbreviated as VoTT is the data annotation-free software. It provides a platform to carry out annotations and provides labelling for those annotations simultaneously. It exports the annotated file for each image as a separate file of the “JSON” extension. 5. Python—It is an interpreted and general-purpose programming language. In the paper, Python is an extensively used programming language with various libraries like Python Image Library (viz. PIL, it is lightweight image processing functionality available in Python), Pandas (used for data analytics and dealing with various file formats, like Json in this paper), NumPy (used for working with arrays and metrics), OS (provides with all manipulating functionalities with files and directories), Matplotlib (mainly used for data visualizations) and open-cv (helps with tasks under image processing and computer vision).

4.2 Convolutional Neural Network As CNNs or ConvNets, convolutional neural networks are used to process data with a grid-like topology, such as images. We can learn that from Fig. 6, as the CNN takes an input of an image with dimensions (width × height × depth) and eventually gives an output of the classification from its classes specified. The hyperparameters are the part where mathematics takes place in the model, which are: 1. Filter: They are fxf metrics, used for feature detection of special patterns example edges, embossing, sharpness, etc. 2. Padding: Used to keep the image of its real size. Using formula,

Fig. 6 CNN architecture

15 Applied Deep Learning for Safety in Construction Industry

175

m + 2p − f + 1 = n where p is padding amount calculated by p = ( f − 1)/2 and m, f and n are metric sizes. In this paper, the padding variable is kept “SAME” as the input size is same as the output size. 3. Strides: It is used as a jumping value through the matrices. It is used in formula to calculate the value as, (m + 2 p − f + 1)/s = n s being the stride value. 4. Pooling: It is done to preserve the values while reducing in size. Generally, it is dependent on the filter (f ) and stride (s) value.

4.3 VGG16 Model This CNN model was proposed in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition” at the University of Oxford. It takes an image RGB input of size (224, 224) passing through convolutional and pooling layers, and finally through three fully connected layers. This is one of the very well-known convolutional neural network models because of its performance on the ImageNet-labelled dataset of about 15 million.

5 Results 5.1 Classification Performance Given an input of approximately 9000 images dataset, we analyse the output. To compare the performance, we use the aspects of accuracy and loss. A model’s accuracy is quantified in percentages by comparing its predictions with the true values, whereas the loss is a measurement of how well (or poorly) our model is doing. If the errors are high, the loss will be high, so our model does not do a great job. Either way, the lower it is, the better our model performs.

176

T. Bhosale et al.

Fig. 7 CNN model performance

Fig. 8 Graphical performance representation of CNN model

CNN Model Performance In Fig. 7, the CNN model is run for 15 epochs and gives an accuracy of 0.8998 (89.98%) and a loss value of 0.4851 (48.51%). This is also represented graphically in Fig. 8, using matplotlib a Python library. The performance for VGG16 can be enhanced but Transfer learning or Finetuning. These two are optimization techniques that use a Base model and Custom network. In transfer learning, the base model is fully frozen whereas, in Fine-tuning, the base model is partially frozen. To shortly generalize on which model should be used under which conditions: Large and Different dataset → Train using a CNN model Large and Similar dataset → Use Fine-tuning Small and Different dataset → Use Fine-tuning Small and Similar dataset → Use Transfer learning. VGG16 Model Performance In Fig. 9, the CNN model is run for 15 epochs and gives an accuracy of 0.7985 (79.85%) and a loss value as 0.5169 (51.69%). This is also represented graphically in Fig. 10, using matplotlib a Python library.

15 Applied Deep Learning for Safety in Construction Industry

177

Fig. 9 VGG16 model performance

Fig. 10 Graphical performance representation of VGG16 model

5.2 Pre-processed Image Dataset The dataset consists of approx. 2000 RGB type images. These images undergo annotation using VoTT for labelling the features as “Hat”, “Vest” or “Human”. On completion of annotations, the exported JSON file with the details of labels and bounding boxes is forwarded for data augmentation. Data augmentation is increasing the amount of relevant data by performing functions like flipping, rotating, scaling, cropping and translations for better prediction, which generates a dataset of over a count of 9000 images in this paper. This dataset is divided into a ratio of 8:2 for training and testing. These images are augmented into the dimension (32 × 32 × 3), as displayed in Fig. 11.

5.3 Equipment Classification After passing an image as input, the image is expanded to an array of pixels and then scaled to (32, 32, 3). This image is given in the CNN classification model which then predicts the percentage of class labels, i.e. Vest, Hat and Human. Below are a few examples of the results obtained. Hence, the maximum value of the predicted percentage can be considered as the most important classified class label as per Fig. 12.

178

T. Bhosale et al.

Fig. 11 Pre-processed image output. With a dimension of (32, 32, 3)

5.4 Result Comparison The results are obtained after performing model training on CNN model and VGG16 model, and a brief comparison is provided in the table shown in Fig. 13.

15 Applied Deep Learning for Safety in Construction Industry

179

Fig. 12 Classification percentage output

Parameters

CNN model

VGG16 model

Input image size/Type

(32,32,3)/ RGB

(32,32,3)/ RGB

Convolutional layers

2

4

Dense layers

2

3

Loss

0.4851

0.5169

Accuracy

0.8998

0.7985

Fig. 13 Comparison table for CNN and VGG16 model

6 Conclusion This paper provides a classification of safety equipment and workers used in the construction industry to enhance the monitoring of worker safety, by providing a

180

T. Bhosale et al.

processed dataset of thousands of images into the two convolutional neural network models. Firstly, an input of this dataset to the CNN model returns an accuracy of 89.98%. Similarly, giving the dataset as an input to the VGG16 model returned an accuracy of 79.85%. Comparing these models, we conclude that the CNN model performs best. And further, this model can be exported and used to deploy for realtime detection which will provide the manager/contractor ease in managing and instructing the workers and monitoring their safety.

References 1. Construction industry-common problems and fatal injuries, Blog (2014). https://www.lin kedin.com/pulse/20140823165519-346201029-construction-industry-common-problemsand-fatal-injuries/ 2. Safety health and environment handbook (2019)–CPWD. https://cpwd.gov.in/Publication/SAF ETY_HEALTH_AND_ENVIRONMENT_HANDBOOK_2019.pdf 3. Xua Y, Zhou Y, Sekula P, Ding L (2021) Machine learning in construction: from shallow to deep learning. Dev Built Environ 6. https://www.sciencedirect.com/science/article/pii/S26661 65921000041 4. Hoang L (2018) An evaluation of valuation of VGG16 and YOLO v3 on hand-drawn images. Portland State University. https://pdxscholar.library.pdx.edu/cgi/viewcontent.cgi?art icle=1904&context=honorstheses 5. Natha ND, Behzadanb AH, Paala SG (2020) Deep learning for site safety: real-time detection of personal protective equipment. Autom Constr 112. https://www.sciencedirect.com/science/ article/abs/pii/S0926580519308325 6. Han K, Zeng X (2021) Deep learning-based workers safety helmet wearing detection on construction sites using multi-scale features. IEEE Access 10. https://ieeexplore.ieee.org/doc ument/9663184 7. Bahire G, Dhawade SM, Sabihuddin S (2020) A review paper on construction site monitoring and predictive analysis using artificial intelligence. SSRG Int J Civ Eng 7. https://www.intern ationaljournalssrg.org/IJCE/paper-details?Id=382 8. Daghan ATAA, Kesh SV, Manek A (2021) A deep learning model for detecting PPE to minimize risk at construction sites. In: IEEE international conference on electronics, computing and communication technologies (CONECCT). https://ieeexplore.ieee.org/document/9622658 9. Filatov N, Maltseva N, Bakhshiev A (2020) Development of hard hat wearing monitoring system using deep neural networks with high inference speed. In: 2020 International Russian automation conference (RusAutoCon). https://ieeexplore.ieee.org/document/9208155 10. Rubaiyat A, Toma T, Kalantari-Khandani M, Rahman S, Chen L, Pan C (2016) Automatic detection of helmet uses for construction safety. In: 2016 IEEE/WIC/ACM international conference on web intelligence workshops (WIW). https://ieeexplore.ieee.org/document/7814495 11. Li Y, Wei H, Han Z, Huang J, Wang W (2020) deep learning-based safety helmet detection in engineering management based on convolutional neural networks. Adv Civ Eng. https://doi. org/10.1155/2020/9703560 12. Kamal R, Chemmanam A, Jose BA, Mathews S, Varghese E (2020) Construction safety surveillance using machine learning. In: 2020 International symposium on networks, computers and communications (ISNCC). https://ieeexplore.ieee.org/document/9297198 13. Thakur R (2019) Towards Data Science-step by step VGG16 implementation in Keras. https://towardsdatascience.com/step-by-step-vgg16-implementation-in-keras-for-beg inners-a833c686ae6c 14. Artificial intelligence for construction safety. https://www.linkedin.com/pulse/artificial-intell igence-construction-safety-ahmed-safwat-pmp-

15 Applied Deep Learning for Safety in Construction Industry

181

15. Akinosho TD, Oyedele LO, Bilal M, Ajavi AO, Delgado MD, Akinade O, Ahmed A (2020) Deep learning in the construction industry: a review of present status and future innovations. J Build Eng 32. https://www.sciencedirect.com/science/article/pii/S2352710220334604

Chapter 16

Deep Learning-Based Quality Inspection System for Steel Sheet M. Sambath, C. Sai Bhargav Reddy, Y. Kalyan Reddy, M. Mohit Sairam Reddy, M. Kathiravan, and S. Ravi

1 Introduction Nowadays, metals such as steel have a significant influence on our everyday lives. Steel is used in a multitude of applications, including towers that connect our electricity lines, pipes in the gas sector, machine tools, weaponry for the army, and so on. With all of these applications, it can be concluded that the steel has earned a place for safeguarding and simplifying our everyday lives. Unlike other metals, steel is indeed the backbone of economies due to its strength and inherent applicability. Steel can be created with low manufacturing costs, but aluminum takes one-quarter of the required energy to extract iron ore. Steel is recyclable, so it is not only affordable but also sustainable.

M. Sambath (B) · C. Sai Bhargav Reddy · Y. Kalyan Reddy · M. Mohit Sairam Reddy · M. Kathiravan · S. Ravi Computer Science and Engineering, Hindustan Institute of Technology and Science, Chennai, India e-mail: [email protected] C. Sai Bhargav Reddy e-mail: [email protected] Y. Kalyan Reddy e-mail: [email protected] M. Mohit Sairam Reddy e-mail: [email protected] M. Kathiravan e-mail: [email protected] S. Ravi e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_16

183

184

M. Sambath et al.

Steel companies have worked diligently to develop new technologies and enhance the world’s strongest and most versatile commodity. There are around 2000 steel grades developed, 1500 of which are high-grade steels. There is still plenty of space for the development of new steel grades with varying characteristics. However, steel can also realize its potential by producing new higher grades and flexible grades of steel. This may be accomplished by enhancing the existing steel structure and applying alloying procedures by maximizing the product’s utility value. Industrial parts are considered as the critical components in a wide range of industrial products and are widely used in different sectors such as aircraft, manufacturing, electronics, and automobiles. In today’s world, the quality of industrial products and materials is closely related to the end product quality as the modern industrial production is very competitive. Changing tool position, steel material properties, vibrations, tool damage, and poor polishing process management during production will result in surface defects such as dents and scratches on the steel surface. The difficulties include damage and insufficient reflecting qualities. Surface flaws in industrial parts can have an impact on the product’s appearance. The flaw detection of metal sheet on machined metal sheet has become a main method and in product quality. Means inspection needed as the processing needs of modern industrial products continue to develop. Most businesses, on the other hand, still rely on manual visual inspections. This method cannot keep up with the high speed manufacturing speed of factory production in these days. This has less sampling rate, and the number of samples missed is considerable. There is a lot of workloads, a lot of labor to be done, and a lot of errors can be made while using the test staff. The subjective aspects of the test personnel might readily influence the test results. There is not any consensus view. Companies urgently require modern technology and equipment for surface flaw detection of industrial products since the development of standards is inconvenient. With the rapid advancement of artificial intelligence, it is now possible to address such problems with machine vision rather than manpower. Machine learning is a branch of deep learning which is one of the most significant developments and investigations in machine learning in recent years. Convolutional neural networks (CNNs) are a type of deep learning algorithm that has been steadily gaining traction in the field of image recognition.

2 Literature Survey In this study, the strategy primarily perceives and groups a few sorts of human activity present in a particular. There are many existing systems that are being introduced every day as the technology is making tremendous progress in growing world so by using the technology humans are introducing machines and techniques and below are the short notes on existing systems.

16 Deep Learning-Based Quality Inspection System for Steel Sheet

185

Prihatno et al. have proposed a model to detect the flaws in steel using the open dataset on Kaggle, and the model has six convolutional layers and has accuracy score of 73% with the loss of 0.9 by using convolutional neural network architecture but model is overfitted hence resulted in poor detection [1]. Gai et al. used new network using Gigabit Ethernet for transmission with stable and versatile network for transformation. It provides external input trigger for control acquisition and output signals, as well as crisp and stable image quality. Magnification, a huge depth of field, little optical distortion, good quality, and strong contrast are all features of an image. The focal length approach is used, and the measurement error is low. They focus primarily on picture acquisition for a better outcome [2]. Amin et al. have chosen and explored two deep learning methods like U-Net model and deep residual U-Net to detect the defects in steel and compared both the methods. Here, as a result the accuracy came out is pretty low. He Y et al. suggested system automated defect inspection which makes use of a baseline convolutional neural network and a proposed multilevel fusion network (MFN), both of which need a lot of computing resources and are graphically demanding [3]. Wang et al. provided a survey on surface flaw detection techniques, and this work evaluates around 120 papers from the previous two decades for three popular flat steel sheet concasting slabs and hot, cold-rolled steel strips. Based on the nature of algorithms and picture attributes, existing approaches are grouped into four categories: statistical, spectral, model-based, and machine learning. These have been summarized in order to provide a better understanding of the methods [4]. Fang et al. have proposed an algorithm for strip steel surface detection based on templates. And it is based on sorting techniques like by comparing the guidance template. Finally, using a detection method that checks for pixel-wise, the flaws may be precisely found by subtracting the guiding template from the sorted test image [5]. Nand et al. used an entropy-based pre-new flaw detection algorithm. As a preprocessing, they have used compensation and illumination of image [6]. Mao et al. presented an auto-encoder algorithm for intelligent fault diagnosis. The efficiency of the proposed method is implemented using data from rolling element bearings [7]. Examination of steel products by hand for faults is a time-consuming and inconvenient way of annual steel defect inspection. On a UPH basis, inspectors’ ability to identify defects in steel sheets is limited. Production rates will be affected by the delay. Economic losses for the industrial sector as a whole a business [8]. As part of Industry 4.0, which incorporates IoT, big data, and artificial intelligence, automatic defect detection is essential for steel production enterprises (AI) [9]. In order to address the inefficiencies of manual inspection, machine vision-based solutions were being developed. Deep learning techniques may be used in the steel sector to identify the flaw patterns in steel sheets. When dealing with vast volumes of steel picture data, these techniques improved outcomes. Many researchers have looked at steel flaw detection using machine learning approaches in an attempt to overcome this problem. Many approaches to neural network model training in machine vision applications for defect detection have been presented. [10]. AVI surveys that cover a wide range of inspection issues might be made accessible at the same time. Recent surveys are increasingly focusing on specific planar materials, such as cloth, in their

186

M. Sambath et al.

findings [11]. Defect identification may be performed using the multiple-class defect classification approach based on the similarity of picture pixels, which is specialized in mining information implicitly present in texture images. Industrial defect detection requires real-time and anti-noise capabilities, according to Bulnes et al. [12]. Although statistical techniques make up the majority of steel surface detection literature, many of them fail to successfully identify various defects with delicate intensity changes (such as thin roll marks, microscopic scratches) when illumination fluctuates or pseudo-defect visits often. As a result, emerging AVI approaches for detecting steel surface defects in real-world manufacturing are eagerly anticipated. An early review of the AVI system for hot steel slabs [13].

3 Existing System The primary existing framework is a technique based on CNN architecture with six convolutional layers; there is no evidence to support that adding more layer to the model gives better accuracy, and hence, the model failed to be a good fit [1]. In another techniques steel defect inspection methods in this study by developing machine learning models to detect multilevel faults in sample steel sheet photographs and classify them according to their respective classes. To solve the steel defect identification problem, investigate two deep learning methods: U-Net and deep residual U-Net. In other systems to categorize and detect steel surface flaws, use the convolutional neural network approach in deep learning. To begin, industrial cameras were utilized to capture and pre-process photos of steel defects in order to get useful datasets. Second, the VGG model was utilized to increase network characteristics in order to improve defect detection and realize defect classification and recognition. This approach outperforms existing methods in terms of accuracy and efficiency. Because of the various patterns of defects, disruption of faux defects, and unpredictable gray-level organization in the backdrop, automatic defect identification on strip steel surfaces is a difficult job in computer vision. A revolutionary template establishment is provided in this work. A simple guidance template-based approach for detecting strip steel surface defects is also proposed. To begin, a large number of defect-free photos are gathered in order to calculate the statistical characteristic of normal texture Second, the first template for each test picture is created based on the statistical feature and the size of the test image. The provided test picture is then subjected to a sorting procedure. Furthermore, based on the specific intensity distribution of the sorted test picture, a unique guiding template is constructed by modifying the initial template. The backdrop of each test image has been roughly recreated in the guide template thus far. Finally, using pixel-wise detection, the faults may be reliably found by subtracting the guiding template from the sorted test picture, reverse sorting, and determining an adaptive threshold.

16 Deep Learning-Based Quality Inspection System for Steel Sheet

187

4 Proposed Work The proposed idea is to build a deep learning model to detect and classify the defects in steel sheet. Proposed system uses the images of steel acquired from the camera during the production of steel. Since manufacturing of steel includes various process. During the manufacturing, the steel goes into many machines which can cause damage to the steel; these damages include other metal inclusion, cracks, scratches, etc. Relevant collection of dataset of steel images has images belonging to the different classes based on the damage on the steel. Then, a model is to be built based on CNN model. A well-known approach in computer vision applications is the convolutional neural network, commonly known as convnets or CNN. It is a type of deep neural network that is used to assess visual data. This architecture is commonly used to recognize items in a photograph or video. Image or video recognition, neural language processing, and other applications employ it. The system will be having two modules, one is training the data, and the second is testing model. After training, the model is able to detect the defect on the steel. As we are using multilabel classification, the model will be able to classify the image into different categories according to the defect like tiny spots and large patches. From the category-wise classification, as a result if the steel has minor or close to no defect or a huge defect. Existing system has low accuracy and high loss. So firstly, dataset and pre-processing is done the data set using various libraries available in Python. Such as pandas and NumPy, it is a Python package that makes it possible to work with arrays. It also includes utilities for working in the linear algebra domain. It also includes utilities for working in the linear algebra domain. Pandas is a Python library for data analysis and manipulation that is available as an open-source project. After data pre-processing, load the image dataset to the system and system reads all the image using open CV library. OpenCV is a fantastic image processing and computer vision technology. Resizing of every image is done; because neural networks only accept inputs of the same size, all photos must be scaled to the same size before being fed into the CNN. And the resized image will be fed to the model.

5 Architecture Diagram The system starts with taking the input image from the user to detect the defects. The model architecture is based on CNN so we have three convolutional layers and three max pooling layers in the model. Any image given to the model will go under three convolutional layers; basically, convolutional operation gives the feature map for the given; and this can be done by adding activation function. The proposed model has ReLU as activation function. ReLU is used for faster computation. The feature map acquired after the convolutional operation goes under max pooling which helps to reduce the size of the feature map without loosing main features. After repeating this

188

M. Sambath et al.

Fig.1 Architecture diagram

process thrice, the feature is given to fully connected layer by neurons at the dense layer. At the output layer, there are 4 neurons resembling four classes of defects. The above outline shows the engineering graph of the framework which is utilized to follow the stream and plan of the making of the model. From Fig. 1, clear idea about model design can be seen. It explains the flow of the input image through the model.

6 System Design 6.1 Analysis of Experimental Data This module describes the steps relevant to dataset collection and pre-processing. After acquiring the dataset [14], import the dataset and required libraries to perform data cleaning and visualization, which belongs to 4 classes so categorical mode is used. There are 7095 photos with flaws among the 12,997 images, while 5902 images are defect-free. Figure 1 depicts the distribution of all training photos across the four categories. Which class has the most faults? Class 3 has the most defects. Class 1 has the most photos, accounting for 56% of all defect photographs. 0.03% of overall defects have the fewest defect images. The faulty sample photos of each class from our dataset increased. As a next step, pre-processing of the dataset is done using various libraries available in Python. Such as pandas and NumPy, NumPy is a Python package that makes it possible to work with arrays. It also includes utilities for working in the linear algebra domain. It also includes utilities for working in the linear algebra domain. Pandas is a Python library for data analysis and manipulation that is available as an open-source project. After data pre-processing, next step is to load the image dataset and read all the image using open CV library. OpenCV

16 Deep Learning-Based Quality Inspection System for Steel Sheet

189

is a fantastic image processing and computer vision technology. Resizing of every image should be done. Because neural networks only accept inputs of the same size, all photos must be scaled to the same size before being fed into the CNN. And the resized image will be fed to the model.

6.2 Building Model Sequential model is utilized for this system. Sequential class creation can be achieved by using the sequential model which is a method of developing deep learning models and addition of different model layers for the class. So, firstly system has the first convolutional layer with 32 layers where it is a trial-and-error method. The beauty of the convolutional neural network is it will figure out the filters you do not need, load and specify the filters, and next specify the size of the filter that search for the features in the images to train on ReLU activation function used for the hidden layers. It is fast to compute. Generally, after convolution model has pooling layer. Basically, there are many types of pooling; max pooling is used to store the features and reduce the size of the image. As it is, for better feature extraction, photos run across layers numerous times. And the output layer has 4 neurons, and in this module, dataset is split into two parts, test set and train set. And shaping of the data is done and feed the train set to the model for training.

6.3 System Training and Testing Procedures This is the most crucial aspect of our model. 15 epochs are given a parameter to train our model, which improves accuracy. The benefit is that it only runs once since model is saved to skip and saves the time to train the model again and again. Once the image is given as feed to our flaw detection model, it can easily predict the result in less time, so basically the model will be having three convolutional layers. The gear and programming stages on which the requirements for the testing are shown in Table 1. Table 1 Software and hardware requirements

Software and hardware

Model features

OS

Windows 10 or 11

Python framework

Tensorflow-gpu1.9.0

System processing

Intel core i7 8th Gen

Graphic cards

Radeon 2g

190

M. Sambath et al.

6.4 User Interface Now, CNN model is ready. UI (user interface) is created using streamlit; for this, new Python file has been created and requires libraries like streamlit, PIL, TensorFlow hub, and TensorFlow models. Libraries. Next start with header which will give title for our streamlit UI. Title header is the project title. Creation of two function is done, one function is to upload image, and the other one is to get the predicted class of the given image and their probabilities. In the first function, that is main function system ask the user to upload the image using st.fileupload by displaying text please upload an image, and also, system has instruction to allow specified types the types that are allowed to upload. A loop is constructed as if the image uploaded is not null; then, next step is proceeding for processing and plot the image that has been uploaded. Next, second function is created to predict the class of the image so as a first step load the model using tf.keras models from the specified path with saved model which has.h5 extension. Next, shape the image that has to be fed to the model; it should be same size as during training of the model. Input shape is 120 * 120 * 3 where three represents RGB Channel. Later, expansion of the dimension will be done. Declare class names as four classes. Now, the prediction of the image is done by taking the input image and the SoftMax activation function is used for making the prediction. Argmax in NumPy will display the highest probability, which will show the major defect in steel.

7 Results Information is gathered through Kaggle [11] which contains various types of different defected pictures. The steel dataset is having around 12K images, in which 7095 images are defected. The dataset is separated into train and test sub-catalogs for simple access, and the train-test split extraction was done in the structure 80–20. After completion of the training with the train data with 15 epochs, the accuracy of the model gave the initial accuracy of 72.5% at the first epoch, and later, it gave 77.8% of accuracy at the final epoch to show the accuracy and the loss by plotting both the accuracy and loss for the model. In here, loss curve as in Fig. 2 shows how well our fits with the train data and with the validation data; the curve represents that the model is good fit or close to good fit. Below image is to show the accuracy curve as in Fig. 3 of the model. The experimental outcomes are displayed in Fig. 4. Contrasted with the best outcomes produced by generative model. Initially, it is obvious from our Fig. 4 outcomes that discriminative strategies can proficiently handle the test image. Testing Interface is an important step, to test the interface of the model which is created by using streamlit app where the user will be able to upload image and get the defect type along with probabilities of each class of defect. As here, system takes the input from the user and tests whether the web app is able to load and plot the image or

16 Deep Learning-Based Quality Inspection System for Steel Sheet

191

Fig. 2 Loss curve

Fig. 3 Accuracy curve

not. System should display the image uploaded by the user. Later, system resizes the image and gives it as a feed to the model and output will be displayed at end.

192

M. Sambath et al.

Fig. 4 Output

8 Conclusion Thus, a model has been developed and deployed based on CNN architecture to detect and classify the defects into four different classes with accuracy of 77.8%. And also, a user interface has been developed using streamlit where the user can give the image and get the major defect class as the output. The model will divide the output into four different classes; each class will have different kind of defects like dots, lines, scratches, and patches so our model will also be able to detect multiple defects in a single image and will display the probabilities of each class of the image.

9 Future Scope The system can be further developed with more accuracy by providing more data and can be developed as automated software with more control over the production. Various deep learning algorithms are developing everyday so a new model can be built. Also, system can be trained on multiple images on different surfaces. As an extension, hardware can be developed to mark the defects on the steel to identify the defects by person for correcting the defects.

References 1. Prihatno AT, Utama IBKY, Kim JY (2021) Metal defect classification using deep learning 2. Gai X, Ye P, Wang J, Wang B (2020) Research on defect detection method for steel metal surface based on deep Learning 3. He Y, Song K, Meng Q, Yan Y (2019) An end-to-end steel surface defect detection approach via f using multiple hierarchical features

16 Deep Learning-Based Quality Inspection System for Steel Sheet

193

4. Wang H, Zhang J, Tian Y, Chen H, Sun H, Liu K (2018) A simple guidance template-based defect detection method for strip steel surfaces 5. Fang X, Liu L, Yang C, Luo Q (2019) Automated visual defect detection for flat steel surface: a survey. IEEE Trans Instrum Meas 637–641 6. Nand GK, Neogi V (2014) Defect detection of steel surface using entropy segmentation 7. Mao W, Yan Y (2016) Bearing fault diagnosis with auto encoder extreme learning machine 8. Abu M, Amir A, Lean YH, Zahri NAH, Azemi SA (2021) The performance analysis of transfer learning for steel defect detection by using deep learning. J Phys Conf Ser 1755 9. Prihatno AT, Nurcahyanto H, Jang YM (2020) Smart factory based on IoT platform. pp 2–4 10. Tulbure AA, Tulbure AA, Dulf EH (2021) A review on modern defect detection models using DCNNs–deep convolutional neural networks. J Adv Res 11. Kumar A (2008) Computer-vision-based fabric defect detection: a survey. IEEE Trans Ind Electron 348–363 12. Bulnes FG, Usamentiaga R, Garcia DF, Molleda J (2012) Vision-based sensor for early detection of periodical defects in web materials. Sensors 10788–10809 13. Suresh BR, Fundakowski RA, Levitt TS, Overland JE (1983) A real-time automated visual inspection system for hot steel slabs. IEEE Trans Pattern Anal Mach Intell 563–572 14. Severstal steel defect detection, https://www.kaggle.com/c/severstal-steel-defect-detection/ data 15. Amin D, Akther S (2020) Deep learning based defect detection system in steel sheet surfaces

Chapter 17

Forecasting Prediction of Covid-19 Outbreak Using Linear Regression Gurleen Kaur, Parminder Kaur, Navinderjit Kaur, and Prabhpreet Kaur

1 Introduction Covid-19 is a name of the virus that leads to a disorder in the respiratory system of a person. Due to having “crown-like spikes” around the surface of the virus, they named it “corona.” Coronavirus is an RNA virus, and it is further of 4 types such as Alpha, Beta, Gamma, and Delta. Among these, the classified first two infect the human body and the rest two infect the birds [1]. Coronavirus is also termed as “2019-nCov” or “SARS-CoV-2” or very commonly called Covid-19. Covid-19 outbreak was declared as a pandemic by World Health Organization in March 2020. The first case of SARS virus was found in the human body in 1965. There are more than 1000 coronavirus found on earth among which 7 affect the humans. SARS-CoV (started in China in 2003) and MERS-CoV (started in Saudi Arabia in 2012) are very popular [2]. As the situation got disastrous with time, it forced the whole nation and government of each and every state or country to take strict actions like the implementation of partial or full lockdown as per the need [3]. Moreover, many NGOs such as WHO and staff members of various hospitals encouraged societies and individuals to keep protective measures such as avoiding physical contact, avoid going out of the house, wash face and hands as much as possible, avoid crowds, wear masks while talking and going out, sanitize regularly G. Kaur (B) · P. Kaur Computer Engineering and Technology, Guru Nanak Dev University, Amritsar, India e-mail: [email protected] P. Kaur e-mail: [email protected] P. Kaur Guru Nanak Dev University, Amritsar, Punjab, India e-mail: [email protected] N. Kaur Department of Computer Science, Guru Nanak Dev University, Amritsar, Punjab, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_17

195

196

G. Kaur et al.

and properly, and seek medical care. Moreover, the government nowadays provides free vaccines to everyone to get protection from illness [4, 5]. When someone is found with fever or has a respiratory illness, his/her sample is taken by medical staff through the nostril to check whether he/she is suffering from Covid-19. If a person gets a positive report, it means he is a corona patient and special care is given and is kept isolated to avoid further spread [6]. The center point of Covid-19 was the seafood market of Wuhan (the city of China) [7]. Coronavirus started from China, where 44 patients suffered from pneumonia and spread to all the 7 continents within no time due to the lack of awareness. The first case of coronavirus in India started from Kerala. A lady came back from Wuhan on 23 Jan 2020, with dry cough and sore throat. A sample was taken and sent to ICMR (Pune) where she was declared as corona positive [8]. According to the latest reports, India has the second highest confirmed Covid-19 cases after China and 3rd highest death cases after China and USA. The transformation of the virus is from human to human when they cough, sneeze, come in contact, or have a distance of less than 6 ft with the infected person. The risk of coronavirus increases for older people, diabetic person, people with neurological disease, heart disease, and lung disease, pregnant women, and people with HIV [9]. The government imposed a lockdown to fight the pandemic. On the other hand, lockdown became a stressful period for every individual be it a working person or student. Students felt difficulty in their studies through online mode as their concentration was not getting stable, and some students did not attend classes due to the lack of Internet and hardware facilities. More than half of the population lost their jobs during the pandemic as every task was shifted online. Apart from these, there was an increase in depression, anxiety, weight gain, and many other illnesses [10]. At the beginning of Covid-19, no vaccines were available. The only measure was to isolate oneself and stay motivated. Later, several vaccines like COWIN, COVAX, Sputnik V, and many more were developed by different countries. It helps the nation to fight with coronavirus and new variants of SARS as it boosts immunity and helps the human body to fight against the virus [11]. As doctors play an important role in fighting against Covid-19, so do the researchers. They implement various methods to overcome the pandemic using different technologies. In this era, machine learning and deep learning have become a helping hand as they detect the future pattern of coronavirus and estimate the end of the pandemic. Using different classification and regression algorithms, one can analyze the trend of Covid-19. So, this paper works on a real dataset of Covid-19 using machine learning models and finds which model is the best in predicting the Covid-19 result. Figure 1 portrays the confirmed cases of Covid-19 with respect to countries. It shows how Covid-19 cases are spreading across the world.

17 Forecasting Prediction of Covid-19 Outbreak Using Linear …

197

Fig. 1 Covid-19 confirmed case growth rate across the world

2 Literature Review Since 2020, a lot of study and research have been performed to analyze the future trend of Covid-19. Nemati et al. [12] used Gradient Boosting to predict the upcoming trends and used various machine learning models/approaches. For metric evaluation, they used Cindex. The algorithm used to perform research is IPCRidge, Fast SVM, COXPH, Fast kernel SVM, Stagewise GB, Componentwise GB and Coxnet. Data was collected from GitHub Repository. Lauer et al. [13] estimated the incubation period of coronavirus. Moreover, they discussed the 3 most used distributions of the incubation period (Erlang, Gamma and Weibull). To calculate the probability (symptomatic infection), they used the lognormal model. In their research, they used monitor package and CoarseDataTools for analysis purpose. Sharma et al. [14] used the SVM kernel function and Cuckoo search algorithm to improve the accuracy. The main focus of the study was to apply machine learning classifier technique which is SVM. Also, mRMr (Minimum Redundancy Maximum Relevance) was the hybrid feature technique used in the research. Accuracy, specificity, F-score, precision, and sensitivity were used to measure the performance evaluation. Ghafouri-Fard et al. [15] developed a model to analysis/predict the Covid-19 daily new cases. Various methods and technologies were used for predicting coronavirus future results such as LSTM, ANN, RNN, ANFIS, MLP, ARIMA, and PRISMA. To compare the accuracy of model, RMSE, MAPE, MAE, and R2 were used as performance evaluation. LSTM and ARIMA had the highest values of MAPE. Also, three hybrid methods were proposed for the prediction of Covid-19 such as LSTM, CNN (with the Bayesian algorithm), and multi-head attention.

198

G. Kaur et al.

Pahar et al. [16] studied deals with two datasets (Coswara and SARCOS dataset). They used various ML classifiers such as MLP, Logistic Regression, KNN, SVM, DNN, LSTM, Resenet50, and SMOTE. Resnet50 had the highest AUC-0.98 when working on the Coswara dataset. They achieved accuracy-95.3%, sensitivity-93%, and specificity-98%. Yadav et al. [17] proposed a model to analyze Covid-19 using ML models. SVM was used to examine 5 different tasks relevant to coronavirus. For the first four tasks, the SVR model was used, and Pearson’s correlation method was used to work on the 5th task. Dairi et al. [18] used LSTM-CNN, GAN-GRU, CNN, GAN, LSTM, RBM, SVR, and Logistic Regression to improve the performance. LSTM-CNN produced the best output followed by the GRU-GAN model. Prediction was carried on time-series data which was taken from John Hopkin University. Gothai et al. [19] trained a model to analyze data using various supervised ML algorithms like Linear Regression, SVR, Holt-Winter model, and time-series algorithm. The Covid-19 dataset used was collected from John Hopkin’s University repository. Pandas and NumPy libraries were used to extract relevant features. The Holt-Winters model was used to accurately analyze the future pattern of Covid-19 cases. They achieved 87% accuracy (Table 1).

3 Proposed System Workflow of the proposed system of Covid-19 using various models is shown in Fig. 2.

3.1 Dataset Description While doing research in any field, the first step is to collect the dataset. The data used in the present study is taken from Github.com and Kaggle (Johns Hopkins University) [36, 37]. The data downloaded contains 6 CSV files. The dataset collected contains time-series data of confirmed cases, recovered cases, and death cases of 277 Country/Region alphabetically, from 22/01/2020 to 29/05/2020. Along with the Country/Region, dataset also describes the latitude, longitude, and State/Province, so that it becomes easy for researchers to understand the data and work on it. Data shows how the confirmed case, death cases, and recovered cases increase or decrease with time. The total number of confirmed cases is 169,951,560. The total number of death cases is 3,533,619. The total number of recovered cases is 107,140,669. The dataset is used to give the accurate result of the pandemic as it is a real-time dataset collected from Johns Hopkins University [36]. The dataset is observed in detail using Jupyter notebook and Python. Tables 2, 3, and 4 show the exact view of the dataset used in the present study of Covid-19.

Dataset

Johns Hopkins (JHU_CSSE) (https://github/CSSEGISan dDATA/COVID-19)

Api.covid19india.org

Johns Hopkins

Mygov.in/Covid-19 (time-series data)

www.tjh.com

Author, Year

Punn et al. [20]

Bhadana et al. [21]

Rustam et al. [22]

Kumari and Toshniwal [23]

Yao et al. [24]

Table 1 Comparison of literature review

Linear regression LASSO SVM ES (exponential smoothing)

Logistic regression LASSO Random forest SVM Decision tree

EDA SVR DNN RNN LSTM Polynomial regression

– – – –

SVM Linear regression KNN Random forest, Ada boost

– Linear regression – ARIMA – ANN

– – – –

– – – – –

– – – – – –

Classification methods

Merits

Accuracy-81.48% Sensitivity-76% Specificity-69%

– Limited dataset – Few performance metrics are used

Demerits

– Grid search strategy used – Metrics Youden’s index is used – One-hot strategy is used to encode the categorical features

– ANN is used to predict future patterns in Covid-19 – ANN is verified using mathematical models

(continued)

– Small dataset – Redundant values need to be replaced or removed to improve the accuracy

– Study can be further extended using more metrics and data

• SVM has poor performance in all scenarios • Dataset used is restricted (limited) for prediction

– Supervised ML models are • Performance can be better actively used using the 6-degree – Used graphical polynomial representation to predict the • SVM shows poor results trend

R2 -score adjusted – Real-dataset is taken to MSE deeply study the Covid-19 – Research is based on some RMSE state of art MAE (mean absolute error) – ES is best among all other methods used

R2 -score MAE MSE RMSE

ARIMA – Discusses about the 6-degree polynomial kernel coronavirus future Gamma-0.01 reachability RMSE – Used ML and DL models – Real-time data was used

– RMSE – MAPE – MAE

– – – –

– – – –

– – – –

Parameters

17 Forecasting Prediction of Covid-19 Outbreak Using Linear … 199

National Health Commission Machine learning (LSTM) AI

John Hopkin University

GitHub Repository

Liu and Xiao [26]

Khanday et al. [27]

Nemati et al. [12]

Classification methods

Hossen and Karmoker [25]

– – – – – –

– – – – – – – IPCRidge Coxnet Stagewise GB Componentwise GB Fast SVM Fast kernel SVM

Logistic regression SVM Naïve Bayes Decision tree Random forest Bagging Ada boost

– Random forest – SVM – KNN

Dataset

Kaggle.com

Author, Year

Table 1 (continued) Parameters

C-index

Accuracy-96.2% Precision-94% Recall-96%

– MSE – Polynomial regression



Merits

To predict future patterns Gradient Boosting is best

Classified and ensemble techniques (TF/IDF) are used

– SEIR model is used – Gradient descent is used to get the fitting curve

– Graphical representation of data is used to discuss the result and future trends

Demerits

(continued)

– Survival data distributed is not known – Small dataset

– Requires more data

– Small data is used – Few methods are used – Only one metric is used to check performance – Percentage of accuracy is not discussed

– No evaluation metrics are used in this study

200 G. Kaur et al.

Dataset

– King Fahad University

Espirito Santo state Portal

Nature.com

www.who.org

Author, Year

Aljameel et al. [28]

De Souza et al. [29]

Pourhomayoun and Shakibi [30]

Abirami and Kumar [31]

Table 1 (continued) Classification methods

– – – – –

– – – –

– – – – – – – –

SVM Naïve Bayes KNN Neural network Decision tree

SVM, ANN, KNN Random forest Decision tree Logistic regression

Logistic regression Linear regression Naïve Bayes Decision tree XGB KNN SVM Python

– Logistic regression – Random forest – Extreme gradient boosting (XGB) – SVM – Entropy – Information gain – Gain ratio – Gini index – KNN – K-mean – Python

Parameters

AUC-92% Sensitivity-88% Recall Confusion matrix Accuracy Recall F1-score

– Accuracy-91.62% – Sensitivity-95%

– Accuracy-89%

– – – – – – –

– Accuracy-95% – AUC-99%

Merits

Demerits • There is some imbalance in the dataset • More parameters are needed to check or calculate performance • To get better results, to work on large and real-time datasets

– Both (regression and classification) are used

– Data-driven analytics algorithm – Handles missing values using KNN – Stochastic gradient optimizer is used

(continued)

– Few techniques and parameters are used to obtain results – DT and RF show overfitting values

• Few performance metrics are used

– Training and validation • Naive Bayes produces cohort are the two data used worse output in this research • Sample size used is limited – Decision Tree has low robustness – PR AUC metrics perform best among other metrics used

– 10K cross-validation is used in data partitioning – SMOTE is used for alleviating the balance of data – K-mean is used to perform imputation – XGB is the best method used to overcome overfitting

17 Forecasting Prediction of Covid-19 Outbreak Using Linear … 201

Dataset

Coswara

Worldometer

NewYork-Presbyterian Hospital

Science-Direct, Springer, Hindawi, and MDPI

Author, Year

Anupam et al. [32]

Chowdhury et al. [33]

Gupta et al. [34]

Kwekha-Rashid et al. [35]

Table 1 (continued)

ANFIS LSTM MATLAB Python TensorFlow Keras

Naive Bayes Logistic regression KNN K-mean ANN Supervised and unsupervised learning – Reinforcement learning

– – – – – –

– Naïve Bayes – Random forest – Decision tree

– – – – – –

Random forest SVM KNN Decision tree Linear regression

Classification methods – – – – –

– – – –

Accuracy-92.9% Classification-86% Regression-7.1% Clustering-7%

Accuracy-98.12

Demerits

– Data is limited – Few algorithms are used

– There are a few threats that occur – Problem with the database

– Technologies like neural network and deep learning are needed – Small data is available

– NB and LR are better – Selecting suitable among other used models parameters is a big and algorithms challenge – Potential technology is used – Supervised learning gives better output than unsupervised algorithms – Poor/small dataset

– EDA is used to preprocess the data – Naïve Bayes is used for overfitting problems

– Correlation efficiency is used which gives 0.75 output – ANIFS is the best for predicting future patterns

MAPE = 45% RMSE = 65%

Merits – Follows screening test (non-contact based) – SVM is the best among used ML models

Parameters – Accuracy-96.9% – Sensitivity-96.7% – Precision-99.1%

202 G. Kaur et al.

17 Forecasting Prediction of Covid-19 Outbreak Using Linear …

203

Fig. 2 Workflow of the proposed system of Covid-19 using various models

3.2 Data Pre-processing Data preprocessing is an important step to remove noise from the dataset so as to enhance the quality of the dataset [38]. This step is important to maintain accuracy. Noise is removed by deleting the unwanted columns and rows from the dataset without affecting the overall result. Data preprocessing helps in cleaning the data by removing undesired values. While working with a dataset in the current work, it has been noticed that some irregular values are present. Some irrelevant information given are latitude, longitude, and Province/State, which are not useful. Moreover, Province/State column contains null values which create a disturbance while implementation. In this present work, some of the columns are removed without affecting the actual values of Covid-19 cases. The number of rows remains the same after the preprocessing of data, whereas the number of columns has reduced. Tables 5, 6, and 7 show the preprocessed data of the coronavirus pandemic. Also, polynomial features

Country/Region

Afghanistan

Albania

Algeria

Andorra

Angola

Province/State

NaN

NaN

NaN

NaN

NaN

1.521800 17.873900

−11.20270

1.659600

20.168300

67.709953

Long

42.50630

28.03390

41.15330

33.93911

Lat

Table 2 Dataset showing confirmed case of Covid-19

0

0

0

0

0

1/22/20

0

0

0

0

0

1/23/20

0

0

0

0

0

1/24/20

0

0

0

0

0

1/25/20

0

0

0

0

0

1/26/20

0

0

0

0

0

1/27/20













31,661

13,569

126,156

132,118

64,575

5/20/21

31,909

13,569

126,434

132,153

65,080

5/21/21

204 G. Kaur et al.

Country/Region

Afghanistan

Albania

Algeria

Andorra

Angola

Province/State

NaN

NaN

NaN

NaN

NaN

1.521800 17.873900

−11.20270

1.659600

20.168300

67.709953

Long

42.50630

28.03390

41.15330

33.93911

Lat

Table 3 Dataset showing recovered cases of Covid-19

0

0

0

0

0

1/22/20

0

0

0

0

0

1/23/20

0

0

0

0

0

1/24/20

0

0

0

0

0

1/25/20

0

0

0

0

0

1/26/20

0

0

0

0

0

1/27/20













26,483

13,234

87,902

127,869

55,687

5/20/21

26,513

13,234

88,066

128,425

55,790

5/21/21

17 Forecasting Prediction of Covid-19 Outbreak Using Linear … 205

Afghanistan

Albania

Algeria

Andorra

Angola

NaN

NaN

NaN

NaN

NaN

Long

0

−11.20270 17.873900 0

0

0 0

1.659600 0 1.521800 0

42.50630

28.03390

41.15330 20.168300 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0





704

127

… 3401

… 2440

… 2772

709

127

3405

2441

2782

715

127

3411

2442

2792

1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 … 5/20/21 5/21/21 5/22/21

33.93911 67.709953 0

Province/State Country/Region Lat

Table 4 Dataset showing death cases of Covid-19

206 G. Kaur et al.

17 Forecasting Prediction of Covid-19 Outbreak Using Linear …

207

with degree = 3 are used for preprocessing of data so as to improve the accuracy of the models.

3.3 Training and Testing Training and testing are performed to calculate the accuracy of the model. In this task, the dataset is divided/split into two parts that are training and testing. Data is split into a 7.5–2.5 ratio which means 75% of data is used for training purpose and the rest 25% is used for testing purposes by using a train-test-split method imported from Python library. Firstly, the model is trained with 7.5/10 part of the data, and then, the model is tested with 2.5/10 of remaining data. Train/test is performed separately for confirmed cases and then for recovered cases and, at last, performed for death cases of Covid-19.

3.4 Classification Machine learning is considered as the best technique to classify the future outcome of Covid-19. In this proposed work, a number of classification techniques are used such as Logistic Regression, Linear Regression, KNN Classifier, Random Forest, and SVM. Supervised learning techniques used in machine learning give the excellent results [39]. Supervised learning is considered best due to the feature that it first teaches and then tests the model using some dataset. Supervised algorithms are further categorized into two: regression algorithms and classification algorithms [40].

3.4.1

Linear Regression

Linear Regression is selected to find the relationship between 2 variables. Here in the case of Linear Regression, 1 variable is independent and another one is the dependent variable. It is predictive research and also a statistical approach [19]. It comes under supervised learning algorithm. The output is in the continuous range used for the predictive analysis and helps to predict the trend in data. It regulates the relative significance of data features [41]. Equation (1) shows the relation of the Y (dependent) variable with that of X (independent variable). To improve the accuracy of Linear Regression, the sum of residual between actual and predicted value has to be reduced [42]. This sum of residual is reduced by subtracting the observed value from the predicted value. The main aim is to create a fit straight line. The S in Eq. (1) is the slope, T is the intercept, and e represents the noise (error). Y = SX + T + e

(1)

0

0

0

0

0

0

1

2

3

4

1/22/20

0

0

0

0

0

1/23/20

0

0

0

0

0

1/24/20

0

0

0

0

0

1/25/20

0

0

0

0

0

1/26/20

Table 5 Confirmed cases of Covid-19 after removing noise

0

0

0

0

0

1/27/20

0

0

0

0

0

1/28/20

0

0

0

0

0

1/29/20

0

0

0

0

0

1/30/20

0

0

0

0

0

1/31/20













31,661

13,569

126,156

132,118

64,575

5/20/21

31,909

13,569

126,434

132,153

65,080

5/21/21

208 G. Kaur et al.

0

0

0

0

0

0

1

2

3

4

1/22/20

0

0

0

0

0

1/23/20

0

0

0

0

0

1/24/20

0

0

0

0

0

1/25/20

0

0

0

0

0

1/26/20

Table 6 Recovered cases of Covid-19 after removing noise

0

0

0

0

0

1/27/20

0

0

0

0

0

1/28/20

0

0

0

0

0

1/29/20

0

0

0

0

0

1/30/20

0

0

0

0

0

1/31/20













26,483

13,234

87,902

127,869

55,687

5/20/21

26,513

13,234

88,066

128,425

55,790

5/21/21

17 Forecasting Prediction of Covid-19 Outbreak Using Linear … 209

0

0

0

0

0

0

1

2

3

4

1/22/20

0

0

0

0

0

1/23/20

0

0

0

0

0

1/24/20

0

0

0

0

0

1/25/20

0

0

0

0

0

1/26/20

Table 7 Death cases of Covid-19 after removing noise

0

0

0

0

0

1/27/20

0

0

0

0

0

1/28/20

0

0

0

0

0

1/29/20

0

0

0

0

0

1/30/20

0

0

0

0

0

1/31/20













704

127

3401

2440

2772

5/20/21

709

127

3405

2441

2782

5/21/21

210 G. Kaur et al.

17 Forecasting Prediction of Covid-19 Outbreak Using Linear …

3.4.2

211

K-Nearest Neighbor

K-nearest neighbor comes into existence in the year 1951 by Joseph Hodges and Evelyn Fix and then improved in 1967. It is of non-parametric type [43]. KNN is defined under the supervised learning algorithm, which stocks all available data and then groups new data based on it. It is best to describe it as a lazy learner algorithm due to its nature; that is, it first stores data and then during classification it carries out an action. KNN algorithm calculates the value using a formula called Euler distance which throwbacks a relation between each attribute of the dataset and the corresponding data attribute in the sample database. KNN measures the Euclidean distance as sqrt ((y2 − y1 )2 + (x 2 + x 1 )2 ), where x 1 and y1 are x-axis coordinates and x 2 and y2 are y-axis coordinates. Moreover, it also solves regression problems along with classification [44]. Using the KNN algorithm, unknown categories can be classified based on the available or already known categories. In case of KNN, large training dataset leads to effective output. Figure 3 shows a pseudo-code of how KNN works. Fig. 3 Working of KNN algorithm

212

3.4.3

G. Kaur et al.

SVM

Support Vector Machine is a part of a nonlinear ML algorithm. SVM is a binary classifier that can also be extended to multiclass. Vapnik, Boser, and Cortes are the cofounder of SVM [45]. It is most recommended due to its special nature that it can work with both nonlinear and linear data and also support various kernel functions. The curse of dimensionality issues is mainly solved by SVM. Another advantage of SVM is that it works in both cases, either classification or regression. Optimal hyperplane is considered to maximize the distance between the hyperplane and support vector [45]. The main aim of the SVM is to classify new data using a suitable hyperplane [46]. SVM objective is to have the best hyperplane to separate classes either into linear SVM or nonlinear SVM. Sequential training method is utilized to evaluate SVM [47]. SVM is used to reduce classification error and increase the geometric margin that is why it is also called as maximum margin classifier [48]. The main motive of SVM is to address the problem of binary classification [49]. Moreover, kernels in SVM are of 3 types: RBF kernel, linear kernel, and polynomial kernel. If there is a small dataset, then SVM is the best option.

3.4.4

Random Forest

Beriman proposed the Random Forest. Random space and bagging are the 2 methods that combine together to form a Random Forest. Here in the case of RF, data is separated into 2 proportions: training and testing. This RF follows black-box nature; that is, each tree does not separately test or examine [50]. RF is a part of an ensemble learning algorithm built on the concept of a Decision Tree and takes out random samples from actual samples to erect one Decision Tree by using bootstrap technology. The further ensemble is categorized into 2 algorithms: bagging and boosting [51]. More the number of trees means more the accuracy and also helps to get rid of the overfitting problem. The main benefit of RF is it keeps accuracy stable even when data is missing and consumes less time for training with respect to other algorithms. The voting method and randomization are the two methods in RF that helps to boost the accuracy of the model and bring down the correlation among trees [52].

3.4.5

Logistic Regression

The Logistic Regression is more similar to the procedures and techniques in Linear Regression. To predict the relation between variables, a statistical dataset analysis method is used which is called as Logistic Regression (LR). They are further of 3 types: binary, ordinal, and multinomial Logistic Regression [53]. LR follows the white box formula, which shows what it really does. It considers real inputs and then performs a prediction. LR considers output as 0 (here 0 means non-churner) in case prediction is > 0.5, if not the output is 1 (also defined as churners) [54]. Equation 2 shows the mathematical representation of LR. Figure 4 shows the relation between

17 Forecasting Prediction of Covid-19 Outbreak Using Linear …

213

Fig. 4 Relationship between linear and logistic regression

Linear Regression (blue line) and Logistic Regression (orange line). p=

1 1 + e−(b0 +b1x )

(2)

4 Result The proposed work is tested using real data from Johns Hopkins University to predict upcoming results or trends of Covid-19 using various algorithms such as Linear Regression, KNN, SVM, Logistic Regression, and Random Forest. The data is totally accurate as it is collected from the official website of the authenticated organization. Very few columns from the whole dataset contain irrelevant values which have been removed before continuing with the implementation process. In this research, data is collected at first and then preprocessed (i.e., removed null values and irrelevant columns without effecting the real values of actual data) to get accurate outputs. Moving further, data is split and then training and testing are carried out using various classification techniques or methods. Then, performance is evaluated using different parameters like RMSE, MSE, R2 -score, and MAE to check the effectiveness of the proposed system. Figure 5 shows the graphical view of confirmed cases marked by purple, recovered cases marked by green, and death cases marked by red. This graph shows the status of the Covid-19 with respect to time that is whether the cases are increasing or decreasing or remain the same. Based on the given data, the total number of confirmed cases, recovered cases, and total death cases is 169,951,560, 107,140,669, and 3,533,619, respectively, within in the given time frame from 22/01/2020 to 29/05/2020. Figure 5 graph shows that Covid-19 cases are increasing rapidly with time, whereas the red line is almost constant which shows the death rate is very low. Polynomial feature with degree 3 is added during classification to achieve more accurate results. Figure 5 reveals that patients are getting recovered from Covid-19 with time but in a small ratio with respect to the confirmed cases. The way the confirmed cases are increasing leads to a worrying situation for the world. The main aim behind the research is to evaluate the pattern of Covid-19 in the upcoming days. Linear Regression, SVM, Random Forest, KNN, and Logistic

214

G. Kaur et al.

Fig. 5 Graphical representation of confirmed cases, recovered cases, and death cases with respect to time

Regression are used to achieve results. All the used algorithms for forecasting are imported from sklearn Python which is the in-built library. To judge the performance of the machine learning (ML) algorithm, various evaluation metrics are used such as RMSE, MAE, MSE, and R2 -score to measure the performance of the model while performing classification and training–testing on various ML algorithms. The main aim of the metric is to differentiate the performance and accuracy of ML models. RMSE: Root mean squared error (RMSE) is used to count precision and to check the accuracy of transformation from one coordinate model to another model. It is also used to find the difference between the actual output and the desired output. The mathematical equation of RMSE is as mentioned in Eq. 3 where the current predictor is represented by k, number of predictors is mentioned as m in the equation, and RMSE is calculated in pixel [55]. It is thus a count of how many real dataset points are enclosed in the best fit [22].     n 1  2  ∧ RMSE = (xk − x k ) m k=1

(3)

MSE: The main purpose of mean squared error (MSE) is to determine the regression model performance. It takes the interval of dataset points and then squares them to remove the minus indication from the values [22]. Equation 4 shows the mathematical representation of MSE. Here, m is the observed value, xk is the actual value, and x ∧ k is predicted. Lower the value of MSE means it is close to the fit line. Because of its random nature, it is always a positive MSE value [13]. MSE also named as MSD (mean squared deviation) is used to count errors in the given model.

17 Forecasting Prediction of Covid-19 Outbreak Using Linear …

215

  n 1  ∧ 2 xk − x k MSE = m k=1

(4)

MAE: Mean absolute error (MAE) also named as scale-dependent accuracy is used to calculate the distance between true and predicted preferences over objects. MAE metric value lies between 0 and ∞ [22]. Equation 5 shows the mathematical equation of MAE where the total no. of observations, prediction value, and true value are represented by m, xk , and yk , respectively. The value of MAE is reduced when it is round-off to the nearest neighbor using rounding operators [56]. MAE =

n 1  |xk − yk | m k=1

(5)

R2 -Score: It is the evaluation metrics that help to check the accuracy of the model. R2 -score calculates the strength between the regression model and the dependent variable on a suitable scale. R2 observes the scatteredness of the observations, and the value always lies between 0 and 100%. Higher the value of R2 shows the betterness of the model [22]. Equation 6 shows the implementation formula. The main motive of R2 is to measure the production of Linear Regression. R2-score = 1 −

SSR (sum squared regression) SST (total sum of squares)

(6)

Table 8 divulges the values of performance metrics to predict the result of Covid19 confirmed cases using machine learning. Table 9 reveals the accuracy of future recovered cases of Covid-19, and Table 10 shows the results of various metrics to check performance and analyze the future trend of Covid-19 death cases using different classification methods. In all cases of Covid-19, Linear Regression has a lower MSE value when compared to other approaches, to forecast the Covid-19 situation. Lower MSE value generates a better fit line and superior forecast, and hence, Linear Regression shows excellent results using the Johns Hopkins database. R2 -score metrics for Linear Regression result in 14% for confirmed cases, 91.18% for recovery rate forecasting, and 90.32% for death cases. The comparison of test data (actual values) versus polynomial regression prediction (predicted output) for confirmed, recovered, and death cases of Covid-19 is shown in Fig. 6. It achieves accuracy of 99.92% for Covid-19 confirmed cases, 99.73% for recovered cases, and 99.60% for death cases using Linear Regression. All these values are achieved by implementing the real database in the Jupyter notebook using Python. Figure 6 shows the graphical representation of various algorithms such as performed Linear Regression, KNN, Random Forest, SVM, and Logistic Regression. It depicts the actual output (red line) versus observed output (green line) and shows that Linear Regression shows better results when compared to other used techniques. Here, it is clear that Linear Regression is the best algorithm to analyze and predict the future

216

G. Kaur et al.

Table 8 Evaluation of performance of the model for future predictions of confirmed cases MSE

RMSE

MAE

R2 -score

Linear regression

1,275,718,555,405,083.8

35,717,202.5

30,941,675.08

14.734

SVM

1,446,764,840,633,415.0

31,730,752.17

38,036,362.08

0.0

K-neighbors classifiers

1,589,197,341,584,735.2

39,864,738.07

33,900,931.17

0.0

Random forest

1,446,764,840,633,415.0

31,730,752.17

38,036,362.08

0.0

Logistic regression

1,544,946,139,781,014.2

39,305,802.87

33,241,870.17

0.0

Table 9 Evaluation of performance of the model for future predictions of recovered cases MSE Linear regression SVM

13,363,796,117,335.4

RMSE

R2 -score

MAE

3,655,652.62 2,978,547.219 91.183

655,562,942,654,381.9 25,603,963.41 20,931,718.37 0.0

K-Neighbors classifiers 1,589,197,341,584,735.2 39,864,738.07 33,900,931.17 0.0 Random forest

655,562,942,654,381.9 25,603,963.41 20,931,718.37 0.0

Logistic regression

695,880,895,606,561.4 26,379,554.49 21,873,609.37 0.0

Table 10 Evaluation of performance of the model for future predictions of death cases MSE Linear regression

RMSE

MAE

R2 -score

25,957,538,218.41

161,113.43

134,134.5663

90.32

SVM

612,361,142,740.19

782,535.07

681,224.62

0.0

K-neighbors classifiers

681,596,641,898.51

825,588.66

730,275.6290

0.0

Random forest

612,361,142,740.19

782,535.07

681,224.62

0.0

Logistic regression

658,453,822,634.06

811,451.67

714,254.62

0.0

outcome of Covid-19. While implementing the preprocessed dataset using Python, other algorithms used show the same results as shown in Fig. 6. However, Linear Regression graph shows much different results in Fig. 6 which prove that Linear Regression is better than other algorithms.

5 Conclusion The globe is suffering from a catastrophe disease since 2019 called as coronavirus (Covid-19). Because of its death-dealing behavior, WHO has revealed it to be a pandemic. The center point of SARS-CoV-2 is China and then slowly transmits globally through physical interaction among humans. The administration of all countries has taken several strict steps as per the requirement, to control the growth of virus by imposing full or partial lockdown. The study proposes the future forecasting of Covid-19 by utilizing Python and machine learning. It helps to predict the situation

17 Forecasting Prediction of Covid-19 Outbreak Using Linear …

i(a)

217

i(b)

i(c)

ii(a)

ii(b)

ii(c)

iii(a)

iii(b)

iii(c)

iv(a)

iv(b)

iv(c)

v(a)

v(b)

v(c)

Fig. 6 Graphs a confirmed cases, b recovered cases, and c death cases of (i) linear regression, (ii) KNN, (iii) random forest, (iv) SVM, and (v) logistic regression, respectively, depicting the actual output (red line) versus observed output (green line)

218

G. Kaur et al.

of corona and take mandatory steps in advance to stop the growth of the pandemic. This study has been performed in various stages like data preprocessing, training and testing, and then performance evaluation. The implementation is performed using the real database from “Johns Hopkins dashboard” having data of confirmed case, death cases, and recovered cases of Covid-19, and the outcome shows that Linear Regression has minimum mean squared error than the other used classification techniques and attains an accuracy of 99.92% for Covid-19 confirmed cases, 99.73% for recovered cases, and 99.60% for death cases. According to the database used, although confirmed cases are increasing day by day, the death rate is very small from which it can be concluded that the situation will get better if people become unsocial as much as possible. The work can further be expanded by combining the updated database of different organizations and evaluated using advanced machine learning and deep learning techniques.

References 1. Ahmed MH (2020) Dexamethasone for the treatment of coronavirus disease (COVID-19): a review. http://doi.org/10.1007/s42399-020-00610-8 2. WHO | World Health Organization. https://www.who.int/. Accessed 24 Sept 2021 3. Pikoulis E et al (2021) The effect of the COVID pandemic lockdown measures on surgical emergencies: experience and lessons learned from a Greek Tertiary Hospital. World J Emerg Surg 16(1):1–8. https://doi.org/10.1186/s13017-021-00364-1 4. How to protect yourself & others | CDC. https://www.cdc.gov/coronavirus/2019-ncov/preventgetting-sick/prevention.html. Accessed 19 Jan 2022 5. Khan MA, Abbas S, Khan KM, Ghamdi MAA, Rehman A (2020) Intelligent forecasting model of covid-19 novel coronavirus outbreak empowered with deep extreme learning machine. Comput Mater Contin 64(3):1329–1342. https://doi.org/10.32604/cmc.2020.011155 6. Caso V, Federico A (2020) No lockdown for neurological diseases during COVID19 pandemic infection. Neurol Sci 41(5):999–1001. https://doi.org/10.1007/s10072-020-04389-3 7. Rameshrad M, Ghafoori M, Mohammadpour AH, Nayeri MJD, Hosseinzadeh H (2020) A comprehensive review on drug repositioning against coronavirus disease 2019 (COVID19). Naunyn Schmiedebergs Arch Pharmacol 393(7):1137–1152. https://doi.org/10.1007/s00210020-01901-6 8. SARS-CoV-2 resources—NCBI. https://www.ncbi.nlm.nih.gov/sars-cov-2/. Accessed 19 Jan 2022 9. Bajracharya T, Kumar RS (2020) COVID-19 (novel coronavirus) a global disease, vol 19 10. Biancalana E, Parolini F, Mengozzi A, Solini A (2021) Short-term impact of COVID-19 lockdown on metabolic control of patients with well-controlled type 2 diabetes: a single-centre observational study. Acta Diabetol 58(4):431–436. https://doi.org/10.1007/s00592-020-016 37-y 11. Ndwandwe D, Wiysonge CS (2021) COVID-19 vaccines. Curr Opin Immunol 71:111–116. https://doi.org/10.1016/j.coi.2021.07.003 12. Nemati M, Ansary J, Nemati N (2020) Machine-learning approaches in COVID-19 survival analysis and discharge-time likelihood prediction using clinical data. Patterns 1(5):100074. https://doi.org/10.1016/j.patter.2020.100074 13. Lauer SA et al (2020) The incubation period of coronavirus disease 2019 (CoVID-19) from publicly reported confirmed cases: estimation and application. Ann Intern Med 172(9):577– 582. https://doi.org/10.7326/M20-0504

17 Forecasting Prediction of Covid-19 Outbreak Using Linear …

219

14. Sharma DK, Subramanian M, Malyadri P, Reddy BS, Sharma M, Tahreem M (2021) Classification of COVID-19 by using supervised optimized machine learning technique. Mater Today Proc. http://doi.org/10.1016/j.matpr.2021.11.388 15. Ghafouri-Fard S, Mohammad-Rahimi H, Motie P, Minabi MAS, Taheri M, Nateghinia S (2021) Application of machine learning in the prediction of COVID-19 daily new cases: a scoping review. Heliyon 7(10):e08143. https://doi.org/10.1016/j.heliyon.2021.e08143 16. Pahar M, Klopper M, Warren R, Niesler T (2021) COVID-19 cough classification using machine learning and global smartphone recordings. Comput Biol Med 135:104572. http://doi.org/10. 1016/j.compbiomed.2021.104572 17. Yadav M, Perumal M, Srinivas M (2020) Analysis on novel coronavirus (COVID-19) using machine learning methods. Chaos Solitons Fractals 139:110050. https://doi.org/10.1016/j. chaos.2020.110050 18. Dairi A, Harrou F, Zeroual A, Hittawe MM, Sun Y (2021) Comparative study of machine learning methods for COVID-19 transmission forecasting. J Biomed Inform 118:103791. http:// doi.org/10.1016/j.jbi.2021.103791 19. Gothai E, Thamilselvan R, Rajalaxmi RR, Sadana RM, Ragavi A, Sakthivel R (2021) Materials today: proceedings prediction of COVID-19 growth and trend using machine learning approach. Mater Today Proc. http://doi.org/10.1016/j.matpr.2021.04.051 20. Punn NS, Sonbhadra SK, Agarwal S (2020) COVID-19 epidemic analysis using machine learning and deep learning algorithms. medRxiv. http://doi.org/10.1101/2020.04.08.20057679 21. Bhadana V, Jalal AS, Pathak P (2020) A comparative study of machine learning models for COVID-19 prediction in India. In: 4th IEEE conference on information and communication technology CICT 2020, Dec 2020. http://doi.org/10.1109/CICT51604.2020.9312112 22. Rustam F et al (2020) COVID-19 future forecasting using supervised machine learning models. IEEE Access 8:101489–101499. https://doi.org/10.1109/ACCESS.2020.2997311 23. Kumari P, Toshniwal D (2020) Real-time estimation of COVID-19 cases using machine learning and mathematical models—the case of India. In: 2020 IEEE 15th international conference on industrial and information systems, ICIIS 2020, pp 369–374. http://doi.org/10.1109/ICIIS5 1140.2020.9342735 24. Yao H et al (2020) Severity detection for the coronavirus disease 2019 (COVID-19) patients using a machine learning model based on the blood and urine tests. Front Cell Dev Biol 8:1–10. https://doi.org/10.3389/fcell.2020.00683 25. Hossen MS, Karmoker D (2020) Predicting the probability of Covid-19 recovered in south Asian countries based on healthy diet pattern using a machine learning approach. In: 2020 2nd international conference on sustainable technologies for industry 4.0, STI 2020, pp 19–20. http://doi.org/10.1109/STI50764.2020.9350439 26. Liu Y, Xiao Y (2020) Analysis and prediction of COVID-19 in Xinjiang based on machine learning. In: Proceedings 2020 5th international conference on information science, computer technology and transportation, ISCTT 2020, pp 382–385. http://doi.org/10.1109/ISCTT51595. 2020.00072 27. Khanday AMUD, Rabani ST, Khan QR, Rouf N, Mohi Ud Din M (2020) Machine learning based approaches for detecting COVID-19 using clinical text data. Int J Inf Technol 12(3):731– 739. http://doi.org/10.1007/s41870-020-00495-9 28. Aljameel SS, Khan IU, Aslam N, Aljabri M, Alsulmi ES (2021) Machine learning-based model to predict the disease severity and outcome in COVID-19 patients. Sci Program 2021. http:// doi.org/10.1155/2021/5587188 29. De Souza FSH, Hojo-Souza NS, Dos Santos EB, Da Silva CM, Guidoni DL (2021) Predicting the disease outcome in COVID-19 positive patients through machine learning: a retrospective cohort study with Brazilian data. Front Artif Intell 4:1–13. https://doi.org/10.3389/frai.2021. 579931 30. Pourhomayoun M, Shakibi M (2021) Predicting mortality risk in patients with COVID-19 using machine learning to help medical decision-making. Smart Health 20:100178. http://doi. org/10.1016/j.smhl.2020.100178

220

G. Kaur et al.

31. Abirami RS, Kumar GS (2022) Comparative study based on analysis of coronavirus disease (COVID-19) detection and prediction using machine learning models. SN Comput Sci 3(1). http://doi.org/10.1007/s42979-021-00965-2 32. Anupam A, Mohan NJ, Sahoo S, Chakraborty S (2021) Preliminary diagnosis of COVID-19 based on cough sounds using machine learning algorithms. In: Proceedings of 5th international conference on intelligent computing and control systems, ICICCS 2021, pp 1391–1397. http:// doi.org/10.1109/ICICCS51141.2021.9432324 33. Chowdhury AA, Hasan KT, Hoque KKS (2021) Analysis and prediction of COVID-19 pandemic in Bangladesh by using ANFIS and LSTM network. Cognit Comput 13(3):761–770. https://doi.org/10.1007/s12559-021-09859-0 34. Gupta JP, Singh A, Kumar RK (2021) A computer-based disease prediction and medicine recommendation system using machine learning approach. Academia.Edu 12(3):673–683. https://doi.org/10.34218/IJARET.12.3.2021.0 35. Kwekha-Rashid AS, Abduljabbar HN, Alhayani B (2021) Coronavirus disease (COVID-19) cases analysis using machine-learning applications. Appl Nanosci (0123456789). http://doi. org/10.1007/s13204-021-01868-7 36. COVID-19 data from John Hopkins University | Kaggle. https://www.kaggle.com/antgoldbl oom/covid19-data-from-john-hopkins-university. Accessed 23 Jan 2022 37. CSSEGISandData/COVID-19: novel coronavirus (COVID-19) cases, provided by JHU CSSE. https://github.com/CSSEGISandData/COVID-19. Accessed 06 Jan 2022 38. Hota HS, Handa R, Shrivas AK (2021) 27—COVID-19 pandemic in India: forecasting using machine learning techniques. Elsevier Inc., Amsterdam 39. Manco L, Maffei N, Strolin S, Vichi S, Bottazzi L, Strigari L (2021) Basic of machine learning and deep learning in imaging for medical physicists. Phys Medica 83:194–205. https://doi.org/ 10.1016/j.ejmp.2021.03.026 40. Vrindavanam J, Srinath R, Shankar HH, Nagesh G (2021) Machine learning based COVID19 cough classification models—a comparative analysis. In: Proceedings 5th international conference on computing methodologies and communication, ICCMC 2021, pp 420–426. http://doi.org/10.1109/ICCMC51019.2021.9418358 41. Date P, Potok T (2021) Adiabatic quantum linear regression. Sci Rep 11(1):1–11. https://doi. org/10.1038/s41598-021-01445-6 42. Pandey G, Chaudhary P, Gupta R, Pal S (2020) SEIR and regression model based COVID-19 outbreak predictions in India, pp 1–10. http://doi.org/10.1101/2020.04.01.20049825 43. Arslan H, Arslan H (2021) A new COVID-19 detection method from human genome sequences using CpG island features and KNN classifier. Eng Sci Technol Int J 24(4):839–847. https:// doi.org/10.1016/j.jestch.2020.12.026 44. Fan Z, Xie JK, Wang ZY, Liu PC, Qu SJ, Huo L (1930) Image classification method based on improved KNN algorithm. J Phys Conf Ser 1:2021. https://doi.org/10.1088/1742-6596/1930/ 1/012009 45. Albagmi FM, Alansari A, Al Shawan DS, AlNujaidi HY, Olatunji SO (2022) Prediction of generalized anxiety levels during the Covid-19 pandemic: a machine learning-based modeling approach. Inform Med Unlocked 28:100854. http://doi.org/10.1016/j.imu.2022.100854 46. Tigga NP, Garg S (2020) ScienceDirect prediction of type 2 diabetes using machine learning classification methods. Procedia Comput Sci 167(2019):706–716. https://doi.org/10.1016/j. procs.2020.03.336 47. Utami NA, Maharani W, Atastina I (2021) Personality classification of facebook users according to big five personality using SVM (Support Vector Machine) method. Procedia Comput Sci 179(2020):177–184. https://doi.org/10.1016/j.procs.2020.12.023 48. Majumder S, Aich A, Das S (2021) Sentiment analysis of people during lockdown period of COVID-19 using SVM and logistic regression analysis. SSRN Electron J. http://doi.org/10. 2139/ssrn.3801039 49. Faris H, Habib M, Faris M, Alomari M, Alomari A (2020) Medical speciality classification system based on binary particle swarms and ensemble of one vs. rest support vector machines. J Biomed Inform 109:103525. http://doi.org/10.1016/j.jbi.2020.103525

17 Forecasting Prediction of Covid-19 Outbreak Using Linear …

221

50. Ye¸silkanat CM (2020) Spatio-temporal estimation of the daily cases of COVID-19 in worldwide using random forest machine learning algorithm. Chaos Solitons Fractals 140. http://doi.org/ 10.1016/j.chaos.2020.110210 51. Zhu L, Zhou X, Zhang C (2021) Rapid identification of high-quality marine shale gas reservoirs based on the oversampling method and random forest algorithm. Artif Intell Geosci 2:76–81. https://doi.org/10.1016/j.aiig.2021.12.001 52. Rustam Z, Saragih G (2021) Prediction insolvency of insurance companies using random forest. J Phys Conf Ser 1752(1). http://doi.org/10.1088/1742-6596/1752/1/012036 53. Wibowo FW, Wihayati (2021) Prediction modelling of COVID-19 outbreak in Indonesia using a logistic regression model. J Phys Conf Ser 1803(1). http://doi.org/10.1088/1742-6596/1803/ 1/012015 54. Jain H, Khunteta A, Srivastava S (2020) Churn prediction in telecommunication using logistic regression and logit boost. Procedia Comput Sci 167(2019):101–112. https://doi.org/10.1016/ j.procs.2020.03.187 55. Moses KP, Devadas MD (2012) An approach to reduce root mean square error in toposheets. Eur J Sci Res 91(2):268–274 56. Wang W, Lu Y (2018) Analysis of the mean absolute error (MAE) and the root mean square error (RMSE) in assessing rounding model. IOP Conf Ser Mater Sci Eng 324(1). http://doi. org/10.1088/1757-899X/324/1/012049

Chapter 18

Proctoring Solution Using AI and Automation (Semi) Ravi Sridharan, Linda Joseph, and B. Sandhya Reddy

1 Introduction Since the COVID-19 pandemic, every sector has been converted to online including the education industry. This resulted in the blossom of remote learning. A large number of online education institutions were opened. While many educational institutions have transformed from classroom teaching to using applications like Google Classroom or Microsoft Teams for online teaching and interaction purposes, there has been no effective and actual solution to examinations. A majority of institutions have opted for simple take-home assignments where cheating is almost a norm, while some just canceled them completely. If the current circumstances are to be the future’s norm, there is an immense need for an effective solution. Many institutions are allowing students to give exams from home while being monitored by a camera remotely. This solution however is not at all scalable and infeasible at large scale due to the workforce required, and even if it is implemented, there is an extremely high possibility for the students to cheat. To ensure that students do not cheat and make their learning and testing experience valid, fair, and valuable, free and good proctoring software is needed. So, the right solution might be a proctoring system which can monitor the students using the laptop in-built webcam and microphone itself to achieve the target of software and hardware requirements being affordable and easily available. The software is built to be scalable, and the examiners can monitor multiple students at once. R. Sridharan (B) · L. Joseph · B. S. Reddy Department of Computer Science and Engineering, Hindustan Institute of Technology and Science, Chennai, Tamil Nadu, India e-mail: [email protected] L. Joseph e-mail: [email protected] B. S. Reddy e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_18

223

224

R. Sridharan et al.

2 Project Plan and Schedule This project is organized into six chapters. The first chapter is an introductory chapter where the project is introduced and the project scope is also defined. The problem statement is specified, and clear goals and objectives are identified and listed. Chapter two is for literature review and the proposed methodology of this project. It begins by reviewing existing proctoring systems and describing the proposed one. The methodology and approach of this project are then explained. Chapter three is the most important part as it describes the system design of this project. The system requirements are collected, and the requirements are then evaluated and structured. Chapter four is for the project implementation in which the proper programming language is selected. Details of the implementations are illustrated through a system flow diagram, forms, and reports. Chapter five briefly explains the system evaluation, advantages, and drawbacks as it is essential to examine whether the application is delivering the required output. Chapter six or the last chapter concludes the project by briefing the overall process and methodology that took place during the development of this work. Also, it epitomizes the possible future work that needs to be carried out to complement this project and further develop and expand it.

3 Literature Survey Labayen et al. [1] deeply described a solution based on the authentication of biometric technologies and an automatic proctoring system that incorporates many other features in their paper titled “Online Student Authentication and Proctoring System Based on Multimodal Biometrics Technology.” Liu et al. [2] presented a system that performs online exam proctoring automatically through multimedia analytics in their paper titled “Automated Online Exam Proctoring.” The system assesses six basic components: (i) gaze estimation, (ii) voice detection, (iii) text detection, (iv) active window detection, (v) user verification, and (vi) phone/prohibited device detection. They continuously combined the estimation components and applied a temporal sliding window to determine if any malpractice was happening. Mukhanbet et al. [3] in their paper titled “Hybrid Architecture of Face and Action Recognition Systems for Proctoring on a Graphics Processor” developed a hybrid architecture for face and action recognition. They used the FaceNet algorithm to implement the system. To reduce processing time and recognition algorithms, such as bandwidth, and resize images, were used. In addition, the entire process of video processing for proctoring was launched in the graphics server. Asep Hadian and Bandung [4] in their paper “A Design for Online Exam Proctoring on M-Learning or Mobile Learning through Continuous User Verification” proposed a method to enhance the robustness of pose and lighting variations by doing an incremental training process using the training data set obtained from m-learning online lecture sessions. Their paper resulted in the design of the incremental training using online lecture images which is the third stage

18 Proctoring Solution Using AI and Automation (Semi)

225

of DRM. The proposed method was expected to increase the resistance of face verification to online exam proctoring of variations in poses and lighting without adding another image processing such as face alignment or histogram equalization. Muzaffar et al. [5] in their paper titled “A Systematic Review of Online Exams Solutions in E-Learning: Techniques, Tools, and Global Adoption” provided a consolidated repository where interested people can pick out features and developments for any custom exam proctoring requirements. Similarly, few more papers were published regarding the assessment part [6] of the online examination, few were subject or course specific, and few were more focused on the media submitted by the examinee [7]. In order to make online learning and teaching better and effective specifically for discrete mathematics course [8], Q. Chen released a paper in ACM Turing Celebration Conference China. Similarly, in 2018 a prototype of online examination [9] was made on MoLearn application specifically on plagiarism control. A paper in August 2016 [10] used data mining technology to analyze and study teaching techniques to find teaching problems, with an aim to guide students to improve learning [11] in US and instructors to benefit teaching. A similar paper [12] was published in 2018, which studied the drawbacks and researched about the current online examination and monitoring system. Online proctoring was very much focused on the face detection and image analysis aspect, so many papers were published [13] and prototypes [14] were presented about various algorithms of face detection for online proctoring. Few papers [15, 16] were published trying to study the reason and trends in cheating among college students. In 2021, a paper [17] was published in Journal of Innovative Image Processing which proposed an eye-tracking devices for specially abled people to be beneficiary in electric wheelchairs. They had 90% better results as compared to other existing systems. A part of it could be extremely beneficiary to the educational institutes when used for proctoring. In August 2019, an IEEE paper [18] was published focusing on the security aspect of proctoring. Many systems [19–21] were built and introduced with primary aim of authentication; few involved biometrics, image analysis and pattern recognition [22], fingerprint authentication, liveliness detection [23], gaze estimation [24, 25], and gesture recognition [26]. However, for a system to be effective and affordable, one needs to have human intervention which is precisely the motive of this paper.

4 System Requirements The system requirement can be classified into two categories:

4.1 Hardware Requirement 1. CPU processor 4 or higher. 2. Memory at least 2 GB or more.

226

R. Sridharan et al.

3. Hard disk at least 160 GB. 4. A network card. 5. Any operating system preferably Windows or Linux operating system.

4.2 Software Requirements The system was developed using Python programming language and required a PyQt5 framework to code the User Interface part of the project. Many Python packages were imported to the project to implement the proctoring features of the application. OpenCV. It is a library for computer vision, machine learning, and image processing. Dlib. It is a library used for implementing a variety of machine learning algorithms. TensorFlow. It is used in machine learning and is an open-source library for numerical computation. PyAudio NLTK PyQt5 (For UI). It is a GUI widgets toolkit for Qt and a cross-platform GUI library. It is a blend of Python programming language and the Qt library. Pyrebase. It is a Python interface to Firebase’s REST API. It basically is a Python wrapper for Firebase API. It allows us to use Python to connect with a Firebase project by providing authkey, authDomain, databaseURL, and storageBucket, and serviceAccount credentials can also be added. Current project majorly used it to manipulate its database and also use other features of Firebase such as authentication, sign-in methods, and docs.

5 Proposed System The existing proctoring software is not open source, and the majority of the schools and colleges require an examination portal system. Moreover, no technology can assure the best results than a fusion of manpower and artificial intelligence. The major drawback in the existing proctoring system is that not only are they costly, but they also are not effective as a student can find ways to pass through an algorithm. Proctoring software assists an examiner with the best possible solution to avoid inappropriate practices and cheating by a student. They have explained popular e-cheating trends among students [3]. Students are smarter than any artificial intelligence algorithm or proctoring systems. So, this paper proposes a concept of semi-automation in proctoring software. A human proctor along with the help of AI can be more effective than a completely automated proctoring software. The proposed system is a complete package for exam monitoring and conduction. It does not only support proctoring, but also has features like a portal system, and data visualization about

18 Proctoring Solution Using AI and Automation (Semi)

227

Fig. 1 Architectural diagram of the proposed proctoring system

exams conducted previously. It is a semi-automated proctoring software based on capabilities like video and audio features along with routine authorization features. It can help eradicate cheating or inappropriate behavior on a large scale in online exams at a time since the human proctor gets notified every time a student or an examinee scores high on the cheating scale. The software also allows the human proctor to monitor live actions to decide for themselves if the system is overestimating or not. The structure of the project or the features are combined through multithreading. The software application would take control over the device completely and will not let the student open any browser or file. The architecture diagram of the proposed system is shown in Fig. 1, and the use case diagram is shown in Fig. 2.

6 System Design and Working 6.1 Proctor Module This is the heart of the whole project. It contains major features of proctoring such as eye-tracking, face detection, and noise and voice detection. This module is implemented on the exam page of the student module. The proctor module is divided into two sub-modules each having different functionalities.

228

R. Sridharan et al.

Fig. 2 Use case diagram of the proposed system

1. Vision module: It has the following functionalities: Gaze tracking. It is responsible for tracking eye pupils through the webcam and gives the eye’s real-time stream coordinates. It also provides the eye-tracking functionality and blink detection which contributes to the student avoiding cheating. ˙Idlib’s facial landmark detector algorithm was used. When it came to decide whether to use DNN (Deep Neural Network) or Dlib, the later was chosen because it was faster and it can give predictions in real-time live proctoring and facial key points that can detect eyes in a pre-trained network in the Dlib library were used which can detect 68 key points. Mouth open or close. It gives affirmation if the student is frequently opening and closing their mouth (talking). However, it was built using Dlib from eye-tracking code.

18 Proctoring Solution Using AI and Automation (Semi)

229

Head-pose estimation. It is used for finding where the head is facing. OpenCV, TensorFlow, and Caffe model of OpenCV’s DNN module were used for converting the points to 3D space, and cv2.solvePnP was used to find rotational and translational vectors. Person counting and mobile phone detection. It is for counting persons and detecting mobile phones in the camera view. YOLOv3 is used in Tensor-flow and 2.wgetpython which was used to download YOLOv3 weights. YOLOv3 is a pre-trained model which can be used to classify 80 objects quickly and accurately. Anchor Boxes were recognized using non-maximal suppression or NMS which used IOU to support its functionalities. Face spoofing. It is used to make user that the face is real and not a photograph or some image. Sklearn version 0.19.1 pre-trained model was used and OpenCV was used to use the webcam or handle images and for changing the color channels using a Caffe module of OpenCV’s DNN module. Histograms are calculated and concatenated together as required by the model. The probability whether it is genuine or fake is predicted with threshold set to 0.7. The model is trained on a replay attack database which contains 1300 videos of 50 clients under different lighting conditions. 2. Audio module: It has the following functionalities. Noise/Voice detection. This noise detection sub-modules make sure that the student is in a calm and quiet environment, or else it would raise a warning and notifies it to the examiner. The detected voice is converted to text and then compared with the questions on the screen. This makes sure that students are not using any voice assistant software like Google, Alexa, or Siri. Google’s audio to text API was used to code this function. The function gets activated every 10 s. The question paper and the detected audio-text are compared; if the score is higher than the limit, it would notify the examiner or the human proctor.

6.2 Student Module This module contains the programming dynamics of a portal. To access the portal, the student would need a username and password set by the examiner through their examiner portal. Through this module, the student will be able to view their previous exam data along with the reviews issued by the examiner and also the proctoring system. The exam can be attended using a pin set by the examiner to access the correct question paper. Once the authentication is done, the screen is navigated to the instructions page and then the “exam page.” This page consists of the questions along with a multithreading algorithm running to proctor the examinee. The proctor module is implemented in this section of the student module.

230

R. Sridharan et al.

6.3 Examiner Module This module is similar to the student module except that an examiner can register and set their username and password for authentication. This module contains the programming dynamics of a portal. To access the portal, the examiner would have to log in and will then be redirected to the portal. Previous exam data and archived and upcoming exam data can be viewed through the examiner portal. The examiner can set the question paper; after giving the necessary credentials, the system will generate a PIN which can be shared with the student or examinee to access the exam.

7 Implementation Users can access the system from any computer, laptop, mobile, or tablet, provided the application’s .exe file is installed. Since the authentication is enabled using an online database by Google (Firebase), Internet connection is a must. The system is of two parts: The first is the student part of the application, which namely is the portal (used by students and authentication credentials set by the examiner). The second is the proctoring software that runs in the background when the student starts their examination. So the following environment characteristics should be available for the system to work properly: Operating system: any operating system. Web browser: any modern web browser. Internet connectivity: broadband Internet connection with an appropriate bandwidth. Following are the sample screenshots of our project (Fig. 3). As shown in Fig. 4, the application starts with a login page which navigates the user to either the proctor login page or examinee or student portal login page as depicted in Figs. 7 and 8, respectively. The proctor portal has the options to add or configure the details of student as shown in Fig. 5. The examiner needs to first enter the details of the question paper beforehand to create a question paper as shown in Fig. 6. When the examination starts, the examinee can proctor multiple students at a time in the proctor console as shown in Fig. 9. In Fig. 9, there are options to refresh the logs of the proctoring and also he or she will be able to warn the examinee by switching on “Show Alerts” option. The proctor can also monitor the student’s action in real time by clicking on “Monitor Actions” button which shows the camera view of the selected student as shown in Figs. 10 and 11. The proctor can monitor the student’s noise level, excessive lip movement, camera view phone detection, and authentication of the student all at once. Each question paper is mapped to a code known as “Exam Code,” and the examinee or student would need to enter the Exam Code to access the question paper or the exam which is pre-configured by the examiner in “Exam Configuration Page” as mentioned in Fig. 6.

18 Proctoring Solution Using AI and Automation (Semi)

Fig. 3 Activity diagram depicting a brief on the activity depending on the user

Fig. 4 Home page of the application built for demo purposes

231

232

R. Sridharan et al.

Fig. 5 Examinee register page (to add the profiles of the students or examinees) of the application built for demo purposes

Fig. 6 Exam Qs enter form (to store the metadata of each question paper) of the application built for demo purposes

8 Results The following outcomes were observed while analyzing the final result:

18 Proctoring Solution Using AI and Automation (Semi)

233

Fig. 7 Proctor login page of the application built for demo purposes

Fig. 8 Examinee login page of the application built for demo purposes

• Module-2 and Module-3 which are student module and the examiner module were able to execute under few milliseconds. • Module-1 which was the proctor module took several seconds to execute. • The functions incorporated in the code through multithreading concept were partly successful due to its performance such as person detection, noise detection, phone detection, and lip movement detection and partly unsuccessful due to the amount of time it took to execute. • Human audio to text worked remarkably well with plain English. It has room for improvement when the input is in other dialects.

234

R. Sridharan et al.

Fig. 9 Proctor console page (for the examiner to proctor all the students simultaneously and be able to access their screen and communicate with all the examinees along with help of AI) of the application built for demo purposes

Fig. 10 Screenshot of lip movement detection of the examinee from the application built for demo purposes

9 Advantages The proposed system has the following advantages: • Enable the teacher or examiner to monitor multiple students at once.

18 Proctoring Solution Using AI and Automation (Semi)

235

Fig. 11 Screenshot of gaze estimation of the examinee from the application built for demo purposes

• Separate portals for both the examiner and examinee. • Live proctoring with special video and audio functionalities. • Makes sure the examinee does not cheat using personal assistants like Alexa and Siri. • Would notify the examiner every time a student is suspected of cheating.

10 Disadvantages The proposed system has the following drawbacks: • It is not completely automated and requires a proctor present to perform certain operations. • Speech to text conversion does not work well for all dialects. • There is always scope for students or examinee to cheat.

11 Conclusion This paper presents an attempt to solve the current issue of virtually invigilating students during their online examinations. The system is designed to be convenient, affordable, and easy to use from both the examinee’s and examiner’s perspectives since it only requires having one inexpensive camera and a microphone. With the captured videos and audio, we extract features from the following basic components: phone detection, speech detection, text detection, gaze estimation, active window detection, and user verification. These features are then processed to acquire highlevel features and then are used for detecting whether the examinee is cheating or not.

236

R. Sridharan et al.

This proctoring solution was built which not only had audio and video functionalities but also had a portal relating to both the examinee and examiner part. It could also let the examiner proctor multiple examinees, at a time.

12 Future Enhancements To improve the current design, the following features are being considered as future enhancements. The need for a proctor or an examiner cannot be eradicated as they are required to perform certain operations and can also assert the need for authority and dominance during the examinations for the students of the examinee to not cheat. There are certain ways to cheat as well. To completely avoid cheating, there would be a requirement of external hardware like a camera to cover the whole field of view of the test taker. A third camera can be added making a multimedia analytics system that could cover the whole field of view on its feed. Creating a clearer and more efficient user interface, adding more analytical features to both the examiner and examine portals, and improving speech to text conversion.

References 1. Labayen M, Ricardo V, Julián F, Naiara A, Basilio S (2021) Online student authentication and proctoring system based on multimodal biometrics technology. IEEE Access 9:72398–72411 2. Atoum Y, Chen L, Liu AX, Hsu SDH, Liu X (2020) Automated online exam proctoring. IEEE Trans Multimedia 19(7) 3. Mukhanbet AA, Nurakhov ES, Imankulov TS, Al-Farabi (2021) Hybrid architecture of face and action recognition systems for proctoring on a graphic processor. In: 2021 IEEE smart information systems and technologies (SIST), 28–30 April, 2021 4. Asep Hadian SG, Bandung Y (2019) A design of continuous user verification for online exam proctoring on M-learning. In: 2019 international conference on electrical engineering and informatics (ICEEI), July 2019, 9–10, Bandung, Indonesia 5. Muzaffar AW, Tahir M, Anwar MW, Chaudry Q, Mir SR, Rasheed Y. A systematic review of online exams solutions in e-learning: techniques, tools, and global adoption. IEEE Access. http://doi.org/10.1109/ACCESS.2021.3060192 6. Boussakuk M, Ghazi ME, Bouchboua A, Ouremchi R (2019) Online assessment system based on IMS-QTI specification. In: Proceedings of 7th Mediterranean Congress of telecommunications (CMT), Oct 2019, pp 1–4 7. Wagstaff B, Lu C, Chen X (2019) Automatic exam grading by a mobile camera: snap a picture to grade your tests. In: Proceedings of 24th international conference on intelligent user interfaces: companion, pp 3–4 8. Chen Q (2018) An application of online exam in a discrete mathematics course. In: Proceedings of ACM Turing celebration conference (China), May 2018, pp 91–95 9. Lemantara J, Dewiyani Sunarto MJ, Hariadi B, Sagirani T, Amelia T (2018) Prototype of online examination on MoLearn applications using text similarity to detect plagiarism. In: Proceedings of 5th international conference on information technology, computer, and electrical engineering (ICITACEE), Sept 2018, pp 131–136

18 Proctoring Solution Using AI and Automation (Semi)

237

10. Fan Z, Xu J, Liu W, Cheng W (2016) Gesture-based misbehavior detection in online examination. In: Proceedings of 11th international conference on computer science and education (ICCSE), Aug 2016, pp 234–238 11. Allen IE, Seaman J (2019) Grade change: tracking online education in the United States, 2013, vol 3, no 5. Babson Survey Research Group, Quahog, pp 3–18 12. Guo P, Feng Yu H, Yao Q (2018) The research and application of online examination and monitoring system. In: Proceedings of IEEE international symposium on IT in medicine and education, Dec 2018, pp 497–502 13. Bashier HK, Abdu Abusham E, Khalid F (2012) Face detection based on graph structure and neural networks. Trends Appl Sci Res 7:683–691 14. Jain V, Patel D (2016) A GPU based implementation of robust face detection system. Procedia Comput Sci 156–163 15. King DL, Case CJ (2019) E-cheating: incidence and trends among college students. Issues Inf Syst 15(1):20–27 16. Cluskey G Jr, Ehlen CR, Raiborn MH (2021) Thwarting online exam cheating without proctor supervision. J Acad Bus Ethics 4:1–7 17. Tesfamikael HH, Fray A, Mengsteab I, Semere A, Amanuel Z (2021) Simulation of eye tracking control based electric wheelchair construction by image segmentation algorithm. J Innov Image Process (JIIP) 3(01):21–35 18. Jung I, Yeom H (2019) Enhanced security for online exams using group cryptography. IEEE Trans Educ 52(3):340–349 19. Wahid A, Sengoku Y, Mambo M (2015) Toward constructing a secure online examination system. In: Proceedings of the 9th international conference on ubiquitous information management and communication, Art. no 95 20. Rosen W, Carr M (2013) An autonomous articulating desktop robot for proctoring remote online examinations. In: Proceedings of IEEE frontiers in education conference, Oct 2013, pp 1935–1939 21. Das I, Sharma B, Rautaray SS, Pandey M (2019) An examination system automation using natural language processing. In: Proceedings of international conference on communication and electronics systems (ICCES), July 2019, pp 1064–1069 22. Potapov A (2017) Automatic image analysis and pattern recognition. LAP Lambert Academic Publishing, 292 p 23. Hamdan YB, Sathesh A (2021) Construction of efficient smart voting machine with liveness detection module. J Innov Image Process 3(3):255–268 24. Reale M, Canavan S, Yin L, Hu K, Hung T (2021) A multi-gesture interaction system using a 3-D iris disk model for gaze estimation and an active appearance model for 3-D hand pointing. IEEE Trans Multimedia 13(3):474–486 25. Xiao B, Georgiou P, Baucom B, Narayanan S (2018) Head motion modeling for human behavior analysis in dyadic interaction. IEEE Trans Multimedia 17(7):1107–1119 26. Abisado MB, Gerardo BD, Vea LA, Medina RP (2018) Towards academic affect modeling through experimental hybrid gesture recognition algorithm. In: Proceedings of international conference on data science and information technology (DSIT), pp 48–52

Chapter 19

Apple Leaf Disease Prediction Using Deep Learning Technique Thota Rishitha and G. Krishna Mohan

1 Introduction Leaf diseases are the main reason for the damage to apple production. Apple scab, frozen spots, cedar rust, black rot, and powdery mildew are the types of diseases that affect apple growth. Therefore, the identification of diseases has attracted more attention and requires early therapeutic intervention. The process is isolated into manual identification and an expert system in the past. However, they are largely reliant on fruit growers and professionals, which takes a lot of time. Leaves are prone to many types of fungal infections that affect apples at a certain temperature. In this paper, using image processing, we can discover diseases; likewise, plenty of methods require expensive and finest equipment. Accuracy is also significantly strenuous. A disease can leave many symptoms on the leaves, feasible to partition the afflicted sections of the leaves through the above technique. we can train data to identify the model. This enhances accuracy and results in faster and more costeffective results. Without delay, the disease can be detected by this proposed scheme. This methodology also can detect healthy leaves. Whenever a disease is discovered, an accurate ratio of the zone where it has spread is provided. Black rot is formed by the fungus Botryosphaeria obtuse and traits like purple spots on upper leaf surfaces. When the age of fleck rises, the margins remain purple, but the middle part dries out and changes from yellow to brown. Cedar apple rust is begotten by the fungal infectious agent Gymnosporangium juniperi-virginianae. Affected leaves are like yellow and bright orange round sore on the upper surface. Apple scab is spawned by the fungus Venturia inaequalis, and it has Olive-green and black velvety stain with misty margins on leaves. These are the types of diseases T. Rishitha (B) · G. Krishna Mohan Koneru Lakshmaiah Education Foundation, Vaddeswaram, Andhra Pradesh, India e-mail: [email protected] G. Krishna Mohan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_19

239

240

T. Rishitha and G. Krishna Mohan

used in this study. The goal of this research is to identify diseases in frond by using Inception v3 one of the pre-trained models in CNN. It is feature extraction of a dataset after training and testing and applied inception layers using activation functions like Softmax with Adam Optimizer to get best accuracy. The proposed model’s main objective is to perform training in less time on the model and achieve an accuracy of 97%.

2 Related Works In this, the authors introduced Apple Leaf Disease Identification using VGG and ROI subnetworks to detect background, leaf area, and spot area from images. The proposed ROI-Aware DCNN performs best in dispersion through all models and received an accuracy of 82% [1]. In this, the authors introduced the Identification of ATLD using pattern recognition and image processing methods for diseased leaf images. A color transformation structure for the input RGB image was designed first, and then, the model was converted to HSI, YUV, and gray models. The background was removed based on a specific threshold value, and then, the disease spot image was segmented with a region-growing algorithm [2–5]. To reduce the dimensionality of the feature space and improve the accuracy, the most valuable features were selected by combining genetic algorithms and correlation-based feature selection. The proposed model is attainable and received an accuracy of more than 90% [6]. In this, the authors introduced detecting and classifying apple leaves using the LeNet-5 Model for three types of infected leaves and healthy leaves using image generation and received an accuracy of 98.54% [7]. In this, the authors introduced methods to discover apple leaf disease using DFTiny-YOLO and Tiny-YOLO based on regression with the concepts of DenseNet and F-YOLO from The PlantVillage dataset. It helps breeders to accurately diagnose and deals faster and achieved an accuracy of 99.99% higher than other models [8]. In this, the authors introduced findings of three apple tree leaf diseases and healthy leaves using DCNN [9–13]. Speeded-up robust feature and Grasshopper Optimization Algorithm (GOA) are methods for feature extraction and optimization, which helps to achieve better detection and classification accuracy of 98.28% [14]. In this, the authors introduced apple tree leaf diseases using CNN model VGGINCEP and INAR-SSD models using Rainbow concatenation to enlarge the multiscale disease object detection and small diseased object detection performances and received accuracy of 97% and 78% [15].

19 Apple Leaf Disease Prediction Using Deep Learning Technique

241

3 Proposed Methodology Dataset. We selected The PlantVillage dataset from Kaggle [16], which has the jpg/png format of the images. Each plant has a different directory for diseases, in which we used only the apple folder [17]. It specifically includes three types of leaf diseases and one healthy leaf folder. It consists of 2016 apple scab images, 1987 black rot images, and 1760 cedar apple rust images. The training set and testing sets are two subsets of the dataset as in Fig. 1. Convolution neural network is a well-known type of deep neural network that convolves learned features with input data and uses 2D convolutional layers, making this architecture well suited to processing 2D data, such as images, and distinguishes from one another as in Fig. 2. The need for preprocessing in ConvNet is minimal compared to other classification algorithms. CNN eliminates the need for manual feature extraction, so you do not need to identify features used to classify images. The relevant features are not pre-trained; they are learned while the network trains on a collection of images. The well-known GoogleNet is used to classify while competing in ImageNet Recognition and finding applications in Object Recognition.

Fig. 1 Dataset images

242

T. Rishitha and G. Krishna Mohan

Fig. 2 Existing architecture

Inception v3. Inception v3 is a model that has already been trained with a depth of 48 layers. This is a variant of a net that has already trained over a million images from its ImageNet database. It is 3rd volume of the CNN inception model via Google, actually triggered. Classified while competing in the ImageNet Recognition and finding applications in Object Recognition is based on the well-known GoogleNet. In this paper, Inception v3 architecture is used for feature extraction. It was codeveloped by Google and several other researchers. Building Blocks of Inception v3 are Convolutions, Maximum Pooling, Concatenate, and average pooling. Batch norm has bestowed activation functions over the model. The feature extraction helps the model to clearly identify among all the features of the image and to have knowledge for further explanation. We have made some changes based on the existing architecture, reducing our computational resources and time to make the system fast, and simultaneously mentioned the whole process in the form of a table shown in Table 1. Transfer Learning. Transfer learning is an approach that is already a pre-trained model taking a section of the trained model on some tasks and reusing it on a new model. So, when we analyze our original dataset with the new model, we use the previously used extracted features and train on the model with our dataset. Thus, we use the bare minimum of assets, datasets, and duration to train the model. Image Segmentation. The process whereby the digital image is divided into various segments of images is known as image segmentation. The process is also called image objects or regions of image. The resemblance of the image is facilitated by the process, and its evaluation is made easier and smoother. To strike the objects and image boundaries, the process of image segmentation is used. The process assigns a label in an image to each and every pixel in such a way that definite components are shared by the pixels with a similar label.

19 Apple Leaf Disease Prediction Using Deep Learning Technique

243

Table 1 Proposed architecture table Type

Filters/stride

Output size

Input



299 × 299 × 3

Conv

3 × 3/2

149 × 149 × 32

Batch normalization

3 × 3/1

147 × 147 × 32

Activation (activation)

3 × 3/1

147 × 147 × 64

Max pooling

3 × 3/2

73 × 73 × 64

Batch normalization

80/1

73 × 73 × 80

Conv

192/1

71 × 71 × 192

Max pooling

3 × 3/2

35 × 35 × 64

Block 1 × 3



35 × 35 × 288

Block 2 × 5



17 × 17 × 768

Block 3 × 2



8 × 8 × 2048

Fully connected layer

Classifier

1 × 1 × 1000

Feature Selection. Feature selection (FS) is a well-known preprocessing step that makes it possible to simultaneously detect and remove related features and decrease data size without compromising unnecessary irrelevant features and performance. Data Augmentation. Data augmentation refers to approaches for boosting the amount of data in a dataset by modifying it to increase the number of samples in the actual dataset. Data augmentation aids in the expansion of a dataset’s size and diversity.

4 Proposed System The following objectives should be achieved through our proposed system. In Fig. 3, apply preprocessing to the dataset which involves segmentation, feature extraction, and feature selection; this makes the task of training the model on images easy. Build the model with the Inception v3 and train data concerning model. Plot the necessary graphs showing accuracy and loss. Save the model and model weights in the future. Load the saved model using a flask to predict the disease of the leaf.

5 Results We have applied different activation functions for the same dataset and got different accuracy results, and variations in graphs are provided in Table 2.

244

T. Rishitha and G. Krishna Mohan

Fig. 3 Our proposed model

Table 2 Function table Activation functions

Optimizer

Result

SoftMax

Adam

97% accuracy

Relu

Adam

27% accuracy

Sigmoid

Adam

98% accuracy

tanh

Adam

27% accuracy

The validation, as well as training, ends up in losses and accuracy. This is plotted, and the following Fig. 4 indicates the graphical representation for the Inception v3 model. Our model is fitted with training and validation set with 20 epochs and a batch size of 32. It got an accuracy of 97% and a loss of 0.07%. The calculations are made for data obtained from validation and training and the way model is working with the set of two. For each epoch in the training and validation sets, it is the sum of errors and it also implies on the poor behavior of the model after optimization or all

Fig. 4 Loss and accuracy graphs of inception v3 model

19 Apple Leaf Disease Prediction Using Deep Learning Technique

245

Fig. 5 Output

iterations. The algorithmic performances are measured by the accuracy in a better and quite understandable way. The model prediction accuracy is also described by it along with the contrast with true information or data. Next, we created a web page in order to detect diseases in leaves by uploading images and predicting them as in Fig. 5.

6 Conclusion Our task is to spot types of diseases and healthy leaves of the apple. Image segmentation was adopted as a preprocessing phase, and an experiment to find the disease in the leaves will also benefit the farmers. It would be a huge step in increasing apple yields. We assembled more data and applied an algorithm that also achieved the best accuracy There are some other pre-trained models like ResNet-50, VGG-16, VGG19, and AlexNet. Among them, Inception v3 got good accuracy because it takes less time for computing resources or parameters. Moreover, CNN produced good results. In the future, farmers can use the proposed model web page for detecting diseases in leaves which saves time and cost. Acknowledgements I thank my guide, Dr. G.Krishna Mohan, for supporting and helping me with the successful completion of this project.

246

T. Rishitha and G. Krishna Mohan

References 1. Yu H, Son C, Lee D (2020) Apple leaf disease identification through region-of-interest-aware deep convolutional neural network. J Imaging Sci Technol 64:20507-1–20507-10 2. Alsayed A, Alsabei A, Arif M (2021) Classification of apple tree leaves diseases using deep learning methods. IJCSNS Int J Comput Sci Netw Secur 21(7) 3. Srinidhi VV, Sahay A, Deeba K (2021) Plant pathology disease detection in apple leaves using deep convolutional neural networks: apple leaves disease detection using efficient net and denseness. In: 2021 5th international conference on computing methodologies and communication (ICCMC), pp 1119–1127 4. Bansal P, Kumar R, Kumar S (2021) Disease detection in apple leaves using deep convolutional neural network. Agriculture 11:617 5. Chakraborty S, Paul S, Rahat-uz-Zaman M (2021) Prediction of apple leaf diseases using multiclass support vector machine. In: 2021 2nd international conference on robotics, electrical, and signal processing techniques (CREST), pp 147–151 6. Zhang C, Zhang S, Yang J, Shi Y, Chen J (2017) Apple leaf disease identification using genetic algorithm and correlation-based feature selection method. Int J Agric Biol Eng 10:74–83 7. Baranwal S, Khandelwal S, Arora A (2020) Deep learning convolutional neural network for apple leaves disease detection. In: Proceedings of the international conference on sustainable computing in science, technology and management (SUSCOM), Amity University Rajasthan, Jaipur, India, 26–28 February 2019, p 8. Symmetry 12:1065 8. Di J, Li Q. A method of detecting apple leaf diseases based on improved convolutional neural network. PLOS ONE 17(2):e0262629. http://doi.org/10.1371/journal 9. Bi C, Wang J, Duan Y, Fu B, Kang JR, Shi Y (2020) Mobile net-based apple leaf diseases identification. Mobile Netw Appl 1–9 10. Yan Q, Yang B, Wang W, Wang B, Chen P, Zhang J (2020) Apple leaf diseases recognition based on an improved convolutional neural network. Sensors (Switzerland) 20(12):1–14 11. Yu H-J, Son C-H (2020) Leaf spot attention network for apple leaf disease identification. In: Conference on computer vision and pattern recognition workshops (CVPRW) 12. Chao X, Sun G, Zhao H, Li M, He D (2020) Identification of apple tree leaf diseases based on deep learning models. Symmetry (Basel) 12(7):1–17 13. Li X, Rai L (2020) Apple leaf disease identification and classification using resnet models. In: 2020 IEEE 3rd international conference on electronic information and communication technology (ICE ICT), pp 38–44 14. Albay JSH, Ustundag BB (2020) Evolutionary feature optimization for plant leaf disease detection by deep neural networks. Int J Comput Intell Syst 13:12–23 15. Jiang P, Chen Y, Bin L, He D, Liang C (2019) Real-time detection of apple leaf diseases using deep learning approach based on improved convolutional neural networks. IEEE Access 7:59069–65908 16. https://www.kaggle.com/datasets/vipoooool/new-plant-diseases-dataset 17. Singh S, Gupta S, Tanta A, Gupta R (2021) Extraction of multiple diseases in apple leaf using machine learning. Int J Image Graph [Internet] 2140009

Chapter 20

Sentimental Analysis and Classification of Restaurant Reviews P. Karthikeyan , V. Aishwariya Rani, B. Jeyavarshini, and M. N. Muthupriyaadharshini

1 Introduction Sentiment analysis is the study of text analysis, linguistic communication process, and linguistics to scientifically determine, extract, and study subjective data from the subject information. Sentiment or opinion is that the perspective of customers comes from reviews, survey responses, online social media, healthcare media, etc. Generally, sentiment analysis is to see the insolence of a speaker, writer, or different subject with reference to a specific topic or discourse polarity to a particular event, discussion, forum, interaction, any documents, etc. The fundamental errand of opinion investigation is to see the extremity of a given message at the element, sentence, and report level. Thanks to an increase in users of the net, each user is interested in placing his opinion on the web through a completely different medium and, as a result, opinion information is generated on the web. With the gaining quality of those platforms, there conjointly comes the negative half along with its edges. This increasing negative scenario on the web has created an enormous demand for these social media platforms to undertake the task of sleuthing out the objectionable content and taking suitable action which might forestall the case turning into additional worse. This task of sleuthing the offensive content may be performed by my human moderators manually. However, it is absolutely impossible as time is due to the P. Karthikeyan · V. Aishwariya Rani · B. Jeyavarshini (B) · M. N. Muthupriyaadharshini Department of Electronics and Communication Engineering, Velammal College of Engineering and Technology, Madurai, Tamil Nadu, India e-mail: [email protected] P. Karthikeyan e-mail: [email protected] V. Aishwariya Rani e-mail: [email protected] M. N. Muthupriyaadharshini e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_20

247

248

P. Karthikeyan et al.

amount of information generated on these social media platforms, so there is a requirement to fill this gap. Sentiment analysis helps to analyze this opinionated information and extract some necessary insights which can facilitate other users to form calls [1–3]. Social media information may be of differing kinds, like product reviews, pic reviews, reviews from airlines, cricket reviews, hotel reviews, worker interaction, care reviews, news, articles, etc. Therefore, the requirement for characteristic and restraining emotional speech is important. But, detecting such content is a terribly important issue thanks to the large volume of user-generated polyglot information on the net, significantly social media platforms. Clients do not accept straightforwardly while picking a restaurant or lodging; they trust while purchasing a telephone, vehicle, or garments on the web. They accept that their surveys are sober minded and that they can know what is in store while understanding them. Albeit a negative survey can come as a shock for proprietors, they should realize that even the best gets awful audits and that the entire aggregate is a genuine picture of what they offer. In this way, eatery, bar, or convenience proprietors need to urge individuals to e-audit and share their experience and, doing so, they, for all intents and purposes, are saying—we do quality stuff and our administration is at a significant level all the time [4, 5]. Your viewpoint matters to us! Online surveys make it workable for individuals to say their viewpoint from their home, on the rearward sitting arrangement of a vehicle while driving home without facing anyone. The most significant number is the number of surveys that creates a rundown on which one can accept how much a restaurant, for instance, is well known. The audit is composed of grades for administration, surroundings, and tidiness. The impact can be gigantic. It has shown that an increase in grade for one can increase income from 5 to 9%, which can decidedly affect the entire firm. This sort of deceivability of restaurant, bars, and convenience has given the opportunities for those more modest and in less alluring areas to get to an enormous number of visitors. Today, it is not significant where you are or the historical backdrop of your place; it is significant what the level of your administration is. Social media websites and product review forums give opportunities for users to create content in informal settings. Moreover, to enhance user expertise, these platforms make sure that the user communicates his/her opinion in such a way that he/she feels comfy with either mistreatment linguistic communication or switching between one or more languages within the same spoken language. For example, there are changed brands that are utilized as gifts inside the market; picking one will be an extreme errand for a client. The progression of E-Commerce impacts the shopping routine for customers. Shoppers make the predetermined call zeroed in on the surveys present in E-trade close to it [6–9]. The surveys for the item will what’s more be seen on informal communication locales. In the gift years, informal organizations have turned horribly well known. In this manner, there is an open door, in light of the fact that because of those locales, the broadening of information might be wild in the future since everybody is by all accounts posting remarks on the online information.

20 Sentimental Analysis and Classification of Restaurant Reviews

249

2 Literature Survey Consumer loyalty is a fundamental worry in the field of showcasing and research regarding customer conduct. As with the propensities for inn shoppers, when they get superb administration, they will communicate to others mouth to mouth. Through the examination cycle of a few text mining points of view, data can be created that can be utilized to build benefits and administration. The justification behind the evaluation assessment is utilizing Normal Language Processing (NLP), message appraisal, and a few computational areas to take out or impede pointless parts to see the instance of the sentence negative or positive. In the eighteenth century, Reverend Thomas Bayes planned a strategy known as Gullible Bayes where utilized likelihood and valuable chance to move close. One attribute of the Guileless Bayes Order is the presence of autonomous information factors which recognize the presence of an articular part from a class that is regularly freed from different parts. As an appraisal, the possible results of this study are furthermore separated from the Text Blob evaluation, an inclination analyzer that has a Characteristic Language Toolbox (NLTK) and design managing premise. Text Mass can likewise be utilized for text mining, text dealing with modules for Python victors, and, certainly, even text evaluation. Message Mass, in addition, gives fundamental APIs for general regular language handling (NLP) managing, for instance, including talk checking, tokenizing sentences, thing phrase extraction, evaluation assessment, demand, and understanding. Sumbal Riaz et al. proposed an approach named text searching for reviewing client reviews to become familiar with the clients’ viewpoints and executing the SA on the tremendous dataset of thing (6 sorts) reviews proffered by disparate clients on the web. In this technique, SA was used at the articulation level as opposed to file level for handling each term’s SP [10–12]. Then, at that point, key outline watchword extraction was used and pointed at removing watchwords of each document with high-repeated terms and the force of SP by estimating its solidarity was evaluated. Janice M. Weibe talked about the classification of records and sentences. He accumulated assessment information for an assortment of item classes, including autos, banking, films, and travel. He described the terms as great or negative. Also, he processed the text’s complete positive or negative score. Assuming the text has more positive than negative terms, it is considered positive; in any case, it is viewed as negative. As per the examination, Schrauwen directed sentiment. The survey on related works is shown in Table 1. Analysis using the Naive Bayes technique, Greatest Entropy, and a Decision Tree classifier were the techniques used. Exactness, precision, and recall are utilized to assess execution utilizing the N-overlay cross-validation technique. Another review looked at the exactness of esteem when the component determination procedure was incorporated utilizing the Naive Bayes and Adaboost techniques. By joining different methodologies, the examination accomplishes higher exactness values than when just a single technique is utilized. Also, opinion examination research has been directed to utilizing the Probabilistic Dormant Semantic Analysis procedure. The information

250

P. Karthikeyan et al.

Table 1 Related work Study

Author

Citation

Key findings

[4]

Spoorthi C. Pushpa, Ravikumar Adarsh M. J.

Sentiment analysis of customer feedback on restaurant reviews

It does the text investigation utilizing huge dataset and from eatery audits. By utilizing the objective quality worth, it helps to arrange the text information utilizing our classifier called Naïve Bayes utilizing algorithm

[5]

Liangqiang Li, Liang Yang* and Yuyang Zeng

Further developing sentiment classification of the restaurant reviews with attention-based Bi-GRU neural network

It applied profound learning strategies to survey feeling investigation in internet-based food requesting stages to work on the presentation of opinion examination in the eatery audit area

[6]

Dhiraj Kumar, Gopesh, Restaurant review Avinash Choubey, Pratibha classification and analysis Singh

This paper enhances the user experience by analyzing the reviews of restaurants and categorizes them in some aspects so that a user can easily know about the restaurant

[11]

Rachmawan Adi Laksono, Kelly Rossa Sungkono, Riyanarto Serno, Cahyaninigtyas Sekar Wahyuni

This review attempts to order Surabaya cafe consumer loyalty utilizing Naive Bayes and Text Blob

Sentimental analysis Laksono restaurant customer reviews on Trip Advisor Kelly using Naïve Bayes and Text Blob

is gotten from the survey’s title, not the entire comment. His examination uncovered that the distinguishing proof discoveries were 76% exact. Table 1 shows the literature survey of our reference paper.

3 Existing Methodology Client joy is an essential worry in showcasing and customer behavior when conducting research. Similar to the case with inn shoppers, when they get remarkable help, they will get the message out through expression of mouth. Text mining is the method involved with removing information from an assortment of consistently put away reports utilizing logical apparatuses or guides [13]. By dissecting a few text mining perspectives, data might be created that can be utilized to upgrade incomes and administration. Feeling investigation is utilized to learn the creator’s sentiments about a specific thing. An audit’s opinion investigation is an assessment

20 Sentimental Analysis and Classification of Restaurant Reviews

251

of an item’s assessment. Feeling investigation depends on the utilization of Natural Language Processing, text examination, and certain computational parts to concentrate or eliminate unnecessary data to decide whether an articulation is negative or positive. In the early hundreds of years, Thomas Bayes made a strategy called Naive Bayes, which consolidated likelihood and opportunity examination. The Equation represents the Naive Bayes calculation’s activity. Naive Bayes gauges future probabilities in light of recently gathered information or experience. One of the Naive Bayes Classification’s traits is the presence of unfastened record elements that assume the presence of an articular factor from a category that is generally impartial about diverse features. Existing accuracy is 72.77%. Alec co-used an assortment of ML strategies. Various ML strategies are accessible, including Naive Bayes, maximum entropy, and support vector machine. Janice M. Weibe talked about the classification of archives and sentences. He accumulated assessment information for an assortment of item classes, including autos, banking, films, and travel. He described the terms as great or negative [14]. Also, he processed the text’s absolute certain or negative score. Assuming that the text has more positive than negative terms, it is considered positive; in any case, it is viewed as negative. The recommended framework completes three capacities: To remove item credits from client surveys, affiliation rule mining is performed; Word Net is utilized to gauge the semantic directions of assessment terms; and an element-based rundown is created. Barely, any techniques for including based outlines have been recommended in the course of the most recent twenty years. The summarizers are utilized in an assortment of regions, including item surveys, film audits, nearby help assessments, and inn surveys.

4 Sentiment Analysis of Restaurant Reviews For a long long-time, food and neighborliness organizations have been running with the understanding that great food and administration is the method for drawing in more clients. All the more critically, the information made by the utilization of online stages has pointed toward new discoveries and opened new entryways. Most buyers these days rate an item on the web, more than 1/third of them compose surveys, and almost 88% of individuals trust online audits. Survey Services like Google Reviews and so on give clients and organizations a way to connect with each other. Audits and appraisals are helpful wellsprings of data, yet critical issues exist in removing pertinent data and anticipating the future through investigation and relationships with the existing information. Every day, a great many cafes and organizations are checked on by clients [3]. The fundamental goal of the work proposed in this paper is to upgrade the client’s experience by dissecting the audits of eateries and ordering them in certainly to the goal that the client can, without much of a stretch, have some familiarity with the cafe. Eateries cannot use audits for their organizations. We need to utilize the perspectives that are significant in the food and administration industry so we can examine the opinion of text surveys and assist them with moving

252

P. Karthikeyan et al.

along their organizations. This exploration administrative work can be applied to numerous different enterprises connected with food and neighborliness.

5 Sentimental Analysis Sentimental Analysis, otherwise called opinion mining, is a computational investigation of individuals’ needs, perspectives, and feelings toward a substance. Figure 1 shows the classification approaches of sentiment polarity in machine learning. Assessment mining can get good or pessimistic assessments of appraisal subjects and their power, and the results of Sentimental Analysis can be useful in many fields, similar to online inclination evaluation examination, point noticing, casual appraisal of immense things, and so forth. Including determination is a principal task in the field of opinion investigation, and compelling highlighted choice of emotional messages can fundamentally work on the proficiency of feeling examination. Numerous researchers have led research according to the element point of view to track down a compelling element choice strategy. Zhang et al., including and using Boolean key processes to handle fees, and the results show that the method shows the part they choose to be a good choice. Hogenboom et al. utilized a vectorized portrayal in light of message structure for the multi-space English message opinion examination and in the end showed that this technique works better compared to word-based highlight portrayal [5]. Sentimental Analysis is touchier and touchier, including the determination that it should be visible as the recognizable proof of space explicitly named substances, which prompts the way that most opinion investigation strategies require area explicit information to work on the presentation of the framework. The vast majority of the current examinations include determination to have restrictions, and the effectiveness of feeling examination diminishes altogether whenever it is taken out from a particular area. Many investigations involved opinion word references as well as ML strategies to examine restaurant surveys, and albeit generally great outcomes have been accomplished, the information handling exertion is moderately high, and the space is less

Fig. 1 Classification approaches of sentiment polarity in machine learning

20 Sentimental Analysis and Classification of Restaurant Reviews

253

adaptable. In the meantime, profound learning-based opinion examination strategies are acquiring prevalence as profound learning. Programs incorporate extraction alongside more excessive depiction of execution and better execution. Abdi et al. proposed a significant learning-based method for managing request client appraisals imparted in reviews (called RNSA), which beats the shortcomings of traditional procedures that lose common as well as positional information and achieves extraordinary results in sentence-level inclination gathering. In the field of feeling assessment, various analysts have used methods considering assessment word references or traditional AI. The delayed consequences of these techniques are not satisfactory, as the introduction of the model is seriously dependent upon the component assurance philosophy and the tuning of the limits [9]. Significant learning consolidates Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), and other association structures. Significant learningbased assessment models use cerebrum association to sort out some way to eliminate complex components from data with inconsequential external responsibilities. Furthermore, they have accomplished great execution of normal language handling. Contrasted with opinion examination procedures utilizing AI techniques, profound learning-based feeling examination is more generalizable, and likewise, profound learning-based techniques have better execution concerning highlight extraction and nonlinear fitting abilities.

6 Proposed Methodology In existing methodology, they have used Naive Bayes which produced accuracy of 72.77%, and the same algorithm is used in proposed methodology and attained an accuracy of 66.66% in Naive Bayes and 92.22% in Logistic Regression. Sentiment Analysis is the top examination field under regular language handling (NLP), containing interaction of identifying and extricating feelings/assessment from the text and characterizing their opinion. Figure 2 depicts the Sentimental Analysis process where opinion investigation concentrates on individuals’ perspective, examination, feelings, and disposition toward individual, association, items, motion pictures, issues, occasions, and so on. In this technique, only one item is handled, and we figure out the assessment on the same item. Sentiments are communicated in just a single element. This proposed work is to consequently foresee the text in view of the informational collection values put away utilizing tool. By utilizing the preparation of informational collection values, it is feasible to foresee the text information utilizing our classifier called Naive Bayes using an algorithm.

254

P. Karthikeyan et al.

Input data Restaurant review dataset

Preprocessi ng

Handling missing values Drop unwanted columns

Stop words, stem words

NLP techniques

Remove punctuations

Test Data Splitting

Count vectorization

Feature Extraction/ Vectorizati

Train

LR and NB Accuracy

Classificati on

Predictions Visualizati ons

Fig. 2 System architecture

6.1 Data Collection In this development, information is taken out from Kaggle in a clear arrangement. Missing fields are released in this cycle, and hence, the information is changed. Assessment Analysis can be seen as a game-plan affiliation. There are three fundamental arrangement levels in evaluation document-level, sentence-level, and perspective-level tendency assessment. Level of document means to organize an examination report which has a positive or negative assessment verbalization. It considers the full record a key data unit. In this survey, we utilized the reference dataset “Restaurant Reviews.csv” for researching bistro reviews. These appraisals are written in clear language and contain a couple work-related conversation and casual language [15]. It fuses both incredible and negative appraisals that are novel from one another. Plan models may be taught to hear the client’s point of view. We collected 900 training datasets. There are two fragments in this dataset. The chief segment contains text data showed by the saying “Review,” while the resulting fragment has twofold characteristics implied by the saying “Appreciated.” For example,

20 Sentimental Analysis and Classification of Restaurant Reviews

255

if a review is great for the diner, as in a positive review, the connected still up in the air as “1.” Of course, expecting that a review is inconvenient to diners, as in a negative review, it is named “0.” dataset = pd.read_csv (‘Restaurant_Reviews.csv’, delimiter=‘\t’, quoting=3)

6.2 Data Preprocessing Considering that we are investigating message information for feeling examination, information readiness is basic to guarantee that the model gets the information. Text information is thickly loaded with commotion. Therefore, it is challenging to disinfect the texts cleverly [11]. Pre-handling information extensively diminishes the size of the information text archives. It happens in a progression of steps. Each audit starts with a preprocessing stage that takes out all uncertain data, for example, stop words, numerals, and exceptional characters. We imported the NLTK library, stop words and Porter Stemmer for taking out stop words, numeric, and unique characters. NLTK (Natural Language Toolkit) is the business standard Python system for working with human language information. It gives instinctive connection points to north of 50 corporate and lexical assets, including Word Net, as well as a set-up of text handling libraries for arrangement, tokenization, stemming, labeling, parsing, and semantic thinking, as well as coverings for modern strength normal language handling libraries [3]. The gathered crude information of cafe audits comprises an enormous number of properties. Lessening the properties is required; extricating the necessary is additionally more fundamental. In data cleaning, whenever ascribes are taken out, filling out the missing qualities, eliminating conflicting information, and estimating the focal inclination for the property, like mean middle, quartile is finished. In the information preprocess, the information is cleaned and extricated before investigation. Nontextual contents and contents that are irrelevant for the analysis are identified and discarded.

6.3 Stop-Word Elimination Stop words are useful terms that show up frequently in the text’s language (for instance, “a,” “the,” “an,” and “of” in the English language), making them unusable for classification. We decrease stop words by using the Natural Language Toolkit bundle. We do not believe that these terms should eat up significant extra room or handling time in our dataset. This is essentially achieved by keeping a rundown of words that you view as stop words. In Python, NLTK stores a rundown of stop words in 16 unmistakable dialects. Stop-word expulsion is a course of clearing out words that once in a while show up in any case do not have importance in dialects, for example, “the,” “a,” “an,” and “in.” For an example: “let me prelude this survey,” said janitors “became” let

256

P. Karthikeyan et al.

presentation audit, saying, and crazy looking sharp nursery workers.” The discard complement process is the course of clearing out emphasis that as frequently as conceivable shows up and commonly does not have an extraordinary arrangement significance like “-, /,:,?” After the preprocessing message is done , it will classify the reviews using Naive Bayes [11].

6.4 Stemming Stemming is the method involved with diminishing arched words to their root (or stem) in data recovery, such that comparable terms guide to a similar stem. This approach diminishes the quantity of words connected with each archive naturally, thus decreasing the element space. In our tests, we utilize a Porter stemming calculation execution. For example, the English word “generalizations” would be stemmed as “generalizations, generalization, generalize, general, and gener.”

6.5 Bag-of-Words Model A technique for addressing text information for ML calculations and the pack-ofwords model guides us in this undertaking. The pack-of-words worldview is clear to understand and utilize. It is a strategy for removing literary attributes for use in ML calculations. In ML, the demonstration of changing regular language handling text into numbers is alluded to as vectorization. Potential highlights are separated and converted into a mathematical configuration from the cleaned dataset. Vectorization is a strategy that changes literary contribution over to mathematical information. A network is produced by means of vectorization, with each section addressing a component and each column addressing a singular audit. In the main significant period of normal language handling, we cleaned every one of the surveys, yet additionally constructed a corpus. Corpus is a term that alludes to the accumulation of texts. Our model relates to a corpus of 1000 purged audits. We developed the model pack from the corpus. The sack of models contains the entirety of the corpus’ particular words. There are 1000 surveys in our corpus, and each audit has one segment for every novel term. Since 1000 audits contain an enormous number of one of kind words, they have countless segments. We made a table with every one of the segments and a line count of 1000. We may just wipe out copy values and information over repetitiveness by using the bag of models. Every cell contains a number, which addresses the recurrence with which the important segment shows up in the survey. For example, the main segment has the expression “wow love area,” which brings about a worth of 1 for the “amazing” cell. Nonetheless, the subsequent line contains no cases of “wow,” bringing about a worth of “0” for the “wow” cell in the subsequent column. This is the strategy through which the bag model was made.

20 Sentimental Analysis and Classification of Restaurant Reviews

257

6.6 Data Classification The vocabulary-based approach is to track down the assessment mining which is utilized to break down or to anticipate the text. There are two strategies for this methodology. The word reference is based on movement, which relies upon observing assessment, looking for seed words, and afterward looking through the word reference of their equivalents, antonyms. The corpus-based approach starts with a seed rundown of assessment words and afterward tracks down other assessment words in a huge corpus to help in observing assessment words by setting up explicit directions. This should be possible by utilizing factual or semantic strategies. Information mining has got the two most continuous demonstrated objectives-arrangement and forecast. The arrangement model characterizes discrete, unordered qualities or information. In this expectation process, the arrangement procedures used are the Naive Bayes Classifier [4].

6.7 Splitting Dataset Dividing the informational index into equal parts is a basic part of the machine learning model. 1. Training set 2. Testing set. Machine learning’s essential goal is to sum up past information models used to prepare models. We wish to test the model to decide the nature of its example speculation on un-prepared information. Notwithstanding, since future occasions will have obscure objective qualities and we will not be able to check the precision of our expectations for future occurrences right now, we should involve a portion of the information for which we as of now know the response as an intermediary for future information, which will be alluded to as our test set. While managing huge datasets, the most well-known technique is to partition them into preparing and test subsets, frequently with a proportion of 70–80% for preparing and 20–30% for testing. The train-test split work, which is stacked from the skit-learn bundle, does this parting arbitrarily. Training Set: 80% of the information is consolidated into our train set. Both the autonomous variable (x train) and the reliant variable (y train) are known in the preparation set. Testing Set: The test set contains 20% of the information from 1000 surveys, where the reliant variable is signified by (x test) and the autonomous variable is meant by (y test).

258

P. Karthikeyan et al.

Fig. 3 Splitting dataset

7 Splitting Figure 3 shows the splitting of datasets.

8 Naive Bayes The classification technique Naive Bayes depends on Bayes’ hypothesis. The Naive Bayes Classifier’s essential trademark is an extremely impressive freedom of supposition between conditions and occasions. The Naive Bayes Classifier consolidates this model with a choice rule. One normal rule is to pick the theory that is generally likely; this is known as the most extreme deduced or MAP choice rule. For text grouping at the word highlight level, the Naive Bayes suspicion of characteristic freedom works really. Whenever the quantity of attributes is enormous, the autonomy suspicion empowers each quality’s boundaries to be advanced autonomously, impressively working on the learning system [11]. It is one of the well-known characterization strategies of calculations utilized in information mining. It is a likelihood classifier. Figure 4 gives a visualization of Naive Bayes Classifier. It connects the qualities commonly and is subject to a number of boundaries. The guidelines here are simply the factors given that they are autonomous. It creates exact results with suitable estimation and gives quick outcomes.

9 Logistic Regression Logistic Regression is one of the most famous machine learning calculations, which goes under the Supervised Learning procedure. It is utilized for anticipating the

20 Sentimental Analysis and Classification of Restaurant Reviews

259

Fig. 4 Confusion matrix for Naïve Bayes

clear-cut subordinate variable utilizing a given arrangement of autonomous factors. Logistic Regression predicts the result of an all-out subordinate variable.

10 Result First and foremost, we broke down the “Restaurant Reviews.xlsx” document. The general number of positive and negative surveys is something similar for this situation. Activities like data assortment, data preparation, bag-of-words model, and fitting algorithm to training dataset are performed, and afterward, Naive Bayes procedure is utilized to order a survey as sure or negative, yielding an exactness score of 92.22 percent. The accuracy obtained through Logistic Regression is 92.22% as in Fig. 5. Figure 6 shows the classification of reviews for our given dataset. Table 2 shows the performance matrices for Naïve Bayes. Table 3 shows the performance metrics for Logistic Regression Table 4 shows the performance comparision.

11 Conclusion Following exploring a gigantic corpus of overviews, we surmise that the NB model beats battling procedures on basically every assessment model. We propose the Naive Bayes Classifier Model for sensation of examination in this proposition. This model may be used to separate the sensation of any kind of text data, including tweets,

260

P. Karthikeyan et al.

Fig. 5 Confusion matrix graph for logistic regression Fig. 6 Classification of restaurant reviews

Table 2 Performance metrics for Naïve Bayes Precision

Recall

F1-score

Support

0

0.53

0.77

0.63

30

1

0.85

0.67

0.75

60

Micro-avg

0.70

0.70

0.70

90

Macro-avg

0.69

0.72

0.69

90

Weighted avg

0.75

0.70

0.71

90

20 Sentimental Analysis and Classification of Restaurant Reviews

261

Table 3 Performance metrics for logistic regression Precision

Recall

F1-score

Support

0

0.07

1.00

0.13

3

1

1.00

0.54

0.70

87

Micro-avg

0.56

0.56

0.56

90

Macro-avg

0.53

0.77

0.42

90

Weighted avg

0.97

0.56

0.68

90

Table 4 Performance comparison

Existing accuracy using Naïve Bayes and Text Blob

Proposed accuracy using Naïve Bayes and logistic regression

72.06% and 69.12%

66.66% and 92.22%

brand/thing reviews, and vacation spot overviews. This model took a stab at a dataset of 1000 dinner studies. Feeling assessment is essential for clients and expert associations the equivalent. By and by, in the high level season of the web and globalization, the two clients and expert associations are keen on the general populace’s viewpoint on a particular brand/thing/region, etc. [3]. It helps the expert center since it fuses a business part, yet it moreover helps clients since it helps them in picking the best thing. We have wrapped up our work on the Bernoulli NB Classifier, which is a splendid ML model for assessment. It further fosters the assessment gauge. It is a basic issue in the field of feeling examination to separate a derided review/text. A machine might be prepared to perceive a joke. A machine may be ready to see a joke. At last, an outline is selected as one of two or three kinds of customer dedication that is key with respect to the restaurant business. The results similarly show that the Naive Bayes technique has a value of 66.6% precision and Logistic Regression produces 92.22 accuracy.

12 Future Work For future work, by using the same technique we can further develop for sarcastic reviews that precisely individuals preferred or hated. Future examinations could focus on wry articulations, which are famously hard to understand, both for people and PCs. Another troublesome issue is recognizing spam content in client surveys.

262

P. Karthikeyan et al.

References 1. Jagdal RS, Shirsat VS, Desphmukh SN (2019) Sentiment analysis on product reviews using machine learning techniques. Springer Nature Singapore Pte Ltd., Singapore 2. Ring CE (2012) Hate speech in social media: an exploration of the problem and its proposed solution 3. Reddy KN, Indira Reddy P (2021) Restaurant review classification using Naïve Bayes model. J Univ Shanghai Sci Technology 4. Spoorthi C, Puspha R, Adarsh (2018) Sentiment analysis of customers feedback on restaurants. Int J Eng Res Technol (IJERT) 5. Li L, Yang L, Zeng Y (2021) Improving sentiment classification of restaurant reviews with attention-based Bi-GRU neural network. https://www.mdpi.comjournalsymmetry 6. Kumar D, Gopesh, Choubey A, Singh P (2020) Restaurant review classification and analysis, vol 11, no 8. https://www.jespublication.com 7. Sasikala P, Mary Immaculate Sheela L (2020) Sentiment analysis of online product reviews using DLMNN and future prediction of online product using IANFIS. J Big Data 8. Tontodimamma A, Nissi E, Sarra A, Fontanella L (2020) Thirty years of research into hate speech: topics of interest and their evolution. https://doi.org/10.1007/s11192-020-03737 9. Prathan R, Chaturvedi A, Tripathi A, Sharma DK (2020) A review on offensive language detection. https://www.researchgate.net/publication/338355806 10. Dang NC, Moreno-Garcia MN, De la Prieta F (2020) Sentiment analysis based on deep learning: a comparative study. www.mdpi.com/journal/electronics 11. Laksono RA, Sungkono KR, Serno R, Wahyuni CS (2019) Sentimental analysis Laksono restaurant customer reviews on Trip Advisor Kelly using Naïve Bayes. In: 12th international conference on information and communication technology and system (ICTS) 12. Tripathi M (2021) Sentiment analysis of Nepali COVID19 tweets using NB, SVM AND LSTM. J Artif Intell 3(03):151–168 13. Sungheetha A, Sharma R (2020) Transcapsule model for sentiment classification. J Artif Intell 2(03):163–169 14. Pandian AP (2021) Performance evaluation and comparison using deep learning techniques in sentiment analysis. J Soft Comput Paradigm (JSCP) 3(02):123–134 15. Kottursamy K (2021) A review on finding efficient approach to detect customer emotion analysis using deep learning analysis. J Trends Comput Sci Smart Technol 3(2):95–113

Chapter 21

A Controllable Differential Mode Band Pass Filter with Wide Stopband Characteristics K. Renuka, Ch. Manasa, P. Sriharitha, and B. Vijay Chandra

1 Introduction The filters are most extensively used in millimeter wave and microwave applications, and the development of the BPF is the most important component in the radar and UWB applications. The essential configuration in BPF should have smaller size, low insertion loss, low cost and good selectivity [1]. Nowadays, radar has major widespread benefits in the areas of telecommunications, civil and navigation applications. The evolution of multi-passband filters acquired more awareness and attentiveness for the advancement in wireless communications [2]. In wireless communications, specific frequency band signals are to be filtered from mixed signal frequency bands. The combination of the low pass and high pass filter properties can be adopted to design the band pass filter [3]. The deportment of the discrete elements varies at the higher frequencies, so therefore the discrete elements are reinstated with the microstrip transmission lines. In microstrip line, a thin conductor strip is used above the dielectric substrate put forth on the ground plate at the dielectric bottom. As WiMAX and CDMA work at higher gigahertz frequencies, higher data rate circuits are required to filter these signals. The conventional filtering techniques will not be sufficient to filter the noise K. Renuka (B) · Ch. Manasa · P. Sriharitha Department of ECE, PACE Institute of Technology and Sciences, Ongole, Andhra Pradesh, India e-mail: [email protected] Ch. Manasa e-mail: [email protected] P. Sriharitha e-mail: [email protected] B. Vijay Chandra Department of EEE, PACE Institute of Technology and Sciences, Ongole, Andhra Pradesh, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_21

263

264

K. Renuka et al.

rendered at higher frequencies due to skin effects. Even if L , R, C element filtering techniques are used, it leads to abominable filtering because L , C values at higher frequencies vary substantially. Therefore, microstrip lines are used to overcome the problems that originated due to L , C values, as these element values can be easily realized by the modifications in the termination impedances and lengths and also by scaling the frequency, load impedances and source impedances. Microstrip line filters can be released and modeled in various means. The coupled resonator and image parameter methods are highly regarded. The design mentioned in [4], is a cost-efficient image parameter technique for lumped LC elements with a low pass filter. A lesser circuit element design of filter will always be a more efficient, and this technique mentioned is also used for high pass filter design. The design in [5] proposed an equal rippled and maximal flat passband filter with a general stopband and vice versa. This design is analyzed using one or two transformed frequency variables to improve numerical conditioning and to solve the approximation problem. The reconfigurable microstrip antennas and filter design combinations are reported in the literature to deal with the multi-functional wireless communication tasks. The electronic tuning methods help to improve the performance of these applications and hence reduce the larger number of RF elements. Thus, the entire wireless communication unit becomes cost-efficient [6]. The microstrip band pass filter (BPF) is widely used in various applications like wireless communication and radio frequency (RF), because of its ability to suppress interference and noisy signals [7–9]. The filter is the most important component that can enhance the performance of radar. It has to pass the necessary frequencies and reject the unwanted signals. Radar is most used to detect the object existence by means of electromagnetic waves. The radar operation is to spread the wanted electromagnetic waves (or) forces within the antenna. The antenna captures the incoming signal from the object and sends it to the radar from the object and sends it to the radar, and then the signal is processed to bring the object into the vicinity. The bandpass filter will pass the frequency signals between the first cut off and the second cut off frequencies and then damp the frequencies outside its range [10–12]. In this work, the major work is focused on single band pass filter for controlled frequencies has been designed and analyzed. The microstrip line and attached with a stepped type of resonator and a slot on the ground were designed on the RT duroid substrate based on the structure filter that works on the band of WiMAX at 3.5 GHz.

2 Filter Design In this work, the band pass filter is designed on the rogers Rt duroid substrate consisting of 0.8 mm of thickness. The proposed balanced single band pass filter is designed using ANSYS hfss. The structure comprises u-type microstrip line and rectangular stepped impedance slot line resonator which is etched on the ground. In between the

21 A Controllable Differential Mode Band Pass Filter with Wide . . .

265

Fig. 1 Band pass characteristics

Fig. 2 a Iteration 1, b Iteration 2 and c Iteration 3 evolution of proposed antenna

two microstrip lines, a resonator is placed to enhance the impedance in the resonator circuit. Each circuit acts as a half wavelength resonator which is used to produce the band pass characteristics. Filter characterization based on the characteristic impedance is as follows as to design a band pass filter using low pass the inductance and capacitance and for high pass and for band pass and band stop. In this work, we have concentrated on balanced band pass filter for single band applications (Fig. 1). The evolution of proposed antenna structure analyzed in Fig. 2 the iteration one consists of a microstrip line (u shaped) and its response has been observed. Later at the ground for the second iteration, the ground is modified with a slot to enhance the impedance characteristics. The final iteration consists of a H -shaped resonator in the middle of the u-shaped microstrip lines as shown in Fig. 2 (Fig. 3). Figure 4 shows how s parameters is varied for band pass filter from port-to-port variations The u-shaped microstrip line plays a important role for obtaining the

266

K. Renuka et al.

Fig. 3 Band pass filter quarter wave structure

Fig. 4 Sparameters response of proposed filter

band pass filter characteristics. The first pass band will be working from range of 3–4.5 GHz with 3-dB bandwidths of one pass band, respectively. The total dimension 49 * 25 gives common mode suppression greater than 50 dB while the return losses rate better in the normal pass band (Fig. 5 and Table 1).

3 Current Distributions The current distributions of the band pass filter show how microstrip line varies with the given response in Fig. 6a the port 1 is excited, and how microstrip line influence the band pass filter (Fig. 7).

21 A Controllable Differential Mode Band Pass Filter with Wide . . .

Fig. 5 Sparameters response of proposed filter when different port excitation Table 1 Dimension table Wi1 = 2.5 mm L i1 = 14.0 mm L s1 = 8.0 mm L s4 = 10.2 mm Ws3 = 0.6 mm Ws6 = 1.0 mm Wm1 = 1.2 mm L 2 = 11.7 mm W2 = 0.6 mm g2 = 0.6 mm g5 = 0.4 mm

Wi2 = 4.0 mm L i2 = 14.0 mm L s2 = 5.9 mm Ws1 = 5.0 mm Ws4 = 0.2 mm L m1 = 17.0 mm Wm2 = 0.5 mm L 3 = 18.2 mm W3 = 0.5 mm g3 = 0.3 mm

Wi3 = 2.5 mm L i3 = 17.0 mm L s3 = 8.0 mm Ws2 = 6.0 mm Ws5 = 0.2 mm L m2 = 17.75 mm L 1 = 22.7 mm W1 = 0.5 mm g1 = 0.4 mm g4 = 6.0 mm

Fig. 6 Current distribution responses of proposed filter when port 1 and port 2 excitation

267

268

K. Renuka et al.

Fig. 7 Current distribution responses of proposed filter when port 3 and port 4 excitation

4 Conclusion In this article, a balanced single band pass filter is presented. The proposed filter achieves controllable DM center frequencies and FBW with high suppression with good selectivity and stopband characteristics. The simulated results show good agreement for showing well band pass filter character tics 3–4.5 GHz at S-Band.

References 1. Zhu L, Sun S, Menzel W (2005) Ultra-wideband (UWB) bandpass filters using multiple-mode resonator. IEEE Microw Wirel Compon Lett 15(11):796–798 2. Chen Y, Dai Z, Chiu C, Chiou S, Chen Y, Lin Y, Chen K, Wu H, Lee H, Su Y, Chang S (2016) Compact dual-band bandpass filter based on quarter wavelength stepped impedance resonators. Int J Electr Comput Eng 10(4):517–520 3. Sun S, Zhu L (2006) Capacitive-ended interdigital coupled lines for UWB bandpass filters with improved out-of-band performance. IEEE Microw Wirel Compon Lett 16(8):440–442 4. Hao Z, Hong JS (2010) Ultrawideband filter technologies. IEEE Microw Mag 56–68 (2010) 5. Viswavardhan Reddy K, Dutta M, Dutta M (2014) Design and analysis of band pass filter for wireless communication. IJLTEMAS III(VI). ISSN 2278-2540 6. Bianchi G (1988) Image parameter design of parallel coupled microstrip filters. In: 18th European microwave conference, 12–15 Sept 1988 7. Orchard HJ, Temes GC (1968) Filter design using transformed variables. IEEE Trans Circuit Theor 15(4) 8. Tu Y, Al-Yasir YI, Ojaroudi Parchin N, Abdulkhaleq AM, Abd-Alhameed RA (2020) A survey on reconfigurable microstrip filter-antenna integration: recent developments and challenges. Electronics 9(8):1249 9. Hong J-S, Lancaster MJ (2004) Microstrip filters for RF/microwave applications, vol 167. Wiley, Hoboken 10. Richard JC, Chandra MK, Raafat RM (2017) Microwave filters for communication systems fundamentals, design, and applications. Wiley, Hoboken

21 A Controllable Differential Mode Band Pass Filter with Wide . . .

269

11. Ian H (2006) Theory and design of microwave filters. IET electromagnetic waves series 48. IET, London 12. Young M (1989) The technical writer’s handbook. University Science, Mill Valley, CA

Chapter 22

Design and Analysis of Conformal Antenna for Automotive Applications Sk. Jani Basha, R. Koteswara Rao, B. Subbarao, and T. R. Chaitanya

1 Introduction Automotive industries mainly focus on the advancement of existing vehicle features which helps in exchanging information from a vehicle-to-vehicle (V2V) [1]. Common or most of the challenges faced by the industries were in terms of functioning, designing, styling, and location of antennas [2]. Intelligent transportation systems (ITS) and telematics help in sharing traffic information between the vehicles linked together. To achieve this, we need a compact antenna. In the car antennas, to fulfill the MIMO systems with advanced requirements, dual-frequency bands were used. Longterm evolution (LTE) helps for the cellular communication between the vehicles [3]. Due to high data rates and scalability, the LTE represents satisfactory performance for automotive applications. Numerous multiband antenna solutions for automobile applications have been presented, including GPS, GSM, LTE, and WiMAX bands. To achieve noiseless and smoothness with a great quality of service (QoS), conformal antennas were used [3]. In this article, polyamide substrate is used for its features in different weather conditions. Here, the antenna is analyzed in two cases, i.e., standalone mode and inside a radome. Low moisture absorption, low coefficient of hygroscopic expansion, low permeation due to high molecular order in the crystalline regions, adjustability of its coefficient of thermal expansion, and stable Sk. Jani Basha (B) · R. Koteswara Rao · B. Subbarao Department of ECE, PACE Institute of Technology and Sciences, Ongole, Andhra Pradesh, India e-mail: [email protected] R. Koteswara Rao e-mail: [email protected] B. Subbarao e-mail: [email protected] T. R. Chaitanya Department of CSE, PACE Institute of Technology and Sciences, Ongole, Andhra Pradesh, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_22

271

272

Sk. Jani Basha et al.

performance over a range of humidity and moist environments are just a few of the benefits of this substrate [4]. Some are compatible with the proposed antenna (5.850–5.925 GHz). The dielectric constant of the antenna inside a radome ranges from 2 to 2.1. [5]. The operating band shifts more or eventually raise the reflection losses when the relative permittivity is more than 2.1. Here, the antenna is placed inside a shark fin. A front position is also closer to the control electronics (shorter cables). Conformal antenna cavities inside a shark fin possess omnidirectional coverage around the car. When the antenna faces toward the driving direction, again the antenna increases, which increases the coverage range [6]. The antenna is designed to reduce complexity and improved RF performance for the Internet of Vehicles (IOVs). This process does not require complex coding or selection algorithms, still, it strengthens the signal. In this, dualport antennas which mean a combination of two antennas that are linearly polarized and circularly polarized are used [7]. The results of these antennas show that they have good impedance matching and radiation performance. The cost of the antenna is minimized. In this article, three fundamental models are approached required to identify and make certain to reassure the models for advanced systems used in current wireless technology and some radar techniques. Tunable antenna integration with radiofrequency switching devices, wideband or multiband antenna integration with tunable filters, and array topologies with the same aperture used for several operational modes are among the approaches [8, 9]. In this modern era, antennas have a wide range of applications in many fields. There are some fixed values like gain, radiation pattern, a band of frequencies, etc [10]. Reduced insertion loss, excellent isolation, great linearity, low power losses, use of little or no DC power, and broad bandwidth. It requires a very low driving voltage, a fast-tuning speed (1–100 ns), a high-power handling capability, is exceptionally dependable due to the lack of moving parts and is relatively inexpensive [11]. Nonlinear behavior, poor quality factor, and discrete tuning are all required in their state, which consumes a substantial quantity of DC power [12]. Varactors are nonlinear with a small dynamic range, necessitating the need for sophisticated bias circuitry.

2 Antenna Design Process The design layout and fabricated prototype of the proposed conformal antenna are illustrated and stimulated using ANSYS ELECTRONIC SUITE 19.2 (which is an upgraded version of HFSS) stimulator in Fig. 1. The antenna consists of a polyimide substrate having three concentric rings linked together. The proposed antenna is of size 30 mm × 26.6 mm× 0.4 mm having a dielectric constant of 2.9 and a loss tangent of 0.008. Also, the antenna occupies a 30 × 26.6 mm board area and is printed using the PCB etching technique. The analysis is done in discrete mode. The antenna is fed through a 50- microstrip line. The proposed antenna cavity provides omnidirectional coverage around the car. The observations are taken by placing the antenna in different positions of the car. Also, we know the antenna gain increases when the

22 Design and Analysis of Conformal Antenna for Automotive . . .

273

Fig. 1 Microstrip antenna iterations. a Iteration-1. b Iteration-2. c Iteration-3. d Iteration-4. e Iteration-5

274

Sk. Jani Basha et al.

Fig. 2 Reflection coefficient and frequency plot at different iterations Table 1 Dimensions of the proposed antenna Parameters Ls Ws Units (in mm) 30 26.6 Parameters Lf g Units (in mm) 7 0.5 Parameters R4 R5 Units (in mm) 5 6

Wg 11.8 R1 2 R6 7

Wf 2 R2 2 R7 8

Lg 6.8 R3 2

antenna is in the driving direction. So, the position of the antenna on a car is the major consideration to the coverage area and cellular communication in the driving direction. The iterations of the proposed antenna and the simulated S-parameters of each iteration are shown in Fig. 1. The proposed antenna is desired to use as a conformal antenna in a radome of the vehicle to operate in a suitable spectrum for vehicular communication bands. Very thin antennas are suitable for such applications, which can be bent easily with good flexibility for conformal applications. These substrates are based on polyamide good thickness uniformity, dimensional stability, extremely low moisture, flame-resistant characteristics, and steady dielectric properties suitable for vehicle applications are all features of this material (Fig. 2 and Table 1).

22 Design and Analysis of Conformal Antenna for Automotive . . .

275

2.1 Bending Analysis Here, the analysis is started with a basic circular antenna. Firstly, we took a circular patch antenna with a patch connected at the bottom of the circle as shown in Fig. 3a. Another circle is taken with a radius lesser than the previous circle (differ by 1). In this way, a similar four circles with different radii (each circle lesser by 1) were used. Draw a box with coordinates (15, 0, −0.5) and its location is placed in the axis as x:0.5, y:13, z:2.5. Assign the above box with material PEC. In process of stimulation in ANSYS SOFTWARE, there exists primary materials like perfect electrically conducting material (PEC) and vacuum. The best possible material shows infinite electrical conductivity and no resistivity with the PEC. Now, draw another box with coordinates as −ls1/2, −ws/2, −0.3 which means −15, −13.3, −0.3. Now, create a region around the box with a value of −75 mm/fr. Now, hide the region by pressing the eye-shaped option in the software. Now, draw a circle with a radius of 3 mm in the box with the position of −5.65, 0, 0.1. Similarly, draw different circles with different radii as z, 8, 0, and z, 7, 0 which form two different circles with different radii. By pressing the control button to select both circles and then selecting the subtract option. This helps in creating a circular-shaped region in

Fig. 3 Proposed antenna bent at different angles

276

Sk. Jani Basha et al.

the box. Similarly, create three different types of circle paths and place them at three different regions with positions as (0, 0, 0.1), (0, 6, 0.1), and (0, −6, 0.1). For the second box, assign the material as polyamide. Polyamide is preferred here because of its features or properties like flexibility, high wear resistance, and high thermal stability. Up to here, we created a patch inside a polyamide box with a PEC at the end of the patch for the excitation. Now, create a polyline with nine lines which helps in creating another path disturbance that can make the break flow of the current distribution inside the patch. This helps in recognizing the traffic in those areas. These nine lines will have attached which creates a break in the path. Now, draw two rectangles with dimensions (0, −8, 0.1) and (7, −4.9, 2). Now, combine all the things inside the polyamide box. To do this, keep pressing the control button and select all the circles and the two triangles, and select the united option. This helps in uniting or combing all the things into one single which is known as a patch. The region created around the antenna uses the vacuum as its material. In the bending process, a total of six cases were simulated using ANSYS software. The angles determined are at 300, 600, 900, 1200, 1500, 1800. At 1800, the obtained frequency is 5.8 GHz. This is suitable for the antenna to get the WLAN connection to the user inside the car and also vehicle-to-vehicle communication (V2V). At different angles of bending, the substrate bends, and the patch design which is included in the substrate also bends. Here, current distribution in the plot will differ for different iterations which are represented in the below images. For the substrate purpose, the output will be different in the regions. Mainly, a box created with a material of polyamide is used (Fig. 4).

Fig. 4 Graphical representation at different angles

22 Design and Analysis of Conformal Antenna for Automotive . . .

Fig. 5 VSWR versus frequency

Fig. 6 Graphical representation between axis and frequency

277

278

Fig. 7 Graphical representation between gain and frequency

Fig. 8 Graphical representation between real and imaginary

Sk. Jani Basha et al.

279

Fig. 9 Current distribution responses of proposed design

22 Design and Analysis of Conformal Antenna for Automotive . . .

Fig. 10 Radiation pattern at different frequency’s

280 Sk. Jani Basha et al.

22 Design and Analysis of Conformal Antenna for Automotive . . .

281

3 Results and Analysis A general scenario in commercial applications, the VSWR ratio is 0 to 2 dB. When there is in-between value, there exists the input impedance matching. The simulated and measured results of VSWR and isolation − |S21| in dB are depicted in Fig. 5. It can be seen that the measured bandwidth is wider than simulation during operating bands with the higher VSWR, the notch band shifted to lower frequency while compared to simulation (Figs. 6, 7, 8, 9 and 10).

4 Conclusions The proposed model has been demonstrated as a compact and ultrathin antenna for the dedicated short range (DSR) having a usable operating band at 5.8 GHz for automotive applications. This paper also significates radiation pattern and current distribution of the proposed antenna. The proposed antenna is bent at different angles and observed in its current distribution pattern. The current flow in the antenna will eventually decrease when the antenna is bent. The results from simulation and measurement show the antenna exhibits frequency and reliability characteristics at different angles.

References 1. Artner G, Kotterman W, Galdo GD, Hein MA (2018) Conformal automotive roof-top antenna cavity with increased coverage to vulnerable road users. IEEE Antennas Wirel Propag Lett 17(12):2399–2403 2. Kwon OY, Song R, Kim BS (2018) A fully integrated shark-fin antenna for MIMO-LTE, GPS, WLAN, and WAVE applications. IEEE Antennas Wirel Propag Lett 17:600–603 3. Leelaratne R, Langley R (2005) Multiband PIFA vehicle telematics antennas. IEEE Trans Veh Technol 54(2):477–485 4. Navarro-Méndez DV et al (2017) Wideband double monopole for mobile, WLAN, and C2C services in vehicular applications. IEEE Antennas Wirel Propag Lett 16:16–19 5. Braham Chaouche Y, Bouttout F, Nedil M, Messaoudene I, Mabrouk IB (2018) A frequency reconfigurable U-shaped antenna for dual-band WiMAX/WLAN systems. Prog Electromagn Res C 87:63–71 6. Choukiker YK, Behera SK (2016) Wideband frequency reconfigurable Koch snowflake fractal antenna. IET Microw Antennas Propag 3(1):203 7. De Mingo J, Roncal C, Carro PL (2012) 3-D conformal spiral antenna on elliptical cylinder surfaces for automotive applications. IEEE Antennas Wirel Propag Lett 11:148–151 8. Madhav BTP, Anilkumar T, Kotamraju SK (2018) Transparent and conformal wheel-shaped fractal antenna for vehicular communication applications. AEU Int J Electron C 91:1–10 9. Artner G, Langwieser R, Mecklenbrauker CF (2017) Concealed CFRP vehicle chassis antenna cavity. IEEE Antennas Wirel Propag Lett 16:1415–1418 10. Li T, Zhai H, Wang X, Li L, Liang C (2015) Frequency-reconfigurable bow-tie antenna for bluetooth, WiMAX, and WLAN applications. IEEE Antennas Wirel Propag Lett 14:171–174

282

Sk. Jani Basha et al.

11. Li PK, Shao ZH, Wang Q, Cheng YJ (2015) Frequency- and pattern-reconfigurable antenna for multistandard wireless applications. IEEE Antennas Wirel Propag Lett 14:333–336 12. Kashanianfard M, Sarabandi K (2017) Vehicular optically transparent UHF antenna for terrestrial communication. IEEE Trans Antennas Propag 65(8):3942–3949

Chapter 23

An Improved Patch-Group-Based Sparse Representation Method for Image Compressive Sensing Abhishek Jain, Preety D. Swami, and Ashutosh Datar

1 Introduction With the advancement in medical imaging, remote sensing, satellite imaging, IoT imaging, etc., efficient storage and transmission of images have become a significant issue. Candes et al. [1] proposed the groundbreaking theory of compressive sensing (CS) which denies the famous sampling theorem. According to CS theory, any signal can be perfectly recovered with fewest samples, if the signal is sparse in some or other domain. The conventional method of acquiring signal is begun with sampling under Nyquist constraints leading to many signal elements. Further, these acquired samples are then compressed by suitable techniques that discard the smaller measurements. CS came with a solution by combining signal acquisition with compression. This breakthrough in signal processing motivated the researchers to utilize it in various imaging applications. It is highly efficient in the fields where there is limitation in the deployment of imaging sensors, sensing time and energy supplies. CS can be easily described using the single vector sparse model shown in Fig. 1. It depicts the generation of a sparse representation by compressing the signal with basic processing of matrices and transformation functions. The model can be mathematically described by Eq. (1) as: y =ϕ∗x +η

(1)

A. Jain (B) · A. Datar Department of Electrical and Electronics Engineering, SATI, Vidisha, India e-mail: [email protected] A. Datar e-mail: [email protected] P. D. Swami Department of Electronics and Communication Engineering, UIT, RGPV, Bhopal, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_23

283

284

A. Jain et al.

Fig. 1 CS framework

where y ∈ R m represents an m dimensional compressed sparse measurement, ϕ ∈ R m×n is the non-orthogonal projection matrix of m × n dimension, and x ∈ R n is the n dimensional input signal. η ∈ R m is additive Gaussian noise such that m  n. Figure 2 shows the basic block diagram of signal acquisition and recovery in compressive sensing. The recovery of the original signal x from Eq. (1) is the basic CS problem. The reconstruction model of CS needs to approximate the sparse signal by solving the underdetermined linear system equation by minimization methods using l0 or l 1 norms. Several transform domain methods are developed using the block compressive sensing (BCS) for improving the sparsity constraints of image recovery process [2]. A major strength of BCS-based image compression is that its encoder is independent of the image signal and is less computationally complex [2–4]. Signal Acquisi on Non-orthogonal Projec on matrix

Input Image

Compressed sparse measurements

Signal Recovery Recovered Image

CS Decoding Method

Sparse es ma on of the input image Reconstruc on matrix

Fig. 2 Basic block diagram of signal acquisition and recovery in compressive sensing

23 An Improved Patch-Group-Based Sparse Representation …

285

On the other hand, complexity at CS decoder side becomes extremely high. As a specific advantage, the random projection matrix does not depend on image features, and same random projection can be applied on all the input images at CS encoders. The complex optimization problem is solved at the decoder for recovering randomly sampled image data in the sparse domain [3].

1.1 Research Motivation and Contribution The existing patch-based methods are computationally complex, and perfect recovery of the images is not always guaranteed by them. Group-based methods are adequately fast. They provide better PSNR but they may tend to oversmoothen the recovered images. Thus, a superior method is needed to be developed in this domain. This paper proposes an effective and fast hybrid model which provides better results than the existing methods. Experiments are done on a set of standard images with different measurements/sampling ratios (r). Simulation results show the efficacy of the proposed work in image compressive sensing.

2 Related Work and Challenges CS recovery algorithms need to address several challenges while designing. The major challenge is perfect reconstruction of the original image with less computation expense and memory requirement. Major research is carried out in the strategy of efficiently recovering the images with lesser execution time. The past decade has observed several efficient and fast algorithms of image recovery. The memory challenge was addressed using a block-based sampling operation using BCS [2]. Although the possibility of blocking effect in it cannot be ignored. Sampling with projection-based Landweber (SPL) iterations [2, 5] was proposed to accomplish fast block compressive sensing (BCS) reconstruction eliminating the blocking artifacts. Another challenge is to enhance the recovery method parameters for better optimization. A collaborative reconstruction approach is developed in [3] by exploiting 2D local and 3D non-local sparsities. In [6], CS recovery was done by adaptively learning the basis of sparsity (ALSB) using l 0 optimization. Sparse image patches for local sparsity were represented using this adaptive l0 norm basis. For further effectiveness in finding the solution of l0 minimization problem, a split Bregman repetition method [4, 7] is utilized. This method is computationally complex, and perfect recovery of the images is not guaranteed. Image recovery in [6] and in all the approaches using sparse patch-based illustration is time consuming. Thus, instead of image patch, set of non-local patches called “group” was considered for representation in [8]. A novel and efficient dictionary learning was also proposed for every group. The authors named the method as groupbased sparse representation (GSR). In GSR, the overall computation complexity is

286

A. Jain et al.

reduced as compared to the previous methods because of concurrent sparsifying of local and non-local resemblances. GSR has significantly improved the PSNR but it may tend to oversmoothen the recovered images. Several GSR-based algorithms of image reconstruction were proposed [9–11] recently for improving the image recovery performance. GSR with non-convex regularization (GSR-NCR) is presented in [11]. The authors used non-convex penalty operation on GSR components instead of l1 regularization. Xu et al. combined nonlocal total variation (NLTV) with GSR for better confining the solution. The model showed better results perceptually when matched with other CS methods [12]. Hybrid Non-Local Sparsity Regularization (HNLSR) developed in [13] used the dictionary of singular value decomposition (SVD) by exploiting the non-local sparsity in both 2D and 3D transform domain. For fast image reconstruction, Gaussian Pyramidbased Recovery with Collaborative Sparsity (GP-RCoS) is proposed in [14]. The authors developed a model where an adaptive pyramid is constructed initially and then RCoS recovery method is applied.

3 Proposed Work The dictionary learning in patch-based CS recovery methods leads to high computational time for estimation. Self-similarity between the patches is enforced in GSR with group adaptive learning. The proposed work is a fusion of adaptively sparsifying patch-based [6] and group-based representation method in a restricted manner. The proposed work is implemented in two phases of reconstruction. In the initialization phase, adaptive sparsifying l0 optimization model is exploited for local overlapping patches in bounded iterations. Further, constrained GSR is applied on the outcome of the first phase. The BCS-GSR decoder utilizes restricted split Bregman iteration (SBI) for fast convergence.

3.1 Phase-1: Patch-Based Adaptive Sparsifying Learning (PASL) Sparsity of true images is represented in terms√of patches √ (x k ). For an image x of size N, the number of image patches (n) of size Bs × Bs at k locations (k = 1, 2, 3…n), xk can be given by Eq. (2) as xk = Rk (x)

(2)

Extraction of the patch xk from the image x is achieved by the operator Rk (). For keeping the patch back to its k location while reconstruction, transpose of this operator RkT () is computed [6]. Image x is recovered using Eq. (3).

23 An Improved Patch-Group-Based Sparse Representation …

x=

n 

RkT (xk ) ./

k=1

n 

  RkT 1Bs

287

(3)

k=1

Here, 1 Bs is an all-ones vector of size Bs and ./ represents division element by element. For every patch, a sparse vector αk is to be determined under the frame of a certain dictionary D. The problem of sparsifying is represented in Eq. (4) 1 αk = argminα xk − Dαk 22 + λα p 2

(4)

Here, λ is regularization factor and p is 0 or  1. Further, the dictionary D and the coefficient matrix  = α1 , α2 , α3 . . . . . . αj are optimized for a certain collection of training patches s = s1 , s2 , s3 . . . . . . sj , such that sk = Dαk , αk  p ≤ L. The minimization problem is described as: 

j   ˆ  ˆ = argminD,  D, sk − Dαk 22 s.t.αk  p ≤ L

(5)

k=1

The dictionary is updated by adaptive learning from the updated image for constrained iterations [6]. Finally, the estimated image xˆ is reconstructed using patch-based sparse estimation over D.

3.2 Phase-2: Constrained Group Sparse Representation In GSR, a searching window looks for its best matched C patches for every patch xk . Similar patches are searched based on Euclidean distance d over the search

window S a . A group matrix (simply termed as group) xGk = xG1 , xG2 , . . . . . . xGn is formulated using similar patches. Figure 3 shows the formation of groups xGk by extracting xk patches and stacking the similar set Sxk from the image x. The reconstruction problem from xGk to x stated in Eq. (3) can be reformulated as x=

n 

n      RGT k xGk ./ RGT k 1Bs×C

k=1

k=1

(6)

Extracting the patch xk from image x is achieved by the enabler RGk (). RkT () is the transpose, and 1Bs ×C is an all-ones matrix of size Bs ×C. Further, GSR regularizationbased image restoration process with subproblems as described in [8] is carried out using split Bregman iteration (SBI). The modified GSR algorithm under constrained iterations is given in Table 1. Flowchart of the proposed work is shown in Fig. 4.

288

A. Jain et al.

Extrac on

Similarity

Stacking

Fig. 3 Group building in GSR

Table 1 Algorithm 1: constrained group sparse representation Algorithm 1 Phase 1 (Reconstructed PASL phase component) 1. Initialization: initial input as x = xRec 2. Set ← parameters t = 0, λ, η, Bs , C, μ, B, r (subrate), F (factor) √ hard threshold parameter τ is calculated τ = λ ∗ F μ → Th = 2 ∗ τ 3. Input: measurement vector y and projection matrix φ, 4. Do ← SBI till constrained iterations ← BCS-GSR-SBI-Decoding    ◦α −b Update uˆ = u − η ∅T ∅u − ∅T y + μ u − DG G  √     √  Update αˆ Gk = hard γrGk , 2τ = γrGk 1 abs γrGk − 2τ t+1 t+1 Update DG and αˆ G

  ◦ α t+1 Update bt+1 = bt − u t+1 − DG G

5. Final restored image xˆ 6. Evaluate final PSNR and FSIM → end

4 Result and Discussion Proposed work is simulated in MATLAB 2019b on an Intel Core i7-8565U CPU 1.80 GHz, 8 GB RAM computer. Performance of the hybrid patch-group-based image recovery CS method is carried out on many standard test images. The objective image quality of the test images is evaluated by calculating PSNR [15], and the visual quality is evaluated by calculating FSIM [16]. For finding the CS measurements in the experimentation, block size B of 32 × 32 is taken for BCS. In patchbased adaptive sparsifying learning, the patch size Bs is set to 8, and other parameters μ, λ, η are optimally set for restricted implementation of 5 recurrences. The measurements/subrate (sampling ratio) r is varied from 0.1 (10%) to 0.4 (40%). For constrained GSR phase, local window L is set to 41 × 41 and the best matched block C is set to 10. Empirically, number of iterations for the GSR recovery is limited to 15 only. The proposed algorithm is compared with 3 standard CS recovery methods. The performance evaluation on the basis of PSNR and FSIM is assessed for the test images in the cases from 0.1 (10%) to (0.4) 40% measurements (sampling ratio/subrate). For

23 An Improved Patch-Group-Based Sparse Representation …

289

Define CS parameters

Read Image x

Projec on Matrix Generate Measurement Matrix Transmit/Store Observe Measurement Matrix Patch based Adaptive Sparsifying Learning Recovered PASL Image Group Sparse Representa on of the Recovered PASL Image

Constrained GSR Algorithm Split Bregman Itera on Decoding Final Recovered Image

Fig. 4 Flowchart of the proposed model

this paper, the test images are cameraman, house and monarch of size 256 × 256. Table 2 depicts the PSNR performance, and Table 3 shows the FSIM performance of the proposed work. It can be seen from the results that the proposed model offers higher PSNR and FSIM values as compared to other existing methods.

290

A. Jain et al.

Table 2 PSNR evaluation (in dB) of the proposed method with 3 standard CS recovery methods Image Cameraman

House

Monarch

Subrate

Method ALSB

GSR

GSR-NCR

Proposed method

0.1

22.92

22.89

22.50

23.21

0.2

26.62

27.17

26.30

27.34

0.3

29.01

29.62

29.37

29.95

0.4

31.01

31.64

31.59

31.90

0.1

32.18

32.33

32.11

32.90

0.2

36.07

37.21

36.57

37.35

0.3

38.36

39.19

39.38

39.41

0.4

40.25

40.56

41.12

41.24

0.1

24.34

25.02

24.67

25.32

0.2

28.30

30.29

29.46

30.45

0.3

31.41

33.97

34.68

34.23

0.4

34.12

36.56

36.43

36.96

31.22

32.20

32.01

32.52

Average

Table 3 FSIM evaluation of the proposed method with 3 standard CS recovery methods Image

Subrate

ALSB

GSR

GSR-NCR

Proposed method

Cameraman

0.1 (10%)

0.8021

0.8134

0.8012

0.8181

0.2 (20%)

0.8759

0.8941

0.8796

0.9015

0.3 (30%)

0.9190

0.9321

0.9305

0.9351

0.4 (40%)

0.9413

0.9539

0.9530

0.9545

0.1 (10%)

0.9201

0.9215

0.9211

0.9263

0.2 (20%)

0.9561

0.9633

0.9508

0.9687

0.3 (30%)

0.9732

0.9775

0.9795

0.9793

0.4 (40%)

0.9824

0.9823

0.9862

0.9876

0.1 (10%)

0.8251

0.8553

0.8318

0.8601

0.2 (20%)

0.8907

0.9334

0.9216

0.9355

0.3 (30%)

0.9303

0.9617

0.9668

0.9623

0.4 (40%)

0.9610

0.9711

0.9708

0.9768

0.9148

0.9300

0.9244

0.9338

House

Monarch

Average

Method

Average PSNR and FSIM performance is also depicted graphically in Fig. 5. A visible comparison of the hybrid model along with other existing methods is shown in Fig. 6.

23 An Improved Patch-Group-Based Sparse Representation …

Average FSIM

Average PSNR (dB) 33 32.5 32 31.5 31 30.5

291

0.94 0.93 0.92 0.91

M et ho d

0.9 ALSB

GSR

GSR-NCR Proposed Method

Fig. 5 Comparison of average PSNR and FSIM performance

Fig. 6 Reconstruction of cameraman and house image by the proposed and the standard methods at subrate 0.4 (40%) a ALSB (PSNR = 31.01 dB); b GSR (PSNR = 31.64 dB); c GSR-NCR (PSNR = 31.59 dB); d Proposed Method (PSNR = 31.90 dB); e ALSB (PSNR = 40.25 dB); f GSR (PSNR = 40.56 dB); g GSR-NCR (PSNR = 41.12 dB); and h proposed method (PSNR = 41.24 dB)

The computational complexity of the proposed model is also calculated while simulation. The total execution time taken to recover the images by the proposed model for different measurements is compared with the standard group-based sparse representation (GSR) method in Table 4. Chart in Fig. 7 shows the average execution time of the proposed work over GSR. The average execution time taken by the proposed work is 29% less than the GSR method.

292

A. Jain et al.

Table 4 Computational time comparison of the proposed method (in seconds) Image

Subrate

Method GSR

Cameraman

House

Monarch

Proposed method

0.1 (10%)

392.35

185.59

0.2 (20%)

340.45

176.48

0.3 (30%)

220.13

184.57

0.4 (40%)

290.17

248.57

0.1 (10%)

428.20

243.57

0.2 (20%)

270.18

162.60

0.3 (30%)

216.82

160.23

0.4 (40%)

189.89

161.27

0.1 (10%)

207.20

180.32

0.2 (20%)

227.50

198.45

0.3 (30%)

270.82

243.54

0.4 (40%)

294.90

215.98

279.05

196.76

Average

Average Execu on Time Execu on Time of the proposed work over GSR

71%

Fig. 7 Average execution time of the proposed model over GSR

Figure 8 illustrates the PSNR evolution of the proposed method and the standard GSR image recovery method for the house image at a subrate of 0.4 (40%). The proposed work achieves higher PSNR fastly over the iteration counts as compared to GSR method. It evidently portrays excellent convergence execution of the proposed hybrid model.

23 An Improved Patch-Group-Based Sparse Representation …

293

Fig. 8 Comparison of PSNR evolution

5 Conclusion This paper proposes a fast and efficient hybrid patch-group-based sparse illustration model for image compressive sensing. Different from the previous methods, the proposed work exploits both local and non-local image patches in two phases of constrained reconstruction. Hence, with this model, fast convergence is achieved and it has been proven in the result section. Simulation results show that the proposed algorithm performs better than the existing standard methods in terms of PSNR and FSIM. The average PSNR gain offered by the proposed method is 1.3 dB, 0.32 dB and 0.51 dB over ALSB, GSR and GSR-NCR methods, respectively. Execution time which is the main parameter of a good CS method is also evaluated for the proposed work. The average execution time taken by the proposed work is 29% less than the GSR method. Overall, the proposed work is expeditious and it outmatched the existing methods both quantitatively and qualitatively. As a future scope, the proposed work can be extended in image inpainting, deblurring and denoising applications.

294

A. Jain et al.

References 1. Candes EJ, Wakin MB (2008) An introduction to compressive sampling. IEEE Signal Process Mag 25(2):21–30 2. Mun S, Fowler JE (2009) Block compressed sensing of images using directional transforms. In: Proceedings—international conference on image processing, ICIP 3. Zhang J, Zhao D, Zhao C, Xiong R, Ma S, Gao W (2012) Image compressive sensing recovery via collaborative sparsity. IEEE J Emerg Sel Top Circuits Syst 2 4. Goldstein T, Osher S (2009) The split Bregman method for L1-regularized problems. SIAM J Imaging Sci 2 5. Gan L (2007) Block compressed sensing of natural images. In: 2007 15th International conference on digital signal processing, DSP 6. Zhang J, Zhao C, Zhao D, Gao W (2014) Image compressive sensing recovery using adaptively learned sparsifying basis via L0 minimization. Signal Process 103 7. Afonso MV, Bioucas-Dias JM, Figueiredo MAT (2010) Fast image recovery using variable splitting and constrained optimization. IEEE Trans Image Process 19 8. Zhang J, Zhao D, Gao W (2014) Group-based sparse representation for image restoration. IEEE Trans Image Process 23 9. Zha Z, Yuan X, Wen B, Zhang J, Zhou J, Zhu C (2020) Image restoration using joint patchgroup-based sparse representation. IEEE Trans Image Process 29:7735–7750 10. Zha Z, Yuan X, Wen B, Zhou J, Zhu C (2018) Joint patch-group based sparse representation for image inpainting. In: Zhu J, Takeuchi I (eds) Proceedings of machine learning research (PMLR), vol 95, pp 145–160 11. Zha Z, Zhang X, Wang Q, Tang L, Liu X (2018) Group-based sparse representation for image compressive sensing reconstruction with non-convex regularization. Neurocomputing 296 12. Xu J, Qiao Y, Fu Z, Wen Q (2019) Image block compressive sensing reconstruction via groupbased sparse representation and nonlocal total variation. Circuits Syst Signal Process 38 13. Li L, Xiao S, Zhao Y (2020) Image compressive sensing via hybrid nonlocal sparsity regularization. Sensors (Switzerland) 20 14. Jain A, Swami PD, Datar A (2022) Fast Gaussian pyramid based recovery with collaborative sparsity for image compressive sensing. In: 2021 IEEE international conference on smart technologies for power, energy and control (STPEC). IEEE, pp 1–6 15. Kaushik P, Sharma Y (2012) Comparison of different image enhancement techniques based upon PSNR & MSE. Int J Appl Eng Res 7 16. Zhang L, Zhang L, Mou X, Zhang D (2011) FSIM: a feature similarity Index for image quality assessment. IEEE Trans Image Process 20

Chapter 24

Comparative Analysis of Stock Prices by Regression Analysis and FB Prophet Models Priyanka Paygude, Aatmic Tiwari, Bhavya Goel, and Akshat Kabra

1 Introduction The twentieth century is changing at a pace of light. One of the significant assets in today’s world is data and how to use this data to one’s advantage. New techniques like data visualization, data mining and machine learning are being introduced at very short spans. These tools help to analyse the raw data more strategically and straightforwardly. Thus, it can even help us predict future operations, like weather forecasting or predicting next-day electricity prices based on their previous stats. Different prediction models and techniques are used for forecasting various values in real-time applications. One of the critical things used for forecasting is the previous data; whether it is regression analysis, neural network or some other model, everyone needs to analyse past performance behaviour. Based on these performances, prediction models do their forecasting. One of the tools used is FB prophet. It perfectly fits historical data of several seasons and substantial seasonal effects, and it is fully automatic with limited manual involvement [1]. Regardless of its importance, there are encounters associated with producing reliable and high-quality forecasts, especially when there are a variety of time series and analysts [2]. On the other hand, forecasting using regression analysis constructs mathematical models that describe or explain relations between variables of a linear equation [3]. In our paper, two models are discussed: regression analysis and FB prophet, in which we compare the indexes like absolute error and error % and done a comparative study. The datasets are adopted from Yahoo Finance, and the stock used is Infosys ranging from Jan 2015 to Dec 2019 [4]. Based on available data, forecasting was done in Dec 2019, Jan 2020 and Feb 2020. Stock values and actual prices are also P. Paygude · A. Tiwari (B) · B. Goel · A. Kabra College of Engineering, Bharati Vidyapeeth (Deemed to be University), Pune, India e-mail: [email protected] P. Paygude e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_24

295

296

P. Paygude et al.

mentioned for the clarity of both the models and the one that is more accurate over the other. To get more insights, we considered daily stock prices to analyse the accuracy and error % for the considered model. We had also studied various papers on both models, like how FB prophet is helpful for short-term traffic predictions, about power prediction, which is maximum power depiction (MPD) and even one of the most consistent global problems like air pollution [5–7]. FB prophet is used in various scenarios like evaluating COVID-19 cases, sales of a particular product and weather forecasts. If we talk about regression analysis, this method is vastly adopted by the whole world in different domains for even citation impact in papers. Also, it can be used, for economic growth as a whole, forecasting electric loads and interdisciplinary topics like construction of performance analysis [8–11]. At the end of the discussion, we had drawn some conclusions regarding both the models based on the Mean Absolute Percentage Error (MAPE) analysis. Along with that, absolute error and root mean square are also taken into consideration before concluding. Thus, this discussion will act as a laypeople paper for those who want to know about prediction models and their future scope.

2 Dataset Used for Research Evaluation Dataset plays a significant role in every research as it acts as a baseline or reference to propose different approaches or optimal solutions to a specific process. Similarly, in our paper, we took a dataset from one of the renowned sources where almost every stock price is available with exact dates, such as Yahoo Finance [4]. The stocks we used are of Infosys company, represented as INFY, and the dates we selected are from January 2015 to November 2019. The prediction is applied in the later three months: December 2019, January 2020 and February 2020. We have taken the original date and actual closing prices. There are various options to include datasets based on the date timelines; for instance, we can take monthly wise stock prices or even yearly wise as well, but we took datasets daily because it will help us create results more accurately. There are numerous papers in which Yahoo Finance is taken as a reference for a dataset, like comparing stock prices of google trends and Yahoo Finance [12]. Even forecasting the bitcoin prices via the LSTM model, the Yahoo Finance dataset is considered [13] because they are the best real-time source available free of cost and updated with the current stock price daily. There are several risks attached to the dataset like missing data, sudden changes in the stock values due to external factors like government news, new policies or strategies enforced by the company or changes in board members. Table 1 shows us the dataset of Infosys, and we had taken consideration of stock price on daily basis, and that is, the reason dataset is above 1200 lines, and thus, we had presented in short manner.

24 Comparative Analysis of Stock Prices by Regression …

297

Table 1 Infosys dataset Date

Close

21-01-2015

8.7975

22-01-2015

8.95

23-01-2015

8.875

26-01-2015

8.9225

27-01-2015

8.785

28-01-2015

8.785

29-01-2015

8.655

30-01-2015

8.52

02-02-2015

8.5875

03-02-2015

8.63

… … 29-11-2019

9.83

3 Comparative Analysis of Linear Regression and FB Prophet Models The proposed approach is that we are taking datasets of Infosys from Yahoo Finance [4] that is from past five years and the models are applied for forecasting on December 2019, January 2020 and February 2020. Daily stock market prediction is taken into account for better accuracy. Figure 1 it can be depicted that first dataset is taken into consideration and then both models are applied separately. After the processing of models, evaluation is done on forecasted values and the initial evaluation is done on the basis of absolute error and error %. The one with less error will be more accurate model.

Fig. 1 Evaluation process

298

P. Paygude et al.

3.1 Linear Regression Model Time series is a set of observations on the values that a variable takes at different times. These data can be collected monthly, weekly, quarterly or even annually. For the scenarios in which a single variable or does not have any co-variables, then time series models are used, and thus, they are termed as univariate variables, and among these models, linear regression is one of them. This approach can be used in realtime scenarios like weather forecasting and stock prediction analysis; this method is vastly adopted by the whole world in different domains, for even citation impact in papers. Also, it can be used, for economic growth as a whole, forecasting electric loads and interdisciplinary topics like construction of performance analysis [8–11]. Regression analysis is a reliable and trusted method through which we can determine the factors affecting an equation most, and other minor ones are ignored. In our case, we are forecasting the future prices, which is our only main interest, and thus one variable will be independent, and the dependent quantity will be the one that we have to find. Thus, regression is used. This model is not only used to forecast stock values but plays a significant role in other domains as well, like the astronomy domain to find the bisector of two OLS lines [14] or for facial regression in which they mentioned linear regression as the novel approach because of the formulating of pattern recognition [15]. This model uses the relationship of one dependent variable over one or more independent variables, and thus, it helps us to find the future relationship between them [16–19]. Thus, this technique can also be used to find the relationship between different stocks. Linear regression can be a method to estimate a set of time series. The average financial and macroeconomic variables of identified resources at the beginning of each year are independent of these estimations [20]. There can be various forms of regression like linear regression, multiple linear regression and nonlinear regression. The regression graph for the price variation in 5 years in stock is shown in Fig. 2.

Fig. 2 Regression graph

24 Comparative Analysis of Stock Prices by Regression …

299

The assessment of one dependent variable over another independent one is basically involved in this model, and it can be expressed as: Y = mX + c + ε

(1)

Y—Dependent variable X—Independent variable c—Intercept m—Slope ε—Residual (error) y = 0.0013x − 49.586

(2)

R2 = 0.5608. The conclusion that we came with; now, if we calculate the mean average of all the errors, this can be more accurately predicted. The above equation is the equation that we had to find via analysing our dataset prices; with the help of this graph, we were able to find a relationship, which is precisely the same as we discussed in the linear regression model above. Here there is one more variable which is R2 , i.e. coefficient of determination value, and it ranges from 0 to 1. R2 is the square of the coefficient of correlation (R) between predicted and actual values. R2 ranged from 0 to 1. The more R2 value close to 1 is, the more data fitted by the model [21].

3.2 FB Prophet Model Time series is a set of observations on the values that a variable takes at different times. These data can be collected monthly, weekly, quarterly or even annually. For the scenarios in which a single variable or does not have any co-variables, then time series models are used, and thus, they are termed as univariate variables, and among these models, linear regression is one of them. This approach can be used in realtime scenarios like weather forecasting and stock prediction analysis; this method is vastly adopted by the whole world in different domains, for even citation impact in papers. Also, it can be used, for economic growth as a whole, forecasting electric loads and interdisciplinary topics like construction of performance analysis [8–11]. Regression analysis is a reliable and trusted method through which we can determine the factors affecting an equation most, and other minor ones are ignored. In our case, we are forecasting the future prices, which is our only main interest, and thus one variable will be independent, and the dependent quantity will be the one that we have to find. Thus regression is used. This model is not only used to forecast stock values but plays a significant role in other domains as well, like the astronomy

300

P. Paygude et al.

Fig. 3 FB prophet workflow

domain to find the bisector of two OLS lines [14] or for facial regression in which they mentioned linear regression as the novel approach because of the formulating of pattern recognition [15]. This model uses the relationship of one dependent variable over one or more independent variables, and thus, it helps us to find the future relationship between them. Thus, this technique can also be used to find the relationship between different stocks. Linear regression can be a method to estimate a set of time series. The average financial and macroeconomic variables of identified resources at the beginning of each year are independent of these estimations [20]. There can be various forms of regression like linear regression, multiple linear regression and nonlinear regression. In Fig. 3, it can be depicted that the FB prophet workflow works in a cyclic nature in which modelling is the first phase where models are selected, and then the forecast is evaluated on the given data. After this, an automated process occurs, which is findings of a surface problem, and at last, the final forecastings are visually evaluated for better understanding.

3.2.1

Handling Missing Data in the FB Prophet Model

In the case of FB prophet, the missing values are not inserted as it does not require regularly spaced data. The equation can define FB prophet a(t) = b(t) + c(t) + d(t) + et

24 Comparative Analysis of Stock Prices by Regression …

301

b(t): represents non-periodic changes in the time series value s(t): represents periodic changes in the time series value h(t): represents the effect of holidays which occur on an irregular schedule et: represents any other changes that the model does not accommodate. None of the features mentioned above requires regularly spaced data; if some data is missing, the model will interpolate between the known values and not use the missing data to estimate anything.

3.2.2

FB Prophet Pseudocode

(1) (2) (3) (4) (5)

Import Numpy library as np Import pandas library as pd Get CSV data from (‘/content/SAIL.NS.csv”) into df dataframe Performing data preprocessing that is removing missing data Remove all the irrelevant columns from the data frame and only keep the date and closing stock price df = df[[Date”,” Close]] (6) Import matplotlib.pylot library as plt (7) Get the fbprophet model m = Prophet(daily_seasonality = True) (8) Fit the data frame in the FB prophet model (9) Predict values for the next three months (10) Print the plotted graphs for predicted values using plt. Plot.

4 Metric Used for Evaluation of Comparative Analysis of Models MAPE is the mean or average of the absolute percentage errors of a forecast [22]. MAPE has been used to make the comparison between the two models easy to understand as it provides errors in terms of percentages. It is considered that finding averages of the whole entities will give more apparent perspectives as both absolute error and error % are being calculated, which helps make decisions more clearly and effectively. After considering each error and error, % the mean is calculated by adding all the values and dividing it by the total number of deals. Thus, MAPE is considered quite efficient in comparing different models as it shows the result in quite an understandable form n = number of times event happens V = actual value of forecast P = predicted value of forecast.

302

P. Paygude et al.

5 Result and Discussion As per the regression model, we could minimize the error by less than 10%. However, on a more ground level, FB prophet was more efficient as it has only value of 0.1 absolute error, which is quite efficient. The absolute error is basically how closely our model matched with the original value, and thus, to make it more accurate and easy to understand, the mean average is taken, which is nothing but the mean of all the absolute errors, and the same goes with the error % as well. The FB prophet also analyses previous data patterns, and based on that, it reaches its final findings. One can observe that in all three months, the forecasted values show excellent results, and the error % is significantly less, which slashes the error % more and helps to give accuracy to forecasted values. The results show that the values mostly show just 5–6% of an error %. Thus, if the patterns are recognized sharply, these are very accurate.

5.1 Comparative Study Comparative graph of month Dec 2019 is shown in Fig. 4. Table 2 shows the forecasting values in Dec 2019 stock prices with the help of the previous 5 year pattern, so basically the equation which we got from graph 1 is implemented on the further period, and it is in the form of linear regression only. Here, we can observe that MAPE of regression model is 1.40 absolute error and 14.00% of error %, whereas MAPE of FB prophet comes around 0.61 in absolute error and 6% as error %, respectively. Table 3 shows the forecasting values in Jan 2020 stock prices with the help of the previous 5 year pattern, so basically the equation which we got from Fig. 1 is

Fig. 4 Comparative graph of month Dec 2019

24 Comparative Analysis of Stock Prices by Regression …

303

Table 2 Forecasting in December 2019 Date

Actual closing

Forecasting using regression model

Forecasting using FB prophet model

Forecasted value

ABS error

Error (%)

Forecasted value

ABS error

Error (%)

02-12-2019

9.69

8.66

1.03

10.63

10.48

0.79

8.15

03-12-2019

9.67

8.66

1.01

10.44

10.5

0.83

8.58

04-12-2019

9.86

8.66

1.2

12.17

10.51

0.65

6.59

05-12-2019

9.89

8.66

1.23

12.44

10.41

0.52

5.26

06-12-2019

9.94

8.66

1.28

12.88

10.44

0.5

5.03

09-12-2019

9.94

8.67

1.27

12.78

10.56

0.62

6.24

10-12-2019

10.04

8.67

1.37

13.65

10.6

0.56

5.58

11-12-2019

10.07

8.67

1.4

13.90

10.64

0.57

5.66

12-12-2019

10.03

8.67

1.36

13.56

10.69

0.66

6.58

13-12-2019

10.08

8.67

1.41

13.99

10.71

0.63

6.25

16-12-2019

10.13

8.68

1.45

14.31

10.63

0.5

4.94

17-12-2019

10.29

8.68

1.61

15.65

10.67

0.38

3.69

18-12-2019

10.33

8.68

1.65

15.97

10.81

0.48

4.65

19-12-2019

10.27

8.68

1.59

15.48

10.85

0.58

5.65

20-12-2019

10.29

8.68

1.61

15.65

10.9

0.61

5.93

23-12-2019

10.29

8.69

1.6

15.55

10.94

0.65

6.32

24-12-2019

10.24

8.69

1.55

15.14

10.97

0.73

7.13

26-12-2019

10.17

8.69

1.48

14.55

10.88

0.71

6.98

27-12-2019

10.26

8.69

1.57

15.30

10.91

0.65

6.34

MAPE

1.404

14.00

MAPE

0.6115

6.00

implemented on the further period, and it is in the form of linear regression only. Here we can observe that MAPE of regression model is 2.01 absolute error and 19.00% of error %, whereas MAPE of FB prophet comes around 0.16 in absolute error and 2% as error per cent, respectively. Comparative graph of month Jan 2020 is shown in Fig. 5. Table 4 shows the forecasting values in Feb 2020 stock prices with the help of the previous 5 year pattern, so basically the equation which we got from graph 1 is implemented on the further period, and it is in the form of linear regression only. Comparative graph of month Feb 2020 is shown in Fig. 6. Here we can observe that MAPE of regression model is 0.83 absolute error and 8.00% of error %, whereas MAPE of FB prophet comes around 0.41 in absolute error and 4% as error per cent, respectively. The above graphs help to understand the overview of all stock prices in a more clear way; i.e. the regression line graph (orange line) is always quite below the actual

304

P. Paygude et al.

Table 3 Forecasting in January 2020 Date

Actual closing

Forecasting using regression model

Forecasting using FB prophet model

Forecasted values

ABS error

Error (%)

Forecating values

ABS error

Error (%)

02-01-2020

10.29

8.7

1.59

15

10.48

0.19

2

03-01-2020

10.31

8.7

1.61

16

10.50

0.19

2

06-01-2020

10.21

8.7

1.51

15

10.51

0.3

3

07-01-2020

10.1

8.71

1.39

14

10.41

0.31

3

08-01-2020

10.09

8.71

1.38

14

10.44

0.35

3

09-01-2020

10.48

8.71

1.77

17

10.56

0.08

1

10-01-2020

10.65

8.71

1.94

18

10.60

0.05

1

13-01-2020

10.88

8.71

2.17

20

10.64

0.24

2

14-01-2020

10.86

8.71

2.15

20

10.69

0.17

2

15-01-2020

10.85

8.72

2.13

20

10.71

0.14

1

16-01-2020

10.85

8.72

2.13

20

10.63

0.22

2

17-01-2020

10.88

8.72

2.16

20

10.67

0.21

2

21-01-2020

10.85

8.72

2.13

20

10.81

0.04

1

22-01-2020

11.04

8.72

2.32

21

10.85

0.19

2

23-01-2020

11.06

8.73

2.33

21

10.90

0.16

1

24-01-2020

11.03

8.73

2.3

21

10.94

0.09

1

27-01-2020

10.99

8.73

2.26

21

10.97

0.02

1

28-01-2020

11.1

8.73

2.37

21

10.88

0.22

2

29-01-2020

11.06

8.73

2.33

21

10.91

0.15

1

30-01-2020

11.06

8.74

2.32

21

11.04

0.02

1

MAPE

2.01

19

MAPE

0.167

2

Fig. 5 Comparative graph of month Jan 2020

24 Comparative Analysis of Stock Prices by Regression …

305

Table 4 Forecasting in February 2020 Date

Actual closing

Forecasting using regression model

Forecasting using FB prophet model

Forecasted value

ABS error

Error (%)

Forecasted value

ABS error

Error (%)

03-02-2020

10.94

11.82

0.88

8.00

10.69

0.25

2

04-02-2020

11.1

11.83

0.73

7.00

10.69

0.41

4

05-02-2020

11.04

11.83

0.79

7.00

10.68

0.36

3

06-02-2020

10.96

11.83

0.87

8.00

10.69

0.27

2

07-02-2020

10.85

11.83

0.98

9.00

10.68

0.17

2

10-02-2020

10.92

11.83

0.91

8.00

10.62

0.30

3

11-02-2020

10.95

11.83

0.88

8.00

10.62

0.33

3

12-02-2020

11.22

11.84

0.62

5.00

10.64

0.58

5

13-02-2020

11.1

11.84

0.74

7.00

10.65

0.45

4

14-02-2020

11.01

11.84

0.83

8.00

10.66

0.35

3

18-02-2020

11.21

11.84

0.63

6.00

10.68

0.53

5

19-02-2020

11.3

11.85

0.55

5.00

10.68

0.61

5

20-02-2020

11.36

11.85

0.49

4.00

10.64

0.72

6

21-02-2020

11.41

11.85

0.44

4.00

10.65

0.76

7

24-02-2020

11.22

11.85

0.63

6.00

10.69

0.53

5

25-02-2020

11.02

11.85

0.83

8.00

10.71

0.31

3

26-02-2020

11

11.86

0.86

8.00

10.74

0.26

2

27-02-2020

10.62

11.86

1.24

12.00

10.76

0.14

1

28-02-2020

10.07

11.86

1.79

18.00

10.74

0.67

4

MAPE

0.83

8.00

MAPE

0.41

4

closing price (blue line), and FB prophet (grey line) is very much close to the original value. Hence, it can be seen that FB prophet is more accurate in forecasting the prices.

6 Conclusion and Future Scope This research work gives detail understanding of the statistical regression and FB prophet models in stock forecasting. We came up with the results as well that which model is showing more comprehensive results, but still, there will be some cases in which both models may fail. In such cases, we will also switch to other models that may have included neural networks. One shortfall is that it cannot use any side information for prediction purposes. It then limits the magnitude of the rate change, called sparse prior. Decreasing the changepoint prior scale below a specific value can lead to underfitting [23]. There are some models in which that issue is fixed,

306

P. Paygude et al.

Comparative Graph

Close

Forecasting (Regression)

Forecasting(Fb prophet) Fig. 6 Comparative graph of month Feb 2020

among those models, there is an ARIMA model, which would help to understand more depth and increase our understanding of the shortcomings of previous models. Thus, this will be our future scope, and we would like to explore more models now.

References 1. Battineni G, Chintalapudi N, Amenta F, Forecasting of COVID-19 epidemic size in four high hitting nations (USA, Brazil, India and Russia) by Fb-prophet machine learning model. Applied Computing and Informatics Emerald Publishing Limited, e-ISSN: 2210-8327, p-ISSN: 26341964. https://doi.org/10.1108/ACI-09-2020-0059 2. Gaur S (2020) Int J Eng Appl Sci Technol 5(2):463–467. ISSN No. 2455-2143 3. Seber GAF, Lee AJ, Linear regression analysis. Wiley Publication 4. Yahoo Finance homepage. https://finance.yahoo.com/quote/INFY/history?period1=158302 0800&period2=1585526400&interval=1d&filter=history&frequency=1d&includeAdjusted Close=true 5. Chikkakrishna NK, Hardik C, Deepika K, Sparsha N (2015) Short-term traffic prediction using Sarima and fb-prophet. IEEE Trans Hum Mach Syst 45(4) 6. Guo C, Ge Q, Jiang H, Yao G, Hua Q, Maximum power demand prediction using Fbprophet with adaptive Kalman filtering. IEEE Access. https://doi.org/10.1109/ACCESS.2020.2968101 7. Samal KKR, Babu KS, Das SK, Acharaya A, Time series based air pollution forecasting using Sarima and prophet model. In: ITCC 2019: proceedings of the 2019 international conference on information technology and computer communications 8. Yu T, Yu G, Li PY, Wang L, Citation impact prediction for scientific papers using stepwise regression analysis. Scientometrics. https://doi.org/10.1007/s11192-014-1279-6 9. Benos N, Zotou S (2014) Education and economic growth: a meta-regression analysis. World Dev 64:669–689, 0305-750X

24 Comparative Analysis of Stock Prices by Regression …

307

10. Hong T, Gui M, Baran ME, Willis HL, Modeling and forecasting hourly electric load by multiple linear regression with interaction. In: IEEE conference. https://doi.org/10.1109/pes. 2010.5589959 11. Joseph PJ, Vaswani K, Thazhuthaveetil MJ (2006) Construction and use of linear regression models for processor performance analysis. In: The twelfth international symposium on highperformance computer architecture, pp 99–108 12. Xu SY, Berkely CU (2014) Stock price forecasting using information from Yahoo Finance and Google trend. UC Brekley 13. Andi HK (2021) An accurate Bitcoin price prediction using logistic regression with LSTM machine learning model. J Soft Comput Paradigm 3(3):205–217 14. Isobe T, Feigelson ED, Akritas MG, Babu GJ (1990) Linear regression in astronomy. I. Astrophys J 364:104–113 15. Naseem I, Togneri R, Bennamoun M (2010) Linear regression for face recognition. IEEE Trans Pattern Anal Mach Intell 32(11) 16. Taylor SJ, Letham B, Forecasting at scale. PeerJ Projects 17. Jha BK, Pande S, Time series forecasting model for supermarket sales using FB-prophet. In: Proceedings of the fifth international conference on computing methodologies and communication (ICCMC 2021) 18. Chikkakrishna NK, Hardik C, Deepika K, Sparsha N (2019) Short-term traffic prediction using Sarima and FbPROPHET. In: 2019 IEEE 16th India council international conference (INDICON), pp 1–4. https://doi.org/10.1109/INDICON47234.2019.9028937 19. Lounis M (2021) Predicting active, death and recovery rates of COVID-19 in Algeria using facebook’ prophet model. Preprints 2021, 2021030019. https://doi.org/10.20944/preprints202 103.0019.v1 20. Mohan S, Mullapudi S, Sammeta S, Vijayvergia P, Anastasiu DC, Stock price prediction using news sentiment analysis. In: 2019 IEEE fifth international conference on big data computing service and applications (BigDataService) 21. Cakra YE, Trisedya BD (2015) Stock price prediction using linear regression based on sentiment analysis. In: 2015 international conference on advanced computer science and information systems (ICACSIS), pp 147–154. https://doi.org/10.1109/ICACSIS.2015.7415179 22. Swamidass P (2000) Mean absolute percentage error (MAPE). In: Encyclopedia of production and manufacturing management. Springer, Boston. https://doi.org/10.1007/1-4020-06128_580 23. Ahangar RG, Yahyazadehfar M, Pournaghshband H (2010) The comparison of methods artificial neural network with linear regression using specific variables for prediction stock price in Tehran stock exchange. Int J Comput Sci Inf Secur (IJCSIS) 7(2)

Chapter 25

g POD—Dual Purpose Device (Dustbin and Cleaning) R. Brindha, Vinoth Kumar Balan, Harri Srinivasan, Kartik Rajayria, and Rohit Kumar Singh

1 Introduction Robot vacuum cleaners and intelligent dustbins are well-known items. It was explained in the same way in the beginning and is a smart dustbin can for collecting dry waste, with the lid opening and closing automatically as the rubbish is added. Even robot vacuum cleaners need batteries for power during operation, they use entirely different technology and patterns than conventional battery-operated vacuum cleaners, making them a distinct category as in Fig. 1. For the review study, market data on vacuum cleaners was acquired, including performance data based on annual energy for the years 2013–2018, as well as sales data for the years 2006–2016. Based on stakeholder inputs, estimations are based on the current development of market in Table 1 and are based on predicted life spans and sell. Overall vacuum cleaner market from 2005 to 2029 is shown in Fig. 2. Table 2 shows the entire stock of various vacuum cleaners. Robot vacuum cleaners are mostly utilized in domestic settings to remove particulates from interior surfaces, with roughly 220 million households in the EU28 in 201,620. The goal of this chapter is to learn more about how to develop a robot that can clean and dustbin, as well as the design and requirements for it, in order to perhaps improve some features and, in the end, have a robot that can clean and dustbin. Item detection techniques, object identification, face detection and other industry targets all use image recognition. Numerous studies have been published on the robot R. Brindha (B) · V. K. Balan · H. Srinivasan · K. Rajayria · R. K. Singh Department of Electrical and Electronics Engineering, SRM Institute of Science and Technology, Kattankulathur, Chennai, India e-mail: [email protected] V. K. Balan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_25

309

310

R. Brindha et al.

Fig. 1 New products in the market, products with new or enhanced capabilities Table 1 Sales of different vacuum cleaner types from 1990–2030 Sales in million

1990

2000

Cylinder domestic

2005

2010

2015

2018

2020

2025

2030 12.07

14.81

16.92

25.01

25.28

25.07

23.43

22.06

17.88

Cylinder commercial

1.78

2.03

3.00

3.01

2.95

2.95

2.95

2.95

2.95

Upright domestic

2.61

2.99

4.41

3.44

2.91

2.60

2.56

2.38

2.01

Upright commercial

0.31

0.36

0.53

0.41

0.35

0.31

0.31

0.31

0.31

Handstick mains

0.30

0.34

0.50

0.91

1.25

1.66

1.87

2.38

3.22

Handstick cordless

0.51

0.59

0.87

1.56

4.24

7.39

9.11

13.51

18.10

ROBOT Total

0.0

0.0

0.0

0.8

1.5

2.0

2.5

3.6

4.8

20.3

23.2

34

35

38.3

40.4

41.3

43.0

43.5

Annual total sales and stock es mats 60 40 20

Fig. 2 Overall vacuum cleaner market from 2005 to 2029

25 g POD—Dual Purpose Device (Dustbin and Cleaning)

311

Table 2 Stock of different vacuum cleaner Stock, million units

2005

2010

2015

2025

2030

Cylinder domestic

209.97

217.34

213.00

2020 20.71

179.89

140.38

commercial cylinder

17.0

17.0

16.0

17.0

16.0

16.0

Upright domestic

34

28.5

25

23.6

21.5

19.4

Commercial upright

2.6

2.1

1.9

1.8

1.7

2.1

Handstick (mains)

5.0

8.0

Cordless handstick

8

ROBOT

2.2

Total

278.5

11

12.0

17.0

22.0

14.0

28.0

39.2

68.6

98.1

6.7

9.5

11.7

18.4

27.8

311.7

322.8

326.4

294.12

304.7

vacuum cleaner, smart dustbin, fuzzy item recognition, neural and image processing identification, and LIDAR navigation. Using a pair of attached arms and a running path to cleaning machine, we focus on constructing a robot that can clean even objects beneath those suited to be picked and, if necessary, collect debris in a mounted dustbin upon it. Trashes by use of appropriate algorithm to aid the robot in order to ascertain and manoeuvre staircases. It will be able to safely navigate through the staircase and also efficiently clean it by detecting detritus while climbing.

2 Brief Overview of Dual Purpose Device g POD Duo Purpose Device A g pod is an automatic, battery-based floor cleaning device, smart dustbin management system, automatic arm capable of deciding own trajectory in cleaning, waste collection, steps climbing mechanism and in tracking its power-charger/docking station. Whole devices have three distinct parts: robot vacuum cleaner base, attached pair of ARMs and mounted smart dustbin. Adding all feature estimated consumer collective prices in market ranges from less than 17,000 and 40,000–60,000 rupees for the best cleaning models and depends on battery performance to rupees for models with the moderate cleaning [1–3].

2.1 Robot Vacuum Cleaner Base Manufacturers include: • iRobot (Roomba brand) and Neato 299 (Botvac, Connected) are US robotics specialists. • Dyson (UK, 360 eye), Bosch (DE, Roxxter), Miele are European vacuum cleaner makers (DE, Scout).

312

R. Brindha et al.

Fig. 3 1. Collecting, 2. Vacuum suction, 3. Exhaust

• Samsung (Powerbot, Navibot), LG (Hombot), Techtronic industries TTI (Dirt Devil, VAX, Hoover brands), Chuwi (ILIFE) are Asian vacuum cleaner manufacturers. Figure 3 illustrates a typical high-end robot vacuum cleaner. The design is typically cylinder or D-shaped, with a diameter of 35–37 cm and a height of 8–11 cm and incorporates a 0.4–0.8 L “bag-less” dustbin, filter, battery and the active components listed below. Motors • Two drive wheels, each operated individually for motor and gearbox, • centrifugal backwards-curved fan powered by a DC motor (similar to a PC graphics card cooling fan) • turbo-compressor type and cyclonic dust separation • main brushes, spring-hinged • castor wheel, which is controlled by a tiny DC motor (which has a belt drive) • side-brush with DC motor. Required Sensors • • • • • • • •

IR receivers for tracking virtual wall and also docking station Infrared sensors (LED + receiver), side and cliff detection Keypad sensors with mechanical bumpers for detecting collisions laser distance camera ultrasonic sensor piezo-electric sensor for dirt detection magnetic tape sensor

25 g POD—Dual Purpose Device (Dustbin and Cleaning)

• • • •

313

drop sensor tachometer gyroscope fan speed control (sensor-assisted).

Printed Circuit Board (PCB) A high-end robot vacuum cleaner printed circuit board (PCB) is equivalent to that of a less capable smartphone and laptop. This model includes an quad-core system-onchip (SoC), 512 MB RAM, 4 GB flash memory managed by a 32-bit microcontroller unit and Wi-Fi module. An universal asynchronous receiver-transmitter (UART) is included into the SoC and STM for serial communication. A UART is also available for the LIDAR laser rangefinder. Large inductor and capacitors, transistors, diodes and connectors for wiring to and from the PCB are the only other components on the board active components. Communication • • • • •

One or two push-buttons Remote control (battery-powered controller) LED-display Voice command Control of smartphones via Wi-Fi and Bluetooth.

Peripheral • • • •

A docking station including a battery charger infrared transmitter Either magnetic tape or virtual walls Additional supplies (e.g. mops).

The path that the robot takes to manoeuvre the floor and clean it is different for each model as in Fig. 4. The algorithm will vary. The possible different algorithms are • • • •

Random mapping Criss-cross mapping Zigzag mapping Helical mapping.

Simultaneous localization and mapping (SLAM) can be used to control it, although this takes additional processing power (Figs. 5, 6 and 7).

314

Fig. 4 Robot vacuum cleaner (illustrative only)

Fig. 5 A random bounce pattern using SLAM

R. Brindha et al.

25 g POD—Dual Purpose Device (Dustbin and Cleaning)

315

Fig. 6 A random plus spirals pattern

Fig. 7 Room is mapped by a robot cleaner utilizing SLAM technology

Early models, for example, used a combined imaginary “wall following” pattern, with this help walk along walls, and a “random bounce” way, in this process it cross floor in a straight-line till to find next obstruction, then moved away in a random direction. Modern-type system “SLAM” utilizes extra power because of higher computing power nonetheless it has a significantly shorter coverage time on the other hand. No matter whatever model is used, the algorithms should cover full available portion of the floor is covered [4–6]. A result here cannot be compared to the two double strokes anticipated for manual vacuum [7, 8]. There is no indication that bags are utilized to capture the waste in robotic vacuum cleaners. Instead, they with a storage that needs cleaning on a regular basis. Most of them come with extra filters and brushes. They are to be cleaned and replaced on a regular basis.

316

R. Brindha et al.

2.2 Pair of ARMs Descriptions of Hardware and Software An Arduino which is coupled with a DC motor controller and sensor makes up the attached arms hardware architecture. The NVIDIA Jetson Nano is a central control unit (CCU) that oversees the robot’s whole operation. The module, which runs on Ubuntu 18, has an ARM CPU and a 128-core GPU with 4 GB memory to run real-time deep learning inference. Other components include an ambient perception unit and Arduino control blocks, and the processing unit uses the Rosserial communication interface to facilitate communication between ROS nodes. Between the Arduino Mega microcontroller and the Jetson Nano unit, this bridge is utilized to transfer sensor data and trajectory information. The locomotion control and on-board sensor interface are handled by the Arduino mega microcontroller. It gathers data from numerous sensor modules and transmits it to the CCU. The inbuilt PWM module generates the input signal to the DC motor module based on the trajectory information received. To execute self-reconfiguration and locomotion duties, many sensors are combined (Fig. 8). The inner string of the staircase was detected using bump sensors (the left and right sides of the robot). Flight distance sensors are fixed on the front face of the first part, and to limit two mechanical switches are installed in front of the second and third parts of block for stair riser identification. We also provide time-of-flight distance sensors at bottom of all three blocks to detect the staircase climbing. The detail of motor, sensor and communication protocol is given in Table 3. Environmental Perception System (EPS) The environmental perception system is critical for the robot’s autonomous control systems to work. The goal of this system is to recognize items in the surroundings. The RGB-D vision sensor, SSD MobileNet-based object identification module and depth-based error correcting unit make up the EPS system (Fig. 9).

Fig. 8 Hardware architecture of control system

25 g POD—Dual Purpose Device (Dustbin and Cleaning)

317

Table 3 Hardware and sensor specification Description

Specification

Interface

ToF sensor

SEN-02815, Rang 10 cm

12C

Vision sensor

Intel real sense D435

USB 3.0

Bump sensor

Limit switch mechanism

Binary logic

Worm gear motor

12 V, 100 RPM

UART

Fig. 9 Climbing system flow diagram

The EPS covers the fundamental difficulties of a staircase cleaning robot, such as detecting staircases, debris (e.g. liquid spillages), static and dynamic barriers (e.g. flowerpots, people) that may hinder the robot’s route while climbing. Through a deep learning system, the vision sensor data was used to detect any objects on the stairs in the current investigation. CCU will control the robot’s navigation direction based on obstacle position. Identifying and Aligning with the Stairwell is the First Step RGB along with depth location information is used to determine the initial step and also to calculate the angle of the staircase from the robot. This information is critical in directing the robot to the stairwell. The use of edge detection algorithms is a common strategy for detecting steps.

318

R. Brindha et al.

This contour algorithm of detection specifically for the case of recognition is the first step. Algorithm points along the gradient are given a higher priority by the algorithm. This keeps the observed contour from being heavily impacted by noise. The algorithm also looks for points that have strayed from the gradient. As a result, the recognized contour by this approach can be curved as well as straight, making it useful in a variety of stairs. Algorithm 1: Detection algorithm for contour Data: Current point [x, y], Edge map img Result: Contour Z Z = ∅; Z.add([x, y]); δ = grad(Z[−5:]); A = ∅; x1 = δ × t + x; y1 = y + t; z1 = 0; A.add([x1, y1, z1]); ifimg[x2 + z2][y2] 6 = 0 & x2 + z2 < y2 + δlimit × t then Z.add([x2 + z2, y2]);goto 8; ifimg[x2 − z2][y2] 6 = 0 & x2 − z2 > y2 − δlimit × t then Z.add([x2 − z2, y2]); goto 8; z2 = z2 + 1; repeat 4 to 10 ∀[x2, y2, z2] ∈ A; repeat 3 to 11 ∀t ∈ [0, thresh]; repeat 2 to 12 until t = thresh or end of image is reached;

Tag Granularity Independent of the tag arrangement patterns, optimal tag interval is found where the REe is low, according to the simulation results. We establish that there is an ideal link between the tag interval and the read range by running simulations on different sets of read ranges and deriving the lowest REe values (tag granularity). Table 4 in sort the best results show different read ranges. In order to get the lowest navigational errors with respect to REe values, a ratio of 4:1 has to be maintained (4 parts tag interval to one part read range). This has been demonstrated above. Table 4 RE values and the tag intervals Read range (cm)

Optimal interval (cm)

Relative error (RE) In % (T) (%)

(S) (%)

(P) (%)

6

27

36.94

40.36

39.82

8

33

32.51

35.97

35.59

10

41

29.75

32.76

33.08

12

45

27.53

30.58

30.21

14

53

26.01

28.35

29.32

T tilted-square, P parallelogram, S square

25 g POD—Dual Purpose Device (Dustbin and Cleaning)

319

Pattern for Tag Arrangement By examining the RE columns in Table 3, we can notice the three different organization patterns in terms of performance based on the choicest tag particle size. In every example, the “tilted-square” layout yields the optimal results in terms of relative efficacy. We infer the “tilted-square” figure proposed in this paper delivers the optimum efficiency required for navigation, by using this data. The radio-frequency identification tags (RFID Tags), which are the most appropriate sensors, are strategically placed beneath the floorboard of the living room. The “tilted-square” approach is used to place 128 RFID tags. Figure 10 depicts the layout. A water bottle a piece of paper represents a big trash can, and a compact disc is the objects contained in Fig. 10. Figure 11 shows the forward motion under a stable state. Distance Calculation of Image Taking a photograph with a different camera has distinct features. Because different settings cause distinct features, determining the parameters is the first step before turning on the camera. Figure 11 shows the mapping of pixel position to the distance by means of a flowchart. This is used to measure the distance of any picture pixel to its centre. Figure 12 depicts camera which has a range of 180°; Fig. 12a depicts the camera’s output, and Fig. 12b depicts the images captured by the camera (a). Clearly, the content has been skewed significantly. One of the most significant aspects in distance measuring is the camera altitude, which must be set up initially. The camera’s distance computation is shown in Fig. 13. In order to compute the marks, the values of P1–P4 are used.

Fig. 10 Living-room floor equipped by RFID tags

320

R. Brindha et al.

Camera al tude value H

Set the floor mark as f1,f2,f3....fn

The distance d1,d2...dn between posi on f1,f2,f3...fn and center poin

Measure distance d1, d2, d2...dn by pixels

Compute any point by pixels Fig. 11 Flowchart depicting the placement of pixels in relation to distance

Fig. 12 Whole-scene camera; a camera’s look, b photographs taken by the camera in (a)

Correction and Distortion The content of a whole-scene camera is always distorted in some way. We employ Eqs. (1) and (2) to compute the angle between the item and the camera’s centre in this study.

25 g POD—Dual Purpose Device (Dustbin and Cleaning)

321

Fig. 13 Camera’s distance calculation; a the red point represents the camera’s position, d1–d4 represents the distance between pixels and the red point, and b the distance is proportional to the altitude

Figure 14 is displaying visuals and degree of deterioration of scenes in order to describe the compensation impact and distortion, whereas Fig. 16b is the distortion adjustment result. θo = tan−1

W E+D

(1)

W D

(2)

θm = tan−1

Fig.14 180 entire scenes’ distortion compensation; a the distortion scenes, b the result after distortion compensation

322

R. Brindha et al.

The Global Path Planning This done automatically Fig. 15 depicts the autonomous global path planning. The basic setup and parameter calibration are completed during system initialization. As a result, a camera is utilized to get a picture and calibrate the parameters. Lowering it is needed to entire scene image is only required at each interval period. In most cases, when taking a paragraph route, a new path for movement will be deduced. When the system acquires an image of the entire scene, image pre-processing is to remove and improve object recognition and noise. Robot is progressing on the blank area of the image that constitutes key work in route planning and to calculate of route path that follows the rules.

Fig. 15 Flowchart for automatic global path planning has issues

25 g POD—Dual Purpose Device (Dustbin and Cleaning)

323

Arduino Uno R3

INPUT Ultrasonic Sensor

Motor Servo

INPUT Battery

OUTPUT The process of opening and closing smart bins automatically Fig. 16 Circuit block diagram of dustbin

2.3 Mounted Dustbin Block Diagram An Arduino Uno R3-based smart dustbin is made up of an Arduino, a distancedetecting ultrasonic sensor (HC-SR04), and a servo. Figure 16 depicts the circuit block diagram. The fundamental system architecture is based on this: to control the smart dustbins, a battery is used as the controlling element; for processing and controlling, the process an Arduino Uno R3 is used. When disposing of waste, the ultrasonic sensor detects hand motions, and the servo motor automatically opens and closes the smart trash can lid. Code To control the smart dustbins, a battery is used as the controlling element; for processing and controlling, the process an Arduino Uno R3 is utilized. These are the foundations of the system design. When disposing of waste, the ultrasonic sensor (HC-SR04) detects hand motions, and the servo motor automatically opens and closes the smart trash can lid. The word “data” is plural, not singular. The subscript for the permeability of vacuum μ0 is zero, not a lowercase letter +

324

#include Servo servo; int trigPin = 5; int echoPin = 6; int servoPin = 7; long duration, dist, average; long aver[3]; servo.attach(servoPin); pinMode(trigPin, OUTPUT); pinMode(echoPin, INPUT); servo.write(00); delay(2500); servo.detach(); } void measure(){ digitalWrite(trigPin, LOW); delayMicroseconds(6); digitalWrite(trigPin, HIGH); delayMicroseconds(15); digitalWrite(trigPin, LOW); pinMode(echoPin, INPUT); timeperiod =pulseIn(echoPin, HIGH); dist =(timeperiod/2) / 29.1; } void loop(){ for (int i=0; i max f (Vni ) i Ba

i Bn

(2)

In Eq. (2), we first select the maximum scores from the positive and negative bag and use these values in our ranking loss function. We use the maximum value from the positive bags because the abnormal event can occur over multiple segments, and we know that only one abnormal event is captured in a single video. We have created our training sets such that every set has instances from both positive and negative bags. Thus, the error is backpropagated from maximum instances of both the bags. Another advantage of using Eq. (2) is that it makes sure that the segment selected from the positive bag is more likely to be true positive as we have applied the tightest bound on the score of the negative bag. After this step, we get a set of maximum possible scores for both the bags, and similar to a regression problem, we get a maximum margin hyperplane to separate both the instances. In [8], Sultani et al. have noted some shortcomings of the loss calculated by Eq. (1). They discuss that an anomaly occurs for a short duration of time; hence,

450

S. Dhondkar et al.

the occurrences of anomalous instances in the negative bag would be sparse. They have also proposed that there is an abrupt change in the anomaly score in consecutive segments when an anomaly occurs. As a result, they have added sparsity and smoothness constraints to their loss function to adjust for irregularities. We also use these constraints in our loss function, which then is given by sparsity = λ1

n 

f (Vai )

(3)

i

smoothness = λ2

n−1 

( f (Vai ) − f (Vai+1 ))

2

(4)

i

It is important to note that the constraints are added only for the anomalous instances of the positive bags. On adding these constraints to the loss function of Eq. (1), we get the final loss value as lk (Ba , Bn ) = Lk(Ba , Bn ) + sparsity + smoothness

(5)

The objective function obtained in Eq. (5) is then used to train the model. We assume that when training the model for a large number of instances from both the positive and negative bags, the model will score the anomalous instances in the positive bags higher.

4 Experiments and Result After creating the proposed network, we need to train and test it to check its effectiveness. The dataset choice is affects how our well our model will learn, hence we need to select a robust dataset. To check the effectiveness of the model, we also need a parameter for benchmarking.

4.1 Dataset for Training and Testing The most widely used datasets for video anomaly detection are UCSD Ped1, Ped2, abnormal crowds, UMN, Avenue [15, 16, 20, 21]. These datasets contain a very small number of video samples in a very limited range of environments. Therefore, they are not ideal for creating generalized anomaly detection systems for practical applications. We have used the UCF-Crime [8] dataset for training and testing the model. This dataset is a compilation of 128 h of surveillance camera footage of 13 types of abnormal incidents like Abuse, Arrest, Arson, Assault, Road Accident, Burglary, Explosion, Fighting, Robbery, Shooting, Stealing, Shoplifting, and Vandalism. It also

36 Anomaly Detection in Image Sequences Using …

451

Fig. 4 Sample snapshot from video snippets of various types of anomalous activities of the UCFCrime dataset with the frame number at the top right

has normal videos in equal numbers of similar environments. The dataset is divided into 1610 training and 290 testing videos. The train set comprises 800 normal and 810 abnormal videos; the test set is split into 150 normal and 140 abnormal instances. Although the dataset provides a large number of videos for each type of activity, we do not distinguish between the activities. Our objective is to identify the presence of abnormal activity in the video. Hence, we ignore the activity specific labels and label each anomalous video as 1 and normal video as 0. While testing the model, we follow the same convention. The aim is to detect the presence of anomalous activity in the abnormal video; such detection is our true positive instance (Fig. 4).

4.2 Implementation Details For the implementation, each video is converted into the 240 × 320 format at 30 fps and then divided into 16-frame clips each. We use the standard weights for the I3D network available as “mixed.c”. The features are calculated for each 16-frame clip and then aggregated to create video segments. For our implementation, we divide each video into 32 segments of equal length depending on the size of video. After passing these segments through a fully connected I3D network, as discussed in Sect. 3.1, we obtain a 2048D vector of composite features. The feature vector is used as input in the first layer having 2048 nodes; this layer is connected to a 512 node hidden layer with a ReLU activation function and a dropout rate of 0.6. The next layer is of 32 nodes with the same activation function. The final anomaly score is obtained by

452

S. Dhondkar et al.

flattening the output of the 32 nodes via a sigmoid function. For optimization, we have used Adagrad optimizer [22] with a learning rate of 0.001. The hyperparameters of λ1 and λ2 used as sparsity and smoothness constraints are set to 0.00008. We used these parameter values based on the observations provided by Sultani et al. [8]. For preprocessing the videos, we use the OpenCV and PyTorch to create the neural network model. We have trained the setup on the Google Colab platform using Tesla K80 GPUs for 200 epochs.

4.3 Evaluation and Analysis To evaluate the model, we use the AUC value of the ROC plot for the test set results. ROC curve is a plot of true positive rate (TPR) on the Y -axis and false positive rate (FPR) on the X -axis for evaluating the performance of models at various thresholds (FPR values). To calculate TPR and FPR, we calculate the number of true positives, true negatives, false positives, and false negatives. When a test case has an anomaly, and the model detects it, that is a true positive (TP). If the test case does not have an anomaly but the model detects one, it is a false positive (FP). Similarly, we have a true negative (TN) for the correct identification of a normal case and a false negative (FN) for an incorrect identification. Using these terms TPR and FPR are defined as follows: TPR = TP/(TP + FN)

(6)

FPR = FP/(FP + TN)

(7)

After the ROC graph is plotted, we calculate the AUC value, and this value is a benchmark of the effectiveness of the model in classifying the test cases. The value of the area under the ROC curve is always in the range of [0, 1]; the higher the value, the better the model at predicting the label. The AUC value shows the measure of separability between the classes. A value close to 1 means the separability is excellent, 0 means the classification is precisely the opposite (all samples with label ‘A’ is labeled as ‘B’ and all samples with label ‘B’ as ‘A’). An AUC value of 0.5 indicates that the model has learned nothing and has no criteria for classification. The ROC plot for Sultani et al. [8] and our proposed model is compared in Fig. 5, and we also compare the AUC values with some recently reported techniques used for video anomaly detection in Table 1. We use the results that have been reported by various publications [5, 9, 11, 13] for these methods. We have compared our results with popular unsupervised and semi/weakly supervised techniques used for anomaly detection. Lu et al. [21] had based their method on dictionary-based technique to understand normal behavior and used the reconstruction approach to detect the presence of anomalies. Basic one-class discriminative subspaces (BODS) and generalized one-class discriminative subspaces (GODS) are unsupervised clustering techniques for anomaly detection given by [2]. Another

36 Anomaly Detection in Image Sequences Using …

453

Fig. 5 Comparison of ROC plot of Sultani et al. (red) with our model (yellow). The shaded portion denotes the AUC of our model Table 1 The following table gives a comparison of various AUC values of ROC curve for UCFCrime dataset, Method AUC (%) Lu et al. [21] BODS [2] GODS [2] GMM-based [5] Sultani et al. [8] Proposed work

65.51 68.26 70.46 75.90 77.92 82.03

clustering technique based on Gaussian mixture model (GMM) based on Bayesian distribution is also provided by [5]. Our proposed method with the modified loss function provides a better AUC value by a factor of 8.07% from the highest value for the unsupervised learning technique. We also compare our results with the MIL technique introduced by Sultani et al. [8] and have achieved an improvement of about 5.27% in the AUC metric.

5 Conclusion We have proposed a deep learning solution based on the MIL ranking loss model and two-stream I3D network. The results from RGB and optical flow are concatenated to create a composite feature vector. Due to the range and complexity of real-world

454

S. Dhondkar et al.

scenarios, we use both the abnormal and normal videos in the training set. We tested our implementation on the UCF-Crime dataset and evaluated the AUC of ROC curve values. The proposed method has achieved better results than the some of the recent approaches on the AUC metric. For future work, we can try to improve the feature extraction by extracting more contextual information. Acknowledgements This work was supported by the Science and Engineering Research Board (SERB), Department of Science and Technology (DST), New Delhi, India, under Grant No. CRG/2020/001982.

References 1. Nayak R, Pati C (2021) A comprehensive review on deep learning-based methods for video anomaly detection. Image Vis Comput 106:104078. https://doi.org/10.1016/j.imavis.2020. 104078 2. Wang J, Cherian A (2019) Gods: generalized one-class discriminative subspaces for anomaly detection. In: IEEE international conference on computer vision, pp 8201–8211. https://doi. org/10.1109/ICCV.2019.00829 3. Hasan M, Choi J (2016) Learning temporal regularity in video sequences. Comput Vis Pattern Recogn (CVPR) 733–742. https://doi.org/10.1109/CVPR.2016.86 4. Yang B, Cao J (2018) Anomaly detection in moving crowds through spatiotemporal autoencoding and additional attention. Adv Multimedia 1–8. https://doi.org/10.1155/2018/2087574 5. Degardin B (2020) Weakly and partially supervised learning frameworks for anomaly detection. https://doi.org/10.13140/RG.2.2.30613.65769 6. Kiran B, Thomas D (2018) An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos. J Imag 4(2):36. https://doi.org/10.3390/ jimaging4020036 7. Liu Y, Li Z (2020) Generative adversarial active learning for unsupervised outlier detection. IEEE Trans Knowl Data Eng 32(8):1517–1528. https://doi.org/10.1109/TKDE.2019.2905606 8. Sultani W, Chen C (2018) Real-world anomaly detection in surveillance videos. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 6479–6488. https://doi. org/10.1109/CVPR.2018.00678 9. Tian Y, Pang G (2021) Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4975–4986. https://doi.org/10.48550/arXiv.2101.10030 10. Li J, Zhang S (2020) Multi-scale temporal cues learning for video person re-identification. IEEE Trans Image Process 29:4461–4473. https://doi.org/10.1109/TIP.2020.2972108 11. Feng J, Hong F (2021) MIST: multiple instance self-training framework for video anomaly detection. In: Conference on computer vision and pattern recognition (CVPR). https://doi.org/ 10.1109/CVPR46437.2021.01379 12. Carreira J, Zisserman A (2017) Quo Vadis, action recognition? A new model and the kinetics dataset. CVPR 4724–4733. https://doi.org/10.1109/.2017.502 13. Kay W, Carreira J (2017) The kinetics human action video dataset CVPR. https://doi.org/10. 48550/ARXIV.1705.06950 14. Christopher Z, Thomas P (2007) A duality based approach for realtime TV-L1 optical flow. Pattern Recogn 4713:214–223. https://doi.org/10.1007/978-3-540-74936-3-22 15. Li W, Mahadevan V (2014) Anomaly detection and localization in crowded scenes. IEEE Trans Pattern Anal Mach Intell 36(1):18–32. https://doi.org/10.1109/TPAMI.2013.111

36 Anomaly Detection in Image Sequences Using …

455

16. Mehran R, Oyama A (2009) Abnormal crowd behavior detection using social force model. In: IEEE conference on computer vision and pattern recognition, pp 935–942. https://doi.org/10. 1109/CVPR.2009.5206641 17. Popoola O, Wang K (2012) Video-based abnormal human behavior recognition—a review. IEEE Trans Syst Man Cybernet Part C Appl Rev 42(6):865–878. https://doi.org/10.1109/ TSMCC.2011.2178594 18. Maximilian I, Jakub M (2018) Attention-based deep multiple instance learning. In: International conference on machine learning (ICML). PMLR, vol 80, pp 2127–2136 19. Bing L, Wang W (2017) Sparse representation based multi-instance learning for breast ultrasound image classification. Comput Math Methods Med 1–10. https://doi.org/10.1155/2017/ 7894705 20. Rabiee H, Haddadnia J (2016) Novel dataset for fine-grained abnormal behavior understanding in crowd. In: 13th IEEE international conference on advanced video and signal based surveillance (AVSS), pp 95–101. https://doi.org/10.1109/AVSS.2016.7738074 21. Lu C, Shi J (2013) Abnormal event detection at 150 fps in MATLAB. In: IEEE international conference on computer vision, pp 2720–2727. https://doi.org/10.1109/ICCV.2013.338 22. Duchi J, Hazan E (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121–2159 23. Zhao Y, Deng B (2017) Spatio-temporal autoencoder for video anomaly detection. In: Proceedings of the 25th ACM international conference on multimedia, pp 1933–1941. https://doi. org/10.1145/3123266.3123451 24. Tran D, Bourdev L (2015) Learning spatiotemporal features with 3D convolutional networks. In: International conference on computer vision (ICCV), pp 4489–4497. https://doi.org/10. 1109/ICCV.2015.510 25. Nguyen T, Meunier J, Anomaly detection in video sequence with appearance-motion correspondence. In: IEEE/CVF international conference on computer vision (ICCV), pp 1273–1283. https://doi.org/10.1109/ICCV.2019.00136

Chapter 37

Sentiment Analysis of Twitter Data for COVID-19 Posts Salil Bharany, Shadab Alam, Mohammed Shuaib, and Bhanu Talwar

1 Introduction Nowadays, data is being attested as the most valuable resource globally. As stated by some industry experts that data is the new oil for boosting any country’s economy. The motto behind this phrase is that raw data is not valuable in and of itself, just like oil. However, the value is created when collected accurately and attached to some relevant source. Data, when properly refined, can become the most powerful decision-making tool. These days’ data is in more demand than ever before [1, 2]. The web covers most of the areas that are growing exponentially in terms of volume as sites are dedicated to gathering as much information as possible. Social media is one of the major platforms where we can connect and communicate. Even Twitter is in trend more than ever before [3]. Twitter has become a ubiquitous platform for anyone to share their views. Tweets convey opinions [4]. Therefore, it is a very tedious job to collect and maintain tweet data so that it can help predict any insights. Tweets seem to be positive for some and harmful for others. We attempt to classify the polarity of the tweet where it is either positive or negative. The more dominant sentiment should be picked as the final label if the tweet has positive and negative elements. To better understand these tweets, we have used sentiment analysis. Sentiment analysis, also known as opinion mining, is a process where a dataset that contains emotions, S. Bharany Department of Computer Engineering and Technology, Guru Nanak Dev University, Amritsar, India S. Alam (B) · M. Shuaib Department of Computer Science, Jazan University, Jazan, Saudi Arabia e-mail: [email protected] B. Talwar Department of Computer Science and Engineering, Lovely Professional University, Phagwara, Punjab, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_37

457

458

S. Bharany et al.

viewpoints, or judgment is taken into account in a way a human thinks. In any tweet that follows up, it is sometimes challenging to understand the negative or positive aspects [5, 6]. The tweet should have an extreme adjective to get better cognizance out of it. These days’ people are very fond of using emoticons or emojis while writing any sentence. All these emojis have got different definitions as per their structure. So, it is again a difficult task to get the perception of these [7, 8]. Sentiment analysis can also speed up the completion of time-consuming jobs. Sentiment research aids companies in swiftly gaining a comprehensive understanding of their consumers’ views. Making quick and correct judgments is also a benefit [9]. We are teaching the model to think like a person using sentiment analysis, and we will use that knowledge to classify the individual tweet in question. One of the most challenging jobs in natural language processing is sentiment analysis since even humans cannot effectively interpret sentiments [10, 11]. Here are the sections of the paper. Various scholars’ work on sentiment analysis in diverse areas was briefly addressed in the second portion of the paper. The final section explains the method we used to analyze sentiment. Section 4 discusses the execution and outcomes, and the last section concludes with thoughts on the future of the work.

2 Related Work In recent years much work has been done in sentiment analysis. As stated earlier, sentiment analysis is a fantastic way to discover how people, particularly customers, feel about a particular product or idea [12]. It is human nature that before buying any product or good, we tend to take the opinions of our friends and family members. Moreover, it has become very easy in the Internet era based on the recommendations that we get when buying a product. The Internet eases our efforts to get the opinions of the general population.

2.1 Steps Involved in Sentiment Analysis • Data Acquisition: Collection of data. • Text pre-processing: Reduces noise in data. • Feature Selection and Extraction: Extracting all the essential features for better results. • Sentiment Classification: Choose the best classification technique. • Polarity Detection: Check whether the test is positive, negative or neutral. • Validation and Evaluation: Validate and evaluate the overall result.

37 Sentiment Analysis of Twitter Data for COVID-19 Posts

459

2.2 Various Approaches for Sentıment Analysis ML Based Approach: The machine learning (ML) approach used for sentiment analysis is basically for supervised classification and text classification techniques. It is why it is referred to as “Supervised learning”. There is a need for two sets of documents in the ML-based approach: a training dataset and a test dataset [13, 14]. An automatic classifier uses a training dataset for learning the distinguishing characteristics of documents, and for seeing the performance of the automated classifier, a test set is used. Many machine learning techniques are used to define and divide the reviews. Techniques like naive Bayes (NB), maximum entropy (ME), and support vector are very prominent and have been very successful in sentiment analysis. Naive Bayes is used to dividing and define textual data and is also a very effective and simple algorithm. The naive Bayes algorithm has advantages like low complexity and simple training procedure, and it can be beneficial in some instances [15]. However, when used to high-dimensional data, NB suffers from sparsity. Due to the expenses of labeling, the training data is usually limited to a few documents, such as tweets, and this is often the case when the size of the training set is too large [16–18]. Using the naive Bayes technique to analyze high-dimensional datasets is problematic. Training data with few documents, such as tweets, and a small training set (because of human labeling’s high cost) might lead to this problem. Smoothing techniques are employed to avoid the zero-probability issue [19, 20]. Nb’s creator utilized it to categorize a massive volume of textual material. For the sparsity problem, they came up with Wikipedia semantic smoothing. Topic signatures, for example, may be extracted from training papers using the semantic smoothing technique. The expectation-maximization (EM) technique calculates term probabilities in the semantic smoothing approach. Vapnik developed a statistical classification approach known as the support vector machine (SVM). Senti-WordNet can be a better tool for sentiment classification than SVM. Machine learning may be used to analyze any form of data, and this study proves that it works well [21]. Lexicon Based Approach: This is unsupervised learning because it does not need prior training to classify the data. This approach includes classifying text by comparing it against the sentiment lexicons and the values determined before use. The sentiment lexicon has lists of expressions and words used to express a person’s feelings and different opinions [22]. Example–analyzing the text to find particular sentiment with both positive and negative word lexicons. Then after analyzing if the particular document has more no. of negative word lexicons, it is negative; otherwise positive. Antonio Moreno-Ortiz and Chantal Pérez Hernández used this approach, and they presented it to sentiment analysis using Sentitext [23, 24]. Sentitext is a web-based application and is written in python and C++. They experimented with testing whether lexically motivated systems can work on concise texts, generally generated from social networking sites like Twitter. The observations concluded that it is tough to get the desired results in the “neutral” and “no polarity” categories as

460

S. Bharany et al.

it is tough to differentiate between them. The lexicon-based approach is helpful for small texts, comments, and tweets on the web [25]. Firstly, sentences are classified into objective and subjective, and their semantic scores are checked using the SentiWordNet. Then the final weight of each sentence is calculated. Khan et al.’s this method has an accuracy of 86.6%.

3 Our Methodology A simple activity such as obtaining data was the first obstacle we faced in our research. In other words, we used tweets from Twitter to acquire information. We used ASP.NET 2012 to create this system. One individual can send several tweets on Twitter, so we first deleted multiple tweets from a single source to avoid bias in the results. Afterwards, we used WEKA 3.8 to do sentiment analysis on each tweet and determine its polarity. Detailed explanations of each phase are provided in the following sections. In Fig. 1, you can see a diagram of the proposed methodology. Then the semantic analysis offers a large set of synonyms and similarities that provide the content’s polarity. The complete narration of how the project workflow goes will be discussed in the following sub-sections, and the schematic drawing for the same is graphically represented in Fig. 1.

3.1 Collecting the Dataset In order to collect Twitter data on “Corona” and “COVID-19” we only collected tweets which are in English language other languge tweets are not be considered. “Vaccines” are also included in the search. The data was collected from English speakers in the UK and India. About 57,187 tweets were included in this analysis. Real-time sentiment analysis was studied in a research [16]. Twitter was used to gather data for our study. Visual Studio [15] was used to create an ASP.Net solution. It was easy to connect the tweeting API [16] with the Dot Net

Fig. 1 Proposed workflow

37 Sentiment Analysis of Twitter Data for COVID-19 Posts

461

Fig. 2 Hastags used for collecting tweet

framework because it is free and open source. This is Fig. 1 the Proposed method’s flow charts define a technique based on the hashtags (#) for tweets twitted by peoples during covid. There are hashtags (#) used to get tweets from Twitter which can also be seen in Fig. 2.

4 Implementation and Results The number of tweets is not the determining element in the success of any particular way of thinking; instead, we used sentiment analysis to determine if each tweet was polarizing (positive or negative) [10]. Sentiment analysis investigates how people feel about a wide range of themes and subjects. This includes products and services, personalities; organizations; issues; events; topics, and qualities. Open-source software WEKA 3.8 [17] was used to build this classification model, which uses machine learning techniques to perform data mining tasks. We used a supervised machine learning technique called support vector machines (SVM) for the sentiment analysis [26]. When dealing with classification issues in two groups, the SVM is a machine that uses a kernel function to split dataset instances into a multidimensional feature space [27]. Because SVM is widely considered one of the most acceptable classification algorithms, we used it to create the model. According to Kotzias et al. [28], the training data set included reviews and scores from three separate datasets: Amazon, IMDb, Yelp. There were 1500 positive and negative statements in each dataset; therefore, The data set had two columns, one for the sentence and one for the sentiment of each sentence in the form of “0” (negative) or “1” (positive). Because in the real world, a person can test either positive or negative at once, Additionally, only one tweet per individual will be taken into account. Because of this, numerous organizations and agencies are recruited to influence the analysis nowadays. This oddity was ruled out by coding skills that said if a person tweeted many times, the first tweet by that individual would be selected for review of findings. This constraint can be explained as an example “salil” has tweeted three times, whereas “Shael” has

462

S. Bharany et al.

Table 1 Daily tweet collection Date

Daily tweets

01-09-2021 to 08-09-2021

6214

09-09-2021 to 16-09-2021

7558

17-09-2021 to 23-09-2021

4814

24-09-2021 to 30-09-2021

8547

01-10-2021 to 07-10-2021

6957

08-10-2021 to 13-10-2021

6714

14-10-2021 to 20-10-2021

5714

21-10-2021 to 27-10-2021

4587

28-10-2021 to 31-10-2021

6082

Total tweets

57,187

tweeted two times. Any tweets other than the first one is flagged with a ‘1’. As a result, for “Salil” and “Shael,” only one tweet will be tallied, negating the effect of several tweets. After removing 20,085 duplicate tweets (35%), we were left with 37,102 (65%) tweets. A takeaway from this is how many individuals were publishing several tweets [29]. The 37,102 tweets we used as a basis for our whole experiment were critical to its success. On every single day from September 1st to October 31, 2021 we gathered up to 57,187 tweets from the Twitter-verse. We had picked this time frame to collect data from various types of Twitter users since the third wave was beginning in some regions and daily cases were diminishing in others [30, 31]. The total number of tweets can be seen in Table 1. We employed filtered classifiers, which let us create a classifier with our filter. “String To Word Vector” is used as a filter to convert a string attribute to an array of word occurrence frequencies. A technique known as rotation estimation and 10fold cross-validation were employed to see how well a model would perform on an unknown dataset. To put this into perspective, this indicates that 2359 out of the training set’s occurrences were correctly identified, while just 641 were wrongly identified. The confusion matrix accurately identified 1190 negative cases (Class a) and 1177 positive ones (Class b). Figure 4 displays the complete results and a confusion matrix, while Fig. 3 provides a graph demonstrating the data set’s area under the curve (ROC = 0.793). We used tweets from Twitter as a testing set. In order to avoid biased findings, we preprocessed the data before testing to eliminate any undesirable HTML elements, web links, and special symbols (, “!’;: @ #). The preparation of data was done automatically. Prior to testing, we preprocessed each tweet to determine its polarity and then used that information to train our classification model on that data. These statistics were used to calculate a candidate’s net positive score (NPS), the difference between the total number of positive tweets and the total number of negative tweets received by a candidate. Figure 5 and Table 2 present the final result of our analysis.

37 Sentiment Analysis of Twitter Data for COVID-19 Posts

463

Fig. 3 Area under the curve (ROC = 0.793)

Fig. 4 Results of classification model

Table 2 shows the daily tweet collection related to covid. There were 19,487(51.88%) good tweets and 17,605 (48.12%) negative tweets in 37,102. Positivity was shown to have an NPS nearly three times higher than negativity in the tweets analyzed. Our research showed that people are responding positively to the current scenario. A visual representation of these findings may be seen in Fig. 6.

464

S. Bharany et al.

Tweet count

Tweets %

Negative

Positive 0

10

20

30

40

Sentiment label Positive

Negative

Fig. 5 Final results Table 2 Daily tweet collection related to covid Total tweets

57,187

Positive tweets

45.4%

Negative tweet

30%

Total number of Tweets 20000

Tweet count

19500 19000 18500 18000 17500 17000 16500 Postive

Negative

Sentiment Label Postive

Fig. 6 Number of positive and negative tweets

Negative

50

37 Sentiment Analysis of Twitter Data for COVID-19 Posts

465

5 Conclusion In the meantime, communication media are still in the process of developing. This is owing to the fact that we now use social media programmed on a regular basis. In order to fulfill their objectives, several institutions, and government systems are motivated to incorporate textual classification as a proactive component of their work. This article proposed a model for sentiment analysis of Twitter messages connected to the COVID-19 vaccination. When it comes to utilizing social media, such as Twitter or Facebook, this might aid the health system and organizations like the World Health Organization (WHO). Promoting the necessity of immunization on this platform. Additionally, by promoting pro-vaccine tweets and rejecting anti-vaccine ones, it may be possible to put a stop to the epidemic. A total of 57,187 tweets were analyzed using machine learning classifiers SVM. The findings are optimistic and enticing enough to continue investigating this issue in other nations and with a wider range of languages. For the purpose of evaluating additional models, we want to employ both supervised and unsupervised learning approaches in conjunction. Moreover, we can combine two classifiers in u = our future model as we can extend the time spam in for collection of tweets.

References 1. Liu C-L, Hsaio W-H, Lee C-H, Lu G-C, Jou E (2011) Movie rating and review summarization in mobile environment. IEEE Trans Syst Man Cybern Part C (Appl Rev) 42(3):397–407 2. Diviya M, Karmel A (2022) Review on technological advancement and textual data management algorithms in NLP and CBIR systems. In: Artificial intelligence and technologies. Springer, pp 311–321 3. Luo Y, Huang W (2011) Product review information extraction based on adjective opinion words. In: 2011 Fourth international joint conference on computational sciences and optimization, pp 1309–1313 4. Alabid NN, Katheeth ZD (2021) Sentiment analysis of twitter posts related to the COVID-19 vaccines. Indones J Electr Eng Comput Sci 24(3):1727–1734 5. Gautam G, Yadav D (2014) Sentiment analysis of twitter data using machine learning approaches and semantic analysis. In: 2014 Seventh international conference on contemporary computing (IC3), pp 437–442 6. Gowda SR, Archana BR, Shettigar P, Satyarthi KK (2022) Sentiment analysis of twitter data using Naïve Bayes classifier. In: ICDSMLA 2020. Springer, pp 1227–1234 7. Shuaib M et al (2022) Identity model for blockchain-based land registry system: a comparison. Wirel Commun Mob Comput 2022:1–17. https://doi.org/10.1155/2022/5670714 8. Shuaib M et al (2022) Land registry framework based on self-sovereign identity (SSI) for environmental sustainability. Sustainability 14(9):5400 9. Bhatia S, Alam S, Shuaib M, Alhameed MH, Jeribi F, Alsuwailem RI (2022) Retinal vessel extraction via assisted multi-channel feature map and U-net. Front Public Heal 10 10. Bharany S et al (2021) Energy-efficient clustering scheme for flying ad-hoc networks using an optimized LEACH protocol. Energies 14(19):6016 11. Shuaib M et al (2022) Self-sovereign identity solution for blockchain-based land registry system: a comparison. Mob Inf Syst 2022:1–17. https://doi.org/10.1155/2022/8930472

466

S. Bharany et al.

12. Bharany S et al (2022) A systematic survey on energy-efficient techniques in sustainable cloud computing. Sustainability 14(10):6256 13. Rahmani MKI et al (2022) Blockchain-based trust management framework for cloud computing-based internet of medical things (IoMT): a systematic review. Comput Intell Neurosci 2022 14. Alam S et al (2021) Blockchain-based initiatives: current state and challenges. Comput Netw 198. https://doi.org/10.1016/j.comnet.2021.108395 15. Khubrani MM, Alam S (2021) A detailed review of blockchain-based applications for protection against pandemic like COVID-19. 19(4):1185–1196. https://doi.org/10.12928/TELKOM NIKA.v19i4.18465. 16. Nisha KA, Kulsum U, Rahman S, Hossain M, Chakraborty P, Choudhury T (2022) A comparative analysis of machine learning approaches in personality prediction using MBTI. In: Computational intelligence in pattern recognition. Springer, pp 13–23 17. Kalaivani MS, Jayalakshmi S (2022) Text-based sentiment analysis with classification techniques—a state-of-art study. In: Computer networks and inventive communication technologies. Springer, pp 277–285 18. Reegu FA, Daud SM, Alam S, Shuaib M (2021) Blockchain-based electronic health record system for efficient covid-19 pandemic management. https://doi.org/10.20944/preprints202 104.0771.v1 19. Rajalakshmi S, Asha S, Pazhaniraja N (2017) A comprehensive survey on sentiment analysis. In: 2017 Fourth international conference on signal processing, communication and networking (ICSCN), pp 1–5 20. Talwar B, Arora A, Bharany S (2021) An energy efficient agent aware proactive fault tolerance for preventing deterioration of virtual machines within cloud environment. In: 2021 9th International conference on reliability, infocom technologies and optimization (Trends and Future Directions) (ICRITO), pp 1–7 21. Alam S, Shuaib M, Samad A (2019) A collaborative study of intrusion detection and prevention techniques in cloud computing. In: Lecture notes in networks and systems, vol 55. Springer, pp 231–240. https://doi.org/10.1007/978-981-13-2324-9_23 22. Shuaib M, Daud SM, Alam S, Khan WZ (2020) Blockchain-based framework for secure and reliable land registry system. TELKOMNIKA Telecommun Comput Electron Control 18(5):2560–2571. https://doi.org/10.12928/TELKOMNIKA.v18i5.15787 23. Shuaib M, Daud SM, Alam S (2021) Self-sovereign identity framework development in compliance with self sovereign identity principles using components. Int J Mod Agric 10(2):3277–3296 24. Khan ZA, Khubrani MM, Alam S, Hui SJ, Wang Y Method for measuring the similarity of multiple metrological sequences in the key phenological phase of rice-based on dynamic time 25. Bharany S, Sharma S, Bhatia S, Rahmani MKI, Shuaib M, Lashari SA (2022) Energy efficient clustering protocol for FANETS using moth flame optimization. Sustainability 14(10):6159 26. Abdus S et al (2018) Internet of vehicles (IoV) requirements, attacks and countermeasures. In: 2018 5th International conference on computing for sustainable global development, pp 4037–4040 27. Tripathi M (2021) Sentiment analysis of Nepali COVID19 tweets using NB SVM and LSTM. J Artif Intell 3(03):151–168 28. Kotzias D, Denil M, De Freitas N, Smyth P (2015) From group to individual labels using deep features. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp 597–606 29. Singh P, Singh S, Sohal M, Dwivedi YK, Kahlon KS, Sawhney RS (2020) Psychological fear and anxiety caused by COVID-19: insights from twitter analytics. Asian J Psychiatr 54:102280 30. Brahmi N, Singh P, Sohal M, Sawhney RS (2020) Psychological trauma among the healthcare professionals dealing with COVID-19. Asian J Psychiatr 54:102241 31. Pandian AP (2021) Performance evaluation and comparison using deep learning techniques in sentiment analysis. J Soft Comput Paradig 3(02):123–134

Chapter 38

Brain Tumor Detection Using Image Processing Approach Abhinav Agarwal, Himanshu Arora, Shivam Kumar Singh, and Vishwabandhu Yadav

1 Introduction The brain is the most complex organ in the human body. The brain has a large number of cells, and their uncontrolled division, aberrant and uncontrollable proliferation of tissues creates brain tumors. Brain tumors are classified according to their origin, type, size, pace of development, region of aberrant tissues, and stage of progression [1]. Tumors are usually divided into two types: (i) Benign Tumor (non-cancerous): It has the ability to spread but seldom affects nearby healthy cells with definite boundaries. It has a sluggish pace of advancement. (ii) Malignant tumor (cancerous): This form of tumor has fuzzy boundaries and may swiftly spread and attack nearby tissues in the brain. It worsens the patient’s situation by hastening the growth of tumor tissues. The WHO categorization of tumor grade is as follows: In Fig. 1 represented the different types of the brain tumor. Many lives may be spared if the tumor was detected early. The therapy of tumor, on the other hand, would be dependent on the prompt diagnosis of tumor [2]. In recent years, various computer-aided diagnostic approaches for identifying and categorizing tumors using MRI images have been developed [3]. These researches range from old medical image processing methods to modern machine learning approaches [4, 5]. Essentially, the foundation of machine learning approaches is nothing more than a circumstance in which a machine has a job to complete and experience improves machine performance [6]. In the realm of medical analysis, the machine learning technique has been extensively applied. Many common machine A. Agarwal Department of CSE, Amity University, Jaipur, Rajasthan, India H. Arora Department of CSE, Arya Institute of Engineering and Technology, Jaipur, Rajasthan, India S. K. Singh (B) · V. Yadav Department of CSE, Arya College of Engineering and Research Centre, Jaipur, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_38

467

468

A. Agarwal et al.

Fig. 1 Types of brain tumor

learning methods for brain tumor identification, classification, and segmentation include SVM, KNN, ANN, decision tree, and among others. Accordingly, these ML strategies have laid out certain benchmarks in the tumor location arrangement field, each with its own arrangement of disadvantages, for example, order error, edge obscuring, expanded clamor responsiveness, long handling time, expanded computational expense, intricacy, and lower accuracy. Aside from those strategies for increasing the performance of high-performance computers, lowering hardware prices, and eliminating some of the shortcomings of prior ML methods, one such methodology, deep learning models, which is a sub-field of machine learning, has recently been adopted.

2 Imaging Techniques Tumor diagnosis may take two techniques, such as the invasive approach (biopsy), which involves making an incision in order to obtain a tumor sample for examination. Pathologists use a microscope to evaluate distinct aspects of tumor tissues for a quick overview in this method [7]. Another method is non-invasive brain scanning utilizing imaging technologies. It is a quicker and safer approach than biopsy. Image processing is one of the most popular techniques that is used in various fields like healthcare, military security, medical and more [8–14]. Imaging is crucial in the treatment of brain tumors. There are several imaging modalities, such as CT, MRI, PET, and among others. Radiologists can detect brain abnormalities and progression

38 Brain Tumor Detection Using Image Processing Approach

469

Fig. 2 (1) Normal brain image; (2) Benign; (3) Malignant

rates using imaging modalities, which leads to surgical planning and computerassisted surgery. However, when comparing CT with MRI imaging modalities, MRI is the safer, more popular, and widely used imaging technology [3–5]. Because it produces pictures that are devoid of radiation. With great contrast, MRI images give extensive information on brain tissues [15]. MRI is the most used radiological imaging technology for a variety of reasons, including: The MRI pictures in Fig. 2 above show the normal brain imaging as well as the tumor-infected region of the brain. The display of MRI using 3D representations simplifies and adapts the analysis.

3 Methodology Preprocessing, segmentation, feature extraction, and classifying MRI images of brain tumors are the four main processes in the proposed approach. According to Fig. 3, DCNN architecture is used for tumor identification in this technique. There are various phases to this procedure. Pre-processing or data normalization is first conducted on the brain MRI image, followed by image thresholding and dilations. After the data has been gathered, it undergoes pre-processing and further enhancements. Here, by using the CNN architecture, any section of the brain may be split into tumors. Finally, a pre-trained CNN and a resizable picture would be used for the input. At long last, CNN can characterize the pictures into two classed for benign and malignant tumors. This methodology has the potential to expand the categorization of glioma into four classes. Brain MRI images will be used in this study’s dataset. In the suggested paradigm, the training and testing modules are separated. Different MRI samples will be used for training and testing.

470

A. Agarwal et al.

Fig. 3 Proposed model flowchart

If abnormal, it will then determine the tumor’s grade in an MRI scan, which is the goal of the training module’s design. Module: This is where we run a huge number of scans through the system to make sure everything works properly and accurately. For the most part, it’s used to evaluate the system’s accuracy, precision, and efficiency.

38 Brain Tumor Detection Using Image Processing Approach

471

4 Result and Discussion Image the purpose of pre-processing is to improve the picture quality and make it acceptable for subsequent classification. MRI acquisition is basically a trade-off between signal-to-noise ratio and acquisition time (SNR). Speed and resolution, which have the greatest impact on picture quality when it comes to acquisition. The best way to boost the SNR is to lengthen the acquisition period. However, due to technological limitations, this is not feasible in practice. As a result, a reduction in collection time lowers SNR as well as contrast. Thusly, it is hard to recognize the tumor from the boisterous and low-contrast MRI pictures. Specifically when most of the segmentation techniques are clamor delicate, irregularities in intensity and poor contrast. Therefore, pre-processing is required to remove noise and enhance contrast across areas. Brain tumor segmentation outcomes may be greatly influenced by these steps which include picture scaling and de-noising as well as skull-stripping, image enhancement and intensity normalization. This is done so that all photos may be used as input to the neural network in the same way. Step one of the image segmentation process involves separating the picture into sections with varying levels of brightness and texture (contrast, shadow, and degree of gray). The digital grayscale image is used as an input for the system’s functionality. Large amounts of data are extracted from photos during the segmentation process. The neural network-based supervised learning approach has been selected for the proposed system among a variety of algorithms. Brain MRI scans are classified as either normal or abnormal in this stage, therefore the characteristics are extracted based on the classifications. Because of the wide range of possible appearances, the characteristics employed in brain tumor segmentation are highly dependent on the kind and grade of the tumor. There are several ways to get an image’s visual information. The MRI scans may be used to extract intensity, shape, and texture-based properties. The process of categorizing images based on their qualities is known as image classification. These automated systems will employ the CNN classifier. When an abnormality is seen in an imaging, it is classed as either a benign or malignant tumor. Finally, the malignant tumor would be classified into LGG and HGG with four grades of glioma. In other words, classification is just labeling a picture according to its features. In Figs. 4, 5, 6, 7, 8, and 9 has shown obtained results from the proposed experiment analysis for tumor detection are median filtered image, alone malignant tumor, bounding box in tumor, eroded image, tumor outline, and detected tumor separately. In Figs. 4 and 5 represent the median filtered image and malignant tumor alone separately. In Fig. 8 displays the tumor outline, and in Fig. 9 shown the detected tumor from the proposed technique using image processing approach.

472

A. Agarwal et al.

Fig. 4 Median filtered image

Fig. 5 Malignant tumor alone

5 Conclusion ANN, SVM, PNN, k-NN, DBN, CNN, DNN, and so on are some of the different techniques for detecting and classifying brain tumors from MRI images. All of these methods are more efficient and comfortable than manual segmentation. As a result of several developments in tumor analysis, the medical advancements are very dependable and life-saving because of early diagnosis of tumors. Because of this, the patient

38 Brain Tumor Detection Using Image Processing Approach

473

Fig. 6 Bounding box in tumor

Fig. 7 Eroded image

is able to obtain therapy sooner. In any case, to further develop the ongoing techniques that are inadequate with regards to exactness, accuracy, awareness, and handling time, the ongoing CNN strategy segmentation and feature extraction stages are altered and the DNN-based model is executed to accomplish the most noteworthy clinical area expectations.

474

A. Agarwal et al.

Fig. 8 Tumor outline

Fig. 9 Detected tumor

References 1. Raut G, Raut A, Bhagade J, Bhagade J, Gavhane S (2020) Deep learning approach for brain tumor detection and segmentation. In: IEEE international conference on convergence to digital world—Quo Vadis (ICCDW), pp 1–5 2. Soni GK, Rawat A, Yadav D, Kumar A (2021) 2.4 GHz Antenna design for tumor detection

38 Brain Tumor Detection Using Image Processing Approach

3.

4.

5.

6. 7. 8.

9. 10.

11.

12.

13. 14.

15.

475

on flexible substrate for on-body biomedical application. In: 2021 IEEE Indian conference on antennas and propagation (InCAP), pp 136–139 Shahriar Sazzad TM, Tanzibul Ahmmed KM, Hoque MU, Rahman M (2019) Development of automated brain tumor identification using MRI images. In: 2019 International conference on electrical, computer and communication engineering (ECCE), pp 1–4 Jha P, Biswas T, Sagar U, Ahuja K (2021) Prediction with ML paradigm in healthcare system. In: 2021 Second international conference on electronics and sustainable communication systems (ICESC), pp 1334–1342 Soni GK, Rawat A, Jain S, Sharma SK (2020) A pixel-based digital medical images protection using genetic algorithm with LSB watermark technique. In: Smart systems and IoT: innovations in computing. Springer, pp 483–492 Ahuja K, Sekhawat H, Mishra S, Jha P (2021) Machine learning in artificial intelligence: towards a common understanding. Turk Online J Qual Inq (TOJQI) 12(8):1143–1152 Vijayakumar T (2019) Classification of brain cancer type using machine learning. J Artif Intell 1(02):105–113 Soni GK, Arora H, Jain B (2020) A novel image encryption technique using Arnold transform and asymmetric RSA algorithm. In: International conference on artificial intelligence: advances and applications 2019 algorithm for intelligence system. Springer, pp 83–90 Singh V, Choubisa M, Soni GK (2020) Enhanced image steganography technique for hiding multiple images in an image using LSB technique. TEST Eng Manag 83:30561–30565 Arora H, Soni GK, Kushwaha RK, Prasoon P (2021) Digital image security based on the hybrid model of image hiding and encryption. In: 2021 6th International conference on communication and electronics systems (ICCES), pp 1153–1157 Kumar M, Soni A, Shekhawat ARS, Rawat A (2022) Enhanced digital image and text data security using hybrid model of LSB steganography and AES cryptography technique. In: 2022 Second international conference on artificial intelligence and smart energy (ICAIS), pp 1453–1457 Mishra S, Singh D, Pant D, Rawat A (2022) Secure data communication using information hiding and encryption algorithms. In: 2022 Second international conference on artificial intelligence and smart energy (ICAIS), pp 1448–1452 Arora H, Kumar M, Tiwari S (2020) Improve image security in combination method of LSB stenography and RSA encryption algorithm. Int J Adv Sci Technol (IJAST) 28(8):6167–6177 Agarwal A, Arora H, Mehra M, Das D (2021) Comparative analysis of image security using DCT, LSB and XOR techniques. In: 2021 Second international conference on electronics and sustainable communication systems (ICESC), pp 1131–1136 Malik M, Jaffar MA, Naqvi MR (2021) Comparison of brain tumor detection in MRI images using straightforward image processing techniques and deep learning techniques. In: 2021 3rd International congress on human-computer interaction, optimization and robotic applications (HORA), pp 1–6

Chapter 39

Routing Method for Interplanetary Satellite Communication in IoT Networks Based on IPv6 Paweł Dobrowolski, Grzegorz Debita, and Przemysław Falkowski-Gilski

1 Introduction The topic of interplanetary network connection is a current and complex scientific problem [1]. With the growing number of satellite missions [2], including particularly small satellites (SmallSats), sometimes referred to as CubeSats, providing and maintaining reliable data transmission between newly launched and already present artificial satellites in space become a challenging task. The concept of designing an efficient communication link, including the ground as well as space segment, is shown in Fig. 1. This concept becomes even more challenging when examining the interplanetary communication between numerous satellites [3], grouped in various satellite formations [4], coming from different satellite systems, and in future, interplanetary satellite systems [5]. An exemplary scenario is shown in Fig. 2. Let us focus on the following scenario. The arrangement of satellites in Saturn’s geostationary orbit is shown in Fig. 3. Due to the rotation of the planet about its axis, satellites placed in its orbit could be easily located. This would facilitate communication, because the time of the planet’s rotation through a satellite placed in a geostationary orbit is equal to the rotation of P. Dobrowolski IT Services NetPD, Wrocław, Poland e-mail: [email protected] G. Debita Faculty of Management, General Tadeusz Kosciuszko Military University of Land Forces, Wrocław, Poland e-mail: [email protected] P. Falkowski-Gilski (B) Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, Gda´nsk, Poland e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_39

477

478

P. Dobrowolski et al.

Fig. 1 Exemplary communication link between the ground and space segment

Fig. 2 Exemplary interplanetary satellite communication link

Fig. 3 Arrangement of satellites in Saturn’s geostationary orbit

the planet about its own axis [6]. Furthermore, it would ensure orderly, continuous movement of satellites in a given area of the solar system, around a particular planet. The path that satellites placed in a geostationary orbit, of which all planets of the solar system would travel, is shown in Fig. 4. By using the rotation of the planets around their axis, the measuring devices would be able to collect data from the indicated areas of the solar system. Thanks

39 Routing Method for Interplanetary Satellite Communication …

479

Fig. 4 Geostationary orbits of satellites place in our solar system

to the combination of the aforementioned movement around the planets [7], with the movement of the planets themselves (in their orbits around the Sun), artificial satellites would provide even more data, as shown in Fig. 5. Fig. 5 Collection of data by artificial satellites in our solar system

480

P. Dobrowolski et al.

2 Protocols and Transmission The IPv6 protocol is an improved version of the IPv4 protocol, described in the RFC 791 specification [8]. The most important change, compared to the superseded IPv4 protocol, is the increased size of possible network addresses [9], which grows from 32 bits in the IPv4 protocol to 128 bits in the IPv6 protocol, as shown in Fig. 6. The main idea of such a huge change is the necessity to adapt to the changing technological needs, mainly in terms of the number of devices working in IT networks. Currently, this field of study is referred to as the Internet of things (IoT) [10]. This brings the possibility of assigning public addressing at end stations, thus excluding the need to utilize network address translation (NAT) used in IPv4 networks. It brings a much easier auto-configuration of device addressing, as well as a stimulus to the development of network technologies, such as multicast or anycast [11]. The IPv6 protocol allows automatic provision of basic connectivity on each of the active interfaces, by calculating the address for a given interface based on the reserved IPv6 link-local FE80/10 network and the EUI-64 mechanism based on the MAC address of a specific interface network [12, 13]. The use of link-local addressing is limited. However, based on RFC 2545 [14], the method of using FE80/10 addressing, such as next-hop addresses for prefixes received via border gateway protocol (BGP), can be utilized [15]. The correct linklocal address calculation for an ens3 interface, after the interface is put into an active state, is shown in Fig. 7.

Fig. 6 Header format of the IPv6 packet

39 Routing Method for Interplanetary Satellite Communication …

481

Fig. 7 Calculated correct link-local address

Thanks to the received information, it is also possible to verify all neighbors found within a given segment of the network. It refers to those who have also correctly calculated the link-local address for the interface [16]. One of the main problems of networks, nowadays, is the use of outdated protocols. They include numerous protocols, even those introduced several dozen years ago. This fact forces to adjust them to various IT business requirements. The IPv6 protocol was extended with the functionality of the routing segment, allowing the implementation of source routing functionality, which is constantly being developed by the Internet Engineering Task Force (IETF). In contrast to the traditional approach, used in current networks, regardless of the version of the IP protocol, which is based on routing packets only to the destination, the functionality of the segment routing (SR) offers the possibility of sending packets along longer routes (in terms of the number of devices within the path). This is possible by specifying a list of tags, called segments. These segments contain specific instructions that are sent to each device along the route [17]. Functionalities related to SR in IPv6 networks are realized thanks to the use of an additional header in the IPv6 packet [14]. The segment routing extension header (SRH), shown in Fig. 8, contains the aforementioned list of segments. Segments included in the SRH can define conditions and guidelines for traffic control and/or traffic engineering within the network, by specifying capacity requirements or delays on a given link, thus defining the routing method for particular specific packages.

482

P. Dobrowolski et al.

Fig. 8 Structure of the SRH

3 Implementation, Validation, and Testing The prepared solution was based on the following laboratory components: • • • • • • •

Linux Ubuntu Server VMs 20.04; Linux Kernel 5.6; FRRouting 7.4; IProute2-ss200330; Tshark v3.2.6; BWM-NG v0.6.3; TMUX 3.1c.

The hardware platform was the Nexys 4, shown in Fig. 9, a FPGA device from Xilinx. We have chosen this model because of its broad collection of communication ports, including, e.g., USB and Ethernet. Furthermore, it offers several buildin peripherals, including an accelerometer, temperature sensor, digital microphone, amplifier, speaker, etc. [18]. This board had a LEON3 32-bit processor, compliant with the SPARC architecture, as shown in Fig. 10. It is a highly customizable system-on-a-chip (SoC), compatible with VHDL. Each core of the 16 cores available can be configured to operate independently, in either a single-core and/or multicore configuration. Additionally, the configuration of each processor core, in terms of, e.g., cache, is independent, as solutions with asymmetric configurations of individual cores are supported [19].

39 Routing Method for Interplanetary Satellite Communication …

483

Fig. 9 Hardware FPGA-based platform utilized during tests

Fig. 10 Block diagram of the LEON3 processor

Based on the assumptions from RFC 3849 [14], the proposed network structure utilized IPv6 addressing from a dedicated prefix 2001:DB8::/32 (for documentation purposes). The proposed subnetting plan for the entire network, based on the /32 prefix for the project, is shown in Fig. 11. In order to standardize the configuration between all devices as much as possible, apart from the possibility of establishing a neighborhood relationship based on linklocal addressing, the same router-id values, a peer-group functionality was also configured, as shown in Fig. 12. It allowed to significantly reduce the number of configurations within each device. This was done by applying the same connection rules to each of the potential peers, regardless of their autonomous system number, by specifying remote-as parameter to external. Router-ID addresses were configured

484

P. Dobrowolski et al.

Fig. 11 Proposed subnetting of the tested network

with the same values, based on the RFC6286 [14] document, which requires unique Router-ID values only within the same autonomous system. Due to the assumed large number of devices that can work in IPv6 networks, the numbering of autonomous systems had to be adjusted accordingly. In accordance

Fig. 12 Sample show command of unification features

39 Routing Method for Interplanetary Satellite Communication …

485

with the documents RFC6996 and RFC4893 [14], 32-bit numbering of autonomous systems was utilized, with a separate range of private numbers (4,200,000,000– 4,294,967,294), which gave the total number of 94,967,294 autonomous systems that could be used for the needs of the satellite network. Due to the uniqueness of addresses assigned to satellites, from subnets assigned to specific planets, the hexadecimal loopback address is converted into an adequate number in the decimal system, thus constituting the number of the autonomous system in a separate private pool. Additionally, in order to ensure the correct operation of the segment routing functionality, two route maps were configured, as shown in Fig. 13. The first one, RM_NEXT_HOP, was designed to change the value of the next-hop address to the loopback address of a given satellite (for all advertised subsets, except for the loopback interface itself). This configuration would lead to the recursive route lookup function, when sending packets to the destination. Then, the next-hop address, pointing to the loopback interface address, would be translated into a link-local address, calculated in the device on which the loopback is configured. The second RM_PREF_GLOBAL route map had the opposite effect, and the function configured in it was designed to force the next-hop address to be indicated in its own RIB table to the appropriate loopback address of the device advertising the given route. The last stage of implementation of the discussed method was to configure the SR function for the IPv6 dataplane on Linux. The idea behind such a solution is the possibility of selecting the forwarding path of the sent packets by the transmitting device. An additional advantage of such a solution is the possibility of choosing and reacting appropriately to the current path of transmitted packets, with regard to the quality of a given link. Currently, the path selection and the method of data delivery to the receiver are selected by the BGP protocol, which, despite its flexibility, allows manipulation of the traffic route selection only by AS_PATH_PREPENDING or MED attributes. Due to this fact, the proposed solution utilizes eBGP sessions. Despite such a choice, these changes would really only affect the choice of the first device in the route and not the entire path. Due to this, the role of the BGP protocol could be minimized, only to propagate datacenter networks and loopback addresses of each device, which can then be used as SID address for the SR function. The current route selection, offered by the eBGP protocol, is shown in Fig. 14 and depicted in Fig. 15. In order to change this selection, and to run the segment routing function, a route to the 2001:db8:0:0:1::/80 destination was configured on satellite 2001:db8::50:b3c/128, with the function to encapsulate IPv6 packets to that destination with the following segments [2001:db8::40:86b1, 2001:db8::40:8aa0, 2001:db8::30:20], each of which represented the SID of the device to which the traffic had to be delivered. The configuration of the encapsulating method, for IPv6 packets to a 2001:db8:0:0:1::/80 network, is shown in Fig. 16. A segment decapsulation instruction was also configured on each device, if the destination address matched the configured loopback address on the device. The method of decapsulating an SRv6 segment is shown in Fig. 17.

486

P. Dobrowolski et al.

Fig. 13 Implemented routing function

The packet flow path from satellite 2001:db8::50:b3c to destination 2001:db8:0:0:1::/80, for which the SRv6 function was configured, is shown in Fig. 18. A part of an output from packet capture, proving the functionality of decapsulation, configured within the satellites, is shown in Fig. 19. The output was taken from satellite 2001:db8::40:86b1, after the first segment on the list was stripped from the IPv6 packet, and now, the destination field was changed to the value of the next segment on the list, being 2001:db8::40:8aa0.

39 Routing Method for Interplanetary Satellite Communication …

Fig. 14 Route selection using the eBGP protocol Fig. 15 Depiction of route selection using the eBGP protocol

Fig. 16 Configuration of the encapsulating method for IPv6 packets

Fig. 17 Decapsulation of the SRv6 segment

487

488

P. Dobrowolski et al.

Fig. 18 Packet flow path with SRv6 function configured

Fig. 19 Output packet capture with decapsulation

4 Conclusions The presented solution, based on a combination of the BGP protocol and SRv6 instructions, could be utilized in multiple IT network scenarios, used in IoT solutions or intranet networks, including satellite communication [20, 21]. The method proved how revolutionary and efficient SRv6 functions could be, compared to traditional

39 Routing Method for Interplanetary Satellite Communication …

489

destination-based routing, with protocols such as open shortest path first (OSPF) or enhanced interior gateway routing protocol (EIGRP) [22]. In the described method, a routing decision was moved to the 3rd layer of a traditional OSI model [23], namely the IPv6 protocol itself. This brings new possibilities and opportunities, when designing complex systems and services. An inspiration for future work may be found in [24, 25], including space educational activities [26], as well as aspects related with mobile applications [27, 28]. The concept of using a large number of satellites, connected in such a wireless network, can take astronomical research to a completely new level. What is more, the idea of using FPGA systems, with a synthesizable processor such as LEON3, opens a new chapter. The main question on how to interact with the device already put in space, giving the engineers a possibility to improve the synthesized hardware logic or change the way of how data from numerous measurement systems would be interpreted by the processor and then transmitted to the datacenter, could be finally solved.

References 1. Velazco J (2020) An inter planetary network enabled by smallsats. In: 2020 IEEE aerospace conference. IEEE Press, Big Sky, pp 1–10 2. Di Mauro G, Lawn M, Bevilacqua R (2018) Survey on guidance navigation and control requirements for spacecraft formation-flying missions. J Guid Control Dyn 41(3):581–602 3. Davarian F, Babuscia A, Baker J, Hodges R, Landau D, Lau CW, Lay N, Kuroda V (2020) Improving small satellite communications in deep space—a review of the existing systems and technologies with recommendations for improvement. Part I: direct to earth links and smallsat telecommunications equipment. IEEE Aerosp Electron Syst Mag 35(7):8–25 4. Liu GP, Zhang S (2018) A survey on formation control of small satellites. Proc IEEE 106(3):440–457 5. Hippke M (2020) Interstellar communication network. I. overview and assumptions. Astron J 159(3):1–10 6. Kuai ZZ, Cao XL, Shen HX, Li HN (2020) Maneuver planning of geostationary satellites using mean orbital elements. In: 2020 Chinese automation congress. IEEE Press, Shanghai, pp 6596–6600 7. Sommer M, Yano H, Srama R (2020) Effects of neighbouring planets on the formation of resonant dust rings in the inner solar system. Astron Astrophys 635:1–19 8. IETF specification: RFC 791-1980—internet protocol (1981) 9. Ordabayeva GK, Othman M, Kirgizbayeva B, Iztaev ZD, Bayegizova A (2020) A systematic review of transition from IPv4 to IPv6. In: 6th International conference on engineering and MIS 2020. ACM Digital Library, Almaty, pp 1–15 10. Almagrabi AO, Al-Otaibi YD (2020) A survey of context-aware messaging-addressing for sustainable internet of things (IoT). Sustainability 12(10):1–26 11. Jia WK, Dong X (2019) Deploying lightweight anycast services based-on explicit multicast routing for evolved Internet. In: 11th International conference on ubiquitous and future networks. IEEE Press, Zagreb, pp 233–238 12. Abdullah SA (2019) SEUI-64, bits an IPv6 addressing strategy to mitigate reconnaissance attacks. Eng Sci Technol Int J 22(2):667–672 13. Tang J (2021) Research on IPv6 protocol transition mechanism. In: 6th International conference on intelligent computing and signal processing. IEEE Press, Xi’an, pp 702–705 14. IETF Tools. https://datatracker.ietf.org/

490

P. Dobrowolski et al.

15. Karimi M, Jahanshahi A, Mazloumi A, Sabzi HZ (2019) Border gateway protocol anomaly detection using neural network. In: 2019 IEEE international conference on big data. IEEE Press, Los Angeles, pp 6092–6094 16. Halavachou Y, Yubian W (2019) Research on IPv4, IPv6 and IPv9 address representation. J Adv Netw Monit Control 4(2):48–61 17. Ventre PL, Salsano S, Polverini M, Cianfrani A, Abdelsalam A, Filsfils C, Camarillo P, Clad F (2020) Segment routing: a comprehensive survey of research activities, standardization efforts and implementation results. IEEE Commun Surv Tutor 23(1):182–221 18. Ibraimov MK, Tynymbayev ST, Park J, Zhexebay DM, Alimova MA (2021) Hardware implementation of the coding algorithm based on FPGA. IOP Conf Ser Mater Sci Eng 1047(1):1–5 19. Bansal R, Karmakar A (2017) Efficient integration of coprocessor in LEON3 processor pipeline for system-on-chip design. Microprocess Microsyst 51:56–75 20. Mugunthan SR (2020) Novel cluster rotating and routing strategy for software defined wireless sensor networks. J ISMAC 2(3):140–146 21. Smys S, Vijesh Joe C (2021) Metric routing protocol for detecting untrustworthy nodes for packet transmission. J Inf Technol Digital World 3(2):67–76 22. Habib MS, Shehu HA, Bello I (2018) Performance analysis of EIGRP and OSPF routing protocols for a client network. Int J Adv Acad Res Sci Technol Eng 4(7):46–56 23. Radhakrishnan R, Edmonson WW, Afghah F, Rodriguez-Osorio RM, Pinto F, Burleigh SC (2016) Survey of inter-satellite communication for small satellite systems: physical layer to network layer view. IEEE Commun Surv Tutor 18(4):2442–2473 24. Tripathy AK, Sarkar M, Sahoo JP, Li KC, Chinara S (eds) (2021) Advances in distributed computing and machine learning. In: Proceedings of ICADCML 2020. Springer, Singapore 25. Jacob IJ, Shanmugam SK, Piramuthu S, Falkowski-Gilski P (eds) (2021) Data intelligence and cognitive informatics. In: Proceedings of ICDICI 2020. Springer, Singapore 26. Łubniewski Z, Falkowski-Gilski P, Chodnicki M, Stepnowski A (2019) Three editions of interuniversity studies on space and satellite technology. Candidate and/vs graduate a case study. In: 3rd Symposium on space educational activities. University of Leicaster, Leicaster, pp 48–50 27. Weichbroth P (2020) Usability of mobile applications: a systematic literature study. IEEE Access 8:55563–55577 28. Weichbroth P, Łysik Ł (2020) Mobile security: threats and best practices. Mob Inf Syst 2020:828078

Chapter 40

Parameterization of Sequential Neural Networks for Predicting Air Pollution Farheen and Rajeev Kumar

1 Introduction Globally, air pollution is affecting several cities. There are multiple types of air pollutants, one of them is a particulate matter (PM), which is a mixture of solid and liquid particles suspended in air. Micro-particles smaller than 10 µm are finer, can easily penetrate our organs, and pose health risks. It can affect lungs, respiratory system, and other internal organs that can cause serious health issues [1, 2]. Hence, the prediction of fine particles (PM2.5 ) is important. Prediction can be made in many ways, including static methods, traditional machine learning methods, and deep learning methods [3–7]. Deep learning techniques have gained much attention in recent years. In particular, recurrent neural networks (RNNs) perform well with sequence data. RNN and its variants such as LSTM, GRU, and bi-directional LSTM are used for predictions and expected to give reliable results. Previously, we have used two benchmark datasets, we reduced data dimensions using correlation-based feature selection, and we fed these reduced dimensions into LSTM to predict PM2.5 levels [8]. This paper will input the reduced dimensions into the most commonly used sequential neural networks, such as RNNs, LSTMs, and GRUs. In addition, the tuning of hyperparameters plays an imperative role in model performance. Hyperparameters exist in two forms in deep learning: model parameters and optimization parameters. The goal is to adjust the model’s parameters, such as the number of unit in each layer. The number of hyperparameters to be used will vary according to the network architecture. The characteristics of the problem and the data will also influence the value of each. The methods of optimizing the parameters can be divided into four categories. Those are trial-n-error, grids, random, Farheen (B) · R. Kumar Data to Knowledge (D2K) Lab, School of Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi 110 067, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_40

491

492

Farheen and R. Kumar

and probabilistic. In this study, we use the most straightforward method of trial-nerror to tune parameters [9, 10]. We work with variable node sizes starting with six nodes to ninety nodes for each model. Using data of the past 24 h, we wish to predict PM2.5 concentrations for the next hour. Lastly, we compare each model’s RMSE, MAE, and training time (TT) for different sizes of nodes. As a result of performance analysis, we show LSTM and GRU perform better than RNN for most parameters size. In addition, RNNs take lesser time to train than other models. The rest of the paper is organized as follows: Relevant literature review is discussed in Sect. 2. The proposed work is presented in Sect. 3. Experimental results are included in Sect. 4. Finally, the work is concluded in Sect. 5.

2 Preliminaries and Related Work We divide this section into two parts. We will brief the basic idea of RNN and its variants and then review different air pollution prediction models.

2.1 Background Recurrent Neural Network (RNN): In recent years, the use of RNNs in time series forecasting has increased significantly. Hewamalage et al. [11] used RNN for time series prediction. It is well suited for sequence data, which differs from feed-forward neural networks. An RNN takes input in sequence such as xt = x1 , x2 , . . . , x T and outputs data in series such as yt = y1 , y2 , . . . , yT where t ranges from 1 to T . The hidden state is one of the essential components of an RNN. It remembers sequence information. So we can say that RNNs have a memory that remembers information about past data. They have the main problem of not being able to handle long-term dependencies. When a weight matrix is recurrently multiplied by the previous value, the gradient vanishes and explodes. Long short-term memory (LSTM) networks were used to resolve this issue. Long Short-Term Memory Network (LSTM): Hochrieter and Schmidhuber [12] presented the LSTM cell in 1997 to address long-term dependencies. A gate was introduced into the RNN cell to improve its memory. In the LSTM, each neuron is a memory cell. These cells keep track of the previous information. There are three gates in each cell: the input gate, the forget gate, and the output gate. The LSTM cell has the potential to add or omit information. The additional gates resolve the long-term dependency problem but increase the computing overhead. The structure of LSTM becomes complex which makes it computationally expensive. GRU was introduced to make LSTM simple.

40 Parameterization of Sequential Neural Networks …

493

Gated Recurrent Unit (GRU): Cho et al. [13] proposed gated recurrent unit (GRU). The GRU cell incorporates the LSTM cell forget and input gate as an update gate and helps to decrease the number of parameters. GRU cells contain only two gates: an update and a reset gate.

2.2 Prediction of Air Pollution with Deep Learning Models Yi-Ting et al. [14] proposed an approach to find PM2.5 concentration using RNN with LSTM. Training data is accessed from the Environmental Protection Administration (EPA) of Taiwan from 2012 to 2016 and is combined into 20-dimensional data; the test data is used of the year 2017. At 66 locations around Taiwan, they conducted studies to evaluate predicting value of PM2.5 concentration for the next four hours. The results show that the proposed approach effectively forecasts the value of PM2.5 . They follow different data preprocessing steps, e.g., feature selection, data smoothing, data normalization. Due to instrument failure or other reasons that generated missing values issues, they used the arithmetic average of the data fields for each dimension to replace the missing values. For data normalization, they used min–max normalization to limit the values of different dimensions between 0 and 1. The data used by the LSTM cell must be sorted according to the time. So they arranged the 72-h data into a 72 ∗ 20 matrix and used the 73-h PM2.5 data as their forecast target. Artificial neural networks (ANN) and long short-term memory networks are compared in terms of RMSE and MAE. In their analysis, they found LSTMs to be more accurate. Athira et al. [15] predicted the particulate matter 10 (PM10 ) using LSTM, GRU, and RNN with the AirNet dataset, which contains data on air quality and climate. The data was collected from 1 April 2015 to 1 September 2017. Data from 2017 was considered for the test and the remaining data for training a model. The batch size, data dimension, and the number of epochs are 32, 6, and 100, respectively. They performed three trials for window size and learning rate to achieve satisfactory results. They found that results were optimized with four layers, so they set four layers for each network. Three different models were used to predict particulate matter 10. They compared the mean square error (MSE) across layers 1 to 4 of each model. The GRU layer 1 has the least mean square error compared to other proposed models. The authors compared the RMSE and MAPE of all three models and concluded that the results were comparable. Guan and Sinnott [16] analyzed data from a wide range of Web sources and compared prediction results with machine learning models such as linear regression, artificial neural networks, and long short-term memory. They compared the accuracy of these models and concluded that LSTM performs better than others. Rao et al. [17] proposed air quality prediction in Visakhapatnam with LSTMbased recurrent neural networks. The source of data is the central pollution control board. Data includes pollutants and meteorological conditions. Data gaps were filled with averages. Their study aimed to forecast the PM2.5 concentration for the next hour based on past data. They used the RNN-LSTM model to make predictions.

494

Farheen and R. Kumar

Fig. 1 Basic idea of the proposed methodology

The results were evaluated based on RMSE, MAE, and coefficient of determination. The different models used for comparison include support vector regression (SVR) and variants of the base line regression technique. As a result of the study, deep learning-based strategies were found more promising than conventional strategies in forecasting air quality.

3 The Proposed Methodology Figure 1 illustrates the basic idea behind our previous and current work. We converted the full feature set into the reduced feature set in our earlier work by applying correlation-based feature selection. The present study applied those reduced dimensions in different sequential models to predict PM2.5 concentration. We used multiple dimensions of the past 24 h to predict a single step of PM2.5 concentration. Deep learning models are used for prediction tasks such as RNN, LSTM, and GRU. The model consists of an input layer, a hidden layer, and a dense layer with unit one. We are parameterizing the model and comparing the results. We used different batches of nodes, including (6, 18, 30, 42, 54, 66, 78, 90). Models are evaluated based on the RMSE, MAE, and training time (TT).

4 Experimental Results 4.1 Data This paper focuses on two benchmark datasets from UCI machine learning repository [18]. Beijing PM2.5 Dataset: It is an hourly dataset from 2010 to 2014. PM2.5 concentrations from US embassy and meteorological data are from capital international airport in Beijing. There are eight variables in this dataset: dew point, wind direction, temperature, snow, pressure, wind speed, and PM2.5 Concentration.

40 Parameterization of Sequential Neural Networks … Table 1 Data description Dataset Total data Beijing PM2.5 43,824 dataset Multi-site air quality 35,065 data (Aotizhongxin and Changping station)

495

Training data

Validation data

Test data

35,000

7000

28,000

6008

18-10-2014 to 31-12-2014 17-01-2017 to 28-02-2017

Table 2 Reduced dimensions Dataset Reduced feature set Beijing PM2.5 data PM2.5 , wind direction, temperature, snow, rain, wind speed Multi-site air quality data PM2.5 , PM10 , SO2 , NO2 , CO, O3 , temperature, rain, wind direction, wind speed

Multi-site Air Quality Dataset: The data was collected from 12 national air quality monitoring stations in Beijing, China. It is an hourly dataset from March 2013 to February 2017. The dataset contains twelve features: PM2.5 , PM10 , SO2 , NO2 , CO, O3 , temperature, pressure, dew point, rain, wind speed, wind direction. Table 1 shows the data description, and Table 2 shows the reduced feature set that we are using for our present work.

4.2 Prediction Results with Different Sequential Networks Table 3 shows the prediction results with different parameters for all three datasets. We use accuracy measures (MAE and RMSE) and training time (TT is in minutes) to check the performance. We also share the prediction plots with parameter size (30, 20) in Figs. 2, 3, and 4. The orange color denotes the original value in the plot, and the blue color represents the predicted results. Figure 5 shows the bar chart for training time, MAE, and RMSE. Here, we only share the bar chart for the Beijing PM2.5 dataset.

4.3 Discussion We trained RNN and its variants with varying parameters. Our results were analyzed on two levels: intra and inter; the intra analysis within a single model, and in the inter-study, we observed variations among various models. The following are some points we have analyzed.

496

Farheen and R. Kumar

Table 3 Prediction results for different datasets Nodes (l1 and l2)

Beijing PM2.5 data

Multi-site air quality data Aotizhongxin

RNN 6, 4

18, 12

30, 20

42, 28

54, 36

66, 44

78, 52

90, 60

TT

LSTM

GRU

RNN

LSTM

Changping GRU

RNN

LSTM

GRU

1.11

2.58

2.92

1.57

1.77

2.69

1.29

4.83

2.69

MAE

12.79

12.56

12.43

12.09

10.98

11.50

10.43

10.38

10.23

RMSE

22.54

22.32

22.37

23.23

22.93

23.11

19.58

19.26

19.15

1.34

2.89

3.07

1.82

2.05

1.49

1.48

4.22

2.55

MAE

12.94

12.59

12.61

13.60

11.14

11.15

10.61

10.38

10.40

RMSE

22.60

22.35

22.38

24.34

22.71

23.24

19.44

19.19

19.06

1.07

3.27

4.01

1.67

2.30

1.24

1.12

4.34

2.70

MAE

14.49

12.56

12.96

12.94

11.26

11.22

11.08

9.97

10.49

RMSE

23.54

22.20

22.93

24.27

23.10

23.04

19.94

19.04

19.45

2.55

4.23

5.65

1.91

6.01

2.15

1.85

5.89

2.80

MAE

13.21

13.31

12.66

11.84

11.08

10.91

10.70

11.40

10.39

RMSE

22.65

22.91

22.36

23.03

23.13

23.01

19.27

19.61

19.17

2.24

3.83

3.68

2.79

3.97

3.64

2.74

3.55

3.56

MAE

16.04

13.50

12.50

11.29

11.05

11.12

10.72

10.58

10.09

RMSE

24.58

22.79

22.38

23.26

23.16

23.27

19.49

19.34

19.09

1.35

3.46

2.04

2.52

3.58

4.35

2.25

3.93

3.56

MAE

13.61

12.88

13.03

13.88

11.03

11.18

14.19

10.76

11.12

RMSE

23.19

22.29

22.54

24.48

23.93

23.41

21.91

19.28

19.40

2.15

4.50

3.15

3.25

3.61

2.81

2.27

4.86

4.34

MAE

15.56

12.79

12.96

18.70

11.10

12.28

11.56

10.17

12.58

RMSE

24.72

22.50

22.45

27.51

23.39

23.55

20.37

19.02

20.37

3.24

8.95

9.61

2.92

4.30

4.28

4.73

6.81

5.05

MAE

13.22

12.84

12.89

14.94

11.10

10.96

10.71

10.27

10.66

RMSE

22.56

22.27

22.64

24.81

23.35

23.06

19.85

19.29

19.86

TT

TT

TT

TT

TT

TT

TT

(a) (RNN)

(b) (LSTM)

(c) (GRU)

Fig. 2 Prediction plots for Beijing PM2.5 dataset

40 Parameterization of Sequential Neural Networks …

(a) (RNN)

497

(b) (LSTM)

(c) (GRU)

Fig. 3 Prediction plots for multi-site air quality dataset (Aotizhongxin station)

(a) (RNN)

(b) (LSTM)

(c) (GRU)

Fig. 4 Prediction plots for multi-site air quality dataset (Changping station)

• In Beijing PM2.5 dataset, RNN and GRU show good results with parameter size (6, 4) and LSTM with (30, 20). • In Aotizhongxin Station data, RNN and GRU show good results with parameter size (42, 28) and LSTM with (6, 4). • In Changping Station data, RNN shows good results with parameter size (6, 4), LSTM with (30, 20) and GRU with (54, 36). • Comparatively, the RNN shows more fluctuation with changes in parameters. • We compared training times of all three models and found that RNNs take the least amount of time to train for these two datasets. • We also analyzed that RNN with some parameter values gives lesser accuracy than the other two models. • LSTM and GRU take longer time to train than RNN but provide better accuracy for most parameter values.

498

Farheen and R. Kumar

(a) Training Time

(b) MAE

(c) RMSE

Fig. 5 Bar charts for Beijing PM2.5 dataset

5 Conclusion In this study, we studied the parameterization of RNN and its variants. Using data from the past 24 h, we predicted PM2.5 concentrations for the next hour. We used two benchmark datasets, including pollutants concentration and meteorological features. Our previous work had calculated correlation coefficients to reduce data dimensions; we used those reduced dimensions for this study. To measure accuracy, we calculated RMSE and MAE. We compared the accuracy matrix and training time of three models. We analyzed that RNN takes lesser time to train the model for these two datasets. RNNs with some parameter sizes show poor accuracy and more fluctuations with changing parameters compared to other models. We also analyzed that LSTM and GRU take longer time to train than RNN, though they provide superior accuracy with most parameter values. In future, we wish to do these experiments with air pollution data captured in real time with other variants of RNNs. The experiments could also be performed using convolution layers.

40 Parameterization of Sequential Neural Networks …

499

References 1. World Health Organization (2005) WHO air quality guidelines for particulate matter, ozone, nitrogen dioxide and sulphur dioxide. Global update 2. Zhang R, Wang G, Guo S, Zamora ML, Ying Q, Lin Y, Wang W, Hu M, Wang Y (2015) Formation of urban fine particulate matter. Chem Rev 115(10):3803–3855 3. Ariyo AA, Adewumi AO, Ayo CK (2014) Stock price prediction using the ARIMA. In: Proceedings of the UKSim-AMSS 16th international conference on computer modelling and simulation, pp 106–112 4. Sharma N, Taneja S, Sagar V, Bhatt A (2018) Forecasting air pollution load in Delhi using data analysis tools. Proc Comput Sci 132:1077–1085 5. Li X, Peng L, Yao X, Cui S, Yuan H, You C, Chi T (2017) Long short-term memory neural network for air pollutant concentration predictions: method development and evaluation. Environ Pollut 231:997–1004 6. Chang Y-S, Chiao H-T, Abimannan S, Huang Y-P, Tsai Y-T, Lin K-M (2020) An LSTM-based aggregated model for air pollution forecasting. Atmos Pollut Res, pp 1451–1463 7. Zhao R, Gu X, Xue B, Zhang J, Ren W (2018) Short period PM2.5 prediction based on multivariate linear regression model. PLoS ONE 13(7) 8. Farheen, Kumar R, Correlated features in air pollution prediction. In: Proceedings of the international conference on artificial intelligence: advances and applications 9. Torres JF, Hadjout D, Sebaa A, Martínez-Álvarez F, Troncoso A (2021) Deep learning for time series forecasting: a survey. Big Data 9:3–21 10. Mukhriya A, Kumar R (2021) Building outlier detection ensembles by selective parameterization of heterogeneous methods. Pattern Recogn Lett 146:126–133 11. Hewamalage H, Bergmeir C, Bandara K (2021) Recurrent neural networks for time series forecasting: current status and future directions. Int J Forecast 37:388–427 12. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–80 13. Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1724–1734 14. Tsai Y-T, Zeng Y-R, Chang Y-S (2018) Air pollution forecasting using RNN with LSTM. In: Proceedings of the IEEE 16th international conference on dependable, autonomic and secure computing, 16th international conference on pervasive intelligence and computing, 4th international conference on big data intelligence, computing, cyber science and technology congress (DASC/PiCom/DataCom/CyberSciTech), pp 1074–1079 15. Athira V, Geetha P, Vinayakumar R, Soman KP (2018) DeepAirNet: applying recurrent networks for air quality prediction. In: Proceedings of the international conference on computational intelligence and data science, pp 1394–1403 16. Sinnott RO, Guan Z (2018) Prediction of air pollution through machine learning approaches on the cloud. In: Proceedings of the IEEE/ACM 5th international conference on big data computing applications and technologies (BDCAT), pp 51–60 17. Rao K, Devi G, Ramesh N (2019) Air quality prediction in visakhapatnam with LSTM based recurrent neural networks. Int J Intell Syst Appl 11:18–24 18. Dua D, Graff C (2017) UCI machine learning repository

Chapter 41

Customer Analytics Research: Utilizing Unsupervised Machine Learning Techniques Anuj Kinge, P. B. Hrithik, Yash Oswal, and Nilima Kulkarni

1 Introduction In the 21st-century post-pandemic era, we define Retail analytics as to the study of data recorded by retail businesses with the purpose of making profitable business choices. The pandemic has made things difficult for small retail stores, many struggling to barely keep their businesses up and running. This is not just being hit by shop closures due to the pandemic but also by the ability to understand customer behavior and adjust to their ever-changing needs once the shops were allowed to open back. The way consumers buy products has changed, given so many options to choose from, not only the product but also the various channels available to them, with big retail giants coming up to capture the market with retail tactics by studying vast amounts of data for customer decision making and purchasing cycles. Certainly, this impact served as a wake-up signal for many shops, prompting them to see the value of data analytics and artificial intelligence [1]. Many studies have shown that machine learning technology is particularly effective in such situations and may be utilized by learning from past data [2]. There are majorly three customer analytics techniques that can foster small retail businesses, namely, Customer Churn, Customer Segmentation, and Market Basket Analysis. Customer Churn is the process where customers undergo attrition after having opted for services in the past, which ultimately affects the businesses. Customers do not abruptly cease buying from a shop; instead, they gradually switch to the shop’s competitors [3]. It is typically preferable for businesses to maintain their present customers rather than seek new ones. Following the identification of A. Kinge (B) · P. B. Hrithik · Y. Oswal · N. Kulkarni Department of Computer Science and Engineering, MIT School of Engineering, MIT Arts Design and Technology University, Pune, India e-mail: [email protected] N. Kulkarni e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_41

501

502

A. Kinge et al.

these customers, the businesses may provide services based on the reasons, uses, or criteria that they value the most. Segmentation is the process of breaking down markets into smaller groups (segments) of people who have similar demands for a certain product or service [4]. In simple words, Customer segmentation is the process of determining how to associate with consumers in different categories in order to maximize each customer’s value to the company. Market Basket Analysis is the process of understanding customer purchase patterns based on their history, which data mining achieves. This is often accomplished by identifying itemsets in a transaction and developing rules with a high degree of correlation [5]. Businesses must adopt a strategy that will help them grow their enterprises by developing marketing tactics to attract new customers. Figure 1 shows the ideal approach that can be used in the Customer Analytics domain. Initially, the data would be stored in the corresponding database, which is taken into consideration during the Data Collection Phase. The collected data is enriched, cleansed, and prepared for Analysis during the Preprocessing phase. Various parameters are scrutinized, and the best parameters are selected in the Analysis phase. Customer churn, Customer Segmentation, and Market Basket Analysis have been achieved through Machine learning and Data Mining techniques. Finally, the results are displayed in the form of reports and dashboards through which a vendor can take necessary decisions to grow business, improve customer relationships, and manage stocks as per needs. The following is a description of the layout of the paper: The first segment gives an analysis of customer analytics methodologies, while the second segment conducts a literature study on Customer Churn, Customer Segmentation, and Market Basket Analysis. The working procedure is discussed in the third segment. The experimental results and analysis are addressed in the fourth part. The conclusion of the research is provided in the final segment.

2 Literature Survey An in-depth study is conducted to learn about the most prevalent methods used for Clustering and Association Rule Mining which are mentioned in Table 1.

3 Proposed System Recency, Frequency, and monetary (RFM) analysis is a way of determining a company’s most important customers by measuring and analyzing spending behavior. RFM analysis takes sales data into consideration and calculates Recency, Frequency, and Monetary values for every customer. RFM then assigns data points on the hyperplane, with Recency, Frequency, and Monetary as the three respective axes, to all the RFM scores generated. In short, RFM analysis uses sales data to calculate RFM

41 Customer Analytics Research …

Fig. 1 Customer analytics approach

503

504

A. Kinge et al.

Table 1 Literature survey Author name

Method

Observation

Xia and He [6]

The recency-frequency-monetary (RFM) is used to classify the customers as per the criteria of recency, frequency, and monetary. To handle non-contract customer churn, the Schmittlien Morrison and Colombo (SMC) model is used to anticipate consumers’ future activity

When the ANN and SVM models were utilized separately, it was clear that the ANN model had a higher correction rate and a lower error rate than the latter. The empirical data show that the combined predictive model enhances accuracy, with a correction rate of 93.01% and an error rate of 6.99%

Wu et al. [7]

Recency-frequency-monetary (RFM) is utilized on sales data along with K-means clustering for segmenting customers into different groups. Four different segments are formed based on purchasing habits

Using this approach normalization of the RFM model and index weight analysis, the total purchase volume and total consumption amount increased by 279% and 101.97%, respectively

Sharma et al. [8]

K-means (Center-Based), hierarchical (Connectivity), and DBSCAN (Density-Based) clustering algorithms were used to investigate the usefulness of customer segmentation as a basic customer relationship management feature (CRM) for segmenting customers using bunching processes

This research shows three alternative methods for choosing a clustering algorithm in various contexts. Although K-means outperformed the other clustering algorithms, they all have drawbacks that make them unsuitable when used individually

Raorane et al. [9]

Association rule mining is used to achieve market basket analysis along with its metrics such as support and confidence. A support of 20% and confidence of 60% are considered, which yields eight association rules for output

The method proposed in this work is limited to support = 20% and confidence = 60%. As total transactions were only 50, these values could serve the purpose of generating association rules. Only a single ARM algorithm is used, which undermines the efficiency

Kaur and Singh [10]

The frequent pattern growth technique is used in this paper to mine the repeating patterns in huge datasets extensively. Support, confidence, and lift were considered as metrics for pruning

Minimum confidence = 100%, minimum support = 52.70% and lift = 1.897 was observed. The ARM model yielded 28 association rules (continued)

41 Customer Analytics Research …

505

Table 1 (continued) Author name

Method

Observation

Venkatachari and Chandrasekaran [11]

This work uses market basket transactions, which may be used in big datasets to uncover interesting relationships. When employing association rule mining to detect the frequency of item sets, a binary representation format is used. For association rules, frequent pattern (FP) growth and apriori algorithms are utilized on rapid miner

FP growth takes longer than apriori for a large number of transactions, according to the research. More time is necessary for apriori to create frequent itemsets using rapid miner

Krishnan et al. [12]

This work analyzes the comparative study of the apriori and ECLAT algorithm using the cross-industry standard process. ARM is used on both apriori and ECLAT, followed by comparing the number of rules formed and the time taken

It was observed that apriori took 0.4 s to generate 11 rules, whereas ECLAT took only half of what apriori took to generate nine rules. The ECLAT approach outperforms the apriori algorithm in terms of scalability

Kottursamy [13]

This work presents a comparative study of Naive Bayes, support vector machine, decision tree, random forest, and convolutional neural network in sentiment analysis. Moreover, it provides a novel approach that uses CNN with expression net to reduce generalization errors

The proposed CNN method yielded 96.12% accuracy, 98% classification success rate, and 95.19% improved efficiency rate. Having worked on a small dataset, the authors explicitly mention the quantization method is yet to be incorporated into their work

Karuppusamy [14]

In this study, an artificial recurrent neural network architecture and a long short term memory mechanism are engaged to estimate consumer consumption behavior utilizing product consumption information based on consumer age and gender

LSTM, compared to other models like KNN, ELM, and BPNN, was more efficient for all the cases when the number of epochs varied from 10 to 50. Because of the prefix scan method, it took 0.198 s for 50 epochs with 35% memory utilization, while other models doubled the memory

Scores for every customer, where Recency refers to how recently a customer has made a purchase, Frequency refers to how often the customer makes purchases, and Monetary refers to the amount of money a customer has spent to date. With clusters, we can bring customers with similar interests together in a group. Thus, understanding patterns of different customers with reference to somewhat similar customers in the same group (cluster) becomes more effortless. RFM analysis without clustering does not give us a clear idea of the number of segments in which our data is to be divided, whereas clustering algorithms take the output of RFM analysis and cluster

506

A. Kinge et al.

the data into different groups by decreasing the sum of the distance of points and/or centroids and tell us the optimal number of clusters/segments based on Silhouette, Davies-Bouldin, and Calinski-Harabasz scores. Agglomerative Hierarchical Clustering is the most popular type of clustering used to club objects together based on their similarity. It is also known as a bottom-up clustering algorithm as it starts the way from the bottom by combining similar objects together and moves upwards. DBSCAN is often used most when one has no predefined set of clusters to be allotted to the data. With only some parameters like epsilon radius and the minimum number of nearest neighbors, DBSCAN computes clusters with the nearest data points. Epsilon radius is the local radius that is used to expand the cluster to the specified distance, whereas the minimum number of nearest neighbors is the total number of data points that come under the cluster specified by the given epsilon radius. If the circle surrounding a data point contains the specified number of minimum points, then it is called a Core point. The point is categorized as a Border Point if the number of points is less than the specified number of minimum points. Those that do not satisfy the given criterion are left behind as outliers. DBSCAN obtains the optimum value of epsilon using the knee method by KNN. A general rule of thumb for choosing minimum points is a value greater than or equal to dimensionality + 1. In our case, since we have a 3D plane (hyperplane), we consider the minimum points to be at least 4. With minimum points set to 4, we obtain 3 clusters, whereas setting the value of minimum points to 5 or above leads us to only 2 clusters, which is not ideal, as customers will be classified into loyal and disloyal alone. Since DBSCAN solely depends on two parameters—epsilon radius and the minimum number of nearest neighbors—tweaking these values would vary the densities of clusters. GMM clustering algorithm is also known as the soft clustering algorithm as it decides the clusters formed based on probability densities. Suppose x is a random data point, Pax is the probability that x will be in cluster A, and Pbx is the probability that x will be in cluster B. If Pax is 0.8 and Pbx is 0.2, x will be assigned to cluster A. This is how soft-clustering considers the probability density to form clusters. It uses the Expectation-Maximization algorithm in computing and reiterates till a decent cluster shape has been formed. K-means clustering is a technique that aims to segment the dataset into a number of clusters with the help of a Within-cluster sum of squares (WCSS). Figure 2 shows the whole process that is going on behind the scenes when RFM analysis and cluster analysis are done on the dataset. Association Rule Mining is an unsupervised learning method that mainly deals with mining patterns out of data based on their associativity. It has mainly three metrics: support, confidence, and lift. A general association rule can be given by: A(antecedent) ⇒ B(conseqent) where A and B are items of an itemset. The support of A gives B is defined as:

41 Customer Analytics Research …

507

Fig. 2 RFM and cluster analysis

Support(A ⇒ B) = (A ∪ B).count/ n

(1)

where n is the total number of transactions. The confidence of A gives B is defined as:  Confidence(A ⇒ B) = (A ∪ B).count A.count

(2)

The lift of A gives B is defined as:  Lift (A ⇒ B) = Support(A ∪ B) Support(A) ∗ Support(B)

(3)

Apriori algorithm is the very first Association Rule Mining algorithm. Confidence and Support are two of the most important metrics used by Apriori for filtering rules. It is based on the Breadth-first search (BFS) approach, which is the reason why it is computationally inefficient. Eclat is a vertical layout algorithm that finds the elements through Depth-first search (DFS). It is similar to the Apriori algorithm except that it converts horizontal data into vertical data. Unlike Apriori, Eclat does not make use of confidence and lift metrics; the only metric on which it is contingent is support. Eclat is more computationally efficient than Apriori as it scans the dataset only once. Frequent Pattern Growth is an ARM algorithm that finds frequent patterns without generating candidates, unlike Apriori. It makes use of Tree data structure to mine frequent patterns. The FP growth algorithm depicts the database as an FP tree that not only stores the itemset in the database but also maintains track of the relationships between itemsets. It is built by mapping each item set to a different path in the tree one at a time; that is, the database is fragmented by one item. The itemsets of these fragments are then examined, and the search for common itemsets is significantly reduced in the case of huge datasets. FP Growth is more efficient than Apriori as the

508

A. Kinge et al.

Fig. 3 Optimal ARM algorithm selection

Fig. 4 Comparison of clustered and non-clustered approach

dataset is scanned only twice. Figure 3 shows the process of selecting the optimal ARM algorithm based on the performance. Apriori, Eclat, and FP Growth can be applied separately on individual clusters formed by K-means, Agglomerative Hierarchical, DBSCAN, and Gaussian Mixture Models. There are chances that the ARM models could be more efficient by doing so. This approach has mainly two steps: Using a Clustering algorithm to cluster data, considering the best clustering algorithm, and applying Apriori, Eclat, and FP Growth. Figure 4 shows the process of this approach.

4 Experimental Results 4.1 Dataset For this research work, the dataset used is obtained from a local retail shop named ‘GURU BASVA ENTERPRISES’ based in Pune, India. The original data set has 18,414 records and eight features, namely, Date, Invoice number, Customer ID, Product ID, Product Name, Quantity, Maximum Retail Price, and Total Amount

41 Customer Analytics Research …

509

recorded for a period of 3 months ranging from November 2021 to January 2022. Upon basic analysis, we got to know that there were a total of 5983 transactions and 974 unique customers.

4.2 Evaluation Metrics The Silhouette score, Davies-Bouldin score, and Calinski-Harabasz score are used to determine the best possible number of clusters. By doing so, we can choose the most efficient clustering algorithm for our dataset. Silhouette score is computed for each sample using the mean intra-cluster distance and the mean nearest-cluster distance. Let ‘a(i)’ be the mean distance between i and all other points in the data cluster. ‘C’ represents the number of clusters belonging to the ‘i’ cluster. Similarly, let ‘b(i)’ be the mean smallest distance of point ‘i’ to all other data points of which this particular data point is not apart [15]. Finally, s(i) is the silhouette value if the ‘C’ value is greater than one or else the value of ‘s(i)’ is equal to zero. s(i) =

b(i) − a(i) max{a(i), b(i)}

(4)

Davies-Bouldin score is the average similarity measure of each cluster to its most comparable cluster, where similarity is defined as the ratio of within-cluster to between-cluster distances. If ‘N’ is the size of the cluster and ‘D’ is the separation of the ith cluster from the jth cluster [16]. The DB score is given by: DB =

N 1  Di N i=1

(5)

Calinski-Harabasz score is defined as the ratio of within-cluster dispersion to between-cluster dispersion. If ‘N’ is the size or the number of points in the sample, ‘K’ is the number of groups formed, b is the between-group variance, and ‘W k ’ is the intra-group variance [17]. The CH score is given by: SCH =

(N − K )B K Wk (K − 1) k=1

(6)

510

A. Kinge et al.

4.3 Performance Analysis Table 2 shows all the Silhouette scores, Table 3 shows all the Davies-Bouldin scores, and Table 4 shows all Calinski-Harabasz scores for different numbers of clusters on four distinct clustering algorithms. Silhouette score ranges from −1 to 1, where 1 is the best or the ideal case, indicating that the clusters configuration is good and the object can be clearly distinguished in the cluster and lower the score means that the clusters are poorly organized, which may have too many or too few clusters and making it much more difficult to identify the object in a cluster. If the value of silhouette score is near one, in general, Table 2 Silhoutte scores No. of clusters

Agglomerative hierarchical clustering

DBSCAN clustering

GMM clustering

K-Means clustering

3

0.31932

0.29679

0.24135

0.39527

4

0.31508

0.29679

0.22744

0.35580

5

0.30762

0.29679

0.22698

0.33240

6

0.29222

0.29679

0.21635

0.33704

Bold indicates the optimal number of clusters and clustering algorithm

Table 3 Davies-Bouldin scores No. of clusters

Agglomerative hierarchical clustering

DBSCAN clustering

GMM clustering

K-Means clustering

3

0.87481

0.99586

1.22654

0.89880

4

0.94345

0.99586

1.10548

0.93743

5

0.90840

0.99586

1.28772

0.98665

6

1.03874

0.99586

1.07857

0.96012

Bold indicates the optimal number of clusters and clustering algorithm

Table 4 Calinski-Harabasz scores No. of clusters

Agglomerative hierarchical clustering

DBSCAN clustering

GMM clustering

K-Means clustering

3

581.29561

173.95721

391.51492

754.46723

4

612.50462

173.95721

421.87422

734.62554

5

575.55638

173.95721

412.02633

675.82484

6

558.15064

173.95721

458.94193

668.77665

Bold indicates the optimal number of clusters and clustering algorithm

41 Customer Analytics Research …

511

it implies that this is the most optimum selection. The general trend observed in Agglomerative Hierarchical, GMM, and K-means is a gradual decrease in silhouette value as the number of clusters keeps increasing. Whereas in DBSCAN, the silhouette score is constant as the clusters cannot be manually changed and remain the same since there is a predetermined cluster formation. In the Davies-Bouldin table, the score nearer to 0 is often chosen as it represents the clusters formed are distinct and unique. There is no specific pattern observed here as the number of clusters increases, but Agglomerative Hierarchical clustering and K-means clustering seem to be good choices to choose out of all for either three or five clusters, considering DBSCAN gives us the score on predetermined clusters. The Calinski-Harabasz score ranges from 0 to infinity. So, the higher the value of Calinski-Harabasz, the better is the distribution of clusters. From the results, it can be observed that DBSCAN proves to be the worst out of all. On the other hand, K-means clustering performs the best when compared to others for the given dataset. For a comparative analysis of clustering algorithms, we have considered the values of k (number of clusters) ranging from 3 to 6. The idea behind not selecting k = 2 is that it becomes quite absurd to segment data into two groups that is, loyal customers and disloyal customers, whereas having more than two clusters would aid us in understanding the multiple loyalty levels of customers, not constraining to only two groups. As observed in Table 2, K-means Algorithm with 3 clusters has the highest Silhouette score. As observed in Table 3, Agglomerative Hierarchical Algorithm with 3 clusters has the lowest Davies-Bouldin score. As observed in Table 4, K-means Algorithm with 3 clusters has the highest Calinski-Harabasz score. Figure 5 shows RFM analysis with Agglomerative Hierarchical and K-means clustering. According to Tables 2, 3, and 4, it is quite evident that when k = 3, the results obtained are optimum. Between Agglomerative Hierarchical and K-means clustering algorithms, the former has time complexity equal to O(n3 ) while the latter has time complexity equal to O(n2 ) [18, 19]. Hence, we come to the conclusion of selecting the K-means algorithm for RFM Analysis. Further, we have used ARM algorithms such as Apriori, Eclat, and FP Growth. Before actually implementing these algorithms, it is essential to draw the values of Support and Confidence. The first graph of Fig. 6 shows the number of association rules produced for different Support and Confidence values. The green curve performs well compared to the rest curves, and hence, we can have Minimum Confidence = 35% and Minimum Support = 0.4%. The second graph of Fig. 6 shows that a triangle-like structure is formed for the selected Minimum Confidence and Minimum Support. This structure construes that our values are correct, and we are good to go. Figure 7 shows the nature of Apriori, Eclat, and FP Growth algorithms based on their performance. From the above analysis, we can extrapolate that FP Growth outperforms Apriori and Eclat when it comes to time complexity for different Minimum Support values. Apriori, ECLAT, and FP Growth took 1046.20194 ms,

512

A. Kinge et al.

Fig. 5 RFM analysis with clustering

Fig. 6 Deciding the values of support and confidence

275.23636 ms, and 78.78923 ms, respectively. Hence, we can conclude that FP Growth is independent of Minimum Support Values, unlike Apriori and Eclat. After successfully clustering data through RFM Analysis and above used clustering algorithms, the K-means algorithm outperformed. Hence, we applied K-means clustering on all of the three ARM algorithms to examine if efficiency increases. Figure 8 shows the time taken by ARM algorithms to generate rules with a segmentation approach. From the above analysis, it is quite conspicuous that the efficiency of Eclat and FP growth decreased upon clustering with K-means. On the other hand, the Apriori algorithm took less time upon clustering with K-means, thereby increasing its efficiency.

41 Customer Analytics Research …

Fig. 7 Apriori, ECLAT, and FP growth performance

Fig. 8 Performance of ARM algorithms with segmentation approach

513

514

A. Kinge et al.

5 Conclusion Many small-scale retail businesses are currently not able to utilize advanced technologies used by large-scale companies. Hence, the small stores are limited to using conventional methods that do not actually help for improving business. This research paper has utilized K-means, DBSCAN, GMM, and Agglomerative Clustering to perform RFM analysis, followed by ARM algorithms. According to the analysis, K-means outperformed the other algorithms and had an optimal number of clusters equal to five with 0.33240, 0.98665, and 675.82484 as Silhouette, Davies-Bouldin, and Calinski-Harabasz scores, respectively. When it came to time complexity, FP growth completely outperformed Apriori and ECLAT. Upon using the segmentation approach along with association rule mining, Apriori’s performance improved, whereas ECLAT and FP Growth performance worsened. Using the proposed approach, a significant impact can be made in small-scale retail businesses to accelerate their businesses at a much faster rate.

References 1. Adulyasak Y, Cohen MC, Khern-am-nuai W, Krause M (2022) Retail analytics in the new normal. Available at SSRN: https://ssrn.com/abstract=4007401 or https://doi.org/10.2139/ssrn. 4007401 2. Umayaparvathi V, Iyakutti K (2016) A survey on customer churn prediction in telecom industry: datasets, methods and metric. Int Res J Eng Technol 3(4):1065–1070; Foster I, Kesselman C (1999) The grid: blueprint for a new computing infrastructure. Morgan Kaufmann, San Francisco 3. Buckinx W, den Poel DV (2005) Customer base analysis: partial defection of behaviourally loyal clients in a non-contractual FMCG retail setting. Eur J Oper Res 252–268; Foster I, Kesselman C, Nick J, Tuecke S (2002) The physiology of the grid: an open grid services architecture for distributed systems integration. Technical report, Global Grid Forum 4. Anifa M, Jeyanthi M, Hack-Polay D, Mahmoud AB, Grigoriou N Segmenting the retail customers 5. Wu X, Kumar V, Quilan JR, Ghosh J, Yang Q, Motoda H (2007) Top 10 algorithms in data mining. 14:1–37. Springer-Verlay London Limited 6. Xia G, He Q (2018) The research of online shopping customer churn prediction based on integrated learning. In: Proceedings of the 2018 international conference on mechanical, electronic, control and automation engineering 7. Wu J, Shi L, Lin WP, Tsai SB, Li Y, Yang L, Xu G (2020) An empirical study on customer segmentation by purchase behaviours using an RFM model and K-means algorithm. Math Probl Eng 2020 8. Sharma BTSS, Ahmad BTSK, Singh BTSV (2021) Clustering approaches to offer business insights 9. Raorane AA, Kulkarni RV, Jitkar BD (2012) Association rule–extracting knowledge using market basket analysis. Res J Recent Sci 2502. ISSN: 2277 10. Kaur H, Singh K (2013) Market basket analysis of sports store using association rules. Int J Recent Trends Electr Electr Eng 3(1):81–85 11. Venkatachari K, Chandrasekaran ID (2016) Market basket analysis using fp growth and apriori algorithm: a case study of mumbai retail store. BVIMSR’s J Manag Res 8(1):56

41 Customer Analytics Research …

515

12. Krishnan MS, Nair AS, Sebastian J (2022) Comparative analysis of apriori and ECLAT algorithm for frequent itemset data mining. In: Ubiquitous intelligent systems. Springer, Singapore, pp 489–497 13. Kottursamy K (2021) A review on finding efficient approach to detect customer emotion analysis using deep learning analysis. J Trends Comput Sci Smart Technol 3(2):95–113 14. Karuppusamy DP (2020) Artificial recurrent neural network architecture in customer consumption prediction for business development. J Artif Intell Capsule Netw 2(2):111–120 15. Shahapure KR, Nicholas C (2020) Cluster quality analysis using silhouette score. In: 2020 IEEE 7th international conference on data science and advanced analytics (DSAA), pp 747–748. https://doi.org/10.1109/DSAA49011.2020.00096 16. Sitompul B, Sitompul O, Sihombing P (2019) Enhancement clustering evaluation result of Davies-Bouldin index with determining initial centroid of k-means algorithm. J Phys Conf Ser 1235:012015. https://doi.org/10.1088/1742-6596/1235/1/012015 17. Cengizler C, Ün M (2017) Evaluation of Calinski-Harabasz criterion as fitness measure for genetic algorithm based segmentation of cervical cell nuclei. Brit J Math Comput Sci 22:1–13. https://doi.org/10.9734/BJMCS/2017/33729 18. Whittingham H, Ashenden SK (2021) Chapter 5—Hit discovery, the era of artificial intelligence, machine learning, and data science in the pharmaceutical industry. Academic Press, pp 81–102. https://doi.org/10.1016/B978-0-12-820045-2.00006-4. ISBN: 9780128200452 19. Pakhira MK (2014) A linear time-complexity k-means algorithm using cluster shifting. In: 2014 International conference on computational intelligence and communication networks, pp 1047–1051. https://doi.org/10.1109/CICN.2014.220

Chapter 42

Multi-class IoT Botnet Attack Classification and Evaluation Using Various Classifiers and Validation Techniques S. Chinchu Krishna and Varghese Paul

1 Introduction IoT is an integrated computing environment that is capable of transferring data over a network [1]. It makes the global system connected [2]. IoT environments [3] usually consist of diverse devices with limited security mechanisms, which is sure to increase the vulnerabilities of such systems. An IoT botnet is a cluster of hacked, interconnected devices that collaborate for security breaching purposes. The proliferation of IoT devices and the pitfalls of security features has sparked malicious users’ attention to DDoS attacks through many IBA [3–5]. BASHLITE and MIRAI are the two variants of IoT botnet [6]. The reason for the prevalence of IoT botnets is IoT devices are usually a simple version of Linux with minimum security features, malware can be compiled easily. In this paper, the proposed work is evaluating machine learning classifiers for multi-class classification of attacks using cross-validation techniques. When IoT devices are compromised, we can observe deviation from expected behavior, it is a botnet. So supervised machine learning formulated is to classify abnormal behavior into multiple classes. This helps in the identification and classification of attacks. The training dataset is used to train the model. But, if the number of training cases is less, the model prediction runs into problems. The lesser number of training datasets will affect the performance of model prediction. A simple but popular solution is to use cross-validation (CV). Here we propose the comparison of CV approaches in the training phase of the classifier. S. Chinchu Krishna (B) · V. Paul Department of Computer Science and Engineering, Rajagiri School of Engineering and Technology, Kochi, Kerala, India e-mail: [email protected] V. Paul e-mail: [email protected] A P J Abdul Kalam Technological University, Thiruvananthapuram, Kerala, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_42

517

518

S. Chinchu Krishna and V. Paul

The contributions of this research work are summarized as: 1. Proposes an exhaustive identification and classification of IBA. The IBA is classified into multiple classes of MIRAI and BASHLITE attacks. In the related work, several experimental studies have been summarized and most of the works concentrate only on the detection of the occurrence of IBA, i.e., binary classification. There are several works published focused on multi-class classification, but the works on the multi-class classification of IBA are very limited. 2. There are different experimental studies linked with the identification of IoT attacks on simulated or emulated data [7–10]. We propose an experimental evaluation of real-time traffic data. 3. Training phase of machine learning classifiers is associated with adjusting the training parameters. Being a multi-class classification, the training phase should be done with proper proportions of training data set from all classes [11–13] so that the adaptation of training parameters is done properly for all the classes. For this purpose, the CV techniques applied are KCV and stratified SKCV. The results of these CV techniques are compared with various evaluation metrics with the datasets without cross-validation.

2 Related Work IoT botnet problem is solved by [4, 6] the combination of a grey wolf optimization algorithm (GWO) and one class support vector machine (OCSVM) are used [6]. The hyper-parameters have been optimized by GWO while OCSVM is deployed as a learning algorithm. The classification involves multiple phases. The focus goes on IBA detection and the work is not extended to deal with multiple classes of attacks. In [14], the methodology for attack detection is a binary classification. The IoT devices are very limited in computational power, being taken this limitation, this work focused to provide an algorithmic solution that uses a sparse representation framework. The research work is trying to minimize the impact of an attack by isolating attacked IoT devices. Here also the focus primarily goes into detecting the attack as with the previous work. So, a multi-class classification approach can be introduced such that further algorithmic solutions can be fine turned with multiple attacks. The paper [15] is another framework for online network intrusion detection, introduced a plug-and-play mechanism. The proposed system detects attacks in video surveillance network with a real IP camera. Out of many attacks, the prominent attacks are OS Scan, Fuzzing, Video Injection, ARP MitM, Active Wiretap, SSDP Flood, SYN DoS, SSL, Renegotiation, and Mirai. The developed system uses both online and online algorithms. Here IBA is not explored further. In [16], in the learning phase, the training is done by features from benign traffic data. The anomalies detected, points to the compromised device. The training is done by extracting features by a neural network called autoencoder and the training is carried out on benign traffic only. The classification adopted in this work is binary.

42 Multi-class IoT Botnet Attack Classification and Evaluation …

519

The related works explored different works that focused on IoT attack detection and classification. From all the papers analyzed, the works are focused on attack detection and classification. Attack detection is binary classification. Multi-class classification of IoT attacks is there in [15] but research works concentrated on the multi-class classification of IBA are not common.

3 Data Set Description Multi-class IBA classification performs an experimental evaluation with a real-time IoT traffic dataset rather than using simulated or emulated data. The dataset is NBaIoT [16, 17] dataset, is extracted from 9 IoT network traffic data. Port mirroring is used to collect data. This is deployed in switches and the format of collected data is Packet Capture (pcap). The benign dataset and malicious dataset are collected separately. The benign dataset is collected immediately after the installation of the network since it is the basis for identifying other types of attacks. The packet’s contextual information regarding protocols and hosts is captured. The number of features in the dataset is 115. The statistical features are identified from five temporal windows, 100 ms, 500 ms, 1.5 s, 10 s, and 1 min. The packet’s contextual information regarding protocols and hosts is captured. The attacks [16] executed Bashlite attacks and Mirai attacks. The IoT devices are two doorbells, one thermostat, four securitycameras, one baby monitor and a webcam.

4 Experimental Evaluation 4.1 Preprocessing: Min–Max Normalization Feature-wise normalization can be applied to the dataset before model fitting. The normalization applied in this paper is Min–Max normalization [18]. Min–Max normalization calculates a value d of p to d  in the range [0, 1]. It is shown in Eq. (1). d =

d − min( p) max( p) − min( p)

(1)

where d  = Normalized value of d; d = original value; max(p) = maximum value of d; min(p) = minimum value of d.

520

S. Chinchu Krishna and V. Paul

Fig. 1 10-fold stratified CV

4.2 KCV and SKCV The dataset train-test ratio is 67–33. CV techniques are applied to the training dataset, and then the model is evaluated with test data set. There are different categories of CV based on how the samples have been selected from the dataset. The first CV is leave-one-out [12] which is having the issues like disbursement of computational power and low bias. The next category is KCV [11], which provides an estimate that is unbiased, an estimate of expected value of prediction error (EPE). In Fig. 1 the training dataset is applied to 10-fold CV. For multi-class classification, this may generate some imbalanced data set, each data fold may not contain proper proportions of all the classes. KSCV [13] eliminates this issue in the dataset. Figure 1 shows the configuration of a 10-fold stratified CV with N BaIoT data set. Out of 10 folds, nine folds are deployed for testing and one for training. All the experiments are done in scikit-learn library in Python.

4.3 Machine Learning Classifiers The classifiers [19] used to create the model are Linear Discriminant Analysis [20] (LDA), K-Nearest Neighbor classifier [21] (KNN), Decision Tree Classifier [22] (DTC), Random Forest Classifier [23] (RFC) and Extra tree Classifier [24] (EXT). All the classifiers use the dataset train-test ratio of 67–33. In LDA [20], 10 non-zero eigenvalues are formed for 11 class classification. Euclidean distance is used to classify data points. KNN [21] uses the classified points to assign the new sple point to a particular class. The given n pairs of data points are {(ӿ1 , 81 ), (ӿ2 , 82 ), (ӿn , 8n )}, ӿi takes the points in a metric space Ҳ. The metric d and 8i are defined in X, is the index of the category to which ӿi belongs. A new point (ӿ, 8) has to be classified as the nearest neighbor ӿn where ӿn p {ӿ1 , ӿ2 , ӿ3 ,…., ӿn } if ɱɨ ï d(ӿi, ӿ) = d(ӿ n , ӿ), i = 1, 2, …., ï.

42 Multi-class IoT Botnet Attack Classification and Evaluation …

521

In DTC [22], the first phase is to perform the selection of splits, the second phase is to identify the terminal nodes. The last phase maps this terminal node to a labeled class. To minimize the problem of misclassification rate in-class assignment problem terminal nodes are mapped to highest probability class. RFC [23] incorporates a sequence of tree classifiers. All the classifiers are formed by a random vector from input. The total count of trees in the forest is formulated as 100. The criterion to measure the quality of the test is the Gini index. The nodes are explored until all leaves are pure or until all leaf nodes contain less than two. EXT [24] uses several randomized decision trees for classification and is implemented with a similar configuration of RFC.

4.4 Algorithm # Generation of the imbalanced dataset. 1. For AT1 to AT11 in IoT-D1 to IoT-D9 2. DS = imbalanced (IoT-D[i],AT[j]) # Preprocessing 3. ND = Min max Normalize(DS) # Train and test data splitting 4. TRD = 67% of ND 5. TSTD = 33% of ND 6. 7. 8. 9.

# Model training with KCV Generate 10-folds from TRD Use TRD in LDA, KNN, DTC, RFC, EXT ACC_TR = mean accuracy (10 fold) Test the model TSTD # Model training with and testing SKCV # Model training and testing without CV The entire process is pictorially represented in Fig. 2.

5 Results and Discussion The multi-class classification assessment methods are formulated by a confusion matrix of 11class classifications. For empirical evaluation metrics used for evaluation are accuracy of the model, F1 score, and Cohen’s kappa coefficient (ҟ).

522

Fig. 2 System architecture

S. Chinchu Krishna and V. Paul

42 Multi-class IoT Botnet Attack Classification and Evaluation …

523

5.1 Accuracy Accuracy [25] is a measure of classification performance. It is the ratio of correctly classified to the total samples. It is shown in Eq. (2). Accuracy =

TP + TN TP + TN + FP + FN

(2)

where TP = true positive; TN = true negative; FP = false positive; FN = false negative. The number of TP of class i = Ci, i . ⎛ ⎛ ⎞ ⎞ 11 11   FN of class i = ⎝ Ci, j ⎠ − Ci, i and FP of class i = ⎝ C j, i ⎠ − Ci,i j=1

j=1

The accuracy of all the cases is plotted against the number of samples. The empirical evaluation of LDA, KNN, DTC, RFC, and ETC is shown in Fig. 3. All the results point to the fact that always the stratified 10-fold CV ensures the participation of at least one instance of all the classes and the 10-fold CV does not ensure the same. LDA and KNN give a significant improvement in a 10-fold CV than a 10-fold stratified CV in terms of training accuracy. Also, LDA and KNN have the same test accuracy in both types of CV. The best performing classifiers among these five classifiers are DTC, RFC, and ETC.

5.2 Execution Time The two types of CV do not have any impact in DTC, RFC, and ETC and which gives the same results with datasets without a CV. There is some dip in the graph because of the random imbalance in the dataset. To analyze the difference in performance, the time of execution with the number of samples is plotted and shown in Fig. 4. The time of execution of stratified K-fold CV is higher than K-fold CV in LDA and CNN and DTC. These types of CV have the almost same time of execution in RFC and ETC. In LDA, KNN, and DTC, the two types of CV give the same accuracy, but in terms of execution time, K-fold CV is better than stratified K-fold CV. In RFC and EXT, the two types of CV give the same accuracy, and the time of execution is almost the same. The noticeable observation at the time of execution is that all the classifiers perform best with a dataset with no CV.

524

S. Chinchu Krishna and V. Paul

Fig. 3 Accuracy evaluation of classifiers

5.3 F1 Score and Cohen’s Kappa Coefficient (Ҟ) F1 score [25, 26] is an estimate of an accuracy in testing phase. This involves the precision (PR) and the recall (RC) of the test phase. Precision means the percentage of your results that are relevant. This is shown in Eq. (3). F1 score = 2.

PR . RC PR + RC

(3)

TP TP where PR = TP+FP and RC = TP+FN . ‘ҟ’ is a quantifier that is to quantify inter-rater reliability for qualitative items [27, 28], shown in Eq. (4).

42 Multi-class IoT Botnet Attack Classification and Evaluation …

525

Fig. 4 Time of execution of classifiers

k=

po − pe 1 − pe 11

(4) C ∗C

i: i=1 :i where po is the accuracy calculated and pe = Total . samples Where C:i and Ci: are the sums of elements in the ith column and ith row of the confusion matrix, respectively. The F1 score and ҟ for the machine learning classifiers with datasets without CV are shown in Table 1 (F1 Score and ҟ of classifiers). The largest sample is selected for these two evaluation matrices.

526

S. Chinchu Krishna and V. Paul

Table 1 F1 score and ҟ of ML classifiers ҟ

F1 score

LDA KNN DTC RFC EXT LDA KNN DTC RFC EXT 10-fold CV

0.75

0.93

0.96

0.96

0.96

0.72

0.93

0.95

0.95

0.96

10-fold Stratified CV 0.75

0.94

0.96

0.96

0.97

0.72

0.93

0.95

0.95

0.96

Without CV

0.94

0.96

0.96

0.97

0.72

0.93

0.96

0.96

0.96

0.75

6 Conclusion IoT is more vulnerable to botnets attacks and other malicious programs because of insufficient computational resources. The two broad categories of botnet classes are Bashlite and Mirai attacks. The methods for detecting and classifying attacks are crucial because of the proliferation of the attacks. In this paper, a multi-class classification approach, which classifies the IBA of 9 IoT devices into the 11 classes (10 IBA types and one benign type). It is empirically evaluated using the machine learning classifiers. Every classifier is highly dependent on the training and testing phase since the model parameters are turned in this phase. To ensure the proper proportions of every class in the training phase, the CV techniques applied are 10Fold CV and stratified 10-Fold CV. Also, another set of models is trained without CV for comparative purposes. All the classifiers produce the same accuracy values with datasets with and without a CV. The best performing classifiers in terms of accuracy are DTC, RFC, and EXT with 96–99% accuracy. There are some falls in the graphs which are because of imbalances in the dataset. To analyze the performance another parameter adopted is execution time. All the algorithms perform best in a dataset without CV in terms of execution time. LDA, KNN, and DTC have high execution time in SKCV than KCV. RFC and EXT have almost the same execution time with SKCV as KCV. DTC performs best while considering execution time and accuracy. Cohen’s kappa coefficients and F1 score are also evaluated.

References 1. Smys S (2020) A survey on internet of things (IoT) based smart systems. J ISMAC 2(04):181– 189 2. The Statistics Portal (2017) Internet of things (IoT) connected devices installed base worldwide from 2015 to 2025 (in Billions). [Online]. Available: https://www.statista.com/statistics/471 264/iotnumber-of-connected-devices-worldwide/ 3. Angrishi K (2017) Turning internet of things (iot) into internet of vulnerabilities (iov): Iot botnets. arXiv Preprint. arXiv:1702.03681 4. Kamel DK (2021) Wireless IoT with blockchain-enabled technology amidst attacks. IRO J Sustain Wireless Syst 2(3):133–137 5. Sivaganesan D (2021) A data driven trust mechanism based on blockchain in IoT sensor networks for detection and mitigation of attacks. J Trends Comput Sci Smart Technol (TCSST) 3(01):59–69

42 Multi-class IoT Botnet Attack Classification and Evaluation …

527

6. Al Shorman A, Faris H, Aljarah I (2020) Unsupervised intelligent system based on one class support vector machine and grey wolf optimization for IoT botnet detection. J Ambient Intell Humanized Comput 11(7):2809–2825 7. Bostani H, Sheikhan M (2017) Hybrid of anomaly-based and specification-based IDS for internet of things using unsupervised OPF based on mapreduce approach. Comput Commun 98:52–71 8. Smys S, Basar A, Wang H (2020) Hybrid intrusion detection system for internet of things (IoT). J ISMAC 2(04):190–199 9. Snehi M, Bhandari A (2021) Vulnerability retrospection of security solutions for softwaredefined cyber-physical system against DDoS and IoT-DDoS attacks. Comput Sci Rev 40:100371 10. Sedjelmaci H, Senouci SM, Al-Bahri M (2016) A lightweight anomaly detection technique for low-resource IoT devices: a game-theoretic methodology. In: 2016 IEEE international conference on communications (ICC). IEEE, pp 1–6 11. Al-Abdaly NM, Al-Taai SR, Imran H, Ibrahim M (2021) Development of prediction model of steel fiber-reinforced concrete compressive strength using random forest algorithm combined with hyperparameter tuning and k-fold cross-validation. Eastern-Eur J Enterp Technol 5(7):113 12. Kelter R (2021) Bayesian model selection in the M-open setting—approximate posterior inference and subsampling for efficient large-scale leave-one-out cross-validation via the difference estimator. J Math Psychol 100:102474 13. Dei-Cas I, Giliberto F, Luce L, Dopazo H, Penas-Steinhardt A (2020) Metagenomic analysis of gut microbiota in non-treated plaque psoriasis patients stratified by disease severity: development of a new psoriasis-microbiome index. Sci Rep 10(1):1–11 14. Tzagkarakis C, Petroulakis N, Ioannidis S (2019) Botnet attack detection at the IoT edge based on sparse representation. In: 2019 Global IoT summit (GIoTS). IEEE, pp 1–6 15. Mirsky Y, Doitshman T, Elovici Y, Shabtai A (2018) Kitsune: an ensemble of autoencoders for online network intrusion detection. arXiv Preprint. arXiv:1802.09089 16. Meidan Y, Bohadana M, Mathov Y, Mirsky Y, Shabtai A, Breitenbacher D, Elovici Y (2018) Nbaiot—network-based detection of iot botnet attacks using deep autoencoders. IEEE Pervasive Comput 17(3):12–22 17. Meidan Y, Bohadana M, Mathov Y, Mirsky Y, Breitenbacher D, Shabtai A, Elovici Y (2018) detection_of_IoT_botnet_attacks_N_BaIoT Data Set. Mar 19. https://archive.ics.uci.edu/ml/ datasets/detection_of_IoT_botnet_attacks_N_BaIoT. Accessed 26 Oct 2021 18. Panda SK, Bhoi SK, Singh M (2020) A collaborative filtering recommendation algorithm based on normalization approach. J Ambient Intell Humanized Comput 11(11):4643–4665 19. Osarogiagbon AU, Khan F, Venkatesan R, Gillard P (2021) Review and analysis of supervised machine learning algorithms for hazardous events in drilling operations. Process Saf Environ Prot 147:367–384 20. Li Y, Liu B, Yu Y, Li H, Sun J, Cui J (2021) 3E-LDA: three enhancements to linear discriminant analysis. ACM Trans Knowl Discov Data (TKDD) 15(4):1–20 21. Qiu L, Qu Y, Shang C, Yang L, Chao F, Shen Q (2021) Exclusive lasso-based k-nearest-neighbor classification. Neural Comput Appl 33(21):14247–14261 22. Yoo SH, Geng H, Chiu TL, Yu SK, Cho DC, Heo J, Choi MS et al (2020) Deep learning-based decision-tree classifier for COVID-19 diagnosis from chest X-ray imaging. Front Med 7:427 23. Herce-Zelaya J, Porcel C, Bernabé-Moreno J, Tejeda-Lorente A, Herrera-Viedma E (2020) New technique to alleviate the cold start problem in recommender systems using information from social media and random decision forests. Inf Sci 536:156–170 24. Saeed U, Jan SU, Lee Y-D, Koo I (2021) Fault diagnosis based on extremely randomized trees in wireless sensor networks. Reliab Eng Syst Saf 205:107284 25. Tharwat A (2020) Classification assessment methods. Appl Comput Inform 26. Miao J, Zhu W (2021) Precision–recall curve (PRC) classification trees. Evol ˙Intell 1–25

528

S. Chinchu Krishna and V. Paul

27. Roldán-Nofuentes JA, Regad SB (2021) Estimation of the average Kappa coefficient of a binary diagnostic test in the presence of partial verification. Mathematics 9(14):1694 28. Chicco D, Warrens MJ, Jurman G (2021) The Matthews correlation coefficient (MCC) is more informative than Cohen’s Kappa and Brier score in binary classification assessment. IEEE Access 9:78368–78381

Chapter 43

IoT-Based Dashboards for Monitoring Connected Farms Using Free Software and Open Protocols K. Deepika

and B. Renuka Prasad

1 Introduction Internet of things (IoT) is an intelligent network of devices, people, process, data, and things to an interconnected network. In addition to the intelligent network connection, IoT connects sensors and actuators to extract and analyze data with the use of automated and people-based processes across verticals. Connected farms enable the end-users to use predictive analysis, remote monitoring, sensor-based mapping to monitor the field condition data remotely and in a sustainable way. The architecture of IoT-based dashboards for monitoring connected farms using free software and open protocols is represented in Fig. 1. Connected farms yield benefits like • Reliable fault-tolerant mechanisms for data collection from sensors to monitor climatic conditions, plant growth statistics and humidity levels • Real-time data visualizations with location awareness and historical monitoring of connected farms • Customizable dashboards for the end-user to view the results from connected farms • Integration of free software solutions and deployment of open protocols to achieve smart farming • Configuring devices remotely based on input optimization and performing result analysis. Connected farms employ a multitude of sensors using development boards for gathering data in real-time and rely on sensor technology with the integration of low power, location-aware, and in a single view. Sensors are hardware devices that measure physical data and are installed at a certain distance to record, indicate, or K. Deepika (B) · B. Renuka Prasad RV College of Engineering, Bengaluru, India e-mail: [email protected] B. Renuka Prasad e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_43

529

530

K. Deepika and B. Renuka Prasad

Fig. 1 Architecture of IoT-based dashboards for monitoring connected farms using free software and open protocols

respond to the same. The development boards act as networking nodes that enable the flow of data from sensors to Web browser-based flow-editor over the same network and provide dependable access to sensors, actuators, cameras to connected farms using the open protocols. The data is communicated from Web browser editors to free software-based analytics and monitoring time series database to handle massive volumes and countless time-stamped sensor data. The aggregations are further analyzed using interactive visualization tools and analytics by experts generating queries, visualizations, alerts, and understanding metrics of the massive amount of data. The architecture of IoTbased dashboards for monitoring connected farms using free software and open protocols is represented in Fig. 1. Over the last few decades, numerous research works have executed in the IoTbased connected farms in agriculture. Therefore, it is necessary to overview, outline, analyze, and classify the state-of-the-art research in this domain. The objective of this study is to comprehensive background work in the field on IoT-based connected farms in agriculture. The contributions in IoT-based dashboards for monitoring connected farms using free software and open protocols are structured as follows. In Sect. 2, a systematic literature study in the field of IoT-based connected farms is presented. In Sect. 3, the design of IoT-based connected farms involving sensors with development boards and construction of experimental setup incorporating free software and open protocol is elaborated. Section 4 projects the results synthesizing the publisher-subscriber connectivity, communication protocols, and observability dashboards. Section 5, the summary of findings, open issues, and challenges of implementations is listed. Research validity threats, conclusions, and future works are concluded in Sect. 6.

43 IoT-Based Dashboards for Monitoring Connected Farms …

531

2 Literature Study The study of literature is organized into five divisions, namely—use of IoT in agricultural sectors, managing and controlling the use of technology, real-time monitoring of connected farms, challenges of IoT in agriculture, opportunities and applications of IoT in connected farm agricultural sectors. Numerous research practices have been constituted over years in the field of agriculture to increase production and reduce the workforce. Several researchers have implemented various techniques to increase the quality and agricultural productivity. The usage of IoT in agriculture has been identified from the literature and summarized in the following section.

2.1 Use of IoT in Agricultural Sectors The idea of neural networks for validating sensor failures and plant monitoring is implemented in [1]. Plant health management system is deployed in the paper [2] which estimates plant growth based on various parameters like temperature, humidity, and light intensity. A game-based learning system is studied by Tangworakitthaworn et al. [3] to understand the importance of plant growth using IoT technology to create harmony in plant care. The concept of smart farming is studied in [4] with the usage of modern sensors like moisture, temperature, and humidity to determine plant growth. Automated irrigation system using IoT predicts temperature sensors sense the variations of heat and coldness to communicate interrupt signals to microcontroller-based devices [5]. Environmental parameters are investigated by Dursun and Ozden [6] along with topography, soil types, and vegetation yields. The results of investigations prove that soil moisture sensors determine the crop sowing period and tree growth rings. A low-cost IoT-based water contamination detection system using Raspberry Pi was detected by Anam and Devender [7]. The water temperature is obtained by immersed water sensor. Cloud-based IoT greenhouse monitoring structure is studied by Keerthi and Kodandaramaiah [8]. Modern sensors like temperature, soil, light have been implied to achieve a greenhouse monitoring system. The results from investigations prove that a positive impact has been achieved using the system.

2.2 Managing and Controlling in Connected Farms Using IoT Technology The research investigated by Kang et al. [9] on electronic IoT-based sow feeder in swine farms using radio frequency identification (RFID) as a commercial technology. RFID transponders ensured the right amount of feed are sowed. A study on IoT and cloud-based smart agriculture by Tonke [10] is to build an automated control

532

K. Deepika and B. Renuka Prasad

production in a plant factory. RFID technology with GPS sensing is to gather the information from feed sowed to plant growth. IoT-based digital agricultural monitoring system was researched by Sarkar and Chanagala [11] to obtain an overall view of optimal resource utilization using IoT protocols and communicate the information using ZigBee technology. The system proved that cycled form from seeding to selling, improved production, product quality, prediction of supply to estimating the demand was achieved using IoT. Barakat [12] points out RFID technology can be embedded in agricultural lands, and location awareness can be processed using global positioning systems (GPS). The data obtained from the system is communicated using Wi-Fi technology [13]. The supply chain value of agricultural products has been managed and controlled by RFID and IoT according to Zhao and Wang. Scanning of Electronic Product Code (EPC) transponders on goods for automatic identification and information retrieval is mentioned by Abdul et al. [14], Deepika et al. [15]. Qin et al. [16] have predicted RFID as a critical technology of IoT and are used to achieve positioning, monitoring, and controlling in the agricultural domain.

2.3 Real-Time Monitoring Connected Farms IoT sensor network-based approach has been investigated by Jaishetty and Patil [17] for remote monitoring and controlling data from multiple farms. The information was retrieved on hand-held devices and the system implied ultrasonic sensors to monitor the quality of soil and water levels. Environmental change and patterns of connected farms have been identified using optimal sensor methods. Arduinobased drip irrigation process for automatic watering of crops as the water levels decrease and timely basis in connected farms has been achieved by Parameswaran and Sivaprasath [18]. A smart agricultural system with a significant impact on monitoring and controlling crop growth is achieved. Real-time soil observations for topsoil intact and alerting system for fertilizing the plants with required amount intimations are experimented by Channe et al. [19]. Analysis of soil moisture sensor helped to water plants with Zigbee and IoT technology assist to monitor the croplands.

2.4 Challenges of IoT in Agriculture The authors Perera et al. [20], Borgohain [21], and Whitmore et al. have mentioned the challenges faced while implementing IoT in agriculture. The major concerns included scalability, the configuration of a network, connectivity between sensors, interoperability of devices, storage and interpretation of data, fault-tolerance and erroneous data. Minimal software infrastructures are required to implement smart object to execute. The author Ashton [22] states that smart objects should execute with minimal resources, and fault-tolerance has to incorporate in maintaining various levels of redundancy. IoT development is constituted by software, devices, and

43 IoT-Based Dashboards for Monitoring Connected Farms …

533

Table 1 Comparative analysis of challenges, opportunities, and applications of IoT in agriculture Challenges Opportunities Applications Software complexity: spatiality and dynamics are affecting the quality and production of crop [2]

Security: front-end sensors and equipment, network, and back-end of IT systems are referred as the security threats of IoT [23] Skill requirement: necessary farming and operational skills are required to create an impact on the overall performance [11]

Low power wireless sensor: low energy of consumption of sensors supports the battery-powered devices for longer period without data interruption [26] Better connectivity: M2M connectivity assists farmers to improve efficiency of the farming operations [27]

Plant sensing: soil moisture and temperature sensors play a vital role in retrieving the climatic conditions and water parameters of a plant [19]

Managing and controlling: aggregated dashboards and cloud-based data storage assist in managing and controlling agricultural systems [28] Operational efficiency: Monitoring aggregated maintenance of timely farms: mapping and scheduling equipment enables monitoring aggregated farms great operational efficiency in are achieved with LoRA, agricultural sectors [18] UWB, and GPS technologies [29] Lack of infrastructure: Remote management: remote Real-time monitoring informative infrastructure monitoring allows end-users to environment: remote facilities of IoT have created a monitor the climatic monitoring of the real-time negative impact in the conditions and water feed from the environment as productivity of farming [17] requirements of the field [7] achieved with IoT-based wireless technologies [30]

technologies in energy harvesting contributing to the development of IoT technology [23]. Construction and deployment of software to cater to the requirements are complicated in the agricultural domain. Patil et al. [24] overviewed IoT and cloud computing in the agricultural domain. The characteristics associated with plant growth were identified to be complexity, spatio-temporal variability, and diversity. The growth in the IoT sector is estimated with the usage of sensors, actuators, software and location awareness, wireless and RFID technology [25]. Construction and deployment of software to cater to the requirements are complicated in the agricultural domain. The comparative analysis of the five divisions has depicted in Table 1.

2.5 Opportunities and Applications of IoT in Agriculture Chen [31] identified the opportunities prevailing on IoT technology. Low-power sensors are found to be the major application of IoT. Energy-saving transmitters are used as low power sensor nodes to extract the information from connected farms. IoT technology achieved high success compared to other technologies with the usage of

534

K. Deepika and B. Renuka Prasad

low-powered sensor nodes [26, 30]. The transmitters range to cover up few meters of area, and low-energy consumption is the major feature to alternate the batterypowered devices and helps to reduce cost with an increase of maintainability. Wired technology is replaced with low-cost low-powered sensor devices with the capability to communicate using wireless technology [32]. Reduction of the sensing area and implementation of wireless sensor networks (WSN) increases the application of IoT in the agricultural domain. Remote sensing and data gathering are the advantages of low-powered wireless sensor technology.

3 Design and Implementation of Connected Farms Cultivation and monitoring of plants require the involvement of consistent technology. The need for an integrated system arises as the data involved in the agricultural domain is forced to multiply. Single solutions do not exist in the creation of connected farms as every operation handles different and unique methods of handling data. A detailed literature study has been accomplished to understand the uniqueness of farms and facilities to enable and incorporate open systems of components communicating through the software layer. Reference architecture of the connected farm addresses the needs with a focus on scalability, connectivity, communication to address every unique farm. IoT-enabled edge devices like soil probes, weather stations, and surveillance-based Internet protocol (IP) cameras fetch data about the land constantly or at regular intervals. Remote environments focus on connected farm devices operating on three states—online, partial, offline. The equipment design depends on onboard power, latency, communication method (Bluetooth, Wi-Fi, cellular, or satellite), portable or fixed and the size of data. Devices capturing soil moisture and weather information communicate minimal data streams communicating over MQTT protocol whereas the surveillance-based IP cameras communicate using AI and deep learning algorithms. Design and implementation of connected farms involve five steps—sensing technologies, system implementation, communication process, positioning technology, and data management. The crop growth and overall farm productivity are increased with sensing technologies like sensors, actuators, development boards and geomapping. System implementation is achieved with hardware and software setup to accurately monitor and map field information in real-time. The sensor values fed IoTbased cloud platform with predefined rule-based decision management. The communication process employs protocols to communicate automated data from farm to network information on a central repository. Positioning technology is employed with cellular and GPS technologies for field plotting, crop scouting, and yield mapping. The data obtained from positioning systems is integrated with satellite imagery of crops, weather, and farm monitoring using the sensors deployed on crops in the field. Communicated data is shared with the end-users, i.e., farmers and experts to analyze agricultural data and make forecasts to end-users.

43 IoT-Based Dashboards for Monitoring Connected Farms …

535

3.1 Involvement of Sensing Technologies The agricultural industry is facing a major challenge of meeting the demands of the global population without depleting the available resources. Technology and innovation help agricultural holdings to become productive and the processing systems to be more sustainable. The sensor development boards are capable of measuring the ground conditions with exact accuracy and communicating data to assist the end-users with precision values. Remote sensing technology involves sensor technologies for soil quality, weather information, geo-mapping, and monitoring the environmental changes using intelligent cameras. Smart network of sensors measures temperature, humidity, soil moisture, light intensity, detect rainfall, monitor using surveillance-based IP cameras and accumulate location data using GPS. The hydrometric humidity sensor—AM2302 from Adafruit—measures the concentration of water vapor present in the air. The sensor is deployed in unrestricted air circulation zones to indicate the water level present in the air and be placed in a sheltered area protected from rainfall. The variation in temperature with the amount of heat or coldness is sensed with temperature sensor. Semiconductor-based sensors are incorporated with integrated circuits (ICs). Temperature sensor-DHT11 from Adafruit utilizes two identical diodes with sensitivity voltage and current characteristics with linear response. Soil moisture is a dielectric soil sensor used for measuring water levels by calculating the dielectric constant of the soil. The moisture level is identified with the stationary sensors placed between plants within a crop row at desired depths. Light dependent resistor (LDR)—photoresistor from Robo India—is a variable resistance component to estimate the amount of light intensity retrieved from the electrode of the sensor. Resistance range and sensitivity in various kinds of LDR sensors differ. Rain sensor—SEN5 from Robodo—detects the unpredicted rainfall assisting for water conservation and irrigation system. The sensor temporarily suspends watering as the rainfall is detected and records the amount of rain received. The sensor is mounted on clear, and waterproof enclosures to retrieve the light levels. Camera— ESP32-CAM from Adafruit—is used for imaging, mapping, and surveying farms. The insights obtained from the data are used to map crop health, irrigation, soil monitoring, plant growth, yield prediction. IP camera sensor is used for surveillance, remote sensing, and real-time monitoring is achieved over the same network. GPS devices—Quectel MC20 from WIO—are used to position and identify plots in multiple farms in real-time monitoring and remote sensing. Coupled with Satellite technology, GPS systems receive historical and real-time climatic data. The system relies on the integration of data collection sensors communicating to the central server using development boards and the involvement of GPS systems to position the farmlands in connected farms.

536

K. Deepika and B. Renuka Prasad

3.2 Framework for Design and Implementation With the impetus of IoT, the agricultural domain maintains the criteria of interoperability across verticals, scalability, and traceability. IoT is divided into three layers from the literature study by the authors. Some authors sub-divide the layers of IoT [33], determined with necessary research and implementation. The sub-divided layers of IoT are found to be suitable for fog or edge computing, and the layer edge/fog computing is positioned between device and network layers. The naming pattern of the layers differs among authors, the general trend is dividing the layers as device, network, and application layers. The device layer consists of physical objects to collect data from the environment—automatic identification, sensing or actuation, and connecting to the Internet. The network layer transmits data to a gateway (proxy server) to the cloud (Internet) with the implementation of communication protocols. The application layer provides services and determines the set of protocols to assist the end-user with analyzed information. The accumulated data experiences diverse stages of transition from sensing or actuation to cloud and the stages influence the implemented technologies. The authors say that data obtained is from sensing devices, a process mediated (data retrieved from the business process), human-generated (recorded and calculated by humans). Overall data acquisition system needs to analyze the process, retrieval, and use of data in IoT. Device Layer—Sensor devices sense the parameters from a physical environment and communicate the data to the cloud wirelessly. An actuator is a device that receives commands from the cloud to activate/deactivate a mechanical unit of a system. The device layer is also named—the perception layer (Tzounis et al.), the sensing layer, or the physical layer. The devices are constituted by with a transceiver (smartphone), microcontrollers (Arduino Boards, NodeMCU), interfacing unit (Raspberry Pi), and one or many sensors (temperature, humidity, soil moisture, rain, light, etc.) and actuators (motors, sprinklers hydraulic cylinders, etc.). The location information is fetched using GPS receivers and accessed via controller area network (CAN) for field plotting, crop scouting, and yield mapping. Many sensors and actuators are employed in connected farms. The sensors used in the system are to obtain temperature, humidity, calculate soil moisture, estimate the light intensity, and detect rainfall. IP camera communicates real-time sensing images to the end-user for remote monitoring and GPS devices position the latitude and longitude coordinates of the location in Fig. 2. The network layer of the IoT-based connected farm is discussed in terms of technology in Sect. 3.3 and as positioning systems in Sect. 3.4. The application layer of IoT in connected farms is presented in Figs. 3, 4 and 5.

43 IoT-Based Dashboards for Monitoring Connected Farms …

537

Fig. 2 Sensing images communicated by ESP32-CAM

Fig. 3 Communication between MQTT protocol, InfluxDB, and Grafana

3.3 Real-Time Sensing Images Communicated by ESP32-CAM Network Layer—Networking layer sends the data to a middleware initially and subsequently communicated to the cloud (Internet) in turn connecting to the actuators. The layer employs short-range wireless communication technologies like Bluetooth low energy (BLE), Ethernet, wireless fidelity (Wi-Fi), RFID, near-field communications (NFC) to communicate the data to a gateway. Camera lens incorporated in smart devices capture ultrahigh definition (UHD) pictures and record videos. GPS receivers and camera module in smart devices are programmed for computing data and displaying graphical user interfaces (GUI) on the phone screens. Smart devices are listed all the three layers of IoT, i.e., achieves sensing using embedded sensors—sensor layer, serving as a gateway, computing the retrieved data in the network layer, and displaying the GUIs in the application layer.

538

K. Deepika and B. Renuka Prasad

Fig. 4 Sensors communicating parameters to Node-RED

Fig. 5 Node-RED displaying MQTT data

3.4 Incorporation of Positioning Systems The incorporation of GPS-based positioning systems achieves precision farming in connected farms. GPS technology communicates to satellite evolving in the orbit to provide real-time data with accurate position information further leading to efficient manipulation and analysis of gathered data. The real-time data is utilized for precision farming, field planning, yield mapping, and location assistance.

43 IoT-Based Dashboards for Monitoring Connected Farms …

539

A nano SIM card is inserted into the board slot to transmit data. The board is flashed using Ardunio IDE, an open-source Arduino or similar board programming software by connecting micro to USB card and selecting the Wio GPS board in the Tools → Board. Test runs are executed using the Arduino serial monitor which displays the satellite connection data along with latitude and longitude information. A series of test runs are executed in multiple locations and proceed with the actual deployment.

4 Integration of Data Sources and Management Application Layer—Most important layer in IoT is responsible for providing services, and determining protocols for communication is the application layer. The layer retrieves parameters communicated through controlling devices. The third layer in IoT provides several services like data storage, analytics, and access via application programming interface (API). APIs handle user interface (UI)-based software applications to assist in aiding heterogeneous cloud data improving interoperability. Data storage is accomplished with different types of databases to meet the design and application of the system. The sensing data from the device layer is sent to the local repository via Wi-Fi using development boards like Arduino board with ESP8266 Wi-Fi module and NodeMCU Wi-Fi development boards communicating using single board computers, i.e., Raspberry Pi. Data storage is cloud-based involving local repositories and subsequently communicating to other multiple servers. The data gathered from the sensing layer is sent to Node-RED using the MQTT protocol. Node-RED is a programming tool licensed under Apache License 2.0 for event-based applications combining visual wiring to implement IoT development.

4.1 Real-Time Sensing Images Communicated by ESP32-CAM Network Layer—Networking layer sends the data to a middleware initially and subsequently communicated to the cloud (Internet) in turn connecting to the actuators. The layer employs short-range wireless communication technologies like Bluetooth low energy (BLE), Ethernet, wireless fidelity (Wi-Fi), RFID, near-field communications (NFC) to communicate the data to a gateway. The Internet gateway is positioned in the proximity range of the connected devices including a proxy server. The data is collected and processed by the proxy server to communicate it to the end-user. Proxy-server implies protocols like hypertext transfer protocol (HTTP), MQTT, or user datagram protocol (UDP) over TCP, constrained application protocol (COAP). Camera lens incorporated in smart devices capture ultrahigh definition (UHD) pictures and record videos. GPS receivers and camera module in smart devices are

540

K. Deepika and B. Renuka Prasad

programmed for computing data and displaying graphical user interfaces (GUI) on the phone screens. Smart devices are listed all the three layers of IoT, i.e., achieves sensing using embedded sensors - sensor layer, serving as a gateway, computing the retrieved data in the network layer and displaying the GUIs in the application layer.

4.2 Incorporation of Positioning Systems The incorporation of GPS-based positioning systems achieves precision farming in connected farms. GPS technology communicates to satellite evolving in the orbit to provide real-time data with accurate position information further leading to efficient manipulation and analysis of gathered data. The real-time data is utilized for precision farming, field planning, yield mapping, and location assistance. A nano SIM card is inserted into the board slot to transmit data. The board is flashed using Ardunio IDE, an open-source Arduino or similar board programming software by connecting micro to USB card and selecting the Wio GPS board in the Tools → Board. Test runs are executed using the Arduino serial monitor which displays the satellite connection data along with latitude and longitude information. A series of test runs are executed in multiple locations and proceed with the actual deployment.

5 Integration of Data Sources and Management Application Layer—Most important layer in IoT is responsible for providing services, and determining protocols for communication is the application layer. The layer retrieves parameters communicated through controlling devices. The third layer in IoT provides several services like data storage, analytics, and access via application programming interface (API). APIs handle user interface (UI)-based software applications to assist in aiding heterogeneous cloud data improving interoperability. An extension of the application is extended into InfluxDB to store all the sensor data using an optimized time series database. InfluxDB is licensed under MIT license to handle high write and query loads. Mosquitto is an EPL/EDL open-source licensed lightweight portable messaging protocol with a Publisher/Subscriber model to implement MQTT protocols and is suitable for IoT. The data is sent to Grafana for dashboards and visualizations. Grafana is an open-source analytics and monitoring solution licensed under Contributor License Agreement (CLA) for performing queries, creating dynamic visualizations, and generating alerts. Grafana visualizes data in an observability dashboard to provide insightful meaning of data collected and generating alerts based on the data. The observability dashboard is created using Grafana dashboards by adding multiple panels. The display panels are setup for humidity, temperature, light dependent

43 IoT-Based Dashboards for Monitoring Connected Farms …

541

resistor, and soil moisture. The humidity parameters are specified with the table (monitor) and selecting the field as humidity and updating the information every 5 s. The data is a displayed in the dashboard as a time series format. The other parameters are configured in respective panels using the same method.

6 Configuration of Devices and Performance of Result Analysis Configuration of Devices—The configuration of devices involves device management and dashboard configurations. Device management registers the sensors (humidity, temperature, lDR, and soil moisture) and devices (GPS and camera). The sensors and GPS devices communicate to ESP8266 development board. The camera sends the captured images to ESP32 board. The boards are configured to communicate to remote Wi-Fi network during system failures or after losing connection. ESP8266 and ESP32 development boards have abilities to reconnect to the router in case of a Wi-Fi outage with necessary code for setting up auto reconnect and persistent usage. Dashboard Configurations—The configurations of dashboards are adjusted remotely based on the requirements. The dashboards—Node-RED and Grafana—are configured with public IP. The public IP can be reset or changed to any other public IP to access the data globally. The dashboards can also be configured for localhost for loopback interface. The URL can be specified with the term localhost or with the IP address of the localhost as 127.0.0.1. The localhost address can also be specified with a static IP associated with Ethernet or Wi-Fi. Result Analysis—Result analysis compares visual tracks, analyzes, and displays key performance indexes (KPI). The parameters communicated by a particular sensor are compared with a secondary sensor or with standard hardware tools. The parameters are also compared on setting up similar environment and weather conditions. The location information in latitude and longitude coordinates is matched using Google Maps API. The results obtained on the Grafana dashboard are inline with the parameters communicated from the sensors and data fetched in the Node-RED application. No delay is noticed between the sensor value in serial monitor and Node-RED application Fig. 5. A delay of 3 s is found between the data communicated from the sensor and Grafana dashboard.

7 Conclusions Precision agriculture has improved the efficiency of managing connected farms with the inclusion of digital data systems applying the technique of what is needed, when

542

K. Deepika and B. Renuka Prasad

and where is needed. Adoption of precision farming in connected farms improves agricultural yields and reduces environmental risks. Data-based digital systems with free software and open protocols represent a modus operandi of connected farms with visualizations. The location-based information assisted in identifying precise parameters accumulated in the exact location. The analysis confirms the parameters fetched from the sensors are communicated to free software-based visualizations dashboard using open protocols. The databased management systems handle data from connected farms, and the results are orchestrated from the implemented system. Digital systems combine analog/digital signals, communicated to time series database using open protocol, and visualized in dashboards. The project implementations experienced some hardware failures and networking issues connecting devices and software in the course of deployment of the system. The failures were addressed with the replacement of hardware and setting the network topology to local and global network configurations. The data is generated in various forms and is consolidated for future usages. Evolution of Agriculture 5.0 is dealt with connected farms and digitization of data to make next generation machines smarter for agriculture.

References 1. Upadhyaya BR, Eryurek E (1992) Application of neural networks for sensor validation and plant monitoring. Nucl Technol 97(2):170–176 2. Siddagangaiah S (2016) A novel approach to IoT based plant health monitoring system. Int Res J Eng Technol 3(11):880–886 3. Tangworakitthaworn P, Tengchaisri V, Rungsuptaweekoon K, Samakit T (2018) A game-based learning system for plant monitoring based on IoT technology. In: 2018 15th international joint conference on computer science and software engineering (JCSSE), 2018, pp 1–5. https://doi. org/10.1109/JC-SSE.2018.8457332 4. Bangera T, Chauhan A, Dedhia H, Godambe R, Mishra M (2016) IoT based smart village. Int J Eng Trends Technol (IJETT) V32(6):301–305. ISSN: 2231-5381 5. Kansara K, Zaveri V, Shah S, Delwadkar S, Jani K (2015) Sensor based automated irrigation system with IOT: a technical review. Int J Comput Sci Inf Technol 6(6):5331–5333 6. Dursun M, Ozden S (2011) A wireless application of drip irrigation automation supported by soil moisture sensors. Sci Res Essays 6:1573–1582 7. Anam SM, Devender M (2015) A low cost internet of things network for contamination detection in drinking water systems using raspberry pi. Int J Electr Electr Comput Sci Eng 2:49–53 8. Keerthi V, Kodandaramaiah GN (2015) Cloud IoT based greenhouse monitoring system. Int J Eng Res Appl 5(10):35–41 9. Kang JJ, Larkin H (2016) Inference of personal sensors in Internet of Things. Int J Inf Commun Technol Appl 2(1). https://doi.org/10.17972/ijicta20162125 10. TonKe F (2013) Smart agriculture based on cloud computing and IoT. J Convergence Inf Technol 8(2):210–216 11. Sarkar PJ, Chanagala S (2016) A survey on IoT based digital agricultural monitoring system and their impact on optimal utilization of resources. IOSR J Electr Commun Eng 11:01–04. https://doi.org/10.9790/2834-11120104 12. Barakat SM (2016) Internet of things: ecosystems and applications. J Curr Res Sci 4:32–34

43 IoT-Based Dashboards for Monitoring Connected Farms …

543

13. Buckely J (2006) The Internet of Things: from RFID to the next-generation pervasive networked systems, 1st edn. Auerbach Publications, New York 14. Abdul Aziz MHHID, Ismail MJ, Mehat M, Haroon NS (2009) Remote monitoring in agricultural greenhouse using wireless sensor and Short Message Service (SMS). Int J Eng Technol 9:35–43 15. Deepika K, Usha J (2020) Implementation of personnel localization & automation network (PLAN) using internet of things (IoT). Procedia Comput Sci 171:868–877. ISSN: 1877-0509 16. Qin P, Lu Z, Zhu T (2015) Application research on agricultural production throughout the internet. In: Proceedings of the 3rd international conference on management, education and information and control (EIC’ 15) 17. Jaishetty SA, Patil R (2016) IoT sensor network based approach for agricultural field monitoring and control. Int J Res Eng Technol 5:45–48 18. Parameswaran G, Sivaprasath K (2016) Arduino based smart drip irrigation system using internet of things. Int J Eng Sci Comput 6:5518–5521. https://doi.org/10.4010/2016.1348 19. Channe H, Kothari S, Kadam D (2015) Multidisciplinary model for smart agriculture using Internet-of-Things (IoT), sensors, cloud-computing, mobile-computing and big-data analysis. IJCTA 6:374–382 20. Perera C, Liu CH, Jayawardena S (2015) The emerging internet of things marketplace from an industrial perspective: a survey. IEEE Trans Emerg Top Comput 3:585–598. https://doi.org/ 10.1109/TETC.2015.2390034 21. Borgohain T, Kumar U, Sanyal S (2015) Survey of security and privacy issues of internet of things. Int J Adv Netw Appl 6:2372–2378 22. Ashton K (2009) That-Internet of Things. RFiD J 23. Guang Y, Guining G, Jing D, Zhaohui L, He H (2011) Security threats and measures for the Internet of Things. J Tsinghua Univ 51:1335–1340 24. Patil VC, Al-Gaadi KA, Biradar DP, Rangaswamy M (2012) Internet of Things (IoT) and cloud computing for agricultural: an overview. In: Proceedings of the agro-informatics and precision agriculture (AIPA’ 12), India, pp 292–296 25. Shao W, Li L (2009) Analysis of the development route of IoT in China. Perking: China Sci Technol Inform 24:330–331 26. Sen J (2009) A survey on wireless sensor network security. Int J Commun Netw Inform Secur 1:55–78 27. Wu Z, Li S, Yu M, Wu J (2015) The actuality of agriculture internet of things for applying and popularizing in China. In: Proceedings of the international conference on advances in mechanical engineering and industrial informatics (EII’ 15) 28. Deepika K, Usha J (2021) Automation of smart monitoring for person localisation and alerting network. Int J Inf Technol Manag 20(1/2):145–159 29. Deepika K, Renuka Prasad B (2022) High-precision indoor tracking using ultra-wide band devices and open standards. In: Smys S, Balas VE, Palanisamy R (eds) Inventive computation and information technologies. Lecture notes in networks and systems, vol 336. Springer, Singapore 30. IEC (2014) Internet of things: wireless sensor networks. International Electro-Technical Commission, Switzerland 31. Chen YK (2012) Challenges and opportunities of internet of things. In: Proceedings of the 17th Asia and South Pacific design automation conference, 30 Jan–2 Feb, IEEE Xplore Press, Sydney, NSW, Australia, pp 383–388. https://doi.org/10.1109/ASP-DAC.2012.6164978 32. Gutierrez J, Villa-Medina JF, Nieto-Garibay A, Porta-Gandara MA (2014) Automated irrigation system using a wireless sensor network and GPRS module. IEEE Trans Instrument Meas 63:166–176. https://doi.org/10.1109/TIM.2013.2276487 33. Deepika K, Usha J (2016) Investigations & implications on location tracking using RFID with global positioning systems. In: 2016 3rd International Conference on Computer and Information Sciences (ICCOINS), 2016, pp 242–247. https://doi.org/10.1109/ICCOINS.2016. 7783221

Chapter 44

Predicting the Gestational Period Using Machine Learning Algorithms R. Jane Preetha Princy, Saravanan Parthasarathy, S. Thomas George, and M. S. P. Subathra

1 Introduction The term “gestation” comes from the Latin word “Ge stare,” which means “to carry or bear.” The gestational period is defined as the time between conception and birth of the fetus. Women’s normal gestation periods range from 37 to 42 weeks [1]. Preterm or premature birth occurs when the gestational days are less than 37 weeks; postterm or post-mature birth occurs when the gestational days are greater than 42 weeks. According to the WHO, 15 million babies are born prematurely worldwide each year, with approximately 1 million babies dying as a result [2]. During the final stages of pregnancy, the baby experiences rapid growth. If the baby is born before the specified time, the risk of disability increases. Premature birth is a global issue. More than one in every ten infants is born prematurely, more than three weeks before their due date. High blood pressure, smoking, maternal BMI, alcohol and drug consumption, maternal age, previous preterm birth, C-section, and diabetes all play a role in preterm birth [3–5]. Early detection of premature births improves infant survival and allows medical practitioners to be better prepared for safe delivery [6]. Maternal smoking refers to women who have a proclivity to smoke while pregnant. According to a study conducted by papova et al., maternal smoking is prevalent R. J. P. Princy (B) · S. T. George · M. S. P. Subathra Karunya Institute of Technology and Sciences, Coimbatore, India e-mail: [email protected] S. T. George e-mail: [email protected] M. S. P. Subathra e-mail: [email protected] S. Parthasarathy B. S. Abdur Rahman Crescent Institute of Science and Technology, Chennai, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_44

545

546

R. J. P. Princy et al.

worldwide, in specific Ireland, Uruguay, and Bulgaria being the worst-affected countries [7]. Maternal smoking during pregnancy causes restricted growth and low birth weight (LBW) in infants. Infants born with LBW are at an increased risk of subsequent illness, sudden infant death syndrome (SIDS), and long-term health issues in childhood and adulthood [8]. Furthermore, LBW may result in future negative consequences such as heart disease, type 2 diabetes, high blood pressure, and obesity. Malformations of the musculoskeletal, digestive, and cardiovascular systems are also possible. Facial slits, eye defects, clubfoot, cryptorchidism, and hernia might occur [9]. The mother’s age is an important factor in the safe delivery of the child. Conceiving a child as a teenager or a woman over the age of 35 carries significant risks. According to the WHO [10], approximately 12 million teenage births occur worldwide each year, with 3.9 million facing maternal mortality, morbidity, and long-term health issues. Under-15-years-old are especially vulnerable to anemia, hypoglycemia, and pregnancy-related hypertension [11]. Pregnant teenage girls who smoke, drink, or use drugs during pregnancy are twice as likely as women over the age of 25 to have a lowbirth-weight baby. Smoking also increases the risk of pregnancy complications such as preterm birth and stillbirth. Similarly, risks associated with pregnancy after the age of 35 include gestational diabetes, pre-eclampsia, low birth weight, miscarriage, premature birth, and down syndrome [12]. Maternal height is another factor that may influence childbirth. According to studies, the shorter mother may be at a higher risk of pre-term delivery with shorter babies and a shorter gestation period [13]. The reason for this could be the mother’s height influencing uterine or pelvic size. It could also be due to a lack of energy required for fetal development. Underweight and overweight mothers have also been identified as having several risks during pregnancy. Infants born to overweight mothers (BMI > 25) have heart defects, neural tube deformity, high cholesterol, and type 2 diabetes, whereas infants born to underweight mothers (BMI 18) have premature birth, low birth weight, and other health complications [14]. The pregnancy risk dataset is large and contains a variety of attributes. Because humans’ analytical abilities are limited, Machine learning models are used to address the challenges. Machine learning methodologies use a statistical data-driven approach to handle large datasets. It provides a suitable solution to the problem by developing a model. Machine learning-enabled models are used in a variety of healthcare domains to improve decision-making, personalized treatment, surgical simulations, drug discovery, and disease detection speed. In this study, the mothers’ behavior was examined, and preterm deliveries were predicted. Early detection of pregnancy risks may benefit the mother, child, and healthcare providers [15, 16]. Considering the foregoing facts and advances, machine learning methodologies were used in this study to forecast the gestational period. The remaining paper is arranged as follows: Sect. 2 specifies the existing research accomplishments relating to machine learning methodologies. Section 3 reveals the characteristics of the dataset and proposed model. Section 4 outlines the various methodologies that have been employed. Section 5 discloses the results of exploratory

44 Predicting the Gestational Period Using Machine Learning …

547

analysis. Section 6 enumerated the experimental results. As a final point, the paper concludes in Sect. 7 with the forthcoming research direction.

2 Literature Review Machine learning has the potential to make quick and accurate predictions, which could improve clinical diagnostic results. Machine learning models that use effective feature selection methods could be useful in ensuring the health of pregnant women and fetuses. Huang et al. investigated the effects of maternal smoking on fetal outcomes during pregnancy [17]. The findings revealed that smoking during pregnancy increases the likelihood of having an LBW fetus with a small chest circumference. Inoue et al. developed a logistic regression model to determine the impact of paternal and maternal smoking on fetal growth [18]. The study’s findings confirmed that both maternal and paternal smoking raises the risks of short birth length, LBW, and small head circumference in infants. Kobayashi et al. presented an overview of the relationship between smoking and small-for-gestational-age (SGA) fetus from cotinine levels in plasma [19]. Hoyt et al. created a multivariate logistic regression (MLR) model that confirmed the connection between secondhand smoking and preterm birth [20]. However, it lacked evidence of SGA birth occurrences. Wang et al. conducted a cohort survey among non-smoking Chinese women whose husbands were found to be smokers [21]. The proposed MLR model’s findings indicated that pregnant women whose husbands smoked had an increased risk of spontaneous abortion. Xaverius et al. compared nonsmoking mothers to smoking mothers and discovered that the latter had significantly higher LBW rates [22]. Patole and Paprikar [23] investigated the relationship between maternal BMI and infant birth weight. The findings demonstrated statistically that underweight mothers are more likely to have SGA babies, whereas overweight mothers have LGA babies. Ludwig and Currie predicted that mothers who gained more than 24 kg during pregnancy would have heavier babies [24]. Suzuki et al. used multiple linear and logistic regression models to determine that mothers who stopped smoking before or during early pregnancy were able to avoid negative consequences and have healthy babies [25]. Using logistic regression models, Ling et al. investigated the relationship between pre-pregnancy BMI and gestational weight gain (GWG) [26]. The findings suggested that mothers with high GWG may have heavier babies. Liu et al. used logistic regression to investigate the relationship between smoking doses and stages and preterm birth [27]. The results revealed that smoking increased the risk of pre-term birth regardless of dosage or stage. Ward et al. discovered that women who smoked and were exposed to tobacco had LBW babies and pre-term birth [28]. Ludvigsson et al. discovered that SGA is linked to an increased risk of developing infections and neurological diseases, which could lead to death [29]. Karthiga et al. created an application to calculate fetal weight by analyzing the mother’s age,

548

R. J. P. Princy et al.

gestation period, plurality, and gender of the baby [30]. Knowing the fetus’s birth weight ahead of time allows the doctor to plan ahead of time and mothers to take special care of the child. Pan et al. [31] developed a machine learning-based web application model for predicting adverse birth risks. The logistic regression outperformed the others by serving all the objectives of the study. Kuhle et al. [32] compared various machine learning techniques for the prediction of fetal growth complications. SGA/LGA is heavily influenced by smoking, gestational weight gain, previous low birth infant or macrosomic infant, and pre-pregnancy BMI. Hang et al. developed a J48 model for predicting LBW that produced 90.3% AUC [33]. To predict the vital factors that contribute to prenatal death, Mboya et al. used artificial neural network, Naive Bayes, random forest, boosting, logistic regression, and bagged trees [34]. Except for bagged trees, the results revealed that all models performed indistinguishably. Poly aromatic hydrocarbon (PAH) was studied by Jain et al. in the development of an LBW fetus [35]. The SVM-based models revealed the most accurate predictors of LBW. Borson et al. investigated the factors that contributed to LBW [36]. The SVM and logistic regression models outperformed other models by predicting the factors with high accuracy. Senthil Kumar and Paulraj proposed a decision-making system for predicting LBW in infants [37]. The classification trees outperformed the other models with accuracy of 89.95%. Metgud et al. developed Indian Council of Medical Research (ICMR) scoring method for LBW prediction. The proposed model achieved 80.6% sensitivity, but its positive predictive value is 43.8% [38]. Singha et al. discovered that logistic regression was the best performing model for predicting infant mortality [39]. Vovsha et al. [40] overcame the difficulty of predicting preterm birth. Despite the unbalanced dataset, SVM produced better results. Ghosh et al. conducted a follow-up study to assess the prevalence of LBW caused by air pollution around the perimeter of the Air Toxic Monitoring Station in Los Angeles, California [41]. To estimate pollution and LBW, land-use-based regression and logistic regression were used. Using data mining techniques, Chen et al. investigated the vulnerability of pre-term birth [42]. Multiple births were found to be the most common risk factor, followed by hemorrhage during pregnancy and other risk factors such as diseases, maternal smoking, maternal body weight, and so on.

3 Dataset The Maternity and Child Health Dataset from Kaggle [43] was used in this study. The dataset includes 1174 records with the following characteristics: birth weight, gestational days, maternal age, maternal height, maternal pregnancy weight, and maternal smoking. As birth weight is a post-gestational period attribute, it was removed. Entries that are more than 42 weeks (>300 days) were excluded based on the ideal gestational period [44]. The final dataset contains 1108 entries with 4 predictors to forecast the mother’s gestational days. Table 1 displays the attributes of the dataset.

44 Predicting the Gestational Period Using Machine Learning … Table 1 Feature information of the dataset

549

S. No.

Attribute name

Description

Data type

1

Maternal age

Age of the mother Int (years)

2

Maternal height

Height of the mother

Int (inches)

3

Maternal weight

Weight of the mother

Int (lbs)

4

Maternal smoking

Smoking habit of the mother

Binary

5

Gestational days

Number of gestational days

Int (days)

According to the data presented in Fig. 1a, the gestational period ranges from 148 to 300 days. The dataset consists of entries pertaining to women between the ages of 15 and 45 (refer Fig. 1b). The mothers varied in height and weight from 53 to 72 inches and 87–250 pounds, respectively (refer Fig. 1c and d). A total of 39.4% of mothers are smokers, while 60.6% are non-smokers (refer Fig. 1e). Figure 1f, represents the correlation between the attributes of the study.

4 Methodology Figure 2 represents the proposed approach of the study. The system is developed using Python and executed in Jupyter notebook. Linear regression, robust regression, ridge regression, lasso regression, elastic net regression, polynomial regression, stochastic gradient descent, random forest regressor, SVM regressor, and artificial neural network models were employed in this study. Linear regression is the plotting of independent variable against the dependent variable. It establishes a relationship between the predictor and the target variable. Linear regression is represented as, y = a0 + a1 x

(1)

where y x a0 a1

is dependent variable. is independent variable. is intercept of the line. is linear regression coefficient.

Ridge regression is used to analyze the data which possesses multicollinearity. In multicollinearity, due to high variability, the values may be distant from the true values. Ridge regression minimizes the standard errors, by adding a degree of bias to the regression estimates.

550

R. J. P. Princy et al.

Fig. 1 a Range of gestational days. b Range of maternal age. c Range of maternal height. d Range of pregnancy weight. e Smoking habit of mother. f Correlation between the attributes

44 Predicting the Gestational Period Using Machine Learning …

551

Fig. 2 Proposed approach

    −1 E  B−B = X  X + kI X  X − I B

(2)

Robust regression identifies and handles outliers in the data by minimizing their influence on the coefficient estimates. A special curve, called an influence function, governs the number of weights applied to each observation in robust regression. min β

N N       ρ y j − x j β = min ρ ej β

j=1

(3)

j=1

Least absolute shrinkage and selection operator is abbreviated as LASSO. As a mean, lasso shrinks the data into its central point. This method of regression is well suited for models with high levels of multicollinearity and to simplify certain aspects of the model selection process, such as variable selection/parameter elimination. n   i=1

yi −

 j

xi j β j

2

p 



β j

+ λ

(4)

j=1

where λ denotes the amount of shrinkage. Elastic net is a common form of regularized linear regression, which precisely integrates the two penalties from lasso and ridge methods. By learning from their limitations, this approach incorporates both the LASSO and ridge regression, to enhance the regularization of mathematical models.

552

R. J. P. Princy et al.

 L enet βˆ =

n  i=1

yi − xi βˆ 2n

2

⎞ m m

  1 − α



+ λ⎝ βˆ 2 + α

βˆ j ⎠ 2 j=1 j j=1 ⎛

(5)

Where α is the mixing parameter between ridge (α = 0) and lasso (α = 1). If a linear regression simply does not fit all the data points, it may be suitable for polynomial regression. The relation between the variables x and y is availed by polynomial regression, to devise a way to connect the dots through the data points. y = β0 + β1 + β2 x 2 + ε

(6)

In stochastic gradient descent, a few samples are selected at random from the entire dataset on every iteration. There is a notion called “batch” in gradient descent that represents the total number of samples that are randomly selected from a database. for I in range (m):   θ j = θ j −α yˆ i − y i x ij

(7)

Random forest regression employs ensemble method, which conjugates the predictions from several algorithms to make a more precise prediction. To determine the final output, random forest combines multiple decision trees instead of relying on individual decision trees. RF f i i

j

norm f i i j

j∈all features, k∈all trees

norm f i jk

(8)

A statistical method of examining the linear relationship between two continuous variables is support vector regression (SVR). SVM regression gives us the flexibility to define how much error is permissible in the model and will attempt to fit the error within a specific threshold.  1 arg minw, b w2 + C ξn 2 n=1 N

(9)

The part of artificial intelligence that is intended to mimic the working of the human brain is the artificial neural network (ANN). ANNs consist of several nodes that collect the input and transmit it to the other nodes. There are three layers namely input layer, hidden layer, and the output layer. The actual processing takes place in the hidden layers. The values entering a hidden node are multiplied by predetermined numbers called weight. The weighted inputs are then summed up to produce a single number.

44 Predicting the Gestational Period Using Machine Learning …

E(w) =

553

   (w0 + w1 x1 − y1 )2 + (w0 + w1 x2 − y2 )2

(10)

+ . . . . . . + (w0 + w1 xn − yn )2

5 Results In this study, we assess how far each model has learned using the mean absolute error (MAE), root mean square error (RMSE), R-squared score (R2 ), and explained variance score. Mean Absolute Error (MAE) In statistics, mean absolute error is the average of differences between the actual value and the measured value. Matrix value of MAE ranges from 0 to infinity. The lesser the score values, the greater is the performance of the models, which is why it is often called negatively oriented scores. MAE =

n 1 |xi − x| n i=1

(11)

8.714046

8.672124

8.789797

8.740095

10.943913

9.959998

11.56576

LASSO REGRESSION

ELASTIC NET REGRESSION

POLYNOMIAL REGRESSION

STOCHASTIC GRADIENT DESCENT

RANDOM FOREST REGRESSOR

SVM REGRESSOR

ARTIFICIAL NEURAL NETWORK

8.674093

8.53003 ROBUST REGRESSION

RIDGE REGRESSION

8.65824 8

14 12 10 8 6 4 2 0

LINEAR REGRESSION

MAE

The MAE comparison of various models of regression is seen in Fig. 3. By achieving minimal error, robust regression outperformed other models. Linear and ridge regression models performed equally better. The ANN model ended up with least performance.

Machine Learning Models Fig. 3 Comparison of mean absolute errors

554

R. J. P. Princy et al.

Root Mean Squared Error (RMSE) RMSE is the square root of the squared mean of all the errors. RMSE is used, since it is scale-dependent, to compare the prediction errors of various models for a single variable and not between variables.   n  ( yˆi − yi )2 (12) RMSE =  n i=1 In a series of predictions, the MAE and the RMSE may be used together to identify the difference in the errors. The RMSE value would either be larger or equal to the MAE. The larger the difference in each sample error, the greater the variance. In the comparison of RMSE values shown in Fig. 4, the linear regression stands as the best performing model followed by ridge regression, elastic net regression, and other models. Again, the ANN model is the worst performer. R Squared score (R2 )

11.808913

11.91812

11.848645

15.003 016

13.65252 5

15.179818

ELASTIC NET REGRESSION

POLYNOMIAL REGRESSION

STOCHASTIC GRADIENT DESCENT

RANDOM FOREST REGRESSOR

SVM REGRESSOR

ARTIFICIAL NEURAL NETWORK

11.807708 RIDGE REGRESSION

11.868197

12.001586 ROBUST REGRESSION

LASSO REGRESSION

11.80113 6

16 14 12 10 8 6 4 2 0

LINEAR REGRESSION

RMSE

R2 score is the statistical measure of finding out the proportion of the variance in a regression model for a dependent variable. It is determined by an independent variable or variables. It is the description of about one variable’s degree of variance in comparison to another. It is also called as the coefficient of determination. Rsquared scores range from 0 to 1 and are generally represented between percentage values from 0 to 100%. “0”% denotes that neither the response data variability nor the

Machine Learning Models Fig. 4 Comparison of root mean square errors

44 Predicting the Gestational Period Using Machine Learning …

555

outcome data variability is around the mean, while 100% denotes that the outcome is all around the mean. Higher the R-squared score, better the model matches the results. R2 = 1 −

Unexplained Variation Total Variation

(13)

From Fig. 5, linear regression has topped the board by obtaining the highest score among all the models. Next in order, comes up the ridge and elastic net regression models, while ANN falls back to the last position. Explained Variance Score (EVS) To find out the dissimilarity between a model and real results, the explained variance score is used. The difference between explained variance and R2 is that the former uses biased variance to calculate the fraction of variance and the latter uses raw sums of squares. If the predictor error is unbiased, the two scores are same. (14)

0.002921

0.009597

0.009799

0.1

0.010901

r 2 = R2 = n2

-0.008806

-0.022985

-0.1

-0.000372

0

Linear Regression

-0.4

Robust Regression Ridge Regression Lasso Regression ElasƟc Net Regression

-0.6

Polynomial Regression

-0.7

StochasƟc Gradient Descent Random Forest Regressor

-0.598635

-0.5

SVM Regressor ArƟficial Neural Network

Machine Learning Models Fig. 5 Comparison of R squared scores

-0.636534

-0.3

-0.323787

R 2 Score

-0.2

0.021743

0.001009

0.020362

0.02057

0.1

0.010419

R. J. P. Princy et al. 0.021675

556

-0.1

Linear Regression Robust Regression

-0.3 Ridge Regression

-0.4

-0.322722

EVS

-0.2

-0.017308

0

Lasso Regression ElasƟc Net Regression

StochasƟc Gradient Descent

-0.6

Random Forest Regressor SVM Regressor

-0.7

ArƟficial Neural Network

-0.592671

Polynomial Regression

-0.551122

-0.5

Machine Learning Models Fig. 6 Comparison of explained variance score

According to the findings, the best performance was achieved using the stochastic gradient descent algorithm. On the other hand, the ridge regression model and the linear regression model both performed admirably. Both the ANN and random forest regressor models performed quite poorly (Refer to Fig. 6).

6 Discussions Table 2 contains a comprehensive summary of the results of the regression models. The models’ efficiency is measured using parameters such as MAE, RMSE, R2 score, and explained variance score. The results show that the robust and linear regression models had the lowest error rate in MAE and RMSE. Linear regression and SGD, on the other hand, achieved the highest positions in determining R2 and explained variance scores, respectively. When comparing the overall performance of each model, robust regression achieved the lowest error in MAE. However, RMSE,

44 Predicting the Gestational Period Using Machine Learning …

557

Table 2 Performances of various machine learning models MODEL

MAE

RMSE

R2

EVS

Linear regression

8.658248

11.801136

0.010901

0.021675

Robust regression

8.53003

12.001586

− 0.022985

− 0.017308

Ridge regression

8.674093

11.807708

0.009799

Lasso regression

8.714046

11.868197

− 0.000372

0.010419

Elastic net Regression

8.672124

11.808913

Polynomial regression

8.789797

11.91812

Stochastic gradient descent

8.740095

Random forest regressor SVM regressor Artificial neural network

0.02057

0.009597

0.020362

− 0.008806

0.001009

11.848645

0.002921

0.021743

10.943913

15.003016

− 0.598635

− 0.551122

9.959998

13.652525

− 0.323787

− 0.322722

15.179818

− 0.636534

− 0.592671

11.56576

R2 , and explained variance of the same is relatively low. As a result, robust regression is regarded as an untrustworthy model for achieving the goal of this study. Linear regression consistently performed well in all parameters. By achieving the lowest possible error, it ranked second best in MAE. Linear regression highlights even the smallest changes in each independent variable, which influences the dependent variable’s mean change. Linear regression is straightforward and efficient, and the output coefficients are easier to understand. Logistic regression outperformed other models in terms of RMSE by achieving the lowest error with the highest R2 value. Logistic regression performed similarly to the best model in terms of explained variance score. As a result, logistic regression could be considered the most reliable and commendable model for predicting gestational days. In comparison to linear regression, the ridge regression model appears to perform consistently. Considering RMSE and R2 reveals that it is the second-best performer and explained variance illustrates that it is the third-best performer. Similarly, the elastic net regression model consistently performed well, ranking third in determining the MAE, RMSE, and R2 scores, as well as having an apparent good explained variance score. Despite having a lower explained variance, the SGD model had a higher MAE, RMSE, and R2 . As a result, it was ruled out as a viable model. In predicting gestational days, other regression models, including lasso, polynomial regression, random forest regressor, SVM regressor, and ANN, performed abysmally. ANN appears to be the least effective model among them. It had high MAE and RMSE values, as well as the lowest R2 and explained variance scores. Despite being given an epoch value of 100, the model failed to learn and provide better prediction.

558

R. J. P. Princy et al.

7 Conclusion and Future Work This study’s primary objective is to calculate the number of gestational days by examining the mother’s age, height, weight, and smoking habit in addition to other demographic information. Several different machine learning regression models were used, and their performances were evaluated using statistical measures such as the mean absolute error, root mean squared error, R2 score, and explained variance score. It has been noticed that, among the many machine learning models, the linear regression model demonstrated the best performance. As a result, it might be possible to use it to forecast the duration of the gestational period. It would be beneficial to the healthcare professionals in providing proper therapy if they had this information. In addition to that, it would be helpful in dealing the undesired scenario with a plan that involves mitigation. As a next step, we plan to work on improving the outcome by applying advanced deep learning-based models.

References 1. Pregnancy Lingo: what does gestation mean? (2018). https://www.healthline.com/health/pre gnancy/what-is-gestation#gestation-period 2. Preterm birth (2018). https://www.who.int/news-room/fact-sheets/detail/preterm-birth 3. What are the risk factors for preterm labor and birth? (2018). https://www.nichd.nih.gov/hea lth/topics/preterm/conditioninfo/who_risk 4. Di Renzo GC, Giardina I, Rosati A, Clerici G, Torricelli M, Petraglia F (2011) Maternal risk factors for preterm birth: a country-based population analysis. Eur J Obstet Gynecol Reprod Biol 159(2):342–346. https://doi.org/10.1016/j.ejogrb.2011.09.024 5. Zhang Y-P, Liu X-H, Gao S-H, Wang J-M, Gu Y-S, Zhang J-Y, Zhou X, Li Q-X (2012) Risk factors for preterm birth in five maternal and child health hospitals in Beijing. PLoS ONE 7(12):e52780. https://doi.org/10.1371/journal.pone.0052780 6. Georgiou HM, Di Quinzio MKW, Permezel M, Brennecke SP (2015) Predicting preterm labour: current status and future prospects. Dis Markers 2015:1–9. https://doi.org/10.1155/ 2015/435014 7. Lange S, Probst C, Rehm J, Popova S (2018) National, regional, and global prevalence of smoking during pregnancy in the general population: a systematic review and meta-analysis. Lancet Glob Health 6(7):e769–e776. https://doi.org/10.1016/s2214-109x(18)30223-7 8. Greenhalgh EM, Ford C, Winstanley MH (2020) 3.8 Child health and maternal smoking before and after birth. In: Greenhalgh EM, Scollo MM, Winstanley MH (eds) Tobacco in Australia: facts and issues. Cancer Council Victoria, Melbourne. Available from http://www.tobaccoin australia.org.au/chapter-3-health-effects/3-8-chid-health-and-maternal-smoking 9. Hackshaw A, Rodeck C, Boniface S (2011) Maternal smoking in pregnancy and birth defects: a systematic review based on 173 687 malformed cases and 11.7 million controls. Hum Reprod Update 17(5):589–604. https://doi.org/10.1093/humupd/dmr022 10. Adolescent pregnancy (2020). https://www.who.int/news-room/fact-sheets/detail/adolescentpregnancy 11. Risks of teenage pregnancy. https://reverehealth.com/live-better/risks-teen-pregnancy/#: ~:text=Teens%20often%20don’t%20get,pregnancy%2Drelated%20high%20blood%20pres sure. 12. Pregnancy after age 35. https://www.marchofdimes.org/complications/pregnancy-after-age35.aspx

44 Predicting the Gestational Period Using Machine Learning …

559

13. Shorter women have shorter pregnancies (2015). https://www.marchofdimes.org/news/shorterwomen-have-shorter-pregnancies.aspx 14. Weight, fertility, and pregnancy (2018) Womenshealth.Gov. https://www.womenshealth.gov/ healthy-weight/weight-fertility-and-pregnancy#:%7E:text=Babies%20born%20to%20moth ers%20who,risk%20for%20health%20problems%2C%20including%3A&text=Premature% 20birth%20(also%20called%20preterm,5%201%2F2%20pounds) 15. Sazawal S, Ryckman KK, Das S, Khanam R, Nisar I, Jasper E, Dutta A, Rahman S, Mehmood U, Bedell B, Deb S, Bahl R (2021) Machine learning guided postnatal gestational age assessment using new-born screening metabolomic data in South Asia and sub-Saharan Africa. BMC Pregnancy Childbirth 21(1):1–11 16. Wylie BJ, Lee AC (2022) Leveraging artificial ıntelligence to ımprove pregnancy dating in low-resource settings. NEJM Evid 1(5):EVIDe2200074 17. Huang S-H, Weng K-P, Huang S-M, Liou H-H, Wang C-C, Ou S-F, Lin C-C, Chien K-J, Lin C-C, Wu M-T (2017) The effects of maternal smoking exposure during pregnancy on postnatal outcomes: a cross sectional study. J Chin Med Assoc 80(12):796–802. https://doi.org/10.1016/ j.jcma.2017.01.007 18. Inoue S, Naruse H, Yorifuji T, Kato T, Murakoshi T, Doi H, Subramanian SV (2016) Impact of maternal and paternal smoking on birth outcomes. J Public Health 39(3):1–10. https://doi. org/10.1093/pubmed/fdw050 19. Kobayashi S, Sata F, Hanaoka T, Braimoh TS, Ito K, Tamura N, Araki A, Itoh S, Miyashita C, Kishi R (2019) Association between maternal passive smoking and increased risk of delivering small-for-gestational-age infants at full-term using plasma cotinine levels from the Hokkaido study: a prospective birth cohort. BMJ Open 9(2):e023200. https://doi.org/10.1136/bmjopen2018-023200 20. Hoyt AT, Canfield MA, Romitti PA, Botto LD, Anderka MT, Krikov SV, Feldkamp ML (2018) Does maternal exposure to secondhand tobacco smoke during pregnancy increase the risk for preterm or small-for-gestational age birth? Matern Child Health J 22(10):1418–1429. https:// doi.org/10.1007/s10995-018-2522-1 21. Wang L, Yang Y, Liu F, Yang A, Xu Q, Wang Q, Shen H, Zhang Y, Yan D, Peng Z, He Y, Wang Y, Xu J, Zhao J, Zhang H, Zhang Y, Dai Q, Ma X (2018) Paternal smoking and spontaneous abortion: a population-based retrospective cohort study among non-smoking women aged 20– 49 years in rural China. J Epidemiol Community Health 72(9):783–789. https://doi.org/10. 1136/jech-2017-210311 22. Xaverius PK, O’Reilly Z, Li A, Flick LH, Arnold LD (2019) Smoking cessation and pregnancy: timing of cessation reduces or eliminates the effect on LBW. Matern Child Health J 23(10):1434–1441. https://doi.org/10.1007/s10995-019-02751-2 23. Patole KP, Paprikar DS (2018) To study the correlation between maternal body mass index and birth weight of the baby. MVP J Med Sci 5(2):222–225. https://doi.org/10.18311/mvpjms/ 2018/v5i2/18672 24. Ludwig DS, Currie J (2010) The association between pregnancy weight gain and birth weight: a within-family comparison. Lancet 376(9745):984–990. https://doi.org/10.1016/S0140-673 6(10)60751-9. ISSN: 0140-6736 25. Suzuki K, Sato M, Zheng W, Shinohara R, Yokomichi H, Yamagata Z (2014) Effect of maternal smoking cessation before and during early pregnancy on fetal and childhood growth. J Epidemiol 24(1):60–66. https://doi.org/10.2188/jea.je20130083 26. Tsai IH, Chen CP, Sun FJ, Wu CH, Yeh SL (2012) Associations of the pre-pregnancy body mass index and gestational weight gain with pregnancy outcomes in Taiwanese women. Asia Pac J ClinNutr 21(1):82–87. PMID: 22374564 27. Liu B, Xu G, Sun Y, Qiu X, Ryckman KK, Yu Y, Snetselaar LG, Bao W (2020) Maternal cigarette smoking before and during pregnancy and the risk of preterm birth: a dose–response analysis of 25 million mother–infant pairs. PLoS Med 17(8):e1003158. https://doi.org/10.1371/ journal.pmed.1003158 28. Ward C, Lewis S, Coleman T (2007) Prevalence of maternal smoking and environmental tobacco smoke exposure during pregnancy and impact on birth weight: retrospective study using Millennium Cohort. BMC Public Health 7(1). https://doi.org/10.1186/1471-2458-7-81

560

R. J. P. Princy et al.

29. Ludvigsson JF, Lu D, Hammarström L, Cnattingius S, Fang F (2018) Small for gestational age and risk of childhood mortality: a Swedish population study. PLoS Med 15(12):e1002717. https://doi.org/10.1371/journal.pmed.1002717 30. Karthiga S, Indira K, Nisha Angeline CV (2019) Machine learning model to predict birth weight of new born using tensorflow. In: First ınternational conference on secure reconfigurable architectures & intelligent computing (SRAIC 2019), pp 72–90. https://doi.org/10.5121/csit. 2019.91506 31. Pan I, Nolan LB, Brown RR, Khan R, van der Boor P, Harris DG, Ghani R (2017) Machine learning for social services: a study of prenatal case management in Illinois. Am J Public Health 107(6):938–944. https://doi.org/10.2105/AJPH.2017.303711. Epub 2017 Apr 20. PMID: 28426306 32. Kuhle S, Maguire B, Zhang H et al (2018) Comparison of logistic regression with machine learning methods for the prediction of fetal growth abnormalities: a retrospective cohort study. BMC Pregnancy Childbirth 18:333. https://doi.org/10.1186/s12884-018-1971-2 33. Hange U, Selvaraj R, Galani M, Letsholo K (2018) A data-mining model for predicting LBW with a high AUC. In: Lee R (ed) Computer and information science, vol 719; Studies in computational ıntelligence, vol 719. Springer Nature, Switzerland AG, pp 109–121. https:// doi.org/10.1007/978-3-319-60170-0_8 34. Mboya IB, Mahande MJ, Mohammed M, Obure J, Mwambi HG (2020) Prediction of perinatal death using machine learning models: a birth registry-based cohort study in northern Tanzania. BMJ Open 10(10):e040132. https://doi.org/10.1136/bmjopen-2020-040132. PMID: 33077570 35. Kumar SN, Saxena P, Patel R, Sharma A, Pradhan D, Singh H, Deval R, Bhardwaj SK, Borgohain D, Akhtar N, Raisuddin S, Jain AK (2020) Predicting risk of LBW offspring from maternal features and blood polycyclic aromatic hydrocarbon concentration. Reprod Toxicol 94:92–100. https://doi.org/10.1016/j.reprotox.2020.03.009. Epub 2020 Apr 10. PMID: 32283251 36. Borson NS, Kabir MR, Zamal Z, Rahman RM (2020) Correlation analysis of demographic factors on LBW and prediction modeling using machine learning techniques. In: 2020 Fourth world conference on smart trends in systems, security and sustainability (WorldS4). London, United Kingdom, pp 169–173. https://doi.org/10.1109/WorldS450073.2020.9210338 37. Senthilkumar D, Paulraj S (2015) Prediction of LBW infants and its risk factors using data mining techniques. In: Proceedings of the 2015 international conference on industrial engineering and operations management, pp 186–194 38. Metgud C, Naik V, Mallapur M (2013) Prediction of LBW using modified Indian council of medical research antenatal scoring method. J Matern Fetal Neonatal Med 26(18):1812–1815. https://doi.org/10.3109/14767058.2013.804046. Epub 2013 Jun 10. PMID: 23662690 39. Singha AK, Phukan D, Bhasin S, Santhanam R (2016) Application of machine learning in analysis of infant mortality and its factors. Work Pap 1–5 40. Vovsha I, Rajan A, Salleb A, Raja A, Radeva A, Diab H, Tomar A, Wapner R (2014) Predicting preterm birth is not elusive: machine learning paves the way to individual wellness. In: AAAI Spring symposium—technical report, pp 82–89 41. Ghosh JKC, Wilhelm M, Su J, Goldberg D, Cockburn M, Jerrett M, Ritz B (2012) Assessing the influence of traffic-related air pollution on risk of term LBW on the basis of land-use-based regression models and measures of air toxics. Am J Epidemiol 175(12):1262–1274. https:// doi.org/10.1093/aje/kwr469 42. Chen HY, Chuang CH, Yang YJ, Wu TP (2011) Exploring the risk factors of preterm birth using data mining. Expert Syst Appl 38(5):5384–5387 43. Maternity and child health (2020). https://www.kaggle.com/athulmathewkonoor/maternityand-child-health 44. Ideal pregnancy length: an unsolved mystery (2013). https://blog.oup.com/2013/08/idealpregnancy-length-human-reproduction/#:~:text=If%20healthy%20pregnancies%20can%20v ary,it%20might%20be%2042%20weeks

Chapter 45

Digital Methodologies and ICT Intervention to Combat Counterfeit and Falsified Drugs in Medicine: A Mini Survey Munirah Alshabibi, Elham Alotaibi, M. M. Hafizur Rahman, and Muhammad Nazrul Islam

1 Introduction One of the natural phenomena in life is sickness and healing. But it is very harmful to take harmful medicines that are not suitable for the treatment of disease. One of the mostly used technique to prevent counterfeiting of medicines is the possibility of tracing them and sequencing the medicines. Another role of patients is to choose reliable and certified companies and follow the instructions in that. Methods used to reach patients with fraudulent drugs are unauthorized, illegal, and online pharmacies [1]. To determine the problem of the research paper, we must know the spread of fraudulent medicines is one of the greatest threats to humanity. The reason for this is due to the time of using medicines for a person. Medicines are used in cases of physical or psychological weakness when the person is in his weakest state, and the aim is to return the physical or psychological condition to its normal and good condition. The increase in the spread of fraudulent medicines has posed the spread of many risks and threats for human health, therefore, requests increased to address this problem. Analysis, study, and attention to the problem mentioned above is critM. Alshabibi · E. Alotaibi · M. M. Hafizur Rahman (B) Department of Computer Networks and Communications, CCSIT, King Faisal University, Al Hassa 31982, Saudi Arabia e-mail: [email protected] M. Alshabibi e-mail: [email protected] E. Alotaibi e-mail: [email protected] M. N. Islam Department of Computer Science and Engineering, Military Institute of Science and Technology, Dhaka 1216, Bangladesh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_45

561

562

M. Alshabibi et al.

ical for the present and the future. Providing and evaluating suggestions and cyber measures taken to prevent and address this problem are extremely important for humans specifically and for the health sectors in order to apply these technologies that may contribute to prevent the spread of counterfeit medicines and contribute to the protection and preservation of human health around the world. One of the main risks is that the drug gets sick! This poses a threat to the potential spread of disease or. Suggesting a solution to prevent and mitigate the spread is extremely important. The solutions here are digital technologies that will contribute to achieving the goal. It is important to clarify the importance of integrating cybersecurity and digital technologies to ensure that medicines are of high quality, in addition to that, this study seeks to survey a group of articles related to the topic and identify the most important techniques that were proposed in the articles and audit the most efficient ones. Recent studies found that developing countries are affected most by counterfeit drugs. Counterfeit drugs are medicines that have a detrimental effect on health used in the treatment of diseases [2], including infectious diseases. Substandard and counterfeit medicines, generally, do not respect intellectual freedom and the law. Counterfeiting may include changing the brand or replacing it with another brand [3–5]. Of course, it must be treated to maintain people’s health and safety, so that most of the tools used in it are detecting counterfeit medicines which is very important and one of the solutions is information and communication technology (ICT), as it is concerned with the development and maintenance of computer systems and software, and this includes data processing [6]. There are many digital methods and solutions that may contribute to the prevention of counterfeiting and proliferation, for example, computational methods, radio frequency identification, and online verification [7]. The ineffectiveness of medicines to treat them became a big deal and problem, as it was discovered that medicines are often falsified, and their chemical compositions are altered. The goals of preventing counterfeit drugs using ICT and solutions can be summarized: 1. Scientific cooperation between drug companies around the world, law enforcement, and application of the highest penalties to manufacturers of counterfeit drugs. 2. Save money and don’t waste it on ineffective medicines. 3. Stop delegating people’s trust in the health system and drug manufacturers. Make full use of technology and its services in the service of mankind. The objectives of this review study are to examine the earlier studies focusing on digital intervention to protected counterfeit drugs; to highlight the evolution of techniques used to combat counterfeit drugs along with its merits and demerits in preventing counterfeiting drugs.

45 Digital Methodologies and ICT Intervention to Combat …

563

2 Selection of Papers for Literature Review We used PRISMA to select and analyzing the articles, that are relevant to our subject. This systematic review was carried out in four main phase: the preferred reporting elements for systemic reviews and the meta-analysis process. In the first phase, these research terms have been formulated as: (counterfeit medicines or fake medicines), (e-procurement or transparency), (hyper-ledger or pharmaceutical industry), (security intervention and technology intervention), and (counterfeit or falsified). This research was written by making use of the research papers available on the Saudi Digital Library and Google Scholar. With the identification of the following criteria: Papers that focus on digital interference to prevent counterfeiting in medicines, and papers that published between August 2018 and March 2021. There are many reasons for ex-collusion, some of the reported reasons, some research is not in the digital scope and digital analysis, the publication date is old, focusing on the medical field more than the technical. In identification phase, is origin is old, focus on the medical field more than technical. At identification stage, 70 research papers were identified from different databases, duplication has been removed, and we have left 50 studies. The remaining 50 studies are in the stage screening. By reading the topic and the abstract, 19 research papers were excluded. In the eligibility stage, 11 research papers were excluded after reading the introduction to each paper and approving only 20 research papers. It is divided between 9 quantitative research papers and 11 qualitative research papers, as shown in Fig. 1.

Fig. 1 Schematic diagram of selection of papers for literature review by PRISMA

564

M. Alshabibi et al.

3 Existing Works The research studies were reviewed and summarized, highlighting the threats and techniques used to mitigate if counterfeit drugs. Articles related to the topic will be presented to make why the reason behind select them is clear. The articles agree that all of them aim to reduce harm and seek to provide good and effective medicines for the patient that do not endanger his life. Kaihlanen et al. [8] aimed to know challenges that vulnerable groups can face during digital health services in the period of COVID-19. To achieve health and comfort needs in digital access, the needs for implementation have been identified, and it has been addressed that despite the benefits and abundance of digital health services, traditional manual services cannot be dispensed with to achieve consolation between digital and traditional needs. Was caused by a lack of competence in digital skills, as well as a lack of training and support, which led to the difficulty of accessing services. Mackey and Cuomo [9] aimed to explore how digital technologies are used to prevent drug fraud and corruption, and to increase transparency and fraud detection for population health outcomes. They examined a number of related articles and highlighted that the counterfeiting medicines is one of the biggest challenges facing the global health, but the viable solutions to improve the transparency and to reduce the counterfeiting drugs are not developed yet. Uddin [10] aimed to talks about the new medledger system that is tracked by blockchain technology, and the codes are used inside. This system helps to implement drug provision transactions with high efficiency and security, which enhances efficiency and reliability and reduces the possibility of interference in records, transactions, and all data stored on the system. The registration, transportation, and tracing of drugs have been achieved through cooperation with the codes and sequencing of these drugs. These techniques have been used with digital notarial certificates in cryptographic functions. The solution is based on decentralized technology that does not enable the data to be changed after it is stored in the medledger. Ratta et al. [11] aimed to know through the use of IoT and blockchain technology to enhance health functions, the usability of these technologies has been tested in the field of patient monitoring, drug tracking, and medical records management. The use of these systems, which are used in various fields, is discussed, and among those areas is the health field in tracking medicines, monitoring patients, and other achievements, and this technology has been concluded for the sake of high safety. The methods that have been tried for the successful integration of these two technologies were discussed. In the blockchain, the identity of the sender and receiver is not revealed, and this can cause problems. Ignore invalid transactions, and this can cause the transaction to work again and waste time. Ciapponi et al. [12] aimed to release on smart mobile applications that are downloaded and used on smart phones that can affect public health outcomes. Assessment technique downloading applications and quality assessment were also used for the validity of clinical and general health results. Applications were also used to identify

45 Digital Methodologies and ICT Intervention to Combat …

565

pills by shape, color and fingerprint, and rarely of them present the type through photographs in order to correctly identify pills. The report did not find a direct evaluation of SF’s products in the real world. Also, it was not possible to perform a meta-analysis, and also in the field of drug identification, only a few applications provide the feature of identifying the dent from photographs, except for a few of them. Barrett [13] aimed to targeting pharmacies in England in terms of readiness to implement the European Union directive for counterfeit medicines. The aim of this study is to study the readiness to implement this directive. This study was keen on the national level so that those who practiced abroad or who were not pharmacists were excluded, and a survey was sent via e-mail, and the responses received were classified into several classifications in terms of readiness for implementation. Not all pharmacies responded to this technology, as many of them were not ready and did not implement it, and I think that this was due to the lack of policies in the technology, and the lack of professional awareness and awareness among patients about the techniques to detect the identification of fraudulent drugs. Rasheed et al. [6] aimed to show the new techniques that have been taken to catch drugs that are counterfeit or of poor quality. Covered by several IT tools, including mobile authentication applications, online drug safety alerts, databases, and other tools used to detect poor quality drugs, these tools have a profound and profound impact on success. They are easy to use, operate and access, as well as low cost, which means they can be easy to disable. Arora and Sharma [14] aimed to the poor-level medicines that have been spread in some geographical areas and their impact on patients was discussed, and the focus was on malaria. It was pointed out that malaria disease and its impact on the world and attempts to combat deaths due to this disease, it was pointed out that the threat posed by counterfeit antimalarials continues, and this affects not only the community, but also the reliability and credibility of the health institution. Research have been conducted with certain words, and most of these search words have been searched for in certain geographical areas, and forged treatments have been spread in these areas. It was pointed out and clarified that there is a need to define a method to prevent the entry of counterfeit medicines and that they must be examined before they are entered and distributed to patients. In order to effectively identify counterfeit medicines, better techniques are needed, which means that this technique is not effective enough. Fittler et al. [15] aimed to the period of the COVID-19 epidemic, when access to illegal or counterfeit medicines became easy and fast. An informatics epidemiology methodology was collected, analyzed, and patient safety risks assessed. Because of the epidemic, the Internet and digitization hve become very fast. Because of the rules of distancing that have been applied all over the world, people have resorted to the Internet and digitization for everything they need, access to illegal pharmacies has become easy and simple due to their exploitation of the epidemic conditions, and thus the purchase of fake and bad pharmacies has spread harmful level. It was mentioned that ivermectin can prevent the spread of COVID-19 cells in the body, and studies have appeared on them, as illegal pharmacies have been established to sell this treatment. The information methodology has been used to detect points that sell

566

M. Alshabibi et al.

ivermectin illegally. There are no strict solutions to eliminate the non-Shiite sources of counterfeit medicines. Rebiere et al. [16] aimed to the importance of X-rays, chemical measurements, and spectrophotometric analysis in the analysis of counterfeit products. Indicates that X-rays show the chemicals in suspicious eyes and can be used to validate samples properly. In this methodology, counterfeit and counterfeit drugs were detected in the composition and for several samples. This technique is excellent, but it is not sufficient, other techniques are implemented with it to ensure efficiency in some tests, and it is also a shortcoming that this technique is not used so far. Raijada et al. [17] aimed to patient dosage, drug and drug product tracing, health platforms, treatment tracking, and diagnostics—all of these and other areas are difficult to combine in a single model, and few have a model that does. It is important that an increasing number of advanced digital health solutions be integrated to activate the feedback loop for patients and medicines. It provides an overview of the designs submitted for new products and medicines, an example of which is the treatment that links pharmacies and the digital world. It still suffers from challenges. Among these, challenges are technological and economic challenges, and it is still facing challenges in information security and privacy related issues. Baker [1] aimed to false and misleading allegations about the coronavirus, the alleged treatments, and its mode of transmission through social networks were the most important challenges during the corona pandemic. Dealing with misleading information is through these methods: Updating the policies of technology companies by raising trusted content and blocking unreliable content (removing malicious claims accounts on social media. Raising the level of trust and content may be useful for people who trust public authorities, but this method is ineffective for those who have mistrust or suspicion of experts and political elites. Also, information may not reach all people if they are not informed and follow up on these authorities. Mackey [3] aimed to healthcare records management: aims to share data and information for patients, employees, and health workers across the various stakeholders in health care, taking care to maintain the privacy and confidentiality of data, and despite that, it is vulnerable to theft and forgery. To maintain data confidentiality, blockchain technology has been worked on to facilitate better data management, source, and security. There is no data exchange outside the blockchain, the more chains there will require more standards as they relate to the environmental interoperability of one chain to another. Troncoso [5] aimed to: Primary health care: It is a person’s ability to access quality care for all comprehensive medical services throughout his life. Primary care does not happen to all people in the world, especially in low- and middle-income countries. Machine learning and artificial intelligence in the current reality have made a revolution in correcting primary healthcare conditions and encouraging individuals to maintain a good and stable health condition. Transforming patients from a passive recipient into an active participant in their own care. The biggest challenge for the application of artificial intelligence is the sense of responsibility of citizens to identify data, while giving priority to protected data.

45 Digital Methodologies and ICT Intervention to Combat …

567

Ghanem [2] aimed to one of the most important health problems is the counterfeit and fraudulent medicines and their locations, whether online illegally or in pharmacies in rural countries and regions. 1. One of the methods used to prevent the spread of fraudulent medicines: collecting data and information about fake medicines. The difficulty in this matter is that it lacks consensus and disagreement, and this leads to differences in the recorded numbers. 2. Lack of global coordination: Fake medicines are not limited to low-level countries, but to the world as a whole. There is a need for cooperation between the strongest medical regulatory authorities to prevent fraudulent medicines and their spread. 3. Preparing reports for fraudulent drugs through an electronic rapid alert model using the global application chain monitoring and control system, this system allows the distribution of alerts all over the world and this enhances detection, response, and prevention better. More comprehensive insight is needed, relative to regions and economic status, to determine the best approach to tackle this global problem in a localized manner. Pascu et al. [7] aimed to counterfeit medicines are one of the most important public health threats all over the world. A unified system must be applied to prevent the spread of counterfeit medicines. To solve the problem of counterfeit medicines, the latest technologies have been used (the possibility of being traced) through the sequence of medicines. A pharmaceutical unit is given a unique number. This number is used to track the product and verify authentication in the distribution chain. Counterfeit medicines are a global threat to health, requiring a clear standardized system that can be implemented around the world and applied. Khurshid [4] aimed to the lack of accurate data and the spread of misinformation is one of the most important reasons for protecting human health and well-being, and this happened during the corona pandemic, as it revealed the shortcomings in health institutions. To solve the problem of the lack of data and not protecting it well, the use of blockchain technology has been studied and used, as it works with cultivated networks of trust and security based on encryption of the data. Digital transformation is still in progress shifting from paper to digital systems, and the costs are very high for this technology. Troein [18] aimed to self-diagnosis and self-prescriptions are one of the reasons for the spread of illegal Web sites market, which is more vulnerable to the possibility of accessing fraudulent medicines, fraudulent medicines are one of the most serious health threats to human health. To confront this danger first, it is necessary to increase the knowledge of doctors and people. Secondly, prepare reports on the names of fraudulent medicines and update them continuously, third: educate patients through publications and report any suspicion of negative effects of medicines. No effective technique has been suggested or used to prevent this problem. Hassan et al. [19] aimed to blockchain technology. This paper examines the characteristics and examination of this technology from the perspective of prevention, safety from epidemics, control, and effectiveness. The technology has been used to

568

M. Alshabibi et al.

face challenges in the health sector such as: violation of patient data privacy and counterfeit products. The lack of sufficient study of the technology of the blockchain and its high cost. Harrington et al. [20] aimed to reducing counterfeit drugs by using number and code sequences depending on the technology. The challenges this technology may face have not yet been identified and detailed.

4 Literature Survey From the articles, some of technologies used are scanning and tracking the sequence of drugs so that each drug unit has a unique number and the drug is tracked through it, and mobile phone applications have been created to monitor drugs, X-ray technology that was used to find out whether the drug is counterfeit or not Through the components of this drug, finally and the most famous of the technologies used is the blockchain, and it was used with Internet of Things technologies to monitor, prevent and combat fraudulent medicines, and given in Tables 1 and 2 more technologies that used to protect drugs.

5 Discussion The health has developed its applications for the ability to monitor from a distance, as it reduces the burden on doctors, in order to preserve human health [21]. For this reason, techniques that help detect incorrect or counterfeit medicines that harm human health and oppose the efforts made have been revealed. Among the most important techniques that have been used to prevent or reduce cases of drug counterfeiting are blockchain, X-ray, and digital sequence. A strong infrastructure is a technology that helps it work well and reliably. An example of this is the structure of these technologies. Blockchain is a cryptographic technology that provides a distributed, ledger for storing, transmitting, and displaying secure information over a mutually untrusted network. It consists of a protocol that often ensures that the network node makes sure that the information stored in the blocks is correct [10]. Digital sequencing system or supply chain and digital alerts: This system has been proposed and implemented by national regulatory authorities, whereby fraudulent drugs are monitored by assigning a unique number to each drug package and a drug that distinguishes it from others. This sequence can be checked whether it is true or fraudulent. In the event that an adulterated drug is detected, an electronic alert will be sent. This alert will be published to all countries of the world, and doctors and users will be able to check the packages of drugs to ensure their quality [2]. This process of system implementation is shown in Fig. 2.

45 Digital Methodologies and ICT Intervention to Combat … Table 1 Literature survey on counterfeit and falsified in the medicine No. References Technology of Dataset Similar combat the description techniques counterfeit and data used medicine 1

Kaihlanen et al. [8]

2

Uddin [10]

3

Ratta et al. [11]

Qualitative to data analysis, and descriptive analysis to semi-structured interviews

Semi-regular interviews were conducted for different age groups, users of health and mental services, even the unemployed Medledger The focus was system on medicines. Almost every medicine out of ten medicines that were produced in developing regions, which has harmful effects on human life Blockchain and Tracking the Internet of treatment and Things monitoring patient and mange the medical reports

Digital health services

Treatments made in less developed regions

Healthcare applications

569

Major findings

Seeking to understand how digital technologies are used to prevent fraud and corruption in medicines, they used different type of technology to communicate with people to collect and analysis information The new medledger system that is tracked by blockchain technology, and the codes are used inside, with this technology, people can know the date of start and end the use of drug and other information from Medledger

Through the use of IoT and blockchain technology to enhance health functions, the usability of these technologies has been tested in the field of patient monitoring, drug tracking and medical records management, integration IOT with blockchain is used in remote patient form away, as collect their information and translate it to the hospital (continued)

570

M. Alshabibi et al.

Table 1 (continued) No. References

Technology of combat the counterfeit medicine

Dataset description

Similar techniques and data used

Major findings

Comparing in the apps that allowed in smart devices, that is can used in the public health and the result of clinical

This study relied on smart mobile applications that are used on smart phones that can affect public health outcomes, they collect and analysis the application from apple and Android, it removed the application that is not provide benefit or to can not work in detect and collect information about falsified medicine The first study targeting pharmacies in England in terms of readiness to implement the European Union directive for counterfeit medicines, they used the emails to send the survey to the pharmacy’s to study the readiness to implement this directive False and misleading claims about the coronavirus, its alleged treatments, and its mode of transmission through social media, have been among the most important challenges during the COVID-19 pandemic

4

Ciapponi et al. [12]

Systematic review of the literature and used the validated (MARS)

Databases, applications, studies for medicals databases

5

Barrett [13]

Survey

Evaluating the Pharmacies in ready to England Directing falsified treatment

6

Baker [1]

Update tech companies’ policies

False allegations about the Corona virus, treatments, and the way it is transmitted through social media

The technology has been suggested to technology companies to raise the level of trusted content

(continued)

45 Digital Methodologies and ICT Intervention to Combat … Table 1 (continued) No. References

Technology of combat the counterfeit medicine

Dataset description

Similar techniques and data used

Advantages of implementing and applying sequencing to prevent and reduce counterfeit drugs A wide Addressing variety of data the issue of that was healthcare discussed data silos (such as remote patient monitoring, hospital data, consumer health)

571

Major findings

7

Pascu et al. [7] Pharmaceutical serialization, a global effort to combat counterfeit medicines

Tracking and controlling medicines and identifying fraudulent ones

The lack of accurate data and the spread of misinformation is one of the most important reasons for protecting human health and well-being

8

Mackey [3]

Blockchain

9

Khurshid [4]

Blockchain

10

Troncoso [5]

(AI/ML)

Health care records management: aims to share data and information for patients, employees and health workers across the various stakeholders in health care, taking care to maintain the privacy and confidentiality of data, and despite that, it is vulnerable to theft and forgery It discusses It has been Proposing blockchain the Corona applied to technology and a pandemic and patient data to secure, robust and its track medical distributed framework, shortcomings, supplies and the aim of which is to the most provide maintain data important of solutions to confidentiality and which is the trust and data storage and limit the lack of loss problems spread of accurate data misinformation People in Lack of trust Counterfeit medicines low-income and value of are among the most countries and health data in significant threats to the health care primary public health provided to health care worldwide. A unified them system should be implemented to prevent the spread of counterfeit medicines, by using artificial intelligence/machine learning for every member of society)

572

M. Alshabibi et al.

Table 2 Literature survey on counterfeit and falsified in the medicine No. References Technology of Dataset Similar Major findings combat the description techniques and counterfeit data used medicine 11

Mackey and Cuomo [9]

Literature review

Data in different major, as computer science and engineering

Online portals and electronic databases

12

Rasheed et al. [6]

IT tools such as mobile applications and messaging

Drug safety 2D bar coding alert systems, approaches web-based drug safety alerts, radio frequency identification tags, databases to support visual inspection and other

13

Arora and Sharma [14]

Perform a search

Terms PubMed and ‘counterfeit Google antimalarials’, ‘substandard’, ‘falsified’, and ‘drug resistance’. Free searches in other search engines included the terms ‘antimalarial counterfeit drugs’ and ‘drug resistance’

Improving concepts in how techniques can be used to detect counterfeit drugs, the definition of digital technology in this study was in the databases, calles depend on the Internet, online portals and other The aim of this study is to show the new techniques that have been taken to catch counterfeit or poor quality medicines. From this, technique to numeric the medicine to 12 digit and perform the track and trace process by 3 steps: systematic serialization of the products at the manufacturing site, authentication and documentation Reducing the impact of infectious diseases, especially malaria, from these methods are HPLC, mass spectrometry, infrared spectroscopy, and X-ray powder diffraction 4–6, and other methods they pointed in the study

(continued)

45 Digital Methodologies and ICT Intervention to Combat … Table 2 (continued) No. References

573

Technology of Dataset combat the description counterfeit medicine

Similar Major findings techniques and data used Google

14

Fittler et al. [15]

Infodemiology Patient safety methodology risks

15

Rebiere et al. [16]

X-ray

16

Raijada et al. [17]

Drug delivery systems (PDDS)

17

Ghanem [2]

Information Treatments about samples inorganic composition

Customized medicines and doses tailored to each patient, how to apply them, their pros and cons Data on Study the falsified situation of medicines, and pharmacists creating a and the extent quick of their electronic alert experience and form knowledge in identifying fraudulent drugs

Modern products such as customized interactive therapy

adulterated drugs and how to identify them

Preventing the use of fraudulent treatments that can harm human health. This method allows to evaluate the risk and its strong. Evaluation of search trends and assessment of triggering news, obtaining and evaluation of search engine results, content evaluation of websites offering ivermectin for retail use, this was the method they track it Detection of visits to several types of drug samples, by taking several type of drugs and doing the X-ray in them to determine the component of them, to reduce the risk on human health Increasing number of advanced digital health solutions are integrate to enable the feedback loop for patients and medicines Using the global monitoring system through an alert form, after collecting accurate data with fraudulent drugs and alerting them if they are discovered

(continued)

574

M. Alshabibi et al.

Table 2 (continued) No. References

Technology of Dataset combat the description counterfeit medicine

Similar Major findings techniques and data used A literature review and information gathering to determine physicians’ level of knowledge and experience

18

Troein [18]

A pilot survey A survey of a group of doctors online in Sweden

19

Hasan et al. [19]

Blockchain

It discusses a set of data that was mentioned in other research papers concerned with examining the blockchain technology and its effectiveness

20

Harrington et al. [20]

Advanced manufacturing techniques (AMT)

A set of data collected from the academic literature

Primary health care, it does not happen to all people in the world, especially in low- and middle-income countries. Among the most important health problems are counterfeit and fraudulent medicines and their locations, self-prescriptions and self-diagnosis of diseases. An online survey was conducted to narrow the answers, identify the problem and try to treat it The use of Blockchain blockchain technology has been technology in proposed to meet these the face of challenges, but there some are not enough studies challenges about it, to search for during the its effectiveness in Corona combating epidemics pandemic, and preventing them such as privacy through the possibility violations and of tracking, auditing, fake medical detecting fraud and data other characteristics of blockchain networks Pharmaceutical Reducing counterfeit sectors and medicines by using how to reduce sequences of numbers adulterated and codes depending drugs on technology

45 Digital Methodologies and ICT Intervention to Combat …

575

Fig. 2 Process of checking the sequence number

And other technologies and their infrastructure that support the detection of tarry, which is often in the chemical composition of the treatment, as a difference in the chemical composition causes great harm to humans, counterfeit medicines are counterfeited in unsanitary conditions by an unknown manufacturer and contain incorrect quantities and materials of the ingredient that have the ability to affect human health and may also be contaminated, so most of the techniques contribute to the detection of drug components for detection whether it is really necessary or counterfeit [2].

6 Result Among these methods scanning, radio frequency identification, online verification by smart devices, invisible publications, and drug sequencing, and some other studies concern browsing health care and other pharmacies, while another way was using X-rays to infer the doctor’s component and find out if it was a real or fake treatment. In [16], look five sample that are similar, and they do the X-ray on them, they deduces the difference in them and the falsified in their by knowing the component of them. Mobile applications are also used to track medicine through their digital numbers. Drug sequencing helps pharmaceutical factories deliver valid, safe, and controlled

576

M. Alshabibi et al.

Fig. 3 Effective technique used

medicine. This method will be required by law and will be adopted as an approved and standardized authentication method worldwide [7]. The most used technology from our point of view and survey reviews we do is blockchain, this technology is considered a key factor to facilitate, enable, and contain many challenges and counterfeit drugs, it has been used with IoT, medledger system, and others, it is used to know the date of use the medicine, tracking, auditing and conclusion of the drug, and the secret the information [10, 11] (Fig. 3). The results boil down to two main points, the first of which is the first goal of the research to determine whether modern techniques helped protect against counterfeit drugs or not, and by reviewing the studies, we concluded that techniques play a major role in protection, whether by preventing the access of medicines to humans or educating people about this risk. Secondly, to prove the effectiveness of the technique in preventing counterfeiting, we concluded that some techniques are effective and others need to be studied more and continuously developed to work more efficiently.

45 Digital Methodologies and ICT Intervention to Combat …

577

7 Conclusion Protecting human health and primary care from fraudulent and poor quality medicines is one of the most important goals of global health organizations. Many reasonable techniques, such as digital sequencing techniques, blockchain, and other related techniques maintain human health by reducing the spread of fraudulent drugs. Some of these technologies are successful and beneficial for some cities, but some of them have little effect to prevent or reduce falsified medicine. It is to be noted that the technologies are beneficial for a specific period. It is quite difficult to alleviate fraudulent activities in medicine in near future, especially in low-income countries because of the many challenges that must be taken into consideration. And thus rich countries and poor countries need to work together to combat counterfeit and falsified drugs in medicine to achieve the goal of complete health care for people in various parts of the world. Acknowledgements The authors would like to thank the anonymous reviewers for their insightful comments and suggestions to improve the clarity and quality of the paper.

References 1. Baker SA (2020) Tackling misinformation and disinformation in the context of Covid-19, City research Online 2. Ghanem N (2019) Substandard and falsified medicines: global efforts to address a growing problem. Pharm J 11(5) 3. Mackey TK (2019) ‘Fit-for-purpose?’—challenges and opportunities for applications of blockchain technology in the future of healthcare. BMC Med 17(68) 4. Khurshid A (2020) Applying blockchain technology to address the crisis of trust during the COVID-19 pandemic. National Library of Medicine 5. Troncoso EL (2020) The greatest challenge to using AI/ML for primary health care: mindset or datasets? Front Artif Intell 3(53) 6. Rasheed H, Höllein L, Holzgrabe U (2018) Future information technology tools for fighting substandard and falsified medicines in low- and middle-income countries. Front Artif Intell 9(995) 7. Pascu GA, Hancu G, Rusu A (2020) Pharmaceutical serialization, a global effort to combat counterfeit medicines. Acta Marisiensis Seria Medica 4(66) 8. Kaihlanen AM, Virtanen L, Buchert U, Safarov N, Valkonen P, Hietapakka L, Hörhammer I, Kujala S, Kouvonen A, Heponiemi T (2020) Towards digital health equity—a qualitative study of the challenges experienced by vulnerable groups in using digital health services in the COVID-19 era. BMC Health Serv Res 22(188) 9. Mackey TK, Cuomo RE (2020) An interdisciplinary review of digital technologies to facilitate anti-corruption, transparency, and accountability in medicines procurement. Glob Health Action 13(104) 10. Mueeen U (2021) Blockchain Medledger: hyperledger fabric enabled drug traceability system for counterfeit drugs in pharmaceutical industry. Int J Pharm 597(120235) 11. Ratta P, Kaur A, Sharma S, Shabaz M, Dhiman G (2021) Application of blockchain and internet of things in healthcare and medical sector: applications, challenges, and future perspectives. J Food Qual 2021(7608296)

578

M. Alshabibi et al.

12. Ciapponi A, Donato M, Guülmezoglu AM, Alconada T, Bardach A (2021) Mobile apps for detecting falsified and substandard drugs: a systematic review. PLoS ONE 16(2) 13. Barrett R (2020) Evaluation of community pharmacists’ readiness to implement the Falsified Medicines Directive (Directive 2011/62/EC): an English cross-sectional survey with geospatial analysis. BJM Open 10(1136) 14. Arora T, Sharma S (2019) Global scenario of counterfeit antimalarials: a potential threat. J Vector Borne Dis 58(4) 15. Fittler A, Adeniye L, Katz Z, Bella R (2021) Effect of infodemic regarding the illegal sale of medications on the internet: evaluation of demand and online availability of ivermectin during the COVID-19 pandemic. Int J Environ Res Public Health 18(7475) 16. Rebiere H, Kermaïdic A, Ghyselinck C, Brenier C (2019) Inorganic analysis of falsified medical products using X-ray fluorescence spectroscopy and chemometrics. Talanta 195 17. Raijada D, Wac K, Greisen E, Rantanen J, Genina N (2021) Integration of personalized drug delivery systems into digital health. Adv Drug Deliv Rev 176:113857 18. Troein M (2019) Substandard and falsified medical products are a global public health threat. A pilot survey of awareness among physicians in Sweden. J Public Health 41(1) 19. Hasan MR, Deng S, Sultana N, Hossain MZ (2020) The applicability of blockchain technology in healthcare contexts to contain COVID-19 challenges. Library Hi Tech 40(2) 20. Harrington TS et al (2017) Reconfiguring global pharmaceutical value networks through targeted technology interventions. Int J Prod Res 55(5) 21. Balasubramaniam V (2020) IoT based biotelemetry for smart health care monitoring system. J Inf Technol Digit World 2(3)

Chapter 46

Utilizing Hyperledger-Based Private Blockchain to Secure E-Passport Management Nusrat Jahan

and Saha Reno

1 Introduction In the South Asian region, Bangladesh is the first country who issued e-passports for all eligible residents. An electronic passport has the owner’s biometric data as well as a Radio Frequency Identification (RFID) tag, and the effective integration of biometrics with RFID technologies seeks to improve security [14]. Bangladesh’s passport is an ICAO-compliant, machine-readable, and biometric e-passport issued to passport holders for traveling to foreign nations. E-passports offered by The Worldwide Civil Aviation Organization Public Key Directory (ICAO PKD) and Extended Access Control (EAC) have dominated the international landscape since 2004. These two standards specify principles and processes for standardizing and securing e-passports [5]. As a result, nations worldwide are increasingly proposing the use of electronic passports and identification cards to boost national security against international terrorism and crime. The e-passport is difficult to counterfeit, and it improves security by requiring more personal verification [13]. It holds biometric data as well as an RFID chip for identification. Biometric data is irreplaceable and staying ahead of fraud advances. However, if a password or pin is hacked, it can be maliciously altered. Additionally, when biometrics are converted to data and stored, especially in places or countries with extensive monitoring, users may fall under the risk of leaving a permanent digital record that shady performers can track [12]. Also, biometric data can become a permanent digital tag that can be used to detect anyone with or without their knowledge. Integrating RFID technology into e-passports, on the other hand, raises the issue of providing adequate security in restricted environments, where memory, bandwidth, and computational resources are limited. The system must scale to process a large number of applicants while using the limited bandwidth and memory resources that are available [3]. N. Jahan · S. Reno (B) Bangladesh Army International University of Science and Technology, Cumilla Cantonment, Cumilla 3501, Bangladesh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_46

579

580

N. Jahan and S. Reno

Blockchain is a network of interconnected blocks that is constantly growing as transactions are stored on the blocks where each block contains a cryptographic hash of the previous block, transaction data, and a timestamp [15]. Blockchain is a persistently developing rundown of exchange records. Once transaction data is kept in the blockchain, blockchain technology ensures data integrity. Because of this, blockchain is an appropriate infrastructure for notarization services. However, there is a fundamental but difficult problem of ensuring that input data is not fabricated before being entered into a block [17]. In our e-passport security management system, we have utilized a Hyperledgerbased private blockchain to strengthen the security of our system. The primary reason for preferring Hypeledger is to make our e-passport management system shielded from anonymous and unwanted accessibility. The Access Control feature of Hyperledger prevents unauthorized access and limits the availability of the resources based on the identity of the participants. The Historian Registry ensures the immutability of the transactions and the SQL supportability makes the system faster and more convenient than the existing and classical approaches. Moreover, assembling a secure system using the Hyperledger is proficient because of the chaincode feature, making the system even more customizable. The core attributes of a private ledger empowers the required trust inside the network which are: • Immutability: There are indeed several intriguing blockchain properties, but ‘immutability’ is one of the most important. Once a transaction gets executed, the information regarding that transaction can not be altered without the consent of 51% nodes in the network. • Increased Security: Each block in the ledger has its hash and provides the previous block’s hash. Changing or attempting to interfere with the data will not just only alter the current block’s hash but also alter all the preceding blocks’ hashes. • Irreversible Hashing: Hashing is a complicated process that cannot be changed or reversed. No one can generate a private key from a public key. Slight modifications are not acceptable in the system because a single change in the information can result in drastic changes in the ledger. Three distinct features of Hyperledger make it extraordinary from other systems: • Privacy and Confidentiality. • Efficient and Faster Processing. • Chaincode Functionality.

2 Related Work The authors of [14] presented a new method for improving e-passport authentication security using a novel protocol based on the Elliptic Curve, Identity-Based Encryption, and a shared secret between entities. The researchers employed the BAN logic

46 Utilizing Hyperledger-Based Private Blockchain to Secure …

581

language, which is intended for designing and validating security protocols. Unfortunately, the author did not spend effort considering two significant drawbacks of this system: (i) Key Escrow Problem and (ii) Identity Revocation Problem, which prevents it from implementing it practically. In [10], the authors proposed a distributed infrastructure for anonymous data set exchange that does not rely on a centralized trustworthy third party. For this project, the author used a Hyperledger Fabric-based blockchain. However, the author failed to examine an economic model for their platform and appropriate platform parameters for achieving a long-term data trading market. For e-passport detection and verification, the authors of [4] proposed an RFID-based Location Authentication Protocol (RFIDLAP). For such research, a C# simulator was created to test the accuracy of e-passport detection, and RSA and ECC encryption methods were employed to reduce the amount of time necessary for authentication. However, ECC substantially increases the size of the encrypted messages and RSA can not encrypt data of size more than the key’s length. The authors of [13] investigate several privacy and security vulnerabilities in RFID systems used in e-passports and propose that a proposed authentication protocol be adopted to overcome those difficulties. The author, conversely, did not consider future active attacks by altering the random number, which is something that an active adversary can do. The authors of [9] look into how blockchain technology could be applied to car networking, particularly in terms of distributed and secure data storage. They also propose a model for outward transmission of vehicle blockchain data and detailed theoretical analysis and numerical results. Apart from that, the authors proposed model ignores traffic between vehicles and cellular network channel reliability. In [18], the authors look into how to keep Electronic Health Records safe. For effective storage and maintenance of EHRs, the author of this paper used a blockchain-based system. Regrettably, the researchers failed to consider the relationship between noise and the size of the ledger. The authors of [1] present the evolution of these passports through time to establish a taxonomy of faults and act as a reference point for security flaws associated with RFID e-passport characteristics in the first and second generations of e-passports. However, this article may only be used as a review or reference for the security risks related to RFID e-passport features in current passport generations. The authors of [2] suggested a cost-effective solution to use blockchain technology to improve the long-term security of breeder documentation. For their work, the author has used Bitcoin-based blockchain technology. Nevertheless, the author did not test the long-term security of breeder records with payment systems. In [7], the researchers assess a Blockchain-As-a-Service application for implementing distributed electronic voting systems. For this effort, the author of this paper used an Ethereum-based blockchain. However, the authors did not present further methods for larger countries to accommodate higher transaction volume per second. Shahnaz, Qamar and Khalid in [15] suggested a framework for implementing blockchain technology in the healthcare sector for electronic health records. For this effort, the author of this paper also used an Ethereum-based blockchain. However, on the contrary, they did not include a payment module in this framework. In [17], Sharma and Zodpe proposed an architecture that uses blockchain technology to archive social media material authentically. A proof-of-concept is shown based on the proposed

582

N. Jahan and S. Reno

strategy. However, the author failed to address the scalability issue or develop a reputation system to lessen reliance on an official service provider. The researchers of [6] proposed a healthcare architecture based on a permissioned blockchain. The authors developed a blockchain-based RPM system to accomplish this goal. Despite that, the system does not demonstrate how different doctors or caregivers from various organizations will work together to produce an accurate diagnosis. Vora and the co-authors, in [16], compared the modular multiplication algorithms used in the RSA algorithm for keys with a length of 1024 bits. This encryption and decryption technique was performed on a Virtex-5 FPGA board using the Xilinx ISE 14.3 platform. The authors of [11] introduced BasGit, an Administrative Web Interface (AWI)-based protocol that allows the users to make their e-passport portable using a mobile application. It also allows the user to print the e-passport details on a paper without being interfered by any specialized appliances. After storing the passport data in the central database, it gets digitally signed by the issuing authority. A Universally Unique Identifier (UUID) is assigned to each valid passport and AWI administrates all the processing steps from the beginning to delivering the passport details. Unfortunately this scheme is vulnerable to some common security threats like Man-in-the-Middle attacks, Denial-of-Service attacks and Insider attacks. Our proposed system attempts to overcome the aforementioned limitations by utilizing Hyperledger’s Asset and Historian Registries. Each asset represents the epassports and these assets are stored inside the Asset Registry. Whenever someone creates, updates or removes assets, the corresponding transaction must be executed. The transactions get stored in the Historian Registry, which is only readable and can not be tempered. Thus, any adversarial attempts can easily be traced. Moreover, in case of public blockchain, every transactions is publicly available to anyone around the world, which breaks the confidentiality and security of the system. Using the Hyperledger-based private blockchain and its Access Control feature, the encrypted transactions are not accessible to everyone. Instead, the access is restricted to certain users using CRUD operations defined in the rules under Access Control module.

3 Methodology Controlling the accessibility and transactions in charge of making, changing, or omitting these assets are two fundamental elements of our proposed architecture as the private blockchain is built on Hyperledger Composer. The asset ID and transaction ID are used in our approach to detect whether an e-passport’s data have been misused or changed. The workflow of our proposed system, starting from the blockchain client to the peers is demonstrated in Fig. 1. The following subsections outline the systemic protocol for detecting any illegal activity using an e-passport.

46 Utilizing Hyperledger-Based Private Blockchain to Secure …

583

Fig. 1 Workflow of the proposed system

3.1 Characterization of the Assets The passport number, personal number, name, ethnicity, date of birth, date of issue, country code, and other information about a specific e-passport and other relevant data are combinedly denoted as a particular asset of our system. Our private blockchain’s assets list is where we keep all of our digital e-passport information and is referred to as Asset Registry. Both the passport ID and the asset ID are used to identify specific assets in our system. This identification number is indexed in this Asset Registry. Transactions are used to ensure the immutability of the actions related to creating, updating or deleting the assets like correcting information, creating a new passport, approval of visas, etc.

3.2 Transaction Execution via Chaincode Like any other blockchain framework, all the e-passport related tasks of our suggested blockchain-based protocol is carried out via transactions. When a valid transaction is stored in the private ledger’s block, it becomes immutable, meaning it cannot be

584

N. Jahan and S. Reno

Fig. 2 Algorithm for a Execution of createEPassort transaction and emission of events. b Execution of updateDateOfBirth transaction

altered without an attack like the 51%, which is nearly hard to carry out. Our scheme consists of two main transactions that are used to prevent the malicious alteration or processing of passports: (i) Creating Assets; (ii) Omitting Assets. Transaction responsible for asset formation has the same parameters as the asset definition. On the other hand, removing an asset only needs the asset ID to get deleted from Asset Registry. A unique transaction ID is produced when a transaction for creating ,deleting and updating an asset is carried out. This transaction ID is used to track all kinds of activities in our private blockchain. Figure 2 shows the implementation of utility

46 Utilizing Hyperledger-Based Private Blockchain to Secure …

585

Fig. 3 Type of assets and participants of our system

functions for transactions accountable for generating and removing assets. All the biometric data like fingerprint templates, photographs, etc. which are non-textual and heavyweight in size can not be stored as an attribute of a transaction. Consequently, Historian Registry is unable to include the mandatory biometric details of e-passports. Therefore, a secure peer-to-peer distributed file management system called Interplanetiry file Management System is used to store the biometric data [8] which is discussed in the subsequent subsections. IPFS returns a content identifier which is stored as a transaction attribute’s values inside the Historian Registry. Figure 2 represents two algorithms for creating and updating the e-passport assets.

3.3 Setting Up the Categories of Participants In Hyperledger Composer, the private blockchain must contain participants for smooth system management. Similarly, our proposed e-passport system consists of two types of participants: (i) System Administrators and (ii) End Users. The system administrator can generate assets and has access to retrieve both the Historian Record (discussed in the next section) and Asset Registry. Furthermore, the administrator has the authority to create additional participant categories (if necessary) to improve the manageability of the system. The default permission range for the end-users is restricted to retrieving specific asset details, but the accessibility to the system resources can be ascended or descended. In addition, end-users have the privilege to execute certain transactions accountable for services related to their passport. Figure 3 illustrates two types of participants along with the genre of the assets in our system.

586

N. Jahan and S. Reno

3.4 Placement of Assets in the Registry The Asset Registry stores the e-passport-type assets and the information regarding the participant responsible for asset generation and the timestamp of that asset. Corresponding transactions are required to be executed for creating, updating or removing assets. As assets can be updated and deleted, the immutability of the private ledger is preserved by transaction executions, as transactions can not be tempered by any means. When a system administrator or end-user requires information about a specific asset, they must provide the asset ID to this register. Using our system’s querying technique, users can search for unique passport information. When an asset is deleted from the registry, our system genuinely does not destroy the asset information to maintain the immutability of the blockchain. Instead, the asset’s status is changed from ‘Active’ to ‘Delete’. The Asset Registry displays the non-existence of any specific passport if the status is ‘Deleted,’ even though that asset still remains in the register.

3.5 Registering Transactions in Historian Records All of the transactions on our private blockchain are stored and retrieved from the system’s Historian Registry. This registry assigns each transaction a unique ID that gets linked with the participant’s ID who executed the transaction. Although the asset details can be updated or omitted, one cannot temper the transaction details inside the registry. Participants can edit the Transaction Registry by including new transactions in it but cannot modify the details of already executed and valid transactions. Hyperledger Composer allows our system to query this Historian Registry using SQL as a traditional database, making our system much faster for searching any particular information. Our proposed method utilizes both this Historian Record and the Asset Registry to detect any unusual system intrusion like altering the passport information with the wrong intention and declining any valid requests from the system users. Figure 4 shows the functionalities of the system for inserting assets and transactions in the corresponding registries.

3.6 Implementation of Permission Regulations Inside Access Control Module The Access Control module restricts the access to retrieve or modify the system resources, such as the assets, transactions, registries, and participant creation. Access Control exploits the Create, Read, Update, Delete (CRUD) operations along with the type of the system resource to regulate the actions and activities of the users. For example, allowing a certain user to read the Asset Registry and Historian Record only

46 Utilizing Hyperledger-Based Private Blockchain to Secure …

587

Fig. 4 Updation of asset and historian registry by the execution of transactions

allows him to retrieve asset and transaction information from the registries and do not get the right to create new asset or execute any transaction. System administrators have access to almost all properties, but the end-users are only privileged to retrieve information from the Asset Registry and the Historical Registry. Rules inside the Access Control module only allow end-users to edit the registries when applying for service via executing transactions. To prevent fraudulent end-users from storing bogus e-passport information in the blockchain, the regulations set inside the Access Control mechanism do not grant end-users access to the transactions responsible for creating, removing or modifying details from assets.

3.7 InterPlanetary File System (IPFS) to Store Non-textual Data Our private ledger can only store textual information and therefore, the heavyweight data like images and fingerprint templates must be saved to IPFS. IPFS is a decentralized and distributed peer-to-peer file storing and distributing system. Every data stored in the IPFS is assigned a unique Content Identifier (CID) which works as the same way the block has does. If someone tempers any information, the CID gets changed and thus the traceability of the alteration is ensured. Non-Textual data corresponding to a particular e-passport is uploaded to the IPFS and the returned CID is stored inside our ledger as an attribute value of that e-passport’s asset. In this way, the immutability of the media files are guaranteed using this file-sharing network.

588

N. Jahan and S. Reno

Fig. 5 Submission of createEPassport transaction using hyperledger test network

Fig. 6 Submission of updateDateOfBirth transaction using hyperledger playground

4 Result We have used Hyperledger Playground to setup a test network for our system. Composer is used instead of Fabric as Fabric does not provide any testing environment. The transaction logic and asset definitions are defined using JavaScript. At first, the definitions of assets, transactions are defined in ’model.cto’ module of Composer and then the transactions are configured in ’logic.cs’ module. After successful configuration, the transactions are successfully set to execute in the Playground. Figure 5 shows the submission request of CreateEPassport transaction for execution purpose. Among several other transactions to update e-passports inside the ledger, Fig. 6 demonstrates the transaction execution of UpdateDateOfBirth transaction responsible for correcting wrong date of birth of any passport holder.

46 Utilizing Hyperledger-Based Private Blockchain to Secure …

589

Fig. 7 All valid transactions inside historian record Fig. 8 Retrieval of details of validated transactions from historian registry

All the successfully executed and valid transactions can be accessed from Historian Record, which enlists every transactions indexed by its timestamp and the identity of the users who executed those transactions. This Historian Record along with all the detailed transactions is illustrated in Fig. 7. Among all the transactions displayed in Fig. 7, a particular transaction details, which has been used to create e-passport asset, is retrieved from the Historian Registry and is showed using Fig. 8. Unlike asset information residing in these Asset Registry, the attributes of these transactions are is non-editable and has read-only property. Access Control module prevents assigning Update privilege from CRUD operations in case of transactions in Historian Record. To utilize the IPFS to store heavyweight data like images and fingerprint formats, the daemon of the IPFS is run first. After successful initialization of daemon, ’add’ command is used to upload certain image or fingerprint template. IPFS then returns a Content Identifier for the uploaded data and this hash is stored inside the ledger. The overall process is demonstrated using Figs. 9 and 10.

590

N. Jahan and S. Reno

Fig. 9 Initiating the daemon of IPFS service

Fig. 10 Uploading biometric data to IPFS and retrieve the corresponding CID

We generated several random assets from unreal e-passport details and measured the transaction processing time of our Hyperledger + IPFS-based proposed system. ‘python-bitcoinlib’ library is used to implement our method using Bitcoin blockchain and utilized Remix programming language for the development of Ethereum-based e-passport management. To calculate the throughput for Bitcoin-based system, ‘time’ and ‘now’ methods were used and ‘block.timestamp’ along with ‘block.number’ in the case of Ethereum. From Table 1 and Fig. 11, it is evident that our system surpasses two of the public blockchain systems concerning transaction throughput.

46 Utilizing Hyperledger-Based Private Blockchain to Secure …

591

Table 1 Transaction throughput in three different blockchain protocols Transactions amount Required time in Required time in Required time in hypeledger + IPFS ethereum (in seconds) bitcoin (in seconds) (max file size: 25 MB) (in seconds) 500 430 357 245 119 83 Total amount of transactions 1734

17.31 29.36 15.86 26.77 13.72 24.29 11.21 20.55 10.93 18.37 08.40 16.41 Execution time on average

39.78 36.24 31.91 28.36 25.98 23.15

IPFS + Hyperledger 12.91

Bitcoin 31

Ethereum 22.63

Fig. 11 Three-dimensional graph representing the comparison among transaction processing times of three different blockchain protocols

5 Conclusion In our proposed system, we have utilized Hyperledger Composer which is a framework of blockchain, able protect e-passport information. In e-passport, biometric information like fingerprint and iris data are mandatory, which are saved in IPFS and IPFS returns a hash to the Hyperledger. It is nearly impossible to alter the fingerprint and iris data inside the IPFS as it is security is almost similar to the blockchain’s one. It eliminates the possibility of fraudulent activities by providing an immutable record history which is permanently connected to the system, simplifying paperwork and record-keeping. For this, a permissioned blockchain like Hyperledger Composer is used to create a peer-to-peer tamper-proof and forge-proof network. In future

592

N. Jahan and S. Reno

work, we will utilize Hyperledger Sawtooth to secure e-passport as transactions can be executed in parallel. Therefore, providing faster processing time compared to Hyperledger Composer.

References 1. Bogari EA, Zavarsky P, Lindskog D, Ruhl R (2012) An investigative analysis of the security weaknesses in the evolution of RFID enabled passport. Int J Int Technol Secur Trans 4(4):290– 311 2. Buchmann N, Rathgeb C, Baier H, Busch C, Margraf M (2017) Enhancing breeder document long-term security using blockchain technology. In: 2017 IEEE 41st annual computer software and applications conference (COMPSAC), vol 2. IEEE, pp 744–748 3. Dubbaka G (2021) e-passport using RFID. Available at SSRN 3918306 4. Hamad F, Zraqou J, Maaita A, Taleb AA (2015) A secure authentication system for epassport detection and verification. In: 2015 European intelligence and security informatics conference. IEEE, pp 173–176 5. Hanzlik L, Kutyłowski M (2021) epassport and EID technologies. In: Security of ubiquitous computing systems. Springer, pp 81–97 6. Hathaliya J, Sharma P, Tanwar S, Gupta R (2019) Blockchain-based remote patient monitoring in healthcare 4.0. In: 2019 IEEE 9th international conference on advanced computing (IACC). IEEE, pp 87–91 7. Hjálmarsson FÞ, Hreiðarsson GK, Hamdaqa M, Hjálmt`ysson G (2018) Blockchain-based evoting system. In: 2018 IEEE 11th international conference on cloud computing (CLOUD). IEEE, pp 983–986 8. Hossan MS, Khatun ML, Rahman S, Reno S, Ahmed M (2021) Securing ride-sharing service using IPFS and hyperledger based on private blockchain. In: 2021 24th international conference on computer and information technology (ICCIT), pp 1–6. https://doi.org/10.1109/ ICCIT54785.2021.9689814 9. Jiang T, Fang H, Wang H (2018) Blockchain-based internet of vehicles: distributed network architecture and performance analysis. IEEE Int Things J 6(3):4640–4649 10. Kiyomoto S, Rahman MS, Basu A (2017) On blockchain-based anonymized dataset distribution platform. In: 2017 IEEE 15th international conference on software engineering research, management and applications (SERA). IEEE, pp 85–92 11. Kocaogullar C, Yıldırım K, Sakaogulları MA, Küpçü A (2021) Basgit: a secure digital epassport alternative. In: ISCTURKEY 12. Mauw S, Horne R (2021) Discovering epassport vulnerabilities using bisimilarity. Log Methods Comput Sci 17 13. Morshed MM, Atkins A, Yu H (2011) Privacy and security protection of RFID data in epassport. In: 2011 5th international conference on software, knowledge information, industrial management and applications (SKIMA) proceedings. IEEE, pp 1–7 14. Saoudi S, Yousfi S, Robbana R (2017) Elliptic curve cryptography on e-passport authentication protocol. In: 2017 IEEE/ACS 14th international conference on computer systems and applications (AICCSA). IEEE, pp 1253–1260 15. Shahnaz A, Qamar U, Khalid A (2019) Using blockchain for electronic health records. IEEE Access 7:147782–147795 16. Sharma S, Zodpe H (2016) Implementation of cryptography algorithm for e-passport security. In: 2016 international conference on inventive computation technologies (ICICT), vol 3. IEEE, pp 1–3

46 Utilizing Hyperledger-Based Private Blockchain to Secure …

593

17. Song G, Kim S, Hwang H, Lee K (2019) Blockchain-based notarization for social media. In: 2019 IEEE international conference on consumer electronics (ICCE). IEEE, pp 1–2 18. Vora J, Nayyar A, Tanwar S, Tyagi S, Kumar N, Obaidat MS, Rodrigues JJ (2018) Bheem: a blockchain-based framework for securing electronic health records. In: 2018 IEEE Globecom workshops (GC Wkshps). IEEE, pp 1–6

Chapter 47

An Exploratory Data Analysis on SDMR Dataset to Identify Flood-Prone Months in the Regional Meteorological Subdivisions J. Subha and S. Saudia

1 Introduction Floods are natural disasters which affect almost all the states of the country in India from Jammu and Kashmir, Himachal Pradesh, Uttarakhand in the North to Tamil Nadu, Andaman and Nicobar, Lakshadweep in the South and from Arunachal Pradesh, Assam, West Bengal in the East to Gujarat, Maharashtra, Goa in the West. According to the India Today Journal report published in the year 2019 [1], the total flood-related damages over the last 65 years in India is estimated to have killed at least 107,535 humans and 6,049,349 cattle. A total of 807,117,993 homes were destroyed, and an area of 466.335 million ha was affected. In the report of the latest Intergovernmental Panel on Climate Change (IPCC) published in the year 2021[2], in Indian subcontinent, rainfall will increase by 20% and will be more frequent and erratic leading to flooding. During floods, people get caught in their homes without legitimate food and water. The occurrences which happened in Uttarakhand in the year 2013, Srinagar in the year 2014, Chennai in the year 2015, Gujarat in the year 2017 and Kerala in the year 2018 [3] projects that protecting the lives and physical assets of people is important. Nearly every year, people are affected by flood [4] that makes them dangerous to live unless preventive methods are taken. Thus, flood control is important and longtime forecasts are needed so that prevention action will be possible to overcome the effects of flood and to improve the safety of people based on forecasts and warnings. So, a detailed analysis is made in this paper to study the incidence of floods in different meteorological subdivisions of the country to identify J. Subha (B) · S. Saudia Center for Information Technology and Engineering, Manonmaniam Sundaranar University, Tirunelveli, Tamil Nadu 627012, India e-mail: [email protected] S. Saudia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_47

595

596

J. Subha and S. Saudia

flood-prone months since rainfall is the most important reason in generating floods [5]. Srivastava et al. [6] conducted a trend analysis of heavy rainfall occurrences in India using annual and seasonal rainfall data of IMD from 1901–1992 and discovered a rising annual rainfall trend in Punjab, Telangana, North Interior Karnataka and South Interior Karnataka, as well as an increased seasonal rainfall trend in West Bengal, Punjab, Himachal Pradesh, and Telangana. Guhathakurta et al. [7] made an analysis on one-day extreme rainfall using the daily rainfall dataset of IMD and found that there has been an increase in the intensity of extreme rainfall over Coastal Andhra Pradesh and adjoining areas of Northeast Indian parts such as Saurashtra Kutch, Orissa and West Bengal. There has been an increase in the frequency and intensity of monsoon rain events in Central India. It was also found that the Indian monsoon leads to a high risk of floods over Central India [8, 9], Rao et al. [10] made an analysis on Gridded Daily rainfall data in Western Ghats and found that the rainfall trend in annual, monsoon and post-monsoon rainfalls is increasing in state of Goa and Coastal region of Karnataka state. Recent empirical investigations [10–12] on the daily gridded rainfall dataset [10–12] have demonstrated that extreme precipitation events and monsoon rain events [13] are becoming more common all over the world in the past 50 years. Several authors [6, 8–11] have reported an increase in excessive monsoon rainfall by analyzing the datasets like daily rainfall data and daily gridded rainfall data [6–11]. Adhikari et al. [14] observed that the incidence of floods starts increasing during May and reaches its peak in July and August using the DFO and EMDAT datasets [14]. Table 1 shows the various rainfall inferences made by different authors using different daily precipitation datasets. From Table 1, it is known that understanding and analyzing changes in rainfall, its duration and trend becomes vital to predict flood disasters. So, this paper proposes an EDA to visualize the seasons, months of high rainfall in different parts of the country to plan associated flood prevention and rehabilitation processes based on SDMR dataset collected from 1901 to 2017 [15]. The aim of the EDA is to find the variations in annual, seasonal, and monthly rainfall and to discover patterns, trends, correlations and outliers in different meteorological subdivisions of India and to identify the flood-prone months in the subdivisions using Indian SDMR dataset. This paper is organized in four sections. The workflow of the EDA conducted in this work is detailed in Sect. 2. In Sect. 3, result and inferences obtained from the SDMR datasets are discussed. Finally, conclusion is presented in Sect. 4.

2 Exploratory Data Analysis The practice of systematically using statistical and/or logical tools to explain and illustrate, recapitulate, condense and assess data is known as Exploratory Data Analysis [16]. The different tools used for EDA like graphs, plots, charts, etc. help get a figure of the data and what they represent by displaying and summarizing significant properties/characteristics of datasets. EDA is an effective primary step before

47 An Exploratory Data Analysis on SDMR Dataset …

597

Table 1 Rainfall inferences Author

Dataset

Srivastava et al. [6]

Annual, seasonal rainfall data of Increasing trend in annual, 35 subdivisions in all district seasonal rainfall over India (1901–1992) [6] Rising annual rainfall trend in Punjab, Telangana, North Interior Karnataka, South Interior Karnataka

Inferences

Increasing seasonal rainfall trend in West Bengal, Punjab, Himachal Pradesh, Telangana, etc., Guhathakurta et al. [7]

Daily rainfall dataset—IMD 1901–2005 [7]

Increasing intensity of extreme rainfall over Coastal Andhra Pradesh, Saurashtra, Orissa, West Bengal, etc., Flood risk increased in the decades 1981–1990, 1991–2000

Mondal and Mujumdar [11] High-resolution daily gridded (1 latitude 1 longitude) dataset, ENSO [11]

Varying rainfall frequency is inducing flood. Variability of Indian monsoon rainfall is high

Rajeevan et al. [8]

Indian monsoon leads to high, frequency of very high rainfall (VHR) events

Daily rainfall data (104 Years), SST [8]

Risk of flood increased over Central India Goswami et al. [9]

Daily rainfall data [9]

Increase in the frequency and intensity of monsoon rain events in Central India

Sinha et al. [12]

Chhattisgarh state rainfall data [12]

Increasing patterns in Ambikapur, Baloda Bazar, Bemetara, Durg and Gariyaband and decreasing pattern in Dantewada, Kondagaon and Sukma

Adhikari et al. [14]

DFO, EMDAT, others [14]

Flood events increasing in May and reaching peaks in the months of July to August

Rao et al. [10]

Gridded daily rainfall data in western ghats [10]

The trend in annual, monsoon and post-monsoon rainfalls is increasing in Goa and Coastal region of Karnataka and decreasing in some parts of Kerala and Maharashtra

598

J. Subha and S. Saudia

Fig. 1 Workflow diagram of proposed EDA

predicting future floods by evaluating past data such as precipitation events. The different stages in the EDA workflow of the proposed analysis as shown Fig. 1 are detailed in subsections below. The proposed EDA is carried out in Python using its NumPy, Pandas, and Matplotlib libraries.

2.1 Data Collection Data collection is the process of collecting data from relevant sources to find solutions to the research problem under consideration [17]. Data collection methods are divided into two categories: Primary methods and Secondary methods of data collection. In Primary data collection method, the data are first-hand information collected by the investigator. In Secondary data collection methods, data are collected from previously published sources like books, newspapers, magazines, journals, online portals, government publications, websites, etc. [18]. The rainfall dataset in this work, SDMR, is acquired from the website, ‘data.gov.in’ [17]. So, the data collection method used in this EDA is a Secondary data gathering method. The SDMR dataset is a meteorological dataset collected by 36 different meteorological subdivisions in the country [15]. The dataset has detailed qualitative and quantitative information regarding precipitation from the years, 1901 to 2017 in different meteorological subdivisions of India. The dataset is stored in CSV file format both month-wise and also subdivision-wise; the rainfall is mentioned in millimeters (mm). The dataset consists of 4188 records and 19 features. The features are subdivision, year, January to December rainfall, annual rainfall, seasons of rainfall: January–February, March–May, June–September, October–December. The summary of features of the dataset is shown in Table 2. The quantitative features as mentioned in the dataset are: YEAR, JAN (January), FEB (February), MAR (March), APR (April), MAY (May), JUN (June), JUL (July), AUG (August), SEP (September), OCT (October), NOV (November), DEC (December), JF (January to

47 An Exploratory Data Analysis on SDMR Dataset …

599

Table 2 Summary of SDMR dataset Description name

Summary of description

Data source

IMD

Year of events recorded 1901–2017 Number of records

4188

Number of features

19

Features name

JAN–DEC, JF, MAM, JJAS, OND, ANNUAL, YEAR, SUBDIVISION

Unique features data

SUBDIVISION: 36, YEARS: 117

February), MAM (March to May), JJAS (June to September), OND (October to December), ANNUAL. The qualitative features in the dataset is: SUBDIVISION (meteorological subdivision). A sample portion of the SDMR dataset is shown in Table 3. Table 3 Sample space of SDMR dataset Subdivision

Andaman and Nicobar Islands

Andaman and Nicobar Islands

Andaman and Nicobar Islands

Andaman and Nicobar Islands

Andaman and Nicobar Islands

YEAR

1901

1902

1903

1904

1905

JAN

49.2

0

12.7

9.4

1.3

FEB

87.1

159.8

144

14.7

0

MAR

29.2

12.2

0

0

3.3

APR

2.3

0

1

202.4

26.9

MAY

528.8

446.1

235.1

304.5

279.5

JUN

517.5

537.1

479.9

495.1

628.7

JUL

365.1

228.9

728.4

502

368.7

AUG

481.1

753.7

326.7

160.1

330.5

SEP

332.6

666.2

339

820.4

297

OCT

388.5

197.2

181.2

222.2

260.7

NOV

558.2

359

284.4

308.7

25.4

DEC

33.6

160.5

225

40.1

344.7

ANNUAL

3373.2

3520.7

2957.4

3079.6

2566.7

JF

136.3

159.8

156.7

24.1

1.3

MAM

560.3

458.3

236.1

506.9

309.7

JJAS

1696.3

2185.9

1874

1977.6

1624.9

OND

980.3

716.7

690.6

571

630.8

600

J. Subha and S. Saudia

2.2 Understanding Features In this stage of the EDA workflow, characteristics of different features of the dataset, SDMR used in the analysis are explored to find the behavior of features, missing values in features, outliers in features using the summary and unique values of the dataset. The shape(), head() functions in the NumPy library of Python are used in this stage to find the shape or number of rows and columns in a dataset and to visualize the few top-most instances of the dataset respectively. It is identified that there are 4188 rows and 19 columns in the SDMR dataset. The first 5 instances of the dataset as produced by the head() function are shown in the Table 3. The types of values of different features in the SDMR dataset can be understood from the output of head() function in Table 3. The functions: columns, unique(), describe() in Pandas library are used to identify columns in the dataset, find the unique values in each column, summarize the values of different columns in the dataset, respectively. The different columns in the dataset are identified as: ‘SUBDIVISION’, ‘YEAR’, ‘JAN’, ‘FEB’, ‘MAR’, ‘APR’, ‘MAY’, ‘JUN’, ‘JUL’, ‘AUG’, ‘SEP’, ‘OCT’, ‘NOV’, ‘DEC’, ‘ANNUAL’, ‘JF’, ‘MAM’, ‘JJAS’, ‘OND’. The unique() function returns the unique values in a particular column/feature of dataset. The size and the statistical summary of the quantitative data in the dataset are determined using describe() function along with the boundaries of various precipitations like minimum, maximum, mean, percentile, and standard deviation as shown in Table 4.

2.3 Data Cleaning Data cleaning is the process of correcting or removing incorrect, corrupt, duplicate, incomplete data within a dataset for data analysis [19, 20]. The procedure for data cleaning process will vary from dataset to dataset [21] but will generally involve identification of null values, outliers, and replacing missing values. There are some missing values in the SDMR dataset obtained from the meteorological department. The subdivisions Arunachal Pradesh, Andaman and Nicobar Islands, and Lakshadweep have missing records against few years. Data values of all other features are available from 1916 to 2017 for all subdivisions. So, the proposed EDA is carried out using the data available with all subdivisions from 1916 to 2017 (102 years). Dropping the null entries is a simple and quick solution to solve a missing data problem, but it can have influence on the final model. So, instead of data-dropping methods, the imputation approaches which replace missing values with the mean or median values of the dataset or some other summary statistics are used in the proposed work. This mean () function from the Pandas library is used to fill on the other missing values in the dataset. Figure 2 is a line plot which shows the count of null values in different columns of the SDMR dataset before and after imputation as blue and red lines, respectively.

47 An Exploratory Data Analysis on SDMR Dataset …

601

Table 4 Dataset description table Count

Mean

Min

25%

50%

75%

Max

YEAR

4188

1959.2

Std 33.7

1901.0

1930.0

1959.0

1988.0

2017.0

JAN

4188

18.9

33.8

0.0

0.6

6.0

22.0

583.7

FEB

4188

21.6

35.7

0.0

0.5

6.5

26.6

403.5

MAR

4188

27.4

46.9

0.0

1.0

7.9

31.3

605.6

APR

4188

43.1

68.1

0.0

3.0

15.5

49.5

595.1

MAY

4188

85.7

122.7

0.0

8.7

36.9

97.7

1168.6

JUN

4188

230.1

234.2

0.4

70.9

139.0

304.3

1609.9

JUL

4188

347.0

268.6

0.0

175.9

285.3

418.4

2362.8

AUG

4188

289.7

188.3

0.0

156.0

258.9

377.5

1664.6

SEP

4188

197.3

135.5

0.1

100.4

173.9

265.8

1222.0

OCT

4188

95.3

99.1

0.0

14.6

65.8

148.1

948.3

NOV

4188

39.5

68.3

0.0

0.6

9.5

45.0

648.9

DEC

4188

19.0

43.0

0.0

0.1

3.1

17.7

617.5

ANNUAL

4188

1414.7

905.6

62.2

805.0

1123.4

1651.0

6331.1

JF

4188

40.5

59.3

0.0

4.0

19.0

50.2

699.5

MAM

4188

156.1

201.3

0.0

24.2

75.2

197.6

1745.8

JJAS

4188

1064.2

706.3

57.4

574.2

881.3

1287.4

4537.0

OND

4188

153.8

166.9

0.0

34.2

98.3

211.6

1252.5

Fig. 2 Line plot showing null values of SDMR before and after imputation

602

J. Subha and S. Saudia

2.4 Data Visualization and Analysis The dataset obtained from the data cleaning stage is subjected to next stage of the EDA workflow—data visualization and analysis to make various inferences from the SDMR dataset. This enhanced dataset obtained from the data cleaning stage when subjected to visualization produce accurate inferences. Data visualization approaches create graphical or pictorial representation of the data to identify different patterns, trends, correlations in data. Data visualizations which find the relationship between features and which find the patterns in the values of feature in SDMR dataset are discussed in Sect. 2.5. The visualizations made and inferences obtained from the dataset are discussed in Sect. 3.

2.5 Examining Relationship Between Features and Finding Patterns It is very important to understand the relationship between the features of a dataset in order to extract correct conclusions from the statistical analysis. This paper proposes an EDA to visualize the seasons, months of high rainfall in different meteorological subdivisions of the country to plan associated flood prevention and rehabilitation processes based on SDMR dataset collected from 1901 to 2017 [15]. So, it is important to identify the correlation between features of the dataset to find the relationship between different features. Correlation is the statistical measure that expresses the extent to which two features are linearly related [22]. The correlation between the features can be positive or negative correlation. Positive correlation means, an increase in values of one feature quantity leads to an increase in other, whereas in negative correlation, an increase in the value of one feature leads to decrease in the other. In Python, corr() function will find correlation between features. The positive and negative correlation values between features of SDMR dataset are shown in the correlation table in Fig. 3. From Fig. 3 only the feature, year has got negative correlation with all other features. Though all the features show a positive correlation with other features, it is found from Fig. 3 that the annual rainfall is found to have higher positive correlations with the features corresponding to the monsoon months June, July, August, and September. The linear pattern of higher rainfall during the monsoon months from June to September is shown in heat maps in Fig. 4. Heat maps are useful for understanding the relationship between numerical columns in a dataset. It contains different shades of the same color representing different values. Generally, the darker colors of the chart indicate larger values than the lighter shades as shown in Fig. 4. Correlation analysis is done on a monthly, seasonal basis for the years from 1916 to 2017 as shown in heat maps in Fig. 4a and b, respectively. The dark blue shade in Fig. 4a and b corresponds to the months and

47 An Exploratory Data Analysis on SDMR Dataset …

603

Fig. 3 Correlation values between the features in SDMR Dataset

seasons receiving high rainfalls through all the years. These figures clearly indicate a high monthly and seasonal rainfall patterns in the months from June to September.

3 Inferences from the Dataset and Discussion Four climatological seasons are defined by the IMD in the dataset, SDMR. They are January to February (JF), March to May (MAM), June to September (JJAS), October to December (OND). The winter season is from January through February, the summer season also known as the pre-monsoon season lasts from March through May. The monsoon season is also known as the rainy season, lasts from June through September. From October to December, the post-monsoon or autumn begins [23]. In this paper, the monthly rainfall data, SDMR collected from IMD is analyzed, and the visualizations on rainfall variations on a monthly, seasonal, and annual basis are discussed and elaborated in this section. The results of the visualization of monthly, seasonal, annual rainfall are made year-wise and subdivision-wise to determine the flood-prone months and subdivisions as shown and discussed in the subsections

604

J. Subha and S. Saudia

Fig. 4 Heat map for a monthly and b seasonal rainfall of SDMR

below. The plots are made using the different plot functions in the Matplotlib library of Python.

3.1 Visualization of Annual Rainfall (1916–2017) The mean annual rainfall as recorded in different subdivisions in the SDMR dataset is represented using the stem plot in Fig. 5. The plot is drawn using the stem() function in the Matplotlib library of Python. From the Fig. 5, the highest of mean annual rainfall is recorded in the year 1961. Also, the top 10 highest rainfall recorded years and mean rainfall is recorded in Table 5. It is found from Fig. 5 that the trend of the annual rainfall is increasing or decreasing every next year without showing a definite pattern. Also, from Fig. 5, the ten years during which highest rainfall were recorded are tabulated in Table 5 along with the mean annual rainfall in mm. Even from Table 5, information to predict flood in a subdivision cannot be obtained. So, a detailed analysis into the seasonal and monthly precipitation data on a subdivision basis is found essential to predict flood-prone months and subdivisions to plan

47 An Exploratory Data Analysis on SDMR Dataset …

605

Fig. 5 Stem plot showing the mean annual rainfall from 1916 to 2017

Table 5 Top 10 highest mean annual rainfall recorded years

Year

Mean rainfall

Year

Mean rainfall

1961

1717.08

1956

1612.82

1917

1671.57

1988

1605.47

1933

1666.74

1946

1584.08

1959

1648.23

1916

1581.08

1990

1614.31

1936

1564.66

Bold indicates the highest mean annual rainfall recorded in the year

preventive processes in such subdivisions. Seasonal and month-wise EDA based on the subdivisions is made in subsections below.

3.2 Visualization of Seasonal Rainfall The average seasonal rainfall received in different seasons, JF, MAM, JJAS, OND in the SDMR dataset is represented as a pie chart in Fig. 6 and tabulated in Table 6. The pie chart is made using the pie() function in the Matplotlib library of Python. From the Table 6 and Fig. 6, monsoon accounts for more than 75% of heavy rainfalls and have occurred in the months of June, July, August, and September (JJAS) in the monsoon season. It is thus indicated that in the monsoon season, the chances of flood is higher. To find the month of the season when rainfall is higher, visualization of the average monthly data is provided in Sect. 3.3.

606

J. Subha and S. Saudia

Fig. 6 Piechart on seasonal rainfall data from 1916 to 2017

Table 6 Average seasonal rainfall from 1916 to 2017 Seasons JF

Average rainfall in mm

Percentage

40.25

2.82

MAM

158.60

11.11

JJAS

1073.12

75.18

OND

153.43

10.89

Total

1427.40

100.00

Bold indicates the monsoon season (JJAS) which recorded highest rainfall

3.3 Visualization of Average Monthly Rainfall The percentage of average monthly rainfall across the country is mentioned in Table 7. It shows that the rainfall is higher in the month of June to September and peak in July. The mean monthly rainfall of India from the years 1916–2017 across India is represented as a bar plot in Fig. 7 also. The bar plot is drawn using the bar() function in the Matplotlib library of Python. It is observed from the Fig. 7 that the average rainfall is higher in the monsoon season from June to September. Table 8 and Fig. 8 show the trend of mean of monthly rainfall for the last 10 years from 2008 to 2017 of different subdivisions, and it is also found that the heavy rainfalls are recorded in the months from June to September. The highest rainfall (372.5 mm in 2013) recorded being in the month of July and the minimum rainfall were in the months January, February, and December. But the subdivision which obtains maximum rainfall will be the flood-prone subdivision and is to be identified for prediction of flood and for planning flood preventive measures in that subdivision. So, subdivision-based visualization of the SDMR dataset is made in the Sect. 3.4.

47 An Exploratory Data Analysis on SDMR Dataset … Table 7 Percentage of average monthly rainfall No.

Month

1

JAN

Percentage of rainfall 1.31

2

Feb

1.50

3

MAR

1.92

4

APR

3.05

5

MAY

6.12

6

JUN

16.24

7

JUL

24.50

8

AUG

20.45

9

SEP

13.98

10

OCT

6.79

11

NOV

2.77

12

DEC

1.31

Bold indicates the month (July) which recorded highest rainfall

Fig. 7 Bar plot on average monthly rainfall for years from 1916 to 2017

607

608

J. Subha and S. Saudia

Table 8 Average monthly rainfall data for the last ten years from 2008 to 2017 for all subdivisions Year

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

JAN

19

8

9

14

27

12

16

19

10

25

FEB

17

10

12

22

9

36

25

16

10

9

MAR

50

17

18

27

11

17

29

51

27

33

APR

40

30

49

49

58

31

22

76

31

47

MAY

69

79

86

71

50

82

91

75

84

77

JUN

259

137

202

241

183

297

142

239

214

232

JUL

293

342

361

314

295

372

309

269

349

322

AUG

307

220

305

329

296

264

291

235

242

277

SEP

192

174

221

225

234

179

198

158

198

191

OCT

71

93

95

53

76

144

81

61

66

103

NOV

33

60

75

29

38

27

21

55

14

22

DEC

13

15

29

12

13

8

16

21

23

25

Fig. 8 Line plot on average monthly rainfall data for the last ten years from 2008 to 2017

47 An Exploratory Data Analysis on SDMR Dataset …

609

3.4 Visualization of Subdivision-Based Rainfall The visualization of SDMR based on subdivisions is important to identify the subdivisions that are prone to higher rainfall and to identify the month during which the subdivisions will be subjected to highest rainfall so that such subdivisions can be kept under vigilance for incidence of flood and flood prevention. The bar plot in Fig. 9 is made between the average annual rainfall received in mm and the different meteorological subdivisions. It is found that more than half of the precipitation falls on ten subdivisions: Andaman and Nicobar Islands, Arunachal Pradesh, Assam and Meghalaya, Coastal Karnataka, Kerala, Konkan & Goa, Nagaland, Manipur, Mizoram, Tripura, Lakshadweep, Gangetic West Bengal, Sub-Himalayan West Bengal, and Sikkim. Coastal Karnataka receives the most precipitation and West Rajasthan the least. The higher average annual rainfall as recorded in top 10 subdivisions are shown in Table 9 and the bar plot in Fig. 10. The highest average annual rainfall across the years from 1916 to 2017 recorded in Coastal Karnataka is 3430.40 mm. The month in which heavier rainfall is received by the top ten subdivisions are analyzed subsequently. From the bar plots in Figs. 11 and 12, the highest average rainfall is received by most subdivisions in the month of July for the years from 1916 to 2017. Especially, the subdivisions Coastal Karnataka and Konkan & Goa receive higher rainfall in the month July as shown in Fig. 11a and Fig. 12c, respectively. But, the subdivisions, Coastal Karnataka, Arunachala Pradesh and Sub-Himalayan West

Fig. 9 Average annual rainfall in different meteorological subdivisions of India

610

J. Subha and S. Saudia

Table 9 Highest average annual rainfall of top ten subdivisions from 1916 to 2017 Subdivision

Annual rainfall (mm)

Coastal Karnataka (CK)

3430.40

Arunachal Pradesh (AP)

3410.77

Konkan and Goa (K&G)

3039.72

Kerala (KE)

2898.15

Andaman and Nicobar Islands (A&NI)

2892.06

Sub-Himalayan West Bengal and Sikkim (SHWB&S)

2767.11

Assam and Meghalaya (AM)

2563.45

Naga Mani Mizo Tripura (NMMT)

2438.48

Lakshadweep (LA)

1580.41

Gangetic West Bengal (GWB)

1498.17

Bold indicates the subdivision (Coastal Karnataka) which recorded highest average annual rainfall

Fig. 10 Top ten subdivisions with highest average annual rainfall from 1916 to 2017

Bengal and Sikkim receive higher rainfall in all the months from June to October as is clear in Fig. 11a, b and f. Also, the subdivisions, Assam and Meghalaya, Konkan & Goa and Kerala receive higher rainfall in all the months from June to October as is clear in Fig. 12 a, c and d. From the inferences made in Figs. 10, 11 and 12, Table 9, the ten subdivisions corresponding to the higher average annual rainfall and the months in which they receive higher rainfall are identified. The average month-wise precipitation data of all subdivisions as shown in the line plot in Fig. 13 identifies different peaks for all the subdivisions. From Fig. 13, though most subdivisions receive higher rainfalls in July, Tamil Nadu, Rayalseema and Coastal Andhra Pradesh receive higher rainfalls in the month of October. To emphasize the month in which different subdivisions receive highest rainfall, mode of the months which received highest rainfall from 1916–2017 is identified for the subdivisions and is tabulated in Table 10. The data in Table 10 and Fig. 13 gives the possible month of highest rainfall for each subdivision.

47 An Exploratory Data Analysis on SDMR Dataset …

611

Fig. 11 Bar plot showing the month-wise average rainfall received by a Coastal Karnataka, b Arunachal Pradesh, c Lakshadweep, d Andaman and Nicobar Islands, e Naga Mani Mizo Tripura, f Himalayan West Bengal and Sikkim

612

J. Subha and S. Saudia

Fig. 12 Bar plot showing the month-wise average rainfall received by a Assam and Meghalaya, b Gangetic West Bengal c Konkan Goa, d Kerala

The subdivisions will be vulnerable to flood during the months of frequent heavy rainfall as highlighted bold in Table 10 and so the necessary flood prevention strategies shall be followed for those months. Flood preventive measures can be adopted in the subdivisions during the months when rainfall is less as shown in Table 10. Also, water wastage can be prevented by deepening the ponds, clearing waterways. These preventive and water conserving measures will protect lives and property from floods and increase water levels in reservoirs. Inferences obtained from proposed EDA can also be used to design time series or machine learning-based models for the prediction and management of flood as done in the publications [24–27].

4 Conclusion In this EDA-based paper, the damage and devastation caused to human lives and property are discussed, and a brief survey is made on the inferences made by different

47 An Exploratory Data Analysis on SDMR Dataset …

613

Fig. 13 Line plot on average month-wise precipitation data for all subdivisions from 1916 to 2017

authors on different precipitation datasets. The proposed EDA is an extremely important step in research which included understanding features, detecting outliers, examining features relationship and visualization. The proposed EDA analyzed the SDMR dataset of IMD and made detailed visualizations on annual, seasonal, monthly, and subdivisional basis and identified the top ten meteorological subdivisions which receive highest rainfall as the flood-prone subdivisions. Also, the month in which these flood-prone subdivisions will receive highest rainfall is also inferred and recorded. The information is helpful to plan flood preventive measures in different flood-prone subdivisions of the country. In future, design of a more accurate Machine Learning-based flood forecasting model is planned.

JAN – – – – – – – – – – – – – – 1 6 – – – – –

Subdivision

Andaman and Nicobar Islands

Arunachal Pradesh

Assam and Meghalaya

Naga Mani Mizo Tripura

Sub-Himalayan West Bengal and Sikkim

Gangetic West Bengal

Orissa

Jharkhand

Bihar

East Uttar Pradesh

West Uttar Pradesh

Uttarakhand

Haryana Delhi and Chandigarh

Punjab

Himachal Pradesh

Jammu and Kashmir

West Rajasthan

East Rajasthan

West Madhya Pradesh

East Madhya Pradesh

Gujarat Region











10































FEB











15

1





























MAR











3



























1



APR











1

1

1



















6

5

1

15

MAY

Table 10 Count of highest rainfall month from 1916–2017 by subdivisions of India

3

2

3

1



1

6

2

1

1

1

2

3

11

20

37

37

34

25

JUN

64

55

49

51

51

29

51

52

43

57

45

50

57

52

43

45

59

38

48

45

15

JUL

31

39

46

50

42

32

43

36

41

38

49

42

34

38

48

31

21

20

12

13

13

AUG

4

6

7

1

6

5

4

11

12

5

7

9

10

8

6

13

2

1



4

26

SEP













1

1











2

2

2







1

3

OCT









































1

NOV

(continued)











































DEC

614 J. Subha and S. Saudia

– – – – – – – – – – – – – – –

Saurashtra and Kutch

Konkan and Goa

Madhya Maharashtra

Matathwada

Vidarbha

Chhattisgarh

Coastal Andhra Pradesh

Telangana

Rayalseema

Tamil Nadu

Coastal Karnataka

North Interior Karnataka

South Interior Karnataka

Kerala

Lakshadweep































FEB































MAR











1



















APR

Bold indicates the months which recorded highest rainfall in different subdivisions

JAN

Subdivision

Table 10 (continued)

7

1





1

1

1



3





1





1

MAY

45

46

7

4

26



2

4

4

1

3

6

6

13

9

JUN

32

49

57

21

65

1

8

50

17

55

59

36

65

75

54

JUL

13

5

17

23

10

2

14

27

15

43

38

25

21

14

27

AUG

2



7

41



6

27

18

19

3

2

34

10



10

SEP

1

1

13

12



46

33

3

40











1

OCT





1

1



39

17



4













NOV

1









6

















DEC

47 An Exploratory Data Analysis on SDMR Dataset … 615

616

J. Subha and S. Saudia

Acknowledgements The authors are very thankful to the India Meteorological Department for providing open access to the Sub-Divisional Monthly Rainfall (SDMR) data from 1901 to 2017 for the EDA.

References 1. India today, 27 Aug 2019 report. https://www.indiatoday.in/india/story/loss-due-floods-indiapeople-killed-crop-houses-damaged-in-65-years-1591205-2019-08-27 2. IPCC climate change report. https://idronline.org/article/environment/ipcc-climate-change-rep ort-what-does-it-mean-for-india/. Accessed 21 Feb 2022 3. Niloy P, Rituparna A, Sandip M, Sudipta M, Indrajit P, Debashis M, Anirban M (2021) SAR based flood risk analysis: a case study Kerala flood 2018. In: Advances in space research, vol 69, Issue 4. Elsevier, B.V., pp 1915–1929 4. Gogoi C, Goswami DC, Phukan S (2013) Flood risk zone mapping of the Subansiri sub-basin in Assam, India. Int J Geomatics Geosci 4(1):75–78 5. Office of the Queensland Chief Scientist. https://www.chiefscientist.qld.gov.au/publi-cations/ understanding-floods 6. Srivastava HN, Sinha Ray KC, Dikshit SK, Mukhopadhaya RK (1998) Trends in rainfall and radiation over India. Vayu Mandal 1:41–45 7. Guhathakurta P, Sreejith OP, Menon PA (2011) Impact of climate change on extreme rainfall events and flood risk in India. J Earth Syst Sci 120(3):359–373 8. Rajeevan M, Bhate J, Jaswal AK (2008) Analysis of variability and trends of extreme rainfall events over India using 104 years of gridded daily rainfall data. Geophys Res Lett 35(18) 9. Goswami BN, Venugopal V, Sengupta D, Madhusoodanan MS, Xavier PK (2006) Increasing trend of extreme rain events over India in a warming environment. Science 314(5804):1442– 1445 10. Rao PSB, Shetty S, Umesh P, Shetty A (2018) An exploratory analysis of rainfall: a case study on western ghats of India. In: Proceedings of the international conference on industrial engineering and operations management. USA, pp 1607–1617 11. Mondal A, Mujumdar PP (2015) Modeling non-stationarity in intensity, duration and frequency of extreme rainfall over India. J Hydrol 521:217–231 12. Sinha HK, Manikandan N, Chaudhary JL, Nag S (2020) Extreme rainfall trends over Chhattisgarh state of India. J Agrometeorology 22(2):215–219 13. Hsu PC, Li T, Luo JJ, Murakami H, Kitoh A, Zhao M (2012) Increase of global monsoon area and precipitation under global warming: a robust signal? Geophys Res Lett 39(6) 14. Adhikari P, Hong Y, Douglas KR, Kirschbaum DB, Gourley J, Adler R, Robert Brakenridge G (2010) A digitized global flood inventory (1998–2008): compilation and preliminary results. Nat Hazards 55(2):405–422 15. Open government data (OGD) platform India. https://data.gov.in/ 16. Responsible conduct in data management. https://ori.hhs.gov/education/products/n_illinois_u/datamanagement/datopic.html 17. Business research methodology. https://research-methodology.net/research-methods/-data-col lection/ 18. Formplus, from https://www.formpl.us/blog/secondary-data 19. Wu X, Zhu X (2008) Mining with noise knowledge: error-aware data mining. IEEE Trans Syst Man Cybern-Part A Syst Humans 38(4):917–932. IEEE Press 20. Tableaua Salesforce company from, https://www.tableau.com/learn/articles/what-is-data-cle aning 21. Analytics Vidhya, from https://www.analyticsvidhya.com/blog/2021/06/data-cleaning-usingpandas/

47 An Exploratory Data Analysis on SDMR Dataset …

617

22. Statistics knowledge portal. https://www.jmp.com/en_in/statistics-knowledge-portal/what-iscorrelation.html 23. Wikipedia the free Encyclopedia. https://en.wikipedia.org/wiki/Climate_of_India 24. Chawan AC, Kakade VK, Jadhav JK (2020) Automatic detection of flood using remote sensing images. J Inf Technol 2(01):11–26. 25. Smys S, Basar A, Wang H (2020) CNN based flood management system with IoT sensors and cloud data. J Artif Intell 26. Laio F, Porporato A, Revelli R, Ridolfi L (2003) A comparison of nonlinear flood forecasting methods. Water Resour Res 27. Toth E, Brath A, Montanari A (2000) Comparison of short-term rainfall prediction models for real-time flood forecasting. J Hydrol 132–147

Chapter 48

Segmentation of Shopping Mall Customers Using Clustering D. Deepa, A. Sivasangari, R. Vignesh, N. Priyanka, J. Cruz Antony, and V. GowriManohari

1 Introduction Data is very precious in today’s ever-competitive world. Every day organizations and people are encountered with a large amount of data. A most efficient way to handle this data is to classify or categorize the data into clusters, set of groups, or partitions. “Usually, the classification methods are either supervised or unsupervised, depending on whether they have labeled datasets or not”. Unsupervised classification is the exploratory data analysis where there will not be any training dataset and having to extract hidden patterns in the dataset with no labeled responses is achieved, whereas classification of supervised learning model is machine learning task of deducing a function from training dataset. The main focus is to enhance the propinquity or closeness in data points belonging to the same group and increase the variance among various groups and all this is achieved through some measure of similarity [1]. Exploratory by data analysis is all about dealing with a wide range of applications such as “engineering, text mining, pattern recognition, bioinformatics, spatial data analysis, mechanical engineering, voice mining, textual document collection, artificial intelligence, and image segmentation”. This diversity explains the importance of clustering in scientific research but this diversity can lead to contradictions due to different purposes and nomenclature [2]. Maintaining and managing relationships of a customer have always played a very key role to provide business intelligence to companies to build, develop and manage very important long-term relationships with customers [3]. The importance of treating customers as a main asset to the organization is increasing in the present day era. By using clustering techniques like K-means, mini-batch K-means, and hierarchical clustering customers with the same habits are clustered as one cluster. It allows marketing to identify different customer segments that are different in their D. Deepa (B) · A. Sivasangari · R. Vignesh · N. Priyanka · J. Cruz Antony · V. GowriManohari School of Computing, Sathyabama Institute of Science and Technology, Chennai, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_48

619

620

Deepa et al.

thinking and approach to purchasing and follow different approaches and strategies to purchase products. The purpose of customer segmentation is to group customers who have similar interests so that marketing or business can tailor their offers to their interests. The process involves figuring out the customers purchasing habits, expectations, desires, preferences, and attributes. The techniques of clustering consider data tuples as objects. They partition the data objects into clusters or groups [4]. Customer segmentation is the process where one has to divide the customers into various groups called customer segments so each customer segment comprises customers who have similar interests and patterns. The segmentation process is mostly based on the similarity or the identical nature in different ways that are relevant to marketing features like age, gender, interests, and miscellaneous spending habits. Customer segmentation has importance as it includes, the ability to modify the pro-grams of the market so that it is suitable to each of the segments, support in a business decision, identification of products associated with each customer segment, and managing the demand and supply of that product, and predicting customer defection, identifying and targeting the potential customer base, providing directions in finding the solutions. Clustering is an iterative process of knowledge discovery from unorganized and huge amounts of data that is raw [5]. Clustering is one of the kinds of exploration of data mining that is used in several applications, and those are classification, machine learning, and recognition of pattern [6].

2 Related Work 2.1 Clustering In clustering, data points are divided into groups that have similar characteristics or exhibit similar behavior so that the groups as a whole have similar characteristics or behavior. In short, segregating the data points into different clusters based on their similar traits [7].

2.1.1

Types of Clustering Algorithms

The main aim of clustering is subjective which means there are several routes of achieving the goal of clustering. Each methodology has its own set of rules to segregate data points into different clusters. There is n number of clustering algorithms in which these are few mostly used algorithms such as K-means clustering algorithm, hierarchical clustering algorithms, and mini-customer segmentation is the task of segregating the customers based on their same or common batch K-means clustering algorithm, etc. [8].

48 Segmentation of Shopping Mall Customers Using Clustering

2.1.2

621

Applications of Clustering

Clustering is used in our daily lives such as in data mining, in academics, in web cluster engines, in bioinformatics, in image processing, and many more. There are a few common applications where clustering is used as a tool are recommendation engines, market segmentation, customer segmentation, social network analysis (SNA), search result clustering, identification of cancer cells, biological data analysis, and medical imaging analysis [9].

2.1.3

Customer Segmentation

Customer segmentation is a tool for business purposes to arrange customers’ strategies and better target customers. Every customer is different with their shopping pattern so the single approach does not work for everyone. Then, this is where customer segmentation comes into play. Characteristics such as spending culture, shopping patterns, age, income, demographics, or behaviors so that the organization can market their products to their customers more effectively [10].

2.1.4

K-means Clustering Algorithm

K-means clustering algorithm follows an unsupervised learning model that is utilized to solve the problems of clustering in machine learning. This algorithm groups the unlabeled dataset into different clusters. The algorithm is iterative that segregates the dataset that is not labeled into k number of different clusters in such a way that each data point belongs to only one cluster which has the same properties or similar characteristics. This algorithm is based on centroid, where each cluster is associated with the centroid. This algorithm aims to decrease the sum of the distance between the data points and their corresponding clusters [11].

2.1.5

Hierarchical Clustering Algorithm

A hierarchical cluster analysis is a method of grouping similar objects into groups or clusters using the hierarchical clustering method. The data points in the same cluster are mostly similar to each other, whereas each cluster is completely different from the other cluster. We again have two different hierarchical clustering algorithms one is divisive which follows the top-down approach and the second is agglomerative which follows the bottom-up approach [12].

622

2.1.6

Deepa et al.

Mini-Batch K-means Clustering Algorithm

The clustering algorithm which is mini-batch K-means idea is to utilize tiny random data batches of definite size, and stored in memory. The latest random sample from each iteration is obtained and used for updating the groups until convergence. It takes the small randomly chosen batches of datasets for each dataset. It is faster but sometimes gives slightly different results compared to K-means. This reduces the computational cost of finding clusters. When working on huge datasets, minibatch K-means algorithms give better results compared to normal K-means clustering algorithms. It is the mini-batch version of the K-means algorithm which helps in dealing with large datasets [3].

2.1.7

Comparative Study of K-means and Mini-Batch K-means Clustering Algorithms

The Mini-batch K-means algorithm is almost similar to the K-means algorithm, used especially when working with huge datasets. As the mini-batch K-means does not iterate over the entire dataset every single time and the computational cost to find the clusters is minimum, it provides better performance over K-means. Sometimes mini-batch K-means performs better than K-means with huge datasets since it does not iterate over the entire dataset every single time. The computational cost to find the clusters is minimum can be considered as the main advantage of the mini-batch K-means algorithm [4].

2.1.8

Comparative Study of K-means and Hierarchical Clustering Algorithms

The clustering algorithms namely K- means and Hierarchical clustering group each data point as a cluster, and then it executes the algorithm while considering each cluster as a separate cluster. Hierarchical clustering works by grouping each data point into a cluster. In hierarchical clustering, the main of this algorithm is to produce a hierarchical series of nested clusters. But comparing both the algorithms, K-means algorithm performs better than hierarchical clustering [5].

3 Proposed Work The main aim of this project is to cluster a dataset that is about the behavior of the customers having credit cards using many unsupervised algorithms and determine the most effective algorithm after evaluating the results.

48 Segmentation of Shopping Mall Customers Using Clustering

623

3.1 General View of Data The dataset has 8950 records of details about an account that belongs to customers. Features C_ID BAL BAL_FREQ PURCHASES ONE-OFF PURCHASES ADV_CASH PURCH_FREQ ADV_CASH_FREQ PURC_TRXN PAYMENTS MINIMUM PAYMENTS PAYMENT

ID of credit cardholder The amount left in the account The pace at which the balance is updated Number of purchases made from the account Maximum purchase amount done at a time The cash paid by the user in advance The pace at which purchases are being made The pace at which the cash is being paid in advance It lists the total number of purchase transactions are made Payment amount done by the user Minimum number of payments made by the user PRC FULL Percent of full payment done by the user

3.2 Data Collection and Preparation The dataset having 8500 records of information about the customers has been taken into consideration by the Kaggle. The more the merrier, in this case, this is because it enables us to find more patterns and trends within the dataset, needs set of features depending upon the most important metrics for the business followed by preprocessing the data to remove inconsistencies in data as this eventually helps in better data analysis. As mentioned Figs. 1, 2, and 3 are example of exploration and plotting of data.

3.3 Data Analysis and Exploration This step is a crucial one as this can help us find interesting relations and patterns in the data. With this, we can better understand the customer’s interests’ choices and purchasing patterns so that you know which attributes are more closely related to customers and business as well.

624

Deepa et al.

Fig. 1 Violin plot for the attributes of the data

Fig. 2 Data exploration

4 Methodology 4.1 Clustering Clustering is equivalent to the decomposition of a set of data into natural groups called clusters. The two major factors that depict the efficiency of the cluster are as follows: (i) It contains the algorithmic conditions for identifying the particular cluster i.e., tractability. (ii) It involves the quality of the computed cluster.

48 Segmentation of Shopping Mall Customers Using Clustering

625

Fig. 3 Plot of gender

4.2 K-means Clustering K-means clustering method is one of the unsupervised partition-based clustering techniques that decomposes the unlabeled dataset into different clusters. The algorithms work by determining the appropriate K value number of clusters, wherein turn to find the “K” centroids. Therefore, forms the clusters by assigning each data point to its closest K-center.

4.3 Hierarchical Clustering Agglomerative hierarchical clustering deviates from partition-based clustering as it builds a binary merge tree with leaves containing data elements and the root that contains the full dataset. The graphical representation of that tree that implants the nodes on the plane is called a dendrogram.

4.4 Mini-Batch K-means Clustering There is no doubt that K-means is the prominent clustering algorithm for its performance and low cost of time but with an increase in the size of the datasets being taken into consideration for analysis the computation time of K-means increases. To overcome this, a different approach is introduced called the mini-batch K-means algorithm whose main idea is to divide the whole dataset into small-fixed-size batches of data and use a new random mini-batch from the dataset and update the clusters where this iteration is repeated till the convergence.

626

Deepa et al.

4.5 Elbow Method Determining the optimal number of clusters for the given dataset is the most fundamental step for any unsupervised algorithm. The elbow method helps us to determine the best value of k. The k value is selected where the point starts to flatten out forming an elbow in the graph plotted using the sum of squared distance between the data points and their respective assigned cluster centroids. Therefore, the optimal number of clusters is determined.

5 Clustering Using Different Algorithms Clustering is the mathematical model to discover or decompose the dataset into different groups based on the similarities they share. There are three clustering algorithms, namely K-Means, Mini batch means, and hierarchical clustering, after conducting the literature survey. Now all three algorithms have been deployed on the dataset.

5.1 Elbow Method The phenomena behind the elbow method are to perform a K-mean correlation in the given data for the k value (num_clusters, e.g., k = 1–10), and for each k value, evaluate the sum of the squared errors (SSE). Based on the graph above, it looks like K = 5, or five clusters are the correct number of clusters in this analysis. Graph for elbow method to find K value is shown in Fig. 4.

5.2 K-means Clustering Algorithm As we have already determined the appropriate value of K using the elbow method, now by deploying the K-means algorithm on the data, constructs the K number of clusters. When the dataset undergoes an iterative process of dividing it into Knumbers of subgroups, each of the individual data points being tied to only Knumbers of groups based on similarities in the features. Now, let’s visualize the “CC GENERAL” dataset in three-dimensional space.3D forms using mini-batch K-means mentioned in Fig. 6.

48 Segmentation of Shopping Mall Customers Using Clustering

627

Fig. 4 Graph for elbow method to find k value

Fig. 6 3D Representation of clusters formed by mini-batch K-means

5.3 Mini-Batch K-means Clustering 5.4 Hierarchical Clustering Agglomerative hierarchical clustering deviates from partition-based clustering as it builds a binary merge tree with leaves containing data elements and the root that contains the full dataset. The graphical representation of that tree that implants the nodes on the plane is called a dendrogram as shown in Fig. 7 [13].

628

Deepa et al.

Fig. 7 Dendrogram for hierarchical clustering

6 Performance Analysis Unlike supervised algorithms such as a linear regression model, there is a target to predict where the accuracy can be measured by using metrics such as RMSE, MAPE, and MAE implementing a clustering model with no target to aim, it is not possible to calculate the accuracy score. Henceforth, the aim is to create clusters with distinct or unique characteristics. The two most common metrics to measure the distinctness of clusters are. Silhouette Coefficient This score ranges between − 1 and 1, where the clusters are well defined and distinct with higher scores. Davies-Bouldin Index On contrary to the Silhouette score, this score measures the similarity among the clusters which defines that the lower the score the better clusters are formed. These scores can be calculated using scikit-learn. Comparison of performance among the three different clustering methods are mentioned in Table.1. Table 1 Comparison of performance among the three different clustering methods

Algorithms

Silhouette score

Davies-Bouldin score

K-means

0.444286

0.821878

Hierarchical clustering

0.444286

0.821878

Mini-batch K-means

0.440189

0.821672

48 Segmentation of Shopping Mall Customers Using Clustering

629

7 Conclusion The significance of customer segmentation in attracting the customers toward the products which in turn aids the increase in the business scale in the market. Segmenting the customer group into the different groups according to the similarities they possess, on one hand, helps the marketers to provide customized ads, products, and offers. Where on other hand, it supports the customers by avoiding them from the confusion of the products to buy. Comparing the clusters obtained by deploying the three different clustering algorithms on the customers’ data using the metrics that measure the distinctness and uniqueness of the clusters. It is observed that the K-means algorithm produces the best clusters by obtaining the highest Silhouette score and the least Davies-Bouldin score followed by hierarchical clustering and mini-batch K-means clustering. It could not be said that the K-means is the most effective clustering algorithm every time. It depends on various factors such as the size of the data and attributes of the data.

References 1. Sungheetha A (2021) Assimilation of IoT sensors for data visualization in a smart campus environment. J Ubiquitous Comput Commun Technol 3(4):241 2. Ginimav I (2020) Live streaming architectures for video data-a review. J IoT Soc Mobile Analytics Cloud 2(4):207–215 3. Banduni A, Ilavedhan A, Customer segmentation using machine learning 4. Bindra K, Mishra A (2017) A detailed study of clustering algorithms 5. Peng K, Leung VCM, Huang Q (2018) Clustering approach based on mini batch K-means 6. Murtagh F, Contreras P (2018) Methods of hierarchical clustering 7. Kushwaha DPY, Prajapati D (2008) Customer segmentation using K-means Algorithm. 8th Semester 8. Kaushik M, Mathur B (2014) Comparative study of k-means and hierarchical clustering techniques 9. Feizollah A, Anuar NB, Salleh R, Amalina F (2014) Comparative study of k-means and mini batch k-means clustering algorithms. Int J Softw Hardware Res Eng 10. Ishantha A (2021) Mall customer segmentation using clustering algorithm. Future University Hakodate, Conference Paper 11. Dogan O, Aycin E, Bulut ZA (2018) Customer segmentation by using RFM model and clustering methods: a case study in retail industry. Int J Contemp Econ Adm Sci 12. Sari JN, Ferdiana R, Nugroho L, Santosa PI (2016) Review on customer segmentation technique. 13. Deepa D, Jena S, Ganesh Y, Roobini MS, Ponraj A (2021) Threat level detection in android platform using machine learning algorithms. In: Advances in electronics, communication and computing. Springer, Singapore, pp. 543–551

Chapter 49

Advanced Approach for Heart Disease Diagnosis with Grey Wolf Optimization and Deep Learning Techniques Dimple Santoshi, Sangita Chaudhari, and Namita Pulgam

1 Introduction Heart disease (HD) is one of the most frequent illnesses today, owing to an assortment of contributing components such as high blood pressure, diabetes, fluctuating cholesterol levels, tiredness, and many more. For many years, researchers have worked to make an early diagnosis of this illness, and several data analytic tools have been developed to assist healthcare professionals in recognizing some of the first indicators of HD. Predicting HD in its early stages may save lives via the use of several tests that can be done on prospective patients [1]. A patient’s heart status may be assessed using ECG, which is a critical diagnostic technique. It has gained popularity in recognizing abnormally quick and slow heart rates (tachycardia and bradycardia). These signals helps to understand the electrical movement of the human heart and comprise a few waveforms (P, QRS, and T). The length and shape of each waveform and the separations between distinctive peaks are utilized to analyse heart illnesses. When DL methods are applied to ECG data, several researches have shown experimentally that profound learning features are more instructive than expert highlights [2, 3]. When it comes to ECG investigation assignments such as illness identification [4], sleep staging [5], and other activities, deep learning approaches outperform classical methods on a wide range of tasks, but DL performs well if it has huge data and also makes the entire process expensive while training the data to make

D. Santoshi (B) · S. Chaudhari · N. Pulgam Ramrao Adik Institute of Technology, D Y Patil Deemed to be University, Nerul, Navi Mumbai, Maharashtra, India e-mail: [email protected] S. Chaudhari e-mail: [email protected] N. Pulgam e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_49

631

632

S. Dimple et al.

them complex data models. Hence, with optimization algorithms one can figure out the value of the parameters (weights) which can minimize the error while mapping inputs to outputs. These optimizers are widely affecting the accuracy of the DL models and also affecting the training speed of the model. With grey wolf optimization (GWO) in diagnosis of heart disease, one can give exceptionally competitive outcomes in terms of neighbourhood optima avoidance. GWO highlights itself interms of wolves calculation with the focal points of basic in rule, quick looking for speed, high look exactness, and simple to realize, it is more effectively combined with the viable designing issues with respect to other optimization algorithms like dragon-fly, swarm optimization, cuckoo algorithm, ant colony optimization, and many more. This is the reason why GWO has a high hypothetical investigate value, and this is how GWO helps us in better diagnosis of HD. This investigation contributes to the improvement of heart illness expectation strategies by utilizing the ECG signal information combined with the clinical information. The grey wolf optimizer may be an unused sort of optimization calculation for the problem statement that is propelled by nature. This research majorly contributes to the prediction of heart diseases as it uses ECG signal data and grey wolf optimization and results the output with and without optimization. The rest of the paper is sorted as follows. Section 2 will provide the research areas of HD diagnosis and Sect. 3 will disclose the proposed system in two phases. Followed by the results and discussions in Sect. 4. Further, Sect. 5 will conclude this paper.

2 Literature Survey The GWO strategy is utilized for highlight determination, which dispenses with the features that are unrelated to each other and those that are redundant. It has a significant impact on the execution of expectation. Taking after highlight choice, the autoencoder (AE)-based recurrent neural network(RNN) procedure overcomes the dimensionality issues associated with feature selection. Additionally, the UCI database may be used to accurately anticipate the various types of illnesses. As per [6], they have studied about the GWO method on the Cleveland data set and applied it to identify important features. As per the inventive method to adjust the parameters of the control framework modulator utilizing the foraging behaviour of the grey wolves [7], the results concluded that the formulated way of attuning utilizing the grey wolves chasing behaviour minimizes the overshoot of the peaks by the most extreme of 9.4 rates and least of 2.38 rate when compared to the conventional and the other calculations. Also, the introduction of the body sensor organize is utilized to identify heart diseases utilizing a combination of empowering information outfit innovation in a haze computing encompassing [8]. To guarantee high exactness, a forest ensemble is utilized to anticipate. They can be utilized with a profound neural network to assist improve the accuracy of the forecast.

49 Advanced Approach for Heart Disease Diagnosis …

633

An ECG may be a fast and effortless test that records the electrical signals in one’s heart. It can spot anomalous heart rhythms. One can have an ECG while at rest or whereas working out (stretch electrocardiogram). Automation in ECG interpretation systems has advanced tremendously in the previous several years [2]. Deep learning-based procedures have accomplished or indeed outperformed cardiologistlevel execution in particular subtasks [2], or have enabled claims that were previously impossible to make, such as cardiologists being able to reliably estimate age and gender from the ECG. The comparative study of these ML and DL models in detection of HD is shown [9]. Multiple research articles every year show that the apparent simplicity and decreased dimensionality of ECG classification data has piqued the attention of the greater machine learning community. Recent contribution of the same is illustrated in [10]. ECG/EKG data has also been annotated or localized using deep learning approaches [11]. Real-time ventricular tachycardia exit detection using an autoencoder [1] and recurrent neural network (RNN) is achieved using a 12-lead ECG. To better understand arrhythmias, researchers named Pourbabaee et al. [2] used the MIT-BIH arrhythmia data set and focussed on annotation of the foetus QRS complex (detection of the Q and R waves as well as heart rate calculation), rather than foetal QRS complex detection. QT database (QTDB) on Physio Net was used to investigate Pwave annotation [2], as well as other forms of ECG/EKG wave annotation, including P-wave annotation [1]. Later, researchers came up with the implementation of the safe multimodal biometric framework which is designed based on the numerous levels of combinations employing a CNN and Q-Gaussian multi-bolster vector machine (QG-MSVM). Both the PTB Symptomatic data set (counting 549 15-channel ECG recordings from 290 members) and the CYBHi data set (containing 65 patients with an normal age between 21.64 and 40.56 a long time) have been utilized to test the proposed technique’s execution. PTB Diagnostics incorporates 549 15- channel ECG records from 290 patients, and an ECG signal denoising approach based on denoising autoencoders and fully convolutional networks has been developed by researchers [12]. These findings show that the bidirectional recurrent autoencoder (BRDAE), as introduced in [13], gives advantages over the traditional denoising strategy by offering PPG feature accentuation for pulse waveform analysis, as compared to the standard denoising procedure. The convolutional neural network calculation can be an implication for early heart infection hazard assurance utilizing organized data [14, 15]. The exactness obtained utilizing our model increases up to 85–88%. Also, to compare various DL models in prediction of heart disease diagnosis.

3 Proposed System As per the points mentioned in introduction and literature survey of this paper, it is very clear that HD diagnosis with DL models has various limitations, and hence, GWO has been considered to overcome. Also, many such HD diagnosis and analysis

634

S. Dimple et al.

have been done in consideration of very old data. To bridge the gap, proposed model came up with an advanced approach for classifying and diagnosing heart disease is used by introducing the new concept of merging clinical data with ECG data by index matching technique and performing its analysis in two categories, i.e. “Without Optimization” and “With Optimization”. For this, the newly released publicly available clinical 12-lead PTB-XL and ECG data set has been used and the nature-inspired grey wolf optimization technique(GWO) is used for selecting the best features from the clinical data which is further used for merging with ECG data. Class imbalance is overcomed by using grey wolf optimization at feature selection stage and proper weight adjustments in GWO. Later, deep neural networks have been used on the merged data set for further evaluations.

3.1 Without Optimization In this category of the experimentation, the flow of the system starts with reading the ECG 12-lead readings and later the classification into scored and unscored classes in done. Further, data splitting into training and validation using k-fold is performed on the classified data. Later, data representation using confusion matrix is done, and then data has been normalized. Lastly, the normalized data is fed to the DL models and classification of heart diseases is done for better results. The algorithm of the HD diagnosis with PTB-XL 12-lead ECG data set is appeared in Algorithm 1, and the stream graph appears in Fig. 1. Algorithm 1 Heart Disease diagnosis with PTB-XL 12-lead ECG data set 1. Read PTB-XL 12-lead data set into an array 2. Classify the data into scored and unscored classes using Systematized Nomenclature of Medicine (SNOMED) class labels SNOMED-scored—To encode labels SNOMED-unscored—To define target variables 3. Split data into training and validation using K-fold 4. Compute confusion matrix and perform data normalization 5. Repeat for ANN, CNN & RNN: (i) Model selection in ANN,CNN, and RNN, i.e. sequential() (ii) Set the number of dense, globalAveragePooling1D, input shape, and activation values (iii) Compile the model with hyper-parameters (iv) Fit the model with batchsize (lets say, 10) (v) Perform predictions (vi) Find best threshold (vii) Plot normalized confusion matrix 6. End the process 7. Determine the best

49 Advanced Approach for Heart Disease Diagnosis …

635

Fig. 1 The flowchart of proposed architecture (without optimization)

3.2 With Optimization (Grey Wolf Optimization) Among the inquiries about stochastic calculations, introductions, advancements, and implementations of the nature-inspired computing (NIC) calculations became the trendy domain. The NIC calculations are put forward in motivation of the nature and are demonstrated to be productive to unravel the issues which human connects with [16, 17]. The foremost NIC critical part calculations are the supernatural calculations, and many of these are backtracking [17]. One cannot ensure the worldwide ideal arrangements; in this way, most of the metaheuristic calculations present haphazardness to dodge region optimum. The persons in swarms are well controlled with partition, adjust, and cling [18] with haphazardness; and the current speeds which contains previous speeds, irregular multiples of the recurrence [19], or Euclidean separations of specific individuals’ positions [20]. A few enhancements are made with dormancy weights alteration [21], hybridization with obtrusive weed optimization [22], chaos [23], and twofold [24] vectors et al. Many of such changes results in a better execution of the particular calculations, irrespective of the general structure. Many such multiobjective calculations and the changes corresponding to them are distantly propelled, specifically by the behaviours of the life forms such as looking, chasing [25], pollinating, and blazing [26]. Within the ancient metaheuristic calculations, such as the hereditary calculation (GA), recreated strengthening (SA), and the

636

S. Dimple et al.

ant colony optimization (ACO) calculation [27], many individuals are been treated in the similarly, and the most excellent wellness values are the ultimate outputs. Multiobjective calculations perform their behaviour beneath similarly administering conditions. For illustration, in an ant colony, the commander is ruler in irrespective of the propagation part; and denigrates are warriors for cultivation of the colony, whereas the ergates are careered with building, gathering, and breeding. Also, can be concluded that the pecking order for the subterranean insect colony is queen denigrates ergates when the classification is done with employments. The behaviour of ergates’ may coordinated with elder’s involvement along with the ruler or their denigrates. If ergates are commanded by the ruler, a few denigrates, or older folks, and such movements are scientifically portrayed and presented to the insect colony optimization (ACO) in a few ways, and will check whether the ACO calculation perform superior in understanding the issues or not. To simplify, have to think about the social chain of command for the swarms that are considered within the multiobjective calculations. The work on the same is done by Mirjalili et al., and an unused boost strategy known as the dark wolf optimization (GWO) calculations proposed [28]. The calculations of GWO is easier to use and merges more fastly. It has been showed that these are more proficient than the PSO interms of training multiperceptrons [28]. Calculation and also other bionic calculations [29]. Also, agreeing with Mirjalili et al. [28], entire grey wolves pack lives together at same place and also chase in bunches. The looking and chasing preparation can be portrayed as taking after: (1) in case a prey is figured out, they will begin with tracking and chasing to access it. (2) In case the prey escapes from there, the grey wolves need to encompass and bug the prey until it stops itself from moving. (3) At last, the assault starts.

3.2.1

Mathematical Representation of GWO Algorithm

Mirjalili [28] outlined the boost calculation mirroring the looking and chasing preparation for wolves. Within numerical demonstration, the best fit arrangement is known as alpha (α), the next best fit is beta (β), and then delta (δ). The remaining candidate arrangements are termed as omega (ω). All omega wolves are guided by the remaining three wolves amid optimizing and hunting. Here, the betas are lower-ranking wolves that offer assistance to the alpha in decision-making or other pack exercises. The beta wolves are either male or female, and he/she is likely the most excellent candidate to be the alpha in case one of the alpha wolves passes away or gets to be exceptionally old. The beta wolf ought to regard the alpha, but commands the other apprentice wolves as well. It plays the part of a consultant to the alpha and discipliner for the pack. The beta strengthens the alpha’s commands all through the pack and gives input to the alpha and an omega wolf can be either male or female. In the substitute, they are the least positioned part of the pack. The omega lives on the edges of the pack, more often than not eating last. They serve as both a stretch reliever and instigator of the play.

49 Advanced Approach for Heart Disease Diagnosis …

637

Once prey is figured out, the cycle starts (t = 1). The point on wards, the alpha, beta, and delta wolves will make the omegas to look after and, in the long run, it will enclose the prey. The coefficients, A, C and D are put forward to depict the encompassing behaviour: Dα = |C1 · Xα − X(t)| , Dβ = |C2 · Xβ − X(t)| , Dδ = |C 3 · Xδ − X(t)|

(1)

where t shows the current cycle, X is the position vector of the grey wolf, and, X1, X2 and X3 are the position vectors of the alpha, beta, and delta wolves. X would be computed as follows: − → X 1 = X α − A1 · Dα , (2) − → X 2 = X β − A1 · Dβ ,

(3)

− → X 3 = X δ − A1 · Dδ

(4)

X(t) = (X1 + X2 + X 3 ) /3

(5)

Here, X(t) is the radius vector for grey wolf at current repetition and the parameters A and C are combinations of the controlling parameter “a” and the arbitrary numbers r1 and r2 [28]: → → A = 2a− r1 − a and C = 2− r2 (6) The criterion parameter a changes A and makes the omega wolves to approach or else run missing from the overpowering wolves such as the alpha, beta, and delta. Hypothetically, | A| > 1 is the event in which, the grey wolves appears to be absent from the dominants which means the omega wolves will escape from the prey and investigate lot much space, and called as worldwide look in boosting and in case | A| < 1 , approach is towards the dominants, meaning, the omega wolves would take after the dominants that are approaching the prey, known as a neighbourhood look in boosting algorithms. The criterion parameter is characterized to be reduced straightly from a greatest esteem of two to zero, whereas the cycles results in:   it α =2 1− N

(7)

where N is the most extreme emphasis number, also initialized by clients at the beginning. It is characterized as the aggregate emphasis number. The application strategy can be isolated in three parts: (1) The given issues are caught on and numerically portrayed, and a few natural parameters are at that point known; (2) A pack of dark wolves are haphazardly initialized all through the space space; (3) The alpha

638

S. Dimple et al.

and other prevailing dim wolves lead the pack to look, seek after, and encompass the prey. When the prey is encompassed by the dim wolves and it stops moving, the look wraps up and assaults start. The pseudocode is shown in Algorithm 2 [30]. Algorithm 2 GWO Pseudocode 1. Optimization preparation: – – – – –

Measurement of the given issues Confinements of the given issues Populace measure Controlling parameter Halt model (most extreme cycle times or allowable mistakes)

2. Initialization: – Positions of all of the grey wolves including α, β, and δ wolves 3. Searching: – – – – –

Whereas not the halt measure, calculate the unused wellness function Upgrade the positions Constrain the scope of positions Refresh α, β, and δ End

3.2.2

Index Matching Technique

This is the main phase of the proposed system, i.e. the concept of merging clinical and ECG data set. The flow of this technique is mentioned in Algorithm 3 and the flow diagram is shown in Fig. 2. Algorithm 3 Index Matching Technique 1. Read selected features (from GWO) from clinical data in variable named ptb 2. Check condition, if ecg-id matched in both ptb and matlab file (i) Then extract id of the clinical data (ii) Return selected clinical data 3. Load challenge data: (i) (ii) (iii) (iv) (v)

Extract ecg-id from matlab file Load matlab file Extract header data Add clinical data to clinical array Convert clinical array to numpy format

49 Advanced Approach for Heart Disease Diagnosis …

639

Fig. 2 Flow chart of index matching technique (merging)

(vi) Append matlab data and clinical values to single data array (vii) Return data and header values

3.2.3

Proposed Architecture for HD Diagnosis with GWO and Merged Data Set of Clinical with ECG

This step illustrates the overall flow of proposed system. The flow of this technique is mentioned in Algorithm 4 and the flow diagram is shown in Fig. 3. Algorithm 4 Heart disease diagnosis with GWO and merged data set of clinical with ECG 1: Optimization: (i) (ii) (iii) (iv) (v)

Read clinical data Perform data pre-processing Apply GWO and perform feature selection Read ECG data Split data into training and validation sets

640

S. Dimple et al.

Fig. 3 Flow chart of proposed architecture (with optimization)

2: Merging: (i) Read selected features of clinical data (ii) If ecg-id of clinical data matches with ecg-id of matlab file then, extract that data of clinical (iii) Extract ecg-id from matlab file (iv) Get clinical data of given ecg-id (v) Load matlab file (vi) Extract ecg image from matlab files (vii) Extract header data (viii) Append clinical data and matlab data in array 3: Deep Learning Models: (i) Get the merged and cleaned data (ii) Fed the data to DL models, namely ANN, CNN & RNN 4: Determine the best model for classification of HD 5: End

49 Advanced Approach for Heart Disease Diagnosis …

641

4 Results and Discussions The implementation of the proposed methodology used data taken from the continents with three diverse and distinctly different populations, enclosing 111 diagnoses. The data was drawn from three continents with diverse and distinctly different populations, encompassing 111 diagnoses. The data is a combination of four data sets represented in Table 1, which are as follows: The results of the model is examined with respect to the evaluation parameters like Validation Accuracy, Recall, Precision and AUC. Where, (i) Validation Accuracy: This is the percentage of the correct classification which is derived through the method of cross-validation. (ii) Validation Recall: This is the proportion of actual positives that was identified correctly. (iii) Validation Precision: This is the quality of a positive prediction made by the model. (iv) Validation AUC: This is the measure of the ability of a classifier to distinguish between classes (the higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes). Table 2 briefs about the results that the model has shown after its execution and based on the results, we can conclude that CNN model, i.e. AlexNet performs better than remaining deep learning models for the data set we chose for these analysis with Validation Accuracy 95.77, Validation Recall 26.01, Validation Precision 77.47, and Validation AUC 72.41, and Table 3 briefs about the results that the models has shown after its execution. Graphs of some DL models like ANN, CNN-Lenet5, and Alexnet are shown with respect to their accuracies in Figs. 4, 5, 6, 7 and 8 in the form of 2 categories, “without optimization” and “with optimization” for the ease in analysis of the results and for

Table 1 Data sets used Data set name No. of samples Parameters considered Recorded time collected from patients PTB (Kaggle) PTB-XL ECG [31]

2648 18,885

Georgia 12-lead ECG challenge [31] St Petersburg INCART 12-lead arrhythmia [12] China 12-lead ECG challenge [4, 32]

10,344 32 Holter records

43,436

9 71 distinct records with age, gender, Dx, Rx, Hx and Sx Dx, Rx, Hx and Sx (default) Dx, Rx, Hx and Sx (default)

– 10 s

Dx, Rx, Hx and Sx (default)



10 s, sampled 500 Hz 30 min, sampled 257 Hz

642

S. Dimple et al.

Table 2 Summary result of deep learning models without optimization Summary table of results (without optimization) Model name Validation Validation recall Validation accuracy precision ANN [11] LeNet-5 [11, 14–16] AlexNet [11, 14–16] VGG-16 [11, 14–16] LSTM [11]

95.38 95.27

17.98 17.04

72.16 67.34

63.93 64.48

95.77

26.01

77.47

72.41

95.57

25.38

70.16

71.47

95.59

31.04

66

55.09

Table 3 Summary result of deep learning models with optimization Summary table of results (with optimization) Model name Validation Validation recall Validation accuracy precision ANN LeNet-5 AlexNet VGG-16 LSTM

Validation AUC

95.26 95.27 95.72 96.15 95.55

23.18 15.15 24.24 36.66 26.59

61.48 70.82 78.5 77.28 68.34

Validation AUC 61.95 69.99 76.77 78.87 61.02

Fig. 4 ANN accuracy (1) without optimization; (2) with optimization

comparison. It is observed that in ALexnet, the accuracy is increased with involvement of optimization. Other DL models like ANN, CNN-Lenet5, CNN-Alexnet, and VGG-16 are showing a slight variation. Similarly, Table 3 briefs about the results in optimization scenario. Based on Table 3 results, We can conclude that on involving merging technique of both clinical and ECG data and the usage of Grey Wolf Optimization technique, CNN model, i.e. VGG-16 performs better than remaining deep learning models for the data set we

49 Advanced Approach for Heart Disease Diagnosis …

Fig. 5 CNN-Lenet5 accuracy (1) without optimization; (2) with optimization

Fig. 6 CNN-Alexnet accuracy (1) without optimization; (2) with optimization

Fig. 7 CNN-VGG-16 accuracy (1) without optimization; (2) with optimization

Fig. 8 RNN-LSTM accuracy (1) without optimization; (2) with optimization

643

644

S. Dimple et al.

Fig. 9 ANN precision, recall, and AUC (1) without optimization; (2) with optimization

Fig. 10 CNN-LENET-5 precision, recal,l and AUC (1) without optimization; (2) with optimization

Fig. 11 CNN-ALexnet precision, recall, and AUC (1) without optimization; (2) with optimization

chose for these analysis with Validation Accuracy 96.15, Validation Recall 36.66, Validation Precision 72.28 and Validation AUC 78.87. Similar to without optimization scenario, the graphs of some DL models like CNN-ALexnet and VGG-16 are plotted with respect to their Recall, Precision and AUC in Figs. 9, 10, 11, 12 and 13 in two categories, “without optimization” and “with optimization” for the ease in analysis of the results and for comparison. One can observe that in VGG-16, all the three parameters Recall, Precision and AUC values are increased with involvement of optimization. Other DL models like ANN, CNN-Lenet5, CNN-ALexnet and RNN are showing a slight variation.

49 Advanced Approach for Heart Disease Diagnosis …

645

Fig. 12 CNN-VGG-16 precision, recall, and AUC (1) without optimization; (2) with optimization

Fig. 13 RNN-LSTM precision, recall, and AUC (1) without optimization; (2) with optimization

In comparison with the results from both the scenarios, from Table 3, i.e. in without optimization scenario, the accuracies found to be less and the computational speed is also observed to be less when compared with optimization results. Finally, results have shown improvement for the data set that we considered and can be further improved if we take merged data for without optimization scenario as well. This can be taken as a future work.

5 Conclusion and Future Work With the advancement in technology, it is proved by many researchers that there are various limitations observed while analysing the results on HD predictions done with only DL algorithms. To overcome, there is a requirement of some advanced methods. Later, the concept of optimization has shown a great results in diagnosing HD by helping the researchers in reducing the overall cost in many ways. Also, again choosing the right optimization based on the challenges in the problem statement is quite challenging. With GWO in diagnosis of heart disease, one can provide very competitive results in terms of local optima avoidance. GWO highlights itself interms of wolves calculation with the focal points of basic in rule, quick looking for speed, high look exactness, and simple to realize, it is more effectively combined with the

646

S. Dimple et al.

viable designing issues with respect to other optimization algorithms like dragon-fly, swarm optimization, cuckoo algorithm, ant colony optimization, and many more. This is the reason why GWO has a high hypothetical investigate value, and this is how GWO helped us in better diagnosis of HD in this study. Also choosing a good data set which is new and has less noise is recommended for better results. Hence, publicly freely available 12 led PTB ECG data set is considered. Also, the concept of merging both clinical and ECG data is taken into consideration, and the results are evaluated and analysed in both the scenario, without optimization and using optimization using DL models. In comparison with five different neural networks with and without optimization, it is observed that VGG-16 is the most efficient algorithm that has been suggested as the conclusion of this study. Consequently, deep learning-based algorithms have a promising future in ECG analysis, not as it were in terms of quantitative exactness but too in terms of additional quality criteria and as part of future work, merging of clinical and ecg data can be taken into consideration with the same data set, and the results can be evaluated, compared, and analysed. The results may definitely vary positively to a great extent, and hence, this research work will significantly contribute to the heart disease prediction system in medical industry.

References 1. Wagner P, Strodthoff N, Bousseljot R, Samek W, Schaeffter T (2020) PTB-XL, a large publicly available electrocardiography dataset (version 1.0.1). PhysioNet. https://doi.org/10.13026/ x4td-x982 2. Pourbabaee B, Roshtkhari MJ, Khorasani K (2017) Deep convolutional neural networks and learning ECG features for screening paroxysmal atrial fibrillation patients. IEEE Trans Syst Man Cybern: Syst 48(12):2095–2104 3. Golrizkhatami Z, Acan A (2018) ECG classification using three-level fusion of different feature descriptors. Expert Syst Appl 114:54–64 4. Sellami A, Hwang H (2019) A robust deep convolutional neural network with batch-weighted loss for heartbeat classification. Expert Syst Appl 122:75–84 5. Jiang J, Zhang H, Pi D, Dai C (2019) A novel multi-module neural network system for imbalanced heartbeats classification. Expert Syst Appl: X 1:100003 6. Babu SB, Suneetha A, Babu GC, Nagendra Kumar YJ, Karuna G (2018) Medical disease prediction using grey wolf optimization and auto encoder based recurrent neural network. Period Eng Nat Sci 6(1):229–240 7. Kumar AD (2020) Flawless attuning for parameters of power system modulator applying grey wolf optimization. J Electr Eng Autom 2(2):102–111 8. Shakya S, Joby PP (2021) Heart disease prediction using fog computing based wireless body sensor networks (WSNs). IRO J Sustain Wirel Syst 3(1):49–58 9. Martin-Isla C, Campello VM, Izquierdo C, Raisi-Estabragh Z, Baeßler B, Petersen SE, Lekadir K (2020) Learning image-based cardiac diagnosis with machine: a review. Front Cardiovasc Med. https://doi.org/10.3389/fcvm.2020.00001

49 Advanced Approach for Heart Disease Diagnosis …

647

10. Faust O, Shenfield A, Kareem M, San TR, Fujita H, Acharya UR (2018) Automated detection of atrial fibrillation using long short-term memory network with RR interval signals. Comput Biol Med 102:327–335 11. Mehta S, Fernandez F, Villagran C, Niklitschek S, Frauenfelder A, Nola F, Ceschim MR, Matheus C, Chaves C, Quintero S et al (2019) Application of artificial intelligence to detect ST elevation MI with a single lead EKG. J Am Coll Cardiol 73(9 Suppl 1):1328 12. Shashikumar SP, Shah AJ, Clifford GD, Nemati S (2018) Detection of paroxysmal atrial fibrillation using attention-based bidirectional recurrent neural networks. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery data mining. ACM, pp 715–723 13. Goto S, Kimura M, Katsumata Y, Goto S, Kamatani T, Ichihara G, Ko S, Sasaki J, Fukuda K, Sano M (2019) Artificial intelligence to predict needs for urgent revascularization from 12-leads electrocardiography in emergency patients. PLoS ONE 14(1):e0210103 14. Shankar V, Kumar V, Devagade U et al (2020) Heart disease prediction using CNN algorithm. SN Comput Sci 1:170. https://doi.org/10.1007/s42979-020-0097-6 15. Kumara D (2021) Study of heart disease prediction using CNN algorithm. JETIR 8(7) 16. Xin-She Y (2014) Nature-inspired optimization algorithms. Elsevier, Amsterdam, Netherlands 17. Yang XS, Chien SF, Ting TO (2015) Chapter 1–Bioinspired computation and optimization: an overview. In: Yang XS, Chien SF, Ting TO (eds) Bio-inspired computation in telecommunications. Morgan Kaufmann, Boston, MA, USA 18. Reynolds CW (1987) Flocks, herds and schools: a distributed behavioral model. ACM SIGGRAPH Comput Graph 21(4):25–34 19. Juan Z, Zheng-Ming G (2015) The bat algorithm and its parameters. Electronics, communications and networks IV. CRC Press, Boca Raton, FL, USA 20. Yu JJQ, Li VOK (2015) A social spider algorithm for global optimization. Appl Soft Comput 30:614–627 21. Yan C-m, Guo B-l, Wu X-x (2012) Empirical study of the inertia weight particle swarm optimization with constraint factor. Int J Soft Comput Softw Eng [JSCSE] 2(2):1–8 22. Basak A, Maity D, Das S (2013) A differential invasive weed optimization algorithm for improved global numerical optimization. Appl Math Comput 219(12):6645–6668 23. Yuan X, Zhang T, Xiang Y, Dai X (2015) Parallel chaos optimization algorithm with migration and merging operation. Appl Soft Comput 35:591–604 24. Kang M, Kim J, Kim JM (2015) Reliable fault diagnosis for incipient low-speed bearings using fault feature analysis based on a binary bat algorithm. Inf Sci 294:423–438 25. Azizi R (2014) Empirical study of artificial fish swarm algorithm. Int J Comput Commun Netw 3(1–3):1–7 26. Marichelvam MK, Prabaharan T, Yang XS (2014) A discrete firefly algorithm for the multiobjective hybrid flowshop scheduling problems. IEEE Trans Evol Comput 18(2):301–305 27. Dorigo M, Birattari M, Stutzle T (2006) Ant colony optimization. IEEE Comput Intell Mag 1(4):28–39. https://doi.org/10.1109/MCI.2006.329691 28. Mirjalili S, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:46–61 29. Komaki GM, Kayvanfar V (2015) Grey wolf optimizer algorithm for the two-stage assembly flow shop scheduling problem with release time. J Comput Sci 8:109–120 30. Gao Z-M, Zhao J (2019) An improved grey wolf optimization algorithm with variable weights. Comput Intell Neurosci. https://doi.org/10.1155/2019/2981282 31. Dagenais GR, Leong DP, Rangarajan S, Lanas F, Lopez-Jaramillo P, Gupta R et al (2019) Variations in common diseases, hospital admissions, and deaths in middle-aged adults in 21 countries from five continents(PURE): a prospective cohort study. Lancet 32. Li Y, Pang Y, Wang J, Li X (2018) Patient-specific ECG classification by deeper CNN from generic to dedicated. Neurocomputing 314:336–346

Chapter 50

Hyper-personalization and Its Impact on Customer Buying Behaviour Saurav Kumar, R. Ashoka Rajan, A. Swaminathan, and Ernest Johnson

1 Introduction The various stages of industrial revolution have impacted human life in various ways. Industrial revolution 1.0 improved our lives through steam powered engines. With 2.0, mass production and electricity came into existence. Electronics and IT systems development and automation rose in industrial revolution 3.0. Currently, we are in the era of 4.0, where the cyber physical systems play an important role. With the maturity and popularity of IT technology over the last decade has resulted in large amount of data. The biggest challenge in 4.0 is to cope up with the exabyte of data produced every day. Artificial intelligence, Internet of Things, robotics and automation, and big data are used in different ways to utilize this huge amount of data. A 3-stage circular cycle of data generation has been observed through physicaldigital-physical process. However, data feedback cycle from a physical process back to digital is optimized. It is represented in Fig. 1 which explains constant cyclic process. A constant exchange of information from the physical world to the machine world and vice-versa occurs. Earlier the movement generation between the above said worlds have largely been taken care of by humans. With the maturity of technology, S. Kumar · R. Ashoka Rajan (B) · A. Swaminathan School of Computer Science and Engineering, VIT Chennai, Chennai, India e-mail: [email protected] S. Kumar e-mail: [email protected] A. Swaminathan e-mail: [email protected] E. Johnson Paul J. Hill School of Business, Faculty of Business, Administration, University of Regina, Regina, SK S4S0A2, Canada e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_50

649

650

S. Kumar et al.

Fig. 1 Physical-digital-physical loop cycle

software programmes played the role until recently. However, lately algorithms have started to take key ownership for generating these movements between these two worlds. As shown in Fig. 1, first stage is to digitize from physical process, i.e. to capture information from the physical world and convert it into digital record. Second stage is digital, i.e. to analyse and visualize the records such that machines share information among each other and do analytics of real-time data sources. The third stage is to generate movement, i.e. to translate actions and decisions from the digital process into movements in the physical process by applying algorithms and automation. Industry 4.0 has three stages of implementation and usage. Basic levels of data collection analysis and intercommunication must be set in order to establish three stages, (a) process optimization, (b) process flow and quality, and (c) new business models. (a) Process Optimization: In this stage, improvement in current process and optimization of use of assets is done by increasing networking and digitization. Up gradation and automation of standards reduces cost. The data collected in this stage is leveraged through advance algorithms. (b) Process Flow and Quality: A digital thread is created from beginning to end process. Various cyber security measures can be used to address the risk of connectivity and improve quality. (c) New Business Models: The collected data is used to generate insights, visualization, and idea for new processes, improved process, and products. Hyper-personalization aims to improve the data collected for various customer activities and leverage automation algorithms via process optimization. It focusses on increasing networking and digitization, improving current processes, optimizing use of assets, and maximizing data collected to use data with advanced algorithms. The need for hyper-personalization can be explained with the following examples. Generally, in an e-commerce website when a customer browses through the page, there is less than 10 s to capture the attention of the user. According to a survey made

50 Hyper-personalization and Its Impact on Customer Buying Behaviour

651

by Google, the phrase ‘best’ has increased over 80%. Information overloading is affecting the customer to narrow down their approach. Another survey made by the giant company Accenture the customers is likely to buy a product if it is personalized in some sense. Personalization in e-commerce industry is widely prevalent however uses basic technique. Marketing companies personalize content via data collected from various channels like social media, own website, and emails. The most common way to personalize is through email. Customized website content is also widely used method by retail marketers. Personalization targets large customer group with similar interests, however, not specific to user. On the contrary, hyper-personalization is more specific with a specific interest related target. This is called hyper-targeting which results in improved customer engagement and response. Hyper-personalization is far more beneficial for the customer and provides greater sales efficiency to the retailers. The data is tailored from services, products, and contents. This paper is focussed on analysing the customer behaviour. This behavioural data is generated via various customer activities like browsing; purchasing and can affect behavioural responses of browsing and purchasing.

2 Literature Review Meddour et al. [1] have tried to study the role of cultural importance, motivation, and the price impact on the customer buying behaviour of the Saudis expatriates. For this, partial least square method (PLM) method is used. Customer buying behaviour data was collected using questionnaire survey. The study shows that there is a direct link between price, motivation, and cultural importance on the customer buying behaviour. Jain et al. [2] have studied on the role of hyper-personalization by digital client on fashion. The method adopted was structural equation modelling. The data was collected from website with focus on the segment that is related to women ethnic wear. After the analysis, it was found that usefulness of product has strong relationship on the customer buying behaviour. Weinlich and Semeradova [3] from Czech Republic took 840 Facebook advertisements with different extents of personalization to check the performance of various targets. The feature considered were profitability, number of conversions, viewed page count, number of clicks made by the user, average browsing time on the website, and count of reactions clicks made to the viewed content. The result established that excessive targeting leads to negative impact on customers. Therefore, hyper-personalization should be done in an optimized way. Maheswari and Packia Amutha Priya [4] from Kalasalingam University made a study on customer behaviour using inventory and sales dataset. Support vector machine (SVM) has been used as the classification method. The SVM performs classification using a multidimensional hyper plane. The support vector machine uses neural network techniques to train radial base function, polynomial, and multilayer

652

S. Kumar et al.

perceptron classifier methods. It was found out that customer behaviour has a positive impact due to hyper-targeting and they will purchase more when there are offers in e-commerce. Pervaiz and Gull [5] from the University of Lahore studied and analysed customer behaviour based on time spent by customer on home decoration shopping websites. The authors have used associative rule of data mining technique and Apriori algorithm to suggest the relevant item based on purchasing preferences. The conclusion they reached is that products that are closely related if they are given as combo offers are likely to purchase. Paramania et al. [6] had conducted a study on the personalization from various perspectives. The approach used was a systematic analysis of Kitchenham’s literature, in which various forms of personalization were thoroughly analysed in 21 articles between 2012 and 2017. The paper highlights the characteristics of personalisation in four dimensions. They are commercial, relational, architectural, and the fourth is instrumental. These parameters are based on three factors, i.e. the process, objective, and model of the user. The author concluded that commercial and instrumental personalization had been shown to be the most common dimensions and influenced the customer’s purchasing behaviour. Silahtaroglu [7] from Turkey studied customer behaviour based on the clickstream data obtained from e-shopping websites in Turkey. The authors used a multilayer neural network prediction model and decision tree algorithm for research to figure out whether the consumer would purchase a product based on clicks made on the website. An approach to derive consumer purchasing habits was made from website logs and mouse movements to predict the chances of purchase of a product from the shopping basket. This pattern is gender based and increases likeliness of shopping for men and not for women. Similarly, timing spent by women on website is more. So, there is no clear picture based on the clicks made to infer on customer behaviour. Doddegowda et al. [8] studied the web personalization based on the user needs. The authors used frequent sequential patterns (FSPs) algorithms for analysis by taking group of users in consideration to know customer buying behaviour. The author had also explored various The FSP mining algorithms, such as WAP-tree, SPADE, and Prefix Span, for the extraction of FSPs from the WUD of the said academic website for a period ranging from weekly basis to quarterly basis. The analysis of these FSP algorithms has been performed and the results of the Prefix Span FSP algorithm that describes user navigation behaviour can be used for website personalization applications. Therefore, the conclusion that was reached is when there is an increase in the minimum support, the pattern decreases due to higher frequency in a large database. Symeonidis et al. [9] have studied hyper-personalization based on conversation web. Authors have recommended PRCW model which is a combination of offline and online recommendation techniques using RFMG. The datasets used is from two online retailers. The proposed approach improves current approach in small and medium datasets and can boost efficiency in large datasets when grouped with other methods. The results differ significantly in various datasets, which depends on size and characteristics of the dataset, so to find the correct method for every dataset

50 Hyper-personalization and Its Impact on Customer Buying Behaviour

653

can be a difficult and complex job. Offline and online algorithms help in achieving optimized results because offline algorithm provides better efficiency. It is concluded that with large datasets the performance is higher than small datasets. Also, offline and online together can give better recommendations. Luo et al. [10] have focussed their work on relationship marketing. The researchers have tried to find out the connection between customer stickiness and relational benefit. It is found that personalization and confidence benefits have an important and positive result on customer stickiness. Customized services are being given too much attention by consumers today, though economic benefits are not considered to be an important factor. The emotional attachment acts as a mediator between the customer’s stickiness and the benefits of the relationship. This is confirmed by the results obtained from the data which means that the benefits of the relationship have a direct as well as indirect effect on the customer stickiness. Wang and Zhao [11] studied customer behaviour by taking into account the combinations of the products and the services and tried to observe stimulus response. In this experiment, it was divided into two conditions. The first was customized preferences given to the customer, whereas second was general services. This survey was conducted on a sample of 20 participants. Customization of the service attracted the customers more than that of the general service provided. The final result showed that when a product is being developed, the service design should be considered. In obtaining the result, neural networks are used to provide offers and obtain response in customer buying behaviour. Vempati et al. [12] have proposed the need for hyper-personalization. The main focus is on the banners which is the first point of interaction of a customer to a website. The main emphasis is laid on the need for hyper-personalize the customer content based on the click through rate. The paper deals with generating large numbers of large banners based on the click data. The method used is deep learning techniques which detect the various tags on the object. Also, genetic algorithm has been used to optimize the banner content. Ranking method helps in sorting and selecting the right banner for a page. Thomaz et al. [13] have proposed a method to solve the problem of hyper privacy which will hinder the progress of hyper-personalization. Hyper privacy of the data will not in the implementation of the new marketing techniques. The author has found that there are two types of consumers one that are willing to share some information and others who keep everything private. Therefore, it proposed to use conversational agent’s deployment and also using of chat bots to get deep understanding in both type of consumers. The final result is that conversational agents and chat bots will help in getting understanding of the upcoming hindrance in the path of personalization and solving the same in this new era of hyper privacy. Tyrvainen et al. [14] have studied and developed eight hypotheses based on two surveys from Finland and Sweden. The participants were selected by using sampling method, and detailed reviews were taken. The result was that there is a strong and positive relationship between hedonic motivation and personalized content. The hedonic

654

S. Kumar et al.

component is based on emotional customer experiences. The author found positive effects are related to the loyalty of the customers, thus providing insights on theoretical and managerial views. Kapoor et al. [15] have studied the impact of e-commerce industry personalization. The quality of the good personalization depends on the quality of the data, good observations, and automated and customized marketing team. The real-time data helps in getting instant recommendation which is personalized. The final result is to show the various aspects a good hyper-personalized content can throw at an ecommerce industry. Riegger et al. [16] have tried to study reaction of the consumer based on technology enabled personalization in the retail industry. A detailed qualitative study from various consumer interviews was done. The result obtained can be divided into two parameters. The factors that drive personalization are control, interaction, integration, hedonistic, and utilitarian. The main four hindrances to the personalization of the retail industry are privacy concerns by the customers, lack of confidence on the online retail service, interactions misfit of the communication, and exploitation of the customers. The author has also highlighted the combined factor which can lead to the success for the technology driven personalization. The factors are the presence and absence of the staff, the privacy and personalization brought together, use of personal retail-oriented devices and to limit the exploitation of the customers. Karuppusamy [17] proposed a customer consumption prediction model using artificial neural network. Kumar [18] have experimented various retail applications using virtual and augmented reality methods.

3 Methods Hyper-personalization is the advanced version of personalization which uses realtime data and artificial intelligence to analyse user behaviour and give relevant product, content recommendation. With the vast advancement of technology, the e-commerce platform has undergone radical changes. MACH architecture has been one of the modern infrastructure platforms that will revolutionize the e-commerce platform in the near future. Hyper-personalization can utilize the MACH architecture quite effectively to address challenges of segmentation in the industry and also to produce solution towards cost optimization. There are classical methods such as matrix factorization and neighbourhood approaches like KNN, but these have some shortcomings. Matrix factorizations do not work well with past transactional, click data and neighbourhood approach considers only the last item clicks. Some of the methods used for personalization since long time basis are mentioned below:

50 Hyper-personalization and Its Impact on Customer Buying Behaviour

655

3.1 PLS-SEM Model Partial least square method and structural equation modelling (PLS-SEM) is another method by which we can study the consumer behaviour. These both use the measurement and the structural model in which the reliable values are measured and the indicator variable helps in assessing the threshold value. To test the hypotheses, structural model is used. The data used is mostly questionnaire type collected from various customers.

3.2 KNN Model k-nearest neighbours is a popular approach that has been used for long to customize customers and segment them based on the similar criteria. The new point distance is calculated using Euclidean distance or Manhattan distance is mostly used to find out the distance and based on the initial k clusters selected, the new points joins the nearest cluster. But there are some disadvantages of this approach as computational cost is high and also to find k, i.e. the number of clusters is difficult.

3.3 Matrix Factorization It is another way which is user in creating recommender systems. It is to identify the relationship between the users and the items. By this, the similarity index is found out and predictions are made based on both the user and the item entities. It has faster computation time and easily learns complex and dense features. While dealing with the high dimensionality it works well. But when it comes to deal with past transactional and click stream data it lags in performance.

3.4 Support Vector Machines The support vector machine is used in both the classification and the regression problem and outlier detection. It is effective while working with higher dimension and also when the number of samples is less than the number of dimensions. The disadvantage of SVM is that it does not work well with large datasets, when the dataset has more noise which will lead to the overlap of the output. Since the data that is dealt is large and also there are many streams it may fail to customize the data and hence performance may drop for personalization.

656

S. Kumar et al.

3.5 Association Rule of Mining This is another important machine learning model which is used to analyse the customer behaviour. This is done by searching and finding out frequent data patterns. The support, lift, and confidence criteria are used to find relationships which are important. Market basket analysis is widely used to find patterns in the retail stores. When the data size is very large there are too many rules formed which are unnecessary to process.

3.6 Decision Tree Analysis For analysing the customer behaviour bootstrapping need to be applied using both Gini index and entropy functions. The data is always dived into two parts where 70–80% of the data is used for training and rest is used for analysis. The confusion matrix can be used to find out the accuracy of the model.

3.7 Frequent Sequential Pattern In order to know about the behaviour, the pattern of the data needs to be analysed. In frequent sequential pattern (FSP), there are mainly of three types: • WAP-tree: This tree is built by scanning the database once to find out the discrete events that are frequent. The WAP-tree is built by scanning the database again and suing the sub-sequence to build a tree. The sub-sequence events are inserted from the root of the tree built by WAP. • Prefix Span: This method uses pattern growth method to get the sequential patterns in the data. The result is analysed from the support value obtained which is minimum. • SPADE: Sequential pattern discovery using equivalent classes is an Apriori algorithm on vertical format method. The process is to scan the scan the database using vertical format and check the equivalence format from the first sequence. The minimum support is used to select the elements of the class.

3.8 Recency Frequency Monetary Analysis This method is widely used in analysing the customer behaviour. Recency deals with how much recently the customer bought a product, frequency is to show how often the customer buys a product and monetary deals with how much does the customer buy.

50 Hyper-personalization and Its Impact on Customer Buying Behaviour

657

3.9 Genetic Algorithms This technique is based on the natural evolution process. The best fit individual from the parent is used for reproduction to produce next generation children. The personalization can be achieved by taking into account the maximum interaction of the customer with the particular page, i.e. higher rank. Recurrent neural network (RNN) is used to model the user behaviour within the current session. The current session will contain information on the user’s clicks and purchases. The input to the RNN model is click rate which returns a score for each item indicating how likely it is to be the next item in the session. The next item ID for the session is used to calculate the loss during model training. In terms of ranking measures, RNNs have lately been utilized for session-based recommendations and have outperformed item-based techniques by 15–30%. One disadvantage of the RNN technique is that it only considers the current session when training the model, learning about user preferences, and making suggestions. However, there are instances where a user is logged in. So, in these cases, it does not consider the past sessions which HRNN can consider. The main algorithm used in this research is hierarchical neural network (HRNN). An extra gated recurrent unit (GRU) layer is present to model information across user sessions and track the change of user interests over time. In this model, unlike RNN, there are two GRU layers: one is session-level, which is comparable to RNN’s GRU layer, and the other is user-level GRU. The session-level GRU creates suggestions by modelling user activity within sessions. The session-level GRU’s hidden state is initialized the user-level GRU models the user’s evolution across sessions and provides personalization capabilities to the session-level GRU. The information about the user’s choices indicated in prior sessions is transmitted to the session-level GRU in this way. The modern e-commerce platform, i.e. the MACH architecture has lot of benefits in the digital commerce industry which is also promising for hyper-personalization. Hyper-targeting is primarily based on (1) behavioural targeting, (2) selective content, and (3) technical maturity of the overall information technology landscape. As depicted in Fig. 2, the IT maturity is the important components in futuristic and modern e-commerce platforms. The online retailers can leverage MACH architecture to deliver seamless shopping experience to all customers across all devices. The four principles of MACH architecture can be further augmenting the hypertargeting as explained below. • Micro service-based: Independent applications developed, deployed, and managed to perform like a single function enables multiple business logics to be applied for hyper-targeting. • API-first: Various systems can be connected through application programming interface (API) to enable all separate components operate together. Multiple algorithms and techniques for hyper-targeting can be connected together using this principle.

658

S. Kumar et al.

Fig. 2 Factors affecting hyper-personalization

• Cloud-native: Hosting on cloud enables scalable infrastructure and less managed technology. Various cloud native applications for personalization can be hosted so as to enable maximum reach and minimum management. • Headless: Headless e-commerce solution decouples frontend and backend. This enables the creation of custom-made frontend customer experiences to use segregated backend interface for selective content to hyper-target users. MACH architecture enables companies the flexibility to replace any digital component and helps in dynamic business requirement. AI driven hyperpersonalization now can utilize MACH architecture to independently operate for multiple kinds of services and products. Now, flexibility of cloud micro services and various API can enable varied multiple business logics and algorithms to be used for different kinds of customers, user-segments, and demographics. As the world is recovering from current pandemic MACH forecast shows 20% growth across multiple e-commerce sectors like food growing at 10%, non-food at 20%, department stores at 28% which definitely shows promising usage of hyper-personalization. It will also help in dealing with the problem of information overload for organization and largely benefits end customers for easy and faster decision making.

4 Proposed Model The proposed model is implemented using the retail demo store dataset. Users, interactions and products are some details in the datasets shown in Fig. 3 describe the model implemented. In the figure, deep learning operations to personalize content using ‘experience analysis’ and ‘behaviour analysis’. Figure 3 shows data flow in two ways. • The information data flow, i.e. from the user of the e-commerce website using the frontend towards the back end. This information is used by the backend system to personalize the content for the user using deep learning algorithms. Behaviour and experience analysis of the information is done in consideration with the big data repository.

50 Hyper-personalization and Its Impact on Customer Buying Behaviour

659

Fig. 3 Model description

• The action information flow is generated from the backend based on the information of current user action, past action, history of usage, demography, and algorithm predictions. This reverse action can be used not only for hyper-targeting but also for predictive modelling. This enables e-commerce users and owners to improve experiences. The model used for the deep learning operation is the HRNN where the following process happen in order to hyper-personalize the user content. (a) Users’ sessions are grouped based on multiple parameters depending on the selection criteria. (b) Purchase history is a grouped by time stamp within each group of data. (c) The HRNN model is trained with users in a random sequence through multiple runs. (d) The model training follows the similar steps in subsequent iterations. The steps followed in the first iteration is described below and is represented in Fig. 4 • The input to the HRNN is the initial item of the first session of the first set of N users. • The output of each session is the second item as shown in Fig. 4. • The output of this iteration of training is subsequently used as input for the following iteration of training. • If the consumer only bought three products, i1, i2, and i3, then user ID and i1 will be inputs and i2 as output in the first iteration. • The complete user transaction set will be used for training the model. • Scores are generated by the model and i2 will help in adjusting the weights. The model continues to learn in this way and make hyper-personalize recommendation. Thus, it tells the likeness of a product to be in the next session.

660

S. Kumar et al.

Fig. 4 HRNN modelling

5 Experimental Results The results show how the customer clicks have been tracked, how long the customer paused, what products they scrolled and what products that added to cart. In Fig. 5 using HeatMap we can see the interaction of the users with the different products. Accessories, groceries, and books as well as furniture’s and home decors lead the chart with more than 350,000 clicks. By this diagram, it can be inferred that HeatMaps helps in knowing about which type of products customers are more interested in. As shown in Fig. 6, similar user’s recommendation, different product searches, rank of the product lists based on observed behaviour of the users is found. The figure shows user information for daily and monthly interactions. In this interaction, the selection factor is 50% for the given set. The selection factor demonstrates and defines

Fig. 5 HeatMap showing the user interaction with the various products

50 Hyper-personalization and Its Impact on Customer Buying Behaviour

661

the user interaction aspects. This factor is an important metric in hyper-targeting. The HRNN algorithm can rank based on different key performing indicators (KPI) of which one is selection factor. Based on the selection factor, user behaviour can be predicted. As shown in Fig. 7, the different number of the products explored by the user in a particular session can be seen. In the figure, it is evident that how the user has a favourable selection compared to 500 views versus 10,000 views. It is highly likely that the sale prediction for a larger viewed product will be large. This will help ecommerce owners to focus on right product supply and sourcing resulting in better customer experience and larger profit for themselves.

Fig. 6 User behaviour analysis

Fig. 7 User interactions based on viewing of different products

662

S. Kumar et al.

Usages like activity tracking, sticky factors, retention etc., are tracked as shown in Fig. 8 for a two-day period. The session’s information, device information, geography, number of sign-ins, and duration of sessions are shown in Fig. 8. These parameters can also be used to predict the behaviour of the customer. Failure of the login, type of device, and clicks are also used in the HRNN model. App version, device model, device make, user location, and campaigns are the demographics information that can use by the e-commerce provider to understand and track product demands based on demographics. In Fig. 9, the different type of events and the corresponding users’ interaction are shown. The figure shows ‘product viewed’ has increased considerably compared to ‘product added’ and ‘cart viewed’. User’s interest in the products shopping has improved due to hyper-personalisation. In Fig. 10, the user’s activity impact on the discount is shown. When a discount is offered on the product the user events increase irrespective of the price. However, the low-priced products show a greater interest. The data can be grouped based on week, month, and time of year for a good and relevant recommendation. Events are tracked and used for capturing experience

Fig. 8 Demographics related analysis

Fig. 9 User interactions based on event types

50 Hyper-personalization and Its Impact on Customer Buying Behaviour

663

Fig. 10 Impact of discounts on user activity

analysis and behaviour analysis. The model gets better as more and more data is added and trained using the training algorithm.

6 Conclusion and Future Work The product suggested is based on the events and activity and improves as more data is processed over time and user interactions. The proposed model provides 92% prediction accuracy of customer buying behaviour. The various factors that can impact the result can be divided into five categories. • • • • •

Quality: It requires real-time information. Experience: User should feel they have a personal connection. Profiles: Customer analysis should be made on regular basis. Traffic: Customer that feel connected is likely to become repeated visitors. Revenue: Customer respond to offer based on current interest.

Capabilities in the marketing automation space for data ingestion, storage, processing, and analytics, allowing the customers to build scalable, real time, robust solutions that will continue to fit their marketing needs. E-commerce providers can use this data as highly relevant and specific to the user, i.e. hyper-personalize which can be further investigated and explored to find new ways to make relevant recommendations.

664

S. Kumar et al.

References 1. Meddour H, Abu Auf MA, Saoula O, Abdul Majid AH (2018) Consumer buying behaviour: the roles of motivation, price, perceived culture importance, and religious orientation”. J Bus Retail Manag Res 12(4) 2. Jain G, Chaturvedi KR, Nabi MK, Rakesh S (2018) Hyper-personalization—fashion sustainability through digital clienteling. Res J Text Apparel 3. Weinlich P, Semeradova (2019) Computer estimation of customer similarity with Facebook lookalikes: advantages & disadvantages of hyper-targeting. In: IEEE access, vol 7 4. Maheswari K, Packia Amutha Priya P (2017) Predicting customer behaviour in online shopping using SVM classifier. In: IEEE international conference on intelligent techniques in control, optimization and signal processing 5. Pervaiz A, Gull M (2018) Customer behaviour analysis towards online shopping using data mining 6. Paramania B, Sensuse DI, Solichah L, Dzulfikara MF, Prima P, Wilarso I (2018) In: Fourth IEEE international conference on information management in personalization features on business to consumer e-commerce: review and future direction 7. Silahtaroglu G (2015) Analysis and prediction of e-customers’ behaviour by mining clickstream data. In: IEEE international conference on big data 8. Doddegowda BJ, Raju GT, Kumar S, Manvi S (2016) Extraction of behavioural patterns from pre-processed web usage data for web personalization. In: IEEE international conference on recent trends in the electronics information and communication technology 9. Symeonidis AL, Vavliakis KN, Kotouza MT, Mitkas PA (2018) Recommendation systems in a conversational web. In: 14th international conference on web information systems and technologies 10. Luo H, Wu J, Sun Z, Guo Y (2010) An empirical study on effect of relationship benefit on customer stickiness in online shopping. In: IEEE 11. Wang J, Zhao M (2015) Differential effects of service content on event-related potentials in buying decision. In: 12th international conference on service systems and service management (ICSSSM) 12. Vempati S, Malayil KT, Sruthi V, Sandeep R (2019) Enabling hyper-personalisation: automated ad creative generation and ranking for fashion e-commerce. In: recsysXfashion’19, Copenhagen, Denmark 13. Thomaz F, Salge C, Karahanna E, Hulland J (2020) Learning from the dark web: leveraging conversational agents in the era of hyper-privacy to enhance marketing. J Acad Market Sci 14. Tyrvainen O, Karjaluoto H, Saarijarvi H (2020) Personalization and hedonic motivation in creating customer experiences and loyalty in monichannel retail. J Retail Consum Serv 15. Kapoor R, Shirilkar A, Gupta E (2018) E-commerce personalization. IJSDR 3(10) 16. Riegger AS, Henkel S, Klein JF, Merfeldand K (2021) Technology-enabled personalization in retail stores: Understanding drivers and barriers. J Bus Res 17. Karuppusamy P (2020) Artificial recurrent neural network architecture in customer consumption prediction for business development. J Artif Intell 2(02):111–120 18. Kumar ST (2021) Study of retail applications with virtual and augmented reality technologies. J Innovative Image Process 3(2):144–156

Chapter 51

Proof of Concept of Indoor Location System Using Long RFID Readers and Passive Tags Piotr Łozinski ´ and Jerzy Demkowicz

1 Introduction Radio-frequency identifications (RFIDs) are a general term used to describe technology, which enables automatic identification (or object recognition) using radio waves. RFID is basically a small electronic device that consists of small chip and an antenna. A chip is usually able to hold several bytes of data. It can be said that the RFID device performs the same function as the barcode or magnetic stripe on the back of a credit card containing the unique identifier of that item. Just like a barcode or magnetic strip needs to be scanned to get information, the RFID device needs to be scanned to get the identification information as shown in Fig. 1 [1]. RFID is a technology that uses radio waves not only to transmit data, but also to power an electronic circuit known as an RFID tag. Such a system usually consists of a reader, i.e., an antenna containing a transmitter, receiver, and decoder. An RFID tag is a passive electronic system that takes energy from the resonant frequency electromagnetic field generated by the antenna and stores the energy in a capacitor contained in the tag structure. When the accumulated energy level is sufficient, a reply is sent containing the tag code. The process is controlled by the corresponding software to obtain the relevant information. The most famous use of this technology is payment cards. They do not need to be charged, and yet they fulfill their function all the time, the only downside of this solution in the context of wider application is the small range, so for the purposes of, e.g., location of warehouse goods, long RFID technology can be used, which is a technology very similar to payment cards, but with a greater range [2–4]. Long RFID technology may prove useful in the location P. Łozi´nski · J. Demkowicz (B) Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, Gda´nsk, Poland e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_51

665

666

P. Łozi´nski and J. Demkowicz

Fig. 1 RFID model diagram with one tag [1]

of warehouse resources. Using a different solution here, without a passive label, may be pointless as even with very economical systems, we have to change the batteries every few years, and in addition, the cost of one label is very large compared to RFID labels [5–7].

1.1 UHF Frequency Band In order to be able to present RFID correctly, ultra-high frequency (UHF) must be discussed first. UHF refers to the range of electromagnetic waves with frequencies ranging from 300 MHz to 3 GHz as presented in Fig. 2. Thus, wavelengths range from one to ten decimeters (from 10 cm to 1 m) [1, 8–10]. Contemporary mobile phones transmit and receive in the UHF band. The UHF is widely used by public service agencies for two-way radio communication, usually using narrowband frequency modulation, but also digital services. Narrowband radio modems use UHF and VHF frequencies to transmit data for long-range communications, for example for the supervision and control of power distribution networks, as well as SCADA and automation applications. Until recently, there was very little of this type of broadcast communication in this band [11, 12].

1.2 Passive Elements Another issue that requires clarification are the passive elements (Fig. 3). We call them devices or components that store or hold energy in the form of voltage or current,

51 Proof of Concept of Indoor Location System …

667

Fig. 2 UHF frequencies [1]

for example: resistor, capacitor, inductor, etc. In simple terms, passive elements are energy absorbing elements, and active elements are energy donors. They are linear and nonlinear, respectively. Typical examples of active devices are diodes, SCR transistors, etc. [1]. A significant advantage of RFID devices over barcodes is that the RFID device does not have to be positioned exactly in relation to the scanner. RFID devices can operate within a radius of several feet (up to 20 feet for high-frequency devices) from the scanner [13, 14].

Fig. 3 Long RFID passive tags

668

P. Łozi´nski and J. Demkowicz

Fig. 4 Schematic depiction of simple passive RFID tag [1]

1.3 Basic UHF Passive RFID Components There are three basic components that all RFID passive components contain—an interrogator or reader, transponder or tag, and antenna. It should be noted that communication between the tag and the reader is based on the principle of cellular communication. The communication between the reader and tag antennas is called downlink, and the communication between the tags and the reader is called uplink. There are basically three types of RFID: passive, semi-passive, and active tags [1, 8, 15, 16]. In Fig. 4, the antenna generates electromagnetic fields, and therefore, highfrequency RF generates a voltage. The voltage is rectified by a diode (a device that only allows current to flow in one direction), and the resulting signal is smoothed using a storage capacitor to produce a roughly constant voltage which is then used to power the tag’s logic circuits and access memory. Passive tag memory circuits are always non-volatile as tag power is usually turned off. A similar rectifier circuit, using a lower capacitance value to make the voltage vary with the reader type. The logic module is used to demodulate the information from the reader. This technique is known as envelope detection. A field-effect transistor (FET) is used as a switch when the FET is connected, the antenna is grounded, which allows a lot of current to flow, and when it is turned off, the antenna passes very little current [1, 17, 18].

1.4 Basic Long RFID Location System Concept The aim and scope of the proposed approach is to develop a system for locating resources in, e.g., large warehouses as presented in Fig. 5, so the main goals are to determine the characteristics of the obtained RFID antenna, create a functional antenna handling kit, and prepare documentation and recommendations [19, 20].

51 Proof of Concept of Indoor Location System …

669

Fig. 5 Warehouse long RFID location concept [21]

The main idea of the designed solution is primarily to determine the appropriate configuration of RFID antennas in order to increase their operating range. The second important topic is exploring the possibility of detecting unique tags and their identification and naming. An important step is also to investigate the actual characteristics of the RFID antenna. Finally, tests are carried out using two cooperating antennas. Important aspects are also determining further directions of work. To make the project realistic, we need to calculate the system cost estimate for a sample warehouse. After a series of tests, the first concept of a locating system is developed.

2 Research Equipment and Configuration 2.1 Equipment Configuration For research purposes, equipment consists of RFID antenna, converter, passive RFID tags, and additional software monitor. For mentioned reasons, the proposed equipment set consists of the low price long RFID reader to answer the question if a cheap generally available equipment can be suitable for such locating systems as shown in Figs. 5 and 6.

Fig. 6 Long RFID location equipment set

670

P. Łozi´nski and J. Demkowicz

Table 1 Gen2 UHF-105 long RFID antenna parameters ID

Parameter

Value

1

Working frequency

902–925 MHz

2

Protocol support

1 ISO18000-6B, ISO18000-6C (EPC GEN2)

3

Frequency hopping

FHSS or a fixed frequency set by the software

4

Power of transmitter

0–30 dBm, can be customized by software

5

Working range

1–6 m

6

Antenna

Built-in circular polarization antenna, gain 8 dB/built-in linear polarization antenna, 12 dBi gain

The long RFID reader includes a transmitter, receiver, and decoder. Its parameters are presented in Table 1. The RFID tag is a passive electronic circuit that sends a response containing the tag code. The final composition of the transmitting kit includes a Windows computer, an usb-rs232 converter, an antenna, and tags.

2.2 Software Monitor The software makes it possible to obtain relevant information from the reader. Figure 7 shows the monitor application menu. The application allows to choose the method of connection with the antenna and allows to change the transmission parameters. The lower bar shows the antenna status, and above it a window displaying information from the read tags. The application allows you to change the most important settings of the antenna connection and read the data stored in the tags. You can choose the method of communication between the computer and the antenna. There are three ways of communication: through the serial port (RS232, RS485, etc.), through the USB interface, and through the IP network. The active mode (the antenna is on all the time and transmits data to the computer all the time) and passive mode (the antenna is on, but only transmits information to the computer after receiving such an order) are available. The application can determine the power of the connected antenna, as well as the length of the data stored in one tag. Interestingly, the number of tags that the antenna reads in one command cycle after which the antenna will stop reading can be set. The measurement interval in the active mode can also be set, but it is recommended that the interval between them should be min 10 ms. The authors managed to configure the reader to receive multiple tags simultaneously; however, the reader is able to receive 25 or more tags depending on the mode. Additionally, which is also important, it was possible to successfully enable two antennas on computer, in separate applications, suggesting that even more devices could be connected. Authors see a real possibility of using these readers in warehouses.

51 Proof of Concept of Indoor Location System …

671

Fig. 7 Software monitor

3 Measurements and Recommendations 3.1 Key Components In order to study all aspects of a proposed long RFID location system, several key elements that determine the system should be analyzed, Fig. 8. Then, the features of the tested elements can be considered and the way in which they are to be examined. Finally, scenarios that will optimally check the system can be prepared. The crucial elements from the point of view of the tested system are: antenna gain, antenna polarization, RFID tags, reader settings, cable length, multiplexers or adapters, and environmental factors. Identifying all these key elements is one of the

Fig. 8 Key parameters and components from the point of view of the system evaluation

672

P. Łozi´nski and J. Demkowicz

problems that needed to be resolved. The room where the research was carried out is sketched in Fig. 9. RFID antennas and tags were placed at different distances from each other and in different positions as presented in Fig. 9. As a result of the measurements carried out the following characteristics of the antennas are obtained, as shown in Fig. 10. The conducted research has shown that the antenna can detect several tags at once, it distinguishes them using the tag ID, which is unique for each tag. This is very important from a proposed system capability perspective. A significant disadvantage of the antenna (however, it is advantage at the same time) is the operating range, and the measurements show that the antenna is directional, but the maximum distance is only 7 m at an angle of about 60°, as shown in the Fig. 10. At full range, the tags were not constantly detected, but only periodically. A good connection where the tag is received all the time the range is about 4 m. In addition, the connection

Fig. 9 Long RFID locating system research room

Fig. 10 Long RFID reader antenna characteristic

51 Proof of Concept of Indoor Location System …

673

deteriorated a lot when there are obstacles and positioning the antenna close to the wall even made the antenna find its own signal as a tag, so sometimes the reader read its own tags, especially in the case of strong reflections from the wall, however, it as a side effect of the tested system. This side effect is difficult to eliminate, but is easy to ignore with no consequences for the overall system performance. To sum up, a single antenna allows for the detection of number of tags, but has a short range and the transmission can be easily disturbed by obstacles. After all, in the location systems used in warehouses, it is not necessary that every tag is located all the time, as the location of most tags does not change. The operation of two readers from one central computer through a simple USB connection with the RS232 converter was positive. The study showed that there are no problems with such communication, and the antennas do not influence each other in a visible way, but actually support each other. The authors suggest following recommendations when designing such type of system: • Use the system option to detect and distinguish multiple tags simultaneously. • Depending on the recommended accuracy, use the appropriate range (maximum range for the presented set—7 m). • The use of antenna directivity and transmission reliability is more effective for smaller ranges (4 m in the tested case). • Increasing the selectivity of the antenna and active directional antennas improves the accuracy of the object position location. • Obstacles in the path of the long RFID reader radiation affect the range. • Location in the warehouse does not require constant receiving of a signal from a given Tag. Detecting a tag every minute or so is more than enough. • The use of communication via the IP protocol. • Use of wireless connections with readers. A communication method necessary for large warehouses. • For cost reasons, it is necessary to operate several and more antennas by one computer/driver. • The application managing the readers should cooperate with the database system which can be used to update the current location of the tags.

4 Further Work Tests give a positive results. The next stage of the research is to assemble the antennas in a warehouse and test how they will detect tags mounted on cardboard boxes in a real working environment. Further, work is the examination of the reflection characteristics in the sense of how the reflection of the wave affects the reading of the cards. Launching an application that records RFID data and writes it to a database is straightforward, so the final step would be to write an application that would collect data from that application and send them to the cloud. Investigating whether it is possible to determine the signal strength, and thus the distance from the

674

P. Łozi´nski and J. Demkowicz

reader, could significantly improve the accuracy of the system, which is currently 2–3 m. Designing and training a neural network to support optimal antenna locations in a specific room or warehouse are to be another research direction.

5 Conclusions The conducted tests showed that the readers can detect several tags at once, it distinguishes them by the obtained ID, which is unique for each tag. A significant disadvantage of the antenna is the operating range. In summary, a single antenna allows detecting a large number of tags, but has a short range, and the transmission can be easily disturbed by obstacles. Nevertheless, in the localization systems used in warehouses, the use of this system is very likely. Due to the inability to use cheap microcontrollers in the tested system, it is necessary to support minimum several antennas by one computer. This functionality is essential if the entire system is to be profitable. The application itself has the ability to communicate via the IP server. However, studies of the antennas have shown that they do not have such capabilities, but there is still the possibility of using the tested set in real conditions. The disadvantage of the proposed set is the need to use a large number of this type of readers and the problem of arranging antennas in buildings of height of more than 7 m. However, this is typical of all known indoor positioning systems. Acknowledgements Authors would like to thank K. Konwerski, M. Lewalski, M. Sułkowski for their support during the study.

References 1. Zaman Tanim MM (2016) How does passive RFID works, briefly explained. Technical report 2. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197 3. May P, Ehrlich HC, Steinke T (2006) ZIB structure prediction pipeline: composing a complex biological workflow through web services. In: Nagel WE, Walter WV, Lehner W (eds) Euro-Par 2006, vol 4128. LNCS. Springer, Heidelberg, pp 1148–1158 4. Foster I, Kesselman C (1999) The grid: blueprint for a new computing infrastructure. Morgan Kaufmann, San Francisco 5. Foster I, Kesselman C, Nick J, Tuecke S (2002) The physiology of the grid: an open grid services architecture for distributed systems integration. Technical report, Global Grid Forum 6. National Center for Biotechnology Information. http://www.ncbi.nlm.nih.gov 7. Kitayoshi H, Sawaya K (2005) Long range passive RFID-tag for sensor networks. In: VTC2005-Fall. 2005 IEEE 62nd vehicular technology conference, IEEE Press, Dallas, pp 2696– 2700 8. Hou X, Arslan T (2017) Monte Carlo localization algorithm for indoor positioning using Bluetooth low energy devices. In: 2017 international conference on localization and GNSS (ICL-GNSS), Nottingham, pp 1–6

51 Proof of Concept of Indoor Location System …

675

9. Radoi IE, Cirimpei D, Radu V (2019) Localization systems repository: a platform for opensource localization systems and datasets 10. Radoi IE (2019) A platform for open-source localization systems and dataset. In: International conference on indoor positioning and indoor navigation (IPIN), Pisa, pp 1–8 11. El-Hadidy M, Yasser YESB (2019) Realistic chipless RFID tag modeling, mathematical framework and 3D EM simulation. In: 2019 IEEE international conference on RFID technology and applications (RFID-TA), Pisa, pp 201–206 12. Alvarez-Narciandi G, Laviada J, Pino MR, Las-Heras F (2017) 3D location system based on attitude estimation with RFID technology. In: 2017 IEEE international conference on RFID technology & application (RFID-TA), Warsaw, pp 80–82 13. Czajkowski K, Fitzgerald S, Foster I, Kesselman C (2001) Grid information services for distributed resource sharing. In: 10th IEEE international symposium on high performance distributed computing. IEEE Press, New York, pp 181–184 14. Ahmad U, Poon K, Altayyari AM, Almazrouei MR (2019) A low-cost localization system for warehouse inventory management. In: 2019 international conference on electrical and computing technologies and applications (ICECTA), Ras Al Khaimah, pp 1–5 15. Madany YM, Mohamed DAE, Ali WAE, Emara RF (2017) Modelling and simulation of indoor reverse RFID tag localization method based on mobile antenna reader position. In: 2017 UKSim-AMSS 19th international conference on computer modelling & simulation (UKSim), Cambridge, pp 235–239 16. Xu C, Zhao Y, Zhang Y (2009) Localization technology in wireless sensor networks based on UWB. In: 2009 international conference on wireless networks and information systems, Washington, pp 35–37 17. Zhi-yuan Z, He R, Jie T (2010) A method for optimizing the position of passive UHF RFID tags. In: 2010 IEEE international conference on RFID-technology and applications, Guangzhou, pp 92–95 18. Rao KVS, Nikitin PV, Lam SF (2005) Antenna design for UHF RFID tags: a review and a practical application. IEEE Trans Antennas Propag 53(12):3870–3876 19. Wei D, Hung W, Wu KL (2016) A real time RFID locationing system using phased array antennas for warehouse management. In: 2016 IEEE international symposium on antennas and propagation (APSURSI), Fajardo, pp 1153–1154 20. Hislop G, Lekime D, Drouguet M, Craeye C (2010) A prototype 2D direction finding system with passive RFID tags. In: Proceedings of the fourth European conference on antennas and propagation, Barcelona, pp 1–5 21. WMS OPTIPROMAG, https://www.optidata.pl/en/oprogramowanie/wms-optipromag/

Chapter 52

Autism Detection in Young Children Using Optimized Long Short-Term Memory S. Guruvammal, T. Chellatamilan, and L. Jegatha Deborah

Nomenclature FPR SSO DT KNN ASD LSTM SVM AI LDA LR FNR PRO CNN TD RF NDK Social SO NN NPV ALO SMO CA

False positive rate Shark smell optimization Decision trees K-nearest neighbors Autism spectrum disorder Long short-term memory Support vector machines Artificial intelligence Linear discriminant analysis Logistic regression False negative rate Poor and rich optimization Convolutional neural network Typical developing Random forests Near-duplicated key frames Social spider optimization Neural network Negative predictive value Ant lion optimization Spider monkey optimization Correspondence analysis

S. Guruvammal (B) · L. Jegatha Deborah University College of Engineering Tindivanam, Tamil Nadu, Tindivanam, India e-mail: [email protected] T. Chellatamilan Vellore Institute of Technology, Tamil Nadu, Vellore, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_52

677

678

MCC MCA ADOS DNN SVR ML ACSSO

S. Guruvammal et al.

Mathews correlation coefficient Multiple correspondence analysis Autism diagnostic observation schedule Deep neural network Support vector regression Machine learning Arithmetic crossover insisted shark smell optimization

1 Introduction A person’s capacity to interact with people and communicate is permanently disrupted by autism. For instance, children with ASD have issues in forming meaningful relationships with others. Numerous efforts have been made by researchers from various nations to assist autistic children to recover as well as appreciate their childhood. An endeavor is to use NAO, an AI robot to converse with the youngsters, as the interaction with people and behavior are affected by ASD. Because symptoms usually arise during the first two years of life, it is typically referred to as a developmental illness. As a consequence, early diagnosis is critical, and it is been shown that implementing early effective interventions helps children to gain the necessary abilities to participate in kindergarten along with TD children. Deficiency in social behavior and nonverbal interactions are some of the features of ASD. In particular, ASD youngsters avoid eye contact, struggle with social relationships, and repeat bodily gestures and activities. They primarily rely on behavioral signals and data gathered from parents [1]. The evaluation of bio-signals has gained growing interest among available markers, owing to fast improvements in the software as well as hardware components. Visual and gaze exploration is recognized as the distinguishing characteristic among TD and ASD children. Further, the modern eye-tracking scheme enables seamless recording of gaze information to analyze the behavior. Furthermore, these kinds of data might be processed using current developments in machine learning and the quick adoption of new, potent, and reasonably priced hardware. To use the traditional method of facial expression, therapists or other caretakers must be able to identify the emotional states of autistic children. The ability of thermal imaging to measure emotional states in a “masked” manner as they are described in TD children’s facial appearance inspires research to extend the same to autistic children’s affective reactions. Accurate detection of emotional state might help existing rehabilitation and training regimes in improving their performance [2–4]. Anxiety treatment in people with ASD is difficult. Traditional anxiety treatments are based on anxiety symptoms, which is difficult for people with ASD. This is a treatment roadblock that is essential for the timely and successful implementation of administration methods. Physiological responses, gathered by non-invasive and widely accessible wearable sensors, provide a real-time, objective, as well as

52 Autism Detection in Young Children Using Optimized …

679

language-free assessment of anxiety states, which could also help to overcome the aforementioned difficulty. This method is based on quantitative measures of autonomic arousal that have been linked to anxiety states. Modeling baseline physiological features of users as well as recognizing deviations from this baseline that correlate to anxiety states is a fundamental technological difficulty in designing an anxiety detection method. Numerous techniques have been developed [5–7] to classify the anxiety vs. baseline states. LDA, KNN, AdaBoost, DT, LR, RF, and SVM are examples of supervised learning algorithms. • Following is a summary of this paper’s main contribution: • A novel data normalization procedure for the preprocessing stage was proposed in this research. • For the optimal tuning of weights in LSTM, an arithmetic crossover insisted shark smell optimization (ACSSO) algorithm was implemented in this work. The review of autism detection in young children is established in Sect. 2 of this research. Section 3 depicts the entire framework for detecting autism in early childhood. Section 4 specifies the preprocessing and feature extraction phases. The detection utilizing improved long short-term memory is shown in Sect. 5. Section 6 displays the load optimization of the LSTM utilizing the numeric split as shark smell optimization method.

2 Literature Review 2.1 Related Works In 2020, Sarabadani et al. [8] examined the ASD detection of autonomic reactions to negative and positive stimuli in children by employing four physiological parameters. Although 15 children with ASD saw common pictures intended to generate altering levels of arousal (high and low intensity), electrocardiograms, valence (negative and positive) respiration, temperature, and skin conductance were evaluated. Affective states caused via stimuli of lower and higher arousal or positive and negative valence were discriminated with average levels of accuracy reaching or above 80% using an ensemble classifier. These findings show that by using physiological signals, this might be possible to objectively distinguish the emotional states of people with ASD. In 2019, Yang et al. [9] introduced a new multi-modal picture book recommendation system and accessed via a testing dataset including textual and visual information to compute the similarity among picture books and discussion subjects. An image neighbor finding approach and an NDK neighbor detection technique were suggested in the conceptual framework. The methods used include CA and MCA. Furthermore, six performance metrics were used to evaluate the booklist created during the experiment, and the findings show that the suggested methodology provides promising and effective outcomes.

680

S. Guruvammal et al.

In 2020, Rusli et al. [1] adopted thermal imaging usage for noninvasively analyzing physiological data related to subjective states as a passive medium. The researchers reasoned that coetaneous temperature variations caused by pulsing blood flow at the frontal face area in the blood vessels assessed by the method have a direct effect on autistic children’s emotional states. Thermal imaging data obtained from diverse expressions of emotional state by various groups of audio–video stimuli was measured in the controlled experimental setting. The classifier’s results demonstrated the technique’s usefulness, with an 88% classification accuracy in recognizing autistic children’s emotional states. In 2020, Eni et al. [10] suggested a DNN for diagnosing autism severity in small children employing speech signals. Voice records of Hebrew-speaking young people who took the ADOS test were used by researchers to gather a range of prosodic, acoustic, and personal characteristics. The observations of 72 children yielded 60 attributes, 21 of which were shown to be closely connected to the children’s ADOS scores. They have created a number of DNN approaches for predicting ADOS scores based on these characteristics as well as compared the performance to LR and SVR algorithms. After being trained and evaluated on several subsamples of the data, this method forecast ADOS values with a median RMSE of 4.65 and an average correlation of 0.72 with the actual ADOS scores. In 2021, Mazumdar et al. [11] offered a method depending on the collective usage of ML and eye-tracking data. Particularly, features including the fixations or object’s presence toward an imaging center were derived from image content as well as viewing activity. The feature extraction to be employed is the initial phase in RM3ASD for classification. Three categories of features were studied in further depth: (i) features derived from image content, (ii) features acquired through exploiting fixations on image stimuli, and (iii) features chosen regarding the respondents’ biases when investigating visual stimuli. An ML-based model was trained using those features. The findings demonstrated that the features examined could distinguish children with ASD from normal children. Numerous works have been focused on autism detection systems. However, the existing method has a drawback like more consumption time, low accuracy, overfitting, and so on. Hence, to overcome the above-mentioned problem, this paper proposed arithmetic crossover insisted metaheuristic optimization algorithm.

3 Overall Framework of Autism Detection in Young Children The purpose of this study is to develop a unique method for detecting autism in early infants with three key phases: (a) preprocessing, (b) feature extraction, and (c) detection. At first, in preprocessing phase, the input data is processed using modified data normalization. From the normalized data, it extract features like statistical and

52 Autism Detection in Young Children Using Optimized …

681

higher-order statistical features. Moreover, the detection is carried out using an optimized LSTM classifier. To enhance the detection performance, the LSTM weights are tuned optimally via the developed ACSSO model. As a result, the output is categorized in an efficient manner. Figure 1 depicts the framework of the developed approach. Fig. 1 Designed scheme’s entire design

Input data

Preprocessing Modified Data Normalization

Feature Extraction Statistical features and Higher order statistical features

Detection Optimal Weights LSTM

Proposed ACSSO model

Detected

682

S. Guruvammal et al.

4 Preprocessing and Feature Extraction Phase 4.1 Preprocessing The given data is initially placed through preprocessing, where changed data normalization happens. Data normalization modifies the feature scales to have a uniform scale of measurement. Modified Data Normalization: Let D be the data with m records and n attributes as Dm×n . The modified data normalization process is determined as per Eq. (1). D − μi Sd n

Dnorm =

j=1

Here, μ j = n

(1) wj

wj j=1 D j

(2)

Equation (2) indicates the weighted harmonic mean.  Sd =

n t=1

Dt (Dt − μi )2 n

(3)

where Sd indicates the standard deviation.

4.2 Feature Extraction After modified data normalization, the features have been extracted that include • Statistical features • Higher-order statistical features. (i) Statistical features: The ensuing characteristics are decided as follows: • Mean • Median • SD • Min • Max. (a) Mean (Average) [12]: The mean value is the result of dividing the total of all values by a variety of values. R=

g 1 Rc g c=1

(4)

52 Autism Detection in Young Children Using Optimized …

683

Equation (4), R represents the quantity of values, denotes the observed value, and alludes to the sample mean’s symbol. (b) Median [12]: The middle value of a dataset is structured in ascending order using this method. When there are two values in the middle of a dataset, the median is defined as the mean of those two values. Median =

 g R 2 R

g−1 2

+R



g+1 2

2



if g is odd (5) if g is even

(c) SD: It measures the amount of variation or a collection of dispersion values. A smaller SD [13] indicates results that are typically closer to the average value, whereas a bigger SD indicates values that are spread out over a wider range. Equation (6) shows SD. Here, σ refers SD symbol.

σ =

2 1  Rc − R g − 1 c=1 g

(6)

The statistical characteristics SF are highlighted and specified in Eq. (7) where Min, Max denote the minimum and maximum values. SF = R + Median + σ + Min + Max

(7)

(ii) Higher-order statistical features: The following features are listed. • Skewness • Kurtosis • Moment. (a) Skewness [14]: The measurement is asymmetrical. A set of data or distribution is said to be symmetric but only if the right and left margins of the central axis are comparable. It is given in Eq. (8). g Skewness =

c=1



3 Rc − R /g σ3

(8)

In Eq. (8), Rc = R1 , R2 , …, Rg , mean value is denoted as R, σ is denoted as SD, and collection of data points is denoted as g. Further, SD is represented as σ . (b) Kurtosis [14]: This is the criterion for determining if the data is light tailed or heavy tailed and whether they are linked to normal distribution. Outliers and lower tails are less common in datasets with reduced kurtosis. Moreover, outliers and long tails are much more likely to be present in datasets with higher kurtosis. Equation (9) expresses the kurtosis formula for univariate data such as R1 , R2 , …, Rg . Instead of computing the kurtosis, the SD is analyzed by g value in the denominator

684

S. Guruvammal et al.

than g − 1. g Kurtosis =

c=1



4 Rc − R /g σ4

(9)

(c) Moment [15]: In probability study and statistics, it is the moment of the probability distribution with the random variable. It is the typical value of an integer power departure from the mean for the random variable. The shape and spread dispersion of the position is related to the moments with

higher-order.  The E signifies the expectation operator, R c = E (R − E[R]c ) and also cth moment is connected to the central moment of a real-value random variable R. For a continue univariate probability distribution with f (y) probability density function, the moment cth of average R is calculated. Equation (10) provides the moment. ⎡⎛ ⎞⎤ −∞   c y − R f (y)dy ⎠⎦ (10) moment = R c = E ⎣⎝ R − E[R]c = +∞

The higher-order statistical features are denoted as HF in Eq. (11). HF = Skewness + Kurtosis + moment

(11)

The entire extracted features are indicated as FE in Eq. (12). FE = SF + HF

(12)

5 Detection Using Optimized Long Short-Term Memory 5.1 Optimized LSTM The extracted feature FE is provided to the LSTM classifier as its input. Through the usage of a gate control unit and linear connection, the LSTM network provides an efficient way to solve gradient desertion difficulties. Moreover, the LSTM network caught the significant dependence of time-series data. The series of determined LSTM cells consisted of the LSTM [16] enlargement. The LSTM cells’ input gate, output gate, and forget gate were each represented by three units. The LSTM cells can suggest and store information for a long time because to this characteristic. Let H and C as the hidden and cell state. (Hl , Cl ) and (Fl , Cl−1 , Hl−1 ) represent the output and input layers correspondingly. At time l, the

52 Autism Detection in Young Children Using Optimized …

685

output, input gates, and forget gate are denoted as Ol , Il , G l . It uses G l to filter the data primarily. G l is determined in Eq. (13). G l = κ(W L Fl + h L + W J Hl−1 + h J )

(13)

The bias parameter as well as weight matrix are specified (W J , h J ) and (W L , h L ). Therefore, the gate activation function (κ) is selected to be the sigmoid operation. The LSTM cell then uses the input gate to mix the appropriate data, as defined by Eqs. (14), (15), and (16). The weight matrices and bias parameters that mapthe input and hidden layers to the cell gate are denoted as (W X , h X ) and (WY , h Y ). W p , h p   and Wq , h q represent the weight and bias parameters that relate to K l the hidden and input layers. Because the weight parameter plays such an important role in the network, it must be set to perfection for effective detection. Ul = tanh(WY Fl + h Y + W X Hl−1 + h X )

(14)

  K l = κ Wq Fl + h q + W p Hl−1 + h p

(15)

Cl = G l Cl−1 + K l Ul

(16)

Here, the LSTM attains a hidden layer (output) from the output gate as determined in Eqs. (17) and (18). ol = κ(We Fl + h e + Wr Hl−1 + h r )

(17)

Hl = ol tanh(Cl )

(18)

Here, (We , h e ) and (Wr , h r ) denote the weight and bias parameters for mapping the input as well as hidden layers to ol . The LSTM output is indicated as CLLSTM .

6 Weight Optimization of LSTM Using Arithmetic Crossover Insisted Shark Smell Optimization Scheme 6.1 Objective Function and Solution Encoding The LSTM weights are optimally tuned using developed ACSSO scheme. Figure 2 represents the input solution to the proposed ACSSO scheme where LSTM weight is denoted as W , and weight count is denoted as N . The objective function is specified in Eq. (19).

686

S. Guruvammal et al.

Fig. 2 Solution encoding

W1

W2

…….

Obj = Max (accuarcy)

WN

T

(19)

6.2 Proposed ACSSO Model SSO [17] is based on the shark’s ability to hunt using its sense of smell, and this approach has been applied to tackle real-world engineering challenges. Still, the model lags in determining a better convergence rate and convergence speed while solving the optimization issues. To sort out these issues, the ACSSO scheme is implemented. Normally, the capability of self-enhancement has been demonstrated in established optimization methods [18–22]. SSO includes four primary phases. Initialization: The initial guess population is arbitrary formed in the entire search space for SSO modeling. Each solution represents a component of the initial shark’s odor used in the search process. According to Eq. (20), the first solution vector is established (21), in which, Bi1 = ith starting populace vector position and d denotes populace size. 

B 1 = B11 , B21 , . . . Bd1

(20)

1 In Eq. (38), the related optimization problem is identified, wherein Bi,b = bth dimension of ith position of shark and u denotes count of decision variable.



1 1 1 Bi1 = Bi,1 , Bi,2 , . . . Bi,u

(21)

Forward movement: Each place of the shark produces potent odor particles with a “velocity V ” to meet the goal when blood and water are combined (prey). The beginning velocity vector is, therefore, defined using its positions according to Eq. (22), and each V consists of a dimensional component as specified in Eq. (23). 

Vi1 = V11 , V21 , . . . Vd1

(22)



1 1 1 Vi1 = Vi,1 , Vi,2 , . . . Vi,u

(23)

52 Autism Detection in Young Children Using Optimized …

687

Therefore, the velocity in every dimension is evaluated as in Eq. (24), where  k k = 1, 2, . . . Mmax , ∂(OB) point out derivative O B at position χi,b , k symbolize ∂χb  k χi,b

stage count, kmax denotes stage count, and 1 symbolizes arbitrary integer amid (0, 1).  ∂(OB)  k Vi,b = ηk .1. (24) ∂χ b χi,b k The shark’s forward movement is updated by the velocity limiter of the SSO model used for each stage, and it is determined in Eq. (25). Where i = 1, 2, . . . NP and b = 1, 2, . . . ND.  ⎤ ⎡     ηk .1. ∂(OB)  + ψk .2.Q k−1 ,  k i,b  ⎦ ∂χb χ k V  = min⎣  (25) i,b  i,b k−1  βk .Q  i,b

As per the proposed model, ηk and ψk are calculated using the circle map as per Eq. (26). Wherein, ψk denotes momentum rate or inertia coefficient and ηk determines a value in the interval [0, 1]. xk+1 = xk + z − (P − 2π ) sin(2π xk ) mod (1)

(26)

where P = 0.5 is a control parameter, and z = 0.2. Due to its previous velocity and position according to Eq., the current shark position is shown (27), where vk point out the time interval of stage k. Tik+1 = Aik + Vik . vk

(27)

Rotational movement: For the purpose of locating the stronger odor particle, the shark rotates. A local search is what this procedure is known as according to (28), where hˆ = 1, 2 . . . Hˆ and 3 denotes arbitrary integer among (0, 1). ˆ

Z ik+1,h = Tik+1 + 3.Tik+1

(28)

Particle position update: Particle position update: As the shark approaches the potent odor particle that is shown in Eq. (29), it rotates along its search path, here; Z ik+1 represents the subsequent shark location with a higher OB value.       ˆ Aik+1 = arg max OB Tik+1 , OB Z ik+1,i , . . . , OB Z ik+1, H

(29)

An arithmetic crossovers procedure is carried out in accordance with the proposed ACSSO model.

688

S. Guruvammal et al.

Arithmetic crossover: When real-value encoding is applied, the arithmetic crossovers is employed. The two parent chromosomes are connected linearly by the arithmetic crossover operator. In an arithmetic crossover, two chromosomes are randomly chosen for crossing, and two children are produced by a linear combination of these chromosomes. This linear combination is calculated as follows: ˜ = W˜ .G1gen ˜ ˜ C1 + (1 − W˜ ).G2gen

(30)

˜ = W˜ .G2gen ˜ ˜ C2 + (1 − W˜ ).G1gen

(31)

˜ where C˜ is the individual from the new generation, Ggene is the individual from the old generation, and W˜ indicates the weight that direct dominant individual among 0 and 1. Algorithm 1 specifies the procedural code of the proposed ACSSO scheme. Algorithm 1: Adopted ACSSO scheme Start Initialization Assign constraints, d,ψk ,kmax ,ηk and k = 1, 2, . . . Mmax Make a primary population that includes everyone Initializing k = 1 For k = 1:Mmax Forward movement Calculate all element of Vi,b A new shark position as per the forward movement is achieved in Eq. (25) As per the proposed logic, ηk and ψk are calculated using the circle map as per Eq. (26) Rotational movement ˆ

As per the rotational movement Z ik+1,h novel shark position is achieved choosing the next shark position based on two moves End for k Fix k = k + 1 Select the great shark position with a higher O B value Perform arithmetic crossover as per Eqs. (30) and (31) Fix k = k + 1 End

52 Autism Detection in Young Children Using Optimized …

689

Table 1 Algorithm parameters Methods

Parameters

Values

Shark smell optimization (SSO)

Eta

0.5

Alpha

0.5

Beta

0.5

Delta

0.1

V

0.5

Spider monkey optimization (SMO)

Perturbation rate pr

0.1

Social spider optimization (SSO)

fp

(0.65, 0.9)

7 Results and Discussions 7.1 Simulation Procedure The developed LSTM + ACSSO model for autism detection in young children was executed in Python, and their outcomes were confirmed. In addition, the performance of the developed LSTM + ACSSO model was analyzed to previous methods including LSTM + SMO [23], LSTM + PRO [24], LSTM + Social SO [25], LSTM + ALO [26], and LSTM + SSO [17], correspondingly. The dataset was collected from [27]. This dataset was downloaded from the UCI machine learning repository which contains screening data of 292 patients. In this project, supervised learning is used to diagnose ASD based on behavioral features and individual characteristics. The performance was also calculated by adjusting the learning percentage from 60, 70, 80, and 90% for different performance metrics. The parameters used in various algorithms mentioned in Table 1.

7.2 Performance Analysis The performance analysis of the proposed LSTM + ACSSO model is computed to extant schemes like LSTM + SMO, LSTM + PRO, LSTM + Social SO, LSTM + ALO, and LSTM + SSO, correspondingly with respect to certain metrics, and it is represented in Figs. 3, 4 and 5. Likewise, the adopted LSTM + ACSSO method achieves higher accuracy (~0.92) for a learning percentage of 60 than the other extant methodologies like LSTM + SMO, LSTM + PRO, LSTM + Social SO, LSTM + ALO, and LSTM + SSO, correspondingly in Fig. 3b. This proves that the developed model’s accuracy is greater than that of the conventional models. Additionally, Fig. 3 illustrates the favorable metrics like precision, sensitivity, accuracy, and specificity. Additionally, the created LSTM + ACSSO model has demonstrated improved performance over other older methods at a training percent of 70 in Fig. 3c. Additionally, the specificity of the developed LSTM + ACSSO model for learning percentage 90

690

S. Guruvammal et al.

is 68.9, 40, 72.2, 46.7, and 20% superior to the existing schemes like LSTM + SMO, LSTM + PRO, LSTM + Social SO, LSTM + ALO, and LSTM + SSO, respectively, as shown in Fig. 3d. In Fig. 3a, the created LSTM + ACSSO model achieves greater sensitivity (0.99) for a training percent of 70. The effect of the LSTM classifier that is trained with appropriate features has been demonstrated by the analysis’s results. Thus, the adopted model paved the way for better results in autism detection in young children with lower error. The FPR and FNR of achieved LSTM + ACSSO scheme to the older methods like LSTM + SMO, LSTM + PRO, LSTM + Social SO, LSTM + ALO, and LSTM + SSO, respectively is represented in Fig. 4. The execution has demonstrated that the chosen work has coincided with the goal. As the variations in learning percentage increase, the outcomes of the adopted work have minimized with better results. Figure 4b, the suggested LSTM + ACSSO model outperforms the traditional models with a lower FPR value (0.1) at learning % 60. The suggested LSTM + ACSSO model’s minimum FNR value, shown in Fig. 4a, suggests that the method is less error-prone and more likely to produce accurate results. The suggested LSTM + ACSSO model’s MCC, NPV, and F1-score in comparison with existing schemes are shown in Fig. 5. Further, the F1-score (~0.96) of presented

Fig. 3 Performance analysis of the developed approach to the traditional approaches for a sensitivity, b accuracy, c precision, and d specificity

52 Autism Detection in Young Children Using Optimized …

691

Fig. 4 Performance analysis of the developed model to the traditional scheme for a FNR and b FPR

Fig. 5 Performance analysis of the proposed approach to the traditional schemes for a NPV, b F-measure, and c MCC

692

S. Guruvammal et al.

LSTM + ACSSO scheme for learning percentage 90 than other learning percentage 70. Figure 5b represents an improvement above other conventional schemes like LSTM + SMO, LSTM + PRO, LSTM + Social SO, LSTM + ALO, and LSTM + SSO, respectively. The MCC of the chosen LSTM + ACSSO model attains higher learning percentage 80, as shown in Fig. 5c; nevertheless, the compared old models achieve lower values. Similarly, in Fig. 5a, the selected LSTM + ACSSO model achieves a higher maximum NPV for learning percentage 70 than other previous methods. As a consequence, the effectiveness of the given LSTM + SSO model has demonstrated that it outperforms existing approaches for detecting autism in children.

7.3 Statistical Analysis The statistical comparison of the suggested LSTM + ACSSO approach to the current method, which is predicated on the accuracy measure, is shown in Table 2. In practice, metaheuristic procedures are stochastic, which means they must be run several times to determine whether the given aim has been reached. In the best-case scenario, the proposed LSTM + ACSSO model achieves (0.0427) with more accurate findings than other standard models such as LSTM + SMO, LSTM + PRO, LSTM + Social SO, LSTM + ALO, and LSTM + SSO. The mean performance of the developed LSTM + ACSSO approach holds higher outcomes than the traditional schemes such as LSTM + SMO, LSTM + PRO, LSTM + Social SO, LSTM + ALO, and LSTM + SSO, correspondingly. The created LSTM + ACSSO model has almost always demonstrated its improvement. As a result, the recommended LSTM + ACSSO strategy has been successfully validated for the early identification of autism in children. Table 2 Statistical analysis based on accuracy measure: developed versus extant approaches Methods

Best

Worst

Mean

Median

Standard deviation

LSTM + SMO [23]

0.487179

0.7

0.606935

0.620281

0.081666

LSTM + PRO [24]

0.393162

0.542373

0.483884

0.5

0.055161

LSTM + Social SO [25]

0.363636

0.433333

0.399264

0.400043

0.02569

LSTM + ALO [26]

0.239316

0.322034

0.280678

0.280682

0.032278

LSTM + SSO [17]

0.128205

0.266667

0.199305

0.201175

0.050085

LSTM + ACSSO

0.042735

0.102273

0.078342

0.084181

0.025111

52 Autism Detection in Young Children Using Optimized …

693

Table 3 Analysis based on optimization Metrics

Proposed model without data normalization

Proposed model without optimization

Proposed LSTM + ACSSO model

Accuracy

0.824742

0.845361

0.933333

Sensitivity

1

1

1

Specificity

0.690909

0.727273

0.904762

Precision

0.711864

0.736842

0.818182

F-measure

0.831683

0.848485

0.9

MCC

0.701308

0.732042

0.860383

NPV

1

1

1

FPR

0.309091

0.272727

0.095238

FNR

0

0

0

7.4 Analysis on Optimization Table 3 provides an illustration of the assessment optimal control in terms of specific metrics. In addition, the recommended LSTM + ACSSO model holds better accuracy (~0.933) than other proposed models without data normalization, and the proposed model without optimization, respectively. Furthermore, compared to other developed model without data normalization and the structural method without optimization, the provided LSTM + ACSSO model has a lower FPR and produces superior results. This has showed that the suggested LSTM + ACSSO model aids in examining issues more precisely, but other existing models demonstrate the proposed concept’s bad performance. This demonstrates unequivocally that the chosen combo is suitable for detecting autism in early children.

7.5 Analysis on Classifiers The analysis of adopted work based on different classifiers in terms of certain metrics is represented in Table 4. In addition, the developed LSTM + ACSSO model holds maximum sensitivity to other classifiers like CNN, SVM, RF, and NN, respectively. Likewise, the adopted LSTM + ACSSO scheme attains higher MCC (~0.860) when compared to other extant schemes. From Table 3, the adopted LSTM + ACSSO scheme has shown lower FPR with maximum outcomes when computed to other extant approaches including CNN, SVM, RF, and NN. Therefore, the betterment of the suggested LSTM + ACSSO method has been attained effectively.

694

S. Guruvammal et al.

Table 4 Analysis of proposed work with different classifiers Metrics

CNN [28] SVM [29] RF [30]

Accuracy

0.824742

NN [31]

Proposed LSTM + ACSSO model

0.814433

0.835052 0.721649 0.933333

Sensitivity 1

0.97619

0.928571 1

Specificity 0.690909

0.690909

0.763636 0.509091 0.904762

Precision

0.706897

0.75

F-measure 0.831683

0.82

0.829787 0.756757 0.9

MCC

0.701308

0.67414

0.687756 0.55667

0.860383

NPV

1

0.974359

0.933333 1

1

FPR

0.309091

0.309091

0.236364 0.490909 0.095238

FNR

0

0.02381

0.071429 0

0.711864

1

0.608696 0.818182

0

7.6 Convergence Analysis The convergence of the developed ACSSO framework to conventional techniques is examined by varying the loop number from 0, 5, 10, 15, 20, and 25. Figure 6 displays the convergence study between the proposed scheme and common schemes. In comparison with other current models like SMO, PRO, Social SO, ALO, and SSO, the cost function of the chosen ACSSO scheme produces a reduced constant value (1.032) from iteration 15 to iteration 25. The cost function of the ACSSO method is reduced as the number of iterations increases. Furthermore, the suggested ACSSO model’s cost function fell between the 12th and 14th iteration. As a consequence, it can be shown that the constructed ACSSO model has achieved the minimal cost function. Fig. 6 Convergence analysis of developed scheme and previous schemes

52 Autism Detection in Young Children Using Optimized …

695

8 Conclusion This paper has implemented novel autism detection in young children that poses three phases including (a) preprocessing, (b) feature extraction, and (c) detection. First, modified data normalization was used to process the input data. Higher-order statistical features (kurtosis, skewness, and moment) and statistical features were retrieved from the normalized data. Finally, the detection was carried out by an optimized LSTM classifier. To enhance the detection performance, the weight of LSTM was optimally tuned by ACSSO Algorithm. Thus, the classified output was obtained in an effective manner. Final calculations compared the results of the proposed approach to those of the existing approaches using several metrics, including F1-score, specificity, NPV, accuracy, FNR, sensitivity, precision, FPR, and MCC, accordingly. From the graph, the specificity of the developed LSTM + ACSSO model for learning percentage 90 was 68.9%, 40%, 72.2%, 46.7%, and 20% superior to the existing schemes like LSTM + SMO, LSTM + PRO, LSTM + Social SO, LSTM + ALO, and LSTM + SSO, respectively. The suggested LSTM + ACSSO model’s minimal FNR value demonstrated that, at an 80 percent learning rate, the model was less errorprone than it was to produce accurate results. Similar to other existing schemes, the chosen LSTM + ACSSO model achieves maximum NPV at a training percentage of 70%. The created LSTM + ACSSO model achieves (0.0427) with accurate findings, which is an improvement above the standard LSTM + SMO, LSTM + PRO, LSTM + Social SO, LSTM + ALO, and LSTM + SSO models, respectively. To determine whether physiologic patterns can be classified more accurately in one group than the other and the possibility of establishing a global classifier of emotional states, future paper will evaluate accuracy of classification with that of typically developing youngsters.

References 1. Rusli N, Sidek SN, Yusof HM, Ishak NI, Khalid M, Dzulkarnain AAA (2020) Implementation of wavelet analysis on thermal images for affective states recognition of children with autism spectrum disorder. IEEE Access 8:120818–120834. https://doi.org/10.1109/ACCESS.2020. 3006004 2. Nazmul H, Islam MN (2019) Exploring the design considerations for developing an interactive tabletop learning tool for children with autism spectrum disorder. In: International conference on computer networks, big data and IoT. Springer, Cham, pp 834–844 3. Shanthi S, Palanisamy P, Parveen S (2019) Autism spectrum disorder prediction using machine learning algorithms. In: International conference on computational vision and bio inspired computing. Springer, Cham, pp 496–503 4. Samy N, Fathalla R, Belal NA, Badawy O (2019) Classification of autism gene expression data using deep learning. In: International conference on intelligent data communication technologies and internet of things. Springer, Cham, pp 583–596 5. Fernandis JR (2021) ALOA: ant lion optimization algorithm-based deep learning for breast cancer classification. Multimedia Res 4(1)

696

S. Guruvammal et al.

6. Liu Y (2020) Hybrid shark smell optimization based on world cup optimization algorithm for minimization of THD. J Computat Mech Power Syst Control 3(3) 7. Rajeyyagari S (2020) Automatic speaker diarization using deep LSTM in audio lecturing of e-Khool platform. J Netw Commun Syst 3(4) 8. Sarabadani S, Schudlo LC, Samadani AA, Kushski A (2020) Physiological detection of affective states in children with autism spectrum disorder. IEEE Trans Affect Comput 11(4):588–600. https://doi.org/10.1109/TAFFC.2018.2820049 9. Yang X, Shyu ML, Yu HQ, Sun SM, Yin NS, Chen W (2019) Integrating image and textual information in human-robot interactions for children with autism spectrum disorder. IEEE Trans Multimedia 21(3):746–759. https://doi.org/10.1109/TMM.2018.2865828 10. Eni M, Dinstein I, Ilan M, Menashe I, Meiri G, Zigel Y (2020) Estimating autism severity in young children from speech signals using a deep neural network. IEEE Access 8:139489– 139500. https://doi.org/10.1109/ACCESS.2020.3012532 11. Mazumdar P, Arru G, Battisti F (2021) Early detection of children with autism spectrum disorder based on visual exploration of images. Signal Process Image Commun 94 (Cover date: May 2021)Article 116184 12. https://en.wikipedia.org/wiki/Statistic 13. https://en.wikipedia.org/wiki/Standard_deviation 14. https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm#:~:text=Skewness% 20is%20a%20measure%20of,relative%20to%20a%20normal%20distribution 15. https://en.wikipedia.org/wiki/Central_moment#:~:text=In%20probability%20theory% 20and%20statistics,random%20variable%20from%20the%20mean 16. Zhou X, Lin J, Zhang Z, Shao Z, Liu H (2019) Improved itracker combined with bidirectional long short-term memory for 3D gaze estimation using appearance cues. Neuro computing In press, corrected proof, Available online 17. Mohammad-Azari S, Bozorg-Haddad O, Chu X, Shark smell optimization (SSO) algorithm. In: Bozorg-Haddad O (eds) Advanced optimization by nature-inspired algorithms. Studies in computational intelligence, vol 720. Springer, Singapore. https://doi.org/10.1007/978-981-105221-7_10 18. Rajakumar BR (2013) Impact of static and adaptive mutation techniques on genetic algorithm. Int J Hybrid Intell Syst 10(1):11–22. https://doi.org/10.3233/HIS-120161 19. Rajakumar BR (2013) Static and adaptive mutation techniques for genetic algorithm: a systematic comparative analysis. Int J Computat Sci Eng 8(2):180–193. https://doi.org/10.1504/ IJCSE.2013.053087 20. Swamy SM, Rajakumar BR, Valarmathi IR (2013) Design of hybrid wind and photovoltaic power system using opposition-based genetic algorithm with Cauchy mutation. In: IET Chennai fourth international conference on sustainable energy and intelligent systems (SEISCON 2013), Chennai, India. https://doi.org/10.1049/ic.2013.0361 21. George A, Rajakumar BR (2013) APOGA: an adaptive population pool size based genetic algorithm. In: AASRI Procedia—2013 AASRI conference on intelligent systems and control (ISC 2013), vol 4, pp 288–296. https://doi.org/10.1016/j.aasri.2013.10.043 22. Rajakumar BR, George A (2012) A new adaptive mutation technique for genetic algorithm. In: Proceedings of IEEE international conference on computational intelligence and computing research (ICCIC), pp 1–7, 18–20 Dec 2012, Coimbatore, India. https://doi.org/10.1109/ICCIC. 2012.6510293 23. Harish S, Garima H, Jagdish B (2019). Spider monkey optimization algorithm. https://doi.org/ 10.1007/978-3-319-91341-4_4 24. Moosavi SHS, Bardsiri VK (2019) Poor and rich optimization algorithm: a new human-based and multi populations algorithm. Eng Appl Artif Intell 86 (Cover date: November 2019):165– 181 25. Ahmed F (2015) Social spider optimization algorithm. https://doi.org/10.13140/RG.2.1.4314. 5361 26. Modestus O, Lagouge T (2020). Ant lion optimization algorithm. https://doi.org/10.1007/9783-030-61111-8_9

52 Autism Detection in Young Children Using Optimized …

697

27. https://github.com/saadhaxxan/Autism-spectrum-disorder-Detection-using-Deep-Learning/ blob/master/Autism-Child-Data.txt 28. LeCun Y, Kavukcuoglu K, Farabet C (2010) Convolutional networks and applications in vision. In: International symposium on circuits and systems, pp 253–256 29. Yuan J, Holtz C, Smith T, Luo J (2016) Autism spectrum disorder detection from semistructured and unstructured medical data. EURASIP J Bioinform Syst Biol 3 30. Masetic Z, Subasi A (2016) Congestive heart failure detection using random forest classifier. Comput Methods Program Biomed 130:54–64 31. Mohan Y, Chee SS, Xin DKP, Foong LP (2016) Artificial neural network for classification of depressive and normal in EEG. In: 2016 IEEE EMBS conference on biomedical engineering and sciences (IECBES)

Chapter 53

A Comparative Review Analysis of OpenFlow and P4 Protocols Based on Software Defined Networks Lincoln S. Peter, Hlabi Kobo, and Viranjay M. Srivastava

1 Introduction Software Defined Network intelligently manages and centrally controls all network activities through a software application. This is so efficient that the performance and monitoring of the network are improved when compared to the traditional network. This idea of a Software Defined Network (SDN) reignited the interest of researchers in the field of programmable networks. Software Defined Network is redefining the whole network system design and its management. It consists of two main features, firstly an SDN separates the control and data plane. Secondly, the control plane oversees the whole operation of the data plane [1]. To further elaborate on the above characteristics, the separation of data and control plane ensures the central control of the SDN network. The control plane manages all the decision-making processes and communicates those to the data or infrastructure plane via the interface. This interface is called the application programming interface (API). The data plane simply executes the decisions communicated by the controller and forwards packets to the destinations as instructed [2]. In traditional networks, the tight coupling of control and data planes created challenges in management and evolution of the network [3]. To introduce new policies on the network, a network operator must manually log in on each device on the network. Imagine if the network operator has more than 200 network devices and had to login L. S. Peter (B) · V. M. Srivastava Department of Electronic Engineering, Howard College, University of KwaZulu-Natal, Durban 4041, South Africa e-mail: [email protected] V. M. Srivastava e-mail: [email protected] H. Kobo Council for Scientific and Industrial Research (CSIR), Pretoria, South Africa e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_53

699

700

L. S. Peter et al.

individually and change the policies; that is a daunting task. If you needed to add new functionality to the network or introduce new protocols, this required a new infrastructure which then down the process of evolution. There are also interoperability issues between vendors when it comes to implementation, testing, and operation. To integrate all this and try to make sure that all is working was very complex, challenging, and prone to errors. This is where the idea of programmable networks in the form of SDN came in to remedy this situation. The SDN promotes innovation in managing the network and deployment of services through programmability. This leads us to flexible networks where a network operates according to the user’s requirements. The SDN is embraced in the academic and industrial space. It is already being applied in data centers, wide area networks (WANs), and mobile core networks. All these applications are using the southbound interface called OpenFlow protocol which resides between the data and controller plane. However, the adoption of SDN has slowed down recently due to the obsolescence of the OpenFlow protocol. The development of the OpenFlow protocol organically stalled because it was not programmable enough. This meant that the routers and switches which are OpenFlow compliant would only be used as per the initially programmed customization and could not be reprogrammed with new headers. This led to the development of the Programmable Protocol-Independent Packet Processor (P4). The P4 uses a different switch architecture, which has flexibility that enables programmability and re-programmability. This paper compares these two protocols, namely OpenFlow and P4. A comparative study between OpenFlow and P4 has been done before, which was mainly focused on central processing units (CPUs) of both protocols. This paper extends this to include the switch architecture and the packet flows. This paper has been organized as follows. Section 2 has the background of the work. Section 3 explains the protocols that enabled SDN. Sections 4 and 5 have the overview of OpenFlow and Programming Protocol-Independent Packet Processor (P4), respectively. Section 6 has the OpenFlow switch architecture. Finally, Sect. 7 concludes the work and recommends the future aspect.

2 Background of the Work The increase in Internet usage and expansion of communication networks led researchers to develop and experiment with new ideas for network services. In working through these ideas, the researchers identified some of the obstacles like managing complex network infrastructure, network devices that support specific protocols and interoperability between different vendors. This severely hindered the progress toward programmable networks. Some efforts were made to remedy this situation, like the proposal to separate the management and decision-making from

53 A Comparative Review Analysis of OpenFlow and P4 Protocols …

701

the network devices and provide management and control on a separate open interface [4]. These early initiatives were led by the open signaling working group and active networking [5]. This section looks at some of the efforts and interventions that led to today’s popularly known as Software Defined Network.

2.1 Open Signaling The open signaling working group started testing their idea of programmable networks on the asynchronous transfer mode (ATM) networks. The main aim of this effort was to separate the control and data plane. Then, signaling would be used through the open interface between these two planes. The idea was to be able to program and control the ATM switches remotely. At that level, the network operator can tune and deploy new services on the network. This idea of open signaling interface was further developed and led to the discovery of the tempest framework [6], which allowed multiple switch controllers. This approach assisted network operators in a way that they were not forced to define a unified control. Another project that was running parallel to the tempest was called Devolved Control of ATM Networks (DCAN) [6]. This project was meant to lay the foundation for developing the infrastructure necessary for control of the ATM networks. The objective was to remove all management functions and control on the ATM network devices and place this in the dedicated external workstation. This project was finished around the middle of 1998.

2.2 Active Networking Around the mid-90s, Defense Advanced Research Projects Agency (DARPA) supported the idea of the active networking project [7, 8]. Its objective was to develop programmable and flexible networks that would lead to an environment conducive to network innovation. Network engineers or operators had full control and could make changes as they wished on the active networking. This is contrary to open signaling, as this one embraces faster implementation and dynamism in the configuration of the network. The system design of active networking has three layers. The bottom layer contains the operating system. The second layer executes and writes active networking applications, including ANTS [9] and PLAN [10]. The last one contains source code that is self-developed by operators themselves. Active networking has two models [10, 11], namely capsule and programmable router/switch model. The capsule model executes the code within the data packets, whereas the switch model or programmable router contains a code to be executed on the network devices. The capsule model became popular as it was seen as the most innovative and close to active networking [10]. The main reason for the choice of

702

L. S. Peter et al.

the capsule model was that it offered a more radical way in terms of managing the network and had a straightforward methodology for deploying new functionality in the data plane throughout the network.

2.3 4D Project This project started around 2004 [12–14] with a proposal to develop separate protocols governing the communication between network devices and routing decisions. It offered that the control plane has a view of the global network, assisted by the discovery and dissemination plane in terms of control of the data plane for forwarding the traffic. These ideas known as NOX [15], suggested a network operating system specifically looking into OpenFlow-enabled networks.

2.4 NETCONF In early 2006, the Internet Engineering Task Force (IETF) network configuration working group came up with NETCONF [16] as a management protocol to make changes to the network device configurations. This protocol opened up the application programmable interface (API) so that configured information can be transmitted and received. A protocol for management called simple network management protocol (SNMP) [16] was proposed in the late-80 s and is popular. This protocol uses a structured management interface (SMI) to retrieve the management information base (MIB) data. It was later discovered that this protocol did not achieve what was intended to do like to configure network devices but focused more on performance and fault monitoring tools. When IETF proposed NETCONF, it was widely known that it would remedy the shortcoming experience by SNMP. The NETCONF managed to simplify the network devices (re)configuration and lay a foundation for management. The NETCONF did not call for extraction of data and control plane, and the same can be said for SNMP. The NETCONF was not primarily designed to assist with direct control of enabling quick deployments of network services and applications but for automated configurations.

2.5 Ethane The Ethane project is regarded as the predecessor to OpenFlow [17]. In 2006, it came up with a new system design for the network of the enterprise. Its main idea was to manage security and policy using the centralized controller on the network. It provided an identity-based control system. The two components that are controller

53 A Comparative Review Analysis of OpenFlow and P4 Protocols …

703

and Ethane switch characterize the Ethane. A controller makes a decision on how to forward the packets, while the Ethane switch is made up of flow tables and a secured communication channel. This is a similar approach to SDN in the context of OpenFlow protocol. That is why it is called the predecessor of OpenFlow because it laid a foundation that led to SDN.

3 Protocols Enabled Software Defıned Network Two specific SDN architectures, namely forwarding and control element separation and OpenFlow abide by the standing principle of SDN that of extraction data and control plane. Although they are both SDN protocols, they differ in design, forwarding rules, and interface protocol.

3.1 The Forwarding Control Element Separation (ForCES) The IETF working group’s approach proposed the extraction of forwarding and control elements in the internal architecture of the network device. In this case, the device of the network remains a single entity [18], which means that both the control and data plane are kept in one box. ForCES is composed of two logical entities, namely the control element and the forwarding element (FE). These logical elements deploy this protocol for communication [18]. The forwarding element contains the hardware on handling the packets, whereas control element deals with signaling and control. The FE operates at the data layer hence the responsibility to handle the underlying hardware. The CE is also making sure that the ForCES protocols instruct the FEs on any matter when it comes to packet processing. The FEs are represented by logical functional blocks (LFB), as shown in Fig. 1. The protocol works in a master and slave manner. In this case, the CEs are the masters and FEs slaves [18]. The LFB is the foundation of ForCES system design. The LFB resides on the FEs and are logical interconnected functional blocks that process packets within the entire FE architecture. All this is then managed and controlled by CEs through ForCES protocols. The LFB acts as an enabler for CEs to configure and how LFBs should handle the packets.

3.2 OpenFlow OpenFlow, like ForCES, decouples the data forwarding and control planes. The forwarding devices are OpenFlow-enabled switches or OpenFlow compliant switches. These switches contain an abstraction layer and flow tables that secure communication channels between the planes. The abstraction layer communicates

704

L. S. Peter et al.

Fig. 1 ForCES architecture

to the controller through the OpenFlow protocol. Inside the flow tables, there are flow entries that determine the forwarding of packets. The flow entries consist of three components inside: match field, counters, and actions or set of instructions [18]. Upon the arrival of a packet on the OpenFlow switch, the extraction takes place on the header fields and now begins to be matched against the match fields inside the flow entry. If the packet matches one of the entries, then the switch will process the actions to be taken. If no entry is found, the switch will send it to its table-miss entry. This is where either the packet will be dropped or the matching will proceed to the next flow table or forward the packet to the controller through the OpenFlow protocol. The controller will then forward rules to the switch tables telling it the way to handle packets of that nature in the future. The exchange of information between the data and the controller plane is managed through the OpenFlow protocol. This protocol defines a set of messages that can be shared between the control and data plane over a secured communication channel.

3.3 Comparison Between OpenFlow and ForCES The similarities and differences between OpenFlow and ForCES are discussed in ref. [19]. It has been highlighted that among the differences between the two is the forwarding model. ForCES uses a logical functional block, whereas OpenFlow uses flow tables. The combination of actions in the OpenFlow can be used to assist with development, administration, better network management, and control. The same thing can be said for ForCES by combining the logical functional blocks. It has been noted that although ForCES does not abide by the principles of the SDN model compared to OpenFlow, it can still deploy the same functions [19]. The SDN and OpenFlow proposal gained very strong support from the academic space, industry, and research community. The support produces significant deliverables in terms of white papers, reference software, and hardware implementations. This led to open networking foundation (ONF) [15] arguing that the current SDN de-facto standard is OpenFlow’s SDN architecture [18]. An important thing to note about OpenFlow and ForCES switches is fixed.

53 A Comparative Review Analysis of OpenFlow and P4 Protocols …

705

4 Overvıew of Openflow The SDN has a standard protocol that is used in real applications. The most popular one is the OpenFlow protocol [21–24]. The OpenFlow protocol has made it very simple to implement the SDN both on the hardware and software platforms. The OpenFlow protocol relies on switches and ports to manage the flow of tables. The controller in the SDN manages a group of switches. The controller manages the switches, which are layer-2 devices through the OpenFlow protocol. This OpenFlow switch contains several flow tables, a group table, and an OpenFlow channel. Every flow table has its entries which are communicated with the controller [21]. In the beginning, the routing table of the routing devices will be empty. The routing table inside the OpenFlow routing devices has packet fields like a destination as well as the action field that contains codes of every step that must be taken by the routing device. The routing table is populated as the packets are incoming and getting forwarded to the destination port. A new packet with no matching conditions in the data flow table is passed through the controller to handle it. The controller is tasked with managing decisions, it is either that packet is discarded or a new policy is added to the controller on how to manage such packets in the future. On the architecture of the SDN, the routing table is generated on the control plane. The data plane uses that routing table to see where the packet should go. The OpenFlow protocol is straightforward, and some data centers have used it in their operations. It is easy to manage large networks using OpenFlow protocol and SDN. The OpenFlow protocol architecture has three important components [22]: • The OpenFlow switch has the authority to change the flow tables on another layer 2 device. This OpenFlow switch contains flow tables, communication channels, and OpenFlow protocol. The flow tables have an action field that relates to each flow entry. The communication channel just connects the OpenFlow switch and the controller so that packets can be transmitted. OpenFlow protocol enables the communication between the OpenFlow switches and the controller. • Controllers, continuously update the flow entries on the flow tables and add or delete routes. The controller must be configured to dynamically update the routes. This helps to ease the flow of data. • Flow entries, each entry has an action to be undertaken. OpenFlow switches send the flows to a certain port. It then encapsulates the packet and passes it through the controller, dropping the packets if there is no matching entry. The OpenFlow protocol evolution started with versions 1.0–1.5 [8]. It started with only 12 fixed match fields and a single flow table and now features multiple tables with around 50 matching fields. Reconfigurable OpenFlow switches were very slow, but the switches became faster due to improvements in the micro-electromechanical system. The funding agencies and researchers developed a desire to experiment with the network, later at the end of the twentieth century. This was motivated mainly by the

706

L. S. Peter et al.

quest to introduce new protocols and services to improve the Quality of Services (QoS) and performance on a bigger scale enterprise networks. The success in the experimentation on infrastructure like PlanetLab gave much support to the idea [20]. The researchers most of the time were using simulation tools for evaluation, and this at times, could not give the expected results in the same way as a real testbed would. This speaks to the need for infrastructure-based programmability that would definitely simplify management of the network and services. This will also assist in terms of running experiments concurrently, using different kinds of forwarding rules. The researchers at Stanford were motivated by these developments and proposed OpenFlow protocol to run experimental protocols on an everyday networking environment [4]. OpenFlow protocol followed the same approach as previous interventions like ForCES by separating the forwarding and control plane. The interaction by these planes is facilitated through a secured communication channel. The solution that OpenFlow provided laid the foundation for the architectural support of programmable networks. This paved the way for SDN.

5 Overvıew of Programmıng Protocol-Independent Packet Processor (P4) Programmable Protocol-Independent Packet Processor (P4) language is a normal programming language that enhances the SDN concept [25]. Initially, it could only configure SDN-enabled switches with a fixed switch chip that supported limited protocols [26]. A fixed switch chip is a switch that can support up to certain number of flow entry tables. This limitation of tables capability led to the development of Protocol-Independent Switch Architecture (PISA). The PISA comprises high packet processing [27]. The PISA supports P4 language as it is a special reconfigurable match table (RMT) case. The P4 switch needs to be programmed first before it can understand the protocol; hence, it is protocol-independent. There are two versions of the P4 language, namely P414 and P416 [26]. Figure 2 shows the P416 packet processing pipeline. Here, it has six blocks: egress deparser, egress match-action, egress parser, ingress deparser, ingress match-action, and programmable ingress parser. It also has two blocks that are non-programmable that is buffer queuing engine (BQE) and packet replication engine (PRE).

Fig. 2 P4 packet processing pipeline

53 A Comparative Review Analysis of OpenFlow and P4 Protocols …

707

P4 pipeline: The ingress parser receives the packet and transforms it from binary representation to headers [28]. After that, a decision will have to be taken on processing the packet by the ingress match-action. Then, the ingress deparser will queue the packet for processing. The packet will then be passed to egress matchaction. The egress deparser then indicates how packets are departed from the headers into a binary representation. A P4 program is generated using P416. Once generated, it is then sent to the P4 compiler and deployed onto the P4 switch. The match-action rules are populated on the matching table inside the P4 switch. A critical issue about the P4 protocol is that it can be used on SDN switches and non-SDN switches, whereas OpenFlow can only be used in SDN switches.

5.1 Comparison Between OpenFlow and P4 Protocols Historically, ForCES did not gain enough momentum hence the compromise to OpenFlow switches which are characterized by fixed chip. The current OpenFlow standard relies on centralized control. This means that the single controller handles the flow tables for all these switches. Yes, for a very small-scale network, this is applicable. When the network grows like, for example, adding more switches and routers to the network, it becomes difficult for a single controller to manage that [29, 30]. Especially, if wireless media is also deployed to connect far sites, this is a threat to the network as it would result to a single point of failure. Another challenge with OpenFlow is that when the network expands, it becomes flooded and begins to discard packets as it cannot cope with high data processing. This is because of memory limitations on the OpenFlow. The OpenFlow works on fixed chip switches and is strictly for SDN. Any switch that is non-SDN, OpenFlow cannot work with that. The introduction of P4 has now resolved all these challenges of OpenFlow, including the speed. In the case of P4, the most feature of the SDN is the programmability of the data plane. The network administrators can configure how the packets can be processed inside the pipelines on the hardware switch. This can only be achieved using the P4 switch. The P4 allows the packets to be processed through the pipelines onto the data plane. Packet headers are processed through match + action tables and the P4 program will describe how these packet headers are parsed. The P4 switch programmers can control the packet headers that are operated on the user-defined headers. P4 switches cannot manipulate the payload but only packet headers. The packet payload is treated as different headers; this assists the P4 switch to be able to aggregate and disaggregate the payload inside the pipelines. The P4 protocol can process data more quickly than the OpenFlow. It does not operate on a fixed chip switch. The P4 protocol is compatible even with non-SDN switches.

708

L. S. Peter et al.

Fig. 3 OpenFlow switch architecture

6 Swıtch Archıtecture 6.1 The Switch Architecture of the Openflow Figure 3 depicts an OpenFlow switch architecture. The operation of the OpenFlow switch has been clearly explained in Sect. 3.2. Here, we are just demonstrating the actual design of the switch. Openflow is a fixed switch chip that supports different protocols as defined by IEEE and IETF protocols in silicon. The switch contains a fixed set of match tables. Inside the silicon, you have match fields and actions in each table.

6.2 The Switch Architecture of P4 Protocol-Independent Switch Architecture (PISA) shown in Fig. 4 is a programmable and non-fixed switch model. In this case, the switch contains no set of match tables. This enables faster introduction of protocols and network functionality. The PISA microchip was very slow, hence, the fixed switch chip (OpenFlow) became prevalent in that period but after the improvement on microchip, they became faster and were considered for deployment. This switch comprises a parser, ingress/egress, and a deparser. When the packet arrives at the parser, headers are extracted and then sent to the pipelines for processing. The ingress and egress are the main engines that deal with packet processing units. These packet processing units go through the matchaction tables. The headers are then matched against the match-action tables based on the set of rules from the control plane and process the packet according to the corresponding action. Runtime assists the control plane to manage the forwarding rules to the data plane of the switch. Once the process is complete, the headers are regrouped to reconstruct the packet at the deparser.

53 A Comparative Review Analysis of OpenFlow and P4 Protocols …

709

Fig. 4 PISA switch architecture

7 Conclusion and Future Recommendatıon In this work, the authors have provided a comparative analysis of OpenFlow and P4 protocols and have looked at the historical background of programmable networks and a lead-up to Software Defined Network. The protocols that laid a solid foundation for SDN were also covered in detail. It is worth noting that ForCES was the first protocol to be tested on SDN but did not get enough support due to breakthroughs on OpenFlow. The OpenFlow protocol started with 10 headers and grew as it went through different versions. Today, OpenFlow has about 50 headers but still facing shortcomings as it is a fixed switch. This work has also covered an overview of OpenFlow and P4 and drew the comparison between the two protocols. The P4 protocol has proven to be the best as it allows the programmability of the data plane. In conclusion, it is important to highlight that more research needs to be done on the flexibility and programmability of the data plane for network evolution.

References 1. Zhang X, Cui L, Wei K, Tso FP, Ji Y, Jia W (2021) A survey on stateful data plane in software defined networks. Comput Netw 184. https://doi.org/10.1016/j.comnet.2020.107597 2. Sezer S et al. (2013) Are we ready for SDN? Implementation challenges for software-defined networks. IEEE Commun Mag 51(7):36–43. https://doi.org/10.1109/MCOM.2013.6553676 3. Duan Q, Ansari N, Toy M (2016) Software-defined network virtualization: an architectural framework for integrating SDN and NFV for service provisioning in future networks. IEEE Network 30(5):10–16. https://doi.org/10.1109/MNET.2016.7579021 4. Foukas X, Marina MK, Kon-vasilis K (2015) Software-defined networking concepts, software defined mobile networks (SDMN): concepts and challenges. Wiley Telecom, pp 21–44. https:// doi.org/10.1002/9781118900253.ch3 5. Benabbou J, Elbaamrani K, Idboufker N (2019) Security in OpenFlow-based SDN, opportunities and challenges. Photon Netw Commun 37(1):1–23. https://doi.org/10.1007/s11107-0180803-7 6. van der Merwe JE, Rooney S, Leslie I, Crosby S (1998) The tempest—a practical framework for network programmability. IEEE Network 12(3):20–28

710

L. S. Peter et al.

7. Qadir J, Ahmed N, Ahad N (2014) Building programmable wireless networks : an architectural survey. EURASIP J Wirel Commu Netw 8. Feamster N, Rexford J, Zegura E (2014) The road to SDN: an ıntellectual history of programmable networks. ACM SIGCOMM Comput Commun Rev 44(2):87–98 9. Wetherall DJ, Guttag JV, Tennenhouse DL (1998) ANTS : a toolkit for building and dynamically deploying network protocols. In: IEEE open architectures and network programming, San Francisco, CA, USA, pp 117–129 10. Hicks M, Kakkar P, Moore JT, Gunter CA, Nettles S (1999) PLAN: a packet language for active networks. ACM SIGPLAN Not 34(1):86–93 11. Nunes BAA, Mendonca M, Nguyen X, Obraczka K, Turletti T (2014) A survey of softwaredefined networking: past, present, and future of programmable networks. IEEE Commun Surv Tutorials 16(3):1617–1634 12. Rexford J, Greenberg A, Hjalmtysson G, Maltz D, Myers A, Xie G, Zhan J, Zhang H (2004) Network-wide decision making : toward a wafer-thin control plane. In: Proceedings of HotNets III 13. Greenberg A, Hjalmtysson G, Maltz DA, Myers A, Rexford J, Xie G, Yan H, Zhan J, Zhang H (2005) A clean slate 4D approach to network control and management. ACM SIGCOMM Comput Commun Rev 35(5):41–54 14. Caesar M, Caldwell D, Feamster N, Rexford J, Shaikh A, van der Merwe J (2005) Design and ımplementation of a routing control platform. In: 2nd Symposium on networked systems design & implementation, pp 15–28 15. Gude N, Koponen T, Pettit J, Pfaff B, Casado M, McKeown N, Shenker S (2008) NOX: towards an operating system for networks. ACM SIGCOMM Comput Commun Rev 38(3):105–110 16. Wallin S, Wikstrom C (2011) Automating network and service configuration using NETCONF and YANG. In: 25th International conference on large installation system administration, pp 1–22 17. Casado M, Freedman MJ, Pettit J, Luo J, McKeown N, Shenker S (2007) Ethane: taking control of the enterprise. ACM SIGCOMM Comput Commun Rev 37(4):1–12 18. Nunes BAA, Mendonca M, Nguyen XN, Obraczka K, Turletti T (2014) A survey of softwaredefined networking: past, present, and future of programmable networks. IEEE Commun Surv Tutorials 16(3):1617–1634. https://doi.org/10.1109/SURV.2014.012214.00180 19. Kaljic E, Maric A (2019) A survey on data plane flexibility and programmability in softwaredefined networking. IEEE Access 7:47804–47840. https://doi.org/10.1109/ACCESS.2019.291 0140 20. Chun B et al (2003) PlanetLab: an overlay testbed for broad-coverage services. Comput Commun Rev 33(3):3–12. https://doi.org/10.1145/956993.956995 21. Hu F, Hao Q, Bao K (2014) A survey on software-defined network and OpenFlow: from concept to implementation. IEEE Commun Surv Tutorials 16(4): 2181–2206. Institute of Electrical and Electronics Engineers Inc., https://doi.org/10.1109/COMST.2014.2326417 22. Sherwood R, Gibb G, Kobayashi M (2010) Carving research slices out of your production networks with open flow. Comput Commun Rev 40(1):129–130. https://doi.org/10.1145/167 2308.1672333 23. Dely P, Kassler A, Bayer N (2011) OpenFlow for wireless mesh networks. In: International conference on computer communications and networks. https://doi.org/10.1109/ICCCN.2011. 6006100 24. Tsai PW, Tsai CW, Hsu CW, Yang ChuS (2018) network monitoring in software-defined networking: a review. IEEE Syst J 12(4):3958–3969 25. Nobach L, Rimac I, Hilt V, Hausheer D (2016) SliM: enabling efficient, seamless NFV state migration. In: International conference on network protocols, vol 2016, no. 3, pp 87–95. https:// doi.org/10.1109/ICNP.2016.7784459 26. Seeber S, Stiemert L, Rodosek GD (2015) Towards an SDN-enabled IDS environment. In: IEEE conference on communications and network security, pp 751–752. https://doi.org/10. 1109/CNS.2015.7346918

53 A Comparative Review Analysis of OpenFlow and P4 Protocols …

711

27. Bosshart P et al (2013) Forwarding metamorphosis: fast programmable match-action processing in hardware for SDN. Comput Commun Rev 43(4):99–110. https://doi.org/10.1145/ 2534169.2486011 28. Hu F, Hao Q, Bao K (2014) A survey on software-defined network and OpenFlow: from concept to implementation. IEEE Commun Surv Tutorials 16(4):2181–2206. https://doi.org/10.1109/ COMST.2014.2326417 29. Kobo HI, Abu-Mahfouz AM, Hancke GP (2017) A distributed control system for softwaredefined wireless sensor networks through containerisation. In: International multidisciplinary information technology and engineering conference, vol 5, pp 1872–1899. https://doi.org/10. 1109/ACCESS.2017.2666200 30. Kobo HI, Abu-Mahfouz AM, Hancke GP (2019) Fragmentation-based distributed control system for software-defined wireless sensor networks. IEEE Trans Industr Inf 15(2):901–910. https://doi.org/10.1109/TII.2018.2821129

Chapter 54

NLP-Driven Political Analysis of Subreddits Kuldeep Singh and Sai Venkata Naga Saketh Anne

1 Introduction Politics has always been an emotive subject with very passionate disagreements between separate parties. The heated political climate since the 2016 US presidential election, and the power to anonymously express opinions on online message boards such as Reddit, has made the language behind political thought much more accessible and very interesting to explore. Using embedding-based sentiment analysis and feature analysis of a classification model, we will be examining ways of extracting information from the language used in political subreddits.

2 Inspiration and Related Work The inspiration for our work is based on a paper that explores various applications of sentiment analysis [1]. In particular, the paper covers applications of opinion mining and sentiment analysis in intelligence gathering; how they can be used to better understand subjective judgments of people on various topics. In one of the examples, the authors illustrated how laptop manufacturers can utilize surveys, reviews, and complaints with machine learning to better answer business questions (e.g. why their customers have not been buying the company’s products, what have the customers been complaining about). This type of application would essentially save companies hundreds of hours of reading a collection of surveys/reviews/complaints to answer their business questions. K. Singh (B) Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India e-mail: [email protected] S. V. N. S. Anne Kakatiya Institute of Technology and Science, Warangal, Telangana, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_54

713

714

K. Singh and S. V. N. S. Anne

Following this example, we are putting the application into practice and using various machine learning techniques to gain an understanding of a body of text. In our case, we are trying to gain an understanding of the differences between the sentiments and language used by two political camps (liberals and conservatives).

3 Data Collection We collected our data from Reddit, which is organized into various forums, called subreddits, that are dedicated to specific topics. We chose Reddit because the anonymity of the platform may encourage people to be more emotive with their language choice about certain topics. Although trolling is an undesired outcome of anonymity, all the subreddits that we chose have anti-trolling rules so we are relying on the admins to remove these posts and comments. We pulled comments and post data from several political subreddits that are associated with liberals and conservatives. The timeframe that we included lies between 2015 and 2019. For the liberal camp, we chose r/democrats, r/Liberal, and r/progressive. For the conservative camp, we chose r/Republican and r/Conservative. We chose the aforementioned subreddits since we needed to know the political affiliation associated with the subreddits in order to build the politically biased word embeddings and to train the liberal/conservative classifier model. For this reason, we did not draw data from the massively popular r/politics because it does not label the political affiliation of each post and comment [2]. The relevant data points that we extracted from Reddit are: (1) The post’s subreddit, (2) the UTC timestamp of when the post was created, (3) the post title, (4) selftext, which contains the body of text associated with the post. (5) The UTC timestamp of when the comment was written (6) comments associated with the post.

4 EDA All of the subreddits we used were created in 2008 which was the first election year since Reddit was created. Even though each of the subreddits had the same amount of time to mature, as shown in Fig. 1, r/conservative has the same amount of members as the top three liberal subreddits combined. In total, the conservative subreddits have 1.4 times the amount of members compared to the liberal subreddits (34.2% difference). What we found to be very interesting is that the conservative subreddits have 1.9 times the amount of posts (64.1% difference) and up to 5 times the number of comments (133.3% difference) compared to the liberal subreddits. Considering the number of members in each group, the conservative subreddits have a disproportionately larger amount of interactions compared to the liberal subreddits. Perhaps conservatives are more attracted to a forum-based medium like Reddit to express their ideals compared to their liberal counterparts. It is also possible that the

54 NLP-Driven Political Analysis of Subreddits

715

Fig. 1 Bar charts representing the number of members in each subreddit, the number of comments extracted from each subreddit, and the number of posts extracted from each subreddit

majority of liberal Reddit users chose r/politics which has 5.3 million members as opposed to the smaller liberal-specific subreddits. According to Fig. 2, we see that the word “trump” is the most common word used in both corpora which is not much of a surprise. What is very interesting, however, is the fact that most of the other words imply some form of “us versus them” context. From this basic word count, we can almost assume that most of the comments and posts in each of our subreddits are about comparing liberals to conservatives, right to left, democrats to republicans, and vice versa.

5 Biased Embedding Sentiment Models 5.1 Method When it comes to natural language processing, word embedding is one of the most frequently used and important tools. Word embeddings are essentially vector representations of words with the advantage of preserving some form of the context of each word in their respective use cases. Words with similar context would be grouped closer together and the angle between their respective vectors would be close to 0 (the cosine of the angle would be close to 1). Despite being such a useful tool, word embeddings are by no means perfect. The context of each word held within their embedding vectors will inherit many of the biases that exist in the corpus that the embedding is trained from [3]. For example, in the Word2Vec embedding trained from an extremely large corpus of Google News articles, we can observe the proper relationship that: −→ − −→ − − −−−→ ≈ − −−→ man woman king − − queen

(1)

716

K. Singh and S. V. N. S. Anne

Fig. 2 Table showing the list of top words used by liberals and conservatives (out of a random sample of 10,000 sentences and excluding stop words)

At the same time, we will also see: −−−−−−−−−−−−−−→ −−−−−−−→ − −→ − − −−−→ ≈ − man woman computer programmer − homemaker

(2)

This second example is a clear case of how the bias from the news articles is preserved within their word embeddings [3]. While a lot of research is going into developing strategies for removing human bias from word embeddings, we aim to take advantage of this flaw to gain insight into the sentiment that different political philosophies have regarding certain topics. To build our word embeddings, we pulled all the titles, posts, and comments from each of the subreddits and grouped them by political affiliation [4]. As mentioned in the EDA section, there are many more comments and posts by conservatives than liberals. Because of this, we randomly sampled comments from the conservative corpus to match the number of comments in the liberal corpus. The same thing was done for posts. We wanted to keep the size of data in each group as similar as possible because our goal is to compare the embeddings created from these two corpora. This also had the added benefit of significantly cutting down the time it takes to train our conservative embedding model. After this, we preprocessed the text in both corpora to lowercase, tokenize URLs, tokenize different lengths of numbers, as well as stripped out certain symbols and stop words. After that, each of the sentences is tokenized into words and stored in two separate DataFrames, one for each corpus. We decided to utilize the Skip Gram Word2Vec method developed by Tomos Mikolov in 2013, which is a popular technique that only requires a shallow neural

54 NLP-Driven Political Analysis of Subreddits

717

network [5]. According to Mikolov, the Skip Gram Word2Vec method is ideal for a smaller amount of data as well as being able to represent rare words fairly well. Therefore, this method is ideal because our training corpora are not very large, and there are many niche political words that may not occur often. We also incorporated negative sampling for faster training times and learning high-quality word representations [6]. After the embeddings are trained, we take a look at which words are grouped in the vector space. Figure 3 shows examples of words that are grouped with “hillary” and “trump” in the vector space. Unsurprisingly, we see that most of these tokens are the different variations of how each group would refer to the two people. What is interesting is that in the liberal word embedding, “hillary” is grouped with “sanders”, while in the conservative embedding, “hillary” is grouped with “trump”. This suggests that in the liberal subreddits, Hillary Clinton is often discussed in a similar context as Bernie Sanders, while in the conservative subreddits, Hillary Clinton is often discussed in a similar context as Donald Trump. We verified that our embeddings are performing as expected by running numerous examples through them and looking at the closest words. After we felt confident about the performance of our embedding model, we began to look at various sentiment lexicons to train our model. We considered Bing Liu’s Lexicon which would provide binary labels to a list of positive and negative words [7]. We also looked at the SentiWords lexicon which consists of a large corpus of words, each containing a continuous positive or negative float as an indication of the word’s sentiment [8]. To train our models, we would map each of the sentiment

Fig. 3 Lists showing which words are closest to the word “hillary” and “trump” in the liberal and conservative embeddings vector space

718

K. Singh and S. V. N. S. Anne

Table 1 Table showing the performance of various sentiment models Model parameters

Liberal model Acc

Conservative model Acc

Liberal RMSE

Conservative RMSE

BingLiu + SGDClassifier + remove stopwords

0.803

0.802

N/A

N/A

BingLiu + RandomForest + remove stopwords

0.747

0.739

N/A

N/A

BingLiu + SGDClassifier + keep stopwords

0.797

0.798

N/A

N/A

BingLiu + RandomForest + keep stopwords

0.743

0.736

N/A

N/A

SentiWords(binarized) + SGDClassifier + remove stopwords

0.791

0.790

N/A

N/A

SentiWords(binarized) + SGDClassifier + keep stopwords

0.791

0.788

N/A

N/A

SentiWords + RandomForest + remove stopwords

N/A

N/A

0.927

0.934

SentiWords + RandomForest + keep stopwords

N/A

N/A

0.933

0.931

words to a vector using the liberal and conservative embeddings that we have built. Then, we trained these sentiment vectors with various combinations of the sentiment lexicons combined with various types of models. We tried binarizing the SentiWords lexicon in conjunction with classifier models as well as using the continuous labels with regression models. Table 1 includes some of the combinations we used and their resulting accuracy and RMSE. In the end, the model that resulted in the highest accuracy is an SGDClassifier on Bing Liu’s sentiment lexicon while removing stop words when training our embeddings.

5.2 Sentiment Observations After we have built our biased embedding-based sentiment models, we did a sanity check with a list of Democratic and Republican Presidents. We also included an “other” category which holds all the presidents that fell into neither party such as George Washington and Thomas Jefferson. The full list of words we used can be seen in the code associated with this project. Figure 4 shows the results, and as expected, liberals have a higher sentiment for Democratic presidents and conservatives have a higher sentiment for Republican presidents. What was interesting about these results is that even though each group has an obviously low sentiment for the presidents

54 NLP-Driven Political Analysis of Subreddits

719

Fig. 4 Bar graph showing the distribution of sentiments for Democratic presidents, Republican presidents, and presidents that do not fall under either category

belonging to the opposing party, they do not seem to have an obviously high sentiment for the presidents in their own party. In a way, this suggests that the members of each political philosophy seem to dislike the opposing party more than they like their own party. It is also very possible that most of the posts and comments on these political subreddits are geared towards attacking the opposing party rather than praising their party. To further explore this finding, we put together a list of people that each group tends to dislike in order to look at the resulting sentiment [9]. For example, under “Conservative Dislikes”, we included names such as “Hillary Clinton”, “AOC”, and “Bernie Sanders”, and under “Liberal Dislikes”, we included names such as “Donald Trump”, “Ted Cruz”, and “Sean Spicer”. The full list of words we used in each group can be found in the code associated with this project. Figure 5 shows that conservatives have a very strong negative sentiment towards the “Conservatives Dislike” list and the liberals have a very strong negative sentiment towards the “Liberals Dislike” list. What is very interesting is that the conservatives have a very neutral sentiment towards the people that liberals dislike, and liberals also have a very neutral sentiment towards the people that conservatives dislike. We expected to see at least some positive sentiment, but it seems like dislike for the other group is a lot more detectable than admiration within each political group. This finding suggests that the majority of political discussion regarding politics centres around attacking the opposing group. Furthermore, it is possible to argue that maybe what keeps people in their respective political tribes is not the love for their leaders, but rather the dislike towards the opposing tribe. It is impossible to extract any inference from a study like this, but these findings may influence some extensive studies and field experiments to look

720

K. Singh and S. V. N. S. Anne

Fig. 5 Bar graph showing the distribution of sentiments for political figures that liberals dislike and political figures that conservatives dislike

into the claim of “what keeps us in our political tribes? Mutual agreement or mutual disagreement?”.

6 Political Leaning Classification 6.1 Method The project aims to use machine learning to understand how the language is used. In particular, we are interested in identifying the word features within the post titles that distinguish a conservative post from that of a liberal one. To accomplish this, we needed to have a classification model and word representation that are interpretable. Thus, we chose to use Term Frequency-Inverse Document Frequency (TF-IDF) in conjunction with random forest classification to easily draw interpretations about the languages used in the data. Prior to training the classification model, we applied informed oversampling and undersampling methods to the training data to account for class imbalance and improve our model’s performance. We applied both informed oversampling and undersampling because previous experiences applying both demonstrated success in improving model performance. For oversampling, we used synthetic minority oversampling technique which is grounded on k-nearest neighbours. Specifically, SMOTE selects the specified knearest neighbours based on each document’s TF-IDF vectors, randomly selects one

54 NLP-Driven Political Analysis of Subreddits

721

of those neighbours and multiplies it with a random number between 0 and 1 to generate the new data [10]. For undersampling, we used the NearMiss method, which is also grounded on k-nearest neighbour. In particular, we used a version of NearMiss, called NearMiss2, that chooses the majority class examples that have the shortest average distance, based on their TF-IDF vectors, to the three furthest minority examples [10].

6.2 Model Performance There were three models that were trained and evaluated for the political leaning classification. The first one is just a model that predicts the majority class. The second one is the random forest classification model without under/oversampling. The third model is another random forest classification, but this time, it utilizes oversampling and undersampling. Based on Table 2, both of the random forest models clearly perform better than the baseline one. However, the plain random forest model seems to not have a good recall as the one that utilizes oversampling and undersampling. Since we have a minority class, we opted to use the final model for our analysis as it represents the minority class a bit better despite the tradeoff in precision.

6.3 Model Interpretation Feature importance in a decision tree is the decrease in node impurity weighted by the probability of reaching that particular node. In random forest, the feature importance is averaged across all the decision trees that get generated out of the algorithm. Feature importance gives us an idea of what are the different variables that are important in making the predictions. In the below Fig. 6, we can observe the various word features that are important for making the class predictions. This gives us an understanding of some of the keywords in our corpus that are significant for differentiating a conservative post versus a liberal one. Furthermore, we produced partial dependency (PD) plots based on the training data that show us the marginal effect of the TF-IDF score on the predicted classification. In addition, we also overlaid the individual conditional expectation (ICE) lines (one line per instance) on top of the PD plot to show how the prediction probability changes as the TF-IDF score changes for various instances. Lastly, there is also a rug plot at the bottom of every PD plot to prevent over-interpretation in areas of the PD plots, where there are not a lot of examples. As observed from the PD plot below, with the word “conservative” and “liberal”, the average probability of predicting that the post is a liberal one decreases as the TFIDF score for the word increases as in Figs. 7 and 8. In looking at the ICE lines, there seems to be only a bit of heterogeneous effect, since there are few instances, where the probability of predicting a liberal class increases as the TF-IDF score increases.

0.80

0.82

Random forest

Random forest w/ over/undersampling

0.82

0.88 0.82

0.84

0.80

F1-score

0.64

0.70

0.00

Precision

1.00

Recall

Precision

0.66

Liberal

Conservative

Majority class baseline

Model

Table 2 Table showing the model performance of various classification models

0.63

0.57

0.00

Recall

0.64

0.63

0.00

F1-score

0.73

0.75

0.33

Precision

Macro-Avg

0.73

0.72

0.50

Recall

0.73

0.73

0.40

F1-score

722 K. Singh and S. V. N. S. Anne

54 NLP-Driven Political Analysis of Subreddits

723

Fig. 6 Chart showing the top 10 most important features

This can be interpreted as the more important the terms “conservative” and “liberal” are to the post, the more likely that the post is a conservative one. The ICE lines also imply that the majority of the posts that have the words “conservative” and “liberal” tend to be conservative one. Using this in conjunction with the sentiments Fig. 11, we can see the associated sentiment for the words from both political sides. The word “liberal” seems to be used in a negative manner in the conservative subreddit, and positively in the liberal subreddit. The word “conservative” seems to have a positive sentiment with both the liberal and conservative embedding. In the below PD plot, the word “democrat” displays little heterogeneous effect as in Fig. 9, very similar to the words “liberal” and “conservative”. However in this

Fig. 7 PD plot for the word “conservative” along with the ICE lines (blue lines) that represents individual examples and its probability changes

724

K. Singh and S. V. N. S. Anne

Fig. 8 PD plot for the word “liberal” along with the ICE lines (blue lines) that represents individual examples and its probability changes

case, the probability of predicting the liberal class gets stronger with a higher TF-IDF score. This PD plot implies that the more important the word democrat is to the post, the more likely it is that the post is a democratic one. In addition, the ICE lines, for the most part, go in the upward direction, implying that most of the posts that contain the word “democrat” tend to be liberal one. Using the sentiments Fig. 11, we can see that the word “democrat” seems to have a negative sentiment based on both the conservative and liberal subreddit embeddings. For the last example, we will look at the word “trump” as this offers a more complex story. In this case, the PD plot is not linear as the PD plot flattens out after

Fig. 9 PD plot for the word “democrat” along with the ICE lines (blue lines) that represents individual examples and its probability changes

54 NLP-Driven Political Analysis of Subreddits

725

Fig. 10 PD plot for the word “trump” along with the ICE lines (blue lines) that represents individual examples and its probability changes

Fig. 11 Sentiments attached to the top 4 features of the classification model based on liberal and conservative biased embeddings

reaching a certain TF-IDF score as in Fig. 10. In addition, the PD plot is also affected by a stronger heterogeneous effect compared to the previous examples as there are a lot of instances, where the probability of predicting the liberal class decreases. With this plot, it is hard to decipher from the ICE lines which camp tends to have more posts with the word trump. Using the sentiments Fig. 11, we can see that the word “trump” has a negative sentiment based on both the conservative and liberal subreddit embeddings, but is more negative with the liberal embeddings.

7 Conclusion By building custom embeddings for two groups with vastly different points of view, we were able to witness interesting associations between words that are used in similar contexts based on the cosine similarities. We saw that words like “hillary” and “trump” are associated closely with nicknames and aliases given to the two people by the two opposing groups. On top of this, we were able to see that interestingly enough, “hillary” is commonly used in the same context as “bernie” by liberals, and “trump” is commonly used in the same context as “hillary” by conservatives. This may reflect the number of comparisons made between the pairs within each

726

K. Singh and S. V. N. S. Anne

of the groups. From the embedding-based sentiment models we built, we were able to explore the sentiment bias that each political group has towards certain words, topics, and people. Most of what we found were not surprising such as liberals having a very negative sentiment towards “trump” and conservatives having a very low sentiment towards “hrc”. What we did find to be interesting is that our results tend to skew much more towards a low sentiment. Liberals seem to have a very negative sentiment towards people associated with conservatives and conservatives seem to have a very negative sentiment towards people associated with liberals. However, when it comes to the sentiment of liberals towards Democratic presidents and people associated with liberals, the sentiment is not very high. The same thing can be said for conservatives regarding Republican presidents and people associated with conservatives. This finding opens up an interesting question of whether modern political tribalism is based more strongly on dislike of the opposing party rather than an agreement on policies. Using the political leaning classification model, we were able to infer how the various words in the post title and their corresponding importance (via TF-IDF score) affect the probability of a post belonging to a liberal or conservative subreddit. When combined with the sentiment model, we are also able to observe the type of sentiment associated with those words, which reflects how differently those words are being used by the different political sides. Using the tools above.

References 1. Pang B, Lee L et al (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2):1–135 2. Sarker A, Al-Garadi MA, Ge Y, Nataraj N, Jones CM, Sumner SA (2022) Trends in co-mention of stimulants and opioids: a natural language processing driven analysis of reddit forums 3. Bolukbasi T, Chang KW, Zou JY, Saligrama V, Kalai AT (2016) Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In: Advances in neural information processing systems, pp 4349–4357 4. Spadaro A, Sarker A, Hogg-Bremer W, Love JS, O’Donnell N, Nelson LS, Perrone J (2022) Reddit discussions about buprenorphine associated precipitated withdrawal in the era of fentanyl. Clin Toxicol 1–8 5. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 6. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119 7. Hu M, Liu B (2004) Mining opinion features in customer reviews. AAAI 4:755–760 8. Baccianella S, Esuli A, Sebastiani F (2010) Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. Lrec 10:2200–2204 9. Gamage D, Ghasiya P, Bonagiri V, Whiting ME, Sasahara K (2022) Are deepfakes concerning? Analyzing conversations of deepfakes on reddit and exploring societal implications. arXiv preprint arXiv:2203.15044 10. He H, Garcia EA (2008) Learning from imbalanced data. IEEE Trans Knowl Data Eng 9:1263– 1284

Chapter 55

Feature Extraction and Selection with Hyperparameter Optimization for Mitosis Detection in Breast Histopathology Images Suchith Ponnuru and Lekha S. Nair

1 Introduction Breast cancer is one of the most common types of cancer. As per WHO’s recent statistics, over the past five years, 7.8 million cases of women have been diagnosed with breast cancer. In 2020, around 2.3 million women were diagnosed with breast cancer, with a total number of reported deaths being 685,000 globally. Fortunately, if identified early, breast cancer has many effective treatment options available. Many image modalities are being used for the detection and diagnosis of cancer. One effective image modality is histopathology images, which are microscopic biopsy images, and pathologists carefully examine these images to look at the specific features in the cells and tissue structures. The cancerous cells and tissue structures can have abnormal features compared to standard tissue regions. Pathologists mainly focus on specific characteristics like the shape and size of the cells, the size and shape of the cell’s nucleus, and the distribution of the cells in tissue. Hematoxylin and eosin (H&E) are the most commonly used stain in medical diagnosis, specifically histology and are frequently used as the golden standard. These HE stains are evaluated using three features, tubule formation, nucleus polymorphism, and mitotic count, based on one Elston and Ellis grading system.

S. Ponnuru (B) · L. S. Nair Department of Computer Science and Engineering, Amrita School of Computing, Amrita Vishwa Vidyapeetham, Amritapuri Campus, Kollam, Kerala 690525, India e-mail: [email protected] L. S. Nair e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_55

727

728

S. Ponnuru and L. S. Nair

This paper focuses on mitotic count detection involving a comparative analysis of various features as input to three different classifiers by applying multiple hyperparameter optimization techniques. We did different study combinations: applying principal component analysis (PCA) separately to each component after extraction, principal component analysis (PCA) after combining each feature and then utilizing feature selection, etc. Mitosis detection using microscopy images of breast cancer is a challenging task in clinical practice. Stained hematoxylin and eosin (H&E) for mitosis detection come with various challenges, as there are four phases of mitotic nuclei: Prophase, metaphase, anaphase, telophase, etc. Each stage differs from the other in shape and texture configurations. The paper is divided into several sections; Sect. 2 contains details about related work, followed by an overview of our proposed method in the next section. We explained different feature extraction techniques in Sect. 4. Section 5 explains the different classifiers tuned with other optimizers. Section 6 describes the hyperparameter optimization techniques used in the implementation. Section 7 presents results demonstrating the efficacy of combining various features as input to different hyperparameter optimized classifiers for mitosis detection. The final section concludes the entire method.

2 Related Work There has been a lot of research and advancement in mitotic detection and classification using hematoxylin-and eosin-stained (HE-stained) biopsy images. Since the invention of whole-slide imaging scanners, these stained images in mitosis research have increased extensively. In previous work, a combination of different feature extraction methods and machine learning methods were used for detection. Irshad [15] presented a technique [12] in which decision tree classifier was used to extract all statistical, morphological features. Vuksanovic [3] proposed the usage of Haralick algorithm to extract the features at the pixel level with different window sizes and then used SVM to classify the input. Tashk et al. [21] presented a method that involves utilization of local binary pattern (LBP) and SVM [21] as a classification algorithm. Here, mitoses were segmented using k-means clustering and features of different types (morphological, intensity-based, and textural) were extracted from the segmented for SVM classification [11, 22]. In this approach, three types of features were extracted: (completed local binary pattern (CLBP), statistical moment entropy, stiffness matrix (SM)). These were fused with each other for (RBF) kernel for support vector machine (SVM) and random forest classifiers. These methods require high computational resources so it cannot be used in real world application. Different ways of optimizing of hyperparameters for common machine learning models are studied in this paper [24]. Generally, a model based on deep-features tends to be more effective than a machine learning method because the latter considers more number of meaningful features while training. Albayrak and Bilgin [2] presented a

55 Feature Extraction and Selection with Hyperparameter …

729

work that achieved good performance by collecting features using CNN and then using PCA and LDA as dimension reduction methods. Wang et al. [23] presented a technique in which two classifiers were used, one trained with handcrafted features and the other with features extracted from CNN. Cai et al. [5] used Resnet-101 for the feature extraction with a combination of modified faster R-CNN. In this paper [20], YOLOv4 model is used with RGB images as input and f1-score of 0.76 was achieved.

3 Data Set To evaluate our proposed model, we are using mitos-atypia-14 challenge data set from 22nd international conference on pattern recognition; the data set contains 20× and 40× magnification images; we used 40× magnification images from these images and extracted patches. Total number of patches extracted are 432. In those 432 mitosis and non-mitosis patches are evenly divided.

4 Image Pre Processing Each image of the data set is processed to obtain the most information out of them. The image is first converted to a gray-scale image and then resized to 128 * 128 pixels.

5 Proposed Method 5.1 Overview of Proposed Methods In this paper, we have done a comparative analysis of the classification results based on the extraction of different features from the 432 breast cancer biopsy slides images at a pixel level, local level (shape, size, etc.), global level (hog, shift, surf) and find the best combination of features that can be utilized for efficient detection of mitosis. The extracted features are then combined, and their dimensionality is reduced using principal component analysis (PCA). The resultant features obtained from dimensionality reduction are passed onto support vector machine (SVM), K-nearest neighbor (KNN), and random forest classifiers alongside a different optimization technique for classification. After the features are extracted, an optimal subset of features is selected using feature selection and passed to support vector machine (SVM) classifier. Different combinations of analysis were done: application of principal component analysis (PCA) separately to each feature after extraction, principal component

730

S. Ponnuru and L. S. Nair

Fig. 1 Proposed architecture

analysis (PCA) after combining features, and then using feature selection technique to obtain the best classification accuracy (see Fig. 1).

6 Feature Extraction Feature extraction is vital for classification and recognition/detection of images/videos. We explore three levels of feature from our images; those are: 1. Local level 2. Global level 3. Pixel level. Table 1 lists the features algorithm used. Local Level: At this level, we try to identify critical points in the image, which helps us identify the points or edges. These points/edges are measurements from a region centered on a local feature. These points are usually extracted from image patches that differ by texture, color, or intensity from their immediate surroundings. The local level reveals features centered around the nucleus. Local features Table 2 used in our work are ORB, censure, and edge. Table 1 Features algorithm used

Features

Algorithm used

Local

ORB, censure, edge

Global

Corner, HOG

Pixel

Gray scale pixel value

64.37

62.07

KNN

SVM

0.72

SVM

63.22

0.72

KNN

RF

0.7

0.72

SVM

RF

0.71

KNN

0.72

SVM

0.71

0.72

KNN

RF

0.71

0.72

SVM

RF

0.71

KNN

62.07

SVM

0.71

0.72

KNN

RF

0.7

65.52

RF

SVM

64.37

0.63

0.64

0.63

0.63

0.64

0.66

0.63

0.6

0.63

0.63

0.6

0.63

0.63

0.6

0.63

0.63

0.64

0.65

0.64

0.64

0.64

F1

0.61

0.64

0.63

0.61

0.64

0.64

0.61

0.63

0.61

0.61

0.63

0.61

0.61

0.63

0.61

0.61

0.64

0.65

0.66

0.64

0.64

Rec

Acc Accuracy; F1 F1-score; Rec Recall; Prec Precision

TPOT

Genetic

BO TPE

SKOPT GP

BOGP

Random

ORB

64.37

RF

Grid

Acc

KNN

Classifier

Optimizer

Table 2 Local level results

0.65

0.63

0.63

0.65

0.63

0.67

0.65

0.58

0.65

0.65

0.58

0.65

0.65

0.58

0.65

0.65

0.63

0.65

0.63

0.63

0.63

Prec Censure

65.52

65.52

62.07

0.65

0.63

0.61

0.65

0.63

0.62

0.65

0.63

0.62

0.65

0.61

0.61

64.37

0.64

0.62

62.07

65.52

56.32

Acc

0.67

0.67

0.61

0.67

0.65

0.61

0.66

0.67

0.55

0.66

0.67

0.55

0.66

0.67

0.55

0.69

0.67

0.57

0.63

0.67

0.56

F1

0.64

0.64

0.62

0.6

0.63

0.59

0.61

0.64

0.55

0.61

0.64

0.55

0.61

0.64

0.55

0.6

0.64

0.56

0.61

0.64

0.56

Rec

0.7

0.7

0.6

0.74

0.67

0.63

0.72

0.7

0.56

0.72

0.7

0.56

0.72

0.7

0.56

0.81

0.7

0.58

0.65

0.7

0.56

Prec EDGE

63.22

64.37

63.22

0.61

0.62

0.62

0.62

0.62

0.62

0.61

0.62

0.61

0.62

0.62

0.62

64.37

0.62

0.61

66.67

59.77

59.77

Acc

0.62

0.64

0.6

0.6

0.54

0.52

0.59

0.6

0.61

0.59

0.6

0.61

0.59

0.6

0.61

0.64

0.58

0.6

0.69

0.55

0.59

F1

0.63

0.64

0.65

0.61

0.6

0.54

0.62

0.61

0.59

0.62

0.61

0.59

0.62

0.61

0.59

0.64

0.67

0.56

0.64

0.62

0.6

Rec

0.6

0.63

0.56

0.58

0.49

0.51

0.56

0.58

0.63

0.56

0.58

0.63

0.56

0.58

0.63

0.63

0.51

0.65

0.74

0.49

0.58

Prec

55 Feature Extraction and Selection with Hyperparameter … 731

732

S. Ponnuru and L. S. Nair

6.1 Oriented FAST and Rotated BRIEF (ORB) Oriented FAST and Rotated BRIEF (ORB) is an image matching algorithm that generally constitutes of three steps: feature point extraction, generation of feature point descriptors, and matching of feature points. ORB [7] is based on FAST key point detector and the BRIEF descriptor, which are popular algorithms, due to their high performance and low cost. ORB performs a greedy search among all possible binary tests to find those with high variance and means close to 0.5, as well as being uncorrelated which is called as rBRIEF. Steps for calculation of ORB from each mitosis/non-mitosis images: 1. Our input images are passed to FAST to identify the key points using Harris corner to find the top points θ = atan2(m01, m10)

(1)

2. Apply grid filtering, which helps distribute key points more evenly across the image. 3. Extract feature orientation using “steer” BRIEF which uses Gaussian kernel which helps to avoid high-frequency noise which as sensitive to our descriptor. gn( p, θ ) = fn( p)|(xi, yi) ∈ Sθ 4. Extract features Results of ORB when applied to our patches (see Fig. 2).

Fig. 2 ORB result

(2)

55 Feature Extraction and Selection with Hyperparameter …

733

Fig. 3 Censure results

6.2 Center Surround Extremas (Censure) Censure [1] feature detector is a scale-invariant center surround detector (Censure) that is famous owing to its claims regarding its capabilities of real-time implementation. Steps for calculation of censure from each mitosis/non-mitosis images: 1. Find extrema in a local neighbor using center surround filters over multiple scales and utilizing original image resolution for each 2. Extremas are then filtered by using Harris measure, and they are further eliminated by using weak corner response 3. Rest of the extremas are considered as features for our classification. Results of censure when applied to our patches (see Fig. 3).

6.3 Edge Detection Edge detection [6] is a technique used in image processing to identify points in a digital image where there is a steep difference in image brightness. In general, the image’s boundaries are the points which are classified as edges. In order to identify these edges from images, we used two filters: vertical and horizontal sobel filters, and created two histograms from them. The above methods were used for extraction and analysis of local level features (HOG, ORB, Censure), which helped us get information about nucleus. The following portion of the paper provides details about the methods used to extract global level features (corner, HOG), which helped us to get information regarding the whole images. Global Level: Global-level features describe the image as a whole, extensively used to generalize the entire image, including contour representations, shape descriptors,

734

S. Ponnuru and L. S. Nair

Table 3 Global level results Optimizer Classifier Grid

Random

BOGP

SKOPT GP

BO TPE

Genetic

TPOT

RF

Acc

F1

Rec

Prec

Acc

F1

Rec

Prec

HOG 79.31 0.8

KNN

0.76 0.86 Corner 57.47 0.58 0.57 0.6 64.37 0.51 0.8 0.37 peak 72.41 0.75 0.68 0.84

SVM

80.46 0.81 0.77 0.86

70.11 0.71 0.68 0.74

RF

0.7

0.76 0.71 0.81

0.61

0.6

KNN

0.6

0.51 0.8

0.66

0.71 0.66 0.77

SVM

80.46 0.82 0.76 0.88

67.82 0.69 0.66 0.72

RF

0.74

0.75 0.7

0.6

0.63 0.68 0.58

KNN

0.59

0.37 0.91 0.23

0.66

0.75 0.68 0.84

SVM

0.82

0.8

0.66

0.7

RF

0.74

0.75 0.7

0.81

0.61

0.63 0.68 0.58

KNN

0.59

0.37 0.91 0.23

0.65

0.75 0.68 0.84

SVM

0.81

0.8

0.65

0.7

RF

0.74

0.75 0.7

0.6

0.63 0.68 0.58

KNN

0.59

0.37 0.91 0.23

0.66

0.75 0.68 0.84

SVM

0.81

0.8

0.76 0.86

0.66

0.7

RF

0.73

0.78 0.77 0.79

0.61

0.66 0.67 0.65

KNN

0.59

0.52 0.59 0.47

0.66

0.71 0.66 0.77

SVM

0.81

0.82 0.76 0.88

0.65

0.69 0.66 0.72

RF

78.16 0.79 0.76 0.81

66.67 0.63 0.69 0.58

KNN

64.37 0.51 0.8

67.82 0.69 0.66 0.72

SVM

80.46 0.82 0.76 0.88

0.37 0.81

0.76 0.86

0.76 0.86 0.81

0.37

70.11 0.7

0.61 0.58

0.67 0.74

0.67 0.74

0.67 0.74

0.7

0.7

Acc Accuracy; F1 F1-score; Rec Recall; Prec Precision

and texture features. Corner and HOG were the global features show in Table 3 that we focused on in our paper.

6.4 Histogram of Oriented Gradients (HOG) The histogram of oriented gradients [3], or HOG, is frequently used to extract features from image data. It is widely used for object detection in computer vision tasks. The HOG descriptor focuses on an object’s structure or shape. Its capabilities include providing whether or not the pixel is an edge with edge direction. This is accomplished by determining the gradient and orientation (or magnitude and direction) of the edges (see Fig. 4). Steps for calculation of HOG for each mitosis/non-mitosis images: 1. First we divide the mitosis/non-mitosis images mitosis into small regions, generally grids of size 8 * 8.

55 Feature Extraction and Selection with Hyperparameter …

735

Fig. 4 Results of HOG when applied to our images

2. Gradients of two types are calculated for each region: vertical and horizontal. Gx(r, c) = I (r, c + 1) − I (r, c − 1)

(3)

Gy(r, c) = I (r − 1, c) − I (r + 1, c)

(4)

3. For each pixel in the 8 * 8 grid, orientation is calculated by using gradient magnitude and gradient angle Angle(θ ) = tan − 1(Gy/Gx) 4. The obtained 64 gradient vectors are compressed into 9 vectors without losing much information 5. Generation of histogram for each of these regions is done individually, and integrating each of them together to form the structure of original image results in a histogram below.

6.5 Corner Peak Corners [16] are the most important features in the image, and they are commonly referred to as interest points because they are insensitive to translation, rotation, and illumination. Steps for calculation of corner peak for each mitosis/non-mitosis images: 1. Consider a small window surrounding each pixel in an image which helps us to identify corner pixel window. 2. Amount of change in pixel values can be measured by shifting each window by a small amount in a given direction and measuring the amount of change. 3. Identify pixel windows with a large SSD.

736

S. Ponnuru and L. S. Nair

The next section describes methods used to identify pixel level features in the images. Pixel Level: A pixel is the smallest block of an image representing the amount of brightness (or gray intensity) to be displayed for that particular portion. An integer is used to signify the pixel values. There are 256 possible values for a pixel, ranging from 0 (black) to 255 (white). Pixel level: Table 4 features extracts calculated features from each pixel like color, location. We extracted gray-scale pixel value at each point. Table 4 Pixel results Optimizer

Classifier

Accuracy

F1-score

Recall

Precision

Grid search

RF

62.07

0.65

0.60

0.70

KNN

73.56

0.69

0.81

0.60

SVM Random search

BOGP

SKOPT GP

BO TPE

Genetic

TPOT

78.16

0.78

0.77

0.79

RF

0.65

0.71

0.71

0.70

KNN

0.67

0.41

0.62

0.30

SVM

81.61

0.81

0.81

0.81

RF

0.66

0.60

0.59

0.60

KNN

0.68

0.47

0.71

0.35

SVM

0.75

0.81

0.81

0.81

RF

0.60

0.60

0.59

0.60

KNN

0.68

0.47

0.71

0.35

SVM

0.75

0.81

0.81

0.81

RF

0.60

0.60

0.59

0.60

KNN

0.68

0.47

0.71

0.35

SVM

0.75

0.81

0.81

0.81

RF

0.65

0.57

0.57

0.58

KNN

0.68

0.47

0.71

0.35

SVM

0.75

0.81

0.81

0.81

RF

67.82

0.68

0.67

0.70

KNN

60.92

0.47

0.71

0.35

SVM

81.61

0.81

0.81

0.81

Acc Accuracy; F1 F1-score; Rec Recall; Prec Precision

55 Feature Extraction and Selection with Hyperparameter …

737

6.6 Grayscale Pixel Values Because images are represented by pixels, the most basic way to create image features is to use the raw pixel values as separate features. The number of features will be the same as the number of pixels, which in this case is 128 times 128, or 16,384.

6.7 Principal Component Analysis (PCA) PCA is a dimensionality reduction process that deals with large data sets by transforming a large set of variables into a smaller set without losing a majority of important information in the large set. We extracted several features (local, global, pixel). Since the no. of number of features extracted is high, we used different ways of applying PCA to determine which provides the best accuracy. 1. Applying PCA separately and combining with n components as 100, 200, 400 2. Applying PCA before combining with n components as 400 3. Combining without PCA. Table 5 explains the number feature from each feature extracted algorithm.

7 Classification The goal of classification is to determine which category an observation belongs to, which is accomplished by comprehending the associations between the dependent and independent variables. The dependent variable is categorical in this case, while the independent variables can be either numerical or categorical. Classification is predictive modeling that works under a supervised learning setup because there is Table 5 No. of features extracted

Features

No. of feature extracted

Hog

15,876

ORB

4992

Censure

300

Corner

974

Pixel value

128 * 128

738

S. Ponnuru and L. S. Nair

a dependent variable that allows us to establish the relationship between the input variables and the categories. There are several classification: 1. 2. 3. 4.

Binary classification Binomial classification Multi-class classification Multi-label classification.

Once the different features are collected and dimensionality is reduced, the next task is to apply different classifiers to understand better results. For our data set, we are going to use binary classification. A binary classification is the most basic and widely used type of classification. The dependent variable in this case is divided into two distinct categories denoted by the numbers 1 and 0, hence the term binary classification. We are going employ three types of classifiers to detect mitosis. Those are SVM, KNN, and random forest.

7.1 Support Vector Machines (SVM) The support vector machine (SVM) [8] is a supervised machine learning algorithm that can be used for classification and regression tasks. However, its usage is mostly seen in classification problems. The data points are plotted in n-dimensional space, where n is the total number of features, with the value of each feature being the value of a specific coordinate in the SVM algorithm. A hyperplane is then located that distinguishes between the various classes and classifies them. Table 6 explains the accuracy, F1-Score, recall, precision obtained using support vector machine (SVM). Table 6 SVM results Optimizer N component 100 Pre

200 Rec

F1

Acc Pre

400 Rec

F1

Acc Pre

Rec

F1

Acc

HOG

0.818 0.837 0.828 82.7 0.825 0.767 0.795 83.4 0.857 0.837 0.847 85

ORB

0.660 0.721 0.689 70.4 0.692 0.628 0.659 70.4 0.63

0.791 0.701 69.2

CEN

0.648 0.814 0.722 68.9 0.600 0.767 0.673 65.5 0.66

0.814 0.729 70.1

COR

0.571 0.651 0.609 65.0 0.617 0.674 0.644 67.8 0.591 0.605 0.598 66.9

EDGE

0.647 0.512 0.571 62.3 0.638 0.698 0.667 61.1 0.722 0.605 0.658 68.9

PX

0.717 0.767 0.742 75.5 0.781 0.581 0.667 71.2 0.735 0.581 0.649 77.6

Acc Accuracy; F1 F1-score; Rec Recall; Prec Precision

55 Feature Extraction and Selection with Hyperparameter …

739

Table 7 KNN results Optimizer N component 100 Pre

200 Rec

F1

Acc Pre

400 Rec

F1

Acc Pre

Rec

F1

Acc

HOG

0.625 0.233 0.339 55.1 0.786 0.256 0.386 59.7 0.636 0.163 0.259 59.7

ORB

0.609 0.651 0.629 68.9 0.694 0.581 0.633 70.1 0.619 0.605 0.612 70.1

CEN

0.617 0.674 0.644 62.7 0.600 0.698 0.645 62.0 0.617 0.674 0.644 63.2

COR

0.583 0.651 0.615 66

EDGE

0.561 0.535 0.548 62.0 0.614 0.628 0.621 62.0 0.625 0.581 0.602 62

PX

0.714 0.465 0.563 64.8 0.800 0.093 0.167 63.4 0.571 0.093 0.160 65.5

0.608 0.721 0.660 66.3 0.608 0.721 0.66

68.4

Acc Accuracy; F1 F1-score; Rec Recall; Prec Precision

7.2 K-Nearest Neighbor (KNN) The k-nearest neighbors (KNN) [14] algorithm is a data classification algorithm. KNN is a non-parametric and lazy learning algorithm. First we select k values which is going to be nearest data points. We then calculate the Euclidean distance from each neighboring point (k points) and sort the results in ascending order. We then select the top value and make it as k and do the process again until the k value is same. For this method, we selected a random k value. Table 7 explains the accuracy, F1-Score, recall, precision obtained using k-nearest neighbors (KNN).

7.3 Random Forest Random forest [4] is a well-known supervised machine learning algorithm that is based on the concept of ensemble learning, which is the process of combining multiple classifiers to solve a complex problem and improve the model’s performance overall. 1. A set of random samples is selected from the pool of given data set. 2. Next, for each sample, a decision tree is constructed. Then it will get the prediction result from every decision tree. 3. In this step, voting will be performed for every predicted result. 4. Selection of the most voted prediction result as the final prediction result. Table 8 explains the accuracy, F1-Score, recall, precision obtained using random forest.

740

S. Ponnuru and L. S. Nair

Table 8 Random forest results Optimizer

N component 100

200

400

Pre

Rec

F1

Acc

Pre

Rec

F1

Acc

Pre

Rec

F1

Acc

HOG

0.81

0.81

0.81

81.6

0.77

0.79

0.78

78.1

0.88

0.74

0.8

82.7

ORB

0.64

0.58

0.61

71.8

0.72

0.67

0.69

67.2

0.63

0.65

0.64

69.8

CEN

0.62

0.69

0.65

64.3

0.55

0.58

0.56

63.4

0.61

0.69

0.65

63.1

COR

0.56

0.62

0.59

64.6

0.64

0.62

0.63

64.6

0.60

0.60

0.60

62.8

EDGE

0.53

0.53

0.53

56.9

0.60

0.62

0.61

60.8

0.60

0.67

0.63

62

PX

0.52

0.44

0.48

63.7

0.60

0.48

0.53

65.7

0.78

0.58

0.66

63.7

Acc Accuracy; F1 F1-score; Rec Recall; Prec Precision

8 Hyperparameter Optimization Hyperparameter is a pre-defined value set to control the learning process. More specifically, if a hyperparameter is changed, the parameters learnt by the model would also vary. Hyperparameter optimizer [24] is technique to find the right combination of hyperparameter values to achieve best performance and accuracy on the data. Table 9 explains the hyperparameter used to optimizer for classifier. For our mitosis detection, we used several hyperparameter optimizers. Optimizer used are: Grid search, random search, Bayesian optimization with Gaussian process, Bayesian optimization with tree-structured Parzen estimator (TPE) and genetic algorithm, TPOT classifier. We used different optimizers because we wanted to find out the best optimizer that provides best accuracy with least execution time. We started by examining grid search, which take a significant amount of time and resources and moved on to TPOT with less amount of time and resources.

8.1 Grid Search It is the process of fine-tuning hyperparameters to find the best values for a given model. We must iteratively test all the values to determine the best possible value. Table 9 Hyperparameter values used to classifier Classifier

Hyperparameter optimized

SVM

C, Gama, Kernel

KNN

N-estimator, max features, max depth, criterion, min samples split, min samples leaf

Random forest n neighbors

55 Feature Extraction and Selection with Hyperparameter …

741

Since manual method of doing this could take a significant amount of time and resources, we use grid search CV to automate hyperparameter tuning.

8.2 Random Search Random combinations of parameters are used, to find the potential best combination of hyperparameters.

8.3 Bayesian Optimization with Gaussian Process Bayesian optimization: This process is based on the concept of Bayesian optimization, which is an approach that uses Bayes theorem to direct the search in order to find the minimum or maximum of a given objective function. This approach is particularly useful for functions that are complex and noisy. Gaussian process: We used the Gaussian process as a “surrogate model,” to estimate the performance of our predictive algorithm. The Gaussian process also explains how uncertain a prediction is. In practice, this process proves to be only marginally better than the random search algorithm.

8.4 Genetic Algorithm Genetic algorithm (GA) is a technique based on the principles of genetics and natural selection. GAs not only try to find the best combination in the given parameters but also go through a recombination mutation, producing a children and iterating over the process again and again. Each individual is assigned a value, and the value is fitted individuals have a better chance.

8.5 TPOT Tree-based pipeline optimization tool, or TPOT for short, is a Python library for automated machine learning. TPOT uses a tree-based structure that constructs tree structures and finds the one that performs best for a given data set. It searches over a broad range of feature selectors, constructers, and models to find the most optimum series of operators which minimize the error of model predictions.

742

S. Ponnuru and L. S. Nair

Among all the hyperparameter optimizers, the most promising results were obtained from TPOT and Bayesian optimization with Gaussian process.

9 Feature Selection Feature selection [13] is the process of selecting the important features for the model. Here, we used feature engineering methods, split into two processes: feature selection and feature extraction. Here, we used different feature selection techniques, using which the features are selected, making it easier to pass them to a SVM classifier.

9.1 Red Deer Algorithm Red deer algorithm [9] is a metaheuristic algorithm that provides the benefits of both heuristic search techniques and evolutionary algorithms. In this technique, first we first select a random population, called a red deer (RDs) which is then divided into two types. We select certain percent of best RDs. N.male.Com = roundc.Nmale N.stag = Nmale − N.male.Com Figure 5 explains graph between fitness versus iteration. The accuracy obtained using red deer algorithm is 83.9% with dimension 1831. Fig. 5 Red deer convergence curve

55 Feature Extraction and Selection with Hyperparameter …

743

Fig. 6 Cuckoo search convergence curve

9.2 Cuckoo Search Algorithm Cuckoo search [10] is one among many nature-inspired algorithms used extensively, based on dynamically increasing switching parameters. The algorithm iteratively replaces an ineffective solution with a new and potentially superior one, to find the best set of features at the end. Figure 6 explains graph between fitness versus iteration. The accuracy obtained using cuckoo search algorithm is 82.7% with dimension 12,778.

9.3 Harmony Search Harmony search [12] (HS) is a simple algorithm that is generally used along with other subset evaluation techniques. The simplicity of harmony search is often exploited to solve complex problems. Figure 7 explains graph between fitness versus iteration. The accuracy obtained using harmony search algorithm is 71.2% with dimension 11,156.

9.4 Whale Optimization Algorithm Whale optimization algorithm [17] is a metaheuristic optimization that is built by taking inspiration from humpback whales. This technique is generally more useful for high-dimensional data and has various variants. Figure 8 explains graph between fitness versus iteration.

744

S. Ponnuru and L. S. Nair

Fig. 7 Harmony search convergence curve

Fig. 8 Whale optimization convergence curve

The accuracy obtained using whale optimization algorithm is 74.7% with dimension 11,046.

9.5 Genetic Algorithm Genetic algorithms are based on evolution and can be leveraged to select the most important set of features. Mutations and children are considered instead of only the features provided by the input model. This helps the system move away from local features and explore multiple different feature sets. Figure 9 explains graph between fitness versus iteration. The accuracy obtained using genetic algorithm is 72.4% with dimension 10,866.

55 Feature Extraction and Selection with Hyperparameter …

745

Fig. 9 Genetic convergence curve

9.6 Binary Bat Algorithm A bat set of rules (BA) [19] is a heuristic set of rules that operates with the inspiration of bat’s superior search mechanisms that leverage echolocation to find something in a worldwide scenario. The BA is widely used in various optimization troubles due to its top notch overall performance. Figure 10 explains graph between fitness versus iteration. The accuracy obtained using binary bat algorithm is 72.4% with dimension 10,866. Fig. 10 Binary bat algorithm convergence curve

746

S. Ponnuru and L. S. Nair

Fig. 11 Gray wolf optimizer convergence curve

9.7 Gray Wolf Optimizer Gray wolf optimization [18] set of rules (GWO) is a new metaheuristic optimization that mimics the capabilities of gray wolves, which hunt in a superior and cooperative manner. GWO is not like others in phrases of version shape. Figure 11 explains graph between fitness versus iteration. The accuracy obtained using gray wolf optimizer is 77% with dimension 14,236.

10 Results and Discussion To evaluate our proposed model, we are using mitos-atypia-14 challenge data set from 22nd international conference on pattern recognition, the data set contains 20× and 40× magnification images, we used 40× magnification images; from these images, we extracted patches. Total number of patches extracted are 432. We used three approaches. One extracted the features and used dimensionality reduction with three n components, one by one, and combined them before applying them to a classifier and calculating the accuracy. Next, we extracted the features, combined them, and applied the dimensionality reduction with 3 n components before classifying the data and calculating the accuracy. Once the features are extracted, we applied to different feature selection to get optimal subset with support vector machine (SVM) as classifier. Table 10 displays the accuracy obtained by using different kinds of optimizes with no. of features extracted and no. of features reduced for our two approaches.

55 Feature Extraction and Selection with Hyperparameter …

747

Table 10 Best accuracy achieved and combination Feature extracted

PCA

Hog, censure, ORBCorner, edge, pixel

Before

No. of features reduced

Best optimizer

100 (600) 200 (1200)

Classifier and accuracy SVM

Random forest

KNN

TPOT

81.7

73.1

72.0

TPOT

82.2

77.0

62.0

400 (2000)

TPOT

87.4

80.2

73.5

After

400

Grid

75.6

74.7

73.6

None

25,635

BOGP

76.5

73.3

71.3

Red deer

25,635



83.7

0

0

11 Conclusion This paper highlights a comprehensive study of different combinations of feature extraction and feature selection methods to determine the one with the highest accuracy. The proposed method for mitosis detection in HE image patches is based on various features and classifiers. The methodology involves the usage of six types of features extraction methods along with different combinations of the extracted features, application of two methods of dimensionality reduction, one before combining the features and other after, the results of which are passed to three types of classifiers: Support vector machine (SVM), K-nearest neighbor (KNN), and random forest with different hyperparameter optimizers to find the best hyperparameter for our classifiers. An accuracy of 87.4% was obtained by performing dimensionality reduction before combining all the features, where support vector machine (SVM) was used as the classifier and tree-based pipeline optimization tool (TOPT) as optimizer. Instead of using the entire feature set, an effort to find a subset that performs efficiently was made by using seven different feature selection methods. The red deer optimization technique produces an accuracy of 83.9% when support vector machine (SVM) was used as the classifier.

References 1. Agrawal M, Konolige K, Blas MR (2008) CenSurE: center surround extremas for realtime feature detection and matching. In: Forsyth D, Torr P, Zisserman A (eds) Computer vision— ECCV 2008. Lecture notes in computer science, vol 5305. Springer, Berlin, Heidelberg, pp 102–115. https://doi.org/10.1007/978-3-540-88693-88 2. Albayrak A, Bilgin G (2016) Mitosis detection using convolutional neural network based features. In: 2016 IEEE 17th international symposium on computational intelligence and informatics (CINTI). IEEE. https://doi.org/10.1109/cinti.2016.7846429 3. Albayrak A, Bilgin G (2013) Breast cancer mitosis detection in histopathological images with spatial feature extraction, p 90670L. https://doi.org/10.1117/12.2050050 4. Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:101093 3404324

748

S. Ponnuru and L. S. Nair

5. Cai D, Sun X, Zhou N, Han X, Yao J (2019) Efficient mitosis detection in breast cancer histology images by RCNN. In: 2019 IEEE 16th international symposium on biomedical imaging (ISBI 2019). IEEE. https://doi.org/10.1109/isbi.2019.8759461 6. Cui FY, Zou LJ, Song B (2008) Edge feature extraction based on digital image processing techniques. In: 2008 IEEE international conference on automation and logistics. IEEE. https:// doi.org/10.1109/ical.2008.4636554 7. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society conference on computer vision and pattern recognition (CVPR’05), vol 1. IEEE, pp 886–893. https://doi.org/10.1109/CVPR.2005.177 8. Evgeniou T, Pontil M: Support vector machines: theory and applications. In: Paliouras G, Karkaletsis V, Spyropoulos CD (eds) Machine learning and its applications. Lecture notes in computer science, vol 2049. Springer, Berlin, Heidelberg, pp 249–257. https://doi.org/10.1007/ 3-540-44673-712 9. Fathollahi-Fard AM, Hajiaghaei-Keshteli M, Tavakkoli-Moghaddam R (2020) Red deer algorithm (RDA): a new nature-inspired meta-heuristic. Soft Comput 24(19):14637–14665. https:// doi.org/10.1007/s00500-020-04812-z 10. Gandomi AH, Yang XS, Alavi AH (2011) Cuckoo search algorithm: a metaheuristic approach to solve structural optimization problems. Eng Comput 29(1):17–35. https://doi.org/10.1007/ s00366-011-0241-y 11. Gandomkar Z, Brennan P, Mello-Thoms C (2017) Determining image processing features describing the appearance of challenging mitotic figures and miscounted nonmitotic objects. J Pathol Inform 8(1):34. https://doi.org/10.4103/jpi.jpi2217 12. Gao XZ, Govindasamy V, Xu H, Wang X, Zenger K (2015) Harmony search method: theory and applications. Comput Intell Neurosci 2015:1–10. https://doi.org/10.1155/2015/258491 13. Guha R, Chatterjee B, Hassan SKK, Ahmed S, Bhattacharyya T, Sarkar R (2021) Py FS: a python package for feature selection using meta-heuristic optimization algorithms. In: Computational intelligence in pattern recognition. Springer, Singapore, pp 495–504. https://doi.org/ 10.1007/978-981-16-2543-542 14. Guo G, Wang H, Bell D, Bi Y (2004) KNN model-based approach in classification 15. Irshad H (2013) Automated mitosis detection in histopathology using morphological and multichannel statistics features. J Pathol Inform 4(1):10. https://doi.org/10.4103/2153-3539.112695 16. Jain M (2020) An analytical approach on feature extraction for image classification with any regression method using matlab. Int J Adv Sci Technol 29(06):9057–9075. http://sersc.org/jou rnals/index.php/IJAST/article/view/31978 17. Mirjalili S, Lewis A (2016) The whale optimization algorithm. Adv Eng Softw 95:51–67. https://doi.org/10.1016/j.advengsoft.2016.01.008 18. Mirjalili S, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:46–61. https://doi.org/10.1016/j.advengsoft.2013.12.007 19. Mirjalili S, Mirjalili SM, Yang XS (2013) Binary bat algorithm. Neural Comput Appl 25(3– 4):663–681. https://doi.org/10.1007/s00521-013-1525-5 20. Nair LS, Ramkishor, RP, Sugathan G, Gireesh KV, Nair AS (2021) Mitotic nuclei detection in breast histopathology images using YOLOv4. In: 2021 12th international conference on computing communication and networking technologies (ICCCNT). IEEE. https://doi.org/10. 1109/icccnt51525.2021.9579969 21. Tashk A, Helfroush MS, Danyali H, Akbarzadeh M (2013) An automatic mitosis detection method for breast cancer histopathology slide images based on objective and pixel-wise textural features classification. In: The 5th conference on information and knowledge technology. IEEE. https://doi.org/10.1109/ikt.2013.6620101 22. Tashk A, Helfroush MS, Danyali H, Akbarzadeh-jahromi M (2015) Automatic detection of breast cancer mitotic cells based on the combination of textural, statistical and innovative mathematical features. Appl Math Model 39(20):6165–6182. https://doi.org/10.1016/j.apm. 2015.01.051 23. Wang H, Cruz-Roa A, Basavanhally A, Gilmore H, Shih N, Feldman M, Tomaszewski J, Gonzalez F, Madabhushi A (2014) Cascaded ensemble of convolutional neural networks

55 Feature Extraction and Selection with Hyperparameter …

749

and handcrafted features for mitosis detection. In: Gurcan MN, Madabhushi A (eds) SPIE proceedings. SPIE. https://doi.org/10.1117/12.2043902 24. Yang L, Shami A (2020) On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing 415:295–316. https://doi.org/10.1016/j.neucom.2020. 07.061

Chapter 56

Review and Comparative Analysis of Unsupervised Machine Learning Application in Health Care Mantas Lukauskas and Tomas Ruzgas

1 Introduction Artificial intelligence had the most significant leap in the last two decades. However, artificial intelligence was first mentioned back in 1956. The ever-increasing computing power of computers has driven this massive leap in artificial intelligence. Now, often every one of us, every day, is confronted with artificial intelligence and its application. We deal with artificial intelligence in various fields, from product production (quality control, inventory management) to everyone’s daily life (recommendation systems, image recognition). When it comes to applying artificial intelligence, health is a significant area in everyone’s life. Artificial intelligence can be applied in a variety of ways in health care. One type of machine learning is that unsupervised learning is ubiquitous because it does not require prior classes. Clustering allows you to distinguish the best-distinguished groups and create profiles of these groups. In health care, the formation of such groups makes it possible to make more precise recommendations for the different profiles created. Scientific researches show that clustering algorithms can be applied to identify different diseases. K-means, hierarchal agglomerative clustering, and k-modes methods are still the most widely used algorithms, as these are fast-acting and work well with specific datasets. Clustering methods can be used for different applications in health care. For example, identify migraine, breast cancer, heart and diabetes, Parkinson’s or Huntington’s disease, or various psychological and psychiatric disorders.

M. Lukauskas (B) · T. Ruzgas Department of Applied Mathematics, Kaunas University of Technology, Kaunas, Lithuania e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_56

751

752

M. Lukauskas and T. Ruzgas

Moreover, these applications are just some among many others. There are a lot of different clustering methods, but the best-known method and most widely used is k-means. This paper aims to present the current methods of unsupervised learning in the health care and the results they achieve to expand knowledge about possible clustering methods in health care. First, this paper presents different clustering methods results, and a comparative analysis was done. In the second section of the paper, machine learning application in health care is presented, in the third one, methods and datasets are presented, and in the fourth section, there are results of different clustering methods for different real healthcare datasets. Last but not least, in the last section of the paper, conclusions and future research direction are presented.

2 Machine Learning Applications in Health Care One of the computer science fields called machine learning aims to ‘teach’ the recognition of specific patterns and the connection between individual data. Machine learning can be called one of the parts of artificial intelligence. Machine learning is also often described as an area of artificial intelligence that aims to improve performance using computer technology to increase operational efficiency. Machine learning algorithms use existing data to develop practical models [1]. There are three distinct categories of machine learning: unsupervised learning, supervised learning, and reinforcement learning. The fourth category of machine learning that is less noticeable is semi-supervised learning as shown in Fig. 1. In reinforcement learning, agent receives information about the environment and learns to choose actions that maximize the goal function [2]. First of all, supervised learning uses labeled historical data to train the model and gets the best possible result. There can be different data like historical sales and user behavior. Firstly, the model gets input data with labels and adjusts model weights with the lowest possible outcome (classification, regression) [3]. Semi-supervised learning uses a little different data. As input data are in semi-supervised learning, we can use a small portion of labeled data and many unlabeled data. We train predictive models with these different data in semi-supervised learning [4]. Although, as we can see, there are different types of machine learning, the most critical type in this work is unsupervised learning. In the case of unsupervised learning, data with no labels at all

Machine learning

Reinforcement learning

Supervised learning

Fig. 1 Types of the machine learning

Unsupervised learning

Semisupervised learning

56 Review and Comparative Analysis of Unsupervised Machine …

753

are used. Since the classes of available observations are not known in advance, it can be understood that these methods attempt to find the most similar observations. These most similar observations can be understood as being in a particular group and having a specific hidden connection. It is difficult to determine precisely how these groups were formed due to the lack of pre-existing classes. However, many different metrics can assess the similarities and differences between observations within a group [5]. Creating different groups makes them much easier to interpret and allows you to create profiles for these groups. The division of observations into certain similar groups is called data clustering. Cluster analysis is widely used in various disciplines: business insights, image recognition, web searches, and more. The methods or algorithms of cluster analysis can be classified differently, as it is difficult to define exactly what type of algorithm it is, as they often overlap. However, when evaluating methods according to their principle of operation, the primary methods can be distinguished: division, density-based, and hierarchical methods. The division methods divide the available data into k groups, where each observation belongs to exactly one group. Most splitting methods are distance-based methods. In the first cycle, the data are divided into k groups. The iterations then attempt to find the most appropriate division of the data so that the elements in the cluster are similar (the distance between them is the smallest). At the same time, the observations are different between the individual clusters (the distance between them is the largest). The most commonly used division algorithms are k-means and kmedoids, which use iterations to find the optimal cluster composition. These methods are characterized by the excellent discovery of spherical clusters when the number of observations is relatively small or medium. Hierarchical methods create a hierarchical breakdown of observations. These methods can be divided into two smaller subgroups: agglomeration (composition) and fragmentation. In the case of agglomeration methods, the number of clusters is equal to the number of observations [6]. Density-based clustering methods are characterized by the fact that, unlike other clustering methods, they can discover more than just spherical clusters. The choice of data clustering methods is currently quite wide: spectral clustering, density-based spatial clustering, MULIC, DENCLUE, SOMs (NeuralNet), SVM, HIERDENC UNIC [7], k-medoids (PAM) [8], Gaussian mixture [8], TCLUST [8], Trimmed k-means [9], deeply embedded clustering [10]. It is essential to mention that data clustering methods are not universal, so clustering methods are chosen based on the data used. In health care, data clustering can also be applied quite widely. Clustering can be performed using different patient data, profiling patients, the better adaptation of treatments for them, and more accurate health recommendations [11]. In this case, similar persons can be distinguished based on the various characteristics of a person. Given the similarity of different individuals and having information that the drug selected for a particular patient has been successfully adapted, the same drug can be assigned to other individuals in the same cluster/group. Research can find different applications of clustering algorithms to detect different diseases [12– 16]. One example of an application is the detection of cancer based on patient data (lung, breast, and other forms of cancer) [17], Parkinson’s disease [18, 19], migraine

754

M. Lukauskas and T. Ruzgas

[20], and various psychological and psychiatric disorders [21], heart and diabetes diseases [22], and Alzheimer’s disease [23], multiple predictive analytics tasks [24] among many others. Data clustering allows identifying specific groups of patients and finding significantly different observations that can be described as exclusions. When assessing different vital, research, and other indicators of individuals, they can be assessed more comprehensively than by assessing only individual indicators. Cluster analysis makes it possible to determine from different indicators which of the patients have a specific unique indicator or a combination of these indicators, which can usually be overlooked. Such an assessment helps identify patients much earlier that need more attention and research. However, it is also important to mention that data clustering can be applied to structured and unstructured data. One such example is the application of clustering to cluster different documents according to the text they contain. In this case, the clustering of documents, prescriptions, and other medical-related textual data allows physicians to reduce their workload much more [25]. Furthermore, the use of these algorithms to summarize certain information from documents is also noticeable. It allows the grouping of documents themselves and from different parts, prescriptions, and other textual information that previously could not be so easily analyzed. Image analytics is another not-so-small example of an increasingly popular clustering application. In this case, clustering can be applied to cluster-specific images and highlight certain key areas in the image (X-rays, MRIs, and other images). This use allows the doctor to evaluate existing X-rays and other images much faster, as only suspicious images can be focused. Also, clustering in this area reduces the risk of human error, for example, due to a doctor’s short experience [26–28]. One example may also be hierarchical clustering applied to healthy and pathological aortic arches [29]. That study aimed to determine the influence of different clustering metrics.

3 Methods and Datasets Used in Research This section provides information on the datasets and methods used in the empirical study to verify the accuracy of using medical data clustering. More than 150 datasets (Mice protein [30], Flame, cancer, E. coli, and others) were used throughout the more extensive study, and Table 2 provides information on the results of 3 datasets. These datasets parameters are listed in Table 1. Table 1 Datasets used in this paper results section information Dataset

Samples

Dimensions

Classes

Flame

240

2

2

E. coli

327

7

5

Cancer

569

30

2

56 Review and Comparative Analysis of Unsupervised Machine …

755

Table 2 Sample of research results (only three datasets with 12 methods) (authors’ results) Dataset

Datasets Flame

E. coli

Cancer

NMI

AMI

ARI

NMI

AMI

ARI

NMI

AMI

ARI

ADBSCAN

1.000

1.000

1.000

0.499

0.490

0.499

0.265

0.61

0.220

DBSCAN

1.000

1.000

1.000

0.499

0.490

0.499

0.265

0.61

0.220

OPTICS

0.992

0.992

0.995

0.507

0.498

0.461

0.141

0.138

0.117

BIRCH

0.025

0.019

0.022

0.734

0.723

0.778

0.566

0.565

0.689

FINCH

0.092

0.086

0.024

0.650

0.634

0,649

0,585

0,584

0,706

CURE

0.234

0.229

0.188







0.349

0.348

0.423

GMM







0.643

0.626

0.641

0.668

0.667

0.780

BGMM







0.649

0.639

0.681

0.653

0.652

0.767

Elastic

0.140

0.133

0.139

0.406

0.381

0.262

0.262

0.261

0.356

AP

0.435

0.416

0.151

0.552

0.512

0.226

0.271

0.256

0.058

AGG

1.000

1.000

1.000

0.726

0.714

0.748

0.416

0.416

0.538

Spectral







0.617

0.601

0.585

0.554

0.553

0.623

Bold significance means the best obtained results for each dataset

More than 30 different clustering methods were also used in the study, and the main methods used in the study are discussed below. The first clustering method used is density-based spatial clustering of applications with noise (DBSCAN). This method aims to find higher density areas in multidimensional data and thus determine which observations should be assigned to the same cluster. This method requires only a few parameters: the minimum distance between points and the minimum number of points. Adaptive density-based spatial clustering of applications with noise (ADBSCAN), this method is characterized in that, unlike DBSCAN, the value of Eps is not static but adaptive and variable, thus adapting to the available data. The OPTICS algorithm is quite similar in operation to the methods presented above. However, compared to the previous methods, its advantage is that it is stated that at different densities, this method can better distinguish the clusters in the data [31]. Balanced iterative reducing and clustering using hierarchies (BIRCH) is an unsupervised data mining algorithm that performs hierarchical clustering over enormous datasets [32]. With modifications, it can also be used to accelerate k-means clustering and Gaussian mixture modeling with the expectation–maximization algorithm [33]. FINCH algorithm firstly computing partition to merge later cluster recursively and computing distances between clusters [34]. Clustering Using Representatives (CURE) is an efficient data clustering algorithm for large databases. Compared with K-means clustering, it is robust to outliers and can identify clusters with non-spherical shapes and size variances. A Gaussian mixture model (GMM) is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. The is a possible Bayesian Gaussian mixture model (BGMM) modification.

756

M. Lukauskas and T. Ruzgas

4 Results This section provides information on performance metrics for clustering algorithms. The results obtained during the study are also presented. It is essential to choose the appropriate metrics for that assessment. The choice of metrics depends on whether the classes are known in advance. Pre-known classes can only exist if the data are for experiments, but in real situations, these classes are not known in advance. NMI, AMI, Silhouette Coefficient, Davies-Boulding, J-Score, and other metrics assess clustering accuracy. The metrics used in this work are described below. Normalized mutual information (NMI) is a clustering assessment metric based on mutual information that can be calculated as shown below [35]: NMI = √

I (X, Y ) . H (X )H (Y )

(1)

X can be understood as the predicted label, Y as the real label, H as the entropy, and I as the mutual information between predicted and true labels. Normalization of the MI metric allows us to improve the sensitiveness of measure, then concerning the difference in clusters [36]. Another modification of the MI assessment metric is adjusted mutual information (AMI), which allows you to assess whether clusters provide more information [37]. The Rand index (RI), which considers pairs of all available observations, is also used to assess the similarity of the two clusters. It is assessed whether these pairs are assigned to the same cluster or different clusters by estimating actual and predicted values. More often, the adjusted Rand index (ARI) is used instead of the Rand index to evaluate the results of clustering, which can be interpreted a little more efficiently [38, 39]. The closer this estimate is to zero, the worse the clustering results are, and the closer to the unit, the better. Table 2 presents some of the data results examined in the study. Based on the results table above, different clustering algorithms work differently for different datasets. The most used clustering algorithms, e.g., k-means and more sophisticated algorithms can be observed. As we can see in the results shown above for the Flame dataset, the best methods are ADBSCAN, DBSCAN, and agglomerative. The best results were obtained for the healthcare E. coli dataset with the BIRCH clustering method. Lastly, the dataset’s best results for the second healthcare cancer were obtained with a different clustering method—Gaussian mixture model. These results prove that different clustering methods work best for different datasets.

5 Conclusions The primary purpose of this short article was to give a brief overview of machine learning and pay more attention to unsupervised machine learning and clustering.

56 Review and Comparative Analysis of Unsupervised Machine …

757

Studies have also shown that data clustering can well differentiate between different clusters and thus facilitate the work of medical professionals, for example, by analyzing data, separating groups, making recommendations, and analyzing X-ray images (COVID-19 and other diseases). Every year, more and more additional research is carried out, which allows us to apply these algorithms in health care and helps doctors assess the situation and start treatment faster or even more accurately. Therefore, it can be assumed that only more excellent use of AI and clustering in health care will be observed in the coming years. In this paper, different datasets were used to compare multiple clustering methods. Based on the results obtained in the research for the Flame dataset, the best methods are ADBSCAN, DBSCAN, and agglomerative. E. coli dataset’s best results were obtained with the BIRCH clustering method. Lastly, the dataset’s best results for the second healthcare cancer were obtained with a different clustering method—Gaussian mixture model. These results prove that different clustering methods work best for different datasets. Limitations. This article provides information on only a subset of the datasets used in the more extensive study, so a broader comparison of the datasets and methods would better assess the different clustering methods. Also, different data dimension reduction techniques can be used in the data preparation step for more accurate results. However, such use would allow an even more extensive comparison of data clustering results. Future research. This paper is the second step in an overall more extensive study of clustering algorithms. There are currently contacts with foreign researchers regarding the inclusion of their clustering methods in the comparative study so that the study will be expanded with new methods. In addition, the authors are currently developing a new clustering method based on the inversion formula, and it is expected that this method will be suitable for application in health care. In the next step of the research, this new clustering method will be applied for healthcare data clustering and compared with other methods.

References 1. Mohammed M, Khan MB, Bashier EBM (2016) Machine learning: algorithms and applications. CRC Press 2. Chollet F (2021) Deep learning with Python. Simon and Schuster 3. Mohri M, Rostamizadeh A (2012) A. Talwalkar Foundations of machine learning. MIT Press, Cambridge, MA, USA 4. Van Engelen JE, Hoos HH (2020) A survey on semi-supervised learning. Mach Learn 109:373– 440 5. Duda RO, Hart PE, Stork DG (2001) Unsupervised learning and clustering. Pattern classification, 2nd edn. 6. Witten IH, Frank E (2002) Data mining: practical machine learning tools and techniques with Java implementations. ACM SIGMOD Rec 31:76–77 7. Leopold N, Rose O (2020) UNIC: a fast nonparametric clustering. Pattern Recogn 100:107117

758

M. Lukauskas and T. Ruzgas

8. El Attar A, Khatoun R, Birregah B, Lemercier M (2014) Robust clustering methods for detecting smartphone’s abnormal behavior. In: 2014 IEEE wireless communications and networking conference (WCNC). IEEE, pp 2552–2557 9. Cuesta-Albertos JA, Gordaliza A, Matrán C (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25:553–576 10. Ren Y, Hu K, Dai X, Pan L, Hoi SC, Xu Z (2019) Semi-supervised deep embedded clustering. Neurocomputing 325:121–130 11. Nezhad MZ, Zhu D, Sadati N, Yang K, Levi P (2017) SUBIC: a supervised bi-clustering approach for precision medicine. In: 2017 16th IEEE international conference on machine learning and applications (ICMLA). IEEE, pp 755–760 12. Nugent R, Meila M (2010) An overview of clustering applied to molecular biology. Stat Methods Mol Biol 369–404 13. Li X, Zhu F (2013) On clustering algorithms for biological data. Engineering 5. https://doi. org/10.4236/eng.2013.510B113 14. Nithya N, Duraiswamy K, Gomathy P (2013) A survey on clustering techniques in medical diagnosis. Int J Comput Sci Trends Technol (IJCST) 1:17–23 15. Wiwie C, Baumbach J, Röttger R (2015) Comparing the performance of biomedical clustering methods. Nat Methods 12:1033–1038 16. Chen C-H (2014) A hybrid intelligent model of analyzing clinical breast cancer data using clustering techniques with feature selection. Appl Soft Comput 20:4–14 17. Polat K (2012) Classification of Parkinson’s disease using feature weighting method on the basis of fuzzy C-means clustering. Int J Syst Sci 43:597–609 18. Nilashi M, Ibrahim O, Ahani A (2016) Accuracy improvement for predicting Parkinson’s disease progression. Sci Rep 6:1–18 19. Wu Y, Duan H, Du S (2015) Multiple fuzzy c-means clustering algorithm in medical diagnosis. Technol Health Care 23:S519–S527 20. Trevithick L, Painter J, Keown P (2015) Mental health clustering and diagnosis in psychiatric in-patients. BJPsych Bulletin 39:119–123 21. Yilmaz N, Inan O, Uzer MS (2014) A new data preparation method based on clustering algorithms for diagnosis systems of heart and diabetes diseases. J Med Syst 38:48–59 22. Nikas JB, Low WC (2011) Application of clustering analyses to the diagnosis of Huntington disease in mice and other diseases with well-defined group boundaries. Comput Methods Programs Biomed 104:e133–e147 23. Alashwal H, El Halaby M, Crouse JJ, Abdalla A, Moustafa AA (2019) The application of unsupervised clustering methods to Alzheimer’s disease. Front Comput Neurosci 13:31 24. Smys S (2019) Survey on accuracy of predictive big data analytics in healthcare. J Inf Technol 1:77–86 25. Renganathan V (2017) Text mining in biomedical domain with emphasis on document clustering. Healthc Inform Res 23:141–146 26. Suetens P, Bellon E, Vandermeulen D, Smet M, Marchal G, Nuyts J, Mortelmans L (1993) Image segmentation: methods and applications in diagnostic radiology and nuclear medicine. Eur J Radiol 17:14–21 27. Boudraa A-O, Zaidi H (2006) Image segmentation techniques in nuclear medicine imaging. Quantitative analysis in nuclear medicine imaging. Springer, pp 308–357 28. Qu P, Zhang H, Zhuo L, Zhang J, Chen G (2017) Automatic tongue image segmentation for traditional Chinese medicine using deep neural network. In: International conference on intelligent computing. Springer, pp 247–259 29. Bruse JL, Zuluaga MA, Khushnood A, McLeod K, Ntsinjana HN, Hsia T-Y, Sermesant M, Pennec X, Taylor AM, Schievano S (2017) Detecting clinically meaningful shape clusters in medical image data: metrics analysis for hierarchical clustering applied to healthy and pathological aortic arches. IEEE Trans Biomed Eng 64:2373–2383 30. Higuera C, Gardiner KJ, Cios KJ (2015) Self-organizing feature maps identify proteins critical to learning in a mouse model of down syndrome. PLoS ONE 10:e0129126

56 Review and Comparative Analysis of Unsupervised Machine …

759

31. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: kdd, pp 226–231 32. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Rec 25:103–114 33. Lang A, Schubert E (2020) BETULA: numerically stable CF-trees for BIRCH clustering. In: International conference on similarity search and applications. Springer, pp 281–296 34. Sarfraz S, Sharma V, Stiefelhagen R (2019) Efficient parameter-free clustering using first neighbor relations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8934–8943 35. Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617 36. Wu J, Xiong H, Chen J (2010) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 877–886 37. Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854 38. Steinley D (2004) Properties of the Hubert-Arable Adjusted Rand Index. Psychol Methods 9:386 39. Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66:846–850

Chapter 57

A Systematic Literature Review on Cybersecurity Threats of Virtual Reality (VR) and Augmented Reality (AR) Abrar Alismail, Esra Altulaihan, M. M. Hafizur Rahman, and Abu Sufian

1 Introduction The Virtual reality (VR) and augmented reality (AR) can be considered two terms that can be used interchangeably and fall under the same umbrella. Umbrella combines the two terms, each with the aim of expanding the customer’s sensory environment by integrating reality with augmented technology. VR aims to provide a virtual and alternative environment that allows the customer to experience. While AR enhances the actual reality experienced by the customer. The differences between AR and VR are compared in Table 1. In the 1970s and 1980s, virtual reality and augmented reality research grew rapidly. The Virtual Interface Environment Workstation (VIEW) framework was created at NASA Ames Research Center in the mid-1980s. It incorporated a headmounted device and gloves in order to enable haptic feedback. At the 2012 public E3 videogame expo, the Oculus Rift appeared as the most punctual of the current augmented reality devices [1]. Figure 1 shows the timeline of how VR/AR technology has evolved from its start to the present. The original version of this chapter was revised: The complete acknowledgement has been updated. The correction to this chapter can be found at https://doi.org/10.1007/978-981-19-6004-8_68 A. Alismail (B) · E. Altulaihan · M. M. H. Rahman Department of Computer Networks and Communications, CCSIT, King Faisal University, Al Hassa 31982, Saudi Arabia e-mail: [email protected] E. Altulaihan e-mail: [email protected] M. M. H. Rahman e-mail: [email protected] A. Sufian Department of Computer Science, University of Gour Banga, Malda, West Bengal, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023, corrected publication 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_57

761

762

A. Alismail et al.

Table 1 Comparison between AR and VR VR AR Completely virtual setting Changing our vision User is controlled by the system VR is 25% real and 75% virtual Requires a headset device Enhances a fictional world only for the gaming world

Real world settings are used Adding augmented reality to it User controls his or her presence in the real world AR is 25% virtual and 75% real No headset needed AR can be accessed with a smartphone Both real and virtual worlds are enhanced

Fig. 1 Timeline of VR and AR

VR and AR provide many new and exciting opportunities for innovation. Despite the opportunities offered by VR and AR, they present new cyber challenges. VR, in particular, completely disconnects users from the outside world due to its visual and auditory nature. Considering the physical security and safety of a user’s environment is always the most important step. In AR as well, it is imperative that users keep a high level of situational awareness, especially in environments that are more immersive. Conventional cybersecurity threats have a limited effect on users while VR and AR threats have a significant impact on users that threaten their safety, privacy, and security (SPS) [2]. Despite its importance to information security and the fact that it impacts any device, identity and access management is often overlooked when it comes to adopting AR and VR systems. In some situations, avatars might allow you to identify other people you are working with, but there is also a risk that the avatar might be copied and used by someone uninvited. AR and VR can significantly improve identity and access management if used in the right way—for example, by using eye-tracking sensors to check your identity before making your way to the rest of the system. In spite of the challenges, there is a great deal of potential in virtual and augmented reality in the workplace. Therefore, this study aimed: 1. To review the recent threats and risks that have been associated with VR and AR. 2. To review exiting mitigation techniques on VR and AR security threats and Identify the methods that individuals and organizations can use to protect themselves from cyber-attacks that occur via VR and AR.

57 A Systematic Literature Review on Cybersecurity …

763

2 Problem Statement With the rapid development of virtual reality and augmented reality has made it a disruptive technology that has changed the way we interact with information and interact with each other [3]. The immersive quality of virtual and augmented reality may also make them more vulnerable to cybersecurity attacks. The vulnerability of the VR and AR system may result in severe consequences and major cybersecurity attacks such as malware attacks, data loss, and data leakage. So, we must learn about the threats posed by these technologies and find solutions to mitigate its risks. Knowing the types of attacks that can be made and the techniques used to defend against them is important to protect ourselves.

3 Objective Technology has permeated nearly every aspect of our lives. We are much more connected than ever before, whether we are operating machines with virtual reality, designing and marketing with augmented reality, or simply surfing the internet. In the present day, the Internet has become one of the most important resources in society, and it is available everywhere with a WiFi connection. As we rely more and more on this technology, it becomes increasingly vulnerable to hacking and malware that target our devices. A companies use of AR/VR to innovate and create a competitive advantage can cause this to be a concern, especially when it is used to collect data for further product improvement. As a result, this paper will explore the vulnerabilities, threats, and risks associated with the VR/AR environments as well as the protection methods that can be applied to such environments. Additionally, it outlines and discusses some techniques for mitigating risks in the VR and AR systems. Hence, the goal of this project is to improve VR and AR security by increasing awareness. This project will also raise awareness among individuals and organizations that have been or may be victimized by cyber-crime due to VR/AR.

4 Research Questions This paper will attempt to address the following three important questions: 1. What are the current and potential cybersecurity threats facing VR and AR systems? 2. What are the most common mitigation techniques to control VR and AR threats? 3. How can we improve cybersecurity for VR and AR systems?

764

A. Alismail et al.

5 Scope of the Study This paper focuses on cybersecurity attacks in VR and AR. The paper aims to discuss the most common cybersecurity attacks that have occurred in VR and AR. The paper aims to examine the impact of cybersecurity attacks on VR and AR in terms of direct financial losses and indirect effects on users. The paper aims to provide a comprehensive assessment and classification of attacks techniques targets VR and AR users. Furthermore, the paper aims to guide the VR and AR developers toward security countermeasures adopted to protect VR and AR from such cybersecurity attacks and avoid financial losses. This paper illustrates how VR and AR developers are improving their security policies and procedures to better safeguard users data, detect cybersecurity attacks early, and recover from financial and data losses. This paper will be a good source of education and awareness for VR and AR developers on the importance of choosing the appropriate risk management technique that contributes to mitigate the impact of cybersecurity risks.

6 Expected Results At the end of the project, we expect to provide a paper that will help to raise awareness about VR and AR security risks and how to mitigate it. To accomplish this goal paper is expected to include: 1. 2. 3. 4.

Recent threats and risks that have been associated with VR and AR. Analysis of VR/AR attacks and threats. Identify the mitigation techniques for VR and AR risks. Identifying ways for individuals and organizations to protect themselves from VR and AR attacks.

7 Selection of Research Papers for Review The search follows PRISMA that goes through four stages. In the identification stages, the Saudi digital library and google scholar databases were searched with the following inclusion criteria: papers that describe cybersecurity threats in virtual reality and augmented reality, and papers that were published between January 2005 and March 2022. Table 2 shows that there were four exclusion criteria: Papers does not address VR or AR or MR security, papers not written in English, papers not directly related to cybersecurity threats in virtual reality and augmented reality and papers not available online. Academic journal or conference paper is specified as the source type. Figure 2 demonstrating that a total of 5428 papers were identified in the identification stage, after removing duplication, 3985 papers remain. At the screening

57 A Systematic Literature Review on Cybersecurity … Table 2 Inclusion and exclusion criteria S/N Inclusion criteria 1 2 3

4

765

Exclusion criteria

Journals, conferences, preprints, chapter

Does not address VR or AR or MR security Published between 2005 and 2022 Not written in English Papers that describe cybersecurity threats Papers not directly related to in virtual reality and augmented reality cybersecurity threats in virtual reality and augmented reality Mitigation techniques for VR and AR Journals, conferences, preprints, papers threats not available online

Fig. 2 Schematic diagram of selection of papers for literature review by PRISMA

stage, of the 200 papers screened for title and abstract, 150 have been excluded for not fitting the criteria closely. In the eligibility stage, where 50 studies are eligible to go the final stage. In the Included stage, there were 50 articles selected; 30 of them were excluded, leaving 20 articles for review. The most complex level of prisma is the screening stage, in which we must read the literature papers and exclude those that are unsuitable for a variety of reasons, including only abstracts, data duplication, and lack of sensitivity. Figure 3 illustrate the distribution of the chosen papers

766

A. Alismail et al.

Fig. 3 Distribution of selected papers by years

by years. We noticed that most of the selected papers have been published between (2018–2022). The majority of the selected papers were published between 2018 and 2022 in order to address the most recent threats posed by VR and AR.

8 Literature Survey This section highlights some noteworthy state-of-the-art works on VR and AR cybersecurity concerns. In the subject of cybersecurity, this literature review focuses on addressed vulnerabilities and suggested mitigation techniques. The study [3] shows that few VR users are aware of the cyber threats posed by virtual reality. It is found that presence of an unreliable third party that depends on the data of AR users is one of the weaknesses encountered by augmented reality devices [4]. For effective and secure AR systems, security policies for the input, output, processing units of AR user data must be strengthened. Aggarwal and Singhal [6] stated that Augmented Reality (AR) aims to enlarge the actual reality by adding virtual components to provide the user experience while interacting with the real world. Stobiecki [7] conducted a study on possible challenges and threats of AR usage. Participants were able to immerse themselves in a virtual environment and treat virtual objects as real. Researchers [7] provide insights into

57 A Systematic Literature Review on Cybersecurity …

767

how security and privacy can be addressed in multiuser AR systems. The study [8] presented a RubikAuth, a novel authentication scheme that uses a handheld controller to let users enter numbers from a virtual 3D cube. Syal and Mathew [10] reviewed the different attacks on MR systems, including performance issues, physical interface issues, and data aggregation attack. The study [10] aims to help researchers, developers, and neuroscientists take these issues into account before AR technologies are widely used. The study [10] aims is to anticipate such issues before they manifest in the real world of AR technologies and applications. Such experiments must be done in an ethical and safe manner. Many security vulnerabilities have been discovered in educational VR applications, which pose a threat to the security, safety, and privacy of VRLEs users. The study [11] proposed a “risk assessment framework” that uses attack trees to assess internal and external vulnerabilities. A survey of privacy policies in VR experiences (i.e., applications) and a code of ethics for VR developers is proposed. The study [12] shows that users and developers worry about three categories of risks: well-being, security, and privacy. An overview of data collection in AR/VR and its relationship to the broader landscape of digital technologies for information gathering and privacy protection. The study [13] analyzed policies and regulations that govern user privacy in the augmented reality and virtual reality industry. The study [14] stated urgent need for granular authentication and authorization in commercial virtual reality applications calls for a multidimensional authentication approach. Virtual reality learning environments (VRLEs) are social and geographically distributed, which can lead to attacks. Researchers studied how design principles for VRLEs can provide users with a reliable level of privacy. Using the principle of least privilege, they [15] reduced the disruption probability from 0.82. The study [16] stated that there is a lack of investigation, implementation, and evaluation of data protection approaches in MR, there is an opportunity for developing, researching, and implementing security and privacy mechanisms that can be integrated with current MR systems. Since Web browsers do not provide AR support, AR browsers must resort to ad-hoc cross-origin mechanisms. They [17] amplify existing threats, including cross-site scripting and clickjacking. A computer vision-based side channel attack employing a stereo camera to extract numerical passwords on touch-enabled mobile devices was disclosed in this work. Researchers [18] hope to raise awareness of possible security and privacy breaches from seemingly harmless VR and AR products that have been gaining in popularity. Sensors collect data from the user’s real world surroundings on a continuous basis, while feedback devices convey sensory data directly to the user. As augmented reality technology becomes more widely used, new security and privacy concerns will emerge [19]. Researchers need to figure out how to allow legal advertising while avoiding spam and deceit. Virtual reality [20] can be used for entertainment, simulation training, modeling and visualization. In the medical field, BioSimMER is an example of a distributed, multiple-user VR simulation training platform. Unlike live training, virtual training simulations can be carried out with no risk to human life.

768

A. Alismail et al.

Table 3 Summary of existing studies addressed threats and suggested mitigation S/N

Authors

Pub. year

Addressed threats

Suggested mitigation

1

Adams et al. [4]

2018

Privacy threats as data collected and security vulnerabilities as data leakage of VR users

Authentication system to protect VR users and defend against VR cyber threats

2

Dissanayake and Viraj [5]

2019

Cybersecurity risks involved on AR are (CIA) which represents confidentiality, integrity, availability of augmented reality users

Approaches to developing AR applications that are secure and safe

3

Aggarwal and Singhal [6]

2019

Lack of use cases, legal, privacy concerns, digital fatigue, miniaturization issues, poor experience, social rejection

No mitigation techniques proposed

4

Stobiecki and Pawe [7]

2018

Cybercriminal attacks, including exploitation, data theft, and remote control of AR devices

Enhance AR users understanding of the risks involved with their usage of the AR technology

5

Lebeck et al. [8]

2018

The privacy and security issues related to emerging AR technologies

No mitigation techniques proposed

6

Mathis et al. [9]

2021

There is a need for usable and secure authentication in VR, and establishment concepts (e.g., graphical PINs in 2D) are vulnerable to observation attacks and proposed alternatives are relatively slow.

Using manipulable 3D objects for frequent authentications in VR

7

Syal and Mathew [10]

2020

The various threats mixed reality faces and issues like hidden security risks, latent data, and privacy risks, as well as the different attacks on MR systems, including performance issues, physical interface issues, and data aggregation attacks

Methods of protection have been developed such as input, data access, output, interactivity, and device integrity

8

Baldassi et al. [11]

2020

The computer security perspective on perceptual and sensory risks associated with AR

Framework for evaluating AR possible threats and by considering and addressing these issues now, we can enable future AR technologies to achieve their full potential

9

Gulhane et al. [12]

2018

Cyber concerns linked with Risk assessment approach, VRLE correspond to threats to structured attack trees, policy security, privacy, and safety (SPS) change control during VRLE sessions

(continued)

57 A Systematic Literature Review on Cybersecurity …

769

Table 3 (continued) S/N

Authors

Pub. year

Addressed threats

Suggested mitigation

10

Adams et al. [13]

2019

VR risks and design for safer experiences, including understanding end-user perceptions of risks and how, if at all, developers are addressing them

A mixed-methods approach to address human-centered privacy and security risks in VR and a “code of ethics” for VR development

11

Dick and Ellysse [14]

2021

Data collection in AR/VR and its relationship to the broader landscape of digital technologies for information gathering and privacy protection

Presented recommendations on how to deal with the unique challenges VR/AR technologies present for privacy

12

Viswanathan [15]

2022

The need for granular authentication and authorization in VR applications

In order for virtual reality applications to be authenticated and authorized, a multidimensional authentication process is required

13

Valluripally et al. [16]

2020

The security and privacy of virtual reality applications

Framework for evaluating the security and privacy of virtual reality applications using attack tree theory and model checking statistics

14

De Guzman et al.[17]

2019

MR risks, and the latest security and privacy work on MR

No mitigation techniques proposed

15

McPherson et al. [18]

2015

Analysis of the security and privacy features of AR browser

Presented guidelines to ensure that AR functionality is implemented in a secure manner

The study of interactions between people and information, with a focus on decision-making outcomes, is known as human-information interaction (HII). Display solutions [21] must figure out how to display telepresence users to local users in a way that is both compatible with existing environment and their relationship to the data. An output module [23] is a component of an Augmented Reality (AR) app that allows developers to decide how and where AR material should be shown. Freedrawing allows apps to “free-draw” anywhere inside the user’s view. Fine-Grained AR Objects manage visual content at the level of AR objects rather than windows. We have summarized the main findings of existing studies in terms of addressed threats and suggested mitigation as depicted in Tables 3 and 4.

9 Results and Discussion This section will investigate and analyze questions expressed in the research questions section, these results are based on the literature papers that have been reviewed.

770

A. Alismail et al.

Table 4 Summary of existing studies addressed threats and suggested mitigation S/N

Authors

Pub. year

Addressed threats

Suggested mitigation

16

Chen et al. [19]

2018

The security and privacy threats in VR and AR

Seeked to raise awareness about possible security and privacy vulnerabilities from seemingly innocuous (VR/AR) technologies

17

Roesner et al. [20]

2021

The four major challenges associated with AR technologies are real world interfaces, tensions between privacy and functionality, understanding and control, application permissions, and cross-cutting challenge

Presented two practical approaches for using AR to improve security and privacy based on using personal views, information overlays, and easier authentication

18

Chow et al. [21]

2005

The latency problem in immersive VR that is defined as the delay between a user actions and when those actions reflected by the display

Researchers presented two mitigation techniques based on using ARP system and priority rendering technique for latency problems in virtual worlds. Level-of-Detail (LOD) management is a technique used in many graphics systems to minimize latency and improve system performance

19

Spicer et al. [22]

2017

The four challenges of HII in terms of asymmetry in relative positions, asymmetry in local view, asymmetry in information, and asynchronous access and artifacts

Researchers should consider when and how MR might aid with HII tasks to enhance results, keeping in mind that some activities will be more suited to spatial MR interfaces than others

20

Lebeck et al. [23]

2016

The two major security and privacy challenges associated with AR platforms. The first challenge is input privacy. The second challenge is output safety and privacy

Established three AR visual output models that balances OS control with application flexibility based on windowing, free-drawing, and fine grained AR objects. Through established AR visual output models, operating system would be able to regulate visual output at the granularity of individual AR objects controlled by applications and matching to elements that the apps want to add to the user’s viewpoint

1. What are the current and potential cybersecurity threats facing VR and AR systems? Figure 4 summarizes the results from the previous section for the most common threats in the selected papers. Confidentiality, Integrity, and Availability (CIA), data theft, observation attacks for graphical PIN in 2D, security, privacy, and safety (SPS), granular authentication and authorization, and latency problem are the most common threats in VR and AR. User authentication is the most secure control since user data

57 A Systematic Literature Review on Cybersecurity …

771

Fig. 4 Most common threats mentioned in VR and AR

in VR and AR applications is owned by the application provider, and users do not have complete control over their own data. Security, privacy, and safety (SPS) are the most commonly identified threats for VR and AR users, as seen in Fig. 4. 2. What are the most common mitigation techniques to control VR and AR threats? Figure 5 illustrates that development of an authentication system, development of secure and safe AR applications, implementation of a “code of ethics” during the development of VR application, utilization manipulable 3D objects for frequent authentications in VR, implementation of risk assessment approach, and utilization Level of-Detail (LOD) management are among the suggested mitigation techniques to address these threats. The most effective mitigation techniques and countermeasures for the VR and AR threats are adopting a “code of ethics” and using a risk assessment approach, as shown in Fig. 5. 3. How can we improve cybersecurity for VR and AR systems? Cybersecurity for virtual reality and augmented reality (VR/AR) is increasingly important for future developments in this field. These include development of an authentication system, development of secure and safe AR applications, and implementation of a “code of ethics” during the development of VR application. Furthermore, the use of manipulable 3D objects for frequent authentications in VR, the

772

A. Alismail et al.

Fig. 5 Most common mitigation techniques mentioned in VR and AR

deployment of a risk assessment approach, and the use of Level of Detail (LOD) management to create a secure and safe environment for VR and AR users.

10 Conclusions In this paper, we present a systematic literature review of 20 existing research publications in virtual reality, augmented reality, and mixed reality. Most of the threats addressed by virtual reality and augmented reality to users are hidden security risks, latent data, and privacy risks. We then present how ARP can be utilized as a tool to enhance security and privacy. Once the technology has been deployed, it will be challenging to incorporate security and privacy controls. The moment has come to prioritize security and privacy in the development of emerging AR technology. This systematic literature review aims to raise public awareness of possible privacy and security breaches associated with virtual and augmented reality (VR and AR) environments that are becoming increasingly popular. There are some shortcomings in this paper that need be addressed in future research: 1. The developers of VR and AR applications do not reveal all cybersecurity vulnerabilities and keep them hidden, making them harder to access. 2. Future research will look into the influence of cybersecurity concerns on the stability of VR and AR apps. 3. Future research can build on this work by employing a systematic mixed-methods model to address human-centered privacy and security threats in VR and incorporating a “code of ethics” into VR and AR application development.

57 A Systematic Literature Review on Cybersecurity …

773

Acknowledgements The authors would like to thank the anonymous reviewers for their insightful comments and suggestions to improve the clarity and quality of the paper. This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. GRANT932].

References 1. Laghari A, Jumani A, Kumar K, Chhajro M (2021) Systematic analysis of virtual reality & augmented reality. Int J Inf Eng Electron Bus 2. Truong TC, Zelinka I, Plucar J, Candik M, Šulc V (2020) Artificial intelligence & cybersecurity: past, presence, and future. Springer 3. Kumar ST (2018) Study of retail applications with virtual & augmented reality technologies. J Innov Image Process 4. Adams D, Bah A, Barwulor C, Musabay N, Pitkin K, Redmiles E (2018) Perceptions of the privacy & security of virtual reality. In: iConference 2018 proceedings 5. Dissanayake VD (2019) A review of cyber security risks in an augmented reality world. University of Sri Lanka, Institute of Information Technology 6. Aggarwal R, Singhal A (2019) Augmented reality & its effect on our life. In: 2019 9th international conference on cloud computing, data science & engineering (Confluence). IEEE, pp 510–515 7. Stobiecki P et al (2018) Augmented reality-challenges & threats. J Ekon Probl Uslug 131(2):197–205 8. Lebeck K, Ruth K, Kohno T, Roesner F (2018) Towards security & privacy for multi-user augmented reality: foundations with end users. In: IEEE symposium on security and privacy (SP). IEEE, pp 392–408 9. Mathis F, Williamson JH, Vaniea K, Khamis M (2021) Fast & secure authentication in virtual reality using coordinated 3d manipulation & pointing. ACM Trans Comput-Hum Interact (ToCHI) 28(1):1–44 10. Syal S, Mathew R (2020) Threats faced by mixed reality & countermeasures. J Procedia Comput Sci 171(2):2720–2728 11. Baldassi S, Kohno T, Roesner F, Tian M (2018) Challenges & new directions in augmented reality. J Comput Secur Neurosci—Part 1(2) 12. Gulhane A, Vyas A, Mitra R, Oruche R, Hoefer G, Valluripally S, Calyam P, Hoque KA (2019) Security, privacy & safety risk assessment for virtual reality learning environment applications. In: 16th IEEE annual consumer communications & networking conference (CCNC), pp 1–9 13. Adams D, Bah A, Barwulor C, Musaby N, Pitkin K, Redmiles EM (2018) Ethics emerging: the story of privacy & security perceptions in virtual reality. In: Fourteenth symposium on usable privacy, & security (SOUPS), pp 427–442 14. Dick E (2021) Balancing user privacy & innovation in augmented and virtual reality. J Comput Secur Neurosci—Part 1(2) 15. Viswanathan K (2022) Security considerations for virtual reality systems. arXiv preprint arXiv:2201.02563 16. Valluripally S, Gulhane A, Mitra R, Hoque KA, Calyam P (2020) Attack trees for security & privacy in social virtual reality learning environments. In: IEEE 17th annual consumer communications & networking conference, pp 1–9 17. De Guzman JA, Thilakarathna K, Seneviratne A (2019) Security & privacy approaches in mixed reality: a literature survey. J ACM Comput Surv (CSUR) 52(6):1–37 18. McPherson R, Jana S, Shmatikov V (2015) No escape from reality: security & privacy of augmented reality browser. In: Proceedings of the 24th international conference on world wide web, pp 743–753

774

A. Alismail et al.

19. Chen S, Li Z, Dangelo F, Gao C, Fu X (2018) A case study of security & privacy threats from augmented reality (AR). In: 2018 international conference on computing, networking, & communications (ICNC). IEEE, pp 442–446 20. Roesner F, Kohno T, Molnar D (2021) Augmented reality: challenges & opportunities for security and privacy. J Comput Secur Neurosci—Part 1(2) 21. Chow Y-W, Pose R, Regan M (2005) The arp virtual reality system in addressing security threats & disaster scenarios. In: TENCON 2005—2005 IEEE Region 10 conference. IEEE, pp 1–6 22. Spicer RP, Russell SM, Rosenberg ES (2017) The mixed reality of things: emerging challenges for human-information interaction. J Next-Gener Anal V 7(2):97–108 23. Lebeck K, Kohno T, Roesner F (2016) How to safely augment reality: challenges & directions. In: Proceedings of the 17th international workshop on mobile computing systems, & applications, pp 45–50

Chapter 58

A Review on Risk Analysis of Cryptocurrency Almaha Almuqren, Rawan Bukhowah, and M. M. Hafizur Rahman

1 Introduction The digital economy era contributes to the global financial system’s rapid development. That creates both new opportunities and risks for society. The advancement of computer technology has made a new financial instrument known as cryptographic money or cryptocurrencies [1]. The rise of cryptocurrencies poses a concern, as do too many traditional financial operations. Cryptocurrencies rely on a peer-to-peer method to e the “middle man,” a banking institution [2]. Cryptocurrencies are not placed in banks or safe boxes but are available in processors and storage on the Internet and traded in information. Cryptocurrency is collected without dealing with the owners of capital, and it is changed without being listed on the stock exchanges. The entire range of currencies in the cryptocurrency market ranges from well-known currencies such as Bitcoin, Ripple, and Ethereum to more obscure ones. However, the general acceptance of digital banking transactions allows the development of alternative types of money. These new types are not tied to traditional bank accounts and are entirely based on the digital environment. Therefore, it is called a digital currency. With the substantial volatility of the cryptocurrency exchange rate, great opportunities and high risks appear at the same time for investors. In addition to the enormous fluctuations, there are other issues in risk theft risk [3, 4]. Another threat to the economic system exists. The Covid-19 virus, for example, caused the financial system to become unstable. Oil prices and global stock markets both dropped. The correlation between most asset classes has risen dramatically. Since its inception, The original version of this chapter was revised: The complete acknowledgement has been updated. The correction to this chapter can be found at https://doi.org/10.1007/978-981-19-6004-8_69 A. Almuqren · R. Bukhowah · M. M. Hafizur Rahman (B) Department of Computer Networks and Communications, CCSIT, King Faisal University, Al Hassa 31982, Saudi Arabia e-mail: [email protected] A. Almuqren e-mail: [email protected] R. Bukhowah e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 corrected publication 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_58

775

776

A. Almuqren et al.

governments, consumers, entrepreneurs, and economists have faced difficulties and opportunities related to cryptocurrency. Cryptocurrency is unlike any other financial asset on the market [5]. The issues will be explored based on the literature review: 1. 2. 3. 4. 5.

The methodology used in cryptocurrency. Fluctuations in the pricing process. What is blockchain technology, and how does it apply to cryptocurrencies? Is there a need for cryptocurrency to counterpart to traditional money? Cryptocurrency’s potential risks.

By reviewing the literature, we can extract the risks to cryptocurrencies, both from a technical point of view and from a people’s perspective. Knowing these risks is technically necessary to mitigate them according to the risks they pose. On the other hand, they knew the risks that users consider to be the most significant threats helps to understand the adoption of cryptocurrencies. We also analyze additional concepts from the research that may impact the use of cryptocurrencies, such as knowledge and trust. The following are the paper’s primary contributions and findings. First, an introduction explains the workings of cryptocurrencies, their use, and how the perception of security affects the decision to use. Second, the method of selecting papers, literature, and data sources for documents: PRISMA and Search String. However, cryptocurrencies are designed as a blockchain. Third, a literature survey by studying the papers that identifies the vulnerabilities found in cryptocurrencies and the risks these vulnerabilities pose to user. Our analysis shows the volatility of cryptocurrencies.

2 Cryptocurrency Risk Assessment Cyber risk is the confluence of assets, threats, and vulnerabilities. That means Threats plus Vulnerability equals risk. To assess the amount of cyber risk, you must comprehend the many threats and be aware of your system’s vulnerabilities. So, cryptocurrency threats are divided into four categories. Financial, technological, legal, and political risks are considered. The technical risks that cryptocurrency attacks, The Smart contract attacks, network-level attacks, and cryptographic key attacks. The first step in addressing Cryptocurrency’s threats is comprehending and analyzing the vulnerabilities. Additionally, financial risks that technological risks may entail. Some of the Cryptocurrency Risks are depicted in Fig. 1. The summary of the risks and their solutions in cryptocurrency is tabulated in Table 1.

3 Selection of Papers for Literature Review A Systematic Literature Review (SLR) is one of the research methodologies used because it helps write research by identifying, selecting, and critically assessing all

58 A Review on Risk Analysis of Cryptocurrency

777

Fig. 1 Cryptocurrency risk

Table 1 Cryptocurrency risks and their solutions Risks Solutions Man in the middle attack Fake crypto exchanges

Certificate authority Two key strategies to protect crypto exchanges involve focusing on payments and ID verification Ransomware Disrupting the ransomware supply chain is a big step in the right direction. Blockchain analytics tools track how money on the blockchain has changed hands. This data can cause supply chain disruptions and suspicious transaction patterns Fake investment scams Report cryptocurrency fraud and any questionable behavior. And read carefully from various source before you invest Art and money laundering The system of know your customer policies and ongoing monitoring including NFTs

results from all studies that answer the research questions. The reason for conducting SLR in this paper is to clarify and analyze cryptocurrency risks via the digital market. In addition, The Search String is a user’s entry of all text, numbers, and symbols into a search engine to locate desired results. Then, we use the PRISMA flow diagram to create our systematic review. In the beginning, we create a search string to facilitate finding research papers from the database.

778

A. Almuqren et al.

3.1 Search String The search string for searching the papers for this literature review is comprised as below: (Cryptocurrency OR virtual currencies OR digital currency OR bitcoin) AND (Stable icons OR financial flow) AND (cryptocurrency trading OR currency exchange) AND (blockchain OR blockchain) AND (factors model OR model) AND (proof of work OR proof of stake) AND (security OR cybersecurity).

3.2 Selection of Papers by PRISMA This is a systematic literature review paper whereby the research articles and relevant documents are selected for this review using Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) is illustrated in Fig. 2. It is a benchmark procedure to select papers for literature review [6]. For this review study, we have considered different publisher database and Google scholar. And by using PRISMA, we have selected 20 research paper for reading and analysis which is required for this literature review paper. From these papers, we look at papers that were published between 2019 and 2022 research papers for some reasons like duplicated research and marked these papers as ineligible by automation tools. By opening google scholar and collecting 17,500 papers that speak about Cryptocurrency Risks specifically for security concerns, these 17,500 search papers were registered. Still, records removed 17,300 before screening was done by removed. Also, 110 papers were excluded from 200 papers that were screened, and these are papers that are important but still need to review and evaluated. Besides, examining the remaining reports 90 for retrieval so dedicate 40 papers are not retrieval. Accordingly, 50 papers were assessed for eligibility, and only 30 papers were excluded for irrelevance of this study. Hence 20 research papers are used to do this systematic literature survey.

4 Cryptocurrencies and Blockchains Cryptocurrency is a digital asset that uses cryptography to safeguard transactions, govern the production of new value units, and validate asset transfers. There are many distinct types of cryptocurrencies, each with its regulations. The consensus method, latency, and cryptographic hashing techniques are possible differences between cryptocurrencies. Also, cryptocurrencies, and the majority of them are based on blockchains. Then, The blockchain is a decentralized database shared across computer network nodes. A blockchain acts as a database, storing information in a digital format.

58 A Review on Risk Analysis of Cryptocurrency

779

Fig. 2 Selection of papers for literature review using PRISMA

Besides, Blockchain technology is the technology that powers cryptocurrenciesblockchain technology for sending and receiving money. Figure 3 shows how blockchain technology works. Cryptocurrencies and blockchains come with a slew of potential dangers. Cryptocurrencies are perfect instruments for money laundering and tax evasion because of their pseudonymous or entirely anonymous accounts, censorship-resistant networks, and worldwide accessibility. Also, arguably the most well-known and infamous cryptocurrency application is the illegal purchase of narcotics, firearms, stolen identity papers, credit card data, and cybercrime tools. Due to their secrecy and inability to reverse transactions, cryptocurrencies are also extensively utilized for ransomware, blackmail, and extortion.

5 Literature Survey Cryptocurrency Risk is the subject of this chapter. As a result, this part focuses on analyzing cryptocurrency data. Even though there are more cryptocurrency exchanges than other financial assets, cryptocurrency is subject to being hacked and the target

780

A. Almuqren et al.

Fig. 3 Cryptocurrencies-blockchain technology for sending and receiving money

of other criminal activities. Furthermore, in this section, we summarize the literature survey of the Cryptocurrency Risks, followed by Table 2 for a Literature Survey to Analyze Cryptocurrency. Cryptographic money shares the characteristics of being a form of payment and trade and an investment instrument. The critical risk in this study is. It is an unstable means of circulation due to the lack of a central emitter and a regulating administrator, posing certain risks to its owner [1]. They are many conventional financial operations that face an existential danger due to the advent of cryptocurrencies. One threat in the young cryptocurrency industry is the anonymous character of some cryptocurrencies’ transactions, which might allow malicious individuals to conduct unlawful business or, worse, represent a greater danger to our society and institutions [2]. This article examines the success of zero-investment long-short strategies using size-related characteristics such as market capitalization, price, maximum price, and age. First, calculate the average excess return over the risk-free rate of each portfolio in the next week. The average excess returns of each portfolio over the risk-free rate and the excess returns of long-short strategies based on the difference between the fifth and first quintiles are then calculated [3]. The security and advantages of local digital currencies are not seen positively by older and more experienced potential users as they are by their younger and less professional peers. Income levels, technical interest, and the experience of younger and less experienced potential users in utilizing new technologies might all be convincing evidence to explain disparities in people’s traits. Because most community currencies are designed to support low-income individuals who do not have steady employment, are at risk, and are young, implement local digital money in the exam-

58 A Review on Risk Analysis of Cryptocurrency

781

ined region. Their readiness to gain additional advantages and enthusiasm to adopt innovative services maybe even more helpful for young and less experienced residents. On the other side, they contain some flaws or issues that make potential users hesitant to utilize them because they may be seen as complicated, inconvenient, or unpleasant [4]. Except for the oil returns, all assets produced positive returns. For example, gold has the lowest risk, whereas Bitcoin has the lowest risk of all the cryptocurrencies. On the other hand, Oil returns pose the most danger, while Monero poses the most significant risk for cryptocurrencies. The gold market has the most delicate riskreturn ratio and is a safe refuge for investors. All market returns (cryptocurrencies and financial assets) have kurtosis values more significant than three, and the distribution of returns for all-time investments is negatively and positively skewed, indicating that all return markets are not normally distributed; thus, the Jarque-Bera test rejects the assumption of Gaussian returns for all support and cryptocurrencies [5]. They believe equity risk will be reassessed during and after COVID-19. Furthermore, various studies in the literature show the influence of COVID-19 on equities markets. Finally, we explore the underlying distinctions among the cryptocurrencies in our sample [7]. Digital money is defined in this article as a type of currency utilized in the digital environment, such as digital forms of electronic gadgets. The electronic economy is precarious and should be treated with extreme caution to prevent or minimize the hazards. The deep neural network (DNN) method was enhanced to estimate Bitcoin price and then meet the main aim of lowering financial risks to proceed with electronic commerce. A good prediction was obtained using valuable data such as transactions and currency returns.Digital currency pricing is based on traditional pricing principles. The critical risk in dealing with digital currency is the systematic and fluctuation risk in the pricing process; consequently, the focus of this study will be on these two categories of risks, with theft and fraud risks being ignored. The extreme volatility in the price of digital currencies will make it a risky investment for nonprofessional investors [8]. The FOMC’s monetary policy announcements also contain “extended information shocks” that are not linked to present and future risk-free interest rates but can be reflected by changes in the price of risky assets. They use factor analysis to measure risk shifts based on data on surprise responses of CDS spreads, the VIX, and the USD exchange rate. They use factor analysis to measure risk shifts based on data on surprise responses of CDS extends, the VIX, and the USD exchange rate. They show that these risk shifts are responsible for a significant percentage of the unexplained variation in equity excess returns that is not explained by changes in risk-free rates. It will be interesting to see how Bitcoin values react to risk shifts, much as it was with central bank information shocks. The former captures changes in the economic outlook that may be relevant to the choice to hold cryptocurrencies. Still, the latter covers orthogonal alterations in willingness to take risky asset positions, perhaps more essential [9]. Because of the significant volatility of the blockchain markets, investors and market players are focusing their attention on the diversification routes of NFTs, DeFi Tokens, and Cryptocurrencies. In the median, extreme low, and extremely high

782

A. Almuqren et al.

volatility levels, the excessive risk transmission of blockchain markets using the quantile connectedness approach With substantial disconnection of NFTs, we discover significant risk spillovers among blockchain marketplaces. Meanwhile, numerous unequal economic conditions have time-varying characteristics. Therefore, NFTs provide more diversification opportunities with a significant risk-bearing capacity to protect investors and prevent severe hazards among other blockchain markets [10]. They persuade people to believe in cryptocurrencies by using faith and technical reasons. Money is the essential component of the economic system, and there is a significant danger of getting it wrong when the monetary system is modified. The bar for crypto success is significantly higher than for previous technological advancements, and cryptocurrencies must deliver on their promises without posing new hazards to economic and financial stability [11]. Traditional malware detection systems focus primarily on signature matching and heuristic detection. However, this form of rule matching has limited generalization and detection capabilities for unknown malware. Machine learning algorithms have advanced in recent years and are now a viable option for detecting malware. The features of network traffic and CPU use have been used to identify miner malware. Converting bytecode data into grayscale pictures and training CNN models to recognize grayscale image attributes is used to detect browser cryptocurrency mining attacks. Although opcode and grayscale image properties can be utilized as feature vectors to identify miner malware, they struggle to reflect the pattern of miner malware activities [12]. The legislation forbids “all private cryptocurrencies” but allows “limited exclusions” to encourage blockchain, cryptocurrency’s underlying technology, and applications. They have previously attempted to prohibit the use of virtual currencies, such as bitcoin. In addition, the country’s monetary policy authority warned banks not to deal with virtual money, citing “different hazards connected with such virtual currencies” [13]. They are unable to discover solutions to the subject of a single legal definition and legal status of cryptocurrencies and their regulation at the state and international levels, despite numerous talks regarding blockchain technology. The activity of purchasing, selling, trading, and converting into cryptocurrency, in particular, entails several hazards. First, there is still the possibility of such payment systems being abused [14]. Almost everyone is familiar with cryptocurrencies. However, their widespread use, extensive diversity, and financial value raise concerns about security dangers. People’s risk awareness and how their security risk perception influences their decision to utilize cryptocurrencies are just as essential as the actual hazards. Users are mostly exposed to risks in the bitcoin environment. The environment around cryptocurrencies poses the greatest threat to its adoption since it contains technological hazards and is also viewed as having a significant influence [15]. The cryptocurrency built along the lines of the Libra plan might have a lot of advantages. The provision of a fast and cheap payment system, a store of value for residents of countries with unstable currencies, a disciplining effect on national central bank actions, and the possibility of the blockchain-based ledger making it easier

58 A Review on Risk Analysis of Cryptocurrency

783

to track money laundering activities are all highlighted. However, The most severe dangers are linked with Libra. Analyze the risks and categorize them as political risks, financial risks, systemic risks related to situations similar to bank runs in which a large number of users decide to convert their Libra to fiat currencies, economic risks, the creation of an oligopolistic market, technological risks cyber-attacks, fraud, or failure, as well as ethical and regulatory risks [16]. They use an Autoregressive Fractionally Integrated GARCH model with nonnormal innovations. This research investigates the effects of long memory on conditional volatility and conditional non-normality on market risks in Bitcoin and other cryptocurrencies. Two tail-based risk measures, namely, Value at Risk (VaR) and Expected Shortfall (ES), are used to analyze the tail behavior of market risks in Bitcoin and other cryptocurrencies. Empirical studies of tail behavior are undertaken using real-time cryptocurrency exchange rate data [17]. The rigidity produced by the cryptocurrency price matching membership demand with the speculative supply of tokens might lead to a market collapse, especially if there is high complementarity in membership demand. Speculator emotion exacerbates market fragility by squeezing out consumers, whereas user optimism mitigates market fragility by increasing user involvement. By reducing price volatility and platform performance, informational frictions lessen the danger of a systemic failure. Furthermore, the market’s instability is exacerbated by consumers’ expectations of losses from strategic attacks by miners [18]. The main issue in [19] was the absence of clear definitions of the fraud types identified. In addition, the level of risk presented by the offenses. Thirty-three sources of cryptocurrency were either inaccessible or were omitted because they constituted a privacy or security risk if they were opened. Eleven of the authorities did not mention cryptocurrency fraud, and three were found to be duplicates after additional analysis. In contrast, the amount of information disseminated on phishing appeared to be disproportionate to the perceived risk [19]. The research on the dangers and potential of cryptocurrencies in the global financial system was undertaken in this paper. It investigated the theoretical aspects of the concept of cryptocurrency, the stages of development of the term itself and the technologies associated with it, the analysis of the current market of digital currencies, a comparison of existing cryptocurrencies with fiat money, assets, and data on market competition [20]. Developing and enhancing online lending rules and regulations is required, increasing the online lending industry’s oversight, regulating market behavior, and preventing the problem from worsening. Understanding the primary influencing variables of platform risk and developing a reliable platform risk early warning model have become two critical objectives for improving safety. This paper first established a dynamic early warning system for Internet financial platforms based on information collection and processing technology, with classification algorithms as the core, using different classification algorithms and horizontally comparing their learning efficiency and accuracy. When analyzing risk warning situations involving structural alterations, classification algorithms offer a distinct edge [21].

784

A. Almuqren et al.

The Libra, a new financial asset meant for processing payments, has been unveiled by a Facebook-led consortium of banking, IT, and social media organizations. According to its creators, Libra will be a global currency capable of operating in global infrastructure and serving billions of people worldwide. Libra, like Bitcoin, is digital money [10]. The Libra was developed and administered by the Libra Association. The Libra Association is a non-profit organization with headquarters in California and a registered seat in Geneva, Switzerland. Switzerland picked it since it is one of the few nations having cryptocurrency/assets regulation. Libra will have a reseller network. Libra currency units will not be available for purchase directly from the Libra Association. Instead, approved dealers will operate as a link between the association and its members. Libra is a cryptocurrency with a steady value. In the document’s terminology, Libra the referred to as a crypto-asset or cryptocurrency. First, the value constant, a reserve, will be established, mainly consisting of secure and liquid assets, such as short-term government securities issued by several nations and bank deposits in various currencies. Libra’s value they determined by a weighted average of the reserve currencies. In two stages, the Libra reserve will be built up. Libra’s value must be steady and equivalent to a basket of support designed with the least amount of volatility as feasible. As a result, the assets must be widely distributed geographically and, above all, issued by creditworthy nations and central banks. The money invested by the founding members forms the first stage. After Libra has been launched, the reserve is expanded as users purchase Libra for their national currencies. The payment information in Libra will be maintained in a distributed database, which is the foundation of Distributed Ledger Technology (DLT). Unlike centralized databases, which store information centrally in one location, a distributed database exists in numerous places simultaneously in a network of computers (nodes). Each node in the network has a copy of the database. Payment systems such as Bankgirot and card networks are centralized. In the case of Libra, the network’s nodes are made up of Libra Association members. The blockchains Bitcoin and Ethereum are the two most well-known uses of Distributed Ledger Technology. The transaction history in Libra is recorded in a structured database rather than a connected chain of transaction blocks (thus the name blockchain). Unlike Bitcoin’s blockchain, Libra’s ‘blockchain’ means that Libra Association members may only register transactions.

6 Future Challenges The first future challenge risk is that cryptocurrencies are fundamentally different and not interchangeable. The confusing array of cryptocurrencies differs in various ways, notably in terms of security, programmability, and governance. One of the first stages in cryptocurrencies managing any financial instrument’s risk is assessing and determining its exposure using standard market-wide procedures. However, cryptocurrencies are unique in that there is no consensus valuation technique, no widely acknowledged measures, and reported pricing information can vary signif-

References

Melnikov et al. [1]

Härdle et al. [2]

Liu and Tsyvinski [3]

Bulgaria [4]

Ghorbel and Jeribi [5]

Goodell and Goutte [7]

Chen [8]

Karau [9]

Soren and Deutsche [10]

Danielsson [11]

Fu and Zheng [12]

No.

1

2

3

4

5

6

7

8

9

10

11

MBGINet

Fiat system

Systematic empirical analysis

Quantile var estimates and Volatility estimates

Deep learning algorithms

Descriptive statistics

GARCH model

Binary logistic regression test

Cross section

Analysis

Analysis systematization and generalization

Proposed techniques

Includes two malware datasets with varying data sizes

The technology world of fiat money dates back at least 800 years

In 2009–2018. The dataset is composed of approximately 811 million observations

NFTs, DeFi tokens, and cryptocurrencies 2018–2021

Bitcoins in Europe

Principal component analysis

Money assets

The massive number of computers between 30 and 50

Coins from 2014 till 2018

From 2017 to 2019, money assets

Not disclosed

Dataset description

Table 2 Literature survey cryptocurrency risks

The Wild Dataset Simulated

No evaluation

Structural analysis

No evaluation

Optimization algorithm

No evaluation

Comparison

Digital currency and cash

Comparing the factors

No evaluation

No evaluation

Comparison techniques

(continued)

A behavioral signature representation for cryptocurrency mining malware is created by combining these characteristics or by using blockchain computing alone

The risk and reward of a cryptocurrency financial system are dependent on its structure. Tokenized fiat money serves as financial stability in its most basic form. If we move beyond that, replacing central bank-managed fiat money with cryptocurrencies, there will be a disruption

Then the risk transmission of blockchain markets using the quantile connectivity approach With substantial disconnection of NFTs, we discover large risk spillovers among blockchain marketplaces

The Monetary policy advances in cryptocurrency markets are based on high frequency and blockchain transaction data

The deep learning algorithms have been upgraded in a deep neural network to respond to this issue statement, making them more suited for addressing bitcoin concerns

According to the authors, equity risk will be reevaluated during and after COVID-19. The impact of the COVID-19 epidemic on certain cryptocurrencies, such as Bitcoin and Tether

That asset shocks have a minor influence on Bitcoin’s risk. However, the BEKK-GARCH model reveals that cryptocurrencies have a more significant volatility spillover

The risk and cost are significant variables in introducing local currencies, digital currencies provide some benefits to address these concerns

In the cross-section of cryptocurrencies, standard asset pricing methodologies can be employed

The most prevalent risks associated with cryptocurrency. They create cryptocurrency analogs by considering different pricing and market-related indicators in the stock market

In the current state of digital economy growth, the potential for using electronic and virtual money was studied; digital fraud prospects were assessed, and ways for managing cryptographic money turnover were identified

Major findings

58 A Review on Risk Analysis of Cryptocurrency 785

Not disclosed

Margulescu and Margulescu [13]

Anush et al. [14]

Roppelt [15]

Sebastia et al. [16]

Siu [17]

Sockin and Xiong [18]

Trozze et al. [19]

Titov et al. [20]

Fu et al. [21]

12

13

14

15

16

17

18

19

20

Machine learning algorithms

EOS system

PRISMA-ScR flow diagram

Systematic framework

VaR and ES matrix, GARCH model

Compares and utilizes additional long-memory techniques

Theoretical and technical analysis

Historical method

Proposed techniques

No. References

Table 2 (continued)

The customer data in the bank

Cryptocurrencies appear as they cannot be predicted, and their volatility

Types and characteristics of cryptocurrency fraud are about 29

The few months after its ICO, the event that assured when DAO failed, raised 150 million dollars in 2016

The impact of long memory in volatility, non-normality, and behavioral insights

From 2013 to 2019, a comparison

Bitcoin has approximately reached 20,000 euros for six months, and then it is increased by 1000 euros

In 2018 the top ten cryptocurrencies

Data is collected by recording every transaction and purchase done over the computer screen

Dataset description

No evaluation

No evaluation

Types of fraud that exist and will exist

Baseline model

Risks matrix

Different methodologies

Qualtrics, SPSS, MANOVA

No evaluation

Traditional cryptocurrencies and digital currencies

Comparison techniques

They build a preliminary credit risk management strategy, check the primary method to optimize it, and then design a new credit risk control strategy using the machine learning and logistic regression models

Their drawbacks, include significant volatility, forecasting problems, poor throughput platforms, and scalability concerns

Thirty-three sources were either inaccessible or were omitted because they constituted a privacy or security risk if they were opened. Applying three steps for defining any cryptocurrency fraud

They reduce price volatility and platform performance. Informational frictions lessen the danger of a systemic failure. Furthermore, the market’s instability is exacerbated by consumers’ expectations of losses from strategic attacks by miners

Value at risk and expected shortfall are very important to determine and studying market risks and cryptocurrency risks

There will risks in terms of financial stability, consumer privacy, and protection, as well as other operational hazards associated with the blockchain technology to cyber-attacks and vulnerability

Security is a big concern for cryptocurrency and not only protect blockchain technology but also on its developers and algorithms that use and defining all vulnerabilities

There are numerous risks associated with the operations of buying, selling, exchanging, and converting into cryptocurrencies, and there is still the possibility of such payment systems being misused

Monetary policy regulators alerted banks that they must stop dealing with them, citing various risks associated with dealing with such virtual currencies

Major findings

786 A. Almuqren et al.

58 A Review on Risk Analysis of Cryptocurrency

787

icantly between venues. In addition, unlike financial instruments, cryptocurrencies are not regulated and do not have the same level of legal protection as traded financial items. That creates complicated legal issues and uncertainties, which can significantly impact the instability and risk of digital assets. As a result, risk cryptocurrencies managing may not have the data to anticipate future bitcoin exposures and hazards. Nevertheless, cryptocurrencies have progressively gained traction as an asset class over the last decade and are now drawing institutional investors. The increased demand necessitates a more thorough examination of the underlying causes of risks and opportunities. The requests for enhanced risk cryptocurrencies management [22] are part of the market’s maturing, eventually replacing self-regulation and automated governance with effective supervisory and regulatory frameworks.

7 Conclusion This review presents a systematic examination of the research on analyzing currently existing cryptocurrencies. It uniquely identifies expert practitioners’ assessments of these issues through a survey of the literature and a study of the technical literature. Moreover, the best technique that can be used to reduce the risk of cryptocurrency is deep learning algorithms, which are used to solve complex cases that require advanced prediction of financial transactions. Additionally, some literature reviews analyze and compare cryptocurrencies, US indices (S and P500, Nasdaq, VIX), oil prices (WTI), and gold prices through the relationship between volatility between cryptocurrencies and other financial assets. The other blockchain markets are affected by investors, and the market is focusing its attention on the diversification avenues of NFTs, Defi Tokens, and Cryptocurrencies. The extreme risk of transmission in the blockchain markets can be between low, median, and extremely high volatility levels. The most significant risks that affect cryptocurrency are the risk of currency loss as a result of hacker attacks and the risk of depreciation due to the unlimited issue (mining) of electronic money. Furthermore, changing the legislative framework field of regulating virtual currencies up to a ban on the extraction and trade of cryptocurrency. Eventually, participation in the illegal receipt of crypto money and the danger in transfers due to the inability to cancel a launched transaction. This is a review paper. To enhance the quality, clarity, and versatility of this study we need to add more research paper. This is kept as future works to extend the work. Acknowledgements The authors would like to thank the anonymous reviewers for their insightful comments and suggestions to improve the clarity and quality of the paper. This work was supported by the Deenship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia, [Grant No. GRANT818].

788

A. Almuqren et al.

References 1. Melnikov VA, Luchkin AG, Lukasheva OL, Novikova NE, Zyatkova AV, Yarotskaya EV (2022) Cryptocurrencies in the global financial system. In: Russian conference on digital economy and knowledge management, vol 4, no 7, pp 423–430 2. Hardle WK, Harvey CR, Reule RCG (2020) Understanding cryptocurrencies. J Financ Econom 18(2):181–208 3. Liu Y, Tsyvinski A, Wu X, Borri N, Brunnermeier M, Daniel K, He Z (2022) Common risk factors in cryptocurrency. Nber J 12(5) 4. Kljuˇcnikov A, Civelek M, Polách J, Mikoláš Z, Banot M (2020) How do security and benefits instill trustworthiness of a digital local currency. Oecon Copernic 11(3):433–465 5. Ghorbel A, Jeribi A (2021) Investigating the relationship between volatilities of cryptocurrencies and other financial assets. Decis Econ Finance J Appl Math 44(2):817–843 6. Matthew J et al (2021) PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. BMJ 372(n160):1–35 7. Goodell JW, Goutte S (2021) Diversifying equity with cryptocurrencies during COVID-19. Int Rev Financ Anal 76:101781 8. Chen S (2022) Cryptocurrency financial risk analysis based on deep machine learning. Secur Commun Netw Hindawi J 27(1):430–473 9. Karau S (2021) Monetary policy and cryptocurrencies. SSRN Electron J 7(6):3949–3549 10. Karim S, Lucey BM, Naeem MA, Uddin GS (2022) Examining the interrelatedness of NFTs, DeFi tokens and cryptocurrencies. Finance Res Lett 47:102696 11. Danielsson J (2019) Cryptocurrencies: policy economics and fairness. SSRN Electron J 15(7):99–103 12. Suri G, Zheng R, Wang Q, He J, Fu J, Jiang Z (2022) Cryptocurrency mining malware detection based on behavior pattern and graph neural network. Secur Commun Netw Hindawi J 22(3) 13. Margulescu S, Margulescu E (2021) Traditional cryptocurrencies and fiat-backed digital currencies. Glob Econ Observ 9(1):116–123 14. Anush B, Inna G, Tatyana S, Aleksey D, Tetyana B (2021) Comparative and informative characteristic of the legal regulation of the blockchain and cryptocurrency. State Prospects Ilköˇgretim Online 20(3):1541–1550 15. Roppelt JC (2019) Security risks surrounding cryptocurrency usage. Dataversity 2(9):217–7500 16. Sebastia HMCV, Cunha PJORD, Cortesa PM, Godinho O (2021) Cryptocurrencies and blockchain. Int J Econ Bus Res 21(3):305–342 17. Siu TK (2021) The risks of cryptocurrencies with long memory in volatility, non-normality and behavioural insights. Appl Econ 53(17):1991–2014 18. Sockin M, Xiong W (2020) A model of cryptocurrencies. National Bureau of Economic Research 8(15):26816 19. Trozze A, Kamps J, Akartuna EA, Hetzel FJ (2022) Cryptocurrencies and future financial crime. Crime Sci J 11(1):1–35 20. Titov V, Uandykova M, Litvishko O (2021) Cryptocurrency open innovation payment system. J Open Innov: Technol Market Complex 7(1):1–102 21. Fu W, Liu M, Gao R (2021) Analysis of Internet financial risk control model based on machine learning algorithms. J Math 2021:8541929 22. Shakya S, Smys S (2021) Big data analytics for improved risk management and customer segregation in banking applications. J ISMAC 3(3):235–249

Chapter 59

Sine Cosine Algorithm with Tangent Search for Neural Networks Dropout Regularization Luka Jovanovic , Milos Antonijevic , Miodrag Zivkovic , Dijana Jovanovic , Marina Marjanovic , and Nebojsa Bacanin

1 Introduction Artificial intelligence can be considered metaheuristics under the presumption that the recreation of human intelligence is attempted. Note that this is still not achieved and artificial intelligence outperforms human intelligence only in specialized scenarios. Excellent symbiosis results from hybrid artificial intelligence solutions that exploit metaheuristics furthermore add to this argument. Swarm intelligence got distinguished in the hybridization trend resulting in various excellent optimizers for NP-hard problems. Convolutional neural networks (CNNs) are the closest solution to human intelligence as they try to do exactly that [25, 41, 43]. The process of creating a CNN is modeled after principals observed in the visual cortex of animals. Acquired data is progressively improved between layers as it is forwarded through each layer. The process is based on the simplification of the data after each layer while retaining the features critical to the data. Furthermore, the output acquires more data progressively L. Jovanovic · M. Antonijevic · M. Zivkovic · M. Marjanovic · N. Bacanin (B) Singidunum University, Danijelova 32, 11000 Belgrade, Serbia e-mail: [email protected] L. Jovanovic e-mail: [email protected] M. Antonijevic e-mail: [email protected] M. Zivkovic e-mail: [email protected] M. Marjanovic e-mail: [email protected] D. Jovanovic College of Academic Studies “Dositej”, 11000 Belgrade, Serbia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_59

789

790

L. Jovanovic et al.

as well, while processing it in a form that allows faster computing. The example process for this behavior is the shaping of objects that can be observed on the output of layers in the following order: edges, the corners and sets of edges, parts of objects and sets of corners and contours, and finally full objects constructed from the previous elements. Nevertheless, CNNs are not without shortcomings. The problems that require optimization are overfitting, as is the case with all learning methods, and the tuning of hyperparameters. General well-performing CNN architecture does not exist; hence, every problem requires generating a specific one. The main defining characteristics of a CNN are the quantity of layers, types of layers, neurons per layer, the rate of learning of the loss function, and the activation function. The preceding components are referred to as hyperparameters and are not trainable. The process of selecting appropriate optimal hyperparameters values is a task considered NP-hard [4]. Overfitting occurs when the weight and biases are adjusted to certain data during training making them inefficient for solving problems with unknown data. To rephrase, the models’ generalization ability is weak. The issue is considered a problem of bias-variance trade-off [35]. The solutions to this problem can be various, and some are proposed in [32]. The most effective method has proven to be the dropout [37]. The main principle of this method is the removal of neurons at random from the network including their connections. Dropout probability is the main controlling parameter of this process, and it influences the number of units to be dropped in percentage. Furthermore, this is another hyperparameter to be adjusted according to the problem. The optimal way to do so is through the use of frameworks that provide optimal or suboptimal solutions. The metaheuristic algorithms have proven to yield results with such tasks. This work proposes a swarm intelligence-based automated framework for the determination of the dropout probability for the dropout layer of CNN. The research is a follow-up to the previously published research [11] which tackled the same problem. This paper proposes an improved version of well-known sine cosine algorithm (SCA) [28] for dropout regularization, tested on the following benchmark datasets for image classification: MNIST, CIFAR-10, UPS, and Semeion. The primary focus of this research is to enhance the classification performance of the CNN even further while avoiding overfitting by improved dropout regularization. The following structure is applied throughout the paper: Sect. 2 provides CNN and dropout regularization background as well as related research in the field of swarm intelligence, Sect. 3 explains the used metaheuristic along with its pseudocode, Sect. 4 displays the results of the research and comparison against other methods, and finally Sect. 5 finalizes the obtained observations and proposes further improvements.

2 Related Works and Preliminaries Unlike artificial intelligence, humans do not process information based on given labels and tags. Obvious limits are formed for understanding the presented results.

59 Sine Cosine Algorithm with Tangent Search …

791

The success of the AI-based software solution if such would use word-based descriptions to process images is unlikely. Consequently, the use of CNN-based solutions finds great application in the field of visual tasks [26]. Recent progress in the filed shows climate change predictors [21], analysis of documents [1], classification of medical images [36, 39], as well as facial recognition [30, 31]. CNNs consist of different types of layers each with a different purpose. The first layer is naturally the input layer containing image data. The second type of layer, the convolution layer, is tasked with extracting features utilizing filters to reduce input size alongside the application of the convolution operation. The usually employed filters are of sizes 3 × 3, 5 × 5, and 7 × 7. After the kernel application, feature maps are produced. The convolution operation is mathematically represented as [20, 40]: z i,[l]j,k = wk[l] xi,[l]j + bk[l]

(1)

while the z i,[l]j,k represents the kth feature map placed at i, j output value. Additionally, input values are given as x placed at i, j. Furthermore, filters are represented as w. Finally, the bias is shown as b. The following equation denotes the activation function: gi,[l]j,k = g(z i,[l]j,k )

(2)

in which g(·) represents a the non-linear function employing the output. The types of most commonly used pooling layers are average and max, while the pooling layers can generally be local or global. To reduce the resolution of input, the pooling function is applied: yi,[l]j,k = pooling(gi,[l]j,k )

(3)

Identical operations as in a standard ANN are performed by fully connected layers of CNNs. Generally, CNNs are contained in multiple dense layers in which the final layer carries out multi-class classification with softmax and binary classification by sigmoid or tanh functions. Regardless of various capabilities, CNNs are not perfect. This topic was previously discussed in Sect. 1 as the main problem is the avoidance of overfitting [24, 37]. The techniques usually used to tackle this problem are [32]: data augmentation, early stopping, model simplification, and regularization. Different techniques of regularization have been suggested and applied such as dropout [37], L1 [29], as well as L2 regularization [29]. When a neuron is removed (dropped) from a layer, every connection it employed to the rest of the network is removed as well. The selection of these neurons is stochastic, and such units are withdrawn only temporarily from the training phase. This is performed to produce a better generalizing network as it is less susceptible to the removed neurons’ weight. To achieve such a goal, the neighbors of removed neurons are required to process heavier loads to compensate.

792

L. Jovanovic et al.

This behavior is what is believed to positively impact independent internal representations. The mentioned process is exclusive to the final fully connected layers and performed ahead of the preceding classification layer. Following this, the next equation represents the feed-forward operation: z i[l+1] = wi[l+1] y l + bi[l+1]

(4)

yi[l+1] = g(z i[l+1] )

(5)

in which w and b as previously and respectively represent the weight and bias, while lth hidden layer of the network is represented by l, input and output vectors as z and y, and finally the activation function as g. After the dropout regularization, the feed-forward functions are applied [9]: r [l] j ∼ Bernoulli( p)

(6)

y˜ [l] = r [l] · y [l]

(7)

z i[l+1] = wi[l+1] y˜ l + bi[l+1]

(8)

y˜i[l+1] = g(z i[l+1] )

(9)

with the independent Bernoulli random variables vector denoted as r . The percentage of dropped neurons is represented by the hyperparameter dp—the dropout probability which is consequentially not trainable. The value is in the range of [0, 1] hence the tuning of this parameter is considered an NP-hard problem. The swarm intelligence field provided substantial solutions to these types of problems. The field of swarm intelligence relies on population-based techniques of stochastic nature resembling the behavior of animals that organize in large groups toward the same goals. The basis of swarm intelligence-type algorithms is the examination of antecedently identified sections in the area of search, which is referred to as exploitation, and by doing so in unexplored areas, referred to as exploration [42]. Algorithms from the swarm intelligence domain find diverse purpose with practical optimization problems of which some are the problem of global numerical optimization [16], scheduling of workflow and tasks in the cloud-based environments [13, 19, 47], the problems of wireless sensors networks [3, 10, 44, 46, 50], artificial neural networks optimizing process [2, 5–8, 12, 17, 22, 23, 27, 38], MRI classifier optimization for the medical diagnostics [14, 15, 18], and finally the COVID-19 cases forecasting [45, 49]. The blooming field of hybridization yielded many successful solutions which are the combination of machine learning techniques and swarm algorithms.

59 Sine Cosine Algorithm with Tangent Search …

793

3 Sine Cosine Metaheuristics and Proposed Enhancements The sine cosine algorithm (SCA) is inspired by the trigonometric functions upon which its mathematical model is based on [28]. The position updating is performed according to said functions making them prone to oscillations in the area of the optimum. The returned values of these functions are in the ranges of [−1, 1]. During the initialization phase, the algorithm begins the generation of multiple solutions that are candidates for the best solution considering the limitations of the area of search. Randomized adaptive parameters control the phases of exploration and exploitation. The position updating is performed by the following equation [28]: X it+1 = X it + r1 · sin(r2 ) · |r3 · Pi∗t − X it |

(10)

X it+1 = X it + r1 · cos(r2 ) · |r3 · Pi∗t − X it |

(11)

X it and X it+1 denoting the positioning for a given solution in the terms of dimension ith and tth and i + 1th iteration, following the same order, generated pseudorandom numbers shown as r1−3 , the position of the target is denoted as Pi∗ for the ith dimension, while the absolute value is standardly shown as ||. With the use of control parameter r4 following two equations are combined:  X it+1

=

X it+1 = X it + r1 · sin(r2 ) · |r3 · Pi∗t − X it |, r4 < 0.5 X it+1 = X it + r1 · cos(r2 ) · |r3 · Pi∗t − X it |, r4 ≥ 0.5,

(12)

and the r4 denotes a number randomly selected from an interval of [0, 1]. It is important to note, that for every solution in the population the process of generation of pseudorandom values r1−4 is repeated. The search process of the algorithm is controlled by four different parameters that are randomly generated which affect the best as well as the current solution. The based functions range is adjusted on the go ensuring balance toward the global best solution. Cyclic sequences exhibited by the sine and cosine functions allow for repositioning near the solution. The exploitation is guaranteed owing to this behavior. The process requires search beyond the destinations thanks to range changes of the main functions while avoiding overlaps with other solutions’ areas. To guarantee exploration and increase the quality of randomness, the range for the parameter values r2 is changed to [0, 2] and that guarantees exploration. The diversification and exploitation balance is controlled by the following equation: r1 = a − t

a , T

(13)

in which t represents the ongoing repetition, T denotes the maximum allowed amount of possible repetitions per run, and the a represent a constant. While the a variable is internally hard-coded control parameter, therefore it is not adjustable by the user.

794

L. Jovanovic et al.

Satisfying value for this parameter was determined empirically, and for dropout regularization experiments it was set to 2.0, as it was suggested in the original SCA paper [28] for continuous global optimization. The dropout regularization experiment also falls into the group of continuous global optimization NP hard challenges.

3.1 Cons of Basic SCA and Proposed Improved Version The SCA metaheuristic provides admirable performance when handling both boundconstrained and unconstrained benchmarks, with the added benefit of relative simplicity and a small number of control parameters [28]. Additionally, it shows admirable performance when handling various real-world, practical challenges [34, 48]. Despite these advantages, based on results from extended practical testing using standard Congress on Evolutionary Computation (CEC) benchmarks, it becomes clear that during some executions the algorithm tends to converge too fast toward current best solutions and population’s diversity is shrunk. This occurs primarily because Eq. (12) is executed using the sine or cosine function guided toward the location of the best thus far obtained solution P ∗ , for every agent’s parameter. The consequence of directed search toward the P ∗ is that if the initially generated solutions are too far from the optimum, then the whole population would quickly converge toward these unpromising domain of the search space and worse results will be generated at the end of a run. With this in mind, while the original SCA provides efficient exploitation, further improvements are possible in terms of its exploration abilities. Notwithstanding that many different methods for enhancing metaheuristics’ exploration abilities in the modern literature, one recent promising method, based on the tangent flight operator, was introduced in [39]. Inspired by the tangent flight, method proposed in this manuscript incorporates tangent search operator with large flight, that further intensifies exploration, by using the following equation: X it+1 = X it + tan(r5 · π ),

(14)

where r5 represents a pseudorandom value selected from the [0, 1] range, and π is mathematical constant. Therefore, in each iteration, instead of using just sine and cosine search expression, the Eq. (14) was also employed. However, during subsequent iterations, given a rational presumption that the algorithm converged toward optimum (suboptimum) domains of the search space, the tangent search exploration is not needed and for this reason, the following search process expression is applied in only first 50% iterations, or fitness function evaluations (FFEs):

59 Sine Cosine Algorithm with Tangent Search …

X it+1

⎧ t+1 t ∗t t ⎪ ⎨ X i = X i + r1 · sin(r2 ) · |r3 · Pi − X i |, r4 < 0.35 t+1 t ∗t = X i = X i + r1 · cos(r2 ) · |r3 · Pi − X it |, r4 ∈ [0.35, 0.7) ⎪ ⎩ X t+1 = X t + tan(r · π ), r ∈ [0.7, 1] 4 5 i i

795

(15)

Phase of algorithms’ execution when the tangent search is applied is determined empirically. Proposed approach is dubbed SCA with tangent search (SCA-TS). The pseudocode of which is given in Algorithm 1. Algorithm 1 The proposed SCA-TS algorithm’s pseudocode create randomized set of solutions (Agents)(X ) while (t < T ) do assess agents using objective function memorize top performing agent thus far (P ∗ = X ∗ ) revise r 1, r 2, r 3, r 4 and r 5 if (t ≤ T /2) then update agents positioning utilizing Eq. (15) else update agents positioning utilizing Eq. (12) end if end while return best performing agent as global optimal solution

4 Simulations and Discussion The experimentation has been conducted with the same model as in the paper used for reference for the results [32]. Following the experimental setup of [32] the environment was arranged the same as well due to the establishment of objective grounds for comparison. The swarm intelligence potential in this domain is not thoroughly explored, and the goal of this research is to contribute in this direction. Python was used for the development of the framework for testing alongside standard Python modules and API’s which are NumPy, scikit-learn, keras, SciPy, pandas, and matplotlib for graphic representation. The hardware that the tests were performed on consists of six NVIDIA GTX 1080 GPU, IntelCoreTM i9-11900K CPU, and 64GB RAM running Windows 10 OS. Validation was achieved through four datasets standard to these types of problems: MNIST (http://yann.lecun.com/exdb/mnist/), Semeion (https://archive.ics.uci.edu/ ml/datasets/Semeion+Handwritten+Digit), USPS (http://statweb.stanford.edu/tibs/ ElemStatLearn/datasets/zip.info.txt), and CIFAR10 (http://www.cs.toronto.edu/kriz/ cifar.html). Detailed specification of each dataset can be obtained from the given links. The [33] referenced work utilizes two distinct CNN architectures, and the same practice was applied to the performed experiments. Default Caffe examples suggest the use of one architecture for the first three datasets, MNIST, Semeion, and USPS, while the CIFAR-10 requires the use of a different architecture.

796

L. Jovanovic et al.

L1 regularization (penalty) α and L2 regularization (weight decay) λ are employed for all simulations except the dropout probability (dp) for the dropout layer. Learning rate η was applied for the RMSProp optimizer for the training of the models. Only dp was optimized as the tuple of parameters (η, α, λ) was fixed. Consequentially, the encoding of solutions is clear as all of the solutions have one parameter of values ∈ [0, 1]. The fitness function is represented as the classification error, hence making this a problem of minimization. The fitness is formulated as the inverse proportion of the error in the following way for an individual i: fiti =

1 1 + errori

(16)

Adding to the objectivity of compared results, regular Caffe architecture with dropout as well as without dropout is incorporated also for every metaheuristic in the comparison. The parameters used are the default Caffe parameters for η, α, and λ. Summarized parameters for the four used datasets are given in Table 1. The standard procedure of splitting datasets as training, then validation, and testing sets is followed. The test set’s classification accuracy is used for the determination of each individual’s fitness. The categorical_cr oss_entr opy loss function for all experiments of the model of the CIFAR-10 dataset is trained in 4000 epochs, and in the case of the three remaining datasets, it is 10,000 epochs. The batch size, number of samples, and all relevant details are available in Table 2. Proposed SCA-TS was compared with the bat algorithm (BA), cuckoo search (CS), particle swarm optimization (PSO), elephant herding optimization (EHO),

Table 1 CNN η, α, and λ adjustments for simulations Dataset η α CIFAR-10 MNIST Semeion USPS

0.001 0.01 0.001 0.01

0.9 0.9 0.9 0.9

Table 2 Used datasets’ experimental configuration Dataset Training set Validation set samples (batch samples (batch size) size) CIFAR-10 MNIST Semeion USPS

20,000 (100) 20,000 (64) 200 (2) 2406 (32)

30,000 (100) 40,000 (100) 400 (400) 4885 (977)

λ

dp

0.004 0.00005 0.00005 0.00005

[0, 1] [0, 1] [0, 1] [0, 1]

Testing set (batch Epochs size) 10,000 (100) 10,000 (100) 993 (993) 2007 (2007)

4000 10,000 10,000 10,000

59 Sine Cosine Algorithm with Tangent Search …

797

Table 3 Comparison of the proposed SCA-TS algorithm and other metaheuristics optimizer for the mean accuracy of classifications Method MNIST Semeion USPS CIFAR-10 acc. dp acc. dp acc. dp acc. dp Caffe Dropout Caffe BA CS PSO EHO WOA SSA GOA BBO FA SCA SCA-TS

99.07 99.18

0 0.5

97.62 98.14

0 0.5

95.81 96.22

0 0.5

71.48 72.09

0 0.5

99.15 99.15 99.16 99.14 99.14 99.18 99.15 99.14 99.19 99.18 99.21

0.492 0.488 0.494 0.476 0.488 0.498 0.491 0.473 0.494 0.495 0505

98.35 98.21 97.79 98.11 98.23 98.31 98.14 98.16 98.29 98.24 98.35

0.693 0.544 0.371 0.481 0.561 0.642 0.513 0.515 0.619 0.581 0.693

96.44 96.32 96.33 96.23 96.33 96.41 96.16 96.16 96.41 96.28 96.52

0.761 0.714 0.724 0.681 0.721 0.754 0.480 0.484 0.759 0.704 0.773

71.49 71.21 71.51 71.15 71.23 71.58 70.95 71.08 71.55 71.54 72.11

0.632 0.668 0.622 0.704 0.684 0.528 0.848 0.769 0.584 0.598 0.481

whale optimization algorithm (WOA), salp swarm algorithm (SSA), grasshopper optimization algorithm (GOA), biogeography-based optimization (BBO), firefly algorithm (FA), and the basic SCA. All opponent metaheuristics have been assessed with identical experimental condition for the purpose of objective comparative analysis, and their control parameters can be retrieved from the previous study [11, 33]. Methods were tested with 77 FFEs in the run with N = 7. The FFEs are used instead of T as the termination criteria for providing more robust comparative analysis because some algorithms utilizing varying numbers of FFE in the one iteration. Reported metrics represent mean values obtained in 20 independent runs. Table 3 reports the average accuracy value and the dp mean value collected from the MNIST, Semeion, USPS, and CIFAR-10 dataset simulations, and the approaches that achieved the best accuracy overall are shown in bold. The observation is derived from the given table that the average accuracy is inconsistent for different datasets. Considering each dataset’s structure regarding the number of images, amount of features, and the content itself, this is justified. Table 3 attests to an obviously better performance the proposed solution has regarding the dp value put through optimization. The proposed SCA-TS approach attained the accuracy of 99.21% on the MNIST dataset, with a determined dp value of 0.505. Contemporary metaheuristics attained dp values under standard Dropout Caffe value of dp = 0.5. The experiments on this specific dataset suggest that in the ideal case, the dp value ought to be slightly above 0.5 to achieve higher accuracy, and evidently, the proposed SCA-TS was the one metaheuristics approach that obtained it.

798

L. Jovanovic et al.

When looking at the Semeion dataset, the SCA-TS and BA approaches obtained superior accuracy of 98.35%, by determining the dp value of 0.693 and 0.692, respectively. For this dataset, the findings indicate that the accuracy increases by an increment of dp, exceeding standard the value of 0.5 for Dropout Caffe. The second best method was SSA, which obtained a slightly lower accuracy of 98.31% by determining the dp = 0.642. The fundamental Caffe model that is not using the dropout (dp = 0) obtained 97.62% accuracy, while the Dropout Caffe (dp = 0.5) attained the accuracy of 98.14%. The same trends have been observed in the USPS dataset too. The suggested SCA-TS generated CNN architecture that obtained superior accuracy of 96.52% by determining the dp value of 0.773. Here, the obtained accuracy also rises as the value of dp grows. The BA algorithm obtained slightly lower accuracy again, 96.45 % by dp = 0.762. The basic Caffe and Dropout Caffe obtained quite lower accuracy than the proposed SCA model, approx. 0.7% and 0.3%, respectively. The fourth dataset, CIFAR-10, exposed an odd pattern in comparison to the rest of the datasets. The experimental results showed that, when the dp is above the standard Dropout Caffe value (dp = 0.5), capabilities start decreasing, and the accuracy starts to decline. On the opposite, if the dp is set too low, once again, the obtained accuracy starts decreasing. The best performance level on this particular dataset is obtained for the dp in a range marginally below 0.5. The suggested SCA-TS method attained the most accurate percentage of 72.11% by estimating the dp = 0.481. The SCA-TS is the only algorithm that ascertains the dp value to be under the 0.5 limits since all other metaheuristics methods ended up with the dp values ranging in (0.5, 1]. Therefore, based on the experimental findings, the most important observations are twofold: The proposed SCA-TS proved as the solid method for tackling continuous optimization challenges and showed superior performance in comparison to other contemporary state-of-the-art methods. Secondly, it was proven that the deficiencies of basic SCA were overcome by the proposed algorithm. Finally, to visualize better performance improvements of proposed SCA-TS over the basic SCA, average convergence speed graphs for four datasets, generated for 77 FFEs in a run, are shown in Fig. 1.

5 Conclusion The work proposed by this experimental research suggests an automated method for the selection of the regularization dropout parameter dp in convolutional neural networks by employing novel SCA-TS metaheuristics algorithm. Scientific contributions of presented research are twofold: First off, improvements for the well-known SCA metaheuristics are devised, and an new version is proposed. Secondly, by establishing better dropout regularization, performance of CNNs for image classification challenges is improved. Performance of proposed SCA-TS metaheuristics has been verified on a applied CNN’s task for dropout probability optimization, that proves significant in overfit

59 Sine Cosine Algorithm with Tangent Search …

799

Fig. 1 Average convergence speed graphs for MNIST, Semeion, USPS, and CIFAR-10 datasets for SCA-TS and SCA

prevention, being one important challenge from the deep learning domain. Reported classifying accuracy percentage obtained on MNIST, CIFAR-10, Semeion, and USPS datasets point out the suggested SCA-TS method has significant potential in deep learning applications. In future research, the focus will be on testing the SCA-TS on other machine learning problems and adapt it for resolving other applied NP-hard teaks. Finally, the CNN’s regularization shall be additionally addressed in depth by utilizing other methods via optimizing different variables η, α, and λ as well.

References 1. Afzal MZ, Capobianco S, Malik MI, Marinai S, Breuel TM, Dengel A, Liwicki M (2015) Deepdocclassifier: document classification with deep convolutional neural network. In: 2015 13th international conference on document analysis and recognition (ICDAR). IEEE, pp 1111– 1115 2. Bacanin N, Alhazmi K, Zivkovic M, Venkatachalam K, Bezdan T, Nebhen J (2022) Training multi-layer perceptron with enhanced brain storm optimization metaheuristics. Comput Mater Contin 70(2):4199–4215. https://doi.org/10.32604/cmc.2022.020449

800

L. Jovanovic et al.

3. Bacanin N, Arnaut U, Zivkovic M, Bezdan T, Rashid TA (2022) Energy efficient clustering in wireless sensor networks by opposition-based initialization bat algorithm. In: Computer networks and inventive communication technologies. Springer, pp 1–16 4. Bacanin N, Bezdan T, Tuba E, Strumberger I, Tuba M (2020) Monarch butterfly optimization based convolutional neural network design. Mathematics 8(6):936 5. Bacanin N, Bezdan T, Venkatachalam K, Zivkovic M, Strumberger I, Abouhawwash M, Ahmed A (2021) Artificial neural networks hidden unit and weight connection optimization by quasirefection-based learning artificial bee colony algorithm. IEEE Access 6. Bacanin N, Bezdan T, Zivkovic M, Chhabra A (2022) Weight optimization in artificial neural network training by improved monarch butterfly algorithm. In: Mobile computing and sustainable informatics. Springer, pp 397–409 7. Bacanin N, Petrovic A, Zivkovic M, Bezdan T, Antonijevic M (2021) Feature selection in machine learning by hybrid sine cosine metaheuristics. In: International conference on advances in computing and data sciences. Springer, pp 604–616 8. Bacanin N, Stoean R, Zivkovic M, Petrovic A, Rashid TA, Bezdan T (2021) Performance of a novel chaotic firefly algorithm with enhanced exploration for tackling global optimization problems: application for dropout regularization. Mathematics 9(21). https://doi.org/10.3390/ math9212705 9. Bacanin N, Tuba E, Bezdan T, Strumberger I, Jovanovic R, Tuba M (2020) Dropout probability estimation in convolutional neural networks by the enhanced bat algorithm. In: 2020 international joint conference on neural networks (IJCNN). IEEE, pp 1–7 10. Bacanin N, Tuba E, Zivkovic M, Strumberger I, Tuba M (2019) Whale optimization algorithm with exploratory move for wireless sensor networks localization. In: International conference on hybrid intelligent systems. Springer, pp 328–338 11. Bacanin N, Zivkovic M, Al-Turjman F, Venkatachalam K, Trojovsk`y P, Strumberger I, Bezdan T (2022) Hybridized sine cosine algorithm with convolutional neural networks dropout regularization application. Sci Rep 12(1):1–20 12. Bacanin N, Zivkovic M, Bezdan T, Cvetnic D, Gajic L (2022) Dimensionality reduction using hybrid brainstorm optimization algorithm. In: Proceedings of international conference on data science and applications. Springer, pp 679–692 13. Bacanin N, Zivkovic M, Bezdan T, Venkatachalam K, Abouhawwash M (2022) Modified firefly algorithm for workflow scheduling in cloud-edge environment. Neural computing and applications, pp 1–26 14. Basha J, Bacanin N, Vukobrat N, Zivkovic M, Venkatachalam K, Hubálovsk`y S, Trojovsk`y P (2021) Chaotic Harris hawks optimization with quasi-reflection-based learning: an application to enhance CNN design. Sensors 21(19):6654 15. Bezdan T, Milosevic S, Venkatachalam K, Zivkovic M, Bacanin N, Strumberger I (2021) Optimizing convolutional neural network by hybridized elephant herding optimization algorithm for magnetic resonance image classification of glioma brain tumor grade. In: 2021 zooming innovation in consumer technologies conference (ZINC). IEEE, pp 171–176 16. Bezdan T, Petrovic A, Zivkovic M, Strumberger I, Devi VK, Bacanin N (2021) Current best opposition-based learning salp swarm algorithm for global numerical optimization. In: 2021 zooming innovation in consumer technologies conference (ZINC). IEEE, pp 5–10 17. Bezdan T, Stoean C, Naamany AA, Bacanin N, Rashid TA, Zivkovic M, Venkatachalam K (2021) Hybrid fruit-fly optimization algorithm with k-means for text document clustering. Mathematics 9(16):1929 18. Bezdan T, Zivkovic M, Tuba E, Strumberger I, Bacanin N, Tuba M (2020) Glioma brain tumor grade classification from MRI using convolutional neural networks designed by modified FA. In: International conference on intelligent and fuzzy systems. Springer, pp 955–963 19. Bezdan T, Zivkovic M, Tuba E, Strumberger I, Bacanin N, Tuba M (2020) Multi-objective task scheduling in cloud computing environment by hybridized bat algorithm. In: International conference on intelligent and fuzzy systems. Springer, pp 718–725 20. Bouvrie J (2006) Notes on convolutional neural networks

59 Sine Cosine Algorithm with Tangent Search …

801

21. Chattopadhyay A, Hassanzadeh P, Pasha S (2020) Predicting clustered weather patterns: a test case for applications of convolutional neural networks to spatio-temporal climate data. Sci Rep 10(1):1–13 22. Cuk A, Bezdan T, Bacanin N, Zivkovic M, Venkatachalam K, Rashid TA, Devi VK (2021) Feedforward multi-layer perceptron training by hybridized method between genetic algorithm and artificial bee colony. Data science and data analytics: opportunities and challenges. p 279 23. Gajic L, Cvetnic D, Zivkovic M, Bezdan T, Bacanin N, Milosevic S (2021) Multi-layer perceptron training using hybridized bat algorithm. In: Computational vision and bio-inspired computing. Springer, pp 689–705 24. Gavrilov AD, Jordache A, Vasdani M, Deng J (2018) Preventing model overfitting and underfitting in convolutional neural networks. Int J Softw Sci Comput Intell (IJSSCI) 10(4):19–28 25. Hongtao L, Qinchuan Z (2016) Applications of deep convolutional neural network in computer vision. J Data Acquis Process 31(1):1–17 26. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105 27. Milosevic S, Bezdan T, Zivkovic M, Bacanin N, Strumberger I, Tuba M (2021) Feed-forward neural network training by hybrid bat algorithm. In: Modelling and development of intelligent systems: 7th international conference, MDIS 2020, Sibiu, Romania, 22–24 Oct 2020. Revised selected papers, vol 7. Springer, pp 52–66 28. Mirjalili S (2016) SCA: a sine cosine algorithm for solving optimization problems. KnowlBased Syst 96:120–133. https://doi.org/10.1016/j.knosys.2015.12.022 29. Ng AY (2004) Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Proceedings of the twenty-first international conference on Machine learning, p 78 30. Ramaiah NP, Ijjina EP, Mohan CK (2015) Illumination invariant face recognition using convolutional neural networks. In: 2015 IEEE international conference on signal processing, informatics, communication and energy systems (SPICES). IEEE, pp 1–4 31. Ranjan R, Sankaranarayanan S, Castillo CD, Chellappa R (2017) An all-in-one convolutional neural network for face analysis. In: 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017). IEEE, pp 17–24 32. de Rosa G, Papa J, Yang XS (2018) Handling dropout probability estimation in convolution neural networks using metaheuristics. Soft Comput 22. https://doi.org/10.1007/s00500-0172678-4 33. de Rosa G, Papa J, Yang XS (2018) Handling dropout probability estimation in convolution neural networks using metaheuristics. Soft Comput 22. https://doi.org/10.1007/s00500-0172678-4 34. Salb M, Zivkovic M, Bacanin N, Chhabra A, Suresh M (2022) Support vector machine performance improvements for cryptocurrency value forecasting by enhanced sine cosine algorithm. In: Computer vision and robotics. Springer, pp 527–536 35. Sammut C, Webb GI (eds) (2010) Bias-variance trade-offs. Springer, Boston, MA, p 110. https://doi.org/10.1007/978-0-387-30164-8_76 36. Špetlík R, Franc V, Matas J (2018) Visual heart rate estimation with convolutional neural network. In: Proceedings of the British machine vision conference, Newcastle, UK, pp 3–6 37. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958 38. Strumberger I, Tuba E, Bacanin N, Zivkovic M, Beko M, Tuba M (2019) Designing convolutional neural network architecture by the firefly algorithm. In: 2019 international young engineers forum (YEF-ECE). IEEE, pp 59–65 39. Ting FF, Tan YJ, Sim KS (2019) Convolutional neural network improvement for breast cancer classification. Expert Syst Appl 120:103–115 40. Wu J (2017) Introduction to convolutional neural networks. National Key Lab for Novel Software Technology. Nanjing University, China 5(23):495 41. Xiao T, Xu Y, Yang K, Zhang J, Peng Y, Zhang Z (2015) The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 842–850

802

L. Jovanovic et al.

42. Yang XS (2015) Recent advances in swarm intelligence and evolutionary computation. Springer 43. Zhang Y, Zhao D, Sun J, Zou G, Li W (2016) Adaptive convolutional neural network and its application in face recognition. Neural Process Lett 43(2):389–399 44. Zivkovic M, Bacanin N, Tuba E, Strumberger I, Bezdan T, Tuba M (2020) Wireless sensor networks life time optimization based on the improved firefly algorithm. In: 2020 international wireless communications and mobile computing (IWCMC). IEEE, pp 1176–1181 45. Zivkovic M, Bacanin N, Venkatachalam K, Nayyar A, Djordjevic A, Strumberger I, Al-Turjman F (2021) Covid-19 cases prediction by using hybrid machine learning and beetle antennae search approach. Sustain Cities Soc 66:102669 46. Zivkovic M, Bacanin N, Zivkovic T, Strumberger I, Tuba E, Tuba M (2020) Enhanced grey wolf algorithm for energy efficient wireless sensor networks. In: 2020 zooming innovation in consumer technologies conference (ZINC). IEEE, pp 87–92 47. Zivkovic M, Bezdan T, Strumberger I, Bacanin N, Venkatachalam K (2021) Improved Harris hawks optimization algorithm for workflow scheduling challenge in cloud–edge environment. In: Computer networks, big data and IoT. Springer, pp 87–102 48. Zivkovic M, Jovanovic L, Ivanovic M, Krdzic A, Bacanin N, Strumberger I (2022) Feature selection using modified sine cosine algorithm with covid-19 dataset. In: Evolutionary computing and mobile sustainable networks. Springer, pp 15–31 49. Zivkovic M, Venkatachalam K, Bacanin N, Djordjevic A, Antonijevic M, Strumberger I, Rashid TA (2021) Hybrid genetic algorithm and machine learning method for covid-19 cases prediction. In: Proceedings of international conference on sustainable expert systems: ICSES 2020, vol 176. Springer, p 169 50. Zivkovic M, Zivkovic T, Venkatachalam K, Bacanin N (2021) Enhanced dragonfly algorithm adapted for wireless sensor network lifetime optimization. In: Data intelligence and cognitive informatics. Springer, pp 803–817

Chapter 60

Development of a Web Application for the Management of Patients in the Medical Area of Nutrition Antonio Sarasa-Cabezuelo

1 Introduction An important part of doctors’ jobs is managing the data they get from patients when they visit. From the data, doctors can control the evolution of the pathologies that their patients present, they can predict possible problems, they can know the overall health status of a patient, or they can develop appropriate treatments for the pathologies. That is why they require a very efficient, simple, and intuitive information management system that allows them to capture data, visualize it, calculate different metrics, as well as other more advanced functions such as predicting pathologies or recommending appropriate treatments or medications. These general needs are particularized in the case of the field of nutrition [1]. In this sense, a nutritionist requires specific management of patient information [3] such as the possibility of storing information on the anthropometric measurements of patients, the calculation of metrics based on measurements such as somatotype, body mass index …, the assignment of specific diets to the pathologies of the patients, and the monitoring of compliance [5] with the diet and its effectiveness based on the monitoring of anthropometric measurements. The main problems in the field of medical information management applications are [2] the general nature of their management and the difficulty in adapting them to particular needs. In this sense, in most cases, the applications allow a standard management of clinical records [4] that considers them as the set formed by characteristic information of the patient plus a set of consultations or visits made. The characteristic information is made up of personal data, diseases suffered, pathologies, medications used, allergies, or medical history. With regard to consultations, they are managed [9] as independent documents associated with the clinical history. Thus, A. Sarasa-Cabezuelo (B) Dpto Sistemas Informáticos y Computación, Universidad Complutense de Madrid, Calle José García Santesmases, 9, 28040 Madrid, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_60

803

804

A. Sarasa-Cabezuelo

each query consists of a set of meta-information such as date and time of the query, relationship with other previous queries, or reason for the query, which are usually managed as specific fixed text fields to contain this information, and a set of fields of free text, in which the doctor describes the pathology, symptoms, tests performed or requested, and the indicated treatment. On the other hand, these applications are normally complex to adapt [15] to the particular needs of each medical specialty and to the particular needs of each doctor, given that they hardly allow configuration functions. This situation is due to the design [16] of the application that has as its objective that the user interface simply acts as a collector of fixed information, therefore not admitting changes. In the field of nutrition, the applications that are mostly used are general [6] or, in the case of being specific [7], they present very simple functions that are reduced to capturing anthropometric measurements or displaying data, such as nutrium.io[8], nutritioapp [10], grunumur [11], and nutri-educa [12]. Likewise, another problem is the lack of integration [13] between the diet management applications and the applications used for the management of patient data. This complicates the work of the nutritionist who must consult two types of independent applications. Furthermore, in the latter case, there is no quality [14] control carried out by specialists on some diet applications. It is for all these reasons that it would be necessary to have applications that integrate the management of nutrition patient information with the applications that manage diets, so that the nutritionist, in addition to capturing patient data and being able to view it, could also manage diets (create, modify, and adapt them to the particular pathologies of each patient) as well as actively monitor the effectiveness of the diet in patients. This work proposes a web application specifically oriented to the nutritionist to offer him that particular functionality that he requires to manage his patients as well as diet-based treatments and their follow-up. The article is structured as follows. Section 2 presents the objectives of the application. Section 3 describes the architecture of the application and the data model used. Next, Sect. 4 shows the functionality of the web application. Subsequently, Sect. 5 describes the usability evaluation carried out. Finally, Sect. 6 proposes the conclusions and a set of lines of future work.

2 Objectives The objective of the proposed application is to offer a tool that allows the nutritionist to manage a history of each patient where the data collected in each visit is saved, so that he can consult and exploit the information in a simple and efficient way. Likewise, the application will allow the patient to access and manage their personal information. This objective is specified in the following more specific objectives: 1. Availability of a function that allows to store the information of a patient’s clinical history, as well as to make modifications, additions, or eliminations of the same. 2. Availability of a function that allows managing patient visits shifts.

60 Development of a Web Application for the Management …

805

3. Availability of a function to assign diets to patients and manage them. 4. Possibility of consulting and exploiting the information of a patient through statistical tools as well as the graphical representation of this information. 5. Possibility of monitoring the evolution of the pathologies of each patient. 6. Availability of the functionality associated with the patient so that he can access, consult, and exploit the information associated with it, such as modifying his profile, viewing his anthropometric constants, or accessing to the diets that the nutritionist has assigned him. 7. Develop the functionality as a system implemented as a web application that integrates all the functionality. It should present an easy-to-use, friendly, and intuitive user interface. 8. The application will have integrated visual tools that will provide the doctor with the possibility of graphically representing the information as well as manipulating it. In order to implement these objectives, a web application oriented toward professionals and patients has been developed, in which a set of functionalities grouped by user type has been created. In this sense, the following functionalities have been defined for the patient actor, referring to consulting the associated diet and visualizing its effects in terms of anthropometric constants, as well as the possibility of consulting the next appointment or contacting the nutritionist by email. With respect to the nutritionist, functions have been defined to register and cancel patients, search for patients, manage patient appointments, being able to create, modify and cancel appointments, manage a patient’s medical information (add, modify and/or delete measures anthropometric, and pathologies), create predefined diets for certain pathologies, create particular diets for a patient, and manage diets (associate, modify, or delete the diet associated with a patient).

3 Architecture The web application has been developed using an MVC model so that the application data, the graphical user interface, and the logic responsible for controlling the application are independent and the communication between the server and the client occurs asynchronously. Figure 1 shows the architecture scheme. The front-end is made up of blocks that share information with each other. Each block contains three files that represent the view and the controller. The HTML file (what the client will display) together with the CSS file (style sheet applied to a block) of a block is part of the view responsible for receiving user requests. The controller is represented by Typescript files that respond to the changes and requests of the view (it contains all the functionalities that allow obtaining, modifying and saving the data generated by the user or by the application itself and is responsible for the communication of two path between the HTML code and the data model). In

806

A. Sarasa-Cabezuelo

Fig. 1 Architecture scheme

this sense, the controller modifies the model in the way requested by the user through the view, and once the model is updated, the data will be reflected in the view. To carry out this communication, classes implemented in angular are used from which calls are made to an API hosted on the server (back-end), and in this way, return the information provided by the databases in a correctly formatted way or save the information in the database. In the back-end, there is an API implemented in PHP that allows you to interact with an Apache web server and make access requests to a MariaDB-type relational database. The database has 12 tables, in which everything necessary to manage the information generated in the application activity is stored: • The registration code table contains the registration codes necessary to allow the registration of a professional. These codes are unique and will be entered by the system administrator. • The professional table contains the information that represents the professional in the application. • The patient table maintains the data of a patient and the relationship with the professional associated with it. A patient is associated with a single professional and a professional can have multiple patients.

60 Development of a Web Application for the Management …

807

• The appointment table contains the history of appointments that the associated patient has had. An appointment contains only one patient, while a patient can have multiple appointments associated with it. • The pathology_patient table maintains the relationship between the patient’s information and the pathologies presented by the patient. • The pathology table contains the name of a pathology and the email of the professional who created it. A professional can add different pathologies. By storing the email of the professional who created it, it is possible to make these pathologies accessible only to the professional who created them. It is also possible to store pathologies common to all professionals. • The anatomy table represents the measurements of a patient. This information is part of the patient’s clinical history, so it is stored associated with the measurement date. These measurements of a patient’s anatomy are entered by a professional through the application. • The metric table represents the metrics of a patient. The data is part of the patient’s clinical history, so it is stored associated with the measurement date. Metrics are automatically calculated by the app based on patient data. There is the possibility of knowing all the information about a patient’s metrics since this information is associated with the patient’s email. • The diet table contains the diets associated with a patient. A patient can have multiple diets, but only one patient belongs to a diet. • The day table contains five fields that represent the different time bands. Each time slot has a meal associated with it. • The food table relates the foods in the diets to the foods that are part of a meal. A meal can have multiple foods, and a food can be referenced in multiple meals. • The nutritional table contains all the nutritional information of the foods that are part of a diet.

4 Functionality 4.1 Nutritionist Functionality In order to use the application, a professional must contact the administrator to be provided with a unique access code and to be able to register. Once the code is obtained, the user must validate the code in the application, entering it. After validation, the professional must enter the personal data and the credentials. Next, it will be shown the log-in screen, where it must be entered the credentials, and then it will be shown the main view of the professional (Fig. 2). Three actions can be performed in the professional view: 1. Register a new patient. Click on the “+” button in the “controls” (“Controles”) section and a form is displayed where the data of the new patient is included

808

A. Sarasa-Cabezuelo

Fig. 2 Main view of the professional

(name, surname, age, gender, email, and password). Next, the new patient is created, it appears in the professional’s patient list, and the patient file is loaded. 2. Create default diets. It allows to create diet templates so that when it is necessary to assign a diet to a patient, it is possible to take a template and adapt it to the patient. To do this, click on the “Default Diets” (“Dietas predeterminadas”) button, and a diet editor opens to create it. These diets are saved associated with the professional and can be edited and viewed through the list of available diets. 3. Search for a specific patient from the patient list. To do this, the patient’s first or last name is entered in the search box at the bottom of the patient list, so that the names of those who do not meet the search criteria are hidden from the patient list. From the patient view (Fig. 3a), the professional can manage all the information of a patient: • Add a new appointment. To add an appointment to a patient, click on the “New Appointment” (“Nueva Cita”) button, which displays a form (Fig. 3c) where the date of the appointment is selected. Appointments are stored in a history in the appointments section. There can only be one active appointment, represented by a green color in the same section. If a new appointment is added, while an appointment is active, the old appointment will be deactivated and this new appointment will be active. In order to cancel an appointment, just click on the X icon next to the active appointment. • Assign measures and calculate metrics. Measurements can be assigned to a patient by filling in the text fields that appear on their record in the “Measurements” (“Medidas”) section. Once added, they are saved by clicking on the “Update measurements/Save” button, and as a result, a set of metrics are calculated in real

60 Development of a Web Application for the Management …

809

Fig. 3 a Patient view, b assign pathologies, c new appointment, d assign a diet, e view the progress

time at the “Metrics” (“Métricas”) section. It is possible to obtain measurements and metrics from the patient’s history by selecting from the list located above the “Update measurements/Save” (“Actualizar medidas/Guardar”) button. Selecting any of the historical measures will load the selected measures and metrics. When some measurements are loaded from the history and modified, they will be saved in the patient’s history as new measurements. The history remains intact. • Assign pathologies. Pathologies can be associated with a patient, to do this, click on the “Pathologies” (“Patologías”) button, and as a result, a window is displayed with all the available pathologies (Fig. 3b) that can be assigned. One or more pathologies can be assigned by clicking on the slider next to them. When assigning a pathology, it appears in a red bar at the top of the patient file. Pathologies not available in the list can also be added by entering the name of the new pathology and clicking on the “+” button. • Delete a patient. To delete a patient, click on the “Deactivate Patient” (“Desactivar Paciente”) button and a confirmation box is displayed. Once confirmed, the patient remains in the database but is not accessible to the professional. To reactivate it, a new patient should be registered with the email associated with this patient.

810

A. Sarasa-Cabezuelo

• View the progress of a patient. It is possible to view the progress of any patient from the history of measurements and metrics. To do this, it must be clicked on the “Progress” (“Progreso”) button. As a result, a view will be loaded where the evolution and different patient data are shown by means of graphs (Fig. 3e). To view the different graphs (weight, somatotype, body composition, skinfolds and perimeters, and waist-hip ratio), simply move through the different available tabs. Likewise, it is possible to obtain more details of a specific point, it just has to hold the mouse pointer over the point to consult, and a small note will be loaded giving more details about the information of said point. • End appointment. Once a patient has been attended to, click on the “Attended” button. • Assign a diet. The professional can assign a diet to a patient. To do this, click on the “Diets” (“Dietas”) button. As a result, a view will be loaded, in which it is possible to add as many foods as desired in the different daily slots available, on any of the different days of the week. To do this, it must be clicked on the “+” button. Then, it is added a new food to the list, and the professional has to fill in the name of the food, its quantity, and its units. If the food is already present in the database, its units will be automatically assigned. Once the entire diet is filled in, then the professional assigns a name. After clicking on the “Save” button, the diet is assigned to the patient and is shown in the list of diets on the left side of the view. A patient can have different diets assigned over time and it is possible to edit them. To do this, it must be clicked on the “Edit” (“Editar”) button next to each diet, and the selected diet will automatically be loaded in the diet editing area to add, delete, or edit any food. Likewise, it is possible to obtain any predefined diet for the patient, selecting it from the list of predefined diets found below the diet history. Selecting any of the predefined diets will load it into the diet edit view to add, remove, or modify foods. Once the diet is completed, the “Save” button is pressed and the patient is assigned so that it becomes the active diet. Any diet can be visualized graphically and correctly formatted, and to do this, simply click on the diet to visualize.

4.2 Patient Functionality Each patient has a personal account in the application, so that when they access the application with their credentials, the patient has access to their personal data and a series of functionalities. Thus, in the patient view, it is possible to find three functions (Fig. 4a). First of all, each patient can check her next appointment by clicking on the “My next appointment”(“Mi próxima cita”) button. Pressing this button will display a pop-up window with the next appointment assigned by the professional. Likewise, when the “My Diet” (“Mi dieta”) button is pressed, the patient visualizes the active diet assigned by the nutritionist (Fig. 4b). Finally, if you click on the “Contact Nutritionist” (“Contactar Nutricionista”) button, an email editor opens with the email address of your professional to be able to communicate with him. Likewise,

60 Development of a Web Application for the Management …

811

Fig. 4 a Patient view, b patient diet

in the patient’s view, in addition to the three options, the graphs of the anthropometric and metric measurements of the patient corresponding to their clinical history are shown so that they can be viewed (these graphs are the same ones that the nutritionist can observe).

5 Evaluation An evaluation of the usability of the application has been carried out. For this, a script has been written where a set of actions that the evaluator user must carry out within the application are described. Once the evaluation is finished, the user must fill out an online form where they must answer questions related to the test carried out. Eleven people participated in the evaluation, 7 women and 4 men, aged between 25 and 56 years, and most with studies related to the field of health sciences. The tests carried out have been: • Creation of a user with a professional role and the creation of a patient associated with an appointment, a pathology, and different measures. • Add some foods to a diet in different time slots and days, as well as save and edit the diet and create a default diet. • Creation of new patients from the data provided. • Add and remove appointments to different patients, update their measurements, edit their diets, and observe their progress through graphs. • Log-in as one of the created patients and carry out the functions associated with the patient. These actions are to view your progress graphs, consult your diet, your next appointment, and write an email to your nutritionist.

812

A. Sarasa-Cabezuelo

The results obtained with the evaluation have been the following: • Question: How intuitive is the application? On a scale of 1 to 10, with 1 not at all intuitive and 10 very intuitive. 6 respondents rate it with 9, 3 respondents rate it with 8, 1 respondent rate it with 10, and 1 respondent rate it with 6. • Question: How intuitive is the way to access patient records? On a scale of 1 to 5, 1 being not at all intuitive and 5 being very intuitive. 7 respondents evaluate it with 5, 4 respondents with 4, and 1 respondent evaluates it with 3. • Question: How intuitive is it to add measurements to a patient? On a scale of 1 to 5, 1 being not at all intuitive and 5 being very intuitive. 7 respondents evaluate it with 5, 2 respondents with 4, and 2 respondents evaluate it with 3. • Question: Would you add any more metrics or measures to the application? Most answer “NO.” However, one respondent has suggested the inclusion of “one bioimpedance data.” • Question: How intuitive is the way to create predetermined diets? On a scale of 1 to 5, 1 being not at all intuitive and 5 being very intuitive. 9 respondents evaluate it with 5, 1 respondent with 1, and 1 respondent evaluates it with 3. • About anomalies and problems in the application: Difficulty in returning to the main page, difficulty in finding a patient with the searched if the name and surnames are entered at the same time, and difficulty in modifying an appointment for a patient. • About future improvements of the application: Save more clinical data of the patient as her clinical history; generate a memory every 24 h about the diet; generate a food consumption frequency questionnaire; make physical exercise recommendations; propose cooking recipes related to the diet; function to print the diet of a patient; function that relates anthropometric data with certain pathologies; collect data to obtain the calculation of Harris and Benedict. • About the problems they have had to carry out the tasks: – Two people have had problems with the point “Open one of the required web browsers” – A person has had problems with the point of “Enter the registration code” – A person has had problems with the point “Introduce the measurements to a patient” – A person has presented problems with the point of “Modifying a diet by adding one more food” – A person has had problems with the point of “Save the diet with a given name” – A person has had problems with the point of “Go to the professional’s main page” – Two people have had problems with the point of “Show a patient her progress graphs” – A person has had problems with the point of “Add a predetermined diet to a patient by adding one more food” – A person has had problems with the point of “Mark as cared for the patient” – A person has had problems with the item “Change a patient’s appointment”.

60 Development of a Web Application for the Management …

813

6 Conclusions and Future Work This article has presented a web application that aims to facilitate the work of a nutritionist, and offer access to information to patients. Regarding the nutritionist, the application allows the nutritionist to manage their patients. First of all, it allows you to manage patients and their clinical history, as well as organize consultation shifts. Regarding the management of patient information, the application facilitates the monitoring of the patients’ progress by controlling the anthropometric measurements, their metrics and the visualization of these data through different graphs. On the other hand, it also has the possibility of associating pathologies to patients, and in this way, create more adjusted diagnoses and treatments for each type of patient. For this last task, the application facilitates the preparation of diets by specifying the foods, the quantity, and units in the different time slots of the day. Each diet is associated with a specific patient, who can consult it through their patient account in the application. In addition, any diet can be edited or eliminated by the nutritionist. Also, the application allows to create diet templates that can be adapted to each patient. Regarding the patient, the application allows to view the diet created by the nutritionist, consult the next appointment, contact the nutritionist by email, and view the evolution of the measurements and metrics through the different graphs. The application can be improved in various ways. In this sense, there are the following lines of future work. In the first place, the inclusion of more relevant measures and metrics for the nutritionist, as well as a greater number of graphs that complement these new measures and metrics with the aim of providing more information on the patient’s health status. Secondly, to include a section of cooking recipes using the foods included in the patient’s diet, so that when certain foods are included in a time slot, the application suggests cooking recipes that contain said foods. Third, allow the patient to mark in the diet the foods she has consumed to improve the monitoring of the patient’s eating habits. Fourth, the inclusion of a questionnaire on food consumption to be used in the first consultation and to provide the nutritionist with knowledge of the eating habits of the new patient. Fifth, it is proposed to expand the functionality so that the application recommends physical exercise routines that complement the nutrition treatment. Finally, a functionality is proposed to calculate the calories and nutritional values of the foods in a diet, with the aim of setting the daily calories and nutrients for the patient. Acknowledgements I would like to thank Iván Canas Ramos for developing the application.

814

A. Sarasa-Cabezuelo

References 1. Akdur G, Aydin MN, Akdur G (2020) Adoption of mobile health apps in dietetic practice: case study of diyetkolik. JMIR Mhealth Uhealth 8(10):e16911 2. Alamoodi AH, Garfan S, Zaidan BB, Zaidan AA, Shuwandy ML, Alaa M et al (2020) A systematic review into the assessment of medical apps: motivations, challenges, recommendations and methodological aspect. Heal Technol 10(5):1045–1061 3. Braz VN, de Moraes Lopes MHB (2019) Evaluation of mobile applications related to nutrition. Public Health Nutr 22(7):1209–1214 4. Canaway R, Boyle DI, Manski-Nankervis JAE, Bell J, Hocking JS, Clarke K et al (2019) Gathering data for decisions: best practice use of primary care electronic records for research. Med J Aust 210:S12–S16 5. Choi J, Chung C, Woo H (2021) Diet-related mobile apps to promote healthy eating and proper nutrition: a content analysis and quality assessment. Int J Environ Res Public Health 18(7):3496 6. Ghelani DP, Moran LJ, Johnson C, Mousa A, Naderpoor N (2020) Mobile apps for weight management: a review of the latest evidence to inform practice. Front Endocrinol 11:412 7. Gurinovi´c M, Mileševi´c J, Kadvan A, Nikoli´c M, Zekovi´c M, Djeki´c-Ivankovi´c M et al (2018) Development, features and application of DIET ASSESS & PLAN (DAP) software in supporting public health nutrition research in Central Eastern European Countries (CEEC). Food Chem 238:186–194 8. Holzmann SL, Holzapfel C (2019) A scientific overview of smartphone applications and electronic devices for weight management in adults. J Personalized Med 9(2):31 9. Kruse CS, Stein A, Thomas H, Kaur H (2018) The use of electronic health records to support population health: a systematic review of the literature. J Med Syst 42(11):1–16 10. Lashinsky JN, Suhajda JK, Pleva MR, Kraft MD (2021) Use of integrated clinical decision support tools to manage parenteral nutrition ordering: experience from an academic medical center. Nutr Clin Pract 36(2):418–426 11. Martinon P, Saliasi I, Bourgeois D, Smentek C, Dussart C, Fraticelli L, Carrouel F (2022) Nutrition-related mobile apps in the French App stores: assessment of functionality and quality. JMIR Mhealth Uhealth 10(3):e35879 12. Schumer H, Amadi C, Joshi A (2018) Evaluating the dietary and nutritional apps in the google play store. Healthc Inform Res 24(1):38–45 13. S, erban CL, Sima A, Hogea CM, Chirit, a˘ -Emandi A, Perva IT, Vlad A et al (2019) Assessment of nutritional intakes in individuals with obesity under medical supervision. A cross-sectional study. Int J Environ Res Public Health 16(17):3036 14. Shinozaki N, Murakami K (2020) Evaluation of the ability of diet-tracking mobile applications to estimate energy and nutrient intake in Japan. Nutrients 12(11):3327 15. Sun W, Cai Z, Li Y, Liu F, Fang S, Wang G (2018) Data processing and text mining technologies on electronic medical records: a review. J Healthc Eng 16. Welch BM, Wiley K, Pflieger L, Achiangia R, Baker K, Hughes-Halbert C et al (2018) Review and comparison of electronic patient-facing family health history tools. J Genet Couns 27(2):381–391

Chapter 61

Exploring the Potential Adoption of Metaverse in Government Vasileios Yfantis and Klimis Ntalianis

1 Introduction to Metaverse The term “Metaverse” was initially debuted in Neal Stephenson’s science fiction novel Snow Crash in 1992 [1]. The Metaverse refers to digital environments that blur the lines between physical and virtual space [2]. The Metaverse is split into divergent views. One view is of a privately run, centrally controlled long term wherein giant firms, such as Facebook’s “Meta,” decide how persons “socialize, learn, collaborate, and play” [3]. Mark Zuckerberg, the CEO of Facebook, disclosed a rebranding to “Meta” on October 28, 2021, with a new goal that included building up the “Metaverse,” a three-dimensional representation based on virtual and augmented reality. The new name of the social media appears to be a breakthrough, indicating a significant impact mostly on a company’s entire business strategy [4]. Another view of the Metaverse is based on distributed technical design, such as blockchain-based Internet infrastructure, in which distributed, mission communities identified as “Decentralized Autonomous Organizations” (or DAOs) create their own environments [5]. These communities are usually autonomous cryptocurrency communities and adopt a virtual system of governance that can jointly make a deal, finance, construct, preserve, and recreate without relying on outside resources [6]. The Metaverse, according to the business, will mirror a mix of today’s online social encounters in a three-dimensional space or reflected into reality [7]. This innovative medium, as a technology, has the potential to greatly alter user-platform connections by addressing the visual, aural, haptic, and olfaction senses while allowing for movement- and touch-based interactions [8]. Matthew Ball, a Metaverse entrepreneur V. Yfantis (B) · K. Ntalianis University of West Attica, Egaleo, Greece e-mail: [email protected] K. Ntalianis e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_61

815

816

V. Yfantis and K. Ntalianis

and author, claims the Metaverse could be incredibly profitable, up to $30 trillion in the next ten years [9]. The architecture of the Metaverse consists of both tangible and intangible elements. Our interpretation of the architecture is an adaptation of the Metaverse Value-Chain by Jon Radoff [10]. The seven elements of the Metaverse are: 1. Perception: The Metaverse is associated with the incredibly powerful virtualization of physical space, distance, and objects. Metaverse users produce content that derives from their relationships within their virtual societies. 2. Exploration: The medium that introduces people to innovative perceptions is the experience of exploration. Metaverse is revealing digital social structures as we virtualize physical reality, which becomes the target of exploration for users. Previous phases of the Web were characterized by static social networking activities around a few centralized services; a decentralized ecosystem of virtual experiences may transfer power back to the users, leading them to collective experiences and collective intelligence. 3. Hardware: This element includes all of the tangible technology that producers use often to generate experiences for the users of the Metaverse [11]: • Head-Mounted Displays (HMD): The HMD offers a visual representation on the screen and performs audio through the speaker. The images displayed on the screen could cover the entire screen (e.g., virtual reality) or overlay the virtual world (e.g., augmented reality). • Hand-Based Input Device (HBID): The scope of these devices is to provide a tactile experience during the user’s presence in the Metaverse. There are two types of tactile features offered through these devices: A passive haptic provides the sense of real objects, and an active haptic creates virtual pressure, simulating the interaction with objects. • Non-Hand Input Device: These devices offer eye-tracking, head-tracking, voice-input equipment, and other input options. For instance when the user moves his eyes without moving his head, the eye-tracking feature estimates eye movement and changes the user’s viewpoint. 4. Software: This category includes software that helps us understand, explore, and retrieve information from 3D worlds. Some types of software that are being used include 3D engines for the visualization of animation, geographic mapping and object recognition, voice and gesture detection. 5. Decentralization: The implementation of distributed computing creates a decentralized ecosystem of users that act without being monitored by a central entity. A plethora of virtual worlds are produced and each world creates its own facilities and security measures, which prevents the total collapse of the Metaverse if one of the virtual worlds is under attack by an unauthorized entity. Moreover, decentralization enhances the transparency of the virtual world due to technologies such as blockchain. This technology allows the exchange of digital values among entities in real time by requesting permission for the implementation of the exchange from the members of a computing network.

61 Exploring the Potential Adoption of Metaverse in Government

817

Fig. 1 The Metaverse’s architecture

The Metaverse’s architecture is depicted in Fig. 1 and includes all the previously mentioned elements.

2 Metaverse Integration in Government According to a survey that was carried out by Ipsos on behalf of the World Economic Forum, it was announced that almost half of the adult citizens in 29 countries feel familiar with the Metaverse [12]. Developing nations are more optimistic about the Metaverse’s prospects, but most participants hope Metaverse-based apps to alter their everyday lives within the next ten years. The questionnaire was implemented between April and May 2022, and it contains the replies of over 21,000 adults. Several of the most important findings are displayed in Table 1, which answers the question about the familiarity of the citizens with the related technologies. The familiarity of the citizens with the Metaverse has inspired several cities all over the world to adopt the Metaverse. Seoul in South Korea, is preparing the ground for a Metaverse environment known as “Metaverse Seoul” for all parts of the municipal government [13]. The action plan integrates digital twins, virtual reality (VR), and

818

V. Yfantis and K. Ntalianis

Table 1 Familiarity with the Metaverse Type of technology

Percentage of feeling familiar (%)

Country with the highest percentage

Country with the lowest percentage

Virtual reality

80

Turkey 94%

France 46%

Augmented reality

61

Turkey 84%

Belgium 36%

The Metaverse

52

Turkey 86%

Poland 27%

cooperation to strengthen municipal services as well as management, and virtual tourism. As part of Mayor Oh Se-Hoon’s SeoulVision 2030 plan, the South Korean city has invested approximately e2.8 billion in the project. Cities in U.S.A are also positive toward adopting Metaverse. The National League of Cities (NLC) is a group of city, town, and community representatives who work to make life better for their current and future fellow citizens. According to a recently published report [14], NLC wants to develop a future in which citizens in the United States can quickly access city services and public gatherings via the Metaverse, and in a much more user-friendly way than other digital platforms. This could end up making services more accessible to people with physical disabilities or time constraints. Moreover, Barbados, a small island country in the Lesser Antilles of the West Indies, decided to open an embassy in Decentraland [15]—a current form of the Metaverse. Gabriel Abed, the ambassador of Barbados to the United Arab Emirates and in charge of the country’s digital diplomacy stated at Bloomberg [15]: “We recognize that we are a 166-square mile island—we are tiny—but in the Metaverse we are as large as America or Germany.” In Asia, Joko Widodo, the President of the Republic of Indonesia, openly admitted that Indonesia has to be ready to participate in the Metaverse [16]. Actions have already been considered in Indonesia to adopt Metaverse innovation. Anies Baswedan, the Governor of Jakarta, decided in 2022 to establish a partnership agreement with WIR Group, a popular augmented reality technology firm in Southeast Asia [17]. Michael Budi, President Director of WIR Group, stated that through this agreement, the company will support Jakarta by leading them in implementing interaction in the Metaverse. Since the launch of Meta (former Facebook), several governmental organizations have expressed interest in using Metaverse. However, due to the recent trend of the virtual world, there are no significant forms of the Metaverse in the government and that is why it is important to shed light on the important opportunities and challenges of adopting the Metaverse.

61 Exploring the Potential Adoption of Metaverse in Government

819

3 Metaverse Opportunities 3.1 To Find Innovative Ways to Communicate with the Citizens Physical communication with the citizens was partially replaced with Web communication when the Internet became available for the majority of the citizens. The virtual world seems to be the next big thing for the new era of communication with the citizens due to the fact that it attracts them. Second Life, the forerunner of the Metaverse, was established in 2003 as a virtual world, presenting a landscape where the virtual and physical worlds collide and meaning is generated through a number of social activities [18]. Users interact in Second Life in many ways, including sending online messages, co-visiting locations, participating in multi-player games, and creating, selling, and purchasing virtual objects. Despite the fact that the amount of visitors to the U.S. National Oceanic and Atmospheric Administration (NOAA) Second Life website is a percent of that of the institution’s regular portal, the virtual world online experience is highly engaging. For instance, according to the NOAA’s official stats, users to its Second Life Island spend 10 times as much time there as receivers of the institution’s other web services [19]. The Metaverse as a virtual world could create an innovative communication channel with the citizens because citizens will be able to visit the virtual space of the public offices and implement all their transactions with the government just like in the real world. As long as avatars-citizens adopt a virtual passport/identification card, their virtual presence in the public office will be equivalent to their physical presence.

3.2 To Establish Team Working Operation Inside the Workplace The Metaverse has the advantage of giving employees the impression that someone else is working beside them. The psychological perception of being in an area with someone is formed by the online identity and the capacity to interact with the environment and virtual elements from multiple perspectives, such as the third-person perspective [20]. The advantages of meeting in cyberspace include the ability to: (1) Remove the alienation of employees who are located across the country; (2) Decrease the number of individuals who need to travel to participate in meetings or conferences. The co-presence of public servants as avatars in the workplace will enhance team spirit and creative collaboration between the employees.

820

V. Yfantis and K. Ntalianis

3.3 To Find New Employees The US Army is a recognized leader in the use of virtual worlds as a recruiting tool. “America’s Army” is considered the US government’s first massive use of gaming technology/virtual worlds as a platform for effective communication and recruitment, as well as the first use of gaming software in assistance of US Army recruiting [21]. The advantage of the Metaverse as a recruiting tool is the fact that it could act as a monitoring tool for both the digital literacy skills and behavioral skills of the candidates. Despite the fact that the digital literacy (e.g., IT skills) of the candidates could be monitored through a variety of written tests, the behavior of a potential employee inside the workplace is hard to predict. Since Metaverse is going to be the actual workplace, the candidates could be asked to join a real working team and work for a day in the real working environment. In this way, the human resources executives could monitor and confirm the behavioral skills of the candidates in relation to their work at the office.

3.4 To Develop a New Economy Another important scenario to investigate is the rebranding of the government’s business model with the launch of the Metaverse for both administrative and tourism operations. If the public office of a government is located on a virtual island in the Metaverse, citizens can also enjoy the virtual beach, purchase touristic memorabilia from the government’s kiosks, go on paid vacation with other citizens, etc. In general, Metaverse as a new experience for the citizens could be a new source of funds for the government as well. Many governments, notably the Maldives, have set up virtual embassies in Second Life to learn more about the virtual world. Since May 2007, the Maldives became the first nation in Second Life to establish a “virtual embassy” in Diplomacy Island’s Diplomatic Quarter. The virtual embassy, which is themed after a beach holiday, offers tourism products and services as well as information and connections to official government offices [22]. If governments re-consider the role of citizen service centers and public offices, then Metaverse could be the hub for a new economy that will offer opportunities for additional governmental income through new touristic products and services.

4 Metaverse Challenges Despite the previously mentioned advantages of the Metaverse, the transition from the physical to the digital world includes several serious challenges for the officials to take into account. In this study area, we will discuss issues relevant to the potential difficulties in implementing the operation of the public sector in the Metaverse.

61 Exploring the Potential Adoption of Metaverse in Government

821

• Security: Because Metaverse gathers information on activity that is more insightful than transactions and browsing habits, privacy and security are critical concerns. We should be more cautious in the case of unexpected crimes in the Metaverse since data protection is essential. Moreover, as a consequence of the rise in users, tracking activities imply that agencies like the police and the army are necessary. Due to their online anonymity in the Metaverse, respectable individuals in real life, may disobey the law in certain occasions. • Health: Health issues and negative impacts (such as feeling tired, headset heaviness, and movement injuries) exist too. Moreover, in specific augmented reality systems, user focus disruption has resulted in significant risk, such as unplanned accidents. Head or neck discomfort is another limitation for longer usage periods, caused by the weight of virtual reality headsets. Extensive virtual reality use might also lead to Internet addiction, social alienation, and isolation from real life activities and muscular pains [23]. • Staff: Although autonomous avatars can immediately welcome visitors to a virtual world, it is advised that an organization’s virtual world be supervised by a virtual avatar who can really welcome visitors, answer questions, and guide them to the right offices. Having a virtual public servant would need filling the position with more than one actual servant, particularly during the public office’s normal office hours. If the public office on the Metaverse operates 24/7, then more real public servants are needed to work in shifts. • Ethical Issues: Ethics concerns involve unauthorized expansion and truth deception into predisposed opinions. Users’ physiological psychographies may be developed by Metaverse participants based on user data feelings [24]. These factors could be used to begin forming unintended psychological judgments that encourage bias. Another well-known drawback of virtual social worlds is toxic behavior, such as anxiety, cyberbullying, and verbal harassment [1]. Moreover, virtual world contexts can lead to stressful circumstances. Identity theft could be committed utilizing artificial intelligence methods and deep learning algorithms, which are associated with Internet morality.

5 Epilogue The opportunities and challenges of adopting the Metaverse in government are shaping the future of the new framework of delivering public services. The best way for the government to deal with the current topic is to explore how each single element (opportunity or challenge) fits with each element of the Metaverse (Perception, exploration, hardware, software, and decentralization). Afterwards, a SWOT analysis [25] is suggested to help the decision-makers find the way to move to the next stage of adopting the Metaverse. Figure 2 displays a schema including all the appropriate information for the implementation of the SWOT analysis that will reveal the public organization’s current status toward adopting the Metaverse.

822

V. Yfantis and K. Ntalianis

Fig. 2 Challenges, opportunities, and elements of the Metaverse

The government’s decision to focus on the Metaverse places it as a trailblazer of new technology with huge future potential. Virtualization of government business may result in lower operational costs in the near future because less hardware will be required. Although Metaverse would be a new environment to host the administrative transactions, the core of the business would not change substantially because the government has already formed the most critical business processes in terms of operations. As a result, government servants will have to concentrate on improving service efficiency instead of reinventing services entirely from the initial concept. Citizenoriented activities that are required for new techniques of interaction or data security subject areas that are not important in the current workplace environment but rather are expected to gain popularity in the future are examples of activities that will require further development. While government organizations have traditionally used and accepted concepts like “extended reality” and “virtual worlds,” the realization of the Metaverse has yet to materialize. The most important requirement for the adoption of Metaverse is to train the potential new staff about how to adjust the culture of servicing the citizens to the new framework of Metaverse. For this purpose, virtual world techniques such as gamification [26] will be useful to teach the staff about the prospects of the virtual world and how to affect the motivation of the citizens toward accepting this innovative environment. Additional technologies such as artificial intelligence [27], blockchain

61 Exploring the Potential Adoption of Metaverse in Government

823

[28] and cloud computing [29] could be beneficial for the training of public servants as tools to improve the delivery of public services by reducing the cost of the required resources.

References 1. Stephenson N (1992) Snow crash. Bantam Books, New York 2. Dionisio J, Burns W, Gilbert R (2013) 3D Virtual worlds and the metaverse: current status and future possibilities. ACM Comput Surv 145:1–38 3. Facebook. https://about.facebook.com/meta/ 4. Sascha K, Kanbach D, Krysta P, Steinhoff M, Tomini N (2022) Facebook and the creation of the metaverse: radical business model innovation or incremental transformation? Int J Entrep Behav Res 28:52–77 5. Building the Metaverse: ‘Crypto states’ and corporates compete, down to the hardware. https:// ssrn.com/abstract=3981345 6. Foresight Institute. https://foresight.org/salon/balaji-s-srinivasan-the-network-state/ 7. Meta. https://about.fb.com/news/2021/10/facebook-company-is-now-meta/ 8. Studen L, Tiberius V (2020) Social media, quo vadis? Prospective development and implications. Future Internet 12:1–22 9. Metaverse Economy Could Value up to $30 Trillion Within Next Decade. https://beincrypto. com/metaverse-economy-could-value-30-trillion-in-a-decade/ 10. The Metaverse Value-Chain. https://medium.com/building-the-metaverse/the-metaversevalue-chain-afcf9e09e3a7 11. Park SM, Kim YG (2022) A Metaverse: taxonomy, components, applications, and open challenges. IEEE Access 10:4209–4251 12. How enthusiastic is your country about the rise of the metaverse? https://www.weforum.org/ agenda/2022/05/countries-attitudes-metaverse-augmented-virtual-reality-davos22/ 13. Kit K (2022) Sustainable engineering paradigm shift in digital architecture, engineering and construction ecology within Metaverse. Int J Comput Inf Eng 16:112–115 14. Geraghty L, Lee T, Glickman J, Rainwater B (2022) Cities and the Metaverse. Report, National League of Cities 15. Barbados is Opening a Diplomatic Embassy in the Metaverse. https://www.bloomberg.com/ news/articles/2021-12-14/barbados-tries-digital-diplomacy-with-planned-metaverse-embassy 16. Ifdil I, Situmorang D, Firman F, Zola N, Rangka I (2022) Virtual reality in Metaverse for future mental health-helping profession: an alternative solution to the mental health challenges of the COVID-19 pandemic. J Public Health fdac049:1–2 17. Jakarta Provincial Government & WIR Group to Develop Jakarta Metaverse. https://www.asi aone.com/business/jakarta-provincial-government-wir-group-develop-jakarta-metaverse 18. Boellstorff T (2015) Coming of age in second life. Princeton University Press, Princeton 19. Wyld D (2008) Government in 3D: how public leaders can draw on virtual worlds. Report, IBM Center for the Business of Government 20. Mystakidis S (2019) Motivation enhanced deep and meaningful learning with social virtual reality. University of Jyväskylä, Jyväskylä 21. America’s Army. https://americasarmy.com/ 22. Diplomacy for the digital age. http://archive1.diplomacy.edu/pool/fileInline.php?idpool=463 23. Slater M, Gonzalez-Liencres C, Haggard P, Vinkers C, Gregory-Clarke R, Jelley S, Watson Z, Breen G, Schwarz R, Steptoe W, Szostak D, Halan S, Fox D, Silver J (2020) The ethics of realism in virtual and augmented reality. Front Virtual Reality 1:1–13 24. Metaverse Wiki. https://en.wikipedia.org/wiki/Metaverse

824

V. Yfantis and K. Ntalianis

25. Leigh D (2009) SWOT analysis. In: Silber KH, Foshay WR, Watkins R, Leigh D, Moseley JL, Dessinger JC (eds) Handbook of ımproving performance in the workplace, vol 1-3, pp 115–140. Pfeiffer, San Francisco 26. Yfantis V, Ntalianis K, Xuereb PA, Garg L (2018) Motivating the citizens to transact with the government through a Gamified experience. Int J Econ Stat 6:81–86 27. Yfantis V, Ntalianis K, Ntalianis F (2020) Exploring the ımplementation of artificial ıntelligence in the public sector: welcome to the clerkless public offices. Applications in education. WSEAS Trans Adv Eng Educ 17:76–79 28. Yfantis V, Leligou HC, Ntalianis K (2021) New development: Blockchain—a revolutionary tool for the public sector. Public Money Manag 41:408–411 29. Yfantis V, Ntalianis K (2020) The exploration of government as a service through community cloud computing. Int J Hyperconnectivity Internet Things (IJHIoT) 4:58–67

Chapter 62

Hyperparameter Tuning in Random Forest and Neural Network Classification: An Application to Predict Health Expenditure Per Capita Gulcin Caliskan and Songul Cinaroglu

1 Introduction Deep modelling for classification is popular in intelligence computing and information systems domain [1]. Hyperparameter tuning improves classification performances of learning techniques by setting proper hyperparameters for prediction models [2]. Random forest (RF) and neural network (NN) are well-known learning techniques in classification and regression modelling. Number of trees in the forest and number of neurons in hidden layer are well-known hyperparameter of RF and NN, respectively. The number of trees in the forest, the degree of complexity of each tree, sampling procedures and splitting rule to use during tree construction are hyperparameters of RF [3]. The size of layers and kernels, number of neurons, activation function, learning rate and batch size is the hyperparameters of neural network to tune [4]. Existing literature states that, hyperparameter tuning improves classification performances of RF and NN in many medical data set applications [1, 5]. However, there is a lack knowledge about the effect of hyperparameter tuning on classification performances of deep learning techniques for grouping of countries in terms of health expenditure per capita. In this study, classification performances of RF and NN models are compared, by changing the number of trees in the forest and the number of neurons in hidden layers, respectively. Cross validation is applied simultaneously by changing “k” parameter from 2 to 20. The data classification task with RF and NN model and the discriminatory power of these models by changing number of trees in the forest and number of neurons in hidden layer is examined. Following section overviews existing knowledge about the data classification task with RF and NN and gives a brief summary of hyperparameters of these techniques. G. Caliskan (B) · S. Cinaroglu Department of Health Care Management, FEAS, Hacettepe University, Ankara, Turkey e-mail: [email protected] S. Cinaroglu e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 I. J. Jacob et al, (eds.), Data Intelligence and Cognitive Informatics, Algorithms for Intelligent Systems, https://doi.org/10.1007/978-981-19-6004-8_62

825

826

G. Caliskan and S. Cinaroglu

2 Materials and Methods 2.1 The Data Classification Task with RF and NN RF classification is used for developing prediction models [3]. One of the widely used machine learning models is RF that was developed by Breiman [6] and it generates trees using the classification and regression algorithm (CART). The RF, as well as CART, has an objective to explore a prediction function for a p-dimensional  T random vector X = X 1 , . . . , X p that shows the real values of the predictors, and a random variable Y presenting the output. If the dependent variable is categorical, RF conducts classification; otherwise, RF conducts regression [7]. According to Breiman [3], generalization error in classification RFs can be proved to be converging as the number of trees increases. RF classification is a popular machine learning technique for developing predictive models in number of studies [5]. In this study, a dichotomous predictive variable is used, therefore, RF classification is performed to predict the classes of health expenditure per capita variable. Number of trees in the forest is one of the hyperparameters of RF. Literature states that increasing the number of trees improves the accuracy of RF prediction [8]. RF generated a forest of classification trees by increasing the number of trees in the forest. CART builds a massive tree before pruning it. It has been claimed that pruning a massive tree rather than building a small number of trees increases the accuracy of RF prediction [9]. A training sample of “n” cases is used to a classification tree for a dichotomous outcome variable. A tree-structured classification rule is generated for case i using a vector of covariates x i . Through repetitive partitioning, the training sample is split into increasingly homogeneous groups. Using the vector of input x, three types of splits are possible [10]. 1. Univariate split: Is x i ≤ t? 2. Linear combination split: Is 3. Categorical split: Is xi ε S.

p 

(wi xi ) ≤ t?

i=1

Separating the cases into two maximally homogeneous groups is the goal of the split, which has been defined in various ways in literature. The tree is grown according to Eq. (1): Y =

r 

β j I (xε R J ) + ε

(1)

j=1

where the regions R J and the coefficients β j are estimated from the data. The R J are usually disjoint, and the β j is the average of the Y values in the R J . Artificial NN is a type of computing technique that simulates the structure and processing capabilities of the human brain. This is the model of choice when it comes to classifying medical data [11]. NN is made up of interconnected adaptive simple

62 Hyperparameter Tuning in Random Forest and Neural Network …

827

processing elements, often known as artificial nodes or neurons, that are dependent on the brain’s ability to learn. NNs are still frustrating tools in the data mining literature. Because NN exhibits good modelling performance, but do not give a clue about the structure of their models [12]. NNs are aggregations of perceptrons. The output of multi-layer feedforward network is: ON =

1 1+e

−(β O H +β0 )

(2)

and the result is once more taken as P(1|x, β, β0 , α). In this case, o H is a vector of perceptron outputs, each of which has α parameter; these perceptrons are commonly referred to as hidden neurons. An artificial NN’s output y is a nonlinear function of the inputs due to the nonlinearity of these hidden neurons. This implies that the decision boundary in a classification context can also be nonlinear, giving more flexibility to the model than logistic regression [11]. NN is capable of creating knowledge and modelling the complicated linear relationship between output and input data [12]. Most data sets can be classified using with one layer of hidden neurons. The number of neurons in the hidden layer needs to be set empirically, e.g. by cross validation or bootstrapping. The time required to complete the training increases as the number of hidden layers required to capture the properties of the data increases in NN [11]. NN models provide a better overall model fit when forecasting financial ratios and estimation of healthcare expenditure per capita incorporated with genetic algorithm-based feature selection [13].

2.2 K-Fold Cross Validation to Improve Classification Performance k-fold cross validation is a very well-known tool for assessing the performance of classification algorithms [14]. This process splits a data set into k disjoint folds of roughly equal size at random, with each fold being used to test the model induced by the other k-1 folds using a classification algorithm. The classification algorithm’s performance is assessed by averaging the k accuracies from k-fold cross validation, assuming that the level of averaging is fold. In all folds except when explicitly specified, the same number of instances is assumed.In this study, k-fold cross validation was performed to improve the classification performance of the machine learning techniques. Therefore, k-fold cross validation is used to estimate how accurately a predictive model. During cross validation, a sample of data is partitioned into complementary subsets, training is performed and an analysis is validated on the testing or validation set.

828

G. Caliskan and S. Cinaroglu

2.3 Measuring the Discriminatory Power of a Model In this study, deep learning techniques are used to assess the model in performance on health expenditure per capita classes. The area under the ROC curve (AUC), sensitivity, specificity and accuracy are the most often used metrics of discriminating power. The conventional receiver operating characteristic (ROC) curve and area under the ROC curve (AUC-ROC) are limited by the presence of many truenegative cases, which can lead to a greater AUC and misinterpretation of actual performance. Over all conceivable thresholds, the AUC reflects a common measure of sensitivity and specificity [15]. The sensitivities and specificities associated with different values of a continuous test measure are tabulated to create a ROC curve. The result is essentially a list of different test values and their corresponding sensitivity and specificity. To generate a graphical ROC curve, sensitivity (true positive-TP rate) on the y-axis is plotted against 1–specificity (false positive-FP rate) on the x-axis. ROC graphs are two-dimensional that shows the TP rate on the x-axis and the FP rate on the x-axis. An ROC graph depicts relative trade-offs between benefits (true positives) and costs (false positives) [16]. The only discrimination measure influenced by the data set’s class distribution is accuracy. When the case distribution in the training set differs from the case distribution of the population on which the classifier is applied, this measure must be utilized with caution [11].Metrics such as positive predictive value (or precision), sensitivity (or recall) and F1 score could be used to evaluate overall model performance (or the harmonic mean of precision and recall) in Eq. 6 β represents 1 for F1 score (if β = 1; F 1 = r2+prp ) [17]. The metrics mentioned above were calculated as follows (referans): TPC = true positive; FPC = false positive; FNC = false negative; C

TPC C 1 TPC + 1 FPC C 1 TPC Recall: C C 1 TPC + 1 FNC

Precision: C

1

TPC TPC r= TPC + FPC TPC + FNC     pr 1 + β 2 TPC 2   = F1 score: Fβ = 1 + β r + β2 p 1 + β 2 TPC + β 2 FNC + FPC p=

(3)

(4) (5)

(6)

Following section presents descriptive statistics, binary coding of health expenditure per capita variable, comparison of RF and NN classification results by incorporating hyperparameter tuning and cross validation. All computations and visual representations of the data set are performed in Orange (version 3.29) program.

62 Hyperparameter Tuning in Random Forest and Neural Network …

829

Table 1 Baseline characteristics of WB countries in terms of study variables Variables

Descriptive statistics Mean

SD

Min

Max

1549.44

1902.64

40.61 10,921.01

Median

Dependent variable Health expenditure per capita

757.97

Independent variables GDP per capita

21,023.73 21,307.12 343

117,341.90 13,816.73

Mortality rate

21.06

19.03

1.60

82.40

14.30

Life expectancy

70.46

13.79

53.28 84.36

73.25

6.28

1.16

6.82

Population aged 65 years and over (%) 8.88

28

Explanations Current health expenditure per capita, PPP (current international $), GDP per capita, PPP (current international $), mortality rate, infant (per 1000 live births), life expectancy at birth, total (years), population age 65 years and over (% of total population) Min. Minimum, Max. Maximum, SD Standard deviation. Data is taken from official statistics of WB for the year 2019 as follows: https://data.worldbank.org/

3 Findings 3.1 Descriptive Statistics In this study, data is taken from official statistics of WB [18]. Total number of member of WB countries is 189. In our case, totally 188 countries are included into the study by considering data availability for the year 2019. Table 1 presents baseline characteristics of member of WB countries in terms of study variables for the year 2019 [18]. Total number of member of WB countries is 188 for 2019. Median values for study variables are as follows: current health expenditure per capita PPP (current international $) is 757.97$ (min. 40.61, max. 10,921.01$); GDP per capita is 13816.73 $ (min. 343, max. 117,341.90); mortality rate is 14.3 (min. 1.60; max. 82.40); life expectancy at birth is 73.25 (min. 53.28; max. 84.36); percentage of population aged 65 years and over is 6.82 (min. 1.16; max. 28).

3.2 Binary Coding of Health Expenditure Per Capita Variable In this study, member of WB countries is classified in terms of groups of health expenditure per capita variable, by using NNs and changing neurons in hidden layer and incorporating k-fold cross validation into the model. Health expenditure per capita is an independent variable of this study in line with the existing knowledge [19], and the distribution of health expenditure per capita variable is skewed and can be seen in Fig. 1. The histogram and density plot of the variable are highly positively skewed. Balancing the groups of dependent variable is strategically important for

830

G. Caliskan and S. Cinaroglu

classification tasks. In this study, due to heavily skewed distribution of dependent variable, health expenditure per capita variable is categorized by using the median value of this variable (757.97$) as a cut-off point. After binary coding of health expenditure per capita variable, the mean value of the first group of countries is 304.99$ and the mean value of second group of countries is 2793.88$. Density plot of countries verifies skewed distribution of dependent variable. Balancing the groups of dependent variable is strategically important to improve classification performance of machine learning techniques [20]. Following section explains generation of balanced categories for health expenditure per capita variable by using appropriate measure of central tendency. Balanced categories of health expenditure per capita variable after binary coding of this variable, obtained from the member of 188 WB countries, are presented in Table 2. In this study, median value (757.97$) is determined as a cut-off point for binary classification of the variable. In this regard, WB countries that have median health expenditure per capita values equal and higher than 757.97$ are in one group and WB countries which have median health expenditure per capita values smaller

Fig. 1 Highly skewed distribution of HE variable for the year 2019

62 Hyperparameter Tuning in Random Forest and Neural Network …

831

Table 2 Binary groups of WB countries in terms of median values of health expenditure per capita variable Categories of health expenditure per capita variable

N

%

0.80). Table 3 Random forest classification performance by changing number of trees in the forest NumTrees

AUC

CA

F1

Precision

Recall

Meana

Meana

Meana

Meana

Meana

NumTrees = 5

0.9513

0.8745

0.8744

0.8749

0.8745

NumTrees = 10

0.9545

0.8819

0.8819

0.8822

0.8819

NumTrees = 15

0.9554

0.8798

0.8798

0.8802

0.8798

NumTrees = 20

0.9588

0.8787

0.8787

0.8790

0.8787

NumTrees = 25

0.9601

0.8755

0.8755

0.8756

0.8755

NumTrees = 30

0.9599

0.8787

0.8787

0.8790

0.8787

NumTrees = 35

0.9623

0.8819

0.8819

0.8823

0.8819

NumTrees = 40

0.9622

0.8734

0.8734

0.8735

0.8734

NumTrees = 45

0.9601

0.8766

0.8766

0.8766

0.8766

NumTrees = 50

0.9633

0.8713

0.8713

0.8714

0.8713

NumTrees = 55

0.9625

0.8755

0.8755

0.8756

0.8755

NumTrees = 60

0.9616

0.8830

0.8830

0.8831

0.8830

NumTrees = 65

0.9620

0.8798

0.8798

0.8799

0.8798

NumTrees = 70

0.9642

0.8787

0.8787

0.8788

0.8787

NumTrees = 75

0.9628

0.8766

0.8766

0.8769

0.8766

NumTrees = 80

0.9642

0.8745

0.8745

0.8746

0.8745

NumTrees = 85

0.9627

0.8787

0.8787

0.8789

0.8787

NumTrees = 90

0.9637

0.8766

0.8766

0.8768

0.8766

NumTrees = 95

0.9649

0.8798

0.8798

0.8799

0.8798

NumTrees = 100

0.9631

0.8713

0.8713

0.8714

0.8713

Mean

0.9609

0.8773

0.8773

0.8775

0.8773

Abbreviations: NumTrees Number of trees in the forest a 5 different “k” parameters are generated by changing “k” parameter from 2 to 20 (2, 3, 5, 10, 20) in k-fold cross validation while changing hyperparameter of RF

62 Hyperparameter Tuning in Random Forest and Neural Network …

833

Table 4 Neural network classification performance by changing number of neurons in hidden layer NumHL

AUC

CA

F1

Precision

Recall

Meana

Meana

Meana

Meana

Meana

NumNHL = 5

0.9392

0.8010

0.7953

0.8389

0.8011

NumNHL = 10

0.9597

0.8723

0.8713

0.8835

0.8723

NumNHL = 15

0.9539

0.9042

0.9040

0.9079

0.9043

NumNHL = 20

0.9529

0.8872

0.8870

0.8893

0.8872

NumNHL = 25

0.9603

0.9010

0.9009

0.9029

0.9011

NumNHL = 30

0.9602

0.9042

0.9041

0.9067

0.9043

NumNHL = 35

0.9576

0.8968

0.8967

0.8971

0.8968

NumNHL = 40

0.9619

0.9085

0.9084

0.9093

0.9085

NumNHL = 45

0.9669

0.9021

0.9021

0.9025

0.9021

NumNHL = 50

0.9636

0.9000

0.8999

0.9013

0.9000

NumNHL = 55

0.9598

0.8989

0.8989

0.8993

0.8989

NumNH L = 60

0.9647

0.9021

0.9020

0.9031

0.9021

NumNHL = 65

0.9604

0.9031

0.9031

0.9045

0.9032

NumNHL = 70

0.9600

0.8989

0.8988

0.8996

0.8989

NumNHL = 75

0.9633

0.9000

0.8999

0.9013

0.9000

NumNHL = 80

0.9602

0.8968

0.8967

0.8980

0.8968

NumNHL = 85

0.9610

0.9010

0.9009

0.9023

0.9011

NumNHL = 90

0.9615

0.9010

0.9010

0.9019

0.9011

NumNHL = 95

0.9622

0.9000

0.8999

0.9010

0.9000

NumNHL = 100

0.9633

0.9010

0.9010

0.9019

0.9011

Mean

0.9596

0.8940

0.8935

0.8976

0.8940

Abbreviations: NumHL Number of neurons in hidden layer a 5 different “k” parameters are generated by changing “k” parameter from 2 to 20 (2, 3, 5, 10, 20) in k-fold cross validation while changing hyperparameter of NN

In this case, 20 different hyperparameter tuning applications are performed for RF by changing the number of trees from 5 to 100. During hyperparameter tuning process, k-fold cross validation is applied by changing “k” parameter from 2 to 20 and 5 different “k” parameters (k = 2, 3, 5, 10, 20) are generated during k-fold cross validation. Table 3 presents that the mean value of AUC obtained from RF model while generating 5 trees in the forest and 5 different k-fold parameter applications is 0.9513. Additionally, the average of AUC values obtained from 20 different RF hyperparameter tuning applications is 0.9609. The average CA value is 0.8773, F1 value is 0.8773, precision value is 0.8775, recall value is 0.8773, during 20 different RF hyperparameter tuning applications. Table 4 presents 20 different hyperparameter tuning applications performed for NN by changing the number of neurons in hidden layer from 5 to 100. During hyperparameter tuning process, k-fold cross validation is applied by changing “k”

834

G. Caliskan and S. Cinaroglu

parameter from 2 to 20 and 5 different “k” parameters (k = 2, 3, 5, 10, 20) are generated during k-fold cross validation. Table 4 presents that the mean value of AUC obtained from NN model while generating 5 neurons in hidden layer and 5 different k-fold parameter applications is 0.9392. Additionally, the average of AUC values obtained from 20 different NN hyperparameter tuning applications is 0.9596. The average CA value is 0.8940, F1 value is 0.8935, precision value is 0.8976, recall value is 0.8940, during 20 different NN hyperparameter tuning applications. Table 5 presents classification performance differences obtained from RF and NN. RF and NN performances are obtained by hyperparameter tuning and incorporating kfold cross validation, by changing “k” values from 3 to 20. In this case, 20 different number of tree parameters differing from 5 to 100 are generated for RF and 20 different number of neurons in hidden layer are generated changing from 5 to 100 are generated for NN. It is seen that, there exists statistically significant differences between mean rank values obtained from RF and NN in terms of CA (U = 38, p < 0.001); F (U = 39, p < 0.001); precision (U = 20, p < 0.001); recall (U = 38, p < 0.001). However, there is no statistically significant mean rank differences exists between RF and NN classifiers, in terms of AUC values (U = 163.50, p = 0323). Figure 3 presents area under the ROC curves obtained from RF and NN classification results obtained by generating 70 trees in the forest (NumTrees) (AUC = 0.9717) and 60 neurons in hidden layer (NumNHL) in NN (AUC = 0.9674) and performing k = 2 fold cross validation, simultaneously. Existing knowledge suggests that the larger the AUC, the better the model. ROC curve plot shows excellent validity for both of two models for differentiating WB country groups in terms of health expenditure per capita variable for the year 2019 (see Fig. 3). Moreover, RF presents outstanding performance compared with NN classification. In Fig. 3, orange colour represents for RF and light green colour shows NN. It is seen that, RF is very next to the ROC curve compared with NN. Table 5 Performance comparison of random forest and neural network classification performances Performance measures

Model

N

Mean rank

U

p

AUC

RF

20

22.33

163.50

0.323

NN

20

18.68

RF

20

12.40

38