Proceedings of International Conference on Computational Intelligence and Data Engineering: ICCIDE 2022 9819906083, 9789819906086

542 84 20MB

English Pages 563 [564] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Proceedings of International Conference on Computational Intelligence and Data Engineering: ICCIDE 2022
 9819906083, 9789819906086

Citation preview

Lecture Notes on Data Engineering and Communications Technologies 163

Nabendu Chaki Nagaraju Devarakonda Agostino Cortesi   Editors

Proceedings of International Conference on Computational Intelligence and Data Engineering ICCIDE 2022

Lecture Notes on Data Engineering and Communications Technologies Volume 163

Series Editor Fatos Xhafa, Technical University of Catalonia, Barcelona, Spain

The aim of the book series is to present cutting edge engineering approaches to data technologies and communications. It will publish latest advances on the engineering task of building and deploying distributed, scalable and reliable data infrastructures and communication systems. The series will have a prominent applied focus on data technologies and communications with aim to promote the bridging from fundamental research on data science and networking to data engineering and communications that lead to industry products, business knowledge and standardisation. Indexed by SCOPUS, INSPEC, EI Compendex. All books published in the series are submitted for consideration in Web of Science.

Nabendu Chaki · Nagaraju Devarakonda · Agostino Cortesi Editors

Proceedings of International Conference on Computational Intelligence and Data Engineering ICCIDE 2022

Editors Nabendu Chaki Department of Computer Science and Engineering University of Calcutta Kolkata, India

Nagaraju Devarakonda VIT-AP University Amaravati, Andhra Pradesh, India

Agostino Cortesi Ca’ Foscari Univeristy Venice, Italy

ISSN 2367-4512 ISSN 2367-4520 (electronic) Lecture Notes on Data Engineering and Communications Technologies ISBN 978-981-99-0608-6 ISBN 978-981-99-0609-3 (eBook) https://doi.org/10.1007/978-981-99-0609-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

This volume is devoted to the quality contributions from the authors and invited speakers of the 5th International Conference on Computational Intelligence and Data Engineering (ICCIDE-22) held on 12 and 13 August 2022 at VIT-AP University. The conference brought together 480 participants from various academic institutes and industry across the globe. The two-day international conference has been organized by the School of Computer Science and Engineering of the VIT-AP University. The academic programs of the conference commenced with the keynote address by Prof. Johan Debayle, FIET, FIACSIT, SMIEEE, Head of Department PMDM, MINES Saint-Etienne, France on “Image Processing, Analysis and Modelling of Particle Populations-Interests for Chemical Engineering”. Prior to this, ICCIDE2022 has been inaugurated in the presence of Hon’ble Chief Guest Dr. Anupam Sharma, Associate Director, Directorate of Special Projects, DRDO, Hyderabad, Guest of Honor Mr. Caleb Andrews, Associate General Manager at HCL Technologies, Chennai, and Guest of Honor Dr. G. Viswanathan, Founder and Chancellor, VIT. The Hon’ble Chief Guest Dr. Anupam Sharma addressed the audience by providing an insight into the initiatives of the conference. The Guest of Honor Mr. Caleb Andrews and Dr. G. Viswanathan also addressed the audience by improvising the need of a conference in shaping research practices. The conference was also addressed by Dr. Sekar Viswanathan, Vice President, VIT, Dr. S. V. Kota Reddy, Vice Chancellor, VIT-AP University, and Dr. Jagadish Chandra Mudiganti, Registrar VIT-AP University. The inaugural session also witnessed the addresses by eminent speakers. Dr. Nagaraju Devarakonda, Convener of ICCIDE-22 VIT-AP University, addressed that constant updation of technical knowledge in new research areas is necessary and also explained about the schedule of conference. A thorough peer-review process has been carried out by the PC members and associates. While reviewing the papers, the reviewers mainly looked at the novelty of the contributions, besides the technical content, the organization and the clarity of the presentation. The entire process of paper submission, review and acceptance process was done electronically. While nearly 500 articles across different themes of v

vi

Preface

the conference were received from ten countries across the globe, only 40 papers were received for presentation and publication in this post-conference proceedings. These figures themselves reflect the high quality and standard of the research presented in ICCIDE 2022. The conference had five presidential keynote speakers from various universities across the globe: Keynote-1: Mr. Karthick Vankayala, Director and Principal Consultant at First Identity, Australia Keynote-2: Dr. Selvakumar Manickam, Universiti Sains Malaysia, Malaysia Keynote-3: Wadii Boulila, Prince Sultan University, Saudi Arabia Keynote-4: Prof. Johan Debayle, MINES Saint-Etienne, France. We thank all the members of the Program Committee for their excellent and timebound review work. We are thankful to the entire management of VIT-AP University for their warm patronage and continual support to make the event successful. We especially thank Dr. G. Viswanathan, Founder and Chancellor, VIT, for his inspiring presence whenever we approached him for his advices. We sincerely appreciate the parental role of Dr. S. V. Kota Reddy, Vice Chancellor, VIT-AP University. We appreciate the initiative and support from Mr. Aninda Bose and his colleagues in Springer Nature for their strong support toward publishing this volume in Springer Nature. Finally, we thank all the authors without whom the conference would not have reached the expected standards. Kolkata, India Andhra Pradesh, India Venice, Italy

Dr. Nabendu Chaki Dr. Nagaraju Devarakonda Dr. Agostino Cortesi

Contents

A Review of Deep Learning Methods in Automatic Facial Micro-expression Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lalasa Mukku and Jyothi Thomas

1

Mathematical Modeling of Diabetic Patient Model Using Intelligent Control Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Subashri Sivabalan and Vijay Jeyakumar

17

End-to-End Multi-dialect Malayalam Speech Recognition Using Deep-CNN, LSTM-RNN, and Machine Learning Approaches . . . . . . . . . . Rizwana Kallooravi Thandil, K. P. Mohamed Basheer, and V. K. Muneer

37

JSON Document Clustering Based on Structural Similarity and Semantic Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. Uma Priya and P. Santhi Thilagam

51

Solar Power Forecasting to Solve the Duck Curve Problem . . . . . . . . . . . . Menon Adarsh Sivadas, V. P. Gautum Subhash, Sansparsh Singh Bhadoria, and C. Vaithilingam

63

Dynamic Optimized Multi-metric Data Transmission over ITS . . . . . . . . Roopa Tirumalasetti and Sunil Kumar Singh

81

Solar Energy-Based Intelligent Animal Reciprocating Device for Crop Protection Using Deep Learning Techniques . . . . . . . . . . . . . . . . . Ch. Amarendra and T. Rama Reddy

93

Toward More Robust Classifier: Negative Log-Likelihood Aware Curriculum Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Indrajit Kar, Anindya Sundar Chatterjee, Sudipta Mukhopadhyay, and Vinayak Singh Design of Fuzzy Logic Controller-Based DPFC Device for Solar-Wind Hybrid System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 V. Sowmya Sree, G. Panduranga Reddy, and C. Srinivasa Rao vii

viii

Contents

Analysis of EEG Signal with Feature and Feature Extraction Techniques for Emotion Recognition Using Deep Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Rajeswari Rajesh Immanuel and S. K. B. Sangeetha Innovative Generation of Transcripts and Validation Using Public Blockchain: Ethereum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 S. Naveena, S. Bose, D. Prabhu, T. Anitha, and G. Logeswari Windows Malware Hunting with InceptionResNetv2 Assisted Malware Visualization Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Osho Sharma, Akashdeep Sharma, and Arvind Kalia Custom-Built Deep Convolutional Neural Network for Breathing Sound Classification to Detect Respiratory Diseases . . . . . . . . . . . . . . . . . . . 189 Sujatha Kamepalli, Bandaru Srinivasa Rao, and Nannapaneni Chandra Sekhara Rao Infrastructure Resiliency in Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . 203 K. Tirumala Rao, Sujatha, and N. Leelavathy Deep Learning Model With Game Theory-Based Gradient Explanations for Retinal Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Kanupriya Mittal and V. Mary Anita Rajam A Comparative Analysis of Transformer-Based Models for Document Visual Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Vijay Kumari, Yashvardhan Sharma, and Lavika Goel Develop Hybrid Wolf Optimization with Faster RCNN to Enhance Plant Disease Detection Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . 243 M. Prabu and Balika J. Chelliah An Efficient CatBoost Classifier Approach to Detect Intrusions in MQTT Protocol for Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 P. M. Vijayan and S. Sundar Self-regulatory Fault Forbearing and Recuperation Scheduling Model in Uncertain Cloud Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 K. Nivitha, P. Pabitha, and R. Praveen A Comprehensive Survey on Student Perceptions of Online Threat from Cyberbullying in Kosova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Atdhe Buja and Artan Luma Addressing Localization and Hole Identification Problem in Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 Rama Krushna Rath, Santosh Kumar Satapathy, Nitin Singh Rajput, and Shrinibas Pattnaik

Contents

ix

The Impact of ICMP Attacks in Software-Defined Network Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 Kamlesh Chandra Purohit, M. Anand Kumar, Archita Saxena, and Arpit Mittal An Area-Efficient Unique 4:1 Multiplexer Using Nano-electronic-Based Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Aravindhan Alagarsamy, K. Praghash, and Geno Peter Digital Realization of AdEx Neuron Model with Two-Fold Lookup Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 Nishanth Krishnaraj, Alex Noel Joesph Raj, Vijayarajan Rajangam, and Ruban Nersisson A Novel Quantum Identity Authentication Protocol Based on Random Bell Pair Using Pre-shared Key . . . . . . . . . . . . . . . . . . . . . . . . . . 361 B. Devendar Rao and Ramkumar Jayaraman Analysis of Hate Tweets Using CBOW-based Optimization Word Embedding Methods Using Deep Neural Networks . . . . . . . . . . . . . . . . . . . 373 S. Anantha Babu, M. John Basha, K. S. Arvind, and N. Sivakumar Performance Analysis of Discrete Wavelets in Hyper Spectral Image Classification: A Deep Learning Approach . . . . . . . . . . . . . . . . . . . . 387 Valli Kumari Vatsavayi, Saritha Hepsibha Pilli, and Charishma Bobbili Tyro: A Mobile Inventory Pod for e-Commerce Services . . . . . . . . . . . . . . 401 Aida Jones, B. Ramya, M. P. Sreedharani, R. M. Yuvashree, and Jijin Jacob Segmentation and Classification of Multiple Sclerosis Using Deep Learning Networks: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 V. P. Nasheeda and Vijayarajan Rajangam Malware Detection and Classification Using Ensemble of BiLSTMs with Huffman Feature Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 Osho Sharma, Akashdeep Sharma, and Arvind Kalia Detection of Location of Audio-Stegware in LSB Audio Steganography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 A. Monika, R. Eswari, and Swastik Singh Hybrid Quantum Classical Neural Network-Based Classification of Prenatal Ventricular Septal Defect from Ultrasound Images . . . . . . . . . 461 S. Sridevi, T. Kanimozhi, Sayantan Bhattacharjee, Soma Sekhar Reddy, and Durri Shahwar Experimental Evaluation of Reinforcement Learning Algorithms . . . . . . 469 N. Sandeep Varma, Vaishnavi Sinha, and K. Pradyumna Rahul

x

Contents

An Approach to Estimate Body Mass Index Using Facial Features . . . . . 485 Dipti Pawade, Jill Shah, Esha Gupta, Jaykumar Panchal, Ritik Shah, and Avani Sakhapara An Approach to Count Palm Tress Using UAV Images . . . . . . . . . . . . . . . . 497 Gireeshma Bomminayuni, Sudheer Kolli, Shanmukha Sainadh Gadde, P. Ramesh Kumar, and K. L. Sailaja Comparative Analysis on Deep Learning Algorithms for Detecting Retinal Diseases Using OCT Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 G. Muni Nagamani and S. Karthikeyan PCB-LGBM: A Hybrid Feature Selection by Pearson Correlation and Boruta-LGBM for Intrusion Detection Systems . . . . . . . . . . . . . . . . . . 523 Seshu Bhavani Mallampati, Seetha Hari, and Raj Kumar Batchu Extractive Summarization Approaches for Biomedical Literature: A Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 S. LourduMarie Sophie, S. Siva Sathya, and Anurag Kumar SMS Spam Detection Using Federated Learning . . . . . . . . . . . . . . . . . . . . . . 547 D. Srinivasa Rao and E. Ajith Jubilson Data Extraction and Visualization of Form-Like Documents . . . . . . . . . . . 563 Dipti Pawade, Darshan Satra, Vishal Salgond, Param Shendekar, Nikhil Sharma, and Avani Sakhapara Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577

A Review of Deep Learning Methods in Automatic Facial Micro-expression Recognition Lalasa Mukku and Jyothi Thomas

Abstract Facial expression analysis to understand human emotion is the base for affective computing. Until the last decade, researchers mainly used facial macroexpressions for classification and detection problems. Micro-expressions are the tiny muscle moments in the face that occur as responses to feelings and emotions. They often reveal true emotions that a person attempts to suppress, hide, mask, or conceal. These expressions reflect a person’s real emotional state. They can be used to achieve a range of goals, including public protection, criminal interrogation, clinical assessment, and diagnosis. It is still relatively new to utilize computer vision to assess facial micro-expressions in video sequences. Accurate machine analysis of facial microexpression is now conceivable due to rapid progress in computational methodologies and video acquisition methods, as opposed to a decade ago when this had been a realm of therapists and assessment seemed to be manual. Even though the research of facial micro-expressions has become a longstanding topic in psychology, this is still a comparatively recent computational science with substantial obstacles. This paper a provides a comprehensive review of current databases and various deep learning methodologies to analyze micro-expressions. The automation of these procedures is broken down into individual steps, which are documented and debated. Keywords Micro-expressions · Micro-expression recognition · Deep learning · Optical flow · LBP-TOP · Facial image analysis · Emotion recognition

1 Introduction As the latest technologies like artificial intelligence, machine learning, and computer vision have improved, alongside smart speakers, autonomous driving, and accompanying robots, intelligent human–computer communication has become an increasingly common element of everyday life. Intelligent human–computer communication L. Mukku (B) · J. Thomas CHRIST (Deemed to be University), Bangalore 560064, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_1

1

2

L. Mukku and J. Thomas

will be even used in daily lives in the future. Emotionally intelligent human–computer connections require not just the machine’s ability to fulfill activities via various types of engagement and its emotional recognition, expression, and feedback skills, which are comparable to humans. According to psychologists [1], language communicates 7% of human emotional expression, speech communicates 38%, and facial expressions transmit the remaining 55%. Although facial expressions could indeed indicate an individual’s state of mind, people frequently conceal or purposefully exhibit some facial expressions within the given circumstance. Under this instance, facial micro-expressions should be used to analyze a person’s genuine, empathetic state. Facial micro-expressions unwittingly reveal actual feelings that an individual is trying to repress, obscure, cover, or disguise. It is tough to change with willpower and accurately portray a person’s present emotional state. While researching psychotherapy in 1966, Haggard and Isaacs [2] highlighted the prevalence of shortlived, difficult-to-detect facial emotions and adopted the name “micro-expressions.” When Ekman and colleagues [3] analyzed a conversational tape betwixt a psychologist and a depressed patient in 1969, they noted whether the patient might have a facial display of trauma while she tried to persuade the physician that she was no longer depressed through a grinning expression. Later, it was determined that the patient was suffering from severe suicidal notions, and the micro-expression analysis revealed the correlation of the depressed state. Micro-expressions, according to psychologists, are quick, unconscious facial motions that individuals make whenever they are experiencing significant feelings. Micro-expressions are more successful at portraying actual emotions than conscious human facial expressions. Micro-expressions are especially important in high-risk situations like lie detection [4], criminal investigation, and clinical medical treatment in high-risk cases [5] because they depict true emotional information. Micro-expressions are low-intensity, short-duration facial expressions that individuals use to mask their genuine feelings, whether intentionally or involuntarily. As a result, discerning such genuine emotional data is difficult. One of the most troubling difficulties seems to be that the period of occurrence of a micro-expression is relatively limited, lasting between 0.04 and 0.2 s. Certain researchers have found that the duration of a micro-expression is lower than 0.33 s and no more than 0.5 s [6]. The location and detection of micro-expressions are significantly hampered by their fast emergence and disappearance. Micro-expressions are similar to macro-expressions in that they are low intensity as well as just relate to a portion of facial expression action units [7]. As a result, the human eye readily misses micro-expressions. Most people cannot identify micro-expressions rapidly unless they have had professional training. Ekman et al. created micro-expression training tool (METT) [8]. The ordinary individual may learn to detect seven fundamental micro-expressions through utilizing METT for prolonged and repetitive training. According to Frank et al., the total identification rate of micro-expressions through METT-trained individuals is just 40%. Computer vision as well as video processing systems have arisen as a novel route for research teams to use face micro-expression detection throughout therapeutic diagnosis or greater risk circumstances, thanks to the rapid progress of visual sensors.

A Review of Deep Learning Methods in Automatic Facial …

3

For a variety of research backgrounds and application situations, spontaneous facial expression recognition is a common study issue. The Micro-expressions Grand Challenge (MEGC) [9], on the other hand, encouraged the production of facial expressions. However, video-based micro-expression detection, as well as recognition, seems to be the latest problem with new obstacles. Initially, micro-expression datasets from spontaneous induction are difficult to come by, and existing data samples are restricted. It is difficult to identify the occurrence of micro-expression from a video clip. Owing to their subtlety and rapidity, discovering micro-expressions throughout lengthier video sequences is difficult. Ultimately, micro-expression identification is difficult because micro-expression changes will only be linked to several facial action units as well as the magnitude of the modification is small. Through computer technology, some advancement in micro-expression analysis has occurred in recent years. The paper’s remaining section is organized in the following way: Sect. 2 describes the micro-expression recognition benchmark datasets, Sect. 3 describes the microexpression recognition pipeline and several feature representations approaches, Sect. 4 describes performance evaluation measures, and Sect. 5 is conclusion.

2 Micro-expression Datasets Having adequate annotated emotional data is essential for constructing any automatic micro-expression identification system. The datasets for micro-expressions are of two types—spontaneous and posed. The posed datasets as the name suggests contain facial expressions that are deliberately posed. The participants were instructed to display each emotion class and those images were captured. On the contrary, spontaneous micro-expression datasets contain the facial response of the participants to emotional stimulus. The spontaneous dataset generation has gained interest of computer scientists who research the affective computing field. Since this study in computer vision has only recently gained popularity, the amount of publicly available spontaneous datasets is significantly lesser. The earlier datasets utilized for micro-expression recognition testing were Polikovsky’s Database [10], USFHD [10], and the York Deception Detection Test (York-DDT) [11] which are all posed. All such datasets, nevertheless, were also polarizing because of their errors in acquisition. Data from Polikovsky’s Database as well as USF-HD, for example, are gathered by allowing individuals to replicate or mimic feelings that are at odds with micro-expression’s spontaneous nature. York-DDT, on the other hand, only includes 18 spontaneous micro-expression samples that seem to be unsatisfactory for the analysis of data. In this section, we cover the benefits and drawbacks of all publicly accessible spontaneous expressions datasets. Micro-expressions are more likely to happen spontaneously and involuntarily. Spontaneous Micro-Expression Corpus (SMIC) [12], Chinese Academy of Sciences Micro-Expressions (CASME) [13], and Chinese Academy of Sciences Micro-Expressions (CASME II) [14] were popular spontaneous datasets for evaluating state-of-the-art methodologies. Instead

4 Table 1 Summary of spontaneous micro-expression datasets

Table 2 Summary of spontaneous micro-expression datasets

L. Mukku and J. Thomas Parameters

SMIC

CASME

CASME II

Number of participants

16

35

35

Number of samples

164

195

247

Number of emotions

3

7

5

Resolution

640 × 480

640 × 480, 720 × 1280

640 × 480

Framerate

100

60

200

Ethnicities

3

1

1

Techniques

Accuracy (%)

F1-score

LBP-TOP + SVM

96.26

0.867

MAP-LBP-TOP

77.29

0.654

RoI-selective (LBP-TOP)

46

0.32

RSTR

57.74

0.5587

of identifying emotions, these datasets concentrate on micro-expression recognition. Table 1 displays the details of these benchmark datasets, and Table 2 presents the sample’s distribution of each emotion class under those benchmark datasets.

2.1 Spontaneous Micro-expression Corpus (SMIC) Having adequate annotated emotional data is essential for constructing any automatic micro-expression identification system. The datasets for micro-expressions are of two types—spontaneous and posed. The posed datasets, as the name suggests, contain facial expressions that are deliberately posed. The participants were instructed to display each emotion class, and those images were captured. On the contrary, spontaneous micro-expression datasets contain the facial response of the participants to emotional stimulus. The spontaneous dataset generation has gained interest of computer scientists who research the affective computing field. Since this study in computer vision has only recently gained popularity, the amount of publicly available spontaneous datasets is significantly lesser. The earlier datasets utilized for micro-expression recognition testing were Polikovsky’s Database, USF-HD, and the York Deception Detection Test (York-DDT) which are all posed. All such datasets, nevertheless, were also polarizing because of their errors in acquisition. Data from Polikovsky’s Database as well as USF-HD, for example, are gathered

A Review of Deep Learning Methods in Automatic Facial …

Onset

Apex

5

Offset

Fig. 1 Positive samples from SMIC dataset

by allowing individuals to replicate or mimic feelings that are at odds with microexpression’s spontaneous nature. York-DDT, on the other hand, only includes 18 spontaneous micro-expression samples that seem to be unsatisfactory for the analysis of data. In this section, we cover the benefits and drawbacks of all publicly accessible spontaneous expressions datasets. Micro-expressions are more likely to happen spontaneously and involuntarily. Spontaneous Micro-Expression Corpus (SMIC [1]), Chinese Academy of Sciences Micro-Expressions (CASME [2]), and Chinese Academy of Sciences Micro-Expressions CASME II [3] were popular spontaneous datasets for evaluating state-of-the-art methodologies. Instead of identifying emotions, these datasets concentrate on micro-expression recognition. Table 1 displays the details of these benchmark datasets, and Table 2 presents the sample’s distribution of each emotion class under those benchmark datasets (Fig. 1). They released the entire SMIC dataset during the year 2013. After capturing 328 video patterns of 20 people, the SMIC discovered 164 spontaneous micro-expression sample data out of 16 participants. While watching the stimulating emotional video clip, the subject was alone, and his or her face video information was autonomously monitored from another room. The entire video data were collected at 100 fps with a pixel resolution of 640,480 utilizing a high-speed (HS) camera. These dataset’s micro-expressions were all created by watching video samples that were no longer than 4.3 min long. However, the dataset lacked an AU label, and the index of the most expressive frames (apex) was unclear. The 25 fps visual camera (VIS), as well as a near-infrared camera (NIR), was employed to capture the remaining ten people. As a result, the SMIC database had three subsections: SMICHS, SMIC-VIS, as well as SMIC-NIR, which have been employed for micro-expression analysis employing multi-source data. SMIC-VIS and SMIC-NIR data were determined to be inadequate; therefore, they are not utilized in micro-expression research very often.

2.2 Chinese Academy of Sciences Micro-expression (CASME) Dataset In 2013, at a very similar time as the entire version of the SMIC had been available, Yan et al. constructed CASME, a quite comprehensive dataset. This study included 35 participants (13 women along with 22 males). A 60-fps camera was used to record 195 micro-expression sample video patterns in the dataset. CASME was made up

6

L. Mukku and J. Thomas

Onset

Apex

Offset

Fig. 2 Sample from CASME dataset

of 195 samples taken at 60 fps from 1500 face expressions. The eight emotion classifications seem to be happiness (5 samples), sadness (6 samples), disgust (88 samples), amaze (20 samples), contempt (3 samples), fear (2 samples), inhibition (40 samples), and tension (28 samples), with sample distribution severely uneven between them. All such feelings have been AU labeled; along with that, the images had a face size of 150 × 190 pixels. To induce the micro-expressions, emotionally stimulating videos were shown to the participants, who were instructed to maintain neutral expressions throughout the videos. The films were roughly 1–4 min long, and the obtained data were analyzed using AU recognition from the videos. Figure 2 illustrates a negative sample from the CASME dataset.

2.3 Chinese Academy of Sciences Micro-expression (CASME II) Dataset Although CASME has a large number of micro-expressions, some of the recordings are incredibly brief, lasting less than 0.2 s, making micro-expression analysis challenging. Yan et al. enhanced the CASME dataset and produced CASME II. There are 247 video sequences in this collection. These images were captured with 200 fps camera at a resolution of 640 × 480 pixels. Out of 2500 facial expressions, a rigorous selection process was used to choose the best samples. A few films with a maximum duration of 3 min and high emotional valance were played to elicit emotional reactions from the participants. Furthermore, facial expressions were collected at 200 fps within a controlled atmosphere. Participants were asked to report their thoughts once the recording was completed in order to modify the experience. The face sizes were limited to 280 × 340 pixels; then, expressions were divided into five emotion classes: happy (33 samples), disgust (60 samples), surprise (25 samples), and repression (27 samples), among others (102 samples). By facilitating a consistent as well as highintensity lighting condition, CASME II also fixed the illumination difficulties that brought down the efficiency in the prior dataset. It also has a bigger sample size and has higher temporal and geographical resolution than CASME. The unbalanced

A Review of Deep Learning Methods in Automatic Facial …

7

Fig. 3 Negative sample from CASME II dataset

sample distributions among classes are one of CASME II’s drawbacks. Furthermore, the participants were restricted to young people of a single ethnicity. Figure 3 illustrates a negative sample from the CASME II dataset.

3 Micro-expression Recognition Micro-expression recognition attempts to categorize a video of micro-expressions into one of the expression classes (happiness, sadness, disgust, fear, contempt, anger, and surprise). Human feelings are expressed in micro-expressions, which are similar to facial expressions, and recognizing them is the most typical problem. Microexpression recognition is the procedure of identifying the feelings conveyed together on a face sequence utilizing existing micro-expressions. Unfortunately, not all classes seem to be present in the current datasets owing to problems in eliciting microexpressions. The emotion classes depicted in the sample collection are typically unevenly allocated; the rest were simpler to elicit than others. Hence, they have so many specimens. Feature extraction and classification are the technical aspects of a recognition problem [15]. However, before feature extraction, a preprocessing stage is used to increase the amount of descriptive data available for descriptors to collect. In order to capture quick and delicate movements of face components that are invisible to the human eye or in low frame rate movies, a high-speed camera is required. Here, all of the stages outlined above are given in detail. Figure 4 depicts the pipeline of micro-expression recognition [16].

3.1 Preprocessing Face detection, face registration, motion magnification, and temporal normalization are among the preprocessing processes employed in micro-expression identification. Noises are removed, and accessible attributes are reinforced for improved performance thereafter, normalization of discovered faces in face registration. Noise reduction with a Wiener or Gaussian filter [17], as well as exaggerating delicate characteristics and temporal normalization, are all approaches used in processing.

8

L. Mukku and J. Thomas

Fig. 4 Micro-expression recognition pipeline

3.1.1

Face Detection and Registration

In 2001, Viola and Jones [18] were among the foremost researchers to suggest a framework for this industry. The Viola–Jones cascade classifier illustrated a machine learning algorithm for recognizing faces in pictures rapidly and effectively. Face recognition utilizing multi-task learning was also demonstrated using a combination of trees approach together with a shared pool of sections. Alternatively, a mixed cascade approach has been given in order to identify both faces as well as landmark points at the very same time. The position and lighting variance is a problem, as a solution to which many CNN-based face identification algorithms have been developed. A detected face is aligned onto a reference face in the face registration stage. This is a crucial step in improving performance and dealing with various head-pose concerns. To date, a lot of work has gone into face registration, which may be divided into two categories: fiducial landmark points and generic techniques. Detecting landmark points on a face as well as subsequently transforming the input to a prototype face is prevalent in landmark-based techniques. Due to its versatility and durability, the active shape model (ASM) [19] is commonly employed in this sector to simulate face characteristics. ASM employs 68 landmark points that are iteratively refined to improve model picture object posture, size, and form approximations. Increased landmark points have been shown in several studies to improve performance by reducing registration error and making the system more resistant to head-pose fluctuation. The OpenPose [20] library was just released, and it can recognize the body, hand, and face landmarks in real-time. Face registration and facial point recognition are

A Review of Deep Learning Methods in Automatic Facial …

9

performed using 70 key points in this library. In recent years, CNN-based landmark localization approaches that outperform other methods have been presented. For face registration, the Lucas–Kanade (LK) approach is presented, in which the difference between two frames is minimized by assuming minor modifications in pixels among two neighboring frames. Several optimization extensions of LK have been used, although the pixel-based technique remains susceptible to variations in light. As a result, the Gabor filter is used to solve this problem.

3.1.2

Motion Magnification

Due to their subtlety, the micro-expression facial movements are sometimes difficult to discern. As a result, motion magnification techniques are used to improve the ability to discern between various movements. Eulerian Video Magnification (EVM) [21] is a technique to magnify tiny muscle movements by amplifying motion disparities. To help with feature extraction, EVM amplifies both movements and colors. Because the intensity levels of facial micro-expression motions have been so low, distinguishing between micro-expression kinds is quite challenging. Exaggerating or magnifying these face micro-movements is one way to solve this problem. The Eulerian Motion Magnification (EMM) approach was used to magnify the small movements in micro-expression films in recent years. The EMM method utilizes band-pass filters to retrieve the appropriate frequency bands out of numerous spatial frequency bands acquired via decomposition of an input video and afterward amplifies the returned band-pass signals at different spatial levels through a magnification factor to magnify the motions. By expanding the gap between different kinds of microexpressions, the EMM method helps to improve recognition rates (i.e., interclass difference). On the other hand, larger amplification factors may result in unwanted amplified noise (i.e., motions are not caused via micro-expressions), compromising recognition performance.

3.1.3

Temporal Normalization

The temporal interpolation model (TIM) approach is often utilized to normalize the video length. It was created primarily to deal with unexpected as well as difficult-todetect nuanced emotions. TIM is also employed to reduce unnecessary faces without emotions by interpolating fewer frames. This strategy, however, does not influence the efficacy of recognition performance. Instead, linear interpolation is utilized since it is more accurate at detecting motion patterns. Meanwhile, the TIM partially removes redundant data at evenly spaced points without being conscious of any sparse data within the frame that may be accidentally eliminated [22]. Sparsity-promoting dynamic mode decomposition (DMDSP) [23] has been created for addressing all weaknesses of TIM. When contrasted to TIM, the data show that DMDSP leads to better outcomes.

10

L. Mukku and J. Thomas

3.2 Feature Extraction The primary motivation for automatic micro-expression analysis in recent years has been to establish standard micro-expression recognition. Provided a microexpression video sequence/clip, the aim would be to identify its emotion label (or class). In face processing, feature representation depicts the transformation of raw input data into a compact form; representations can be geometric or appearance-based. Geometric-based features explain the intensity and texture information, including wrinkles, furrows, and some other emotional patterns. In contrast, appearance-based features explain the intensity as well as the textural information, like wrinkles, and furrows, among other emotional patterns. Moreover, earlier investigations [24] in facial emotion identification have found that appearance-based features cope better with light fluctuations and misalignment errors than geometricbased features. Since accurate landmark identification and alignment techniques were necessary, geometric-based characteristics will not be as trustworthy as appearancebased features. Appearance-based feature representations were grown more common throughout the studies on micro-expression identification for pretty much the same reasons.

3.2.1

LBP-Based Methods

Local binary pattern in three orthogonal planes [25] (LBP-TOP) represents a local binary pattern extension that utilizes binary codes to convey local texture modifications within a circular region before converting them to a histogram. LBP-TOP has been utilized in many investigations. The LBP-TOP has been reported as the base-line evaluation technique in the majority of known datasets (SMIC, CASME II, and SAMM). The local binary pattern is an enhancement of LBP-TOP that employs binary codes to transmit local texture variations across a circular region, afterward encoding it into a histogram. LBP-TOP, like classic LBP, captures characteristics from local spatio-temporal regions in three planes: spatial (XY ), vertical Spatiotemporal (YT ), as well as horizontal spatio-temporal (XT ), permitting LBP-TOP for encoding the temporal fluctuations dynamically. For feature extraction, Adegun and Vadapalli [26] use a local binary pattern upon apex micro-expression frames [4]. Using the local binary pattern on three orthogonal planes (LBP-TOP) spatiotemporal feature extraction technique, micro-expression movies are segmented into image sequences. LBP values were calculated for each gray-scale image by contrasting the center pixel values with every nearby pixel value, yielding an 8-digit binary number which was in turn translated to decimal. Then, using levels of intensity on Y coordinates and pixel values (beginning at 1–256) on X coordinates, the histograms of these LBP values have been generated, yielding a feature vector size of 1 × 256 for every sample. So as to retrieve LBP-TOP characteristics, gray-scale picture sequences have been initially read (i.e., micro-expression videos widely transferred into frames of differing lengths). Following that, LBP-TOP characteristics were collected. Radii

A Review of Deep Learning Methods in Automatic Facial … Table 3 Comparison of accuracy of optical flow methods on CASME, CASME II datasets

11

Techniques

CASME (%)

CASME II (%)

MDMO

68.86

67.37

3D flow-based CNN

54.44

59.11

OF-DI fusion



80.77

values in the X, Y, and T planes were altered from 1 to 4, while radii values in the T plane were modified from 2 to 4. The number of nearby points for the Y, XT, as well as YT planes was fixed to 8, resulting in a total of 28 patterns (256 patterns) for each sample. The histograms of each image sequence were concatenated, generating a 3 × 256 feature vector. For the purpose of feature vector dimensionality reduction, uniform binning has been employed to retrieve only uniform patterns out of the 256 histogram bins. The feature vector was decreased from 256 to 59 by using uniform patterns. Additionally, this is trained with SVM and obtains the average training accuracy of 94%. Sun et al. [27] indicated a multi-scale active patches fusion-based spatiotemporal LBP-TOP descriptor, which takes into account the influential efforts for various local regions within the faces. To find a threshold that arbitrarily merges the local as well as global features in the feature method, they utilize the average value of the entire spots below each magnitude. They are performed by determining the histogram of LBP-TOP for different scales of patches, and then the active patches are selected with SVM. Finally, the multi-scale feature fusion has been performed to recognize the micro-expression. The suggested MAP-LBP-TOP descriptor enhances feature representation and selection capabilities. They utilize the average accuracy as an appropriate criterion for identifying active patches during the feature extraction operation. As a result, the suggested technique not only considers local and global aspects but also selectively merges them and obtains 77.3% of accuracy. Liong et al. [28] tried to use both LBP and optical flow for feature extraction. To better distinguish spontaneous micro-expressions, a hybrid face area extraction framework that blends heuristic and automated techniques has been developed. Alternatively, the holistic usage of the full region of the face and relevant facial areas has been analytically evaluated, relying mostly on the occurrence frequency of facial action units. Based on the locations of the facial landmarks, the zones were automatically selected. Zhang et al. [29] are attempting to offer cross-database microexpression recognition using TIM, LBP, and SVM. Table 3 presents the comparison of performances such as accuracy and F1-measure of LBP-TOP-based approaches in the CASME II dataset. The results show that the combination technique using local binary pattern in three orthogonal planes performed best among the other models.

3.2.2

Optical Flow-Based Methods

According to many pieces of research, the temporal dynamics that prevail along video sequences have been significant in enhancing micro-expression identification

12

L. Mukku and J. Thomas

performance. As a result, approaches based on optical flow (OF), which assesses spatio-temporal variations in intensity, became a point of controversy. Liu et al. [30] suggested the straightforward, efficient main directional mean optical flow (MDMO) feature for micro-expression recognition. They divide the facial areas into zones based on importance using a reliable optical flow technique relying on action units on micro-expression video recordings. An MDMO represents a normalized statistic feature centered on ROI, which takes into account both local statistic motions and global location. MDMO’s modest feature dimension is one of its distinguishing features. An MDMO feature vector is typically 36 × 2 = 72 characters long, with 36 being the number of ROIs. They also show how to align the entire frames of a micro-expression video clip using optical flow for reducing the influence of noise generated by head motions. Li et al. [31] used a 3D flow-based CNN approach for video-based microexpression recognition that brings out thoroughly trained characteristics capable of representing fine motion flow from minute face motions and achieves 59.11% accuracy within the CASME II dataset. The optical flow of the video series was evaluated by Thi Thu Nguyen et al. [32] for conveying the motion magnification. The frame series’ dynamic motion representation was then generated using dynamic image computation. The optical flow and dynamic picture motion information are then combined using a feature engineering approach.

3.2.3

Deep Learning-Based Methods

Although handcrafted features can produce high identification results, they tend to neglect additional details within the real picture data. A convolutional neural network (CNN) represents a form of neural network that has become popular in recent years. Hubel and Wiesel pushed for the usage of linked neurons within the cerebral cortex of cats as an extremely effective pattern categorization approach in 1960 while they were investigating the basis of linked neurons within the cerebral cortex of cats. A Noble prize was awarded for this invention, following which there had been extensive large-scale research in improving CNNs. CNN is predominantly employed in the field of image processing. CNNs are capable of quickly recognizing and classifying images. LeNet, AlexNet, VGGNet, GoogLeNet, and ResNet are some of the most well-known CNN network topologies. The CNN architecture was utilized by Kim et al. [33] to encode spatial details of distinct onset, apex, as well as offset frames. It is among the latest studies to look at micro-expressions employing CNNs. In an effort to perform micro-expression recognition, CNN characteristics were placed into a long-short-term memory (LSTM). LSTMs work deftly in the case of time-series data. The acquired spatial features with expression state constraints were ported to learn temporal characteristics of micro-expression in the CASME II dataset, which uses LSTM to represent the temporal attributes of various states of micro-expression and achieves 60.98% accuracy. Takalkar and Xu [34] look into how deep learning can be employed to recognize micro-expressions. In preparation of building a trustable deep neural network, huge

A Review of Deep Learning Methods in Automatic Facial … Table 4 Comparison of deep learning-based methods

13

Techniques

Accuracy (%) SMIC

CASME

CASME II

CNN-LSTM





60.98

CNN



74.25

75.57

STSTNet

70.13



76.92

CapsuleNet

58.77



70.18

LEARNet

81.60

80.62

76.57

training sets containing a large number of tagged picture examples are required. The suggested CNN achieves 74.25% and 75.57% accuracy mostly on CASME and CASME II datasets, respectively. Liong et al. [35] devised the Shallow Triple Stream Three-dimensional CNN (STSTNet) that extracts high-level discriminative features as well as microexpression data while being operationally cheap. The network is trained to utilize three optical flow parameters (optical strain, horizontal also vertical optical flow fields) derived out from the beginning as well as apex frames of each video. Van Quang et al. [36] presented a CapsuleNet for micro-expression identification employing just the apex frames. The ability of CapsuleNet to sort out part-whole links and also be trained adequately only on brief datasets, like micro-expression recognition tasks, was examined. Verma et al. [37] presented a Lateral Accretive Hybrid Network (LEARNet) to record micro-level characteristics of a facial expression. This network improves every expressive aspect in an accretive way by bringing accretion layers (AL) into the network. The AL’s response stores the hybrid feature maps created through preceding laterally coupled convolution layers. The LEARNet model also includes a crossdecoupled link between convolution layers that assists in the retention of little but substantial facial muscle alterations. Table 4 contrasts the accuracy of deep learningbased micro-expression identification approaches mostly on SMIC, CASME, as well as CASME II datasets.

3.3 Classification The classification of the emotion type is the final step in the micro-expression recognition process. K-nearest neighbor, support vector machine, random forest, sparse representation classifier (SRC), relaxed K-SVD, group sparse learning (GSL), as well as extreme learning machine (ELM) are certain classifiers that have been utilized for micro-expression detection. The most extensively used classifier, according to the literature, is the SVM. Within a high-dimensional or infinite-dimensional domain, SVMs were indeed algorithms that build a hyperplane or a succession of hyperplanes. During SVM training, the margins between the boundaries of various classes are supposed to be as big as possible. Even though the amount of training data

14

L. Mukku and J. Thomas

has been limited, SVMs outperform conventional classifiers in terms of durability, accuracy, and effectiveness.

4 Performance Metrics The accuracy performance parameter, which is often employed in image/video recognition tasks, is reported in the micro-expression recognition job. The major portion of studies in the literature provides the accuracy measure, which is essentially the amount of properly identified video sequences divided by the total number of video sequences within the dataset. Accuracy =

TP1 + TN1 TP1 + FP1 + TN1 + FN1

(1)

Because of the uneven nature of the micro-expression datasets, accuracy results might be heavily skewed toward bigger classes, as classifiers have a hard time learning from classes that are underrepresented. As a result, providing the F1-Score (or Fmeasure), which denotes the harmonic mean of precision as well as recall, seems to be more feasible. Recall =

TP1 TP1 + FN1

Precision = F1-Score =

TP1 TP1 + FP1

1 Precision

2 +

1 Recall

(2) (3) (4)

5 Conclusion The datasets related to micro-expression recognition techniques were examined in this research. To begin, we compiled a list of existing micro-expression datasets for posed as well as spontaneous images and then compared their benefits. Further, the pipeline of the micro-expression recognition has been elaborated, and individual steps in the automation of these operations are also described and discussed. Specifically, various feature representation techniques of micro-expression recognition approaches such as LBP-TOP, optical flow as well as deep learning premised methods are reviewed, and comparative analyses of performance are provided for each distinct method based on which the performance evaluation metrics are defined. From this

A Review of Deep Learning Methods in Automatic Facial …

15

review, it is clear that the generation of spontaneous micro-expression datasets is pivotal for developing and running robust deep learning models for the automatic recognition of micro-expressions.

References 1. Ekman P, Oster H (1979) Facial expressions of emotion. Annu Rev Psychol 30(1):527–554 2. Haggard EA, Isaacs KS (1966) Micromomentary facial expressions as indicators of ego mechanisms in psychotherapy. In: Methods of research in psychotherapy. Springer, Boston, MA, pp 154–165 3. Ekman P, Friesen WV (1969) The repertoire of nonverbal behavior: categories, origins, usage, and coding. Semiotica 1(1):49–98 4. Owayjan M, Kashour A, Al Haddad N, Fadel M, Al Souki G (2012) The design and development of a lie detection system using facial micro-expressions. In: 2012 2nd international conference on advances in computational tools for engineering applications (ACTEA). IEEE, pp 33–38 5. Datz F, Wong G, Löffler-Stastka H (2019) Interpretation and working through contemptuous facial micro-expressions benefits the patient-therapist relationship. Int J Environ Res Public Health 16(24):4901 6. Yan W-J, Wu Q, Liang J, Chen Y-H, Fu X (2013) How fast are the leaked facial expressions: the duration of micro-expressions. J Nonverbal Behav 37(4):217–230 7. Zhao Y, Xu J (2018) Necessary morphological patches extraction for automatic microexpression recognition. Appl Sci 8(10):1811 8. Ekman P (2003) Darwin, deception, and facial expression. Ann N Y Acad Sci 1000(1):205–221 9. Yap MH, See J, Hong X, Wang S-J (2018) Facial micro-expressions grand challenge 2018 summary. In: 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, pp 675–678 10. Polikovsky S, Kameda Y, Ohta Y (2009) Facial micro-expressions recognition using high speed camera and 3D-gradient descriptor, p 16 11. Warren G, Schertler E, Bull P (2009) Detecting deception from emotional and unemotional cues. J Nonverbal Behav 33(1):59–69 12. Li X, Pfister T, Huang X, Zhao G, Pietikäinen M (2013) A spontaneous micro-expression database: inducement, collection and baseline. In: 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, pp 1–6 13. Yan W-J, Wu Q, Liu Y-J, Wang S-J, Fu X (2013) CASME database: a dataset of spontaneous micro-expressions collected from neutralized faces. In: 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, pp 1–7 14. Yan W-J, Li X, Wang S-J, Zhao G, Liu Y-J, Chen Y-H, Fu X (2014) CASME II: an improved spontaneous micro-expression database and the baseline evaluation. PLoS ONE 9(1):e86041 15. Kondaveeti HK, Goud MV (2020) Emotion detection using deep facial features. In: 2020 IEEE international conference on advent trends in multidisciplinary research and innovation (ICATMRI). IEEE, pp 1–8 16. Bruni V, Vitulano D (2021) A fast preprocessing method for micro-expression spotting via perceptual detection of frozen frames. J Imaging 7(4):68 17. Wang C-P, Zhang J-S (2012) Image denoising via clustering-based sparse representation over Wiener and Gaussian filters. In: 2012 spring congress on engineering and technology. IEEE, pp 1–4

16

L. Mukku and J. Thomas

18. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, vol 1. IEEE, p I 19. Zhou D, Petrovska-Delacrétaz D, Dorizzi B (2009) Automatic landmark location with a combined active shape model. In: 2009 IEEE 3rd international conference on biometrics: theory, applications, and systems. IEEE, pp 1–7 20. Cao Z, Simon T, Wei S-E, Sheikh Y (2017) Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291–7299 21. Wang Y, See J, Oh Y-H, Phan RC-W, Rahulamathavan Y, Ling H-C, Tan S-W, Li X (2017) Effective recognition of facial micro-expressions with video motion magnification. Multimed Tools Appl 76(20):21665–21690 22. Mayya V, Pai RM, Manohara Pai MM (2016) Combining temporal interpolation and DCNN for faster recognition of micro-expressions in video sequences. In: 2016 international conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 699–703 23. Jovanovi´c MR, Schmid PJ, Nichols JW (2014) Sparsity-promoting dynamic mode decomposition. Phys Fluids 26(2):024103 24. Xia Z, Hong X, Gao X, Feng X, Zhao G (2019) Spatiotemporal recurrent convolutional networks for recognizing spontaneous micro-expressions. IEEE Trans Multimed 22(3):626– 640 25. Guo Y, Tian Y, Gao X, Zhang X (2014) Micro-expression recognition based on local binary patterns from three orthogonal planes and nearest neighbor method. In: 2014 international joint conference on neural networks (IJCNN). IEEE, pp 3473–3479 26. Adegun IP, Vadapalli HB (2020) Facial micro-expression recognition: a machine learning approach. Sci Afr 8:e00465 27. Sun Z, Hu Z-P, Zhao M, Li S (2020) Multi-scale active patches fusion based on spatiotemporal LBP-TOP for micro-expression recognition. J Vis Commun Image Represent 71:102862 28. Liong S-T, See J, Phan RC-W, Wong K, Tan S-W (2018) Hybrid facial regions extraction for micro-expression recognition system. J Signal Process Syst 90(4):601–617 29. Zhang T, Zong Y, Zheng W, Philip Chen CL, Hong X, Tang C, Cui Z, Zhao G (2020) Crossdatabase micro-expression recognition: a benchmark. IEEE Trans Knowl Data Eng 30. Liu Y-J, Zhang J-K, Yan W-J, Wang S-J, Zhao G, Fu X (2015) A main directional mean optical flow feature for spontaneous micro-expression recognition. IEEE Trans Affect Comput 7(4):299–310 31. Li J, Wang Y, See J, Liu W (2019) Micro-expression recognition based on 3D flow convolutional neural network. Pattern Anal Appl 22(4):1331–1339 32. Thi Thu Nguyen N, Thi Thu Nguyen D, The Pham B (2021) Micro-expression recognition based on the fusion between optical flow and dynamic image. In: 2021 the 5th international conference on machine learning and soft computing, pp 115–120 33. Kim DH, Baddar WJ, Ro YM (2016) Micro-expression recognition with expression-state constrained spatio-temporal feature representations. In: Proceedings of the 24th ACM international conference on multimedia, pp 382–386 34. Takalkar MA, Xu M (2017) Image based facial micro-expression recognition using deep learning on small datasets. In: 2017 international conference on digital image computing: techniques and applications (DICTA). IEEE, pp 1–7 35. Liong S-T, Gan YS, See J, Khor H-Q, Huang Y-C (2019) Shallow triple stream threedimensional CNN (STSTNet) for micro-expression recognition. In: 2019 14th IEEE international conference on automatic face & gesture recognition (FG 2019). IEEE, pp 1–5 36. Van Quang N, Chun J, Tokuyama T (2019) CapsuleNet for micro-expression recognition. In: 2019 14th IEEE international conference on automatic face & gesture recognition (FG 2019). IEEE, pp 1–7 37. Verma M, Vipparthi SK, Singh G, Murala S (2019) LEARNet: dynamic imaging network for micro expression recognition. IEEE Trans Image Process 29:1618–1627

Mathematical Modeling of Diabetic Patient Model Using Intelligent Control Techniques Subashri Sivabalan

and Vijay Jeyakumar

Abstract Diabetes mellitus (DM) is the most common chronic disease, which is categorized into two types: they are type 1 and type 2. Category of type 1 diabetic patients experiences fluctuations in blood glucose (BG) levels that are mainly caused because the pancreas failing to secrete insulin which results in the cause of hyperglycemia or a high increase in blood glucose (BG) level exceeding 150 mg/dl. In such conditions, a patient needs continuous insulin injections throughout their life span. Considering the metabolism of diabetic patients is very complex and nonlinear, the paper mainly focuses on dynamic simulations of glucose–insulin interaction to obtain a new method for regulating blood glucose levels in diabetic patients. Therefore, a model is developed to describe the process of insulin–glucose interaction which is to simulate the system using state-space analysis. The state-space model development is done with the help of the classic linearization method followed by a closed-loop simulation. In this paperwork, two different models such as Lehmann and Sorensen-based human patient models are proposed to regulate the blood glucose levels of diabetic patients, and a comparison is done with the analysis of a mathematical model on both types of proposed models under the category of type 1 diabetic patient models. Later, the models are validated by the help of machine learning algorithms using Simulink software. Keywords Diabetes mellitus (DM) · Blood glucose (BG) · Type 1 diabetes (T1D) · Glucose–insulin (GI) · Net hepatic glucose balance (NHGB) · Central nervous system (CNS) · Red blood cells (RBC) · Creatinine clearance rate (CCR) · Type 1 diabetes mellitus (T1DM) · Multiple inputs and multiple output (MIMO) · Renal threshold glucose (RTG) · Oral glucose tolerance test (OGTT)

S. Sivabalan · V. Jeyakumar (B) Department of Biomedical Engineering, Sri Sivasubramaniya Nadar College of Engineering, Chennai 603110, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_2

17

18

S. Sivabalan and V. Jeyakumar

1 Introduction Diabetes is a group of metabolic disorders that causes a high increase in the level of blood glucose (BG) which is defined as hyperglycemia. The normal concentration of blood glucose (BG) level is around [60–120] mg/dl. This concentration level can be regulated with the help of two hormones such as glucagon and insulin that is secreted by alpha and beta cells from the pancreas. Insulin causes an increase in the blood glucose level and glucagon causes a decrease in the blood glucose level, respectively. The major cause of the disease is type 1 diabetes mellitus which is caused by the insufficient secretion of insulin and type 2 diabetes is caused because the pancreas produces insulin, but the body failed to absorb it. Our body cells need fuel, which is called glucose, but to enter the cell, it needs a key. Insulin acts as a key for entering the cell. People suffering from type 1 diabetes do not produce insulin which is called not having the key. People suffering from type 2 diabetes do not respond to insulin which is called the broken key. This work mainly concentrates on type 1 diabetes which needs an external infusion of insulin. So many research works are developed with different kinds of mathematical models that represent the behavior of people suffering from type1 diabetes. The most important types of models are Bergmann minimal model (BMM), the Sorensen model (nineteenth-order) [1], the Hovorka model (seventh-order), and Lehmann model (ninth-order). The Bergmann model [2] is a very simple model which is mostly used for the design of controllers. Different models of diabetes are described in the literature for finding the dynamics of insulin in regulating concentrations of blood glucose (BG) levels. Among these models, the one which is proposed by Lehmann and Sorensen model is most widely useful for the simulation and is considered under this study for the correction of blood glucose which also includes the meal disturbance [3]. The main goal of the work is to develop the model that describes insulin–glucose interaction for insulin–glucose interaction [4] for blood glucose regulation in diabetic patients. For developing an accurate model, the prediction of parameters [5] and closed-loop control has to be performed satisfactorily [6]. To improve the quality of the model, the datasets are taken using an input signal from diabetic patient model [7] that will be used in future for designing artificial pancreas.

2 Literature Survey Various authors propose different kinds of type 1 diabetes patient models in insulin– glucose interaction in the human body. They are described as follows. Topp et al. in his work proposed the modeling of blood glucose regulation using potential measurement in the islets of pancreatic cells with the help of beta cells found inside the pancreas. This method of the glucose–insulin interaction process of type 2 diabetes [8] is based on an earlier method called as Guyton model. Tiran et al. have made a study on simulation of the interaction of insulin–glucose interaction process which

Mathematical Modeling of Diabetic Patient Model Using Intelligent …

19

describes the different flow of data and volume in the blood which includes various organs. Bergmann et al. developed a minimal model which consists of three compartments that act as representations of various organs [9] of the human body. The dynamics of glucose transport in tissues were neglected in this study. Ahlborg et al. have introduced the dynamics of the glucose–insulin process that are associated with study under certain disturbances such as stress, meals, and exercise activity on the human body. Puckett presented the model which describes the metabolism of glucose based on the physiological process [10] with the help of prior knowledge and nonlinearities using an artificial neural network. Quoin et al. give a solution regarding the input and output models that specifies the combination of a recurrent predictive model and also predicts the metabolism of insulin–glucose in the diabetic patient [11]. Fischer et al. describe a state estimation method based on the Kalman filter technique extended method with the help of a linearizing nonlinear dynamic model [12] of T1DM patients. Camellia et al. report Quoin uncertainty cases in modeling [13] which use a robust control strategy technique that obeys the law of conservation. Lehmann et al. use the pharmacokinetic model for calculating the active and plasma insulin considering linearized lumped parameters. He followed the compartment model for prediction and controlling specific parameters in diabetic patients. Parker et al. describe a detailed nineteenth-order model for the process of design of glucose–insulin using predictive algorithms. The same author developed a model that has six major parameters which result from the insulin compartment into plasma space that affects the uptake of glucose. Brannon presented the design strategy for treating cancer. Gampetelli et al. present a model of the predictive control algorithm in type 1 diabetic patients [14] to represent the nineteenth-order pharmacokinetic and pharmacodynamics model. Kirubakaran et al. deal with describing the patient model based on controlled outputs and manipulated inputs and also considering the disturbance model [15]. Cochin et al. describe the patient model with the help of a microprocessor based on adaptive control that gives status about the patient which includes details of blood chemistry and switching control algorithms [16] used to describe the patient resting condition process under conditions of heart failure. Malik et al. performed a comparative analysis of machine learning techniques in early and onset diabetes mellitus among women. In this work, datasets are taken from the existing literature. Pethunachiyar presented a diabetes mellitus classification system using a machine learning algorithm, and here, a support vector machine with different kernel functions is used, and datasets are taken from the UCI Machine Repository. But to validate the developed model, the datasets are taken from the developed model considering features as insulin and glucose, with respect to time which was not found in the previous works.

3 Materials and Methods Several models are suggested by different authors for representing glucose–insulin interaction in past literature surveys [17] to denote the dynamics of glucose–insulin

20

S. Sivabalan and V. Jeyakumar

Fig. 1 Simplified model of GI interaction

(GI) interaction for controlling the blood glucose (BG) level. The model named Lehmann et al. and revised Sorensen model are used for the study to design a human patient model for blood glucose regulations. (a) Section 1: Modeling of Lehmann model The glucose–insulin (GI) interaction of the Lehmann-based diabetic patient model comprises six different compartments gut, kidney, periphery, heart/lungs, brain, and liver as shown in Fig. 1. The solid arrow in Fig. 1 represents the flow of glucose transport promotion and inhibition with the indication of + 1 and − 1. The general block diagram of the glucose metabolism process is described as follows. The steps involved in designing Lehmann (ninth-order model) process are as follows: Step 1: Insulin action: The carbohydrate present in a meal [18] is given to the patient, and the pancreas injects a suitable amount of insulin and the rate at which insulin is absorbed is called insulin absorption rate Iabs . The first-order filter is required for calculating insulin concentration in plasma which is given as, Iabs dIP = − K e ∗ IP dt VI Ip =

  1 (I abs /VI ) − K e ∗ Ip s

(1) (2)

where Ip —denotes the insulin concentration in plasma in terms of (mU/dl), Iabs — denotes the insulin absorption rate (mU/min), K e —denotes insulin elimination rate constant, and VI —denotes insulin distribution volume (dl).

Mathematical Modeling of Diabetic Patient Model Using Intelligent …

21

Ia —termed as a concentration of plasma in terms of insulin which is called active insulin. Its rate of change is defined as Dia = (K 1 ) ∗ Ip − (K 2 ) ∗ Ia

(3)

1 (K 1 ) ∗ Ip − (K 2 ) ∗ Ia s

(4)

Ia =

where Ia —denotes the insulin active concentration in terms of (mU/dl), I abs —denotes the insulin absorption rate (mU/min), K 1 —denotes insulin elimination rate constant, and K 2 —denotes delay in insulin action. The two important types of insulin are as follows: • Effective active insulin (Iea ) • Effective plasma insulin (Iep ): 

 K2 ∗ Ia K1   IP Iep = Sh Ibasal

Iea = Sp

(5) (6)

where Sp —denotes the specific peripheral insulin sensitivity (the range of values is from 0 to 1), Sh —denotes specific peripheral hepatic insulin sensitivity (the range of values is from 0 to 1). Hence, all the above-mentioned dynamic equations are used for the simulation of the insulin part [19] in the Lehmann human patient model. Step 2: Glucose action: Glucose is produced by the gastric emptying subsystem which increases up to V maxge in which values range from 360 mg/min and then it decreases and reaches the value 0. During the ingestion of the meal to the model, the carbohydrate intake must be less than 10.8 g. calculated by using equations,  Chcritical =

  T ascge + T desge V maxge 2

T ascge = T desge = Ch/

(7) (8)



     T ascge + T desge V maxge 1 V maxge T maxge = Ch − 2 2

(9)

The rate of gastric emptying G empt for meals that contain carbohydrates is given by mathematical equations as follows:   G empt = (V maxge )/T ascge t

(10)

22

S. Sivabalan and V. Jeyakumar

G gut =

 1  ∗ G empt − G in s

G in = K gabs G gut

(11) (12)

where, G gut —denotes the glucose absorption in the gut, (mg), G in —glucose intake rate, K abs —denotes the constant of glucose absorption, T maxge —the rate of time duration of gastric emptying system and T ascge , T desge the rate at which time duration has default values as 30 min. The term net hepatic glucose balance (NHGB) represents the function of the liver’s amount of production and utilization of glucose. Its values are realized by the lookup table as the function of two factors which is represented by using the Simulink model of the liver. Glucose will be later utilized by the Simulink model that acts as a subsystem in which the utilization rate of the human body is considered as 72 mg/h/kg. G ren = CCR(G − RTG)

(13)

where G—represents the amount of plasma glucose level in the blood. Here, kidney extracts some amount of glucose from the blood as a function of creatinine clearance rate (CCR). The Simulation diagram of the Lehmann model is developed with help of Simulink MATLAB software [20]. Glucose is produced with the help of the gastric system (digestive organ); it is utilized by the liver, peripheral cells, CNS and RBC, and it is excreted through the kidney. Step 3: Insulin delivery system: The transfer function of the insulin dispenser is described briefly in the paper by Cochin et al. which was mentioned in the literature survey. The components of the system are a micropump insulin capsule, accumulator, and suitable electronic valves. The pump and valves are all part of micro-insulin dispenser systems that are described with their corresponding Simulation diagram. Step 4: Simulation diagram Figure 3 represents the Simulink model of different kinds of a subsystem of the T1DM patient model which includes the gastric emptying system, CNS subsystem, kidney subsystem, and also insulin dispenser system as shown in Fig. 2. Taking into the study, the system should be designed with and without disturbances, and its responses are tested in a detailed manner.

Fig. 2 Simulation diagram of insulin delivery system

Mathematical Modeling of Diabetic Patient Model Using Intelligent …

23

Fig. 3 Simulation diagram of human patient (Lehmann) model

Step 5: Linearization of the T1DM patient (Lehmann) model The patient model has been completely simulated using MATLAB/Simulink as shown in Fig. 3, and the glucose–insulin interaction process has to consider three inputs and one output. The two disturbances taken for the study are meal and exercise with a control signal and one output signal as glucose level. The output parameter is the blood glucose (BG) which has to be monitored. For the simulation purpose, the closed-loop parameters are considered here as insulin injection rate which is represented by U (t). The general form of any human patient model state space for any differential equation including white noise is described by equations as follows: .

X m (t) = Am (t) X m (t) +Bm (t)U (t) + Bd (t)W (t)

(14)

Y (t) = Cm (t)X m (t) + Dm (t)U (t)

(15)

where U (t) represents the control input sign, X m (t) represents that state vector which comprises glucose level in the blood X m1 (t), Y (t) represents the output glucose concentration level, X m2 (t) represents the production rate of glucose by the gut system, X m3 (t) represents the glucose utilization rate of the peripheral cell, X m4 (t) represents the excretion rate of glucose from kidney, X m5 (t) represents the NHGB rate, X m6 (t) represents the glucose utilization rate of CNS, X m7 (t) represents insulin dose, X m8 (t) represents the plasma insulin rate, and X m9 (t) represents the active insulin rate. W 1(t) represents the meal disturbances, and W 2(t) represents the exercise disturbances. For the special purpose, the equations should be converted into the linearized form followed by the technique of Laplace transform to get the transfer functions model.

24

S. Sivabalan and V. Jeyakumar

Step 6: Linearization results ⎤ − 0.0001 0.000016 − 0.00008 0 0 0 0 0 0 ⎢ 0 − 0.0013 0 0 0 0 0 0 0 ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ 0 0 − 0.052 0 0 110 0 0 0 ⎥ ⎥ ⎢ ⎢ 0 0 0.000127 − 0.003 0 0 0 0 0 ⎥ ⎥ ⎢ Am = ⎢ 0 0 0 0 − 14.0 − 206.0 0 0 0 ⎥ ⎥ ⎢ ⎢ 0 0 0 0 0 0 0 0 0 ⎥ ⎥ ⎢ ⎥ ⎢ 0 0 0 0 0 0 − 12.5 − 72.67 − 33 ⎥ ⎢ ⎥ ⎢ ⎣ 0 0 0 0 0 0 0 0 0 ⎦ 0 0 0 0 0 0 0 0 0 ⎡ ⎤ 0 ⎢0⎥ ⎢ ⎥ ⎢ ⎥ ⎢0⎥ ⎢ ⎥ ⎢0⎥ ⎢ ⎥ ⎥ Bm = ⎢ ⎢0⎥ ⎢0⎥ ⎢ ⎥ ⎢ ⎥ ⎢1⎥ ⎢ ⎥ ⎣0⎦ 0 ⎡



⎤ 0 − 0.5 ⎢0 0 ⎥ ⎢ ⎥ ⎢0 0 ⎥ ⎢ ⎥ ⎢0 0 ⎥ ⎢ ⎥   ⎢ ⎥ Bd = ⎢ 0 0 ⎥ Cm = 1 0 0 0 0 0 0 0 0 Dm = [0] ⎢ ⎥ ⎢0 0 ⎥ ⎢ ⎥ ⎢0 0 ⎥ ⎢ ⎥ ⎣0 0 ⎦ 0 0 Step 7: Results and discussion from the Lehmann model: (otherwise called as ninth-order system) Figure 4 represents the blood glucose concentration level expressed in terms of (mg/dl) which attains the normal value under the design of the model and is considered with the disturbances. The general block diagram of the diabetic patient indicates the ninth-order Lehmann-based human patient model that consists of six compartments as gut, liver, kidney, periphery, heart/lungs, and brain in Fig. 5. (b) Section 2: Modeling of revised Sorensen model

Mathematical Modeling of Diabetic Patient Model Using Intelligent …

25

Fig. 4 a Blood glucose concentration level expressed in terms of (mg/dl), b the response of the utilization of glucose by the GUT subsystem, c response of utilization of glucose by the central nervous system (CNS), d the control input signal given as input to trigger the developed ninth-order Lehmann-based human patient model

Several types of models are described briefly in the previous literature survey for representing the glucose–insulin system, such as the UVA/PADOVA model, Bergmann minimal model, and Hovorka model [20], which are simpler with few equations with limited parameters that include interactions between glucose/insulin systems. Here in this section, Sorensen [21] based human patient diabetic model is described with 22 differential equations all are mostly nonlinear equations that represent three submodels that describe the glucose concentrations in the brain, liver, heart, lungs, gut, periphery, kidney, and gastrointestinal tract that includes the release of the (insulin and glucagon) from the pancreatic system. Especially in the present work, an improved version of the Sorensen model inherits equations from the original SIMO model [22] which shows the glucose absorption rate in the stomach, the jejunum, the ileum and a delay compartments. In this paper, the glucose/insulin system of the human body is the revised version of the original Sorensen model.

26

S. Sivabalan and V. Jeyakumar

Fig. 5 Compartment representation of the glucose–insulin interaction of human patient model (Lehmann)

The steps involved in designing the revised Sorensen (nineteenth-order model) process are as follows: Step 1: Insulin action: The insulin compartment consists of organs such as the brain, heart/lungs, liver, kidney, gut and periphery. The equations which are used for representations of the insulin submodel are as follows. The equations are arranged in this order starting from the organs [23] brain, heart/lungs, gut, liver, kidney, and periphery system. dIB QB = (IH − IB ) dt VB

(16)

  dIH = Q BI IB + Q LI IL + Q KI IK + Q PI IPV − Q HI IH /VHI dt

(17)

dIJ QI = (IH − IJ ) ∗ JI dt VJ

(18)

Q I IH + Q JI IJ − Q L IL + rPIR − rLIC dIL = A dt VLI  I  Q K (IH − IK ) − rKIC dIK = dt VKI

(19)

(20)

Mathematical Modeling of Diabetic Patient Model Using Intelligent …

   VI (IH − IPV ) − (IPV − IPI ) TPI dIPV P = Q PI I dt VPV dIPI = dt

VPI TPI

(IPV − IPI ) − rPIC VPI

27

(21)

(22)

All the above dynamic equations are taken for the study, and it is coded using Simulink MATLAB software for the design purpose in the insulin subsystem. Step 2: Glucose action: The glucose compartment consists of organs such as the brain, heart/lungs, liver, kidney, gut, and periphery. The equations which are used for representations of the glucose submodel are as follows.   G dGBV = (G H − G BV )Q BG − VBI /TB (G BV − G BI ) /VBV dt dGBI = dt

VBI TB (G BV

− G BI ) − rBGU VBI

Q G G BV + Q LG G L + Q KG G K + Q PG G PV − Q HG G H − rRBCU dGH = B dt VHG

(23)

(24)

(25)

dG J (G H − G J ) = Q JG dt VJG

(26)

  dG L = Q AG G H + Q JG G J − Q LG G L + rHGP − rHGU /VLG dt

(27)

dGK = Q KG (G H − G K ) − rKGE dt

(28)

  G dGPV = (G H − G PV )Q PG − VPI (G PV − G PI )/TPG /VPV dt

(29)

dG PI (G PV − G PI ) = − rPGU dt TPG

(30)

The values of all the parameters are tabulated in the tabular column, and they can be taken from the reference paper cited below. Step 3: Glucagon system: In addition to the insulin and glucose system, the glucagon system can be added as a third compartmental submodel for the design of the Sorensen model. Its dynamic equations are given below as follows:  d  . = rPIR −rPIC V  dt

(31)

28

S. Sivabalan and V. Jeyakumar

The above model representing the stomach, jejunum, and ileum and the delay compartment in between the jejunum and ileum which acts as a part of the submodel that is related to the absorption rate of glucose. The equations that are inherited from the model are given below. dS = −K js S; S(O) = D dt dJ = K js S − K gj J − K rj J ; dt dR = −K lr R + K rj J ; dt

J (0) = 0

R(0) = 0

dL = −K lr R + K rj L; L(0) = 0 dt   roga = f K gj J + K gl L ; roga = 0

(32) (33) (34) (35) (36)

where D is the oral administer quantity of glucose, and S, J, and L represent the amount of glucose present in the stomach, jejunum and ileum organs, respectively, whereas R—delay components. In other words, r oga is the absorption rate of glucose by the gut system that combines the link between the new compartment model and Sorensen model which is represented by the equation as dG G = Q GJ G H − Q GG G J + roga − rGGU dt

(37)

Mostly, the parameters are related to the metabolism rate of insulin for describing the pancreatic release of insulin. Certain tests are carried out for testing the ability of the revised version of the Sorensen model using an oral glucose tolerance test (OGTT) which can be included in the future scope of work. Step 4: Simulation method: All the dynamic equations for glucose, insulin, and glucagon models that were used for mathematical representations are implemented through coding in the Simulink MATLAB [18]. The parameters used for the simulation of the insulin model, glucose model, and glucagon are tabulated refer Tables 1 and 2. The metabolic source and sink components of insulin [24, 25], glucose [5, 26], and glucagon are cited in the below reference paper. Step 5: Linearization of the revised Sorensen model The dynamic simulation of the interaction of glucose–insulin was conducted by solving the above (16)–(37) equations in the MATLAB. To make this happen, those equations are converted to linearize form by taking Laplace to transform to get transfer functions.

Mathematical Modeling of Diabetic Patient Model Using Intelligent … Table 1 Estimated parameters of the glucose system

29

VBI = 0.26 l

Q BI = 0.45 1/min

TPI = 20 min

VHI = 0.99 l

Q HI = 3.12 1/min

I βPIR1 = 3.27

VGI = 0.94 l

Q AI = 0.18 1/min

I βPIR2 = 132 mg/dl

VLI = 1.14 l

Q KI = 0.72 1/min

I βPIR3 = 5.93

VKI = 0.51 l

Q PI = 1.05 1/min

I βPIR4 = 3.02

I = 0.74 l VPV

Q GI = 0.72 1/min

I βPIR5 = 1.11

VPII = 6.74 l

Q LI = 0.90 1/min

VPII = 6.74 l M1 = 0.00747 min−1

M2 = 0.0958 min−1

Q 0 = 6.33 U

α = 0.0482 min−1 β = 0.931 min−1 K = 0.575 U/min

Table 2 Estimated parameters of the insulin system

G = 3.5 dl VBV

Q BG = 5.9 dl/min

TBI = 2.1 min

VBI = 4.5 dl

Q BG = 43.7 dl/min G = 2.5 dl/min QA Q LG = 12.6 dl/min G = 10.1 dl/min QG G = 10.1 dl/min QK Q PG V = 15.1 dl/min

TPG = 5.0 min

VHG = 13.8 dl VLG = 25.1 dl VGG = 11.2 dl VKG = 6.6 dl G = 10.4 dl VPV

V  = 11,310 ml

dx = AX + BU dt

(38)

Y = C X + DU

(39)

where X acts as the state variable, U acts as the insulin and meal intake, A called as process matrix, B is called the input matrix, C is called the output matrix, and D is called feed through the matrix. The form of the revised Sorensen model referring Eqs. (38) and (39) is being transferred into the state space as follows:

30

S. Sivabalan and V. Jeyakumar

Step 6: Linearization results ⎡

0.51 ⎢ 0.01 ⎢ ⎢ ⎢ 0.20 ⎢ ⎢ 0 ⎢ ⎢ 0 ⎢ ⎢ 0 ⎢ ⎢ Am = ⎢ 0 ⎢ ⎢ 0 ⎢ ⎢ 0 ⎢ ⎢ ⎢ 0 ⎢ ⎢ 0 ⎢ ⎣ 0 0

0 0.5 0 0 0 0 0 0 0 0 0 0 0

0.20 0.01 0 0.5 0 0.66 0 0 0 0 0 0 0



0 0 0 0 0 0 0 0 0 0 0 0 0.22 0 0 0 0 0.30 0 0.60 0 0.2 0 0 0 0 0 0 0.5 0 0 0.001 0 0 0.08 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3.2 0 0 0 0 0 0 0 0 16 0 52 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 10.5 0 0 0 0 0 0 0 2.5 0.02 0 0 0 0 0 0 0 2.1 0 0 0 0 0 0 0 3 3.6

0 ⎢ 0 ⎢ ⎢ 0 ⎢ ⎢ 0 ⎢ ⎢ ⎢ 0 ⎢ ⎢ 0 ⎢ ⎢ 0 ⎢ Bm = ⎢ ⎢ 12 ⎢ 0 ⎢ ⎢ 1.4 ⎢ ⎢ ⎢ 0 ⎢ ⎢ 0 ⎢ ⎢ 0 ⎢ ⎣ 0 1

⎤ ⎡ 2.1 100000 ⎥ ⎢ 3 ⎥ ⎢0 1 0 0 0 0 ⎢0 0 1 0 0 0 5 ⎥ ⎥ ⎢ ⎥ ⎢0 0 0 1 0 0 0 ⎥ ⎢ ⎥ ⎢ 0 ⎥ ⎢0 0 0 0 1 0 ⎥ ⎢ ⎥ ⎢0 0 0 0 0 1 0 ⎥ ⎢ ⎥ ⎢0 0 0 0 0 0 0 ⎥ ⎢ ⎥ 0 ⎥ Cm = ⎢ ⎢0 0 0 0 0 0 ⎢0 0 0 0 0 0 0 ⎥ ⎥ ⎢ ⎥ ⎢0 0 0 0 0 0 0 ⎥ ⎢ ⎥ ⎢ 0 ⎥ ⎢0 0 0 0 0 0 ⎥ ⎢ ⎢0 0 0 0 0 0 4 ⎥ ⎥ ⎢ ⎢0 0 0 0 0 0 0 ⎥ ⎥ ⎢ ⎣0 0 0 0 0 0 5 ⎦ 1 000000

0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

⎤ 0 0 0 00 0 0 0 0 0 0 00 0 0 0 ⎥ ⎥ ⎥ 0 0 0 00 0 0 0 ⎥ ⎥ 0 0 0 00 0 0 0 ⎥ ⎥ 0 0 0 00 0 0 0 ⎥ ⎥ 0 0 0 00 0 0 0 ⎥ ⎥ ⎥ 0 0 0 00 0 0 0 ⎥ ⎥ 0 0.21 0 0 0 0.5 0 16 ⎥ ⎥ 0 0 0 00 0 8 0 ⎥ ⎥ ⎥ 70 0 10.2 0 9 01.1 0 9 ⎥ ⎥ 0 0 0 0 8 0 3.2 0 ⎥ ⎥ 1.2 1.1 0 0 0 00.2 0 12 ⎦ 0 0 2.3 0 0 0 11 0.02

0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

⎤ 00000 0 0 0 0 0⎥ ⎥ 0 0 0 0 0⎥ ⎥ 0 0 0 0 0⎥ ⎥ ⎥ 0 0 0 0 0⎥ ⎥ 0 0 0 0 0⎥ ⎥ 0 0 0 0 0⎥ ⎥ 0 0 0 0 0⎥ ⎥ 0 0 0 0 0⎥ ⎥ 0 0 0 0 0⎥ ⎥ ⎥ 1 0 0 0 0⎥ ⎥ 0 1 0 0 0⎥ ⎥ 0 0 1 0 0⎥ ⎥ 0 0 0 1 0⎦ 00001

Step 7: Results and discussion from revised Sorensen model: (otherwise called as nineteenth-order system) The output response plot of glucose, insulin, and glucagon [27, 28] with the addition of an extra compartment is described in Fig. 6. Step 8: General block diagram of insulin and glucose subsystem is shown in Figs. 7 and 8. On comparison of both the Lehmann and revised Sorensen model, the following points are to be noted: • The Sorensen model appears to be the most complex and is explained detailed in describing the physiological parameter and its values, with 22 differential equations and 135 parameters.

Mathematical Modeling of Diabetic Patient Model Using Intelligent …

31

Fig. 6 a Glucose absorption rate of brain, b glucose absorption rate by net hepatic system, c glucose absorption rate by gastric emptying system, d insulin absorption rate by gut system

Fig. 7 Schematic diagram of Sorensen model (insulin system)

32

S. Sivabalan and V. Jeyakumar

Fig. 8 Schematic diagram of Sorensen model (glucose system)

• Among the models present in the literature, the Sorensen model is the most frequently used method for representing virtual patients. • The limit of this model mainly depends on deciding the values of its many parameters [29] (it is nearly about 100 parameters). • The mathematical description of the Sorensen model appears describing it, shows the complete values of several types of parameters that are not shown for the other study purpose, and keeping in the account the larger number of parameters, most of the researchers fail to work on those models. • From a user side, the Lehmann model [30] is the simplest and widest model, with few parameters, and hence, it will be easier for simulation purposes. • The Sorensen model is known as a maximal model since it demonstrates the interactions between the metabolism of glucose (glucose, insulin, and glucagon) in the case of normal and diabetic patients. The traditional model provides poor performance that is used for the control because of incorrect gain. In this work, more than one disturbances such as physical activity, and meal intake were considered for designing both the models. In future, disturbances such as stress will also be considered. Experiment Validation The major novelty of the work by considering research gap as point is that datasets are taken from developed patient model; it is preprocessed then by applying machine learning techniques such as support vector machine, K-nearest neighbor, decision tree, logistic regression, and random forest which are used to obtain the best accuracy. In this work, it was seen that ensemble (AdaBoost) achieves best accuracy results

Mathematical Modeling of Diabetic Patient Model Using Intelligent … Table 3 Datasets are taken from ninth- and nineteenth-order systems (percentage of accuracy)

Types of classifiers

33

% of accuracy

Logistic regression

96

Naïve Bayes

94

KNN

95.5

Ensemble (AdaBoost)

97

Wide neural network

93

Neural network

92

compared to other machine learning algorithms. Table 3 shows the accuracy results of machine learning methods with developed datasets. The attributes considered in designed human patient model are insulin and glucose, with respect to time period. In the comparison of existing state of art technique [31–33], datasets of the model act a major role in the design.

4 Conclusion The present work is mainly focused on the modeling of glucose–insulin interaction process, and the human patient models such as Lehmann (ninth-order system) and revised Sorensen model (nineteenth-order system) design are developed which are very much effective in the regulation of blood glucose (BG) level by considering the meal and exercise disturbances in treating type 1 diabetes patients. The mathematical model of the nonlinear glucose–insulin interaction of the physiological process of diabetic patients was tested with external disturbances such as meals and exercise. The major highlight in this work when compared to the other previous work is the calculation of meal disturbance by taking into account the amount of both carbohydrates and also proteins present in the food. Earlier methods of designing the

34

S. Sivabalan and V. Jeyakumar

human patient model for diabetic patients mainly do not include glucose absorption in the gastrointestinal tract. But here in this work, the design is based on the reimplementation of the original Sorensen model that has taken four major parts into consideration stomach, jejunum, and ileum with a delay compartment act as part of the model which could be an additional compartmental model that considers glucose absorption of the gastrointestinal tract which was not considered under any other study. Thus, developed new model is especially suitable for treating both type 1 and type 2 diabetic patients, and its results were tested. This work also has limitations that the model could be developed by considering other disturbances such as stress activity also in the design of the diabetic patient model. Also, both the models such as Lehmann et al. and revised Sorensen model will be suitable for controller design even under the case of external disturbances. Finally, the model is validated, and the accuracy results are tested using machine learning algorithms [34].

References 1. Panunzi S, Pompa M, Borri A, Piemonte V, De Gaetano A (2021) A revised Sorensen model: simulating glycemic and insulinemic response to oral and intra-venous glucose load. PLoS ONE 15(8):e0237215 2. Owens C, Zisser H, Jovanovic L, Srinivasan B, Bonvin D, Doyle FJ III (2006) Run-to-run control of blood glucose concentrations for people with type 1 diabetes mellitus. Biomed Eng 53(12):996–1005. Pmid: 16761826 3. Blechert J, Meule A, Busch NA, Ohla K (2014) Food-pics: an image database for experimental research on eating and appetite. Front Psychol 5:617 4. Pompa M, Panunzi S, Borri A, De Gaetano A (2021) A comparison among three maximal mathematical models of the glucose-insulin system. PLoS ONE 16(9):e0257789 5. Kovacs L, Benyo B, Bokor J, Benyó Z (2011) Induced L2-norm minimization of glucoseinsulin system for type I diabetic patients. Comput Methods Programs Biomed 102:105–118. Pmid: 20674065 6. Kovatchev B, Breton M, Dalla Man C, Cobelli C (2009) In silico preclinical trials: a proof of concept in closed-loop control of type 1 diabetes. J Diabetes Sci Technol 3:44–55 7. Dalla Man C, Micheletto F, Lv D, Breton M, Kovatchev B, Cobelli C (2014) The UVA/Padova type 1 diabetes simulator: new features. J Diabetes Sci Technol 8(26–34):21 8. Visentin R, Campos-Náñez E, Schiavon M, Lv D, Vettoretti M, Breton M et al (2018) The UVA/Padova type I diabetes simulator goes from single meal to single day. J Diabetes Sci Technol 12:273–281. Pmid: 29451021 9. Hovorka R, Shojaee-Moradie F, Caroll P, Chassin L, Gowrie I, Jackson N et al (2002) Partitioning glucose distribution/transport, disposal, and endogenous production during IVGTT. Am J Physiol 282:992–1007 10. Guyton C, Hall E (2006) Textbook of medical physiology, 11th edn. Elsevier Saunders 11. Bergman RN, Ider YZ, Bowden CR, Cobelli C (1979) Quantitative estimation of insulin sensitivity. Am J Physiol 236:667–677. Pmid: 443421 12. ADA (2010) Standards of medical care in diabetes. Diabetes Care 33:11–61 13. Martin-Timon I, Sevillano-Collantes C, Canizo Gomez FJ (2016) Update on the treatment of type 2 diabetes mellitus. World J Diabetes 7(7):354–395 14. Kaiser AB, Zhang N, Der Pluijm WV (2018) Global prevalence of type 2 diabetes over the next ten years (2018–2028). Diabetes 67 15. Cobelli C, Renard E, Kovatchev B (2011) Artificial pancreas: past, present, future. Diabetes 60:2672–282. Pmid: 22025773

Mathematical Modeling of Diabetic Patient Model Using Intelligent …

35

16. Doyle FJ III, Huyett LM, Lee JB, Zisser HC, Dassau E (2014) Closed-loop artificial pancreas systems: engineering the algorithms. Diabetes Care 37:1191–1197 17. Peyser T, Dassau E, Breton M, Skyler S (2014) The artificial pancreas: current status and prospects in the management of diabetes. Ann NY Acad Sci 1311:102–123. Pmid: 24725149 18. Steil GM, Rebrin K (2005) Closed-loop insulin delivery—what lies between where we are and where we are going? Ashley Publ 2:353–362 19. Hovorka R, Canonico V, Chassin L, Haueter U, Massi-Benedetti M, Federici M et al (2004) Nonlinear model predictive control of glucose concentration in subjects with type 1 diabetes. Physiol Meas 25:905–920. Pmid: 15382830 20. Sorensen JT (1978) A physiologic model of glucose metabolism in man and its use to design and improved insulin therapies for diabetes 21. Parker RS, Doyle FJ III, Peppas NA (1999) A model-based algorithm for blood glucose control in type I diabetic patients. Biomed Eng 46(2):148–157 22. Abate A, Tiwari A, Sastry S (2009) Box invariance in biologically-inspired dynamical systems. Automatica 45:1601–1610 23. Chee F, Fernando T (2007) Closed-loop control of blood glucose. Springer 24. Galwani S, Tiwari A (2008) Constraint-based approach for analysis of hybrid systems. In: Gupta A, Malik S (eds) Computer aided verification. CAV 2008. Lecture notes in computer science, vol 5123. Springer, Berlin, Heidelberg, pp 190–203 25. Campos-Delgado DU, Hernandez-Ordoñez M, Fermat R, Gordillo-Moscoso A (2006) Fuzzy based controller for glucose regulation in type-1 diabetic patients by subcutaneous route. Biomed Eng 53(11):2201–2210 26. Gillis R, Palerm CC, Zisser H, Jovanociˇc L, Seborg DE, Doyle FJ III (2007) Glucose estimation and prediction through meal responses using ambulatory subject data for advisory mode model predictive control. J Diabetes Sci Technol 1:825–833. Pmid: 19885154 27. Cameron BD, Baba JS, Coté GL (2007) Measurement of the glucose transport time delay between the blood and aqueous humor of the eye for the eventual development of noninvasive glucose sensor. Diabetes Technol Ther 3(2):201–207 28. Markakis MG, Georgios DM, Papavassilopoulos GP, Marmarelis VZ (2008) Model predictive control of blood glucose in type 1 diabetes: the principal dynamic modes approach. In: 2008 30th annual international conference of the IEEE engineering in medicine and biology society, pp 5466–5469 29. Galvanin F, Barolo M, Macchietto S, Bezzo F (2009) Optimal design of clinical tests for the identification of physiological models of type 1 diabetes mellitus. Ind Eng Chem Res 48:1989–2002 30. Singh N, Singh P (2020) Stacking-based multi-objective evolutionary ensemble framework for prediction of diabetes mellitus. Biocybern Biomed Eng 40(1):1–22 31. Kumari S, Kumar D, Mittal M (2021) An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier. Int J Cogn Comput Eng 2 32. Islam MMF, Ferdousi R, Rahman S, Bushra HY (2020) Likelihood prediction of diabetes at early stage using data mining techniques. In: Computer vision and machine intelligence in medical image analysis. Springer, Singapore, pp 113–125 33. Malik S, Harous S, Sayed HE (2020) Comparative analysis of machine learning algorithms for early prediction of diabetes mellitus in women. In: Proceedings of the international symposium on modelling and implementation of complex systems, Oct 2020. Springer, Batna, Algeria, pp 95–106 34. Hussain A, Naaz S (2021) Prediction of diabetes mellitus: comparative study of various machine learning models. In: Proceeding of the international conference on innovative computing and communications, Jan 2021. Springer, Delhi, India, pp 103–115

End-to-End Multi-dialect Malayalam Speech Recognition Using Deep-CNN, LSTM-RNN, and Machine Learning Approaches Rizwana Kallooravi Thandil , K. P. Mohamed Basheer , and V. K. Muneer Abstract Research in Malayalam speech recognition is constrained by the scarcity of speech data. Accent variation poses the greatest challenge for automatic speech recognition (ASR) for any language. Malayalam, spoken by the people in the southernmost state of India, has a wide range of accents that reflect regional, cultural, and religious differences. Malayalam is a low-resource language; there are not many works proposed in the ASR of the language which makes this work more significant and challenging at the same time. The majority of the experiments done in the ASR for Malayalam use the traditional HMM methods. No benchmark dataset for accented data is available for doing research. The authors have constructed accentbased data for doing this experiment. The proposed methodology comprises three distinct stages: dataset preparation, feature engineering, and classification using machine learning and deep learning approaches. A hybrid approach is adopted for the feature engineering process. Different feature extraction techniques are considered for extracting features from the inputted accent-based speech signals for the best representation of the data. Mel frequency cepstral coefficient (MFCC), shortterm Fourier transformation (STFT), and mel spectrogram techniques are adopted for the feature engineering process. The features are then used to build machine learning models using multi-layer perceptron, decision tree, support vector machine, random forest, k-nearest neighbor, and the stochastic gradient descent classifiers. In the deep learning approach, the feature set is first fed to LSTM-RNN architecture to construct the accented ASR system. The next approach is to plot the spectrograms of the speech signals and hence represent the speech data as images. The features are then extracted from these spectrograms and fed into deep convolutional network architecture to build a deep learning model. Finally, a hybrid ASR system has been constructed from all the independent models. The result of each experiment is compared against each other to find the better approach for modeling the end-to-end accented ASR (AASR).

R. K. Thandil (B) · K. P. Mohamed Basheer · V. K. Muneer Sullamussalam Science College, Areekode, Kerala, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_3

37

38

R. K. Thandil et al.

Keywords Automatic speech recognition (ASR) · Accent-based speech recognition · Long short-term memory (LSTM) · Deep convolutional neural network (DCNN) · Speech feature extraction · Malayalam speech recognition

1 Introduction An accent is a variation in pronunciation that poses a great challenge for building ASR. Accent variation in Malayalam is specific to geographic distribution, religion, community, and social class style according to the Dravidian Encyclopedia. The utterances of different genders and age groups also contribute to the complexity of developing ASR. There are 15 regional dialects of the Malayalam language according to the Dravidian Encyclopedia. Here in this experiment, we have considered utterances from five districts where five different dialects are spoken in Kerala, a southern state in India where Malayalam is the local and official language. Here, we have constructed an accent-independent ASR for the Malayalam language that would work well across the five accents. The model has been trained using audio samples comprising all the accents which are expected to work well with all the accents. The authors focus on constructing a unified accented ASR for the Malayalam language. This work initially contributes to bridging the enormous research gap by collecting, analyzing, and publishing a corpus of spontaneous Malayalam speech with accents from five different regions of Kerala. Here in this paper, the authors have first experimented with various machine learning algorithms for modeling the accented speech and then experimented with LSTM-RNN and DCNN for building the multi-dialect ASR. Though multi-layer perceptron in the machine learning approach performed well, LSTM-RNN and DCNN approaches have outperformed all other methodologies we have adopted in this experiment. We also have constructed a hybrid ASR system by combining the above-mentioned approaches. While the individual systems produce equivalent overall recognition scores, we demonstrate that the systems have complementary strengths. In this article, we provide an in-depth analysis of different strategies in the context of a low-resource, orthographically non-standardized, and morphologically rich language pair.

2 Related Work The field of speech recognition in Malayalam is still in its infancy. The development of speech recognition and ASR technologies has advanced significantly in recent years. Speech recognition technology plays an important role in many applications today. There have been several works on ASR in a variety of languages. Improvements in feature extraction techniques and improved approaches to modeling speech data are the major contributions of this work.

End-to-End Multi-dialect Malayalam Speech Recognition Using …

39

Yang et al. [1] proposed a method that utilized quantum convolutional neural networks (QCNNs) for decentralized feature extraction in federated learning. Zhu et al. [2] suggested that the noisy speech and its clean equivalent are fed into the same feature encoder, with the clean speech being used as training data. Hamed et al. [3] suggested work on code-switched Egyptian Arabic-English ASR. They used endto-end models based on transformer and DNN to construct the ASR systems. Hida et al. [4] proposed an approach to polyphone disambiguation and accent prediction. This approach incorporates both implicit and explicit characteristics from previously trained language models (PLMs) into the morphological analysis. A BERT and Flair embedding is used to integrate implicit features with explicit features. Purwar et al. [5] developed an accent classification tool based on machine learning and deep learning. Dokuz and Tüfekci [6] proposed hybrid algorithms that incorporated voice database gender and accent attributes to select mini-batch samples when training deep learning architectures. Using both features yields better speech recognition results than using either component separately, according to experimental results. Rusnac and Grigore [7] studied a frequency-domain method of feature extraction based on covariance, which was superior to other time-domain methods. They studied different convolutional neural network (CNN) architectures and concluded that a more sophisticated architecture does not necessarily produce better results. As a ´ result, Swietlicka et al. [8] proposed a method for reducing the dimensionality of speech signal variables and proving its suitability for both fluent and disturbed speech recognition. A model was developed that described the examined utterances through four main components. A new set of coordinates was calculated between the standard original variables and the components of the observation matrix, and these distances were used in the recognition. Bhaskar and Thasleema [9] rely on video footage that contains both speech content and facial expression information in their research. In the study, audio and visual expressions of unimpaired Malayalam speech were collected. A deep learning approach is used by this system to process video using convolutional neural networks and long short-term memory. Imaizumi et al. [10] experimented to illustrate the interdependence of the Japanese dialects and proposed a multi-task learning framework for Japanese dialect identification and ASR. Radzikowski et al. [11] proposed a method to modify the accent of the non-native speaker so that it closely resembles the accent of the native speaker. The authors used the spectrogram which is the graphical representation of the speech signals for experimenting. The experiment yielded better outcomes when done with an autoencoder based on CNN. Chen et al. attempted to construct an accent invariant accented ASR system using generative adversarial nets [12].

40

R. K. Thandil et al.

3 Proposed Methodology and Design The main objective of this study is to develop a more effective method for creating ASR for Malayalam speech with multiple accents. The main goal of this research is to propose a word-based ASR for the Malayalam language that makes use of machine learning, deep learning algorithms, and a hybrid approach that combines the strength of all the previous approaches. The dataset was created by the authors using crowdsourcing methods from individuals across five different districts. The authors hereby discuss the outcomes of the experiments conducted using different machine learning approaches, LSTM-RNN, DCNN, and a hybrid approach to the accented speech data. We shall discuss the entire process of the experiment in detail in the coming sessions. We propose a comparative analysis of different approaches for constructing accent-based ASR for the Malayalam language. All approaches have yielded better outcomes individually when experimented with so many low-resourced languages. Each independent system when combined yields complementary strength. The performance of an accented ASR system is highly dependent on the nature of the dataset which varies with languages. The approach suitable for one language may not suit the other. It is by constantly experimenting with a language with different approaches that concludes a better approach for a particular language. This experiment used sound datasets that belong to 20 classes. Using crowdsourcing techniques, the spoken words are recorded in a natural recording setting. To carry out this experiment, a corpus of 4000 data was gathered. This experiment considers input from speakers of diverse ages, genders, and locations. Initially, the ASR system is built using multi-layer perceptron, decision tree, support vector machine, random forest, k-nearest neighbor, and the stochastic gradient descent classifiers. Later, the LSTMRNN and CNN algorithms are used to train the model, which is then mapped onto the word models. Finally, a hybrid ASR system is constructed from the models that are already constructed. The test set is split arbitrarily from among the dataset, which is preprocessed, and the features extracted. The test signals are vectorized utilizing a one-hot encoding and passed to the LSTM and CNN networks. These vectors are then analyzed and compared against the target classes and weights get updated likewise during the training stage. The features of the test signals are fed to the networks, and the results are predicted accordingly.

3.1 The Proposed Methodology The proposed methodology involves: 1. Construct the dataset. 2. Feature extraction. 3. Construct an accent-based ASR system using different machine learning approaches.

End-to-End Multi-dialect Malayalam Speech Recognition Using …

4. 5. 6. 7. 8. 9.

41

Construct accent-based ASR using LSTM-RNN. Construct accent-based ASR using DCNN. Construct a hybrid ASR system. Model evaluation and prediction. Comparative analysis. Conclusion and future scope.

3.2 Dataset Here, we created a dataset for Malayalam with multiple utterances of the accented words. The samples from five different districts in north Kerala—Kasaragod, Kannur, Kozhikode, Wayanad, and Malappuram—make up the dataset. The dialect used in these areas differs greatly from the standard dialect. To represent the signals in the dataset, samples are gathered from all age groups and genders. The speech corpus is constructed from 30 distinct native speakers belonging to different age groups. Multiple utterances of every single word in the dataset have been collected and recorded by different speakers (both male and females), and then every recording is sampled to a frequency of 16,000 Hz which is then converted and saved into the .wav format which forms the most primary step in the preprocessing task (Table 1). Table 2 contains the statistical data of the speech recordings collected based on the different age groups. The dataset includes data recordings from spoken speech donors of all ages. To reflect accurate and high-quality data, we have gathered much of the data from people who are between the ages of 20 and 45. Table 1 Statistics of accented data collected from different districts

District

760

Kannur

760

Kozhikode

1090

Malappuram

760

Wayanad

630

Total

Table 2 Statistics of the dataset based on age groups

Size of audio samples

Kasaragod

Age group

4000

Size of audio samples

5–12

660

13–19

660

20–45

1500

46–65

690

66–85

490

Total

4000

42

R. K. Thandil et al.

Fig. 1 Speech feature extraction

3.3 Feature Extraction In the process of creating machine and deep learning models, feature extraction is the most important phase. A more effectively trained model is produced by only including the important features in the experiment. The quality of the model produced would be significantly impacted if insignificant elements were chosen. Here in this experiment, we have extracted 180 features from each speech signal to better represent the prominent accented speech features. The different features considered for the experiment correspond to: 1. Mel frequency cepstral coefficients 2. Short-term Fourier transform (STFT) components 3. Mel spectrogram features. Here in this experiment, we have considered 40 MFCC features, 12 STFT components, 128 mel spectrogram features, and hence, a total of 180 speech features to represent the accented data. The 40 MFCC features correspond to the prominent speech frequencies. The STFT components include the amplitude data of each speech frame. The mel spectrogram features closely correspond to the low-frequency features to which the human ears are sensitive (Fig. 1).

3.4 Building the Accented ASR System The accented ASR system is built using the following approaches: 1. Machine learning approach 2. Deep learning approach 3. Hybrid approach. 3.4.1

Machine Learning Approach

The features extracted from the accented speech data are used to build the ASR systems using the machine learning approaches. Here in this experiment, we used six machine learning approaches for constructing the model. The experiment using different machine learning approaches is discussed in detail below.

End-to-End Multi-dialect Malayalam Speech Recognition Using …

43

Fig. 2 Performance evaluation of various ML classifiers

Multi-layer perceptron is the simplest form of neural network for performing the classification of data. The ASR model is constructed with the hidden layer size set to 3000 and the maximum iteration set to 10,000. With this architecture, the model produced an accuracy of 94.82%. Decision tree is a supervised learning method that can be applied to classification and regression problems; however, it is most frequently used to address classification problems. It is a tree-structured classifier, where internal nodes represent the dataset’s features, branches for the rules of classification, and each leaf node for the result. The ASR system constructed using a decision tree classifier produced an accuracy of 55.67%. Support vector machines are used for performing classification in supervised learning problems. We have adopted SVM in our experiment to classify the accented data. The ASR system was built with an accuracy of 66.15%. A random forest classifier is used for modeling the accented speech in this experiment, and the ASR system was built with an accuracy of 24.53%. But after performing the hyperparameter tuning, we could build a model with 78.76%. K-nearest classifier was used for constructing the accented ASR with 63.73% accuracy. On performing the hyperparameter tuning, an ASR with 81.69% was constructed. The stochastic gradient descent (SGD) classifier was used for constructing the ASR system. The experiment with SGD produced an accuracy of 19.86%. The model was again enhanced with the tuning of hyperparameters and thus built a model with 33.16%. The summary of the performances of all the machine learning algorithms used in this experiment is shown in Fig. 2.

3.4.2

Deep Learning Approach

ASR with LSTM-RNN and ASR with DCNN are the two deep learning approaches used for the construction of the accented ASR in this experiment.

44

R. K. Thandil et al.

Fig. 3 Sample spectrogram used in the experiment

ASR with LSTM-RNN The LSTM-RNN is used for constructing the accented ASR model because of its ability to remember the necessary information that will aid in future prediction while forgetting the irrelevant information through the forget gate. Since speech is sequential data, LSTM gates can be used to remember the order of the occurrences, which makes LSTM special for processing speech signals. An input gate, a forget gate, and an output gate make up a cell in a typical LSTM. The three gates control how information flows through the cell, which can store information for any length of time. The algorithm for model building using LSTM-RNN are: Step 1: Construct the dataset. Step 2: Pre-process every audio signal to 16 kHz and save the signals in .wav format. Step 3: Extract appropriate features from the dataset using MFCC, STFT, and mel spectrogram. Step 4: Split the dataset into a train set and test set. Step 5: Feed these features into LSTM-RNN architecture and construct the model. Step 6: Make the predictions and evaluate the performance accordingly. Model Building Using Deep CNN In this work, the spectrogram of the speech signals is used as input to the model. We construct the model using image classification techniques. The advantage of CNN which has already been proved as an efficient tool for image classification problems is made used here (Fig. 3). The speech features are plotted as spectrograms and used for building the model with this as the input set. The algorithm for building the ASR using DCNN is: Step 1: Construct the speech dataset. Step 2: Construct the spectrogram dataset that corresponds to the accented speech.

End-to-End Multi-dialect Malayalam Speech Recognition Using …

45

Step 3: Initialize the model. Step 4: Add the CNN layers. Step 5: Add the dense layers. Step 6: Configure the learning process. Step 7: Train the model. Step 8: Evaluate the model by making predictions. The same speech dataset has been used for both the study. For this experiment, the spectrogram representation of the speech dataset has been used. The model is constructed as a sequential model with three layers of CNN: the convolutional network layer, the pooling layer, and the flattening layer. The input is resized to (224, 224, 3) and fed to the convolutional layer. This is then fed into the next convolutional layer with an activation size of (222, 222, 32) and then max-pooling is applied to a width equal to 111, height equal to 111, and depth equal to 32 which is used for downsampling of the feature maps. This is again fed to a convolutional layer with activation size (109, 109, 64). After this layer, a max-pooling is applied with values (54, 54, 64). And then a dropout layer is added to prevent overfitting of the data. This is then fed as input to a convolution layer of activation size (52, 52, 64), and the output of this layer is fed to the max-pooling layer of size (26, 26, 64) which is then applied to a dropout layer again to reduce the overfitting of data. The output of this layer is fed to the convolution layer again to the activation size (24, 24, 128). This is again applied with a max-pooling of (12, 12, 128) and then flattened to a one-directional array of size 18,432. This is then applied to a dense layer of 64 neurons which is again fed to a dropout layer. The output from the dropout layer is finally fed into a dense layer of 20 neurons that corresponds to the 20 different classes of data in our experiment. The model is trained for 4000 epochs with a total of 53,000 training steps with a total of 4000 spectrograms as input. Eighty percent of the input is used for training, and the remaining 20% is used for testing. Figure 4 shows the visualization of the model built using CNN. The convolutional layers are represented in yellow, the max-pooling layer in red, the dropout layer in green, and the long blue layer represents the flattening to represent the features in a vector and the darker layer toward the output represents the dense layer of neurons.

Hybrid Approach Here in this approach, all the models that have already been constructed are combined to build a hybrid ASR system. This performance of this system represents the performance of all the ASR systems put together for performing the classification. Voting techniques are employed here for combining the predictions of the models.

46

R. K. Thandil et al.

Fig. 4 Layered architecture of the DCNN model

4 Experimental Results Eight models were constructed for this doing this experiment. Six models were constructed using the machine learning approach and two models were constructed using the deep learning approach. All the models were built using 4000 speech samples that were collected across five districts in Kerala. The dataset was constructed in a natural recording environment. The experiment used 1.08 h of speech data as input and took 15 h of training hours to build the model. The focus of this experiment was to construct an ASR for multi-accent speech in the Malayalam language. The machine learning model evaluation results are shown in Table 3. The LSTM-RNN model produced a training accuracy of 95% over 98,000 steps. The training features that are split randomly are one-hot encoded and fed to the LSTM-RNN. The prominent 180 features of the speech data are extracted using different methods and fed as input to the LSTM-RNN architecture. The maximum height of the utterances considered for the experiment is a thousand for all the speech samples. This in turn is used for constructing the model which predicts the test features into any of the twenty classes of data mentioned above. The visualization of the accuracy of the model across the training steps is shown. The training loss was 2.5 at the beginning which has reduced to 0.24 toward the end of training the

End-to-End Multi-dialect Malayalam Speech Recognition Using … Table 3 Performance evaluation of machine learning approaches

47

Model

Before tuning

After hyperparameter tuning

MLP classifier

94.82% (7 min 3 s)

97.50% (8 min 5 s)

Decision tree

55.61% (4 min)

56.13% (4 min 5 s)

SVM

66.15% (8.22 s)

53.54% (2 min 43 s)

Random forest

24.53% (465 ms)

78.76% (3 min 62 s)

KNN

63.73% (2.46 ms)

81.69% (37 min 38 s)

SGD

19.86% (822 ms)

33.16% (4 min)

model at step 98,000. The model is constructed with a validation accuracy of 82% over 98,000 steps (Fig. 5). The faded line in Fig. 5 graph is the original classification whereas the darker line is obtained with a smoothing value of 0.5. A model has been constructed using CNN with 4000 epochs. The model has been constructed with 4000 spectrograms where 3020 samples are used for training and the remaining 800 samples were used for testing. The model resulted in 98% of train accuracy and 71% test accuracy. Figure 6 visualizes the test and train accuracy and test and train loss of the model constructed using CNN.

Fig. 5 Performance evaluation of the ASR model constructed using LSTM-RNN

Fig. 6 Performance evaluation of the ASR model constructed using DCNN

48

R. K. Thandil et al.

5 Conclusion and Future Scope The authors have very carefully experimented with extracting the influential and spectral features of the accented speech signals that covered the frequency, amplitude, pitch, and low-frequency values to which the human hearing systems are sensitive. The features selected for the experiment showed contained the values representing the age and gender of the speaker also. The feature set constructed here has greatly influenced the building of a high-performance accented ASR system. The experiment was conducted using three different approaches and all the approaches performed well for the selected dataset. A hybrid approach was employed to boost the performance of the weak models like SGD and that attempt also worked fine for the dataset. The authors proposed three different approaches for accented ASR for the Malayalam language. In this paper, we made a comparative analysis and proposed the methodology that worked well for the dataset we constructed. The limitation in the availability of data was a major challenge for doing the research. We need a huge amount of data for working with ASR to yield better results. The authors would work on hybrid approaches for constructing accented ASR system that would work fine for unknown accents. The authors would also work on multi-class speech in a single audio wave in the future.

References 1. Yang C-HH et al (2021) Decentralizing feature extraction with quantum convolutional neural network for automatic speech recognition. In: ICASSP 2021—2021 IEEE international conference on acoustics, speech, and signal processing (ICASSP), pp 6523–6527. https://doi.org/10. 1109/ICASSP39728.2021.9413453 2. Zhu Q-S, Zhang J, Zhang Z-Q, Wu M-H, Fang X, Dai L-R (2022) A noise-robust self-supervised pre-training model based speech representation learning for automatic speech recognition. In: ICASSP 2022—2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3174–3178. https://doi.org/10.1109/ICASSP43922.2022.9747379 3. Hamed I, Denisov P, Li C-Y, Elmahdy M, Abdennadher S, Vu NT (2022) Investigations on speech recognition systems for low-resource dialectal Arabic–English code-switching speech. Comput Speech Lang 72:101278. ISSN 0885-2308. https://doi.org/10.1016/j.csl.2021.101278 4. Hida R, Hamada M, Kamada C, Tsunoo E, Sekiya T, Kumakura T (2022) Polyphone disambiguation and accent prediction using pre-trained language models in Japanese TTS frontend. In: ICASSP 2022—2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7132–7136. https://doi.org/10.1109/ICASSP43922.2022.9746212 5. Purwar A, Sharma H, Sharma Y, Gupta H, Kaur A (2022) Accent classification using machine learning and deep learning models. In: 2022 1st international conference on informatics (ICI), pp 13–18. https://doi.org/10.1109/ICI53355.2022.9786885 6. Dokuz Y, Tüfekci Z (2022) Feature-based hybrid strategies for gradient descent optimization in end-to-end speech recognition. Multimed Tools Appl 81:9969–9988. https://doi.org/10.1007/ s11042-022-12304-5 7. Rusnac A-L, Grigore O (2022) CNN architectures and feature extraction methods for EEG imaginary speech recognition. Sensors 22(13):4679. https://doi.org/10.3390/s22134679

End-to-End Multi-dialect Malayalam Speech Recognition Using …

49

´ ´ 8. Swietlicka I, Kuniszyk-Jó´zkowiak W, Swietlicki M (2022) Artificial neural networks combined with the principal component analysis for non-fluent speech recognition. Sensors 22(1):321. https://doi.org/10.3390/s22010321 9. Bhaskar S, Thasleema TM (2022) LSTM model for visual speech recognition through facial expressions. Multimed Tools Appl. https://doi.org/10.1007/s11042-022-12796-1 10. Imaizumi R, Masumura R, Shiota S, Kiya H. End-to-end Japanese multi-dialect speech recognition and dialect identification with multi-task learning. ISSN 2048-7703. https://doi.org/10. 1561/116.00000045 11. Radzikowski K, Wang L, Yoshie O et al (2021) Accent modification for speech recognition of non-native speakers using neural style transfer. J Audio Speech Music Proc 2021:11 12. Chen Y-C, Yang Z, Yeh C-F, Jain M, Seltzer ML (2020) Aipnet: generative adversarial pretraining of accent-invariant networks for end-to-end speech recognition. In: ICASSP 2020— 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 6979–6983. https://doi.org/10.1109/ICASSP40776.2020.9053098

JSON Document Clustering Based on Structural Similarity and Semantic Fusion D. Uma Priya

and P. Santhi Thilagam

Abstract The emerging drift toward real-time applications generates massive amounts of JSON data exponentially over the web. Dealing with the heterogeneous structures of JSON document collections is challenging for efficient data management and knowledge discovery. Clustering JSON documents has become a significant issue in organizing large data collections. Existing research has focused on clustering JSON documents using structural or semantic similarity measures. However, differently annotated JSON structures are also related by the context of the JSON attributes. As a result, existing research work is unable to identify the context hidden in the schemas, emphasizing the importance of leveraging the syntactic, semantic, and contextual properties of heterogeneous JSON schemas. To address the specific research gap, this work proposes JSON Similarity (JSim), a novel approach for clustering JSON documents by combining the structural and semantic similarity scores of JSON schemas. In order to capture more semantics, the semantic fusion method is proposed, which correlates schemas using semantic as well as contextual similarity measures. The JSON documents are clustered based on the weighted similarity matrix. The results and findings show that the proposed approach outperforms the current approaches significantly. Keywords JSON · Clustering · Structural similarity · Semantic similarity

1 Introduction In recent years, JavaScript Object Notation (JSON) has established itself as the primary standard for data interchange over the Web due to its varied applicability in numerous applications. NoSQL document stores manage self-describing JSON D. Uma Priya (B) · P. Santhi Thilagam National Institute of Technology Karnataka, Mangalore, India e-mail: [email protected] P. Santhi Thilagam e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_4

51

52

D. Uma Priya and P. Santhi Thilagam

documents in the form of collections. The data first, schema later approach facilitates flexibility in storing the data without defining a schema. Thus, JSON documents do not conform to any fixed structure within the collection. As documents evolve, it is necessitated to efficiently organize the JSON documents for efficient data retrieval [1]. Clustering is the process of grouping similar documents into clusters based on similarity metrics. Despite being the most popular data representation and interchange format, the research on JSON document clustering is sparse due to its complex hierarchical structure. The flexibility of JSON data format allows documents to have varying schemas in a collection. Therefore, few researchers have worked on grouping similar documents based on the structural similarity of JSON schemas [2– 4]. However, JSON schemas are varied not only in syntax but also in semantics. As a result, the need for clustering JSON documents based on the structural and semantic properties of JSON schemas has emerged. In recent years, a number of similarity metrics between a pair of words have been defined based on ontology-based measures (semantic similarity) and distributional measures (contextual similarity). The former uses well-known knowledge bases such as WordNet [5] that provide a semantic network to identify the meaning of JSON attributes and compare them to find similar attributes. The latter studies the cooccurrence of words to examine the similarity of documents. The major issue in using the external knowledge bases is that they do not handle recent concepts [6] such as “bitcoin” and “selfie”. Due to the dynamic nature of JSON documents, semantics based on word co-occurrence helps to improve similarity. While existing works of eXtensible Markup Language (XML) and JSON data determine the semantic similarity based on the meaning of the attribute names, semantics based on context are left unaddressed. Deep neural network-based language models can be used to overcome the issues related to domain knowledge in JSON document clustering [7]. A deep language model learns to represent attributes in a vector space. These vectorized representations of attributes are good at capturing the syntactic and semantic information of schemas. Context-free embeddings or semantic representations of words using traditional knowledge bases produce a single vector for each attribute within a corpus. On the other hand, contextualized embeddings provide word vector representations that change dynamically depending on the context in which the words appear. We employ Bidirectional Encoder Representations from Transformers (BERT) [8] as the underlying language model to identify the contextual similarity of JSON schemas. Most existing work uses only structural or semantic similarity metrics to cluster the hierarchical data such as XML and JSON documents. To address this research gap, we propose an approach, namely JSON Similarity (JSim), to investigate the power of structural, semantic, and contextual relationships among schemas for better organization of JSON documents and support efficient data retrieval. The significant contributions of this work include the following: 1. Exploiting the contextual similarity of JSON schemas by using syntactic and semantic properties of JSON schemas.

JSON Document Clustering Based on Structural Similarity …

53

2. Determining the structural, semantic, and contextual similarity of JSON schemas and clustering JSON documents based on the weighted similarity matrix. 3. Evaluating the performance of JSim with the existing approaches. The rest of the paper is organized as follows. Section 2 discusses the related works, and Sect. 3 reviews the proposed approach in detail. Section 4 presents the experimental study and performance analysis. Section 5 discusses the conclusions and future work.

2 Literature Survey Over the last decade, several approaches for clustering hierarchical data, such as XML, have been developed. Clustering JSON data, on the other hand, has received less attention. This section discusses the related works on XML and JSON similarity approaches based on their use in various applications such as clustering and schema matching. The majority of research on JSON data focuses on analyzing the structural similarities across JSON schemas by examining the attribute names and data types. A concept of skeleton was proposed by Wang et al. [2], which clusters similar schemas and provides a summarized representation of heterogeneous schemas called “skeleton schema.” Gallinucci et al. [9] developed the Build Schema Profile (BSP) approach using a decision tree that classifies the schema variants based on the attribute values. Blaselbauer and Josko [10] integrate the linguistic, semantic, and instance-level methods to determine the degree of similarity between different schemas. Uma Priya and Santhi Thilagam [11] used term frequency-inverse document frequency (TF-IDF) measure to calculate similar schemas. To capture the contextual similarity of JSON schemas, the author [7] has designed a SchemaEmbed model based on Word2Vec and a deep autoencoder. In the instance of XML data, Wang and Koopman [12] identified similar journal articles based on the entities of the XML documents. Nevertheless, this approach uses conventional semantic models to capture the contextual information, and they are limited by high dimensionality. Laddha et al. [13] modeled the text and structure of the semi-structured documents in a joint vector space, which allowed to successfully capture the semantics of the documents. Costa and Ortale [14] recommended using topical similarity to cluster the documents rather than structural or content similarity. Wu et al. [15] grouped the semantically similar registers such as interviews, news, and novels by inferring their syntactic structures. Dongo et al. [16] presented an approach for semantic similarity of XML documents based on structure and content analysis. XML path-based algorithms [17–19] take into account the existence of patterns, and the similarity of patterns improves the clustering efficiency. Accottillam et al. [19] proposed the TreeXP framework, which identifies the most frequent XML paths (patterns) and clusters documents based on structural pattern similarity. Costa and

54

D. Uma Priya and P. Santhi Thilagam

Ortale [20] identified XML features by capturing the structure-constrained phrases as n-grams and grouped the documents based on their contextualized representations. A considerable amount of research works for comparing hierarchical data formats such as XML have been proposed in the literature. Yet, the existing approaches fall short of identifying similarities in JSON documents. Consequently, the existing literature focused on either structural or semantic measures alone to identify the similarity of schemas. Therefore, it is necessary to identify the underlying semantic relatedness in the schema by considering both semantic and contextual similarity.

3 JSim Consider a JSON document collection G = {D1 , D2 , …, Dn } and Di = {A1 , A2 , … Am } where m denotes the number of attributes and n denotes the size of collection. This work proposes JSim, an approach to assign D into K cluster centroids {C 1 , C 2 , …, C K }, where C i contains equivalent JSON documents based on structural and semantic relevance. This section presents the design of our approach on clustering JSON documents using a weighted similarity matrix. The matrix is constructed based on the structural, semantic, and contextual similarity of JSON schemas. This work aims to meet the following objectives: 1. To exploit the contextual relationship of JSON schemas and generate schema embeddings. 2. To cluster the JSON documents based on structural, semantic, and contextual similarities of JSON schemas. JSim works in three phases: (i) schema extraction, (ii) similarity computation, and (iii) clustering. The overall workflow of the proposed approach is shown diagrammatically in Fig. 1.

3.1 Schema Extraction In general, when the data does not contain explicit schema information, then extracting a schema from data can be seen as a reverse engineering step. Hence, as a first step, the JSON document must be parsed before further analysis. The most sensible part of JSON documents is nesting levels and arrays. To determine the depth of every field, the path to reach every attribute should be preserved. Hence, all the documents are flattened to preserve the structural information of JSON attributes. In the case of nested levels, the documents are flattened by concatenating their parents’ attribute names. The flexibility of JSON documents is that the attributes

JSON Document Clustering Based on Structural Similarity …

55

Fig. 1 Flow description of JSim

are unordered. To preserve this feature, a schema is represented as a bag of root-toleaf paths that also preserves the ancestor–descendant relationship. In other words, the schema of a JSON document is represented as a set of paths.

3.2 Similarity Computation Considering the nature of JSON schemas, relying on a single similarity metric may not be sufficiently informative. Therefore, considering multiple similarity metrics with a threshold produces better clustering. The pseudocode of the proposed work is given in Algorithm 1. This section determines the various similarity measures of JSON schemas as follows: Structural Similarity: Structural similarity is a measure of how distant two documents are with respect to their schema. The prominent feature of the JSON document is its hierarchical structure. There can be multiple matches between two attribute names, but they differ in data type. For instance, the attribute title can present at different levels in different documents that yield different structures. The structural

56

D. Uma Priya and P. Santhi Thilagam

matcher prunes these false positives by including the type of values in the path information. To compute the similarity between two attributes, say a1 and a2 , the path P1 and P2 for a1 and a2 are compared. To solve this problem, this work uses the Jaccard coefficient to measure the similarity between schemas where the paths act as features. This is perhaps the most important similarity measure of the three because it will be a good score only if the majority part of the structure is the same. The Jaccard coefficient [21] for schemas si and sj is calculated as ( ) knn(si ) ∪ knn s j ( ), StS[i][ j] = knn(si ) ∩ knn s j

(1)

where knn(si ) represents the k-nearest neighbor of si . Semantic Fusion: This section describes interdependent schema representations based on word embeddings and external knowledge sources. It’s worth noting that when these interdependent representations of input texts are combined, we get low-dimensional, dense, and real-valued vectors. Algorithm 1 JSON document clustering Input: JSON data collection G = {D1 , D2 , …, Dn } Output: C = {C 1 , C 2 , …, C k } 1

Start

2

Initialize: Structural Similarity Matrix StS ∈ Rn∗n , Contextual Similarity Matrix CoS ∈ Rn∗n , Semantic Similarity Matrix SeS ∈ Rn∗n , Semantic Fusion Matrix SFS ∈ Rn∗n , Similarity Matrix SM ∈ Rn∗n , dimension = 768

3

Extract the schemas S = {S 1 , S 2 , …, S n } and attributes A = {A1 , A2 , …, Am } from G

4

for each (S i , S j ) ∈ S do

5

StS[i][j] = Jaccard Coefficient (S i , S j )

6

end for

7

for each (S i ) ∈ S do

8

construct trigrams and add to T;

9

bert_emb = BERT (T, dimension)

10

end for

11

for each (ei , ej ) ∈ bert_emb do

12

CoS[i][j] = cos (ei , ej )

13

end for

14

for each (Ai ) ∈ A do

15

find synsets (Ai )

16

end for

17

Construct SeS using WordNet database

18

SFS[i][j] = Max (CoS[i][j], SeS[i][j]) if CoS[i][j] > T, SeS[i][j] > T else 0 (continued)

JSON Document Clustering Based on Structural Similarity …

57

(continued) 19

SM[i][j] = (0.5 * StS[i][j]) + (0.5 * SFS[i][j])

20

Let clusters C = {C 1 , C 2 , …, C k }

21

Construct Degree Matrix D from SM

//Clustering

22

Construct Laplacian Matrix L = D − SM

23

Normalize L

24

Determine the top-k eigen vectors from L and construct U ∈ Rn∗k

25

Consider each row of U as a vertex in Rk , then use the k-means algorithm to cluster them

26

End

Contextual Similarity: The proposed work uses an embeddings-based representation model, namely BERT, for identifying the contextual similarity of JSON schemas. The vector for each schema describes how it appears in context with other schemas. In general, the attributes that co-occur together in schemas generate similar vectors. After mapping the schemas to vector space, the contextual similarity CoS is computed using the cosine similarity measure. The process of generating schema embeddings is initiated with the feature-based approach of the BERT model. Initially, the schemas are tokenized by the WordPiece model and get the attributes (paths). JSON documents comprise both ordered and unordered attributes due to the presence of an array. In order to support both ordered and unordered properties of JSON schemas, this work takes trigram attributes as input and feeds them to the BERT model. The text representation obtained for a schema with n trigram attribute set is n numeric vectors of length 768. Therefore, the output vector or schema embeddings E s of all the attributes in a schema is placed into a matrix of size n * 768. Given a schema embedding E s of G, the cosine similarity measure for any two schemas si , sj ∈ S is formally measured as follows: ) ( cos si , s j =

siT s j ∥ ∥ ∥si ∥∥s j ∥

(2)

If cos(si , sj ) for any two schemas si and sj is less than the threshold T, then they are dissimilar. In this way, we can deal with the situation where cos(si , sj ) is very small and hence should be assigned the value 0. The Contextual Similarity Matrix CoS ∈ Rn∗n for JSON schemas S of size n is calculated as ( ) CoS[i][ j] = cos si , s j ∀ i, j ≤ n

(3)

Semantic Similarity: In addition to the schema embeddings generated, we focus on the similarity of semantic information carried by the attributes. To exploit the semantic similarity of JSON schemas, we use a well-known knowledge base, WordNet, that

58

D. Uma Priya and P. Santhi Thilagam

determines the similarity between attributes by comparing their synonyms. The semantic similarity for any two schemas si , sj ∈ S is formally measured using Wu–Palmer similarity measure [22] as follows: ) ( SeS[i][ j] = wup_similarity si , s j

(4)

If wup_similarity(si , sj ) for any two schemas si , sj is less than the threshold T, then they are dissimilar. The semantic fusion matrix SFS ∈ Rn∗n is computed as Max(CoS, SeS). Finally, the similarity between any two schemas si , sj ∈ S is calculated by combining the structural similarity StS and merged semantic similarity SeS using the following equation. ) ( Similarity si , s j = SM[i][ j] = (0.5 ∗ StS[i][ j]) + (0.5 ∗ SFS[i][ j]) ∀ i, j ≤ n

(5)

In Eq. 5, the weights determine the significance of the similarity measures. The main reason for the high weightage given to the structural similarity score is that if the structural similarity score is high, then the two documents are definitely going to be clustered together. Hence, their overall score needs to be high as well. Since this may not be true for the other two measures, we have merged them and given equal weightage to them.

3.3 Clustering Given a data collection G = {D1 , D2 , …, DN }, i.e., N data points (vectors) each with n attributes and the similarity matrix SM, the clustering U K algorithm divide G into K Ci = G, Ci ∩ C j = ∅ for mutually exclusive clusters C = {C 1 , C 2 , …, C K }, i=1 1 ≤ i /= j ≤ K. The similarity matrix SM determines the quality of clustering. In this work, similar schemas are clustered using spectral clustering algorithm [23] with SM. The clustering problem can now be reformulated as an undirected similarity graph G = (V, E), where V = {v1 , v2 , …, vn } represents the vertices and E represents the weighted edge between vi and vj . The edge is weighted using sij ∈ SM. The degree matrix D is termed as {d 1 , d 2 , …, d n } is a diagonal matrix where the degree d i of a vertex vi ∈ V is defined as di =

n ∑

si j .

(6)

j=1

Given a similarity matrix SM, the normalized Laplacian matrix L = Rn∗n is defined as

JSON Document Clustering Based on Structural Similarity … Table 1 Dataset features

59

Dataset

No. of documents

No. of schema variants

No. of attributes

DBLP

200,000

40

76

50,000

SD −1

−1

L = D 2 (D − SM)D 2 .

(7)

Compute the first k’s eigenvectors U = {u1 , u2 , …, uk } associated with L where U ∈ Rk and let the vectors represent the columns of U. The ith row of U is represented as yi ∈ Rk . The data points y1 , y2 , …, yn in Rk are clustered using K-means clustering.

4 Experimental Evaluation JSim has been evaluated for two datasets such as DBLP [24] and the synthetic dataset (SD). DBLP contains 2 million XML documents scraped from DBLP and converted to JSON. In this paper, we have chosen 200,000 documents randomly from 2 million documents. The synthetic dataset (SD) is also populated for publication scenarios with references from various publications such as IEEE,1 ACM,2 and so on. Table 1 describes the characteristics of the datasets. Both datasets together have 13 classes, such as conferences, journals, books, and so on. Hence, the number of clusters is decided as 13 for evaluating the existing and proposed approaches. To evaluate the performance of the proposed approach, the cluster internal and external validity measures such as silhouette coefficient (SC), adjusted mutual information (AMI), normalized mutual information (NMI), and adjusted Rand index (ARI) scores are used. The most common external cluster validity metrics, such as precision, recall, and F1-Measure, depend on the alignment of cluster labels to ground truth labels. The measures such as NMI, AMI, and ARI scores are found to be appropriate to this work because they are not influenced by the absolute label values [25].

4.1 Results In order to show the effect of combining all similarities, the proposed approach is compared with structure-only [4], semantic-only [10], and contextual approaches [26–28]. Since the related work on JSON document clustering is sparse, state-ofthe-art language models such as InferSent [26], Universal Sentence Encoder (USE) [27], and Embeddings for Language Models (ELMo) [28] are considered to evaluate the proposed approach. 1 2

www.ieee.org. www.acm.org.

60

D. Uma Priya and P. Santhi Thilagam

Bawakid [4] used the TF-IDF-based approach (frequency of attributes) to find the structural similarity of JSON schemas and performed clustering. It extracted the exact schema variants rather than clustering contextually similar schemas. It is evident from Fig. 2 and Table 2 that JSim achieves better performance than TF-IDFbased approach on all the evaluation measures. The high SC value of JSim indicates that both structurally and semantically similar documents are captured and grouped in a cluster. The threshold T for JSim is set as 0.7 for deciding the semantic similarity. This is because the external metrics don’t yield promising results for other threshold values. The JSONGlue [10] has calculated the semantic similarity for JSON schema matching. In this work, JSONGlue has been extended such that the semantic similarity of JSON schemas with a threshold of 0.7 has been considered for clustering the schemas. It is observed from Fig. 2 and Table 2 that the similarities are better captured by JSim and exhibit higher effectiveness in grouping the schemas. One can see that the results of JSim are better than USE, InferSent, and ELMo for each case we have tried. We also evaluated the quality of clusters using external validity measures such as NMI, AMI, and ARI. It is observed from Table 2 that the proposed approach outperforms the existing approaches over the datasets. Fig. 2 Clustering performance using silhouette coefficient score

Table 2 Clustering performance using external cluster validity metrics Metrics

InferSent [26]

USE [27]

ELMo [28]

JSONGlue [10]

TF-IDF [4]

JSim

NMI

0.64

0.84

0.85

0.61

0.66

0.91

AMI

0.63

0.83

0.86

0.59

0.64

0.9

ARI

0.34

0.65

0.68

0.29

0.35

0.71

JSON Document Clustering Based on Structural Similarity …

61

4.2 Discussion The differences in the results of JSim and existing approaches in Table 2 and Fig. 2 appear to come from their abilities to handle the different similarity types. When compared to existing language models, the use of trigrams attributes as input to the BERT model in JSim enhances the performance of embeddings in feature space. The trigram attributes capture the subset of attributes in unordered data, resulting in better-contextualized embeddings. Hence, the overall score of JSim is improved compared to existing models. Results from Table 2 and Fig. 2 show that JSim has a significant contribution toward finding both structurally and semantically relevant JSON documents compared to the existing approaches. The clustering performance of JSim demonstrates the power of semantic fusion on unordered JSON data. Consequently, JSim performs exceptionally well on datasets with structural and semantic heterogeneity. Thus, JSim supports various tasks such as data integration, efficient data retrieval, and so on.

5 Conclusions In this paper, we proposed an approach named JSim for clustering JSON documents based on structural, semantic, and contextual similarity of JSON schemas. JSim captures the semantics in JSON schemas with the help of semantic fusion. The advantage of JSim was found in its ability to create structurally and semantically consistent clusters using the weighted similarity matrix. It is evident from the results that JSim has outperformed the existing approaches individually. In the future, we plan to extend this study for frequently ordered attributes and compare their performance in a real-world scenario.

References 1. Bourhis P, Reutter JL, Suárez F, Vrgoˇc D (2017) JSON: data model, query languages and schema specification. In: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI symposium on principles of database systems, PODS’17. ACM, New York, NY, pp 123–135 2. Wang L, Zhang S, Shi J, Jiao L, Hassanzadeh O, Zou J, Wangz C (2015) Schema management for document stores. Proc VLDB Endow 8(9):922–933 3. Gallinucci E, Golfarelli M, Rizzi S (2019) Approximate OLAP of document-oriented databases: a variety-aware approach. Inf Syst 85:114–130 4. Bawakid F (2019) A schema exploration approach for document-oriented data using unsupervised techniques. PhD thesis, University of Southampton 5. Miller GA (1998) WordNet: an electronic lexical database. MIT Press 6. Nguyen HT, Duong PH, Cambria E (2019) Learning short-text semantic similarity with word embeddings and external knowledge sources. Knowl Based Syst 182:104842 7. Uma Priya D, Santhi Thilagam P (2022) JSON document clustering based on schema embeddings. J Inf Sci 01655515221116522

62

D. Uma Priya and P. Santhi Thilagam

8. Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 9. Gallinucci E, Golfarelli M, Rizzi S (2018) Schema profiling of document-oriented databases. Inf Syst 75:13–25 10. Blaselbauer VM, Josko JMB (2020) JSONGlue: a hybrid matcher for JSON schema matching. In: Proceedings of the Brazilian symposium on databases 11. Uma Priya D, Santhi Thilagam P (2022) ClustVariants: an approach for schema variants extraction from JSON document collections. In: 2022 IEEE IAS global conference on emerging technologies (GlobConET), pp 515–520 12. Wang S, Koopman R (2017) Clustering articles based on semantic similarity. Scientometrics 111(2):1017–1031 13. Laddha A, Joshi S, Shaikh S, Mehta S (2018) Joint distributed representation of text and structure of semi-structured documents. In: Proceedings of the 29th hypertext and social media, pp 25–32 14. Costa G, Ortale R (2019) Mining cluster patterns in XML corpora via latent topic models of content and structure. In: Yang Q, Zhou Z-H, Gong Z, Zhang M-L, Huang S-J (eds) Advances in knowledge discovery and data mining. Springer International Publishing, Cham, pp 237–248 15. Wu H, Liu Y, Wu Q (2020) Stylistic syntactic structure extraction and semantic clustering for different registers. In: 2020 international conference on Asian language processing (IALP). IEEE, pp 66–74 16. Dongo I, Ticona-Herrera R, Cadinale Y, Guzmán R (2020) Semantic similarity of XML documents based on structural and content analysis. In: Proceedings of the 2020 4th international symposium on computer science and intelligent control, ISCSIC 2020. Association for Computing Machinery, New York, NY 17. Piernik M, Brzezinski D, Morzy T (2016) Clustering XML documents by patterns. Knowl Inf Syst 46(1):185–212 18. Costa G, Ortale R (2017) XML clustering by structure-constrained phrases: a fully-automatic approach using contextualized N-grams. Int J Artif Intell Tools 26(01):1760002 19. Accottillam T, Remya KTV, Raju G (2021) TreeXP: an instantiation of xpattern framework. In: Data science and security. Springer, pp 61–69 20. Costa G, Ortale R (2018) Machine learning techniques for XML (co-)clustering by structureconstrained phrases. Inf Retr J 21(1):24–55 21. Hennig C, Hausdorf B (2006) Design of dissimilarity measures: a new dissimilarity between species distribution areas. In: Data science and classification. Springer, pp 29–37 22. Wu Z, Palmer M (1994) Verb semantics and lexical selection. arXiv preprint arXiv:cmp-lg/940 6033 23. Von Luxburg U (2007) A tutorial on spectral clustering. Statist Comput 17(4):395–416 24. Chouder ML, Rizzi S, Chalal R (2017) JSON datasets for exploratory OLAP. https://doi.org/ 10.17632/ct8f9skv97.1. Accessed 21 Dec 2020 25. Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854 26. Conneau A, Kiela D, Schwenk H, Barrault L, Bordes A (2017) Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of the 2017 conference on empirical methods in natural language processing, Copenhagen, Denmark, Sept 2017. Association for Computational Linguistics, pp 670–680 27. Cer D, Yang Y, Kong S, Hua N, Limtiaco N, St John R, Constant N, Guajardo-Céspedes M, Yuan S, Tar C et al (2018) Universal sentence encoder. arXiv preprint arXiv:1803.11175 28. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365

Solar Power Forecasting to Solve the Duck Curve Problem Menon Adarsh Sivadas, V. P. Gautum Subhash, Sansparsh Singh Bhadoria, and C. Vaithilingam

Abstract Solar power is widely regarded as the power of a green future. However, excessive generation of solar energy can cause damage to the existing power sources, if it’s not balanced properly. Herein comes the concept of the duck curve, a load curve that shows how much power is required to be produced by non-solar sources when solar power generation is in full swing. In order to calculate the duck curve, it is necessary to calculate the solar power generated. There are several methods to achieve this, and this project focuses on the machine learning side of it. Using weather data with parameters like temperature, humidity and wind temperature of a region, the solar output is predicted. This allows easier allocation of power demand to the non-solar sources. As expected, solar power peaks during the daytime and this results in a sharp drop in demand for non-solar sources. This is followed by a steady increase as the sunsets. This load curve essentially shows the impact of solar power on the load demand and gives valuable information on how the other sources have to be adjusted for efficient power generation. Using the dataset obtained from NSRDB, we were able to predict the per hour GHI of VIT Chennai and corroborate with official sources. We were able to plot the duck curve of VIT and then use the model to observe the GHI of other locations in India. Keywords Random forest · Solar energy prediction · Duck curve

M. A. Sivadas · V. P. Gautum Subhash · S. S. Bhadoria · C. Vaithilingam (B) School of Electrical Engineering, Vellore Institute of Technology, Chennai, India e-mail: [email protected] M. A. Sivadas e-mail: [email protected] V. P. Gautum Subhash e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_5

63

64

M. A. Sivadas et al.

1 Introduction Solar power is a vital asset for humanity in the years to come. However, it comes with its own problems. It is not always dependable, and we would still need access to non-solar power sources until the day our batteries and panels are highly efficient. Balancing the power generation from solar plants and traditional power plants poses a problem, something that has been recently coined as the duck curve. Excess solar energy generation creates challenges for utilities to balance supply and demand on the grid. There is a steep drop in demand from non-solar sources during mid-day and then an increased demand for electricity generators to quickly ramp up energy production when the sunsets. Another challenge with high solar adoption is the potential for PV to produce more energy than can be used at one time, called over-generation. This leads system operators to curtail PV generation, reducing its economic and environmental benefits. Thus, a prediction algorithm with high accuracy can help in creating an ideal generating environment. The aim of this study is to test different machine learning algorithms and find the most suitable one to predict the drop and subsequent rise in demand for energy from non-solar power sources.

2 Literature Review Several works have already been done in this field. Most of them focus on finding a good algorithm or technique to predict solar energy with reasonable accuracy. Work by Inman et al. [1] highlights the different AI techniques currently used to predict solar power. Techniques like ADALINE and MADALINE are discussed, along with techniques that use camera footage and satellite footage. In the paper [2], different machine learning models have been tested on a dataset of the USA, obtained via NSRDB. The authors discuss in detail the training time and suggest models with lower training time to ensure faster setup. In the work [3], data is based in Australia. The target variables included are surface pressure, relative humidity, cloud cover, etc. The work uses SVR and suggests combining models to increase accuracy. In [4], data is based on Morocco from 2016 to 2018. Parameters like temperature, humidity and pressure are selected based on their Pearson correlation. Several models like RNN and LSTM have been used, and their accuracy parameters are listed. In [5], data is based on 15 min observations from Denmark Rooftops and uses adaptive linear time series models to predict solar power. Clear sky modelling has also been discussed in detail. In the work done by Wan et al. [6] characteristics of solar forecasting are explained and different models are described along with the advantages and disadvantages of them all. In [7], detailed work is presented on existing forecasting techniques and optimisation of input and network parameters. Modelling approaches are discussed, and convolution neural networks have been regarded as the best. In [8], applicable and value-based metrics for solar forecasting for different scenarios are discussed. In

Solar Power Forecasting to Solve the Duck Curve Problem

65

[9], the authors stress the importance of forecasting and varying the electricity prices using demand prediction models so that loads are balanced. In [10], importance of predicting the duck curve is discussed. In the work by Prasad and Kay [11], the prediction of solar power using satellite data is discussed. However, satellite-based models have errors in calculating the intensity, contrast and movement of clouds and this error may reduce the accuracy of the prediction. In [12], solar generation is predicted using KNN and a comparison has been done with other models like MLP and neural networks. In [13], the paper talks about the challenges of switching to renewable energy and suggests solutions to the possible destabilisation of the power grid. The duck curve for entire India is also plotted in this work. Finally, in [14] authors have used a neural network with backpropagation to predict data sourced from Singapore. As evident, the mentioned papers are focused mostly on predicting the solar power generated. In this paper, we are predicting solar power generated, and at the same time, forecasting the duck curve in advance for our university, using real-time load data and predicted solar power data. This allows easier load scheduling in a solar-dependent power generation scenario.

3 Methodology 3.1 Duck Curve The duck curve is the power demand on non-solar energy resources. When solar generation peaks at noon, consumers move away from non-solar options. This leads to a steep drop in demand followed by a sudden increase after evening. This demand, when plotted, looks like a duck and hence the name. The ramifications of this include problems in adjusting the power generation from traditional sources like hydropower. Figure 1 graph represents an example of high solar generation impacting the overall load demand within a span of a day with peak impact at 12 p.m. since the solar generation would be at its near maximum.

3.2 Data Acquisition and Inputs The data on the solar irradiance data (in W/m2 ) along with the atmospheric parameters such as temperature, dew point, humidity and pressure was obtained from a website run by NSRDB—National Solar Radiation Database. It gives us a collection of hourly values of meteorological data and global horizontal solar radiation values. NSRDB is a joint collaboration of US organisations. Figure 2 is a sample of the data that we have obtained from the website. In this, we have the target variable to be set as global horizontal irradiance (GHI) which is

66

M. A. Sivadas et al.

Fig. 1 Duck curve

the total amount of radiation received from the sun on a surface horizontal to the ground. GHI is represented in W/m2 . After understanding the correlation between the GHI and the remaining variables in the excel file such as dew point, temperature and humidity, the specific variables which have a good correlation are used to train the machine learning algorithm. Pearson correlation is used to create the correlation matrix (Fig. 3). Several parameters are collected by NSRDB for an area. The relative humidity is the percentage of moisture that is present in the atmosphere at a given temperature and pressure. The relative humidity is an indicator of rainfall as it increases immediately after rain. Direct normal irradiance (DNI) and diffused horizontal irradiance are components of the target variable, and GHI hence was avoided. Dew point is the temperature at which air has 100% relative humidity. Pressure is the force exerted by the atmosphere in a unit area. Pressure, relative humidity, dewpoint and temperature affect cloud formation and therefore are important factors to consider when predicting solar generation. Other factors like wind direction are important in coastal areas as the wind from the oceans may bring in rain. Solar zenith angle and year seem to have very less correlation with the target variable.

Fig. 2 Sample data

Solar Power Forecasting to Solve the Duck Curve Problem

67

Fig. 3 Correlation between the parameters and GHI

Using the value of GHI and other parameters like solar panel efficiency, it is possible to calculate the amount of solar energy produced.

3.3 Flowchart The flowchart starts with the solar data that we have obtained from the NSRDB website. Next, we calculate the correlation of each of the variables with the target variable, GHI and remove the variables that have a bad correlation. The data will then be fed into training the machine learning algorithm, then it would be used to predict the GHI for a specific location based on the input variables. We then compare the predicted values with the actual solar power generated data of the solar panels in that specific location to find the coefficient that related the solar power generation and global irradiance (GHI). After obtaining this coefficient, we can predict the generated power with the input variables. The predicted solar power generation is subtracted by the hourly load demand per day to give us the predicted duck curve of a day (Fig. 4).

3.4 Machine Learning Algorithms Tested • Gradient Boosting: The gradient boosting algorithm is one of the most powerful algorithms in the field of machine learning. Errors in machine learning can be classified into bias errors and variance errors. Gradient boosting is one of the boosting algorithms that is used to minimise the bias error of the model. Gradient

68

M. A. Sivadas et al.

Fig. 4 Flowchart

boosting algorithms can be used for predicting not only continuous target variables (as a regressor) but also categorical target variables (as a classifier). Since the paper presents a regression case, the cost function used is mean square error. • Random Forest: Random forest is a supervised machine learning algorithm that is used widely in classification and regression problems. The model builds decision trees and then finds the average of the results of the trees. It uses ensemble learning methods for regression. • KNN: K-nearest neighbour is one of the simplest machine learning algorithms based on the supervised learning technique. The KNN algorithm finds the similarity between the new data and the available data. It then puts the new case into a category that is most similar to the available categories. • Neural Network: Neural network is a machine learning algorithm that mimics the working of the human brain using a series of nodes and layers. They have a wide range of applications and are more popular in fields like image classification. The number of layers has to be determined by the user, along with hyperparameters like weight and activation layer. Successful creation of a neural network can have higher accuracy than usual ML models.

Solar Power Forecasting to Solve the Duck Curve Problem Table 1 Performance table

Algorithm

69 Accuracy (%) RMS error (W/m2 )

K-nearest neighbour (n = 81.161 3)

222.2084

Gradient boosting

51.06

74.7

Random forest

92.034

26.8843

Polynomial regression

− 52.2

218

Logistic regression

− 400

231

• Logistic Regression: It is a supervised learning technique used to predict target variables that are dependent on categories. It is ideally suited for classification problems. • Polynomial Regression: It is an algorithm that models the relationship between variables and the target variable as nth degree polynomials. It is used mostly for regression problems.

4 Results 4.1 Performance of Machine Learning Models Neural network gave an RMSE of 402.0120 W/m2 units (Table 1).

4.2 Random Forest Performance Metrics Random forest is a popular machine learning model used for regression and classification problems. It uses decision trees on different samples and finds the average of their outputs to find the predicted values. On analysis with different datasets, random forest performed the highest out of all the models tested. Figure 5 shows how accurate the model is. With an accuracy of 92.034%, the predicted value (red) and the actual value (blue) of GHI are almost identical. This degree of accuracy is very useful in finding the output of the solar panels. The RMSE was measured to be 26.8843 W/m2 . This is much lower than the other models. Figure 6 represents the RMSE error versus the number of trees used in the random forest model. The number of trees was determined by iterating the number of trees and plotting the corresponding RMSE.

70

M. A. Sivadas et al.

Fig. 5 Solar prediction comparison with actual values plot

Fig. 6 RMS error for random forest

4.3 Calculation of Generated Power for VIT Chennai Using the selected machine learning algorithm, the solar irradiance for a specific day is predicted. These hourly values are divided by the actual per hour solar data of VIT Chennai. The average of these dividends gives a coefficient that comprises parameters like panel tilt, panel efficiency and inverter efficiency for a specific location. This allows us to calculate the generated power of VIT Chennai. Figure 7 shows the solar irradiance (GHI) for a specific date, and we have also obtained the solar generation for that exact same date, so we use that to compare with this obtained graph values. To find the coefficient, we take the average by dividing the predicted and the actual solar values. This gives a coefficient value. Then we multiplied our predicted irradiance value with the coefficient to find the generated power (Fig. 8). Figure 9 shows the actual and our predicted power for 9th March 2022 peak-topeak error of 4 kW.

Solar Power Forecasting to Solve the Duck Curve Problem

Fig. 7 Solar data prediction for 3/9/2022

Fig. 8 Actual solar-generated power from VIT Chennai

Fig. 9 Generated power coefficient-190 (9/3/2022)

71

72

M. A. Sivadas et al.

From the data from 3 dates, we have got an average value of the coefficient to be 200 (Figs. 10 and 11).

Fig. 10 Generated power coefficient-210 (10/4/2022)

Fig. 11 Generated power coefficient-200 (31/3/2022)

Solar Power Forecasting to Solve the Duck Curve Problem

Load demand perhour

Predicted solar power

73

Duck curve

Fig. 12 Duck curve formula

Fig. 13 Solar generation (19/04/2022)

4.4 Duck Curve for VIT Chennai In order to find the duck curve for VIT Chennai, the predicted solar power is subtracted from the hourly load demand of VIT Chennai (Fig. 12). Figure 13 shows the hourly solar power generation from VIT Chennai on 19th April 2022. Figure 14 shows us the predicted duck curve for VIT Chennai for April 19th. The dip indicated by the blue line shows the impact solar power generation has on the load demand, and this determines the output load that has to be satisfied by the conventional power sources.

4.5 Case Study In each case study, the ML algorithm used is random forest since it has the highest accuracy among all other tested algorithms. Then five cities were chosen for their

74

M. A. Sivadas et al.

Fig. 14 Duck curve and solar generation (19/04/2022)

uniqueness in geographical position and climate. The below section describes in detail about the findings (Table 2). • Delhi Choosing Delhi allowed us to observe the effects of weather and geographical position on solar power production. It is located on the northern side of India and being the capital city, allowed easy access to other parameters that allowed the calculation of duck curves. Figure 15 shows the average GHI per month predicted versus actual. The winter months, as expected, have lower GHI. Delhi is famous for its smoggy winter mornings contributing to lower efficiency. Table 2 Case study

City

Accuracy (%)

RMSE (W/m2 )

Delhi

92.79

38.459824992076

Chennai

88.074

42.048679041929

Cherrapunji

87.943

46.33957551891

Jaisalmer

96.580

27.682932281395

Kochi

91.084

47.8903317343846

Shimla

86.104

38.08241701586331

Solar Power Forecasting to Solve the Duck Curve Problem

75

Fig. 15 Average GHI per month of Delhi

• Chennai Chennai was chosen for its location close to the eastern coast. Shadowed by the Western Ghats, the city is known for its intense sun and was thus studied. When the monthly GHI graph was plotted, Chennai had higher GHI levels than Delhi throughout the year. This was evident in the hot summer months when temperatures rise close to 35 °C (Fig. 16). • Cherrapunji Situated in Meghalaya on the eastern side, this place is famous for its heavy rains. It has one of the highest average annual precipitation levels at 11,430 mm. It rains almost every day due to the summer air currents over the plains of Bengal. When the GHI level is compared with Delhi and Chennai, Cherrapunji has a lower average GHI per month. Random forest was able to predict the annual GHI with good accuracy. The GHI level drops considerably during the monsoon season due to more cloud cover (Fig. 17).

76

Fig. 16 Average GHI per month of Chennai

Fig. 17 Average GHI per month of Cherrapunji

M. A. Sivadas et al.

Solar Power Forecasting to Solve the Duck Curve Problem

77

Fig. 18 Average GHI per month of Jaisalmer

• Jaisalmer Located in Rajasthan in the west of India, Jaisalmer is one of the hottest cities. The temperature has gone up to 47 °C with dry winds in the summer. All of this leads to much higher GHI levels throughout the year. GHI is much higher than the other cases, and there is only a slight dip in the winter months. Even then, it is quite high, making Jaisalmer an ideal place for solar energy generation (Fig. 18). • Kochi Located on the western coast of the south Indian state of Kerala, Kochi, has the unique advantage of the Western Ghats. The mountain range acts as a barrier to the monsoon winds, making it rain in Kerala rather than have it pass over Tamil Nadu. As such, Kerala has higher rainfall than the other southern states. GHI drops considerably in the monsoon seasons when it is cloudy throughout. GHI averages above 650 W/m2 only in the summer months and in some of the winter months. Monsoon begins in June, and it can be clearly seen in Fig. 19.

78

M. A. Sivadas et al.

Fig. 19 Average GHI per month of Kochi

• Shimla Shimla, located in the Himalayan foothills, was another interesting candidate added to this list as the other cities in this case study don’t have snowfall. This reduction in solar power is clearly seen, especially in the winter months when GHI falls much below the usual levels seen in other cities (Fig. 20).

5 Conclusion The project was focused on finding the duck curve of a selected region. To find the duck curve, the solar power generation of that region must be calculated. This depends on a variety of factors such as rain, cloud cover, duration of the day and temperature. Taking into account this wide variety of factors, machine learning models were implemented that can predict the GHI or the power output per unit area received from the sun. To verify the calculations, the region initially considered was that of our college, VIT Chennai, located in the sunny Chennai City of India. The GHI

Solar Power Forecasting to Solve the Duck Curve Problem

79

Fig. 20 Average GHI per month of Shimla

per hour from previous years was obtained from NSRDB website, an international organisation that has backing from major players like NASA. This data was used to train a number of ML models, and then the best one was identified which is random forest. Using random forest, the per hour GHI of a particular day was predicted. Certain calculations were done to find the coefficient that comprises parameters like inverter efficiency and panel efficiency, and the solar power generated per hour was predicted. This was later verified with the actual generation data obtained from VIT. The load curve of VIT was also plotted using data obtained from official sources. Using these data, the duck curve of VIT was calculated. To further test the model, the data of different regions were taken, based on their geographical and climaterelated uniqueness. The model was able to predict the GHI for these regions with high accuracy. Using this predicted data, it is possible to properly plan the load schedule for a solar-dependent society. With more time devoted to neural networks, it is possible to achieve a custom neural network algorithm that can perform higher than the current random forest model. Another major problem that was encountered was that several cities haven’t had weather data collected for more than 5 years. This could be likely due to some financial and geographical constraints. The model can be more accurate if the data obtained is fresh as threats like global warming change the weather patterns every few years. With easier access to per hour load demand, the duck curve for major cities can be easily predicted.

80

M. A. Sivadas et al.

References 1. Inman RH, Pedro HT, Coimbra CF (2013) Solar forecasting methods for renewable energy integration. Prog Energy Combust Sci 39(6):535–576 2. Yagli GM, Yang D, Srinivasan D (2019) Automatic hourly solar forecasting using machine learning models. Renew Sustain Energy Rev 105:487–498 3. Abuella M, Chowdhury B (2017) Solar power forecasting using support vector regression. arXiv preprint arXiv:1703.09851 4. Jebli I, Belouadha FZ, Kabbaj MI, Tilioua A (2021) Deep learning based models for solar energy prediction. Adv Sci 6:349–355 5. Bacher P, Madsen H, Nielsen HA (2009) Online short-term solar power forecasting. Sol Energy 83(10):1772–1783 6. Wan C, Zhao J, Song Y, Xu Z, Lin J, Hu Z (2015) Photovoltaic and solar power forecasting for smart grid energy management. CSEE J Power Energy Syst 1(4):38–46 7. Ahmed R, Sreeram V, Mishra Y, Arif MD (2020) A review and evaluation of the state-of-theart in PV solar power forecasting: techniques and optimization. Renew Sustain Energy Rev 124:109792 8. Zhang J, Florita A, Hodge BM, Lu S, Hamann HF, Banunarayanan V, Brockway AM (2015) A suite of metrics for assessing the performance of solar power forecasting. Sol Energy 111:157– 175 9. Sheha M, Powell K (2019) Using real-time electricity prices to leverage electrical energy storage and flexible loads in a smart grid environment utilizing machine learning techniques. Processes 7(12):870 10. Hou Q, Zhang N, Du E, Miao M, Peng F, Kang C (2019) Probabilistic duck curve in high PV penetration power system: concept, modeling, and empirical analysis in China. Appl Energy 242:205–215 11. Prasad AA, Kay M (2021) Prediction of solar power using near-real time satellite data. Energies 14(18):5865 12. Ramli NA, Hamid MFA, Azhan NH, Ishak MAAS (2019) Solar power generation prediction by using k-nearest neighbor method. AIP Conf Proc 2129(1):020116. AIP Publishing LLC 13. Bandyopadhyay J, Gupta P, Kamla PE, Kapoor J (2022) Impact of renewable energy sources on Indian electricity grid. Gas 25329(38):8 14. Sivaneasan B, Yu CY, Goh KP (2017) Solar forecasting using ANN with fuzzy logic preprocessing. Energy Procedia 143:727–732

Dynamic Optimized Multi-metric Data Transmission over ITS Roopa Tirumalasetti

and Sunil Kumar Singh

Abstract Intelligence Transport System (ITS) is a wirelessly connected, selfconfigurable structureless network with multiple sensor hosts. Here, all the hosts in the network follow centralized authority present at all the hosts in the network, which are moving independently with available mobility. Sensor networks have various random features, such as unreliability in wireless connections between multiple hosts and the potential to change the network’s topology at any time. For effectively solving the metrics mentioned above, a variety of algorithms have been developed. However, challenges like intermittent connectivity, heterogeneous vehicle management, energy consumption, and support of network intelligence remain unanswered. To support random data transference, energy consumption, and mobility management issue in intelligent transport networks, “A New Optimal Searchable Multi-metric Routing (NOSMR)” is proposed. This approach controls mobility from a new perspective to manage load maintenance of each host in intelligence sensor networks. Power-aware advanced routing scenarios are introduced for efficient connection between wireless hosts and to manage efficient mobility in ad hoc networks. Extensive simulations are done using NS3 to evaluate the performance of the proposed approach. The performance metrics used to analyze the proposed approach’s efficiency are throughput, time, and end-to-end delay. The proposed study is compared with state-of-the-art studies in the literature. Keywords Mobility model · Multi-cast routing · Intelligence transport networks · Quality of service

R. Tirumalasetti (B) · S. K. Singh School of Computer Science and Engineering, VIT-AP University, Near Vijayawada, Guntur, Andra Pradesh, India e-mail: [email protected] S. K. Singh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_6

81

82

R. Tirumalasetti and S. K. Singh

1 Introduction Intelligence Transport Sensor Networks (ITSNs) are self-deployed and send data from a specially designated location. They find applications in expanded fields like disaster area correspondence, swarm control, crisis administrations, and traffic executives. [3, 4] Topology maintenance and information stream control are two major topics addressed by different authors in the ITSN. The usage of information transport devices such as mobile phones, PCs, and tablets has recently increased significantly. As a result, the processes of geography interface and spontaneous infrastructural point of connection have progressed. Position-based Opportunistic Routing (POR) [5, 10, 23] has been proposed as a solution to the difficulties of traditional routing concerns in knowledge transport networks. Intelligent routing reduces unfortunate knowledge transport associations by investigating broadcast insight transport information transmission with blockage variety [10, 18]. Diversity Back pressure Routing (DIVBAR) is guaranteed to investigate limited overabundance routing for every one of the courses [23]. Back pressure calculation uses feasible ways between the source and objective instead of thinking about the shortest path for transmission. In DIVBAR, packets are sent over unnecessary long distances, resulting in a horrifying delay in packet delivery. Expanded DIVBAR [7] evaluates the number of alternative routes with an estimated hop count in view [14] of the objective’s cost parameters. E-DIVBAR does not provide a better execution over DIVBAR. Extreme Opportunistic Routing (Ex-OR) was proposed [5] to depict the benefit of the transmission nature of knowledge transport specially appointed networks. A portion of the neighbor hubs monitors information transmission when different network hubs communicate the packets. Proposed New Optimal Searchable Multi-cast Routing (NOSMR) supports random data transference and mobility management issues in transport intelligent networks. The main contributions of the proposed approach are described as follows: 1. The performance of NOSMR’s practical distributed and asynchronous 802.11 compliant implementations was examined using a complete set of NS3 simulations on real-world networks. 2. This approach controls mobility in a new way to manage load maintenance of each host in intelligence sensor networks. 3. NOSMR is a packet-based variant of min-backlogged-path routing that eliminates the requirement for path enumeration across the network and expensive computations of total delay along routes. This approach controls mobility in a new way to manage load maintenance of each host in intelligence sensor networks. 4. The proposed approach gives a better and more efficient quality of service parameters compared to traditional approaches concerning intelligence transmission of data in ad hoc networks. The remainder of the paper is structured as follows: The related work is discussed in Sect. 2 of the paper, The proposed algorithm for optimized multi-metric data transmission is discussed in Sect. 3, followed by simulation results in Sect. 4, and the conclusion in Sect. 5 followed by proof of concept in Sect. 6.

Dynamic Optimized Multi-metric Data Transmission over ITS

83

2 Related Multi-cast Mobility Models We display a couple of produced convenience models [20] that have been proposed for the execution evaluation of improvised framework Venitta Raj [19]. Trust aware similarity-based source routing to ensure effective communication, and safety message communication via designing dynamic power control mechanisms designed by Alabbas [2]. For example, we would not expect SNs to develop in straight lines at consistent speeds over the duration since SNs would not go in such a restricted manner. The Waypoint-based Random Mobility Model (WRMM) fuses pause times between direction and speed changes. Optimal path selection for logistics transportation is based on an improved ant colony algorithm defined in Wang et al. [22]. At the point when time slaps, the fundamental discretionary objective of the versatile host is to re-item sending information and control speed aggregated between min and blend speed, after which the portable host focuses on the most recent decided objective at ordinary or greatest speed in network association [6, 9, 12]. The versatile host delays pre-defined steering, allowing the system to start only once and almost complete information association. In the Mobility-based City Section Model (CSMM) It is a recreation zone and a vital region corresponding to the illustration of city traffic analysis using network frameworks. It is extremely near chosen region around city traffic examination, similar to city street traffic streets and speed limits that rely on control of various upgrades with circle representations. Analysis of different routing scenarios and with opportunistic routing and re-enforcement learning-based routing is described in Hussain et al., Nazib and Moh [8, 11]. Each portable host begins by entertaining the host with new steering qualities in a city street traffic examination. The SN then chooses an objective irregularly, which is also addressed by some streets. In addition, safe driving credits, such as a speed limit and a base detachment allowed between any two SNs, are observed during the progression computation from the current objective to the new objective. After completing the goal, the SN rests for a predetermined amount of time before randomly selecting another goal (for example, a point on the street) and repeating the process Wang [21]. An Area Boundless Simulation Mobility Model (BSAMM) How a generation zone restriction is handled in an Area Boundless Simulation Mobility Model (BSAMM) is unique Poongodi et al. [7]. SNs represent off or stop moving once they reach a reenactment limit in all of the previously determined adaptability models. SNs that complete one side of the generation region in BSAMM continue their journey and return to the opposite side of the entertainment zone. Vehicle to vehicle communication dedicated short range communication and safety awareness discussed in Vershinin and Zhan [20]. SNs represent off or stop moving once they reach a reenactment limit in all of the previously determined adaptability models. SNs that complete one side of the generation region in BSAMM continue their journey and return to the opposite side of the entertainment zone.

84

R. Tirumalasetti and S. K. Singh

Optimization of Multi-hop Broadcast Protocol and MAC Protocol in Vehicular Ad Hoc Networks is shown by Pei et al. [13]. Through a single tuning boundary, the Gauss-Markov Mobility Model was designed to conform to various degrees of discretion. Every SN is given a current speed and course at the start. An enhanced hybrid ant colony optimization routing protocol is defined by Ramamoorthy et al. [14]. Zou et al. [24] define an improved ant colony optimization routing scenario for efficient selection of node but location of node is not up-to-date.

3 Proposed Implementation This section introduces the proposed system, New Optimal Searchable Multi-cast Routing (NOSMR), as well as a sharp hub determination for effective directing between various hubs in remote organizational correspondence Chinmoy Ghorai et al. [24]. We further develop the clog variety in light of the deft technique to defeat overshooting issues in information transmission between hubs. There are three parts to the NOSMR execution part: Shrewd Node Selection Procedure, Acknowledgment (ACK) Procedure, and Information and Acknowledgment Frame Representation.

3.1 Procedure for selection of Multi-cast Node NOSMR’s execution is depicted in its ability to select a focused hub from a set of hubs closest to its information transmission objective. The proposed method follows the entrepreneurial hub choice method for effective execution (hubs are focused on packet conveyance rates). The following is the hub selection methodology in NOSMR, which is based on the deft system. The mathematical representation of effective packet transmission requires consideration of energy consumption and information packet re-transmission factor. Expect a hub with a probability of v within a period of successful data packet sending and transmission over an unstable radio link. The fruitful transmission likelihood for the greatest furthest reaches of ts can be processed utilizing the Eq. (1). μ = 1 − (1 − v)∧ ts

(1)

The study also demonstrated the expense of effective information transmission with dc , which could be calculated in relation to the number of packet re-transmission counts with 1 ≤ p ≤ ts . The dc can be calculated using the Eq. (2) as follows: dc =



p ∗ ts ∗ φ { p | p ≤ t s }

(2)

Dynamic Optimized Multi-metric Data Transmission over ITS

85

Algorithm 1 Selection procedure of multi-cast nodes in transport intelligence networks. 1: 2: 3: 4: 5:

Start ITS deployment with different nodes ρ(v,e) mn(node count)= (30-50) AODV routing model Time instance at each node (ti) a. Subset vehicle node selection - → mn b. < ns, and − > Ex( rTable ) c. If E (nr ) > cut-off i. Enable count(p) & describe the transmission of a message at the interval of time ii. Control of congestion based on Ack1 & Ack2 which are received from other vehicle nodes iii. Success_transmission with relay message iv. Update the routing table from the agent (SA) & configuration of the relay d. Else if Congestion of control related (Ack1) is not received i. U _coun(p) = p e. Else if Congestion of control related (Ack2) is not received i. drop packet relay transmission ii. U _relay _transmission f. Else i. End data transmission with updated details g. End 6: End

From Eq. (2), it can be concluded that if a packet does not arrive at its intended destination within a set time period, the disappointment cost is expressed as p × ts . Here, φ shows the likelihood of the events that could happen, and a banner F < 0 | 1 > is utilized to demonstrate the achievement or disappointment of the occasion. The packet sending game demonstrates that hubs connected to the powerful ITS are often egocentric and intend to drop packets to conserve energy. The game also shows that their activities are distinguishable, which is a driving factor. A conceited hub either selects AF, advances the packet using a helpful packet sending mechanism, or selects the activity AD, which is a discard strategy. Whether a hub augments its benefit of assets/notoriety or not, this state of activity is contingent and depends on the circumstances. The distribution of activity that each hub achieves between sending and declining is determined by the result estimation and is probabilistic numerical displaying that handles every event that can occur during packet sending scenarios (Fig. 1).

3.2 Acknowledgment (ACK) Procedure in NOSMR It was guaranteed which of the chosen hubs forward packets is a critical test behind NOSMR’s pioneering directing. A customized variant of 802.11 Medium Access Control (MAC) is proposed for this, which includes and holds a different number of schedule openings for getting hubs to resend affirmations in light of the strategy of

86

R. Tirumalasetti and S. K. Singh

Fig. 1 Multi-cast node selection processing in transport intelligence networks

AODV with three-way handshakes in Poongodi et al. [7]. Source sends a confirmation to neighbor hubs with objective sequence id using their sequence id, which is stored in the header of that hub. Every hub in the network is required to listen to ACK before sending information to neighbor hubs. If the hub has a high need in the ACK section, it should forward information to the next jump. If the hub has a low need in the ACK section, it should send a report to the high need hub (based on sequence id). When hub A receives a request for information transmission, it will be a high-demand hub and send ACK to its organization’s neighbors. Hub B is the organization’s second highest need hub and does not hear hub A’s affirmation, while hubs C and D do hear A’s ACK. Because hub B can hear hub C and hub C contains the sequence id of hub A, hub B can get the parcels of hub A through C in a roundabout way, despite not having received the ACK of A.

3.3 Representation of Acknowledgement Data Frame NOSMR addresses information and affirmation outline designs based on the most recent rendition of 802.11 with medium access control changes. Figure 2 depicts the essential portrayal of information casing and confirmation outline.

Fig. 2 Representation of sender data and receiver data in intelligence transfer networks

Dynamic Optimized Multi-metric Data Transmission over ITS

87

Representation of information outline arranged with optimized selection of node in vehicular ad hoc networks based on honey bee and genetic algorithm defined by Ahmad [1] as shown in Fig. 2. The most critical components in an information outline are the source address and time duration, which regulate all edge development in data transmission, and the neighbor hub sequence id and conveyance proportion. The target location, outline control, and competitor id are all included in the affirmation outline design. When the information outline is altered, the refreshed steering table at the reserve of each hub in an organization’s exposed affirmation is updated. NOSMR rewrites this approach to reduce parcel misfortune and provide helpful information communication.

4 Experimental Results By utilizing NS3, a recreation of the proposed approach is compared to existing directing calculations such as City Section-Based Mobility Model (CSMM) [16], Boundless Simulation Area Mobility Model (BSAMM) [16], Diversity Backpressure And Routing (DIVBAR) [15], Opportunistic Multi-Hop Routing For Wireless Networks (Ex-OR) [17], and Extended Diversity Back pressure Routing (E-DIVBAR) [15]. Simulation For the exploratory review, simulation is done in NS3; multi-way topology with various is evaluated to defeat order. By using pre-defined convention headers in NS3, an organization with several hubs is developed, and then routing is replicated among all the hubs for successful data transmission. Every hub in the network has the ability to direct traffic. Table 4 shows the boundaries used for the organization’s reenactment. Varibale Co-ordinate node representation Region between nodes Intermediate connection Size of maximum packets Size of network window Time of simulation Number of vehicles

Representation 50 mm 10 m 60 mm 1024 bits 50 * 50 mm 35–45 s 30

To extend the network’s life, describe the network topology while transmitting data from one host to another. At the same time, we are transmitting data in ad hoc networks, Fig. 3 analyzes efficient data transference with varied node communication. The performance of the proposed approach in terms of constant bit rate in the evaluation of data transference in ITSs is depicted in this diagram. The proposed

88

R. Tirumalasetti and S. K. Singh

Fig. 3 Performance of different approaches with respect to transmitted bit rate

Fig. 4 Performance of different approaches with respect to packet delivery ratio

approach maintains efficient data transmission in ITSs as bandwidth grows with the data transference rate Fig. 4. It compares the efficiency of the suggested approach to traditional approaches in ad hoc networks and the packet delivery ratio performance with respect to different host connections available in ITSs. Performance of power optimization in ad hoc networks concerning other node communication, shown in Fig. 5, describes if the number of nodes is increased, then power optimization using the suggested approach is better than standard mobilityrelated approaches in ITSs. The above result explains the NOSMR efficiency. It describes the packet delivery ratio, time efficiency, power consumption, and latency regarding end-to-end evaluations of multi-cast connections in intelligence-based networks.

5 Conclusion and Future Work For multi-cast transmissions with practical guiding situations and refreshed information transmission, a New Optimal Searchable Multi-cast Routing (NOSMR) is proposed in this paper. NS3 is used to accomplish the rounds. The suggested NOSMR

Dynamic Optimized Multi-metric Data Transmission over ITS

89

Fig. 5 Performance evaluations of different approaches with respect to energy utilization

reduces start-to-finish time and increases throughput. Moreover, the execution time is shorter than other calculations available in writing. Work in adaptable sensor networks, particularly intelligence-based sensor networks, might be developed further by combining advanced steering order with better blockage control to reduce delay and increase packet conveyance percentage. The proposed algorithm considers just 90 vehicles but will extend it to 100–150 vehicles in the future.

6 Proof of the Concept One of the most challenging aspects of opportunistic routing is getting the candidate nodes to agree on which one of them should forward the message. We suggest using a modified version of the 802.11 MAC that allocates numerous time slots for receiving nodes to return acknowledgments. Instead of simply indicating whether or not the packet was received successfully, each acknowledgment includes the ID of the highest-priority successful recipient known to the ACK’s sender. If a low-priority candidate’s ACK reports a high-priority candidate’s ID, all candidates listen to all ACK slots before deciding to forward. Including the sender ID of the highest priority, ACK heard so far assists in preventing duplicate forwarding. Assume node A receives a message, is the highest-priority contender, and sends an ACK. Node B, the second highest-priority contender, does not hear the ACK, but Node C does. Assume that node B hears node C’s ACK. If the ACKs are missing IDs, node B will forward the packet since it is the highest-priority receiver. The fact that node C’s ACK contains node A’s ID indirectly notifies B that node A did receive the packet Refer https:// dspace.mit.edu/bitstream/handle/1721.1/34115/67618057-MIT.pdf;sequence=2

90

R. Tirumalasetti and S. K. Singh

References 1. Ahmad M (2020) Optimized clustering in vehicular ad hoc networks based on honey bee and genetic algorithm for internet of things. Peer-to-Peer Network Appl 1–6 2. Amjed Razzaq Alabbas LAH (2020) Performance enhancement of safety message communication via designing dynamic power control mechanisms in vehicular ad hoc networks. Comput Intell 3. Yang C, Li Z (2020) Traffic path planning method based on vanet and ant colony algorithm. DEStech Trans Eng Technol Res. https://doi.org/10.12783/dtetr/mcaee2020/35048 4. Elhoseny M (2020) Intelligent firefly-based algorithm with levy distribution (ff-l) for multicast routing in vehicular communications. Expert Syst Appl 140:112889. https://doi.org/10.1016/ j.eswa.2019.112889 5. Fatemidokht H (2021) Rafsanjani: efficient and secure routing protocol based on artificial intelligence algorithms with uav-assisted for vehicular ad hoc networks in intelligent transportation systems. IEEE Trans Intell Transport Syst 22(7):4757–4769. https://doi.org/10.1109/TITS. 2020.3041746 6. Gawas MA, Govekar SS (2019) A novel selective cross layer based routing scheme using aco method for vehicular networks. J Netw Comput Appl 143:34–46. https://doi.org/10.1016/j. jnca.2019.05.010 7. Guerrieri A (2020) Security and privacy in vehicular ad hoc network and vehicle cloud computing: a survey. Wirel Commun Mob Comput. https://doi.org/10.1155/2020/5129620 8. Hussain R, Lee J, Zeadally S (2021) Trust in vanet: A survey of current solutions and future research opportunities. IEEE Trans Intell Transport Syst 22(5):2553–2571. https://doi.org/10. 1109/TITS.2020.2973715 9. Ji X (2018) Efficient and reliable cluster-based data transmission for vehicular ad hoc networks. Mob Inf Syst. https://doi.org/10.1155/2018/9826782 10. Nazib RA, Moh S (2020) Routing protocols for unmanned aerial vehicle-aided vehicular ad hoc networks: a survey. IEEE Access 8:77535–77560. https://doi.org/10.1109/ACCESS.2020. 2989790 11. Nazib RA, Moh S (2021) Reinforcement learning-based routing protocols for vehicular ad hoc networks: a comparative survey. IEEE Access 9:27552–27587. https://doi.org/10.1109/ ACCESS.2021.3058388 12. Osman RA, Peng XH, Omar MA (2019) Adaptive cooperative communications for enhancing qos in vehicular networks. Phys Commun 34:285–294. https://doi.org/10.1016/j.phycom.2018. 08.008 13. Pei Z, Wang X, Chen W (2021) Joint optimization of multi-hop broadcast protocol and mac protocol in vehicular ad hoc networks. Sensors (09). https://doi.org/10.3390/s21186092 14. Ramamoorthy R (2022) An enhanced hybrid ant colony optimization routing protocol for vehicular ad-hoc networks. IEEE Trans Intell Transp Syst 13. https://doi.org/10.1007/s12652021-03176-y 15. Togou MA, Hafid A, Khoukhi L (2016) Scrp: stable cds-based routing protocol for urban vehicular ad hoc networks. IEEE Trans Intell Transp Syst 17(5):1298–1307. https://doi.org/ 10.1109/TITS.2015.2504129 16. Togou MA, Khoukhi L, Hafid A (2018) Performance analysis and enhancement of wave for v2v non-safety applications. IEEE Trans Intell Transp Syst 19(8):2603–2614. https://doi.org/ 10.1109/TITS.2017.2758678 17. Togou MA, Khoukhi L, Hafid AS (2016) Throughput analysis of the ieee802.11p edca considering transmission opportunity for non-safety applications pp. 1–6. https://doi.org/10.1109/ ICC.2016.7511454 18. Ullah A, Yao X, Shaheen S, Ning H (2020) Advances in position based routing towards its enabled fog-oriented vanet-a survey. IEEE Trans Intell Transp Syst 21(2):828–840. https://doi. org/10.1109/TITS.2019.2893067

Dynamic Optimized Multi-metric Data Transmission over ITS

91

19. Venitta Raj R (2021) Trust aware similarity-based source routing to ensure effective communication using game-theoretic approach in vanets. J Amb Intell Human Comput 12. https://doi. org/10.1007/s12652-020-02306-2 20. Vershinin YA, Zhan Y (2020) Vehicle to vehicle communication: dedicated short range communication and safety awareness, pp 1–6. https://doi.org/10.1109/IEEECONF48371.2020. 9078660 21. Wang H (2020) Research on data transmission optimization of communication network based on reliability analysis. Informatica 44. https://doi.org/10.31449/inf.v44i3.3280 22. Wang X, Li H, Y J (2020) Optimal path selection for logistics transportation based on an improved ant colony algorithm. J Phys Conf Ser 2083(3):032011. https://doi.org/10.1088/ 1742-6596/2083/3/032011 23. Xia Z, J W (2021) Survey of the key technologies and challenges surrounding vehicular ad hoc networks. ACM Trans Intell Syst Technol 12. https://doi.org/10.1145/3451984 24. Zou Z (2019) Wireless sensor network routing method based on improved ant colony algorithm. J Amb Intell Hum Comput 10. https://doi.org/10.1007/s12652-018-0751-1

Solar Energy-Based Intelligent Animal Reciprocating Device for Crop Protection Using Deep Learning Techniques Ch. Amarendra and T. Rama Reddy

Abstract Much research and numerous attempts have been made to apply the new emerging technology to agricultural areas. The main objective of this research is to protect the crop from animal attacks. The conventional techniques have the conventional same security applied to all the types of animals detected based on a Passive IR sensor, and only single-stage protection is applied. The images were captured and identified with the help of support vector machine and convolution neural network techniques, and with the help of IoT devices, the information was sent to the farm owner if the primary protection fails. Cameras were fixed to capture the image for processing to identify the animals; based on the animal identification, different levels of security were applied. Based on the animal level of the reciprocating sound dB level will change. The accuracy of the proposed method can be estimated by comparing the conventional technique based on the complexity of the technique, implementation cost, reciprocating time, and accuracy of animal detection. Keywords Animal classification · CNN · IoT · Crop protection · Deep learning

1 Introduction Agriculture is vital to the economies of many countries throughout the world. Despite economic progress, agriculture remains the economy’s backbone [1]. Agriculture satisfies people’s dietary needs while also producing a variety of raw resources for industry. However, there will be significant crop loss due to animal interference and fires in agricultural fields. The crop will be completely ruined. There will be a significant number of farmer losses [2].

Ch. Amarendra (B) · T. Rama Reddy Aditya Engineering College (A), Surampalem, India e-mail: [email protected] Jawaharlal Nehru Technological University, Kakinada, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_7

93

94

Ch. Amarendra and T. Rama Reddy

It is critical to preserve agricultural fields and farms from animal and fire damage to minimize financial losses. To address this issue, so many methods prevent animals from entering the farm [3]. These intruder alerts safeguard the crop from damage, increasing agricultural output. The embedded development system will not be dangerous to animals or humans. To create an intelligent security system for agricultural safety in the challenging issue [4–6]. The motion-sensitive cameras are non-inverse and simpler to operate, known as camera traps are a more popular camera tool for animal image collection [7]. Camera traps are the most successful and cost-effective solution for numerous species when compared to other wildlife monitoring approaches [8]. Several ambitious projects are growing the scale at which cameras are used on the landscape, with hundreds of sensors already rotating across thousands of places [9], sometimes with the help of citizen scientists [10]. Convolution neural network (CNN) techniques have lately demonstrated exceptional performance in picture categorization and object recognition [11]. One of the most extensively used deep learning models is the CNN model. Convolution, pooling, and classification layers make up a basic CNN model. The convolution layers function as local and translation invariant operators between the input picture and the collection of filters [12]. CNN learns the hierarchy of features from pixels to classifiers, with training guided by stochastic gradient descent [13]. The classification layer generates three scores that correspond to human, animal, and background classifications. Computer vision has the potential to provide an automated tool for evaluating camera trap photographs if it can first recognize the moving animal inside the image, remove the background, and then identify the moving object within the image [14]. Many interior situations have overcome similar issues [15], the difficulty with camera trap photographs is significantly larger because of the dynamic backdrop backgrounds with waving trees, changing shadows, and sunspots. Foreground detection has previously been offered as a method of distinguishing animals from the background in camera traps. In general, foreground areas are chosen using one of two methods: pixel by pixel, where each pixel is given its judgment, or regionbased, where a decision is made on a collection of geographically adjacent pixels [16]. Using the median pixel value to build a background model [17], a nonparametric method in which the pixel-level background model is represented by a series of background samples [18], and robust principal component analysis (RPCA) [19] are all examples of analytical approaches. Unfortunately, these attempts have been hampered by a high number of false positives and the inability to discern between animal and human items [20]. The accuracy of the CNN method was compared with the latest studies in [21, 22], in these methods, the accuracy proposed was 82 and 92%. The CNN methods for the proposed model with 92.68%.

Solar Energy-Based Intelligent Animal Reciprocating Device for Crop …

95

Fig. 1 A good image classification model must be invariant to the cross-product of all these variations, while simultaneously retaining sensitivity to the inter-class variations

2 Real-World Problems of Image Classification Before going through the various picture categorization techniques that can be utilized. Some of the computer vision issues which are shown in Fig. 1, that would be easy for a person to complete: 1. 2. 3. 4. 5. 6. 7.

Viewpoint variation Scale variation Deformation Occlusion Illumination conditions Background clutter Intra-class variation.

3 Test System Description Figure 2 shows the detailed block diagram of the test system. It consists of the sample area of the monitored agricultural land. The animal reciprocating system consists of a solar panel with a battery to operate remotely without any power from the external source, a camera was fixed below the solar panel to get the images as well as the video of the farm, and the bottom of the pole consists of speaker and lighting equipment to produce necessary repelling depending upon the detecting animal.

96

Ch. Amarendra and T. Rama Reddy

Agriculture area

Surveillance area

GSM

Network Gateway

System Monitor

Solar Panel with battery , Camera, and Speakers. Fig. 2 Block diagram of the system

4 Methodology The methods that are used to classify wild animals. Various methodologies were analyzed to perform the task. Two popular methods were compared on the same test conditions and compared the results. The methods were explained below with a detailed algorithm.

Solar Energy-Based Intelligent Animal Reciprocating Device for Crop …

97

4.1 Convolutional Neural Networks (CNNs) In recent years, CNN has risen to prominence as the master algorithm in computer vision, with improved processing power allowing them to approach superhuman performance on some complicated visual tasks, and Exception is one of the outperformer architectures used in this research via transfer learning. It combines the properties of many well-known CNN designs, including Inception-v4, GoogLeNet, and ResNet, but substitutes the inception modules with a depth-wise separable convolutional layer. Unlike traditional convolutions, which learn both spatial and crosschannel patterns at the same time, the separable convolution layer divides feature learning into two phases: the first phase involves applying a single spatial filter to each input feature map, followed by a search for cross-channel patterns in the second phase (Table 1). Table 1 Percentage of damage caused by the different wild animals [23–25] No.

Wild animal name

1

Image of the animal

Name of the crop Percentage of damage (%)

Reciprocating action

Elephant

Sugarcane Coconut, plantain, paddy, maize

Use bright lights, noise

2

Gaur

Mulberry, sandal 62

High-frequency sound waves

3

Sambar deer

Crops, pasture, forestry plantations, gardens

17

Loud noise and dazzling lights

4

Wild boar

Paddy, maize, bean corn, and fruit trees

16

Loud noise

5

Monkey

Maize, wheat, rice, vegetable crops

75

Lighting with sound

72

(continued)

98

Ch. Amarendra and T. Rama Reddy

Table 1 (continued) No.

Wild animal name

6

Image of the animal

Name of the crop Percentage of damage (%)

Reciprocating action

Porcupine

Maize, potatoes, groundnuts, sugarcane

65

Dazzling of lights

7

Goral

Maize, potato, millet, wheat, paddy

20

Loud noise and dazzling lights

8

Bear

Field corn, oats, and sweet corn

55

Fire and noise

9

Wolf

Rice, wheat, maize, pulses, and mustard

18

Loud noise and dazzling lights

10

Zebra

Maize or corn, potato, tomato, carrot, and other vegetables

15

Loud noise and dazzling lights

An example in Fig. 5 of an image processed via CNN is illustrated. It requires fewer parameters, memory, and calculations than traditional convolutional layers, in addition to providing greater performance. The flowchart of the CNN algorithm is shown in Fig. 3. CNN, or convolutional neural networks, is multilayer neural networks that are largely utilized for image processing and object detection. Yann LeCun founded the original CNN, which he named LeNet, in 1988. Characters like ZIP codes and digits were recognized. CNN is frequently used to discover anomalies, identify satellite pictures, analyze medical imaging, forecast time series, and find anomalies.

Solar Energy-Based Intelligent Animal Reciprocating Device for Crop …

99

Fig. 3 CNN algorithm flowchart

4.2 Recurrent Neural Networks (RNNs) The RNN detailed flowchart is shown in Fig. 4. Recurrent neural networks can also be divided into units called long short-term memory (LTSM) shown in Fig. 5. There are four types of recurrent neural networks: (i) one to one, (ii) one to many, (iii) many to one, (iv) many to many shown in Fig. 6. RNNs with directed cycles can be used to provide the LSTM outputs as inputs to the current phase. In the present phase, the LSTM’s output is utilized as an input, and it has internal memory to recall previous inputs. RNNs are used in a variety of applications, including image captioning, time series analysis, natural language processing, handwriting recognition, and machine translation.

100

Ch. Amarendra and T. Rama Reddy

Fig. 4 RNN algorithm flowchart

Monkey Wolf

CNN

Fig. 5 Image processing using CNN

Fully connected layer

Solar Energy-Based Intelligent Animal Reciprocating Device for Crop …

101

Fig. 6 a Recurrent neural network, b one to one, c one to many, d many to one, e many to many

5 Experimental Processing In this research, ten wild animal photographs were obtained using a 96-megapixel camera in various weather circumstances. Image resolution may be enhanced to accommodate huge images, but this comes at the cost of greater memory and computing needs. Data argumentation was required before training the picture to boost image variety and improve image representation. To achieve increased identification accuracy and robustness, it was necessary to preprocess the images using augmentation methods such as jitter, image rotation, flipping, cropping, multi-scale transformation, hue, saturation, Gaussian noise, and intensity. The virtual machine has T4 GPU has 16 GB GDDR6 on-board memory, NVIDIA Tensor Cores for faster training, and RTX hardware acceleration for faster ray tracing. T4 is a passively cooled board that relies on system airflow to maintain its operational temperatures. This GPU is designed for inference, or predictions generated by deep learning models, with low latency and high throughput. While training and validating a neural network, TensorBoard is the greatest tool for visualizing various metrics. To check for further details in most cases, such as how a model performs on validation data. The performance of validation data when training and validation loss and accuracy aren’t enough. Using a confusion matrix as a visual aid is one of the options. In the machine learning, a confusion matrix, also known as an error matrix, is a unique table structure that allows the visualization of the performance of an algorithm, commonly a supervised learning algorithm, in the issue of statistical classification (in unsupervised learning it is usually called a matching matrix). The rows of the matrix represent the examples in a predicted class, whereas the columns reflect the instances in an actual class. The phrase derives from how easy it is to identify whether the system is mixing two sorts of data. In deep learning methodologies which were described in this research, a confusion matrix was built to understand the complexity, accuracy, and validation of the method. Tables 2, 3, 4, 5, 6, 7, 8 and 9 describes the confusion matrix for the two methods. The detailed comparison was shown in Table 10, the comparison was done based on the complexity of the method, the time taken to identify the animal, and the time taken to produce the repelling action. The CNN method provides the best results compared to other

102

Ch. Amarendra and T. Rama Reddy

methodologies. The CNN provides better results because of the fixed number of input and output layers, whereas in RNN the input and output numbers are dynamic and flexible. The proposed CNN method is compared with two similar methods carried out in recent years as shown in Table 11.

Table 2 Confusion matrix for CNN method Elephant Gaur Sambar Wild Monkey Porcupine Goral Bear Wolf Zebra deer boar 945

19

0

0

0

0

0

36

0

0

22

931

11

0

0

0

12

0

10

14

0

21

912

12

0

0

18

0

13

24

Wild boar

0

21

19

902

0

0

21

0

19

18

Monkey

0

0

11

9

953

0

12

0

15

0

Elephant Gaur Sambar deer

Porcupine

0

2

12

12

2

955

6

0

6

5

Goral

0

23

32

11

0

0

881

0

31

22

Bear

11

10

0

0

0

0

0

971

3

5

Wolf

0

11

13

12

0

0

14

0

925

25

Zebra

0

17

19

21

0

0

29

0

21

893

Table 3 Precision matrix for CNN method Elephant Gaur Sambar Wild Monkey Porcupine Goral Bear Wolf Zebra deer boar Elephant

0.95

0.02

0.00

0.00

0.00

0.00

0.00

0.04 0.00

0.00

Gaur

0.02

0.93

0.01

0.00

0.00

0.00

0.01

0.00 0.01

0.01

Sambar deer

0.00

0.02

0.91

0.01

0.00

0.00

0.02

0.00 0.01

0.02

Wild boar 0.00

0.02

0.02

0.90

0.00

0.00

0.02

0.00 0.02

0.02

Monkey

0.00

0.00

0.01

0.01

0.95

0.00

0.01

0.00 0.02

0.00

Porcupine 0.00

0.00

0.01

0.01

0.00

0.96

0.01

0.00 0.01

0.01

Goral

0.00

0.02

0.03

0.01

0.00

0.00

0.88

0.00 0.03

0.02

Bear

0.01

0.01

0.00

0.00

0.00

0.00

0.00

0.97 0.00

0.01

Wolf

0.00

0.01

0.01

0.01

0.00

0.00

0.01

0.00 0.93

0.03

Zebra

0.00

0.02

0.02

0.02

0.00

0.00

0.03

0.00 0.02

0.89

Solar Energy-Based Intelligent Animal Reciprocating Device for Crop …

103

Table 4 Recall matrix for CNN method Elephant Gaur Sambar Wild Monkey Porcupine Goral Bear Wolf Zebra deer boar Elephant

0.97

0.02

0.00

0.00

0.00

0.00

0.00

0.04 0.00

0.00

Gaur

0.02

0.88

0.01

0.00

0.00

0.00

0.01

0.00 0.01

0.01

Sambar deer

0.00

0.02

0.89

0.01

0.00

0.00

0.02

0.00 0.01

0.02

Wild boar 0.00

0.02

0.02

0.92

0.00

0.00

0.02

0.00 0.02

0.02

Monkey

0.00

0.00

0.01

0.01

1.00

0.00

0.01

0.00 0.01

0.00

Porcupine 0.00

0.00

0.01

0.01

0.00

1.00

0.01

0.00 0.01

0.00

Goral

0.02

0.03

0.01

0.00

0.00

0.89

0.00 0.03

0.02

0.00

Bear

0.01

0.01

0.00

0.00

0.00

0.00

0.00

0.96 0.00

0.00

Wolf

0.00

0.01

0.01

0.01

0.00

0.00

0.01

0.00 0.89

0.02

Zebra

0.00

0.02

0.02

0.02

0.00

0.00

0.03

0.00 0.02

0.89

Table 5 F1-score for CNN method Elephant Gaur Sambar Wild Monkey Porcupine Goral Bear Wolf Zebra deer boar Elephant

0.96

0.02









0.04 –



Gaur

0.02

0.91

0.01







0.01



0.01

0.01

Sambar deer



0.02

0.90

0.01





0.02



0.01

0.02

Wild boar –

0.02

0.02

0.91





0.02



0.02

0.02

Monkey



0.01

0.01

0.97



0.01



0.01



0.00

0.01

0.01

0.00

0.98

0.01



0.01

0.00

0.88





Porcupine – Goral



0.02

0.03

0.01





Bear

0.01

0.01









Wolf



0.01

0.01

0.01





Zebra



0.02

0.02

0.02





0.03

0.02

0.97 0.00

0.00

0.01



0.91

0.02

0.03



0.02

0.89

6 Conclusion The test was carried out to classify the various wild animals to protect the crops, after identification repelling action needs to be initiated as part of the protection scheme. The research results using deep learning methods with the proposed setup can save crop damage from wild animal attacks. Among the described methods, CNN provided the best results with an identification accuracy of 92.68% and a repelling time of less than 200 ms. All the described methods were validated through the software with same test system using Python code and TensorBoard.

104

Ch. Amarendra and T. Rama Reddy

Table 6 Confusion matrix for RNN method Elephant Gaur Sambar Wild Monkey Porcupine Goral Bear Wolf Zebra deer boar Elephant

937

27

0

0

0

0

0

36

0

0

28

931

12

0

0

0

9

0

12

8

Sambar deer

0

12

921

11

0

0

17

0

18

21

Wild boar

0

21

22

902

0

0

22

0

23

10

Gaur

Monkey

0

0

11

12

956

0

9

0

12

0

Porcupine

0

21

12

22

9

895

12

0

12

17

Goral

0

13

28

10

0

0

874

0

29

46

Bear

12

13

0

0

0

0

0

952

12

11

Wolf

0

12

29

13

0

0

12

0

906

28

Zebra

0

31

25

0

0

0

38

0

32

874

Table 7 Precision matrix for RNN method Elephant Gaur Sambar Wild Monkey Porcupine Goral Bear Wolf Zebra deer boar Elephant

0.94

0.03

0.00

0.00

0.00

0.00

0.00

0.04 0.00

0.00

Gaur

0.03

0.93

0.01

0.00

0.00

0.00

0.01

0.00 0.01

0.01

Sambar deer

0.00

0.01

0.92

0.01

0.00

0.00

0.02

0.00 0.02

0.02

Wild boar 0.00

0.02

0.02

0.90

0.00

0.00

0.02

0.00 0.02

0.01

Monkey

0.00

0.00

0.01

0.01

0.96

0.00

0.01

0.00 0.01

0.00

Porcupine 0.00

0.02

0.01

0.02

0.01

0.90

0.01

0.00 0.01

0.02

Goral

0.00

0.01

0.03

0.01

0.00

0.00

0.87

0.00 0.03

0.05

Bear

0.01

0.01

0.00

0.00

0.00

0.00

0.00

0.95 0.01

0.01

Wolf

0.00

0.01

0.03

0.01

0.00

0.00

0.01

0.00 0.91

0.03

Zebra

0.00

0.03

0.03

0.00

0.00

0.00

0.04

0.00 0.03

0.87

Solar Energy-Based Intelligent Animal Reciprocating Device for Crop …

105

Table 8 Recall matrix for RNN method Elephant Gaur Sambar Wild Monkey Porcupine Goral Bear Wolf Zebra deer boar Elephant

0.96

0.02

0.00

0.00

0.00

0.00

0.00

0.04 0.00

0.00

Gaur

0.03

0.86

0.01

0.00

0.00

0.00

0.01

0.00 0.01

0.01

Sambar deer

0.00

0.01

0.87

0.01

0.00

0.00

0.02

0.00 0.02

0.02

Wild boar 0.00

0.02

0.02

0.93

0.00

0.00

0.02

0.00 0.02

0.01

Monkey

0.00

0.00

0.01

0.01

0.99

0.00

0.01

0.00 0.01

0.00

Porcupine 0.00

0.02

0.01

0.02

0.01

1.00

0.01

0.00 0.01

0.02

Goral

0.01

0.03

0.01

0.00

0.00

0.88

0.00 0.03

0.05

0.00

Bear

0.01

0.01

0.00

0.00

0.00

0.00

0.00

0.96 0.01

0.01

Wolf

0.00

0.01

0.03

0.01

0.00

0.00

0.01

0.00 0.86

0.03

Zebra

0.00

0.03

0.02

0.00

0.00

0.00

0.04

0.00 0.03

0.86

Table 9 F1-score for RNN method Elephant Gaur Sambar Wild Monkey Porcupine Goral Bear Wolf Zebra deer boar Elephant

0.95

0.03









0.04 –



Gaur

0.03

0.89

0.01







0.01



0.01

0.01

Sambar deer



0.01

0.89

0.01





0.02



0.02

0.02

Wild boar – Monkey



Porcupine –

0.02

0.02

0.92





0.02



0.02

0.01



0.01

0.01

0.97



0.01



0.01



0.02

0.01

0.02

0.01

0.94

0.01



0.01

0.02

Goral



0.01

0.03

0.01





0.88



0.03

0.05

Bear

0.01

0.01











0.96 0.01

0.01

Wolf



0.01

0.03

0.01





0.01



0.88

0.03

Zebra



0.03

0.02







0.04



0.03

0.87

Table 10 A comparative analysis S. No.

Method

Complexity

Classification time (ms)

Time is taken for reciprocating action (ms)

Classification efficiency

1

CNN

Low

140

194

92.68

2

RNN

Low

156

210

91.48

106

Ch. Amarendra and T. Rama Reddy

Table 11 Comparison of latest works Existing methods

Obtained accuracy for image classification using conventional CNN (%)

Proposed CNN method (%)

Yu et al. [21]

82

92.68

Norouzzadeh et al. [22]

92

References 1. Liu Y, Ma X, Shu L, Hancke GP, Abu-Mahfouz AM (2020) From industry 4.0 to agriculture 4.0: current status, enabling technologies, and research challenges. IEEE Trans Ind Inform 17(6):4322–4334 2. Farooq MS, Riaz S, Abid A, Abid K, Naeem MA (2019) A survey on the role of IoT in agriculture for the implementation of smart farming. IEEE Access 7:156237–156271 3. Kirkpatrick K (2019) Technologizing agriculture. Commun ACM 62(2):14–16 4. Ojo MO, Adami D, Giordano S (2020) Network performance evaluation of a LoRa-based IoT system for crop protection against ungulates. In: 2020 IEEE 25th international workshop on computer aided modeling and design of communication links and networks (CAMAD), Sept 2020. IEEE, pp 1–6 5. Levisse A, Rios M, Simon WA, Gaillardon PE, Atienza D (2019) Functionality enhanced memories for edge-AI embedded systems. In: 2019 19th non-volatile memory technology symposium (NVMTS), Oct 2019. IEEE, pp 1–4 6. Shuja J, Bilal K, Alasmary W, Sinky H, Alanazi E (2021) Applying machine learning techniques for caching in next-generation edge networks: a comprehensive survey. J Netw Comput Appl 181:103005 7. Dai W, Nishi H, Vyatkin V, Huang V, Shi Y, Guan X (2019) Industrial edge computing: enabling embedded intelligence. IEEE Ind Electron Mag 13(4):48–56 8. Li E, Zeng L, Zhou Z, Chen X (2019) Edge AI: on-demand accelerating deep neural network inference via edge computing. IEEE Trans Wireless Commun 19(1):447–457 9. Zhou Z, Chen X, Li E, Zeng L, Luo K, Zhang J (2019) Edge intelligence: paving the last mile of artificial intelligence with edge computing. Proc IEEE 107(8):1738–1762 10. Codeluppi G, Cilfone A, Davoli L, Ferrari G (2020) LoRaFarM: a LoRaWAN-based smart farming modular IoT architecture. Sensors 20(7):2028 11. Ojo MO, Adami D, Giordano S (2021) Experimental evaluation of a LoRa wildlife monitoring network in a forest vegetation area. Future Internet 13(5):115 12. Martinez-Alpiste I, Casaseca-de-la-Higuera P, Alcaraz-Calero J, Grecos C, Wang Q (2019) Benchmarking machine-learning-based object detection on a UAV and mobile platform. In: 2019 IEEE wireless communications and networking conference (WCNC), Apr 2019. IEEE, pp 1–6 13. Yu Y, Zhang K, Zhang D, Yang L, Cui T (2019) Optimized faster R-CNN for fruit detection of strawberry harvesting robot. In: 2019 ASABE annual international meeting. American Society of Agricultural and Biological Engineers, p 1 14. Shi R, Li T, Yamaguchi Y (2020) An attribution-based pruning method for real-time mango detection with YOLO network. Comput Electron Agric 169:105214 15. Wang J, Shen M, Liu L, Xu Y, Okinda C (2019) Recognition and classification of broiler droppings based on deep convolutional neural network. J Sens 2019 16. Aburasain RY, Edirisinghe EA, Albatay A (2020) Drone-based cattle detection using deep neural networks. In: Proceedings of SAI intelligent systems conference, Sept 2020. Springer, Cham, pp 598–611 17. Hong SJ, Han Y, Kim SY, Lee AY, Kim G (2019) Application of deep-learning methods to bird detection using unmanned aerial vehicle imagery. Sensors 19(7):1651

Solar Energy-Based Intelligent Animal Reciprocating Device for Crop …

107

18. Partel V, Nunes L, Stansly P, Ampatzidis Y (2019) Automated vision-based system for monitoring Asian citrus psyllid in orchards utilizing artificial intelligence. Comput Electron Agric 162:328–336 19. Shadrin D, Menshchikov A, Ermilov D, Somov A (2019) Designing future precision agriculture: detection of seeds germination using artificial intelligence on a low-power embedded system. IEEE Sens J 19(23):11573–11582 20. Codeluppi G, Davoli L, Ferrari G (2021) Forecasting air temperature on edge devices with embedded AI. Sensors 21(12):3973 21. Yu X, Wang J, Kays R, Jansen PA, Wang T, Huang T (2013) Automated identification of animal species in camera trap images. EURASIP J Image Video Process 2013(1):1–10 22. Norouzzadeh MS, Nguyen A, Kosmala M, Swanson A, Palmer MS, Packer C, Clune J (2018) Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proc Natl Acad Sci 115(25):E5716–E5725 23. Jayson EA (1999) Studies on crop damage by wild animals in Kerala and evaluation of control measures. KFRI Res Rep (169) 24. https://krishijagran.com/featured/technology-to-reduce-economic-losses-in-agriculture-dueto-wildlife-attacks/ 25. Mehta P, Negi A, Chaudhary R, Janjhua Y, Thakur P (2018) A study on managing crop damage by wild animals in Himachal Pradesh. Int J Agric Sci 10(12):6438–6442

Toward More Robust Classifier: Negative Log-Likelihood Aware Curriculum Learning Indrajit Kar , Anindya Sundar Chatterjee , Sudipta Mukhopadhyay , and Vinayak Singh

Abstract The curriculum learning has shown immense potential in improving computer vision tasks. However, the drawback still exists when it comes to the multiclass classification problem, because of the nature of both data and model uncertainties. In this paper, we introduce a novel curriculum sampling strategy that takes into consideration uncertainty, confidence, score, and negative log-likelihood. We also suggest a novel method of grading the samples that have already been shown to be very successful. During the training period, curriculum learning is put into practice. After the preliminary training is finished, we use curriculum learning in our experimental setting. For this experiment, we used the CIFAR-10 dataset, and we were able to demonstrate the effectiveness of our approach by showing faster convergence, more accurate findings, and a strong deep learning model for image classification. We have demonstrated the use of NLL-based CL post-training on the same model to accomplish the indicated results, in contrast to the state of the art where curriculum learning is utilized before the model training. Problem statements involving multiclass object detection and segmentation can be addressed using the technique. Keywords Curriculum learning · Uncertainty estimation · Negative log-likelihood · Image classification

I. Kar · A. S. Chatterjee · S. Mukhopadhyay (B) · V. Singh Siemens Technology and Services Private Limited, Bengaluru, India e-mail: [email protected] I. Kar e-mail: [email protected] A. S. Chatterjee e-mail: [email protected] V. Singh e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_8

109

110

I. Kar et al.

1 Introduction The best method for someone to acquire knowledge and useful abilities is to start by understanding fundamental, straightforward concepts, and then build on that knowledge to understand more complicated, challenging subjects. Bengio first introduced the original ideas of curriculum learning. Curriculum learning is defined as “training from easier data to harder data.” The concept is first to take small and easy data, then train the model with that simple data, and then gradually feed the difficult data to the difficulty level of data for training [1]. Curriculum learning and conventional stochastic gradient descent differ from each other in this regard. At each training iteration, the classifier’s weights are updated using mini-batches that are randomly selected from the training set. In contrast, curriculum learning arranges the mini-batches in ascending order of difficulty [2]. The scoring function used to score the samples according to difficulty and the pacing function used to decide how the samples are increased during the training are the two most common categories used to classify curriculum learning. There are two distinct methods for grading the complexity of the samples or for scoring functions, as well as transfer learning from a network that has already been trained, such as the Google Map Inception Network [3]. The pre-trained network is selected, and after identifying the samples for which it has a subpar classification score, it uses that information to assess the complexity of each sample in the training set. The alternative approach is called bootstrapping, and it uses pre-trained networks with the same architecture as the networks used to train the current model using curriculum learning to get scores [4]. The main distinction between these two scoring methods is that they both used the same network to calculate the results, as opposed to a pre-trained network that doesn’t use the same parameters as the network used with curriculum learning. The other concept is the use of pacing functions to increment the more difficult instances into the training set. It is parameterized by the following values: step, step length increments, and starting percent. This function defines the first pacing function, which uses an exponential scaling with a fixed number of repetitions to exponentially increase a small portion of trained data. The same concept underlies varying exponential pacing, with the exception that the system-to-separations can now alter and a new hyperparameter has been added to the pacing function. The simplest approach is single-step step pacing, which samples first from the simplest cases and then from the entire set of data. Using uncertainty-aware negative log-likelihood, a novel approach to curriculum learning is introduced in this study effort, and the outcomes demonstrate the balance between robust classification and effectiveness in curriculum learning. Therefore, the inputs to feed into the neural network in curriculum learning can be ordered in an ascending or descending fashion to work in the anti-curriculum technique by applying the aforementioned method of structured sampling rather than random sampling.

Toward More Robust Classifier: Negative Log-Likelihood Aware …

111

2 Literature Survey Curriculum learning (CL) is crucial for training a deep learning architecture from a simpler to a more complex data. The two key benefits of applying CL training methodologies to various real-world settings are speeding up the training process and enhancing model performance on target tasks, which address the two most important needs in significant machine learning research. For instance, in [5], compared to traditional training without curriculum, CL aids the neural machine translation model in reducing training time by up to 70% and improving performance by up to 2.2 BLEU points. In [6], CL yields a comparative 45.8% MAP boost over normal batch training with a faster convergence in the multimedia event detection task. In [7], CL allows the reinforce learning agents to resolve complex goal-oriented puzzles that they are unable to resolve without a curriculum. In addition to the aforementioned two key benefits, CL is simple to employ since, according to most of the CL literature, it is a flexible plug-and-play submodule independent of the original training procedures. In various CV and NLP contexts, most of the predefined difficulty measurers are developed for image and text data. The main point of interest is that the predefined difficulty measurers are constructed from the angles of complexity, diversity, and noise estimation, which are distinct but also correlated, with the exception of some domain knowledge-based measurement [8]. First, complexity refers to the structural complexity of the data; i.e., with higher structural complexity have more dimensions and are therefore more difficult for models to describe [9]. For examples, it includes the number of objects in images for the task of semantic segmentation [10] or the quantity of conjunctions (such as “and” or “or”) or phrases (such as prepositional phrases) [11], which assesses the complexity of the instruction set in task execution activities. Second, the angle of diversity refers to the distributional diversity of a set of data, such as words or regular or irregular shapes [1], or a collection of data (e.g., sentence). A higher diversity number indicates that the data is more diverse, contains more (unusual) different kinds of data or pieces, and is therefore harder to learn models for both high complexity and high diversity provide the data more degrees of freedom, requiring a model with a larger capacity and greater training effort. Data can become noisier when diversity is larger. As a result, a different perspective is noise estimation, which quantifies the amount of noise present in data instances and classifies cleaner data as simpler. According to a logical approach used in [12], photographs acquired from search engines like Google are thought to be cleaner, whereas images posted on photo-sharing websites like Flickr are thought to be more realistic and noisy. In [13], the authors use CNNs to map images to vectors and make the assumption that cleaner images frequently resemble one another and so have higher local density values. As a result, cases with lower local densities should be noisier and more difficult to forecast. Additionally, the signal-to-noise ratio/distortion (SNR/SND) [14, 15] is frequently used to evaluate noise levels. Signal intensity [16, 17], and human-annotation-based image difficulty scores [18, 19], both of which are intended for image data, are other potential difficulty measurers.

112

I. Kar et al.

Signal intensity can be used to estimate how informative a given data feature is. For instance, more exaggerated faces are thought to be simpler data than poker faces in the task of interpreting facial expressions [16]. More severe symptoms are easier to identify and provide more information for diagnosing thoracic illness [17]. Additionally, image difficulty score [19] is suggested to quantify the difficulty of an image by gathering the annotators’ response times in the technique described below: (i) if the image contains a “object class” (such as an elephant) and (ii) measure the time taken by annotator to response either “Yes” or “No,” using this reaction time to calculate the image difficulty score: Naturally, a harder visual example equates to a longer response time. The authors build a regression model to map the CNN features of the new images to the difficulty score after collecting the annotation data. With a multi-scale convolutional neural network (CNN) and curriculum learning, Lotter et al. [20] classified mammograms on the DDSM dataset and obtained an AUC-ROC score of 0.92. In 2019, Hacohen and Weinshall [4] demonstrated the influence of curricular learning on the training of CNNs for image recognition using CIFAR-10, CIFAR-100, and ImageNet subset datasets. This was based on a nonuniform sampling of mini-batches. Park et al. [21] evaluated the effectiveness of curriculum learning using dual CXR image datasets. The baseline model had AUC value of 0.967 and 0.99, whereas the curriculum learning-based model had AUC score of 0.99 and 1.00 for the detection of pulmonary anomalies. Wei et al. [22] used curriculum learning for the classification of histopathology images where the dataset contained 3152 total samples. Vanilla training achieved the AUC score of 0.837, whereas using curriculum learning the AUC score is 0.882 which showed that curriculum learning impacted the classifier model. Wang et al. introduced two curriculum schedulers for sampling and loss backward propagation in curriculum learning [23]. A unified framework named dynamic curriculum learning (DCL) which can adaptively adjust the sampling strategy and loss weight of each branch for better generalization and discrimination ability was proposed. They outperformed the state of the art using face attribute dataset CelebA, pedestrian attribute dataset RAP, and CIFAR-100 dataset. Tang et al. [17] proposed an attention-guided curriculum learning (AGCL) framework to exploit the task of weakly supervised thoracic disease localization from chest radiographs and joint classification. In 2020, Yu et al. [24] presented a multitask curriculum learning framework via estimation of the probability of out-of-distribution (OOD) sample. Negative loss likelihood is primarily used for loss function in multiclass classification. In 2019, Yao et al. [25] proposed a discriminative loss function with a negative log-likelihood ratio (NLLR) between correct and competing classes replacing cross-entropy loss. MNIST and CIFAR-10 datasets were used in the study. In terms of accuracy, architecture gained an accuracy of 0.8601 on CIFAR-10 and an accuracy of 0.9928 on the MNIST dataset using the NLLR loss function which outperformed other loss functions for the CIFAR-10 dataset.

Toward More Robust Classifier: Negative Log-Likelihood Aware …

113

3 Uncertainty Estimation Uncertainty estimation and quantification (UQ) methods have an important for the reduction of uncertainties to optimize the model properly.

3.1 Mathematical Formulas for Uncertainty Quantification Predictive uncertainty is a result of uncertain model parameters, irreducible nature of data and the probabilistic distributional difference between training and testing data [22]. It has two components: aleatoric uncertainty and epistemic uncertainty. PE = AE + EU Epistemic uncertainty can be formulated as a probability distribution with the model parameters. Let the training dataset is defined as N dtrain = { p, q} = {( pi , qi )}i=1 .

Inputs are pi ∈ Rd and the corresponding to the input the class labels are qi ∈ {1, . . . , n} where the number of classes is denoted by n. The main target is optimization of model parameters, and ω is the model parameter of the output producing function h ω ( p). According to a model, likelihood can be defined as P(q| p, ω). The softmax likelihood will be used for classification purpose:   exp f nω ( p)  ω  P(q = n| p, ω) =  n  exp f n  ( p)

(1)

and assume Gaussian likelihood for regression:   P(q| p, ω) = N q; f ω ( p), σ −1 I ,

(2)

where model precision is represented by σ . For a given dataset dtrain with parameter ω, P(q| p, ω) is the posterior distribution. Now, Bayes’ theorem is applied and it can be written as P(ω| p, q) =

P(q| p, ω)P(ω) . P(q| p)

(3)

If p ∗ is taken as given test sample, the class label corresponding to P(ω| p, q) can be predicted using the below formula:

114

I. Kar et al.

  P q ∗ | p ∗ , p, q =



  P q ∗ | p ∗ , ω P(ω| p, q)dω.

(4)

The process is termed as inference or marginalization. It is not possible to compute P(ω| p, q) analytically but it may be approximated by variational parameters xθ (ω). Approximation of the distribution which is very closer to the posterior distribution is main target. With real-time data, the actual findings are with respect to what parameter Kullback–Leibler (KL) divergence can be minimized. The similarity level can be measured among the two distributions by the below-mentioned formula:  xθ (ω) dω. (5) KL(xθ (ω)||P(ω| p, q)) = xθ (ω) log P(ω| p, q) After minimizing KL divergence, we approximate the predictive distribution as   P q ∗ | p ∗ , p, q ≈



    P q ∗ | p ∗ , ω xθ∗ (ω)dω =: xθ∗ q ∗ , p ∗ .

(6)

The optimized objective is represented by xθ∗ . Now, rearrange this KL divergence minimization into the evidence lower bound (ELBO) maximization:  DVI (θ ) := xθ (ω) log P( p|q, ω)dω − KL(xθ (ω||P(ω))), (7) whereby maximizing the first term and being too close to minimize the second term, xθ∗ describes the data well. This is variational inference (VI). For complex model to approximate inference, the most common approach is Dropout VI. For minimization, the objective is as follows: N 1  1−ρ θ 2 , D(θ, ρ) = − log p(qi | pi , ω) + N i=1 2N

(8)

where ρ and N denote the dropout probability and the number of samples, respectively. Now the aim is obtaining data-dependent uncertainty. So, the precision which was defined in (2) can be formulated in the form of function of data. Now to obtain epistemic uncertainty, either choose one approach that is to mix two functions: predictive mean f θ ( p) and model precision h θ ( p) and the likelihood function can be defined as   qi = N f θ ( p), h θ ( p)−1

(9)

Toward More Robust Classifier: Negative Log-Likelihood Aware …

115

On the weights of the model, a prior distribution is performed and then the amount of change is computed. Adaptation of Euclidian loss function are as follows. (ED)

w1 ,w2 ,b

N  1  qi − qˆ 2 . ( p, q) = 2N i=1

(10)

1 1 M q − f W1 ,W2 ,b ( p)g W1 ,W2 ,b ( p) log det h W1 ,W2 ,b ( p) + log 2π 2 2 2   θ = − log N f ( p), h θ ( p)−1

(ED)w1 ,w2 ,b :=

Hence, predictive variance formula is as follows: 



Var p





T  T   T

1  ω˜ t := h ( p)I + f ω˜ t p ∗ f ω˜ t p ∗ − E˜ q ∗ E˜ q ∗ T t=1

−−−−→ Var yθ∗ (q ∗ | p ∗ ) q ∗ T /→∞

4 Negative Log-Likelihood and Uncertainty 4.1 Likelihood Versus Probability Likelihood is how likely a model has a certain parameter value given some data; i.e., given a set of data likelihood is used to model the probability of how likely the model has a certain value.

4.2 Maximum Likelihood Estimation Often when building statistical machine learning models, there are several parameters that need to find values for. Ideally, finding the best optimal parameters and models is the main objective: f (x|a, b, c . . .)—finding out a, b, c … For fitting the Gaussian distribution over a set of data two parameters mean μ and standard deviation σ, using maximum likelihood estimation can determine which parameters or which curve was most likely to have created the data points. As per observation maximum likelihood estimation as a way of determining the parameters for model that best fits the given data.

116

I. Kar et al.

Fig. 1 SoftMax activation function and negative loss function as loss function in neural network

4.3 Negative Log-Likelihood and Its Relationship with SoftMax Activations The SoftMax activations are the most placed activations at the output layer of a neural network in multiclass learning use cases where the features are related to one out of N classes, i.e., 10 classes in CIFAR-10 dataset. It is used to classify the features extracted from the input image (i * j pixels) into one of the 10 classes of the CIFAR-10 dataset. The SoftMax squeezes the vector of size N in this case 10 between 0 and 1. Also as it is a normalization of the exponential the sum of all the values in the vector is 1 which is the output from the SoftMax which are essentially the probabilities that the features extracted are from a certain class. The SoftMax function is used alongside the negative log-likelihood (NLL). This loss function is very interesting if we interpret it with the behavior of SoftMax L(y) = − log(y). The goal of neural network during training is to find the minima of the loss function given the weights and biases. The lower the loss, the better. So, if using NLL as the loss function, it’s better for higher values. This happens due to the summation of the loss functions to all the correct classes; i.e., when the model assigns higher confidence to the correct class, the sadness of the model is low but when it assigns lower confidence to correct class the sadness is high (Fig. 1).

5 Methodology For training, ResNet [26], EfficientNet [27], ResNeXt [28] architectures are considered to apply on CIFAR-10 dataset [29]. EfficientNet performs better, and predictions are shown below. Additionally, it provides a measure of prediction uncertainty. The scoring and pace were determined using this measure of uncertainty, confidence score, and negative log-likelihood. Now, taking into account the data above, create a ten-dimensional axis on a twodimensional plane, with each axis representing the likelihood of a class.

Toward More Robust Classifier: Negative Log-Likelihood Aware …

117

Fig. 2 10-simplex

In Fig. 2, the probability of being an automobile is indicated by the first edge, a bird is indicated by the second edge, a cat is indicated by the third edge, and so on. On any axis or edge, 1 is the greatest value. A geometric coordinate is intended to depict the discrete probability distribution over the 10 classes. A 10-edge simplex is a 2D plane where all the edges must total to one because the probabilities of all the classes are known from the model, and all discrete probability distributions across the 10 classes are bound to exist on this plane. Each prediction along with the confidence score was shown on the edges for the CIFAR-10 pictures using validation and test sets only. When uncertain images are taken into account and plotted onto the simplex, it is visually evident that for input images that observations were more certain about, they prefer to lie close to the simplex’s edges, while more uncertain images tend to lie closer to the simplex’s center, which is evenly spaced from the 10 classes. Therefore, a model provides us with a discrete distribution of classes for the same inputs, and the predictions are a collection of points on the simplex for the same input (Fig. 3). GradCam and AblationCam are used to examine the class activations mapping in order to determine what is causing this high NLL. For the true classes, the negative log-likelihood is lower. The model is confused between a truck and a dog due to the unique image structure, and the same is true for an ostrich and a deer because the activations are similar. As a result, parts of the ostrich, most likely the leg, are present in the image of the deer. To train the model, additional curriculum bins were created in the following ranges: 0–9, 10–19, 20–29, 30–39, 40–49, 50–59, 60–69, 70–79, 80–89, 90–100. The curriculum refers to these as bins.

118

I. Kar et al.

Fig. 3 Graphs of train and validation losses per epoch without curriculum learning (a) and using curriculum learning (b)

Steps to construct the bins for curriculum: • Sort the complete collection of images from the validation and test sets from highest to lowest confidence scores for each class set A. • Sort the entire collection of images from the validation and test sets from highest to lowest confidence scores. Set B, across classes, where the real class is supplied what is the non-class image’s confidence score? • The negative log-likelihood of each image in Set A and Set B is calculated, and the results are sorted from highest to lowest. Set BNL1 and ANL1 should be constructed. • Plotting of Set ANL1 on the 10-edge simplex, with the red inner boundary representing the greatest degree of certainty and the outer boundary representing the most certain. • Similar procedures were used for Set BNL1 (Figs. 4 and 5).

Fig. 4 Bins per classes for CIFAR-10 and cross classes

Toward More Robust Classifier: Negative Log-Likelihood Aware …

119

Fig. 5 Set ANL1 and BNL1

6 Results After testing with the best model, EfficientNet, and repeating the aforementioned curriculum stages, WGAN was used to generate additional of these curriculum images to add to the curriculum bins. Results demonstrate that the approaches are more accurate, the confidence score distribution is more positively skewed (Leptokurtic) than it was previously (mesokurtic), and the learning curves are significantly smoother. The learning curve approaches smoother convergence.

7 Conclusion According to the study’s findings, using curriculum learning to classify images decreases the negative log-likelihood, boosting the certainty of the outcomes. With this system, the confidence score is raised for images that are more challenging for the best model to understand. As a result, for the same inputs, a model gives us a discrete distribution of 10 classes as well as a set of simplex points. By using this method, we may improve the multiclass image segmentation and classification capabilities of any trained model by using the validation and test sets as an NLL curriculum. Finally, using this we also successfully minimized the epistemic and aleatoric uncertainties.

References 1. Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. In: Proceedings of the 26th annual international conference on machine learning, June 2009, pp 41–48 2. Peng X, Li L, Wang FY (2019) Accelerating minibatch stochastic gradient descent using typicality sampling. IEEE Trans Neural Netw Learn Syst 31(11):4649–4659

120

I. Kar et al.

3. Brewer E, Lin J, Kemper P, Hennin J, Runfola D (2021) Predicting road quality using high resolution satellite imagery: a transfer learning approach. PLoS ONE 16(7):e0253370 4. Hacohen G, Weinshall D (2019) On the power of curriculum learning in training deep networks. In: International conference on machine learning, May 2019. PMLR, pp 2535–2544 5. Liu F, Ge S, Wu X (2022) Competence-based multimodal curriculum learning for medical report generation. arXiv preprint arXiv:2206.14579 6. Jiang L, Meng D, Yu S-I, Lan Z, Shan S, Hauptmann A (2014) Self-paced learning with diversity. In: Advances in neural information processing systems, vol 27 7. Klink P, Yang H, D’Eramo C, Peters J, Pajarinen J (2022) Curriculum reinforcement learning via constrained optimal transport. In: International conference on machine learning. PMLR, pp 11341–11358 8. Penha G, Hauff C (2020) Curriculum learning strategies for IR. In: European conference on information retrieval, Apr 2020. Springer, Cham, pp 699–713 9. Zhou Y, Yang B, Wong DF, Wan Y, Chao LS (2020) Uncertainty-aware curriculum learning for neural machine translation. In: Proceedings of the 58th annual meeting of the association for computational linguistics, July 2020, pp 6934–6944 10. Wei Y, Liang X, Chen Y, Shen X, Cheng M-M, Feng J, Zhao Y, Yan S (2016) STC: a simple to complex framework for weakly-supervised semantic segmentation. IEEE Trans Pattern Anal Mach Intell 39(11):2314–2320 11. Kocmi T, Bojar O (2017) Curriculum learning and minibatch bucketing in neural machine translation. arXiv preprint arXiv:1707.09533 12. Chen X, Gupta A (2015) Webly supervised learning of convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 1431–1439 13. Guo S, Huang W, Zhang H, Zhuang C, Dong D, Scott MR, Huang D (2018) CurriculumNet: weakly supervised learning from large-scale web images. In: Proceedings of the European conference on computer vision (ECCV), pp 135–150 14. Braun S, Neil D, Liu S-C (2017) A curriculum learning method for improved noise robustness in automatic speech recognition. In: 2017 25th European signal processing conference (EUSIPCO). IEEE, pp 548–552 15. Ranjan S, Hansen JHL (2017) Curriculum learning based approaches for noise robust speaker recognition. IEEE/ACM Trans Audio Speech Lang Process 26(1):197–210 16. Gui L, Baltrušaitis T, Morency L-P (2017) Curriculum learning for facial expression recognition. In: 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017). IEEE, pp 505–511 17. Tang Y, Wang X, Harrison AP, Lu L, Xiao J, Summers RM (2018) Attention-guided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs. In: International workshop on machine learning in medical imaging. Springer, Cham, pp 249–258 18. Soviany P, Ardei C, Ionescu RT, Leordeanu M (2020) Image difficulty curriculum for generative adversarial networks (CuGAN). In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3463–3472 19. Tudor Ionescu R, Alexe B, Leordeanu M, Popescu M, Papadopoulos DP, Ferrari V (2016) How hard can it be? Estimating the difficulty of visual search in an image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2157–2166 20. Lotter W, Sorensen G, Cox D (2017) A multi-scale CNN and curriculum learning strategy for mammogram classification. In: Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer, Cham, pp 169–177 21. Park B, Cho Y, Lee G, Lee SM, Cho YH, Lee ES et al (2019) A curriculum learning strategy to enhance the accuracy of classification of various lesions in chest-PA X-ray screening for pulmonary abnormalities. Sci Rep 9(1):1–9 22. Wei J, Suriawinata A, Ren B, Liu X, Lisovsky M, Vaickus L et al (2021) Learn like a pathologist: curriculum learning by annotator agreement for histopathology image classification. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2473–2483

Toward More Robust Classifier: Negative Log-Likelihood Aware …

121

23. Wang Y, Gan W, Yang J, Wu W, Yan J (2019) Dynamic curriculum learning for imbalanced data classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5017–5026 24. Yu Q, Ikami D, Irie G, Aizawa K (2020) Multi-task curriculum framework for open-set semisupervised learning. In: European conference on computer vision, Aug 2020. Springer, Cham, pp 438–454 25. Yao H, Zhu DL, Jiang B, Yu P (2019) Negative log likelihood ratio loss for deep neural network classification. In: Proceedings of the future technologies conference, Oct 2019. Springer, Cham, pp 276–282 26. Wu Z, Shen C, Van Den Hengel A (2019) Wider or deeper: revisiting the ResNet model for visual recognition. Pattern Recogn 90:119–133 27. Koonce B (2021) EfficientNet. In: Convolutional neural networks with swift for tensorflow. Apress, Berkeley, CA, pp 109–123 28. Orhan AE (2019) Robustness properties of Facebook’s ResNeXt WSL models. arXiv preprint arXiv:1907.07640 29. https://www.cs.toronto.edu/~kriz/cifar.html

Design of Fuzzy Logic Controller-Based DPFC Device for Solar-Wind Hybrid System V. Sowmya Sree, G. Panduranga Reddy, and C. Srinivasa Rao

Abstract A grid-connected system, in particular, relies heavily on the generation of electricity from Renewable Energy Sources. Because of the Renewable Energy Sources connection to a grid, problems with power quality have arisen. Harmonics, voltage swells, sags and other grid concerns are caused by power quality issues. As solar and wind energy are both free and environmentally beneficial, they are regarded as the finest options for remote (or rural) electricity. The combination of solar power and wind power is a reliable source of energy creating a constant energy flow by avoiding the fluctuations. But this hybrid system gives rise to complications related to power system stability. Most of the industrial loads are controlled by power electronic converters that are sensitive to power system disturbances. Hence the power quality issues diminution is more focused in recent times as it is vital in power supply industry. A number of power semiconductor devices have been developed to overcome the above power quality issues. Distributed Power Flow Controller, which is emerged from Unified Power Flow Controller, is considered as the best reliable device among the others. The DC link is the key distinction between these devices. In case of Distributed Power Flow Controller, the DC connection that links both converters does not exist. Later the system is examined with Fuzzy Logic Controller for Shunt control of Distributed Power Flow Controller. The results of the investigation demonstrate Distributed Power Flow Controller has improved achievement in conditions of harmonics reduction and voltage compensation. MATLAB/Simulink has been used to study the anticipated integrated hybrid system under unbalanced voltage situations.

V. Sowmya Sree (B) JNTUA, Anantapuramu, India e-mail: [email protected] V. Sowmya Sree · G. Panduranga Reddy · C. Srinivasa Rao GPCET, Kurnool, A.P. 518452, India e-mail: [email protected] C. Srinivasa Rao e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_9

123

124

V. Sowmya Sree et al.

Keywords Unified Power Flow Controller (UPFC) · Distributed Power Flow Controller (DPFC) · Fuzzy Logic Controller · Power quality · Solar and wind system

1 Introduction The usage of thermal generating and fossil fuels, which cause pollutants are pricy and depleting are some of the issues faced by a contemporary power system network [1]. The current electrical power system is under severe threat from the rising demand for electricity, the depletion of conventional supplies and the deterioration of distribution and transmission networks. The energy crisis as well as constraints and issues can be overcome through Renewable Energy Sources (RES) [2]. RES penetration in the distribution system causes instability difficulties identical to the transmission network, but at a considerably increasing price. The volatile nature of RES further increases the severity of unstable situations in the distribution grid. Since the distribution system is always evolving, a regulatory system for regular coordination and monitoring is required to provide a protected, reliable and efficient system. Moreover, using power electronics to manage and convert RES electricity is more efficient [3]. This improves the power supplied to users. Distributed renewable energy generation impacts hybrid research greatly. A hybrid system combines multiple RES. The load is supplied via hybrid sources linked to the current grid [4]. Renewable energy conversion systems (RECS) such as solar and wind are the most advanced. Problems will develop when wind and solar energy systems are connected to the grid due to their recurrent nature. Power quality concerns such as droop, surge and harmonics will occur in the system. Due to these concerns, the grid may experience an abrupt voltage change tripping the grid. Frequent tripping can seriously damage grid dependability. Disconnection can be avoided by using grid standards and keeping the surface firm. Swell and sag in the distribution network are the main issues [5, 6]. A lower-quality supply of electricity will have an impact on the loads and electronic equipment resulting in a decrease in their performance. These issues can be mitigated using Distributive filters and Custom power devices in a distribution network certain series devices such as Dynamic Voltage Restorer (DVR) and Static Synchronous Series Compensator (SSSC) compensate the voltage. In distribution systems, few shunt devices such as Distribution Static Compensator (DSTATCOM) and Thyristor controlled reactor (TCR) correct for voltage [7]. In spite of providing real and reactive power at low voltages DSTATCOM not able to reduce load harmonics [8]. DVR deals with load voltage compensation [9]. But their require an extra energy storage capacitors/transformers and voltage compensator. Unified Power Flow Controller (UPFC) is a shunt and series controller that includes a common DC link voltage. The DC link voltage rating is inappropriate since the shunt controller requires a higher rating than the series controller. The DC link capacitor rating as well as Voltage Source Inverter voltage and current ratings increases resulting in high cost. To reduce these issues, the Distributed Power Flow Controller (DPFC) is utilized. DPFC is a series-shunt

Design of Fuzzy Logic Controller-Based DPFC Device for Solar-Wind …

125

Power Quality

Voltage

Voltage Variations

Frequency

Flicker

Interruption

Harmonics

Transients

Fig. 1 Classification of issues in power quality

converter without any DC link. DPFC has its own DC capacitor and modest rated series converters [10]. One series converter failure does not influence the system. The DC capacitor supplies the converter’s DC voltage. The shunt converter side has a high pass filter. DPFC permits active exchange of power between converters without the need of a usual DC link voltage. Figure 1 shows the various types of power quality issues. In this view, the projected dynamic system including the concept and mathematical modeling of wind and solar energy systems are presented in Sect. 2. Section 3 deals with principle of operation of Distributed Power Flow Controller (DPFC). The design of Fuzzy Logic Controller for DPFC is discussed in Sect. 4. The results and discussion of the proposed system and existing topologies in terms of adjustment of level of voltage, DC ripple, active power and harmonics and comparative analysis of the proposed methodology of FLC-based DPFC with the existing conventional methods are presented in Sect. 5. Section 6 portrays the conclusion of this research paper.

2 Proposed Dynamic System Figure 2 shows the block diagram proposed dynamic system consisting of hybrid combination of solar and wind systems connected to the grid with custom power device connected at Point of Common Coupling. Whenever a problem occurs at the PCC, it affects both output and grid power. So, in this research paper, a special device called DPFC is utilized to improve power quality concerns under load.

2.1 Concept of Wind Energy Conversion System The essential components of a wind energy alteration system are the turbine, permanent magnet synchronous generator and power electronic converter [2].

126

V. Sowmya Sree et al.

Fig. 2 Photovoltaic-wind hybrid system block diagram

2.1.1

Mathematical Design of Wind Turbine

Wind turbine does the conversion of kinetic energy obtained from wind into mechanical power [7]. The power of wind turbine is expressed as: Pm =

1 ρCp (λ, θ )π R 2 Vw3 2

) ( 18.4 115 2.14 − 0.58β − 0.002β − 13.2 e λi Cp (λ, β) = 0.73 λi

(1) (2)

where λi is λi =

1 λ−0.02β

1 −

0.035 β 3 +1

(3)

And TSR (λ) =

ωr Rr Vw

(4)

C p is assumed to be 0.59 according to Betz’s Law, and the rotor pitch angle is considered to remain constant. Performance coefficients (C p ) vary from 0.2 to 0.4 in practice. Figure 3 shows the wind turbine system’s Matlab/Simulink diagram.

2.1.2

Permanent Magnet Synchronous Generator Modeling

In order to calculate the dynamic model of the PMSG with quadrature of 90° between the d-axis and q-axis with regard to the direction of rotation, the 2-phase synchronous

Design of Fuzzy Logic Controller-Based DPFC Device for Solar-Wind …

127

Fig. 3 MATLAB/Simulink model of wind turbine

reference frame is utilized, as seen in Fig. 4. Equations (5)–(7) represent the dynamics of PMSG in a frame of synchronous reference. Assume reference frame speed w.r.t. generator as ωe [7]. di gd − ωe L sg i gq dt

(5)

( ) di gq + ωe L sg i gd + λm dt

(6)

3P λm i gq 22

(7)

Vgd = Rsg i gd + L sg Vgq = Rsq i gq + L sg Electromagnetic torque is given as Te =

Fig. 4 d-q axes of synchronous machine

128

V. Sowmya Sree et al.

2.2 Concept of Solar PV System The analogous circuit of a solar PV module is seen in Fig. 5. It transforms solar energy into electrical energy. A typical PV cell is shown through a current source coupled parallel to a diode [2]. From Kirchhoff’s current law, Iph = Id + IRP + I

(8)

I = Iph − (IRP + Id )

(9)

3 Distributed Power Flow Controller DPFC consists of shunt and distributed static series compensators for compensating active-reactive powers provided negative and zero sequence components of currents as well [11]. Shunt and series converters can be placed independently in DPFC with higher flexibility due to the absence of DC link. The real power exchange in UPFC device is performed through the DC link whereas in DPFC the third frequency component aids in real power exchange. In DPFC, the use of multiple single-phase converters in contrary to single three-phase converter of large size reduces the components rating and provides high reliability because of redundancy [12]. Figure 5 depicts the internal circuit of DPFC [8, 9]. The converter control scheme is employed at the respective terminals of the converter and is called local controller. DPFC is made up of three main controllers as represented in Fig. 5. The central controller serves as the primary controller for the other controllers in the system, while the series and shunt controllers are responsible for compensating for current

Fig. 5 Internal circuit of Distributed Power Flow Controller

Design of Fuzzy Logic Controller-Based DPFC Device for Solar-Wind …

129

Fig. 6 Internal circuit of series controller

Fig. 7 Internal circuit of shunt controller

harmonics and voltage. Figures 6 and 7 shows the Internal Control circuits of series and shunt controllers, respectively.

3.1 Central Control Balancing imbalanced components and controlling power flow are covered in this section. Reference voltage signals are generated for the series and shunt controllers

130

V. Sowmya Sree et al.

in response to requirements of the system. All the reference signals are produced at fundamental frequency.

3.2 Series Control A series converter’s DC voltage is maintained by using the third harmonic frequency component. A series converter generates the necessary series voltage in the direction of the central controller. The voltage disturbance will be mitigated by the series filter if any problem happens in the distribution system. Sagging, swelling and other disturbances are prevented by the use of a series filter. The comparison signal or error signal is injected into the PLL structure by correlating the supply and line voltages. From a proposed series controller, a voltage is generated by the algorithm. d-q-0 coordinates have been applied to VSxyz. Figure 6 shows a series controller’s schematic diagram. Switching pulses are generated by comparing the voltages of the reference (V * Lxyz) and the load (VLxyz). The PWM controller generates the pulses which in turn are supplied to the switches as needed. There are input limitations to the hysteresis controller as it should be employed in a limited hysteresis band to activate the series filter. The phase voltage utilized as feedback is used to calculate the error. It is compared to hysteresis band using input VLxyz provided to the controller (h). The gate signals generated by this comparison are used to switch a series filter.

3.3 Shunt Control The shunt controller introduces a current into the line in order to send the true power to a series converter. Because it uses reactive current to maintain a constant DC voltage across a capacitance, shunt controllers are commonly used in power systems. The harmonics of the current can be adjusted for in this shunt control. The reference current generating technique is depicted in Fig. 7. Reactive power theory is employed to regulate the shunt controller. The 3-phase voltages and currents are converted into α-β-0 coordinates. If reactive power and harmonic compensation is required, then the shunt filter’s reference currents in α-β-0 coordinates are converted to the 3-phase system represented as i * sx, i * sy and i * sz. Reactive, neutral and harmonic currents are compensated by adjusting the reference currents at the load end. The source currents are used to compare these reference currents. The PWM controller generates the necessary switching signals using the errors fed into it.

Design of Fuzzy Logic Controller-Based DPFC Device for Solar-Wind …

131

Fig. 8 Basic structure of Fuzzy Logic Controller

Fig. 9 Input and output membership functions

4 DPFC with Fuzzy Logic Controller The Fuzzy Logic Controller (FLC) is based on fuzzy-set theory and human reasoning processes. Figure 8 depicts the FLC structure with three essential blocks: fuzzification, rule base interfacing and defuzzification [13]. The membership functions used in FLC are triangular membership functions for simplicity as shown in Fig. 9. Fuzzification is processed with continuous universe of discourse and defuzzification is processed using the centroid method. The DC voltage of shunt controller and a reference value are compared in DPFC and error is then provided for FLC to produce required power for regulation of shunt controller [14]. Construction of rule base originates with developing rules that correlate the input variables to the attributes of the model. Table 1 gives the rules for constructing FLC.

5 Results and Discussion The simulation of PV/Wind hybrid system has been done employing MATLAB/Simulink. It is essential to connect system to the 3-phase distribution grid as well as to the electrical load. Figure 10 shows the voltage sag and swell at 0.2–0.4 s and 0.6–0.8 s, respectively. The following waveforms have been used to observe the system’s performance.

132

V. Sowmya Sree et al.

Table 1 Rule base of FLC e

Δe NGL

NGM

NGS

ZE

PSS

PSM

PSL

NGL

NGL

NGL

NGL

NGL

NGM

NGM

ZE

NGM

NGL

NGL

NGL

NGM

NGM

ZE

PSS

NGS

NGL

NGL

NGM

NGS

ZE

PSS

PSM

ZE

NGL

NGM

NGS

ZE

PSS

PSM

PSL

PSS

NGM

NGS

ZE

PSS

PSM

PSL

PSL

PSM

NGS

ZE

PSS

PSM

PSL

PSL

PSL

PSL

ZE

PSS

PSM

PSL

PSL

PSL

PSL

Fig. 10 Waveform of voltage at load point before device connected

5.1 Simulation Results of PV/Wind Hybrid System Without Any Custom Device The results of proposed PV/Wind hybrid system without any device are shown. Figures 10 and 11 show the current and voltage waveforms at different load points. Figure 10 depicts the power generated at load point. In this system, voltage sag and swell are created in between 0.2 and 0.4 s and 0.6 and 0.8 s, respectively, as shown in Fig. 12. Load voltage harmonic spectra are being shown graphically in Fig. 13. At t = 0.2 s (Voltage Sag Harmonic) and t = 0.6 s (Voltage Swell Harmonic), the percentage of THD is 9.31 and 6.06, respectively.

Design of Fuzzy Logic Controller-Based DPFC Device for Solar-Wind …

133

Fig. 11 Waveform of current at load point before device connected

Fig. 12 Active power at load point before device connected

Fig. 13 Load voltage harmonic spectra at t = 0.2 s (for sag) and at t = 0.6 s (for swell) before device connected

134

V. Sowmya Sree et al.

5.2 Simulation Results of PV/Wind Hybrid System with UPFC It is noticeable from Figs. 14 and 15 that the UPFC compensated for said sag and swell that occurred due to disturbances in the system amid evaluation. Figure 16 depicts the load’s necessary active power, i.e., up to t = 0.5 s it is 150 kW and after t = 0.5 s it is 200 kW. Load voltage harmonic spectra are included in Fig. 17, correspondingly. At t = 0.2 s, the percent THD is 5.91; at t = 0.6 s, the percent THD is 5.02%.

Fig. 14 Waveform of voltage at load point with UPFC

Fig. 15 Waveform of current at load point with UPFC

Design of Fuzzy Logic Controller-Based DPFC Device for Solar-Wind …

135

Fig. 16 Active power at load point with UPFC

Fig. 17 Load voltage harmonic spectra at t = 0.2 s (for sag) and at t = 0.6 s (for swell) with UPFC

5.3 Simulation Results of PV/Wind Hybrid System with DPFC It is evident from Figs. 18 and 19 that the DPFC has compensated better for said sag and swell that occurred due to disturbances in the system amid evaluation. Figure 20 depicts the load’s necessary active power, i.e., up to t = 0.5 s it is 150 kW and after t = 0.5 s it is 200 kW. Load voltage harmonic spectra are included in Fig. 21, correspondingly. At t = 0.2 s, the percent THD is 4.36; at t = 0.6 s, the percent THD is 3.80%.

136

Fig. 18 Waveform of voltage at load point with DPFC

Fig. 19 Waveform of current at load point with DPFC

Fig. 20 Active power at load point with DPFC

V. Sowmya Sree et al.

Design of Fuzzy Logic Controller-Based DPFC Device for Solar-Wind …

137

Fig. 21 Load voltage harmonic spectra at t = 0.2 s (for sag) and at t = 0.6 s (for swell) with DPFC

5.4 Simulation Results DPFC Device with Fuzzy Logic Controller The voltage and current waveforms are shown in Figs. 22 and 23, and it is observed that the FLC-DPFC device has effectively compensated for the varying voltage and current levels in the system. Figure 24 shows that required active power of the load, i.e., up to t = 0.5 s it is 150 kW and after t = 0.5 s it is 200 kW. Figure 25 presents the harmonic spectra of load voltage at different time instants. The %THD at t = 0.2 s (Voltage Sag Harmonic) is 0.31% and at t = 0.6 s (Voltage Swell Harmonic) is 0.30%, respectively.

Fig. 22 Voltage waveform at load point for DPFC with FLC

138

V. Sowmya Sree et al.

Fig. 23 Current waveform at load point for DPFC with FLC

Fig. 24 Active power at load point for DPFC with FLC

Fig. 25 V L harmonic spectrum at t = 0.2 s and at t = 0.6 s DPFC with FLC

5.5 Comparative Analysis Table 2 shows the comparative analysis of DPFC device with FLC and conventional PI controller along with UPFC device [15]. It is observed that the proposed method in DPFC gives better performance compared to existing methods.

Design of Fuzzy Logic Controller-Based DPFC Device for Solar-Wind …

139

Table 2 %THD values of load voltage waveforms %THD

Without device With UPFC With DPFC With FLC-DPFC

At t = 0.2 s, %THD for V load

9.31

5.91

4.36

0.31

At t = 0.6 s, %THD for V load

6.06

5.02

3.80

0.30

6 Conclusion The DPFC device is employed to reduce concerns involving sags and swells. DPFC has a comparable structure to UPFC and may affect system parameters. The DPFC has three control loops: central, shunt and series. The system under investigation is a PV-Wind Hybrid. Swells and sags near the load approximate dynamic performance. The performance of a DPFC device is studied using internal control mechanisms with conventional PI controller. In addition, harmonic content is assessed at 0.2 and 0.6 s intervals. The simulation findings indicate that both controllers fully offset the sag and swell harmonics; however the DPFC has higher compensation capabilities and lessens harmonic distortion. The simulation results show that the Fuzzy Logic Controller outperforms the standard PI controller in terms of compensation and harmonic distortion. The further work will be implemented with advanced controllers and optimization techniques to improve the performance of DPFC device.

References 1. Alsammak AN, Mohammed HA (2021) Power quality improvement using fuzzy logic controller based unified power flow controller (UPFC). Indones J Electr Eng Comput Sci 21(1) 2. Elyaalaoui K, Labbadi M, Ouassaid M, Cherkaoui M (2021) Optimal fractional order based on fuzzy control scheme for wind farm voltage control with reactive power compensation. Math Probl Eng 3. Sowmya Sree V, Srinivasa Rao C (2022) Modeling of GA-ANFIS controller for DPFC coupled solar-wind microgrid system. Inform J 33(6):75–86. ISSN: 0868-4952 4. Lenin Prakash S, Arutchelvi M, Stanley Jesudaiyan A (2016) Autonomous PV-array excited wind-driven induction generator for off-grid application in India. IEEE J Emerg Sel Top Power Electron 4(4):1259–1269 5. Pota HR, Hossain MJ, Mahmud MA, Gadh R, Bansal RC (2014) Islanded operation of micro grids with inverter connected renewable energy resources. In: IEEEPES general meeting, Washington, DC, 27–31 July 2014 6. Mandi RP, Yaragatti UR (2016) Power quality issues in electrical distribution system and industries. Asian J Eng Technol Innov Spec Conf Issue 3:64–69 7. Lavanya V, Senthil Kumar N (2018) A review: control strategies for power quality improvement in micro grid. Int J Renew Energy Res 8(1):149–165 8. Pandu Ranga Reddy G (2020) Power quality improvement in DFIG based WECS connected to the grid using UPQC controlled by fractional order PID and ANFIS controllers. J Mech Contin Math Sci 5:1–13 (ESCI Indexed Journal) 9. Narasimha Rao D, Srinivas Varma P (2019) Enhancing the performance of DPFC with different control techniques. Int J Innov Technol Explor Eng (IJITEE) 8(6):1002–1007. ISSN: 2278-3075

140

V. Sowmya Sree et al.

10. Duvvuru R, Rajeswaran N, Sanjeeva Rao T (2019) Performance of distributed power flow controller in transmission system based on fuzzy logic controller. Int J Recent Technol Eng (IJRTE) 8(3):2039–2043. ISSN: 2277-3878 11. Pandu Ranga Reddy G, Vijaya Kumar M (2015) Analysis of wind energy conversion system employing DFIG with SPWM and SVPWM type converters. J Electr Eng (JEE) 15(4):95–106 12. Sowmya Sree V, Panduranga Reddy G, Srinivasa Rao C (2021) A mitigation of power quality issues in hybrid solar-wind energy system using distributed power flow controller. In: 2021 IEEE international women in engineering (WIE) conference on electrical and computer engineering (WIECON-ECE). ISBN: 978-1-6654-7849-6/21/$31.00 ©2021 IEEE 13. Raut A, Raut SS (2019) Review: different technology for distributed power flow controller. Int Res J Eng Technol (IRJET) 06(03):6803–6809 14. Pratihar DK (2013) Soft computing: fundamentals and applications, 1st edn. Alpha Science International Ltd., pp 208–220 15. Pavan Kumar Naidu R, Meikandasivam S (2020) Power quality enhancement in a gridconnected hybrid system with coordinated PQ theory & fractional order PID controller in DPFC. Sustain Energy Grids Netw 21:100317

Analysis of EEG Signal with Feature and Feature Extraction Techniques for Emotion Recognition Using Deep Learning Techniques Rajeswari Rajesh Immanuel

and S. K. B. Sangeetha

Abstract In affective computing, recognizing emotions using an EEG signal is challenging. A three-dimensional model is used to identify the emotion. We have used real-time data in this study. The videos (one minute) were played for the subject as stimuli, and the EEG signal was recorded using an EEG recorder. The significant features for emotion are identified by comparing different EEG features. From the EEG signal, four types of features are extracted using methods for feature extraction. The best features among them are identified. PCA is employed to select the important features from the extracted dataset, and the selected features are given to three deep learning classifiers: Gated recurrent unit (GRU), convolutional neural network (CNN), and deep emotion recognizer (DER). The performance of the deep learning proposed system is 81.23%, 80.41%, and 81.75% for arousal, valence, and dominance, respectively. Our findings show that EEG signals’ time-domain statistical features can effectively distinguish between different emotional states. The proposed model accuracy is 81% and the model loss is 1.2. Keywords Emotion recognition · Deep learning · EEG signals · Classification · CNN · Stress · Dataset · Feature extraction · GRU

1 Introduction An emotion is a mental and physical condition that encompasses a wide range of emotions, ideas, and behaviors. Human-brain interaction and medical applications are the fields most involved in emotional states [1]. Furthermore, when we consider the in-depth aspects of emotions, it can help in the treatment of stress management, R. R. Immanuel (B) · S. K. B. Sangeetha Department of Computer Science and Engineering, SRM Institute of Science and Technology, Vadapalani, Chennai, Tamil Nadu, India e-mail: [email protected] S. K. B. Sangeetha e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_10

141

142

R. R. Immanuel and S. K. B. Sangeetha

memory disorders, and other related problems. The bio-signals have gained a lot of attention in emotion detection in research in recent days [2]. Two important steps involved in developing an emotion identification system are selecting accurate features associated with emotions and effective classification. The features extracted from the signal must carry all the required information. EEG signals, among all the bio-signals, are vital since their originates from the brain. The EEG signals are the small amount of electricity produced by the central nervous system (CNS) when a human has an emotion. The EEG signals directly reflect the psychological aspects of emotions. In affective computing, emotion detection system (EDS) is a demanding and active research subject. The study’s major objective is to find the traits that can best distinguish the features that identify emotions. Wavelet, statistical, fractal, power, and entropy are the features extracted from the EEG signal and examined to identify the best features for emotion recognition Padhmashree et al. [3]. EDS defends the emotion stimulus method, feature extraction techniques, feature selection algorithms, classifiers, number of subjects involved, and number of emotions identified. The problem behind the EDS is what kinds of features produce better accuracy, the position of electrodes, the number of electrodes, and the quality of stimuli Gao et al. [4]. These factors drive the output accuracy of the EDS. As a result, different features are retrieved using different feature extraction techniques, and the feature selection method is used to achieve high accuracy to prove the utility of the proposed system. The proposed model uses the feature extraction methods, feature selection methods, and deep learning classifier to classify the different emotions. This paper focuses on the various features associated with EEG signals and especially, features related to emotion identification. It also throws light on the feature selection method PCA—working and performance are discussed.

2 Related Works It is different from machine learning, which processes a large amount of data linearly, in that deep learning focuses on the problem in a nonlinear manner. Consequently, deep learning was used to study stress and other emotions. EEG signals have been used in numerous studies on human-computer interface (HCI), with a focus on emotion recognition. Research on emotion identification focuses on brain signals rather than EEG signals because EEG signals are chaotic, nonlinear, and non-stationary Dongkoo Shon et al. (2018); Sangeetha et al. [5]. The KNHANES VI dataset is used for general lifestyle features like gender, age, and habits, and the deep belief network is used to classify stress (DBN). Support vector machine (SVM), Naive Bayes classification (NBC), and random forest are some other common machine learning algorithms that have been studied in depth (RF). There is a 75.32% specificity and 66.23% accuracy in predicting people’s stress levels. Layer one of the model focuses on gathering and cleaning data, while layer two examines that data. t-test and Chi-

Analysis of EEG Signal with Feature and Feature Extraction Techniques …

143

square tests are used to extract features from the data. Using this information, Song et al. (2017) were able to develop a stress prediction model. In the EEG-based EDS, the features and the ways to get them are very important. Many studies have been carried out with EEG signals, such as time, frequency, and time-frequency domain analysis [2]. Statistical features, such as mean, standard deviation, and difference are examined concerning emotions [6]. EEG signals are less focused due to their nonlinear properties, including fractal dimension(FD) [7]. Positive and negative emotions are associated with the left and right frontal regions, respectively [8]. Similarly, one of the key indications of emotional states is a change in the power spectrum of different EEG bands [9]. However, there is no proper clarity on which EEG properties are the most appropriate for emotion identification [10]. The implementation of emotion recognition systems necessitates a thorough examination of many types of features Li et al. [11]. Few researchers have compared the importance of different EEG signal characteristics for emotion identification Nawaz et al. [2]. Automatic extraction of features works better than extracting features manually which does not require any prior knowledge [12]. Higher frequency subbands, such as beta (16–32 Hz) and gamma (32–64 Hz), have been widely tested and proven to outperform lower frequency subbands for emotion identification [13, 14]. Deep learning classifiers perform better for the EEG signals, which extract the features for emotion recognition. The features are extracted through feature selection methods to choose only the most potential features that can yield the best results for EDS. In the next section, we propose different methodologies to extract features and identify significant features for emotion detection. Multimodal interfaces, affective computing, and healthcare all rely on the ability to understand how people express their emotions. Electroencephalogram (EEG) signals are an easy, cheap, compact, and precise way to detect emotions. Human emotion recognition using multivariate EEG signals is proposed in this paper. It begins with a method known as multivariate variational mode decomposition (MVMD), which uses multiple channels of electroencephalograms to create an ensemble of MMOs Padhmashree et al. [3].

3 Methodology In this study, using EEG data, we used a deep learning approach to classify emotions. It is a three-dimensional model for emotion identification having arousal, valence, and dominance in Fig. 1. The different emotions are illustrated in Fig. 2. Arousal levels range from high to low, with high arousal indicating excitement and low arousal indicating tranquility. Valence is a measure of happiness, with a low value indicating melancholy and a high number indicating happiness. The intensity of emotions is shown by dominance. Among many input types for emotion detection, EEG plays an important role as it captures the emotion from its origin place (brain), and the performance of EDS is more accurate with EEG signals Agunaga et al. [15].

144

R. R. Immanuel and S. K. B. Sangeetha

Fig. 1 Three-dimensional emotional model [2]

Fig. 2 Different emotions with respect to arousal and valence [1]

3.1 Datasets We have already worked with many online datasets and in this study, we have worked with a real-time dataset that involves 20 subjects EEG signals. The subject is allowed to watch different videos with different emotions. EEG signal is recorded using the device Emotiv. The Emotiv device has semi-dry polymer sensors which capture the brain signals and send them to an EEG recorder. We used a three-dimensional emotional model that is deployed with deep convolutional neural network in this study, thus we only looked at arousal, valence, and dominance. The EEG signals for every subject are recorded for 2 min. Since the EEG signal for a single subject has a huge volume of data, we preferred recording for 20 subjects only. The subjects are from age 30 to 40 in this study.

Analysis of EEG Signal with Feature and Feature Extraction Techniques …

145

3.2 Preprocessing In preprocessing, the filter is used to filter only the required signals related to emotions, which lie between 4 and 45 Hz. The electrooculogram artifacts were removed by downsampling the EEG signals (512)–128 Hz and processing them. The mean value is calculated for each selected channel and deducted from the total number of channels. Normalization is also used to lessen the computational complexity and the consequences of individual variances owing to basic frequency cycles. All data is normalized to the range [0, 1].

3.3 Extraction of Features The important section of this study is feature extraction, as it plays a very vital role in extract the most salient features for detecting emotions. Various features are extracted to investigate the performance of the EDS. All of the extracted features have been used a lot in EDS with EEG, as explained below. Entropy Feature of EEG Signal The brain of the human is complicated and EEG signals are nonlinear in nature and chaotic also [16]. It is highly recommended to study the EEG signal with nonlinear properties in addition to the linear investigation. Entropy is used for nonlinear analysis. All of the obtained features have been extensively employed for EEG-based emotion recognition, as detailed below. Permutation Entropy (PE) of EEG Signal + The PE is calculated by forming a series of patterns known as motifs [2]. The continuous EEG signal is broken down into motifs (there are six motifs defined in [2]). The chances of each motif class occurring are calculated and denoted by Yi. Finally, PE is calculated using the classic Shannon formula. A motif is the small segments of the EEG signals fragmented for analysis. ∑

PE = −

yi × ln (yi ) ln(motifs_count)

(1)

Information Entropy of EEG signal (IE) The IE is defined as the power spectral density and measures the spectral power distribution of EEG signals (SPD). SE =: −

f =0 ∑

POWSD ( f ) log2 (POWSD ( f ))

(2)

fn

According to the Nyquist criterion, POWSD is the normalized spectral power distribution. Wavelet Feature (WF) The time-frequency feature is the wavelet feature that is employed in EEG signal EDS. For each time point and frequency resolution, a timefrequency transform of the original signal is obtained via wavelet decomposition. To

146

R. R. Immanuel and S. K. B. Sangeetha

achieve this, the mother wavelet is associated with the original signal coefficients of wavelets. This mother wavelet was selected because of its near-optimal timefrequency representation properties. Wavelet decomposition is used to decompose the signal into the theta, alpha, beta, and gamma bands. Ej =

k=1 ∑ (

D j (k)2

)

(3)

N

Dj denotes the coefficients of the jth wavelet transform. Statistical Feature Five statistical features are extracted in the current study, which are modified from [1,38].The time-series EEG data were characterized using these statistical methods. Mean from Statistical Feature y=

m=1 1 ∑ Y(m) M M

(4)

Standard Deviation from Statistical Feature [2] ┌ | m=1 |1 ∑ ( )2 Y m − μy σy = √ M M

(5)

Mean of absolute values of first difference [2] y=

m=1 1 ∑ Y(m) M M

(6)

Mean of absolute values of second difference [2] γy =

m=1 1 ∑ |Y (m + 2) − Y (m)| M − 2 M−2

(7)

Mean of absolute values of second difference of normalized [2] γy −

m=1 1 ∑ |Y (m + 2) − Y (m)| M − 2 M−2

(8)

Fractal dimension features The temporal sequences of EEG signals are directly examined in the time domain using fractal dimension (FD) [14], which considers the data as a geometric object. By calculating the fractional spaces occupied, FD

Analysis of EEG Signal with Feature and Feature Extraction Techniques …

147

determines the geometric complexity in time-series EEG signals [17] and assesses its correlation and evolutionary aspects. It is regarded as a successful feature for EEG-based emotion recognition [2]. In another study, a multimodal fusion technique is utilized for emotion recognition, with FD features generated from EEG signals combined with musical data [54]. Many algorithms are used to compute the FD, the one among them is Petrosian FD.

3.4 Feature Selection (FS) Features derived from this method will have relevant information from the existing features. Dimensionality issues can be reduced and the efficiency of the model can be improved. In computer science, PCA is a common way to figure out what the most important features are based on what the overall features are. This technique uses an acceptable scheme to perform FS and evaluates the relevance of the feature components by leveraging the eigenvectors of the covariance matrix. The PCA performs well with time-series data. The step-by-step working of PCA is explained in Fig. 3, which explains that the input EEG signal is given to the PCA module, and the mean is calculated from the matrix created. Then the mean is used to derive the standard deviation for the EEG signal. The covariance matrix must then be computed. The major components are calculated using the eigenvector and eigenvalue of the covariance matrix Zhang et al. [17]. Figure 3 explains the principle components extracted from the input data. Figures 4 and 5 represent the data visualization of PCA extracted features and 3D representation of principle components of EEG signals, respectively.

3.5 Classifier The classifier in this study is a deep learning neural networks. The PCA input is received by the input layer from the previous step. The received data is passed to the convolutional ReLU layer, which performs an activation function to activate the

Fig. 3 Workflow of PCA algorithm

148

R. R. Immanuel and S. K. B. Sangeetha

Fig. 4 Data visualization using PCA

desired features. The max-pooling layer is the next layer, which performs the pooling operation using a 2D filter to determine the required information or features Yang et al. [18]. Flattening is used to construct the 1D array in the following layer. A 1D vector is produced using the output of the preceding layer as input. It is also connected to the final categorization model, known as a fully connected layer. The final output layer includes the max soft activation function. To activate a specific node, the max soft method employs decimal probabilities ranging from 0.0 to 1.0. Because of the increased constraint, training converges faster than it would otherwise. Figure 6 depicts the DER’s whole layout Song et al. [19]. The proposed model uses feature extraction and feature selection approaches, DER receives the features retrieved from the EEG signal. We used the dataset to train the DER model, which was divided into training and test data in an 8:2 ratio. Figure 4 depicts the several layers that make up the suggested paradigm Chen et al. (2019). The architecture of the DER is explained layer wise. The data is given to the convolutional layer. The output from the PCA with 13 components totaling to 12,500 data points is given as input to the next layer. The calculation of the different layers is DER is formulated as

Analysis of EEG Signal with Feature and Feature Extraction Techniques …

Fig. 5 3D graph for principle components

Fig. 6 Architecture of proposed model (DER)

149

150

R. R. Immanuel and S. K. B. Sangeetha

Fig. 7 Convolutional neural network (DER) architecture

Layer calculation = Number of inputshape − Number of filtersize/Stride Layer 1 = 1 × 12500 − 2/1 = 12498 Layer 2 = 1 × 12498 − 2/3 = 4165 and the same way for other layers are also calculated and same is visually represented in Fig. 7.

4 Discussion and Results In this study, various features of EEG signals are compared and the best performing features among them are selected for emotion identification in Table 1. In addition, we compared different classifier approaches and gave our opinion on the optimal classifier method for emotion identification systems Rajeswari and Patil [20]. The consistency of all the subjects is then calculated for emotion recognition with EEG. Figure 8 shows the average classification results for each of the four categories of characteristics with respect to arousal. The accuracy of statistical features is superior to the other features for classifiers, GRU, CNN, and DER (proposed model), as evidenced by the accuracy Sangeetha et al. [5]. Similarly, statistical features outscored the other three types of features, with valence, arousal, and dominance accuracy of 80.41%, 81.23%, and 81.75%, respectively, with the proposed model. Figures 9 and 10 explain the comparison of different

Table 1 Comparison of different classifiers performance Method/classifier Accuracy (%) Sr. No 1 2 3

GRU (Lew et al. [21]) 78.23 CNN (Lan et al. [9]) 79.56 DER (proposed 81.04 method)

Validation 1.9 2.1 1.2

Analysis of EEG Signal with Feature and Feature Extraction Techniques …

Fig. 8 Comparison chart of the different EEG features with respect to arousal

Fig. 9 Comparison chart of the different EEG features with respect to valence

Fig. 10 Comparison chart of the different EEG features with respect to dominance

151

152

R. R. Immanuel and S. K. B. Sangeetha

Fig. 11 Model accuracy

Fig. 12 Model loss

features of EEG for three classifiers (GRU, CNN, and DER). The comparison of different models for the same real-time data is explained in the Table 1. The proposed model performs better than the other two classifiers for the same dataset as it uses the PCA feature selection method for selecting significant features. The model accuracy for the DER is 81.04% and the model loss calculated by the system is 0.12 as shown in Figs. 11 and 12, respectively. When the epoch increases, the model accuracy also increases. At one point, the accuracy slows down, and at that point is the proper training data accuracy for the model. The loss decreases slowly and finally reaches the threshold point. The features obtained work perfectly for recognizing human emotions from EEG signals.

Analysis of EEG Signal with Feature and Feature Extraction Techniques …

153

5 Conclusion The constructed model (DER) that uses bio-signal (EEG) to determine if a person is stressed or not is presented in this study. This is accomplished by combining the realtime dataset with feature extraction, feature selection, and PCA among other deep learning techniques. The three-dimensional model (arousal, valence, and dominance) is compared with three different classifiers—GRU, CNN, and DER. The significant features are derived from the retrieved features using PCA. The signals are preprocessed, which aids in obtaining high accuracy in emotion detection. Among all the features, the statistical features show better accuracy for emotion identification. The model accuracy is 81.04% and the model loss is 0.12. The same model can be refined in the future to distinguish other emotions of varying intensity.

References 1. Mohammadi Z, Frounchi J, Amiri M (2017) Wavelet-based emotion recognition system using EEG signal. Neural Comput Appl 28:1985–1990. https://doi.org/10.1007/s00521-015-2149-8 2. Nawaz R, Cheah KH, Nisar H, Yap VV (2020) Comparison of different feature extraction methods for EEG-based emotion recognition. Biocyber Biomed Eng 40(3):910–926. ISSN 02085216. https://doi.org/10.1016/j.bbe.2020.04.005 3. Padhmashree V, Bhattacharyya A (2022) Human emotion recognition based on time-frequency analysis of multivariate EEG signal. Knowledge-Based Syst 238:107867 4. Gao Q et al. (2022) EEG-based emotion recognition with feature fusion networks. Int J Mach Learn Cybern 13.2:421–429 5. Sangeetha SKB, Dhaya R, Shah DT, Dharanidharan R, Praneeth Sai Reddy K (2021) An empirical analysis of machine learning frameworks digital pathology in medical science. J Phys Conf Seri 1767:012031. https://doi.org/10.1088/1742-6596/1767/1/012031 6. Han CH et al. (2016) Data-driven user feedback: an improved neurofeedback strategy considering the interindividual variability of EEG features. BioMed Res Int 7. Xing XF et al (2019) SAE+LSTM: a new framework for emotion recognition from multi-channel EEG. Front Neuro Robot 13:37 8. Harmon-Jones E, Gable PA, Peterson CK (2010) The role of asymmetric frontal cortical activity in emotion-related phenomena: a review and update. Biol Psychol 84:451–62 9. Lan Z, Sourina O, Wang L, Scherer R, Müller-Putz G (2017) Unsupervised feature learning for EEG-based emotion recognition. Int Conf Cyberworlds 182–185 10. Jenke R, Peer A, Buss M (2014) Feature extraction and selection for emotion recognition from EEG. IEEE Trans Affect Comput 5:327–39 11. Li C et al. (2022) Emotion recognition from EEG based on multi-task learning with capsule network and attention mechanism. Comput Biol Med 143:105303 12. Wang J, Wang M (2021) Review of the emotional feature extraction and classification using eeg signals. Cognitive Rob 13. Koelstra S, Muhl C, Soleymani M et al. (2012) DEAP: a database for emotion analysis; using physiological signals. IEEE Trans Affect Comput 3(1):18–31 14. Wichakam I, Vateekul P (May 2014) An evaluation of feature extraction in EEG-based emotion prediction with support vector machines. In: Proceedings of the 2014 11th international joint conference on computer science and software engineering (JCSSE ’14). Chon Buri, Thailand, pp 106–110 15. Aguiñaga AR et al. (2022) EEG-based emotion recognition using deep learning and M3GP. Appl Sci 12.5:2527

154

R. R. Immanuel and S. K. B. Sangeetha

16. Liu Y, Sourina O, Nguyen MK (2011) Real-time EEG-based emotion recognition and its applications. Trans Comput Sci XII 256–277 17. Zhang H (2020) Expression-EEG based collaborative multimodal emotion recognition using deep autoencoder. IEEE Access 8:164130–164143. https://doi.org/10.1109/ACCESS.2020. 3021994 18. Yang Y, Wu Q, Qiu M, Wang Y, Chen X (2018) Emotion recognition from multi-channel EEG through parallel convolutional recurrent neural network. Int Joint Conf Neural Networks (IJCNN) 2018:1–7. https://doi.org/10.1109/IJCNN.2018.8489331 19. Song T, Zheng W, Song P, Cui Z (2020) EEG emotion recognition using dynamical graph convolutional neural networks. IEEE Trans Affective Comput 11(3):532–541. https://doi.org/ 10.1109/TAFFC.2018.2817622 20. Rajeswari IB, Patil D (2014) Detection of intrusion and recovery for smartphones using cloud services. J Comput Technol 3(7):2278–3814 21. Lew et al. W-CL (2020) EEG-based emotion recognition using spatial-temporal representation via Bi-GRU. In: 2020 42nd Annual international conference of the IEEE engineering in medicine & biology society (EMBC). pp 116–119. https://doi.org/10.1109/EMBC44109.2020.9176682 22. Yin Z, Zhao M, Wang Y, Yang J, Zhang J (2017) Recognition of emotions using multimodal physiological signals and an ensemble deep learning model. Comput Methods Programs Biomed 140:93–110 23. Gao Q, Wang Ch, Wang Z et al. (2020) EEG based emotion recognition using fusion feature extraction method. Multimed Tools Appl 79:27057–27074. https://doi.org/10.1007/s11042020-09354-y 24. Gao Z, Wang X, Yang Y, Li Y, Ma K, Chen G (2021) A channel-fused dense convolutional network for EEG-based emotion recognition. IEEE Trans Cognitive Develop Syst 13(4):945– 954. https://doi.org/10.1109/TCDS.2020.2976112 25. Garg D, Verma GK (2020) Emotion recognition in valence-arousal space from multi-channel EEG data and wavelet based deep learning framework. Proc Comput Sci 171:857–867. ISSN 1877-0509. https://doi.org/10.1016/j.procs.2020.04.093 26. Lan Z, Sourina O, Wang L, Scherer R, Müller-Putz G (2017) Unsupervised feature learning for EEG-based emotion recognition. Int Conf Cberworlds 2017:182–185. https://doi.org/10.1109/ CW.2017.19

Innovative Generation of Transcripts and Validation Using Public Blockchain: Ethereum S. Naveena , S. Bose , D. Prabhu , T. Anitha , and G. Logeswari

Abstract University certificates are the biggest asset to prove our excellence and our worth to others. They are essential to help students succeed in higher education or gain employment. The methodology for creating, verifying, and managing certificates in the current analog system is unreliable and slow. Verification of the digital certificate requires institutions to reach out to the issuing authority to verify certificates. To ensure the security and authenticity of such certificates, a system must maintain records of transcripts and make them available online. Ideally, if such a system were proposed and developed, authorities and students would easily be able to confirm the authenticity of certificates. In the proposed system, a permission blockchainbased system for secure verification of academic certificates. Using a hash-based storage system to ensure the authenticity and security of digital contents stored on the platform, allows universities to upload student certificates and verify certificates from other member universities. We use the Raft ordering service to handle numerous distributed orderers and Paxos service. Deploying the system in a cloud environment enables the user to access it from anywhere in the world. Keywords Blockchain · Digital certificate · Verification · Raft

S. Naveena · S. Bose · T. Anitha (B) · G. Logeswari Department of Computer Science and Engineering, College of Engineering Guindy, Anna University, Chennai, Tamil Nadu, India e-mail: [email protected] S. Bose e-mail: [email protected] D. Prabhu Department of Computer Science and Engineering, University College of Engineering, Arni, Tamil Nadu, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_11

155

156

S. Naveena et al.

1 Introduction An open, distributed ledger [1] composed made up of cryptographically linked blocks. Each block contains a cryptographic hash of the previous block, a timestamp, and transactions. The linked blocks make up a chain blockchain is ordinarily overseen by a distributed organization all things considered sticking to a convention for between hub correspondence and approving new squares. The core advancement of blockchain is the agreement which gives a high assurance that a warning can’t change an exchange once this exchange is adequately somewhere down in the blockchain, accepting legit hubs control most of the hubs in the framework. Even though blockchain records are not unalterable, blockchain might be thought of as secure by plan and represent the circulated processing framework with crash resilience. Blockchain technology can be applied in a variety of industries, including education, healthcare, government, travel and hospitality, retail, and consumer packaged goods [2]. By increasing the insurance, security, and interoperability of educational data, blockchain can play an important role in the educational sector [3, 4]. It has the potential to overcome a variety of interoperability issues in the region and enable secure information sharing among the many elements and individuals involved in the interaction [5, 6]. The essential utilization of blockchain today is a disseminated record for cryptographic forms of money, the most outstandingly nibbled coin. To manage who approaches the organization, permission blockchain uses an entrance control layer [7]. Employers and institutions must occasionally contact the certificate’s granting authorities to ensure that the transcript is authentic. It’s difficult and time-consuming, which is one of the main reasons for fraud [8]. The main contribution of the work is given below: • To build a secured digital system that provides a hassle-free experience in transferring and verifying the certificate. • Verification of digital certificate and to provide security and authenticity of the certificate. • Deploying it in a cloud environment enables the system to access it anywhere to protect the data.

2 Related Work A permission blockchain-based framework is utilized to permit organizations to safely and constantly move and check scholarly records at the understudy demand. Permissioned blockchain, like Hyperledger, give a more versatile and practical, and private answer for big business applications [9]. The system developed by Rama Reddy et al. works as follows: a legitimate electronic file of the certificate, i.e., an E-certificate, is generated at the request of the student. A relevant QR code or unique serial number is also sent to the learner along with the E-certificate [10]. The goal of this research project is to use the blockchain to verify the legitimacy of given

Innovative Generation of Transcripts and Validation Using Public …

157

testaments. The initial stage of the project exploration is a model that permits the enlistment of scholarly organizations and their foundations/resources, enrollment of understudies, and giving of testament grants. They gave declarations are enlisted on the blockchain [11, 12] so that any outsider who might have to confirm the credibility of an endorsement can do such, autonomously of the scholastic organization, even if such establishment has shut. BlockCert [13] is free software that may be used by any university/institute to issue credentials on the Bitcoin blockchain [14]. The Interplanetary File System (IPFS) is a dispersed document framework that looks to decentralize the web and make it quicker and more productive [15]. The quick reception of this disseminated document framework is to a limited extent because IPFS is intended to work on top of various conventions, like FTP and HTTP [16]. College can be added exclusively by the proprietor of the smart contract. Each made declaration will be put away in the Interplanetary document framework (IPFS) which thusly will return the one-of-a-kind hash created utilizing SHA-256 calculation [17, 18]. This will fill in as a special personality for each archive. The created hash and detail of certificates [19] will be put away in the blockchain and the resultant exchange id will be shipped off to the 7 students. In centralized reputation systems [20], data about the presentation of a given member is gathered as ratings from different individuals locally who have had direct involvement in that member. Afrianto et al. reported several use cases where blockchain [21] technology can be adopted in the education context. Blockchain aspect [22] uses the distributed ledger incorporated with Ethereum blockchain for privacy-preserving and confidentiality along with significant research value. This is valid only in areas of technical knowledge. In the areas of less technical fields, further assessments are required. Here rather than using the distributed ledger we can use the private and permission blockchain for the private transactions [23, 24] of data. The incorporation of docschain and OCR template in blockchain within the social workflow of degree processing. Institutes face the limitation of being tightly bound with students [25] because the data of degree documents cannot be placed on blockcerts. Students have to bind with a blockcert application. OCR scans the documents very quickly but we can adopt QR rather than OCR due to its efficiency. The blockcerts, database servers, and cryptography techniques create a reliable and safe environment for data storage and sharing using the Ethereum blockchain. The redundancy of blockchain [26] makes them hard to scale. Every device in a network must have a copy of every transaction made. Digital Signatures prevent the documents from being attackers [8, 27] however it is safe to use efficient algorithms like AES, and SHA. Certificate data can be stored in the public blockchain Ethereum [28] infrastructure and its supporting files stored in the IPFS environment.

3 Proposed Methodology The proposed system consists of 10 blocks namely Certificate creation, Certificate verification, Ethereum, Hyperledger fabric, IPFS, Raft ordering service, Paxos

158

S. Naveena et al.

ordering service, Organization inclusion, and Organization deletion, Feedback System is shown in Fig. 1. Colleges can transfer and confirm declarations and can straightforwardly communicate with the framework through this module. This module contains chain code which is a set of rules by which the participants of the network communicate within the blockchain network. This package also saves the log information of all previous transactions, making them immutable. We use the Raft ordering service and the Paxos service to handle many distributed orderer’s. The organization inclusion and deletion module are responsible for adding or removing a University from the blockchain network. A reputation value (min: − 1 max: 1 default: 0) is stored on the web server for each university in the network. From the beginning, the college enters into the association through the web interface. The organization’s authorizations are reviewed. The organization is added to the blockchain network after passing the check. On the off chance that the worth matches, the endorsement is supported by the beneficiary college. On the off chance that there are no such squares, the solicitation is dismissed. The understudy needs to request the University. Cloud Deployment makes the structure more secure and diminishes the different close-by machines and decreases IT costs.

Fig. 1 The overall architecture of the proposed system

Innovative Generation of Transcripts and Validation Using Public …

159

3.1 Web Server Module To authenticate the university, a two-way verification system is built. This module is necessary for interaction with the blockchain network and accessing its functionalities. Universities need to be registered to the network before issuing or verifying a certificate. The web server acts as an interface between the user and the blockchain mechanism. A certificate can be uploaded to the blockchain network and the generated hash value can be sent to the user. Using this web server, we can interact with the user, and store and verify the digital certificates. IPFS is used to fetch and store the certificates in the distributed file system that will be used while verifying certificates. Algorithm 1: Password credential checking // Login–2F Authentication Step 1: Get the username (email) password from the university Username = get username();password = get password(); Step2: Remove the tags from username and password to prevent cross-site scripting attacks Username = remove tag(username)password = remove tag(password) Step3: Check is such username password combination exists result = query in DB(username, password);if result = = true then |go to dashboard();else |Go back to step1end // SmtpPseudocode function sendEmail() { Email.send ({ Host: “smtp.gmail.com”, Username: “sender@email_address.com”, Password: “Enter your password”, To: ‘receiver@email_address.com’, From: “sender@email_address.com”, Subject: “Sending Password”, Body: “xxxxxxxx!!”, }) then(function (message) { alert(“mail sent successfully”)});} // OTP Generation Input: Randomly generated OTP Output: OTP received in registered mobile number Send OTP using sid, token client = new Client(sid,token) client to messages to create(”from”,”body” pin) // OTP Verification Step 1: Get the otp to the registered mobile number User _otp = pin(); Step 2: Check if such otp sends from the system result = otp in DB(otp); if result = = true then go to dashboard(); else |Go back to step1 end // Login–2F Authentication

160

S. Naveena et al.

Algorithm 2: AES encryption Input: Roll Number from Database Output:Encrypted String Get Roll No from Database Encrypt the roll no (a)encryption = opensslencrypt(simple string, ciphering, encryption key, options, encryption); // QR Encoding Algorithm Input: Encrypted String Output: Encoded QR Step – 1: Get the string Step – 2: Resize the image Step – 3: Detect the cigarette Step – 4: Then convert the image into grayscale Step – 5: Detect the face Step – 6: If both cigarette and face are detected find the relative distance // QR Decoding algorithm Input: QR code Output: Encrypted String Step - 1. function decodeImageFromBase64(data, callback) // set callback qrcode.callback = callback; // Start decoding qrcode.decode(data)

3.2 Certificate Verification Module Universities can now verify the certificate by fetching the document from the interplanetary file system. Once the verifier gets authenticated he/she puts the hash value given by the user. That hash value will return the document if the hash value submitted is not available in the ledger or if the certificate is not valid anymore, a prompt is displayed saying that “no such certificate exists” indicating the absence of the files. Algorithm 3: Certificate verification Input: Certificate Output: Same Or Not Same Certificate certificateinledger = invoke chaincode(”query certificate”, cert hash); if certificate in ledger = = true then certificate = ipfsfetch(cert hash); if exists ( certificate) = = true then cert feedback = get certificate feedback from verifier(); if cert feedback = = yes then reputation change(increase); document verified; else reputation change(decrease); end else reputation change(decrease); end else prompt “ no such certificates exist” end

Innovative Generation of Transcripts and Validation Using Public …

161

3.3 Raft Consensus Module Raft is an agreement calculation for dealing with an imitated log. Raft follows a “pioneer and supporter” model, where a pioneer hub is chosen (per channel) and its choices are reproduced by the devotee. If two passages in different logs (Leader’s and Followers’) have the same list and term, they will record equivalent information, and the logs will be indistinguishable up to that Index. On the off chance that the Leader-crashes, the logs might become conflicting. So Leader Election Happens. If two entries in distinct logs (Leader’s and Followers’) have the same index and term, they will store the same data, and the logs will be identical up to that Index. The Raft calculation utilizes randomized political decision breaks to guarantee that split votes are intriguing and that they are settled rapidly. Whenever the request from the users is more, they should be processed without any event of failure. If an orderer fails the RAFT conducts a leader election algorithm to elect an existing available orderer thus making the system crash tolerant. Log replication followed by a log consistency check is made whenever a leader election takes place due to a network change.

3.4 Inter-planetary File System The Inter-Planetary File System (IPFS) is a shared organization and protocol for sharing data in a distributed file system environment. IPFS uses content-addressing to identify any document in a global namespace that connects all connected devices. IPFS is utilized to retrieve and store certificates in the distributed file system that will be used for certificate verification. Any client in the organization can provide a document based on its content address, and various peers can identify and demand that content from any hub using a distributed hash table (DHT).

3.5 Hyperledger Fabric Channel Module Every exchange on the organization is executed on a channel, where each party should be confirmed and approved to execute on that channel. Each companion that joins a channel, has its character given by an enrollment administrations supplier (MSP), which confirms each friend to its channel companions and administrations. A Genesis Block is the primary square on the record which is made when the channel is made.

162

S. Naveena et al.

3.6 Hyperledger Fabric Transaction In a hyperledger fabric network, Transaction is the most important aspect. The transaction reflects the certificates uploaded on the fabric network. The client sends out a transaction proposal to the specified endorsing peer nodes called orderer’s. Orderer’s execute the chain code according to the information contained in the proposal and the world state after ensuring the client is valid to transmit a transaction proposal. They return the proposal answer to the client after execution. Algorithm 4: Hyperledger fabric channel creation Step 1: Generate the Crypto artifacts using the cryptogenic tool we define the configtx.YAML file which contains the configuration of the channel such as anchor peers, and members #cryptogengenerate − − config = ./crypto − config.yaml Step 2: Generate the genesis block using the command # Configtxgen − profile $PROFILE NAME − output Block. / channel − artifacts/genesis.block Step 3: Generate the channel artifacts # configtxgen − profile $PROFILE NAME − output Create Channel Tx. / channel − artifacts /channel.tx − channel ID $CHANNEL NAME Step 4: Generate the transaction for the Anchor Peer in each Peer organizations #configtxgen − profile $PROFILE NAME − outputAnchorPeers Update./channel − artifacts/Org1MSPanchors.tx − channel ID $CHANNEL NAME –asOrg Org1MSP

3.7 Paxos Module The process of agreeing on one decision among a group of members is called consensus. In a network of unreliable and fallible processors, Paxos is a group of protocols for solving consensus. This algorithm is used when durability is needed for the system. The execution of each module in the Paxos protocol is based on a single output value. Paxos consists of processors that have a specific role to play they are client, acceptor, proposer, learner, and leader.

3.8 Organization Inclusion and Deletion An organization (university) is added when it needs to add the certificates and verify the certificates issued by other members of the system. For a successful inclusion in the system, a university has to pass the two-factor authentication and a manual verification by the admin. Upon verification, an org key is assigned to that university by the membership provider (MSP) and it is stored on the web server. A university needs to be removed from the system when it misuses the system by uploading an

Innovative Generation of Transcripts and Validation Using Public …

163

invalid certificate. If the university’s reputation value falls below a certain level (− 0.5), it is removed from the system and its entire certificate’s status is changed to invalid.

3.9 Feedback Module To tackle malicious users and counterfeit certificates a system must have a feedback system that accepts the feedback of the receiver and reputes the issuer based on the feedback received. Considering the system is a centralized one, the beta reputation system (refer to chapter 2) is implemented. To aggregate feedback and calculate reputation ratings, the beta reputation system uses beta probability density functions. The beta reputation system has the advantages of flexibility and simplicity. The reputation change is directly dependent upon the reputation value of the university that gives the feedback. The higher the reputation, the greater will be the impact of the feedback given by that university.

3.10 Cloud Deployment Module Deployment in a cloud makes the system more secure and reduces the numerous local machines. It also reduces IT costs and also replaces data centers. This module consists of a hybrid cloud system that is private and public cloud where the public cloud can be used for business-to-consumer transactions and the private cloud will be used for business-to-business transactions. Cloud Deployment makes the framework safer and decreases the various nearby machines and lessens IT costs. It replaces the server farms situated in different spots.

4 Experimental Results Webpages are important for the interaction between the blockchain and clients. We access the functionalities of blockchain using this web server interface. This module is necessary for interaction with the blockchain network and accessing its various functionalities. This is built using the Flask framework. This module connects to the blockchain network via API calls. By clicking the login icon, we get a login page where the data manager enters their credentials (email, password) if they already registered. If he/she is a new user, then he has to register by clicking the register icon. For user registration, the user has to enter their email address and their password. Finally, they have to confirm their login credentials. After the successful login, the user gets an OTP for his/her registered mobile number as a result of two-factor authentication. Only the user with the correct OTP can be allowed to access the

164

S. Naveena et al.

Fig. 2 Digital certificate

system. Click “enter pin” after giving the OTP. Once we got the OTP, we have to enter it correctly and click ok to proceed further. Aftermath, we have to enter the student details. After the successful login, the data manager has to enter the details of the student data like name, roll no, semester, and marks and upload their photo. Finally submit it. Only the data manager has the credentials to enter and change the details of a particular student. After submitting all the details of the students we get a digital certificate as shown below. That will be used for future certificate validation in other organizations. It includes a QR scanner for easy scanning and retrieval of that particular certificate during verification. A digital certificate can be downloaded as a pdf in Fig. 2. Image by clicking on the “Convert to Image” option. Organizations will use this certificate to verify the student while going for a job, higher studies, on site education, and any other multinational corporations. Once the certificate got successfully created, the next process is to verify it. The user has to enter their email address and password to know whether he is authorized or not. Users with the wrong id and password will not be allowed to access the system. It will be locked after the three failed attempts. Universities can either upload a single certificate or bulk uploads multiple certificates provided that the certificate type and student details are already uploaded to the system. Once a certificate is uploaded a hash value of the same is generated. This hash value is now sent as a mail to the respective student’s mail id. These hash values along with a few other details are submitted as a transaction to the blockchain network. During the process of verification, the mark sheet is given by the authority and the mark sheet with the user gets compared. This is shown in the below screenshot. This system does not accept the mark sheet even if it has a small pen scratch. It will display it as “NOTSAME” otherwise it shows “SAME” in Fig. 3. If the certificate given by both the authority is the same and without any information, it is assured as the original certificate. Once the certificate is got approved, the issued university within the network gets good reputation value and the certificate gets the certificate validity. Using this system, we can track the status of a certificate after its upload action and verification by clicking on the “UPLOAD LOG” and

Innovative Generation of Transcripts and Validation Using Public …

165

Fig. 3 Digital certificate verification

“VERIFY LOG”. If we click on the Upload log we can see the uploaded certificate details and if we click on the Verify log we can see the recently verified certificate details with the timestamp. Starting the local server to test our React client. The Truffle Box provides some React boilerplate. Firstly navigate into the client/folder and run the following in the terminal “npm run start”. It will open a new browser tab and it attempts to connect the browser to the blockchain. During this process, we get a notification from the Metamask extension like “you are attempting to connect”. Press CONFIRM and proceed to the next process. React JS is an open-source JavaScript library and it is used for building User Interface (UI). It is specifically used for building the user interface for singlepage applications. Metamask is the most straightforward and safe way to connect to blockchain applications. The gas fee is nothing but a charge we pay when we do any transactions on the Ethereum blockchain we have to pay for the computation. After building the user interface and plug-in that connects to the blockchain network, we have to add the details of the upload log and verify the log to the react app. That will be reflected in the transactions of blockchain history. Using that information we can keep track of and identify the status of the certificates and the generation of the fake certificate. After entering all the details about upload and verify log it shows that the log was added successfully to the blockchain network. Ganache is a personal Ethereum blockchain that is used for setting up the network of blocks by using the chain code. In Fig. 4 we can see the transactions that are previously made by the user. When we store the upload and verify logs using the react app, those details with timestamps will be reflected here. The highlighted content shows the details of inputs, functions, and the purpose of the contract. Using this detail, it is easy to keep track of the certificate and free from malicious users.

166

S. Naveena et al.

Fig. 4 View transactions

4.1 Performance Metrics Any system’s performance must be assessed against a set of criteria, which determine the system’s performance foundation. Performance metrics are the term for such parameters. Throughput Throughput refers to how much data can be manipulated through a given time frame. The data involves enrolling new data of a candidate or updating and retrieving data of the candidate. It can be measured either in bits per second or data per second. Throughput = (Sum of Enrolling/Updating/Retrieving Data) For the calculated values of throughput given the quantitative result and the observations. Let the X-axis be the number of peers who are all doing operations, i.e., giving requests for fetching documents from the file system. Let Y be the time taken to process the particular request in seconds. Here the three-bar represents the number of requests given by one peer at a time. The blue color bar represents one request at a time and the orange bar represents requests that are > 1 and < 10at a time. So from the throughput, we can understand that the requests to process will take an average time that is greater than or equal to 0.9 s in Fig. 5. Delay Network delay is a design and performance characteristic of a network. It species the latency for a bit of data to travel across the networks. This delay can be calculated from receiving OTP, time taken for hosting connection, and time taken for data entry and retrieval. Delay = Host Connection Delay + Data Transmission Delay

Innovative Generation of Transcripts and Validation Using Public …

167

Fig. 5 Throughput

Let X represents the number of peers who are all giving requests to access the certificate from the file system and Y represents the time taken to process the request. In this graph, we can see the delay that happens whenever the peer increases by 1 unit. It increases by a time of 0.27 s for each peer. It is normal for the system to get a delay of 0.25 s. It shows the peak increase of about 0.25 s as an average in Fig. 6. Latency Latency is the time between when the request is submitted and when the response is received. Latency (L) = Time when the response is received–Time When a Request is submitted. Therefore ledger height is directly proportional to the number of transactions successfully submitted to the network. Also, the greater the ledger size, the greater will be the latency value since the system needs to process more blocks.

Fig. 6 Delay

168

S. Naveena et al.

Fig. 7 The critical point of RAFT

Compromised Host Percentile The Compromised host percentile (CHP) of the RAFT ordering system is defined as the minimum count of orderer’s that need to be in an available state to achieve the consensus of the ledger. The Compromised host percentile (CHP) of the system can be inferred from the HP as the minimum HP in which the system can achieve consensus. The following graph shows the no of available orderer’s Versus the total number of orderer’s before a system crash. orderers assigned − orderer scrashed no of orderers assigned to the system The Capacity of the system to handle is directly proportional to the count of peers present in the system to handle requests. A single peer in a uni-peer setup can handle 0.08323 operations per second in Fig 7. As the number of peers increases the throughput increases and reaches a max of 37.613 operations which is considerably higher than bit coin’s throughput [12] which has an average of 8 transactions per second. Varying Weight The following analysis shows how the reputation rating evolves as a function of accumulated positive feedback with varying weight. Let r X, T and s X, T respectively represent the (collective) amount of positive and negative feedback about a university provided by a university denoted by X. The university T receives only a sequence of “n” positive feedback from university X. r X, T = n ∗ ω s X, T = 0 The Reputation of university X on university T can be defined as Rep X, T = n ∗ ω/(n ∗ ω) + 2

Innovative Generation of Transcripts and Validation Using Public …

169

5 Conclusion and Future Work Students need to demand official records from the college’s enlistment centers and pay charges for each duplicate of the record mentioned. Employers and universities still at times need to call the issuing authority of the certificate if they want to be sure that the transcript was not faked. Universities can directly interact with the system and can upload and transact the certificates. On the other hand, students have a digital copy of their certificates which can be easily verified by other universities while preserving the security and authenticity of the certificate. Deploying it in a cloud environment enables the system to access it anywhere in the world. The reputation of the university gets calculated and based on this the university will be included or deleted from the network. Cloud-based storage and verification of certificates replace various data centers. Further works to improve our system would include additional automation by incorporating IoT like bar code sensors, scanners, etc., using the IoT technology, we can easily scan the Certificate and send them to the cloud for verification and the result will be displayed to the user and the respective student.

References 1. Kim TH, Kumar G, Saha R, Rai MK, Buchanan WJ, Thomas R (2020) A privacy-preserving distributed ledger framework for global human resource record management: the blockchain aspect. IEEE Access 8:96455–96467 2. Mishra RA, Kalla A, Braeken A, Liyanage M (2021) Privacy protected blockchain based architecture and implementation for sharing of students’ credentials. Inf Process Manage 58(3) 3. Anitha T, Bose S, Logeswari G (2021) Dynamic PHAD/AHAD analysis for network intrusion detection and prevention system for cloud environment. In: Proceedings IEEE international conference on computing and communications technologies (ICCCT). Chennai, India, pp 273– 279 4. Logeswari G, Bose S, Anitha T (2021) An intrusion detection and prevention system for DDoS attacks using a 2-player bayesian game theoretic approach. In: Proceedings IEEE international conference on computing and communications technologies (ICCCT). Chennai, India, pp 319– 324 5. Caldarelli G, Ellul J (2021) Trusted academic transcripts on the blockchain: a systematic literature review. Appl Sci 11(4) 6. Alam S, Ayoub HAY, Alshaikh RAA, AL-Hayawi AHH (2021) A blockchain-based framework for secure educational credentials. Turkish J Comput Math Educ 12(10):5157–5167 7. Dalal J, Chaturvedi M, Gandre H, Thombare S (2020) Verification of identity, and educational certificates of students using biometric and blockchain. In: Proceedings of the 3rd international conference on advances in science & technology (ICAST). Mumbai, India 8. Logeswari G, Bose S, Anitha T (2022) An intrusion detection system for SDN using machine learning. Intell Autom Soft Comput 35(1):867–880 9. Rasool S, Saleem A, Iqbal M, Dagiuklas T, Mumtaz S, ul Z (2020) Docschain: blockchain-based IoT solution for verification of degree documents. IEEE Trans Comput Soc Syst 7(3):827–837 10. Rama Reddy T, Prasad Reddy PVGD, Srinivas R, Raghavendran CV, Lalitha RVS, Annapurna B (2021) Proposing a reliable method of securing and verifying the credentials of graduates through blockchain. EURASIP J Inf Secur 7

170

S. Naveena et al.

11. Gresch J, Rodrigues B, Scheid E, Kanhere SS, Stiller B (2020) The proposal of a blockchainbased architecture for transparent certificate handling. Bus Inf Syst 12. Yue D, Li R, Zhang Y, Tian W, Peng (2018) Blockchain based data integrity verification in P2P cloud storage. In: IEEE 24th international conference on parallel and distributed systems (ICPADS). Singapore, pp 561–568 13. Li H, Dezhi H (2019) A blockchain-based educational records secure storage and sharing scheme. IEEE Access 7, 179273–179289 14. Rodel A, Fernandez PL (2019) Credence Ledger: a permissioned blockchain for verifiable academic credentials conference. In: IEEE international conference on engineering, technology, and innovation (ICE/ITMC). Stuttgart, Germany 15. Afrianto I, Heryanto Y (2019) Design and implementation of work training certificate verification based on public blockchain platform. In: Fifth international conference on informatics and computing (ICIC). Gorontalo, Indonesia 16. Teymourlouei H, Jackson L (2019) Blockchain: enhance the authentication and verification of the identity of a user to prevent data breaches and security intrusions. In: Proceedings of the international conference on scientific computing (CSC) 17. Alrawais A, Alhothaily A, Cheng X, Hu C, Yu J (2018) A certificate validation system in public key infrastructure. In: The proceedings of IEEE transactions on vehicular technology, vol 67, pp 5399–5408 18. Zhu WT, Lin J (2016) Generating correlated digital certificates: framework and applications. In: The proceedings of IEEE transactions on information forensics and security, vol 11, pp 1117–1127 19. Cheng J, Lee N, Chi C, Chen Y (2018) Blockchain and smart contract for a digital certificate. In: IEEE International conference on applied system invention (ICASI). Chiba, Japan, pp 13–17 20. Imam T, Arafat Y, Alam KS, Shahriyar SA (2021) DOC-BLOCK: a blockchain-based authentication system for digital documents. In: Third international conference on intelligent communication technologies and virtual mobile networks (ICICV) 21. Blockcerts.in the open initiative for blockchain certificates. http://www.blockcerts.org/ 22. Kim TH, Kumar G, Saha R, Rai MK (2020) A privacy-preserving distributed ledger framework for global human resource record management: the blockchain aspect. IEEE Access 99 23. Rasool S, Saleem A, Iqbal M, Dagiuklas T, Mumtaz S, Qayyum Z (2020) Docschain: blockchain-based IoT solution for verification of degree documents. IEEE Trans Comput Soc Syst 7(3) 24. Rama Reddy T, Prasad Reddy PVGD, Srinivas R, Raghavendran CV, Lalitha RVS, Annapurna B (2021) Proposing are liable method of securing and verifying the credentials of graduates through-blockchain. EURASIP J Inf Secur 7 25. Li H, Han D (2019) A blockchain-based educational records secure storage and sharing scheme. In: This work was supported in part by the National Natural Science Foundation of Chinaunder Grant 61672338 and Grant 61873160.7. pp 179273–179289 26. Coin market cap hyperledger archwg paper 1 consensus.pdf, access December 12 2017. https://www.hypuploadsds/2017/08/HyperLedgerArchWG Paper 1 Consensus.pdf. Accessed December 12 (2019) 27. Mani S, Sundan B, Thangasamy A, Govindaraj L (2022) A new intrusion detection and prevention system using a hybrid deep neural network in cloud environment. Comput Netw Big Data IoT. (Lecture notes on data engineering and communications technologies) 117:981–994 28. Afrianto I (2019) Design and implementation of work training certificate verification based on public-blockchain platform. IEEE Transaction

Windows Malware Hunting with InceptionResNetv2 Assisted Malware Visualization Approach Osho Sharma , Akashdeep Sharma, and Arvind Kalia

Abstract Context: With rapidly growing information transfer speeds and easier code development strategies, recent years have witnessed an increase in volume, velocity, and voracity of malware attacks. Existing consumer-level malware detection solutions are inefficient at detecting ‘zero-day’, obfuscated and unknown malware variants. However, machine learning and deep learning solutions overcome these issues and demonstrate promising results. Malware visualization-based techniques in particular, which have demonstrated significant efficacy in the past, offer room for improvement, which has been discussed in the current work. Objectives: The current study proposes a method for malware detection and classification using grayscale malware images which are created from Windows malware binaries. This is followed by utilizing a pretrained InceptionResNetv2 CNN for effective malware detection and classification. Methods and design: We begin by creating grayscale images of latest malware binaries collected from the Internet. We utilize image resizing and byte reduction techniques to equalize the image sizes and utilize a pretrained InceptionResNetv2 CNN architecture trained on 1.5 million images in the ImageNet repository for malware detection and classification. Results and Conclusion: To evaluate the performance of the suggested method, we utilize one public benchmark malware image dataset (Malimg) and one custom built malware image dataset created from latest malware samples from the Internet. Our model is able to demonstrate state-ofthe-art classification accuracy of 99.2% in both datasets, and our model proves to be an effective yet computationally inexpensive choice for real-time malware detection and classification.

O. Sharma (B) · A. Kalia Department of Computer Science, Himachal Pradesh University, Shimla, India e-mail: [email protected] A. Kalia e-mail: [email protected] A. Sharma Department of Computer Science and Engineering, UIET, Panjab University, Chandigarh, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_12

171

172

O. Sharma et al.

Keywords Deep learning · Information security · Malware classification · Convolutional neural networks

1 Introduction Malware, short for malicious software(s), refers to computer program(s) that are designed to execute any undesirable or destructive behavior such as intruding and capturing sensitive information, manipulating, or destroying a hacked system [1]. The area of malware detection and classification has received considerable critical attention due to an increase in cybercrimes. The Windows OS receives the highest number of malware threats due to its popularity and frequent piracy of softwares [2]. According to an estimate by VirusTotal, nearly 1.6 million distinct malware scan hits were registered in their system in 2022.1 Moreover, the issue of malware detection and classification has become more severe due to the inefficiency of commonly available anti-malware solutions which struggle to detect zero-day and obfuscated malware [3]. Additionally, the lack of updated malware datasets makes the problem much difficult [4]. Recent computer vision solutions have shown decent success in the domain of malware detection and classification [5–7]. The idea in malware visualizationbased solutions is to convert malicious executables into images and then turning the malware classification problem into image classification problem. Several studies in the past have shown that malware instances of the same family exhibit matching patterns when converted into image format and neural networks can be effective at identifying these patterns [8]. Image-based malware detection and classification techniques have several advantages. For instance, these techniques are shown to be resilient against code obfuscation, hiding, packing, and encryption methods [7]. Moreover, image-based malware detection approaches have shown to produce better results against detection of ‘zero-day’ and unknown malware samples. The current study proposes a malware detection and classification model that is built on a popularly used image classification network: InceptionResNetv2 [9]. We have chosen this network due to its excellent ability at identifying highly discriminating features from malware images. Our approach involves conversion of malware instances into grayscale malware images and performing malware detection and family classification using pretrained InceptionResNetv2 CNN, which is trained on 1.5 million images of the ImageNet repository. Additionally, we observe that studies in the past have utilized obsolete malware or old benchmark datasets [1, 10]. To avoid the problem of obsolete malware, we create our own data corpus from scratch by collecting latest malware executables from the Internet. The following points outline the contributions of the current work:

1

https://www.virustotal.com/gui/stats

Windows Malware Hunting with InceptionResNetv2 Assisted Malware …

173

1. We develop a malware family classification system based on transfer learning using a pretrained InceptionResNetv2 CNN, which is used to classify grayscale malware images into respective families. 2. We perform the comparison of InceptionResNetv2 CNN with other image classification networks: Densenet50, Nasnetlarge, VGG16, and Mobilenetv2 along with other malware image classification methodologies in the literature and demonstrate good detection results for the proposed solution. 3. To address the issue of outdated malware datasets used in the past studies, we create our own grayscale malware image dataset using latest malware samples from the Internet. The study presents new directions toward designing cheap and robust malware detectors capable of detecting and classifying zero-day and unknown malware, and our model has significant implications and usability in the area of cyber-security. The remainder of the work is arranged in the following sections: Sect. 2 presents the recent studies in the domain of image-based malware detection. Section 3 presents the stages in the proposed model. Section 4 depicts the dataset details and the evaluation metrices, and Sect. 5 demonstrates the results and analysis of our proposed system. Finally, Sect. 6 concludes the work and presents future directions.

2 Related Works The published work in the domain of malware detection and classification can be divided into static methods [11, 12], dynamic analysis methods [3, 13], hybrid methods [14], machine learning [15, 16] and deep learning solutions [17, 18], and image-based malware detection approaches [5–7]. In this section, we discuss some of the most recent researches involving malware visualization-based approaches. Kalash et al. [19] developed a malware image classification framework using VGG16 convolutional neural network. Their framework M-CNN was evaluated using two popular benchmark datasets: Malimg and Microsoft image dataset and demonstrated good malware detection accuracy scores of nearly 99%. Vasan et al. [6] proposed an ensemble-based approach to integrate multiple CNN architectures. Their model named IMCEC combined a fine-tuned VGG16 and a fine-tuned ResNet50 employing SoftMax classifiers with SVM classifiers to achieve an accuracy score of nearly 98% on the same benchmark datasets mentioned earlier. Xiao et al. [20] have integrated the process of malware visualization, automatic feature extraction, and malware classification using their model MalFCS. Their proposed model can process malware executables without disassembly and decryption by using structural entropy graphs. Their model achieved good accuracy scores on Malimg and Microsoft benchmark datasets of nearly 99% each. Verma et al. [21] proposed a method based on analyzing memory dumps left by malware. The memory dump files are converted to a grayscale image, and they have used histogram of gradient (HOG) technique for feature extraction. Their approach shows promising malware

174

O. Sharma et al.

family classification scores of nearly 98%. To bypass RNN and CNN-based malware detection methods, Peng et al. [4] suggested an approach for creating adversarial malware. Their method is built on a black-box approach that uses word embeddings to map API calls, which are then utilized to evade malware classifiers based on RNN and CNN. A BiLSTM model with a score of over 96 percent demonstrates their best-case accuracy score. Jeon and Moon [22] developed a convolutional recurrent neural network (CRNN) model by combining convolutional neural network (CNN) and recurrent neural network (RNN) to detect malicious executables using n-grams and extracted opcode sequences. They have also recommended using dynamic RNNs (DRNNs) to extract the opcode sequences without executing the file to save computational time and power. Their approach demonstrates a 99% AUC, a 95% TPR, and a 96% detection accuracy. Narayanan et al. [23] used their CNN-LSTM model to examine and assess the nine different classes of Windows malware using the Microsoft Malware Challenge dataset (BIG-2015) provided on Kaggle. The LSTM neural network achieved a 97.2% accuracy score, whereas the CNN model achieved a 99.4% accuracy score. For efficient Android-based malware image detection, De Lorenzo et al. [24] suggested an RNN-LSTM model. Their work also includes a tool called VizMal, which uses system calls metadata to visualize the execution path of Android programs. They used user-based validation to confirm and demonstrate the promising outcomes, but built-in evaluation and assessment metrics would be a better way to evaluating their model. Zhang et al. [25] proposed a ransomware classification technique based on static opcode sequence analysis. They used the selfattention approach to capture complementary data of distance-aware relationships and compared its effectiveness to CNN and RNN-based models. Their proposed method showedprecision and recall scores of nearly 87.5%. By turning malicious executables into grayscale pictures, Nataraj et al. [26] developed an image-based malware classification model. Their research focuses on turning malware binaries into vectors of 8-bit unsigned integers, with each integer corresponding to a pixel value for grayscale image conversion. According to their findings, texture analysis of binary data delivers accurate detection results with a classification accuracy of 98% while taking less time. To test their hypothesis that malware memory dumps may be utilized to successfully detect and categorize malware samples, Dai et al. [27] employ a backdoor malware dataset and utilize a MLP-based model for malware image classification. Malware memory dumps are converted to grayscale images and fed into multilayer perceptron (MLP) networks to verify their approach. For their MLP-based model, their technique shows promising accuracy results of nearly 95%. Yue [28] offered a remedy to class imbalance, in which the author exhibited a strategy for improving the structure of their deep CNN-based malware detection model by adding a weighted SoftMax loss layer at the last layer. Their model which was based on ’vgg-verydeep-19’ model showed a classification accuracy of over 98%, as reported in their findings. Using two popular malware datasets ‘Malimg’ and ‘Microsoft Malware Dataset (BIG-15)’, Lo et al. [29] demonstrate their results using their proposed Xception CNN-based model. The authors evaluated their modified version of Xception CNN against KNN, SVM, and VGG16 models and discovered that Xception CNN’s classification accuracy outperforms other similar models

Windows Malware Hunting with InceptionResNetv2 Assisted Malware …

175

by 99.03%. Jung et al. [30] propose an intuitive method for classifying Android malware images, based on the assumption that the complete.dex file in Android may contain noisy patterns, and the portion useful for analyzing is the data section, which they used to create their data corpus. Their data corpus was eventually transformed into image files and fed into a convolutional neural network (CNN) for classification. Their method not only achieves a detection accuracy of over 98%, but it also reduces the model’s storage requirements by 17.5%. With Malimg and Microsoft’s malware dataset, Sudhakar and Sushil [10] demonstrate a conventional and transfer learning-based strategy to classifying malware images. By altering the last layer of the ResNet-50 model, they have shown improvements in the existing architecture. On the same two benchmark datasets as the previous study, their redesigned structure demonstrates good classification accuracy of > 98%. Yuan et al. [31] demonstrate MDMC, which is a byte-level classification approach based on Markov images. They also highlight a method for converting executables to Markov images, and their deepCNN-based architecture is tested using the Microsoft dataset and the Drebin dataset, with accuracy scores of almost 99% and 97%, respectively. Sharma et al. [32] in their work utilized three types of malware images (grayscale, color, and Markov images) and employed a custom deep CNN and transfer learning-based Xception CNN to produce excellent malware classification performance of nearly 99% on Microsoft dataset and custom built dataset. Pinhero et al. [33] in their work utilized 12 different neural network architectures to classify 20,199 malware images and achieve good classification results with an F-measure of approximately 99%. Based on a slightly different approach of color image visualization, Naeem et al. [7] created a deep neural network architecture to identify malware attacks on the Industrial Internet of Things (IIoT). Their method has been proved to attain a classification score of about 99%. Liu et al. [34] offer a machine learning-based malware detection solution based on malware visualization and adversarial training, which they test against two benchmark datasets: Ember and Microsoft’s BIG-15. Their recommended technique has a 97.73% accuracy rate for blocking zero-day attacks, with an average of 96.25% for all malwares examined. A colored labeled boxes (CoLab)-based method was presented to pin the sections of a PE file to highlight the section information in images of malicious executables by Xiao et al. [35] using a unique visualization method based on VGG16 and SVM. Their testing results from malware collected from VX-Heaven, VirusShare, and the Microsoft BIG-2015 dataset show nearly 96% accuracy and detection rates. A few points can be inferred from the previous works such as (a) previous works incorporate complex solutions making them ineffective for lightweight environments and (b) lack of updated malware datasets in the previous works. To address these challenges, we suggest (a) a malware visualization-based classification approach which is computationally lightweight, (b) in addition to a public benchmark malware image dataset (Malimg), we create our own malware image dataset from latest malware samples from the Internet, and (c) we utilize transfer learning and a state-of-theart image classification framework InceptionResNetv2 to produce good malware classification results.

176

O. Sharma et al.

Fig. 1 Overview of the suggested framework

3 Methodology Malware developers often change the code of known malware variants to generate new malware. Deep learning methods can detect these alterations by expressing malware code structures in image representation. The next subsection gives an overview of the general architecture, followed by a description of each stage in the following subsections.

3.1 Overview The task in the current study can be seen as a semi-supervised multiclass image classification problem, in which the labels of training instances are obtained, and the labels of testing instances are predicted. The task of malware visualization, or the conversion of malware instances into malware images, is the main consideration, and the technique is outlined in the following paragraphs. After that, the images are classified using pretrained InceptionResNetv2 CNN using transfer learning approach. The overview of the framework is depicted in Fig. 1.

3.2 Grayscale Malware Images We transform malicious files into grayscale malware pictures as part of our data conversion process to extract distinctive qualities from malware binaries. The hexadecimal values of executable files are converted into binary data. We transform the bytes to gray values ranging from 0 to 255, with 0 signifying black and 255 indicating white. These decimal integers are arranged into a 256 × 256 two-dimensional matrix that can be mapped into images. The process of malware to image conversion as originally suggested by [26] is shown in Fig. 2. Figure 3 presents the sections of a popular Trojan Dontovo. A from Malimg dataset for illustration purposes.

Windows Malware Hunting with InceptionResNetv2 Assisted Malware …

177

Fig. 2 Converting malware into grayscale image as suggested by Nataraj et al. [26]

Fig. 3 Sections of a common trojan downloader Dontovo. A malware

3.3 InceptionResNetv2 CNN Artificial neural networks are able to automatically generate high-level feature descriptions from low-level data streams, eliminating the need for manual analysis. CNNs are artificial neural networks based on the visual cortex of animals, and their receptive fields are built using layered sub-regions that cover the whole visual area. CNNs have long been used in visual recognition tasks such as image categorization and object identification due to their ability to successfully retrieve discriminating and local features from visual data. At the core of our malware classification system lies InceptionResNetv2 [9] convolutional neural network which is a very popular CNN architecture used for image detection and computer vision related tasks. It is built by combining the Inception structure and the residual connections. The residual connections not only address the degradation issue caused in the network but also reduce the training time. The original version of InceptionResNetv2 is 164 layers deep and is shown to classify images into 1000 object categories [9]. As a consequence, the network has learnt a variety of rich feature representations for a variety of pictures.

178

O. Sharma et al.

Fig. 4 InceptionResNetv2 architecture [9]

The network takes a 299-by-299 picture as input and outputs a list of predicted class probabilities. It is built on a foundation of the Inception architecture and the Residual link. Multiple sized convolutional masks are mixed with residual connections in the Inception-ResNet block. The introduction of residual connections overcomes the degradation problem caused by deep structures while also cutting down on training time. We utilize a pretrained version of InceptionResNetv2 CNN. The core network structure of InceptionResNetv2 is depicted in Fig. 4. After passing the input image from the network structure, it is passed to a batch normalization layer to standardize the input shape to 1792. It is followed by a ReLU activation layer and subsequently to an average pooling layer which takes the average of previous inputs and returns a uniform output. Further, a dropout layer (dropout rate = 0.2) is used to randomly discard 20% of the inputs before sending it to a SoftMax layer for classification. The neurons in SoftMax depend on the classes in the malware datasets, i.e., for Malimg dataset, the number of neurons is 25, whereas for the self-created dataset, the number of neurons is kept at 6.

4 Experiments The current section specifies the experimental setup, dataset details, and evaluation metrices used in the suggested model. We have utilized an 11th Generation Intel(R) Core(TM) i7-1165G7 @ 2.80 GHz with 16 GB RAM and a Geforce GTX 1080 Ti 2 GB graphics card using the open-source Ubuntu 21.04 Operating System (OS). The code is written using Python3 along with Keras2, Scikitlearn, TensorFlow, and Matplotlib libraries. The InceptionResNetv2 CNN is trained for 60 epochs, and categorical-cross entropy loss function is utilized for classification. The InceptionResNetv2 is pretrained on ImageNet repository with 1.5 million images, and the pretrained weights are utilized for the current work.

Windows Malware Hunting with InceptionResNetv2 Assisted Malware …

179

4.1 Dataset Details To evaluate the training networks in the current work, we utilize two malware image datasets, the first dataset is a public benchmark dataset which is originally proposed by Nataraj et al. called ‘Malimg’ [36] which consists of 9342 grayscale malware images categorized into 25 classes. It contains malware from the Yuner. A, VB.AT, Malex.gen!J, Autorun. K, and Rbot!gen families among others. The number of observations in each class, however, is imbalanced. The study’s second dataset is a self-created Windows malware image dataset that includes five different types of malware and one class of benign files. Infected files were downloaded from theZoo,2 VirusShare,3 and VX-Heaven4 websites. To avoid duplication, we verified the MD5 checksums of the samples. We utilized the VirusTotal API to properly label the malware samples using a majority vote technique in which 80% of the anti-malware software in the VirusTotal API approved on the classification of the malware pertaining to a certain family type. After collecting the malware samples, the malware is converted into image format as described in the procedures mentioned in previous section. Table 1 and Table 2 show the details of the two datasets.

4.2 Evaluation Metrics The evaluation metrics to test the performance of the proposed model are given below: • True Positive (TP) denotes the correctly classified positive category instances. • True Negative (TN) denotes the correctly classified negative category instances. • False Positive (FP) denotes negative category instances that have been incorrectly classified as positive category instances. • False Negative (FN) refers to positive category instances that have been incorrectly classified as negative category instances. • Accuracy, precision, F1, and recall are calculated using these criteria and are stated in the formulas given below. • The AUC is defined as the likelihood that the classifier would give a randomly selected positive sample a high value compared to a randomly selected negative sample. It has a numerical value between 0 and 1, and the nearer it goes to 1, the higher is the model’s performance. • Accuracy: the percentage of accurately predicted samples among all samples is shown in Eq. (1)

2

https://github.com/ytisf/theZoo. https://virusshare.com/. 4 https://vx-underground.org/archive/VxHeaven/index.html. 3

180 Table 1 Details of malware in Malimg dataset [36]

O. Sharma et al. Family

Class ID

Adialer.C

1

125

Agent.FYI

2

116

Allaple.A

3

2949

Allaple.L

4

1591

Allueron.gen!J

5

198

Autorun.K

6

106

C2Lop.P

7

146

C2Lop.gen!g

8

200

9

177

Dontovo.A

10

162

Fakerean

11

381

Instantaccess

12

431

Lolyda.AA1

13

213

Lolyda.AA2

14

184

Lolyda.AA3

15

123

Lolyda.AT

16

159

Malex.gen!J

17

136

Obfuscator.AD

18

142

Rbot!gen

19

158

Skintrim.N

20

80

Swizzor.gen!E

21

128

Swizzor.gen!I

22

132

VB.AT

23

408

Wintrim.BX

24

97

Yuner.A

25

800

Dialplatform.B

9342

Total Table 2 Details of malware in custom built windows malware dataset

Number of samples

Family

Class ID

Number of samples

Adware

1

1146

Exploit

2

138

Spyware

3

582

Downloader

4

1512

Worm

5

1620

Benign

6

508

Total

5400

Windows Malware Hunting with InceptionResNetv2 Assisted Malware …

Accuracy =

TP + TN TP + TN + FP + FN

181

(1)

• The percentage of accurately predicted malware to total predicted malware is known as precision which is shown in Eq. (2) Precision =

TP TP + FP

(2)

• The fraction of anticipated malware instances to the total number of malware instances is the recall or sensitivity value of a dataset which is shown in Eq. (3) Sensitivity =

TP TP + FN

(3)

• The weighted average of recall and precision values is called F1 score which is shown in Eq. (4) F1Score =

2 ∗ Precision ∗ Recall Precision + Recall

(4)

5 Results and Analysis The evaluation and comparison of the current work is carried out with respect to similar works in the literature. The binary code of malicious samples is used to generate grayscale malware images, as described in the previous section. The malware images are used on a pretrained InceptionResNetv2 CNN for classification. The accuracy and loss plots for InceptionResNetv2 on two malware image datasets are shown in Fig. 5. As observed in Fig. 5, the accuracy for both datasets begins at nearly 3 and 5% and increases to nearly 99.2% in the 50th epoch. Similarly, the learning loss for both datasets begins at 1.60% and gradually reduces to 0.37% after the 50th epoch. The networks tend to stabilize and show least convergence near the 50th epoch. The classification performance of the InceptionResNetv2 CNN on the two image datasets given earlier is discussed in this subsection. We choose 70% of samples for training and 30% for testing in Malimg and our custom malware image dataset. Figures 6 and 7 show the outcomes of the studies, including classification accuracy, AUC, sensitivity, and F-measure for the two datasets. Other deep CNNs used in image classification include Densenet50, Mobilenetv2, Nasnetlarge, and VGG16, in addition to the InceptionResNetv2 CNN. Figures 8 and 9 illustrate the generated confusion matrix of the performance achieved using InceptionResNetv2 CNN in the two datasets. Tables 3 and 4 illustrate a comparison of our work with a number of major publications for a more comprehensive analysis of the InceptionResNetv2 system. Based on the comparison, we conclude that our strategy is more effective

182

O. Sharma et al.

Fig. 5 Training accuracy and loss plots for the CNNs used in the study

Fig. 6 Accuracy, AUC, F-measure, and sensitivity of the tested models on Malimg [36] dataset

and lightweight than other approaches. As a result, our technique is proved to have state-of-the-art accuracy in detecting and classifying malware.

6 Conclusions and Future Scope Malware attacks are responsible for a majority of economic losses across global economies, and there is a dire need to address the growing speed and complexity of

Windows Malware Hunting with InceptionResNetv2 Assisted Malware …

183

Fig. 7 Accuracy, AUC, F-measure, and sensitivity of the tested models on self-created malware image dataset

malware. The current work utilizes pretrained InceptionResNetv2 CNN for effective Windows malware detection and classification. Apart from one public benchmark malware image dataset (Malimg), we create our own malware image dataset by utilizing latest Windows binaries from the Internet. Our solution demonstrates excellent family classification accuracy (≈ 99.2%), and our method is resilient toward obfuscated, packed, and unknown malware variants as a result of using malware visualization approach. However, in the future, we plan to work with adversarial malware samples capable of evading machine learning and deep learning classifiers. Additionally, we plan to deploy our suggested model using a cloud hosting solution for the usage of general public. Statements and Declarations

184

Fig. 8 Confusion matrix of InceptionResNetv2 CNN on Malimg [36] dataset

O. Sharma et al.

Windows Malware Hunting with InceptionResNetv2 Assisted Malware …

185

Fig. 9 Confusion matrix of InceptionResNetv2 CNN on self-created malware image dataset Table 3 Comparison of different works on Malimg [36] dataset Model

Accuracy (%)

F1 (%)

Train time (≈min)

Test time (≈s)

Anandhi et al. [37]

98.97

98.88



0.05

Vasan et al. [5]

98.82

98.75



0.081

94.81





99.05

99.21

225

15

Current (InceptionResNetv2 99.23 CNN)

99.25

210

9

Gibert et al. [1] Sudhakar and Kumar [10]



Table 4 Comparison of different works on custom built Windows malware image dataset Model

Accuracy (%)

F1 (%)

Train time (≈min)

Test time (≈s)

Stamp et al. [38]

97.51

97.72

140

30

Xiao et al. [20]

98.45

98.20

150

50

Yuan et al. [31]

98.75

98.30

140

60

Sudhakar and Kumar [10]

98.81

97.44

120

65

Current (InceptionResNetv2 99.21 CNN)

98.76

110

35

186

O. Sharma et al.

Acknowledgements Not applicable Funding Not applicable

Conflict of Interest The authors state that they have no known competing financial interests or personal ties that could have appeared to affect the work reported in this study. Consent for publication Not applicable Credit authorship contribution statement All authors contributed equally in this manuscript

References 1. Gibert D, Mateu C, Planes J, Vicens R (2019) Using convolutional neural networks for classification of malware represented as images. J Comput Virol Hack Tech 15(1):15–28. https:// doi.org/10.1007/s11416-018-0323-0 2. Ring M, Schlör D, Wunderlich S, Landes D, Hotho A (2021) Malware detection on windows audit logs using LSTMs. Comput Secur 109:102389. https://doi.org/10.1016/j.cose.2021. 102389 3. Amer E, Zelinka I (2020) A dynamic Windows malware detection and prediction method based on contextual understanding of API call sequence. Comput Secur 92:101760. https://doi.org/ 10.1016/j.cose.2020.101760 4. Peng X, Xian H, Lu Q, Lu X (2021) Semantics aware adversarial malware examples generation for black-box attacks. Appl Soft Comput 109:107506. https://doi.org/10.1016/j.asoc.2021. 107506 5. Vasan D, Alazab M, Wassan S, Naeem H, Safaei B, Zheng Q (2020) IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture. Comput Netw 171:107138. https://doi.org/10.1016/j.comnet.2020.107138 6. Vasan D, Alazab M, Wassan S, Safaei B, Zheng Q (2020) Image-Based malware classification using ensemble of CNN architectures (IMCEC). Comput Secur 92:101748. https://doi.org/10. 1016/j.cose.2020.101748 7. Naeem H et al (2020) Malware detection in industrial internet of things based on hybrid image visualization and deep learning model. Ad Hoc Netw 105:102154. https://doi.org/10.1016/j. adhoc.2020.102154 8. Ding Y, Zhang X, Hu J, Xu W (2020) Android malware detection method based on bytecode image. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-020-02196-4 9. Szegedy C, Ioffe S, Vanhoucke V, Alemi A (Aug 2016) Inception-v4, inception-ResNet and the impact of residual connections on learning. arXiv:1602.07261 [cs]. Accessed 11 Nov 2021. [Online]. Available: http://arxiv.org/abs/1602.07261 10. Sudhakar, Kumar S (Dec. 2021) MCFT-CNN: Malware classification with fine-tune convolution neural networks using traditional and transfer learning in Internet of Things. Future Gener Comput Syst 125:334–351. https://doi.org/10.1016/j.future.2021.06.029 11. Amin M, Tanveer TA, Tehseen M, Khan M, Khan FA, Anwar S (2020) Static malware detection and attribution in android byte-code through an end-to-end deep system. Futur Gener Comput Syst 102:112–126. https://doi.org/10.1016/j.future.2019.07.070 12. Liu L, Wang B (2017) Automatic malware detection using deep learning based on static analysis. In: Data science. Singapore, pp 500–507. https://doi.org/10.1007/978-981-10-63855_42 13. Escudero García D, DeCastro-García N (June 2021) Optimal feature configuration for dynamic malware detection. Comput Secur 105:102250. https://doi.org/10.1016/j.cose.2021.102250

Windows Malware Hunting with InceptionResNetv2 Assisted Malware …

187

14. Darabian H et al (2020) Detecting cryptomining malware: a deep learning approach for static and dynamic analysis. J Grid Computing 18(2):293–303. https://doi.org/10.1007/s10723-02009510-6 15. Bai Y, Xing Z, Ma D, Li X, Feng Z (2021) Comparative analysis of feature representations and machine learning methods in Android family classification. Comput Netw 184:107639. https://doi.org/10.1016/j.comnet.2020.107639 16. Dehkordy DT, Rasoolzadegan A (2021) A new machine learning-based method for android malware detection on imbalanced dataset. Multimed Tools Appl 80(16):24533–24554. https:// doi.org/10.1007/s11042-021-10647-z 17. Gibert D, Mateu C, Planes J (2020) HYDRA: A multimodal deep learning framework for malware classification. Comput Secur 95:101873. https://doi.org/10.1016/j.cose.2020.101873 18. Moti Z et al (2021) Generative adversarial network to detect unseen Internet of Things malware. Ad Hoc Netw 122:102591. https://doi.org/10.1016/j.adhoc.2021.102591 19. Kalash M, Rochan M, Mohammed N, Bruce NDB, Wang Y, Iqbal F (Feb. 2018) Malware classification with deep convolutional neural networks. In: 2018 9th IFIP International conference on new technologies, mobility and security (NTMS), pp 1–5. https://doi.org/10.1109/NTMS. 2018.8328749 20. Xiao G, Li J, Chen Y, Li K (2020) MalFCS: An effective malware classification framework with automated feature extraction based on deep convolutional neural networks. J Parallel Distrib Comput 141:49–58. https://doi.org/10.1016/j.jpdc.2020.03.012 21. Verma V, Muttoo SK, Singh VB (2020) Multiclass malware classification via first- and secondorder texture statistics. Comput Secur 97:101895. https://doi.org/10.1016/j.cose.2020.101895 22. Jeon S, Moon J (2020) Malware-detection method with a convolutional recurrent neural network using opcode sequences. Inf Sci 535:1–15. https://doi.org/10.1016/j.ins.2020.05.026 23. Narayanan BN, Davuluru VSP (May 2020) Ensemble malware classification system using deep neural networks. Electronics 9(5) Art. no. 5. https://doi.org/10.3390/electronics9050721 24. De Lorenzo A, Martinelli F, Medvet E, Mercaldo F, Santone A (2020) Visualizing the outcome of dynamic analysis of Android malware with VizMal. J Inf Secur Appl 50:102423. https:// doi.org/10.1016/j.jisa.2019.102423 25. Zhang B, Xiao W, Xiao X, Sangaiah AK, Zhang W, Zhang J (Sep 2020) Ransomware classification using patch-based CNN and self-attention network on embedded N-grams of opcodes. Future generation computer systems 110:708–720. https://doi.org/10.1016/j.future. 2019.09.025 26. Nataraj L, Karthikeyan S, Jacob G, Manjunath BS (July 2011) Malware images: visualization and automatic classification. In: Proceedings of the 8th International symposium on visualization for cyber security, New York, NY, USA, pp 1–7. https://doi.org/10.1145/2016904.201 6908 27. Dai Y, Li H, Qian Y, Lu X (2018) A malware classification method based on memory dump grayscale image. Digit Investig 27:30–37. https://doi.org/10.1016/j.diin.2018.09.006 28. Yue S (Aug. 2017) Imbalanced malware images classification: a CNN based approach. arXiv: 1708.08042 [cs, stat]. Accessed: 19 Oct 2021. [Online]. Available: http://arxiv.org/abs/1708. 08042 29. Lo WW, Yang X, Wang Y (June 2019) An Xception convolutional neural network for malware classification with transfer learning. In: 2019 10th IFIP International conference on new technologies, mobility and security (NTMS), pp 1–5. https://doi.org/10.1109/NTMS.2019.876 3852 30. Jung J, Choi J, Cho S, Han S, Park M, Hwang Y (2018) Android malware detection using convolutional neural networks and data section images. In: Proceedings of the 2018 conference on research in adaptive and convergent systems, New York, NY, USA, Oct. 2018, pp 149–153. https://doi.org/10.1145/3264746.3264780 31. Yuan B, Wang J, Liu D, Guo W, Wu P, Bao X (2020) Byte-level malware classification based on markov images and deep learning. Comput Secur 92:101740. https://doi.org/10.1016/j.cose. 2020.101740

188

O. Sharma et al.

32. Sharma O, Sharma A, Kalia A (2022) Windows and IoT malware visualization and classification with deep CNN and Xception CNN using Markov images. J Intell Inf Syst. https://doi.org/10. 1007/s10844-022-00734-4 33. Pinhero A et al (2021) Malware detection employed by visualization and deep neural network. Comput Secur 105:102247. https://doi.org/10.1016/j.cose.2021.102247 34. Liu X, Lin Y, Li H, Zhang J (2020) A novel method for malware detection on ML-based visualization technique. Comput Secur 89:101682. https://doi.org/10.1016/j.cose.2019.101682 35. Xiao M, Guo C, Shen G, Cui Y, Jiang C (2021) Image-based malware classification using section distribution information. Comput Secur 110:102420. https://doi.org/10.1016/j.cose. 2021.102420 36. Nataraj L, Karthikeyan S, Jacob G, Manjunath BS (2011) Malware images: visualization and automatic classification. In: Proceedings of the 8th international symposium on visualization for cyber security-VizSec ’11, Pittsburgh, Pennsylvania, pp 1–7. https://doi.org/10.1145/201 6904.2016908 37. Anandhi V, Vinod P, Menon VG (2021) Malware visualization and detection using DenseNets. Pers Ubiquit Comput. https://doi.org/10.1007/s00779-021-01581-w 38. Stamp M, Chandak A, Wong G, Ye A (2022) On ensemble learning. arXiv:2103.12521 [cs], Mar. 2021. Accessed 22 Jan 2022. [Online] Available: http://arxiv.org/abs/2103.12521

Custom-Built Deep Convolutional Neural Network for Breathing Sound Classification to Detect Respiratory Diseases Sujatha Kamepalli , Bandaru Srinivasa Rao , and Nannapaneni Chandra Sekhara Rao Abstract In any living being, the respiratory system plays a vital role and is responsible for taking oxygen required for body organs and blood. It has a substantial impact on global health. Detecting and diagnosing the respiratory diseases is a challenging task for the medical practitioners. When it comes to addressing COVID-19 in the current situations, it becomes substantially more dangerous and leading to death since the virus directly effecting the human respiratory system. In COVID times, wearing masks for longer times is also leading to respiratory diseases. The traditional method used for observing the respiratory disorders is auscultation. It is well-known for being less costly, non-invasive, and safe, requiring less diagnosis time. However, the accuracy of diagnosis with auscultation is dependent on the physician’s expertise and understanding and therefore necessitates substantial training. This paper suggests a solution that depends on a deep CNN for diagnosing respiratory diseases. A customized deep convolutional neural network was developed by considering the stacked LSTM model to classify the breathing sounds to detect respiratory diseases. The developed model was built to categorize the six different types of breathing cycles included in the “ICBHF17 scientific challenge respiratory sound database”, and it performs well with 98.6% accuracy. We compared the developed model’s efficiency against state-of-the-art models. Keywords Deep learning models · Respiratory disease diagnosis · Breathing sound cycles · Stacked LSTM model

S. Kamepalli (B) · B. S. Rao VFSTR University, Guntur, Andhra Pradesh, India e-mail: [email protected] N. C. S. Rao VR Siddhartha Engineering College, Vijayawada, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_13

189

190

S. Kamepalli et al.

1 Introduction “Forum of International Respiratory Societies” conducting various studies and research on respiratory diseases (RDs), as per the recent studies from them the RDs are the primary causes of severe sickness among all throughout the globe, claiming more than 4 million lives each year (FIRS). According to the statistics produced by the “World Health Organization (WHO)” in the year 2017, chronic respiratory diseases are becoming severe and 10% of throughout the world due these diseases only [1–3]. In Sri Lanka, these disorders were responsible for 4.15% of deaths [4]; this occurred before the COVID-19 epidemic. When compared prior to the COVID period, the mortality rate from respiratory disorders is now extraordinarily high. COVID-19-related respiratory problems are poorly known and researched. It is critical to diagnose these lethal diseases early and accurately to prevent mortality and the severe consequences patients face [5]. Respiratory diseases can be diagnosed by a clinical examination known as auscultation. The auscultation process involves in observing the sound coming from the lungs due to breathing. A stethoscope is usually used to hear the lungs and listen to the “anterior and posterior chest walls” during the breathing process. Auscultation is a medical procedure in which a doctor examines a patient’s lungs to look for abnormal noises overlaid over the patient’s normal breathing [3]. Auscultation alone cannot provide a correct diagnosis; however, it can help diagnose lung disorders. Auscultation is subjective to the physician’s knowledge and familiarity, and becoming a professional in respiratory disease diagnosis via auscultation needs substantial training. Researchers have recently come up with various artificial intelligence algorithms to detect lung noises that are not the result of a medical condition [6].

1.1 Significance of the Proposed Model The existing research work mainly focusses on the classification of breathing sounds, ignored the class imbalance problem. The proposed method was primarily focused on feature extraction and categorization. Several signal processing techniques were extensively used to characterize breathing sounds during the feature extraction step. Deep learning approaches have been used with these feature extraction methods. Also the class imbalance problem was considered, the class labels with negligible count of samples were removed from the dataset, and the proposed model is configured to allocate greater costs to the class labels with less data in learning phase.

Custom-Built Deep Convolutional Neural Network for Breathing Sound …

191

1.2 Research Gap and Contributions of the Paper In the existing research on breathing sound classification, the authors ignored the class imbalance problem. In this paper, the class imbalance problem was addressed and the accuracy improved when compared to the contemporary models. The significant contributions and results of this research are summarized as follows: • A customized deep convolutional neural network based on a stacked LSTM model was proposed for multi-class classification of breathing sound classifications. • The model was designed in such a way that it can to allocate greater costs to the class labels with less data in learning phase • The experiments were conducted with different epochs to find the increased level of performance of the network in classifying the breathing cycles. • The experimental results were presented and compared with the state-of-the-art models used for breathing sounds classification in terms of accuracy. The paper is arranged in this manner: In the Sect. 1, we looked at the basics of respiratory disease diagnostics and the role deep learning models play in identifying lung sounds. Sect. 2 summarized the findings of the previous investigation into this topic. There is a custom-built deep neural network proposed in Sect. 3. Finally, we summarized our findings before concluding Sect. 5 in Sect. 4.

2 Literature Survey Back ground work related to breathing sound classification was analyzed in this section. Five different lung sounds were included in the dataset: “normal, coarse crackle, fine crackle, monophonic, and polyphonic wheezes”. The authors employed various techniques such as “HOS, genetic algorithms (GA), and Fisher’s discriminant ratio (FDR)” to minimize dimensionality in the given dataset. The classification methods k-nearest neighbors and Naive Bayes are used to extract the features from the lung sounds. Cross-validation and Tukey’s honestly significant difference criterion were used to compare the performance of the classifiers. In feature selection, the genetic algorithms surpassed Fisher’s discriminant ratio. Each lung class had a distinct signature pattern, indicating that HOS is a potential feature extraction method for lung sounds. Other advantages include being able to classify distinct lung sounds appropriately. The top tree-based classifier achieved a classification accuracy of 98.1% on learning and 94.6% on validation respectively. Even with just one feature extraction tool, the proposed approach obtained good results (higher-order statistics) [1]. The diagnosis of respiratory diseases focused on this research [5], developed a Smart Stethoscope. The “Smart Stethoscope” is a cutting-edge artificial intelligence-powered platform for diagnosing respiratory diseases and teaching new doctors how to use it. The primary functions of this system include (modes). There

192

S. Kamepalli et al.

are three distinct aspects of this research. (1) The real-time prediction mode: this mode delivers real-time respiratory diagnosis predictions based on auscultation lung sound recordings. (2) Offline mode: Doctors who are in training period and medicos can use the offline mode to practice their skills. (3) The system’s prediction performance is continually improved by obtaining feedback from pulmonologists through the usage of expert mode. Prediction models for respiratory disease diagnosis are built using state-of-the-art artificial neural networks and recurrent convolutional networks. In doing the classification of respiratory sounds for the “ICBHF17 scientific challenge respiratory sound database”, the suggested C-Bi LSTM model obtained 98% accuracy. A unique CNN architecture and the extraction of “mel frequency cepstral coefficients (MFCC)” are used to discover unhealthy indications in respiratory sound data to give clinicians a possibly life-saving tool [2]. Various ensemble classification algorithms were used in this study to classify respiratory disorders into multiple classes. There were 215 participants in the study, and a total of 308 clinically acquired lung sound recordings and 1176 recordings from the “ICBHI Challenge” database were used. These recordings represented a wide range of health states, including “normal, asthma, pneumonia, heart failure, bronchiectasis, and chronic obstructive pulmonary disease”. It was found that Shannon entropy, logarithmic energy entropy, and spectrogram-based spectrum entropy were used to represent the features of the lung sound signals. Bootstrap aggregation and adaptive boosting ensembles were built using decision trees and discriminant classifiers as basis learners. Bayesian hyper parameter optimization was used to determine the ideal ensemble model structure, then compared to well-known classifiers in the literature. For the most accurate overall accuracy, sensitivity, specificity, F1-score, and Cohen’s kappa coefficient, boosted decision trees were the most effective at 98.27; 95.28; 99.99; 93.6; and 92.28%. SVM had the best overall accuracy (98.20%), sensitivity (91.5%), and specificity (97.5%) of the baseline approaches but performed somewhat worse than the others (98.55%). Despite their simplicity, the ensemble classification approaches studied showed promising results for detecting a wide spectrum of respiratory illness states [3]. As a preliminary step, a rectangular window is created in this experiment to contain a single cycle of respiratory sound (RS). Windowed samples are then normalized. Segments of 64 samples length are used to extract the features from the normalized RS signal. Synchronized summing of power spectra components is carried out for each segment. Power spectrum components are averaged to provide 32dimensional feature vectors [7]. Classification performances for nine different RS classes, including bronchial sounds, broncho vesicular sounds, vesicular sounds, crackle sounds, wheezes sound, stridor sounds, grunting sounds, squawk sounds, and friction rub sounds, are compared in this study using MLP, GAL, and a novel incremental supervised neural network (ISNN) [8]. An attention-based encoder-decoder model is used in this study to solve the problem of breath sound segmentation. The proposed model will be used in this investigation to precisely segment the inhalation and exhalation of individuals with lung disorders. Spectrograms and time labels for each time interval were utilized for training the model. The spectrogram would be encoded initially, and then an attention-based decoder would use the encoded image

Custom-Built Deep Convolutional Neural Network for Breathing Sound …

193

to detect inhalation or exhalation noises. With the use of the attention mechanism, doctors would be able to make a more precise diagnosis based on the more interpretable outputs. Twenty-two people participated in the study, and their breath sounds were recorded with digital stethoscopes or noise-canceling microphone sets. When using 0.5-s time segments with ResNet101 as the encoder, the experimental results indicated high accuracy of 92.006%. Ten-fold cross-validation trials show that the proposed approach consistently performs [9]. The presence of COPD is diagnosed using a multi-class classifier, as either normal breath sounds or aberrant breath sounds like wheeze, crackle, and rhonchi. Descriptor features such as the MFCC from the mel spectrum and the linear spectrum’s spectral descriptor features are retrieved. A total of 596 lung sound signals are used in this study for experimentation and classification. Decision trees and K-NNs are used to achieve better classification accuracy than binary machine learning. For multi-class classifiers utilizing the deep learning CNN model, 96.7% overall accuracy is achieved. A comparison is made between the results of the MCC and those of the SVM classifier [10]. A method for classifying multi-channel lung sound using spectral, temporal, and spatial information is presented in this research. A convolutional recurrent neural network is used with a frame-wise classification framework to handle multi-channel lung sound recordings during their whole breathing cycle. Patients with idiopathic pulmonary fibrosis (IPF) and healthy volunteers are recorded using our newly built 16-channel lung sound recording system. For binary classification, i.e., healthy versus pathological, spectrogram characteristics from the lung sound recordings were extracted and compared to different deep neural network architectures [11]. By extracting many characteristics from sounds, constructing multiple models, and conducting appropriate testing procedures for multi-class and binary classification of respiratory disorders, this study implements robust respiratory disease classification (RRDCBS). To execute binary classification of respiratory disease against healthy data sound, decision level fusion of indices on features has provided 100% accuracy for VQ, support vector machine (SVM), and K-nearest neighbor (KNN) modeling techniques. Multiple/binary categorization of respiratory disorders is also tested using deep recurrent and convolutional neural networks [12]. A multimodal architecture was proposed to detect COVID-19. CovScanNet model was developed by combining an existing deep learning convolutional neural network (CNN) based on Inception-v3 with a multi-layer perceptron (MLP). According to this model, breathing sound analysis has an 80% accuracy rate, and COVID-19 detection has a 99.66% accuracy rate for the CXR picture dataset [13]. A simple CNN-based model, called RespireNet, was developed. For 4-class classification, we improve on the state-of-the-art by 2.2% after extensive evaluation of the ICBHI dataset [14]. The mel frequency cepstral coefficients approach was used in this study (MFCCs). For each audio file in the dataset, we used MFCC to extract resources, resulting in a visual representation for each audio sample. From the results, over 93% of respiratory disorders in the database were classified using the method outlined in this article. Upper respiratory tract infection (URTI), bronchiectasis, pneumonia, and bronchiolitis are the other five categories [15]. The ICBHI benchmark dataset’s audio files are being classified using a two-stage approach described in this research. As a first

194

S. Kamepalli et al.

step, the optimal combination of intrinsic mode function (IMF) features for classifying respiratory illnesses is found by extracting feature vectors from lung sounds. IMF characteristics with the best combination of gamma tone filters and gamma tone cepstral coefficient (GTCC) are used in the next step. A recurrent neural networkbased stacked BiLSTM classifier is used to classify the GTCC inputs. In comparison with other IMFs and MFCCs, the IMF 3 provides more useful data and improves performance when used in conjunction with GTCCs. According to these findings, the GTCC of the third IMF component applied to the stacked BiLSTM framework outperforms the rival convolutional neural network approach of classification [16].

3 Methodology The following Fig. 1 shows the basic framework of a breathing sound classification system using a customized deep convolutional neural network.

Fig. 1 Basic framework of breathing sound classification system

Custom-Built Deep Convolutional Neural Network for Breathing Sound …

195

The stethoscope was used to listen to distinct parts of the chest known as zones. Typically, the lungs are divided into five lobes: three on the right (the upper, lower, and mid-lobes) and two on the left (the upper and lower lobes). The sounds recorded from the stethoscope were formed as a dataset for analysis. After preprocessing the dataset, the features of breathing sounds were extracted. The samples in the dataset were distributed among train, validation, and test sets for further classification of breathing cycles. A customized deep convolutional neural network was trained, and performance evaluation was done on the validation set. Finally, a test set was used to predict the class label of a randomly selected sound signal. A neural network is used to train our model to make predictions. A simple feedforward neural network used to train the data points which are independent of each other, whereas recurrent neural network is capable of modeling sequential data, time series data, speech analysis, etc. In this type of network, the output not only depends on the current value, it also depends on the inputs of previous level. The mathematical modeling behind RNN can be represented using the following equations. The hidden state of RNN can be represented using the following mathematical equation: h t = tanh(Wh h t−1 + Wx xt )

(1)

Here, x t is taken as the input to the network at time step t, w is the weight function. The loss of the model can be calculated using cross entropy represented as follows: 

L θ (y, y  )t = −yt log yt

(2)

  Where θ = Wh , Wx , W y RNNs come in a variety of flavors. Classical RNN consists of a series of simple, recurrent neural network modules. It is theoretically possible for RNNs to learn long-term dependencies from their training data. As a result of this long-term dependency, it has a disappearing gradient problem and an increasing gradient problem in practice. Because of this reliance, RNN is less helpful and more challenging to train. “Long short-term memory (LSTM)” was proposed as a solution by Hochreiter and Schmidhube [17]. A deep neural network model was developed using “LSTM model” as the basis. After the initial LSTM layer, a dropout layer is added. The final layer of the LSTM generates a vector hi, which is fed into a fully linked multi-layer network. This network has three layers, two of which are dense and one of which is output. The first two dense layers are activated using the rectified linear unit (ReLU), and the output layer is activated using an exponential activation function (Fig. 2). The respiratory sound dataset was collected from the Kaggle data science community. The respiratory sound dataset used in this investigation was developed by 2 “research teams in Greece and Portugal” [18]. There are 920 annotated lung sound recordings in the dataset, ranging in length from 10 to 90 s. In order to collect these samples, a total of 120 patients were observed. There are 5.5 h of soundtracks

196

S. Kamepalli et al.

Fig. 2 Architecture of proposed LSTM network

in the collection, which includes 6898 respiratory cycles. From these respiratory cycles, “1864 have crackles, 886 have wheezes, and 506 have both crackles and wheezes”. The considered dataset includes both unequivocal respiratory sounds and noise recordings have been included in the collection to imitate real-life settings. The patients evaluated for respiratory sound samples range in age from children to the elderly. The dataset consists of 8 respiratory sounds named URTI, healthy, asthma, COPD, LRTI, bronchiectasis, pneumonia, and bronchitis. Since the two classes, asthma and LRTI samples, are significantly less in the count, we are removing those two class labels from the experimentation and considering only “healthy, COPD, bronchiectasis, bronchitis, pneumonia, and URTI” (Fig. 3).

4 Results and Discussions We trained the developed model on a Kaggle notebook with an I7 Processor, and @ 2.30 GHz, 8 GB RAM, and a 16 GB Nvidia P100 GPU. The results for multi-class respiratory disease classification from the experimentation are shown below. “URTI, healthy, COPD, bronchiectasis, pneumonia, and bronchiolitis” were tested for their effects on breathing cycles as part of a multi-class classification of respiratory disease diagnosis. The proposed custom-built deep convolutional neural network was implemented on the considered dataset with different epochs. The model gives better accuracy of 98.6% at 30 epochs. The following are the graphical representations of accuracy obtained at the implemented stacked LSTM model (Fig. 4). The following is the graphical representation of loss obtained at different epochs of the implemented stacked LSTM model (Fig. 5).

Custom-Built Deep Convolutional Neural Network for Breathing Sound …

197

Fig. 3 a Dataset distribution among various class labels, b class labels considered for experimentation

The obtained results were related with the contemporary models implemented on the same dataset. The following table gives the accuracies of different models and the proposed model (Table 1). The following Fig. 6 depicts the graphical representation of comparison of accuracy of proposed model with other contemporary models.

198

S. Kamepalli et al.

Fig. 4 Training and validation accuracies at a 10 epochs, b 20 epochs, and c 30 epochs

The proposed stacked LSTM model classifies the respiratory sounds into six classes named URTI, healthy, COPD, bronchiectasis, pneumonia, and bronchiolitis, 98.6%. Class imbalance in our data collection was a vital issue during this study. Our custom-built stacked LSTM model is configured to allocate greater costs to the class labels with less data in learning phase, which solves this problem.

5 Conclusions and Future Scope Respiratory diseases are becoming global cause of death in the COVID era. Detection and diagnosis of respiratory diseases is a critical task, and medical practitioners need to be expertized in identifying the difference in the breathing cycles in traditional method such as auscultation. An automated system that analyzes the breathing sounds and detects the difference of various sounds from respiratory system performs well. A custom-built deep convolutional neural network based on the LSTM model was developed to classify respiratory sounds. The dataset was considered from the Kaggle data science community. The dataset contains eight types of respiratory sounds, which are labeled as follows: URTI, healthy, asthma, COPD, LRTI, bronchiectasis, pneumonia, and bronchitis. Because the number of samples from the

Custom-Built Deep Convolutional Neural Network for Breathing Sound …

199

Fig. 5 Training and validation loss at a 10 epochs, b 20 epochs, and c 30 epochs

Table 1 Comparison of accuracy of proposed model with other contemporary models S. No

Paper

Classifier algorithm

Accuracy

1

[1]

k-nearest neighbors and Naive Bayes classifier

98.1

2

[5]

Convolutional bidirectional long short-term memory

98

3

[3]

SVM

98.2

4

Proposed model

Stacked LSTM model

98.6

two classes, asthma and LRTI, is so low, we are deleting those two class labels from the experimental and considering just healthy, COPD, bronchitis, pneumonia, and URTI samples. The proposed stacked LSTM model classifies the respiratory sounds into six categories: “URTI, healthy, COPD, pneumonia, and bronchitis”, with an accuracy of 98.6%. Our data collection was by a significant imbalance hampered in class, which was a significant concern throughout the study. Our custom-built stacked LSTM model is configured to allocate greater costs to the class labels with less data in learning phase, which solves this problem. The study’s future work will primarily emphasize refining the performance of the “respiratory disease prediction model”.

200

S. Kamepalli et al. Comparison of Various Models

98.8 98.6 98.4 98.2 98 97.8 97.6

k-Nearest Neighbors and Naive Bayes classifier Convolutional Bi- directional Long Short-Term Memory SVM Stacked LSTM Model

Fig. 6 Graphical representation of comparison of accuracy of proposed model with other contemporary models

References 1. Naves R, Barbosa BHG, Ferreira DD (2016) Classification of lung sounds using higher-order statistics: a divide-and-conquer approach. Comput Methods Programs Biomed 129:12–20. https://doi.org/10.1016/j.cmpb.2016.02.013 2. Perna D (2018) Convolutional neural networks learning from respiratory data. IEEE Int Conf Bioinf Biomed BIBM 2018:2109–2113. https://doi.org/10.1109/BIBM.2018.8621273 3. Fraiwan L, Hassanin O, Fraiwan M, Khassawneh B, Ibnian AM, Alkhodari M (2020) Automatic identification of respiratory diseases from stethoscopic lung sound signals using ensemble classifiers. Biocybern Biomed Eng 41:1–14. https://doi.org/10.1016/j.bbe.2020.11.003 4. Abbafati C et al (2020) Global burden of 87 risk factors in 204 countries and territories, 1990–2019: a systematic analysis for the global burden of disease study 2019. Lancet 396(10258):1223–1249. https://doi.org/10.1016/S0140-6736(20)30752-2 5. Subasinghe A et al. (2022) Smart stethoscope: intelligent respiratory disease prediction system. In: 2nd International conference on advanced research in computing (ICARC), pp 242–247 6. Sujatha K, Srinivasa Rao B Recent applications of machine learning: a survey. Int J Innov Technol Explor Eng (IJITEE) 8(6C2) ISSN: 2278–3075, April 2019. https://www.ijitee.org/ wp-content/uploads/papers/v8i6c2/F10510486C219.pdf 7. Sujatha K, Krishna Kishore KV, Srinivasa Rao B (2020) Performance of machine learning algorithms in red wine quality classification based on chemical compositions. Indian J Ecol 47(11): 176–180 8. Dokur Z (2009) Respiratory sound classification by using an incremental supervised neural network. Pattern Anal Appl 12(4):309–319. https://doi.org/10.1007/s10044-008-0125-y 9. Hsiao CH et al. (2020) Breathing sound segmentation and detection using transfer learning techniques on an attention-based encoder-decoder architecture. In: Annual international conference of the IEEE engineering in medicine and biology society. EMBS, pp 754–759. https://doi.org/ 10.1109/EMBC44109.2020.9176226 10. Jayalakshmy S, Priya BL, Kavya N (2020) CNN based categorization of respiratory sounds using spectral descriptors. In: IEEE International conference on communication, computing and industry 4.0. C2I4 2020, pp 1–5. https://doi.org/10.1109/C2I451079.2020.9368933 11. Messner E et al (2020) Multi-channel lung sound classification with convolutional recurrent neural networks. Comput Biol Med 122:1–10. https://doi.org/10.1016/j.compbiomed.2020. 103831

Custom-Built Deep Convolutional Neural Network for Breathing Sound …

201

12. Revathi A, Sasikaladevi N, Arunprasanth D, Amirtharajan R (2022) Robust respiratory disease classification using breathing sounds (RRDCBS) multiple features and models. Neural Comput Appl 34:8155–8172. https://doi.org/10.1007/s00521-022-06915-0 13. Sait U et al (2021) A deep-learning based multimodal system for Covid-19 diagnosis using breathing sounds and chest X-ray images. Appl Soft Comput 109:107522. https://doi.org/10. 1016/j.asoc.2021.107522 14. Gairola S, Tom F, Kwatra N, Jain M RespireNet: a deep neural network for accurately detecting abnormal lung sounds in limited data setting. In: Annual international conference of the IEEE engineering in medicine and biology society. EMBS, 2021, pp 527–530. https://doi.org/10. 1109/EMBC46164.2021.9630091 15. Mridha K, Sarkar S, Kumar D (2021) Respiratory disease classification by CNN using MFCC. In: IEEE 6th international conference on computing, communication and automation. ICCCA 2021, pp. 517–523. https://doi.org/10.1109/ICCCA52192.2021.9666346 16. Jayalakshmy S, Sudha GF (2021) GTCC-based BiLSTM deep-learning framework for respiratory sound classification using empirical mode decomposition. Neural Comput Appl 33(24):17029–17040. https://doi.org/10.1007/s00521-021-06295-x 17. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 18. Rocha B, Filos D, Mendes L (2017 December) A respiratory sound database for the development of automated classification. In: IFMBE Proceeding 66. Springer Nature, pp 29–31. https://doi.org/10.1007/978-981-10-7419-6.

Infrastructure Resiliency in Cloud Computing K. Tirumala Rao, Sujatha, and N. Leelavathy

Abstract We propose a novel software testing environment based on Amazon cloud services such as Amazon Elastic Cloud Compute and Amazon Relational Database Service, along with a fault injection tool, for defining infrastructure resiliency, including a specific case of such attributes as availability, zero downtime, nearzero data loss, mission-critical services, and maintainability. The performance of resources is evaluated experimentally by hosting services in multiple regions on the AWS cloud with an open-source cloud-based application and comparing the average response time to the number of transactions and concurrent users. Keywords Resiliency · Fault Injection · Amazon Elastic Cloud Compute · Amazon Relational Database Service

1 Introduction Cloud computing is quickly gaining popularity in today’s information technology by utilizing various technologies such as virtualization, storage, networks, processing power, sharing, the web, and software applications [1]. It is a type of dynamically scaled computing that enables the provision of a diverse range of internet-based services. Furthermore, users could remotely access these technologies, including compute, storage, networking, and applications, and make them accessible from many locations and devices worldwide. The National Institute of Standards and Technology (NIST) proposed Cloud Computing (CC) characteristics [2]. Some of them are.

K. T. Rao Godavari Institute of Engineering and Technology (Autonomous), Rajahmundry, India Sujatha (B) · N. Leelavathy (B) Department of CSE, Godavari Institute of Engineering and Technology (Autonomous), Rajahmundry, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_14

203

204

K. T. Rao et al.

1. On-demand self-service: Without requiring human interaction with each service, a service provider can give computational capabilities to customers, such as server time and network storage. 2. Broad network access: All cloud features are available over the internet and can be accessed through standard methods that work with various devices, like mobile phones, tablets, laptops, and workstations. 3. Resource pooling: The provider’s computing resources are pooled to service several customers via a multi-tenant strategy, with physical and virtual resources constantly assigned and reassigned in response to customer demands. There is a sense of location independence because the client does not usually have a say in where the resources are. For example, the client may be able to think about where the resources are at a higher level of abstraction (e.g., country, state, or datacenter). 4. Rapid elasticity: Capabilities may be elastically allocated and released in certain circumstances automatically, allowing resources for scaling up and down in response to demand. To the customer, provisioning capacities sometimes appear limitless and may be appropriated in any number. 5. Measured service: Cloud systems automate resource management and optimization by exploiting a metering capability at a level of abstraction appropriate for services (e.g., storage, processing, bandwidth, and active user accounts). In addition, monitoring, controlling, and reporting on resource utilization enable transparency for the service provider and customer. Many individuals and organizations adopt CC due to its features. The CC enables businesses, products, and systems to avoid ongoing expenditures associated with operating a business, product, or system and deploying and maintaining IT infrastructure [1]. Additionally, it provides on-demand service, enabling them to scale up and down by adjusting the service to fit their demands [1]. However, the infrastructure hosted in CC may experience several challenges and issues. The most critical concern is reliability since it is possible for a failure to occur, resulting in an interruption of delivery service. The failure might be attributed to the growing number of cloud customers, increasing the number of necessary services and raising the likelihood of failure. In addition to hardware failures such as network or power outages, software failures such as malicious activity or high workload may occur [3]. The planet is always prone to disaster, regardless of where we live. A disaster is an event caused by humankind, nature, or technology. However, the outcome of a catastrophe is invariably to detect losses. A few of them are loss of physical damage, loss of human life, environmental changes and data, and loss of information in the case of technology. The disaster’s total impact on people and the economy was devastating. According to Deng et al. [4], over 20 years, catastrophes claimed 2.8 million lives and impacted the lives of about 820 million people [5]. David M. Neal, 2014, discussed the ten worst disasters in the recent decade as tabulated in Table 1. These have caused organizations to lose money, and data loss is still a big problem for everyone involved [4].

Infrastructure Resiliency in Cloud Computing

205

Table 1 Top Disasters and their effects Sr. No

Type of disaster

Location

Year

Estimated loss (USD)

1

Tsunami

Uttarakhand, India

06/23/2013

150 K

2

Hurricane

US

10/29/2012

29.5 B

3

Earthquake

Japan

03/11/2011

34.6 B

4

Earthquake

Haiti

01/12/2010

14 B

5

Earthquake

Java, Indonesia

05/27/2006

3.1 B

6

Katrina Hurricane

US

08/29/2005

125 B

7

Earthquake

Sichuan, China

05/12/2008

85 B

8

Earthquake

Kashmir, Pakistan

10/08/2005

5.2 B

9

Floods

Mumbai, India

07/26/2005

3.3 B

10

Cyclone

Nargis, Myanmar

05/02/2008

4B

11

Earthquake

Bam, Iran

12/26/2003

500 M

12

Tsunami

Indonesia, India, Sri Lanka, Myanmar

12/26/2004

9.2B

Note K—Thousands, B—Billion, and M—Million

The statistical data from Table 1 regarding the top disasters and their effects show that disaster management is critical for society’s survival and business. Recovery methods were arduous due to the many processes involved, such as restoring data to a recovery site, and restarting servers. Because of the problems, businesses started looking for ways to solve problems, like making sure applications are always available at a lower cost and in less time. Additionally, service-interrupting events can occur at any moment. The network may go down, and the latest application release may include a severe issue. A resilient focused, and well-tested infrastructure is critical when things go wrong. By providing a fully managed solution on a cloud platform, we may avoid the majority, if not all, of these problematic aspects, thus eliminating several business costs. The discussion has been related to how things can go wrong. To have a resilient infrastructure, we must address how quickly services can be restored. We need to understand some terms like SLO and SLA to discuss that. A service level objective (SLO) is a critical metric component of a service level agreement (SLA). SLAs and SLOs are frequently used interchangeably. An SLA is a comprehensive agreement that details the service to be given, how it will be supported, the hours and places of the service, the charges, the performance, the penalties, and the parties’ duties. SLOs are quantifiable properties of the service level agreement, such as availability, throughput, frequency, response time, or quality. The reason for bringing the SLA topic here is that in the existing architecture proposed by Al-Said Ahmad and Andras [6] in Fig. 1.,

206

K. T. Rao et al.

Fig. 1 Existing architecture proposed by Al-Said Ahmad and Andras

a few Amazon Web Services (AWS) are consumed whose SLAs are not guaranteed to 5.9’s (99.999%). So, it is understood that there could be some outages with those services. During that period, our infrastructure should be resilient to have business continuity. The paper [6] does use availability zones (AZ) for high availability (HA) and other options like autoscaling for scalable infrastructure. However, using another AZ does not guarantee that the data centers in each AZ will be located physically far apart. There are chances that the whole region might go down. We have proposed an infrastructure resiliency architecture in the AWS cloud to address this. After choosing an optimal strategy, we must validate our suggested design against various faults, among which fault tolerance is a significant concern. By considering fault tolerance as a significant constraint [7], this paper proposed infrastructure by hosting an application in more than one region. If one of these virtual machines (VM) fails, the other continues to operate and delivers the appropriate response to the client. The primary benefits of a fault tolerance cloud are cost reduction and performance improvement [8, 9]. Faults can be injected at different stages using the API. Some fault classes are physical node faults, virtualization level faults, service level faults, and network faults. The proposed architecture in Fig. 2 has been used to test these faults, which are discussed in the section on experimental results. This article proposes a method for combining resilience methods to increase the availability of services to meet the expectations of cloud clients. The rest of this article is organized in the following manner. The methodology Sect. 2 discusses the proposed architecture and design. Section 3 discusses the experimental results obtained from JMeter performance testing and fault injection testing and compares

Infrastructure Resiliency in Cloud Computing

207

Fig. 2 Proposed architecture in AWS cloud

the effective resource utilization in both regions. Finally, Sect. 4 summarizes the proposed architecture.

2 Methodology This section will discuss the architecture and various components used to achieve infrastructure resiliency in AWS Cloud Fig. 2. 1.

2.

3.

Regions: To achieve infrastructure resilience in AWS, we have selected two regions. One is the primary, and the other is the secondary region. AWS uses the term “Region” to refer to a geographical area on the globe where AWS clusters data centers. Each set of logical data centers is referred to as an Availability Zone. Each AWS Region comprises several discrete and physically distinct AZs within a specific geographic region [9, 10]. Amazon Route 53: AWS Route53 is a DNS-based traffic load balancer that distributes traffic to services across global AWS. Route53 switches to the other region if one of the regions becomes unavailable [11]. Virtual Private Cloud (VPC): A VPC is a logical grouping of resources on a virtual network. Each region is assigned its virtual network [12].

208

4.

K. T. Rao et al.

AWS Relational Database for SQL Server (RDS): RDS is utilized in highavailability zones [13]. 5. VPC peering: VPC Peering establishes a networking link for a private routing between two VPCs. Communication is established among the resources as if they reside on the same network. Therefore, communication through this peering network does not have bandwidth challenges or a single point of failure [14]. 6. Application Load Balancer (LB): The application load balancer routes the incoming traffic based on rules and health probes [15, 16]. 7. AWS Data Migration Service (DMS): In a Nutshell, DMS is used for the study to transfer data from one region to another and vice versa. In this paper, DMS is used to replicate data [17] from the primary to the secondary region and vice versa [18]. 8. DMS Endpoint: An endpoint gives information about the data store’s connection, type, and location. AWS DMS uses this information to connect to a data store and transfer data from a source to a target [19]. 9. DMS Task: All the work is performed using an AWS DMS task. Define what tables (or views) and schemas to utilize for your migration [20]. The working mechanism of the AWS DMS service is explained in Fig. 3. 10. Network Address Translation (NAT) Gateway: NAT Gateway is an AWSmanaged service that simplifies the process of connecting instances within a private subnet in an Amazon Virtual Private Cloud to the Internet (Amazon VPC) [21]. 11. Amazon Elastic Compute Cloud (Amazon EC2): Amazon EC2 is a cloud computing online service that offers secure, resizable compute capability. Regions and Availability Zones provide several physical locations for hosting resources, including instances and Amazon EBS volumes [22].

Fig. 3 Working of DMS

Infrastructure Resiliency in Cloud Computing

209

2.1 System Set-Up Stage Using the AWS cloud formation template, an Amazon EC2 instance and RDS services were used to host the DotNetNuke (https://www.dnnsoftware.com/) application [15, 23]. DotNetNuke (DNN) is a.Net-based open-source platform edition. DNN is the world’s most popular content management system, with over 1 million active users. We deployed the scalable content management system DNN application on the Internet Information Service (IIS) to test the proposed architecture for performance and fault injection tests. Due to its high adoption rate, the application is heavily utilized by cloud-based apps and service providers. DNN is a CRUD (CreateRead-Update-Delete)-based application with different modules for user management, payment, cart management, and additional operations such as record deletion and user administration. Web content is delivered from the web tier, while transactions related to other modules are processed in the data layer. The technical specifications of the resources are listed in below Table 2. Technical specifications of resources.

3 Experimental Results Performance testing was conducted using Apache JMeter [24] to mimic the response time depicted in Fig. 4. It illustrates the working of the JMeter Server and Load Generators. Table 3 uses the ten most critical transactions to assure the response time scenario. The scenario can be achieved by deploying and executing the test scripts via JMeter Server and repeating the tests across both regions without changing the test settings. JMeter enables online performance testing and precisely measures response time, error rate, throughput, and other metrics. Another thought is that automation testing technology keeps improving, making it easier to test performance that looks like it would look in a natural production environment. It also makes it easier to find the root cause of performance problems by analyzing test results. Additional virtual machines were used as workload generators, generating HTTP requests for the web application via the JMeter application [24]. JMeter was configured to simulate 500 concurrent users, and each called the web application to conduct a random operation using the most significant transactions. We used an AWS LB to route requests to the web application using the round-robin method. We have shown only a single web server in this study’s proposed architectural Fig. 2. Previous studies Table 2 Technical specifications of resources Resource

Resource type

Amazon EC2

m5. Large

RDS

db. r5. large

DMS

dms. c5. large

RAM (GB)

vCPU

Storage (GB)

8

2

30

16

2

100

4

2

30

210

K. T. Rao et al.

Fig. 4 Illustrates working of JMeter server and load generators

Table 3 The performance metric for ten transactions Trans. ID

Transaction name

Transaction count

Avg. response time (in s)

1

Home page

338

0.116

2

OnPageClickNext

332

0.035

3

Search

307

7.597

4

Submit

266

0.341

0.554

5

Back

692

0.249

0.515

6

Next

450

1.693

3.179

7

Addtocart

713

0.079

0.125

8

RemoveAll

674

0.064

0.106

9

Login

903

0.144

0.207

10

Logout

872

0.079

0.106

90th Percentile response time (in s) 0.212 0.138 11.75

on web server scaling and load balancing routing techniques are discussed in the paper related to the scalability resilience framework [25]. Therefore, we will not discuss the working of AWS Application Load Balancer and Auto Scaling in detail.

Response Time(in Seconds)

Infrastructure Resiliency in Cloud Computing

211

14 12 10 8 6 4 2 0

PR-Average RT PR-90th Percentile RT SR-Average RT SR-90th Percentile RT

1

2

3

4

5

6

7

8

9 10

Transactions Fig. 5 Transactional metrics in both the regions. Note PR–Primary Region, SR–Secondary Region, and RT–Response Time

We have not used any caching services or mechanisms for the quick search transaction for better response time in the proposed architecture. Implementing such services can improve the response times by lowering the time required to execute HTTP requests. However, adopting the Representational State Transfer (REST) architecture could increase the efficiency and scalability of cloud computing applications. Based on the results of the JMeter tests in both regions, shown in Fig. 5, we can assume that the application’s performance [26] and resource utilization are comparable. The other objective is to perform the fault injection (FI) tests [7, 27–30] and determine the capacity and redundancy planning to determine the response time during the injection and post-injection phase [31]. To perform the FI tests, we have identified a tool called Toxy [30]. Toxy helps add poisons, which rules can filter, that can change the HTTP flow and perform different actions like limiting bandwidth, delaying network packets, injecting network jitter latency, or responding with a custom error or status code. The other reason for opting Toxy tool is status codes, where AWS Route 53 does perform health checks regularly. With the status code, the request can be routed among the regions. It is primarily an L7 network simulator, although it can replicate L3 network situations. Toxy’s operating mechanism is seen in Fig. 6. Fault injection is performed on one of the two replicated network regions, i.e., primary, or secondary [13, 32]. Specific tests are performed to validate different types of faults, limit bandwidth, delay network packets, inject network jitter latency, or respond with a custom error or status code to assess whether the resources are performing well even in the event of failures, as shown in Fig. 7. The performance metrics of the primary and secondary regions are tabulated in Tables 4 and 5, respectively. The tests aim to evaluate whether the resources allocated to the EC2 instance and the AWS relational database service are sufficient to achieve the desired results. Upon analyzing the results obtained from FI tests as a cloud architect, we can learn about the ability of the resources during a FI and post-injection phase to restore a steady state [11, 12]. In addition, these metrics will help access meet the service level requirements.

212

K. T. Rao et al.

Fig. 6 The working mechanism of the Toxy testing tool

Percentage

100 80 CPU%Web server Busy

60

CPU%RDS Busy

40

Memory% Available Web

20

Memory% Available RDS

0

Primary Region

Secondary Region

Fig. 7 Resource utilization metrics during FI Table 4 Performance metrics of resources in the primary region Category

Threshold

Rating

Average CPU utilization

% CPU Use < 70%

Acceptable

Memory utilization

Memory Available > 20%

Acceptable

Transaction timings (based on 90th percentile response time)

< 5% of transactions have > 5 s

Acceptable

Error rate

Error rate < 5%

Acceptable

Table 5 Performance metrics of resources in the secondary region Category

Threshold

Rating

Average CPU utilization

% CPU Use < 70%

Acceptable

Memory utilization

Memory Available > 20%

Acceptable

Transaction timings (based on 90th percentile response time)

< 5% of transactions have > 5 s

Acceptable

Error rate

Error rate < 5%

Acceptable

Infrastructure Resiliency in Cloud Computing

213

4 Conclusion The architecture for a project is chosen based on the business needs and the technical requirements. We discovered during our analysis that some organizations require a unique infrastructure. Others, on the other hand, want assistance in determining the best architecture for their company or meeting customer needs. As a result, the suggested architecture is optimal when seeking a platform with the most incredible range of dependable and stable services at a reasonable price. The suggested infrastructure is also suitable for startups and other proliferating business models, running 24 * 7 and making a lot of unstructured data and other resources.

4.1 Limitations This study does not evaluate or regulate all factors, such as the country’s local governing body, regional laws, policies, and limitations. It is another dynamic viewpoint that varies over time. Because its scope is so vast, it is not easy to include all countries or regions. Therefore, it is excluded from the study overview.

4.2 Recommendation Having expertise in a cloud service environment and substantial practical experience in optimizing cloud service operations with optimum features and cost would be beneficial.

4.3 Future Research The other domain that can be picked for future research is selecting a specific platform, say SaaS, PaaS, IaaS, or a combination. It does make sense, as not all organizations need all platforms at a single time. Some can be insourced, and the rest can be obtained from expert/cloud service providers. It is precisely for large-scale organizations where optimizing cost, efficacy, and efficiency is a matter of time.

References 1. Saxena VK, Pushkar S (2016, March) Cloud computing challenges and implementations. In: 2016 International conference on electrical, electronics, and optimization techniques (ICEEOT). IEEE, pp 2583–2588

214

K. T. Rao et al.

2. Accessed from https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-145.pdf 3. Amoon M (2016) Adaptive framework for reliable cloud computing environment. IEEE Access 4:9469–9478 4. Deng J, Huang SCH, Han YS, Deng JH (2010, December) Fault-tolerant and reliable computation in cloud computing. In: 2010 IEEE globecom workshops. IEEE, pp 1601–1605 5. Housner GW (1989) An international decade of natural disaster reduction: 1990–2000. Nat Hazards 2(1):45–75 6. Gill SS, Buyya R (2018) Failure management for reliable cloud computing: a taxonomy, model, and future directions. Comput Sci Eng 22(3):52–63 7. Piscitelli R, Bhasin S, Regazzoni F (2017) Fault attacks, injection techniques, and tools for simulation. In: Hardware security and trust. Springer, Cham, pp 27–47 8. Jhawar R, Piuri V (2017) Fault tolerance and resilience in cloud computing environments. In: Computer and information security handbook. Morgan Kaufmann, pp 165–181 9. Bala A, Chana I (2012) Fault tolerance-challenges, techniques and implementation in cloud computing. Int J Comput Sci Iss (IJCSI) 9(1):288 10. Regions and Availability Zones. https://aws.amazon.com/about-aws/global-infrastructure/reg ions_az/. Accessed 4 May 2022 11. Amazon Route 53. https://aws.amazon.com/route53/. Accessed 4 May 2022 12. Virtual Private Cloud. https://aws.amazon.com/vpc/. Accessed 4 May 2022 13. AWS Relational Database for SQL Server. https://aws.amazon.com/rds/sqlserver/. Accessed 4 May 2022 14. VPC Peering. https://docs.aws.amazon.com/vpc/latest/peering/what-is-vpc-peering.html. Accessed 4 May 2022 15. Amazon Application Load Balancing. https://docs.aws.amazon.com/elasticloadbalancing/lat est/application/introduction.html. Accessed 4 May 2022 16. Security Groups. https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ec2-securitygroups.html. Accessed 4 May 2022 17. Casanova H, Vivien F, Zaidouni D (2015) Using replication for resilience on exascale systems. In: Fault-tolerance techniques for high-performance computing. Springer, Cham, pp 229–278 18. AWS Data Migration Service. https://aws.amazon.com/dms/. Accessed 4 May 2022 19. DMS Endpoint. https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Endpoints.html. Accessed 4 May 2022 20. DMS Task. https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.html. Accessed 4 May 2022 21. Network Address Translation Gateway. https://docs.aws.amazon.com/vpc/latest/userguide/ vpc-nat-gateway.html. Accessed 4 May 2022. 22. Amazon Elastic Compute Cloud. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ concepts.html. Accessed 4 May 2022. 23. DotNetNuke. https://www.dnnsoftware.com. Accessed 4 May 2022 24. JMeter (2022) JMeter HTTP Request. https://jmeter.apache.org/usermanual/component_refere nce.html#HTTP_Request. Accessed 4 May 2022 25. Avi A, zienis J-C, Laprie BR, Landwehr C (2004) Basic concepts, and taxonomy of dependable and secure computing. IEEE Trans Depend Secure Comput 1(1):11–33 26. Huber N, Brosig F, Dingle N, Joshi K, Kounev S (2012) Providing dependability and performance in the cloud: case studies. In: Resilience assessment and evaluation of computing systems. Springer, Berlin, Heidelberg, pp 391–412 27. Natella R, Cotroneo D, Madeira HS (2016) Assessing dependability with software fault injection: a survey. ACM Comput Surv (CSUR) 48(3):1–55 28. Herscheid L, Richter D, Polze A (2015, April) Experimental assessment of cloud software dependability using fault injection. In: Doctoral conference on computing, electrical and industrial systems. Springer, Cham, pp 121–128 29. Ye K, Liu Y, Xu G, Xu CZ (2018, June) Fault injection and detection for artificial intelligence applications in container-based clouds. In: International conference on cloud Computing. Springer, Cham, pp 112–127

Infrastructure Resiliency in Cloud Computing

215

30. Toxy testing tool. https://github.com/h2non/toxy, https://mitmproxy.org/, https://github.com/ h2non/toxy/tree/master/benchmark. Accessed 4 May 2022 31. Feinbube L, Pirl L, Tröger P, Polze A (2017, July) Software fault injection campaign generation for cloud infrastructures. In: 2017 IEEE international conference on software quality, reliability, and security companion (QRS-C). IEEE, pp 622–623 32. Deng Y, Mahindru R, Sailer A, Sarkar S, Wang L (2017). U.S. Patent No. 9,753,826. Washington, DC: U.S. Patent and Trademark Office 33. Annotated bibliography of hazard and flood-related articles. (n.d.). Accessed 20 April 2022, from http://socialscience.focusonfloods.org/2014/neal-d-m-1997-reconsidering-thephases-ofdisaster-international-journal-of-mass-emergencies-and-disasters-152-239-264-2/

Deep Learning Model With Game Theory-Based Gradient Explanations for Retinal Images Kanupriya Mittal

and V. Mary Anita Rajam

Abstract Due to the black-box nature of the deep learning models and the inability to explain the results to the medical experts, they are still not fully adopted in clinics. Explainable models enhance the confidence of medical experts in deep learning models. This paper proposes an explainable diabetic retinopathy (DR) model with ResNet 50 architecture with gradient-based Shapley values for explainability. The model explains the contribution of each image pixel towards the final classification using game theory-based gradient approach. A quadratic weighted kappa score of 0.784 and classification accuracy of 89.34% are achieved with our DR classification model. The image classification is explained by the colour of each pixel, which indicates how much each pixel contributes to the positive or negative outcome. The method is novel in itself as it explains the importance and the contributions of the intermediate layers in making predictions. The proposed explainable DR model provides meaningful explanations of the predictions made for five DR classes, and the image plots of different retinal images give visual explanations which can be easily understood by the ophthalmologists and will help them in the diagnosis of diabetic retinopathy in patients. Keywords Model explainability · Deep learning · Game theory · Diabetic retinopathy

1 Introduction Deep learning is one of the leading artificial intelligence (AI) methods for solving medical imaging problems and has been used for cancer detection, retinal disease detection, and so on, from medical images. Despite the remarkable achievements of deep learning in medical imaging problems, the deep learning models are still not fully adopted in clinics. The central problem is the black-box nature of the deep learning models and the inability to explain the results to the medical experts. A K. Mittal (B) · V. M. A. Rajam Department of CSE, CEG, Anna University, Chennai, Tamil Nadu, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_15

217

218

K. Mittal and V. M. A. Rajam

medical computer-aided diagnosis system has to be understandable, transparent, and explainable. The availability of transparent and explainable models will enhance the trust of medical experts in the AI systems [7]. Hence, there is the need for an explainable AI system for the verification of the system, to gain an understanding of the decisions made and also comply to the legislation [6, 18]. Recently, a number of explainable AI methods have been developed which could be powerful tools to help medical experts in interpreting medical images [2, 8, 21, 23]. In this work, we have chosen diabetic retinopathy (DR) identification and grading as a use case for understanding the internal representation of deep neural networks (DNNs) and explaining the classification. Diabetic retinopathy (DR) is one of the leading causes of blindness in the world. By the year 2030, it is estimated that around 370 million people would be diagnosed with diabetes raising the risk for DR [17]. DR is a progressive disease and as it progresses, symptoms like distorted and blurred vision start appearing, demanding early diagnosis for prevention against blindness. The lesions like microaneurysms (MA), haemorrhages (HEM), cotton wool spots (CWS), and hard exudates (HE) characterize the DR. The presence of lesions and severity classifies DR as non-proliferative diabetic retinopathy (NPDR) and proliferative diabetic retinopathy (PDR). NPDR can be further classified as mild (at least one MA with or without any HEM, or CWS.), moderate (presence of a number of MAs and HEM), and severe (multiple HEM and MA, presence of HE, and bleeding). PDR refers to the advanced stage where new blood vessels grow with fragile walls, raising the risk of blood leakage affecting the vision [3, 13]. Medical community has a standardized classification for grading DR: class 0 is referred as no apparent retinopathy, class 1 as mild NPDR, class 2 as moderate NPDR, class 3 as severe NPDR, and class 4 as PDR [27]. Deep learning methods are being applied for diabetic retinopathy detection and grading in recent years. Nazir et al. used content-based image retrieval method for diabetic retinopathy detection. Tetragonal local octa patterns features were used to represent fundus images, and then, extreme learning machine was used for classification [14]. Semi-supervised multichannel-based generative adversarial network method was proposed by Wang et al. for DR grading [26]. Kathiresan et al. used histogram-based segmentation and then synergic deep learning model for DR classification [9]. Fractional max pooling layers in deep convolutional neural network approach were used by Li et al. [10]. Mansour et al. [12] proposed a diabetic retinopathy diagnosis method based on transfer learning with deep convolutional neural network. Adly et al. performed diabetic retinopathy grading on a Kaggle dataset using a binary tree based multi-class VggNet classifier [1]. Quellec et al. proposed a back-propagation-based convolutional net method to create heat-maps which showed which image pixels were relevant in the image-level predictions for diabetic retinopathy grading [16]. Integrated gradient-based approach was used by Sayres et al. for understanding the impact of deep learning on the DR grading [19]. Interpretability of neural retrieval models was studied by Fernando et al. using deep Shapley value method [5].

Deep Learning Model With Game Theory-Based Gradient …

219

The novelty of the work lies in the fact that the importance and contribution of the intermediate layers in a deep learning model towards the decision-making about the final outcome are explained. The main contributions of our work are as follows: • An explainable DR model with ResNet 50 architecture for classification • Shapley values with gradient for explainability • Visualization of the internal layers of the model and explanation of how the different layers affect the outcomes. The rest of the paper is structured as follows: Section 2 describes integrated gradients, Smooth-grad, and Shapley values concepts and gives details about the proposed DR classification model and the formulation of the proposed explainable DR model. Section 3 provides the results, analysis, and visual interpretations of a set of samples and explainable plots. Section 4 discusses the conclusion and future scope of the work.

2 Materials and Method Different methods have been recently proposed to address the concern of correct interpretation and explanation of a model’s output for medical images. In our work, we have constructed a DR classification model with ResNet 50 architecture and then used Shapley gradient model on it for explaining the DR classification model. The proposed Shapley gradient model is a combination of integrated gradients, Smoothgrad, and Shapley values concepts. This section first gives brief details about integrated gradients, Smooth-grad, and Shapley values. Then, it describes the proposed SHAP gradient DR model.

2.1 Integrated Gradients A gradient-based explanation approach explains a given prediction by using the gradient of the output with respect to the input features. One of the methods which applies the gradient operator is the integrated gradients (IG) method. The integrated gradients method needs no modification to the original model and is simple to implement in most machine learning frameworks [24]. It considers a straight line path in the feature space starting from an image in the training set to a certain baseline image, and as it moves along this path, it integrates the gradients of the prediction with respect to the input features.

220

K. Mittal and V. M. A. Rajam

2.2 Smooth-Grad Smooth-grad is a method that helps in visually sharpening gradient-based sensitivity maps and reducing visual noise. Smooth-grad method when used with other gradientbased methods sharpens the sensitivity maps visually [22]. It creates sample images of the input image by adding pixel-wise Gaussian noise.

2.3 Shapley Values The Shapley values approach is commonly known as SHAP. It is an acronym from Shapley additive explanations. The SHAP concept is a game theory-based approach developed to know the contribution of a player in a team environment. The total gain is distributed amongst team players by calculating the relative importance of each player’s contribution to the final outcome [11]. A fair reward to each player can be assigned by using the Shapley values solution [20]. SHAP can be used for explaining the output of any machine learning model. It is assumed that each feature is a player in a game, and the prediction is the payout; then, the Shapley value explains the prediction by how fairly the pay-out is distributed amongst the features. In the case of images, the image pixels are grouped to form super-pixels. The super-pixels are the group of neighbouring pixels which have similar brightness and colour. These super-pixels are the players for the image, and the prediction is the pay-out. SHAP values stick to three unique and desirable properties of local accuracy, missingness, and consistency and hence, give a unique additive feature importance measure [11]. Shapley values show the importance of each feature in making the prediction. Different features have different contributions (magnitude and sign) to the model’s output. Accordingly, Shapley values calculate estimates for each feature. The Shapley value for a feature k, φk is calculated by using the Eq. (1). φk =

1 ∑ |M|!(|N | − |M| − 1)![ f (M ∪ k) − f (M)] |N |! M⊆N

(1)

In Eq. (1), f (M) is the output of the model to be explained with a set of M features, and N is the complete set of all the features. φk is computed as an average of the summation of all possible combinations of M features except for the k-th feature value.

Deep Learning Model With Game Theory-Based Gradient …

221

2.4 Proposed SHAP Gradient DR Model In our work, we present an explainable model called SHAP gradient DR model. The proposed method is able to learn and explain the outcome of the DR classification model via SHAP gradient approach. Figure 1 shows the basic diagram of the proposed SHAP gradient DR model. DR classification model Different deep learning architectures like AlexNet, VGGNet, ResNet, and GoogleNet have been applied for diabetic retinopathy detection and grading by researchers [25, 28]. In this work, we have used residual neural network (ResNet50), a 50 layer-deep convolutional neural network (CNN) trained on ImageNet dataset, for detection of diabetic retinopathy. First, the image data is preprocessed with circular cropping, Gaussian blur, and resizing of the images. Then, the dataset is divided into training and testing sets in 70–30 ratio. There is an imbalance in the dataset as there is a large variation in the number of images for different classes. The class imbalance is handled with the higher weight assignment to the minority classes. The pre-processed images are fed into ResNet50 pre-trained model. The transfer learning model approach allows to use the pre-trained model which is trained over the large annotated database and then train the convolutional model. The proposed deep CNN DR classification model is shown in Fig. 2. Here, we have used the ResNet50 pre-trained model (transfer learning model). Then, two convolutional neural network blocks are built on top of it with a global average pooling layer, a flatten layer (to convert the pooled feature map to a single column), and two dense layers with rectified linear unit (ReLU) and Softmax activation functions, respectively. The activation function acts as a decision maker to specify which neuron should be triggered. Drop-out node is added to handle over-fitting and to increase the training speed. The ADAM optimizer is used to manage the learning rate of the neural network in order to reduce the losses. The binary cross entropy loss function is used to measure the performance of the model. For each retinal image, our model extracts the deep features and classifies the images into one of the five DR classes, no apparent retinopathy (class 0), mild NPDR (class 1), moderate NPDR (class 2), severe NPDR (class 3), and PDR (class 4), accordingly. The DR classification model is then given

Fig. 1 Proposed SHAP gradient DR model

222

K. Mittal and V. M. A. Rajam

Fig. 2 Proposed deep CNN DR classification model

as one of the inputs to the SHAP gradient model to provide an explanation of the prediction made for a particular input retinal image. SHAP gradient model The explainability model of our work is based on the Shapley values with gradient approach. The SHAP gradient model combines concepts from integrated gradients, Smooth-grad, and Shapley values together into a single expected value equation. The SHAP gradient model considers the entire image dataset as the background dataset. This helps in local smoothing, and also the entire dataset can be used as the background distribution. A linear function is made between each background data sample and the input image (the image for which explanation is to be given) as in Eq. (2). Some random points are selected on the path given by the linear function. The gradients of the output with respect to these points are computed. This linear function and the gradients then compute the SHAP values. Final SHAP value ≈ gradients ∗ (input − background data sample)

(2)

The DR classification model and the input image are given as input to the SHAP gradient model. The predictions made by the complete DR model are explained first. The effect of various intermediate layers on the final prediction is also studied, and hence, these intermediate layers are also given as input to our SHAP gradient model for explainability. The image plots are used for visualization, and the image pixels are marked with different colours, based on their contribution towards the final classification output. The image pixels that contribute positively are marked in pink, and the pixels that contribute negatively are marked in blue.

Deep Learning Model With Game Theory-Based Gradient …

223

3 Results and Discussion This section gives details about the dataset used in this work, the accuracy and performance of the DR classification model, and the analysis done using the explainable DR model. We have provided explanations for individual predictions, and few explanations are given for the internal layers of the classification model.

3.1 Dataset Description The proposed work uses the EyePACS dataset hosted on the Kaggle platform for diabetic retinopathy competition [4]. The images were taken from different types of cameras and under different imaging conditions. The dataset consists of more than 35000 images; out of these, only 5000 images are used for this work due to computational resources constraints.

3.2 Performance Evaluation of DR Classification The 5000 images dataset is split into 70–30 ratio for training and testing purpose. The binary cross entropy function is used to measure the performance of the model. The quadratic weighted kappa (QWK) score is used as evaluation metric. The QWK score measures the agreement between the annotated/labelled score and the predicted score. It varies from 0 (random agreement) to 1 (complete agreement). The QWK score is calculated using Eq. (3), where ρe is the probability of the random agreement, and ρo is the probability of the observed agreement. QWK =

ρo − ρe 1 − ρe

(3)

In this work, we have achieved a QWK score of 0.784. A score of 0.60 + is considered to be a good score. A QWK score between 0.61 and 0.80 shows a substantial agreement between the true values and the predicted values. The model achieved a classification accuracy of 89.34% on the training data set and a classification accuracy of 82.7% on the testing dataset. The model loss is computed using binary cross entropy function. Figure 3 gives the model loss graph for the training and testing datasets. As depicted by the model loss graph, it can be inferred that the model keeps learning from the loss and fixes the weights with increase in the number of iterations, thereby decreasing the loss value and increasing the model accuracy. A subset of images from the EyePACS dataset has been used to train the proposed model. As per literature, very less work has been done to apply deep neural networks

224

K. Mittal and V. M. A. Rajam

Fig. 3 Model loss graph

Table 1 Performance comparison between the state-of-the-art methods and the proposed method Authors Datasets Acc % Pratt et al. [15] Adly et al. [1] Wang et al. [26] Li et al. [10] Proposed SHAP gradient DR model

Kaggle Kaggle Messidor Kaggle Kaggle

75 83.2 84.23 86.17 89.34

for DR grading using a small training dataset. Most of the state-of-the-art works have used binary classification for diabetic retinopathy grading. In the proposed work, we have considered multi-class classification, that is, diabetic retinopathy grading has been performed on five levels. Table 1 shows the comparative study of the work where five class DR grading has been performed. Our method achieves better performance when compared with the existing work.

3.3 Analysis of the Explainability SHAP Gradient DR Model In this section, we provide explanations for the impact of the complete DR classification model and the various intermediate layers of the DR classification model on the output (class label − 0 to 4). Here, we have used 20 random images from the 5000 images as the background dataset for the explanations. Image plots are used for visualization. In the image plots, different colour codes scheme (pink and blue colours for the pixels) is used to mark the regions that are important for the deep neural network for the final prediction. The presence of pink pixels depicts the contribution towards a positive prediction, and the presence of blue pixels depicts the contribution towards a negative prediction. The size of the pink area decides the image label. The image can be labelled as either “0.0” or “1.0”. If the size of the pink area is greater, the image is labelled as “1.0”. Thus, the image classification is explained by the colour

Deep Learning Model With Game Theory-Based Gradient …

(a)

225

(b)

Fig. 4 Image plots of SHAP gradient DR model for four retinal images a input images to SHAP gradient model b SHAP explanation

of each pixel, which indicates how much each pixel contributes to the positive or negative outcome. Figure 4 depicts the explanations of the outputs (five DR classes) of four images given as input to the classification model. Each of these four input images correspond to a different class (Fig. 4a). For each input image, the five columns show the contribution of the pixels of the image to each class (0–4). Consider the first row of Fig. 4b. The input image is Image 0 and corresponds to class 2. As we move to right in that row, in the first image, the blue colour area is greater than the pink colour area, and hence, the image is marked as “0.0” as class 0 cannot be the correct prediction for this input image. As we keep moving towards the right, the pink area increases in size. Consider the third image from left, the pink colour area is greater in this image than the blue colour area, and hence, this input image is marked as “1.0”. The two right most images are empty and hence marked as “0.0”. This shows that for this input image, the class label is 2. Similarly, for any input image, depending on the size of the pink area, the class label can be marked. Analysis of the contribution of intermediate layers The effect and contribution of the intermediate convolutional layers of the DR classification model in predicting the output are also studied in this work. The output of the intermediate layers of the classification model is passed through a global average pooling layer and a dense layer with Softmax activation function before being given as input to the SHAP gradient model. The global average pooling layer helps to identify the pixels in an image which are being used for predicting the class label. The output from the three intermediate convolutional layers is taken, and their outcomes are explained for an input image in Fig. 5. Figure 5 shows how the prediction of the class label for a particular input image changes over the different layers of the deep classifier. Figure 5a shows the image (with class label 0) given as input to the DR classification model. For better view, only the images corresponding to the highest predicted output for each of the three intermediate layers are shown in Fig. 5. The output of the stage 4 convolutional layer of the DR classification model predicts the class of the image as 1 as seen in Fig. 5b. The output of the stage 5 convolutional layer block 1 of the DR classification model predicts the class of the image as 0 as seen in Fig. 5c. The output of the stage 5 convolutional layer block 2 of the DR classification model predicts the class of

226

K. Mittal and V. M. A. Rajam

Fig. 5 Explanations for a image a input image with class 0 b stage 4 convolutional layer output c stage 5 convolutional layer block 1 output d stage 5 convolutional layer block 2 output e complete model output

the image as 0 as seen in Fig. 5d. This shows that as we move to the higher layers, the prediction is refined. The overall prediction of the complete classification model shows the correct class as in Fig. 5e. Analysis of the incorrect predictions The image plots for two incorrectly predicted images are shown in Fig. 6. As can be seen in Fig. 6a, the model has made incorrect predictions. The actual label for the input image represents class 1, but the model has predicted it to be of class 2 type. This is because some of the pixels in the optic disc region are misclassified, and hence, the result is inaccurate. In Fig. 6b, some pixels are misinterpreted as normal and hence, the incorrect result. The proposed SHAP gradient model has identified the pixels that falsely contribute to the prediction whilst the ones which make positive contributions are absent, thus justifying the incorrect prediction. Thus, the SHAP model also helps in understanding model errors and the reason for the incorrect prediction. This also shows the complexity of the deep neural network models.

Deep Learning Model With Game Theory-Based Gradient …

227

Fig. 6 Explanations for incorrect prediction a image with class 1 label b image with class 3 label

4 Conclusion In this paper, we have proposed an explainable model for deep learning-based diabetic retinopathy classification. The proposed SHAP gradient DR model is able to classify the retinal images into five classes based on the severity level of the diabetic retinopathy, show visualization of the important pixels based on which the prediction is made, and provide an explanation for the prediction made which assist in making the model understandable for the human expert. The SHAP gradient model clarifies the black-box nature of many deep learning techniques. We found that our model provided meaningful explanations of the predictions made for five diabetic retinopathy classes, and contributions for the inaccurate predictions made by the DR classification model. The importance and contributions of the intermediate layers in making predictions are also explained by the SHAP gradient model. The study of the intermediate layers’ contributions helps in analyzing the model better. In medical imaging applications, such insights are of vital importance. The SHAP approach can be used for the analysis of deep learning algorithms for further medical applications and assist medical experts in making the model understandable and trustable. In today’s world, when everything is moving towards online system, the number of virtual patients is also increasing. And so, this kind of analysis and explanations about the severity levels of diabetic retinopathy will help both ophthalmologists and the patients. The future work will focus on developing interpretable computer-aided diagnosis driven disease detection system for different healthcare problems.

228

K. Mittal and V. M. A. Rajam

References 1. Adly MM, Ghoneim AS, Youssif AA (2019) On the grading of diabetic retinopathies using a binary-tree-based multiclass classifier of CNNS. Int J Comput Sci Inf Secur 17(1) 2. Arrieta AB, Díaz-Rodríguez N, Del Ser J, Bennetot A, Tabik S, Barbado A, García S, Gil-López S, Molina D, Benjamins R et al (2020) Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf Fusion 58:82–115 3. Chu J, Ali Y (2008) Diabetic retinopathy: a review. Drug Develop Res 2008 4. Cuadros J, Bresnick G (2009) Eyepacs: an adaptable telemedicine system for diabetic retinopathy screening. J diab Sci Technol 3(3):509–516 5. Fernando ZT, Singh J, Anand A (2019) A study on the interpretability of neural retrieval models using deepSHAP. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp 1005–1008 6. Goodman B, Flaxman S (2017) European Union regulations on algorithmic decision-making and a “right to explanation”. AI magazine 38(3):50–57 7. Holzinger A, Biemann C, Pattichis CS, Kell DB (2017) What do we need to build explainable AI systems for the medical domain? ArXiv 8. Holzinger A, Langs G, Denk H, Zatloukal K, Müller H (2019) Causability and explainability of artificial intelligence in medicine. Wiley Interdisc Rev: Data Min Knowl Discov 9(4):e1312 9. Kathiresan S, Sait ARW, Gupta D, Lakshmanaprabu S, Khanna A, Pandey HM (2020) Automated detection and classification of fundus diabetic retinopathy images using synergic deep learning model. Patt Recognit Lett 2020. https://doi.org/10.1016/j.patrec.2020.02.026 10. Li YH, Yeh NN, Chen SJ, Chung YC (2019) Computer-assisted diagnosis for diabetic retinopathy based on fundus images using deep convolutional neural network. Mob Inf Syst 2019 11. Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Advances in Neural Information Processing Systems, pp 4765–4774 12. Mansour RF (2018) Deep-learning-based automatic computer-aided diagnosis system for diabetic retinopathy. Biomed Eng Lett 8(1):41–57 13. Mittal K, Mary Anita Rajam V (2020) Computerized retinal image analysis—a survey. Multimedia Tools Appl 2020. DOIurlhttps://doi.org/10.1007/s11042-020-09041-y 14. Nazir T, Irtaza A, Shabbir Z, Javed A, Akram U, Mahmood MT (2019) Diabetic retinopathy detection through novel tetragonal local octa patterns and extreme learning machines. Artif Intell Med 99:101695 15. Pratt H, Coenen F, Broadbent DM, Harding SP, Zheng Y (2016) Convolutional neural networks for diabetic retinopathy. Proc Comput Sci 90:200–205 16. Quellec G, Charrière K, Boudi Y, Cochener B, Lamard M (2017) Deep image mining for diabetic retinopathy screening. Med Image Anal 39:178–193 17. Raman R, Gella L, Srinivasan S, Sharma T (2016) Diabetic retinopathy: an epidemic at home and around the world. Indian J Ophthalmol 64(1):69 18. Samek W, Wiegand T, Müller KR (2017) Explainable artificial intelligence: understanding, visualizing and interpreting deep learning models. ITU J: ICT Discov–Spec Issue 1–Impact Artif Intell (AI) Commun Networks Serv 1:1–10 19. Sayres R, Taly A, Rahimy E, Blumer K, Coz D, Hammel N, Krause J, Narayanaswamy A, Rastegar Z, Wu D et al (2019) Using a deep learning algorithm and integrated gradients explanation to assist grading for diabetic retinopathy. Ophthalmology 126(4):552–564 20. Shapley LS (1953) A value for n-person games. Contrib Theory Games 2(28):307–317 21. Singh A, Sengupta S, Lakshminarayanan V (2020) Explainable deep learning models in medical image analysis. J Imaging 6 22. Smilkov D, Thorat N, Kim B, Viégas F, Wattenberg M (2017) SmoothGrad: removing noise by adding noise. In: Thirty-fourth international conference on machine learning 23. Stiglic G, Kocbek P, Fijacko N, Zitnik M, Verbert K, Cilar L (2020) Interpretability of machine learning based prediction models in healthcare. Wires Data Mining Knowl Discov 24. Sundararajan M, Taly A, Yan Q Axiomatic attribution for deep networks. In: Proceedings of the 34th international conference on machine learning, vol 70, pp 3319–3328 (2017)

Deep Learning Model With Game Theory-Based Gradient …

229

25. Wan S, Liang Y, Zhang Y (2018) Deep convolutional neural networks for diabetic retinopathy detection by image classification. Comput Electr Eng 72:274–282 26. Wang S, Wang X, Hu Y, Shen Y, Yang Z, Gn M, Lei B (2020) Diabetic retinopathy diagnosis using multichannel generative adversarial network with semisupervision. IEEE Trans Autom Sci Eng 2020. https://doi.org/10.1109/TASE.2020.2981637 27. Wilkinson C, Ferris FL III, Klein RE, Lee PP, Agardh CD, Davis M, Dills D, Kampik A, Pararajasegaram R, Verdaguer JT et al (2003) Proposed international clinical diabetic retinopathy and diabetic macular edema disease severity scales. Ophthalmology 110(9):1677–1682 28. Zhang W, Zhong J, Yang S, Gao Z, Hu J, Chen Y, Yi Z (2019) Automated identification and grading system of diabetic retinopathy using deep neural networks. Knowl-Based Syst 175:12–25

A Comparative Analysis of Transformer-Based Models for Document Visual Question Answering Vijay Kumari, Yashvardhan Sharma, and Lavika Goel

Abstract Visual question answering (VQA) is one of the most exciting problems of computer vision and natural language processing tasks. It requires understanding and reasoning of the image to answer a human query. Text Visual Question Answering (Text-VQA) and Document Visual Question Answering (DocVQA) are the two sub problems of the VQA, which require extracting the text from the usual scene and document images. Since answering questions about documents requires an understanding of the layout and writing patterns, the models that perform well on the Text-VQA task perform poorly on the DocVQA task. As the transformer-based models achieve state-of-the-art results in deep learning fields, we train and fine-tune various transformer-based models (such as BERT, ALBERT, RoBERTa, ELECTRA, and Distil-BERT) to examine their validation accuracy. This paper provides a detailed analysis of various transformer models and compares their accuracies on the DocVQA task. Keywords DocVQA · TextVQA · Bidirectional Encoder Representations from Transformers (BERT) · Tesseract OCR Engine · Natural language processing (NLP) · Long Short-Term Memory (LSTM)

1 Introduction Visual question answering (VQA) is an important research area in artificial intelligence that answers text-based questions using images from everyday life [1]. Specific V. Kumari (B) · Y. Sharma Birla Institute of Technology and Science, Pilani, India e-mail: [email protected] Y. Sharma e-mail: [email protected] L. Goel Malaviya National Institute of Technology, Jaipur, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_16

231

232

V. Kumari et al.

tasks in VQA require the ability to read text from the image to answer the question. Since VQA models are predominantly deep neural networks trained on a particular task, they do not perform well on tasks requiring extracting and reasoning over text present in the images [2]. The development of models that perform well on such activities will help in automating manual tasks, enhancing accessibility for people with disabilities, etc. The DocVQA [3] task aims to develop a model capable of extracting information from scanned documents and responding to queries in natural language. The ability to analyze the scanned text and understand the significance of the text’s layout is necessary for answering the questions. Certain writing conventions necessitate specific placement of text and data to convey meaning effectively, which must be learned by the model. The model should be able to extract and understand the handwritten, typewritten, or printed text from the document images in addition to utilizing a variety of additional visual cues, such as page structure, tables, forms, and nontextual features like marks, tick boxes, separators, and diagrams (font, colors, and highlighting). The automation of administrative tasks requiring the answer of queries related to older documents would greatly benefit from developing such a model. The main contribution of the paper is highlighted as follow: This work focused on the Text-based Document Visual Question Answering (DocVQA) system that answers a question from the document image. We used the Tesseract OCR to extract the text from the document images and arranged the text in paragraph form. We implemented the different transformer-based Bidirectional Encoder Representations from Transformers (BERT) and its variant models to answer the question and analyzed their performance for the DocVQA task. The remainder of the paper is organized as follows. Section 2 explains the related work, and Sect. 3 describes the language representation techniques and text extraction methods. The following section explains the comparative results of the different BERT variants for the DocVQA task, followed by the conclusion and future work.

2 Related Work 2.1 Text-Based Visual Question Answering The datasets like VQA 2.0 [4] contain only a small percentage of text questions. As a result, models developed using these datasets perform poorly on tasks that require reasoning about the text to answer questions. To overcome these issues, TextVQA [5] is developed. It consists of questions based on generalizable everyday scenes which require reading text from the image. DocVQA is a specialized dataset for answering questions about extracting documents’ textual and visual cues. Generic VQA and text-based VQA approaches do not perform well on DocVQA as answering questions about documents requires knowledge of implicit written communication conventions.

A Comparative Analysis of Transformer-Based Models for Document …

233

A text-VQA model called the Look Read Reason Ahead (LoRRA) uses Rosetta for text extraction and detection, Fast Text for OCR embeddings, and GloVe for question embedding [5]. However, the M4c model use transformer-based models to extract the question’s features and combine all of these modalities into a single space [6].

2.2 Transformer-Based Natural Language Processing Models The development of Bidirectional Encoder Representation from Transformer (BERT) [7] marked a new era for Natural Language Processing. BERT outperformed human performance by employing transformer networks to build a bi-directional model on SQuAD 1.1 [8]. RoBERTa outperformed BERT and gave state-of-the-art results for 4/9 tasks by improving upon the pretraining process of BERT. DistilBERT uses knowledge distillation to obtain comparable performance to BERT with a much lower overhead. ALBERT outperforms all current state-of-the-art language models by obtaining a GLUE score of 89.4 and an F1 score of 92.2 on the SQuAD 2.0 (Stanford Question Answering Dataset) benchmark.

2.3 Multimodal Learning in Vision and Language Tasks Various models have been proposed which involve multimodal learning to solve these tasks. Pythia [9] was inspired by the bottom-up, top-down attention network’s detector-based bounding box prediction approach, which has a multimodal attention mechanism. LXMERT [10] applied transformer-based fusion between image and text with self-supervision. To utilize layout features gained from an OCR system without having to completely re-learn language semantics, LAMBERT [11] updates the Transformer encoder architecture. By eliminating the use of image data, it simply adds the coordinates of token bounding boxes to the model’s input. This results in a language model that is layout-aware and can be fine-tuned on downstream tasks. LayoutLMV2 [12] presents a multi-modal learning approach for visually rich document understanding which considers text, layout, and image information. Similarly, the LaTr (Layout-Aware Transformer for Scene-Text VQA) is developed for the TextVQA task on the scene text images [13].

234

V. Kumari et al.

3 Methodology 3.1 Method for Text Extraction Tesseract OCR Engine Tesseract is an optical character Recognition (OCR) engine that uses long short-term memory (LSTM) networks in its pipeline to recognize characters. It takes an image, runs multiple networks on it, and stacks the input to recognize lines. Tesseract is compatible with many languages and can also be trained to recognize other languages. We chose Tesseract OCR for typed text because it supports different languages, and the architecture can be expanded over additional languages [14]. While considering the tesseract’s design [14] as described in Fig.1, a grayscale or color image is given to the framework as input, applied adaptive threshold (helps clean out the dirty pictures), and produces a binary image. The connected component analysis is then used to identify character outlines from the given corpus’s text outlines and word lists. The final step involves the tesseract recognizing a word and passing it as the output. Handwritten Text Recognition OCR programs give very accurate results for printed text; however, they are unable to recognize handwritten text effectively as there are many different varieties of handwritten text. The proposed model is based on the methodology mentioned in [15] which recognizes handwritten text using four steps: 1. Removal of background: The background of the image is removed to improve the accuracy and speed. 2. Detection of words: In this step, the words are detected by using the spatial properties of the words in the image. Bounding boxes are created around each word. 3. Normalization: The words detected in the previous step are resized to a standard size and converted to gray scale. The rotation of the text is also corrected to make recognition easier. 4. Word recognition: The best performing model uses a bi-direction recurrent neural network followed by a convolutional neural network. However, the model which uses Bidirectional Recurrent Neural Network (BiRNN) + Convolutional Neural Network (CNN) architecture cannot be trained on images that contain whole

Fig. 1 Architecture of tesseract

A Comparative Analysis of Transformer-Based Models for Document …

235

words. Hence, a Seq2Seq (Sequence to sequence) model was combined with the BiRNN + CNN network. The datasets used to train handwriting OCRs usually have very clean images, which is not the case with DocVQA. Handwriting is sometimes difficult to read, and the images are often blurry. Moreover, graphs and pie charts are difficult to read because the inference depends on the text and drawing. One would need to determine the precise bar length from a bar graph and relate it to a fixed reading to obtain the readings from a bar graph. Furthermore, there are many kinds of graphs and forms, and no standard dataset can be used to train a model to understand plots and pie charts. Therefore, OCR transcriptions for graphs and handwritten text are provided along with the dataset. The proposed model uses a Python script to transform these OCR transcriptions into a specific format.

3.2 Method for Answer Extraction Transformer-Based Models BERT-Large: (Bidirectional Encoder Representations from Transformers (BERT) is a pre-trained unsupervised NLP model [7] which is built upon the transformer architecture. Two streams of attention mechanisms are used to form relevant relations between words in the content. The input text is the combination of three embeddings: position, segment, and token. BERT uses positional embedding to capture the location of a word in a sentence. Segment embeddings are generated using sentence pairs input for question-answering tasks. Token embeddings are learned from the word piece token vocabulary. Two stages are involved in implementing the BERT model, which is pre-training and fine-tuning. Pre-training involves training the model using unlabeled data from various tasks, and fine-tuning involves training the model with parameters specific to the downstream task. Distilled BERT (DistilBERT): The size and complexity of natural language processing models have increased, negatively impacting their usage in interesting real-time applications. To address these challenges, DistilBERT is developedusing a technique called knowledge distillation [16]. A smaller model is trained to become familiar with the larger model’s specific behavior by attempting to replicate its output at each level. There are three steps: correspondence setup, forward pass, and backpropagate. During the design phase, correspondence is set up between intermediate outputs of the student and teacher network. Forward pass through the teacher network is done to get all the halfway outputs. The teacher network and correspondence relation are utilized to compute back-propagation errors so that the student network learns to mimic the behavior of the teacher network. DistilBERT has a similar architecture as BERT. The quantity of layers is diminished by a factor of 2; token-type embedding and the pooling layers are removed. RoBERTa (robustly optimized BERT): RoBERTa is a streamlined strategy for pre-training natural language processing models which improve upon BERT. The

236

V. Kumari et al.

changes made to the training procedure are [17] Dynamic masking, Removal of Next Sentence Prediction (NSP), Training with large batches, and Text encoding. Word masking is done once during the pre-processing data phase of the pre-training of BERT, creating only a single static mask for training. In RoBERTa, the mask is generated each time a sequence is fed to the model, which is helpful when the model is trained using a larger dataset. Dynamic masking results in slightly improved accuracy over static masking. Studies have shown that NSP degrades downstream task performance after fine-tuning. Removing NSP from BERT led to a slight improvement in RoBERTa’s performance. When the model is trained using larger batch sizes, the accuracy of the final task is observed to increase over the BERT model. Moreover, training in larger batches makes it easier to parallelize the workload. The original implementation of BERT used a character-level Byte-Pair Encoding (BPE) vocabulary of size 30 k. RoBERTa uses a larger BPE with a vocabulary of 50 k, ensuring that no tokens are regarded as ”unknown.” A Light BERT (ALBERT): constraints and communication overhead issues. ALBERT achieves significant improvements over BERT using only 70% of its parameters [18]. The model primarily made three contributions to the design of BERT: Factorized embedding parametrization, Cross-layer parameter sharing, and inter-sentence coherence loss. The size of the workpiece embedding is the same as that of the hidden layer in BERT. Tying two things that work under contrasting purposes implies inefficient parameters. In order to dramatically reduce the number of parameters in the embedding matrix, the embedding parameters are factorized and divided into two smaller matrices. ALBERT shares all of its parameters by default across layers to improve parameter efficiency. Compared to BERT, this results in a smoother transition between layers. ALBERT uses Sentence Order Prediction (SOP) loss instead of next sentence prediction (NSP) to find the inter-sentence relation. It makes the framework to learn better-grained distinctions about discourse-level coherence characteristics, and as a result, ALBERT performs better in downstream tasks. (Efficiently Learning an Encoder that Classifies Token Re- placements Accurately (ELECTRA): ELECTRA is an NLP model which improves upon the pretraining strategies of BERT. ELECTRA uses replaced token detection instead of masked language modeling (MLM) [19]. Experiments have shown that ELECTRA outperforms the RoBERTa model when the same computing power is provided. Token replacement detection uses two transformer encoder neural networks, a generator, and a discriminator. The generator is trained to perform masked language modeling by randomly selecting a set of positions from the token embedding to mask out. Then, it is trained to learn the identity of the masked-out tokens. The discriminator is trained to distinguish between tokens produced by the generator and tokens present in the data. After pre-training, the generator is thrown out, and the discriminator is fine-tuned on downstream tasks. The complete architecture of the proposed models developed using the abovementioned techniques is shown in Fig.2. An image of a document is given as input, and a query is then addressed to the document image. The proposed model first extracts the text from the document image using Tesseract OCR, and embedding

A Comparative Analysis of Transformer-Based Models for Document …

237

Fig. 2 Architecture of the proposed model

for both the query text and the OCR are generated using the transformers to get the answer. The answer model fetches the answer from the text-based context.

4 Experiments and Results The section will explain the dataset used, experimental setup, results obtained by various models for the task, and the qualitative results of the experiment to test the model.

4.1 DocVQA We used the DocVQA [3] dataset to train the model, consisting of 12,000 + document images over many document types. The documents include tables, figures, and text, such that visual cues like layout can be used to answer the questions. There are 50,000 questions and answers over the document images in the dataset. The document images were hand-picked such that binarized images were minimized to preserve image quality. Documents with forms, figures, etc., were prioritized over images with only long-running text. The documents include hand-written, typed, printed, and born-digital text.

4.2 Experimental Setup A BERT-based question-answering system aims to extract the answer from a textbased context; hence, the model is trained and fine-tuned on pairs of questions and answers with contexts. In our task context is the document image, which contains text; for that, we require to extract the text strings from the document using the OCR model, and then we can fine-tune the model for the DocVQA task. We used the state-of-the-art tesseract model to extract the OCR tokens from the document. BERT

238

V. Kumari et al.

Table 1 Validation accuracy of different models on DocVQA dataset Model

Accuracy (correct)

Accuracy (correct + similar)

BERT-Large [7]

48.1

68.9

DistilBERT [16]

39.15

61.5

RoBERTA [17]

39.7

67.3

ELECTRA [18]

31.5

60.9

ALBERT [19]

37.9

63.9

Table 2 Accuracy of the text-VQA models over DocVQA Method fixed

Object feature

Vocab

Dynamic vocab size

Val ANLS

Val accuracy

LoRRA [5]

Yes

Yes

500

0.094

6.41

M4C [6]

No

Yes

500

0.385

24.73

and its variants are fine-tuned on the DocVQA on Nvidia GPU with CUDA enabled for six epochs with a train batch size of 8 and evaluation batch size of 64 with a maximum answer length of 50. The learning rate we kept is 2e-05.

4.3 Results Obtained by Various Models on DocVQA The proposed model classifies the outputs into three categories: (1) Correct: The predicted answer exactly matches the ground truth. (2) Similar: The predicted answer is the substring of ground truth or vice versa. (3) Incorrect: The predicted answer is neither correct nor similar. ANLS, Average Normalized Levenshtein Similarity and Accuracy metrics, is used for the performance evaluation. Results of the transformerbased BERT models, pre-trained on the SQuAD dataset and then fine-tuned on the DocVQA dataset, are shown in Table 1. The ANLS score of the best-performing model (BERT-Large) is 0.67. We can observe from Table 2 that all the BERT-based models outperform the Text-VQA LoRRA [5] and Multimodal Multi-Copy Mesh [6] models by a significant margin.

4.4 Experiments The qualitative results of the experiment using the BERT-Large model are presented in this section. Images that are and are not a part of the DocVQA dataset are tested for, and the are shown. Experiment 1 (Testing on scanned document image belonging to DocVQA dataset on spatial questions) Answers retrieved by the model for experiment 1 (Fig. 3) is:

A Comparative Analysis of Transformer-Based Models for Document …

239

Fig. 3 Image of the scanned document belonging to DocVQA dataset on spatial questions Q1: What is the location of 11th congress on women health? Q2: What is full form of ACOG? Q3: What is date of ACOG?

Prediction for Q1 is: Hilton Head Island, SC with confidence: 0.999. Prediction for Q2 is: American College of Obstetricians and gynecologists with confidence: 0.999. Prediction for Q3 is: April 26–30, 2003 with confidence: 0.999. Experiment 2 (Testing on the image belonging to DocVQA validation set) Answers retrieved by the model for experiment 2 (Fig. 4) is: Prediction for Q1 is: The WINSTON RACING NATION with confidence: 0.999. Prediction for Q2 is: 21 + with confidence: 0.994. Prediction for Q3 is: WINSTON with confidence: 0.992. Experiment 3 (Testing on random scanned document image not be- longing to DocVQA dataset) Answers retrieved by the model for experiment 3 (Fig. 5) is: Prediction for Q1 is: Captain james K. Powell with confidence: 0.990. Prediction for Q2 is: over a million dollars with confidence: 0.999. Prediction for Q3 is: As our equipment was crude in the extreme with confidence: 0.374.

5 Conclusion and Future Work TextVQA and DocVQA datasets contain questions that can only be answered by reading and reasoning about the text in images/documents. We experimented with

240 Fig. 4 Image of the scanned document belonging to DocVQA validation set Q1: What is the name of heading? Q2: What is minimum age of smokers? Q3: What brand of cigarettes?

Fig. 5 Image of the scanned document not belonging to DocVQA dataset Q1: What is the name of officer? Q2: What is worth ore? Q3: How is condition of equipment?

V. Kumari et al.

A Comparative Analysis of Transformer-Based Models for Document …

241

various transformer models and calculated the ANLS score of the best-performing model on the test set. The experiments show that the BERT-based model can accurately infer textual information and account for spatial factors, which is crucial for the DocVQA problem. The existing OCR algorithms are good at recognizing clean, typed text, but their performance suffers significantly when the image includes noise or is slightly disturbed. This could be a future research focus. Acknowledgements The authors would like to convey their sincere thanks to the Department of Science and Technology (ICPS Division), New Delhi, India, for providing financial assistance under the Data Science (DS) Research of Interdisciplinary Cyber-Physical Systems (ICPS) Program [DST/ICPS/CLUSTER/Data Science/2018/ Proposal-16: (T-856)] at the department of computer science, Birla Institute of Technology and Science, Pilani, India. The authors are also thankful to the authorities of Birla Institute of Technology and Science, Pilani, to provide basic infrastructure facilities during the preparation of the paper.

References 1. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433 2. Zhong H, Chen J, Shen C, Zhang H, Huang J, Hua X-S (2020) Self-adaptive neural module transformer for visual question answering. IEEE Trans Multimedia 23:1264–1273 3. Mathew M, Karatzas D, Jawahar CV (2021) Docvqa: a dataset for vqa on document images. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2200–2209 4. Ross H, Mannion G (2012) Curriculum making as the enactment of dwelling in places. Stud Philos Educ 31(3):303–313 5. Singh A, Natarajan V, Shah M, Jiang Y, Chen X, Batra D, Parikh D, Rohrbach M (2019) Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8317–8326 6. Hu R, Singh A, Darrell T, Rohrbach M (2020) Iterative answer prediction with pointeraugmented multimodal transformers for textvqa. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9992–10002 7. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 8. Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 9. Jiang Y, Natarajan V, Chen X, Rohrbach M, Batra D, Parikh D (2018) Pythia v0. 1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956 10. Tan H, Bansal M (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 11. Garncarek L-, Powalski R, Stanis-lawek T, Topolski B, Halama P, Turski M, Gralin´ski F (2021) Lambert: layout-aware language modeling for information extraction. In: International conference on document analysis and recognition. Springer, pp 532–547 12. Xu Y, Xu Y, Lv T, Cui L, Wei F, Wang G, Lu Y, Florencio D, Zhang C, Che W et al. (2020) Layoutlmv2: multi-modal pre-training for visually-rich document understanding. arXivpreprint arXiv:2012.14740 13. Biten AF, Litman R, Xie Y, Appalaraju S, Manmatha R (2022) Latr: layout-aware transformer for scene-text vqa. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16548–16558

242

V. Kumari et al.

14. Smith R (2007) An overview of the tesseract ocr engine. In: Ninth international conference on document analysis and recognition (ICDAR 2007), vol 2. IEEE, pp 629–633 15. Hajek B. https://github.com/Breta01/handwriting-ocr. Handwriting ocr 16. Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 17. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907. 11692 18. Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Albert RS (2019) A lite bert for selfsupervised learning of language representations. arXiv preprint arXiv:1909.11942 19. Clark K, Luong M-T, Le QV, Manning CD (2020) Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555

Develop Hybrid Wolf Optimization with Faster RCNN to Enhance Plant Disease Detection Performance Analysis M. Prabu

and Balika J. Chelliah

Abstract Plant diseases appear to have become a major threat to global food security, both in terms of production and supply. In this paper, we present a real-time plant disease that relies on altered deep convolutional neural networks. The plant illness images first were expanded by image processing technologies, resulting in the plant disease sets of data. A Wolf Optimization with Faster Region-based Convolutional Neural Network (WO-FRCNN) system that improved removal characteristics was used to identify plant diseases. The proposed method improved the detection of plant diseases and achieved a precision of 96.32%. Prevention activities achieve the basic rate of 15.01 FPS as the existing methods according to experimental data. This study means that the real detectors Improved WO-FRCNN, which would depend on deep learning. It would be a viable option for diagnosing plant diseases and used for identifying other diseases within plants. The evaluation report indicates that the proposed method provides good reliability. Keywords Plant disease · Deep Convolutional Neural Networks (DCNN) · R-CNN · Data Fusion

1 Introduction Plant diseases were responsible for 10–16% of yearly crop failure, costing an estimated US $220 billion in worldwide agricultural yields [1]. These depicted chronic food shortages as an outcome of crop disease-related impairment of agricultural production, which has become a worldwide problem that crop pathologists must not disregard [2]. As a result, agricultural productivity should be increased by up to 70% to maintain an abundance of food for the fast-growing population [3]. Rice leaf disease diagnosis and prediction were critical for preserving the amount and quality M. Prabu (B) · B. J. Chelliah Department of Computer Science and Engineering, SRM Institute of Science and Technology, Ramapuram, Chennai, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_17

243

244

M. Prabu and B. J. Chelliah

of paddy cultivation because early diagnosis of the disease allows for prompt action to transform disease development and the growth and development of plants, hence enhancing rice production and supply [4–7]. Due to the huge ecological impact, sluggish detection rate, and low accuracy, the most recent system-based identification process has to be widely adopted [6, 8, 9].

2 Related Works Extracting features neural networks (NN) should correctly harvest properties of images that contain damaged tomatoes to achieve high recognition accuracy for tomato illnesses [10]. Some common artificial design aspects, for example, have performed well in past research. These artificial design traits, on the other hand, do not generalize well [11]. CNN as a learning algorithm was capable of hierarchy training. Earlier studies have shown important from the viewpoint from CNN had better discriminating and recall capabilities than artificially generated elements. Four kinds of DCNN were used in this research [12]. The training approach was compressed into a single stage, obviating the requirement for a large range of language student efforts. By using the map connection to gain characteristics of different area proposing proposals, the Fast R-CNN only needs to compute the whole prior picture once, eliminating a substantial amount of recurring calculations [13–16]. Moreover, both the R-CNN and the Fast R-CNN produce segmentation results using selective searching or end box rejection method depending on low-level characteristics, which would be time demanding and results in low-quality regions of interest. In the Faster R-CNN developed in 2017, the region proposal network benefits from deep characteristics replacing selected search or end box detection systems. The Mask R-CNN incorporated an object recognition that depends on the Faster R-CNN to precisely extract the positions and contours of data objects.

3 Proposed Method First, images of plant illness were obtained from a laboratory and a genuine vineyard. After that, information rationalization processes have been used to enlarge the initial plant illness images, which would then be modified by professional annotations. Lastly, the dataset would be split into three sections: A learning set was used to train the Improved WO-FRCNN framework validation data have been used to fine-tune specifications and validate the system, and the test data sets has been utilized to ensure that a model generalizes. The basic method of the Improved WO-FRCNN system for identifying four common plant illnesses was seen in Fig. 1. There have been three sections to the proposed Improved WO-FRCNN: (1) a pre network for collecting characteristics of

Develop Hybrid Wolf Optimization with Faster RCNN to Enhance Plant …

245

Fig. 1 Proposed architecture of FRCNN

plant illness images. INSE-ResNet would be a per-network that combines residual architecture, inception components, and SE-block. (2) A system for locating items is known as the region proposal network (RPN). Feature maps were transmitted to the RPN after the pre-network processing is completed. Bounding boxes have been used to find and anticipate unhealthy locations in this Section. (3) Layers for regression and classification problems that are fully coupled.

3.1 Data Collection Diseases affect grapevines in varying periods, temperature, and moisture levels. Black rot, for example, could be devastating to the grape business if the weather was consistently hot and humid, but it was uncommon in a dry summer. The four forms for plant diseases were chosen for two purposes: first, some of the damaged patches were difficult to discern conventionally, but Improved WO-FRCNN could easily extraction of features. Furthermore, the incidence of these illnesses has a significant impact on the grape business. Figure 2 depicts typical images of four different forms of the plant diseases. The four infected places on grape leaves have both consistency and variety: the illness symptoms generated by the same illness under comparable natural settings were

246

a

M. Prabu and B. J. Chelliah

b

c

d

Fig. 2 There are four typical plant diseases. a Black rot is a term used to describe a type of fungus that b Measles of the black variety c Blight on the leaves d Leaves mites

essentially similar, whereas the features of diseased spots caused by various illnesses have often been distinct. Algorithm: FRCNN The algorithm is given as follows. I/P:Image with Rician. O/P: Pre-processed image. Begin. // Neighbourhood(mean of local frame). βn = D(Uj )// full noise image (mean of global frame). βn = D(U / k )// calculate noise. αh = ∂2c // compute median in every pixel. Li = Median-Filtration. //local. Rnon-local = Non-local filtrations. //computer similarity among pixels. Rlocal = Li. Rlocal-1, I non-local . //Calculate computer images. G(y.z) = Rlocal-1 *,Li*Rlocal *Rnon-local) . This algorithm pre-processes the data with FRCNN.

3.2 Feature Extraction ResNet34 does have high classification accuracy depending on the characteristics of based on leaf sick spots. As a result, ResNet34 was chosen as the detection model’s network. ResNet using residual connections allows the network architecture to be further deepened without gradients disappearing, which overcomes the issues using WO-FRCNN degeneration and fits for tiny sick patches. Furthermore, it was simple to optimize and attain good classification accuracy.

Develop Hybrid Wolf Optimization with Faster RCNN to Enhance Plant …

247

Algorithm for WO-FRCNN Input: Scanned images. Output: Predict the Leaf disease. Begin. Step 1:Extract the features from segmented image. Step 2:Steps to follow F-RCNN features is for predicting leaf disease. (i) Calculate the input features. FE(b(n)) =



q = 1 wei qδq(n)

(1)

(ii) Weighed Quantum estimation using Eq. (2) ( )−1 wei− ιT ι ιT y

(2)

δq(n) = exp[−|b(n) − cenq|2]

(3)

(iii) RBF estimation.

2ωq Step 5:WO algorithm to calculate the optimum value using Eq. (3) Step 6:predict the types of leaf disease. End. FRCNN’s learn low-level characteristics like color and edges in the first few layers, and then extract comprehensive and discriminative features in the later layers. Res1 to Res3 of ResNet34 was therefore preserved in its entirety (Fig. 3).

Fig. 3 Structure of INSE-ResNet

248

M. Prabu and B. J. Chelliah

3.3 Predicting Diseased Spots The detection method improved WO-FRCNN relies heavily on region proposal networking, the candidate boxes were obtained using the placed anchors. DoubleRPN architecture, influenced by feature pyramid networks, was presented for finding irregular and multidimensional sick areas, as shown in Fig. 4. The rich Inception 5b semantic information was combined with a full definition for Inception ResNet v2 through a deconvolution procedure. As a result, the proposed detection method could forecast unhealthy regions within every feature map independently.

4 Experimental Evaluation Tests have been carried out on the 16.04.2-Ubuntu system with a Xeon(R) Intel(R) CPU E5-2650 v4 @ 2.20 GHz 48. It utilizes an NVIDIA Tesla P100 PCI-E GPU with 16 GB of memory and 3,584 CUDA cores to accelerate it. A core frequency can reach 1328 MHz, and the performance of a single-precision floating-point is 9.3 trillion floating-point operations per second. Caffe, a deep learning framework, has been utilized to implement a proposed WO-FRCNN of the faster model. A conventional one-stage method SSD and two main detection algorithms have been used to evaluate the contribution of different detection techniques. Table 1 summarizes the results of the study. The mean average accuracy seems to be a standard index for evaluating the object detection system. The proposed Improved WO-FRCNN Faster model obtains a good precision of 96.31% mAP in the method of two-stage, with a similar input size of 500 × 500, and detection rate in all categories was greater than that for existing detection projections based on the Faster R-CNN.

4.1 Prediction of Accuracy and Speed Table.2 shows accuracy results for every recognition network in the tests. The identification performances of ResNet and Inception systems, as shown in Fig. 5, were satisfactory, inspiring us to create INSE-ResNet by leveraging their capabilities. Figure 6 shows the detection of disease using proposed system.

Develop Hybrid Wolf Optimization with Faster RCNN to Enhance Plant …

Fig. 4 Double RPN in proposed system

249

ResNet 50

VGG 16

513 120 74.9 81.8 72.2 79 76.8

Feature extractor

Classes Input Iterations (k) Black rot Black measles Leaf blight Leaf Mites MAP (%)

500 200 79.2 82.6 69.3 69.5 74.9

R-FCN

SSD

Method 500 200 63.9 75.6 59.5 69.7 68.1

ZF 500 200 64.7 81.1 60.2 70.5 68.9

VGG 16

Faster R-CNN

Table 1 Outcomes for various CNN models in terms of detection ResNet 50 500 200 64.6 79.2 60.9 70.4 68.9

ResNet 34 500 200 69.5 81.6 64.6 70.9 71.7

ResNet 18 500 200 65.9 75.2 64.6 73.7 69.9

INSE-ResNet 500 280 74.2 85.4 71.2 84.3 78.6

500 280 82.9 88.2 73.9 96.5 86.3

INSE-ResNet

Improved WO-FRCNN

250 M. Prabu and B. J. Chelliah

Develop Hybrid Wolf Optimization with Faster RCNN to Enhance Plant … Table 2 The performance measures

Pre-network model

Input size

Recognition accuracy (Percentage)

VGG16 GoogLeNet ResNet18 ResNet34 ResNet50 ResNet101 Inception-ResNet v2 INSE-ResNet

224 × 224 224 × 224 224 × 224 224 × 224 224 × 224 224 × 224 224 × 224 224 × 224

98.52 98.93 98.95 98.42 97.03 88.65 99.41 99.52

Fig. 5 Accuracy curve for pre-networks models

Fig. 6 Detection of disease

251

252

M. Prabu and B. J. Chelliah

5 Conclusion By incorporating the SE-blocks, Inception-v1 module, and Inception-ResNet-v2 module, the proposed WO-FRCNN of Faster detectors increased the detection accuracy of multi-scale diseased spots and tiny disease-ridden spots. On GPU architecture, the novel deep-learning-based detecting methodology was introduced in a Caffe platform. Improved WO-FRCNN had a detection accuracy of 96.31% map and a speed of 15.01 frames per second. Findings demonstrate that the proposed improved WOFRCNN of the faster approach could efficiently and precisely detect four major plant illnesses, making it a viable alternative for real-time plant disease detection.

References 1. Wang J, Yu L, Yang J, Dong H (2021) DBA_SSD: a novel end-to-end object detection algorithm applied to plant disease detection. Information 12(11):474 2. Sun X, Gu J, Huang R, Zou R, Giron Palomares B (2019) Surface defects recognition of wheel hub based on improved faster R-CNN. Electronics 8(5):481 3. Singh A, SV, HJ, Aishwarya D, Jayasree JS (2022, January) Plant disease detection and diagnosis using deep learning. In: 2022 International conference for advancement in technology (ICONAT). IEEE, pp 1–6 4. Devi Priya R, Devisurya V, Anitha N, Geetha B, Kirithika RV (2021, December) Faster R-CNN with augmentation for efficient cotton leaf disease detection. In: International conference on hybrid intelligent systems. Springer, Cham, pp 140–148 5. Ozguven MM, Adem K (2019) Automatic detection and classification of leaf spot disease in sugar beet using deep learning algorithms. Phys A 535:122537 6. David HE, Ramalakshmi K, Gunasekaran H, Venkatesan R (2021, March) Literature review of disease detection in tomato leaf using deep learning techniques. In: 2021 7th International conference on advanced computing and communication systems (ICACCS), vol. 1. IEEE, pp 274–278 7. Prabu M, Chelliah BJ (2022) Mango leaf disease identification and classification using a CNN architecture optimized by crossover-based levy flight distribution algorithm. Neural Comput Appl 34(9):7311–7324 8. Wang Y, Liu M, Zheng P, Yang H, Zou J (2020) A smart surface inspection system using faster R-CNN in a cloud-edge computing environment. Adv Eng Inform 43:101037 9. Mohan HM, Rao PV, Kumara HC, Manasa S (2021) A non-invasive technique for real-time myocardial infarction detection using faster R-CNN. Multimedia Tools Appl 80(17):26939– 26967 10. Sethy PK, Barpanda NK, Rath AK, Behera SK (2020) Rice false smut detection based on faster R-CNN. Indonesian J Electr Eng Comput Sci 19(3):1590–1595 11. Jadhav S, Garg B (2022) Comprehensive review on machine learning for plant disease identification and classification with image processing. In: Proceedings of international conference on intelligent cyber-physical systems. Springer, Singapore, pp 247–262 12. Bai T, Yang J, Xu G, Yao D (2021) An optimized railway fastener detection method based on modified faster R-CNN. Measurement 182:109742 13. Fang F, Li L, Zhu H, Lim JH (2019) Combining faster R-CNN and model-driven clustering for elongated object detection. IEEE Trans Image Process 29:2052–2065

Develop Hybrid Wolf Optimization with Faster RCNN to Enhance Plant …

253

14. Rehman ZU, Khan MA, Ahmed F, Damaševiˇcius R, Naqvi SR, Nisar W, Javed K (2021) Recognizing apple leaf diseases using a novel parallel real-time processing framework based on MASK RCNN and transfer learning: an application for smart agriculture. IET Image Proc 15(10):2157–2168 15. Jin S, Su Y, Gao S, Wu F, Hu T, Liu J, Guo Q (2018) Deep learning: individual maize segmentation from terrestrial lidar data using faster R-CNN and regional growth algorithms. Front Plant Sci 9:866 16. Prakash V, Raghav S, Singh S, Sood S, Aggarwal AK, Pandian MT (2022, January) A comparative study of various techniques for crop disease detection and segmentation. In: 2022 4th International conference on smart systems and inventive technology (ICSSIT). IEEE, pp 1580–1587

An Efficient CatBoost Classifier Approach to Detect Intrusions in MQTT Protocol for Internet of Things P. M. Vijayan and S. Sundar

Abstract Recent advancements in Internet of Things (IoT) infrastructures attribute a rise in undesirable issues specific to network security. As the number of IoT devices connected to the network rises daily, the network is more vulnerable to cyber-attacks. Hence, an intrusion detection system (IDS) is vital for detecting the type of cyberattacks automatically in a time-bound manner. Moreover, the network often uses the MQTT protocol to deploy communication among IoT devices. This work proposes a CatBoost algorithm, a variant of machine learning (ML) algorithms, to classify the given attack into SlowITe, Malformed, Brute force, Flood, Dos, and Legimate. The algorithm is trained on a publicly available MQTT network dataset by creating a balancing dataset. Despite the significant disparity in the number of labeled records for each dataset class, the algorithm achieves state-of-the-art performance. The test result suggested that the algorithm can classify the type of attack with an accuracy of 94% within 78.45 s in the balanced dataset. Keywords Machine learning · IoT · MQTT dataset · Detection system

1 Introduction The innovation of new technology in the IoT environment affects due to assaults, hence the need to secure the IoT context. The security of IoT devices has recently been a major worry, particularly in the healthcare arena, where recent assaults have revealed catastrophic IoT security unprotected. Conventional network security solutions are well established. However, traditional security processes cannot be utilized directly to defend IoT devices and networks from cyber-attacks due to the resource restrictions of IoT devices and the unusual behavior of IoT protocols. As a result, depending on where the attack happens, IoT can be attacked in a variety of ways. With physical attacks, the attacker has physical access to the device and can thus P. M. Vijayan · S. Sundar (B) Vellore Institute of Technology, Vellore, Tamil Nadu, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_18

255

256

P. M. Vijayan and S. Sundar

damage it or physically manipulate it. The IoT refers to the ability of non-computer equipment such as sensors or actuators to connect with one another without the need for human involvement in order to generate, exchange, and utilize data [1, 2]. IoT protocols, such as Advanced Message Queuing Protocol (AMQP), Constrained Application Protocol (CoAP), Extensible Messaging Presence Protocol (XMPP), and Message Queuing Telemetry Transport (MQTT) have been designed to provide for the secure and dependable flow of data across IoT nodes [2]. In terms of connectivity, such protocols enable similar functions to some extent [3]; however, they differ, in the extent to which these functionalities are provided. MQTT has been widely used in a variety of applications, including smart homes, agricultural IoT, and industrial applications. Support for communication across low bandwidths, low memory needs, and decreased packet loss are just a few of the reasons [4–6]. In the MQTT, the central server is known as the broker, and it serves as the recipient of messages from the client, which is effectively the entire node involved in the communication process [7]. The information may take the form of a publish or subscribe topic [8]. MQTT is the most extensively used protocol in IoT [7]. As a result, in order to be secured, security threats in the IoT domain that uses MQTT must be identified, hence the need to secure these assault detection systems using machine learning. Machine learning (ML) has proven useful in a wide range of applications, including IDS systems for IoT [9, 10]. Some researchers believe that properly taught with model data, ML has the potential to not only identify but also forecast assaults. As a result, in this paper, we offer an IDS system for the MQTT protocol that is based on the ML approach. There are three classifications of ML: supervised learning, unsupervised learning, and reinforcement learning. The desired output of the model is known in supervised learning, even though the intrinsic relationships of the data are unknown.

2 Related Work Cyber-threat protection in the IoT network systems is an open research challenge, as a result of the ongoing emergence of new threats aimed at such platforms the IoT system affects a lot [11]. This type of novel threat crating from sensors, protocols, and network datasets [12–14]. Hence to secure these assaults, we are using detection systems using ML. In 2022, Makhija et al. [15] proposed, on MQTT-based IoT systems, different machine learning models such as Random Forest (RF), KNN classifier, and SVM are utilized to estimate the efficiency of the attacked dataset. The evaluation parameters for comparing the models’ performances were precision, accuracy, and F measure. The results revealed that the Random Forest’s performance was exceptionally accurate, with a 96% accuracy rate. It has a high level of precision, but it only detects one strike. In 2021, Khan et al. [16], the DNN-based pattern achieved 99.92, 99.75, and 94.94% accuracy for Uni, Bi, and Packet flow sequentially, for the first dataset and

An Efficient CatBoost Classifier Approach to Detect Intrusions …

257

binary classification. These accuracies decreased to 97.08%, 98.12%, and 90.79%, respectively, in the multi-label categorization. In the second dataset, however, the suggested DNN model has the greatest accuracy of 97.13% when compared to LSTM and GRUs. It does not analyze with different attacks. In 2022, Dissanayake [17], this“ for malware detection, an antivirus software employed a signature-based strategy. A signature is a brief byte pattern that can be used to recognize known viruses. The signature-based detection method, on the other hand, cannot protect against zero-day assaults. Furthermore, malware production toolkits such as Zeus can generate thousands of variations of the same infection by employing various obfuscation techniques. With the present rate of malware growth, signature generation is frequently a human-driven operation that will inevitably become infeasible. In 2019, Hari Priya and Kulothungan [18] have proposed to introduce secureMQTT, a light weight fuzzy logic-pattern IDS system in order to detect malicious activity during IoT device communication. The proposed solution used a fuzzy logic system with a fuzzy rule interpolation mechanism to detect the node’s hostile behavior. Secure-MQTT keeps away using a dense rule base by make use fuzzy rule interpolation, which generate rules dynamically. The proposed method has given an efficient mechanism for defending less configuration devices against DoS attacks. When compared to existing methods, the simulation results suggest that the proposed method detects attacks with greater accuracy. It does not analyze with different attacks, and it does not utilize optimization algorithm.

3 Message Queuing Telemetry Transport (MQTT) MQTT is a messaging protocol that was developed for aid in the communication between smart devices that use IoT. MQTT is built on top of the IoT Transport Control Protocol (TCP) to facilitate these connections between IoT devices. Furthermore, the MQTT protocol is particularly beneficial in devices with unreliable networks and low bandwidths. The MQTT protocol is still relatively new as of 2021, yet it has quickly become one of the most widely utilized messaging systems on the planet. for example, MQTT protocols are used in websites such as Facebook that allow their users to communicate with each other MQTT uses pub-sub patterns for communication [19, 20]. In order the pub-sub system to work, three variables must be present: publishers, brokers, and subscribers, as depicted in Fig. 1. Publishers have data, and their job is to disseminate this data to broker-managed subjects. Subscribers are subscribed to the topics inside the broker. Data is transmitted from the publisher to the broker, and the broker subsequently sends the data to the subscribers. This means that the publisher and subscriber are unaware of each other, and the broker serves as a conduit between them [4]. Brokers are an essential component of the MQTT protocol, and their purpose is to ensure that the pub-sub method is functioning properly [21, 22]. To accomplish

258

P. M. Vijayan and S. Sundar

Fig. 1 MQTT basic diagram

so, the broker must, for example, ensure that clients can accept messages or that the client can subscribe and unsubscribe from a device at any time.

4 Proposed Method for CatBoost Classifier This proposed model is put through its paces for multi-class assault classification. As a result, two distinct activation functions are applied at the output layer. In the situation of multi-attack classification, the CatBoost classifier algorithm is efficient in predicting category feature [23]. CatBoost is a gradient boosting method that uses binary and multi-classification predictors. CatBoost is a machine learning method that was recently open-sourced. It easily interfaces with Deep Learning configuration like TensorFlow and Apple’s Core ML. It is also the most accurate in its class. “Cat Boost” name from two words ‘Category’ and ‘Boosting.’ As previously indicated, the library is capable of handling a wide range of data kinds, including audio, text, image, and historical information. The term ‘Boost’ is derived from the gradient boosting ML algorithm, since it is based on the gradient boosting library. In the proposed CatBoost classifier, we are setting the learning rate 0.083513, depth used for 4–10, learning rate sp_rand Float (), iterations also vary from 10 to 50 and random state 18 by this way get the best accuracy other than the remaining approaches, using ReLU and Sigmoid activation function. The diagram of the proposed Catboost intrusion detection system is shown in Fig. 2.

4.1 Research Methodology The proposed model can be performed in a variety of ways, such as (a) data collection, (b) data pre-processing, (c) machine learning models, (d) prediction, and (e) calculate the accuracy. The IoT data will initially be gathered using conventional MQTT dataset. The obtained data will be subjected to a pre-processing phase, which will include data cleaning and data balancing techniques. As a result, the pre-processed

An Efficient CatBoost Classifier Approach to Detect Intrusions …

259

Fig. 2 Diagrammatic representation of proposed intrusion detection system in IoT

data was used in several machine learning models, and the predicted output was determined using standard equations from the evaluation metrics. In the Internet of Things, a diagrammatic illustration of a multi-class intrusion detection system is shown. Validation of datasets This work opted to use MQTT datasets to give a publically available dataset for detection purposes after it is defined the dataset are generated by used sensor networks. As previously stated, MQTT dataset enclose IoT network traffic, specifically in MQTT communications to validate MQTT datasets, we created an IDS system that was then deployed to the dataset, mixing legitimate MQTT data traffic with various cyberattacks targeting the network’s MQTT broker, MQTT databases include the lawful data traffic and fraudulent data traffic. Following that, the various datasets pertaining to lawful and fraudulent scenarios were combined and utilized to train and forecast our algorithms, so validating the ability of using MQTT datasets to test and deploy a revolutionary IDS system technique. This work investigated the following methods for potential intrusion detection system validation: Neural Network (NN), Random Forest (RF), Naive Bayes (NB), Decision Tree (DT), Gradient Boost (GB) Multilayer Perceptron (MLP), and CatBoost (CB) algorithm. In each scenario, a data pre-processing phase is performed with the goal of obtaining the required attributes capable of characterizing abnormal, thus assaults, traffics/connections. To choose intrusion detection methods to the test, the dataset is parted into two: training and testing. Here, training 70% of traffic data records and testing 30% of traffic data. As a result, as with other similar systems, the test step is carried out after training is completed. The performance matrices of the accuracy and execution time for each of the selected algorithms are mentioned. Pseudocode 1

Import scikit-learn

2

import dataset (continued)

260

P. M. Vijayan and S. Sundar

(continued) Pseudocode 3

Pre-processing is used to improve missing values, clean the dataset, and check for zero values Data analysis, SMOTE, data cleaning, feature selection

4

scale the data

5

In a variable model, you can save different machine learning models

6

set name as the model’s name

7

set balanced dataset ‘depth’: sp_randint(4,10), ‘learning_rate’: sp_randfloat(), ‘iterations’: sp_randint(10,50) randomized search cv (model, parameters)

8

For name, store the model-selection value using 10 splits in a variable, then calculate and store the result in sklearn using the cross validation score technique of model selection append results in list of existing results print mean accuracy

9

end for

Evaluation Metrics Metrics of evaluation understanding model performance necessitates statistic ground truth values, which quantify how the model performs in assault categorization from normal data. More evaluation metrics can be utilized to analyze an IDS’s performance. Accuracy Accuracy simply calculates how often the classifier guesses accurately. The ratio between the number of right forecasts to the total number of predictions is the solution of accuracy. Accuracy =

TP + TN TP + FP + FN + TN

(1)

Precision Precision reveals number of the accurately anticipated cases were actually positive. Precision is useful when False Positives are more of a worry than False Negatives. In recommendation systems, e-commerce websites, and other locations where inaccurate results can lead to client churn, which can be costly to a firm, precision is essential. Precision =

TP TP + FP

(2)

An Efficient CatBoost Classifier Approach to Detect Intrusions …

261

Recall (Sensitivity) Recall reflects number of the real positive cases we were able to forecast properly with our model. It is a useful indicator when False Negative is more concerning than False Positive. Recall =

TP TP + FN

(3)

F measure It provides an overview of the precision and recall measures. It is greatest when precision equals recall. The F measure is measured using the mean of precision and recall [16].  F − measure = 2 ×

Precision × Recall Precision + Recall

 (4)

5 Results and Discussion In this work, the results of the CatBoost algorithm approach, here, trained the model and tested it on this MQTT dataset, yielding the results were mentioned. The diagram of multi-class intrusion detection system is shown in Fig. 3. For training and validation, our MQTT protocol yielded the following results.

Fig. 3 Diagrammatic representation of multi-class intrusion detection system in IoT

262

P. M. Vijayan and S. Sundar

All the approaches have been tested on the same host (in detail, a 2.5 GHz Intel Core i7 Quad-core). This ensures that tests and results are consistent. In prior study, the imbalanced dataset provided us with higher accuracies from various attacks; i.e., SlowlTe, Malformed, Brute force, Flood, Dos, Legimate. In the meanwhile, when applying the various ML approaches such as Neural Network (NN) achieved 0.9932683, Random Forest (RF) achieved 0.9942991, Naïve Bayes (NB) achieved 0.9897062, Decision Tree (DT) achieved 0.985021, Gradient Boost (GB) achieved 0.991639, Multilayer Perceptron (MLP) achieved 0.9468814 [24]. In this proposed work, the CatBoost algorithm outperforms the previous methods in terms of accuracy of 0.995134. In this condition imply attacks on IoT networks, our result demonstrates that ensemble approaches offer maximum accuracies and less loss than other linear models. Multi-class classification issues are more difficult than binary problems, making it more difficult to achieve better results. When it comes to categorizing, we had same results in both studies using ensemble models, where kept the maximum metrics and results. Despite the fact that this paper addressed imbalance, there are still significant inequalities between classes. This may have had a negative impact on the accuracy of some of our models; thus, we need to analyze using a balanced dataset and only anticipate the best true positive findings. By taking the problem’s sequencing into consideration, this model was able to retain a decent result.

5.1 Imbalanced Dataset By way of examine into the details of the outcomes acquired and concentrating on NN achieved 0.9932683 with an F measure 0.993246, RF achieved 0.9942991 with an F measure 0.9943007, NB achieved 0.9897062 with an F measure 0.9897062, DT achieved 0.985021 with an F measure equal to 0.985021, GB achieved 0.991639 with an F measure 0.991639, MLP achieved 0.9468814 with an F measure 0.963694 [24]. Finally, for the proposed CatBoost classifier obtained accuracy of 0.995134 and F measure of 0.982122. The imbalanced traffic dataset is mentioned in Table 1. The confusion matrix is calculated and given in order to better examine the data. Table 1 Imbalanced dataset

S No

Classes

Number of bytes

1

Class = 0

10,150

4.382 39.351

2

Class = 1

91,156

3

Class = 2

429

Percentage (%)

0.185

4

Class = 3

115,824

5

Class = 4

7646

3.301

6

Class = 5

6441

2.781

50.000

An Efficient CatBoost Classifier Approach to Detect Intrusions …

263

Table 2 Results for imbalanced dataset ML algorithm

Accuracy

F measure

NN

0.993268

0.99324

Training time (s) 262.850

Testing time (s) 74.205

RF

0.994299

0.994300

1375.648

35.872

NB

0.98790

0.989706

45.0247

7.144

DT

0.977972

0.98502

88.713

1.293

GB

0.991131

0.99163

1584.301

10.626

MLP

0.946881

0.96369

3024.188

CatBoost

0.995134

0.982122

6578.32

18.438 283.45

In the significance of the bold represent by compare the remaining classifier Catboost only got high accuracy in both imbalanced and balanced dataset

For all classes, the above Table 2 were mentioned an imbalanced dataset performance metrics like, accuracy, F Measure and processing times. All of the examples from class = 0 to class = 5 is distinct from one another. The data from each class is then fed into a separate machine learning algorithm, with the CatBoost method achieving the maximum accuracy. The given graphical representations shown in Fig. 4 imbalanced dataset for all class, here the X-axis taken number of classes for malicious and Y-axis taken as number of bytes in network data. Balanced dataset This work enlarged the size of the separate MQTT datasets move to balance the reports, in order to build a more balanced dataset since as seen in the total size of lawful traffic was far greater than the total size of fraudulent traffic. As a result, by reproducing each threat, we reassessed the amount of the individual lawful traffic data linked to the fraudulent traffic, resulting in a final size of the equal order of the valid scenario’s dataset. The balanced dataset for all classes was indicated in the Table 3. The values from class = 0 to class = 5 are all 1200 bytes. Then, for all classes, common samples were

Fig. 4 Imbalanced network dataset

264 Table 3 Balanced dataset

P. M. Vijayan and S. Sundar S No

Classes

Number of bytes

Percentage (%)

1

Class = 0

12,000

16.667

2

Class = 1

12,000

16.667

3

Class = 2

12,000

16.667

4

Class = 3

12,000

16.667

5

Class = 4

12,000

16.667

6

Class = 5

12,000

16.667

Fig. 5 Balanced network dataset

chosen, and the trained dataset was fed into various machine learning algorithms, with the CatBoost algorithm providing best accuracy. The given graphical representations shown in Fig. 5 balanced dataset for all class, here the X-axis taken number of classes for malicious and Y-axis taken as number of bytes in network data. The given Table 4 shows a balanced dataset performance metrics like, accuracy, F Measure, and processing times. When comparing Tables 2 and 4, the accuracy and F Measure achieved are noticeably differed. The CatBoost method has the highest accuracy, with a score of 0.940122 and F Measure 0.940122, respectively, to compare and contrast the confusion matrices of balanced and imbalanced datasets. In NN accurately classified lawful data traffic, while the RF recognizes flood and fraudulent data traffic, and the NB correctly classifies bruteforce, as shown by a detailed analysis of the matrices. Instead, all algorithms are capable of pinpointing the SlowITe assault. Because the dataset is balanced, balanced tests may be performed on it, and these results could be deemed more precise and accurate. The following Table 5 were mentioned the confusion matrix for all kinds of attacks for the Catboost classifier. Furthermore, the algorithms have showed certain short comings in terms of attack detection, as the classification process is occasionally unable to detect the correct traffic.

An Efficient CatBoost Classifier Approach to Detect Intrusions …

265

Table 4 Results for balanced MQTT dataset ML algorithm

Accuracy

F Measure

NN

0.9044728

0.9023636

Training time (s) 778.180

Testing time (s) 144.218

RF

0.9159708

0.9140355

2298.276

125.850

NB

0.643889

0.6872843

85.284

13.783

DT

0.9159608

0.9140241

148.811

2.303

GB

0.8795693

0.872704

8840.004

18.137

MLP

0.903852

0.9018922

5714.481

27.284

CatBoost

0.941201

0.940122

456.732

78.453

In the significance of the bold represent by compare the remaining classifier Catboost only got high accuracy in both imbalanced and balanced dataset

Table 5 Confusion matrix of CatBoost algorithm Predicted Bruteforce Actual

Bruteforce

DoS

Flood

Legitimate

Malformed

SlowITe

3369

550

0

7

425

0

200

35,580

0

3250

47

0

1

4

59

90

30

0

Legitimate

0

3171

0

46,468

0

0

Malformed

1030

297

0

472

1479

0

0

0

0

0

0

2761

DoS Flood

SlowITe

MQTT Dataset This work used a Kaggle public dataset that was placed on assaults on the MQTT protocol for IoT systems. This model with ML technique identifies the MQTT attacks for IoT systems in this paper, which is based on a proprietary attacks model. The dataset can be found at the given link URL: https://www.kaggle.com/cnrieiit/mqttset

6 Conclusion This work presents models for detecting assaults in IoT environments, which can be used as an IDS pattern for IoT. Machine learning technologies like this can be utilized effectively in cybersecurity to protect against harmful attacks all around the world. However, cybersecurity is a challenging task, so the CatBoost classifier model can provide us with the maximum potential accuracy on the supplied dataset. This work used state-of-the-art method in coding these models, like cross-validation and feature selection. Finally, the conclusion is that the CatBoost classifier model can provide the maximum level of accuracy of 94% within 78.45 s in the balanced dataset.

266

P. M. Vijayan and S. Sundar

References 1. Minerva R, Biru A, Rotondi D (2015) Towards a definition of the internet of things (IoT). IEEE Internet Initiat 1–86 2. Al-Masri E, Kalyanam KR, Batts J, Kim J, Singh S, Vo T, Yan C (2020) Investigating messaging protocols for the internet of things (IoT). IEEE Access 8:94880–94911. https://doi.org/10.1109/ ACCESS.2020.2993363 3. Stolojescu-crisan C, Crisan C, Butunoi B (2021) An IoT-based smart home automation system. 1–23 4. Safaei B, Monazzah AMH, Bafroei MB, Ejlali A (2017) Reliability side-effects in internet of things application layer protocols. 2017 2nd Int Conf Syst Reliab Saf 207–212 5. Soni D, Makwana A (2017) A Survey on Mqtt: a protocol of internet of things (IoT). Int Conf Telecommun Power Anal Comput Tech (Ictpact–2017) 0–5 6. Hunkeler U, Truong HL, Stanford-clark A MQTT-S–A publish/subscribe protocol for wireless sensor networks 7. Niruntasukrat A, Issariyapat C, Pongpaibool P, Meesublak K, Aiumsupucgul P, Panya A (2016) Authorization mechanism for MQTT-based internet of things. 2016 IEEE Int Conf Commun Work 290–295 8. Dorsemaine B, Gaulier J-P, Wary J-P, Kheir N, Urien P (2016) A new approach to investigate IoT threats based on a four layer model. In: Proceedings of the 2016 13th international conference on new technologies for distributed systems (NOTERE), pp 1–6 9. Mahdavinejad MS, Rezvan M, Barekatain M, Adibi P, Barnaghi P, Sheth AP (2018) Machine learning for internet of things data analysis: a survey. Digit Commun Networks 4:161–175. https://doi.org/10.1016/j.dcan.2017.10.002 10. da Costa KAP, Papa JP, Lisboa CO, Munoz R, de Albuquerque VHC (2019) Internet of things: a survey on machine learning-based intrusion detection approaches. Comput Networks 151:147– 157. https://doi.org/10.1016/j.comnet.2019.01.023 11. Vaccari I, Cambiaso E, Aiello M (2019) Evaluating security of low-power internet of things networks. Univ Bahrain Sci J 2210–142X 12. Vaccari I, Aiello M, Cambiaso E (2020) SlowITe, a novel denial of service attack affecting MQTT. Sensors 20. https://doi.org/10.3390/s20102932 13. Vaccari I, Cambiaso E, Aiello M (2017) Remotely exploiting AT command attacks on ZigBee networks. Secur Commun Networks 2017:1723658. https://doi.org/10.1155/2017/1723658 14. Vaccari I, Aiello M, Cambiaso E (2020) Innovative protection system against remote AT command attacks on ZigBee networks. Comput Sci 2:2–8 15. Makhija J, Shetty AA, Bangera A (2022) Classification of attacks on MQTT-based IoT system using machine learning techniques. In: Proceedings, international conference innovation computer communication, pp 217–224 16. Khan MA, Khan MA, Jan SU, Ahmad J, Jamal SS, Shah AA, Pitropakis N, Buchanan WJ (2021) A deep learning-based intrusion detection system for Mqtt enabled Iot. Sensors 21:1–25. https:// doi.org/10.3390/s21217016 17. Dissanayake MB (2022) Feature engineering for cyber-attack detection in Internet of Things. https://doi.org/10.5815/ijwmt.2021.06.05 18. Haripriya AP, Kulothungan K (2019) Secure-MQTT: an efficient fuzzy logic-based approach to detect DoS attack in MQTT protocol for internet of things. EURASIP J Wireless Commun Netw 2019(90) 19. Casteur G, Aubert A, Blondeau B, Clouet V, Quemat A, Pical V, Zitouni R (2020) Fuzzing attacks for vulnerability discovery within MQTT protocol. In: Proceedings of the 2020 international wireless communications and mobile computing (IWCMC), pp 420–425 20. Hwang HC, Park J, Shon JG (2016) Design and implementation of a reliable message transmission system based on MQTT protocol in IoT. Wirel Pers Commun 91:1765–1777. https:// doi.org/10.1007/s11277-016-3398-2 21. Mishra B, Kertesz A (2020) The use of MQTT in M2M and IoT systems: a survey. IEEE Access 8:201071–201086. https://doi.org/10.1109/ACCESS.2020.3035849

An Efficient CatBoost Classifier Approach to Detect Intrusions …

267

22. Dinculean˘a D, Cheng X (2019) Vulnerabilities and limitations of MQTT protocol used between IoT devices. Appl Sci 9. https://doi.org/10.3390/app9050848 23. Ismail S, Khoei TT, Marsh R, Kaabouch N (2021) A comparative study of machine learning models for cyber-attacks detection in wireless sensor networks. In: Proceedings of the 2021 IEEE 12th annual ubiquitous computing, electronics mobile communication conference (UEMCON), pp 313–318 24. Vaccari I, Chiola G, Aiello M, Mongelli M, Cambiaso E (2020) MQTTset, a new dataset for machine learning techniques on MQTT. Sensors 20. https://doi.org/10.3390/s20226578

Self-regulatory Fault Forbearing and Recuperation Scheduling Model in Uncertain Cloud Context K. Nivitha, P. Pabitha, and R. Praveen

Abstract Cloud computing has become inevitable in own way with its offering multi-services, deployment, resource provisioning, data management, etc. Efficient resource consumption proves the dignity of a reliable cloud service provider in the market. However, the cloud environment often witnesses the uncertainties with respect to the resource provisioning. The most significant uncertainty encountered is the fault that paves way for a failure in the total process. Existing methods fail to incorporate such uncertainty into the picture leading to performance degradation of the cloud service requested by cloud user. The proposed scheme, robust fault detection and recuperation (RFDR), is a reactive fault-tolerant strategy that monitors the system continuously and recovers it from failures caused by VMs, host, and PEs. The proposed mechanism is run through baseline scheduling algorithm such as first come first service (FCFS), shortest job first (SJF), priority scheduling (PS), min-min, and max–min scheduling for the performance evaluation. The experiments indicate that the proposed method efficiently handles the fault detection and recover with 99% of resource availability and reliability. Keywords Cloud computing · Fault tolerance · Scheduling · Fault recovery, · Availability · Reliability

K. Nivitha Department of Information Technology, Rajalakshmi Engineering College, Anna University, Chennai, Tamil Nadu 602105, India e-mail: [email protected] P. Pabitha · R. Praveen (B) Department of Computer Technology, Madras Institute of Technology Campus, Anna University, Chennai 600044, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_19

269

270

K. Nivitha et al.

1 Introduction Cloud computing is bounded by the concept of service level agreement (SLA) between the cloud service providers and cloud consumers, providing a huge dependency that is initiated to for the maintaining the requirements of the cloud users [1]. The SLA [2] violations need to be avoided to render the variety of quality of services (QoS) pertaining to cloud user requirements. The service requests are being handled by various scheduling algorithms to allocate the required resources. The significant feature of scheduling algorithm is to enhance the overall performance through increased reliability and minimal execution cost. The probability of fault occurrence is higher at the time of scheduling; hence, the priority should be given for the faulttolerant while developing the system. The system is said to be fault-tolerant, when the operations continue to execute in spite of occurrence of uncertainty (failure) in the system [3]. The proposed methodology concentrates on detection and recovery mechanism providing the cloud environment efficient in terms of availability and reliability.

1.1 Cloud Computing Cloud computing is an extension of grid, parallel, and distributed computing which adopts the concept of virtualization and facilitates on-demand accessibility of computer resources such as data storage, memory, and processing power to its users. The cloud computing paradigm has a self-oriented architecture which provides itself to users through the data centers available on the Internet, the data centers consist of resources such as virtual machines (VMs) which are used to provide efficient performance [4]. The cloud computing domain is steadily growing and widely preferred for its low-cost storage devices and high-capacity networks and systems and also for accepting hardware virtualization as an integral feature [5]. It is a pay-as-you-go model, where the cloud providers lease their resources to the cloud consumers for a temporary period. If the administrators are unaware of the pricing schemes of cloud and goes by this model, it could cost them unexpected operating charges [6]. Aside from virtualization, the other highlighting features are scalability, high-availability, elasticity, and dynamic scheduling.

1.2 Task Scheduling Scheduling in cloud is the efficient mapping of tasks to resources and is observed as a non-deterministic polynomial-time hardness (NP-hard) problem. It occurs in 2 levels in the cloud environment, such as task-level scheduling (allocation of user requests on to VMs) and virtual machine level (allocation of VMs on to host machines) [7].

Self-regulatory Fault Forbearing and Recuperation Scheduling Model …

271

Even though several heuristic and metaheuristic approaches have been put-forth, scheduling remains a major region of cloud that is open for research to solve task scheduling problem in cloud environment [8]. • • • • •

First Come First Serve (FCFS) Shortest Job First (SJF) Priority Scheduling Min–Min Scheduling Max–Min Scheduling

1.3 Fault Tolerance Different faults in dynamic environment are application failures, virtual machine defective nature of virtualization, network problems, and hardware faults, where the system resists adapting to changing cloud environment. To ensure the efficient working of system even in the presence of uncertainty (faults); where the system continues to function for the betterment of overall performance providing the required resources. The shortest and fastest concept to find alternative way to fix the failure that aroused due to faulty nodes, usually virtual machines are called as term node on the physical machine.

1.4 Types of Failures in Cloud Environment The deviation from expected result with the intended outcome can be termed as failure which contradicts with concept of success. Figure 1 shows how failures originated as a result of faults. The various failure types that are identified in cloud environment are as follows: • VM Failure • Hardware Failure • Application Failure [7] The main contributions of the proposed work are as follows:

Fig. 1 Origination of failure

272

K. Nivitha et al.

• To give a fault-tolerant system that detects faults and failures in the various components of cloud like VM, host, processing elements, etc., using the reactive policy. • The uncertainties (faults) in the system are induced automatically using a fault inject mechanism, and the faults are induced using a failure occurrence value in the PEs of the host, resulting in VM and host failures. These unexpected faults are detected by the proposed system using a continuous monitoring strategy that checks both the VM and host. • The recovery method is implemented in such a way that none of the cloudlets are dropped from the execution process ensuring completion of all submitted cloudlets. This implementation is done by cloning the faulty VMs or creating new host to hold the newly generated VMs, making the system more reliable and available, thereby making it fault tolerant. Rest of the paper is organized as follows: Sect. 2 gives a complete literature survey on the topics like uncertainty in cloud, resource provisioning strategies, faulttolerance review, and various detection methods and management techniques, the challenges in existing system, etc. The section for proposed methodology, which is Sect. 3, gives description about the concept, working, and implementation of the proposed work. Section 4 gives a review about the proposed work incorporated in algorithms like FCFS, SJF, round robin, priority scheduling, and max–min and min– min algorithm. The comparison graph for every result generated is mentioned. The conclusion for the proposed system is delivered in Sect. 5, which discuss about what has been done with respect to enhancements and graphical analysis. The conclusion is followed by the possible future scope of the proposed system.

2 Literature Review 2.1 Cloud and Its Uncertainties Singh et al. [9] compiled a QoS metric-based resource provisioning methodology that mainly focuses on the workload it needs to schedule. The algorithm focuses on categorizing the workloads based on common patterns before the actual scheduling. Doing this reduces the execution time cost of workloads and several other parameters and also listed a detailed study of the resource provisioning and scheduling methods and also their transformation from traditional methods to autonomic strategies. An effective analysis for the various techniques is also given as a comparative study, which shows the advantages of autonomic cloud systems. The past researches, current status, and future possibilities of cloud resource management are enunciated [10]. Tchernykh et al. [9] have made extensive research in the field of uncertainties in cloud, majority of the uncertainties focus on the users perspective of QoS and the providers actions, and the effects of these uncertainty on privacy, confidentiality,

Self-regulatory Fault Forbearing and Recuperation Scheduling Model …

273

availability, and reliability are primarily focused. Their effects on service and resource provisioning are also viewed.

2.2 Resource Provisioning Strategies for Improving QoS The framework implemented uses reinforcement learning (RL) to maintain the assurance of QoS in an open dynamic environment where all clients are clubbed together according to two aspects such as computing resources adaptation and service admission control based on the RL model. This is the first of its kind learning of integration of these aspects to enhance the profit for the cloud provider and avoidance of SLA violation [11]. Calheiros et al. [12] designed auto-regressive integrated moving average (ARIMA) model that captures the prediction of workload in cloud for the providers of SaaS. The ARIMA model on prediction concentrated on resource utilization and QoS by providing significant improvement in terms of workload prediction pertaining to the future Web service requests with usage of real traces. Chen discusses a self-adaptive model to dynamically consider inputs at run time and tune accuracy by using hybrid dual learners that permit concurrent learning algorithms for the selection of optimal model that render efficient QoS function, and it also the two subspaces are partitioned from the possible input spaces [13]. Homsi developed a prototype model to validate the implemented methods and algorithm and the model proved to be efficient by performing significantly in terms of guaranteed QoS, power consumption, and demand of resources [14]. Kamboj and Ghumman [15] suggested an algorithm for efficient load balancing using K-means clustering to schedule jobs in the cloud environment. Experiments with different computation of parameters in the CloudSim environment and configurations of the cloudlet and virtual machines are conducted, and the results show that the algorithm can outperform other job scheduling algorithms. Karamoozian et al. [16] implemented a resource assignment methodology where it concentrated on allocation of cloud resources optimally based on learning automata technique for the media service by providing the requires QoS for the applications. Three important factors such as total response time, level of uncertainty, and computational capability are considered. The results depict that the explained methodology can optimally map the task to service with minimal QoS for applications. Liu putforth four algorithms for improving the quality of virtual machine method consolidation, and QoS is guaranteed by usage of selective repetitive method for VM’s to save energy. A novel method was proposed using flexible reserved resources, and the model had achieved improvement in terms of VM migrations, consumption of energy, and QoS guarantees [17]. Mireslami and Rakai presented a multi-objective cost-effective and run time friendly optimization algorithm that minimizes the cost while meeting the QoS performance requirements offering an optimal choice for deploying a Web application in cloud environment to maximize QoS performance [18]. Ahmed and Minjie

274

K. Nivitha et al.

propose two approaches using RL to deal with the uncertainties prevailing in open and dynamic cloud environments, resulting in a set of Pareto optimal solutions that satisfy multiple QoS-objectives with different user preference [19]. Mugen Peng et al. explains the issues present in underlying HetNets to improve spectral efficiency and energy efficiency when combining with energy harvesting in cloud computing [20]. Shrisha Rao concentrated on principal of uncertainty and coalition formation for the allocation of resources which proved significant improvement in better utilization of resources with improvement in request satisfaction. It aims at analyzing the workloads by setting them apart based on some common patterns and then provisioning the cloud workloads before actual scheduling [21].

2.3 Fault Tolerance and Detection Techniques AbdElfattah et al. put-forth a model that uses replication and resubmission techniques to tolerate faults. Amin et al. [22] came up with an artificial neural network-based algorithm for detection of fault which would subdue the gaps of previously implemented algorithms and provide a much more effective fault-tolerant model [23]. The algorithm will be a proactive fault-tolerance mechanism designed for dynamic clouds using artificial neural network (ANN) for fault detection and will also provide detection time [22]. Ataallah et al. [24] surveyed the overall fault-tolerance techniques and the popular FT approaches in cloud environment which are suggested by research experts in the corresponding field [12]. The existing method [24, 25] highlights the taxonomy of faults, errors, and failures, their causes, existing approaches for FT in cloud computing, underlying topologies of data centers, various miscellaneous cloud-based problems, and a few research directions have been enumerated [26]. Mittal D helps provide a better understanding of fault tolerance and gave a detailed report on the FT techniques and some existing fault-tolerance model which are used for managing faults in cloud [27]. Wang and Zhang proposed system for diagnosis of fault by handling the problems of workload variations in huge complex systems that required intense perception of domain knowledge. The work monitored the patterns of behaviors of access with the concept of incremental online clustering algorithm, thereby conveying the relationship between the pattern metrics and workload, detecting the uncertain change in pattern correlation and identifies the malicious metrics. The system is evaluated with accuracy through the injection of faults in terms of different applications of Web. The faults are efficiently identified as per the output analysis in various physical layers [28, 29]. Ghani et al. [30] discussed the properties of fault detector and proposed selftuning failure detector where they analyzed the comparison between the proposed method and various other failure detector methods through control parameters and able to maintain better performance. Kumar designed various fault detection and reduction approaches. The faults are injected in the proposed work pertaining to the evaluation of performances, where

Self-regulatory Fault Forbearing and Recuperation Scheduling Model …

275

the injections of faults are done through the fault injection feature in CloudSim simulation tool. Through the strict monitoring, the workload allocation is done only to the VM which are considered to be healthy through various detection approaches. Identification of failed VM is monitored at periodic intervals and reported, respectively, to the fault-tolerance manager. Migration of job is done at the next phase with the mitigation strategy as conclusion. Kumar et al. [31] proposed novel method for the detection of faults in environment of cloud through unsupervised ML outlier detection techniques. The method made use of the top three features and titled as fault detection system (FDS). The designed framework can be used at various instance of faults thereby altogether collected and reported as batch, finding efficient way to identify faults.

2.4 Fault Tolerance and Management Techniques Devi proposed multi-level fault-tolerance mechanism in real-time environment of cloud to handle the faults efficiently in multi-level manner. The trustable virtual machine is recognized in the first level through reliability assessment method, and using the replication mechanism at most reliability and availability is achieved by analyzing the availability of data. Jhawar and Piuri [32] presented a mechanism to analyze the mechanism of fault tolerance. Considering several factors such as failure behavior, network, and power distribution, the system makes usage of virtualization technology to increase the reliability and availability of applications in the virtual machines. Jhawar et al. [25] introduced a comprehensive modular novel high-level system that allows users to insist and take action of the necessary desired level of fault tolerance without having needed prior understanding of the FT techniques that are present already in cloud. Mohammed et al. [33] came up with an optimal FT mechanism where a model is designed to increase the reliability of each VM by handling faults and replacing the VM when the performance and outcome are not optimal. A series of experiments using Petri Nets are done to explain the effectiveness and correctness of suggested method. An instinctive monitoring and management system with ontology-based learning methodology with prediction-based service allocation are implemented by Modi. In this approach, monitoring is performed at the cloud broker, which acts as an intermediate module of client and provider. In the presence of any SLA violations, the client and providers receive alerts regarding the same, thereby making the broker to reschedule tasks to reduce any forms of violations [34].

2.5 Challenges Identified The existing mechanism and methods concentrate only on single VM failure and its associated processes. The existing methods involve the assumption of initial

276

K. Nivitha et al.

single VM failure or the consideration of single VM failure during the execution of processes. Also, it tries to find another VM to allocate resources or processes associated with the failed VM. But the proposed approach is way far different from the existing methodology. Concentration on failing the processing element of each host such that increasing number of failed processing elements leads to the increased number of failed VMs or even host failures. The rate of failure is set very high when compared to the existing approach. It involves real complexity by which model has to allocate all resources concurrently. This modular approach helps to overcome successive failures in VMs.

3 Proposed Methodology The general architecture diagram of proposed methodology given in Fig. 2 conveying the process of user submitting request to cloud broker and the optimal process which are responded through the RFDR mechanism. The detailed architecture diagram for RFDR is given in Fig. 3, and Algorithm 1 works with the concept of continuous monitoring of the system, thereby identifying the occurrence of faults in each VM, and triggers the recovery mechanism on recognizing the failure in VM. In dynamic cloud environment, the detection of error and recovery of the same are the most critical features to be handled. Faults and failures and mostly host failures or the failures of several or single VM are specified in the domain. When the server or physical computer break down which in turn leads to breakdown of all the virtual machine hosted on the server. Crashes may occur due to inadequacies of resources in both ways as follows, software and hardware with some of the causes being failures in power, network, load excessive, overuse of RAM/CPU/bandwidth, VMs’ availability or performance mismatches to job requirements, etc. The proposed methodology, robust fault detection and recuperation (RFDR) algorithm, is represented in Fig. 3. The goal is to deal with uncertainty (faults) that arise from unpredictable situation in the dynamic cloud environment’s runtime, and with less human interference, the faults can be handled for the configuration of resources. Any uncertainties that arise in the system are prominently handled through the mechanism of recovery without any SLA breaching. Round robin algorithm is default scheduling scheme pertaining to CloudSim; hence, it is ignored for the perfor mance evaluation context.

3.1 Experimental Setup and Dataset Description For the CloudSim simulation, a primary number of hosts, virtual machines, and data centers are required. For this model to be up and running, the specifications for the hosts such as bandwidth, millions of instructions per second (MIPS), storage capacity, number of processing elements and random-access memory (RAM) in megabyte (MB) is specified. The CPU core unit is the processing element of the hosts which is

Self-regulatory Fault Forbearing and Recuperation Scheduling Model …

Fig. 2 General architecture of proposed methodology

Fig. 3 Architecture diagram of robust fault detection and recuperation (RFDR) model

277

278

K. Nivitha et al.

represented using MIPS. Millions of instructions per second defined the instruction or the cloudlet length that a processing element can process or compute in a second. Conventionally, it only measures in millions and hence the name millions of instructions per second. Same way the VM characteristics like MIPS, RAM, bandwidth (BW), and no. of PEs and size are also required. Also, the cloudlet requirements are also considered. With all these details initialized, the overall outlook for the experiment is established. The cloudlets are defined using cloudlet length, the number of processing elements that they require to run, priority of the cloudlet, and input and output of the file size of the cloudlet. CloudSim is configured to perform the proposed algorithm with one data center and two host machines. Each host system has 4 processing elements; hence, each virtual machine uses 2 of the host machines processing elements. So, the basic hardware and virtualization units are initialized. The cloudlets are scheduled to each virtual machine based on various implemented scheduling algorithms. The cloudlets are variable in length. Algorithms used for work allocation are simple algorithms such as first come, first serve, and priority scheduling. To test the performance of the work with this experimental setup, a total of 10 cloudlets of varying length is allocated to 2 VM’s configured in 2 host machines. As the MEAN FAILURE PER HOUR is taken as a value of 0.005, one error is to be expected every 200h. Therefore, the simulation requires a dataset that with cloudlets that will run for at least a minimum time of 200h. So, 10 cloudlets each of randomly generated length are used. Each cloudlet has parameters such as file input size, file output size, and required number of processing elements for the cloudlet and priority. The required number of processing element for each cloudlet is 2 and priority for each cloudlet cycles between the values 2 and 0 alternatively. Similarly, each virtual machine that it is to be created also has a certain parameter. They are the number of processing elements, RAM, millions of instructions per second (MIPS), bandwidth, and VM image size. They are assigned values of 2, 2048 MB, 1000, 1000 and 10000 MB, respectively.

3.2 RFDR Model As shown in Fig. 3, Algorithm 1 describes the fault detection and recovery approach that continuously monitors all hosts and VMs across all data centers, looking for flaws or faults and initiating the recovery process when a failure occurs. The algorithm’s flow is described in detail below. A counter for the total number of VM failures throughout the program’s execution period is initially set to 0. This counter increases by one when the conditions for a failed VM are met. This approach constantly monitors the simulation using a listener object that tracks the system time and notifies it when the simulation clock advances. The variable VMList lists all VMs in Table 1. A loop is run for the list of entire VMs on all the host to check for failure at ever second as it works with the clock Listener object, this listener object checks the system for continuous time of the simulation enabling a continuous monitoring

Self-regulatory Fault Forbearing and Recuperation Scheduling Model … Table 1 RFDR algorithm

279

Algorithm 1: Robust fault detection and recuperation algorithm Input: VMlist Output: Failed VM and HOST 1. No. of Faults < - 0 2. For each VM in VMlist 3. Check VM.Failed 4. No.of.Faults < - No.of.Faults + 1 5. HOST < - VM.getHOST 6. Check HOST.Failed 7. Create newHOST 8. Add newHOST to DC 9. If (HOST.Failed = FALSE) then 10. Create newVM 11. Submit newVM to Broker 12. Add newVM to VMlist 13. CloudletList < - VM.getCloudletList 14. Submit CloudletList 15. End if 16. End for

mechanism. The status for the VM is checked using two flags, one that checks the current status of the VM and another that checks the previous failure status of the same VM, if the current status of the VM returns to TRUE and flag that checks the previous failed status of the VM is FALSE, this means that the VM is failing newly, and a VM failure has been recorded and detected. When the total number of VM Failure is incremented, the flag will automatically indicate that the VM has failed already, resulting no need for checking the status of VM which is useful. PEs status also checked when there is VM failure with help of two flags, a TRUE flag checking current failure status and a FALSE flag for no host failure.

3.3 Fault Injection Mechanism Due to the fact that failures of PEs, VMs, and hosts during cloud run time are inescapable and utterly unforeseen in real time, this method is used to simulate such an unknown event in the CloudSim system. The fault injection mechanism is used to generate random faults at Poisson distribution time in the processing elements of the hosts located in the data center. The Poisson distribution is used to create events in an experiment at extremely random time intervals. In Fig. 3, the Poisson distribution

280

K. Nivitha et al.

is a discrete probability distribution (DPD) that determines the probability of the number of occurrences that can occur during a specified time and space period. Once workloads are assigned to their appropriate resources, this process initiates, and the workload execution begins. To begin, the event rate of a Poisson process is determined, which is the total number of failures expected to occur per hour. After the host generates the virtual machines, it is necessary to initialize a fault object for the injection. When it is permissible to clone failed VMs, this encourages simpler function descriptions. As indicated previously, events occur in the sequence provided for this phase. 1. A host failure injection time is generated using a random generator, and at that point, a random host is made to fail. The random host is selected using an internal generator that uses the same seed value as the preceding generator. 2. The internal generator generates the total number of processing elements on the failed hosts at random, deletes the failed processing elements (PEs) from the VMs, terminates the VMs with no working PEs, and sends a duplicate of the failed VM to the data center (DC) broker. The Poisson distribution is one of the discrete probability approaches for counting the number of events occurring in a particular time period. This is Poisson distribution. It is the projected average rate of failure. It is a simulation number. The MEAN FAILURE NUMBER PER HOUR grows with the error per hour. The seed variable is used to ensure that each outcome is predetermined. When a seed is provided, the estimated number of processing failures between hourly intervals remains constant. When the seed is removed, the amount of processing elements failures within the hours is randomized. The data entry containing the real host machine and virtual machines is transferred as a parameter for the host fault injection mechanism. This indicates or states that the fault should be induced on the relevant data entry and the hosts associated with it. There is also a set MaxTimeToFailure method, which is a method in the host fault injection medium that states the maximum period after which the induced fault will stop. The maximum time for failure in the CloudSim region is set to 800 h for the simulation. This specifies that after 800 h, no mistake or host processing element fault induction can occur, even if the simulation is still running.

3.4 Fault Detection Mechanism Let us assume, R is the number of PEs that a cloudlet operating in an affected VM needs. If the cloudlet requires R PEs, but the VM has less PEs than the requirement, the VM executes the cloudlet with the current PE, but completion takes longer than estimated. The monitoring process is done to ensure that there are always a greater number of PEs than the required for process. As an example, N is the number of failed PEs for running a VM. As one at a time N PEs are eliminated, there is always a monitoring process for indicating that there is no lesser number of required PEs.

Self-regulatory Fault Forbearing and Recuperation Scheduling Model …

281

Throughout the runtime of the program, the monitoring system keeps track on all VMs, independently at the data center stage, and thoroughly detects if a malfunction occurs. The living status of each VM allocated in all hosts is checked by providing a continuous monitoring system, if the status reports to failure, the recovery mechanism begins to act up. Whenever an identification of VM failure, a snapshot of the failed VM is sent to the data broker, and the cloning begins until recovery mechanism fixes things. The host fault injection creates error in the set of processing elements present within the virtual machine. On failure of the elements one by one, there is no hazard until the number of required processing elements becomes lesser than the threshold which results in a decrease in its computation and execution power. When there is an identification of faulty tolerant stages, a new host is created as a new virtual machine which is added to the data center and cloudlets submitted during the failure is resubmitted for scheduling ensuring a perfect workload execution which is termed as rescheduling process.

3.5 Recovery Mechanism An automatic recovery mechanism should be an integral feature in the design of any systems which should ensure its robustness, fault-tolerant ensuring functioning, and efficiency of the system is not at stake. After the detection mechanism has detected the faults, the recovery mechanism begins working to ensure that the environment operates normally [35]. If only one VM has failed out of the many VMs in a host, the recovery mechanism may attempt to clone the faulty VM into the same host if there is free PEs that can satisfy the failed VM requirements. The mechanism decides to create a new host with the stated requirements and clone the failed VM into the newly created host if there is no remaining PEs. Cloudlets or tasks that have not completed their execution on the destroyed VM are resubmitted to the new cloned VM once the cloned VMs are configured. A variation in cloudlet ID numbering is used to distinguish the failed systems from the recreated systems. The host ID is only increased one by one for a newly created host. Similarly, the VM ID is increased one by one for the virtual machine that is built. However, the cloudlet ID is not increased in an organized manner for the cloudlets that are being resubmitted. A concatenation of 0 instead of distinguishing if the cloudlet has been resubmitted with the previous cloudlet id is done. For example, a cloudlet with ID 5 when resubmitted becomes 50.

3.6 Failure Metrics and QoS Parameters Used The mean time to failure (MTTF) is the predicted time to failure after the system has been operationally down. The mean time to fix (MTTR) represents the expected

282

K. Nivitha et al.

Fig. 4 Failure metrics differentiation

time to repair the system once a failure occurs. Mean time between failures (MTBF) is the average time between failures during normal system operation. A pictorial representation of the same is shown in Fig. 4 for better comprehension. MTBF can be calculated using (1), MTBF = MTTF + MTTR

(1)

Availability is the indicator of the resources available in the actual line for the workload represented. Availability can be shown as: Availability = MTBF/1 + MTBF

(2)

When scheduling the services, the resource’s efficiency must be taken into account. The resource’s fault tolerance can be checked with the aid of reliability parameters. Reliability of the resource is calculated as: Reliability = MTTF/1 + MTTF

(3)

Thus, based on the failure metrics and QoS parameters, the uncertainty measure is calculated and predicted using the proposed model, workloads are completely executed to provide a higher QoS awareness to the incoming services and workloads, and (2) and (3) are used to calculate the availability and reliability. Some of the QoS parameters that were considered for the performance analysis of the proposed work are discussed as:

Self-regulatory Fault Forbearing and Recuperation Scheduling Model …

283

4 Experimental Evaluation and Results The outputs from the various scheduling algorithms have been collected, and a graphical analysis of the results obtained has been demonstrated. The focus of the proposed work is to recover failed virtual machines and hosts even during continuous failure that would happen dynamically with the host fault injection module, by doing so the reliability of the virtual machines and hosts are potentiated to increase. The scheduling algorithms considered for the comparison are FCFS, SJF, priority, max–min, and min–min algorithms.

4.1 Analyzing Performance of RFDR in Various Scheduling Algorithms The scheduling algorithms with RFDR model are discussed in detail.

4.1.1

First Come First Serve (FCFS)

This simulation was run with fault and without fault as seen, and the results for the same have been documented and created as a graph. The graph in Fig. 5 shows the finishing time of each cloudlet using FCFS in both fault and non-fault conditions. In the graph illustrated, the non-fault line indicates the time taken for cloudlet to complete execution without fault, and the fault line indicates the time taken for cloudlet to finish the task after a fault is induced. The little deviation in the fault line represents the presence of induced fault during the execution of tasks. The fault is induced to check how quick the proposed strategy recovers when applied over FCFS. FCFS is basic and not much effective so whenever there is an increase in number of cloudlets, it corresponds with a gradual increase in finishing time of each cloudlet. FCFS is non-pre-emptive in nature meaning it runs for the entire current process to. It is revealed that the deviation of the fault line from the non-fault line is caused due the time required for the fault recovery. This deviation explains the time that has been taken for the system to detect the fault and do the recovery process of the VM as mentioned in the proposed implementation method.

4.1.2

Shortest Job First (SJF)

In short, SJF executes the jobs in an ascending order of fashion. The finishing time from the tabulations has been converted into a graph for both fault and non-fault mechanism as shown in Fig. 6 which shows the finishing time of each cloudlets using SJF scheduling algorithm in both fault and non-fault conditions. In the graph, the time taken for cloudlet without fault and the time taken for cloudlet to finish the

284

K. Nivitha et al.

Fig. 5 Finish time against cloudlets for FCFS

task after a fault is induced is displayed in line variants. As inference, the SJF gave less finishing time than the FCFS scheduling algorithm as inferred by comparing the two graphs. But when a fault was induced in SJF, it reacts in a different way. So, the deviation is somewhat bigger when compared to FCFS algorithm, but, however, the overall finishing time of SJF is lesser than the FCFS algorithm. This is found to be true because as the execution time for the non-fault mechanism is also shorter than FCFSs non-fault mechanism. Similarly, in the fault induced mechanism too, SJF showed less execution time than FCFSs fault induced mechanism. Pre-emption was not used because the fault will induce all concurrent tasks to get failed during execution, and the later will become a tedious one to complete.

4.1.3

Priority Scheduling

Figure 7 explains the relationship between fault and non-fault conditions when priority-based scheduling algorithm was applied. Priority-based scheduling executes the process based on a priority factor, and the factor may either be a time required to complete the task, memory required to complete the task, or any performance metrics related to the executing process. The major disadvantage of this algorithm is that it gives heavy weightage to the priority of the cloudlets, if the system was to fail, only the high priority tasks would be given importance. Hence, the major deviation of fault line in the graph and this algorithm produces a poor performance than the other two algorithms. The time taken to complete each cloudlet process is too high than the expected value in both fault and non-fault conditions. This algorithm also possesses a poor performance

Self-regulatory Fault Forbearing and Recuperation Scheduling Model …

285

Fig. 6 Finish time against cloudlets for SJF

Fig. 7 Finish time against cloudlets for priority scheduling

than FCFS algorithm because of its complex implementation. It can be concluded that this algorithm is not suitable for the approach that has been put-forth. Figure 7 depicts the data of the graph for both the faulty and non-fault mechanism of the priority scheduling algorithm. Priority scheduling uses the priority that has been assigned to the cloudlets as the order of execution. This deviation is caused due

286

K. Nivitha et al.

to the fault induced mechanism. When the fault is induced, the simulation takes some time to identify the fault. After identifying the fault, the type of fault is detected by continuous monitoring after which the recovery process is deployed.

4.1.4

Min–Min Scheduling

The min-min algorithm is explored using both fault and non-fault conditions. The results deal with comparative study between numbers of cloudlets against finish time of cloudlets in seconds as shown in Fig. 8. The min-min algorithm works on the basis of choosing a task with minimum completion time in its processing queue. Comparing both fault line and non-fault line, the time taken for both fault and non-fault conditions is somewhat a little deviated from one another is shown. Comparing other scheduling algorithms, the results from min-min give a better performance than other corresponding algorithms as in Table 2. The only drawback with min-min algorithm if the number of larger tasks Increases, it shows poor resource utilization. The algorithm is optimal with small tasks than the larger tasks. The comparative result analysis between number of cloudlets and finish time of cloudlets in seconds using fault and non-fault conditions clearly shows that time taken is far similar for 90% of the cloudlets. The experimental output indicates the self-regulatory nature of scheduling scheme on various scenarios.

Fig. 8 Finish time against cloudlets for min–min scheduling

Self-regulatory Fault Forbearing and Recuperation Scheduling Model …

287

Table 2 Finish time comparison of baseline scheduling algorithms No. of cloudlets 1

FCFS 10,422.858

SJF 10,422.858

Priority

Min–min

Max–min

504,591.159

10,422.86

10,422.86

2

504,591.18

504,591.18

2,018,441.477

504,591.2

504,591.2

3

1,743,570.124

601,213.17

2,955,333.023

2,461,165

1,938,446

4

2,334,360.433

1,569,241.918

3,946,464.520

2,523,033

2,523,033

5

2,523,032.697

1,879,266.386

3,956,887.4

3,051,955

3,216,499

6

3,612,413.653

3,302,389.199

5,138,467.613

4,256,180

4,256,180

7

4,973,774.535

3,807,289.419

5,718,170.297

4,979,978

5,979,336

8

5,741,714.467

6,258,031.229

7,451,317.561

6,258,031

7,043,987

9

6,901,797.544

7,339,271.734

8,729,370.763

7,742,814

8,430,077

10

9,664,634.7309

9,020,868.454

9,794,021.502

8,807,465

9,020,867

4.1.5

Max–Min Scheduling

The max-min algorithm is explored using both fault and non-fault conditions. The results deal with comparative study between numbers of cloudlets against finish time of cloudlets in seconds. The max-min algorithm works on the basis of choosing a task with maximum completion time in its processing queue. Comparing both fault line and non-fault line explains the time taken for both fault and non-fault conditions which is much deviated than the result produced by min-min algorithm. The Table 2 shows the finish time of the various cloudlets that have been considered. Max–min gives a better performance than other corresponding algorithms only on non-fault condition as given in Table 3. But when comes to fault, it does not show such performance it takes more time for the entire cloudlets’ processes as shown in Fig. 9. The max–min algorithm overcomes the disability of min-min algorithm, and so, the resource utilization is better. The algorithm is optimal with large tasks than the smaller tasks. Figure 10 and Table 2 show the comparison of all scheduling algorithms under fault conditions, respectively. The result analysis shows the number of cloudlets against finish time using all scheduling algorithms under fault condition. The comparative study reveals that max-min, min-min, and SJF algorithm provides an outstanding performance among all the scheduling algorithms while FCFS and priority-based scheduling pose comparatively poor performance and better in nonfault condition. Under fault condition, min-min is better than all other scheduling algorithms. The order for the scheduling algorithm from best to worst is given, minmin toping the chart, followed by SJF, followed by max-min, followed by FCFS scheduling and with the last place taken by priority scheduling. Result inferences: The QoS parameters that are taken into consideration are availability and reliability. They both make use of the failure metrics like MTTF, MTTR, and MTBF. MEAN FAILURE PER HOUR is as low as 0.005, and the number of failures expected

288

K. Nivitha et al.

Table 3 Finish time comparison of baseline scheduling algorithms with non-fault conditions Min–min

Max–min

10,422.96

10,422.96

504,591.26

10,422.86

10,422.86

2

504,591.26

504,591.26

2,018,441.56

504,591.2

504,591.2

3

1,743,570.19

601,213.25

2,955,333.06

2,461,165

1,938,446

4

2,334,360.48

1,569,241.97

3,946,464.54

2,523,033

2,523,033

5

2,523,032.72

1,879,266.43

3,956,887.4

3,051,955

3,216,499

6

3,612,413.67

3,302,389.21

4,547,677.69

4,256,180

4,256,180

7

4,677,064.38

3,807,289.41

5,612,328.4

4,979,978

5,979,336

8

4,973,774.52

5,320,830.67

5,718,170.22

6,258,031

6,706,922

9

6,901,797.5

6,258,031.21

7,451,317.46

7,019,017

7,043,987

10

9,664,634.67

8,083,667.83

8,729,370.64

8,083,668

7,297,712

No. of cloudlets 1

FCFS

SJF

Priority

Fig. 9 Finish time against cloudlets for max–min scheduling

at each hour is also low result increased MTTF increases. Reliability is dependent on MTTF, 99.92% and MTBF increment results availability of 99.93%. On increasing MEAN FAILURE PER HOUR, both MTTF and MTBF will be resulting in least availability and reliability. The proposed RFDR model is used in the scheduling algorithms such as FCFS, shortest job first, priority scheduling, min–min and max–min algorithms, and a performance graph for the total finish time for all these algorithms during fault and non-fault conditions which has been illustrated. This gives a detailed analysis of how the scheduling works in fault and non-fault conditions. The failure metrics such as MTTF, MTTR, and MTBF are calculated, and these values are used to estimate the

Self-regulatory Fault Forbearing and Recuperation Scheduling Model …

289

Fig. 10 Finish time against cloudlets for all scheduling algorithms after fault is induced

QoS parameters such as availability and reliability. The implemented system proves to be very reliable with the estimated values. From the analysis performed, it can be concluded that max-min algorithm works best in the system without fault in comparison with other algorithms, whereas in the case of faults, the min-min algorithm fairs better than the rest (Fig. 10). Figure 11 and Table 3 show the comparison of all scheduling algorithms under non-fault conditions, and the result analysis explains the time taken for number of cloudlets against finish time using all scheduling algorithms under non-fault condition. The comparative study reveals that max-min algorithm provides an outstanding performance among the all the scheduling algorithms while FCFS poses least performance. The stability for max-min is high when compared to all other scheduling algorithms, and it is the main factor for such performance. The order for the scheduling algorithm from best to worst is given, max-min toping the chart, followed by SJF, followed by min-min, followed by priority scheduling and with the last place taken by FCFS.

5 Conclusion The proposed robust fault detection and recuperation (RFDR) is a re-active faulttolerant a system that uses a continuous monitoring technique that often recovers from this adversity to recognize and subjugate the consequences of fault and malfunction in VMs, hosts, and PEs. By prioritizing their workloads based on the QoS criteria such as reliability and accessibility provided by them, the built system focuses

290

K. Nivitha et al.

Fig. 11 Finish time against cloudlets for all scheduling algorithms under non-fault is induced

on ensuring customer satisfaction. It also has the ability to automatically handle resources by discovering and responding to sudden faults and adapting to resources with minimal human interference. In both fault and non-fault conditions, RFDR is used with scheduling algorithms such as FCFS, SJF, priority, max-min and min-min, and max-min works well in non-fault conditions in this experiment, while min-min performs better in the presence of faults than the other algorithms. The performance analysis shows that in times of failure, the model system functions effectively by extending the reach of the resource, thereby generating space for the full execution of the unfinished tasks. This ensures QoS for users as resources are made available, such as availability and reliability, and full task execution of all user requests makes the system secure. Due to the high values of MTTF and MTBF that occur in the system, high reliability and availability values occur. The reliability and availability of the proposed work are directly proportional to the rates of the MTTF and MTBT, respectively, so reliability is also high if the MTTF is high and vice versa. In other words, the system’s reliability decreases if the number of faults per hour is high. Such relationships also apply in the case of availability. The uncertainty addressed in the proposed work focuses primarily on fault tolerance, although other uncertainties in the cloud environment need to be addressed, such as scalability, resource provisioning, virtualization, and scheduling. In order to make the cloud more reliable, machine learning algorithms such as reinforcement learning (RL), which learns based on the situation and makes decisions to maximize the reward, failure prediction models based on the Bayesian, and neural network to be analyzed and implemented for the future work.

Self-regulatory Fault Forbearing and Recuperation Scheduling Model …

291

References 1 Nivitha K, Solaiappan A, Pabitha P (2021) Robust service selection through intelligent clustering in an uncertain environment. In: Intelligence in big data technologies—beyond the hype: proceedings of ICBDCC 2019. Springer, Singapore, pp 325–332. https://doi.org/10.1007/978981-15-5285-4_32 2 Jagatheswari S, Praveen R, Chandra Priya J (2022) Improved grey relational analysis-based TOPSIS method for cooperation enforcing scheme to guarantee quality of service in MANETs. Int J Inf Technol 14(2):887–897. https://doi.org/10.1007/s41870-022-00865-5 3 Nivitha K, Pabitha P (2022) C-DRM: Coalesced P-TOPSIS entropy technique addressing uncertainty in cloud service selection. Inf Technol Control 51(3):592–605. https://doi.org/10. 5755/j01.itc.51.3.30881 4 Nivitha K, Pabitha P (2020) A survey on machine learning based fault tolerant mechanisms in cloud towards uncertainty analysis. In: Proceeding of the international conference on computer networks, big data and IoT (ICCBI-2019). Springer, pp 13–20. https://doi.org/10.1007/978-3030-43192-1_2 5 Nivitha K, Pabitha P (2020) Fault diagnosis for uncertain cloud environment through fault injection mechanism. In: 2020 4th International conference on intelligent computing and control systems (ICICCS), pp 129–134. IEEE. https://doi.org/10.1109/ICICCS48265.2020.9121168 6 Moorthy RS, Pabitha P (2020) A novel resource discovery mechanism using sine cosine optimization algorithm in cloud. In: 2020 4th International conference on intelligent computing and control systems (ICICCS). Madurai, India, pp 742–746. https://doi.org/10.1109/ICICCS 48265.2020.9121165 7 Praveen R, Pabitha P (2023) Improved Gentry–Halevi’s fully homomorphic encryption-based lightweight privacy preserving scheme for securing medical Internet of Things. Trans Emerging Telecommun Technol 34(4). https://doi.org/10.1002/ett.4732 8 Jagadish Kumar N, Balasubramanian C (2023) Hybrid gradient descent golden eagle optimization (HGDGEO) algorithm-based efficient heterogeneous resource scheduling for big data processing on clouds. Wireless Personal Commun 129(2):1175–1195. https://doi.org/10. 1007/s11277-023-10182-0 9 Singh S, Chana I, Singh M (2017) The journey of QoS-aware autonomic cloud computing. IT Prof 19(2):42–49. https://doi.org/10.1109/MITP.2017.26 10 Pillai PS, Rao S (2014) Resource allocation in cloud computing using the uncertainty principle of game theory. IEEE Syst J 10(2):637–648. https://doi.org/10.1109/JSYST.2014.2314861 11 Alsarhan A, Itradat A, Al-Dubai AY, Zomaya AY, Min G (2017) Adaptive resource allocation and pro- visioning in multi-service cloud environments. IEEE Trans Parallel Distrib Syst 29(1):31–42. https://doi.org/10.1109/TPDS.2017.2748578 12 Calheiros RN, Masoumi E, Ranjan R, Buyya R (2014) Workload prediction using ARIMA model and its impact on cloud applications’ QoS. IEEE transactions on cloud computing 3(4) pp 449–458. https://doi.org/10.1109/TCC.2014.2350475 13 Chen T, Bahsoon R (2016) Self-adaptive and online qos-modeling for cloud-based software services. IEEE Trans Softw Eng 43(5):453–475. https://doi.org/10.1109/TSE.2016.2608826 14 Homsi S, Liu S, Chaparro-Baquero GA, Bai O, Ren S, Quan G (2016) Workload consolidation for cloud data centers with guaranteed QoS using request reneging. IEEE Trans Parallel Distrib Syst 28(7):2103–2116. https://doi.org/10.1109/TPDS.2016.2642941 15 Kamboj S, Ghumman NS (2016) A novel approch of optimizing performance using K-means clustering in cloud computing. Int J 15(14). http://dx.doi.org/https://doi.org/10.24297/ijct.v15 i14.4942 16 Karamoozian A, Hafid A, Boushaba M, Afzali M (2016) QoS-aware resource allocation for mobile media services in cloud environment. In: 2016 13th IEEE Annual consumer communications and networking conference (CCNC). IEEE, pp 732–737. https://doi.org/10.1109/ CCNC.2016.7444870

292

K. Nivitha et al.

17 Liu Y, Sun X, Wei W, Jing W (2018) Enhancing energy-efficient and QoS dynamic virtual machine consolidation method in cloud environment. IEEE Access 6:31224–31235. https:// doi.org/10.1109/ACCESS.2018.2835670 18 Mireslami S, Rakai L, Far BH, Wang M (2017) Simultaneous cost and QoS optimization for cloud resource allocation. In: IEEE Transa Network Ser Manage 14(3):676–689. https://doi. org/10.1109/TNSM.2017.2738026 19 Ahmed M, Zhang M (2015) Multi-objective service composition in uncertain environments. IEEE Trans Serv Comput. https://doi.org/10.1109/TSC.2015.2443785 20 Nita M-C, Pop F, Mocanu M, Cristea V (2014) FIM-SIM: fault injection module for CloudSim based on statistical distributions. J Telecommun Inf Technol 4:14–23. https://www.infona.pl/ resource/bwmeta1.element.baztech-b656c17f-c18e-41b2-991f-f960ad6935b3 21 Mugen P, Wang C, Li J, Xiang H, Lau V (2015) Recent advances in underlay heterogeneous networks: interference control, resource allocation, and self-organization. IEEE Commun Surv Tutorials 17(2):700–729. https://doi.org/10.1109/COMST.2015.2416772 22 Amin Z, Singh H, Sethi N (2015) Review on fault tolerance techniques in cloud computing. Int J Comput Appl 116(18). http://research.ijcaonline.org/volume116/number18/pxc3902768. pdf 23 AbdElfattah E, Elkawkagy M, El-Sisi A (2017) A reac tive fault tolerance approach for cloud computing. In: 2017 13th Interna tional computer engineering conference (ICENCO).IEEE, pp 190–194. https://doi.org/10.1109/ICENCO.2017.8289786 24 Ataallah SMA, Nassar SM, Hemayed EE (2015) Fault tolerance in cloud computing-survey. In: 2015 11th International computer engineering conference (ICENCO). IEEE, pp 241–245. https://doi.org/10.1109/ICENCO.2015.7416355 25 Jhawar R, Piuri V, Santambrogio M (2012) Fault tolerance management in cloud computing: a system-level perspective. IEEE Syst J 7(2):288–297. https://doi.org/10.1109/JSYST.2012.222 1934 26 Kumari P, Kaur P (2018) A survey of fault tolerance in cloud computing. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2018.09.021 27 Mittal D, Agarwal N (2015) A review paper on fault tolerance in cloud computing. In: 2015 2nd International conference on computing for sustainable global development (INDIACom). IEEE, pp 31–34 28 Tchernykh A, Schwiegelsohn U, ghazaliTalbi E, Babenko M (2019) Towards understanding uncertainty in cloud computing with risks of confidentiality, integrity, and availability. J Comput Sci A36:100581 29 Tchernykh A, Schwiegelsohn U, Talbi EG, Babenko M (2019) Towards understanding uncertainty in cloud computing with risks of confidentiality, integrity, and availability. J Comput Sci 36:100581. https://doi.org/10.1016/j.jocs.2016.11.011 30 Kim Y, Jeong SR (2015) Opinion-mining methodology for social media analytics. KSII Trans Internet Inf Syst (TIIS) 9(1):391–406. https://doi.org/10.3837/tiis.2015.01.024 31 Kumar M, Mathur R (2014) Outlier detection based fault-detection algorithm for cloud computing. In: International conference for convergence for technology-2014. IEEE, pp 1–4. https://doi.org/10.1109/I2CT.2014.7092201 32 Jhawar R, Piuri V (2012) Fault tolerance management in IaaS clouds. In: 2012 IEEE first AESS European conference on satellite telecommunications (ESTEL). IEEE, pp 1–6. https://doi.org/ 10.1109/ESTEL.2012.6400113 33 Bashir M, Kiran M, Awan I-U, Maiyama KM (2016) Optimising fault tolerance in real-time cloud computing IaaS environment. In: 2016 IEEE 4th International conference on future internet of things and cloud (FiCloud). IEEE, pp 363–370. https://doi.org/10.1109/FiCloud. 2016.58 34 Modi KJ, Chowdhury DP, Garg S (2018) Automatic cloud service monitoring and management with prediction-based service provisioning. Int J Cloud Comput 7(1):65–82. https://doi.org/10. 1504/IJCC.2018.091684 35 Pabitha P, Chandra Priya J, Praveen R, Jagatheswari S (2023) ModChain: a hybridized secure and scaling blockchain framework for IoT environment. Int J Inf Technol 15(3):1741–1754. https://doi.org/10.1007/s41870-023-01218-6

Self-regulatory Fault Forbearing and Recuperation Scheduling Model …

293

36. Mehmi S, Verma H.K., Sangal AL (2017) Simulation modeling of cloud computing for smart grid using CloudSim. J Electr Syst Inf Technol 4(1):159–172. https://doi.org/10.1016/j.jesit. 2016.10.004 37. Rajalakshmi SM, Pabitha P (2019) Optimal provisioning and scheduling of analytics as a service in cloud computing. Trans Emerging Telecommun Technol 30(9). https://doi.org/10. 1002/ett.3609

A Comprehensive Survey on Student Perceptions of Online Threat from Cyberbullying in Kosova Atdhe Buja and Artan Luma

Abstract Cyberbullying is a main concern of any school, parent, or university. We know from past events that news media, awareness campaigns are very influential in society—readers for other cases as well. The purpose of this study is to examine the perceptions of students, assess the current situation at the national educational institutions—high schools on cyberbullying. From the findings of a student survey, the authors learn that there exists a threat of cyberbullying; most types of methods used for this threat are mobile phones on higher class, and students’ perceptions on incidents happening are high. This research study can be beneficial if it is taken in consideration the data which represents actual state, as cyberbullying has been seen that are more frequent at this age in high schools. We conclude by discussing the results and suggestions for intervention and prevention by high schools. Keywords Cyberbullying · Higher education · Internet

1 Introduction Cyberbullying is a new digital era method of an attack using behaviors on electronic communication Internet, technological devices by individual or a group of people once or repeatedly expressing aggression, teasing with intentional messages to cause disruption and loss of control to one or more other students [6, 13]. Cyberbullying impact has a very tight relationship with Psychology and Sociology as a science, by manifestation of depression, loneliness, drinking alcohol, smoking, etc. [8]. After the psychological problems showing of victim, can bring them to run into fatality situation as an exit strategy of the torture [4]. Students of high schools by using the A. Buja (B) · A. Luma Faculty of Contemporary Sciences and Technologies, South East European University, 1200 Tetovo, North Macedonia e-mail: [email protected] A. Luma e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_20

295

296

A. Buja and A. Luma

Internet can be involved in activities including gaming, communication with friends and relatives, and getting various resources the information [13, 14].

1.1 Cyberbullying Consequences Bullying can happen through technology, which is called cyberbullying [2]. Actual numbers show reports of cyberbullying are increased in middle schools (33%) come after of high schools (30%), combined schools (20%), and primary schools (5%) [1]. Cyberbullying has been categorized as highly harmful and leaves negative consequences on individuals, especially to youngsters. Cyberbullying sets targets under attack from not convenient situation including humiliation, threatening, and personal online profiles [1]. Cyberbullying can be described as an example where a message or picture is distributed over the Internet or technological devices and later is distributed by other people but not the initial perpetrator. This single act by one can be multiplied very fast by hundreds into the account of the victim [10] (Fig. 1).

2 Related Work Cyberbullying is very hard to identify who is behind it because most of the time they cover real identities by using fake names, hidden Internet addresses, etc. For example, students might take not proper picture (on occasions when the relationship was good) that later can be eventful adjust and posted on web sites once relationships sour [5], making victims of cyberbullying exposed on the Internet to others to see. The aggression of cyberbullies aimed at improvement of their social position by utilizing critics of a particular group expanding violence situation with a larger audience by having anonymity [3]. A [7] conducted a survey on a representative sample (n=15,425, the composition was students from class 9–12, overall girls are most probably to report being cyberbullied (22.0% vs. 10.8%) since boys are only more odds to report school bullying (12.2% vs. 9.2%). A new emerging technology

Fig. 1 Statistics [1]

A Comprehensive Survey on Student Perceptions of Online Threat …

297

is machine learning through different algorithms on predicting and preventing cyber harassment. Such models using of [9] classifying model algorithms, library function, and Natural Language Processing were being used to identify and prevent the negative viewpoint in social media. This model can be a tool for the Government entity responsible for cyberbullying and online threats to identify and act against harmful content.

3 Methodology As for research gap, cyberbullying is an issue that is developing among young people in schools; this paper expresses and evaluates the current state of this phenomenon and presents the findings of the questionnaire. The purpose of this study is to address issues like online threat cyberbullying perception at youngster students at high schools. This study will have a quantitative method by attempting to compare and identify perception of cyberbullying by knowing factors that cause one [11]. The primary data source will be questionnaires which will be made with most high schools in every city of the country Kosovo, additional information of actual state on Kosovo Government agencies responsible for a safer Internet against online threats. The methodology used in this study is in function of the aims set at the beginning of the study. First, the literature review was conducted for the respective fields of the study. The primary data come mainly from national educational institutions— high schools from questionnaires and additional information from previous research studies (Table 1). Table 1 Statistics in details of respondents

Places/participated school

No. of students

28 Nentori

12

Don Bosko

1

Fan S. Noli

19

Fehmi Agani (Kline)

3

Fehmi Agani (Gjilan)

1

Gjimnazi “Gjon Buzuku”

17

Gjimnazi natyror “Xhavit Ahmeti”

43

Gjin Gazull

63

Kadri Kusari

1

Pjeter Bogdani

5

SHMT “TAFIL KASUMAJ”

7

Shtefan Gjeqovi

1

298

A. Buja and A. Luma

3.1 Research Questions These research questions will try to address the purpose of this study, which are as following: RQ1: First research question: What methods or technological devices are used to cause cyberbullying among students, that is more coming from mobile devices or PCs? RQ2: Second research question: How realistic is the perception of cyberbullying knowing the factors that cause one?

3.2 Limitations Moreover, given that this study has included all national educational institutions— high schools, of which parts of the study were made 11 high schools and 14 cities in the national level of the Republic of Kosovo. In the first place of this study, we were trying to find actual data from relevant stakeholders like the Department of Cybercrime at the police station, and the judicial system. If there were any case related to cyberbullying, there was no such data, which has created limitations for the researcher. Given the fact that in some institutions, the researcher had obstacles in terms of access to documentation, reports while their digitalization is not yet in the institutional culture. Time to conduct study was short. However, we started to use technical skills and do web scraping in some news online media eligible for further investigation and extraction of news article text around 30,000.00 of articles between 01.01.2015 and 31.12.2018. The content analysis was conducted on those datasets of articles, but none of them was identified as related to cyberbullying.

3.3 Data Analysis Procedures For the purpose of this study, a data analysis will take place from the dataset, which was gathered from questionnaire and presented results are in the next section Result: • • • •

organizing the data; summarizing the findings; describing findings how the data are distributed; exploring connections between parts of the data.

This study will have analyses of personality including items below: • Gender • Class • Cyberbullied cases

A Comprehensive Survey on Student Perceptions of Online Threat …

299

• School of study • Living place (City).

3.4 Data Presentation This study as part of data analysis will have presentation tables, bar charts, pie charts, scatter plots. All those data presentations will be in the next section Result.

3.5 Types of Questionnaire Questions The questionnaire has an introduction where the purpose of this questionnaire is presented and in the following data are required for the school institution where you attend the school, the city of residence, the class currently and the gender who should fill the questionnaire.

4 Results The study reveals some important results, when the case of cyberbullying is defined as “Have you ever been cyberbullying” then 21.97% of students reported having experiences of cyberbullying, 17.34% of students reported they are not sure if they have been cyberbullied, and the rest of responses (60.69%) they have not been cyberbullying. Figure 2 shows the percentages of the types of methods of cyberbullying encountered. On Fig. 2, we can see that students are experiencing cyberbullying mostly through mobile phone reported by students is 55%, 30% didn’t declare, 16% by SMS, 15% by laptop, 13% by e-mail, 12% by PC, and 8% by tablet. Figure 3 presents the results of students’ understanding and perception of cyberbullying. This information can be useful in improving education and expanding knowledge regarding Internet threats and in particular cyberbullying attacks. The results are based on calculating the percentage of students who answered one or two options or more. The vast majority, 37% of students, said that “Friends of mine have been cyberbullied”. Second place for “We’ve had cyberbullying incidents in my school” of 25% of students expressed. The third part of the selection is “I don’t know what cyberbullying is” with a figure of 18%. As can be seen in Table 2, the sample of the population are high schools students class 10 to 12 at Kosovo. In addition, the majority of students (98.27%) lived in the city, 1.73% outside the city (suburb area). No significant differences were found in gender, class, or residence. No differences by gender were found in methods used for cyberbullying “mobile phones” or other cyberbullying variables. Mobile phone as a method used for cyberbullying was more

300

A. Buja and A. Luma

Fig. 2 Types of methods of cyberbullying at high schools of Kosovo (in percentages). Note 121 responses

Fig. 3 Understanding and perception of cyberbullying among students. Note 173 responses Table 2 Demographic characteristics

Variable

Study sample (%)

Gender

F = 43.35, M = 56.65

Class 10 Class

12.72

11 Class

25.43

12 Class

61.85

A Comprehensive Survey on Student Perceptions of Online Threat …

301

dominant for those who identified as a student of 12 class (56 responses), 11 class (27 responses), and 10 class (13 responses). In addition, no big differences were found in cyberbullying with regard to sexual orientation. As you can see from Fig. 4, there is a distribution almost equal to an increase in males who have experienced cyberbullying. Most of the respondents are class 12 of 55.26%, followed by class 11 with 31.58% of cases, whereas class 10 has the lowest by 13.16%. There is a strong link between gender inclusion and cyberbullying, as we see that male are more involved in experiencing the phenomenon of cyberbullying than females. From the findings, Fig. 5 showed that 56.65% of men have experienced cyberbullying, while about 43.35% of their females admitted to having experienced cyberbullying. Figure 6 shows results of students reporting that frequent cyberbullying happens too often, but not all the time (36.42%), reported happens all the time (32.37%), happens sometimes (19.08%). Did the students, who were victims of cyberbullying, tell anyone? In order to successfully reduce cyberbullying in schools, every school must report, i.e., everyone from students to school staff. As we see from Fig. 7, most of the students (13.29%) who have experienced to be a victims of cyberbullying activities claim to have

Fig. 4 Distribution of students who have experienced cyberbullying by class and gender

Fig. 5 Differences between genders on experiencing cyberbullying

302

A. Buja and A. Luma

Fig. 6 Dissemination of cyberbullying frequency

reported them to someone, while 64.74% of students did not prefer to tell anyone, and 21.97% of students did not comment at all about the question. Did the students, who were victims of cyberbullying, tell anyone? In order to successfully reduce cyberbullying in schools, every school must report, i.e., everyone from students to school staff. As we see from Fig. 7, most of the students (13.29%) who have experienced to be a victims of cyberbullying activities claim to have reported them to someone, while 64.74% of students did not prefer to tell anyone, and 21.97% of students did not comment at all about the question. Question: if you would draft a law on matter of cyberbullying, what would it layout? Seeing Fig. 8, most of the students (48%) claim there would be a cyberbullying police squad to investigate cyberbullying, 44% claimed that schools would have to help victims of cyberbullying and so on.

Fig. 7 Reporting cases of cyberbullying

A Comprehensive Survey on Student Perceptions of Online Threat …

303

Fig. 8 Needed content for a law in the fight against cyberbullying. Note 173 responses

This study may serve and include possibilities to cooperate with high schools, other research and educational institutes to improve environments, awareness of high schools toward prevention, education, and guidance.

5 Discussion The educational system—high schools should be required to address new on- line social problems related with online participation. This study also explored demographic characteristics and methods used on cyberbullying. The extension of methods used on cyberbullying only have differentiation by class, with students of 12 class experiencing higher rates than those who identified as in 10 class. Only a small portion of students (n = 38) that received cyberbullying reported it to anyone, and half of students did not report cyberbullying to anyone. This study suggests that high schools should consider cyberbullying as the main threat which disturbs the social life of the students, by developing materials, activities which include information about threats coming from using the Internet and being engaged online like cyberbullying. Prevention as a very important tool should be considered by limiting access to some content or communication that is dangerous to students. At the same time, protecting the privacy of each student and the school information system should have possibilities to give students the opportunity to use secure libraries by the school itself to obtain knowledge and more. Moreover, cyberbullying can lead to a healthy, social problems which will interfere with student’s time of study [5]. “It appears that parents are slower to address issues such as cyberbullying, as evidence from the report found that parents were more likely to give advice and monitor their children’s Internet use only after their child had already experienced something upsetting online.” Summers

304

A. Buja and A. Luma

[12] High school education institutions should investigate the extent of cyberbullying in their environment, if necessary to involve other prevention or intervention institutions.

6 Future Work Within recent years, societies, people have become more dependent on the Internet. As noted in this study, further research is going to be conducted about the source, impact, and distribution channels of cyberbullying to high schools.

7 Conclusions This study recommends that there is cyberbullying harassment through technological channels, Internet, and devices in high schools. This paper and its result show a lot, and contributes in the technical aspect because it gives school institutions a knowledge about the current situation of students in terms of cyberbullying and then helps in choosing against protection measures in schools. Moreover, as the Internet has become very common in student life, problems related to cyberbullying will become more likely to happen. High school and educational institutions should educate students about these threats, creating guidelines of online communications, and provide intervention when it is needed.

References 1. Blair J (2003) New breed of bullies torment their peers on the internet. Educ Week 22(21):6. http://www.mskennedysclass.com/NewBreedofBullies.pdf 2. CDC, Preventing Bullying, Bullying research. Retrieved (2021). Available https://www.cdc. gov/violenceprevention/youthviolence/bullyingresearch/fastfact.html 3. Fernandez MT, Ortiz-Marcos JM, Olmedo-Moreno EM (2019) Educational environments with cultural and religious diversity: psychometric analysis of the cyberbullying scale. MDPI 10(7):443. https://doi.org/10.3390/rel10070443 4. Ghadampour E, Shafiei M, Heidarirad H (2017) Relationships among cyberbullying, psychological vulnerability and suicidal thoughts in female and male students. J Res Psychol Health 11(3):28–40, 12. https://doi.org/10.29252/rph.11.3.28 5. Hoff DL, Mitchell SN (2009) Cyberbullying: causes, effects, and remedies. J Educ Adm 47(2009):652–665. https://doi.org/10.1108/09578230910981107 6. Lawler JP, Molluzzo JC (2015) A comprehensive survey on student perceptions of cyberbullying at a major metropolitan university. Contemp Issues 8(Third Quarter):159–170. https:// files.eric.ed.gov/fulltext/EJ1069888.pdf 7. Messias E, Kindrick K, Castro J (2011) School bullying, cyberbullying, or both: corrcelates of teen suicidality in the 2011 CDC youth risk behavior survey. Compr Psychiatry 55(5). https:// doi.org/10.1016/j.comppsych.2014.02.005

A Comprehensive Survey on Student Perceptions of Online Threat …

305

8. Peled Y (2019) Cyberbullying and its influence on academic, social, and emotional development of under graduate students. Heliyon 5(3). https://doi.org/10.1016/j.heliyon.2019.e01393 9. Rajesh S, Sharanya B (2021) Recognition and prevention of cyberharassment in social media using classification algorithms. Mater Today Proc 06(01). https://doi.org/10.1016/j.matpr.2020. 10.502 10. Slonje R, Smith PK, Frisen A (2013) The nature of cyberbullying, and strategies for prevention. Comput Human Behav 29(1):26–32. https://doi.org/10.1016/j.chb.2012.05.024 11. Stanimirovic D, Jukic T, Nograsek J, Vintar M (2012) Analysis of the methodologies for evaluation of e-government policies. In: IFIP international federation for informa tion processing 12. Summers N (2015) Cyberbullying: experiences and support needs of students in a secondary school. School of Environment, Education and Development University of Manchester. https://www.research.manchester.ac.uk/portal/en/theses/cyberbullying-experiences-and-sup port-needs-of-students-in-a-secondary-school(1294d779-9745-4ee2-b018-bb5b29ba9699). html 13. Tokunaga RS (2010) Following you home from school: a critical review and synthesis of research on cyberbullying victimization. Comput Human Behav 26(3):277–287. https://doi. org/10.1016/j.chb.2009.11.014 14. Zymeri T, Latifi L (2016) SIGURIA E FE¨ MIJE¨ VE NE¨ INTERNET, FIT

Addressing Localization and Hole Identification Problem in Wireless Sensor Networks Rama Krushna Rath, Santosh Kumar Satapathy, Nitin Singh Rajput, and Shrinibas Pattnaik

Abstract Wireless sensor network (WSN) is one of the latest developments in communication networks. The major component of a wireless sensor network is the sensor nodes, which monitor a particular area and environment in the deployed region. Therefore, WSN is responsible for controlling and managing the system and environmental conditions in the region of interest (RoI). Sensor nodes can be deployed manually or randomly where random deployment is easy but creates many challenges like mobility, coverage and connectivity, transmission capability, etc. So, in this research, we have focused on handling the problem of coverage and connectivity issues by healing the holes created due to external factors. In the first step, we check the sensor positions and holes in the network. In the second step, we calculate the centroid of holes and its neighbor nodes, and in the last step, we choose a target node to move to the centroid for hole healing. For this, we consider only the holes within the network. The holes are not addressed which are the result of the initial deployment and exist on the border. Keywords WSN · Localization · Coverage and connectivity · Node mobility · Hole healing

R. K. Rath Department of Computer Science and Engineering, Indian Institute of Information Technology (IIIT) Sri City, Chittoor, Andhra Pradesh, India e-mail: [email protected] S. K. Satapathy (B) · N. S. Rajput Department of Information and Communication Technology, Pandit Deendayal Energy University (PDEU), Gandhinagar, India e-mail: [email protected] S. Pattnaik Department of Electrical and Electronics Engineering, Gandhi Institute of Science and Technology, Rayagada, Odisha, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_21

307

308

R. K. Rath et al.

Fig. 1 Simple wireless sensor network [6]

1 Introduction The concept of wireless sensor network (WSN) came into being nearly three decades ago. The US military was behind the commencement of wireless sensor network research. At first, WSN was more a vision than a technology that could be widely exploited because of the advances in sensors, computers, and wireless communication technologies [1]. At that time, WSN was limited to military applications only. However, with the recent advances in micro-electronic mechanical systems (MEMS) [2], the cheaper manufacturing technologies along with wireless communication have led to the development of cheap, tiny sensors with not only sensing but also processing and transmission abilities, which in turn has enthused the wide variety of WSN applications. So, a wireless sensor network shown in Fig. 1 is made for those locations and fields generally inaccessible by humans; as such, the WSN sensors need to utilize the battery power to the maximum to be alive for as long as possible. These differences and constraints require a specific design of the routing protocols for WSN according to the application for which they are to be used [3]. Routing protocols in WSN be designed in such a way that they should satisfy the constraints above. The cluster-based routing protocols have been tagged as energyefficient among all the routing algorithms [4, 5]. In this paper, our major contributions are: First phase: network boundary discovery, holes detection, and holes characteristics identification, second phase: hole centroid calculation, neighbor nodes selection, target nodes identification, third phase: the process of hole healing.

2 Technological Background Wireless sensor networks have significant advantages over traditional wired networks. The wireless sensors reduce the deployment cost and delay in the services; further can be used in any environment [7]. It is most desirable for locations and

Addressing Localization and Hole Identification Problem in Wireless …

309

terrains that are hostile or normally unreachable, like a battlefield, outer space, deep forests, oceans, etc. Inexpensive sensor nodes in WSN are fueled by recent advancements in wireless communication technologies and micro-electronic mechanical systems (MEMS), deployed randomly in the area of interest and connected through wireless links. The Internet presents ample scopes and applications in our physical world. The microsensors vary in applications, including military, health care industry [8], food industry, geosciences, household activities, and vehicle theft detection [9]. WSN has captured the attention of academia and industry, and many research experiments have been and are being done to solve various issues in design and application aspects. Basically, a sensor network’s efficiency depends on the coverage of the monitoring area [10]. In random deployment, some regions can be closely covered, while others are poorly. We may find many disconnected regions called holes in the network. In an actual environment, obstacles like walls, buildings, ponds, small hills, and trees may be present; also sinkhole attack on the network [11] disturb the deployment of sensor nodes. Because of the various environmental factors, the sensing and communication range cannot be assumed correctly. The no coverage region in the network is considered a hole range or sensing void as shown in Fig. 2. Those areas are not covered by any sensor node. Different kinds of holes like wormholes [12], coverage holes [13], jamming holes [14], routing holes [15], etc., can exist within the network due to those obstacles that cause the network to be partitioned and uncovered. Many research works are attracted to the sensor deployment that is both coverage and connectivity due to the holes in WSNs [16]. It occurs in a wireless sensor network when no sensor node actively covers an area. Sometimes, the hole is essential for a few reasons like finding geographic regions, detecting the boundary of disaster regions [16], etc. Moreover, the holes may cause routing failure when a node transmits sensing data back to the sink. Practically, sensor networks generally have coverage holes; i.e., regions without adequate working sensors which is meant by holes in the rest of the papers.

3 Proposed System A wireless sensor network (WSN) comprises several sensor nodes deployed and scattered over a specific monitoring region for collecting sensed data. There are two critical issues in wireless sensor networks: (1) power resources control to prolong the network lifetime and (2) coverage and connectivity. Although there are two ways to deploy the sensor nodes: (1) manually controlled by humans and (2) randomly deployed without human control. The power resources control is improved by routing protocols, devices designed, and package transmission control. On the other hand, the locations of sensor nodes decide the coverage and connectivity in the monitoring area. Our research targets the coverage issue of the wireless sensor network. That means holes in a network field that create unwanted problems are recovered. The entire mechanism is described in Fig. 3.

310

R. K. Rath et al.

Fig. 2 Holes in a sensor network

3.1 Addressing Localization The positions of randomly deployed sensors and the overlapped regions in a network range are shown in Fig. 4a. All the sensors have a fixed range (both sensing range and communication range as shown in Fig. 5a) of input, which can be changed for different networks. The sensing range determines the area a sensor node can monitor, known as the sensing coverage region. This is denoted by Rs . Though the sensing ability of sensor nodes is limited, so a node can sense only a restricted area of a network. Communication range determines the radio coverage, also known as radio area coverage, within which a sensor node can communicate to other sensor nodes in active mode. The communication range is denoted by Rc . In our implementation, it is assumed to have better performance that the sensing range should be less (nearly half) or equal to the communication range. Now, this can be seen in Fig. 5b that node 1 is in the communication range of node 2, but node 3 is outside the communication range, hence denoted as a black node for node 2. After finding the node positions, the source node is chosen to be given as input as in Fig. 4b, from which intra-node distances are calculated using euclidean distance formula. Here, the distance is shown in by taking ten randomly positioned sensors. Again, the shortest path between two nodes is found using Dijkstra’s shortest path algorithm, where the one-to-one and one-to-all approaches are used. Further, a path matrix (shown in Fig. 6) is developed which shows the existing paths as follows:

Addressing Localization and Hole Identification Problem in Wireless …

Fig. 3 Work flow diagram of proposed mechanism

311

312

R. K. Rath et al.

(a) Randomly deployed sensors and overlapping areas

(b) Euclidean Distance from the source (node 10) to other destination nodes

Fig. 4 Sensor coverage parameters

{ path(s1 , s2 ) =

1 path between s1 and s2 0 otherwise

where s1 and s2 are two sensor nodes.

(a) Sensing Range and Communication Range of a sensor node

Fig. 5 Sensor connectivity parameters

Fig. 6 Path matrix

(b) Neighbor node and Black node

Addressing Localization and Hole Identification Problem in Wireless …

313

3.2 Hole Identification and Healing This section describes the implementation details of hole identification and recovery in the target network. The basic scenario can be described as follows: During the normal operation of a sensor network, a great loss of nodes occurs, due to an external attack causing the creation of one or several large holes within the network making it ineffective. We propose a mechanism for detecting and recovering holes by exploiting only the nodes mobility. It is noted that only the holes within the network are considered. The holes on the border that are the result of the initial deployment are not addressed. Upon executing this mechanism, we have made the following assumptions: • A dense mobile WSN is deployed randomly in an obstacle-free region of interest (RoI). • All deployed nodes are time-synchronized and homogeneous. • Location information of each sensor node is available. • The considered network is connected, i.e., we can reach any node from any source node. The path may be direct or via intermediate nodes. • No isolated node is considered in the network.

3.3 Implementation and Result Analysis The simulation is carried out under the WINDOWS environment using MATLAB. In a basic scenario, let us assume there are 500 sensors; the sensing range is 30, and the threshold is 5 as given inputs. The sensor positions can be seen in Fig. 7, which are randomly deployed. Observations: • From the above scenario, it is clear that a total of 7 steps are required to cover all the holes.

Fig. 7 Number of sensors = 500, sensing range = 30 unit, threshold = 5

314 Table 1 Step wise hole recovery in the sensor network

R. K. Rath et al. Step

Number of holes

Reference

1

63

Figure 8a

2

31

Figure 8b

3

9

Figure 8c

4

6

Figure 8d

5

4

Figure 8e

6

2

Figure 8f

7

1

Figure 8g

8

1

Figure 8h

9

1

Figure 8i

10

0

Figure 8j

• But 2 new holes are generated during the hole healing process at step 8 and step 9. • So, a total of 9 steps are needed to make the sensor network with no holes Table 1 shows the stepwise output.

4 Conclusion and Future Work An ideal wireless sensor network includes hundreds of sensor nodes and a few base stations (gateway nodes). Primarily, there are three types of nodes (as in Fig. 1): sensor node, base station, and sink. Base stations are the potent nodes. While there is significant progress in the optimization of node positioning in WSNs [17, 18], many challenging problems still remain unsolved. The three main design challenges of sensor nodes are low processing power, battery capacity, and connectivity. The first limitation directly deals with the algorithms which can be used. For example, we should not use asymmetric key cryptography or modern encryption algorithms [19] to secure communication as it needs more processing power. The second limitation deals with the attributes of used algorithms. Although nowadays, the use of solar batteries is increased [20]. It can be automatically charged from the environment, but the size of the battery decides the size of a node. And the third limitation deals with the localization and connectivity issues. The authors have focused on this issue of a wireless sensor network. The mechanism to fill the hole in a wireless network where sensors are deployed manually is still challenging [21]. To an extent, we have tried to find an existing path between the sensor nodes along with variable distances, holes in a sensor network, processes of hole healing through hole centroid, and threshold values. For future work, we have focused on: 1. Addressing the connectivity for isolated nodes in the network and reducing overlap areas. 2. Implementation of mathematical modeling for resource discovery problem and manual deployment technique using matrix for sensor network.

Addressing Localization and Hole Identification Problem in Wireless …

(a) Step = 1, Number of Holes = 63

(b) Step = 2, Number of Holes = 31

(c) Step = 3, Number of Holes = 9

(d) Step = 4, Number of Holes = 6

(e) Step = 5, Number of Holes = 4

(f) Step = 6, Number of Holes = 2

(g) Step = 7, Number of Holes = 1

(h) Step = 8, Number of Holes = 1

(i) Step = 9, Number of Holes = 1

(j) Final Output with no hole

Fig. 8 Process of Hole Healing in each step

315

316

R. K. Rath et al.

References 1. Bensky A (2019) Chapter 14—technologies and applications. In: Bensky A (ed) Short-range wireless communication, 2rd edn. Newnes, pp 387–430. https://doi.org/10.1016/B978-0-12815405-2.00014-2 2. Reverter F (2018) 2—Interfacing sensors to microcontrollers: a direct approach. In: Nihtianov S, Luque A (eds) Smart sensors and MEMs, 2nd edn. Woodhead Publishing Series in Electronic and Optical Materials, Woodhead Publishing, pp 23–55. https://doi.org/10.1016/B978-0-08102055-5.00002-4 3. Chen X (2020) Chapter 6—stochastic scheduling algorithms. In: Chen X (ed) Randomly deployed wireless sensor networks. Elsevier, pp 89–102. https://doi.org/10.1016/B978-0-12819624-3.00011-2 4. Yousefi S, Derakhshan F, Aghdasi HS, Karimipour H (2020) An energy-efficient artificial bee colony-based clustering in the internet of things. Comput Electr Eng 86:106733 5. Chen J, Sackey SH, Anajemba JH, Zhang X, He Y (2021) Energy-efficient clustering and localization technique using genetic algorithm in wireless sensor networks. Complexity 2021 6. Othman MF, Shazali K (2012) Wireless sensor network applications: a study in environment monitoring system. Procedia Eng 41:1204–1210 7. Das S, Bala PS (2013) A cluster-based routing algorithm for wsn based on residual energy of the nodes. Int J Comput Appl 74(2) 8. Girish MVS, Pallam A, Divyashree P, Khare A, Dwivedi P (2021) Iot enabled smart healthcare assistance for early prediction of health abnormality. In: 2021 IEEE international symposium on smart electronic systems (iSES), 2021, pp 244–248. https://doi.org/10.1109/iSES52644. 2021.00063 9. Kommaraju R, Kommanduri R, Rama Lingeswararao S, Sravanthi B, Srivalli C (2020) Iot based vehi cle (car) theft detection. In: International conference on image processing and capsule networks. Springer, 2020, pp 620–628 10. More A, Raisinghani V (2017) A survey on energy efficient coverage protocols in wireless sensor networks. J King Saud Univ-Comput Inf Sci 29(4):428–448 11. Raju I, Parwekar P (2016) Detection of sinkhole attack in wireless sensor network. In: Proceedings of the second international conference on computer and communication technologies. Springer, pp 629–636 12. Verma MK, Dwivedi RK (2020) A survey on wormhole attack detection and prevention techniques in wireless sensor networks. In: 2020 International conference on electrical and electronics engineering (ICE3). IEEE, 2020, pp 326–331 13. Latif K, Javaid N, Ahmad A, Khan ZA, Alrajeh N, Khan MI (2016) On energy hole and coverage hole avoidance in underwater wireless sensor networks. IEEE Sens J 16(11):4431–4442 14. Tsiota A, Xenakis D, Passas N, Merakos L (2019) On jamming and black hole attacks in heterogeneous wireless networks. IEEE Trans Veh Technol 68(11):10761–10774. https://doi. org/10.1109/TVT.2019.2938405 15. Zhou J, Lu J, Huang S, Fan Z (2010) Location-based routing algorithms for mobile ad hoc networks with holes. In: 2010 International conference on cyber-enabled distributed computing and knowledge discovery, pp 376–379. https://doi.org/10.1109/CyberC.2010.74 16. Senouci MR, Mellouk A, Assnoune K (2013) Localized movement-assisted sensordeployment algorithm for holedetection and healing. IEEE Trans Parallel Distrib Syst 25(5):1267–1277 17. Kulkarni RV, Venayagamoorthy GK (2010) Particle swarm optimization in wireless-sensor networks: a brief survey. IEEE Trans Syst Man Cybern Part C (Applications and Reviews) 41(2):262–267

Addressing Localization and Hole Identification Problem in Wireless …

317

18. El Alami H, Najid A (2020) Optimization of energy efficiency in wireless sensor networks and internet of things: a review of related works. In: Nature-Inspired computing applications in advancedcommunication networks (2020), pp 89–127 19. Rath RK, Hema S, Manohar M, Ch N, Varma I (2019) A new approach to the data security using modern encryption standard. Int J Emerging Technol Innovative Res 6(4):96–100 20. Sharma H, Haque A, Jaffery ZA (2018) Solar energy harvesting wireless sensor network nodes: a survey. J Renew Sustain Energy 10(2):023704 21. Priyadarshi R, Gupta B, Anurag A (2020) Deployment techniques in wireless sensor networks: a survey, classification, challenges, and future research issues. J Supercomput 76(9):7333–7373

The Impact of ICMP Attacks in Software-Defined Network Environments Kamlesh Chandra Purohit, M. Anand Kumar, Archita Saxena, and Arpit Mittal

Abstract Due to the tremendous growth in network technologies, there is a huge increase in the number of devices connected to the Internet on a daily basis. Currently, network techniques are used in inter-discipline domains like health care, agriculture, manufacturing industries, and commerce. Traditional network architecture faces huge challenges due to the increase in demand for the connectivity. In the current scenario, more than 60% of industries are still using traditional networks, whereas software-defined networks (SDN) and virtualization concepts are slowly starting to replace the current infrastructure. In recent years, SDN was deployed very fast due to its advanced features like flexibility, scalability, and centralized monitoring. Software-defined networks also add network programmability as well as other capabilities to increase network adaptability by reacting to continually shifting network conditions and making network certification and implementation easier. Although certain new capabilities have already been added to the SDN to improve network management easier, various new kinds of vulnerabilities have evolved as a result of a lack of security concerns in the SDN architecture’s original design. This research work mainly focuses on SDN security attacks based on the Internet control message protocol (ICMP). Even though ICMP protocol is used by network administrators to troubleshoot network devices, the protocol itself is more vulnerable in the SDN network environment. The paper addresses three recent attacks that have a huge impact on SDN networks such as the application layer flooding attacks, distributed denial of service attacks, and unintended DoS attacks. Keywords Attack · Denial of service · Controller · Networks · Software-defined networks · Intelligence system · Virtualization

K. C. Purohit Department of Computer Science, Graphic Era Deemed, To Be University, Dehradun, India M. Anand Kumar (B) · A. Saxena · A. Mittal Department of Computer Applications, Graphic Era Deemed, To Be University, Dehradun, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_22

319

320

K. C. Purohit et al.

1 Introduction The software-defined network plays a vital role in the modern communication era with a more flexible and dynamic software-controlled environment. Slowly, the technology is transforming telecommunication networks to the next level of controllerspecific environment [1]. The SDN architecture consists of three layers, namely the data plane, the control plane, and the application plane. Data transmissions between the components are done through an interface [2]. The data plane is separated from the control plane with the controller being the vital component. The SDN controller interacts with the data plane through the southbound interface and applications through a northbound interface. It is possible to interconnect distributed controllers with the help of an east–west bound interface [3]. It provides a centralized view of the entire network through the controller and simplifies the management of other devices in the network. Figure 1 presents the SDN architecture and its components. Due to the enormous advantages, SDN is deployed in several applications instead of traditional network architecture. As far as large-scale applications are concerned, it is necessary to define strong security policies in information sharing and resources across different platforms [4]. The components of the SDN networks present several security vulnerabilities to the underlying applications and devices that to be addressed seriously [5]. One of such security attacks is from Internet control message protocols like denial of service and flooding attacks that create a huge impact on the devices in

Fig. 1 SDN architecture

The Impact of ICMP Attacks in Software-Defined Network Environments

321

the SDN networks. This paper addresses some most recent ICMP attacks and their impacts in SDN environments.

2 Literature Review The paper [6] presented DoS attacks in the SDN network with the concept of flooding the data plane. This paper also addressed several types of DoS attacks and countermeasures. The paper also pointed out how DoS attacks can consume the entire resources of the components of SDN networks including the memory of the data plane, CPU of the SDN controller, and the bandwidth of the data plane. The work proposed a lightweight framework called flood-shield to overcome DoS related issues in the SDN networks. The paper [7] proposed an SDN-based security service approach to block unwanted access to the SDN Network. This model blocks all the false IP addresses in the entire SDN networks with the use of OpenFlow protocol. The main issue with this approach is it blocks the entire devices in the network which will be not suitable for the majority of the applications. The authors [6] suggested a saturation attack detection system based on machine learning that uses a variety of machine learning algorithms as the platform for testing the model. The findings showed that the machine learning classifiers can detect attacks easily when compared to other traditional methods. They also proved that the proposed machine learning classifiers can reduce different saturation attacks by up to 90%. They concluded that the machine learning testing tool efficiently undermines saturation attack detection systems based on machine learning. The authors [8] presented SA-detector, an anomaly detection approach for coping with IP Spoofing, UDP spoofing, ICMP flooding, and other types of flooding attacks. FloodGuard [9] and FloodDefender [10] are extensions to SYN flooding assault prevention. They are protocol-agnostic defense mechanisms against many sorts of attack flow (e.g., UDP-based flooding attacks, ICMP-based flooding attacks, or any other floodingrelated attacks. The work also proposed a middleware-like component that is used to connect a controller platform to other applications. The paper [11] proposed a security framework for NOX controllers. It provides an integrated environment that lets scripts be written to identify certain assaults, such as controller-based attacks, and to block or drop malicious packets. The main focus of this model is to examine total traffic in the SDN network to detect flooding attacks. The author [12] employed policy-based SDN functions to isolate network functions, reduce the attacks, and then combine them into composite services. To describe security issues in SDN, the authors [13] offered threat vectors. There are no convincing mechanisms for establishing a trust connection between the controller and applications in SDNs, according to the report. SDN controllers give abstraction that are turned into configuration commands for the underlying infrastructure by apps, and a rogue application might potentially cause havoc in the network. In SDN, the authors [14] proposed a new approach based on back prorogation neural network classifiers to defend against DoS attacks. They used features of the flow table to

322

K. C. Purohit et al.

defend against the target attacks on the SDN controller. The test results show that the proposed scheme can reduce DoS-based attacks on the controller to 90% with minimum overload in terms of efficiency. The authors [15] utilize the generalized entropy approach to identify traffic on the switch and then use the neural network to find the abnormal switch based on whether the generalized entropy value exceeds the threshold. However, because such detection relies on only a few signs, it is simple to misinterpret normal random bursts in real-world networks as attacks. Furthermore, the outcomes of these procedures necessitate extremely high threshold accuracy. In [16], the authors presented a model for authentication systems that are deployed outside the controller to authenticate the applications. Only authenticated applications can log in to perform operations or request certain services from the controller. The authentication system is responsible for managing resources, access permission, application certificates, encryption, authorization, and other security-related services. Using the floodlight controller architecture as a model, the study [17] presented an authorization method between applications and the SDN environment. Authors of [18] designed a model which only allows applications to share the data with trusted third-party applications. The design will authorize the application to use a northbound interface in conjunction with digital signature and encryption. Such cryptographicbased models will degrade the performance of the entire network. Shield [19] is a model that uses the control-flow graph to analyze the behavior of applications (CFG). It is possible to detect suspicious attacks that could lead to changes in internal network parameters using this solution. According to the authors, their solution provides defenses for several attack vectors, which include data manipulation, impersonation, authorization assignment, and information disclosure, and can result in the network service being shut down.

3 Internet Control Message Protocol (ICMP) Internet control protocol (ICMP) [20] is one of the TCP/IP model protocols that use client/server terminology (ICMP). All the IP-enabled end systems and intermediate devices like routers use ICMP server frequently for troubleshooting the network. The ICMP protocol is used to report issues in the network or intermediate devices like routers, hubs, and switches. Some of the important features of ICMP protocols are reporting when end systems (ES) are not responding to the request, congestion in the network, IP header issues, and other network-related issues. The protocol is frequently used by network administrators to ensure that end systems (ES) are functioning properly and that router is correctly directing packets to their intended destinations. The Internet control message protocol is one of the network layer protocols of the TCP/IP model [21]. Its communications are not transferred directly to the data link layer, though it belongs to the network layer, massages will be encoded into IP datagrams before being sent to the lowest layer. The protocol field has a value of one to indicate that the IP data is an ICMP message category.

The Impact of ICMP Attacks in Software-Defined Network Environments

323

3.1 Operations of ICMP Protocol ICMP protocol, on the other hand, is a protocol that defines control messages, as its name suggests [22]. As a result, ICMP is primarily concerned with a mechanism for any IP-enabled devices to deliver error messages to another IP machine in the network. ICMP has several message formats that allow different sorts of data to be transferred. In response to a message delivered by Host0 to Host1 and transmitted by router0, router R1 generates ICMP packets. When the MTU value of the link between router0 and router1 is less than the size of the IP packet, and when the packet has a don’t fragment (DS) bit in the IP packet header, then the ICMP message will be delivered to the Host0.

3.2 ICMP Messages One of the most significant protocols of internet protocol suite is the Internet control message protocol (ICMP). It is largely used by operating system in computer networks to transmit error messages. ICMP [23] is a critical element of IP that must function. It differs from TCP and UDP in that it is rarely utilized for data transmission between end systems. The user network programs or devices use this protocol rarely except for ping and traceroute commands. Unannounced network flaws, such as the inaccessibility of a host or a network portion owing to a malfunction, are among the issues. ICMP sends a TCP packet or UDP packet to the specified port number in the network without any destination information. The router in the network will buffer the packet when there are more packets to be transmitted in a specific time interval to assist in the troubleshooting process. The Echo function in ICMP simply sends a message back and forth between two hosts. Ping command is one of the popular network administration tools to know the availability of the device in the network. Ping will send out a series of packets to calculate loss percentages and average roundtrip times. Timeouts should be announced. When the TTL field of an IP packet becomes zero, then the router discards the packet from the network and sends an ICMP message to the source to denote the issue in the delivery of the packet to the destination. Trace-route is a command that uses tiny TTL packets to map network pathways while monitoring ICMP timeout notices.

3.3 ICMP Message Types Network errors are reported to the host using ICMP messages. The faults may be in the network, router, or any other intermediate devices. The source can quickly determine the cause of errors by observing these types of messages. Query and error

324

K. C. Purohit et al.

reporting are two types of ICMP messages that can be used for troubleshooting network issues. When intermediate devices like the host or router process an IP packet, these error reporting schemes can report the errors that are encountered. Destination inaccessible, source-quench, time exceed, a parameter problem, and redirection are some of the error reporting messages provided by the ICMP protocol to the host devices or routers [24]. A pair of query messages will assist intermediate devices like host, router, or network manager in obtaining error-related information from a host or router in the network. Devices in the network can locate any routers and can collect router information for further processing. Even routers can assist devices (hosts) with redirection messages with the updated information about the router and routing table. Echo messages, timestamps, router advertisement, and solicitation are the message types provided by the query message of ICMP protocol. The following are some key points to remember regarding ICMP error messages: 1. ICMP messages will not be generated for the messages that contain error messages of ICMP type. 2. There is no provision for using the ICMP error messages for fragmented Datagram. 3. ICMP messages will not be generated for messages that contain a multicast address. 4. ICMP error reporting messages are not generated for a datagram that contains special address ranges like 127.0.0.0 or 0.0.0.0.

3.4 ICMP Message Format An ICMP message’s structure can be conceived of as having a common component and a unique part [24]. The common part of all ICMP messages consists of three fields with same size and meaning (but the values in the fields vary depending on the ICMP message type). Each form of a message has its own set of fields in the unique portion (Fig. 2). To maintain a network performing at its best, network troubleshooting entails identifying and resolving networking issues. The primary role of a network administrator Fig. 2 ICMP packet format

The Impact of ICMP Attacks in Software-Defined Network Environments

325

is to maintain network connectivity to all the devices. To assist the administrators; ICMP plays a vital role to track the status of the connections and improving their performance. ICMP can be used to accomplish this. First, the ICMP traffic should be captured on the network to troubleshoot it. A network analyzer can be used to record all TCP/IP traffic while just filtering ICMP traffic. After configuring the network analyzer to filter ICMP traffic, examine the ICMP traffic that passes through the network. Although some redirect messages are common (especially during morning start-up hours), if one device is frequently being routed before talking with other network devices, then it is necessary to designate that device a different default gateway.

4 ICMP Attacks in SDN Environments In the SDN environment, it is very difficult to detect ICMP attacks as they only occupy less bandwidth. Usually, ICMP-based denial of service attacks is launched by a single host in very shorter duration. But distributed denial of service attacks is launched through multiple hosts or devices at the same time in an SDN environment. Network administrators are finding it difficult to detect ICMP-based attacks as they use only less traffic in the network and are very similar to original traffic or host. ICMP attacks occupy most of the system resources in a very short period and degrade the entire network performance. High-rate DDoS assaults are frequently aggressive, bombarding the host with a large number of malicious packets to drain its resources. These flooding assaults consume a large amount of bandwidth, necessitating the deployment of numerous machines by the intruder to route the huge traffic toward the victim system and application layers. To launch an attack, only limited resources are used by the attacker. But later on, the attacker takes complete control over the network devices including the SDN controller.

4.1 Application Layer Flooding Application layer flooding is one of the serious issues in any SDN network environment where the entire service goes down in a short period. Flooding attacks use spoofed IP addresses to execute unwanted service requests to intermediate devices like switches in the SDN networks. Within a short period, the attack will use all the available resources in the network and makes the connection vulnerable to other trusted applications that are running simultaneously. In Fig. 3, the host D with IP address 172.16.23.142 is a spoofed IP device in the network. Here, the host continuously requests service from the controller through the switch (S2) which in turn sends the request to the controller. Within 30 s, the connection becomes unavailable to the other devices in the entire network. Flooding attacks also increase the CPU utilization in a fraction of seconds due to unwanted traffic in the networks.

326

K. C. Purohit et al.

Fig. 3 Attack scenario in SDN

Distributed attacks are similar to those DoS attacks in which flow requests will be initiated from several clients or devices at the same time. It is one of the most complicated attacks in an SDN network environment which is very difficult to detect the source of the victims. These types of attacks only use less bandwidth initially, whereas in less amount of time, it starts using the entire bandwidth. Most of the DDoS attacks target the firewalls that can increase the packet filtering time and CPU utilization. Some of the important issues related to DDoS attacks in SDN are as follows: • It floods the entire bandwidth of switches, the control plane, and the SDN controller • Network failure for legitimate hosts in the network • DDoS attacks generate massive packets in messages to the controller • DDoS attacks consume higher communication bandwidth, memory, and CPU utilization

4.2 SYN Flooding SYN flooding attacks are one of the typical DoS attacks in SDN environments that not only block a particular device or host, but also flood the entire devices in the

The Impact of ICMP Attacks in Software-Defined Network Environments

327

network. It transmits a large number of control messages like ICMP messages as discussed in the previous section. It repeatedly sends the connection requests to the SDN switch which in turn sends the request to the SDN controller. It generates the traffic between the SDN switch and the controller that leads to overload in processing requests and delays and affects the entire communication channel.

5 Experiment and Results The security attacks are tested using the Mininet simulator running on a virtual machine with Windows operating system. Hyper-V virtualization technology was used which enables virtualized computer systems in the windows platform. Analytical modeling, measurement, and evaluation have been identified as the three main approaches commonly used for evaluating communication network systems. 60 Core i7 CPU with 3.40 GHz, 1 IBM Intel server, 8 GB RAM, and Windows 64 bit operating systems are used for evaluation. The experimental topology is presented in Fig. 4. Before analyzing the impact of attacks in an SDN environment, threshold values are set to the network components with different parameters. Then after evaluating the attacks, these values are compared to get accurate results. The threshold parameters are CPU utilization, memory utilization, storage space, and throughput. Table 1 presents the recorded parameter values before the attack scenario. The impact of ICMP attacks on the SDN environment is analyzed in this section. The metrics that are used for this study are CPU performances, control channel

Fig. 4 Experimental topology

328

K. C. Purohit et al.

Table 1 Threshold value before analysis S. No.

Parameter

Initial threshold values (%)

1

CPU utilization

12

2

Memory utilization

23

3

Storage space

42

4

Throughput

96

Table 2 CPU utilization Time in seconds

Performance (%) Attacks (test run 1) App. flooding

DDoS

Attacks (test run 2) Syn flooding

Flooding

DDoS

Syn flooding

100

8.3

18.7

9.3

8.9

18.1

9.2

200

14.7

38.1

13.2

13.2

32.4

12.2

300

28.3

50.1

21.3

23.7

40.1

21.6

400

32.7

61.2

36.3

33.6

57.1

35.2

500

44.1

73.6

49.1

45.2

63.3

48.5

600

52.1

81.2

61.9

51.6

71.6

60.7

700

59.3

92.3

72.8

61.2

82.3

72.7

bandwidth, packet delivery ratio, and flow request analysis. Based on the parameter mentioned above, the impact of ICMP attacks is evaluated.

5.1 CPU Utilization The analysis shows that CPU utilization is very high for DDoS attacks when compared to that application flooding and SYN flooding. The reason behind this is the DDoS attacks took place on 12 machines out of 60 machines in the network. In the initial stages, CPU utilization is low for all three attacks. But it gradually increased in a very less period, where DDoS attacks occupy more than 90% of the total CPU. Table 2 shows the statistics for all the three ICMP attacks in two rounds of analysis at different time intervals. Figures 5 and 6 shows the comparison of CPU utilization for ICMP attacks.

5.2 Control Channel Bandwidth When the host requests pass through a control channel during the ICMP attacks, then the channel becomes unavailable due to the lack of bandwidth. During the attack

The Impact of ICMP Attacks in Software-Defined Network Environments

329

Fig. 5 CPU utilization analysis

Fig. 6 CPU performance

period, bandwidth increases in all three cases, whereas DDoS attacks consume more bandwidth. Table 3 presents the consumption of channel bandwidth by ICMP attacks during the evaluation period. Figure 7 clearly shows the bandwidth distribution between all three ICMP attacks.

5.3 Packet Delivery Ratio The packet delivery ratio is calculated as the ratio of the total packet sent by the source machine and the number of packets received by the destination machine. Packet loss ratio also plays a vital role in evaluating packet delivery ratio. In our experiment, TCP packets are sent from the source host to the destination host. Then,

330

K. C. Purohit et al.

Table 3 Consumption of channel bandwidth

Time in seconds

Channel bandwidth (Kbps) Attacks (test run 1) DDoS

App. flooding

SYN flooding

100

141.7

102.3

40.2

200

163.2

104.6

32.3

300

153.6

101.2

41.1

400

138.1

111.1

40.7

500

156.7

104.8

31.3

600

162.3

108.3

41.3

700

138.2

102.5

44.4

Fig. 7 Control channel bandwidth analysis

the counter is used to store the number of successful and unsuccessful packets. In the attack scenario, the packet delivery is too low in all three attacks. Table 4 presents the complete statistics on the packet sent and packet delivery ratio in the SDN network during the analysis period. Figure 8 clearly shows that the packet delivery rate is below 50% in all three cases. Table 4 Packet delivery ratio Attacks App. flooding

Test run I

Test run II

dropped

Delivery percentage

dropped

Delivery percentage

27,913

48.62

30,667

49.09

DDoS

26,940

40.63

36,138

35.74

SYN flooding

37,833

26.00

33,766

48.05

The Impact of ICMP Attacks in Software-Defined Network Environments

331

Fig. 8 Packet delivery analysis

5.4 Flow Request Analysis Flow request is one of the vital components of SDN traffic. Due to DoS attacks, more flow rules will be installed by the end switch. Most of the attacks are executed only with the flow request features to overload the SDN controller. Figure 9 indicates that DDoS attacks generate more packet-in messages when compared to the other two attacks. It also clearly shows that there are more unwanted packet-in requests during the attack. The average packet in the message is below 1000 messages per minute in application flooding attacks in the initial stages of execution. The analysis clearly shows that most ICMP attacks are very dangerous in the SDN environment. All the ICMP attacks had a huge impact on the SDN environment where

Fig. 9 Flow request analysis

332

K. C. Purohit et al.

these attacks flood the entire bandwidth of the switches, control plane, and SDN controller generating massive packets in messages to the controller and consuming higher communication bandwidth, memory, and CPU utilization.

6 Conclusion Even though ICMP protocols are used by the network administrators for troubleshooting network devices, the protocol itself is more vulnerable in the SDN network environment. High-rate DDoS assaults are frequently aggressive, bombarding the host with a large number of malicious packets to drain its resources. Usually, ICMP-based denial of service attacks are launched by a single host in a very shorter duration. But distributed denial of service attacks are launched through multiple hosts or devices at the same time in an SDN environment. Network administrators are finding it difficult to detect ICMP-based attacks as they use only less traffic in the network and are very similar to original traffic or host. The paper evaluated three recent attacks that have a huge impact on SDN networks such as application layer flooding attacks, distributed denial of service attacks, and SYN flooding attacks. The final analysis has shown that the ICMP has a huge impact on the SDN networks in terms of efficiency in providing the services to the end devices in the network.

References 1. Jiménez MB, Fernández D, Rivadeneira JE, Bellido L, Cárdenas A (2021) A survey of the main security issues and solutions for the SDN architecture. IEEE Access 9:122016–122038 2. Lv Z, Kumar N (2020) Software defined solutions for sensors in 6G/IoE. Comput Commun 153:42–47 3. Ahmad I, Namal S, Ylianttila M, Gurtov A (2015) Security in software defined networks: a survey. IEEE Commun Surv Tuts 17(4):2317–2346 4. Cox JH, Chung J, Donovan S, Ivey J, Clark RJ, Riley G, Owen HL (2017) Advancing softwaredefined networks: a survey. IEEE Access 5:25487–25526 5. Dargahi T, Caponi A, Ambrosin M, Bianchi G, Conti M (2017) A survey on the security of stateful SDN data planes. IEEE Commun Surv Tuts 19(3):1701–1725 6. Li Z, Xing W, Khamaiseh S, Xu D (2020) Detecting saturation attacks based on self-similarity of OpenFlow traffic. IEEE Trans Netw Serv Manage 17(1):607–621 7. Mirsky Y, Kalbo N, Elovici Y, Shabtai A (2019) Vesper: using echo analysis to detect man-inthe-middle attacks in LANs. IEEE Trans Inf Forensics Secur 14(6):1638–1653 8. Kotani D, Okabe Y (2020) A packet-in message filtering mechanism for protection of control plane in openflow networks. In: Proceedings of 10th ACM/IEEE symposium architectures networks communication system, pp 29–40 9. Wang H, Xu L, Gu G (2015) FloodGuard: a DoS attack prevention extension in softwaredefined networks. In: Proceedings of 45th annual IEEE dependable system networks (DSN), Rio de Janeiro, Brazil, pp 239–250 10. Varadharajan V, Karmakar K, Tupakula U, Hitchens M (2019) A policy-based security architecture for software-defined networks. IEEE Trans Inf Forensics Secur 14(4):897–912

The Impact of ICMP Attacks in Software-Defined Network Environments

333

11. Ahmad I, Namal S, Ylianttila M, Gurtov A (2020) Security in software defined networks: a survey. IEEE Commun Surv Tutorials 17(4):2317–2346 12. Yue M, Wang H, Liu L, Wu Z (2020) Detecting DoS attacks based on multi-features in SDN. IEEE Access 8:104688–104700 13. Cui H, Chen Z, Yu L, Xie K, Xia Z (2017) Authentication mechanism for network applications in SDN environments. In: Proceedings 20th international symposium wireless personal multimedia communications (WPMC), pp 1–5 14. Kim G, An J, Kim K (2017) A study on authentication mechanism in SEaaS for SDN. In: Proceedings of 11th International Conference Ubiquitous Information Management Communication New York, NY, USA, Jan 2017, pp 1–6 15. Natanzi SBH, Majma MR (2017) Secure northbound interface for SDN applications with NTRU public key infrastructure. In: Proceedings of IEEE 4th international conference knowledgebased engineering innovation (KBEI), 2017, pp 452–458 16. Liu ZP, He YP, Wang WS, Zhang B (2019) DDoS attack detection scheme based on entropy and PSO-BP neural network in SDN. China Commun 16(7):144–155 17. Aly WHF (2019) Controller adaptive load balancing for SDN networks. In: Eleventh international conference on ubiquitous and future networks (ICUFN), pp 514–519 18. Aziz NA, Mantoro T, Khairudin MA, Murshid AFBA (2018) Software defined networking (SDN) and its security issues. In: 2018 International conference on computing, engineering, and design (ICCED), pp 40–45 19. Hussein A, Elhajj IH, Chehab A, Kayssi A (2016) SDN security plane: an architecture for resilient security services. In: 2016 IEEE international conference on cloud engineering workshop (IC2EW), pp 54–59 20. Hayawi K, Trabelsi Z, Zeidan S, Masud MM (2020) Thwarting ICMP low-rate attacks against firewalls while minimizing legitimate traffic loss. IEEE Access 8:78029–78043 21. Sayadi S, Abbes T, Bouhoula A (2017) Detection of covert channels over ICMP protocol. In: IEEE/ACS 14th international conference on computer systems and applications (AICCSA), pp 1247–1252 22. Arote P, Arya KV (2015) Detection and prevention against ARP poisoning attack using modified ICMP and voting. In: 2015 International conference on computational intelligence and networks, 2015, pp 136–141 23. Kim H, Kwon D, Ju H (2014) Analysis of ICMP policy for edge firewalls using active probing. In: The 16th Asia-Pacific network operations and management symposium, pp 1–4 24. Wei-hua J, Li Wei-hua L, Jun D (2003) The application of ICMP protocol in network scanning. In: Proceedings of the fourth international conference on parallel and distributed computing, applications and technologies, pp 904–906

An Area-Efficient Unique 4:1 Multiplexer Using Nano-electronic-Based Architecture Aravindhan Alagarsamy , K. Praghash , and Geno Peter

Abstract Quantum dot cellular automata computing methodology is a new way to develop systems with less power consumption. Nanotechnology-based computing technology has enabled the QCA principles to be more relevant with respect to the critical limitations of current VLSI-based design. In this paper, a novel 4:1 multiplexer design based on the QCA concept is presented. As compared to the previous designs of a multiplexer, this novel design is area efficient and power efficient. A five-input majority voter is used for the design of the multiplexer. The 4:1 multiplexer is constructed by making use of three 2:1 multiplexer. Keywords Quantum dot cellular automata · Multiplexer · Majority gates

1 Introduction Quantum dot cellular automata are a new approach to transistor-free computation. According to a report from the 2011 International Technology Roadmap for Semiconductors, current VLSI-based CMOS implementations have reached scaling limits by 2019, and cellular automata quantum dot computation methods have been born. QCA helps you create an area, delay, and power-efficient designs that do not cause A. Alagarsamy (B) Multi-Core Architecture Computation (MAC) Lab, Department of Electronics and Communication Engineering, Koneru Lakshmaiah Education Foundation, Vaddeswaram, AP 522501, India e-mail: [email protected] K. Praghash Department of Electronics and Communication Engineering, CHRIST University, Bengaluru, India G. Peter CRISD, School of Engineering and Technology, University of Technology Sarawak, Sibu, Malaysia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_23

335

336

A. Alagarsamy et al.

Fig. 1 Two ground state polarization of QCA cell

other common impacts for CMOS implementations. The basic components used in the QCA design method are inverters, majority gates, and binary wires. Implementing digital logic circuits based on a combination of majority gates is very attractive and time-consuming [1].

1.1 QCA Cell Quantum dot cell placement is used to build logical boolean expressions. There is no actual movement of electrons from one cell to another. The flow of information in the QCA structure is realized by the change in the polarization of each quantum dot cell, which greatly reduces power consumption. A QCA cell is a square planar structure with four points and two electrons in the cell, tunneling from one position to another, based on which the polarization of each cell is determined [2]. As shown in Fig. 1, there are two polarization states, i.e., “1” and “1”, depending on the, which is treated as a binary value of “1” or “0”.

1.2 QCA Majority Gate As shown in Fig. 2, a three-input majority voter consists of three-input cells, a center cell, and an output cell [3]. The output of the majority voter is the majority of the three inputs initially provided. The central cell does not have its own configuration. Colombian interaction with adjacent cells reaches a polarization state of [4]. Then, depending on the polarization of the central cell, the output cell adjusts its polarization. By setting the input cell to logical “1” or “0”, the majority voter acts as an “OR” or “AND” gate [5]. The functional formula for the majority voter [6] is given bellow M3 (A, B, C) = AB + BC + C A

An Area-Efficient Unique 4:1 Multiplexer Using …

337

Fig. 2 Three-input majority gates

1.3 QCA Inverter The inverter structure is used to implement the “NOT” function. Due to the general placement of QCA cells, the output cell achieves a polarization opposite to that of the input cell, as shown in Fig. 3 [2].

Fig. 3 QCA inverter cell

338

A. Alagarsamy et al.

Fig. 4 Five-input majority gate QCA structure

2 Five Input Majority Gate Use a majority of voters with 5 entries to create an area-efficient layout of the structure. As shown in Fig. 4, the use of a 5-input majority vote significantly reduces design complexity and significantly improves simulation-induced delays. This majority voter can be used to implement an ALU [4], flip-flop, or such digital logic feature [7]. The output is obtained by getting the majority of all five given inputs [1]. The output formula for a majority voter with five entries is shown below [8–10]. M5 (A, B, C, D, E) = ABC + AB D + AB E + AC D + AC E + AD E + BC D + BC E + B D E + C D E Five-input majority gates can be used to implement complex structures by setting the polarity of the input or by providing different inputs depending on the function it gets [11].

3 2:1 Multiplexer The multiplexer is a very important component that can be used to realize digital logic circuits. This helps to achieve the required input at the output based on the selected line signal [12]. The traditional 2:1 multiplexer design requires two threeinput majority voters for the “AND” operation and one for the “OR” operation. Using a five-input majority voter reduces structural complexity so that a 2:1 multiplexer logic function can be achieved using only a three-input majority voter and a 5-input majority voter, as shown in Fig. 2. Became. The five-input majority voter is actually

An Area-Efficient Unique 4:1 Multiplexer Using …

339

Fig. 5 Five-input majority gates-based 2:1 multiplexer

the 2nd and 3rd inputs, and the three-input majority voter output and other inputs are A, S, and the constant fixed input ‘1‘ [1] (Fig. 5). And the functional expression for the five-input majority gates is expressed as follows:   M5 A, M3(B, 0, S), M3(B, 0, S), S, 1   M5 A, B S, B S, S, 1 = AS + B S

4 Proposed 4:1 Multiplexer The proposed 4:1 multiplexer design is implemented by considering a 5-input 2:1 majority ruler as a module, as shown in Fig. 1. Use these three modules to build the proposed design. The design is conceptually much simpler and requires less implementation effort than the previous designs. The proposed design is considered a reusable module because it is on a single layer with no crossover. As shown in Fig. 6, A, B, C, and D are the inputs to the 4:1 multiplexer, and S1 and S0 are the selection lines. The required output of the multiplexer is obtained based on the combination of selected lines.

5 Simulation and Experimental Result Analysis QCA designs are implemented and simulated using the open software QCA Designer 2.0.3. The QCA Designer tool has two simulation engines, a bistable vector engine, and a coherence vector engine. In recent years, the implementation and design of QCA-based arithmetic logic circuits have been considered. Existing designs were created in multiple layers to improve the quality of performance.

340

A. Alagarsamy et al.

Fig. 6 Logic diagram for five-input majority gates-based 2:1 multiplexer

This paper described a single-layer approach from the 4444 QCA design that provides the same functionality as the implemented multi-layer logic. We used QCA Designer to implement and validate all the proposed designs. The proposed structure is mainly observed under a coherence vector simulation machine. A majority-based 4:1 multiplexer with five inputs is implemented using the QCA Designer tool. It seems that the actual realization of the multi-layer QCA design is still unknown. In this situation, QCA design limited to single-layer implementations is much easier. The design we propose can avoid this problem by limiting them to a single layer. The proposed design is superior to previously developed designs in terms of cells, area, and circuit delay. The design proposed in this white paper consists of a 2:1 multiplexer and a 4:1 multiplexer.

5.1 Implementation of 2:1 and Proposed 4:1 Multiplexer Figure 7 shows an implementation of a QCA-based 2:1 multiplexer implementation based on a majority voter with five inputs. Figure 8 shows the simulation results of a 2:1 multiplexer implementation. Figure 9 shows an implementation of a majoritybased 4:1 multiplexer implementation with five inputs implemented using a 2:1 multiplexer module. Figure 10 shows the simulation results of a 4:1 multiplexer implementation. The simulation was performed in both the bistable vector environment and the coherence vector environment, and the correct output waveform was obtained under both conditions.

An Area-Efficient Unique 4:1 Multiplexer Using …

341

Fig. 7 Implementation of 2:1 multiplexer

5.2 Complexity of Proposed 4:1 Multiplexer The 2:1 multiplexer design consists of 23 cells and an area of approximately 0.05 µm2 , which requires 11 s of simulation time in a coherence vector environment. The 4:1 multiplexer design consists of 98 cells and an area of approximately 0.21 µm2 with a simulation time of 47 s in a coherence vector environment.

5.3 Analysis of Proposed 4:1 Multiplexer The proposed 4:1 multiplexer is analyzed in this section with the existing QCA structure. Table 1 represents the comparison of the proposed QCA over the previous approaches against the area, complexity, and maximum delay. The results are clearly indicating that the proposed QCA approach for 4:1mux is improved 28.5% and 92% of QCA area over multiplexer structure in [13, 14], respectively. Furthermore, the area of the QCA structure in [15] is not comparable with our proposed structure. Similarly, the proposed architecture holds a complexity (cell #) improvement of 8.9% and 52.7% over QCA structure in [13, 14], respectively. In view of maximum delay over a clock cycle for the proposed QCA, architecture improves 6.6%, 29.2%, and 28.8% over mux structure existing in [13–15], respectively. Figure 11 indicates the cell count comparison of the proposed QCA architecture against the existing multiplexer architecture. The 4:1 multiplexer in this work provides comparative improvement over cell counts against the existing approach. The area and the delay comparison chart are indicated in Figs. 12 and 13, respectively. The results are clearly representing that the QCA architecture of 4:1 mux in

342

Fig. 8 Simulation result of 2:1 multiplexer

A. Alagarsamy et al.

An Area-Efficient Unique 4:1 Multiplexer Using …

343

Fig. 9 Implementation of 4:1 multiplexer

this work is comparatively outperforms the existing. According to the plots in Fig. 12, the proposed QCA architecture holds improvement of 40%, 32%, and 52.46% over the work in [16–18], respectively.

6 Conclusion The quantum dot cellular automata (QCA)-based 4:1 multiplexer design is a completely compact and efficient design implemented in a single layer with no crossover. The majority-based 2:1 multiplexer structure with five inputs is the basic module used in the design of 4:1 multiplexer. Multiplexers are an important element that has a variety of uses in the design of complex high-end systems such as ALU blocks. By looking at the proposed 4:1 multiplexer as a module, other higher-level multiplexers such as 8:1 multiplexer and 16:1 multiplexer can be designed.

344

Fig. 10 Simulation result of 4:1 multiplexer

A. Alagarsamy et al.

An Area-Efficient Unique 4:1 Multiplexer Using … Table 1 Comparison of the proposed 4:1 mux QCA over existing approach

345

4:1 Multiplexer with QCA

Area (µm2 )

Complexity (cell #)

Maximum delay (clock cycle)

4:1 mux in [13]

0.14

112

1.35

4:1 mux in [14]

1.25

216

1.78

4:1 mux in [15]

NA

96

1.77

Proposed 4:1 mux

0.10

102

1.26

Fig. 11 Comparison of cell counts of the proposed 4:1 multiplexer QCA architecture

346

A. Alagarsamy et al.

Fig. 12 Comparison of area of the proposed 4:1 multiplexer QCA architecture

Fig. 13 Comparison of delay of the proposed 4:1 multiplexer QCA architecture

Acknowledgements The first author would like to thank DST—FIST for funding the lab facility for supporting this research under grant number SR/FST/ET-II/2019/450.

An Area-Efficient Unique 4:1 Multiplexer Using …

347

References 1. Peter G, Sherine A, Teekaraman Y, Kuppusamy R, Radhakrishnan A (2022) Histogram shiftingbased quick response steganography method for secure communication. Wirel Commun Mob Comput 2022:1–11 2. Chabi AM, Sayedsalehi S, Angizi S, Navi K (2014) Efficient QCA exclusive-or and multiplexer circuits based on a nanoelectronic-compatible designing approach. Int Sch Res 3. Jaiswal R, Sasamal TN (2017) Efficient design of exclusive-or gate using 5-input majority gate in QCA. IOP Conf Ser Mater Sci Eng 225(1):012143 4. Alagarsamy A, Praghash K, Arunmetha S, Kumar KS, Sekar R (2012) Review on nanoelectronic computing approach: quantum dot cellular automata. In: 5th International conference on electronics, communication and aerospace technology (ICECA). IEEE, Coimbatore, pp 106–111 5. Perri S, Corsonello P, Cocorullo G (2014) Area-delay efficient arithmetic logic unit using QCA. IEEE Trans. on VLS 22(5):1174–1179 6. Waje MG, Dakhole PK (2013) Design and implementation of 4-bit arithmetic logic unit using quantum dot cellular automata. In: Proceedings of 2013 3rd IEEE international advance computing conference (IACC), vol 1, pp 1022–1029, Ghaziabad, India 7. Snider GL et al (1999) Quantum-dot cellular automata: line and majority logic gate. Jpn J Appl Phys Part 1 Regul Pap Short Notes Rev Pap 38(12 B):7227–7229 8. Akeela R, Wagh MD (2011) A five-input majority gate in quantum-dot cellular automata. NSTI Nanotech 2:978–981 9. Tougaw PD, Lent CS (1994) J Appl Phys 75(3):1818–1825 10. Morris Mano M (2001) Computer system architecture, 3rd edn. California State University, Los Angeles 11. Thapliyal H, Labrado C (2016) Design of adder and subtractor circuits in majority logic-based field-coupled QCA nanocomputing. Electron Lett 52(6):464–466 12. Balali M, Rezai A, Balali H, Rabiei F, Emadi S (2017) A novel design of 5-input majority gate in quantum-dot cellular automata technology. In: ISCAIE 2017—2017 IEEE symposium computer applications industrial electronics 5(1):13–16 13. Kianpour M, Sabbaghi-Nadooshan R (2012) A novel design and successful simulation of QCA-based multiplexer. In: Proceedings of mediterranean electrotechnical conference— MELECON. Yasmine Hammamet, Tunisia, pp 183–186 14. Morris MM (2004) Digital logic and computer design PHI. California State University, Los Angeles 15. Chabi AM, Sayedsalehi S, Angizi S, Navi K (2014) Int Sch Res Notices 2014 16. Sabbaghi-Nadooshan R, Kianpour M (2014) J Comput Electron 13(1):198–210 17. Mardiris VA, Karafyllidis IG (2010) Int J Circuit Theory Appl 38(8):771–785 18. Roohi A, Khademolhosseini H, Sayedsalehi S, Navi K ()2011 Int J Comput Sci Issue (IJCSI) 8(6)

Digital Realization of AdEx Neuron Model with Two-Fold Lookup Table Nishanth Krishnaraj, Alex Noel Joesph Raj, Vijayarajan Rajangam, and Ruban Nersisson

Abstract This paper presents a novel approach to reduce the storage size of lookup tables and hardware resource consumption whilst implementing the adaptiveexponential integrate and fire neuron model. This approach uses a two-fold lookup table architecture, an exponent - lookup table and a fractional - lookup tTable, and a pair of logical shift registers to approximate the exponential function. The proposed technique is synthesised and simulated using Verilog as a proof of concept. The model, which is tested with a 64 kHz clock, had a latency of 31.25 s to retrieve data from the two lookup tables. The simulation results show that the two-fold lookup table requires 64.2% less storage than a conventional single lookup table with an average error of 0.14%. Furthermore, the use of shift registers negates the need for multipliers, thereby reducing the latency, power consumption, and hardware initialisation, making the proposed model suitable for large-scale neuromorphic and biologically inspired neural network implementations targeting low-cost hardware platforms. In addition, the proposed model also generates different spiking patterns of the neuron with minimal computational error. Keywords Adaptive-exponential model · Fractional-lut · Exponent-lut · Two-fold lookup table

N. Krishnaraj Electrical Engineering, Penn State University, State College 16801, PA, USA A. N. Joesph Raj Department of Electroncis, Shantou University, Shantou 515063, Guangdong, China V. Rajangam (B) Centre for Healthcare Advancement, Innovation and Research, SENSE, Vellore Institute of Technology, Chennai, Tamil Nadu, India e-mail: [email protected] R. Nersisson SELECT, Vellore Institute of Technology Vellore, Vellore, Tamil Nadu, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_24

349

350

N. Krishnaraj et al.

1 Introduction Over the last few years, the research in neuromorphic computing, which involves the use of very-large-scale integration (VLSI) systems to mimic the neuronal structure of the human brain, has risen drastically. The hardware implementation of the neural network consisting of billions of neurons, interconnected through trillions of synapses, is crucial for understanding the human brain and possibly diagnosing and treating brain diseases. The processing of electric spikes by the neurons can be modelled as a computational structure [1]. Whilst analysing the dynamic response of the brain’s large-scale simulation, a deep knowledge of a single neuron’s response plays a vital role. A single neuron’s response behaviour can be described using a couple of differential equations known as the biological neuron model, explaining the electrical potentials across the cell membrane. Such mathematical modelling of neural dynamics has been robust in analysing the behaviour of biological neural networks. Furthermore, the neuron models can be categorised into groups depending on various computation and biological accuracy levels. These can be broadly classified into two modelling groups: conductance-based models and spiking-based models [2]. Conductance-based models provide high precision and are accurate to the biological details of the neuron cells. For example, models like Hodgkin-Huxley [3] are based on an equivalent circuit representation of a cell membrane that can offer accurate ionic activity (Na+ and K+ ) across the ion channels. However, the major drawback of such models is their considerably high computational overheads and costs, which is unsuitable for large-scale simulation. On the other hand, spiking-based models describe the temporal behaviour of the cortical spikes, which are computationally low-cost for large-scale simulation. In addition, models like the Izhikevich model [4] and the adaptive—exponential (AdEx) model are based on conductance-based models that can produce a wide range of spiking patterns. AdEx model [5] is a two-dimensional integrate and fire model [6, 7] similar to the Izhikevich model in which an exponential function replaces the quadratic term. It can also describe more realistic neural dynamics than the Izhikevich model. This paper presents a novel method to realise the AdEx model with shorter latency and lower hardware utilisation along with the Verilog simulation results to validate the method. This method proposes a two-fold lookup table (LUT) architecture to approximate the exponential function. Using two-fold LUT by storing the floatingpoint values which can significantly reduce the storage size compared to a single LUT, thus reducing hardware implementation costs [8]. The architecture uses a LUT to store the mantissa bits of the exponential function called the fractional-LUT (f-LUT). The f-LUT values with the same exponent values are grouped and stored in another LUT called the exponent-LUT (e-LUT). The original values of the exponential function can later be calculated by binary shifting the f-LUT values, with its corresponding e-LUT value denoting the direction and number of bits that are to be shifted. In this paper, the original AdEx neuron model and the proposed model using a two-fold LUT are compared to evaluate the proposed model’s response and evaluate

Digital Realization of AdEx Neuron Model …

351

the accuracy of the LUTs. Furthermore, the proposed two-fold LUT is compared to a regular LUT to determine the reduction in memory size. Moreover, the use of shift registers instead of multipliers significantly decreases the hardware consumption and improves the speed of operation. The rest of the paper is organised as follows. The following section presents a brief background of the AdEx neuron model, whilst in the third section, the proposed method using LUTs is presented. The hardware implementation is shown in the fourth section. The fifth section discusses the Verilog simulation results of the proposed model. The final section concludes this article by discussing the advantages of the proposed method.

2 AdEx Model Adaptive exponential integrate and fire is a two-dimensional neuron model to describe the dynamic behaviour of the neuron and can reproduce a wide range of physiological firing patterns despite its simplicity [9]. The model consists of two coupled non-linear differential equations and an auxiliary reset equation as: C

V − VT dV = −gl (V − El ) + gl T exp( )+ I −W T dt

(1)

dW = a(V − El ) − W dt

(2)

τω

 If V > vpeak , then V → Vr ; W → W + b

(3)

V denotes the neuron’s membrane potential in these equations when the presynaptic input current I is injected into the neuron. W represents the recovery variable of the neuron, which provides negative feedback to V . The exponential function represents the quasi-instantaneous reaction of the activation variable of the Na+ ion channel when the membrane potential crosses the threshold potential (VT ). Once the membrane potential crosses the peaking voltage (vpeak ), the reset equation sets its voltage to its reset value (Vr ) (Fig. 1 ). The parameters that describe the model are as follows: C—Membrane capacitance (pF) gl —Leak conductance (nS) El —Rest potential (mV) T —Threshold slop factor (mV) VT —Threshold potential (mV) Vr —Reset potential (mV) τω —Adaptation time constant (ms) a—Subthreshold adaptation (nS)

352

N. Krishnaraj et al.

Fig. 1 Neuron membrane potential (tonic spiking) generated by the original adaptive-exponential integrate and fire model in response to the input injected current

b—Spike-triggered adaptation (pA) I —Input current (pA). By varying the values of these parameters, the AdEx model can generate different types of spiking patterns, such as tonic spiking and tonic bursting.

3 Proposed AdEx Model Using Two-Fold Lookup Table In this section, the proposed modifications to the original AdEx are presented. The primary motive of these changes is the approximation of exponential function using a two-fold LUT [10, 11]. The original AdEx model Eqs. (1) and (2) can be rewritten in discrete form as: V = (−gl (V − El ) + I − W )(

t ) + q(V ) C

W = (a(V − El ) − W )( where q(V ) = gl T exp(

t ) τω

V − VT t )( ) C T

(4) (5)

(6)

Figure 2 shows the exponential function q(V ), that is, approximated with the LUT architecture which can significantly reduce the storage size compared to a single LUT, thus reducing hardware implementation costs. The architecture uses a LUT to store the mantissa bits of the exponential function called the f-LUT. The f-LUT values with the same exponent values are grouped and stored in another LUT called the e-LUT.

Digital Realization of AdEx Neuron Model …

353

Fig. 2 Approximating the exponential function (q(V )) with two-fold lookup table architecture, which can significantly reduce the storage size

3.1 f-LUT and e-LUT Architecture To design and determine the values approximated by the LUT architecture for a given set of model parameters, the values of the exponential function q(V ) (q(V ) > 0) must be represented in floating-point format [12, 13] as described by Eq. 7. X = k × 2E

(7)

where X ∈ R + , k ∈ R + in [1 , 2) and E ∈ Z . Since k is a decimal number in the range of 1 to 2 and the integral part of the decimal number is 1 for all values of k. Therefore, only the fractional part of k is needed, and these fractional values of k are stored in the f-LUT [14]. The number of fractional digits of ‘k’ that are stored determines the bit width of the f-LUT. The depth of the f-LUT is determined by the number of ‘k’ values. The bit width of the f-LUT is user-determined, which decides the error and the precision of the two-fold lookup table [15]. There is a trade-off between the width of the f-LUT and the two-fold LUT’s percentage error as shown in Fig. 3. The values of k with the same E values are grouped together to form a band. The E values of the bands are stored in another lookup table referred to as e-LUT. The exponent values in the e-LUT are in sign-magnitude binary format rather than two’s complement format to reduce computation whilst retrieving the data. The bit width of the e-LUT is determined by the range of E values, and the depth of the e-LUT is determined by the number of ‘k’ bands.

354

N. Krishnaraj et al.

Fig. 3 Plot describing the relationship between the percentage error of the two-fold lookup table and the bit width of the f-LUT. As the bit width of the fractional-LUT increases, the percentage error of the two-fold lookup table decreases

3.2 Bit Width Requirements and Compressibility The compressibility of the two-fold lookup table is defined as C =1−

Tt Ts

(8)

where Tt is the storage size(bits) of the two-fold LUT and Ts is the storage size(bits) of an equivalent single LUT used to approximate the exponential function q(V ). The compressibility of the two-fold LUT describes the percentage decrease in the storage size of the two-fold LUT compared to the storage size of an equivalent single LUT. The storage size of the two-fold LUT, Tt is the combined storage size of f-LUT and e-LUT. (9) Tt = T f + Te where T f and Te are the storage size of f-LUT and e-LUT, respectively. The size of an LUT is determined by the depth and bit width. T =W×D

(10)

where W is the bit width, and D is the depth of the LUT. Since the f-LUT stores the fractional digits of k, the bit width of the LUT is decided by the user. Figure 3 describes the trade-off between the bit width of f-LUT and the approximation error

Digital Realization of AdEx Neuron Model …

355

of the two-fold LUT. The depth of the f-LUT is same as the depth of an equivalent single LUT. The depth of the e-LUT is given by the number of bands. The ‘k’ values are grouped, and its width is determined by the range of ‘E’ values. The total storage size of the two-fold LUT is given by Tt = (W f × D f ) + (We × De )

(11)

and the storage size of a single LUT is given by Ts = Ws × Ds

(12)

The depth of the single LUT, which depends upon the range of ‘X ’ values, is same as the depth of the f-LUT (Ds =D f ). The bit width of the single LUT is determined by the bit width of the f-LUT and the range of ‘E’ values. Therefore, the compressibility of a two-fold LUT for a given set of parameters is C =1−

(W f × D f ) + (We × De ) Ws × D s

(13)

which can be rewritten as C =1−

e) W f + ( (WeD×D ) f

Ws

(14)

where W f is the parameter that determines the approximation error and the compressibility of the two-fold LUT. Figure 4 shows the relationship between the approximation error and the compressibility of the two-fold LUT, which is determined by the bit width of the f-LUT. Increasing the number of f-LUT, bit width reduces the approximation error and the compressibility of the two-fold LUT and vice-versa.

3.3 Data Retrieval from the Two-Fold Lookup Table To approximate the exponential function ‘q(V )’ using the two-fold lookup table, the data from the f-LUT and its corresponding data from the e-LUT is retrieved using an address decoder. Using Eq. 7, the value of the function q(V ) for the given input is calculated. The data retrieved from the f-LUT is the fractional part of ‘k’. The ‘k’ value can be calculated by appending a single ‘1’ bit to the left of the data retrieved from the f-LUT. The data retrieved from the e-LUT is the exponent value ‘E’ whose base value is 2. Figure 5 shows the plot comparing the original exponential function and exponential function approximated using two-fold LUT.

356

N. Krishnaraj et al.

Fig. 4 Plot describing the relationship between the percentage error of the two-fold lookup table and the compressibility of the two-fold lookup table. As the compressibility increases, the percentage error of the two-fold lookup table also increases

Fig. 5 Plot comparing the original exponential function and exponential functional approximated using two-fold LUT

Since the data from the two-fold LUT is in binary format, Eq. 7 can be realised efficiently using shift registers. The use of shift registers instead of multipliers leads to a high-speed and low-cost digital realisation of the AdEx model.

Digital Realization of AdEx Neuron Model …

357

4 Hardware Implementation The proposed AdEx model using a two-fold lookup table architecture is designed using Verilog to verify its validity. The AdEx model parameters that were used to implement the design using Verilog are as follows: C = 281 pF, gl = 30 nS, El = −70.6 mV, T = 2 mV, VT = −50.4 mV, Vr = −70.6 mV, τω = 144 ms, a = 4 nS, b = 80.5 pA, vPeak = 20 mV, t = 0.0625 ms. Using these parameters, the exponential function ‘q(V )’ (Eq. 6) is first calculated. Using Eq. 7, the ‘k’ and its corresponding ‘E’ values are calculated. The bit width of the e-LUT for the‘E’values was found out to be 5 bits. The ‘E’ values stored in the e-LUT are in signed magnitude representation. For the f-LUT, a bit width of 8 is considered due to its high precision and low error. The storage size of the f-LUT and the e-LUT for the given parameters is 2.4843 Kb and 0.0733 Kb, respectively. Therefore, the total size of the two-fold lookup table architecture is 2.5576 Kb with a mean error of 0.14%. To calculate the compressibility of the two-fold lookup table, the storage size of an equivalent single LUT has to be found first. Using the same parameter values, the size of the single LUT with a bit width of 23 bits is calculated to be 7.1426 Kb. Therefore, the compressibility of the two-fold lookup architecture is 64.2%.

5 Simulation Results Figure 6 shows the digital hardware design of the two-fold lookup table module that approximates the exponential function ‘q(V )’ using Verilog. An address decoder is used to calculate the memory address for the f-LUT and the e-LUT. The module uses a pair of shift registers to perform the multiplication operation without using multipliers to implement Eq. 7. The most significant bit of the e-LUT stores the sign bit of the ‘E’ values and is used as the select line for the multiplexer ‘MUX 1’. If the lookup tables are not enabled, the multiplexer ‘MUX 2’ sets the output of the module to ‘0’. Figure 7 shows the digital hardware design of the proposed adaptive-exponential integrate and fire model. The ‘V Module’ and ‘W module’ are utilised for implementing the Eqs. 4 and 5, respectively. To increase the accuracy and the precision of the proposed model, a 32-bit two’s complement binary representation (1 sign bit, 9 integral bits, and 22 fractional bits) is considered for the proposed hardware design.

358

N. Krishnaraj et al.

Fig. 6 Digital hardware design for the module that approximates the exponential function q(V ) using a two-fold lookup table architecture

Fig. 7 Digital hardware design for implementing the proposed adaptive exponential integrate and fire model. Equations 4, 5, and 6 are implemented in this design

The model, which was tested with a 64 kHz clock, had a latency of 31.25 µs to retrieve data from the two lookup tables. The simulation results showed that the two-fold lookup table required 64.2% less storage than a conventional single lookup table with an average error of 0.14%. The two-fold LUT would require only 35.8% of the storage size compared to an equivalent single LUT.

6 Conclusion In summary, this paper presents a novel approach to reduce the storage size of the lookup tables and hardware resource consumption by implementing the adaptiveexponential integrate and fire neuron model. The proposed method used a two-fold lookup table architecture to approximate the exponential function rather than using a single lookup table. The use of shift registers instead of multipliers to perform the multiplication operation reduced the complexity and resource consumption of the hardware. It also has a lower power consumption compared with the original

Digital Realization of AdEx Neuron Model …

359

AdEx neuron model, because of its multiplier less implementation. The storage size of the two-fold lookup table was compared with the single lookup table to determine the compressibility and the storage reduction of the proposed method. The advantages of the proposed method make it suitable for large-scale neuromorphic and biologically inspired neural network implementations targeting low-cost hardware platforms. The proposed technique was synthesised and simulated using Verilog as a proof of concept.

References 1. Kaiser J, Billaudelle S, Müller E, Tetzlaff C, Schemmel J, Schmitt S (2022) Emulating dendritic computing paradigms on analog neuromorphic hardware. Neuroscience 489:290–300 2. Heidarpour M, Ahmadi A, Rashidzadeh R (2016) A CORDIC based digital hardware for adaptive exponential integrate and fire neuron. IEEE Trans Circ Syst I: Regular Pap 63(11):1986– 1996 3. Hodgkin AL, Huxley AF (1952) A quantitative description of membrane current and its application to conduction and excitation in nerve. J Phys 117(4):500–544 4. Izhikevich EM (2003) Simple model of spiking neurons. IEEE Trans Neural Netw 14(6):1569– 1572 5. Brette R, Gerstner W (2005) Adaptive exponential integrate-and-fire model as an effective description of neuronal activity. J Neurophysiol 94(5):3637–3642 6. Hertag L, Haß J, Golovko T, Daniel D (2011) An analytical approximation to the AdEx neuron model allows fast fitting to physiological data. BMC Neurosci 12:81. https://doi.org/10.1186/ 1471-2202-12-S1-P81 7. Hertäg L, Hass J, Golovko T, Durstewitz D (2012) An approximation to the adaptive exponential integrate-and-fire neuron model allows fast and predictive fitting to physiological data. Front Comput Neurosci 6:62 8. Xie Y, Raj ANJ, Hu Z, Huang S, Fan Z, Joler M (2020) A twofold lookup table architecture for efficient approximation of activation functions. IEEE Trans Very Large Scale Integr (VLSI) Syst 28 9. Soman S, Suri M (2016) Recent trends in neuromorphic engineering. Big Data Analytics 1(1):1–19 10. Haghiri S, Ahmadi A (2019) A novel digital realization of AdEx neuron model. IEEE Trans Circ Syst II: Express Briefs 67(8):1444–1448 11. Haghiri S, Ahmadi A, Saif M (2016) VLSI implementable neuron-astrocyte control mechanism. Neurocomputing 214:280–296 12. Seidner D (2008) Efficient implementation of log10 lookup table in FPGA. In: 2008 IEEE International conference on microwaves, communications, antennas and electronic systems 13. Martinez WL, Martinez AR (2001) Computational statistics handbook with MATLAB. Chapman and Hall/CRC 14. Zamanlooy B, Mirhassani M (2013) Efficient VLSI implementation of neural networks with hyperbolic tangent activation function. IEEE Trans Very Large Scale Integr (VLSI) Syst 22 15. Tang PTP (1991) Table-lookup algorithms for elementary functions and their error analysis (No. CONF-9106103-1). Argonne National Lab., IL (USA)

A Novel Quantum Identity Authentication Protocol Based on Random Bell Pair Using Pre-shared Key B. Devendar Rao and Ramkumar Jayaraman

Abstract Modern generation faces the security problem not only transferring secure message but also identifying whether the receiving parties are authenticated or fake. Before initiating authentication protocol, secure key or personal stuff should be shared between legitimate users which known as pre-shared key. Pre-shared key is used to send secure message in encrypted form between trusted parties. An X% of preshared key is used to verifying the identity of users and remaining key used to transfer secure message in trusted communication. Adversary attack on quantum channels can be identified by the trusted parties based on the principles of quantum mechanics. Frequent attack by adversary leads to abort the protocol in current transmission, but during multiple run, it will leaks the pre-shared key information. A novel quantum identity authentication protocol was proposed based on bell pair, and its selection depends on the pre-shared key. The security of proposed protocol was analyzed under the intercept measure and resend attack. Keywords EPR · Pre-shared key · Authentication · Entanglement swapping

1 Introduction Millions of devices are connected each other across the globe through Internet, starting from smart phones to CCTV cameras. Every human connected with other with the help of gadgets and communicating with people across the continent. The major problem current generation faces is whether they are communicating with right people or someone who pretended to be original. Secure communication not only means that secure transmission of information from one device to other but also verifies the other devices which are the authenticated or fake. To verify the identity of users, both the user should have some pre-shared information or private stuff B. Devendar Rao · R. Jayaraman (B) Department of Computing Technologies, School of Computing, SRM Institute of Science and Technology, Kattankulathur, Chennai, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_25

361

362

B. Devendar Rao and R. Jayaraman

which only known to them. From shared key, what percentage of key information is used for verifying the identity of user without leaking to the third party. In current scenario, the classical computation to perform this authentication process is in threat due to development in quantum computing. The development of quantum computing can be utilized to verify the identity of both users because quantum communication uses the principles of quantum mechanics. Various quantum authentication schemes have been developed in recent years [1–4]. The real problem is what percentage of pre-shared information are utilized for verifying the identity and how frequent attack by an adversary leads to the leakage of sequential information from pre-shared key. In quantum communication, a lot of quantum identity authentication (QIA) protocol has been designed without using Einstein–Podolsky–Rosen (EPR) states where single photon is used to communicated among trusted user[1, 2]. An alternative way to design a QIA protocol using EPR states where two 2 qubits are entangled and communicated among trusted parities [3, 4]. Many QIA protocol is design either using communication task [5, 6] or computation task [7, 8]. In existing QIA protocol [9], Alice and Bob shared a common key known as pre-shared key used for authentication purposes. Some percentage of pre-shared key utilized to identifying the legitimate user and the remaining used to perform secure communication. Alice generates an authenticated random key A K and encode qubit by selecting the bases {Z , X } from pre-shared key bits as shown in Fig. 1. The selection criteria {00, 11 for Z bases and 01, 10 for X bases} shared among them. If consecutive two bit is 00 and the authenticate bit is 1, then Alice will encode the qubit in Z bases and send the qubit |1⟩ to Bob. If consecutive two bit is 10 and the authenticate bit is 0, then Alice will encode the qubit in X bases and send the qubit |+⟩ to Bob. Once the qubit received, Bob will measure the qubit in X or Z bases based on the pre-shared key position and store the authenticated key as A'K . If A K == A'K , then authentication successful or else abort the protocol and continue with next random authenticated key. Frequent attack by adversary leads to information leakage and adversary try to obtain complete knowledge about pre-shared key which used for authenticating purpose. Adversary (Eve) will be acting as Bob for Alice and share secret message due to failure in authenticating protocol.

Fig. 1 Quantum identity authentication model

A Novel Quantum Identity Authentication Protocol Based on Random …

363

In our proposed protocol, EPR states are used to perform authentication process in random position method without leaking sequential information about the pre-shared key.

2 Basic Ideas on EPR and Entanglement Swapping Two classical bits are entangle to form a new states called as EPR states or Bell’s states. Four Bell states can be generated from the two different classical bits are shown in Eq. (1). In quantum computing, the EPR states are represent by physical notation {ϕ → Phi and ψ → Psi} for manipulation among states. Measurements of EPR states will provide the two classical bits which will be equal to classical bit that are used for EPR generation. Measurement of EPR states in computational or diagonal basis will provide different output as compared with EPR generation. | +⟩ |ϕ ⇒ √1 (|00⟩ + |11⟩) 2 | +⟩ 1 |ψ ⇒ √ (|01⟩ + |10⟩) 2

| −⟩ |ϕ ⇒ √1 (|00⟩ − |11⟩) 2 | −⟩ 1 |ψ ⇒ √ (|01⟩ − |10⟩) 2

(1)

Let suppose take four classical bit {1,0, 1,0}{|to form ⟩ |a EPR ⟩ } states in position {(1,4), (2,3)} will give the resulted Bell states as |ϕ− 14 , |Ω− 23 . Measurement of EPR states using BSM in same position {(1,4), (2,3)} will obtain the similar output as created. Measurement of EPR states using BSM in different position {(1,2), (3,4)} will give different results as shown in Eq. (2). | ⟩ | ⟩ | ⟩ | ⟩ | −⟩ | +⟩ |ϕ |ψ = |ϕ− 12 |ψ + 34 + |ψ − 12 |ϕ+ 34 14 23 | ⟩ | ⟩ | ⟩ | ⟩ + |ϕ+ |ψ − + |ψ + |ϕ− 12

34

12

34

(2)

The probability to obtain the original classical bit is 25%, and remaining probability will generate incorrect classical bits. Applying BSM on different Bell states will generate four new Bell states which are known as entanglement swapping.

3 Proposed Protocol 3.1 General Procedure Step 1: Alice and Bob have the private knowledge which known only to them, either by quantum key distribution process or personal identity as key (which can called as pre-shared key).

364 Table 1 Applying Pauli operator depends on pre-shared key

B. Devendar Rao and R. Jayaraman ki

kr

Pauli operator

Applied on qubit

0

0

I

1

0

1

X

2

1

0

Z

3

1

1

Y

4

K = {K 1 , K 1 , K 1 , . . . K n } Step 2: Before initiating secure communication, both the trusted parties should reveal their identity (AK -Alice) and (BK -Bob) as their authenticated identity key. | ⟩ Step Bob agree on⟩ each of the four Bell states {00 → |ϕ+ , 01 → | | + ⟩ 3: Alice| and ⟩ |ψ , 10 → |ϕ− , 11 → |ψ − } to can carry two bits of classical information to encode the authentication key. Step 4: If Bob wants to communicate with Alice, Bob will randomly choose a subscript i from pre-shared key K and a random position distance r where (i, r ≤ n). Transfer i and r to Alice through classical channel for encoding process. Step5: Alice and Bob agree on the position information based on the pre-shared key ki ⊕ kr = 0 for position {(1,2) and (3,4)} or ki ⊕ kr = 1 for position {((1,4) and (2,3)}. Step 6: Various Pauli operator applied on which qubit depends on the pre-shared key as shown in Table 1.

3.2 Quantum Procedure Step 7: Once Alice received the subscript i and r. Alice generates the authenticated key A K = {Ak1 , Ak2 . . . Akm } of length m where m = 10%n and initialize t = 1. Step 8: Alice selects sequential 4 classical bits from its authentication key AK and gets the position choice from pre-shared key values xt = ki ⊕ kr . If xt = 0, generate the Bell pair in position pair {(1,2) and (3,4)} or else in {(1,4) and (2,3)}. Step 9: Alice applies the Pauli operator {I, X, Z, Y } on Bell states qubit based on the value of ki and kr as given in Table 1 and send to Bob. Step 10: Bob receives the Bell pair and apply the Pauli operator {I, X, Z, Y } based on ki and kr as given in Table 1. Step 11: Bob performs Bell state measurement on Bell pair based on the position given by the pre-shared key xt = ki ⊕ kr and store the Alice authenticated key A'K .

A Novel Quantum Identity Authentication Protocol Based on Random …

365

Step 12: Repeat until t /= m and fix the next key position choice for i = (i +r ) mod n and r = (2r ) mod n which forms a circular selection from pre-shared key K n .

3.3 Classical Procedure Step 13: Bob compares the authentication key A K and A'K in classical channel. If the difference between the authentications key more than QBER [10], abort the protocol or else continue to step 14 as shown in Fig. 2. Step 14: Once Bob verified Alice identity, Alice should verify Bob identity by randomly choose a subscript i from K and a random position distance r where (i, r ≤ n) and repeat the quantum procedure in term of Bob. Step 15: Alice compares the authentication key B K and B K' in classical channel. If the difference between the authentications key more than QBER [10], abort the protocol or else continue with secure message communication.

4 Security of Proposed Protocols In secure communication, the third party (also known as EVE) will interfere and try to obtain information from the trusted parties. Eve applies various attacking techniques like intercept measure and resend attack (IR) or entangle measure attack to break the identity of trusted users. IR attack the incoming Bell states by performing the Bell state measurement (BSM) to decode the identity key, and the same decoded identity key will be encoded to form bell states and send to other user. The proposed protocol has been implemented in IBM Quantum Lab using Qiskit Tool, and the visual representation has shown in IBM Quantum Composer [11].

4.1 Eve Guess on Correct Bell States Position Eve has no information about pre-shared key but Eve known about the pre-shared position {i} and distance from position {r}. Eve has to guess the Bell states position randomly either {(1,2) and (3,4)} or {(1,4) and (2,3)}. If Eve has ½ chances to guess the correct Bell states, then Eve can trespass the security measures of proposed protocol. Let see with an example, Alice selects the position | ⟩ based | ⟩ on the pre-shared key ki ⊕ kr = 1 ⊕ 1 = 0 and generates the Bell states |ϕ− 12 |ψ − 34 for the classical bit {1, 0, 1, 1}. Since ki and kr values equal to 1, Alice apply the Pauli operator ‘Y ’ on qubit 4 and send the Bell states to Bob as shown in Table 1. Eve has 50% chance to guess the correct position either eki ⊕ ekr = 0 ⊕ 0 = 0 or eki ⊕ ekr = 1 ⊕ 1 = 0,

366

Fig. 2 Working model of proposed protocol

B. Devendar Rao and R. Jayaraman

A Novel Quantum Identity Authentication Protocol Based on Random …

367

Fig. 3 Eve guess right position but wrong gate selection

but Eve cannot able to decide which Pauli operator to apply either ‘I ‘or ‘Y ’. If Eve applies correct Pauli operator, then Eve can obtain the right identity information about Alice. If Eve applies wrong Pauli operator, then Eve guess only 50% about the Alice identify as shown in Fig. 3. Eve information gain about the Alice authentication key is 75% during correct guess on position can be shown as 21 ∗ 1 + 21 ∗ 21 = 34 . Alice authentication key is correctly decoded by Bob; therefore, the detection of Eve is null in this scenario as shown in Table 3.

4.2 Eve Guess on Wrong Bell States Position The probability of Eve to selecting the wrong position is 50%, and the detection probability is also increased by the trusted parties as in Fig. 4. Let see with an example, Alice selects the position|based ⟩ |on the ⟩ pre-shared key ki ⊕ kr = 0 ⊕ 0 = 0 and sends the generated Bell states |ϕ− 12 |ψ − 34 to Bob after applying Pauli operator as discussed above (Table 2). Eve has two choice here to select the incorrect position either eki ⊕ekr = 0⊕1 = 1 or eki ⊕ ekr = 1 ⊕ 0 = 1. Eve has ½ chance to select the eki = 0 and ekr = 1, then Eve applies the incorrect Pauli operator on incoming qubit and decodes the information using incorrect Bell state measurement. If applying wrong position during BSM, it leads to Entangled swapping and provides incorrect authentication key. Incorrect Pauli operator and incorrect position make Eve to generate incorrect information as shown in Table 3. Since Eve received incorrect key and generate Bell states based on the wrong position and apply wrong Pauli operator, send to Bob. Once Bob receives the incoming Bell states, he will apply the correct Pauli operator based on pre-shared key information and decode the Alice authenticated key by applying BSM as shown in Table 4. Since Eve position

Fig. 4 Eve guesses wrong position and wrong gate selection

368

B. Devendar Rao and R. Jayaraman

Table 2 Eve guess correct position Alice generates position information from key

ki ⊕ kr = 0 i f ki = 1, kr = 1{(1,2) and (3,4)}

Classical bit grouping

1

2

3

4

Alice authenticated bits

1

0

1

1

Alice generated Bell states Alice Pauli operator Bell states after Pauli operation

| −⟩ |ϕ 12

| −⟩ |ψ

‘Y ’ on qubit 4 | −⟩ |ϕ 12

| +⟩ |ϕ

34

34

Alice send Bell states to Bob—Eve interfered Eve have two choices

eki ⊕ ekr = 0 ⊕ 0 = 0

Classical bit grouping-2 ways

1

Eve Pauli operator

‘I’ on qubit 1 | −⟩ |ϕ 12

| +⟩ |ϕ

1

0

Bell states after Pauli operation Eve decode authenticated key Eve generates Bell states Eve Pauli operator Bell states after Pauli operation

2

0

3

| −⟩ |ϕ 12

| +⟩ |ϕ

‘I’ on qubit 1 | −⟩ |ϕ 12

| +⟩ |ϕ

eki ⊕ ekr = 1 ⊕ 1 = 0 4

34

0

1

2

3

4

‘Y ’ on qubit 4 | −⟩ |ϕ 12

| −⟩ |ψ

1

1

0

34

1

34

| −⟩ |ϕ

| −⟩ |ψ

34

‘Y ’ on qubit 4 | −⟩ |ϕ 12

| +⟩ |ϕ

‘Y ’ on qubit 4 | −⟩ |ϕ 12

| −⟩ |ψ

12

34

34

Eve sends Bell states to Bob Bob Pauli operator Bell states after Pauli operation

‘Y ’ on qubit 4 | −⟩ |ϕ 12

Bob applying BSM

{(1,2) and (3,4)}

Bob generating Alice key

1

0

| −⟩ |ψ

34

34

{(1,2) and (3,4)} 1

1

1

0

1

1

choice not matched with Bob position, again Entangle swapping take places and Bob generates different authentication key A'K . During classical communication, Bob shares the authenticated identity key information to Alice to verify whether it him or not. The total information gain by Eve in this scenario is null because of applying incorrect Bell position and incorrect Pauli operator. The detection probability of Eve in the trusted parties is 75% even in both the cases of selecting wrong position. Total information gain about the Alice authentication key by Eve in entire process in 1 ∗ 43 + 21 ∗ 0 = 38 Eve to trespass the security measures in both scenarios as given as 2

A Novel Quantum Identity Authentication Protocol Based on Random …

369

Table 3 Eve selection on incorrect position leads to incorrect Pauli operator selection Alice key (Ak) 1

2

3

4

1

0

1

1

k i ⊕ kr = 0 ki = 0kr = 0 key = 00 Position {(1,2) and (3,4)} | −⟩ | −⟩ |ϕ |ψ 12 34

Eve wrong selection

Pauli operator

eki = 0 ekr = 1 key = 01

(X-2) | −⟩ | −⟩ |ψ |ψ 12 34

eki = 1 ekr = 0 key = 10

(Z-3) | −⟩ | +⟩ |ϕ |ψ 12 34

BSM {(1,4) and (2,3)} | −⟩ | −⟩ |ψ |ψ | + ⟩14 | + ⟩23 |ψ |ψ | − ⟩ 14| − ⟩ 23 |ϕ |ϕ | + ⟩14 | + ⟩23 |ϕ |ϕ | − ⟩14 | + ⟩23 |ϕ |ψ | + ⟩14 | − ⟩23 |ψ |ϕ | + ⟩ 14| − ⟩23 |ϕ |ψ | − ⟩14 | + ⟩23 |ψ |ϕ 14

4

2

3

1

1

1

1

0

1

0

1

1

0

1

0

0

0

0

0

1

0

0

1

0

1

1

0

0

0

1

1

1

1

0

0

= 58 . The total detection probability for Eve to be identified by trusted ( )4n users can be given as Pd = 1 − 58 , as n increases, the Eve detection will be high. The Eve detection by trusted parties makes the legitimate user to abort the protocol and generate new authenticated key. Since same pre-shared key is used as bases for next cycle, the authenticated key bit leads to the leakage of authenticating pre-shared key information. In existing protocol, different authenticated key is used with the previous pre-shared key position when legitimate users abort the protocol. Eve’s detection rate is directly propositional to information gain about the preshared key [9, 12] as shown in Fig. 5. In proposed protocol, different authenticated key is used with random position of pre-shared key. In each Eve detection, the legitimate user changes the initial key position {i} and its distance {r}; therefore, the Eve information gain about the authenticating pre-shared key will be around 20% as shown in Fig. 5.The main purpose of pre-shared key is to communicate secure information between trusted parties; therefore, little amount of key bits can be used for verifying the sender and receiving party. In [9], two consecutive bit ki and ki+1 from pre-shared key used to send a single authenticated key bit Aki to Bob. In [12], an even bit k2i used to send decoy states and applied XOR operation on two consecutive bit from k2i and k2i−1 are used to send authenticated key bit Aki . In existing protocol, the authenticated key size is less than or equal to the pre-shared key size; therefore, a large number of pre-shared key is used for authenticating purpose. In proposed protocol, the authenticated key size is more than pre-shared key as compared to existing protocol as shown in Fig. 6. For example, if we take authenticated key size m = 100, then 26 pre-shared key used to communicated the quantum qubit to Bob. 1 2

∗ 1 + 21 ∗

23

EVE key (EAk) 1

1 4

0

1

0

0

1

0

1

0

1

0

1

0

1

1

0

0

2

1

0

1

0

0

1

4

1

1

1

EVE key (EAk)

3

0

1

0

1

0

0

1

1

| −⟩ | −⟩ |ψ |ψ | + ⟩14 | + ⟩23 |ψ |ψ | − ⟩ 14| − ⟩ 23 |ϕ |ϕ | + ⟩14 | + ⟩23 |ϕ |ϕ | − ⟩14 | + ⟩23 |ϕ |ψ | + ⟩14 | − ⟩23 |ψ |ϕ | + ⟩ 14| − ⟩23 |ϕ |ψ | − ⟩14 | + ⟩23 |ψ |ϕ 14 23

Eve encoding bits | −⟩ | −⟩ |ψ |ϕ | + ⟩14 | + ⟩23 |ψ |ϕ | − ⟩ 14| − ⟩23 |ϕ |ψ | + ⟩14 | + ⟩23 |ϕ |ψ | − ⟩14 | − ⟩23 |ϕ |ψ | + ⟩14 | + ⟩23 |ψ |ϕ | + ⟩ 14| + ⟩23 |ϕ |ψ | − ⟩14 | − ⟩23 |ψ |ϕ 14 23

Apply Pauli operator

Table 4 Eve generating incorrect Bell states based on received BSM

| −⟩ | −⟩ |ψ |ϕ | + ⟩14 | + ⟩23 |ψ |ϕ | − ⟩ 14| − ⟩23 |ϕ |ψ | + ⟩14 | + ⟩23 |ϕ |ψ | − ⟩14 | − ⟩23 |ϕ |ψ | + ⟩14 | + ⟩23 |ψ |ϕ | + ⟩ 14| + ⟩23 |ϕ |ψ | − ⟩14 | − ⟩23 |ψ |ϕ 14 23

Bob Pauli (I) operator | +⟩ | +⟩ |ϕ |ψ | − ⟩12 | − ⟩34 |ϕ |ψ | − ⟩12 | − ⟩34 |ψ |ϕ | + ⟩12 | + ⟩34 |ψ |ϕ | + ⟩12 | + ⟩34 |ψ |ϕ | − ⟩12 | − ⟩34 |ψ |ϕ | − ⟩ 12| − ⟩34 |ϕ |ψ | + ⟩12 | + ⟩34 |ϕ |ψ 12 34

Bob BSM {(1,2) and (3,4)}

0

1

1

0

0

1

1

0

1

0

0

1

1

1

1

0

0

2

0

1

1

0

0

1

1

0

3

Bob key (A’k) 4

1

1

0

0

0

0

1

1

370 B. Devendar Rao and R. Jayaraman

A Novel Quantum Identity Authentication Protocol Based on Random …

Fig. 5 Eve information gain when the protocol aborts

Fig. 6 Pre-shared key utilization versus authentication key size

371

372

B. Devendar Rao and R. Jayaraman

5 Conclusion The proposed QIA protocol based on Bell states selection by the random value from the pre-shared key which was shared among users. Two level of authentication was provided in proposed model; first one is guess the correct position, and second one is guessing the correct Pauli operator. The overall adversary information gain and detection rate by trusted parties under intercept measure and resend attack are given as 37% and 84%, respectively. Extension of proposed protocol can be implemented using 3 qubit states between GHZ and W states which depend on the key value of pre-shared key.

References 1. Zawadzki P (2019) Quantum identity authentication without entanglement. Quantum Inf Process 18(1):7 2. Zhu H, Wang L, Zhang Y (2020) An efficient quantum identity authentication key agreement protocol without entanglement. Quantum Inf Process 19(10):381 3. Li X, Barnum H (2004) Quantum authentication using entangled states. Int J Found Comput Sci 15(04):609–617 4. Zhang, Zeng G, Zhou N, Xiong J (2006) Quantum identity authentication based on ping-pong technique for photons. Phys Lett A 356(3):199–205 5. Bennett CH, Brassard G (2014) Quantum cryptography: Public key distribution and coin tossing. Theoret Comput Sci 560(1):7–11 6. Ekert AK (1991) Quantum cryptography based on Bell’s theorem. Phys Rev Lett 67(6) 7. Shan R-T, Chen X, Yuan K-G (2021) Multi-party blind quantum computation protocol with mutual authentication in network. Sci China Inf Sci 64(6):162302 8. Quan J, Li Q, Liu C, Shi J, Peng Y (2021) A simplified verifiable blind quantum computing protocol with quantum input verification. Quantum Eng 3(1):e58 9. Chang G, Heo J, Jang JG, Kwon D (2017) Quantum identity authentication with single photon. Quantum Inf Process 16(10):236 10. Shor PW, Preskill J (2000) Simple proof of security of the BB84 quantum key distribution protocol. Phys Rev Lett 85:441–444 11. IBM Quantum. https://quantum-computing.ibm.com/. Last accessed 15 Mar 2022 12. Arindam D, Anirban P (2021) A short review on quantum identity authentication protocols: how would Bob know that he is talking with Alice?. e-Print: 2112.04234 [quant-ph]

Analysis of Hate Tweets Using CBOW-based Optimization Word Embedding Methods Using Deep Neural Networks S. Anantha Babu, M. John Basha, K. S. Arvind, and N. Sivakumar

Abstract In the recent years, governments, corporations, and academics have poured money into addressing the rising frequency of hate speech and expression, as well as the pressing need for effective solutions. Many methods for identifying hate speech have been developed and published on the Internet. This seeks to categorise textual information as non-hate or hate speech, with the algorithm being able to detect targeted features in the latter instance. which attempts to detect if textual content is non-hate or hate speech, with the algorithm recognising the desired qualities in the latter case Our proposed study uses Continuous Bag Of Word (CBOW)-based word embedding to try to predict the target term by analysing the context of the surrounding words, and feature extractors, deep learning-based structures explicitly capture the meanings of offensive speech. On the largest collection of hate speech data sets based on Twitter, our approaches are sorely tested. For hate speech identification, we investigate the influence of several extra-linguistic factors in combination with character n-grams. In addition, we provide a lexicon based on the most important words in our data. The proposed approach predicts 95% word embedding accuracy in real-time Twitter hate speech test data.

S. Anantha Babu (B) Department of Computer Science and Engineering, KL Deemed to-be University, Hyderabad 500043, India e-mail: [email protected] M. John Basha · K. S. Arvind Department of Computer Science and Engineering, Jain Deemed to-be University, Bangalore 562112, India e-mail: [email protected] K. S. Arvind e-mail: [email protected] N. Sivakumar School of Computer Science and Information Technology, Jain Deemed to-be University, Bangalore 560041, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_26

373

374

S. Anantha Babu et al.

Keywords CBOW technique · Word embedding · BERT + CNN · Natural language processing

1 Introduction The distribution of meta-features in classes and data sets has recently been found to help in the detection of disaster tweets. Since most crisis tweets come from media sources, they appear to be written in a more formal tone with more words than non-disaster tweets. There are more errors in non-disaster tweets than in catastrophe tweets since they are posted by individuals [1]. Speech act recognition in an automated framework has a huge impact on Twitter and tweeters. The classification of tweet behaviours is useful. In terms of speech actions, the Twitter platform may be used to determine what defines a certain subject or theme, as well as whether there is any contradiction in a particular topic. In areas like news and category representation, e.g. claims, assertions, and facts are common [2]. A topic of considerable interest in the field of artificial intelligence, particularly in the field of Natural Language Processing, is automatic hate speech detection on social media [3]. Although there are several definitions of hate speech in the literature, it is frequently defined as a phrase that targets or denigrates an individual or group based on certain qualities, such as physical characteristics, country of origin, religion, or sexual identity, among others. Given the massive considering the volume of reader material available and the current pace of information distribution, a crucial first step in combating hate speech is to identify those who propagate it, not just individual hate speech remarks. The responsibility for preventing online hate speech falls on social media platforms that offer the services and websites that are used by racist communicators [4]. A huge amount of work has already been done in computational linguistics to detect speech actions in the context of conversations [5], which is known as dialect act categorisation (DAC) and contains a lot of tasks. Traditional text collection and information retrieval procedures are inefficient due to the restricted tweet length (currently 280 words; originally 140), as well as the chaotic, strange, and quirky nature of tweets [6]. The semantics of words and the distances between words are seldom adequately captured by traditional bag of words and bag of n-grams. Despite the fact that “walk” should be semantically closer to “run” than “eat,” this indicates that the terms “walk,” “run,” and “eat” are similarly far apart. “Walk” and “run” will be reasonably close to one another based on word embeddings. This study uses an unsupervised neural network to construct the word embedding representation model from a vast collection of billions of phrases. The main contributions of this research are: The main contributions of this research are that they can be used for tasks using textual data from different sources, such as tweets from social media. Depending on their use cases, users may select multiple embedding sets, and they can quickly test

Analysis of Hate Tweets Using CBOW-based Optimization Word …

375

them all out to discover which one works best for their solution. We also provided various examples to show how these embeddings may be applied in real-world settings. Section 2 begins with a background study. Section 3 describes the system model of proposed CBOW-based word embedding techniques. Section 4 presents and illustrates performance evaluations. Finally, Sect. 5 presents the conclusion of this article.

2 Background Study Deep neural network-based approaches, including ResNet [7], as well as sentiment analysis, sarcasm detection, and other applications of subjective language analysis on OSN, have all been thoroughly investigated and implemented. Sentiment analysis is a technique used to examine people’s thoughts, sentiments, and emotions and categorize them as good, negative, or neutral [7, 8]. There are several methods for people to convey their sentiments and emotions. Sarcasm is sometimes used in conjunction with these feelings, especially when expressing strong emotions. A good phrase with an underlying negative purpose is known as sarcasm [9]. Hate speech identification, on the other hand, has garnered significantly fewer attempts in relation to the aforementioned difficulties. Some of these studies, such as Warner and Hirschberg’s [9], concentrated on online sentences. The binary classification approach had a classification accuracy of 94% and an F1 score of 63.75%, whereas the previous attempt had an accuracy of 80% and an F1 score of 63.75% [8]. By utilising word embeddings trained on content from the extremist website Daily Stormer and tweets from people with high centrality, Liu et al. [10] were able to detect hate speech with a 0.78 F1 score. Convolutional neural networks (CNN) outperformed LSTM due to their short-term reliance on tweets. The research used tweets with 140 characters, but Twitter now allows messages with 280 characters. They came to the conclusion that domain-specific word embedding offers superior classification outcomes and is appropriate for classes with imbalances. Davidov et al. [11] developed a method for using Twitter’s user-defined hashtags in tweets to classify opinion types, which are then combined into a single feature vector for sentiment classification, which uses grammar, individual words, n-grams, and patterns as various feature types. While creating a feature vector for each sample in the training and test sets, it used the K-Nearest Neighbour approach to assign emotion labels. The first deep learning-based Twitter activity classifier was suggested by Saha et al. [12]. For best results with the multiclass classification problem, they used a convolutional neural network-based model with an SVM gradient descent. To make their recommended model more robust, they included a few homemade parts. Based on the original transformer model, BERT [13] is a multilayered content simultaneous transformer encoder representation [12]. BooksCorpus (800 million words) [12] and the English Encyclopaedia have been used to train it (2500 M words). The model’s input representation is made up of WordPiece extracted features

376

S. Anantha Babu et al.

[12], directional vectorisation, and the identification of elements. Its analysis may be identified as a series of N-times recurrent multihead information gathering, adding, reduction, and feed-forward neural layers [7]. Other research, on the other hand, focuses on detecting offensive utterances on Twitter. Kwok and Wang [14] concentrated on identifying racist tweets directed towards black people. Using unigram features, they achieved a binary classification accuracy of 76%. Focusing on hate speech aimed towards a specific gender, ethnic group, race, or other group, of course, leads the gathered unigrams to be connected with that group. In terms of forecasting tweet actions with more precision, the suggested model surpasses scale-based boundaries and jurisdiction approaches. A pre-trained word embedding model and a regression model were utilised to identify abusive language across many areas [10]. Their method produced F1 scores of 0.60 on the financial domain and 0.65 on the news domain. However, Word2Vec domain-specific word embedding performed better, with improvements of 5% on both domains. The study by Badjatiya et al. [15] investigated various deep learning (deep neural networks) and machine learning models for detecting hate speech on a benchmark data set, including Logistic Regression (LR), Random Forest (RF), Gradient Boost Decision Tree (GBDT), Support Vector Machine (SVM), and various word embedding models. They claim that “racist” or “sexist” biases are revealed for various terms by domain-specific embeddings developed using deep neural networks. According to the aforementioned research, domain-specific-based identification performs well because it offers more precise semantic representations of hateful terms that are often used by users in a certain domain [16, 17]. Malicious content cannot be stopped at the scale of the Internet by manually reviewing each and every piece of information. As a result, machine learning and artificial intelligence are becoming increasingly crucial in reducing serious societal issues like the prevalence of hate speech. As a result, our challenge set can be thought of as serving two purposes: monitoring development on the optimised CBOW model and encouraging development on a practical application of hate speech detection. This further distinguishes our problem from other activities, many of which have sporadic or hazy real-world applications.

3 System Model 3.1 Word Embedding Technique Every consistent numerical symbol symbolises a phrase in neural word embedding. It is a simple but unusual adaptation. In the same way that an autoencoder compresses every phrase as a vector, Word2vec does the same. In contrast to a restricted Boltzmann machine, word embedding trains phrases to other words in the input corpus rather than training against the input words through reconstruction. It does this in

Analysis of Hate Tweets Using CBOW-based Optimization Word …

377

one of two ways: by using context to forecast a target word. Each frame and objective pair is represented as a set of paired inputs and outputs, but using CBOW, we will combine the surrounding window embeddings [7, 14]. CBOW is a network that attempts to predict the middle word from a set of surrounding words: [W n [− 3], W n [− 2], W n [− 1], W n [− 1], W n [1], W n [2], W n [3] = W n [0]. Skip-gram is the absolute opposite of CBOW: it forecasts the surrounding words based on the main word: W n [0] = > [W n [− 3], W n [− 2], W n [− 1], W n [1], W n [2], W n [3]]. A word embedding is a parameterised function of the word in mathematical terms., f ∅n (wn ) = ∅n equation for word embedding, where is the parameter and W is the sentence’s word. Word embedding is also known as a dense representation of words in the form of vectors. The words cat and dog, e.g. can be represented as: W (T 1) = (0.9, 0.1, 0.3, − 0.23 …), W (T 2) = (0.76, 0.1, − 0.38, 0.3 …). The words will be close in a vector space if the model can maintain contextual similarity. Bengio’s method could train a neural network so that each training phrase may educate the model about a number of semantically accessible nearby words, a technique called distributed representation of words. The neural network not only created associations between distinct words but also retained semantic and grammatical linkages [18]. Words that are often used and appear in related settings frequently attempt to signify the same thing. In Word2vec, there are two ways to locate a vector that represents a word: • The Continuous Bag of Words (CBOW): The model forecasts the present word based on the adjacent word embeddings. • Continuous skip-gram: This model suggests the context words from the current frame (Fig. 1).

3.2 Problem Statement A binary file is included in each uploaded embedding data set, and this includes the words (as well as phrases, if the data sets the words) and their embeddings. It also includes a txt file that includes a list of each word (and the probabilities of the words (or phrases) in this data collection. The text document primarily serves as a reference, and users may utilise the binary file inclusion without a word document. (1) Use one of the proposed techniques to load the model into memory. (2) The model will be saved as a map, where the keys are words or phrases, and the values are their embeddings, which are each represented as a list of 300 real integers. The embedding of a phrase may then be obtained by searching the map. According to the lookup, a null value will be returned if a word is absent from this data collection. Setting the embedding of the non-existent terms to zero is one straightforward fix. To respond to this inquiry, we take a multiple approach: 1. Create a novel deep neural network architecture that increases the current stateof-the-art chances that a tweet is accurately categorised.

378

S. Anantha Babu et al.

Fig. 1 Train the model of nearest neighbouring words

2. Determine whether including more data from the social network, such as tweet analytics or user data, improves classification precision.

3.3 Proposed Architecture We obtain a raw target variable with parameters of (2 × parameter), which we input into a hidden layer having parameters of (number of words x embedding dimensions), which provides us with a high density vector representation for each one of these word embeddings (1 × word embedding). Furthermore λ layer, with the input of a feature map, we aggregate out its word embeddings to create an overall dense embedding (1 × embedding size), which is sent to the sparse softmax function, which yields the most likely target phrase. We check this to the exact numerical word, estimate the loss, back programme the errors to adjust the parameters (in the embedding layer), and continue for all (contextual, value) pairings for multiple epochs. Figure 2 shows the optimised CBOW model.

3.4 Optimised CBOW Techniques 1. Within the window size, the input will comprise the total of one-hot encoded vectors of the context words. The input will be n × 1, and the data loading and

Analysis of Hate Tweets Using CBOW-based Optimization Word …

379

Fig. 2 Optimised CBOW model

2.

3. 4. 5.

text normalisation routines will be moved to a different file that we imported at the start, which we can now call. Only reviews with three or more words will be considered. We develop a vocabulary dictionary to aid us in word research. We also require a reverse dictionary that searches indices for terms. Next, we initialise the word embeddings that we want to fit and declare the model data placeholders. The CBOW model sums up the context window’s embeddings; we establish a loop and add up all of the window’s embeddings. To get a notion of our embedding, we’ll utilise cosine similarity to print out the closest terms to our validation word data set. Finally, when we specify, we can cycle over our training phase and print out the loss, as well as store the embeddings and dictionaries. loss =

N 1  yi . log( p(yi )) + (1 − yi ). log(1 − p(yi )) N i=1

To anticipate the target word embedding with one embedding, the following are the primary disadvantages of such neural network-based language models: (i) extensive training and testing, and (ii) the inability to gather statistical data on a global scale.

380

S. Anantha Babu et al.

4 Performance Evaluation Numerous experiments were conducted to detect hate speech by first classifying words that fall into the category of hate/offensive words, and then using deep convolutional neural networks and other computational modelling techniques to learn abstract representations from input data and then evaluate the model on the Twitter data set [19]. A data set comprised of 1,600,000 tweets that were retrieved through the Twitter API and taken from the sentiment 40 sample.

4.1 Pre-processing The majority of real-world data sets are receptive to losing, inconsistent, and noisy data due to their different origins. Data mining tools would produce poor results since they are unable to detect patterns in this noisy data. As a consequence, the processing of data is essential for enhancing data quality throughout. The overall statistics of data may be misinterpreted due to duplicate or missing numbers. Outliers and incompatible data points may interrupt the model’s learning abilities, contributing to unclear forecasts. It is required to work with our available tweets and perform certain adjustments to them at the start of the cycle so that our algorithms can interpret language in a comprehensible manner. Simultaneously, we must combine the two data sets in order to have instances of both negative and positive tweets [20].

4.2 Pre-processing Steps 1. Retweets are Twitter messages that have been retweeted and include the RT tag, as well as the content of the retweeted message and the user’s name. The retweeted content was kept, but the “RT @username” was deleted because the username didn’t provide any more information. 2. URL links to other websites, tweets, online content, and so forth are common in tweets. The http and bit.ly URLs were both removed; bit.ly is a URL shortening service. 3. Html tags for Unicode characters begin with the following various protagonists &#followed by a number, which can relate to emoji, characters, punctuation, and so on. Before eliminating the punctuation, the model removed the URL links in addition to conventional commas. 4. Next, we decided to keep the phrase because it frequently has a semantic significance. Hashtags are words or phrases that are prefixed with the hash (#) symbol to indicate that they belong to a certain topic. 5. Leading and trailing whitespaces were deleted, and any 1 + whitespaces were reduced to a single space.

Analysis of Hate Tweets Using CBOW-based Optimization Word …

381

6. Any related words, such as AaaaBbbbCccc, were separated as evidenced by inner capitalisation. 7. Finally, there are no instances in English when a letter is repeated more than twice, thus any letters that are repeated more than twice were deleted.

4.3 Techniques to Deal with Unbalanced Data Unbalanced data sets are ubiquitous in real-world problems. In simpler terms, an imbalanced data set is one with an uneven distribution of classes. Unbalanced data might make categorisation more difficult. While we examine where to deal with uneven data, it is necessary to know what difficulties an unbalanced data set might bring. One of the goals is to detect hate speech, and because we haven’t had many of those examples to work with, the classifier will give the ones that are available a lot of weight. Keras weights for each class will be supplied as a parameter to do this. As a result, samples from a minority group will receive “more attention” from the model. Disaster tweets may be predicted using meta-feature patterns in categories and data sets. Though most crisis hashtags come from news sources, they appear to be written in a more sarcastic tone and with more words than non-disaster comments. Nondisaster tweets include more discrepancies than disaster tweets since they originate from individual people [2, 21]. The main purpose of this project is to categorise Twitter speech activities, also known as tweet act classification. The aim is to assign the most relevant twitter act (like y2) from a collection of tags (Y = n1, n2, … , ni, where I is the number of tweet acts) given a tweet N. As a result, it is a multiclass categorisation issue. It can be expressed in formal terms as, ˇ y = arg max F (y|n) n∈Y

where F is the tweet act classifier created. We note that this approach does not always reflect relevant to real contexts, and that a Twitter post may reflect numerous educational technique.

4.4 Training and Testing Data Users set up three pre-trained word embeddings for the trial: Word2Vec embeddings that used a skip-gram on the approach-billion-word Google News corpus, “GloVe” word embedding to use a search strategy on a textual data of 840 billion words [22], and Twitter extracted properties learnt on 1.2 billion tweets with spam eliminated [23]. However, no embeddings consistently outperformed others across all tasks and data sets, according to our findings. This isn’t unexpected, given that earlier research has found similar trends: The dominance of a single word embedding model on

382

S. Anantha Babu et al.

irrespective of activities, contexts, and even data sets, the dependence of a specific word-hidden Markov model on intrinsic tasks is only convertible to diverse applications. We will go over the results obtained with Word2Vec deep features in greater detail.

4.5 Evaluation Metrics First, we cleaned up our raw text data. Next, we learned how to extract four distinct types of feature sets from text data. Finally, we utilised these feature sets to create sentiment analysis models [24, 25]. The Word2Vec aspects of our proposed CBOW model proved to be the most helpful. This demonstrates the effectiveness of word embedding in solving NLP issues. The word with the highest similarity score in the sexiest classification accuracy is 95%, while the word with the minimum loss score in the proposed CBOW model is 43%. A well-designed loss/objective function should be minimised by both the skip-gram and CBOW models. To train these language models, we may use a variety of loss functions Table 1 gives the machine learning model’s dropout, score, and word resemblance. Figure 3a, b show the loss and accuracy of the optimised CBOW technique. The evaluation metric applied is the F1 score. It is the average of precision and recall, weighted. Therefore, both false positives and false negatives are included while calculating this score. It is appropriate for issues with unequal classification processes [26, 27]. Precision = TP/TP + FP

(1)

Recall = TP/TP + FN

(2)

F1 Score = 2(Recall ∗ Precision)/(Recall + Precision)

(3)

Table 1 Performance metrics optimised CBOW technique

Model

F1 score

Loss

Word similarity with sexiest

Full Softmax [28]

87

63

62

Hierarchial Softmax [28, 29]

88

70

72

Cross entropy [23]

89

59

80

NCE [29]

90

58

81

NEG [23]

92

56

92

Optimised CBOW

95

43

95

Analysis of Hate Tweets Using CBOW-based Optimization Word …

383

Fig. 3 a Loss of CBOW b Accuracy of CBOW

Table 2 Evaluation metrics compared with other traditional ML model Model

Precision

Recall

F1 score

Loss

SVM

61

63

62

15

One hot

76

74

72

12

LSTM

82

80

80

11

Bidirectional LSTM

88

82

81

8

Optimised CBOW

97

96

96

4

4.6 Comparison with Existing Methods For a variety of reasons, including inconsistencies in the re-generated data sets, possibly different pre-processing methods, and uncertain excitable selections from earlier work, 16 research studies rarely present an exact replica of the prior methodologies in our re-implementation. As a consequence, we compare the findings of our methodology to previously published results in Tables 2 and 3. When compared to other ML models, our suggested model is modernisation, with a 96 accuracy rating [28, 29]. The CBOW model, when tested with different traditional data sets using CNN + GRU and BERT + GRU, performs well in terms of word embedding accuracy.

5 Conclusion Counting word tokens and Tfidf weighting were used to extract characteristics from the most common words and hash tags in general and in racist or sexist tweets. Tokens included unigrams, bigrams, and trigrams. Finally, to classify future tweets, we created an efficient CBOW model and a combined BERT and CNN classifier.

384 Table 3 Evaluation metrics compared with other traditional data set with CNN + GRU and BERT + GRU

S. Anantha Babu et al. Data set

F1 Score CNN + GRU [11, BERT + CNN [2] 12]

WZ-S.am

0.84

0.92

0.92

WZ-S.exp

0.91

0.92

0.92

LSTM

0.91

0.92

0.92

WZ-S.gb

0.78

0.93

0.93

WZ.pj

0.83

0.83

0.83

WZ

0.74

0.83

0.83

DT

0.90

0.83

0.83

Twitter API 0.91

0.94

0.94

We’ll examine whether traits uncovered in one hate class may be transferred to another, increasing each other’s instruction. In future, we intend to use the capability of word embedding to classify text into more precise categories and capture minor changes. Additionally, we want to use other machine learning ideas, such as those from unsupervised learning, to improve the model’s accuracy even further.

References 1. Song G, Huang D (2021) A sentiment-aware contextual model for real-time disaster prediction using twitter data. Future Internet 13(7):163 2. Saha T, Jayashree SR, Saha S, Bhattacharyya P (2020) BERT-caps: a transformer-based capsule network for tweet act classification. IEEE Trans Comput Soc Syst 7(5):1168–1179 3. Gupta V, Jain N, Shubham S, Madan A, Chaudhary A, Xin Q (2021) Toward integrated cnnbased sentiment analysis of tweets for scarce-resource language—Hindi. Trans Asian LowResour Lang Inf Process 20(5):1–23 4. Mahajan R, Mansotra V (2021) Predicting geolocation of tweets: using combination of CNN and BiLSTM. Data Sci Eng 6(4):402–410 5. Stolcke A, Ries K, Coccaro N, Shriberg E, Bates R, Jurafsky D, Taylor P, Martin R, EssDykema CV, Meteer M (2000) Dialogue act modeling for automatic tagging and recognition of conversational speech. Comput Linguist 26(3):339–373 6. Veale T, Cook M (2018) Twitterbots: making machines that make meaning. MIT Press 7. Sun C, Qiu X, Xu Y, Huang X (2019) How to fine-tune bert for text classification? In: China national conference on Chinese computational linguistics. Springer, Cham, pp 194–206 8. Krallinger M, Vazquez M, Leitner F, Salgado D, Chatr-Aryamontri A, Winter A, Perfetto L, Briganti L, Licata L, Iannuccelli M, Castagnoli L, Cesareni G, Tyers M, Schneider G, Rinaldi F, Leaman R, Gonzalez G, Matos S, Kim S, Wilbur WJ, Rocha L, Shatkay H, Tendulkar AT, Agarwal S, Liu F, Wang X, Rak R, Noto K, Elkan C, Lu Z, Dogan RI, Fontaine JF, Andrade-Navarro MA, Valencia A (2011) The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinf 12(8):1–31 9. Antonakaki D, Fragopoulou P, Ioannidis S (2021) A survey of twitter research: data model, graph structure, sentiment analysis and attacks. Expert Syst Appl 164:114006 10. Kapil P, Ekbal A (2020) A deep neural network based multi-task learning approach to hate speech detection. Knowl-Based Syst 210:106458

Analysis of Hate Tweets Using CBOW-based Optimization Word …

385

11. Chikersal P, Poria S, Cambria E, Gelbukh A, Siong CE (2015) Modelling public sentiment in Twitter: using linguistic patterns to enhance supervised learning. In: International conference on intelligent text processing and computational linguistics. Springer, Cham, pp 49–65 12. Akter F, Tushar SA, Shawan SA, Keya M, Khushbu SA, Isalm S (2021) Sentiment forecasting method on approach of supervised learning by news comments. In: 2021 12th International conference on computing communication and networking technologies (ICCCNT). IEEE, pp 1–7 13. Kumar A, Cambria E, Trueman TE (2021) Transformer-Based bidirectional encoder representations for emotion detection from text. In: 2021 IEEE symposium series on computational intelligence (SSCI). IEEE, pp 1–6 14. Dong L, Xu S, Xu B (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5884–5888 15. Mozafari M, Farahbakhsh R, Crespi N (2020) Hate speech detection and racial bias mitigation in social media based on BERT model. PLoS ONE 15(8):e0237861 16. Samuel Raj RJ, Anantha Babu S, VL HJ, Varalatchoumy M, Kathirvel C (2022) Implementing multiclass classification to find the optimal machine learning model for forecasting malicious URLs. In: 2022 6th International conference on computing methodologies and communication (ICCMC), 2022, pp 1127–1130. https://doi.org/10.1109/ICCMC53470.2022.9754005 17. Joshua Samuel Raj R, Anantha Babu S, Jegatheesan A, Arul Xavier VM (2022) A GAN-Based triplet facenet detection algorithm using deep face recognition for autism child. In: Peter JD, Fernandes SL, Alavi AH (eds) Disruptive technologies for big data and cloud applications. Lecture notes in electrical engineering, vol 905. Springer, Singapore. https://doi.org/10.1007/ 978-981-19-2177-3_18 18. Ji Y, Haffari G, Eisenstein J (2016) A latent variable recurrent neural network for discourse relation language models. arXiv preprint arXiv:1603.01913 19. Saif H, Fernandez M, He Y, Alani H (2013) Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold 20. Ali A, Shamsuddin SM, Ralescu AL (2013) Classification with class imbalance problem. Int J Adv Soft Compu Appl 5(3) 21. Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surv (CSUR) 49(2):1–50 22. Zhou M et al (2020) A text sentiment classification model using double word embedding methods. Multimedia Tools Appl 1–20 23. Stein RA, Jaques PA, Valiati JF (2019) An analysis of hierarchical text classification using word embeddings. Inf Sci 471:216–232 (2019) 24. Nagarajan SM, Gandhi UD (2019) Classifying streaming of Twitter data based on sentiment analysis using hybridization. Neural Comput Appl 31(5):1425–1433 25. Poonguzhali R, Ahmad S, Sivasankar PT, Anantha Babu S, Joshi P et al (2023) Automated brain tumor diagnosis using deep residual u-net segmentation model. Comput Mater Continua 74(1):2179–2194 26. Senthil Murugan N, Usha Devi G (2018) Detecting streaming of Twitter spam using hybrid method. Wireless Pers Commun 103(2):1353–1374 27. Chen JIZ, Zong JI (2021) Automatic vehicle license plate detection using k-means clustering algorithm and CNN. J Electr Eng Autom 3(1):15–23 28. Kouretas I, Paliouras V (2019) Simplified hardware implementation of the softmax activation function. In: 2019 8th International conference on modern circuits and systems technologies (MOCAST). IEEE, pp 1–4 29. Goldberg Y, Levy O (2014) word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722

Performance Analysis of Discrete Wavelets in Hyper Spectral Image Classification: A Deep Learning Approach Valli Kumari Vatsavayi , Saritha Hepsibha Pilli , and Charishma Bobbili Abstract The use of convolutional neural networks (CNNs) to classify hyperspectral images (HSIs) is being done in contemporary research works. HSI data poses a challenge to the current technique for data analysis because of its extensive spectrum information. It has been noted that conventional CNN primarily grabs the spatial characteristics of HSI while ignoring the spectral data. In that way, it exhibits poor performance. As a result, spectral feature extraction now plays a big role in HSI data processing. Out of the several existing strategies for HSI spectral feature extraction, the discrete wavelet transform (DWT) approach is selected for analysis as a solution to the issue. Because it preserves the contrast between spectral signatures, spectral feature extraction using Wavelet Decomposition might be helpful. This work analyses two basic DWTs, namely Haar and Daubechies wavelets for this topic and gives a thorough examination of deep learning-based HSI categorization. In this regard, this paper examines the concept of wavelet CNN which highlights spectral characteristics by layering DWTs. The 2D CNN is next connected to the retrieved spectral features. It highlights spatial characteristics and generates a spatial spectral feature vector for classification. In particular, factor analysis is utilised to minimise the HSI dimension first. The discrete wavelet decomposition algorithm is then used to get four-level decomposition features. They are concatenated with 4-layer convolution features for merging spatial and spectral information, respectively. The entire approach aims to improve the final performance of the HSI classification with appropriate choice of mother wavelet. Experiments with wavelet feature fusion CNN on benchmark data sets like Indian Pines were conducted to assess the performance. To determine the overall classification accuracy, the classification results were analysed. In the context of extracting spectral features, it is discovered that Daubechies wavelets perform better in terms of classification than Haar wavelets.

V. K. Vatsavayi (B) · S. H. Pilli · C. Bobbili Department of Computer Science and Systems Engineering, Andhra University College of Engineering, Visakhapatnam, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_27

387

388

V. K. Vatsavayi et al.

Keywords Hyperspectral images (HSIs) classification · Convolutional neural network (CNN) · Haar wavelet · Daubechies wavelet · Multi-level feature decomposition

1 Introduction Hyperspectral image (HSI) is a three-dimensional cube that combines spectral and spatial data in remote sensing area. In HSI, each and every band encodes the pixel intensity values for a certain spectrum [1]. Due to their ability to discern small spectral differences, HSIs have found widespread use in numerous fields. Nevertheless, classification of hyper spectral images presents unique issues because of their high dimensionality, short number of labelled samples, and considerable spatial variance of spectral finger prints [2]. Therefore, the hyperspectral community’s most active research area is how to classify HSI data in order to acquire the most discerning information. Based on the recent research reported in [3], deep learning has a significant ability in feature extraction. As a result, an increasing number of researchers are using deep learning technology like CNN to investigate HSI categorization approaches. There have been numerous studies using convolutional neural network (CNN) to categorise HSI since it has demonstrated large data processing capabilities and can assist in extracting inherent data properties. However, due to the limits of HSI data, obtaining adequate spectral–spatial characteristics with standard convolution is difficult. To address the aforementioned shortcoming, researchers began to look at extracting useful spectral–spatial characteristics. Some flexible models, such as, CNN integrated with graph residual architecture [4] and spectral–spatial-dependent global learning model [5] have been proposed to classify HSIs to achieve high classification accuracy. In addition, in order to fully utilise spectral–spatial information, researchers started to concentrate on wavelet transforms in obtaining important spectral–spatial features. In [1], the authors looked at the shortcomings of several models (such as 2D CNNs, 3D CNNs, and 3D-2D CNNs) and presented a wavelet CNN for multiresolution HSI classification. They called it as a SpectralNET which is a variation of a 2D CNN. Because the wavelet transform is faster than a 3D CNN, the authors infer that a superior model is obtained. Wavelet transform was established as an effective feature extractor for HSI classification task in [6]. The authors found that the wavelet transform and a 2D CNN model work together to extract spectral and spatial data from HSI. The dense classification layers of the 2D CNN are then fed with these features concatenated channel by channel. Based on the foregoing notions, we inferred that a 2D CNN model yields better classification accuracy for HSI data when DWT method is used for spectral feature extraction. A wise choice of mother wavelet could result in higher classification accuracy since DWT-based systems are typically more sensitive to that choice. Therefore,

Performance Analysis of Discrete Wavelets in Hyper Spectral Image …

389

in this paper, we choose to examine the HSI classification accuracies for the two fundamental discrete wavelets, HAAR and DAUBECHIES, when paired with 2D CNN. Following that, we give a comprehensive evaluation report that verifies the improved HSI classifiers with the wavelet technique. As a result, the purpose of this study is to give a basic and clear comparative analysis of DWT-based spectral feature extraction methods and techniques used in improving classification accuracy of 2D CNNs for HSI data. Using the factor analysis (FA) algorithm, we first reduce the dimensionality of the original HSI data in order to build the input features of the model. After that, the features are transmitted to the wavelet CNN module. Then, using a discrete wavelet decomposition approach, multi-level decomposition features are generated, which are then concatenated with multi-layer convolution features to combine spatial and spectral data. In this work, we employ four-level wavelet decomposition and a fourlayer CNN with Python implementation to create the classifier under investigation. We use the HSI data cube from the Indian Pines data set for HSI data [1]. The remainder of this paper is organised as follows: Section 2 gives an overview of some common deep models and wavelet transforms for HSI classification. The implementation methodology of the wavelet functions and the discrete wavelet CNN framework is then presented in Sect. 3. The wavelet CNN-HSI classification performance is analysed and provided for the two basic discrete wavelet functions in Sect. 4. A conclusion is made in Sect. 5.

2 Related Work In this part, we look at a few popular deep learning models for HSI categorization. We will also give a quick overview of some of the tactics employed in our model for reference. HSI classification is the process of accurately predicting the various pixel values connected to the various classes present in a remotely sensed HSI. The fact that HSI includes both spectral and spatial information is an important characteristic. Different CNNs (i.e. 2D, 3D, 2D-3D, and FuSENet) for HSI classification have been proposed in the literature due to the good performance of deep learning-based algorithms on image data [7–11]. To train a 3D CNN with a 3D kernel, the authors [12] used tiny 3D patches derived from the original HSI cube. In [13], the impact of HSI deep feature extraction on 3D CNNs was examined. A new residual hybrid 3D-2D CNN has been presented in [14]. Recurrent Neural Networks (RNN), Generative Adversarial Networks (GAN), and Graph CNNs have all been used [15]. The authors came to the conclusion that although 3D-2D CNNs simulate both the spatial and spectral components of an HSI cube, they perform poorly across a wide range of data sets. Moreover, 3D CNNs are computationally costly in comparison with 2D CNNs. Many studies in the literature have proposed a method that uses only a 2D CNN and has the capability of extracting both spatial and spectral characteristics. However,

390

V. K. Vatsavayi et al.

due to the limits of HSI data, obtaining sufficient spectral–spatial characteristics with CNN alone is difficult. For HSI classification, researchers in [6] used the wavelet transform. They investigated how combining the wavelet transform with a 2D CNN model may extract both spectral and spatial characteristics from a HSI. In multiresolution picture classification, their technique beats all existing methods and opens the path for wavelet CNN. A multi-level Haar wavelet features fusion network with CNN enhancement (CNN-MHWF2N) has been attempted by Guo et al. [3]. The HSI approach described above yields superior results. However, this model only concentrated on the popular Haar wavelet to achieve the four-level decomposition of spectral–spatial information. There are an endless number of mother wavelets. Wavelet decomposition could be effective for spectral data extraction because it preserves the distinction between spectral signatures. The accuracy of 2D CNN can be considerably influenced by the mother wavelet. The efficacy of two commonly used mother wavelets has been studied in our work. The Haar and Daubechies families are represented by these mother wavelets. The polynomial patterns are best captured by Daubechies wavelets, but the Haar wavelet is discontinuous and looks like a step function. The classification accuracies of these wavelets are compared in this research. Specifically, this paper focuses on key issues that influence the performance accuracy of 2D CNNs for HSI data.

3 Methodology In this section, we describe the wavelet CNN strategy to accomplish HSI image classification.

3.1 Wavelet CNN Framework To minimise the dimensionality of HSI data, there are a variety of methods for extracting essential spectral components. Due to the wide variety of materials represented, it is critical to understand which features are most important for classification algorithms. More specifically, a deep learning system for the proper classification of HSI data is required. Thus, for implementation in this study, we used a simple wavelet CNN (see Fig. 1), which incorporates a spectral analysis into CNNs and is a good example of current known approaches. As shown in Fig. 1, the implemented model goes through the following steps: Factor analysis (FA) is used in the created model as a pre-processing step to decrease the enormous dimensionality of HSI of (P, Q, R) dimensions, where P and Q are spatial dimensions and R is a spectral dimension. The wavelet CNN is then used to process the retrieved patches with reduced dimensions (M, M, B). As

Performance Analysis of Discrete Wavelets in Hyper Spectral Image …

391

Fig. 1 Wavelet CNN model for HSI classification overview

a result, training takes less time. In addition, compared to a 3D CNN, the spectral characteristics from wavelet transform need less processing effort. In the current work, a discrete wavelet function is used to decompose the spectral–spatial properties into four levels. Likewise, to determine the 2D spatial characteristics, four convolution layers are used. The information from each convolution layer is mixed with the decomposed features of each level. Furthermore, the features of multiple levels are gradually cascaded during the feature extraction process to produce the information interaction. Finally, an average pooling layer, many fully connected (FC) layers, and a softmax classifier could easily measure classification accuracy. In the discrete wavelet CNN technique, each data set’s labelled samples are divided in to two parts: training and testing. Y 1 and Y 2 are the labels for training data X 1 and testing data X 2 . The purpose of training data is to update parameters, whereas testing data is used to evaluate the method’s generalisation achievement. We picked the “Stochastic gradient descent (SGD)” optimiser and the “Crossentropy” loss function for this work. The formula for loss function is as follows: n   1  yi log yˆi + (1 − yi ) log 1 − yˆi L=− n i=0

(1)

where n is the number of categories, yi is the probability value for the ith class in the actual sample labels Y = {y1 , y2 , …, yn }, and yˆi is the probability value for the ith class in the predicted sample labels Y i = {Y1 , Y2 , . . . , Yn }. 







392

V. K. Vatsavayi et al.

3.2 Feature Decomposition Using Discrete Wavelets The discrete 2D wavelet transform will be utilised in this study to reduce hyperspectral data in the spectral domain for each pixel independently. The most discriminative multi-scale features can be extracted using DWT, as demonstrated in [16]. Using several wavelet filters, the authors of this research investigated on reducing the dimensionality of HSI. They also highlighted why they chose the DWT method over a variety of other cutting-edge HSI reduction solutions. Because of the wavelet’s intrinsic multi-resolution capabilities, which maintain high and low frequency features, peaks and troughs in typical spectra are preserved. Haar Wavelet Haar wavelet is the quickest and easiest. The notion behind the Haar wavelet transform is straightforward, and it is quick and memory efficient. The Haar wavelet is a step-like, discontinuous function. To achieve hierarchical breakdown of the HSI data, a pair of kernels (Kl and Kh ) will be used in multi-resolution CNN. At each phase of the hierarchical decomposition of the HSI data, the multi-resolution CNN applies different kernels. The high-pass wavelet kernel (Kh ) has been employed as the Haar wavelet kernel function, while the low-pass wavelet kernel (Kl ) has been used as the scaling function in this study. The 2D Haar wavelets use the following four kernels (f LL , f LH , f HL , f HH ) for wavelet transform, as stated in [17]. The input image f (x, y) in Fig. 2 is first filtered and down sampled along the rows direction (i.e. horizontal direction) to create the coefficient matrices L-l(x, y) and H– h(x, y). They are then filtered and down sampled in the direction of the columns (i.e. vertical direction). The first wavelet decomposition yields four sub-images features (f HH (x, y), f HL (x, y), f LH (x, y), and f LL (x, y)). As described in [1], if X is a HSI patch with M × M dimensions and passed through a Haar transform, then the (i, j)th spectrum position value can be written as:

Fig. 2 Flow chart for Haar wavelet one-level decomposition

Performance Analysis of Discrete Wavelets in Hyper Spectral Image …

393

Haar(i, j ) = X (2i − 1, 2 j − 1) + X (2i − 1, 2 j ) + X (2i, 2 j − 1) + X (2i, 2 j )

(2)

As shown in Fig. 3, the HSI patch is split into sub-bands using the wavelet transform, and the spectral and spatial features are learned by feeding these sub-bands via a convolution layer. The wavelet transform decomposes the sub-band component in the next layer and sends it to the convolution layer. The CNN keeps learning the spectral and spatial characteristics of the HSI patch by repeating this process in each layer. Daubechies Wavelet The Daubechies wavelet transform is a class of wavelet transforms that are simple to apply and inverse. The Daubechies wavelet transform is implemented similarly to the Haar transform as a series of decompositions. Only the filter length exceeds two (see Fig. 4). As a result, it is more focused and consistent.

Fig. 3 Four-level feature decomposition for wavelet

Fig.4 Haar and Daubechies wavelets

394

V. K. Vatsavayi et al.

Because the wavelet transform has strong time–frequency localisation, it can transform distinct underlying picture features into different wavelet coefficients, allowing for deeper feature extraction. The final step involves sequential interaction with the retrieved four-level features, potentially enriching information flow and HSI features. The average pooling layer, two FC layers, and a softmax classifier are then applied to these fused features. The prediction result is then obtained.

4 Experiments and Result Analysis We primarily undertake a thorough collection of experiments from two perspectives in this part. First, a set of experiments is conducted to show the benefits of including wavelet transforms in deep learning models for HSI categorisation. Second, two discrete wavelets’ classification performance is carefully compared. The benchmark HSI, Indian Pines, is used to complete our experiments. The trials were carried out on the Google Colab cloud platform with GPU using Python 3.7.13 and Keras 2.8.0.

4.1 Classification Results Tables 1 and 2 give the classification findings. In this analysis, we performed the accuracy assessment for examining the influence of DWT-based spectral feature extraction on CNN model’s performance. We chose the three benchmark accuracy evaluation metrics, namely Overall Accuracy (OA), Average Accuracy (AA), and Kappa Coefficient (Kappa). OA is the proportion of labels that were properly identified out of all labels. AA is the average class-wise classification accuracy, while Kappa is a metric that connects the true value and the categorise values. The use of bold text in Tables 1 and 2 serves to draw attention to the Daubechies wavelets’ superiority over Haar wavelets during the accuracy assessment of the CNN model in classifying Indian Pines-HSI data, making it simpler for the user to quickly determine which wavelet is providing the best accuracy in HSI classification in terms of OA, AA, and Kappa values. The classification results of the CNN with two DWT methods under consideration are compared for varying quantities of training data and epochs. It can be observed from the Tables 1 and 2 that over Indian Pines data set, the CNN with Daubechies wavelet outperforms the other method in terms of OA, AA, and Kappa values. Further, the output maps of the CNN model with the two wavelets for diverse quantities of training data ratio and epochs are illustrated in Figs. 5, 6, 7 and 8. We can see that the maps used for visualisation that have poor categorisation accuracy are rough. The reason is that mining sufficient and effective spectral–spatial characteristics using a CNN model with a basic wavelet like Haar is difficult. The CNN

97.52

97.25 99.52

99.46

98.11

98.34

98.48

CNN with Haar

CNN with Daubechies

98.27

20% OA

AA

Kappa

15%

OA

Training ratio

99.45

99.38

Kappa 98.80

98.61

AA 99.86

99.76

OA

30%

99.84

99.72

Kappa

99.44

97.92

AA

99.93

99.75

OA

40%

99.92

99.72

Kappa

Table 1 Experiment results for CNN with different wavelets on Indian Pines data set for distinct amounts of training ratios with epochs fixed at 80

99.94

99.19

AA

Performance Analysis of Discrete Wavelets in Hyper Spectral Image … 395

99.14

99.03

98.00

CNN with Daubechies

97.19

98.69

98.85

99.05

CNN with Haar

98.92

40 OA

AA

Kappa

20

OA

No. of epochs

98.90

97.71

Kappa 98.66

97.48

AA 98.72

98.63

OA

60

98.54

98.44

Kappa

98.94

98.38

AA

98.48

98.34

OA

80

98.27

98.11

Kappa

Table 2 Experiment results for CNN with different wavelets on Indian Pines data set for various number of epochs with training ratio fixed at 15%

97.52

97.25

AA

396 V. K. Vatsavayi et al.

Performance Analysis of Discrete Wavelets in Hyper Spectral Image …

397

Fig. 5 Visualisation maps of IP data set for CNN with Haar using various number of training ratios and epochs fixed at 80 a ground truth, b training ratio = 15% (98.34%), c training ratio = 20% (99.48%), d training ratio = 30% (99.76%) and e training ratio = 40% (99.75%)

model with Daubechies wavelet was able to capture more discriminative information of HSI data, resulting in improved accuracies in terms of OA, AA, and Kappa values and more accurate visualisation maps.

Fig. 6 Visualisation maps of IP data set for CNN with Daubechies using various number of training ratios and epochs fixed at 80 a ground truth, b training ratio = 15% (98.48%), c training ratio = 20% (98.90%), d training ratio = 30% (99.86%), and e training ratio = 40% (99.93%)

Fig. 7 Visualisation maps of IP data set for CNN with HAAR using various number of epochs and fixed training ratio of 15%, a ground truth, b epochs = 20 (98.85%), c epochs = 40 (98.00%), d epochs = 60 (98.63%), and e epochs = 80 (98.34%)

398

V. K. Vatsavayi et al.

Fig. 8 Visualisation maps of IP data set for CNN with Daubechies using various number of epochs and fixed training ratio of 15%, a ground truth, b epochs = 20 (99.05%), c epochs = 40 (99.03%), d epochs = 60 (98.72%), and e epochs = 80 (98.48%)

The total impact shows that the CNN model with Daubechies wavelet has improved accuracy, which reflects the developed method’s superior performance. It is because, Daubechies is a compactly supported orthonormal wavelet that preserves signal energy, whereas Haar compressed by averaging and differencing. Thus, Daubechies filters outperform Haar filters in terms of OA, AA, and Kappa values, thereby provide superior categorisation. Finally, we contend that this accuracy evaluation is crucial to assess the categorisation model in terms of multiple benchmark accuracy measures to reduce false or inaccurate findings and offer suggestions for best practises going ahead.

5 Conclusion In this paper, we aimed at exploring the effect of spectral features extraction using the Haar and Daubechies wavelet filters on the performance of CNN model to classify HSI data. The classification findings were evaluated in terms of OA, AA, and Kappa values. In our experiments, one of the benchmark data sets for HSIs, Indian Pines is used to evaluate the performance of CNN model with Haar and Daubechies wavelets. The experimental results demonstrate that the CNN model with Daubechies wavelet consistently provides over 98% classification accuracy. It is discovered from the experiments that the accuracy of the CNN model with Daubechies wavelet is substantially higher than that gained from the Haar wavelet with decent visualisation maps on the HSI data like Indian Pines data set.

References 1. Chakraborty T, Trehan U (2021) SpectralNET: Exploring spatial–spectral WaveletCNN for hyperspectral image classification. arXiv preprint arXiv:2104.00341

Performance Analysis of Discrete Wavelets in Hyper Spectral Image …

399

2. Shambulinga M, Sadashivappa G (2021) Hyperspectral image classification using convolutional neural networks. Int J Adv Comput Sci Appl (IJACSA) 12(6) 3. Guo W, Xu G, Liu B, Wang Y (2022) Hyperspectral image classification using CNN-Enhanced multi-level haar wavelet features fusion network. IEEE Geosci Remote Sens Lett 19 4. Guo W, Xu G, Liu W, Liu B, Wang Y (2021) CNN-combined graph residual network with multilevel feature fusion for hyperspectral image classification. IET Comput Vis 15(8):592–607 5. Zhu et al (2021) A spectral–spatial-dependent global learning framework for insufficient and imbalanced hyperspectral image classification. IEEE Trans Cybern. https://doi.org/10.1109/ TCYB.2021.3070577 6. Prabhakar TN, Geetha P (2017) Two-dimensional empirical wavelet transform based supervised hyperspectral image classification. ISPRS J Photogramm Remote Sens 133:37–45 7. Li S, Song W, Fang L, Chen Y, Ghamisi P, Benediktsson JA (2019) Deep learning for hyperspectral image classification: an overview. IEEE Trans Geosci Remote Sens 57(9):6690–6709 8. Yang X, Ye Y, Li X, Lau RYK, Zhang X, Huang X (2018) Hyperspectral image classification with deep learning models. IEEE Trans Geosci Remote Sens 56(9):5408–5423 9. Han M, Cong R, Li X, Fu H, Lei J (2020) Joint spatial-spectral hyperspectral image classification based on convolutional neural network. Pattern Recogn Lett 130:38–45. Image/Video Understanding and Analysis (IUVA) 10. Roy SK (2020) Fusenet: fused squeeze-and-excitation network for spectral-spatial hyperspectral image classification. IET Image Process 14(8):1653–1661 11. Zheng J, Feng Y, Bai C, Zhang J (2020) Hyperspectral image classification using mixed convolutions and covariance pooling. IEEE Trans Geosci Remote Sens 1–13 12. Ahmad M, Khan AM, Mazzara M, Distefano S, Ali M, Sarfraz MS (2022) A fast and compact 3-D CNN for hyperspectral image classification. IEEE Geosci Remote Sens Lett 19:1–5. Art no. 5502205. https://doi.org/10.1109/LGRS.2020.3043710 13. Chen Y, Jiang H, Li C, Jia X (2016) Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans Geosci Remote Sens 54(10):1–20 14. Roy SK, Krishna G, Dubey SR, Chaudhuri BB (2020) Hybridsn: exploring 3-d–2-d cnn feature hierarchy for hyperspectral image classification. IEEE Geosci Remote Sens Lett 17(2):277–281 15. Mou L, Lu X, Li X, Zhu XX (2020) Nonlocal graph convolutional networks for hyperspectral image classification. IEEE Trans Geosci Remote Sens 1–12 16. Mallat SG (1989) A theory for multi resolution signal decomposition: the wavelet representation. IEEE Trans Pattern Anal Mach Intell 11(7):674–693 17. Liu P, Zhang H, Lian W, Zuo W (2019) Multi-level wavelet convolutional neural networks. IEEE Access 7:74973–74985

Tyro: A Mobile Inventory Pod for e-Commerce Services Aida Jones, B. Ramya, M. P. Sreedharani, R. M. Yuvashree, and Jijin Jacob

Abstract Due to the increasing use of smartphones and Internet access, e-commerce start-ups are rapidly spreading throughout the world. Robotics and automation, together with cutting-edge technology like artificial intelligence (AI), machine learning, and deep learning concepts, are some of the factors contributing to the ecommerce sector’s explosive growth. While start-up e-commerce services are falling behind in terms of automation and mechanical integrity, the e-commerce industry as a whole is performing well in contrast to other sectors. In the light of this, this study suggests a reasonably priced mobile inventory pod for e-commerce services that incorporates artificial intelligence and stores the requested items in the designated region of the warehouse. Picking goods from the rack is done by a robotic arm with four degrees of freedom, and the pod’s eye is a camera module. AI helps to train the bot to take the right path, choose the right object, and identify obstacles in its path. Mecanum wheels allow us to manoeuvre the robot in any direction, even in confined spaces, and an Arduino microprocessor ensures that the entire system runs smoothly. Keywords Artificial intelligence · Robotic arm · Camera module · Mecanum wheels · Arduino microcontroller

1 Introduction The e-commerce sector is rapidly changing as access to the Internet increases across much of the globe. Traditional retail businesses are relocating to the internet. As a result, they are growing their customer base and maintaining their competitiveness. Because of the expanded Internet accessibility, the ease of the transaction, the variety A. Jones (B) · B. Ramya · M. P. Sreedharani · R. M. Yuvashree Department of ECE, KCG College of Technology, Chennai, India e-mail: [email protected] J. Jacob Landmark International Auto Spare Trading LLC, Dubai, UAE © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_28

401

402

A. Jones et al.

of items and personalised offers, and the lack of restrictions imposed by physical presence and contact, consumers prefer online shopping. The rate of e-commerce growth is currently very high, particularly since the COVID-19 pandemic, when numerous new e-commerce companies have emerged. Even if the epidemic has little impact on e-commerce, it does highlight its significance and increase demand, while COVID19 curfews are in effect. Leading online retailers like Amazon, Alibaba Group, eBay, and Flipkart are able to match demand with a mix of fewer employees and increased automation and robotics. Emerging e-commerce businesses, however, found it difficult to meet customer demand due to a lack of labour and equipment support. This demonstrates how crucial automation and robotics are to the e-commerce sector. Robots are used on e-commerce platforms for a wide range of tasks, including product delivery, stock management, picking, and more. Many e-commerce businesses utilise fulfilment and logistics to keep track of the inventory and deliver goods to customers on schedule and with the least amount of fuss. Robots can help businesses by doing tasks that are too difficult for people to carry out, such retrieving items from a warehouse floor or organising goods according to size or colour. Additionally, robots can be used in warehouses to complete comparable activities more affordably than hiring high-priced labour or investing in new machinery. Employees can now devote more time to higher-level tasks like business planning and improvement. We are creating a low-cost mobile inventory pod with artificial intelligence as the main goal of our project after taking all of this into account. Businesses will be able to expand their capabilities with AI, and the stress of managing high levels of product unpredictability will also be reduced. The best integrators are those who have worked with machine learning and artificial intelligence. The pod has a four-degreeof-freedom robotic arm integrated with servo motors that is driven by a servo motor driver and can pick up objects from racks with its gripper. The robotic arm is right in front of the AI-enabled camera module, which is mounted on top of the chassis and helps the pod follow the object and identify impediments in its path. Section 2 of this paper is followed by the following sections: Sect. 3, which elaborates on the functioning and flow of the proposed system; Sect. 4, which discusses the project’s outcome; and Sect. 5, which wraps up the project’s overall process.

2 Related Work The earlier works on the pick-and-place robotic arm have been described in this section. An Arduino-based pick-and-place robotic arm with the Arduino serving as the primary controller was presented by Chandrika et al. [1]. This project’s goal was to incorporate a pick-and-place robot into a four-wheel drive vehicle that could perform both industrial and non-industrial tasks. In this instance, the robotic arm had a flex sensor. An Arduino was used to control the navigation. To fit the needs of the application, they may adjust the end gripper and robotic arm. Future advances covered in this paper included the addition of a feedback system and other sensors to make the system autonomous.

Tyro: A Mobile Inventory Pod for e-Commerce Services

403

A pick-and-place robot was developed by Neeraja et al. [2] to help people with disabilities do activities on their own. They developed a robot that was controllable by an Android phone. The prototype was made up of an XLR8 Development Board, an FPGA-based microcontroller that can be programmed using the Arduino IDE, a power source, motor drivers, motors, and a Bluetooth module. The microcontroller known as the XLR8 is more rapid, potent, and scalable. An android app called “Arduino Bluetooth controller” was installed on the user’s device, and commands were sent to the robot to gather items and arrange them from the source or needed location to the desired location. Their plan has a number of limitations, such as the Bluetooth module’s restricted range, which may be replaced with a Wi-Fi module to increase the operating range. The hand gripper could be roughened up to enable it to grasp objects in order to address the robot’s difficulty manipulating smooth surfaces. The creation and design of a pick-and-place robotic arm controlled by a programmable logic controller were suggested by Abhiraj et al. [3]. Programmable logic and controllers were used to operate and control the robotic arm. The goal was to develop an automated industrial system that could be managed at any time and from any location. When the start push button is pressed, the system will start. Then the conveyor belt will start to move. The item was put on the conveyer belt. As soon as the conveyor began to move, the sensor picked up the object. Additionally, the conveyor would automatically stop whenever an item was found. The negative of the device is that it was not cost-effective and needs more upkeep. Robotic pick-and-place time optimization for the production of shoes. The robot in the [4] system found shoe fragments, placed them in a tray, then chose and put them in a shoe-mould for processing. The shoe components were spread out in any order on a tray and were sorted at random. In this case, a decision tree model was created to identify the pattern and predict the best order for putting the puzzle pieces back together. Decision trees are rarely used in research to address sequencing and scheduling problems in robotics. According to the study’s findings, task planning in a complex environment with several paths and the potential for robot collisions is made easier by using the decision tree technique. Roshni and Kumar [5] discussed how the pick-and-place operation is carried out by a robot that has artificial intelligence built into it. This allows the robot to pick objects that are oriented in various directions using the centre of gravity, increasing the pickand-place operation’s precision. From the recommended method, it can be inferred that pick-and-place procedures will become more accurate and safe. Compared to the other methods mentioned above, this method required less engagement with people. The suggested approach, however, tended to concentrate on similar goods. After looking at numerous successful manipulators research articles, the review study by Surati et al. [6] emphasised the various components of a robotic arm. They cited a number of empirically verified research publications to observe the different controller types used and the various strategies adopted by various writers to ascertain the degrees of freedom of a manipulator used for selecting an object and placing it in a predetermined location. Consequently, the information gleaned from reading all of these books will help with the robotic arm’s design.

404

A. Jones et al.

A technical summary of some of the most recent studies on this subject is provided in a thorough study [7] on the development of robotic arms. This is a research topic with a number of unresolved issues that still need to be resolved. Robotic arms that are for sale in the marketplace provide many alternatives. Some of them are fairly precise and reliable. The development of the robotic arm during the preceding twenty years is examined and numerous arm characteristics are defined in this paper. The type of robotic arm is only determined by these variables. The study concludes with a list of unresolved issues and potential future studies. The survey results might be used to guide and direct future research.

3 Methodology 3.1 Mecanum Wheels As seen in Fig. 1, a mecanum wheel is one that has rollers all the way around it. These rollers are positioned perpendicular to the rotational axis of the wheel or at a 45°. The wheel generates force diagonally when moving forward or backward. As a result, we can use these diagonal forces to move the robot in any direction by turning the wheels in a particular order. By adjusting the rotational speed and direction of each wheel, the sum of the force vectors from each of the wheels will supply the vehicle with both linear motions and/or rotations, enabling it to manoeuvre in a small space. The longitudinal vectors couple and the transverse vectors cancel out to generate a torque around the central axis, which causes the vehicle to rotate at a fixed speed on one side when both wheels are driven in the same direction. Moving the diagonal wheels in one direction while the other diagonal wheels are running in the opposite direction will generate a sideways movement because the transverse vectors add up while the longitudinal vectors cancel out. Fig. 1 Mecanum wheels

Tyro: A Mobile Inventory Pod for e-Commerce Services

405

Fig. 2 Husky lens camera

3.2 Husky Lens Camera Module A straightforward AI camera with a vision sensor is the husky lens. The husky lens features face recognition, object tracking, object recognition, line tracking, colour recognition, and tag identification. It is an easy-to-use AI camera. Additionally, husky lens has a 2.0-inch IPS display. As a result, you won’t need to adjust the parameters using a computer. The smart camera husky lens AI camera, which is depicted in Fig. 2, has the capacity to learn new objects by persistently pressing the learning button, even from varying angles and distances. As it gains more knowledge, it becomes more accurate. Devices with husky lenses have the potential to act as robot eyes. The robot can now recognise objects and react as a result.

3.3 Proposed System The mobile inventory pod is designed primarily to help warehouse employees. It is particularly designed for start-up-run warehouses. With the use of artificial intelligence, the bot is taught to locate the object using the husky lens and robotic arm, and to take it to the designated location. In this scenario, the entire system is managed by an Arduino Uno microcontroller. Mecanum wheels are included in this pod because they facilitate product searching in omnidirectional patterns inside the warehouse and allow the pod to manoeuvre in confined areas where workers are unable to. The L298N DC motor driver controls the system as a whole by connecting each of the four mecanum wheels to a separate DC motor. This pod has a husky lens camera module, which enables the pod to distinguish objects and detect obstacles with great accuracy. The obstacle identification capability helps to prevent mishaps when there are several pods in a warehouse. The robotic arm of the pod, which has four degrees of freedom and is integrated with servo motors, picks up the designated item from the rack. To guarantee the arm’s reliable operation, the pod is outfitted with a PCA9685 servo motor driver. The block diagram and circuit diagram for the arrangement are shown in Figs. 3 and 4.

406

A. Jones et al.

Fig. 3 Block diagram

Fig. 4 Circuit diagram

3.4 Working The pod takes the order and moves away from its starting point when the ordered item is assigned to it. The pod moves to the product storage racks with the aid of the

Tyro: A Mobile Inventory Pod for e-Commerce Services

407

mecanum wheels. The pod already has the map recorded in its memory, so it is aware of the whereabouts of various categories of products. Thus, it reaches the necessary area, where the ordered goods are deposited. Figure 5 in the next section shows the detailed process involved in the system with the use case diagram.

Fig. 5 Use case diagram of tyro

408

A. Jones et al.

The ordered item is then checked for presence in the rack using the husky lens. If the goods are recognised, the pod uses the robotic arm gripper to pick it up. As a result, it looks for the following order in the pipeline. If not, it moves the things that were collected to the delivery area. Figure 6 clearly illustrates the project’s work flow.

4 Results and Discussion Any industry can assemble items using the pod. The offered remedy will work effectively and cost-effectively to help e-commerce start-ups. The pod makes it possible for the required goods to be supplied on time, reducing the need for human interaction, errors, and time to complete a task. The addition of an obstacle detecting capability allows the pod to achieve a long lifespan by avoiding collisions with other pods or other barriers, and the mecanum wheels allow the pod to travel freely even in crowded areas. Figure 7 depicts a clear photograph of our prototype as well as the front view of the chassis seen in Fig. 8. The robotic arm gripper view and the side view of the pod holding the object using the arm gripper are shown in Figs. 9 and 10.

5 Conclusion For e-commerce start-ups, the mobile inventory pod is a helpful tool since it enables them to deliver ordered items to the proper location. The husky lens camera module’s object recognition capability allows the pod to recognise the appropriate goods and overcome barriers while moving, making it special. The pod can move horizontally because to the mecanum wheels. As a result, this is a suitable solution to be implemented as a bot in the assembly area of an industry warehouse.

Tyro: A Mobile Inventory Pod for e-Commerce Services Fig. 6 Flow chart

409

410 Fig. 7 Prototype

Fig. 8 Front view of the chassis

Fig. 9 Robotic arm gripper view

A. Jones et al.

Tyro: A Mobile Inventory Pod for e-Commerce Services

411

Fig. 10 Side view of the pod while lifting the object using a robotic arm gripper

References 1. Chandrika P et al (2021) Arduino based pick and place robotic Arm. Int J Emerg Technol Innovative Res 8(2):2266–2269. ISSN: 2349-5162 2. Neeraja R et al (2018) Implementation of pick and place robot. Int J Creative Res Thoughts (IJCRT) 6(2):15–157. ISSN:2320-2882 3. Abiraj B et al (2019) Pick and place robotic ARM using PLC. Int J Eng Res Technol (IJERT) 8(08) 4. Mendez JB et al (2020) Robotic pick-and-place time optimization: application to footwear production. IEEE Access 8:209428–209440 5. Roshni N, Kumar TKS (2017) Pick and place robot using the centre of gravity value of the moving object. In: 2017 IEEE international conference on intelligent techniques in control, optimization and signal processing (INCOS), pp 1–5. https://doi.org/10.1109/ITCOSP.2017. 8303079 6. Surati S et al (2021) Pick and place robotic arm: a review paper. Int Res J Eng Technol (IRJET), 8(2) 7. Patidar V, Tiwari R (2016) Survey of robotic arm and parameters. In: 2016 International conference on computer communication and informatics (ICCCI), pp 1–6. https://doi.org/10.1109/ ICCCI.2016.7479938 8. Omijeh BO et al (2014) Design analysis of a remote controlled ‘pick and place’ robotic vehicle. Int J Eng Res Dev 10(5) 9. Chandak LP, Junghare A, Naik T, Ukani N, Chakole S (2020) Mobile gantry robot for pick & place application. In: 2020 IEEE international students’ conference on electrical, electronics and computer science (SCEECS), 2020, pp 1–5. https://doi.org/10.1109/SCEECS48394.202 0.171 10. Baby A et al (2017) Pick and place robotic arm implementation using Arduino. IOSR J Electr Electron Eng (IOSR-JEEE) 12(2):38–41. e-ISSN: 2278-1676, p-ISSN: 2320-3331 11. Jones A, Abisheek K, Dinesh Kumar R, Madesh M (2022) Cataract detection using deep convolutional neural networks. In: Reddy VS, Prasad VK, Wang J, Reddy K (eds) Soft computing and signal processing. ICSCSP 2021. Advances in intelligent systems and computing, vol 1413. Springer, Singapore

412

A. Jones et al.

12. Thanzeem Mohamed Sheriff S, Venkat J, Vigeneshwaran S, Jones A, Anand J (2021) Lung cancer detection using VGG net architecture. In: IOP Publishing 2021 journal of physics conference series, vol 2040, p 012001 13. Kasiviswanathan S, Vijayan TB, John S, Simone L, Dimauro G (2020) Semantic segmentation of conjunctiva region for non-invasive anemia detection applications. Electronics 9:1309. https://doi.org/10.3390/electronics9081309 14. Kasiviswanathan S, Vijayan TB, John S (2020) Ridge regression algorithm based non-invasive anaemia screening using conjunctiva images. J Ambient Intell Humanized Comput. https://doi. org/10.1007/s12652-020-02618-3

Segmentation and Classification of Multiple Sclerosis Using Deep Learning Networks: A Review V. P. Nasheeda

and Vijayarajan Rajangam

Abstract The central nervous system is potentially disabled by multiple sclerosis, in which the myelin sheaths of neuron destroyed and cause communication problems between the brain and the rest of the body. Magnetic resonance imaging is used to track the new lesions and enlarged lesions. This is particularly challenging since the new lesions are very small and changes are often subtle. Lesion activity is determined by observing their tactile sensation and their position. MS lesion activity is used as a secondary endpoint in numerous MS clinical drug trials, and the detection of lesion activity between two-time points is a crucial biomarker since it decides the disease progression. Segmentation and classification of multiple sclerosis lesions are very important in helping MS diagnosis and patient disease follow-up. This paper reviews the deep learning networks for segmenting and classifying the brain tissues of multiple sclerosis patients through magnetic resonance images. Keywords Multiple sclerosis · MRI · Lesion · Deep learning

1 Introduction The elementary unit of the human nervous system is a neuron that carries messages throughout the human body. A single neuron comprises soma (cell body), dendrite, axon, etc. Axon is the long fiber that connects the cell body to another neuron. There is a protective covering called a myelin sheath on the axon. Myelin is responsible for the quick transmission of impulses through the neuron [1]. A disease called multiple

V. P. Nasheeda School of Electronics Engineering, Vellore Institute of Technology Chennai, Chennai, Tamil Nadu, India V. Rajangam (B) Centre for Healthcare Advancement, Innovation and Research, Vellore Institute of Technologygy Chennai, Chennai, Tamil Nadu, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_29

413

414

V. P. Nasheeda and V. Rajangam

sclerosis (MS), in which the human immune system attacks the myelin which results in nerve damage and disrupts the communication between the brain and the rest of the body. The main symptom of MS is vision loss, pain, fatigue, and impaired coordination. The symptoms, duration, and period of the disease vary from person to person. MS is influenced by many factors like age, gender, family history, weather, etc. This disease will create a plaque which is termed a lesion in the central nervous system (CNS) [2].

1.1 Stages and Types of MS According to the International Advisory Committee on MS clinical trials in 2013, there are mainly four stages of MS [3]. They are Clinically Isolated Syndrome (CIS), Relapsing-Remitting MS (RRMS), Secondary-Progressive MS (SPMS), and Primary-Progressive MS (PPMS). The CIS is the first part of the neurological symptom because of the myelin destruction in the CNS and exists for at least 24 h. CIS can also be either monofocal or multifocal. A single symptom can be considered monofocal and results in an optic neuritis attack [4]. Mainly two challenges are faced by health care providers in identifying CIS. The first one is to identify the present neurological damage due to CIS and the second one is to find whether it leads to MS or not [5]. RRMS includes the disease symptoms of relapsing and remitting and begins in the age group of 20–30. The symptoms may be visible for a few weeks to months. The next stage of RRMS is SPMS. The symptoms of SPMS include eye pain, double vision, numbness and tingling, pain, dizziness, etc. The progression of the disease refers to as the PPMS. After progression, the chance of remitting decreases, and the condition of the patient becomes worse [6]. Figure 1 shows the MRI of an MS patient in different angles. The starting point of demyelination is inflammation and a MS patient has lesions in different parts of CNS. MS is a white matter (WM) disease but gray matter (GM) demyelination occurs in chronic MS patients [7]. The WM and GM lesions are pathologically different. The GM lesions are less inflammatory. There are three types of cortical lesions in MS patients. They are leukocortical lesion (Type-1), intracortical lesion (Type-2), and pial surface lesion (Type-3). Type-1 lesions include GM and nearest WM, Type-2 is complete within the neocortex and type-3 lesion is from the pial surface [8]. The inflammation in MS is due to the penetration of lymphocytes and plasma cells around the damaged myelin. Commonly, the MRI measures are used to estimate the inflammatory activity, and the presence of T1 gadolinium-enhancing WM lesions reflect the chemical decomposition of the blood-brain barrier in the CNS. The gadolinium enhancement can show only the pivotal and temporary swelling of the WM and these are present in infected and normal lesions of WM and GM. These immune cells infiltrate, particularly lymphocytes and plasma cells present in the leptomeninges, where they aggregate with features of tertiary lymphoid tissue, and ectopic lymphoid follicle-like structures (ELFs). ELFs are a reason for brain outer damage such as neuroaxonal loss and myelin destruction. The modern approaches in

Segmentation and Classification of Multiple Sclerosis …

415

Fig. 1 MRI of MS patient

brain imaging have preferred the study of this combination by detecting the core of leptomeningeal contrast enhancement [9]. The classification of MRI includes conventional and non-conventional MRI. Apart from brain imaging, spinal cord imaging, myelin imaging, optical nerve imaging can also be used for the detection of MS [10]. A set of principles for providing a good sensitivity called McDonald criteria helps the MRI for MS detection [11]. Diagnosing MS using MRI is time-consuming, tedious, and prone to manual errors. The automatic MS lesion segmentation and classification can be done by using deep learning networks. This paper is organized as follows. The next section elaborates on the MRI, dataset for MS, and evaluation measures. Deep learning methods for segmentation and classification are given in the third and fourth sections, respectively, followed by issues and challenges in the fifth section, discussion in the sixth section, and conclusion in the seventh section.

2 MRI, Datasets, and Evaluation Measures 2.1 MRI MRI is a medical imaging technique that uses a magnetic field and radio waves to create detailed images of the organs and tissues in the human body [12]. The difference in the T1 relaxation time of the tissues in MRI is referred to as spin-lattice or relaxation time. Radio Frequency (RF) pulse is used to change the magnetic field forcing the tissues to the equilibrium state. T1 gives the time taken by the proton spins to align with the main magnetic field [13]. The Echo Time (ET) and Reception Time (RT) are the times of weighting tended to be short. To get better contrasts, the agents like gadolinium-containing compounds are used. Black holes are the area of focal T2 hyperintensity as hypo-intense lesions on unenhanced T1 weighted (T1W) images. A ten to thirty percent of demyelination and loss of axon leads to the formation of a black hole [14]. Black holes are not correlated with the degree of demyelination and are correlated with axonal density. The MRI sequences are less sensitive to black holes of the spinal cord than conventional MRI. These black holes of T1W

416

V. P. Nasheeda and V. Rajangam

images provide a high correlation with disability level, less demonstrate about clinical radiological paradox in MS and markers of more serving chronic diseases [15]. T2 weighted imaging (T2W) is one of the basic pulse sequences of MRI. It highlights the differences in the T2 relaxation time of tissues. The T2 relaxation time is also called the spin relaxation time or transverse relaxation time. T2W MRI can play a role in the MS clinical evaluation [16]. A new MS lesion with high periventricular WM lesions against the low background of WM can be identified by imaging. Complementary information is provided by PD-weighted (PDW) images in one spin-echo sequence [17]. Non-conventional MRI markers in MS include different types like Diffusion Tensor imaging (DTI),Functional MRI (fMRI), etc. The lesion tissue can be evaluated by a DTI. The main drawback of DTI is that it includes the usage of fibers of different alignments inside the same voxel, which is called crossing fibers, leading to the failure of diffusion ellipsoid. The effect on neural patterns by activation through task based and resting state paradigms due to multiple sclerosis can be measured by MRI scanner with the main magnetic field of strength 7 teslas (7T) or greater. It provides an improved facility to detect the smaller and earlier versions of WM and GM MS lesions with high localization. For overcoming the problem of working 7T MRI within the comfortable acquisition time and accepted safety limits, a new signal transmission method and read-out method are needed [19].

2.2 Dataset The problem of a less number of databases for MS detection and classification was balanced through the introduction of Medical Image Computing and ComputerAssisted Intervention (MICCAI) and IEEE International Symposium Imaging (ISBI) MS lesion segmentation challenge datasets. A 3T Siemens Allegra system was used for MICCAI 2008 database acquisition. High-resolution T1W, T2W, and T2- FLAIR are presented in the database. All images have the same axial orientation and undergo suitable processing for easiness in registration and interpolation [19]. ISBI 2015 dataset deals with the longitudinal MS lesion which is used for the validation of segmentation. The image acquisition was done through a 3T Philips medical systems MRI Scanner which contains T1W, T2W, FLAIR, and PDW images. The training is done by ground truth segmentation [20]. MICCAI 2016 challenge was organized for the study of multiple sclerosis segmentation algorithms. It includes a large range of automatic algorithms for independent evaluation comprising of 13 methods for the segmentation of MS lesions against 53 MS cases. This dataset contains T1W, T2W, FLAIR, and PDW images and was acquired through various MRI scanners using different magnetic field strengths. The dataset has undergone preprocessing and anonymization. The training cases are carried out by ground truths and manual segmentation [20].

Segmentation and Classification of Multiple Sclerosis … Table 1 Evaluation measures Measures DSC or F1 score Dice Loss Specificity Positive predictive value Sensitivity Accuracy

417

Calculation DSC = 2TP/(2TP + FP + FN) (1) Dice Loss = 1 − DSC (2) SPE = TN/(TN + FP) (3) series PPV = TP/(TP + FP) (4) SEN = TP/(TP + FN) (5) ACC = (TN + TP)/(TN + TP + FN + FP) (6)

2.3 Evaluation Measures There are a number of measures for the evaluation of MS lesion segmentation methods depending on the True positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) values. The TP’s value is 1 as a ground truth value 1. The FP’s estimated value is 1, then it is different from the ground truth of 0. The FN’s estimated value is 0, than it is different from ground truth 1. The TN’s estimated value is 0 and it is the same as the ground truth 0 [21].Some of the evaluation measures are given in Table 1. DSC is the most commonly used performance metric. It is similar to the coefficient of Intersection over Union (IOU). The only difference is that it takes the double value of TP. It also varies from 0 to 1. Zero represents no overlap between the evaluated image and ground truth and 1 shows a perfect overlap between these two [21]. The DSC values corresponding to the different deep learning architectures for MS lesion segmentation is shown in Fig. 5.

3 Deep Learning (DL) Networks DL is a subgroup of machine learning (ML), which contains three or more layers that will learn from huge data and then perform the analytic and physical tasks to improve the automatic performance [22]. Deep learning can solve complicated segmentation problems with high accuracy [23]. There are five stages for a DL architecture as shown in Fig 2. The stage of all deep learning networks defines

Fig. 2 Stages of DL

418

V. P. Nasheeda and V. Rajangam

the architecture based on the task to be performed. Every deep learning network needs historical data for its performance which is called training data. It contains the previous observations based on a specific problem. The deep learning model becomes more useful when the training data is huge. The fitting of the model on the dataset is carried out over various iterations which are followed by validation and predictions. After achieving the expected level of accuracy, the model is deployed for real-time predictions [24].

3.1 DL Methods for MS Lesion Segmentation The unique feature of segmentation using DL is learning without human intervention. The model is trained using unlabeled images for identifying the pattern which is considered common in MS [25]. The labeling can be done to the selected training images for performing segmentation in supervised learning. This approach is considered the first attempt at automatic learning of discriminative three-dimensional features. The performance increases by increasing the amount of labeled data [26]. For evaluating the therapeutic effectiveness, both tissue and lesion are used. Major studies are concentrating on lesion segmentation or single-center studies, but the accuracy is modest corresponding to the multi-center study. There are a number of DL networks for MS detection and segmentation. The most common DL networks applied for MS detection are Convolutional Neural Network (CNN), U-Net, Fully Convolutional Neural Networks (FCNN), Generative Adversarial Networks (GAN), and Recurrent Neural Networks (RNN).

3.2 CNN N is an important type of DL network for image segmentation and classification. CNN consists of convolution, pooling, batch normalization, and fully connected layers. Figure 3 shows the architecture of 2D-CNN.

Fig. 3 2D-CCN Architecture

Segmentation and Classification of Multiple Sclerosis …

419

3.3 U-Net As shown in Fig. 4, U-Net is a type of CNN that has encoding and decoding networks. The encoder consists of a number of downsampling sections along with the convolutional layers. The decoding section has many upsampling sections and layers of convolution [26]. Among the upsampling layers and downsampling layers, skip connections are present. It uses multi-scale information via skip connection for the coarser and finer information [27]. In U-Net, the stacking of the feature map is responsible for learning important features. The obtained feature maps are undergoing upsampling and deconvolution to obtain the image of the original resolution. The usage of mathematical operations like addition, and subtraction provides the voxel-wise diffusion so the feature is combined. For long-range connections, the feature maps are joined and the output is given to the decoding part. The drawbacks of the U-Net structure and training time can be reduced to a significant extent by reducing the number of filters and concatenation related to convolutional layers. This method is termed as lightweight deep learning framework. Usually, the MRI constitutes unbalanced data because it contains more non-lesion voxels than lesion class voxels. This helps to solve the false positive problem and also provides higher accuracy and DSC [28]. The robustness and efficiency of the network in MS lesion segmentation can be analyzed through randomized fivefold subject independent cross-validation. The lesion detection can be done by the convolutional layer and sigmoid function. This architecture with a binary cross-entropy loss function provides an accuracy of 96.79 3D U-Net with multi-modal MRI and T2 weighted lesion maps and an attention mechanism. These are corresponding to the difference

Fig. 4 U-Net architecture for image segmentation

420

V. P. Nasheeda and V. Rajangam

between two MRIs that have been taken at different times. This is for assisting the network in learning to classify the real anatomical change and syntactical change during the construction of place findings for small plaques. The framework can also do the classification based on the appearance and disappearance of New and Enlarging (NE) lesions [29]. A synergy Net framework is a 3D MS MRI lesion segmentation that maintains good performance of U-Net for average to high dimension lesions and multiplies with mask R-CNN to improve the segmentation performance on fine lesions [30]. 3D FLAIR MRIs are used in this framework. Synergy Net tackles the MS lesion irregularly. The combined segmentation of the joint segmentation of MS lesions and total brain structure in MRI of any contrast and resolution can be done using the CNN without either new supervised datum [31]. This step only requires segmentation, and no images are needed. It uses the generative model of Bayesian segmentation to generate synthetic scans with simulated lesions which are then used to train the CNN network [32]. The graph convolutional network (GCN) is a hybrid method in which the CNN with autoencoder is used. A GCN type of neural network will take the advantage of structural information. This is used for applying the graph datasets designed by considering the 3D medical MRI voxels as a node [33]. The graph is a function of the number of vertex (V) and adjacency matrix A, where A includes data corresponding to the node connection with updating weights. SMORE is a DL-based self-supervised anti-aliasing (SAA) and self-supervised superresolution (SSR) method. The main difference from other deep learning frameworks is that SMORE doesn’t need any external training data that use a wide variety of acquired MRI pulse sequences without any preprocessing steps. It provides high accuracy and robustness under low levels of Rician noise [34]. In 2015, Branch et al. proposed the first semantic segmentation. The input is an MRI of a large patch. There is no redundant calculation by the overlapping patches [35]. The convergence is faster during training due to the random sampling. It requires only one forward propagation path for the classification of all pixels in an image, so the computational efficiency increases and the appearance of plaque results in class imbalance. The lesion area is huge compared to the non-lesion area. The GAN is a function for image translation that will work by adversarial training of the generator and discriminator by using pixel-to-pixel mapping. The GAN framework is not based on existing data processing steps to provide a solution to segmentation problems. It is more efficient than other popular semantic segmentation models and the patch-based 3D CNN model is made for brain MRI [36]. Image segmentation using deep learning needs a huge amount of data. In the medical image field, there is a shortage of available images. The method of image registration framework for augmenting the MS dataset will handle this problem by registering images of two different patients created into one new image. This method smoothly adds lesions from the first patient into a brain image, structured like the second patient [37]. The subtraction of images is a deep learning framework for automatic detection and segmentation of NE T2w lesions from longitudinal brain MRIs of RRMS patients [38] (Fig. 5).

Segmentation and Classification of Multiple Sclerosis …

421

Fig. 5 DSC values corresponding to different segmentation model

4 Lesion Classification Ms lesion can form in different parts of CNS. There are mainly two types of lesions they are acute active and chronic active lesions. The acute lesion patient will have only a small life span [39]. The commonly used classification is active or chronic active. Another type of classification is related to the location of MS lesions. They are pancortical, leukocortical, intracortical, and subpial lesions. The MS lesion generation is an active one. The presence of lesions is characterized by the sensual formation of demyelination with the destruction of myelin elements macrophages or microglia. The non-parenchymal macrophages and microglia are present in high value for CNS-related disease [40]. The classification of MS lesions can be done by the detection of macrophages, histological stain myelin, and antibody usage against the Myelin Basic Protein (MBP) [6]. The active lesion is again classified into active demyelinating and active post-demyelinating lesions [40]. The myelin degradation product for phagocytosis is present in the cytoplasm of macrophage/microglia and they are MBP. In this type, the macrophages are present in the outer lining of the lesion, hence, they are active. For the mixed active/inactive MS lesions, a middle hypocellular lesion and edge of the macrophages at the outer line [40] are present. The center is completely detached. That is in mixed active /inactive lesions, the border of macrophage/microglia is not essential to the lesion around it. In the inactive lesion, only a few macrophages/microglia are present. But there may be the presence of axonal damage. The plaques correspond to inactive lesions seen more than 15 years [40]. The classification of MS lesions from MRI can be carried out by many methods. These methods are unsupervised method [41], weighted radial basis function kernels [6], sparse representation and adaptive dictionary paradigm [42], diffusion evaluation method [43], etc. An automatic way to classify the lesion is done by using unsupervised learning, in which the combination of spatiotemporal mean shift (STM-S) and dynamic time warping (DWT) is used. This will give the identities of these lesions [40]. A fully automatic framework, based on the probability distribution and entropy values, is Bayesian Framework for the classification of the lesions as normal or diseased. It will give results similar to the manual classification [6]. This classification method is based on a sparse representation with the help of basic element data. This method also separates as normal or diseased tissue [44]. The classification accuracy will depend on the size of the dictionary. It has the main performance limitation while using a mixture of different tissues [45]. A classifica-

422

V. P. Nasheeda and V. Rajangam

tion method based on differential evaluation is using if-then criteria. The differential evaluation is a method of optimizing problems with iteration to get good accuracy. It is an algorithm based on a group called population. It is an efficient decision making system with insight into lesion problems [46]. The method includes many groups of rules for every cluster.

5 Issues and Challenges The MS lesion segmentation and classification methods are facing many challenges while using MRIs. The main challenges in the DL application to MS segmentation include data scale, data imbalance, and domain shift [45]. In T1 and T2 MRIs, the MS lesion and CSF have the same appearance. It leads to difficulty in classification. So, the classification methods should have better values of accuracy, specificity, and sensitivity.

6 Discussion The DL networks can be applied for the MS lesion classifications. The main DL algorithms are CNN, U-net, FCNN, etc. The DL-based classification includes a large number of mathematical computations and a large time. These challenges can be overcome by optimizing the DL networks. The best classification can be done with the help of fitness function optimization [47].

7 Conclusion The DL networks can be applied for the MS lesion classifications. The main DL algorithms are CNN, U-net, FCNN, etc. The DL-based classification includes a large number of mathematical computations and a large time. These challenges can be overcome by optimizing the DL networks. The best classification can be done with the help of fitness function optimization [47].

References 1. Carey JE (2013) Brain facts: a primer on the brain and nervous system. Society of Neuro science, 11 Dupont Circle, Washington 2. Udupa JK, Wei L, Samarasekera S, van Buchem MA, Grossman RI (1997) Multiple sclerosis lesion quantification using fuzzy-connectedness principles. IEEE Trans Med Imaging 16(5):598–609

Segmentation and Classification of Multiple Sclerosis …

423

3. Jangi S, Gandhi R, Cox LM, Li N, Von Glehn F, Yan R, Patel B et al (2016) Alterations of the human gut microbiome in multiple sclerosis. Nat Commun 7(1):1–11 4. Mangalam A, Shahi AK et al (2017) Human gut-derived commensal bacteria. Nat Rev Neurol 25–36. ISSN 1759-4766 5. Didonna A et al (2015) A non-synonymous single-nucleotide polymorphism associated with multiple sclerosis risk affects the EVI5 interactome. Human Mol Genet 24(24):7151–7158 6. Leary SM, Porter B, Thompson AJ (2005) Multiple sclerosis: diagnosis and the management of acute relapses. Postgrad Med J 81(955):302–308 7. Geurts JJG, Barkhof F (2008) Grey matter pathology in multiple sclerosis. Lancet Neurol 7(9):841–851 ISSN 1474–4422 8. Cortese R, Collorone S, Ciccarelli O, Toosy AT (2019) Advances in brain imaging in multiple sclerosis. Ther Adv Neurol Disord 12,5(3):246–55. https://doi.org/10.1016/j.jceh.2015.08.001. Epub 2015 Aug 20. PMID: 26628842; PMCID: PMC4632105 9. Cortese R, Collorone S, Ciccarelli O, Toosy AT (2019) Advances in brain imaging in multiple sclerosis. Therapeutic Adv Neurol Disord 12. PMID: 31275430 10. Akbar N, Rudko DA, Parmar K (2007) Magnetic resonance imaging of multiple sclerosis. Sci J Mult Scler 1:008–020 11. Efendi H (2015) Clinically isolated syndromes: Clinical characteristics, differential diagnosis, and management. Nöro Psikiyatri Ar¸ssivi 52(1):S1 12. Grover VP, Tognarelli JM, Crossey MM, Cox IJ, Taylor-Robinson SD, McPhail MJ (2015) Magnetic resonance imaging: principles and techniques: lessons for clinicians. J Clin Exp Hepatol 5(3):246–255 13. Bushong SC, Clarke G (2013) Magnetic resonance imaging—e-book. Elsevier Health Sciences 14. . Polman et al (2010) Department of neurology, VU medical center amsterdam. In: “Diagnostic criteria for multiple sclerosis: 2010 Revisions to the McDonald criteria” article online at wileyonlinelibrary.com. https://doi.org/10.1002/ana.22366 15. Nelson RE, Butler J, LaFleur J, Knippenberg KC, Kamauu AWC, DuVall SL (2016) Determining multiple sclerosis phenotype from electronic medical records. J Managed Care Specialty Pharm (JMCP) 22 16. Kumari S, Sharma AK, Chaurasia S (2021) Brain tumor detection and segmentation of MRI images through the integration of different methods: a review. In: 2021 9th International conference on reliability, infocom technologies and optimization (trends and future directions) (ICRITO), 2021, pp 1–5. https://doi.org/10.1109/ICRITO51393.2021.9596334 17. Kholmovski Eugene G et al (2002) Motion artifact reduction technique for dual-contrast FSE imaging. Magnetic resonance imaging 20(6):455–462 18. Kholmovski Eugene G et al (2002) Motion artifact reduction technique for dual-contrast FSE imaging. Magnetic resonance imaging 20(6):455–462 19. Hemond CC, Bakshi R (2018) Magnetic resonance imaging in mutiple sclerosis. Cold Spring Harbor perspectives in medicine 8(5) 20. Bruschi N, Boffa G, Inglese M (2020) Ultra-high-field 7-T MRI in multiple sclerosis and other demyelinating diseases: from pathology to clinical practice. European radiology experimental 4(1):1–13 21. Tiu E (2019) Metrics to evaluate your semantic segmentation model. Towards Data Sci. https://towardsdatascience.com/metrics-to-evaluate-your-semantic-segmentationmodel-6bcb99639aa2 22. Nelson RE, Butler J, LaFleur, J., Knippenberg, K., C. Kamauu, A.W. and DuVall, S.L, “Determining multiple sclerosis phenotypes from electronic medical records. J Managed Care Specialty Pharmacy 22(12):1377–1378 23. Valverde S, Cabezas M, Roura E, Gonzalez-Villa S, Pareto D, Vilanova JC, Rami-Torrent L et al (2017) Improving automated multiple sclerosis lesion segmentation with a cascaded 3D convolutional neural network approach. NeuroImage 155:159–168. https://towardsdatascience. com/5-essential-steps-for-every-deep-learning-model-30f0af3ccc37 24. Yoo Y, Brosch T, Traboulsee A, Li DK, Tam R (2014) Deep learning of image features from unlabeled data for multiple sclerosis lesion segmentation. In: International workshop on machine learning in medical imaging 1st edn. Springer, Cham, pp 117–124. 319-10581-9

424

V. P. Nasheeda and V. Rajangam

25. Shoeibi A, Khodatars M, Jafari M, Moridian P, Rezaei M, Alizadehsani R, Khozeimeh F, Gorriz JM, Heras J, Panahiazar M, Nahavandi S (2021) Applications of deep learning techniques for automated multiple sclerosis detection using magnetic resonance imaging: a review. Comput Biol Med 136:104697 26. Narayana PA, Coronado I, Sujit SJ, Wolinsky JS, Lublin FD, Gabr RE (2020) Deep-learningbased neural tissue segmentation of MRI in multiple sclerosis: effect of training set size. J Magn Reson Imaging 51(5):1487–1496 27. Kumar P, Nagar P, Arora C, Gupta A (2018) U-segnet: fully convolutional neural network based automated brain tissue segmentation tool. In: 2018 25th IEEE international conference on image processing (ICIP), pp 3503–3507. IEEE 28. Abolvardi A, Hamey L, Ho-Shon K (2019) Registration based data augmentation for multiple sclerosis lesion segmentation. In: 2019 digital image computing: techniques and applications (DICTA). In: Abolvardi A, Hamey L, Ho-Shon K (2019) Registration based data augmentation for multiple sclerosis lesion segmentation. Digital image computing: techniques and applications (DICTA), pp 1–5. https://doi.org/10.1109/DICTA47822.2019.8946022.2019 29. Vang S et al (2020) SynergyNet: a fusion framework for multiple sclerosis brain MRI segmentation with local refinement. In: 2020 IEEE 17th international symposium on biomedical imaging (ISBI), pp 131–135. https://doi.org/10.1109/ISBI45749.2020.9098610 30. de Oliveira M et al (2020) Quantification of brain lesions in multiple sclerosis patients using segmentation by convolutional neural networks. In: 2020 IEEE international conference on bioinformatics and biomedicine (BIBM), pp 2045–2048. https://doi.org/10.1109/BIBM49941. 2020.9313244 31. Wargnier-Dauchelle V, Grenier T, Durand-Dubief F, Cotton F, Sdika M (2021) A more interpretable classifier for multiple sclerosis. In: 2021 IEEE 18th international symposium on biomedical imaging (ISBI), pp 1062–1066. https://doi.org/10.1109/ISBI48211.2021.9434074 32. Billot B, Cerri S, Leemput KV, Dalca AV, Iglesias JE (2021) Joint segmentation of multiple sclerosis lesions and brain anatomy in MRI scans of any contrast and resolution with CNNs. In: 2021 IEEE 18th international symposium on biomical imaging (ISBI), pp 1971–1974. https:// doi.org/10.1109/ISBI48211.2021.9434127 33. Wargnier-Dauchelle V, Grenier T, Durand-Dubief F, Cotton F, Sdika M (2021) A more interpretable classifier for multiple sclerosis. In: 2021 IEEE 18th international symposium on biomedical imaging (ISBI), pp 1062–1066. https://doi.org/10.1109/ISBI48211.2021.9434074 34. Gabr RE, Coronado I, Robinson M, Sujit SJ, Datta S, Sun X, Allen WJ, Lublin FD, Wolinsky JS, Narayana PA (2020) Brain and lesion segmentation in multiple sclerosis using fully convolutional neural networks: a large-scale study. Multiple Sclerosis J 26(10):1217–1226 35. Dayananda C, Choi JY, Lee B (2021) Multi-scale squeeze U-SegNet with multi global attention for brain MRI-segmentation. Sensors 21(10):3363 36. Sepahvand NM, Arnold DL, Arbel T (2020) CNN detection of new and enlarging multiple sclerosis lesions from longitudinal Mri using subtraction In: 2020 IEEE 17th international symposium on biomedical imaging (ISBI), pp 127–130. https://doi.org/10.1109/ISBI45749. 2020.9098554 37. Filippi M, Cercignani M, Inglese M, Horsfield MA, Comi G (2001) Diffusion tensor magnetic resonance imaging in multiple sclerosis. Neurology 56(3):304–311 38. Mark A, van Buchem MA (2007) Demyelinating diseases–II. Sprinkler Japan, pp 247–251 39. Li Q, Barres BA (2018) Microglia and macrophages in brain homeostasis and disease. Nat Rev Immunol 18(4):225–242 40. Kuhlmann T, Ludwin S, Prat A, Antel J, Brück W, Lassmann H (2017) An updated histological classification system for multiple sclerosis lesions. Acta Neuropathol 133(1):13–24. https:// doi.org/10.1007/s00401-016-1653-y 41. Tsai C, Chen HM, Chai J, Chen CC, Chang C (2011) Classification of Magnetic Resonance brain images by using weighted radial basis function kernels. In: 2011 International conference on electrical and control engineering, pp 5784–5787. https://doi.org/10.1109/ICECENG.2011. 6058066

Segmentation and Classification of Multiple Sclerosis …

425

42. De Falco I, Scafuri U, Tarantino E (2016) A Differential Evolution approach for classification of multiple sclerosis lesions. IEEE symposium on computers and communication (ISCC) 2016:141–146. https://doi.org/10.1109/ISCC.2016.7543729 43. Deshpande H, Maurel P, Barillot C (2015) Adaptive dictionary learning for competitive classification of multiple sclerosis lesions. 2015 IEEE 12th international symposium on biomedical imaging (ISBI), pp 136–139. https://doi.org/10.1109/ISBI.2015.7163834 44. Conti A, Treaba CA, Mehndiratta A, Barletta VT, Mainero C, Toschi (2021) An interpretable machine learning model to explain the interplay between brain lesions and cortical atrophy in multiple sclerosis. In: 202143rd annual international conference of the IEEE engineering in medicine and biology society (EMBC), pp 3757–3760 45. Zeng C, Gu L, Liu Z, Zhao S (2020) Review of deep learning approaches for the segmentation of multiple sclerosis lesions on brain MRI. Front Neuroinformatics 55 46. Geurts JJG, Barkhof F (2008) Grey matter pathology in multiple sclerosis. Lancet Neurol 7(9):841–851. ISSN 1474-4422 47. De Falco I, Scafuri U, Tarantino E (2016) A differential evolution approach for classification of multiple sclerosis lesions. In: 2016 IEEE symposium on computers and communication (ISCC), pp 141–146

Malware Detection and Classification Using Ensemble of BiLSTMs with Huffman Feature Optimization Osho Sharma , Akashdeep Sharma, and Arvind Kalia

Abstract Context: Malware attacks are responsible for data breaches and financial losses across the globe. Traditional signature-based malware detection methods fail against ‘zero-day’ and ‘unknown’ malware variants, whereas data conversion-based malware detection methods are computationally intensive and time-consuming. Additionally, the challenge of malware detection and family classification has become more severe in Windows devices due to a shortage of updated malware datasets. Objectives: The goal of this study is to use ensemble learning to aggregate multiple BiLSTM networks to improve malware detection and classification performance. Methods and Design: We begin by collecting latest Windows malware samples from the Internet followed by extraction of Application Programming Interface (API) call sequences from malware binaries by performing dynamic malware analysis in a virtual environment. The API calls data is encoded using Enhanced Huffman Features (EHF) method. To identify the long-term dependencies between API call sequences, three BiLSTM networks are used which are later combined using ensemble technique. Results and Conclusion: To evaluate our proposed method, we utilize one public and one self-created Windows malware dataset. Our model outperforms earlier methods in the literature by atleast 6% in classification accuracy and our system can be hosted on the web for commercial use. Keywords Deep learning · Information security · Malware detection and classification · Ensemble learning · Bidirectional long short-term memory

O. Sharma (B) · A. Kalia Department of Computer Science, Himachal Pradesh University, Shimla, India e-mail: [email protected] A. Sharma (B) Department of Computer Science and Engineering, UIET, Panjab University, Chandigarh, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_30

427

428

O. Sharma et al.

1 Introduction Malware or malicious software(s) are responsible for a majority of cybercrimes such as service disruption, security breaches, and data theft. Automatic detection and classification of malware has become a popular research topic due to the increasing complexity and volume of malware in recent years. In the first half of 2022, the volume of new malware programs registered in AV-TEST’s database each day was above 450,000.1 A free online tool VirusTotal registered nearly one million distinct hits for malware scans per day in 2022.2 A large proportion of malware are designed to affect Windows environments, mainly because of the huge market share of Windows OS and also because of piracy of software, which accounts for a significant proportion of security threats. Moreover, researchers in the past have demonstrated the inefficiency of commercially available malware detectors at detecting ‘zero-day’, obfuscated and unknown malware which has led to a proliferation of studies utilizing machine learning and deep learning approaches for malware detection [1]. A popularly used approach for malware detection involves extracting features contained in the executables by disassembling the executables and then utilizing deep neural networks for classifying the malware samples into respective family classes. However, taking into account, the entire contents of an executable may not effectively model the behavior of the malware. The information contained in an executable is primarily spatial which may ignore temporal aspects [2]. Application Programming Interface (API) calls contain temporal information which may be useful for effectively modeling the behavior of a malware executable [1]. Studies in the past have utilized API calls to build malware detection systems [1, 3–6] but in certain cases, the length of these API call sequences may exceed to a million characters which increases the detection time and computational load of the system. To address some of the current issues faced by malware analysts, we present a novel malware detection and classification framework, which is an ensemble of three separate Bidirectional Long Short-Term Memory (BiLSTM) deep networks and integrates Huffman encoding in the deep feature engineering phase. We choose BiLSTM networks because of their effectiveness in spotting long-term relationships in sequential data. The proposed model can create and optimize highly discriminating features automatically from executables by extracting the API call sequence data and utilizing Huffman encoding for the construction of Enhanced Huffman Features (EHFs). The proposed API call-based EHFs increase the efficiency of our malware detection and classification model which is demonstrated in the current study. The key contributions of the study are encapsulated in the following points: 1. The study presents a novel malware detection and classification framework which is developed by integrating three separate BiLSTM networks using ensemble learning to improve the aggregate performance of the proposed system.

1 2

https://www.av-test.org/en/statistics/malware. https://www.virustotal.com/gui/stats

Malware Detection and Classification Using Ensemble of BiLSTMs …

429

2. Inspired by Huffman’s lossless data compression algorithm [7], we create and incorporate Windows API calls-based ‘Enhanced Huffman Features’ (EHFs) which are capable of representing malware behavior more effectively and can be easily used in deep neural network architectures. 3. We use one benchmark Windows malware dataset and one custom-made Windows malware dataset to test and assess the suggested method, and we achieve promising detection and classification results. The study contributes novel prospects for developing effective, low cost and highly reliable malware detection systems for automatic detection of malware in uncontrolled environments. The proposed ensemble BiLSTM model has good implications and usability in the area of cyber-security and can be deployed and tested using any web framework like Flask [8]. The API-EHFs lower the binary code complexity, require less domain knowledge and can be easily used in deep neural network architectures, thus making the proposed model easy to implement and scalable solution for malware detection and classification. The remainder of the work is organized in the following sections: the second section discusses the related works in malware analysis that utilize similar architectures. The third section presents the stages in the proposed model followed by the fourth section which presents the data collection details and evaluation criteria. The fifth section presents the results of the experiments followed by the sixth section, which presents some key discussion points. Finally, the conclusion and future scope of the study are discussed in the seventh section.

2 Related Works The recently published literature in the domain of malware detection can be classified into static analysis, dynamic analysis, hybrid analysis, machine learning solutions, deep learning solutions and malware visualization-based solutions. The task of malware detection typically involves either code-based or behavior-based features to represent benign and malicious samples. In this section, we focus on some of the most recent research that involves similar deep structures to create malware detection and classification systems as in the current study. Several high-performance malware detection and classification systems in the literature rely on ensemble methods which are used to aggregate the performance of multiple machine learning classifiers or deep structures. Vasan et al. [9] utilized malware images in their proposed ensemble of Convolutional Neural Network (CNN)-based architectures for the detection of packed and unpacked malware. Their ensemble of CNNs-based model was able to achieve nearly 98% classification accuracy, however, since the model involved conversion of malware binaries into images, the overall complexity of the model was high. Similarly, Narayanan and Davuluru [10] utilized an ensemble of multiple machine learning classifiers along with CNN and Long Short-Term Memory (LSTM) deep networks to classify malware in the

430

O. Sharma et al.

500 GB Microsoft Malware classification challenge. They demonstrated an accuracy score of 99.8% using their proposed ensemble method, however, their method involves heavy data conversion into image format which is computationally intensive activity. Mallik et al. [11] proposed Conrec model which incorporated convolutional layers to capture visual similarities in malware samples and two subsequent BiLSTM layers to capture sequential similarities. The researchers tested the model on two public benchmark datasets: Malimg and Microsoft and their proposed system is shown to achieve an accuracy score of nearly 99% in both datasets. Roy and Chen [12] proposed a model for ransomware early detection called DeepRan. Their model incorporated an attention-based BiLSTM with FC layers to capture normal and abnormal behavioral logs. The authors capture attack data from 17 ransomware attacks and demonstrate nearly 99% accuracy scores for Ransomware early detection for the 17 candidate ransomware families. In a similar approach to classify ransomware families, Aurangzeb et al. [13] proposed an ensemble method utilizing multiple machine learning classifiers and deep networks. Their method was evaluated on a dataset of nearly 2000 samples and their model is shown to achieve nearly 98% accuracy score, however, the problem of feature dependence in the method remains the same as the method utilizes a combination of static and dynamic features. In a slightly different approach of using Recurrent Neural Networks (RNNs) in Internet of Things (IoT), Saharkhizan et al. [14] utilized a decision tree made of RNN nodes to choose the best RNN to detect cyber-attacks using network traffic data. Their approach, although not specifically addressing malware classification, managed to achieve over 99% detection rate in the detection of IoT-related cyber-attacks. Lu [15] proposed a two-stage RNN-LSTM model to classify malicious and benign files on a dataset of 969 malicious and 128 benign files. They managed to achieve an average AUC score of 99%. Narayanan et al. [10] utilized the Microsoft Malware Challenge dataset (BIG 2015) hosted on Kaggle, to study and analyze the nine distinct classes of Windows malware using their CNN-LSTM model. The LSTM neural network demonstrated an accuracy score of 97.2%, whereas the CNN model managed to achieve a 99.4% accuracy score. In an intuitive approach by Stephen et al. [7], the authors incorporated Huffman encoding to develop optimized features by combining the symbols extracted from binary code, its frequency and the Huffman code. The authors compared the developed features with current compression-based methods and demonstrated accuracy scores of over 97% for their proposed features. Jeon and Moon [16] combined Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) to form a Convolutional Recurrent Neural Network (CRNN) model which is used to detect malicious executables using n-grams and extracted opcode sequences. They have also proposed a method to extract the opcode sequences without executing the file by using Dynamic RNNs (DRNNs). Their method demonstrates a promising AUC of 99%, TPR 95% and detection accuracy of 96%. De Lorenzo et al. [17] have proposed an RNN-LSTM model for Android malware detection. Their work also includes a tool called VizMal that can visualize the execution path of Android applications by using the system calls information. They have incorporated user-based validation to verify and demonstrate the promising results, however, a better way to evaluate

Malware Detection and Classification Using Ensemble of BiLSTMs …

431

their model is possible with built-in evaluation metrics. Jha et al. [18] developed an RNN model to detect malware using Natural Language Processing (NLP), random vector representation and Word2Vec model. Their approach is verified on 10,000 malware executables where 5000 are malicious and 5000 are benign. Their results demonstrated nearly 91% accuracy of the RNN-Word2Vec model. Hiai and Shimada [19] tried to correlate RNNs with relation vectors. Their work tries to develop relation information between various pairs of expressions, e.g., ‘boss and staff’. They also proposed a relation information-based sarcasm detection method which is a unique usage of RNNs for classification and their approach demonstrates decent precision, recall and F1 scores of 80.2%, 80.2% and 80.1%, respectively. Zhong and Gu [20] suggested a multilayer deep learning system for malware detection that uses a cluster tree network to combine multiple types of deep learning methods to handle complex data distributions in malware datasets and to enhance scalability. They have organized RNNs into a tree structure. The researchers have also divided the data into multi-level clusters and have selected a suitable deep learning model for each cluster demonstrating a TPR score of > 95% and AUC score greater than 90%. Peng et al. [21] proposed an algorithm for the creation of adversarial malware to bypass RNN- and CNN-based malware detection models. Their algorithm is based on a black-box approach that utilizes word embeddings to map the API calls which are later used to bypass RNN- and CNN-based malware classifiers. Their best case accuracy score is demonstrated in a BiLSTM model with a score of over 96%. Zhang et al. [22] suggested a ransomware categorization system based on static analysis of opcode sequences. They used the self-attention method to capture complementary information of distance-aware dependencies and compared the method’s efficiency to CNN- and RNN-based models, and their precision and recall scores are demonstrated to be nearly 87.5%. Gao et al. [23] have proposed an RNN-based byte classifier that extracts byte features from files without disassembling the files. They have also proposed the semi-supervised transfer learning model for malware detection which demonstrates high detection accuracy of 96.9%. Yazdinejad et al. [24] have proposed RNN-based cryptocurrency malware hunting framework. Their proposed model includes five separate Long Short-Term Memory (LSTM) modules. Although their dataset is comparatively small with only 500 malware and 200 benign files, they were able to demonstrate good detection accuracy of nearly 98%. Pei et al. [25] have developed a deep learning framework for Android-based malware detection that utilizes Graph Convolution Networks (GCNs) for modeling high-level graphical semantics. The authors have also utilized an Independent Recurrent Neural Network (IndRNN) to independently extract and decode the semantic information. Their model performs well on five different benchmark datasets and demonstrates a high detection accuracy of nearly 99%. In an intuitive approach by Sharma et al. [26], the authors proposed a malware visualization-based approach wherein malware samples were converted into three types of malware images (grayscale, color and Markov images), and then traditional learning-based custom-built deep CNN and transfer learning-based Xception CNN were utilized to perform the task of malware family classification. The Markov images were able to achieve good classification

432

O. Sharma et al.

performance scores of nearly 99% in the custom-built and benchmark ‘Microsoft’ datasets. Previous studies have suffered from limitations such as: (a) increased model complexity, rendering the models less effective for lightweight settings; (b) lack of updated Windows datasets for the assessment of suggested approaches; and (c) the use of data conversion-based procedures, which add noise in data and are computationally costly. To address these difficulties, we suggest (a) a lightweight ensemble of BiLSTM networks that is both computationally efficient and uses limited resources than rival methods, (b) in addition to one Windows public benchmark dataset, we develop our own Windows malware dataset using the most recent malware samples and (c) instead of computationally costly data conversion approaches to extract discriminating features, we use a new feature optimization technique called Enhanced Huffman Features[7].

3 Methodology Most static analysis approaches fail to detect packed and obfuscated malware types; however, dynamic analysis techniques (where the malware is deployed in a secure environment and its behavior is recorded) outperform static analysis techniques in detecting complex malware types. Current research work is based on this premise and it uses dynamically extracted API calls data to identify and categorize malware variants. To develop an effective malware detection and classification system, we have utilized an ensemble approach to integrate three separate BiLSTM networks. Additionally, we have employed Huffman feature optimization in the feature optimization phase. The proposed malware detection system can be divided into five different stages where each stage performs a specific task. The next subsection presents the overall architecture and the subsequent parts discuss the individual stages in malware detection and classification tasks.

3.1 Overview As illustrated in Fig. 1, the first stage of our model deals with data collection and labeling of the samples to form the data corpus. The second stage deals with the extraction of discriminating features (API calls data) from malware instances using dynamic malware analysis approach. The third stage deals with the generation of EHFs by processing, optimizing and converting the API calls data. The fourth stage deals with feeding the EHF vectors into separate BiLSTM networks which are later combined using ensemble learning. The main assumption is based on the fact that rather than relying on the predictions of a single BiLSTM, a combination of multiple BiLSTMs may result in a higher detection and classification performance.

Malware Detection and Classification Using Ensemble of BiLSTMs …

433

Fig. 1 Overview of the suggested model

3.2 Feature Extraction Compared to static analysis, dynamically collected data is more resilient to packing and obfuscation techniques. Therefore, we first extract the API calls data from malware binaries using a dynamic analysis environment. In this work, the API calls of executables are treated as the basic features which are later used to generate Enhanced Huffman Features (EHFs). The virtual environment (Cuckoo Sandbox [27]) executes the malware files for a fixed time duration of 3 min and generates analysis reports in the JavaScript Object Notation (JSON) format. The JSON reports contain both static (binary metadata, packer detection and information about sections) and dynamic (dynamically linked libraries, dropped files, Windows API calls, mutex actions, file operations, registry operations, network activities and processes) features of executables. The execution logs generated by the sandbox contain detailed runtime information of the executable files, whose size ranges from several KBs to GBs. These logs require filtering to extract only the API calls information. The JSON reports are then converted into numpy array (.npy extension) files. Since the size of samples was variable, we reduced the input of each sample to 1000 bytes. Additionally, padding was done to make the input dimension (1000, 102) for all input data by adding zero vectors at the end of each sample. Then we performed data normalization by dividing each input data with a global maximum value. We apply the Enhanced Huffman Feature generation method to convert the features into EHF form [13]. Once the EHFs are generated, we train our deep learning model on a model server with GPUs for malware detection and classification.

3.3 Huffman Feature Optimization The collected API calls data is raw and unoptimized and we optimize this data using Huffman Encoding and turn it into EHFs as a part of our feature optimization strategy. We used Python to write the code for EHF conversion with the method explained in the following lines. Construction of each EHF is done by integrating the API call

434

O. Sharma et al.

sequence string, its frequency and the code generated using Huffman’s procedure. The API call sequence corresponding to a single malware binary is denoted by s. The first step is to calculate the frequency of each API call sequence in an executable. The frequency is then stored in dictionary d where each API call sequence is the key and its frequency serves as the value [13]. Following that, a heap h is constructed, which comprises tuples representing each symbol’s frequency as well as a nested array holding the alphabet symbol k and an unfilled string c. Each symbol’s Huffman code will be saved in the string c. To produce the Huffman codewords, the two lowest frequency weight pairs indicating the left and right nodes in the tree are dequeued out from heap [13]. The left node is considered to be zero, which denotes the left node edge’s 0 binary bit, while the right node is switched to one, which represents the right node edge’s 1 binary bit. If a node has previously been processed, the node edge (1 or 0) is appended to c. The Huffman code is created by adding up the node edge values. The array containing the left and right nodes, and the sum of the frequencies, which indicates the primary node weight in the tree, is then returned to the queue. The procedure is repeated until the root node is reached, and the heap’s next two least frequency weight pairs are dequeued [13]. The heap now has a single element with the sum of all weights and the symbol: code pair array for each alphabet symbol. The whole tree is then dequeued from the heap and sorted by Huffman code size in ascending order. Finally, the tree is traversed once again, and this time aggregating the API call symbol, Huffman code pair, and the algorithm’s matching value feature. The feature is appended to the EHF vector at each iteration. Once the EHF vector is complete, it is returned [13].

3.4 Training Networks An ensemble of BiLSTMs is used to create the proposed malware detection and classification system due to the excellent ability of BiLSTMs to retain and predict sequential data. In order to construct a successful model, attention needs to be given to architectural layers and the learning process. The inner components of the BiLSTM, known as gates, can alter the information flow. BiLSTM structures can examine the data from both ways after processing the inputs in two directions, one from rear to forward end and the other from forward to rear end. The design of training networks is shown in Fig. 2. As shown in Fig. 2a, the input data is first passed to a BiLSTM layer with output dimension size 128. After the BiLSTM layer, Global Max Pooling is applied before sending the data to a dense layer with ReLU activation function. The data is subsequently forwarded to a dropout layer with 0.2 dropout rate and then reshaped using a dense layer for classification. This model makes use of the tenfold cross-validation strategy. After splitting the training data into 10 sets (folds), 10 models are trained independently, where each model is trained on 9 out of the 10 sets and validated on the other 1 set. The final prediction is an average of the prediction results of the 10 models. The second model has identical layer architecture as model 1, with differences in BiLSTM output size (180), global max pooling output size (180)

Malware Detection and Classification Using Ensemble of BiLSTMs …

435

Fig. 2 Training network design a Model 1, b Model 2, c Model 3.1: Bagging, d Model 3.2: Boosting

and dropout rate (0.1). Also, eightfold cross-validation was applied to this model. Figure 2b shows the layers in model 2 which differs from model 1 in layer dimension sizes. Model 3 also has the same architecture and hyperparameters as model 1 with additional dropout and dense layers. Similarly, it trains 10 different models and averages the result. In contrast, bagging and boosting techniques are used in order to reduce the variance and validation loss. Every time before training a new model, the bagging algorithm samples the original training data into a bag, whose size is 60% of the original training data. Meanwhile, the boosting algorithm will also apply a higher probability to select samples that were poorly predicted by the models it trained in the earlier rounds. With the help of bagging and boosting, this model reduces the chance of overfitting. Figure 2c, d presents the layers in bagging and boosting networks respectively. We try to enhance the classification confidence of the models using the ensemble strategy. Our aggregate model builds upon an ensemble technique by integrating the three models by weighing their performance and testing different combinations of model percentages. After testing several ensemble configurations, we selected the best performing ensemble configuration that resulted into highest accuracy score of nearly 98.9%. It was observed that an ensemble of 0.5%, 0.25%, 0.125% and 0.125% of models 1, 2, 3.1 and 3.2 produced the best results.

436

O. Sharma et al.

Table 1 Details of malware samples in Stamp et al. [28] malware dataset Class_ID

Class_name

No of samples

Class_ID

Class_name

No of samples

1

Adload

162

12

Renos

532

2

Agent

184

13

Rimecud

153

3

Allaple

986

14

Small

180

4

BHO

332

15

Toga

406

5

Bifroze

156

16

VB

346

6

CeeInject

873

17

VBInject

937

7

Cycbot

597

18

Vobfus

929

8

FakeRean

553

19

Vundo

762

9

Hotbar

129

20

Winwebsec

837

10

Injector

158

21

Zbot

303

11

OnLineGames

210

Total

9725

4 Experiments The current section gives a description of the experimental setup, dataset design, and evaluation criteria for the proposed model. The details of the experimental environment are as follows: The open-source Ubuntu 21.04 Operating System (OS) was used with an 11th gen Intel(R) Core(TM) i7-1165G7 @ 2.80 GHz with 16 GB RAM and Geforce GTX 1080 Ti 2 GB graphics card to run the experiments. To generate malware analysis reports, we prepared an automated environment using distributed Cuckoo Sandbox[27]. The host system works on Ubuntu, while the two guest machines run Windows 10 and are set up with Oracle VM VirtualBox. We used Keras2 to implement all neural network experiments and Python3 for writing ensemble scripts.

4.1 Dataset Details We utilized two malware datasets to evaluate the performance of the suggested solution system. The first dataset is a public benchmark malware dataset which is originally proposed by Stamp et al. [28]. It contains 21 families of malware totaling to 9725 samples as given in Table 1. The second dataset is a custom-built malware dataset and the sources of the malware files are the public malware repositories VirusShare,3 VX-Heaven4 and ‘thezoo’.5 This dataset contains 8 families of malware and one class of benign (harmless) Windows executables as given in. 3

https://virusshare.com/. https://vx-underground.org/archive/VxHeaven/index.html. 5 https://github.com/ytisf/theZoo. 4

Malware Detection and Classification Using Ensemble of BiLSTMs … Table 2 Details of the self-created windows malware dataset

Class_ID

Class_name

437 No of samples

1

Virus

850

2

Spyware

986

3

Trojan

897

4

Worms

988

5

Adware

350

6

Dropper

868

7

Exploit

849

8

Backdoor

789

9

Benign

2000

Total

8577

Table 2. In the custom-built Windows malware dataset, we verified the MD5 checksums of instances to eliminate duplicates. We utilized the VirusTotal API to properly label the malware samples, using a majority vote technique in which 80% of the security software in the VirusTotal API voted on the malware’s label in order to produce the class label.

4.2 Evaluation Metrics The evaluation metrics to test the performance of the proposed model are given below: • True Positive (TP) denotes the correctly classified positive category instances. • True Negative (TN) denotes the correctly classified negative category instances. • False Positive (FP) denotes negative category instances that have been incorrectly classified as positive category instances. • False Negative (FN) refers to positive category instances that have been incorrectly classified as negative category instances. • Accuracy, Precision, F1 and recall are calculated using these criteria and are stated in the formulas given below. • The AUC is defined as the likelihood that the classifier would give a randomly selected positive sample a high value compared to a randomly selected negative sample. It has a numerical value between 0 and 1, and the nearer it goes to 1, the higher is the model’s performance. • Accuracy: The percentage of accurately predicted samples among all samples is shown in Eq. (1). Accuracy =

TP + TN TP + TN + FP + FN

(1)

438

O. Sharma et al.

• The percentage of accurately predicted malware to total predicted malware is known as precision which is shown in Eq. (2). Precision =

TP TP + FP

(2)

• The fraction of anticipated malware instances to the total number of malware instances is the recall or sensitivity value of a dataset which is shown in Eq. (3). Sensitivity =

TP TP + FN

(3)

• The weighted average of recall and precision values is called F1 Score which is shown in Eq. (4). F1 Score =

2 ∗ Precision ∗ Recall Precision + Recall

(4)

5 Results and Analysis Following the collection of malware samples and dynamic malware analysis to collect API calls data, we utilize the EHFs to train the deep networks. The accuracy and training loss as seen for the custom malware dataset are shown in Figs. 3 and 4. It can be observed that all the models tend to stabilize and attain convergence by the 50th epoch. The maximum accuracy achieved by the ensemble technique is nearly 98.3% for the self-created malware dataset. The training loss for the self-created malware dataset begins from 1.75% in the first epoch and gradually reduces to 0.48% in the 50th epoch. The same process is repeated for Stamp et al. [28] dataset and the maximum accuracy by the 50th epoch is recorded to be 98.9% and the loss reduces to nearly 0.47% in the 50th epoch. We evaluated the performance of the proposed models by comparing their performance to standard machine learning classifiers: decision trees, support vector classifier, random forest, KNN and logistic regression, all of which were kept on default settings. To perform the comparison, we randomly selected malware samples from the VXheaven and ‘theZoo’ repositories, extracted the API calls data, performed EHF creation and trained the models. It can be observed from Fig. 5 that the proposed ensemble model’s accuracy exceeds all the individual deep networks as well as the machine learning classifiers, implying that, by performing the ensemble of the networks, the overall accuracy is being boosted. We discuss the classification performance of the proposed ensemble model on two datasets described earlier. In both Stamp et al. [28] dataset and our self-created malware dataset, we select 70% of the samples for training and the remaining 30% for testing. The results of the experiments along with classification score, AUC,

Malware Detection and Classification Using Ensemble of BiLSTMs …

Fig. 3 Accuracy of the proposed models on self-created malware dataset

Fig. 4 Training loss of the proposed models on self-created malware dataset

439

440

O. Sharma et al.

Fig. 5 Comparison of accuracy scores of suggested ensemble model with other models

Sensitivity and F_measure are presented in Figs. 6 and 7 for the two datasets. The generated confusion matrices of the two datasets are shown in Figs. 8 and 9. The results of the experiments depict that after utilizing the ensemble technique, the overall performance of the model shows improvement and the suggested ensemble framework surpasses the other classifiers in aggregate performance.

Fig. 6 Accuracy, AUC, F_measure and Sensitivity of the proposed ensemble model on Stamp et al. [28] dataset

Malware Detection and Classification Using Ensemble of BiLSTMs …

441

Fig. 7 Accuracy, AUC, F_measure and Sensitivity of the proposed ensemble model on self-created Windows dataset

For a more thorough assessment of the suggested ensemble framework, we show the comparison of our ensemble framework with a handful of prominent publications listed in Tables 3 and 4. From the tables, it can be concluded that our approach is more effective than other methods. Therefore, our method can detect and classify malware with state-of-the-art accuracy.

6 Discussion The current study presents an ensemble learning-based approach for combining multiple BilSTM networks and the proposed system also utilizes Enhanced Huffman Feature [7]-based approach for encoding API call sequences. However, some points need to be considered in order to assess the validity of the proposed model. For instance, the process of API calls extraction is a time-consuming affair and certain types of packed malware need to be preprocessed before performing dynamic malware analysis. In addition, certain malware variants may seem to show API calls that are incomplete or empty. These API calls need to be eliminated in order to improve the quality of the datasets. Furthermore, the labels of the malware variants need to be verified and tested to ensure malware classification reliability due to the reason that some malware possesses common characteristics of multiple malware families. This task can be carried out using verification with nearly 70 malware engines in VirusTotal api. Additionally, a more diverse number of malware family variants can make the findings of the study more generalizable.

442

O. Sharma et al.

Fig. 8 Confusion matrix of the suggested ensemble model on Stamp et al. [28] dataset

Fig. 9 Confusion matrix of the suggested ensemble model on self-created Windows dataset

Malware Detection and Classification Using Ensemble of BiLSTMs …

443

Table 3 Comparison of different works on Stamp et al. [25] dataset Model

Accuracy (%)

F1 (%)

Train time (min)

Test time (s)

Stamp et al. [28]

91.8

91.51





Jha et al. [18]

91.5

91.5

200

20

Moti et al. [29]

97.5

97.6

150

20

Jahromi et al. [30]

98.5

98.2

175

25

Current (ensemble)

98.9

99.1

125

15

Table 4 Comparison of different works on self-created windows malware dataset Model

Accuracy (%)

F1 (%)

Train time (min)

Test time (s)

Stamp et al. [28]

92.02

90.33

200

10

Jha et al. [18]

91.3

92.2

150

15

Moti et al. [29]

97.1

97.5

150

15

Jahromi et al. [30]

97.5

97.4

175

20

Current (ensemble)

98.3

98.6

125

14

7 Conclusions and Future Scope The challenge of malware detection and classification has been an important research topic in recent years as malware threats have proliferated at an exponential rate. The current study suggests an ensemble strategy that combines multiple BiLSTM networks to handle the challenge of malware detection and classification in Windows environment. To depict the interdependence among API calls data in Windows malware instances, we used Enhanced Huffman Features. The findings show that the suggested ensemble model beats past works in the literature in terms of overall accuracy and other evaluation metrics. The suggested ensemble model is evaluated on two different datasets, one of which is a publicly available benchmark dataset and the other is a custom dataset created from recent Windows malware obtained from the internet. As a conclusion, the study’s suggested approach can efficiently identify and categorize malware with a high accuracy. In the future, we wish to develop adversarial malware capable of fooling machine learning and deep learning-based techniques. Statements and Declarations Conflict of Interest The authors state that they have no known competing financial interests or personal ties that could have appeared to affect the work reported in this study. Consent for Publication Not Applicable Credit Authorship Contribution Statement All authors contributed equally in this manuscript.

444

O. Sharma et al.

Acknowledgements Not Applicable. Funding Not Applicable.

References 1. Li C, Zheng J (2021) API call-based malware classification using recurrent neural networks. J Cyber Secur Mobility 617–640. https://doi.org/10.13052/jcsm2245-1439.1036 2. Cruickshank I, Johnson A, Davison T, Elder M, Carley KM (2020) Detecting malware communities using socio-cultural cognitive mapping. Comput Math Organ Theory 26(3):307–319. https://doi.org/10.1007/s10588-019-09300-w 3. Alam S, Alharbi SA, Yildirim S (2020) Mining nested flow of dominant APIs for detecting android malware. Comput Netw 167:107026. https://doi.org/10.1016/j.comnet.2019.107026 4. Alazab M, Alazab M, Shalaginov A, Mesleh A, Awajan A (2020) Intelligent mobile malware detection using permission requests and API calls. Futur Gener Comput Syst 107:509–521. https://doi.org/10.1016/j.future.2020.02.002 5. Amer E, Zelinka I (2020) A dynamic Windows malware detection and prediction method based on contextual understanding of API call sequence.Comput Secur 92:101760 https://doi.org/10. 1016/j.cose.2020.101760 6. Xiaofeng L, Fangshuo J, Xiao Z, Shengwei Y, Jing S, Lio P (2019) ASSCA: API sequence and statistics features combined architecture for malware detection. Comput Netw 157:99–111. https://doi.org/10.1016/j.comnet.2019.04.007 7. O’Shaughnessy S, Breitinger F (2021) Malware family classification via efficient Huffman features. Forensic Sci Int Digit Investig 37:301192. https://doi.org/10.1016/j.fsidi.2021.301192 8. Welcome to Flask—Flask Documentation (2.0.x). https://flask.palletsprojects.com/en/2.0.x/. Accessed 27 Oct 2021 9. Vasan D, Alazab M, Wassan S, Safaei B, Zheng Q (2020) Image-Based malware classification using ensemble of CNN architectures (IMCEC). Comput Secur 92:101748. https://doi.org/10. 1016/j.cose.2020.101748 10. Narayanan BN, Davuluru VSP (2020) Ensemble malware classification system using deep neural networks. Electronics 9(5). Art. no. 5. https://doi.org/10.3390/electronics9050721 11. Mallik A, Khetarpal A, Kumar S (2022) ConRec: malware classification using convolutional recurrence. J Comput Virol Hack Tech. https://doi.org/10.1007/s11416-022-00416-3 12. Roy KC, Chen Q (2021) DeepRan: attention-based BiLSTM and CRF for ransomware early detection and classification. Inf Syst Front 23(2):299–315. https://doi.org/10.1007/s10796-02010017-4 13. Aurangzeb S, Anwar H, Naeem MA, Aleem M (2022) BigRC-EML: big-data based ransomware classification using ensemble machine learning. Cluster Comput. https://doi.org/ 10.1007/s10586-022-03569-4 14. Saharkhizan M, Azmoodeh A, Dehghantanha A, Choo K-KR, Parizi RM (2020) An ensemble of deep recurrent neural networks for detecting IoT cyber attacks using network traffic. IEEE Internet Things J 7(9):8852–8859. https://doi.org/10.1109/JIOT.2020.2996425 15. Lu R (2021) Malware detection with LSTM using opcode language. arXiv:1906.04593 [cs], June 2019, Accessed: 15 Oct 2021. [Online]. Available: http://arxiv.org/abs/1906.04593 16. Jeon S, Moon J (2020) Malware-Detection method with a convolutional recurrent neural network using opcode sequences. Inf Sci 535:1–15. https://doi.org/10.1016/j.ins.2020.05.026 17. De Lorenzo A, Martinelli F, Medvet E, Mercaldo F, Santone A (2020) Visualizing the outcome of dynamic analysis of Android malware with VizMal. J Inf Secur Appl 50:102423. https:// doi.org/10.1016/j.jisa.2019.102423

Malware Detection and Classification Using Ensemble of BiLSTMs …

445

18. Jha S, Prashar D, Long HV, Taniar D (2020) Recurrent neural network for detecting malware. Comput Secur 99:102037. https://doi.org/10.1016/j.cose.2020.102037 19. Hiai S, Shimada K (2019) Sarcasm detection using RNN with relation vector. IJDWM 15(4):66– 78. https://doi.org/10.4018/IJDWM.2019100104 20. Zhong W, Gu F (2019) A multi-level deep learning system for malware detection. Expert Syst Appl 133:151–162. https://doi.org/10.1016/j.eswa.2019.04.064 21. Peng X, Xian H, Lu Q, Lu X (2021) Semantics aware adversarial malware examples generation for black-box attacks. Appl Soft Comput 109:107506. https://doi.org/10.1016/j.asoc.2021. 107506 22. Zhang B, Xiao W, Xiao X, Sangaiah AK, Zhang W, Zhang J (2020) Ransomware classification using patch-based CNN and self-attention network on embedded N-grams of opcodes. Future Gener Compu Syst 110(xxxx):708–720. https://doi.org/10.1016/j.future.2019.09.025 23. Gao X, Hu C, Shan C, Liu B, Niu Z, Xie H (2020) Malware classification for the cloud via semi-supervised transfer learning. J Inf Secur Appl 55(October):102661. https://doi.org/10. 1016/j.jisa.2020.102661 24. Yazdinejad A, HaddadPajouh H, Dehghantanha A, Parizi RM, Srivastava G, Chen M-Y (2020) Cryptocurrency malware hunting: a deep recurrent neural network approach. Appl Soft Comput 96:106630. https://doi.org/10.1016/j.asoc.2020.106630 25. Pei X, Yu L, Tian S (2020) AMalNet: a deep learning framework based on graph convolutional networks for malware detection. Comput Secur 93:101792. https://doi.org/10.1016/j. cose.2020.101792 26. Sharma O, Sharma A, Kalia A (2022) Windows and IoT malware visualization and classification with deep CNN and Xception CNN using Markov images. J Intell Inf Syst. https://doi.org/10. 1007/s10844-022-00734-4 27. Cuckoo Sandbox—Automated Malware Analysis. https://cuckoosandbox.org/. Accessed 26 Oct 2021 28. Stamp M, Chandak A, Wong G, Ye A (2021) On ensemble learning. arXiv:2103.12521 [cs], Mar 2021, Accessed: 22 Jan 2022. [Online]. Available: http://arxiv.org/abs/2103.12521 29. Moti Z et al (2021) Generative adversarial network to detect unseen Internet of Things malware. Ad Hoc Netw 122:102591. https://doi.org/10.1016/j.adhoc.2021.102591 30. Namavar Jahromi A et al (2020) An improved two-hidden-layer extreme learning machine for malware hunting. Comput Secur 89:101655. https://doi.org/10.1016/j.cose.2019.101655.

Detection of Location of Audio-Stegware in LSB Audio Steganography A. Monika, R. Eswari, and Swastik Singh

Abstract Steganography in malware, often known as stegomalware or stegware, is growing in popularity as attackers continue to show their versatility by inventing new tactics and re-inventing old ones in the quest to disguise their dangerous software. Malware authors are modernizing the old steganography method by hiding dangerous code in seemingly harmless files such as images, audios, and videos. Many of these files are regarded to provide a low-security risk and are frequently ignored for further investigation. This has offered a perfect entry point for cyber-criminals to hide their malicious programs. Therefore, this paper proposes an Audio-Stegware Location Detection model, which aims to find the exact location of stegware in audio cover medium. The proposed system functions in three main phases: LSB clustering phase, ASCII conversion phase, and location finder. In the LSB clustering phase, the given audio files are converted to binary. The LSB pixels which are subject of stegware are clustered together. In the ASCII conversion phase, the binary format of the subject of stegware is converted to ASCII codes. Finally, in the location finding phase, the cluster of ASCII codes is processed by a language processor to differentiate the audio pixels and steganography pixels. The effectiveness of the proposed system is evaluated by analyzing numerous audio files collected from various sources with obfuscated malicious codes. Business application programming interfaces (APIs) are used to collect a selection of the most recent virus codes. The stimulation results show that the proposed system obtains a higher accuracy rate in detecting stegware location ranging from 80 to 97% for lower to higher embedding rate. Keywords Stegware · StegAudio · Stegopixels · ASCII codes · Language processor

A. Monika (B) · R. Eswari · S. Singh Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamil Nadu, India e-mail: [email protected] R. Eswari e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_31

447

448

A. Monika et al.

1 Introduction Steganography [1] is the process of obfuscating the confidential information inside the cover medium. To conceal information, the cover medium such as an image, audio, video, or text is most used. Anyone can use steganography techniques to share data invisibly for legitimate reasons. For example, the government may use steganography to transmit confidential information to stockholders secretly. Even though steganography improves the privacy of user data sharing, hackers are always looking for new ways to attack, and they might utilize steganography for nefarious purposes. For example, hackers may inject the malicious payload into the cover medium to evade antimalware solutions detection. The malicious payload obfuscated inside the cover medium is termed as “Stegomalware” or “Stegware.” Stegware [2] can be used throughout the cyber-attack life cycle to deceive security defender tools and carry out the harmful activities intended. Our state-of-the-art study reveals that previous research has seen steganography as a tool for concealing data. Steganalysis, on the other hand, is a stego data detection technique. There has been little research on hidden malware detection in multimedia files like images. Yet, no solution has been found for other digital mediums, particularly for the cover mediums like audios and videos. Since stegware is regarded as one of the most sophisticated techniques used by hackers in cyber-attacks to evade security detection, stego malware detection technology is in high demand to identify hidden data in cover media to protect the enterprise’s assets and prevent users from becoming victims of data breaches. Furthermore, according to the cutting-edge analysis of recent real-time cyberattacks, all past cases of malware exploiting steganography [3–8] have used image file formats like PNG or JEPG. The two recently published reports show the use of WAV audio files in malware operations. Waterbug (or Turla) using WAV files [9] and BlackBerry Cylance Threat using WAV files [10] are the two most recently used audio-stegware. Most of the existing antivirus are failed to detect these types stegware as shown in Fig. 1. These recent attacks and act-up behavior of the existing systems create a significant motivation for the proposed system. This paper proposes a model, which aims to find the exact location of stegware in audio cover medium. The proposed system functions in three main phases: LSB clustering phase, ASCII conversion phase, and location finder. In the LSB clustering phase, the given audio files are converted to binary. The LSB pixels which are subject of stegware are clustered together. In the ASCII conversion phase, the binary format of the subject of stegware is converted to ASCII codes. Finally, in the location finding phase, the ASCII codes cluster are processed by language processor to differentiate the audio pixels and steganography pixels. The rest of the article is structured as follows: The related works are presented in Sect. 2; the suggested Audio-Stegware Location Detection System is described in Sect. 3; Sect. 4 covers the experimental setup, datasets, and assessment metrics; Sect. 5 describes the proposed system’s evaluated outcomes as well as the comparison; Sect. 6 finishes with major findings and future improvements.

Detection of Location of Audio-Stegware in LSB Audio Steganography

449

Fig. 1 Result of stegware detection by existing antivirus system

Contribution: This work aims to find the location of stegware, which is invisible to existing endpoint defense mechanisms. Numerous systems are available to detect and defense against visible malware. But there is a lack of systems that can detect or defend invisible malware, especially by hiding dangerous code in seemingly harmless files such as audios and videos. Many of these files are regarded to provide a low-security risk and are frequently ignored for further investigation. This work majorly contributes to finding the malwares which are hidden inside other lowsecurity risk files and the files that are frequently ignored for further investigation. It also contributes to finding the location of malwares which are in encrypted form.

2 Related Work Puchalski et al. [11] proposed a method for detecting malware and other undesirable information in digital images. This method was designed to deal with images compressed using the Graphics Interchange Format. Because such files follow a well-defined standard, the abnormal data can be identified by looking for the file’s end. Verma et al. [12] created a Python script to identify malware in the popular JPEG image format. They said that prior methods have mostly focused on detecting steganography artifacts or have relied on the feature-based analysis to discover concealed malicious data. However, their proposed program, in addition to classification, finds malicious content in JPEG images and outputs the detected malicious data

450

A. Monika et al.

as well as its location. They also demonstrated that this functionality had not been found in the available literature. Their software has analyzed three different types of JPEG images: harmful, benign, and stego. Though malign images can also be stego, their tool refers to those that hide non-malicious information as stego images. SIMARGL [13] is a European Commission-funded project aimed at combating the growing threat of malware. It tries to address new cybersecurity concerns such as information concealment tactics, network anomalies, and stegware. Their research in stegware aims to improve the available information, raise awareness about different types of stegware, and mitigate the hazards. Their most recent project focuses on detecting steganographic changes to digital images delivered over the network and estimating the size of an embedded PowerShell script within the pixels of PNG images. Monika and Eswari [14] have proposed a neutralization system against image stegware. Their model first detects the presence of any obfuscated items inside images. Then finds the exact obfuscated location inside the digital medium. Finally, the detected location where the hidden item is found is deactivated. Cohen et al. [15] created MalJPEG, a technique for analyzing unknown hazardous JPEG images. Their method used machine learning classification algorithms to separate genuine from malicious JPEG images by collecting nine key attributes from the JPEG image file format. Their proposed approach is domain-specific and only works with JPEG files. In fact, other formats such as .bmp, .png, and .giff are incompatible with their system. George et al. [16] developed a steganalysis method for analyzing, identifying, and extracting hidden data in a group of image files. It tries to inspect the payload and determines whether it is malicious or not. The disadvantage of their technique is that in case of obfuscated data containing PE file format as payload data, it gets classified as malware. Moreover, the decoding strategy for hidden payloads has not been described by the authors. Choudhury et al. [2] presented a showering technique that eliminates malware while providing a typical genuine image to the end user. The authors have proposed a Blind destructor. Whether the disguised data are malicious or not, this system eliminates the hidden payload regardless of its validity status. However, this strategy will influence steganography applications against malware attacks like packet sniffing. Prasad et al. [17] proposed a detection model for images are sent across noisy channels. This model uses neural networkbased classification to perform steganographic analysis. But the detection process, in this case, is more complicated due to noise. Their methodology does not classify the obfuscated data inside the image as normal or malicious repository. Moreover, this model can only determine whether an image is clean. The following issues are identified from literature survey: (1) The existing endpoint defense mechanism stated above only deals with image cover medium. Other major digital mediums which are more likely to be used in steganography like Audio are not covered yet. (2) Most recent research on progress like SIMARGL focuses on image medium. But recent attacks like Waterbug, BlackBerry Cylance have advanced their methods to the mediums like audio which are considered lessrisky and overlooked for security analyses. The audio cover medium has not yet covered in stegware analysis to the best of our knowledge and survey. Therefore, this work focuses on audio cover medium to defend against audio-stegware

Detection of Location of Audio-Stegware in LSB Audio Steganography

451

attacks. Unlike image stegware, since audio mediums are composed of wave-forms, this creates challenge and complications on detecting hidden malwares. This paper proposes an Audio-Stegware Location Detection Model, which aims to find the exact location of stegware in audio cover medium.

3 Proposed System This paper proposes an “Audio-Stegware Location Detection model” that aims to build a ubiquitous mechanism to find the exact location of stegware in audio cover medium. The proposed system functions in three main phases as shown in Fig. 2: LSB clustering phase, ASCII conversion phase, and location finder. In the LSB clustering phase, the given audio files are converted to binary. The LSB pixels which are subject of stegware are clustered together. In the ASCII conversion phase, the binary format of the subject of stegware is converted to ASCII codes. Finally, in the location finding phase, the cluster of ASCII codes is processed by a language processor to differentiate the audio pixels and steganography pixels, as illustrated in Algorithm 1.

3.1 LSB Location Clustering Least significant bit (LSB) is one of the techniques that is majorly preferred for hiding the secret message in audio files. LSB audio steganography method reduces embedding distortion of the host audio with increased capacity of secret text. In this technique, messages are hidden at the last bit of several bytes of the given files. Because changing last two bits does not change the physical appearance of any digital medium. Identifying the changes in the file before and after adding the content in the selected files becomes very difficult. Therefore, to find the location of obfuscated

Fig. 2 Proposed system: audio-stegware location detection model

452

A. Monika et al.

payload in the given audio file, extracting the last bit of every byte of the file is essential. The last bit of every byte is clustered together to identify the location of hidden message in the file. To collect the above specified bit of the given media file, it is required to convert the given audio file into bit stream, but it is not possible to convert the given audio file into bit stream directly. First, it is required to convert the audio file into a byte stream and then convert it to bit stream. After getting the bit stream of the complete audio files, it is necessary to create the cluster of the required bits. Here, required bits are every 8th bit of the bit stream because it will indicate the last bit of each byte of the audio file. The extracted required bits are stored in the output file. A. Audio to Bit Stream Conversion To convert files into byte stream, it is required to read and write the data in the form of 8-bit bytes. Streams are unidirectional in nature. Therefore, it can only transfer the data in one direction, i.e., either reading data from source to program or writing data from the program to destination. For this reason, here, two different classes are used: Input stream class for reading the data in form of 8-bit data and output stream for writing the data in form of 8-bit data. Let S(x) is the source file, P(x) is our program, I(c) is input stream class, O(c) is output stream class, and D(x) is the destination, then through the input stream class, we can read the data from S(x) to P(x), and through output stream, we can write the data from P(x) to D(x) as shown in Eq. (1). S(x) ========> P(x) ========> D(x) I (c)

O(c)

(1)

B. LSB Bits Identification After getting bit stream of the complete audio file now, it is required to create the cluster of the required bits. Here required bits are every 8th bit of the bit stream because it will indicate the last bit of each byte of the audio file. For getting every 8th bit of the input stream, it is required to take the modulo of the complete bit stream. Consider the size of the input bit stream is “N,” bi represents each individual bit in input stream, and E[b] represents the required LSB bits (Eq. 2) of output stream. Then E[b], output stream can be obtained by applying the modulo function as follows: LSb =

N  i=1

bi %8 = 0 ∈ E[b]

(2)

Detection of Location of Audio-Stegware in LSB Audio Steganography

453

3.2 ASCII Conversion In the ASCII format, every character is represented in the form of eight bit. During the conversion of binary to ASCII, if bytes are space-separated, then it can identify characters even if it contains less than eight bits. But, if it is not space-separated, it takes each eight bit and then converts it into the corresponding ASCII character. Here, in the proposed model, the bit stream which we store in the audio file is not space-separated. Therefore, it takes the first eight bit and then converts it into ASCII format again, taking the next eight bit and then covert it into ASCII. For converting any 8 bits into its equivalent ASCII value, it is required to convert that 8 bits into its integer value. 01000001 =======> 65 =======> A

(3)

For 8-bit data, the maximum value will be 255, and there is an ASCII table which contains fix symbol for every integer from 0 to 255, for example, consider an 8-bit pattern 01000001. Its integer equivalent is 65, and in the ASCII table, 65 means A (Eq. 3). So, 01000001 can be converted into 65. In this way, the complete output stream is converted to ASCII format.

3.3 Location Finder In this phase, the exact location of the hidden item inside the given file is identified. Here, the cluster of ASCII code of the audio file contains a combination of distorted audio and hidden messages. After getting the cluster of ASCII codes of the given audio file, these codes are processed by the language processor. The language processor identifies it as distorted audio pixels or hidden message pixels. Here, the language processor tries to find the first word in the given ASCII file, which matches the dictionary words. After finding the first word which matches in the dictionary, it ignores all the letters which are left behind because all these letters are part of the distorted audio file. Now, after getting the first word for finding the location of the hidden message in the original audio, it tries to find the number of letters which left behind the first word. After this, multiplying by eight got the actual location of the hidden message of the hidden file because the ASCII file that currently processed was made by selecting only the 8th bit of the original bit stream of the original audio file. Now, to get the complete message after finding the first word, continue this work until we get the words that match in the dictionary. In this way, we got the complete message and the location of that message in the given audio file.

454

A. Monika et al.

Algorithm 1: Audio-Stegware Location Identifier Input: Audio-Stegware Output: Location of Stegware inside Audio cover medium // N ← number of bits in bit stream // bi ← each bit in bit stream; i = 1 to N // E[b] ← output stream (lsb cluster) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

Byte stream ← Audio signal bit stream ← Byte stream for (i = 1 to N) if (bi % 8 = = 0) bi ∈ E[b] end if end for for (i = 1; i < = length(E[b]); i = i + 8) for (j = i to i + 7)   lsbbitgroup[j] = j. E b j convert lsbbitgroup[j] to ASCII store in output file end for end for stegware_location ← language_processor(ouput_file)

4 Experimental Setup The experimental setup used to simulate the suggested system for efficiency evaluation in locating stegware in audio cover media is briefly detailed in this part. This section also covers the evaluation metrics and parameter settings.

4.1 Dataset It is a known fact that the accuracy of location varies based on the distinctiveness of audio sources. Hence, these studies use audio files (.wav) from two separate datasets and multiple online sources to evaluate the functionality of the proposed AudioStegware Location Finder model. After examining these samples using approximately 15 antivirus product engines, newer harmful codes totaling 100 were gathered from commercial APIs for this evaluation.

Detection of Location of Audio-Stegware in LSB Audio Steganography

455

a. Arabic Speech Corpus was created as part of Nawar Halabi’s Ph.D. research at the University of Southampton. A professional studio was used to record the corpus in south Levantine Arabic (Damascian accent). The result of synthesized speech utilizing this corpus is a high quality, natural voice. This corpus contains 1813.wav files with spoken utterances, 1813.lab files with text utterances, and 1813.TextGrid files with phoneme labels and time stamps of the borders where these appear in the.wav files. Praat software can be used to open these files. b. TIMIT read speech corpus was created to offer speech data for acoustic-phonetic knowledge acquisition, as well as the creation and assessment of automatic speech recognition systems. TIMIT has a total of 6300 sentences, with 10 sentences uttered by each of the 630 speakers from the United States’ eight primary dialect regions. c. Open Source (.wav) files: Various Internet sources like social media, musical Web sites, and learning sites are referred and 500.wav files of different streams are downloaded. UNION dataset (self-created dataset) includes the subset of audios from all the corpus mentioned above and open sources. A total of 200 cover audio and 100 malware samples are used to create 200 audio-stegware for this stimulation.

4.2 Simulation Setup The proposed system is simulated by using 200 audio-stegware created using 200.wav audio files of the source mentioned above, and 100 malware samples collected from open-source API. The message insertion rate of the malware codes collected in the UNION dataset is set at different pixel rates (10, 20, 30, 40, and 50%), wherein 10% of codes are found toughest to detect, whereas 50% of codes are easier to detect through different steganography algorithms. As mentioned earlier, the given audio files are converted to binary. The LSB pixels which are subject of stegware are clustered together. The binary format of the subject of stegware is converted to ASCII codes. Finally, the cluster of ASCII codes is processed by a language processor to differentiate the audio pixels and steganography pixels.

4.3 Evaluation Metrics The performance metrics for evaluation of the proposed systems are as follows: (a) Accuracy: It is a significant performance metric to evaluate the correctness of the location identified. Here, the correctness of location is estimated by extracting the hidden content and comparing the original content with extracted content. According to steganographic taxonomy, the accuracy of extracted content is directly proportional to the accuracy of the detected hidden location. That is, the

456

A. Monika et al.

original payload can be extracted correctly, if the location is properly identified. Therefore, the accuracy of identified location is defined as the ratio of the number of letters correctly extracted to the total number of letters hidden inside the audio cover medium.  total number of correctly decoded letters  Accuracy = (4) total number of letters inserted (b) False Positive Rate: It is measured as the ratio of the number letters wrongly extracted to the total number of letters hidden inside the audio cover medium.  total number of wrongly decoded letters  (5) FPR = total number of letters inserted (c) Missed-Pixel Rate: It is measured as the ratio of the number letters missed during extraction to the total number of letters hidden inside the audio cover medium.  total number of letters missed in extraction  (6) MPR = total number of letters inserted

5 Experimental Results and Discussions In this segment, an analysis on the proposed system is illustrated in terms of their evaluation metrics (by calculating the similarity and dissimilarity between encoded and decoded payloads). The proposed system is shown as effective in finding the exact location of stegware in audio cover medium to defense against hidden information attacks.

5.1 Effectiveness of Proposed Algorithm in Identifying Audio-Stegware Location The performance of the proposed system in finding the location of audio-stegware is made in terms of the effectiveness of accuracy, FPR, and MPR for the dataset described above. Experiments were conducted for the audio-stegware files created from Arabic Speech Corpus.wav files, TIMIT.wav files, Open Source.wav files, and for UNION dataset, which is the subset of a combination of said three sources. The payload in the identified location is extracted and compared with the embedded original payload to find the accuracy, FPR, and MPR of the audio-stegware identified by the proposed algorithm.

silimalrity and dissimalrty rate of decoded payload

Detection of Location of Audio-Stegware in LSB Audio Steganography

457

100 80 60 40 20 0 10%

20%

30%

40%

Embedding rate - High to Low

50%

Accuracy FPR MPR

Fig. 3 Results obtained for UNION dataset

The average of accuracy, FPR, and MPR obtained for the experiments mentioned above is depicted in Fig. 3. The results show the accuracy, FPR and MPR of location found with respect to the embedding rates used (10, 20, 30, 40, and 50%), as the embedding rate 10% being the most difficult to detect, and 50% being the easiest to detect. Accuracy is calculated by comparing the similarity of original and decoded payloads (by Eq. 4). FPR is calculated by comparing the dissimilarity of original and decoded payloads (by Eq. 5), whereas the MPR: (Missed-pixel rate) is calculated by averaging the total number of pixels missed during extraction (by Eq. 6). From the graph, it can be observed that the proposed model gives higher accuracy in finding the exact location and lower FP rate and MP rate. Table 1 illustrates the stimulation setup and results obtained by the proposed model. The proposed model is tested with the UNION dataset with 200.wav files, which is a subset of Arabic Speech Corpus.wav files, TIMIT.wav files and.wav files collected from open sources. The embedding of 10–50% shows the rate of bits used by the steganography. The rate of 10% shows that only 10% of the total.wav pixel is used for steganography, and 50% shows that half of the total pixel is used for obfuscation. The steganography taxonomy says that the number of pixels used is indirectly proportional to steganalysis. That is, if a smaller number of pixels is used, it is harder to detect obfuscation. Therefore, the proposed model is tested from harder to easier embedding rates.

6 Conclusion On-demand platforms continue to face massive risks from file formats that are often benign and often unanticipated, yet which often lead to malicious applications. Malware authors are revamping the old steganography method by hiding dangerous code in seemingly harmless files such as images, audios, and videos. Many of these files are regarded as providing a low-security risk and are easily ignored for further

458

A. Monika et al.

Table 1 Results of proposed system in term of accuracy, FPR, and MPR Arabic speech corpus (100.wav files)

TIMIT corpus (50.wav files)

Open source (50.wav files)

Embedding rate

Accuracy (%)

FPR (%)

MPR (%)

10% (40 files)

80

15

5

20% (40 files)

83

11

6

30% (40 files)

86

10

4

40% (40 files)

95

3

2

50% (40 files)

97

2

1

200 files of mixed embedding rates

89

8

3

UNION dataset (200.wav files)

investigation. This has offered an entry point for cyber-criminals to hide their malicious programs. Therefore, this paper proposes a model, to find the exact location of stegware in audio cover medium. The proposed system functions in three main phases: LSB clustering phase, ASCII conversion phase, and location finder. In the LSB clustering phase, the given audio files are converted to binary. The LSB pixels which are subject of stegware are clustered together. In the ASCII conversion phase, the binary format of the subject of stegware is converted to ASCII codes. Finally, in the location finding phase, the cluster of ASCII codes is processed by a language processor to differentiate the audio pixels and steganography pixels. The proposed model is evaluated by using 200 audio-stegware created using 200.wav audio files of various sources and 100 malware samples collected from open-source API. The payload embedded rate in UNION dataset is set to various embedding rates from 10 to 50%, wherein 10% of codes are found toughest to detect, whereas 50% of codes are easier to detect through different steganography algorithms. The performance of the proposed system in finding the location of audio-stegware is made in terms of the effectiveness of accuracy, FPR, and MPR for the dataset described above. Experiments were conducted for the audio-stegware files created from Arabic Speech Corpus.wav files, TIMIT.wav files, Open Source.wav files, and for the UNION dataset, which is the subset of a combination of said three sources. The payload in the identified location is extracted and compared with the embedded original payload to find the accuracy and FPR of the audio-stegware identified by the proposed algorithm. Accuracy is calculated by comparing the similarity of original and decoded payloads. FPR is calculated by comparing the dissimilarity of original and decoded payloads. The results show that the proposed system gives higher accuracy in a rate of 80% even in lower embedded rate and up to 97% in higher embedded rate in finding the exact location of stegware and lower FP rate and MPR rate. The stimulation results show that the proposed system has given higher accuracy in finding the exact location of audio-stegware. The proposed system was tested

Detection of Location of Audio-Stegware in LSB Audio Steganography

459

for audio-stegware files. It will be extended to detect legitimate stego-audio and malicious stego-audio and deactivate the location of malicious stego-audio.

References 1. Monika A, Eswari R (2021) Ensemble-based stegomalware detection system for hidden ransomware attack. In: Inventive systems and control. Springer, Singapor, pp 599–619 2. Choudhury S, Amritha PP, Sethumadhavan M (2019) Stegware destruction using showering methods. Int J Innov Technol Explor Eng (IJITEE) 8:256–259 3. Soni A, Barth J, Marks B (2018) Research and intelligence. https://blogs.blackberry.com/en/ 2019/10/malicious-payloads-hiding-beneath-the-wav 4. Cabaj K, Caviglione L (2018) The new threats of information hiding: the road ahead. IEEE IT Prof. 20:31–39. https://doi.org/10.1109/MITP.2018.032501746 5. Caviglione L, Choras M (2020) Tight arms race: overview of current malware threats and trends in their detection. IEEE Access 9. https://doi.org/10.1109/ACCESS.2020.3048319 6. Wiseman SR (2017) Stegware—using steganography for malicious purposes. Deep Secure Technical Report DS-2017-4. https://doi.org/10.13140/RG.2.2.15283.53289. 7. Jung D-S et al. (2020) ImageDetox: method for the neutralization of malicious code hidden in image files. Comput Eng Sci Symmetry Asymmetry. https://doi.org/10.3390/sym12101621 8. Caviglione L, Mazurczyk W, Repetto M, Schaffhauser A, Zuppelli M (2021) Kernel-level tracing for detecting stegomalware and covert channels in Linux environments. Int J Comput Telecommun Netw. https://doi.org/10.1016/j.comnet.2021.108010 9. Zdnet (2019) Steganography malware trend moving from PNG and JPG to WAV files. https:// www.zdnet.com/article/wav-audio-files-are-now-being-used-to-hide-malicious-code/ 10. Symantec Enterprise Blogs and Threat Intelligence (2019) https://symantec-enterprise-blogs. security.com/blogs/threat-intelligence/waterbug-espionage-governments 11. Puchalski D et al. (2020) Stegomalware detection through structural analysis of media files. In: ARES ‘20: proceedings of the 15th international conference on availability, reliability and security, article no. 73, pp 1–6. https://doi.org/10.1145/3407023.3409187 12. Verma V, Muttoo SK, Singh VB (2022) Detecting stegomalware: malicious image steganography and its intrusion in windows. Secur Priv Data Anal 103–116. https://doi.org/10.1007/ 978-981-16-9089-1_9 13. SIMARGL (2022) Stegware—the latest trend in cybercrime. https://simargl.eu/blog/technical/ stegware-the-latest-trend-in-cybercrime 14. Monika A, Eswari R (2022) Prevention of hidden information security attacks by neutralizing stego-malware. Comput Electr Eng. https://doi.org/10.1016/j.compeleceng.2022.107990 15. Cohen A, Nissim N, Elovici Y (2020) MalJPEG: machine learning based solution for detection of malicious JPEG images. Expert Syst Appl 8 16. George G, P Savaridassan P, Devi K (2018) Detect images embedded with malicious programs. Int J Pure Appl Math 120(6):2763–2777 17. Prasad SS, Hadar O, Polian I (2020) Detection of malicious spatial-domain steganography over noisy channels using convolutional neural networks. In: Media watermarking, security, and forensics, pp 76–1–76–7(7)

Hybrid Quantum Classical Neural Network-Based Classification of Prenatal Ventricular Septal Defect from Ultrasound Images S. Sridevi, T. Kanimozhi, Sayantan Bhattacharjee, Soma Sekhar Reddy, and Durri Shahwar

Abstract Prenatal ventricular septal defect (VSD) is the second most common congenital heart defect-based (CHD) cardiac anomaly among growing fetus. Diagnosis of prenatal VSD CHD is clinically accomplished from ultrasound images, which is safe but is distorted by inherent speckle noise making the diagnosis a more challenging task. In this paper, we present a hybrid quantum classical neural network model (HQCNN) executed in IBM Aer simulator to recognize the prenatal VSD from fetal cardiac 2-dimensional ultrasound images. We trained the HQCNN model by varying number of qubits and number of shots of the parameterized quantum circuit. The proposed model trained with 2 qubit and executed for 1500 shots offered superior performance comparatively by yielding a high testing accuracy of about 95.8% by accurately classifying VSD CHD. Keywords Ventricular septal defect · Hybrid quantum classical neural network · Parameterized quantum circuit · Aer simulator

S. Sridevi · T. Kanimozhi Associate Professor, School of Computer Science and Engineering, Vellore Institute of Technology Chennai, Chennai 600127, Tamil Nadu, India e-mail: [email protected] T. Kanimozhi e-mail: [email protected] T. Kanimozhi Associate Professor, Department of Electronics and Communication Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai 600062, Tamil Nadu, India S. Bhattacharjee (B) · S. S. Reddy · D. Shahwar Department of Electronics and Communication Engineering, Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai 600062, Tamil Nadu, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_32

461

462

S. Sridevi et al.

1 Introduction The rarest but common heart diseases among infants are CHD which occur during fetus conception. These defects can be classified as structural and functional heart diseases. There exists several variants of CHD defects, among one most common CHD variant is cardiac septal defect which is described by a formation of hole in atrial, ventricular, or both atrial and ventricular. This variant of CHDs is called atrioventricular septal defect (ASD) and ventricular septal defect (VSD) [1]. In India, cases of VSD accounts for 20–30% among all other types of CHDs [2]. Approximately, every 2–6 births of every 1000 have VSDs, and this accounts for approximately 30% of all CHDs in children/adolescents. They require immediate early intervention in order to treat and make the victim to survive [1]. The current technique being employed in order to classify the images is ultrasonography, which is an ultrasound diagnostic imaging modality. In this technique, sound waves are transmitted, and the reflected waves are used to visualize the anatomical structures of body. This technique is a gold standard. These techniques require exclusive exposure in anatomical structure identification of fetal heart chambers and blood vessels. Only experienced radiologists can interpret the ultrasound images and identify the diseases. The new coming obstetricians and gynecologists find it hard to extract various planes of ultrasound images and infer them. Moreover, the inherent presence of speckle noise impact ultrasound images hinders the diagnosing ability. Thus ultrasound image based disease diagnosis is itself a harder task from clinical perspective. There are other methods such as auscultation using stethoscope is used to listen for heart sounds and murmurs. But, this method can often lead to misdiagnosis. With current revolution of artificial images, computer-aided diagnosis systems have been developed like automatic interpretation of echocardiogram images [3–5]. The paper describes different ways for identifying the defects by segmenting the images using deep learning techniques [6, 7]. The authors effectively utilized CNN (U-net) and Markov random field technique [8], respectively, in order to extract region of interest area (ROI) from the images. Literature serves very limited research outcomes based on classification of CHD diseases from ultrasound images [9]. In this context, use of deep learning models could be a more reliable research solution in this path [10]. But, these data-driven approaches require huge amount of dataset in order to give the best accuracy. Nevertheless, it is obvious to accept that collecting huge number of clinical images of abnormal categories is, and hence, it remains as major motivation to use a smaller dataset. In the scenario of using imbalanced dataset, techniques like data augmentation can be used. In this juncture, development in the field of quantum machine learning enhances the learning process owing to its superior properties such as superposition and entanglement [11–13]. All conceivable states of a qubit state are evaluated at the same time in superposition. Entanglement is the interrelation between two qubits even at infinite distances. Non-typical model patterns that are thought to produce conventional quantum computing have given advantage in machine learning

Hybrid Quantum Classical Neural Network-Based Classification …

463

field. The main focus of this paper is to use hybrid classical quantum neural network (i.e., combination of neural network and classical computer) model for classification task as proposed in the paper. Researchers have used HQCNN for classifying the chest X-ray images in order to detect COVID-19 and were able to achieve an accuracy of 98.2%. Hence, in this study, we train a hybrid quantum classical model for binary classification task of discriminating efficiently between VSD and normal heart images.

2 Methodology The suggested HQCNN model intended toward classifying VSD-based prenatal CHD abnormalities from ultrasound images is described as follows. The prominent intent is to utilize a parameterized quantum circuit to partially quantize the classical convolution neural network (CNN) model to boost the performance of the HQCNN model. Thus, the architecture comprises of two phases, namely classical CNN phase and parameterized quantum circuit phase. The proposed model’s architecture is depicted in Fig. 1.

2.1 Classical Convolution Neural Network Phase From the given input images, the CNN phase extracts the symptomatically notable features. In order to extract those features, the model is framed with two levels convolution layer with multiple filters and is mathematically expressed by

Fig. 1 Block diagram of the proposed HQCNN model

464

S. Sridevi et al.

F( p, q) =

 m

K (m, n) ∗ U (x − m, y − n)

(1)

n

where U(x, y) signifies ultrasound images and K(m, n) signifies kernels to extract features. In this conventional convolution operation, the nonlinear function rectified linear unit (ReLU) is utilized as activation function. The convoluted feature map is extracted as feature vector from the fully connected layer once the max pooling operation is done. This fully connected layer’s feature vector is subsequently passed to the parameterized quantum circuit for further processing.

2.2 Parameterized Quantum Circuit Phase In this hybrid model, the parameterized quantum circuit is used to further enhance the classification task with the quantum properties of superposition and entanglement. The convoluted feature map from the fully connected layer is fed to the two qubitbased parameterized quantum circuit which is contrived in terms of three stages, namely Hadamard gate in stage I, Rotation RY gate with trainable parameter β in stage II, and Pauli-Z measurement unit in stage III as shown in Fig. 2. The Hadamard gate (H) is responsible to manipulate the qubit value |ψ = a|0+ b|1 to another qubit state by turning the position from its original pole of Bloch sphere to far apart. Further H induces the notion of superposition for specified input qubit value. The corresponding 2 × 2 unitary matrix for H quantum gate is described as   1  1 1  (2) H |ψ = √  2 1 −1  In the stage II, the parameterized quantum circuit is framed with RY rotation gate whose 2 × 2 unitary matrix is described as

Fig. 2 Parameterized quantum circuit

Hybrid Quantum Classical Neural Network-Based Classification …

  cos β − sin β 2 2 R y (β) =  sin β2 cos β2

   

465

(3)

This RY rotation gate is responsible to enhance the learning purpose of quantum trainable weight parameter β. In this HQCNN model, this parameterized quantum circuit is initialized with the parameter value of β = π /4 radians in order to compute the qubit feature vector. Finally in stage III, the parameterized quantum circuit comprises of Pauli-Z gate. The unitary matrix of this Pauli-Z gate is described as  σz =

1 0 0 −1

 (4)

This quantum gate is primarily utilized to transform the sophisticated qubit feature map into the classical data, after which the expected observable values are measured with the necessary number of shots as specified in Eq. (5). y = σz =

N 

z i p(z i )

(5)

i=0

where N signifies number of shots. The final measured value of Pauli-Z gate σ z is considered as the HQCNN model predicted label y in the form of classical results. In the process of learning and model optimization, e utilize Adam optimizer scheme to reduce the negative log likelihood loss function. During training process when we perform convolution, the weight matrix W is learnt in parallel by means of hybrid mode simultaneously in classical mode and quantum mode using parameter shift rule.

3 Results The suggested study and its experimental results were acquired utilizing IBM Quantum Lab, as well as the Qiskit and Pytorch libraries. The model is run on an IBM quantum system equipped with a 27-qubit Falcon R10 processor, resulting in a quantum volume of 32. Qiskit Aer simulator is used for simulating the quantum computer. To test the performance of the proposed model, we analyzed our imbalanced dataset comprising 112 ultrasounds of normal and VSD CHD images utilized in previous work. As the dataset is imbalanced with very a much smaller number of abnormal images, we used synthetic minority oversampling technique (SMOTE) to augment and balance the dataset. For analyzing the performance of the proposed model, the number of qubits is varied from 1 to 4 and the number of shots by 100, 500, 1000, and 1500 units in seeing how the model’s behavior in terms of testing loss

466 Table 1 Proposed model performance by varying qubits and shots

S. Sridevi et al.

Qubits

Shots

Test accuracy

1

100

89.6

Test loss − 0.9476

1

500

94.8

− 0.9482

1

1000

45.8

− 0.4861

1

1500

52.1

− 0.9566

2

100

93.8

− 0.9640

2

500

50.0

− 0.2653

2

1000

81.2

− 2.4211

2

1500

95.8

− 5.1686

3

100

68.8

− 50.82

3

500

56.2

− 5.19

3

1000

50.0

− 3.05

3

1500

94.8

− 5.506

4

100

52.1

− 51.88

4

500

45.8

− 11.6300

4

1000

62.5

− 47.36

4

1500

75.0

− 41.32

and accuracy, as shown in Table 1. The model is trained at a learning rate of 0.001 for 100 epochs. The system’s typical training time is approximately 10 min. As shown in the Table 1, as the quantum depth and the number of shots are varied, we can see a change in the loss and accuracy of the model. After analyzing the Table 1, we can infer that the obtained classification results in terms of accuracy and loss are optimal for the hyperparameter of 2 qubits and 1500 shots. Despite a smaller dataset, the proposed HQCNN model has achieved excellent classification accuracy by leveraging the benefits of quantum superposition and entanglement features. Figure 3 depicts the proposed HQCNN model’s sample prediction results. The convergence loss plot and confusion matrix are two metrics being used for understanding the model. The loss plot obtained after running the experiment helps us to understand whether the model undergoes learning appropriately. Loss graph has been plotted against the number of epochs and loss rate. Another metric being used is confusion matrix which is used measure the accuracy, precision score on test

Fig. 3 HQCNN sample prediction results

Hybrid Quantum Classical Neural Network-Based Classification …

467

Fig. 4 a Confusion matrix and b loss graph for the proposed model

dataset. For our model, we have plotted the loss graph and confusion matrix as shown in Fig. 4.

4 Conclusion Quantum computing is helping regular deep learning tasks to get enhanced. In this paper, we tried to propose a HQCNN model for the purpose of classifying the CHD VSD from normal ultrasound images. Specifically, the hybrid learning of classical convolution and quantum parameterized circuit together improved the classification performance. We also used data augmentation technique in order to compensate for the abnormal class-based image samples in the dataset. In summary, the proposed HQCNN model performed well with hyperparameter of 2 qubits and 1500 shots with a classification accuracy of 95.8%. This model can be further improved by adding entanglement layers and also running it in real quantum processors. Acknowledgements The authors would like to appreciate and recognize the financial support granted by the University of Vel Tech Rangarajan Dr. Sagunthala RD Institute of Science and Technology, Avadi, Chennai, Tamil Nadu, under the “SEED FUND.”

References 1. Chaudhry TA, Younas M, Baig A (2011) Ventricular septal defect and associated complications. JPMA 61(10):1001–1004 2. Hoffman JIE, Kaplan S (2002) The incidence of congenital heart disease. J Am Coll Cardiol 39(12):1890–1900 3. Ghorbani A, Ouyang D, Abid A, He B, Chen JH, Harrington RA, Liang DH, Ashley EA, Zou JY (2020) Deep learning interpretation of echocardiograms. NPJ Digit Med 3(1):1–10

468

S. Sridevi et al.

4. Kusunose K, Haga A, Abe T, Sata M (2019) Utilization of artificial intelligence in echocardiography. Circ J CJ-19 5. Nova R, Nurmaini S, Partan RU, Putra ST (2021) Automated image segmentation for cardiac septal defects based on contour region with convolutional neural networks: a preliminary study. Inf Med Unlocked 24:100601 6. Sridevi S, Nirmala S (2016) ANFIS based decision support system for prenatal detection of Truncus Arteriosus congenital heart defect. Appl Soft Comput 46:577–587 7. Sampath S, Sivaraj N (2014) Fuzzy connectedness based segmentation of fetal heart from clinical ultrasound images. In: Advanced computing, networking and informatics, vol 1. Springer, Cham, pp 329–337 8. Nirmala S, Sridevi S (2016) Markov random field segmentation based sonographic identification of prenatal ventricular septal defect. Procedia Comput Sci 79:344–350 9. Shorten C, Khoshgoftaar TM (2019) A survey on image data augmentation for deep learning. J Big Data 6(1):1–48 10. Weiss K, Khoshgoftaar TM, Wang D (2016) A survey of transfer learning. J Big Data 3(1):1–40 11. Killoran N, Bromley TR, Arrazola JM, Schuld M, Quesada N, Lloyd S (2019) Continuousvariable quantum neural networks. Phys Rev Res 1(3):033063 12. Schuld M, Sinayskiy I, Petruccione F (2015) An introduction to quantum machine learning. Contemp Phys 56(2):172–185 13. Arthur D (2022) A hybrid quantum-classical neural network architecture for binary classification. arXiv preprint arXiv:2201.01820

Experimental Evaluation of Reinforcement Learning Algorithms N. Sandeep Varma, Vaishnavi Sinha, and K. Pradyumna Rahul

Abstract Reinforcement learning is an active field of machine learning that deals with developing agents that take actions in an environment with the end goal of maximizing the total reward. The field of reinforcement learning has gained increasing interest in recent years, and efforts to improve the algorithms have grown substantially. To aid in the development of better algorithms, this paper tries to evaluate the state-of-the-art reinforcement learning algorithms for solving the task of learning with raw pixels of an image as input to the algorithm by testing their performance on several benchmarks from the OpenAI Gym suite of games. This paper compares their learning capabilities and consistency throughout the multiple runs and analyzes the results of testing these algorithms to provide insights into the flaws of certain algorithms. Keywords Machine learning · Reinforcement learning · Algorithm reliability

1 Introduction Reinforcement learning is a field of machine learning in which the agents make a sequence of decisions when put in an environment. For every decision that it takes, it earns a reward or a penalty. The environment could be anything ranging from a simulation environment, an environment faced by autonomous vehicles to a game. Such environments are potentially complex and uncertain. The models, when put in such environments, learn from experience and try to maximize the total reward. Recently, reinforcement learning has seen a dramatic rise in interest. Mainly in the areas of continuous control in robotics systems, playing Go, Atari and competitive video games. In order to ensure continuous development in RL research, it is vital that current works can be readily replicated and be contrasted with new approaches to reliably determine improvements and enable the development of novel techniques. N. Sandeep Varma · V. Sinha · K. Pradyumna Rahul (B) Department of ISE, BMS College of Engineering, Bangalore, Karnataka 560019, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_33

469

470

N. Sandeep Varma et al.

Fig. 1 Number of reinforcement learning papers published on a yearly basis. The number of RL publications (y-axis) per year (x-axis)

Figure 1 consists of data scraped from Google Scholar searches for the term “reinforcement learning” [1]. The collection of data is from the year 2005 till 2020, and it illustrates the growth of the field of reinforcement learning in terms of papers published per year. It is crucial that existing works can be readily replicated and compared to effectively judge changes provided by novel approaches in order to ensure rapid development in reinforcement learning research. Reliability is a very important aspect in deciding whether a model could be used for real-life application. It should work without risking the lives of the individual and the equipment used. Reinforcement learning models have a huge range of potential applications, but their application on real-life scenarios considers reliability as a very important factor [2]. RL models could be unreliable, as proved by the fact that multiple runs over an environment may result in varying results. There is a need for a systematic and demanding assessment method to measure success in improving reinforcement learning algorithms. There is a need for systematic assessment and distinction of their shortcomings to provide guidance for future study in order to further the understanding of the capabilities of current reinforcement learning algorithms. Systematic reliability metrics can support the RL industry by enabling researchers and practitioners to compare algorithms in a comprehensive and reliable manner [3]. In essence, this helps the field to evaluate success and also guides the choice of algorithms for both testing and development environments. This paper also identifies unique strengths and shortcomings of algorithms by evaluating different facets of reliability, helping users to identify particular areas of development. The paper suggests a principled assessment process that quantifies the complexity of using an algorithm in order to both make and track progress toward creating accurate and reliable algorithms. It uses the approach of running several trials runs as a basis to collect the required data in terms of rewards which the variance in

Experimental Evaluation of Reinforcement Learning Algorithms

471

the performance of these algorithms can be estimated on the basis that they are supposed to perform equally independently. The paper proposes training the algorithms without their predefined hyperparameters so that the actual learning capability of the algorithm can be tested.

2 Related Work Reinforcement learning has been extensively applied in-game theory-based examples, and this is essential because of the ability of reinforcement learning algorithms to learn optimal techniques without an initial understanding of the rules of the game by means of trial and error. Learning systems have been built to defeat various games much above human proficiency such as in Atari [4, 5], Go, Chess [6] and several other games [7]. The recent progress made by combining developments made in the field of deep learning for building algorithms to understand feature representations [8] with reinforcement learning. Notable examples of the current state-of-the-art learning agents are demonstrated by training agents to play Atari games based on the image data from the game without any knowledge of their underlying dynamics [9] and by using simpler trust region policy optimization for more general and better empirical sample complexity [10]. There have been several other studies that examine the process of evaluating reinforcement learning algorithms. Establishing the baseline performance of algorithms is essential for further research and development in the field. In [11], they benchmark several learning algorithms on continuous control environments and provide their baselines implementations. In [12], they provide a means of basing evaluations on multiple environments of samples from a distribution based on their generalized methodologies. In [13], they propose improved evaluation methodologies for benchmarks based on the Arcade Learning Environments used by several algorithm papers to demonstrate their proposed models learning ability. The current evaluation processes demonstrated in several studies do not account properly for the uncertainty in the produced results [14] and they abandon the difficulty of applying reinforcement learning algorithms to any given problem statement. In [15], they show us consequently that existing reinforcement learning algorithms can be difficult to apply to environments of real-world application scenarios. However, while the problem of reproducibility and general good experimental practice has been in related studies (references), there haven’t been many works focusing on the reliability of reinforcement learning algorithms [16] with the context of reproducibility of their demonstrated rewards in new environments without using their predefined hyperparameters.

472

N. Sandeep Varma et al.

3 Methodology 3.1 Data Collection This study may not have a fixed input dataset, which is broken into preparation, evaluation and testing subsections, due to the intrinsic existence of reinforcement learning. Instead, the input data for our models comes from live environmental measurements that are strictly based on the actual behavior taken by the agent. The experiments use OpenAI Gym Retro, which is a tool collection that helps users to turn retro video games into gym environments for reinforcement learning. Support for different emulators using the Libretro API is offered. Every game integration contains files listing memory locations for in-game variables, incentive functions dependent on certain variables, episodes and conditions, save states at the start of stages and a file containing ROM hashes that interact with these files [17]. The algorithm may not be performing well consistently over the environment and earning large rewards for a significant period of time. When looking for consistency of the algorithm, running it multiple times over the environment and recording the rewards on all the runs is important. Recording of all the rewards would provide insight into the variations in performance, and this couldn’t be noticed if only best performance is considered. Some approaches also involve using the predefined parameters for the environments over which the algorithms perform best to consider their consistency [18]. But assessment after defining the hyperparameters won’t help us if the algorithm faces a new and unknown environment. In such conditions, the algorithm itself needs to optimize and work to gain rewards.

3.2 Model Evaluation Approach The paper, along with the Atari environments, uses the OpenAI classic control and Box2D environments to test their performance. Small-scale tasks that have been described in the reinforcement learning literature are class control environments, where the agent must coordinate with the environment by generating a real valued number rather than depending on a predefined discrete space of action. A number of continuous control tasks that run in the Box2D simulator are given by the Box2D series of environments. These environments include BipedalWalker, BipedalWalkerHardcore and LunarLander. In order to test the performance of the environment even further, the paper has devised a new environment to test the capability of the algorithms. This environment is based on the well-known game of Go. The game of Go consists of the two players that play either the black or white stones, respectively [19]. These players place their pieces on the board with the main goal of surrounding parts of the board and to

Experimental Evaluation of Reinforcement Learning Algorithms

473

Fig. 2 Custom 6 × 6 Go Board

capture the opponent’s stones. Unlike most board grid based games, the stones are placed on the cross section of the lines on the board rather than the squares. The paper defines a 6 × 6 Go board as shown as in Fig. 2. This means that the board has 6 rows and 6 columns. The reward functions that are defined to use in this environment are of 2 types. The first one is called the Real reward r, it is used to evaluate the performance of an agent based on the final outcome of the agent on the board. This is shown in Eq. (1). ⎧ −1, ⎪ ⎪ ⎨ 0, r= ⎪ 1, ⎪ ⎩ 0,

if White won and Game Ended if Game is tied and Game Ended if Black won and Game Ended otherwise

(1)

In order to provide a means to evaluate the performance of the agent based on the area occupied when compared to the other agent and the final outcome, the paper defines a Heuristic reward R that takes into account the difference in the overall board area occupied by the agent and whether it won the current episode of the game as mentioned in Eq. (2). ⎧ blackarea − whitearea; if ⎪ ⎪ ⎨ boardsize ∗ 2; R= ⎪ −boardsize ∗ 2; ⎪ ⎩ 0,

Game ongoing if Black won if White won if tied

(2)

To represent the state of the environment, the paper defines a multidimensional array. This array is defined to be 6 × 6x6 dimensions. The state object is used as a return by the reset and step functions of the given environment. The state representation is illustrated in Fig. 3. The values present in the state array consist of binary values in its representation. The state can be broadly divided into six channels or indexes that each represents a different piece of information about the board and its pieces.

474

N. Sandeep Varma et al.

Fig. 3 State representation of the environment

The channels each hold relevant information about the state. The first and second channel represents the black and white pieces present on the board, respectively. The third channel indicates the turn of the player for the specific exchange. Representation of invalid moves for the next action, also including information that considers koprotection which is given in the fourth channel. The fifth channel indicates whether the previous move for the player turn was a pass. And the final sixth channel represents the information that states whether the game has reached the end state. The action to interact with the environment can be done through step functions that the environment provides that use the input and board state and respond with a next board state based on the given board and input value. The action step functions are as follows: • Coordinate-based step function Takes an input coordinate of the desired location on the board and the current board state. This is expressed in Eq. (3). f (x, y, board) = board

(3)

• Position Integer step function Takes a position integer i for the action in an 1D space and then updates the board state based on that given input. This is expressed in Eq. (4). f (x, board) = board

(4)

To get better insight into the consistency and reliability of these algorithms, the paper devises training them on multiple runs. In this paper, each algorithm is trained over each environment for a total of 3 runs. This is essential, as the variability in the performance of these algorithms can be calculated on the premise that they are meant to perform similarly regardless. The paper looks for drops in performance in both short and long time periods by using the total reward as a measure. These measures are also used to identify the worst-case drop in performance of these algorithms given the same time periods.

Experimental Evaluation of Reinforcement Learning Algorithms

475

3.3 Training Training of reinforcement learning algorithms take long periods of time to run and then evaluate. In order to train several algorithms and environment pairs, it will take a lot of time to train them sequentially. To optimize the usage of both the CPU and GPU resources, this paper uses a system known as the Sync Runner, it allows dividing the work of training the algorithm-environment pairs by running multiple pairs of this on each processor core. These environments are then trained parallelly using threading and PyTorch.

Algorithm 1: Training and Result collection Input: Models, Environments, K, NumEnvironments Output: trainingResults 1 trainingResults ← ∅ 2 for modeli in M odels do 3 for environmenti in Environments do 4 for runk in K do 5 modeli ← initializeM odel(modeli ) 6 environmentj ← initializeEnvironment(environmentj ) 7 for process ← 1 to N umEnvironments do 8 train(modeli , environmentj , runk , process) 9 10

11

modeli ← updateM odel(modeli ) trainingResults ← trainingResults ∪ {storeResults(modeli , environmentj )} return tainingResults

In Algorithm 1, the paper defines the procedure that SyncRunner uses to train multiple algorithms simultaneously. It requires us to provide it with algorithms given as models and the game environments given as Environments on the number of runs K to train and evaluate the algorithm-environment pairs. Step 1 initializes an empty set training results. It will hold the results of the training from all the algorithm-environment pairs. Step 2–4 is the iterative process of moving through every pair of algorithm-environments, and then through K runs for each pair. In step 5, the paper initializes the model using the initializeModel procedure that takes a model modeli as an argument and then sets up the model for training. Step 6 initializes the environment environment j by setting up the environment variables and creates the multiple instances of the environment. In this step, the SyncRunner has to allocate space for multiple of these environments in the GPU memory. Step 7–8 is where iteration through, NumEnvironments which represents the number of parallel environments, will be created for the given algorithm-environment pair. This involves creating a thread for each run of the parallel environment and then training the model on NumEnvironments simultaneously.

476

N. Sandeep Varma et al.

In steps 9–10, update of the model weights with the final collected weights. Then the collection of the results of the training for that run is done using the storeResults procedure that collects this information. These training results are appended to the results set known as trainingResults. This process is repeated for every run of the algorithm-environment pair.

4 Experiments 4.1 Computation Environment The experiments were executed in an 8 core Intel 2.30 GHz Xeon processor with 61 GB RAM and an NVIDIA V100 Tensor Core GPU with 16 GB memory. The machine was set up with a p3.2xlarge instance and executed on AWS. In order to perform controlled experiments comparing different learning methods with the same computational budget, there are three types of experiments that are performed, as reported in the following subsection. These experiments were executed one after another using the developed Sync Runner as previously mentioned in Sect. 3.2.

4.2 Experiments Setup This section defines three experiments to evaluate the algorithms: 1. Evaluating A2C on the Atari 2600 environments In this experiment, it trains the A2C algorithm on the Atari environments. The algorithm is trained on three random seeds on every environment and for 500 K environment steps on each run. The average collected returns, minimum and maximum returns and the consistency metrics as mentioned in the previous sections are reported. 2. Comparing three policy-based learning algorithms(PPO, SAC, DDPG) on Classic Control and Box2D environments The experiment trains three policy-based learning algorithms (PPO, SAC, DDPG) on OpenAI Gym classic control and Box2D = environments. The environments used for the experiment are Bipedal Walker, Bipedal Walker Hardcore, Lunar Landing,

Experimental Evaluation of Reinforcement Learning Algorithms

477

Mountain Car and Pendulum. On these environments, the learning agents are trained for approximately 6–16 million environment steps. 3. Evaluating MuZero on custom 6 × 6 Go environment The paper trains the MuZero algorithm on the previously mentioned 6 × 6 Go environment in Sect. 3.2 for approximately 20 million steps. The experiment doesn’t provide any information about the rules, domain knowledge or any human data, the purpose is to understand the capability of the learning algorithm to master an environment with unknown dynamics.

4.3 Results To ensure that the algorithms are on a fair basis for comparison, the paper uses their default hyperparameters and trained the algorithms in a sequential order so that when the training was in parallel and the results wouldn’t be skewed due to the hardware usage. Experiment 1: Evaluating A2C on the Atari 2600 environments. This experiment trains the A2C algorithm on the Atari 2600 suite of games. It trained for 3 runs on each of these environments. The performance of the algorithm was fairly consistent across multiple environments but for certain environments, there is a huge gap in performance in one or more of the runs of the algorithm on the environment. The performance of A2C is recorded in Table 4. Environments like Tutakham, Ms Pacman, Road Runner, etc., had a large gap in the average reward returns for one of the runs with respect to the other environments. On the other hand, environments like Zaxxon had a lot of fluctuations in the average reward return across its training period. This is illustrated in Fig. 4 (Tables 1, 2 and 3). Experiment 2: Comparing three policy-based learning algorithms(PPO, SAC, DDPG) on Classic Control and Box2D environments.. The algorithms are trained on the OpenAI Gym and Box2D continuous control environments. Figure 5 illustrates that the performance of PPO across all the environments is similar across the 3 runs that it was trained on. The performance results for SAC will be illustrated in Fig. 6. The figure shows that the performance of SAC across all the environments is not consistent across the 3 runs, as deduced from the large gaps in performance between runs in environments like Mountain Car and Bipedal Walker Hardcore. The collected performance results for DDPG is illustrated in Fig. 7. DDPG had only consistent runs across the environments Mountain Car and Pendulum. In the other environments, the rewards collected by the algorithm fluctuate a lot throughout the training period, and it is not similar through the three runs. The comparison in terms of median and maximum rewards for these algorithms are given in Tables 1 and 2, respectively.

478

N. Sandeep Varma et al.

Fig. 4 Average collected returns for the A2C experiment on the timepilot, videopin-ball, spaceinvaders, yarsrevenge and upndown game environments. Average rewards(y-axis) and training steps in thousands (x-axis). Each plot in a graph represents the average collected returns from a run(1-orange, 2-red, 3-blue)

Experimental Evaluation of Reinforcement Learning Algorithms

479

Table 1 Median reward for Box2D and continuous control environments Algorithm

Bipedal walker

Bipedal walker hard core

PPO

235

− 145

SAC

168

− 70

DDPG

− 60

− 75

Table 2 Median reward for Box2D and continuous control environments Algorithm

Lunar landing

Mountain car

Pendulum

PPO

205

88

− 182

SAC

284

84

− 193

DDPG

120

− 198

− 192

Table 3 Training time for Box2D and continuous control environment algorithms Algorithm

Training time

PPO

5 h 10 m

SAC DDPG

4 h 45 m 4 h 40 m

DDPG

4 h 40 m

The disparity in results across the different algorithms in this experiment can be primarily derived from the learning policy implemented by the algorithms. In the case of DDPG and SAC, they implement an off-policy learning optimization which while enables these algorithms to be more sample efficient but as they implement bootstrapping, off-policy and function approximation via deep learning making them susceptible to the “deadly triad issue” [20] which might cause their value function to be affected from instabilities. In the case of PPO which implements an on-policy learning optimization meaning, it learns its value function from observations associated with exploring the environment based on the current policy leading to more stability given a large enough batch size. This doesn’t necessary mean PPO takes significantly longer to train, and this is shown in Table 3. Experiment 5: MuZero. The experiment trained and evaluated the performance of the MuZero algorithm on our custom Go environment. As explained in the previous sections, it had trained the algorithm on the environment across 3 runs. As shown in the Fig. 8, it can be seen that reported rewards across the runs are consistent and have maintained the same performance. Given the complexity of the game of Go and the fact that MuZero trained by playing against itself, the collected reward across the training time is ideal for the implementation. MuZero implements a Q-learning and model-based approach where it estimates Q-values, but it also has a model of

480

N. Sandeep Varma et al.

Table 4 Median rewards and maximum rewards of the A2C on the Atari environments Environment

Median reward

Maximum returns

Alien

265

286

Ms pacman

92

610

Name this

2430

2650

Phoenix

812

823

Pitfall

− 198

10

Pong

− 20

15

Private eye

45

60

Qbert

267

284

Riverraid

746

813

Road runner

103

504

Robot tank

2.2

2.3

Sea quest

251

279

Skiing

− 25,340

− 25,540

Solaris

2489

2494

Space invaders

158

178

Star gunner

802

923

Tennis

− 22

− 10

Time pilot

3430

3564

Tutakham

24

125 879

Up n down

612

Venture

−1

− 0.3

Video pinball

2103

2214

Wizard wor

741

824

Yars revenge

4120

5831

Zaxxon

12

51

the environment over which it plans. This enables it perform well on a Markovian environment, i.e., a discrete action space while limiting its capabilities in a continuous action environment.

Experimental Evaluation of Reinforcement Learning Algorithms

481

Fig. 5 Average collected returns for the PPO experiment. Average rewards(y-axis) and training steps in thousands (x-axis). Each plot in a graph represents the average collected returns from a run(1-orange, 2-red, 3-blue)

Fig. 6 Average collected returns for the SAC experiment. Average rewards(y-axis) and training steps in thousands (x-axis). Each plot in a graph represents the average collected returns from a run(1-orange, 2-red, 3-blue)

482

N. Sandeep Varma et al.

Fig. 7 Average collected returns for the DDPG experiment. Average rewards(y-axis) and training steps in thousands (x-axis). Each plot in a graph represents the average collected returns from a run(1-orange, 2-red, 3-blue)

Fig. 8 Average collected returns for the MuZero experiment. Average rewards(y-axis) and training steps in thousands (x-axis). Each plot in a graph represents the average collected returns from a run(1-orange, 2-red, 3-blue)

5 Conclusion This paper evaluates five deep reinforcement learning algorithms on several environments from environments that have discrete action space to those that have a continuous action space and then evaluated the performance of the current stateof-the-art MuZero on our custom Go environment. It reported the results on the comparison of the performance of these algorithms in terms of their reward returns, training time and the reliability metrics. To run and evaluate these algorithms, the paper uses its proposed multiple algorithm-environment training system that adapts based on the available hardware setup. To ensure a fair evaluation, the algorithms

Experimental Evaluation of Reinforcement Learning Algorithms

483

train using their default hyperparameters. To conclude, the reliability of the algorithms had been measured using the performance metrics and various OpenAI Gym environments used to evaluate their performance on multiple runs.

6 Future Work This work primarily is around the process of training and evaluating the reliability of algorithms in environments other than the Atari suite of environments, but this hasn’t been fully explored in the sense that most of the environments used were based on what OpenAI gym provides. Reinforcement learning algorithms when used in the real world tend to not have similar scenarios to that of a sandbox game environment. Building custom environments that have some amount of uncertainty in terms of operation and agent interaction would be more suitable to evaluate these algorithms. Future research should focus more on evaluating using custom environments.

References 1. Google Scholar (2021) Searches for reinforcement learning. https://scholar.google.com/sch olar?q=%22reinforcement+learning%22&hl=en&as_sdt=0%2C5&as_ylo=2020&as_yhi= 2021 2. Whittlestone J, Arulkumaran K, Crosby M (2021) The societal implications of deep reinforcement learning. J Artif Intell Res 70:1003–1030 3. Cobbe K, Klimov O, Hesse C, Kim T, Schulman J (2019) Quantifying generalization in reinforcement learning. In: International conference on machine learning, PMLR, pp 1282–1289 4. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller MA (2013) Playing Atari with deep reinforcement learning, CoRR abs/1312.5602. arXiv:1312.5602. URL http://arxiv.org/abs/1312.5602 5. Hosu I-A, Rebedea T (2016) Playing Atari games with deep reinforcement learning and human checkpoint replay. arXiv preprint arXiv:1607.05077 6. Heinrich J, Silver D (2016) Deep reinforcement learning from self-play in imperfectinformation games, CoRR abs/1603.01121. arXiv:1603.01121. URL http://arxiv.org/abs/1603. 01121 7. Gamble C, Gao J (2018) Safety-first AI for autonomous data centre cooling and industrial control 8. Krizhevsky I, Sutskever GE (2017) Hinton, ImageNet classification with deep convolutional neural networks. Commun ACM 60(6):84–90 9. Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, Guez A, Lockhart E, Hassabis D, Graepel T et al (2020) Mastering Atari, go, chess and shogi by planning with a learned model. Nature 588(7839):604–609 10. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. Corr abs/1707.06347. arXiv preprint arXiv:1707.06347 11. Duan Y, Chen X, Houthooft R, Schulman J, Abbeel P (2016) Benchmarking deep reinforcement learning for continuous control. In: International conference on machine learning, PMLR, pp 1329–1338

484

N. Sandeep Varma et al.

12. Whiteson S, Tanner B, Taylor ME, Stone P (2011) Protecting against evaluation overfitting in empirical reinforcement learning. In: 2011 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL). IEEE, pp 120–127 13. Machado MC, Bellemare MG, Bowling M (2017) A Laplacian framework for option discovery in reinforcement learning. In: International conference on machine learning, PMLR, pp 2295– 2304 14. Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D (2018) Deep reinforcement learning that matters. In: Proceedings of the AAAI conference on artificial intelligence, vol 32, pp 2–5 15. Dulac-Arnold G, Mankowitz D, Hester T (2019) Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901 16. Papoudakis G, Christianos F, Schäfer L, Albrecht SV (2020) Comparative evaluation of multiagent deep reinforcement learning algorithms. arXiv preprint arXiv:2006.07869 17. Nichol A, Pfau V, Hesse C, Klimov O, Schulman J (2018) Gotta learn fast: a new benchmark for generalization in RL. arXiv preprint arXiv:1804.03720 18. Haarnoja T, Ha S, Zhou A, Tan J, Tucker G, Levine S (2018) Learning to walk via deep reinforcement learning. arXiv preprint arXiv:1812.11103 19. Kim J, Jeong S-H (1997) Learn to play Go, 2nd edn, vol. five volumes. Good Move Press, New York 20. van Hasselt H, Doron Y, Strub F, Hessel M, Sonnerat N, Modayil J (2018) Deep reinforcement learning and the deadly triad, CoRR abs/1812.02648. arXiv:1812.02648

An Approach to Estimate Body Mass Index Using Facial Features Dipti Pawade, Jill Shah, Esha Gupta, Jaykumar Panchal, Ritik Shah, and Avani Sakhapara

Abstract People must have a healthy daily routine in order to stay away from healthrelated problems. They should keep their body measures in check. A simple measure like body mass index (BMI) can be considered as an indicator for the ratio of weight and height. Generally, health risk is directly proportional to the BMI value, so it is advised that one should keep check on it. But, due to the busy routine, normally people are reluctant to measure weight and height regularly and then observe the BMI value. This motivated us to come up with a system that aims to calculate the BMI by just using a face image of an individual. From the face image, the facial features are extracted using three methods. These are then used to estimate the BMI of the person using machine learning algorithms. In regression, seven algorithms and in classification, nine algorithms were tested. The multiple linear regression technique outperformed the other methods. The main objective of this study is to estimate the BMI from face image accurately in a hassle-free manner. Keywords Facial feature extraction · BMI estimation · Classification · CNN · FaceNet model · Regression D. Pawade · J. Shah (B) · E. Gupta · J. Panchal · R. Shah · A. Sakhapara Department of Information Technology, K. J. Somaiya College of Engineering, Vidyavihar, Mumbai, India e-mail: [email protected] D. Pawade e-mail: [email protected] E. Gupta e-mail: [email protected] J. Panchal e-mail: [email protected] R. Shah e-mail: [email protected] A. Sakhapara e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_34

485

486

D. Pawade et al.

1 Introduction The BMI is a straightforward, low-cost screening tool for adults and children to identify potential weight concerns [1]. The BMI can be calculated as an individual’s weight (kilograms) divided by square of height (meter) [2]. Thus, it is a measurement of body fat and a reliable indicator of your risk of diseases associated with excess body fat. Apart from the traditional method for calculating BMI, there are various online Web applications like BMI calculator [3], National Heart, Lung, and Blood Institute [4], online calculator [5], etc., which automatically calculate the BMI by taking the weight and height value as an input from the user. Even though it appears to be easy, there is a chance that some people are unaware of their height or weight, or that certain people have disabilities because of which they cannot use these applications. As a result, calculating the BMI can be a tedious process. To solve this difficulty, a system is developed that calculates a person’s BMI straight from a facial image. This may be accomplished by estimating facial features from a face using deep learning algorithms. Using machine learning techniques such as regression and classification are used to predict the BMI. The major objectives of this research are as follows: OBJ 1: Preprocessing for the facial data to obtain more accurate results. OBJ 2: Exploration of 68-face landmarks such as eyes, eyebrows, and nose. OBJ 3: Estimation of BMI using face image of an individual and classification of the BMI into categories. OBJ 4: Testing and evaluation of the BMI prediction model and results.

2 Literature Survey Researchers have been researching bodyweight definition for these reasons. The faces of humans have a variety of indicators, including age, identity, personality traits, expression, gender, and so on, that can be studied and used for face recognition for security, facial expression detection, and face tracking. Researchers have come up with the topic of linking face features to body weight in many aspects as a result of these applications. Usage of feature extraction and image processing using face images of people was used to categorize them into BMI groups (normal, overweight, and obese). The face database has better accuracy when images with the conditions such as individuals with caps, glasses, and facial hair were manually filtered out, side images or tilted face images, images in which faces were blocked by other individuals were removed. The preprocessing steps for image processing are face detection, feature extraction, facial matrix and classification. Face detection has iris detection which has a bounding box used to examine each facial component, which was then cropped out for further processing. A circle must be manually drawn around the iris of the cropped eye image by the user. Because the circle drawn by the user may not provide a solid round shape, the system then places a full circular shape around it. The coordinates of the left and right sides of the iris can be calculated

An Approach to Estimate Body Mass Index Using Facial Features

487

using this method. It also has image enhancement which is done in order to enhance the image quality; the cropped facial image has undergone image resizing. Feature extraction has facial landmarks extraction and is used to mark the points of each facial feature. The identified locations have significantly represented the specific facial components such as jawline, eye contour, nose feature, lip corner, and hairline. Active shape model is the technique used for extracting facial landmarks (ASM). Classification algorithms used here are artificial neural network (ANN), K-nearest neighbors (KNNs), and support vector machine (SVM). ANN has the highest recognition rate, and KNN is the least favorable classifier. The paper’s [6] major goal is to look into BMI prediction using a geometric approach, extract face features for BMI category categorization, and evaluate the performance of BMI prediction using a machine learning approach. Furthermore, good results have shown that it is possible to calculate an individual’s BMI without knowing their weight or height. Finally, the BMI prediction system can be considered a support for doctors, personal trainers, and healthcare professionals in obtaining the BMI quickly and efficiently [6]. A small bunch of scholarly investigations [7] proposes that BMI can be gathered from facial pictures utilizing machines and profound learning strategies. Be that as it may, current writing needs comprehension of the viability of various profound learning models, for example, VGG and ResNet, in BMI forecast from facial pictures. There is no assessment and comprehension of the adequacy of different profound CNN designs in BMI forecast. Distinctive CNN designs will undoubtedly acquire various outcomes because of element portrayal contrasts rising out of their interesting structures. Consequently, the correlation of various CNNs becomes indispensable in addition to propelling information in this space. For completeness and comparability, a custom CNN was developed that was trained from the ground up for performance evaluation. The proposed CNN by the author takes input as image 224 × 224 and has three layers of convolution. Then, stack normalization was done along with maximal grouping. Two fully connected layers with 200 channels were then added to structure and then one channel for the regression. Training was done using the Adam optimizer which has 0.001 learning rate and 150 epochs. Loss function used was MAE [7]. BMI is used for many health applications and other areas. Finding the BMI from the facial features is a problem of regression. The algorithms used by authors are support vector machine, random forest method, and gradient boost method and then compared the results after prediction. The dataset used is collected from the online sources, and it is small. Hence, the authors have augmented the dataset with some synthetic data and by transfer learning methods. Deep learning may not be suitable for the small dataset. The support vector machines are used as they work better on the data. The models are trained using the differential regression technique. The first piece of training is the feature extraction process. The latter is model training. Before training, the dataset is cleaned by removing the unwanted features. In the method of support vector, the SVR or the support vector regressors are used. Some parameters are epsilon with value of 0.5, radial basis function, polynomial, and linear kernels are used, and the best model is polynomial. The random forest method is a type of collective learning. The bagging method is used here. Parameters here were number

488

D. Pawade et al.

and maximum depth. Finally, the gradient boosted regressor showed the best results, and the support vector also gave the good results [8]. Facial representations like geometry and based on deep learning can be evaluated and analyzed to get the performance of BMI prediction, sensitivity to head posture, and redundancy of facial features. The authors have found that a deep learningbased model works better than the geometry one. Removing redundancy helps to get a better performance and the head poses if large decrease the efficiency. A face has so much biometric information like gender, age, height, weight, etc. Extracting BMI using facial features is called visual BMI. BMI is very important for the checking of health conditions and used by researchers too who study obesity in huge populations. The authors have shown the four steps to estimate BMI from the 2D facial images. The steps are as follows: detection of face, alignment of images, extraction of facial representation and regression. The steps of detecting faces and aligning the images are important, and they determine the performance of BMI estimation. The authors have done this in two ways: one using geometric features (PIGF) and the other using deep representations. They have used the face in the wild database which has faces of individuals along with gender, weight, and height as labels. The perception of weight, i.e., adiposity in the face is related to health. A study was done by authors to see the relation between facial measures like face width to height ratio, face perimeter to area ratio, and cheek to jaw width ratio and BMI. They have researched that there are various other facial measures like size of the eye, lower face to the face height ratio, face width to the lower height of face ratio and mean of the eyebrow, skin colors and hues, texture and shape in the females can be used, and also, ResNet architecture with 50 layers can be used. Geometric-based learning: Facial shapes are related to the fat of the body. The authors have used seven geometric features—mean of the eyebrow height, face width to lower face height ratio, face width to lower face height ratio, face perimeter to area ratio, face width to height ratio, eye size, and the cheek to jaw width ratio. Then, after extracting these features, they have used statistical methods to calculate the BMI. The accuracy of this BMI depends on facial features detection. The computation is fast for this technique, and very low dimensions are used. Deep learning based: The authors have used LightCNN for BMI estimation. The features are trained using the SoftMax loss function [9]. The authors [10] demonstrate how computer vision may be used to deduce the BMI of a person based on social media photographs in this research. The authors explored the impact of a variety of face features on people’s health judgments. They discovered that face features such as skin yellowness, mouth curvature, and shape were positively correlated with perception of health, whereas facial shape associated with obesity was negatively correlated with perception of health. They employed a strategy that involved detecting a number of fiducial points in each face image and generating hand-crafted geometric facial features for training the regression model for BMI estimation. However, because their dataset only included passportstyle frontal face images with a clear background, the effectiveness of their BMI prediction algorithm for noisy social media photos is uncertain. They have used the set of annotated images from the VisualBMI project to ensure that our algorithm works with noisy, frequently low-quality social media pictures like profile pictures.

An Approach to Estimate Body Mass Index Using Facial Features

489

The two stages of the proposed BMI prediction method are as follows: (i) deep feature extraction and (ii) regression model training. They have employed two well-known deep models for feature extraction: one trained on general object classification (i.e., VGG-Net) and the other trained on a face recognition task. They used epsilon support vector regression models for BMI regression because of their robust generalization characteristics. When presented with a pair of profile pictures, the tool performs as well as humans in determining the more overweight person [10]. To put Wen and Guo’s method to the test, a facial landmark detector was utilized to identify several face fiducial focuses, which were then used to figure out 7 facial characteristics. The facial provisions incorporate WHR (width to upper face height ratio), PAR (perimeter to area ratio), and ES (eye size), MEH (mean of eyebrow height), CJWR (cheek to jaw width), and FW/FH (lower face to face height ratio). Element standardization was executed and relapse work showed up to address the connection between BMI esteems and facial measures. This capacity was then utilized to process the BMI for every test face picture. Processed facial BMI was contrasted with actual estimated BMI through connection investigation, coordinated with pair correlations, and changing the affectability of the calculation to identifying the unwavering quality of fBMI to mBMI. Techniques for breaking down photographs are clarified in a past composition, however, are for the most part depicted [11]. The authors [12] utilize a lingering organization in their work, the 50-layer ResNet design. Profound neural organizations uncovered difficulties, for example, detonating/disappearing angles and debasement, remaining connections met the test and have displayed to outperform different procedures like introduction systems, better enhancers, skip associations, information move, just as layer-wise preparing empowering preparing of more profound neural organizations. The consolidated leftover associations have been innately essential for preparing extremely profound convolutional networks. ResNets have essentially advanced precision of item grouping, object identification, and division, while further developing the speed of preparation. The engineering characteristic of residual networks is that they rely on successively stacked residual blocks. Utilizing fundamental blocks with two continuous 3-3 convolutions was done with group normalization and convolution after ReLU. For activation of activation, convolution, and normalization in residual blocks the original convolution—ReLU is used. This paper introduced an original methodology for assessing stature, BMI, and weight from single facial pictures, in view of ResNet50. There was no noticeable sex inclination in assessing tallness, weight, and BMI. In any case, more work is fundamental in such a manner. The current demand for self-demonstrative apparatuses for distant medical services, as well as for sensitive biometrics arrangement in security applications, inspired the tallness, weight, and BMI predictor [12]. Since ancient times, people have been curious about how to discern human faces. Recent quantitative and systematic investigations on facial cues have led researchers to hypothesize that specific structural differences in the face may serve as accurate indicators of psychological and physical patterns [13]. The author [14] has used a computational approach and had a database of 14,500 images. They have also used statistics in their experimentation.

490

D. Pawade et al.

3 Methodology As the system aims to predict the BMI based on facial features, thus the user first has to upload his picture where a face is clearly seen. The next step is extraction of facial features [15] which is done in the backend, and the BMI is estimated using regression. The output is shown to the user as the BMI value. The overall system is divided into following main modules: 1. Facial feature extraction 2. BMI value prediction I. Facial Feature Extraction BMI is estimated using the facial features extracted using the feature extraction method. Following approaches are explored for feature extraction: a. Using convolution neural network (CNN) b. Using OpenCV c. Using FaceNet model We have applied the above techniques on two datasets which are, viz., VIP dataset [12] and FIWBMI dataset [9]. The VIP dataset contains 1026 facial images, specifically 513 female and 513 male images. The dataset contains a csv file with image name, height, weight, and the BMI values and a folder with all the images. For preprocessing in the CNN approach, the image pixel values have been rescaled from the range of 0–255 to the ideal range of 0–1 for neural network models. Another dataset used is a FIWBMI dataset consisting of 8370 face images of each individual has 1–4 images. The annotation of each image consists of 5 parts: The first part denotes the individual; the second part denotes different images from the same individual; the third part denotes the body weight in lb; the fourth part denotes the height in inches; the fifth part denotes gender (true is female; false is male). In preprocessing of this dataset, we have created a dataframe, extracted the features from the image name, stored the features in the columns, and saved the dataframe as a csv file. In the CNN approach, the image pixel values have been rescaled from the range of 0–255 to the ideal range of 0–1 for neural network models. Rescaling helps to treat all images in the same manner. Each dataset is split into 80:20 for the training and testing sets, respectively. a. Feature Extraction Using CNN A CNN is a deep learning algorithm that can take an image as input and give trainable weights and biases to numerous aspects in the image, as well as differentiate between them. The different layers available in CNN are convolution layer, ReLU correction layer, max-pooling or average pooling layer, and the fully connected layers. The convolutional layer, which is always at least the first layer in convolutional neural networks, is their most important component. Its purpose is to identify a feature set in the photos given as input. This layer is frequently sandwiched between two convolutional layers: It accepts a large number of feature maps and performs the

An Approach to Estimate Body Mass Index Using Facial Features

491

Fig. 1 CNN model for facial feature extraction

pooling procedure on each one. The pooling procedure reduces the size of the photos while maintaining their essential properties. Max-pooling layers which return the maximum value from the Kernel’s part of the image have been used. The ReLU correction layer replaces all negative numbers received as inputs with zeros. It works as an activation function. The last layer, which is the fully connected layer, generates a new vector by accepting an input vector. It accomplishes this by applying a linear combination and, perhaps, an activation function to the incoming input values. For feature extraction, initially, we started with model consisting of four convolution 2D layers, three max-pooling layers, two fully connected layers, and a flatten layer which gives 16 features. The loss produced here is 4.34 which results in more error, and hence, changes are made in architecture to reduce the loss. In the revised model, combination of changing layers is modified as five convolution 2D layers, four maxpooling layers, two fully connected layers, and a flatten layer. This model gave 49 features, and loss was also reduced to 4.32. Still there was a scope for reduction in loss. So, the model is again revised as 4 convolution 2D layers, 3 max-pooling layers, 2 dense layers, and it gives 144 features. This model is tested for different image sizes too. Here, the loss was reduced to 4.17 which significantly reduced the error. Figure 1 depicts the final model which gives the best result. b. Feature Extraction Using OpenCV OpenCV detects the facial landmarks from the image given by the user, and then if the landmarks such as jaw points, left and right brow points, nose points, left and right eye points, mouth points, and lip points are obtained, the further processing is done. Here, the Euclidean distance measure is used to measure the distance between related landmarks to calculate different widths and ratios. These widths and ratios are face widths, face height, eye widths, and then face ratios between two different widths and between width and heights, also ratios of eye widths. The steps involved here are as follows: The points to be detected are stored in a file, and a Haar cascade file is used for detection which is a pretrained face detector in the OpenCV library. First, a rectangle is chosen which has the points, and then, points are detected using the predictor file. At the end, the landmark coordinates are stored. Then, these are used for further calculation. In total, we get 68 landmarks.

492

D. Pawade et al.

c. Feature Extraction Using FaceNet Model The FaceNet model [16] helps extract the facial features vector. It contains a face recognition module which has two functions, load-image-file which loads the image file and returns a Numpy array. The other function is the face-encoding function which returns a 128-dimensions feature vector for a given image. The next step involves converting this 128 features vector into different columns for easy processing of the training. II. BMI Value Prediction Different types of regression and classification methods are used for BMI value prediction. a. Regression: The different regression algorithms used for BMI prediction are support vector regression (SVR) with various kernel values, lasso regression, random forest, multiple linear regression, and ridge regression. SVR allows users to choose the amount of model error they are willing to take, and it will find an appropriate boundary (or hyperplane in higher dimensions) to fit the experimental data. Random forest regression is a regression supervised learning methodology that uses ensemble learning techniques. Lasso regression is a regularization approach that uses shrinkage and is used instead of regression methods for a more accurate forecast. Ridge regression is a model optimization approach that is applied to multicollinear data. The features matrix which has been calculated using CNN method were split into different columns as individual features after which the different regression algorithms were applied with features as the independent variable and the BMI value from the CSV file as a dependent variable. In the OpenCV method, the regression algorithms are used to find the relation between the BMI value and the calculated facial feature widths. In the FaceNet model, the algorithms are applied on the calculated 128 features and the BMI value. b. Classification: Classification algorithms used are support vector machines (SVM) with different kernel values, Naive Bayes, logistic regression, stochastic gradient descent, K-nearest neighbors (KNNs), random forest, and decision tree. The SVM method’s objective is to choose the optimal decision boundary or line for classifying n-dimensional space into groups so that following data points can be quickly assigned to the appropriate category. The term ‘hyperplane’ refers to the optimal choice boundary. SVM is used to select the extreme points/vectors that help build the hyperplane. To predict a dependent data variable, a logistic regression model examines the connection between one or more existing independent variables. When the dependent variable (target) is categorical, it is used. In a learning task, the number of parameters required by Naive Bayes classifiers is in proportion to the amount of features and predictors. In contrast to various other forms of similar classifiers, maximum-likelihood training may be carried out by evaluating a locked expression in linear time rather than by incremental approximation, which is time-consuming. Stochastic gradient descent (SGD) is a fast

An Approach to Estimate Body Mass Index Using Facial Features

493

and simple method for fitting linear classifiers and regressors to convex loss functions such as (linear) SVM and logistic regression. KNN works by computing the distances between a query and each example in the data, choosing the K examples that are most relevant to the query, and then deciding on the most prevalent label. Decision trees create tree-like categorization models. It reduces a dataset into progressively smaller chunks over time while building a decision tree. The random forest is a classification strategy that uses bagging and randomization to build an uncorrelated forest of trees whose committee prediction is more accurate than any one tree. Similar to regression algorithms, we have applied different classification algorithms to estimate the BMI category.

4 Result and Discussion For the testing purpose, both VIP and FIWBMI datasets are used to estimate the BMI values. For regression, the metrics like mean squared error (MSE) and root mean squared error (RMSE) are used, and for classification accuracy, precision and recall are used. Table 1 gives an overview for the MSE and RMSE values for different methods discussed in the methodology section. The lowest MSE and RMSE in each dataset for each method is highlighted in bold. From Table 1, it has been observed that, for the VIP dataset, CNN model with random Forest regression has lowest RMSE value as 4.17 and thus has minimum error. Similarly, for the FIWBMI dataset, CNN model with SVM regression with kernel ‘rbf’ has RMSE value as 7.68 which is the lowest as compared to other algorithms. For the OpenCV method, VIP dataset SVM regression with kernel ‘linear’ has the minimum error with the RMSE value as 3.91, and for the FIWBMI dataset, lasso regression has RMSE value as 8.13. The pretrained FaceNet model for both the dataset gives the lowest error with the multiple linear regression. The lowest RMSE value for the VIP dataset is 3.25 and for FIWBMI dataset is 6.24. To conclude, as per the values of RMSE, the most accurate estimation of BMI is done using the pretrained FaceNet model with multiple linear regression as it gives minimum RMSE value for both the dataset as compared to the other methods. Then, the following Table 2 shows the accuracy, precision, and recall for each dataset’s evaluation of the BMI category using classification. Regression method gives less error than the classification method. The maximum accuracies obtained in classification are up to 55%, but the precision values are quite low which shows that the classification performs poorly in BMI categories. Hence, it is observed that the regression method gives more accurate values and is used for implementation. Finally, the pretrained FaceNet model with multiple linear regression is employed for implementation, which yields the lowest error and best outcomes. To summarize, the first objective of preprocessing is achieved by removing the outliers, creating a csv for the FIWBMI dataset [9], and re-scaling the images. Then, the second objective of exploration of feature extraction is achieved using three techniques and finally implementing the best one. The third objective of estimation

4.46

4.17

19.89

17.40

Lasso

Random forest

64.53

62.63

59.58

59.58

62.14

58.98

7.79

8.03

7.91

7.72

7.72

7.88

7.68

15.30

17.77

15.90

17.19

17.19

16.79

17.19

4.21

3.99

4.15

4.15

4.10

4.15

3.91

RMSE

The lowest MSE and RMSE in each dataset for each method is highlighted in bold

4.48

20.04

Ridge

4.74

4.48

22.50

20.05

SVM-Poly

Multiple linear

4.61

4.41

21.21

19.47

SVM-linear

SVM-rbf

60.64

MSE

RMSE

MSE

MSE

VIP dataset

FIWBMI dataset

VIP dataset RMSE

OpenCV

CNN model

Table 1 Performance metrics for regression

71.31

66.03

68.81

68.81

68.89

68.81

68.50

MSE

8.44

8.13

8.29

8.29

8.30

8.29

8.28

RMSE

FIWBMI dataset

14.28

16.87

12.24

10.55

12.14

12.24

12.14

MSE

VIP dataset

FaceNet

3.78

4.11

3.50

3.25

3.48

3.50

3.48

RMSE

45.13

66.03

41.62

38.93

41.38

41.62

41.72

MSE

6.72

8.13

6.45

6.24

6.43

6.45

6.46

RMSE

FIWBMI dataset

494 D. Pawade et al.

An Approach to Estimate Body Mass Index Using Facial Features

495

Table 2 Performance metrics for classification VIP dataset Accuracy

FIWBMI dataset Precision

Recall

Accuracy

Precision

Recall

SVM linear

54.85

13.71

25

34.77

6.95

20

SVM RBF

54.85

13.71

25

34.77

6.95

20

SVM poly

52.91

27.08

21.06

34.77

6.95

20

Logistic regression

54.85

13.71

25

34.77

6.95

20

Naïve Bayes

44.17

23.97

24.98

34.77

6.95

20

Stochastic gradient descent

13.11

18.45

19.77

33.42

14.21

20.46

K-nearest neighbors

53.40

17.64

19.72

26.86

21.09

21.68

Decision tree

44.17

24.32

19.8

25.34

18.88

19

Random forest

54.85

13.71

25

34.68

6.95

20

of BMI training is achieved using two methods and various algorithms. The last objective of testing and evaluating is done and achieved by using the metric of MSE.

5 Conclusion In this paper, a system was implemented which estimates the BMI of an individual using their facial image. The system extracts facial features from an individual’s face image, which are then used to estimate BMI. The facial features were extracted using CNN, a pretrained FaceNet model, and OpenCV. The best results were obtained when a pretrained FaceNet model was used. Furthermore, regression and classification were used to estimate BMI. Support vector regression (SVR) with various kernel values, lasso regression, random forest, multiple linear regression, and ridge regression were all employed in regression. Naive Bayes, logistic regression, stochastic gradient descent, K-nearest neighbors, random forest, and decision tree were among the classification methods employed. The accuracy metric is used in classification. The VIP dataset has the best accuracy of 34.77%. For the FIWBMI dataset, the best accuracy is 54.85%. The regression metric chosen is the root mean square error (RMSE). The lowest RMSE value is the best, and it was found for both datasets using linear regression. VIP dataset had an RMSE of 3.25, while FIWBMI dataset had an RMSE of 6.2. In future, training the model with more data could improve the results. For a better experience, a feature to click live face photos of individuals can also be added. This would enhance the user experience. It would also help to get the best results with the recent photo of the user.

496

D. Pawade et al.

References 1. Harvard Health (2022) How useful is the body mass index (BMI)?—Harvard Health. Retrieved from https://www.health.harvard.edu/blog/how-useful-is-the-body-massindex-bmi-201603309339 2. Assessing Your Weight and Health Risk (2022) Nhlbi.nih.gov. Retrieved from https://www. nhlbi.nih.gov/health/educational/lose_wt/risk.htm 3. https://www.calculator.net/bmi-calculator.html. Accessed on 17 June 2022 4. https://www.nhlbi.nih.gov/health/educational/lose_wt/BMI/bmicalc.htm. Accessed on 17 June 2022 5. Online calculators. https://www.CalculateStuff.com. Accessed on 17 June 2022 6. Fook CY, Chin LC, Vijean V, Teen LW, Ali H, Nasir ASA (2020) Investigation on body mass index prediction from face images 7. Siddiqui H, Rattani A, Kisku DR, Dean T (2020) Al-based BMI inference from facial images: an application to weight monitoring 8. Bolukba¸s G, Ba¸saran E, Kama¸sak ME (2019) BMI prediction from face images 9. Jiang M, Shang Y, Guo G (2019) On visual BMI analysis from facial images. Image Vis Comput 89 10. Kocabey E, Camurcu M, Ofli F, Aytar Y, Marin J, Torralba A, Weber I (2017) Face-to-BMI: using computer vision to infer body mass index on social media 11. Barr M, Guo G, Colby S, Olfert M (2018) Detecting body mass index from a facial photograph in lifestyle intervention 12. Dantcheva A, Bremond F, Bilinski P (2018) Show me your face and I will tell you your height, weight and body mass index 13. Pham DD, Do JH, Ku B, Lee HJ, Kim H, Kim JY (2011) Body mass index and facial cues in Sasang typology for young and elderly persons. Evid Based Complement Altern Med 14. Wen L, Guo G (2013) A computational approach to body mass index prediction from face images. Image Vis Comput 15. Chauhan R, Pandey V, Lokanath M (2022) Smart attendance system using CNN 16. Schroff F, Kalenichenko D, Philbin J (2022) FaceNet: a unified embedding for face recognition and clustering

An Approach to Count Palm Tress Using UAV Images Gireeshma Bomminayuni, Sudheer Kolli, Shanmukha Sainadh Gadde, P. Ramesh Kumar, and K. L. Sailaja

Abstract The need for edible vegetable oils has risen dramatically. To meet the demand, a large number of palm tree plantations are springing up. As palm tree cultivation is not viable unless it is grown on a large scale, so farmers cultivate palm trees in large areas; hence, counting the number of palm trees manually is a difficult task. Predicting the count before harvesting aids in harvest planning, storage requirements, and delivery estimation. The proposed method estimates the crop count using image processing and gives faster results than prior models. The input images are threshold using OTSU’s method for plant segmentation. The model extracts the plant’s pattern to get reliable findings. Keywords Harvesting · Threshold · OTSU’s method · Segmentation

1 Introduction The palm tree is one of the high-yielding crops, generally found on large hectares produces major by-products. Thus, the process of manually identifying and counting plants in such large fields through visual examination takes time, is labor-intensive, and is expensive. Improper crop management may result in an overestimation or G. Bomminayuni (B) · S. Kolli · S. S. Gadde · P. Ramesh Kumar · K. L. Sailaja Computer Science and Engineering, VR Siddhartha Engineering College, Vijayawada, India e-mail: [email protected] S. Kolli e-mail: [email protected] S. S. Gadde e-mail: [email protected] P. Ramesh Kumar e-mail: [email protected] K. L. Sailaja e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_35

497

498

G. Bomminayuni et al.

underestimation of storage capacity for the harvested crop. Furthermore, graceless crop size estimation can lead to the overuse of fertilizers, which can harm the crop and lead to loss of water resources. Many researchers applied various techniques and algorithms to match the actual counting results, namely, deep learning techniques with various architectures, image processing, computer vision, and machine learning approaches by using either the remote sensing data or UAV images. When these datasets are closely examined, ground truth is less obvious in satellite photos. Furthermore, obtaining high-quality photos using remote sensing data is costly. To address the problem, UAV photographs are used, which can provide relatively high-resolution images with more ground reality while remaining within budget. The work intends to develop a model that counts the trees for the given palm tree crop image. On account of this two top views of unnamed aerial vehicle (UAV), photos are used as input to develop the model. An image of palm trees in the area of interest, as well as an image of a single palm tree in RGB (red, green, blue) format, was used to calculate the tree count. Both photos are subjected to a series of processes facilitated by the OTSU method’s thresholding operation, named after Nobuyuki OTSU. The number of plants the model produces shows its emergence to evaluate different plant breeding techniques. The results can be an added advantage for yield prediction and crop monitoring.

2 Literature Survey Crop counts are forecasted using deep learning architectures such as Alex-Net, CNN, and Inception v2, v3, and v4. Even approaches based on computer vision and machine learning. The models use a drone image dataset or remote sensing imagery dataset. Ribera et al. [1] applied a regression technique to counting the number of plants from an orthorectified photograph of a complete field of sorghum plants acquired by UAVs using a convolution neural network approach. On testing with various architectures, a 6.7% MAPE was detected for Inception-v3. Another approach by García-Martínez et al. [2] employed image segmentation and filtering to process and convert the RGB images into binary images. Finally, normalized cross-correlation (NCC) is used in finding the crop count. This process is much more useful when the weeds are not overlapping with the plants. Karami et al. [3] developed a model implementing anchor-free identifiers, by making use of RGB photos captured with UAVs for the corn plants. The findings were derived through a changed CenterNet architecture. Conversely, Djerriri et al. [4] introduced a unique regression-based methodology for counting palm palms from high spatial resolution remote sensing data. In comparison to manual counting and based on the results achieved on a variety of datasets, the proposed method appears to be a viable tool for avoiding tiresome, expensive, and time-consuming human counting.

An Approach to Count Palm Tress Using UAV Images

499

Aparna et al. [5] devised a system for automatically calculating the coconut trees count present in UAV images. Precisely numbering and finding coconut plants saves time and labor. Although the developed classification system is not resistant to a wide range of image types, it does respond well to extra training data and adapts quickly. Correspondingly, for semantic segmentation of field photos, Mukhtar et al. [6] used a cross-consistency-based semi-supervised approach and a plant counting inception-based regression network from which small plant groupings are recovered from the RGB picture using loosely semantic segmentation and passed into a network of regression to calculate the count. Gnädinger et al. [7] carried out a technique to count the plants in the maize field. Decorrstrech filter is used to convert the UAV RGB image into an enhanced stretched image with contrast. The resulting image is then turned into an HSV image, which is subsequently converted into a l*a*b model to get the field count. Lavania and Matey [8] proposed a method that uses double thresholding and image segmentation to identify the difference between a crop and a weed. The proposed algorithm was able to detect the difference between the two crops with a 97.1% accuracy. After retrieving a set of key points, Bazi et al. [9] trained an extreme learning machine classifier on the points. The ELM algorithm will then classify the points as output. A follow-up step involved extracting the outline of each tree and then combining it with an active contour approach to record its outline. Then, it can distinguish the textures of the regions generated by the trees. To address the issue of detecting the number of plants that emerged from an image captured by an unmanned aerial vehicle, Valente et al. [10] proposed a method that combines machine vision and machine learning. Table 1 Analysis of models References

Crop

Technique

Input format

[1]

Sorghum plants

CNN

Orthorectified UAV 6.7% MAPE image

[2]

Corn crop

Cross-correlation of templates

UAV RGB

R2 = 0.98 to 0.16

[3]

Corn crop

FSL using CenterNet architecture

UAV RGB

Precision > = 95%

[4]

Date palm trees

Unique regression-based methodology

High spatial resolution remote sensing data

R2 = 0.99

UAV

Accuracy

[5]

Coconut trees

CNN algorithm

[6]

Wheat Crop

Semantic Semantic segmentation of segmentation of field photos field RGB photos Inception-based regression network

The absolute difference in the count of 0.87

90%

[7]

Corn crop

Image phenotyping UAV RGB

R2 = 0.35 to 0.90 (continued)

500

G. Bomminayuni et al.

(continued) References

Crop

Technique

Input format

[8]

Palm farm

ELM algorithm

UAV

[9]

Spinach Plants

Machine vision and UAV machine learning

Accuracy 91.11% 67.5%

Table 1 depicts the results of a research of certain approaches employed on a specific crop with varied input formats, as well as their accuracy. When these results were analyzed, the Hogue transform and regression-based methodology both produced high accuracy among them.

3 Proposed Method 3.1 Methodology Converting the Input RGB Image to the HSV Color Channel The RGB image is being converted into an HSV image to vary the degrees of light. An RGB colormap is converted to an HSV colormap. The processed image is depicted in Fig. 1. The r, g, and b values are divided by 255 to change the range from 0.255 to 0.1: as r  = r/255, g  = g/255, b = b/255

Fig. 1 HSV image

An Approach to Count Palm Tress Using UAV Images

501

Hue calculation: ⎧ ◦ 0,  ⎪ ⎪  ⎪   ◦ ⎪ ⎪ ⎨ 60 × g −b mod 6 , d     Hue = ◦ b −r ⎪ 60 × + 2 , ⎪ ⎪  d   ⎪ ◦ ⎪ r −g ⎩ 60 × +4 , d

d=0 cmax = r  cmax = g 

(1)

cmax = b

Saturation calculation:  Saturation =

cmax = 0 , c max  = 0 cmax

0,

d

(2)

Value calculation: Value = cmax

(3)



where cmax = max r  , g, b , cmax = min r  , g  , b , d = cmax − cmin Thresholding the Image with OTSU The concept behind OTSU’s method is that it checks the pixel values and determines the ideal sweet spot for dividing the classes into two by minimizing the variance over the histogram. The OTSU thresholding method recursively checks all the potential threshold values. It then measures the spread of the pixel levels in the background and on the opposite side of the threshold. OTSU thresholding makes image regions after segmentation better distributed, with a clearer edge and superior noise-resistant nature. It has the advantages of both better effect of segmentation and faster operation speed as experimented by Ribera et al. [1]. Hence, the OTSU thresholding is the best fit for this methodology. The formula for calculating the within-class variance at any threshold is as follows: 2 σ 2 (θ ) = ϕbg (θ )σbg (θ ) + ϕfg (θ )σfg2 (θ )

(4)

where ϕbg (θ ) ad ϕfg (θ ) are the probabilities of the number of pixels for each class at threshold θ and σ 2 is color variance. ϕbg (θ ) =

Pbg (θ ) Pall

(5)

ϕ f g (θ ) =

P f g (θ ) Pall

(6)

where Pbg (θ ) be the number of pixels in the background at threshold θ Pfg (θ ) be the number of pixels in the foreground at threshold θ . The variance can be computed using the formula

502

G. Bomminayuni et al.

Fig. 2 Thresholding image

σ (θ ) = 2

(xi − xmean )2 N −1

(7)

where, xi is the value of a pixel at i in the background or foreground, xmean is the mean of pixel values in background or foreground, N is the number of pixels. The OTSU algorithm returns a single threshold that the pixel levels into two classes, namely the foreground and the background. The resulting HSV image is shown in Fig. 2. Filtering on the Binary Image and Applying Morphological Close Operation The median filter is a nonlinear digital filter that removes noise from data and images. The median value of its neighbors is used to replace the targeted noisy pixels. The number of neighbors is determined by the filtering window size. In a sorted sequence, the median value is simply the middle value.

Ij Median

k+1 = , kisodd I j 2

⎩ 1 I j (k/2) + I j (k/2) + 1 , kisodd 2 ⎧ ⎨

Imedian

(8)

I 1 , I 2 , I 3 ,……I k is the image pixels. The pixels are sorted before applying them to the filter. The closure operator is applied to a binary or grayscale image using the MORPH CLOSE function. The dilation operation MORPH CLOSE is followed by an erosion operation. The closure operation is an idempotent operator, which means that applying it several times has no effect, as seen in Fig. 3.

An Approach to Count Palm Tress Using UAV Images

503

Fig. 3 After applying filtering and morphological close operation

Apply Distance Transform Operation and Applying Copy Make Border As illustrated in Fig. 4, the gray level intensities of the interior of the point in the foreground section are adjusted in this operation to distance their various distances from the closest 0 value (border). This function does a binary image’s Euclidean distance transform. The distance between each pixel and the nearest foreground pixel is represented by the output matrix values. The information at the boundaries of Fig. 4 may be lost if it is used directly for further processing. An extra border is created around the image to avoid the problem.

Fig. 4 After applying the copy make a border

504

G. Bomminayuni et al.

Applying the OTSUs, Filtering, Morphing, and Distance Transforming Operations on a Single Crop or Plant The thresholding, filtering, morphing, and distance altering procedures that are performed to the whole crop image are likewise applied to the single crop image. The resultant image is sown in Fig. 5. It aids in comparing the entire crop to obtain closed rounded shapes, which in turn aids in determining crop count. Find min-max Locations Now find the min–max locations, a threshold the image, and convert the maximum thresholds to absolute values to find the number of dots. Thereby returning return plant count of crop as shown in Fig. 6. Fig. 5 Applying OTSUs and median filter on a single plant

Fig. 6 Finding min and max locations

An Approach to Count Palm Tress Using UAV Images

505

3.2 Flowchart The proposed methodology’s successive steps are depicted in Fig. 7. Both the crop view and single plant images of the UAV RGB image are preprocessed at first. The preprocessed photos are then subjected to template matching, with any matches being counted.

Fig. 7 Flowchart

506

G. Bomminayuni et al.

4 Architecture Initially, RGB crop images are captured using a UAV. These images are given as the input to the model. This RGB image has been converted into an HSV image to avoid disturbances in the field. These images are further filtered to distinguish the trees/plants from unwanted areas of the crop other than the trees/plants. Now, the image is sharpened by using various techniques. By applying the morphing techniques to the resultant image, the skeleton structure of each plant/tree in the field is obtained. Here, the processed image of the field is obtained as shown in Fig. 8.

Fig. 8 Architecture

The procedure used to process the UAV image of the field is applied to one plant/tree and as well on the complete UAV crop image. This processed image is evaluated with the previous to get the common points. These points are calculated to print the number of plants/trees in the crop.

5 Results Table 2 shows the results of applying the proposed technique to a variety of images rather than considering similar types of images. These images undergo through the model for result analysis. The actual count of plants in i.(a) is 222, while the anticipated count is 220. Despite the enormous number of plants in the photograph, the accuracy is high. It can be observed in Figure ii. (a) that small trees are engulfed by large trees. This has a significant impact on the prediction’s accuracy. A road can be seen in the middle of the image in Figure ii. (a), yet any component of the road is not mistakenly deemed a plant. This implies that any other items are not included in the tree’s definition. The

An Approach to Count Palm Tress Using UAV Images

507

accuracy of the results is significant when the trees are aligned sequentially as in iii. (a). Although the trees are perfect in line, a slight decline in the accuracy is identified. Despite the shadows in Figure iv. (a), the model provided excellent accuracy. Since the model converts RGB images to HSV. Even if they have the same shape, the model can identify the difference between the real tree and its shadow. As a result, the tree’s shadow is not counted as the tree while returning the result. The proposed method can count the plants that are present along the image’s edge. This is due to the image having a white edge on one side. As a result, as indicated in Table 2, incomplete plants can be spotted. Trees that are barely visible in the image are not counted in the final crop count. Table 2 Results of various input crop images Input image

Output image

i.(a)

ii.(a)

iii.(a)

Predicted plant count

Actual plant count

220

222

47

60

87

91

i.(b)

ii.(b)

iii.(b)

(continued)

508

G. Bomminayuni et al.

(continued) Input image

Output image

iv.(a)

Predicted plant count

Actual plant count

64

62

iv.(b)

6 Conclusions and Future Work The model was able to count the plants and can be useful for the farmers to estimate the count before harvesting. This crop count can be used to calculate crop density which would be a very good parameter for yield prediction rather than predicting using area. The future work will be based on developing a yield prediction model using crop density and climatic conditions such as temperature, humidity, and rainfall as parameters.

References 1. Ribera J, Chen Y, Boomsma C, Delp EJ (2017) Counting plants using deep learning. In: 2017 IEEE global conference on signal and information processing (GlobalSIP), pp 1344–1348. https://doi.org/10.1109/GlobalSIP.2017.8309180 2. García-Martínez H, Flores-Magdaleno H, Khalil-Gardezi A, Ascencio-Hernández R, TijerinaChávez L, Vázquez-Peña MA, Mancilla-Villa OR (2020) Digital count of corn plants using images taken by unmanned aerial vehicles and cross correlation of templates. Agronomy 10(4):469. https://doi.org/10.3390/agronomy10040469 3. Karami A, Crawford M, Delp EJ (2020) Automatic plant counting and location based on a few-shot learning technique. IEEE J Sel Top Appl Earth Obs Remote Sens 13:5872–5886. https://doi.org/10.1109/JSTARS.2020.3025790 4. Djerriri K, Ghabi M, Karoui MS, Adjoudj R (2018) Palm trees counting in remote sensing imagery using regression convolutional neural network. In: IGARSS 2018—2018 IEEE international geoscience and remote sensing symposium, pp 2627–2630. https://doi.org/10.1109/ IGARSS.2018.8519188 5. Aparna P, Hebbar R, Harshita MP, Sounder H, Nandkishore K, Vinod PV (2018) CNN based technique for automatic tree counting using very high resolution data. In: 2018 International conference on design innovations for 3Cs compute communicate control (ICDI3C), pp 127– 129. https://doi.org/10.1109/ICDI3C.2018.00036 6. Mukhtar H, Khan MZ, Usman Ghani Khan M, Saba T, Latif R (2021) Wheat plant counting using UAV images based on semi-supervised semantic segmentation. In: 2021 1st International conference on artificial intelligence and data analytics (CAIDA), pp 257–261. https://doi.org/ 10.1109/CAIDA51941.2021.9425252

An Approach to Count Palm Tress Using UAV Images

509

7. Gnädinger F, Schmidhalter U (2017) Digital counts of maize plants by un-manned aerial vehicles (UAVs). Remote Sens 9(6): 544. https://doi.org/10.3390/rs9060544 8. Lavania S, Matey PS (2015) Novel method for weed classification in maize field using Otsu and PCA implementation. In: 2015 IEEE international conference on computational intelligence and communication technology, pp 534–537. https://doi.org/10.1109/CICT.2015.71 9. Bazi Y, Malek S, Alajlan N, AlHichri H (2014) An automatic approach for palm tree counting in UAV images. In: 2014 IEEE geoscience and remote sensing symposium, pp 537–540. https:// doi.org/10.1109/IGARSS.2014.6946478 10. Valente J, Sari B, Kooistra L et al (2020) Automated crop plant counting from very highresolution aerial imagery. Precis Agric 21:1366–1384. https://doi.org/10.1007/s11119-02009725-3

Comparative Analysis on Deep Learning Algorithms for Detecting Retinal Diseases Using OCT Images G. Muni Nagamani

and S. Karthikeyan

Abstract Retinal diseases become more complex for people at any age. In the early stages, many people suffering from retinal diseases have very mild symptoms. It is observed that retinal diseases mainly damage the blood vessels that cause the leakage of fluid. The accretion of fluid can impact the retina and cause vision changes. Retinal diseases such as diabetic retinopathy (DR), age-related macular degeneration (AMD), diabetic macular edema (DME), drusen, and choroidal neovascularization (CNV) are complex diseases that show a huge impact on human retinal health. If the patients are not detected with retinal diseases in the early stages, this may lead to permanent vision loss. These diseases have a lot of other side effects, such as brain disorders. Prevention of these diseases may stop permanent vision loss in patients. Machine learning (ML) algorithms are most widely used to detect retinal diseases in their early stages. The main disadvantage of ML is that these algorithms will take more time to process the data and select complex methods. In this paper, deep learning (DL) algorithms are discussed, and various optical coherence tomography (OCT) datasets are used for experiments. Experimental results show the performance of various ML and DL approaches applied to OCT and retinal image datasets. The performance and comparison of various algorithms is also discussed in this paper. Keywords Age-related macular degeneration (AMD) · Diabetic macular edema (DME) · Drusen · Choroidal neovascularization (CNV)

1 Introduction Accurate diagnosis of retinal disorders has been a significant public health concern in recent years. Using trained human experts, manual localization of retinal disease includes the identification of finer points of interest in OCT images and their classification into the relevant disease using a grading system. To get over the limitations of G. Muni Nagamani (B) · S. Karthikeyan Computer Science and Engineering, VIT-AP University, Amaravati, Andhra Pradesh, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_36

511

512

G. Muni Nagamani and S. Karthikeyan

manual identification, automated retinal disease detection models are necessary [1]. The burden of retinal diseases increases globally due to the continued development of the older population [2]. Artificial intelligence systems that may not have access to training examples for all retinal disorders in all phenotypic presentations could be used to develop anomaly detectors for retinal diagnoses [3]. Nowadays, many people are affected by retinal diseases. Retinal diseases include diabetic retinopathy (DR), DME, drusen, AMD, and CNV [4]. These are different types of retinal diseases that can have various types of treatments for patients. The diseases are identified based on the grading, and these diseases may lead to permanent loss of vision. Retinal diseases are classified based on the sample images that use the pixel-level classification for very high resolution (VHR) images, which is a hugely complex task [5]. It is one of the important task that need to identify the retinal diseases using 3D optical coherence tomography angiography(OCTA)[6]. This is a more novel method that analyzes retinal diseases very efficiently. Segmentation is one of the better techniques to analyze the retinal OCT images for better results. Several retinal diseases are detected by using segmentation techniques such as retinal vessel (RV) and foveal avascular zone (FAZ) segmentation [7]. ML and artificial intelligence (AI) are domains that are used to detect and diagnose retinal diseases. Deep learning (DL) is an advanced technique that can solve several issues in disease predictions [8]. Of the retinal diseases, cataract is one of the diseases that affects the vision of the person if it is not detected in the early stages. Several automation techniques have been developed to detect diseases by using various features such as Haar and visible structure features merged and considered as multilayer perceptrons with discrete state transition (DST-MLP) [9]. A deep neural network (DNN) is a method that can analyze retinal disease based on eye movement [10]. Figure 1 explain the early signs of diabetic retinopathy (DR) which is more dangerous and also causes permanent blindness [11]. Fundus images are also used to detect diseases like DR and DME. Several segmentation approaches are applied to fundus and OCTA images. These approaches are used to find the retinal fluid to manage the image-guided treatment [12, 13] (Figs. 2 and 3). This paper mainly focused on detecting and diagnosing several retinal diseases from OCT images that were collected from various Internet sources. To detect these diseases, various benchmark datasets are used to show the performance of deep learning algorithms.

2 Literature Survey Li et al. [14] developed the self-supervised learning model (SSLM) for detecting and diagnosing retinal diseases. The proposed model is applied to two public datasets for the classification of retinal diseases. This model detects disinfectant myopia (DiM), which is close to age-related macular degeneration (AMD). Abdelmaksoud et al. [15] proposed the multi-label computer-aided diagnosis (ML-CAD) that visualizes

Comparative Analysis on Deep Learning Algorithms for Detecting …

Fig. 1 Early signs of DR and DME [4] Fig. 2 Sample OCT image

Fig. 3 Sample fundus image on eye

513

514

G. Muni Nagamani and S. Karthikeyan

the various compulsive modifications and detects the DR grades given by the experts. This model follows several steps, including preprocessing and classifying the DR based on grades. It also extracts the four changes inclusive of exudates, microaneurysms, hemorrhages, and blood vessels by using the DL technique (UNet). Li et al. [16] introduced a multi-model method that is integrated with a selfsupervised feature learning technique to diagnose retinal disease. This system is mainly focused on analyzing the modality-invariant and patient-similarity features. By using this technique, several modalities are identified in the visual similarity between patients. This model is applied to two benchmark datasets to detect retinal diseases. The performance of the proposed system shows huge accuracy when compared with the other self-developed feature learning methods. Zang et al. [17] proposed an approach that connects NN with the integration of DcardNet, which is used to detect the DR classification. This approach mainly processes the datasets in three stages. This approach mainly focused on classifying retinal diseases such as NPDR, DR, and normal eyes. Ngo et al. [18] proposed a DNN model called an adaptive normalized intensity score (ANIS) that utilizes the various features from the OCTA datasets. The experiments are conducted on a new dataset consisting of 114 images with a computation time of 10.578 s per image to find the 8 borders, and the computation time for training takes 30 s. The proposed model acquired an accuracy of 96.6%. Seebock et al. [19] introduced the novel model with an advanced post-processing technique that works efficiently on OCT images. This approach got a performance of 0.789 dice index and accuracy for the AMD cases. Classification is done for AMD, GA, DME, and RVO. Tennakoon et al. [20] proposed the advanced classification technique which is used to solve the volumetric image classification problem. To increase the performance of the proposed model, an extreme value theory is adopted to find the accurate features. The experiments are based on classifying the three types of diseases from OCT images. The proposed model achieved a huge accuracy when compared with existing approaches. Li et al. [21] proposed an integrated framework that focused on the classification of retinal diseases. In this approach, feature extraction is used to achieve high accuracy. To integrate the new feature extraction method, a novel ribcage network (RCNet) is used as the middle layer. Rong et al. [22] introduced the surrogateassisted classification (SAC) approach, which classifies retinal diseases by using OCT images based on convolutional neural networks (CNNs). In this approach, the author focused on denoising the given inputs. To extract the masks from the given OCT image, threshold and morphological dilation are used. To generate the surrogate images, denoised masks are used. The training is done with CNN. Two datasets are used, such as local and Duke datasets. The proposed approach achieved a better proposal for retinal diseases. Xiang et al. [23] introduced the automatic approach, which consists of segmented layers and liquid in 3D OCT retinal images that are suffering from central serous retinopathy. This is the combination of various layers that extracts the 24 features that are integrated with random forest classifiers. The proposed approach shows better results compared with state-of-the-art approaches. Luo et al. [24] proposed a new

Comparative Analysis on Deep Learning Algorithms for Detecting …

515

approach that involves the automatic segmentation of retinal vessels. This approach is applied to the DRIVE and STARE fundus image datasets. The proposed approach is Attention-Dense-UNet (AD-UNet), which shows better segmentation results.

3 Methodology 3.1 Role of Deep Learning Deep learning is widely used in the detection of disease prediction in many ways. DL is a fast-growing field and the most powerful field that increases the performance of artificial neural networks (ANNs) and contains multiple layers that predict any type of data very accurately. In particular, DL algorithms work better in detecting retinal diseases by using various OCT image datasets. In DL models, several disease image datasets are used to classify or detect retinal or eye diseases. In every input dataset, image preprocessing techniques are applied to process the complex images to decrease the noise from the given input image and make it ready for the next step, which is called feature extraction. To learn the classification rules, the preprocessed image is taken as input and applies the DL architecture for the dynamic extraction of features and their related weights. The feature extraction techniques give better classification results. Deep learning plays a significant role in training the OCT images for the extraction of optimized features. For these models, a lot of training is required to get an accurate output. Limited training shows the impact on output. An ensemble deep learning algorithm was introduced by Qummar et al. [25], which is focused on extracting the features from the Kaggle dataset. The dataset is trained with five CNN models to increase the accuracy. Another deep learning model is called a novel category attention block (CAB) that extracts the features that are extracted from DR grade. For the detection of small lesions, an improved model called global attention block (GAB) is adopted to extract the features from fundus image datasets, which were proposed by He et al. [26].

3.2 Segmentation Approaches in Detecting Retinal Diseases Segmentation plays a major role in processing the images into segments based on the properties and features of the images. The images are divided into several parts, and these image parts are considered image objects. Chen et al. [27] proposed a new segmentation approach that segments the retinal blood vessels through deep learning. Retinal blood vessels show an impact on the health status of patients that can be diagnosed with retinal diseases with the proposed segmentation model. Yan et al. [28] introduced the automated retinal vessel segmentation that is used to detect

516

G. Muni Nagamani and S. Karthikeyan

eye-related diseases. This method is referred to as a three-stage vessel segmentation that extracts the features that can locate the non-vessel pixels and increase the full thickness of the vessel. Xiuqin et al. [29] presented the new segmentation for retinal vascular fundus images that solves the performance issues using DNN. Sarki et al. [30] discussed the retinal diseases that are caused by diabetes and analyzed the performance of IP techniques, DL models, and other performance-improved models. Khan et al. [31] introduced the classification method based on diabetic retinopathy (DR). The proposed system uses the VGG16 for training purposes. The integration of VGG16 with the spatial pyramid pooling layer (SPP) and network-in-network (NiN) becomes the VGG-NiN model. The performance of the proposed work is high. He et al. [26] proposed a model called the novel category attention block (CAB). The main aim of this approach is to detect the grade of the DR. The proposed approach is applied to three publicly available datasets. Gao et al. [32] proposed a newly developed approach that detects the DR by using fundus images. The proposed approach is integrated with deep CNN and finds the grades for the DR. This approach achieved an accuracy of 88.72%, and the consistency rate was 91.8%. Sarhan et al. [33] discussed the types of medical analysis image data. This system segments the retinal vessels into such things as layers and fluid segmentation. The proposed system shows high accuracy for the detection of medical data analysis. van Grinsven et al. [34] introduced the CNN model, which has powerful training. This is mainly focused on detecting brain hemorrhages by using fundus color images. The proposed approach reduced the training time and testing time with improved performance. Greenspan et al. [35] focused on detecting medical image analysis by using deep learning approaches. Liskowski et al. [36] proposed the segmented approach that segments the size of the eye vessels. The accuracy is up to 97% by applying the proposed approach. Costa et al. [37] implemented the auto-encoder to analyze the retinal vessels. The proposed approach generates the trees to detect the stage of the vessels from the retinal images. Gopinath et al. [38] proposed an automated approach that detects the accurate segmentation of cysts in the retinal OCT images. Qummar et al. [25] proposed automating DR detection using fundus images. The proposed system is applied to the Kaggle dataset and consists of five CNN models such as Resnet50, Inceptionv3, Xception, Dense121, and Dense169. Based on the features, the DR stages are divided. Soomro et al. [39] discussed the comparison of various DL algorithms to detect the Girshick et al. [40] proposed the new approach, which is a scalable and easy approach that improves the mean average precision (MAP) for the above 50%. Shelhamer et al. [41] introduced the new segmented approach that extracts the features of the datasets. Ren et al. [42] developed a feature learning method for segmenting drusen retinal images. Two datasets, such as the STARE and DRIVE datasets, are used.

Comparative Analysis on Deep Learning Algorithms for Detecting …

517

Fig. 4 Confusion matrix

3.3 Data Preprocessing Techniques in Detecting Retinal Images Preprocessing is one of the significant steps to remove the noise from the given dataset samples. Here, the dataset samples are in any format, such as JPG images and PNG images. Several preprocessing techniques are applied to ML and DL.

4 Performance Metrics The performance is calculated by using several metrics such as sensitivity, specificity, and accuracy. These metrics calculate not only correct classification but also incorrect classification. These metrics are defined as follows (Fig. 4):

Sensitivity (Se) =

TP TP + FN

Specificity (Sp) =

TN TN + FP

Accuracy (ACC) =

TP + TN TP + FP + FN + TN

5 Comparative Analysis In this section, the comparisons are discussed among the several existing approaches or models. These approaches are applied to various real-time datasets by using

518

G. Muni Nagamani and S. Karthikeyan

popular programming languages such as Python and MATLAB, as well as traditional programming languages such as Java, and C#.NET. Table 1 shows the performance of various approaches and models based on retinal diseases. Some algorithms are used to diagnose retinal diseases early with the integration of tools. The performance is measured by using various metrics such as sensitivity, specificity, and accuracy (Fig. 5). Table 1 Several segmentation algorithms to detect retinal diseases are applied to various datasets and their performance Author

Algorithm/model

Disease

Dataset

Results

Zeng et al. [43]

Binocular model

Diabetic Retinopathy

Kaggle

Se-82.2%, Sp-70.7%, AUC-95.1%

Bogunovi´c et al. RETOUCH [44]

AMD, RVO

OCT Imaging Dataset

Dice Score (DSC)-80% and Absolute Volume Difference (AVD)-95%

Romo-Bucheli et al. [45]

Automated segmentation approach

Neovascular AMD (nAMD)

Curated dataset

Se-82%, Sp-69%, Acc-72%, AUC-85%

Luo et al. [46]

Self-supervised fuzzy clustering network (SFCN)

Diabetic Retinopathy (DR)

MESSIDOR Dataset

Acc-87.6%, AUC-85%

He et al. [47]

Novel modality-specific attention network (MSAN)

AMD, DR

Hunan Multi-modal Retinal Image (HMRI) dataset

P-76.85, RE-69.99, F1-70.42, AUC-85.52, Time-1.38

Hassan et al. [48]

Incremental cross-domain adaptation

DME, AMD, CSR

Zhang dataset

Acc-98.26, F1-Score-98.46

Hassan et al. [49]

A deep retinal Diabetic analysis and Retinopathy grading framework (DR) (RAG-FW)

OCT Image Dataset

Acc-98.70

OPTIMA cyst Dataset

P-66, RE-79.0

Girish et al. [50] Fully convolutional network (FCN) *

Retinal disorders

Se Sensitivity; Sp Specificity; Acc Accuracy; AUC Area of Curve; RE Recall

Comparative Analysis on Deep Learning Algorithms for Detecting …

519

Fig. 5 Performance of segmentation algorithm

6 Conclusion Retinal disorders are generally diagnosed by experts called ophthalmologists. It is a time-consuming process to diagnose retinal disorders with the traditional approaches. In this paper, automated techniques are analyzed based on the features of retinal diseases. The classification of retinal diseases is based on the detection and diagnosis of OCT images. Retinal OCT images are used to find diseases based on the features of every disease. Diseases such as DR, AMD, DME, Drusen, CNV, and other types of retinal diseases are discussed and proposed in several DL models for diagnosis. Several pretrained DL models such as ResNet, VGG16, VGG19, and CNN are used for training the system. Also, this paper targeted the use of numerous deep learning algorithms on diverse OCT images and retinal datasets to carry out diverse eye diseases. A comparison between various DL algorithms and their performances is shown in this paper.

References 1. Subramanian M, Sandeep Kumar M, Sathishkumar VE, Prabhu J, Karthick A, Sankar Ganesh S, Meem MA (2022) Diagnosis of retinal diseases based on Bayesian optimization deep learning network using optical coherence tomography images. Comput Intell Neurosci 2022:15. Article ID 8014979. https://doi.org/10.1155/2022/8014979 2. Ben-Arzi A, Ehrlich R, Neumann R (2022) Retinal diseases: the next frontier in pharmacodelivery. Pharmaceutics 14(5):904. https://doi.org/10.3390/pharmaceutics14050904 3. Burlina P, Paul W, Liu TYA, Bressler NM (2022) Detecting anomalies in retinal diseases using generative, discriminative, and self-supervised deep learning. JAMA Ophthalmol. 140(2):185– 189. https://doi.org/10.1001/jamaophthalmol.2021.5557 4. Li X, Hu X, Yu L, Zhu L, Fu C-W, Heng P-A (2020) Cane: cross-disease attention network for joint diabetic retinopathy and diabetic macular edema grading. IEEE Trans. Med. Imaging 39(5):1483–1493 5. Yan L, Fan B, Liu H, Hua C, Xiang S, Pan C (2020) Triplet adversarial domain adaptation for pixel-level classification of VHR remote sensing images. IEEE Trans Geosci Remote Sens 58(5):3558–3573

520

G. Muni Nagamani and S. Karthikeyan

6. Zhang J et al (2020) 3D shape modeling and analysis of retinal microvasculature in OCTangiography images. IEEE Trans Med Imag 39(5):1335–1346 7. Tjoa E, Guan C (2020) A survey on explainable artificial intelligence (XAI): Toward medical XAI. IEEE Trans Neural Netw Learn Syst 8. Zhou Y, Li G, Li H (2020) Automatic cataract classification using deep neural network with discrete state transition. IEEE Trans Med Imaging 39(2):436–446 9. Mao Y, He Y, Liu L, Chen X (2020) Disease classification based on synthesis of multiple long short-term memory classifiers corresponding to eye movement features. IEEE Access 8:151624–151633 10. Kou C, Li W, Yu Z, Yuan L (2020) An enhanced residual U-Net for microaneurysms and exudates segmentation in fundus images. IEEE Access 8:185514–185525 11. Bogunovic H, Venhuizen F, Klimscha S et al. (2019) RETOUCH–the retinal OCT fluid detection and segmentation benchmark and challenge. IEEE Trans Med Imaging 1–1 12. Gu Z et al (2019) CE-net: context encoder network for 2D medical image segmentation. IEEE Trans Med Imag 38(10):2281–2292 13. Seebock P et al (2019) Unsupervised identification of disease marker candidates in retinal OCT imaging data. IEEE Trans Med Imaging 38(4):1037–1047. https://doi.org/10.1109/TMI.2018. 2877080 14. Li X et al (2021) Rotation-oriented collaborative self-supervised learning for retinal disease diagnosis. IEEE Trans Med Imaging 40(9):2284–2294. https://doi.org/10.1109/TMI.2021.307 5244 15. Abdelmaksoud E, El-Sappagh S, Barakat S, Abuhmed T, Elmogy M (2021) Automatic diabetic retinopathy grading system based on detecting multiple retinal lesions. IEEE Access 9:15939– 15960 16. Li M et al (2020) Image projection network: 3D to 2D image segmentation in OCTA images. IEEE Trans Med Imag 39(11):3343–3354 17. Zang P, Gao L, Hormel TT, Wang J, You Q, Hwang TS, Jia Y (2020) Dcardnet: diabetic retinopathy classification at multiple levels based on structural and angiographic optical coherence tomography. IEEE Trans Biomed Eng 18. Ngo L, Cha J, Han J-H (2020) Deep neural network regression for automated retinal layer segmentation in optical coherence tomography images. IEEE Trans Image Process 29:303–312 19. Seebock P et al (2020) Exploiting epistemic uncertainty of anatomy segmentation for anomaly detection in retinal OCT. IEEE Trans Med Imaging 39(1):87–98 20. Tennakoon R et al (2020) Classification of volumetric images using multi-instance learning and extreme value theorem. IEEE Trans Med Imaging 39(4):854–865 21. Li X, Shen L, Shen M, Qiu CS (2019) Integrating handcrafted and deep features for optical coherence tomography-based retinal disease classification. IEEE Access 7:33771–33777 22. Rong Y et al (2019) Surrogate-assisted retinal OCT image classification based on convolutional neural networks. IEEE J Biomed Health Inf 23(1):253–263 23. Xiang D et al (2019) Automatic retinal layer segmentation of OCT images with central serous retinopathy. IEEE J Biomed Health Inf 23:283–295 24. Luo Z, Zhang Y, Zhou L, Zhang B, Luo J, Wu H (2019) Micro-vessel image segmentation based on the AD-UNet model. IEEE Access 7:143402–143411 25. Qummar S, Khan FG, Shah S, Khan A, Shamshirband S, Rehman ZU, Ahmed Khan I, Jadoon W (2019) A deep learning ensemble approach for diabetic retinopathy detection. IEEE Access 7:150530–150539 26. He A, Li T, Li N, Wang K, Fu H (2021) CABNet: category attention block for imbalanced diabetic retinopathy grading. IEEE Trans Med Imaging 40(1):143–153 27. Chen C, Chuah JH, Ali R, Wang Y (2021) Retinal vessel segmentation using deep learning: a review. IEEE Access 9:111985–112004. https://doi.org/10.1109/ACCESS.2021.3102176 28. Yan Z, Yang X, Cheng K-T (2019) A three-stage deep learning model for accurate retinal vessel segmentation. IEEE J Biomed Health Inf 23(4):1427–1436. https://doi.org/10.1109/ JBHI.2018.2872813

Comparative Analysis on Deep Learning Algorithms for Detecting …

521

29. Xiuqin P, Zhang Q, Zhang H, Li S (2019) A fundus retinal vessels segmentation scheme based on the improved deep learning U-Net model. IEEE Access 7:122634–122643. https://doi.org/ 10.1109/ACCESS.2019.2935138 30. Sarki R, Ahmed K, Wang H, Zhang Y (2020) Automatic detection of diabetic eye disease through deep learning using fundus images: a survey. IEEE Access 8:151133–151149 31. Khan Z et al (2021) Diabetic retinopathy detection using VGG-NIN a deep learning architecture. IEEE Access 9:61408–61416. https://doi.org/10.1109/ACCESS.2021.3074422 32. Gao Z, Li J, Guo J, Chen Y, Yi Z, Zhong J (2019) Diagnosis of diabetic retinopathy using deep neural networks. IEEE Access 7:3360–3370. https://doi.org/10.1109/ACCESS.2018.2888639 33. Sarhan MH et al (2020) Machine learning techniques for ophthalmic data processing: a review. IEEE J Biomed Health Inf 24(12):3338–3350. https://doi.org/10.1109/JBHI.2020.3012134 34. van Grinsven MJ, van Ginneken B, Hoyng CB, Theelen T, Sánchez CI (2016) Fast convolutional neural network training using selective data sampling: Application to hemorrhage detection in color fundus images. IEEE Trans Med Imaging 35(5):1273–1284 35. Greenspan H, van Ginneken B, Summers RM (2016) Guest editorial deep learning in medical imaging: overview and future promise of an exciting new technique. IEEE Trans Med Imaging 35(5):1153–1159 36. Liskowski P, Krawiec K (2016) Segmenting retinal blood vessels with deep neural networks. IEEE Trans Med Imaging 35(11):2369–2380 37. Costa P et al. (2017) End-to-end adversarial retinal image synthesis. IEEE Trans Med Imaging 37(3):781–791 38. Gopinath K, Sivaswamy J (2018) Segmentation of retinal cysts from optical coherence tomography volumes via selective enhancement. IEEE J Biomed Health Inf 23(1):273–282 39. Soomro TA et al (2019) Deep learning models for retinal blood vessels segmentation: a review. IEEE Access 7:71696–71717. https://doi.org/10.1109/ACCESS.2019.2920616 40. Girshick R, Donahue J, Darrell T, Malik J (2016) Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans Pattern Anal Mach Intell 38(1):142– 158 41. Shelhamer E, Long J, Darrell T (2017) Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 39(4):640–651 42. Ren X et al (2018) Drusen segmentation from retinal images via supervised feature learning. IEEE Access 6:2952–2961. https://doi.org/10.1109/ACCESS.2017.2786271 43. Zeng X, Chen H, Luo Y, Ye W (2019) Automated diabetic retinopathy detection based on binocular siamese-like convolutional neural network. IEEE Access 7:30744–30753 44. Bogunovi´c H et al. (2019) Retouch-the retinal OCT fluid detection and segmentation benchmark and challenge. IEEE Trans Med Imaging 38(8):1858–1874 45. Romo-Bucheli D, Erfurth US, Bogunovic H (2020) End-to-end deep learning model for predicting treatment requirements in neovascular AMD from longitudinal retinal OCT imaging. IEEE J Biomed Health Inf 24:3456–3465 46. Luo Y, Pan J, Fan S, Du Z, Zhang G (2020) Retinal image classification by self-supervised fuzzy clustering network. IEEE Access 12(8):92352–92362 47. He X, Deng Y, Fang L, Peng Q (2021) Multi-modal retinal image classification with modalityspecific attention network. IEEE Trans Med Imaging 40(6):1591–1602 48. Hassan T, Hassan B, Akram MU, Hashmi S, Taguri AH, Werghi N (2021) Incremental crossdomain adaptation for robust retinopathy screening via Bayesian deep learning. IEEE Trans Instrum Meas 70:1–14 49. Hassan T, Akram MU, Werghi N, Nazir MN (2021) RAG-FW: a hybrid convolutional framework for the automated extraction of retinal lesions and lesion-influenced grading of human retinal pathology. IEEE J Biomed Health Inf 25(1):108–120 50. Girish GN, Thakur B, Chowdhury SR, Kothari AR, Rajan J (2019) Segmentation of intra-retinal cysts from optical coherence tomography images using a fully convolutional neural network model. IEEE J Biomed Health Inf 23(1):296–304

PCB-LGBM: A Hybrid Feature Selection by Pearson Correlation and Boruta-LGBM for Intrusion Detection Systems Seshu Bhavani Mallampati, Seetha Hari, and Raj Kumar Batchu

Abstract Nowadays, Internet of Things (IoT) applications are growing and gaining popularity.This accelerated development encounters several obstacles, including the sheer volume of data generated, network scalability, and security concerns. Distributed Denial of Service (DDoS) attacks are widespread in IoT systems because security is frequently overlooked in detection systems. Therefore, it is necessary to propose an efficient IDS to detect DDoS attacks from immense network traffic. The present study suggests a hybrid feature selection model by combining filter-based Pearson Correlation and wrapper-based Boruta with the Light gradient Boost Model (LGBM) as a base classifier (PCB-LGBM). In addition, the hyperparameters of the model are tuned by using the shap-hyper tune approach and retrieved more informative features. The proposed architecture was tested on the CICIDS-2017 dataset. The experimental results reveal that the suggested model PCB-LGBM has better results with the LGBM classifier when compared with other methods such as Decision Tree (DT), Multi-Layer Perceptron (MLP), and K-Nearest Neighbors (KNNs). Keywords IoT · DDoS attacks · CICIDS-2017 · Feature selection

1 Introduction The Internet of Things (IoT) is an extensively utilized technology that substantially impacts our lives in many ways, covering social, commercial, and economic elements. According to Markets and Markets [1], the global IoT market will reach $561 billion by 2022. Furthermore, according to Fortune Business Insights, by 2028, this figure

S. B. Mallampati · R. K. Batchu School of Computer Science and Engineering, VIT-AP University, Amaravati, Andhra Pradesh, India H. Seetha (B) Center of Excellence, AI and Robotics, VIT-AP University, Amaravati, Andhra Pradesh, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_37

523

524

S. B. Mallampati et al.

may go to $1.85 trillion. However, cyber security experts are concerned that this rapid increase and use of connected devices lead to many security issues. An intrusion detection system (IDS) is a comprehensive ecosystem that monitors network traffic for malicious activity detection such as DDoS and provides alert messages when an attack arises. The intrusion detection mechanism is categorized into misuse-based and anomaly-based intrusion detection [2]. The misuse-based model detects attacks by scanning for specific patterns in network data. However, this model cannot identify unknown patterns when a new attack arises. Therefore, the database has to be updated whenever a recent attack occurs. On the other hand, anomaly-based IDS learns regular network activity before detecting abnormalities that are not part of regular traffic and detects unknown attacks. Due to a significant increase in network traffic over the last few years, feature selection (FS) has emerged as an essential preprocessing step for numerous machine learning tasks. Because most intrusion detection datasets contain many features, effective extraction of possible risk factors using FS approaches is necessary and challenging. The goal of FS is to find and remove irrelevant features from the dataset. As a result, the learning model performs better with less computing time. FS is mainly categorized into filter, wrapper, and embedded [3]. Filter models choose features independent of any classifier based on the general properties of the training data. At the same time, wrapper and embedded methods generate candidate feature subsets by iteratively exploring the entire feature space with a specific classifier performance. Filter methods are computationally low, but wrapper methods produce efficient results [4]. Furthermore, by combining the FS methods such as filter and wrapper, the classifier’s performance can be improved by removing correlated and weak features [5]. Therefore, this article proposes a hybrid feature selection by combining filter-based Pearson correlation and wrapper-based Boruta feature selection to select appropriate features to improve classification accuracy. The article is structured as follows. Section 2 outlines related work. Section 3 describes the proposed hybrid feature selection model, and Sect. 4 describes performance metrics along with detailed experimental results and analysis. Section 5 provides comparative analysis. Finally, conclusions and future work are described in Sect. 6.

2 Related Works This section will provide some of the most modern and extensively utilized methods for detecting intrusions. Osanaiye et al. [5] explored a filter-based ensemble FS using Gain ratio, Chisquared, ReliefF, and information gain methods. This model selects relevant features using the four filter techniques and selects the 13 common features that meet the threshold chosen. Then, they used the J48 classifier on the features selected to identify the intrusion effectively.

PCB-LGBM: A Hybrid Feature Selection by Pearson Correlation …

525

Balkanli et al. [6] proposed a feature selection based on filter techniques such as Chi-Square and Symmetrical Uncertainty (SU). The Chi-square model retrieves more informative features in the initial stage. Further, SU is used as a ranker-based feature selection to select an appropriate feature set. Finally, the attacks were identified by using the C4.5 Decision Tree classifier. Gu et al. [7] suggested a model that determines the most informative features by a hybrid Hadoop-based feature selection technique. Further, to find the local minima and mitigate the outliers, they used k-means clustering using an enhanced densitybased initial cluster centers’ selection approach. Then, a semi-supervised weighted k-means technique employs a hybrid FS technique to improve detection performance. Batchu et al. [8] proposed a hybrid feature selection by combining Spearman rank correlation and Random Forest (RF) and select nine informative features. Further, they are fed into learning classifiers such as Gradient boost (GB), support vector machine (SVM), K-Nearest Neighbor (KNN), Decision Tree (DT), and Logistic regression (LR) to detect DDoS attacks effectively. Kasim et al. [9] suggested an IDS based on deep learning, which utilizes Autoencoder to extract the relevant features, and further, they are fed into a support vector machine (SVM) for detecting intrusions effectively. Watson et al. [10] explored a hybrid feature selection by combining correlation feature selection with three search methods such as genetic, greedy stepwise, and best-first. Further, a wrapper-based RF is used to evaluate the features obtained from filter methods and selects 12 appropriate features fed to the RF classifier to mitigate intrusions. Jaw et al. [11] explored a feature selection model by using genetic algorithm, CfsSubsetEval, and a rule-based engine to identify appropriate features, reducing model complexity. Further, an ensemble method for classification is utilized by using One-Class SVM, k-means, Expectation–Maximization, and DBSCAN to classify the probability distributions between malicious and non-malicious instances. The following are the important observations from the literature: • The majority of the studies examined their models without addressing the problem of data imbalance. • As new attack behaviors are evolving day by day, identifying more relevant features is more challenging. We addressed the above issues in the proposed work with less training time and better performance.

3 Methodology This section describes the proposed technique, categorized into preprocessing, hybrid feature selection, and classification. The proposed workflow is shown in Fig. 1.

526

S. B. Mallampati et al.

Fig. 1 Proposed workflow

3.1 Dataset In this article, we used the publicly available benchmark datasets provided by the Canadian Institute for Cybersecurity (CIC), called CICIDS-2017, to test the performance of the proposed model. It was developed by Sharafaldin et al. [12] and included updated families of attacks that met real-world criteria, as shown in Table 1. The data were captured from Monday, 03-07-2017, to Friday, 07-07-2017. It has normal traffic on Monday. Web attacks, DoS, DDoS, infiltration, and other attack traffic were captured in the remaining days. We used a subset of CICIDS2017 dataset, with 52,502 records. 70% of the data is used for training and 30% of the data is used for testing. The data had two class labels known as attack and normal.

PCB-LGBM: A Hybrid Feature Selection by Pearson Correlation …

527

Table 1 CICIDS-2017 attack types Day

Labels

Total records

Monday

Benign

2,273,097

Tuesday

BForce, SFTP, and SSH

445,909

Wednesday

DoS, Heartbleed attacks, Slowloris, Slowhttptest, Hulk, and GoldenEye

692,703

Thursday

Web and infiltration attacks

458,968

Friday

DDoS, Botnet, PortScans

703,245

3.2 Preprocessing In any machine learning framework, preprocessing plays a vital role. It is primarily used to prepare, organize, and clean data to make it appropriate for the construction and training of the models. Initially, the dataset is analyzed by statistical measures and identified that the dataset contains missing values, outliers, class imbalance, and redundant features. In the proposed work, outliers are removed by calculating Interquartile Range, and redundant features were removed. The KNN imputer handles missing values in datasets with the mean value from the ‘n’ nearest neighbors found in the training set using the k-Nearest Neighbors’ (KNNs) algorithm. The Euclidean distance is the default method for finding the nearest neighbors [13]. Class imbalance occurs when the instances of one class exceed (majority class) the instances of other class (minority class). When an imbalanced dataset is fed to traditional classification methods, they miss classify the occurrences of minority class data points which may be due to bias toward majority class samples. In the proposed work, we handled the class imbalance nature of CICIDS-2017 using kmeans SMOTE. It has three stages such as • Clustering. • Filtering. • Oversampling. The dataset is grouped into k groups using k-means clustering in the clustering stage. Next, the filtering stage chooses clusters for oversampling, keeping those with a significant number of samples from minority classes. The number of synthetic instances to be generated is then distributed, with more samples being assigned to clusters where minority instances are sparsely distributed. Finally, SMOTE is used in each selected cluster during the oversampling step to obtain the required ratio of majority and minority samples [14]. Once the data are balanced, it has to be scaled. The scaling of the values in the data can be different. As a result, training the model becomes complicated with various scales, and the model’s performance may decrease. Therefore, we used a Min–Max Scaler to scale the values with zero mean and unit variance by the following Eq. 1:

528

S. B. Mallampati et al.

X new =

X i − min(X ) , max(X ) − min(X )

(1)

where X new is scaled data, X i is the cell value, min(X ) is the minimum value of the column X , and max(X ) is the maximum value of the column X .

3.3 Hybrid Feature Selection Once the dataset is preprocessed, it can be fed into any traditional machine learning model. A hybrid feature selection is proposed as shown in Algorithm 1, to improve the classifier’s performance with less computational time. As CICIDS-2017 has 79 features, every feature is not essential for predicting the attack. So, irrelevant features have to be removed. Initially, filter-based Pearson Correlation is used to remove the correlated features based on the threshold of 0.8 in the dataset. As Pearson correlation identifies linearly correlated features, it helps only in removing redundant features [15]. But still, there can be features irrelevant to the task on hand. To overcome this problem, the selected features are passed to the wrapper-based Boruta approach, which evaluates the importance of features based on the performance. The Boruta method removes non-essential features and highlights significant features, resulting in a precise classification and a reliable model. We used BorutaLGBM with tuned parameters using shap-hyper tune, which extracts informative features. The working mechanism of Boruta is as follows: • To begin, it adds unpredictability to the given dataset by making shuffled copies of all features, which are referred to as shadow features. • The LGBM is then trained on this expanded dataset, and the value of each feature is evaluated using a feature importance metric such as accuracy. • Boruta Algorithm tests whether an actual feature is more important at each iteration. • The strongest of its shadow and constant features eliminates features that are extremely unimportant. • Finally, the Boruta Algorithm terminates either when all features are validated or rejected or when it hits a predetermined limit of LGBM. • Further, Boruta finds all features that are highly or weakly related to the response variable. • Wrapper approaches are the most successful in general because they extract dependencies and correlations between features. They do, however, appear to be more prone to overfitting. To handle this problem, the learning classier LGBM is tuned by using shap-hyper tune method, a Python package for tweaking hyperparameters and selecting features simultaneously.

PCB-LGBM: A Hybrid Feature Selection by Pearson Correlation …

529

Algorithm 1 (1) Input: Pre-processed CICIDS 2017dataset S = {s1 , s2 , .....sn }. (2) Initialize Subset P = φ (3) Remove duplicate features i f (sg == s y ) where g, y = {1, 2, 3....n}; g = y. Update the new feature set  S = S − sg ;S  = {s1, s2 ......s y ....sm }; m ≤ n. (4) Remove correlated feature i f (ρ(sa , sb ) ≥ 0.8), (a, b) ∈ {1, 2, 3....m}, a = b remove sa from   S  and update the feature set S to P = {s1 , s2 , .....sb ....sl }, l ≤ m. cov(P,Y ) where ρ = σ Pσ Y ;ρ is Pearson correlation.  (5) The updated Subset P is fed into a wrapper-based Boruta and tuned the hyperparameters with shap-hyper tune.  (6) Remove the subset of features from P that are weak based on step 5. (7) The final feature subset is placed in P = {s1 , s2 , s3 .....sk }; where k ≤ l. Further, P is passed to machine learning classifiers such as DT, LGBM, MLP, and KNN to determine the model’s performance.

4 Results and Analysis The proposed work is done on Intel® Xenon® W-2125 CPU @ 4.00 GHz processor Core i7-10750H CPU-2.60 GHz processor, Windows 10 Pro for workstation operating system with 64 GB RAM, 2 GB of GeForce GTX 1080 Ti graphics. The experimental results were implemented in a Python environment. The classifier’s performance is defined by a Confusion matrix, which has performance measures like True Negative (TN), True Positive (TP), False Negative (FN), and False Positive (FP). Metrics like recall (RC), accuracy (AC), precision (PR), F1-score, and area under ROC are evaluated based on these components. AC =

TP + TN , TP + TN + FP + FN

(2)

PR =

TP , TP + FP

(3)

RC =

TP , TP + FN

(4)

F1 - Score = 2 ∗

PR ∗ RC , PR + RC

where • TP is the result where the model accurately predicts the positive class. • TN is the result where the model predicts the negative class accurately.

(5)

530

S. B. Mallampati et al.

Table 2 Without class balancing and without feature selection Model

AC

DT

99.94

PR

LGBM

99.98

MLP

99.86

99.05

KNN

99.94

99.20

99.73 100

Table 3 Features selected by a proposed hybrid model

RC

F1-score

ROC-AUC

99.19

99.46

99.59

Time (s)

99.59

99.79

99.79

0.44

98.13

98.59

99.04

15.39

99.73

99.46

99.84

0.85

0.31

Proposed feature selection method

Optimal features obtained

Pearson + Boruta-LGBM (PCB-LGBM)

“Destination Port”, “Flow Duration”, “Total Fwd Packets”, “Total Length of Fwd Packets”, “Bwd Packet Length Max”, “Flow Packets/s”, “Flow IAT Min”, “Fwd IAT Min”, “Fwd PSH Flags”, “Bwd packets/s”, “Down/Up Ratio”.

• FP is the result that describes the number of normal samples misclassified as an attack. • FN is a result where the attack sample is predicted as non-attack. The proposed model is tested by machine learning models such as DT, LGBM, MLP, and KNN. The experiments are done in two ways, such as with and without feature selection and class imbalance. Table 2 shows the results obtained from data without balancing and without feature selection. It provides good accuracy but fails to provide efficient PR, RC, F1-score, and ROC results. This may occur due to irrelevant features, and models may overfit when data are not balanced. Feature selection and class balancing are utilized to address the above problem. Table 3 shows the number of features selected by the proposed PCB-LGBM, which improves the model’s performance in terms of PR, RC, F1-score, and ROC. By comparing the results of Table 4, the LGBM outperforms with an accuracy of 99.99% and PR—100%, Recall—99.99%, F1-score—99.99%, ROC—99.99% with a computational time of 0.31 s.

5 Comparison with Existing Methods The effectiveness of the suggested technique is compared to that of existing techniques, as shown in Table 5. Arif et al. [16] proposed the IDS model by combing

PCB-LGBM: A Hybrid Feature Selection by Pearson Correlation …

531

Table 4 With class balancing and feature selection Model

AC

DT

99.11

PR 99.28

RC 0.989464

F1-score

ROC-AUC

99.11

99.11

Time (s) 0.10

LGBM

99.99

99.99

99.99

99.99

0.31

MLP

99.57

99.16

99.99

99.57

99.57

146.48

KNN

99.96

99.94

99.99

99.96

99.93

0.28

100

Table 5 Comparison of the proposed technique with existing techniques Model

No. of features

AC

AdaBoost + EFS + SMOTE [16]

25

81.83

MC-CNN [17]

All

98.87

Pearson correlation + DNN [18]

35

99.73

Proposed model

11

99.99

PR

RC

81.83 100 NA 96.68

100

NA

F1-score 90.01

ROC-AUC 86.4

Time NA

NA

NA

NA

99.58

98.05

100

6.46 s

99.99

99.99

99.99

0.31 s

ensemble-based feature selection to select more informative features. Further, SMOTE was used for balancing the CICIDS-2017 dataset. Finally, the AdaBoost classifier is trained to detect the attacks with an accuracy of 81.83%, precision— 81.83%, recall—100%, F1-score—90.01%, ROC-AUC—86.4%. Jinyin et al. [17] suggested a deep learning technique based on a multi-channel convolution neural network to mitigate the DDoS attacks with an accuracy of 98.87%. Murtaza et al. [18] proposed an optimized filter-based IDS to mitigate intrusions effectively. Initially, the dataset is preprocessed. Further, they utilized filter-based Pearson correlation to select more informative features from CICIDS-2017. Finally, a deep neural network model is trained to detect intrusions, attaining an accuracy of 99.73%, precision—99.68%, recall—99.58, F1-score—98.05, ROC-AUC—100%. By comparing the proposed model with existing models, the proposed model attains outstanding performance on LGBM with an accuracy of 99.99% and PR of 100%, recall of 99.99%, F1-score of 99.997%, and the ROC of 99.99% with the training time of 0.31 s.

6 Conclusion The article proposes an efficient IDS to mitigate DDoS attacks over IoT networks. Initially, we preprocessed the CICIDS-2017 dataset. Then, to enhance the machine learning model’s performance, we proposed a hybrid feature selection PCB-LGBM to select the relevant features. Further, the selected features are fed into machine learning models such as DT, LGBM, MLP, and KNN. Our experimental results revealed that

532

S. B. Mallampati et al.

the LGBM technique gives good accuracy, F1-score, precision, recall, and ROC-AUC with less training time. Finally, we evaluated our proposed system using a number of performance metrics and by comparing with state-of-the-art procedures to show that the proposed approach outperforms existing methods. As a result, the proposed model will be significant in identifying DDoS effectively in IoT network security research fields. Furthermore, in the future, we plan to compare our proposed model with PCA feature selection and extend our model to multi-class classification.

References 1. IoT security in 2022: defending data during the rise of ransomware. https://www.perle.com/ articles/iot-security-in-2022-defending-data-during-the-rise-of-ransomware-40193618.shtml. Accessed 23 April 2022 2. Putra D, Kadnyanana IGAGA (2021) Implementation of feature selection using information gain algorithm and discretization with NSL-KDD intrusion detection system. JELIKU (Jurnal Elektron. Ilmu Komput. Udayana) 9(3):359. https://doi.org/10.24843/jlk.2021.v09.i03.p06 3. Batchu RK, Seetha H (2022) On improving the performance of DDoS attack detection system. Microprocess Microsyst 93(December 2021):104571. https://doi.org/10.1016/j.micpro.2022. 104571 4. Sánchez-Maroño N, Alonso-Betanzos A, Tombilla-Sanromán M (2007) Filter methods for feature selection—a comparative study. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics) 4881 LNCS:178–187. https://doi.org/10.1007/978-3540-77226-2_19 5. Osanaiye O, Cai H, Choo KKR, Dehghantanha A, Xu Z, Dlodlo M (2016) Ensemble-based multi-filter feature selection method for DDoS detection in cloud computing. Eurasip J Wirel Commun Netw 1:2016. https://doi.org/10.1186/s13638-016-0623-3 6. Balkanli E, Nur Zincir-Heywood A, Heywood MI (2015) Feature selection for robust backscatter DDoS detection. Proc—Conf Local Comput Netw, LCN, 2015:611–618. https:// doi.org/10.1109/LCNW.2015.7365905 7. Gu Y, Li K, Guo Z, Wang Y (2019) Semi-supervised k-means ddos detection method using hybrid feature selection algorithm. IEEE Access 7:64351–64365. https://doi.org/10.1109/ACC ESS.2019.2917532 8. Batchu RK, Seetha H (2021) A generalized machine learning model for DDoS attacks detection using hybrid feature selection and hyperparameter tuning. Comput Netw 200:108498. https:// doi.org/10.1016/j.comnet.2021.108498 9. Kasim Ö (2020) An efficient and robust deep learning based network anomaly detection against distributed denial of service attacks. Comput Netw 180:107390. https://doi.org/10.1016/J.COM NET.2020.107390 10. Watson T (2019) Hybrid feature selection technique for intrusion detection system Muhammad Hilmi Kamarudin *, Carsten Maple and, (January, 2019). https://doi.org/10.1504/IJHPCN. 2019.097503 11. Jaw E (2021) SS symmetry feature selection and ensemble-based intrusion detection system: an efficient and comprehensive approach, pp 1–34 12. Sharafaldin I, Lashkari AH, Ghorbani AA (2018) Toward generating a new intrusion detection dataset and intrusion traffic characterization. Cic:108–116. https://doi.org/10.5220/000663980 1080116 13. Beretta L, Santaniello A (2016) Nearest neighbor imputation algorithms: a critical evaluation. BMC Med Inform Decis Mak. https://doi.org/10.1186/s12911-016-0318-z 14. Last F, Douzas G, Bacao F (2017) Oversampling for imbalanced learning based on K-means and SMOTE, pp 1–19. https://doi.org/10.1016/j.ins.2018.06.056

PCB-LGBM: A Hybrid Feature Selection by Pearson Correlation …

533

15. Brahmam MV, Sravan KR, Bhavani MS Pearson correlation based outlier detection in spatialtemporal data of IoT networks, pp 1–10 16. Yulianto A, Sukarno P, Suwastika NA (2019) Improving AdaBoost-based intrusion detection system (IDS) performance on CIC IDS 2017 dataset. J Phys Conf Ser 1192(1). https://doi.org/ 10.1088/1742-6596/1192/1/012018 17. Chen J, Tao Yang Y, Ke Hu K, Bin Zheng H, Wang Z (2019) DAD-MCNN: DDoS attack detection via multi-channel CNN. ACM Int Conf Proc Ser, Part F1481(February 2019):484– 488. https://doi.org/10.1145/3318299.3318329 18. Siddiqi MA, Pak W (2020) Optimizing filter-based feature selection method flow for intrusion detection system. Electron 9(12):1–18. https://doi.org/10.3390/electronics9122114

Extractive Summarization Approaches for Biomedical Literature: A Comparative Analysis S. LourduMarie Sophie , S. Siva Sathya , and Anurag Kumar

Abstract Text summarization is one of the key applications of Natural Language Processing that has become more prominent for information condensation. Text summarization minimizes the content of a source document to generate a coherent summary while maintaining vital information from the source. Automatic text summarization has emerged as a key area of research in information extraction, notably in the biomedical field. Keeping pace with the recent research findings on certain issues has become an incredibly difficult challenge for medical specialists and researchers as the number of publications in medical research increases over time. Automatic text summarization approaches for biomedical literature may aid researchers in quickly reviewing research findings by extracting significant information from recent publications. Typical techniques for medical text summarization necessitate domain knowledge. Such systems’ effectiveness depends on resourceintensive clinical domain-specific knowledge bases and preprocessing mechanisms for generating relevant information. This research compares three text summarization approaches word frequency, cosine similarity, and the Luhn algorithm. The comparative results are generated for 20 biomedical articles and evaluated using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metric; the outcome demonstrates the best-suited technique for summarization of biomedical literature. Keywords Natural language processing · Automatic text summarization · Medical text processing · Biomedical informatics · Extractive summarization

1 Introduction Natural Language Processing (NLP) involves computer–human interaction. NLP is the process of constructing a language-analyzing system; as Internet use worsens S. LourduMarie Sophie (B) · S. Siva Sathya · A. Kumar Department of Computer Science, Pondicherry University, Pondicherry, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_38

535

536

S. LourduMarie Sophie et al.

information overload [1], the need for NLP systems for computer–human interactions rises. The abundance and diversity of data cause problems with data management and consumer relevance. When a person needs to limit the amount of data he collects, the text is shortened. Manually summarizing a huge volume of data is time-consuming. Each person’s understanding skills and learning ability vary; hence, many humangenerated summaries for the same document are feasible. As the document’s information becomes more complex, it will be harder to understand and summarize. Document summarization requires prior knowledge [2]. A person with mediocre knowledge may be unable to generate a good summary for complicated papers, especially clinical documents, biomedical publications, scientific articles, legal records, etc. Due to inputs from different domains, summarizing papers is difficult. Therefore, a system that can automatically collect, extract, and summarize content is needed. One potential explanation for this situation is summarization. Summarization is discovering a subset of a document that represents its overall content [3]; its types include image, video, and text. Image summarization selects pertinent photos. In video summarization, the system removes repetitious scenes and produces a concise version. Text summarization chooses the most important and relevant sentences. Automated text summarization tools provide crisp, eloquent summaries that include source content. A comparative analysis of 20 biomedical literatures using the frequency, cosine similarity, and Luhn summarization approaches is presented in this article. The aforementioned approaches are all extractive-based, and the resultant summary produced by each approach is evaluated using ROUGE metrics.

2 Automatic Text Summarization Automatic text summarization recognizes the most important information in a document or group of related materials and condenses it. It cuts down on reading time and storage space [4]. The standard summarization process includes: analyze, transform, and generate. Any text-based summarizer first extracts notable elements from the text. Next, it aggregates the initial findings and feeds them to the generate stage, which creates a user-tailored summary at the end [5]. Text summarization systems can be categorized into the following sub-domains, and their illustration is presented in Fig. 1: i.

ii.

Input: A summarization system can accept a single document or a group [1, 6]. The former method uses only one document. One-page summaries are easier to write than executive summaries. Multi-document systems are fed similar documents. It is more complex to implement since various materials must be summarized. Approaches: Summarization can be accomplished by extraction, abstraction, or a hybrid-based approach [1, 6, 7]. The extraction process extracts significant

Extractive Summarization Approaches for Biomedical Literature …

537

Fig. 1 Automatic text summarization classification based on different criteria

phrases from the source material. The abstract highlights the document’s main ideas in straightforward terms [3]. iii. Language: Summarization systems can be classified as monolingual or multilingual depending on the input language and the summary [1, 6]. Monolingual systems take documents written in a single language and exclusively generate a summary in that language. Multilingual systems can accept documents in several languages and produce summaries in multiple languages. iv. Source representation: The summarizer might be generic or query-oriented, dependent on the source material [1, 5, 6]. The generic summarization method is subject-neutral and can be used by anyone. A query-based summary is a question-response type in which the summary is the user’s question. v. Summary representation: Depending on the summary’s key features, the system might be indicative or informative [1, 6]. Indicative summarization systems present the user with the document’s main idea, motivating them to read the entire text. Informative summary systems provide quick summaries of primary text and even replace it. vi. Training data: Text summarization systems can be supervised or unsupervised [5]. The former option uses machine learning to construct a textual summary from past summaries. In the latter case, the algorithm uses hypotheses collected from the input material to build the summary as there is no human input. vii. Evaluation methods: Performance can be evaluated using intrinsic and extrinsic indicators [8, 9]. A summary’s correctness, relevance, readability, and comprehensibility are evaluated intrinsically. The extrinsic technique assesses summarization by analyzing its effects on Information Retrieval, decision-making, etc. [8]. viii. Limitation: The summary might be genre-specific, domain-dependent, or domain-independent based on input text [1, 6]. Genre-specific systems only accept text that follows a template (scientific articles, user manuals, etc.); the template is vital for constructing the summary. Domain-dependent systems simply summarize input text. Domain-independent systems accept any text input.

538

S. LourduMarie Sophie et al.

Automatic text summarization is challenging because when we summarize a paragraph, we read it carefully to understand it and then highlight the main points [10, 11]. Summarizing content is difficult for robots because they lack brains and language skills [12, 13]. Automatic summaries have been present since the 1950s, but they could be better. Summarizing biomedical literature has become necessary and often relies on external subject knowledge bases such as SNOMED-CT [14], MeSH [15], and UMLS [16] to give extensive semantic interpretations of the texts to be highlighted [17].

3 Related Works Automatic text summarization, a prominent topic in data capture research, especially in the medical and biomedical domains, outlines original materials and retains their most useful aspects [18]. Summaries can be either extractive or abstractive. Extractive summarization summarizes the most important sentences, and paraphrasing creates abstractive summaries. This section mentions past publications using different summarization methods. Most study focuses on sentence retrieval, not summarization. Early techniques emphasize source content [19]. Source document frequency measures determine these properties. Extractive summarization uses Information Retrieval (IR) heuristics, such as the document’s term frequency [20, 21], the word/sentence position [22–25], the existence of certain key terms [26, 27], or the similitude of document phrases with respect to the title and abstract [28–30]. Simple specifications evaluate, sort, and retrieve phrases for the summary. In [31], a multi-document generic summarizer is constructed using the aforementioned heuristics. In [32], the summarizer is upgraded by attaching suitable characteristics to each phrase to help extract query-relevant phrases. Another widely utilized technique for summarization is to employ statistical aspects of the phrase to generate extractive summaries. There has been a tremendous advancement in statistical methodologies throughout the years. In Latent Semantic Analysis (LSA)-based summarization [33], a word matrix is created, and Singular Value Decomposition (SVD) is performed. kth singular vector denotes kth critical concept. The summary chooses sentences that best depict solitary vectors. TextRank is a graph-based technique developed in the early 2000s [34], inspired by PageRank [35]. TextRank builds phrase relationships using cosine similarity [36]. As in [37, 38], stop words must be removed to improve the summary. ROUGE is a [39] assessment metric that compares machine-generated with human-generated summaries.

Extractive Summarization Approaches for Biomedical Literature …

539

4 Text Summarization Algorithms Extractive summaries retrieve the most relevant terms from the source text. Extractive text summarization involves preprocessing the input, extracting words or phrases based on their qualities, and then selecting and merging the phrases [40]. Figure 2 depicts the biomedical text summarizer’s workflow. Before summarizing, preprocessing the text is essential. It converts information into a format understandable by machine learning algorithms. Basic text preprocessing includes • Tokenization: Divides paragraphs into lines and lines into words called tokens. • Lowercasing: It converts the word to lower case. • Normalization: – Stemming—reduces the word to its root form, eliminating the suffix; however, occasionally, the root term created is meaningless. – Lemmatization—It is similar to stemming, but the root word is generated after removing the suffix is significant and exists in the dictionary. • Removing stop words, whitespaces: Stop words are often used terms in texts (a, an, the, etc.). These terms are devoid of significance since they serve no purpose other than to aid in the differentiation of two texts. There are several methods for creating extractive summaries. This section discusses the approaches implemented in this article.

4.1 Frequency-Based Approach This approach is the simplest and the most direct technique for producing an extractive summary. Each source phrase is scored based on its relative frequency after preprocessing [41]. The most frequent terms are chosen, and their weighted frequency is calculated by dividing their frequency by the entire number of terms in the document. Then, each phrase’s score is computed by adding its terms’ weights, and the phrases are organized by weight. High scores suggest relevant phrases. Python’s

Fig. 2 Generic workflow of biomedical summarizer

540

S. LourduMarie Sophie et al.

Fig. 3 Workflow of frequency-based biomedical summarizer

NLTK library was used to build this method. Figure 3 depicts the overall flow of a biomedical text summarizer based on frequency. Extractive summarization algorithms based on word frequency are simple to build and produce consistent results across languages. This approach neglects context, exposes document subjects inconsistently, repeats summary words, and is sometimes unorganized and difficult to comprehend.

4.2 Similarity-Based Approach Depending on the overlapping content of the phrases, several similarity measures are utilized to identify the degree of similarity between them [36, 42, 43]. In this study, a cosine similarity-based technique is used to determine the similarity of two words or documents on the grounds of their numerical values between [− 1, 1]. Following the preprocessing step, a vector representation of each phrase is created. Subsequently, as per Eq. (1), a similarity matrix is formed by calculating the cosine similarity score between each phrase and the preceding phrase. Cosine similarity(X 1 , X 2 ) =

→ − →− X 1. X 2 , ||X 1 |.|X 2 ||

(1)

where X 1 and X 2 represent sentence 1 and sentence 2, respectively, and ||X 1 |.|X 2 || denotes the cardinality product of both sentences. This technique transforms the similarity matrix into graphs, with nodes representing textual phrases and edges expressing semantic relationships [37]. After computing similarity scores, the phrases are ordered using page rank, and the resultant summary includes the top-ranked phrases. Figure 4 illustrates the workflow of the approach opted in this research.

Extractive Summarization Approaches for Biomedical Literature …

541

Fig. 4 Workflow of similarity-based biomedical summarizer

4.3 Luhn Approach Luhn’s methodology is based on the Term Frequency-Inverse Document Frequency (TF-IDF) association and was proposed in 1958. Stop words are beneficial when both rare and common. The Luhn approach rates sentences based on the number and proximity of keywords within a phrase, with the highest-ranking phrases included in the summary [44]. In this approach, sentence weight (X wt ) is computed by the summation of the term score divided by the length of the sentence as given in Eq. (2). Each term in the input manuscript is scored by keeping track of the number of times each distinct term appears in the manuscript. j X wt =

1

scoret j , |X |

(2)

where t j is the jth term of the sentence X and |X | is the cardinality of the sentence. In this case, splitting by sentence length is used as a normalization factor to prevent picking longer phrases over those with more significant terms. After computing each phrase’s weight, the summary is constructed by selecting the x highest-weight phrases and reordering them. Also, phrases near the beginning of a text are given more weight. Figure 5 shows Luhn’s workflow.

542

S. LourduMarie Sophie et al.

Fig. 5 Workflow of Luhn biomedical summarizer

5 Comparative Analysis of Text Summarization Approaches This research assesses three summarization approaches on 20 PubMed biomedical articles using Rogue metrics. Since each of the 20 documents is given to three textbased techniques, the outputs are also text files. Each method yields 20 summaries; on the whole, 60 summaries are produced. Although article lengths vary, output summaries are nearly identical. Traditional evaluations of summaries include human assessments. These are human-generated/reference summaries of medical literature based on a common inquiry. ROUGE [39] determines the degree of similarity between system-generated summaries and reference summaries. ROUGE has five different measures: ROUGEN, ROUGE-L, ROUGE-W, ROUGE-S, and ROUGE-SU. ROUGE 2.0 is a variation of the ROUGE-N that calculates the number of N-gram similar content in the reference and system summaries. In this article, ROUGE 2 is utilized to evaluate the system-generated summary. ROUGE metrics produce three evaluation measures per summary [45]: recall, precision, and F-measure. Recall divides the number of matched n-gram units in the system and human-generated summary by the total number, confirming that the system summary maintains most of the reference’s content. It is expressed mathematically in Eq. (3). Recall =

Count of n − grams in system and reference summary Total count of n − grams in reference summary.

(3)

Precision is nearly identical to that of recall, except that instead of dividing by the reference summary n-gram value, it is divided by the system summary n-gram value. It is expressed mathematically in Eq. (4). Precision =

Count of n−grams in system and reference summary Total count of n−gram in system summary.

(4)

Extractive Summarization Approaches for Biomedical Literature …

543

The F-measure/F-score is defined as the harmonic mean of precision and recall. It is expressed in Eq. (5)  F − score = 2 ∗

Precision ∗ Recall Precision + Recall

 (5)

Performance evaluation utilizing the ROUGE metric is carried out by comparing the output summary with the manual summary for each biomedical article, and their results are tabulated in Table 1. Figure 6 illustrates a visual depiction of Table 1. The findings indicate that the Luhn and similarity-based approaches perform better than the frequency-based method for biomedical text summarization. A higher value of precision, recall, and F-score indicates a better performance of the approach used. From the comparative analysis carried out in this research, it is found that the Luhn method performs better for most of the inputs given as it produces a higher precision, recall, and F-score; hence, it is recommended for extractive biomedical text summarization. Table 1 Average performance measure comparison of Luhn, cosine, and frequency approach for 20 biomedical literatures Average performance measure ROUGE metric

Luhn approach

Cosine similarity

Frequency-based

Recall

0.84

0.80

0.83

Precision

0.80

0.76

0.74

F-score

0.82

0.78

0.78

Fig. 6 Visual representation of Rouge scores for biomedical literature

544

S. LourduMarie Sophie et al.

6 Conclusion With the plethora of textual information accessible on the Web, text summarization has emerged as a critical and appropriate tool for aiding the comprehensive interpretation of text information. Given the importance of biomedical text summarization, this research compares three approaches to 20 biomedical papers to determine the optimum. ROUGE metrics are used to evaluate the system’s recall, precision, and F-measure relative to a reference summary. The experimental assessment shows that the Luhn approach is better; however, both Luhn and cosine produce equivalent results and much higher accuracy than the frequency-based technique. This investigation will assist NLP researchers in determining the best method for extractive text summarization and improve current methods.

References 1. Munot N, Govilkar SS (2014) Comparative study of text summarization methods. Int J Comput Appl 102(12):33–37. https://doi.org/10.5120/17870-8810 2. Rani U, Bidhan K (2021) Comparative assessment of extractive summarization: TextRank, TF-IDF and LDA. J Sci Res 65(01):304–311. https://doi.org/10.37398/jsr.2021.650140 3. Bhalla S, Verma R, Madaan K (2017) Comparative analysis of text summarisation techniques 5(10):1–6 4. Madhuri JN, Ganesh Kumar R (2019) Extractive text summarization using sentence ranking. In: 2019 International conference on data science and communication IconDSC 2019, pp 1–3. https://doi.org/10.1109/IconDSC.2019.8817040 5. Jabar A (2020) Generating extractive document summaries using weighted undirected graph and page rank algorithm. May. https://doi.org/10.13140/RG.2.2.16261.99048 6. Gholamrezazadeh S (2009) A comprehensive survey on text summarization systems 7. Ma C, Zhang WE, Guo M, Wang H, Sheng QZ (2022)Multi-document summarization via deep learning techniques: a survey. ACM Comput Surv. https://doi.org/10.1145/3529754 8. Inderjeet M (2009) Summarization evaluation: an overview. Pflege Z 62(6):337–341 9. Saziyabegum S, Sajja PS (2017) Review on text summarization. Indian J Comput Sci Eng 8(4):497–500 10. Basheer S, Anbarasi M, Sakshi DG, Vinoth Kumar V (2020) Efficient text summarization method for blind people using text mining techniques. Int J Speech Technol 23(4):713–725. https://doi.org/10.1007/s10772-020-09712-z 11. Aone C, Okurowski ME, Gorlinsky J (1998) Trainable, scalable summarization using robust NLP and machine learning, p 62. https://doi.org/10.3115/980451.980856 12. Barzilay R, Elhadad N, McKeown KR (2002) Inferring strategies for sentence ordering in multidocument news summarization. J Artif Intell Res 17(April):35–55. https://doi.org/10. 1613/jair.991 13. Barzilay R, Lee L (2003) Learning to paraphrase: an unsupervised approach using multiplesequence alignment. In: Proceedings of the 2003 human language technology conference of the North American Chapter of the Association for Computational Linguistics HLT-NAACL 2003, June, pp 16–23 14. Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT). http://www.ihtsdo. org/snomed-ct/ 15. Medical Subject Headings. http://www.nlm.nih.gov/mesh/

Extractive Summarization Approaches for Biomedical Literature …

545

16. Bodenreider, O (2004) The unified medical language system (UMLS): integrating biomedical terminology Nucleic Acids Res 32(DATABASE ISS):267–270. https://doi.org/10.1093/nar/ gkh061 17. Plaza L (2014) Comparing different knowledge sources for the automatic summarization of biomedical literature. J Biomed Inform 52:319–328. https://doi.org/10.1016/j.jbi.2014.07.014 18. Nasr Azadani M, Ghadiri N, Davoodijam E (2018) Graph-based biomedical text summarization: an itemset mining and sentence clustering approach. J Biomed Inform 84(April):42–58. https://doi.org/10.1016/j.jbi.2018.06.005 19. Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2:159–165. https://doi.org/10.1147/rd.22.0159 20. Yadav VJ, Pandey TM, Rathore HM, Pandey AR (2019) Text summarization using word frequency 4(6):4–6 21. Kumar GK, Rani DM (2021) Paragraph summarization based on word frequency using NLP techniques paragraph summarization based on word frequency using NLP techniques. 060001(February) 22. Ko Y, Park J, Seo J (2004) Improving text categorization using the importance of sentences. 40:65–79. https://doi.org/10.1016/S0306-4573(02)00056-0 23. Acad U, Tianguistenco P (2020) Determining the importance of sentence position for automatic text summarization. https://doi.org/10.3233/JIFS-179902 24. Ouyang Y (2010) A study on position information in document summarization (August):919– 927 25. Rautray R, Rakesh C (2015) Document summarization using sentence features. August https:// doi.org/10.4018/IJIRR.2015010103 26. Thomas JR Automatic keyword extraction for text summarization in 27. Motwani D, Saxena AS (2016) Multiple document summarization using text-based keyword extraction, pp 187–197. https://doi.org/10.1007/978-981-10-0448-3 28. Aliguliyev RM (2009) A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Syst Appl 36(4):7764–7772. https://doi.org/ 10.1016/j.eswa.2008.11.022 29. Abujar S, Hasan M, Hossain SA (2017) Sentence similarity estimation for text summarization using deep learning (February 2019) 30. Jain M (2020) Automatic text summarization using soft-cosine similarity and centrality measures, pp 1021–1028 31. Ahuja R, Anand W (2017) Multi-document text summarization using sentence extraction. In: Artificial intelligence and evolutionary computations in engineering systems, Advances in Intelligent Systems and Computing, vol 517. Springer, Singapore, pp 235–242. https://doi.org/ 10.1007/978-981-10-3174-8_21 32. Afsharizadeh M (2018) Query-oriented text summarization using sentence extraction technique, April. https://doi.org/10.1109/ICWR.2018.8387248 33. Gong Y (2001) Generic text summarization using relevance measure and latent semantic analysis. In: SIGIR’01, 2001 34. Mihalcea R, Tarau P (2004) TextRank: bringing order into text. 4:404–411 35. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine, vol 30 36. Barrios F, Federico L, Variations of the similarity function of TextRank for automated summarization. 37. Manalu SR (2017) Stop words in review summarization using TextRank. In: 14th International conference on electrical engineering/electronics, computer, telecommunications and information technology, pp 846–849. https://doi.org/10.1109/ECTICon.2017.8096371 38. Qaiser S, Ali R (2018) Text mining: use of TF-IDF to examine the relevance of words to documents text mining: use of TF-IDF to examine the relevance of words to documents, July. https://doi.org/10.5120/ijca2018917395 39. Lin C, Rey M (2004) ROUGE: a package for automatic evaluation of summaries, 1 40. Yadav AK, Maurya AK, Ranvijay, Yadav RS (2021) Extractive text summarization using recent approaches: a survey. Ing des Syst d’Information 26(1):109–121. https://doi.org/10.18280/isi. 260112

546

S. LourduMarie Sophie et al.

41. Moradi M, Dashti M, Samwald M (2020) Summarization of biomedical articles using domainspecific word embeddings and graph ranking. J Biomed Inform 107(May):103452. https://doi. org/10.1016/j.jbi.2020.103452 42. Pawar S, Rathod S (2021) Text summarization using cosine similarity and clustering approach. Int J Curr Eng Technol 2020(8):669–673 43. Givchi A, Ramezani R, Baraani-Dastjerdi A (2022) Graph-based abstractive biomedical text summarization. J Biomed Inform 132(July 2021):104099. https://doi.org/10.1016/j.jbi.2022. 104099 44. Uçkan T, Karcı A (2020) Extractive multi-document text summarization based on graph independent sets. Egypt. Informatics J. 21(3):145–157. https://doi.org/10.1016/j.eij.2019. 12.002 45. Anand D, Wagh R (2022) Effective deep learning approaches for summarization of legal texts. J King Saud Univ—Comput Inf Sci 34(5):2141–2150. https://doi.org/10.1016/j.jksuci.2019. 11.015

SMS Spam Detection Using Federated Learning D. Srinivasa Rao and E. Ajith Jubilson

Abstract Despite all technological advancements, the biggest issue tech giants face is mining data while keeping user privacy intact.According to an article in quartz, which says Google spent “hundreds of years of human time complying with Europe’s privacy rules”. A lot of important data cannot be accessed because of these privacy rules. A lot of methods have been under research to use the power of ML and keep privacy intact at the same time. In this work, we implemented one such method called federated learning. In the federated learning paradigm, the data is not moved out of the device, instead, the model is moved to the device where the data is trained and only the parameters are shared with the main server thus keeping the data privacy intact as well as using it to train the models. This could prove to be an innovative technology keeping in mind the current scenarios where most of the tech giants are driven by data and because of the privacy policy, they are unable to make the best out of the data being collected. Keywords Spam detection · Network security · Phishing · Federated learning

1 Introduction One of the prevalent ways of communication between people is Short Message Service (SMS), based on communication standard protocols message transmission will occur. Classifying the messages either to ham or spam messages, there is a need for text classification algorithms. Genuine users have created ham messages, but the spam messages are not necessary. Spam messages are formed by profile-raising companies, so the spam messages must be noticed and detached once they reached the mobile station. SMS spam messages are overwhelming time, resources, network bandwidth, and money, however, spam filtering software availability for detecting SMS spam is very limited. D. Srinivasa Rao · E. Ajith Jubilson (B) VIT-AP University, Amaravathi, Andhra Pradesh 522237, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_39

547

548

D. Srinivasa Rao and E. Ajith Jubilson

For SMS spam detection, various techniques are used, and they are support vector machine (SVM), k-nearest neighbor (KNN), Naïve Bayes (NB), artificial neural network, decision tree, and random forest. Different experiments and comparisons were made using different datasets with different techniques. By using SVM and NB, the results presented in those classifiers are the highest accuracy, yet decision tree, logistic regression, and Bayesian classification techniques are grieved from timeconsuming problems. SMS has grown perhaps the most well-known strategy for correspondence between a great many individuals overall on account of its comfort and easy-to-understand interface however one thing that has turned into the difficulty of this help is the undesirable spam messages got by clients. These spam messages are the consequence of the use of this help by either organization to advance their items and administrations or by tricksters to fool clients into giving them private data to ultimately swindle them. Thus, there is a requirement for order instruments to distinguish whether in the SMS, a client gets spam or real and caution the client against malevolent spam messages. Even though there are many existing characterization models common in the industry, they need to get to client SMS information, which is one more significant worry in the present situation. In the present situation, a lot of consideration is being given to client security concerns and a great deal of examination is going on its safeguarding methods and standards. Unified learning is one such security protecting worldview that appears to be encouraging, in its beginning stage, to have the option to safeguard client protection in light of the decentralized nature of the AI approach. Considering the above two situations, we have attempted to assemble an SMS spam detector application using ReactJS in which clients could send messages to one another while the significant feature of the application is that if a client gets a message and it is considered spam by our application then the client is cautioned ahead of time. The spam recognition model is prepared using Gated Recurrent Unit (GRU) design and in a simulated federated environment provided by the PySyft Framework.

1.1 Objectives To plan a messaging application that permits different clients to interact with one another and could classify messages received by the client as spam or genuine, and give the clients a choice to physically stamp an SMS as spam or genuine in the event that the message isn’t as expected characterized by the model. To build a profound learning model utilizing Gated Recurrent Unit Architecture that could characterize messages as spam or genuine and to train our model on the SMS Spam Detection Dataset utilizing differential security and protection safeguarding methods in a combined arrangement. We will deploy our prepared model as an API and incorporate it into our informing application as a message classification service, train the performance of our prepared

SMS Spam Detection Using Federated Learning

549

model, and talk about the compromises between model execution and security protection costs.

2 Background and Literature Survey Venturing into the field of federated learning (FL) was a learning cycle for me, so prior to beginning to work on the application I wanted to get a decent idea of how FL really functions and what existing works have been done in this field. So, I alluded to a couple of research papers and articles that in the long run assisted me with finding out about the idea and its practical implementation. The first article that I referred to was “Combined Learning with PySyft” by Saransh Mittal [1]. The article talked about the presentation of FL and one of its significant systems, PySyft. It talked about the applications, and working strategy and gave an itemized manual for executing the FL approach essentially utilizing PySyft with the assistance of a demo project utilizing the MNIST dataset. Another research paper that I examined was “Combined Learning: Challenges, Methods, and Future Directions” by Li and Sahu [2] which talked about the extraordinary attributes and challenges of FL, given an expansive outline of current methodologies, and illustrated a few bearings of future work that are pertinent to a wide scope of exploration networks. I additionally scanned through another paper named “SMS Spam Detection Based on Long Short-Term Memory and Gated Recurrent Unit” by Poomka et al. [3] to get a comprehension of existing deep learning instruments for spam discovery. The paper talked about two profound learning calculations, i.e., Long Short-term Memory (LSTM) and Gated Recurrent Unit (GRU), clarified the philosophy for building a spam detection framework utilizing deep learning alongside important preprocessing strategies for text mining in deep learning and gave a definite correlation of execution of both the calculations. In view of the discoveries of this paper, I chose to continue with GRU as the essential calculation for spam identification in ourtask. Cormack et al. investigated the topic of spam filtering for brief text messages based on content. These appear in three separate situations: A low-bandwidth client might display SMS, blog, and email summary information like this. The important takeaways are that brief messages convey a lot of information. There aren’t enough words to properly enable spam classifiers based on a bag of words or word bigrams. As a result, the filter’s performance suffered, although it was significantly improved by increasing the feature set, should include character bigrams and trigrams, as well as orthogonal sparse word bigrams. The technique based on Dynamic Markov Compression outperformed all other approaches in terms of short messages and message fragments.

550

D. Srinivasa Rao and E. Ajith Jubilson

3 Federated Learning 3.1 Federated Learning Overview Federated learning is a new machine learning (ML) model that allows numerous strategies to train models locally and then combine them to produce a global model without disclosing the customers’ data. An FL system can be thought of as an extensive dispersed scheme made up of several mechanisms and participants with varying needs and constraints. As a result, creating an FL scheme necessitates skill in both software system architecture and ML. The increasing usage of industrial-scale IoT platforms and smart devices leads to the exponential development in data extents [4], permitting AI and ML research and applications. But, the advancement of AI and ML has heightened data confidentiality worries the General Data Protection Regulation (GDPR) mandates several data security precautions that many of these IoT platforms and smart device systems must adhere to. This is especially difficult in ML systems since the data available for model training is sometimes inadequate, and they regularly experience data hungriness. Because data discretion is currently the most essential ethical principle of ML systems [5], a solution that can offer enough data for drill while protecting the discretion of data owners is required. McMahan proposed FL [6] in 2016 as a solution to this problem. FL is a type of distributed ML that allows for model training over a large number of client devices. The training of models utilizing data acquired close by, without sending the data available of the client policies, is an important component of FL. A global model is created and distributed to the contributing client policies for local training on a central server. The central server collects the locally trained model parameters and aggregates them to apprise global model limitations. For the next training round, the global model parameters are disseminated once more. The gradient descent process is normally taken a step at a time during each local training round. Figure 1 depicts a high-level overview of the FL procedure.

3.2 Federated Learning Reference Architecture FLRA is a pattern-oriented reference architecture for FL systems that we provide. The entire reference architecture is depicted in Fig. 2. An FL system’s foundation contains two basic components: (1) a central server and (2) a client system. Client systems train models using local data and computation properties, while a central server launches an ML task and supervises the federated exercise process. We summarized all of the reference architecture’s necessary and optional components and the functionalities and duties of each. It describes each component related to the steps of the FL pipeline.

SMS Spam Detection Using Federated Learning

Fig. 1 Overview of federated learning [7]

Fig. 2 Federated learning reference architecture

551

552

D. Srinivasa Rao and E. Ajith Jubilson

Job Creation The method of FL begins by utilizing the central server’s job creator to create a model training task (containing the initial model and training configurations). Client registry/client database, client cluster, and client selector are three alternative components that could be considered within the job creation component. In an FL system, client devices may be owned by multiple gatherings and constantly join and disengage from the system. As a result, keeping track of all engaging client devices, including dropouts and dishonest devices, is difficult. This differs from shared or centralized ML systems, in which a single party typically owns and manages both the clients and the server [8]. A registry of clients is required to keep track of all the data about the registered client devices (e.g., ID, local model performance, resource information, number of participating rounds, etc.). Because the system can control devices efficiently and promptly detect problematic ones using the component of the client registry, both IBM FL framework and doc.ai included it in their designs to increase system maintainability and reliability, which is an FL benchmarking and simulation framework, has included a client manager module that performs the same role as the client registry in their framework. This situation arises due to the central server’s recording of device information. The model performance is hampered by the data features of local raw data and the data-sharing restriction [6, 9–11]. Speed A client cluster module can be added to speed up model convergence by grouping client devices based on their gradient loss to increase the global model’s generality, data distribution, and feature similarities. The IFCA algorithm, the TiFL system [12], and the patient system at Massachusetts General Hospital all use this design. The increased computing cost generated by client relationship quantification is a side consequence of the client cluster. A large number of client devices that interface with the central servers are both statistically and systemically heterogeneous. The atrociousness of the number of client devices is also numerous times higher than the number of dispersed devices [9]. Systems for ML to improve the model’s and system’s performance predetermined criteria (e.g., resource availability). A client selector component can be used to access data or performance. This has been taken into thought in Google’s FedAvg [6] and IBM’s Helios [13] algorithms. Model deployment The global model evaluator evaluates the global model’s performance once it has been aggregated. TensorFlow Extended (TFX), e.g., includes a model validator function for evaluating the performance of FL models. The model deployer module installs the global model into the client device for decision-making over the decision-maker section if the global model performs effectively. TensorFlow lite, e.g., creates the final certified model for data inference deployment to client devices. There are two possible components for selection within the model deployer component: incentive registry and deployment picker. The deployment selector component looks at the client devices and chooses which ones should get the global model depending on their data or apps. In Azure ML, Amazon SageMaker, and Google Cloud, the deployment picker design has been employed to increase model performance. To inspire clients to contribute to

SMS Spam Detection Using Federated Learning

553

the training, the incentive registry component keeps track of all client devices’ rewards depending on individual contributions and agreed-upon rates. FLChain and DeepChain both use blockchain to create incentive registries. Model aggregation Based on the supplied local models, the model aggregator creates a new global model. Within the model aggregator component, four types of optional aggregator related components are available those are secure aggregator, asynchronous aggregator, decentralized aggregator and hierarchical aggregator. Over multiparty computation protocols like cryptographic or differential privacy approaches, a safe aggregator component prohibits confrontational parties from retrieving the models during model interactions. These methods provide security evidence, ensuring that each participant is only aware of their input and output. Communication safety between clients and servers is not a major concern in centralized and distributed ML environments that use centralized system orchestration. HybridAlpha [14] and the TensorFlow Privacy Library, on the other hand, apply these best practices in FL environments. ASO-fed, AFSGD-VP, and FedA sync are examples of asynchronous aggregators. The asynchronous aggregator component allows for asynchronous global model aggregation whensoever a local model update is received. Similar strategies have been used in distributed ML approaches like iHadoop, and it has been proved to minimize overall training time. An FL system’s traditional design, which depends on a central server to direct the learning process, may result in an oversight. To improve system stability, a decentralized aggregator executes model connections and aggregation hip a decentralized way. BrainTorrent and FedPGA are two examples of decentralized aggregators in use. For FL systems, blockchain can be used as a decentralized approach. The network topology P2p is used in MapReduce [8] to alleviate the oversight risk on constraint servers in distributed ML systems. A model co-versioning registry component, in addition to aggregator-related optional components, can be integrated into the model aggregator component to map all local models to their respective global models. This enhances system accountability and enables model governance. The model coversioning registry pattern is derived from DVC, Replicate.ai, and Pachyderm version control approaches. Data Collector and Preprocessor Because of the data-sharing constraint, a separate client device collects data, using distinct sensors over the data collector component, and processes the data locally through the data preprocessor component (i.e., data cleaning, feature extraction, local data preprocessing, augmentation, etc.). This is in contrast to centralized or distributed ML systems, where non-IID input remains normally jumbled and analyzed on a central server. As a result, an elective heterogeneous data handler is used within the data preprocessor to cope with non-IID and slanted data delivery issues using data augmentation approaches. Astraea, FAug technique, and Federated Distillation (FD) method are some of the recognized applications of the component.

554

D. Srinivasa Rao and E. Ajith Jubilson

Model Training Based on the hyperparameters after the client receives the job from the central server (number of the learning rate, epochs, etc.), the model trainer constituent executes the model. Only the model parameters (i.e., gradient/weight) are indicated to be supplied as of the central server in McMahan’s typical FL training procedure [6], though, in this orientation architecture, the models encompass not discrete the model parameters but also the hyperparameters. A multi-task model trainer component can be used to train task-related models in multi-task ML scenarios to progress model performance and learning competence. Multi-task learning is an ML technique for transferring and sharing information by training individual models. It increases model generalization by inductively biasing domain knowledge contained in related task parameters. This is skillful by learning tasks similar while retaining a shared representation; what is learned for one job can help in the learning of other tasks. This strategy is especially useful in FL scenarios when dealing with non-IID data, as it can lead to customized models that exceed the best feasible shared global model [9]. Google’s multimodal architecture and Microsoft’s MT-DNN architecture were used to identify this best practice solution. Model Evaluation If the performance criteria are met, the local model evaluator component measures the performance of the local model and uploads it to the model aggregator on the central server. The performance of client devices is not reviewed locally in distributed ML systems; instead, only the aggregated server model is evaluated. Local model presentation evaluation is necessary for system processes such as model co-versioning, client selection, contribution computation, enticement providing, client clustering, and so on in FL systems. Model Uploading For model aggregation, the qualified local model parameters or slopes are uploaded to the central server. The cost of transmission model parameters or slopes between bandwidth-limited client devices and the central server is in height once the system scales up, unlike centralized ML systems that perform model training on a central server or distributed ML systems that deal with a relatively small number of client nodes [9]. To boost communication efficiency, a message compressor component can be introduced. Model Monitoring Following the distribution of models for real data inference, a model monitor continuously monitors model performance. The model standby trigger component informs the model trainer for local refinement or sends an alert to the task creator for a fresh model generation if performance drops below a predefined threshold value. Microsoft Azure ML Designer, Amazon SageMaker, and Alibaba ML Platform are all known to use the model replacement trigger pattern.

SMS Spam Detection Using Federated Learning

555

4 Proposed System Architecture-Federative Learning The suggested solution is split into two parts: a messaging application and a backend that includes flask APIs and FL architecture.

4.1 Application Development for Messaging This will be accomplished via a flask-based API that will perform the prediction. Our FL network will send the model that was learned locally. As a result, the API will return a flag indicating whether or not the message is spam. A red border will appear around the message transmitted on the selected channel to the other members of that channel. The message and the flag are also saved in that device’s local storage. In the case that our model wrongly classifies a communication as spam, the user can label it as spam or not spam manually. As a result, the message in local storage is saved with the appropriate flag. The data collected during that period will be utilized to train the local model, and only the updated parameters will be sent to the master server at the time of the sync when the master parameters will be updated and shared with the local devices based on the weight.

4.2 Backend Initially for training our data, we needed to take a dataset to train our spam detection model on, therefore, data preprocessing is a must in such a case. The dataset will be passed through various filters and will be tokenized, sequenced, truncated, and padded. Once the data is preprocessed, it will be trained and initial parameters will be calculated. Once this process is completed, the parameters will be shared with the virtual workers which in the prototype are nothing but the servers running on the localhost. The training procedure as in the case of the Gated Recurrent Unit (GRU) model architecture is as follows: 1. The parameters, as well as the model, will be shared with the virtual workers, and then the model will function and forecast messages based on the master server’s starting parameters. 2. The model will then be trained locally on the devices at a predetermined time with the data kept on local storage. 3. The master server will be informed of the weight of the device and updated parameters at the time of syncing. 4. The parameters will be grouped together based on the device weights (in this case all devices are given equal weights). 5. Then the modified parameters will then be shared with the devices, allowing the model to collect data more efficiently while still adhering to privacy rules. 6. The process will be repeated at set intervals.

556

D. Srinivasa Rao and E. Ajith Jubilson

4.3 Working Methodology In this work, the SMS spam classification model Fig. 3 is built which is dependent on deep learning calculations like GRU and FL paradigm. We have utilized NLP methods for preprocessing SMS text information into succession utilizing word tokenization, padding information, shortening information, and word installing technique. We circulated the pre-handled information to numerous virtual gadgets to make a computer experience of a federated setup. After that we prepared our model utilizing FL procedures on the virtual workers and saved the last master model to be sent as an API service. In light of the above outcomes and perceptions, we could induce that we have effectively prepared our model in a virtual federated setup without compromising the privacy of our data. Consequently, we had the option to keep a decent performance while utilizing data protection methods. Furthermore, we created a messaging application to permit clients to speak with one another and coordinated our model’s API service with the application to follow each message got by the client, and if the message is set apart as spam, the application would caution the client in advance by showing the message inside a red limit. In addition, if a client discovered a message as spam even though it was set apart as genuine by our model then the client was given a choice to stamp the message as spam or the other way around and the same thing was conveyed to our data set which was facilitated on MongoDB Atlas. Standards Various standards used in this project are: HTTP: All the communication between the frontend and backend will be done over HTTP protocols, keeping the data protected against the interference of hackers. CORS: Cross-origin resource sharing (CORS) is a mechanism that allows restricted resources on a web page to be requested from another domain outside

Fig. 3 System architecture

SMS Spam Detection Using Federated Learning

557

the domain from which the first resource was served. Keeping our data safe from intruders.

4.4 System Details Software Details: The technologies used to build this system are React, NodeJS, ExpressJS, PySyft and PyTorch framework, Flask, and MongoDB.

4.5 Messaging Application The frontend of the messaging application is built using ReactJS, while the backend is built using NodeJS and ExpressJS. The backend of this application is connected to MongoDB which is hosted on Atlas clusters. Developing Mobile Application Create a react boiler place using the create react app function then the user needs to login an authentication page Fig. 4. After successfully logging in, the user will be redirected to the conversation channel Fig. 5. The messages that are predicted as spam by the model will be bordered with red color, in case a message is incorrectly predicted, the user can also mark them correctly Fig. 6. Once the frontend was developed, the backend was developed using NodeJs and Expressserver, for user authentication and message transfer. The APIs developed included Message posting API, Message Retrieval API, and UserAuthentication API.

4.6 Database To stimulate the local storage, the project uses the cloud database. MongoDB Atlas is the global cloud database service for modern applications which is used to deploy fully managed MongoDB across AWS, Google Cloud, and Azure with best-in-class automation and proven practices that guarantee availability, scalability, and compliance with the most demanding data security and privacy standards. Because of these features, we deployed our database on MongoDB Atlas to make our application globally accessible over the internet. The database I hosted was a three clustered database so that the data would be preserved in the replica database Fig. 7. Connecting Application with MongoDB The database was configured over the cloud, after configuring, the mongoose library to connect the database to the node application was used. The configuration keys were added for authentication, and

558

D. Srinivasa Rao and E. Ajith Jubilson

Fig. 4 Login page

once the secured connection was established, the application could connect to the database Fig. 8. Spam Detection Service After training the model, the model was deployed and accessed through the flask API. Whenever a message was sent by a user, the API was called which returned a flag stating whether the message was spam or not Fig. 9.

5 Results The developed APIs included user login API, message posting API, and message retrieval API. With best-in-class automation and compliance with the most stringent data security and privacy regulations, MongoDB Atlas is the global cloud database service for modern applications that deploy fully managed MongoDB across AWS, Google Cloud, and Azure. The database was set up via the cloud, and after it was, the mongoose library was used to link it to the node application. Once the secured connection had been created and the configuration keys for authentication had been entered, the software could connect to the database. The

SMS Spam Detection Using Federated Learning

559

Fig. 5 Chat user interface

Fig. 6 User correction

model was trained, then it was deployed and made available via the flask API. When a user sent a message, the API was called, and it returned a flag indicating whether or not the message was spam. The model was trained successfully on the initial data and gave Area Under the Curve (AUC) value of close to 97% after 15 epochs Fig. 10.

560 Fig. 7 Replica sets

Fig. 8 Mongo Atlas Console

Fig. 9 Response from flask API

D. Srinivasa Rao and E. Ajith Jubilson

SMS Spam Detection Using Federated Learning

561

Fig. 10 Model performance

6 Conclusion and Future Scope In this work, we have presented a novel technique of spam detection using advanced techniques like FL. This work also gives a case study that uses revolutionary techniques like deep learning while adhering to the privacy laws. SMS spam classification models are dependent on deep learning calculations utilizing security protecting procedures. In the future, we plan to improve the performance of the model by utilizing a more explicit methodology for model accumulation in FL and considering different parts of unified discovering that are as of now viewed as the difficulties for building a vigorous and secure combined learning application. In the future, this model can be utilized to help individuals in making admin panels without accessing clients’ information, which will help in end-to-end privacy, which is the most expected parameter in the future customer applications.

References 1. Ziller A, Trask A, Lopardo A, Szymkow B, Wagner B, Bluemke E, Nouna-hon J-M, PasseratPalmbach J, Prakash K, Rose N et al (2021) Pysyft: a library for easy federated learning. In: Federated learning systems. Springer, pp 111–139 2. Li T, Sahu AK, Talwalkar A, Smith V (2020) Federated learning: challenges, methods, and future directions. IEEE Signal Process Mag 37(3):50–60 3. Poomka P, Pongsena W, Kerdprasop N, Kerdprasop K (2019) Sms spam detection based on long short-term memory and gated recurrent unit. Int J Future Comput Commun 8(1):11–15 4. Lo SK, Liew CS, Tey KS, Mekhilef S (2019) An interoperable component-based architecture for data-driven iot system. Sensors 19(20):4354 5. Jobin A, Ienca M, Vayena E (2019) The global landscape of AI ethics guidelines. Nat Mach Intell 1(9):389–399 6. McMahan B, Moore E, Ramage D, Hampson S, Arcas BA (2017) Communication-efficient learning of deep networks from decentralized data. In: Artificial intelligence and statistics. PMLR, pp 1273–1282 7. Lo SK, Lu Q, Zhu L, Paik H-Y, Xu X, Wang C (2022) Architectural patterns for the design of federated learning systems. J Syst Softw 191:111357

562

D. Srinivasa Rao and E. Ajith Jubilson

8. Marozzo F, Talia D, Trunfio P (2012) P2p-mapreduce: parallel data processing in dynamic cloud environments. J Comput Syst Sci 78(5):1382–1402 9. Kairouz P, McMahan HB, Avent B, Bellet A, Bennis M, Bhagoji AN, Bonawitz K, Charles Z, Cormode G, Cummings R et al (2021) Advances and open problems in federated learning. Found Trends® Mach Learn 14(1–2):1–210 10. Li X, Huang K, Yang W, Wang S, Zhang Z (2019) On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189 11. Zhao Y, Li M, Lai L, Suda N, Civin D, Chandra V (2018) Federated learning with non-iid data. arXiv preprint arXiv:1806.00582 12. Chai Z, Ali A, Zawad S, Truex S, Anwar A, Baracaldo N, Zhou Y, Ludwig H, Yan F, Cheng Y (2020) Tifl: a tier-based federated learning system. In: Proceedings of the 29th international symposium on high-performance parallel and distributed computing, 2020, pp 125–136 13. Farrington N et al (2010) Helios: a hybrid electrical/optical switch architecture for modular data centers. In: Proceedings of the ACM SIGCOMM 2010 conference 14. Xu R, Baracaldo N, Zhou Y, Anwar A, Ludwig H (2019) Hybridalpha: an efficient approach for privacy-preserving federated learning. In: Proceedings of the 12th ACM workshop on artificial intelligence and security, 2019, pp 13–23

Data Extraction and Visualization of Form-Like Documents Dipti Pawade, Darshan Satra, Vishal Salgond, Param Shendekar, Nikhil Sharma, and Avani Sakhapara

Abstract While the world moves toward digitization and most vital information is getting stored, processed, and understood in computers, there are still organizations, businesses, and companies that are serving employees and audiences that prefer physical forms to online ones. Along with them, there are also forms and documents that are filled online in a Word Document or PDF, like NDAs, resumes, etc. Manually going through each physical or digital copy of the form, and then typing in the data into an Excel sheet or database leads to the investment of a lot of man-hours, energy, and computer resources. Our solution, Form Analyzer, aims to solve this problem by being a one-stop platform where a bulk number of forms will be accepted, segmented, parsed and data will be stored and visualized. The system considers physical forms which are filled manually and soft copy (or digital) forms which are filled by typing and then printed. Data visualization dashboard will help the organization to get insights about the available information. This will lead to lower time and resource consumption. Also, the system analyzes the sentiments of the comments or information provided in the form which can be helpful for further strategic planning of the organization. The paper discusses the architecture, working, and performance evaluation of the system on handwritten as well as typed forms. D. Pawade · D. Satra · V. Salgond · P. Shendekar (B) · N. Sharma · A. Sakhapara Department of Information Technology, K. J. Somaiya College of Engineering, Vidyavihar, Mumbai, India e-mail: [email protected] D. Pawade e-mail: [email protected] D. Satra e-mail: [email protected] V. Salgond e-mail: [email protected] N. Sharma e-mail: [email protected] A. Sakhapara e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3_40

563

564

D. Pawade et al.

Keywords Boundary recognition · Form processing · Data extraction · Data visualization · Sentiment analysis

1 Introduction Technologies have been evolving and so are the method of completing work. People are organizing themselves in such a way that they can have an upper hand over others by using the right technology and using them correctly. Data are entered, stored, and processed digitally nowadays, but there is still traditional paperwork that needs to be examined. One example of traditional paperwork is form documents that people fill out in offices, government organizations, etc. The current method involves a human manually going through each form, reading the content, and then simultaneously typing the content in a database. Now imagine doing this for hundreds or thousands of forms. It is not only redundant manual labor, but also a complete waste of time, energy, and money spent on human as well as compute resources [1]. To eradicate this issue, the need is for such a system that removes this process entirely and limits it to a few button clicks. Upon market and literature survey, it was identified that the solution is needed by vast domains of companies, mainly ranging from educational, marketing, etc. [2, 3]. The handwritten forms are plentiful and also people are using editable documents, PDFs, etc., online to make customers fill in the required data. Some of the types of forms that are handwritten are Government Office forms, Bank forms, Restaurant review forms, Faculty feedback forms, etc., while those which are edited online are non-disclosure agreements, curricula vitae in specifically required templates, passport application forms, etc. All of these data need to be extracted, stored, and understood to make use of it, without which it is nothing but unstructured or semi-structured data. This paper presents a system which accepts the scanned copy of handwritten as well as printed forms, extracts data from it, and then stores in a database. The algorithm follows a series of steps to segment the document, extract the key and value pair for a particular field, and for each different type of value, a different technique is employed to extract the data. For the ease of data interpretation, data visualization dashboard is provided. But still, the descriptive form section where a person writes the comments or reviews or gives some suggestions was a challenge. So, to address that issue, polarity of the comments/reviews is predicted using sentiment analysis. For evaluating the algorithm, a custom self-prepared dataset containing a set of both scanned and unscanned images was used. Furthermore, accuracy was determined by matching the extracted data with the actual data. Later, the future scope for making the platform better and more robust is also mentioned, ending with references. The major objectives of this research are as follows: Obj 1: Extracting information for bulk forms on one click which leads to time and manpower reduction. Obj 2: Providing user-friendly data visualization facility. Obj 3: Analyzing the descriptive content in the form and predicting the sentiment polarity.

Data Extraction and Visualization of Form-Like Documents

565

2 Previous Work A lot of research and work have been done in the fields of boundary detection, area segmentation, optical character recognition (OCR), sentiment analysis, and data visualization which have been built upon over the years by researchers to provide better and optimal systems. However, it was found that these systems were often monetized and there were hardly any open-source solutions available that were iteratively made better by the community. Form segmentation and boundary detection are very crucial steps while extracting data from a document, and a paper written by Srihari et al. [4] talks about line segmentation running upon a clustering algorithm, hand-print segmentation by iterative segmentation, machine print segmentation by analyzing font spacing, and standard font sizes. Furthermore, a paper by Ebin Zacharias et al. [5] shows a very interesting approach of segmentation and then OCR using Tesseract, and this paper follows two steps: detection of the text area to increase the Tesseract OCR engine’s accuracy and OCR using Tesseract V5. The system has taken motivation from this paper, as it follows the approach of contours to get the boundary box and then use OCR. Preprocessing is of utmost importance to get more accurate results. Vaishali Aggarwal et al. [6] have shed some light on how to preprocess the image and segment characters. The main idea of this paper is also to extract data from a form, but it uses a one-vs-all logistic regression model. Sharma et al. [7] have helped with the issue of text overlapping with the text-field boundary. The paper discusses an algorithm developed for distinguishing between text-field boundary and overlapping written text. The paper majorly focuses on boundary detection which sheds light on how to improve the boundary detection process. There are also some CNN approaches that can detect some features in a form, paper written by Pawel Forczmanski et al. [8] has discussed in-depth detecting different elements in a document, and the elements can be stamps, logos, printed text blocks, signatures, and tables. This is possible using YOLOv2 architecture, although our paper did not follow the CNN approach; this definitely can increase the scope if implemented. Not to mention, OCR is the most important step for extracting data from a form; for this, there are several already built solutions such as Tesseract, Google Document AI, and Amazon Textract. Thomas Hegghammer et al. [9] did an astounding benchmarking experiment where around 14,000 English language documents and 4,4000 Arabic are prepared to test Tesseract, Document AI, and Amazon Textract. The results for the English language showed that Document AI and Textract almost perform the same while tesseract lags behind, and for Arabic, Document AI is much better since Textract does not support Arabic. Depending on this experiment, Tesseract was finalized for typed forms and Amazon Textract for handwritten forms. Kashif Imran et al. [10] have given a detailed summary of every feature of Amazon Textract and the various wrappers that have been created around those features to be used in various languages, and this helped us a lot to implement Textract for handwritten text in our pipeline. After reading the paper by Ray Smith et al. [11], which talks about

566

D. Pawade et al.

Tesseract and how it performs OCR, it was decided to use this for typed forms only as it does not have good accuracy for handwritten text.

3 Methodology The proposed system accepts the scanned forms and processes it. For experimentation purpose, we have considered two types of the form. The first one is printed form where information is filled on the computer and then a printout is taken (printed/typed form) and the other one is where one has filled the information by hand (handwritten form). For both types of forms, a scanned copy of it is considered as an input to the system. For analyzing the system performance, we have created our own database for both forms. The typed form generation process was automated using Excel and MailMerge feature from Microsoft Word [12]. This allowed us to create as many forms as wanted, once there was an Excel representation of the data that goes into each form. Next, for handwritten forms, 100 copies of the template were taken, and each form was manually written in bold, capitalized, and separated font. So, in each database, we have considered scanned copies of 100 forms for each type, with size of around 300–500 KBs per form. Figures 1 and 2 show the sample forms. Figure 3 depicts the workflow of the system. The system is broadly divided into three modules, viz., image preprocessing and segmentation, text extraction, and sentiment analysis.

Fig. 1 Scanned copy of printed form

Data Extraction and Visualization of Form-Like Documents

567

Fig. 2 Scanned copy of handwritten form

3.1 Image Preprocessing and Segmentation Image preprocessing is done to clean the image and improve its quality so that it becomes easy to analyze and process the image. Preprocessing is a very crucial step in any image processing task. For this case, the input image is resized to 720px width and 1080px height. This standardization of image size helps in filtering out the erroneous boxes that could be recognized (explained in the subsequent steps). After the image has been resized, the following preprocessing steps are performed on the image:

568

D. Pawade et al.

Fig. 3 Methodology flowchart

1. The raw image is converted into a grayscale image. Grayscale images have a single channel where each pixel represents the information about only the amount of light and not the actual color. 2. After that, an inverse binary threshold is applied. In inverse binary thresholding, the destination pixel of the image is set to zero if the original pixel value is greater than threshold value, otherwise it is set to a maximum value. In our case, the threshold value is 0 and the maximum value is 255. 3. The image is then blurred using the Gaussian Blur. The Gaussian Blur also known as Gaussian Smoothing is widely used in image preprocessing to remove any random noise that might be present in the image.

Data Extraction and Visualization of Form-Like Documents

569

Fig. 4 Image generated after preprocessing

4. Finally, a binary threshold is applied to the image. This is completely the opposite of the inverse binary threshold that was earlier used. Here, the destination pixel is set to a maximum value if the original pixel value is greater than the threshold, otherwise it is set to zero. Again, the threshold value and maximum value were set to 0 and 255 in our case. The segmentation module [13, 14] is responsible for taking the preprocessed image of the form as input and fetches all the keys and values from the image. All these fetched keys and values then act as an input to the text extraction module. The input to this module is the image of the form in jpg/png format (sample image is shown in Fig. 4). After receiving the input, the following steps are carried out: 1. After the preprocessing is complete, the next step is to find the contours present in the image. A contour is nothing but a curve which joins all the continuous points along the boundary, having the same color or intensity. The result of this step is a list of contours containing the coordinates of the shape that was formed. Note that in order to find the contours, it is important to have a grayscale image because in a colored image it becomes difficult to detect the borders of the object properly, and also, the difference in intensity between the pixels is not defined. 2. Using the contours that define the exact points surrounding a field, bounding rectangles are formed around them. X coordinate, Y coordinate, width, and height are taken into account while forming these bounding rectangles. 3. But, bounding rectangles around keys adds to the redundancy, and hence, an effective way to only take value fields into consideration was devised. Using this, the bounding rectangles around key fields are filtered out and the rectangles that remain are of importance. Upon testing on many different sizes and templates, it was found that an area of 1000 pixels was a very general value and worked well on most forms, to portray value fields’ size, and hence, the filter applied looks for boxes having area (width × height) greater than 1000 pixels. For this to work, it

570

D. Pawade et al.

Fig. 5 Cropped image of key–value pair

Fig. 6 Cropped image of checkbox type field

4.

5.

6.

7.

is important to first resize the image to a particular standard size as discussed in step 1 of image preprocessing. After filtering out the unnecessary boxes, what remains are boxes containing value fields. Now, an important thing to note here was that the forms were designed in such a way that every key is horizontally aligned with the corresponding value box. The image formed at this step is depicted in Fig. 5. The key field is present on the left side, while the value field is present on the right inside a box. To separate these two, the image is split into two parts where the left side contains the key and right side contains the value. The reference line to split every such pair is the left-hand side boundary of the value field box. In case of checkboxes (Fig. 6), it is the first box which is taken into account for this split. This key and value are then passed to text extraction module. The process is a little different for values of checkbox type due to the change in orientation compared to other types of value fields. Here, there is not just a simple value box but a bunch of boxes with exactly one of them containing a tick. For this case, first an image is created for each option in the checkbox, and then, the option containing the tick is returned. To find out which option to return, the option whose box has maximum pixel sum is found.

3.2 Text Extraction As the name suggests, the text extraction module is responsible for extracting text from the input image. The input for this module is a cropped image of a key or a value, and the output is extracted text in the appropriate format—string, int, date, etc. From the background study, it has been observed that Tesseract is the best OCR engine available so far. Here, the Pytesseract wrapper of the Tesseract module is used. Pytesseract supports various image formats like jpeg, png, gif, bmp, tiff, and others. For better performance, we configured it to only run a subset of layout analysis and assume a certain form of image. The major downside of using Tesseract is that it does not work well for handwritten text. It produced very poor results for handwritten forms. Hence, there was definitely a need for a much more robust OCR engine which

Data Extraction and Visualization of Form-Like Documents

571

can recognize handwritten text. There are plenty of APIs available for OCR. The one which was used in this module is Amazon Textract. One important thing to note here is that Amazon Textract is an API and making an API call over the internet can be time consuming. Hence, to cut down on that time, the API calls are only made for handwritten forms and that too for value fields only, because even in handwritten forms the keys are present in printed text and they can be easily recognized by Tesseract.

3.3 Sentiment Analysis One of our core features for analysis of the data is the sentiment analysis of the form contents. Sentiment analysis basically helps us to understand the polarity of the comments mentioned by the user [15]. To do so, CountVectorizer is used to transform the text into a vector, based on the frequency of each word occurring in the text. It basically builds a vocabulary of known words and also encodes new texts using the vocabulary. This encoded text is fed as an input to the sentiment analysis model which is based on the Multinomial Naive Bayes (NB) algorithm [16]. The Multinomial NB algorithm is a Bayesian learning approach that has been pretty popular in natural language processing. Multinomial NB using the Bayes theorem estimates the tag of texts, i.e., it calculates which tag might be the most likely fit for the given sample and outputs it. Every feature being classified by the Naive Bayes classifier is classified independently of all other features using a variety of techniques. The incorporation or removal of one feature does not depend on the incorporation or removal of another feature. Before finalizing the Multinomial Naïve Bayes algorithm, we have explored other approaches like Bidirectional LSTM [17], Embedding + SpatialDropout1D + LSTM _ Dense [18], Simple Polarity, LSTM [19] as shown in Table 1. If our dataset was large with about 1 million samples, then Bidirectional LSTM would have been the best approach. But, for this experimental setup, Multinomial Naïve Bayes algorithm is observed to be most appropriate approach with accuracy of 80%. The main challenge we face with our approach is the double negatives: Our model is unable to identify double negatives and predict them as negative sentences. To solve this problem, we will need a better dataset with a lot more sentences with a lot of diversity and we will have to use a more complex model like Bidirectional LSTM.

4 Results and Discussion We tested our system against two different datasets; one was a collection of handwritten forms, and the other was printed forms. Each dataset consists of 100 forms. Testing was done in batches of different sizes, to get more information about the execution time and accuracy. From Table 2, we see that although for a Batch Size of

572

D. Pawade et al.

Table 1 Comparison of different approaches to sentiment analysis Approach

Accuracy

Problems

Multinomial Naïve Bayes

0.8

Double negatives

Bidirectional LSTM

0.52

Low accuracy

Embedding + SpatialDropout1D + LSTM _ Dense

0.93

Data skewed so model did not perform well on custom data

Simple Polarity

0.7

Double negatives and complex sentences

LSTM

0.6

Low accuracy

1, we are getting accuracy of 77.57%, but when we consider a Batch Size of 100, we get an accuracy of 90.18%. Therefore, we can conclude that for handwritten forms, we are getting an accuracy of above 90%. From Table 3, for Batch Sizes 10, 25, 50, and 100, we are getting an accuracy of around 89%. Hence, we can conclude that for Computer-Typed Forms, the accuracy is 89%. Visualization of the extracted information is an important feature of this application. Using visualization, data extracted from each field in form can be presented in the form of graphs and charts. To demonstrate this feature, we have taken use case of Student Resume depicted in Fig. 1, which has a field named departments. To extract each department manually and visualize it would be a hassle. However, using this algorithm, a batch of Student Resume can be visualized by just one click. Figure 7 depicts the visualization of department from those resumes. Table 2 Results for handwritten forms

Table 3 Results for printed forms

Batch size Total execution Average file size Accuracy (%) time (s) (kb) 1

8.69

206.90

77.57

10

54.18

203.075

96.26

25

65.79

203.82

93.38

50

86.66

409.19

91.71

100

122.30

380.01

90.18

Batch size Total execution Average file size Accuracy (%) time (s) (kb) 1

6.10

377.48

80.68

10

40.73

376.47

89.72

25

62.94

376.38

89.50

50

99.47

376.19

88.97

100

213.71

341.711

88

Data Extraction and Visualization of Form-Like Documents

573

Fig. 7 Pie chart for departments

Another distinguishing contribution is to provide the sentiment analysis of reviews/ feedback. This will help user to understand the sentiment polarity of the textual information in the form. We have tested our sentiment model on a dataset containing Restaurant feedback field from the form depicted in Fig. 2. The sentiment analysis module has achieved an accuracy of around 80%. The performance metrics for sentiment analysis are given in Table 4. The purpose of using sentiment analysis is to enable organizations to get a quick overview of what the customers think about their product or service and help them make calculated decisions based on that. Often times, organizations might use the platform to gather feedback or suggestions from customers; in this case, going through each and every feedback or suggestion might not be possible or could be very time consuming. Hence, by using the sentiment analysis, they can quickly get to know the overall sentiment of their customers. To prove the need of this application, it is compared with most of the existing applications today such as Online OCR [20], OCRSpace [21] which use a simple OCR-based approach, that can only be used to extract the data from scanned documents, whereas our platform provides a systematic and an end-to-end solution where organizations can extract data from scanned documents in structured way, i.e., in keyvalue format, store the data in a persistent storage, and view visualizations based on that data to make data-driven decisions. Table 4 Accuracy of the sentiment model

Precision

Recall

F1-score

Negative

0.79

0.81

0.80

Positive

0.81

0.79

0.80

574

D. Pawade et al.

5 Future Work We have only scratched the surface of the application’s capabilities. Our future work will be focused on two major directions: polishing existing features to make them more general and accurate so that the amount of human intervention required is as low as possible and adding new features that will enable organizations to seamlessly visualize and interact with data while keeping business needs in mind. Currently, our model only works with a limited number of form formats; we would like to create a single model that can handle any type of form format. To do so, we will rely heavily on ML and NLP, as we will ignore the form format and instead rely on intelligent mapping of questions and answers based on the context and relative position of the texts. The sentiment analysis model is the next item we would like to improve. Right now, we are using Multinomial NB, which has an accuracy of 79%, but we are having trouble with double negatives. To solve this, we had use more intricate deep learning models and a bespoke dataset. We also want to enhance our visualization model by allowing users to customize and interact with it. Users will be able to select their own theme, download visualizations, and customize the axis input. Our OCR model uses Tesseract, which has limitations in detecting words in different languages and handwritten texts. We will replace it with a deep learning model to recognize handwritten text and add the ability to choose the language of the form so that we can support more languages such as Hindi, Russian, and Arabic, among others.

6 Conclusion In the paper, we have presented a simple, easy-to-understand, open-source, and endto-end solution that attempts to remove the hassle of manual work and unnecessary utilization of compute resources in manual data entry from forms to databases. Our system takes a bulk number of forms as input and outputs the cleaned data in an Excel format as well as visualization of the data. Our system works well on handwritten as well as typed forms. The OCR can be attributed to open-source Tesseract in case of typed forms and AWS Textract in case of handwritten forms. The system was able to achieve an accuracy of 89% for typed forms and 90% for handwritten forms. Furthermore, the system’s scope can be expanded by allowing it to accept templates other than the two-column layout. Also, we can make the system completely opensource by training Tesseract on our custom dataset of handwritten text. There is still room for improvement in case of double-negative problem of sentiment analysis and interactive visualizations. However, the use of this system is in line with any enterprise as well for personal use.

Data Extraction and Visualization of Form-Like Documents

575

References 1. Rehman A, Saba T (2014) Neural networks for document image preprocessing: state of the art. Artif Intell Rev 42:253–273. https://doi.org/10.1007/s10462-012-9337-z 2. Qureshi R, Khurshid K, Yan H (2019) Hyperspectral document image processing: applications, challenges and future prospects. Pattern Recogn 90:12–22. https://doi.org/10.1016/j.patcog. 2019.01.026 3. El Bahi H, Zatni A (2019) Text recognition in document images obtained by a smartphone based on deep convolutional and recurrent neural network. Multimed Tools Appl 78:26453–26481. https://doi.org/10.1007/s11042-019-07855-z 4. Srihari SN, Shin Y-C, Ramanaprasad V, Lee D-S (1995) Name and Address Block Reader system for tax form processing. In: Proceedings of 3rd international conference on document analysis and recognition, vol 1, pp 5–10. https://doi.org/10.1109/ICDAR.1995.598932 5. Zacharias E, Teuchler M, Bernier B (2020) Image processing based scene-text detection and recognition with Tesseract 6. Aggarwal V, Jajoria S, Sood A (2018) Text retrieval from scanned forms using optical character recognition. Springer Nature Singapore Pte Ltd. 7. Sharma DV, Lehal GS (2009) Form field frame boundary removal for form processing system in Gurmukhi Script. In: 10th International conference on document analysis and recognition 8. Forczma´nski P, Smoli´nski A, Nowosielski A, Małecki K (2020) Segmentation of scanned documents using deep-learning approach. In: Burduk R, Kurzynski M, Wozniak M (eds) Progress in computer recognition systems. Advances in Intelligent Systems and Computing, vol 977. Springer, pp 141–152 9. Hegghammer T (2021) OCR with Tesseract, Amazon Textract, and Google Document AI: A Benchmarking Experiment 10. Imran K, Schade M (2019) Automatically extract text and structured data from documents with Amazon Textract 11. Smith R (2008) An overview of the Tesseract OCR engine. Google Inc. 12. Taljaard M, Chaudhry SH, Brehaut JC et al (2015) Mail merge can be used to create personalized questionnaires in complex surveys. BMC Res Notes 8:574. https://doi.org/10.1186/s13104015-1570-5 13. Pawade D, Sakhapara A, Parab S, Raikar D, Bhojane R, Mamania H (2018) Automatic HTML Code generation from graphical user interface image. In: 2018 3rd IEEE international conference on recent trends in electronics, information & communication technology (RTEICT), 2018, pp 749–753. https://doi.org/10.1109/RTEICT42901.2018.9012284 14. Pawade D, Sakhapara A, Parab S, Raikar D, Bhojane R et al (2018) i-Manager’s J Comput Sci; Nagercoil 6(2)(Jun/Aug 2018):34. https://doi.org/10.26634/jcom.6.2.15005 15. Dipti P, Kushal D, Khushboo R, Shruti S, Harshada S (2007) Product review analysis tool. Int J Recent Innov Trends Comput Commun 4(4):960–963 16. Abbas M, Ali K, Memon S, Jamali A, Memon S, Ahmed A (2019) Multinomial Naive Bayes classification model for sentiment analysis. https://doi.org/10.13140/RG.2.2.30021.40169 17. Long F, Zhou K, Ou W (2019) Sentiment analysis of text based on bidirectional LSTM with multi-head attention. IEEE Access 7:141960–141969. https://doi.org/10.1109/ACCESS.2019. 2942614 18. Gopalakrishnan K, Salem FM (2020) Sentiment analysis using simplified long short-term memory recurrent neural networks. arXiv preprint arXiv:2005.03993 19. Erkartal B, Yılmaz A (2022) Sentiment analysis of Elon Musk’s Twitter data using LSTM and ANFIS-SVM. In: Kahraman C, Tolga AC, Cevik Onar S, Cebi S, Oztaysi B, Sari IU (eds) Intelligent and fuzzy systems. INFUS 2022. Lecture Notes in Networks and Systems, vol 505. Springer, Cham. https://doi.org/10.1007/978-3-031-09176-6_70 20. https://www.onlineocr.net/ 21. https://ocr.space/

Author Index

A Aida Jones, 401 Ajith Jubilson, E., 547 Akashdeep Sharma, 171, 427 Alex Noel Joesph Raj, 349 Anand Kumar, M., 319 Anantha Babu, S., 373 Anindya Sundar Chatterjee, 109 Anitha, T., 155 Anurag Kumar, 535 Aravindhan Alagarsamy, 335 Archita Saxena, 319 Arpit Mittal, 319 Arvind Kalia, 171, 427 Arvind, K. S., 373 Avani Sakhapara, 485, 563

E Esha Gupta, 485 Eswari, R., 447

B Balika J. Chelliah, 243 Bandaru Srinivasa Rao, 189 Bose, S., 155 Buja, Atdhe, 295

J Jaykumar Panchal, 485 Jijin Jacob, 401 Jill Shah, 485 John Basha, M., 373 Jyothi Thomas, 1

C Ch. Amarendra, 93 Charishma Bobbili, 387

D Darshan Satra, 563 Devendar Rao, B., 361 Dipti Pawade, 485, 563 Durri Shahwar, 461

G Gautum Subhash, V. P., 63 Gireeshma Bomminayuni, 497

H Hari Seetha, 523

I Indrajit Kar, 109

K Kamlesh Chandra Purohit, 319 Kanimozhi, T., 461 Kanupriya Mittal, 217 Karthikeyan, S., 511

L Lalasa Mukku, 1

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 N. Chaki et al. (eds.), Proceedings of International Conference on Computational Intelligence and Data Engineering, Lecture Notes on Data Engineering and Communications Technologies 163, https://doi.org/10.1007/978-981-99-0609-3

577

578 Lavika Goel, 231 Leelavathy, N., 203 Logeswari, G., 155 LourduMarie Sophie, S., 535 Luma, Artan, 295

M Mary Anita Rajam, V., 217 Menon Adarsh Sivadas, 63 Mohamed Basheer, K. P., 37 Monika, A., 447 Muneer, V. K., 37 Muni Nagamani, G., 511

N Nannapaneni Chandra Sekhara Rao, 189 Nasheeda, V. P., 413 Naveena, S., 155 Nikhil Sharma, 563 Nishanth Krishnaraj, 349 Nitin Singh Rajput, 307 Nivitha, K., 269

O Osho Sharma, 171, 427

P Pabitha, P., 269 Panduranga Reddy, G., 123 Param Shendekar, 563 Peter, Geno, 335 Prabhu, D., 155 Prabu, M., 243 Pradyumna Rahul, K., 469 Praghash, K., 335 Praveen, R., 269

R Raj Kumar Batchu, 523 Rajeswari Rajesh Immanuel, 141 Rama Krushna Rath, 307 Rama Reddy, T., 93 Ramesh Kumar, P., 497 Ramkumar Jayaraman, 361 Ramya, B., 401 Ritik Shah, 485 Rizwana Kallooravi Thandil, 37 Roopa Tirumalasetti, 81 Ruban Nersisson, 349

Author Index S Sailaja, K. L., 497 Sandeep Varma, N., 469 Sangeetha, S. K. B., 141 Sansparsh Singh Bhadoria, 63 Santhi Thilagam, P., 51 Santosh Kumar Satapathy, 307 Saritha Hepsibha Pilli, 387 Sayantan Bhattacharjee, 461 Seshu Bhavani Mallampati, 523 Shanmukha Sainadh Gadde, 497 Shrinibas Pattnaik, 307 Siva Sathya, S., 535 Sivakumar, N., 373 Soma Sekhar Reddy, 461 Sowmya Sree, V., 123 Sreedharani, M. P., 401 Sridevi, S., 461 Srinivasa Rao, C., 123 Srinivasa Rao, D., 547 Subashri Sivabalan, 17 Sudheer Kolli, 497 Sudipta Mukhopadhyay, 109 Sujatha, 203 Sujatha Kamepalli, 189 Sundar, S., 255 Sunil Kumar Singh, 81 Swastik Singh, 447

T Tirumala Rao, K., 203

U Uma Priya, D., 51

V Vaishnavi Sinha, 469 Vaithilingam, C., 63 Valli Kumari Vatsavayi, 387 Vijayan, P. M., 255 Vijayarajan Rajangam, 349, 413 Vijay Jeyakumar, 17 Vijay Kumari, 231 Vinayak Singh, 109 Vishal Salgond, 563

Y Yashvardhan Sharma, 231 Yuvashree, R. M., 401