Innovations in Machine and Deep Learning: Case Studies and Applications (Studies in Big Data, 134) [1st ed. 2023] 3031406877, 9783031406874

In recent years, significant progress has been made in achieving artificial intelligence (AI) with an impact on students

250 106 14MB

English Pages 523 [506] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Innovations in Machine and Deep Learning: Case Studies and Applications (Studies in Big Data, 134) [1st ed. 2023]
 3031406877, 9783031406874

Table of contents :
Preface
Contents
Analytics-Oriented Applications
Recursive Multi-step Time-Series Forecasting for Residual-Feedback Artificial Neural Networks: A Survey
1 Introduction
2 Residual-Feedback ANNs: A Systematic Review
2.1 Systematic Review Planning and Execution
2.2 Overview of the Systematic Review Findings
3 The Existing Recursive Multi-step Forecast Strategy Solution
4 Limitation
5 Conclusions and Future Works
References
Feature Selection: Traditional and Wrapping Techniques with Tabu Search
1 Introduction
2 Related Work
3 Methodology
3.1 Data Description
3.2 Entropy-Based Feature Selection
3.3 Feature Selection Using Principal Component Analysis
3.4 Correlation-Based Feature Selection
4 Tabu Search
4.1 Initial Solution
4.2 Neighborhood
4.3 Objective Function
4.4 Memory Structures
5 Results
6 Discussion
7 Conclusions and Future Work
References
Pattern Classification with Holographic Neural Networks: A New Tool for Feature Selection
1 Introduction
2 Holographic Neural Networks
2.1 Basic Theory
2.2 Learning and Prediction Methods
2.3 red Explainability and Optimization of Holographic Models
3 Feature Selection with Holographic Neural Neworks
3.1 Previous Works
3.2 Pythagorean Membership Grades
4 Pattern Classification
4.1 Iris Dataset
4.2 red NIPS Feature Selection Challenge
5 red Conclusions and Future Works
References
Reusability Analysis of K-Nearest Neighbors Variants for Classification Models
1 Introduction
2 The K-Nearest Neighbors Algorithm
3 The Parameter K
4 Closeness Metrics
5 Analysis of KNN Variants
5.1 Heuristics for Class Assignment
5.2 Reduction of Dataset Records
5.3 Estimation of Dataset Variables
5.4 Discussion
6 Conclusions
References
Speech Emotion Recognition Using Deep CNNs Trained on Log-Frequency Spectrograms
1 Introduction
2 Literature Survey
2.1 Motivation
2.2 Contributions
3 Proposed Methodology
3.1 Data Augmentation
3.2 Extraction of Log-Frequency Spectrograms
3.3 Motivation Behind Using Spectrograms
3.4 Log-Frequency Spectrogram Extraction
3.5 Understanding What a Spectrogram Conveys
4 The Deep Convolutional Neural Network
4.1 Architecture
4.2 Training
5 Observations
5.1 Dataset Used
5.2 Performance Metrics Used
5.3 Results Obtained
5.4 Comparison Study
6 Conclusion
References
Text Classifier of Sensationalist Headlines in Spanish Using BERT-Based Models
1 Introduction
2 Background
2.1 Sensationalism
2.2 BERT-Based Models
3 Related Work
4 Dataset and Methods
4.1 Data Gathering and Data Labeling
4.2 Data Analysis
4.3 Model Generation and Fine-Tuning
5 Results
6 Conclusion
References
Arabic Question-Answering System Based on Deep Learning Models
1 Introduction
2 Natural Language Processing (NLP)
2.1 Difficulties in NLP
2.2 Natural Language Processing Phases
3 Question Answer System
3.1 Usage Deep Learning Models in Questions Answering System
3.2 Different Questions Based on Bloom’s Taxonomy
3.3 Question-Answering System Based on Types
3.4 Wh-Type Questions (What, Which, When, Who)
4 List-Based Questions
5 Yes/No Questions
6 Causal Questions [Why or How]
7 Hypothetical Questions
8 Complex Questions
8.1 Question Answering System Issues
9 Arabic Language Overview
9.1 Arabic Language Challenges
10 Related Work
11 Proposed Methodology
11.1 Recurrent Neural Networks (RNNs)
11.2 Long Short-Term Memory (LSTM)
11.3 Gated Recurrent Unit (GRU)
12 Prepare the Dataset
12.1 Collecting Data
13 Data Preprocessing
14 Results and Discussion
15 Conclusion and Future Work
References
Healthcare-Oriented Applications
Machine and Deep Learning Algorithms for ADHD Detection: A Review
1 Introduction
2 Research Methodology
3 Related Work
3.1 Machine Learning Approaches
3.2 Deep Learning Approaches
4 Approaches for ADHD Detection Using AI Algorithms
4.1 Machine Learning-Based Approaches
4.2 Deep Learning-Based Approaches
5 Datasets for ADHD Detection
5.1 Hyperaktiv
5.2 Working Memory and Reward in Children with and Without ADHD
5.3 Working Memory and Reward in Adults
5.4 Eeg Data for ADHD
6 Machine Learning and Deep Learning Classifiers for ADHD Detection
7 Trends and Challenges
7.1 New Types of Sensors or Biosensors
7.2 Multi-Modal Detection and/or Diagnosis of ADHD
7.3 The Use of Biomarkers as Variables for Diagnosis
7.4 Interpretability
7.5 Building of Standardized and Accurate Public Datasets
7.6 Different Classification Techniques
8 Conclusion
References
Mosquito on Human Skin Classification Using Deep Learning
1 Introduction
2 Literature Review
3 Methodology
3.1 Dataset Description
3.2 Deep Convolutional Neural Networks and Transfer Learning
3.3 Hyperparameter Tuning
3.4 Proposed Workflow
4 Experiments and Results
5 Conclusion and Future Work
References
Analysis and Interpretation of Deep Convolutional Features Using Self-organizing Maps
1 Introduction
2 Materials
2.1 Convolutional Neural Networks
2.2 Self-organizing Maps
3 Proposed Method
3.1 Stage A: Training of CNN
3.2 Stage B: Extraction of Features
3.3 Stage C: SOM Training
3.4 Stage D: Analysis and Interpretation
4 Application Example
4.1 Experimental Setup
4.2 Result Analysis
5 Conclusions
References
A Hybrid Deep Learning-Based Approach for Human Activity Recognition Using Wearable Sensors
1 Introduction
2 Literature Analysis
3 OPPORTUNITY Dataset
4 MHEALTH Dataset
5 HARTH Dataset
6 Materials and Methods
6.1 Some Preliminaries
6.2 Basic Architecture of CNN
7 Long-Short Term Memory (LSTM)
7.1 Working Principle of LSTM
8 Proposed Model Architecture
9 Dataset Description
9.1 MHEALTH Dataset
9.2 OPPORTUNITY Dataset
9.3 HARTH Dataset
10 Experimental Results
10.1 Evaluation Metrics Used
10.2 Results Analysis on MHEALTH Dataset
10.3 Results Analysis on OPPORTUNITY Dataset
10.4 Results Analysis on HARTH Dataset
10.5 Result Summary and Comparison
11 Conclusion and Future Works
References
Predirol: Predicting Cholesterol Saturation Levels Using Big Data, Logistic Regression, and Dissipative Particle Dynamics Simulation
1 Introduction
2 Related Works
2.1 Models for the Simulation of Fluids
2.2 Data Mining Application for Prevention of Cardiovascular Diseases
2.3 Comparative Analysis
3 PREDIROL Architecture
3.1 Big Data Model
3.2 Cholesterol Saturation Level Prediction Module
3.3 Cholesterol Levels Simulation Module with Dissipative Particle Dynamics
4 Case Study: Prediction of Cholesterol Levels of a Hospital Patients
5 Conclusions and Future Work
References
Convolutional Neural Network-Based Cancer Detection Using Histopathologic Images
1 Introduction
2 Image Processing Techniques
2.1 Statistical-Based Algorithms
2.2 Learning-Based Algorithms
2.3 Hyper-Parameters of CNN
2.4 Evaluation Metrics
2.5 Implementation
3 Stage 3: CNN Algorithm Training
3.1 Model Training Phase
3.2 Model Optimization Phase
4 Conclusion
References
Artificial Neural Network-Based Model to Characterize the Reverberation Time of a Neonatal Incubator
1 Introduction
2 Materials and Methods
2.1 Artificial Neural Networks Using the Levenberg–Marquardt Algorithm
3 Results
3.1 Data Analysis
3.2 Artificial Neural Network-Based Model Training
4 Conclusions
References
A Comparative Study of Machine Learning Methods to Predict COVID-19
1 Introduction
2 Related Works
3 Background
3.1 Covid-19
3.2 Machine Learning
4 Materials and Methods
4.1 Dataset Pre-processing
4.2 Machine Learning Models
5 Results and Discussions
6 Conclusions
References
Sustainability-Oriented Applications
Multi-product Inventory Supply and Distribution Model with Non-linear CO2 Emission Model to Improve Economic and Environmental Aspects of Freight Transportation
1 Introduction
2 Literature Review and Contributions
3 Development of the Integrated Routing Model
3.1 Inventory Planning with Non-deterministic Demand and Multiple Products
3.2 Non-linear Emission for Heterogeneous Fleet
3.3 Association of Variables
4 Assessment of the Model
4.1 Numerical Data and Solving Method
4.2 Analysis of Results
5 Future Work
6 Statement
References
Convolutional Neural Networks for Planting System Detection of Olive Groves
1 Background
1.1 Evolution of Production Techniques in Olive Groves
1.2 Current Situation of Modern Olive Cultivation Systems
1.3 Application of Remote Sensing Techniques for Image Analysis
1.4 Scope of the Present Chapter
2 Materials and Experimental Methods
2.1 Area of Study and Image Acquisition
2.2 Methodology
3 Results and Discussion
4 Conclusions and Future Lines
References
A Conceptual Model for Analysis of Plant Diseases Through EfficientNet: Towards Precision Farming
1 Introduction
2 Related Study
3 Deep Learning in Keras
4 Overview of Plant Diseases
5 Materials and Methods
5.1 Dataset Used in the Study
5.2 Overview of Convolutional Neural Network Models
5.3 Overview of EfficientNet
5.4 B0 to B7 Variants of EffcientNet
6 Proposed Methodology for Plant Disease Detection from Leaf Images
6.1 Experimental Setup
6.2 Training
7 Results and Discussion
7.1 Evaluation of Model
7.2 Image Analysis
8 Conclusion
References
Ginger Disease Detection Using a Computer Vision Pre-trained Model
1 Introduction
2 Related Work
3 Data Preparation
4 Pre-trained Model Description
5 Methodology
6 Hyper-Parameter Setting
7 Experimental Result
8 Conclusion
References
Anomaly Detection in Low-Cost Sensors in Agricultural Applications Based on Time Series with Seasonal Variation
1 Introduction
2 Related Work
3 Problem Statement
4 Anomaly Detection Methodology
4.1 Methodology Basis
4.2 Anomaly Detector Enhancements
5 Evaluation of the Proposed Approach
5.1 Data Generation
5.2 Experimental Results
6 Conclusion
References
Coconut Tree Detection Using Deep Learning Models
1 Introduction
2 Related Studies
3 Proposed Work
3.1 Datasets
4 Training and Classification
4.1 Model Selection
4.2 Evaluation Metrics
5 Experiments and Results
5.1 Model Trained with Low-Resolution Images
5.2 Model Trained with High-Resolution Images
5.3 Graphical User Interface—GUI
6 Conclusion and Future Work
References
Hybrid Neural Network Meta-heuristic for Solving Large Traveling Salesman Problem
1 Introduction
2 Clustering for Dimensionality Reduction
2.1 Self-organizing Maps
2.2 Reduction of Location Data
3 Structure of the Hybrid ANN-CW-LS Meta-heuristic
4 Results and Assessment of Performance
4.1 Test Data and Experiment Settings
4.2 Test of Average Performance
4.3 Test of Best Performance
4.4 Speed Performance Versus Dimensionality Reduction
5 Conclusions and Future Work
References

Citation preview

Studies in Big Data 134

Gilberto Rivera Alejandro Rosete Bernabé Dorronsoro Nelson Rangel-Valdez   Editors

Innovations in Machine and Deep Learning Case Studies and Applications

Studies in Big Data Volume 134

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Big Data” (SBD) publishes new developments and advances in the various areas of Big Data- quickly and with a high quality. The intent is to cover the theory, research, development, and applications of Big Data, as embedded in the fields of engineering, computer science, physics, economics and life sciences. The books of the series refer to the analysis and understanding of large, complex, and/or distributed data sets generated from recent digital sources coming from sensors or other physical instruments as well as simulations, crowd sourcing, social networks or other internet transactions, such as emails or video click streams and other. The series contains monographs, lecture notes and edited volumes in Big Data spanning the areas of computational intelligence including neural networks, evolutionary computation, soft computing, fuzzy systems, as well as artificial intelligence, data mining, modern statistics and Operations research, as well as self-organizing systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. The books of this series are reviewed in a single blind peer review process. Indexed by SCOPUS, EI Compendex, SCIMAGO and zbMATH. All books published in the series are submitted for consideration in Web of Science.

Gilberto Rivera · Alejandro Rosete · Bernabé Dorronsoro · Nelson Rangel-Valdez Editors

Innovations in Machine and Deep Learning Case Studies and Applications

Editors Gilberto Rivera División Multidisciplinaria de Ciudad Universitaria Universidad Autónoma de Ciudad Juárez Chihuahua, Mexico Bernabé Dorronsoro School of Engineering University of Cadiz Cádiz, Spain

Alejandro Rosete Universidad Tecnológica de La Habana “José Antonio Echeverría” La Habana, Cuba Nelson Rangel-Valdez Instituto Tecnológico de Ciudad Madero Tecnológico Nacional de México Tamaulipas, Mexico

ISSN 2197-6503 ISSN 2197-6511 (electronic) Studies in Big Data ISBN 978-3-031-40687-4 ISBN 978-3-031-40688-1 (eBook) https://doi.org/10.1007/978-3-031-40688-1 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Machine Learning (ML) is a branch of Artificial Intelligence (AI) continuously changing and adapting to address emerging challenges. Since the early conceptions of computers, humanity has struggled with developing more capable thinking machines. It began with the proposal of the Turing test, which suggests that a computer could be considered “intelligent” if it convinced a human that it is a person. The first models mimicked human thought processes but quickly developed into algorithms that acquire knowledge or skills through experience. Nowadays, this behavior is improved through Deep Learning (DL) techniques, a branch of ML that distinguishes itself by overcoming the finite capacity to absorb knowledge by integrating more experiences into its learning process. The creation of inference engines scales up innovation in software capabilities. Mainly, ML & DL contribute to the software transition from human-driven to inference-driven. Furthermore, AI R&D plays an essential role in any enterprise innovation department, whether it is to aid decision-makers in making informed decisions as part of a Business Intelligent tool or as part of the developed technology that will be commercialized. Innovations in Machine Learning and Deep Learning: Case Studies and Applications aims to collect the latest technological applications in the field of ML & DL to innovate tasks related to decision-making, forecasting, information retrieval, interpretable AI, risk management, healthcare, human activity recognition, sustainability, logistics among other topics related to this field. This book consists of twenty-two chapters, organized into three main areas: First Part: Analytics-Oriented Applications. This seven-chapter part advocate to the systematic analysis of data. The studied cases’ domains include timeseries forecasting, feature selection, pattern classification, reusability frameworks, speech recognition, text classification, and question-answering systems. The original research in this part involves analyzing a rich set of ML & DL techniques such as residual-feedback artificial neural networks, tabu search, holographic neural networks, k-nearest neighbors, convolutional neural networks, and text analytics models. Some concrete applications include emotion recognition through speech,

v

vi

Preface

sensationalism detection in news headlines, and Arabic language processing. Lastly, let’s point out that this part shows that almost any data is subject to analysis and that ML & DL tools can aid during the process. Second Part: Healthcare-Oriented Applications. These eight chapters highlight tools based on ML & DL techniques oriented to the support of medical care of individuals. This part analyzes applications for attention deficit hyperactivity disorder, classification of mosquitoes on human skin, pneumonia classification from X-ray images, human activity recognition, cholesterol prediction, cancer detection, characterization of the reverberation time of neonatal incubators, and COVID-19 prediction. The issues presented in the previous applications were tackled using strategies such as convolutional neural networks, dissipative particle dynamics simulation, and several classifiers (e.g., random forest and stochastic gradient). Throughout these chapters, the relevance of ML & DL to healthcare is noted in how such techniques can support the prevention of a wide variety of illnesses or conditions or to get a better understanding of them. Third Part: Sustainability-Oriented Applications. This part encompasses seven chapters oriented to developing means to support the balance between society, environment, and economic growth. Humanity must coexist with the environment to be sustainable, i.e., its actions must not compromise its future, and they must support the perdurance of the ecosystems. ML & DL techniques make these tasks more manageable, and this part provides rich, supportive evidence on this topic. The analyzed case studies are CO2 emissions prediction in freight transportation, planting system detection, ginger disease detection, coconut tree detection, logistics, and anomaly detection in low-cost sensors. In addition, a wide range of artificial neural networks lies within the approaches used to solve the previous applications. The chapters of this part offer a clear point of view of the tremendous impact that ML & DL techniques have in the green industry. All chapters, rigorously analyzed by the book editors, resulted from a stringent double-blind peer-review process by field experts. The contributions of all the authors enrich the reader’s experience and knowledge of ML & DL techniques and applications. The main focus is on the role that AI tools play in the solution of real-life issues. Hence, this book is expected to motivate readers to implement these technologies to become a Smart Business or Industry 4.0 environment. Innovations in Machine Learning and Deep Learning: Case Studies and Applications represents a channel to examine our knowledge about how ML & DL influence the solution of daily emerging needs. Lastly, we hope readers find inspiration in this book (or any of its chapters) that motivates research in developing intelligent solutions for real-world problems using ML & DL with related disciplines. Chihuahua, Mexico La Habana, Cuba Cádiz, Spain Tamaulipas, Mexico

Gilberto Rivera Alejandro Rosete Bernabé Dorronsoro Nelson Rangel-Valdez

Contents

Analytics-Oriented Applications Recursive Multi-step Time-Series Forecasting for Residual-Feedback Artificial Neural Networks: A Survey . . . . . . . . . . Waddah Saeed and Rozaida Ghazali Feature Selection: Traditional and Wrapping Techniques with Tabu Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laurentino Benito-Epigmenio, Salvador Ibarra-Martínez, Mirna Ponce-Flores, and José Antonio Castán-Rocha Pattern Classification with Holographic Neural Networks: A New Tool for Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luis Diago, Hiroe Abe, Atsushi Minamihata, and Ichiro Hagiwara Reusability Analysis of K-Nearest Neighbors Variants for Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . José Ángel Villarreal-Hernández, María Lucila Morales-Rodríguez, Nelson Rangel-Valdez, and Claudia Gómez-Santillán Speech Emotion Recognition Using Deep CNNs Trained on Log-Frequency Spectrograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mainak Biswas, Mridu Sahu, Maroi Agrebi, Pawan Kumar Singh, and Youakim Badr

3

21

39

63

83

Text Classifier of Sensationalist Headlines in Spanish Using BERT-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Heber Jesús González Esparza, Rogelio Florencia, José David Díaz Román, and Alejandra Mendoza-Carreón Arabic Question-Answering System Based on Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Samah Ali Al-azani and C. Namrata Mahender

vii

viii

Contents

Healthcare-Oriented Applications Machine and Deep Learning Algorithms for ADHD Detection: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Jonathan Hernández-Capistran, Laura Nely Sánchez-Morales, Giner Alor-Hernández, Maritza Bustos-López, and José Luis Sánchez-Cervantes Mosquito on Human Skin Classification Using Deep Learning . . . . . . . . . 193 C. S. Ayush Kumar, Advaith Das Maharana, Srinath Murali Krishnan, Sannidhi Sri Sai Hanuma, V. Sowmya, and Vinayakumar Ravi Analysis and Interpretation of Deep Convolutional Features Using Self-organizing Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Diego Sebastián Comas, Gustavo Javier Meschino, Agustín Amalfitano, and Virginia Laura Ballarin A Hybrid Deep Learning-Based Approach for Human Activity Recognition Using Wearable Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Deepak Sharma, Arup Roy, Sankar Prasad Bag, Pawan Kumar Singh, and Youakim Badr Predirol: Predicting Cholesterol Saturation Levels Using Big Data, Logistic Regression, and Dissipative Particle Dynamics Simulation . . . . . 261 Reyna Nohemy Soriano-Machorro, José Luis Sánchez-Cervantes, Lisbeth Rodríguez-Mazahua, and Luis Rolando Guarneros-Nolasco Convolutional Neural Network-Based Cancer Detection Using Histopathologic Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Jayesh Soni, Nagarajan Prabakar, and Himanshu Upadhyay Artificial Neural Network-Based Model to Characterize the Reverberation Time of a Neonatal Incubator . . . . . . . . . . . . . . . . . . . . . 305 Virginia Puyana-Romero, Lender Michael Tamayo-Guamán, Daniel Núñez-Solano, Ricardo Hernández-Molina, and Giuseppe Ciaburro A Comparative Study of Machine Learning Methods to Predict COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 J. Patricia Sánchez-Solís, Juan D. Mata Gallegos, Karla M. Olmos Sánchez, and Victoria González Demoss Sustainability-Oriented Applications Multi-product Inventory Supply and Distribution Model with Non-linear CO2 Emission Model to Improve Economic and Environmental Aspects of Freight Transportation . . . . . . . . . . . . . . . . 349 Santiago Omar Caballero-Morales, Jose Luis Martinez-Flores, and Irma Delia Rojas-Cuevas

Contents

ix

Convolutional Neural Networks for Planting System Detection of Olive Groves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 Cristina Martínez-Ruedas, Samuel Yanes Luis, Juan Manuel Díaz-Cabrera, Daniel Gutiérrez Reina, Adela P. Galvín, and Isabel Luisa Castillejo-González A Conceptual Model for Analysis of Plant Diseases Through EfficientNet: Towards Precision Farming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 Roneeta Purkayastha and Subhasish Mohapatra Ginger Disease Detection Using a Computer Vision Pre-trained Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Olga Kolesnikova, Mesay Gemeda Yigezu, Atnafu Lambebo Tonja, Michael Meles Woldeyohannis, Grigori Sidorov, and Alexander Gelbukh Anomaly Detection in Low-Cost Sensors in Agricultural Applications Based on Time Series with Seasonal Variation . . . . . . . . . . . . 433 Adrián Rocha Íñigo, José Manuel García Campos, and Daniel Gutiérrez Reina Coconut Tree Detection Using Deep Learning Models . . . . . . . . . . . . . . . . . 469 Deepthi Sudharsan, K. Harish, U. Asmitha, S. Roshan Tushar, H. Theivaprakasham, V. Sowmya, V. V. Sajith Variyar, Krishnamoorthy Deva Kumar, and Vinayakumar Ravi Hybrid Neural Network Meta-heuristic for Solving Large Traveling Salesman Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 Santiago Omar Caballero-Morales, Gladys Bonilla-Enriquez, and Diana Sanchez-Partida

Analytics-Oriented Applications

Recursive Multi-step Time-Series Forecasting for Residual-Feedback Artificial Neural Networks: A Survey Waddah Saeed and Rozaida Ghazali

Abstract Residual-feedback artificial neural networks are a type of artificial neural network (ANNs) that have shown better forecasting performance on some time series. One of the challenges of residual-feedback ANNs is by utilizing the previous time step’s observed value, they are only capable of predicting one step ahead in advance. Therefore, it would not be possible to apply them directly in a recursive multi-step forecast strategy. To shed light on this challenge, a systematic literature review was conducted in this paper to find answers to the following three research questions: What are the main motivations behind introducing residual feedback to ANNs? How good are the existing residual-feedback ANNs compared to other forecasting methods in terms of forecasting performance? And what are the existing solutions for recursive multi-step time series forecasting using residual-feedback ANNs? An analysis of 19 studies was conducted to answer these questions. Furthermore, several potential solutions that can be further practically explored are suggested in an attempt to overcome this challenge.

1 Introduction The use of artificial neural networks (ANNs) for time series forecasting is quite widespread [20]. Generally, ANNs are trained and applied with lagged observations (or auto-regressive terms) of one or more time series. On the contrary, external events/shocks or innovations cannot be modeled directly because they are not observable [3]. One way to approximate the innovations with any degree of accuracy is by using a sufficient number of auto-regressive terms [3]. However, it could degrade the model’s performance if given more degrees of freedom than it requires W. Saeed (B) School of Computer Science and Informatics, De Montfort University, Leicester LE1 9BH, UK e-mail: [email protected] R. Ghazali Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Batu Pahat, 86400 Johor, Malaysia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Innovations in Machine and Deep Learning, Studies in Big Data 134, https://doi.org/10.1007/978-3-031-40688-1_1

3

4

W. Saeed and R. Ghazali

[3]. Therefore, feeding residuals to an ANN can provide a more direct estimation of the innovations and thus more parsimoniously [3]. In addition, it is also mentioned that using residuals as inputs to an ANN might provide the network with information about its predictive capabilities, thus, enhancing them during training [21, 26]. A residual (i.e., error) is calculated by taking the difference between the target and the network output. According to the results in [3, 6, 8, 19, 20, 26], residual-feedback ANNs outperformed some forecasting methods on some time series. ANNs have some advantages that make them a popular choice for forecasting [15, 21, 25]. As residual-feedback ANNs are a type of ANNs, residual-feedback ANNs share these advantages. One major advantage is their non-linear input-output mapping nature, which allows for approximating any continuous function with an arbitrary degree of accuracy. Further, this mapping is generated with little prior knowledge of the nonlinearity in the data; therefore, they are less prone to model misspecification than other parametric non-linear methods. Additionally, residualfeedback ANNs offer better performance in modeling and forecasting time series with a moving-average process compared to non-residual-feedback ANNs (as found in this work). On the other hand, residual-feedback ANNs, like many ANNs prone to overfitting, are computationally expensive and time-consuming to achieve optimal performance and black-box models, so their outputs need to be justified for some applications [16]. One of the challenges with residual-feedback ANNs is by utilizing the previous time step’s observed value; it is only capable of predicting one step ahead in advance [17, 19]. Therefore, it would not be possible to apply them in a recursive multi-step forecast strategy [23]. The following example can illustrate this. Suppose we want to develop a temperature forecasting system based on a residual-feedback ANN model. Using this system, the user needs to know the temperature forecasts for the next three days. Let us assume the temperatures for the last four days can be used as inputs to the model. Since it is a residual-feedback ANN model, the error produced by the system (i.e., the difference between the actual temperature and the forecast temperature of the last day) will be used as an extra input to the model. Obtaining the forecast for the first following day is possible because the error can be calculated using the last observed temperature. However, we cannot get the forecasts for the next two days because we cannot calculate the errors until the actual temperatures are observed. In order to shed light on this challenge, a systematic literature review was conducted in this paper with the aim of finding answers to three research questions focused on the main motivations behind introducing residual feedback to ANNs, the forecasting performance of residual-feedback ANNs compared to other forecasting methods, and the existing solutions for multi-step time series forecasting using residual-feedback ANNs. To the best of our knowledge, this is the first study that focuses on this challenge. Addressing this challenge can help utilize residual-feedback ANNs in different forecasting problems that requires obtaining recursive multi-step forecasts. The remainder of the paper is structured as follows. The systematic review’s planning and execution are given in Sect. 2. Following that, an overview of the systematic review findings is detailed in Sects. 2.1, and 3 discusses the existing

Recursive Multi-step Time-Series Forecasting for Residual-Feedback …

5

solution. The limitation of the research is mentioned in Sect. 4. Finally, conclusions and future research are given in Sect. 5.

2 Residual-Feedback ANNs: A Systematic Review In this section, we describe the protocol used for the conducted systematic literature review (SLR) for residual-feedback ANNs for time series forecasting and an overview of the findings from the SLR.

2.1 Systematic Review Planning and Execution In this study, a systematic literature review (SLR) was conducted using the methodology introduced by Kitchenham and Charters [12]. This methodology is widely accepted, and has influenced how scientists perform literature reviews nowadays (e.g. [4]). Here, three research questions were specified: • RQ1: What are the main motivations behind introducing residual feedback to ANNs? • RQ2: How good are the existing residual-feedback ANNs compared to other forecasting methods in terms of forecasting performance? • RQ3: What are the existing solutions for multi-step time series forecasting using residual-feedback ANNs? By RQ1, we aim to know the main reasons behind introducing residual feedback to ANNs for time series forecasting. For RQ2, the aim is to have an idea of how good are the existing residual-feedback ANNs compared to other forecasting methods (e.g., non-residual-feedback ANNs and statistical models) in terms of forecasting performance. The last research question is aimed at identifying any existing solutions for multi-step time series forecasting using residual-feedback ANNs. Having the RQs established, the search strings and search strategy were defined. For search strings, the search terms were identified based on the research questions and divided into four groups as follows: • Error keywords: past prediction error, past forecast error, past forecasting error, previous prediction error, previous forecast error, previous forecasting error, previous error, past error, residual error, residual feedback, error feedback, lagged variables of error, error correction, error lag, nonlinear auto-regressive moving average, NARMA, Nonlinear ARMA, residuals modeling, error modeling, residual time series, error time series, residual series, error series. • Neural networks keywords: neural network, recurrent network. • Time series keywords: time series. • Forecasting keywords: forecasting, prediction.

6

W. Saeed and R. Ghazali

Search strings were then built with these selected search terms using Boolean ANDs between groups and ORs within each group. For example, (“past prediction error” OR “past forecast error”) AND (“neural network” OR “recurrent network”) AND “time series” AND (forecasting OR prediction). After that, the following relevant and important electronic databases were selected: • Scopus: https://www.scopus.com/. • Science Direct: https://www.sciencedirect.com/. • Institute of Electrical and Electronics Engineers Xplore® Digital Library (IEEEXplore): https://ieeexplore.ieee.org/Xplore/home.jsp. • Springer Link: https://link.springer.com/. It is good to note that Title, Abstract, and Keywords fields were used in the search, and sometimes search strings have to be adapted according to the specific needs of digital libraries. The last search using these databases was conducted on 1 February 2022. After obtaining search results, all studies were analyzed individually to assess their relevance in the context of this SLR. Inclusion/exclusion criteria were applied to the retrieved studies as follows: • Excluding non-English studies (e.g., [5]). • Excluding the duplicate studies that were retrieved by more than one database. • Excluding studies that did not use residual-feedback as additional input(s) to the ANN time series forecasting models (e.g., [27]). Residual feedback is calculated by taking the difference between the target and the network output. • If two papers from the same study were published in different venues, only the most recent or most complete one was included (e.g., excluding [18] because [20] is the complete and recent one). However, if the results were different, both studies were included. The selected studies were further checked to include other papers that may not be retrieved from the selected electronic databases. That was done by checking the list of references for the selected studies. This resulted in retrieving three more studies. In total, nineteen studies were selected for inclusion in the review, but none were published in 2021, as shown in Table 1. The number of research papers per paper’s type is shown in Fig. 1.

2.2 Overview of the Systematic Review Findings Before answering the research questions, it is important to state several forecasting models based on neural network models currently used in the literature. These models are: nonlinear auto-regressive (NAR), nonlinear auto-regressive with exogenous inputs (NARX), nonlinear auto-regressive moving average (NARMA), and nonlinear auto-regressive moving average with exogenous inputs (NARMAX), with NARMAX is the general form of the models.

Recursive Multi-step Time-Series Forecasting for Residual-Feedback …

7

Table 1 Selected titles based on the systematic literature review Year References Title 1994

[6]

1997

[7]

1999 1999

[17] [3]

2000

[14]

2002

[13]

2002

[28]

2005

[10]

2009

[24]

2015

[8]

2016

[21]

2018

[1]

2019

[19]

2019

[22]

2019

[2]

2020

[20]

2020

[26]

2020

[23]

2020

[9]

Recurrent neural networks and robust time series prediction A real-time short-term load forecasting system using functional link network Time series prediction with artificial neural networks Modelling non-linear moving average processes using neural networks with error feedback: An application to implied volatility forecasting Neural network and time series identification and prediction A short-term temperature forecaster based on a state space neural network Modeling Dynamical Systems by Error Correction Neural Networks NARMAX time series model prediction: feed-forward and recurrent fuzzy neural network approaches BP neural network with error feedback input research and application Recurrent Multiplicative Neuron Model Artificial Neural Network for Non-linear Time Series Forecasting Ridge Polynomial Neural Network with Error Feedback for Time Series Forecasting An ARMA type Pi-Sigma artificial neural network for nonlinear time series forecasting Forecasting the Behavior of Gas Furnace Multivariate Time Series Using Ridge Polynomial Based Neural Network Models Nonlinear Auto-regressive Moving-average (NARMA) Time Series Forecasting Using Neural Networks An improved pi-sigma neural network with error feedback for physical time series prediction A novel error-output recurrent neural network model for time series forecasting Inhibition of Long-Term Variability in Decoding Forelimb Trajectory Using Evolutionary Neural Networks With Error-Correction Learning Ridge Polynomial Neural Network with Error Feedback for Recursive Multi-step Forecast Strategy: A Case Study of Carbon Dioxide Emissions Forecasting Finding an accurate early forecasting model from small dataset: A case of 2019-ncov novel coronavirus outbreak

8

W. Saeed and R. Ghazali

Fig. 1 Number of research papers per paper’s type Table 2 Neural network forecasting models and their equations Model Equation NAR( p, p) ˆ yˆt+h = fˆ(yt , . . . , yt− p , yˆt , . . . , yˆt− pˆ ) NARMA( p, p,q) ˆ yˆt+h = fˆ(yt , . . . , yt− p , yˆt , . . . , yˆt−n , et , . . . , et−q ) yˆ

NARX( p, p,n) ˆ

yˆt+h = fˆ(yt , . . . , yt− p , yˆt , . . . , yˆt−n yˆ , xt , . . . , xt−n )

NARMAX( p, p,q,n) ˆ

yˆt+h = fˆ(yt , . . . , yt− p , yˆt , . . . , yˆt−n yˆ , et , . . . , et−q , xt , . . . , xt−n )

fˆ is an approximation of an unknown nonlinear function ( f ), yˆt+h is the needed h forecasts, y represents the auto-regressive terms from the time series, x is the external inputs, e represent moving-average terms (i.e., residual-feedback), p is the order of the auto-regressive part, pˆ is the order of the network output part, n is the order of external input part, and q is the order of the moving-average part.

Generally, for NAR model, the inputs to the model are any combination of autoregressive terms from the time series and the output(s) of the network. For NARMA model, besides the auto-regressive terms and the output(s) of the network, movingaverage terms are included (i.e., residual feedback). For NARX and NARMAX, there are extra external inputs to the model. These external inputs are used to predict and explain the variable we are interested in. For example, using an extra input to separate weekdays and weekends in a product’s daily sales forecasting model. Table 2 illustrates the equations for all these models. As a further clarification, Table 3 compares the inputs between these models. NARMA and NARMAX models are considered residual-feedback ANN models. A block diagram for examples of a NARMA model is shown in Fig. 2. As shown in Fig. 2a, the inputs to the model are auto-regressive inputs and a residual-feedback,

Recursive Multi-step Time-Series Forecasting for Residual-Feedback …

9

Table 3 Type of inputs that can be used by neural network forecasting models Model Auto-regressive Network output Error-feedback External inputs from time series feedback NAR( p, p) ˆ NARMA( p, p,q) ˆ NARX( p, p,n) ˆ NARMAX( p, p,q,n) ˆ

   

   

X  X 

X X  

simply NARMA( p,0,1), for example, the models proposed by [7, 21]. Another example is shown in Fig. 2b where its inputs consist of auto-regressive inputs, a network output, and a residual-feedback, simply NARMA( p,1,1), for example, the model proposed by [20]. Turning now to answering the research questions. With regards to the RQ1, the main motivations for introducing residual feedback to ANN models have been discussed in [3, 21, 26, 28]. In [3], the authors mentioned that in the context of time series modeling using ANNs, lagged observations (or auto-regressive terms) are frequently used as inputs. On the contrary, external events/shocks or innovations cannot be modeled directly because they are not observable. They mentioned that one way to approximate the innovations with any degree of accuracy is by using a sufficient number of auto-regressive terms. However, there is potentially a risk of adding more degrees of freedom to the model than it needs, thus degrading its performance. Therefore, feeding residual feedback to an ANN can serve as an estimation for the innovations in a more direct manner and hence more parsimoniously. The authors also highlighted that in a situation of the cointegrated time series [11], equilibrium and error-correction models are combined to form a single model, including error feedback, instead of the conventional two-stage approach. The author in [28] discussed that it is expected that we do not have complete knowledge about external forces, or the available observations might be noisy. By using residual feedback, the learning process can interpret the model misfit as an external shock, thus, guiding the model dynamics. It is also highlighted by [21, 26] that utilizing residual feedback as inputs to an ANN might provide the network with information about its predictive capabilities, thus, allowing it to improve them during training. Table 4 shows the ANN base models used to develop the existing residualfeedback ANN models, with ridge polynomial neural network and multi-layer perceptron being the most used ANN-based models. For RQ2, the following points highlight the main findings of the overall performance of the residual-feedback ANN models in the selected papers: • According to [3, 6], NARMA( p, 0, q) is better than NAR( p, 0) in modeling and forecasting linear MA(q), NARMA( p, q), and NMA(q). Adding more autoregressive terms to the NAR model can help enhance its performance. However,

10

W. Saeed and R. Ghazali

(a) NARMA( ,0,1).

(b) NARMA( ,1,1).

Fig. 2 A block diagram for examples of residual-feedback ANNs

Recursive Multi-step Time-Series Forecasting for Residual-Feedback …

11

Table 4 ANN base models used to develop the existing residual-feedback ANN models ANN base model Reference(s) Ridge polynomial neural network Multi-layer perceptron Pi-Sigma neural network Multiplicative neuron model artificial neural network Functional link network Fully connected neural networks State space neural network Time-delay recurrent neural network Generalized fuzzy neural network Evolutionary constructive and pruning neural network Polynomial neural network

[19–23] [6, 14, 17, 24] [1, 2] [8] [7] [3] [13] [28] [10] [26] [9]

Table 5 Forecasting performance metrics used in papers used to answer RQ2 Forecasting performance metric References Mean Squared Error (MSE) Root Mean Squared Error (RMSE) Normalized Mean Squared Error (NMSE) Mean Absolute Error (MAE) Mean Absolute Percentage Error (MAPE) Signal-to-Noise Ratio (SNR) The ratio between accumulated model return and maximum possible accumulated return

• • • • •

[2, 3, 6, 8, 10] [1, 9, 10, 19–22] [2, 20, 21, 24] [2, 21] [1, 8] [2, 21] [28]

this should be done if useful data is available. That implies fewer parameters and better input representation can help in better generalization performance. According to [6], NARMA( p, 0, q) is good at modeling and forecasting linear auto-regressive processes AR( p). According to [10], NARMAX(0, p, ˆ 0, n) and NARMAX( p, 0, q, n) do not have much advantage over the NARX( p, 0, n) on forecasting NARX time series. According to [1, 2, 8, 9, 19, 21, 22, 24, 26], it was found that NARMA( p, 0, q) is better than NAR( p, 0) on some time series data. According to [2, 19, 21, 22], NARMA( p, 0, q) was found better than NAR( p, p) ˆ on some time series data. According to [19, 20, 22], it was highlighted that NARMA( p, p, ˆ q) is better than NAR( p, 0) and NARMA( p, 0, q) on some time series data.

12

W. Saeed and R. Ghazali

Table 6 Non-residual-feedback ANNs used in papers used to answer RQ2 Non-residual-feedback ANN References Evolutionary constructive and pruning neural network Jordan Pi-Sigma Neural Network Pi-Sigma Neural Network Multi-layer perceptron Elman recurrent ANN Multiplicative neuron model ANN Radial bases function (RBF) Multiplicative linear and non-linear ANN Multiplicative seasonal ANN Feedforward generalized fuzzy neural network Recurrent generalized fuzzy neural network RBF-based adaptive fuzzy system Fuzzy neural network system using similarity analysis Orthogonal-least-squares-based RBF network Dynamic fuzzy neural network Fully recurrent Polynominal neural network Ridge polynomial neural network Dynamic Ridge polynomial neural network

[26] [2, 20] [1, 2, 20] [1, 6, 8, 24, 28] [1, 8] [1, 8] [8] [8] [8] [10] [10] [10] [10] [10] [10] [6] [9] [19–22] [19–22]

• According to [1, 8, 9, 19, 20, 22, 28], using some time series, it was found that residual-feedback ANN models showed better forecasting performance compared to some forecasting methods, e.g., statistical methods, MLP, and naive methods. To sum up the overall performance of the residual-feedback ANN models, it is proven that they can produce better performance than non-residual-feedback ANNs in modeling and forecasting time series that possess moving-average process (e.g., MA(q), NARMA( p, q), and NMA(q)). On the other hand, they do not have many advantages over non-residual-feedback ANNs in modeling and forecasting time series that do not possess moving-average process (e.g., AR( p) and NARX( p, p, ˆ q)). Additionally, residual-feedback ANNs can produce better performance than some statistical and benchmark forecasting methods. Tables 5 and 6 show forecasting performance metrics and non-residual-feedback ANNs used in papers used to answer RQ2, respectively. Finally, for RQ3, after analyzing the selected papers, we found that the majority of the residual-feedback ANNs fed the calculated error directly to the network, except for [9, 24, 26] where the absolute error was used. It seems reasonable to state that the one-step ahead was used in the majority of the selected papers because the testing set was not small; otherwise, the testing set’s error would be large. That’s because using network predictions and network errors in producing the forecasts introduce

Recursive Multi-step Time-Series Forecasting for Residual-Feedback …

13

additional errors because the real observations are not available. There is only one solution in the selected papers in [23] where the authors used the last calculated error during the generalization step. In addition, the works in [1, 8] stated that in the early stages of training the models, errors were taken as zero because the network output had not been calculated yet. But, it is not clear if this solution was utilized during the generalization (i.e., testing) step. Based on the answers to the third research question, it is clear that there is a gap in the literature that needs to come up with solutions for recursive multi-step time series forecasting using residual-feedback ANNs. This can help utilize residual-feedback ANNs in different forecasting problems that requires obtaining recursive multi-step forecasts, not only using them to generate 1-step forecasts. Thus, benefiting from the good performance that can be achieved from such models.

3 The Existing Recursive Multi-step Forecast Strategy Solution Without a recursive strategy, one can overcome this challenge by training h models to generate h-step forecasts [23]. Even so, it is not always simple to do this, especially when there are many time series to deal with and when the models need to be retrained with the newly collected observations [23]. According to [23], for the recursive strategy, after training residual-feedback ANNs to perform one-step-ahead forecasting, the trained model is used to produce the forecasts. For simplicity, we will consider NARMA( p, 0, 1) model that has only one residual feedback. Therefore, the inputs used to generate the forecasts are the lagged values from a time series and the last network error. The error, err or , is given by: et = dt − yt

(1)

The first produced forecast value is given by: yˆt+1 = fˆ(yt , . . . , yt−d+1 , et )

(2)

where d refers to the embedding dimension. Next, yˆt+1 is used to generate the second forecasted value using the following equation: (3) yˆt+2 = fˆ( yˆt+1 , yt , . . . , yt−d+2 , et+1 ) We continue in this manner until finishing forecasting the entire forecast horizon, and this is given by: yˆt+h = fˆ( yˆt+h−1 , . . . , yˆt+h−d , et+h−1 )

(4)

14

W. Saeed and R. Ghazali

Fig. 3 Training data in the last epoch

However, since the real observations (i.e., actual values) from dt+1 until dt+h−d are not observed yet, it is not possible to calculate the errors from et+1 to et+h−1 . The work in [23] proposed using the last calculated error during the generalization step (Naive). For simplicity, let’s assume that error values in the last epoch are as shown in Fig. 3. Let us also assume that we need to forecast the next three values using the last observed data and the forecasts from the ANN model, as shown in Fig. 4. Based on that, the last calculated error value in the last training epoch (i.e., –0.03 as shown in Fig. 3) is used with all errors from et+1 to et+h−1 , as shown in Fig. 5. This solution (i.e., the last calculated error) was used with a residual-feedback ANN [21] to forecast carbon dioxide emissions for three countries. The results of this solution were compared to the results of some benchmark statistical forecasting methods like naive, auto-regressive integrated moving-average (ARIMA), three exponential smoothing methods, and Theta. The results showed that the solution produces reasonable forecasts compared to the compared forecasting methods, as shown in Fig. 6. Therefore, it was pointed out that this solution can be used with residual-feedback ANNs for a recursive multi-step forecast strategy.

4 Limitation Although many error keywords were used in our search, there is a possibility that other error keywords might have been used by other researchers in their works. Further, there is a possibility that the researchers might not have used at least one keyword from all groups. For example, they might have used “exchange rate” instead of “time series.”

Recursive Multi-step Time-Series Forecasting for Residual-Feedback …

15

Fig. 4 Feeding out-of-sample data to the trained residual-feedback ANN

Fig. 5 Naive error solution [23]

5 Conclusions and Future Works This paper reviews the existing literature on residual-feedback ANNs for time series forecasting. As a result of analyzing 19 selected papers, it was found that there is only one solution that contributed to addressing the existing challenge with residualfeedback ANNs for recursive multi-step time series forecasting. Building on what was discussed in the existing solution section, this paper proposes the following potential solutions that can be further practically explored: • Using zero error value (Zero): Simply, zeros may be used for all errors from et+1 to et+h−1 , as shown in Fig. 7a. • Using the absolute Naive (|Naive|): The absolute Naive error value may be used with all errors from et+1 to et+h−1 , as shown in Fig. 7b. • Using the average of the calculated error values in the last epoch (Average): The average of the calculated error values in the last training epoch may be used for all errors from et+1 to et+h−1 , as shown in Fig. 7c.

16

W. Saeed and R. Ghazali

Fig. 6 Average ranking between the Naive error solution and other forecasting methods. The lower, the better. Adapted from [23]

• Using the absolute Average (|Average|): Similar to the average solution except the absolute values may be used for all errors from et+1 to et+h−1 , as shown in Fig. 7d. • Using the last h calculated error values in the last epoch (h-error): If the forecast points needed is h, the last h calculated errors from the last training epoch may be used when generating the h forecasts. For example, we need to forecast the next three future values, so the last three calculated errors in the last epoch (as shown in Fig. 3) are used, as shown in Fig. 7e. • Using the absolute h-error (| h-error|): Similar to h-error solution except the absolute values may be used for all errors, as shown in Fig. 7. Having these potential solutions, an important question needs an answer. This question is about which solution is suitable for a given time series and when it is not. An extensive analysis involving many time series and various residual-feedback ANNs is needed to answer this open question. Furthermore, explainable AI (XAI) [16] methods can be explored with residual-feedback ANNs to enhance its performance (e.g., residuals importance), understand its decisions, and select the best models (i.e., models comparison).

Recursive Multi-step Time-Series Forecasting for Residual-Feedback …

17

Fig. 7 Examples of out-of-sample data using the proposed solutions

References 1. Akdeniz, E., Egrioglu, E., Bas, E., Yolcu, U.: An arma type pi-sigma artificial neural network for nonlinear time series forecasting. J. Artif. Intell. Soft Comput. Res. 8(2), 121–132 (2018) 2. Akram, U., Ghazali, R., Ismail, L., Zulqarnain, M., Husaini, N., Mushtaq, M.: An improved pi-sigma neural network with error feedback for physical time series prediction. Int. J. Adv. Trends Comput. Sci. Eng. 8, 1–7 (2019) 3. Burgess, A., Refenes, A.P.: Modelling non-linear moving average processes using neural networks with error feedback: An application to implied volatility forecasting. Signal Process. 74(1), 89–99 (1999). https://doi.org/10.1016/S0165-1684(98)00202-3 4. Cisneros, L., Rivera, G., Florencia, R., Sánchez-Solís, J.P.: Fuzzy optimisation for business analytics: A bibliometric analysis. J. Intell. Fuzzy Syst. 44(2), 2615–2630 (2023). https://doi. org/10.3233/JIFS-221573 5. Cogollo, M.R., Velasquez, J.D.: Are neural networks able to forecast nonlinear time series with moving average components? IEEE Lat. Am. Trans. 13(7), 2292–2300 (2015). https://doi.org/ 10.1109/TLA.2015.7273790 6. Connor, J., Martin, R., Atlas, L.: Recurrent neural networks and robust time series prediction. IEEE Trans. Neural Netw. 5(2), 240–254 (1994). https://doi.org/10.1109/72.279188 7. Dash, P., Satpathy, H., Liew, A., Rahman, S.: A real-time short-term load forecasting system using functional link network. IEEE Trans. Power Syst. 12(2), 675–680 (1997). https://doi. org/10.1109/59.589648

18

W. Saeed and R. Ghazali

8. Egrioglu, E., Yolcu, U., Aladag, C.H., Bas, E.: Recurrent multiplicative neuron model artificial neural network for non-linear time series forecasting. Neural Process. Lett. 41(2), 249–258 (2015). https://doi.org/10.1007/s11063-014-9342-0 9. Fong, S.J., Li, G., Dey, N., Crespo, R.G., Herrera-Viedma, E.: Finding an accurate early forecasting model from small dataset: A case of 2019-ncov novel coronavirus outbreak. Int. J. Interact. Multimed. Artif. Intell. 6(1), 132–140 (2020). https://doi.org/10.9781/ijimai.2020. 02.002 10. Gao, Y., Er, M.J.: Narmax time series model prediction: feedforward and recurrent fuzzy neural network approaches. Fuzzy Sets Syst. 150(2), 331–350 (2005). https://doi.org/10.1016/j.fss. 2004.09.015 11. Granger, C.W.: Some properties of time series data and their use in econometric model specification. J. Econ. 16(1), 121–130 (1981) 12. Kitchenham, B., Charters, S.: Guidelines for performing systematic literature reviews in software engineering. Technical report, Ver. 2.3 EBSE Technical Report. EBSE. University of Durham (2007) 13. Lanza, P.A.G., Cosme, J.M.Z.: A short-term temperature forecaster based on a state space neural network. Eng. Appl. Artif. Intell. 15(5), 459–464 (2002). https://doi.org/10.1016/S09521976(02)00089-1 14. Neji, Z., Beji, F.M.: Neural network and time series identification and prediction. In: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, vol. 4, pp. 461–466 (2000). https://doi.org/10.1109/IJCNN.2000.860814 15. Panda, C., Narasimhan, V.: Forecasting exchange rate better with artificial neural network. J. Policy Model. 29(2), 227–236 (2007) 16. Saeed, W., Omlin, C.: Explainable ai (xai): A systematic meta-survey of current challenges and future opportunities. Knowl.-Based Syst. 110273 (2023). https://doi.org/10.1016/j.knosys. 2023.110273 17. Shlens, J.: Time series prediction with artificial neural networks. In: Computer Science Program. Swarthmore College, Los Angeles (1999) 18. Waheeb, W., Ghazali, R.: Multi-step time series forecasting using ridge polynomial neural network with error-output feedbacks. In: International Conference on Soft Computing in Data Science, pp. 48–58. Springer, Berlin (2016) 19. Waheeb, W., Ghazali, R.: Forecasting the behavior of gas furnace multivariate time series using ridge polynomial based neural network models. Int. J. Interact. Multimed. Artif. Intell. 5(5), 126–133 (2019). https://doi.org/10.9781/ijimai.2019.04.004 20. Waheeb, W., Ghazali, R.: A novel error-output recurrent neural network model for time series forecasting. Neural Comput. Appl. 32(13), 9621–9647 (2020). https://doi.org/10.1007/s00521019-04474-5 21. Waheeb, W., Ghazali, R., Herawan, T.: Ridge polynomial neural network with error feedback for time series forecasting. PLOS ONE 11(12), 1–34 (2016). https://doi.org/10.1371/journal. pone.0167248 22. Waheeb, W., Ghazali, R., Shah, H.: Nonlinear autoregressive moving-average (narma) time series forecasting using neural networks. In: 2019 International Conference on Computer and Information Sciences (ICCIS), pp. 1–5 (2019). https://doi.org/10.1109/ICCISci.2019.8716417 23. Waheeb, W., Shah, H., Jabreel, M., Puig, D.: Ridge polynomial neural network with error feedback for recursive multi-step forecast strategy: A case study of carbon dioxide emissions forecasting. In: 2020 2nd International Conference on Computer and Information Sciences (ICCIS), pp. 1–6 (2020). https://doi.org/10.1109/ICCIS49240.2020.9257685 24. Wan, D., Hu, Y., Ren, X.: BP neural network with error feedback input research and application. In: Second International Conference on Intelligent Computation Technology and Automation, 2009. ICICTA’09, vol. 1, pp. 63–66. IEEE (2009). https://doi.org/10.1109/ICICTA.2009.24 25. Wong, W.K., Xia, M., Chu, W.: Adaptive neural network model for time-series forecasting. Eur. J. Oper. Res. 207(2), 807–816 (2010). https://doi.org/10.1016/j.ejor.2010.05.022

Recursive Multi-step Time-Series Forecasting for Residual-Feedback …

19

26. Yang, S.H., Wang, H.L., Lo, Y.C., Lai, H.Y., Chen, K.Y., Lan, Y.H., Kao, C.C., Chou, C., Lin, S.H., Huang, J.W., Wang, C.F., Kuo, C.H., Chen, Y.Y.: Inhibition of long-term variability in decoding forelimb trajectory using evolutionary neural networks with error-correction learning. Front. Comput. Neurosci. 14, 22 (2020). https://doi.org/10.3389/fncom.2020.00022 27. Zemouri, R., Gouriveau, R., Zerhouni, N.: Defining and applying prediction performance metrics on a recurrent narx time series model. Neurocomputing 73(13–15), 2506–2521 (2010) 28. Zimmermann, H.G., Neuneier, R., Grothmann, R.: Modeling dynamical systems by error correction neural networks. In: Soofi, A.S., Cao, L. (eds.) Modelling and Forecasting Financial Data: Techniques of Nonlinear Dynamics, pp. 237–263. Springer US, Boston, MA (2002). https://doi.org/10.1007/978-1-4615-0931-8_12

Feature Selection: Traditional and Wrapping Techniques with Tabu Search Laurentino Benito-Epigmenio, Salvador Ibarra-Martínez, Mirna Ponce-Flores, and José Antonio Castán-Rocha

Abstract Feature selection is an important step in improving the performance of machine learning algorithms. This paper describes a comparative study of traditional feature selection techniques and a wrapping technique with tabu search. To validate our wrapper with tabu search approach, we implemented three feature selection techniques: correlation, entropy, and principal component analysis. Nevertheless, to evaluate their performance, we used three classification algorithms: a J48 decision tree, a random forest, and an artificial neural network. We selected five datasets with a large number of features from public repositories. The experimental results showed that the subsets provided by tabu search have high performance with a J48 decision tree. Additionally, tabu search was better ranked than the other feature selection techniques. Finally, we consider that tabu search is a good alternative for feature selection paired with a simple classification algorithm. Keywords Feature selection · Tabu search · Dimensionality reduction · Machine learning

1 Introduction Tabu search is a metaheuristic used to solve combinatorial optimization problems. It involves iteratively exploring the solution space by making small changes to the current solution and accepting or rejecting those changes based on a set of criteria [3, 31, 40]. A key feature of tabu search is the use of a “tabu list,” which keeps track of recently explored solutions and prevents the algorithm from revisiting them. This helps to ensure that the algorithm explores a wide range of solutions and avoids getting stuck in local optima [4, 36]. L. Benito-Epigmenio · S. Ibarra-Martínez (B) · M. Ponce-Flores · J. A. Castán-Rocha Departamento de Posgrado e Investigación, Facultad de Ingeniería, Universidad Autónoma de Tamaulipas (UAT), Tampico Tamaulipas, Mexico e-mail: [email protected] J. A. Castán-Rocha e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Innovations in Machine and Deep Learning, Studies in Big Data 134, https://doi.org/10.1007/978-3-031-40688-1_2

21

22

L. Benito-Epigmenio et al.

In this paper, we propose applying a tabu search to feature selection, Which is a selection of relevant features from a larger set of features to improve the performance of a machine learning model while minimizing the processing time. In data science, filter and wrapper methods are the main approaches to feature selection [29]. The filter method uses statistical techniques to identify the most relevant characteristics. It consists of ranking the features according to their correlation with the target variable, which is somehow their relevance to the model. The main advantage of the filter method is its efficiency. It usually requires fewer computational resources and can be applied to high-dimensional datasets [2, 30, 38]. However, it does not consider the interaction between features for a particular model [11]. The wrapper method involves building a model and evaluating its performance with different subsets of features. It is an iterative process that selects the best subset of features by testing various combinations in a search space [22, 32]. The wrapper method can select the optimal subset of features but may overfit the model by making a feature selection that is only relevant to the training set. This method is computationally expensive, but it has the advantage of being able to identify non-linear relationships between features [41]. Therefore, the choice of method depends on the specific problem and the characteristics of the data. Tabu search algorithm can be used as a wrapper approach for feature selection by exploring the space of possible feature subsets and selecting the subset that gives the best performance. One advantage of using tabu search is that it can handle a large number of features and explore a wide range of feature subsets. This can be particularly useful in high-dimensional datasets [9, 10, 18]. Tabu search can also handle non-linear relationships between features and the target variable, which may be missed by simpler feature selection methods. Another advantage of tabu search for feature selection is that it can be customized to fit specific problem constraints. For example, we can set constraints on the maximum number of features to include in the final subset or the minimum correlation between features. Additionally, we can also use domain-specific knowledge to inform the search process, such as prior knowledge about which features are likely to be important. However, there are also some potential disadvantages to using tabu search for feature selection. One is that it can be computationally expensive, especially when searching over a large number of features. This can make it difficult. Additionally, the effectiveness of tabu search depends on the quality of the evaluation function used to assess the performance of each feature subset. If the evaluation function is not well-designed or is too computationally expensive, the search may be slow or may at all. Therefore, we propose to use the accuracy of a J48 tree as the objective function in the tabu search algorithm. The J48 decision tree is a widely used and well-established machine-learning method for classification [35, 39]. It’s a deterministic approach, which consistently produces the same results, meaning reliability with the objective function and allowing convergence. In this case, it can be used as an objective function in the tabu search algorithm by evaluating the quality of a solution based on the accuracy of the J48 tree. The use of a J48 decision tree can have several advantages.

Feature Selection: Traditional and Wrapping Techniques with Tabu Search

23

First, it is easy to interpret and understand and can handle both categorical and numerical data. This can help to ensure that the final feature subset is not only predictive but also interpretative. Second, a J48 decision tree is a computationally efficient and fast classification method. It can handle large data sets and high-dimensional feature spaces, which can be important for feature selection tasks involving a large number of features. Third, the J48 decision tree can support missing data and noisy features, which is important in real-world applications where data may be incomplete or noisy [13, 27]. Hence, using the accuracy of a J48 decision tree as the objective function for feature selection in a tabu search approach can have several advantages include: computational efficiency, deterministic, allows convergence, interpretability, robustness to missing data and noise, and the ability to capture complex relationships allows features and the target variable. This paper proposes a feature selection method based on tabu search and compares it with some traditional techniques found in the literature. The rest of the paper is organized as follows: Section 2 reviews the related work on feature selection, which provides a comprehensive overview of the existing methods and the importance of feature selection. Sections 3 and 4 present a detailed description of the proposed methodology, including the algorithmic steps and its features. Additionally, this section mentions the characteristics of the datasets used in the experiments. Section 5 presents the classification performance results of the feature selection techniques, where the proposed approach is compared with traditional techniques found in the literature. Section 6 provides a discussion of the results and an analysis of the implications of the findings. Finally, Sect. 7 presents conclusions and future work, summarizing the paper’s contributions and suggesting future research.

2 Related Work In the literature, we find works that approach the issue of feature selection and innovative approaches to improve the process. Here we present a review of some previous work related to feature selection. Liu et al. [21] in 2019 proposed a feature selection method using a weighted Gini index to solve a class imbalance problem. The work presents a feature selection method called neighborhood relationship preserving score for multi-label classification. This method implements a backward elimination approach for feature selection and embedded support vector machines. The authors compared their proposal with three feature selection techniques: Chi2 , F statistic, and Gini index. They used two datasets with 36 and 16 features for the experiments. The results show that when there are few features, F-statistic and Chi2 have good performance, and when the features increase, the feature selection method proposed has the best performance.

24

L. Benito-Epigmenio et al.

Lima et al. [20] in 2020 showed the importance of machine learning algorithms as a support for early diagnosis to prevent deaths due to lack of early medical treatment. Machine learning techniques can aid in the healthcare field to minimize errors and provide useful information for diagnosis; therefore, it is important that these techniques are as efficient and accurate as possible. To address this problem, the authors proposed a new feature selection algorithm combined with a twin-bounded support vector machine. They experimented with eight data sets commonly used in the medical field. Finally, the results showed that the proposed method is very robust, achieving high performance with few attributes. Ghosh et al. [14] in 2020 propose a wrapper-filter feature selection technique based on ant colony optimization. Here the authors introduce subset evaluation using a filter method instead of using a wrapper method to reduce computational complexity. They used a structured memory to keep the best ants and feature dimension-dependent pheromone update to perform feature selection as multi-objective. The authors also evaluate their approach with several real-life datasets using K-nearest neighbor and a multilayer perceptron classifier. Experimental results clearly show that the proposed method outperforms most of the state-of-the-art algorithms used for feature selection. Alazzam et al. [6] in 2020 proposed a new feature selection algorithm for intrusion detection systems based on a pigeon-inspired optimizer. They used three popular datasets to evaluate the algorithm. The proposed algorithm significantly reduced the number of attributes in each data set. Due to this reduction, the machine learning algorithms minimized their construction time and increased their accuracy. Zhou et al. [42] in 2021 proposed the concept of feature weight as a standard for the feature selection process in decision tree construction. In this case, to build the tree layer, they first calculate the weight of the feature and choose the feature with the highest weight. Additionally, the authors proposed a new algorithm to pre-filter the features before building the decision tree. They use twelve data sets for experiments from the UCI machine learning database. The prefiltering algorithm improved the performance of the classification algorithms and reduced the computational cost of the algorithm. However, it performs better when working with discrete features; however, it requires more processing time. Odhiambo et al. [26] in 2021 presented a new hybrid filter model for feature selection based on principal component analysis and information gain. The authors make a sequential feature selection at two levels, reducing or reorganizing the number of evaluations in the model. In the experiments, they used a database with nine attributes indicating the presence or occurrence of breast cancer and divided it into two parts, 75% for training and 25% for testing. The results show an increase in feature selection accuracy and overall performance. Therefore, the authors concluded that feature selection does improve the performance of the classification model. Got et al. [15] in 2021 considered feature selection as a multi-objective problem and mentioned that most approaches treat feature selection as a single-objective problem. Feature selection maximizes classification accuracy when the algorithm only works with a minimum number of features. However, the problem is between the number and the accuracy, which makes it a multi-objective problem. The authors propose a novel hybrid filter-wrapper feature selection approach using a whale optimiza-

Feature Selection: Traditional and Wrapping Techniques with Tabu Search

25

tion algorithm. They considered the algorithm multi-objective because it seeks the simultaneous optimization of a filter and wrapper. To validate the method proposed, they implemented seven algorithms used in the state-of-the-art and twelve datasets. This proposed algorithm achieves good classification accuracy with a smaller number of features. However, they observed that performance is not the same when the number of classes increases. According to the recent works analyzed, the main focus is the search for efficient and accurate feature selection methods to apply to different real-life problems. The experimentation shows a performance increase in the classification algorithms after feature selection. Additionally, the state-of-the-art shows that classification algorithms improve their performance after feature selection independently of the technique used. However, the wrapper technique has higher consumption times than filtering techniques and, in contrast, tends to produce better results than those techniques because it tests its feature selections with some particular classification approach. In most cases, the proposed algorithms performed better than the classical techniques but showed problems when the number of classes increased considerably. Overall, these papers show the importance of feature selection and the need for efficient and accurate methods for different applications. The results of these studies indicate that there is still much to improve and research in the field of feature selection.

3 Methodology As mentioned above, this research proposes using a tabu search algorithm for feature selection and compares its performance with other known techniques in the literature. Figure 1 shows the steps of the methodology used to evaluate the different feature selection approaches.

Fig. 1 Methodology

26

L. Benito-Epigmenio et al.

In the first step, we selected five datasets from public repositories and performed preprocessing on each dataset. Data preprocessing is very important to improve the quality of data, which can be: data standardization, processing nominal data, identifying outliers, and carrying out data imputation. The goal of data preprocessing is to make data more suitable for analysis, enabling the more accurate performance of the classification algorithms. In the second step, we applied various techniques to select a subset of features for each dataset. These techniques included correlation analysis, entropy-based feature selection, principal component analysis (PCA), and our proposed algorithm based on tabu search. In the third step, we evaluated the performance of each subset of features using the accuracy of the five-fold cross-validation produced with three different classification techniques: J48 decision tree, random forest, and artificial neural network. We chose these algorithms because they are widely used in machine learning literature and provide good performance for a diverse data set. Finally, we use Weka software to perform this process.

3.1 Data Description The experiments were carried out using five databases from public repositories with a high level of features and for use in classification. The information for each dataset is shown in Table 1.

3.2 Entropy-Based Feature Selection The selection of features by means of entropy is a filter method that makes it possible to measure the information level of a set of attributes and select those with a lower degree of disorder. Features with low entropy provide more useful information to machine learning algorithms [1, 33]. This method does not require high computing

Table 1 Description of the datasets Dataset Features Students Divorce Flags Celeb-Faces Credit

32 54 30 39 20

Instances

Classes

395 170 194 20000 1000

4 2 8 2 2

Feature Selection: Traditional and Wrapping Techniques with Tabu Search

27

costs. In this research, the following steps have been followed for attribute selection using entropy: 1. We calculate the entropy of each feature: For this process, we use the entropy package from the programming language R, which provides functions for estimating the Kullback-Leibler divergence, Shannon entropy, and mutual information of two variables. Additionally, it provides tools for estimating differential Shannon entropy and differential coefficient of variation. 2. We create a sorted list of the features: This list helps us to identify the features with low entropy, which provide more information and are considered more relevant. 3. Finally, we established a threshold for the maximum entropy value to feature selection: In this case, we select the Features with an entropy value lower than or equal to 0.65.

3.3 Feature Selection Using Principal Component Analysis Principal Component Analysis (PCA) is a statistical technique used to reduce the dimension of data or in feature selection [16, 34], where the goal is to identify the most relevant features that can best explain the variation in the data. The feature selection procedure using PCA involves several steps: 1. We calculate the PCA: In this step, we use the programming language R and select the first two principal components (PC1 and PC2). These components are linear combinations of the original features, where the first component explains the maximum amount of variation in the data, followed by the second component, and so on. Generally, the number of principal components chosen is dependent on the desired amount of data reduction, as more principal components will provide more information but take longer to calculate. 2. We calculate the Euclidean distance between PC1 and PC2: These procedures are commonly used in many areas of data analysis and are essential for gaining insights and making informed decisions based on data. 3. In this step, we group the features in quartiles: This approach allows us to identify similarities among features and helps us to simplify the analysis. 4. Finally, two subsets of data are created based on the quartiles of features: The first subset includes only quartile 1 of features, while the second subset includes quartiles 1 and 2.

3.4 Correlation-Based Feature Selection In this section, we propose a correlation-based feature selection to identify redundancy in features instead of the regular selection of features with high correlation regarding the target class. Correlation-based feature selection is a filter method that

28

L. Benito-Epigmenio et al.

seeks to identify features with high correlation. If two features are highly correlated, this means that both features can provide a similar behavior for the learning model. Having two highly correlated features could mean redundant information, so it is better to delete one of them. In the literature, correlation-based feature selection is a technique for selecting a subset of features from a data set with a strong correlation to the target variable. The object is to identify features with a high correlation with the target variable, as these features are most informative in predicting the target variable [5, 12, 19]. The steps used in this research for feature selection through correlation are as follows: 1. We create a matrix correlation: This matrix captures the pairwise correlation between all the features in the dataset. The correlation coefficient ranges between –1 and 1, where values close to 1 indicate a strong positive correlation, values close to –1 indicate a strong negative correlation, and values close to 0 indicate no correlation between the two features. 2. The correlation coefficients are ranked in descending order, forming a list: This list allows us to identify the features with the highest correlation with the target variable. It also enables us to detect feature pairs with highly correlated with each other. 3. Finally, we eliminate one of the attributes with correlation values equal to or above 0.65: This threshold value indicates a relatively high correlation degree between the two features. Therefore, we remove one of the features to reduce redundant information and obtain a subset of features with the highly correlated target variable. This proposal is still in development regarding parameter tuning and the selection of the attribute to be removed.

4 Tabu Search The tabu search algorithm is a local search procedure used in optimization problems. Glover first proposed the algorithm based on the principles of adaptive memory and sensitive search in 1986 [17]. The main objective of TS is to escape from local optima, and it does so by using a memory called the tabu list to record the recent search history [7]. The difference between other metaheuristics and tabu search is the dependence on systematic and deterministic methods that minimize the involvement of any random process. The use of memory by TS allows for avoiding cyclic in the search space, even allowing movements that worsen the solution. After some time, the stored solutions in the tabu list lose their tabu status and can be visited from other areas of the search space.

Feature Selection: Traditional and Wrapping Techniques with Tabu Search

29

4.1 Initial Solution The algorithm starts by creating a single solution which is a binary vector that represents the use or not use of a specific feature. Where 1 means that the feature is enabled in the solution and 0 means it is ignored. The decision is made at random and is given a probability of 70% to be 1 and 30% to be 0. Figure 2 shows the initial solution. However, at first, we produced ten solutions randomly and evaluated their classification performance using the accuracy reached of a J48 tree to choose the best among them and continue the rest of the algorithm.

4.2 Neighborhood To create a neighborhood, we first need to define a move to apply multiple times to a solution to produce its neighborhood. References [23, 37]. We propose an exchange movement. In Fig. 3, we can see this process represented. Once the exchange move is defined, we create a neighborhood. As stated before, this neighborhood is produced by applying the defined move to each element of the original solution producing other solutions. Then the best solution is selected, and the process continues from that neighbor solution. Figure 4 shows the neighborhood created by an initial solution.

Fig. 2 Initial solution

Fig. 3 Exchange movement

30

L. Benito-Epigmenio et al.

Fig. 4 Initial solution neighborhood

4.3 Objective Function The tabu algorithm requires an objective function, which is the classification performance of the dataset for a specific feature selection. For this purpose, we use a J48 decision tree because it is a simple algorithm that does not have random elements; therefore, it is deterministic and reliable. In contrast to other classification algorithms, such as a neural network, which, due to their internal processes, do not always produce the same results even when training with the same solution.

4.4 Memory Structures The concept of memory is specific to a set of metaheuristic algorithms, being tabu algorithm one of them; This algorithm makes use of historical information to escape from local optima to explore the search space and also carry out an exploitation strategy in the search space [8, 28]. In Fig. 5, we can see this process represented. In this implementation, the tabu list is used to block changes in the selection of specific features for a period of n iterations for features that have improved the performance of the current solution. Therefore, when we choose a new solution from the neighborhood, the tabu list blocks the selected or unselected attribute for n iterations. This blocking forces the solution to explore neighbor solutions while, at the same time, forcing the algorithm to search near solutions with a good selection of features. Finally, the feature will be unlocked when the current iteration exceeds the number in the tabu list of the specific feature.

5 Results The computation of experimentation was carried out using correlation, entropy, PCA, and tabu search. We use three classification algorithms to evaluate feature selection techniques: a J48 decision tree, a random forest, and an artificial neural network. In

Feature Selection: Traditional and Wrapping Techniques with Tabu Search

31

Fig. 5 Memory structures Table 2 Number of features produced with each technique for each dataset Dataset Original Correlation Entropy PCA1 PCA2 Students Divorce Flags Celeb-Faces Credit

32 54 29 39 20

29 30 28 36 20

8 46 12 39 20

8 14 7 10 5

16 28 14 20 10

Tabu 21 27 17 24 10

this section, we present the number of features produced with each technique for each original dataset (see Table 2), and in Figs. 6, 7, 8, 9 and 10, we show their performance. Figures 6, 7, 8, 9 and 10 show the percentage of correctly classified cases and the classification algorithms used to evaluate the performance of the feature selection.

Fig. 6 Students

32

L. Benito-Epigmenio et al.

Fig. 7 Divorce

Fig. 8 Flags

To analyze the performance of feature selection techniques, we ranked the techniques for each dataset separately. The first technique that immediately outperforms the original dataset receives 1 point, while the one that outperforms this technique receives 2 points, and so on. Therefore, the best-performing technique is the one that receives the most points. Finally, we sum the points obtained by each technique; here, we say that the higher the score, the better the performance. The results show that tabu search is better than the other feature selection techniques, with 12 points, while the second best is the correlation, with 8.5 points. Table 3 shows the score for all the techniques, the first column shows the name of the technique, the second column shows the points obtained by the rank, and finally, the third column shows the performance of the technique, where more “+” better the performance. Finally, PCA1 and entropy are the feature selection techniques with the lowest performance.

Feature Selection: Traditional and Wrapping Techniques with Tabu Search

33

Fig. 9 Celeb-Faces

Fig. 10 Credit Table 3 Feature selection techniques with better performance Feature selection technique Points Tabu search Correlation PCA2 PCA1 Entropy

12 8.5 4.5 3 0

Performance +++++ ++++ +++ ++ +

We apply a second rank similar to the previous one to identify the lowest performance techniques. Here, the first technique that is immediately inferior than the original dataset receives 1 point, while the one that is immediately inferior to this one receives 2 points, and so on. Therefore, the worst-performing technique is the one

34

L. Benito-Epigmenio et al.

Table 4 Feature selection techniques with the worst performance Feature selection technique Points Performance Correlation Tabu search Entropy PCA2 PCA1

2 5 14 16 23

– –– ––– –––– –––––

that receives the most points. Once again, we sum the points for each feature selection technique; here, the higher score, the worse the performance. The results show that PCA1 is the worst feature selection technique, with 23 points, followed by PCA2, with 16 points. Table 4 shows the score of all the feature selection techniques; the first column shows the name of the technique, the second column shows the points obtained by the rank, and finally, the third column shows the performance of the technique, where more “–” means worst performance. It is important to highlight that for both rankings, we consider all the techniques; nevertheless, entropy could not produce better performance than the original dataset in the first ranking, thus obtaining zero points. Additionally, we applied the Friedman test [24]; the results show that tabu search has the highest ranking with 4.70 while PCA1 has the lowest ranking with 2.10. In the test statistics, we obtained a p-value of 0.000, and since this value is lower than 0.05, we claim that there is a statistical difference among the feature selection techniques. However, the Wilcoxon test [25] in the feature selection to tabu search algorithm and the original features dataset showed a p-value of 0.132, meaning there is no statistical difference between those techniques; however, it shows an 86.8% certainty that both results are different. We believe that this result can be improved by testing with more datasets. In the case of the classification algorithms, the J48 tree obtained the best performance among all the classification techniques used in the experiments, with a maximum percentage of 13.88% in the student data set with the features selected by the tabu search.

6 Discussion The results of the experiments showed that the tabu search algorithm is the best technique for feature selection, scoring 12 points and ranking first among the feature selection techniques. The Friedman test also confirmed that tabu search is the highest ranking while PCA1 is the lowest. It indicates that tabu search is capable of effectively selecting the most relevant features from the datasets and improving the performance of the classification algorithms used in the experiments. The correlation technique also performed well in the experiments, resulting in a score of 8.5 points. Although it did not outperform tabu search, it is faster to implement and does not require a

Feature Selection: Traditional and Wrapping Techniques with Tabu Search

35

high computational cost. We also noted that using simple classification algorithms, such as the J48 decision tree, after feature selection is more effective than using more complex algorithms. For instance, the random forest has a particular problem with performing an internal feature selection, which can lead to fewer features than those previously selected. On the other hand, a neural network may internally disable some information inputs and affect the training process. Finally, these results seem to indicate that the performance of the feature selection techniques may vary depending on the datasets and the classification algorithms used.

7 Conclusions and Future Work In this paper, we present a new wrapper approach method based on tabu search for solving feature selection problems. The experiments show that the tabu search algorithm for feature selection is a promising method that shows significant improvement in the classification algorithms’ performances in the experiments. However, the results also show that the performance of feature selection techniques varies depending on the datasets and classification algorithms used. On the other hand, we observed that using a simple classification algorithm after feature selection is more effective than more complex algorithms. Finally, the study emphasizes the importance of feature selection in machine learning and the need to continue research to produce more efficient and accurate feature selection methods. In future work, we recommend evaluating the proposed method with more datasets to ensure testing the statistical difference again. We recommend using other deterministic and low-cost objective functions, such as support vector machines, logistic regression, or k-nearest neighbors. Different classification algorithms, such as chisquared automatic interaction detection, naive Bayes, or assembly methods, can provide us with more information about the performance of the proposed method. Finally, we recommend carrying out hyperparameter tuning to improve the performance of the proposed method. Acknowledgements Thanks to CONACYT for support with number 848089.

References 1. Adeel, A., Khan, M.A., Akram, T., Sharif, A., Yasmin, M., Saba, T., Javed, K.: Entropycontrolled deep features selection framework for grape leaf diseases recognition. Expert Syst. 39 (2022). https://doi.org/10.1111/exsy.12569 2. Agrawal, U., Rohatgi, V., Katarya, R.: Normalized mutual information-based equilibrium optimizer with chaotic maps for wrapper-filter feature selection. Expert Syst. Appl. 207, 118107 (2022). https://doi.org/10.1016/j.eswa.2022.118107 3. Ahmed, Z.E., Saeed, R.A., Mukherjee, A., Ghorpade, S.N.: Energy optimization in low-power wide area networks by using heuristic techniques. In: LPWAN Technologies for IoT and M2M Applications, pp. 199–223 (2020). https://doi.org/10.1016/B978-0-12-818880-4.00011-9

36

L. Benito-Epigmenio et al.

4. Ahmed, Z.H., Yousefikhoshbakht, M.: An improved tabu search algorithm for solving heterogeneous fixed fleet open vehicle routing problem with time windows. Alex. Eng. J. 64, 349–363 (2023). https://doi.org/10.1016/j.aej.2022.09.008 5. Al-Batah, M., Zaqaibeh, B., Alomari, S.A., Alzboon, M.S.: Gene microarray cancer classification using correlation based feature selection algorithm and rules classifiers. Int. J. Online Biomed. Eng. 15, 62–73 (2019). https://doi.org/10.3991/ijoe.v15i08.10617 6. Alazzam, H., Sharieh, A., Sabri, K.E.: A feature selection algorithm for intrusion detection system based on pigeon inspired optimizer. Expert Syst. Appl. 148 (2020). https://doi.org/10. 1016/j.eswa.2020.113249 7. Alidaee, B., Wang, H.: Uncapacitated (facility) location problem: A hybrid genetic-tabu search approach. IFAC-PapersOnLine 55(10), 1619–1624 (2022). https://doi.org/10.1016/j.ifacol. 2022.09.622. 10th IFAC Conference on Manufacturing Modelling, Management and Control MIM 2022 8. Alotaibi, Y.: A new meta-heuristics data clustering algorithm based on tabu search and adaptive search memory. Symmetry 14 (2022). https://doi.org/10.3390/sym14030623 9. Bentsen, H., Hoff, A., Hvattum, L.M.: Exponential extrapolation memory for tabu search. EURO J. Comput. Optim. 10, 100028 (2022). https://doi.org/10.1016/j.ejco.2022.100028 10. Bolívar, A., García, V., Florencia, R., Alejo, R., Rivera, G., Sánchez-Solís, J.P.: A preliminary study of smote on imbalanced big datasets when dealing with sparse and dense high dimensionality. In: Pattern Recognition: 14th Mexican Conference, MCPR 2022, Ciudad Juárez, Mexico, Proceedings, pp. 46–55. Springer, Berlin (2022). https://doi.org/10.1007/978-3-03107750-0_5 11. Bommert, A., Sun, X., Bischl, B., Rahnenführer, J., Lang, M.: Benchmark for filter methods for feature selection in high-dimensional classification data. Comput. Stat. Data Anal. 143 (2020). https://doi.org/10.1016/j.csda.2019.106839 12. Chen, P., Li, F., Wu, C.: Research on intrusion detection method based on pearson correlation coefficient feature selection algorithm. J. Phys. Conf. Ser. 1757 (2021). IOP Publishing Ltd. https://doi.org/10.1088/1742-6596/1757/1/012054 13. Ghane, M., Ang, M.C., Nilashi, M., Sorooshian, S.: Enhanced decision tree induction using evolutionary techniques for parkinson’s disease classification. Biocybern. Biomed. Eng. 42(3), 902–920 (2022). https://doi.org/10.1016/j.bbe.2022.07.002, www.sciencedirect.com/science/ article/pii/S0208521622000663 14. Ghosh, M., Guha, R., Sarkar, R., Abraham, A.: A wrapper-filter feature selection technique based on ant colony optimization. Neural Comput. Appl. 32, 7839–7857 (2020). https://doi. org/10.1007/s00521-019-04171-3 15. Got, A., Moussaoui, A., Zouache, D.: Hybrid filter-wrapper feature selection using whale optimization algorithm: A multi-objective approach. Expert Syst. Appl. 183 (2021). https:// doi.org/10.1016/j.eswa.2021.115312 16. Gárate-Escamila, A.K., Hassani, A.H.E., Andrès, E.: Classification models for heart disease prediction using feature selection and pca. Inform. Med. Unlocked 19 (2020). https://doi.org/ 10.1016/j.imu.2020.100330 17. Hanafi, S., Wang, Y., Glover, F., Yang, W., Hennig, R.: Tabu search exploiting local optimality in binary optimization. Eur. J. Oper. Res. (2023). https://doi.org/10.1016/j.ejor.2023.01.001 18. He, Y., Jia, T., Zheng, W.: Tabu search for dedicated resource-constrained multiproject scheduling to minimise the maximal cash flow gap under uncertainty. Eur. J. Oper. Res. (2023). https:// doi.org/10.1016/j.ejor.2023.02.029 19. Khaire, U.M., Dhanalakshmi, R.: Stability of feature selection algorithm: A review. J. King Saud Univ. Comput. Inf. Sci. 34, 1060–1073 (2022). https://doi.org/10.1016/j.jksuci.2019.06. 012 20. de Lima, M.D., de Oliveira Roque e Lima, J., Barbosa, R.M.: Medical data set classification using a new feature selection algorithm combined with twin-bounded support vector machine. Med. Biol. Eng. Comput. 58, 519–528 (2020). https://doi.org/10.1007/s11517-019-02100-z 21. Liu, H., Zhou, M., Liu, Q.: An embedded feature selection method for imbalanced data classification. IEEE/CAA J. Automatica Sinica 6, 703–715 (2019). https://doi.org/10.1109/JAS. 2019.1911447

Feature Selection: Traditional and Wrapping Techniques with Tabu Search

37

22. Liu, Z., Chang, B., Cheng, F.: An interactive filter-wrapper multi-objective evolutionary algorithm for feature selection. Swarm Evol. Comput. 65, 100925 (2021). https://doi.org/10.1016/ j.swevo.2021.100925 23. Lucay, F.A., Gálvez, E.D., Cisternas, L.A.: Design of flotation circuits using tabu-search algorithms: Multispecies, equipment design, and profitability parameters. Minerals 9 (2019). https:// doi.org/10.3390/min9030181 24. Ma, J., Xia, D., Guo, H., Wang, Y., Niu, X., Liu, Z., Jiang, S.: Metaheuristic-based support vector regression for landslide displacement prediction: a comparative study. Landslides 19, 2489–2511 (2022). https://doi.org/10.1007/s10346-022-01923-6 25. Momenzadeh, M., Sehhati, M., Rabbani, H.: A novel feature selection method for microarray data classification based on hidden markov model. J. Biomed. Inf. 95 (2019). https://doi.org/ 10.1016/j.jbi.2019.103213 26. Omuya, E.O., Okeyo, G.O., Kimwele, M.W.: Feature selection for classification using principal component analysis and information gain. Expert Syst. Appl. 174 (2021). https://doi.org/10. 1016/j.eswa.2021.114765 27. Panigrahi, R., Borah, S.: Rank allocation to j48 group of decision tree classifiers using binary and multiclass intrusion detection datasets. Procedia Comput. Sci. 132, 323–332 (2018). https:// doi.org/10.1016/j.procs.2018.05.186. International Conference on Computational Intelligence and Data Science 28. Prajapati, V.K., Jain, M., Chouhan, L.: Tabu search algorithm (tsa): A comprehensive survey. In: Proceedings of 3rd International Conference on Emerging Technologies in Computer Engineering: Machine Learning and Internet of Things, ICETCE 2020, pp. 222–229. Institute of Electrical and Electronics Engineers Inc. (2020). https://doi.org/10.1109/ICETCE48199.2020. 9091743 29. Pudjihartono, N., Fadason, T., Kempa-Liehr, A.W., O’Sullivan, J.M.: A review of feature selection methods for machine learning-based disease risk prediction. Frontiers Bioinf. 2 (2022). https://doi.org/10.3389/fbinf.2022.927312 30. Rivera, G., Florencia, R., García, V., Ruiz, A., Sánchez-Solís, J.P.: News classification for identifying traffic incident points in a spanish-speaking country: A real-world case study of class imbalance learning. Appl. Sci. 10(18), 6253 (2020). https://doi.org/10.3390/app10186253 31. Shanthi, S., Akshaya, V.S., Smitha, J.A., Bommy, M.: Hybrid tabu search with sds based feature selection for lung cancer prediction. Int. J. Intell. Netw. 3, 143–149 (2022). https://doi.org/10. 1016/j.ijin.2022.09.002 32. Singh, N., Singh, P.: A hybrid ensemble-filter wrapper feature selection approach for medical data classification. Chemom. Intell. Lab. Syst. 217, 104396 (2021). https://doi.org/10.1016/j. chemolab.2021.104396 33. Solorio-Fernández, S., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: A review of unsupervised feature selection methods. Artif. Intell. Rev. 53, 907–948 (2020). https://doi.org/10. 1007/s10462-019-09682-y 34. Spencer, R., Thabtah, F., Abdelhamid, N., Thompson, M.: Exploring feature selection and classification methods for predicting heart disease. Digit. Health 6 (2020). https://doi.org/10. 1177/2055207620914777 35. Tambake, N., Deshmukh, B., Patange, A.: Development of a low cost data acquisition system and training of j48 algorithm for classifying faults in cutting tool. Mater. Today: Proc. 72, 1061– 1067 (2023). https://doi.org/10.1016/j.matpr.2022.09.163. 2nd International Conference and Exposition on Advances in Mechanical Engineering (ICoAME 2022) 36. Venkateswarlu, C.: A metaheuristic tabu search optimization algorithm: Applications to chemical and environmental processes. In: Tsuzuki, M.S., Takimoto R.Y., Sato, A.K., Saka, T., Barari, A., Rahman, R.O.A., Hung, Y.T. (eds.) Engineering Problems, Chap. 10. IntechOpen, Rijeka (2021). https://doi.org/10.5772/intechopen.98240 37. Venkatraman, P., Levin, M.W.: A congestion-aware tabu search heuristic to solve the shared autonomous vehicle routing problem. J. Intell. Transp. Syst.: Technol., Plan., Oper. 25, 343–355 (2021). https://doi.org/10.1080/15472450.2019.1665521

38

L. Benito-Epigmenio et al.

38. Vommi, A.M., Battula, T.K.: A hybrid filter-wrapper feature selection using fuzzy knn based on bonferroni mean for medical datasets classification: A covid-19 case study. Expert Syst. Appl. 218, 119612 (2023). https://doi.org/10.1016/j.eswa.2023.119612 39. Weinberg, A.I., Last, M.: Enhat—synergy of a tree-based ensemble with hoeffding adaptive tree for dynamic data streams mining. Inf. Fusion 89, 397–404 (2023). https://doi.org/10.1016/ j.inffus.2022.08.026 40. Yu, C., Lahrichi, N., Matta, A.: Optimal budget allocation policy for tabu search in stochastic simulation optimization. Comput. Oper. Res. 150, 106046 (2023). https://doi.org/10.1016/j. cor.2022.106046 41. Zhang, J., Xiong, Y., Min, S.: A new hybrid filter/wrapper algorithm for feature selection in classification. Anal. Chim. Acta 1080, 43–54 (2019). https://doi.org/10.1016/j.aca.2019.06. 054 42. Zhou, H.F., Zhang, J.W., Zhou, Y.Q., Guo, X.J., Ma, Y.M.: A feature selection algorithm of decision tree based on feature weight. Expert Syst. Appl. 164 (2021). https://doi.org/10.1016/ j.eswa.2020.113842

Pattern Classification with Holographic Neural Networks: A New Tool for Feature Selection Luis Diago, Hiroe Abe, Atsushi Minamihata, and Ichiro Hagiwara

Abstract This chapter aims to develop classification algorithms based on holographic neural networks (HNN) that can explain the inferences made in the classification of driver states during conditional driving automation. Holographic neural networks (HNN) are a type of associative memory developed more than thirty years ago. Among its main advantages is the speed of learning, as it develops as a linear model in the complex domain. HNNs are a type of associative memory developed more than thirty years ago that, being a simple linear model in the complex domain, can be updated at high speed during online learning. After a chronological review of the technology to solve the pattern classification problem, a new tool for feature selection is proposed. The new tool maps the complex numbers in the holographic memory to the interval [0, 1] using Pythagorean membership degrees. The performance of holographic classifiers is compared with the Iris dataset, and the explainability of the models is analyzed by the predicting power and stability of the selected features. The holographic classifiers with fuzzy quantization show an accuracy result higher than 90% with the Iris dataset, being the most explainable according to the extracted rules and the ones with the highest stability of the features selected by the proposed tool. Keywords Holographic neural networks · Feature selection · Explainable models · Complex fuzzy sets · Pattern classification

L. Diago (B) · H. Abe · I. Hagiwara Meiji University, Room 408M, 1-1 Kanda Surugadai, Chiyoda-ku, Tokyo 101-8301, Japan e-mail: [email protected] H. Abe e-mail: [email protected] I. Hagiwara e-mail: [email protected] A. Minamihata Kansai University of International Studies, 1-18 Aoyama Shijimicho Miki, Hyogo 673-0521, Japan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Innovations in Machine and Deep Learning, Studies in Big Data 134, https://doi.org/10.1007/978-3-031-40688-1_3

39

40

L. Diago et al.

1 Introduction Autonomous cars have been developed in recent years thanks to advances in object recognition and machine learning algorithms [7, 11, 14]. However, their practical introduction has been delayed due to social concerns about their safety and regulatory compliance, mainly at level 3 of the Society of Automation Engineers (SAE), which requires collaboration between the vehicle and the driver to perform the driving task [32]. In this context, it is not only important that the vehicle can recognize or classify the driver’s states but also that it can explain the inferences made during driver monitoring. This problem has not been solved and is the main motivation for this research. Deep neural networks [7, 11] show the most promising results for solving this problem, but they lack the possibility of being interpretable models and are difficult to develop online learning methods. That is why this chapter focuses on using HNN to solve the above problems [69, 70]. Two mathematical models of neural networks [52, 53, 69, 70] based on holographic associative memories [21] have emerged from Gabor’s work on holography [20, 21]. The first is based on a photorefractive crystal to record optical inputoutput interconnection strengths [52, 53] and the second focuses on the theoretical basis for modeling systems’ dynamics in the complex domain [69, 70]. Complexvalued neural networks (CVNN) have gained popularity in recent years [4, 31, 51]. However, unlike CVNN the governing equations of HNN [69, 70] formed a nonconnectionist approach in which a large number of input-output associations could be enfolded onto a single memory element. This chapter focuses on the HNN [69, 70] presented as linear models in the complex domain that do not have a layer structure and their learning algorithm is based on the solution of a system of linear equations.

2 Holographic Neural Networks Holographic Neural Networks (HNN) [69, 70] are based on the fact that the inputoutput relationships are linearized according to a mapping of inputs into outputs at the complex plane.

2.1 Basic Theory Suppose two real vectors x and y represented as input vector x = {x1 , x2 , . . . , xk }T and output vector y = {y1 , y2 , . . . , ym }T . The dimensions of x and y are k and m respectively. If we have a set of n input-output vectors, they may be represented as matrix X and Y according to (1):

Pattern Classification with Holographic Neural Networks …

⎛ ⎞ x11 x1T ⎜ x21 ⎜ ⎟ ⎜ X = ⎝ ... ⎠ = ⎜ . ⎝ .. xnT xn1 ⎛

x12 · · · x22 · · · .. . . . . xn2 · · ·

41

⎛ ⎞ ⎛ T⎞ y11 x1k y 1 ⎜ y21 x2k ⎟ ⎟ ⎜ . ⎟ ⎜ .. ⎟ , Y = ⎝ .. ⎠ = ⎜ .. ⎝ . . ⎠ ynT xnk yn1

y12 · · · y22 · · · .. . . . . yn2 · · ·

⎞ y1m y2m ⎟ ⎟ .. ⎟ . ⎠

(1)

ynm

Each element of X and Y is changed into angles θai (a = 1, . . . , n; i = 1, . . . , k) and φa j (a = 1, . . . , n; j = 1, . . . , m) by input and output mapping functions f x and f y : θai = f x (xai ), φa j = f y (ya j )

(2)

Some commonly used types of data conversion in f x and f y are: sigmoid or arctan conversion (real values from an unknown or infinite range) and linear conversion (real values from a known finite range) [69, 70]. Next, angles are mapped on the complex plane by the exponential function ˆ

ˆ

sai = λai eiθai , ra j = γa j eiφa j

(3)

In (3), iˆ is the imaginary unit of a complex number (iˆ2 = –1), λai and γa j are the magnitudes of the complex numbers (usually set as one). Through the operations represented in (2) and (3), arrays X and Y are mapped at the complex plane and named stimulus S and response R respectively. ⎛

s11 ⎜ s21 ⎜ S=⎜ . ⎝ .. sn1

s12 · · · s22 · · · .. . . . . sn2 · · ·

⎛ ⎞ r11 s1k ⎜ r21 s2k ⎟ ⎜ ⎟ .. ⎟ , R = ⎜ .. ⎝ . . ⎠ snk rn1

⎞ r12 · · · r1m r22 · · · r2m ⎟ ⎟ .. . . .. ⎟ . . ⎠ . rn2 · · · rnm

(4)

If the relationship between S and R = [r1 , . . . , rm ] is stored in a matrix H = [h1 , . . . , hm ] called holographic memory [69, 70], H can be computed as −1 ∗  · S · R. H = S∗ · S

(5)

The difference between R and the product S · H can be expressed by the error matrix:  = [d1 , . . . , dm ] = R − S · H

(6)

The linear vectors h j ( j = 1, . . . , m) of the matrix H have k elements and are computed so that the linear combination of vectors d j of the error matrix  is minimal. That is  (7) min d∗j d j = min{(r j − S · h j )∗ · (r j − S · h j )} hj

hj

42

L. Diago et al.

where ∗ means conjugate. The value of h j that makes minimum the expression in (7) is h j = (S∗ · S)−1 · S∗ · r j and if h j is expressed in matrix form we get Eq. (5). The matrix H is determined using input S and output R during learning (i.e. training or encoding) and the output V for a new input U (a.k.a query pattern) unused during the training is predicted by V = U · H. (8) To get the real vectors v for the new output V it is necessary to reverse the output mapping in (2).

2.2 Learning and Prediction Methods Learning and predicti on methods reported in the literature are based on the variations of the mapping functions f x and f y in (2) [18, 29, 30, 34–37, 49, 69, 70], the expansion of the basis functions in (3) [12, 13, 34–37, 49, 69, 70]; and the ways to calculate the values of the holographic memory matrix H in (5) [19, 69, 70]. This chapter focuses on the pattern classification problem and chronologically presents the above methods as they have been used in the literature.

2.2.1

Mapping Functions

Sigmoid Function (Suherland 1990–1992) Sutherland [69, 70] suggested using a sigmoid function to obtain an ideal symmetrical distribution of the input data that has a normal distribution. The following is a sigmoid mapping function: 2π (9) θai = − 1 + e f (xai ,σi ,μi ) b

i , A, b ∈ R. Here, xai is the i th input where generally f (xai , σi , μi ) = A xaiσ−μ i stimulus element and μi and σi are the mean and standard deviation of stimulus field distribution. The success of HNN learning lies in having evenly distributed (symmetric) vectors in the complex domain evaluated by the asymmetry (Sym) [69, 70]: n ˆ a iθ a λa e (10) Sym = n a λa The value of Sym is in the range 0-1 (i.e. zero means completely symmetric). The values of A = 3 and b = 1 are used in the experiments based on the values of Sym.

Pattern Classification with Holographic Neural Networks …

43

Output Linear Mapping (Sutherland 1990–1992) Within the realm of pattern classification [69, 70], the complex plane for the response is divided into an arbitrary number of areas, permitting the network to generate a proportionate number of separate classifications. Figure 1a considers a system in which the response plane has been divided into 8 areas of equivalent size. Each area will correspond to one class (C) as it is shown for areas C0, C1, . . . , C7. Input Symmetric Associator (Franich 1993) Although satisfactory in many cases, the standard sigmoid transform is not appropriate if the original data distribution differs from Gaussian [18, 49]. Thus, Franich [18] proposed the “Symmetric Associator” in which the external domain for a holographic network is composed of real numbers xi , in the range −∞ < xi < +∞ with an arbitrary distribution f 1 (x). This external domain should be converted into the internal phase representation in the range 0 < θ < 2π. The conversion should produce a uniform distribution f 2 (θ). According to [18, 49] the proper conversion (transformation) curve g(x) is

g(x) =

f 1 (x)d x = F1 (x) .

(11)

Applying an external sequence, xi , on a transfer function, g(x) = F1 (x), a new sequence θi is generated, with the uniform probability density function f 2 (θ). Implementations of symmetric holographic associator developed in [18, 49] estimate the corresponding distribution for phase elements and compute the cumulative distribution for distribution already computed. Spiral Function (Khan 1995–1998) Since the query pattern and the encoded patterns must be mapped using the same function, the response pattern may not be retrieved correctly if the query pattern significantly differs from the encoded pattern, e.g., they have a different mean and standard deviation. For this reason Khan [34–37] uses the following spiral mapping function for achieving a symmetrical distribution of stimulus pattern elements:  θai = kspr ead

xai 2π (xi )max − (xi )min

 mod 2π − π.

(12)

This function transforms the input data into −π to +π range. Here, kspr ead is the spreading coefficient, which is chosen to be any number that is not a factor of [(xi )max − (xi )min ], otherwise, the mapped values will not be unique.

44

L. Diago et al.

Fig. 1 Division of the complex plane into classification regions: a Linear mapping b RMWC

Output Reverse Modulus Weight Code (Khan 1995–1998) The Reverse Modulus Weight Code (RMWC) function used by Khan [34–37] to obtain a more symmetrical distribution of vectors is illustrated in Fig. 1b, with 8 intervals. As depicted, Index 1 is 180◦ away from Index 0, Index 3 is 180◦ away from Index 2, etc. This indicates that even if there were only four patterns mapped (Index 0, 1, 2, and 3), it would yield a symmetrical vector distribution. Input Reverse Modulus Weight Code (Hendra 1999) The Reverse Modulus Weight Code (RMWC) mapping function used by Khan [34– 37] for the mapping of response patterns was adapted for the stimulus pattern mapping [29, 30]. Assume that the 0 − 2π range is divided into 256 intervals, corresponding to the range of intensity values. The mapping of an intensity value Ii (in the range 0–255) to a phase value θi using the RMWC function is performed in two steps: 1. Ii is represented as an eight-digit binary number (since 28 = 256) as follows: Ii = b7 27 + b6 26 + b5 25 + b4 24 + b3 23 + b2 22 + b1 21 + b0 , where b0 . . . b7 are the weight codes whose values are either 0 or 1. 2. the corresponding phase value, θi , is obtained by reversing the weight codes as illustrated in the following expression:  θi =

b0 27 + b1 26 + · · · + b6 21 + b7 256

 2π.

(13)

In [29, 30], Hendra compared the linear, spiral, and RMWC mappings for input data with uniform, normal, or right-skewed normal distributions to show that RMWC gets a relatively lower asymmetry in all cases.

Pattern Classification with Holographic Neural Networks …

2.2.2

45

Expansion Methods

For the classification problem, some difficulties may arise if the number of attributes used for classification is small. A further and possibly more effective means of obtaining a highly symmetrical state over a wide class of input data distributions is through the expansion of the input field to higher-order combinatorial product terms. Higher Order Statistics (Suherland 1990–1992) These higher-order “statistics” represent the generation of Mth order product terms over elements within the raw stimulus data field, allowing for repeating factors. The generation of a higher-order product term is illustrated simply in the following [69, 70]: M  ˆ λli eiθli (14) li

where M indicates the “order” of the above product solution. The number of combinations of unique Mth order product terms that can be generated from a data field of size N is defined by the following factorial relationship: (N + M − 1)! . (N − 1)!M!

(15)

In this manner, stimulus input fields of relatively small size may be expanded to extremely large data sets. Sines and Cosines (Manger 1993–1997) The standard expansion procedure [69, 70] based on higher order product terms may prove inadequate for an extremely short stimulus. A more suitable expansion method, based on sines and cosines, has been proposed [49, 50]. The sine-cosine procedure interprets the original stimulus si as an angle between 0 and 2π, i.e. instead of si it uses αi = 2πsi /( p + 1). Then, the expansion procedure computes sin(li αi ) and cos(li αi ) for li = 1, 2, . . . , K /4. These K /2 numbers (internally represented), together with their conjugates, form the expanded vector of K elements. The authors compared the sine-cosine expansion with higher-order statistics and stated that among the two stimulus expansion routines, the higher-order-statistics expansion seems to be more successful [46]. They proposed to combine the sinecosine method with higher order statistics [49]. The sine-cosine method expands an extremely short stimulus vector to a moderate length, and then higher-order statistics are applied further to reach the desired final vector length. Multi-dimensional Expansion (Khan 1995–1998)

46

L. Diago et al.

The input model in (3), which is a two-dimensional scheme, has been extended by Khan [34–37] into a multi-dimensional scheme given by the following: ˆ



di −1

iˆ θ



si = λi eiθi → λi e li li ili (λi , θi ) → (λi , θi1 , θi2 , . . . , θi[di −1] )

(16)

Using this model, each θi is mapped onto a set of phase elements θili . Here, si becomes a vector within a unit hyper-sphere in a d-dimensional space. Each θili is the spherical projection or phase component of the vector along the dimension iˆli . Based on this representation, each piece of information, si may be composed of other sub-information. For example, in a color image, the ith pixel information may be built up by the values of three colors in RGB format corresponding to red, green, and blue. The li th colour of the ith pixel is mapped to θili . Fuzzy Expansion and Quantification (Diago 2008–2011) Khan [34–37] presented the idea of using an expansion to a multi-dimensional space according to equation (16) and applied it to color images but left open the question of how to compute the dimension of the hyper-complex numbers in other applications. In [12], the expansion step has been substituted by a fuzzy expansion step to avoid the trial and error approach defining the number of combinations of unique M-th order product terms in (14). Each input parameter xi was divided into di categories Cili : (li = 1 . . . di ) with semantic meaning (e.g., small, medium, large). Dividing the input values into several equiprobable intervals ensures lower values of Sym in (10). The paper [12] showed that, by using di = 3 categories per parameter, the classification accuracy is comparable with support vector machines when working with small sample datasets because the fuzzy expansion works like the “Symmetric Associator” in (11) without estimating the distribution of each stimulus field. In [13], an optimization algorithm was proposed to solve the generalized eigenvalue problem Ac = η 2 Bc and automatically compute the dimension di in (16) from the eigenvector c corresponding to the largest eigenvalue η 2 . The matrixes A and B are computed from the membership functions (MFs) μCili (xi ) : (i = 1 . . . k; li = 1 . . . di ) of input fuzzy groups and μG j (y j ) : ( j = 1 . . . m) of output fuzzy groups. HNN models were improved by computing the fuzzy quantification (FQ) coefficients c, reducing the number of input parameters, creating output MFs and extracting fuzzy rules from the holographic memory H [13].

2.2.3

Computing the Holographic Memory

Basic and Enhanced Learning (Sutherland 1990–1992) In the basic learning [69, 70], several stimulus-response pattern associations can be superimposed onto the same matrix space as follows:

Pattern Classification with Holographic Neural Networks …

H0 =

n 

S∗aT Ra

47

(17)

a=1

Theoretically, the number of patterns (n) that can be stored in the holographic memory H0 for reliable performance never exceeds the number of stimulus pattern elements (k). This characteristic is expressed as load factor (L), which is the ratio of n to k, for n ≤ k and given as (n/k). Here, L is in the range 0–1.0. In practice, the enhanced learning algorithm [69, 70], adapted from Hebbian’s differential learning, is applied to all patterns (a = 1, . . . , n) as follows: Ha = Ha−1 + α S∗aT (Ra −

1 Sa Ha−1 ) ca

(18)

k λai . The enhanced where α is a learning constant in the range 0–1 and ca = i=1 learning (18) avoids the computation of the inverse matrix in (5) but it has to be repeated several times for each association a in the set in order to minimize the difference in (6). So we end up with a form of iterative training. Pseudoinverse and Penalty Functions (Hagiwara 2004) The increase in the number of expansion terms decreases the learning error, but in some cases, the resulting matrix in (5) may not be invertible. Fukushima et al. [19] proposed to compute the matrix H by using the Moore-Penrose pseudoinverse: ∗ H M P = S∗ · S · (S∗ · S · S∗ · S)−1 G ·S ·R

(19)

The symbol ()−1 G denotes the computation of the Moore-Penrose pseudoinverse. Moore-Penrose generalized inverse matrices allow for solving such systems, even with rank deficiency, and they provide minimum-norm vectors of the learning error, which contribute to the regularization of the input-output mapping. As the generalized inverse matrix minimizes the learning error, it could be thought that the matrix H M P obtained in the previous equation expresses correctly any inputoutput relation. However, when matrix H M P is used to predict the output from a new vector (not used during the learning phase), the results are not good. In order to increase the generalization performance, Fukushima et al. [19] proposed the use of penalty function: (20) H P = (S∗ · S + ρ1 O + ρ2 I)−1 · S∗ · R Here, ρ1 and ρ2 are penalty parameters, O is a square matrix with all its elements equal to one and I is the identity matrix. If ρ2 is not zero, as the rank of matrix (S∗ · S + ρ1 O + ρ2 I) agrees with the size of the matrix, the inverse matrix always exists. When parameters ρ1 and ρ2 are near zero, the first part of (20) has greater weight, and the matrix becomes approximately the matrix of Moore-Penrose. Fukushima et al. [19]

48

L. Diago et al.

proposed the values of ρ1 = 1 and ρ2 = 1 to achieve the maximum generalization performance of their MPOD method.

2.3

Explainability and Optimization of Holographic Models

Once holographic memory is obtained, we can predict new cases (not seen during training) with high accuracy, but user-understandable explanations must accompany these predictions to be helpful in applications [3, 72]. The purpose of such explainable artificial intelligence (XAI) [24] is to make their behavior more intelligible to humans by providing explanations; so they constitute a challenge and a trend in current research [10, 62, 65, 72]. A brief analytical review [2] of the current state-of-the-art in relation to the XAI in the context of recent advances in machine learning and deep learning defines 2 types of explainable models: transparent ones (e.g. decision trees, Bayesian models, linear regression models, or rule-based models) and opaque ones (e.g. support vector machines, multilayer neural networks, recurrent neural networks, convolutional neural networks). The former is more interpretable [10, 62, 72] but does not necessarily guarantee explainability. The latter is more accurate but must be processed post-hoc (that is, after the model has been trained) to explain their inferences. HNN classifiers can be seen as linear regression models in the complex domain. Although linear regression models are transparent, the regression process in the complex domain does not guarantee the explainability of the inferences, so post-hoc processing is necessary. Several representations in the context of CVNN [4] could be adapted as mapping functions to optimize the application of HNN models. As mentioned in [4], input representations can be complex either naturally or by design. In the former case, a complex number represents the data domain (e.g. Fourier transform on images). In the latter case, inputs have magnitude and phase, which are statistically correlated (see e.g. radio frequency or wind data applications in [31, 51]). On the other hand, expansion methods have been less explored because CVNNs are based on layered structures. For HNNs, expansion methods serve the same purpose as layers in traditional CVNNs, but the fact that there are no layers allows solving the learning problem using well-established tools in linear algebra (e.g. solving systems of linear equations or eigenvalue systems and avoiding solving the optimization problem that arises in CVNN learning). There is a trade-off between accuracy and interpretability for expansion methods. The larger the number of model parameters, the higher the model’s accuracy. On the other hand, the smaller the number of model parameters, the more interpretable the model is and the easier to provide user-understandable explanations There are other advances reported in the literature, such as learning with dynamic attention (Khan 1995–1998 [34–37]) and neural plasticity (Sutherland 1990–1992 [69, 70]), which have not been analyzed here because they are not directly related to the mapping functions, the expansion methods or the way of calculating the memory matrix H. The main contribution of Khan is related to the reduction of the network

Pattern Classification with Holographic Neural Networks …

49

error when assigning values to the magnitudes in (3) that are different from one and are adjusted according to the application. Neural plasticity also promotes the variation of the magnitudes of the holographic memory in order to optimize learning. However, the way to select the memory values to be optimized is incomplete. In most of the examples reported in the literature [13, 38], only the magnitude of the complex values in memory is used instead of using the complete information of the complex numbers to select the elements to be optimized. Hence, this chapter focuses on improving the way of selecting the features from the holographic memory matrix H in order to increase its interpretability [10, 62, 72] and explainability [2, 24, 65].

3 Feature Selection with Holographic Neural Neworks Feature selection algorithms used to reduce data dimensionality may be divided into filters, wrappers, and embedded approaches [25]. Filters select subsets of input as a pre-processing step, independently of the chosen predictor. Wrappers utilize the learning machine of interest as a black box to score subsets of input according to their predictive power. Embedded methods perform input selection in the process of training and are usually specific to given learning machines.

3.1 Previous Works Two filters called hnn-filter and fq-filter were used to reduce the number of parameters in [13]. The optimal dimension di for FQ-expansion was computed by tuning the mean and standard deviation in the MFs μCil (xi ) and μG j (y j ) and category weights ci (i = 1...K ) to maximize the value of η 2 in the real domain. The maximum eigenvalue η 2 and its corresponding eigenvector c provide the maximum degree of separation of the fuzzy groups that maintain the classification accuracy higher than in previous models. The fq-filter sorts the variables based on the category weight c = [c1 , . . . , c K ] and the hnn-filter ranks the variables using the level of input-output correlation measured directly from the magnitude ρi of the complex ˆ value h i = ρi eiϕi stored in the holographic memory matrix H. Since the generalized eigenvalue problem is ill-conditioned, the solution proposed by the fq-filter may not be reliable. At the same time, the hnn-filter only uses the magnitude of the complex numbers to select the most important parameters, thus losing the phase information. This chapter explores the advantages of performing the feature selection entirely in the complex domain. According to [60, 61, 71, 75], the basis functions in (3) are complex fuzzy sets by definition where λ(x), θ(x), γ(y) and φ(y) are real-valued functions. If the MF-value of the complex number h i denotes the degree to which the feature si belongs to the set of selected features, the MF-value could give an order to these features. Complex membership functions (CMFs) have been used as acti-

50

L. Diago et al.

vation functions at the nodes of complex-valued neuro-fuzzy systems (see e.g. [75] and reference therein). However, both the inputs and outputs of the CMF are complex numbers interpreted as two real-valued outputs (the “dual-output property”) that cannot be used for feature selection. Any function mapping complex numbers to the unit interval [0, 1] (e.g. [73, 74]) could provide a way of ordering the features in the holographic memory matrix H.

3.2 Pythagorean Membership Grades In [74], Yager introduced a new class of nonstandard fuzzy sets called Pythagorean fuzzy sets (PFS) and called the membership grades associated with these sets Pythagorean membership grades (PMGs). A PMG is a pair of values (a, b) such that a, b ∈ [0, 1] and a 2 + b2 ≤ 1. Yager and Abbasov [73] discussed the connection between PMGs and complex numbers. They noticed that not all complex numbers ˆ of the form z = a + biˆ = r eiθ are interpretable as PMGs. The requirement for a pair (a, b) to be a PMG is that a = AY (z) = r cos θ and b = A N (z) = r sin θ be in the unit interval and A2Y (z) + A2N (z) ≤ 1. Here, AY (z) and A N (z), are the support for membership of z in A (e.g. z satisfies the concept A) and the support against membership of z in A (e.g. z does not satisfy the concept A). These conditions require ˆ that r ∈ [0, 1] and θ ∈ [0, π/2]. So complex numbers z = r eiθ having the properties r ∈ [0, 1] and θ ∈ [0, π/2] are examples of PMGs called “-i numbers”. As the ˆ complex values h i = ρi eiϕi stored in the holographic memory matrix H are not -i numbers the function proposed in [73, 74] was modified to map the complex values h i to the interval [0, 1]: F(ρi , ϕi ) =

1 π/2ϕ + ρi m 2



ϕi 1 − 2 ϕm

 (21)

where ϕm = max[ϕ1 , . . . , ϕ K ]. Note that the new function performs a uniform scaling of all values in the holographic memory matrix H and maintains the relationships between the orientations of all its vectors.

4 Pattern Classification Holographic neural networks can be applied to the same problems as the other network types. Reported applications are mainly within images and object recognition [12, 13, 22, 23, 29, 30, 34–37, 43, 44, 55–58, 63, 68, 70] and automobile optimal design [19, 28, 39, 40, 66, 67]. There are also other isolated applications to data compression [46], currency exchange rate prediction [47], credit scoring [54], computer-based molecular design [64], robotics [5, 6], chemical applications [8]

Pattern Classification with Holographic Neural Networks …

51

and software engineering [41]. Most of the above applications [5, 6, 8, 39–41, 46, 47, 54, 55, 63, 64, 69, 70] are based on AND Corporation’s Holographic Neural Technology (HNeT) [1] and compare HNeT with traditional classifiers (e.g. discriminant analysis [8, 41, 48], logistic regression [41], logical classification [41, 48], multi-layered neural networks [8, 39–41, 48, 54, 63, 66], neuro-fuzzy [13] and support vector machines [12]). As the above methods have been developed and tested for specific applications using datasets that are not available for research, this chapter compares the performance of HNN classifiers with the standard Iris flower dataset [15–17] based on the results of accuracy already reported in the literature [46–50].

4.1 Iris Dataset The Iris flower dataset is a well-known database to be found in the pattern recognition literature [15] and it is one of the earliest datasets used for evaluation of classification methodologies. The problem consists of grouping iris flowers into three classes (named Setosa, Versicolor, and Virginica, respectively), with 50 instances each, according to four attributes (sepal length and width, petal length and width). According to the UCI Machine Learning Repository [16], there are 351 papers that cite this data set, and the average accuracy reported by the baseline model classifiers are between 89.47% (random forest classification) and 97.3% (multi-layered neural networks). The database shows an overlap between some of its classes which is also characteristic of driver state classification during conditional automation [32] where traditional classifiers show very poor results [7, 11, 14]. That is why this chapter focuses only on comparing the accuracy results and explanatory possibilities of the holographic classifiers with the Iris dataset.

4.1.1

Comparison of HNN Classifiers

Table 1 shows the results of the HNN classifiers in Sect. 2.2 with the Isis data set. The reference to each classifier is shown in the first column. The results of Khan 1995– 1998 [34–37], and Hendra 1999 [29, 30] are not included because they are closely related to the image retrieval problem, and there is no evidence that they are useful for solving the classification problem. The second and third columns show the input mapping functions and expansion methods used by the classifiers, respectively. In all cases, a linear mapping function (Sutherland 1990–1992 [69, 70]) was used in the output to represent the 3 flower classes. The fourth column shows the equation used by the classifier to calculate the holographic memory matrix H. The last two columns show the accuracy of the classifier with the training and test data sets, respectively. The first two rows show the results of the HNeT system from references [46–50]. In the first row, the network cannot learn even when the expansion method proposed by Sutherland [69, 70] is applied using 2nd-order statistics with 400 terms (6 unique).

52

L. Diago et al.

Only 60% accuracy is obtained in the training set and 56% in the test set. Row 2 shows the main result reported in [46–50]. The size of the stimulus is increased from length 4–20 using a 2nd-order expansion with 400 terms (190 unique) by applying the sines and cosines pre-processing explained in Sect. 2.2. Thus, the network was correctly trained (100% accuracy) and achieved 95% accuracy in the test set. According to the references [46–50], in both cases, the training in (18) was used to obtain the holographic matrix Ha . As it is not clear which samples were used for training and testing in [46–50], the methods [12, 13, 19] were reprogrammed in Matlab and tested by randomly selecting 45 samples of each type (135 in total) for training and the rest (15) for testing. The experiments were repeated 20 times, and the average results of the 20 runs are reported in the last two rows. In both cases, the Eq. (20) was used to obtain the holographic memory matrix HP using ρ1 = 1 and ρ2 = 1. For Hagiwara (2004), as for the Sutherland 1990–1992 method, when the sigmoid function is used, the prediction results are poor. The network has a higher accuracy for training with the HP function without the expansion method than with Ha (74% vs. 60%). Training times have not been compared, but since the method using Ha is iterative and the method using HP only requires the computation of an inverse matrix, the latter should be faster. However, in both cases, the prediction results on the test set are not good (53.3% and 56%, respectively). With the fuzzy expansion (Diago 2008–2011 [12, 13]), a considerable improvement is achieved (93.3% accuracy in the prediction of the test set), although only 96.3% learning accuracy of the training set is achieved. The t-test showed statistically significant differences ( p = 7e−10 ) between the last two classifiers in Table 1. The accuracy is lower than Magner’s method, but the models’ explainability is far superior to that of other holographic classifiers. Note that the method does not use an optimal expansion but a fixed value of di = 3 (K = 12) to obtain such results, while Magner’s method uses K = 190 unique terms with which it is very difficult to explain the models.

Table 1 Comparison of the accuracy (in %) of different HNN classifiers Reference Mapping (Pre- Expansion Computing Train set (%) processing) method matrix H Sutherland 1990–1992 [69, 70] Magner 1993–1997 [46–50] Hagiwara 2004 [19] Diago 2008–2011 [12, 13]

Sigmoid

Sin-cos from length 4–20 Sigmoid

2nd order 400 term (6 unique) 2nd order 400 terms (190 unique) None

Linear

FQ(di = 3)

Test set (%)

Ha Eq. (18)

60

56

Ha Eq. (18)

100

95

HP Eq. (20)

74.0

53.3

HP Eq. (20)

96.3

93.3

Pattern Classification with Holographic Neural Networks …

4.1.2

53

Explainability of the Models

The post-hoc processes reported in the literature [2] to guarantee the explainability of the models are basically divided into 3 groups: (1) analyze the influence of the input parameters (feature-oriented), (2) the model parameters (for example, the weights associated with each of the inputs in the case of neural networks) and (3) visualization of the model and the properties of the classes. The following discusses the ways to explain one of the random models obtained with the fuzzy expansion in the last row of Table 1 for illustrative purposes. Since 20 experiments were performed by randomly selecting the training and test sets, the model obtained for each experiment is different, so this section shows the results of one of the random experiments (trial 3/20) in the last row of Table 1. The results are shown in Tables 2, 3, 4 and 5. First, to analyze the influence of the input parameters in an understandable way, the fuzzy quantification (FQ) method divides each parameter into di = 3 categories with an understandable meaning, i.e., Small (S), Medium (M), and Large (L). Table 2 shows the boundaries of each category after sorting the values of the 135 training samples of each parameter and selecting as boundaries of the medium category the average values of elements 45–46 (lower) and 90–91 (upper). The lower boundaries of the small category and the upper boundaries of the large category were selected

Table 2 Boundary value of each category Category (di = 3) x1 sepal length x2 sepal width Small (S) Medium (M) Large (L)

4.3–5.4 5.4–6.3 6.3–7.7

2–2.9 2.9–3.2 3.2–4.4

x3 petal length

x4 petal width

1–2.45 2.45–4.85 4.85–6.9

0.1–0.8 0.8–1.55 1.55–2.5

ˆ

ˆ and F(ρ, ϕ) Table 3 FQ-eigenvector (c), Holographic matrix H-values (ρeiϕ = a + bi) c ρ ϕ a b F(ρ, ϕ) x1S x1M x1L x2S x2M x2L x3S x3M x3L x4S x4M x4L

0.12669 0.22062 −0.34731 0.22487 −0.10338 −0.1215 0.030217 0.3877 −0.41792 0.030201 0.43844 −0.46864

0.10203 0.36869 0.10196 0.093526 0.37196 0.095439 0.37117 0.30155 0.6065 0.37117 0.29887 0.60761

−1.8318 −2.9488 −1.3484 −2.1271 −2.9414 −1.2342 −0.37331 2.5883 −0.43934 −0.37331 2.5908 −0.45171

−0.026327 −0.36186 0.022485 −0.049388 −0.36453 0.031518 0.34561 −0.25656 0.5489 0.34561 −0.25467 0.54667

−0.098575 −0.070631 −0.099447 −0.079423 −0.073972 −0.090084 −0.13537 0.15846 −0.25797 −0.13537 0.15643 −0.26523

0.36196 0.45075 0.3174 0.39173 0.44967 0.31023 0.11579 0.54511 0.072009 0.11579 0.5447 0.073635

54

L. Diago et al.

Table 4 Learning and predicting accuracy of the selected features [x1 , x2 , x3 , x4 ] x1 (%) x2 (%) x3 (%) Train set Test set

96.2963% 93.33% (14/15)

73.3233 69.3333

57.3827 59.1111

95.0617 91.5556

x4 (%) 95.9506 94.6667

from elements 1 and 135, respectively. The upper boundary of the small category coincides with the lower boundary of the medium category, and the lower boundary of the large category coincides with the upper boundary of the medium category. Second, to analyze the influence of the model parameters, Table 3 shows the FQeigenvector (c), the values of the holographic matrix H and the PMGs computed by the function F(ρ, ϕ). Table 3 shows in bold the values with the highest absolute values for each feature selector. It is difficult to explain with c, ϕ, a, and b what feature is responsible for the variance in the response variable because they have positive and negative values. Hence, the value of ρ is the most widely used in the literature as a feature selector [13, 36, 70]. Table 4 shows the learning and predictive power of the 4 parameters (column 2) compared to each independent parameter (columns 3–6). The results in Table 3 reported by ρ, and F(ρ, ϕ) in this trial agree with those in Table 4 where the two most important parameters (x3 and x4 ) maintain prediction accuracy above 90% when used independently. However, we do not know if the same is true for the other trials. Therefore, the stability [33] of the feature selectors should be analyzed together with their prediction accuracy. Finally, to visualize the model and the properties of the classes, as in [13], rules are extracted to explain the model’s inferences. Table 5 shows the 15 test set cases from the previous experiment (trial 3/20). The five test cases of the Setosa class were described by 2 rules (1–1 and 1–2), the five cases of the Vesicolour class by 3 rules (2–1, 2–2 and 2–3), and 4 of the 5 cases of the Virginica class were described by 3 rules (3–1, 3–2, 3–3). The probability of each rule in the training set is shown also in parentheses in the table. Only 1 of the 15 test cases was incorrect (marked with an X in the “Correct” column). The combination of the test case antecedents was not found in the training set. As rule 2–1 has a higher probability than rule 3–2 (0.14 vs. 0.06), the case was incorrectly included in the Vesicolour class. Hence, the prediction accuracy for this test set was 93.3% (14/15).

4.1.3

Stability of the Feature Selectors

Stability [33] is the robustness of the feature preferences it produces to the perturbation of training samples. Stability indicates the reproducibility power of the feature selection method. The high stability of the feature selection algorithm is equally important as the high classification accuracy when evaluating feature selection performance. Pearson’s correlation coefficient (PCC) is one of the stability measures reported in [33] that calculates the correlation between the weights of the selected

3 Virginica

2 Versicolour

1 Setosa

Flower type

S S S S S M

M L L M L L L M L

6.1 6.4 6.6 5.6 6.3 7.2 6.8 6.1 6.3

x1 Category

4.7 4.6 5.4 5.1 5 5.7

Value

2.8 2.9 3 2.7 3.3 3.6 3 2.6 2.5

3.2 3.1 3.9 3.5 3.5 2.8

Value

S S M S L L M S S

M M L L M S

x2 Category

4 4.3 4.4 4.2 6 6.1 5.5 5.6 5

1.3 1.5 1.3 1.4 1.3 4.5

Value

M M M M L L L L L

S S S S S M

x3 Category

Table 5 Prediction percent of the rules extracted from the holographic memory 93.33% (14/15)

1.3 1.3 1.4 1.3 2.5 2.5 2.1 1.4 1.9

0.2 0.2 0.4 0.3 0.3 1.3

Value

M M M M L L L M L

S S S S S M

x4 Category

3–2 (0.06)

2–1 (0.14) 2–3 (0.03) 2–2 (0.03) 2–1 (0.14) 3–3 (0.03) 3–3 (0.03) 3–1 (0.11)

1–2 (0.10) 1–2 (0.10) 1–1 (0.17) 1–1 (0.17) 1–1 (0.17) 2–1 (0.14)

O O O O O O O X O

O O O O O O

Rule Correct (Probability)

Pattern Classification with Holographic Neural Networks … 55

56

L. Diago et al.

Table 6 Pearson’s correlation coefficient (PCC) of different feature selectors c ρ ϕ a b c ρ ϕ a b F(ρ, ϕ) PCC

1.0000 −0.3776 0.3368 −0.6740 0.8193 0.7251 0.3049

−0.3776 1.0000 0.1720 0.4862 −0.4026 −0.4810 0.0662

0.3368 0.1720 1.0000 0.0792 0.5427 0.1348 0.3776

−0.6740 0.4862 0.0792 1.0000 −0.7705 −0.9561 −0.1392

0.8193 −0.4026 0.5427 −0.7705 1.0000 0.8680 0.3428

F(ρ, ϕ) 0.7251 −0.4810 0.1348 −0.9561 0.8680 1.0000 0.2151

subsets of the features. PCC returns the result in a range of [−1, 1]. 1 means the weight vector is perfectly correlated, and –1 indicates weight vectors are anti-correlated. 0 indicates no correlation between weight vectors. Table 6 shows the average PCC matrix computed from PCC matrices of 20 experiments. The last row shows the stability of each feature selector computed from the average PCC. As mentioned before, c, ϕ, a, and b are difficult to use as feature selectors because they have positive and negative values. Therefore, comparing ρ and F(ρ, ϕ), F(ρ, ϕ) has higher stability (0.2151 vs. 0.0662), allowing also feature selection with high prediction accuracy.

4.2

NIPS Feature Selection Challenge

NIPS 2003 Feature Selection Challenge [26, 27] introduced a methodology with datasets that allow feature selection algorithms to be evaluated in a controlled manner avoiding different datasets or different data splits in the comparison of published results in the field of feature selection. In the challenge, high-dimensional datasets were chosen to cover a wide variety of domains and with a sufficient number of examples to create a test set large enough to obtain statistically significant results. Several selection methods were analyzed in [26], and many of the top-ranked ones were simple filtering strategies such as the one shown in this chapter. However, as already discussed, in order to solve the problem of classifying the state of drivers during conditional driving, it is necessary to develop models that can explain their online inferences from a small number of human-understandable features. The choice of the appropriate approach to feature representation is a matter of continuing debate among researchers [59]. The current trend [9] is to use raw data obtained by sensors that would be high-dimensional data and use deep neural networks to extract a set of features whose information has no human-understandable meaning. On the other hand, if we stop explaining black-box machine learning models and start from a small set of features extracted from human knowledge [62], the process of selecting the most important ones would result in a set of features already understandable. The proposed tool was developed in the context of condi-

Pattern Classification with Holographic Neural Networks …

57

tional autonomous driving, but in future work, its efficiency could be tested on the NIPS 2003 Feature Selection Challenge [26, 27] or its application to the solution of the online feature selection problem where features are constantly changing, and the stability of the method plays an important role.

5

Conclusions and Future Works

HNNs have been developed for more than thirty years, but there is no literature review of the technology like the one presented in this chapter. Existing reviews in the literature focus on CVNNs, and although HNNs are complex domain networks, they are not referenced in the literature because they are not based on a layered architecture. In the development of HNNs, 3 periods can be seen: the beginning of the 90s (1990–1994), the late 90s (1995–1999), and the beginning of the 21st century (2000–2011). In the first period, the works reported in the literature include Sutherland’s holographic associative memory proposal, and its applications focus on the HNeT system, which is a proprietary system. In the second period, the most relevant works are those reported by Khan, where memory is extended to multiple dimensions and the concept of learning with variable attention is introduced. In the third period, there is a tendency to learn from the efficient solution of the system of linear equations in the complex domain and its combination with traditional fuzzy logic to improve the interpretability of the models. In the chapter, the basic theory of HNNs is briefly introduced, and the main contributions to its development are presented chronologically from the 3 basic elements of the technology: the mapping functions from the real domain to the complex domain, the expansion methods of the basis functions in the complex domain and the ways of calculating the holographic memory matrix. HNNs can be used to solve the same problems as other types of networks, but their application is difficult in some cases due to the need for processing in the complex domain. Using the classification problem as an example, this chapter proposes a tool for feature selection in the complex domain whose stability is superior to the traditional feature selection method for holographic networks while maintaining the accuracy of its models (explainable by learning rules) above 90% with the Iris data set. With the current development of deep learning networks and their complex domain variants [45], it would be interesting to demonstrate that HNNs can be a variant to consider when it comes to performing online learning due to their proven learning speed and the possibility of continuous updating of the models without retraining them. In order to develop deep models of HNNs, it would be essential to continue the development of the feature selection tool proposed in this chapter, for which several questions arise to be investigated. For example: Is there any advantage if other membership functions are used instead of using Pythagorean membership functions (e.g., Intuitionistic vs. Pythagorean vs. Fermatean)? It is interesting to find some

58

L. Diago et al.

connection with Laplace networks [42] to work with images directly and not with the already extracted features. Testing the proposed selection tool with other standard data sets could be a short-term job before using the tool for learning with dynamic attention or its application to HNN optimization (neural plasticity).

References 1. AND Corp.: AND Corporation : HNeT2005 application development system. http://www. andcorporation.com/home.html. Accessed 15 Jan 2023 2. Angelov, P.P., Soares, E.A., Jiang, R., Arnold, N.I., Atkinson, P.M.: Explainable artificial intelligence: an analytical review. WIREs Data Min. Knowl. Discov. 11(5) (2021). https://doi.org/ 10.1002/widm.1424 3. Banegas-Luna, A.J., Peña-García, J., Iftene, A., Guadagni, F., Ferroni, P., Scarpato, N., Zanzotto, F.M., Bueno-Crespo, A., Pérez-Sánchez, H.: Towards the interpretability of machine learning predictions for medical applications targeting personalised therapies: A cancer case survey. Int. J. Mol. Sci. 22(9), 4394 (2021). https://doi.org/10.3390/ijms22094394 4. Bassey, J., Qian, L., Li, X.: A survey of complex-valued neural networks (2021). arxiv:abs/2101.12249 5. Boudreau, R., Darenfed, S., Turkkan, N.: Etude comparative de trois nouvelles approches pour la solution du probleme geometrique direct des manipulateurs paralleles. Mech. Mach. Theory. 33(5), 463–477 (1998) 6. Boudreau, R., Levesque, G., Darenfed, S.: Parallel manipulator kinematics learning using holographic neural network models. Robot. Comput.-Integr. Manuf. 14(1), 37–44 (1998). www. scopus.com 7. Braunagel, C., Rosenstiel, W., Kasneci, E.: Ready for take-over? a new driver assistance system for an automated classification of driver take-over readiness. IEEE Intell. Transp. Syst. Mag. 9(4), 10–22 (2017). https://doi.org/10.1109/MITS.2017.2743165 8. Burden, F.R.: Holographic neural networks as nonlinear discriminants for chemical applications. J. Chem. Inf. Comput. Sci. 38(1), 47–53 (1998) 9. Cai, J., Luo, J., Wang, S., Yang, S.: Feature selection in machine learning: A new perspective. Neurocomputing 300, 70–79 (2018). https://doi.org/10.1016/j.neucom.2017.11.077 10. Carvalho, D.V., Pereira, E.M., Cardoso, J.S.: Machine learning interpretability: A survey on methods and metrics. Electronics 8(8) (2019). https://doi.org/10.3390/electronics8080832 11. Deo, N., Trivedi, M.M.: Looking at the driver/rider in autonomous vehicles to predict takeover readiness. IEEE Trans. Intell. Veh. 5(1), 41–52 (2020). https://doi.org/10.1109/TIV.2019. 2955364 12. Diago, L.A., Kitaoka, T., Hagiwara, I.: Development of a system for automatic facial expression analysis. J. Comput. Sci. Technol. 2(4), 401–412 (2008). https://doi.org/10.1299/jcst.2.401 13. Diago, L.A., Kitaoka, T., Hagiwara, I., Kambayashi, T.: Neuro-fuzzy quantification of personal perceptions of facial images based on a limited data set. IEEE Trans. Neural Netw. 22, 2422– 2434 (2011) 14. Diago, L.A., Abe, H., Adachi, K., Hagiwara, I.: Exploración de redes neuronales holográficas con cuantificación difusa para el monitoreo de conductores en conducción autónoma condicional. Revista Cubana de Transformación Digital 2(1), 46–65 (2021). https://rctd.uic.cu/rctd/ article/view/104 15. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. A Wiley-Interscience publication. Wiley, Nashville, TN (2000) 16. Fisher, R.: Iris. UCI Machine Learning Repository (1988). https://archive-beta.ics.uci.edu/ dataset/53/iris. Accessed 18 Jan 2023 17. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7(2), 179–188 (1936). https://doi.org/10.1111/j.1469-1809.1936.tb02137.x

Pattern Classification with Holographic Neural Networks …

59

18. Franich, A., Soucek, B., Visaggio, G.: The symmetric holographic associator. Neural Netw. World 3(1), 61–67 (1993) 19. Fukushima, H., Kamada, Y., Hagiwara, I.: Optimum engine mounting layout using mpod. Nippon. Kikai Gakkai Ronbunshu, C Hen/Trans. Jpn. Soc. Mech. Eng., Part C 70(1), 54–61 (2004). www.scopus.com 20. Gabor, D.: A new microscopic principle. Nature 161(4098), 777–778 (1948). https://doi.org/ 10.1038/161777a0 21. Gabor, D.: Associative holographic memories. IBM J. Res. Dev. 13(2), 156–159 (1969). https:// doi.org/10.1147/rd.132.0156 22. Gopalan, R., Lee, G.: Indexing of image databases using untrained 4d holographic memory model. In: McKay, R., Slaney, J. (eds.) Proceedings of the 15th Australian Joint Conference on Artificial Intelligence, LNAI 2557, pp. 237–248. Springer, Berlin (2002) 23. Gopalan, R.P., Hendra, Y.: Retrieval characteristics of an untrained holographic index for image databases. In: IEEE International Conference on Image Processing, vol. 2, pp. 555–558 (2003). www.scopus.com 24. Gunning, D., Stefik, M., Choi, J., Miller, T., Stumpf, S., Yang, G.Z.: XAI–Explainable artificial intelligence. Sci. Robot. 4(37), eaay7120 (2019). https://doi.org/10.1126/scirobotics.aay7120 25. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003) 26. Guyon, I., Gunn, S., Ben-Hur, A., Dror, G.: Result analysis of the nips 2003 feature selection challenge. In: Saul, L., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, vol. 17. MIT Press (2004) 27. Guyon, I., Gunn, S., Hur, A.B., Dror, G.: Design and analysis of the nips2003 challenge. In: Guyon, I., Nikravesh, M., Gunn, S., Zadeh L.A. (eds.) Feature Extraction: Foundations and Applications, pp. 237–263. Springer, Berlin (2006). https://doi.org/10.1007/978-3-54035488-8_10 28. Hagiwara, I., Shi, Q., Ichikawa, T.: Optimization for noise-vibration problem by response surface methodology using holographic neural network. Am. Soc. Mech. Eng., Press. Vessel. Pip. Div. (Publication) PVP 370, 35–42 (1998) 29. Hendra, Y.: Content-based retrieval of information from image and video databases using a holographic memory model. Master’s thesis, Curtin University of Technology, Kent St, Bentley, Western Australia 6102 (1999) 30. Hendra, Y., Gopalan, R., Nair, M.: A method for dynamic indexing of large image databases. In: Systems, Man, and Cybernetics, vol. I, pp. 302–307. IEEE Press (1999) 31. Hirose, A.: Complex-Valued Neural Networks. World Scientific (2003). https://doi.org/10. 1142/5345 32. ISO/SAE-J3016_202104 Taxonomy and definitions for terms related to driving automation systems for on-road motor vehicles. In: Standard, International Organization for Standardization. Geneva, CH (2021) 33. Khaire, U.M., Dhanalakshmi, R.: Stability of feature selection algorithm: A review. J. King Saud Univ. Comput. Inf. Sci. 34(4), 1060–1073 (2022). https://doi.org/10.1016/j.jksuci.2019. 06.012 34. Khan., J.I.: Attention modulated associative computing and content-associative search in image archive. Ph.D. thesis, University of Hawaii, Hawaii (1995) 35. Khan, J.I.: Intermediate annotationless dynamical object-index-based query in large image archives with holographic representation. J. Vis. Commun. Image Represent. 7(4), 378–394 (1996) 36. Khan, J.I.: Characteristics of multidimensional holographic associative memory in retrieval with dynamically localizable attention. IEEE Trans. Neural Netw. 9(3), 389–406 (1998) 37. Khan, J.I., Yun, D.: A parallel, distributed and associative approach for searching image patterns with holographic dynamics. J. Vis. Lang. Comput. 8(3), 303–331 (1997) 38. Khan, J.I., Yun, D.Y.Y.: Feature based contraction of sparse holographic associative memory. In: Proceedings of World Congress on Neural Networks, vol. 4, pp. 26–33 (1994)

60

L. Diago et al.

39. Kozukue, W., Miyaji, H.: Control of vehicle suspension using neural network. Veh. Syst. Dyn. 41(SUPPL.), 153–161 (2004). www.scopus.com 40. Kozukue, W., Miyaji, H.: Force identification using neural network. Am. Soc. Mech. Eng., Press. Vessel. Pip. Div. (Publication) PVP 482, 195–199 (2004). www.scopus.com 41. Lanubile, F., Visaggio, G.: Evaluating predictive quality models derived from software measures: Lessons learned. J. Syst. Softw. 38(3), 225–234 (1997) 42. Limmer, S., Sta´nczak, S.: Optimal deep neural networks for sparse recovery via laplace techniques (2017). arxiv:abs/1709.01112 43. Loo, C.K., Perus, M., Bischof, H.: Associative memory based image and object recognition by quantum holography. Open Syst. Inf. Dyn. 11(3), 277–289 (2004). www.scopus.com 44. Loo, C.K., Perus, M., Bischof, H.: Simulated quantum-optical object recognition from highresolution images. Opt. Spectrosc. (English translation of Optika i Spektroskopiya) 99(2), 218–223 (2005) 45. Lopez-Pacheco, M., Yu, W.: Complex valued deep neural networks for nonlinear system modeling. Neural Process. Lett. 54(1), 559–580 (2021). https://doi.org/10.1007/s11063-021-106441 46. Manger, R.: Holographic neural networks and data compression. Informatica (Ljubljana) 21(4), 665–673 (1997) 47. Manger, R., Mauher, M.: Using holographic neural networks for currency exchange rates prediction. In: International Conference on Information Technology Interfaces—ITI94, pp. 143–150 (1994) 48. Manger, R., Pantamura, V.L., Soucek, B.: Classification with holographic neural networks. In: Pantamura, V.L., Soucek, B., Visaggio, G. (eds.) Frontier Decision Support Concepts: Help Desk, Learning, Fuzzy Diagnoses, Quality Evaluation, Prediction, Evolution, pp. 91– 106. Wiley-Interscience (1994) 49. Manger, R., Pantamura, V.L., Soucek, B.: Stimulus preprocesing for holographic neural networks. In: Pantamura, V.L., Soucek, B., Visaggio, G. (eds.) Frontier Decision Support Concepts: Help Desk, Learning, Fuzzy Diagnoses, Quality Evaluation, Prediction, Evolution, pp. 79–90. Wiley-Interscience (1994) 50. Manger, R., Souˇcek, B.: New preprocessing methods for holographic neural networks. In: Artificial Neural Nets and Genetic Algorithms, pp. 190–197. Springer, Vienna (1993). https:// doi.org/10.1007/978-3-7091-7533-0_29 51. Nitta, T.: Complex-Valued Neural Networks. Information Science Reference, Hershey, PA (2009) 52. Paek, E.G., Patel, J.R.W.I.: Holographic on-line learning machine for multicategory classification. Jpn. J. Appl. Phys. 29(7A), L1332 (1990). https://doi.org/10.1143/JJAP.29.L1332 53. Paek, E.G., Wullert, J.R., Patel, J.S.: Holographic implementation of a learning machine based on a multicategory perceptron algorithm. Opt. Lett. 14(23), 1303–1305 (1989). https://doi.org/ 10.1364/OL.14.001303 54. Pantamura, V.L., Soucek, B., Visaggio, G.: The holographic fuzzy learning for credit scoring. In: International Joint Conference on Neural Networks, pp. 729–732 (1993) 55. Peniche-Ruiz, L., Mendoza, P., Elias, R.P.: Holographic technology applied to face location and identification. ICCIMA 00, 469 (1999). https://doi.org/10.1109/ICCIMA.1999.798576 56. Perus, M.: Image processing and becoming conscious of its result. Informatica (Ljubljana) 25(4), 575–592 (2001) 57. Perus, M., Bischof, H., Caulfield, H.J., Loo, C.K.: Quantum-implementable selective reconstruction of high-resolution images. Appl. Opt. 43(33), 6134–6138 (2004) 58. Perus, M., Dey, S.K.: Quantum systems can realize content-addressable associative memory. Appl. Math. Lett. 13(8), 31–36 (2000) 59. Pudjihartono, N., Fadason, T., Kempa-Liehr, A.W., O’Sullivan, J.M.: A review of feature selection methods for machine learning-based disease risk prediction. Front. Bioinform. 2 (2022). https://doi.org/10.3389/fbinf.2022.927312 60. Ramot, D., Friedman, M., Langholz, G., Kandel, A.: Complex fuzzy logic. IEEE Trans. Fuzzy Syst. 11, 450–461 (2003)

Pattern Classification with Holographic Neural Networks …

61

61. Ramot, D., Milo, R., Friedman, M., Kandel, A.: Complex fuzzy sets. Trans. Fuzzy Syst. 10(2), 171–186 (2002). https://doi.org/10.1109/91.995119 62. Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019). https://doi.org/10. 1038/s42256-019-0048-x 63. Sandirasegaram, N., English, R.: Comparative analysis of feature extraction (2d FFT and wavelet) and classification (l p metric distances, MLP NN, and HNeT) algorithms for SAR imagery. In: Zelnio, E.G., Garber, F.D. (eds.) Algorithms for Synthetic Aperture Radar Imagery XII. SPIE (2005). https://doi.org/10.1117/12.597305 64. Schneider, G., Wrede, P.: Artificial neural networks for computer-based molecular design. Prog. Biophys. Mol. Biol. 70(3), 175–222 (1998) 65. Shevskaya, N.V.: Explainable artificial intelligence approaches: Challenges and perspectives. In: 2021 International Conference on Quality Management, Transport and Information Security, Information Technologies. IEEE (2021). https://doi.org/10.1109/itqmis53292.2021.9642869 66. Shi, Q., Hagiwara, I.: Optimal design method to automobile problems using holographic neural network’s approximation. Jpn. J. Ind. Appl. Math. 17(3), 321–339 (2000) 67. Shi, Q., Hagiwara, I., Azetsu, S., Ichkawa, T.: Holographic neural network approximations for acoustics optimization. JSAE Rev. 19(4), 361–363 (1998) 68. Stoop, R., Buchli, J., Keller, G., Steeb, W.: Stochastic resonance in pattern recognition by a holographic neuron model. Phys. Rev. E-Stat., Nonlinear, Soft Matter Phys. 67(6 1), 061918/1– 061918/6 (2003) 69. Sutherland, J.G.: A holographic model of memory, learning and expression. Int. J. Neural Syst. 01(03), 259–267 (1990). https://doi.org/10.1142/s0129065790000163 70. Sutherland, J.G.: The holographic neural method. In: Soucek, B (ed.) Fuzzy, Holographic and Parallel Intelligence, Chap. 1, pp. 30–63. Wiley, Nashville, TN (1992) 71. Tamir, D.E., Rishe, N.D., Kandel, A.: Complex fuzzy sets and complex fuzzy logic an overview of theory and applications. In: Tamir, D.E., Rishe, N.D., Kandel, A. (eds.) Fifty Years of Fuzzy Logic and its Applications, pp. 661–681. Springer International Publishing, Cham (2015). https://doi.org/10.1007/978-3-319-19683-1_31 72. Vollert, S., Atzmueller, M., Theissler, A.: Interpretable machine learning: A brief survey from the predictive maintenance perspective. In: 2021 26th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA ), pp. 01–08 (2021). https://doi.org/10.1109/ ETFA45728.2021.9613467 73. Yager, R.R.: Pythagorean membership grades in multicriteria decision making. IEEE Trans. Fuzzy Syst. 22(4), 958–965 (2014). https://doi.org/10.1109/TFUZZ.2013.2278989 74. Yager, R.R., Abbasov, A.M.: Pythagorean membership grades, complex numbers, and decision making. Int. J. Intell. Syst. 28(5), 436–452 (2013). https://doi.org/10.1002/int.21584 75. Yazdanbakhsh, O., Dick, S.: A systematic review of complex fuzzy sets and logic. Fuzzy Sets Syst. 338, 1–22 (2018). https://doi.org/10.1016/j.fss.2017.01.010. Theme: Fuzzy Systems

Reusability Analysis of K-Nearest Neighbors Variants for Classification Models José Ángel Villarreal-Hernández, María Lucila Morales-Rodríguez, Nelson Rangel-Valdez, and Claudia Gómez-Santillán

Abstract The K-nearest neighbors (KNN) algorithm is a powerful example-based method that is easy to implement in classification tasks. As a result, in the literature, there are proposals for KNN variants, such as KNN with weights, which are meant to improve KNN without straying too far from the original algorithm. To select a KNN variant using the design of experiments, a complete implementation of each KNN variant must be stored. Combining these variants with a range of K values and a set of distance metrics can produce a set of configurations to choose from. Even with these options, there is a chance that none of the possible configurations will suit the research scenario. Finding patterns in KNN variants would allow their features to be separated so that they can be reused and recombined, expanding the range of possible configurations. It is proposed to analyze the variants of KNN to determine the essence of their building patterns, which can be exploited for reusability purposes. This analysis will make it clear that there are patterns among the strategies that these variants use. The main contribution of this work is to distinguish the compositional differences between KNN variants and to identify strategies that can be adapted as part of a reusable approach. Keywords K-nearest neighbors · Classification problems · Algorithm design · Reusable components

J. Á. Villarreal-Hernández · M. L. Morales-Rodríguez (B) · C. Gómez-Santillán Instituto Tecnológico de Ciudad Madero, Tecnológico Nacional de México, Tamaulipas, Mexico e-mail: [email protected] J. Á. Villarreal-Hernández e-mail: [email protected] C. Gómez-Santillán e-mail: [email protected] N. Rangel-Valdez IxM CONACyT, Tecnológico Nacional de México, Tamaulipas, Mexico e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Innovations in Machine and Deep Learning, Studies in Big Data 134, https://doi.org/10.1007/978-3-031-40688-1_4

63

64

J. Á. Villarreal-Hernández et al.

1 Introduction The KNN algorithm is one of the most popular approaches in supervised learning [1–3]. Different variants of the KNN algorithm have been proposed; each of these variants is meant to overcome a lack of precision or exploit some advantage of this algorithm. One of the reasons that these variants exist is that there are different application situations for which the original KNN algorithm could be improved. These complicated situations are explained in detail in Fig. 1. In the KNN algorithm, the parameter K determines the number of members of the reference neighborhood. Once the neighborhood is determined, the class of the majority of its members is determined, and this class is the result of the algorithm. In this way, the parameter K influences the algorithm’s final result, and choosing the correct value of K is a problem. For example, in Fig. 1a, a K value of 1 is used, which makes this algorithm highly sensitive to noise. Suppose that the nearest neighbor of the point that we seek to classify is an anomaly or measurement error. In that case, the point’s classification will be that of the anomaly, even though there are many other sample points around the point that might indicate the correct class. The opposite case is found in Fig. 1b. In this example, a much higher value of K is used. Because of this, many more neighbors are included, resulting in a majority class. However, a visual inspection of the distribution of the points indicates that around the point , there is an island of a different class. Mechanically, the algorithm has operated correctly, but in principle, the KNN algorithm surveys the surroundings of point  to determine which class it is most similar to. Since the island is more than just a pair of outliers, it can be argued that the less similar neighbors (◦) are outshining, the more similar neighbors (+). In Fig. 1, we have a case in which K = 6, and therefore the most frequent class is +. It could be said that this is the ideal behavior of the algorithm since the neighborhood does not include members that are very distant or different from . However, the visual inspection of the larger picture suggests that the classes are arranged in concentric rings. If the ring shape trend in the data is the correct way to define classes,  could be part of the inner ring (◦). Similar situations would occur on the boundaries

Fig. 1 Examples of KNN with different values of K .

Reusability Analysis of K-Nearest Neighbors Variants for Classification Models

65

between classes, giving rise to “acceptable” predictions from the algorithm that do not correspond to reality. These situations led researchers to study how to form the neighborhood and how to determine the class based on this neighborhood: what value of K is appropriate? What is the relevance of the ◦ points near  in the example? Should all selected neighbors be counted equally? Are the variables/dimensions that describe the examples sufficient or perhaps excessive? In response, a whole family of algorithms has been created based on the original KNN idea. The members of this family, called KNN variants, represent modifications to the original algorithm that affect how it operates. For example, these variants may be focused on modifying the dataset that will be used to form a neighborhood or perhaps on how to determine the resulting class given a neighborhood. It is common for KNN variants to be published as complete algorithms, even if changes to the original algorithm occur only in a specific part of the variant. An example of this is KNN with weights [4, 5], in which the procedure of the original KNN is followed up to the moment that the neighborhood is formed; this variant only modifies the calculation of the resulting class. Some aspects must be established to select a KNN algorithm configuration for a given work context using the design of experiments. A range of K values must be proposed for the algorithm, as well as a set of distance metric functions and a set of KNN variants to test. Although the differences between implementations of KNN variants may not be extensive, it is common for each variant to be stored as individual, isolated code. Surveys can be found in the literature that address the general mechanics of KNN variants, as well as comparisons of the performances of these variants [6, 7]. However, even though these surveys offer a valuable perspective on the state of the art, they do not clarify the differences between the construction of the original KNN and the treated variants or provide the construction patterns of these variants. This work proposes to analyze variants of the KNN algorithm that are used for classification. This analysis focuses on determining the construction patterns and differences among the KNN variants. The variants discussed in this document are compared to the standard KNN, with an emphasis on the substantially different components and those that are a particular case of the KNN algorithm. The remainder of this document is organized as follows. Section 2 describes the standard KNN and the components that interact within it. Sections 3 and 4 discuss the value of K and the proximity metrics. Section 5 continues with the identification of patterns in the KNN variants, which are ordered according to their main contributions. The findings obtained are discussed, as well as the contributions of this work regarding the reusability of the components of the KNN variants. Finally, Sect. 6 presents the overall conclusions.

66

J. Á. Villarreal-Hernández et al.

2 The K-Nearest Neighbors Algorithm Historically, the parameter K and the distance function used in the KNN algorithm have been considered independent of the algorithm itself. The parameter K is considered a configuration parameter since it can be changed if necessary and since determining K is not part of the algorithm. Regarding the distance function, the Euclidean distance function has been established as a standard for the KNN algorithm, but other distance functions have also been used to obtain good results. There is a line of reasoning that causes us to consider the choice of the parameter K and the distance function as an external mechanism that is not part of the KNN algorithm. To extend this reasoning to the variants of the algorithm, we must look at the components of the original KNN. For the KNN algorithm to operate, an expert must provide a training dataset that is interpreted as points in an n-dimensional space. To classify a new element, the algorithm calculates the distance from this new point to each point in the training dataset. The algorithm takes the K neighbors with the minimum distance, which represent the neighbors most similar to the new point, and uses them as the reference neighborhood. The class assigned to the new point is usually the majority class in the reference neighborhood. Given this brief description of the KNN algorithm, we now define some details of this algorithm. According to Steinbach et al. [3], the classification process of the KNN algorithm (see Algorithm 1) proceeds as follows. Two datasets are required; the first is a set of elements that have already been classified (C). The second dataset contains elements with no known classification (S). In both sets, the elements are described with the same variables, which are considered spatial dimensions in the KNN algorithm. A distance metric M must also be defined; it will be used to compare each member of S with all members of C. These measurements are used to build the list of distances between each Ci and a particular S j ; this list is referred to as DC S . This establishes how similar to (close) or different from (far) the known instances an element with no known class is. Then, the K neighbors of S j are selected, that is, the Ci with the shortest distances to S j are selected. To decide to which class the point S j belongs, the number of neighbors that belong to each class is determined, and the majority class is assigned to the point S j . The process builds a new distance list DC S and a different reference neighborhood for each element with no known classification in S. Algorithm 1 Standard KNN. Require: K ≥ 1 for S j in S do for Ci in C do DC S ← distance(M, Ci , S j ) end for neighbor s ← select N ear by(K , Ds ) S j .class ← ma jorit yClass(neighbor s) end for

Reusability Analysis of K-Nearest Neighbors Variants for Classification Models

67

We find that the original KNN algorithm is composed of the following key elements: • A training dataset with known classes (C); • A set of elements with unknown classes (S); • A metric M with which closeness/similarity can be established between the unknown class element Si and the instances C; • A number of reference neighbors K ; • A heuristic H that uses the reference neighborhood to set the class of the unknown class element S j . These elements play an essential role within the algorithm. However, they are not part of the KNN algorithm in an immovable way; they can be altered within different variants. This is particularly easy to see for the K parameter and the distance function. In the following sections, we will determine if these elements appear and how they are implemented in the different variants of the KNN algorithm.

3 The Parameter K The KNN algorithm uses the K value to form the neighborhood of the unclassified element. Generally, this neighborhood consists of the K neighbors with the shortest distances to the point. Some problems related to the choice of the number of reference neighbors are that if K is too small, the result may be sensitive to noise. If enough known elements are available, larger values of K become noise-resistant [3]. On the other hand, if K is too large, the neighborhood can be formed with other less similar classes that dilute the class that should be the selected class or directly replace it. The recommendations found in the literature are compiled below according to the application scenario. If these recommendations do not fit a given scenario, crossvalidation can be used to select a better value of K [8], as long as the computational cost is not too high. One way to simplify the calculations required for the KNN algorithm is to choose a value of K = 1. The result of choosing this value is that calculations related to the class decision policy are eliminated, changing the frequency count policy to simply assigning the class of the nearest neighbor. Several known element-set reduction methods and some similarity metrics have been designed with the K = 1 case in mind. According to Ejazuddin [9], if the above considerations are not taken into account, K = 1 provides unstable results because of its high variance and sensitivity. Traditionally, the value of K is chosen as the square root of the total population of the set of known data [10]. While this K value can be a starting point for experimenting with the algorithm, in retrospect, it is often a huge value for real-life and even academic applications. Consider, for example, the dataset of iris flowers [11] that is included in some development libraries. It is small, with 150 records, and its integer square root is 12, which, as we will see later, is approximately twice as large

68

J. Á. Villarreal-Hernández et al.

Table 1 Rules for the value of K , according to Enas et al. [12] Difference between class sizes Small Large Difference between covariance matrices

3

Small

n8

Large

n8

2

2

n8 3

n8

as the recommended value. However, this method is a quick way to propose an upper bound for the value of K . Enas et al. [12] study the effects of the K value for the case in which one decides between two possible classes using uniform weights. The authors report the following guidelines based on the sizes of the classes and their covariance matrices. When the classes have similar sizes, the covariance matrix of each class is calculated. If the difference between the covariance matrices is small, K can take the 3 value n 8 . If the difference between the covariance matrices is large, K can take the 2 value n 8 . When the classes have unequal sizes, the covariance matrix for each class is calculated. If the similarity between the underlying covariance structures is small, K 3 can take the value n 8 . If the similarity between the underlying covariance structures 2 is large, K can take the value n 8 . The aforementioned rules are given in Table 1. Consider that most classification problems can be translated to the two-class case. For example, a four-class situation (A, B, C, D) can be treated as the classification of meta-classes (A − B, C − D) or simply as A and not A. Although Enas et al. use covariance structures for these rules, the procedures to obtain these structures can be computationally expensive [13], and it can be more economical to simply perform experiments with the two recommendations. Various experimental studies can be found in the literature that relate the precision of KNN algorithm classifications with the metrics and K values used. Tables 2 and 3 contain some good precision results from these publications, organized by metric and publication. Note that these publications used standard and original uncharacterized data bodies, so the results may be inconsistent. By considering the published results, we can extract some broad notions about the value of K : • Using K = 1 may be a viable option as a starting point. • Although odd values of K are recommended to avoid ties, they do not appear to be associated with particularly high precision. • For values of K greater than 10, good precision results can be found, but they are not significantly better than those achieved with lower values. • In general, K values in the range from 1 to 6 present better precision results than higher K values. In Tables 2 and 3, the items in the precision row correspond to the item in the row for the value K . In the row dedicated to the Minkowski distance, the results that do

Reusability Analysis of K-Nearest Neighbors Variants for Classification Models Table 2 Published precision results associating metrics and K value. Metric Article [17] [14] [18] [16] Euclidean

Precision % 95.9

(a) 98 (b) 96 (c) 98

Numeric

(a) 1 (b) 8 (c) 15

Numeric

96

69

[15]

(a) 85.5 (b) 85

97.1

(a) 3 (b) 1

1

(a) 85.5 (b) 85 (a) 1 (b) 3 1.5 (a) 85.5 (b) 85 (a) 1 (b) 3

97.8

Categorical

89.5 K

Manhattan

Minkowski

10

3, 4, 5 Categorical

Precision % 96



3 –

K





p value N/A, 2 Precision % 95.9

– –

– –

K





6

10

1 – – –

not coincide with the Euclidean ( p = 2) or Manhattan ( p = 1) distances are given. In Sect. 4, the calculation of distances with these metrics is discussed. The Euclidean distance row of Table 2 shows three cases [14–16] in which a K value of 1 is used, and variable results are obtained. This variability could be attributed to K = 1, which has been identified as sensitive to noise and highly variable. However, it could also be explained by factors such as the total number of records in the dataset, the size difference between the classes, and the distance metric used. A large number of records in a dataset can better describe the classes in the dataset; in the three cases discussed above, the datasets have different sizes. In [14], a dataset of 100 records was used, while in [16] the dataset consisted of 448 records; in [15] 1,074,992 points were used. The number of records in the dataset used in [16] also does not appear to be responsible for the change in precision. In all three publications [14–16], the classification application involves only two classes, but the size ratios between the classes are different. In [14], each class makes up 50% of the population, while in [16] the smallest class represents 45.3% of the population; in [15] the smallest class contains only 24.39% of the data points. We can note that although [16] has the lowest precision of the three, the difference between classes is not as vast as the difference in [15] nor is it as small as the difference in A. In this case, the size ratio of the classes and the precision are unrelated. At least for the case of [16] in which the Euclidean distance is used, the difference in precision seems to be more related to the value of K since the precision grows to 85.5% when K = 3 is used. Therefore, a more extensive reference neighborhood

70

J. Á. Villarreal-Hernández et al.

Table 3 Published precision results associating metrics and K value. Metric Article [17] [14] [18] [16] Cosine of similarity

[15]

Precision % –



Numeric









(a) 91 (b) 90 (c) 89









(a) 65.8 (b) 63.6 (a) 1 (b) 13, 15 89.80

97.6

1



Categorical

83 K





Numeric

(a) 1 (b) 2 (c) 3 Categorical

Chebyshev

Chi square

Precision % 96



1 –

K







Categorical

10

Precision % –

1 –

(a) 96 (b) 81 K





Categorical

(a) 2, 3, 4, 5 (b) 1

is required in this application scenario. However, the distance metric’s effect on the ranking process should not be dismissed. In the results of [16] for the Manhattan distance, we observe that the precision of K = 1 exceeds that of K = 3, and the same applies when the Minkowski distance with p = 1.5 is used. Considering that the Minkowski distance is equivalent to the Euclidean distance when p = 2 and to the Manhattan distance when p = 1, tuning p below 2 could lead to more precise results.

4 Closeness Metrics One of the key elements of the KNN algorithm is the metric with which the similarity between the known and unknown elements is calculated. This similarity is typically computed as a numerical distance, like that used for Euclidean space. Thus, tech-

Reusability Analysis of K-Nearest Neighbors Variants for Classification Models

71

niques have been created that modify how the distances between points are calculated, maintaining the principle that the closeness should indicate the extent to which two points are similar and, therefore, the probability that these points belong to the same class. On the other hand, the similarity functions usually work in the range [0, 1], where two points x and y are maximally similar when s(x, y) = 1 [19]. The most common and simple distance types used in KNN are the Euclidean and Manhattan distances. For both of these distances, the differences between the n dimensions in which the points x and y are found are accumulated in a single value. In addition, the Euclidean distance has a resistance to discrimination that increases with the number of dimensions [3]. The Minkowski distance can be considered a generalization of both the Euclidean distance and the Manhattan distance [19]. When p = 1 is used in the Minkowski distance formula, the exponents behave like the absolute value of the Manhattan distance, and when p = 2, they represent the power and square root of the Euclidean distance. The Mahalanobis distance (also known as the generalized distance between points squared) follows an ensemble approach. It calculates the distance between a point and a cloud of points [19, 20], which in the KNN algorithm can be the members of a class. The calculation of this distance requires the definition of the set X that is composed of n points xi , and the variables of each point are referred to as xi j . We also need to find the midpoint of the set X and calculate the covariance matrix . The covariance matrix is a square matrix containing the covariances between the elements of a vector. It is the generalization of the concept of the variance of a scalar random variable. Finally, the Mahalanobis distance between a point and the cloud X is calculated as the inverse of the covariance matrix. The similarity technique known as the Gaussian kernel is based on the Euclidean distance as a negative exponential. This makes the function take values between 0 and 1, with 1 indicating the greatest similarity [19]. Given two character strings, the Hamming distance is measured by determining at how many positions the strings differ [19]. For example, in Table 4, we can see differences in positions 2 and 3. Therefore, the Hamming distance is 2. Note that under this system, the strings must have the same length. The Levenshtein distance or editing distance is a method designed to establish a distance between two character strings using three operations: replacement (*), insertion (+), and deletion (-). This method considers the minimal operations required to convert one string into the other. For example (see Table 5), “pato” can be converted

Table 4 Examples of Hamming distance. Position 0 1 String 1 String 2

C C

A A

2

3

S* Z*

A* O*

72

J. Á. Villarreal-Hernández et al.

Table 5 Levenshtein distance example. Position 0 1 2 String 1 String 2 String 3 String 4

P* G* P* P

A A L+ L

T T A A

3 O O T N+

4

O T

5

Distance

A*

– 1 2 2

into “gato” by replacing “p” with “g” (one operation); on the other hand, going from “gato” to “plato” requires two operations, like going from “plato” to “planta.” We can use the cosine similarity metric to calculate the similarity between two documents x and y. To do this, two arrays of size n are created, where n is the number of unique words corresponding to each document; each array position is associated with a unique word. The number of occurrences of the word within the document is kept in the arrays. With these data, the cosine similarity can be calculated according to the formula [19]. Thanks to the distance metrics and the similarity functions, we have tools that allow us to establish the closeness between two elements. However, it is still not completely clear when it is convenient to use a distance metric and when it makes sense to use a similarity function; to determine this, a few different aspects must be considered. Distance metrics indicate in real terms how far one element is from another in space. On the other hand, the similarity functions usually tell us the degree of likeness that two elements have. Additionally, the nature of the elements to be compared can cause us to gravitate toward one of these two types. For example, when comparing two text documents, the first choice is likely to be the cosine similarity. It should be noted that the cosine similarity establishes a relationship between documents, and the distance metrics indicate the distance that separates two points in space, so they operate inversely. Although the KNN algorithm requires the distance relationship between pairs to be established, the cosine similarity procedure extracts descriptors (the unique words) from a document using numerical values. If, on the other hand, the working context of the dataset uses only numerical data, it is possible to use a distance metric. If KNN with weights is used, the weighting mechanism can be a similarity function that gives more weight to the most similar neighbors.

5 Analysis of KNN Variants 5.1 Heuristics for Class Assignment In the original KNN algorithm, once the reference neighborhood is formed, the next step is to determine to which class the unknown point belongs. The process of determining the majority class in the standard KNN is itself a heuristic. There are

Reusability Analysis of K-Nearest Neighbors Variants for Classification Models

73

many variants whose strategies affect various components of the standard KNN to overcome ambiguity or imprecision. Different approaches modify how the reference neighborhood members are weighted, and we find that these heuristics can be applied separately from the rest of the strategy. The nearest neighbor KNN variant [21] takes the principle of similarity between KNN points and proposes that, in any case, the immediate neighbor is more similar to the point than any other neighbor. This means that only the nearest neighbor is part of the reference neighborhood, and no other neighbor is considered. Searching for the majority class in a group of size 1 does not make practical sense, so a direct assignment is made. That is, the class of the unknown point will be that of the selected neighbor. This heuristic is commonly applied by different variants that use some kind of simplification of the dataset. The variant KNN with weights [4, 5] consists of assigning a weight w to each member of the reference neighborhood. The modifier w can be any function, including, for example, the inverse of the distance. This is a modification to the majority class heuristic, which interprets the class counts as a point system. In this way, more representation is given to the nearest neighbors while also considering the bigger picture. There are variants of KNN that do not return a classification if some conditions are not met. For example, in KNN with rejection [22], the count of the majority class must exceed a minimum threshold U related to the total number of selected neighbors. Absolute majority mechanics have also been included in the count, where ma jorit y class > (U + second ma jorit y class). As illustrated, it is possible to identify a common component in the KNN variants that determines the resulting class. This heuristic for class assignment takes the reference neighborhood as input, even if a neighborhood of size 1 is defined. The output of this heuristic is the resulting class of the point to be classified, which is considered the output of the KNN algorithm. In between these two steps, there can be techniques of any level of complexity that consider the reference neighborhood and extract a class from it.

5.2 Reduction of Dataset Records The KNN algorithm requires the calculation of the distance from the unclassified element to each known element, so the calculations made by the algorithm increase as the number of known elements increases. Although having many points may mean that the selected neighbors will be closer to the unknown element and may more accurately describe the region [8], an adequate reduction of the set of known elements can reduce the computations required without sacrificing performance. In his article [21], Wilson explores editing the example dataset for the nearest neighbor variant. This editing process consists of selecting and discarding known elements by excluding them from a KNN run. The editing process starts by extracting the first element C1 , and a KNN run is made with the temporary training set T =

74

J. Á. Villarreal-Hernández et al.

C − C1 . To classify C1 using T , the classes are compared: if the classes are different, C1 is eliminated. This process is then repeated for all members of C. In this way, a dataset with a smaller number of members than the original dataset that is considered sufficient for the nearest neighbor variant is produced. The use of this dataset is intended to reduce some of the disadvantages associated with this mechanism. Hart’s condensed nearest neighbor method [23] also exploits the nearest neighbor mechanic by selecting useful points from the training dataset. The condensation procedure loops through the ordered C elements, and a KNN run classifies them using a temporary set T made up of the C elements that precede it. Then, the resulting class and the known class are compared: if they are different, the case is added to the dataset. In the reduced nearest neighbor variant [24], Gates proposes a technique that reduces the example dataset. The algorithm starts with a condensed dataset created using Hart’s technique, extracts the first point from it, and tries to classify the points of C using the modified condensed dataset. If the classification is defective, the extracted point is returned to the dataset. This is repeated for all points in the original condensed dataset. In the K-nearest centroid neighbor variant [25, 26], the members of the neighborhood are grouped by class. The average distance is obtained for each class present, and the class with the lowest average distance is assigned. In other words, for the class decision, the number of members per class and their distances are simplified to a single value. In his article [27], Angiulli presents the fast condensed nearest neighbor variant, which was inspired by Hart’s work. His method has some advantages over Hart’s condensation method. For example, Angiulli’s method is not dependent on the order of appearance in the dataset. The first step of the fast condensation algorithm is to produce seed elements of each class in C, which represent the class centroids. This method then checks that the training dataset can be correctly classified with these seed elements. If it detects that this is not the case, it adds a representative point that allows the training dataset to be classified correctly. The practice of reducing the size of the dataset is a common denominator of the contributions seen in this subsection. It is observed that the procedures that implement this approach are executed before the reference neighborhood is formed. While several of these techniques are intended for the nearest neighbor approach, there are other techniques [28, 29] that produce edited datasets to use with values of K ≥ 1.

5.3 Estimation of Dataset Variables Classically, data bodies are stored for computational processing in tables. The convention is that within the table, the columns represent the variables that describe each member, and each member has a single entry in the table (a row). Variable estimation refers to the process of weighting variables in such a way that the results

Reusability Analysis of K-Nearest Neighbors Variants for Classification Models

75

of an algorithm are improved. To carry out this process, we start from a dataset and seek to optimize the influence of these variables (attributes), which are considered spatial dimensions by the KNN algorithm. Variable estimation techniques usually take a variable-weighting (Pearson correlation, chi-square, relevant component analysis) or minimal subset selection [9] (greedy set coverage, OBLIVION algorithm, linear discriminant analysis) approach that selects the variables that will come into play in the distance calculation, excluding the rest. Different strategies for performing the estimation of variables are reviewed below. The Pearson correlation method is a technique that evaluates the relevance of each variable since it performs an individual evaluation for each attribute. This evaluation is usually used to assign weights to the variables that make up the dataset. In this method’s range-generating function, the variables that describe the points and the classes that they belong to are related. Another technique that is used to weigh attributes is the chi-square test, in which higher values indicate that attributes are more closely associated with each other. In general terms, this technique verifies the null hypothesis that there is no association between attributes. The chi-square test is designed to work with categorical data and does not work with quantitative data. The test is based on comparing the actual distribution of the data and the expected distribution of the data using a contingency table. The greedy set cover method progressively adds attributes that increase the quality of the predictions [30]. This method is performed in two main stages: (1) an initial base attribute selection is performed, and a disjunction list with the attributes is generated. Then, for each disjunctive variant, classifications of known elements are performed, and the accuracy of these classifications is recorded. Attributes that did not participate in incorrect classifications are then grouped. (2) Once the set of safe attributes is formed, each attribute individually is used to rank the same test set to locate the most successful attribute. That attribute is kept fixed by separating it from the safe set; then, it is temporarily combined with the rest of the safe attributes individually to classify the same test set. Again, the attribute that increases the success of the rankings the most is fixed. The cycle of temporarily combining and fixing attributes is repeated until no member of the set of safe attributes increases the quality of the classifications.

76

J. Á. Villarreal-Hernández et al.

The OBLIVION method [31] developed by Langley and Sage consists of progressively removing attributes from the dataset such that the quality of the predictions does not diminish. The general idea of the algorithm is that by removing a redundant or irrelevant attribute, the quality of the predictions can only be maintained or improved. This principle applies to the case in which two attributes present a synergy or dependence during the classification process. The algorithm starts with the full set of attributes and then creates subsets of size n − 1 and evaluates them using crossvalidation. This process is repeated with the subsets that did not decrease in quality, forming sub-subsets of size n − 2 in a tree dynamic. While this method can take attribute dependencies into account, it has a higher risk of leading to overfitting than other techniques. This technique is computationally expensive since the classifier, a KNN configuration, is part of the evaluation function. Linear discriminant analysis [11] aims to project the characteristics of a dimensional space onto a dimensional space with fewer attributes. To do this, the variance between classes is first calculated: in other words, the distances between the means of the different classes present in the dataset are determined. Then, the internal class variance, which is the distance between the mean and the sample of each class, is calculated. Then, this method builds the space with the fewest dimensions that maximize the variance between classes and minimize the internal variance of the class. This projection process is known as Fisher’s criterion. Relevant component analysis [32] is a technique that produces a weighting for attributes according to their relevance. It uses the concept of a chunk, a subset of points that are known to belong to the same class. Each chunk is processed by subtracting the sample mean of the chunk from each point in that chunk. Next, the covariance matrix is calculated, and “irrelevant” directions for classification are found. These are directions in which the reliability of the data is due to the variability within the class. The techniques used for the estimation of variables often have a statistical background. The revised techniques have been related to KNN classification work, although they have also been used in other contexts. It was found that, like the reduction of dataset records, the estimation of variables is part of the preprocessing of the training dataset.

5.4 Discussion This subsection summarizes the two main facts determined from the research presented in this section concerning the studied KNN variants. First, there exist some recurrent roles between variants. The analysis revealed that these roles are the definition of dataset variables, the reduction of dataset records, and the heuristics for class assignments. The definition of dataset variables has the role of preparing the dataset by identifying the value of the variables that compose it.

Reusability Analysis of K-Nearest Neighbors Variants for Classification Models

77

The reduction of dataset records simplifies the dataset by either removing redundant members or generating representative replacements for these members. Heuristics for class assignments must analyze the reference neighborhood and extract from it the class of the unknown point. Second, different configuration options are available for the aforementioned roles that affect the way that these roles are implemented. In addition to the K parameter components and the closeness metrics, the previous roles (the definition of dataset variables, the reduction of dataset records, and the heuristics for class assignment) integrate the five types of customizable options that any KNN implementation requires. Hence, almost all KNN implementations result from different combinations of the identified customizable options. Table 6 compiles the KNN variants described in this document and their associated roles.

6 Conclusions This work carried out an analysis of several KNN-based classification models. It focused on identifying the main building patterns that are present in KNN implementations. As a result, the following conclusions can be drawn. First, the KNN variants used for classification problems are usually implemented for specific purposes, such as the minimization of the computational cost associated with large datasets or the exploitation of a particular feature. Even though these approaches offer valuable contributions by solving a concrete problem, their definitions do not highlight the main modifications of the original KNN design that support the good results that they provide. Second, a KNN design always requires certain roles. These roles might affect one or more specific components in the implementation of a variant. This work identified three key roles: 1. The heuristic for class assignment; 2. The reduction of dataset records; 3. The estimation of dataset variables. Let us point out that different improvements can be achieved with these roles, such as, for example, improvements in the classification precision or in the performance of the algorithm. Third, any KNN implementation has at least three customizable components: the value of the K parameter, the closeness metric, and the heuristic for class assignment (these components are included in the original KNN definition). In addition, this work reveals that two more customizable components might be present: the reduction of dataset records and the estimation of dataset variables (see Table 6). Finally, the five identified customizable components are indeed used to build the patterns present in KNN variants. These patterns appear as isolated components, which allows them to be reused and combined without conflicts within the required

78

J. Á. Villarreal-Hernández et al.

Table 6 KNN variants as components. Name published Reduction of Estimation of dataset records dataset variables KNN [33] (1951)



Nearest Neighbor [34] – (1967) Condensed NN [23] (1968) Hart condensation Edited NN [21] (1972) Wilson’s edit Reduced NN [24] (1972)

Heuristic for class assignment



Any odd



K=1



Any odd



Any odd

Frequency count Direct assignment Frequency count Frequency count Direct assignment Neighbors with weights Frequency count Direct assignment Direct assignment Direct assignment Frequency count

Gates reduction –

KNN with weights [35] – (1993) OBLIVION [31] (1994) – KNN of minimum distance [36] (1996) Fast Condensed NN [27](2005) KNN with mean distance [37] (2006) KNN of attributes with weights [38] (2007)*

Supported K value

Reduction to centroids Fast condensation Reduction to mean –

KNN with rejection [22] – (2009)

K=1



Any odd > 1

OBLIVION selection –

Any odd K=1



K=1



K=1

Weights by Pearson correlation –

Any odd

Any odd

Probability threshold

*Note in this KNN variant, the calculation of the euclidean distance is modified with the calculated weights. In the other KNN variants, the euclidean distance was used as a standard for comparisons.

roles for a particular KNN implementation. Hence, it is possible to develop a KNN variant with a design based on reusable components; this characteristic not only allows the easy implementation of existing KNN designs but also potentially allows the development of new KNN variants. A future line of research is the creation of a framework that can guide the development of reusable KNN designs and considers the five customizable components identified in this work.

Reusability Analysis of K-Nearest Neighbors Variants for Classification Models

79

Acknowledgements JAVH acknowledges CONACyT Ph.D. scholarship (Grant 794328). Also, the authors acknowledge the support obtained from: IxM CONACyT project 3058-“Optimización de Problemas Complejos”, Laboratorio Nacional de Tecnologías de Información (LaNTI) at TecNM/Instituto Tecnológico de Ciudad Madero, and TecNM research networks “Tecnologías Computacionales Aplicadas” and “Electromovilidad”.

References 1. García, V., Sánchez, J.S., Marqués, A., Florencia, R., Rivera, G.: Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data. Expert Syst. Appl. 158, 113026 (2020). https://doi.org/10.1016/j.eswa.2019.113026 2. Rivera, G., Florencia, R., García, V., Ruiz, A., Sánchez-Solís, J.P.: News classification for identifying traffic incident points in a Spanish-speaking country: a real-world case study of class imbalance learning. Appl. Sci. 10(18), 6253 (2020). https://doi.org/10.3390/app10186253 3. Steinbach, M., Tan, P.N.: kNN: K-nearest neighbors. In: Wu, X., Kumar, V. (eds.) The Top Ten Algorithms in Data Mining, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, vol. 1, 1 edn., pp. 151–162. CRC Press (2009); 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 (2009). OCLC: ocn227914250 4. Hechenbichler, Klaus, Schliep, Klaus: Weighted k-nearest-neighbor techniques and ordinal classification. Sonderforschungsbereich 386(399), 17 (2004). https://doi.org/10.5282/ubm/ epub.1769 5. Tan, S.: Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst. Appl. 28(4), 667–671 (2005). https://doi.org/10.1016/j.eswa.2004.12.023 6. Bhatia, N.: Vandana: survey of nearest neighbor techniques. Int. J. Comput. Sci. Inf. Secur. 8(2) (2010). https://doi.org/10.48550/ARXIV.1007.0085 7. Uddin, S., Haque, I., Lu, H., Moni, M.A., Gide, E.: Comparative performance analysis of knearest neighbour (KNN) algorithm and its different variants for disease prediction. Sci. Rep. 12(1), 6256 (2022). https://doi.org/10.1038/s41598-022-10358-x 8. Russell, S.J., Norvig, P., Davis, E.: Artificial Intelligence: A Modern Approach, Prentice Hall Series in Artificial Intelligence, vol. 1, 3 edn. Prentice Hall, 1 Lake Street, Upper Saddle River, NJ 07458 (2010) 9. Syed, M.E.: Attribute weighting in k-nearest neighbor classification. Master thesis, University of Tampere (2014). https://core.ac.uk/download/pdf/250135847.pdf 10. Hassanat, A.B., Abbadi, M.A., Altarawneh, G.A.: Solving the problem of the k parameter in the KNN classifier using an ensemble learning approach 12(8), 7 (2014). https://doi.org/10. 48550/arXiv.1409.0919. Preprint number 1409.0919 11. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7(2), 179–188 (1936). https://doi.org/10.1111/j.1469-1809.1936.tb02137.x 12. Enas, G.G., Choi, S.C.: Choice of the smoothing parameter and efficiency of k-nearest neighbor classification. Comput. Math. Appl. 12(2), 235–244 (1986). https://doi.org/10.1016/08981221(86)90076-3 13. Leidy Laura Arboleda Quintero: Estimación de modelos de estructura de covarianza mediante algoritmos genéticos. Master thesis, Universidad Nacional de Colombia (2017) 14. Angreni, I.A., Adisasmita, S.A., Ramli, M.I., Hamid, S.: Pengaruh nilai k pada metode knearest neighbor (knn) terhadap tingkat akurasi identifikasi kerusakan jalan. Rekayasa Sipil 7(2), 63 (2019). https://doi.org/10.22441/jrs.2018.v07.i2.01 15. Mulak, P., Talhar, N.: Analysis of distance measures using k-nearest neighbor algorithm on KDD dataset. Int. J. Sci. Res. 4(7) (2015) 16. Wahyono, W., Trisna, I.N.P., Sariwening, S.L., Fajar, M., Wijayanto, D.: Comparison of distance measurement on k-nearest neighbour in textual data classification. Jurnal Teknologi dan Sistem Komputer 8(1), 54–58 (2020). https://doi.org/10.14710/jtsiskom.8.1.2020.54-58

80

J. Á. Villarreal-Hernández et al.

17. Iswanto, I., Tulus, T., Sihombing, P.: Comparison of distance models on k-nearest neighbor algorithm in stroke disease detection. Appl. Technol. Comput. Sci. J. 4(1), 63–68 (2021). https://doi.org/10.33086/atcsj.v4i1.2097 18. Hu, L.Y., Huang, M.W., Ke, S.W., Tsai, C.F.: The distance function effect on k-nearest neighbor classification for medical datasets. SpringerPlus 5(1), 1304 (2016). https://doi.org/10.1186/ s40064-016-2941-7 19. Metcalf, L., Casey, W.: Cybersecurity and applied mathematics. In:SYNGRESS, vol. 1, 1st edn. Elsevier, 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, USA (2016). https:// doi.org/10.1016/C2015-0-01807-X 20. Mahalanobis, P.C.: Reprint of: Mahalanobis, p.c. (1936) “on the generalised distance in statistics.”. Sankhya A 80, 1–7 (2018). https://doi.org/10.1007/s13171-019-00164-5 21. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. SMC-2(3), 408–421 (1972). https://doi.org/10.1109/TSMC.1972.4309137 22. Dalitz, C.: Reject options and confidence measures for kNN classifiers. Schriftenreihe des Fachbereichs Elektrotechnik und Informatik der Hochschule Niederrhein 8, 16–38 (2009). http://lionel.kr.hs-niederrhein.de/~dalitz/data/publications/sr09-knn-rejection.pdf 23. Hart, P.: The condensed nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 14(3), 515– 516 (1968). https://doi.org/10.1109/TIT.1968.1054155 24. Gates, G.: The reduced nearest neighbor rule (corresp.). IEEE Trans. Inf. Theory 18(3), 431– 433 (1972). https://doi.org/10.1109/TIT.1972.1054809 25. Gou, J., Yi, Z., Du, L., Xiong, T.: A local mean-based k-nearest centroid neighbor classifier. Comput. J. 55(9), 1058–1071 (2012). https://doi.org/10.1093/comjnl/bxr131 26. Zapata-Tapasco, A., Pérez-Londoño, S., Mora-Flórez, J.: Método basado en clasificadores kNN parametrizados con algoritmos genéticos y la estimación de la reactancia para localización de fallas en sistemas de distribución. Revista Facultad de Ingeniería Universidad de Antioquia 70, 220–232 (2014) 27. Angiulli, F.: Fast condensed nearest neighbor rule. In: Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, pp. 25–32. Association for Computing Machinery, New York, NY, USA (2005). https://doi.org/10.1145/1102351.1102355 28. Jiang, Y., Zhou, Z.H.: Editing training data for kNN classifiers with neural network ensemble. In: Yin, F.L., Wang, J., Guo, C. (eds.) Advances in Neural Networks - ISNN 2004, pp. 356–361. Springer, Berlin (2004). https://doi.org/10.1007/978-3-540-28647-9_60 29. Kanj, S., Abdallah, F., Denæux, T., Tout, K.: Editing training data for multi-label classification with the k-nearest neighbor rule. Pattern Anal. Appl. 19(1), 145–161 (2016). https://doi.org/ 10.1007/s10044-015-0452-8 30. Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97(1), 245–271 (1997). https://doi.org/10.1016/S0004-3702(97)00063-5 31. Langley, P., Sage, S.: Oblivious decision trees and abstract cases. In: Hayes-Roth, B., Korf, R. (eds.) Proceedings of the AAAI-94 Workshop on Case-Based Reasoning, AAAI Conference on Artificial Intelligence, vol. 12, pp. 113–117. AAAI Press, Convention center, Seattle, Washington (1994) 32. Bar-Hillel, A., Hertz, T., Shental, N., Weinshall, D.: Learning distance functions using equivalence relations. In: Fawcett, T., Mishra, N. (eds.) Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), Machine Learning, vol. 20, p. 8. AAAI Press, Washington, DC (2003) 33. Fix, E., Hodges, J.: Nonparametric discrimination: Consistency properties. In: Discriminatory Analysis, International Statistical Review, vol. 1. USAF School of Aviation Medicine, Randolf Field, Texas (1951) 34. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967). https://doi.org/10.1109/TIT.1967.1053964 35. Cost, S., Salzberg, S.: A weighted nearest neighbor algorithm for learning with symbolic features. Mach. Learn. 10(1), 57–78 (1993). https://doi.org/10.1007/BF00993481 36. Chaudhuri, B.: A new definition of neighborhood of a point in multi-dimensional space. Pattern Recognit. Lett. 17(1), 11–17 (1996). https://doi.org/10.1016/0167-8655(95)00093-3

Reusability Analysis of K-Nearest Neighbors Variants for Classification Models

81

37. Mitani, Y., Hamamoto, Y.: A local mean-based nonparametric classifier. Pattern Recognit. Lett. 27(10), 1151–1159 (2006). https://doi.org/10.1016/j.patrec.2005.12.016 38. Vivencio, D.P., Hruschka, E.R., do Carmo Nicoletti, M., dos Santos, E.B., Galvao, S.D.: Feature-weighted k-nearest neighbor classifier. In: 2007 IEEE Symposium on Foundations of Computational Intelligence, pp. 481–486. IEEE, Honolulu, HI, USA (2007). https://doi. org/10.1109/FOCI.2007.371516

Speech Emotion Recognition Using Deep CNNs Trained on Log-Frequency Spectrograms Mainak Biswas , Mridu Sahu , Maroi Agrebi , Pawan Kumar Singh , and Youakim Badr

Abstract Speech serves as the most important means of communication between humans. Every phrase a person speaks has certain emotions intertwined with it. Therefore, a natural desire would be to build a system that understands the mood and feelings of the speaker. Speech emotion detection may have a lot of real-life applications ranging from bettering recommendation systems (which adapt to the emotion user is experiencing) to monitoring people with chronic depression and suicidal tendencies. In this chapter, we propose a model for the recognition of emotions from speech data using log-frequency spectrograms and a deep convolutional neural network (CNN). We supplement our data with the noise of varied loudness obtained from various contexts with the aim of making our model resilient to noise. The augmented data is used for the extraction of spectrograms. These spectrogram images are used to train the deep CNNs, proposed in this paper. The model is independent of linguistic features, speaker-dependent features, the gender of speakers, and the intensity of the expressed emotion. This has been guaranteed by using the RAVDESS dataset, where the same sentences were spoken by 24 speakers (12 male and 12 female) with different expressions (in two levels of intensity). The M. Biswas · P. K. Singh (B) Department of Information Technology, Jadavpur University, Jadavpur University Second Campus, Plot No. 8, Salt Lake Bypass, LB Block, Sector III, Salt Lake City, Kolkata, West Bengal 700106, India e-mail: [email protected] M. Biswas e-mail: [email protected] M. Sahu Department of Information Technology, National Institute of Technology, Raipur, India e-mail: [email protected] M. Agrebi Department of Computer Science, Université Polytechnique Hauts-deFrance, 59313 Valenciennes, France e-mail: [email protected] Y. Badr Pennsylvania State University, Great Valley, 30 East Swedesford Rd, Malvern, PA 19355, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Innovations in Machine and Deep Learning, Studies in Big Data 134, https://doi.org/10.1007/978-3-031-40688-1_5

83

84

M. Biswas et al.

model obtained an accuracy of 98.13% on this dataset. The experimental results show that our proposed model is quite capable of classifying emotions from human speech. The source code of the proposed model can be accessed using the following link: https://github.com/mainak-biswas1999/Spoken_Emotion_classification.git. Keywords Speech emotion recognition · Log-frequency spectrogram · MFCC · Deep learning · RAVDESS

1 Introduction The most common mode in which humans communicate is through speech. Verbal communication had existed since 5,00,000 BCE (when the first form of language emerged). Besides conveying messages, emotions, and feelings are expressed through it. The delivery of a person’s speech reflects his/her mood, emotion, attitude towards others, and the notion he/she has about a particular topic. Loudness, timbre, and frequency may vary for a particular utterance depending on the emotion the speaker is experiencing. Fluctuations of the amplitude or frequency over time are other parameters that reflect the mood of the speaker. Humans have the capacity to understand the emotion a speaker is experiencing, even if we don’t understand the language. Since humans have this capability, a desire to build an automatic emotion detection system (from speech) is natural. This task is known as Speech Emotion Recognition (SER) in the literature. The major task here is to identify what features must be extracted to discriminate between different emotions. Some of the widely used feature extraction techniques from speech include Linear Predictive Analysis (LPC) [1], Linear Predictive Cepstral Coefficients (LPCC), Perceptual Linear Predictive Coefficients (PLPC) [2], and Mel-Frequency Cepstral Coefficients (MFCC) [3, 4]. These features prove to be effective in tasks like spoken language identification, transcript generation, translation, etc., where both micro/low-level (phonetics, acoustics, etc.) and macro/high-level (morphology and overall characteristics like sentence syntax, etc.) features are essential to understand the characteristics of language. But in the case of an emotion detection task, the global picture of the speech signal is of most importance. This is because the same utterances can be spoken with different expressions, and we do not want our machine learning model to pick up words/ phonemes as a discriminative feature (which are characteristics of language and not expressions of the speaker). To capture the global picture, log-frequency spectrograms [5] would be an obvious choice. This is because, through this process, the speech data (which is time-series data) is converted to an image. The images (spectrogram) would portray the global characteristics of the corresponding clips. This is expected to perform better as neural networks like convolutional neural networks (CNN) would be able to capture these global differences instead of the phonetics that time-series data and recurrent neural networks (RNNs) would capture. Thus, an efficient emotion detection model should be able to capture the relevant feature required for emotion detection from

Speech Emotion Recognition Using Deep CNNs Trained …

85

an utterance in any adverse environment (should not be affected by noise) [6, 7]. Log-frequency spectrograms model acoustic features by plotting the magnitudes of frequencies available in the speech signal as a function of time. This helps us read regions of silence and fluctuations in amplitude and frequency, which is instrumental in the task of emotion detection [8]. Thus, macro-level features are more important in this task, and thus CNNs trained by using spectrograms appear to be an ideal solution. Although the above theory appears attractive, this model will only be useful if conventional systems like RNNs, Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU) trained with features like MFCC fail in this task [9]. Therefore, this paper also shows experimentally that the later methods do not perform as well (discussed in Sect. 2.1) as the former one. Besides being able to understand emotions, the model must be robust to external noise. Otherwise, the model may fail when deployed (as surrounding noise is an integral part of our daily life). Therefore, we augmented our data with a combination of noises encountered in day-to-day life. Augmentation not only increased our training and testing data but also made the model extremely robust to noise. It has been observed that the model was not affected by the number of speakers, gender of speakers, or intensity in which they were expressing themselves (there were 12 male and 12 female speakers expressing the same emotion in two intensities). The highlighting features of our proposed methodology are as follows: 1. Accurate emotion recognition: The proposed model achieves a high accuracy of 98.13% on the RAVDESS dataset, indicating its effectiveness in classifying emotions from speech. 2. Language-independent approach: By using log-frequency spectrograms, the model focuses on capturing the global attitude and expression of the speaker rather than relying on language-specific features. This allows the model to work across different languages. 3. Robustness to noise: The augmentation of speech data with noise helps the model become resilient to noise present in real-world scenarios, enhancing its applicability in practical settings. 4. Independence from speaker characteristics: The model is designed to be independent of the number of speakers, gender, and intensity of the expressed emotion. This adds to its versatility and generalizability. The remainder of the paper is organized as follows. Section 2 provides a brief summarization of some research works done in the SER domain, followed by a brief explanation of our motivation and the contributions of our proposed work. Section 3 presents a brief overview of the proposed methodology, such as the augmentation of the speech data by adding background noise, the process of extraction of logfrequency spectrograms, and explains how to read and understand them. Section 4 describes the proposed architecture of the deep CNN model used. Section 5 reports the results obtained on testing the model. Finally, Sect. 6 concludes the work along with some future scope of work.

86

M. Biswas et al.

2 Literature Survey Emotion detection, spoken from speech, has been a well-explored area of research [10]. Time-series features (like MFCC), as commonly, are the choice for audio data, have been used to train machine learning models. Spoken language identification is a similar problem and has been widely researched too. In this section, we have tried to mention some previous works related to SER that have been done by the researchers. The work described in [11] extracted two different sets of features from the audio clips for the task of speech emotion recognition (SER) and reported the performance that they achieved. Firstly, a 42-dimensional feature vector is used. It has 39 Mel Frequency Cepstral Coefficient (13 each MFCC, MFCC, MFCC) components, and 1 component from Zero Crossing Rate (ZCR), Harmonic to Noise Rate (HNR), and Teager Energy Operator (TEO) each. Secondly, they proposed the Auto-Encoder method for the selection of relevant parameters from the feature vector previously extracted. It uses a support vector machine (SVM) (with different similarity functions) as the classifier. The experiments are conducted in the Ryan Multimedia Lab (RML). The best performance in the former set of features is achieved by using an SVM with the radial basis function (RBF) as the kernel. It obtained an accuracy of 64.19% on the dataset (with six different emotions). After feature selection, the SVM (with RBF) achieved an accuracy of 74.07%. Kerkeni et al. [12] study several models that can be used in the task of SER and compare them. It also proposes a solution based on the combination of these approaches. It uses features like modulation spectral features (MSFs) and MFCC to train a variety of machine-learning models. They include RNNs, SVMs, and multivariate linear regression (MLR). It reports the performance of the mentioned classifiers when trained on MFCC, MSF, and a combination of MFCC and MSF. The best result is obtained by an RNN trained in the combination of MFCC and MLR features. It achieved an accuracy of 90.05% on the Spanish emotion dataset and 82.41% on the Berlin emotion dataset. Avots et al. [13] attempted to use both verbal and nonverbal communication channels to develop a method that more clearly and understandably expresses the emotional state. It offers audio-visual information analysis to identify human emotions. Three separate databases (SAVEE, eNTERFACE’05, and RML) are used as the training set, and AFEW (a database replicating real-world settings) as the testing set for a cross-corpus evaluation. The voice signal is extracted for MFCC characteristics. For classification, the SVM classifier has been employed. Faces in key frames are located using the Viola-Jones face recognition algorithm and AlexNet’s categorization of facial image mood. On the RML dataset, 69.3% accuracy is attained. Lech et al. [14] study the effect of reduced speech bandwidth and the µ-low companding procedure in transmission systems on the accuracy of speech emotion recognition. It discusses real-time speech emotion in detail. It uses the pre-trained AlexNet for this recognition task. It achieved an accuracy of 82% on the Berlin Emotion Speech (EMO-DB) database (which has seven different emotions). It

Speech Emotion Recognition Using Deep CNNs Trained …

87

observed that decreasing the sampling frequency from (16 to 8 kHz) reduced the accuracy of the system by 3.3%. An and Ruan’s [15] in-depth analysis of the algorithms already in use in SER revealed that they suffer from issues such as low feature utilization of human-created features, basic feature extraction methods, large model complexity, and poor particular emotion recognition accuracy. Using additive Gaussian white noise, the experiment quadrupled the RAVDESS dataset, producing a dataset of 5760 audio snippets. A network structure consisting of two parallel CNNs has been employed for this. A transformer encoder network has been used to extract temporal characteristics, and this was used to extract spatial features. An accuracy on the hold-out test set of 80.46% is achieved on the RAVDESS dataset. Tang et al. [16] proposed a SER system that is capable of learning sufficiently long temporal dependencies in speech signals. They have suggested a revolutionary endto-end neural network architecture based on the idea of dilated causal convolution with context stacking to achieve this goal. The model can be processed in parallel because it only has layers that can be parallelized. Therefore, it keeps computational costs reasonably low. It also introduces a context-stacking structure that helps the model exploit long-term temporal dependencies. This provides an alternative to RNN layers as well as does away with the lack of parallelism in them. The authors have demonstrated that the suggested model greatly improves SER performance (on the RECOLA dataset) while only requiring a third of the model parameters utilized in the state-of-the-art model. Jiang et al. [17] propose a novel deep neural architecture to extract the informative feature representations from the heterogeneous acoustic feature groups containing low-level acoustic details such as voice quality, speech rate, MFCC, LPCC, etc., and high-level feature details extracted using neural networks. When fused together without any analysis, these features may contain redundant and unrelated information leading to low emotion recognition performance of the model. For improved performance, the fusion of these features was done to obtain a set of highly informative features, and non-redundant features were selected. The authors defined a fusion network to learn the discriminative acoustic feature representation. They used a Support Vector Machine (SVM) for the recognition task. Experimental results on the IEMOCAP dataset demonstrate that the proposed architecture improved the recognition performance (achieving an accuracy of 64%). Two types of characteristics (MFCCs and formant frequencies extracted using LPCC) were combined by Manchala et al. [18]. Then, ten languages are classified using GMM, with a maximum accuracy of 98.8%. Gupta et al. [19] analyzed the effectiveness of SVM and Random Forest as classifiers and suggested a method based on LPC and MFCC characteristics. For six Indian languages, it has been discovered that using both attributes together produces the greatest results (achieving an accuracy of 92.6%). Similarly, SVM and LDA classifiers are compared by Anjana and Poosrna [20]. They use the IIIT-H database’s MFCC and formant feature vectors and utilize them to categorize seven Indian languages. SVM is outperformed by the LDA classifier, which obtains a classification accuracy of 93.88%.

88

M. Biswas et al.

For language identification, Sarthak et al. [21] suggest a two-dimensional CNN with attention and a bi-dimensional GRU model with dropout regularization (rate of 0.1). The model is trained and tested using input pictures from log-Mel spectrograms. The research shows how unprocessed, uncooked audio waveforms may function well as features for identifying spoken languages. When trained and tested on six languages (German, Italian, French, Russian, Spanish, and English) taken from the VoxForge dataset, the model’s effectiveness is 95.4%. Even without Russian and Italian languages, it produced 96.3% accuracy. Recently, in order to process and recognize speech emotions, Kumar et al. [22] suggest using two attention-based deep learning approaches, which are: the CNNLSTM model and the Vision Transformer model based on Mel Spectrogram. They achieved the best accuracy of 88.50% on the RAVDESS dataset. The authors do not test their models on other standard speech recognition datasets. Rudregowda et al. [23] develop a visual speech recognition dataset for the Kannada language. The dataset is benchmarked using different deep learning models like Bi-LSTM, ResNet-LSTM, VGG-16, etc., and attained the highest accuracy of 91.90% on the proposed dataset. However, the dataset is limited to the Kannada language only.

2.1 Motivation When we have a classification problem with the speech signals (audio files) as the training data, the first model that comes to our mind is neural networks like RNN [24], LSTM [25], and GRU [26]. This is because these networks were invented to achieve the objective of end-to-end learning of time series data. They generally performed very well whenever the order of data matters (like in speech). Therefore, our first approach was to model this problem using LSTMs and GRUs. The first 13 components of MFCC features [27] were used as the features. In other words, if there are T samples in a training clip, the corresponding training example is of the dimensión (T × 13). The spectrogram-CNN model (proposed in Sect. 3) and this experiment have an identical number of examples in the train and test set. Table 1 clearly conveys to us that the recurrent neural networks trained on MFCC time series features doesn’t perform extraordinarily well in classifying all the different emotions present in the RAVDESS dataset. Several research papers (as mentioned in Sect. 2) agree with this fact. RNNs may have failed to produce the desired result due to their tendency to capture phonetics that are typical to languages. Since (in the dataset) the same sentences were spoken with different tones, timbre, and expressions (corresponding to different emotions), the recurrent networks may have picked up phonetics corresponding to particular words in the sentences, thereby confusing the classifier. This may exacerbate the classification accuracy [28, 29]. The poor performance of the recurrent networks serves as the motivation for building a model which would perform better at the task of identifying the emotion depicted by the speaker. The intuition that led us to build the proposed model is the idea that emotion is an overall characteristic of speech and does not depend a lot upon

Speech Emotion Recognition Using Deep CNNs Trained …

89

Table 1 Comparison of the performances of basic RNN, LSTM, and GRU models for identifying human emotions depicted through a particular clip. The neural networks use MFCC features as the time series for training Neural network used (model)

Features used

Test accuracy (in %) (8 different emotions)

Basic RNN

MFCC (13 components)

61.50

LSTM

MFCC (13 components)

71.40

GRU

MFCC (13 components)

72.30

the phonetics or the sequence in which phones appear (which are typically related to language). This demands a model that captures the attitude of the speaker by relating a set of features that are independent of language. Thus, the spectrogram (described in Sect. 3.3) corresponding to an audio clip is the perfect candidate for this task.

2.2 Contributions The contributions of the proposed SER work can be summarized as follows: • We have plotted the spectrogram for the clips in the dataset and used them to learn a deep CNN for the task of emotion identification from human speeches. This proposes a general framework for all emotion detection tasks (from speech signals), independent of language and other phonetic-dependent features. • Our model has been trained and tested on the RAVDESS dataset (for emotion detection from speech). • As the data in the RAVDESS dataset is not large, we had to augment the data. Before the training and testing phases, we added a large variety of noises (heard commonly in day-to-day life) to the audio clips (increasing the data 11 folds). Despite that, the results obtained are not affected, which proves that this model is robust to noise. • The model appears to be independent of the number of speakers and the intensity of the emotion. This is because there are 24 speakers (12 female and 12 male actors), each speaking the same set of sentences in 8 distinct emotions, with each emotion having two intensity levels (‘strong’ and ‘normal’). Our model seems to be independent of all these variables. • Our model performed better than all other state-of-the-art models when we compared its performance to that of other studies on the same dataset. Figure 1 shows a block diagram of our proposed model. The following three steps are followed in implementing this model which are: 1. Extract log-frequency spectrograms from the audio data. In doing so, we obtain an image matrix (corresponding to the spectrogram) of the dimension (1025 × 160) for each clip.

90

M. Biswas et al.

Fig. 1 Block diagram of the model proposed in this paper

2. Train a Deep CNN model using these spectrogram images. 3. Test the performance of the proposed model.

3 Proposed Methodology Figure 2 shows the methodology followed in this work. Firstly, we take raw speech data and augment it with noise. This increases the size of the dataset considerably, which is beneficial for both the training and testing phase. Augmentation helps not only increase training data but also make the model more robust to noise. In the second phase, we extract the log-frequency spectrogram (as discussed in Sect. 3.2) from the audio files. This gives us an image matrix (of size 1025 × 160) for every clip. After the spectrogram matrices are extracted (and normalized), the matrices are randomly grouped into two disjoint sets (train and test sets). Finally, we use the ‘train set’ to train a deep CNN (described in Sect. 4) and report its performance on the test set. As mentioned earlier, the aim of this work is to build a model that ignores temporal dependence in speech and prevents it from picking up words or phrases as a distinctive feature. Instead, the model should be able to understand the overall attitude of the speaker. Humans, like in many deep learning models, serve as the major inspiration for building this model. Humans can understand the emotion of a speaker irrespective of whether we know the language. For example, the utterance of an angry speaker is quite different from the utterance of a speaker who is overjoyed or sad. Although the model is tested on a dataset having multiple languages, the model appears to be independent of the number of speakers, temporal dependence, gender of the speakers, and level of intensity in which the emotion is depicted. The results obtained (mentioned in Sect. 5) seem to agree with the above claim.

Speech Emotion Recognition Using Deep CNNs Trained …

91

Fig. 2 A flowchart representing the procedure followed in this work

3.1 Data Augmentation Data augmentation is the act of adding slightly altered copies of already existing data to increase the quantity and variety of data. The dataset has a maximum of 1440 clips, each running over 3–5 s. Such a clip can serve as a single training/test example. Clearly, 1440 is quite a small number for training and testing a deep CNN. Hence, the data needs to be augmented. The original audio samples have been enhanced by the addition of noise (from the real world) to the data. For this assignment, 50 distinct noise clips from the realms of the house (doorbells, air conditioners, etc.), the natural world (thunderstorms, cats, dogs, etc.), and the street (cars, building sites, etc.) are employed. To simulate real-world sounds, they are blended at random and added to the original audio. For instance, we grouped the sounds of a ceiling fan and a doorbell because they fall under the same category. Additionally, a variety of noises have been combined. It has also been used to use single noise clips without combining them. Using random combinations, as mentioned above, we generate ten different instances of noise. Let Ni , ∀i ∈ {0, 1, 2, 3, 4, 5}, (where N0 , is the 0 vector) be a certain noise clip added to an audio clip A. Suppose E A , E N be the average energy of A, and Ni respectively. A random number s (between 0.02 and 0.2), is used as a scaling factor to scale down the amplitude of the noise. Then, noise Ni is added to A in order to get corresponding Ai , , ∀i ∈ {0, 1, 2, 3, 4, 5} by using the following:  Ai = A + s ∗

EA ∗ Ni EN

(1)

92

M. Biswas et al.

From a single audio sample, the system produces six training instances (the original and five enhanced). As a result, this form of data augmentation multiplies the amount of data by six. The method also increases the size of the dataset, which is very helpful during training, and makes our model more noise-resistant, which is essential for real-world emotion detection systems.

3.2 Extraction of Log-Frequency Spectrograms After augmentation, the audio clips are ready for spectrogram extraction. Spectrograms are extracted for each clip. Spectrograms are one of the best ways to adequately convert a speech signal to a visual representation. Spectrograms are different from features that are calculated by simply taking the Fast Fourier Transform [30] of a speech signal. In the features of the latter type, the sequence of the speech signal (in the time domain) is not preserved after taking the transform (which is now in the frequency domain). Spectrogram, on the other hand, preserves the sequence of the signal and extracts the frequencies that an FFT would do. It plots the magnitude of respective frequency (using the color codes) along the y-axis and time along the x-axis. Thus, at a time instant t, the vertical line x = t, gives a vector having the amplitude of the frequencies present in the speech signal at that instant. The next subsection discusses how we can read and understand a spectrogram.

3.3 Motivation Behind Using Spectrograms To convince ourselves that spectrograms would be reasonable for the task of emotion detection from speech, we plotted the spectrograms of some clips from different classes of emotion. It was evident (shown in Fig. 3) that every emotion has a specific form of spectrogram. The characteristics of the spectrograms are observed to be considerably different for the eight distinct emotions (in the dataset). Besides having distinct visual characteristics, the objective of trying to capture the global attitude, expression, and emotion of the speaker is also achieved. Moreover, we no longer need to work with time-series data and have a graphical representation of the audio which can be used for end-to-end learning, i.e., we can feed a CNN with the spectrograms and their corresponding labels and expect it to capture features to discriminate amongst the classes.

3.4 Log-Frequency Spectrogram Extraction Here, we follow the process depicted in Fig. 4. On receiving a speech signal, preemphasis is done first. This is followed by the framing and windowing phase. Biswas

Speech Emotion Recognition Using Deep CNNs Trained …

93

Fig. 3 Spectrograms are obtained from audios portraying eight distinct emotions in the RAVDESS dataset. The images portray spectrograms corresponding to eight different emotions

94

M. Biswas et al.

Fig. 4 The algorithm involved in the extraction of spectrogram features from speech signals

et al. [27] discuss the above steps in detail. At time instant t, upon using a window size of t0 s (which is shifted by as every step), we get an amplitude vector f (t) with (t0 × f s ) components, where f s is the sampling frequency. N-point Fast Fourier Transform [29] of the input signal f (t) is computed to obtain F(s) (in the frequency domain). Since the frequency components of F(s) are complex numbers, their amplitudes are computed and squared to give P(s), i.e., the power spectrum of the input signal. Formally, if z is a component of F(s), its corresponding component in P(s) is given by z ∗ .z, where z ∗ is the complex conjugate of z. This is done for all the windows (shifted by as in every step), and all the power spectrum vectors are vertically stacked to get a 2D power spectrum matrix M. On obtaining M, all its entries are mapped between 0 and 1 (to get M), i.e., it is normalized by dividing by the maximum value in M. The final log-frequency spectrogram Ispect is obtained by taking the logarithm of all the entries of M (as human sensitivity to sound energy is logarithmic). Ispect of all the input audio clips have been used to train a deep CNN, which will, in turn, learn to distinguish between different speech emotions. For plotting purposes, the color code is generated to represent a particular longfrequency amplitude. For example, in Fig. 3, the yellow/white color represents higher amplitude, and darker shades of violet represent lower amplitudes. It is useful to know that this step is required only for displaying the spectrogram. When training a neural network, we may directly use Ispect , which would reduce computation time and memory requirements.

Speech Emotion Recognition Using Deep CNNs Trained …

95

3.5 Understanding What a Spectrogram Conveys As mentioned in Sect. 3.3, the color of a pixel in the spectrogram represents the magnitude of a particular frequency at a particular instant (yellow/white represents higher amplitude, and violet represents lower amplitudes). Colors are used instead of the actual magnitude to make the spectrogram comprehensible to the human eye. For example, in all the spectrograms of Fig. 3, black pixels are dominant at the beginning and the end, which signifies that these are regions of silence (0 amplitude for all frequencies). The yellow and white regions in the spectrograms tell us the frequencies where the relative amplitude is large. The distribution of these pixels conveys the attitude in which the speech was delivered. A calm expression (in Fig. 3a) would therefore have a uniform amplitude of the frequencies, i.e., there would not be a sudden fluctuation of frequencies which is a sign of anxiety. Major fluctuations of frequencies are indeed seen in Fig. 3f, as the speaker was afraid and anxious; due to all these reasons, the CNN is expected to differentiate the classes based on the spectrograms. Note: The input spectrograms first must be normalized (by dividing the maximum magnitude in the entire dataset) before being fed to the neural network.

4 The Deep Convolutional Neural Network 4.1 Architecture The architecture of the deep CNN has been inspired by the dimensions of the input spectrogram images (1025 × 160). Moreover, the size of the images is very large. This demands a CNN that not only captures the features in the image that distinguishes the classes (like edges) but also reduces the size of the data over the layers of the CNN. This would help the deep layers to have a lower number of inputs, which would, in turn, reduce the number of parameters. This would accelerate the training phase and make model memory efficient. To fulfill the objectives mentioned above, we must use stridden convolution [31] with stride length (>1), which would reduce the dimensions along the horizontal and vertical dimensions by a factor of s. Although the strategy is effective, we face a serious practical problem. If we use the same stride s along the horizontal and the vertical dimension, the horizontal dimension will shrink rapidly, and information will be lost. Therefore, we cannot use the same stride length s. We need to take large strikes along the vertical dimension and short strides along the horizontal dimension. This would finally make the dimensions comparable. Therefore, stride s is defined as a vector (SV , S H ), where SV and S H represent the strides taken by the filter along the vertical and horizontal dimensions, respectively (while convolving with the image). The filters in the earlier layers of a convolutional neural network typically learn weights that are known to extract micro-level features in an image (like edges).

96

M. Biswas et al.

As we go deeper, a CNN recognizes more complex features. It is intuitive that the spectrograms contain fewer complex features than an image of a human being or an animal. So, it is logical to start with the CNN that is not too deep but it is able to reduce the first two dimensions of the input image and learn discriminative features to separate the classes. Therefore, it might be optimal to use only two convolutional layers and two max pooling layers [32] before feeding the flattened output to a dense layer. Having two sets of convolutional and max pooling helps us compress the input image gradually. This would help us achieve higher compression (in turn reducing the number of parameters in the dense layers) and will help the CNN to learn more subtle discriminative patterns in the spectrograms (compared to a CNN where there is only a single set). The output of the final max pooling layer (as shown in Fig. 5) is flattened to obtain a vector X  R 3264 . X is fed to a fully connected layer with 256 neurons, which is further connected to the output layer. To prevent the neural network from overfitting, a 20% dropout is applied at both the fully connected layers [33]. Let the volume V [i−1] be the input to a convolutional/max pooling layer (i) xn [i−1] xn [i−1] ) (representing height, width, and in the CNN has dimensions (n [i−1] c H W number of channels, respectively). The trainable parameters of the convolutional layers are the layer L C[i]V ∀i{1, 2}, we define square filters   filters. For convolutional [i]( j)

), where n C[i]V is FC V ∀ j 1, 2, . . . , n C[i]V , with the dimensions ( f C[i]V x f C[i]V xn [i−1] c the number of filters in the layer (and all the filters have the same dimension). The [i] [i] output of this convolutional layer has dimensions (n [i] H ×n W ×n c ). The strides vector

Fig. 5 The figure shows the architecture of the deep convolutional neural network used. It shows the dimensions of the input, the hidden volumes, the filters, the dense layers, and the output layer (provided in Table 2). The input to the deep CNN is a 1025 × 60 × 1 spectrogram image (of the input clip), which passes through two convolutional layers and two Max pooling layers to get the feature vector X1 , X2 , . . . , X3264  (upon flattening the last volume). The feature vector has been passed through dense layers to get the output vector, Y . It contains the probabilities corresponding to a particular label. The maximum entry of Y gives the most probable label for an input 



Speech Emotion Recognition Using Deep CNNs Trained … (t ype)[i]

97

(t ype)[i]

S (t ype)[i] is defined as (S V , SH )∀t ype{C V, M P}. The relation between the dimensions of the input and output volume is given by Eq. 2a–c: n [i] H = n [i] W =

− f C[i]V n [i−1] H SV(C V )[i] − f C[i]V n [i−1] W S H(C V )[i]

[i] n [i] c = nC V

(2a)

(2b) (2c)

The max pooling layer L [i] M P is defined similarly. The job of the max pooling layer [i] [i] is to select the maximum value in the submatrix (filter L [i] M P of size ( f M P x f M P )) we are considering (there is only one channel of this filter). Therefore, there are no trainable parameters, and the width of the input and output volume are identical, i.e., [i−1] . Change in other dimensions identical to 1a and 1b (instead of S (C V )[i] , n [i] c = nc (M P)[i] is used). S For the fully connected layers, we define the layer L [i] ∀i{1, 2} having l [i] units (neurons). Clearly, L [1] is the hidden layer and L [2] is the output layer. Each layer an inputA[i−1] (the components outputs an activation vector A[i] , when provided  [i] of an activation vector are defined as a j ∀ j 1, 2, . . . , l [i] ). Therefore, the output of a layer is the input to the next layer. Clearly, A[0] and the feature vector X,A[2] and Y are synonymous. For layer L [i] , we define parameters, W [i] andb[i] . W [i] has dimension of (l [i] xl [i−1] ), and b[i] is a l [i] dimensional vector. Table 2 contains all the information relevant to the neural network. Figure 5 displays the design of the deep CNN that has been employed. On both the train as well as test set, the proposed architecture achieved accuracy that is found to be almost flawless. Therefore, it has been decided that adding more hidden layers is not essential. 

4.2 Training 4.2.1

Forward Propagation

There are m training examples (spectrograms) available. (I ( j) , Y ( j) )∀ j{1, 2, 3, . . . . . . , m} are the labeled training examples, where I ( j) represents the spectrogram image for the jth training example and Y ( j) is its output vector. Y ( j) is defined as a one-hot vector, ( j)

Yk

= {1 if k is the required label 0, 0 otherwise.

(3)

Dense layer

Output layer

L [2]

Total

Flattening layer

L [1]

Max pooling layer

Convolutional layer

Max pooling layer

Convolutional layer

Type of layer

Flatten

L [2] MP

[2] LC V

L [1] MP

[1] LC V

Layer

(256, 1)

(3264, 1)

(17, 6, 32)

(34, 12, 32)

(102, 26, 16)

(205, 52, 16)

Spectrogram (1025, 60, 1)

Input dimension

(8, 256) (8, 1)

(256, 3264) (256, 1)











Parmeter dimension (W [i] , b[i] )

Y , effectively classifying them into an emotion class









(2, 2)

[1] nC V = 32 [1] bC V :: (32,

(3, 3, 16)

(2, 2)

1)

bC[1]V :: (16, 1)

[1] nC V = 16

Filter: (5, 5)

Filter dimension, number







(2, 2)

(3, 2)

(2, 2)

(5, 3)

Strides (SV , S H )

8,42,952

2056

835,840





4640



(5 × 5x16 + 16) = 416

Trainable parameters

Activation function

(8, 1)

(256, 1)

(3264, 1)

(17, 6, 32)

(34, 12, 32)

Softmax

ReLU







(102, 26, 16) –

(205, 52, 16) –

Output dimension



20%

20%









Dropout (in %)

Table 2 This table shows the details of the neural network used, types of layers, number of dimensions of filters, trainable parameters, activation functions used in each layer, etc. We also see that the neural network inputs a spectrogram image of the dimension (1025, 160, 1) and outputs their corresponding output

98 M. Biswas et al.

Speech Emotion Recognition Using Deep CNNs Trained …

99

The convolutional layers L C[l]V converts the volume V [l−1] to volume V [l] . For the pixel value V [l] (i 0 , j0 , k0 ), where k0 is the channel number corresponding to the 0) with dimensions ( f C[l]V × f C[l]V × n [l−1] ) and scalar bk[l]0 , computed using filter FC[l](k c V the following weighted sum:

V (i 0 , j0 , k0 ) = [l]

bk[l]0

+

[l] [l] n c [l−1]−1 C V −1 C V −1 f  f

k=0

j=0

0) V [l−1] (i 0 + i, j0 + j, k) · FC[l](k (i, j, k) V

i=0

(4) Note: (i 0 + f C[l]V − 1), ( j0 + f C[l]V − 1) must not exceed the dimensional limits of volume V [l−1] . The same thing holds for the max pooling layer also. The changes in dimension have already been described in Sect. 4.1. [l−1] to volume V [l] . The pixel The max pooling layers L [l] M P converts the volume V [l] [l] value V (i 0 , j0 , k0 ) is computed using the filter FM P with dimensions ( f M[l]P × f M[l]P ), which finds the maximum value in the submatrix as: V [l] (i 0 , j0 , k0 ) = V [l−1] (i 0 + i, j0 + j, k0 )

(5)

Before feeding a volume to a dense layer, the 3D volume is flattened to a feature vector X . This is done by the flattening layer. It takes a volume and stacks the rows one after the other. The channels are taken in order while flattening. Now, X acts as A[0] for the dense layers. For dense layer L [i] ∀i{1, 2}, their parameters (W [i] andb[i] ), inputs provided (A[i−1] ) and outputs generated ( A[i] ) are related as (the same calculation is done for all the training examples): A[i] = g [i] (W

[i]

A

[i−1]

+ b[i] )

(6)



A[0] = X and A[2] = Y

(7&8)

where, g [i] is the activation function associated with the layer L [i] . The activation functions used in the neural network in each layer have been shown in ( j)



Table 2. Y

is the output generated by the neural network for an input j, where,

( j) Y k [0, 1]∀k{1, 2, . . . , 8} represent the probabilities corresponding to the kth label



of the (i.e., it represents the conditional probability corresponding to label k, given an input), i.e. 

( j)

Y k = P(X [ j]  class k|X [i] )

(9) 

The label corresponding to the maximum entry of the vector Y taken by the neural network.

( j)

is the decision

100

4.2.2

M. Biswas et al.

Initialization

All parameters, W [i] , b[i] ∀i{1, 2}, have been initialized by the Xavier initialization method [34, 35].

4.2.3

Cost Function, Mini-Batch Size, and Back-Propagation

Gradient descent is carried out by computing gradients based on mini-batches of batch size b (we had chosen b = 64 as the batch size). The cost function is J defined as follows for each batch: 1 − b i=1 b

J=

nclasses(=8) 



Y j(i) loglogY j(i)

(10)

j=1

Now, partial derivatives of J are computed with respect to all the parameters using back-propagation. Adam optimization [36] of gradient descent is used for learning the parameters. Parameters P have been updated as follows: P = P −α

∂J ∂P

(11)

The learning rate, α is an important hyperparameter that needs to be tuned. We used α = 10−3 and a decay of 10−5 [37], and there was not any oscillation which is clear from Fig. 6. The decay is used to define the amount of decrease in α after every epoch. The learning rate is gradually decreased to prevent oscillation during gradient descent. Fig. 6 Plot showing the value of the cost function versus the number of epochs when α = 10−3 . As there is no oscillation it is an optimal value of the learning rate. For the model to learn better (and avoid oscillations when gradient descent is performed), a decay rate of 10−5 is used

Speech Emotion Recognition Using Deep CNNs Trained …

101

5 Observations 5.1 Dataset Used We have used the RAVDESS dataset to build and gauge the performance of our model. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [38] contains a total of 7356 files (including audio-only files, audio-visual files, and video-only files). The total size of the dataset is approximately 24.8 GB. The Audioonly speech files have been used in the work to both train and test the proposed model. The audio files (speech files only) contain 24 professional actors (12 female, 12 male) vocalizing two lexically matched statements in a neutral North American accent. The total size of this set of data is 215 MB. The utterances include ‘neutral’, ‘calm’, ‘happy’, ‘sad’, ‘angry’, ‘fearful’, ‘surprise’, and ‘disgust’ expressions. Each expression is produced at two levels of emotional intensity (‘normal’, ‘strong’), except the ‘neutral’ expression. Thus, we have 192 training examples for each expression (except ‘neutral’) and 96 training examples for the neutral expression (1440 clips in total). Each clip runs over 3–5 seconds. 16 bit, 48 kHz (*.wav files) is the format of the files.

5.2 Performance Metrics Used If a clip Ck belongs to class p and is predicted to belong to class p, then the following updates are made to the table of p and p: p = pthenT P p = T P p + 1, (TP means true-positive, FN means false-negative) OtherwiseF P p = F P p + 1, andT N p = T N p + 1 (FP means false positive, TN means true negative) (i) Accuracy: A(in%) =

Number of correctly predicted examples × 100% Total number of examples

(12)

(ii) Precision:

T Pp

P =

p T Pp + p F Pp p

(iii) Recall:

(13)

102

M. Biswas et al.



T Pp

R =

p T Pp + p F Np p

(14)

(iv) F1 Score: It combines recall and precision into a single metric. It is defined as: F1 =

2P R P+R

(15)

In a multi-class classification problem, the precision, recall, and F1 score have to be defined for every class. (v) Confusion Matrix: It is a p × p matrix, with C M i j indicating how many times a test example of class i is predicted to be a clip of class j. Therefore, i= j C Mi j (sum of diagonal) gives us the total number of correct predictions.

5.3 Results Obtained The implementation was done using Python 3.6. The experiments are carried out on Google Colaboratory with 12.69 GB memory (RAM) and a GPU (that Colab provides for training and testing neural networks). The log-frequency spectrograms are extracted using Python’s librosa (version 0.8.1) library [39]. The training and testing of the neural network are done on Google Colab’s GPU. The neural network is implemented using TensorFlow 1.6.0 [40] and Keras 2.5.0 [41]. The clips are sampled as follows: 1. Sampling rate, f s : 44,100 Hz; therefore, the largest frequency present in the audio signals is (44100 ∗ 0.5) = 22050 Hz, i.e., the Nyquist frequency. 2. Minimum Clip Size: 2.94 s 3. Maximum Clip Size: 5.28 s 4. Average Clip Size: 3.70 s 5. Window size (for Spectrogram computation): The window size chosen is 25 ms, and the window is shifted by 10 ms. For training a CNN, the dimensions of the spectrograms need to be consistent. Therefore, prior to the extraction of spectrograms, all the clips are converted to a runtime of 3.70 seconds (the average clip size in the dataset). This is done by padding smaller clips with zero at the end and truncating larger clips. The spectrograms (with identical dimensions) are divided into two sets (train and test). The deep CNN is trained on the train set, and the performance is reported on the test set. We have used all the speech clips in the RAVDESS dataset (1440 clips). Each clip in the dataset is augmented to get six training examples (as mentioned in Sect. 3.1). So, a dataset of 8640 clips is obtained. Thus, the duration of the augmented dataset is approximately 8.11 hours. After extraction of spectrograms, they are divided into train and test sets as written in Table 3. The training accuracy achieved is found to be 99.87%, which shows that the model fits the training data very well. Further, a test accuracy of 98.13% (736 correctly

Speech Emotion Recognition Using Deep CNNs Trained …

103

Table 3 The number of clips (per emotion class) obtained from the RAVDESS dataset (after augmentation) for training and testing Sl.No

Emotion

Number of clips (before augmentation)

Number of clips (after Number of augmentation) train examples

Number of test examples

1

Neutral

96

576

522

54

2

Calm

192

1152

1060

92

3

Happy

192

1152

1062

90

4

Sad

192

1152

1048

104

5

Angry

192

1152

1058

94

6

Fearful

192

1152

1042

110

7

Disgust

192

1152

1042

110

8

Surprised

192

1152

1056

96

1440

8640

7890

750

Total

predicted clips out of 750) illustrates that the proposed model can generalize to unseen data as well. Tables 4 and 5 contain the performance results as well as the confusion matrix produced by the proposed model, respectively. The classifier obtained 100% while identifying clips with the emotions: ‘calm’, ‘happy’, ‘angry’, ‘fearful’, and ‘surprised’. This means that the classifier never makes a mistake when a training example belonging to any of these classes is provided to it. This may be because these emotions are least correlated, i.e., the expression of a happy person is completely different from a person who is ‘angry’. The same thing is true for a ‘calm’, ‘fearful’, or a ‘surprised’ emotion. It can be observed that the classifier makes mistakes while identifying clips belonging to classes: ‘neutral’, ‘sad’, and ‘disgust’. The fact that there were two intensities of all the emotions (except ‘neutral’) may have played a role in misguiding Table 4 Performance evaluation of the proposed model for identifying eight emotion classes from speech data of the RAVDESS dataset Sl.No

Language

1

Neutral

54

100.00

2

Calm

92

3

Happy

90

4

Sad

104

5

Angry

94

100.00

100.00

100.00

6

Fearful

110

100.00

100.00

100.00

7

Disgust

110

100.00

94.55

97.20

8

Surprised

96

94.12

100.00

96.97

750

98.24

98.13

98.16

Total

Number of test clips

Precision (in %)

Recall (in %)

F1 score (in %)

96.30

98.12

97.87

100.00

98.92

97.82

100.00

98.90

96.08

94.23

95.15

104

M. Biswas et al.

Table 5 Confusion matrix obtained by the proposed model on the RAVDESS dataset Predicted

Neutral

Calm

Happy

Sad

Angry

Fearful

Disgust

Surprised

Actual Neutral

52

0

0

0

0

0

0

2

Calm

0

92

0

0

0

0

0

0

Happy

0

0

90

0

0

0

0

0

Sad

0

2

0

98

0

0

0

4

Angry

0

0

0

0

94

0

0

0

Fearful

0

0

0

0

0

110

0

0

Disgust

0

0

2

4

0

0

104

0

Surprised

0

0

0

0

0

0

0

96

the classifier. A few clips from the ‘disgust’ class were identified (by the classifier) to belong to the ‘sad’ class. This may be because a ‘moderate’ disgust expression may be strongly correlated to a ‘sad’ expression. Similarly, a ‘strong’ emotion of sadness may sound like being ‘surprised’. On the other hand, the skewed nature of the neutral emotion (having half the number of clips compared to others) may have contributed to its misclassification. Despite a few errors (which may be due to ambiguity of expression), the classifier performed very well, achieving an accuracy of 98.13%. The performance suggests that the deep CNN model trained on log-frequency spectrograms is highly accurate and effective at the task of emotion detection from speech data.

5.4 Comparison Study In the present work, we have compared our proposed work with other recent speech emotion recognition works on the RAVDESS dataset, and the comparison results are shown in Table 6. From Table 6, it can be observed that our proposed deep CNN with log-frequency spectrograms model has outperformed all the previous works done for the speech emotion recognition problem on the RAVDESS dataset.

Speech Emotion Recognition Using Deep CNNs Trained …

105

Table 6 Performance comparison of our proposed model with recently proposed speech emotion recognition works for the RAVDESS dataset Authors [Refs.]

Year of publication

Model used

Accuracy (%)

Yadav et al. [42]

2020

CNN and Bi-LSTM

73

Mustaqeem et al. [43]

2020

Spectrograms and CNN

77.02

Kanwal et al. [44]

2021

Feature optimization with genetic algorithm

82.50

Proposed work

2023

Customized deep CNN with log-frequency spectrograms

98.16

6 Conclusion This work proposes a model for emotion detection from human speeches that can understand the global attitude/expression of a speaker. It shows how this model outperforms conventional recurrent neural networks trained on time-series data. It describes a process of conversion of an audio clip into a visual representation called a spectrogram. It proposes an architecture of a deep CNN that has been trained on the extracted spectrograms. The performance of the model has been tested on the RAVDESS dataset, and a test accuracy of 98.13% has been obtained. Moreover, the proposed model is shown to be robust and is not affected by background noise. Its performance is apparently independent of the number and gender of the speakers and can identify the emotion of a speaker even if the degree of the expressed emotion varies. Although the proposed model outperforms conventional models, the model worked on a dataset that had only one language. As mentioned before, humans can decipher emotions in speech even if they do not know the language. Therefore, an ideal emotion detection model should be largely language-independent. This can be a challenging goal to accomplish in the future. The paper showed that CNNs and MFCC features (used as time series) are not good enough for this task. An interesting problem one can deal with in the future would be to find recurrent architectures that would work better at this task.

References 1. Makhoul, J.: Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975). https:// doi.org/10.1109/PROC.1975.9792 2. Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990). https://doi.org/10.1121/1.399423 3. Mermelstein, P.: Distance measures for speech recognition, psychological and instrumental. Pattern Recogn. Artif. Intell. 374–388 (1976) 4. Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420

106

M. Biswas et al.

5. Jin, F., Sattar, F., Krishnan, S.: Log-frequency spectrogram for respiratory sound monitoring. In: International Conference on Acoustics, Speech, and Signal Processing, 1988. ICASSP-88 (2012). https://doi.org/10.1109/ICASSP.2012.6287954 6. Dey, A., Chattopadhyay, S., Singh, P.K., Ahmadian, A., Ferrara, M., Sarkar, R.: A Hybrid metaheuristic feature selection method using golden ratio and equilibrium optimization algorithms for speech emotion recognition. IEEE Access 8, 200953–200970 (2020). https://doi.org/10. 1109/ACCESS.2020.3035531 7. Sahoo, K.K., Dutta, I., Ijaz, M.F., Wo´zniak, M., Singh, P.K.: TLEFuzzyNet: fuzzy rank-based ensemble of transfer learning models for emotion recognition from human speeches. IEEE Access 9, 166518–166530 (2021). https://doi.org/10.1109/ACCESS.2021.3135658 8. Biswas, M., Rahaman, S., Ahmadian, A., Subari, K., Singh, P.K.: Automatic spoken language identification using MFCC based time series features. Multimed. Tools Appl. 1–31 (2022). https://doi.org/10.1007/s11042-021-11439-1 9. Garain, A., Singh, P.K., Sarkar, R.: FuzzyGCP: a deep learning architecture for automatic spoken language identification from speech signals. Expert Syst. Appl. 168, 114416 (2021). https://doi.org/10.1016/j.eswa.2020.114416 10. Pazos-Rangel, R.A., Florencia-Juarez, R., Paredes-Valverde, M.A., Rivera, G. (eds.).: Handbook of Research on Natural Language Processing and Smart Service Systems. IGI Global (2021). https://www.doi.org/https://doi.org/10.4018/978-1-7998-4730-4 11. Aouani, H., Ayed, Y.B.: Speech emotion recognition with deep learning. Procedia Comput. Sci. 176, 251–260. (2020). https://doi.org/10.1016/j.procs.2020.08.027 12. Kerkeni, L., Serrestou, Y., Mbarki, M., Raoof, K., Mahjoub, M.: Speech emotion recognition: methods and cases study (2018). https://doi.org/10.5220/0006611601750182 13. Avots, E., Sapi´nski, T., Bachmann, M., Kami´nska, D.: Audiovisual emotion recognition in wild. Mach. Vis. Appl. 30(5), 975–985 (2019). https://doi.org/10.1007/s00138-018-0960-9 14. Lech, M., Stolar, M., Best, C., Bolia, R.: Real-time speech emotion recognition using a pretrained image classification network: effects of bandwidth reduction and companding. Front. Comput. Sci. 2, 14 (2020). https://doi.org/10.3389/fcomp.2020.00014 15. An, X.D., Ruan, Z.: Speech emotion recognition algorithm based on deep learning algorithm fusion of temporal and spatial features. J. Phys: Conf. Ser. 1861(1), 12064 (2021). https://doi. org/10.1088/1742-6596/1861/1/012064 16. Tang, D., Kuppens, P., Geurts, L., van Waterschoot, T.: End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network. EURASIP J. Audio Speech Music Process. 2021(1), 18 (2021). https://doi.org/10.1186/s13636-021-00208-5 17. Jiang, W., Wang, Z., Jin, J.S., Han, X., Li, C.: Speech emotion recognition with heterogeneous feature unification of deep neural network. Sensors (Basel, Switzerland) 19(12), 2730 (2019). https://doi.org/10.3390/s19122730 18. Manchala, S., Prasad, V.K., Janaki, V.: GMM based language identification system using robust features. Int. J. Speech Technol. 17(2), 99–105 (2014). https://doi.org/10.1007/s10772-0139209-1 19. Gupta, M., Bharti, S.S., Agarwal, S.: Implicit language identification system based on random forest and support vector machine for speech. In: 2017 4th International Conference on Power, Control & Embedded Systems (ICPCES), pp. 1–6. (2017). https://doi.org/10.1109/ICPCES. 2017.8117624 20. Anjana, J.S., Poorna, S.S.: Language identification from speech features using SVM and LDA. In: 2018 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), pp. 1–4 (2018). https://doi.org/10.1109/WiSPNET.2018.8538638 21. Sarthak, Shukla, S., Mittal, G.: Spoken language identification using convNets. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11912. LNCS, pp. 252–265 (2019). https://doi.org/10.1007/9783-030-34255-5_17 22. Kumar, C.S.A., Maharana, A.D., Krishnan, S.M., Hanuma, S.S.S., Lal, G.J., Ravi, V.: Speech emotion recognition using CNN-LSTM and vision transformer. In: Abraham, A., Bajaj, A., Gandhi, N., Madureira, A.M., Kahraman, C. (eds.) Innovations in Bio-inspired Computing and

Speech Emotion Recognition Using Deep CNNs Trained …

23.

24. 25. 26. 27.

28.

29.

30. 31.

32.

33. 34. 35. 36.

37. 38.

39.

40.

41. 42.

107

Applications. IBICA 2022. Lecture Notes in Networks and Systems, vol. 649. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27499-2_8 Rudregowda, S., Patil Kulkarni, S., HL, G., Ravi, V., Krichen, M.: Visual speech recognition for kannada language using VGG16 convolutional neural network. Acoustics 5(1), 343–353 (2023). https://doi.org/10.3390/acoustics5010020 Lipton, Z.: A critical review of recurrent neural networks for sequence learning (2015). https:// doi.org/10.48550/arXiv.1506.00019 Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735 Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling, pp. 1–9. (2014). https://doi.org/10.48550/arXiv.1412.3555 Biswas, M., Rahaman, S., Kundu, S., Singh, P.K., Sarkar, R.: Spoken language identification of Indian languages using MFCC features. In: Kumar, P., Singh, A.K. (eds.) BT—Machine Learning for Intelligent Multimedia Analytics: techniques and Applications, pp. 249–272. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-9492-2_12 Garain, A., Ray, B., Giampaolo, F., Velasquez, J.D., Singh, P.K., Sarkar, R.: GRaNN: feature selection with golden ratio-aided neural network for emotion, gender and speaker identification from voice signals. Neural Comput. Appl. (2022). https://doi.org/10.1007/s00521-022-07261-x Marik, A., Chattopadhyay, S., Singh, P.K.: A hybrid deep feature selection framework for emotion recognition from human speeches. Multimed. Tools Appl. 82, 11461–11487 (2023). https://doi.org/10.1007/s11042-022-14052-y Strang, G.: Linear algebra and its application. In: Linear Algebra, 4th edn, pp. 211–221 (chapter 3.5) (n.d.) Ayachi, R., Afif, M., Said, Y., Atri, M.: Strided convolution instead of max pooling for memory efficiency of convolutional neural networks, pp. 234–243 (2020). https://doi.org/10.1007/9783-030-21005-2_23 Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technoloetgy (ICET), pp. 1–6 (2017). https://doi.org/10.1109/ICEngTechnol.2017.8308186 Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014) Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res. 9, 249–256 (2010) Kumar, S.K.: On weight initialization in deep neural networks, pp. 1–9 (2017). https://doi.org/ 10.48550/arXiv.1704.08863 Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015—Conference Track Proceedings, pp. 1– 15 (2015). https://doi.org/10.48550/arXiv.1412.6980 You, K., Long, M., Wang, J., Jordan, M.I.: How does learning rate decay help modern neural networks? (2019). https://doi.org/10.48550/arXiv.1908.01878 Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. Plos One 13(5), e0196391 (2018). https://doi.org/10.1371/journal.pone.0196391 McFee, B., Lostanlen, V., McVicar, M., Metsai, A., Balke, S., Thomé, C., Raffel, C., Malek, A., Lee, D., Zalkow, F., Lee, K., Nieto, O., Mason, J., Ellis, D., Yamamoto, R., Seyfarth, S., Battenberg, E., Mopozov, B., Bittner, R., et al.: Librosa/librosa (2020). https://doi.org/10.5281/ ZENODO.3606573 Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. ., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Kaiser, L., Kudlur, M., Levenberg, J., Zheng, X.: TensorFlow: large-scale machine learning on heterogeneous distributed systems (2015). https://doi.org/10.48550/arXiv.1603.04467 Chollet, F.: Keras (2015). Accessed 13 May 2023. https://github.com/fchollet/keras Yadav, A., Vishwakarma, D.K.: A multilingual framework of CNN and Bi-LSTM for emotion classification. In: 2020 11th International Conference on Computing, Communication and

108

M. Biswas et al.

Networking Technologies (ICCCNT), pp. 1–6 (2020). https://doi.org/10.1109/ICCCNT49239. 2020.9225614 43. Mustaqeem, Sajjad, M., Kwon, S.: Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 8, 79861–79875 (2020). https://doi. org/10.1109/ACCESS.2020.2990405 44. Kanwal, S., Asghar, S.: Speech emotion recognition using clustering based GA-optimized feature set. IEEE Access 9, 125830–125842 (2021). https://doi.org/10.1109/ACCESS.2021. 3111659

Text Classifier of Sensationalist Headlines in Spanish Using BERT-Based Models Heber Jesús González Esparza, Rogelio Florencia, José David Díaz Román, and Alejandra Mendoza-Carreón

Abstract Information technologies play a crucial role in keeping society informed during global events like pandemics. However, sensational headlines can negatively impact public perception and trust in institutions. In this chapter, several BERTbased text classifiers were developed to classify sensational and non-sensational health-related headlines in Spanish. The models were fine-tuned on almost 2000 headlines from major Mexican newspapers, achieving up to 94% F1-Score and accuracy. This demonstrates the effectiveness of machine learning techniques in detecting sensationalism in news headlines. Keywords Natural language processing · Machine learning · Deep learning · Sensationalism · Short-text classification

1 Introduction Global pandemics are among the most significant challenges society faces. One of the most important roles when a health crisis occurs is to inform and educate the public on ways of mitigating it [1]. Still, several studies suggest that news coverage about health-related issues does not always pursue this objective [2–5]. In recent years, with more competition than ever, there has been a growing trend among news outlets of packaging news articles in headlines that attempt to capture H. J. González Esparza · R. Florencia (B) · J. D. Díaz Román · A. Mendoza-Carreón División Multidisciplinaria de Ciudad Universitaria, Universidad Autónoma de Ciudad Juárez, Av. José de Jesús Macías Delgado #18100, C.P. 32000, Ciudad Juárez, Chihuahua, México e-mail: [email protected] H. J. González Esparza e-mail: [email protected] J. D. Díaz Román e-mail: [email protected] A. Mendoza-Carreón e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Innovations in Machine and Deep Learning, Studies in Big Data 134, https://doi.org/10.1007/978-3-031-40688-1_6

109

110

H. J. González Esparza et al.

people’s attention. This inclination has sometimes been pushed to the limit, resulting in headlines that are inaccurate, misleading, or too emotional. A sensationalist headline emphasizes elements that could provoke emotional responses rather than focusing on factual information that is valuable to readers [6]. This presents a problem because headlines can significantly shape the reader’s worldview [7]. When dealing with health-related issues, for example, an emotional or misleading headline can lead to a change in people’s decision-making and their trust in institutions [8, 9]. One of the most important branches of computer science is Natural Language Processing (NLP). Some NLP techniques take advantage of Machine Learning (ML) algorithms for their development, which can be faster and cheaper than doing them manually [10]. This makes ML models a very effective tool when dealing with tasks involving human languages, such as sentiment analysis or text classification [10]. Bidirectional Encoder Representations from Transformers (BERT) is a simple and powerful language representation model that can be used for many different NLP tasks, like text classification [11]. BERT is one of the most popular deep learningbased language models and has been described as a ‘quantum leap’ in the artificial intelligence and NLP fields [12]. Other models based on its architecture have also been proposed, improving performance in some scenarios. Only a few projects in Spanish have taken advantage of NLP techniques for developing systems that automatically identify news headlines, even less so when dealing with sensationalism. This chapter presents three text classifiers for detecting sensationalist headlines in Spanish. These models were generated using several BERT-based models and fine-tuned with data collected and labeled manually for this project. This chapter is structured as follows: Sect. 2 describes the theoretical background of the project. Other projects with similar objectives and comparable approaches are presented in Sect. 3. Section 4 describes the process of collecting, labeling, and analyzing the data used to train and validate the models. It also describes the methods used to build and fit the models, while the performance of the classifiers can be found in Sect. 5. Finally, Sect. 6 concludes the chapter by describing the strengths and limitations of the project.

Text Classifier of Sensationalist Headlines in Spanish Using …

111

2 Background This Section defines crucial concepts necessary for a better understanding of the problem and, therefore, the proposed solution. Section 2.1 describes the usage of the term ‘sensationalism’ in various contexts and the way it can be a harmful kind of journalism. Section 2.2 presents a brief explanation of BERT-based models.

2.1 Sensationalism When dealing with problems as big as global pandemics, news coverage on these subjects becomes a critical element for solving these issues properly. Emotional tone in news stories can be a decisive factor in influencing people’s risk perceptions, attitudes, and behaviors towards health-related topics [8]. The current market-driven state of mass media has led to health-related news being exaggerated or ‘sensationalized’, aiming to attract and hold the public’s attention. Many different contexts make use of the term ‘sensationalism’. Some authors define it as a type of journalism aimed at the popular classes, where violence, pornography and tragedy are the norms. The most important thing for a sensationalist journalist is to gain the reader’s attention, even if the facts themselves are compromised [13, 14]. Still, there are other circumstances in which this term is also used. One of these cases is when describing the kind of journalism that uses stylistic techniques to provoke emotions, including excitement, fear, and astonishment. The word ‘sensationalism’ is also used to reference the ‘discursive strategy’ of packaging information into headlines with the intention of making the news look more interesting, extraordinary, or relevant [6]. Given the fact that the definition of ‘sensationalism’ heavily depends on context, a more concise description was needed. With the intention of limiting the definition of the term, a list of criteria selected using the opinion of many different authors [5, 6, 15–18] was selected. A headline would be considered sensationalist if it met at least one of the following: • Omit words that can represent uncertainty (such as ‘could’ or ‘might’). • Dramatize the information by implying that it was previously hidden from the public (e.g., ‘Are we about to witness the end of Britain?’). • Make use of superlatives or other extreme words to exaggerate the headline (such as ‘miracle’ or ‘revolutionary’). • Make use of vivid metaphors to gain the reader’s attention (e.g., referencing zombies in the case of an infectious outbreak). • Make use of capital letters to emphasize words that can gain the reader’s attention (such as ‘HISTORIA’ in Spanish or ‘HISTORY’ in English). • Make use of narratives where two groups are opposed to each other (e.g., ‘them’ versus ‘us’).

112

H. J. González Esparza et al.

• Make use of words that are susceptible to alarm people without the need to use them, such as (in Spanish): ‘Apocalyptic’ (‘Apocalíptico’), ‘Emergency’ (‘Emergencia’), ‘Alarm’ (‘Alarma’), ‘Crisis’ (‘Crisis’), ‘Chaos’ (‘Caos’), etc. As a disclaimer, every single one of the characteristics listed above should not be considered as ‘definitive’. The idea of which headlines can be regarded as sensationalist is far from being universally accepted. All criteria written here are subject to verification, as the classification of a sensationalist headline depends heavily on the perception and opinion of each reader.

2.2 BERT-Based Models BERT is a very popular deep-learning model developed by Google that is based on a neural network known as Transformer [11]. A Transformer, first described in [19], is a combination of two different concepts: Convolutional Neural Networks (CNN) and the Attention mechanism, making models based on Transformers able to learn contextual relationships between words in a text. Transformers were developed as an alternative to Recurrent Neural Networks (RNN) for NLP tasks, primarily because of RNN’s lack of ability to parallelize work and poor performance when dealing with long sentences [20]. Even though Transformers were first intended to be used in translation tasks, several studies have shown that many other NLP tasks, such as text classification, sentiment analysis, emotion recognition, and spam detection, can be performed with very good results using the Transformer architecture, and more specifically the BERT model [12, 21, 22]. One of the key features of BERT is its ability to process text bidirectionally. Meaning that it considers the context of a word in relation to the words that precede and follow it, allowing it to capture the full meaning and context of a sentence. This contrasts with traditional NLP models, which only consider the context of a word in relation to the words that precede it. The two steps that can be found in the BERT framework are pre-training and finetuning. For pre-training, BookCorpus and the English Wikipedia were used as the corpus (around 3,300 M words combined) for the English version of BERT, enabling it to grasp language patterns. This is done by making use of two unsupervised tasks: Masked-Language Modeling (MLM) and Next Sentence Prediction (NSP). For MLM, the model must predict the 15% of the words of a sentence that were randomly masked at the beginning of the process. This characteristic is what allows BERT to learn a bidirectional representation of the sentence. In the NSP task, the model takes as an input two masked sentences and then must predict if they were next to each other in the original text. For fine-tuning, task-specific inputs are needed to get better results in the NLP task that the project is aimed to perform. In this case, the fine-tuning step was made using the sensationalist headlines dataset presented in this chapter.

Text Classifier of Sensationalist Headlines in Spanish Using …

113

Fig. 1 Diagram representing the BERT architecture

When first released in 2018, BERT had two different versions: BERT-Base and BERT-Large, both of which had ‘uncased’ and ‘cased’ variants (in the uncased version, ‘John Smith’ would become ‘john smith’, while in the cased version this does not happen). BERT-Large makes use of a 24-layer model with 16 attention heads, whereas BERT-Base’s model is ‘only’ 12-layer and has 12 attention heads. Like the original Transformer encoder, BERT takes a series of words as input. For its functioning, each input must start with the special token [CLS] and be separated by [SEP]. Each layer of encoders then applies the Self-Attention mechanism before passing its results to the next encoder. This generates a vector of size hidden_size (768 in BERT-Base), as seen in Fig. 1. The resulting vector can now be used in many different NLP tasks, including text classification. Many other models based on the BERT architecture have been proposed to improve performance on specific tasks. In this chapter, experimentation was carried out using three pre-trained models based on BERT: • Multilingual BERT-Base [23]: Pre-trained in the top 104 languages with the largest Wikipediaes (including several romances languages, such as French, Portuguese, and Spanish), it delivers slightly worse results than the singlelanguage models of BERT (such as English or Chinese versions). Everything previously stated about BERT is also true for this model, as it is only a multilingual version. • BETO-Base (Spanish Pre-trained BERT model) [24]: A variant of BERT-Base that was pre-trained in a large Spanish corpus making use of the Whole-Word Masking (WWM) technique. In some cases, it has provided better results than the multilingual version of BERT [25]. • XLM-RoBERTa-Base (Robustly Optimized BERT Pre-training Approach) [26]: This is a multilingual model that was pre-trained in 100 different languages in a self-supervised way [27]. It is based on RoBERTa and XLM. RoBERTa is a variation of the BERT model that tries to optimize it by using more data, larger batches, and longer training than its predecessor [28]. On the other hand, XLM

114

H. J. González Esparza et al.

is an extension of BERT with two additional purposes: Translation Language Modeling (TLM) and Cross-lingual Language Modeling (XLM). This allows the model to be trained on parallel sentences in different languages and monolingual sentences in other languages.

3 Related Work In this Section, a description of other projects with similar goals to the ones described in this chapter is presented. Most of the efforts in NLP over the years have been made having the English language in mind. For this reason, only a few advancements in this area have been accomplished in languages other than English, like Mandarin or Spanish [29–32]. There are a few projects that tackle the text classification task with the usage of ML algorithms, such as Bag of Words or Word2Vec in Spanish, though few, it’s important to remark on the existence of several projects that do. Some projects even use some pre-trained model based on BERT as a way of solving their problems. In [32] sensitive data on several clinical datasets in Spanish was automatically anonymized. They used two datasets for their project, both of which comprised of plain text that contained clinical narrative and manual annotations of sensitive information. To generate the model, they made use of the Base Multilingual Cased version of BERT and the PyTorch library. After following two different experiments, each one with different tasks, F1 Scores between 0.925 and 0.979 were obtained. Another project that used BERT to solve a problem involving natural language can be found in [31]. It consisted of a sentiment analysis task of comments on the Google Play Store. For the fine-tuning task, 15,985 comments were gathered from the 15 most downloaded education apps on the Google Play Store, such as Google Classroom or Duolingo. The generated model got a 0.8 F1-Score when classifying between negative, neutral, and positive comments. Also, in [33], aggressiveness in Mexican social media was detected using the same three models as the text classifiers presented in this chapter: Multilingual BERT, XLM-RoBERTa, and BETO. Their model was fine-tuned using the MEXA3T dataset, which consists of more than 7,000 tweets written by Mexican Spanish Speakers (2,110 of which were labeled as aggressive), as well as the OffensEval [34] data and the HatEval [35] Spanish subset for some experimentation. They used a test set of 3,143 elements, achieving an F1-Score of 79% in their best model. Even though said projects do not tackle the same problem as the one described in this chapter, other solutions that do can be found in the English language. In [36], a model for quantifying scientific quality and sensationalism of news records mentioning pandemics was built. The sensationalism of news records was measured with the usage of a tool that identified sensationalism through surveys and focus groups that was also partially used in this project [6]. They built a maximum entropy model using a random sample of 500 news records as a training set and then applied the regression to 10,000 randomly selected news records that mentioned pandemics.

Text Classifier of Sensationalist Headlines in Spanish Using …

115

The resulting model reached an accuracy of 73% when scoring sensationalism in a testing set consisting of 200 records. Another project that tackles the same problem was developed during the 2017 edition of Google’s annual program, Summer of Code. In this solution [37], they built a non-linear Support Vector Machine (SVM) model. It used features such as punctuation counts, average sentence length and the number of caps letters in a news record to classify them as sensationalist or non-sensationalist. Using a dataset of 16,000 elements, an F1-Score of 0.82 was achieved through a 5-fold cross-validation when training the model only on headlines.

4 Dataset and Methods Searching the web for a sensationalist and non-sensationalist headlines dataset did not go well, as no dataset that fitted the needs of the project was found. Even less so when searching for a dataset that contained elements in Spanish. As is well known, a vast quantity of data is needed to develop an ML text classifier. With no valuable data to be found on the internet, the decision to collect and manually label the data was taken. Section 4.1 presents the procedures taken for the gathering and labeling of the data necessary to train and validate the model, while Sect. 4.2 describes the generated dataset and its content.

4.1 Data Gathering and Data Labeling The first requirement necessary before gathering the headlines was to really understand what patterns can be found in sensationalist headlines. As the classification was binary (whether a headline can be considered or not as sensationalist), a headline would be classified as sensationalist if it met one or more of the criteria listed in Sect. 2. The first step in the gathering process was selecting which news outlets should be considered for the collecting process. In this case, two factors were considered to choose the news sources: 1. Reputation of the news outlet. 2. How easy its website makes it to search and collect headlines. Among the most important news outlets in Mexico, the ones that had a website that facilitated the data gathering task were Milenio (www.milenio.com) and El Universal (www.eluniversal.com.mx), so, they were both selected for those reasons. Even though having a good reputation is not an exclusion from being somewhat sensational, a more popularly known news media for its sensationalism could be helpful while gathering this kind of headlines. One of the most prominent and

116

H. J. González Esparza et al.

‘colorful’ news outlets in Mexico is known as ‘La Prensa’ (www.la-prensa.com.mx), and it was also selected. Following the process of the collecting task, a range of dates on which the headlines would be considered had to be set. To capture as much of the recent events regarding the COVID-19 pandemic as possible, the range of dates that was selected started on the first of January of 2020 and ended on the first of June of 2022. As the main objective of this project was to build a machine learning model capable of classifying sensationalist and non-sensationalist headlines regarding health-related issues, the list of topics considered when gathering the headlines focused on (but were not limited to) these subjects: • • • • • •

New cases and accumulated cases of a disease or medical condition. New deaths and accumulated deaths caused by a disease or medical condition. Discoveries regarding health-related issues (such as new treatments). Measures taken by government officials regarding a disease or medical condition. Economic or social consequences caused by a disease or medical condition. Vaccines.

Web Scrapping techniques were considered, but the decision not to use them came from the fact that manual selection would still be needed. The tasks regarding the collection and labeling process varied from site to site but can be summarized in the next steps: 1. Make use of the site’s search engine to search for articles containing different keywords, such as ‘coronavirus’, ‘cancer’, ‘vaccines’, or ‘dengue’. 2. Set the date range for the search. Usually, it was done month by month, starting with January of 2020 and ending with June of 2022. 3. If the site allowed it, the search was made featuring the most popular articles first. If not, the chronological order was used. 4. Read each headline and manually determine if it met any of the criteria described in Sect. 2, then label it accordingly. 5. Copy the data to a Comma Separated Value (CSV) file. All data collected was organized into five different columns in the CSV file, four of which were gathered from the news outlets’ websites, including the following: • ‘encabezado’: The headline, in plain text. • ‘fecha’: The date the article was published (in format DD-MM-YYYY). • ‘fuente’: The news outlet from which the headline was gathered, in plain text (‘universal’ for El Universal, ‘milenio’ for Milenio and ‘laprensa’ for La Prensa). • ‘enlace’: A link to the article featuring the headline. The fifth and last column featured in the dataset, ‘clase’, can have two different values: ‘1’ if the headline was considered sensationalist, and ‘0’ otherwise. Given the fact that the COVID-19 pandemic took place in the entirety of the date range selected for this project, and because of its social, political, and medical relevance, most of the headlines collected in the process referenced, in some level, the Coronavirus disease and its consequences.

Text Classifier of Sensationalist Headlines in Spanish Using …

117

Table 1 Samples from the dataset Headline (original)

Headline (translated to English)

Class

“Es un trabajo difícil, muy triste y desgarrador”: enfermera de Hubei sobre brote de coronavirus

“It’s a difficult job, very sad and heartbreaking”: Hubei nurse on coronavirus outbreak

1

Coronavirus genera temor en Europa y paraliza a Italia

Coronavirus generates fear in Europe and paralyzes Italy

1

“¡Mátenla!”, pide hombre en Honduras para presunta enferma de coronavirus

“Kill her!”, asks a man in Honduras for an alleged coronavirus patient

1

¿México…inmune al COVID-19?

Mexico…immune to COVID-19?

1

Parecía el Apocalipsis zombie: tampiqueño que regresó de China por coronavirus

It looked like the zombie apocalypse: man from Tampico who returned from China due to coronavirus

1

Por coronavirus, intensifican medidas preventivas y de higiene en el Metro

Due to coronavirus, preventive and hygiene measures are intensified in the Metro

0

Ebrard anuncia red de América Latina para investigar coronavirus

Ebrard announces Latin American network to investigate coronavirus

0

OMS reúne a 400 expertos para estudiar coronavirus

WHO brings together 400 experts to study coronavirus

0

Episcopado mexicano hace recomendaciones ante Coronavirus

Mexican Episcopate makes recommendations against Coronavirus

0

La Organización Mundial de la Salud abre cuenta en TikTok para informar sobre el coronavirus

The World Health Organization opens an account on TikTok to report on the coronavirus

0

The data collection task only aimed to gather sensationalist and non-sensationalist headlines from the selected newspapers in no particular order or pattern. So, as a disclaimer, all data gathered, labeled, and present in the dataset do not intend to represent the quality of the news coverage each news outlet can offer. Table 1 contains ten samples drawn from the dataset. The first column shows the headline of the source (in Spanish). The second column is an English translation of each headline. The third column indicates whether a headline was considered sensationalist (1 for sensationalist and 0 otherwise). Columns 1 and 3 were used to train the text classification models.

4.2 Data Analysis In total, 2,200 headlines were collected and labeled, 1,080 of which were labeled as sensationalist and 1,120 as non-sensationalist. This means the dataset has a nearly 50/50 ratio between its classes and therefore can be considered balanced, as shown in Fig. 2.

118

H. J. González Esparza et al.

Fig. 2 Ratio between sensationalist and non-sensationalist headlines in the dataset

Figure 3 shows the distribution of sensationalist and non-sensationalist headlines among the three news outlets selected for this project. 502 headlines were collected from ‘La Prensa’ (www.la-prensa.com.mx), representing 23% of the dataset. Among those 502 headlines, 320 were manually classified as sensationalist (64%), while the remaining 182 were not (36%). In the case of ‘El Universal’ (www.eluniversal. com), 395 of the headlines gathered from this outlet were classified as sensationalist (47.6%) and 434 were classified as non-sensationalist (52.4%), making up 37% of the dataset. Regarding the headlines gathered from ‘Milenio’ (www.milenio.com), that represent a 40% of the dataset, 366 of them were classified as sensationalist (42%) and the remaining 503 were not (58%). After the lemmatization process and the removal of stop words from the dataset, Figs. 4 and 5 display the most used words found in each class. Both classes feature

Fig. 3 Class distribution among the news outlets

Text Classifier of Sensationalist Headlines in Spanish Using …

119

Fig. 4 Most used words in sensationalist headlines from the dataset

a large quantity of headlines of news articles regarding the COVID-19 pandemic coverage. Proof of this can be found in the fact that, in both cases, the two most used words are ‘coronavirus’ and ‘covid’, adding up to 1675 combined. In the case of sensationalist headlines, the words ‘pandemia’ (‘pandemic’), ‘riesgo’ (‘risk’) and ‘contagio’ (‘contagion’), make up the rest of the five most used words. The words ‘caso’ (‘case’), ‘México’ (‘Mexico’) and ‘vacuna’ (‘vaccine’) do the same for the non-sensationalist headlines. Figures 6 and 7 show the word clouds of each class, which highlight the most frequent words in each class by size.

120

H. J. González Esparza et al.

Fig. 5 Most used words in non-sensationalist headlines from the dataset Fig. 6 Word cloud of sensationalist headlines from the dataset

Text Classifier of Sensationalist Headlines in Spanish Using …

121

Fig. 7 Word cloud of non-sensationalist headlines from the dataset

4.3 Model Generation and Fine-Tuning In this Section, a detailed explanation of the procedures taken to generate, train and validate the text classification models is presented. For the development of the text classifiers, different Python libraries were used to manipulate the data and generate the ML model, such as pandas and sklearn. Besides those, one of the most important libraries for this project is ktrain. It is a wrapper of other libraries, like TensorFlow and transformers, that aims to help build, train, and deploy machine learning models [38]. Algorithm 1 describes the implementation of the multilingual BERT text classifier. Algorithm 2 describes the operation of the BETO and XLM-RoBERTa text classifiers. They comprise the entirety of the tasks needed to manipulate the data and generate the model, including the creation of a data frame object, splitting the data into three different sets, preprocessing said data and generating, training, and validating the text classifier.1

1

https://github.com/gonzalezheber/Text-classifier-of-sensationalist-headlines-in-Spanish-usingBERT-based-models.

122

H. J. González Esparza et al.

Line 1 of Algorithm 1 deals with the process of creating a data frame object using the pandas library. This object was named df and contained all the collected headlines, as well as a label that determined if a headline was considered sensationalist or not. The link, date, and publisher of each news article were also included in the data set. Given the fact that the only two columns necessary to train and validate the proposed ML model were the headline itself and the label, Line 2 deals with the process of dropping the rest of the described columns from the df object. As seen in Line 3, the data frame object is divided into three different sets: a training set, a validation set, and a testing set. The testing set contained 10% of the

Text Classifier of Sensationalist Headlines in Spanish Using …

123

proposed dataset (220 headlines) and was used to evaluate the model once the finetuning phase was finished. On the other hand, training and validation sets comprise the remaining 90% (1980 headlines). This is done by using the sklearn function train_test_split, and its purpose is to allow a good validation process. In Lines 4–6, the first usage of the ktrain library can be seen with the usage of the text_from_arrays function, which loads and preprocesses text data from arrays. This function has several parameters, two in particular that determine that the task needed at the time is text classification using the BERT model: class_names (if empty, a regression task is assumed) and preprocess_mode (with three possibilities: ‘standard’, ‘distilbert’ and ‘bert’). This function also takes as parameters the training and validation sets generated in Line 3. It is also important to mention that the ‘lang’ parameter is used to define the language. This parameter can be autodetected, but in this case it was manually set to ‘es’ (Spanish). This function provides a way of preprocessing the text using a method called WordPiece tokenization, a technique that splits words into smaller elements (called wordpieces). Other preprocessing, such as lemmatization or stop word removal, is generally unnecessary when dealing with models based on the Transformer architecture, as its usage can cause a loss of context. Lines 7 and 8 make use of two very important functions in the development of a text classifier using the ktrain library: text_classifier and get_learner. The text_classifier function builds and returns a text classification model and takes as arguments the type of text classifier needed (in this case, ‘bert’), the training data, and a preprocessing variable (preproc) that was generated in Line 6. On the other hand, the get_learner function returns a Learner instance that can be used to train and tune Keras models (such as in this case). It takes as arguments the model just generated (model) and the training and validation data. One of the advantages of using ktrain to generate, train and validate a machine learning model is the availability of the lr_find function, as it simulates training and then plots loss as the learning rate is increased. The function works by training a model on a small portion of the data using a range of learning rates. Then, it plots and determines the learning rate that results in the lowest loss. According to the ktrain library documentation, which can be found on [39], the highest learning rate that corresponds to a still falling loss (as displayed in the resulting plot) should be chosen. In this case, and with the attribute suggest set to true, the resulting plot can be seen in Fig. 8, with the red and purple dots representing the suggested learning rates by the function for the training phase. To train the generated model, the autofit function is used, as seen in Line 10. In the case of this project, and following the suggestion found in Fig. 8, a learning rate of 4.01e-6 was chosen. The autofit function automatically sets early_stopping enabled at patience = 5, meaning that training will automatically stop after five epochs with no improvement in validation loss. In the same way, reduce_on_plateau is also automatically enabled at patience = 2, which reduces the learning rate when validation loss does not improve after two epochs. Both early_stopping and reduce_ on_plateau are optional parameters in the autofit function and can be edited if needed.

124

H. J. González Esparza et al.

Fig. 8 Resulting plot of the lr_find function, which shows loss as learning rate is increased

Lines 11–13 deal with the validation process of the model, which is done by making use of the validate function, which is also part of the ktrain library. This function takes as an argument the validation set that was generated in Line 6 and returns a confusion matrix. After following several processes regarding the metrics selected for validating this project (such as precision, recall and F1-Score) and using the resulting confusion matrix, these metrics were calculated to then be displayed in a classification report, as seen in Line 13. Finally, in Line 14, a predictor instance is created with the purpose of making predictions on unlabeled data. The get_predictor function takes as an argument the previously trained Keras model and the same preproc variable declared in Line 6, which was also used in several other lines during the process. This predictor instance can be saved to disk and then be reloaded as part of other applications using a function called load_predictor. The creation of the predictor instance is of help when dealing with the last part of Algorithm 1, seen in Lines 15–17. These lines have the objective of testing the resulting model by using the training set generated in Line 6. The predict function in Line 15 saves an array of predictions in the predictions variable that can then be compared to the actual labels on the testing set. This process is done in Line 16 and, after following a similar approach as the one described in Line 12, the chosen metrics are displayed in a classification report in Line 17. Results regarding the testing process can be found in Sect. 5. Algorithms 1 and 2 are similar. The first few lines of Algorithm 2 follow the same idea as those of Algorithm 1. The first change is introduced in Line 5 where the Transformer function used to create a Transformer object is used. It takes as arguments the name of the Hugging Face pretrained model to use and the class names, along with some hyperparameters like batch_size and maxlen. After some experimentation, maxlen was set to 128 in both text classifiers, while batch_size was set to 16.

Text Classifier of Sensationalist Headlines in Spanish Using …

125

In this case, preprocessing was made in Lines 6 and 7 by using the preprocess_ train and preprocess_test functions. These two functions return objects that can be used by the classifier object model created in Line 8 with the get_classifier function. The model object is based on the Transformer object transformer created earlier, so it will use the previously selected pre-trained model, maxlen, batch_size, and class_

126

H. J. González Esparza et al.

Fig. 9 Resulting plot of the lr_find function for the BETO classifier

Fig. 10 Resulting plot of the lr_find function for the RoBERTa classifier

names as the Transformer. Lines 9–18 are very similar to Algorithm 1, except for Line 15, where the predictor instance is created using the transformer object created in Line 5 instead of the prepoc variable used in Algorithm 1. The resulting plots from the lr_find function for both classifiers are displayed in Figs. 9 and 10.

5 Results Performance was evaluated using the test set containing 10% of the data set (220 headlines). Table 2 shows the results obtained by each of the models. The first column shows the name of the three models (Multilingual BERT, BETO, and XLMRoBERTa). Columns 2, 3, and 4 show each class’s F1 score, precision, and recall, where 1 represents sensationalism (the class of interest) and 0 otherwise. The experimentation results showed that the three text classifiers reached an accuracy above 90%. The classifier that obtained the highest accuracy was

Text Classifier of Sensationalist Headlines in Spanish Using …

127

Table 2 Results of the BERT based models after fine-tuning F1 score

Precision

Model

1

0

Multilingual BERT

0.93

BETO

0.93

XLM-RoBERTa

0.94

Recall

Accuracy

1

0

1

0

0.93

0.92

0.94

0.94

0.92

0.93

0.95

0.91

0.91

0.95

0.93

0.94

0.92

0.96

0.96

0.92

0.94

0.93

XLM-RoBERTa with 94%, BETO reached 93%, and multilingual BERT obtained 93%. Predicting class 1, XLM-RoBERTa achieved 96% on the recall metric, Multilingual BERT 94%, and BETO 91%. When predicting class 0, BETO achieved 95%, XLM-RoBERTa 92%, and Multilingual BERT 92%. Regarding the precision metric, XLM-RoBERTa and Multilingual BERT reached 92% and BETO 95% in class 1. When predicting class 0, they reached 96%, 94%, and 91%, respectively. In the F1-Score metric, the best performance was XLM-RoBERTa with 94%, and BETO and Multilingual BERT achieved 93% in class 1 and 94%, 93%, and 93%, respectively, in class 0. A graphic way of showing the results can be seen in Figs. 11, 12 and 13, where the three confusion matrices of the 220 predictions made by each model during the testing phase are presented.

Fig. 11 Confusion matrix of the multilingual BERT model

128

H. J. González Esparza et al.

Fig. 12 Confusion matrix of the BETO model

Fig. 13 Confusion matrix of the XLM-RoBERTa model

6 Conclusion This project explored the problem of detecting sensationalism in written media using three pre-trained models based on the BERT architecture. Previous projects with similar goals focused on detecting sensationalism in news articles written in English. However, with the performance that BERT-based models can offer in other languages, good results can be obtained in classifying news headlines in Spanish. The experimental results showed that the classifiers presented in this chapter achieved accuracy and F1-Score metrics of 94%, indicating that the models could identify both sensational and non-sensational headlines. Additionally, the classifiers performed well on other metrics, such as accuracy and recall. The results demonstrate that the tools presented in this chapter have the potential to be used in real-world applications that can help users discern between credible

Text Classifier of Sensationalist Headlines in Spanish Using …

129

and sensational sources of information. Therefore, it allows them to make informed decisions about the sources they trust and consume. As future work, a multiclass classification can be used. Instead of relying on a binary classification, different levels of sensationalism can be determined depending on a set of criteria. This could be potentially helpful if the model’s goal is to detect the most harmful headlines from the somewhat sensationalist ones. As one of the main goals of this project was to automatically detect sensationalism in health-related topics, the criteria selected for gathering and labeling headlines was targeted in that direction. Different criteria can be applied to classify news headlines regarding other subjects, such as politics or sports. Despite limitations (such as a binary classification and the somewhat limited size of the dataset for this type of task), the models presented in this chapter represent an important step forward in promoting the consumption of credible and trustworthy information. It also demonstrates the potential of machine-learning in combating the spread of sensationalism in written media.

References 1. Laing, A.: The H1N1 crisis: roles played by government communicators, the public and the media. J. Prof. Commun. 1(1) (2011). https://doi.org/10.15173/jpc.v1i1.88 2. Mach, K.J., et al.: News media coverage of COVID-19 public health and policy information. Humanit. Soc. Sci. Commun. 8(1), 220 (2021). https://doi.org/10.1057/s41599-021-00900-z 3. Pieri, E.: Media framing and the threat of global pandemics: the Ebola crisis in UK media and policy response. Sociol. Res. Online 24(1), 73–92 (2019). https://doi.org/10.1177/136078041 8811966 4. Frangogiannis, N.G.: The significance of COVID-19-associated myocardial injury: how overinterpretation of scientific findings can fuel media sensationalism and spread misinformation. Eur. Heart J. 41(39), 3836–3838 (2020). https://doi.org/10.1093/eurheartj/ehaa727 5. Ottwell, R., Puckett, M., Rogers, T., Nicks, S., Vassar, M.: Sensational media reporting is common when describing COVID-19 therapies, detection methods, and vaccines. J. Investig. Med. 69(6), 1256–1257 (2021). https://doi.org/10.1136/jim-2020-001760 6. Molek-Kozakowska, K.: Towards a pragma-linguistic framework for the study of sensationalism in news headlines. Discourse Commun. 7(2), 173–197 (2013). https://doi.org/10.1177/ 1750481312471668 7. Waage, H.: Hyper-reading headlines: how social media as a news-platform can affect the process of news reading. University of Stavanger (2018) 8. Nabi, R.L., Prestin, A.: Unrealistic hope and unnecessary fear: exploring how sensationalistic news stories influence health behavior motivation. Health Commun. 31(9), 1115–1126 (2016). https://doi.org/10.1080/10410236.2015.1045237 9. van Scoy, L.J., et al.: Public anxiety and distrust due to perceived politicization and media sensationalism during early COVID-19 media messaging. J. Commun. Healthc. 14(3), 193–205 (2021). https://doi.org/10.1080/17538068.2021.1953934 10. Pedrycz, W., Martínez, L., Espin-Andrade, R.A., Rivera, G., Gómez, J.M. (eds.).: Preface. In: Computational Intelligence for Business Analytics, pp. v–vi. Springer (2021). https://doi.org/ 10.1007/978-3-030-73819-8 11. Devlin, J., Chang, M.-W., Lee, K., Google, K.T., Language, A.I.: BERT: pre-training of deep bidirectional transformers for language understanding. https://github.com/tensorflow/tensor2te nsor

130

H. J. González Esparza et al.

12. Koroteev, M.V.: BERT: a review of applications in natural language processing and understanding 13. Pedroso, R.: Elementos para una teoría del periodismo sensacionalista. Comun. y Soc. 21, 139–157 (1994) 14. Torrico, E.: El sensacionalismo: algunos elementos para su comprensión y análisis. Sala de prensa, vol. 2, no. 45 (2002) 15. Lin, L.: Semantic Comparisons for Natural Language Processing Applications. University of Washington (2021) 16. Doherty, J.-F.: When fiction becomes fact: exaggerating host manipulation by parasites. Proc. R. Soc. B: Biol. Sci. 287(1936), 20201081 (2020). https://doi.org/10.1098/rspb.2020.1081 17. Costa-Sánchez, C.: Tratamiento informativo de una crisis de salud pública: Los titulares sobre gripe A en la prensa española. Revista de Comunicación de la SEECI 0(25), 29 (2011). https:// doi.org/10.15198/seeci.2011.25.29-42 18. Alonso-González, M.: coronavirus a través de los titulares de El Mundo y La Vanguardia. Revista de Comunicación y Salud 10(2), 503–524 (2020). https://doi.org/10.35669/rcys.2020. 10(2).503-524 19. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017) 20. Giacaglia, G.: Transformers. Medium (2019). https://towardsdatascience.com/transformers141e32e69591. Accessed 21 Sep. 2022 21. Özçift, A., Akarsu, K., Yumuk, F., Söylemez, C.: Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish. Automatika 62(2), 226–238 (2021). https://doi.org/10.1080/00051144.2021.1922150 22. González-Carvajal, S., Garrido-Merchán, E.C.: Comparing BERT against traditional machine learning text classification (2021) 23. bert-base-multilingual-cased · Hugging Face. https://huggingface.co/bert-base-multilingualcased. Accessed 06 June 2023 24. dccuchile/bert-base-spanish-wwm-cased · Hugging Face. https://huggingface.co/dccuchile/ bert-base-spanish-wwm-cased. Accessed 06 June 2023 25. Cã, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., Pérez, J.: Spanish pre-trained Bert model and evaluation data. https://github.com/josecannete/spanish-corpora. Accessed 06 Mar. 2023 26. xlm-roberta-base · Hugging Face. https://huggingface.co/xlm-roberta-base. Accessed 06 Mar. 2023 27. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale (2019). https:// doi.org/10.48550/arxiv.1911.02116 28. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019) 29. Dai, Z., Wang, X., Ni, P., Li, Y., Li, G., Bai, X.: Named entity recognition using BERT BiLSTM CRF for Chinese electronic health records. In: 2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), pp. 1–5 (2019). https://doi.org/10.1109/CISP-BMEI48845.2019.8965823 30. Liu, H., et al.: Use of BERT (bidirectional encoder representations from transformers)-based deep learning method for extracting evidences in Chinese radiology reports: development of a computer-aided liver cancer diagnosis framework. J. Med. Internet Res. 23(1), e19689 (2021). https://doi.org/10.2196/19689 31. López Condori, J.J., Gonzales Saji, F.O., López Condori, J.J., Gonzales Saji, F.O.: Análisis de sentimiento de comentarios en español en Google Play Store usando BERT. Ingeniare. Revista chilena de ingeniería 29(3), 557–563 (2021). https://doi.org/10.4067/S0718-330520 21000300557 32. García-Pablos, A., Perez, N., Cuadros, M.: Sensitive data detection and classification in Spanish clinical text: experiments with BERT (2020). https://doi.org/10.48550/arXiv.2003.03106 33. Tanase, M.-A., Zaharia, G.-E., Cercel, D.-C., Dascalu, M.: Detecting aggressiveness in Mexican Spanish social media content by fine-tuning transformer-based models (2020). https://www. facebook.com/communitystandards/hate_speech. Accessed 07 Mar. 2023

Text Classifier of Sensationalist Headlines in Spanish Using …

131

34. Zampieri, M., et al.: SemEval-2020 task 12: multilingual offensive language identification in social media (OffensEval 2020), pp. 1425–1447 (2020). http://sites.google.com/site/offenseva lsharedtask/offenseval2019. Accessed 07 Mar. 2023 35. Basile, V., et al.: SemEval-2019 task 5: multilingual detection of hate speech against immigrants and women in twitter, pp. 54–63. http://evalita.org. Accessed 07 Mar. 2023 36. Hoffman, S.J., Justicz, V.: Automatically quantifying the scientific quality and sensationalism of news records mentioning pandemics: validating a maximum entropy machine-learning model. J. Clin. Epidemiol. 75, 47–55 (2016). https://doi.org/10.1016/j.jclinepi.2015.12.010 37. Ivenskaya, M.: Fake news detection. Google Summer of Code (2017). https://summerofcode. withgoogle.com/archive/2017/projects/5547741878943744. Accessed 28 Mar. 2022 38. Maiya, A.S.: ktrain: a low-code library for augmented machine learning (2020). https://github. com/amaiya/ktrain 39. Maiya, A.: ktrain API documentation. https://amaiya.github.io/ktrain/index.html. Accessed 27 Oct. 2022

Arabic Question-Answering System Based on Deep Learning Models Samah Ali Al-azani and C. Namrata Mahender

Abstract An information retrieval mechanism with particular capabilities is the question-answering system. The QAS is interested in pertinent responses that are provided in natural language to the suggested questions. It is becoming more and more important in education, notably as a method of automated subjective exam scoring. The classification of questions, information retrieval, and response retrieval are the three basic parts of most QAS systems. By classifying the submitted questions according to their nature, the classifying of questions plays a crucial role in QASs. Retrieval of knowledge is used to find the answer to a question since further processing cannot be done to locate the solution if a document lacks the right answers. The answer to the user’s query is then obtained by performing the retrieving of an answer. A review of the NLP as well as all relevant information on question-answering systems is provided in this research article. There are millions of pieces of knowledge available, but making the proper information accessible when needed is very important. Getting the proper documents to read and further getting a direct answer to one’s question from the set of documents is a challenging task. An answering system is concerned with building systems that automatically answer questions posed by humans in a natural language. A QA system is usually programmed to pull answers from a structured database or an unstructured collection of natural language documents. This paper is related to the long answers; it introduces a completely unique way of grading using simple RRN models and advanced RNN models, such as LSTM and GRU. We were ready to build an RRN, LSTM, and GRU neural network for predicting the right answers after training. We got an accuracy on RRN up to 94%, LSTM up to 91%, and GRU up to 99%. And on testing, we got on RRN up to 85%, LSTM up to 84%, and GRU up to 95%. Keywords Natural language processing · Neural network · Arabic question answering system · Deep learning models S. A. Al-azani (B) · C. Namrata Mahender Department C.S. and I.T, Dr. Babasaheb Ambedkar, Marathwada University, Aurangabad, Maharashtra, India e-mail: [email protected] C. Namrata Mahender e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Innovations in Machine and Deep Learning, Studies in Big Data 134, https://doi.org/10.1007/978-3-031-40688-1_7

133

134

S. A. Al-azani and C. Namrata Mahender

1 Introduction Natural Language Processing (NLP) is a subfield of computer science and artificial intelligence that deals with the interactions between human language and computers. It involves the development of algorithms and computational models that enable computers to understand, interpret, and generate human language. NLP techniques are used to analyze and process large amounts of natural language data, including text, speech, and other forms of communication. Some of the applications of NLP include language translation, sentiment analysis, chatbots, speech recognition, and text summarization [1, 2]. The goal of NLP is to create intelligent systems that can communicate with humans in natural language, making it easier for people to interact with technology. Reference [3] presented a comprehensive survey of question-answering technology from the perspective of information retrieval. The authors discuss various aspects of question-answering systems, including question classification, answer extraction, and answer validation. They also explore different approaches to building question-answering systems, such as rule-based, templatebased, and statistical methods. The article then delves into the challenges faced by question-answering systems, such as the difficulty of understanding natural language and the need for accurate information retrieval. The authors also discuss the evaluation metrics used to assess the performance of question-answering systems. Reference [4] provides a comprehensive review of state-of-the-art question-answering systems. The authors discuss various types of question-answering systems, such as open-domain, closed-domain, factoid, and non-factoid question-answering systems. They also discuss the challenges faced by these systems, such as knowledge representation, natural language processing, and scalability. Then explores the current trends in question-answering systems, such as the use of deep learning techniques and the integration of knowledge graphs. The authors conclude by highlighting the potential benefits of question-answering systems, such as improving search engines, customer service, and education [5]. In activities connected to NLP, neural networks are receiving more and more attention. In recent years, neural networks have reemerged as potent machine-learning models, producing cutting-edge results in disciplines like speech processing and picture recognition. Recently, neural network models have begun to be used to analyze textual natural language signals, and the results are once again highly encouraging [6]. In order to do traditional machine learning tasks, feature engineering is required. At the same time, neural networks do not require it. Nobody knows what takes place within the neural network’s secret layers. It picks up every feature conceivable. Possibly ones those humans haven’t even considered. Because of their ability to learn, neural networks can do tasks like non-linear classification and nonMarkovian modeling of trees and sequences. Reference [7] explores the application of deep learning techniques to answer selection in question-answering systems. The authors present a study comparing various deep learning models for answer selection, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory (LSTM) networks. They also propose an open

Arabic Question-Answering System Based on Deep Learning Models

135

task for answer selection in the context of community question-answering (CQA) systems. The authors evaluate their approach using the SemEval-2015 Task 3 dataset, which consists of questions and answers from the CQA website Yahoo! Answers. They show that their proposed deep learning models outperform traditional machine learning approaches for answer selection. The work highlights the potential benefits of using deep learning techniques for answer selection in question-answering systems, particularly in the context of CQA systems. The authors also discuss the challenges faced by their approach, such as the need for large amounts of annotated data and the difficulty of handling noisy and ambiguous input. Overall, the research provides a valuable contribution to the field of natural language processing and question-answering systems, particularly in the area of applying deep learning techniques to answer selection [8].

2 Natural Language Processing (NLP) It is a field of artificial intelligence where machines analyze, comprehend, and draw significance from human language. To organize and arrange knowledge for tasks like text summarization, translation, NER, extraction relationship, sentiment analysis, and speech recognition, developers can use natural language processing (NLP). By interpreting language for meaning, NLP systems have long performed valuable tasks like grammar checking, converting speech into text, and automatically translating among languages. A text analysis technique called NLP makes it possible for computers to understand spoken language. This human–computer interaction allows automatic summarization, sentiment classification, topic identification, recognition of named entities parts-of-speech tagging, connection detection, stemming, and other practical applications. NLP is frequently used to analyze text [9].

2.1 Difficulties in NLP Through the use of NLP, there have been significant advancements in the capacity of computers to comprehend human language. However, this easy implementation can be challenging in a variety of situations due to the magnitude and extensive variation of the data sets. The following challenges that firms and engineers may face while utilizing natural language processing. • Development Time: The development of an NLP system takes time. An AI must examine millions of data points, and processing all of the data on a shoddy PC could take a lifetime. A distributed deep network and several GPUs working together can cut training times in half. You’ll need to budget time to construct the solution from scratch unless you’re using NLP technology that is already available.

136

S. A. Al-azani and C. Namrata Mahender

• Misspellings: People can easily correct spelling errors. When a word is spelled incorrectly, we can rapidly grasp the rest of a sentence by relating it to its proper counterpart. On the other hand, it may be more difficult for such a machine to identify misspellings. Use technologies for natural language processing (NLP) that can identify and go beyond common word errors[10]. • Words with Multiple Meanings: There is no such thing as an ideal language, and most of them contain words that, depending on the context, can mean several things. What’s up? Users’ goals were considerably different from those of users who asked, “How do I connect the bank card” Using parameters, the best NLP system must be able to distinguish between these utterances. • Phrases with Multiple Intentions: No language is perfect, and most of them contain words that, depending on the context, can mean a variety of things. The way are you? Users who asked, “How do I connect the bank card?” had completely different goals in mind. Parameters should allow the best NLP algorithms to distinguish [11].

2.2 Natural Language Processing Phases There are five main stages or steps in natural language processing, starting with simple word processing and progressing to the recognition of complex phrase meanings (see Fig. 1). • Lexical Analysis: The lexical stage analyses the text word by word. The morpheme is the word’s smallest building block it. For instance, it is possible Fig. 1 Natural Language Processing phases

Arabic Question-Answering System Based on Deep Learning Models











137

to break down the phrases “irrationally” and “rationally” into their component elements (suffixes). Lexical analysis establishes the relationship between these morphemes and transforms any word into root form. The probable Part-Of-Speech of the word is also being determined (POS). The lexicon of the language is taken into consideration. Syntactic Analysis: is an important component of natural language processing and is used in various applications such as machine translation, question answering, and sentiment analysis. It helps in understanding the meaning of a sentence by identifying the grammatical relationships between its constituent parts [12]. Using rule-based approaches or statistical approaches may involve the use of parsing algorithms such as chart parsing, dependency parsing, or constituency parsing. The output of the syntactic analysis is usually represented in the form of a tree structure called a parse tree, which shows the hierarchical relationship between the words in a sentence. Semantic Analysis: is the process of understanding the meaning of a sentence or text by analyzing its context and the relationships between words and phrases. Unlike syntactic analysis, which focuses on the grammatical structure of a sentence, semantic analysis aims to capture the intended meaning of the sentence. The semantic analysis involves a range of techniques, such as named entity recognition, entity linking, semantic role labeling, and sentiment analysis, among others. These techniques help in identifying and extracting important concepts, entities, and relationships from the text. One of the key challenges in semantic analysis is dealing with ambiguity and understanding the contextual meaning of words [13]. Discourse Analysis: discourse analysis is a research method that aims to understand how language is used in social contexts to create meaning, convey information, and establish social relationships. It involves the systematic study of texts, spoken or written, in order to identify patterns of language use, power relations, and social norms that shape communication. It is rooted in various theoretical frameworks, including structuralism, post-structuralism, and critical theory. These frameworks provide different lenses through which researchers can analyze language use, ranging from a focus on the linguistic structures of texts to an examination of the social and cultural contexts in which they are produced and consumed. Pragmatic Analysis: Pragmatic Analysis is an approach to studying language use that focuses on how speakers use language to accomplish specific communicative goals in social interactions. It examines the contextual factors that shape language use, including the identity of the speakers, the social setting, and the cultural norms and expectations that inform communicative behavior.

138

S. A. Al-azani and C. Namrata Mahender

3 Question Answer System A question-answering (QA) system is a type of artificial intelligence application that uses natural language processing (NLP) techniques to understand and answer questions posed by humans in a natural language format. The system typically includes several components, such as a language parser, a knowledge base, and a reasoning engine. When a user enters a question, the system parses the question to identify the key concepts and relationships within the question, searches the knowledge base to find relevant information, and uses the reasoning engine to generate a response [14]. QA systems can be categorized into different types, including fact-based QA systems that provide direct answers to factual questions and open-domain QA systems that attempt to provide answers to a broader range of questions by generating responses based on information available on the internet or other external sources. QA systems can be categorized into different types, including fact-based QA systems that provide direct answers to factual questions and open-domain QA systems that attempt to provide answers to a broader range of questions by generating responses based on information available on the internet or other external sources. Overall, QA systems have a wide range of applications, from customer service chatbots to virtual personal assistants, and they continue to improve in accuracy and capability with advancements in NLP and AI technologies [15]. The question-answering architecture is broken down into three distinct parts. Answer processing, document processing, and question processing modules (see Fig. 2).

Fig. 2 Question answering system framework

Arabic Question-Answering System Based on Deep Learning Models

139

3.1 Usage Deep Learning Models in Questions Answering System We use deep learning models in question-answering systems because they can learn complex patterns and relationships in data, allowing them to perform well in tasks such as natural language processing (NLP) and understanding. Deep learning models, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformer models, can process large amounts of data and extract + of information from a question and use that information to generate accurate answers. In addition, deep learning models can be trained on large amounts of data, which helps to improve their accuracy and robustness. With the growth of big data, deep learning models have become increasingly popular in question-answering systems because they can handle the large amounts of text data that are generated by modern applications. Overall, deep learning models are powerful tools for question-answering systems because they can learn complex patterns in data and improve their accuracy and robustness through training on large datasets.

3.2 Different Questions Based on Bloom’s Taxonomy Bloom’s Taxonomy is a framework used to categorize different levels of cognitive skills involved in learning. Here are some different types of questions based on each level of Bloom’s Taxonomy [16] (see Table 1). The following support purposes are used to classify QA Systems [17]: Open-domain QA: This sort of QA deals with questions pertaining to a wide range of topics and may only rely on general ontologies and knowledge from the outside world. On the other hand, these programmers frequently make more information available to them so they can respond. The system enters the NL question rather than a list of terms. Closed-domain QA: This kind of QA responds to inquiries with specific background knowledge (for instance, questions about banking or education), and it appears to be a straightforward task because the NLP programmer may use the precise field data of the regular organization’s cost body. Alternately, a closed domain includes a number of situations when only a certain number of inquiries.

3.3 Question-Answering System Based on Types The many categories of question-answering systems are divided as follows: 1. Queries Wh-Type. 2. List-Based Questions 3. Yes/No Questions. 4. Causal Questions 5. Hypothetical Questions 6. Complex Queries.

140

S. A. Al-azani and C. Namrata Mahender

Table 1 summarizing the different types of questions based on Bloom’s Taxonomy Level of Bloom’s taxonomy

Description

Example questions

Remembering

Recall previously learned information

What is…?, Who wrote…?, Name…?

Understanding

Comprehend and explain concepts or ideas

How would you describe…?, Can you explain…?, What is the main theme…?

Applying

Use knowledge and concepts How would you use…?, If a to solve problems or complete recipe calls for…, Can you tasks design…?

Analyzing

Break down complex information into smaller parts and identify relationships between them

What are the causes and effects…?, How does the structure…?, Can you identify…?

Evaluating

Make judgments and form opinions based on criteria or evidence

Do you think…? Why or why not?, Which of the two… do you think is more valid? Why?, Can you assess…?

Creating

Generate new ideas, products, or solutions based on existing knowledge and concepts

Can you design…?, How would you create…?, Can you write…?

3.4 Wh-Type Questions (What, Which, When, Who) Wh-type inquiries typically begin with that letter. These are straightforward questions with real-world requirements that can be answered in a single sentence or brief phrase. The capital of Yemen, for instance, is a Wh-type query. It demands the name of a town, and because it is difficult to respond to these factoids, the experiment’s potential for useful information is reduced. Large named elements are the best response kinds for Wh-type queries [18]. Wh-based type inquiries don’t need to worry about managing sophisticated standard English in order to get the right answers. One of the testing problems inside the query-based Answering architecture is differentiating proof of Wh-based form inquiries and its sub-arrangement. Short expressions work well as answers to wh-based queries [19].

4 List-Based Questions The answer to the list-based questions is the enumeration. Usually, the entity and facts are enumerated. Factoid questions will handle list questions for successful applications; further factual response extraction does not require NLP. Since the answers to list-type queries typically consist of entities, the answer can be accurately extracted.

Arabic Question-Answering System Based on Deep Learning Models

141

5 Yes/No Questions Confirmation questions require yes/no responses. A draw and the proper yes/no response are indicated by the confirmation Wh-based inquiry “Is Ahmed a good boy?”, for instance.

6 Causal Questions [Why or How] Not as Wh-based type questions, the right responses to irregular questions are not referred to as elements. Unusual questions demand depictions of the elements in the responses. Users that demand explanations for causes, interpretations, dilation, and other reasons related to particular things or events express strange questions.

7 Hypothetical Questions Questions that are hypothetical require information about any hypothetical event and have no specific solutions. Speculative questions typically begin with a phrase like “what may happen if.” These inquiries have poor user and environment dependability and precision. For queries of the speculative variety, the typical response type is developing. As a result, speculative query noting is not very precise [20].

8 Complex Questions It is a question with a compound expectation. The assumption is a statement that is assumed to be true by the respondent at the time the question is posed. The moment the respondent makes any kind of direct response, he commits to this idea. Due to the fact that the presupposition is neither a conjunctive nor a disjunctive statement, it is labeled as “complex.”

8.1 Question Answering System Issues 1. Question Classes: Various approaches to answering questions are taken into account. Only specific answers to queries may be provided by strategies. We must comprehend a specific category of questions in order to respond to a question correctly.

142

S. A. Al-azani and C. Namrata Mahender

2. Data Sources & QA: It’s crucial to have the right source available to respond to any inquiries drawn from existing knowledge. To find the right answer, use a dataset, the web, or documents as the source. 3. Answer extraction: Answer retrieval depends on the difficulty of the question, the type of answer offered by question processing, the actual data being searched for the answer, the search strategy, and the focus and context of the question. 4. Question Processing: The same question may be posed in a variety of ways. Therefore, it’s important to understand what it means and to which class it belongs. 5. Formulation of the Answer: A QA system’s output should be provided as naturally as feasible. For instance, the retrieval of a single datum suffices when the question categorization shows that the answer type is such as a name (of the person, organization, etc.), a quantity (length, distance, etc.), or a date. In other situations, fusion techniques that merge the partial solutions from other documents may be necessary for the display of the solution. 6. Real-time question and answer systems: Regardless of the complexity, size, heterogeneity, or ambiguity of the question, real-time question and answer systems must be able to extract answers from enormous data sets in seconds. 7. Complex justification for QA: More knowledgeable users anticipate responses that go beyond what can be found in written documents or structured databases. It would be necessary to incorporate reasoning components working on several knowledge domains, encoding world knowledge, sensible reasoning techniques, as well as knowledge relevant to certain fields in order to improve a QA system with these capabilities. The capacity to respond to a question presented in one language to use an answer corpus in another language is known as multilingual question answering (or even several). Thanks to this, users are able to consult information that they are unable to utilize directly. 8. Interactive QA: It frequently happens that a QA system does not adequately capture the necessary information; for example, the question processing component may not appropriately classify the question, or even the information required for extracting and creating the answer may be difficult to locate. In such circumstances, the asker might want to engage in conversation with the system in addition to reformulating the query. 9. User profile: records information about the user, including context data, the user’s area of interest, frequent reasoning patterns, and common knowledge developed during various interactions between the user and the system. A prepared template with slots for various profile features could be used to depict the profile. 10. Information clustering in QA systems: this is a recent development that was developed to improve the accuracy of question-answering systems by reducing the search space [21].

Arabic Question-Answering System Based on Deep Learning Models

143

9 Arabic Language Overview Arabic is a Semantic language [18]. It has a population of over 422 million people worldwide. The first language of the Arabs and the official language of the United Nations [22]. It is the third most important international language after English and French. The Arabic language has a very rich combination of special features that the computer is difficult to perform [23]. Arab nations speak Arabic as their mother tongue, which has enormous global significance. Arabic is an exceptionally rich language that belongs to a separate linguistic family from Indo-European languages, notably Semitic vernaculars. Anyone who has even a slight understanding of Arabic may read and comprehend a work that was penned fourteen centuries ago. Arabic has its roots in Classical or Qur’anic Arabic, but the language has evolved over the centuries to what is now known as Modern Standard Arabic. MSA is a simplified form of classic Arabic that keep track of its grammar. Arabic letters are written from right to left. There are 28 letters in it. Arabic is taught in many schools and institutions, utilized in the workplace and in the media; although practically all written Arabic sources utilize (MSA), readers must infer the absence of diacritical markings from the context because they are frequently left out. The MSA has 34 phonemes, 28 consonants, and six vowels (a, i, u, @, I, and *—which are short and long. /@, I and */ are called Fatha, Kasra & Damma.)). Every Arabic word or sentence must begin with a consonant. Arabic vowels are divided into three short and three long vowels, with long vowels lasting almost twice as long as short vowels.

9.1 Arabic Language Challenges The Arabic language is challenging. The main challenges are as follows: 1. A major problem is the lack of a significant corpus. 2. Named items are commonly translated from Arabic 3. The main two internal issues that the language today faces are obscurantism and modernization. One of the challenges the Arabic language has is interaction with other languages. 4. Arabic has a wide range of forms, but the colloquial variations are the ones that are the most frequently used. Many Internet users write Arabic utilizing English letters in addition to these styles; this practice is known as arabizi[24]. 5. It is difficult to distinguish between proper nouns in Arabic because there are no upper- and lowercase letters. In contrast to English, whose proper nouns may be recognized because they start with a capital letter, it is different. 6. In Arabic, there really are additional infixes in addition to suffixes and prefixes at the beginning and end of words. 7. Because Arabic is so derivational and inflectional, morphological studies can be extremely difficult.

144

S. A. Al-azani and C. Namrata Mahender

8. It is challenging to discriminate among tokens and parse written texts because they lack the diacritics that usually indicate vowels and because there is ambiguity. 9. The text is presented from right to left; thus, depending on their placement inside a sentence, a number of the letters have different shapes. 10. It may be difficult to discern among proper names, abbreviations, and acronyms because Arabic does not use capitalization [25].

10 Related Work Vaswani et al. presented the Transformer, a deep learning architecture that revolutionized natural language processing tasks, including machine translation and question answering. The paper proposed a self-attention mechanism that allows the model to capture global dependencies between input and output sequences, eliminating the need for recurrent or convolutional neural networks. The Transformer model achieved state-of-the-art results on machine translation tasks, surpassing the performance of traditional sequence-to-sequence models that used recurrent neural networks. The paper’s introduction of the Transformer architecture had a significant impact on the development of deep learning techniques for natural language processing, and it has since been widely adopted in various NLP tasks [26]. Devlin et al. the paper was introduced the Bidirectional Encoder Representations from Transformers (BERT) model, a deep learning architecture for natural language processing tasks. The paper proposed a pre-training method for BERT that enables the model to learn contextual representations of words by predicting missing words in a text corpus. BERT uses a multi-layer bidirectional Transformer encoder to capture the context of words in a sentence, and the pre-training process allows the model to capture complex relationships between words and their surrounding context. The BERT model achieved state-of-the-art results on a wide range of natural language processing tasks, including question answering, text classification, and named entity recognition, and has since been widely adopted by the research community and industry. The paper’s introduction of BERT has had a significant impact on the development of deep learning techniques for natural language processing and continues to be an active area of research [26]. Antoun and Hajj created the Araelectra pre-training text discriminators for Arabic language comprehension. The BERT model’s architecture and layering are the same as those of the discriminator network. They adjusted the whole model using the additional layer on reading comprehension tasks in order to fine-tune their method by adding a linear classification layer on top of ELECTRA’s output. Numerous Arabic NLP tasks, including reading comprehension, were used to test the model. QA systems in the Arabic language are developing relatively slowly when compared to English-language QA. This is because there aren’t enough NLP tools and datasets for Arabic QA. The sparse technique is used in the Arabic OpenQA study for passage retrieval [27].

Arabic Question-Answering System Based on Deep Learning Models

145

Almiman et al. highlighted the difficulty of addressing questions in the Arabic community. They investigated the impact of preprocessing and used a variety of similarity features. Through the use of the semantic and lexical similarity features, they created a novel deep neural network ensemble model. The BERT model, which has recently made breakthroughs, was used in the model. The MRR value for the model was 68.86% [28]. Karpukhin et al. focus on merging the BERT pre-trained model and a dual-encoder architecture to produce the right dense embedding model using only pairs of questions and answers. Their dense passage retriever creates an index for each passage to be retrieved and employs a dense encoder to transform any text into a dimensional realvalued vector. On a variety of QA datasets, including SQuAD, Natural Questions, and TriviaQA, their suggested model outperformed numerous open-domain QA methods [29]. Guu et al. a successful method that blends a pre-trained language model with a learned textual neural knowledge retriever. By requiring the model to select which knowledge to extract and use during inference, this technique explicitly exposes the importance of world knowledge in contrast to models that store knowledge in their parameters [30]. Huang et al. with help the Arabic Open-QA system run more efficiently. They used a two-stage (Retriever-Reader) design to build Open-QA systems because it is the most effective and fruitful approach [31].

11 Proposed Methodology Figure 3 presents the methodology for the proposed deep learning models.

11.1 Recurrent Neural Networks (RNNs) A Recurrent Neural Network (RNN) is a type of artificial neural network that is designed to process sequential data, such as time series data or natural language sentences. Unlike feedforward neural networks, which process data inputs in a fixed order and produce output, RNNs have a feedback loop that allows them to maintain a state or memory of the previous inputs they have processed. This enables RNNs to process sequences of variable length and learn temporal dependencies in the input data. In an RNN, each neuron or “unit” is connected to itself through a time-delayed connection, creating a loop (see Fig. 4). This allows the network to pass information from one time step to the next, effectively capturing temporal dependencies in the data. RNNs can be trained using backpropagation through time (BPTT), which is a variant of backpropagation that takes into account the time dimension of the data. RNNs suffer from the vanishing gradient problem, which limits their ability to capture long-term dependencies in the data. To address this drawback, variants of RNNs such

146

S. A. Al-azani and C. Namrata Mahender

Fig. 3 Proposed deep learning models

as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) have been developed [32]. Given a sequence x = (x1, x2, …, xt), the RNN updates its recurrent hidden state ht by.

Arabic Question-Answering System Based on Deep Learning Models

147

Fig. 4 Recurrent neural network

 ht =

0, t = 0 ϕ(ht − 1, xt), other wise

(1)

where is a non-linear function, such as the geometric change known as affinity, maintaining parallelism and lines of a logistic sigmoid. Optionally, the RNN would have an output y = (y1, y2, …, yt) that is variable in length. the recurrent hidden state updated in Eq. (1) is presented as in Eq. (2) ht = g(W xt + U ht − 1)

(2)

where g is a sigmoid function. p(x1, . . . , xT) = p(x1)p(x2|x1)p(x3|x1, x2) . . . p(xT|x1, . . . , −1)

(3)

An output from a generative RNN is a probability distribution over the sequence’s subsequent element. This generative model can capture a distribution over sequences of varied lengths given its current state ht by employing a unique output symbol to denote the conclusion of the series, with a specific end-of-sequence value as the final element. Each conditional probability distribution is represented using Eq. (4). where

148

S. A. Al-azani and C. Namrata Mahender

ht is from Eq. (1). p(xt|x1, . . . , xt − 1) = g(ht),

(4)

11.2 Long Short-Term Memory (LSTM) Long Short-Term Memory (LSTM) was first proposed in a research paper by Sepp Hochreiter and Jürgen Schmidhuber in 1997. They designed LSTM to address the issue of vanishing gradients in traditional RNNs, which makes it difficult for the network to learn long-term dependencies. They have introduced a new type of memory cell that uses gating mechanisms to control the flow of information, allowing the network to selectively remember or forget information over many time steps. They demonstrated the effectiveness of LSTM in various applications, including speech recognition and artificial grammar learning. Since then, LSTM has become one of the most widely used RNN architectures and has been applied in a wide range of fields, including natural language processing, computer vision, and robotic [33]. LSTM is a type of recurrent neural network (RNN) architecture that is designed to address the issue of vanishing gradients and the inability of RNNs to capture long-term dependencies in sequential data. LSTMs use a memory cell, input gate, forget gate, and output gate to selectively store, discard, and output information in each time step (see Figs. 5 and 6). This allows the network to remember important information for a longer period and forget irrelevant information. The memory cell provides a pathway for information to flow through the network over many time steps, enabling the model to capture long-term dependencies. LSTMs have been successfully used in many applications, including speech recognition, natural language processing, and image captioning [34]. One of the main challenges in using LSTM is determining the optimal architecture and hyperparameters for a given problem. The architecture of an LSTM can be quite complex, with multiple layers, gates, and memory cells, and selecting the right combination of these components can be a difficult task. Another challenge in using LSTM is dealing with overfitting, which can occur when the model is too complex or when there is insufficient training data. Overfitting can lead to poor generalization performance and make the model less effective in realworld applications. Finally, LSTM can suffer from the issue of exploding gradients, where the gradients used in training become too large and cause the model to become unstable. Techniques such as gradient clipping and weight regularization can be used to mitigate this issue [35]. • Forget Gate: Determining when input from a previous time should be maintained or discarded is the primary or most important function of the LSTM architecture.

Arabic Question-Answering System Based on Deep Learning Models

149

Fig. 5 Long short-term memory model

Fig. 6 Long short-term memory gates

f[t] = σ g(wfx[t] + ufh[t − 1] + bf)

(5)

q where σg () is some nonlinearity such that 0 ≤ σg () ≤ 1. • Input Gate: To update the cell status, the input gate does the subsequent processes. First, the second sigmoid function receives two arguments: the current condition X(t) and the previously hidden state h(t-1). Transformed values range from 0 (significant) to 1. (Not-significant). Only when the input gate is turned on does the cell accept input, i[t] = 1. c[t] = f[t]c[t − 1] + i[t]σ h(wcx[t] + uch[t − 1] + bc)

(6)

where, t = timestep, It = input gate at t, wi = weight matrix of sigmoid operator between input gate and output gate, bt = bias vector at t, c[t - 1] = value generated by tanh. Wc = weight matrix by tanh operator between cell state information and network output, bc = bias vector at t, w, r, t wc. • Output gate: Current input and previous output determine the output gate’s operation.

150

S. A. Al-azani and C. Namrata Mahender

o[t] = σ g(wox[t] + uoh[t − 1] + bo)

(7)

where, t = timestep, ot = output gate at t, wo = weight matrix of output gate. bo = bias vector, w, r, t, wc, ht = LSTM output.

11.3 Gated Recurrent Unit (GRU) A gated recurrent unit (GRU) is a type of recurrent neural network (RNN) architecture that was introduced in 2014. It is like the more well-known long short-term memory (LSTM) architecture but has fewer parameters and is faster to train. Like other RNNs, GRUs are designed to process sequential data, such as time series data or natural language sentences. The key difference between GRUs and other RNNs is the use of gating mechanisms to control the flow of information through the network. A GRU unit has two gates, an update gate, and a reset gate, that control the flow of information through the unit (see Fig. 7). The update gate determines how much of the previous hidden state should be preserved, while the reset gate determines how much of the new input should be incorporated into the new hidden state [35]. The equations that govern the operation of a GRU unit are as follows:    Updategate : zt = sigmoid Wz h{t−1} , xt

(8)

Reset gate : r_t = sigmoid(W_r [h_{t − 1}, x_t])

(9)

Candidatehiddenstate : h ∼ t = tanh(W_h[r_t ∗ h{t1}, x_t])

(10)

Hiddenstate : ht = (1 − zt ) ∗ h{t−1} + zt ∗ h∼t

(11)

where h_t is the hidden state at time t, x_t is the input at time t, z_t is the update gate at time t, r_t is the reset gate at time t, and h ~ _t is the candidate hidden state at time t. Fig. 7 Gated Recurrent Unit model

Arabic Question-Answering System Based on Deep Learning Models

151

12 Prepare the Dataset We have used questions answered by students from grade 4 to grade 9. We have acquired this dataset from the Arabic subject books. This dataset consists of sets of types of answers which are written by students. The dataset consists of around 10,000 answers. The dataset we get is a cvs file, a tab-separated value file consisting of answers labels and answers set.

12.1 Collecting Data We have acquired our data from different Arabic books (as mentioned in the above section - Dataset). We have collected a total of around 10,000 on subjective answering types (long answers). As mentioned below, in some examples.

152

S. A. Al-azani and C. Namrata Mahender

13 Data Preprocessing 1. Punctuation Removal: We First Eliminated All integers and Whitespaces. We Removed All Punctuation Marks from Answer Sentences 2. Tokenization: it is the operation of a partition of a series of document text containing a paragraph, and this paragraph is a stream of the sentence; this sentence is split into words tokens; the sentence tokenization is as follows:

Arabic Question-Answering System Based on Deep Learning Models

153

The sentence was then divided into tokens once cleanup. We tend to then take options from this, such as word count, character count, average word length, misspelled words, and POS tagging. 3. Padding Sequences: It is Important to Note that not All of the Sequences Have the Same Length in that Some Are Naturally Longer Than Others. Where We Know that we need to have inputs of the same size, padding is applied. When padding is required, the inputs should be the same size. 4. Extraction Features: to turn out feature vectors, we have got used the model design from word2vec. Word2vec is a natural language processing (NLP) technique that was first published in 2013. To learn word associations from a large corpus of text, the word2vec algorithm utilizes a model of neural networks. Once prepared, a model of this type can suggest additional words for a partial sentence. Word2vec, referred to as (Skip-Gram), takes in text corpus as input and pops out feature vectors as its output. The Word2vec model creates vectors that could be assigned numerical representations of word functions, options consisting of the context of the character word. Here’s an example of a longer Arabic answer sentence that can be used with the skip-gram model:

This sentence means “The company has many branches in different cities around the world and works to provide innovative technological solutions to customers in various industries, including medical, electronic, and automotive industries.” The skip-gram model can learn the relationship between different words in (company), ” (tech” (branches), this sentence, such as “ ” (customers). By training on a large corpus of text data, nological), and the skip-gram model can generate high-quality word embeddings that can be used to analyze and understand complex Arabic answer sentences (see Fig. 8).

154

S. A. Al-azani and C. Namrata Mahender

Fig. 8 Skip-Gram model

5. Data Splitting: we do a division of data into two subsets. A two-part split is commonly used to assess or test the data and the other to train the model. Data splitting is an important aspect of data science, especially when creating datadriven models. Training/Testing is a technique for determining the accuracy of our models. 80% for training and 20% for testing. The training set is used to train the model. The model is tested using the testing set.

14 Results and Discussion The RNN, LSTM, and GRU deep learning models. The growth and application of the learning models to various question-answering tasks are made possible by deep learning networks. Three deep learning-based quality assurance systems are applied in the current work and their effectiveness. The training and testing are measured on all three deep learning models, as Figs. 9–12 show. To measure the accuracy, precision, recall, and F1 score, the Eq. (12)–(15) are used to compute the metrics presented in Table 1. Accuracy =

T P + TN TP + FP + TN + FN

P r eci si on(akn) = Recall(akn) = F1(scor e) =

TP T P + FN

TP TP + FN

2 × Pr ecision × Recall Pr ecision + Recall

(12) (14) (14) (15)

Arabic Question-Answering System Based on Deep Learning Models

Fig. 9 Performance metrics on RRN model

Fig. 10 Performance metrics on LSTM model

Fig. 11 Performance metrics on GRU model

155

156

S. A. Al-azani and C. Namrata Mahender

Fig. 12 Accuracy and Loss validation

Table 2 Difference in accuracy, precision, recall, and F1-score in RNN, LSTM. GRU models

Measurements

RNN

LSTM

GRU

precision

0.78

0.81

0.79

Recall

0.82

0.76

0.80

F1- score

0.79

0.77

0.78

Accuracy

0.80

0.82

0.85

15 Conclusion and Future Work The first part is related to the concept’s natural language processing and details related to the question-answering system. Question answering is built using Natural language processing and information retrieval techniques. The Process is divided into three modules, namely the Question Processing Module, Document Processing Module, and answer extraction module. Arabic nowadays is an official language in more than 20 countries and the mother tongue of more than 300 million native speakers. This chapter highlights the important challenges of the Arabic natural language and question-answering system. The second part gave us a cover into the field of Neural networks. We obtained good learning with the implementation of

Arabic Question-Answering System Based on Deep Learning Models

157

neural networks using Tensor Flow and Keras and more libraries to support the Arabic language. We were able to build an RRN, LSTM, and GRU neural network for predicting the correct answers after training. RN, LSTM, and GRU are all types of recurrent neural networks that are designed to process sequential data. While RRNs are the simplest form of RNN and can be used for simple tasks, they suffer from the problem of vanishing gradients when processing long sequences. LSTM and GRU were developed to address this problem by introducing gating mechanisms that regulate the flow of information. LSTM uses three gates, including input, forget, and output gates, and a memory cell to selectively remember or forget previous inputs. On the other hand, GRU uses only two gates, reset and update gates, making it computationally less expensive than LSTM while still being able to model long-term dependencies. We got an accuracy on RRN up to 94%, LSTM up to 91%, and GRU up to 99%. And on testing, we got on RRN up to 85%, LSTM up to 84%, and GRU up to 95%. In order to improve accuracy, we want to thoroughly analyze the current model and make additional modifications. To track changes in the system’s performance, we’ll put the stacked, bidirectional LSTM-based RNN network into use.

References 1. Pazos-Rangel, R.A., Rivera, G., Martínez, J., Gaspar, J., Florencia-Juárez, R.: Natural Language Interfaces to Databases: A Survey on Recent Advances. In Handbook of Research on Natural Language Processing and Smart Service Systems (pp. 1–30). IGI Global (2021). https://doi. org/10.4018/978-1-7998-4730-4.ch001 2. Pazos-Rangel, R.A., Florencia-Juarez, R., Paredes-Valverde, M.A., Rivera, G.: “Preface”, In Handbook of Research on Natural Language Processing and Smart Service Systems, xxv–xxx. IGI Global (2021). https://doi.org/10.4018/978-1-7998-4730-4 3. Ishwari, K.S.D., Aneeze, A.K.R.R., Sudheesan, S., Karunaratne, H.J.D.A., Nugaliyadde, A., Mallawarrachchi, Y: Advances in natural language question answering: a review (2019). https:// doi.org/10.48550/arXiv.1904.05276 4. Kolomiyets, O., Marie-Francine M.: A survey on question answering technology from an information retrieval perspective. Inf. Sci. 181(24):5412 (2011). https://doi.org/10.1016/j.ins. 2011.07.047 5. Kodra, K., Kajo, E.: Question Answering Systems: A Review on Present Devel-opments, Challenges and Trends. Int. J. Advanc. Comput. Sci. Appl. 8(9) (2017) 6. K. Ray, Santosh, S., Shaalan, K. A Review and Future Perspectives of Arabic Question Answering Systems. IEEE Trans. Knowled. Data Eng. 28 3169–3190 (2016). https://doi.org/ 10.1109/TKDE.2016.2607201 7. Ahmed, W., PV, A., Babu Anto, P.: Web-Based Arabic Question Answering System using Machine Learning Approach. Volume 8, No. 1, Jan-Feb International Journal of Advanced Research in Computer Science. (2017) 8. Feng, M., Xiang, B., Glass, M.R., Wang, L., Zhou, B.: Applying deep learning to answer selection: A study and an open task. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), (2015). https://doi.org/10.1109/ASRU.2015.7404872 9. Otter, D.W., Medina, J.R., Kalita, J.K.: A Survey of the Usages of Deep Learning for Natural Language Processing. IEEE transactions on neural network and learning systems. -237X © (2020). https://doi.org/10.1109/TNNLS.2020.2979670

158

S. A. Al-azani and C. Namrata Mahender

10. Farghaly, A., Shaalan, K.: Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP) 8(4), 1–22 (2009). https:// doi.org/10.1145/1644879.1644881 11. Khurana, D., Koli, A., Khatter, K., Singh, S.: Natural language processing: State of the art, current trends and challenges. Multimedia tools and applications 82(3), 3713–3744 (2023). https://doi.org/10.1007/s11042-022-13428-4 12. Salloum, S.A., Khan, R., Shaalan, K.: A Survey of Semantic Analysis Approaches. In: Proceedings of the International Conference on Artificial Intelligence and Computer Vision, , pp. 61–70 © Springer Nature Switzerland AG (2020). https://doi.org/10.1007/978-3-030-44289-7_6 13. Xie, X., Song, W., Liu, L., Du, C., Wang, H.: Research and Implementation of Automatic Question Answering System based on Ontology". The 27th Chinese Control and Decision Conference (2015 CCDC), pp. 1366–1370 (2015). https://doi.org/10.1109/CCDC.2015.716 2131 14. Ojokoh, B., Adebisi, E.: A Review of Question Answering Systems. J. Web Eng. Vol. 17 8, pp. 717–758 (2019). https://doi.org/10.13052/jwe1540-9589.1785 15. Syahidah Sufi Haris and Nazliamar, Bloom’s Taxonomy Question Categorization Using Rules and NGram Approach, Journal of Theoretical and Applied Information Technology Vol.76. No.3, 30th June (2015). 16. Poonguzhali, R., LAakshmi, D.R.K.: Analysis on the Performance of Some Standard Deep Learning Network Models for Question Answering Task. Networks 7.14: (2020) 17. Vaibhav Mishra and Dr. Nitesh Khilwani, “Recent Trends in Natural Language Question Answering Systems: A Survey “© IJEDR| Volume7, Issue 4| ISSN: 2321- 9939, (2019). 18. Kumar, S.G., Zayaraz, G.: Concept relation extraction using Naive Bayes classifier for ontology-based question answering systems. J. King Saud Univ (2014). https://doi.org/10. 1016/j.jksuci.2014.03.001 19. Kumari, V., Keshari, S., Sharma, Y, Goel, L.: Context-Based Question Answering System with Suggested Questions. In: 022 12th International Conference on Cloud Computing, Data Science & Engineering (Confluence) (2022). https://doi.org/10.1109/Confluence52989.2022. 9734207 20. Nakov, P., Hoogeveen, D., Màrquez, L., Moschitti, A., Mubarak, H., Baldwin, T., Verspoor, K.: SemEval-2016 Task 3: Community Question Answering. In: Proceedings of the 10th International Workshop on Seman- tic Evaluation, SemEval@NAACL-HLT 2016, San Diego, CA, USA, June 16–17, 2016, 525–545, (2016). https://doi.org/10.18653/v1/S17-2051 21. Mishra, A, Jain, S.K.: A survey on question answering systems with classification. J. King Saud University Comput. Informat. Sci. 28, 345–361, (2016). https://doi.org/10.1016/j.jksuci. 2014.10.007 22. Setio, B., Ayu, P.: Statistical-based Approach for Indonesian Complex Factoid Question Decomposition, International Journal on Electrical Engineering and Informatics, 8, 2,356–373, June (2016). 23. Yaghan, M.A.: Arabizi: A contemporary style of Arabic slang. Des. Issues, vol. 24, no. 2, , pp. 39–52, (2008). https://doi.org/10.1162/desi.2008.24.2.39 24. Ray, K., Shaalan, K.: A Review and Future Perspectives of Arabic Question Answering Systems” IEEE Transactions on Knowledge and Data Engineering 28 -3169-3190 (2016). https://doi.org/10.1109/TKDE.2016.2607201 25. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., & Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30. (2017). https://doi.org/10.48550/arXiv.1706.03762 26. Devlin, J., Chang, M. W., Lee, K., and Toutanova, K., Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. (2018). https://doi. org/10.48550/arXiv.1810.04805 27. Antoun, W., Baly, F., Hajj, H.: AraBERT: transformer-based model for Arabic languageunderstanding. In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection. Marseille: European Language Resource Association, 9–15, (2020)

Arabic Question-Answering System Based on Deep Learning Models

159

28. https://doi.org/10.48550/arXiv.2003.00104 29. Almiman, A., Osman, N., Torki, M.: Deep neural network approach for Arabic community question answering. Alex. Eng. J. 59(6), 4427–4434 (2020). https://doi.org/10.1016/j.aej.2020. 07.048 30. Karpukhin, V., O˘guz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., Yih, W.T.: Dense passageretrieval for open-domain question answering. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 6769–6781, (2020) 31. https://doi.org/10.48550/arXiv.2004.04906 32. Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.W: Realm: retrieval-augmented language model pre-training 33. Huang, Z., Xu, S., Hu, M., Wang, X., Qiu, J., Fu, Y., Zhao, Y., Peng, Y., Wang, C.: Recent trends in deep learning based open-domain textual question answering systems. IEEE. Access 8, 94341–94356 (2020). https://doi.org/10.1109/ACCESS.2020.2988903 34. Vinyals, O., Le, Q.: A neural conversational model arXiv:1506.05869, (2020). https://doi.org/ 10.48550/arXiv.1506.05869 35. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1007/978-3-642-24797-2_4 36. Wang, D., Nyberg, E.: A long short-term memory model for answer sentence selection in question answering. In ACL-IJCNLP, ACL 2015, July 26–31, 2015, Beijing,China, Volume 2: Short Papers, pp. 707–712 (2015) 37. Nowak, J., Taspinar, A., Scherer, R.: LSTM recurrent neural networks for short text and sentiment classification. Artificial Intelligence and Soft Computing: 16th International Conference, ICAISC 2017, Zakopane, Poland, June 11–15, 2017, Proceedings, Part II 16. Springer International Publishing, (2017). https://doi.org/10.1007/978-3-319-59060-8_50 38. Minh, D., et al.: Deep learning approach for short-term stock trends prediction based on twostream gated recurrent unit network. IEEE Access 6: 55392–55404 (2018). https://doi.org/10. 1109/ACCESS.2018.2868970

Healthcare-Oriented Applications

Machine and Deep Learning Algorithms for ADHD Detection: A Review Jonathan Hernández-Capistran, Laura Nely Sánchez-Morales, Giner Alor-Hernández, Maritza Bustos-López, and José Luis Sánchez-Cervantes

Abstract Attention Deficit Hyperactivity Disorder (ADHD) is a neurodevelopmental disorder common in childhood. In 2017, a report from the WHO (World Health Organization) indicated that the prevalence of ADHD in childhood was a median of 2.2% with a range of 0.1–8.1% and an interquartile range of 0.9–2.9%. The ADHD diagnosis is usually made by psychiatrists or specialized pediatricians verifying the DSM-5 diagnostic criteria. Early and timely diagnosis provides the possibility of access to appropriate treatment. Adequate treatment can improve the affectations generated by this disorder, such as social relationships, lack of concentration, lack of attention or interest, and school performance. In recent years, systems for the automated detection of ADHD have been developed using machine learning (ML) and deep learning (DL) algorithms. In this chapter, we present a review of ML and DL algorithms related to the ADHD detection. The contribution of this review is to provide a guide to enable specialists to identify different Artificial Intelligence (AI) techniques for the ADHD detection. This guide provides different techniques applied for the extraction and selection of characteristics used for the ADHD detection. Keywords ADHD detection · Artificial intelligence · Deep learning · Machine learning J. Hernández-Capistran · G. Alor-Hernández (B) · M. Bustos-López Tecnológico Nacional de México/ I. T. Orizaba, Oriente 9 #852, Colonia Emiliano Zapata, 94320 Orizaba, VER, México e-mail: [email protected] J. Hernández-Capistran e-mail: [email protected] M. Bustos-López e-mail: [email protected] L. N. Sánchez-Morales · J. L. Sánchez-Cervantes CONAHCYT-Instituto Tecnológico de Orizaba, Oriente 9 #852, Colonia Emiliano Zapata, 94320 Orizaba, VER, México e-mail: [email protected] J. L. Sánchez-Cervantes e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Innovations in Machine and Deep Learning, Studies in Big Data 134, https://doi.org/10.1007/978-3-031-40688-1_8

163

164

J. Hernández-Capistran et al.

1 Introduction Attention Deficit Hyperactivity Disorder (ADHD), according to the World Health Organization (WHO), is one of the most common neurodevelopmental disorders of childhood. ADHD is characterized by a persistent lack of attention or hyperactivityimpulsivity. It is usually diagnosed in childhood but often lasts into adolescence and even adulthood. In 2017, WHO reported 20 representative countries’ surveys that assessed adult ADHD. The surveys included 11 countries classified by the World Bank as high-income countries, five as upper-middle-income, and four classifieds as low-lower-middle-income [1]. According to this information, the prevalence of ADHD in childhood aver-aged 2.2%, with a range of 0.1–8.1% and an interquartile range of 0.9–2.9%. Related to high-income countries was a prevalence of 3.3%, meanwhile in up-per-middle and low-lower-middle income countries was a prevalence of 2.2% and 0.6%, respectively. The ADHD detection is more frequent in boys than in girls [2–4]. This is because hyperactivity symptoms are more marked in boys than in girls. The male gender bias of ADHD is estimated at 2:1 and 3:1 [5]. The ADHD detection is primarily made clinically. This means that an expert in the diagnosis of ADHD (usually a psychiatrist or a specialized pediatrician) performs the necessary clinical evaluations to determine whether or not the individual meets the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) diagnostic criteria. Early diagnosis can provide the possibility of accessing appropriate clinical and/or therapeutic treatment. From this perspective, early treatment can help improve the symptoms or skills such as social relationships, concentration, lack of attention or interest, and school performance, to name a few. On the other hand, early attention to ADHD symptoms can help reduce school rejection, child abuse, and social isolation and even prevent drug use. Therefore, it is necessary to address a specific review of novel algorithms, techniques, and approaches, such as ML and DL, useful in the diagnosis/treatment of ADHD. ML and DL algorithms, due to their characteristics, can facilitate and simplify diagnosis/treatment tasks of ADHD. In this sense, if medical specialists have adequate diagnostic tools, it is possible to reduce the gap for patients to receive early and timely treatment. This represents one of the important challenges in the study of ADHD. In addition, a timely review of ML and DL algorithms for the diagnosis/treatment of ADHD will allow researchers to have a catalog of algorithms to develop new tools. According to the reviewed literature, systems for the automated detection of ADHD have been proposed using ML and DL algorithms, such as those reported by [6–9]. Reviews such as [10–14] presented different AI algorithms useful for the diagnosis/treatment of ADHD. Other proposals, as in [15], addressed a review of tools focused on the diagnosis and discussed from a general approach. Although a review of ML and DL algorithms is addressed in [15], it is important to point out that some aspects were not considered. In contrast to those proposed in [15], the reasons for our review are: (1) to address in greater depth the research work not related to the MRI diagnostic tool. That is, we

Machine and Deep Learning Algorithms for ADHD Detection: A Review

165

present a particular description of each work, mentioning the algorithm and evaluation metrics used, the target population, the databases, as well as the methodology. (2) We added other research works that had not been considered in the search methodology used in [15]. (3) We described the areas of the body where the different tools are used and the reason for this. In addition, we included the classifiers used for each algorithm separated by ML and DL. (4) We include an explanation of different datasets useful for ADHD diagnosis. (5) We describe the advantages and disadvantages of the different ML and DL algorithms and indicate the tool with which they have been implemented. Therefore, the motive of our review is to serve as a basis for the generation of research work that focuses on solving the problems still existing in the diagnosis of ADHD using AI techniques. In such a way that this will allow in the future the use of AI techniques in the daily practice of ADHD diagnosis. Due to the above, we present an analysis of ML and DL algorithms related to ADHD detection, including different approaches from MRI applications. The contributions of this review are (1) provide a guideline that allows specialists to identify different AI techniques for the ADHD detection. (2) This review will help to eliminate the subjectivity and extreme dependence on the level of medical expertise for the diagnosis and treatment of ADHD. And (3) provide a wide scope of the different techniques applied for the extraction and selection of features used for the diagnosis of ADHD. This chapter is structured as follows: Sect. 2 presents the search methodology. Section 3 discusses the most related works for ADHD detection. Section 4 presents a set of approaches for ADHD detection; Sect. 5 describes a set of datasets used for ADHD diagnosis. Section 6 includes a review of the main ML and DL classifiers for ADHD detection. Section 7 describes the main trends and challenges of ADHD diagnosis/detection from the perspective of deep learning and ML algorithms. Finally, Sect. 8 describes the conclusions.

2 Research Methodology Our review is based on Arksey and O’Malley’s [16] methodological framework for conducting studies, as well as Levac’s [17] recommendations on that framework. Therefore, the methodology for our review is divided into three main stages: (1) identify relevant studies, (2) select relevant studies (3) summarize and report findings. The first stage is the search for scientific papers through the main sources related to the topic, as well as the exclusion of certain studies according to certain criteria. In the second stage, the evaluation of the studies resulting from the previous stage is carried out; the exclusion is made according to title, abstract, etc. Likewise, the request and retrieval of the studies is performed. In the last stage, all the scientific articles resulting from the previous stages are presented. These stages are presented in more detail below.

166

J. Hernández-Capistran et al.

The first stage of the search strategy consisted of identifying the repositories in which the search for scientific articles would be carried out. Thus, we adopted the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) model proposed by Page et al. [18] to analyze the most relevant studies on ADHD diagnosis using ML or DL algorithm. The PRISMA statement was developed to facilitate transparent and complete reporting of systematic reviews and has been updated (to PRISMA 2020) to reflect recent advances in systematic review methodology and terminology. The PRISMA statement is the most used (with 1900 + cites) to elaborate systematic reviews and bibliometric analyses (e.g., [19]). Consequently, the six primary computer science digital libraries were selected, (1) ACM Digital Library, (2) IEEE, (3) PubMed, (4) Springer, (5) Science Direct, and (6) Wiley Online Library as well as the web source of (7) Google Scholar. The second stage consisted of performing a keyword-based search. Using the search keywords shown in Table 1. Therefore, all publications from the period 2018– 2022 were systematically searched. Then, one query was performed as follows to search for studies in each selected repository. (“ADHD” OR “attention deficit hyperactivity disorder”) AND (“machine learning” OR “deep learning”) AND (“EEG” OR “body” OR “pupillometry” OR “eye” OR “multi-modal” OR “actigraphy” OR “wearables” OR “gait”) At the end of the search process, we found 2188 relevant results, as illustrated in the PRISMA flowchart of Fig. 1. As a result, we reduced it to 863 after removing 1312 publications with duplicated titles and 13 reviews. Subsequently, we screened the titles and abstracts of the publications. We removed 467 articles focusing on other topics different from the diagnosis of ADHD using AI. We were left with 396 articles, which were searched for retrieval, but only 232 could be downloaded. Our main topic is the diagnosis of ADHD using ML and DL and focusing mainly on diagnostic tools other than MRI. Therefore, after reading, all papers not related to this topic were eliminated, along with those written in a language other than English Table 1 Keywords and related concepts Area

Keywords

Related concepts

ADHD

ADHD Attention deficit hyperactivity disorder Learning Artificial intelligence EEG Body Pupillometry Eye Multi-modal Actigraphy Wearables gait

Diagnosis Screening Instrumental diagnosis

Machine and Deep Learning Algorithms for ADHD Detection: A Review

167

Fig. 1 PRISMA flow diagram for systematic filtering of articles

and those not relevant to this review. Finally, 37 journal articles were found eligible for inclusion in this review. Sections 3–7 refer to the last stage of the methodology used in this review, which is summarizing and reporting of findings.

3 Related Work In this section, we present the most relevant works reported in the literature about machine and deep learning for ADHD detection. Firstly, we discuss the works that have used different ML approaches. Secondly, we discuss the works that have used different deep-learning techniques.

168

J. Hernández-Capistran et al.

3.1 Machine Learning Approaches Riaz et al. [18] presented the classification of ADHD subjects against control subjects. Two types of features, non-image and image features, were used to create a single feature vector employed for classification. The non-image features are age, gender, verbal IQ, performance IQ, and full IQ4. For the classification stage, (Support Vector Machine) SVM was used, and the best result was obtained on the KKI sub-dataset of the ADHD-200 dataset, with an Accuracy of 86.7%, sensitivity of 77.2%, and specificity of 90.1%. Shao et al. [20] focused on the imbalanced nature of the ADHD200 dataset. To do this, they proposed a cost-sensitive three-objective classification model based on an SVM classifier to handle the problem of imbalanced data. Thus, it considered the empirical errors for the positive and negative samples separately so that the class imbalance problem could be handled effectively. The best accuracy was obtained on the Peking-2 sub-dataset with 86.04%. Ahmed Salman et al. [21] concluded the major contribution was in the preprocessing stage, where three average time series parcellations were used to extract various sets of brain regions of interest in the functional analysis, Craddock 400 and 200, and Automated Anatomical Labeling (AAL). In the classification stage, the best-performing classifier was the Hierarchical Extreme Learning Machine (ELM) with a hidden layer and an accuracy of 99.72% for identifying control subjects and 98.06% for ADHD. Pupillometry is the diagnostic test that measures the size of the pupil and its behavior in response to certain stimuli. There are a few papers that use this method for the diagnosis and/or classification of ADHD. Silva et al. [22] used a rule-based system for the detection of ADHD by using fixation and saccade movements of the eye. These movements were obtained using the Tobii eye tracker from 14 participants ages 18–65 years. From these data, the main features were extracted and input into different classifiers. The classifier that obtained the best results was Random Forest (RF), with a maximum accuracy of 85.31%. Khanna et al. [23] proposed a Web-based application where the Ensemble Voting model was integrated into the backend of a Web application that records the biometric data of the pupils in real-time, yielding a diagnostic probability. This method was tested with 50 participants and obtained sensitivity, specificity, and the Area Under the Receiver Operating Characteristic (AUROC) of 0.821,0.727, and 0.856, respectively. It is worth mentioning that a Convolutional Neural Network (CNN) was used for head and iris detection. Das et al. [24] extracted 22 features that were acquired from 50 subjects aged 10–12. The best accuracy was from the SVM classifier with a 0.762 and also sensitivity, specificity, and AUROC of 0.773, 0.753, and 0.856, respectively. Lev et al. [25] utilized a continuous performance test (MOXO-dCPT) as a platform to integrate an eye tracker into such a test. For the classification of the 70 participants, the algorithm that obtained the best results was the second stage of hierarchal logistic regression, with a correct classification of 75.76% for ADHD and 81.82% for controls. Electroencephalogram (EEG) signals provide extensive information about cognitive abilities, so these signals can be used to detect ADHD. Alchalabi et al. [26] recorded these signals via a wireless device while the subject played a video game

Machine and Deep Learning Algorithms for ADHD Detection: A Review

169

that was played using an EMOTIV EPOC + kit. For the classification of the signals coming from the control and ADHD subjects, an SVM algorithm was used. The test was performed with four subjects diagnosed with ADHD between 18 and 23 aged. The best mean accuracy obtained was 98.62% from the EMOTIV data. Tor et al. [27] achieved an accuracy of 97.88% with K-Nearest Neighbors (KNN) classifier using the ten-fold validation strategy. EEG signals were acquired from 123 children. Empirical modular composition (EMD) and discrete waveform transformation (DWT) methods were used by the authors to decompose EEGs. In turn [28], Holker et al. proposed a method that extracts a collection of EEG signal features from the functional connectivity, spectral, rEEG, and amplitude domains. For this, they developed a database, which is available online [29], of 121 children with a mean age of 9.62 ± 1.75 years for controls and 9.85 ± 1.77 years for those diagnosed with ADHD. The highest accuracy of 81.82% was obtained with the RF classifier. Another approach to detecting ADHD is by analyzing facial and/or body movements; the last one is commonly known as actigraphy. For this reason, Ochab et al. [30] acquired actigraphy data from 29 men with an age of 9.89 + − 0.92 years and were obtained over seven days. From these data, 47 characteristics were extracted, from which the authors proposed a methodology for the reduction of 4 main characteristics. These four characteristics were used as input for the nearest neighbors classifier, obtaining 69.4 ± 1.6% accuracy, 78 ± 2.2% sensitivity, and 60.8 ± 2.6% specificity. It is worth mentioning that the DSM-5 was used for the diagnosis. Khademi et al. [31] evaluated personalized sleep–wake state classifiers, one for each individual. This is for a comparison with generalized sleep–wake state classifiers using training data from the entire population. A dataset that included 54 participants was used, where the actigraphy data were captured simultaneously with the PSG recordings every 30 s. They concluded that personalized ML models outperform generalized models in terms of sleep parameter estimation and are indistinguishable from PSG-labeled sleep–wake states. Similarly, Choi et al. [32] designed a robotassisted test that guided children ranging in age from 5 to 12 years through a series of activities to be performed. In addition, the robot tracked the children’s movements using an RGB-D sensor. A total of 19 characteristics were extracted, 13 from the robot and three from a questionnaire. The best accuracy of 96.9% on the test set was obtained using the RF classifier. On the other hand, Jiang et al. [33] designed a system named WeDA (Wearable Diagnostic Assessment) for children with ADHD. They used the RF model to decrease the dimension of the features to no more than 10. In addition, they used the Bayes network (BN) model and obtained a prediction accuracy of 82% for all symptoms. This system was tested with 162 participants with an age range of 7 to 13 years. Hicks et al. [34] proposed an openly available online actigraphy dataset of 103 individuals with an age range of 17 to 67 years. In addition, they provided a set of benchmark ML experiments to compare the published dataset and assess its technical validity. The best results were obtained with the RF classifier, with an accuracy of 72%. There are a few papers that discuss the detection and/or diagnosis of ADHD using questionnaires/surveys. For instance, Uluyagmur-Ozturk et al. [35] proposed a system that classified subjects with ADHD based on their performance during a

170

J. Hernández-Capistran et al.

face-image emotion recognition questionnaire. Participant responses and response time were used for the classification process of 61 participants with an average age of 9.72 years. A 90% accuracy rate was obtained with the Adaboost classifier. Meanwhile, Trognon et al. [36] designed a questionnaire from the DSM-5, which was applied to 220 subjects with a mean age of 27.8 years and a standard deviation of 9.2. From the responses, they extracted characteristics that were reduced by means of multiple regression procedures. They obtained an accuracy of 98% with a sensitivity of 97% and specificity of 100% for the XGBoost classifier. Finally, Maniruzzaman et al. [37] proposed a method to extract the most prominent risk factors of ADHD patients. For this, they used data from the National Survey of Children’s Health (NSCH, 2018–2019) containing information on 45,779 children aged 3–17 years. They showed that the RF classifier provided the highest accuracy of 85.5%, sensitivity of 84.4%, specificity of 86.4%, and AUC of 0.94.

3.2 Deep Learning Approaches The fMRI data are 3D shape data with a time component, so there has been great interest in using 3D CNNs to analyze the spatial components without loss of data when projecting the dataset onto a 2D CNN. Shao et al. [38] used fMRI images for the ADHD detection. ADHD global competition data sets were used to test the method proposed. Then, each brain region was used to calculate an average time series. From these, two features were obtained, a functional connection matrix and an Amplitude of Low-Frequency Fluctuation (ALFF) image. The best result was 82.73% average ac-curacy using the gcForest classifier. Peng et al. [39] proposed a diagnostic tool applied to functional MRI (fMRI) and structural MRI (sMRI) images. The methodology used was inspired by one of the methods used by medical experts for the diagnosis. It involves superimposing the fMRI color map on the sMRI, thus making it possible to localize neuronal activity in a specific brain region. Then, they developed a dual 3D CNN deep learning model, one for fMRI and the other for sMRI; then, two 3D CNN were combined utilizing a proposed summation-based combination algorithm. Once combined, they employed a dense neural network leading to predicting ADHD. The accuracy obtained was 72.89% on the ADHD -200 dataset. Similarly, Sims [40] obtained 99.69% accuracy utilizing a multi-modal 3D CNN with two different Recurrent Neural Networks (RNNs), a Gated Recurrent unit (GRU), and Long Short-Term Memory (LSTM). Deepfake images with 3D StyleGAN were generated. All of this was also tested over the ADHD-200 dataset. The only work that uses deep learning techniques applied to pupillometry is proposed by Ko et al. [41], who obtained the gaze position data of the eyes of 25 subjects using VR games. The data is x and y coordinates with a duration between 3 and 7 min per player. They used the Zoom-in Neural Network (ZNN) classifier, obtaining a maximum average precision of 83.575%.

Machine and Deep Learning Algorithms for ADHD Detection: A Review

171

In turn, Altınkaynak et al. [42] analyzed the EEG signal in the time and time– frequency domain. The Multilayer Perceptron method was the one that provided the best classification, with an accuracy rate of 91.3% and a Kappa index of 0.82. This was tested on a data set consisting of 46 participants aged 7–12 years. Vahid et al. [43] recorded the EEG signals from 144 subjects (8.5–135 years old) through electrodes using a BrainAmp amplifier. These signals were input into a compact CNN called EEGNet, and LOOS (Leave One Out Subject) was used to evaluate classification performance in distinguishing ADHD/ADD (Attention Deficit Disorder) patients from controls. The accuracy of the classification was 83 ± 23%. Chen et al. [44] used the connectivity matrix obtained from the multichannel EEG input data as an imagelike for training the CNN models. The signals were obtained from 107 subjects aged 9.69 to 11.61 years using a Hydro Cel Geodesic Sensor Net. The authors obtained performance with an accuracy of 98.17% on the validation data and 94.67% on the test data. Dubreuil-Vall et al. [45] acquired EEG signals with the Starstim system from 40 participants with a mean age of 43.85 years. They implemented a CNN in Tensorflow, where the input was the Wavelet transform of the EEG data. The authors achieved a classification accuracy of 88% ± 1.12%. In turn, Ahmadi et al.[46] proposed a computer-aided diagnosis system tested over 40 children between 6 to 11 years old. They recorded 5 min of EEG signals under eyes open conditions using the Mitsar amplifier. They introduced to a CNN the combination of three EEG bands obtaining the best accuracy, precision, and Kappa index of 99.46%, 99.48%, and 99%, respectively. Zhou et al. [47] obtained all data from children 6–16 years old who were diagnosed with ADHD, and they used 24 h of video from CADWELL’s video EEG monitoring system. The best results were obtained by implementing a CNN with an average model accuracy on test data reaching 97.7%, with an average false negative rate of 2.2%. In turn, Joy et al. [48] used a data dataset composed of 10 subjects between 7 to 12 years old and categorized into two groups, eyes-closed and eyes-open, in a state of rest. EEGs were decomposed into subbands using the Tunable Q-Factor Wavelet Transform (TQWT) technique. These subbands were used for feature extraction using the Katz and Higuchi algorithms. The Artificial neural network (ANN) classifier obtained the best accuracy value of 100%. Jaiswal et al. [49] presented a methodology to aid in the diagnosis of the presence/ absence of ADHD and ASD through automatic visual analysis of a person’s behavior. Approximately 12 min of video per subject from a total of 55 people over the age of 18 were analyzed. The videos were acquired with a Kinect device that captured the depth and RGB images, both at high resolution. Thus, six features were extracted from these videos, which were used as inputs for a CNN classifier, obtaining a 96% accuracy rate. Similarly, Amado-Caballero et al. [50] designed, trained, and deployed a CNN that was able to diagnose combined ADHD from a 24-h actigraphic recording of a child on a normal school day. The group consisted of 148 subjects between the ages of 6 and 15 years. The best-performing classifier was CNN 2D with entries in the time–frequency domain, with an accuracy of 98.57%, sensitivity, specificity, and an Area Under the Curve (AUC) of 97.62%, 99.52%, and 99.52% y 99.93% respectively. On the other hand, Hammam et al. [51] used an RGB camera to analyze the gait of 27 children between 6 and 10 years of age. To pre-train the classifier

172

J. Hernández-Capistran et al.

model, the publicly available NTU-RGB + D 120 database was used. A 4-layer 1D CNN was used as the encoding network. An accuracy of 72.46% was obtained using 90% of the dataset for testing and using the Momentum Contrast (MoCo) method to update the network parameters during training. There are also some works that combine two or more diagnostic techniques for ADHD detection. Zhang et al. [52] designed a system that used a camera to capture limb movements, facial expressions, linguistic expressions, patient reaction skills, and eye movements. All this while the patient performed some requested tasks. Then, a computer vision algorithm was used to extract features, and then a deep learning algorithm was used to detect specific behaviors of the children. It was completed testing and evaluation of hundreds of children with ADHD. Meanwhile, Silva et al. [53] proposed a system that used fMRI data and eye movement data. The fMRI data were classified with a CNN that obtained an accuracy of 82%. While the eye movement data, the best-performing classifier was the RF with 84.48% accuracy. Finally, they proposed an algorithm for both data for the prediction probability of ADHD patients. Lastly, Qin et al. [54] presented a method for feature extraction from fMRI images and phenotypic data. For the former, a CNN was used, while for the latter, an RF classifier was used. Finally, an ensemble learning approach takes the feature vectors from the two classifiers and trains the second-stage classifier to obtain the final classification result with 74.5% accuracy on the test set. A special case was found where Hamedi et al. [55] used the images of a magnetoencelogram when the patient was in a resting state with his eyes open. For this, they proposed a stacking set learning method that consists of combining the base classifiers to improve the results obtained by each individual classifier. They obtained an accuracy of 89.92% for the database: Open MEG Archive (OMEGA) [56].

4 Approaches for ADHD Detection Using AI Algorithms As with many mental disorders, ADHD is still not fully understood. The diagnostic criteria for ADHD have improved with new research, but the tools and assessment have remained largely unchanged. Therefore, ADHD remains to date, a primarily clinical diagnosis. Clinical diagnostic testing for ADHD includes an anamnesis of prenatal, perinatal, and family history. Environmental factors, school performance, and a detailed physical examination are also taken into account. During the latter, special attention is paid to vital signs (cardiovascular, cutaneous, thyroid, and neurological systems, including assessment of motor coordination). In addition, comorbid conditions should be looked for. Thus, different diagnostic tools such as MRI, actigraphy, and questionnaires, to mention but a few, are usually used. The diagnosis of ADHD is mostly based on the DSM, whose most updated version is version V, published in May 2013 [57]. Figure 2 presents a summary of the body parts from which data are collected for ADHD detection. It is divided into five main parts: brain, eyes, face, wrist, and general body movement. For the brain, data are acquired in two ways, by MRI images and

Machine and Deep Learning Algorithms for ADHD Detection: A Review

173

EEG signals. In the eyes, data are obtained from fixation and saccadic movements, as well as from pupil variation. Actigraphy data are mostly acquired at the wrist. The patient’s facial expressions have also been analyzed. Also, body movement data are acquired in everyday activities using sensors at the wrist, ankle, waist, and head. These movements, as well as the gait, are analyzed through videos using computer vision. For the brain part, one of the most commonly used methods for the ADHD detection is by analyzing the structural and functional connectivity of the brain. Structural

Fig. 2 The main human body parts where the data for ADHD detection are acquired

174

J. Hernández-Capistran et al.

connectivity defines the existence of white matter tracts that physically interconnect brain regions, while functional connectivity describes the dependencies between a pair of brain regions and their correlation over time. Tools such as EEG and fMRI are used to measure functional connectivity. The latter uses non-ionizing radiation to generate high-contrast images of soft tissues. Most of our body is made up of water molecules, so they contain hydrogen atoms. The nuclei of hydrogen atoms have only one proton. MRI uses magnets that produce a powerful magnetic field that forces the protons in the body to align with that field. In addition, a radio frequency signal is applied through the body; then, these protons can absorb the radio frequency energy and move from a lower energy state to a higher energy state, that is, in parallel or antiparallel directions with respect to the applied magnetic field. The protons absorb some of the radio frequency energy as they enter magnetic resonance (MR). When the radio frequency signal is turned off, the absorbed energy is re-emitted in the form of a radio frequency signal. This radio frequency emission is measured by a tuned coil. The spinning protons from different tissues release energy at different rates because different tissues in the body have different chemical compositions and physical states [58]. Medical practitioners are able to identify the difference between the various types of tissues based on these magnetic properties. fMRI identifies active areas of the brain by detecting the level of oxygenated blood. It is commonly used to monitor neuronal activities. In addition, a perfusion tracer is used in these images to produce differential contrast in the tissues. It is usually used to measure cerebral blood flow. fMRIs are essentially 3D images of the human brain captured over time. Each image consists of a large number of 3D pixels, better known as voxels. The intensity of a voxel generates a signal known as a BOLD (Blood-oxygen-level dependent) signal. On the other hand, structural MRI (sMRI) is used to examine the anatomy and pathology of the brain. By measuring the amount of water in a given location, sMRI is able to acquire a detailed anatomical image of the brain. This makes it possible to accurately distinguish between different types of tissue, such as gray and white matter. Also related to the brain, another method for ADHD detection is by using EEG data. The EEG is a set of signals that allow the analysis of brain function by acquiring the electrical activity of the brain in a basal situation and other situations such as photo-stimulation and hyperventilation. It is even often recorded during sleep. The electrical signal collected is amplified and presented in the form of waves, interpreting the activity of different brain areas over time. There are normal and abnormal patterns of these waves, which raise suspicion of disorders. For data capture, the patient is placed, for example, playing a video game specially designed for ADHD therapy [26]. In another scheme, data are captured during activities performed on a daily basis for 24 h [47]. Also, another approach obtained this data under a visual attention task or attention experiments, such as the oddity paradigm and the Eriksen flanker task [45]. Also, another research captured these signals in a resting state with eyes open [46]. The main characteristics extracted are time and frequency, and computer vision is used to detect patterns in the waves [44]. For the eye, there are two diagnostic tools that are most commonly used, pupillometry and the detection of fixation of sight and saccadic movements. The pupil is

Machine and Deep Learning Algorithms for ADHD Detection: A Review

175

connected to a large set of neurological processes, which makes the pupil responsive to various stimuli, i.e., it is correlated with emotion, cognition, and arousal. Therefore, the pupillary response is related to neurological diseases. Similarly, saccadic movements and fixation reflect mental processes such as attention and anticipation, which are often impaired in people with ADHD. Different methods were used for data acquisition, such as attention and concentration test [25], video games [41], visuospatial memory tasks [23, 24] and reading [22]. Another part of the body that reflects people who have this disorder is the face, that is, in the facial expressions that the person makes when subjected to certain conditions. It is also known that the gait, the posture when performing certain activities, and the movements made when sleeping also allow the detection of ADHD. Therefore, the analysis of body movements is usually made by actigraphy [30, 31, 34, 50], computer vision [32, 49], wearables [33], or a combination [51]. A variety of devices have been used for data capture, such as actigraphs, motion sensors, Kinect devices, and cameras. Therefore, the data can be either images, sensor signals, or a combination of both. Movements have been captured in different situations such as sleeping, daily activities, classroom, and controlled experiments. On the other hand, AI can be broadly defined as the development of algorithms that are capable of solving cognitive problems that are commonly associated with human intelligence, such as problem-solving, pattern recognition, and learning. AI is mainly divided into two areas, machine learning (ML) and deep learning (DL). [59]. ML is a set of approaches that allows a system to learn from data instead of learning using explicit programming. In other words, it is about teaching the machine using previously processed data and information so that the machine learns to make decisions. ML has the ability to modify itself when exposed to more data. When it is mentioned that the machine “learns,” it means that ML algorithms try to minimize errors and maximize the likelihood that the predictions will be true. DL is a subset of ML and uses a collection of computational models and algorithms that have been inspired by how the human brain works; these models and algorithms are known as artificial neural networks (ANNs). These neural networks learn in the same way as ML algorithms, from a set of data, but the difference is that DL deals with a larger amount of data. Another difference is that DL does not require recurrent human intervention since it learns from past errors. ANNs are divided into three types of layers, the input layer that receives the data, the output layer that produces the results of the processed data, and finally, the hidden layer that extracts the existing patterns in the data. Therefore, the word deep in DL refers to the fact that it contains a large number of hidden layers which allows it to perform more complex tasks. DL, in most cases, achieves higher accuracy than ML, but for this, it requires a large volume of training data as well as large software and hardware resources. Because of its high accuracy, DL has been used in many areas, including computer vision. In recent years, a particular method known as CNN has been used for this purpose. In general, DL and ML perform classification, i.e., decide to which set an unknown input data belongs according to what has been previously learned from the training data set. To perform this classification, there are two types of learning models most commonly used: supervised and unsupervised. The main difference

176

J. Hernández-Capistran et al.

is that in supervised learning, the input and output training data are labeled, and this labeling is usually performed by a human. While in unsupervised learning, the labeling process is not performed; that is, it does not require human intervention. In general, the two approaches found for the automatic detection of ADHD using AI are detailed below.

4.1 Machine Learning-Based Approaches For the MRI-based diagnostic tool, the SVM [18, 20, 24, 26], and Hierarchical ELM [21] algorithms are highlighted. In terms of the use of EEG for the automatic detection of ADHD, the SVM, KNN, and RF classifiers have been used. While for the eye part, the ML classifiers used are the RF, Ensemble Voting model, SVM, and hierarchical logistic regression. Regarding the use of different parts of the body as a diagnostic tools, the algorithms used for this are KNN, RF, and BN. One of the diagnostic tools that stands out most in the clinical literature is the use of questionnaires. However, in the literature review, only [35–37] were found that use this diagnostic tool for the detection of ADHD using machine learning. The data from the questionnaires are mainly based on the DSM-4 and DSM-5, with the exception of [35], which is mostly based on the response of which feeling is presented in a given image. These three works used Adaboost, XGBoost, and RF classifiers.

4.2 Deep Learning-Based Approaches The main classifier used in the approaches utilizing MRI is the CNN. Some authors [39] perform a transformation of the 3D image to a 2D image or 1D of the main features, and other authors use a 3D CNN [60]. In the same way, the Deep Forest classifier (gcForest) [38] has been used. Therefore, many works [18–21, 30] use one or both types of MRI images (fMRI and sMRI), and in most cases, the ADHD-200 database is used for such analysis. In EEG studies, the two classifiers used are the CNN and the multilayer perceptron. For the eye-based diagnostic tool, as far as is known, only one work was found using a neural network called Zoom-In Neural Network that is based on a neuro-fuzzy algorithm. While for different parts of the body as a diagnostic tool, it is found that only the CNN has been used. Finally, few works [52–54] were found that performed multi-modal detection., that is, using more than one diagnostic instrument. The combinations of instruments found were fMRI and pupillometry, fMRI and phenotypic data, and finally, the combination of facial expressions, pupillometry, limb movements, linguistic expressions, and reaction skills of children. All research works use the CNN classifier. The classification of each instrument is performed independently, and then ensemble learning is applied. The following section describes the most commonly used public databases for ADHD detection.

Machine and Deep Learning Algorithms for ADHD Detection: A Review

177

5 Datasets for ADHD Detection There are different datasets used for ADHD detection. The most used datasets are HYPERAKTIV, ADHD200, Working Memory and Reward in Children with and without ADHD, Working Memory and Reward in Adults, EEG DATA FOR ADHD, and FOCUS. Each dataset contains information suitable for ADHD detection.

5.1 Hyperaktiv HYPERAKTIV [34] is a publicly available dataset with heart rate-related data from 51 adult ADHD patients and 52 clinical controls. HYPERAKTIV includes activity data, health, attributes such as age, and sex, output data from a computerized neuropsychological test, and information on the patient’s mental status. The HYPERAKTIV Dataset was used in [61] for the diagnosis of ADHD in adults using ML algorithms. Meanwhile, ADHD200 [62] offers a preprocessed dataset from the ADHD-200 World Competition. ADHD200 consists of 776 anatomical and restingstate fMRI datasets, including 285 from children and adolescents and the rest from typically developing ADHD individuals. ADHD200 includes data on age, gender, diagnostic status, medication status, dimensional measures of ADHD symptoms, and intelligence quotient (IQ). The ADHD200 Dataset was utilized in [63] for the development of ADHD classification methods based on 3D CNN and multichannel deep learning neural networks in [64]. The results obtained in both studies indicated an accuracy of 69.15% and 95%, respectively.

5.2 Working Memory and Reward in Children with and Without ADHD On the other hand, the dataset entitled “Working Memory and Reward in Children with and without ADHD” [65] is available at OpenNeuro.org. This dataset was the result of data recorded in functional MRI (fMRI) obtained during n-back tasks. The goal of the n-back tasks was to explore monetary compensation, working memory, and feedback processing in developing typically and diagnosed ADHD children. Data were collected from 79 children aged 8.6–12; 35 children had a formal diagnosis of ADHD [66].

178

J. Hernández-Capistran et al.

5.3 Working Memory and Reward in Adults In a subsequent study, a dataset entitled “Working Memory and Reward in Adults” [67] was created, which includes data from 24 individuals who performed the same n-back tasks. This dataset is also available at Open-Neuro.org. The Dataset “Working memory and reward in children with and without attention deficit hyperactivity disorder (ADHD)” has antecedents in the study presented by [68] and [69]. Booth et al. [68] analyzed functional MRI data obtained from four visuospatial working memory (VSWM) tasks to detect cases of ADHD in children. The results indicated 92.5% accuracy in the classification of ADHD. Hammer et al. [69] tested the interactive effect of feedback and reward on visuospatial working memory in children with ADHD. The tests collected MRI data from 17 ADHD children and 17 normal control children as they performed spatial monitoring of letters on a screen. The results indicated that the performance of ADHD children relative to normal control children is similar only when they were given feedback using a large reward.

5.4 Eeg Data for ADHD Meanwhile, EEG DATA FOR ADHD [70] is a dataset containing information from 61 children diagnosed with ADHD and 60 healthy controls. EEG recording was performed during visual attention tasks that consisted of showing a set of cartoon images. The children were asked to count the number of personages in each image; each image was shown immediately after the child’s response. EEG recording was based on 10–20 standard 19-channel EEG recordings, and A1 and A2 electrodes were placed in the lobes of the ears. The EEG DATA FOR ADHD Dataset has been used in other studies for the detection of ADHD from EEG features. In this regard, Mohammadi et al. [71] classified nonlinear EEG features in children with and without ADHD as equivalent to attention. The results confirmed the defect in the forebrain segment of children with ADHD. Barua et al. [72] used EEG signals to propose a new hand-modeled classification model to differentiate individuals with ADHD from those without ADHD. The proposed model used the Tunable Q Wavelet Transform (TQWT) to generate wavelet subbands and a new ternary motif pattern (TMP). The results through cross-validations yielded percentages of 95.57% and 77.93% classification accuracy. Table 2 presents the relevant information from the datasets reviewed in this subsection. The main characteristics of the datasets presented in Table 2 include: (a) the number of attributes that can be applied to the diagnosis of ADHD, (b) the number of records from which the information was collected, and finally, (c) if the dataset is used for the diagnosis or prediction of ADHD is given.

Machine and Deep Learning Algorithms for ADHD Detection: A Review

179

Table 2 Review of ADHD diagnostic datasets Dataset

Number of attributes

Number of records

Prediction/ Diagnosis

HYPERAKTIV

+25

51 adults and 52 clinical controls

Diagnosis

ADHD200

23

776

Diagnosis

Working Memory and Reward in Children with and without Attention Deficit Hyperactivity Disorder (ADHD)

8

79 children

Diagnosis

Working Memory and Reward in Adults

8

24 adults

Diagnosis

EEG DATA FOR ADHD

12

61 children and 60 60 healthy controls

Diagnosis

6 Machine Learning and Deep Learning Classifiers for ADHD Detection There are a wide variety of classifiers in the literature. The most used classifiers based on ML are SVM, Random Forest, Ensemble Voting model, AdaBoost, XGBoost, Hierarchical ELM, hierarchical logistic regression, Multilayer Perceptron, KNN, and Bayes network. Table 3 presents the advantages and limitations of the ML classifiers that were used in the reviewed papers, as well as a brief description of the classifier and a list of which diagnostic tool the classifier has been tested on. On the other hand, the most used Deep Learning classifiers are the CNN and the Deep Forest (gcForest). Table 4 presents the description, advantages, disadvantages, and in which diagnostic tool the classifiers have been tested. In this review, the efforts made by several researchers for the automated diagnosis of ADHD disorder have been presented. The following section presents trends and challenges related to the detection and/or diagnosis of ADHD.

7 Trends and Challenges We have identified two current trends that have been observed in the detection/ diagnosis of ADHD using ML and DL algorithm.

180

J. Hernández-Capistran et al.

Table 3 Review of Machine Learning Classifiers (MLC) MLC

Description

Advantage

Random Forest Using controlled Reduction in [22, 28, 32, 34, 37] variance, a random over-fitting selection of features is performed to generate the decision trees

Disadvantages/ Limitations

Diagnostic tools in which were tested

Slow in real-time prediction Complex algorithm

Eye movement EEG Body movement

KNN [27, 30]

Generates a classification of objects according to the closest training sample in the feature space

It is used in Slower at many classification applications in the field of data mining, statistical pattern recognition, and many others

EEG Body movement

SVM [19, 20, 24, 26]

It is a method based on statistical learning theory and the structural risk minimization principle and has the aim of determining the location of decision boundaries, also known as hyperplane, that produces the optimal separation of classes

One of the most robust and accurate methods among all well-known algorithms

SVMs are extremely slow in learning, requiring a large amount of training time

Eye movement EEG Body movement Questionnaires fMRI

AdaBoost [35]

Through iterative training of the base classifiers, it assigns greater importance to the previously misclassified data, thus obtaining a new classifier

In real problems, it is possible to build compositions that are superior in quality to the basic algorithms

It is prone to retraining when there is significant noise in the data. Requires sufficiently long training samples

Questionnaires

(continued)

Machine and Deep Learning Algorithms for ADHD Detection: A Review

181

Table 3 (continued) MLC

Description

Advantage

Disadvantages/ Limitations

Diagnostic tools in which were tested

XGBoost [36]

It is a method that performs the consecutive assembly of decision trees. The trees are added sequentially in order to learn from the previous trees and correct the error produced by them; this is iterated until the error can no longer be corrected

It works well with large and complex datasets by using various optimization methods

If a large number Questionnaires of trees are handled, overfitting may occur. It can consume a lot of computational resources in large databases

Hierarchical ELM [21]

It is based on the ELM algorithm. However, HELM consists of an unsupervised hierarchical feature selection based on ELM sparse autoencoding and a supervised classification based on ELM

It is fast and has an efficient learning speed, fast convergence, good generalization ability, and ease of implementation

It is more complex as compared to ELM

Ensemble Voting model [23]

An ensemble voting model works by combining the predictions of multiple models. For prediction, the average of the model’s predictions is calculated. For classification, the predictions of each label are summed, and the most voted label is predicted

Better results are obtained than when using a single model

If any of the Eye movement prediction models are incorrect or have a bad performance, the ensemble voting model will run incorrectly

fMRI

(continued)

182

J. Hernández-Capistran et al.

Table 3 (continued) MLC

Description

Advantage

Disadvantages/ Limitations

Diagnostic tools in which were tested

Hierarchical logistic regression [25]

Similar to logistic regression only that it is used to study data in groups and with a binary response variable

It is appropriate when the variance of a criterion variable is explained by predictor variables that are correlated with each other

It can only be Eye movement applied to study simple relationships with a limited number of variables

Bayes network [33]

It consists of a Training times structure that is a are short directed acyclic graph (DAG) that expresses the conditional independencies and dependencies between the variables associated with the nodes. It contains parameters that are conditional probability distributions associated with each node

It has poor Body performance on movement high-dimensional data

7.1 New Types of Sensors or Biosensors In recent years there has been a trend towards the development and use of biosensors for the detection of ADHD [73] where nanomaterial and biomolecule-assisted dopamine sensors are used to measure dopamine levels in subjects with ADHD. Likewise, there is a trend towards the use of wearables, e.g., actigraphy units, wristbands, fitness trackers, skin sensors, Impulse radio ultra-wideband (IR-UWB) radar sensors, and smartwatches, together with smartphones for the detection of this disorder, like the use in [74]. These biosensors and wearables can be translated into the concept of smart clothing [75].

Machine and Deep Learning Algorithms for ADHD Detection: A Review

183

Table 4 Review of Deep Learning Classifiers (DLC) DLC

Description

Advantage

Disadvantages/ Diagnostic Limitations tools in which were tested

gcForest [38]

This method generates a set of deep forests with a cascading structure

It does not require a great effort in the adjustment of hyperparameters

Classification fMRI capabilities of different forests differ. Multi-grain scanning produces feature vectors of different scales. When it is used to train random forests, the classification capability of each trained RF is different

Multilayer Perceptron [42]

It is based on It requires short the training times perceptron, which is the simplest neural network

CNN It applies a [39, 40, 43]–[47, 49]–[51, 53, 54] set of convolutional filters to the input images, each of which activates certain features of the images

It can be retrained for new recognition tasks, allowing them to leverage pre-existing networks

The number of EEG total parameters can grow excessively, so there may be redundancy As a black box, there is no full control over the decisions made by the algorithm. A lot of training data are needed

Eye movement EEG Body movement Questionnaires fMRI sMRI

7.2 Multi-Modal Detection and/or Diagnosis of ADHD Although there are few studies that consider this method of detection so far, it is observed that this technique is more accepted by medical experts in this type of disorder. The tools to be included in this multi-modal method should prioritize data from questionnaires and movements. This is because they are the most widely used diagnostic methods in medical standards. In addition, it is noted that this type of data has been used in other disorders [76] but has not yet been performed more frequently for the detection of ADHD using AI, even though the benefits of using multi-modal data have been demonstrated [77].

184

J. Hernández-Capistran et al.

There are still several challenges to be overcome, among which the following stand out.

7.3 The Use of Biomarkers as Variables for Diagnosis From the literature review, no papers were found that considered any biomarker as a variable for detection (except for EEG). A biomarker is defined as “a characteristic that is measured objectively and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacological responses to a therapeutic intervention” [78]. There are a number of biomarkers that have been studied that may contribute to the diagnosis of ADHD, such as genetic, peripheral, neurochemistry, and Micro-ribonucleic acids (miRNA)-based [79, 80]. In such a way that a relationship between the different types of biomarkers can be found. Nevertheless, these biomarkers have not yet been incorporated into computational methods for ADHD diagnosis.

7.4 Interpretability Interpretability is an important consideration for relying on models using some form of AI, which is a necessity for understanding the differences between controls and ADHD patients. This aspect is lacking in most of the reviewed papers and is an area of improvement that needs further attention. There has been a rise in the use of Explainable AI (XAI) in the detection/diagnosis of mental disorders [81], as well as recommendations of their importance in health care [82]. Thus, XAI has become a current research topic in biomedical and healthcare applications that must be considered for ADHD. Therefore, there is an area of opportunity to improve the quality of predictive models by employing, for instance, transfer learning methods, model agnostics, or techniques such as SHAP (Shapley Additive Explanation) or LIME (Local Interpretable Model-Agnostic Explanations) that can improve the explainability of AI models. On the other hand, some works [18, 23, 24, 30, 36, 37, 50] reported performance measures of sensitivity or specificity, which are important for interpretation, especially when the patient and control groups have different sizes. Therefore, current models mostly lack interpretability, or there is no standard of comparison.

7.5 Building of Standardized and Accurate Public Datasets As previously presented, few public databases are available for ADHD screening research. Moreover, the existing ones are based exclusively on MRI [62]. Also, it was

Machine and Deep Learning Algorithms for ADHD Detection: A Review

185

observed that most of the results reported are checked with non-public databases, which makes a systematic comparison very difficult with other works. Therefore, research should be generated to build standardized and accurate publicly available datasets that can help researchers to validate their proposed models. This can facilitate research toward the advancement of computational diagnostic systems. These databases should focus on physiological signals (ECG), questionaries, and motion data (accelerometer), because of their high use in the medical field [15]. These databases should contain multi-modal data.

7.6 Different Classification Techniques As far as is known, there is no specific method to identify ADHD subtypes, including impulsivity, inattention, and hyperactivity. Therefore, it is necessary to implement new classification techniques that allow the identification of subtypes [83–85]. Most studies deal with binary classification. Similarly, in deep learning methods, the most commonly used algorithm is the CNN. Therefore, the testing of other algorithms such as Generative Adversarial Networks (GANs), Radial Basis Function Networks (RBFNs), Self-Organizing Maps (SOMs), Deep Belief Networks (DBNs), Restricted Boltzmann Machines (RBMs) and Autoencoders has not been diversified, especially for works that do not use MRI for detection.

8 Conclusion In recent years, the number of children and adults affected by ADHD has increased considerably, and it is one of the most complex disorders to diagnose. In this review, we analyzed various ADHD diagnostic tools, as well as works that used ML and DL AI techniques to perform diagnosis and/or detection of ADHD. The most representative classifiers found in this review were also presented. For the ML approach, the most used classifier is the SVM. It has been used in conjunction with the diagnostic tool of MRI, EEG, questionnaires, and eye movements, lacking body movements. The Hierarchical ELM classifier is the classifier reported to have the highest percentage of accuracy, with 99.72% using the MRI diagnostic tool. It should be noted that the only diagnostic tool that can be compared so far, among the different studies, is the MRI; this is due to the fact that the papers that use the other diagnostic tools do not use public databases that allow an objective comparison. It was observed that in the deep learning approach, the most used classifier is the CNN, used in conjunction with all the diagnostic tools. The classifier reported as 3D CNN-GRU reports the best accuracy percentage with 99.69% for the MRI tool. From this review, a list of opportunities to be addressed in future work was found. For example, it was noted that most of the papers were biased towards the EEG modality after MRI, while the other modalities were reported by very few studies. However, in clinical practice,

186

J. Hernández-Capistran et al.

the most commonly used tools for the initial diagnosis of ADHD are the observation of movements and the use of questionnaires. Except for MRI, there are few or no publicly available data sets for most modalities, which impedes their clinical practice. For that reason, the development of a database covering these modalities is imperative. In addition, it is common in clinical practice to use different tools at the same time for the diagnosis of this disorder, i.e., a multi-modal approach. As future work, we have considered reviewing the relationships that exist with a wide variety of biomarkers. Also, XAI techniques should be implemented to improve the interpretability of the algorithms since this is an indispensable requirement in clinical practice. Therefore, it is necessary to develop a framework based on AI techniques for the detection of ADHD, focusing efforts on the implementation in the clinical setting. That is a framework that will assist the clinical expert and develop algorithms using current methods and standards used in clinical practice for the diagnosis of ADHD. Acknowledgements This work was supported by Mexico’s National Technological Institute (TecNM) and sponsored by both Mexico’s National Council of Science and Technology (CONACYT) and the Secretariat of Public Education (SEP) through the PRODEP project (Programa para el Desarrollo Profesional Docen-te). Also, this research was funded by Mexico’s National Council of Science and Technology (CONACYT) through postdoctoral grant number 2256876 for the research project titled “Diagnostic level identification of ADHD using Artificial Intelligence techniques and the Internet of Things paradigm.”

References 1. Fayyad, J., Sampson, N., A., Hwang, I., Adamowski, T., Aguilar-Gaxiola, S., Al-Hamzawi, A., Laura H. S. G. Andrade, Borges, G., Girolamo, G., Florescu, S., Gureje, O., Haro, J., M., Hu, C., Karam, E., G., Lee, S., Navarro-Mateu, F., O’Neill, S., Pennell, B., Piazza, M., Posada-Villa, J., Have, M., Torres, Y., Xavier, M., Zaslavsky, A., Kessler, R.: The descriptive epidemiology of DSM-IV Adult ADHD in the World Health Organization World Mental Health Surveys. Atten Defic Hyperact Disord, vol. 9, no. 1, p. 47, Mar (2017). https://doi.org/10.1007/S12402016-0208-3 2. Quinn, P.O., Madhoo, M.: A Review of Attention-Deficit/Hyperactivity Disorder in Women and Girls. Prim Care Com-panion CNS Disord (2014). https://doi.org/10.4088/PCC.13r01596 3. Skounti, M., Philalithis, A., Galanakis, E.: Variations in prevalence of attention deficit hyperactivity disorder worldwide. Eur. J. Pediatrics 166,2 117–12 (2006). https://doi.org/10.1007/ S00431-006-0299-5 4. Bener, A.R. et al.: The Prevalence of ADHD Among Primary School Children in an Arabian Society. J Atten Dis-ord, vol. 10, no. 1, pp. 77–82 (2006). https://doi.org/10.1177/108705470 5284500 5. American Psychiatric Association: Diagnostic and statistical manual of mental disorders, text revision (DSM-IV-TR®). American Psychiatric Association (2010) 6. Cicek, G., Akan, A.: Deep Learning Approach Versus Traditional Machine Learning for ADHD Classification. TIPTEKNO 2021 - Tip Teknolojileri Kongresi - 2021 Medical Technologies Congress (2021). https://doi.org/10.1109/TIPTEKNO53239.2021.9632940 7. Salgotra, K., Khullar, V., Singh, H.P., Khan, S.A.: Diagnosis of At-tention Deficit Hyperactivity Disorder: An Intelligent Neuroimaging Perspective. https://doi.org/10.4018/978-1-7998-75116.ch003, pp. 31–44 1AD. https://doi.org/10.4018/978-1-7998-7511-6.CH003

Machine and Deep Learning Algorithms for ADHD Detection: A Review

187

8. Rasti, J., Torabi, A., Sarrami-Foroushani, N., Amiri, G., Malekifar, N.: Design and Validation of an Eye-Tracker-Based Software to Improve Attention in Attention Deficit Hyperactivity Disorder (ADHD): A Val-idation Study,” Journal of Research in Rehabilitation Sciences, vol. 15, no. 3, pp. 137–143, (2019). https://doi.org/10.22122/JRRS.V15I3.3439 9. Berrezueta-Guzman, J., Krusche, S., Serpa-Andrade, L., Martín-Ruiz, M.L.: Artificial Vision Algorithm for Behavior Recognition in Children with ADHD in a Smart Home Environment. Lecture Notes in Networks and Systems, vol. 542 LNNS, pp. 661–671 (2023). https://doi.org/ 10.1007/978-3-031-16072-1_47/COVER 10. Periyasamy, R., Vibashan, V., Varghese, G., Aleem, M.: Machine Learning Techniques for the Diagnosis of Attention-Deficit/Hyperactivity Disorder from Magnetic Resonance Imaging: A Concise Review. Neurol India 69(6), 1518 (2021). https://doi.org/10.4103/0028-3886.333520 11. Biswas, S.D., Chakraborty, R., Pramanik, A.: A Brief Survey on Various Prediction Models for Detection of ADHD from Brain-MRI Images,” in ACM International Conference Proceeding Series (2020), vol. Part F165625. https://doi.org/10.1145/3369740.3372775 12. Eslami, T., Almuqhim, F., Raiker, J.S., Saeed, F.: Machine Learning Methods for Diagnosing Autism Spectrum Disorder and Attention- Deficit/Hyperactivity Disorder Using Functional and Structural MRI: A Survey,” Frontiers in Neuroinformatics, vol. 14. Frontiers Media S.A. (2021). https://doi.org/10.3389/fninf.2020.575999 13. Quaak, M., van de Mortel, L., Thomas, R.M., van Wingen, G.: Deep learning applications for the classification of psychiatric disorders using neuroimaging data: systematic review and meta-analysis. Neu-roImage: Clinical, vol. 30. Elsevier Inc. (2021). https://doi.org/10.1016/j. nicl.2021.102584 14. Alam, S., Raja, P., Gulzar, Y.: Investigation of Machine Learning Methods for Early Prediction of Neurodevelopmental Disorders in Children. Wirel Commun Mob Comput, vol. 2022, pp. 1– 12 (2022). https://doi.org/10.1155/2022/5766386 15. Loh, H.W., Ooi, C.P., Barua, P.D., Palmer, E.E., Molinari, F., Acharya, U.R.: Automated detection of ADHD: Current trends and future perspective. Comput Biol Med, vol. 146, p. 105525 (2022). https://doi.org/10.1016/J.COMPBIOMED.2022.105525 16. Arksey, H., O’Malley, L.: Scoping studies: towards a methodological framework. Int J Soc Res Methodol, vol. 8, no. 1, pp. 19–32 (2005). https://doi.org/10.1080/1364557032000119616 17. Levac, D., Colquhoun, H., O’Brien, K.K.: Scoping studies: advancing the methodology. Implementation Science, vol. 5, no. 1, p. 69, Dec. (2010). https://doi.org/10.1186/1748-59085-69 18. A. Riaz, M. Asad, E. Alonso, and G. Slabaugh: Fusion of fMRI and non-imaging data for ADHD classification. Computerized Medical Imaging and Graphics, vol. 65, pp. 115–128, Apr. (2018). https://doi.org/10.1016/j.compmedimag.2017.10.002 19. Cisneros, L., Rivera, G., Florencia, R., Sánchez-Solís, J.P.: Fuzzy optimisation for business analytics: A bibliometric analysis. Journal of Intelligent & Fuzzy Systems 44(2), 2615–2630 (2023). https://doi.org/10.3233/JIFS-221573 20. L. Shao, Y. You, H. Du, and D. Fu: Classification of ADHD with fMRI data and multi-objective optimization. Comput Methods Programs Biomed, vol. 196, p. 105676, Nov. (2020). https:// doi.org/10.1016/j.cmpb.2020.105676 21. S. Ahmed Salman, Z. Lian, M. Saleem, and Y. Zhang: Functional Con-nectivity Based Classification of ADHD Using Different Atlases. in 2020 IEEE International Conference on Progress in Informatics and Computing (PIC), Dec. (2020), pp. 62–66. https://doi.org/10.1109/PIC50277. 2020.9350749 22. De Silva, S., Dayarathna, S., Ariyarathne, G., Meedeniya, D., Jayarathna, S., Michalek, A. M., Jayawardena: A Rule-Based System for ADHD Identification us-ing Eye Movement Data. in 2019 Moratuwa Engineering Research Conference (MERCon), Jul. (2019), pp. 538–543. https://doi.org/10.1109/MERCon.2019.8818865 23. S. Khanna and W. Das: A Novel Application for the Efficient and Ac-cessible Diagnosis of ADHD Using Machine Learning (Extended Ab-stract). in 2020 IEEE / ITU International Conference on Artificial In-telligence for Good, AI4G 2020, Sep. (2020), pp. 51–54. https:// doi.org/10.1109/AI4G50087.2020.9311012

188

J. Hernández-Capistran et al.

24. W. Das and S. Khanna: A Robust Machine Learning Based Frame-work for the Automated Detection of ADHD Using Pupillometric Bi-omarkers and Time Series Analysis. Sci Rep, vol. 11, no. 1, Dec. (2021). https://doi.org/10.1038/s41598-021-95673-5 25. A. Lev, Y. Braw, T. Elbaum, M. Wagner, Y. Rassovsky: Eye Tracking During a Continuous Performance Test: Utility for Assessing ADHD Patients. J Atten Disord, vol. 26, no. 2, pp. 245– 255, Jan. (2022). https://doi.org/10.1177/1087054720972786 26. A. E. Alchalabi, S. Shirmohammadi, A. N. Eddin, M. Elsharnouby: FOCUS: Detecting ADHD patients by an EEG-based serious game. IEEE Trans Instrum Meas, vol. 67, no. 7, pp. 1512– 1520, Jul. (2018). https://doi.org/10.1109/TIM.2018.2838158 27. Tor, H. T., Ooi, C. P., Lim-Ashworth, N. S., Wei, J. K. E., Jahmunah, V., Oh, S. L., Fung: Automated detection of conduct disorder and atten-tion deficit hyperactivity disorder using decomposition and nonlinear techniques with EEG signals. Comput Methods Programs Biomed, vol. 200, Mar. (2021). https://doi.org/10.1016/j.cmpb.2021.105941 28. Holker, R., Susan, S.: Computer-Aided Diagnosis Framework for ADHD Detection Using Quantitative EEG. in Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artificial Intelli-gence and Lecture Notes in Bioinformatics), (2022), vol. 13406 LNAI, pp. 229–240. https://doi.org/10.1007/978-3-031-15037-1_19 29. Motie Nasrabadi, A., et al.: EEG data for ADHD / Control children. IEEE Dataport (2020). https://doi.org/10.21227/rzfh-zn36 30. Ochab, J.K., Gerc, K., Fafrowicz, M., Gudowska-Nowak, E., Marek, T., Nowak, M. A.,Chialvo D R.: Classifying attention deficit hyperactivity disorder in children with non-linearities in actigraphy (2019) 31. Khademi, A., El-Manzalawy, Y., Master, L., Buxton, O.M., Honavar, V.G.: Personalized sleep parameters estimation from actigraphy: A machine learning approach. Nat Sci Sleep 11, 387– 399 (2019). https://doi.org/10.2147/NSS.S220716 32. Choi, M.T., Yeom, J., Shin, Y., Park, I.: Robot-Assisted ADHD Screening in Diagnostic Process. Journal of Intelligent and Robotic Systems: Theory and Applications, vol. 95, no. 2, pp. 351– 363, Aug. (2019). https://doi.org/10.1007/s10846-018-0890-9 33. Jiang, X., Chen, Y., Huang, W., Zhang, T., Gao, C., Xing, Y., Zheng, Y.: WeDA: Designing and Evaluating A Scale-driven Wearable Diagnostic Assessment System for Children with ADHD. in Conference on Human Factors in Computing Systems - Proceedings (2020). https://doi.org/ 10.1145/3313831.3376374 34. Hicks, S. A., Stautland, A., Fasmer, O. B., Førland, W., Hammer, H. L., Halvorsen, P., Jakobsen, P.: HYPERAKTIV: An Activity Dataset from Patients with Attention-Deficit/Hyperactivity Disorder (ADHD). In Proceedings of the 12th ACM Multimedia Systems Conference (2021), pp. 314–319. https://doi.org/10.1145/3458305.3478454 35. Uluyagmur-Ozturk, M., Arman, A. R., Yilmaz, S. S., Findik, O. T. P., Genc, H. A., CarkaxhiuBulut, G., Cataltepe, Z.: ADHD and ASD classification based on emotion recognition data. In 2016 15th IEEE International Conference on Machine Learning and Applications (2017), pp. 810–813. https://doi.org/10.1109/icmla.2016.0145 36. A. Trognon and M. Richard: Questionnaire-based computational screening of adult ADHD. BMC Psychiatry, vol. 22, no. 1, Dec. (2022). https://doi.org/10.1186/s12888-022-04048-1 37. M. Maniruzzaman, J. Shin, M. A. M. Hasan: Predicting Children with ADHD Using Behavioral Activity: A Machine Learning Analy-sis. Applied Sciences (Switzerland), vol. 12, no. 5, Mar. (2022). https://doi.org/10.3390/app12052737 38. Shao, L., Zhang, D., Du, H., Fu, D.: Deep Forest in ADHD Data Clas-sification. IEEE Access 7, 137913–137919 (2019). https://doi.org/10.1109/ACCESS.2019.2941515 39. J. Peng, M. Debnath, A. K. Biswas: Efficacy of novel Summation-based Synergetic Artificial Neural Network in ADHD diagnosis. Ma-chine Learning with Applications, vol. 6, p. 100120, Dec. (2021). https://doi.org/10.1016/j.mlwa.2021.100120 40. C. Sims: Highly Accurate FMRI ADHD Classification using time dis-tributed multi modal 3D CNNs. ArXiv, May (2022). https://doi.org/10.48550/arxiv.2205.11993 41. H. Ko, B. Wang, J. S. Lim: A Study for ADHD Identification using Eye Movement Data. in 2022 International Conference on Electron-ics, Information, and Communication, ICEIC 2022, (2022). https://doi.org/10.1109/ICEIC54506.2022.9748230

Machine and Deep Learning Algorithms for ADHD Detection: A Review

189

˙ 42. Altınkaynak, M., Dolu, N., Güven, A., Pekta¸s, F., Özmen, S., Demirci, E., & Izzeto˘ glu, M.: Diagnosis of Attention Deficit Hyperactivity Disorder with combined time and frequency features. Biocybern Bi-omed Eng, vol. 40, no. 3, pp. 927–937, Jul. (2020). https://doi.org/10. 1016/j.bbe.2020.04.006 43. A. Vahid, A. Bluschke, V. Roessner, S. Stober, C. Beste: Deep learning based on event-related EEG differentiates children with ADHD from healthy controls. J Clin Med, vol. 8, no. 7, Jul. (2019). https://doi.org/10.3390/jcm8071055 44. H. Chen, Y. Song, X. Li: A deep learning framework for identify-ing children with ADHD using an EEG-based brain network. Neuro-computing, vol. 356, pp. 83–96, Sep. (2019). https://doi. org/10.1016/j.neucom.2019.04.058 45. L. Dubreuil-Vall, G. Ruffini, J. A. Camprodon: Deep Learning Convolutional Neural Networks Discriminate Adult ADHD From Healthy Individuals on the Basis of Event-Related Spectral EEG. Front Neurosci, vol. 14, Apr. (2020). https://doi.org/10.3389/fnins.2020.00251 46. A. Ahmadi, M. Kashefi, H. Shahrokhi, M. A. Nazari: Computer aided diagnosis system using deep convolutional neural networks for ADHD subtypes. Biomed Signal Process Control, vol. 63, Jan. (2021). https://doi.org/10.1016/j.bspc.2020.102227 47. Zhou, D., Liao, Z, Chen, R.: Deep Learning Enabled Diagnosis of Children’s ADHD Based on the Big Data of Video Screen Long-Range EEG. J Healthc Eng, vol. 2022 (2022). https:// doi.org/10.1155/2022/5222136 48. Joy, R.C. et al.: Detection and Classification of ADHD from EEG Sig-nals Using Tunable QFactor Wavelet Transform. J Sens, vol. 2022, pp. 1–17 (2022). https://doi.org/10.1155/2022/ 3590973 49. Jaiswal, S., Valstar, M.F., A. Gillott, D. Daley: Automatic Detec-tion of ADHD and ASD from Expressive Behaviour in RGBD Data. in 2017 12th IEEE International Conference on Automatic Face & Ges-ture Recognition (FG 2017), May (2017), pp. 762–769. https://doi.org/ 10.1109/FG.2017.95 50. Amado-Caballero, P., Casaseca-de-la-Higuera, P., Alberola-Lopez, S., Andres-de-Llano, J. M., Villalobos, J. A. L., Garmendia-Leiza, J. R., Alberola-Lopez, C.: “Objective ADHD Diagnosis Using Convo-lutional Neural Networks over Daily-Life Activity Records. IEEE J Biomed Health Inform, vol. 24, no. 9, pp. 2690–2700, Sep. (2020). https://doi.org/10.1109/JBHI.2020. 2964072 51. Hammam, N., Sadeghi, D., Carson, V., Tamana, S. K., Ezeugwu, V. E., Chikuma, J., Mandhane, P. J.: The relationship between machine-learning-derived sleep parameters and behavior problems in 3- And 5-year-old children: Results from the CHILD Cohort study. Sleep, vol. 43, no. 12, Dec. (2020). https://doi.org/10.1093/sleep/zsaa117 52. Zhang, Y., Kong, M., Zhao, T., Hong, W., Zhu, Q., Wu, F.: ADHD In-telligent Auxiliary Diagnosis System Based on Multimodal Information Fusion. in MM 2020 - Proceedings of the 28th ACM Interna-tional Conference on Multimedia (2020), pp. 4494–4496. https://doi.org/ 10.1145/3394171.3414359 53. De Silva, S., Dayarathna, S., Ariyarathne, G., Meedeniya, D., Jayarathna, S., Michalek, A.M.: Computational Decision Support System for ADHD Identification. International Journal of Automation and Computing, vol. 18, no. 2, pp. 233–255 (2021). https://doi.org/10.1007/s11 633-020-1252-1 54. Qin, Y., Lou, Y., Huang, Y., Chen, R., Yue, W.: An Ensemble Deep Learning Approach Combining Phenotypic Data and fMRI for ADHD Diagnosis. J Signal Process Syst (2022). https://doi.org/10.1007/s11265-022-01812-0 55. Hamedi, N., Khadem, A., Vardast, S., Delrobaei, M., Babajani-Feremi, A.: An Effective Connectomics Approach for Diagnosing ADHD using Eyes-open Resting-state MEG. in ICCKE 2021 - 11th Interna-tional Conference on Computer Engineering and Knowledge, (2021) pp. 110–114. https://doi.org/10.1109/ICCKE54056.2021.9721443 56. Niso, G., Rogers, C., Moreau, J. T., Chen, L. Y., Madjar, C., Das, S., ... & Baillet, S.: OMEGA: The Open MEG Archive. Neuroimage, vol. 124, pp. 1182–1187, Jan. 2016. https://doi.org/10. 1016/j.neuroimage.2015.04.028.

190

J. Hernández-Capistran et al.

57. Wolraich, M.L. et al.: ADHD Diagnosis and Treatment Guidelines: A Historical Perspective. Pediatrics, vol. 144, no. 4, (2019). https://doi.org/10.1542/peds.2019-1682 58. Polzehl, J., Tabelow, K.: Magnetic Resonance Brain Imaging. Cham: Springer International Publishing (2019). https://doi.org/10.1007/978-3-030-29184-6 59. Jakhar, D., Kaur, I.: Artificial intelligence, machine learning and deep learning: definitions and differences. Clin Exp Dermatol, vol. 45, no. 1, pp. 131–132 (2020). https://doi.org/10.1111/ CED.14029 60. Sims, C.: Highly Accurate FMRI ADHD Classification using time dis-tributed multi modal 3D CNNs (2022). https://doi.org/10.48550/ARXIV.2205.11993 61. Kaur, A., Kahlon, K.S.: Accurate Identification of ADHD among Adults Using Real-Time Activity Data. Brain Sci, vol. 12, no. 7, p. 831, (2022). https://doi.org/10.3390/brainsci1207 0831 62. Nichols Nolar: ADHD200. https://data.world/nicholsn/adhd-200 (2016) 63. Zou, L., Zheng, J., Miao, C., Mckeown, M.J., Wang, Z.J.: 3D CNN Based Automatic Diagnosis of Attention Deficit Hyperactivity Disor-der Using Functional and Structural MRI. IEEE Access 5, 23626–23636 (2017). https://doi.org/10.1109/ACCESS.2017.2762703 64. Chen, M., Li, H., Wang, J., Dillman, J.R., Parikh, N.A., He, L.: A Multichannel Deep Neural Network Model Analyzing Multiscale Functional Brain Connectome Data for Attention Deficit Hyperactivity Disorder Detection. Radiol Artif Intell, vol. 2, no. 1, p. e190012, (2019). https:// doi.org/10.1148/ryai.2019190012 65. Lytle, M.N., Hammer, R., Booth, J.R.: Working Memory and Reward in Children with and without Attention Deficit Hyperactivity Disorder (ADHD). OpenNeuro (2021). https://doi. org/10.18112/openneuro.ds002424.v1.2.0 66. Lytle, M.N., Hammer, R., Booth, J.R.:A neuroimaging dataset on working memory and reward processing in children with and without ADHD, Data Brief, vol. 31, p. 105801 (2020). https:// doi.org/10.1016/J.DIB.2020.105801 67. Booth, J., Cooke, G., Gayda, E., Hammer, J. R., Lytle, M., N., Stein, M., A., Tennekoon, M.: Working Memory and Reward in Adults. https://openneuro.org/datasets/ds002687/versions/ 1.2.0. OpenNeuro, (2021). https://doi.org/10.18112/openneuro.ds002424.v1.1.0 68. Hammer, R., Cooke, G.E., Stein, M.A., Booth, J.R.: Functional neuroimaging of visuospatial working memory tasks enables accurate detection of attention deficit and hyperactivity disorder. Neuroimage Clin 9, 244–252 (2015). https://doi.org/10.1016/j.nicl.2015.08.015 69. Hammer, R., Tennekoon, M., Cooke, G.E., Gayda, J., Stein, M.A., Booth, J.R.: Feedback associated with expectation for larger-reward improves visuospatial working memory performances in children with ADHD. Dev Cogn Neurosci, vol. 14, pp. 38–49 (2015). https://doi. org/10.1016/j.dcn.2015.06.002 70. Ali Motie Nasrabadi, Armin Allahverdy, Mehdi Samavati, Mo-hammad Reza Mohammadi, EEG data for ADHD / Control children IEEE Dataport, Jun. 10, (2020). https://doi.org/10. 21227/rzfh-zn36 71. Mohammadi, M.R., Khaleghi, A., Nasrabadi, A.M., Rafieivand, S., Begol, M., Zarafshan, H.: EEG classification of ADHD and normal children using non-linear features and neural network. Biomed Eng Lett, vol. 6, no. 2, pp. 66–73 (2016). https://doi.org/10.1007/s13534-016-0218-2 72. Barua, P.D. et al.: TMP19: A Novel Ternary Motif Pattern-Based ADHD Detection Model Using EEG Signals. Diagnostics, vol. 12, no. 10, p. 2544 (2022). https://doi.org/10.3390/dia gnostics12102544 73. Xing, J., Zhang, Y., Xu, S., Zeng, X.: Nanomaterial assisted diagno-sis of dopamine to determine attention deficit hyperactivity disorder - ‘An issue with Chinese children,’” Process Biochemistry, vol. 118, pp. 112–120 (2022). https://doi.org/10.1016/J.PROCBIO.2022.01.012 74. Lee, W.H., Cho, S.H., Park, H.K., Cho, S.H., Lim, Y.H., Kim, K.R.: Movement Measurement of Attention-Deficit/Hyperactivity Disorder (ADHD) Patients Using IR-UWB Radar Sensor. Proceedings of 2018 6th IEEE International Conference on Network Infrastructure and Digital Content, IC-NIDC 2018, pp. 214–217 (2018). https://doi.org/10.1109/ICNIDC.2018.8525709 75. Wang, J., Lin, C.C., Yu, Y.S., Yu, T.C.: Wireless Sensor-Based Smart-Clothing Platform for ECG Monitoring. Comput Math Methods Med, vol. 2015 (2015). https://doi.org/10.1155/2015/ 295704

Machine and Deep Learning Algorithms for ADHD Detection: A Review

191

76. Ceccarelli, F., Mahmoud, M.: Multimodal temporal machine learn-ing for Bipolar Disorder and Depression Recognition. Pattern Analy-sis and Applications 2021 25:3, vol. 25, no. 3, pp. 493–504 (2021). https://doi.org/10.1007/S10044-021-01001-Y 77. Shakur, A.H., Sun,T., Kim, J.-E., Huang, S.: A rule-based explora-tory analysis for discovery of multimodal biomarkers of ADHD using eye movement and EEG data. IISE Trans Healthc Syst Eng, pp. 1–15, (2022). https://doi.org/10.1080/24725579.2022.2126036 78. Bough, B.J., Lerman, C., Rose, J.E., McClernon, F.J., Kenny, P.J., Tyndale, R.F., David, R.S., Stein, E.A., Uhl, G.R., Conti, D.V., Green, C., Amur, S.: Biomarkers and surrogate endpoints: preferred definitions and conceptual framework. Clin Pharmacol Ther, vol. 69, no. 3, pp. 89–95 (2001). https://doi.org/10.1067/MCP.2001.113989 79. Takahashi, N., Ishizuka, K., Inada, T.Peripheral biomarkers of at-tention-deficit hyperactivity disorder: Current status and future perspective. J Psychiatr Res, vol. 137, pp. 465–470, May (2021). https://doi.org/10.1016/J.JPSYCHIRES.2021.03.012 80. Michelini, G., Norman, L.J., Shaw, P., Loo, S.K.: Treatment bi-omarkers for ADHD: taking stock and moving forward. Translational Psychiatry 12:1, vol. 12, no. 1, pp. 1–30 (2022). https://doi.org/10.1038/s41398-022-02207-2 81. Sudar, K.M., Nagaraj, P., Nithisaa, S., Aishwarya, R., Aakash, M., Lakshmi, S.I.: Alzheimer’s Disease Analysis using Explainable Artificial Intelligence (XAI). International Conference on Sustainable Compu-ting and Data Communication Systems, ICSCDS 2022 - Proceedings, pp. 419–423, (2022). https://doi.org/10.1109/ICSCDS53736.2022.9760858 82. Biswas, M., Kaiser, M.S., Mahmud, M., Al Mamun, S., Hossain, M.S, Rahman, M.A.: An XAI Based Autism Detection: The Context Behind the Detection. Lecture Notes in Computer Science (including subse-ries Lecture Notes in Artificial Intelligence and Lecture Notes in Bio-informatics), vol. 12960 LNAI, pp. 448–459, (2021). https://doi.org/10.1007/978-3-03086993-9_40/COVER 83. Zerón-Rugerio, M. F., Carpio-Arias, T.V., Ferreira-García, E., Díez-Noguera, A., Cambras, T., Alda, J. Á., Izquierdo-Pulido, M.: ADHD subtypes are associated differently with circadian rhythms of motor activity, sleep disturbances, and body mass index in children and adolescents: a case–control study. Euro-pean Child & Adolescent Psychiatry 30:12, 1917–1927 (2020). https://doi.org/10.1007/S00787-020-01659-5 84. Mu, S., Wu, H., Zhang, J., Chang, C.: Structural Brain Changes and Associated Symptoms of ADHD Subtypes in Children. Cerebral Cortex, 32(6), 1152–1158 (2022). https://doi.org/10. 1093/CERCOR/BHAB276 85. Slater, J., Joober, R., Koborsy, B.L., Mitchell, S., Sahlas, E., Palmer, C.: Can electroencephalography (EEG) identify ADHD subtypes? A systematic review. medRxiv, p. 2022.03.25.22272900, Mar (2022). https://doi.org/10.1101/2022.03.25.22272900

Mosquito on Human Skin Classification Using Deep Learning C. S. Ayush Kumar, Advaith Das Maharana, Srinath Murali Krishnan, Sannidhi Sri Sai Hanuma, V. Sowmya, and Vinayakumar Ravi

Abstract The greatest cause of death worldwide each year is mosquitoes. Identifying them is vital in order to take appropriate action to eradicate them in a particular location. Our aim is to create a state-of-the-art machine learning model to accurately identify and classify mosquitoes on human skin. This task is crucial in the field of public health as it can help to identify potential disease vectors and facilitate the implementation of prevention and control measures. We explored various pre-trained and deep convolutional neural network (DCNN) models for classification and evaluated the impact of hyperparameter tuning using the Hyperband optimization strategy. We also conducted preprocessing experiments and found data augmentation to be necessary. Our results demonstrated that both DCNNs and pre-trained models were effective in classifying mosquito species on human skin with high accuracy and F1 scores. We proposed an automated model update and build based on input images, which can be established in real-time environments by automating the hyperparameter selection on the best configuration. The use of the Hyperband optimization strategy was effective in improving model performance, with a significant increase in accuracy and F1 scores compared to models that were not tuned using Hyperband. By automating the process of selecting optimal hyperparameters C. S. A. Kumar · A. D. Maharana · S. M. Krishnan · S. S. S. Hanuma · V. Sowmya (B) Center for Computational Engineering and Networking (CEN), Amrita Vishwa Vidypaeetham, Coimbatore, India e-mail: [email protected] C. S. A. Kumar e-mail: [email protected] A. D. Maharana e-mail: [email protected] S. M. Krishnan e-mail: [email protected] S. S. S. Hanuma e-mail: [email protected] V. Ravi Prince Mohammad Bin Fahd University, Khobar, Saudi Arabia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Innovations in Machine and Deep Learning, Studies in Big Data 134, https://doi.org/10.1007/978-3-031-40688-1_9

193

194

C. S. A. Kumar et al.

for the pre-trained models using the Hyperband optimization strategy, we improved their performance on the classification of mosquito species on human skin, resulting in the best-performing result of this work with an accuracy of 91%. Our study provides valuable insights into the development of an artificial intelligence-based model for the termination of harmful mosquitoes through categorization. This work has important implications for the prevention and control of vector-borne diseases and the identification of potential vectors. Keywords Mosquito on human skin · Deep convolutional neural networks · Pre-trained models · Transfer learning · Feature extraction · Hyperband optimization · Hypermodels

1 Introduction Mosquito surveillance is the process of monitoring and tracking the presence and behavior of mosquitoes in a particular area. This can be done through various methods, such as trapping, netting, and analyzing mosquito habitats. Mosquito surveillance is important because certain species of mosquitoes are vectors for diseases such as malaria, West Nile virus, and Zika virus. The mosquito species Aedes aegypti, Aedes albopictus, and Culex quinquefasciatus are the main vectors of these diseases. By tracking the presence and distribution of these mosquitoes, public health officials can predict the risk of disease outbreaks and implement control measures to prevent or mitigate the spread of disease [1]. Disease outbreak modeling is the use of mathematical and statistical tools to predict the likelihood, severity, and potential impact of a disease outbreak [2]. Both mosquito surveillance and disease outbreak modeling are critical components of public health efforts to control and prevent the spread of mosquito-borne diseases. This information can be used to predict the risk of a disease outbreak and inform control measures that can be taken to reduce the population of mosquitoes or protect people from mosquito bites. For example, if mosquito surveillance data indicates that there is an increased presence of mosquitoes that carry the West Nile virus in a particular area, public health officials may implement control measures such as distributing mosquito nets or insect repellent or conducting mosquito control operations such as applying larvicides to breeding sites or releasing sterilized males [3]. These measures can help reduce the number of disease-carrying mosquitoes in the area, which in turn can lower the risk of a disease outbreak. By continuously monitoring and tracking mosquito populations, public health officials can quickly identify any changes in the presence or behavior of mosquitoes that may increase the risk of a disease outbreak and take timely action to prevent or mitigate the spread of the disease. Adult mosquitoes are typically sampled in the field, and data are collected to monitor the mosquito population; subsequently, the species are identified and counted in a laboratory [4, 5]. Key restrictions, including labor, time, and cost, are present

Mosquito on Human Skin Classification Using Deep Learning

195

with these conventional methods for obtaining mosquito population statistics. Since physical traps are less successful than human-as-bait traps, and people frequently come into contact with mosquitoes on their own. The concept of involving the general public in a mosquito monitoring program offers a great substitute for gathering spatial-temporal mosquito data. We must provide the community with the skills necessary to identify mosquito pests in order to make the concept of communitybased mosquito surveillance workable. A smartphone app or identification system that can categorize mosquitoes at the household level can provide the necessary support [6]. The mosquito’s condition that the community is most likely to notice is whether it lands on human skin alive, dead, or damaged. Whereas the latter two lose their physical features, making it a challenging task. The classification of mosquitoes in these conditions has never been addressed in the literature. The categorization of mosquitoes in these conditions can be achieved through machine and deep learning algorithms to analyze images of mosquitoes and accurately identify the species of mosquito based on their physical characteristics. This can be useful in mosquito surveillance efforts, as different mosquito species have different ecological preferences and may be more or less likely to carry certain diseases. By accurately identifying the species of mosquito, public health officials can better understand the distribution and behavior of different mosquito populations, which can inform control measures and help predict the risk of disease outbreaks. Classifying mosquito species based on whether they have landed on or been smashed on human skin can be important for a number of reasons. First, identifying the species of mosquito that has landed on or been smashed on human skin can help determine the potential risk of disease transmission. Different mosquito species have different ecological preferences and may be more or less likely to carry certain diseases. For example, if a mosquito species that is known to carry West Nile virus lands on or is smashed on human skin, there may be a higher risk of disease transmission compared to a mosquito species that is not known to carry the virus. By accurately identifying the species of mosquito, public health officials and individuals can better understand the potential risk of disease transmission and take appropriate precautions. This can help researchers understand mosquito behavior and the factors that influence mosquito-human interactions. For example, by analyzing the types of mosquitoes that land on or are smashed on human skin, researchers can learn more about the factors that influence mosquito host preference, such as the type of skin secretions, body temperature, or clothing worn by the individual. This information can be valuable in developing strategies to prevent or reduce mosquito-human interactions and the transmission of disease. Overall, classifying mosquito species based on their landing or smashed status on human skin can provide important information about the potential risk of disease transmission and the factors that influence mosquito behavior, which can be valuable in public health and research efforts. There are several advantages to using automated methods for mosquito classification and categorization over human-based methods:

196

C. S. A. Kumar et al.

1. Speed: Automated methods can process large amounts of data quickly, allowing for efficient analysis of large datasets. 2. Consistency: Automated methods can provide more consistent results, as they are not subject to human error or bias. 3. Scalability: Automated methods can be easily scaled up to handle larger datasets, whereas human-based methods may be limited by the availability and capacity of human analysts. 4. Reduced costs: Automated methods can reduce the costs associated with data collection and analysis, as they do not require the hiring and training of human analysts. 5. Increased accuracy: Automated methods can potentially provide more accurate results, as they are not subject to human error or subjectivity. The automated methods can provide a more efficient and cost-effective approach to mosquito classification and categorization while also potentially increasing the accuracy and consistency of the results [7]. In this research, we present the use of pre-trained models for feature selection and classification tasks, with various feature selection algorithms and their applicability in the task of classification of mosquitoes on human skin in landing, damaged, or dead conditions. The resulting predictive models were evaluated against state-of-the-art methods, including a Google workbench implementation using image recognition techniques from Deep Learning on human skin images in healthy state and damage conditions. Results showed that our model outperformed commercial state-of-the-art methods in both training and testing scenarios demonstrating its potential value as a powerful tool for this task with an accuracy of 91% for hyper-tuned EfficientNetB7 pre-trained transfer learning model [8] and 86% for a simple DCNN model. The paper is divided into several sections that are organized to provide a clear and comprehensive understanding of the research conducted. Firstly, the literature review section evaluates the existing literature and benchmarks related to deep convolutional neural networks and transfer learning. This section is crucial as it provides a foundation for the proposed methodology and serves to highlight the gaps and limitations in the current research. The proposed methodology is presented in the following sections, beginning with a comprehensive description of the dataset used for the experiments in the Dataset Description section. This is followed by the Deep Convolutional Neural Networks and Transfer Learning section, where the architecture and pre-trained models used in the experiments are presented in detail. This section is important as it presents the key components of the proposed methodology, including the use of transfer learning to enhance the performance of the model. The Hyperparameter Tuning section is dedicated to explaining the optimization process used to fine-tune the model’s hyperparameters to achieve the best possible performance. This section is critical as it demonstrates the methodology used to obtain the results presented in the experiments and results section. The experiments and results section covers the experimental setup, including the training and validation process, the evaluation metrics used to measure the model’s

Mosquito on Human Skin Classification Using Deep Learning

197

performance, and the results obtained. This section presents the key findings of the research, including the model’s accuracy and its ability to generalize to new data. Finally, the conclusion and future work section summarizes the main contributions of the paper and highlights its limitations. This section provides insight into future research directions, including possible improvements to the proposed methodology and suggestions for future work that can build upon the findings presented in the paper. Overall, the paper’s organization aims to present a clear and concise understanding of the research conducted and its potential implications for future research in the field.

2 Literature Review In the existing works, mosquito classification was carried out in cases where the major species had subcategory labels. Pratik et al. [9] classified mosquitoes based on their wing movement patterns. Their study utilized various models such as Multi-layer CNN, ResNet34, ResNet50, DenseNet121, and XGBoost, achieving classification accuracy ranging from 80% to 86%. Song-Quan et al. describe the creation of a unique DSLR image dataset of mosquitoes. Ae. aegypti, Ae. albopictus and Cx quinquefasciatus were the major classes taken into account. Each species were further categorized into non-feeding, partially repleted, and fully repleted circumstances, and were each sub-classed into one of three categories [10]. It was observed that the use of transfer learning led to an improvement in the classification accuracy of images depicting damaged conditions. This finding highlights the potential of transfer learning as a useful technique for enhancing the performance of image classification models in scenarios where data availability may be limited. The best performance on unseen data was achieved by the Xception model (see Table 1). To find tiger mosquitoes, a ResNet50 model was trained using data from the Mosquito Alert digital observatory by Armin et al. [11]. By eliminating classification bias and using photos of non-mosquito insects from the IP102 dataset as negative examples, performance was marginally enhanced. The ROC curve’s area under the model was 0.96 [11]. Another study provides an app-based community-based digital observatory where users may share photos of mosquitoes they encounter. This system is referred to as “Mosquito Alert citizen science”. Experts in entomology then geotag and classify the publicly submitted photos. A Citizen Science’s recent study on tiger mosquitoes utilized deep learning techniques to develop a model for mosquito detection [12]. Specifically, they used a well-known deep learning architecture called ResNet50 to train their model on a dataset of mosquito images. The model achieved a high ROC score of 0.94, indicating its effectiveness in distinguishing between mosquito and non-mosquito images. This approach could have important implications for mosquito control efforts, as it could potentially help automate the detection of these disease-carrying insects. A CNN was developed by Song-Quan et al. [13] using Google Teachable Machine’s

198

C. S. A. Kumar et al.

Table 1 Literature Review Study Methodology Pratik et al. [9]

Wing movement pattern-based classification

Song-Quan et al. [10]

Transfer learning applied to unique DSLR image dataset of mosquitoes Eliminating bias and leveraging non-mosquito insects as negative examples Mosquito Detection using ResNet50 architecture Exploring CNN Architectures for Accurate Mosquito Classification

Armin et al. [11]

The citizen science [12] Daniel et al. [15]

Species

Result

Ae. (aegypti, albopictus, arabiensis), An. gambiae, Cx. (pipiens, quinquefasciatus) Ae. aegypti, Ae. albopictus, and Cx quinquefasciatus

Multi-layer CNN: 86%

Xception model (Transfer Learning): 77.75%

Mosquito Alert digital ResNet50: ROC AUC observatory data, 0.96 IP102 dataset(non-mosquito) Tiger mosquitoes ResNet50: ROC AUC 0.94 Ae. aegypti, Ae. albopictus, and Cx. quinquefasciatus

Aedes: 100%, Culex: 90%

no-code platform to categorize the two mosquito species Ae. aegypti (Linnaeus) and Ae. albopictus (Skuse). The MobileNet weights were adopted by the network. The model’s accuracy (98.33 ± 0.69%) was comparable to the experts’ manual labeling accuracy ((98.00 ± 0.88%) when classifying items manually. In order to categorize mosquitoes, Kazushige et al. compare the performance of handcrafted features versus DL-extracted features. Speeded-up robust features (SURF), scale-invariant feature transform (SIFT), dense SIFT, histogram of oriented gradients (HOG), co-occurrence HOG, extended CoHOG, and local binary pattern were the hand-crafted feature extraction methods employed [14]. The ML classifiers were then given the retrieved characteristics. On the other hand, AlexNet, VGGNet, and ResNet were the DL architectures utilized. Only after undertaking data augmentation was it noticed that the performance of DL architectures outperformed that of handmade features. The handmade feature extraction (82.4%) and deep learning (95.5%) methods with the highest accuracy were the SIFT algorithm and ResNet, respectively. In a study by Daniel et al. [15], three subspecies of mosquitoes—Ae. aegypti, Ae. albopictus, and Cx. quinquefasciatus—were classified using various CNN architectures. The models used in the study were LeNet, AlexNet, and GoogleNet, with accuracy ranging from 57.5% to 83.9%. Aedes and Culex mosquitoes were also classified separately, achieving accuracy rates of 100% and 90%, respectively. In their study, Junyoung et al. [16] employed transfer learning, a technique where a pre-trained model is used as a starting point for a new task, to classify a bespoke dataset of

Mosquito on Human Skin Classification Using Deep Learning

199

Fig. 1 Sample images—Mosquito Species recognition system on human skin dataset

over 8 mosquito subspecies. The dataset was collected using a mobile phone camera and consisted of 3600 photos, with each subspecies having at least 200 images. The authors utilized advanced CNNs, such as Inception-v3 and Xception, for classification, achieving an accuracy of 82%. The study demonstrated the potential of utilizing transfer learning and cutting-edge CNNs for mosquito classification. Moreover, the use of a mobile phone camera for data collection and labeling highlighted the potential for developing cost-effective and efficient methods for building mosquito vector datasets. These findings could have important implications for mosquito control and disease prevention, as accurate and efficient classification of mosquito species is essential for effective vector control programs. The studies that are now accessible in this sector only focus on a particular subspecies of mosquitoes in their surroundings, not on human skin. There are no models available to recognize or categorize mosquitoes in broken or damaged conditions, and there are no deep learning- or machine learning-based studies on these conditions. So it’s essential to study and develop a suitable, robust, and deployable model for the efficient classification of mosquito species.

200

C. S. A. Kumar et al.

3 Methodology 3.1 Dataset Description In this chapter, we utilize “An annotated image dataset for training mosquito species recognition system on human skin” [17]. The dataset provides information on the three mosquito species Ae. aegypti, Ae. albopictus, and Cu. quinquefasciatus. There are six classes that represent different mosquito species’ impacts on human skin in both their natural landing position and afterward to represent damaged states. There are 10,000 images in the collection. Additional folders hold the 4200 training images, 3600 validation images, and 1799 test images. The images, which were taken using a DSLR camera, were originally 5184 × 3456 in size but were later reduced to 224 by 224 for usability. Chinese, Indian, and Malay were the ethnicities of the volunteers for the human skin images. This points out a lack of a more diverse skin tone, which may facilitate experimenting. Ae. aegypti and Ae. albopictus are two species that share a visual similarity in the dataset. This raises a number of questions during the classification task because it may result in several misclassifications (Fig. 1). When working with a dataset, the data is split into a training set, a validation set, and a test set, which are provided by the creators. The training set is used to train a machine learning model, while the test set is used to evaluate the model’s performance. Using the test-train split provided by the database can help ensure that the data is representative of the overall population and that the model is trained on a diverse range of examples. This can be especially important if the dataset has been carefully curated to include a balanced representation of different classes or features. This helps in reducing the risk of bias in the model. When splitting the data again, there is a risk of introducing bias if the data is not split randomly or if the split is not representative of the same, which can lead to a model that performs poorly on unseen data.

3.2 Deep Convolutional Neural Networks and Transfer Learning Shallow machine learning models rely on hand-engineered features that are designed by the developer. These features are typically based on domain knowledge and may not capture all of the relevant information in the data. As a result, shallow models may perform poorly on tasks that require complex feature extraction, such as image classification [18]. In our research, we propose to use Deep CNNs and pre-trained CNNs for the purposes of Mosquito Species classification on human skin in damaged or landing conditions. They provide a technique for extracting the essential features of mosquito species based on the location and type of damage on human skin and landing conditions.

Mosquito on Human Skin Classification Using Deep Learning

201

Convolutional neural networks (CNNs) are a type of neural network that is particularly well-suited for image analysis tasks. They are designed to process data with a grid-like structure, such as an image, and can learn to identify and extract features from the data by applying a set of filters to the input data. They are able to learn a hierarchy of features from the data, starting with simple edge and color patterns and progressing to more complex shapes and objects. This allows CNNs to automatically extract useful features from the data and make more accurate predictions compared to shallow models. The network is trained to identify and extract features from the input data that are relevant to the task at hand. Pre-trained CNNs are CNNs that have already been trained on a large dataset for a specific task, such as image classification or object detection. These models have already learned useful features from the data and can be used as a starting point for training a new model for a different task [19]. Pretrained CNNs are used for the task at hand i.e. image classification, by using transfer learning. Transfer learning involves using a pre-trained CNN as a starting point for training a new model for a different task, rather than training the model from scratch to use a pre-trained CNN for transfer learning [20]. First, we select a suitable pre-trained model and fine-tune it for our specific task by adjusting the model’s architecture and training it on the dataset. This can involve adding or removing layers, freezing or unfreezing layers, and retraining the model using the dataset. Deep CNNs are CNNs with a deeper architecture, containing more layers compared to shallow CNNs. These models can be trained from scratch or finetuned using transfer learning, just like pre-trained CNNs. By using transfer learning with pre-trained CNNs or deep CNNs, we leverage the knowledge and features learned by the model on a large dataset and apply it to the specific task, which improves the model’s performance and reduces the amount of data and computational resources required for training [21]. Pre-trained models can also be used as feature extractors. Some advantages of using them for feature extraction include their ability to learn from the data and identify relevant features, their efficiency at processing large amounts of data, their robustness in handling noise and variability in the data, and their potential for transferability to other tasks through the use of transfer learning. The extracted features then serve as input to a classifier, such as a support vector machine (SVM), decision tree, random forest, etc., to predict the class of an input image. Using a pre-trained model as a feature extractor often yields good performance on various image analysis tasks, even with high-dimensional or complex data. These models have already learned to extract useful features from images and can be finetuned for a specific task by training a classifier on top of the extracted features. This approach can be particularly useful when working with limited data or computational resources, as the pre-trained CNN can provide a strong baseline for feature extraction, and the classifier can be trained relatively quickly on the extracted features. Popular pre-trained CNN models, such as XceptionNet, InceptionResNet, and EfficientNetB7, are trained on the ImageNet dataset and have been widely used for image classification and other tasks. Other popular pre-trained CNN models include ResNet, Inception, and MobileNet. These models can be easily loaded and used as feature extractors in Python using deep-learning libraries such as Keras and PyTorch.

202

C. S. A. Kumar et al.

When using a pre-trained CNN model as a feature extractor, it is often helpful to freeze the model’s weights and only train the classifier on top of the extracted features, as this can help prevent overfitting and improve generalization.

3.3 Hyperparameter Tuning In image classification tasks, the performance of a convolutional neural network (CNN) model can depend on a wide range of hyperparameters, such as the learning rate, batch size, dropout rate, activation function, number of layers, neurons in a layer, and epochs. Tuning these hyperparameters can be time-consuming and resourceintensive, especially when the hyperparameter space is large [22, 23]. The Hyperband optimization strategy is an efficient method for searching a large hyperparameter space that is based on the idea of using early stopping to identify the best-performing models [24]. Hyperband defines a budget for the amount of time or resources that can be allocated to each model, and the algorithm trains models with different combinations of hyperparameters and uses early stopping to identify the best-performing models within the budget. This allows for efficient exploration of a large hyperparameter space and identification of the optimal hyperparameters for the task. Keras Tuner is a library that can automate the process of hyperparameter tuning for convolutional neural network (CNN) models in Keras by generating hypermodels. Hyperparameter tuning is the process of selecting the optimal set of hyperparameters for a CNN model to achieve the best performance on a specific task, such as the classification of mosquito species on human skin (see Fig. 3). The successive halving procedure, which is used for hyperparameter optimization, is modified in the hyperband optimization algorithm. The performance of each configuration is assessed using this technique, which uniformly distributes a budget of computing resources among a collection of hyperparameter configurations. The combinations with the worst results are discarded after the initial examination. A single configuration is chosen as the best collection of hyperparameters after this procedure is repeated until no other configuration is left. The Hyperband method can outperform other optimization algorithms in terms of outcomes by more uniformly and effectively utilizing computer resources. Repeatedly eliminating configurations that perform badly helps to concentrate computing resources on the most promising configurations, which speeds up the optimization process and results in a faster convergence to an ideal solution. In recent years, deep learning has become a popular approach for solving complex problems in various domains. However, building an effective deep-learning model can be challenging due to the high number of hyperparameters involved. Hyperband optimization strategy is a powerful technique that has been developed to optimize these hyperparameters efficiently. In our experiment, we used the EfficientNetB7 pre-trained model, which is a stateof-the-art architecture that has shown excellent performance on various computer vision tasks. We initialized this model with weights from the ImageNet dataset,

Mosquito on Human Skin Classification Using Deep Learning

203

which is a large-scale database of labeled images that have been widely used in computer vision research. To explore the optimal number of layers and neurons in the model, we varied the number of layers between 1 and 5 and the number of neurons between 32 and 512. The activation functions also varied between ReLU, tanH, ELU, and LeakyReLu. Each layer was tuned with dropout regularization to prevent overfitting of the model. The final output layer consisted of 6 neurons with softmax activation, which is a popular choice for multi-class classification tasks. We compiled the model using a polynomial decay learning rate scheduler on the ADAM optimizer and categorical cross-entropy loss function. The polynomial decay learning rate scheduler gradually reduces the learning rate as training progresses, allowing the model to converge to the optimal solution more effectively. By employing these techniques, we were able to optimize the performance of our model effectively. Our results demonstrated that the hyperparameters we chose improved the model’s accuracy and reduced the training time significantly. Overall, our experiment highlights the effectiveness of the Hyperband optimization strategy in building high-performance deep learning models.

3.4 Proposed Workflow The proposed methodology for using CNNs and a pre-trained CNN model as a feature extractor and classifier for the task of Mosquito on Human Skin Classification can be implemented as follows (Fig. 2): 1. Preprocess the data as needed, such as resizing and normalizing. 2. Choose a pre-trained CNN model: Select a pre-trained CNN model that is suitable for the task. Popular options include XceptionNet, ResNet, InceptionNet, and MobileNet.

Fig. 2 Workflow for model building and optimization

204

C. S. A. Kumar et al.

3. Load the pre-trained model: Use a deep learning library such as Keras or PyTorch to load the pre-trained CNN model. 4. Extract features from the training data: Use the pre-trained CNN model to extract features from the training data by running the data through the model and extracting the output of one of the hidden layers. 5. Train a classifier on the extracted features: Use the extracted features as input to a classifier, such as a support vector machine (SVM), Decision Tree, Random Forest, or Neural Network, and train the classifier on the training data. 6. Evaluate the model on the validation and test sets: Use the trained classifier to make predictions on the validation and test sets and evaluate the performance of the model using appropriate metrics, such as accuracy, precision, and recall. 7. Fine-tune the model: If the performance of the model is not satisfactory, finetune the pre-trained CNN model by unfreezing some of the layers and training them along with the classifier. 8. Analyze the results and draw conclusions: Analyze the results of the experiment and draw conclusions about the effectiveness of using a pre-trained CNN model. By using a pre-trained CNN for the classification of mosquito species on human skin, we can leverage the knowledge and features learned by the model on a large dataset and fine-tune it for the specific task, which can improve the performance of the model and reduce the amount of data and computational resources required for training.

4 Experiments and Results The preliminary experiments are conducted using simple convolutional neural network (CNN) architectures for tasks such as the classification of mosquito species on human skin because it can provide baseline performance for the task and serve as a reference point for comparison with more complex models [25]. A simple CNN architecture typically has fewer layers and fewer parameters compared to a deep CNN, which means it requires fewer data and computational resources to train. This can make it easier to train and evaluate the model, and it can also make it easier to interpret the results of the experiment. These architectures can help to reduce the risk of overfitting, which is when the model learns patterns in the training data that are not generalizable to unseen data. This can be especially useful when the dataset is small or has limited diversity, as it can help the model to generalize better to new data. Based on experimental results, it was determined through empirical analysis that the optimal input dimension for the neural network was 60 pixels by 60 pixels. This specific input dimension yielded the best results in terms of the model’s overall performance, as measured by various evaluation metrics. This finding suggests that the specific size of the input layer had a significant impact on the network’s ability to learn and generalize patterns from the input data.

Mosquito on Human Skin Classification Using Deep Learning Table 2 Simple DCNN model architecture Layer type Output shape Conv2D MaxPooling2D Dropout Conv2D MaxPooling2D Dropout Conv2D MaxPooling2D Dropout Flatten Dense Dense Dense Output Trainable parameters

58 × 58 × 32 29 × 29 × 32 29 × 29 × 32 27 × 27 × 64 13 × 13 × 64 13 × 13 × 64 11 × 11 × 128 5 × 5 × 128 5 × 5 × 128 1 × 3200 1 × 64 1 × 128 1 × 64 1×6

205

No. of parameters 896 0 0 18,496 0 0 73,856 0 0 0 204,864 8,320 8,256 390 315,078

Table 2 represents the architecture of the CNN used, it performed fairly well on the testing set with an accuracy of 86.10% better than the performance of works in [9, 10] (for further metrics refer Table 3), giving the best performance score in our research. This model is directly compared with existing pre-trained models available, fine-tuned by freezing, unfreezing, and adding more trainable layers. Using different pre-trained convolutional neural network (CNN) models can be beneficial for the classification of mosquito species on human skin because it allows us to compare the performance of different models and select the one that is most suitable for the task. Each pre-trained CNN model has been trained on a large dataset and has learned different features and patterns that can be useful for image classification tasks. We can evaluate the performance of each model on the specific task and select the one that performs best. This can help us to achieve better accuracy and generalization performance on the task. In addition, using different pre-trained models allows us to compare the computational requirements and efficiency of the models. Some models may be more computationally intensive and require more data and resources to train, while others may be more efficient and require fewer resources. By comparing the performance and computational requirements of different models, we select the one that is most suitable for the task and the available resources. We obtained comparable results for the same, which requires more fine-tuning. All pre-trained models performed best when more trainable dense layers were added to the pre-trained architectures. After fine-tuning, we observed the results depicted in Table 4, which are comparable to the benchmark scores of [9, 10]. Most of the pre-trained models were experimentally evaluated by rescaling the image dimension between 60 × 60 to 224 × 224.

206

C. S. A. Kumar et al.

This is because the commonly used image classification datasets, such as ImageNet, have a default image size of 224 × 224 pixels. This allows for better preservation of image details and can result in higher accuracy. However, larger input sizes also require more computational resources and can increase training time. The choice of input size for pre-trained models depends on the specific application and available resources. Experimentally evaluating models with a range of input sizes can provide insights into the optimal input size for a given task. EfficientNetB7 is one of the best-performing models, with an accuracy of 84%. The model takes an input of dimensions 600 × 600 pixels, which is the original size the image was trained with. Therefore, we upscaled our input data to reproduce the best possible result. Normalization and data augmentation are a few steps we took to achieve this result. Choosing the right pre-trained convolutional neural network (CNN) model for feature selection can be important for tasks such as the classification of mosquito species on human skin because it can impact the performance and efficiency of the model. When selecting a pre-trained CNN model for feature selection, it is important to consider the characteristics of the dataset and the specific requirements of the task. For example, if the dataset is large and diverse, a more complex model with a larger number of layers and parameters may be more suitable, as it may be able to learn more complex features and patterns. On the other hand, if the dataset is small or has limited diversity, a simpler model with fewer layers and parameters may be more suitable, as it may be less prone to overfitting and may generalize better to new data. Apart from the complexity of the model, it is also important to consider the computational requirements and efficiency of the model. Some models may be more computationally intensive and require more data and resources to train, while others may be more efficient and require fewer resources. Depending on the available resources and the specific requirements of the task, it may be more suitable to choose a model that is more efficient and requires fewer resources. We primarily conducted experiments with four models: InceptionResNetV2, InceptionV3, XceptionNet, and ResNet50. For each of these models, bottleneckpipelined dense layers of dimensions 1000, 500, and 250 were concatenated to generate intermediate features after training the models. Feature selection techniques like Variance thresholding were used to reduce the dimension further. The optimal threshold value was found to be 0.02333 by an iterative process through cross-validation. This method introduced considerable noise to the data and required further training, which is computationally expensive. Hence we conclude the usage of pre-trained models for feature extraction is not a viable technique, as it also significantly reduced model performances as seen in Table 5. Moving on to optimize the best performing models—EfficientNetB7 with imagenet weights as base model followed by global average pooling to flatten the layers, we adopt the Hyperband optimization strategy can be used for tasks such as the classification of mosquito species on human skin by using it to search for the optimal set of hyperparameters for a convolutional neural network (CNN) model. To use the Hyperband optimization strategy for our task, we follow these steps (Fig. 3):

Mosquito on Human Skin Classification Using Deep Learning

207

Table 3 Classification report—simple deep convolutional network Precision Recall F1-Score Aegypti landing Aegypti smashed Albopictus landing Albopictus smashed Culex landing Culex smashed Accuracy Macro avg Weighted avg

Support

0.76 0.83 0.87

0.93 0.79 0.77

0.84 0.81 0.82

300 300 299

0.93

0.82

0.87

300

0.89 0.93

0.95 0.93

0.87 0.87

0.86 0.86

0.92 0.93 0.86 0.86 0.86

300 300 1799 1799 1799

Table 4 Model performance comparison Model Model accuracy Simple DCNN EfficientNetB7 MobileNetV2 DenseNet121 XceptionNet ResNet152V2

Model loss

0.86 0.84 0.82 0.81 0.80 0.71

0.35 0.47 0.49 0.51 0.54 0.94

Table 5 Pretrained model as feature extractors Model Classifier No.of features InceptionResNetV2 InceptionV3 XceptionNet ResNet50 InceptionResNetV2 InceptionResNetV2

SVM Logistic Regression Random Forrest Logistic Regression Logistic Regression Logistic Regression

1000 1000 1000 1000 500 250

Model accuracy 0.78 0.70 0.64 0.51 0.78 0.70

1. Defining the hyperparameter search space: we specify the range of values or options for each hyperparameter to tune. 2. Define a model-building function: Create a function that builds and compiles a CNN model based on the hyperparameters passed to it. This function should return the compiled model. • The number of hidden layers is ranged between 0 to 5, and each layer with n neurons, where n is between 32 and 512 neurons.

208

C. S. A. Kumar et al.

• The activation function is passed as a choice from the list of ‘reLu’, ‘tanh’, ‘ELU’, and ‘LeakyReLu’. • After each layer, a dropout is set ranging from 0 to 0.5 at 0.1 steps. • The final layer is the output layer with 6 neurons and softmax activation. • The learning rate is initialized to 0.00005 and the end rate at 0.0 as a function of the polynomial decay scheduler available in the keras module, compiled with categorical cross-entropy loss and accuracy as metrics. 3. Instantiate a ‘Hyperband’ object: Create an instance of the Hyperband class from the ‘Hyperband’ module of Keras Tuner, and specify the hyperparameter search space and model-building function. 4. Search for the best hyperparameters: Call the search method of the Hyperband object and pass in the training data and target labels. The Hyperband optimization strategy will perform the hyperparameter search using early stopping to identify the best-performing models within the defined budget of time or resources.

Fig. 3 Proposed hyperparameter tuning using hyperband optimization strategy workflow

Mosquito on Human Skin Classification Using Deep Learning

209

Fig. 4 EfficientnetB7 hypermodel training and validation plots

By using the Hyperband optimization strategy, we automated the process of selecting the optimal hyperparameters for the CNN model and improved its performance on the classification of mosquito species on human skin, resulting in the best-performing result of this work with an accuracy of 91% (for further metrics refer table 6) better than the results obtained in [9–11]. The training and validation plots (see Fig. 4) indicate that the model is learning effectively and generalizing well to new data. The model is learning at a consistent rate and is not experiencing issues such as overfitting or underfitting. This can be a good sign that the model is suitable for the task and that the hyperparameters are well-tuned. Interpreting the confusion matrix (see Fig. 5), the diagonal values of the matrix represent the number of correct predictions made by the model, which has high values implying a large number of correct predictions. There are a few misclassifications, which is tolerable for the purpose of generalization between the same species in landing and smashed condition. The model is making a similar number of correct and incorrect predictions for each class, suggesting a balanced performance. Hence we have successfully completed the processes of building and optimizing an efficient classification and recognition model for mosquitos on human skin.

5 Conclusion and Future Work In this chapter, we aimed to correctly classify mosquitoes found in public spaces using machine and deep learning techniques. We have performed various experiments with the dataset for the task of mosquito on human skin classification, with the goal of communal well-being. Primarily we conducted pilot experiments with simple deep convolutional neural network architectures, which performed fairly well. Several experiments with pre-trained models, in order to establish benchmarks for the dataset, are better than existing ones. Experimentation on the pre-trained models was conducted, such as fixing and training the base model, exploring the initialization of weights, freezing the base model, and training with more dense layers. Further,

210

C. S. A. Kumar et al.

Fig. 5 EfficientnetB7 hypermodel confusion matrix Table 6 Classification report—pretrained model after hyperparameter tuning Precision Recall F1-Score Support Aegypti landing Aegypti smashed Albopictus landing Albopictus smashed Culex landing Culex smashed Accuracy Macro avg Weighted avg

0.87 0.86 0.95

0.92 0.86 0.85

0.89 0.86 0.90

300 300 299

0.90

0.90

0.90

300

0.94 0.95

0.96 0.96

0.91 0.91

0.91 0.91

0.95 0.95 0.91 0.91 0.91

300 300 1799 1799 1799

Mosquito on Human Skin Classification Using Deep Learning

211

we explore the application of transfer learning and its prospects for the classification task, where intermediate features are extracted from pre-trained architectures and fed into shallow machine learning algorithms, which did not yield satisfactory results. Finally, we tuned the model parameters of the best-performing pre-trained architectures to build a more robust and scalable model and explore its feasibility on auto-building hyper models. The weights of the model are fine-tuned and optimized for the best results. For further studies, we propose an ensemble boosting model for feature extraction such as Random Forrest, XGBoost for feature selection using feature importance score and further using a classifier, say comprising of deep convolutional neural network or shallow machine learning to better the performance achieved in this study. Preprocessing images in different spectral representations is also an experiment to be tested, preprocessing like RGB to HSV, Fourier Pixel Distribution plot, or image augmentations can be employed to push the results further. We have successfully built a model to identify and classify the three most common mosquito species: Culex, Aegypti, Albopictus in landing and smashed conditions based on image data. Our study provides great insights into the development of an AI-based model for the termination of these harmful mosquitoes in surveillance systems.

References 1. Roiz, D., et al.: Integrated Aedes management for the control of Aedes-borne diseases. PLoS Neglected Trop. Dis. 12(12), e0006845 (2018) 2. Centers for Disease Control, Prevention (US), and National Center for Infectious Diseases (US). Addressing emerging infectious disease threats: a prevention strategy for the United States. Centers for Disease Control and Prevention (1994) 3. Petersen, Lyle R., Marfin, Anthony A.: West Nile virus: a primer for the clinician. Ann. Intern. Med. 137(3), 173–179 (2002) 4. Kweka, Eliningaya J., Mahande, Aneth M.: Comparative evaluation of four mosquitoes sampling methods in rice irrigation schemes of lower Moshi, northern Tanzania. Malar. J. 8(1), 1–5 (2009) 5. Gao, Q., et al.: Comparison of mosquito population composition and dynamics between humanbaited landing and CO2-baited trapping monitoring methods. Chin. J. Hyg. Insect. Equip. 21, 254–258 (2015) 6. Lima, J.B.P. et al.: MosqTent: an individual portable protective double-chamber mosquito trap for anthropophilic mosquitoes. PLoS Neglected Trop. Dis. 11(3), e0005245 (2017) 7. Stone, Christine, Mohammed, Caroline: Application of remote sensing technologies for assessing planted forests damaged by insect pests and fungal pathogens: a review. Curr. For. Rep. 3(2), 75–92 (2017) 8. Rajanbabu, K., et al.: Ensemble of deep transfer learning models for Parkinson’s disease classification. In: Soft Computing and Signal Processing, pp. 135–143. Springer, Singapore (2022) 9. Mulchandani, P., Siddiqui, M.U., Kanani, P.: Real-time mosquito species identification using deep learning techniques. Int. J. Eng. Adv. Technol. 2249–8958 (2019) 10. Ong, S.-Q., et al.: Community-based mosquito surveillance: an automatic mosquito-on-humanskin recognition system with a deep learning algorithm. Pest Manag. Sci. 78(10), 4092–4104 (2022) 11. Pataki, B.A. et al.: Deep learning identification for citizen science surveillance of tiger mosquitoes. Sci. Rep. 11(1), 1–12 (2021)

212

C. S. A. Kumar et al.

12. Mishra, P., Sarawadekar, K.: Polynomial learning rate policy with warm restart for deep neural network. In: TENCON 2019-2019 IEEE Region 10 Conference (TENCON). IEEE (2019) 13. Ong, S.-Q. et al.: Implementation of a deep learning model for automated classification of Aedes aegypti (Linnaeus) and Aedes albopictus (Skuse) in real time. Sci. Rep. 11(1), 1–12 (2021) 14. Okayasu, K., et al.: Vision-based classification of mosquito species: comparison of conventional and deep learning methods. Appl. Sci. 9(18), 3935 (2019) 15. Motta, D., et al.: Application of convolutional neural networks for classification of adult mosquitoes in the field. PloS One 14(1), e0210829 (2019) 16. Park, J., et al.: Classification and morphological analysis of vector mosquitoes using deep convolutional neural networks. Sci. Rep. 10(1), 1–12 (2020) 17. Ong, Song-Quan., Ahmad, Hamdan: An annotated image dataset for training mosquito species recognition system on human skin. Sci. Data 9(1), 1–6 (2022) 18. Alzubaidi, L., et al.: Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J. Big Data 8(1), 1–74 (2021) 19. Hussain, Mahbub, Bird, Jordan J., Faria, Diego R.: A Study on CNN Transfer Learning for Image Classification. UK Workshop on Computational Intelligence, Springer, Cham (2018) 20. Anand, R., et al.: Modified VGG deep-learning architecture for COVID-19 classification using chest radiography images. Biomed. Biotechnol. Res. J. (BBRJ) 5(1), 43 (2021) 21. Tammina, Srikanth: Transfer learning using VGG-16 with deep convolutional neural network for classifying images. Int. J. Sci. Res. Publ. (IJSRP) 9(10), 143–150 (2019) 22. Seshu Babu, G., et al.: Tuberculosis classification using pre-trained deep learning models. In: Advances in Automation, Signal Processing, Instrumentation, and Control, pp. 767–774. Springer, Singapore (2021) 23. Mar-Cupido, R., García, V., Rivera, G., Sánchez, J.S.: Deep transfer learning for the recognition of types of face masks as a core measure to prevent the transmission of COVID-19. Appl. Soft Comput. 125, 109207 (2022) 24. Li, L., et al.: Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18(1), 6765–6816 (2017) 25. Kumar, V.S. et al.: Mosquito type identification using convolution neural network. In: 2022 3rd International Conference on Smart Electronics and Communication (ICOSEC). IEEE (2022)

Analysis and Interpretation of Deep Convolutional Features Using Self-organizing Maps Diego Sebastián Comas, Gustavo Javier Meschino, Agustín Amalfitano, and Virginia Laura Ballarin

Abstract Deep learning has defined a new paradigm for data analysis. In image processing, Convolutional Neural Networks (CNN) have a vast number of applications and do not require prior extraction of features as these are “learned” directly from training images. The interpretation of how a CNN works is an open problem, and any method for interpreting the features extracted from CNN can lead to removing the black-box concept, which is a significant contribution to the field of machine learning. In the present chapter, an approach based on Self-Organizing Maps (SOM) is proposed for the analysis and interpretation of features extracted from CNN. The main characteristics are: (i) CNN are trained from an initial image dataset with different sets of hyperparameters; (ii) new datasets containing different representations of the initial dataset are generated and then analyzed using SOM, visualization tools, and quality measures; (iii) it is possible to select features suitable for classification and to describe complexity and diversity in the classes and to extract additional information about the images in the training datasets. An application example considering chest X-ray images for the classification of pneumonia is analyzed, identifying good features from CNN from scratch and giving some interpretation from them both in the classification of normal versus pneumonia and in viral pneumonia versus bacterial pneumonia. Keywords Deep-learning · Convolutional neural networks · Features interpretation · Self-organizing maps

D. S. Comas (B) · G. J. Meschino · A. Amalfitano · V. L. Ballarin Institute of Scientific and Technological Research in Electronics (ICyTE), National University of Mar del Plata-CONICET, Mar del Plata, Argentina e-mail: [email protected] G. J. Meschino e-mail: [email protected] A. Amalfitano e-mail: [email protected] V. L. Ballarin e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Innovations in Machine and Deep Learning, Studies in Big Data 134, https://doi.org/10.1007/978-3-031-40688-1_10

213

214

D. S. Comas et al.

1 Introduction In recent years, deep learning has defined a new paradigm for data analysis [1]. Convolutional Neural Networks (CNN) have had a vast number of applications during the last decade. The application of CNN is widely extended in science and businesses. We can cite facial recognition, medical imaging, document analysis, autonomous driving, and biometric authentication, among other applications. They can model relations between inputs and outputs by learning features to feed a classification or regression model. Particularly in the field of image processing and unlike traditional machine learning approaches, CNN do not require prior extraction of features by image processing methods (like edge detection, segmenting, shape characterization, object counting, frequential and space transformations, etc.) as these are “learned” directly from training images through a sequence of operations implemented within the network [2]. However, the interpretation of how a CNN works is an open problem currently being explored with great impetus [3]. A method for interpreting the features extracted from the CNN can lead to removing the black-box concept of the CNN with a significant contribution to the machine-learning field. In this context, Self-Organizing Maps (SOM), which are competitive and unsupervised neural networks, have shown excellent capabilities for data analysis using different visualizations [4, 5] and by means of data clustering [6–8]. They have been applied from their creation to the present in a wide spectrum of fields, like recent applications such as temporal series analysis [9], cell classification [10], seismic and driving maneuvers data analysis [9, 11], and Sustainability Performance Assessment [12], among many other applications. The SOM requires to be fed with a set of computed feature vectors that depend on the field they are applied. They perform a dimensionality reduction to a 2D grid that allows understanding and comprehending data. Data can be seen represented in a position of the grid. SOMs have some limitations with categorical data, but it is not a problem for the application proposed in this work. Considering the need to interpret features in CNN and the capabilities of the SOM for data analysis, in the present work, a SOM-based approach for the analysis and interpretation of features extracted from CNN is proposed. The main contributions of this work are (a) Assess the class discrimination of different sets of features extracted by CNN; (b) Describe complexity and diversity in the classes of a dataset according to their features; (c) Select the best CNN for a classification task; (d) Extract additional information about the image classes in the training datasets. To achieve these contributions: (1) CNN are trained from an initial image dataset with different hyperparameters; (2) new datasets containing different representations of the initial dataset are generated; (3) the new datasets are analyzed using SOM visualization tools and quality measures. An application example is provided using chest X-ray images for the classification of pneumonia. The rest of this chapter is structured as follows. In Sect. 2, the more important definitions related to both CNN and SOM are presented, focusing on those concepts required for understanding the proposed approach. In Sect. 3, the method proposed

Analysis and Interpretation of Deep Convolutional Features Using …

215

for analyzing and interpreting features generated from CNN is presented in detail. Section 4 is about an application example of the proposed approach considering images from the Chest X-Ray Images of Pneumonia dataset, containing images with and without pneumonia. Finally, conclusions are stated in Sect. 5.

2 Materials In this section, concepts related to both CNN and SOM are revised. As both topics are well known, only the most important concepts are presented. However, proper references are indicated.

2.1 Convolutional Neural Networks A CNN is a type of deep neural network originally thought and designed to work with images introduced by Fukushima [13] in the eighties. It has been used extensively in various kinds of imaging applications, including medical imaging. A typical CNN consists of two phases: the feature extraction phase and the classification phase [2, 14]: • Feature extraction phase: It alternates several layers that apply convolutional filters and subsampling layers, subsequently changing the representation and reducing the information. Filter weights are optimized during training. As a result, the output of this phase can be interpreted as a set of features associated with an input image. • Classification phase: It is typically a fully connected feedforward network [15] whose architecture is adjusted according to the classification to be performed. The number of output neurons equals the number of classes. The computational cost of training a CNN is high due to the high number of parameters that must be optimized and the amount of operations required to obtain the network output for a given input image. However, the great technological advance of recent years in manufacturing high-capacity Graphics Processing Units (GPU) for parallel operations specific to deep-learning models substantially reduced the complexity of the implementation of CNN while expanding its field of application. An approach frequently used in the practical implementation of CNN is the process known as transfer learning [16–18]. In such an approach, a part of a previously trained CNN, typically the feature extraction layers, are reutilized, followed by ad-hoc new layers. This approach has two main advantages: (a) by using the feature extraction phase of an existing CNN, it takes advantage of the information captured during its training, whose training dataset usually consists of a huge amount of example images, much larger than those typically available for a specific problem; and (b)

216

D. S. Comas et al.

since only some layers must be trained, much less computational effort is required during training. In the present work, both CNN from scratch and from transfer learning are considered. CNN are trained from an initial image dataset, and then they are used as feature extraction generating new representations of the initial one. A description of each part of the proposed approach is presented in Sect. 3.

2.2 Self-organizing Maps The SOM were proposed in 1982 by Kohonen [19] consisting in regular grids of cells mapping input spaces (pattern spaces) to cell spaces (topological spaces), preserving the topology of the input space and having remarkable capabilities for removing noise, for detecting outliers, and for completing missing values [7]. Each cell of a SOM has an associated vector, called prototype vector, which has the same dimension of the input space. The set of all the prototype vectors is called codebook. To proceed with the training stage, the codebook is initialized using linear, random or data-analysis-based initialization [20]. During training, the codebook is adjusted in order to map close data in the input space to close cells. If a SOM is welltrained, its codebook represents the training dataset, preserving its characteristics, i.e., having similar probabilistic density function. To quantify the quality of a trained SOM, i.e., how good its codebook represents the training dataset statistics, it is required to define measures. SOM quality depends on the map size, topology, neighborhood function and training type, which are hyperparameters, heuristically or methodologically set. In the present work, based on the approach used in [7] for obtaining optimal SOM by computing quality measures after training, three quality measures are considered: quantization error, topographic error, and topographic product. While formal definition for these measures can be found in [7], the next conceptual assertions are given: • The quantization error allows quantifying whether the prototype vectors are close to the training dataset, considering only distance measures on the pattern space. • The topographic error is a measure of how good the topology of the pattern space is preserved, i.e., whether very close prototype vectors of the codebook have been assigned to adjacent cells in the map space. • The topographic product combines distances both in the pattern space and in the map space, allowing to assess how good the neighborhood relations are preserved. If the prototype vectors perform an organized projection of the training data according to a similarity criterion and preserving the data topology, then these three errors tend to be minimized. As SOM is used in the present work interpreting the information in the training datasets, it is important to generate SOM with good data representation capabilities. Details of its implementation are given in Sect. 3. In addition to the definition of the optimal SOM, visualization tools are required to analyze and interpret the information contained in the codebooks. The definition

Analysis and Interpretation of Deep Convolutional Features Using …

217

of Best Matching Unit (BMU) is considered, as the cell whose prototype vector is the nearest of an input data vector. The next visualizations are adopted [4, 5]: • Map of labels: A map in which each cell is labeled with the name of the class which is more frequent between the data where the cell is the BMU. In the same map, a color scheme can be used, with similar color for near cells gradually changing as the cells are distant. • Map of hits: A map in which each cell is labeled with the number of times it was BMU. It follows the same color criteria that in the maps of labels. • U-Matrix: It is a representation of the map in which distances between the prototype vectors of two adjacent cells are represented with a color code.

3 Proposed Method The method proposed consists of four stages: (A) Training of CNN; (B) Extraction of features; (C) Training of SOM; (D) Analysis and interpretation. In Stage A, CNN with different hyperparameters settings are trained from the initial image dataset. In order to evaluate the generalization capabilities of the networks, the hold-out approach is applied as widely used in CNN. In Stage B, the feature-extraction stage of each CNN is used to extract features from the initial image dataset, defining new datasets (one dataset per CNN) representing the information in the initial dataset. In Stage C, the datasets defined in Stage B are used to train SOM considering the automatic approach described in Sect. 2.2. As a result, a reliable SOM is obtained for each of the datasets generated, being each SOM a compressed representation of the information in its codebook. Finally, in Stage D, an analysis is performed on the SOM to give some interpretation about the features generated by the CNN and, also, about the classes and images in the initial dataset. In Fig. 1, a pipeline of the method proposed is shown which is presented in detail in the rest of the present section.

IniƟal dataset containing labels for each image.

Stage A

Stage B

Training of CNN

ExtracƟon of features

... ... CNN with different hyperparameters, performance evalua on, and cross valida on.

Using the output of the feature-extrac on stage of each CNN to extract features defining new datasets.

Stage C

Stage D

Training of SOM

Analysis and interpretaƟon

... ...

Training SOM with an automa c approach genera ng one SOM from each previous dataset.

Analysis on each SOM using: error measures, SOM visualiza on, UMatrix, among others.

Fig. 1 Pipeline of the method proposed. Stage A: Training of CNN. Stage B: Extraction of features. Stage C: Training of SOM. Stage D: Analysis and interpretation

218

D. S. Comas et al.

3.1 Stage A: Training of CNN From an initial dataset containing images with labels, i.e., each image is associated with a label defining the class what image belong to, the first stage consists of training a set of CNN with this dataset. The number of CNN to train depends on the numbers of hyperparameters and approaches to be studied considering the problem addressed. For example, it may be required to assess the features obtained by a CNN with 3, 5 and 7 internal layers and two more where transfer-learning is considered (for example, VGG16 and AlexNet), having 5 CNN to consider. The CNN hyperparameters could be set in accordance with some heuristic knowledge about the dataset considering, among others, the next aspects: • The number of samples and labels in the dataset: A decision must be made concerning using transfer-learning or CNN from scratch. As explained in the next stages, transfer-learning approaches which frozen feature-extraction layers will not provide specific features for the dataset, as they were adjusted considering an external dataset (the one used during training) and, therefore, the features are generic (not specific). However, these generic features can provide some interpretation of the label or the classification problem from the SOM trained in Stage C. • Hyperparameters: In CNN from scratch, they should be defined considering heuristic knowledge when available, or recommendations in the literature, including existing papers in the scope of the problem addressed [21–23]. In addition, error to be optimized during training, optimizer functions, end-of-training criteria, validation subset proportions, and activation functions, among others hyperparameters should be defined, requiring knowledge about CNN. Once the CNN (each with an architecture, learning approach, and specific hyperparameters) are defined, each CNN is trained and validated. The proportions used to define training, validation, and test subsets are defined according to the numbers of samples in the initial dataset. Validation subsets are only used when end-of-training criteria are applied. Each CNN should be interpreted as a specific way of solving the classification problem. Moreover, the phase of feature extraction of each CNN defines a specific set of features to be extracted (to be observed) from the initial dataset in order to solve the problem. As each CNN is different, features computed for each CNN are also different. In addition, the generalization errors computed are estimations of the probability of error of the CNN, which include both the feature-extraction phase and the classification phase.

Analysis and Interpretation of Deep Convolutional Features Using …

219

3.2 Stage B: Extraction of Features Each CNN generated in the previous stage can be understood as an interpretation of how the classification problem can be defined. While they can differ in terms of classification performance, they contain in their internal parameters (essentially the weights of synaptic connections) enough information to interpret the classification. In other words, if one of the images in the initial dataset is applied to one of the CNN generated in Stage A, the output of the feature-extraction phase of that CNN is a vector of features describing the image and those features should be enough for classifying. On that basis, in the present stage the next steps are applied in order to generate new datasets from the initial dataset: 1. Each image in the initial dataset is used as input of the CNN generated in Stage A. It is necessary to proper set the neural network in query mode and, more important, to set the last layer of the feature extraction phase (in general a globalaverage pooling layer or a flatten layer) to be used as output. As a result, a feature vector is generated for each image in each CNN. The length of the vector is the size of the layer before the classification stage. This layer is usually a flatten operation or global average of the last convolutional layer [1]. It is important to note that each vector describes the content of the image in the input of the CNN and this description is made in terms of what the CNN analyzes layer by layer in the image to compute the probability of each label. This image analysis is learned during training, as weights of the convolutional filters are adjusted. 2. For each CNN, the feature vectors obtained in the previous step for all the images in the entire initial dataset are now attached in one dataset. As a result, each dataset contains transformations of the information containing in the image which represent what each CNN observes for making the classification. The number of samples in each generated dataset is equal to the number of images in the initial dataset. On the other hand, the number of features can differ from CNN to CNN as it is defined by the size of the last layer of the feature extraction phase. As a result of the previous step, new datasets containing representations of the images in the initial dataset are generated, one for each CNN trained in Stage A. While the number of features generated can be high (a hundred or more) the information in the datasets can be very important and must be analyzed for interpreting how the CNN solve the classification problems (what it is focused on for assigning classes), what features are more relevant, how classes are in terms of diversity inter and intra class, among other aspects. Because of this, an approach proper for patterns analysis must be applied in order to visualize and to analyze the information described by the features in the datasets. In the present work, it is proposed the use of SOM and their visualization tools for this purpose.

220

D. S. Comas et al.

3.3 Stage C: SOM Training As a result of the previous stage, various datasets containing representations of the information in the initial dataset were generated. A new compressed representation of the data using SOM is performed in the present stage. One SOM is finally defined for each dataset generated in Stage B. As mentioned before, a SOM is a competitive neural network with hyperparameters which must be defined in accordance with the training data. Just like CNN, there are no general criteria for setting SOM. However, it is possible to apply an automatic approach for finding an optimal SOM for a given dataset using the SOM error measures described in the Sect. 2.2. To assure that SOM are good representations of the datasets, it is required codebook’s vectors with probability density functions consistent with the datasets used for training. In this sense, various SOM with different combinations of quantity of cells and topologies are trained for each dataset. Then an optimal SOM is selected for each case, considering the minimization of the sum of the quantization error, the topographic error, and the topographic product. The application of such a procedure for the approach proposed here is as follows: 1. For each dataset resulting of the Stage B, SOMs with different combinations of quantity of cells and topologies are trained to select the optimal one. Let η = 5 N 0.5 be an estimated number of cells obtained with this heuristic formula [24], being N the number of data in the dataset. SOM with η, 2η, 3η and 4η cells are considered for each dataset, in each case both with hexagonal and rectangular grid topologies considering 2D maps. As a result, 8 distinct SOM are trained for each dataset, 4 for each type of grid topology, all with the same training dataset. In all cases, Gaussian neighborhood functions and batch training are used. As fast convergence is required, all SOM are initialized by linear initialization, which consist of an ordered initial state of the codebook [7, 8, 25, 26]. 2. For each dataset, it is selected the SOM for which the sum of the quantization error, the topographic error, and the topographic product is the lowest. As a result, an optimal SOM is generated for each dataset. Therefore, it is possible to apply the analysis tools available for SOM to obtain an interpretation of how the CNN solves the classification problem. It is important to note that one of the major limitations for analyzing the dataset generated from CNN in Stage B is the high number of features. This limitation is here overcome as SOM are intrinsically a visual feature reduction method. The SOM are analyzed in the next stage of the method.

Analysis and Interpretation of Deep Convolutional Features Using …

221

3.4 Stage D: Analysis and Interpretation In this stage, graphical representations of SOM are applied (map of labels, map of hits and U-Matrix) to achieve the main goal of this work: assess the class discrimination abilities for different sets of features. In Fig. 2, some examples of maps of labels, maps of hits and U-matrices are presented. In such examples, labels are “vir” and “bac”. The size of the SOM are 10 × 10 cells. The first example represents a SOM where features are not good (there seems to be not suitable for classification) and the second one represents a case better than the first one. In the next paragraphs, it is detailed the way information given by these visualization tools are discovered, introducing the approach proposed in the present work. The next criteria are proposed for analyzing the SOM generated in Stage C: • If the prototype vectors in the SOM codebooks perform an organized projection of the training data and preserve the data topology, it is expected that close data in the patterns space, i.e., images with similar features (with close features vectors) active close cells in the map space (ideally the same or adjacent). This characteristic is guaranteed as final SOM for each CNN are optimal. • In the case of the map of labels, if diversity intra class is high, the cells occupied by the label will cover a large portion of the map and colors associated with the cells will vary (this is the case of the example #2 in Fig. 2).

# 1

Map of labels

Map of hits

U-matrix

Map of labels

Map of hits

U-matrix

# 2

Fig. 2 Two examples of the visualization tools used in the present work for SOM analysis. While the example #1 represents a case where features are not good, the example #2 represents a case with better features in terms of class description

222

D. S. Comas et al.

• Observing the map of labels, it is possible to evaluate an adequate or inadequate topological separation of the labels if colors of the cells are different for different labels and, also, if cells in a same label are in the same area in the map. In the case where a label has isolated cells, i.e., some of the cells are surrounded by cells with other label, the topological separation of the classes is poor. It could be related with a bad quality of the features extracted from the CNN. In this sense, from Fig. 2, the example #1 results inadequate and example #2 shows better separation between classes than the example #1 and, also, high diversity for the class “bac”. • Considering the U-matrix, it is possible to describe the how good is the separation inter classes, considering the distance values between cells which are in the limit of the classes in the map of labels. In addition, high distance between cells in a same class give some idea about diversity or complexity of the class as prototype vector are a set of reduced sample of the training dataset [8]. Considering the examples in Fig. 2, the U-matrix for the example #1 do not show good separation between classes except for some of the cells in the upper right corner. • Considering the maps of hits (those where each cell is labeled with the number of times its prototype vector was BMU), they give some idea about the frequency in which data are presented in the dataset. In this sense, and once more, very similar data are mapped in the same BMU or in adjacent BMU. Those cells with high numbers of hits contain prototypes vector very representative of the training data. For the examples in Fig. 2, frequency of hits is more homogeneous in example #2. Based on these previous considerations and, also, by comparing the error of validation of the CNN and the SOM error measures, it is possible to conclude about the quality of the CNN and the features extracted. As a rule, the best set of features is one that it is possible to observe good topological separation between classes (both on maps of labels and U-matrices), with small values of SOM error measures and good generalization performance in the CNN. The same criteria can be applied to select between CNN. In summary, combining the visual analysis of the U-matrix, the maps of labels, and the map of hits, it is possible to determine if a set of features is appropriate for classification when it is observed that U-matrix presents so many regions of low distances as classes are expected to discriminate, and the labels of the classes are placed together. Moreover, the hits covering regions of the map talk about the variability of the data for each class (extended areas mean large variability and low areas mean low variability). Concerning the case where transfer-learning is used in Stage A for generating the CNN, if feature-extraction layers are frozen during the entire training period, features generated from these CNN are generic. Consequently, they will depend on the images used for training. Despite the features are generic, they can still provide a good solution for the classification problem and can be used for applying the method proposed in this work. However, if the same feature extraction phase is in more than one CNN generated in Stage A, it will have no sense to compare the results of stages B and C as structure used for extracting features will be the same.

Analysis and Interpretation of Deep Convolutional Features Using …

223

4 Application Example In this section, an application example of the proposed approach is presented.

4.1 Experimental Setup The proposed methodology was applied to the dataset Chest X-Ray Images of Pneumonia [27] containing a total of 5863 grayscale chest X-ray images with 3 classes: normal (1584 cases), bacterial pneumonia (2780 cases) and viral pneumonia (1493 cases). The images are from studies of pediatric patients aged 1 to 5 years at Guangzhou Women and Children’s Medical Center acquired as part of patients’ routine clinical care and initially reviewed for quality control, by removing all lowquality or illegible scans. Then, two medical experts made the diagnoses. The size of the images is not uniform, but in all cases exceeds 1500 × 1500 pixels. An example of each class of images used is shown in Fig. 3. Two classification problems were considered from the dataset: • Problem #1: classification of normal versus pneumonia (merging bacterial pneumonia and viral pneumonia in a single class named pneumonia). • Problem #2: classification of bacterial pneumonia versus viral pneumonia (not considering the class normal available in the dataset). For all the cases, CNN trained from scratch were used, starting from an initial architecture with 5 blocks of convolutional layers with kernel and number of filters: 11 × 11 × 32; 5 × 5 × 54; 3 × 3 × 128; 3 × 3 × 246; and 3 × 3 × 492, all with ReLU activation and interleaving max-pooling layers of pool-size = 2. A global-averagepooling was included at the end. The classification phase consisted of a multilayer network with 2048 neurons + dropout + 4096 neurons + dropout + output layer of size 2. The size of the input layer was set to 200 × 200 × 1. Variants with/without data-augmentation and with/without fine-tuning were tested defining 4 CNN for each problem considered. Mini-batch training, stochastic gradient descent (SGD) optimizer and early stopping were used. The generalization error

normal

bacterial pneumonia

Fig. 3 Examples of dataset Chest X-Ray Images of Pneumonia

viral pneumonia

224

D. S. Comas et al.

was estimated by hold-out with 80% training, 5% validation and 15% test, using data balancing by oversampling of the minority class in the training dataset (it is sampled without substitution the minority class until the number of samples of the majority class is reached). The accuracy, the False Positive Rate (FPR) and the False Negative Rate (FNR) were estimated, averaging on 5 iterations of training-test. The data-augmentation was performed with random transformations including rotation between −5 and +5°, translation in both axes up to 5% and zoom up to 15%. Once each CNN was trained, it was applied the methodology introduced in the Sect. 3 from taking the output of the last layer of feature extraction, obtaining feature vectors corresponding to the image in the dataset, defining new datasets from the original one. The results of such a methodology are presented in the next paragraphs. All experiments were conducted using the Keras API in Python on a computer with Ryzen 9 5900X processor, 64 GB of RAM and an NVIDIA® Titan V GPU.

4.2 Result Analysis For the problem #1, normal versus pneumonia, the best results were obtained with data-augmentation and fine-tuning with accuracy = 0.949 ± 0.009; FPR = 0.040 ± 0.006; FNR = 0.056 ± 0.013. The SOM trained from features extracted of the best CNN for this case are represented in Fig. 4 following the methodology proposed in Sect. 3. For this SOM, quality measures were quantization error = 4.055; topographic error = 0.0283; topographic product = −0.0023. The map of the left is the map of labels corresponding to normal (NOR) and pneumonia (NEU). There is an adequate topological separation of the labels, with pneumonia corresponding to most of the cell of the map and normal occupy the cells bottom left. It is concluded that: (a) the features extracted are descriptive enough to differentiate the two classes, even when SOM is an unsupervised method; (b) pneumonias have more different BMU, evidencing variability between the data corresponding to this class; (c) normal cases are similar in the characteristic space found by the network, i.e., they are grouped in a small part of the map in close cells. On the right of Fig. 4, it is represented the U-matrix. For the problem addressed here, it is possible to visualize the separation between normal cases and pneumonias (red area). In the region of pneumonias there are cells with very similar cases (blue and light blue zones), while other pneumonias are different (top left and bottom right). For the problem #2, bacterial pneumonia versus viral pneumonia, the best results were obtained with data-augmentation with accuracy = 0.767 ± 0.019. Representations of the SOM trained from features extracted of the best CNN are presented in Fig. 5. In this case, it is possible to make an analysis like the previous one. Considering the separation of the labels (map of labels in the left of Fig. 5), bacterial pneumonia cases correspond to most of the cells in the map and viral pneumonias occupy the lower left. It is concluded that: (a) the features are descriptive enough to differentiate the two classes of pneumonia; (b) bacterial pneumonias cover more cells of the map, evidencing variability between the data corresponding to this class; (c) cases

Analysis and Interpretation of Deep Convolutional Features Using …

225

Fig. 4 Analysis of the SOM trained with features extracted from the best CNN for problem #1, normal versus pneumonia (NOR = normal; NEU = pneumonia)

of viral pneumonia also show diversity. In the U-matrix (graph to the right of Fig. 5), the separation between the cases considered is not clearly visualized. Although it is proven that there are similar viral pneumonias, they are not so (light blue colors). In addition, according to the diversity of colors in the cells corresponding to bacterial pneumonia (representing distances between prototype vectors), it is verified the expressed in the point (b) in this paragraph. Following with the examples, Figs. 6 and 7 shows results for other SOM than the optimal ones respectively for the problem #1 and the problem #2. Considering the analysis given in this section and the criteria defined in Stage D of the method, it is possible to observe the next aspects of these SOM: • Problem #1 (Fig. 6): While it is observed groups of cells with the same class, there is not a clear separation between class in the U-matrix.

Fig. 5 Analysis of the SOM trained with the features extracted from the best CNN for problem #2, bacterial pneumonia versus viral pneumonia (bac = bacterial pneumonia; vir = viral pneumonia)

226

D. S. Comas et al.

Fig. 6 Analysis of a not optimal SOM trained with features extracted from the best CNN for problem #1, normal versus pneumonia (NOR = normal; NEU = pneumonia)

Fig. 7 Analysis of a not optimal SOM trained with the features extracted from the best CNN for problem #2, bacterial pneumonia versus viral pneumonia (bac = bacterial pneumonia; vir = viral pneumonia)

• Problem #2 (Fig. 7): In the map of labels (left of the figure) there are some unlabeled cells and cells mixed between classes. From U-matrix is not possible to identify any separation between classes.

5 Conclusions In this chapter, an approach for interpreting features extracted from CNN is proposed. It allows us to analyze features generated from CNN and to select between CNN based on what is observed in the SOM. Given a set of images and their labels for classification, the method begins extracting features from CNN trained with these to obtain a labeled dataset that

Analysis and Interpretation of Deep Convolutional Features Using …

227

represents the images and their classes. These CNN could be trained from scratch (modifying the weight of all the layers), or transfer learning may be considered (modifying only the last layers, the classification stage, or fine-tuning some more layers). After the feature extraction, the dataset obtained is used to train a SOM that is optimized by searching the best hyperparameters using an approach previously proposed that is based on the analysis of quality measures. The trained SOM is visualized by its U-matrix representation, the map of labels, and the map of hits. Considering the regions of the U-matrix and the clustering of the labels on the maps of labels, the goodness of the features for the classification can be assessed. By observing the map of hits, the variability of the classes can be determined. If several CNN are considered, the one which gives the best separation of the classes would be the best one to be used for the image classification, which constitutes one of the main contributions of this chapter. Considering the results obtained from the application example, it was possible to identify good features from CNN from scratch and to give some interpretation from them both in the classification of normal versus pneumonia, and in viral pneumonia versus bacterial pneumonia. As immediate future work, it is planned to continue working on new interpretation methods for deep features obtained by CNN, considering the visualization of the outputs obtained in intermediate layers. Acknowledgements Diego S. Comas acknowledges support from Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) from Argentina. The authors acknowledge support from NVIDIA Corporation, who donated the GPU Titan V used in this work.

References 1. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015). https://doi.org/ 10.1038/nature14539 2. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. Presented at the (2014).https://doi.org/10.1007/978-3-319-10590-1_53 3. Montavon, G., Samek, W., Müller, K.R.: Methods for interpreting and understanding deep neural networks. Digit. Signal Process. 73, 1–15 (2018). https://doi.org/10.1016/j.dsp.2017. 10.011 4. Hussain, M., Eakins, J.P.: Component-based visual clustering using the self-organizing map. Neural Netw. 20, 260–273 (2007). https://doi.org/10.1016/J.NEUNET.2006.10.004 5. Ta¸sdemir, K., Merényi, E.: Exploiting data topology in visualization and clustering of selforganizing maps. IEEE Trans. Neural Networks 20, 549–562 (2009). https://doi.org/10.1109/ TNN.2008.2005409 6. Meschino, G.J., Passoni, L.I., Scandurra, A.G., Ballarin, V.L.: Representación automática pseudo color de imágenes médicas mediante Mapas Autoorganizados. In: Simposio Argentino de Informática y Salud - SIS 2006, pp. 105–115. Ciudad de Mendoza, Argentina (2006)

228

D. S. Comas et al.

7. Meschino, G.J., Comas, D.S., Ballarin, V.L., Scandurra, A.G., Passoni, L.I.: Automatic design of interpretable fuzzy predicate systems for clustering using self-organizing maps. Neurocomputing 147, (2015). https://doi.org/10.1016/j.neucom.2014.02.059 8. Comas, D.S., Pastore, J.I., Bouchet, A., Ballarin, V.L., Meschino, G.J.: Interpretable interval type-2 fuzzy predicates for data clustering: a new automatic generation method based on selforganizing maps. Knowl.-Based Syst. 133, 234–254 (2017). https://doi.org/10.1016/j.knosys. 2017.07.012 9. Lakshminarayanan, S.: Application of self-organizing maps on time series data for identifying interpretable driving manoeuvres. Eur. Transp. Res. Rev. 12, 1–11 (2020). https://doi.org/10. 1186/s12544-020-00421-x 10. Yuan, E., Matusiak, M., Sirinukunwattana, K., Varma, S., Kidzi´nski, Ł, West, R.: Selforganizing maps for cellular in silico staining and cell substate classification. Front. Immunol. 12, 4437 (2021). https://doi.org/10.3389/fimmu.2021.765923 11. Meyer, S.G., Reading, A.M., Bassom, A.P.: The use of weighted self-organizing maps to interrogate large seismic data sets. Geophys. J. Int. 231, 2156–2172 (2022). https://doi.org/10. 1093/GJI/GGAC322 12. Nilashi, M., Asadi, S., Abumalloh, R.A., Samad, S., Ghabban, F., Supriyanto, E., Osman, R.: Sustainability performance assessment using self-organizing maps (SOM) and classification and ensembles of regression trees (CART). Sustainability 13, 3870 (2021). https://doi.org/10. 3390/SU13073870 13. Fukushima, K.: Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36, 193–202 (1980). https:// doi.org/10.1007/BF00344251 14. Bishop, C.: Neural Networks for Pattern Recognition. Oxford Press, Oxford (2005) 15. Mejía, J., Ochoa-Zezzatti, A., Contreras-Masse, R., Rivera, G.: Intelligent system for the visual support of caloric intake of food in inhabitants of a smart city using a deep learning model. In: Applications of Hybrid Metaheuristic Algorithms for Image Processing, pp. 441–455 (2020). https://doi.org/10.1007/978-3-030-40977-7_19 16. Akcay, S., Kundegorski, M.E., Devereux, M., Breckon, T.P.: Transfer learning using convolutional neural networks for object classification within X-ray baggage security imagery. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 1057–1061. IEEE (2016). https://doi.org/10.1109/ICIP.2016.7532519 17. Lucena, O., Junior, A., Moia, V., Souza, R., Valle, E., Lotufo, R.: Transfer learning using convolutional neural networks for face anti-spoofing. Presented at the (2017).https://doi.org/ 10.1007/978-3-319-59876-5_4 18. Mar-Cupido, R., García, V., Rivera, G., Sánchez, J.S.: Deep transfer learning for the recognition of types of face masks as a core measure to prevent the transmission of COVID-19. Appl. Soft Comput. 125, 109207 (2022). https://doi.org/10.1016/j.asoc.2022.109207 19. Kohonen, T.: Self organized formation of topological correct feature maps. Biol. Cybern. 43, 59–96 (1982) 20. Attik, M., Bougrain, L., Alexandre, F.: Self-organizing map initialization. Presented at the (2005).https://doi.org/10.1007/11550822_56 21. Mammadli, R., Wolf, F., Jannesari, A.: The art of getting deep neural networks in shape. ACM Trans. Arch. Code Optim. 15, (2019). https://doi.org/10.1145/3291053 22. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., van der Laak, J.A.W.M., van Ginneken, B., Sánchez, C.I.: A survey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88 (2017). https://doi.org/10.1016/J.MEDIA.2017.07.005 23. Cimpoi, M., Maji, S., Vedaldi, A.: Deep filter banks for texture recognition and segmentation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 7–12 June 2015, pp. 3828–3836 (2015). https://doi.org/10.1109/CVPR.2015.729 9007 24. Vesanto, J.: Data Exploration Process Based on the Self-Organizing Map (2002) 25. Kohonen, T.: Self-Organizing Maps. Springer (1997)

Analysis and Interpretation of Deep Convolutional Features Using …

229

26. Kohonen, T.: MATLAB Implementations and Applications of the Self-Organizing Map. Unigrafia Oy, Helsinki, Finland (2014) 27. Kermany, D.S., Goldbaum, M., Cai, W., Valentim, C.C.S., Liang, H., Baxter, S.L., McKeown, A., Yang, G., Wu, X., Yan, F., Dong, J., Prasadha, M.K., Pei, J., Ting, M., Zhu, J., Li, C., Hewett, S., Dong, J., Ziyar, I., Shi, A., Zhang, R., Zheng, L., Hou, R., Shi, W., Fu, X., Duan, Y., Huu, V.A.N., Wen, C., Zhang, E.D., Zhang, C.L., Li, O., Wang, X., Singer, M.A., Sun, X., Xu, J., Tafreshi, A., Lewis, M.A., Xia, H., Zhang, K.: Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172, 1122-1131.e9 (2018). https://doi. org/10.1016/j.cell.2018.02.010

A Hybrid Deep Learning-Based Approach for Human Activity Recognition Using Wearable Sensors Deepak Sharma , Arup Roy , Sankar Prasad Bag , Pawan Kumar Singh , and Youakim Badr

Abstract Human Activity Recognition (HAR) is a branch of computer science that uses raw time-series data information from embedded smartphone sensors and wearable devices to infer human actions. It has aroused considerable interest in various smart home contexts, particularly for constantly monitoring human behavior in an ecologically friendly atmosphere for elderly people and rehabilitation. Data collection, feature extraction from noise and distortion, feature selection, and preprocessing and categorization are among the operating components of a typical HAR system. Extraction of feature and selection strategies have recently been developed using cutting-edge approaches and traditional machine learning classifiers. The majority of the solutions, on the other hand, rely on simple feature extraction algorithms that are unable to detect complex behaviors. Deep learning techniques are often utilized in different HAR approaches to recover features and classification swiftly because of the introduction and development of vast computing resources. The vast majority of solutions, on the other hand, depend on simplistic feature extraction algorithms incapable of recognizing complicated behaviors. Due to advancements in high computational capabilities, deep learning algorithms are now often utilized in HAR methods to efficiently extract meaningful features which can successfully categorize sensor data. In this chapter, we present a hybrid deep learning-based classification model comprising of Convolutional Neural Network (CNN) and Long-Short Term D. Sharma · P. K. Singh (B) Department of Information Technology, Jadavpur University, Jadavpur University Second Campus, Plot No. 8, Salt Lake Bypass, LB Block, Sector III, Salt Lake City, Kolkata 700106, West Bengal, India e-mail: [email protected] A. Roy School of Computing and Information Technology, Reva University, Bengaluru 560064, Karnataka, India S. P. Bag Department of Medical Biotechnology, College of Life Science and Biotechnology, Dongguk University, Seoul, Republic of Korea Y. Badr Pennsylvania State University, Great Valley, 30 East Swedesford Road, Malvern, PA 19355, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Innovations in Machine and Deep Learning, Studies in Big Data 134, https://doi.org/10.1007/978-3-031-40688-1_11

231

232

D. Sharma et al.

Memory (LSTM), which is named CNN-LSTM. The proposed hybrid deep learning model has been tested over three benchmark HAR datasets: MHEALTH, OPPORTUNITY, and HARTH. On the aforementioned datasets, the proposed hybrid model obtained 99.07%, 95.2%, and 94.68% classification accuracies, respectively, which is quite impressive. The source code of the proposed work can be accessed by using the following link: https://github.com/DSharma05/Human-Activity-Recognition-usinghybrid-Deep-learning-approach. Keywords Human activity recognition · Hybrid deep learning model · Wearable sensors · MHEALTH · OPPORTUNITY and HARTH

1 Introduction Human activity recognition (HAR) is among the most engaged and intriguing research domains in computer vision and human–computer interaction. In the fields of ubiquitous computing, human–computer interaction, and human behavior analysis, automatically identifying an individual’s physical activities has emerged as a major issue. As a consequence of tremendous and gigantic developments in micro-electronics during the preceding decade, several intricate and high computational power gadgets capable of accomplishing more difficult functions than ever before have been built. These devices are becoming a part of everyone’s daily lives due to their small size, low cost, massive processing capabilities, and low energy usage. A significant area of research has focused on understanding human behavior, particularly as it relates to applications in medicine, the armed services, and security. Recognizing human activities is accomplished by analyzing signals obtained in real-time mode from a number of body-worn motion sensors. Sensor nodes which include accelerometer sensors, gyroscope sensors, and many more, are utilized in smart devices. Most smartphones now include accelerometer sensors, magnetometer sensors, and gyroscope sensors as standard features. After data from multiple wearable sensing devices have been analyzed and, utilizing a classification method, evaluated physical human activity can be simply identified. Predicting activity is useful since it enables individuals to keep a record of their regular schedule. Our cell phones, which have become an integral part of our lives, feature sensors that can detect human movement. We can use these sensor data to recognize various movements like walking, lying, running, sitting, walking downstairs, upstairs, and so on. Depending on the actions we wish to forecast in accordance with our needs, different sensing devices are employed. Additionally, wearable-sensor device data, pictures, or videos are employed to recognize human behaviors [1]. HAR can be used in a variety of research areas, including the ones listed below: In order to avoid crimes and fatal terrorist activities in public places, monitoring is used. A platform that delivers a complete, deployable HAR system that is based on real-time traffic input from security cameras in public locations, with real-time on-demand activity identification.

A Hybrid Deep Learning-Based Approach for Human Activity …

233

HAR is largely used in a variety of industries, including residential, hospital, and restructure. The smart gadget has been combined with HAR to monitor the everyday action of senior citizens who live at residence or in a rehabilitation center. This not only keeps elderly people healthy, but it also assists them in avoiding ailments by monitoring their pulse rate, blood oxygen levels, calorie consumption, calories burnt, and many other things. By monitoring and tracking all these devices and their activities and helping to maintain patient’s health accordingly, HAR has become one of the most effective ways to control and regulate the patient’s physical daily activities, minimizing the chances of various life-threatening diseases like diabetes, constant exposure, and cardio activity [2]. Despite the fact that HAR is a classic Pattern Recognition (PR) issue, constructing a suitable and extremely accurate model has proven to be challenging. Input data for the HAR model can be obtained via sensors like an accelerometer, gyroscope, and others that are connected to smart gadgets like smartphones, etc. These sensors can measure the acceleration of the entire body, specific body parts, or both with regard to a number of linear or rotational axes. Predicting human behavior, on the other hand, is a tough undertaking based on these few known features. Numerous other features are required to deal with the human body’s complicated system of motion when performing a particular activity. It’s challenging to extract relevant hand-crafted traits with human understanding. Research scholars are utilizing a deep learning-based method as an alternative, which might be a feasible solution for this sort of difficult PR challenge. Various issues are driving the development of new technology in order to increase precision in more realistic settings. Some of these difficulties include: (1) selecting the attributes to be measured, (2) constructing a data acquisition system that is portable, inconspicuous, and low-cost, (3) using a method of function extraction and inference design, (4) obtaining accurate data, (5) flexibility to accommodate new users without requiring retraining, and (6) putting it into practice in a real-world setting. After reading a number of contemporary articles and classic works on HAR as well as analyzing the outcomes of various proposed models, it has been revealed that the previous HAR models have different levels of recognition accuracies on different datasets. As a result, we have worked with the hybrid deep learning model on three different types of HAR datasets such as MHEALTH [3], OPPORTUNITY [4] and HARTH [5]. The following are the contributions of this chapter: 1. We have implemented a CNN-LSTM-based hybrid deep learning model for the HAR problem using wearable sensors. 2. The proposed model has been applied to three publicly available benchmark datasets, which are MHEALTH [3], OPPORTUNITY [4] and HARTH [5]. 3. The above-mentioned datasets have been trained and tested using this hybrid model, and the performance results are evaluated accordingly. 4. The proposed model has been found to outperform all the state-of-the-art HAR models. Figure 1 shows the general representation of HAR using raw sensor data.

234

D. Sharma et al.

Fig. 1 Schematic diagram of HAR using raw sensor input data

2 Literature Analysis In the area of computer vision, HAR is a difficult research problem. Academics from all around the world have been attempting to create a nearly flawless recognition system for quite some time. HAR has already been the focus of in-depth research. Singh et al. [6] have performed a comprehensive survey on this field. Keeping up with the most recent procedures and the results they produce is necessary because this sector is expanding quickly. This section is largely concerned with summarizing previous activities performed in general and with respect to the datasets we’ve chosen. The benefits of machine learning techniques piqued the research scholar’s curiosity in using deep learning models to recognize HAR. Certain techniques categorized the proposed models based on the raw sensory data acquired (time-series signals), while the others translated the signals to visual information such as spectrum or digital pictures for action identification, as seen in Fig. 2.

Fig. 2 Taxonomy of suggested deep learning-based models applied for solving HAR problem

A Hybrid Deep Learning-Based Approach for Human Activity …

235

Recently, Convolution Neural Network (CNN) has been a useful technique for extracting and categorizing large-scale images [7]. Furthermore, scholars have been fascinated by the visual representation of time-series data signals, and it has been used to categorize time-series data signals employing CNN. It has a lot of power. As a result, several HAR systems turn raw time-series information into visual cues like digital pictures and spectrum images, which may subsequently be classified using CNN. Lawal and Banu [8] addressed the multi-modal sensor activity detection system by fusing numerous time sequence (motion) data supplied by worn devices into a specific frequency picture for categorization. Qin et al. [9] converted the time domain signal into a 2-channel picture combined with the remaining fusion net to cope with diverse sensor input from numerous sensors. The layers proposed by Lawal and Banu [8] in the year 2020 are changed using remaining fusion nets rather than just a Convolution block 3 in this suggested technique. For the evaluation process, a significant experiment was carried out utilizing the different datasets. Deep learning methods are extremely capable of managing time-series data signals for the extraction of features and classification due to the benefits of local dependence and scaling invariance. More researchers have lately been lured to the employment of remarkable deep learning approaches such as CNN, Long-Short Term Memory (LSTM), and hybrid models for improved recognition of HAR applications [10]. Because of the limitation, researchers started to use Graph Neural Networks (GNN) and other ensemble models in this field of study for better performance and higher accuracy. Since they need a lot of information for training and are timeconsuming and resource-inefficient, CNN-based models have a number of drawbacks. To overcome the aforementioned problems, Mondal et al. [11] built a GNN model which used a structural graph to depict time series data. Bhattacharya et al. [12] proposed ensemble stacking prediction using base classifier models: CNN, LSTM, and hybrid models. Das et al. [13] proposed an MMHAREnsemNet, and the analysis of the ensemble revealed a considerable improvement in their model’s efficiency by combining the results of the models they trained using sensor data and RGB photos. By combining existing deep learning classifiers in several ways, such as “majority voting,” “sum rule,” and “score fusion,” Mukherjee et al. [14] suggested an ensemble of the three basic classification methods: CNNNet, Encoded-Net, and CNN-LSTM. This improved the overall model performance. For the identification of 3D skeletal actions, Banerjee et al. [15] suggested a Fuzzy integral fusion built on CNN based model in order to recognize the daily actions of humans using mobile sensors to monitor health [16]. CNNs can categorize images in addition to raw time-series data signals. Several researchers have capitalized on this by employing CNN models to identify raw multisensor information and identify human movements. CNN feature extraction provides several benefits over traditional HAR shallow and superficial learning approaches, including local dependence and scale invariance, as well as the ability to capture all imaginable complicated non-linear relationships between several features [17]. A CNN model for identifying simple activities from smartphone tri-axial accelerometer information was proposed by Chen and Xue [18]. To handle tri-axial

236

D. Sharma et al.

accelerometer data, the CNN model was constructed, and the convolutional kernels were upgraded. Ronao and Cho [19] used a previously studied convolution operation layer with a modest pooling size to construct a unique CNN architecture for extracting complicated information. The suggested technique was evaluated by utilizing raw information and the temporal features of Fast-Fourier transformed signals generated by the CNN. In the next subsections, we will discuss the previous works related to the HAR datasets that we have chosen for our study.

3 OPPORTUNITY Dataset CNN models often need a considerable number of computational resources for extracting and classifying the features since they employ a significant number of filters and mappings. A CNN model for the localized extraction of features from a three-axis smartphone signal [20] was created to lower the cost of power consumption and equipment facilities. Lego filters were used rather than convolutional filters to minimize the memory and processing costs of CNN [21]. The proposed minimalist model needs no specific network architecture or computational resources and thus enhances experiment performance and scalability. Similar to [21], Cheng et al. [22] employed conditionally parameterized HAR convolution to construct a novel computer-efficient HAR on smartphones and wearable sensor devices. To demonstrate the efficiency of the larger network of baseline-level models, experiments were carried out. The proposed Smartphone human activity detection systems generally depend on the position of the mobile phones during experiments, where they are fixed in relation to the human participant. Smartphones, for example, are fastened vertically on a human subject’s belt, etc. Rather than categorizing images like the CNN model, LSTM algorithms excel at forecasting raw time-series data sequences. CNN algorithms classify pictures using “spatial correlations,” whereas the LSTM system classifies time-series sequence data by evaluating via feedback connections. Regarding the advantages of the LSTM models over the CNN models, the authors offered alternative methodologies for LSTM-based HAR models. Rashid et al. [23] extended the CNN technique by developing a low-energy consumption adaptive CNN, which is both energy-efficient and memorable. With the potential and dataset of the suggested system, complex actions were tested to demonstrate memory significance and energy efficiency. Zhao et al. [24] built a bidirectional residual LSTM system with forward- and backward-looking states, as well as good and poor temporal directions. Gradient disappearance is prevented by using a residual connection between neatly stacked cells. Local characteristics acquired utilizing heuristic procedures are the most popular solution for the HAR system. Traditional learning machine methods are easily possible other than a world-class answer for the local minimum. A deep hybrid

A Hybrid Deep Learning-Based Approach for Human Activity …

237

approach, built on the convolution of CNN in LSTM recurring unit integrated and ELM-classified operations, has been proposed [25] in the study project. On the basis of opportunity and outperforming non-recurring neural network models, the suggested framework was tested. The CNN-LSTM method reduces the complexity of the model and does not need costly feature engineering while simultaneously improving the accuracy of predicting people’s activity from raw sensor data. The CNN-LSTM network spans space and time. A CNN-LSTM method of human action recognition combines the reliability of a CNN network’s extracting the features with the time series analysis and categorization work of an LSTM model to enhance efficiency in activity classification. According to a variety of situations, including adequate training, perplexing behaviors, and sensor location, hybrid systems are expected to be more successful than typical deep learning techniques. A deep hybrid model combining LSTM models coupled with CNN and global pools (GAP) was proposed by Xia et al. [26]. Instead of a standard fully linked layer, the GAP is employed. Additionally, batch standardization is employed to quicken the proposed system’s convergence following GAP. The majority of HAR devices are designed for basic physical activity and have difficulties recognizing complex physical behavior. Qi et al. [27] proposed a framework for adaptive recognition for monitoring and detecting dynamic and complex human behavior. In addition to complex action, a deep learning system for the recognition of transitional short and long-term activities has been established [28]. The study constructs a CNN to extract the features as well as an LSTM network to manage inter-dependencies across HAR identifying rate improvement parameters. The wearable sensory model identifies the action and its transitions precisely. The suggested HAR technique, as shown in [27, 28], encountered a number of challenges, primarily in recognizing identical and perplexing behaviors such as walking, upstairs climbing, and so on. To conquer these obstacles, Lv et al. [29] proposed a marginal method for deriving discriminatory classification traits. The proposed margin method altered four neural networks and compared their results to traditional models on several benchmark datasets. They want to demonstrate how they can make multi-modal HAR more understandable. Recently, a novel dual attention system based on the CNN model named DanHAR, which combines channel and temporal attention, was developed by Gao et al. [30]. Ordóñez and Roggen [31] proposed the DeepConvLSTM model, which was based on recurrent neural networks’ capacity to discriminate between multimodal activities and particularly characterizes the temporal dynamics of feature activations. Table 1 describes a summarization of the previous HAR work done on the OPPORTUNITY [4] dataset.

238

D. Sharma et al.

Table 1 Performance evaluation of some of the most recent HAR methods on the OPPORTUNITY dataset [4] Author

Publication year

Classification models

Accuracy (in %)

Zhao et al. [24]

2018

Residual bidir-LSTM

90.5

Sun et al. [25]

2018

CNN + LSTM + ELM

90.6

Tang et al. [21]

2019

CNN + Lego bricks

86.10

Cheng et al. [22]

2020

CNN with batch normalization

81.18

Xia et al. [26]

2020

LSTM + CNN + GAP

92.63

Lv et al. [29]

2020

Hybrid model

92.30

Rashid et al. [23]

2021

Adaptive CNN

91.57

4 MHEALTH Dataset Khatun and Morshed [32] explore the ensemble approach combining tree algorithms with a leave-one-subject-out strategy to discover the optimal way for dealing with null values while recognizing frequent actions of significant interest in a computerized platform. In addition, the authors have focused on using little data and device sensors to enable practical applications. Singh et al. [33] suggest a design for a deep neural network that, in addition to capturing the spatio-temporal characteristics of input from various sensors throughout time, also chooses and learns critical time points via a self-attention mechanism. Khatun et al. [34] developed a hybrid classification model that integrates a CNNLSTM and is motivated by the self-attention technique to enhance the system’s prediction abilities. Gumaei et al. [35] offered an efficient multi-sensors-based architecture for human activity detection based on a hybrid deep learning model that integrates neural network “Simple Recurrent Units (SRUs)” and “Gated Recurrent Units (GRUs).” The authors employ deep SRUs to handle multi-modal input sensor data sequences by using their internal storage states. In order to solve oscillations, consistency concerns, and vanishing gradient issues, they also employed deep GRUs to grasp and remember how much information from the past is conveyed to the desired state in the future. By using both accelerometer and gyroscope inputs, Debache et al. [36] suggested a new machine learning-based approach for categorizing human activities. They used a reduced set of features extracted from the filtered signals and then applied a novel hierarchical classifier system for logistic regression. Additionally, the approach significantly reduces computing costs and does not require feature selection or hyperparameter adjustment. Using tri-axial inertial sensors, it is possible to monitor continuous sequences of physical human motions, Jalal et al. [37] offer a unique HAR system with various integrated characteristics. To determine the ideal wearable sensor data, the suggested HAR system applies a notch filter to 1D signals and looks at the lower/upper cut-off frequencies. Following that, it computes a number of composite features, including

A Hybrid Deep Learning-Based Approach for Human Activity …

239

Table 2 Performance evaluation of some of the most recent HAR methods on the MHEALTH [3] dataset Author

Publication year Classification models

Jordao et al. [39]

2018

Accuracy (in %)

CNN

83

Debache et al. [36] 2020

Logistic regression

98.2

Jalal et al. [37]

2020

Decision tree classifier + binary grey 93.95 wolf optimization algorithm

Tahir et al. [38]

2020

Hybrid model

90.91

Singh et al. [33]

2020

CNN + LSTM + SAL + SL

94.86

Khatun et al. [34]

2022

Deep CNN + LSTM + self attention 98.76

statistical characteristics, Mel frequency cepstral coefficients, and Gaussian Mixture Model features. The classification and identification engine is given using the Binary Grey Wolf Optimization algorithm as the classifier. To improve the values of the ideal characteristics, Tahir et al. [38] presented a multi-fused model. Adaptive moment estimations and a Markov model with the highest entropy are then used to optimize and categorize the given feature values. Over the MHEALTH dataset [3], this approach had an accuracy of 90.91%. Convolution neural networks were constructed by Jordao et al. [39] for wearable sensor data that recognized human movement. The accuracy rate was 83% when the authors tested the approach on MHEALTH [3] dataset utilizing the “leave one participant out” validation protocol. Table 2 summarizes some of the previous HAR works on the MHEALTH dataset.

5 HARTH Dataset Recently, Logacjov et al. [5] developed Human Activity Recognition Trondheim (HARTH) dataset in the year 2022 and made it freely accessible to the research community. We have also used this dataset in the current study. As the second contribution of the research, the dataset was utilized for training seven different baseline machine-learning models for HAR. In addition to k-nearest neighbors, Random forests, Extreme gradient boosts, CNN, bi-directional LSTM, and CNN with multi-resolution blocks, the author also employed support vector machines. With an F1-score of 0.81 (standard deviation: 0.18), recall of 0.85 ± 0.13, and accuracy of 0.79 ± 0.22, the support vector machine produced the best results in cross-validation that left one subject out.

240

D. Sharma et al.

6 Materials and Methods 6.1 Some Preliminaries CNNs, a subset of deep neural networks, are often employed in a variety of image analysis and recognition applications [40]. Convolution is described theoretically as a mathematical operation done on two functions that yields a third function that illustrates how the shape of one function is influenced or transformed by the second function [10].

6.2 Basic Architecture of CNN CNN architecture is composed of two key components: • Extraction of features is a convolution method that extracts and detects a picture’s distinctive characteristics for analysis. • A fully connected layer that happens to take the convolution process output and forecasts the picture’s class based on the characteristics gathered before. CNN layers are classified into three types: convolutional, pooling, and fullyconnected (FC) [15]. Figure 3 shows the schematic diagram of CNN architecture.

Fig. 3 General overview of the CNN representation

A Hybrid Deep Learning-Based Approach for Human Activity …

241

7 Long-Short Term Memory (LSTM) LSTM networks were created primarily to address the issue of long-term reliance that recurrent neural networks RNNs have (because of the vanishing gradient problem). The addition of feedback connections distinguishes LSTMs from standard feedforward neural networks [26]. This characteristic allows LSTMs to analyze full data sequences (such as time-series data sequences) without needing to examine each point in the series sequence independently by maintaining crucial details about past data points in the sequences to help in the processing of incoming data units. The processing of data sequences, including text, general time series, and audio, is therefore particularly adept at LSTMs. Figure 4 shows the LSTM representation.

7.1 Working Principle of LSTM To start, there are three variables that affect how an LSTM performs at any given time: • The cell state is the network’s current long-term memory.

Fig. 4 Architectural overview of a generalized LSTM model

242

D. Sharma et al.

The previous hidden state—the result from the preceding point in time • Input data for the current time step To regulate how data in a data sequence enters remains in, and leaves the network, LSTMs employ a set of “gates.” An LSTM is basically made up of 3 gates: a forget gate, an input gate, and an output gate. These gates function like filters, but they each have their own neural network [31].

8 Proposed Model Architecture The suggested CNN-LSTM-based deep learning model’s architecture is given in Fig. 5. In this work, a combination approach for automatically predicting human activity was built using different datasets. The CNN network was utilized to extract complex features from sensor data, while the LSTM network served as a classifier in the architecture created by integrating the two networks. A total of 11 layers make up the network: 2 convolution layers, 2 batch normalization layers, 2 ReLU layers, 1 LSTM layer, 1 MaxPooling layer, and one output layer with Softmax function. In order to extract features, the ReLU function activates the convolutional layer, which has a 3 × 3 kernel size. One method for reducing input dimensions is to utilize the max-pooling layer with a pool size of 2. The function map is provided to the LSTM layer in the last section of the architecture to extract time data, which is then delivered to the dense layer and activation function to obtain the output. Two 1D-convolutional layers make up the CNN extraction of the features block (see Fig. 5a for more details). Between the two successive convolution layers, we added the Batch Normalisation and Rectified Linear Unit (ReLU) layer. Numerous convolutional layers in a deep learning model enable the initial layers to learn low-level properties in the applied input since convolutional operations are incredibly efficient. The outcome of the convolutional layers, the features map, has the limitation of keeping a record of the exact position of the input features. This indicates that even small changes inside the input’s feature positions will produce a different feature map. Following the convolutional layer, a pooling layer is frequently added to minimize the constraint of the feature map’s invariance while enhancing the model’s ability to learn complex structures. We incorporated a MaxPooling layer in our model, which is a down-sampling method that reduces the spatial dimension of the feature maps by a factor of 2, lessening the overall computing cost. Researchers typically utilize the ReLU activation function to make the net more trainable since it is resistant to gradient vanishing. When creating a CNN model, it is usual practice to use a coarse-to-fine strategy. Because there are so many trainable parameters, this structure brings more computational effort. We employed one LSTM layer with 64 neurons in the sequence learning block. At last, we have used the dense layer of 128 units. Figure 5b shows the

A Hybrid Deep Learning-Based Approach for Human Activity …

(a)

243

(b)

(c) Fig. 5 Design of the suggested a CNN model, b LSTM layer, and c deep learning framework based on CNN-LSTM for solving the HAR problem

244 Table 3 Hyperparameters of the proposed CNN-LSTM model

D. Sharma et al.

Parameters

Parameter value

Optimizer

Adam

Learning rate

0.01

Batch size

32

Loss function

Sparse categorical cross entropy

Epoch (MHEALTH)

30

Epoch (OPPORTUNITY)

100

Epoch (HARTH)

20

proposed LSTM layer. Figure 5c shows the pictorial representation of our proposed hybrid model. Table 3 shows the hyperparameters of our proposed hybrid model.

9 Dataset Description 9.1 MHEALTH Dataset In this work, the UCI Machine Learning Repository’s MHEALTH [3] dataset has been employed. Data were obtained from 10 people with various profiles while undertaking 12 physical activities, namely, Standing still, Sitting and relaxing, Lying down, Walking, Climbing stairs, Waist bends forward, Frontal elevation of arms, Knees bending (crouching), Cycling, Jogging, Running and, Jump front and back. The data was collected using wearable device sensors (accelerometer, gyroscope, and magnetometer) (see Table 4 for more details). The sensors were affixed to the subject’s chest, left ankle, and right wrist with elastic bands. Out of ten subjects, data from the first eight subjects are considered as the training set, while the remaining two are kept for the testing part. Both test and train data were segmented by fixedsized sliding windows, each consisting of 100 time stamps. The data was collected at a sample rate of 50 Hz. A video recording device was used to record and categorize each session. The data was obtained in an out-of-lab setting with no restrictions on the method.

9.2 OPPORTUNITY Dataset The OPPORTUNITY dataset [4], which contains 17 complex gestures and types of movement, was acquired in a sensor-rich environment. It includes recordings of four persons performing morning tasks in everyday life scenarios. Sensors of various modalities have been implanted in surroundings, objects, and the human body. In terms of sensor configuration, the OPPORTUNITY [4] challenge rules

A Hybrid Deep Learning-Based Approach for Human Activity … Table 4 Data device location and source

Sensors/device

Positions

Orientation

Accelerometer

Chest

XYZ

Left-ankle

XYZ

Gyroscope Magnetometer

Table 5 Activities of OPPORTUNITY dataset

245

Right-lower-arm

XYZ

Left-ankle

XYZ

Right-lower-arm

XYZ

Left-ankle

XYZ

Right-lower-arm

XYZ

Open Door 1

Open Drawer 1

Open Door 2

Close Drawer 1

Close Door 1

Open Drawer 2

Close Door 2

Close Drawer 2

Open Fridge

Open Drawer 3

Close Fridge

Close Drawer 3

Clean Table

Open Dishwasher

Toogle Switch

Close Dishwasher

Drink from Cup

were followed. The body’s sensors were the only ones we considered, including 12 Bluetooth 3-axis acceleration sensors, 2 InertiaCube3 sensors on the feet, and five inertial measurement units on the sports jacket. During the recording, each participant went through five activities of daily living (ADL) sessions and one drill session. Because each sensor axis is handled as a separate channel, the input space has 113 channels. The sampling frequency of these sensors, in particular, is 30 Hz. In this article, our focus is solely on identifying irregular movements. As a result, this is a segmentation and classification task with 17 activity classes. Table 5 summarises the motions contained in this dataset, and the letters in parenthesis imply gesture symbols. The data are segmented using the sliding-window approach, and the whole dataset is segmented into fixed-sized windows consisting of 32 data points, each with 50% overlap. The train set makes up 70% of the segmented dataset, whereas the test set makes up 30%.

9.3 HARTH Dataset This dataset consists of acceleration data from 22 subjects that wore two three-axial Axivity AX3 accelerometers on the lower and thigh back. Using the visual stream from the camera, experts independently assessed the data and had a good inter-rater

246

D. Sharma et al.

agreement. A set of 12 various tasks, including walking, running, shuffling, stairs (up) and stairs (down), standing, sitting, and lying down, as well as cycling (stand), cycling (sit), cycling (sit, inactive) and cycling (stand, inactive), are assigned to each participant.

10 Experimental Results On a system with an Intel® CoreTM i3-3120M CPU @ 2.50 GHz 4 CPU, 12 Gigs of RAM, and an NVD7/Intel® HD Graphics 4000 (IVB GT2) GPU, we trained and tested the model. The machine runs a 64-bit Fedora 32 operating system. We utilized Python version 3.9, TensorFlow version 2.7.0, Keras, and Scikit learn modules to create the suggested model.

10.1 Evaluation Metrics Used The data is split into two groups for the assessment technique: training and testing. The model is then adjusted to fit the training set of data. The test set is used to make the prediction. We used the performance metrics “Precision,” “Recall,” “F1Score,” and “Accuracy” to analyze our HAR model. Beginning with accuracy, this is calculated by dividing the number of instances that are properly categorized by the total sample count. Accuracy =

Tp + Tn × 100% Tp + Tn + Fp + Fn

(1)

The number of correctly classified records inside the positive class is known as Tp (True Positive), whereas the number of correctly classified entries inside the negative class is known as Tn (True Negative). The number of records that are mistakenly classified as positive or negative is indicated by the terms Fp (False Positive) and Fn (False Negative). The ratio of correctly identified positive samples to predicted positive samples and samples that are actually positive is what is meant by the term “Precision.” Precision =

Tp Tp + Fp

(2)

The recall is calculated as the ratio of positive samples to all positive samples that are correctly classified as positive. Recall gauges how effectively the model can distinguish between positive samples.

A Hybrid Deep Learning-Based Approach for Human Activity …

Recall =

247

Tp Tp + Fn

(3)

The F1-measure (F1 score) is another essential statistic since it combines accuracy and recall into a single metric. As a result, the F1 score provides a more realistic representation of model performance. The F1-score is regarded as the best choice in circumstances of unbalanced classes since classes are given significance based on their sample proportion. The F1 score is written as: F1 − score =

2 ∗ Precision ∗ Recall Precision + Recall

(4)

10.2 Results Analysis on MHEALTH Dataset Pre-processed samples from the MHEALTH [3] dataset are divided into the training (70%) and testing (30%) datasets for training and testing on the aforementioned model, respectively. A total of 1,215,745 records of all the activities are considered for the experiment. A set consisting of 851,000 records are used for training, whereas the remaining 364,700 records are used for testing purposes. Table 6 shows the precision, recall, F1-score, and accuracy of each activity class of the MHEALTH dataset. Table 6 Performance measured with respect to precision, recall, F1-score, and accuracy for each activity class on the MHEALTH dataset Activity

Precision

Recall

F1-score

Accuracy

Nothing

1

1

1

1

Standing still

1

0.992

0.996

0.992

Sitting and relaxing

0.976

1

0.988

1

Lying down

0.992

0.992

0.992

0.992

Walking

1

0.992

0.996

0.992

Climbing stairs

0.943

0.976

0.959

0.976

Waist bends forward

1

0.990

0.995

0.991

Frontal elevation of arms

1

1

1

1

Knees bending (crouching)

0.982

0.957

0.969

0.957

Cycling

1

0.992

0.996

0.992

Jogging

0.978

1

0.989

1

Running

1

0.981

0.990

0.981

Jump front and back

1

1

1

1

Average

0.990

0.989

0.989

0.990

248

D. Sharma et al.

Figure 6 shows the confusion matrix that is generated after the trained proposed model was evaluated using test data. Our proposed model’s classification accuracy is more than or equal to 99% for all 12 activity types, and the total accuracy is found to be 99.07%, according to the confusion matrix that is produced by the proposed hybrid deep learning model. According to the confusion matrix (shown in Fig. 6), all activities are completely categorized (accuracy is nearly 100%) by our suggested model except ‘Climbing stairs,’ ‘Waist bend forward,’ and ‘Jump front and back’ activities. It can be observed that around 5.95% of samples belonging to ‘Climbing stairs’ are being misclassified to ‘knees bending (crouching)’ activities, and around a total of 1.88% of samples belonging to ‘Waist bending forward’ activities are being misclassified into ‘Standing still’ activities (around 0.94%) and ‘walking’ (around 0.94%) activities since they are similar kind of motion activities. Lastly, around 3.70% of samples belonging to ‘Jump front and back’ activities are being misclassified to ‘Running’ activities due to similarity in the motion. Figure 7a, b shows the training versus validation accuracy and training versus validation loss curve on the MHEALTH [3] dataset, respectively, and it is clear from Fig. 7a that as the size of epoch increases, the accuracy is also found to increase.

Fig. 6 Confusion matrix obtained from evaluating our proposed HAR model on the MHEALTH dataset

A Hybrid Deep Learning-Based Approach for Human Activity …

249

Fig. 7 Graph showing the variation of a training with validation accuracy and b training versus validation loss on the MHEALTH dataset

From Fig. 7a, it can be observed that, initially, the accuracy is found to be less, but after a certain number of epochs (approximately 5), the training accuracy almost reached above 95%. As the model training continues, the accuracy is improved continuously, and it becomes stable to approximately around 99% after around 15– 20 epochs. Similarly, for the validation, accuracy reaches approximately around 90% after around 5–8 epochs, and then it decreases and again increases above 95%. As the epochs increase, the validation accuracy increases and becomes stable after a certain number of epochs. From Fig. 7b, it can be observed that, initially, the training loss is around 1, and as the epoch increases, the model trains, and the loss starts to decrease. After around 20 epochs, the loss becomes almost near 0.0. Similarly, the validation loss was above two initially, but as the epoch increases, it is reduced to around 0.5, and after around 5–8 epochs, the validation loss almost becomes close to 0.1. After that, there was a fluctuation in the loss, but finally, it reached approximately 0.1.

10.3 Results Analysis on OPPORTUNITY Dataset Pre-processed samples from the OPPORTUNITY [4] dataset are divided into the training (70%) and testing (30%) datasets for the purposes of training and testing the aforementioned model, respectively. A total of 1,432,340 records of all the activities are considered for the experiment. A set consisting of 1,002,628 records are used for training, whereas the remaining 429,702 records are used the testing purpose. Table 7 shows the Precision, Recall, F1-score, and accuracy of each activity class of the OPPORTUNITY dataset. Figure 8 shows the confusion matrix that is generated after the trained proposed model is evaluated using test data. Our proposed model’s

250

D. Sharma et al.

classification accuracy is found to be more than or equal to 95% for all the activity classes, and the total accuracy is 95.2%, according to the confusion matrix that is produced by the proposed hybrid deep learning model. The top three activities that are classified virtually precisely are “Clean Table,” “Drink from Cup,” and “Toggle Switch,” with 100%, 99.41%, and 97.93% accuracy, respectively, according to the confusion matrix (shown in Fig. 8). The poorly classified activities are “Open Drawer 3”, “Close Drawer 3”, “Open Dishwasher,” and “Close Dishwasher.” It can be observed from the figure around 12.50% of samples belonging to the ‘Open Drawer 3’—activity are being misclassified to the ‘Close Drawer 3’—activity, and around 2.11% of samples belonging to ‘Close Drawer 3’ are wrongly classified to ‘Open Drawer 3’—activities since they are opposing kind of activities. Similarly, 3.80% of samples belonging to the ‘Open Dishwasher’ activity are being misclassified to the ‘Close Dishwasher’ activity, and around 10.48% of samples belonging to the ‘Close Dishwasher’ activity are being misclassified to ‘Open Dishwasher’ activity since they are also opposing kind of activities. Figure 9a, b shows the training versus validation accuracy and training versus validation loss curve on the OPPORTUNITY [4] dataset, respectively. From Fig. 9a, it is observed that, initially, during the starting of the model training, the accuracy was less, but after a certain number of epochs (approximately 10), the training Table 7 Performance measured with respect to precision, recall, F1-score, and accuracy for each activity class of the OPPORTUNITY dataset Activity

Precision

Recall

F1-score

Accuracy

Open Door 1

0.956

0.946

0.951

0.946

Open Door 2

0.942

0.980

0.961

0.980

Close Door 1

0.96

0.955

0.958

0.955

Close Door 2

0.967

0.915

0.941

0.916

Open Fridge

0.894

0.959

0.925

0.959

Close Fridge

0.968

0.906

0.936

0.906

Open Dishwasher

0.887

0.899

0.893

0.899

Close Dishwasher

0.898

0.855

0.876

0.855

Open Drawer 1

0.881

0.927

0.904

0.927

Close Drawer 1

0.891

0.875

0.883

0.875

Open Drawer 2

1

0.901

0.948

0.9012

Close Drawer 2

0.967

0.935

0.951

0.935

Open Drawer 3

0.929

0.875

0.901

0.875

Close Drawer 3

0.818

0.947

0.878

0.947

Clean Table

0.995

1

0.998

1

Drink from Cup

0.996

0.994

0.995

0.994

Toggle Switch

0.953

0.979

0.966

0.979

Average

0.953

0.952

0.952

0.952

A Hybrid Deep Learning-Based Approach for Human Activity …

251

Fig. 8 Confusion matrix obtained after evaluating our proposed hybrid deep learning model on the OPPORTUNITY dataset

accuracy almost reached above 95%. As the model training continues, the accuracy is improved continuously, and it becomes stable to approximately around 99% after around 60 epochs. Similarly, for the validation, accuracy reaches approximately 90% after around ten epochs, and then it decreases and again increases above 90%. As the epochs increase, the validation accuracy increases and becomes stable after a certain number of epochs. From the Fig. 9b, it is observed that initially, the training and validation loss is more, but as the model trains, the loss decreases, and after a certain number of epochs (approximately 20), the loss becomes minimum, and accuracy of the model increases giving us better performance. Initially, the training loss is around 1, and as the epoch increases, the model trains and the loss starts to decrease. After around 20 epochs, the loss becomes almost 0.1. As the epochs increase further after a certain point, the loss value decrease below 0.1 and stabilizes, giving the minimum loss value.

252

D. Sharma et al.

Fig. 9 Graph showing the variation of a training with validation accuracy and b training versus validation loss on the OPPORTUNITY dataset

10.4 Results Analysis on HARTH Dataset Pre-processed samples from the HARTH [5] dataset are divided into the training (70%) and testing (30%) datasets for the purposes of training and testing the aforementioned model, respectively. A total of 6,310,200 records of all the activities are considered for the experiment. A set consisting of 4,417,140 records are used for training, whereas the remaining 1,893,060 records are used the testing purpose. Table 8 shows the Precision, Recall, F1-score, and accuracy of each activity class of the HARTH dataset. Figure 10 shows the confusion matrix that is generated after the trained proposed model is evaluated using test data. Our proposed model’s classification accuracy is more than or equal to 94% for all activities, and the total accuracy is found to be 94.68%, according to the confusion matrix that is produced by the proposed hybrid deep learning model. From the confusion matrix (shown in Fig. 10), it can be seen that the top 3 activities which are classified pretty well are walking, running, and lying, with 98.74%, 99.72%, and 99.72% accuracy, respectively. The poorly classified activities are ‘cycling (stand, inactive),’ ‘shuffling,’ and ‘cycling (sit, inactive).’ It can be observed from Fig. 10 that around 100% of samples belonging to ‘cycling (sit, inactive)’ activities are being misclassified to activities such as ‘walking (0.45%)’, ‘standing (1.79%)’, ‘cycling (sit) (37.67%)’ and ‘cycling (stand, inactive) (2.24%)’, as there is a similarity in motion of these mentioned activities. Around 100% of samples belonging to ‘cycling (stand, inactive)’ activities are being misclassified to ‘standing’ (55.36%), ‘cycling (sit)’ (8.93%), ‘cycling (stand)’ (8.93%), and ‘cycling (sit, inactive) (17.86%)’ activities, since they are opposite kind of activities. Around 65.94% of samples ‘shuffling’ activities are being misclassified to ‘walking (37.12%)’, ‘stairs (ascending) (0.87%)’, ‘standing (22.27%)’, ‘sitting

A Hybrid Deep Learning-Based Approach for Human Activity …

253

Table 8 Performance measured with respect to precision, recall, F1-score, and accuracy for each activity class of the HARTH dataset Activity

Precision

Recall

F1-score

Accuracy

Walking

0.960

0.987

0.973

0.987

Running

0.996

0.997

0.997

0.997

Shuffling

0.850

0.840

0.944

0.840

Stairs (ascending)

0.896

0.834

0.864

0.834

Stairs (descending)

0.856

0.863

0.747

0.864

Standing

0.865

0.974

0.857

0.974

Sitting

0.995

0.922

0.957

0.922

Lying

1.000

0.997

0.999

0.997

Cycling (sit)

0.925

0.970

0.947

0.969

Cycling (stand)

0.967

0.985

0.866

0.785

Cycling (sit, inactive)

0.832

0.978

0.804

0.878

Cycling (stand, inactive)

0.908

0.893

0.925

0.893

Average

0.943

0.947

0.943

0.947

(0.44%)’, ‘cycling (stand, inactive) (1.75%)’, ‘cycling (stand) (0.44%)’ and ‘cycling (sit) (3.06%)’ activities due to their comparable nature of linear acceleration. Figure 11a, b shows the training versus validation accuracy and training versus validation loss curve on the HARTH [5] dataset, respectively. From Fig. 11a, it is observed that, initially, during the starting of the model, training accuracy was less, but after a certain number of epochs (approximately 5), the training accuracy almost reached above 94%. As the model training continues, the accuracy is improved linearly, and it reaches around 96% after around 40 epochs. Similarly, the validation accuracy reaches around 93% during starting, and after a few epochs, there is a fluctuation in the validation accuracy. Finally, giving us a validation accuracy of around 94.68%. From Fig. 11b, it is observed that during the start of the model training, the training loss is around 0.4, but as the model training continues, the loss decreases, and after around 40 epochs, the loss becomes minimal, around 0.1. But it is not the case in validation loss; initially, it was minimal, around 0.25, but after a few epochs, the loss increased, and finally, it reached around 0.30. The activities used in this dataset are very much similar to the kind of activities such as cycling (sit, inactive) is similar to sitting/standing, and cycling (stand, inactive) is very much similar to standing activities. All these activities are of similar motion, because of which the model is unable to predict accurately. As observed from the confusion matrix (shown in Fig. 10), the ‘Cycling (sit, inactive)’ activity has been misclassified to ‘walking,’ ‘standing,’ ‘cycling (sit),’ and ‘cycling (stand, inactive),’ as there is a similarity in motion of these mentioned activities. Furthermore, the ‘Cycling (stand, inactive)’ activity is being misclassified into activities like ‘standing,’

254

D. Sharma et al.

Fig. 10 Confusion matrix obtained after our proposed hybrid deep learning-based HAR model is tested on the HARTH [5] dataset

Fig. 11 Graph showing the variation of a training with validation accuracy and b training versus validation loss on the HARTH dataset

A Hybrid Deep Learning-Based Approach for Human Activity …

255

‘cycling (sit),’ ‘cycling (stand),’ and ‘cycling (sit, inactive).’ Hence, the validation loss increases.

10.5 Result Summary and Comparison Table 9 summarises the overall outcomes produced by our proposed hybrid deep learning-based HAR model on all three HAR datasets in terms of four evaluation metrics, namely, Accuracy, Precision, Recall, and F1-score. On the other hand, Table 10 demonstrates a comparison of our proposed hybrid deep learningbased HAR model with some earlier HAR methods implemented on MHEALTH, OPPORTUNITY, and HARTH datasets. Table 9 Summarization of the results produced by the proposed HAR model on MHEALTH, OPPORTUNITY, and HARTH datasets Dataset

Accuracy (in %)

Precision (weighted average)

Recall (weighted average)

F1-score (weighted average)

MHEALTH [3]

99.07

0.991

0.991

0.991

OPPORTUNITY [4]

95.2

0.953

0.952

0.952

HARTH [5]

94.68

0.942

0.946

0.943

Table 10 Performance comparison of our proposed hybrid deep learning-based HAR model with some recently developed HAR models for all three datasets Dataset

Author

Model/classifier

Accuracy (in %)

MHEALTH [3]

Singh et al. [33]

CNN + LSTM + self-attention layer + Softmax layer

94.86

Khatun et al. [32]

Deep CNN + LSTM + self-attention

98.76

Jalal et al. [37]

Decision tree classifier + binary 93.95 grey wolf optimization algorithm

Debache et al. [36] Logistic regression Hybrid CNN + LSTM model

99.07

Residual bi-directional LSTM

90.5

Rashid et al. [23]

Adaptive CNN

91.57

Lv et al. [29]

Hybrid model

92.30

Xia et al. [26]

LSTM + CNN + GAP

92.63

Proposed

Hybrid CNN + LSTM model

95.2

Logacjov et al. [5]

SVM

0.79 ± 0.22

Proposed

Hybrid CNN + LSTM model

94.68

Proposed OPPORTUNITY [4] Zhao et al. [24]

HARTH [5]

98.2% ± 2.7

256

D. Sharma et al.

11 Conclusion and Future Works We have developed a hybrid-based deep learning strategy for wearable sensor-based HAR concerns in this research. The three publicly accessible standard datasets used to test our proposed model are MHEALTH, OPPORTUNITY, and HARTH. On these datasets, our suggested model worked brilliantly and did a commendable job predicting activity with decent accuracy. The recognition accuracies we have achieved on all the above-mentioned three datasets are 99.07%, 95.2%, and 94.68%, respectively, which are quite satisfactory. Additionally, it outperforms various cutting-edge techniques on the dataset under consideration. However, there is still room for this work to be improved. There are a lot more topics that might be investigated if we continue with this research. We can go for a lot more complex hybrid model and test whether the efficiency can be increased. Furthermore, we can include the dataset from our own built system. In the real world, there is a lot of noise and error found to be merged with the datasets, so, in that case, we have to use some filters to eliminate the noise. In the future, researchers can even work on a composite activity recognition like one man standing in the bathroom, kitchen, or in the bedroom. There is a difference between these scenarios, so we can separate those scenarios in the future and add some more sensors to locate where the person is standing exactly. Furthermore, timeseries sequence data can be analyzed prior to inclusion in a system. We may also employ some new mathematical graph-based concepts such as Markov Transition Fields (MTF) and Gramian Angular Fields (GAF) to transform every time-series raw sensor data into a visual image or matrix [41]. The images may subsequently be utilized in systems presented with CNN and other deep learning evaluation models [42], as well as in transfer learning methodologies [2].

References 1. Bhattacharya, S., Shaw, V., Singh, P.K., Sarkar, R., Bhattacharjee, D.: SV-NET: a deep learning approach to video based human activity recognition. In: Proceedings of the 11th International Conference on Soft Computing and Pattern Recognition, vol. 1182, pp. 10–20 (2019). https:// doi.org/10.1007/978-3-030-49345-5_2 2. Chakraborty, S., Mondal, R., Singh, P.K., Sarkar, R., Bhattacharjee, D.: Transfer learning with fine tuning for human action recognition from still images. Multimed. Tools Appl. 80, 20547–20578 (2021). https://doi.org/10.1007/s11042-021-10753-y 3. Banos, O., Villalonga, C., Garcia, R., Saez, A., Damas, M., Holgado-Terriza, J.A., Lee, S., Pomares, H., Rojas, I.: Design implementation and validation of a novel open framework for agile development of mobile health applications. Biomed. Eng. Online 14, 1–20 (2015). https:// doi.org/10.1186/1475-925X-14-S2-S6 4. Roggen, D., Calatroni, A., Rossi, M., Holleczek, T., Tröster, G., Lukowicz, P., Pirkl, G., Bannach, D., Ferscha, A., Doppler, J., Holzmann, C., Kurz, M., Holl, G., Chavarriaga, R., Sagha, H., Bayati, H., Millán, J.R.: Collecting complex activity data sets in highly rich networked sensor environments. In: Proceedings of the 7th International Conference on Networked Sensing Systems, pp. 233–240 (2010). https://doi.org/10.1109/INSS.2010.5573462

A Hybrid Deep Learning-Based Approach for Human Activity …

257

5. Logacjov, A., Bach, K., Kongsvold, A., Bårdstu, H.B., Mork, P.J.: HARTH: a human activity recognition dataset for machine learning. Sensors 21(23), 1–19 (2021). https://doi.org/10.3390/ s21237853 6. Singh, P.K., Kundu, S., Adhikary, T., Sarkar, R., Bhattacharjee, D.: Progress of human action recognition research in the last ten years: a comprehensive survey. Arch. Comput. Methods Eng. 29(1), 2309–2349 (2022). https://doi.org/10.1007/s11831-021-09681-9 7. Um, T.T., Babakeshizadeh, V., Kuli´c, D.: Exercise motion classification from large-scale wearable sensor data using convolutional neural networks. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2385–2390 (2017). https://doi.org/10.1109/IROS.2017. 8206051 8. Lawal, I.A., Bano, S.: Deep human activity recognition with localisation of wearable sensors. IEEE Access 8(1), 155060–155070 (2020). https://doi.org/10.1109/ACCESS.2020.3017681 9. Qin, Z., Zhang, Y., Meng, S., Qin, Z., Choo, K.K.R.: Imaging and fusing time series for wearable sensor-based human activity recognition. Inf. Fusion 53(1), 80–87 (2020). https://doi.org/10. 1016/j.inffus.2019.06.014 10. Mondal, R., Mukhopadhyay, D., Barua, S., Singh, P.K., Sarkar, R., Bhattacharjee, D.: A study on smartphone sensor-based human activity recognition using deep learning approaches. In: Handbook of Computational Intelligence in Biomedical Engineering and Healthcare, pp. 343– 369 (2021). https://doi.org/10.1016/B978-0-12-822260-7.00006-6 11. Mondal, R., Mukherjee, D., Singh, P.K., Bhateja, V., Sarkar, R.: A new framework for smartphone sensor based human activity recognition using graph neural network. IEEE Sens. J. 21(10), 11461–11468 (2021). https://doi.org/10.1109/JSEN.2020.3015726 12. Bhattacharya, D., Sharma, D., Kim, W., Ijaz, M.F., Singh, P.K.: Ensem-HAR: an ensemble deep learning model for smartphone sensor-based human activity recognition for measurement of elderly health monitoring. Biosensors 12, 1–25 (2022). https://doi.org/10.3390/bios12060393 13. Das, A., Sil, P., Singh, P.K., Bhateja, V., Sarkar, R.: MMHAR-EnsemNet: a multi-modal human activity recognition model. IEEE Sens. 21(10), 11569–11576 (2021). https://doi.org/10.1109/ JSEN.2020.3034614 14. Mukherjee, D., Mondal, R., Singh, P.K., Sarkar, R., Bhattacharjee, D.: EnsemConvNet: a deep learning approach for human activity recognition using smartphone sensors for healthcare applications. Multimed. Tools Appl. 79(41), 31663–31690 (2020). https://doi.org/10.1007/s11 042-020-09537-7 15. Banerjee, A., Singh, P.K., Sarkar, R.: Fuzzy integral based CNN classifier fusion for 3D skeleton action recognition. IEEE Trans. Circuits Syst. Video Technol. 31, 2206–2216 (2021). https:// doi.org/10.1109/TCSVT.2020.3019293 16. Ghosh, R., Chattopadhyay, S., Singh, P.K.: Recognizing human activities of daily living using mobile sensors for health monitoring. In: Internet of Things and Data Mining for Modern Engineering and Healthcare Applications (2022). https://doi.org/10.1201/9781003217398-3 17. Wang, J., Chen, Y., Hao, S., Peng, X., Hu, L.: Deep learning for sensor-based activity recognition: a survey. Pattern Recogn. Lett. 119(1), 3–11 (2019). https://doi.org/10.1016/j.patrec. 2018.02.010 18. Chen, Y., Xue, Y.: A deep learning approach to human activity recognition based on a single accelerometer. In: Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics, pp. 1488–1492 (2015). https://doi.org/10.1109/SMC.2015.263 19. Ronao, C.A., Cho, S.B.: Human activity recognition with smartphone sensors using deep learning neural networks. Expert Syst. Appl. 59(1), 235–244 (2016). https://doi.org/10.1016/ j.eswa.2016.04.032 20. Wan, S., Qi, L., Xu, X., Tong, C., Gu, Z.: Deep learning models for real-time human activity recognition with smartphones. Mob. Netw. Appl. 25(2), 743–755 (2020). https://doi.org/10. 1007/s11036-019-01445-x 21. Tang, Y., Teng, Q., Zhang, L., Min, F., He, J.: Efficient convolutional neural networks with smaller filters for human activity recognition using wearable sensors (2020). arXiv preprint arXiv:2005.03948

258

D. Sharma et al.

22. Cheng, X., Zhang, L., Tang, Y., Liu, Y., Wu, H., He, J.: Real-time human activity recognition using conditionally parametrized convolutions on mobile and wearable devices. arXiv preprint arXiv:2006.03259, 1–10 (2020). https://doi.org/10.48550/arXiv.2006.03259 23. Rashid, N., Demirel, B.U., Faruque, M.A.A.: AHAR: adaptive CNN for energy-efficient human activity recognition in low-power edge devices. IEEE Internet Things J. 9(15), 13041–13051 (2021). https://doi.org/10.1109/jiot.2022.3140465 24. Zhao, Y., Yang, R., Chevalier, G., Xu, X., Zhang, Z.: Deep residual bidir-LSTM for human activity recognition using wearable sensors. Math. Probl. Eng. 2018(1), 1–13 (2018). https:// doi.org/10.1155/2018/7316954 25. Sun, J., Fu, Y., Li, S., He, J., Xu, C., Tan, L.: Sequential human activity recognition based on deep convolutional network and extreme learning machine using wearable sensors. J. Sens. 2018(1), 1–10 (2018). https://doi.org/10.1155/2018/8580959 26. Xia, K., Huang, J., Wang, H.: LSTM-CNN architecture for human activity recognition. IEEE Access 8(1), 56855–56866 (2020). https://doi.org/10.1109/ACCESS.2020.2982225 27. Qi, W., Su, H., Aliverti, A.: A smartphone-based adaptive recognition and real-time monitoring system for human activities. IEEE Trans. Hum.-Mach. Syst. 50(5), 414–423 (2020). https:// doi.org/10.1109/THMS.2020.2984181 28. Wang, H., Zhao, J., Li, J., Tian, L., Tu, P., Cao, T., An, Y., Wang, K., Li, S.: Wearable sensorbased human activity recognition using hybrid deep learning techniques. Secur. Commun. Netw. 2020(1), 1–12 (2020). https://doi.org/10.1155/2020/2132138 29. Lv, T., Wang, X., Jin, L., Xiao, Y., Song, M.: Margin-based deep learning networks for human activity recognition. Sensors 20(7), 2–19 (2020). https://doi.org/10.3390/s20071871 30. Gao, W., Zhang, L., Teng, Q., He, J., Wu, H.: DanHAR: dual attention network for multimodal human activity recognition using wearable sensors. Appl. Soft Comput. 111(7), 1–12 (2021). https://doi.org/10.1016/J.ASOC.2021.107728 31. Ordóñez, F.J., Roggen, D.: Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors 16(1), 1–25 (2016). https://doi.org/10.3390/s16 010115 32. Khatun, S., Morshed, B.I.: Fully-automated human activity recognition with transition awareness from wearable sensor data for mHealth. In: Proceedings of the IEEE International Conference on Electro/Information Technology, pp. 0934–0938 (2018). https://doi.org/10.1109/EIT. 2018.8500135 33. Singh, S.P., Sharma, M.K., Lay-Ekuakille, A., Gangwar, D., Gupta, S.: Deep ConvLSTM with self-attention for human activity decoding using wearable sensors. IEEE Sens. J. 21, 8575–8582 (2020). https://doi.org/10.1109/JSEN.2020.3045135 34. Khatun, M.A., Yousuf, M.A., Ahmed, S., Uddin, M.Z., Alyami, S.A., Ashha, S.A., Akhdar, H.F., Khan, A., Azad, A., Moni, M.A.: Deep CNN-LSTM with self-attention model for human activity recognition using wearable sensor. IEEE J. Transl. Eng. Health Med. 10(1), 1–16 (2022). https://doi.org/10.1109/jtehm.2022.3177710 35. Gumaei, A., Hassan, M.M., Alelaiwi, A., Alsalman, H.: A hybrid deep learning model for human activity recognition using multimodal body sensing data. IEEE Access 7(1), 99152– 99160 (2019). https://doi.org/10.1109/ACCESS.2019.2927134 36. Debache, I., Jeantet, L., Chevallier, D., Bergouignan, A., Sueur, C.: A lean and performant hierarchical model for human activity recognition using body-mounted sensors. Sensors 20(11), 1–12 (2020). https://doi.org/10.3390/s20113090 37. Jalal, A., Batool, M., Kim, K.: Stochastic recognition of physical activity and healthcare using tri-axial inertial wearable sensors. Appl. Sci. 10(20), 120 (2020). https://doi.org/10.3390/app 10207122 38. Tahir, S.B., Jalal, A., Kim, K.: Wearable inertial sensors for daily activity analysis based on Adam optimization and the maximum entropy Markov model. Entropy 22(5), 1–19 (2020). https://doi.org/10.3390/e22050579 39. Jordao, A., Nazare, A.C., Sena, J., Schwartz, W.R.: Human activity recognition based on wearable sensor data: a standardization of the state-of-the-art (2018). arXiv preprint arXiv: 1806.05226, 1–11. https://doi.org/10.48550/arXiv.1806.05226

A Hybrid Deep Learning-Based Approach for Human Activity …

259

40. Mejía, J., Ochoa-Zezzatti, A., Contreras-Masse, R., Rivera, G.: Intelligent system for the visual support of caloric intake of food in inhabitants of a smart city using a deep learning model. In: Applications of Hybrid Metaheuristic Algorithms for Image Processing, pp. 441–455 (2020). https://doi.org/10.1007/978-3-030-40977-7_19 41. Wang, Z., Oates, T.: Encoding time series as images for visual inspection and classification using tiled convolutional neural networks. AAAI Workshop-Tech. Rep. 2015, 40–46 (2015) 42. Banerjee, A., Bhattacharya, R., Bhateja, V., Singh, P.K., Lay-Ekuakille, A., Sarkar, R.: COFENet: an ensemble strategy for computer-aided detection for COVID-19. Measurement 187, 1–14 (2022). https://doi.org/10.1016/j.measurement.2021.110289

Predirol: Predicting Cholesterol Saturation Levels Using Big Data, Logistic Regression, and Dissipative Particle Dynamics Simulation Reyna Nohemy Soriano-Machorro, José Luis Sánchez-Cervantes, Lisbeth Rodríguez-Mazahua, and Luis Rolando Guarneros-Nolasco

Abstract Four out of ten Mexican adults have high cholesterol, according to the National Institute of Cardiology. Cholesterol is essential for the production of substances in our body, such as hormones and vitamin D metabolism; it is essential for the absorption of calcium and bile acids. However, excess cholesterol causes hardening and narrowing in the walls of the arteries and can form a clot that causes a heart attack or stroke. Taking into account this problem, in this chapter, we present PREDIROL: Predicting Cholesterol Saturation Levels Using Big Data, Logistic Regression, and Dissipative Particle Dynamics Simulation, which presents an approach with Big Data and mesoscopic simulation techniques with a method of Particle Dynamics DPD (Dissipative Particle Dynamics). Parallel computing using CUDA was implemented to build the DPD model that would represent the cholesterol and blood molecules. However, considering the quantity of cholesterol and blood molecules generated in 3D, which required high computing power, we opted for the 3Dmol.js library based on WebGL for rendering 3D graphics within any web browser. PREDIROL seeks to raise awareness about the care of cholesterol concentration levels since having high levels is detrimental to health, but having low concentration levels, the body does not produce cells in the body. This is a tool for preventive medicine and to improve the lifestyle of users before they develop more serious ailments and even heart attacks or strokes.

R. N. Soriano-Machorro · L. Rodríguez-Mazahua Tecnológico Nacional de México/Instituto Tecnológico de Orizaba, Orizaba, Veracruz, México e-mail: [email protected] L. Rodríguez-Mazahua e-mail: [email protected] J. L. Sánchez-Cervantes (B) CONACYT-Instituto Tecnológico de Orizaba, Orizaba, Veracruz, México e-mail: [email protected] L. R. Guarneros-Nolasco Universidad Tecnológica del Centro de Veracruz, Veracruz, México e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 G. Rivera et al. (eds.), Innovations in Machine and Deep Learning, Studies in Big Data 134, https://doi.org/10.1007/978-3-031-40688-1_12

261

262

R. N. Soriano-Machorro et al.

Keywords Big Data · Mesoscopic simulation · Cholesterol · Logistic regression

1 Introduction Cholesterol is a serious and fat-like substance, this is a steroid that constitutes the essential component of the cell membrane, and it is a precursor of steroid hormones, which is found in all cells of the human body. Cholesterol is important for good health and is needed to make cell walls, tissues, hormones, vitamin D, and bile acid. This lipid comes from the consumption of foods of animal origin, such as egg yolks, meat, and dairy products made from whole milk. There are two types of cholesterol in the human body, HDL (High-Density Lipoprotein composed of lipids and proteins), commonly known as good cholesterol, and LDL (Low-Density Lipoprotein), known as bad cholesterol, when there is too much cholesterol in the blood with the latter, it can build up in the walls of blood vessels, blocking blood flow to tissues and organs, thereby increasing the risk of heart disease and stroke, and according to the National Institute of Cardiology, four out of ten Mexican adults have high cholesterol [1]. HDL has a diameter between 25 and 10 nm, and phospholipids are its main lipid. They are produced by the liver (30%) and the intestine (70%), and their main function is to extract excess cholesterol from the cells and transport it to the liver for elimination in the form of bile acids and cholesterol in the feces, while LDL are cholesterol-rich particles with a diameter of 20–25 nm that are taken up by the cells of the body and thus provide themselves with the cholesterol they require [2]. In relation to this, cardiovascular diseases encompass a wide range of disorders, including diseases of the heart muscle and the vascular system that supplies the heart, brain, and other vital organs where cholesterol levels take a prominent role. It is the leading cause of death worldwide [3]. Mesoscopic simulation is an alternative, clean, and cheap method to study ways in which there will be no exposure for experiments; it allows to surpass the limitations of molecular simulation because it is directed to the behavior of atoms [4]. Dissipative particle dynamics is a method for mesoscopic simulation that was developed to study the hydrodynamic behavior of complex fluids; it is also useful for the calculation of dynamic properties [5]. High cholesterol can be inherited, although it is often the result of unhealthy lifestyle choices so that it can be prevented and treated. Eating a healthy diet, exercising regularly, and sometimes taking medication help lower high cholesterol. Big Data offers alternatives to carry out the analysis of large volumes of data, for example: (1) High-Performance Computing (HPC) to run advanced applications quickly, efficiently, and reliably; (2) Use of super-computing that uses coprocessors and accelerators; (3) HPC resources in the cloud that are increasingly accessible, being a service for consumers and, (4) Data analysis is performed by a specific analyst or a group of specialists [6, 7]. In the medical domain, Big Data helps specialists to maintain a communicative, intelligent, and well-oriented relationship to meet specific objectives, eliminating irrelevant information for the treatment of diseases. On the other hand, the mesoscopic simulation, which is located between the atomistic and

Predirol: Predicting Cholesterol Saturation Levels Using Big Data …

263

macroscopic scale, is the Dissipative Particle Dynamics (DPD), which is capable of capturing processes that are carried out in microseconds, which allows analyzing the streams of molecules and their motions based on predefined capacities and velocitydensity functions. Individual molecules can be tracked to identify agglomerations or clusters of them that cause some effect depending on the application domain. For this reason, the chapter presents PREDIROL, a system for the prediction of cholesterol concentration in the blood, to make the population aware of the care and prevention of cardiovascular diseases with a Big Data approach and mesoscopic simulation techniques with a DPD Particle Dynamics method. Among the expected benefits are: to show the cholesterol saturation levels in the blood by computational simulation and to support people with the prevention of heart attacks or cerebrovascular accidents. This chapter is organized into four main sections: Sect. 1 covers the Related works that include a comparative analysis of PREDIROL and similar initiatives; Sect. 2 presents the architecture of PREDIROL; Sect. 3 describes the results obtained through a case study; and Sect. 4 presents the general conclusion and future work with some limitations about it.

2 Related Works This section briefly describes the analysis of a set of literary initiatives related to PREDIROL. The initiatives analyzed are divided or classified into two groups: (1) Model for simulation of fluids and (2) Application of Data Mining for the prevention of cardiovascular diseases.

2.1 Models for the Simulation of Fluids In this subsection, a summary and analysis of works related to the simulation of fluids is described. In [8], to search for new methods to eliminate excess cholesterol molecules, the DM molecular dynamics simulation method was used, with which the system composed of b-cyclodextrin molecules and cholesterol groups was studied, both independent as in an aquatic environment. The DM simulations were performed with the NAMD software (Not (just) Another Molecular Dynamics program), the integrator used was the standard (Brünger-Brooks-Karplus algorithm) and visualized with the VMD molecular modeling program to set up the simulation. For the simulations, the TIP3P CHARMM model (Chemistry at Harvard Macromolecular Mechanics, sets of force fields widely used for Harvard molecular dynamics) was used, adjusted due to the applied water model does not correctly reproduce the translational diffusion of water. The authors carried out a series of b-cyclodextrincholesterol DM simulations in the presence and absence of water. The influence of water is quite significant because the cholesterol molecules are grouped together

264

R. N. Soriano-Machorro et al.

with water. The results of the simulations for the removal of excess cholesterol in the body were left to future considerations on the potential use of b-cyclodextrin in the context of combating atherosclerosis disease. The high-density lipoprotein HDL was studied in [9] due to it being a powerful risk factor for cardiovascular disease. The authors analyzed 47 subjects with different levels of HDL (low and high) and studied their analyses with the R language for statistical computing. To perform the simulations, they used the GROMACS package (v. 4.0). The standard MARTINI lipid force field (model for bio-molecular simulations) was used for the composition of lipid and apolipoprotein components. The result obtained was that the molecular profile of HDL particles was combined with dynamic structural modeling, which revealed marked differences in HDL lipidomic profiles, as well as related clinical and biochemical characteristics between HDL-C subjects. low and high. The approach allowed it possible to demonstrate that changes in lipid composition also induce specific spatial distributions of lipids within HDL particles. In [10], an efficient simulation algorithm based on dissipative particle dynamics (DPD) was proposed; this method was used to study electrohydrodynamic phenomena in electrolytic fluids. The fluid flow is mimicked with DPD particles, while the evolution of the ionic species concentration is described using Brownian pseudo particles. The algorithm was developed from the set of electrokinetic equations for electrolyte fluids, which the authors propose is designed to optimize the computational efficiency of electrolyte simulations at high physiological salt concentrations in fluids. The authors noted that the approach of their algorithm, called Condiff-DPD, is not restricted to electrolyte solutions since the same idea is applied to other mesoscale fluid flow simulations where diffusion of a minor component is important, for example, in microreactors. Terrón-Mejía et al. [11] studied the interfacial and structural properties of fluids confined by different walls at the mesoscopic scale using DPD simulations on the grand canonical ensemble. The entire methodology was implemented in simulation code called SIMES, which is designed to study complex systems at a mesoscopic scale using a graphics processing unit (GPU), for confined fluids DPD was used since in the DPD model, the particles do not represent molecules but rather groups of molecules with smooth boundaries which were hybridized with the Grand Canonical Monte Carlo (MC) methodology to simulate fluids confined by walls, at the mesoscopic scale, and the interfacial properties for confined fluids For the simulation, four different models of confined fluids were analyzed, these confined between smooth walls, symmetrical rough walls, non-symmetrical rough walls, and a combination between a smooth wall and a symmetrical rough wall, in the author’s conclusion were described the different behaviors of the confined fluids depending on the walls with which the simulations were carried out and it is defined that to determine the value of the interfacial tension, the geometry of the walls is not fundamental, but the intensity of solid–fluid wall interaction. In [12], a simulation package for red blood cell (RBC) flow with GPU-accelerated chemical transport property based on an adaptation of the transport dissipative particle dynamics (tDPD) method was presented; tDPD was used due to it allows capturing the reaction kinetics at the mesoscopic level. For programming, the languages C/C++ , CUDA, C/ C++ , and MPI (Message

Predirol: Predicting Cholesterol Saturation Levels Using Big Data …

265

Passing Interface) were used. The simulation package processes all computational workloads in parallel per GPU and incorporates multithreaded programming and non-blocking MPI communications to improve inter-node scalability. The authors mentioned that they used GPU for the processing speed compared to its counterpart, the Central Processing Unit (CPU). The red blood cell model presented by the authors correctly recovers cell membrane viscosity, elasticity, bending stiffness, and chemical transport across the membrane. The lack of fluid flow simulation models of complex RCB structures through complex capillary vessels led Phan-Thien et al. [13] to elaborate a method for the simulation of red blood cell flows in different tubes; the methods used were smoothed dissipative particle dynamics (SDPD) and dissipative particle dynamics (DPD), with pa- parameters that have a specific physical meaning, combined with thermal fluctuations in a mesoscale simulation and the immersed boundary method (IBM), a preferred method for handling fluid–structure interaction problems, the latter has been widely used to handle the fluid-RBC interaction in simulations. The authors’ objective was to couple SDPD and IBM to perform RBC simulations on complex flow problems. The authors first developed the SDPD-IBM model and demonstrated with CFD (Computational Fluid Dynamics) simulations the ability of the SDPD-IBM method to simulate red blood cell flows in rectangular, cylindrical, curved, bifurcated, and constrained tubes. Similarly, an overview of DM simulations of nHDL lipidic nanodiscs was presented by Pourmousa and Pastor [14]. For the simulation, there was solvation of all nanodiscs with water molecules and 0,15 M NaCl in 15–16 nm in lateral length cubic boxes using the web interface CHARMM-GUI (Effective Simulation Input Generator and More). The authors conclude after comparing nHDL simulations that understanding nHDL maturation is challenging for future studies because it not only requires a large number of computational resources but also depends on high-resolution structures of other proteins involved, the apolipoprotein APOA1 interacts with the ATP Binding Cassette A1 (ABCA1) transporter to acquire lipids first. Little is known about the details of this first stage, as single-molecule imaging revealed a dimeric form of the iteration of ABCA1 with APOA1. On the other hand, a GPU-accelerated package for the simulation of flow in nano-to-micropore networks with a mesoscale model of many-body dissipative particle dynamics (mDPD) was described in [15]. Mesoscopic simulations of the flow of Hydrocarbons in shales are a challenge due to the heterogeneous pores in shales with sizes ranging from a few nanometers to a few micrometers. The authors used the method of many-body dissipative particle dynamics (mDPD), which is a mesoscopic model for fluids and solid molecules; the wall is modeled with a non-slip boundary condition that prevents the particles of fluid from penetrating the walls; they also used the Velocity-Verlet algorithm, the programming languages used were CUDA C/C++ with MPI and OpenMP. The authors mentioned that a simulation in CPU takes 15 times more than using parallel programming with GPU; the use of GPU effectively reduces the overhead in communication between rank/nodes. Thanks to non-obstructive angioscopy with the NOGA device, Kojima et al. [16] aimed to study NO-GA-derived aortic ruptured plaque RP and the stereographic distribution and regional increase in wall shear stress (WSS)

266

R. N. Soriano-Machorro et al.

using computational fluid dynamics modeling. (CFD), for which the PHOENICSCFD Works application was used; this model was used to calculate the blood flow velocity distribution and reveal the three-dimensional distribution of WSS within the lumen of the aortic arch. The authors studied 45 patients who underwent 3D-CT threedimensional computed tomography before coronary angiography. The WSS in the aortic arch was measured by CFD analysis based on the finite element method using uniform inflow and outflow conditions. Aortic PR was detected by NOGA. They concluded that aortic RP detected by NOGA (Transendocardial injection with an electromechanical mapping system) was strongly associated with a higher maximum WSS in the aortic arch derived by CFD using 3D-CT. The maximum value of WSS has an important role in the mechanism not only of aortic atherosclerosis but also of aortic PR.

2.2 Data Mining Application for Prevention of Cardiovascular Diseases In this part of the section, we continue with the descriptions of initiatives similar to our work. The use of data mining techniques for the prevention of cardiovascular diseases is described briefly below. Liu et al. [17] investigated the association between the cholesterol efflux capacity and the risk of mortality presented by a patient who has a cardiovascular problem (coronary artery disease). Studying 1737 patients with coronary artery disease, the authors found that cholesterol efflux capacity is not simply explained by circulating HDL cholesterol or apolipoprotein AI levels; it is also independently related to the presence and the extent of atherosclerosis; they conclude by suggesting that cholesterol efflux capacity is a predictor of mortality from all cardiovascular problems in both acute coronary syndrome (ACS) and diabetic ketoacidosis (DKA). Reference [18] performed a meta-analysis of data from patients with no history of cardiovascular disease; they used two methods of increasing complexity to model repeated measurements: cumulative mean values and individual-specific intercepts and slopes for each individual from linear regression models of mixed effects. The objective was to quantify the change in discrimination and risk stratification of individuals according to their expected 5-year CVD risk when information from repeated measures of risk predictors was added to the evaluation of single measures to the levels of risk factors used in the standard risk scores. They used 38 studies with emerging risk factors. The events identified for the risk of cardiovascular diseases were non-fatal myocardial infarction or any cerebrovascular accident, and different measures of HDL, and blood pressure, among others, were used in the models. The authors conclude that if a mean of repeated post-baseline measurement of systolic blood pressure is incorporated, it is possible that total cholesterol and HDL cholesterol in cardiovascular disease risk prediction models result in slight improvements in the discrimination and the reclassification of the risk. In the health industry, large amounts of data are

Predirol: Predicting Cholesterol Saturation Levels Using Big Data …

267

collected that contain information not observed at first glance, which is useful for decision-making for the early detection of diseases; that is why Singh et al. [19] used advanced data mining techniques to develop an effective cardiac disease prediction system using a neural network. The authors’ system used fifteen medical parameters for prediction (age, sex, blood pressure, cholesterol, and obesity). The tool they used was Weka 3.6.11, with a data set of 303 records where 40% served as a training set and 60% as a test set. The authors demonstrated that their model has the best results and helps experts in the medical field to plan a better diagnosis in time for the patient since the system worked realistically and, in the results, it was shown that it predicts cardiac diseases with a 100% accuracy by using neural networks. Heart disease is a type of ailment that is directly related to the human heart and blood circulation; for this reason, the objective of [20] was to collect information on the six parameters that are age, chest pain, electrocardiogram, systolic blood pressure, fasting blood sugar, and serum cholesterol to detect cardiovascular disease. The model they proposed for the diagnosis of cardiovascular diseases (DCD) using the Mamdani fuzzy inference system (DCD-MFIS) showed an accuracy of 87.05%; they also proposed the model for the diagnosis of cardiovascular diseases includes fuzzy logic using Mamdani Fuzzy Inference System (DCD-MFIS), Artificial Neural Network (DCD-ANN) and Deep Extreme Machine Learning (DCD-DEML) approach, these achieved higher precision and accuracy. The DCD extreme deep machine learning they proposed achieves higher accuracy than the previously proposed solutions, which are 92.45%. To verify the performance, they calculated many indicators that determine precise performance. The training of the neural networks is done using the layers of 10−20 neurons that denote the hidden layer. DEML reveals and indicates a hidden layer containing ten neurons, which shows the best result. In [21], an efficient system for the prediction of cardiac diseases was presented; they used databases with the clinical information of cardiac patients and used data mining to process the data set obtained with input values such as the age, sex, chest pain, resting blood pressure, serum cholesterol, fasting blood sugar, among others. The authors used the Weka tool and KEEL (Knowledge Extraction Based on Evolutionary Learning) since it is a set of tools with open-source Java language (GPLv3) to implement development processes for data mining problems and the algorithm C4.5. When the system was tested, it obtained an 86.3% of prediction in the test phase, and it obtained a prediction of 87.3% in the training phase. This tool predicts heart disease early. The main cause of heart disease is high blood pressure and high cholesterol. In [22], the authors use the human eye to detect high levels of cholesterol in the blood because when thick plaques form in the arteries, an occlusion in the retinal vein blood flow is observed. The authors developed a mobile application that uses an algorithm, uses Machine Learning, and data such as height, weight, age, sex, eating habits, and iris photo capture, among others. The result that the application presents to the user, if he has cholesterol in his blood, is a “No” or “YES.” The authors use Heroku as a server, and for image processing, they use Python modules. The challenges that the authors found are that the images can vary from person to person, the conditions of the Smartphone camera are different, and the lighting situation, due to the light reflection, affects the diagnosis.

268

R. N. Soriano-Machorro et al.

2.3 Comparative Analysis Tables 1 and 2 show a comparative analysis to observe differences and similarities between them, in addition to the methodologies used by the authors to solve the different problems. Therefore, after carrying out an analysis of the articles addressed in this chapter, it was found that the works [11, 12, 15] use the GPU architecture, which has advantages such as time, due to it being 15 times faster than using a CPU architecture, that is why the GPU-accelerated simulation was chosen for this work, where it allows parallel processing between one or more GPUs and CPUs. In [10, 11], they use the DPD method, and two variant methods of DPD are also used in [12, 13] and [15] for the simulation of simple and complex fluids; these works allow the justification of the use of the DPD in this work in which different levels of cholesterol in the blood will be simulated because it is the method that best describes complex fluids since there is better control of the transport properties. In [8] and [9], cholesterol was simulated with different simulation methods and software, works on the simulation of red blood cells were found as in [12] and [13]; unlike previous works, it will simulate different levels of cholesterol in the blood. For the medical field, it was also found that [16] simulates the flow of the aorta for angioscopy, using data from medical studies. The works [18–20] and [21] used data mining techniques to detect and predict cardiovascular diseases, also in [17–21] used cholesterol as a predictor of risk for heart disease, while [22] proposes an application to detect cholesterol in the blood using machine learning and retinal images, among other data, unlike this work, cholesterol will only be predicted with data provided by the user. PREDIROL uses data mining techniques to predict cholesterol levels and the risk of heart disease.

3 PREDIROL Architecture This section of the chapter shows the architecture of the PREDIROL application that meets the required needs. Figure 1 shows the architecture to be used for the development of the application. The layers, modules, and flows that make up the system architecture are described below. Layers of the PREDIROL Architecture • Presentation: This layer presents the interface between the user and the system, where the user’s information and the predictions of cholesterol levels are displayed. • 3D Models: This layer contains the 3D model generator component of cholesterol levels.

Predirol: Predicting Cholesterol Saturation Levels Using Big Data …

269

Table 1 Comparison of work related to simulation Author

Contribution/Solution

Models

Methodology

Architecture/ Software for simulation

Makieła et al. The system composed of [8] b-cyclodextrin molecules and cholesterol groups was studied

Simulation of b-cyclodextrin molecules and cholesterol groups

TIP3P CHARMM

Software NAMD

Yetukuri et al. HDL simulation with [9] studies of 47 subjects with high and low cholesterol

Simulation of Molecular Dynamics on a large scale

The MARTINI standard lipid force field

GROMACS simulation package (v. 4.0)

Medina et al. [10]

Simulation method specially designed for systems with high salt concentrations

DPD algorithm DPD for electrolytes

Does not use

Terrón-Mejía et al. [11]

Method for the simulation of confined fluids with rough walls

Simulation of confined fluids

DPD and MC

GPU

Blumers et al. GPU-accelerated RBC [12] (Red Blood Cell Count) simulation package

Simulation of transport with red blood cells

Transport dissipative particle dynamics (tDPD) method

GPU

Phan-Thien et al. [13]

Hybrid numerical method with the immersed boundary method and smoothed dissipative dynamic particle method

Numerical model

Smoothed Dissipative Particle Dynamics (SDPD)

CFD (Computational Fluid Dynamics, fluid dynamics)

Pourmousa and Pastor [14]

Comparison of models for the simulation of HDL nanodiscs

Molecular dynamics simulation

Molecular dynamics with the exchange of replicas and Monte Carlo

CHARMM-GUI

Xia et al. [15] GPU-accelerated mesoscopic pore flow simulation package based on a many-body dissipative particle dynamics model (mDPD)

Mesoscopic model for solid and fluid molecules

Many-body dissipative particle dynamics model (mDPD)

GPU

Kojima et al. [16]

Computational fluid dynamics modeling

Fluid dynamics

PHOENICS-CFD Works application

Study where the maximum value of WSS has an important role in the mechanism not only of aortic atherosclerosis but also of aortic RP

270

R. N. Soriano-Machorro et al.

Table 2 Comparative analysis of articles related to Data mining Author

Approach/Issue

Contribution/Solution

Methods/tools used

Liu et al. [17] They investigate the association between cholesterol output capacity and all-cause and cardiovascular mortality in patients with coronary artery disease

Cholesterol output capacity serves as an independent measure to predict cardiovascular mortality

Statistics

Paige et al. [18]

Lack of evaluation of repeated measurements of blood pressure and cholesterol to predict the risk of cardiovascular diseases

Statistical study

Linear Regression

Singh et al. [19]

There is a collection of data that contains useful information to predict the level of risk of heart disease

Algorithm development for an effective heart disease prediction system (EHDPS) using a neural network

Data mining techniques (Neural networks)

Artificial neural networks and extreme deep machine learning approaches for the automated diagnosis of cardiovascular diseases

Data mining techniques (Neural networks)

Siddiqui et al. They collect information on six [20] parameters which are age, chest pain, electrocardiogram, systolic blood pressure, fasting blood sugar, and serum cholesterol that Mamdani Fuzzy Expert uses to detect cardiovascular diseases Purushottam et al. [21]

They propose an efficient system for Efficient system for the prediction of cardiac diseases with predicting heart a set of data provided by the clinical disease information of cardiac patients

Data mining techniques (Decision Tree C4.5) with the Weka and Keel tools

Alhasawi et al. [22]

They propose a mobile application Mobile application for that detects blood cholesterol with iris the detection of photography and other data about the cholesterol in the blood user

Machine Learning and Heroku server with Python modules to process images

• Predictions: This layer presents a component that calculates the cholesterol prediction. • Data model: This contains the user profile component. • Repository: The repository is based on Mongo DB, which will contain all the user profile data.

Predirol: Predicting Cholesterol Saturation Levels Using Big Data …

271

Fig. 1 PREDIROL system architecture

Architecture Components • Data sources: This module is in charge of storing the analysis of the patients, which contains the levels of cholesterol in the blood. • Repository: The repository is responsible for saving the patient’s data in a database manager. • Predictions: This module receives the data from the repository of the profile of each user to predict the cholesterol positions that the user currently has and that the user will have in time periods. The module will also predict the thickness of the walls at different periods of time. • 3D models: In this module, three-dimensional models of cholesterol levels are generated in different periods of time with data from the cholesterol saturation predictor module. • Presentation: This module shows the user the 3D models of cholesterol saturation in the different periods of time simulated in the 3D cholesterol levels model generator module, as well as the user profile.

272

R. N. Soriano-Machorro et al.

Workflow 1. From the data sources (clinical analysis), the information will be extracted and filtered to store the data in the repository. 2. Based on the information stored in the repository, the predictions of cholesterol saturation in the blood will be made at different periods of time and will be stored in the repository. 3. The predictions of cholesterol levels (3A) will be passed to the component that generates the positions of the molecules for the 3D models of the different time periods; these models will be stored in the repository (3B). 4. From the positions of the molecules, the component in charge will generate the three-dimensional models of cholesterol saturation in the time periods calculated by the predictor. 5. The user will be able to visualize through the application the different levels of cholesterol saturation in different periods of time in 3D.

3.1 Big Data Model Figure 2 shows the data model implemented in MongoDB for the storage of user data and for prediction. In the data model, five documents are observed. Table 3 shows the model documents with their description. Each of the documents that make up the data model provides data persistence for the different information used by PREDIROL for its proper functioning. Among the most important information are the records provided by the laboratory of “Sanatorio Escudero in Orizaba, Ver. Mexico” in order to validate PREDIROL. The specialists in the laboratory of the Sanatorio Escudero provided us with 10,355 patient records with three essential fields for the prediction of cholesterol saturation levels, which are “age,” “gender,” and “cholesterol.” It is important to mention that for privacy reasons, the first and last names have been changed. In the following section, we describe how the Cholesterol saturation level prediction module works using the fields “age” and “cholesterol.”

3.2 Cholesterol Saturation Level Prediction Module Logistic regression was used in the prediction module using the variables age versus total cholesterol in mg/dl. Figure 3 shows a graph with the data for model training.

Predirol: Predicting Cholesterol Saturation Levels Using Big Data …

documento: General_data { _id:, _id_paƟent:, _name: “Peter”, _last_name: “Parker”, _Last_name2: “Stark”, _phone: “2720000000”, _date_of_birth: 1996-05-17, _city: “Orizaba”, _gender:”M”, }

273

documento: PaƟent { _id:, local: { user: “[email protected]”, password: “*******” } }

documento: DPDmodels { _id_:200, _totalCholesterol:220 _totalBlood:880 }

documento: Analysis { _id_a:, _id_paƟent:, _ƟpeAnalysis: 1, _age: 35, _date: 2021-07-05, _totalCholesterol: 150, _route: “archivo.pdf” }

documento: PaƟent { _id:, facebook{ email:”[email protected]”, name:”Peter”, last_names:”Parker”, photo: “foto_perfil” } }

documento:molecule { _id_parƟcle:, _position_x: 22.534, _position_y: 8.534, _position_z: 6.534, _type_molecule: C

}

Fig. 2 Big Data model

Table 3 Description of the data model in Fig. 2 Document

Description

Patient

This document stores the patient’s data to log in, either using a registered email or through the social network Facebook

General_ data

This stores the patient’s data to complement the record, such as full name and age, among others

Analysis

The document stores the amount of cholesterol, as well as the date of the analysis, and the type of analysis that the patient underwent

DPDmodels In this document, you have stored the data for the construction of the DPD models of the concentration of cholesterol in the blood Molecule

This document stores the positions of the molecules and the type to which they belong

The model was created in R language. Listing 1 shows a fragment for the creation of the model, which was later trained. A partitioning of 70–30 (Line 3 of Listing 1) was considered for training. That is, 70% of the sample was used for training and 30% for predictions. Additionally, we chose the size 0.9 to obtain a better visualization of the results plot using the plotting package ggplot.

274

R. N. Soriano-Machorro et al.

Fig. 3 Cholesterol graph by age

Listing 1: Model created in R language

#Loading data data